CN110060736A - DNA methylation extended method - Google Patents

DNA methylation extended method Download PDF

Info

Publication number
CN110060736A
CN110060736A CN201910289075.9A CN201910289075A CN110060736A CN 110060736 A CN110060736 A CN 110060736A CN 201910289075 A CN201910289075 A CN 201910289075A CN 110060736 A CN110060736 A CN 110060736A
Authority
CN
China
Prior art keywords
cpg
site
methylation
similarity
dna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910289075.9A
Other languages
Chinese (zh)
Other versions
CN110060736B (en
Inventor
凡时财
孙毅
邹见效
徐红兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910289075.9A priority Critical patent/CN110060736B/en
Publication of CN110060736A publication Critical patent/CN110060736A/en
Application granted granted Critical
Publication of CN110060736B publication Critical patent/CN110060736B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Abstract

The invention discloses a kind of DNA methylation extended methods, according to the covered site the CpG building of existing methylation level test method institute referring to site collection, trained site collection is constructed again, each site CpG that training site is concentrated, M kind similarity calculating method is respectively adopted, the M CpG site most like with it is filtered out concentrating referring to site, then training data is extracted from existing methylation public database, constructed prediction model is trained, to the site to be extended in Mr. Yu's segment DNA sequence, M kind similarity calculating method is respectively adopted, the M CPG site most like with it is filtered out concentrating referring to site, test to obtain the methylation level input prediction model in the M most like sites CPG in the segment DNA sequence using existing methylation level test method, output Methylation level is the methylation level in site to be extended.The present invention can realize that the methylation in the unknown site CpG of methylation level is precisely extended based on the related data in the existing site CpG.

Description

DNA methylation extended method
Technical field
The invention belongs to DNA methylation detection technique fields, more specifically, are related to a kind of DNA methylation extension side Method.
Background technique
DNA methylation adjusts one of important means as genome functions, is of great significance for its research.For The research of methylation is often disease prevention and one step of key for treating work, and the site difference CpG in human body gene is often Lead to the immediate cause of disease.Therefore, concern of the methylation by numerous researchers, this also promotes him to become in epigenetics Studied most modified forms and we enter into one of the phenomenon that gene studies first recognizes that greatly behind the door.
The means for presently obtaining DNA methylation data are extremely limited, it is main or by the method for biochemical reagents come Acquire the related datas such as DNA methylation level.Although such methods data measured is more credible, it spends the time more, at This high and means is complicated.Although and using human DNA methylization 450K chip technology obtain methylation data cost relatively come Saying can receive, but can only obtain the methylation data in 450,000 fixed sites DNA, less than mankind's complete genome DNA site 1/20th, detection range is extremely limited.So predicting DNA methyl using the methods of similar machine study and data mining Change level has one of the hot spot that methylation data are bioinformatics especially DNA methylation research to extension.By pre- It surveys and extension DNA methylation data information will enable us to more deep understanding and explain DNA methylation for life section Learn the important function of research.
In addition, in existing DNA methylation chip data extended method, some by integrating 450K chip data, from Two category features of middle extraction carry out model construction, and some carries out model structure by tissue feature or the similitude of adjacent sites Build, all yield good result, but these methods all have the disadvantage that 1) used in feature quantity it is slightly inadequate, If DNA sequence dna is identical but non-from the same tissue, there can be entirely different methylation patterns, this shows only Predict that methylation level is far from being enough from one-side feature.2) and underuse it is existing based on WGBS obtain Methylation level data do not give full play to the huge advantage of methylation public database GEO, and there are also to be optimized for prediction model.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of DNA methylation extended methods, based on existing The related data in the site CpG realizes that the methylation in the unknown site CpG of methylation level precisely extends.
For achieving the above object, DNA methylation extended method of the present invention specifically includes the following steps:
S1: two site CpG set of building, respectively as reference site collection A and training site collection B, wherein referring to site The site CpG in collection A is the covered site CpG of existing methylation level test method institute, and note is referring to CpG in site collection A The quantity of point is NA, remember that training site integrates the quantity in the site CpG in B as NB
S2: for each site CpG CpG in training site collection Bi, i=1,2 ..., NB, M kind similarity meter is respectively adopted Calculation method, in the N referring to site collection AAIt is filtered out in a site CpG and the site CpG CpGiM is remembered in the M most like site CpG The kind obtained site the CpG CpG of similarity calculating methodiThe most like site CpG be CpGi,m, m=1,2 ..., M;
R test sample is chosen from existing methylation public database, wherein each test sample is covered with people The methylation level in all sites CpG of body;For each site CpG CpG in training site collection Bi, obtain the site CpG CpGi Methylation level λ in r-th of test samplei,r, r=1,2 ..., R;For each site CpG CpGiEach of it is most like The site CpG CpGi,m, obtain the most like site the CpG CpGi,mMethylation level λ in r-th of test samplei,m,r
S3: building prediction model, by each site CpG CpG in training site collection BiThe most like sites CpG M CpGi,mFeature vector λi,m,rAs input, by the site CpG CpGiFeature vector λi,rAs desired output, to prediction model It is trained;
S4: to the site S ' to be extended in Mr. Yu's segment DNA sequence, being respectively adopted M kind similarity calculating method, referring to position The N of point set AAThe M CPG site most like with the site CPG S ' is filtered out in a site CPG, remembers m kind similarity calculating method The most like site CPG of obtained site S ' to be extended is CpG 'm;It tests to obtain using existing methylation level test method The most like site CPG CpG ' in the segment DNA sequencemMethylation level λm, by the M most like site CPG CpG 'mCorresponding first The horizontal λ of baseizationmThe trained prediction model of input step S3, the methylation level of output are the methylation of site S ' to be extended It is horizontal.
DNA methylation extended method of the present invention, according to the existing methylation level test method institute covered site CpG structure It builds referring to site collection, then constructs trained site collection, to each site CpG that training site is concentrated, M kind similarity meter is respectively adopted Calculation method filters out the M CpG site most like with it concentrating referring to site, then from existing methylation common data Training data is extracted in library, and constructed prediction model is trained, to the site to be extended in Mr. Yu's segment DNA sequence, M kind similarity calculating method is respectively adopted, filters out the M CPG site most like with it concentrating referring to site, use is existing Methylation level test method tests to obtain the methylation level input prediction mould in the M most like sites CPG in the segment DNA sequence Type, the methylation level of output are the methylation level in site to be extended.The present invention use sites CpG all to human body carry out The mode of unit point modeling, so that reaching the levels of precision of site grade to the extension of DNA methylation, for grinding for DNA methylation Study carefully and has a very big significance.
Detailed description of the invention
Fig. 1 is the specific embodiment flow chart of DNA methylation extended method of the present invention;
Fig. 2 is the MSE index comparison diagram of four kinds of least square method linear regression model (LRM)s in the present embodiment;
Fig. 3 is the MAE index comparison diagram of four kinds of least square method linear regression model (LRM)s in the present embodiment;
Fig. 4 is the R2 index comparison diagram of four kinds of least square method linear regression model (LRM)s in the present embodiment.
Specific embodiment
A specific embodiment of the invention is described with reference to the accompanying drawing, preferably so as to those skilled in the art Understand the present invention.Requiring particular attention is that in the following description, when known function and the detailed description of design perhaps When can desalinate main contents of the invention, these descriptions will be ignored herein.
Fig. 1 is the specific embodiment flow chart of DNA methylation extended method of the present invention.As shown in Figure 1, DNA of the present invention Methylation extended method specific steps include:
S101: the building site CpG set:
Two site CpG set are constructed, respectively as reference site collection A and training site collection B, wherein referring to site collection A In the site CpG be the covered site CpG of existing methylation level test method institute, that is, use existing methylation level Test method can measure the methylation level in the site CpG, existing methylation level test side employed in the present embodiment Method is the test method based on 450K methylation chip.Note integrates the quantity in the site CpG in A referring to site as NA, remember training site Integrate the quantity in the site CpG in B as NB
S102: training data is obtained:
For each site CpG CpG in training site collection Bi, i=1,2 ..., NB, M kind similarity calculation is respectively adopted Method, in the N referring to site collection AAIt is filtered out in a site CpG and the site CpG CpGiM kind is remembered in the M most like site CpG The obtained site the CpG CpG of similarity calculating methodiThe most like site CpG be CpGi,m, m=1,2 ..., M.
(GEO database, i.e. Gene Expression are used in the present embodiment from existing methylation public database Omnibus, gene expression data base) R test sample of middle selection, wherein each test sample is covered with all CpG of human body The methylation level in site.That is, there is R test sample value in each site CpG in human body for calculating.
The present invention is for each site CpG CpG in training site collection Bi, obtain the site CpG CpGiIt is tested at r-th Methylation level λ in samplei,r, r=1,2 ..., R.For each site CpG CpGiEach of the most like site CpG CpGi,m, obtain the most like site the CpG CpGi,mMethylation level λ in r-th of test samplei,m,r.It is by above data Composing training data.
The type of similarity is preferred after study three kinds of similarities an important factor for influencing the technology of the present invention effect, Similarity, histone modification similarity and the sequence of respectively methylating form similarity, separately below to the meter of every kind of similarity Calculation method is described in detail.
Methylate similarity
P test sample is chosen from existing methylation public database, wherein each test sample is covered with people The methylation level in all sites CpG of body.That is, each site CpG in human body, have P test sample value for It calculates.So for each site CpG, the P dimensional feature vector in the site, the pth in feature vector could set up A element representation methylation level of the site CpG in p-th of test sample, p=1,2 ..., P.
The site CpG for remembering two similarities to be calculated is respectively CPGaAnd CPGb, the two are obtained according to P test sample The P dimensional feature vector α in the site CpGa=[αa,1a,2,…,αa,P] and αb=[αb,1b,2,…,αb,P], wherein αa,p、αb,pRespectively Indicate the site CpG CPGaAnd CPGbThen methylation level in p-th of test sample calculates feature vector αa、αbBetween Similarity, the more big then CpG site CPG of similarityaAnd CPGbSimilarity it is bigger.
The present embodiment is calculating feature vector αa、αbBetween similarity when use Pearson correlation coefficients, pearson correlation Coefficient is bigger, two feature vector αa、αbBetween similarity it is bigger.Pearson correlation coefficients are a kind of common related coefficients, Details are not described herein for its specific formula for calculation.
Histone modification similarity
The site CpG for similarity to be calculated is respectively CPGaAnd CPGb, according to existing methylation public database, Determine the site CpG CPGaAnd CPGbThe site CpG CPG is extracted in absolute position on DNA respectivelyaAnd CPGbAbove and below position Swim DNA sequence dna EaAnd Eb, the length of upstream and downstream DNA sequence dna can be determine according to actual needs 600bp in the present embodiment.According to It needs to select Q histone sample, respectively in DNA sequence dna EaAnd EbThe middle quantity for counting this Q histone sample, remembers DNA sequence dna EaAnd EbIn the quantity of q-th of histone sample be respectively βa,qAnd βb,q, wherein q=1,2 ..., Q, building obtain two length and are The feature vector β of Qa=[βa,1a,2,…,βa,Q] and βb=[βb,1b,2,…,βb,Q], then calculate feature vector βa、βbIt Between similarity, the more big then CpG site CPG of similarityaAnd CPGbSimilarity it is bigger.Similarly, the present embodiment is calculating feature Vector βa、βbBetween similarity when use Pearson correlation coefficients.
Sequence forms similarity
The site CpG for similarity to be calculated is respectively CPGaAnd CPGb, according to existing methylation public database, Determine the site CpG CPGaAnd CPGbThe site CpG CPG is extracted in absolute position on DNA respectivelyaAnd CPGbAbove and below position Swim DNA sequence dna EaAnd Eb, similarly, the length of upstream and downstream DNA sequence dna can be in the present embodiment determine according to actual needs 600bp.The value range k=1,2 ..., K of the k parameter of k-mers algorithm is set as needed, then in the DNA sequence dna for needing statistics A, the composition fruiting quantities of T, C, GRespectively to DNA sequence dna EaAnd EbIt counts to obtain CpG using k-mers algorithm Site CPGaAnd CPGbH kind DNA sequence dna composition as a result, note DNA sequence dna EaAnd EbIn h kind DNA sequence dna composition result quantity Respectively ωa,hAnd ωb,h, wherein h=1,2 ..., H, building obtain the feature vector ω that two length are Ha=[ωa,1, ωa,2,…,ωa,H] and ωb=[ωb,1b,2,…,ωb,H], then calculate feature vector ωa、ωbBetween similarity, phase Like the more big then CpG site CPG of degreeaAnd CPGbSimilarity it is bigger.Similarly, the present embodiment is calculating feature vector ωa、ωbIt Between similarity when use Pearson correlation coefficients.
S103: training prediction model:
Prediction model is constructed, by each site CpG CpG in training site collection BiThe M most like site CpG CpGi,m Feature vector λi,m,rAs input, by the site CpG CpGiFeature vector λi,rAs desired output, prediction model is carried out Training.Prediction model uses least square method linear regression model (LRM) in the present embodiment.
S104: methylation extension:
To the site S ' to be extended in Mr. Yu's segment DNA sequence, M kind similarity calculating method is respectively adopted, referring to site Collect the N of AAThe M CPG site most like with the site CPG S ' is filtered out in a site CPG, remembers m kind similarity calculating method institute The most like site CPG of obtained site S ' to be extended is CpG 'm;This is tested using existing methylation level test method The most like site CPG CpG ' in segment DNA sequencemMethylation level λm, by the M most like site CPG CpG 'mCorresponding methyl Change horizontal λmThe trained prediction model of input step S3, the methylation level of output are the methylation water of site S ' to be extended It is flat.
Embodiment
Technical solution in order to better illustrate the present invention carries out implementation process of the invention using specific example detailed Explanation.
Existing methylation level test method uses the test method based on 450K methylation chip in the present embodiment, that is, joins Integrate each site in A according to site as the site 450K, existing methylation public database uses GEO database.The present embodiment When filtering out the most like site CpG there are three types of used similarities, respectively methylation similarity, histone modification are similar Degree and sequence form similarity.
Methylate similarity
The present embodiment chooses 101 test samples from methylation public database GEO, wherein each test sample is covered The methylation value in all sites CpG of human body is covered.Table 1 is the partial data table of 101 test samples in the present embodiment.
Table 1
Table 2 is the partial data table of 101 dimensional feature vectors in each site CpG in the present embodiment.
Table 2
For using with top referring to each site CpG in each site CpG and training site collection B in site collection A Method obtains respective 101 dimensional feature vector, then calculates each of trained site collection B using Pearson correlation coefficients method The site CpG and the feature vector related coefficient referring to each site CpG in the collection A of site, choose referring to phase relation in site collection A The maximum site CpG of number is as the most like site CpG.
Histone modification similarity
Available many useful information from methylation public database GEO, such as the site CpG number, place dyeing Body number and the absolute position on DNA etc..Absolute position where the present embodiment sorts out each site CpG on DNA, and Using the surrounding of this position 600bp as range (each 300bp of upstream and downstream), position section is obtained, then with this position section work For input, using the packet for calculating DNA sequence dna in R language, output obtains the DNA sequence that the site needs to use in the present embodiment Column.Table 3 is the partial data table of the upstream and downstream DNA sequence dna in each site in the present embodiment.
Table 3
Each 6 samples of 5 kinds of histones are chosen in the present embodiment, that is, are amounted to 30 samples, detected each site CpG upstream and downstream The number of each histone sample in sequence counts the site for 30 result datas of this 30 samples, obtains from face Feature vector.Table 4 is the partial data table of 30 dimensional feature vectors in each site CpG in the present embodiment.
Table 4
For using with top referring to each site CpG in each site CpG and training site collection B in site collection A Method obtains respective 30 dimensional feature vector, and each CpG in trained site collection B is then calculated using Pearson correlation coefficients method Site and the feature vector related coefficient referring to each site CpG in the collection A of site, choose referring to related coefficient in site collection A The maximum site CpG is as the most like site CpG.
Sequence forms similarity
Similarly, from the DNA sequence dna for the upstream and downstream that can obtain the absolute position where each site CpG on DNA in table 3 (ATCG combination).The value range k=1 of k-mers algorithm parameter k is set in the present invention, and 2,3,4,1-mers indicate statistics DNA In sequence A, T, C, G distinguish accounting, 2-mers statistics DNA sequence dna in AA, AT, AC, AG, TA, TT, TC, TG, CA, CT, CC, CG, Accounting, 3-mers count 64 kinds of combinations accounting respectively respectively for totally 16 kinds of combinations by GA, GT, GC, GG, and 4-mers counts 256 kinds of combinations The DNA sequence dna of accounting respectively, i.e., total 340 kinds of A, T, C, G forms result proportion, to obtain 340 dimensional feature vectors.Table 5 be the DNA sequence dna composition result statistics partial data table of each site upstream and downstream DNA sequence dna in the present embodiment.
Table 5
For using with top referring to each site CpG in each site CpG and training site collection B in site collection A Method obtains respective 340 dimensional feature vector, then calculates each of trained site collection B using Pearson correlation coefficients method The site CpG and the feature vector related coefficient referring to each site CpG in the collection A of site, choose referring to phase relation in site collection A The maximum site CpG of number is as the most like site CpG.
Each site is concentrated to training site, obtaining methylating with it, most like, histone modification is most like, sequence composition Then it is trained according to the 101 trained sites chosen in methylation public database in three most like sites 450K Concentrate the methylation water in methylation level and three most like 450K site of each site in each test sample in site It is flat, position is trained with corresponding using the methylation level that least square method linear regression model (LRM) is fitted these three most like sites 450K Point concentrates the relationship between the methylation level of site, and carries out cross validation training to model, finally obtains accurately pre- Survey model.
It is linearly returned in the present embodiment using the trainable least square method of weight being had inside sklearn packet in Python Return model linear_model.LinearRegression ().Four kinds of least square method linear regression moulds are used in the present embodiment Type is described in detail below:
Model 1: only with the methylation obtained most like site 450K of similarity, least square method linear regression model (LRM) Expression formula be Y=a11X+b1Linear relationship, wherein Y indicate output, that is, the methylation level in the site CpG to be extended, X indicate Using the methylation level in the most like site 450K obtained by methylation similarity.There are 101 test samples in the present embodiment, instructs Practice the available 101 groups of data in each site CpG that site is concentrated, is successively arrived using the 1st row in this 101 test sample values 10th row, the 11st row to the 20th row the 91st row to the 100th row are as verifying collection, remaining 91 row is as training Collection carries out 10 cross validations altogether, obtains parameter a11、b1Value, while statistical error situation completes prediction model training.
Model 2: using methylation obtained two most like sites 450K of similarity and histone modification similarity, most The expression formula of small square law linear regression model (LRM) is Y=a21X1+a22X2+b2Linear relationship, wherein Y indicate output, i.e., wait expand Open up the methylation level in the site CpG, X1Indicate the methylation level using the most like site 450K obtained by methylation similarity, X2Indicate the methylation level using the most like site 450K obtained by histone modification similarity.There are 101 in the present embodiment Test sample, the available 101 groups of data in each site CpG that training site is concentrated, successively uses this 101 test sample values In the 1st row to the 10th row, the 11st row to the 20th row the 91st row to the 100th row as verifying collection, remaining 91 Row is used as training set, carries out 10 cross validations altogether, obtains parameter a21、a22、b2Value, while statistical error situation completes pre- Survey model training.
Model 3: forming obtained two most like sites 450K of similarity using methylation similarity and sequence, minimum The expression formula of square law linear regression model (LRM) is Y=a31X1+a32X2+b3Linear relationship, wherein Y indicate output, i.e., wait extend The methylation level in the site CpG, X1Indicate the methylation level using the most like site 450K obtained by methylation similarity, X2 Indicate the methylation level using the most like site 450K obtained by sequence composition similarity.There are 101 tests in the present embodiment Sample, the available 101 groups of data in each site CpG that training site is concentrated, successively using in this 101 test sample values As verifying collection, remaining 91 row is made for 1st row to the 10th row, the 11st row to the 20th row the 91st row to the 100th row For training set, 10 cross validations are carried out altogether, obtain parameter a31、a32、b3Value, while statistical error situation completes prediction mould Type training.
Model 4: most using methylation similarity, histone modification similarity and obtained two of similarity of sequence composition The similar site 450K, the expression formula of least square method linear regression model (LRM) are Y=a41X1+a42X2+a43X3+b4Linear relationship, Wherein Y indicates output, that is, the methylation level in the site CpG to be extended, X1It indicates using most like obtained by methylation similarity The methylation level in the site 450K, X2Indicate the methylation water using the most like site 450K obtained by histone modification similarity It is flat, X3Indicate the methylation level using the most like site 450K obtained by sequence composition similarity.There are 101 in the present embodiment Test sample, the available 101 groups of data in each site CpG that training site is concentrated, successively uses this 101 test sample values In the 1st row to the 10th row, the 11st row to the 20th row the 91st row to the 100th row as verifying collection, remaining 91 Row is used as training set, carries out 10 cross validations altogether, obtains parameter a31、a32、a33、b3Value, while statistical error situation is complete At prediction model training.
In order to illustrate the performance of above four kinds of models, in the present embodiment using MSE (Mean Squared Error, just Difference), MAE (Mean Absolute Error, mean absolute error) and 3 kinds of evaluation indexes of R2 (R squares).Fig. 2 is the present embodiment In four kinds of least square method linear regression model (LRM)s MSE index comparison diagram.Fig. 3 is that four kinds of least square methods are linear in the present embodiment The MAE index comparison diagram of regression model.Fig. 4 is the R2 index comparison of four kinds of least square method linear regression model (LRM)s in the present embodiment Figure.As shown in Figure 2, Figure 3 and Figure 4, combine methylation it is most like, histone modification is most like and sequence forms most like three The methylation level in a site 450K is more preferably model as the model 4 of three features, and it is more that this also complies with feature, predicts mould The more excellent universal law of type, but the performance of other several prediction models is also within an acceptable range, therefore in practical applications, It can according to need similarity number amount and type used by determining.
Although the illustrative specific embodiment of the present invention is described above, in order to the technology of the art Personnel understand the present invention, it should be apparent that the present invention is not limited to the range of specific embodiment, to the common skill of the art For art personnel, if various change the attached claims limit and determine the spirit and scope of the present invention in, these Variation is it will be apparent that all utilize the innovation and creation of present inventive concept in the column of protection.

Claims (6)

1. a kind of DNA methylation extended method, which comprises the following steps:
S1: two site CpG set of building, respectively as reference site collection A and training site collection B, wherein referring in site collection A The site CpG be the covered site CpG of existing methylation level test method institute, number of the note referring to the site CpG in site collection A Amount is NA, remember that training site integrates the quantity in the site CpG in B as NB
S2: for each site CpG CpG in training site collection Bi, i=1,2 ..., NB, M kind similarity calculation side is respectively adopted Method, in the N referring to site collection AAIt is filtered out in a site CpG and the site CpG CpGiM kind phase is remembered in the M most like site CpG Like the degree obtained site the CpG CpG of calculation methodiThe most like site CpG be CpGi,m, m=1,2 ..., M;
R test sample is chosen from existing methylation public database, wherein each test sample is covered with human body institute There is the methylation level in the site CpG;For each site CpG CpG in training site collection Bi, obtain the site CpG CpGi? Methylation level λ in r test samplei,r, r=1,2 ..., R;For each site CpG CpGiEach of it is CpG most like Point CpGi,m, obtain the most like site the CpG CpGi,mMethylation level λ in r-th of test samplei,m,r
S3: building prediction model, by each site CpG CpG in training site collection BiThe M most like site CpG CpGi,m's Feature vector λi,m,rAs input, by the site CpG CpGiFeature vector λi,rAs desired output, prediction model is instructed Practice;
S4: to the site S ' to be extended in Mr. Yu's segment DNA sequence, being respectively adopted M kind similarity calculating method, referring to site collection The N of AAIt is filtered out in a site CPG and the site CPG SiThe M most like site CPG, note m kind similarity calculating method gained To site S ' to be extended the most like site CPG be CpG 'm;It tests to obtain the section using existing methylation level test method The most like site CPG CpG ' in DNA sequence dnamMethylation level λm, by the M most like site CPG CpG 'mCorresponding methylation Horizontal λmThe trained prediction model of input step S3, the methylation level of output are the methylation water of site S ' to be extended It is flat.
2. DNA methylation extended method according to claim 1, which is characterized in that the existing methylation level is surveyed Method for testing is the test method based on 450K methylation chip.
3. DNA methylation extended method according to claim 1, which is characterized in that the prediction model is using minimum Square law regression model.
4. DNA methylation extended method according to claim 1, which is characterized in that the similarity packet in the step S2 Include methylation similarity, histone modification similarity and sequence composition similarity, in which:
The similarity that methylates is calculated using following methods:
P test sample is chosen from existing methylation public database, remembers the site the CpG difference of two similarities to be calculated For CPGaAnd CPGb, the P dimensional feature vector α in the two sites CpG is obtained according to P test samplea=[αa,1a,2,…,αa,P] And αb=[αb,1b,2,…,αb,P], wherein αa,p、αb,pRespectively indicate the site CpG CPGaAnd CPGbIn p-th of test sample Then methylation level calculates feature vector αa、αbBetween similarity, the more big then CpG site CPG of similarityaAnd CPGbPhase It is bigger like spending;
Histone modification similarity is calculated using following methods:
The site CpG for similarity to be calculated is respectively CPGaAnd CPGb, according to existing methylation public database, determine The site CpG CPGaAnd CPGbThe site CpG CPG is extracted in absolute position on DNA respectivelyaAnd CPGbThe upstream and downstream of position DNA sequence dna EaAnd Eb;Q histone sample of selection as needed, respectively in DNA sequence dna EaAnd EbMiddle this Q histone sample of statistics This quantity remembers DNA sequence dna EaAnd EbIn the quantity of q-th of histone sample be respectively βa,qAnd βb,q, wherein q=1,2 ..., Q, Building obtains the feature vector β that two length are Qa=[βa,1a,2,…,βa,Q] and βb=[βb,1b,2,…,βb,Q], then count Calculate feature vector βa、βbBetween similarity, the more big then CpG site CPG of similarityaAnd CPGbSimilarity it is bigger;
Sequence is formed similarity and is calculated using following methods:
The site CpG for similarity to be calculated is respectively CPGaAnd CPGb, according to existing methylation public database, determine The site CpG CPGaAnd CPGbThe site CpG CPG is extracted in absolute position on DNA respectivelyaAnd CPGbThe upstream and downstream of position DNA sequence dna EaAnd Eb;The value range k=1,2 ..., K of the parameter k of k-mers algorithm is set as needed, then needs the DNA of statistics The composition fruiting quantities of A, T, C, G in sequenceRespectively to DNA sequence dna EaAnd EbIt is counted using k-mers algorithm To the site CpG CPGaAnd CPGbH kind DNA sequence dna composition as a result, note DNA sequence dna EaAnd EbIn h kind DNA sequence dna form result Quantity be respectively ωa,qAnd ωb,q, wherein h=1,2 ..., H, building obtain the feature vector ω that two length are Ha= [ωa,1a,2,…,ωa,H] and ωb=[ωb,1b,2,…,ωb,H], then calculate feature vector ωa、ωbBetween phase Like degree, the more big then CpG site CPG of similarityaAnd CPGbSimilarity it is bigger.
5. DNA methylation extended method according to claim 4, which is characterized in that the similarity of described eigenvector is adopted With Pearson correlation coefficients, Pearson correlation coefficients are bigger, and the similarity between two feature vectors is bigger.
6. DNA methylation extended method according to claim 4, which is characterized in that the prediction model in the step S3 Using least square method regression model, expression is as follows: Y=a41X1+a42X2+a43X3+b4, wherein Y indicate output, i.e., to Extend the methylation level in the site CpG, X1Indicate the methylation water using the most like site CpG obtained by methylation similarity It is flat, X2Indicate the methylation level using the most like site CpG obtained by histone modification similarity, X3It indicates to use sequence group At the methylation level in the site most like CpG obtained by similarity.
CN201910289075.9A 2019-04-11 2019-04-11 DNA methylation expansion method Expired - Fee Related CN110060736B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910289075.9A CN110060736B (en) 2019-04-11 2019-04-11 DNA methylation expansion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910289075.9A CN110060736B (en) 2019-04-11 2019-04-11 DNA methylation expansion method

Publications (2)

Publication Number Publication Date
CN110060736A true CN110060736A (en) 2019-07-26
CN110060736B CN110060736B (en) 2022-11-22

Family

ID=67317686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910289075.9A Expired - Fee Related CN110060736B (en) 2019-04-11 2019-04-11 DNA methylation expansion method

Country Status (1)

Country Link
CN (1) CN110060736B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021088653A1 (en) * 2019-11-08 2021-05-14 中国科学院北京基因组研究所(国家生物信息中心) Method and device for classification of urine sediment genomic dna, and use of urine sediment genomic dna

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012021867A2 (en) * 2010-08-13 2012-02-16 The Johns Hopkins University A comprehensive methylome map of myeloid and lymphoid commitment from hematopoietic progenitors
US20150368694A1 (en) * 2014-06-23 2015-12-24 Yale University Methods for closed chromatin mapping and dna methylation analysis for single cells
CN105408494A (en) * 2012-05-11 2016-03-16 独立行政法人国立癌症研究中心 Method for predicting prognosis of renal cell carcinoma
CN106980774A (en) * 2017-03-29 2017-07-25 电子科技大学 A kind of extended method of DNA methylation chip data
CN107918725A (en) * 2017-12-28 2018-04-17 大连海事大学 A kind of DNA methylation Forecasting Methodology based on machine learning selection optimal characteristics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012021867A2 (en) * 2010-08-13 2012-02-16 The Johns Hopkins University A comprehensive methylome map of myeloid and lymphoid commitment from hematopoietic progenitors
CN105408494A (en) * 2012-05-11 2016-03-16 独立行政法人国立癌症研究中心 Method for predicting prognosis of renal cell carcinoma
US20150368694A1 (en) * 2014-06-23 2015-12-24 Yale University Methods for closed chromatin mapping and dna methylation analysis for single cells
CN106980774A (en) * 2017-03-29 2017-07-25 电子科技大学 A kind of extended method of DNA methylation chip data
CN107918725A (en) * 2017-12-28 2018-04-17 大连海事大学 A kind of DNA methylation Forecasting Methodology based on machine learning selection optimal characteristics

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SHICAI FAN等: ""Integrative analysis with expanded DNA methylation data reveals common key regulators and pathways in cancers"", 《NPJ GENOMIC MEDICINE》 *
SHICAI FAN等: ""Computationally expanding infinium HumanMethylation450 BeadChip array data to reveal distinct DNA methylation patterns of rheumatoid arthritis"", 《BIOINFORMATICS》 *
刘光辉: ""基于神经网络的全基因组DNA甲基化预测研究"", 《中国优秀硕士学位论文全文数据库》 *
许杰: ""450K甲基化芯片数据的扩展算法设计与实现"", 《中国优秀硕士学位论文全文数据库》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021088653A1 (en) * 2019-11-08 2021-05-14 中国科学院北京基因组研究所(国家生物信息中心) Method and device for classification of urine sediment genomic dna, and use of urine sediment genomic dna

Also Published As

Publication number Publication date
CN110060736B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
Gao et al. Are 2D fingerprints still valuable for drug discovery?
CN109994200B (en) Multi-group cancer data integration analysis method based on similarity fusion
CN102663214B (en) Construction and prediction method of integrated drug target prediction system
CN109299380B (en) Exercise personalized recommendation method based on multi-dimensional features in online education platform
CN107076712B (en) Chromatographic data processing method and device
CN109906486A (en) Use phenotype/disease specific gene order of common recognition gene pool and network-based data structure
CN108121896B (en) Disease relation analysis method and device based on miRNA
CN110222745A (en) A kind of cell type identification method based on similarity-based learning and its enhancing
CN106202999B (en) Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement
CN107025384A (en) A kind of construction method of complex data forecast model
CN105740626A (en) Drug activity prediction method based on machine learning
CN106055922A (en) Hybrid network gene screening method based on gene expression data
CN104866863B (en) A kind of biomarker screening technique
CN103390119B (en) A kind of Binding site for transcription factor recognition methods
CN102841070A (en) Method for identifying types of crude oil by using two-dimensional correlation infrared synchronization spectrum
CN110060736A (en) DNA methylation extended method
Dai et al. Applications of new technologies and new methods in ZHENG differentiation
CN108427865A (en) A method of prediction LncRNA and environmental factor incidence relation
US20040191804A1 (en) Method of analysis of a table of data relating to gene expression and relative identification system of co-expressed and co-regulated groups of genes
CN102586418A (en) Pathway-based specific combined medicine target detection method
Liu et al. Construction of disease-specific cytokine profiles by associating disease genes with immune responses
Cho et al. Mathematical modeling with single-cell sequencing data
Penchev et al. INFERCNMR: a 13C NMR interpretive library search system
Sun et al. LRSK: a low-rank self-representation K-means method for clustering single-cell RNA-sequencing data
CN104603788A (en) Phenotypic integrated social search database and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20221122