Summary of the invention
The technical problem that the present invention will solve is to provide a kind of tissue specificity difference method for detecting area that methylates, and realized that carrying out tDMR based on full genome detects, and accuracy is high.
The invention provides a kind of tissue specificity difference regional tDMR detection method that methylates, be applied to non-medical diagnosis on disease purposes, comprising:
Obtain on full genome the single-point information that methylates by genome sequencing;
Determine seed tDMR on full genome according to the information that methylates of single-point on full genome based on preselected conditions;
TDMR extends to both sides to seed, based on extending end condition, obtains candidate tDMR;
Based on filtration condition, candidate tDMR is filtered, obtain the tDMR result;
Determine that based on preselected conditions on full genome, seed tDMR comprises according to the information that methylates of single-point on full genome: scan on full genome the single-point information that methylates by moving window, based on preselected conditions, determine seed tDMR on full genome;
Wherein, this preselected conditions comprises:
The p of (1) chi square test (in conjunction with the fisher rigorous examination)<=0.05;
The significant difference of (2) two times of methylation level; With
(3) methylation level of at least one sample is more than 20%;
Extending end condition comprises:
Distance between (1) two continuous CpG surpasses 200bp;
The average methylation level of (2) two samples is less than two times of differences;
This regional methylation level of (3) two samples is all less than 20%;
(4) p of chi square test>0.01;
Filtration condition comprises:
(1)FDR<=0.05;
The average coverage in the tDMR zone that (2) obtains is greater than 20 reads;
The coverage of the CG site single-point that (3) obtains is greater than 10 reads;
(4) accuracy of sampling with replacement assay being carried out in the CpG site in the tDMR that obtains will be more than 95%.
According to an embodiment of detection method of the present invention, take 5 CpG as length, 1 CpG scans on full genome the single-point information that methylates as the moving window of step-length.
An embodiment according to detection method of the present invention, obtaining the single-point information of methylating on full genome by genome sequencing comprises: make in genomic DNA and methylated cytosine(Cyt) deaminizating does not occur be transformed into uridylic by bisulfite, and methylated cytosine(Cyt) remains unchanged; Treated full genome is checked order, and with undressed whole genome sequence, compare, determine on full genome to occur methylated CpG site.
Detection method provided by the invention, obtain single-point on the full genome of the sample information that methylates by the genome sequencing technology, on the basis of genome sequencing, two order-checking samples are carried out analyzing and processing, can extract tDMR in full genome range; , by steps such as seed tDMR selection, seed tDMR extension, candidate tDMR filtrations, improved the accuracy that detects.
The technical problem that the present invention will solve is to provide a kind of tissue specificity difference regional detection system that methylates, and realized that carrying out tDMR based on full genome detects, and accuracy is high.
The invention provides a kind of tissue specificity difference regional tDMR detection system that methylates, comprising:
The acquisition of information module that methylates, be used for obtaining on full genome the single-point information that methylates by genome sequencing;
The seed region determination module, be used for scanning on described full genome the single-point information that methylates by moving window, based on preselected conditions, determines seed tDMR on full genome;
The seed region extension of module, be used for described seed tDMR is extended to both sides, based on extending end condition, obtains candidate tDMR;
The candidate region filtration module, be used for based on filtration condition, described candidate tDMR being filtered, and obtains the tDMR result;
Wherein, preselected conditions comprises: the p of (1) chi square test (in conjunction with the fisher rigorous examination)<=0.05; The significant difference of (2) two times of methylation level; (3) methylation level of at least one sample is more than 20%; Extending end condition comprises: the distance between (1) two continuous CpG surpasses 200bp; The average methylation level of (2) two samples is less than two times of differences; This regional methylation level of (3) two samples is all less than 20%; (4) p of chi square test>0.01; Filtration condition comprises: (1) FDR<=0.05; The average coverage in the tDMR zone that (2) obtains is greater than 20 reads; The coverage of the CG site single-point that (3) obtains is greater than 10 reads; (4) accuracy of sampling with replacement assay being carried out in the CpG site in the tDMR that obtains will be more than 95%.
Detection system embodiment according to the present invention, the seed region determination module scans on described full genome the single-point information that methylates by moving window, based on preselected conditions, determines seed tDMR on full genome; Wherein, take 5 CpG as length, 1 CpG scans on described full genome the single-point information that methylates as the moving window of step-length.
Detection system embodiment according to the present invention, the acquisition of information module that methylates comprises: the sodium bisulfite treatment facility, be used for making complete genomic DNA that methylated cytosine(Cyt) deaminizating not occur by bisulfite and be transformed into uridylic, and methylated cytosine(Cyt) remains unchanged; Full genome alignment equipment, be used for treated full genome is checked order, and with undressed whole genome sequence, compare, and determines on full genome to occur methylated CpG site.
Detection system provided by the invention, the acquisition of information module that methylates obtains single-point on the full genome of the sample information that methylates by the genome sequencing technology, follow-up modules carries out analyzing and processing to two order-checking samples on the basis of genome sequencing, can extract tDMR in full genome range; Carry out seed tDMR by the seed region determination module and select, by the seed region extension of module, carry out seed tDMR extension, by the candidate region filtration module, candidate tDMR is filtered, improved the accuracy that detects.
Embodiment
With reference to the accompanying drawings the present invention is described more fully, exemplary embodiment of the present invention wherein is described.In the accompanying drawings, identical label represents identical or similar assembly or element.
Fig. 1 illustrates the methylate schema of an embodiment of method for detecting area of tissue specificity difference of the present invention.
As shown in Figure 1,, in step 102, by genome sequencing, obtain on the full genome of sample the single-point information that methylates.For example, on s-generation high-throughput genome sequencing basis, by bisulfite sequencing (Bisulfite-sequencing) (reference (2)), obtain the single-point of sample on the full genome information that methylates.After the processing of step 102, below step 104 to 108 differences that be used for to extract two samples zone that methylates.
, in step 104, determine seed tDMR on the full genome of two samples based on preselected conditions according to the information that methylates of single-point on two full genomes of sample.
In step 106, tDMR extends to both sides to seed, based on extending end condition, obtains candidate tDMR.
In step 108, based on filtration condition, candidate tDMR is filtered, obtain final tDMR result.
Detect tDMR experimental technique complicated operation for existing chip technology, success ratio is low, the high in cost of production problem, above-described embodiment obtains single-point on the full genome of the sample information that methylates by the genome sequencing technology, on the basis of genome sequencing, two order-checking samples are carried out analyzing and processing, can find tDMR in full genome range, easy, extract tDMR rapidly from full genome, greatly improve detection efficiency, reduced cost.In addition,, by steps such as seed tDMR selection, seed tDMR extension, candidate tDMR filtrations, the accuracy and the sensitivity that detect have been improved.
Fig. 2 illustrates the methylate schema of another embodiment of method for detecting area of tissue specificity difference of the present invention.
As shown in Figure 2,, in step 202, make in genomic DNA and methylated cytosine(Cyt) deaminizating does not occur be transformed into uridylic by bisulfite, and methylated cytosine(Cyt) remains unchanged.
In step 204, treated full genome is checked order, and with undressed whole genome sequence, compare, determine on full genome to occur methylated CpG site.
In step 206, scan on described full genome the single-point information that methylates by moving window, determine seed tDMR on full genome based on preselected conditions.
Take 5 CpG as length, 1 CpG scans on described full genome the single-point information that methylates as the moving window of step-length; Preselected conditions comprises:
The p of (1) chi square test (in conjunction with the fisher rigorous examination)<=0.05;
In the time of p<=0.05, can think in this zone that there is the difference of significance in methylating between sample in twos.
The significant difference of (2) two times of methylation level; With
(3) methylation level of at least one sample is more than 20%; One of them methyl rate in the tDMR zone of finding needs more than 20%, makes the zone of finding have biological significance.
In step 208, seed tDMR is extended to both sides and obtains candidate tDMR, the extension end condition is:
Distance between (1) two continuous CpG surpasses 200bp;
If the distance between two continuous CpG is long, the cognation between these two CpG is little, thus when this situation occurs, stop extending, thus guarantee as far as possible the reliability of detected result.
The average methylation level of (2) two samples is less than two times of differences;
This regional methylation level of (3) two samples is all less than 20%;
(4) p of chi square test>0.01.
In step 210, based on filtration condition, candidate tDMR is filtered, filtration condition comprises:
(1) FDR (false discovery rate, false discovery rate)<=0.05
The average coverage in the tDMR zone that (2) obtains is greater than 20 reads (read, the DNA sequence dna with the length necessarily read that utilizes the order-checking of new-generation sequencing technology to obtain).
The coverage of the CpG site single-point that (3) obtains is greater than 10 reads
(4) the sampling with replacement check is carried out in the CpG site in the tDMR that obtains and (extracted any one site in namely from all CpG sites, test, after completing, then put back to the method that overall middle participation is selected next time), the accuracy of result will be more than 95%.
Step 212, obtain final tDMR result by filtration.
It may be noted that in above-described embodiment and detect the CG site, it will be understood by those of skill in the art that method of the present invention goes for CHH, CHG site equally, wherein H represents any one in A, C, T.
In above-described embodiment,, by a large amount of experimental studies and creative work, determine specifically to have adopted preselected conditions, extension end condition and the filtration condition of tDMR, accuracy is high.Through subsequent authentication, the tDMR accuracy rate that finds by aforesaid method is more than 85%.
Fig. 3 illustrates the methylate block diagram of an embodiment of regional detection system of tissue specificity difference of the present invention.As shown in Figure 3, in this embodiment, detection system comprises the acquisition of information module 31 that methylates, seed region determination module 32, seed region extension of module 33 and candidate region filtration module 34.Wherein, the acquisition of information module 31 that methylates obtains on full genome the single-point information that methylates by genome sequencing; Seed region determination module 32 is determined seed tDMR on full genome according to the single-point information that methylates on full genome based on preselected conditions; 33 couples of seed tDMR of seed region extension of module extend to both sides, based on extending end condition, obtain candidate tDMR; Candidate region filtration module 34 filters candidate tDMR based on filtration condition, obtains the tDMR result.
In above-described embodiment, the acquisition of information module that methylates obtains single-point on the full genome of the sample information that methylates by the genome sequencing technology, follow-up modules carries out analyzing and processing to two order-checking samples on the basis of genome sequencing, can find tDMR in full genome range, easy, extract tDMR rapidly from full genome, greatly improve detection efficiency, reduced cost.Select, by the seed region extension of module, carry out seed tDMR extension and by the seed region determination module, carry out seed tDMR, by the candidate region filtration module, candidate tDMR is filtered, improved the accuracy and the sensitivity that detect.
In one embodiment, seed region determination module 32 scans on full genome the single-point information that methylates by moving window, based on preselected conditions, determines seed tDMR on full genome; Wherein, take 5 CpG as length, 1 CpG scans on described full genome the single-point information that methylates as the moving window of step-length; Preselected conditions comprises: the p of (1) chi square test (in conjunction with the fisher rigorous examination)<=0.05; The significant difference of (2) two times of methylation level; (3) methylation level of at least one sample is more than 20%.According to an embodiment of detection system of the present invention, above-mentioned extension end condition comprises: the distance between two continuous CpG surpasses 200bp; The average methylation level of two samples is less than two times of differences; This regional methylation level of two samples is all less than 20%; The p of chi square test>0.01.According to an embodiment of detection system of the present invention, filtration condition comprises: FDR<=0.05; The average coverage in the tDMR zone that obtains is greater than 20 reads; The coverage of the CG site single-point that obtains is greater than 10 reads; The accuracy of the CpG site in the tDMR that obtains being carried out the sampling with replacement assay will be more than 95%.
In above-described embodiment,, by a large amount of experimental studies and creative work, determine specifically to have adopted preselected conditions, extension end condition and the filtration condition of tDMR, accuracy is high.Through subsequent authentication, the tDMR accuracy rate that finds by aforesaid method is more than 85%.
Fig. 4 illustrates the methylate block diagram of another embodiment of regional detection system of tissue specificity difference of the present invention.In Fig. 4 and the module of Fig. 3 with same numeral can describe referring to the correspondence in Fig. 3,, for for purpose of brevity, at this, be not described in detail.Compare with Fig. 3, the acquisition of information module 41 that methylates in Fig. 4 comprises sodium bisulfite treatment facility 411 and full genome alignment equipment 412.Wherein, sodium bisulfite treatment facility 411 makes in complete genomic DNA and methylated cytosine(Cyt) deaminizating does not occur is transformed into uridylic by bisulfite, and methylated cytosine(Cyt) remains unchanged; Complete 412 pairs of treated full genomes of genome alignment equipment check order, and with undressed whole genome sequence, compare, and determine on full genome to occur methylated CpG site.
In above-described embodiment, the sodium bisulfite treatment facility makes in DNA and methylated cytosine(Cyt) deaminizating does not occur is transformed into uridylic with bisulfite, and methylated cytosine(Cyt) remains unchanged; Full genome alignment equipment checks order to treated full genome, and with undressed sequence, compare, judge whether that the CpG site methylates, on the clear and definite genome of energy, the methylation state in each CpG site, have very high reliability and tolerance range.
Through checking (being not limited only to Mammals) in many species, method and system of the present invention has higher tolerance range and susceptibility than the method for Vardhman Rakyan etc., and very high reliability and tolerance range are arranged.
Below introduce an application examples of the present invention.
Sampled data used is the fasta data of No. one, inoblast imr90 and YH in this application examples.The fasta data download address that No. one, inoblast imr90 and YH is respectively:
imr90:http://neomorph.salk.edu/human_methylome/data.html
YH:http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ADDF
In this application examples, a plurality for the treatment of steps in technical scheme are realized by software, the running environment of software can be Unix/Linux operating system, by the Unix/Linux order line, moves this software.Description below provides the command line parameter of running software simultaneously.
At first carrying out data prepares to process.Multiple through comparison, duplicate removal after data are downloaded, extract the step process such as the information that methylates, obtain the cout file (recording the file of the situation that methylates in cytosine(Cyt) C site) of YH and imr90, the sample input file that extracts tDMR is this two cout files.It may be noted that method of the present invention can be used for any species that can access the cout file, the content that is not limited to enumerate in embodiment, so range of application is extremely wide.
The concrete form of Cout file is as follows:
Table 1
The first step: selected seed tDMR
Take 5 CpG as length, 1 CpG scans on the full genome that comprises in two sample file count files the single-point information that methylates as the moving window of step-length; Preselected conditions is: the p of (1) chi square test (in conjunction with the fisher rigorous examination)<=0.05; The significant difference of (2) two times of methylation level; (3) methylation level of at least one sample is more than 20%.
Computer command line parameter is:
″./tdmr?slide-c?CG?YH.cout/chr$i.cout?imr90.cout/chr$i.cout>outfile/CG/chr$i.CG;echo?slide?done;
The form of parameter-cytosine(Cyt) C that the c representative will be found.Usually have three kinds of situations optional: CG, CHH, CHG, research is the CpG site herein, so select CG here.
Output rusults: the seed that obtains altogether 916949 tDMR
The Output rusults file layout is as follows:
Table 2
In above-mentioned steps, can divide the seed of the parallel tDMR of searching of karyomit(e) on full genome, thereby can reduce operation time, raise the efficiency.
Second step: the two-way extension of seed tDMR
Seed tDMR extend is obtained candidate tDMR to both sides, extend end condition and be: the distance between (1) two continuous CpG surpasses 200bp; The average methylation level of (2) two samples is less than two times of differences; This regional methylation level of (3) two samples is all less than 20%; (4) p of chi square test>0.01.
Computer command line parameter is:
./tdmr?extend-t8-c?CG?YH.cout/chr$i.cout?imr90.cout/chr$i.cout?outfile/CG/chr$i.CG>outfile/CG/chr$i.CG.ext;echo?extend?done;
The time marquis that parameter-the t representative moves in program, the CPU number that use,
Parameter-c represents the form (have three kinds optional, CG, CHH, CHG, choose CG here) of cytosine(Cyt) C.
Output rusults: totally 279004 of the tDMR after extending
Front 10 row of output file form are as follows:
Table 3
Rear 8 row are as follows:
Table 4
The 3rd step: candidate tDMR is filtered
Based on filtration condition, candidate tDMR is filtered, filtration condition comprises: (1) FDR<=0.05; The average coverage in the tDMR zone that (2) obtains is greater than 20 reads; The coverage of the CG site single-point that (3) obtains is greater than 10 reads; (4) accuracy of sampling with replacement assay being carried out in the CpG site in the tDMR that obtains will be more than 95%.
Utilizing order under linux to complete filters tDMR:
sort-k2-n?outfile/CG/chr$i.CG.ext|awk′\$16>0.95{a=\$1;for(i=2;i<16;i++){a=a\"\t\"\$i}{print?a}}′|uniq>outfile/CG/chr$i.CG.ext.filter;echo?filter?done″
Output rusults: through filtering, obtain altogether finally 36924 tDMR.Dependency checking through tDMR and genetic expression relation, be about to all genes that occur in tDMR and find, and whether the exploit information analytical procedure, find the relation of gene expression amount and methyl rate identical with known relation, and statistics gets final product.The tDMR accuracy rate that present method finds can be more than 85%.
Totally 15 be listed as:
Front 8 row of output file form are as follows:
Table 5
Rear 7 row are as follows:
Table 6
In above-mentioned application examples, determined the methylation state in each CpG site on full genome, the tDMR software of exploitation can be analyzed two order-checking samples on full genomic level on this basis, find out simply, fast between tissue the difference zone that methylates, greatly improve detection efficiency, reduced cost.Above-mentioned processing can divide the karyomit(e) parallel running on full genome, thus raising speed and efficiency, the present invention's minute karyomit(e) parallel running on full genome, can further improve speed and efficiency, and speed is fast, and efficiency is high.
Description of the invention provides for example with for the purpose of describing, and is not exhaustively or limit the invention to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.Selecting and describing embodiment is for better explanation principle of the present invention and practical application, thereby and makes those of ordinary skill in the art can understand the present invention's design to be suitable for the various embodiment with various modifications of specific end use.