Summary of the invention
The technical problem that the present invention will solve provides a kind of tissue specificity difference method for detecting area that methylates, and has realized that carrying out tDMR based on full genome detects, and the accuracy height.
The invention provides a kind of tissue specificity difference regional tDMR detection method that methylates, comprising:
Obtain on the full genome single-point information that methylates by genome sequencing;
Determine seed tDMR on the full genome according to the information that methylates of single-point on the full genome based on preselected conditions;
TDMR extends to both sides to seed, obtains candidate tDMR based on extending end condition;
Based on filtration condition candidate tDMR is filtered, obtain tDMR result.
An embodiment according to detection method of the present invention, determine that based on preselected conditions seed tDMR comprises on the full genome according to the information that methylates of single-point on the full genome: scan on the full genome single-point information that methylates by moving window, determine seed tDMR on the full genome based on preselected conditions.
According to an embodiment of detection method of the present invention, be that moving window that length, 1 CpG are step-length scans on the full genome single-point information that methylates with 5 CpG; This preselected conditions comprises:
P<=0.05 of (1) chi square test (in conjunction with the fisher rigorous examination);
The significant difference of (2) two times of methylation level; With
(3) methylation level of at least one sample is more than 20%.
According to an embodiment of detection method of the present invention, extend end condition and comprise:
Distance between (1) two successive CpG surpasses 200bp;
The average methylation level of (2) two samples is less than two times of differences;
This regional methylation level of (3) two samples is all less than 20%;
(4) p of chi square test>0.01.
Filtration condition comprises:
(1)FDR<=0.05;
The average coverage in the tDMR zone that (2) obtains is greater than 20 reads;
The coverage of the CG site single-point that (3) obtains is greater than 10 reads;
(4) accuracy of sampling with replacement assay being carried out in the CpG site among the tDMR that obtains will be more than 95%.
An embodiment according to detection method of the present invention, obtain by genome sequencing that the single-point information of methylating comprises on the full genome: make by bisulfite and methylated cytosine(Cyt) deaminizating does not take place among the genomic DNA be transformed into uridylic, and methylated cytosine(Cyt) remains unchanged; Treated full genome is checked order, and compare, determine to take place on the full genome methylated CpG site with undressed whole genome sequence.
Detection method provided by the invention obtains single-point on the full genome of the sample information that methylates by the genome sequencing technology, on the basis of genome sequencing two order-checking samples are carried out analyzing and processing, can extract tDMR in full genome range; By steps such as seed tDMR selection, seed tDMR extension, candidate tDMR filtrations, improved the accuracy that detects.
The technical problem that the present invention will solve provides a kind of tissue specificity difference regional detection system that methylates, and has realized that carrying out tDMR based on full genome detects, and the accuracy height.
The invention provides a kind of tissue specificity difference regional tDMR detection system that methylates, comprising:
The information that methylates acquisition module is used for obtaining on the full genome single-point information that methylates by genome sequencing;
The seed region determination module is used for determining seed tDMR on the full genome according to the information that methylates of single-point on the described full genome based on preselected conditions;
The seed region extension of module is used for described seed tDMR is extended to both sides, obtains candidate tDMR based on extending end condition;
The candidate region filtration module is used for based on filtration condition described candidate tDMR being filtered, and obtains tDMR result.
Detection system embodiment according to the present invention, the seed region determination module scans on the described full genome single-point information that methylates by moving window, determines seed tDMR on the full genome based on preselected conditions; Wherein, be that moving window that length, 1 CpG are step-length scans on the described full genome single-point information that methylates with 5 CpG; Preselected conditions comprises: p<=0.05 of (1) chi square test (in conjunction with the fisher rigorous examination); The significant difference of (2) two times of methylation level; (3) methylation level of at least one sample is more than 20%.
Detection system embodiment according to the present invention, extend end condition and comprise: the distance between (1) two successive CpG surpasses 200bp; The average methylation level of (2) two samples is less than two times of differences; This regional methylation level of (3) two samples is all less than 20%; (4) p of chi square test>0.01.Filtration condition comprises: (1) FDR<=0.05; The average coverage in the tDMR zone that (2) obtains is greater than 20 reads; The coverage of the CG site single-point that (3) obtains is greater than 10 reads; (4) accuracy of sampling with replacement assay being carried out in the CpG site among the tDMR that obtains will be more than 95%.
Detection system embodiment according to the present invention, the information acquisition module of methylating comprises: the sodium bisulfite treatment facility, be used for making complete genomic DNA that methylated cytosine(Cyt) deaminizating not take place and be transformed into uridylic, and methylated cytosine(Cyt) remains unchanged by bisulfite; Full genome comparison equipment is used for treated full genome is checked order, and compares with undressed whole genome sequence, determines to take place on the full genome methylated CpG site.
Detection system provided by the invention, the information that methylates acquisition module obtains single-point on the full genome of the sample information that methylates by the genome sequencing technology, follow-up each module is carried out analyzing and processing to two order-checking samples on the basis of genome sequencing, can extract tDMR in full genome range; Carry out seed tDMR by the seed region determination module and select, carry out seed tDMR extension, candidate tDMR is filtered, improved the accuracy that detects by the candidate region filtration module by the seed region extension of module.
Embodiment
With reference to the accompanying drawings the present invention is described more fully, exemplary embodiment of the present invention wherein is described.In the accompanying drawings, identical label is represented identical or similar assembly or element.
Fig. 1 illustrates the methylate schema of an embodiment of method for detecting area of tissue specificity difference of the present invention.
As shown in Figure 1, in step 102, obtain on the full genome of sample the single-point information that methylates by genome sequencing.For example, on s-generation high-throughput genome sequencing basis, obtain the single-point of sample on the full genome information that methylates by sodium bisulfite sequencing (Bisulfite-sequencing) (reference (2)).Through after the processing of step 102, below step 104 to 108 differences that are used to extract two samples zone that methylates.
In step 104, determine seed tDMR on the full genome of two samples based on preselected conditions according to the information that methylates of single-point on two full genomes of sample.
In step 106, tDMR extends to both sides to seed, obtains candidate tDMR based on extending end condition.
In step 108, based on filtration condition candidate tDMR is filtered, obtain final tDMR result.
Detect tDMR experimental technique complicated operation at existing chip technology, success ratio is low, problems such as cost height, the foregoing description obtains single-point on the full genome of the sample information that methylates by the genome sequencing technology, on the basis of genome sequencing, two order-checking samples are carried out analyzing and processing, can in full genome range, seek tDMR, easy, from full genome, extract tDMR apace, improve detection efficiency greatly, reduced cost.In addition, by steps such as seed tDMR selection, seed tDMR extension, candidate tDMR filtrations, the accuracy and the sensitivity that detect have been improved.
Fig. 2 illustrates the methylate schema of another embodiment of method for detecting area of tissue specificity difference of the present invention.
As shown in Figure 2, in step 202, make by bisulfite and methylated cytosine(Cyt) deaminizating not to take place among the genomic DNA be transformed into uridylic, and methylated cytosine(Cyt) remains unchanged.
In step 204, treated full genome is checked order, and compare with undressed whole genome sequence, determine to take place on the full genome methylated CpG site.
In step 206, scan on the described full genome single-point information that methylates by moving window, determine seed tDMR on the full genome based on preselected conditions.
With 5 CpG is that moving window that length, 1 CpG are step-length scans on the described full genome single-point information that methylates; Preselected conditions comprises:
P<=0.05 of (1) chi square test (in conjunction with the fisher rigorous examination);
In the time of p<=0.05, can think in this zone that there is the difference of significance in methylating between sample in twos.
The significant difference of (2) two times of methylation level; With
(3) methylation level of at least one sample is more than 20%; One of them of the tDMR zone of the finding rate that methylates needs to make the zone of being found have biological significance more than 20%.
In step 208, seed tDMR is extended acquisition candidate tDMR to both sides, the extension end condition is:
Distance between (1) two successive CpG surpasses 200bp;
If the distance between two successive CpG is long, the cognation between these two CpG is little, thus when this situation occurs, stop extending, thus guarantee the reliability of detected result as far as possible.
The average methylation level of (2) two samples is less than two times of differences;
This regional methylation level of (3) two samples is all less than 20%;
(4) p of chi square test>0.01.
In step 210, based on filtration condition candidate tDMR is filtered, filtration condition comprises:
(1) FDR (false discovery rate, mistake discovery rate)<=0.05
The average coverage in the tDMR zone that (2) obtains is greater than 20 reads (read, the dna sequence dna with the length necessarily read that utilizes the order-checking of new-generation sequencing technology to obtain).
The coverage of the CpG site single-point that (3) obtains is greater than 10 reads
(4) the sampling with replacement check is carried out in the CpG site among the tDMR that obtains and (extracted any one site in promptly from all CpG sites, test, after finishing, put back to the method that overall middle participation is selected next time again), result's accuracy will be more than 95%.
Step 212 obtains final tDMR result by filtering.
It may be noted that and detect the CG site in the foregoing description that it will be understood by those of skill in the art that method of the present invention goes for CHH, CHG site equally, wherein H represents among A, C, the T any one.
In the foregoing description,, determine specifically to have adopted preselected conditions, extension end condition and the filtration condition of tDMR, the accuracy height by a large amount of experimental studies and creative work.Through subsequent authentication, the tDMR accuracy rate that finds by aforesaid method is more than 85%.
Fig. 3 illustrates the methylate block diagram of an embodiment of regional detection system of tissue specificity difference of the present invention.As shown in Figure 3, detection system comprises the information acquisition module 31 that methylates, seed region determination module 32, seed region extension of module 33 and candidate region filtration module 34 among this embodiment.Wherein, the information acquisition module 31 that methylates obtains on the full genome single-point information that methylates by genome sequencing; Seed region determination module 32 is determined seed tDMR on the full genome according to the single-point information that methylates on the full genome based on preselected conditions; 33 couples of seed tDMR of seed region extension of module extend to both sides, obtain candidate tDMR based on extending end condition; Candidate region filtration module 34 filters candidate tDMR based on filtration condition, obtains tDMR result.
In the foregoing description, the information that methylates acquisition module obtains single-point on the full genome of the sample information that methylates by the genome sequencing technology, follow-up each module is carried out analyzing and processing to two order-checking samples on the basis of genome sequencing, can in full genome range, seek tDMR, easy, from full genome, extract tDMR apace, improve detection efficiency greatly, reduced cost.Select, carry out seed tDMR extension and carry out seed tDMR, candidate tDMR is filtered, improved the accuracy and the sensitivity that detect by the candidate region filtration module by the seed region extension of module by the seed region determination module.
In one embodiment, seed region determination module 32 scans on the full genome single-point information that methylates by moving window, determines seed tDMR on the full genome based on preselected conditions; Wherein, be that moving window that length, 1 CpG are step-length scans on the described full genome single-point information that methylates with 5 CpG; Preselected conditions comprises: p<=0.05 of (1) chi square test (in conjunction with the fisher rigorous examination); The significant difference of (2) two times of methylation level; (3) methylation level of at least one sample is more than 20%.According to an embodiment of detection system of the present invention, above-mentioned extension end condition comprises: the distance between two successive CpG surpasses 200bp; The average methylation level of two samples is less than two times of differences; This regional methylation level of two samples is all less than 20%; The p of chi square test>0.01.According to an embodiment of detection system of the present invention, filtration condition comprises: FDR<=0.05; The average coverage in the tDMR zone that obtains is greater than 20 reads; The coverage of the CG site single-point that obtains is greater than 10 reads; The accuracy of the CpG site among the tDMR that obtains being carried out the sampling with replacement assay will be more than 95%.
In the foregoing description,, determine specifically to have adopted preselected conditions, extension end condition and the filtration condition of tDMR, the accuracy height by a large amount of experimental studies and creative work.Through subsequent authentication, the tDMR accuracy rate that finds by aforesaid method is more than 85%.
Fig. 4 illustrates the methylate block diagram of another embodiment of regional detection system of tissue specificity difference of the present invention.Among Fig. 4 and the module of Fig. 3 with same numeral can describe referring to the correspondence among Fig. 3, for for purpose of brevity, be not described in detail at this.Compare with Fig. 3, the information that the methylates acquisition module 41 among Fig. 4 comprises sodium bisulfite treatment facility 411 and full genome comparison equipment 412.Wherein, sodium bisulfite treatment facility 411 makes by bisulfite and methylated cytosine(Cyt) deaminizating does not take place among the complete genomic DNA is transformed into uridylic, and methylated cytosine(Cyt) remains unchanged; 412 pairs of treated full genomes of full genome comparison equipment check order, and compare with undressed whole genome sequence, determine to take place on the full genome methylated CpG site.
In the foregoing description, the sodium bisulfite treatment facility makes with bisulfite and methylated cytosine(Cyt) deaminizating does not take place among the DNA is transformed into uridylic, and methylated cytosine(Cyt) remains unchanged; Full genome comparison equipment checks order to treated full genome, and compare with undressed sequence, judge whether that the CpG site methylates, the methylation state in each CpG site has very high reliability and tolerance range on the clear and definite genome of energy.
Through checking (being not limited only to Mammals) in many species, method and system of the present invention has higher tolerance range and susceptibility than the method for Vardhman Rakyan etc., and very high reliability and tolerance range are arranged.
Introduce an application examples of the present invention below.
Used sampled data is the fasta data of No. one, inoblast imr90 and YH in this application examples.The fasta data download address that No. one, inoblast imr90 and YH is respectively:
imr90:http://neomorph.salk.edu/human_methylome/data.html
YH:http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ADDF
In this application examples, a plurality of treatment steps in the technical scheme are realized by software the running environment of software can be Unix/Linux operating system, moves this software by the Unix/Linux order line.Description below provides the command line parameter of running software simultaneously.
At first carrying out data prepares to handle.After data are downloaded through comparison, go repetition, extract step process such as the information that methylates, obtain the cout file (file of the situation that methylates in record cytosine(Cyt) C site) of YH and imr90, the sample input file that extracts tDMR is this two cout files.It may be noted that method of the present invention can be used for any species that can access the cout file, the content that is not limited to enumerate among the embodiment is so range of application is extremely wide.
The concrete form of Cout file is as follows:
Table 1
The first step: choose seed tDMR
With 5 CpG is that moving window that length, 1 CpG are step-length scans on the full genome that comprises in two sample file count files the single-point information that methylates; Preselected conditions is: p<=0.05 of (1) chi square test (in conjunction with the fisher rigorous examination); The significant difference of (2) two times of methylation level; (3) methylation level of at least one sample is more than 20%.
Computer command line parameter is:
″./tdmr?slide-c?CG?YH.cout/chr$i.cout?imr90.cout/chr$i.cout>outfile/CG/chr$i.CG;echo?slide?done;
The form of parameter-cytosine(Cyt) C that the c representative will be sought.Usually have three kinds of situations optional: CG, CHH, CHG, research is the CpG site herein, so select CG here.
Output result: the seed that obtains 916949 tDMR altogether
Output destination file form is as follows:
Table 2
In above-mentioned steps, can on full genome, divide the seed of the parallel tDMR of searching of karyomit(e), thereby can reduce operation time, raise the efficiency.
Second step: the two-way extension of seed tDMR
Seed tDMR extend is obtained candidate tDMR to both sides, extend end condition and be: the distance between (1) two successive CpG surpasses 200bp; The average methylation level of (2) two samples is less than two times of differences; This regional methylation level of (3) two samples is all less than 20%; (4) p of chi square test>0.01.
Computer command line parameter is:
./tdmr?extend-t?8-c?CG?YH.cout/chr$i.cout?imr90.cout/chr$i.coutoutfile/CG/chr$i.CG>outfile/CG/chr$i.CG.ext;echo?extend?done;
Parameter-t represents in program run, the CPU number that use,
Parameter-c represents the form (have three kinds optional, CG, CHH, CHG choose CG here) of cytosine(Cyt) C.
Output result: through totally 279004 of the tDMR after extending
Preceding 10 row of output file form are as follows:
Table 3
Back 8 row are as follows:
Table 4
The 3rd step: candidate tDMR is filtered
Based on filtration condition candidate tDMR is filtered, filtration condition comprises: (1) FDR<=0.05; The average coverage in the tDMR zone that (2) obtains is greater than 20 reads; The coverage of the CG site single-point that (3) obtains is greater than 10 reads; (4) accuracy of sampling with replacement assay being carried out in the CpG site among the tDMR that obtains will be more than 95%.
Utilizing order under the linux to finish filters tDMR:
sort-k?2-n?outfile/CG/chr$i.CG.ext?|awk′\$16>0.95{a=\$1;for(i=2;i<16;i++){a=a\″\t\″\$i}{print?a}}′|uniq>outfile/CG/chr$i.CG.ext.filter;echo?filter?done″
Output result: obtain 36924 tDMR at last altogether through filtering.Dependency checking through tDMR and genetic expression relation is about to all genes that occur and finds in tDMR, whether the exploit information analytical procedure finds the relation of gene expression amount and the rate of methylating identical with known relationship, and statistics gets final product.The tDMR accuracy rate that present method finds can be more than 85%.
Totally 15 be listed as:
Preceding 8 row of output file form are as follows:
Table 5
Back 7 row are as follows:
Table 6
In above-mentioned application examples, determined the methylation state in each CpG site on the full genome, Kai Fa tDMR software can be analyzed two order-checking samples on full genomic level on this basis, find out simply, fast and organize the differences zone that methylates, improve detection efficiency greatly, reduced cost.Above-mentioned processing can divide the karyomit(e) parallel running on full genome, thus raising speed and efficient, the present invention divides the karyomit(e) parallel running on full genome, can further improve speed and efficient, and speed is fast, the efficient height.
Description of the invention provides for example with for the purpose of describing, and is not exhaustively or limit the invention to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.Selecting and describing embodiment is for better explanation principle of the present invention and practical application, thereby and makes those of ordinary skill in the art can understand the various embodiment that have various modifications that the present invention's design is suitable for specific end use.