CN106650313B - A method of it filtering out DNA base in DNase high-flux sequence data and is inclined to sexual deviation - Google Patents
A method of it filtering out DNA base in DNase high-flux sequence data and is inclined to sexual deviation Download PDFInfo
- Publication number
- CN106650313B CN106650313B CN201610865814.0A CN201610865814A CN106650313B CN 106650313 B CN106650313 B CN 106650313B CN 201610865814 A CN201610865814 A CN 201610865814A CN 106650313 B CN106650313 B CN 106650313B
- Authority
- CN
- China
- Prior art keywords
- dnase
- base
- dna
- dna base
- binding site
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Organic Chemistry (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Zoology (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Analytical Chemistry (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Wood Science & Technology (AREA)
- Immunology (AREA)
- Biochemistry (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Microbiology (AREA)
- Genetics & Genomics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention belongs to molecular biosciences infomation detection and analysis fields, and in particular to a method of the DNA base in DNase high-flux sequence data that filters out for effectively improving the detection information accuracy of DNase high-flux sequence data is inclined to sexual deviation.The present invention includes: that (1) DNase-Seq experimental data restriction enzyme site regional DNA base obtains;(2) DNase-Seq experimental data DNA base tendentiousness obtains;(3) DNA base tendentiousness removes.The DNA base tendency sexual deviation contained in DNase high-flux sequence data can be accurately filtered out by the method invented, to generate more accurate DNase-Seq sequencing result, to provide Data safeguard for subsequent higher level applied analysis.
Description
Technical field
The invention belongs to molecular biosciences infomation detection and analysis fields, and in particular to one kind effectively improves DNase high throughput
The method for filtering out DNA base tendency sexual deviation in DNase high-flux sequence data of the detection information accuracy of sequencing data.
Background technique
Currently, the detection of DNA protein binding site mainly uses chromatin immune chemical coprecipitation technique (Chromatin
Immunoprecipitation, ChIP).And the ChIP-Seq skill for combining ChIP experimental result with high throughput sequencing technologies
Art, then can effectively binding site of the testing goal functional protein on DNA within the scope of full-length genome.The principle of ChIP-Seq
It is: is combined with first by chromatin immune chemical coprecipitation technique (ChIP) using the enzyme specifically bound with destination protein to be enriched with
The DNA fragmentation of destination protein, and purifying and library construction are carried out to it.Then the DNA fragmentation that enrichment obtains is carried out high-throughput
Sequencing, then the millions of reading sequences that sequencing obtains are pin-pointed on genome, to obtain within the scope of full-length genome
It is combined with the region of DNA segment information of destination protein, and then obtains destination protein DNA binding site by various parsers.
However, ChIP-Seq technology also has many shortcomings, it is that the desmoenzyme for being enriched with destination protein has specifically first
Property, it can not be detected so as to cause certain albumen because can not find suitable specific bond enzyme;Secondly, primary experiment can only be examined
A kind of albumen is surveyed, is taken time and effort, it is at high cost, it can not large-scale use;Third, it is even more important that due to experiment obtain with
The DNA fragmentation that destination protein combines is longer, can only carry out part sequencing to its both ends when sequencing, since sequencing region be not knot
Coincidence point itself, therefore, ChIP-Seq technology is unable to reach single base to the detection resolution of DNA protein binding site.
In view of the above-mentioned problems, producing a kind of new DNA protein binding site detection technique in recent years -- based on DNase high
The DNA protein binding site detection technique of logical sequencing information, i.e. DNase-Seq technology.The principle of DNase-Seq is: sharp first
Digestion processing is carried out to DNA with DNase nucleic acid shearing enzyme.It will then be cut by DNase nucleic acid without the protein bound region of DNA domain DNA
Enzyme cutting is randomly cut off, and has the protein bound region of DNA domain DNA due to not being cut off by protein-bonded obstruction specificity.
Then, purifying and library construction are carried out to the processed DNA fragmentation of digestion, then is sequenced, to obtain full-length genome range
The digestion information of interior DNase nucleic acid shearing enzyme.In digestion information, the digestion information at protein binding site subtracts specificity
It is weak, just as leaving footprint one by one on DNA, so as to combination of the accurate identification DNA binding protein on DNA molecular
Site.
It is very prominent the advantages of DNase-Seq technology compared with ChIP-Seq technology.Firstly, since do not have specificity,
DNase-Seq can disposably detect the binding site of a variety of DNA albumen simultaneously within the scope of full-length genome;Secondly as primary
Property detect the binding sites of a variety of DNA albumen, DNase-Seq greatly improved detection efficiency and reduce testing cost, make big
Scale carries out the detection of DNA protein binding site and is possibly realized;Third, it is even more important that since sequencing initial position is exactly enzyme
Position is cut, DNase-Seq is to the detection resolution of DNA protein binding site up to single base.
However, there are certain DNA base tendentiousness in cutting DNA for discovery DNase nucleic acid shearing enzyme in the recent period, this will be right
The identification of DNA protein binding site has adverse effect on.How to remove the tendentiousness and has become the DNA based on DNase-Seq
One critical issue of protein binding site identification.
Summary of the invention
DNA base tendency sexual deviation in DNase high-flux sequence data is filtered out the purpose of the present invention is to provide a kind of
Method.
The object of the present invention is achieved like this:
(1) DNase-Seq experimental data restriction enzyme site regional DNA base obtains
According to position of the DNase-Seq experimental data in genome, extracts each experimental data and correspond to restriction enzyme site
The DNA base of near zone.The present invention selects the base in 6 sites near restriction enzyme site, i.e., centered on restriction enzyme site, left and right
Respectively take 3 bases.
(2) DNase-Seq experimental data DNA base tendentiousness obtains
The present invention selects the base in neighbouring 6 sites of restriction enzyme site, and each base has 4 kinds of values such as A, C, G, T, then and 6
Site base shares 4096 kinds of base compositions.By counting this 4096 kinds of alkali at entire DNase-Seq experimental data restriction enzyme site
The frequency that base combination occurs, can be obtained the DNA base tendentiousness of DNase-Seq experimental data.
(3) DNA base tendentiousness removes
Equipped with m protein binding site, each binding site includes n base, then: the DNase inspection of i-th of binding site
Survey signal are as follows: [Si1,Si2,…,Sin].Its value and are as follows:
Consider the DNA base tendentiousness of DNase, then the DNase of i-th of binding site jth column detects signal are as follows: Sij=
[(1-w)Pij+wBij]Ri.Wherein, PijIt is corresponding with the protein structure of DNA binding protein at i-th of binding site jth column
The intrinsic cutting probability of DNase, BijFor DNase corresponding with DNA base tendentiousness at this at i-th of binding site jth column
Cutting probability.PijBe it is stable, can be used for the identification of DNA protein binding site, and BijBe it is unstable, should give and filter out.
Specific filtering method is as follows:
Wherein, Sij,RiIt can be directly obtained from experimental data.BijThen tested according to the DNase-Seq that previous step obtains
The DNA base tendentiousness of data obtains.W is weight, and value range needs to further determine that between [0,1].
For m protein binding site, when weight w takes different value, different [P can be obtainedi1,Pi2,…,Pin], 1≤i
≤m.IfThen as m [Pi1,Pi2,…,Pin] and [P1,P2,...,Pn] between m relevance values median
When maximum, w value at this time is optimal value.
The beneficial effects of the present invention are: DNase high-flux sequence number can accurately be filtered out by the method invented
The DNA base tendency sexual deviation contained in, to generate more accurate DNase-Seq sequencing result, to be subsequent higher
The applied analysis of level provides Data safeguard.
Detailed description of the invention
Fig. 1 is DNase-Seq experimental data DNA base tendentiousness histogram.
Fig. 2 is the evaluation of estimate change curve of w weight.
Fig. 3 is flow chart of the present invention.
Specific embodiment
The present invention is described further with reference to the accompanying drawing.
As the new technology of DNA protein binding site detection, DNase-Seq technology has the advantages that numerous protrusions.Due to
Without specificity, DNase-Seq can disposably detect the binding site of a variety of DNA albumen simultaneously within the scope of full-length genome;
Due to disposably detecting the binding site of a variety of DNA albumen, DNase-Seq detection efficiency greatly improved and reduce detection at
This, makes it possible to carry out the detection of DNA protein binding site on a large scale;Since sequencing initial position is exactly digestion position,
DNase-Seq is to the detection resolution of DNA protein binding site up to single base.
However, there are certain DNA base tendentiousness in cutting DNA for discovery DNase nucleic acid shearing enzyme in the recent period, this will be right
The identification of DNA protein binding site has adverse effect on.The present invention is that the one kind proposed for this problem filters out DNase high
The method of DNA base tendency sexual deviation in flux sequencing data.
1, DNase-Seq experimental data restriction enzyme site regional DNA base obtains
According to position of the DNase-Seq experimental data in genome, extracts each experimental data and correspond to restriction enzyme site
The DNA base of near zone.The present invention selects the base in 6 sites near restriction enzyme site, i.e., centered on restriction enzyme site, left and right
Respectively take 3 bases.
2, DNase-Seq experimental data DNA base tendentiousness obtains
The present invention selects the base in neighbouring 6 sites of restriction enzyme site, and each base has 4 kinds of values such as A, C, G, T, then and 6
Site base shares 4096 kinds of base compositions.By counting this 4096 kinds of alkali at entire DNase-Seq experimental data restriction enzyme site
The frequency that base combination occurs, can be obtained the DNA base tendentiousness of DNase-Seq experimental data.
3, DNA base tendentiousness removes
Equipped with m protein binding site, each binding site includes n base, then: the DNase inspection of i-th of binding site
Survey signal are as follows: [Si1,Si2,…,Sin].Its value and are as follows:
Consider the DNA base tendentiousness of DNase, then the DNase of i-th of binding site jth column detects signal are as follows: Sij=
[(1-w)Pij+wBij]Ri.Wherein, PijIt is corresponding with the protein structure of DNA binding protein at i-th of binding site jth column
The intrinsic cutting probability of DNase, BijFor DNase corresponding with DNA base tendentiousness at this at i-th of binding site jth column
Cutting probability.PijBe it is stable, can be used for the identification of DNA protein binding site, and BijBe it is unstable, should give and filter out.
Specific filtering method is as follows:
Wherein, Sij,RiIt can be directly obtained from experimental data.BijThen tested according to the DNase-Seq that previous step obtains
The DNA base tendentiousness of data obtains.W is weight, and value range is determined between [0,1] by following methods:
For m protein binding site, when weight w takes different value, different [P can be obtainedi1,Pi2,…,Pin], 1≤i
≤m.IfThen as m [Pi1,Pi2,…,Pin] and [P1,P2,...,Pn] between m relevance values median
When maximum, w value at this time is optimal value.
4, experimental verification
Human genome base sequence data are downloaded from UCSC international bio information site, and world ENCODE plans UW
The mankind K562 cell line DNase-Seq sequencing data and NFYA transcription factor ChIP-Seq sequencing data that university measures.
According to position of each DNase-Seq sequencing data restriction enzyme site in human genome, 6 sites nearby are extracted
Base, i.e., centered on restriction enzyme site, left and right respectively take 3 bases.Count what 4096 kinds of base compositions at restriction enzyme site occurred
The frequency obtains the DNA base tendentiousness of DNase-Seq experimental data.(horizontal axis is alkali to the tendentious histogram as shown in Figure 1
Base combination, the longitudinal axis is the frequency).As seen from Figure 1, there are apparent DNA base tendentiousness for DNase-Seq experimental data.
According to the ChIP-Seq sequencing data of NFYA transcription factor, 953 NFYA protein binding sites are identified.Each knot
Coincidence point includes 201 bases.
DNA base tendentiousness is carried out to DNase-Seq experimental data using the method for the present invention to filter out.When w takes a certain weight
When, it is [P that each binding site, which filters out the tendentious DNase detection signal of DNA base,i1,Pi2,…,Pin], 1≤i≤953.Meter
Calculate each binding site [Pi1,Pi2,…,Pin] and [P1,P2,...,Pn] between Pearson correlation, here n value be
201.Choose the median of the 953 correlations evaluation of estimate whether excellent as the w value.It allows w value by 0 to 1 variation, obtains as schemed
The evaluation of estimate change curve of w value shown in 2 (horizontal axis is w value, longitudinal axis evaluation of estimate).From Figure 2 it can be seen that when w value is 0.15, evaluation
Value reaches maximum and is not further added by, and w value at this time should be optimal value, and obtains the corresponding DNA base that filters out in turn and be inclined to
The DNase-Seq detection information of property.
As the new technology of DNA protein binding site detection, DNase-Seq technology has outstanding advantages.Due to not having
Specificity, DNase-Seq can disposably detect the binding site of a variety of DNA albumen simultaneously within the scope of full-length genome;Due to one
Secondary property detects the binding site of a variety of DNA albumen, and DNase-Seq greatly improved detection efficiency and reduce testing cost, makes
Extensive progress DNA protein binding site detection is possibly realized;Since sequencing initial position is exactly digestion position, DNase-Seq
To the detection resolution of DNA protein binding site up to single base.However, there are one in cutting DNA for DNase nucleic acid shearing enzyme
Fixed DNA base tendentiousness, this will have adverse effect on the identification of DNA protein binding site.The present invention is to be directed to be somebody's turn to do
A kind of method for filtering out DNA base tendency sexual deviation in DNase high-flux sequence data that problem proposes.
Claims (1)
1. a kind of method for filtering out DNA base tendency sexual deviation in DNase high-flux sequence data, which is characterized in that including such as
Lower step:
(1) DNase-Seq experimental data restriction enzyme site regional DNA base obtains
According to position of the DNase-Seq experimental data in genome, extracts each experimental data and correspond near restriction enzyme site
The DNA base in region;The base for selecting 6 sites near restriction enzyme site, i.e., centered on restriction enzyme site, left and right respectively takes 3 alkali
Base;
(2) DNase-Seq experimental data DNA base tendentiousness obtains
The base in 6 sites near restriction enzyme site is selected, each base has A, C, G, T, and 4 kinds of values, then 6 site bases are shared
4096 kinds of base compositions;Occur by counting this 4096 kinds of base compositions at entire DNase-Seq experimental data restriction enzyme site
The frequency can be obtained the DNA base tendentiousness of DNase-Seq experimental data;
(3) DNA base tendentiousness removes
Equipped with m protein binding site, each binding site includes n base, then: the DNase of i-th of binding site detects letter
Number are as follows: [Si1,Si2,…,Sin];Its value and are as follows:
Consider the DNA base tendentiousness of DNase, then the DNase of i-th of binding site jth column detects signal are as follows: Sij=[(1-w)
Pij+wBij]Ri;Wherein, PijFor DNase corresponding with the protein structure of DNA binding protein at i-th of binding site jth column
Intrinsic cutting probability, BijFor the cutting of DNase corresponding with DNA base tendentiousness at this at i-th of binding site jth column
Probability;PijBe it is stable, can be used for the identification of DNA protein binding site, and BijBe it is unstable, should give and filter out;
Specific filtering method is as follows:
Wherein, Sij,RiIt can be directly obtained from experimental data;BijThe DNase-Seq experimental data then obtained according to previous step
DNA base tendentiousness obtain;W is weight, and value range needs to further determine that between [0,1];
For m protein binding site, when weight w takes different value, different [P can be obtainedi1,Pi2,…,Pin], 1≤i≤m;
IfThen as m [Pi1,Pi2,…,Pin] and [P1,P2,…,Pn] between m relevance values median it is maximum
When, w value at this time is optimal value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610865814.0A CN106650313B (en) | 2016-09-29 | 2016-09-29 | A method of it filtering out DNA base in DNase high-flux sequence data and is inclined to sexual deviation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610865814.0A CN106650313B (en) | 2016-09-29 | 2016-09-29 | A method of it filtering out DNA base in DNase high-flux sequence data and is inclined to sexual deviation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106650313A CN106650313A (en) | 2017-05-10 |
CN106650313B true CN106650313B (en) | 2019-10-18 |
Family
ID=58853980
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610865814.0A Active CN106650313B (en) | 2016-09-29 | 2016-09-29 | A method of it filtering out DNA base in DNase high-flux sequence data and is inclined to sexual deviation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106650313B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108280326B (en) * | 2018-01-22 | 2021-06-11 | 哈尔滨工程大学 | Method for eliminating DNA base tendency deviation in DNase high-throughput sequencing data based on deep recurrent neural network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102120998A (en) * | 2010-12-15 | 2011-07-13 | 哈尔滨工程大学 | Method for perceiving action transcription factor in cell |
CN102622534A (en) * | 2012-04-11 | 2012-08-01 | 哈尔滨工程大学 | Correction method of deoxyribonucleic acid high-pass sequencing data for gene expression detection |
CN103390119A (en) * | 2013-07-03 | 2013-11-13 | 哈尔滨工程大学 | Method for recognizing transcription factor binding site |
CN103810404A (en) * | 2014-01-13 | 2014-05-21 | 哈尔滨工程大学 | High-flux DNA sequencing data matching reinforcement method based on Bayes technology |
CN104131093A (en) * | 2014-07-23 | 2014-11-05 | 哈尔滨工程大学 | DNase high-throughput sequencing detection signal processing method of DNA protein binding sites |
-
2016
- 2016-09-29 CN CN201610865814.0A patent/CN106650313B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102120998A (en) * | 2010-12-15 | 2011-07-13 | 哈尔滨工程大学 | Method for perceiving action transcription factor in cell |
CN102622534A (en) * | 2012-04-11 | 2012-08-01 | 哈尔滨工程大学 | Correction method of deoxyribonucleic acid high-pass sequencing data for gene expression detection |
CN103390119A (en) * | 2013-07-03 | 2013-11-13 | 哈尔滨工程大学 | Method for recognizing transcription factor binding site |
CN103810404A (en) * | 2014-01-13 | 2014-05-21 | 哈尔滨工程大学 | High-flux DNA sequencing data matching reinforcement method based on Bayes technology |
CN104131093A (en) * | 2014-07-23 | 2014-11-05 | 哈尔滨工程大学 | DNase high-throughput sequencing detection signal processing method of DNA protein binding sites |
Non-Patent Citations (1)
Title |
---|
Identification method of transcription factor binding sites based on DNase-Seq signal;Peichao Sang等;《IEEE》;20150903;第1665-1669页 * |
Also Published As
Publication number | Publication date |
---|---|
CN106650313A (en) | 2017-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Urban et al. | Freshwater monitoring by nanopore sequencing | |
CN104164479B (en) | Heterozygous genes group processing method | |
CN104404160A (en) | MIT (Mitochondrion) primer design method and method for constructing planktonic animal barcode database by utilization of high-throughput sequencing | |
CN103245714A (en) | Protein secondary mass spectrum identification method of marker loci based on candidate peptide fragment discrimination | |
Garin-Fernandez et al. | The North Sea goes viral: Occurrence and distribution of North Sea bacteriophages | |
CN109448783B (en) | Analysis method of chromatin topological structure domain boundary | |
CN103390119B (en) | A kind of Binding site for transcription factor recognition methods | |
CN101914619A (en) | RNA (Ribonucleic Acid) sequencing quality control method and device relating to gene expression | |
RU2013135282A (en) | DNA SEQUENCE DATA ANALYSIS | |
CN106650313B (en) | A method of it filtering out DNA base in DNase high-flux sequence data and is inclined to sexual deviation | |
CN115148299A (en) | XGboost-based ore deposit type identification method and system | |
CN110875082A (en) | Microorganism detection method and device based on targeted amplification sequencing | |
Wohlrab et al. | Metatranscriptome profiling indicates size-dependent differentiation in plastic and conserved community traits and functional diversification in dinoflagellate communities | |
Liu et al. | Identification of medical plants of 24 Ardisia species from China using the matK genetic marker | |
Strachan et al. | Performance of the standard CABIN method: comparison of BEAST models and error rates to detect simulated degradation from multiple data sets | |
AU2014308691A1 (en) | Digital analysis of molecular analytes using electrical methods | |
Manu et al. | Deep sequencing of extracellular eDNA enables total biodiversity assessment of ecosystems | |
JP2020526745A5 (en) | ||
Kim et al. | Optimized metavirome analysis of marine DNA virus communities for taxonomic profiling | |
Green et al. | Quantifying aquatic viral community change associated with stormwater runoff in a wet retention pond using metagenomic time series data | |
CN104968806A (en) | Method for providing information about gene sequence-based personal marker and apparatus using same | |
Manu et al. | A Novel Metagenomic Workflow for Biomonitoring across the Tree of Life using PCR-free Ultra-deep Sequencing of Extracellular eDNA | |
CN104131093B (en) | The DNase high pass order-checking detection signal treatment process of DNA protein binding site | |
CN102661932B (en) | Method for identifying specific DNA sequences in germplasm resources based on surface plasma resonance technology | |
Bhowmik et al. | A review article on ChIP-Seq tools: MACS2, HOMER, SICER, PEAKANNOTATOR and MEME |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |