CN106650313B

CN106650313B - A method of it filtering out DNA base in DNase high-flux sequence data and is inclined to sexual deviation

Info

Publication number: CN106650313B
Application number: CN201610865814.0A
Authority: CN
Inventors: 冯伟兴; 贺波; 宋艳霞; 徐斯文; 赵森; 陈多娇; 刘欢
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2019-10-18
Anticipated expiration: 2036-09-29
Also published as: CN106650313A

Abstract

The invention belongs to molecular biosciences infomation detection and analysis fields, and in particular to a method of the DNA base in DNase high-flux sequence data that filters out for effectively improving the detection information accuracy of DNase high-flux sequence data is inclined to sexual deviation.The present invention includes: that (1) DNase-Seq experimental data restriction enzyme site regional DNA base obtains；(2) DNase-Seq experimental data DNA base tendentiousness obtains；(3) DNA base tendentiousness removes.The DNA base tendency sexual deviation contained in DNase high-flux sequence data can be accurately filtered out by the method invented, to generate more accurate DNase-Seq sequencing result, to provide Data safeguard for subsequent higher level applied analysis.

Description

A method of it filtering out DNA base in DNase high-flux sequence data and is inclined to sexual deviation

Technical field

The invention belongs to molecular biosciences infomation detection and analysis fields, and in particular to one kind effectively improves DNase high throughput The method for filtering out DNA base tendency sexual deviation in DNase high-flux sequence data of the detection information accuracy of sequencing data.

Background technique

Currently, the detection of DNA protein binding site mainly uses chromatin immune chemical coprecipitation technique (Chromatin Immunoprecipitation, ChIP).And the ChIP-Seq skill for combining ChIP experimental result with high throughput sequencing technologies Art, then can effectively binding site of the testing goal functional protein on DNA within the scope of full-length genome.The principle of ChIP-Seq It is: is combined with first by chromatin immune chemical coprecipitation technique (ChIP) using the enzyme specifically bound with destination protein to be enriched with The DNA fragmentation of destination protein, and purifying and library construction are carried out to it.Then the DNA fragmentation that enrichment obtains is carried out high-throughput Sequencing, then the millions of reading sequences that sequencing obtains are pin-pointed on genome, to obtain within the scope of full-length genome It is combined with the region of DNA segment information of destination protein, and then obtains destination protein DNA binding site by various parsers.

However, ChIP-Seq technology also has many shortcomings, it is that the desmoenzyme for being enriched with destination protein has specifically first Property, it can not be detected so as to cause certain albumen because can not find suitable specific bond enzyme；Secondly, primary experiment can only be examined A kind of albumen is surveyed, is taken time and effort, it is at high cost, it can not large-scale use；Third, it is even more important that due to experiment obtain with The DNA fragmentation that destination protein combines is longer, can only carry out part sequencing to its both ends when sequencing, since sequencing region be not knot Coincidence point itself, therefore, ChIP-Seq technology is unable to reach single base to the detection resolution of DNA protein binding site.

In view of the above-mentioned problems, producing a kind of new DNA protein binding site detection technique in recent years -- based on DNase high The DNA protein binding site detection technique of logical sequencing information, i.e. DNase-Seq technology.The principle of DNase-Seq is: sharp first Digestion processing is carried out to DNA with DNase nucleic acid shearing enzyme.It will then be cut by DNase nucleic acid without the protein bound region of DNA domain DNA Enzyme cutting is randomly cut off, and has the protein bound region of DNA domain DNA due to not being cut off by protein-bonded obstruction specificity. Then, purifying and library construction are carried out to the processed DNA fragmentation of digestion, then is sequenced, to obtain full-length genome range The digestion information of interior DNase nucleic acid shearing enzyme.In digestion information, the digestion information at protein binding site subtracts specificity It is weak, just as leaving footprint one by one on DNA, so as to combination of the accurate identification DNA binding protein on DNA molecular Site.

It is very prominent the advantages of DNase-Seq technology compared with ChIP-Seq technology.Firstly, since do not have specificity, DNase-Seq can disposably detect the binding site of a variety of DNA albumen simultaneously within the scope of full-length genome；Secondly as primary Property detect the binding sites of a variety of DNA albumen, DNase-Seq greatly improved detection efficiency and reduce testing cost, make big Scale carries out the detection of DNA protein binding site and is possibly realized；Third, it is even more important that since sequencing initial position is exactly enzyme Position is cut, DNase-Seq is to the detection resolution of DNA protein binding site up to single base.

However, there are certain DNA base tendentiousness in cutting DNA for discovery DNase nucleic acid shearing enzyme in the recent period, this will be right The identification of DNA protein binding site has adverse effect on.How to remove the tendentiousness and has become the DNA based on DNase-Seq One critical issue of protein binding site identification.

Summary of the invention

DNA base tendency sexual deviation in DNase high-flux sequence data is filtered out the purpose of the present invention is to provide a kind of Method.

The object of the present invention is achieved like this:

(1) DNase-Seq experimental data restriction enzyme site regional DNA base obtains

According to position of the DNase-Seq experimental data in genome, extracts each experimental data and correspond to restriction enzyme site The DNA base of near zone.The present invention selects the base in 6 sites near restriction enzyme site, i.e., centered on restriction enzyme site, left and right Respectively take 3 bases.

(2) DNase-Seq experimental data DNA base tendentiousness obtains

The present invention selects the base in neighbouring 6 sites of restriction enzyme site, and each base has 4 kinds of values such as A, C, G, T, then and 6 Site base shares 4096 kinds of base compositions.By counting this 4096 kinds of alkali at entire DNase-Seq experimental data restriction enzyme site The frequency that base combination occurs, can be obtained the DNA base tendentiousness of DNase-Seq experimental data.

(3) DNA base tendentiousness removes

Equipped with m protein binding site, each binding site includes n base, then: the DNase inspection of i-th of binding site Survey signal are as follows: [S_i1,S_i2,…,S_in].Its value and are as follows:

Consider the DNA base tendentiousness of DNase, then the DNase of i-th of binding site jth column detects signal are as follows: S_ij= [(1-w)P_ij+wB_ij]R_i.Wherein, P_ijIt is corresponding with the protein structure of DNA binding protein at i-th of binding site jth column The intrinsic cutting probability of DNase, B_ijFor DNase corresponding with DNA base tendentiousness at this at i-th of binding site jth column Cutting probability.P_ijBe it is stable, can be used for the identification of DNA protein binding site, and B_ijBe it is unstable, should give and filter out.

Specific filtering method is as follows:

Wherein, S_ij,R_iIt can be directly obtained from experimental data.B_ijThen tested according to the DNase-Seq that previous step obtains The DNA base tendentiousness of data obtains.W is weight, and value range needs to further determine that between [0,1].

For m protein binding site, when weight w takes different value, different [P can be obtained_i1,P_i2,…,P_in], 1≤i ≤m.IfThen as m [P_i1,P_i2,…,P_in] and [P₁,P₂,...,P_n] between m relevance values median When maximum, w value at this time is optimal value.

The beneficial effects of the present invention are: DNase high-flux sequence number can accurately be filtered out by the method invented The DNA base tendency sexual deviation contained in, to generate more accurate DNase-Seq sequencing result, to be subsequent higher The applied analysis of level provides Data safeguard.

Detailed description of the invention

Fig. 1 is DNase-Seq experimental data DNA base tendentiousness histogram.

Fig. 2 is the evaluation of estimate change curve of w weight.

Fig. 3 is flow chart of the present invention.

Specific embodiment

The present invention is described further with reference to the accompanying drawing.

As the new technology of DNA protein binding site detection, DNase-Seq technology has the advantages that numerous protrusions.Due to Without specificity, DNase-Seq can disposably detect the binding site of a variety of DNA albumen simultaneously within the scope of full-length genome； Due to disposably detecting the binding site of a variety of DNA albumen, DNase-Seq detection efficiency greatly improved and reduce detection at This, makes it possible to carry out the detection of DNA protein binding site on a large scale；Since sequencing initial position is exactly digestion position, DNase-Seq is to the detection resolution of DNA protein binding site up to single base.

However, there are certain DNA base tendentiousness in cutting DNA for discovery DNase nucleic acid shearing enzyme in the recent period, this will be right The identification of DNA protein binding site has adverse effect on.The present invention is that the one kind proposed for this problem filters out DNase high The method of DNA base tendency sexual deviation in flux sequencing data.

1, DNase-Seq experimental data restriction enzyme site regional DNA base obtains

2, DNase-Seq experimental data DNA base tendentiousness obtains

3, DNA base tendentiousness removes

Specific filtering method is as follows:

Wherein, S_ij,R_iIt can be directly obtained from experimental data.B_ijThen tested according to the DNase-Seq that previous step obtains The DNA base tendentiousness of data obtains.W is weight, and value range is determined between [0,1] by following methods:

4, experimental verification

Human genome base sequence data are downloaded from UCSC international bio information site, and world ENCODE plans UW The mankind K562 cell line DNase-Seq sequencing data and NFYA transcription factor ChIP-Seq sequencing data that university measures.

According to position of each DNase-Seq sequencing data restriction enzyme site in human genome, 6 sites nearby are extracted Base, i.e., centered on restriction enzyme site, left and right respectively take 3 bases.Count what 4096 kinds of base compositions at restriction enzyme site occurred The frequency obtains the DNA base tendentiousness of DNase-Seq experimental data.(horizontal axis is alkali to the tendentious histogram as shown in Figure 1 Base combination, the longitudinal axis is the frequency).As seen from Figure 1, there are apparent DNA base tendentiousness for DNase-Seq experimental data.

According to the ChIP-Seq sequencing data of NFYA transcription factor, 953 NFYA protein binding sites are identified.Each knot Coincidence point includes 201 bases.

DNA base tendentiousness is carried out to DNase-Seq experimental data using the method for the present invention to filter out.When w takes a certain weight When, it is [P that each binding site, which filters out the tendentious DNase detection signal of DNA base,_i1,P_i2,…,P_in], 1≤i≤953.Meter Calculate each binding site [P_i1,P_i2,…,P_in] and [P₁,P₂,...,P_n] between Pearson correlation, here n value be 201.Choose the median of the 953 correlations evaluation of estimate whether excellent as the w value.It allows w value by 0 to 1 variation, obtains as schemed The evaluation of estimate change curve of w value shown in 2 (horizontal axis is w value, longitudinal axis evaluation of estimate).From Figure 2 it can be seen that when w value is 0.15, evaluation Value reaches maximum and is not further added by, and w value at this time should be optimal value, and obtains the corresponding DNA base that filters out in turn and be inclined to The DNase-Seq detection information of property.

As the new technology of DNA protein binding site detection, DNase-Seq technology has outstanding advantages.Due to not having Specificity, DNase-Seq can disposably detect the binding site of a variety of DNA albumen simultaneously within the scope of full-length genome；Due to one Secondary property detects the binding site of a variety of DNA albumen, and DNase-Seq greatly improved detection efficiency and reduce testing cost, makes Extensive progress DNA protein binding site detection is possibly realized；Since sequencing initial position is exactly digestion position, DNase-Seq To the detection resolution of DNA protein binding site up to single base.However, there are one in cutting DNA for DNase nucleic acid shearing enzyme Fixed DNA base tendentiousness, this will have adverse effect on the identification of DNA protein binding site.The present invention is to be directed to be somebody's turn to do A kind of method for filtering out DNA base tendency sexual deviation in DNase high-flux sequence data that problem proposes.

Claims

1. a kind of method for filtering out DNA base tendency sexual deviation in DNase high-flux sequence data, which is characterized in that including such as Lower step:

According to position of the DNase-Seq experimental data in genome, extracts each experimental data and correspond near restriction enzyme site The DNA base in region；The base for selecting 6 sites near restriction enzyme site, i.e., centered on restriction enzyme site, left and right respectively takes 3 alkali Base；

(2) DNase-Seq experimental data DNA base tendentiousness obtains

The base in 6 sites near restriction enzyme site is selected, each base has A, C, G, T, and 4 kinds of values, then 6 site bases are shared 4096 kinds of base compositions；Occur by counting this 4096 kinds of base compositions at entire DNase-Seq experimental data restriction enzyme site The frequency can be obtained the DNA base tendentiousness of DNase-Seq experimental data；

(3) DNA base tendentiousness removes

Equipped with m protein binding site, each binding site includes n base, then: the DNase of i-th of binding site detects letter Number are as follows: [S_i1,S_i2,…,S_in]；Its value and are as follows:

Consider the DNA base tendentiousness of DNase, then the DNase of i-th of binding site jth column detects signal are as follows: S_ij=[(1-w) P_ij+wB_ij]R_i；Wherein, P_ijFor DNase corresponding with the protein structure of DNA binding protein at i-th of binding site jth column Intrinsic cutting probability, B_ijFor the cutting of DNase corresponding with DNA base tendentiousness at this at i-th of binding site jth column Probability；P_ijBe it is stable, can be used for the identification of DNA protein binding site, and B_ijBe it is unstable, should give and filter out；

Specific filtering method is as follows:

Wherein, S_ij,R_iIt can be directly obtained from experimental data；B_ijThe DNase-Seq experimental data then obtained according to previous step DNA base tendentiousness obtain；W is weight, and value range needs to further determine that between [0,1]；

For m protein binding site, when weight w takes different value, different [P can be obtained_i1,P_i2,…,P_in], 1≤i≤m； IfThen as m [P_i1,P_i2,…,P_in] and [P₁,P₂,…,P_n] between m relevance values median it is maximum When, w value at this time is optimal value.