CN104131093B

CN104131093B - DNase high-throughput sequencing detection signal processing method of DNA protein binding site

Info

Publication number: CN104131093B
Application number: CN201410352942.6A
Authority: CN
Inventors: 冯伟兴; 廉德源; 刘晓龙; 宋锋飞; 贺波
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2014-07-23
Filing date: 2014-07-23
Publication date: 2015-12-09
Anticipated expiration: 2034-07-23
Also published as: CN104131093A

Abstract

The invention discloses a DNase high-pass sequencing detection signal preprocessing method of a DNA protein binding site. Including the following steps: Obtain basic gene information, DNase-Seq high-pass sequencing detection data and ChIP-Seq high-pass sequencing monitoring data of DNA protein binding sites; evaluate the quality of DNase-Seq high-pass sequencing detection data, and screen out reliable sequencing Data; keep only the sequencing start position that directly reflects the protein binding site for each credible sequencing data; obtain the DNase-Seq detection sample data set; normalize the DNase-Seq detection sample data set; DNase-Seq The test sample data set is subdivided; the data in the two subsets are summed vertically from the front and back respectively to complete the operation. The invention greatly improves the recognition accuracy and recognition resolution of the DNA protein binding site.

Description

DNase high-throughput sequencing detection signal processing method of DNA protein binding site

技术领域technical field

本发明属于DNA蛋白结合位点的检测信号的处理方法，尤其涉及一种DNA蛋白结合位点的DNase高通测序检测信号处理方法。The invention belongs to a detection signal processing method of a DNA protein binding site, in particular to a DNase high-pass sequencing detection signal processing method of a DNA protein binding site.

背景技术Background technique

目前，DNA蛋白结合位点检测主要采用染色质免疫共沉淀技术(ChromatinImmunoprecipitation,ChIP)。而将ChIP实验结果与高通量测序技术相结合的ChIP-Seq技术，则能够高效地在全基因组范围检测与特定功能蛋白结合的DNA区段。ChIP-Seq的原理是：首先通过染色质免疫共沉淀技术(ChIP)利用与目的蛋白特异性结合的酶富集结合有目的蛋白的DNA片段，并对其进行纯化与文库构建。然后对富集得到的DNA片段进行高通量测序，再将测序获得的数百万条读数序列精确定位到基因组上，从而获得全基因组范围内结合有目的蛋白的DNA区段信息，进而通过各种分析算法得到准确的目的蛋白DNA结合位点。然而，尽管针对ChIP-Seq检测数据的DNA蛋白结合位点分析方法已经非常成熟，但该技术也具有不足之处，首先是富集目的蛋白的结合酶具有特异性，从而导致某些蛋白因找不到合适的特异结合酶而无法进行检测；其次，一次实验只能检测一种蛋白，耗时耗力，成本高，无法大规模使用；第三，更为重要的是，由于实验获取的与目的蛋白结合的DNA片段较长，测序时只能对其两端进行部分测序，由于测序区域并不是结合位点本身，因此，尽管测序数据的分辨率可达单碱基，但目的蛋白结合位点的定位分辨率最高也只能达到几十碱基。At present, the detection of DNA protein binding sites mainly adopts the technique of chromatin immunoprecipitation (Chromatin Immunoprecipitation, ChIP). The ChIP-Seq technology, which combines the results of ChIP experiments with high-throughput sequencing technology, can efficiently detect DNA segments that bind to specific functional proteins on a genome-wide scale. The principle of ChIP-Seq is: firstly, the DNA fragments bound to the target protein are enriched by using the enzyme that specifically binds to the target protein through chromatin immunoprecipitation (ChIP), and then purified and library constructed. Then perform high-throughput sequencing on the enriched DNA fragments, and then accurately map the millions of read sequences obtained from the sequencing to the genome, so as to obtain the DNA segment information that binds the target protein in the whole genome, and then through each This analysis algorithm obtains accurate DNA binding sites of the target protein. However, although the DNA protein binding site analysis method for ChIP-Seq detection data is very mature, this technology also has shortcomings. It cannot be detected without a suitable specific binding enzyme; secondly, only one protein can be detected in one experiment, which is time-consuming, labor-intensive, and expensive, and cannot be used on a large scale; The DNA fragment bound by the target protein is long, and only part of its two ends can be sequenced during sequencing. Since the sequencing region is not the binding site itself, although the resolution of the sequencing data can reach a single base, the binding site of the target protein The highest point positioning resolution can only reach tens of bases.

针对上述问题，近几年产生了一种新的DNA蛋白结合位点检测技术--基于DNase高通测序信息的DNA蛋白结合位点检测技术，即DNase-Seq技术。该技术也称DNase足迹法(DNasefootprinting)，可以精确鉴定DNA结合蛋白在DNA分子上的结合位点。其原理是：首先利用DNase核酸剪切酶对DNA进行酶切处理。则没有DNA蛋白结合的DNA区域将被DNase核酸剪切酶随机均匀地切断，而有DNA蛋白结合的DNA区域由于受到结合蛋白的阻碍特异性不被切断。随后，对酶切处理过的DNA片段进行纯化与文库构建，再进行测序，从而获得全基因组范围内DNase核酸剪切酶的酶切信息。在酶切信息中，蛋白结合位点处的酶切信息将特异性减弱，就像在DNA上留下一个个足迹一样，从而可以精确鉴定DNA结合蛋白在DNA分子上的结合位点。与ChIP-Seq技术相比，新提出的DNase-Seq高通测序技术的优点非常突出。首先，由于不具有特异性，DNase-Seq可以一次性在全基因组范围内对所有DNA蛋白的结合位点进行检测；其次，由于一次性检测所有DNA蛋白的结合位点，DNase-Seq技术大幅提高了检测效率并降低了检测成本，使大规模进行DNA蛋白结合位点检测成为可能；第三，更为重要的是，由于DNase-Seq的测序起始位置就是酶切位置，因此，DNase-Seq对DNA蛋白结合位点的检测分辨率可以达到单碱基，如此高的分辨率对后续研究是非常有帮助的。因此，对DNase-Seq信号进行处理，并进行深入的研究和分析是十分必要的。In response to the above problems, a new DNA protein binding site detection technology has emerged in recent years--DNA protein binding site detection technology based on DNase high-throughput sequencing information, namely DNase-Seq technology. This technique is also called DNase footprinting (DNasefootprinting), which can accurately identify the binding sites of DNA binding proteins on DNA molecules. The principle is: firstly, the DNA is digested with DNase nuclease. Then the DNA region without DNA protein binding will be randomly and evenly cut by DNase nucleoclease, while the DNA region with DNA protein binding will not be specifically cut due to the hindrance of the binding protein. Subsequently, the digested DNA fragments were purified and library constructed, and then sequenced, so as to obtain the enzyme digestion information of DNase nuclease in the whole genome. In the enzyme cleavage information, the enzyme cleavage information at the protein binding site will weaken the specificity, just like leaving footprints on the DNA, so that the binding site of the DNA binding protein on the DNA molecule can be accurately identified. Compared with ChIP-Seq technology, the advantages of the newly proposed DNase-Seq high-throughput sequencing technology are very prominent. First of all, due to its lack of specificity, DNase-Seq can detect the binding sites of all DNA proteins in the whole genome at one time; secondly, due to the one-time detection of all DNA protein binding sites, DNase-Seq technology has greatly It improves the detection efficiency and reduces the detection cost, and makes it possible to detect DNA protein binding sites on a large scale; thirdly, and more importantly, since the starting position of DNase-Seq sequencing is the enzyme cutting position, therefore, DNase-Seq The detection resolution of DNA protein binding sites can reach a single base, such a high resolution is very helpful for follow-up research. Therefore, it is necessary to process the DNase-Seq signal and conduct in-depth research and analysis.

自从DNase精确检测DNA蛋白结合位点的技术被提出以来，DNase技术多被用来为DNA结合位点相关研究成果提供精确实验验证。对其实验数据的处理也多是通过简单计数的形式进行分析和直观图示的方式进行表述。直到2006年，Crawford和Sabo在Naturemethods同时提出了DNase-Chip技术，利用microarray对DNase检测信号进行高通量测量，从而开启了应用DNase技术在基因组范围内对蛋白结合位点进行大规模检测和分析的阶段。2010年，Crawford进一步提出了可在全基因组范围内对蛋白结合位点进行检测的DNase-Seq技术。Since the technology of DNase to accurately detect DNA protein binding sites was proposed, DNase technology has been used to provide accurate experimental verification for research results related to DNA binding sites. The processing of its experimental data is also mostly analyzed in the form of simple counting and expressed in the form of intuitive graphics. Until 2006, Crawford and Sabo proposed the DNase-Chip technology in Naturemethods at the same time, using microarray to perform high-throughput measurement of DNase detection signals, thus opening the application of DNase technology for large-scale detection and analysis of protein binding sites within the genome stage. In 2010, Crawford further proposed the DNase-Seq technology that can detect protein binding sites on a genome-wide scale.

DNase-Seq技术被提出后，相继产生了许多分析方法。2010年，Chen基于动态贝叶斯网络利用DNase-Seq数据对DNA蛋白结合位点进行分析。2011年，Fletez基于支持向量机利用DNase-Seq数据对DNA蛋白结合位点进行分析。2012年，Pique提出CENTIPEDE法，通过设计严谨的统计模型分析DNA蛋白结合位点处DNase-Seq数据的统计特征，进而对DNA蛋白结合位点进行识别。2013年，Jason提出Wellington法，利用DNA蛋白结合位点染色体区域开放时的DNase-Seq数据特性，对DNA蛋白结合位点进行识别。2014年，Sherwood利用DNA蛋白结合位点区域独特的幅值和形状特征对DNA蛋白结合位点进行识别。After the DNase-Seq technology was proposed, many analysis methods have been produced one after another. In 2010, Chen used DNase-Seq data to analyze DNA protein binding sites based on dynamic Bayesian network. In 2011, Fletez analyzed DNA protein binding sites using DNase-Seq data based on support vector machines. In 2012, Pique proposed the CENTIPEDE method to analyze the statistical characteristics of DNase-Seq data at the DNA protein binding site by designing a rigorous statistical model, and then identify the DNA protein binding site. In 2013, Jason proposed the Wellington method to identify DNA protein binding sites using the DNase-Seq data characteristics when the DNA protein binding site chromosome region is open. In 2014, Sherwood used the unique amplitude and shape characteristics of the DNA protein binding site region to identify DNA protein binding sites.

但是，目前提出的各种分析方法均采用ChIP-Seq检测数据的预处理方式对DNase-Seq检测数据进行预处理。但实质上，由于检测原理的显著不同，DNase-Seq数据对DNA蛋白结合位点的检测分辨率远高于ChIP-Seq数据。However, various analysis methods currently proposed use the preprocessing method of ChIP-Seq detection data to preprocess the DNase-Seq detection data. But in essence, due to the significant difference in the detection principle, the resolution of DNase-Seq data for the detection of DNA protein binding sites is much higher than that of ChIP-Seq data.

发明内容Contents of the invention

本发明的目的是提供一种能够提供DNA蛋白结合位点的识别精度和识别分辨率的，DNA蛋白结合位点的DNase高通测序检测信号处理方法。The purpose of the present invention is to provide a DNase high-throughput sequencing detection signal processing method for DNA protein binding sites that can provide recognition accuracy and recognition resolution of DNA protein binding sites.

本发明是通过以下技术方案实现的：The present invention is achieved through the following technical solutions:

DNA蛋白结合位点的DNase高通测序检测信号预处理方法，包括以下几个步骤：The DNase high-throughput sequencing detection signal preprocessing method of the DNA protein binding site comprises the following steps:

步骤一：获取基因基本信息，基因基本信息包括DNA基因组的碱基序列和基因在DNA上的位置信息，获取DNA蛋白结合位点的DNase-Seq高通量测序检测数据和ChIP-Seq高通量测序检测数据；Step 1: Obtain the basic information of the gene, including the base sequence of the DNA genome and the position information of the gene on the DNA, obtain the DNase-Seq high-throughput sequencing detection data of the DNA protein binding site and the high-throughput ChIP-Seq Sequencing detection data;

步骤二：对DNase-Seq高通测序检测数据质量评估，筛选出碱基位点的质量得分在20以上的可信测序数据，通过映射找到每条可信预测数据在基因组中的出处；Step 2: Assess the quality of DNase-Seq high-pass sequencing detection data, screen out credible sequencing data with a quality score of more than 20 base sites, and find the source of each credible prediction data in the genome through mapping;

步骤三：将每条可信测序数据仅保留直接反映蛋白结合位点的测序起始位置，得到更新后的DNase–Seq数据；Step 3: Keep only the sequencing start position that directly reflects the protein binding site for each credible sequencing data, and obtain the updated DNase-Seq data;

步骤四：在每个DNA碱基位点上求取更新后的DNase–Seq数据点的个数，作为DNA碱基位点的DNase-Seq检测值；利用ChIP-Seq数据获取存在相关DNA蛋白的结合位点的区域，提取区域内完整的DNase-Seq检测值，得到DNase-Seq检测样本数据集合；Step 4: Calculate the number of updated DNase-Seq data points at each DNA base site as the DNase-Seq detection value of the DNA base site; use ChIP-Seq data to obtain the presence of related DNA proteins The region of the binding site, extract the complete DNase-Seq detection value in the region, and obtain the DNase-Seq detection sample data set;

步骤五：对DNase-Seq检测样本数据集合进行归一化处理，即将每个DNA碱基位点的DNase-Seq检测值除以DNase-Seq检测样本数据集合中所有DNA碱基位点的DNase-Seq检测值之和；Step 5: Normalize the DNase-Seq detection sample data set, that is, divide the DNase-Seq detection value of each DNA base site by the DNase-Seq detection value of all DNA base sites in the DNase-Seq detection sample data set. The sum of Seq detection values;

步骤六：对DNase-Seq检测样本数据集合进行细分；Step 6: subdivide the DNase-Seq detection sample data set;

将DNase-Seq检测样本数据集合分为正链正测序子集、正链负测序子集、负链正测序子集和负链负测序子集，将正链正测序子集和负链负测序子集通过相关对齐的方式合并成为DNA蛋白结合位点的正面检测数据子集，将正链负测序子集和负链正测序子集通过相关对齐的方式合并成为DNA蛋白结合位点的背面检测数据子集；The DNase-Seq detection sample data set is divided into positive-strand positive sequencing subset, positive-strand negative sequencing subset, negative-strand positive-sequencing subset and negative-strand negative sequencing subset. The subsets are merged into the positive detection data subset of the DNA protein binding site through correlation alignment, and the positive strand negative sequencing subset and the negative strand positive sequencing subset are merged into the back detection data of the DNA protein binding site through correlation alignment subset of data;

步骤七：分别从正面和背面两个方向，对正面检测数据子集和背面检测数据子集中数据进行纵向求和，完成操作。Step 7: Vertically sum the data in the front detection data subset and the rear detection data subset respectively from the front and rear directions, and complete the operation.

本发明的有益效果：Beneficial effects of the present invention:

由于检测原理的显著不同，DNase-Seq数据对DNA蛋白结合位点的检测分辨率远高于ChIP-Seq数据，本发明中研究了对DNase-Seq检测数据的针对性的处理方法，大幅提高DNA蛋白结合位点的识别精度和识别分辨率。Due to the significant difference in detection principles, the resolution of DNase-Seq data for the detection of DNA protein binding sites is much higher than that of ChIP-Seq data. Identification accuracy and identification resolution of protein binding sites.

同时通过对DNase-Seq检测数据的处理，实现突显DNA蛋白结合位点的检测信息，为后续蛋白质结合位点的高精度识别打基础。At the same time, through the processing of DNase-Seq detection data, the detection information of DNA protein binding sites can be highlighted, laying the foundation for the subsequent high-precision identification of protein binding sites.

附图说明Description of drawings

图1DNA蛋白结合位点的DNase高通测序检测信号预处理方法框图；Fig. 1 Block diagram of DNase high-throughput sequencing detection signal preprocessing method of DNA protein binding site;

图2常规处理的操作过程；The operation process of Fig. 2 conventional treatment;

图3特殊处理的操作过程；The operation process of Fig. 3 special treatment;

图4DNA蛋白结合位点的DNase-Seq正链正测序检测信息，K562细胞ATF1蛋白；Figure 4 DNase-Seq positive strand positive sequencing detection information of DNA protein binding sites, ATF1 protein in K562 cells;

图5DNA蛋白结合位点的DNase-Seq负链负测序检测信息，K562细胞ATF1蛋白；Figure 5 DNase-Seq negative strand negative sequencing detection information of DNA protein binding sites, ATF1 protein in K562 cells;

图6DNA蛋白结合位点的DNase-Seq正链负测序检测信息，K562细胞ATF1蛋白；Figure 6 DNase-Seq positive-strand negative sequencing detection information of DNA protein binding sites, ATF1 protein in K562 cells;

图7DNA蛋白结合位点的DNase-Seq负链正测序检测信息，K562细胞ATF1蛋白；Figure 7 DNase-Seq negative strand positive sequencing detection information of DNA protein binding sites, ATF1 protein in K562 cells;

图8DNA蛋白结合位点的DNase-Seq正面检测信息，K562细胞ATF1蛋白；Figure 8 DNase-Seq positive detection information of DNA protein binding sites, ATF1 protein in K562 cells;

图9DNA蛋白结合位点的DNase-Seq背面检测信息，K562细胞ATF1蛋白。Figure 9 DNase-Seq back detection information of DNA protein binding sites, ATF1 protein in K562 cells.

具体实施方式Detailed ways

下面将结合附图对本发明做进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings.

获取DNA蛋白结合位点的DNase高通测序检测数据，进行深入分析和针对性的处理，并最终达到突显蛋白质结合位点检测信息的目的。如图1所示，具体包括下列步骤：Obtain the DNase high-throughput sequencing detection data of DNA protein binding sites, conduct in-depth analysis and targeted processing, and finally achieve the purpose of highlighting the detection information of protein binding sites. As shown in Figure 1, it specifically includes the following steps:

1.数据获取1. Data Acquisition

我们在国际生物信息网站UCSC上获取基因组碱基序列基本信息；在BIOBASE公司的TRANSFAC数据库中获取DNA蛋白基本信息。DNA蛋白结合位点的DNase-Seq和ChIP-Seq检测数据均是来自于UCSC网站公布的ENCODE计划所生成的数据。其中DNase-Seq检测数据是在ENCODE网站上，在hg19条件下的DUKE实验室和UW实验室所提供的数据里下载的；ChIP-Seq检测数据是在SYDH实验室所提供的数据里下载的。We obtain the basic information of genome base sequence from the international biological information website UCSC; obtain the basic information of DNA protein from the TRANSFAC database of BIOBASE company. The DNase-Seq and ChIP-Seq detection data of DNA protein binding sites are all from the data generated by the ENCODE project published on the UCSC website. Among them, the DNase-Seq detection data was downloaded from the data provided by the DUKE laboratory and the UW laboratory under hg19 conditions on the ENCODE website; the ChIP-Seq detection data was downloaded from the data provided by the SYDH laboratory.

2.常规预处理部分2. Conventional pretreatment part

如图2所示，DNA蛋白结合位点的DNase-Seq高通测序检测数据由于采用短序列测序方式，一般利用Illumina公司的高通量测序平台进行检测和生成。该测序平台所生成的数据遵守FASTQ格式，即测序数据的每一个读数(read)分四行进行信息存储。其中，第一行是测序平台的相关信息，以“”开头；第二行是序列信息；第三行以“+”号开始，后面和第一行相同，有时候会被省略；第四行是测序序列对应的碱基检测质量得分。其中，每一个碱基位点都有一个与之对应的质量得分。这里要求可信测序数据中所有碱基位点的质量得分均应在20以上，其含义是任一碱基位点测错的概率均在1％以下。As shown in Figure 2, the DNase-Seq high-throughput sequencing detection data of DNA protein binding sites are generally detected and generated using the high-throughput sequencing platform of Illumina because of the short-sequence sequencing method. The data generated by the sequencing platform complies with the FASTQ format, that is, each read (read) of the sequencing data is stored in four rows. Among them, the first line is the relevant information of the sequencing platform, starting with ""; the second line is the sequence information; the third line starts with the "+" sign, and the latter is the same as the first line, sometimes omitted; the fourth line is the base detection quality score corresponding to the sequencing sequence. Wherein, each base site has a corresponding quality score. Here, it is required that the quality score of all base positions in the credible sequencing data should be above 20, which means that the error probability of any base position is below 1%.

随后是测序数据的映射环节，即通过映射找到每一条测序数据在基因组中的出处。Illumina测序平台的测序数据由于长度一致，所以适于采用BWA比对软件实现测序数据到基因组的映射。Then comes the mapping of sequencing data, that is, finding the source of each piece of sequencing data in the genome through mapping. Because the sequencing data of the Illumina sequencing platform have the same length, it is suitable to use the BWA alignment software to realize the mapping from the sequencing data to the genome.

测序数据映射之后，还需对每一条数据的映射质量进行分析。要求可信的测序数据的映射质量MAPQ得分应在20以上(其含义是测序数据映射错的概率在1％以下)，且错配碱基的个数应在2以下。After sequencing data mapping, it is necessary to analyze the mapping quality of each piece of data. It is required that the mapping quality MAPQ score of reliable sequencing data should be above 20 (which means that the probability of sequencing data mapping error is below 1%), and the number of mismatched bases should be below 2.

3.特殊预处理部分3. Special pretreatment part

如图3所示，首先，由于DNase-Seq的测序起始位置就是酶切位置，并可在单碱基分辨率上反映DNA蛋白结合位点，所以在DNase-Seq检测数据的预处理阶段，可将每一条DNase-Seq检测数据变线为点，即仅保留直接反映蛋白结合位点的测序起始位置点，从而有效突显DNA蛋白的结合位点信息。As shown in Figure 3, first of all, since the starting position of DNase-Seq sequencing is the restriction position, and can reflect the DNA protein binding site at single-base resolution, in the preprocessing stage of DNase-Seq detection data, Each piece of DNase-Seq detection data can be changed into a point, that is, only the sequencing start point that directly reflects the protein binding site is retained, thereby effectively highlighting the binding site information of the DNA protein.

其次，在对DNase-Seq数据进行变线为点的预处理之后，在每个DNA碱基位点上求取DNase-Seq数据点的个数作为该碱基位点上的DNase-Seq检测值。再利用ChIP-Seq数据获取存在相关DNA蛋白的结合位点的区域。若存在(其分辨率仅有几十个碱基)，则提取该区域完整的DNase-Seq检测值，并通过相关蛋白的碱基倾向性精确确定DNA蛋白结合位点，其分辨率应达到单碱基。这样，可获取一组肯定存在该DNA蛋白结合位点的DNase-Seq检测样本数据集合。Secondly, after preprocessing the DNase-Seq data into points, calculate the number of DNase-Seq data points at each DNA base site as the DNase-Seq detection value at the base site . The ChIP-Seq data was then used to obtain the regions where the binding sites of related DNA proteins existed. If it exists (the resolution is only a few tens of bases), then extract the complete DNase-Seq detection value of this region, and accurately determine the DNA protein binding site through the base propensity of the related protein, and the resolution should reach a single base. In this way, a set of DNase-Seq detection sample data sets for which the DNA protein binding site must exist can be obtained.

此外，为避免不同样本在后续分析过程中的贡献程度不一致，还应对样本数据进行归一化预处理。即样本中每个碱基位点的DNase-Seq检测值应除以该样本区域上所有碱基位点的DNase-Seq检测值之和。In addition, in order to avoid inconsistencies in the contribution of different samples in the subsequent analysis process, the sample data should also be normalized and preprocessed. That is, the DNase-Seq detection value of each base site in the sample should be divided by the sum of the DNase-Seq detection values of all base sites in the sample area.

随后，对样本集进行细分。首先按DNA蛋白结合位点所在位置将不同样本分为DNA正链和负链两类。其次，每一类样本又根据其测序读数的方向分为DNA正链和负链两部分。这样将样本集中所有样本分为正链正测序，正链负测序，负链正测序，负链负测序等四部分。其中，正链正测序的含义是DNA蛋白结合位点在正链时，从正链对其进行检测的DNA蛋白结合位点检测值；正链负测序的含义是DNA蛋白结合位点在正链时，从负链对其进行检测的DNA蛋白结合位点检测值；负链正测序的含义是DNA蛋白结合位点在负链时，从正链对其进行检测的DNA蛋白结合位点检测值。负链负测序的含义是DNA蛋白结合位点在负链时，从负链对其进行检测的DNA蛋白结合位点检测值。依据DNA蛋白结合位点的DNase-Seq生物检测机理，正链正测序和负链负测序均是从正面对DNA蛋白结合位点进行检测，而正链负测序和负链正测序均是从背面对DNA蛋白结合位点进行检测，因此，正链正测序和负链负测序可通过相关对齐的方式合并成为DNA蛋白结合位点的正面检测数据子集，正链负测序和负链正测序可通过相关对齐的方式合并成为DNA蛋白结合位点的背面检测数据子集。Subsequently, the sample set is subdivided. Firstly, different samples were divided into DNA positive strand and negative strand according to the location of the DNA protein binding site. Second, each type of sample is divided into DNA positive strand and negative strand according to the orientation of its sequencing reads. In this way, all samples in the sample set are divided into four parts: positive strand positive sequencing, positive strand negative sequencing, negative strand positive sequencing, and negative strand negative sequencing. Among them, positive strand positive sequencing means that when the DNA protein binding site is on the positive strand, the detection value of the DNA protein binding site is detected from the positive strand; positive strand negative sequencing means that the DNA protein binding site is on the positive strand When the DNA protein binding site is detected from the negative strand, the meaning of negative strand positive sequencing is that when the DNA protein binding site is on the negative strand, the detection value of the DNA protein binding site is detected from the positive strand . Negative strand negative sequencing means that when the DNA protein binding site is on the negative strand, the detection value of the DNA protein binding site is detected from the negative strand. According to the DNase-Seq biological detection mechanism of the DNA protein binding site, the positive-strand positive sequencing and the negative-strand negative sequencing both detect the DNA protein binding site from the front, while the positive-strand negative sequencing and negative-strand positive sequencing both detect the DNA protein binding site from the back The DNA protein binding site is detected, therefore, the positive strand positive sequencing and the negative strand negative sequencing can be combined into a positive detection data subset of the DNA protein binding site through correlation alignment, and the positive strand negative sequencing and the negative strand positive sequencing can be Merged into a subset of backside assay data for DNA protein binding sites by relative alignment.

最后，分别从正面和背面两个方向上，对所有样本进行纵向求和，从而在统计除噪基础上，突显DNA蛋白结合位点上的DNase-Seq整体检测信息。该信息可用于提取并形成相关DNA蛋白特有的DNase-Seq信号模式，并用于后续实验中全基因组范围内对该DNA蛋白结合位点的高精度高分辨率的识别和检测。Finally, all samples were longitudinally summed from the front and back directions, so as to highlight the overall detection information of DNase-Seq on the DNA protein binding site on the basis of statistical noise removal. This information can be used to extract and form the unique DNase-Seq signal pattern of related DNA proteins, and be used for genome-wide identification and detection of the DNA protein binding sites with high precision and high resolution in subsequent experiments.

4.实验验证4. Experimental verification

从国际公开的UCSC生物信息网站上获取K562细胞系中DNA结合蛋白ATF1的DNA结合位点的DNase-Seq和ChIP-Seq检测数据。图4～图9中显示的为ATF1蛋白的DNase-Seq检测数据经本发明预处理分析方法处理后的结果。其中，横轴为DNA蛋白结合位点所在DNA区域的碱基位置，左侧为DNA区域的5’端，右侧为3’端，纵轴为DNase-Seq检测值。为了清楚显示，如图4和图5所示，负链负测序和正链负测序和常规显示方式相比进行了水平左右翻转。图中两条竖线之间的区域为ATF1的结合位点(DNA蛋白结合是有方向的，应从DNA上游的5’端往下游的3’端)。The DNase-Seq and ChIP-Seq detection data of the DNA-binding site of the DNA-binding protein ATF1 in the K562 cell line were obtained from the internationally published UCSC biological information website. Figures 4 to 9 show the results of the DNase-Seq detection data of ATF1 protein processed by the preprocessing analysis method of the present invention. Among them, the horizontal axis is the base position of the DNA region where the DNA protein binding site is located, the left side is the 5' end of the DNA region, the right side is the 3' end, and the vertical axis is the DNase-Seq detection value. For clear display, as shown in Figure 4 and Figure 5, the negative-strand negative sequencing and positive-strand negative sequencing are horizontally reversed compared with the conventional display mode. The area between the two vertical lines in the figure is the binding site of ATF1 (DNA protein binding is directional, it should be from the upstream 5' end of the DNA to the downstream 3' end).

可以清楚的看到在图4～图7中，从DNA的5’端向3’端看，正链正测序和负链负测序之间，以及正链负测序和负链正测序之间，其DNA结合位点处的DNase-Seq检测信号是基本一致的；但正链正测序、负链负测序与正链负测序、负链正测序)之间，其DNA结合位点处的DNase-Seq检测信号是有所不同的。这表明，通过对DNase-Seq样本的细分，有效突显DNA蛋白结合位点的检测信息，达到了预处理的目的。It can be clearly seen that in Figures 4 to 7, looking from the 5' end to the 3' end of the DNA, between positive-strand positive sequencing and negative-strand negative sequencing, and between positive-strand negative sequencing and negative-strand positive sequencing, The DNase-Seq detection signal at its DNA binding site is basically the same; but between positive strand positive sequencing, negative strand negative sequencing and positive strand negative sequencing, negative strand positive sequencing), the DNase-Seq at its DNA binding site Seq detection signals are different. This shows that through the subdivision of DNase-Seq samples, the detection information of DNA protein binding sites is effectively highlighted, and the purpose of preprocessing is achieved.

最后，将正链正测序数据与负链负测序数据相关对齐后相加，以反映针对DNA结合位点的正面结合情况的检测，如图8所示；相应地，将正链负测序数据与负链正测序数据相关对齐后相加，以反映针对DNA结合位点的背面结合情况的检测，如图9所示。经过这一处理，进一步突显了DNA蛋白结合位点的检测信息。Finally, the positive-strand positive sequencing data and the negative-strand negative sequencing data are aligned and added to reflect the detection of the positive binding of the DNA binding site, as shown in Figure 8; correspondingly, the positive-strand negative sequencing data and The positive and negative strand sequencing data were aligned and added to reflect the detection of the back binding of the DNA binding site, as shown in FIG. 9 . After this processing, the detection information of the DNA protein binding site is further highlighted.

实验结果表明，本发明提出的DNase-Seq信号的预处理方法有效地突显了DNA蛋白结合位点的检测信息，为后续精确提取DNA蛋白结合位点识别模式，以及进一步实现DNA蛋白结合位点的识别打下良好基础。The experimental results show that the preprocessing method of the DNase-Seq signal proposed by the present invention effectively highlights the detection information of the DNA protein binding site, which is used to accurately extract the recognition pattern of the DNA protein binding site and further realize the identification of the DNA protein binding site. Identify a good foundation.

Claims

1. The preprocessing method of the DNase high-throughput sequencing detection signal of the DNA protein binding site, is characterized in that, comprises the following steps:

Step 1: Obtain the basic information of the gene, including the base sequence of the DNA genome and the position information of the gene on the DNA, obtain the DNase-Seq high-throughput sequencing detection data of the DNA protein binding site and the high-throughput ChIP-Seq Sequencing detection data;

Step 2: Assess the quality of DNase-Seq high-throughput sequencing detection data, screen out credible sequencing data with a quality score of more than 20 base sites, and find the source of each credible prediction data in the genome through mapping;

Step 3: Keep only the sequencing start position that directly reflects the protein binding site for each credible sequencing data, and obtain the updated DNase-Seq data;

Step 4: Calculate the number of updated DNase-Seq data points at each DNA base site as the DNase-Seq detection value of the DNA base site; use ChIP-Seq data to obtain the presence of related DNA proteins The region of the binding site, extract the complete DNase-Seq detection value in the region, and obtain the DNase-Seq detection sample data set;

Step 5: Normalize the DNase-Seq detection sample data set, that is, divide the DNase-Seq detection value of each DNA base site by the DNase-Seq detection value of all DNA base sites in the DNase-Seq detection sample data set. The sum of Seq detection values;

Step 6: subdivide the DNase-Seq detection sample data set;

The DNase-Seq detection sample data set is divided into positive-strand positive sequencing subset, positive-strand negative sequencing subset, negative-strand positive-sequencing subset and negative-strand negative sequencing subset. The subsets are merged into the positive detection data subset of the DNA protein binding site through correlation alignment, and the positive strand negative sequencing subset and the negative strand positive sequencing subset are merged into the back detection data of the DNA protein binding site through correlation alignment subset of data;

Step 7: Vertically sum the data in the front detection data subset and the rear detection data subset respectively from the front and rear directions to complete the operation.