CN102622534B

CN102622534B - A kind of DNA high pass sequencing data bearing calibration detected for gene expression

Info

Publication number: CN102622534B
Application number: CN201210104293.9A
Authority: CN
Inventors: 冯伟兴; 宋艳霞; 贺波; 栾兴桃; 王科俊; 刘晓龙; 赵拓; 李双林
Original assignee: Harbin Engineering University
Current assignee: Guangdong Tengfei Gene Polytron Technologies Inc
Priority date: 2012-04-11
Filing date: 2012-04-11
Publication date: 2015-09-30
Anticipated expiration: 2032-04-11
Also published as: CN102622534A

Abstract

The invention belongs to the field of molecular biological information detection. Specifically, it is a correction method for improving the accuracy of gene expression detection data obtained by DNA high-throughput sequencing. The invention includes the following steps: (1) collecting gene expression DNA sequencing detection data, and establishing a gene expression DNA high-pass sequencing detection data correction model; (2) collecting gene expression values measured by gene chips; (3) using correlation analysis to determine gene expression high-pass Model parameters in the sequencing correction model; (4) gene expression after determining the model parameter values. The correction model of the DNA high-pass sequencing detection data generates corrected gene expression values. The present invention uses a correction model to estimate and compensate the sequence comparison mapping error existing in the DNA sequencing value, reduces the detection error, and effectively improves the detection accuracy on the basis of giving full play to the high resolution and high precision of the DNA high-pass sequencing detection data sex.

Description

A DNA high-throughput sequencing data correction method for gene expression detection

技术领域 technical field

本发明属于分子生物信息检测领域。具体是一种提高DNA高通测序基因表达检测数据准确性的校正方法。The invention belongs to the field of molecular biological information detection. Specifically, it is a correction method for improving the accuracy of DNA high-throughput sequencing gene expression detection data.

背景技术 Background technique

随着信息科学实验技术的进步，用于获得分子生物信息的实验手段也日新月异。其中，作为划时代的分子生物信息检测技术，DNA高通测序技术有能力真正实现全基因组的基因表达信息高分辨率，高精度检测。With the advancement of information science experiment technology, the experimental methods used to obtain molecular biological information are also changing with each passing day. Among them, as an epoch-making molecular biological information detection technology, DNA high-pass sequencing technology has the ability to truly realize high-resolution and high-precision detection of gene expression information of the whole genome.

DNA高通测序技术的基因表达检测原理是直接对反映基因表达的靶核苷酸序列进行测序，然后依据测序结果在参考基因组中通过序列比对映射找到靶核苷酸序列所在位置，从而获取该位置相关的基因表达信息。因为是直接对靶核苷酸序列进行测序，高通量DNA测序技术大幅度地提高了基因表达的检测分辨率和检测精度。但由于测序结果需要通过序列比对映射才能转化为有意义的基因表达信息，高通量DNA测序对基因表达的测量属于间接测量，并存在原理性误差。即部分测序结果由于无法成功映射回参考基因组导致检测信息出现误差。该误差将导致检测值比实际值偏小。The gene expression detection principle of DNA high-pass sequencing technology is to directly sequence the target nucleotide sequence reflecting gene expression, and then find the position of the target nucleotide sequence in the reference genome through sequence alignment mapping according to the sequencing results, so as to obtain the position Related gene expression information. Because the target nucleotide sequence is directly sequenced, high-throughput DNA sequencing technology has greatly improved the detection resolution and detection accuracy of gene expression. However, since the sequencing results need to be converted into meaningful gene expression information through sequence comparison and mapping, the measurement of gene expression by high-throughput DNA sequencing is an indirect measurement, and there are principle errors. That is, part of the sequencing results cannot be successfully mapped back to the reference genome, resulting in errors in the detection information. This error will cause the detected value to be smaller than the actual value.

发明内容 Contents of the invention

本发明的目的是提供一种对基因表达检测中生成DNA测序数据时存在的原理性序列比对映射误差进行补偿，在高分辨率、高精度基础上，获得更准确的基因表达检测的DNA高通测序数据校正方法。The purpose of the present invention is to provide a DNA high-pass method that compensates for the principle sequence alignment and mapping errors that exist when generating DNA sequencing data in gene expression detection, and obtains more accurate gene expression detection on the basis of high resolution and high precision. Correction methods for sequencing data.

本发明的目的是这样实现的：The purpose of the present invention is achieved like this:

DNA高通测序数据校正方法，包括下列步骤：The method for calibrating DNA high-throughput sequencing data comprises the following steps:

(1)采集基因表达DNA测序检测数据，建立基因表达DNA高通测序检测数据校正模型：(1) Collect gene expression DNA sequencing detection data, and establish a gene expression DNA high-throughput sequencing detection data correction model:

Z_i＝(1+β×1/C_i)×Y_i Z _i ＝(1+β×1/C _i )×Y _i

其中，Y_i为实测的第i个基因表达高通测序值，Z_i为校正后的第i个基因表达值，C_i为该基因所在DNA区域的保守值，β为模型参数；Among them, Y _i is the measured i-th gene expression high-pass sequencing value, Z _i is the corrected i-th gene expression value, C _i is the conservative value of the DNA region where the gene is located, and β is the model parameter;

(2)采集基因芯片测量的基因表达值；(2) Collect the gene expression value measured by the gene chip;

(3)采用相关分析确定基因表达高通测序校正模型中的模型参数：将基因表达DNA高通测序检测数据校正模型得到的基因表达值与基因芯片测得的基因表达值进行相关程度计算，确定相关值最大时的模型参数β值；(3) Determine the model parameters in the gene expression high-pass sequencing correction model by correlation analysis: calculate the correlation degree between the gene expression value obtained by the gene expression DNA high-pass sequencing detection data correction model and the gene expression value measured by the gene chip, and determine the correlation value The model parameter β value at the maximum;

(4)确定模型参数值后的基因表达DNA高通测序检测数据校正模型生成校正后的基因表达值。(4) Gene expression after determining the model parameter values. The DNA high-throughput sequencing detection data correction model generates corrected gene expression values.

本发明的有益效果在于：The beneficial effects of the present invention are:

本发明利用校正模型对DNA测序值存在的序列比对映射误差进行估算和补偿，减小了检测误差，在充分发挥DNA高通测序检测数据高分辨率、高精度的基础上，有效提高检测的准确性。The present invention uses a correction model to estimate and compensate the sequence comparison mapping error existing in the DNA sequencing value, reduces the detection error, and effectively improves the detection accuracy on the basis of giving full play to the high resolution and high precision of the DNA high-pass sequencing detection data sex.

附图说明 Description of drawings

图1为本发明的流程示意图；Fig. 1 is a schematic flow sheet of the present invention;

图2为目标基因保守值分布图；Figure 2 is a distribution map of the conservative value of the target gene;

图3为校正模型参数寻优曲线。Figure 3 is the optimization curve of the calibration model parameters.

具体实施方式 Detailed ways

本发明方法具体实施方式如下：The specific implementation method of the inventive method is as follows:

首先通过分析DNA高通测序检测数据由于间接比对映射导致的原理性误差，针对性地建立基因表达DNA高通测序检测数据校正模型；First, by analyzing the principle error caused by indirect comparison and mapping of DNA high-pass sequencing detection data, a correction model for gene expression DNA high-pass sequencing detection data is established in a targeted manner;

利用相关分析法，采用在原理上与DNA高通测序互补的另一种高通量基因表达检测实验方法，即基因芯片法所生成的数据，对模型参数进行确定。并得到最终的基因表达DNA高通测序检测数据校正模型。该模型所产生的校正数据在准确性上优于校正前的检测数据。Using the correlation analysis method, another high-throughput gene expression detection experimental method that is complementary to DNA high-throughput sequencing in principle, that is, the data generated by the gene chip method, is used to determine the model parameters. And get the final gene expression DNA high-throughput sequencing detection data correction model. The corrected data generated by the model is better than the uncorrected detection data in accuracy.

1.基因表达DNA高通测序数据的校正模型1. Calibration model of gene expression DNA high-throughput sequencing data

高通量DNA测序在获取基因表达信息过程中，需要一个测序数据向参考基因组映射的环节。当由于某种原因导致测序数据无法映射到参考基因组时，高通量DNA测序误差就会出现。因此，利用高通量DNA测序技术进行基因表达信息检测时，表达检测值往往比实际值偏小。这其中，最主要的误差来源是当所对应基因区域存在大量重复序列时，DNA测序数据将因为非一对一映射问题而导致映射失败。因此，对应基因区域存在的重复序列越多，该误差越严重。In the process of obtaining gene expression information by high-throughput DNA sequencing, a step of mapping the sequencing data to the reference genome is required. High-throughput DNA sequencing errors occur when sequencing data cannot be mapped to a reference genome for some reason. Therefore, when using high-throughput DNA sequencing technology to detect gene expression information, the expression detection value is often smaller than the actual value. Among them, the main source of error is that when there are a large number of repetitive sequences in the corresponding gene region, DNA sequencing data will cause mapping failure due to non-one-to-one mapping problems. Therefore, the more repetitive sequences exist in the corresponding gene region, the more serious the error will be.

基于此，本方法建立的基因表达DNA高通测序数据的校正模型如公式1所示：Based on this, the correction model of gene expression DNA high-throughput sequencing data established by this method is shown in formula 1:

Z_i＝(1+β×1/C_i)×Y_i (1)Z _i ＝(1+β×1/C _i )×Y _i (1)

其中，Y_i为实测的第i个基因表达高通测序值。Z_i为校正后的第i个基因表达值。C_i为该基因所在DNA区域的保守值。β为模型参数。依据生物进化理论，DNA区域的保守值越高，其碱基重复性越低。因此，这里采用保守值来反映DNA区域的碱基重复程度值。Among them, Y _i is the measured i-th gene expression high-throughput sequencing value. Z _i is the i-th gene expression value after correction. C _i is the conservative value of the DNA region where the gene is located. β is a model parameter. According to the theory of biological evolution, the higher the conservation value of a DNA region, the lower its base repeatability. Therefore, the conservative value is used here to reflect the base repetition degree value of the DNA region.

校正模型中，生成值Z_i总大于实测值Y_i。这主要是考虑高通测序误差中映射失败导致的误差将使得测得值Y_i趋势性比真值偏小。另外，生成值Z_i与保守值C_i成反比。即C_i值越大，生成值Z_i越接近Y_i。这是和保守值越大，调整量应越小相一致的。In the calibration model, the generated value Z _i is always greater than the measured value Y _i . This is mainly due to the consideration that the error caused by the mapping failure in the high-pass sequencing error will make the measured value Y _i tend to be smaller than the true value. In addition, the generated value Z _i is inversely proportional to the conservative value C _i . That is, the larger the value of C _i is, the closer the generated value Z _i is to Y _i . This is consistent with the larger the conservative value, the smaller the adjustment should be.

2.校正模型的参数求取2. Calculation of the parameters of the calibration model

作为基因表达的另一高通量检测方法——基因芯片法，虽然在分辨率和检测精度上不如DNA高通测序法，但由于其对基因表达是直接检测，不存在序列比对问题，因此，这里采用基因芯片检测数据对DNA高通测序的基因表达检测数据进行校正。具体为采用相关分析法求取基因表达高通测序校正模型中的模型参数。即在模型参数β取不同值时，可得到不同的基因表达高通测序校正值。然后，将校正值与基因芯片测得的表达值进行相关程度计算。当相关值达到最大时，其对应的模型参数β值即为最优值。所对应的模型即可生成更准确的基因表达测序数据。As another high-throughput detection method for gene expression, the gene chip method is not as good as the DNA high-throughput sequencing method in terms of resolution and detection accuracy, but because it directly detects gene expression, there is no sequence alignment problem. Therefore, Here, the gene chip detection data is used to correct the gene expression detection data of DNA high-throughput sequencing. Specifically, the correlation analysis method is used to obtain the model parameters in the gene expression high-pass sequencing correction model. That is, when the model parameter β takes different values, different high-pass sequencing correction values of gene expression can be obtained. Then, calculate the degree of correlation between the correction value and the expression value measured by the gene chip. When the correlation value reaches the maximum, the corresponding model parameter β value is the optimal value. The corresponding model can generate more accurate gene expression sequencing data.

3.实验测试3. Experimental test

3.1数据获取3.1 Data Acquisition

1)测序数据1) Sequencing data

利用DNA高通测序中的ChIP-seq技术可以对基因转录区内Pol II蛋白数量进行测量和统计，以直接反映基因转录水平。该技术首先利用超声波将DNA链降解为DNA片段，然后利用特制的抗体俘获结合在DNA片段上的Pol II蛋白，再利用沉淀技术(IP)将含有抗体的DNA片段滤出，随后通过测序技术(seq)对所有滤出的DNA片段测序并通过序列比对映射回DNA上，最后根据基因转录区在DNA上的位置定义即可实现对基因转录区内Pol II蛋白数量的测量和统计。The ChIP-seq technology in DNA high-throughput sequencing can be used to measure and count the number of Pol II proteins in the gene transcription region to directly reflect the gene transcription level. This technology first uses ultrasound to degrade the DNA chain into DNA fragments, then uses a special antibody to capture the Pol II protein bound to the DNA fragment, and then uses precipitation technology (IP) to filter out the DNA fragment containing the antibody, followed by sequencing technology ( seq) to sequence all the filtered DNA fragments and map them back to the DNA through sequence alignment, and finally, according to the position definition of the gene transcription region on the DNA, the measurement and statistics of the Pol II protein quantity in the gene transcription region can be realized.

本实验选用两种普通和抗药性MCF7乳腺癌细胞加药前后共4组基因启动子区Pol II测序数据。该数据利用DNA高通测序中的ChIP-seq技术对基因启动子区域内Pol II蛋白数量进行测量和统计，以直接反映基因表达水平。In this experiment, a total of 4 groups of gene promoter region Pol II sequencing data were used in two common and drug-resistant MCF7 breast cancer cells before and after drug administration. The data uses ChIP-seq technology in DNA high-throughput sequencing to measure and count the number of Pol II proteins in the gene promoter region, so as to directly reflect the gene expression level.

2)基因芯片数据2) Gene chip data

本实验选用采用基因芯片ChIP-chip技术获取的基因表达数据来与基因表达测序数据进行相关性分析。该数据是针对相同的两种普通和抗药性MCF7乳腺癌细胞加药前后共4组基因表达检测数据。该基因芯片选用Affymetrix公司的Human Genome U133Plus 2.0Array芯片，可一次对人类基因组38,500个基因进行表达信息检测。In this experiment, the gene expression data obtained by gene chip ChIP-chip technology was used for correlation analysis with the gene expression sequencing data. The data is for the same two common and drug-resistant MCF7 breast cancer cells, a total of 4 groups of gene expression detection data before and after drug addition. The gene chip uses the Human Genome U133Plus 2.0Array chip from Affymetrix, which can detect the expression information of 38,500 genes in the human genome at one time.

3)基因保守值数据3) Gene conservation value data

本实验所用的DNA核苷酸序列保守值数据下载自生物信息大型公共数据库UCSC。该保守值数据是通过比对44种脊椎动物基因组核苷酸序列和人类基因组核苷酸序列生成的。The DNA nucleotide sequence conservation value data used in this experiment was downloaded from UCSC, a large public database of biological information. The conservative value data is generated by comparing the nucleotide sequences of 44 vertebrate genomes with the human genome nucleotide sequences.

4)基因序列数据4) Gene sequence data

本实验所用的DNA核苷酸序列数据也下载自生物信息大型公共数据库UCSC。The DNA nucleotide sequence data used in this experiment were also downloaded from UCSC, a large public database of biological information.

3.2DNA高通测序数据校正3.2 DNA high-throughput sequencing data correction

首先对基因表达测序数据，基因芯片数据和保守值数据的完整性进行了分析，得到具有上述完整信息的9424个基因。First, the completeness of gene expression sequencing data, gene chip data and conservative value data was analyzed, and 9424 genes with the above complete information were obtained.

随后，依据DNA核苷酸序列保守值数据对这些基因区域的保守值进行了计算。由于本实验所用的测序数据是对基因启动子内Pol II蛋白数量进行的测量和统计，因此，我们也用相同区域的保守值和来表示该区域的保守值。所得到的保守值分布如图2所示。图中，横轴为保守值，纵轴为次数。Subsequently, the conservation values of these gene regions were calculated based on the DNA nucleotide sequence conservation value data. Since the sequencing data used in this experiment is the measurement and statistics of the number of Pol II proteins in the gene promoter, we also use the conservative value sum of the same region to represent the conservative value of this region. The resulting conservative value distribution is shown in Figure 2. In the figure, the horizontal axis is the conservative value, and the vertical axis is the number of times.

最后采用本发明所介绍的方法对基因表达DNA高通测序数据进行了处理。处理过程中，采用基因表达测序数据和基因芯片检测数据的相关性对模型参数β进行了优化取值。优化过程如图3所示。图3包括4种不同实验条件下的细胞。分别是：A：加药前普通乳腺癌细胞B：加药后普通乳腺癌细胞C：加药前抗药乳腺癌细胞D：加药后抗药乳腺癌细胞。图3中，横轴为模型参数β的取值，纵轴为校正后的基因测序数据和基因芯片数据的相关程度值。校正过程中，随着β值开始由0增加，相关程度值迅速提高，当β值取某一值时达到极值，当β值继续增大时，相关程度值反而由于过度校正而下降。由图可见，与不进行校正相比，校正后的基因测序数据和基因芯片数据的相关程度明显提高。这表明采用本发明所提方法对测序数据进行校正取得了更合理的结果。β最优值所对应的模型即是测序数据最终的校正模型。表1为采用本方法对两种普通和抗药性MCF7乳腺癌细胞加药前后共4组基因启测序数据的处理模型最优参数β值。Finally, the method introduced in the present invention is used to process the gene expression DNA high-throughput sequencing data. During the processing, the correlation between the gene expression sequencing data and the gene chip detection data was used to optimize the value of the model parameter β. The optimization process is shown in Figure 3. Figure 3 includes cells under 4 different experimental conditions. They are: A: common breast cancer cells before drug addition B: common breast cancer cells after drug addition C: drug-resistant breast cancer cells before drug addition D: drug-resistant breast cancer cells after drug addition. In Fig. 3, the horizontal axis is the value of the model parameter β, and the vertical axis is the correlation degree value between the corrected gene sequencing data and gene chip data. During the correction process, as the β value starts to increase from 0, the correlation degree value increases rapidly, and when the β value takes a certain value, it reaches the extreme value. When the β value continues to increase, the correlation degree value decreases due to over-correction. It can be seen from the figure that, compared with no correction, the degree of correlation between the corrected gene sequencing data and the gene chip data is significantly improved. This shows that using the method proposed in the present invention to correct the sequencing data has achieved more reasonable results. The model corresponding to the optimal value of β is the final calibration model of the sequencing data. Table 1 shows the optimal parameter β value of the processing model for a total of 4 groups of gene sequencing data before and after adding drugs to two common and drug-resistant MCF7 breast cancer cells using this method.

表1校正模型最优参数值Table 1 Optimal parameter values of the calibration model

本发明由于DNA测序技术获得的碱基测序结果需要通过和参考基因组碱基序列进行序列比对映射才能转化为有意义的基因表达信息，当部分测序结果由于非一对一映射而无法成功映射回参考基因组时，将导致检测信息出现误差。依据生物进化理论，DNA区域的保守值越高，其碱基重复率则越低，该区域的DNA测序数据映射成功率也就越高。因此，模型中采用保守值来反映DNA区域的碱基重复程度以及随之产生的比对映射误差。In the present invention, the base sequencing results obtained by DNA sequencing technology need to be converted into meaningful gene expression information through sequence comparison and mapping with the base sequence of the reference genome. When part of the sequencing results cannot be successfully mapped back to When referring to the genome, it will lead to errors in the detection information. According to the theory of biological evolution, the higher the conservation value of a DNA region, the lower its base repetition rate, and the higher the success rate of DNA sequencing data mapping in this region. Therefore, conservative values are used in the model to reflect the degree of base repetition in DNA regions and the resulting alignment and mapping errors.

由于另一种全基因组基因表达检测技术手段——基因芯片技术虽然在检测分辨率上不如DNA测序技术，但不存在比对映射环节，因此，本发明对来自DNA测序和基因芯片等两个不同独立通道所生成的基因表达检测数据进行相关性分析，以确定修正模型参数，并最终实现对DNA测序基因表达检测数据的修正。Since another whole-genome gene expression detection technology means—gene chip technology is not as good as DNA sequencing technology in terms of detection resolution, but there is no comparison and mapping link, therefore, the present invention analyzes two different methods from DNA sequencing and gene chip. Correlation analysis is performed on the gene expression detection data generated by independent channels to determine the parameters of the correction model, and finally realize the correction of the DNA sequencing gene expression detection data.

Claims

1. A DNA high-throughput sequencing data correction method for gene expression detection is characterized in that it comprises the following steps:

(1) Collect gene expression DNA sequencing detection data, and establish a gene expression DNA high-throughput sequencing detection data correction model:

Z _i ＝(1+β×1/C _i )×Y _i

Among them, Y _i is the measured i-th gene expression high-pass sequencing value, Z _i is the corrected i-th gene expression value, C _i is the conservative value of the DNA region where the gene is located, and β is the model parameter;

(2) Collect the gene expression value measured by the gene chip;

(3) Determine the model parameters in the gene expression high-pass sequencing correction model by correlation analysis: calculate the correlation degree between the gene expression value obtained by the gene expression DNA high-pass sequencing detection data correction model and the gene expression value measured by the gene chip, and determine the correlation value The model parameter β value at the maximum;

(4) Gene expression after determining the model parameter values. The DNA high-throughput sequencing detection data correction model generates corrected gene expression values.