CN111640467B

CN111640467B - DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence

Info

Publication number: CN111640467B
Application number: CN202010446416.1A
Authority: CN
Inventors: 牛毅; 马明明; 李甫; 田英轩; 石光明
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2023-03-24
Anticipated expiration: 2040-05-25
Also published as: CN111640467A

Abstract

The invention provides a DNA sequencing quality score lossless compression method based on a self-adaptive coding sequence, which mainly solves the problem that the compression ratio is low due to the fact that a prediction model of the existing quality score compression method is not accurate enough. The implementation scheme is as follows: 1) Compressing a block P by two encodings ₁ And P ₂ Extracting mass fraction data and base number data in the FASTQ file; 2) Calculating a first encoded compressed block P ₁ The mean value of the mass fraction of each row in the extracted file is quantized to obtain a row mean value matrix F of Mx 1; 3) Counting context information, base information and line mean information of the coded characters, 4) setting two identifiers C and D, and uniformly quantizing the information counted in step 3) to construct a coding model; 5) Driving an adaptive arithmetic coder by a coding model, and compressing a block P by a first code along the direction with the strongest correlation by a snake-shaped coding sequence ₁ And performing traversal compression. The invention improves the compression efficiency and can be used for storing and transmitting gene data.

Description

A Lossless Compression Method for DNA Sequencing Quality Score Based on Adaptive Coding Order

技术领域technical field

本发明属于数据压缩技术领域，特别涉及一种DNA测序质量分数无损压缩方法，可用于生物基因测序数据的压缩。The invention belongs to the technical field of data compression, and in particular relates to a DNA sequencing quality score lossless compression method, which can be used for the compression of biological gene sequencing data.

背景技术Background technique

测序已经逐渐成为在生物研究中广泛应用的技术，获取不同生物体的基因遗传信息，能够帮助我们提高对有机世界的认识。随着新一代高通量基因测序技术NGS的飞速发展，以Illumina为代表的测序公司不断推出新的测序技术，使得测序成本迅速下降，人类全基因组测序WGS价格已经下降到1000美金甚至更低，并且依然按照高于摩尔定律的速度下降。在这种情况下，产生的新一代测序数据的数量将超过天文数据，相对的，存储和传输这些数据所带来的开销也日益增大。因此，通过数据压缩减少基因测序数据大小，从而降低存储和传输成本具有重要的意义。目前基因压缩工具研究取得了许多成果，但是没有方案从编码顺序上考虑减少码流，因此压缩效率还有提升空间。Sequencing has gradually become a widely used technology in biological research. Obtaining genetic information of different organisms can help us improve our understanding of the organic world. With the rapid development of the new generation of high-throughput gene sequencing technology NGS, sequencing companies represented by Illumina have continuously introduced new sequencing technologies, which has led to a rapid decline in sequencing costs. The price of human whole genome sequencing WGS has dropped to US$1,000 or even lower. And still decline at a rate higher than Moore's Law. In this case, the amount of next-generation sequencing data generated will exceed that of astronomical data, and the corresponding overhead of storing and transmitting these data will also increase. Therefore, it is of great significance to reduce the size of gene sequencing data through data compression, thereby reducing storage and transmission costs. At present, many achievements have been made in the research of gene compression tools, but there is no plan to reduce the code stream from the encoding sequence, so there is still room for improvement in compression efficiency.

下一代测序产生成千上万条短读，这些短读通常以广泛接受的基于文本的FASTQ格式存储，包含测序产生的所有信息。其中每条短读包含三部分内容：一是元数据，用于描述测序平台等信息；二是DNA碱基序列，用于记录在当前短读中所获得的DNA片段；三是质量分数，用于表示所对应DNA碱基序列中各符号测定的可信程度。FASTQ格式中的质量分数数据具有较高的随机性和噪声，与测序仪器、测序方法等因素有关，通常包含几十种不同的字符，压缩难度高，在压缩文件中通常占比70％左右，因此，质量分数数据的压缩结果对整个FASTQ格式数据的压缩效果起着关键的影响。Next-generation sequencing generates tens of thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information generated by the sequencing. Each short read contains three parts: one is metadata, which is used to describe the sequencing platform and other information; the second is the DNA base sequence, which is used to record the DNA fragments obtained in the current short read; the third is the quality score, which is used to The degree of confidence in the determination of each symbol in the corresponding DNA base sequence. The quality score data in the FASTQ format has high randomness and noise, which is related to factors such as sequencing instruments and sequencing methods. It usually contains dozens of different characters and is difficult to compress. It usually accounts for about 70% of the compressed file. Therefore, the compression result of the quality score data plays a key role in the compression effect of the entire FASTQ format data.

目前，典型的对基因测序数据中质量分数无损压缩的方法主要有以下几种：At present, typical methods for lossless compression of quality scores in gene sequencing data mainly include the following:

第一种是用现有的文本压缩工具作为FASTQ文件最常用的压缩方式，如Gzip和7z，这些方法主要设计用于处理普通字符序列，并未考虑质量分数数据的独有特点，因此在压缩基因测序数据时效果不佳。The first is to use existing text compression tools as the most commonly used compression methods for FASTQ files, such as Gzip and 7z. These methods are mainly designed to process ordinary character sequences, and do not consider the unique characteristics of quality score data. Therefore, when compressing Doesn't work well with genetic sequencing data.

第二种是针对基因数据压缩产生的改进run-length方法和字典方法，这些方法在大部分情况下压缩效果都比熵编码方法差，不能达到最大程度上降低压缩率的目的。The second is the improved run-length method and dictionary method for genetic data compression. In most cases, the compression effect of these methods is worse than that of the entropy coding method, and the purpose of reducing the compression rate to the greatest extent cannot be achieved.

第三种是一些针对质量分数的压缩算法如Quip等，这种方法使用高阶马尔科夫模型对其进行预测编码，虽然得到了不错的压缩效果，但其所占存储体积较大，计算预测模型时过于复杂，并且没有考虑到编码顺序对压缩产生的影响，导致压缩耗时较长且算法的鲁棒性不佳。The third is some compression algorithms for quality scores, such as Quip, etc. This method uses a high-order Markov model to predictively encode it. The model is too complex and does not take into account the impact of encoding order on compression, resulting in long compression time and poor robustness of the algorithm.

发明内容Contents of the invention

本发明的目的在于客服上述现有技术存在的缺陷，提出一种基于自适应编码顺序的DNA测序质量分数无损压缩方法，以在不增加压缩时间的情况下最大程度的提高压缩效果。The purpose of the present invention is to overcome the shortcomings of the above-mentioned prior art, and propose a DNA sequencing quality score lossless compression method based on adaptive coding sequence, so as to maximize the compression effect without increasing the compression time.

本发明的技术方案是：首先，提取FASTQ文件中的碱基序列和质量分数序列；然后针对每行质量分数数据计算其均值并进行量化，根据上下文信息，均值信息，碱基信息构建预测模型；最后采用蛇形编码顺序驱动算术编码器对序列进行编码，达到压缩质量分数的目的，具体实现如下：The technical solution of the present invention is: first, extract the base sequence and quality score sequence in the FASTQ file; then calculate its mean value for each line of quality score data and quantify it, and build a prediction model according to context information, mean value information, and base information; Finally, the arithmetic coder is driven by the serpentine coding sequence to encode the sequence to achieve the purpose of compressing the quality score. The specific implementation is as follows:

(1)提取FASTQ文件中的质量分数数据和碱基数据：(1) Extract the quality score data and base data in the FASTQ file:

(1a)统计分析DNA测序数据特征，创建两个M×N大小的编码压缩块P₁和P₂，其中M为压缩块的行数，即一次处理质量分数数据的行数，N为压缩块的列数，即质量分数的长度，N≤150；(1a) Statistically analyze the characteristics of DNA sequencing data, and create two M×N coded compression blocks P ₁ and P ₂ , where M is the number of lines in the compression block, that is, the number of lines of quality score data processed at one time, and N is the compression block The number of columns, that is, the length of the mass fraction, N≤150;

(1b)分别通过第一编码压缩块P₁和第二编码压缩块P₂提取存放在FASTQ文件中的质量分数数据和碱基数据；(1b) extract the quality score data and base data stored in the FASTQ file through the first code compression block P ₁ and the second code compression block P ₂ respectively;

(2)计算第一编码压缩块P₁所提取FASTQ文件中每行质量分数的均值并进行量化，得到M×1的行均值矩阵F；(2) calculate the mean value of each line quality score in the extracted FASTQ file of the first coded compression block _P1 and quantize, obtain the line mean value matrix F of M * 1;

(3)统计编码字符的上下文信息、碱基信息和行均值信息并进行统一量化，计算最终的编码模型：(3) The context information, base information and row mean information of the encoded characters are counted and quantified uniformly, and the final encoding model is calculated:

(3a)对当前编码字符q建立模型：统计前四个字符q₁，q₂，q₃和q₄，取第二编码压缩块P₂中当前字符和前一个字符对应的碱基信息记做j₁和j₂，取行均值矩阵F中字符q所在行的均值记做f，该f为已经量化后的结果；对于缺少上文信息的边缘字符，其q₁，q₂，q₃和q₄取相同符号或令其等于零；(3a) Build a model for the current coded character q: count the first four characters q ₁ , q ₂ , q ₃ and q ₄ , and take the base information corresponding to the current character and the previous character in the second code compression block P ₂ as j ₁ and j ₂ , take the mean value _of the _row where the character q is located in the row mean matrix F and record it as f, which is _the quantized result; q ₄ takes the same sign or makes it equal to zero;

(3b)通过量化整体模型的方式减少模型代价，即取前两个字符q₁和q₂中的较大值记做A，后两个字符q₃和q₄中的较大值记做B，创建两个不同的标识符C和D，计算当前编码符号的最终编码模型：(3b) Reduce the model cost by quantifying the overall model, that is, take the larger value of the first two characters q ₁ and q ₂ as A, and the larger value of the last two characters q ₃ and q ₄ as B , creating two distinct identifiers C and D, computing the final encoding model for the current encoding symbol:

P_now＝A·B·C·D·j₁·j₂·fP _now ＝A·B·C·D·j ₁ ·j ₂ ·f

其中，当q₁＝q₂时标识符C＝1，否则C＝0；当q₃＝q₄时D＝1，否则D＝0；P_now为当前编码符号的概率估计；Wherein, when q ₁ =q ₂ , the identifier C=1, otherwise C=0; when q ₃ =q ₄ , D=1, otherwise D=0; P _now is the probability estimate of the current coding symbol;

(4)利用设计的最终编码模型驱动自适应算数编码器，采用蛇形编码顺序沿相关性最强的方向对第一编码压缩块P₁进行遍历压缩。(4) Utilize the designed final coding model to drive the self-adaptive arithmetic coder, and traverse and compress the first coded compression block P ₁ along the direction with the strongest correlation by adopting the serpentine coding sequence.

本发明与现有技术相比，具有以下优点：Compared with the prior art, the present invention has the following advantages:

1.本发明由于充分利用了算术编码器的概率更新机制，因而对等长FASTQ文件中的质量分数数据的压缩率优于目前所有的算法。1. Since the present invention makes full use of the probability update mechanism of the arithmetic coder, the compression rate of the quality score data in the equal-length FASTQ file is superior to all current algorithms.

2.本发明由于在压缩质量分数数据的同时压缩了每行质量分数的均值，便于下游处理过程中对均值的统计和访问。2. Since the present invention compresses the mean value of each line of quality score while compressing the quality score data, it facilitates the statistics and access to the mean value in the downstream processing process.

3.本发明由于设计的编码模型结构简单，因而可移植性强，方便再次优化和融入到整个FASTQ文件的压缩，可广泛应用于各种使用该模块的压缩方案，具有良好的可扩展性。3. Due to the simple structure of the coding model designed by the present invention, it has strong portability, and is convenient for re-optimization and integration into the compression of the entire FASTQ file. It can be widely used in various compression schemes using this module, and has good scalability.

附图说明Description of drawings

图1为本发明的实现流程图；Fig. 1 is the realization flowchart of the present invention;

图2为本发明中对质量分数行均值进行量化的示意图；Fig. 2 is a schematic diagram of quantifying the quality score row mean value in the present invention;

图3为本发明中采用蛇形扫描顺序的示意图。FIG. 3 is a schematic diagram of a serpentine scanning sequence used in the present invention.

具体实施方式Detailed ways

以下结合附图和具体实施例，对本发明进行进一步详细描述。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

参照图1，本发明的实现步骤如下：With reference to Fig. 1, the realization steps of the present invention are as follows:

步骤1，提取FASTQ文件中的质量分数数据和碱基数据。Step 1, extract the quality score data and base data in the FASTQ file.

基因测序会产生成千上万条短读，这些短读通常以广泛接受的基于文本的FASTQ格式存储，包含测序产生的所有信息。在FASTQ文件格式中，每个短读包含四行，每行由换行符分隔，其中：Genetic sequencing generates thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information generated by the sequencing. In the FASTQ file format, each short read consists of four lines, each line separated by a newline, where:

第一行从‘@’字符开始，后面跟着唯一的序列ID标识符及可选的序列描述内容，标识符与描述字符以空格分隔；The first line starts with the '@' character, followed by a unique sequence ID identifier and optional sequence description content, and the identifier and description characters are separated by spaces;

第二行是核苷酸序列，表示碱基数据，由仅包含{‘A’，‘T’，‘C’，‘G’，‘N’}五个字符的序列构成，其中字符‘N’表示不明确的碱基；The second line is the nucleotide sequence, representing the base data, consisting of a sequence consisting of only five characters {'A', 'T', 'C', 'G', 'N'}, where the character 'N' Indicates an ambiguous base;

第三行以字符‘+’开始，后面再次加上序列的标志及描述信息，或者没有信息，充当分隔符；The third line starts with the character '+', followed by the sequence logo and description information, or no information, as a separator;

最后一行为质量分数行，每个字符对应第二行相应位置上碱基的质量，质量分数对应于数字Q＝-10log 10P，其中P表示读取中相应核苷酸的概率是错误的。质量分数通常使用ASCII字母[33：73]或[64：104]表示，既用于原始数据的质量控制，也用于下游处理。The last line is the quality score line, each character corresponds to the quality of the base at the corresponding position in the second line, and the quality score corresponds to the number Q = -10log 10P, where P indicates the probability that the corresponding nucleotide in the read is wrong. Quality scores are usually denoted using ASCII letters [33:73] or [64:104] and are used both for quality control of raw data and for downstream processing.

本步骤的具体实现如下：The specific implementation of this step is as follows:

1.1)统计分析DNA测序数据特征，创建两个M×N大小的两个编码压缩块P₁和P₂，其中M为压缩块的行数，即一次处理质量分数数据的行数，N为压缩块的列数，即质量分数的长度，N≤150；1.1) Statistically analyze the characteristics of DNA sequencing data, and create two coded compression blocks P ₁ and P ₂ of M×N size, where M is the number of lines in the compressed block, that is, the number of lines for processing quality score data at one time, and N is the compression The number of columns of the block, that is, the length of the quality fraction, N≤150;

1.2)分别通过第一编码压缩块P₁和第二编码压缩块P₂提取存放在FASTQ文件中的质量分数数据和碱基数据；1.2) Extract the quality score data and base data stored in the FASTQ file through the first code compression block _P1 and the second code compression block _P2 respectively;

由于大部分FASTQ文件中的质量分数字符数都小于40且跳跃性不大，因此可以根据数据间的相关性设计好的预测模型来提升压缩效果。同时，考虑到相关字符过多不仅会导致时间和计算的复杂度升高，有时也会带来模型代价问题，因此需要采用合适的压缩块来统计质量分数之间的相关性，在计算资源允许的范围内，压缩块设计的越大压缩效果越好，但为了不超过最大内存，在本实施例中，取2000000×160的压缩块。总模型数设置为40×40×40×16。在实际压缩过程中，每次都要处理一个压缩块大小的数据，直至文件结尾。Since the quality score characters in most FASTQ files are less than 40 and the jumps are not large, a good prediction model can be designed according to the correlation between data to improve the compression effect. At the same time, considering that too many related characters will not only lead to increased time and calculation complexity, but also sometimes cause model cost problems, it is necessary to use appropriate compression blocks to count the correlation between quality scores. When computing resources allow Within the range of , the larger the compression block is designed, the better the compression effect, but in order not to exceed the maximum memory, in this embodiment, a compression block of 2000000×160 is used. The total number of models is set to 40×40×40×16. In the actual compression process, data of one compressed block size is processed each time until the end of the file.

步骤2，计算第一编码压缩块P₁所提取FASTQ文件中每行质量分数的均值并进行量化，得到M×1的行均值矩阵F。Step 2: Calculate and quantize the mean value of the quality score of each row in the FASTQ file extracted from the first coded compression block P ₁ to obtain an M×1 row mean value matrix F.

2.1)对大小为M×N的第一编码压缩块P₁的每一行进行求平均值操作，将各行的N个质量分数值相加除以总数N得到各行质量分数的均值；2.1) Perform an averaging operation on each line of the first coded compressed block P ₁ whose size is M×N, add and divide the N quality score values of each line by the total number N to obtain the mean value of the quality score of each line;

2.2)对得到的各行质量分数值进行量化操作并存储：2.2) Quantify and store the obtained quality score values of each row:

参照图2，在统计出每行质量分数的均值后，根据均值分布状况进行聚类，对于数量较多的均值进行细分，对于数量较小且值较低的质量值进行合并，以有利于编码效率的提升。对于具体的压缩文件可根据均值分布情况设计特有的量化方式以达到最优的效果，但是这样不仅增加了计算量还额外增加了许多计算时间。因此本实例选用扩展性较强且容易实现的量化方式，即将两个相邻均值看做是同一种情况，对于质量值较小且数量较低的部分整体看做一个部分。总结量化经验，得到如下所示的量化结果：Referring to Figure 2, after counting the mean value of the quality score of each row, clustering is performed according to the distribution of the mean value, the mean value with a large number is subdivided, and the quality values with a small number and low value are merged to facilitate Improve coding efficiency. For a specific compressed file, a unique quantization method can be designed according to the mean distribution to achieve the optimal effect, but this not only increases the calculation amount but also increases a lot of calculation time. Therefore, this example uses a quantization method with strong scalability and easy implementation, that is, two adjacent mean values are regarded as the same situation, and the part with a small quality value and a low quantity is regarded as a part as a whole. Summarizing the quantitative experience, the quantitative results are obtained as follows:

如果f_i<(num-15),则f_i＝(num-15)；If f _i <(num-15), then f _i =(num-15);

如果(num-15)≤f_i<(num-13),则f_i＝(num-13)；If (num-15)≤f _i <(num-13), then f _i = (num-13);

如果(num-13)≤f_i<(num-11),则f_i＝(num-11)；If (num-13)≤f _i <(num-11), then f _i = (num-11);

如果(num-11)≤f_i<(num-9),则f_i＝(num-9)；If (num-11)≤f _i <(num-9), then f _i = (num-9);

如果(num-9)≤f_i<(num-7),则f_i＝(num-7)；If (num-9)≤f _i <(num-7), then f _i =(num-7);

如果(num-7)≤f_i，则f_i＝(num-6)；If (num-7)≤f _i , then f _i =(num-6);

其中，num为编码符号总数40，f_i为当前行的均值，i取值为[1,M]；Among them, num is the total number of encoding symbols 40, _fi is the mean value of the current line, and the value of i is [1,M];

将量化后各行的均值f_i按照列排列的方式合并，得到M×1的行均值矩阵F。Merge the quantized mean values f _i of each row in a column-arranged manner to obtain an M×1 row mean value matrix F.

步骤3，统计编码字符的上下文信息、碱基信息和行均值信息并进行统一量化，计算最终的编码模型。Step 3, the context information, base information and row mean information of the coded characters are counted and quantified uniformly, and the final coding model is calculated.

3.1)对当前编码字符q建立模型：统计前四个字符q₁，q₂，q₃和q₄，取P₂中当前字符和前一个字符对应的碱基信息记做j₁和j₂，字符q所在行的均值记做f，这里的f为已经量化后的结果；对于缺少上文信息的边缘字符，其q₁，q₂，q₃和q₄可取相同符号或令其等于零。3.1) Build a model for the current coded character q: count the first four characters q ₁ , q ₂ , q ₃ and q ₄ , and take the base information corresponding to the current character and the previous character in P ₂ as j ₁ and j ₂ , The mean value of the row where the character q is located is recorded as f, where f is the quantized result; for marginal characters lacking the above information, its q ₁ , q ₂ , q ₃ and q ₄ can take the same sign or make them equal to zero.

例如：给定第一编码压缩块P₁的具体内容为：E,F,G,H,I；第二编码压缩块P₂的具体内容为:A,T,C,G,G；For example: given the specific content of the first encoded compressed block _P1 is: E, F, G, H, I; the specific content of the second encoded compressed block _P2 is: A, T, C, G, G;

当对第一编码压缩块P₁中的第三个字符“G”建立编码模型时，其前四个字符取值分别为：q₁＝F,q₂＝E，q₃＝0，q₄＝0；其当前字符和前一个字符对应的碱基信息取值为：j₁＝C，j₂＝T；其所在行的均值取值为：f＝mean(ASCII(E)+ASCII(F)+ASCII(G)+ASCII(H)+ASCII(I))；When the coding model is established for the third character "G" in the first coded compression block _P1 , the values of the first four characters are: q ₁ =F, q ₂ =E, q ₃ =0, q ₄ =0; the value of the base information corresponding to the current character and the previous character is: j ₁ =C, j ₂ =T; the mean value of the row where it is located is: f=mean(ASCII(E)+ASCII(F )+ASCII(G)+ASCII(H)+ASCII(I));

当对P₁中的第五个字符“I”建立编码模型时，其前四个字符取值分别为：q₁＝H,q₂＝G，q₃＝F，q₄＝E；其当前字符和前一个字符对应的碱基信息取值为：j₁＝G，j₂＝G；其所在行的均值取值为：f＝mean(ASCII(E)+ASCII(F)+ASCII(G)+ASCII(H)+ASCII(I))；When the coding model is established for the fifth character "I" in P ₁ , the values of the first four characters are respectively: q ₁ =H, q ₂ =G, q ₃ =F, q ₄ =E; its current The value of the base information corresponding to the character and the previous character is: j ₁ =G, j ₂ =G; the mean value of the row where it is located is: f=mean(ASCII(E)+ASCII(F)+ASCII(G )+ASCII(H)+ASCII(I));

由此可以看出对于缺少上文信息的边缘字符可以使用相同的方式建立模型。It can be seen that the same method can be used to build a model for marginal characters lacking the above information.

3.2)考虑模型总数有限的实际情况，通过量化整体模型的方式减少模型代价，即取q₁和q₂中的较大值记做A，q₃和q₄中的较大值记做B，创建两个不同的标识符C和D，C为用来判断q₁和q₂是否相等，D用来判断q₃和q₄是否相等。因此当前编码符号最终选取的模型为：P_now＝A·B·C·D·j₁·j₂·f。3.2) Considering the actual situation that the total number of models is limited, reduce the model cost by quantifying the overall model, that is, take the larger value of q ₁ and q ₂ as A, and the larger value of q ₃ and q ₄ as B, Create two different identifiers C and D, C is used to judge whether q ₁ and q ₂ are equal, and D is used to judge whether q ₃ and q ₄ are equal. Therefore, the final selected model of the current coding symbol is: P _now =A·B·C·D·j ₁ ·j ₂ ·f.

其中，P_now为当前编码符号的概率估计值。Among them, P _now is the probability estimation value of the current coding symbol.

步骤4，利用设计的最终编码模型驱动自适应算数编码器，采用蛇形编码顺序沿相关性最强的方向对第一编码压缩块P₁进行遍历压缩。Step 4, use the designed final coding model to drive the adaptive arithmetic coder, and use the serpentine coding order to traverse and compress the first coding compression block P ₁ along the direction with the strongest correlation.

4.1)通过上述最终编码模型得到当前编码字符更加准确的概率估计值P_now，并将其作为最优预测结果送入自适应算术编码器；4.1) Obtain a more accurate probability estimate P _now of the current coded character through the above-mentioned final coding model, and send it to the adaptive arithmetic coder as the optimal prediction result;

4.2)编码器进行遍历编码压缩：4.2) The encoder performs traversal encoding and compression:

编码时需要对逐个字符进行编码扫描，传统扫描方式默认为逐行遍历，当遍历完一整行后从第二行起始位置开始继续扫描。本实例采用按列扫描，且在编码完一列之后，将下一列的尾部作为起始，反向向上遍历，以此循环往复，整体看起来如蛇形扫描，如图3所示。通过对所有字符进行遍历编码，实现最终的无损压缩。When encoding, it is necessary to encode and scan characters one by one. The traditional scanning method defaults to traversing line by line. After traversing a whole line, continue scanning from the starting position of the second line. This example uses column-by-column scanning, and after one column is encoded, the end of the next column is used as the starting point, and the reverse upward traverse is performed, and the whole process looks like a serpentine scan, as shown in Figure 3. The ultimate lossless compression is achieved by traversing all characters.

以上描述仅是本发明的一个具体实例，并不构成对本发明的任何限制。显然对于本领域的专业人员来说，在了解了本发明内容和原理后，都可在不背离本发明原理、结构的情况下，进行形式和细节上的各种修正和改变，但是这些基于本发明思想的修正和改变仍在本发明的权利要求保护范围之内。The above description is only a specific example of the present invention, and does not constitute any limitation to the present invention. Obviously, for those skilled in the art, after understanding the content and principles of the present invention, they can make various modifications and changes in form and details without departing from the principles and structures of the present invention, but these are based on the present invention. The modification and change of the inventive concept are still within the protection scope of the claims of the present invention.

Claims

1. A DNA sequencing quality score lossless compression method based on an adaptive coding sequence, characterized in that, comprising the following:

(1) Extract the quality score data and base data in the FASTQ file:

(1a) Statistically analyze the characteristics of DNA sequencing data, and create two M×N coded compression blocks P ₁ and P ₂ , where M is the number of lines in the compression block, that is, the number of lines of quality score data processed at one time, and N is the compression block The number of columns, that is, the length of the mass fraction, N≤150;

(1b) extract the quality score data and base data stored in the FASTQ file through the first code compression block P ₁ and the second code compression block P ₂ respectively;

(2) calculate the mean value of each line quality score in the extracted FASTQ file of the first coded compression block _P1 and quantize, obtain the line mean value matrix F of M * 1;

(3) The context information, base information and row mean information of the encoded characters are counted and quantified uniformly, and the final encoding model is calculated:

(3a) Build a model for the current coded character q: count the first four characters q ₁ , q ₂ , q ₃ and q ₄ , and take the base information corresponding to the current character and the previous character in the second code compression block P ₂ as j ₁ and j ₂ , take the mean value _of the _row where the character q is located in the row mean matrix F and record it as f, which is _the quantized result; q ₄ takes the same sign or makes it equal to zero;

(3b) Reduce the model cost by quantifying the overall model, that is, take the larger value of the first two characters q ₁ and q ₂ as A, and the larger value of the last two characters q ₃ and q ₄ as B , creating two distinct identifiers C and D, computing the final encoding model for the current encoding symbol:

P _now ＝A·B·C·D·j ₁ ·j ₂ ·f

Wherein, when q ₁ =q ₂ , the identifier C=1, otherwise C=0; when q ₃ =q ₄ , D=1, otherwise D=0; P _now is the probability estimate of the current coding symbol;

(4) Utilize the designed final coding model to drive the self-adaptive arithmetic coder, and traverse and compress the first coded compression block P ₁ along the direction with the strongest correlation by adopting the serpentine coding order.

2. The method according to claim 1, wherein the DNA sequencing data feature in (1a) means that the DNA sequencing data contains thousands of reads, each read has four lines, and the second line is quality Score data, the fourth line is base data, the overall DNA sequencing quality score data is encoded in ASCII code, and the number of encoding symbol types = maximum value - minimum value + 1 quality score data.

3. method according to claim 1, it is characterized in that, in (2), the mean value of every row's quality fraction extracted to the first encoding compression block _P1 is quantized, and its realization is as follows:

If f _i <(num-15), then f _i =(num-15);

If (num-15)≤f _i <(num-13), then f _i = (num-13);

If (num-13)≤f _i <(num-11), then f _i = (num-11);

If (num-11)≤f _i <(num-9), then f _i = (num-9);

If (num-9)≤f _i <(num-7), then f _i =(num-7);

If (num-7)≤f _i , then f _i =(num-6);

Among them, f _i is the mean value of the quality score of each row, num is the total number of coded symbols whose size is 40, and the value of i is [1,M], and each row f _i is quantized to obtain an M×1 row mean matrix F.

4. method according to claim 1, is characterized in that, the final encoding model of utilization design in (4) drives adaptive arithmetic coder, refers to the P _now that current symbol is carried out probability estimation as optimal prediction result into the adaptive arithmetic coder.

5. The method according to claim 1, characterized in that, in (4), the serpentine coding order is adopted to traverse and compress the first coding compression block _P1 along the direction with the strongest correlation, which means traversing the first coding and compressing Block P ₁ is scanned column by column from top to bottom, and then reversely traverses from bottom to top after scanning a column, and reciprocates in turn until the entire compressed block is traversed.