CN115114238A

CN115114238A - An error-correction-based genome sequencing data lossless compression method and related equipment

Info

Publication number: CN115114238A
Application number: CN202210744033.1A
Authority: CN
Inventors: 王荣杰; 刘贤明; 朱泽轩
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-09-27

Abstract

The invention discloses a genome sequencing data lossless compression method based on error correction and related equipment, wherein the method comprises the following steps: identifying and correcting sequencing base errors in the original sequencing short fragment, and recording base error information, wherein the base error information comprises base positions of sequencing errors and original bases; classifying the original sequencing short fragment into the corrected index area file, and adding sequencing error correction information into the index area file; sequencing and compressing the base sequences in the original sequencing short segments in different index region files to obtain a compression result file of the genome sequencing data. The invention realizes the efficient correction of sequencing errors of the sequencing short fragments, enables more similar sequencing short fragments to be distributed into the same barrel by correcting the sequencing errors in the basic groups, further improves the compression efficiency of the sequencing short fragments in the subsequent barrel, and realizes the lossless compression of genome sequencing data by recording and correcting the barrel index sequences in the sequencing short fragments.

Description

An error-correction-based genome sequencing data lossless compression method and related equipment

技术领域technical field

本发明涉及数据压缩技术领域，尤其涉及一种基于纠错的基因组测序数据无损压缩方法、系统、终端及计算机可读存储介质。The present invention relates to the technical field of data compression, in particular to a method, system, terminal and computer-readable storage medium for lossless compression of genome sequencing data based on error correction.

背景技术Background technique

下一代高通量基因组测序技术(能够一次并行对大量核酸分子进行平行序列测定的技术，通常一次测序反应能产出不低于100Mb的测序数据)的出现，大幅度降低了基因组测序的成本，同时测序速度也得到了极大的提高。例如，人类基因组计划利用第一代Sanger(双脱氧法)测序技术完成时耗资30亿美元、历时13年之久，而使用现代高通量基因组测序技术只需要不到1000美元、几天内便可完成。The emergence of next-generation high-throughput genome sequencing technology (a technology that can perform parallel sequencing of a large number of nucleic acid molecules at a time, usually a sequencing reaction can produce no less than 100Mb of sequencing data) has greatly reduced the cost of genome sequencing, At the same time, the sequencing speed has also been greatly improved. For example, the Human Genome Project cost $3 billion and took 13 years to complete using first-generation Sanger (dideoxy) sequencing technology, while using modern high-throughput genome sequencing technology costs less than $1,000 and can be completed in a few days. can be completed.

但在测得的基因组序列长度上，第二代测序技术所产生的碱基序列要比第一代短很多，第一代测序长度可达到上百碱基甚至上万个碱基，而高通量基因组测序所产生的基因组测序短片段只有几十个碱基。由于后续基因组数据分析的需要，为了能够把测序短片段拼接成长基因组序列，必须要求测序短片段序列之间有足够的交叠。这就需要在基因组测序时有足够的测序深度，即相同的基因组序列数据需要同时测几十、甚至上百次，因此产生了大量的测序短片段数据。由于每个碱基位点平均测序几十到上百次，所以测序短片段内部存在着大量的数据冗余信息，为基因组测序数据的压缩提供了基础。However, in terms of the measured genome sequence length, the base sequence generated by the second-generation sequencing technology is much shorter than that of the first-generation sequencing technology. The length of the first-generation sequencing technology can reach hundreds of bases or even tens of thousands of bases. The short genome sequencing fragments produced by quantitative genome sequencing are only a few tens of bases. Due to the needs of subsequent genomic data analysis, in order to be able to splicing the sequenced short fragments into a long genomic sequence, it is necessary to require sufficient overlap between the sequenced short fragments. This requires sufficient sequencing depth during genome sequencing, that is, the same genome sequence data needs to be measured tens or even hundreds of times at the same time, thus generating a large amount of sequencing short fragment data. Since each base locus is sequenced an average of tens to hundreds of times, there is a large amount of redundant data in the sequenced short fragments, which provides the basis for the compression of genome sequencing data.

然而由于第二代测序技术中测序错误的存在，使得一些测序短片段序列被分配到错误桶中。此错分不仅影响了该测序短片段序列自身的压缩，而且还影响到了该桶内其他测序短片段序列的压缩，从而影响了整体测序数据的压缩效果。However, due to the existence of sequencing errors in the second-generation sequencing technology, some sequencing short fragments are assigned to the error bucket. This misclassification not only affects the compression of the sequenced short segment sequence itself, but also affects the compression of other sequenced short segment sequences in the bucket, thereby affecting the compression effect of the overall sequencing data.

因此，现有技术还有待于改进和发展。Therefore, the existing technology still needs to be improved and developed.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的在于提供一种基于纠错的基因组测序数据无损压缩方法、系统、终端及计算机可读存储介质，旨在解决现有技术中在基因组测序时，由于测序短片段序列的错误分配影响整体测序数据的压缩效果的问题。The main purpose of the present invention is to provide a method, system, terminal and computer-readable storage medium for lossless compression of genome sequencing data based on error correction, aiming to solve the problem of wrong allocation of short sequence sequences due to sequencing in the prior art during genome sequencing. A problem that affects the compression effect of the overall sequencing data.

为实现上述目的，本发明提供一种基于纠错的基因组测序数据无损压缩方法，所述基于纠错的基因组测序数据无损压缩方法包括如下步骤：In order to achieve the above object, the present invention provides a method for lossless compression of genome sequencing data based on error correction. The method for lossless compression of genome sequencing data based on error correction includes the following steps:

识别并纠正原始测序短片段中的测序碱基错误，并记录碱基错误信息，所述碱基错误信息包括测序错误的碱基位置及原碱基；Identify and correct sequencing base errors in the original sequencing short fragments, and record the base error information, the base error information includes the base position of the sequencing error and the original base;

将原始测序短片段归类到纠正后的索引区域文件中，并将测序错误纠正信息加入到索引区域文件中；Sort the original sequencing short fragments into the corrected index region file, and add sequencing error correction information to the index region file;

对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件。The base sequences in the original sequencing short fragments in different index region files are sorted and compressed to obtain a compressed result file of the genome sequencing data.

所述的基于纠错的基因组测序数据无损压缩方法，其中，所述识别并纠正原始测序短片段中的测序碱基错误，具体包括：The error-correction-based genome sequencing data lossless compression method, wherein the identifying and correcting sequencing base errors in the original sequencing short fragments specifically includes:

统计k-mer数量，k-mer是原始测序短片段中所有长度为k的碱基子序列；Count the number of k-mers, where k-mers are all base subsequences of length k in the original sequencing short fragment;

获取高频k-mer，将高频k-mer插入到布隆过滤器中，k-mer出现次数大于预设阈值的为高频k-mer；Obtain the high-frequency k-mer, insert the high-frequency k-mer into the Bloom filter, and the number of occurrences of the k-mer greater than the preset threshold is the high-frequency k-mer;

基于布隆过滤器纠正原始测序短片段中索引区域或者整个区域中的测序碱基错误。Correct the sequenced base errors in the index region or the entire region in the original sequenced short fragments based on the Bloom filter.

所述的基于纠错的基因组测序数据无损压缩方法，其中，所述高频k-mer由正确测序碱基产生。The error-correction-based genome sequencing data lossless compression method, wherein the high-frequency k-mers are generated by correctly sequenced bases.

所述的基于纠错的基因组测序数据无损压缩方法，其中，所述索引区域为原始测序短片段中正向和反向互补序列所有k-mer的最小值。In the method for lossless compression of genome sequencing data based on error correction, wherein, the index region is the minimum value of all k-mers of forward and reverse complementary sequences in the original sequencing short fragment.

所述的基于纠错的基因组测序数据无损压缩方法，其中，所述索引区域中的碱基按照字母A、C、G、T顺序排列。In the method for lossless compression of genome sequencing data based on error correction, the bases in the index region are arranged in the order of letters A, C, G, and T.

所述的基于纠错的基因组测序数据无损压缩方法，其中，所述对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件，具体包括：The error-correction-based genome sequencing data lossless compression method, wherein the base sequences in the original sequencing short fragments in different index region files are sorted and compressed to obtain a compression result file of the genome sequencing data, specifically include:

当所有测序短片段序列聚类完成后，对同一索引内所有原始测序短片段进行拆分排序；After the clustering of all sequencing short fragments is completed, split and sort all original sequencing short fragments in the same index;

当对同一索引内所有原始测序短片段进行拆分排序完成后，进行压缩得到基因组测序数据的压缩结果文件。After all the original sequencing short fragments in the same index are split and sorted, compression is performed to obtain a compressed result file of the genome sequencing data.

所述的基于纠错的基因组测序数据无损压缩方法，其中，所述当所有测序短片段序列聚类完成后，对同一索引内所有原始测序短片段进行拆分排序，具体包括：The error-correction-based genome sequencing data lossless compression method, wherein after the clustering of all short sequenced fragments is completed, splitting and sorting all the original short fragments of sequencing in the same index specifically includes:

使用变换tran(r)对测序短片段r进行拆分转换；Use the transformation tran(r) to split and transform the sequencing short fragment r;

所述拆分转换将测序短片段r分成三个区域，分别为Pre(r)、Index(r)和Su f(r)，使得r＝Pre(r)οIndex(r)οSu f(r)，其中ο表示字符串连接。The split conversion divides the sequencing short fragment r into three regions, namely Pre(r), Index(r) and Su f(r), such that r=Pre(r) o Index(r) o Su f(r), where ο represents string concatenation.

所述的基于纠错的基因组测序数据无损压缩方法，其中，所述对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件，之后还包括：The error-correction-based genome sequencing data lossless compression method, wherein the base sequences in the original sequencing short fragments in different index region files are sorted and compressed to obtain a compression result file of the genome sequencing data, and then Also includes:

计算基因组测序数据的压缩率，所述压缩率表示每个碱基压缩后所占用的比特数，所述压缩率越低表示占用的空间越小，压缩效果越好。Calculate the compression ratio of the genome sequencing data, where the compression ratio represents the number of bits occupied by each base after compression, and the lower the compression ratio, the smaller the occupied space, and the better the compression effect.

此外，为实现上述目的，本发明还提供一种基于纠错的基因组测序数据无损压缩系统，其中，所述基于纠错的基因组测序数据无损压缩系统包括：In addition, in order to achieve the above object, the present invention also provides an error-correction-based genome sequencing data lossless compression system, wherein the error-correction-based genome sequencing data lossless compression system includes:

识别纠正模块，用于识别并纠正原始测序短片段中的测序碱基错误，并记录碱基错误信息，所述碱基错误信息包括测序错误的碱基位置及原碱基；An identification and correction module is used to identify and correct the sequencing base errors in the original short sequenced fragments, and record the base error information, the base error information includes the base position of the sequencing error and the original base;

归类记录模块，用于将原始测序短片段归类到纠正后的索引区域文件中，并将测序错误纠正信息加入到索引区域文件中；The classification record module is used to classify the original sequencing short fragments into the corrected index region file, and add sequencing error correction information to the index region file;

排序压缩模块，用于对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件。The sorting and compression module is used to sort and compress the base sequences in the original sequencing short fragments in different index region files to obtain a compressed result file of the genome sequencing data.

所述的基于纠错的基因组测序数据无损压缩系统，其中，所述识别纠正模块，具体包括：In the error-correction-based genome sequencing data lossless compression system, the identification and correction module specifically includes:

统计单元，用于统计k-mer数量，k-mer是原始测序短片段中所有长度为k的碱基子序列；Statistical unit, used to count the number of k-mers, k-mers are all base subsequences of length k in the original sequencing short fragment;

插入单元，用于获取高频k-mer，将高频k-mer插入到布隆过滤器中，k-mer出现次数大于预设阈值的为高频k-mer；The insertion unit is used to obtain high-frequency k-mers, and insert high-frequency k-mers into the Bloom filter. The high-frequency k-mers whose occurrences are greater than the preset threshold are high-frequency k-mers;

纠正单元，用于基于布隆过滤器纠正原始测序短片段中索引区域或者整个区域中的测序碱基错误。The correction unit is used to correct the sequenced base errors in the index region or the entire region in the original sequenced short fragment based on the Bloom filter.

所述的基于纠错的基因组测序数据无损压缩系统，其中，所述排序压缩模块，具体包括：The error-correction-based genome sequencing data lossless compression system, wherein the sequencing and compression module specifically includes:

拆分排序单元，用于当所有测序短片段序列聚类完成后，对同一索引内所有原始测序短片段进行拆分排序；The splitting and sorting unit is used to split and sort all the original short sequenced fragments in the same index after the sequence clustering of all the short sequenced fragments is completed;

数据压缩单元，用于当对同一索引内所有原始测序短片段进行拆分排序完成后，进行压缩得到基因组测序数据的压缩结果文件。The data compression unit is used to compress to obtain a compressed result file of the genome sequencing data after the splitting and sorting of all the original sequencing short fragments in the same index are completed.

此外，为实现上述目的，本发明还提供一种终端，其中，所述终端包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的基于纠错的基因组测序数据无损压缩程序，所述基于纠错的基因组测序数据无损压缩程序被所述处理器执行时实现如上所述的基于纠错的基因组测序数据无损压缩方法的步骤。In addition, in order to achieve the above object, the present invention also provides a terminal, wherein the terminal includes: a memory, a processor, and error correction-based genome sequencing data stored in the memory and executable on the processor A lossless compression program, when the error-correction-based genome sequencing data lossless compression program is executed by the processor, implements the steps of the error-correction-based genome sequencing data lossless compression method as described above.

此外，为实现上述目的，本发明还提供一种计算机可读存储介质，其中，所述计算机可读存储介质存储有基于纠错的基因组测序数据无损压缩程序，所述基于纠错的基因组测序数据无损压缩程序被处理器执行时实现如上所述的基于纠错的基因组测序数据无损压缩方法的步骤。In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores an error-correction-based genome sequencing data lossless compression program, and the error-correction-based genome sequencing data When the lossless compression program is executed by the processor, the steps of the above-described method for lossless compression of genome sequencing data based on error correction are realized.

本发明中，识别并纠正原始测序短片段中的测序碱基错误，并记录碱基错误信息，所述碱基错误信息包括测序错误的碱基位置及原碱基；将原始测序短片段归类到纠正后的索引区域文件中，并将测序错误纠正信息加入到索引区域文件中；对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件。本发明实现了高效的测序短片段测序错误的纠正，通过纠正碱基当中的测序错误，使得更多的相似测序短片段被分配到同一桶中，增加了桶内的数据冗余密度，进而提高了后续桶内测序短片段的压缩效率，通过记录纠正测序短片段中的桶索引序列，同时记录纠正的位置及原始碱基等信息，在解压缩时能够无损恢复出纠正的碱基信息，实现对基因组测序数据的无损压缩。In the present invention, the sequencing base errors in the original sequencing short fragments are identified and corrected, and the base error information is recorded, and the base error information includes the base positions of the sequencing errors and the original bases; the original sequencing short fragments are classified into into the corrected index region file, and add sequencing error correction information to the index region file; sort and compress the base sequences in the original sequencing short fragments in different index region files to obtain the compression of the genome sequencing data result file. The present invention realizes efficient correction of sequencing errors in short sequenced segments. By correcting the sequencing errors in the bases, more similar sequenced short segments are allocated to the same bucket, thereby increasing the data redundancy density in the bucket, thereby improving the The compression efficiency of subsequent in-bucket sequencing short fragments is recorded. By recording and correcting the bucket index sequence in the sequencing short fragments, and recording the corrected position and original base information, the corrected base information can be recovered without loss during decompression. Achieve lossless compression of genome sequencing data.

附图说明Description of drawings

图1是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例的流程图；Fig. 1 is the flow chart of the preferred embodiment of the genome sequencing data lossless compression method based on error correction of the present invention;

图2是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例的整体流程示意图；2 is a schematic overall flow diagram of a preferred embodiment of the method for lossless compression of genome sequencing data based on error correction of the present invention;

图3是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例中测序错误导致的分桶错误示意图；3 is a schematic diagram of bucketing errors caused by sequencing errors in a preferred embodiment of the error-correction-based genome sequencing data lossless compression method of the present invention;

图4是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例中序列的Minimizer示意图；Fig. 4 is the Minimizer schematic diagram of the sequence in the preferred embodiment of the genome sequencing data lossless compression method based on error correction of the present invention;

图5是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例中序列的桶内重排列示意图；5 is a schematic diagram of the rearrangement of sequences in a bucket in a preferred embodiment of the error-correction-based genome sequencing data lossless compression method of the present invention;

图6是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例中本发明压缩方法和其他三个行业内的基因组测序数据压缩方法压缩率结果比较的示意图；6 is a schematic diagram of the comparison of the compression ratio results of the compression method of the present invention and the compression method of genome sequencing data in other three industries in a preferred embodiment of the error-correction-based genome sequencing data lossless compression method of the present invention;

图7是本发明基于纠错的基因组测序数据无损压缩系统的较佳实施例中原理示意图；7 is a schematic diagram of the principle in a preferred embodiment of the error-correction-based genome sequencing data lossless compression system of the present invention;

图8为本发明终端的较佳实施例的运行环境示意图。FIG. 8 is a schematic diagram of an operating environment of a preferred embodiment of the terminal of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案及优点更加清楚、明确，以下参照附图并举实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明较佳实施例所述的基于纠错的基因组测序数据无损压缩方法，如图1和图2所示，所述基于纠错的基因组测序数据无损压缩方法包括以下步骤：The method for lossless compression of genome sequencing data based on error correction according to a preferred embodiment of the present invention, as shown in FIG. 1 and FIG. 2 , the method for lossless compression of genome sequencing data based on error correction includes the following steps:

步骤S10、识别并纠正原始测序短片段中的测序碱基错误，并记录碱基错误信息，所述碱基错误信息包括测序错误的碱基位置及原碱基。Step S10 , identifying and correcting the sequencing base errors in the original sequencing short segment, and recording the base error information, where the base error information includes the base position of the sequencing error and the original base.

具体地，所述步骤S10具体包括：统计k-mer数量，k-mer是原始测序短片段中所有长度为k的碱基子序列；获取高频k-mer，将高频k-mer插入到布隆过滤器中，k-mer出现次数大于预设阈值的为高频k-mer；基于布隆过滤器纠正原始测序短片段中索引区域或者整个区域中的测序碱基错误，其中，所述碱基错误信息包括测序错误的碱基位置及原碱基。Specifically, the step S10 specifically includes: counting the number of k-mers, where k-mers are all base subsequences with a length of k in the original sequencing short fragment; obtaining high-frequency k-mers, and inserting high-frequency k-mers into In the Bloom filter, the number of occurrences of k-mers greater than the preset threshold is a high-frequency k-mer; based on the Bloom filter, the sequenced base errors in the index region or the entire region in the original sequencing short fragment are corrected, wherein the said The base error information includes the base position of the sequencing error and the original base.

如图2所示，原始测序短片段用reads进行表示，识别并纠正原始测序短片段(reads)中的测序错误，可分成两种方式，一种只识别纠正reads索引区域的碱基，一种识别纠正reads所有碱基，两种方式均记录碱基错误信息(包括测序错误的碱基位置及原碱基，以便解压缩时无损还原出原始序列，达到无损压缩的目的)。As shown in Figure 2, the original sequencing short fragments are represented by reads, and the sequencing errors in the original sequencing short fragments (reads) are identified and corrected, which can be divided into two ways. One is to identify and correct the bases in the index region of the reads. Identify and correct all bases in reads, and record the base error information (including the base position of the sequencing error and the original base, so that the original sequence can be restored without loss during decompression, so as to achieve the purpose of lossless compression).

其中，k-mer为reads中长度为k的碱基子序列，并分别统计reads序列及其反向互补序列，在反向互补序列汇中，A→T，C→G；然后将高频k-mer(将k-mer出现次数高于某一阈值称为高频k-mer，由于碱基测序错误率比较低，高频k-mer一般是由正确测序碱基产生的)插入到Bloom Filter(布隆过滤器)中，布隆过滤器是一个长的二进制向量和一系列随机映射函数，布隆过滤器可以用于检索一个元素是否在一个集合中，它的优点是查询时间某一元素是否在集合中的效率为O(1)；最后纠正reads索引区域(Minimizer，如图3所示)或者整个区域当中的碱基错误；reads的索引区域(又称：Minimizer)为序列中正向和反向互补序列所有k-mer的最小值，其中碱基按照字母A、C、G、T顺序排列(如图4所示)。Among them, k-mer is the base subsequence of length k in the reads, and the reads sequence and its reverse complementary sequence are counted separately. In the reverse complementary sequence pool, A→T, C→G; then the high frequency k -mer (the number of occurrences of k-mers above a certain threshold is called high-frequency k-mers, because the base sequencing error rate is relatively low, high-frequency k-mers are generally generated by correctly sequenced bases) Insert into Bloom Filter (Bloom filter), Bloom filter is a long binary vector and a series of random mapping functions, Bloom filter can be used to retrieve whether an element is in a set, its advantage is to query a certain element at time Whether the efficiency in the collection is O(1); finally correct the base errors in the index region of reads (Minimizer, as shown in Figure 3) or the entire region; the index region of reads (also known as Minimizer) is the forward sum in the sequence. The minimum value of all k-mers of the reverse complement sequence, in which the bases are arranged in the order of the letters A, C, G, T (as shown in Figure 4).

步骤S20、将原始测序短片段归类到纠正后的索引区域文件中，并将测序错误纠正信息加入到索引区域文件中。Step S20, classifying the original sequencing short fragments into the corrected index region file, and adding sequencing error correction information to the index region file.

具体地，将reads归类到纠正后的Minimizer文件中，同时将测序错误纠正信息一同加入到该索引区域文件中，以便解压时能够无损还原出原测序短片段序列；当测序碱基错误得到纠正后，需重新计算reads的Minimizer值。Specifically, the reads are classified into the corrected Minimizer file, and the sequencing error correction information is added to the index region file together, so that the original sequencing short fragment sequence can be restored without loss during decompression; when the sequencing base errors are corrected After that, the Minimizer value of the reads needs to be recalculated.

步骤S30、对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件。Step S30, sorting and compressing the base sequences in the original sequencing short fragments in different index region files to obtain a compressed result file of the genome sequencing data.

具体地，所述步骤S30具体包括：当所有测序短片段序列聚类完成后，对同一索引内所有原始测序短片段进行拆分排序；当对同一索引内所有原始测序短片段进行拆分排序完成后，进行压缩得到基因组测序数据的压缩结果文件。Specifically, the step S30 specifically includes: after the sequence clustering of all the short sequenced fragments is completed, splitting and sorting all the original short sequenced fragments in the same index; when the splitting and sorting of all the original short sequenced fragments in the same index is completed Then, perform compression to obtain a compressed result file of the genome sequencing data.

当所有测序短片段序列聚类完成后，对同一索引内所有reads进行拆分排序，例如，对序列r，使用变换tran(r)对其进行拆分转换，如图5所示，该转换将测序短片段r分成三个区域，分别为Pre(r)、Index(r)和Su f(r)，例如Pre(r)为GCGTTGGCCCT，Index(r)为AAGAGGCC，Su f(r)为GCGATGCTCCGTTATCTCTCGTTG，使得r＝Pre(r)οIndex(r)οSu f(r)，其中，ο表示字符串连接，Index(r)是放置该测序短片段的桶索引。设定转换为tran(r)＝Su f(r)οPre(r)οIndexPos，其中，IndexPos为桶索引Index(r)距离测序短片段Minimizer的位置偏移量，例如此时Su f(r)为GCGATGCTCCGTTATCTCTCGTTG，Pre(r)为GCGTTGGCCCT，IndexPos为10。将Tran称为测序短片段拆分转换，它将测序短片段拆分为特定的偏移量并交换此拆分产生的第二个子串和第一个子串。通过此转换所带来的好处是：首先，它删除了测序短片段中存在的显式冗余，即桶索引序列字串Index(r)，因为桶内所有测序短片段均共享该桶索引，其次，将桶索引后的子序列移动到每个测序短片段的前端，这些调整后的序列前缀更可能共享相似的序列，可以提高下游压缩工具发掘和利用这些共享子串的冗余。When the clustering of all sequenced short fragments is completed, split and sort all the reads in the same index. For example, for sequence r, use the transformation tran(r) to split and transform it, as shown in Figure 5, the transformation will The sequenced short fragment r is divided into three regions, namely Pre(r), Index(r) and Su f(r). For example, Pre(r) is GCGTTGGCCCT, Index(r) is AAGAGGCC, Su f(r) is GCGATGCTCCGTTATCTCTCGTTG, Let r=Pre(r)οIndex(r)οSu f(r), where ο represents string concatenation, and Index(r) is the bucket index where the sequenced short segment is placed. Set the conversion to tran(r)=Su f(r)οPre(r)οIndexPos, where IndexPos is the position offset of the bucket index Index(r) from the Minimizer of the sequencing short fragment, for example, Su f(r) at this time is GCGATGCTCCGTTATCTCTCGTTG, Pre(r) is GCGTTGGCCCT, and IndexPos is 10. Tran is called the Sequencing Short Split Transform, which splits the Sequencing Short into a specific offset and swaps the second and first substrings produced by this split. The benefits brought by this transformation are: first, it removes the explicit redundancy in the sequenced short fragments, that is, the bucket index sequence string Index(r), because all the sequenced short fragments in the bucket share the bucket index, Second, by moving the bucket-indexed subsequences to the front of each sequenced short segment, these adjusted sequence prefixes are more likely to share similar sequences, which can improve the redundancy of downstream compression tools to discover and exploit these shared substrings.

进一步地，为了验证本发明方法的两种方法：BICmin和BICall在基因组测序数据压缩上的有效性，BICmin(只纠正Minimizer区域)和BICall(纠正整个序列)方法对真实的五种不同物种的基因组测序数据进行压缩，并同其他三个行业内的基因组测序数据压缩方法：SCALCE、ORCOM和MINCE的压缩结果进行比较，如图6所示，五种不同物种的数据集ID分别为SRR554369_1、MH0001.081026_1、SRR327342_1、SRR870667_1、ERR174310_1，五种不同方式对应各自的压缩率(Bits Per Base)，例如使用BICmin时，SRR554369_1、MH0001.081026_1、SRR327342_1、SRR870667_1、ERR174310_1对应的压缩率分别为0.43、0.7、0.26、0.88、0.67；例如使用BICall时，SRR554369_1、MH0001.081026_1、SRR327342_1、SRR870667_1、ERR174310_1对应的压缩率分别为0.41、0.68、0.24、0.88、0.66。Further, in order to verify the effectiveness of the two methods of the method of the present invention: BICmin and BICall in the compression of genome sequencing data, BICmin (correcting only the Minimizer region) and BICall (correcting the entire sequence) methods are used for real genomes of five different species. The sequencing data is compressed and compared with the compression results of the genome sequencing data compression methods in other three industries: SCALCE, ORCOM and MINCE. As shown in Figure 6, the dataset IDs of five different species are SRR554369_1 and MH0001, respectively. 081026_1, SRR327342_1, SRR870667_1, ERR174310_1, five different methods correspond to their respective compression ratios (Bits Per Base). For example, when using BICmin, SRR554369_1, MH0001.081026_1, SRR327342_1, SRR870667_1, ERR172610_1 correspond to compression ratios of 0.7, 0.0.7, and ERR172610_1 respectively. , 0.88, 0.67; for example, when using BICall, the compression ratios corresponding to SRR554369_1, MH0001.081026_1, SRR327342_1, SRR870667_1, and ERR174310_1 are 0.41, 0.68, 0.24, 0.88, and 0.66, respectively.

此处压缩率代表每个碱基压缩后所占用的比特数，越低说明占用的空间越小，压缩效果越好。从图6可见，BICall方法相比于BICmin在压缩效果上略有提高，但提高效果并不十分明显，因为在BICall方法中，去除了由测序短片段的分桶不同带来的压缩效果后，测序错误纠正提高虽然提高了桶内序列的一致性，但同时仍然序列记录存储纠正碱基的位置和原碱基信息，仍占用部分存储空间，所以将BIC方法提供了两种压缩方式，一种是只纠正能够改变测序短片段索引值的碱基错误方法：BICmin，一种是纠正测序短片段中全部的碱基错误方法：BICall。The compression ratio here represents the number of bits occupied by each base after compression. The lower the compression rate, the smaller the space occupied and the better the compression effect. It can be seen from Figure 6 that the BICall method has a slight improvement in the compression effect compared with BICmin, but the improvement effect is not very obvious, because in the BICall method, after removing the compression effect caused by the different buckets of sequencing short fragments, Although the improvement of sequencing error correction improves the consistency of the sequences in the bucket, it still stores the position of the corrected base and the original base information in the sequence record, which still occupies part of the storage space. Therefore, the BIC method provides two compression methods, one is It is a method of correcting only the base errors that can change the index value of the sequenced short fragments: BICmin, and the other is a method of correcting all the base errors in the sequenced short fragments: BICall.

针对高通量基因组数据中，由于测序技术的原因，含有大量的碱基测序错误，而这种测序错误会减少测序短片段(reads)之间的序列一致性，影响了基因组测序数据的压缩效果。本发明利用基因组测序数据的高通量信息，纠正测序错误位点的碱基，同时建立不同碱基纠错的字符表示方法，在提高基因组测序数据压缩率的同时，能够无损还原出原始序列，达到无损压缩，该方法在已有公开基因组测序数据集无损压缩实验中取得了良好的压缩结果。For high-throughput genomic data, due to the sequencing technology, there are a large number of base sequencing errors, and such sequencing errors will reduce the sequence consistency between short sequences (reads) and affect the compression effect of genomic sequencing data. . The invention utilizes the high-throughput information of the genome sequencing data to correct the bases of the sequencing error sites, and establishes a character representation method for error correction of different bases at the same time. Lossless compression is achieved, and this method has achieved good compression results in the lossless compression experiments of existing public genome sequencing datasets.

有益效果：Beneficial effects:

(1)本发明方法设计了高效的序列纠错方法，通过k-mer的数量统计，将高频k-mer插入到Bloom Filter过滤器当中，纠正了桶索引区域内错误碱基，实现了测序短片段测序错误的纠正。(1) The method of the present invention designs an efficient sequence error correction method. By counting the number of k-mers, inserting high-frequency k-mers into the Bloom Filter filter corrects the erroneous bases in the bucket index region and realizes sequencing. Correction of short fragment sequencing errors.

(2)本发明方法通过纠正碱基当中的测序错误，使得更多的相似测序短片段被分配到同一桶中，增加了该桶内的数据冗余密度，进而提高了后续桶内测序短片段的压缩效率。(2) The method of the present invention corrects the sequencing errors in the bases, so that more similar sequencing short fragments are allocated to the same bucket, which increases the data redundancy density in the bucket, thereby improving the subsequent sequencing short clips in the bucket. segment compression efficiency.

(3)本发明方法通过记录纠正测序短片段中的桶索引序列，同时记录纠正的位置及原始碱基等信息，在解压缩时能够无损恢复出纠正的碱基信息，实现对基因组测序数据的无损压缩。(3) The method of the present invention can restore the corrected base information without loss during decompression by recording the bucket index sequence in the corrected sequencing short segment, and at the same time recording the corrected position and original base information, so as to realize the recovery of the genome sequencing data. lossless compression.

(4)本发明方法对真实的五种不同物种的基因组测序数据进行压缩实验，并将结果与行业内的三个压缩方法(SCALCE、ORCOM方法、MINCE方法)的压缩结果相比较，实验结果表明，本发明方法能够有效的压缩基因组测序数据。(4) The method of the present invention performs compression experiments on real genome sequencing data of five different species, and compares the results with the compression results of three compression methods (SCALCE, ORCOM method, and MINCE method) in the industry. The experimental results show that , the method of the present invention can effectively compress the genome sequencing data.

进一步地，如图7所示，基于上述基于纠错的基因组测序数据无损压缩方法，本发明还相应提供了一种基于纠错的基因组测序数据无损压缩系统，其中，所述基于纠错的基因组测序数据无损压缩系统包括：Further, as shown in FIG. 7 , based on the above-mentioned error-correction-based genome sequencing data lossless compression method, the present invention also provides an error-correction-based genome sequencing data lossless compression system, wherein the error-correction-based genome sequencing data lossless compression system is further provided. The lossless compression system for sequencing data includes:

识别纠正模块51，用于识别并纠正原始测序短片段中的测序碱基错误，并记录碱基错误信息，所述碱基错误信息包括测序错误的碱基位置及原碱基；The identification and correction module 51 is used to identify and correct the sequencing base errors in the original sequencing short fragments, and record the base error information, and the base error information includes the base position of the sequencing error and the original base;

归类记录模块52，用于将原始测序短片段归类到纠正后的索引区域文件中，并将测序错误纠正信息加入到索引区域文件中；The classification recording module 52 is used to classify the original sequencing short fragments into the corrected index region file, and add sequencing error correction information to the index region file;

排序压缩模块53，用于对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件。The sorting and compressing module 53 is used for sorting and compressing the base sequences in the original sequencing short fragments in different index region files to obtain a compressed result file of the genome sequencing data.

其中，所述识别纠正模块51，具体包括：Wherein, the identifying and correcting module 51 specifically includes:

其中，所述排序压缩模块53，具体包括：Wherein, the sorting and compression module 53 specifically includes:

进一步地，如图8所示，基于上述基于纠错的基因组测序数据无损压缩方法和系统，本发明还相应提供了一种终端，所述终端包括处理器10、存储器20及显示器30。图8仅示出了终端的部分组件，但是应理解的是，并不要求实施所有示出的组件，可以替代的实施更多或者更少的组件。Further, as shown in FIG. 8 , based on the above-mentioned method and system for lossless compression of genome sequencing data based on error correction, the present invention also provides a terminal correspondingly, and the terminal includes a processor 10 , a memory 20 and a display 30 . FIG. 8 only shows some components of the terminal, but it should be understood that it is not required to implement all the shown components, and more or less components may be implemented instead.

所述存储器20在一些实施例中可以是所述终端的内部存储单元，例如终端的硬盘或内存。所述存储器20在另一些实施例中也可以是所述终端的外部存储设备，例如所述终端上配备的插接式硬盘，智能存储卡(Smart Media Card,SMC)，安全数字(SecureDigital,SD)卡，闪存卡(Flash Card)等。进一步地，所述存储器20还可以既包括所述终端的内部存储单元也包括外部存储设备。所述存储器20用于存储安装于所述终端的应用软件及各类数据，例如所述安装终端的程序代码等。所述存储器20还可以用于暂时地存储已经输出或者将要输出的数据。在一实施例中，存储器20上存储有基于纠错的基因组测序数据无损压缩程序40，该基于纠错的基因组测序数据无损压缩程序40可被处理器10所执行，从而实现本申请中基于纠错的基因组测序数据无损压缩方法。In some embodiments, the memory 20 may be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. In other embodiments, the memory 20 may also be an external storage device of the terminal, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD card) equipped on the terminal. ) card, flash card (Flash Card) and so on. Further, the memory 20 may also include both an internal storage unit of the terminal and an external storage device. The memory 20 is used to store application software and various types of data installed in the terminal, such as program codes of the installation terminal. The memory 20 can also be used to temporarily store data that has been output or is to be output. In one embodiment, an error-correction-based genome sequencing data lossless compression program 40 is stored on the memory 20, and the error-correction-based genome sequencing data lossless compression program 40 can be executed by the processor 10, so as to realize the error-correction-based genome sequencing data lossless compression program 40 in the present application. The wrong method for lossless compression of genome sequencing data.

所述处理器10在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)，微处理器或其他数据处理芯片，用于运行所述存储器20中存储的程序代码或处理数据，例如执行所述基于纠错的基因组测序数据无损压缩方法等。In some embodiments, the processor 10 may be a central processing unit (Central Processing Unit, CPU), a microprocessor or other data processing chips, which are used to execute program codes or process data stored in the memory 20, such as The error-correction-based genome sequencing data lossless compression method and the like are performed.

所述显示器30在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode，有机发光二极管)触摸器等。所述显示器30用于显示在所述终端的信息以及用于显示可视化的用户界面。所述终端的部件10-30通过系统总线相互通信。In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display 30 is used for displaying information on the terminal and for displaying a visual user interface. The components 10-30 of the terminal communicate with each other via the system bus.

在一实施例中，当处理器10执行所述存储器20中基于纠错的基因组测序数据无损压缩程序40时实现以下步骤：In one embodiment, when the processor 10 executes the error correction-based genome sequencing data lossless compression program 40 in the memory 20, the following steps are implemented:

其中，所述识别并纠正原始测序短片段中的测序碱基错误，具体包括：Wherein, identifying and correcting sequencing base errors in the original sequencing short fragments specifically includes:

其中，所述索引区域为原始测序短片段中正向和反向互补序列所有k-mer的最小值。Wherein, the index region is the minimum value of all k-mers of forward and reverse complementary sequences in the original sequencing short fragment.

其中，所述索引区域中的碱基按照字母A、C、G、T顺序排列。Wherein, the bases in the index region are arranged in the order of letters A, C, G, and T.

其中，所述对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件，具体包括：Wherein, sorting and compressing the base sequences in the original sequencing short fragments in different index region files to obtain a compressed result file of the genome sequencing data, specifically including:

其中，所述当所有测序短片段序列聚类完成后，对同一索引内所有原始测序短片段进行拆分排序，具体包括：Wherein, after the clustering of all sequencing short fragments is completed, all original sequencing short fragments in the same index are split and sorted, specifically including:

其中，所述对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件，之后还包括：Wherein, sorting and compressing the base sequences in the original sequencing short fragments in different index region files to obtain a compressed result file of the genome sequencing data, and then further including:

综上所述，本发明提供一种基于纠错的基因组测序数据无损压缩方法、系统、终端及计算机可读存储介质，所述方法包括：识别并纠正原始测序短片段中的测序碱基错误，并记录碱基错误信息，所述碱基错误信息包括测序错误的碱基位置及原碱基；将原始测序短片段归类到纠正后的索引区域文件中，并将测序错误纠正信息加入到索引区域文件中；对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩，得到基因组测序数据的压缩结果文件。本发明实现了高效的测序短片段测序错误的纠正，通过纠正碱基当中的测序错误，使得更多的相似测序短片段被分配到同一桶中，增加了桶内的数据冗余密度，进而提高了后续桶内测序短片段的压缩效率，通过记录纠正测序短片段中的桶索引序列，同时记录纠正的位置及原始碱基等信息，在解压缩时能够无损恢复出纠正的碱基信息，实现对基因组测序数据的无损压缩。In summary, the present invention provides a method, system, terminal and computer-readable storage medium for lossless compression of genome sequencing data based on error correction, the method includes: identifying and correcting sequencing base errors in the original sequencing short segment, And record the base error information, the base error information includes the base position of the sequencing error and the original base; classify the original sequencing short fragments into the corrected index area file, and add the sequencing error correction information to the index In the region file; sort and compress the base sequences in the original sequencing short fragments in different index region files to obtain the compressed result file of the genome sequencing data. The present invention realizes efficient correction of sequencing errors in short sequenced segments. By correcting the sequencing errors in the bases, more similar sequenced short segments are allocated to the same bucket, thereby increasing the data redundancy density in the bucket, thereby improving the The compression efficiency of subsequent in-bucket sequencing short fragments is recorded. By recording and correcting the bucket index sequence in the sequencing short fragments, and recording the corrected position and original base information, the corrected base information can be recovered without loss during decompression. Achieve lossless compression of genome sequencing data.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者终端不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者终端所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者终端中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or terminal comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or terminal. Without further limitation, an element defined by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or terminal that includes the element.

应当理解的是，本发明的应用不限于上述的举例，对本领域普通技术人员来说，可以根据上述说明加以改进或变换，所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that the application of the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or transformations can be made according to the above descriptions, and all these improvements and transformations should belong to the protection scope of the appended claims of the present invention.

Claims

1. The error correction-based genome sequencing data lossless compression method is characterized by comprising the following steps of:

identifying and correcting sequencing base errors in the original sequencing short fragment, and recording base error information, wherein the base error information comprises base positions and original bases of the sequencing errors;

classifying the original sequencing short fragment into the corrected index area file, and adding sequencing error correction information into the index area file;

sequencing and compressing the base sequences in the original sequencing short segments in different index region files to obtain a compression result file of the genome sequencing data.

2. The method for lossless compression of genome sequencing data based on error correction according to claim 1, wherein the identifying and correcting sequencing base errors in the original sequencing short fragment specifically comprises:

counting the number of k-mers, wherein the k-mers are all base subsequences with the length of k in the original sequencing short fragment;

obtaining a high-frequency k-mer, inserting the high-frequency k-mer into a bloom filter, wherein the high-frequency k-mer is selected when the occurrence frequency of the k-mer is greater than a preset threshold;

and correcting sequencing base errors in the index region or the whole region in the original sequencing short fragment based on the bloom filter.

3. The method of claim 2, wherein the high frequency k-mers are generated from correctly sequenced bases.

4. The method of claim 2, wherein the index region is the minimum of all k-mers of the forward and reverse complement sequences in the original sequenced short segment.

5. The method of claim 4, wherein the bases in the index region are arranged in the order of A, C, G, T letters.

6. The method for lossless compression of genome sequencing data based on error correction according to claim 1, wherein the sequencing and compression of the base sequences in the original sequencing short segments in different index region files to obtain the compressed result file of genome sequencing data specifically comprises:

after all sequencing short fragment sequences are clustered, splitting and sequencing all original sequencing short fragments in the same index;

and after splitting and sequencing all the original sequencing short fragments in the same index, compressing to obtain a compression result file of the genome sequencing data.

7. The error-correction-based genome sequencing data lossless compression method according to claim 6, wherein after all sequencing short fragment sequences are clustered, splitting and sequencing all original sequencing short fragments in the same index, specifically comprising:

using a transform tran (r) to perform resolution transformation on the sequencing short fragment r;

the resolution transformation divides the sequenced short fragment r into three regions, pre (r), index (r) and Suf (r), respectively, such that

Wherein

Representing string concatenation.

8. The method for lossless compression of genome sequencing data based on error correction according to claim 1, wherein the base sequences in the original sequencing short segments in different index region files are sequenced and compressed to obtain a compressed result file of genome sequencing data, and then further comprising:

and calculating the compression ratio of the genome sequencing data, wherein the compression ratio represents the number of bits occupied by each base after compression, and the lower the compression ratio, the smaller the occupied space and the better the compression effect.

9. An error correction based genome sequencing data lossless compression system, wherein the error correction based genome sequencing data lossless compression system comprises:

the identification and correction module is used for identifying and correcting sequencing base errors in the original sequencing short fragment and recording base error information, wherein the base error information comprises the base position of the sequencing error and an original base;

the classification recording module is used for classifying the original sequencing short fragments into the corrected index area file and adding sequencing error correction information into the index area file;

and the sequencing compression module is used for sequencing and compressing the base sequences in the original sequencing short segments in different index region files to obtain a compression result file of the genome sequencing data.

10. The system for lossless compression of genome sequencing data based on error correction according to claim 9, wherein the identification and correction module specifically includes:

the statistical unit is used for counting the number of k-mers, wherein the k-mers are all base subsequences with the length of k in the original sequencing short fragment;

the inserting unit is used for obtaining a high-frequency k-mer and inserting the high-frequency k-mer into the bloom filter, wherein the high-frequency k-mer is selected when the occurrence frequency of the k-mer is greater than a preset threshold;

and the correcting unit is used for correcting sequencing base errors in the index region or the whole region in the original sequencing short fragment based on the bloom filter.

11. The error-correction-based genome sequencing data lossless compression system according to claim 9, wherein the sequencing compression module specifically includes:

the splitting and sequencing unit is used for splitting and sequencing all original sequencing short fragments in the same index after all sequencing short fragment sequences are clustered;

and the data compression unit is used for compressing to obtain a compression result file of the genome sequencing data after the splitting and sequencing of all the original sequencing short fragments in the same index are completed.

12. A terminal, characterized in that the terminal comprises: a memory, a processor, and an error correction based genome sequencing data lossless compression program stored on the memory and executable on the processor, the error correction based genome sequencing data lossless compression program when executed by the processor implementing the steps of the error correction based genome sequencing data lossless compression method of any one of claims 1-8.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores an error correction based genome sequencing data lossless compression program, which when executed by a processor implements the steps of the error correction based genome sequencing data lossless compression method according to any one of claims 1 to 8.