CN115114238A - An error-correction-based genome sequencing data lossless compression method and related equipment - Google Patents
An error-correction-based genome sequencing data lossless compression method and related equipment Download PDFInfo
- Publication number
- CN115114238A CN115114238A CN202210744033.1A CN202210744033A CN115114238A CN 115114238 A CN115114238 A CN 115114238A CN 202210744033 A CN202210744033 A CN 202210744033A CN 115114238 A CN115114238 A CN 115114238A
- Authority
- CN
- China
- Prior art keywords
- sequencing
- original
- base
- compression
- short
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000006835 compression Effects 0.000 title claims abstract description 128
- 238000007906 compression Methods 0.000 title claims abstract description 117
- 238000012268 genome sequencing Methods 0.000 title claims abstract description 103
- 238000012937 correction Methods 0.000 title claims abstract description 82
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000012163 sequencing technique Methods 0.000 claims abstract description 180
- 239000012634 fragment Substances 0.000 claims abstract description 108
- 230000000694 effects Effects 0.000 claims description 10
- 230000009466 transformation Effects 0.000 claims description 9
- 230000000295 complement effect Effects 0.000 claims description 6
- 238000013144 data compression Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 7
- 230000006837 decompression Effects 0.000 description 5
- 230000006872 improvement Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 150000007523 nucleic acids Chemical class 0.000 description 1
- 102000039446 nucleic acids Human genes 0.000 description 1
- 108020004707 nucleic acids Proteins 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1744—Redundancy elimination performed by the file system using compression, e.g. sparse files
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Bioethics (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
技术领域technical field
本发明涉及数据压缩技术领域,尤其涉及一种基于纠错的基因组测序数据无损压缩方法、系统、终端及计算机可读存储介质。The present invention relates to the technical field of data compression, in particular to a method, system, terminal and computer-readable storage medium for lossless compression of genome sequencing data based on error correction.
背景技术Background technique
下一代高通量基因组测序技术(能够一次并行对大量核酸分子进行平行序列测定的技术,通常一次测序反应能产出不低于100Mb的测序数据)的出现,大幅度降低了基因组测序的成本,同时测序速度也得到了极大的提高。例如,人类基因组计划利用第一代Sanger(双脱氧法)测序技术完成时耗资30亿美元、历时13年之久,而使用现代高通量基因组测序技术只需要不到1000美元、几天内便可完成。The emergence of next-generation high-throughput genome sequencing technology (a technology that can perform parallel sequencing of a large number of nucleic acid molecules at a time, usually a sequencing reaction can produce no less than 100Mb of sequencing data) has greatly reduced the cost of genome sequencing, At the same time, the sequencing speed has also been greatly improved. For example, the Human Genome Project cost $3 billion and took 13 years to complete using first-generation Sanger (dideoxy) sequencing technology, while using modern high-throughput genome sequencing technology costs less than $1,000 and can be completed in a few days. can be completed.
但在测得的基因组序列长度上,第二代测序技术所产生的碱基序列要比第一代短很多,第一代测序长度可达到上百碱基甚至上万个碱基,而高通量基因组测序所产生的基因组测序短片段只有几十个碱基。由于后续基因组数据分析的需要,为了能够把测序短片段拼接成长基因组序列,必须要求测序短片段序列之间有足够的交叠。这就需要在基因组测序时有足够的测序深度,即相同的基因组序列数据需要同时测几十、甚至上百次,因此产生了大量的测序短片段数据。由于每个碱基位点平均测序几十到上百次,所以测序短片段内部存在着大量的数据冗余信息,为基因组测序数据的压缩提供了基础。However, in terms of the measured genome sequence length, the base sequence generated by the second-generation sequencing technology is much shorter than that of the first-generation sequencing technology. The length of the first-generation sequencing technology can reach hundreds of bases or even tens of thousands of bases. The short genome sequencing fragments produced by quantitative genome sequencing are only a few tens of bases. Due to the needs of subsequent genomic data analysis, in order to be able to splicing the sequenced short fragments into a long genomic sequence, it is necessary to require sufficient overlap between the sequenced short fragments. This requires sufficient sequencing depth during genome sequencing, that is, the same genome sequence data needs to be measured tens or even hundreds of times at the same time, thus generating a large amount of sequencing short fragment data. Since each base locus is sequenced an average of tens to hundreds of times, there is a large amount of redundant data in the sequenced short fragments, which provides the basis for the compression of genome sequencing data.
然而由于第二代测序技术中测序错误的存在,使得一些测序短片段序列被分配到错误桶中。此错分不仅影响了该测序短片段序列自身的压缩,而且还影响到了该桶内其他测序短片段序列的压缩,从而影响了整体测序数据的压缩效果。However, due to the existence of sequencing errors in the second-generation sequencing technology, some sequencing short fragments are assigned to the error bucket. This misclassification not only affects the compression of the sequenced short segment sequence itself, but also affects the compression of other sequenced short segment sequences in the bucket, thereby affecting the compression effect of the overall sequencing data.
因此,现有技术还有待于改进和发展。Therefore, the existing technology still needs to be improved and developed.
发明内容SUMMARY OF THE INVENTION
本发明的主要目的在于提供一种基于纠错的基因组测序数据无损压缩方法、系统、终端及计算机可读存储介质,旨在解决现有技术中在基因组测序时,由于测序短片段序列的错误分配影响整体测序数据的压缩效果的问题。The main purpose of the present invention is to provide a method, system, terminal and computer-readable storage medium for lossless compression of genome sequencing data based on error correction, aiming to solve the problem of wrong allocation of short sequence sequences due to sequencing in the prior art during genome sequencing. A problem that affects the compression effect of the overall sequencing data.
为实现上述目的,本发明提供一种基于纠错的基因组测序数据无损压缩方法,所述基于纠错的基因组测序数据无损压缩方法包括如下步骤:In order to achieve the above object, the present invention provides a method for lossless compression of genome sequencing data based on error correction. The method for lossless compression of genome sequencing data based on error correction includes the following steps:
识别并纠正原始测序短片段中的测序碱基错误,并记录碱基错误信息,所述碱基错误信息包括测序错误的碱基位置及原碱基;Identify and correct sequencing base errors in the original sequencing short fragments, and record the base error information, the base error information includes the base position of the sequencing error and the original base;
将原始测序短片段归类到纠正后的索引区域文件中,并将测序错误纠正信息加入到索引区域文件中;Sort the original sequencing short fragments into the corrected index region file, and add sequencing error correction information to the index region file;
对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件。The base sequences in the original sequencing short fragments in different index region files are sorted and compressed to obtain a compressed result file of the genome sequencing data.
所述的基于纠错的基因组测序数据无损压缩方法,其中,所述识别并纠正原始测序短片段中的测序碱基错误,具体包括:The error-correction-based genome sequencing data lossless compression method, wherein the identifying and correcting sequencing base errors in the original sequencing short fragments specifically includes:
统计k-mer数量,k-mer是原始测序短片段中所有长度为k的碱基子序列;Count the number of k-mers, where k-mers are all base subsequences of length k in the original sequencing short fragment;
获取高频k-mer,将高频k-mer插入到布隆过滤器中,k-mer出现次数大于预设阈值的为高频k-mer;Obtain the high-frequency k-mer, insert the high-frequency k-mer into the Bloom filter, and the number of occurrences of the k-mer greater than the preset threshold is the high-frequency k-mer;
基于布隆过滤器纠正原始测序短片段中索引区域或者整个区域中的测序碱基错误。Correct the sequenced base errors in the index region or the entire region in the original sequenced short fragments based on the Bloom filter.
所述的基于纠错的基因组测序数据无损压缩方法,其中,所述高频k-mer由正确测序碱基产生。The error-correction-based genome sequencing data lossless compression method, wherein the high-frequency k-mers are generated by correctly sequenced bases.
所述的基于纠错的基因组测序数据无损压缩方法,其中,所述索引区域为原始测序短片段中正向和反向互补序列所有k-mer的最小值。In the method for lossless compression of genome sequencing data based on error correction, wherein, the index region is the minimum value of all k-mers of forward and reverse complementary sequences in the original sequencing short fragment.
所述的基于纠错的基因组测序数据无损压缩方法,其中,所述索引区域中的碱基按照字母A、C、G、T顺序排列。In the method for lossless compression of genome sequencing data based on error correction, the bases in the index region are arranged in the order of letters A, C, G, and T.
所述的基于纠错的基因组测序数据无损压缩方法,其中,所述对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件,具体包括:The error-correction-based genome sequencing data lossless compression method, wherein the base sequences in the original sequencing short fragments in different index region files are sorted and compressed to obtain a compression result file of the genome sequencing data, specifically include:
当所有测序短片段序列聚类完成后,对同一索引内所有原始测序短片段进行拆分排序;After the clustering of all sequencing short fragments is completed, split and sort all original sequencing short fragments in the same index;
当对同一索引内所有原始测序短片段进行拆分排序完成后,进行压缩得到基因组测序数据的压缩结果文件。After all the original sequencing short fragments in the same index are split and sorted, compression is performed to obtain a compressed result file of the genome sequencing data.
所述的基于纠错的基因组测序数据无损压缩方法,其中,所述当所有测序短片段序列聚类完成后,对同一索引内所有原始测序短片段进行拆分排序,具体包括:The error-correction-based genome sequencing data lossless compression method, wherein after the clustering of all short sequenced fragments is completed, splitting and sorting all the original short fragments of sequencing in the same index specifically includes:
使用变换tran(r)对测序短片段r进行拆分转换;Use the transformation tran(r) to split and transform the sequencing short fragment r;
所述拆分转换将测序短片段r分成三个区域,分别为Pre(r)、Index(r)和Su f(r),使得r=Pre(r)οIndex(r)οSu f(r),其中ο表示字符串连接。The split conversion divides the sequencing short fragment r into three regions, namely Pre(r), Index(r) and Su f(r), such that r=Pre(r) o Index(r) o Su f(r), where ο represents string concatenation.
所述的基于纠错的基因组测序数据无损压缩方法,其中,所述对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件,之后还包括:The error-correction-based genome sequencing data lossless compression method, wherein the base sequences in the original sequencing short fragments in different index region files are sorted and compressed to obtain a compression result file of the genome sequencing data, and then Also includes:
计算基因组测序数据的压缩率,所述压缩率表示每个碱基压缩后所占用的比特数,所述压缩率越低表示占用的空间越小,压缩效果越好。Calculate the compression ratio of the genome sequencing data, where the compression ratio represents the number of bits occupied by each base after compression, and the lower the compression ratio, the smaller the occupied space, and the better the compression effect.
此外,为实现上述目的,本发明还提供一种基于纠错的基因组测序数据无损压缩系统,其中,所述基于纠错的基因组测序数据无损压缩系统包括:In addition, in order to achieve the above object, the present invention also provides an error-correction-based genome sequencing data lossless compression system, wherein the error-correction-based genome sequencing data lossless compression system includes:
识别纠正模块,用于识别并纠正原始测序短片段中的测序碱基错误,并记录碱基错误信息,所述碱基错误信息包括测序错误的碱基位置及原碱基;An identification and correction module is used to identify and correct the sequencing base errors in the original short sequenced fragments, and record the base error information, the base error information includes the base position of the sequencing error and the original base;
归类记录模块,用于将原始测序短片段归类到纠正后的索引区域文件中,并将测序错误纠正信息加入到索引区域文件中;The classification record module is used to classify the original sequencing short fragments into the corrected index region file, and add sequencing error correction information to the index region file;
排序压缩模块,用于对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件。The sorting and compression module is used to sort and compress the base sequences in the original sequencing short fragments in different index region files to obtain a compressed result file of the genome sequencing data.
所述的基于纠错的基因组测序数据无损压缩系统,其中,所述识别纠正模块,具体包括:In the error-correction-based genome sequencing data lossless compression system, the identification and correction module specifically includes:
统计单元,用于统计k-mer数量,k-mer是原始测序短片段中所有长度为k的碱基子序列;Statistical unit, used to count the number of k-mers, k-mers are all base subsequences of length k in the original sequencing short fragment;
插入单元,用于获取高频k-mer,将高频k-mer插入到布隆过滤器中,k-mer出现次数大于预设阈值的为高频k-mer;The insertion unit is used to obtain high-frequency k-mers, and insert high-frequency k-mers into the Bloom filter. The high-frequency k-mers whose occurrences are greater than the preset threshold are high-frequency k-mers;
纠正单元,用于基于布隆过滤器纠正原始测序短片段中索引区域或者整个区域中的测序碱基错误。The correction unit is used to correct the sequenced base errors in the index region or the entire region in the original sequenced short fragment based on the Bloom filter.
所述的基于纠错的基因组测序数据无损压缩系统,其中,所述排序压缩模块,具体包括:The error-correction-based genome sequencing data lossless compression system, wherein the sequencing and compression module specifically includes:
拆分排序单元,用于当所有测序短片段序列聚类完成后,对同一索引内所有原始测序短片段进行拆分排序;The splitting and sorting unit is used to split and sort all the original short sequenced fragments in the same index after the sequence clustering of all the short sequenced fragments is completed;
数据压缩单元,用于当对同一索引内所有原始测序短片段进行拆分排序完成后,进行压缩得到基因组测序数据的压缩结果文件。The data compression unit is used to compress to obtain a compressed result file of the genome sequencing data after the splitting and sorting of all the original sequencing short fragments in the same index are completed.
此外,为实现上述目的,本发明还提供一种终端,其中,所述终端包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的基于纠错的基因组测序数据无损压缩程序,所述基于纠错的基因组测序数据无损压缩程序被所述处理器执行时实现如上所述的基于纠错的基因组测序数据无损压缩方法的步骤。In addition, in order to achieve the above object, the present invention also provides a terminal, wherein the terminal includes: a memory, a processor, and error correction-based genome sequencing data stored in the memory and executable on the processor A lossless compression program, when the error-correction-based genome sequencing data lossless compression program is executed by the processor, implements the steps of the error-correction-based genome sequencing data lossless compression method as described above.
此外,为实现上述目的,本发明还提供一种计算机可读存储介质,其中,所述计算机可读存储介质存储有基于纠错的基因组测序数据无损压缩程序,所述基于纠错的基因组测序数据无损压缩程序被处理器执行时实现如上所述的基于纠错的基因组测序数据无损压缩方法的步骤。In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores an error-correction-based genome sequencing data lossless compression program, and the error-correction-based genome sequencing data When the lossless compression program is executed by the processor, the steps of the above-described method for lossless compression of genome sequencing data based on error correction are realized.
本发明中,识别并纠正原始测序短片段中的测序碱基错误,并记录碱基错误信息,所述碱基错误信息包括测序错误的碱基位置及原碱基;将原始测序短片段归类到纠正后的索引区域文件中,并将测序错误纠正信息加入到索引区域文件中;对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件。本发明实现了高效的测序短片段测序错误的纠正,通过纠正碱基当中的测序错误,使得更多的相似测序短片段被分配到同一桶中,增加了桶内的数据冗余密度,进而提高了后续桶内测序短片段的压缩效率,通过记录纠正测序短片段中的桶索引序列,同时记录纠正的位置及原始碱基等信息,在解压缩时能够无损恢复出纠正的碱基信息,实现对基因组测序数据的无损压缩。In the present invention, the sequencing base errors in the original sequencing short fragments are identified and corrected, and the base error information is recorded, and the base error information includes the base positions of the sequencing errors and the original bases; the original sequencing short fragments are classified into into the corrected index region file, and add sequencing error correction information to the index region file; sort and compress the base sequences in the original sequencing short fragments in different index region files to obtain the compression of the genome sequencing data result file. The present invention realizes efficient correction of sequencing errors in short sequenced segments. By correcting the sequencing errors in the bases, more similar sequenced short segments are allocated to the same bucket, thereby increasing the data redundancy density in the bucket, thereby improving the The compression efficiency of subsequent in-bucket sequencing short fragments is recorded. By recording and correcting the bucket index sequence in the sequencing short fragments, and recording the corrected position and original base information, the corrected base information can be recovered without loss during decompression. Achieve lossless compression of genome sequencing data.
附图说明Description of drawings
图1是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例的流程图;Fig. 1 is the flow chart of the preferred embodiment of the genome sequencing data lossless compression method based on error correction of the present invention;
图2是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例的整体流程示意图;2 is a schematic overall flow diagram of a preferred embodiment of the method for lossless compression of genome sequencing data based on error correction of the present invention;
图3是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例中测序错误导致的分桶错误示意图;3 is a schematic diagram of bucketing errors caused by sequencing errors in a preferred embodiment of the error-correction-based genome sequencing data lossless compression method of the present invention;
图4是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例中序列的Minimizer示意图;Fig. 4 is the Minimizer schematic diagram of the sequence in the preferred embodiment of the genome sequencing data lossless compression method based on error correction of the present invention;
图5是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例中序列的桶内重排列示意图;5 is a schematic diagram of the rearrangement of sequences in a bucket in a preferred embodiment of the error-correction-based genome sequencing data lossless compression method of the present invention;
图6是本发明基于纠错的基因组测序数据无损压缩方法的较佳实施例中本发明压缩方法和其他三个行业内的基因组测序数据压缩方法压缩率结果比较的示意图;6 is a schematic diagram of the comparison of the compression ratio results of the compression method of the present invention and the compression method of genome sequencing data in other three industries in a preferred embodiment of the error-correction-based genome sequencing data lossless compression method of the present invention;
图7是本发明基于纠错的基因组测序数据无损压缩系统的较佳实施例中原理示意图;7 is a schematic diagram of the principle in a preferred embodiment of the error-correction-based genome sequencing data lossless compression system of the present invention;
图8为本发明终端的较佳实施例的运行环境示意图。FIG. 8 is a schematic diagram of an operating environment of a preferred embodiment of the terminal of the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案及优点更加清楚、明确,以下参照附图并举实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
本发明较佳实施例所述的基于纠错的基因组测序数据无损压缩方法,如图1和图2所示,所述基于纠错的基因组测序数据无损压缩方法包括以下步骤:The method for lossless compression of genome sequencing data based on error correction according to a preferred embodiment of the present invention, as shown in FIG. 1 and FIG. 2 , the method for lossless compression of genome sequencing data based on error correction includes the following steps:
步骤S10、识别并纠正原始测序短片段中的测序碱基错误,并记录碱基错误信息,所述碱基错误信息包括测序错误的碱基位置及原碱基。Step S10 , identifying and correcting the sequencing base errors in the original sequencing short segment, and recording the base error information, where the base error information includes the base position of the sequencing error and the original base.
具体地,所述步骤S10具体包括:统计k-mer数量,k-mer是原始测序短片段中所有长度为k的碱基子序列;获取高频k-mer,将高频k-mer插入到布隆过滤器中,k-mer出现次数大于预设阈值的为高频k-mer;基于布隆过滤器纠正原始测序短片段中索引区域或者整个区域中的测序碱基错误,其中,所述碱基错误信息包括测序错误的碱基位置及原碱基。Specifically, the step S10 specifically includes: counting the number of k-mers, where k-mers are all base subsequences with a length of k in the original sequencing short fragment; obtaining high-frequency k-mers, and inserting high-frequency k-mers into In the Bloom filter, the number of occurrences of k-mers greater than the preset threshold is a high-frequency k-mer; based on the Bloom filter, the sequenced base errors in the index region or the entire region in the original sequencing short fragment are corrected, wherein the said The base error information includes the base position of the sequencing error and the original base.
如图2所示,原始测序短片段用reads进行表示,识别并纠正原始测序短片段(reads)中的测序错误,可分成两种方式,一种只识别纠正reads索引区域的碱基,一种识别纠正reads所有碱基,两种方式均记录碱基错误信息(包括测序错误的碱基位置及原碱基,以便解压缩时无损还原出原始序列,达到无损压缩的目的)。As shown in Figure 2, the original sequencing short fragments are represented by reads, and the sequencing errors in the original sequencing short fragments (reads) are identified and corrected, which can be divided into two ways. One is to identify and correct the bases in the index region of the reads. Identify and correct all bases in reads, and record the base error information (including the base position of the sequencing error and the original base, so that the original sequence can be restored without loss during decompression, so as to achieve the purpose of lossless compression).
其中,k-mer为reads中长度为k的碱基子序列,并分别统计reads序列及其反向互补序列,在反向互补序列汇中,A→T,C→G;然后将高频k-mer(将k-mer出现次数高于某一阈值称为高频k-mer,由于碱基测序错误率比较低,高频k-mer一般是由正确测序碱基产生的)插入到Bloom Filter(布隆过滤器)中,布隆过滤器是一个长的二进制向量和一系列随机映射函数,布隆过滤器可以用于检索一个元素是否在一个集合中,它的优点是查询时间某一元素是否在集合中的效率为O(1);最后纠正reads索引区域(Minimizer,如图3所示)或者整个区域当中的碱基错误;reads的索引区域(又称:Minimizer)为序列中正向和反向互补序列所有k-mer的最小值,其中碱基按照字母A、C、G、T顺序排列(如图4所示)。Among them, k-mer is the base subsequence of length k in the reads, and the reads sequence and its reverse complementary sequence are counted separately. In the reverse complementary sequence pool, A→T, C→G; then the high frequency k -mer (the number of occurrences of k-mers above a certain threshold is called high-frequency k-mers, because the base sequencing error rate is relatively low, high-frequency k-mers are generally generated by correctly sequenced bases) Insert into Bloom Filter (Bloom filter), Bloom filter is a long binary vector and a series of random mapping functions, Bloom filter can be used to retrieve whether an element is in a set, its advantage is to query a certain element at time Whether the efficiency in the collection is O(1); finally correct the base errors in the index region of reads (Minimizer, as shown in Figure 3) or the entire region; the index region of reads (also known as Minimizer) is the forward sum in the sequence. The minimum value of all k-mers of the reverse complement sequence, in which the bases are arranged in the order of the letters A, C, G, T (as shown in Figure 4).
步骤S20、将原始测序短片段归类到纠正后的索引区域文件中,并将测序错误纠正信息加入到索引区域文件中。Step S20, classifying the original sequencing short fragments into the corrected index region file, and adding sequencing error correction information to the index region file.
具体地,将reads归类到纠正后的Minimizer文件中,同时将测序错误纠正信息一同加入到该索引区域文件中,以便解压时能够无损还原出原测序短片段序列;当测序碱基错误得到纠正后,需重新计算reads的Minimizer值。Specifically, the reads are classified into the corrected Minimizer file, and the sequencing error correction information is added to the index region file together, so that the original sequencing short fragment sequence can be restored without loss during decompression; when the sequencing base errors are corrected After that, the Minimizer value of the reads needs to be recalculated.
步骤S30、对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件。Step S30, sorting and compressing the base sequences in the original sequencing short fragments in different index region files to obtain a compressed result file of the genome sequencing data.
具体地,所述步骤S30具体包括:当所有测序短片段序列聚类完成后,对同一索引内所有原始测序短片段进行拆分排序;当对同一索引内所有原始测序短片段进行拆分排序完成后,进行压缩得到基因组测序数据的压缩结果文件。Specifically, the step S30 specifically includes: after the sequence clustering of all the short sequenced fragments is completed, splitting and sorting all the original short sequenced fragments in the same index; when the splitting and sorting of all the original short sequenced fragments in the same index is completed Then, perform compression to obtain a compressed result file of the genome sequencing data.
当所有测序短片段序列聚类完成后,对同一索引内所有reads进行拆分排序,例如,对序列r,使用变换tran(r)对其进行拆分转换,如图5所示,该转换将测序短片段r分成三个区域,分别为Pre(r)、Index(r)和Su f(r),例如Pre(r)为GCGTTGGCCCT,Index(r)为AAGAGGCC,Su f(r)为GCGATGCTCCGTTATCTCTCGTTG,使得r=Pre(r)οIndex(r)οSu f(r),其中,ο表示字符串连接,Index(r)是放置该测序短片段的桶索引。设定转换为tran(r)=Su f(r)οPre(r)οIndexPos,其中,IndexPos为桶索引Index(r)距离测序短片段Minimizer的位置偏移量,例如此时Su f(r)为GCGATGCTCCGTTATCTCTCGTTG,Pre(r)为GCGTTGGCCCT,IndexPos为10。将Tran称为测序短片段拆分转换,它将测序短片段拆分为特定的偏移量并交换此拆分产生的第二个子串和第一个子串。通过此转换所带来的好处是:首先,它删除了测序短片段中存在的显式冗余,即桶索引序列字串Index(r),因为桶内所有测序短片段均共享该桶索引,其次,将桶索引后的子序列移动到每个测序短片段的前端,这些调整后的序列前缀更可能共享相似的序列,可以提高下游压缩工具发掘和利用这些共享子串的冗余。When the clustering of all sequenced short fragments is completed, split and sort all the reads in the same index. For example, for sequence r, use the transformation tran(r) to split and transform it, as shown in Figure 5, the transformation will The sequenced short fragment r is divided into three regions, namely Pre(r), Index(r) and Su f(r). For example, Pre(r) is GCGTTGGCCCT, Index(r) is AAGAGGCC, Su f(r) is GCGATGCTCCGTTATCTCTCGTTG, Let r=Pre(r)οIndex(r)οSu f(r), where ο represents string concatenation, and Index(r) is the bucket index where the sequenced short segment is placed. Set the conversion to tran(r)=Su f(r)οPre(r)οIndexPos, where IndexPos is the position offset of the bucket index Index(r) from the Minimizer of the sequencing short fragment, for example, Su f(r) at this time is GCGATGCTCCGTTATCTCTCGTTG, Pre(r) is GCGTTGGCCCT, and IndexPos is 10. Tran is called the Sequencing Short Split Transform, which splits the Sequencing Short into a specific offset and swaps the second and first substrings produced by this split. The benefits brought by this transformation are: first, it removes the explicit redundancy in the sequenced short fragments, that is, the bucket index sequence string Index(r), because all the sequenced short fragments in the bucket share the bucket index, Second, by moving the bucket-indexed subsequences to the front of each sequenced short segment, these adjusted sequence prefixes are more likely to share similar sequences, which can improve the redundancy of downstream compression tools to discover and exploit these shared substrings.
进一步地,为了验证本发明方法的两种方法:BICmin和BICall在基因组测序数据压缩上的有效性,BICmin(只纠正Minimizer区域)和BICall(纠正整个序列)方法对真实的五种不同物种的基因组测序数据进行压缩,并同其他三个行业内的基因组测序数据压缩方法:SCALCE、ORCOM和MINCE的压缩结果进行比较,如图6所示,五种不同物种的数据集ID分别为SRR554369_1、MH0001.081026_1、SRR327342_1、SRR870667_1、ERR174310_1,五种不同方式对应各自的压缩率(Bits Per Base),例如使用BICmin时,SRR554369_1、MH0001.081026_1、SRR327342_1、SRR870667_1、ERR174310_1对应的压缩率分别为0.43、0.7、0.26、0.88、0.67;例如使用BICall时,SRR554369_1、MH0001.081026_1、SRR327342_1、SRR870667_1、ERR174310_1对应的压缩率分别为0.41、0.68、0.24、0.88、0.66。Further, in order to verify the effectiveness of the two methods of the method of the present invention: BICmin and BICall in the compression of genome sequencing data, BICmin (correcting only the Minimizer region) and BICall (correcting the entire sequence) methods are used for real genomes of five different species. The sequencing data is compressed and compared with the compression results of the genome sequencing data compression methods in other three industries: SCALCE, ORCOM and MINCE. As shown in Figure 6, the dataset IDs of five different species are SRR554369_1 and MH0001, respectively. 081026_1, SRR327342_1, SRR870667_1, ERR174310_1, five different methods correspond to their respective compression ratios (Bits Per Base). For example, when using BICmin, SRR554369_1, MH0001.081026_1, SRR327342_1, SRR870667_1, ERR172610_1 correspond to compression ratios of 0.7, 0.0.7, and ERR172610_1 respectively. , 0.88, 0.67; for example, when using BICall, the compression ratios corresponding to SRR554369_1, MH0001.081026_1, SRR327342_1, SRR870667_1, and ERR174310_1 are 0.41, 0.68, 0.24, 0.88, and 0.66, respectively.
此处压缩率代表每个碱基压缩后所占用的比特数,越低说明占用的空间越小,压缩效果越好。从图6可见,BICall方法相比于BICmin在压缩效果上略有提高,但提高效果并不十分明显,因为在BICall方法中,去除了由测序短片段的分桶不同带来的压缩效果后,测序错误纠正提高虽然提高了桶内序列的一致性,但同时仍然序列记录存储纠正碱基的位置和原碱基信息,仍占用部分存储空间,所以将BIC方法提供了两种压缩方式,一种是只纠正能够改变测序短片段索引值的碱基错误方法:BICmin,一种是纠正测序短片段中全部的碱基错误方法:BICall。The compression ratio here represents the number of bits occupied by each base after compression. The lower the compression rate, the smaller the space occupied and the better the compression effect. It can be seen from Figure 6 that the BICall method has a slight improvement in the compression effect compared with BICmin, but the improvement effect is not very obvious, because in the BICall method, after removing the compression effect caused by the different buckets of sequencing short fragments, Although the improvement of sequencing error correction improves the consistency of the sequences in the bucket, it still stores the position of the corrected base and the original base information in the sequence record, which still occupies part of the storage space. Therefore, the BIC method provides two compression methods, one is It is a method of correcting only the base errors that can change the index value of the sequenced short fragments: BICmin, and the other is a method of correcting all the base errors in the sequenced short fragments: BICall.
针对高通量基因组数据中,由于测序技术的原因,含有大量的碱基测序错误,而这种测序错误会减少测序短片段(reads)之间的序列一致性,影响了基因组测序数据的压缩效果。本发明利用基因组测序数据的高通量信息,纠正测序错误位点的碱基,同时建立不同碱基纠错的字符表示方法,在提高基因组测序数据压缩率的同时,能够无损还原出原始序列,达到无损压缩,该方法在已有公开基因组测序数据集无损压缩实验中取得了良好的压缩结果。For high-throughput genomic data, due to the sequencing technology, there are a large number of base sequencing errors, and such sequencing errors will reduce the sequence consistency between short sequences (reads) and affect the compression effect of genomic sequencing data. . The invention utilizes the high-throughput information of the genome sequencing data to correct the bases of the sequencing error sites, and establishes a character representation method for error correction of different bases at the same time. Lossless compression is achieved, and this method has achieved good compression results in the lossless compression experiments of existing public genome sequencing datasets.
有益效果:Beneficial effects:
(1)本发明方法设计了高效的序列纠错方法,通过k-mer的数量统计,将高频k-mer插入到Bloom Filter过滤器当中,纠正了桶索引区域内错误碱基,实现了测序短片段测序错误的纠正。(1) The method of the present invention designs an efficient sequence error correction method. By counting the number of k-mers, inserting high-frequency k-mers into the Bloom Filter filter corrects the erroneous bases in the bucket index region and realizes sequencing. Correction of short fragment sequencing errors.
(2)本发明方法通过纠正碱基当中的测序错误,使得更多的相似测序短片段被分配到同一桶中,增加了该桶内的数据冗余密度,进而提高了后续桶内测序短片段的压缩效率。(2) The method of the present invention corrects the sequencing errors in the bases, so that more similar sequencing short fragments are allocated to the same bucket, which increases the data redundancy density in the bucket, thereby improving the subsequent sequencing short clips in the bucket. segment compression efficiency.
(3)本发明方法通过记录纠正测序短片段中的桶索引序列,同时记录纠正的位置及原始碱基等信息,在解压缩时能够无损恢复出纠正的碱基信息,实现对基因组测序数据的无损压缩。(3) The method of the present invention can restore the corrected base information without loss during decompression by recording the bucket index sequence in the corrected sequencing short segment, and at the same time recording the corrected position and original base information, so as to realize the recovery of the genome sequencing data. lossless compression.
(4)本发明方法对真实的五种不同物种的基因组测序数据进行压缩实验,并将结果与行业内的三个压缩方法(SCALCE、ORCOM方法、MINCE方法)的压缩结果相比较,实验结果表明,本发明方法能够有效的压缩基因组测序数据。(4) The method of the present invention performs compression experiments on real genome sequencing data of five different species, and compares the results with the compression results of three compression methods (SCALCE, ORCOM method, and MINCE method) in the industry. The experimental results show that , the method of the present invention can effectively compress the genome sequencing data.
进一步地,如图7所示,基于上述基于纠错的基因组测序数据无损压缩方法,本发明还相应提供了一种基于纠错的基因组测序数据无损压缩系统,其中,所述基于纠错的基因组测序数据无损压缩系统包括:Further, as shown in FIG. 7 , based on the above-mentioned error-correction-based genome sequencing data lossless compression method, the present invention also provides an error-correction-based genome sequencing data lossless compression system, wherein the error-correction-based genome sequencing data lossless compression system is further provided. The lossless compression system for sequencing data includes:
识别纠正模块51,用于识别并纠正原始测序短片段中的测序碱基错误,并记录碱基错误信息,所述碱基错误信息包括测序错误的碱基位置及原碱基;The identification and
归类记录模块52,用于将原始测序短片段归类到纠正后的索引区域文件中,并将测序错误纠正信息加入到索引区域文件中;The
排序压缩模块53,用于对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件。The sorting and compressing
其中,所述识别纠正模块51,具体包括:Wherein, the identifying and correcting
统计单元,用于统计k-mer数量,k-mer是原始测序短片段中所有长度为k的碱基子序列;Statistical unit, used to count the number of k-mers, k-mers are all base subsequences of length k in the original sequencing short fragment;
插入单元,用于获取高频k-mer,将高频k-mer插入到布隆过滤器中,k-mer出现次数大于预设阈值的为高频k-mer;The insertion unit is used to obtain high-frequency k-mers, and insert high-frequency k-mers into the Bloom filter. The high-frequency k-mers whose occurrences are greater than the preset threshold are high-frequency k-mers;
纠正单元,用于基于布隆过滤器纠正原始测序短片段中索引区域或者整个区域中的测序碱基错误。The correction unit is used to correct the sequenced base errors in the index region or the entire region in the original sequenced short fragment based on the Bloom filter.
其中,所述排序压缩模块53,具体包括:Wherein, the sorting and
拆分排序单元,用于当所有测序短片段序列聚类完成后,对同一索引内所有原始测序短片段进行拆分排序;The splitting and sorting unit is used to split and sort all the original short sequenced fragments in the same index after the sequence clustering of all the short sequenced fragments is completed;
数据压缩单元,用于当对同一索引内所有原始测序短片段进行拆分排序完成后,进行压缩得到基因组测序数据的压缩结果文件。The data compression unit is used to compress to obtain a compressed result file of the genome sequencing data after the splitting and sorting of all the original sequencing short fragments in the same index are completed.
进一步地,如图8所示,基于上述基于纠错的基因组测序数据无损压缩方法和系统,本发明还相应提供了一种终端,所述终端包括处理器10、存储器20及显示器30。图8仅示出了终端的部分组件,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。Further, as shown in FIG. 8 , based on the above-mentioned method and system for lossless compression of genome sequencing data based on error correction, the present invention also provides a terminal correspondingly, and the terminal includes a
所述存储器20在一些实施例中可以是所述终端的内部存储单元,例如终端的硬盘或内存。所述存储器20在另一些实施例中也可以是所述终端的外部存储设备,例如所述终端上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(SecureDigital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器20还可以既包括所述终端的内部存储单元也包括外部存储设备。所述存储器20用于存储安装于所述终端的应用软件及各类数据,例如所述安装终端的程序代码等。所述存储器20还可以用于暂时地存储已经输出或者将要输出的数据。在一实施例中,存储器20上存储有基于纠错的基因组测序数据无损压缩程序40,该基于纠错的基因组测序数据无损压缩程序40可被处理器10所执行,从而实现本申请中基于纠错的基因组测序数据无损压缩方法。In some embodiments, the
所述处理器10在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行所述存储器20中存储的程序代码或处理数据,例如执行所述基于纠错的基因组测序数据无损压缩方法等。In some embodiments, the
所述显示器30在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。所述显示器30用于显示在所述终端的信息以及用于显示可视化的用户界面。所述终端的部件10-30通过系统总线相互通信。In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display 30 is used for displaying information on the terminal and for displaying a visual user interface. The components 10-30 of the terminal communicate with each other via the system bus.
在一实施例中,当处理器10执行所述存储器20中基于纠错的基因组测序数据无损压缩程序40时实现以下步骤:In one embodiment, when the
识别并纠正原始测序短片段中的测序碱基错误,并记录碱基错误信息,所述碱基错误信息包括测序错误的碱基位置及原碱基;Identify and correct sequencing base errors in the original sequencing short fragments, and record the base error information, the base error information includes the base position of the sequencing error and the original base;
将原始测序短片段归类到纠正后的索引区域文件中,并将测序错误纠正信息加入到索引区域文件中;Sort the original sequencing short fragments into the corrected index region file, and add sequencing error correction information to the index region file;
对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件。The base sequences in the original sequencing short fragments in different index region files are sorted and compressed to obtain a compressed result file of the genome sequencing data.
其中,所述识别并纠正原始测序短片段中的测序碱基错误,具体包括:Wherein, identifying and correcting sequencing base errors in the original sequencing short fragments specifically includes:
统计k-mer数量,k-mer是原始测序短片段中所有长度为k的碱基子序列;Count the number of k-mers, where k-mers are all base subsequences of length k in the original sequencing short fragment;
获取高频k-mer,将高频k-mer插入到布隆过滤器中,k-mer出现次数大于预设阈值的为高频k-mer;Obtain the high-frequency k-mer, insert the high-frequency k-mer into the Bloom filter, and the number of occurrences of the k-mer greater than the preset threshold is the high-frequency k-mer;
基于布隆过滤器纠正原始测序短片段中索引区域或者整个区域中的测序碱基错误。Correct the sequenced base errors in the index region or the entire region in the original sequenced short fragments based on the Bloom filter.
所述的基于纠错的基因组测序数据无损压缩方法,其中,所述高频k-mer由正确测序碱基产生。The error-correction-based genome sequencing data lossless compression method, wherein the high-frequency k-mers are generated by correctly sequenced bases.
其中,所述索引区域为原始测序短片段中正向和反向互补序列所有k-mer的最小值。Wherein, the index region is the minimum value of all k-mers of forward and reverse complementary sequences in the original sequencing short fragment.
其中,所述索引区域中的碱基按照字母A、C、G、T顺序排列。Wherein, the bases in the index region are arranged in the order of letters A, C, G, and T.
其中,所述对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件,具体包括:Wherein, sorting and compressing the base sequences in the original sequencing short fragments in different index region files to obtain a compressed result file of the genome sequencing data, specifically including:
当所有测序短片段序列聚类完成后,对同一索引内所有原始测序短片段进行拆分排序;After the clustering of all sequencing short fragments is completed, split and sort all original sequencing short fragments in the same index;
当对同一索引内所有原始测序短片段进行拆分排序完成后,进行压缩得到基因组测序数据的压缩结果文件。After all the original sequencing short fragments in the same index are split and sorted, compression is performed to obtain a compressed result file of the genome sequencing data.
其中,所述当所有测序短片段序列聚类完成后,对同一索引内所有原始测序短片段进行拆分排序,具体包括:Wherein, after the clustering of all sequencing short fragments is completed, all original sequencing short fragments in the same index are split and sorted, specifically including:
使用变换tran(r)对测序短片段r进行拆分转换;Use the transformation tran(r) to split and transform the sequencing short fragment r;
所述拆分转换将测序短片段r分成三个区域,分别为Pre(r)、Index(r)和Su f(r),使得r=Pre(r)οIndex(r)οSu f(r),其中ο表示字符串连接。The split conversion divides the sequencing short fragment r into three regions, namely Pre(r), Index(r) and Su f(r), such that r=Pre(r) o Index(r) o Su f(r), where ο represents string concatenation.
其中,所述对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件,之后还包括:Wherein, sorting and compressing the base sequences in the original sequencing short fragments in different index region files to obtain a compressed result file of the genome sequencing data, and then further including:
计算基因组测序数据的压缩率,所述压缩率表示每个碱基压缩后所占用的比特数,所述压缩率越低表示占用的空间越小,压缩效果越好。Calculate the compression ratio of the genome sequencing data, where the compression ratio represents the number of bits occupied by each base after compression, and the lower the compression ratio, the smaller the occupied space, and the better the compression effect.
综上所述,本发明提供一种基于纠错的基因组测序数据无损压缩方法、系统、终端及计算机可读存储介质,所述方法包括:识别并纠正原始测序短片段中的测序碱基错误,并记录碱基错误信息,所述碱基错误信息包括测序错误的碱基位置及原碱基;将原始测序短片段归类到纠正后的索引区域文件中,并将测序错误纠正信息加入到索引区域文件中;对不同的索引区域文件内原始测序短片段中的碱基序列进行排序并进行压缩,得到基因组测序数据的压缩结果文件。本发明实现了高效的测序短片段测序错误的纠正,通过纠正碱基当中的测序错误,使得更多的相似测序短片段被分配到同一桶中,增加了桶内的数据冗余密度,进而提高了后续桶内测序短片段的压缩效率,通过记录纠正测序短片段中的桶索引序列,同时记录纠正的位置及原始碱基等信息,在解压缩时能够无损恢复出纠正的碱基信息,实现对基因组测序数据的无损压缩。In summary, the present invention provides a method, system, terminal and computer-readable storage medium for lossless compression of genome sequencing data based on error correction, the method includes: identifying and correcting sequencing base errors in the original sequencing short segment, And record the base error information, the base error information includes the base position of the sequencing error and the original base; classify the original sequencing short fragments into the corrected index area file, and add the sequencing error correction information to the index In the region file; sort and compress the base sequences in the original sequencing short fragments in different index region files to obtain the compressed result file of the genome sequencing data. The present invention realizes efficient correction of sequencing errors in short sequenced segments. By correcting the sequencing errors in the bases, more similar sequenced short segments are allocated to the same bucket, thereby increasing the data redundancy density in the bucket, thereby improving the The compression efficiency of subsequent in-bucket sequencing short fragments is recorded. By recording and correcting the bucket index sequence in the sequencing short fragments, and recording the corrected position and original base information, the corrected base information can be recovered without loss during decompression. Achieve lossless compression of genome sequencing data.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者终端中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or terminal comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or terminal. Without further limitation, an element defined by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or terminal that includes the element.
应当理解的是,本发明的应用不限于上述的举例,对本领域普通技术人员来说,可以根据上述说明加以改进或变换,所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that the application of the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or transformations can be made according to the above descriptions, and all these improvements and transformations should belong to the protection scope of the appended claims of the present invention.
Claims (13)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210744033.1A CN115114238A (en) | 2022-06-28 | 2022-06-28 | An error-correction-based genome sequencing data lossless compression method and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210744033.1A CN115114238A (en) | 2022-06-28 | 2022-06-28 | An error-correction-based genome sequencing data lossless compression method and related equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115114238A true CN115114238A (en) | 2022-09-27 |
Family
ID=83330579
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210744033.1A Pending CN115114238A (en) | 2022-06-28 | 2022-06-28 | An error-correction-based genome sequencing data lossless compression method and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115114238A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115798605A (en) * | 2022-11-11 | 2023-03-14 | 深圳大学 | Nanopore sequencing original signal data compression method, device, equipment and medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108614954A (en) * | 2016-12-12 | 2018-10-02 | 深圳华大基因科技服务有限公司 | A kind of method and apparatus of the short sequencing error corrections of two generation sequences |
-
2022
- 2022-06-28 CN CN202210744033.1A patent/CN115114238A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108614954A (en) * | 2016-12-12 | 2018-10-02 | 深圳华大基因科技服务有限公司 | A kind of method and apparatus of the short sequencing error corrections of two generation sequences |
Non-Patent Citations (1)
Title |
---|
王荣杰: "高通量基因组数据的无损压缩方法研究", 中国博士学位论文全文数据库基础科技辑, no. 01, 15 January 2020 (2020-01-15), pages 006 - 28 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115798605A (en) * | 2022-11-11 | 2023-03-14 | 深圳大学 | Nanopore sequencing original signal data compression method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220093210A1 (en) | System and method for characterizing biological sequence data through a probabilistic data structure | |
CN107609350B (en) | Data processing method of second-generation sequencing data analysis platform | |
US10140357B2 (en) | Anomaly, association and clustering detection | |
KR101922129B1 (en) | Method and apparatus for compressing and decompressing genetic information using next generation sequencing(NGS) | |
US9929746B2 (en) | Methods and systems for data analysis and compression | |
US6904430B1 (en) | Method and system for efficiently identifying differences between large files | |
Al-Ghalith et al. | NINJA-OPS: fast accurate marker gene alignment using concatenated ribosomes | |
CA2839802C (en) | Methods and systems for data analysis | |
CN107403075B (en) | Comparison method, device and system | |
CN109299086B (en) | Optimal sort key compression and index reconstruction | |
US10810239B2 (en) | Sequence data analyzer, DNA analysis system and sequence data analysis method | |
CN112100982B (en) | DNA storage method, system and storage medium | |
CN106201774B (en) | NAND FLASH storage chip data storage structure analysis method | |
CN110941726A (en) | Method for converting source code into digital identifier and comparing with data set | |
CN115312129A (en) | Gene data compression method and device in high-throughput sequencing background and related equipment | |
CN115114238A (en) | An error-correction-based genome sequencing data lossless compression method and related equipment | |
Holt et al. | Constructing Burrows-Wheeler transforms of large string collections via merging | |
Ravi et al. | A method for carving fragmented document and image files | |
WO2011073680A1 (en) | Improvements relating to hash tables | |
KR20070083641A (en) | Genetic Identification Method for Transcription Mapping | |
CN116821053B (en) | Data reporting methods, devices, computer equipment and storage media | |
CN110111852A (en) | A kind of magnanimity DNA sequencing data lossless Fast Compression platform | |
CN117577184A (en) | Multi-genome comparison method for large-scale genome | |
CN117373535A (en) | Processing method, device, storage medium and equipment for efficiently utilizing second-generation sequencing data of microorganism amplicon | |
CN115691683A (en) | A genotype information compression method, device and computer-readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |