CN111640467B - DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence - Google Patents
DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence Download PDFInfo
- Publication number
- CN111640467B CN111640467B CN202010446416.1A CN202010446416A CN111640467B CN 111640467 B CN111640467 B CN 111640467B CN 202010446416 A CN202010446416 A CN 202010446416A CN 111640467 B CN111640467 B CN 111640467B
- Authority
- CN
- China
- Prior art keywords
- num
- quality score
- compression
- data
- row
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000007906 compression Methods 0.000 title claims abstract description 60
- 230000006835 compression Effects 0.000 title claims abstract description 59
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000001712 DNA sequencing Methods 0.000 title claims abstract description 11
- 108091026890 Coding region Proteins 0.000 title claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims abstract description 8
- 230000003044 adaptive effect Effects 0.000 claims abstract description 7
- WYTGDNHDOZPMIW-RCBQFDQVSA-N alstonine Natural products C1=CC2=C3C=CC=CC3=NC2=C2N1C[C@H]1[C@H](C)OC=C(C(=O)OC)[C@H]1C2 WYTGDNHDOZPMIW-RCBQFDQVSA-N 0.000 claims description 7
- 108090000623 proteins and genes Proteins 0.000 abstract description 6
- 238000012163 sequencing technique Methods 0.000 description 15
- 230000000694 effects Effects 0.000 description 6
- 230000002068 genetic effect Effects 0.000 description 4
- 108020004414 DNA Proteins 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000013144 data compression Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000007481 next generation sequencing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000007423 decrease Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011143 downstream manufacturing Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 239000002773 nucleotide Substances 0.000 description 2
- 125000003729 nucleotide group Chemical group 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012070 whole genome sequencing analysis Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
Landscapes
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Chemical & Material Sciences (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
技术领域technical field
本发明属于数据压缩技术领域,特别涉及一种DNA测序质量分数无损压缩方法,可用于生物基因测序数据的压缩。The invention belongs to the technical field of data compression, and in particular relates to a DNA sequencing quality score lossless compression method, which can be used for the compression of biological gene sequencing data.
背景技术Background technique
测序已经逐渐成为在生物研究中广泛应用的技术,获取不同生物体的基因遗传信息,能够帮助我们提高对有机世界的认识。随着新一代高通量基因测序技术NGS的飞速发展,以Illumina为代表的测序公司不断推出新的测序技术,使得测序成本迅速下降,人类全基因组测序WGS价格已经下降到1000美金甚至更低,并且依然按照高于摩尔定律的速度下降。在这种情况下,产生的新一代测序数据的数量将超过天文数据,相对的,存储和传输这些数据所带来的开销也日益增大。因此,通过数据压缩减少基因测序数据大小,从而降低存储和传输成本具有重要的意义。目前基因压缩工具研究取得了许多成果,但是没有方案从编码顺序上考虑减少码流,因此压缩效率还有提升空间。Sequencing has gradually become a widely used technology in biological research. Obtaining genetic information of different organisms can help us improve our understanding of the organic world. With the rapid development of the new generation of high-throughput gene sequencing technology NGS, sequencing companies represented by Illumina have continuously introduced new sequencing technologies, which has led to a rapid decline in sequencing costs. The price of human whole genome sequencing WGS has dropped to US$1,000 or even lower. And still decline at a rate higher than Moore's Law. In this case, the amount of next-generation sequencing data generated will exceed that of astronomical data, and the corresponding overhead of storing and transmitting these data will also increase. Therefore, it is of great significance to reduce the size of gene sequencing data through data compression, thereby reducing storage and transmission costs. At present, many achievements have been made in the research of gene compression tools, but there is no plan to reduce the code stream from the encoding sequence, so there is still room for improvement in compression efficiency.
下一代测序产生成千上万条短读,这些短读通常以广泛接受的基于文本的FASTQ格式存储,包含测序产生的所有信息。其中每条短读包含三部分内容:一是元数据,用于描述测序平台等信息;二是DNA碱基序列,用于记录在当前短读中所获得的DNA片段;三是质量分数,用于表示所对应DNA碱基序列中各符号测定的可信程度。FASTQ格式中的质量分数数据具有较高的随机性和噪声,与测序仪器、测序方法等因素有关,通常包含几十种不同的字符,压缩难度高,在压缩文件中通常占比70%左右,因此,质量分数数据的压缩结果对整个FASTQ格式数据的压缩效果起着关键的影响。Next-generation sequencing generates tens of thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information generated by the sequencing. Each short read contains three parts: one is metadata, which is used to describe the sequencing platform and other information; the second is the DNA base sequence, which is used to record the DNA fragments obtained in the current short read; the third is the quality score, which is used to The degree of confidence in the determination of each symbol in the corresponding DNA base sequence. The quality score data in the FASTQ format has high randomness and noise, which is related to factors such as sequencing instruments and sequencing methods. It usually contains dozens of different characters and is difficult to compress. It usually accounts for about 70% of the compressed file. Therefore, the compression result of the quality score data plays a key role in the compression effect of the entire FASTQ format data.
目前,典型的对基因测序数据中质量分数无损压缩的方法主要有以下几种:At present, typical methods for lossless compression of quality scores in gene sequencing data mainly include the following:
第一种是用现有的文本压缩工具作为FASTQ文件最常用的压缩方式,如Gzip和7z,这些方法主要设计用于处理普通字符序列,并未考虑质量分数数据的独有特点,因此在压缩基因测序数据时效果不佳。The first is to use existing text compression tools as the most commonly used compression methods for FASTQ files, such as Gzip and 7z. These methods are mainly designed to process ordinary character sequences, and do not consider the unique characteristics of quality score data. Therefore, when compressing Doesn't work well with genetic sequencing data.
第二种是针对基因数据压缩产生的改进run-length方法和字典方法,这些方法在大部分情况下压缩效果都比熵编码方法差,不能达到最大程度上降低压缩率的目的。The second is the improved run-length method and dictionary method for genetic data compression. In most cases, the compression effect of these methods is worse than that of the entropy coding method, and the purpose of reducing the compression rate to the greatest extent cannot be achieved.
第三种是一些针对质量分数的压缩算法如Quip等,这种方法使用高阶马尔科夫模型对其进行预测编码,虽然得到了不错的压缩效果,但其所占存储体积较大,计算预测模型时过于复杂,并且没有考虑到编码顺序对压缩产生的影响,导致压缩耗时较长且算法的鲁棒性不佳。The third is some compression algorithms for quality scores, such as Quip, etc. This method uses a high-order Markov model to predictively encode it. The model is too complex and does not take into account the impact of encoding order on compression, resulting in long compression time and poor robustness of the algorithm.
发明内容Contents of the invention
本发明的目的在于客服上述现有技术存在的缺陷,提出一种基于自适应编码顺序的DNA测序质量分数无损压缩方法,以在不增加压缩时间的情况下最大程度的提高压缩效果。The purpose of the present invention is to overcome the shortcomings of the above-mentioned prior art, and propose a DNA sequencing quality score lossless compression method based on adaptive coding sequence, so as to maximize the compression effect without increasing the compression time.
本发明的技术方案是:首先,提取FASTQ文件中的碱基序列和质量分数序列;然后针对每行质量分数数据计算其均值并进行量化,根据上下文信息,均值信息,碱基信息构建预测模型;最后采用蛇形编码顺序驱动算术编码器对序列进行编码,达到压缩质量分数的目的,具体实现如下:The technical solution of the present invention is: first, extract the base sequence and quality score sequence in the FASTQ file; then calculate its mean value for each line of quality score data and quantify it, and build a prediction model according to context information, mean value information, and base information; Finally, the arithmetic coder is driven by the serpentine coding sequence to encode the sequence to achieve the purpose of compressing the quality score. The specific implementation is as follows:
(1)提取FASTQ文件中的质量分数数据和碱基数据:(1) Extract the quality score data and base data in the FASTQ file:
(1a)统计分析DNA测序数据特征,创建两个M×N大小的编码压缩块P1和P2,其中M为压缩块的行数,即一次处理质量分数数据的行数,N为压缩块的列数,即质量分数的长度,N≤150;(1a) Statistically analyze the characteristics of DNA sequencing data, and create two M×N coded compression blocks P 1 and P 2 , where M is the number of lines in the compression block, that is, the number of lines of quality score data processed at one time, and N is the compression block The number of columns, that is, the length of the mass fraction, N≤150;
(1b)分别通过第一编码压缩块P1和第二编码压缩块P2提取存放在FASTQ文件中的质量分数数据和碱基数据;(1b) extract the quality score data and base data stored in the FASTQ file through the first code compression block P 1 and the second code compression block P 2 respectively;
(2)计算第一编码压缩块P1所提取FASTQ文件中每行质量分数的均值并进行量化,得到M×1的行均值矩阵F;(2) calculate the mean value of each line quality score in the extracted FASTQ file of the first coded compression block P1 and quantize, obtain the line mean value matrix F of M * 1;
(3)统计编码字符的上下文信息、碱基信息和行均值信息并进行统一量化,计算最终的编码模型:(3) The context information, base information and row mean information of the encoded characters are counted and quantified uniformly, and the final encoding model is calculated:
(3a)对当前编码字符q建立模型:统计前四个字符q1,q2,q3和q4,取第二编码压缩块P2中当前字符和前一个字符对应的碱基信息记做j1和j2,取行均值矩阵F中字符q所在行的均值记做f,该f为已经量化后的结果;对于缺少上文信息的边缘字符,其q1,q2,q3和q4取相同符号或令其等于零;(3a) Build a model for the current coded character q: count the first four characters q 1 , q 2 , q 3 and q 4 , and take the base information corresponding to the current character and the previous character in the second code compression block P 2 as j 1 and j 2 , take the mean value of the row where the character q is located in the row mean matrix F and record it as f, which is the quantized result; q 4 takes the same sign or makes it equal to zero;
(3b)通过量化整体模型的方式减少模型代价,即取前两个字符q1和q2中的较大值记做A,后两个字符q3和q4中的较大值记做B,创建两个不同的标识符C和D,计算当前编码符号的最终编码模型:(3b) Reduce the model cost by quantifying the overall model, that is, take the larger value of the first two characters q 1 and q 2 as A, and the larger value of the last two characters q 3 and q 4 as B , creating two distinct identifiers C and D, computing the final encoding model for the current encoding symbol:
Pnow=A·B·C·D·j1·j2·fP now =A·B·C·D·j 1 ·j 2 ·f
其中,当q1=q2时标识符C=1,否则C=0;当q3=q4时D=1,否则D=0;Pnow为当前编码符号的概率估计;Wherein, when q 1 =q 2 , the identifier C=1, otherwise C=0; when q 3 =q 4 , D=1, otherwise D=0; P now is the probability estimate of the current coding symbol;
(4)利用设计的最终编码模型驱动自适应算数编码器,采用蛇形编码顺序沿相关性最强的方向对第一编码压缩块P1进行遍历压缩。(4) Utilize the designed final coding model to drive the self-adaptive arithmetic coder, and traverse and compress the first coded compression block P 1 along the direction with the strongest correlation by adopting the serpentine coding sequence.
本发明与现有技术相比,具有以下优点:Compared with the prior art, the present invention has the following advantages:
1.本发明由于充分利用了算术编码器的概率更新机制,因而对等长FASTQ文件中的质量分数数据的压缩率优于目前所有的算法。1. Since the present invention makes full use of the probability update mechanism of the arithmetic coder, the compression rate of the quality score data in the equal-length FASTQ file is superior to all current algorithms.
2.本发明由于在压缩质量分数数据的同时压缩了每行质量分数的均值,便于下游处理过程中对均值的统计和访问。2. Since the present invention compresses the mean value of each line of quality score while compressing the quality score data, it facilitates the statistics and access to the mean value in the downstream processing process.
3.本发明由于设计的编码模型结构简单,因而可移植性强,方便再次优化和融入到整个FASTQ文件的压缩,可广泛应用于各种使用该模块的压缩方案,具有良好的可扩展性。3. Due to the simple structure of the coding model designed by the present invention, it has strong portability, and is convenient for re-optimization and integration into the compression of the entire FASTQ file. It can be widely used in various compression schemes using this module, and has good scalability.
附图说明Description of drawings
图1为本发明的实现流程图;Fig. 1 is the realization flowchart of the present invention;
图2为本发明中对质量分数行均值进行量化的示意图;Fig. 2 is a schematic diagram of quantifying the quality score row mean value in the present invention;
图3为本发明中采用蛇形扫描顺序的示意图。FIG. 3 is a schematic diagram of a serpentine scanning sequence used in the present invention.
具体实施方式Detailed ways
以下结合附图和具体实施例,对本发明进行进一步详细描述。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
参照图1,本发明的实现步骤如下:With reference to Fig. 1, the realization steps of the present invention are as follows:
步骤1,提取FASTQ文件中的质量分数数据和碱基数据。Step 1, extract the quality score data and base data in the FASTQ file.
基因测序会产生成千上万条短读,这些短读通常以广泛接受的基于文本的FASTQ格式存储,包含测序产生的所有信息。在FASTQ文件格式中,每个短读包含四行,每行由换行符分隔,其中:Genetic sequencing generates thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information generated by the sequencing. In the FASTQ file format, each short read consists of four lines, each line separated by a newline, where:
第一行从‘@’字符开始,后面跟着唯一的序列ID标识符及可选的序列描述内容,标识符与描述字符以空格分隔;The first line starts with the '@' character, followed by a unique sequence ID identifier and optional sequence description content, and the identifier and description characters are separated by spaces;
第二行是核苷酸序列,表示碱基数据,由仅包含{‘A’,‘T’,‘C’,‘G’,‘N’}五个字符的序列构成,其中字符‘N’表示不明确的碱基;The second line is the nucleotide sequence, representing the base data, consisting of a sequence consisting of only five characters {'A', 'T', 'C', 'G', 'N'}, where the character 'N' Indicates an ambiguous base;
第三行以字符‘+’开始,后面再次加上序列的标志及描述信息,或者没有信息,充当分隔符;The third line starts with the character '+', followed by the sequence logo and description information, or no information, as a separator;
最后一行为质量分数行,每个字符对应第二行相应位置上碱基的质量,质量分数对应于数字Q=-10log 10P,其中P表示读取中相应核苷酸的概率是错误的。质量分数通常使用ASCII字母[33:73]或[64:104]表示,既用于原始数据的质量控制,也用于下游处理。The last line is the quality score line, each character corresponds to the quality of the base at the corresponding position in the second line, and the quality score corresponds to the number Q = -10log 10P, where P indicates the probability that the corresponding nucleotide in the read is wrong. Quality scores are usually denoted using ASCII letters [33:73] or [64:104] and are used both for quality control of raw data and for downstream processing.
本步骤的具体实现如下:The specific implementation of this step is as follows:
1.1)统计分析DNA测序数据特征,创建两个M×N大小的两个编码压缩块P1和P2,其中M为压缩块的行数,即一次处理质量分数数据的行数,N为压缩块的列数,即质量分数的长度,N≤150;1.1) Statistically analyze the characteristics of DNA sequencing data, and create two coded compression blocks P 1 and P 2 of M×N size, where M is the number of lines in the compressed block, that is, the number of lines for processing quality score data at one time, and N is the compression The number of columns of the block, that is, the length of the quality fraction, N≤150;
1.2)分别通过第一编码压缩块P1和第二编码压缩块P2提取存放在FASTQ文件中的质量分数数据和碱基数据;1.2) Extract the quality score data and base data stored in the FASTQ file through the first code compression block P1 and the second code compression block P2 respectively;
由于大部分FASTQ文件中的质量分数字符数都小于40且跳跃性不大,因此可以根据数据间的相关性设计好的预测模型来提升压缩效果。同时,考虑到相关字符过多不仅会导致时间和计算的复杂度升高,有时也会带来模型代价问题,因此需要采用合适的压缩块来统计质量分数之间的相关性,在计算资源允许的范围内,压缩块设计的越大压缩效果越好,但为了不超过最大内存,在本实施例中,取2000000×160的压缩块。总模型数设置为40×40×40×16。在实际压缩过程中,每次都要处理一个压缩块大小的数据,直至文件结尾。Since the quality score characters in most FASTQ files are less than 40 and the jumps are not large, a good prediction model can be designed according to the correlation between data to improve the compression effect. At the same time, considering that too many related characters will not only lead to increased time and calculation complexity, but also sometimes cause model cost problems, it is necessary to use appropriate compression blocks to count the correlation between quality scores. When computing resources allow Within the range of , the larger the compression block is designed, the better the compression effect, but in order not to exceed the maximum memory, in this embodiment, a compression block of 2000000×160 is used. The total number of models is set to 40×40×40×16. In the actual compression process, data of one compressed block size is processed each time until the end of the file.
步骤2,计算第一编码压缩块P1所提取FASTQ文件中每行质量分数的均值并进行量化,得到M×1的行均值矩阵F。Step 2: Calculate and quantize the mean value of the quality score of each row in the FASTQ file extracted from the first coded compression block P 1 to obtain an M×1 row mean value matrix F.
2.1)对大小为M×N的第一编码压缩块P1的每一行进行求平均值操作,将各行的N个质量分数值相加除以总数N得到各行质量分数的均值;2.1) Perform an averaging operation on each line of the first coded compressed block P 1 whose size is M×N, add and divide the N quality score values of each line by the total number N to obtain the mean value of the quality score of each line;
2.2)对得到的各行质量分数值进行量化操作并存储:2.2) Quantify and store the obtained quality score values of each row:
参照图2,在统计出每行质量分数的均值后,根据均值分布状况进行聚类,对于数量较多的均值进行细分,对于数量较小且值较低的质量值进行合并,以有利于编码效率的提升。对于具体的压缩文件可根据均值分布情况设计特有的量化方式以达到最优的效果,但是这样不仅增加了计算量还额外增加了许多计算时间。因此本实例选用扩展性较强且容易实现的量化方式,即将两个相邻均值看做是同一种情况,对于质量值较小且数量较低的部分整体看做一个部分。总结量化经验,得到如下所示的量化结果:Referring to Figure 2, after counting the mean value of the quality score of each row, clustering is performed according to the distribution of the mean value, the mean value with a large number is subdivided, and the quality values with a small number and low value are merged to facilitate Improve coding efficiency. For a specific compressed file, a unique quantization method can be designed according to the mean distribution to achieve the optimal effect, but this not only increases the calculation amount but also increases a lot of calculation time. Therefore, this example uses a quantization method with strong scalability and easy implementation, that is, two adjacent mean values are regarded as the same situation, and the part with a small quality value and a low quantity is regarded as a part as a whole. Summarizing the quantitative experience, the quantitative results are obtained as follows:
如果fi<(num-15),则fi=(num-15);If f i <(num-15), then f i =(num-15);
如果(num-15)≤fi<(num-13),则fi=(num-13);If (num-15)≤f i <(num-13), then f i = (num-13);
如果(num-13)≤fi<(num-11),则fi=(num-11);If (num-13)≤f i <(num-11), then f i = (num-11);
如果(num-11)≤fi<(num-9),则fi=(num-9);If (num-11)≤f i <(num-9), then f i = (num-9);
如果(num-9)≤fi<(num-7),则fi=(num-7);If (num-9)≤f i <(num-7), then f i =(num-7);
如果(num-7)≤fi,则fi=(num-6);If (num-7)≤f i , then f i =(num-6);
其中,num为编码符号总数40,fi为当前行的均值,i取值为[1,M];Among them, num is the total number of encoding symbols 40, fi is the mean value of the current line, and the value of i is [1,M];
将量化后各行的均值fi按照列排列的方式合并,得到M×1的行均值矩阵F。Merge the quantized mean values f i of each row in a column-arranged manner to obtain an M×1 row mean value matrix F.
步骤3,统计编码字符的上下文信息、碱基信息和行均值信息并进行统一量化,计算最终的编码模型。Step 3, the context information, base information and row mean information of the coded characters are counted and quantified uniformly, and the final coding model is calculated.
3.1)对当前编码字符q建立模型:统计前四个字符q1,q2,q3和q4,取P2中当前字符和前一个字符对应的碱基信息记做j1和j2,字符q所在行的均值记做f,这里的f为已经量化后的结果;对于缺少上文信息的边缘字符,其q1,q2,q3和q4可取相同符号或令其等于零。3.1) Build a model for the current coded character q: count the first four characters q 1 , q 2 , q 3 and q 4 , and take the base information corresponding to the current character and the previous character in P 2 as j 1 and j 2 , The mean value of the row where the character q is located is recorded as f, where f is the quantized result; for marginal characters lacking the above information, its q 1 , q 2 , q 3 and q 4 can take the same sign or make them equal to zero.
例如:给定第一编码压缩块P1的具体内容为:E,F,G,H,I;第二编码压缩块P2的具体内容为:A,T,C,G,G;For example: given the specific content of the first encoded compressed block P1 is: E, F, G, H, I; the specific content of the second encoded compressed block P2 is: A, T, C, G, G;
当对第一编码压缩块P1中的第三个字符“G”建立编码模型时,其前四个字符取值分别为:q1=F,q2=E,q3=0,q4=0;其当前字符和前一个字符对应的碱基信息取值为:j1=C,j2=T;其所在行的均值取值为:f=mean(ASCII(E)+ASCII(F)+ASCII(G)+ASCII(H)+ASCII(I));When the coding model is established for the third character "G" in the first coded compression block P1 , the values of the first four characters are: q 1 =F, q 2 =E, q 3 =0, q 4 =0; the value of the base information corresponding to the current character and the previous character is: j 1 =C, j 2 =T; the mean value of the row where it is located is: f=mean(ASCII(E)+ASCII(F )+ASCII(G)+ASCII(H)+ASCII(I));
当对P1中的第五个字符“I”建立编码模型时,其前四个字符取值分别为:q1=H,q2=G,q3=F,q4=E;其当前字符和前一个字符对应的碱基信息取值为:j1=G,j2=G;其所在行的均值取值为:f=mean(ASCII(E)+ASCII(F)+ASCII(G)+ASCII(H)+ASCII(I));When the coding model is established for the fifth character "I" in P 1 , the values of the first four characters are respectively: q 1 =H, q 2 =G, q 3 =F, q 4 =E; its current The value of the base information corresponding to the character and the previous character is: j 1 =G, j 2 =G; the mean value of the row where it is located is: f=mean(ASCII(E)+ASCII(F)+ASCII(G )+ASCII(H)+ASCII(I));
由此可以看出对于缺少上文信息的边缘字符可以使用相同的方式建立模型。It can be seen that the same method can be used to build a model for marginal characters lacking the above information.
3.2)考虑模型总数有限的实际情况,通过量化整体模型的方式减少模型代价,即取q1和q2中的较大值记做A,q3和q4中的较大值记做B,创建两个不同的标识符C和D,C为用来判断q1和q2是否相等,D用来判断q3和q4是否相等。因此当前编码符号最终选取的模型为:Pnow=A·B·C·D·j1·j2·f。3.2) Considering the actual situation that the total number of models is limited, reduce the model cost by quantifying the overall model, that is, take the larger value of q 1 and q 2 as A, and the larger value of q 3 and q 4 as B, Create two different identifiers C and D, C is used to judge whether q 1 and q 2 are equal, and D is used to judge whether q 3 and q 4 are equal. Therefore, the final selected model of the current coding symbol is: P now =A·B·C·D·j 1 ·j 2 ·f.
其中,Pnow为当前编码符号的概率估计值。Among them, P now is the probability estimation value of the current coding symbol.
步骤4,利用设计的最终编码模型驱动自适应算数编码器,采用蛇形编码顺序沿相关性最强的方向对第一编码压缩块P1进行遍历压缩。Step 4, use the designed final coding model to drive the adaptive arithmetic coder, and use the serpentine coding order to traverse and compress the first coding compression block P 1 along the direction with the strongest correlation.
4.1)通过上述最终编码模型得到当前编码字符更加准确的概率估计值Pnow,并将其作为最优预测结果送入自适应算术编码器;4.1) Obtain a more accurate probability estimate P now of the current coded character through the above-mentioned final coding model, and send it to the adaptive arithmetic coder as the optimal prediction result;
4.2)编码器进行遍历编码压缩:4.2) The encoder performs traversal encoding and compression:
编码时需要对逐个字符进行编码扫描,传统扫描方式默认为逐行遍历,当遍历完一整行后从第二行起始位置开始继续扫描。本实例采用按列扫描,且在编码完一列之后,将下一列的尾部作为起始,反向向上遍历,以此循环往复,整体看起来如蛇形扫描,如图3所示。通过对所有字符进行遍历编码,实现最终的无损压缩。When encoding, it is necessary to encode and scan characters one by one. The traditional scanning method defaults to traversing line by line. After traversing a whole line, continue scanning from the starting position of the second line. This example uses column-by-column scanning, and after one column is encoded, the end of the next column is used as the starting point, and the reverse upward traverse is performed, and the whole process looks like a serpentine scan, as shown in Figure 3. The ultimate lossless compression is achieved by traversing all characters.
以上描述仅是本发明的一个具体实例,并不构成对本发明的任何限制。显然对于本领域的专业人员来说,在了解了本发明内容和原理后,都可在不背离本发明原理、结构的情况下,进行形式和细节上的各种修正和改变,但是这些基于本发明思想的修正和改变仍在本发明的权利要求保护范围之内。The above description is only a specific example of the present invention, and does not constitute any limitation to the present invention. Obviously, for those skilled in the art, after understanding the content and principles of the present invention, they can make various modifications and changes in form and details without departing from the principles and structures of the present invention, but these are based on the present invention. The modification and change of the inventive concept are still within the protection scope of the claims of the present invention.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010446416.1A CN111640467B (en) | 2020-05-25 | 2020-05-25 | DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010446416.1A CN111640467B (en) | 2020-05-25 | 2020-05-25 | DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111640467A CN111640467A (en) | 2020-09-08 |
CN111640467B true CN111640467B (en) | 2023-03-24 |
Family
ID=72332834
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010446416.1A Active CN111640467B (en) | 2020-05-25 | 2020-05-25 | DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111640467B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP4462439A1 (en) | 2023-05-11 | 2024-11-13 | Gottfried Wilhelm Leibniz Universität Hannover | Method for optimized compression of quality markers and computer device comprising a compression unit |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103995988A (en) * | 2014-05-30 | 2014-08-20 | 周家锐 | High-throughput DNA sequencing mass fraction lossless compression system and method |
CN105391454A (en) * | 2015-12-14 | 2016-03-09 | 季检 | DNA sequencing quality score lossless compression method |
CN106100641A (en) * | 2016-06-12 | 2016-11-09 | 深圳大学 | Multithreading quick storage lossless compression method and system thereof for FASTQ data |
WO2017214765A1 (en) * | 2016-06-12 | 2017-12-21 | 深圳大学 | Multi-thread fast storage lossless compression method and system for fastq data |
CN108306650A (en) * | 2018-01-16 | 2018-07-20 | 厦门极元科技有限公司 | The compression method of gene sequencing data |
WO2019144312A1 (en) * | 2018-01-24 | 2019-08-01 | 深圳大学 | Gpu-accelerated dna sequence compression method and system |
CN110111852A (en) * | 2018-01-11 | 2019-08-09 | 广州明领基因科技有限公司 | A kind of magnanimity DNA sequencing data lossless Fast Compression platform |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10090857B2 (en) * | 2010-04-26 | 2018-10-02 | Samsung Electronics Co., Ltd. | Method and apparatus for compressing genetic data |
US12080384B2 (en) * | 2015-06-16 | 2024-09-03 | Gottfried Wilhelm Leibniz Universitaet Hannover | Method for compressing genomic data |
-
2020
- 2020-05-25 CN CN202010446416.1A patent/CN111640467B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103995988A (en) * | 2014-05-30 | 2014-08-20 | 周家锐 | High-throughput DNA sequencing mass fraction lossless compression system and method |
WO2015180203A1 (en) * | 2014-05-30 | 2015-12-03 | 周家锐 | High-throughput dna sequencing quality score lossless compression system and compression method |
CN105391454A (en) * | 2015-12-14 | 2016-03-09 | 季检 | DNA sequencing quality score lossless compression method |
CN106100641A (en) * | 2016-06-12 | 2016-11-09 | 深圳大学 | Multithreading quick storage lossless compression method and system thereof for FASTQ data |
WO2017214765A1 (en) * | 2016-06-12 | 2017-12-21 | 深圳大学 | Multi-thread fast storage lossless compression method and system for fastq data |
CN110111852A (en) * | 2018-01-11 | 2019-08-09 | 广州明领基因科技有限公司 | A kind of magnanimity DNA sequencing data lossless Fast Compression platform |
CN108306650A (en) * | 2018-01-16 | 2018-07-20 | 厦门极元科技有限公司 | The compression method of gene sequencing data |
WO2019144312A1 (en) * | 2018-01-24 | 2019-08-01 | 深圳大学 | Gpu-accelerated dna sequence compression method and system |
Non-Patent Citations (2)
Title |
---|
基于码书索引变换的高通量DNA序列数据压缩算法;谭丽等;《电子学报》;20150515(第05期);全文 * |
基于高通量测序的短序列生物数据压缩研究;孟倩;《计算机应用与软件》;20170415(第04期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111640467A (en) | 2020-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1183683C (en) | Position adaptive coding method using prefix prediction | |
CN103995988B (en) | High-throughput DNA sequencing mass fraction lossless compression system and method | |
CN102081707B (en) | DNA sequence data compression and decompression system, and method therefor | |
CN103814396B (en) | The method and apparatus of coding/decoding bit stream | |
CN107066837B (en) | Method and system for compressing reference DNA sequence | |
KR101969848B1 (en) | Method and apparatus for compressing genetic data | |
CN103546162B (en) | Based on non-contiguous contextual modeling and the gene compression method of entropy principle | |
CN110021369B (en) | Gene sequencing data compression and decompression method, system and computer readable medium | |
WO2019076177A1 (en) | Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium | |
CN107851137A (en) | Method for compressing genomic data | |
CN110021368B (en) | Comparison type gene sequencing data compression method, system and computer readable medium | |
CN109450452B (en) | Compression method and system for sampling dictionary tree index aiming at gene data | |
CN108287985A (en) | A kind of the DNA sequence dna compression method and system of GPU acceleration | |
CN111640467B (en) | DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence | |
CN107565975A (en) | The method of FASTQ formatted file Lossless Compressions | |
CN109698702B (en) | Gene sequencing data compression preprocessing method, system and computer readable medium | |
CN110111851B (en) | Gene sequencing data compression method, system and computer readable medium | |
CN109698703B (en) | Gene sequencing data decompression method, system and computer readable medium | |
CN102932001B (en) | Motion capture data compression, decompression method | |
CN110111852A (en) | A kind of magnanimity DNA sequencing data lossless Fast Compression platform | |
CN112506876B (en) | Lossless compression query method supporting SQL query | |
CN110915140B (en) | Method for encoding and decoding quality values of a data structure | |
CN109698704B (en) | Comparative gene sequencing data decompression method, system and computer readable medium | |
Pathak et al. | RETRACTED: LFQC: a lossless compression algorithm for FASTQ files | |
Hanus et al. | Compression of whole genome alignments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |