CN109981110A

CN109981110A - The method of lossy compression with point-by-point relative error boundary

Info

Publication number: CN109981110A
Application number: CN201910164475.7A
Authority: CN
Inventors: 夏文; 邹翔宇; 王轩; 张伟哲
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2019-03-05
Filing date: 2019-03-05
Publication date: 2019-07-05
Anticipated expiration: 2039-03-05
Also published as: CN109981110B

Abstract

The invention provides a lossy compression method with point-by-point relative error limits, comprising the following steps: A, tabulation, tabulation according to error requirements and quantization factor intervals; B, acquisition of quantization factors; C, ha Huffman coding, using Huffman coding to compress the quantization factor sequence generated in step B; D. Using a lossless compression method, using a lossless compression method to compress the Huffman coding and Huffman tree generated in step C. The beneficial effects of the present invention are: the time-consuming logarithmic transformation in lossy compression with point-by-point relative error limits can be avoided, and the quantization factor value can be obtained by looking up a table, which greatly accelerates the time-consuming logarithmic transformation with point-by-point relative error limits. Lossy compression.

Description

A method for lossy compression with point-wise relative error bounds

技术领域technical field

本发明涉及有损压缩的方法，尤其涉及一种带有逐点相对误差界限的有损压缩的方法。The present invention relates to a method of lossy compression, in particular to a method of lossy compression with point-by-point relative error bounds.

背景技术Background technique

在高性能计算(HPC)环境中进行科学模拟产生的数据非常庞大，这可能会在运行时导致严重的I/O瓶颈，并为后期分析带来巨大的存储空间负担。与传统的数据缩减方案(例如重复数据删除或无损压缩)不同，有损压缩在满足用户对误差控制的要求下可以显着减小数据大小。为了自动地适应数据集中的精度要求，带有逐点相对误差界限(即，压缩误差取决于数据值)的有损压缩被广泛使用在了许多科学应用中。Scientific simulations in high-performance computing (HPC) environments produce very large amounts of data, which can cause severe I/O bottlenecks at runtime and a huge storage space burden for post-analysis. Unlike traditional data reduction schemes such as deduplication or lossless compression, lossy compression can significantly reduce data size while meeting user requirements for error control. In order to automatically adapt to the accuracy requirements in the dataset, lossy compression with point-wise relative error bounds (ie, the compression error depends on the data values) is widely used in many scientific applications.

原始的带有逐点相对误差界限的有损压缩在压缩过程中需要将所有数据都经过一次对数转换。计算对数在计算机中一般使用级数来实现，计算量大，比较耗时。计算对数这个步骤需要将所有的数据都转换为其对数形式，计算量和数据规模正相关，这个步骤的耗时在算法总耗时中占据了一个比较大的比例。导致带有逐点相对误差界限的有损压缩复杂且耗时。The original lossy compression with point-wise relative error bounds requires that all data be log-transformed during the compression process. Calculation of logarithms is generally implemented in computers by using series, which requires a large amount of calculation and is time-consuming. The step of calculating the logarithm needs to convert all the data into its logarithmic form, and the amount of calculation is positively correlated with the size of the data. The time-consuming of this step occupies a relatively large proportion of the total time-consuming of the algorithm. Resulting in lossy compression with point-wise relative error bounds is complex and time-consuming.

因此，如何加快带有逐点相对误差界限的有损压缩是本领域技术人员所亟待解决的技术问题。Therefore, how to speed up the lossy compression with the point-by-point relative error limit is a technical problem to be solved urgently by those skilled in the art.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中的问题，本发明提供了一种带有逐点相对误差界限的有损压缩的方法。In order to solve the problems in the prior art, the present invention provides a lossy compression method with a point-by-point relative error bound.

本发明提供了一种带有逐点相对误差界限的有损压缩的方法，包括以下步骤：The present invention provides a method for lossy compression with point-by-point relative error bounds, comprising the following steps:

A、制表，根据误差要求以及量化因子的区间来制表；A. Tabulation, according to the error requirements and the interval of the quantification factor;

B、获取量化因子；B. Obtain quantification factor;

C、哈夫曼编码，通过哈夫曼编码来压缩步骤B中生成的量化因子序列；C. Huffman coding, the quantization factor sequence generated in step B is compressed by Huffman coding;

D、使用无损压缩方法，使用无损压缩方法来压缩步骤C生成的哈夫曼编码和哈夫曼树。D. Using a lossless compression method, use a lossless compression method to compress the Huffman code and Huffman tree generated in step C.

作为本发明的进一步改进，在步骤B中，计算实际值X_i和预测值X_i的比值然后使用步骤A生成的表，通过求得的R来查询量化因子。As a further improvement of the present invention, in step B, the ratio of the actual value X _i to the predicted value X _i is calculated Then use the table generated in step A to query the quantization factor through the obtained R.

作为本发明的进一步改进，步骤A包括以下子步骤：As a further improvement of the present invention, step A includes the following substeps:

A1、遍历量化因子的定义域，计算每个量化因子的覆盖范围，生成表T1，表T1是用量化因子来获取该量化因子覆盖范围的表；A1. Traverse the definition domain of the quantization factor, calculate the coverage of each quantization factor, and generate a table T1, which is a table that uses the quantization factor to obtain the coverage of the quantization factor;

A2、根据误差要求计算表T2的大小，根据表T1依次计算出表T2各个表项的数值并填写表T2，表T2是用比值R来获取量化因子M的表。A2. Calculate the size of table T2 according to the error requirements, calculate the values of each table item in table T2 in turn according to table T1, and fill in table T2. Table T2 is a table for obtaining quantization factor M by ratio R.

作为本发明的进一步改进，在步骤A1中，计算每个量化因子M_k对应的值域P_k，生成表T1。As a further improvement of the present invention, in step A1, the value range P _k corresponding to each quantization factor M _k is calculated, and a table T1 is generated.

作为本发明的进一步改进，在步骤A2中，相邻量化因子对应的值域之间产生重叠，重叠的大小小于表T2的表项。As a further improvement of the present invention, in step A2, the value ranges corresponding to adjacent quantization factors overlap, and the size of the overlap is smaller than the entries in table T2.

本发明的有益效果是：通过上述方案，可以避免带有逐点相对误差界限的有损压缩中耗时的对数变换，并通过查表来获取量化因子值，极大地加速了带有逐点相对误差界限的有损压缩。The beneficial effects of the present invention are: through the above scheme, the time-consuming logarithmic transformation in lossy compression with point-by-point relative error limits can be avoided, and the quantization factor value can be obtained by looking up the table, which greatly accelerates the point-by-point relative error limit. Lossy compression relative to error bounds.

附图说明Description of drawings

图1是本发明一种带有逐点相对误差界限的有损压缩的方法的流程图。FIG. 1 is a flow chart of a method of lossy compression with point-by-point relative error bounds of the present invention.

图2是本发明一种带有逐点相对误差界限的有损压缩的方法的步骤A的流程图。FIG. 2 is a flow chart of step A of a method of lossy compression with point-by-point relative error bounds of the present invention.

具体实施方式Detailed ways

下面结合附图说明及具体实施方式对本发明作进一步说明。The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

如图1所示，一种带有逐点相对误差界限的有损压缩的方法，包括以下步骤：As shown in Figure 1, a method for lossy compression with point-by-point relative error bounds includes the following steps:

A、制表，根据用户提供的误差要求以及量化因子的区间来制表，以供之后的步骤使用；A. Tabulation, according to the error requirements provided by the user and the interval of the quantization factor to make a tabulation for subsequent steps;

B、获取量化因子，计算实际值X_i和预测值X′_i的比值然后使用步骤A生成的表，通过求得的R来查询量化因子；B. Obtain the quantization factor and calculate the ratio of the actual value X _i to the predicted value X' _i Then use the table generated in step A to query the quantization factor through the obtained R;

D、使用无损压缩方法，使用gzip或者zstd等常规的无损压缩方法来压缩步骤C生成的哈夫曼编码和哈夫曼树。D. Use a lossless compression method, and use a conventional lossless compression method such as gzip or zstd to compress the Huffman code and Huffman tree generated in step C.

如图2所示，步骤A包括以下子步骤：As shown in Figure 2, step A includes the following sub-steps:

传统的对数处理是将所有的数据对数化{X_i}→{log(X_i)}，将{log(X_i)}命名为{Y_i}；根据实际值Y_i和预测值Y′_i来计算量化因子然后记录量化因子。The traditional logarithmic processing is to log all the data {X _i }→{log(X _i )}, and name {log(X _i )} as {Y _i }; according to the actual value Y _i and the predicted value Y ' _i to calculate the quantization factor Then record the quantization factor.

由可知，其实对于每个量化因子都对应了一个Y_i-Y′_i的值域，只要Y_i-Y′_i在这个值域内，都会生成同样的一个量化因子，步骤A1就是计算每个量化因子M_k对应的值域P_k，生成表T1。Depend on It can be seen that, in fact, each quantization factor corresponds to a value range of Y _i -Y' _i . As long as Y _i -Y' _i is within this range, the same quantization factor will be generated. Step A1 is to calculate each quantization factor. For the value range P _k corresponding to M _k , a table T1 is generated.

由可得其中，0＜δ＜1。根据精度要求(即误差要求)建立表T2，以通过来获取M。为了防止某个表项处于跨值域的位置，微调一下相邻量化因子的间隔，让相邻量化因子对应的值域之间产生一定的重叠，保证重叠的大小小于T2表项所代表的大小，这样可以某个表项一定完全属于某个值域，从而规避掉问题。最后遍历表T2，根据表T1依次填写T2的表项。Depend on Available Among them, 0<δ<1. Table T2 is established based on accuracy requirements (ie, error requirements) to pass to get M. In order to prevent an entry from being in a position that crosses the value range, fine-tune the interval between adjacent quantization factors, so that there is a certain overlap between the value ranges corresponding to adjacent quantization factors, and ensure that the size of the overlap is smaller than the size represented by the T2 entry. , so that a certain table item must belong to a certain value range completely, thus avoiding the problem. Finally, the table T2 is traversed, and the table entries of T2 are filled in according to the table T1.

步骤B则是根据计算出的去查询表T2，从而获得所需的量化因子。Step B is calculated according to Go to lookup table T2 to obtain the desired quantization factor.

按照原始的技术方案，计算对数的过程会比较耗时，并且耗时和数据规模正相关，总是会占据总耗时中一个比较大的部分本发明提供的一种带有逐点相对误差界限的有损压缩的方法，绕过了计算对数的过程，使用了建表的方法，以实际值和预测值的比值来查表，从而直接获取到量化因子。最终在保持根本原理和以前完全相同的前提下，省去了计算对数这一耗时的步骤，实现了整个算法的加速。According to the original technical solution, the process of calculating the logarithm is time-consuming, and the time-consuming is positively correlated with the data scale, which always occupies a relatively large part of the total time-consuming. The lossy compression method of the boundary bypasses the process of calculating the logarithm, and uses the method of building a table to look up the table with the ratio of the actual value and the predicted value, so as to directly obtain the quantization factor. In the end, under the premise of keeping the fundamental principle exactly the same as before, the time-consuming step of calculating the logarithm is eliminated, and the entire algorithm is accelerated.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be considered that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deductions or substitutions can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

1. a kind of method of the lossy compression with point-by-point relative error boundary, which comprises the following steps:

A, it tabulates, is tabulated according to the section of error requirements and quantizing factor；

B, quantizing factor is obtained；

C, Huffman encoding, the quantizing factor sequence generated in compression step B by Huffman encoding；

D, using lossless compression method, Huffman encoding and the Huffman tree of compression step C generation are come using lossless compression method.

2. the method for the lossy compression according to claim 1 with point-by-point relative error boundary, it is characterised in that: in step In rapid B, actual value X is calculated_iWith predicted value X '_iRatioThen the table generated using step A, passes through the R acquired To inquire quantizing factor.

3. the method for the lossy compression according to claim 2 with point-by-point relative error boundary, which is characterized in that step A includes following sub-step:

A1, the domain for traversing quantizing factor, calculate the coverage area of each quantizing factor, generate table T1, and table T1 is with quantization The factor obtains the table of the quantizing factor coverage area；

A2, according to the size of error requirements computational chart T2, the numerical value of each list item of table T2 is successively calculated according to table T1 and is filled in Table T2, table T2 are the tables that quantization factor M is obtained with ratio R.

4. the method for the lossy compression according to claim 3 with point-by-point relative error boundary, it is characterised in that: in step In rapid A1, each quantizing factor M is calculated_kCorresponding codomain P_k, generate table T1.

5. the method for the lossy compression according to claim 4 with point-by-point relative error boundary, it is characterised in that: in step In rapid A2, overlapping is generated between the corresponding codomain of the adjacent quantization factor, the size of overlapping is less than the list item of table T2.