CN114598329B

CN114598329B - Lightweight lossless compression method for rapid decompression application

Info

Publication number: CN114598329B
Application number: CN202210269150.7A
Authority: CN
Inventors: 肖卓凌; 王天越; 彭卓霖; 陈智麒
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2023-04-25
Anticipated expiration: 2042-03-18
Also published as: CN114598329A

Abstract

The invention discloses a lightweight lossless compression method for rapid decompression application, which is characterized in that a dictionary is updated through the existing characters, the matching length and the distance of the current character string are searched according to the dictionary, a method of double hash searching is adopted on a searching strategy, the condition of the longer matching character string is more emphasized on a compression format, meanwhile, the classification of coding conditions is simplified, the decompression abnormality caused by data overflow is avoided through checking an address in the decompression process, the decompression speed is improved, the overflow problem is solved, and the novel algorithm is used for solving the technical problems of low decompression speed, high algorithm cost and overflow in decompression existing in the conventional lossless compression algorithm on the basis of LZO.

Description

Lightweight lossless compression method for rapid decompression application

Technical Field

The invention relates to the field of data compression, in particular to a lightweight lossless compression method for quick decompression application.

Background

With the development of network technology, data storage and data transmission have prompted the field of data compression. Compared with lossy compression, lossless data compression removes redundant information in the original text as far as possible on the premise of not losing information, and ensures that decompressed data is completely consistent with data before compression. The dictionary LZ-based series algorithm plays a significant role in the lossless compression field.

Lossless compression algorithms of the LZ series can be divided into two classes according to the applicable circumstances: the method is a lightweight compression algorithm such as LZ77 and LZO, the principle of the algorithm is simpler, the realization cost of the algorithm is low, and the algorithm is suitable for running in an embedded processor, but the compression rate is lower. The other type is algorithms with higher compression rates such as DEFLATE and LZMA, however, the principle is complex, the code amount and the resource cost of the algorithms are large, and the algorithms are not suitable for running in an embedded processor.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a lightweight lossless compression method for rapid decompression application.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

a lightweight lossless compression method for fast decompression application comprises the following steps:

s1, performing matching search twice by using an LZ0 algorithm dictionary with the size twice, and updating dictionary pages successfully matched;

s2, compressing and storing the successfully matched distance and the matching length information according to different storage distances;

s3, judging the coding type of the file after compression storage through the first byte of the code, and recovering the matching length and the distance according to the coding result;

s4, adopting a boundary detection method to carry out security check when writing operation is carried out on each storage space, and judging whether data overflow occurs in decompression.

Further, the two matching search modes in S1 are as follows:

s11, sequentially carrying out individual matching by using two pages of a matching dictionary, and judging a matching result;

s12, if the first matching fails, the second hash operation is not performed, and the hash value is directly used as an index in the first matching to continuously inquire whether a matching item exists in the second page dictionary;

s13, if the two times of matching are unsuccessful, the current data is used as new characters to be directly output without compression;

and S14, if the matching is successful in both times, compressing the dictionary pages by using the matching information with longer matching length, and updating the dictionary pages.

Further, in the step S2, when the matching information with the distance smaller than 2K bytes is stored, a first type of storage format is adopted for compression storage; and when the storage distance is greater than or equal to 2K bytes of matching information, adopting a second type of storage format to carry out compression storage.

Further, the first type of storage format includes a first compressed storage format and a second compressed storage format, where the first compressed storage format is used to represent a compressed format when the matching length is smaller than 18 bytes, the matching length at this time adopts a 2-reduction representation method, the matching distance adopts 11 bits to represent, and the matching length adopts 4 bits to represent; the second compressed storage format is used for representing the compressed format when the matching length is greater than or equal to 18 bytes and less than 256 bytes, wherein the matching distance is represented by 11 bits, and the matching length is represented by 8 bits.

Further, the second type of storage format includes a third compressed storage format and a fourth compressed storage format, where the third compressed storage format is used to represent a compressed format when the matching length is less than 19 bytes, and the matching length is represented by a 3-bit subtracting method and 4 bits; the fourth compression format is used to represent a compression format when the matching length is 19 bytes or more and 256 bytes or less,

further, the step S3 specifically includes:

s31: taking the first byte W, W and 0xC0 to do AND operation, if the result is equal to 0x80, outputting the original character, and jumping to the fifth step. If the result is not 0x80, jump to S32;

s32: it is determined whether the W result is equal to 0xC0, equal to rotation S33, and not equal to rotation S34.

S33: and performing AND operation on the W and 0x3C, decompressing according to the first format if the result is not equal to zero, decompressing according to the second format if the result is equal to zero, and jumping to S35.

S34: w and 0x78 are anded, decompressed according to the third format if the result is not equal to zero, decompressed according to the fourth format if the result is equal to zero, and step S35 is skipped.

S35: whether the data processing is completed or not is judged, the first step of the jump is not completed, the loop execution is continued, and the jump is completed after the processing is completed S36.

S36: and judging that the decompression of the whole file is completed, and if yes, outputting a final result.

The invention has the following beneficial effects:

1) The decompression speed is effectively improved, and compared with the LZO algorithm, the decompression speed is improved by 16% under the condition that the compression rate is hardly influenced.

2) The problem of data overflow can be well handled through the security check of the decompression module.

3) The software development of the lossless compression algorithm for the rapid decompression application is completed, and the software development on the Windows platform and the software migration on the DSP platform are completed due to the lightweight design of the lossless compression algorithm.

Drawings

FIG. 1 is a schematic flow chart of a lightweight lossless compression method for fast decompression application.

FIG. 2 is a dictionary coding diagram of the present invention.

Fig. 3 is a schematic diagram of a compressed format of the present invention, where a is a first compressed storage format, b is a second compressed storage format, c is a third compressed storage format, and d is a fourth compressed storage format.

Fig. 4 is a schematic diagram of a decompression flow chart according to the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

A lightweight lossless compression method for fast decompression application, as shown in figure 1, comprises the following steps:

as shown in fig. 2, the newly designed compression algorithm performs matching search by adopting a twice matching mode, the matching dictionary has two pages in total, if the first matching fails, the second hash operation is not performed, and the hash value at the time of the first matching is still used as an index to directly and continuously search whether the matching item exists in the second page dictionary. If the two matches are unsuccessful, the current data is directly output as new characters without compression. If both matches are successful, the compression process is performed using the match information with the longer match length. While updating the older page of dictionary entries.

The dictionary size of the newly designed compression algorithm is twice that of the LZO algorithm, and the design can reduce the speed of the compression process, but can improve the success rate of the character string matching link. This design makes the dictionary flexible and resilient. When other compression demands exist, the size of the dictionary can be flexibly adjusted only by increasing or decreasing the pages of the dictionary, so as to achieve different compression performances.

as shown in fig. 3, in order to achieve the purpose of decompressing data more quickly, the newly designed compression algorithm performs compression storage on the distance and the matching length information of successful matching according to different distances, where the first type of storage format is used for storing matching information with a distance less than 2 kbytes, and the second type of storage format is used for storing matching information with a distance greater than or equal to 2 kbytes.

When the distance is less than 2K bytes, the algorithm designs two compression formats to store the distance and matching length information after matching is successful. The first format is used to represent a compressed format when the matching length is less than 18 bytes, where the matching length is represented by subtracting 2, the distance is represented by 11 bits, and the matching length is represented by 4 bits. The second format is used to represent a compressed format when the matching length is 18 bytes or more and 256 bytes or less, the distance is represented by 11 bits, and the matching length is represented by 8 bits.

Compression of matching data with larger distance is effectively realized for compression formats with the distance being more than or equal to 2K bytes. LZO designed two compression formats to store distance and matching length information. The first format is used to represent the compressed format when the matching length is less than 19 bytes, and the matching length is represented by 4 bits using the 3-bit reduction representation method. The second format is used to represent a compressed format when the matching length is 19 bytes or more and less than 256 bytes.

as shown in fig. 4, the compression format analysis module of the newly designed compression algorithm makes a quick judgment on the coding type through the first byte of the coding, classifies the coding type into four types according to the format of the step S2, performs classification processing, and recovers the matching length and the distance according to the coding result. The method has the advantages of simple design, less classification, no excessive offset, reduced decompression calculated amount and improved decompression speed. The specific method is as follows:

After format analysis of the compressed data is completed, the original characters are required to be output according to the matching length and the distance. Aiming at the overflow problem in the decompressed code of LZO, a boundary detection method is adopted to carry out security check when each time of writing operation is carried out on the storage space, namely the boundary is required to be judged when each time of memory operation is carried out. The decompression program must ensure that data does not overflow and the program does not crash when the file is parsed for any corruption.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The lightweight lossless compression method for the fast decompression application is characterized by comprising the following steps of:

s1, performing twice matching search by using an LZ0 algorithm dictionary with double size, and updating dictionary pages successfully matched, wherein the twice matching search mode is as follows:

s11, sequentially matching two pages of the matching dictionary, and judging a matching result;

s14, if the two times of matching are successful, compressing the matching information with longer matching length, and updating the dictionary pages;

s2, compressing and storing the successfully matched distance and the matching length information respectively according to different storage distances, and when the storage distance is smaller than 2K bytes of matching information, adopting a first type storage format for compressing and storing; when the storage distance is greater than or equal to 2K bytes of matching information, adopting a second type of storage format to carry out compression storage, wherein the first type of storage format comprises a first compression storage format and a second compression storage format, the first compression storage format is used for representing the compression format when the matching length is smaller than 18 bytes, the matching length at the moment adopts a 2-reduction representation method, the matching distance adopts 11 bits for representation, and the matching length adopts 4 bits for representation; the second compressed storage format is used for representing the compressed format when the matching length is greater than or equal to 18 bytes and less than 256 bytes, wherein the matching distance is represented by 11 bits, the matching length is represented by 8 bits, the second type of storage format comprises a third compressed storage format and a fourth compressed storage format, the third compressed storage format is used for representing the compressed format when the matching length is less than 19 bytes, and the matching length is represented by a 3-bit subtracting representation method and 4 bits; the fourth compression format is used for representing the compression format when the matching length is greater than or equal to 19 bytes and less than 256 bytes;

s3, judging the coding type of the file after compression storage through the first byte of the code, and recovering the matching length and the distance according to the coding result, wherein the method specifically comprises the following steps:

s31: taking the first byte W, W and 0xC0 for AND operation, if the result is equal to 0x80, outputting the original character, and jumping to S35; if the result is not 0x80, jump to S32;

s32: judging whether the result is equal to 0xC0, equal to the rotation S33 and not equal to the rotation S34;

s33: performing AND operation on W and 0x3C, decompressing according to the first format if the result is not equal to zero, decompressing according to the second format if the result is equal to zero, and jumping to S35;

s34: w and 0x78 are AND-operated, if the result is not equal to zero, the decompression is carried out according to the third format, if the result is equal to zero, the decompression is carried out according to the fourth format, and the step S35 is skipped;

s35: judging whether the data is processed, if not, continuing to circularly execute the first step of the jump, and if so, completing the jump S36;

s36: judging that the decompression of the whole file is completed, and if yes, outputting a final result;