CN115913246A

CN115913246A - Lossless data compression algorithm based on self-adaptive instantaneous entropy

Info

Publication number: CN115913246A
Application number: CN202211089596.8A
Authority: CN
Inventors: 奚彩萍; 武彦霞
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2023-04-04

Abstract

The invention belongs to the technical field of data processing, in particular to a lossless data compression algorithm based on self-adaptive instantaneous entropy, which comprises a compression end and a decompression end, wherein the compression end consists of five parts, namely an input data stream, a lookup table, instantaneous entropy coding, a highest flag bit and a sequencer; the decompression end starts decoding operation from the first bit of the received compressed data stream, so that the compression rate is effectively improved; and the algorithm has small operand and can be realized under limited resources.

Description

Lossless data compression algorithm based on self-adaptive instantaneous entropy

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a lossless data compression algorithm based on self-adaptive instantaneous entropy.

Background

With the advent of the information age, hundreds of millions of information data are generated every day, which brings enormous challenges to the storage, transmission and processing of data. In order to save memory space and increase data transmission speed, it is necessary to use data compression technology. The data compression technology can reduce the data storage cost and improve the data transmission speed by removing excessive redundant information, so the data compression technology has important research significance and practical value. The data compression algorithm is a premise and a basis for realizing data compression and can be divided into lossless compression and lossy compression. Lossy compression refers to compressing some irrelevant data under the premise of allowing certain information to be lost; while lossless compression preserves the original information by encoding the data, its original value can be recovered from the compressed data without loss of data quality.

Lossless data compression algorithms are mainly classified into two types according to compression models, namely statistical compression algorithms and dictionary-based compression algorithms. Arithmetic coding is a famous compression algorithm based on a statistical model, which adopts Shannon information entropy, and converts the whole input character string into a value with a long length, and the binary bit number required for expressing the value is smaller than the source data, so as to achieve the purpose of data compression; the Huffman coding is also a compression algorithm based on a statistical model, and the working principle of the Huffman coding is to sort the occurrence frequencies of data characters, establish a bottom-up binary tree, traverse all the way from a root node to the bottom until all the characters appear in the nodes of the binary tree, assign 0 to all the left subtrees and assign 1 to all the right subtrees, determine the code words of original symbols, and complete data compression by assigning the shortest code with the most common data mode. Another lossless compression algorithm based on dictionary model is to replace the complex original data character string with simple codes, such as LZW algorithm, which stores each first appearing character string in a character string table, and replaces the character string with a unique number as its code, only stores its code instead of the original character string during compression, thereby performing data compression, and uses the character string table generated during compression during decompression, and after compression or decompression is completed, the character string table is discarded without occupying extra space. The Huffman coding technology is simple in operation, the average length of the coded character strings can be reduced, but the frequency of the occurrence symbols of the data stream needs to be calculated, and the compression time is long. And when the probability difference of the data symbols is not obvious, the coding technology has no ideal effect. Arithmetic coding can theoretically achieve good compression rate, but the actual implementation process is very complex, and large storage space and power consumption are needed. The compression effect based on the dictionary compression algorithm is closely related to the repeatability of data and the size of the dictionary, so that the compression effect is not fixed. The compression algorithm based on deep learning uses a large amount of calculation, is specially used for processing image data, does not process data flow, and cannot be applied to text data flow compression.

Disclosure of Invention

In order to solve the technical problem, the invention provides a lossless data compression algorithm based on adaptive instantaneous entropy. The algorithm receives continuous and rapid input data stream at a compression end, calculates instantaneous entropy based on the occupancy rate of a lookup table, adaptively encodes the entropy and outputs compressed data stream; the decompression end starts decoding operation when receiving the first bit of the compressed data stream, and always uses the same lookup table as the compression end to finally recover the original data symbol.

The invention adopts the following specific technical scheme:

a lossless data compression algorithm based on self-adaptive instantaneous entropy is characterized in that a lookup table matching result and an occupied rate are considered when the lossless data compression algorithm works, the instantaneous entropy is calculated according to a received data stream, and entropy coding is performed in a self-adaptive mode to complete compression. The compression end consists of five parts, namely an input data stream, a lookup table, instantaneous entropy coding, a highest flag bit and a sequencer; the decompression end consists of five parts, namely an deserializer, a highest flag bit, instantaneous entropy decoding, a lookup table and output data flow.

The total number of rows of the lookup table is K, the number of occupied rows is K, the lookup table is composed of data symbols with high occurrence probability in input data streams, and the data symbols of each row are N bits. Each row of the lookup table has a corresponding row index of M bits, and M is calculated as follows:

the decompression end decodes the compressed data and outputs the original data symbols using the same look-up table as the compression end. Temporal entropy coding is the process of coding a data stream by computing temporal entropy adaptation of the original data symbols. The instantaneous entropy is calculated depending on the number of occupied rows k of the lookup table, where E denotes the instantaneous entropy, and the bit length of the entropy coding, then

Entropy coding reduces the successfully matched row identification string in the lookup table to E bits by deleting its upper bits. The serializer reconstructs the resulting data stream into a compressed data stream of D bits. The decompression end receives the compressed data stream, performs deserialization and extraction on the compressed data by using instantaneous entropy decoding, and outputs an original symbol.

A lossless data compression algorithm based on adaptive instantaneous entropy specifically comprises a compression flow, a decompression flow and a lookup table updating flow.

And (3) a compression process: when an input data stream reaches a compression end, the compression end receives the first N-bit original data symbols in the input data stream; and matching the data symbol in the lookup table, if the same data symbol is not matched, wherein the instantaneous entropy E = M indicates that the symbol is not occupied by the lookup table, only outputting the symbol without compression, and simultaneously storing the data symbol to the lowest bit in the lookup table, and updating the lookup table. If the matching is successful, the symbol is occupied by the lookup table, the corresponding row identification character string is used as data to be compressed, the instantaneous entropy is calculated to reduce the entropy coding of the compressed data, and the compressed data is generated. Defining a highest flag bit to distinguish whether the data is compressed or not, if the data is not compressed, marking the position to be 0, and outputting 0+ original data; if the data is compressed, the position "1" is marked, and 1+ compressed data is output. After all the input data are processed, the serializer receives all the data generated by the operation and reconstructs the data into a compressed data stream with D bits for output.

And (3) decompression flow: when the decompression end receives the first bit of the D-bit compressed data stream, decoding operation is started, if the highest flag bit is 1, the compressed data is shown, the deserializer extracts E-bit data from the D-bit compressed data stream and expands the E-bit data to M bits according to the instantaneous entropy E obtained by the compression end, and the data in the corresponding table entry is output as an original symbol after the row identification character string is decompressed. If the highest flag bit is 0, indicating that the original data symbol is present, N bits are extracted from the compressed data stream and output as the original data symbol.

And (3) updating a lookup table: the lookup table works like a stack instruction, representing the highest to lowest order bits of the lookup table from bottom to top. The table is occupied by filling the lowest bit, for example, after the lowest bit is occupied, storing a new data symbol requires placing the lowest bit data symbol into the next lowest bit and storing the new data symbol into the lowest bit. At the compression end, when the matching of the input data symbol to the look-up table is not successful, the data symbol needs to be stored in the lowest order bit of the look-up table. When the input data symbol is successfully matched with the lookup table, the successfully matched data symbol needs to be moved to the lowest bit of the lookup table; if all rows of the lookup table are occupied after multiple updates, the data symbol of the highest position needs to be popped out and discarded when a new data symbol is stored, other positions are sequentially moved to a higher position on the logic, and the new data symbol is stored into the lowest position after the lowest position is emptied; there is also a special case in the lookup table updating process, where the successfully matched data symbol is located at the higher order of the occupied table row, and when the occupied row number is larger, the successfully matched data symbol is moved to the lowest order.

The invention has the beneficial effects that: the compression process of the invention calculates instantaneous entropy through the occupancy rate of the lookup table and is self-adaptive to entropy coding, and supports the processing of continuous and rapid data stream; the decompression end starts decoding operation from the first bit of the received compressed data stream, so that the compression rate is effectively improved; moreover, the algorithm has small operand and can be realized under limited resources.

Drawings

Fig. 1 is a block diagram of a compression side and a decompression side of the present invention.

Fig. 2 is a compression flow diagram of the present invention.

Fig. 3 is a diagram illustrating a compression flow in the embodiment of the present invention.

Fig. 4 is a decompression flow diagram of the present invention.

Fig. 5 is a diagram illustrating an exemplary decompression process in an embodiment of the present invention.

FIG. 6 is a diagram illustrating a lookup table update process when a match is unsuccessful.

FIG. 7 is a diagram illustrating a successful matching process in the lookup table updating process according to the present invention.

FIG. 8 is a diagram illustrating a lookup table being full in the lookup table updating process according to the present invention.

Fig. 9 is a lookup expression intention when C =2 in the embodiment of the present invention.

FIG. 10 is a diagram illustrating the movement of the (K-1) row of the lookup table to the first row according to an embodiment of the present invention.

Fig. 11 is a schematic diagram illustrating that the lookup table shifts the matching data to the lowest bit direction by two rows when d =2 according to the embodiment of the present invention.

Detailed Description

For the purpose of enhancing the understanding of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and examples, which are provided for the purpose of illustration only and are not intended to limit the scope of the present invention.

Example (b): a lossless data compression algorithm based on adaptive instantaneous entropy, the structure diagrams of the compression end and the decompression end are shown in FIG. 1: the compression end consists of five parts, namely an input data stream, a lookup table, instantaneous entropy coding, a highest flag bit and a sequencer; the decompression end consists of five parts, namely an deserializer, a highest flag bit, instantaneous entropy decoding, a lookup table and output data stream. The total number of rows of the lookup table is K, the number of occupied rows is K, the lookup table is composed of data symbols with high occurrence probability in input data streams, and the data symbols of each row are N bits. Each row of the lookup table has a corresponding row index of M bits, and M is calculated as follows:

the decompression end decodes the compressed data and outputs the original data symbols using the same lookup table as the compression end. Transient entropy coding is the process of coding a data stream by computing the transient entropy adaptation of the original data symbols. The instantaneous entropy is calculated depending on the number of occupied rows k of the lookup table, where E denotes the instantaneous entropy, and the bit length of the entropy coding, then

Compression flow

The compression flow chart is shown in fig. 2. When an input data stream reaches a compression end, the compression end receives the first N-bit original data symbols in the input data stream; and matching the data symbol in the lookup table, if the same data symbol is not matched, wherein the instantaneous entropy E = M indicates that the symbol is not occupied by the lookup table, only outputting the symbol without compression, and simultaneously storing the data symbol to the lowest bit in the lookup table, and updating the lookup table. If the matching is successful, the symbol is occupied by the lookup table, the corresponding row identification character string is used as data to be compressed, the instantaneous entropy is calculated to reduce the entropy coding of the compressed data, and the compressed data is generated. Here, we define a highest flag bit to distinguish whether the data is compressed, if the data is not compressed, flag the position "0", output 0+ original data; if the data is compressed, the position "1" is marked, and 1+ compressed data is output. After all the input data are processed, the serializer receives all the data generated by the operation and reconstructs the data into a compressed data stream with D bits for output. Fig. 3 is an example of a compression process. The data symbols processed each time are all 8 bits, the total row number K of the lookup table is 8, the occupied row number K is 4, and the serializer outputs a compressed data stream with 3 bits.

Decompression flow

The decompression flow chart is shown in fig. 4. When the decompression end receives the first bit of the D-bit compressed data stream, decoding operation is started, if the highest flag bit is 1, the compressed data is shown, the deserializer extracts E-bit data from the D-bit compressed data stream and expands the E-bit data to M bits according to the instantaneous entropy E obtained by the compression end, and the data in the corresponding table entry is output as an original symbol after the row identification character string is decompressed. If the highest flag bit is 0, indicating that the original data symbol is present, N bits are extracted from the compressed data stream and output as the original data symbol. Fig. 5 is an example of a decompression process corresponding to the compression process described above.

Look-up table updating process

The lookup table works like a stack instruction, representing the highest to lowest order bits of the lookup table from bottom to top. The table is occupied by filling the least significant bits, for example, after the least significant bits are occupied, storing the new data symbol requires placing the least significant bit in the next least significant bit and storing the new data symbol in the least significant bit. At the compression end, when the matching of the input data symbol and the lookup table is not successful, the data symbol needs to be stored in the lowest bit of the lookup table, and the process of updating the lookup table is shown in fig. 5. When the input data symbol is successfully matched with the lookup table, the successfully matched data symbol needs to be moved to the lowest bit of the lookup table, and the process of updating the lookup table is shown in fig. 6.

If all rows of the lookup table are occupied after multiple updates, the most significant data symbol needs to be popped up and discarded when a new data symbol is stored, other bits are sequentially moved to the logically higher bit, and the new data symbol is stored into the least significant bit after the least significant bit is emptied, as shown in fig. 7. But after multiple verifications, when the table is full, the instantaneous entropy

This situation results in both the compressed data and the uncompressed data being N +1 bits, i.e., the highest flag bit + N bits of compressed data/N bits of original data. The sign bit is increased instead of compression, in order to reduce the compression bit number, a matching counter C is defined, when the input data symbol is matched with the data symbol in the table for C times, the highest bit in the lookup table is popped up and discarded, and the value of C can be customized. For example, when C =2, the highest bit in the table is invalidated after a second matching of a certain symbol is successful, as shown in fig. 8.

There is also a special case in the lookup table updating process, where the successfully matched data symbol is located at the higher order of the occupied table row, and when the occupied row number is larger, the successfully matched data symbol is moved to the lowest order. For example, moving from row (K-1) to the first row of the look-up table, which requires a long operation time, as shown in fig. 9. For this case, the operation of shifting the limited number of lines is performed, and taking the limited number of lines d =2 as an example, when the input data symbol is successfully matched, the matched data is shifted by two lines in the lowest order direction, and the other rule is not changed, as shown in fig. 10. If the logically shifted number of rows of matched data symbols is less than the custom limited number of rows, then the operation is performed according to the logically shifted number of rows.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A lossless data compression algorithm based on adaptive instantaneous entropy is characterized by comprising a compression end and a decompression end, wherein the compression end consists of five parts of an input data stream, a lookup table, instantaneous entropy coding, a highest flag bit and a sequencer, and the decompression end consists of five parts of an deserializer, a highest flag bit, instantaneous entropy decoding, a lookup table and an output data stream;

setting the total number of rows of the lookup table as K, the number of occupied rows as K, the data symbol of each row as N bits, each row of the lookup table having a corresponding row index of M bits, the calculation of M being as follows:

the decompression end decodes the compressed data and outputs original data symbols by using the same lookup table as the compression end, the instantaneous entropy coding is a process of coding the data stream by calculating the instantaneous entropy self-adaption of the original data symbols, the calculation of the instantaneous entropy E depends on the occupied row number k of the lookup table, and then

The entropy coding reduces the high order of the successfully matched line identification character string in the lookup table into E order, the serializer reconstructs the obtained data stream into a compressed data stream of D order, the decompression end receives the compressed data stream, the compressed data is deserialized and extracted by utilizing instantaneous entropy decoding, and an original symbol is output.

2. The adaptive instantaneous entropy-based lossless data compression algorithm of claim 1, wherein the algorithm specifically includes a compression process, a decompression process, and a look-up table update process.

3. The lossless data compression algorithm based on adaptive instantaneous entropy of claim 2, wherein the compression process comprises the following specific steps: when an input data stream reaches a compression end, the compression end receives the first N-bit original data symbols in the input data stream, matches the data symbols in a lookup table, if the same data symbols are not matched, the instantaneous entropy E = M indicates that the symbols are not occupied by the lookup table, only outputs the symbols without compression, stores the data symbols to the lowest bit in the lookup table, updates the lookup table, if the matching is successful, indicates that the symbols are occupied by the lookup table, takes the corresponding line identification character string as data to be compressed, calculates the instantaneous entropy to reduce the entropy coding of the compressed data, and generates the compressed data.

4. The adaptive transient entropy-based lossless data compression algorithm as claimed in claim 3, wherein in the compression process, a highest flag bit is set to distinguish whether the data is compressed, if the data is not compressed, the flag bit is "0", and 0+ original data is output; if the data is compressed, marking the position of 1, and outputting the 1+ compressed data; after all the input data are processed, the serializer receives all the data generated by the operation and reconstructs the data into a compressed data stream with D bits for output.

5. The lossless data compression algorithm based on adaptive instantaneous entropy of claim 4, wherein the decompression process comprises the following specific steps: when the decompression end receives the first bit of the D-bit compressed data stream, the decoding operation is started, if the highest flag bit is 1, the compressed data is shown to be present, the deserializer can extract E-bit data from the D-bit compressed data stream and expand the E-bit data to M bits according to the instantaneous entropy E obtained by the compression end, the data in the corresponding table entry is output as an original symbol after the row identification character string is decompressed, and if the highest flag bit is 0, the original data symbol is shown to be present, the N bits are extracted from the compressed data stream and output as the original data symbol.

6. The lossless data compression algorithm based on adaptive instantaneous entropy of claim 5, wherein the look-up table updating process comprises the following specific steps: at the compression end, when the matching of the input data symbol and the lookup table is not successful, the data symbol needs to be stored in the lowest bit of the lookup table, and the lookup table is updated; when the input data symbol is successfully matched with the lookup table, the successfully matched data symbol needs to be moved to the lowest bit of the lookup table, and the lookup table is updated.

7. The adaptive transient entropy based lossless data compression algorithm of claim 6, wherein in the lookup table update process, when the table is full, the transient entropy is now full

And defining a matching counter C, when the input data symbol is matched with the data symbol in the table for C times, popping up and discarding the highest bit in the lookup table, and the value of C can be self-defined.

8. The adaptive transient entropy-based lossless data compression algorithm of claim 6, wherein in the lookup table update procedure, the successfully matched data symbol is located at the higher order of the occupied table row, and when the occupied row number is larger, the successfully matched data symbol is moved to the lowest order.