CN115913246A - Lossless data compression algorithm based on self-adaptive instantaneous entropy - Google Patents

Lossless data compression algorithm based on self-adaptive instantaneous entropy Download PDF

Info

Publication number
CN115913246A
CN115913246A CN202211089596.8A CN202211089596A CN115913246A CN 115913246 A CN115913246 A CN 115913246A CN 202211089596 A CN202211089596 A CN 202211089596A CN 115913246 A CN115913246 A CN 115913246A
Authority
CN
China
Prior art keywords
data
lookup table
compression
entropy
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211089596.8A
Other languages
Chinese (zh)
Inventor
奚彩萍
武彦霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN202211089596.8A priority Critical patent/CN115913246A/en
Publication of CN115913246A publication Critical patent/CN115913246A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention belongs to the technical field of data processing, in particular to a lossless data compression algorithm based on self-adaptive instantaneous entropy, which comprises a compression end and a decompression end, wherein the compression end consists of five parts, namely an input data stream, a lookup table, instantaneous entropy coding, a highest flag bit and a sequencer; the decompression end starts decoding operation from the first bit of the received compressed data stream, so that the compression rate is effectively improved; and the algorithm has small operand and can be realized under limited resources.

Description

Lossless data compression algorithm based on self-adaptive instantaneous entropy
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a lossless data compression algorithm based on self-adaptive instantaneous entropy.
Background
With the advent of the information age, hundreds of millions of information data are generated every day, which brings enormous challenges to the storage, transmission and processing of data. In order to save memory space and increase data transmission speed, it is necessary to use data compression technology. The data compression technology can reduce the data storage cost and improve the data transmission speed by removing excessive redundant information, so the data compression technology has important research significance and practical value. The data compression algorithm is a premise and a basis for realizing data compression and can be divided into lossless compression and lossy compression. Lossy compression refers to compressing some irrelevant data under the premise of allowing certain information to be lost; while lossless compression preserves the original information by encoding the data, its original value can be recovered from the compressed data without loss of data quality.
Lossless data compression algorithms are mainly classified into two types according to compression models, namely statistical compression algorithms and dictionary-based compression algorithms. Arithmetic coding is a famous compression algorithm based on a statistical model, which adopts Shannon information entropy, and converts the whole input character string into a value with a long length, and the binary bit number required for expressing the value is smaller than the source data, so as to achieve the purpose of data compression; the Huffman coding is also a compression algorithm based on a statistical model, and the working principle of the Huffman coding is to sort the occurrence frequencies of data characters, establish a bottom-up binary tree, traverse all the way from a root node to the bottom until all the characters appear in the nodes of the binary tree, assign 0 to all the left subtrees and assign 1 to all the right subtrees, determine the code words of original symbols, and complete data compression by assigning the shortest code with the most common data mode. Another lossless compression algorithm based on dictionary model is to replace the complex original data character string with simple codes, such as LZW algorithm, which stores each first appearing character string in a character string table, and replaces the character string with a unique number as its code, only stores its code instead of the original character string during compression, thereby performing data compression, and uses the character string table generated during compression during decompression, and after compression or decompression is completed, the character string table is discarded without occupying extra space. The Huffman coding technology is simple in operation, the average length of the coded character strings can be reduced, but the frequency of the occurrence symbols of the data stream needs to be calculated, and the compression time is long. And when the probability difference of the data symbols is not obvious, the coding technology has no ideal effect. Arithmetic coding can theoretically achieve good compression rate, but the actual implementation process is very complex, and large storage space and power consumption are needed. The compression effect based on the dictionary compression algorithm is closely related to the repeatability of data and the size of the dictionary, so that the compression effect is not fixed. The compression algorithm based on deep learning uses a large amount of calculation, is specially used for processing image data, does not process data flow, and cannot be applied to text data flow compression.
Disclosure of Invention
In order to solve the technical problem, the invention provides a lossless data compression algorithm based on adaptive instantaneous entropy. The algorithm receives continuous and rapid input data stream at a compression end, calculates instantaneous entropy based on the occupancy rate of a lookup table, adaptively encodes the entropy and outputs compressed data stream; the decompression end starts decoding operation when receiving the first bit of the compressed data stream, and always uses the same lookup table as the compression end to finally recover the original data symbol.
The invention adopts the following specific technical scheme:
a lossless data compression algorithm based on self-adaptive instantaneous entropy is characterized in that a lookup table matching result and an occupied rate are considered when the lossless data compression algorithm works, the instantaneous entropy is calculated according to a received data stream, and entropy coding is performed in a self-adaptive mode to complete compression. The compression end consists of five parts, namely an input data stream, a lookup table, instantaneous entropy coding, a highest flag bit and a sequencer; the decompression end consists of five parts, namely an deserializer, a highest flag bit, instantaneous entropy decoding, a lookup table and output data flow.
The total number of rows of the lookup table is K, the number of occupied rows is K, the lookup table is composed of data symbols with high occurrence probability in input data streams, and the data symbols of each row are N bits. Each row of the lookup table has a corresponding row index of M bits, and M is calculated as follows:
Figure BDA0003836640940000021
the decompression end decodes the compressed data and outputs the original data symbols using the same look-up table as the compression end. Temporal entropy coding is the process of coding a data stream by computing temporal entropy adaptation of the original data symbols. The instantaneous entropy is calculated depending on the number of occupied rows k of the lookup table, where E denotes the instantaneous entropy, and the bit length of the entropy coding, then
Figure BDA0003836640940000022
Entropy coding reduces the successfully matched row identification string in the lookup table to E bits by deleting its upper bits. The serializer reconstructs the resulting data stream into a compressed data stream of D bits. The decompression end receives the compressed data stream, performs deserialization and extraction on the compressed data by using instantaneous entropy decoding, and outputs an original symbol.
A lossless data compression algorithm based on adaptive instantaneous entropy specifically comprises a compression flow, a decompression flow and a lookup table updating flow.
And (3) a compression process: when an input data stream reaches a compression end, the compression end receives the first N-bit original data symbols in the input data stream; and matching the data symbol in the lookup table, if the same data symbol is not matched, wherein the instantaneous entropy E = M indicates that the symbol is not occupied by the lookup table, only outputting the symbol without compression, and simultaneously storing the data symbol to the lowest bit in the lookup table, and updating the lookup table. If the matching is successful, the symbol is occupied by the lookup table, the corresponding row identification character string is used as data to be compressed, the instantaneous entropy is calculated to reduce the entropy coding of the compressed data, and the compressed data is generated. Defining a highest flag bit to distinguish whether the data is compressed or not, if the data is not compressed, marking the position to be 0, and outputting 0+ original data; if the data is compressed, the position "1" is marked, and 1+ compressed data is output. After all the input data are processed, the serializer receives all the data generated by the operation and reconstructs the data into a compressed data stream with D bits for output.
And (3) decompression flow: when the decompression end receives the first bit of the D-bit compressed data stream, decoding operation is started, if the highest flag bit is 1, the compressed data is shown, the deserializer extracts E-bit data from the D-bit compressed data stream and expands the E-bit data to M bits according to the instantaneous entropy E obtained by the compression end, and the data in the corresponding table entry is output as an original symbol after the row identification character string is decompressed. If the highest flag bit is 0, indicating that the original data symbol is present, N bits are extracted from the compressed data stream and output as the original data symbol.
And (3) updating a lookup table: the lookup table works like a stack instruction, representing the highest to lowest order bits of the lookup table from bottom to top. The table is occupied by filling the lowest bit, for example, after the lowest bit is occupied, storing a new data symbol requires placing the lowest bit data symbol into the next lowest bit and storing the new data symbol into the lowest bit. At the compression end, when the matching of the input data symbol to the look-up table is not successful, the data symbol needs to be stored in the lowest order bit of the look-up table. When the input data symbol is successfully matched with the lookup table, the successfully matched data symbol needs to be moved to the lowest bit of the lookup table; if all rows of the lookup table are occupied after multiple updates, the data symbol of the highest position needs to be popped out and discarded when a new data symbol is stored, other positions are sequentially moved to a higher position on the logic, and the new data symbol is stored into the lowest position after the lowest position is emptied; there is also a special case in the lookup table updating process, where the successfully matched data symbol is located at the higher order of the occupied table row, and when the occupied row number is larger, the successfully matched data symbol is moved to the lowest order.
The invention has the beneficial effects that: the compression process of the invention calculates instantaneous entropy through the occupancy rate of the lookup table and is self-adaptive to entropy coding, and supports the processing of continuous and rapid data stream; the decompression end starts decoding operation from the first bit of the received compressed data stream, so that the compression rate is effectively improved; moreover, the algorithm has small operand and can be realized under limited resources.
Drawings
Fig. 1 is a block diagram of a compression side and a decompression side of the present invention.
Fig. 2 is a compression flow diagram of the present invention.
Fig. 3 is a diagram illustrating a compression flow in the embodiment of the present invention.
Fig. 4 is a decompression flow diagram of the present invention.
Fig. 5 is a diagram illustrating an exemplary decompression process in an embodiment of the present invention.
FIG. 6 is a diagram illustrating a lookup table update process when a match is unsuccessful.
FIG. 7 is a diagram illustrating a successful matching process in the lookup table updating process according to the present invention.
FIG. 8 is a diagram illustrating a lookup table being full in the lookup table updating process according to the present invention.
Fig. 9 is a lookup expression intention when C =2 in the embodiment of the present invention.
FIG. 10 is a diagram illustrating the movement of the (K-1) row of the lookup table to the first row according to an embodiment of the present invention.
Fig. 11 is a schematic diagram illustrating that the lookup table shifts the matching data to the lowest bit direction by two rows when d =2 according to the embodiment of the present invention.
Detailed Description
For the purpose of enhancing the understanding of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and examples, which are provided for the purpose of illustration only and are not intended to limit the scope of the present invention.
Example (b): a lossless data compression algorithm based on adaptive instantaneous entropy, the structure diagrams of the compression end and the decompression end are shown in FIG. 1: the compression end consists of five parts, namely an input data stream, a lookup table, instantaneous entropy coding, a highest flag bit and a sequencer; the decompression end consists of five parts, namely an deserializer, a highest flag bit, instantaneous entropy decoding, a lookup table and output data stream. The total number of rows of the lookup table is K, the number of occupied rows is K, the lookup table is composed of data symbols with high occurrence probability in input data streams, and the data symbols of each row are N bits. Each row of the lookup table has a corresponding row index of M bits, and M is calculated as follows:
Figure BDA0003836640940000041
the decompression end decodes the compressed data and outputs the original data symbols using the same lookup table as the compression end. Transient entropy coding is the process of coding a data stream by computing the transient entropy adaptation of the original data symbols. The instantaneous entropy is calculated depending on the number of occupied rows k of the lookup table, where E denotes the instantaneous entropy, and the bit length of the entropy coding, then
Figure BDA0003836640940000042
Entropy coding reduces the successfully matched row identification string in the lookup table to E bits by deleting its upper bits. The serializer reconstructs the resulting data stream into a compressed data stream of D bits. The decompression end receives the compressed data stream, performs deserialization and extraction on the compressed data by using instantaneous entropy decoding, and outputs an original symbol.
Compression flow
The compression flow chart is shown in fig. 2. When an input data stream reaches a compression end, the compression end receives the first N-bit original data symbols in the input data stream; and matching the data symbol in the lookup table, if the same data symbol is not matched, wherein the instantaneous entropy E = M indicates that the symbol is not occupied by the lookup table, only outputting the symbol without compression, and simultaneously storing the data symbol to the lowest bit in the lookup table, and updating the lookup table. If the matching is successful, the symbol is occupied by the lookup table, the corresponding row identification character string is used as data to be compressed, the instantaneous entropy is calculated to reduce the entropy coding of the compressed data, and the compressed data is generated. Here, we define a highest flag bit to distinguish whether the data is compressed, if the data is not compressed, flag the position "0", output 0+ original data; if the data is compressed, the position "1" is marked, and 1+ compressed data is output. After all the input data are processed, the serializer receives all the data generated by the operation and reconstructs the data into a compressed data stream with D bits for output. Fig. 3 is an example of a compression process. The data symbols processed each time are all 8 bits, the total row number K of the lookup table is 8, the occupied row number K is 4, and the serializer outputs a compressed data stream with 3 bits.
Decompression flow
The decompression flow chart is shown in fig. 4. When the decompression end receives the first bit of the D-bit compressed data stream, decoding operation is started, if the highest flag bit is 1, the compressed data is shown, the deserializer extracts E-bit data from the D-bit compressed data stream and expands the E-bit data to M bits according to the instantaneous entropy E obtained by the compression end, and the data in the corresponding table entry is output as an original symbol after the row identification character string is decompressed. If the highest flag bit is 0, indicating that the original data symbol is present, N bits are extracted from the compressed data stream and output as the original data symbol. Fig. 5 is an example of a decompression process corresponding to the compression process described above.
Look-up table updating process
The lookup table works like a stack instruction, representing the highest to lowest order bits of the lookup table from bottom to top. The table is occupied by filling the least significant bits, for example, after the least significant bits are occupied, storing the new data symbol requires placing the least significant bit in the next least significant bit and storing the new data symbol in the least significant bit. At the compression end, when the matching of the input data symbol and the lookup table is not successful, the data symbol needs to be stored in the lowest bit of the lookup table, and the process of updating the lookup table is shown in fig. 5. When the input data symbol is successfully matched with the lookup table, the successfully matched data symbol needs to be moved to the lowest bit of the lookup table, and the process of updating the lookup table is shown in fig. 6.
If all rows of the lookup table are occupied after multiple updates, the most significant data symbol needs to be popped up and discarded when a new data symbol is stored, other bits are sequentially moved to the logically higher bit, and the new data symbol is stored into the least significant bit after the least significant bit is emptied, as shown in fig. 7. But after multiple verifications, when the table is full, the instantaneous entropy
Figure BDA0003836640940000061
This situation results in both the compressed data and the uncompressed data being N +1 bits, i.e., the highest flag bit + N bits of compressed data/N bits of original data. The sign bit is increased instead of compression, in order to reduce the compression bit number, a matching counter C is defined, when the input data symbol is matched with the data symbol in the table for C times, the highest bit in the lookup table is popped up and discarded, and the value of C can be customized. For example, when C =2, the highest bit in the table is invalidated after a second matching of a certain symbol is successful, as shown in fig. 8.
There is also a special case in the lookup table updating process, where the successfully matched data symbol is located at the higher order of the occupied table row, and when the occupied row number is larger, the successfully matched data symbol is moved to the lowest order. For example, moving from row (K-1) to the first row of the look-up table, which requires a long operation time, as shown in fig. 9. For this case, the operation of shifting the limited number of lines is performed, and taking the limited number of lines d =2 as an example, when the input data symbol is successfully matched, the matched data is shifted by two lines in the lowest order direction, and the other rule is not changed, as shown in fig. 10. If the logically shifted number of rows of matched data symbols is less than the custom limited number of rows, then the operation is performed according to the logically shifted number of rows.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. A lossless data compression algorithm based on adaptive instantaneous entropy is characterized by comprising a compression end and a decompression end, wherein the compression end consists of five parts of an input data stream, a lookup table, instantaneous entropy coding, a highest flag bit and a sequencer, and the decompression end consists of five parts of an deserializer, a highest flag bit, instantaneous entropy decoding, a lookup table and an output data stream;
setting the total number of rows of the lookup table as K, the number of occupied rows as K, the data symbol of each row as N bits, each row of the lookup table having a corresponding row index of M bits, the calculation of M being as follows:
Figure FDA0003836640930000011
the decompression end decodes the compressed data and outputs original data symbols by using the same lookup table as the compression end, the instantaneous entropy coding is a process of coding the data stream by calculating the instantaneous entropy self-adaption of the original data symbols, the calculation of the instantaneous entropy E depends on the occupied row number k of the lookup table, and then
Figure FDA0003836640930000012
The entropy coding reduces the high order of the successfully matched line identification character string in the lookup table into E order, the serializer reconstructs the obtained data stream into a compressed data stream of D order, the decompression end receives the compressed data stream, the compressed data is deserialized and extracted by utilizing instantaneous entropy decoding, and an original symbol is output.
2. The adaptive instantaneous entropy-based lossless data compression algorithm of claim 1, wherein the algorithm specifically includes a compression process, a decompression process, and a look-up table update process.
3. The lossless data compression algorithm based on adaptive instantaneous entropy of claim 2, wherein the compression process comprises the following specific steps: when an input data stream reaches a compression end, the compression end receives the first N-bit original data symbols in the input data stream, matches the data symbols in a lookup table, if the same data symbols are not matched, the instantaneous entropy E = M indicates that the symbols are not occupied by the lookup table, only outputs the symbols without compression, stores the data symbols to the lowest bit in the lookup table, updates the lookup table, if the matching is successful, indicates that the symbols are occupied by the lookup table, takes the corresponding line identification character string as data to be compressed, calculates the instantaneous entropy to reduce the entropy coding of the compressed data, and generates the compressed data.
4. The adaptive transient entropy-based lossless data compression algorithm as claimed in claim 3, wherein in the compression process, a highest flag bit is set to distinguish whether the data is compressed, if the data is not compressed, the flag bit is "0", and 0+ original data is output; if the data is compressed, marking the position of 1, and outputting the 1+ compressed data; after all the input data are processed, the serializer receives all the data generated by the operation and reconstructs the data into a compressed data stream with D bits for output.
5. The lossless data compression algorithm based on adaptive instantaneous entropy of claim 4, wherein the decompression process comprises the following specific steps: when the decompression end receives the first bit of the D-bit compressed data stream, the decoding operation is started, if the highest flag bit is 1, the compressed data is shown to be present, the deserializer can extract E-bit data from the D-bit compressed data stream and expand the E-bit data to M bits according to the instantaneous entropy E obtained by the compression end, the data in the corresponding table entry is output as an original symbol after the row identification character string is decompressed, and if the highest flag bit is 0, the original data symbol is shown to be present, the N bits are extracted from the compressed data stream and output as the original data symbol.
6. The lossless data compression algorithm based on adaptive instantaneous entropy of claim 5, wherein the look-up table updating process comprises the following specific steps: at the compression end, when the matching of the input data symbol and the lookup table is not successful, the data symbol needs to be stored in the lowest bit of the lookup table, and the lookup table is updated; when the input data symbol is successfully matched with the lookup table, the successfully matched data symbol needs to be moved to the lowest bit of the lookup table, and the lookup table is updated.
7. The adaptive transient entropy based lossless data compression algorithm of claim 6, wherein in the lookup table update process, when the table is full, the transient entropy is now full
Figure FDA0003836640930000021
And defining a matching counter C, when the input data symbol is matched with the data symbol in the table for C times, popping up and discarding the highest bit in the lookup table, and the value of C can be self-defined.
8. The adaptive transient entropy-based lossless data compression algorithm of claim 6, wherein in the lookup table update procedure, the successfully matched data symbol is located at the higher order of the occupied table row, and when the occupied row number is larger, the successfully matched data symbol is moved to the lowest order.
CN202211089596.8A 2022-09-07 2022-09-07 Lossless data compression algorithm based on self-adaptive instantaneous entropy Pending CN115913246A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211089596.8A CN115913246A (en) 2022-09-07 2022-09-07 Lossless data compression algorithm based on self-adaptive instantaneous entropy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211089596.8A CN115913246A (en) 2022-09-07 2022-09-07 Lossless data compression algorithm based on self-adaptive instantaneous entropy

Publications (1)

Publication Number Publication Date
CN115913246A true CN115913246A (en) 2023-04-04

Family

ID=86475006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211089596.8A Pending CN115913246A (en) 2022-09-07 2022-09-07 Lossless data compression algorithm based on self-adaptive instantaneous entropy

Country Status (1)

Country Link
CN (1) CN115913246A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116979972A (en) * 2023-09-21 2023-10-31 成都博宇利华科技有限公司 Compression and decompression method for acquired data of analog-to-digital converter

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116979972A (en) * 2023-09-21 2023-10-31 成都博宇利华科技有限公司 Compression and decompression method for acquired data of analog-to-digital converter
CN116979972B (en) * 2023-09-21 2023-12-12 成都博宇利华科技有限公司 Compression and decompression method for acquired data of analog-to-digital converter

Similar Documents

Publication Publication Date Title
CN105207678B (en) A kind of system for implementing hardware of modified LZ4 compression algorithms
US9223765B1 (en) Encoding and decoding data using context model grouping
JP3009727B2 (en) Improved data compression device
JP2800880B2 (en) High-speed decoding arithmetic coding device
CN108768403A (en) Lossless data compression, decompressing method based on LZW and LZW encoders, decoder
KR101049699B1 (en) Data Compression Method
US20090201180A1 (en) Compression for deflate algorithm
CN112968706B (en) Data compression method, FPGA chip and FPGA online upgrading method
CN115514375A (en) Cache data compression method
CN106656198A (en) LZ77-based coding method
CN115913246A (en) Lossless data compression algorithm based on self-adaptive instantaneous entropy
US7253752B2 (en) Coding apparatus, decoding apparatus, coding method, decoding method and program
JPH05241777A (en) Data compression system
Mahmood et al. An Efficient 6 bit Encoding Scheme for Printable Characters by table look up
CN105409129A (en) Encoder apparatus, decoder apparatus and method
CN104682966B (en) The lossless compression method of table data
JP4758494B2 (en) Circuit and method for converting bit length to code
CN116471337A (en) Message compression and decompression method and device based on BWT and LZW
Li et al. Lossless compression algorithms
US7612692B2 (en) Bidirectional context model for adaptive compression
Philip et al. LiBek II: A novel compression architecture using adaptive dictionary
JPH05134847A (en) Data compressing method
US7538697B1 (en) Heuristic modeling of adaptive compression escape sequence
CN109698704B (en) Comparative gene sequencing data decompression method, system and computer readable medium
CN117465471A (en) Lossless compression system and lossless compression method for text file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination