CN116015311A

CN116015311A - Lz4 text compression method based on sliding dictionary implementation

Info

Publication number: CN116015311A
Application number: CN202310015819.4A
Authority: CN
Inventors: 李迪; 王炳耀
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-01-05
Filing date: 2023-01-05
Publication date: 2023-04-25

Abstract

The invention discloses an Lz4 text compression method based on a sliding dictionary, which mainly solves the problems of low text compression speed and difficult hardware realization at present. The method comprises the steps of establishing a sliding dictionary, a hash table and a register of unmatched successful times, reading a text to be compressed into the sliding dictionary, calculating position information of a hash address storage character string, searching repeated character strings by using the sliding dictionary, storing the information of the repeated character strings, compressing text information, and finally outputting a code stream of the compressed text. The method adopts a parallel processing mode of calculating the hash address, and is easy to realize in a pipeline mode through hardware. The probability of searching repeated character strings is high in an early prediction mode, the times of calculating hash addresses are reduced, the compression rate of the text to be compressed is judged according to the times of unmatched character strings, the input speed of the text to be compressed is regulated, and the compression speed is greatly improved.

Description

Lz4 text compression method based on sliding dictionary implementation

Technical Field

The invention belongs to the technical field of communication, and further relates to an Lz4 text compression method based on a sliding dictionary in the technical field of lossless data compression. The invention adopts the sliding dictionary to search the repeated character strings to process the text information, predicts the positions of the repeated character strings to complete the text compression processing, can be used for optimizing the realization method of text compression coding hardware, and is particularly suitable for real-time text compression processing.

Background

Along with the requirement of massive data interaction of a computer network, the requirements for data transmission and storage become higher, the compressed data is much smaller than the original data in storage space, and less bandwidth is occupied in the transmission process to reduce the transmission flow of a server, so that rapid transmission is performed. Lz4, the most popular lossless compression method at present, is widely used in the fields of network data downloading, data backup, and the like. Lz4 is a compression method for saving the position information of a repeated character string by sliding the character string which is observed and currently seen by the dictionary.

The Yann Collet presents an LZ4 text compression method in its published paper "Real Time Data Compression:lz4 expanded" (http:// fastcompression. Blogsshot. Com/2011/05/LZ 4-expanded. Html, 2011). The method is a variant of Lz77, the Lz4 text compression method is a dictionary compression method, and is realized through software, the method starts scanning from the beginning of a text, 4 bytes input are subjected to hash calculation to obtain corresponding hash values, and data addresses are stored and read in a hash table through the hash values. When the hash values obtained by calculation are the same, the data input by the two times are the same, and the purpose of compression is achieved by the mode. The method has the following defects: the hash table maintained by the Lz4 text compression method can frequently calculate the hash value of input text data in the encoding process, so that the compression speed is lower; meanwhile, the size of the input data is limited by the data width in the hash table, the hash table stores data addresses, the maximum value of the addresses is limited by the data bit width, continuous data compression cannot be realized, and the output delay is uncertain according to different types of the input data, so that hardware realization is difficult.

A text compression method is disclosed in patent literature "Gzip hardware-based text compression method" (patent application number: 201710255484.8, application publication number: CN 107135003A) applied by the university of western electrotechnology. According to the method, the text is compressed in a dictionary searching and Huffman coding mode, and the original sequential execution mode of the text to be processed is changed into simultaneous matching processing of character strings in the same window by trimming overlapped characters, so that the compression structure is optimized. The method has the following defects: for the dictionary searching mode, multiple hash calculations are actually used to build a lookup table to search repeated characters, a large number of clock cycles are needed for processing, and the compression speed is low.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art, and provides an Lz4 text compression method based on a sliding dictionary. The method is used for solving the problems of low text compression speed and difficult hardware implementation.

In order to achieve the above purpose, the idea of the present invention is to optimize the original text compression flow and process the input text in parallel. In the process of reading in the file to be compressed, searching and comparing by updating the hash table, monitoring the content of the file to be compressed, controlling the speed of input data according to the number of times of successful non-comparison, recording the offset length of the last matching, predicting the position where the matching is likely to occur in advance, and finally outputting the compressed file to complete the compression.

The specific steps for achieving the purpose of the invention are as follows:

step 1, establishing a hash table with the content of the storage unit being 0, sliding a dictionary, and not comparing registers of successful times, and compressing a storage area;

step 2, reading in 4 unread characters in sequence from the text to be compressed to form a character string, and storing the character string into a sliding dictionary; calculating the hash address of the character string;

step 3, searching whether the content of the storage unit corresponding to the hash address is 0, if so, storing the position information of the sliding dictionary where the character string is in into the storage unit corresponding to the hash address, and then executing step 2; otherwise, executing the step 4;

step 4, finding out the corresponding character string from the sliding dictionary through the content addressing of the storage unit in the hash table; comparing the last character string in the sliding dictionary with the character string found by addressing to determine whether the last character string is equal to the character string found by addressing, if so, successfully executing the step 6; otherwise, the step 5 is executed without comparison success;

step 5, judging whether the number of times of unmatched success in the register is increased by one and is larger than 1% of the total number of text strings to be compressed, if so, reading 1 unread character from the text to be compressed, storing the character into a sliding dictionary, accelerating the input speed of the text to be compressed, and executing the step 2; otherwise, keeping the speed of text input to be compressed, and directly executing the step 2;

step 6, clearing the number of times of success of the un-comparison in the register;

step 7, reading out a subsequent character from the text to be compressed, reading out the subsequent character from the tail of the character string comparison successful position of the sliding dictionary, comparing whether the two characters are equal, and executing the step 6 if the two characters are equal; otherwise, executing the step 8;

step 8, comparing the length of the successfully-compared character string with the offset length of the successfully-compared character string, and storing the length of the un-compared character string and the encoding of the un-compared character string into a compressed storage area;

step 9, predicting the matching position of the text to be compressed according to the offset length of the last matching:

step 9.1, 1 character which is not read is read in from the text to be compressed, and is stored in the sliding dictionary, and a new character string is formed by the character and 4 characters at the tail of the sliding dictionary;

step 9.2, predicting the matching position according to the offset length of the character strings successfully compared in the compressed storage area, addressing and finding out the character strings in the sliding dictionary according to the offset length, comparing whether the two character strings are equal, and executing the step 6 if the two character strings are equal; otherwise, executing the step 10;

step 10, judging whether characters exist in the text to be compressed, if so, executing step 11; otherwise, executing the step 2;

and step 11, outputting all the compressed storage area data.

Compared with the prior art, the invention has the following advantages:

firstly, the method adopts a parallel processing mode of hash computation in compressed text data, overcomes the defect of slow processing speed caused by sequential execution structure processing in the prior art, and ensures that the parallel processing of text compression is easily finished in a pipeline structure through hardware.

Secondly, the invention adopts a mode of reading the text to be compressed and predicting the matching position of the matching character strings by adopting the sliding dictionary, when the matching is searched for the text character strings to be compressed, the matching position of the matching character strings is predicted in advance, and the matching character strings can be continuously searched, thereby overcoming the defect of low compression speed caused by repeatedly calculating the hash address in the prior art, accelerating the speed of reading the text to be compressed, improving the probability of successfully matching the matching character strings and greatly improving the speed of compressing the text.

Drawings

Fig. 1 is a flow chart of the present invention.

Detailed Description

The invention is described in further detail below with reference to fig. 1 and the examples.

Step 1, a hash table, a sliding dictionary and a register of unmatched successful times, wherein the content of a storage unit of the hash table is 0, and a compressed storage area are established.

In the embodiment of the invention, the address bit width of the hash table is 11 bits, the set size of the sliding dictionary is 4kb, the sliding function is realized by adding one address to the sliding dictionary of the text data to be compressed in each reading, and the set size of the register of the unmatched successful times is 32 bits.

And step 2, reading in 4 unread characters in sequence from the text to be compressed to form character strings, and storing the selected character strings into a sliding dictionary. A hash address of the string is calculated.

The set size of the text to be compressed processed in the embodiment of the invention is 4kb.

Step 3, searching whether the content of the storage unit corresponding to the hash address is 0, if so, storing the position information of the sliding dictionary where the character string is in into the storage unit corresponding to the hash address, and then executing step 2; otherwise, step 4 is performed.

Step 4, finding out the corresponding character string from the sliding dictionary through the content addressing of the storage unit in the hash table; comparing the last character string in the sliding dictionary with the character string found by addressing to determine whether the last character string is equal to the character string found by addressing, if so, successfully executing the step 6; otherwise, the step 5 is not successfully executed in comparison.

Step 5, judging whether the number of times of unmatched success in the register is increased by one and is larger than 1% of the total number of text strings to be compressed, if so, reading 1 unread character from the text to be compressed, storing the character into a sliding dictionary, accelerating the input speed of the text to be compressed, and executing the step 2; otherwise, the speed of the text input to be compressed is kept, and the step 2 is directly executed.

And 6, resetting the number of times of success of the un-comparison in the register.

Step 7, reading out a subsequent character from the text to be compressed, reading out the subsequent character from the tail of the character string comparison successful position of the sliding dictionary, comparing whether the two characters are equal, and executing the step 6 if the two characters are equal; otherwise, step 8 is performed.

And 8, comparing the length of the successfully-compared character string with the offset length of the successfully-compared character string, and storing the un-compared character string and the un-compared character string codes in a compressed storage area.

And 9, predicting the matching position of the text to be compressed according to the offset length of the last matching.

And 1 character which is not read is read from the text to be compressed, and is stored in the sliding dictionary, and a new character string is formed by the character and 4 characters at the tail of the sliding dictionary. Predicting the matching position by the character string offset length successfully compared with the compressed storage area, addressing according to the offset length to find the character strings in the sliding dictionary, comparing whether the two character strings are equal, and executing the step 6 if the two character strings are equal; otherwise, executing the step 10;

step 10, judging whether characters exist in the text to be compressed, if so, executing step 11; otherwise, step 2 is performed.

And step 11, outputting all the compressed storage area data.

The invention will be further illustrated with reference to examples.

The text data to be compressed is 01234567012345678.

The compression flow is specifically as follows:

a hash table, a sliding dictionary, a register of the number of unmatched successes and a compressed memory area are established. The content of a storage unit in the hash table is all 0, the sliding dictionary data is 0, the register data of the unmatched successful times is 0, and the compressed storage area is all 0.

Reading 4 characters 0123 from the text to be compressed, storing the 4 characters 0123 into a sliding dictionary, and calculating the hash address of the 0123: 0x0123 x 2654435761=0 x d90f5433,0x9e3779b1> 21=0 x6C8. The resulting hash address is 0x6C8. At this time, the sliding dictionary data is 0123, the text to be compressed is 4567012345678, the storage content of all addresses in the hash table is 0, and the value of the un-aligned success number register is 0.

And (3) searching that the corresponding storage content in the hash table is 0 through the hash address, wherein the current position information of the character string is 1, and storing the character string in the storage content with the hash address of 0x6C8. At this time, the sliding dictionary data is 0123, the text to be compressed is 4567012345678, the storage content of 0x6C8 in the hash table is 1, the storage content of the rest addresses is 0, and the value of the un-aligned success number register is 1.

4 characters 4567 are read from the text to be compressed, and the hash address of 4567 is calculated to be 0x4E0. At this time, the sliding dictionary data is 01234567, the text to be compressed is 012345678, and the value of the unpaired success number register is 1.

And (3) through the hash address 0x4E0, searching that the corresponding storage content in the hash table is 0, and storing the current position information of the character string which is 2 in the storage content with the hash address 0x4E0 in the hash table. At this time, the sliding dictionary data is 01234567, the text to be compressed is 012345678, the storage content of 0x6C8 in the hash table is 1, the storage content of 0x4e0 is 2, the storage content of the rest addresses is 0, and the value of the un-aligned success number register is 2.

Reading 4 characters 0123 from the text to be compressed, calculating the hash address of 0123 to be 0x6C8, searching the corresponding storage content of 1 in the hash table through the hash address of 0x6C8, finding out the character string with the position of 1 in the sliding dictionary to be 0123, and comparing the character string with the position of 0123 in the text to be compressed successfully. At this time, the sliding dictionary data is 01234567, the text to be compressed is 012345678, the storage content of 0x6C8 in the hash table is 1, the storage content of 0x4e0 is 2, the storage content of the rest addresses is 0, and the value of the un-aligned success number register is 0.

And continuing the subsequent single character comparison, wherein the comparison between 4, 5, 6 and 7 in the sliding dictionary and the comparison between 4, 5, 6 and 7 of the text to be compressed succeed until the comparison between 0 in the sliding dictionary and 8 of the text to be compressed fails.

The length of the successfully aligned character string is 8, the offset length of the successfully aligned character string is 8, the length of the un-aligned character string is 9, and the un-aligned character string is 012345678. Save to compressed storage area as (8,8,9) and 012345678.

And (5) outputting (8,8,9) and 012345678 after compressing the text without character strings.

Claims

1. The lz4 text compression method based on the sliding dictionary is characterized in that the character strings in the sliding dictionary are found through content addressing of a storage unit in a hash table, and are compared and matched with the character strings to be compressed; adjusting the speed of text input to be compressed according to the recorded number of times of success of un-comparison; predicting the matching position of the text to be compressed according to the offset length of the last matching; the text compression method comprises the following steps:

and step 11, outputting all the compressed storage area data.

2. The Lz4 text compression method based on the sliding dictionary implementation as claimed in claim 1, wherein the method for calculating the hash address of each character string in step 2 is: the string is multiplied by the golden section prime number c to obtain 32-bit data, and the bit data is shifted right by 21 bits to obtain a hash address of the 11-bit string, where c= 2654435761.