CN115955248A

CN115955248A - Data compression method and device, electronic equipment and storage medium

Info

Publication number: CN115955248A
Application number: CN202211699748.6A
Authority: CN
Inventors: 吕涛; 郭超; 陈祥; 黄运新
Original assignee: Shenzhen Dapu Microelectronics Co Ltd
Current assignee: Shenzhen Dapu Microelectronics Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-04-11

Abstract

The application discloses a data compression method, a data compression device, an electronic device and a readable storage medium, wherein the method comprises the following steps: acquiring data to be compressed; determining a current data unit in data to be compressed; the current data unit comprises a first preset number of bytes; taking each byte in the current data unit as a first byte, and extracting a plurality of data units to be processed from the data to be compressed; each data unit to be processed comprises a second preset number of bytes; utilizing a plurality of computing modules to match a plurality of data units to be processed with previous data in parallel; determining a target data unit to be processed matched with the corresponding matched data unit, taking the position of a first byte in the matched data unit in the data to be compressed as a matching position, taking the length of the matched data unit as a matching length, and replacing the target data unit to be processed in the data to be compressed with the matching position and the matching length. The method and the device perform matching of the repeated data in parallel, and improve the compression performance of the LZ77 algorithm.

Description

Data compression method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data compression method and apparatus, an electronic device, and a computer-readable storage medium.

Background

Data compression can potentially reduce the storage space of data, increase the logical capacity of the storage device, and thus reduce the storage and transmission costs of data, and thus is highly attractive. Data compression is a computationally intensive operation that requires the consumption of relatively large computational resources of a host CPU (central processing unit). It is a technical trend in recent years to implement a hardware circuit for data compression inside a Solid State Disk (SSD) to support data compression with a storage-computable architecture.

The LZ77 algorithm achieves compression by replacing a repeated occurrence of a data fragment with a reference to a single copy of a data fragment that exists earlier in the uncompressed data stream. A matching segment is represented by a pair of numbers called "length-distance". In the related art, the LZ77 algorithm performs matching on each position of input data sequentially and serially, and has a slow execution speed and low compression performance.

Therefore, how to improve the compression performance of the LZ77 algorithm is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a data compression method, a data compression device, an electronic device and a computer readable storage medium, and the compression performance of an LZ77 algorithm is improved.

To achieve the above object, the present application provides a data compression method, including:

acquiring data to be compressed;

determining a current data unit in the data to be compressed; wherein the current data unit comprises a first preset number of bytes;

taking each byte in the current data unit as a first byte, and extracting a plurality of data units to be processed from the data to be compressed; each to-be-processed data unit comprises a second preset number of bytes;

utilizing a plurality of computing modules to match a plurality of to-be-processed data units with previous data in parallel;

determining a target data unit to be processed matched with the corresponding matched data unit, taking the position of the first byte in the matched data unit in the data to be compressed as a matching position, taking the length of the matched data unit as a matching length, and replacing the target data unit to be processed in the data to be compressed with the matching position and the matching length.

Wherein, determining the current data unit in the data to be compressed includes:

determining a current processing position;

taking a byte corresponding to the current processing position in the data to be compressed as a first byte, and extracting a current data unit containing the first preset number of bytes from the data to be compressed;

correspondingly, after the matching is performed on the multiple data units to be processed and the previous data in parallel by using the multiple computing modules, the method further includes:

if the target data unit to be processed which is matched with the corresponding matched data unit does not exist, the current processing position is increased by the first preset number, and the step of taking the byte corresponding to the current processing position in the data to be compressed as a first byte and extracting the current data unit containing the first preset number of bytes from the data to be compressed is re-entered;

correspondingly, after replacing the target data unit to be processed in the data to be compressed with the matching position and the matching length, the method further includes:

and increasing the matching length of the current processing position by degrees, and re-entering a step of extracting a current data unit containing the first preset number of bytes from the data to be compressed by taking a byte corresponding to the current processing position in the data to be compressed as a first byte.

Wherein the matching of the plurality of to-be-processed data units and the previous data in parallel by using the plurality of computing modules comprises:

calculating a target hash value of a corresponding data unit to be processed by utilizing each calculation module, determining a corresponding target hash entry in a hash table by taking the target hash value as an index, determining a candidate matching position in the target hash entry, reading a first data content from the candidate matching position in the data to be compressed, reading a second data content from the data unit to be processed in the data to be compressed, and matching the first data content and the second data content; the hash table is used for storing the corresponding relation between the hash value of the data content and the position in the data to be compressed;

correspondingly, the determining the target to-be-processed data unit matched to the corresponding matching data unit includes:

and if the first data content and the second data content are successfully matched, determining the data unit to be processed contained in the second data content as a target data unit to be processed, and determining the first data content as a matched data unit.

Wherein, after determining the target data unit to be processed matched with the corresponding matching data unit, the method further comprises:

updating the hash entry corresponding to the hash value of the target data unit to be processed in the hash table based on the position of the first byte in the target data unit to be processed in the data to be compressed;

correspondingly, after the matching of the multiple to-be-processed data units and the previous data is performed in parallel by using the multiple computing modules, the method further includes:

and if the target data unit to be processed which is matched with the corresponding matched data unit does not exist, respectively updating hash entries corresponding to the hash values of the plurality of data units to be processed in the hash table based on the positions of first bytes in the data to be compressed in the plurality of data units to be processed.

and if the second data content containing the first target data unit to be processed and the second data content containing the second target data unit to be processed are overlapped, rejecting the target data unit to be processed which is positioned at the back in the first target data unit to be processed and the second target data unit to be processed.

The hash table comprises a plurality of hash entries, each hash entry takes a first-level hash value obtained through calculation based on a first hash algorithm as an index, each hash entry comprises a second preset number of second-level hash values and a second preset number of positions corresponding to the second preset number of positions in the data to be compressed, each second-level hash value is a hash value obtained through calculation based on a second hash algorithm on the data content, and each first-level hash value and each second-level hash value are a byte.

Wherein, the calculating, by each calculating module, a target hash value of a corresponding data unit to be processed, determining a corresponding target hash entry in a hash table by using the target hash value as an index, and determining a candidate matching position in the target hash entry, includes:

calculating a target primary hash value of the corresponding data unit to be processed by utilizing each calculation module based on the first hash algorithm, and determining a corresponding target hash entry in a hash table by taking the target primary hash value as an index;

and calculating a target secondary hash value of the corresponding data unit to be processed by utilizing each calculation module based on the second hash algorithm, and determining a candidate matching position corresponding to the target secondary hash value in the target hash entry.

To achieve the above object, the present application provides a data compression apparatus, comprising:

the acquisition module is used for acquiring data to be compressed;

a determining module, configured to determine a current data unit in the data to be compressed; wherein the current data unit comprises a first preset number of bytes;

the extraction module is used for extracting a plurality of data units to be processed from the data to be compressed by taking each byte in the current data unit as a first byte; each to-be-processed data unit comprises a second preset number of bytes;

the plurality of computing modules are used for matching the plurality of to-be-processed data units with previous data in parallel;

and the replacing module is used for determining a target data unit to be processed matched with the corresponding matched data unit, taking the position of the first byte in the matched data unit in the data to be compressed as a matching position, taking the length of the matched data unit as a matching length, and replacing the target data unit to be processed in the data to be compressed with the matching position and the matching length.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

a processor for implementing the steps of the data compression method as described above when executing the computer program.

To achieve the above object, the present application provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the data compression method as described above.

According to the scheme, the data compression method provided by the application comprises the following steps: acquiring data to be compressed; determining a current data unit in the data to be compressed; wherein the current data unit comprises a first preset number of bytes; taking each byte in the current data unit as a first byte, and extracting a plurality of data units to be processed from the data to be compressed; each to-be-processed data unit comprises a second preset number of bytes; utilizing a plurality of computing modules to match a plurality of to-be-processed data units with previous data in parallel; determining a target data unit to be processed matched with the corresponding matched data unit, taking the position of the first byte in the matched data unit in the data to be compressed as a matching position, taking the length of the matched data unit as a matching length, and replacing the target data unit to be processed in the data to be compressed with the matching position and the matching length.

According to the data compression method, the multiple computing modules are used for performing matching of the repeated data on the multiple data units to be processed in parallel, so that the compression performance of the LZ77 algorithm is improved in a multiplied way. The application also discloses a data compression device, an electronic device and a computer readable storage medium, which can also realize the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flowchart of an algorithm of a duplicate data identification technique in an LZ77 algorithm in the related art;

FIG. 2 is a schematic diagram of a serial computing based duplicate data identification technique of the related art;

fig. 3 is a structural diagram of a hash table in the LZ77 algorithm in the related art;

FIG. 4 is a flow chart illustrating a method of data compression in accordance with an exemplary embodiment;

FIG. 5 is a flow diagram illustrating another method of data compression in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating a hash table in accordance with an exemplary embodiment;

FIG. 7 is a detailed flowchart of step S26 in FIG. 5;

FIG. 8 is a flowchart illustrating a duplicate data identification technique based on parallel computing and memory optimization in accordance with an exemplary embodiment;

FIG. 9 is a schematic diagram illustrating a parallel computing based duplicate data identification technique in accordance with an exemplary embodiment;

FIG. 10 is a block diagram illustrating a data compression apparatus in accordance with an exemplary embodiment;

FIG. 11 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application. In addition, in the embodiments of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.

In the related art, the algorithm flow of the repeated data identification technology in the LZ77 algorithm is shown in fig. 1, if the last byte is more than 4 bytes from the start position of the current data unit, at least one data unit with 4 bytes in the current data stream can be continuously matched; otherwise, matching is finished. The logic of the match is as follows:

step 1: calculating the hash value of 2 bytes of the current data unit;

and 2, step: searching a hash table by taking the hash value as an index;

and step 3: if the hash is hit, the matching position and the matching length are continuously obtained; taking each matched byte as an initial position, calculating the hash value of the corresponding 4-byte data unit and updating a hash table; the starting position of the current data unit is incremented by the matching length of bytes.

And 4, step 4: a hash miss updates the hash table with the current hash value and increments the current data unit location by 1 byte.

And 5: and returning to the step 1 for next round of matching.

In the related art, the LZ77 algorithm is performed serially, as shown in fig. 2, a hash operation is performed to find a potential match each time taking a data unit of 4 bytes from one start position i, a matching position is obtained if the hash hits, and a subsequent data comparison is performed to determine a matching length ML. And if the position i is successfully matched, moving the starting position to i + ML to perform the next round of matching, and if the position i is not matched, performing the next round of matching by taking i +1 as a starting point. It can be seen that one of the disadvantages of the LZ77 algorithm in the related art is that the speed of serial execution is slow, and the data compression performance requirements of a scenario such as computational storage cannot be met.

In the related art, the structure diagram of the hash table in the LZ77 algorithm is shown in fig. 3, and generally includes 32768 entries, each of which holds 4 bytes of two pieces of position information (pos 1 and pos 2), so that the size of the entire hash table is 32768 × 4=131072 bytes (128 KB). It can be seen that the second disadvantage of the LZ77 algorithm in the related art is that the memory overhead of the hash table is large, and it is difficult to accept the scenes with scarce memory, such as hardware implementation.

Based on the method, a tiny hash table only containing 256 entries is designed to solve the problem of high memory overhead, and a multipath parallel method is supported to accelerate identification of repeated data fragments. The method and the device can save memory cost for hardware realization, greatly reduce the Integrated Circuit area of an ASIC (Application Specific Integrated Circuit) chip, and support a plurality of hardware computing units, so that the speed of a repeated data matching algorithm is multiplied.

The embodiment of the application discloses a data compression method, which improves the compression performance of an LZ77 algorithm.

Referring to fig. 4, a flow chart of a method of data compression is shown according to an exemplary embodiment, as shown in fig. 4, including:

s11: acquiring data to be compressed;

s12: determining a current data unit in the data to be compressed; wherein the current data unit comprises a first preset number of bytes;

the present embodiment aims to compress data to be compressed using a modified LZ77 algorithm. In a specific implementation, data to be compressed is obtained, and a current data unit containing a first preset number of bytes is determined therein. For example, when the start position of the current data unit is i and the first preset number is 4, the current data unit is [ i, i +1, i +2, i +3]. If the first preset number is 8, the current data unit is [ i, i +1, i +2, i +3, i +4, i +5, i +6, i +7].

S13: taking each byte in the current data unit as a first byte, and extracting a plurality of data units to be processed from the data to be compressed; each to-be-processed data unit comprises a second preset number of bytes;

in this step, each byte in the current data unit is taken as a first byte to extract a data unit to be processed containing a second preset number of bytes from the data to be compressed. For example, when the current data unit is [ i, i +1, i +2, i +3], and the second preset number is 4, the data units to be processed are [ i, i +1, i +2, i +3], [ i +1, i +2, i +3, i +4], [ i +2, i +3, i +4, i +5], [ i +3, i +4, i +5, i +6], respectively. If the current data unit is [ i, i +1, i +2, i +3, i +4, i +5, i +6, i +7], and the second predetermined number is 4, the data units to be processed are [ i, i +1, i +2, i +3], [ i +1, i +2, i +3, i +4], [ i +2, i +3, i +4, i +5], [ i +3, i +4, i +6, i +7], [ i +5, i +6, i +7, i +8], [ i +7, i +8, i +9], [ i +7, i +9, respectively.

S14: utilizing a plurality of computing modules to match a plurality of to-be-processed data units with previous data in parallel;

s15: determining a target data unit to be processed matched with the corresponding matched data unit, taking the position of the first byte in the matched data unit in the data to be compressed as a matching position, taking the length of the matched data unit as a matching length, and replacing the target data unit to be processed in the data to be compressed with the matching position and the matching length.

In specific implementation, a plurality of computing modules are used for executing repeated data matching on a plurality of data units to be processed in parallel, so that the matching speed in the LZ77 algorithm is improved in a multiplied way. And if the target data unit to be processed matched with the corresponding matched data unit does not exist, the step S12 is entered again, and if the target data unit to be processed matched with the corresponding matched data unit exists, the matching position and the matching length are determined, and the target data unit to be processed in the data to be compressed is replaced by the matching position and the matching length, so that the data compression is realized.

As a possible implementation, the matching, by using multiple computing modules, multiple to-be-processed data units with previous data in parallel includes: calculating a target hash value of a corresponding data unit to be processed by utilizing each calculation module, determining a corresponding target hash entry in a hash table by taking the target hash value as an index, determining a candidate matching position in the target hash entry, reading a first data content from the candidate matching position in the data to be compressed, reading a second data content from the data unit to be processed in the data to be compressed, and matching the first data content and the second data content; the hash table is used for storing the corresponding relation between the hash value of the data content and the position in the data to be compressed. Correspondingly, the determining the target to-be-processed data unit matched to the corresponding matching data unit includes: and if the first data content and the second data content are successfully matched, determining the data unit to be processed contained in the second data content as a target data unit to be processed, and determining the first data content as a matched data unit.

In a specific implementation, a hash table is used to store a corresponding relationship between a hash value of a data content and a position of the data content in data to be compressed, the hash table includes a plurality of hash entries, an index of each hash entry is a hash value of the data content, and a content stored by each hash entry is a position of the data content in the data to be compressed, specifically, a position of a first byte of the data content in the data to be compressed. After a plurality of data units to be processed are extracted from data to be compressed, the data units to be processed are respectively input into a plurality of computing modules, each computing module executes repeated data matching operation on the input data units to be processed, and the specific process is as follows: and calculating a target hash value of the data unit to be processed, determining a corresponding target hash entry in the hash table by taking the target hash value as an index, and judging whether the target hash entry stores a valid value or not. If the hash value does not exist, the matching fails, and the hash entry corresponding to the hash value of the data unit to be processed in the hash table is updated based on the position of the first byte in the data unit to be processed in the data to be compressed, namely, the target hash entry is updated. If the matching position exists, the effective value is read from the target hash entry to serve as a candidate matching position, the first data content is read from the candidate matching position in the data to be compressed, the second data content is read from the data unit to be processed in the data to be compressed, specifically, the second data content is read from the first byte of the data unit to be processed in the data to be compressed, the first data content and the second data content are matched, and if the first data content and the second data content are matched successfully, the matching position and the matching length are determined. Further, the hash entry corresponding to the hash value of the target data unit to be processed in the hash table is updated based on the position of the first byte in the target data unit to be processed in the data to be compressed.

For example, the position of the first byte in the data to be compressed in the data to be processed is 1000, the data to be processed is ABCD, and the hash value corresponding to the data to be processed is 8888, then 8888 is used as an index to search an 8888 th entry in the hash table, and whether a valid value is stored therein is determined, if not, the matching is failed, and 1000 is stored in the 8888 th entry. If yes, the valid value is read from the 8888 th entry as 600, and the data content is read from the positions 600 and 1000 in the data to be compressed for matching. For example, if the data content read from location 600 is ABCDEF … and the data content read from location 600 is ABCDEG …, then the match location is 600 and the match length is 5, and 600 in the 8888 th entry is replaced with 1000.

As a preferred embodiment, after determining the target to-be-processed data unit matched to the corresponding matching data unit, the method further includes: and if the second data content containing the first target data unit to be processed and the second data content containing the second target data unit to be processed are overlapped, rejecting the target data unit to be processed which is positioned at the back in the first target data unit to be processed and the second target data unit to be processed.

It should be noted that, in the process of performing matching of repeated data on multiple to-be-processed data units in parallel by multiple computing modules, there may be multiple computing modules all matching data, that is, determining multiple target to-be-processed data units matching corresponding matched data units, at this time, if there is overlap of second data contents including multiple target to-be-processed data units, removing target to-be-processed data units behind positions, determining matching positions and matching lengths of other non-removed target to-be-processed data units for replacement, and updating the hash table based on positions of other non-removed target to-be-processed data units in the to-be-compressed data.

For example, 8 data units to be processed: [ i, i +1, i +2, i +3], [ i +1, i +2, i +3, i +4], [ i +2, i +3, i +4, i +5], [ i +3, i +4, i +5, i +6], [ i +4, i +5, i +6, i +7], [ i +5, i +6, i +7, i +8], [ i +6, i +7, i +8, i +9, [ i +7, i +8, i +9, i +10], are respectively marked as A0, A1, A2, A3, A4, A5, A6, A7. The matching result is as follows: a0 has a corresponding matching data unit, the matching length is 6, a second data content M1 containing A0 is [ i, i +1, i +2, i +3, i +4, i +5], A4 has a corresponding matching data unit, the matching length is 4, a second data content M2 containing A4 is [ i +4, i +5, i +6, i +7], A6 has a corresponding matching data unit, the matching length is 4, and a second data content M3 containing A6 is [ i +6, i +7, i +8, i +9]. Then, since M1 and M2 overlap, A4 is eliminated, and after A4 is eliminated, M3 and M1 do not overlap, so A0 and A6 are reserved as target data units to be processed.

According to the data compression method provided by the embodiment of the application, the multiple computing modules are utilized to perform matching of repeated data on the multiple data units to be processed in parallel, so that the compression performance of the LZ77 algorithm is improved in multiples.

The embodiment of the application discloses a data compression method, and compared with the previous embodiment, the embodiment further explains and optimizes the technical scheme. Specifically, the method comprises the following steps:

referring to fig. 5, a flow chart of another data compression method according to an exemplary embodiment is shown, as shown in fig. 5, including:

s21: acquiring data to be compressed;

s22: determining a current processing position;

in this embodiment, the initial current processing position is 0.

S23: judging whether the distance between the current processing position and the last byte in the data to be compressed is larger than or equal to a first preset number or not; if yes, entering S24; if not, ending the flow;

in specific implementation, whether the distance between the current processing position and the last byte in the data to be compressed is greater than or equal to a first preset number is judged, if yes, S24 is carried out, and if not, the process is ended.

S24: taking a byte corresponding to the current processing position in the data to be compressed as a first byte, and extracting a current data unit containing the first preset number of bytes from the data to be compressed;

s25: taking each byte in the current data unit as a first byte, and extracting a plurality of data units to be processed from the data to be compressed; each to-be-processed data unit comprises a second preset number of bytes;

s26: calculating a target hash value of a corresponding data unit to be processed by utilizing each calculation module, determining a corresponding target hash entry in a hash table by taking the target hash value as an index, determining a candidate matching position in the target hash entry, reading a first data content from the candidate matching position in the data to be compressed, reading a second data content from the data unit to be processed in the data to be compressed, and matching the first data content and the second data content; the hash table is used for storing the corresponding relation between the hash value of the data content and the position in the data to be compressed; if the first data content and the second data content are successfully matched, determining the data unit to be processed contained in the second data content as a target data unit to be processed, and determining the first data content as a matched data unit;

s27: judging whether a target data unit to be processed matched with the corresponding matched data unit exists or not; if yes, entering S28; if not, entering S29;

s28: taking the position of the first byte in the matched data unit in the data to be compressed as a matching position, taking the length of the matched data unit as a matching length, replacing the target data unit to be processed in the data to be compressed with the matching position and the matching length, updating a hash entry corresponding to the hash value of the target data unit to be processed in the hash table based on the position of the first byte in the data to be compressed in the target data unit to be processed, incrementing the current processing position by the matching length, and re-entering S23;

s29: and respectively updating hash entries corresponding to the hash values of the multiple data units to be processed in the hash table based on the positions of the first bytes in the data units to be compressed in the multiple data units to be processed, incrementing the current processing position by the first preset number, and re-entering the step S23.

In this embodiment, if there is no target to-be-processed data unit that matches the corresponding matching data unit, the hash tables are updated based on the positions of the first bytes in the to-be-compressed data in the multiple to-be-processed data units, and the current processing position is incremented by the first preset number, and the process returns to S23. If the target data unit to be processed matched with the corresponding matched data unit exists, determining a matching position and a matching length based on the matched data unit, updating the hash table based on the position of the first byte in the target data unit to be processed in the data to be compressed, replacing the target data unit to be processed in the data to be compressed with the matching position and the matching length, then increasing the matching length by the current processing position, and re-entering S23. It should be noted that, if there are multiple target to-be-processed data units, in order to improve efficiency, the current processing position may be incremented by a matching length corresponding to a target to-be-processed unit with a later position.

On the basis of the foregoing embodiment, as a preferred implementation manner, the hash table includes a plurality of hash entries, each hash entry takes a first-level hash value calculated based on a first hash algorithm as an index, each hash entry includes a second preset number of second-level hash values and a second preset number of positions corresponding to the second preset number of positions in the data to be compressed, the second-level hash values are hash values calculated based on a second hash algorithm for data content, and the first-level hash value and the second-level hash values are both one byte.

In this embodiment, the result of the hash algorithm in the existing LZ77 hash table technique is reduced from 2 bytes to 1 byte, so only 2 bytes are needed ⁸ =256 hash table entries, number of hash table entries of existing LZ77 (2) ¹⁶ = 32768) by a factor of 256. One problem with the significant reduction of hash entries is that the hash collision rate increases, the data matching success rate of LZ77 decreases, and thus the data compression rate also decreases. In order to solve the hash collision problem, in this embodiment, first, the size of the hash bucket is expanded, where the size of the hash bucket in the existing LZ77 hash table is 2, that is, one hash entry may hold 2 positions, and in this embodiment, the size of the hash bucket is expanded to a second preset number, for example, 4, that is, one hash entry may hold 4 positions. In addition, in this embodiment, the first-level hash value calculated based on the first hash algorithm is used as an index of the hash entry, and the second-level hash value calculated based on the second hash algorithm is introduced into the hash table, so as to avoid the false matching problem caused by hash collision, that is, each hash entry stores a second preset number of second-level hash values in addition to the second preset number of positions.

For example, the second preset value is 4, the structure of the hash table is shown in fig. 6, each position is still represented by 2 bytes, and each secondary hash value is represented by 1 byte. Thus, each hash entry occupies a total of 4 × (2+1) =12 bytes, and the size of the entire hash table is 256 × 12=3072 bytes (3 KB). The secondary hash value of the data corresponding to the position stored by pos1 is stored in hash1, the secondary hash value of the data corresponding to the position stored by pos2 is stored in hash2, the secondary hash value of the data corresponding to the position stored by pos3 is stored in hash3, and the secondary hash value of the data corresponding to the position stored by pos4 is stored in hash 4.

Further, on the basis of the hash table provided in this embodiment, referring to fig. 7, fig. 7 is a detailed flowchart of step S26 in fig. 5, as shown in fig. 7, step S26 specifically includes:

s261: calculating a target primary hash value of the corresponding data unit to be processed by utilizing each calculation module based on the first hash algorithm, and determining a corresponding target hash entry in a hash table by taking the target primary hash value as an index;

s262: and calculating a target secondary hash value of the corresponding data unit to be processed by utilizing each calculation module based on the second hash algorithm, and determining a candidate matching position corresponding to the target secondary hash value in the target hash entry.

S263: reading first data content from the corresponding candidate matching position in the data to be compressed by utilizing each computing module, reading second data content from the data unit to be processed in the data to be compressed, and matching the first data content and the second data content; and if the first data content and the second data content are successfully matched, determining the data unit to be processed contained in the second data content as a target data unit to be processed, and determining the first data content as a matched data unit.

In this embodiment, each computing module performs matching operation of repeated data on an input data unit to be processed, and the specific process is as follows: calculating a target first-level hash value of the data unit to be processed based on a first hash algorithm, determining a corresponding target hash entry in a hash table by taking the target first-level hash value as an index, calculating a target second-level hash value of the data unit to be processed based on a second hash algorithm, and judging whether an effective value corresponding to the target second-level hash value is stored in the target hash entry.

If the hash value does not exist, the matching fails, and the hash entry corresponding to the hash value of the data unit to be processed in the hash table is updated based on the position of the first byte in the data unit to be processed in the data to be compressed, namely, the target hash entry is updated. Specifically, the target secondary hash value and the position of the first byte in the data unit to be processed in the data to be compressed are stored in the target hash entry.

If the first data content and the second data content are matched, if the first data content and the second data content are matched successfully, the matching position and the matching length are determined. Further, the hash entry corresponding to the primary hash value of the target data unit to be processed in the hash table is updated based on the position of the first byte in the target data unit to be processed in the data to be compressed, specifically, the position of the first byte in the target data unit to be processed in the data to be compressed is stored to the position corresponding to the secondary hash value of the target data unit to be processed in the hash entry.

Therefore, the implementation reduces the hash table memory overhead of the LZ77 algorithm by greatly reducing the length of the hash table and slightly expanding the width of the hash table, and can be used in scenes with scarce memory, such as ASIC or FPGA (Field Programmable Gate Array) hardware implementation of the LZ77 algorithm, embedded devices with limited memory resources, and the like.

An application embodiment provided by the present application is introduced below, referring to fig. 8, fig. 8 is a flowchart illustrating a repeated data identification technology based on parallel computing and memory optimization according to an exemplary embodiment, and as shown in fig. 8, the method specifically includes the following steps:

step 1: the first byte position 0 of the input data is set to the start address of the current data unit.

Step 2: judging whether the byte number in the current input stream is at least one data unit, namely more than or equal to 4 bytes; if yes, entering step 3, otherwise, ending the execution and outputting all matching information;

and step 3: taking each byte in the current data unit as a starting position, and performing four-way parallel hash value calculation. Referring to FIG. 9, FIG. 9 is a schematic diagram illustrating a parallel computing based duplicate data identification technique in accordance with an exemplary embodiment;

and 4, step 4: and taking the four calculated first-level hash values as indexes of the hash table to perform four-way parallel hash table lookup. In the searching process, whether the value of the secondary hash stored in the hash table is equal to the value of the secondary hash calculated by the current data unit needs to be compared, and the hash matching can be calculated only if the values of the secondary hash are equal.

And 5: if the hash is hit, continuing to compare the data content, sequentially judging the first real matched position in the 1 st, 2 nd, 3 rd and 4 th ways to obtain the matched position and the matched length, calculating a hash value by taking each matched byte as an initial position, updating a hash table, and increasing the initial position of the current data unit by a matched length by a number of bytes; if the hash misses, the hash table is updated with the current four hash values and the location information, and the starting location of the current data unit is incremented by 4 bytes.

Step 6: and returning to the step 2 for the next execution.

Therefore, the performance of the data compression algorithm based on the LZ77 can be greatly improved by the embodiment, and the method can be used for realizing the high-performance data compression algorithm in the storage equipment. For certain data to be compressed, repeated data identification can be executed in parallel, and then digital sequences of identified unmatched symbols, matched positions and matched lengths are further compressed by using techniques such as Huffman coding or FSE (Finite State Entropy coding, fine State Encopy) and the like.

In the following, a data compression apparatus provided by an embodiment of the present application is introduced, and a data compression apparatus described below and a data compression method described above may be referred to each other.

Referring to fig. 10, a block diagram of a data compression apparatus according to an exemplary embodiment is shown, as shown in fig. 10, including:

an obtaining module 10, configured to obtain data to be compressed;

a determining module 20, configured to determine a current data unit in the data to be compressed; wherein the current data unit comprises a first preset number of bytes;

an extracting module 30, configured to extract multiple data units to be processed from the data to be compressed, with each byte in the current data unit as a first byte; each to-be-processed data unit comprises a second preset number of bytes;

a plurality of calculation modules 40, configured to match a plurality of to-be-processed data units with previous data in parallel;

a replacing module 50, configured to determine a target data unit to be processed that is matched to a corresponding matching data unit, use a position of a first byte in the matching data unit in the data to be compressed as a matching position, use a length of the matching data unit as a matching length, and replace the target data unit to be processed in the data to be compressed with the matching position and the matching length.

According to the data compression device provided by the embodiment of the application, the plurality of computing modules are used for executing the matching of repeated data to the plurality of data units to be processed in parallel, so that the compression performance of the LZ77 algorithm is improved by times.

On the basis of the foregoing embodiment, as a preferred implementation, the determining module 20 includes:

a determination submodule for determining a current processing position;

the extraction submodule is used for extracting a current data unit containing the first preset number of bytes from the data to be compressed by taking a byte corresponding to the current processing position in the data to be compressed as a first byte;

correspondingly, the method also comprises the following steps:

the first increasing module is used for increasing the current processing position by the first preset number when the target data unit to be processed which is matched with the corresponding matched data unit does not exist, and restarting the working process of the extracting sub-module;

and the second increasing module is used for increasing the matching length of the current processing position after the target data unit to be processed in the data to be compressed is replaced by the matching position and the matching length, and restarting the workflow of the extraction sub-module.

On the basis of the foregoing embodiment, as a preferred implementation, the calculation module is specifically configured to: calculating a target hash value of a corresponding data unit to be processed, determining a corresponding target hash entry in a hash table by taking the target hash value as an index, determining a candidate matching position in the target hash entry, reading first data content from the candidate matching position in the data to be compressed, reading second data content from the data unit to be processed in the data to be compressed, and matching the first data content and the second data content; the hash table is used for storing the corresponding relation between the hash value of the data content and the position in the data to be compressed;

correspondingly, the replacement module 50 is specifically configured to: when the first data content and the second data content are successfully matched, the data unit to be processed contained in the second data content is determined as a target data unit to be processed, and the first data content is determined as a matched data unit.

On the basis of the above embodiment, as a preferred implementation, the method further includes:

a first updating module, configured to update, based on a position of a first byte in the target data unit to be processed in the data to be compressed, a hash entry corresponding to a hash value of the target data unit to be processed in the hash table;

and the second updating module is used for respectively updating the hash entries corresponding to the hash values of the multiple data units to be processed in the hash table based on the positions of the first bytes in the data to be compressed in the multiple data units to be processed when the target data units to be processed matched with the corresponding matched data units do not exist.

On the basis of the above embodiment, as a preferred embodiment, the method further includes:

and the rejecting module is used for rejecting the target data unit to be processed at the rear position in the first target data unit to be processed and the second target data unit to be processed when the second data content containing the first target data unit to be processed and the second data content containing the second target data unit to be processed are overlapped.

On the basis of the foregoing embodiment, as a preferred implementation, the calculation module is specifically configured to: calculating a target primary hash value of the corresponding data unit to be processed based on the first hash algorithm, and determining a corresponding target hash entry in a hash table by taking the target primary hash value as an index; calculating a target secondary hash value of the corresponding data unit to be processed based on the second hash algorithm, and determining a candidate matching position corresponding to the target secondary hash value in the target hash entry; and reading first data content from the candidate matching position in the data to be compressed, reading second data content from the data unit to be processed in the data to be compressed, and matching the first data content and the second data content.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides an electronic device, and fig. 11 is a structural diagram of an electronic device according to an exemplary embodiment, as shown in fig. 11, the electronic device includes:

a communication interface 1 capable of information interaction with other devices such as network devices and the like;

and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing the data compression method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.

In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For clarity of illustration, however, the various buses are labeled as bus system 4 in fig. 11.

The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.

It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a magnetic random access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), synchronous Static Random Access Memory (SSRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), synchronous Dynamic Random Access Memory (SLDRAM), direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 3 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiment of the present application may be applied to the processor 2, or implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.

When the processor 2 executes the program, the corresponding processes in the methods according to the embodiments of the present application are realized, and for brevity, are not described herein again.

In an exemplary embodiment, the present application further provides a storage medium, i.e. a computer storage medium, specifically a computer readable storage medium, for example, including a memory 3 storing a computer program, which can be executed by a processor 2 to implement the steps of the foregoing method. The computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, CD-ROM, etc. Memory.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, a network device, or the like) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data compression, comprising:

acquiring data to be compressed;

determining a target data unit to be processed matched with a corresponding matched data unit, taking the position of a first byte in the matched data unit in the data to be compressed as a matching position, taking the length of the matched data unit as a matching length, and replacing the target data unit to be processed in the data to be compressed with the matching position and the matching length.

2. The data compression method of claim 1, wherein determining the current data unit in the data to be compressed comprises:

determining a current processing position;

if the target data unit to be processed which is matched with the corresponding matched data unit does not exist, increasing the current processing position by the first preset number, and re-entering the step of extracting the current data unit containing the first preset number of bytes from the data to be compressed by taking the byte corresponding to the current processing position in the data to be compressed as a first byte;

3. The data compression method of claim 1, wherein the matching the plurality of to-be-processed data units with the previous data in parallel by using a plurality of computing modules comprises:

calculating a target hash value of a corresponding data unit to be processed by utilizing each calculation module, determining a corresponding target hash entry in a hash table by taking the target hash value as an index, determining a candidate matching position in the target hash entry, reading first data content from the candidate matching position in the data to be compressed, reading second data content from the data unit to be processed in the data to be compressed, and matching the first data content with the second data content; the hash table is used for storing the corresponding relation between the hash value of the data content and the position in the data to be compressed;

4. The data compression method of claim 3, wherein after determining the target data unit to be processed that matches the corresponding matching data unit, further comprising:

5. The data compression method as claimed in claim 3, wherein after determining the target data unit to be processed matching to the corresponding matching data unit, the method further comprises:

6. The data compression method according to claim 3, wherein the hash table includes a plurality of hash entries, each hash entry uses a first-level hash value calculated based on a first hash algorithm as an index, each hash entry includes a second preset number of second-level hash values and a corresponding second preset number of positions in the data to be compressed, the second-level hash values are hash values calculated based on a second hash algorithm for data contents, and the first-level hash values and the second-level hash values are both one byte.

7. The data compression method as claimed in claim 6, wherein the calculating, by each calculating module, a target hash value of the corresponding data unit to be processed, determining a corresponding target hash entry in a hash table using the target hash value as an index, and determining a candidate matching position in the target hash entry comprises:

8. A data compression apparatus, comprising:

the acquisition module is used for acquiring data to be compressed;

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the data compression method as claimed in any one of claims 1 to 7 when executing said computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of a data compression method as claimed in any one of the claims 1 to 7.