WO2018058604A1

WO2018058604A1 - Data compression method and device, and computation device

Info

Publication number: WO2018058604A1
Application number: PCT/CN2016/101259
Authority: WO
Inventors: 张希舟; 张剑; 牛进保; 全绍晖
Original assignee: 华为技术有限公司
Priority date: 2016-09-30
Filing date: 2016-09-30
Publication date: 2018-04-05
Also published as: CN110419036B; CN110419036A

Abstract

A data compression method is commonly used in a memory array. The method comprises: setting n bits at an end portion of a starting logical address of a memory space in which data to be compressed is located to be 0, such that in a subsequent process of compressing the data to be compressed, reading and writing operations with respect to a hash table can be performed more easily, thereby increasing a compression speed.

Description

Data compression method, device and computing device

Technical field

The present application relates to the field of computer technologies, and in particular, to a data compression method, and a data compression device corresponding to the method, and a computing device for executing the data compression method.

Background technique

Compression technology is widely used in data storage, data transmission and other fields. Traditional compression technology includes dictionary compression, also known as Abraham Lempel and Jacob Ziv (abbreviation: LZ) compression. LZ compression has a large number of compression coding branches, such as LZ4, LZ5, LZO, LZH, etc. The common feature of these compression codes is that historical data is used as a dictionary when encoding current data.

LZ compression performs data compression at the granularity of bytes/strings. For example, if the data block to be compressed is 4M Byte and the window size is 4 bytes, when 4 bytes of data in the current window is compressed, the 4 bytes in the current window are used to match the historical data of the data block to be compressed, if the data is to be compressed. The data of the data block has the same data as the 4 Byte data, and the code corresponding to the 4 Byte data only needs to record the position information and the length of the historical data, so that in the process of decompression, according to the code corresponding to the 4 Byte data, The 4Byte data can be recovered. The compression speed of current LZ compression still needs to be improved.

Summary of the invention

The present application provides a data compression method to increase the speed of data compression.

A first aspect of the present application provides a data compression method performed by a storage controller or a data compression device, including: first allocating a storage space, where the end of the starting logical address of the storage space is 0 bit, and N is greater than An integer of 1. In practice, the value of N is related to the size of the data to be compressed that is subsequently processed.

Then, the data to be compressed is stored in the storage space, and the size of the data to be compressed is 2 ⁿ Byte, and n is not greater than N, so that the end N bit of the starting logical address of the data to be compressed is 0, because the to-be-compressed If the size of the data is not more than 2 ^N Byte, the valid part of the starting logical address of the data to be compressed is 0, and the part of the starting logical address of the data to be compressed that is higher than the n bit is an invalid part because The portion of the logical address of each Byte data of the data to be compressed that is higher than the n bit is the same.

Then, hashing the a-byte data to the a+m-byte data of the data to be compressed to generate a hash value, a is an integer greater than 0, m is an integer greater than 0, and (m+1) is performed The size of the hash operation window. As the starting point of the window shifts from the first Byte data of the data to be compressed to the right, the value of a can be taken from 1 (2 ⁿ -m).

Then, it is determined whether there is a key in the hash table that is the same as the hash value, and the key of the hash table is a hash value generated by hashing the (m+1) Byte history data of the a+m Byte data. The value of the hash table includes the end n bit of the start logical address of the (m+1) Byte history data of the a+m Byte data. Determining whether the hash table is stored in the same hash value, that is, using the hash value to match the key in the hash table one by one, if there is a matching key, the first a Byte data is The a+m Byte data appears in the historical data of the data to be compressed. If there is no matching key, it indicates that the a Byte data to the a+m Byte data first appears in the data to be compressed.

According to the judgment result of the previous step, if there is a key with the same hash value in the hash table, the value corresponding to the hash value in the hash table is updated according to the last n bit of the logical address of the a-th byte data. If the same key as the hash value does not exist in the hash table, the hash value and the end n bit of the logical address of the a-byte data of the data to be compressed are added to the hash table.

If the a Byte data to the a+m Byte data appear in its history data, the record needs to be recorded in the hash table by using the a Byte data to the start logical address of the a+m Byte data. The starting logical address of the historical data. If the a Byte data to the a+m Byte data first appears in the data to be compressed, insert the hash value and the end n bit of the logical address of the a Byte data of the data to be compressed into the hash table. In the new row, after the window continues to move to the right, the subsequent (m+1)Byte data can match the inserted row if it is the same as the a-byte data to the a+m Byte.

Since the minimum granularity of reading and writing the value of the hash table is Byte, if n is not an integer multiple of 4, the content of the value used to replace or join the hash table includes, in addition to the data to be compressed. In addition to the end n bit of the logical address of a Byte data, it may also include one or more bits higher than the n bit. For example, in the case of n=14, since at least the content of the 4 Byte content needs to be replaced or added to the hash table, it is necessary to replace or join the last 16 bits of the logical address of the a-th byte data of the data to be compressed.

The data compression method provided above simplifies the operation of the hash table in the data compression process by setting the end N bit of the starting logical address of the storage space for storing the data to be compressed to 0, thereby improving the data compression speed.

In conjunction with the first aspect, in a first implementation of the first aspect, the data to be compressed includes a plurality of data blocks.

Multiple data blocks are simultaneously stored in the storage space and compressed, and the compression ratio is improved by compressing only a single data block at a time.

With reference to the first aspect or the first implementation manner of the first aspect, in a second implementation manner of the first aspect, before determining whether a key having the same hash value exists in the hash table, the method further includes: determining the Whether the size of the data to be compressed is greater than 2 ^K Byte, and K is an integer greater than 0. If the size of the data to be compressed is greater than 2 ^K Byte, the value of the value of the hash table is not less than (K/8+1) Byte, that is, if the length of the data to be compressed is greater than 2 ^K Byte, at least (K/8+1) Byte can express the relative address of the data to be compressed. In contrast, if the size of the data to be compressed is less than or equal to 2 ^K Byte, the length of the value of the hash table is not less than K/8 Byte, that is, if the length of the data to be compressed is not more than 2 ^K Byte, The relative address of the data to be compressed can be expressed by K/8Byte.

By determining the size of the data to be compressed before matching the key in the hash table with the hash value, it is possible to determine the length of the logical address of the data to be compressed that needs to be written to the value of the hash table. The length of the logical address of the data to be compressed that needs to be written into the hash table is determined relative to each time the update or write operation is performed on the hash table, thereby increasing the compression speed.

With the second implementation of the first aspect, in a third implementation manner of the first aspect, if the size of the data to be compressed is greater than 2 ^K Byte, updating is performed according to the last n bit of the logical address of the a Byte data. After the value corresponding to the hash value in the hash table, the method further includes: if the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) bit and the If the difference between the values corresponding to the hash value in the hash table is less than 2 ^K , the data of the a Byte data and the a Byte data in the data to be compressed and the data of the a Byte data in the data to be compressed are The historical data is matched, and a compression code is generated according to the matching result; if the logical address of the a-byte data of the data to be compressed is at the end (8* the length of the value of the hash table) bit and the hash in the hash table If the difference between the values corresponding to the values is not less than 2 ^K , the data of the a-byte data and the data of the a-byte data in the data to be compressed is not matched with the history data of the a-byte data in the data to be compressed.

The data compression algorithm includes the following settings: when the logical address of the a-byte data of the data to be compressed ends (8* the length of the value of the hash table) bit and the value corresponding to the hash value in the hash table When the difference is not less than 2 ^K , the match of this round window is abandoned. In the case where the setting is adopted, if the size of the data to be compressed is larger than 2 ^K Byte, then the end of the logical address of the a-byte data of the data to be compressed may appear (8* the value of the hash table) The length) bit has a probability that the difference between the value corresponding to the hash value in the hash table is not less than 2 ^K. In the case where the setting is adopted, if the size of the data to be compressed is not more than 2 ^K Byte, the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) The difference between the bit and the value corresponding to the hash value in the hash table must be no more than 2 ^K , so only if the size of the data to be compressed is greater than 2 ^K Byte, the data to be compressed is required. The judgment is made whether the difference between the end of the logical address of the Byte data (the length of the value of the hash table of 8*) and the value corresponding to the hash value in the hash table is greater than 2 ^K. By judging whether the size of the data to be compressed is greater than 2 ^K Byte, if the size of the data to be compressed is not more than 2 ^K Byte, the end of the logical address of the a Byte data of the data to be compressed is not required. (8* the length of the value of the hash table) The difference between the bit corresponding to the hash value in the hash table is greater than or equal to 2 ^K, thereby judging the judgment process in the data compression and improving the compression speed.

With reference to the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, after generating the compression coding according to the matching result, the method further includes: determining whether the a+m Byte data is the data to be compressed. The last 1 Byte data, if yes, ends the encoding of the data to be compressed, and if not, moves the window of the hash operation to the right.

In a second aspect of the present application, a data compression device is provided, comprising: a communication interface and a processing chip, the communication interface being connected to the processing chip. The communication interface is used for communication with an external device to obtain data to be compressed. The processing chip is configured to allocate a storage space, where the N bit of the starting logical address of the storage space is 0, and N is an integer greater than 1. The communication interface is configured to acquire data to be compressed, and store the data to be compressed. Entering the storage space, the size of the data to be compressed is 2 ⁿ Byte, and n is not greater than N; the processing chip is further configured to perform hash operation on the a Byte data to the a+ m Byte data of the data to be compressed. a hash value, a is an integer greater than 0, m is an integer greater than 0 and (m+1) is the size of the window performing the hash operation; determining whether a hash has the same key as the hash value, The key of the hash table is a hash value generated by hashing the (m+1) Byte history data of the a+m Byte data, and the value of the hash table includes the (a+m Byte data) of the hash table. +1) the end n address of the start logical address of the Byte history data. If there is a key with the same hash value in the hash table, the hash table is updated according to the last n bit of the logical address of the a-th byte data. The value corresponding to the hash value, if the hash key does not have the same key as the hash value, the hash value and N bit logical address at the end of a Byte data to the compressed data to be added to the hash table.

The data compression device provided above simplifies the operation of the hash table in the data compression process by setting the end N bit of the starting logical address of the storage space for storing the data to be compressed to 0, thereby improving the data compression speed.

In conjunction with the second aspect, in a first implementation of the second aspect, the data to be compressed includes a plurality of data blocks.

The data compression device is capable of simultaneously storing a plurality of data blocks in a storage space and compressing them, and compresses only a single data block at a time, thereby improving the compression ratio.

With reference to the second aspect or the first implementation manner of the second aspect, in the second implementation manner of the second aspect, the processing chip determines whether a hash key has the same key as the hash value, and is further used for Determining whether the size of the data to be compressed is greater than 2 ^K Byte, and K is an integer greater than 0; if the size of the data to be compressed is greater than 2 ^K Byte, setting the value of the hash table to be no less than (K/8 +1) Byte; if the size of the data to be compressed is less than or equal to 2 ^K Byte, the value of the value of the hash table is set to be no less than K/8 Byte.

The data compression device determines the length of the logical address of the value to be written into the hash table by determining the size of the data to be compressed before matching the key in the hash table with the hash value, thereby improving the compression speed.

With the second implementation of the second aspect, in a third implementation manner of the second aspect, if the size of the data to be compressed is greater than 2 ^K Byte, the processing chip is in a logical address according to the data of the a Byte data. After the last n bit updates the value corresponding to the hash value in the hash table, it is also used for the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) bit And the difference between the value corresponding to the hash value in the hash table is less than 2 ^K , and the data after the a Byte data and the a Byte data in the data to be compressed and the first Byte in the data to be compressed The historical data of the data is matched, and a compression code is generated according to the matching result; and if the logical address of the a-byte data of the data to be compressed is at the end (8* the length of the value of the hash table) bit and the hash table If the difference between the values corresponding to the hash value is not less than 2 ^K , the data of the a a Byte data and the a Byte data in the data to be compressed and the historical data of the a Byte data in the data to be compressed are not Make a match.

The data compression device determines whether the size of the data to be compressed is greater than 2 ^K Byte in advance, and if the size of the data to be compressed is not greater than 2 ^K Byte, the first Byte data of the data to be compressed is not required. The judgment of the end of the logical address (the length of the value of the hash table of 8*) and the value corresponding to the hash value in the hash table is greater than or equal to 2 ^K, thereby saving the judgment process and further improving Compression speed.

With reference to the third implementation manner of the second aspect, in a fourth implementation manner of the second aspect, after the processing chip generates the compression coding according to the matching result, the processing chip is further configured to determine whether the a+m Byte data is the The last 1 Byte of data is compressed, and if so, the encoding of the data to be compressed is terminated, and if not, the window for the hashing operation is shifted to the right.

A third aspect of the present application provides a computing device including a processor and a memory. The processor and the memory establish a communication connection through a bus, the processor operating to read a program in the memory to perform the data compression method provided by the first aspect.

In a fourth aspect of the present application, there is provided a storage medium storing program code, the program code being executed by the computing device, performing the data compression method provided by the first aspect. The storage medium includes, but is not limited to, a flash memory, a hard disk (English: hard disk drive, HDD), or a solid state drive (English: solid state drive, abbreviated as SSD).

In a fifth aspect of the present application, a computer program product is provided. The computer program product can be a software installation package. When the software installation package is executed by the computing device, the data compression method provided by the first aspect is performed.

DRAWINGS

In order to more clearly illustrate the technical solution of the embodiment of the present application, the following needs to be made in the embodiment. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are incorporated in the drawings, The figure obtains other figures.

FIG. 1 is a schematic diagram of a system according to an embodiment of the present application;

2 is a schematic diagram of another system provided by an embodiment of the present application;

3 is a schematic flowchart of a data compression method according to an embodiment of the present application;

4 is a schematic structural diagram of a data compression device according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of another data compression device according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a computing device provided by an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application are described below with reference to the accompanying drawings in the embodiments of the present application.

The terms first, second, etc. are used in this application to distinguish each object, but there is no logical or temporal dependency between each of the "first" and "second".

Throughout this specification, in a block storage scenario, a data block refers to a fixed-size data, and a data block size may be 4K Byte, 8K Byte, etc.; in a file storage scenario, a data block refers to a file, and its size is not fixed.

Throughout this specification, the data chunk includes a plurality of data blocks, and the size of a common data chunk can be 256K Byte, 4M Byte, and the like.

Throughout this specification, the data to be compressed may include one or more data blocks, which may belong to one or more data chunks.

Throughout this specification, the clean (English: clean) hash table refers to the initialization of the hash table, which also returns the data stored in the hash table to 0 to avoid mismatching in the process of using the hash table.

Throughout this specification, the historical data of the current data refers to data in which the logical address is to be compressed before the current data, or the logical address in the data to be compressed is located in the data smaller than the current data. For example, for the data of the a Byte in the data to be compressed, the data of the first byte to the a-1 Byte in the data to be compressed is its historical data.

Throughout this specification, the window includes (m+1) Byte data for hashing. From the window The point is the first Byte of the (m+1) Byte data, and the end point of the window is the last 1 Byte of the (m+1) Byte data, and the size of the window is (m+1). For example, if the data to be compressed includes the string “abcdefghijklmn” and m=3, the window first includes “abcd”. If there is no historical data matching “abcd”, the window is shifted to the right. The length of the right shift can be set by itself. For example, each time 1 byte is shifted right, then "bcde" is used to generate a hash value.

Throughout this specification, a logical address refers to a virtual address assigned by an operating system. The starting logical address of any (m+1) Byte data, that is, the logical address of the first Byte data of the (m+1) Byte data.

Throughout this specification, the unit of the length of the value of the hash table is Byte.

Throughout this specification, the relative address of any 1 byte of data refers to the offset of the Byte data relative to the starting logical address of the data to be compressed in which the Byte data is located. The relative address of any (m+1) Byte data, that is, the relative address of the first Byte data of the (m+1) Byte data. If the end N bit of the starting logical address of the storage space where the (m+1) Byte data is located is 0, and the size of the data to be compressed to which the (m+1) Byte data belongs is 2 ⁿ , n is not greater than N, Then, the relative address of the first Byte data of the (m+1) Byte data, that is, the end n bit of the logical address of the first Byte data of the (m+1) Byte data.

Throughout the specification, or the operation refers to an OR operation, that is, as long as any of A and B is not 0, A OR B=1, and if both A and B are 0, A OR B=0.

Throughout the specification, the data from the a Byte to the a+m Byte refers to data including the a Byte, the a+m Byte, the a Byte data, and the a+m Byte data.

Throughout this specification,

Refers to rounding up Z, for example if Z=4, then

And if Z=4.5, then

System applied in the embodiment of the present application

1 is a schematic diagram of a system applied to an embodiment of the present application, the system includes a storage array including at least one storage controller and a plurality of storage devices, which are generally non-volatile storage devices. Specifically, it may be the flash memory (English: flash memory) or HDD or SSD. Each storage controller is connected to multiple storage devices. In order to save the space of the storage device in the storage array, the storage controller is configured to compress the data to be stored, and store the obtained compression code into the storage device.

2 is a schematic diagram of another system applied to an embodiment of the present application, where the system includes a first data office The device and the second data processing device. A data compression device is disposed in the first data processing device, and a data decompression device is disposed in the second data processing device. The data compression device compresses the data that needs to be transmitted to the second data processing device and then transmits the compressed code to the second data processing device over the communication network. The data decompressing device decompresses the compression encoding. Therefore, only the compression coding needs to be transmitted in the communication network, which reduces the communication traffic and speeds up the data transmission speed.

The data compression method provided in FIG. 3 is performed when the storage controller or the data compression device is in operation.

The present application also provides a data compression method, and a schematic flowchart thereof is shown in FIG. 3. Take the memory controller as an example.

Step 202: Allocating a first storage space, where the first storage space is used to store data to be compressed, and an N bit at the end of the start logical address of the first storage space is 0, and N is an integer greater than 1.

Taking N as 32 and the memory controller adopting a 64-bit operating system as an example, the starting logical address of the first storage space is 0x FFFF FFFF 0000 0000.

Step 204, allocating a second storage space for storing compression coding generated during compression. In order to subsequently decompress the compression code, the corresponding data can be restored.

Step 206: Allocate a third storage space, where the third storage space is used to store a hash table. The hash table may adopt a key-value structure, and each key is a hash value obtained by hashing the (m+1) Byte data in the window, and the value corresponding to each key includes the key (m+). 1) The last n bit of the start logical address of the Byte data. The value corresponding to each key of the hash table needs to include a relative address of the (m+1) Byte data of the generated key, but since the end n address of the starting logical address of the to-be-compressed data block is 0, Therefore, the value corresponding to each key of the hash table also includes the last n bit of the starting logical address of the (m+1) Byte data that generates the key.

Since the minimum granularity of reading and writing the value of the hash table is Byte, if n is not an integer multiple of 4, the value of the hash table is the end of the logical address of the a-byte data including the data to be compressed. In addition to the bit, it may also include one or more bits higher than the last n bit. For example, in the case of n=14, since the value of the hash table has a length of at least 4 bytes, the value of the hash table needs to include the last 16 bits of the logical address of the a-th byte data of the data to be compressed.

The structure of the schematic hash table is shown in Table 1.

KeyKey	ValueValue
hash value 1Hash value 1	逻辑地址1的末尾n bitEnd of logical address 1 n bit
hash value 2Hash value 2	逻辑地址2的末尾n bitEnd of logical address 2 n bit
…...	…...
hash value NHash value N	逻辑地址N的末尾n bitEnd of logical address N n bit

Table 1

Assuming hash value 1 is the hash value corresponding to the a Byte to the a+m Byte data, the logical address 1 is the logical address of the a Byte data, and the rest of the rows in Table 1 are analogous.

Step 202, step 204, and step 206 may be performed in any order, or may be combined into the same step. The first storage space, the second storage space, and the third storage space may refer to a memory space.

Step 207: Acquire data to be compressed, and store the data to be compressed into the first storage space. The size of the data to be compressed is 2 ⁿ Byte, and n is not greater than N. Therefore, the end N bit of the starting logical address of the data to be compressed is 0.

After the end N bit of the start logical address of the first storage space allocated in step 202 is 0, a plurality of

steps

207 and 207 and subsequent steps may be performed, and the first storage space is not allocated once for each data to be compressed. Since the size of the data to be compressed acquired in step 207 may be different, the 2 ^N Byte set in step 202 needs to be greater than or equal to the size of each data to be compressed to ensure the data to be compressed acquired in each subsequent step 207. The end n bit of the starting logical address is 0.

The data to be compressed may include a plurality of data blocks. Comparing the data to be compressed to include only one data block, storing a plurality of data blocks into the first storage space at a time, thereby avoiding performance loss caused by cleaning the hash table multiple times, and at the same time, due to the first storage The size of the data to be compressed in the space is increased, and the data in each window is easier to find the matching historical data, so the compression ratio can be improved.

The storage controller acquires data to be compressed from a client or other device, and the data to be compressed is data that needs to be stored in the storage device.

Step 208: Determine whether the size of the data to be compressed is greater than 2 ^K Byte, and K is an integer greater than 0. If it is greater than, the branch where the step 210 is located is executed. If not, the branch where the step 222 is located is executed.

Common values for K include: 16 or 24 or 32. The logical address of the data to be compressed whose size is larger than 2 ¹⁶ Bytes needs at least 3 bytes, the logical address of the data to be compressed larger than 2 ²⁴ needs at least 4 bytes, and the logical address of the data to be compressed larger than 2 ³² requires at least 5 bytes. In the present application, exemplary use K equals 16. In actual use, the value of K can refer to the size of the storage device's cache. In this branch, the size of the data to be compressed is 2 ³² Bytes, and m=3 is taken as an example.

In step 210, the hash table is cleaned up.

In step 212, the value of the value of the hash table is set to be no less than (K/8+H) Byte.

H is a positive integer greater than 0, and a common value can be 2. (8* the length of the value of the hash table) also needs to be no less than n, for example, n=32, the value of the value of the hash table is not less than 4, and if n=24, the value of the hash table The length is not less than 3. In this branch, the value of the value of the hash table is exemplarily set to 4. In this branch, if the size of the data to be compressed is greater than 2 ^K Byte, the relative address of each ¹ Byte of data in the data to be compressed cannot be expressed by K/8 Byte, so the length of the value of the hash table needs to be increased.

The order of execution of steps 210 and 212 can be interchanged.

After the step 208, the size of the data to be compressed has been determined. Therefore, in step 212 and subsequent step 224, the length of the value of the hash table may be set according to the size of the data to be compressed, so as to avoid the value of the hash table being too long. The resulting storage space is wasted and the difficulty caused by the operation of the hash table is increased, and the length of the value of the hash table is not enough if the length of the value of the hash table is set too short.

Common values of m include: 2, 3, 4, 5, 6, or 7. The length of the key of the hash table is set according to the type of hash operation employed.

Execution of step 210 may be performed at any time prior to step 216, ensuring that the hash table is cleaned prior to use of the hash table in step 216.

Step 214: Generate a hash value according to the a Byte to the a+m Byte of the data to be compressed, where a is an integer greater than 0. When step 214 is executed for the first time, a takes a value of 1.

Step 216: Determine whether there is a key in the hash table that is the same as the hash value. If yes, perform steps 2161 to 2162. If not, go to step 2163.

Step 2161: Acquire the value of the row where the hash value is located, and update the value of the row where the hash value is located according to the last n bit of the logical address of the a Byte data of the data to be compressed.

If n is not an integer multiple of 4, the value of the row for updating the hash value may include n bits longer than the end of the logical address of the a Byte data of the data to be compressed. 1 or Multiple bits. That is, the value of the row of the hash table is updated by using the end of the logical address of the a-byte data of the data to be compressed (8* the length of the value of the hash table) bit.

Step 2162: Determine whether the difference between the end of the logical address of the a-th byte data (8* the length of the value of the hash table) and the value of the row of the hash value is greater than ^2K .

If the value of the value of the hash table is U Byte and U is an integer not less than (K/8+H), then the 8U bit at the end of the logical address of the a Byte data and the value of the row where the hash value is located are determined. Is the difference greater than 2 ^K ?

If the difference between the end of the logical address of the a-byte data (the length of the value of the hash table of 8*) and the value of the row of the hash value is greater than or equal to 2 ^K , a window with a hash value is generated. Shift, ie a=a+Q, Q is an integer greater than 0 and returns to step 214. If the difference between the end of the logical address of the a-byte data (the length of the value of the hash table of 8*) and the value of the row of the hash value is not more than 2 ^K , step 218 is performed.

The value of the row in which the hash value is used in step 2162 is the value of the row in which the hash value is located before the update action is performed in step 2161.

Specifically, since the logical address of each Byte data of the data to be compressed is the same as the address of the last n bit, it is only necessary to compare the end of the logical address of the a Byte data (8* the value of the hash table) Whether the difference between the length) bit and the value of the row of the hash value is greater than ^2K .

Specifically, in step 2162, if it is determined that the value is not greater than 2 ^K , before performing step 218, the same history as the (m+1) Byte data currently hashed is obtained according to the value of the row of the hash value. The starting logical address of the data for use in step 218.

Step 2163, adding the hash value and the last n bit of the logical address of the a-byte data of the data to be compressed to the hash table. The window that generates the hash value is shifted to the right, that is, a=a+W, W is an integer greater than 0, and returns to step 214.

If n is not an integer multiple of 4, one or more bits higher than the last n bit may be included in addition to the end n bit of the logical address of the a-th byte data of the data to be compressed. That is, the hash value and the end of the logical address of the a-byte data of the data to be compressed (8* the length of the value of the hash table) are added to a new row of the hash table.

The improvement of the data compression method provided by the present application relative to the existing data compression method is analyzed in detail below.

Table 2

As shown in Table 2, the current logical address is the starting logical address of the (m+1) Byte data currently hashed.

If the hash value corresponding to the data of the a Byte data to the a+m Byte already exists in the key of the row in the hash table, the value of the row of the hash value in the hash table needs to be read. And the value of the row of the hash value is updated by using the relative address of the a Byte data, that is, the read hash table and the write hash table need to be performed once.

In the prior art, the value record 400 of the row in which the hash value is located is taken as an example, in order to obtain the complete starting logical address of the historical data that is matched, for (m+1) Byte which will be hashed currently. The data is matched against the historical data being matched. Therefore, 400 and 0x FFFF FFFF 0000 0001 need to be added to obtain 0x FFFF FFFF 0000 0191. 0x FFFF FFFF 0000 0191 is the starting logical address of the same history data as the currently hashed (m+1) Byte data. .

In the prior art, in order to update the value of the row where the hash value is located, the relative address of the (m+1) Byte data currently hashed needs to be stored in the value of the matched row, so Subtract 0x FFFF FFFF 0000 07D1 and 0x FFFF FFFF 0000 0001 to obtain 2000, and update the value of the row where the hash value is located with 2000.

It can be seen that, in the prior art, if the hash value corresponding to the (m+1) Byte data currently performing the hash operation already exists in the hash table, an addition operation and a subtraction operation are required.

Correspondingly, in the compression method provided by the present application, taking the value record 0x0190 of the row where the hash value is located as an example, in order to obtain the complete starting logical address of the historical data to be matched, it is necessary to 0190 is ORed with 0x FFFF FFFF 0000 0000 to obtain 0x FFFF FFFF 0000 0190. 0x FFFF FFFF 0000 0190 is the start logical address of the same history data as the (m+1) Byte data currently hashed.

In addition, in the compression method provided by the present application, in order to update the value of the row where the hash value is located, the value of the row of the hash value needs to be updated with the relative address of the (m+1) Byte data currently hashed. Since the end N bit of the starting logical address of the data to be compressed is 0, the value of the matched row is directly updated by 07D0.

It can be seen that, in the compression method provided by the present application, if the hash value corresponding to the (m+1) Byte data currently performing the hash operation already exists in the hash table, only one operation or one operation is required. Compared with the prior art, it is required to use one-time operation and one-time reduction operation, which reduces the time required for the operation and improves the compression speed.

If the hash value corresponding to the data of the a Byte data to the a+m Byte does not exist in the key of any row in the hash table, that is, the data corresponding to the data of the a Byte data to the a+m Byte. The hash value cannot match the key of any row in the hash table, the hash value corresponding to the a-byte data to the a+m Byte, and the relative address of the a-byte data of the data to be compressed are added to the Hash table, that is, you need to write a hash table once.

In the scenario of writing a hash table, in the prior art, 0x FFFF FFFF 0000 07D1 and 0x FFFF FFFF 0000 0001 need to be subtracted to obtain 2000. Subsequently, the hash value corresponding to the data of the a Byte data to the a+m Byte and 2000 are stored in the hash table.

Correspondingly, in the compression method provided by the present application, the starting logical address of the data from the a Byte data to the a+m Byte is 0x FFFF FFFF 0000 07D0, for example, the a Byte data is sent to the a+m. The hash value corresponding to the Byte data and 0x 07D0 are stored in the hash table.

It can be seen that, in the compression method provided by the present application, if the hash value corresponding to the (m+1) Byte data currently hashed does not exist in the key of any row in the hash table, the first a The Byte data is written to the hash table at the end n bit of the start logical address of the data of the a+m Byte. Compared with the prior art, it is required to use one subtraction operation and one write operation, which reduces the time required for the operation of the hash table and improves the compression speed.

Step 218, the same historical data and current as the (m+1) Byte data currently hashed The (m+1) Byte data subjected to the hash operation is matched to the right Byte by Byte, and the compression code corresponding to the current match is generated according to the matching result, and the compression code is stored in the third storage space.

Specifically, after obtaining the starting logical address of the same historical data as the (m+1) Byte data currently hashed, the historical data of the data to be compressed is obtained according to the starting logical address, and the data to be compressed is to be compressed. The historical data of the data is matched with the data of the a-byte data and the data after the a-byte data by Bytes until it cannot be matched.

The compression encoding includes: a matching length of the data after the a Byte and the a Byte and the historical data, a relative address of the historical data, and a last Byte data of the last compression encoded record to the first Byte on the current matching. Data between data.

For example, the data to be compressed includes abcdefghabcdef, assuming that the relative address of the first a is 100, the current window includes the ninth character to the twelfth character, and E=1 is an example, as shown in Table 3.

KeyKey	ValueValue
abcd对应的hash值Abcd corresponding hash value	100100
bcde对应的hash值Bcde corresponding hash value	101101
cdef对应的hash值The hash value corresponding to cdef	102102
defg对应的hash值The ash value corresponding to defg	103103
efgh对应的hash值Efgh corresponding hash value	104104
fgha对应的hash值Fgha corresponding hash value	105105
ghab对应的hash值Gash value corresponding to ghab	106106
habc对应的hash值Habc corresponding hash value	107107

table 3

After obtaining the hash value corresponding to the ninth character to the twelfth character (ie, abcd), since the key of the first row can be matched in the hash table, the value of the first row in the hash table is read. Take the first character, then compare the ninth character with the first character, the tenth character is compared with the second character, and so on, until it matches to the right until it cannot match. In this example, the 9th to 14th characters are the same as the 1st to 6th characters. The resulting compression coding thus includes: abcdefgh, 100, 6. Where abcdefg is the data between the last 1 byte of the last compression code record and the first byte of the current match, where 100 is the relative address of the historical data on the first byte data match after h, and 6 is the match length. According to the compression coding, the order of restoring the data to be compressed is as follows: firstly extract abcdefgh, and then obtain the first 6 characters of abcdefgh according to 100 and 6, that is, abcdef, and add abcdef to abcdefgh, and then restore the data to be compressed abdefghabcdef .

Wherein, after step 2162 and step 2163, the window for generating the hash value may be shifted to the right. Therefore, there may be partial data that is neither recorded in the compression encoding generated in the previous step 218, but also located before the start of the window in this step 218, so this portion of the data needs to be recorded in the compression encoding generated in this step 218. in.

Step 220: Determine whether the data to be compressed is all compressed, that is, whether the a+m Byte data points to the last 1 Byte data of the data to be compressed. If so, the compression coding is ended, and the compression code in the third storage space is stored in the storage device. If not, the window that generates the hash value is shifted to the right, ie, a = a + E, E is an integer greater than 0, and returns to step 214.

Q, W, and E are the lengths of the right shift of the window, that is, how many Bytes the window slides to the right.

Since the buffer of the storage control device is limited, in order to avoid the distance between the (m+1) Byte data currently hashed in step 218 and the historical data on the matching is too large, the historical data on the matching and the current hash are performed. The (m+1)Byte data of the operation cannot be stored in the cache at the same time, which causes the cache to be refreshed and thus affects the compression speed. Therefore, in step 2162, it is judged whether the difference between the history data and the logical address of the (m+1) Byte data currently subjected to the hash operation is greater than ^2K . 2 ^K Byte can be the size of the storage controller's cache. If the historical data on the match and the (m+1) Byte data currently hashed are separated by more than or equal to 2 ^K Bytes, step 220 is not performed in the current match. If the historical data on the match and the (m+1) Byte data currently hashed are less than 2 ^K Bytes of data, the historical data on the match and the current hash operation (m+) 1) The Byte data can be stored in the cache at the same time, so step 220 is performed.

It should be noted that step 2162 is an optional step, that is, after step 2161, step 2162 can be performed without performing step 2162.

In this branch, the size of the data to be compressed is 2 ¹⁶ Bytes, and m=3 is taken as an example.

In step 222, the hash table is cleaned up.

In step 224, the length of the value of the hash table is set to be no less than K/8 Byte.

The order of execution of

steps

222 and 224 can be interchanged.

Since the size of the data to be compressed is not more than 2 ^K Byte, the value of the value of the hash table is not less than K/8 Byte. If K is not a multiple of 8, then in step 224, the length of the value of the hash table is set to be no less than

Since the size of the data to be compressed is 2 ¹⁶ Bytes, a value length of 2 bytes is required to represent the relative address of any Byte data of the data to be compressed.

Execution of step 222 may be performed at any time prior to step 228, ensuring that the hash table is cleaned prior to use of the hash table in step 228.

Step 226: Generate a hash value according to the bth Byte to the b+m Byte of the data to be compressed, where b is an integer greater than 0. When step 226 is executed for the first time, b takes a value of 1.

Step 228, determining whether the hash value can match any key of the hash table. If it can match, step 2301 is performed, and if it cannot be matched, step 2302 is performed.

In step 2301, the value of the row in which the hash value is matched is obtained, and the value of the row of the hash value on the matching is updated according to the last n bit of the logical address of the b-th byte data of the data to be compressed.

If n is not an integer multiple of 4, the value of the row for updating the hash value may include n bits longer than the end of the logical address of the b-th byte data of the data to be compressed. 1 or more bits. That is, the value of the row of the hash table is updated by using the end of the logical address of the b-th Byte data of the data to be compressed (8* the length of the value of the hash table).

In step 2301, it is also necessary to obtain the starting logical address of the same historical data as the (m+1) Byte data currently hashed according to the value of the row in which the hash value is matched, for use in step 232.

Step 2302, adding the hash value and the last n bit of the logical address of the b-th Byte data of the data to be compressed to the hash table. The window that generates the hash value is shifted to the right, that is, b=b+R, and R is an integer greater than 0, and returns to step 226.

If n is not an integer multiple of 4, one or more bits higher than the last n bit may be included in addition to the end n bit of the logical address of the b-th byte data of the data to be compressed. The hash value and the end of the logical address of the b-th Byte data of the data to be compressed (8* the length of the value of the hash table) are also added to the new row of the hash table.

Referring to the description corresponding to the foregoing Table 2, in the compression method provided by the present application, if the hash value corresponding to the (m+1) Byte data currently hashed can match the key of a certain row in the hash table, only Need to do it once or operate. Compared with the prior art, it is required to use one-time operation and one-time reduction operation, which reduces the time required for the operation and improves the compression speed.

Meanwhile, in the compression method provided by the present application, if (m+1) Byte data is currently hashed The corresponding hash value cannot match the key of any row in the hash table, and the b-byte data is directly sent to the end of the starting logical address of the data of the b+m Byte (8* the hash table) The length of the value) bit is written to the hash table. Compared with the prior art, it is required to use one subtraction operation and one write operation, which reduces the time required for the operation and improves the compression speed.

Step 232: Matching the same historical data of the (m+1) Byte data currently hashed with the (m+1) Byte data currently hashed to the right by Byte, and generating the current matching according to the matching result. The compression code is stored in the third storage space.

The details of the compression coding are generated in step 232, with reference to the description in step 218 above.

Step 234: Determine whether the data to be compressed is all compressed, that is, whether the b+m Byte data points to the last 1 Byte data of the data to be compressed. If so, the compression coding is ended, and the compression code in the third storage space is stored in the storage device. If not, the window that generates the hash value is shifted to the right, i.e., b = b + T, T is an integer greater than 0, and returns to step 226.

R and T are the lengths of the right shift of the window, that is, how many Bytes of data the window slides to the right.

By determining in step 208 whether the data to be compressed is greater than 2 ^K Byte, in the branch from step 222 to step 234, since the data to be compressed is not more than 2 ^K Byte, the logical address and current of any Byte history data The difference between the logical addresses of the (m+1) Byte data subjected to the hash operation is certainly not more than 2 ^K , and it is not necessary to perform the similar judgment action of step 2162, which saves the compression process and further improves the compression speed.

It should be noted that step 208 is an optional step.

If step 208 is not used, step 210, step 214, and subsequent steps of step 214 are directly performed. In this case, since the size of the data to be compressed is not known before the operation of the hash table in step 2161 or step 2163, it is necessary to determine the length of the logical address that needs to be written into the hash table according to the size of the data to be compressed. .

For example, the size of the data to be compressed is 2 ¹⁶ Bytes, and the operating system used by the storage controller is a 64-bit system. Therefore, before step 2161 or step 2163, it is necessary to update the hash table by using the last 16 bits of the logical address of the a-th byte data according to the size of the data to be compressed.

Through the adoption of step 208, it is avoided that the size of the data to be compressed needs to be determined once for each operation of the hash table, and the compression speed is further improved.

As shown in FIG. 4, the present application further provides a data compression device 400, which may be the storage controller in FIG. 1 or the data compression device in FIG. 2. The data compression device 400 includes a communication interface 402 and a processing chip 404, and the communication interface 402 and the processing chip 404 establish a communication connection. When the data compression device 400 is in operation, the data compression method corresponding to FIG. 3 is executed.

The communication interface 402 is for communicating with an external device, such as a client writing data to be compressed, a storage device in a storage array, a network device in a communication network, and the like. Communication interface 402 can be an input/output interface of data compression device 400.

The communication interface 402 is specifically configured to perform the step of acquiring data to be compressed in step 207, and the step of storing the compression code in the third storage space into the storage device after step 220 and step 234. If the data compression device 400 is the data compression device of FIG. 2, then after step 220 and step 234, the communication interface 402 is configured to send the compression code in the third storage space to the communication network.

The processing chip 404 is configured to perform step 202 to step 206, and perform the step of storing the data to be compressed into the first storage space in step 207, and is further configured to perform step 208 to step 220, and is further configured to perform step 208 to step 234. .

The processing chip 404 can be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (abbreviated as PLD). The above PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviation: CPLD), a field programmable gate array (English: field programmable gate array, abbreviated: FPGA), general array logic (English: general array logic, Abbreviation: GAL) or any combination thereof.

As shown in FIG. 5, the processing chip 404 can also be implemented by a processor, a storage device, and a logic chip, which can be implemented by a PLD or an ASIC. When the processing chip 404 is in operation, the processor and the logic chip each perform a part of functions, and the functions of the two can be allocated in various ways. Exemplarily, the code in the memory is read by the processor to perform steps 202 to 207. After the first storage space, the second storage space, and the third storage space have all been allocated in the memory, and the data to be stored has already stored the first storage space, the subsequent steps are completed by the logic chip.

The data compression device provided above provides the read/write operation of the hash table in the process of compressing the data to be compressed by setting the end N bit address of the storage space for storing the data to be compressed to 0. Adding simplicity increases the compression speed.

FIG. 6 is a computing device provided by the present application. The computing device 600 may be the storage controller in FIG. 1 or the data compression device in FIG. 2. Computing device 600 includes a processor 602, a memory 604, and may also include a bus 606 and a communication interface 608.

Communication interface 608 is used to communicate with external devices, such as clients that write data to be compressed, storage devices in a storage array, network devices in a communication network, and the like. Communication interface 608 can be an input/output interface of computing device 600.

The processor 602, the memory 604, and the communication interface 608 can implement communication connections with each other through the bus 606, and can also implement communication by other means such as wireless transmission.

The processor 602 can be a central processing unit (English: central processing unit, abbreviation: CPU).

The memory 604 may include a volatile memory (English: volatile memory) (English: random-access memory, abbreviation: RAM).

Optionally, the memory 604 may further include a non-volatile memory, such as a read-only memory (English: read-only memory, abbreviated as ROM), a flash memory, an HDD or an SSD, and a memory 604. Combinations of the above types of memory may also be included.

When the computing device 600 is the storage controller of FIG. 1, since the storage controller is connected to a plurality of storage devices in the storage array, the memory 604 may also not include the non-volatile memory, and the non-volatileness of the computing device 600 The memory is provided by a storage device of the storage array.

When the computing device 600 is the data compression device of FIG. 2, since it can directly send the compression code to the communication network, it is not necessary to store the compression code in the non-volatile memory, so the memory 604 may not include non-volatile. Memory.

When the technical solution provided by the present application is implemented by software, the program code for implementing the data compression method provided in FIG. 3 of the present application is stored in the memory 604 and executed by the processor 602.

The computing device provided above provides a simple read and write operation on the hash table in the process of compressing the data to be compressed by setting the end N bit address of the storage space for storing the data to be compressed to be simple, and improving the compression. speed.

In the above embodiments, the descriptions of the various embodiments are all focused, and in some embodiments, there is no detailed description. For a description of the parts, reference may be made to the related description of other embodiments.

The methods described in connection with the present disclosure can be implemented by a processor executing software instructions. The software instructions can be composed of corresponding software modules, which can be stored in RAM, flash memory, ROM, erasable programmable read only memory (English: erasable programmable read only memory, abbreviation: EPROM), electrically erasable Programming an audio-only memory (English: electrically erasable programmable read only memory, EEPROM), a hard disk, an SSD, an optical disk, or any other form of storage medium known in the art.

Those skilled in the art will appreciate that in one or more of the above examples, the functions described herein may be implemented in hardware or software. When implemented in software, the functions may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium. A storage medium may be any available media that can be accessed by a general purpose or special purpose computer.

The specific embodiments of the present invention have been described in detail with reference to the specific embodiments of the present application. It is to be understood that the foregoing is only a specific embodiment of the present application, and is not intended to limit the scope of the present application. Any modifications, improvements, etc. made on the basis of the technical solutions of the present application are included in the scope of protection of the present application.

Claims

A data compression method, comprising:

Allocating storage space, the end of the starting logical address of the storage space N bit is 0, N is an integer greater than 1;

The data to be compressed is stored in the storage space, the size of the data to be compressed is 2 n Byte, and n is not greater than N;

Performing a hash operation on the a Byte data to the a+m Byte data of the data to be compressed to generate a hash value, where a is an integer greater than 0, m is an integer greater than 0, and (m+1) is performed The size of the window of the hash operation;

Determining whether there is a key in the hash table that is the same as the hash value, and the key of the hash table is a hash generated by hashing the (m+1) Byte history data of the a+m Byte data a value, the value of the hash table includes the end n bit of the start logical address of the (m+1) Byte history data of the a+m Byte data;

If the hash key has the same key as the hash value, the value corresponding to the hash value in the hash table is updated according to the last n bit of the logical address of the a-th byte data;

If the same key as the hash value does not exist in the hash table, the hash value and the last n bit of the logical address of the a-th byte data are added to the hash table.
The data compression method according to claim 1, wherein the data to be compressed comprises a plurality of data blocks.
The data compression method according to claim 1 or 2, wherein before the determining whether the key having the same hash value exists in the hash table, the method further includes:

Determining whether the size of the data to be compressed is greater than 2 K Byte, and K is an integer greater than 0;

If the size of the data to be compressed is greater than 2 K Byte, set the value of the hash table to be no less than (K/8+1) Byte;

If the size of the data to be compressed is less than or equal to 2 K Byte, the value of the value of the hash table is set to be no less than K/8 Byte.
The data compression method according to claim 3, wherein if the size of the data to be compressed is greater than 2 K Byte, updating the hash table according to the last n bits of the logical address of the a-th byte data After the value corresponding to the hash value, the method further includes:

If the difference between the end of the logical address of the a-byte data (8* the length of the value of the hash table) bit and the value corresponding to the hash value in the hash table is less than 2 K , then The data after the a Byte data and the a Byte data are matched with the history data indicated by the value corresponding to the hash value, and the compression coding is generated according to the matching result;

If the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) bit and the value corresponding to the hash value in the hash table are not less than 2 K , the data of the a-th Byte data and the a-th Byte data are not matched with the history data indicated by the value corresponding to the hash value.
The data matching method according to claim 4, wherein after the generating the compression coding according to the matching result, the method further comprises:

Determining whether the a+m Byte data is the last 1 Byte data of the data to be compressed, and if so, ending encoding of the data to be compressed, and if not, moving the window of the hash operation to the right.
A data compression device, comprising: a communication interface and a processing chip, wherein the communication interface is connected to the processing chip;

The processing chip is configured to allocate a storage space, where the Nbit of the starting logical address of the storage space is 0, and N is an integer greater than 1.

The communication interface is configured to acquire data to be compressed, and store the data to be compressed into the storage space, where the size of the data to be compressed is 2 n Byte, and n is not greater than N;

The processing chip is further configured to perform a hash operation on the a-byte data to the a+m Byte data of the data to be compressed to generate a hash value, where a is an integer greater than 0, and m is an integer greater than 0 and M+1) is the size of the window for performing the hash operation; determining whether there is a key in the hash table that is the same as the hash value, and the key of the hash table is the data of the a+m Byte data (m+1) Byte history data, a hash value generated by a hash operation, the value of the hash table including the end of the start logical address of the (m+1) Byte history data of the a+m Byte data n bit, if there is a key in the hash table that is the same as the hash value, the value corresponding to the hash value in the hash table is updated according to the last n bit of the logical address of the a-th byte data. If the key having the same hash value does not exist in the hash table, the hash value and the first a The end n bit of the logical address of the Byte data is added to the hash table.
The device of claim 6, wherein the data to be compressed comprises a plurality of data blocks.
The device according to claim 6 or 7, wherein the processing chip determines whether the size of the data to be compressed is greater than or equal to whether the key has the same value as the hash value in the hash table. 2 K Byte, K is an integer greater than 0; if the size of the data to be compressed is greater than 2 K Byte, the value of the value of the hash table is set to be no less than (K/8+1) Byte; The size of the compressed data is less than or equal to 2 K Byte, and the length of the hash table is set to be no less than K/8 Byte.
The device according to claim 8, wherein if the size of the data to be compressed is greater than 2 K Byte, the processing chip is at the end of the logical address according to the a-th byte data (8* The length of the value of the hash table) bit is used to update the value corresponding to the hash value in the hash table, and is also used to end the logical address of the a Byte data of the data to be compressed (8* And the difference between the value of the value of the hash table and the value corresponding to the hash value in the hash table is less than 2 K , and the data after the a Byte data and the a Byte data are The historical data indicated by the value corresponding to the hash value is matched, and the compression encoding is generated according to the matching result; and the end n bit of the logical address of the a Byte data of the data to be compressed and the hash in the hash table If the difference between the values corresponding to the values is not less than 2 K , the data after the a-th byte data and the a-th Byte data are not matched with the history data indicated by the value corresponding to the hash value.
The device according to claim 9, wherein the processing chip is further configured to: after generating the compression encoding:

Determining whether the a+m Byte data is the last 1 Byte data of the data to be compressed, and if so, ending encoding of the data to be compressed, and if not, moving the window of the hash operation to the right.
A computing device, comprising: a processor, a memory, and the processor establishing a communication connection with the memory;

While the processor is running, the program in the memory is read to perform the method of any one of claims 1 to 5.