WO2018058604A1 - Data compression method and device, and computation device - Google Patents

Data compression method and device, and computation device Download PDF

Info

Publication number
WO2018058604A1
WO2018058604A1 PCT/CN2016/101259 CN2016101259W WO2018058604A1 WO 2018058604 A1 WO2018058604 A1 WO 2018058604A1 CN 2016101259 W CN2016101259 W CN 2016101259W WO 2018058604 A1 WO2018058604 A1 WO 2018058604A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
byte
value
hash
compressed
Prior art date
Application number
PCT/CN2016/101259
Other languages
French (fr)
Chinese (zh)
Inventor
张希舟
张剑
牛进保
全绍晖
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2016/101259 priority Critical patent/WO2018058604A1/en
Priority to CN201680089676.XA priority patent/CN110419036B/en
Publication of WO2018058604A1 publication Critical patent/WO2018058604A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a data compression method, and a data compression device corresponding to the method, and a computing device for executing the data compression method.
  • LZ compression has a large number of compression coding branches, such as LZ4, LZ5, LZO, LZH, etc.
  • LZ4 has a large number of compression coding branches, such as LZ4, LZ5, LZO, LZH, etc.
  • LZ4 has a large number of compression coding branches, such as LZ4, LZ5, LZO, LZH, etc.
  • the common feature of these compression codes is that historical data is used as a dictionary when encoding current data.
  • LZ compression performs data compression at the granularity of bytes/strings. For example, if the data block to be compressed is 4M Byte and the window size is 4 bytes, when 4 bytes of data in the current window is compressed, the 4 bytes in the current window are used to match the historical data of the data block to be compressed, if the data is to be compressed.
  • the data of the data block has the same data as the 4 Byte data, and the code corresponding to the 4 Byte data only needs to record the position information and the length of the historical data, so that in the process of decompression, according to the code corresponding to the 4 Byte data, The 4Byte data can be recovered.
  • the compression speed of current LZ compression still needs to be improved.
  • the present application provides a data compression method to increase the speed of data compression.
  • a first aspect of the present application provides a data compression method performed by a storage controller or a data compression device, including: first allocating a storage space, where the end of the starting logical address of the storage space is 0 bit, and N is greater than An integer of 1.
  • N is related to the size of the data to be compressed that is subsequently processed.
  • the data to be compressed is stored in the storage space, and the size of the data to be compressed is 2 n Byte, and n is not greater than N, so that the end N bit of the starting logical address of the data to be compressed is 0, because the to-be-compressed If the size of the data is not more than 2 N Byte, the valid part of the starting logical address of the data to be compressed is 0, and the part of the starting logical address of the data to be compressed that is higher than the n bit is an invalid part because The portion of the logical address of each Byte data of the data to be compressed that is higher than the n bit is the same.
  • a is an integer greater than 0
  • m is an integer greater than 0
  • (m+1) is performed
  • the key of the hash table is a hash value generated by hashing the (m+1) Byte history data of the a+m Byte data.
  • the value of the hash table includes the end n bit of the start logical address of the (m+1) Byte history data of the a+m Byte data. Determining whether the hash table is stored in the same hash value, that is, using the hash value to match the key in the hash table one by one, if there is a matching key, the first a Byte data is The a+m Byte data appears in the historical data of the data to be compressed. If there is no matching key, it indicates that the a Byte data to the a+m Byte data first appears in the data to be compressed.
  • the value corresponding to the hash value in the hash table is updated according to the last n bit of the logical address of the a-th byte data. If the same key as the hash value does not exist in the hash table, the hash value and the end n bit of the logical address of the a-byte data of the data to be compressed are added to the hash table.
  • the record needs to be recorded in the hash table by using the a Byte data to the start logical address of the a+m Byte data.
  • the content of the value used to replace or join the hash table includes, in addition to the data to be compressed.
  • the data compression method provided above simplifies the operation of the hash table in the data compression process by setting the end N bit of the starting logical address of the storage space for storing the data to be compressed to 0, thereby improving the data compression speed.
  • the data to be compressed includes a plurality of data blocks.
  • Multiple data blocks are simultaneously stored in the storage space and compressed, and the compression ratio is improved by compressing only a single data block at a time.
  • the method before determining whether a key having the same hash value exists in the hash table, the method further includes: determining the Whether the size of the data to be compressed is greater than 2 K Byte, and K is an integer greater than 0. If the size of the data to be compressed is greater than 2 K Byte, the value of the value of the hash table is not less than (K/8+1) Byte, that is, if the length of the data to be compressed is greater than 2 K Byte, at least (K/8+1) Byte can express the relative address of the data to be compressed.
  • the length of the value of the hash table is not less than K/8 Byte, that is, if the length of the data to be compressed is not more than 2 K Byte,
  • the relative address of the data to be compressed can be expressed by K/8Byte.
  • the length of the logical address of the data to be compressed that needs to be written to the value of the hash table is determined relative to each time the update or write operation is performed on the hash table, thereby increasing the compression speed.
  • the method further includes: if the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) bit and the If the difference between the values corresponding to the hash value in the hash table is less than 2 K , the data of the a Byte data and the a Byte data in the data to be compressed and the data of the a Byte data in the data to be compressed are The historical data is matched, and a compression code is generated according to the matching result; if the logical address of the a-byte data of the data to be compressed is at the end (8* the length of the value of the hash table) bit and the hash in the hash
  • the data compression algorithm includes the following settings: when the logical address of the a-byte data of the data to be compressed ends (8* the length of the value of the hash table) bit and the value corresponding to the hash value in the hash table When the difference is not less than 2 K , the match of this round window is abandoned.
  • the setting is adopted, if the size of the data to be compressed is larger than 2 K Byte, then the end of the logical address of the a-byte data of the data to be compressed may appear (8* the value of the hash table)
  • the length) bit has a probability that the difference between the value corresponding to the hash value in the hash table is not less than 2 K.
  • the size of the data to be compressed is not more than 2 K Byte
  • the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table)
  • the difference between the bit and the value corresponding to the hash value in the hash table must be no more than 2 K , so only if the size of the data to be compressed is greater than 2 K Byte, the data to be compressed is required.
  • the judgment is made whether the difference between the end of the logical address of the Byte data (the length of the value of the hash table of 8*) and the value corresponding to the hash value in the hash table is greater than 2 K.
  • the method further includes: determining whether the a+m Byte data is the data to be compressed. The last 1 Byte data, if yes, ends the encoding of the data to be compressed, and if not, moves the window of the hash operation to the right.
  • a data compression device comprising: a communication interface and a processing chip, the communication interface being connected to the processing chip.
  • the communication interface is used for communication with an external device to obtain data to be compressed.
  • the processing chip is configured to allocate a storage space, where the N bit of the starting logical address of the storage space is 0, and N is an integer greater than 1.
  • the communication interface is configured to acquire data to be compressed, and store the data to be compressed. Entering the storage space, the size of the data to be compressed is 2 n Byte, and n is not greater than N; the processing chip is further configured to perform hash operation on the a Byte data to the a+ m Byte data of the data to be compressed.
  • a hash value a is an integer greater than 0, m is an integer greater than 0 and (m+1) is the size of the window performing the hash operation; determining whether a hash has the same key as the hash value,
  • the key of the hash table is a hash value generated by hashing the (m+1) Byte history data of the a+m Byte data, and the value of the hash table includes the (a+m Byte data) of the hash table. +1) the end n address of the start logical address of the Byte history data. If there is a key with the same hash value in the hash table, the hash table is updated according to the last n bit of the logical address of the a-th byte data. The value corresponding to the hash value, if the hash key does not have the same key as the hash value, the hash value and N bit logical address at the end of a Byte data to the compressed data to be added to the hash table.
  • the data compression device simplifies the operation of the hash table in the data compression process by setting the end N bit of the starting logical address of the storage space for storing the data to be compressed to 0, thereby improving the data compression speed.
  • the data to be compressed includes a plurality of data blocks.
  • the data compression device is capable of simultaneously storing a plurality of data blocks in a storage space and compressing them, and compresses only a single data block at a time, thereby improving the compression ratio.
  • the processing chip determines whether a hash key has the same key as the hash value, and is further used for Determining whether the size of the data to be compressed is greater than 2 K Byte, and K is an integer greater than 0; if the size of the data to be compressed is greater than 2 K Byte, setting the value of the hash table to be no less than (K/8 +1) Byte; if the size of the data to be compressed is less than or equal to 2 K Byte, the value of the value of the hash table is set to be no less than K/8 Byte.
  • the data compression device determines the length of the logical address of the value to be written into the hash table by determining the size of the data to be compressed before matching the key in the hash table with the hash value, thereby improving the compression speed.
  • the processing chip is in a logical address according to the data of the a Byte data. After the last n bit updates the value corresponding to the hash value in the hash table, it is also used for the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) bit And the difference between the value corresponding to the hash value in the hash table is less than 2 K , and the data after the a Byte data and the a Byte data in the data to be compressed and the first Byte in the data to be compressed The historical data of the data is matched, and a compression code is generated according to the matching result; and if the logical address of the a-byte data of the data to be compressed is at the end (8* the length of the value of the hash table) bit and the hash table If the difference
  • the data compression device determines whether the size of the data to be compressed is greater than 2 K Byte in advance, and if the size of the data to be compressed is not greater than 2 K Byte, the first Byte data of the data to be compressed is not required.
  • the judgment of the end of the logical address (the length of the value of the hash table of 8*) and the value corresponding to the hash value in the hash table is greater than or equal to 2 K, thereby saving the judgment process and further improving Compression speed.
  • the processing chip is further configured to determine whether the a+m Byte data is the The last 1 Byte of data is compressed, and if so, the encoding of the data to be compressed is terminated, and if not, the window for the hashing operation is shifted to the right.
  • a third aspect of the present application provides a computing device including a processor and a memory.
  • the processor and the memory establish a communication connection through a bus, the processor operating to read a program in the memory to perform the data compression method provided by the first aspect.
  • a storage medium storing program code, the program code being executed by the computing device, performing the data compression method provided by the first aspect.
  • the storage medium includes, but is not limited to, a flash memory, a hard disk (English: hard disk drive, HDD), or a solid state drive (English: solid state drive, abbreviated as SSD).
  • a computer program product is provided.
  • the computer program product can be a software installation package.
  • the software installation package is executed by the computing device, the data compression method provided by the first aspect is performed.
  • FIG. 1 is a schematic diagram of a system according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of another system provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a data compression method according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a data compression device according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of another data compression device according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • first, second, etc. are used in this application to distinguish each object, but there is no logical or temporal dependency between each of the "first” and “second”.
  • a data block refers to a fixed-size data, and a data block size may be 4K Byte, 8K Byte, etc.; in a file storage scenario, a data block refers to a file, and its size is not fixed.
  • the data chunk includes a plurality of data blocks, and the size of a common data chunk can be 256K Byte, 4M Byte, and the like.
  • the data to be compressed may include one or more data blocks, which may belong to one or more data chunks.
  • clean refers to the initialization of the hash table, which also returns the data stored in the hash table to 0 to avoid mismatching in the process of using the hash table.
  • the historical data of the current data refers to data in which the logical address is to be compressed before the current data, or the logical address in the data to be compressed is located in the data smaller than the current data.
  • the data of the a Byte in the data to be compressed is its historical data.
  • a logical address refers to a virtual address assigned by an operating system.
  • the starting logical address of any (m+1) Byte data that is, the logical address of the first Byte data of the (m+1) Byte data.
  • the unit of the length of the value of the hash table is Byte.
  • the relative address of any 1 byte of data refers to the offset of the Byte data relative to the starting logical address of the data to be compressed in which the Byte data is located.
  • the relative address of any (m+1) Byte data that is, the relative address of the first Byte data of the (m+1) Byte data.
  • the relative address of the first Byte data of the (m+1) Byte data that is, the end n bit of the logical address of the first Byte data of the (m+1) Byte data.
  • the data from the a Byte to the a+m Byte refers to data including the a Byte, the a+m Byte, the a Byte data, and the a+m Byte data.
  • the system includes a storage array including at least one storage controller and a plurality of storage devices, which are generally non-volatile storage devices. Specifically, it may be the flash memory (English: flash memory) or HDD or SSD.
  • Each storage controller is connected to multiple storage devices. In order to save the space of the storage device in the storage array, the storage controller is configured to compress the data to be stored, and store the obtained compression code into the storage device.
  • FIG. 2 is a schematic diagram of another system applied to an embodiment of the present application, where the system includes a first data office The device and the second data processing device.
  • a data compression device is disposed in the first data processing device, and a data decompression device is disposed in the second data processing device.
  • the data compression device compresses the data that needs to be transmitted to the second data processing device and then transmits the compressed code to the second data processing device over the communication network.
  • the data decompressing device decompresses the compression encoding. Therefore, only the compression coding needs to be transmitted in the communication network, which reduces the communication traffic and speeds up the data transmission speed.
  • the data compression method provided in FIG. 3 is performed when the storage controller or the data compression device is in operation.
  • the present application also provides a data compression method, and a schematic flowchart thereof is shown in FIG. 3. Take the memory controller as an example.
  • Step 202 Allocating a first storage space, where the first storage space is used to store data to be compressed, and an N bit at the end of the start logical address of the first storage space is 0, and N is an integer greater than 1.
  • the starting logical address of the first storage space is 0x FFFF FFFF 0000 0000.
  • Step 204 allocating a second storage space for storing compression coding generated during compression. In order to subsequently decompress the compression code, the corresponding data can be restored.
  • Step 206 Allocate a third storage space, where the third storage space is used to store a hash table.
  • the hash table may adopt a key-value structure, and each key is a hash value obtained by hashing the (m+1) Byte data in the window, and the value corresponding to each key includes the key (m+). 1) The last n bit of the start logical address of the Byte data.
  • the value corresponding to each key of the hash table needs to include a relative address of the (m+1) Byte data of the generated key, but since the end n address of the starting logical address of the to-be-compressed data block is 0, Therefore, the value corresponding to each key of the hash table also includes the last n bit of the starting logical address of the (m+1) Byte data that generates the key.
  • hash value 1 is the hash value corresponding to the a Byte to the a+m Byte data
  • logical address 1 is the logical address of the a Byte data, and the rest of the rows in Table 1 are analogous.
  • Step 202, step 204, and step 206 may be performed in any order, or may be combined into the same step.
  • the first storage space, the second storage space, and the third storage space may refer to a memory space.
  • Step 207 Acquire data to be compressed, and store the data to be compressed into the first storage space.
  • the size of the data to be compressed is 2 n Byte, and n is not greater than N. Therefore, the end N bit of the starting logical address of the data to be compressed is 0.
  • a plurality of steps 207 and 207 and subsequent steps may be performed, and the first storage space is not allocated once for each data to be compressed. Since the size of the data to be compressed acquired in step 207 may be different, the 2 N Byte set in step 202 needs to be greater than or equal to the size of each data to be compressed to ensure the data to be compressed acquired in each subsequent step 207.
  • the end n bit of the starting logical address is 0.
  • the data to be compressed may include a plurality of data blocks. Comparing the data to be compressed to include only one data block, storing a plurality of data blocks into the first storage space at a time, thereby avoiding performance loss caused by cleaning the hash table multiple times, and at the same time, due to the first storage
  • the size of the data to be compressed in the space is increased, and the data in each window is easier to find the matching historical data, so the compression ratio can be improved.
  • the storage controller acquires data to be compressed from a client or other device, and the data to be compressed is data that needs to be stored in the storage device.
  • Step 208 Determine whether the size of the data to be compressed is greater than 2 K Byte, and K is an integer greater than 0. If it is greater than, the branch where the step 210 is located is executed. If not, the branch where the step 222 is located is executed.
  • K Common values for K include: 16 or 24 or 32.
  • exemplary use K equals 16.
  • the value of K can refer to the size of the storage device's cache.
  • step 210 the hash table is cleaned up.
  • step 212 the value of the value of the hash table is set to be no less than (K/8+H) Byte.
  • H is a positive integer greater than 0, and a common value can be 2.
  • the value of the value of the hash table is exemplarily set to 4.
  • the size of the data to be compressed is greater than 2 K Byte, the relative address of each 1 Byte of data in the data to be compressed cannot be expressed by K/8 Byte, so the length of the value of the hash table needs to be increased.
  • steps 210 and 212 can be interchanged.
  • the length of the value of the hash table may be set according to the size of the data to be compressed, so as to avoid the value of the hash table being too long. The resulting storage space is wasted and the difficulty caused by the operation of the hash table is increased, and the length of the value of the hash table is not enough if the length of the value of the hash table is set too short.
  • m Common values of m include: 2, 3, 4, 5, 6, or 7.
  • the length of the key of the hash table is set according to the type of hash operation employed.
  • step 210 may be performed at any time prior to step 216, ensuring that the hash table is cleaned prior to use of the hash table in step 216.
  • Step 214 Generate a hash value according to the a Byte to the a+m Byte of the data to be compressed, where a is an integer greater than 0.
  • a is an integer greater than 0.
  • Step 216 Determine whether there is a key in the hash table that is the same as the hash value. If yes, perform steps 2161 to 2162. If not, go to step 2163.
  • Step 2161 Acquire the value of the row where the hash value is located, and update the value of the row where the hash value is located according to the last n bit of the logical address of the a Byte data of the data to be compressed.
  • the value of the row for updating the hash value may include n bits longer than the end of the logical address of the a Byte data of the data to be compressed. 1 or Multiple bits. That is, the value of the row of the hash table is updated by using the end of the logical address of the a-byte data of the data to be compressed (8* the length of the value of the hash table) bit.
  • Step 2162 Determine whether the difference between the end of the logical address of the a-th byte data (8* the length of the value of the hash table) and the value of the row of the hash value is greater than 2K .
  • the 8U bit at the end of the logical address of the a Byte data and the value of the row where the hash value is located are determined. Is the difference greater than 2 K ?
  • the value of the row in which the hash value is used in step 2162 is the value of the row in which the hash value is located before the update action is performed in step 2161.
  • each Byte data of the data to be compressed is the same as the address of the last n bit, it is only necessary to compare the end of the logical address of the a Byte data (8* the value of the hash table) Whether the difference between the length) bit and the value of the row of the hash value is greater than 2K .
  • step 2162 if it is determined that the value is not greater than 2 K , before performing step 218, the same history as the (m+1) Byte data currently hashed is obtained according to the value of the row of the hash value. The starting logical address of the data for use in step 218.
  • Step 2163 adding the hash value and the last n bit of the logical address of the a-byte data of the data to be compressed to the hash table.
  • n is not an integer multiple of 4, one or more bits higher than the last n bit may be included in addition to the end n bit of the logical address of the a-th byte data of the data to be compressed. That is, the hash value and the end of the logical address of the a-byte data of the data to be compressed (8* the length of the value of the hash table) are added to a new row of the hash table.
  • the current logical address is the starting logical address of the (m+1) Byte data currently hashed.
  • the value of the row of the hash value in the hash table needs to be read. And the value of the row of the hash value is updated by using the relative address of the a Byte data, that is, the read hash table and the write hash table need to be performed once.
  • the value record 400 of the row in which the hash value is located is taken as an example, in order to obtain the complete starting logical address of the historical data that is matched, for (m+1) Byte which will be hashed currently.
  • the data is matched against the historical data being matched. Therefore, 400 and 0x FFFF FFFF 0000 0001 need to be added to obtain 0x FFFF FFFF 0000 0191.
  • 0x FFFF FFFF 0000 0191 is the starting logical address of the same history data as the currently hashed (m+1) Byte data. .
  • the value of the row of the hash value needs to be updated with the relative address of the (m+1) Byte data currently hashed. Since the end N bit of the starting logical address of the data to be compressed is 0, the value of the matched row is directly updated by 07D0.
  • the hash value corresponding to the data of the a Byte data to the a+m Byte does not exist in the key of any row in the hash table, that is, the data corresponding to the data of the a Byte data to the a+m Byte.
  • the hash value cannot match the key of any row in the hash table, the hash value corresponding to the a-byte data to the a+m Byte, and the relative address of the a-byte data of the data to be compressed are added to the Hash table, that is, you need to write a hash table once.
  • the starting logical address of the data from the a Byte data to the a+m Byte is 0x FFFF FFFF 0000 07D0, for example, the a Byte data is sent to the a+m.
  • the hash value corresponding to the Byte data and 0x 07D0 are stored in the hash table.
  • Step 218 the same historical data and current as the (m+1) Byte data currently hashed
  • the (m+1) Byte data subjected to the hash operation is matched to the right Byte by Byte, and the compression code corresponding to the current match is generated according to the matching result, and the compression code is stored in the third storage space.
  • the historical data of the data to be compressed is obtained according to the starting logical address, and the data to be compressed is to be compressed.
  • the historical data of the data is matched with the data of the a-byte data and the data after the a-byte data by Bytes until it cannot be matched.
  • the compression encoding includes: a matching length of the data after the a Byte and the a Byte and the historical data, a relative address of the historical data, and a last Byte data of the last compression encoded record to the first Byte on the current matching. Data between data.
  • the value of the first row in the hash table is read. Take the first character, then compare the ninth character with the first character, the tenth character is compared with the second character, and so on, until it matches to the right until it cannot match.
  • the 9th to 14th characters are the same as the 1st to 6th characters.
  • the resulting compression coding thus includes: abcdefgh, 100, 6.
  • abcdefg is the data between the last 1 byte of the last compression code record and the first byte of the current match, where 100 is the relative address of the historical data on the first byte data match after h, and 6 is the match length.
  • the order of restoring the data to be compressed is as follows: firstly extract abcdefgh, and then obtain the first 6 characters of abcdefgh according to 100 and 6, that is, abcdef, and add abcdef to abcdefgh, and then restore the data to be compressed abdefghabcdef .
  • the window for generating the hash value may be shifted to the right. Therefore, there may be partial data that is neither recorded in the compression encoding generated in the previous step 218, but also located before the start of the window in this step 218, so this portion of the data needs to be recorded in the compression encoding generated in this step 218. in.
  • Q, W, and E are the lengths of the right shift of the window, that is, how many Bytes the window slides to the right.
  • step 2162 it is judged whether the difference between the history data and the logical address of the (m+1) Byte data currently subjected to the hash operation is greater than 2K .
  • 2 K Byte can be the size of the storage controller's cache.
  • step 220 is not performed in the current match. If the historical data on the match and the (m+1) Byte data currently hashed are less than 2 K Bytes of data, the historical data on the match and the current hash operation (m+) 1) The Byte data can be stored in the cache at the same time, so step 220 is performed.
  • step 2162 is an optional step, that is, after step 2161, step 2162 can be performed without performing step 2162.
  • step 222 the hash table is cleaned up.
  • step 224 the length of the value of the hash table is set to be no less than K/8 Byte.
  • steps 222 and 224 can be interchanged.
  • the value of the value of the hash table is not less than K/8 Byte. If K is not a multiple of 8, then in step 224, the length of the value of the hash table is set to be no less than
  • step 222 may be performed at any time prior to step 228, ensuring that the hash table is cleaned prior to use of the hash table in step 228.
  • Step 226 Generate a hash value according to the bth Byte to the b+m Byte of the data to be compressed, where b is an integer greater than 0.
  • b takes a value of 1.
  • Step 228, determining whether the hash value can match any key of the hash table. If it can match, step 2301 is performed, and if it cannot be matched, step 2302 is performed.
  • step 2301 the value of the row in which the hash value is matched is obtained, and the value of the row of the hash value on the matching is updated according to the last n bit of the logical address of the b-th byte data of the data to be compressed.
  • the value of the row for updating the hash value may include n bits longer than the end of the logical address of the b-th byte data of the data to be compressed. 1 or more bits. That is, the value of the row of the hash table is updated by using the end of the logical address of the b-th Byte data of the data to be compressed (8* the length of the value of the hash table).
  • step 2301 it is also necessary to obtain the starting logical address of the same historical data as the (m+1) Byte data currently hashed according to the value of the row in which the hash value is matched, for use in step 232.
  • Step 2302 adding the hash value and the last n bit of the logical address of the b-th Byte data of the data to be compressed to the hash table.
  • n is not an integer multiple of 4, one or more bits higher than the last n bit may be included in addition to the end n bit of the logical address of the b-th byte data of the data to be compressed.
  • the hash value and the end of the logical address of the b-th Byte data of the data to be compressed (8* the length of the value of the hash table) are also added to the new row of the hash table.
  • Step 232 Matching the same historical data of the (m+1) Byte data currently hashed with the (m+1) Byte data currently hashed to the right by Byte, and generating the current matching according to the matching result.
  • the compression code is stored in the third storage space.
  • step 232 The details of the compression coding are generated in step 232, with reference to the description in step 218 above.
  • R and T are the lengths of the right shift of the window, that is, how many Bytes of data the window slides to the right.
  • step 208 By determining in step 208 whether the data to be compressed is greater than 2 K Byte, in the branch from step 222 to step 234, since the data to be compressed is not more than 2 K Byte, the logical address and current of any Byte history data The difference between the logical addresses of the (m+1) Byte data subjected to the hash operation is certainly not more than 2 K , and it is not necessary to perform the similar judgment action of step 2162, which saves the compression process and further improves the compression speed.
  • step 208 is an optional step.
  • step 208 If step 208 is not used, step 210, step 214, and subsequent steps of step 214 are directly performed. In this case, since the size of the data to be compressed is not known before the operation of the hash table in step 2161 or step 2163, it is necessary to determine the length of the logical address that needs to be written into the hash table according to the size of the data to be compressed. .
  • the size of the data to be compressed is 2 16 Bytes, and the operating system used by the storage controller is a 64-bit system. Therefore, before step 2161 or step 2163, it is necessary to update the hash table by using the last 16 bits of the logical address of the a-th byte data according to the size of the data to be compressed.
  • step 208 Through the adoption of step 208, it is avoided that the size of the data to be compressed needs to be determined once for each operation of the hash table, and the compression speed is further improved.
  • the present application further provides a data compression device 400, which may be the storage controller in FIG. 1 or the data compression device in FIG. 2.
  • the data compression device 400 includes a communication interface 402 and a processing chip 404, and the communication interface 402 and the processing chip 404 establish a communication connection.
  • the data compression method corresponding to FIG. 3 is executed.
  • the communication interface 402 is for communicating with an external device, such as a client writing data to be compressed, a storage device in a storage array, a network device in a communication network, and the like.
  • Communication interface 402 can be an input/output interface of data compression device 400.
  • the communication interface 402 is specifically configured to perform the step of acquiring data to be compressed in step 207, and the step of storing the compression code in the third storage space into the storage device after step 220 and step 234. If the data compression device 400 is the data compression device of FIG. 2, then after step 220 and step 234, the communication interface 402 is configured to send the compression code in the third storage space to the communication network.
  • the processing chip 404 is configured to perform step 202 to step 206, and perform the step of storing the data to be compressed into the first storage space in step 207, and is further configured to perform step 208 to step 220, and is further configured to perform step 208 to step 234. .
  • the processing chip 404 can be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (abbreviated as PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviation: CPLD), a field programmable gate array (English: field programmable gate array, abbreviated: FPGA), general array logic (English: general array logic, Abbreviation: GAL) or any combination thereof.
  • the processing chip 404 can also be implemented by a processor, a storage device, and a logic chip, which can be implemented by a PLD or an ASIC.
  • the processor and the logic chip each perform a part of functions, and the functions of the two can be allocated in various ways.
  • the code in the memory is read by the processor to perform steps 202 to 207. After the first storage space, the second storage space, and the third storage space have all been allocated in the memory, and the data to be stored has already stored the first storage space, the subsequent steps are completed by the logic chip.
  • the data compression device provided above provides the read/write operation of the hash table in the process of compressing the data to be compressed by setting the end N bit address of the storage space for storing the data to be compressed to 0. Adding simplicity increases the compression speed.
  • FIG. 6 is a computing device provided by the present application.
  • the computing device 600 may be the storage controller in FIG. 1 or the data compression device in FIG. 2.
  • Computing device 600 includes a processor 602, a memory 604, and may also include a bus 606 and a communication interface 608.
  • Communication interface 608 is used to communicate with external devices, such as clients that write data to be compressed, storage devices in a storage array, network devices in a communication network, and the like.
  • Communication interface 608 can be an input/output interface of computing device 600.
  • the processor 602, the memory 604, and the communication interface 608 can implement communication connections with each other through the bus 606, and can also implement communication by other means such as wireless transmission.
  • the processor 602 can be a central processing unit (English: central processing unit, abbreviation: CPU).
  • the memory 604 may include a volatile memory (English: volatile memory) (English: random-access memory, abbreviation: RAM).
  • the memory 604 may further include a non-volatile memory, such as a read-only memory (English: read-only memory, abbreviated as ROM), a flash memory, an HDD or an SSD, and a memory 604. Combinations of the above types of memory may also be included.
  • a non-volatile memory such as a read-only memory (English: read-only memory, abbreviated as ROM), a flash memory, an HDD or an SSD, and a memory 604. Combinations of the above types of memory may also be included.
  • the memory 604 may also not include the non-volatile memory, and the non-volatileness of the computing device 600 The memory is provided by a storage device of the storage array.
  • the computing device 600 is the data compression device of FIG. 2, since it can directly send the compression code to the communication network, it is not necessary to store the compression code in the non-volatile memory, so the memory 604 may not include non-volatile. Memory.
  • the program code for implementing the data compression method provided in FIG. 3 of the present application is stored in the memory 604 and executed by the processor 602.
  • the computing device provided above provides a simple read and write operation on the hash table in the process of compressing the data to be compressed by setting the end N bit address of the storage space for storing the data to be compressed to be simple, and improving the compression. speed.
  • the methods described in connection with the present disclosure can be implemented by a processor executing software instructions.
  • the software instructions can be composed of corresponding software modules, which can be stored in RAM, flash memory, ROM, erasable programmable read only memory (English: erasable programmable read only memory, abbreviation: EPROM), electrically erasable Programming an audio-only memory (English: electrically erasable programmable read only memory, EEPROM), a hard disk, an SSD, an optical disk, or any other form of storage medium known in the art.
  • the functions described herein may be implemented in hardware or software.
  • the functions may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium.
  • a storage medium may be any available media that can be accessed by a general purpose or special purpose computer.

Abstract

A data compression method is commonly used in a memory array. The method comprises: setting n bits at an end portion of a starting logical address of a memory space in which data to be compressed is located to be 0, such that in a subsequent process of compressing the data to be compressed, reading and writing operations with respect to a hash table can be performed more easily, thereby increasing a compression speed.

Description

数据压缩方法、设备与计算设备Data compression method, device and computing device 技术领域Technical field
本申请涉及计算机技术领域,尤其涉及数据压缩方法,以及该方法对应的数据压缩设备,以及用于执行该数据压缩方法的计算设备。The present application relates to the field of computer technologies, and in particular, to a data compression method, and a data compression device corresponding to the method, and a computing device for executing the data compression method.
背景技术Background technique
压缩技术被广泛使用在数据存储、数据传输等领域,传统压缩技术包括了字典压缩,又称为Abraham Lempel and Jacob Ziv(简称:LZ)压缩。LZ压缩有着众多的压缩编码分支,如LZ4、LZ5、LZO、LZH等,这些压缩编码的共同特点是在当前数据进行编码时均会使用历史数据作为字典。Compression technology is widely used in data storage, data transmission and other fields. Traditional compression technology includes dictionary compression, also known as Abraham Lempel and Jacob Ziv (abbreviation: LZ) compression. LZ compression has a large number of compression coding branches, such as LZ4, LZ5, LZO, LZH, etc. The common feature of these compression codes is that historical data is used as a dictionary when encoding current data.
LZ压缩以字节/字符串的粒度进行数据压缩。以待压缩的数据block为4M Byte且窗口大小为4Byte为例,在压缩当前窗口内的4Byte数据时,会使用当前窗口内的4Byte与该待压缩数据block的历史数据进行匹配,如果该待压缩数据block的历史数据中存在与该4Byte数据相同的数据,那么该4Byte数据对应的编码只需要记录该历史数据的位置信息和长度,这样在解压缩的过程中,根据该4Byte数据对应的编码,就可以恢复出该4Byte数据。当前的LZ压缩的压缩速度仍有待提高。LZ compression performs data compression at the granularity of bytes/strings. For example, if the data block to be compressed is 4M Byte and the window size is 4 bytes, when 4 bytes of data in the current window is compressed, the 4 bytes in the current window are used to match the historical data of the data block to be compressed, if the data is to be compressed. The data of the data block has the same data as the 4 Byte data, and the code corresponding to the 4 Byte data only needs to record the position information and the length of the historical data, so that in the process of decompression, according to the code corresponding to the 4 Byte data, The 4Byte data can be recovered. The compression speed of current LZ compression still needs to be improved.
发明内容Summary of the invention
本申请提供了一种数据压缩方法,以提升数据压缩的速度。The present application provides a data compression method to increase the speed of data compression.
本申请的第一方面,提供了一种由存储控制器或者数据压缩设备执行的数据压缩方法,包括:首先分配存储空间,该存储空间的起始逻辑地址的末尾N bit为0,N为大于1的整数。实际中,N的取值与后续处理的待压缩数据的大小相关。A first aspect of the present application provides a data compression method performed by a storage controller or a data compression device, including: first allocating a storage space, where the end of the starting logical address of the storage space is 0 bit, and N is greater than An integer of 1. In practice, the value of N is related to the size of the data to be compressed that is subsequently processed.
然后,将待压缩数据存入该存储空间,该待压缩数据的大小为2n Byte,n不大于N,这样该待压缩数据的起始逻辑地址的末尾N bit均为0,由于该待压缩数据的大小不大于2N Byte,那么该待压缩数据的起始逻辑地址的有效部分均为0, 而该待压缩数据的的起始逻辑地址的高于n bit的部分均为无效部分,因为待压缩数据的每个Byte数据的逻辑地址中高于n bit的部分均相同。Then, the data to be compressed is stored in the storage space, and the size of the data to be compressed is 2 n Byte, and n is not greater than N, so that the end N bit of the starting logical address of the data to be compressed is 0, because the to-be-compressed If the size of the data is not more than 2 N Byte, the valid part of the starting logical address of the data to be compressed is 0, and the part of the starting logical address of the data to be compressed that is higher than the n bit is an invalid part because The portion of the logical address of each Byte data of the data to be compressed that is higher than the n bit is the same.
随后,对该待压缩数据的第a Byte数据到第a+m Byte数据进行哈希运算生成哈希值,a为大于0的整数,m为大于0的整数且(m+1)为进行该哈希运算的窗口的大小。随着窗口的起点从该待压缩数据的第1Byte数据往右移,a的取值可以从1取到(2n-m)。Then, hashing the a-byte data to the a+m-byte data of the data to be compressed to generate a hash value, a is an integer greater than 0, m is an integer greater than 0, and (m+1) is performed The size of the hash operation window. As the starting point of the window shifts from the first Byte data of the data to be compressed to the right, the value of a can be taken from 1 (2 n -m).
随后,判断哈希表中是否存在与该哈希值相同的key,该哈希表的key为该第a+m Byte数据的(m+1)Byte历史数据进行哈希运算生成的哈希值,该哈希表的value包括该第a+m Byte数据的(m+1)Byte历史数据的起始逻辑地址的末尾n bit。判断该哈希表中是否存储于该哈希值相同的key,也即用该哈希值逐个匹配该哈希表中的key,如果有匹配上的key,则说明该第a Byte数据到第a+m Byte数据在该待压缩数据的历史数据中出现过,如果没有匹配上的key,则说明该第a Byte数据到第a+m Byte数据在该待压缩数据中首次出现。Then, it is determined whether there is a key in the hash table that is the same as the hash value, and the key of the hash table is a hash value generated by hashing the (m+1) Byte history data of the a+m Byte data. The value of the hash table includes the end n bit of the start logical address of the (m+1) Byte history data of the a+m Byte data. Determining whether the hash table is stored in the same hash value, that is, using the hash value to match the key in the hash table one by one, if there is a matching key, the first a Byte data is The a+m Byte data appears in the historical data of the data to be compressed. If there is no matching key, it indicates that the a Byte data to the a+m Byte data first appears in the data to be compressed.
根据上一步骤的判断结果,若该哈希表中存在与该哈希值相同的key,根据该第a Byte数据的逻辑地址的末尾n bit更新该哈希表中该哈希值对应的value;若该哈希表中不存在与该哈希值相同的key,将该哈希值和该待压缩数据的第a Byte数据的逻辑地址的末尾n bit加入该哈希表。According to the judgment result of the previous step, if there is a key with the same hash value in the hash table, the value corresponding to the hash value in the hash table is updated according to the last n bit of the logical address of the a-th byte data. If the same key as the hash value does not exist in the hash table, the hash value and the end n bit of the logical address of the a-byte data of the data to be compressed are added to the hash table.
如果该第a Byte数据到第a+m Byte数据在其历史数据中出现过,则需要用该第a Byte数据到第a+m Byte数据的起始逻辑地址替换记录在该哈希表中记录的该历史数据的起始逻辑地址。如果该第a Byte数据到第a+m Byte数据在该待压缩数据中首次出现,则将该哈希值和该待压缩数据的第a Byte数据的逻辑地址的末尾n bit插入该哈希表的新一行中,以便窗口继续右移后,后续的(m+1)Byte数据中如果有与该第a Byte数据到第a+m Byte相同的,可以匹配上该插入的行。If the a Byte data to the a+m Byte data appear in its history data, the record needs to be recorded in the hash table by using the a Byte data to the start logical address of the a+m Byte data. The starting logical address of the historical data. If the a Byte data to the a+m Byte data first appears in the data to be compressed, insert the hash value and the end n bit of the logical address of the a Byte data of the data to be compressed into the hash table. In the new row, after the window continues to move to the right, the subsequent (m+1)Byte data can match the inserted row if it is the same as the a-byte data to the a+m Byte.
由于对该哈希表value的读写的最小粒度为Byte,因此如果n不为4的整数倍,则用于替换或加入到该哈希表的value的内容,除了包括该待压缩数据的第a Byte数据的逻辑地址的末尾n bit外,还可能包括高于n bit的1或多个bit。例如n=14的情况下,由于至少需要替换或加入4Byte内容到该哈希表的value,因此需要替换或加入该待压缩数据的第a Byte数据的逻辑地址的末尾16bit。 Since the minimum granularity of reading and writing the value of the hash table is Byte, if n is not an integer multiple of 4, the content of the value used to replace or join the hash table includes, in addition to the data to be compressed. In addition to the end n bit of the logical address of a Byte data, it may also include one or more bits higher than the n bit. For example, in the case of n=14, since at least the content of the 4 Byte content needs to be replaced or added to the hash table, it is necessary to replace or join the last 16 bits of the logical address of the a-th byte data of the data to be compressed.
以上提供的数据压缩方法,通过将用于存储待压缩数据的存储空间的起始逻辑地址的末尾N bit设置为0,简化了数据压缩过程中对哈希表的操作,提升了数据压缩速度。The data compression method provided above simplifies the operation of the hash table in the data compression process by setting the end N bit of the starting logical address of the storage space for storing the data to be compressed to 0, thereby improving the data compression speed.
结合第一方面,在第一方面的第一种实现方式中,该待压缩数据包括多个数据block。In conjunction with the first aspect, in a first implementation of the first aspect, the data to be compressed includes a plurality of data blocks.
将多个数据block同时存入存储空间并进行压缩,相对于每次仅对单个数据block进行压缩,提升了压缩率。Multiple data blocks are simultaneously stored in the storage space and compressed, and the compression ratio is improved by compressing only a single data block at a time.
结合第一方面或第一方面的第一种实现方式,在第一方面的第二种实现方式中,在判断哈希表中是否存在与该哈希值相同的key前,还包括:判断该待压缩数据的大小是否大于2K Byte,K为大于0的整数。若该待压缩数据的大小大于2K Byte,设置该哈希表的value的长度不少于(K/8+1)Byte,也即如果待压缩数据的长度大于2K Byte,则最少需要用(K/8+1)Byte才能表达待压缩数据的相对地址,。与之相对的,若该待压缩数据的大小小于或等于2K Byte,设置该哈希表的value的长度不少于K/8Byte,也即如果待压缩数据的长度不大于2K Byte,则用K/8Byte就能表达待压缩数据的相对地址。With reference to the first aspect or the first implementation manner of the first aspect, in a second implementation manner of the first aspect, before determining whether a key having the same hash value exists in the hash table, the method further includes: determining the Whether the size of the data to be compressed is greater than 2 K Byte, and K is an integer greater than 0. If the size of the data to be compressed is greater than 2 K Byte, the value of the value of the hash table is not less than (K/8+1) Byte, that is, if the length of the data to be compressed is greater than 2 K Byte, at least (K/8+1) Byte can express the relative address of the data to be compressed. In contrast, if the size of the data to be compressed is less than or equal to 2 K Byte, the length of the value of the hash table is not less than K/8 Byte, that is, if the length of the data to be compressed is not more than 2 K Byte, The relative address of the data to be compressed can be expressed by K/8Byte.
通过在用该哈希值匹配该哈希表中的key前判断待压缩数据的大小,得以确定需要写入该哈希表的value的待压缩数据的逻辑地址的长度。相对于在每次对哈希表进行更新或写入操作时,再判断需要写入该哈希表的待压缩数据的逻辑地址的长度,提升了压缩速度。By determining the size of the data to be compressed before matching the key in the hash table with the hash value, it is possible to determine the length of the logical address of the data to be compressed that needs to be written to the value of the hash table. The length of the logical address of the data to be compressed that needs to be written into the hash table is determined relative to each time the update or write operation is performed on the hash table, thereby increasing the compression speed.
结合第一方面的第二种实现方式,在第一方面的第三种实现方式中,若该待压缩数据的大小大于2K Byte,则根据该第a Byte数据的逻辑地址的末尾n bit更新该哈希表中该哈希值对应的value后,该方法还包括:若该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差小于2K,则将该待压缩数据中该第a Byte数据及该第a Byte数据后的数据与该待压缩数据中该第a Byte数据的历史数据进行匹配,根据匹配结果生成压缩编码;若该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差不小于2K,则不将该待压缩数据中该第a Byte数据及该第a Byte数据后的数据与该待压缩数据中该第a Byte 数据的历史数据进行匹配。With the second implementation of the first aspect, in a third implementation manner of the first aspect, if the size of the data to be compressed is greater than 2 K Byte, updating is performed according to the last n bit of the logical address of the a Byte data. After the value corresponding to the hash value in the hash table, the method further includes: if the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) bit and the If the difference between the values corresponding to the hash value in the hash table is less than 2 K , the data of the a Byte data and the a Byte data in the data to be compressed and the data of the a Byte data in the data to be compressed are The historical data is matched, and a compression code is generated according to the matching result; if the logical address of the a-byte data of the data to be compressed is at the end (8* the length of the value of the hash table) bit and the hash in the hash table If the difference between the values corresponding to the values is not less than 2 K , the data of the a-byte data and the data of the a-byte data in the data to be compressed is not matched with the history data of the a-byte data in the data to be compressed.
数据压缩算法中包括如下设置:当该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差不小于2K时放弃本轮窗口的匹配。在采用了该设置的情况下,如果该待压缩数据的大小大于2K Byte,那么就有可能出现该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差不小于2K的可能性。而在采用了该设置的情况下,如果该待压缩数据的大小不大于2K Byte,则该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差必然不大于2K,因此仅在如果该待压缩数据的大小大于2K Byte的情况下,才需要对该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差是否大于2K进行判断。通过对该待压缩数据的大小是否大于2K Byte提前进行判断,在如果该待压缩数据的大小不大于2K Byte的情况下,无须对该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差是否大于或等于2K进行判断,节省了数据压缩中的判断流程,提升了压缩速度。The data compression algorithm includes the following settings: when the logical address of the a-byte data of the data to be compressed ends (8* the length of the value of the hash table) bit and the value corresponding to the hash value in the hash table When the difference is not less than 2 K , the match of this round window is abandoned. In the case where the setting is adopted, if the size of the data to be compressed is larger than 2 K Byte, then the end of the logical address of the a-byte data of the data to be compressed may appear (8* the value of the hash table) The length) bit has a probability that the difference between the value corresponding to the hash value in the hash table is not less than 2 K. In the case where the setting is adopted, if the size of the data to be compressed is not more than 2 K Byte, the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) The difference between the bit and the value corresponding to the hash value in the hash table must be no more than 2 K , so only if the size of the data to be compressed is greater than 2 K Byte, the data to be compressed is required. The judgment is made whether the difference between the end of the logical address of the Byte data (the length of the value of the hash table of 8*) and the value corresponding to the hash value in the hash table is greater than 2 K. By judging whether the size of the data to be compressed is greater than 2 K Byte, if the size of the data to be compressed is not more than 2 K Byte, the end of the logical address of the a Byte data of the data to be compressed is not required. (8* the length of the value of the hash table) The difference between the bit corresponding to the hash value in the hash table is greater than or equal to 2 K, thereby judging the judgment process in the data compression and improving the compression speed.
结合第一方面的第三种实现方式,在第一方面的第四种实现方式中,在根据匹配结果生成压缩编码后,还包括:判断该第a+m Byte数据是否为该待压缩数据的最后1Byte数据,若是,则结束对该待压缩数据的编码,若不是,则将进行该哈希运算的窗口右移。With reference to the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, after generating the compression coding according to the matching result, the method further includes: determining whether the a+m Byte data is the data to be compressed. The last 1 Byte data, if yes, ends the encoding of the data to be compressed, and if not, moves the window of the hash operation to the right.
本申请的第二方面,提供了一种数据压缩设备,包括:通信接口和处理芯片,该通信接口与该处理芯片相连。该通信接口用于与外部设备通信相连以获取待压缩的数据。该处理芯片,用于分配存储空间,该存储空间的起始逻辑地址的末尾N bit为0,N为大于1的整数;该通信接口,用于获取待压缩数据,并将该待压缩数据存入该存储空间,该待压缩数据的大小为2n Byte,n不大于N;该处理芯片,还用于对该待压缩数据的第a Byte数据到第a+m Byte数据进行哈希运算生成哈希值,a为大于0的整数,m为大于0的整数且(m+1)为进行该哈希运算的 窗口的大小;判断哈希表中是否存在与该哈希值相同的key,该哈希表的key为该第a+m Byte数据的(m+1)Byte历史数据进行哈希运算生成的哈希值,该哈希表的value包括该第a+m Byte数据的(m+1)Byte历史数据的起始逻辑地址的末尾n bit,若该哈希表中存在与该哈希值相同的key,根据该第a Byte数据的逻辑地址的末尾n bit更新该哈希表中该哈希值对应的value,若该哈希表中不存在与该哈希值相同的key,将该哈希值和该待压缩数据的第a Byte数据的逻辑地址的末尾n bit加入该哈希表。In a second aspect of the present application, a data compression device is provided, comprising: a communication interface and a processing chip, the communication interface being connected to the processing chip. The communication interface is used for communication with an external device to obtain data to be compressed. The processing chip is configured to allocate a storage space, where the N bit of the starting logical address of the storage space is 0, and N is an integer greater than 1. The communication interface is configured to acquire data to be compressed, and store the data to be compressed. Entering the storage space, the size of the data to be compressed is 2 n Byte, and n is not greater than N; the processing chip is further configured to perform hash operation on the a Byte data to the a+ m Byte data of the data to be compressed. a hash value, a is an integer greater than 0, m is an integer greater than 0 and (m+1) is the size of the window performing the hash operation; determining whether a hash has the same key as the hash value, The key of the hash table is a hash value generated by hashing the (m+1) Byte history data of the a+m Byte data, and the value of the hash table includes the (a+m Byte data) of the hash table. +1) the end n address of the start logical address of the Byte history data. If there is a key with the same hash value in the hash table, the hash table is updated according to the last n bit of the logical address of the a-th byte data. The value corresponding to the hash value, if the hash key does not have the same key as the hash value, the hash value and N bit logical address at the end of a Byte data to the compressed data to be added to the hash table.
以上提供的数据压缩设备,通过将用于存储待压缩数据的存储空间的起始逻辑地址的末尾N bit设置为0,简化了数据压缩过程中对哈希表的操作,提升了数据压缩速度。The data compression device provided above simplifies the operation of the hash table in the data compression process by setting the end N bit of the starting logical address of the storage space for storing the data to be compressed to 0, thereby improving the data compression speed.
结合第二方面,在第二方面的第一种实现方式中,该待压缩数据包括多个数据block。In conjunction with the second aspect, in a first implementation of the second aspect, the data to be compressed includes a plurality of data blocks.
该数据压缩设备能够将多个数据block同时存入存储空间并进行压缩,相对于每次仅对单个数据block进行压缩,提升了压缩率。The data compression device is capable of simultaneously storing a plurality of data blocks in a storage space and compressing them, and compresses only a single data block at a time, thereby improving the compression ratio.
结合第二方面或第二方面的第一种实现方式,在第二方面的第二种实现方式中,该处理芯片判断哈希表中是否存在与该哈希值相同的key前,还用于判断该待压缩数据的大小是否大于2K Byte,K为大于0的整数;若该待压缩数据的大小大于2K Byte,设置该哈希表的value的长度不少于(K/8+1)Byte;若该待压缩数据的大小小于或等于2K Byte,设置该哈希表的value的长度不少于K/8Byte。With reference to the second aspect or the first implementation manner of the second aspect, in the second implementation manner of the second aspect, the processing chip determines whether a hash key has the same key as the hash value, and is further used for Determining whether the size of the data to be compressed is greater than 2 K Byte, and K is an integer greater than 0; if the size of the data to be compressed is greater than 2 K Byte, setting the value of the hash table to be no less than (K/8 +1) Byte; if the size of the data to be compressed is less than or equal to 2 K Byte, the value of the value of the hash table is set to be no less than K/8 Byte.
该数据压缩设备通过在用该哈希值匹配该哈希表中的key前判断待压缩数据的大小,确定需要写入该哈希表的value的逻辑地址的长度,提升了压缩速度。The data compression device determines the length of the logical address of the value to be written into the hash table by determining the size of the data to be compressed before matching the key in the hash table with the hash value, thereby improving the compression speed.
结合第二方面的第二种实现方式,在第二方面的第三种实现方式中,若该待压缩数据的大小大于2K Byte,则该处理芯片在根据该第a Byte数据的逻辑地址的末尾n bit更新该哈希表中该哈希值对应的value后,还用于若该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差小于2K,则将该待压缩数据中该第a Byte数据及该第a Byte数据后的数据与该待压缩数据中该第a Byte数据的历史数据进行匹配,根据匹配结果生成压缩编码;以及若该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所 述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差不小于2K,则不将该待压缩数据中该第a Byte数据及该第a Byte数据后的数据与该待压缩数据中该第a Byte数据的历史数据进行匹配。With the second implementation of the second aspect, in a third implementation manner of the second aspect, if the size of the data to be compressed is greater than 2 K Byte, the processing chip is in a logical address according to the data of the a Byte data. After the last n bit updates the value corresponding to the hash value in the hash table, it is also used for the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) bit And the difference between the value corresponding to the hash value in the hash table is less than 2 K , and the data after the a Byte data and the a Byte data in the data to be compressed and the first Byte in the data to be compressed The historical data of the data is matched, and a compression code is generated according to the matching result; and if the logical address of the a-byte data of the data to be compressed is at the end (8* the length of the value of the hash table) bit and the hash table If the difference between the values corresponding to the hash value is not less than 2 K , the data of the a a Byte data and the a Byte data in the data to be compressed and the historical data of the a Byte data in the data to be compressed are not Make a match.
该数据压缩设备通过对该待压缩数据的大小是否大于2K Byte提前进行判断,在如果该待压缩数据的大小不大于2K Byte的情况下,无须对该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希表中该哈希值对应的value之差是否大于或等于2K进行判断,节省了判断流程,进一步提升了压缩速度。The data compression device determines whether the size of the data to be compressed is greater than 2 K Byte in advance, and if the size of the data to be compressed is not greater than 2 K Byte, the first Byte data of the data to be compressed is not required. The judgment of the end of the logical address (the length of the value of the hash table of 8*) and the value corresponding to the hash value in the hash table is greater than or equal to 2 K, thereby saving the judgment process and further improving Compression speed.
结合第二方面的第三种实现方式,在第二方面的第四种实现方式中,该处理芯片在根据匹配结果生成压缩编码后,还用于判断该第a+m Byte数据是否为该待压缩数据的最后1Byte数据,若是,则结束对该待压缩数据的编码,若不是,则将进行该哈希运算的窗口右移。With reference to the third implementation manner of the second aspect, in a fourth implementation manner of the second aspect, after the processing chip generates the compression coding according to the matching result, the processing chip is further configured to determine whether the a+m Byte data is the The last 1 Byte of data is compressed, and if so, the encoding of the data to be compressed is terminated, and if not, the window for the hashing operation is shifted to the right.
本申请第三方面提供了一种计算设备,该计算设备包括处理器、存储器。该处理器和该存储器通过总线建立通信连接,该处理器运行时读取该存储器中的程序,以执行前述第一方面提供的数据压缩方法。A third aspect of the present application provides a computing device including a processor and a memory. The processor and the memory establish a communication connection through a bus, the processor operating to read a program in the memory to perform the data compression method provided by the first aspect.
本申请的第四方面,提供了一种存储介质,该存储介质中存储了程序代码,该程序代码被计算设备运行时,执行第一方面提供的数据压缩方法。该存储介质包括但不限于快闪存储器、硬盘(英文:hard disk drive,缩写:HDD)或固态硬盘(英文:solid state drive,缩写:SSD)。In a fourth aspect of the present application, there is provided a storage medium storing program code, the program code being executed by the computing device, performing the data compression method provided by the first aspect. The storage medium includes, but is not limited to, a flash memory, a hard disk (English: hard disk drive, HDD), or a solid state drive (English: solid state drive, abbreviated as SSD).
本申请的第五方面,提供了一种计算机程序产品,该计算机程序产品可以为一个软件安装包,该软件安装包被计算设备运行时,执行第一方面提供的数据压缩方法。In a fifth aspect of the present application, a computer program product is provided. The computer program product can be a software installation package. When the software installation package is executed by the computing device, the data compression method provided by the first aspect is performed.
附图说明DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使 用的附图作以简单地介绍,显而易见的,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solution of the embodiment of the present application, the following needs to be made in the embodiment. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are incorporated in the drawings, The figure obtains other figures.
图1为本申请实施例提供的系统示意图;FIG. 1 is a schematic diagram of a system according to an embodiment of the present application;
图2为本申请实施例提供的另一系统示意图;2 is a schematic diagram of another system provided by an embodiment of the present application;
图3为本申请实施例提供的数据压缩方法的流程示意图;3 is a schematic flowchart of a data compression method according to an embodiment of the present application;
图4为本申请实施例提供的数据压缩设备的组织结构示意图;4 is a schematic structural diagram of a data compression device according to an embodiment of the present application;
图5为本申请实施例提供的另一数据压缩设备的组织结构示意图;FIG. 5 is a schematic structural diagram of another data compression device according to an embodiment of the present disclosure;
图6为本申请实施例提供的计算设备的组织结构示意图。FIG. 6 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
具体实施方式detailed description
下面结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application are described below with reference to the accompanying drawings in the embodiments of the present application.
本申请中采用术语第一、第二等来区分各个对象,但各个“第一”、“第二”之间不具有逻辑或时序上的依赖关系。The terms first, second, etc. are used in this application to distinguish each object, but there is no logical or temporal dependency between each of the "first" and "second".
贯穿本说明书,在块存储的场景下,数据block指代固定大小的数据,数据block的大小可以为4K Byte、8K Byte等;在文件存储的场景下,数据block指代一个文件,其大小不固定。Throughout this specification, in a block storage scenario, a data block refers to a fixed-size data, and a data block size may be 4K Byte, 8K Byte, etc.; in a file storage scenario, a data block refers to a file, and its size is not fixed.
贯穿本说明书,数据chunk包括了多个数据block,常见的数据chunk的大小可以为256K Byte,4M Byte等。Throughout this specification, the data chunk includes a plurality of data blocks, and the size of a common data chunk can be 256K Byte, 4M Byte, and the like.
贯穿本说明书,待压缩数据可以包括一个或多个数据block,该一个或多个数据block可以属于一个或多个数据chunk。Throughout this specification, the data to be compressed may include one or more data blocks, which may belong to one or more data chunks.
贯穿本说明书,清理(英文:clean)哈希表指代初始化哈希表,也即将哈希表内存储的数据归0,以避免使用哈希表的过程发生误匹配。Throughout this specification, the clean (English: clean) hash table refers to the initialization of the hash table, which also returns the data stored in the hash table to 0 to avoid mismatching in the process of using the hash table.
贯穿本说明书,当前数据的历史数据指代:待压缩的数据中逻辑地址位于该当前数据之前的数据,或待压缩的数据中逻辑地址位于小于该当前数据的数据。例如对于待压缩数据中第a Byte的数据而言,待压缩数据中第1byte至第a-1Byte的数据均为其历史数据。Throughout this specification, the historical data of the current data refers to data in which the logical address is to be compressed before the current data, or the logical address in the data to be compressed is located in the data smaller than the current data. For example, for the data of the a Byte in the data to be compressed, the data of the first byte to the a-1 Byte in the data to be compressed is its historical data.
贯穿本说明书,窗口包括了用于进行哈希运算的(m+1)Byte数据。窗口的起 点为该(m+1)Byte数据的第1Byte,窗口的终点为该(m+1)Byte数据的最后1Byte,窗口的大小为(m+1)。以待压缩数据包括字符串“abcdefghijklmn”且m=3为例,窗口内首先包括“abcd”,如果没有与“abcd”匹配的历史数据,则窗口右移。右移的长度可以自行设定,以每次右移1Byte为例,则接下来使用“bcde”来生成哈希值。Throughout this specification, the window includes (m+1) Byte data for hashing. From the window The point is the first Byte of the (m+1) Byte data, and the end point of the window is the last 1 Byte of the (m+1) Byte data, and the size of the window is (m+1). For example, if the data to be compressed includes the string “abcdefghijklmn” and m=3, the window first includes “abcd”. If there is no historical data matching “abcd”, the window is shifted to the right. The length of the right shift can be set by itself. For example, each time 1 byte is shifted right, then "bcde" is used to generate a hash value.
贯穿本说明书,逻辑地址指代操作系统分配的虚拟地址。任意(m+1)Byte数据的起始逻辑地址,也即该(m+1)Byte数据的首Byte数据的逻辑地址。Throughout this specification, a logical address refers to a virtual address assigned by an operating system. The starting logical address of any (m+1) Byte data, that is, the logical address of the first Byte data of the (m+1) Byte data.
贯穿本说明书,哈希表的value的长度的单位为Byte。Throughout this specification, the unit of the length of the value of the hash table is Byte.
贯穿本说明书,任意1Byte数据的相对地址,指代该Byte数据相对于该Byte数据所在的待压缩数据的起始逻辑地址的偏移量。任意(m+1)Byte数据的相对地址,也即该(m+1)Byte数据的首Byte数据的相对地址。如果该(m+1)Byte数据所在的存储空间的起始逻辑地址的末尾N bit均为0且该(m+1)Byte数据所属的待压缩数据的大小为2n,n不大于N,则该(m+1)Byte数据的首Byte数据的相对地址,也即该(m+1)Byte数据的首Byte数据的逻辑地址的末尾n bit。Throughout this specification, the relative address of any 1 byte of data refers to the offset of the Byte data relative to the starting logical address of the data to be compressed in which the Byte data is located. The relative address of any (m+1) Byte data, that is, the relative address of the first Byte data of the (m+1) Byte data. If the end N bit of the starting logical address of the storage space where the (m+1) Byte data is located is 0, and the size of the data to be compressed to which the (m+1) Byte data belongs is 2 n , n is not greater than N, Then, the relative address of the first Byte data of the (m+1) Byte data, that is, the end n bit of the logical address of the first Byte data of the (m+1) Byte data.
贯穿本说明书,或操作指代OR操作,也即只要A和B中之任一不为0,则A OR B=1,而如果A和B均为0,则A OR B=0。Throughout the specification, or the operation refers to an OR operation, that is, as long as any of A and B is not 0, A OR B=1, and if both A and B are 0, A OR B=0.
贯穿本说明书,第a Byte到第a+m Byte数据,指代包括第a Byte、第a+m Byte、第a Byte数据和第a+m Byte数据之间的数据。Throughout the specification, the data from the a Byte to the a+m Byte refers to data including the a Byte, the a+m Byte, the a Byte data, and the a+m Byte data.
贯穿本说明书,
Figure PCTCN2016101259-appb-000001
指代对Z向上取整,例如如果Z=4,则
Figure PCTCN2016101259-appb-000002
而如果Z=4.5,则
Figure PCTCN2016101259-appb-000003
Throughout this specification,
Figure PCTCN2016101259-appb-000001
Refers to rounding up Z, for example if Z=4, then
Figure PCTCN2016101259-appb-000002
And if Z=4.5, then
Figure PCTCN2016101259-appb-000003
本申请实施例所应用的系统System applied in the embodiment of the present application
图1为本申请实施例所应用的一个系统的示意图,该系统包括了一个存储阵列,该存储阵列包括了至少一个存储控制器和多个存储设备,该存储设备一般为非易失性存储设备,具体可以为该快闪存储器(英文:flash memory)或HDD或SSD。每个存储控制器与多个存储设备相连接。为了节省存储阵列中存储设备的空间,该存储控制器用于将待存储的数据进行压缩,将获取的压缩编码存入存储设备。1 is a schematic diagram of a system applied to an embodiment of the present application, the system includes a storage array including at least one storage controller and a plurality of storage devices, which are generally non-volatile storage devices. Specifically, it may be the flash memory (English: flash memory) or HDD or SSD. Each storage controller is connected to multiple storage devices. In order to save the space of the storage device in the storage array, the storage controller is configured to compress the data to be stored, and store the obtained compression code into the storage device.
图2为本申请实施例所应用的另一系统的示意图,该系统包括第一数据处 理设备与第二数据处理设备。该第一数据处理设备中设置有数据压缩设备,第二数据处理设备中设置有数据解压缩设备。该数据压缩设备对需要传输给该第二数据处理设备的数据进行压缩,然后通过通信网络将压缩编码传输至第二数据处理设备。数据解压缩设备对该压缩编码进行解压缩。因此,在通信网络中仅需要传输压缩编码,减少了通信流量,加快了数据传输速度。2 is a schematic diagram of another system applied to an embodiment of the present application, where the system includes a first data office The device and the second data processing device. A data compression device is disposed in the first data processing device, and a data decompression device is disposed in the second data processing device. The data compression device compresses the data that needs to be transmitted to the second data processing device and then transmits the compressed code to the second data processing device over the communication network. The data decompressing device decompresses the compression encoding. Therefore, only the compression coding needs to be transmitted in the communication network, which reduces the communication traffic and speeds up the data transmission speed.
该存储控制器或该数据压缩设备运行时执行图3提供的数据压缩方法。The data compression method provided in FIG. 3 is performed when the storage controller or the data compression device is in operation.
本申请还提供了一种数据压缩方法,其流程示意图如图3所示。以存储控制器执行本方法为例。The present application also provides a data compression method, and a schematic flowchart thereof is shown in FIG. 3. Take the memory controller as an example.
步骤202,分配第一存储空间,该第一存储空间用于存储待压缩数据,该第一存储空间的起始逻辑地址的末尾N bit为0,N为大于1的整数。Step 202: Allocating a first storage space, where the first storage space is used to store data to be compressed, and an N bit at the end of the start logical address of the first storage space is 0, and N is an integer greater than 1.
以N为32且该存储控制器采用64位的操作系统为例,该第一存储空间的起始逻辑地址为0x FFFF FFFF 0000 0000。Taking N as 32 and the memory controller adopting a 64-bit operating system as an example, the starting logical address of the first storage space is 0x FFFF FFFF 0000 0000.
步骤204,分配第二存储空间,该第二存储空间用于存储压缩过程中生成的压缩编码。以便后续通过对该压缩编码进行解压操作可以复原对应的数据。 Step 204, allocating a second storage space for storing compression coding generated during compression. In order to subsequently decompress the compression code, the corresponding data can be restored.
步骤206,分配第三存储空间,该第三存储空间用于存储哈希表。该哈希表可以采用key-value结构,每个key为对窗口内(m+1)Byte数据进行哈希运算后获取的哈希值,每个key对应的value包括生成该key的(m+1)Byte数据的起始逻辑地址的末尾n bit。其中,该哈希表的每个key对应的value需要包括生成该key的(m+1)Byte数据的相对地址,但由于该待压缩数据块的起始逻辑地址的末尾n bit均为0,因此该哈希表的每个key对应的value也包括生成该key的(m+1)Byte数据的起始逻辑地址的末尾n bit。Step 206: Allocate a third storage space, where the third storage space is used to store a hash table. The hash table may adopt a key-value structure, and each key is a hash value obtained by hashing the (m+1) Byte data in the window, and the value corresponding to each key includes the key (m+). 1) The last n bit of the start logical address of the Byte data. The value corresponding to each key of the hash table needs to include a relative address of the (m+1) Byte data of the generated key, but since the end n address of the starting logical address of the to-be-compressed data block is 0, Therefore, the value corresponding to each key of the hash table also includes the last n bit of the starting logical address of the (m+1) Byte data that generates the key.
由于对该哈希表value的读写的最小粒度为Byte,因此如果n不为4的整数倍,则该哈希表的value除了包括该待压缩数据的第a Byte数据的逻辑地址的末尾n bit外,还可能包括高于末尾n bit的1个或多个bit。例如n=14的情况下,由于该哈希表的value的长度至少为4Byte,因此该哈希表的value需要包括待压缩数据的第a Byte数据的逻辑地址的末尾16bit。Since the minimum granularity of reading and writing the value of the hash table is Byte, if n is not an integer multiple of 4, the value of the hash table is the end of the logical address of the a-byte data including the data to be compressed. In addition to the bit, it may also include one or more bits higher than the last n bit. For example, in the case of n=14, since the value of the hash table has a length of at least 4 bytes, the value of the hash table needs to include the last 16 bits of the logical address of the a-th byte data of the data to be compressed.
示意性的该哈希表的结构如表1所示。 The structure of the schematic hash table is shown in Table 1.
KeyKey ValueValue
hash value 1Hash value 1 逻辑地址1的末尾n bitEnd of logical address 1 n bit
hash value 2Hash value 2 逻辑地址2的末尾n bitEnd of logical address 2 n bit
... ...
hash value NHash value N 逻辑地址N的末尾n bitEnd of logical address N n bit
表1Table 1
假设hash value 1为第a Byte至第a+m Byte数据对应的哈希值,则逻辑地址1为第a Byte数据的逻辑地址,表1中的其余行依此类推。Assuming hash value 1 is the hash value corresponding to the a Byte to the a+m Byte data, the logical address 1 is the logical address of the a Byte data, and the rest of the rows in Table 1 are analogous.
步骤202、步骤204、步骤206可以以任意顺序执行,也可以合并为同一步骤执行。该第一存储空间、该第二存储空间、该第三存储空间可以指代内存空间。 Step 202, step 204, and step 206 may be performed in any order, or may be combined into the same step. The first storage space, the second storage space, and the third storage space may refer to a memory space.
步骤207,获取待压缩数据,并将该待压缩数据存入该第一存储空间,该待压缩数据的大小为2n Byte,n不大于N。因此,该待压缩数据的起始逻辑地址的末尾N bit为0。Step 207: Acquire data to be compressed, and store the data to be compressed into the first storage space. The size of the data to be compressed is 2 n Byte, and n is not greater than N. Therefore, the end N bit of the starting logical address of the data to be compressed is 0.
步骤202中分配的第一存储空间的起始逻辑地址的末尾N bit为0后,可以执行多轮步骤207及步骤207以后的步骤,无须针对每一个待压缩数据均分配一次第一存储空间。由于每轮步骤207中获取的待压缩数据的大小可能不同,因此步骤202中设置的2N Byte需要大于或等于每个待压缩数据的大小,以保证后续每轮步骤207中获取的待压缩数据的起始逻辑地址的末尾n bit均为0。After the end N bit of the start logical address of the first storage space allocated in step 202 is 0, a plurality of steps 207 and 207 and subsequent steps may be performed, and the first storage space is not allocated once for each data to be compressed. Since the size of the data to be compressed acquired in step 207 may be different, the 2 N Byte set in step 202 needs to be greater than or equal to the size of each data to be compressed to ensure the data to be compressed acquired in each subsequent step 207. The end n bit of the starting logical address is 0.
该待压缩数据可以包括多个数据block。相比于待压缩数据仅包括一个数据block的方案,一次性将多个数据block存入该第一存储空间,即可以避免多次清理哈希表带来的性能损耗,同时由于该第一存储空间内的待压缩数据的大小增大,每个窗口内的数据都会更容易找到匹配的历史数据,因此可以提升压缩率。The data to be compressed may include a plurality of data blocks. Comparing the data to be compressed to include only one data block, storing a plurality of data blocks into the first storage space at a time, thereby avoiding performance loss caused by cleaning the hash table multiple times, and at the same time, due to the first storage The size of the data to be compressed in the space is increased, and the data in each window is easier to find the matching historical data, so the compression ratio can be improved.
存储控制器从客户端或其他设备获取待压缩数据,该待压缩数据为需要存入存储设备中的数据。The storage controller acquires data to be compressed from a client or other device, and the data to be compressed is data that needs to be stored in the storage device.
步骤208,判断该待压缩数据的大小是否大于2K Byte,K为大于0的整数。若大于,则执行步骤210所在的分支,若不大于,则执行步骤222所在的分支。Step 208: Determine whether the size of the data to be compressed is greater than 2 K Byte, and K is an integer greater than 0. If it is greater than, the branch where the step 210 is located is executed. If not, the branch where the step 222 is located is executed.
常见的K的取值包括:16或24或32。而大小大于216Byte的待压缩数据的逻辑地址最少需要3Byte,大小大于224的待压缩数据的逻辑地址最少需要4Byte,大小大于232的待压缩数据的逻辑地址最少需要5Byte。本申请中,示例性的采用K 等于16。实际使用中,K的取值可以参考存储设备的缓存的大小。本分支中以该待压缩数据的大小为232Byte,m=3为例。Common values for K include: 16 or 24 or 32. The logical address of the data to be compressed whose size is larger than 2 16 Bytes needs at least 3 bytes, the logical address of the data to be compressed larger than 2 24 needs at least 4 bytes, and the logical address of the data to be compressed larger than 2 32 requires at least 5 bytes. In the present application, exemplary use K equals 16. In actual use, the value of K can refer to the size of the storage device's cache. In this branch, the size of the data to be compressed is 2 32 Bytes, and m=3 is taken as an example.
步骤210,清理该哈希表。In step 210, the hash table is cleaned up.
步骤212,设置该哈希表的value的长度不少于(K/8+H)Byte。In step 212, the value of the value of the hash table is set to be no less than (K/8+H) Byte.
H为大于0的正整数,常见的取值可以为2。(8*该哈希表的value的长度)还需要不小于n,例如n=32,则该哈希表的value的长度不小于4,而如果n=24,则该哈希表的value的长度不小于3。本分支中,示例性的设置该哈希表的value的长度为4。由于本分支中,判断该待压缩数据的大小大于2K Byte,采用K/8Byte将无法表达该待压缩数据中每1Byte数据的相对地址,因此需要增加该哈希表的value的长度。H is a positive integer greater than 0, and a common value can be 2. (8* the length of the value of the hash table) also needs to be no less than n, for example, n=32, the value of the value of the hash table is not less than 4, and if n=24, the value of the hash table The length is not less than 3. In this branch, the value of the value of the hash table is exemplarily set to 4. In this branch, if the size of the data to be compressed is greater than 2 K Byte, the relative address of each 1 Byte of data in the data to be compressed cannot be expressed by K/8 Byte, so the length of the value of the hash table needs to be increased.
步骤210和步骤212的执行顺序可以互换。The order of execution of steps 210 and 212 can be interchanged.
步骤208后已经确定待压缩数据的大小,因此在步骤212以及后续的步骤224中可以根据待压缩数据的大小设置该哈希表的value的长度,以避免该哈希表的value的长度过大导致的存储空间浪费和对该哈希表进行操作时造成的难度增加,同时也避免了如果将该哈希表的value的长度设置的过短导致的该哈希表的value的长度不够。After the step 208, the size of the data to be compressed has been determined. Therefore, in step 212 and subsequent step 224, the length of the value of the hash table may be set according to the size of the data to be compressed, so as to avoid the value of the hash table being too long. The resulting storage space is wasted and the difficulty caused by the operation of the hash table is increased, and the length of the value of the hash table is not enough if the length of the value of the hash table is set too short.
常见的m的取值包括:2、3、4、5、6或7等。该哈希表的key的长度根据采用的哈希运算的类型进行设置。Common values of m include: 2, 3, 4, 5, 6, or 7. The length of the key of the hash table is set according to the type of hash operation employed.
步骤210的执行可以在步骤216之前任意时刻执行,保证在步骤216中使用该哈希表之前清理该哈希表即可。Execution of step 210 may be performed at any time prior to step 216, ensuring that the hash table is cleaned prior to use of the hash table in step 216.
步骤214,根据待压缩数据的第a Byte到第a+m Byte生成哈希值,a为大于0的整数。第一次执行步骤214时,a取值为1。Step 214: Generate a hash value according to the a Byte to the a+m Byte of the data to be compressed, where a is an integer greater than 0. When step 214 is executed for the first time, a takes a value of 1.
步骤216,判断该哈希表中是否存在与该哈希值相同的key,如果存在,执行步骤2161至步骤2162,如果不存在,执行步骤2163。Step 216: Determine whether there is a key in the hash table that is the same as the hash value. If yes, perform steps 2161 to 2162. If not, go to step 2163.
步骤2161,获取该哈希值所在行的value,并根据该待压缩数据的第a Byte数据的逻辑地址的末尾n bit更新该哈希值所在行的value。Step 2161: Acquire the value of the row where the hash value is located, and update the value of the row where the hash value is located according to the last n bit of the logical address of the a Byte data of the data to be compressed.
如果n不为4的整数倍,用于更新该哈希值所在行的value的除了将该待压缩数据的第a Byte数据的逻辑地址的末尾n bit外,还可能包括高于末尾n bit的1个或 多个bit。也即采用该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit来更新该哈希表所在行的value。If n is not an integer multiple of 4, the value of the row for updating the hash value may include n bits longer than the end of the logical address of the a Byte data of the data to be compressed. 1 or Multiple bits. That is, the value of the row of the hash table is updated by using the end of the logical address of the a-byte data of the data to be compressed (8* the length of the value of the hash table) bit.
步骤2162,判断该第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希值所在行的value之差是否大于2KStep 2162: Determine whether the difference between the end of the logical address of the a-th byte data (8* the length of the value of the hash table) and the value of the row of the hash value is greater than 2K .
如果该哈希表的value的长度为U Byte,U为不小于(K/8+H)的整数,则判断该第a Byte数据的逻辑地址的末尾8U bit与该哈希值所在行的value之差是否大于2KIf the value of the value of the hash table is U Byte and U is an integer not less than (K/8+H), then the 8U bit at the end of the logical address of the a Byte data and the value of the row where the hash value is located are determined. Is the difference greater than 2 K ?
如果该第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希值所在行的value之差大于或等于2K,将生成哈希值的窗口右移,即a=a+Q,Q为大于0的整数,并返回步骤214。如果该第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希值所在行的value之差不大于2K,执行步骤218。If the difference between the end of the logical address of the a-byte data (the length of the value of the hash table of 8*) and the value of the row of the hash value is greater than or equal to 2 K , a window with a hash value is generated. Shift, ie a=a+Q, Q is an integer greater than 0 and returns to step 214. If the difference between the end of the logical address of the a-byte data (the length of the value of the hash table of 8*) and the value of the row of the hash value is not more than 2 K , step 218 is performed.
步骤2162中使用的该哈希值所在行的value,为步骤2161中执行更新动作之前,该哈希值所在行的value。The value of the row in which the hash value is used in step 2162 is the value of the row in which the hash value is located before the update action is performed in step 2161.
具体的,由于该待压缩数据的每个Byte数据的逻辑地址中高于末尾n bit的地址相同,因此仅需比较该第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与该哈希值所在行的value之差是否大于2KSpecifically, since the logical address of each Byte data of the data to be compressed is the same as the address of the last n bit, it is only necessary to compare the end of the logical address of the a Byte data (8* the value of the hash table) Whether the difference between the length) bit and the value of the row of the hash value is greater than 2K .
具体的,步骤2162中,如果判断不大于2K,则执行步骤218之前,还需要根据该哈希值所在行的value,获取与当前进行哈希运算的(m+1)Byte数据相同的历史数据的起始逻辑地址,以供步骤218中使用。Specifically, in step 2162, if it is determined that the value is not greater than 2 K , before performing step 218, the same history as the (m+1) Byte data currently hashed is obtained according to the value of the row of the hash value. The starting logical address of the data for use in step 218.
步骤2163,将该哈希值和该待压缩数据的第a Byte数据的逻辑地址的末尾n bit加入至该哈希表。并将生成哈希值的窗口右移,即a=a+W,W为大于0的整数,并返回步骤214。 Step 2163, adding the hash value and the last n bit of the logical address of the a-byte data of the data to be compressed to the hash table. The window that generates the hash value is shifted to the right, that is, a=a+W, W is an integer greater than 0, and returns to step 214.
如果n不为4的整数倍,加入该哈希表的除了该待压缩数据的第a Byte数据的逻辑地址的末尾n bit外,还可能包括高于末尾n bit的1个或多个bit。也即将该哈希值和该待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit加入至该哈希表的新一行中。If n is not an integer multiple of 4, one or more bits higher than the last n bit may be included in addition to the end n bit of the logical address of the a-th byte data of the data to be compressed. That is, the hash value and the end of the logical address of the a-byte data of the data to be compressed (8* the length of the value of the hash table) are added to a new row of the hash table.
下面详细分析本申请提供的数据压缩方法相对于现有数据压缩方法的提升。 The improvement of the data compression method provided by the present application relative to the existing data compression method is analyzed in detail below.
Figure PCTCN2016101259-appb-000004
Figure PCTCN2016101259-appb-000004
表2Table 2
如表2,当前逻辑地址为当前进行哈希运算的(m+1)Byte数据的起始逻辑地址。As shown in Table 2, the current logical address is the starting logical address of the (m+1) Byte data currently hashed.
如果第a Byte数据到第a+m Byte的数据对应的哈希值已经存在于该哈希表中的某一行的key,则需要读取该哈希表中该哈希值所在行的value,并用该第a Byte数据的相对地址更新该哈希值所在行的value,也即需要进行一次读哈希表和一次写哈希表。If the hash value corresponding to the data of the a Byte data to the a+m Byte already exists in the key of the row in the hash table, the value of the row of the hash value in the hash table needs to be read. And the value of the row of the hash value is updated by using the relative address of the a Byte data, that is, the read hash table and the write hash table need to be performed once.
现有技术中,以该哈希值所在行的value记录400为例,为了获取被匹配上的历史数据的完整起始逻辑地址,以供后续将当前进行哈希运算的(m+1)Byte数据与被匹配上的历史数据进行匹配。因此需要将400与0x FFFF FFFF 0000 0001进行加操作,获取0x FFFF FFFF 0000 0191。0x FFFF FFFF 0000 0191为与当前进行哈希运算的(m+1)Byte数据相同的历史数据的起始逻辑地址。In the prior art, the value record 400 of the row in which the hash value is located is taken as an example, in order to obtain the complete starting logical address of the historical data that is matched, for (m+1) Byte which will be hashed currently. The data is matched against the historical data being matched. Therefore, 400 and 0x FFFF FFFF 0000 0001 need to be added to obtain 0x FFFF FFFF 0000 0191. 0x FFFF FFFF 0000 0191 is the starting logical address of the same history data as the currently hashed (m+1) Byte data. .
同时,现有技术中为了对该哈希值所在行的value进行更新,需要将当前进行哈希运算的(m+1)Byte数据的相对地址存入该被匹配上的行的value,因此需要将0x FFFF FFFF 0000 07D1与0x FFFF FFFF 0000 0001进行减操作,获取2000,并用2000更新该哈希值所在行的value。In the prior art, in order to update the value of the row where the hash value is located, the relative address of the (m+1) Byte data currently hashed needs to be stored in the value of the matched row, so Subtract 0x FFFF FFFF 0000 07D1 and 0x FFFF FFFF 0000 0001 to obtain 2000, and update the value of the row where the hash value is located with 2000.
由此可见,现有技术中,如果当前进行哈希运算的(m+1)Byte数据对应的哈希值已经存在于该哈希表,则需要进行一次加操作和一次减操作。It can be seen that, in the prior art, if the hash value corresponding to the (m+1) Byte data currently performing the hash operation already exists in the hash table, an addition operation and a subtraction operation are required.
与之对应的,本申请提供的压缩方法中,以该哈希值所在行的value记录0x0190为例,为了获取被匹配上的历史数据的完整起始逻辑地址,因此需要将0x 0190与0x FFFF FFFF 0000 0000进行或操作,获取0x FFFF FFFF 0000 0190。0x FFFF FFFF 0000 0190为与当前进行哈希运算的(m+1)Byte数据相同的历史数据的起始逻辑地址。Correspondingly, in the compression method provided by the present application, taking the value record 0x0190 of the row where the hash value is located as an example, in order to obtain the complete starting logical address of the historical data to be matched, it is necessary to 0190 is ORed with 0x FFFF FFFF 0000 0000 to obtain 0x FFFF FFFF 0000 0190. 0x FFFF FFFF 0000 0190 is the start logical address of the same history data as the (m+1) Byte data currently hashed.
同时,本申请提供的压缩方法中为了对该哈希值所在行的value进行更新,需要用当前进行哈希运算的(m+1)Byte数据的相对地址更新该哈希值所在行的value,由于该待压缩数据的起始逻辑地址的末尾N bit为0,因此直接用07D0更新该被匹配上的行的value即可。In addition, in the compression method provided by the present application, in order to update the value of the row where the hash value is located, the value of the row of the hash value needs to be updated with the relative address of the (m+1) Byte data currently hashed. Since the end N bit of the starting logical address of the data to be compressed is 0, the value of the matched row is directly updated by 07D0.
由此可见,本申请提供的压缩方法中,如果当前进行哈希运算的(m+1)Byte数据对应的哈希值已经存在于该哈希表,仅需要进行一次或操作。相对于现有技术中需要使用一次加操作和一次减操作,降低了操作所需的时间,提升了压缩速度。It can be seen that, in the compression method provided by the present application, if the hash value corresponding to the (m+1) Byte data currently performing the hash operation already exists in the hash table, only one operation or one operation is required. Compared with the prior art, it is required to use one-time operation and one-time reduction operation, which reduces the time required for the operation and improves the compression speed.
如果第a Byte数据到第a+m Byte的数据对应的哈希值不存在于该哈希表中的任一行的key,也即该第a Byte数据到第a+m Byte的数据对应的哈希值不能够匹配该哈希表中的任一行的key,将该第a Byte数据到第a+m Byte对应的哈希值,以及该待压缩数据的第a Byte数据的相对地址加入至该哈希表,也即需要进行一次写哈希表。If the hash value corresponding to the data of the a Byte data to the a+m Byte does not exist in the key of any row in the hash table, that is, the data corresponding to the data of the a Byte data to the a+m Byte. The hash value cannot match the key of any row in the hash table, the hash value corresponding to the a-byte data to the a+m Byte, and the relative address of the a-byte data of the data to be compressed are added to the Hash table, that is, you need to write a hash table once.
写哈希表的场景中,现有技术中需要将0x FFFF FFFF 0000 07D1与0x FFFF FFFF 0000 0001进行减操作,获取2000。随后,将第a Byte数据到第a+m Byte的数据对应的哈希值和2000存入该哈希表。In the scenario of writing a hash table, in the prior art, 0x FFFF FFFF 0000 07D1 and 0x FFFF FFFF 0000 0001 need to be subtracted to obtain 2000. Subsequently, the hash value corresponding to the data of the a Byte data to the a+m Byte and 2000 are stored in the hash table.
与之对应的,本申请提供的压缩方法中,以第a Byte数据到第a+m Byte的数据的起始逻辑地址为0x FFFF FFFF 0000 07D0为例,将第a Byte数据到第a+m Byte的数据对应的哈希值和0x 07D0存入该哈希表。Correspondingly, in the compression method provided by the present application, the starting logical address of the data from the a Byte data to the a+m Byte is 0x FFFF FFFF 0000 07D0, for example, the a Byte data is sent to the a+m. The hash value corresponding to the Byte data and 0x 07D0 are stored in the hash table.
由此可见,本申请提供的压缩方法中,如果当前进行哈希运算的(m+1)Byte数据对应的哈希值不存在于该哈希表中的任一行的key,直接将该第a Byte数据到第a+m Byte的数据的起始逻辑地址的末尾n bit写入该哈希表。相对于现有技术中需要使用一次减操作和一次写操作,降低了对该哈希表进行操作所需的时间,提升了压缩速度。It can be seen that, in the compression method provided by the present application, if the hash value corresponding to the (m+1) Byte data currently hashed does not exist in the key of any row in the hash table, the first a The Byte data is written to the hash table at the end n bit of the start logical address of the data of the a+m Byte. Compared with the prior art, it is required to use one subtraction operation and one write operation, which reduces the time required for the operation of the hash table and improves the compression speed.
步骤218,将与当前进行哈希运算的(m+1)Byte数据相同的历史数据与当前 进行哈希运算的(m+1)Byte数据向右逐Byte进行匹配,根据匹配结果生成本次匹配对应的压缩编码,并将该压缩编码存入该第三存储空间内。 Step 218, the same historical data and current as the (m+1) Byte data currently hashed The (m+1) Byte data subjected to the hash operation is matched to the right Byte by Byte, and the compression code corresponding to the current match is generated according to the matching result, and the compression code is stored in the third storage space.
具体的,获取了与当前进行哈希运算的(m+1)Byte数据相同的历史数据的起始逻辑地址后,根据该起始逻辑地址获取该待压缩数据的历史数据,并将该待压缩数据的历史数据与该第a Byte数据及该第a Byte数据后的数据进行逐Byte的匹配,直至无法匹配为止。Specifically, after obtaining the starting logical address of the same historical data as the (m+1) Byte data currently hashed, the historical data of the data to be compressed is obtained according to the starting logical address, and the data to be compressed is to be compressed. The historical data of the data is matched with the data of the a-byte data and the data after the a-byte data by Bytes until it cannot be matched.
该压缩编码包括了:该第a Byte及该第a Byte后的数据与历史数据的匹配长度,该历史数据的相对地址,以及上次压缩编码记录的最后1Byte数据至本次匹配上的第1Byte数据之间的数据。The compression encoding includes: a matching length of the data after the a Byte and the a Byte and the historical data, a relative address of the historical data, and a last Byte data of the last compression encoded record to the first Byte on the current matching. Data between data.
例如待压缩数据包括abcdefghabcdef,假设第一个a为的相对地址为100,当前窗口包括第9个字符至第12个字符,且E=1为例,如表3。For example, the data to be compressed includes abcdefghabcdef, assuming that the relative address of the first a is 100, the current window includes the ninth character to the twelfth character, and E=1 is an example, as shown in Table 3.
KeyKey ValueValue
abcd对应的hash值Abcd corresponding hash value 100100
bcde对应的hash值Bcde corresponding hash value 101101
cdef对应的hash值The hash value corresponding to cdef 102102
defg对应的hash值The ash value corresponding to defg 103103
efgh对应的hash值Efgh corresponding hash value 104104
fgha对应的hash值Fgha corresponding hash value 105105
ghab对应的hash值Gash value corresponding to ghab 106106
habc对应的hash值Habc corresponding hash value 107107
表3table 3
当获取第9个字符至第12个字符(也即abcd)对应的hash值后,由于在该哈希表中能够匹配上第一行的key,因此根据哈希表中第一行的value读取第1个字符,然后将第9个字符与第1个字符相比较,第10个字符与第2个字符相比较,依次类推,直至向右匹配至无法匹配为止。在本例中,第9个字符至第14个字符与第1至第6个字符相同。因此生成的压缩编码包括:abcdefgh,100,6。其中abcdefg为上次压缩编码记录的最后1Byte至本次匹配上第1Byte之间的数据,其中100为h后第1Byte数据匹配上的历史数据的相对地址,6为匹配长度。根据该压缩编码,恢复该待压缩数据的顺序如下:首先提取abcdefgh,然后根据100与6获取abcdefgh的前6个字符,也即abcdef,将abcdef添加在abcdefgh后,则恢复了该待压缩数据abcdefghabcdef。 After obtaining the hash value corresponding to the ninth character to the twelfth character (ie, abcd), since the key of the first row can be matched in the hash table, the value of the first row in the hash table is read. Take the first character, then compare the ninth character with the first character, the tenth character is compared with the second character, and so on, until it matches to the right until it cannot match. In this example, the 9th to 14th characters are the same as the 1st to 6th characters. The resulting compression coding thus includes: abcdefgh, 100, 6. Where abcdefg is the data between the last 1 byte of the last compression code record and the first byte of the current match, where 100 is the relative address of the historical data on the first byte data match after h, and 6 is the match length. According to the compression coding, the order of restoring the data to be compressed is as follows: firstly extract abcdefgh, and then obtain the first 6 characters of abcdefgh according to 100 and 6, that is, abcdef, and add abcdef to abcdefgh, and then restore the data to be compressed abdefghabcdef .
其中,由于步骤2162和步骤2163后,都可能将生成哈希值的窗口右移。因此,可能会有部分数据既没有记录在上一次步骤218中生成的压缩编码中,同时又位于本次步骤218中窗口的起点之前,因此这部分数据需要记录在本次步骤218生成的压缩编码中。Wherein, after step 2162 and step 2163, the window for generating the hash value may be shifted to the right. Therefore, there may be partial data that is neither recorded in the compression encoding generated in the previous step 218, but also located before the start of the window in this step 218, so this portion of the data needs to be recorded in the compression encoding generated in this step 218. in.
步骤220,判断该待压缩的数据是否全部压缩完毕,也即第a+m Byte数据是否指向该待压缩的数据的最后1Byte数据。如果是,结束压缩编码,将该第三存储空间内的压缩编码存入存储设备。如果不是,将生成哈希值的窗口右移,即a=a+E,E为大于0的整数,并返回步骤214。Step 220: Determine whether the data to be compressed is all compressed, that is, whether the a+m Byte data points to the last 1 Byte data of the data to be compressed. If so, the compression coding is ended, and the compression code in the third storage space is stored in the storage device. If not, the window that generates the hash value is shifted to the right, ie, a = a + E, E is an integer greater than 0, and returns to step 214.
Q、W、E均为窗口右移的长度,也即窗口向右滑动多少Byte。Q, W, and E are the lengths of the right shift of the window, that is, how many Bytes the window slides to the right.
由于存储控制设备的缓存有限,为了避免步骤218中当前进行哈希运算的(m+1)Byte数据与匹配上的历史数据之间的距离太大,导致匹配上的历史数据和当前进行哈希运算的(m+1)Byte数据无法同时存储于缓存内,导致需要刷新缓存进而影响压缩速度。因此步骤2162中判断历史数据和当前进行哈希运算的(m+1)Byte数据的逻辑地址之差是否大于2K。2K Byte可以为存储控制器的缓存的大小。如果匹配上的历史数据和当前进行哈希运算的(m+1)Byte数据之间相隔大于或等于2K Byte的数据,则本次匹配中不执行步骤220。而如果匹配上的历史数据和当前进行哈希运算的(m+1)Byte数据之间相隔小于2KByte的数据,则说明本次匹配上的历史数据和当前进行哈希运算的(m+1)Byte数据可以同时存储于缓存内,因此执行步骤220。Since the buffer of the storage control device is limited, in order to avoid the distance between the (m+1) Byte data currently hashed in step 218 and the historical data on the matching is too large, the historical data on the matching and the current hash are performed. The (m+1)Byte data of the operation cannot be stored in the cache at the same time, which causes the cache to be refreshed and thus affects the compression speed. Therefore, in step 2162, it is judged whether the difference between the history data and the logical address of the (m+1) Byte data currently subjected to the hash operation is greater than 2K . 2 K Byte can be the size of the storage controller's cache. If the historical data on the match and the (m+1) Byte data currently hashed are separated by more than or equal to 2 K Bytes, step 220 is not performed in the current match. If the historical data on the match and the (m+1) Byte data currently hashed are less than 2 K Bytes of data, the historical data on the match and the current hash operation (m+) 1) The Byte data can be stored in the cache at the same time, so step 220 is performed.
需要说明的是步骤2162为可选步骤,也即步骤2161后可以无须执行步骤2162,直接执行步骤218。It should be noted that step 2162 is an optional step, that is, after step 2161, step 2162 can be performed without performing step 2162.
本分支中以该待压缩数据的大小为216Byte,m=3为例。In this branch, the size of the data to be compressed is 2 16 Bytes, and m=3 is taken as an example.
步骤222,清理该哈希表。In step 222, the hash table is cleaned up.
步骤224,设置该哈希表的value的长度不少于K/8Byte。In step 224, the length of the value of the hash table is set to be no less than K/8 Byte.
步骤222和步骤224的执行顺序可以互换。The order of execution of steps 222 and 224 can be interchanged.
由于待压缩数据的大小不大于2K Byte,因此该哈希表的value的长度不少于K/8Byte。如果K不为8的倍数,则步骤224中,设置该哈希表的value的长度为不少于
Figure PCTCN2016101259-appb-000005
Since the size of the data to be compressed is not more than 2 K Byte, the value of the value of the hash table is not less than K/8 Byte. If K is not a multiple of 8, then in step 224, the length of the value of the hash table is set to be no less than
Figure PCTCN2016101259-appb-000005
由于待压缩的数据的大小为216Byte,因此需要2Byte的value长度就能表现该待压缩数据的任一Byte数据的相对地址。Since the size of the data to be compressed is 2 16 Bytes, a value length of 2 bytes is required to represent the relative address of any Byte data of the data to be compressed.
步骤222的执行可以在步骤228之前任意时刻执行,保证在步骤228中使用该哈希表之前清理该哈希表即可。Execution of step 222 may be performed at any time prior to step 228, ensuring that the hash table is cleaned prior to use of the hash table in step 228.
步骤226,根据待压缩数据的第b Byte到第b+m Byte生成哈希值,b为大于0的整数。第一次执行步骤226时,b取值为1。Step 226: Generate a hash value according to the bth Byte to the b+m Byte of the data to be compressed, where b is an integer greater than 0. When step 226 is executed for the first time, b takes a value of 1.
步骤228,判断该哈希值是否能匹配上该哈希表的任一key。如果能够匹配,执行步骤2301,如果不能匹配,执行步骤2302。 Step 228, determining whether the hash value can match any key of the hash table. If it can match, step 2301 is performed, and if it cannot be matched, step 2302 is performed.
步骤2301,获取匹配上的哈希值所在行的value,并根据该待压缩数据的第b Byte数据的逻辑地址的末尾n bit更新该匹配上的哈希值所在行的value。In step 2301, the value of the row in which the hash value is matched is obtained, and the value of the row of the hash value on the matching is updated according to the last n bit of the logical address of the b-th byte data of the data to be compressed.
如果n不为4的整数倍,用于更新该哈希值所在行的value的除了将该待压缩数据的第b Byte数据的逻辑地址的末尾n bit外,还可能包括高于末尾n bit的1个或多个bit。也即采用该待压缩数据的第b Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit来更新该哈希表所在行的value。If n is not an integer multiple of 4, the value of the row for updating the hash value may include n bits longer than the end of the logical address of the b-th byte data of the data to be compressed. 1 or more bits. That is, the value of the row of the hash table is updated by using the end of the logical address of the b-th Byte data of the data to be compressed (8* the length of the value of the hash table).
步骤2301中,还需要根据匹配上的哈希值所在行的value,获取与当前进行哈希运算的(m+1)Byte数据相同的历史数据的起始逻辑地址,以供步骤232中使用。In step 2301, it is also necessary to obtain the starting logical address of the same historical data as the (m+1) Byte data currently hashed according to the value of the row in which the hash value is matched, for use in step 232.
步骤2302,将该哈希值和该待压缩数据的第b Byte数据的逻辑地址的末尾n bit加入至该哈希表。并将生成哈希值的窗口右移,即b=b+R,R为大于0的整数,并返回步骤226。 Step 2302, adding the hash value and the last n bit of the logical address of the b-th Byte data of the data to be compressed to the hash table. The window that generates the hash value is shifted to the right, that is, b=b+R, and R is an integer greater than 0, and returns to step 226.
如果n不为4的整数倍,加入该哈希表的除了该待压缩数据的第b Byte数据的逻辑地址的末尾n bit外,还可能包括高于末尾n bit的1个或多个bit。也即将该哈希值和该待压缩数据的第b Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit加入至该哈希表的新一行中。If n is not an integer multiple of 4, one or more bits higher than the last n bit may be included in addition to the end n bit of the logical address of the b-th byte data of the data to be compressed. The hash value and the end of the logical address of the b-th Byte data of the data to be compressed (8* the length of the value of the hash table) are also added to the new row of the hash table.
参考前述表2对应的描述,本申请提供的压缩方法中,如果当前进行哈希运算的(m+1)Byte数据对应的哈希值能够匹配上该哈希表中的某一行的key,仅需要进行一次或操作。相对于现有技术中需要使用一次加操作和一次减操作,降低了操作所需的时间,提升了压缩速度。Referring to the description corresponding to the foregoing Table 2, in the compression method provided by the present application, if the hash value corresponding to the (m+1) Byte data currently hashed can match the key of a certain row in the hash table, only Need to do it once or operate. Compared with the prior art, it is required to use one-time operation and one-time reduction operation, which reduces the time required for the operation and improves the compression speed.
同时,本申请提供的压缩方法中,如果当前进行哈希运算的(m+1)Byte数据 对应的哈希值不能够匹配上该哈希表中的任一行的key,直接将该第b Byte数据到第b+m Byte的数据的起始逻辑地址的末尾(8*所述哈希表的value的长度)bit写入该哈希表。相对于现有技术中需要使用一次减操作和一次写操作,降低了操作所需的时间,提升了压缩速度。Meanwhile, in the compression method provided by the present application, if (m+1) Byte data is currently hashed The corresponding hash value cannot match the key of any row in the hash table, and the b-byte data is directly sent to the end of the starting logical address of the data of the b+m Byte (8* the hash table) The length of the value) bit is written to the hash table. Compared with the prior art, it is required to use one subtraction operation and one write operation, which reduces the time required for the operation and improves the compression speed.
步骤232,将当前进行哈希运算的(m+1)Byte数据相同的历史数据与当前进行哈希运算的(m+1)Byte数据向右逐Byte进行匹配,根据匹配结果生成本次匹配对应的压缩编码,并将该压缩编码存入该第三存储空间内。Step 232: Matching the same historical data of the (m+1) Byte data currently hashed with the (m+1) Byte data currently hashed to the right by Byte, and generating the current matching according to the matching result. The compression code is stored in the third storage space.
步骤232中生成压缩编码相关的细节,参考前述步骤218中的描述。The details of the compression coding are generated in step 232, with reference to the description in step 218 above.
步骤234,判断该待压缩的数据是否全部压缩完毕,也即第b+m Byte数据是否指向该待压缩的数据的最后1Byte数据。如果是,结束压缩编码,将该第三存储空间内的压缩编码存入存储设备。如果不是,将生成哈希值的窗口右移,即b=b+T,T为大于0的整数,并返回步骤226。Step 234: Determine whether the data to be compressed is all compressed, that is, whether the b+m Byte data points to the last 1 Byte data of the data to be compressed. If so, the compression coding is ended, and the compression code in the third storage space is stored in the storage device. If not, the window that generates the hash value is shifted to the right, i.e., b = b + T, T is an integer greater than 0, and returns to step 226.
R和T均为窗口右移的长度,也即窗口向右滑动多少Byte的数据。R and T are the lengths of the right shift of the window, that is, how many Bytes of data the window slides to the right.
通过在步骤208中判断待压缩的数据是否大于2K Byte,在步骤222至步骤234这一支路中,由于待压缩的数据不大于2K Byte,因此任一Byte历史数据的逻辑地址和当前进行哈希运算的(m+1)Byte数据的逻辑地址之差肯定不大于2K,无需执行步骤2162类似的判断动作,节省了压缩流程,进一步提升了压缩速度。By determining in step 208 whether the data to be compressed is greater than 2 K Byte, in the branch from step 222 to step 234, since the data to be compressed is not more than 2 K Byte, the logical address and current of any Byte history data The difference between the logical addresses of the (m+1) Byte data subjected to the hash operation is certainly not more than 2 K , and it is not necessary to perform the similar judgment action of step 2162, which saves the compression process and further improves the compression speed.
需要说明的是,步骤208为可选步骤。It should be noted that step 208 is an optional step.
如果不采用步骤208,则步骤207后直接执行步骤210、步骤214以及步骤214的后续步骤。此种情况下,由于在步骤2161或步骤2163中对该哈希表进行操作之前不知道待压缩数据的大小,因此需要根据待压缩数据的大小判断需要写入该哈希表的逻辑地址的长度。If step 208 is not used, step 210, step 214, and subsequent steps of step 214 are directly performed. In this case, since the size of the data to be compressed is not known before the operation of the hash table in step 2161 or step 2163, it is necessary to determine the length of the logical address that needs to be written into the hash table according to the size of the data to be compressed. .
例如,待压缩数据的大小为216Byte,而存储控制器采用的操作系统为64位的系统。因此,在步骤2161或步骤2163之前需要根据待压缩数据的大小,确认采用该第a Byte数据的逻辑地址的末尾16bit来更新该哈希表。For example, the size of the data to be compressed is 2 16 Bytes, and the operating system used by the storage controller is a 64-bit system. Therefore, before step 2161 or step 2163, it is necessary to update the hash table by using the last 16 bits of the logical address of the a-th byte data according to the size of the data to be compressed.
通过步骤208的采用,避免了每次对该哈希表的操作均需要判断一次待压缩数据的大小,进一步提升了压缩速度。 Through the adoption of step 208, it is avoided that the size of the data to be compressed needs to be determined once for each operation of the hash table, and the compression speed is further improved.
如图4所示,本申请还提供了一种数据压缩设备400,该数据压缩设备可以为图1中的存储控制器或图2中的数据压缩设备。该数据压缩设备400包括通信接口402和处理芯片404,通信接口402和处理芯片404建立通信连接。该数据压缩设备400运行时,执行图3对应的数据压缩方法。As shown in FIG. 4, the present application further provides a data compression device 400, which may be the storage controller in FIG. 1 or the data compression device in FIG. 2. The data compression device 400 includes a communication interface 402 and a processing chip 404, and the communication interface 402 and the processing chip 404 establish a communication connection. When the data compression device 400 is in operation, the data compression method corresponding to FIG. 3 is executed.
通信接口402用于与外部设备通信,例如写入待压缩数据的客户端、存储阵列中的存储设备、通信网络中的网络设备等。通信接口402可以为数据压缩设备400的输入/输出接口。The communication interface 402 is for communicating with an external device, such as a client writing data to be compressed, a storage device in a storage array, a network device in a communication network, and the like. Communication interface 402 can be an input/output interface of data compression device 400.
通信接口402具体用于执行步骤207中获取待压缩数据的步骤,以及步骤220和步骤234之后将第三存储空间内的压缩编码存入存储设备的步骤。如果该数据压缩设备400为图2中的数据压缩设备,则步骤220和步骤234之后,通信接口402用于将该第三存储空间内的压缩编码发往通信网络。The communication interface 402 is specifically configured to perform the step of acquiring data to be compressed in step 207, and the step of storing the compression code in the third storage space into the storage device after step 220 and step 234. If the data compression device 400 is the data compression device of FIG. 2, then after step 220 and step 234, the communication interface 402 is configured to send the compression code in the third storage space to the communication network.
处理芯片404,用于执行步骤202至步骤206,并执行步骤207中将待压缩数据存入第一存储空间的步骤,还用于执行步骤208至步骤220,还用于执行步骤208至步骤234。The processing chip 404 is configured to perform step 202 to step 206, and perform the step of storing the data to be compressed into the first storage space in step 207, and is further configured to perform step 208 to step 220, and is further configured to perform step 208 to step 234. .
处理芯片404可以通过专用集成电路(英文:application-specific integrated circuit,缩写:ASIC)实现,或可编程逻辑器件(英文:programmable logic device,缩写:PLD)实现。上述PLD可以是复杂可编程逻辑器件(英文:complex programmable logic device,缩写:CPLD),现场可编程门阵列(英文:field programmable gate array,缩写:FPGA),通用阵列逻辑(英文:generic array logic,缩写:GAL)或其任意组合。The processing chip 404 can be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (abbreviated as PLD). The above PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviation: CPLD), a field programmable gate array (English: field programmable gate array, abbreviated: FPGA), general array logic (English: general array logic, Abbreviation: GAL) or any combination thereof.
如图5所示,处理芯片404还可以通过处理器、存储设备以及逻辑芯片实现,该逻辑芯片可以由PLD或ASIC实现。该处理芯片404运行时,该处理器和该逻辑芯片各执行一部分功能,两者功能的分配可以有多种。示例性的,由该处理器读取该存储器中的代码执行步骤202至步骤207。在第一存储空间、第二存储空间、第三存储空间均已在该存储器中分配完毕,并且待存储数据已经存储该第一存储空间后,由该逻辑芯片完成后续步骤。As shown in FIG. 5, the processing chip 404 can also be implemented by a processor, a storage device, and a logic chip, which can be implemented by a PLD or an ASIC. When the processing chip 404 is in operation, the processor and the logic chip each perform a part of functions, and the functions of the two can be allocated in various ways. Exemplarily, the code in the memory is read by the processor to perform steps 202 to 207. After the first storage space, the second storage space, and the third storage space have all been allocated in the memory, and the data to be stored has already stored the first storage space, the subsequent steps are completed by the logic chip.
以上提供的数据压缩设备,通过将存储待压缩数据的存储空间的末尾N bit地址设置为0,使得对该待压缩数据进行压缩的过程中对哈希表的读写操作均更 加简单,提升了压缩速度。The data compression device provided above provides the read/write operation of the hash table in the process of compressing the data to be compressed by setting the end N bit address of the storage space for storing the data to be compressed to 0. Adding simplicity increases the compression speed.
图6为本申请提供的一种计算设备,该计算设备600可以为图1中的存储控制器或图2中的数据压缩设备。计算设备600包括处理器602、存储器604,还可以包括总线606以及通信接口608。FIG. 6 is a computing device provided by the present application. The computing device 600 may be the storage controller in FIG. 1 or the data compression device in FIG. 2. Computing device 600 includes a processor 602, a memory 604, and may also include a bus 606 and a communication interface 608.
通信接口608用于与外部设备通信,例如写入待压缩数据的客户端、存储阵列中的存储设备、通信网络中的网络设备等。通信接口608可以为计算设备600的输入/输出接口。 Communication interface 608 is used to communicate with external devices, such as clients that write data to be compressed, storage devices in a storage array, network devices in a communication network, and the like. Communication interface 608 can be an input/output interface of computing device 600.
处理器602、存储器604和通信接口608可以通过总线606实现彼此之间的通信连接,也可以通过无线传输等其他手段实现通信。The processor 602, the memory 604, and the communication interface 608 can implement communication connections with each other through the bus 606, and can also implement communication by other means such as wireless transmission.
处理器602可以为中央处理器(英文:central processing unit,缩写:CPU)。The processor 602 can be a central processing unit (English: central processing unit, abbreviation: CPU).
存储器604可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random-access memory,缩写:RAM)。The memory 604 may include a volatile memory (English: volatile memory) (English: random-access memory, abbreviation: RAM).
可选的,存储器604还可以包括非易失性存储器(英文:non-volatile memory),例如只读存储器(英文:read-only memory,缩写:ROM),快闪存储器,HDD或SSD;存储器604还可以包括上述种类的存储器的组合。Optionally, the memory 604 may further include a non-volatile memory, such as a read-only memory (English: read-only memory, abbreviated as ROM), a flash memory, an HDD or an SSD, and a memory 604. Combinations of the above types of memory may also be included.
当计算设备600为图1中的存储控制器时,由于存储控制器与存储阵列中的多个存储设备相连,因此存储器604也可以不包括非易失性存储器,计算设备600的非易失性存储器由存储阵列的存储设备提供。When the computing device 600 is the storage controller of FIG. 1, since the storage controller is connected to a plurality of storage devices in the storage array, the memory 604 may also not include the non-volatile memory, and the non-volatileness of the computing device 600 The memory is provided by a storage device of the storage array.
当计算设备600为图2中的数据压缩设备时,由于其可以直接将压缩编码发往通信网络,不需要将压缩编码存入非易失性存储器,因此存储器604也可以不包括非易失性存储器。When the computing device 600 is the data compression device of FIG. 2, since it can directly send the compression code to the communication network, it is not necessary to store the compression code in the non-volatile memory, so the memory 604 may not include non-volatile. Memory.
在通过软件来实现本申请提供的技术方案时,用于实现本申请图3提供的数据压缩方法的程序代码保存在存储器604中,并由处理器602来执行。When the technical solution provided by the present application is implemented by software, the program code for implementing the data compression method provided in FIG. 3 of the present application is stored in the memory 604 and executed by the processor 602.
以上提供的计算设备,通过将存储待压缩数据的存储空间的末尾N bit地址设置为0,使得对该待压缩数据进行压缩的过程中对哈希表的读写操作均更加简单,提升了压缩速度。The computing device provided above provides a simple read and write operation on the hash table in the process of compressing the data to be compressed by setting the end N bit address of the storage space for storing the data to be compressed to be simple, and improving the compression. speed.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详 述的部分,可以参见其他实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are all focused, and in some embodiments, there is no detailed description. For a description of the parts, reference may be made to the related description of other embodiments.
结合本申请公开内容所描述的方法可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于RAM、快闪存储器、ROM、可擦除可编程只读存储器(英文:erasable programmable read only memory,缩写:EPROM)、电可擦可编程只读存储器(英文:electrically erasable programmable read only memory,缩写:EEPROM)、硬盘、SSD、光盘或者本领域熟知的任何其它形式的存储介质中。The methods described in connection with the present disclosure can be implemented by a processor executing software instructions. The software instructions can be composed of corresponding software modules, which can be stored in RAM, flash memory, ROM, erasable programmable read only memory (English: erasable programmable read only memory, abbreviation: EPROM), electrically erasable Programming an audio-only memory (English: electrically erasable programmable read only memory, EEPROM), a hard disk, an SSD, an optical disk, or any other form of storage medium known in the art.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请所描述的功能可以用硬件或软件来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。存储介质可以是通用或专用计算机能够存取的任何可用介质。Those skilled in the art will appreciate that in one or more of the above examples, the functions described herein may be implemented in hardware or software. When implemented in software, the functions may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium. A storage medium may be any available media that can be accessed by a general purpose or special purpose computer.
以上该的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上该仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的技术方案的基础之上,所做的任何修改、改进等,均应包括在本申请的保护范围之内。 The specific embodiments of the present invention have been described in detail with reference to the specific embodiments of the present application. It is to be understood that the foregoing is only a specific embodiment of the present application, and is not intended to limit the scope of the present application. Any modifications, improvements, etc. made on the basis of the technical solutions of the present application are included in the scope of protection of the present application.

Claims (11)

  1. 一种数据压缩方法,其特征在于,包括:A data compression method, comprising:
    分配存储空间,所述存储空间的起始逻辑地址的末尾N bit为0,N为大于1的整数;Allocating storage space, the end of the starting logical address of the storage space N bit is 0, N is an integer greater than 1;
    将待压缩数据存入所述存储空间,所述待压缩数据的大小为2n Byte,n不大于N;The data to be compressed is stored in the storage space, the size of the data to be compressed is 2 n Byte, and n is not greater than N;
    对所述待压缩数据的第a Byte数据到第a+m Byte数据进行哈希运算生成哈希值,a为大于0的整数,m为大于0的整数且(m+1)为进行所述哈希运算的窗口的大小;Performing a hash operation on the a Byte data to the a+m Byte data of the data to be compressed to generate a hash value, where a is an integer greater than 0, m is an integer greater than 0, and (m+1) is performed The size of the window of the hash operation;
    判断哈希表中是否存在与所述哈希值相同的key,所述哈希表的key为所述第a+m Byte数据的(m+1)Byte历史数据进行哈希运算生成的哈希值,所述哈希表的value包括所述第a+m Byte数据的(m+1)Byte历史数据的起始逻辑地址的末尾n bit;Determining whether there is a key in the hash table that is the same as the hash value, and the key of the hash table is a hash generated by hashing the (m+1) Byte history data of the a+m Byte data a value, the value of the hash table includes the end n bit of the start logical address of the (m+1) Byte history data of the a+m Byte data;
    若所述哈希表中存在与所述哈希值相同的key,根据所述第a Byte数据的逻辑地址的末尾n bit更新所述哈希表中所述哈希值对应的value;If the hash key has the same key as the hash value, the value corresponding to the hash value in the hash table is updated according to the last n bit of the logical address of the a-th byte data;
    若所述哈希表中不存在与所述哈希值相同的key,将所述哈希值和所述第a Byte数据的逻辑地址的末尾n bit加入所述哈希表。If the same key as the hash value does not exist in the hash table, the hash value and the last n bit of the logical address of the a-th byte data are added to the hash table.
  2. 如权利要求1所述的数据压缩方法,其特征在于,所述待压缩数据包括多个数据block。The data compression method according to claim 1, wherein the data to be compressed comprises a plurality of data blocks.
  3. 如权利要求1或2所述的数据压缩方法,其特征在于,所述判断哈希表中是否存在与所述哈希值相同的key前,还包括:The data compression method according to claim 1 or 2, wherein before the determining whether the key having the same hash value exists in the hash table, the method further includes:
    判断所述待压缩数据的大小是否大于2K Byte,K为大于0的整数;Determining whether the size of the data to be compressed is greater than 2 K Byte, and K is an integer greater than 0;
    若所述待压缩数据的大小大于2K Byte,设置所述哈希表的value的长度不少于(K/8+1)Byte;If the size of the data to be compressed is greater than 2 K Byte, set the value of the hash table to be no less than (K/8+1) Byte;
    若所述待压缩数据的大小小于或等于2K Byte,设置所述哈希表的value的长度不少于K/8Byte。If the size of the data to be compressed is less than or equal to 2 K Byte, the value of the value of the hash table is set to be no less than K/8 Byte.
  4. 如权利要求3所述的数据压缩方法,其特征在于,若所述待压缩数据的大小大于2K Byte,则根据所述第a Byte数据的逻辑地址的末尾n bit更新所述哈希 表中所述哈希值对应的value后,所述方法还包括:The data compression method according to claim 3, wherein if the size of the data to be compressed is greater than 2 K Byte, updating the hash table according to the last n bits of the logical address of the a-th byte data After the value corresponding to the hash value, the method further includes:
    若所述第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与所述哈希表中所述哈希值对应的value之差小于2K,则将所述第a Byte数据及所述第aByte数据后的数据与所述哈希值对应的value指示的历史数据进行匹配,根据匹配结果生成压缩编码;If the difference between the end of the logical address of the a-byte data (8* the length of the value of the hash table) bit and the value corresponding to the hash value in the hash table is less than 2 K , then The data after the a Byte data and the a Byte data are matched with the history data indicated by the value corresponding to the hash value, and the compression coding is generated according to the matching result;
    若所述待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与所述哈希表中所述哈希值对应的value之差不小于2K,则不将所述第a Byte数据及所述第a Byte数据后的数据与所述哈希值对应的value指示的历史数据进行匹配。If the end of the logical address of the a Byte data of the data to be compressed (8* the length of the value of the hash table) bit and the value corresponding to the hash value in the hash table are not less than 2 K , the data of the a-th Byte data and the a-th Byte data are not matched with the history data indicated by the value corresponding to the hash value.
  5. 如权利要求4所述的数据匹配方法,其特征在于,所述根据匹配结果生成压缩编码后,还包括:The data matching method according to claim 4, wherein after the generating the compression coding according to the matching result, the method further comprises:
    判断所述第a+m Byte数据是否为所述待压缩数据的最后1Byte数据,若是,则结束对所述待压缩数据的编码,若不是,则将进行所述哈希运算的窗口右移。Determining whether the a+m Byte data is the last 1 Byte data of the data to be compressed, and if so, ending encoding of the data to be compressed, and if not, moving the window of the hash operation to the right.
  6. 一种数据压缩设备,其特征在于,包括:通信接口和处理芯片,所述通信接口与所述处理芯片相连;A data compression device, comprising: a communication interface and a processing chip, wherein the communication interface is connected to the processing chip;
    所述处理芯片,用于分配存储空间,所述存储空间的起始逻辑地址的末尾Nbit为0,N为大于1的整数;The processing chip is configured to allocate a storage space, where the Nbit of the starting logical address of the storage space is 0, and N is an integer greater than 1.
    所述通信接口,用于获取待压缩数据,并将所述待压缩数据存入所述存储空间,所述待压缩数据的大小为2n Byte,n不大于N;The communication interface is configured to acquire data to be compressed, and store the data to be compressed into the storage space, where the size of the data to be compressed is 2 n Byte, and n is not greater than N;
    所述处理芯片,还用于对所述待压缩数据的第a Byte数据到第a+m Byte数据进行哈希运算生成哈希值,a为大于0的整数,m为大于0的整数且(m+1)为进行所述哈希运算的窗口的大小;判断哈希表中是否存在与所述哈希值相同的key,所述哈希表的key为所述第a+m Byte数据的(m+1)Byte历史数据进行哈希运算生成的哈希值,所述哈希表的value包括所述第a+m Byte数据的(m+1)Byte历史数据的起始逻辑地址的末尾n bit,若所述哈希表中存在与所述哈希值相同的key,根据所述第a Byte数据的逻辑地址的末尾n bit更新所述哈希表中所述哈希值对应的value,若所述哈希表中不存在与所述哈希值相同的key,将所述哈希值和所述第a  Byte数据的逻辑地址的末尾n bit加入所述哈希表。The processing chip is further configured to perform a hash operation on the a-byte data to the a+m Byte data of the data to be compressed to generate a hash value, where a is an integer greater than 0, and m is an integer greater than 0 and M+1) is the size of the window for performing the hash operation; determining whether there is a key in the hash table that is the same as the hash value, and the key of the hash table is the data of the a+m Byte data (m+1) Byte history data, a hash value generated by a hash operation, the value of the hash table including the end of the start logical address of the (m+1) Byte history data of the a+m Byte data n bit, if there is a key in the hash table that is the same as the hash value, the value corresponding to the hash value in the hash table is updated according to the last n bit of the logical address of the a-th byte data. If the key having the same hash value does not exist in the hash table, the hash value and the first a The end n bit of the logical address of the Byte data is added to the hash table.
  7. 如权利要求6所述的设备,其特征在于,所述待压缩数据包括多个数据block。The device of claim 6, wherein the data to be compressed comprises a plurality of data blocks.
  8. 如权利要求6或7所述的设备,其特征在于,所述处理芯片判断哈希表中是否存在与所述哈希值相同的key前,还用于判断所述待压缩数据的大小是否大于2K Byte,K为大于0的整数;若所述待压缩数据的大小大于2K Byte,设置所述哈希表的value的长度不少于(K/8+1)Byte;若所述待压缩数据的大小小于或等于2K Byte,设置所述哈希表的value的长度不少于K/8Byte。The device according to claim 6 or 7, wherein the processing chip determines whether the size of the data to be compressed is greater than or equal to whether the key has the same value as the hash value in the hash table. 2 K Byte, K is an integer greater than 0; if the size of the data to be compressed is greater than 2 K Byte, the value of the value of the hash table is set to be no less than (K/8+1) Byte; The size of the compressed data is less than or equal to 2 K Byte, and the length of the hash table is set to be no less than K/8 Byte.
  9. 如权利要求8所述的设备,其特征在于,若所述待压缩数据的大小大于2K Byte,则所述处理芯片在根据所述第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit更新所述哈希表中所述哈希值对应的value后,还用于若所述待压缩数据的第a Byte数据的逻辑地址的末尾(8*所述哈希表的value的长度)bit与所述哈希表中所述哈希值对应的value之差小于2K,则将所述第a Byte数据及所述第a Byte数据后的数据与所述哈希值对应的value指示的历史数据进行匹配,根据匹配结果生成压缩编码;以及若所述待压缩数据的第a Byte数据的逻辑地址的末尾n bit与所述哈希表中所述哈希值对应的value之差不小于2K,则不将所述第aByte数据及所述第a Byte数据后的数据与所述哈希值对应的value指示的历史数据进行匹配。The device according to claim 8, wherein if the size of the data to be compressed is greater than 2 K Byte, the processing chip is at the end of the logical address according to the a-th byte data (8* The length of the value of the hash table) bit is used to update the value corresponding to the hash value in the hash table, and is also used to end the logical address of the a Byte data of the data to be compressed (8* And the difference between the value of the value of the hash table and the value corresponding to the hash value in the hash table is less than 2 K , and the data after the a Byte data and the a Byte data are The historical data indicated by the value corresponding to the hash value is matched, and the compression encoding is generated according to the matching result; and the end n bit of the logical address of the a Byte data of the data to be compressed and the hash in the hash table If the difference between the values corresponding to the values is not less than 2 K , the data after the a-th byte data and the a-th Byte data are not matched with the history data indicated by the value corresponding to the hash value.
  10. 如权利要求9所述的设备,其特征在于,所述处理芯片在生成所述压缩编码后,还用于:The device according to claim 9, wherein the processing chip is further configured to: after generating the compression encoding:
    判断所述第a+m Byte数据是否为所述待压缩数据的最后1Byte数据,若是,则结束对所述待压缩数据的编码,若不是,则将进行所述哈希运算的窗口右移。Determining whether the a+m Byte data is the last 1 Byte data of the data to be compressed, and if so, ending encoding of the data to be compressed, and if not, moving the window of the hash operation to the right.
  11. 一种计算设备,其特征在于,所述计算设备包括处理器、存储器,所述处理器和所述存储器建立通信连接;A computing device, comprising: a processor, a memory, and the processor establishing a communication connection with the memory;
    所述处理器运行时,读取所述存储器中的程序,执行权利要求1至5任一所述的方法。 While the processor is running, the program in the memory is read to perform the method of any one of claims 1 to 5.
PCT/CN2016/101259 2016-09-30 2016-09-30 Data compression method and device, and computation device WO2018058604A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2016/101259 WO2018058604A1 (en) 2016-09-30 2016-09-30 Data compression method and device, and computation device
CN201680089676.XA CN110419036B (en) 2016-09-30 2016-09-30 Data compression method and device and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/101259 WO2018058604A1 (en) 2016-09-30 2016-09-30 Data compression method and device, and computation device

Publications (1)

Publication Number Publication Date
WO2018058604A1 true WO2018058604A1 (en) 2018-04-05

Family

ID=61763588

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/101259 WO2018058604A1 (en) 2016-09-30 2016-09-30 Data compression method and device, and computation device

Country Status (2)

Country Link
CN (1) CN110419036B (en)
WO (1) WO2018058604A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508334A (en) * 2018-11-23 2019-03-22 中科驭数(北京)科技有限公司 For the data compression method of block chain database, access method and system
CN111835359A (en) * 2019-04-22 2020-10-27 深圳捷誊技术有限公司 Compression device, storage medium, and method and device for repeating information query and update
CN113765854A (en) * 2020-06-04 2021-12-07 华为技术有限公司 Data compression method and server

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259203B (en) * 2020-01-08 2023-08-25 上海兆芯集成电路股份有限公司 Data compressor and data compression method
CN113326001B (en) * 2021-05-20 2023-08-01 锐掣(杭州)科技有限公司 Data processing method, device, apparatus, system, medium, and program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020317A (en) * 2013-01-10 2013-04-03 曙光信息产业(北京)有限公司 Device and method for data compression based on data deduplication
CN105207678A (en) * 2015-09-29 2015-12-30 东南大学 Hardware realizing system for improved LZ4 compression algorithm
CN105631013A (en) * 2015-12-29 2016-06-01 华为技术有限公司 Device and method for generating Hash value
CN105718385A (en) * 2014-12-23 2016-06-29 三星电子株式会社 Data Storage Device, Method Of Operating The Same, And Data Processing System
US20160283398A1 (en) * 2015-03-27 2016-09-29 International Business Machines Corporation Data Compression Accelerator Methods, Apparatus and Design Structure with Improved Resource Utilization

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8566295B2 (en) * 2011-05-31 2013-10-22 John E. G. Matze System and method for electronically storing essential data
CN104077272B (en) * 2014-06-23 2017-01-04 华为技术有限公司 A kind of method and apparatus of dictionary compression
CN105022593B (en) * 2015-08-18 2017-09-26 南京大学 A kind of storage optimization method cooperateed with based on data compression and data de-redundant

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020317A (en) * 2013-01-10 2013-04-03 曙光信息产业(北京)有限公司 Device and method for data compression based on data deduplication
CN105718385A (en) * 2014-12-23 2016-06-29 三星电子株式会社 Data Storage Device, Method Of Operating The Same, And Data Processing System
US20160283398A1 (en) * 2015-03-27 2016-09-29 International Business Machines Corporation Data Compression Accelerator Methods, Apparatus and Design Structure with Improved Resource Utilization
CN105207678A (en) * 2015-09-29 2015-12-30 东南大学 Hardware realizing system for improved LZ4 compression algorithm
CN105631013A (en) * 2015-12-29 2016-06-01 华为技术有限公司 Device and method for generating Hash value

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508334A (en) * 2018-11-23 2019-03-22 中科驭数(北京)科技有限公司 For the data compression method of block chain database, access method and system
CN111835359A (en) * 2019-04-22 2020-10-27 深圳捷誊技术有限公司 Compression device, storage medium, and method and device for repeating information query and update
CN111835359B (en) * 2019-04-22 2022-03-22 深圳捷誊技术有限公司 Compression device, storage medium, and method and device for repeating information query and update
CN113765854A (en) * 2020-06-04 2021-12-07 华为技术有限公司 Data compression method and server
CN113765854B (en) * 2020-06-04 2023-06-30 华为技术有限公司 Data compression method and server

Also Published As

Publication number Publication date
CN110419036B (en) 2022-04-12
CN110419036A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
WO2018058604A1 (en) Data compression method and device, and computation device
CN107506153B (en) Data compression method, data decompression method and related system
US7538695B2 (en) System and method for deflate processing within a compression engine
US9390099B1 (en) Method and apparatus for improving a compression ratio of multiple documents by using templates
US10187081B1 (en) Dictionary preload for data compression
JP6512733B2 (en) Data compression method and apparatus for performing the method
US10416915B2 (en) Assisting data deduplication through in-memory computation
EP3820048A1 (en) Data compression and decompression method and related apparatus, electronic device, and system
US20140358872A1 (en) Storage system and method for performing deduplication in conjunction with host device and storage device
WO2009005758A2 (en) System and method for compression processing within a compression engine
JP2021527376A (en) Data compression
JP2018527681A (en) Data deduplication using a solid-state drive controller
US9479194B2 (en) Data compression apparatus and data decompression apparatus
US20090210437A1 (en) System, method, and computer program product for saving and restoring a compression/decompression state
KR101866151B1 (en) Adaptive rate compression hash processing device
CN115941598A (en) Flow table semi-uninstalling method, device and medium
CN115599757A (en) Data compression method and device, computing equipment and storage system
CN109690957B (en) System level testing of entropy coding
CN104378119A (en) Quick lossless compression method for file system data of embedded equipment
CN112559462A (en) Data compression method and device, computer equipment and storage medium
WO2019168881A2 (en) Method and apparatus for data compression and decompression using a standardized data storage and retrieval protocol
CN106383670B (en) Data processing method and storage device
US11748307B2 (en) Selective data compression based on data similarity
US20230403027A1 (en) Dictionary compressor, data compression device, and memory system
CN115225725B (en) Data compression storage method, device, vehicle and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16917354

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16917354

Country of ref document: EP

Kind code of ref document: A1