WO2021082926A1 - 一种数据压缩的方法及装置 - Google Patents

一种数据压缩的方法及装置 Download PDF

Info

Publication number
WO2021082926A1
WO2021082926A1 PCT/CN2020/120980 CN2020120980W WO2021082926A1 WO 2021082926 A1 WO2021082926 A1 WO 2021082926A1 CN 2020120980 W CN2020120980 W CN 2020120980W WO 2021082926 A1 WO2021082926 A1 WO 2021082926A1
Authority
WO
WIPO (PCT)
Prior art keywords
data block
data
queue
fingerprint
blocks
Prior art date
Application number
PCT/CN2020/120980
Other languages
English (en)
French (fr)
Inventor
王力玉
关坤
刘帮
张真波
沈建强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021082926A1 publication Critical patent/WO2021082926A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • This application relates to the field of storage technology, and in particular to a method and device for data compression.
  • data reduction technology is an indispensable key technology in the storage system.
  • data compression, deduplication, and similar data compression can be implemented to reduce the storage capacity used in the storage system, thereby increasing storage. Utilization of space.
  • an important indicator to measure the performance of a storage system can be reduction efficiency, reduction rate, or compression ratio.
  • the present application provides a data compression method and device, which solves the problems of low utilization rate of storage space for similar data compression in the prior art and low reduction rate of data compression.
  • a data compression method comprising: acquiring a first fingerprint and a second fingerprint of a first data block; adding the first data block to a first data block queue according to the first fingerprint Wherein, the data blocks of the first data block queue all include the first fingerprint; add the first data block to the second data block queue according to the second fingerprint; wherein, the second data block queue All data blocks in the first data block queue include the second fingerprint; when the data block in the first data block queue is compressed with reference to the first reference data block, the data size does not reach the preset threshold, and the second data block queue Obtain a second data block in the second data block queue, and the second data block queue includes a data block with the same fingerprint as the first reference data block; and compress the second data block with reference to the first reference data block.
  • multiple data block queues are obtained based on multiple fingerprints extracted from multiple data blocks, and the data blocks in the data block queue including the same fingerprint are referenced and compressed, when the data blocks in the first data block queue refer to the first data block queue.
  • the data block in the second data queue that includes the same fingerprint as the first reference data block is compressed with reference to the first reference data block, thereby improving the utilization of storage space , To avoid waste of storage resources.
  • the method further includes: determining a second reference data block among uncompressed data blocks in the second data block queue; The data block is compressed with reference to the second reference data block.
  • the method further includes: determining the first reference data block, where the first reference data block is one of the data blocks with the highest similarity in the first data block queue.
  • a data compression method includes: obtaining a plurality of data block queues according to a plurality of fingerprints corresponding to each data block in the plurality of data blocks. The same fingerprint exists; if the data size after compressing the data block in the first data block queue with reference to the first reference data block does not reach the preset threshold, at least one data block is obtained from the second data block queue, and the first data block
  • the second data block queue includes data blocks with the same fingerprint as the first reference data block; the at least one data block is referred to the first reference data block for compression.
  • the method further includes: determining a second reference data block among the uncompressed data blocks in the second data block queue; referencing the uncompressed data blocks in the second data block queue The second reference data block is compressed.
  • obtaining the multiple data block queues according to the multiple fingerprints corresponding to each data block in the multiple data blocks includes: according to the multiple fingerprints corresponding to each data block in the multiple data blocks, Determine the similarity between data blocks, where the more the number of identical fingerprints of at least two data blocks, the higher the similarity of the at least two data blocks; according to the order of the similarity between the data blocks from high to low
  • the multiple data block queues are determined, wherein multiple data blocks with the same fingerprint and the highest similarity are in the same data block queue, and the multiple data block queues include the first data block queue.
  • the data block queue grouping of similar data is performed with reference to the similarity between the data blocks, and the different data blocks are determined according to the order of the similarity between the data blocks from high to low.
  • Data block queue grouping the data blocks with the highest similarity are similarly compressed with reference to the same reference block, so that the data blocks with high similarity are placed in the same compression group for reference compression, which improves the reduction rate of similar data compression, thereby increasing storage space Utilization rate.
  • the determining the similarity between the data blocks according to the multiple fingerprints corresponding to each of the multiple data blocks includes: according to the multiple fingerprints corresponding to each of the multiple data blocks Fingerprint, obtain at least one global fingerprint corresponding to each data block, and determine the similarity between the data blocks according to the global fingerprint.
  • the similarity between data blocks is determined according to the global fingerprints, which can reduce the storage space generated by a large number of data block fingerprints, and can also reduce the reading and writing caused by the same fingerprint between the statistical data blocks. And computing overhead, thereby effectively improving query efficiency and storage space utilization.
  • the method before the data blocks in the first data block queue are compressed with reference to the first reference data block, the method further includes: determining the first reference data block, where the first reference data block is One of the data blocks with the highest similarity in the first data block queue.
  • the data blocks with the highest similarity are placed in the same compression group, and one of the data blocks is referred to for similarity, thereby effectively increasing the reduction rate of similar data blocks and improving the utilization of storage space.
  • the method before the data blocks in the first data block queue are compressed with reference to the first reference data block, the method further includes: determining the first reference data block, where the first reference data block is Other data blocks in the first data queue refer to the data block that occupies the smallest storage space after being compressed with reference to the first reference data block.
  • the data block with the highest compression gain in the data block queue is used as the reference data block, and other data blocks are compressed with reference to the reference data block, which can effectively improve the reduction rate of the data block.
  • a data compression device in a third aspect, includes: an acquisition module for acquiring a first fingerprint and a second fingerprint of a first data block; adding the first data block to the first fingerprint according to the first fingerprint A data block queue; wherein the data blocks of the first data block queue all include the first fingerprint; the first data block is added to the second data block queue according to the second fingerprint; wherein, the first data block queue The data blocks of the two data block queues all contain the second fingerprint; the compression module is used to determine that when the data blocks in the first data block queue are compressed with reference to the first reference data block, the data size has not reached a preset value Threshold, acquiring a second data block from the second data block queue, where the second data block queue includes a data block with the same fingerprint as the first reference data block; referring to the second data block The first reference data block is compressed.
  • the compression module is further configured to determine a second reference data block among uncompressed data blocks in the second data block queue; The processed data block is compressed with reference to the second reference data block.
  • the compression module is further configured to determine the first reference data block, and the first reference data block is one of the data blocks with the highest similarity in the first data block queue.
  • a data compression device in a fourth aspect, includes: an acquisition module for obtaining multiple data block queues according to multiple fingerprints corresponding to each data block in the multiple data blocks. The same fingerprint exists between the data blocks; the compression module is used to determine if the data block in the first data block queue is compressed with reference to the first reference data block and the data size does not reach the preset threshold, then the compression module, Used to obtain at least one data block from the second data block queue, and compress the at least one data block with reference to the first reference data block; wherein the second data block queue includes the same fingerprint as the first reference data block Data block.
  • the compression module is further configured to determine a second reference data block among the uncompressed data blocks in the second data block queue, and refer to the uncompressed data blocks in the second data block queue.
  • the second reference data block is compressed.
  • the acquisition module is specifically used to determine the similarity between the data blocks according to the multiple fingerprints corresponding to each data block in the multiple data blocks, wherein the same fingerprints of at least two data blocks The more the number, the higher the similarity of the at least two data blocks; the multiple data block queues are determined according to the order of the similarity between the data blocks from high to low, wherein the multiple fingerprints are the same and have the highest similarity
  • the multiple data blocks are in the same data block queue, and the multiple data block queues include the first data block queue.
  • the acquiring module is specifically configured to acquire at least one global fingerprint corresponding to each data block according to multiple fingerprints corresponding to each data block in the multiple data blocks, and determine the data block according to the global fingerprint The similarity between.
  • the compression module is further configured to determine the first reference data block, and the first reference data block is one of the data blocks with the highest similarity in the first data block queue.
  • the compression module is also used to determine the first reference data block, and the first reference data block is occupied by other data blocks in the first data queue referring to the first reference data block for compression.
  • a chip system which is applied to electronic equipment; the chip system includes one or more interface circuits and one or more processors; the interface circuit and the processor are interconnected by wires; the interface circuit Used to receive a signal from the memory of the electronic device and send the signal to the processor, the signal includes a computer instruction stored in the memory; when the processor executes the computer instruction, the electronic device executes the above-mentioned first aspect Or the data compression method of any one of the second aspect.
  • a readable storage medium is provided, and instructions are stored in the readable storage medium.
  • the electronic device runs on an electronic device, the electronic device is caused to execute the first aspect or the first aspect described above.
  • a computer program product which when the computer program product runs on a computer, causes the computer to execute the data compression method according to any one of the first aspect or the second aspect.
  • any data compression device, chip system, readable storage medium, and computer program product provided above can all be implemented according to the corresponding data compression method provided above, and therefore, it can achieve
  • FIG. 1 is a schematic diagram of a similar data compression method provided by an embodiment of this application.
  • FIG. 2 is a schematic flowchart of a data compression method provided by an embodiment of this application.
  • FIG. 3 is an example one of a data compression method provided by an embodiment of this application.
  • FIG. 4 is an example 2 of a data compression method provided by an embodiment of this application.
  • FIG. 5 is an example three of a data compression method provided by an embodiment of this application.
  • FIG. 6 is an example four of a data compression method provided by an embodiment of this application.
  • FIG. 7 is a schematic diagram of the flow and modules of a data compression method provided by an embodiment of this application.
  • FIG. 8 is a schematic diagram of a data compression device provided by an embodiment of the application.
  • data reduction technology can effectively improve the utilization of storage space.
  • data reduction means to reduce the amount of data to reduce storage space and improve its transmission, storage and processing efficiency without losing useful information; or to reorganize data according to a certain algorithm to reduce data redundancy and A technical approach to storage space.
  • the reduction rate of data reduction technology can be the value calculated by dividing the amount of data before compression by the amount of data after compression, which can be used to indicate the compression efficiency of data compression.
  • the larger the reduction rate the higher the efficiency of data compression, and it is the storage system.
  • the data reduction technology mainly has the following three implementation methods:
  • Data compression The redundant information in the data is deleted through processing, thereby saving storage space.
  • Data compression can be divided into lossless compression and lossy compression.
  • Lossless compression refers to the use of compressed data for decompression, and the data obtained is exactly the same as the original data.
  • Lossy compression refers to the use of compressed data for decompression, and the obtained data is different from the original data. It is mainly used in the field of image or video compression.
  • Data deduplication Data compression technology can only eliminate redundant information inside files, while data deduplication technology can effectively reduce the physical storage space occupied by data by eliminating the same files or data blocks in the distributed storage system. It is widely used in storage backup and archiving systems. For example, the file is divided into data blocks of corresponding size, and the fingerprint of each data block is calculated. The data of the same fingerprint represents the same content of the data block, so the original data only needs to be stored once for the data block of the same fingerprint.
  • Similar data compression According to the features extracted from the data block, the data will be compressed according to the data block that contains similar or the same data features as the data block, that is, the reference data block. The data block stores the reference increment. Compared with the data deduplication technology, the data blocks must be exactly the same in order to achieve the elimination of redundant data. The similar data compression technology has a significant reduction effect on the data blocks that are not completely duplicated but have a certain degree of similarity.
  • data block 1 and data block 2 cannot be deduplicated due to partial differences in content, but the similarity between data block 1 and data block 2 is extremely high.
  • the extracted features of data block 1 and data block 2 are compared, and some of the features are the same, and some of the features are different.
  • data block 1 can be stored as a reference block.
  • the Delta compression algorithm for data block 2 and data block 1 only the parts of data block 2 that are different from data block 1 need to be stored, that is, data block 2 is relative to data block 1.
  • the reference increment is sufficient, which greatly reduces the storage capacity of data block 2.
  • This application proposes a technical solution based on similar data compression.
  • the similarity between data blocks is analyzed through similar fingerprints to group according to the similarity of the data, and the similar data blocks are jointly compressed with respect to the reference data block, thereby eliminating the similarity.
  • More repeated fields between data blocks increase the reduction rate of data compression and increase the utilization of storage space.
  • the method includes the following steps:
  • the data blocks are data blocks to be compressed, and each data block includes at least one fingerprint for identifying different data blocks.
  • a fingerprint also called a similar fingerprint (SFP) refers to the fingerprint feature of a data block, and is a feature that can be used to characterize a certain degree of similarity between different data blocks.
  • SFP similar fingerprint
  • Each data block may include a similar fingerprint, or it may include multiple similar fingerprints.
  • the data size of the same batch of data blocks to be compressed can be the same.
  • file data is divided into data blocks of the same size for data compression, or data blocks of different sizes can be compressed. If the same fingerprint exists between the two, you can refer to compression.
  • the similar fingerprints of the obtained data blocks can be extracted by the feature extraction algorithm.
  • Hash algorithm or other fingerprint calculation methods can be used.
  • the hash algorithm specifically transforms input data of any length into fixed-length output data through a hash algorithm, and the output data is the hash value.
  • a certain byte can be used as a window, and the feature of the data block can be extracted by a hash algorithm as a fingerprint feature, that is, a similar fingerprint.
  • the data blocks to be compressed are data block A and data block B.
  • the similar fingerprints of the data block A are extracted through a hash algorithm according to a certain byte as a window, and three similar fingerprints can be extracted, such as data block
  • the similar fingerprints of A can be SFP1, SFP2 and SFP3, and the similar fingerprints of data block B are SFP1, SFP4 and SFP5.
  • data block A and data block B both include the same similar fingerprint as SFP1. Therefore, data block A and data block B are similar data blocks and can be placed in the same data compression group for similar data compression.
  • the data blocks with the same SFP are grouped into one queue. Therefore, each data block will appear in the multiple SFP queues included in the data block.
  • the data blocks in the data block queue have at least one common SFP fingerprint, which means that there is a certain degree of similarity between the data blocks.
  • A SFP1, SFP2, SFP3; B (SFP1, SFP4, SPF5); C (SFP3, SFP5, SFP1); D (SFP1, SFP7, SPF2); E (SFP4, SFP5, SFP6); F (SFP1, SFP6 , SPF4);
  • a similar fingerprint table is established based on similar fingerprints, and each data block will be recorded in the corresponding three SFP queues, and the data block queue corresponding to each similar fingerprint is obtained as shown below:
  • SFP1 A, B, C, D, F
  • a joint compression group is also called a data compression group, which is a grouping of different data blocks that performs similar data compression.
  • the compressed data of a joint compression group is stored in a storage unit of the memory, for example, stored in a physical Page.
  • the overall idea of dividing data blocks into joint compression groups can be grouping the joint compression groups according to different data block queues with similar fingerprints. That is, according to the data block queue of similar fingerprints obtained in the above steps, the data blocks including at least the same similar fingerprint are divided into a joint compression group.
  • Similar data blocks in the same joint compression group refer to the reference data blocks in the joint compression group for similar data compression, and the compressed incremental data of the similar data blocks are written into the storage unit together with the reference data blocks.
  • the specific method for dividing the data block into a joint compression group for data compression includes the following steps:
  • the joint compression group can be divided according to the data block queues corresponding to similar fingerprints, and all data blocks in a data block queue are divided into a joint compression group.
  • the first data block queue corresponding to the first similar fingerprint SFP1 is divided into the first joint compression group.
  • the data blocks A, B, C, D, and F are divided into the first joint compression group.
  • the data blocks other than the reference data block in the joint compression group may be referred to as similar data blocks of the reference data block.
  • the similar data blocks in the joint compression group can be compressed with reference to the reference data block.
  • data blocks A, B, C, D, and F can be divided into a joint compression group, and A is set as the reference data block in the joint compression group, and the other data blocks B , C, D, and F are similar data blocks, which are compressed relative to data block A.
  • similar data blocks in a joint compression group, in order to reduce the amount of read and write data to be decompressed, similar data blocks may be compressed with reference only to reference data blocks, instead of performing mutual reference compression between similar data blocks. Therefore, when the data is decompressed, the incremental data of the reference data block and the similar data block are read, and the corresponding decompressed data processing is performed, that is, the original similar data block can be restored. It should be noted that this application is not limited to this method. If the decompression processing performance permits, similar data blocks in the joint compression group can also refer to compression. For example, similar data blocks can refer to reference data blocks and Another similar data block is compressed.
  • a data block in the joint compression group is selected as the reference data block.
  • the reference data block may be the first data block divided into the joint compression group.
  • the data block queue SFP1 A, B, C, Data block A in D, F.
  • the compression gain is used to indicate the size of the saved storage space occupied by the storage data block after multiple similar data blocks are compressed with reference to the reference data block, that is, the size of the improvement in storage capacity utilization.
  • reference data block A is self-compressed into A'according to the compression algorithm, and the reference increment ⁇ B of the data block B relative to the data block A, as well as ⁇ C, ⁇ D, and ⁇ F.
  • the compression benefit of reference data block A is the ratio of the amount of data stored in data blocks A, ⁇ B, ⁇ C, ⁇ D, and ⁇ F to the storage resources saved by storing data blocks A, B, C, D, and F.
  • the storage space occupied by the compressed data is the smallest. data block.
  • the method for determining the data block with the highest similar compression benefit can be specifically determined by performing pre-compression processing on the similar data block, or according to other compression algorithms.
  • the data blocks in a joint compression group can be limited according to a certain number of data blocks, or the total size of compressed data, or the number of data blocks and the total size of compressed data are limited at the same time. Joint compression.
  • the division of the joint compression group can be divided and compressed according to the limit of the number of data blocks.
  • a joint compression group can compress up to 8 data blocks at the same time, and the number of data blocks in the joint compression group is limited Set the threshold to 8.
  • a joint compression group is limited to 16KB according to the total size of the compressed data storage capacity, that is, the amount of stored data is limited, then the preset threshold of the storage data size of the joint compression group is 16KB.
  • two conditions can limit the joint compression group at the same time.
  • a joint compression group can compress up to 8 data blocks at the same time, and the total size of the compressed data is limited to 16KB.
  • the first data block queue is SFP1: A, B, C, D, E
  • the first reference data block in the first data block queue is A
  • the first data block is compressed with reference to the first reference data block A and stored in the first joint compression group. This results in the first joint compression group as shown in Figure 3.
  • the reference data block in the joint compression group can be directly self-compressed through a compression algorithm, or other compression strategies can be used to compress the reference data block separately.
  • a compression algorithm or other compression strategies can be used to compress the reference data block separately.
  • only the reference data block is directly self-compressed through the compression algorithm as an example for description.
  • the reference data block A is self-compressed and becomes A'.
  • the second data block queue includes data blocks with the same fingerprint as the first reference data block, and at least one data block is a data block in the second data block queue other than the first reference data block, and is not yet subjected to data compression processing Data block.
  • the similar data blocks in the second data block queue can be divided into the first joint compression group for compression.
  • the similar data block must be a data block that has the same similar fingerprint with reference to the first reference data block in the first joint compression group.
  • the second data block queue is SFP2: A, F, H, I, where data block A has been self-compressed as a reference data block in the first joint compression group and stored A', therefore, there is no need to compress data again Block A, and data blocks F, H, I have the same similar fingerprint SFP2 as data block A. Therefore, if it is determined that data blocks F, H, and I can be compressed with reference to data block A, the data block can be compressed. F, H, I are also put into the first joint compression group for similar data compression.
  • the second data block queue it is necessary to select the second data block queue according to the remaining space of the first storage space or the limit of the number of data compressed data blocks preset by the first joint compression group, or the simultaneous restriction of the two conditions.
  • the data block in which can be compressed according to the first reference data block is necessary to select the second data block queue according to the remaining space of the first storage space or the limit of the number of data compressed data blocks preset by the first joint compression group, or the simultaneous restriction of the two conditions.
  • the preset number of data blocks for data compression in the first joint compression group is limited to 8, and the data blocks F, H, I can also be put into the first joint compression group for compression.
  • the preset number of data blocks for data compression in the first joint compression group is limited to 6, and only one data block can be selected for storage, then data block F can be put into the first joint compression group for compression storage, and the second data
  • the other data blocks in the block queue are used as the second joint compression group, and the data block H can be determined as the reference data block of the second joint compression group, and the data block I is compressed with reference to the data block H, as shown in FIG. 4.
  • the uncompressed data blocks in the second data block queue are divided into a second joint compression group, from which the second reference data block is determined, and other data blocks in the second joint compression group refer to the second reference data block for similar data compression, namely can.
  • At least one data block is obtained from the third data block queue; at least one data block is compressed with reference to the second reference data block and then stored in the second joint compression group.
  • the first data block queue is divided into the joint compression group first, the first reference data block of the first joint compression group is determined, and the data block that includes the same data fingerprint as the first reference data block is referenced
  • the first reference data block is compressed and stored in the first joint compression group until the preset number of data blocks or the preset storage data size of the first joint compression group is not enough to store the next data block, or there is no connection with the first joint compression group.
  • the reference block includes data blocks of the same data fingerprint
  • the second reference block of the second joint compression group is determined, and the data block including the same data fingerprint as the second reference block is compressed with reference to the second reference block, and stored in the second joint compression group until the second joint compression group
  • the preset number of data blocks or the preset storage data size is not enough to store the next data block, or there is no data block that includes the same data fingerprint as the second reference block;
  • the joint compression group needs to be divided according to the compressed data size limit and/or the data block number limit for data compression.
  • the corresponding joint compression group may not be full.
  • adding all the data blocks in the data block queue corresponding to another SFP2 into compression will exceed the joint compression group. Due to the limitation of the compression group, it is necessary to build another joint compression group. At this time, multiple joint compression groups will not be full, resulting in a waste of space.
  • each joint compression group has a reference data block, such as reference data block A in the above example, creating each joint compression group will increase the number of reference data blocks, such as the data block in the above embodiment F.
  • the reference data block is compressed by a direct compression algorithm, which can only reduce the redundant data in the data block, and its compression rate is not as good as the compression of a similar data block with reference to the reference data block.
  • similar data blocks in the data block queue corresponding to different similar fingerprints are referred to the same common reference data block for joint compression, that is, joint compression is performed across common reference data blocks of fingerprints.
  • the method can improve the storage space utilization of the joint compression group, reduce the number of reference data blocks, and convert the reference data blocks into the increase of similar data blocks as much as possible, thereby improving the overall reduction rate.
  • the storage location of the reference data block in the joint compression group may be the first location or the middle location of the storage unit, or the reference data block may be stored separately from the similar data block.
  • the reference data block in the joint compression group is stored in the first position of the storage unit, and its position offset in the storage unit is 0, so that when the data is decompressed, the system indexes according to the storage position of the reference data block To read the reference data block.
  • the reference data block can be stored separately from the similar data block, that is, the reference data block can be separately stored in the storage unit 1 and the similar data block can be stored in the storage unit 2. In this way, it is convenient for the system to frequently read the reference data blocks so as to centrally manage all reference data blocks individually.
  • all or part of the reference data blocks can be stored in the internal memory, and similar data blocks can be stored in the external memory.
  • the system performs joint compression on similar data blocks, and can first divide similar data blocks in order of the degree of similarity between multiple data blocks to obtain multiple data blocks. Queue; then the data blocks in the same data block queue are compressed with reference to the same reference data block, and the data increments are stored in the joint compression group.
  • the multiple data blocks can be divided into joint compression groups according to the order of the degree of similarity between the multiple data blocks first.
  • obtaining multiple data block queues according to multiple fingerprints corresponding to each of the multiple data blocks includes:
  • SFP1 A, B, D, H, I, J, K, L, C, M, E
  • SFP3 A, C, E, O, P, Q
  • the data blocks with the highest degree of similarity that is, the data blocks with the largest number of the same similar fingerprints
  • the data blocks A and E include the same three similar fingerprints SFP1, SFP2, and SFP3
  • the data blocks H and M include the same three similar fingerprints SFP1, SFP2, and SFP4.
  • the data blocks with the second highest degree of similarity are A and C, H and J, which include two identical similar fingerprints. The number of identical similar fingerprints among the remaining data blocks is at most one.
  • the data block queue division strategy is performed in the order of the degree of similarity between multiple data blocks from high to low.
  • Data blocks A and E can be divided into the first data block queue Group1 to determine the data block.
  • A is the reference data block in Group1.
  • the data blocks H and M are divided into the second data block queue Group2, and the data block H is determined to be a reference data block in Group2.
  • the data block A is the reference data block in Group1
  • the data block C is also put into the reference data block A in Group1 for compression.
  • the data block H is the reference data block in Group2
  • the data block J is also put into the reference data block H in Group2 for compression.
  • Data blocks B, D, I, K, and L of the multiple data blocks corresponding to SFP1 can be divided into Group1, and compressed with reference to data block A. As shown in Figure 5. And the data block Y in the multiple data blocks corresponding to SFP2 has the same similar fingerprint SFP2 as the reference data block H in the data block Group2, then the data block Y is also put into Group2, and the reference data block H is compressed.
  • the remaining data blocks O, P, and Q in the data block queue corresponding to SFP3 can be compressed with reference to data block A, but because Group1 has no extra storage space, a new joint compression group needs to be re-established, and data blocks O, P, Q is put into Group3, and the data block O is determined to be the reference data block of Group3, and the data blocks P and Q reference data block O are compressed.
  • the data blocks in the data block queue corresponding to SFP4 are all compressed, and there is no need to establish a new joint compression group and determine a new reference data block. Please refer to Figure 5 for a specific schematic diagram of joint compression.
  • the joint compression group is performed on the data blocks in the above example. If the number of data blocks in the joint compression group is limited to 8, for multiple data blocks corresponding to SFP1, data block A is determined as the reference data block of Group11, and data blocks B, D, H, I, J , K, and L are all put into the joint compression group Group11, with reference to data block A for compression; the remaining data blocks C, M, N, and E form a joint compression group Group22, and it can be determined that data block C is the reference data block of Group22.
  • the data blocks M, N, and E are compressed with reference to the data block C.
  • data blocks A, H, E have been compressed, then data block Y is divided into a joint compression group Group33; for SFP3, A, C, E have been compressed, then data block O , P, Q are divided into a joint compression group Group44, as shown in Figure 6.
  • the data blocks in the first joint compression group Group1 are all similar data compressed with reference to data block A, and data block E has three similar fingerprints that are the same as data block A, and data block C also has two similar fingerprints and data Block A is the same, which means that data blocks E and C are strongly similar to data block A, and compression with reference to data block A will result in a higher reduction rate.
  • the data blocks H and J in the second joint compression group Group2 are strongly similar to the reference data block H, so the reference data block H will be compressed with a higher reduction rate.
  • the data blocks in the first joint compression group Group11 are not divided according to the degree of similarity between the data blocks.
  • the data blocks are only divided according to the similar fingerprint queue, and the data blocks B, D, I, J, and K are all the same.
  • Data block A has only one identical similar fingerprint, and the reduction rate of reference compression will not be very high.
  • the strongly similar data blocks are first divided into the same joint compression group, which can improve the reduction rate of data compression and improve the utilization efficiency of storage space.
  • the compression efficiency is relatively high. If the degree of similarity between similar data blocks is not considered, the meaning of similar reference compression cannot be fully utilized. Therefore, by identifying the similarity and distinguishing the degree of similarity between the data blocks, the data blocks with higher similarity are first combined and compressed, and one is selected as the reference block, and the other as the similar blocks, a higher similar compression ratio can be obtained. This saves space.
  • a similar compression scheme for identification and grouping based on similar fingerprints of data blocks the main modules and processing flow involved can be as shown in Figure 7.
  • the newly written or updated data blocks are obtained by obtaining similar fingerprint modules Calculate multiple similar fingerprints SFP included in the data block; establish a data block queue corresponding to similar fingerprints according to the similar fingerprint SFP, and save the similar fingerprint table; the fingerprint analysis module performs similar fingerprint analysis on the data block queue, according to the similarity between the data blocks The same number of fingerprints is used to determine the similarity between multiple data blocks and existing data blocks, and then group them; the similar data block compression module preferentially divides the data blocks with higher similarity into a joint compression group for similar data compression.
  • the similar data block is compressed with reference to the reference data block; finally, the storage module at the bottom layer writes the similarly compressed data to the disk.
  • the multiple similar fingerprints in the foregoing embodiment may also be global fingerprints generated by feature extraction or other operations between at least two similar fingerprints, which may also be referred to as super fingerprints.
  • the global fingerprint can be used to represent the characteristics of the data block, and can also be used to characterize the characteristics between similar fingerprints of the data block.
  • the degree of similarity between the data blocks can also be judged by comparing the global fingerprints of the data blocks. Because, for multiple similar fingerprints corresponding to multiple data blocks, to find the number of the same similar fingerprints between the data blocks, the system needs to perform a lot of comparison operations. By comparing the degree of similarity between multiple data blocks through global fingerprints, the system can quickly identify and effectively reduce the amount of calculation.
  • Global fingerprints can be generated by extracting features of data blocks through algorithms such as Locality-Sensitive Hashing (LSH) algorithm and Hamming distance. It is also possible to generate a global fingerprint by extracting data features between multiple similar fingerprints, or to obtain a global fingerprint through multiple similar fingerprints or some simple operations, such as exclusive OR operations. For example, six similar fingerprints are extracted from the data block. In order to reduce the space occupied by storing a large number of similar fingerprints, the six similar fingerprints can be XORed between two to generate three global fingerprints, and then store them as global fingerprints. As a comparison of the degree of similarity between data blocks, thereby reducing the computational complexity of the system's search and comparison.
  • LSH Locality-Sensitive Hashing
  • the joint compression group is divided according to the order of the similarity between multiple data blocks from high to low, and the joint compression is performed by distinguishing the similarity between the data blocks, so that the high similarity
  • the data block is compressed with reference to the same data block, which can improve the reduction rate of data compression and effectively save the storage space of data compression.
  • the device includes an acquisition module 801 and a compression module 802.
  • the obtaining module 801 is configured to obtain multiple data block queues according to multiple fingerprints corresponding to each data block in the multiple data blocks, and the data blocks in each data block queue have the same fingerprint.
  • the compression module 802 is used to determine if the data size after compressing the data block in the first data block queue with reference to the first reference data block does not reach the preset threshold, then the compression module is used to obtain from the second data block queue At least one data block, and the at least one data block is compressed with reference to the first reference data block; wherein the second data block queue includes data blocks with the same fingerprint as the first reference data block.
  • the embodiments of the present application also provide a chip system, which can be applied to the electronic device in the above-mentioned embodiments.
  • the chip system includes one or more interface circuits and one or more processors; the interface circuit and the processor pass lines Interconnection; the interface circuit is used to receive a signal from the memory of the electronic device and send the signal to the processor, the signal includes a computer instruction stored in the memory; when the processor executes the computer instruction, the electronic device executes the implementation as described above Each function or step in any possible data compression method in the example.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium includes computer instructions.
  • the computer instructions run on the above-mentioned electronic device, the electronic device executes the above-mentioned method.
  • the embodiments of the present application also provide a computer program product, which when the computer program product runs on a computer, causes the computer to execute each function or step performed by the electronic device in the above method embodiment.
  • the disclosed device and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be divided. It can be combined or integrated into another device, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate parts may or may not be physically separate.
  • the parts displayed as units may be one physical unit or multiple physical units, that is, they may be located in one place, or they may be distributed to multiple different places. . Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solutions of the embodiments of the present application are essentially or the part that contributes to the prior art, or all or part of the technical solutions can be embodied in the form of a software product, and the software product is stored in a storage medium. It includes several instructions to make a device (may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the method described in each embodiment of the present application.
  • the foregoing storage media include: U disk, mobile hard disk, read only memory (read only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program codes.

Abstract

一种数据压缩的方法及装置,涉及存储技术领域,解决了的相似数据压缩存储空间的利用率不高,数据压缩的缩减率不高的问题。该方法包括:获取第一数据块的第一指纹和第二指纹;根据第一指纹将第一数据块加入第一数据块队列;其中,第一数据块队列的数据块均包含第一指纹;根据第二指纹将第一数据块加入第二数据块队列;其中,第二数据块队列的数据块均包含第二指纹;当对第一数据块队列中的数据块参考第一参考数据块进行压缩后的数据大小未达到预设阈值,则从第二数据块队列中获取第二数据块,第二数据块队列包括与第一参考数据块相同指纹的数据块;将第二数据块参考第一参考数据块进行压缩。

Description

一种数据压缩的方法及装置 技术领域
本申请涉及存储技术领域,尤其涉及一种数据压缩的方法及装置。
背景技术
随着大数据的快速发展,海量数据呈爆发式增长,给数据存储管理带来了巨大的挑战。为了提高存储空间的利用率,数据缩减技术是存储系统中必不可少的关键技术,具体可以通过数据压缩、重复数据删除、相似数据压缩等实现方式减少存储系统中使用的存储容量,从而提升存储空间的利用率。其中,衡量存储系统性能的重要指标可以为缩减效率、缩减率,或者压缩比。
目前的基于相同数据的不同版本进行压缩存储的技术方案中,每次数据更新后,参考原始数据进行压缩得到增量值,在存储空间中存放每一次数据更新的增量值。但是由于存储数据的物理页面是定长的,对于更新较少的数据,则会造成该物理页面的浪费,存储空间的利用率不高。
发明内容
本申请提供一种数据压缩的方法及装置,解决了现有技术的相似数据压缩存储空间的利用率不高,数据压缩的缩减率不高的问题。
为达到上述目的,本申请采用如下技术方案:
第一方面,提供一种数据压缩的方法,所述方法包括:获取第一数据块的第一指纹和第二指纹;根据所述第一指纹将所述第一数据块加入第一数据块队列;其中,所述第一数据块队列的数据块均包含所述第一指纹;根据所述第二指纹将所述第一数据块加入第二数据块队列;其中,所述第二数据块队列的数据块均包含所述第二指纹;当对所述第一数据块队列中的数据块参考第一参考数据块进行压缩后的数据大小未达到预设阈值,从所述第二数据块队列中获取第二数据块,所述第二数据块队列包括与所述第一参考数据块的指纹相同的数据块;对所述第二数据块参考所述第一参考数据块进行压缩。
上述技术方案中,根据多个数据块提取的多个指纹得到多个数据块队列,将包括相同指纹的数据块队列中的数据块进行参考压缩,当第一数据块队列中的数据块参考第一参考数据块进行压缩后仍有剩余的存储空间,则将第二数据队列中与第一参考数据块包括相同指纹的数据块参考第一参考数据块进行压缩,从而能提升存储空间的利用率,避免存储资源的浪费。
在一种可能的设计方式中,所述方法还包括:在所述第二数据块队列未压缩处理的数据块中确定第二参考数据块;对所述第二数据块队列中未压缩处理的数据块参考所述第二参考数据块进行压缩。
在一种可能的设计方式中,所述方法还包括:确定所述第一参考数据块,所述第一参考数据块为所述第一数据块队列中相似度最高的数据块中的一个。
第二方面,提供一种数据压缩的方法,该方法包括:根据多个数据块中每个数据块对应的多个指纹,得到多个数据块队列,每个数据块队列中的数据块之间存在相同的指纹;若对第一数据块队列中的数据块参考第一参考数据块进行压缩后的数据大小未达到预设阈值,则从第二数据块队列中获取至少一个数据块,该第二数据块队列包括与该第一参考数据块相同指纹的数据块;将该至少一个数据块参考该第一参考数据块进行压缩。
在一种可能的设计方式中,该方法还包括:在该第二数据块队列未压缩处理的数据块中确定第二参考数据块;对该第二数据块队列中未压缩处理的数据块参考该第二参考数据块进行压缩。
在一种可能的设计方式中,该根据多个数据块中每个数据块对应的多个指纹,得到多个数据块队列包括:根据多个数据块中每个数据块对应的多个指纹,确定数据块之间的相似度,其中,至少两个数据块的相同指纹的数量越多,该至少两个数据块的相似度越高;根据数据块之间的相似度从高到低的顺序确定该多个数据块队列,其中,多个指纹相同且相似度最高的多个数据块在相同的数据块队列中,该多个数据块队列包括该第一数据块队列。上述可能的实现方式中,进行数据块参考压缩的时候,首先参考数据块之间的相似度进行相似数据的数据块队列分组,根据数据块之间的相似度从高到低的顺序确定不同的数据块队列分组,将相似度最高的数据块参考同一个参考块进行相似压缩,从而将相似度高的数据块放在同一个压缩组参考压缩,提高相似数据压缩的缩减率,从而提升存储空间的利用率。
在一种可能的设计方式中,该根据多个数据块中每个数据块对应的多个指纹,确定数据块之间的相似度包括:根据多个数据块中每个数据块对应的多个指纹,获取与每个数据块对应的至少一个全局指纹,根据该全局指纹确定数据块之间的相似度。上述可能的实现方式中,根据全局指纹确定数据块之间的相似度,能够减少由于大量的数据块指纹所产生的存储空间,还能减少由于统计数据块之间相同的指纹二产生的读写和计算开销,从而有效提高查询效率和存储空间的利用率。
在一种可能的设计方式中,对第一数据块队列中的数据块参考该第一参考数据块进行压缩之前,该方法还包括:确定该第一参考数据块,该第一参考数据块为该第一数据块队列中相似度最高的数据块中的一个。上述可能的实现方式中,将相似度最高的数据块放在同一个压缩组,并且参考其中一个数据块进行相似,从而有效提高相似数据块的缩减率,提升存储空间的利用率。
在一种可能的设计方式中,对第一数据块队列中的数据块参考该第一参考数据块进行压缩之前,该方法还包括:确定该第一参考数据块,该第一参考数据块为该第一数据队列中其他数据块参考该第一参考数据块进行压缩后占用的存储空间最小的数据块。上述可能的实现方式中,将数据块队列中压缩收益最高的数据块作为参考数据块,将其他数据块参考该参考数据块进行压缩,能够有效提高数据块的缩减率。
第三方面,提供一种数据压缩的装置,该装置包括:获取模块,用于获取第一数据块的第一指纹和第二指纹;根据所述第一指纹将所述第一数据块加入第一数据块队列;其中,所述第一数据块队列的数据块均包含所述第一指纹;根据所述第二指纹将 所述第一数据块加入第二数据块队列;其中,所述第二数据块队列的数据块均包含所述第二指纹;压缩模块,用于判断当对所述第一数据块队列中的数据块参考第一参考数据块进行压缩后的数据大小未达到预设阈值,从所述第二数据块队列中获取第二数据块,所述第二数据块队列包括与所述第一参考数据块的指纹相同的数据块;对所述第二数据块参考所述第一参考数据块进行压缩。
在一种可能的设计方式中,所述压缩模块,还用于在所述第二数据块队列未压缩处理的数据块中确定第二参考数据块;对所述第二数据块队列中未压缩处理的数据块参考所述第二参考数据块进行压缩。
在一种可能的设计方式中,所述压缩模块,还用于确定所述第一参考数据块,所述第一参考数据块为所述第一数据块队列中相似度最高的数据块中的一个。
第四方面,提供一种数据压缩的装置,该装置包括:获取模块,用于根据多个数据块中每个数据块对应的多个指纹,得到多个数据块队列,每个数据块队列中的数据块之间存在相同的指纹;压缩模块,用于判断若对第一数据块队列中的数据块参考第一参考数据块进行压缩后的数据大小未达到预设阈值,则该压缩模块,用于从第二数据块队列中获取至少一个数据块,并将该至少一个数据块参考该第一参考数据块进行压缩;其中,该第二数据块队列包括与该第一参考数据块相同指纹的数据块。
在一种可能的设计方式中,压缩模块还用于在该第二数据块队列未压缩处理的数据块中确定第二参考数据块,对该第二数据块队列中未压缩处理的数据块参考该第二参考数据块进行压缩。
在一种可能的设计方式中,获取模块,具体用于根据多个数据块中每个数据块对应的多个指纹,确定数据块之间的相似度,其中,至少两个数据块的相同指纹的数量越多,该至少两个数据块的相似度越高;根据数据块之间的相似度从高到低的顺序确定该多个数据块队列,其中,多个指纹相同且相似度最高的多个数据块在相同的数据块队列中,该多个数据块队列包括该第一数据块队列。
在一种可能的设计方式中,获取模块具体用于根据多个数据块中每个数据块对应的多个指纹,获取与每个数据块对应的至少一个全局指纹,根据该全局指纹确定数据块之间的相似度。
在一种可能的设计方式中,压缩模块,还用于确定该第一参考数据块,该第一参考数据块为该第一数据块队列中相似度最高的数据块中的一个。
在一种可能的设计方式中,压缩模块,还用于确定该第一参考数据块,该第一参考数据块为该第一数据队列中其他数据块参考该第一参考数据块进行压缩后占用的存储空间最小的数据块。
第五方面,提供一种芯片系统,该芯片系统应用于电子设备;该芯片系统包括一个或多个接口电路和一个或多个处理器;该接口电路和该处理器通过线路互联;该接口电路用于从该电子设备的存储器接收信号,并向该处理器发送该信号,该信号包括该存储器中存储的计算机指令;当该处理器执行该计算机指令时,该电子设备执行如上述第一方面或第二方面中任一项所述的数据压缩的方法。
第六方面,提供一种可读存储介质,所述可读存储介质中存储有指令,当所述可 读存储介质在电子设备上运行时,使得所述电子设备执行如上述第一方面或第二方面中任一项所述的数据压缩的方法。
第七方面,提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如上述第一方面或第二方面中任一项所述的数据压缩的方法。
可以理解地,上述提供的任一种数据压缩的装置、芯片系统、可读存储介质和计算机程序产品,均可以根据上文所提供的对应的数据压缩的方法来实现,因此,其所能达到的有益效果可参考上文所提供的数据压缩的方法的有益效果,此处不再赘述。
附图说明
图1为本申请实施例提供的一种相似数据压缩的方法示意图;
图2为本申请实施例提供的一种数据压缩的方法的流程示意图;
图3为本申请实施例提供的一种数据压缩的方法的示例一;
图4为本申请实施例提供的一种数据压缩的方法的示例二;
图5为本申请实施例提供的一种数据压缩的方法的示例三;
图6为本申请实施例提供的一种数据压缩的方法的示例四;
图7为本申请实施例提供的一种数据压缩的方法的流程和模块示意图;
图8为本申请实施例提供的一种数据压缩的装置示意图。
具体实施方式
本申请的说明书和权利要求书及附图中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于限定特定顺序。在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在数据存储管理中,数据缩减技术可以有效的提高存储空间的利用率。其中,数据缩减是指在不丢失有用信息的前提下,缩减数据量以减少存储空间,提高其传输、存储和处理效率;或者,按照一定的算法对数据进行重新组织,减少数据的冗余和存储的空间的一种技术方法。
因此,缩减率就成为了衡量存储系统存储性能的重要指标。数据缩减技术的缩减率可以为压缩前数据量除以压缩后的数据量计算的值,可以用来表示数据压缩的压缩效率,该缩减率越大,表示数据压缩的效率越高,为存储系统节约的存储空间越大,则存储系统性能越好。
目前数据缩减技术主要有以下三种实现方式:
(1)数据压缩:通过处理将数据中的冗余信息删除,从而节省存储空间。数据压缩可分为无损压缩和有损压缩,无损压缩是指使用压缩后的数据进行解压缩,得到的 数据与原来的数据完全相同。有损压缩是指使用压缩后的数据进行解压缩,得到的数据与原来的数据有所不同,主要适用于图像或者视频压缩领域。
(2)重复数据删除:数据压缩技术只能消除文件内部的冗余信息,而重复数据删除技术通过消除分布存储系统中的相同文件或者数据块,可以有效地缩减数据占据物理存储空间,该技术大量应用于存储备份和归档系统中。例如,将文件分割成相应大小的数据块,计算每个数据块指纹,相同指纹的数据代表数据块内容相同,因此对于相同指纹的数据块只需存储一次原始数据。
(3)相似数据压缩(Delta压缩):根据对数据块提取的的特征进行对比,将根据与该数据块包括相似或者相同数据特征的数据块,也就是参考数据块进行数据压缩,相对于参考数据块存储参考增量。相对于重复数据删除技术,要求数据块必须完全相同才能实现冗余数据消除,相似数据压缩技术对于并非完全重复、但具有一定相似度的数据块之间具有显著缩减效果。
例如,数据块1与数据块2由于部分内容不同,无法进行重复数据删除,但数据块1与数据块2之间的相似度极高。例如图1所示,将数据块1与数据块2的提取的特征进行对比,部分特征相同,部分特征不同。此时可以将数据块1作为参考块存储,将数据块2参考数据块1采用Delta压缩算法后,只需要存储数据块2与数据块1不同的部分,也就是数据块2相对于数据块1的参考增量即可,从而极大的缩减了数据块2的存储容量。
本申请提出一种基于相似数据压缩的技术方案,通过相似指纹分析数据块之间的相似性,以根据数据的相似性进行分组,将相似数据块相对参考数据块进行联合压缩,从而可以消除相似数据块之间更多的重复字段,提高数据压缩的缩减率,并且提高存储空间的利用率。如图2所示,该方法包括以下步骤:
201:获取多个数据块中每个数据块对应的多个指纹。
其中,数据块为待压缩的数据块,每个数据块包括至少一个指纹,用于标识不同的数据块。
指纹,也称为相似指纹(Similar Fingerprint,SFP),是指数据块的指纹特征,是一种可以用于表征不同的数据块之间具有一定程度的相似性的特征。
每个数据块可能包括一个相似指纹,也可能包括多个相似指纹。
同一批待压缩的数据块的数据量大小可以是相同的,例如,文件数据被切分成为相同大小的数据块进行数据压缩,也可以是不同大小的数据块进行数据压缩,多个数据块之间存在相同的指纹,即可以参考压缩。
获得数据块的相似指纹可以通过特征提取算法提取。示例性的,可以通过哈希(Hash)算法或其它指纹计算方法。其中,哈希算法具体是把任意长度的输入数据通过散列算法变换成固定长度的输出数据,该输出数据就是散列值。本申请的实施例中,可以按照一定字节为窗口,通过哈希算法提取该数据块的特征作为指纹特征,也就是相似指纹。
例如,本申请实施例中待压缩的数据块为数据块A和数据块B,按照一定字节为窗口通过哈希算法提取该数据块A的相似指纹,可以提取三个相似指纹,例如数据块 A的相似指纹可以为SFP1,SFP2和SFP3,数据块B的相似指纹为SFP1,SFP4和SFP5。则数据块A和数据块B都包括相同的相似指纹为SFP1,因此,数据块A和数据块B是相似数据块,可以放在同一个数据压缩组中进行相似数据压缩。
202:根据多个数据块中每个数据块对应的多个指纹,得到多个数据块队列。
其中,每个数据块队列中的数据块之间存在相同的指纹。
具体可以为,针对上述数据块提取的多个相似指纹SFP,将具有相同SFP的数据块归为一个队列,因此,每个数据块会分别出现在该数据块包括的多个SFP队列中。针对每个SFP,数据块队列中的数据块至少有一个共同的SFP指纹,即说明数据块之间有一定的相似程度。
示例性的,有以下数据块A、B、C、D、E和F,每个数据块对应的三个相似指纹如下所示:
A(SFP1,SFP2,SFP3);B(SFP1,SFP4,SPF5);C(SFP3,SFP5,SFP1);D(SFP1,SFP7,SPF2);E(SFP4,SFP5,SFP6);F(SFP1,SFP6,SPF4);
根据相似指纹建立相似指纹表,每个数据块会被记录在对应的三个SFP队列中,则得到每个相似指纹对应的数据块队列如下所示:
SFP1:A、B、C、D、F
SFP2:A、D
SFP3:A、C
SFP4:B、E、F
SFP5:B、C、E
SFP6:E、F
SFP7:D
接下来,根据相似指纹的数据块队列,将数据块划分联合压缩组进行相似数据压缩。其中,联合压缩组也叫数据压缩组,是对不同的数据块做相似数据压缩的分组,一个联合压缩组压缩处理后的数据,被存储在存储器的一个存储单元中,例如,存储于一个物理页面中。
对数据块划分联合压缩组的整体思路可以为,按照不同的相似指纹的数据块队列进行联合压缩组的分组。也就是说,根据上述步骤得到的相似指纹的数据块队列,将至少包括同一个相似指纹的数据块划分到一个联合压缩组中。
同一联合压缩组中的相似数据块参考该联合压缩组中的参考数据块进行相似数据压缩,将压缩后的相似数据块的增量数据与参考数据块一并写入存储单元。
在一种实施方式中,对数据块划分联合压缩组进行数据压缩的具体方法包括如下步骤:
203:确定第一数据块队列中的第一参考数据块,对第一数据块队列中的数据块参考第一参考数据块进行压缩。
首先,可以按照相似指纹对应的数据块队列进行联合压缩组的划分,将一个数据块队列中的所有数据块划分到一个联合压缩组。
例如,将第一相似指纹SFP1对应的第一数据块队列划分到第一联合压缩组,按 照上述示例,即将数据块A、B、C、D、F划分到第一联合压缩组。
接下来,确定联合压缩组中的参考数据块。则该联合压缩组中除参考数据块之外的其他数据块可以称为该参考数据块的相似数据块。则将联合压缩组中的相似数据块可以参考参考数据块进行压缩。
示例性的,根据上述的相似指纹队列,可以将数据块A、B、C、D、F分为一个联合压缩组,并将A设置为该联合压缩组中的参考数据块,其他数据块B、C、D、F为相似数据块,相对于数据块A做压缩。
在本申请的实施例中,一个联合压缩组内,为了减少解压缩的读写数据量,相似数据块可以只参考参考数据块进行压缩,而不做相似数据块之间的互相参考压缩。从而数据解压缩的时候,读出参考数据块和该相似数据块的增量数据,并进行相应的解压缩数据处理,即可以还原得到原来的相似数据块。需要说明的是,本申请并不限于这种方式,在解压缩处理性能允许的情况下,联合压缩组中的相似数据块之间也可以参考压缩,例如,相似数据块可以参考参考数据块和另一个相似数据块进行压缩。
具体的,选取联合压缩组中的一个数据块作为参考数据块,该参考数据块可以是划分到联合压缩组中的第一个数据块,例如,该数据块队列SFP1:A、B、C、D、F中的数据块A。或者,选择相似压缩收益最高的数据块,或者其他选取方法,本申请对此不作具体限定。
其中,压缩收益用来指示多个相似数据块参考参考数据块进行压缩后,存储数据块所占用的存储空间节省的大小,也就是对存储容量利用率的提升的大小。
例如,第一联合压缩组中的数据块A、B、C、D、F,如果将数据块A作为参考数据块,将数据块B、C、D、F参考数据块A进行压缩,将对参考数据块A根据压缩算法进行自压缩为A',数据块B相对于数据块A的参考增量ΔB,还有ΔC,ΔD和ΔF。参考数据块A的压缩收益即为存储数据块A、ΔB、ΔC、ΔD和ΔF的数据量比存储数据块A、B、C、D、F所节省的存储资源占比。
因此,选择相似压缩收益最高的数据块作为参考数据块,就是选择该联合压缩组中某一个数据块,将其他所有数据块参考该数据块进行压缩后,存储压缩数据所占用的存储空间最小的数据块。相似压缩收益最高的数据块的确定方法,具体可以通过对相似数据块进行预压缩处理,或者根据其他压缩算法确定。
在一种可能的实现方式中,一个联合压缩组中的数据块可以按照一定的数据块数量限制,或者压缩后的数据总大小限制,或者数据块数量和压缩后的数据总大小同时限制,进行联合压缩。
也就是说,联合压缩组的划分可以按照数据块数量限制进行划分和压缩,例如,根据系统设置,一个联合压缩组最多可以同时压缩8个数据块,则该联合压缩组的数据块数量的预设阈值为8。或者,一个联合压缩组按照压缩后的数据存储容量的总大小限制为16KB,也就是存储数据量进行限制,则该联合压缩组的存储数据大小的预设阈值为16KB。或者两个条件同时进行限制联合压缩组,一个联合压缩组最多可以同时压缩8个数据块,并且压缩后的数据量总大小限制为16KB。
示例性的,如图3所示,第一数据块队列为SFP1:A、B、C、D、E,则可以确 定第一数据块队列中的第一参考数据块为A,对第一数据块队列中的数据块B、C、D、E参考第一参考数据块A进行压缩并存储在第一联合压缩组中。也就得到了第一联合压缩组如图3所示的。
其中,联合压缩组中的参考数据块可以通过压缩算法直接进行自压缩,或者采用其他压缩策略对参考数据块单独进行压缩。示例性的,本申请的实施例中,仅以参考数据块通过压缩算法直接进行自压缩为例进行说明。例如,对参考数据块A进行自压缩后为A'。
204:若第一联合压缩组的数据块数量或者存储数据大小未达到预设阈值,则从第二数据块队列中获取至少一个数据块;将至少一个数据块参考第一参考数据块进行压缩。
其中,第二数据块队列包括与第一参考数据块相同指纹的数据块,至少一个数据块为第二数据块队列中除第一参考数据块以外的数据块,且为还没有进行数据压缩处理的数据块。
若对第一数据块队列中的数据块参考第一参考数据块进行压缩后的数据大小未达到预设阈值,也就是第一联合压缩组存在剩余空间,或者第一联合压缩组的最多可以压缩的数据块数量未达到上限,则可以将第二数据块队列中的相似数据块划分到第一联合压缩组进行压缩。其中,相似数据块必须是参考第一联合压缩组中的第一参考数据块具有相同的相似指纹的数据块。
示例性的,第二数据块队列为SFP2:A、F、H、I,其中数据块A已经作为第一联合压缩组中的参考数据块进行自压缩并存储A',因此,无需再次压缩数据块A,而数据块F、H、I都是与数据块A存在相同的相似指纹SFP2的,因此,确定数据块F、H、I是可以参考数据块A进行压缩的,则可以将数据块F、H、I也放入第一联合压缩组中进行相似数据压缩。
在一种实施方式中,需要根据第一存储空间所剩余的空间大小或者第一联合压缩组预设的数据压缩的数据块数量限制,或者两个条件的同时限制,选择将第二数据块队列中可以根据第一参考数据块进行压缩的数据块。
例如,第一联合压缩组预设的数据压缩的数据块数量限制为8,可以将数据块F、H、I也放入第一联合压缩组中进行压缩。如第一联合压缩组预设的数据压缩的数据块数量限制为6,只能选择一个数据块放入,则可以将数据块F放入第一联合压缩组中进行压缩存储,将第二数据块队列中的其他数据块作为第二联合压缩组,可以确定数据块H为第二联合压缩组的参考数据块,将数据块I参考数据块H进行压缩,则如图4所示。
205:在第二数据块队列中未压缩处理的数据块中确定第二参考数据块,对第二数据块队列中的数据块参考第二参考数据块进行压缩。
对于第二数据块队列中,有些数据块已经划分到第一联合压缩组,经过压缩处理存储在第一联合压缩组中,则不需要再次压缩。则第二数据块队列中未压缩处理的数据块划分为第二联合压缩组,从中确定第二参考数据块,第二联合压缩组中的其他数据块参考第二参考数据块进行相似数据压缩即可。
进一步的,若第二联合压缩组中存在剩余空间,则从第三数据块队列中获取至少一个数据块;将至少一个数据块参考第二参考数据块进行压缩后存储在第二联合压缩组。
206:重复以上步骤203-205,直到将数据块队列中的所有待压缩的数据块都进行压缩并存储。
以上实施例,也就是,先对第一数据块队列进行联合压缩组的划分,确定第一联合压缩组的第一参考数据块,将与第一参考数据块包括相同的数据指纹的数据块参考第一参考数据块进行压缩,存储到第一联合压缩组,直到第一联合压缩组的预设数据块数量或者预设的存储数据量大小不足以存储下一个数据块,或者,没有与第一参考块包括相同的数据指纹的数据块;
接着,确定第二联合压缩组的第二参考块,将与第二参考块包括相同的数据指纹的数据块参考第二参考块进行压缩,存储到第二联合压缩组,直到第二联合压缩组的预设数据块数量或者预设的存储数据量大小不足以存储下一个数据块,或者,没有与第二参考块包括相同的数据指纹的数据块;
重复上述步骤,直到没有待压缩的数据块。
在一种实施方式中,需要根据压缩后的数据大小限制和/或数据压缩的数据块数量限制划分联合压缩组。对于每个SFP的数据块队列,对应的联合压缩组可能未塞满,当一个联合压缩组还有剩余空间,将另一个SFP2对应的数据块队列中的所有数据块加入压缩时又会超过联合压缩组的限制,所以需要另建一个联合压缩组,此时就会造成多个联合压缩组都未塞满,从而造成空间浪费。
另外,由于每个联合压缩组都有一个参考数据块,如上述示例中的参考数据块A,则创建每一个联合压缩组,都将增加参考数据块的数量,如上述实施例中的数据块F。而参考数据块是通过直接压缩算法进行压缩的,只能减少数据块中的冗余数据,其压缩率不如相似数据块参考参考数据块进行的压缩。
因此,上述本申请的实施例,通过将不同的相似指纹对应的数据块队列中的相似数据块,参考同一个公共的参考数据块进行联合压缩,也就是跨指纹的公共参考数据块进行联合压缩的方式,可以提高联合压缩组的存储空间利用率,还能减少参考数据块的数量,尽可能将参考数据块转化为相似数据块的增加,从而提升整体的缩减率。
在一种可能的实施方式中,联合压缩组中的参考数据块,存储的位置可以在存储单元的第一个位置,或者中间位置,或者,可以将参考数据块与相似数据块分离存放。
其中,将联合压缩组中的参考数据块存放在存储单元的第一个位置,则其在存储单元的位置偏移为0,方便在数据解压缩的时候,系统根据参考数据块存储的位置索引来读取参考数据块。
将联合压缩组中的参考数据块存放在存储单元的中间位置,则在系统执行数据解压缩的时候,需要解压缩某一个相似数据块只需要读取存储单元前半部分数据或者后半部分数据,而不用读取整个存储单元的数据。因为数据读取是按照一定的步长,顺序读取的,当系统需要解压缩数据块B时,可以根据参考数据块A在存储单元中的存储位置的偏移开始读取,即可以读取数据A'和ΔB,从而根据解压缩的算法恢复得到数 据块B。
或者,还可以将参考数据块与相似数据块分离存放,也就是将参考数据块单独存放在存储单元1,将相似数据块存放在存储单元2。如此可以方便系统对于参考数据块的频繁读取,以便单独对所有的参考数据块集中管理。例如,可以将所有的或者部分参考数据块集中存储在内存,将相似数据块存放在外存。
在另一种可能的实现方式,系统对相似的数据块进行联合压缩,可以首先按照多个数据块之间的相似程度由高到低的顺序,进行相似数据块的划分,得到多个数据块队列;再将同一个数据块队列中的数据块参考同一个参考数据块进行压缩,存储数据增量到联合压缩组中。
其中,多个数据块之间的相同指纹的数量越多,则多个数据块的相似度越高;可以根据多个数据块中每个数据块对应的多个指纹,确定数据块之间的相似度。还可以根据其他方式,例如全局指纹进行数据块之间的相似度的判断,本申请对此不做具体限定,后续将对其他的判断数据块相似程度的方法做简单介绍,此处不再赘述。
也就是说,可以首先按照多个数据块之间的相似程度由高到低的顺序,将多个数据块进行联合压缩组的划分。先将相似程度最高的、也就是包括相同的相似指纹数量最多的多个数据块确定为同一个数据块队列,对同一个数据块队列中的数据块进行联合压缩,存储到同一个联合压缩组中。确定其中一个数据块为第一联合压缩组的第一参考数据块,并存储相似数据块的增量数据到第一联合压缩组。
再将包括相同的相似指纹数量第二多的多个数据块放入第二联合压缩组中,确定其中一个数据块为第二联合压缩组的第二参考数据块。第二联合压缩组中的相似数据块参考第二参考数据块进行压缩。
进一步的,上述步骤203中的,根据多个数据块中每个数据块对应的多个指纹,得到多个数据块队列包括:
根据多个数据块中每个数据块对应的多个指纹,确定数据块之间的相似度,其中,至少两个数据块的相同指纹的数量越多,所述至少两个数据块的相似度越高;
根据数据块之间的相似度从高到低的顺序确定多个数据块队列。按照每个指纹关联的数据块队列中,将数据块中相似度最高的多个数据块中的一个数据块,确定为第一参考数据块;示例性的,至少两个数据块包括的相似指纹中,相同的相似指纹数量越多,表示所述至少两个数据块的相似度越高。
需要说明的是,已经划分到某一联合压缩组中的数据块即不再进行第二次联合压缩组的划分和数据压缩。
示例性的,根据待压缩的数据块提取的相似指纹,获得每个相似指纹对应的多个数据块如下所示:
SFP1:A、B、D、H、I、J、K、L、C、M、E
SFP2:A、Y、H、E、M
SFP3:A、C、E、O、P、Q
SFP4:H、M、J
对上述数据块进行相似指纹的分析,可以得到,相似程度最高的、也就是包括相 同的相似指纹数量最多的数据块为数据块A和E,还有数据块H和M。因为数据块A和E包括相同的三个相似指纹SFP1、SFP2和SFP3,而数据块H和M包括相同的三个相似指纹SFP1、SFP2和SFP4。相似程度第二高的数据块为A和C、H和J,包括两个相同的相似指纹。其余的数据块之间相同的相似指纹个数最多为一个。
则根据上述的,按照多个数据块之间的相似程度由高到低的顺序进行数据块队列的划分策略,可以先将数据块A和E划分到第一数据块队列Group1中,确定数据块A为Group1中的参考数据块。将数据块H和M划分到第二数据块队列Group2中,确定数据块H为Group2中的参考数据块。
接下来,在相似程度第二高的数据块A和C中,数据块A是Group1中的参考数据块,则将数据块C也放入Group1中参考数据块A进行压缩。而数据块H和J中,数据块H为Group2中的参考数据块,则将数据块J也放入Group2中参考数据块H进行压缩。
接下来,将其余的数据块根据相同的相似指纹进行划分。可以将SFP1对应的多个数据块中数据块B、D、I、K、L划分到Group1中,参考数据块A进行压缩。如图5所示。而SFP2对应的多个数据块中数据块Y与数据块Group2中的参考数据块H有相同的相似指纹SFP2,则将数据块Y也放入Group2中,参考数据块H进行压缩。
SFP3对应的数据块队列中其余的数据块O、P、Q可以参考数据块A进行压缩,但是由于Group1没有多余的存储空间,需要重新建立新的联合压缩组,可以将数据块O、P、Q放入Group3中,并确定数据块O为Group3的参考数据块,将数据块P、Q参考数据块O进行压缩。而SFP4对应的数据块队列中的数据块全部压缩过了,不需要建立新的联合压缩组和确定新的参考数据块。具体的联合压缩的示意请参照图5所示。
在一种可能的实施方式中,如果不考虑数据块之间的相似程度、而只是按照相似指纹对应的数据块队列进行联合压缩组的划分,例如,对上述示例中的数据块进行联合压缩组的划分,若联合压缩组的数据块个数限制为8个,对于SFP1对应的多个数据块,将数据块A确定为Group11的参考数据块,将数据块B、D、H、I、J、K、L都放入联合压缩组Group11中,参考数据块A进行压缩;将其余的数据块C、M、N、E组成联合压缩组Group22,可以确定数据块C为Group22的参考数据块,将数据块M、N、E参考数据块C进行压缩。对于SFP2的数据块队列,数据块A、H、E已经压缩过了,则将数据块Y划分为一个联合压缩组Group33;对于SFP3,A、C、E已经压缩过了,则将数据块O、P、Q划分为一个联合压缩组Group44,如图6所示。
对比在上述的两种不同的划分联合压缩组的方式下,第一联合压缩组和第二联合压缩组的压缩情况对比。第一联合压缩组Group1中的数据块都是参考数据块A进行的相似数据压缩,且数据块E有三个相似指纹都与数据块A是相同的,而数据块C也有两个相似指纹与数据块A是相同的,即表示数据块E、C都与数据块A是强相似的,参考数据块A进行压缩,会得到较高的缩减率。类似的,第二联合压缩组Group2中的数据块H、J与参考数据块H是强相似的,因此参考数据块H做压缩会得到较高的缩减率。
而第一联合压缩组Group11中的数据块并未按照数据块之间的相似程度进行划分,只按照相似指纹队列对数据块进行划分,则数据块B、D、I、J、K都是与数据块A只有一个相同的相似指纹,参考压缩的缩减率不会很高。
因此,在对数据块队列中的数据块进行划分联合压缩组的划分时,先将强相似的数据块划分到同一个联合压缩组,可以提高数据压缩的缩减率,提供存储空间的利用效率。
并且,对于相似程度比较强的多个数据块,如果选择其一作为参考数据块,将相似程度比较强的其他数据块参照该参考数据块压缩,其压缩效率是比较高的。如果不考虑相似的数据块之间的相似程度,就不能很好的发挥相似参考压缩的意义。因此,通过识别相似度,区分数据块之间的相似程度,将相似度更高的数据块优先联合压缩,并选用其一作为参考块,其他作为相似块,可得到更高的相似压缩比,进而节省空间。
本申请的上述实施例,基于数据块的相似指纹进行识别和分组的相似压缩方案,主要涉及的模块和处理流程可以如图7所示,新写入的或者更新的数据块通过获取相似指纹模块,计算数据块包括的多个相似指纹SFP;根据相似指纹SFP建立相似指纹对应的数据块队列,并保存相似指纹表;指纹分析模块对数据块队列进行相似指纹分析,根据数据块之间的相似指纹相同的个数,判断多个数据块与已有数据块的相似程度,从而进行分组;相似数据块压缩模块优先将相似程度更高的数据块划分到一个联合压缩组,进行相似数据压缩,将相似数据块参考参考数据块联合压缩;最后底层的存储模块将经过相似压缩后的数据写入磁盘中。
在一种可能的实施方式中,上述实施例中的多个相似指纹,还可以是对至少两个相似指纹之间进行特征提取或者其他运算生成的全局指纹,也可以称为超级指纹。其中,全局指纹可以用于表示数据块的特征,也可以用于表征数据块的相似指纹之间的特征。
因此,上述实施例中的判断数据块之间的相似程度,还可以通过对数据块的全局指纹的对比来判断。因为,对多个数据块对应的的多个相似指纹,查找数据块之间相同的相似指纹的数量,系统需要进行大量的比较运算。而通过全局指纹比较多个数据块之间的相似程度,系统可以进行快速识别,有效的减少了计算量。
全局指纹可以通过局部敏感哈希(Locality-Sensitive Hashing,LSH)算法、汉明距离等算法,提取数据块的特征生成。也可以通过提取多个相似指纹之间的数据特征生成全局指纹,或者对多个相似指纹通过或一些简单的运算,例如,异或运算,得到全局指纹。例如,对数据块提取了六个相似指纹,为了减少存储大量相似指纹的空间占用,可以对六个相似指纹两两之间做异或运算,生成三个全局指纹,进行存储,并以全局指纹作为对数据块之间的相似程度的比较,从而减小系统的查找比较的运算量。
本申请的上述实施例,按照多个数据块之间的相似程度由高到低的顺序进行联合压缩组的划分,通过区分数据块之间的相似程度而进行的联合压缩,使得相似度高的数据块参考同一个数据块压缩,从而可以提高数据压缩的缩减率,并且有效的节省数据压缩的存储空间。
本申请另一些实施例提供了一种数据压缩的装置,可以应用于具有数据存储能力 的电子设备。如图8所示,该装置包括获取模块801和压缩模块802。
其中,获取模块801用于根据多个数据块中每个数据块对应的多个指纹,得到多个数据块队列,每个数据块队列中的数据块之间存在相同的指纹。
压缩模块802用于判断若对第一数据块队列中的数据块参考第一参考数据块进行压缩后的数据大小未达到预设阈值,则该压缩模块,用于从第二数据块队列中获取至少一个数据块,并将该至少一个数据块参考该第一参考数据块进行压缩;其中,该第二数据块队列包括与该第一参考数据块相同指纹的数据块。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本申请实施例还提供一种芯片系统,该芯片系统可以应用于上述实施例中的电子设备,该芯片系统包括一个或多个接口电路和一个或多个处理器;接口电路和处理器通过线路互联;接口电路用于从电子设备的存储器接收信号,并向处理器发送所述信号,所述信号包括存储器中存储的计算机指令;当处理器执行所述计算机指令时,电子设备执行如上述实施例中的任一种可能的数据压缩的方法中的各个功能或者步骤。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质包括计算机指令,当所述计算机指令在上述电子设备上运行时,使得该电子设备执行上述方法实施例中电子设备执行的各个功能或者步骤。
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得所述计算机执行上述方法实施例中电子设备执行的各个功能或者步骤。
通过以上实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案 本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (9)

  1. 一种数据压缩的方法,其特征在于,所述方法包括:
    获取第一数据块的第一指纹和第二指纹;
    根据所述第一指纹将所述第一数据块加入第一数据块队列;其中,所述第一数据块队列的数据块均包含所述第一指纹;
    根据所述第二指纹将所述第一数据块加入第二数据块队列;其中,所述第二数据块队列的数据块均包含所述第二指纹;
    当对所述第一数据块队列中的数据块参考第一参考数据块进行压缩后的数据大小未达到预设阈值,从所述第二数据块队列中获取第二数据块,所述第二数据块队列包括与所述第一参考数据块的指纹相同的数据块;
    对所述第二数据块参考所述第一参考数据块进行压缩。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在所述第二数据块队列未压缩处理的数据块中确定第二参考数据块;
    对所述第二数据块队列中未压缩处理的数据块参考所述第二参考数据块进行压缩。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    确定所述第一参考数据块,所述第一参考数据块为所述第一数据块队列中相似度最高的数据块中的一个。
  4. 一种数据压缩的装置,其特征在于,所述装置包括:
    获取模块,用于获取第一数据块的第一指纹和第二指纹;根据所述第一指纹将所述第一数据块加入第一数据块队列;其中,所述第一数据块队列的数据块均包含所述第一指纹;根据所述第二指纹将所述第一数据块加入第二数据块队列;其中,所述第二数据块队列的数据块均包含所述第二指纹;
    压缩模块,用于判断当对所述第一数据块队列中的数据块参考第一参考数据块进行压缩后的数据大小未达到预设阈值,从所述第二数据块队列中获取第二数据块,所述第二数据块队列包括与所述第一参考数据块的指纹相同的数据块;对所述第二数据块参考所述第一参考数据块进行压缩。
  5. 根据权利要求4所述的装置,其特征在于,所述压缩模块,还用于在所述第二数据块队列未压缩处理的数据块中确定第二参考数据块;对所述第二数据块队列中未压缩处理的数据块参考所述第二参考数据块进行压缩。
  6. 根据权利要求4或5所述的装置,其特征在于,所述压缩模块,还用于确定所述第一参考数据块,所述第一参考数据块为所述第一数据块队列中相似度最高的数据块中的一个。
  7. 一种芯片系统,其特征在于,所述芯片系统应用于电子设备;所述芯片系统包括一个或多个接口电路和一个或多个处理器;所述接口电路和所述处理器通过线路互联;所述接口电路用于从所述电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信号包括所述存储器中存储的计算机指令;当所述处理器执行所述计算机 指令时,所述电子设备执行如权利要求1-3中任一项所述的数据压缩的方法。
  8. 一种计算机可读存储介质,其特征在于,所述可读存储介质中存储有指令,当所述可读存储介质在电子设备上运行时,使得所述电子设备执行权利要求1-3任一项所述的数据压缩的方法。
  9. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行权利要求1-3任一项所述的数据压缩的方法。
PCT/CN2020/120980 2019-10-31 2020-10-14 一种数据压缩的方法及装置 WO2021082926A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911054906.0A CN111061428B (zh) 2019-10-31 2019-10-31 一种数据压缩的方法及装置
CN201911054906.0 2019-10-31

Publications (1)

Publication Number Publication Date
WO2021082926A1 true WO2021082926A1 (zh) 2021-05-06

Family

ID=70297596

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/120980 WO2021082926A1 (zh) 2019-10-31 2020-10-14 一种数据压缩的方法及装置

Country Status (2)

Country Link
CN (1) CN111061428B (zh)
WO (1) WO2021082926A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220197527A1 (en) * 2020-12-23 2022-06-23 Hitachi, Ltd. Storage system and method of data amount reduction in storage system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061428B (zh) * 2019-10-31 2021-05-18 华为技术有限公司 一种数据压缩的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050257019A1 (en) * 2004-05-13 2005-11-17 Jun He Method of storing compressed data
CN104408154A (zh) * 2014-12-04 2015-03-11 华为技术有限公司 重复数据删除方法及装置
CN105743509A (zh) * 2016-01-26 2016-07-06 华为技术有限公司 数据压缩装置及方法
CN107888197A (zh) * 2017-10-31 2018-04-06 华为技术有限公司 一种数据压缩方法和装置
CN111061428A (zh) * 2019-10-31 2020-04-24 华为技术有限公司 一种数据压缩的方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10135462B1 (en) * 2012-06-13 2018-11-20 EMC IP Holding Company LLC Deduplication using sub-chunk fingerprints
JP2019079113A (ja) * 2017-10-20 2019-05-23 株式会社日立製作所 ストレージ装置、データ管理方法、及びデータ管理プログラム
CN108415669A (zh) * 2018-03-15 2018-08-17 深信服科技股份有限公司 存储系统的数据去重方法及装置、计算机装置及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050257019A1 (en) * 2004-05-13 2005-11-17 Jun He Method of storing compressed data
CN104408154A (zh) * 2014-12-04 2015-03-11 华为技术有限公司 重复数据删除方法及装置
CN105743509A (zh) * 2016-01-26 2016-07-06 华为技术有限公司 数据压缩装置及方法
CN107888197A (zh) * 2017-10-31 2018-04-06 华为技术有限公司 一种数据压缩方法和装置
CN111061428A (zh) * 2019-10-31 2020-04-24 华为技术有限公司 一种数据压缩的方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220197527A1 (en) * 2020-12-23 2022-06-23 Hitachi, Ltd. Storage system and method of data amount reduction in storage system

Also Published As

Publication number Publication date
CN111061428B (zh) 2021-05-18
CN111061428A (zh) 2020-04-24

Similar Documents

Publication Publication Date Title
EP2940598B1 (en) Data object processing method and device
US8751462B2 (en) Delta compression after identity deduplication
WO2017096532A1 (zh) 一种数据保存方法和装置
CN107682016B (zh) 一种数据压缩方法、数据解压方法及相关系统
US10936228B2 (en) Providing data deduplication in a data storage system with parallelized computation of crypto-digests for blocks of host I/O data
US11797204B2 (en) Data compression processing method and apparatus, and computer-readable storage medium
US10585856B1 (en) Utilizing data access patterns to determine compression block size in data storage systems
US10824599B2 (en) Information processing apparatus, information processing method, and recording medium
US10540240B2 (en) Method and apparatus for data backup in storage system
WO2021082926A1 (zh) 一种数据压缩的方法及装置
US20160147800A1 (en) Data Processing Method and System and Client
US11106374B2 (en) Managing inline data de-duplication in storage systems
US11379524B2 (en) Multiple overlapping hashes at variable offset in a hardware offload
US20220253222A1 (en) Data reduction method, apparatus, computing device, and storage medium
WO2021012162A1 (zh) 存储系统数据压缩的方法、装置、设备及可读存储介质
KR20210113297A (ko) 컴퓨터 메모리 내의 복제 및 밸류 중복성을 제거하기 위한 시스템, 방법, 및 장치
EP4321981A1 (en) Data processing method and apparatus
Vikraman et al. A study on various data de-duplication systems
WO2019119336A1 (zh) 一种通用数据gz格式的多线程压缩与解压方法及装置
CN111625186B (zh) 数据处理方法、装置、电子设备及存储介质
WO2022206334A1 (zh) 一种数据压缩方法及装置
Xue et al. A comprehensive study of present data deduplication
CN112988041A (zh) 存储系统中的数据存储方法及相关设备
US20220283998A1 (en) Method to optimize ingest in dedupe systems by using compressibility hints
CN113885776A (zh) 评估数据缩减率的方法、装置及系统、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20880405

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20880405

Country of ref document: EP

Kind code of ref document: A1