CN111796969A

CN111796969A - Data difference compression detection method, computer equipment and storage medium

Info

Publication number: CN111796969A
Application number: CN202010473699.9A
Authority: CN
Inventors: 张宇成; 马泽宇; 王春枝; 严灵毓; 苏军; 杨宇; 李德畅; 薛天赐
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-10-20
Anticipated expiration: 2040-05-29
Also published as: CN111796969B

Abstract

The invention provides a data differential compression method, which judges which containers contain a large number of potential reference data blocks by detecting repeatedly rewritten data blocks, and overlaps the reading of the reference data blocks with the reading of fingerprints required by data deduplication, eliminates redundant data among partial similar data blocks by differential compression on the premise of not remarkably reducing the system throughput, and improves the storage performance and the data reading performance of a system.

Description

Data difference compression detection method, computer equipment and storage medium

Technical Field

The present invention relates to the field of data backup technologies, and in particular, to a data delta compression method, a computer device, and a storage medium.

Background

With the rapid development of the internet and information technology, the global information volume is rapidly increasing, and is expected to reach 44ZB by 2020. However, recent research shows that there is a large amount of redundant data in the data information, and particularly in the backup storage system, there is data redundancy of about 90%. Therefore, techniques for eliminating redundant data are attracting much attention.

In general, data deduplication comprises 5 steps: data blocking, fingerprint calculation, fingerprint duplication checking, data rewriting and data writing. The data block is used for dividing the file to be backed up into a plurality of data blocks (generally, the average block length is 8KB) by using a variable-length block division algorithm; the fingerprint calculation is used for calculating a fixed-length fingerprint for each data block; the fingerprint check is used for judging whether the data blocks are repeated or not by matching fingerprints of the data blocks. In the step of fingerprint duplication checking, a storage system maintains a fingerprint cache, fingerprints of the data block to be processed are firstly matched in the fingerprint cache, and if the fingerprints exist, the data block is repeated; if the fingerprint index does not exist, further inquiring the fingerprint index on the disk; if the fingerprint exists in the fingerprint index, the data block is repeated, all data block fingerprints stored in the container where the data block is located are read into the fingerprint cache, and due to the locality among the data blocks, the fingerprints of the subsequent data blocks to be processed are found in the fingerprint cache with high probability, so that the times of inquiring the fingerprint index on the disk are reduced; if the fingerprint index does not find the fingerprint, the data block is a non-duplicate data block.

To better understand the data rewriting step, the data recovery process is described before describing the data rewriting step. After data deduplication is completed, non-duplicate data can be packed into a fixed-length storage unit in sequence, which is called a container. The container size is much larger than the data block size, typically 4 MB. The container is written to disk when it reaches its maximum capacity (e.g., 4 MB).

When the data is recovered, the system maintains a recovery cache in the memory; the reading unit of the system is a container, the container where the required data block is located is read into the recovery cache, and the system acquires the required data block from the recovery cache. Due to the data locality, the subsequent required data blocks may be found directly in the recovery cache, thereby reducing the number of times the disk is read during recovery. But if the container read into the recovery cache contains only a small number of valid data blocks, the read operation is not cost effective. This is because data deduplication may cause multiple backups to share duplicate data chunks, resulting in the data chunks in the last backup being scattered among different containers, a phenomenon known as fragmentation. The main overhead of the system during data recovery is the disk overhead required for reading the container, and fragmentation reduces the number of valid data blocks contained in the container read into the recovery cache and increases the number of containers to be read, thereby reducing the recovery performance of the system.

When only a small number of valid data blocks are contained in a container, these data blocks are referred to as fragmented data blocks. The rewrite algorithm may detect the fragment block and write it to the disk again together with the non-duplicate data blocks, so that the data blocks in the backup are stored on the disk in the order as much as possible, that is, the locality of the data blocks and the number of valid data blocks contained in the container are increased, thereby improving the recovery performance. In the data rewriting link, the system calculates the proportion of the related repeated data blocks in the container by running a rewriting algorithm; if the occupation ratio is large, the locality of the repeated data blocks is strong, and rewriting is not needed; if the proportion is small, the repeated data blocks in the container are fragment blocks and need to be rewritten, but the data rewriting algorithm cannot really solve the problem of objectivity of the fragment data blocks.

Therefore, in order to improve the data recovery performance, the files to be backed up can be packaged into larger backup units by starting from enhancing the locality of the backup data set. As shown in fig. 1, in the backup unit, each file is composed of two areas: a metadata area and a data area; the metadata area stores information such as file names, file sizes, file paths, ownership rights, packaging time and the like; the data area stores the contents of the file.

However, since the partial information (for example, the packing time) of the metadata area is different every time the data is packed, the data block whose file content has not changed originally cannot be detected as duplicate data by data deduplication because a small amount of content of the metadata area is different. When a large number of small files are contained in the data set, most of the data blocks are not repeated for each version because of the metadata area; due to the small percentage in the container, the duplicate data blocks containing only the file content will be identified by the rewrite algorithm as the fragment blocks, even after being rewritten. Thus, such data blocks are repeatedly rewritten, called PersistentFragmented Chunks (PFC).

Fig. 2 illustrates the cause of PFC generation. In fig. 2, each backup consists of 3 files. In backup 1, File 1 is mostly contained within data chunks C1 and C2; the data block C1 contains metadata area, the data block C2 only contains the file content of file 1; file 2 is primarily divided into two data blocks, C3 and C4; file 3 is contained within data block C5. It is assumed that none of the file contents has changed in the three backups. However, since the contents of the metadata area are changed every time the metadata area is packed. Therefore, in each backup, only the data blocks C2 and C4 that do not contain a file metadata area are duplicated. After the third backup, the data chunks are stored in container I, container II, and container III, respectively, and since the proportion of duplicate data chunks C2 and C4 in container II and container III is too small, both backup 2 and backup 3 are determined to be fragment chunks and are overwritten, i.e., are overwritten repeatedly. This results in a large amount of redundant data contained in the storage container, such as C1, C1' and C1 ", which cannot be eliminated by data deduplication because only a small amount of metadata is not identical.

By combining the above descriptions of data deduplication and data recovery, it can be seen that, for the data deduplication method, after data deduplication, there are still more redundant data among similar data blocks including a file metadata area, which wastes a large amount of storage space on one hand, and on the other hand, the unremoved redundant data is written into a disk, which also means that more containers need to be read during recovery, thereby reducing data recovery performance. If delta compression is used to eliminate such redundancy, reading the reference data blocks required for delta compression can greatly reduce system performance. Therefore, a more advanced data backup scheme is urgently needed to be researched.

Disclosure of Invention

Embodiments of the present invention provide a data backup method, a computing device, and a storage medium based on data delta compression, so as to better achieve higher data backup efficiency and data recovery performance.

In order to solve the above technical problems, the technical solutions provided by the embodiments of the present invention are as follows:

in a first aspect, an embodiment of the present invention provides a delta data compression method, where the method includes:

s1, organizing the container where all the data blocks rewritten in the last backup are in into a lookup table RID_lastAnd initializes the fingerprint cache FP_cacheInitialization reference data Block cache B_CacheInitialization List RID_current(ii) a Wherein FP_cacheAnd B_CacheThe data update of (1) is in units of containers;

s2, packaging and blocking the files to be backed up to obtain new divided data blocks;

s3, taking one data block in all newly divided data blocks each time according to the sequence of division, and judging whether the data block can be taken or not;

s3.1, if the data block is obtained, recording the obtained data block as chunk, calculating the fingerprint fp of the data block, and executing S4;

s3.2, if not, indicating that all backup files are processed and the backup is finished;

s4, at FP_cacheFinding FP, judging if FP exists in FP_cachePerforming the following steps;

s4.1, if the chunk exists, the chunk is a repeated data block, and S5 is executed;

s4.2, if the fingerprint index D does not exist, inquiring the fingerprint index D on the disk_indexJudging whether fp exists in the table;

s4.2.1, if yes, chunk is a duplicate data block, go to S5;

s4.2.2, if not, chunk is a non-duplicate data block, executing S7;

s5, judging whether chunk is broken by using a rewriting algorithm;

s5.1, if yes, marking chunk as a fragment block and inserting the container number where chunk is positioned into RID_current，Execution of S6;

s5.2, if not, the chunk is a repeated data block and is not a fragment block, and the storage is not needed, and S3 is executed;

s6, inquiring RID_lastJudging whether the container number stored in chunk exists in RID_lastPerforming the following steps;

s6.1, if the chunk exists, indicating that a potential reference data chunk exists in the container where the chunk is located, wherein the potential reference data chunk is used for data de-duplication and differential compression at the same time; combining the fingerprints of all data blocks in the container into a first hash table, wherein the key of the first hash table is the fingerprint, the value is the number of the container, and the FP is inserted_cacheAll data blocks in the container form a second hash table, keys of the second hash table are characteristic values, values are data blocks, and B is inserted_Cache；

S6.2, if the container where the chunk is located does not exist, the container where the chunk is located is only used for data deduplication; combining the fingerprints of all data blocks in the container into a third hash table, wherein the key of the third hash table is the fingerprint, the value is the number of the container, and the third hash table is inserted into the FP_cache；

S7, calculating a characteristic value for the chunk, and calculating the characteristic value B according to the characteristic value_CacheSearching for the reference data block and judging whether the reference data block exists or not;

s7.1, if the reference data block exists, reading the reference data block, and executing S8;

s7.2, if the current time does not exist, executing S9;

s8, performing delta compression coding on the chunk by using the reference data chunk to generate a delta chunk delta;

s9, write chunk or delta to container, and update its index to D on disk_indexAnd S3 is executed.

Preferably, the method further comprises:

after the backup is finished, RID_currentAll container numbers are written to disk for a later backup to determine which containers contain potential reference data blocks.

In a second aspect, an embodiment of the present invention provides a computer device for implementing delta data compression, where the computer device includes:

one or more processors;

one or more memories;

one or more modules stored in a memory and capable of being executed by at least one of the one or more processors to perform the steps of the delta data compression method as described in the first aspect.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium for implementing delta data compression, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the delta data compression method according to the first aspect.

The embodiment of the invention judges which containers contain a large number of potential reference data blocks by detecting the repeatedly rewritten data blocks, and overlaps the reading of the reference data blocks with the reading of fingerprints required by data deduplication, thereby reducing the disk overhead, avoiding the influence on the system throughput, and improving the storage performance and the data reading performance of the system.

Drawings

The above features, technical features, advantages and implementations of the asynchronous system implementation method, the computer device and the storage medium will be further explained in a clear and understandable manner with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a packed file format for packing files to be backed up into larger units;

FIG. 2 is a diagram illustrating a cause of data being overwritten;

FIG. 3 is a schematic diagram of a process in which data is accessed in a data backup system;

fig. 4 is a schematic structural diagram of a computer device for implementing delta data compression according to an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".

From the above analysis, it can be known that, in the prior art, the data deduplication method and the data recovery method still have more redundant data and fragmented data blocks after data deduplication, and in the data recovery process, because of the existence of a large amount of redundant data and fragmented data blocks, a large amount of storage space is still wasted, and a large amount of disk reading actions are also caused, thereby greatly reducing the data recovery performance.

The embodiment of the invention firstly aims to reduce redundant data, and adopts a data delta compression method to eliminate partial redundancy. The data delta compression is defined as follows: two similar data blocks a and B are given. Wherein, B is a target data block to be processed, A is a data block with more content repetition with B, and is called as a reference data block of B. Delta compression calculates what is present in B but not in A and stores a delta file called delta block, denoted delta. The magnitude of delta is much smaller than that of data block B. Therefore, by storing Δ instead of storing the data block B, the purpose of reducing the storage space is achieved.

At data recovery, when data block B is needed, delta is fetched and reference data block a is delta decoded to reconstruct B. In the data backup system, the reference data block A is a processed data block and is stored on a disk; when the target data block B needs to be subjected to delta compression, the target data block B needs to be read into the memory from the storage medium; a large number of read operations are still required due to the distributed scatter of the reference data blocks. Generally, the storage medium of a data backup system is a disk with poor random reading performance, and reading reference data blocks greatly reduces the system throughput.

However, existing data backup systems only use data deduplication redundancy elimination, and do not use delta compression.

Meanwhile, the inventors have found through extensive studies that a container containing the same PFC contains a large number of similar data blocks, which differ only in a small number of metadata regions. As shown in fig. 2, container II and container III contain the same PFCs, i.e., C2 and C4, and the data blocks adjacent to these two data blocks are similar, e.g., C1' and C1 ", C3' and C3", and C5' and C5 ", respectively, so that a good compression effect can be obtained by using differential compression thereto.

Therefore, the embodiment of the invention improves the effect of introducing differential compression into a data backup system for data deduplication to eliminate redundant data between data blocks containing a file metadata area, and further reduces the storage cost.

As shown in fig. 3, the technical solution of the method for detecting similar data provided in the embodiment of the present invention is as follows:

when the cache is full, replacing the contents in the container which is least recently used by adopting an LRU replacement strategy;

preferentially, a tar tool is adopted for packing, and a Rabin blocking algorithm is adopted for blocking;

for each newly divided data block, the following operations are performed:

s3, taking a data block according to the divided sequence, and judging whether the data block can be taken or not;

s4.2.1, if yes, chunk is a duplicate data block, go to S5;

s4.2.2, if not, chunk is a non-duplicate data block, executing S7;

s5, judging whether chunk is broken by using a rewriting algorithm;

S7, calculating a characteristic value for the chunk, and calculating the characteristic value B according to the characteristic value_CacheIn-search reference numberAccording to the block, judging whether the reference data block exists or not;

s7.2, if the current time does not exist, executing S9;

Preferably, the method further comprises:

That is, the embodiment of the present invention determines which containers contain a large number of potential reference data blocks by detecting repeatedly rewritten data blocks, and overlaps reading of the reference data blocks with reading of fingerprints required for data deduplication, thereby reducing disk overhead and avoiding affecting system throughput.

Through the technical scheme of the embodiment of the invention, the redundant data among partial similar data blocks can be eliminated by differential compression on the premise of not obviously reducing the system throughput. However, the conventional method generally only uses data deduplication for redundancy elimination, so that redundancy between data blocks containing a file metadata area cannot be eliminated, or even if differential compression is used, attempts are made to detect all similar data blocks, so that a large number of disk reading operations are caused by referencing the data blocks, and the system throughput is reduced.

However, embodiments of the present invention only detect similar data blocks that contain metadata regions and avoid the disk overhead required to read metadata blocks, utilize repeatedly rewritten data blocks to evaluate which containers contain a large number of potential reference data blocks, and allow the fingerprint reading to overlap with the reference data block reading. This is because the inventor researches the process of mass data storage and data recovery to find that: (1) there is a great deal of redundancy between data blocks containing the metadata area of the file, such as C1' and C1 "in fig. 2, only the metadata area contained therein is different, and other data are the same; (2) containers containing a large number of such similar data chunks contain the same PFC, e.g., both container II and container III data chunks C2 and C4 in fig. 2.

Therefore, the embodiment of the invention evaluates which containers contain a large number of data blocks similar to the backup data by counting the containers in which the rewritten data blocks in the last backup are located and combining the information of the rewritten data blocks in the backup, so that the reading of the reference data blocks can be overlapped with the disk operation required by data de-rewriting and fingerprint reading, thereby avoiding a large amount of disk expenses required by reading the metadata blocks.

As shown in FIG. 2, after backup 2 is finished, container numbers II and III where the repeatedly rewritten data chunks are located may be collected. Backup 3 begins by writing the container number of container 2 into the lookup table. Processing C2 finds that C2 is fragmented, and detects that the container in which the data chunk resides, i.e., container II, was overwritten in backup 2. Originally, only the fingerprint in the container II needs to be read so that only subsequent data deduplication is needed, but in the embodiment of the invention, all data blocks in the container are read out while the fingerprint is read, the fingerprint is inserted into the fingerprint cache, and the data blocks are inserted into the reference data block cache. When processing C3 ", it is detected that it is similar to C3' in the reference data block cache, so that the delta compression is performed on C3", which can greatly reduce the disk overhead. The overhead of reading the magnetic disk mainly consists of three parts, namely seek time, rotation time and data transmission time, and the main overhead is the seek time and the rotation time. According to the technical scheme, the embodiment of the invention avoids the seek time and the rotation time required by reading the reference data block. Thus, the impact of disk overhead for reading the reference data block on system throughput is greatly reduced. The test of a large number of data sets shows that the embodiment of the invention can ensure that the influence of the disk overhead required by the reference data block on the system throughput does not exceed 3 percent.

Fig. 4 is a schematic physical structure diagram of a computer device according to an embodiment of the present invention, where the computer device is installed in a third-party device, such as a mobile terminal, a portable computer, an IPAD, a large data server, or other computing devices, and as shown in fig. 4, the computer device is installed in the third-party deviceThe apparatus may include: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may call logic instructions in the memory 630 to perform the following method: s1, organizing the container where all the data blocks rewritten in the last backup are in into a lookup table RID_lastAnd initializes the fingerprint cache FP_cacheInitialization reference data Block cache B_CacheInitialization List RID_current(ii) a Wherein FP_cacheAnd B_CacheThe data update of (1) is in units of containers;

s4.2.1, if yes, chunk is a duplicate data block, go to S5;

s4.2.2, if not, chunk is a non-duplicate data block, executing S7;

s5, judging whether chunk is broken by using a rewriting algorithm;

s7.2, if the current time does not exist, executing S9;

A communication bus 640 is a circuit that connects the described elements and enables transmission between the elements. For example, the processor 610 receives commands from other elements through the communication bus 640, decrypts the received commands, and performs calculations or data processing according to the decrypted commands. The memory 630 may include program modules such as a kernel (kernel), middleware (middleware), an Application Programming Interface (API), and an Application program. The program modules may be comprised of software, firmware or hardware, or at least two of the same. Communication interface 620 connects the computer device with other network devices, clients, mobile devices, networks. For example, the communication interface 620 may be connected to a network by wire or wirelessly to connect to external other network devices or user devices. The wireless communication may include at least one of: wireless fidelity (WiFi), Bluetooth (BT), Near Field Communication (NFC), Global Positioning Satellite (GPS) and cellular communications, among others. The wired communication may include at least one of: universal Serial Bus (USB), high-definition multimedia interface (HDMI), asynchronous transfer standard interface (RS-232), and the like. The network may be a telecommunications network and a communications network. The communication network may be a computer network, the internet of things, a telephone network. The computer device may connect to the network through communication interface 620, and the protocol by which the computer device communicates with other network devices may be supported by at least one of an application, an Application Programming Interface (API), middleware, a kernel, and communication interface 620.

Further, an embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer instructions, which cause the computer to perform the method provided by the above method embodiments, for example, including: s1, organizing the container where all the data blocks rewritten in the last backup are in into a lookup table RID_lastAnd initializes the fingerprint cache FP_cacheInitialization reference data Block cache B_CacheInitialization List RID_current(ii) a Wherein FP_cacheAnd B_CacheThe data update of (1) is in units of containers;

s4.2.1, if yes, chunk is a duplicate data block, go to S5;

s4.2.2, if not, chunk is a non-duplicate data block, executing S7;

s5, judging whether chunk is broken by using a rewriting algorithm;

s7.2, if the current time does not exist, executing S9;

Those of ordinary skill in the art will understand that: in addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solution of the present invention, but not for limiting the same, and the above embodiments can be freely combined as required; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention. Without departing from the principle of the invention, several improvements and modifications can be made, and these improvements and modifications should also be construed as the scope of the invention.

Claims

1. A method of delta data compression, the method comprising:

s4.2.1, if yes, chunk is a duplicate data block, go to S5;

s4.2.2, if not, chunk is a non-duplicate data block, executing S7;

s5, judging whether chunk is broken by using a rewriting algorithm;

s5.1, if yes, marking chunk as a fragment block and inserting the container number where chunk is positioned into RID_currentExecuting S6;

s7.2, if the current time does not exist, executing S9;

2. The delta data compression method of claim 1, wherein said method further comprises:

3. A computer device for implementing delta data compression, the computer device comprising:

one or more processors;

one or more memories;

one or more modules stored in a memory and capable of being executed by at least one of the one or more processors to perform the steps of the delta data compression method as recited in any of claims 1-2.

4. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the delta data compression method as claimed in claims 1-2.