CN111274212B

CN111274212B - Cold and hot index identification and classification management method in data deduplication system

Info

Publication number: CN111274212B
Application number: CN202010064610.3A
Authority: CN
Inventors: 邓玉辉; 张大统
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2023-05-05
Anticipated expiration: 2040-01-20
Also published as: CN111274212A

Abstract

The invention discloses a cold and hot index identification and classification management method in a data deduplication system, which aims at the biggest bottleneck existing in the data storage field in the data deduplication technology, namely when the backup data volume reaches PB, EB level or above, the memory is insufficient to store indexes of all data blocks, so that the index is caused to search indexes on a disk frequently by the intensive data deduplication system, the performance of the data deduplication system is seriously reduced, the method firstly proposes to identify and separate a hot index and a cold index, namely the frequently accessed index, and the cold index is the rarely accessed index, and the purposes of improving the utilization rate of the memory, improving the data backup and the data recovery performance are achieved by rejecting the cold index from the memory or the global index, and finally the overall performance of the data deduplication system is improved. The invention can be applied to a data deduplication system with stronger locality among various backup data streams.

Description

Cold and hot index identification and classification management method in data deduplication system

Technical Field

The invention relates to the technical field of data storage and repeated data deletion, in particular to a cold and hot index identification and classification management method in a data deduplication (data deduplication) system.

Background

The explosive growth of data presents a serious challenge to the storage space, researchers find that a large amount of repeated data exists in the data, and the repeated data is stored to cause the waste of the storage space and increase the storage cost. The repeated data deleting technology identifies repeated and unique data blocks through a series of means, only stores the unique copy and the unique block of the repeated data blocks, greatly reduces the expenditure of storage space and saves huge cost for enterprises.

The technology of deleting repeated data can be divided into 5 stages, namely 1), reading, 2), blocking, 3), calculating hash value, 4), removing weight and 5), filtering. Specifically, 1) firstly, data to be backed up is read in a data stream form; 2) Blocking the data stream using a blocking algorithm (e.g., a fixed blocking algorithm, a content-based variable length blocking algorithm, a file size-based blocking algorithm, etc.); 3) Calculating a hash value of each data block, also called a fingerprint, which serves as a unique identifier for identifying the data block; 4) For a new backup data block, searching whether the fingerprints which are the same as the fingerprints of the new data block exist in the disk, if so, marking the new data block as a repeated block, otherwise, marking the new data block as a unique block; 5) And finally, for the unique block, storing the unique block, updating the corresponding index (the index is a map data structure, the key is the fingerprint of the data block, the value is the metadata information such as the address of the data block) and updating the file spectrum information (the file spectrum is used for recovering the data), and for the repeated block, only the index and the file spectrum information need to be updated.

Bottleneck of data deduplication system: the data deduplication system uses fingerprints of data blocks to represent the data blocks, and uses fingerprint comparison to replace byte-by-byte comparison of the data blocks, so that the calculation cost is greatly reduced. However, as backup data increases, the index (the address where the fingerprint is stored and the data block corresponding to the fingerprint) increases accordingly, and when the data level reaches a certain level and the memory is insufficient to store all the indexes, each backup block of new data blocks may access the index on the disk. However, accessing a disk is 1000 times slower than accessing memory, and frequently accessing a disk is intolerable. For example, backing up a data set of 800TB size will result in at least 2TB of data block fingerprints (the index takes up more space than the fingerprint). The memory is insufficient to hold 2TB of data, so a large portion of the fingerprint must be placed on the disk, which necessarily results in the data deduplication system frequently accessing an index on the disk (hereinafter referred to as a disk index). This phenomenon is known as a disk bottleneck in the data deduplication system.

Aiming at the bottleneck of a data deduplication system, the current technology for relieving the bottleneck mainly comprises accurate deduplication and approximate deduplication, wherein the accurate deduplication refers to that each repeated block is identified and deleted, so that the deduplication rate (the larger the value is, the better) reaches the maximum value; approximate deduplication refers to sacrificing a certain deduplication rate in order to alleviate disk bottlenecks. In the prior art, it is proposed to use a bloom filter to identify a unique block, and use a container (a unit that maintains the locality of the backup stream, and one container stores thousands of consecutive data blocks) prefetch strategy to increase the hit rate of the search index, so as to reduce the number of times of accessing the disk index, thereby greatly improving the throughput of the backup data. However, the bloom filter has a false judgment rate, and the false judgment rate is also sharply increased along with the increase of the space utilization rate of the bloom filter; the locality of backup streams maintained by the container becomes weaker as backup versions increase. In order to further improve the throughput of backup, the approximate deduplication method is sequentially proposed, and the most important difference between the accurate deduplication and the approximate deduplication is that the adopted index strategy is different, and the accurate deduplication corresponds to a dense index (flat indexes), namely, a corresponding index is created for each data block; approximate deduplication uses sparse indexes (sparse indexes). Sparse indexing uses a sampling (e.g., random sampling, uniform sampling, minimum sampling) approach to extract a proportion of the data blocks and create an index for them. The performance of backup data is improved greatly by approximation, and the main reason is that the space occupied by the sparse index is far smaller than that of the dense index, so that the utilization rate of the memory is improved, and the search of the disk index is avoided as much as possible. Currently, approximate deduplication has become the primary method of alleviating bottlenecks. However, approximate deduplication uses a random sampling approach to coarsely extract the index, which does not take into account that the index is cold or hot. The cold index is rarely accessed and only space is wasted in placing it in memory. While hot indexes are frequently accessed, placing them in disk tends to cause disk IOs.

The data deduplication system deletes a large number of duplicate data blocks, resulting in fragmentation of the data blocks. Fragmentation refers to the random storage of successive data blocks in a data stream onto a disk, and random reading of data blocks scattered at various bit positions on the disk severely compromises the performance of data recovery when recovering data. Currently, the problem of fragmentation is mainly solved using a method of rewriting fragmented blocks (a rewriting algorithm). The rewriting algorithm needs to identify whether a certain repeated block is a fragmented block or not, if so, the fragmented block is rewritten; the index (fragment index) corresponding to the fragment block is then updated. Updating the shard index is divided into two steps: 1) Searching the fragment index, wherein the step is most likely to access the index on the disk and trigger the disk IO; 2) The value of the index is updated. Since the operation of updating the fragment index is most likely to trigger the disk IO, the performance of the backup data is reduced to a certain extent in the conventional overwriting algorithm.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a cold and hot index identification and classification management method in a data deduplication system, which improves the memory utilization rate of the data deduplication system, the performance of backup data, the false judgment rate of a bloom filter and the performance of recovering data.

The aim of the invention can be achieved by adopting the following technical scheme:

a cold and hot index identification and classification management method in a data deduplication system carries out identification and separation of cold and hot indexes according to the frequency and probability of index access, and the cold and hot index identification and classification management method comprises the following steps:

t1, setting a threshold value, classifying the index with the accessed frequency or probability lower than the threshold value as a cold index, or else, classifying the index as a hot index;

t2, only prefetching and reserving the hot index in the memory, so that the use efficiency of the memory and the hit rate of searching the index in the memory are improved, and the cold index is always stored on the disk;

t3, along with the continuous increase of the index quantity, when the memory is insufficient for storing all the thermal indexes, storing part of the thermal indexes on the disk;

and T4, searching the intensive duplicate removal backup system, and further searching the index on the disk according to a performance priority scheme or a duplicate removal rate priority scheme when the index is not searched in the memory, wherein the performance priority scheme only searches the hot index on the disk, ignores the cold index on the disk, and firstly searches the hot index on the disk if the cold index on the disk is not searched again.

Further, the frequency with which an index is accessed is reflected and predicted using container utilization (the frequency or probability with which a container is accessed during a backup), the higher the container utilization, the higher the frequency with which an index directed to that container is accessed, i.e., the hotter, and vice versa.

Further, in the step T1, the containers (storing thousands of data blocks) are sorted in descending order according to the container utilization rate, n containers with the lowest utilization rate are selected as sparse containers, whether the index points to the sparse containers is judged, if the index points to the sparse containers, the index is a cold index, otherwise, the index is a hot index, wherein the value of n is a positive integer.

Further, the cold and hot index identification and classification management method further comprises the following steps: a Bloom Filter (Bloom Filter) is configured in the data deduplication system, and only the thermal index is mapped into the Bloom Filter, the data deduplication system firstly identifies the attribute of a newly arrived data block (whether the newly arrived data block is a unique block) through the Bloom Filter, if the attribute of the newly arrived data block cannot be identified, then the subsequent index searching operation is carried out.

Further, in the cold and hot index identification and classification management method, the cold index is classified into a useless index and a fragment index, wherein the useless index is an index accessed with ultra-low probability, the fragment index refers to an index of a fragment block, and the cold index is separated from the fragment index according to the useless index and the fragment index and classified and managed.

Further, in the above-mentioned priority scheme of the deduplication rate, the priority of accessing three indexes by the data deduplication system is: the hot index has the highest priority, the shard index is the next lowest, and the useless index is the lowest.

Compared with the prior art, the invention has the following advantages and effects:

(1) The invention can accurately identify the cold and hot indexes, and then extract the hot indexes from the global indexes (all indexes used for index searching operation in a certain backup process) for index searching operation. Whereas prior art techniques coarsely extract (e.g., randomly extract) a proportion of the index for index lookup operations, the extracted index still contains a large number of cold indices. The invention fully makes up the defects of the prior art, further improves the use efficiency of the memory space, improves the hit rate of searching the index in the memory, and avoids the frequent access of the data deduplication system to the disk.

(2) Compared with the existing rewriting algorithm, the method and the system can accurately and quickly identify and rewrite the fragment blocks without extra calculation cost by removing the fragment index from the global index. The method and the device have the advantages that the data recovery performance is improved, meanwhile, the calculation overhead of identifying the fragment blocks and updating the fragment indexes by the traditional rewriting algorithm is avoided, and the data backup performance of the data deduplication system is indirectly further improved.

(3) In the prior art, all indexes are mapped into the bloom filter, but the hot indexes are only mapped into the bloom filter, so that the false judgment rate of the bloom filter is greatly reduced, the bloom filter can identify more unique blocks, and the data deduplication system is further prevented from searching indexes on a disk.

(4) The invention separates the cold index from the global index in each backup, so that the quantity of the hot index is kept at a lower level continuously, and the system still keeps high-efficiency work even if the backup version number and the backup data quantity are continuously increased, thus well overcoming the disk bottleneck of the data deduplication system.

(5) The invention provides a flexible selection strategy, if a de-duplication rate priority scheme is selected, index searching operation is performed by using all indexes (hot index, fragment index and useless index); if a performance priority scheme is selected, then index lookup operations are performed using only the hot index. Thus, the advantages and disadvantages of the respective strategies can be fully exerted.

Drawings

FIG. 1 is a schematic diagram of the present invention for identifying and separating cold and hot indexes by sparse containers;

FIG. 2 is a flow chart of the present invention for identifying and separating cold and hot indexes;

FIG. 3 is a diagram showing a design structure of the cold and hot index identification and classification management method of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment discloses a cold and hot index identification and classification management method in a data deduplication system, wherein the indexes are divided into cold indexes and hot indexes according to the frequency or probability of being accessed, the cold indexes can be further divided into fragment indexes and useless indexes, and the aim of improving the overall performance of the data deduplication system is achieved by classifying and managing the indexes.

The conventional method does not identify and separate the cold and hot indexes, and the data deduplication system management index requires the following steps:

1) The memory stores the cold index and the hot index in a mixed way, and maps all indexes (cold index and hot index) into the bloom filter;

2) As backup versions and backup data volumes increase, the number of indexes increases, and when the memory is insufficient to store all indexes, a part of the cold indexes and the hot indexes are stored in the same position of the disk in a mixed manner.

If the indexes are not distinguished and managed according to the conventional method, the data deduplication system backs up the data, requiring the following steps:

1) Judging whether the current backup data block is the only block or not through a bloom filter, if so, confirming the attribute (the only block or the repeated block) of the data block, processing the next data block, and otherwise, jumping to the step 2). It should be noted that in the prior art, all indexes are mapped into the bloom filter, and the number of mapped indexes is excessive, so that the false judgment rate of the bloom filter is improved, and the working efficiency of the bloom filter is reduced;

2) Searching an index in the memory, if the index is found, the currently backed up data block is a repeated block, the attribute of the data block is confirmed, then the next data block is processed, and otherwise, the step 3) is skipped. It should be noted that in the prior art, the cold index is also put into the memory, which not only wastes memory space, but also reduces the hit rate of searching the index in the memory;

3) And searching the index from the disk, if the index is found, the current backup data block is a repeated block, and if the index is not found, the current backup data block is a unique block. The attribute of the current data block is confirmed and the next data block is processed. It should be noted that because the prior art mixes the cold and hot indexes in the same location on the disk, looking up the cold index on the disk greatly increases the number of accesses to the disk.

In the above steps, if a conventional overwriting algorithm is used, after confirming the attribute of the data block, an additional operation is required to overwrite the fragmented block, which is as follows:

1) Judging whether the current repeated block is a fragment block, if not, then judging the next repeated block, otherwise, jumping to the step 2);

2) The tile block is rewritten, an index lookup operation is triggered, an index (tile index) corresponding to the tile block is found, and the value of the tile index is updated. Notably, index lookup operations can cause the data deduplication system to frequently access indexes on disk, degrading the performance of the backup data.

If an index is classified into a hot index and a cold index (including a shard index and a useless index) according to the frequency or probability of being accessed, the identification and separation of the cold and hot indexes by a sparse container requires the following steps:

1) In the ith backup process, recording sparse containers and performing hot replacement operation, wherein the recording of the sparse containers refers to the process of sorting containers according to the utilization rate, selecting n containers with the lowest utilization rate as the sparse containers, and storing sparse container information to a disk for i+1 backups. The hot replacement operation refers to when the index lookup operation causes cache replacement of the memory index (such as least recently used replacement algorithm, least recently unused replacement algorithm, etc.), preferentially eliminating the index pointing to the sparse container (the sparse container recorded in the i-1 backup process), namely the cold index, from the memory, and preferentially prefetching the hot index on the disk into the memory;

2) When the memory index is subjected to cache replacement, always storing the hot index removed from the memory in one area (hot area) of the magnetic disk, and always storing the cold index removed from the memory in the other area (cold area) of the magnetic disk, so that the separation of the cold index and the hot index is realized;

3) When the ith backup is finished, traversing the memory index, judging whether the index points to a sparse container (the sparse container recorded in the i-1 backup process), if so, storing the index in a cold area on a disk, otherwise, storing in a hot area;

4) When i+1 backups start, i.e. the initialization phase, the hot index is preferentially prefetched and cached in the memory, and the bloom filter is initialized by the hot index only, and then the operation is continued as described in step 1), and the cycle is repeated. By dropping the cold index from memory and onto disk, and mapping only the hot index into the bloom filter. The data deduplication system back-up data requires the following steps:

1) Judging whether the current backup data block is the only block or not through a bloom filter, if so, confirming the attribute (the only block or the repeated block) of the data block, processing the next data block, and otherwise, jumping to the step 2). It should be noted that the invention only maps the thermal indexes into the bloom filter, and because the number of the thermal indexes is far smaller than the number of the dense indexes, the false judgment rate of the bloom filter is reduced;

2) Searching an index in the memory index, if the index is found, the currently backed up data block is a repeated block, the attribute of the data block is confirmed, then the next data block is processed, and otherwise, the step 3) is skipped. It should be noted that the invention only stores the thermal index into the memory, greatly improves the hit rate of searching the index in the memory, and avoids the data deduplication system from accessing the index on the disk as much as possible;

3) If the performance priority scheme is selected, searching the index only on the disk position (area) storing the hot index, so that disk IO caused by searching the cold index on the disk is avoided as much as possible; if the duplicate removal rate priority scheme is selected, the hot index on the disk is searched first, if the hot index is missed, and then the cold index on the disk is searched, so that the number of disk IO caused by index operation can be reduced to a certain extent.

It is noted that when selecting the performance priority scheme, the data deduplication system directly recognizes the fragmented block as a unique block due to directly ignoring the cold index on disk, and directly rewrites the fragmented block (treating the fragmented block as a unique block), which avoids the additional two steps described above required by the rewrite algorithm. Therefore, the performance priority scheme improves the performance of the recovery data while improving the memory utilization rate and the performance of the backup data.

It should be noted that the operation of searching the index on the disk when the index is not hit in the memory in the data deduplication system essentially corresponds to the operation of performing memory index cache replacement in the data deduplication system.

Example two

As shown in fig. 1, fig. 2 and fig. 3, the cold and hot index identification and classification management method in a data deduplication system disclosed by the invention is used for avoiding that the data deduplication system frequently accesses indexes (in the figure, the indexes on a disk) in an index searching process, and performing classification management on the indexes through container utilization (frequency or probability of the container being accessed in a certain backup process). The cold index is removed from the global index/memory and placed on the disk, so that more memory space is released for prefetching the hot index, index searching operation is hit in the memory as much as possible, a data deduplication system is prevented from accessing the index on the disk, and the performance of backup data is improved; only mapping the thermal index into the bloom filter, and reducing the misjudgment rate of the bloom filter, so as to avoid disk IO caused by index searching operation as much as possible; when the performance priority scheme is selected, the index searching operation directly ignores the cold index (the useless index and the fragment index), so that the data deduplication system rapidly identifies and rewrites the fragment blocks, the cost of the traditional rewriting algorithm is avoided while the performance of the recovered data is improved, and the performance of the backup data is further improved.

A cold and hot index identification and classification management method in a data deduplication system comprises the following steps:

s1, identifying and separating cold and hot indexes:

s11, in the ith backup process, counting the container utilization rate, sorting the containers according to the utilization rate, and selecting n containers with the lowest utilization rate as sparse containers;

and S12, traversing the global index (a set of indexes used in a certain backup process, including indexes in a memory and indexes on a disk) when the ith backup is finished, judging whether the indexes point to sparse containers, if so, judging that the indexes are cold indexes, and otherwise, judging that the indexes are hot indexes. Storing the hot index in one area on the disk and the cold index in another area;

s13, when the (i+1) th backup starts, namely in an initialization stage, only the thermal index is mapped into the bloom filter, and only the thermal index is prefetched and stored in the memory, wherein the prefetching refers to prefetching the thermal index from a disk into the memory. The data deduplication system then continues to operate as described in S11, and so on.

S2, backing up data by a data deduplication system:

s21, judging whether the current backup data block is a unique block through a bloom filter, if so, confirming the attribute (the unique block or the repeated block) of the data block, processing the next data block, and otherwise, jumping to the step S22. It should be noted that the invention only maps the thermal indexes into the bloom filter, and because the number of the thermal indexes is far smaller than the number of the global indexes, the false judgment rate of the bloom filter is reduced;

s22, searching an index in the memory, if the index is found, the currently backed up data block is a repeated block, the attribute of the data block is confirmed, then the next data block is processed, and otherwise, the step S23 is skipped. It should be noted that the invention only stores the thermal index into the memory, thereby improving the utilization rate of the memory, greatly improving the hit rate of searching the index in the memory, and avoiding accessing the index on the disk as much as possible;

s23, if the performance priority scheme is selected, searching the index only at the disk position storing the hot index, and avoiding disk IO caused by searching the cold index on the disk as much as possible. If the duplicate removal rate priority scheme is selected, the hot index on the disk is searched first, if the hot index is missed, and then the cold index on the disk is searched, so that the number of disk IO caused by index searching operation can be reduced to a certain extent.

In summary, the cold and hot index identification and separation method in the data deduplication system provided in this embodiment identifies the cold and hot index according to the container utilization, then places the cold index at one location on the disk, stores the hot index at another location on the disk, and only prefetches and caches the hot index into the memory during the initialization and operation of the data deduplication system, so as to reduce the number of accesses to the disk as much as possible. And simultaneously, only the thermal index is mapped into the bloom filter, so that the false judgment rate of the bloom filter is reduced. When the performance priority scheme is selected, the flow of the index lookup operation is as follows: bloom filter- > thermal index in memory- > thermal index on disk (thermal index file in graph); when the duplicate removal rate priority scheme is selected, the index lookup operation flow is as follows: bloom filter- > hot index in memory- > hot index on disk- > cold index on disk (cold index file in the figure). By classifying and managing the indexes and fully playing the characteristics of the indexes, the memory utilization rate of the data deduplication system, the performance of backup data and the performance of recovery data are improved.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A cold and hot index identification and classification management method in a data deduplication system is characterized in that identification and separation of cold and hot indexes are performed according to the frequency and probability of index access, and the cold and hot index identification and classification management method comprises the following steps:

t2, only pre-fetching and reserving a hot index in the memory, and storing a cold index in a region on the disk;

t3, along with the continuous increase of the index quantity, when the memory is insufficient to store all the thermal indexes, storing part of the thermal indexes in another area on the disk;

and T4, searching the intensive data deduplication system, and searching the index on the disk according to the performance priority or the deduplication rate priority when searching the index in the memory, wherein the performance priority scheme only searches the hot index on the disk and ignores the cold index on the disk, and the deduplication rate priority scheme firstly searches the hot index on the disk and searches the cold index on the disk again if the hot index on the disk is not found.

2. The method of claim 1, wherein the frequency at which the index is accessed is reflected and predicted using a container utilization, wherein the container utilization is the frequency or probability at which a container is accessed during a backup.

3. The method for identifying and classifying cold and hot indexes in a data deduplication system according to claim 2, wherein in the step T1, containers are sorted in descending order according to container utilization, n containers with the lowest utilization are selected as sparse containers, whether an index points to the sparse containers is judged, if the index points to the sparse containers, the index is a cold index, otherwise, the index is a hot index, and n is a positive integer.

4. The method for identifying and classifying cold indexes in a data deduplication system according to claim 1, wherein the method for identifying and classifying cold indexes further comprises the steps of: the bloom filter is configured in the deduplication backup system and only the thermal index is mapped into the bloom filter.

5. The method for identifying and classifying cold and hot indexes in a data deduplication system according to claim 1, wherein the cold and hot indexes are classified into useless indexes and shard indexes, wherein the useless indexes are indexes which are accessed with ultra-low probability, the shard indexes refer to indexes of shard blocks, and the cold indexes are separated according to the useless indexes and the shard indexes and classified.

6. The method for identifying and classifying cold and hot indexes in a data deduplication system according to claim 5, wherein in the deduplication rate priority scheme, priorities of accessing three indexes by the data deduplication system are: the hot index has the highest priority, the shard index is the next lowest, and the useless index is the lowest.