CN107515931B

CN107515931B - Repeated data detection method based on clustering

Info

Publication number: CN107515931B
Application number: CN201710747552.2A
Authority: CN
Inventors: 周可; 王桦; 张攀峰
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-08-28
Filing date: 2017-08-28
Publication date: 2023-04-25
Anticipated expiration: 2037-08-28
Also published as: CN107515931A

Abstract

The invention discloses a repeated data detection method based on clustering, which mainly aims at the type of a data set with strong data similarity, improves the performance of repeated data detection and improves the performance of data de-duplication by utilizing the data similarity principle in the data set. Specifically, for possible repeated data in the data set, the invention utilizes a similarity merging strategy to segment a detection fingerprint list, selects representative fingerprints of each segment, classifies different segments according to the representative fingerprints and merges the different segments into different fingerprint containers. The fingerprint container collects duplicate fingerprints from similar segments of the dataset to increase the efficiency of data deduplication while enhancing deduplication performance. The fingerprint container is stored on a disk, which can be written to and read from the disk as a whole, which improves fingerprint retrieval efficiency and overcomes the problem of segmented storage of similar segments.

Description

Repeated data detection method based on clustering

Technical Field

The invention belongs to the technical field of computer storage, and particularly relates to a repeated data detection method and system based on clustering.

Background

With the rapid development of information technology, information becomes a precious resource for survival, and the information becomes the maximum power for promoting the rapid development of productivity. The vast application of information technology is accompanied by the generation of vast amounts of data, more and more valuable data requiring storage. Therefore, how to effectively improve the storage efficiency of the existing storage medium and meet the ever-increasing storage requirement has become one of the urgent problems in the field of storage research. Meanwhile, IDC company research reports show that about 75% of the existing data is redundant information, i.e., only 25% of the data is unique. In this context, data deduplication, a novel technique for detecting and eliminating redundant information over a wide spatial range, has become a research hotspot in academia and industry in recent years, and is increasingly being applied to various information storage systems.

In the existing data deduplication technology, the detection of duplicate data mainly uses a fingerprint detection mode, namely, whether a certain data block is a duplicate data block is identified by extracting the fingerprint (hash value) of the data block and detecting the repeatability of the fingerprint. The current repeated fingerprint detection method generally adopts a single hash table or a B-tree and other data structures to realize the identification of repeated fingerprint segments.

However, a problem of the above repeated fingerprint detection method is that the detection performance is low, and effective repeated data detection cannot be implemented for a large data set, so that the overall performance of data deduplication is affected.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a repeated data detection method and system based on clustering, which aim to solve the technical problems that the detection performance is lower and effective repeated data detection cannot be realized for a large data set in the existing repeated data detection method based on fingerprint detection.

To achieve the above object, according to one aspect of the present invention, there is provided a cluster-based duplicate data detection method, comprising the steps of:

(1) Acquiring a fingerprint list file from a magnetic disk, judging whether partial fingerprints can be acquired from the fingerprint list file, if not, ending the process, otherwise, storing the acquired partial fingerprints in a fingerprint input cache space, segmenting all fingerprints N in the fingerprint input cache space, and forming a fingerprint segment by every M fingerprints, wherein N is the number of all fingerprints, and M is any natural number;

(2) Setting a counter i=1;

(3) Judging whether i is larger than N/M, if so, returning to the step (1), otherwise, entering the step (4);

(4) Taking out an ith fingerprint segment from the plurality of fingerprint segments obtained in the step (1), acquiring a fingerprint with the smallest fingerprint value in the ith fingerprint segment as a representative fingerprint, judging whether the representative fingerprint is positioned in a representative fingerprint index table in a memory, if so, entering the step (5), otherwise, entering the step (8);

(5) Taking out the fingerprint container ID corresponding to the representative fingerprint from the representative fingerprint index table, judging whether the fingerprint container corresponding to the fingerprint container ID exists in the fingerprint container cache or not by searching the memory hit table, if so, entering the step (6), otherwise, reading the fingerprint container corresponding to the fingerprint container ID into the fingerprint container cache from the disk, and then entering the step (6);

(6) Removing repeated fingerprints in the fingerprint segment where the representative fingerprint is located, matching each fingerprint in the removed fingerprint segment with all fingerprints in a fingerprint container corresponding to the fingerprint container ID one by one, marking the fingerprint as repeated fingerprints if the matching result is repeated, and inserting the fingerprint into the fingerprint container if the matching result is not repeated;

(7) Setting a counter i=i+1, and returning to step (3);

(8) And constructing a new fingerprint container in the fingerprint container cache, removing repeated fingerprints in the fingerprint section where the representative fingerprint is located, inserting all fingerprints in the removed fingerprint section into the new fingerprint container, inserting the representative fingerprint and the new fingerprint container ID into the representative fingerprint index table in a key value mode, and inserting the new fingerprint container ID into the memory hit table.

(9) Setting a counter i=i+1, and returning to step (3);

preferably, the method further comprises the step of setting an empty fingerprint input buffer space, an empty fingerprint container buffer, an empty memory hit table and a representative fingerprint index table in the memory before the step (1), wherein the fingerprint input buffer space is used for storing partial fingerprints in the memory, the fingerprint container buffer is used for buffering partial fingerprint containers in the memory, the memory hit table is used for judging whether a certain fingerprint container is already buffered in the memory, the representative fingerprint index table is used for storing representative fingerprints in the memory, and an index function is provided for the representative fingerprints.

Preferably, when N is not divisible by M, fewer than M fingerprints remain to be grouped into a fingerprint segment.

Preferably, the size of the partial fingerprint is equal to the size of the fingerprint input buffer space, which ranges from 0% to less than 80% of the memory size.

Preferably, the size M of the fingerprint segment is 64 to 128.

Preferably, the fingerprint value of the fingerprint is obtained by converting a character string type fingerprint into a numeric type fingerprint.

Preferably, in step (6), when the number of fingerprints reaches the upper limit of the capacity of the fingerprint container, the fingerprint container no longer receives a new fingerprint, and the fingerprint container is written back to the disk.

According to another aspect of the present invention, there is provided a cluster-based duplicate data detection system, comprising:

the first module is used for acquiring a fingerprint list file from a magnetic disk, judging whether partial fingerprints can be acquired from the fingerprint list file, ending the process if the partial fingerprints cannot be acquired, otherwise, storing the acquired partial fingerprints in a fingerprint input cache space, segmenting all fingerprints N in the fingerprint input cache space, and forming a fingerprint segment by every M fingerprints, wherein N is the number of all fingerprints, and M is any natural number;

a second module for setting a counter i=1;

the third module is used for judging whether i is larger than N/M, if so, returning to the first module, otherwise, entering the fourth module;

a fourth module, configured to take out an ith fingerprint segment from the plurality of fingerprint segments obtained in the first module, obtain a fingerprint with a smallest fingerprint value in the ith fingerprint segment as a representative fingerprint, and determine whether the representative fingerprint is located in a representative fingerprint index table in the memory, if so, enter the fifth module, otherwise, enter the eighth module;

a fifth module, configured to take out a fingerprint container ID corresponding to the representative fingerprint from the representative fingerprint index table, and determine whether a fingerprint container corresponding to the fingerprint container ID exists in the fingerprint container cache by looking up the memory hit table, if yes, enter the sixth module, otherwise read the fingerprint container corresponding to the fingerprint container ID from the disk into the fingerprint container cache, and then go to the sixth module;

a sixth module, configured to reject duplicate fingerprints in a fingerprint segment where the representative fingerprint is located, and match each fingerprint in the rejected fingerprint segment with all fingerprints in a fingerprint container corresponding to the fingerprint container ID one by one, if the matching result is duplicate, mark the fingerprint as a duplicate fingerprint, and if the matching result is non-duplicate, insert the fingerprint into the fingerprint container;

a seventh module, configured to set a counter i=i+1, and return to the third module;

and an eighth module, configured to construct a new fingerprint container in the fingerprint container cache, reject the repeated fingerprints in the fingerprint segment where the representative fingerprint is located, insert all the fingerprints in the fingerprint segment after rejection into the new fingerprint container, insert the representative fingerprint and the new fingerprint container ID into the representative fingerprint index table in a key value mode, and insert the new fingerprint container ID into the memory hit table.

A ninth module, configured to set a counter i=i+1, and return to step (3).

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

(1) The invention adopts the mode of carrying out sectional processing on the fingerprints, and the mode can effectively reduce the range of searching the repeated fingerprints, thereby improving the performance of searching the repeated fingerprints;

(2) The invention can effectively reduce the range of searching the repeated fingerprints, thereby being particularly suitable for searching the repeated fingerprints in a large data set;

(3) The invention can provide an effect close to accurate detection for highly redundant data sets.

(4) Because the fingerprint segmentation process adopts the clustering thought, the similar fingerprint containers can be read into the memory for processing at one time, thereby avoiding the defects that the fingerprint containers are stored in a plurality of positions of a magnetic disk and are required to be read for a plurality of times in the prior method.

Drawings

Fig. 1 is a logical block diagram of the present invention.

Fig. 2 is a basic schematic of a similar algorithm of the present invention.

FIG. 3 is a schematic diagram of a similar fusion.

Fig. 4 shows a representative fingerprint index table.

FIG. 5 is a flow chart of a cluster-based duplicate data detection method of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The invention provides a high-efficiency repeated data detection method based on clustering, which mainly faces data sets with strong similarity, and the similar data in the data sets are concentrated and stored together through a similarity principle and a clustering thought, so that the problem of low detection efficiency in the existing repeated data detection method is solved, and the current situation that storage requirements are continuously expanded is met.

The basic idea of the invention is to segment a list of fingerprints to be detected and extract representative fingerprints, identify similar fingerprint containers by detecting the representative fingerprints, and then find the fingerprint containers to identify duplicate data. The method can effectively reduce the range of fingerprint inspection and greatly improve the performance of repeated fingerprint inspection.

The basic logic structure of the invention is shown in figure 1, and is mainly composed of four parts which respectively represent a fingerprint index table, a memory hit table, a fingerprint container cache and a fingerprint container on a disk.

For purposes of clarity of explanation of the invention, terms presented in this application are explained and illustrated:

similar algorithm: data similarity theory is used to search for identical text blocks between different documents, see Broder, a.z., on the resemblance and containment of documents, in Compression and Complexity of sequences.1997, ieee.p.21-29. There are the following theorem according to the Broder similarity algorithm:

theorem 1: two sets S1, S2, and assume that H (S1) and H (S2) are hashed fingerprint sets corresponding to elements in S1 and S2, respectively. Let min (H (S)) represent the minimum value of H (S), there are:

the above theorem states that when the minimum hash value of an element in one set is equal to the minimum hash value of an element in another set, there is a high probability that both sets share a certain number of elements. In a data deduplication system, this theory means that if the minimum fingerprints of two data blocks from two different sets of data blocks are the same, then there is a higher probability that the two data sets share a large number of data blocks. For ease of discussion, the smallest fingerprint in the collection is named the representative fingerprint (hereinafter referred to as representative fingerprint).

In a data deduplication system, a directory in a dataset is first recursively scanned and a list of files is formed. Each file in the list is partitioned into data blocks using a data partitioning algorithm. Each data block is hashed and forms a fingerprint list. Fig. 2 illustrates an example of how a list of fingerprints is obtained and how data similarity theory is applied.

According to the theory of data similarity, a fingerprint list is divided into a plurality of fingerprint subsets. A subset of fingerprints is defined herein as a segment, and in a fingerprint index table, the subset is stored in a similar fingerprint container.

And (3) similar combination: one feature of the present invention is that it employs a segment merging strategy, i.e., merging similar segments into one data container, thereby greatly enhancing the search process of similar segments.

In the present invention, the detection system sorts and merges similar segments in the fingerprint list into different data containers based on their representative fingerprints. A representative fingerprint index table is constructed to map representative fingerprints to addresses of fingerprint containers, and each representative fingerprint corresponds to a fingerprint container. With such a design, the detection system can quickly determine the location of the fingerprint container and use it for repeated data comparisons without having to find a large area of space to search for scattered similar segments, which greatly reduces the search area and speeds up the data deduplication process.

Fig. 3 illustrates how similar segment merging works. In this figure, it is assumed that there are three segments in the fingerprint list. Segment (a) { e, f, g, n, c, w }, segment (b) { f, n, w, t, m, e }, segment (c) { t, m, e, w, c, j, h }, wherein each character represents a data block fingerprint. In the example three segments are similar segments because they have the same representative fingerprint 'e'. RMD merges them together and stores them in one location, i.e., a fingerprint container. When a new segment of data having the same representative fingerprint 'e' arrives, the detection system compares all of the fingerprints in that segment with the fingerprints in similar containers, and due to the similarity of the data, most duplicate data blocks are likely to be successfully identified. In general, merging similar segments can reduce the amount of disk I/O caused by fingerprint detection while improving the accuracy of locating duplicate data objects.

Fingerprint list: the fingerprint collection is formed by data set through data partitioning and fingerprint extraction and according to the processing sequence.

Fingerprint input buffer: the fingerprint input cache is used for caching fingerprints in the fingerprint list. If the list of fingerprints is from a file, a number of fingerprints are read in one time and stored in the fingerprint input buffer. Then, the fingerprints in the fingerprint input buffer are segmented and representative fingerprint selection steps are performed.

Fingerprint container: the fingerprint container is a data structure of the system for storing fingerprints, and is also a basic unit of fingerprint disk storage and cache scheduling. A fingerprint container stores a variable number of individual fingerprints (individual fingerprint fingers are different in value from the other fingerprints).

Representing a fingerprint index table: the representative fingerprint index table is a key-value hash table residing in memory, in which is stored a mapping from the representative fingerprint RF to the fingerprint container ID in which the representative fingerprint resides. The method has the function of quickly positioning the position of the fingerprint container stored in the file when searching the fingerprint container on the disk. The specific structure of the representative fingerprint index table is shown in fig. 4. The fingerprint length is 20 bytes, the fingerprint container ID length is 4 bytes, and the Pointer occupies 8 bytes. Hash table storage presents a problem of hash collisions, which when occurring, are handled by chain addresses.

Fingerprint container buffer: the fingerprint container buffer is a buffer area opened up by the memory, and is used for the buffer before a new fingerprint container is written into the disk or the buffer read into the memory from the disk.

Memory hit table: and the method is used for judging whether the searched fingerprint container is in the cache or not, and if the searched fingerprint container is not in the cache, reading the needed fingerprint container from the disk to the fingerprint container cache.

In the implementation process of the invention, the optimal parameters of each module in the algorithm are required to be set, and the range of the optimal parameters of each module is given below:

fingerprint segment size: 32-8192 fingerprints, optimally ranging from 64-128;

fingerprint container incorporates fingerprint number: 512-4096 fingerprints, and the optimal range is 1024-2048;

the invention is further described below in connection with fig. 1 and 5 and the embodiments assuming that the duplicate data block index capacity is designed as C records.

As shown in fig. 5, the method for detecting repeated data based on clustering of the present invention comprises the following steps:

(1) Acquiring a fingerprint list file from a disk, judging whether partial fingerprints (the size of the partial fingerprints is equal to the size of a fingerprint input cache space opened up in a memory in advance and the range of the partial fingerprints is more than 0 percent and less than 80 percent of the size of the memory) can be acquired from the fingerprint list file, ending the process if the partial fingerprints cannot be acquired, otherwise, storing the acquired partial fingerprints in the fingerprint input cache space, segmenting all fingerprints N in the fingerprint input cache space, forming a fingerprint segment by every M fingerprints, and classifying less than M fingerprints into one fingerprint segment when N cannot be divided by M;

the size M of the fingerprint segment may be any natural number, with a preferred value of 64 to 128.

Before the method is executed, the initialization step is also needed to be executed, namely, an empty fingerprint input buffer space, an empty fingerprint container buffer, an empty memory hit table and a representative fingerprint index table are set in the memory.

The fingerprint input buffer space is used for storing part of the fingerprint in the memory.

The fingerprint container cache is used for caching a portion of the fingerprint container in the memory.

The memory hit table is used to determine whether a fingerprint container is cached in the memory.

The representative fingerprint index table is used for storing the representative fingerprint in the memory and providing an index function for the representative fingerprint.

The step (1) has the advantage that the overall retrieval performance of the repeated fingerprints can be optimized by setting the size of the fingerprint segment, thereby overcoming the defect of low performance in the prior art that the repeated fingerprint retrieval is performed by taking the file size as a logic segment.

(2) Setting a counter i=1;

(4) Taking out an ith fingerprint segment from the plurality of fingerprint segments obtained in the step (1), acquiring a fingerprint with the smallest fingerprint value in the ith fingerprint segment as a representative fingerprint (Representative fingerprint, RF for short), judging whether the representative fingerprint is positioned in a representative fingerprint index table in a memory, if so, entering the step (5), otherwise, entering the step (8);

specifically, the fingerprint value of the fingerprint is obtained by converting a character string type fingerprint into a numeric type fingerprint.

The method has the advantages that the fingerprint container where the repeated fingerprint is located can be simply and efficiently searched, so that the searching range of the repeated fingerprint is effectively shortened.

(6) Removing repeated fingerprints in the fingerprint segment where the representative fingerprint is located, matching each fingerprint in the removed fingerprint segment with all fingerprints in a fingerprint container corresponding to the fingerprint container ID one by one, marking the fingerprint as repeated fingerprints if the matching result is repeated, and inserting the fingerprint into the fingerprint container if the matching result is not repeated, wherein when the number of the fingerprints reaches the upper limit of the capacity of the fingerprint container, the fingerprint container does not receive new fingerprints any more, and writing the fingerprint container back to a disk;

(7) Setting a counter i=i+1, and returning to step (3);

(8) A new fingerprint container is built in the fingerprint container cache, repeated fingerprints in the fingerprint section where the representative fingerprint is located are removed, all fingerprints in the removed fingerprint section are inserted into the new fingerprint container, the representative fingerprint RF and the new fingerprint container ID are inserted into the representative fingerprint index table in a key value mode, and the new fingerprint container ID is inserted into the memory hit table.

(9) Setting a counter i=i+1, and returning to step (3);

the method has the advantages that the size of the fingerprint container can be effectively controlled, and the degradation of repeated fingerprint retrieval performance is avoided.

The technical effects of the invention are as follows: the invention mainly aims at the data set type with stronger similarity, reduces the detection range of repeated data by utilizing the similarity in the data set, and improves the throughput rate of data deduplication. Specifically, for possible repeated data in the dataset, the invention segments fingerprints in a fingerprint list, selects alternative table fingerprints in the segments according to a similarity theory, detects selected representative fingerprints in a representative fingerprint index table, detects similar containers searched by quick mapping, and finally quickly discovers the repeated data by detecting the repeated data in the similar containers, thereby improving the detection performance of the repeated data.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The repeated data detection method based on the clustering is characterized by comprising the following steps of:

(2) Setting a counter i=1;

(7) Setting a counter i=i+1, and returning to step (3);

(8) Constructing a new fingerprint container in the fingerprint container cache, removing repeated fingerprints in the fingerprint section where the representative fingerprint is located, inserting all fingerprints in the removed fingerprint section into the new fingerprint container, inserting the representative fingerprint and the new fingerprint container ID into the representative fingerprint index table in a key value mode, and inserting the new fingerprint container ID into the memory hit table;

(9) A counter i=i+1 is set, and step (3) is returned.

2. The repetitive data detecting method according to claim 1, further comprising the step of setting an empty fingerprint input buffer space in the memory, an empty fingerprint container buffer for storing a part of the fingerprint in the memory, an empty memory hit table for buffering a part of the fingerprint in the memory, and a representative fingerprint index table for judging whether a certain fingerprint container is already buffered in the memory, and providing an index function for the representative fingerprint, before the step (1).

3. The method of claim 1, wherein when N is not divisible by M, the remaining fewer than M fingerprints are grouped into a fingerprint segment.

4. The method of claim 2, wherein the partial fingerprint has a size equal to the size of the fingerprint input buffer space and ranges from 0% to 80% greater than the memory size.

5. The method for detecting repeated data according to any one of claims 1 to 4, wherein the size M of the fingerprint segment is 32 to 8192 pieces.

6. The repetitive data detecting method according to claim 5, wherein the fingerprint value of the fingerprint is obtained by converting a string type fingerprint into a numeric type fingerprint.

7. The repetitive data detection method according to claim 1, wherein in the step (6), when the number of fingerprints reaches an upper limit of the capacity of the fingerprint container, the fingerprint container no longer receives a new fingerprint, and the fingerprint container is written back to the disk.

8. A cluster-based duplicate data detection system, comprising:

a second module for setting a counter i=1;

an eighth module, configured to construct a new fingerprint container in the fingerprint container cache, reject the repeated fingerprints in the fingerprint segment where the representative fingerprint is located, insert all the fingerprints in the fingerprint segment after rejection into the new fingerprint container, insert the representative fingerprint and the new fingerprint container ID into the representative fingerprint index table in a key value mode, and insert the new fingerprint container ID into the memory hit table;

a ninth module, configured to set a counter i=i+1, and return to step (3).