CN107515931B - Repeated data detection method based on clustering - Google Patents

Repeated data detection method based on clustering Download PDF

Info

Publication number
CN107515931B
CN107515931B CN201710747552.2A CN201710747552A CN107515931B CN 107515931 B CN107515931 B CN 107515931B CN 201710747552 A CN201710747552 A CN 201710747552A CN 107515931 B CN107515931 B CN 107515931B
Authority
CN
China
Prior art keywords
fingerprint
container
fingerprints
representative
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710747552.2A
Other languages
Chinese (zh)
Other versions
CN107515931A (en
Inventor
周可
王桦
张攀峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710747552.2A priority Critical patent/CN107515931B/en
Publication of CN107515931A publication Critical patent/CN107515931A/en
Application granted granted Critical
Publication of CN107515931B publication Critical patent/CN107515931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

The invention discloses a repeated data detection method based on clustering, which mainly aims at the type of a data set with strong data similarity, improves the performance of repeated data detection and improves the performance of data de-duplication by utilizing the data similarity principle in the data set. Specifically, for possible repeated data in the data set, the invention utilizes a similarity merging strategy to segment a detection fingerprint list, selects representative fingerprints of each segment, classifies different segments according to the representative fingerprints and merges the different segments into different fingerprint containers. The fingerprint container collects duplicate fingerprints from similar segments of the dataset to increase the efficiency of data deduplication while enhancing deduplication performance. The fingerprint container is stored on a disk, which can be written to and read from the disk as a whole, which improves fingerprint retrieval efficiency and overcomes the problem of segmented storage of similar segments.

Description

Repeated data detection method based on clustering
Technical Field
The invention belongs to the technical field of computer storage, and particularly relates to a repeated data detection method and system based on clustering.
Background
With the rapid development of information technology, information becomes a precious resource for survival, and the information becomes the maximum power for promoting the rapid development of productivity. The vast application of information technology is accompanied by the generation of vast amounts of data, more and more valuable data requiring storage. Therefore, how to effectively improve the storage efficiency of the existing storage medium and meet the ever-increasing storage requirement has become one of the urgent problems in the field of storage research. Meanwhile, IDC company research reports show that about 75% of the existing data is redundant information, i.e., only 25% of the data is unique. In this context, data deduplication, a novel technique for detecting and eliminating redundant information over a wide spatial range, has become a research hotspot in academia and industry in recent years, and is increasingly being applied to various information storage systems.
In the existing data deduplication technology, the detection of duplicate data mainly uses a fingerprint detection mode, namely, whether a certain data block is a duplicate data block is identified by extracting the fingerprint (hash value) of the data block and detecting the repeatability of the fingerprint. The current repeated fingerprint detection method generally adopts a single hash table or a B-tree and other data structures to realize the identification of repeated fingerprint segments.
However, a problem of the above repeated fingerprint detection method is that the detection performance is low, and effective repeated data detection cannot be implemented for a large data set, so that the overall performance of data deduplication is affected.
Disclosure of Invention
Aiming at the defects or improvement demands of the prior art, the invention provides a repeated data detection method and system based on clustering, which aim to solve the technical problems that the detection performance is lower and effective repeated data detection cannot be realized for a large data set in the existing repeated data detection method based on fingerprint detection.
To achieve the above object, according to one aspect of the present invention, there is provided a cluster-based duplicate data detection method, comprising the steps of:
(1) Acquiring a fingerprint list file from a magnetic disk, judging whether partial fingerprints can be acquired from the fingerprint list file, if not, ending the process, otherwise, storing the acquired partial fingerprints in a fingerprint input cache space, segmenting all fingerprints N in the fingerprint input cache space, and forming a fingerprint segment by every M fingerprints, wherein N is the number of all fingerprints, and M is any natural number;
(2) Setting a counter i=1;
(3) Judging whether i is larger than N/M, if so, returning to the step (1), otherwise, entering the step (4);
(4) Taking out an ith fingerprint segment from the plurality of fingerprint segments obtained in the step (1), acquiring a fingerprint with the smallest fingerprint value in the ith fingerprint segment as a representative fingerprint, judging whether the representative fingerprint is positioned in a representative fingerprint index table in a memory, if so, entering the step (5), otherwise, entering the step (8);
(5) Taking out the fingerprint container ID corresponding to the representative fingerprint from the representative fingerprint index table, judging whether the fingerprint container corresponding to the fingerprint container ID exists in the fingerprint container cache or not by searching the memory hit table, if so, entering the step (6), otherwise, reading the fingerprint container corresponding to the fingerprint container ID into the fingerprint container cache from the disk, and then entering the step (6);
(6) Removing repeated fingerprints in the fingerprint segment where the representative fingerprint is located, matching each fingerprint in the removed fingerprint segment with all fingerprints in a fingerprint container corresponding to the fingerprint container ID one by one, marking the fingerprint as repeated fingerprints if the matching result is repeated, and inserting the fingerprint into the fingerprint container if the matching result is not repeated;
(7) Setting a counter i=i+1, and returning to step (3);
(8) And constructing a new fingerprint container in the fingerprint container cache, removing repeated fingerprints in the fingerprint section where the representative fingerprint is located, inserting all fingerprints in the removed fingerprint section into the new fingerprint container, inserting the representative fingerprint and the new fingerprint container ID into the representative fingerprint index table in a key value mode, and inserting the new fingerprint container ID into the memory hit table.
(9) Setting a counter i=i+1, and returning to step (3);
preferably, the method further comprises the step of setting an empty fingerprint input buffer space, an empty fingerprint container buffer, an empty memory hit table and a representative fingerprint index table in the memory before the step (1), wherein the fingerprint input buffer space is used for storing partial fingerprints in the memory, the fingerprint container buffer is used for buffering partial fingerprint containers in the memory, the memory hit table is used for judging whether a certain fingerprint container is already buffered in the memory, the representative fingerprint index table is used for storing representative fingerprints in the memory, and an index function is provided for the representative fingerprints.
Preferably, when N is not divisible by M, fewer than M fingerprints remain to be grouped into a fingerprint segment.
Preferably, the size of the partial fingerprint is equal to the size of the fingerprint input buffer space, which ranges from 0% to less than 80% of the memory size.
Preferably, the size M of the fingerprint segment is 64 to 128.
Preferably, the fingerprint value of the fingerprint is obtained by converting a character string type fingerprint into a numeric type fingerprint.
Preferably, in step (6), when the number of fingerprints reaches the upper limit of the capacity of the fingerprint container, the fingerprint container no longer receives a new fingerprint, and the fingerprint container is written back to the disk.
According to another aspect of the present invention, there is provided a cluster-based duplicate data detection system, comprising:
the first module is used for acquiring a fingerprint list file from a magnetic disk, judging whether partial fingerprints can be acquired from the fingerprint list file, ending the process if the partial fingerprints cannot be acquired, otherwise, storing the acquired partial fingerprints in a fingerprint input cache space, segmenting all fingerprints N in the fingerprint input cache space, and forming a fingerprint segment by every M fingerprints, wherein N is the number of all fingerprints, and M is any natural number;
a second module for setting a counter i=1;
the third module is used for judging whether i is larger than N/M, if so, returning to the first module, otherwise, entering the fourth module;
a fourth module, configured to take out an ith fingerprint segment from the plurality of fingerprint segments obtained in the first module, obtain a fingerprint with a smallest fingerprint value in the ith fingerprint segment as a representative fingerprint, and determine whether the representative fingerprint is located in a representative fingerprint index table in the memory, if so, enter the fifth module, otherwise, enter the eighth module;
a fifth module, configured to take out a fingerprint container ID corresponding to the representative fingerprint from the representative fingerprint index table, and determine whether a fingerprint container corresponding to the fingerprint container ID exists in the fingerprint container cache by looking up the memory hit table, if yes, enter the sixth module, otherwise read the fingerprint container corresponding to the fingerprint container ID from the disk into the fingerprint container cache, and then go to the sixth module;
a sixth module, configured to reject duplicate fingerprints in a fingerprint segment where the representative fingerprint is located, and match each fingerprint in the rejected fingerprint segment with all fingerprints in a fingerprint container corresponding to the fingerprint container ID one by one, if the matching result is duplicate, mark the fingerprint as a duplicate fingerprint, and if the matching result is non-duplicate, insert the fingerprint into the fingerprint container;
a seventh module, configured to set a counter i=i+1, and return to the third module;
and an eighth module, configured to construct a new fingerprint container in the fingerprint container cache, reject the repeated fingerprints in the fingerprint segment where the representative fingerprint is located, insert all the fingerprints in the fingerprint segment after rejection into the new fingerprint container, insert the representative fingerprint and the new fingerprint container ID into the representative fingerprint index table in a key value mode, and insert the new fingerprint container ID into the memory hit table.
A ninth module, configured to set a counter i=i+1, and return to step (3).
In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:
(1) The invention adopts the mode of carrying out sectional processing on the fingerprints, and the mode can effectively reduce the range of searching the repeated fingerprints, thereby improving the performance of searching the repeated fingerprints;
(2) The invention can effectively reduce the range of searching the repeated fingerprints, thereby being particularly suitable for searching the repeated fingerprints in a large data set;
(3) The invention can provide an effect close to accurate detection for highly redundant data sets.
(4) Because the fingerprint segmentation process adopts the clustering thought, the similar fingerprint containers can be read into the memory for processing at one time, thereby avoiding the defects that the fingerprint containers are stored in a plurality of positions of a magnetic disk and are required to be read for a plurality of times in the prior method.
Drawings
Fig. 1 is a logical block diagram of the present invention.
Fig. 2 is a basic schematic of a similar algorithm of the present invention.
FIG. 3 is a schematic diagram of a similar fusion.
Fig. 4 shows a representative fingerprint index table.
FIG. 5 is a flow chart of a cluster-based duplicate data detection method of the invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The invention provides a high-efficiency repeated data detection method based on clustering, which mainly faces data sets with strong similarity, and the similar data in the data sets are concentrated and stored together through a similarity principle and a clustering thought, so that the problem of low detection efficiency in the existing repeated data detection method is solved, and the current situation that storage requirements are continuously expanded is met.
The basic idea of the invention is to segment a list of fingerprints to be detected and extract representative fingerprints, identify similar fingerprint containers by detecting the representative fingerprints, and then find the fingerprint containers to identify duplicate data. The method can effectively reduce the range of fingerprint inspection and greatly improve the performance of repeated fingerprint inspection.
The basic logic structure of the invention is shown in figure 1, and is mainly composed of four parts which respectively represent a fingerprint index table, a memory hit table, a fingerprint container cache and a fingerprint container on a disk.
For purposes of clarity of explanation of the invention, terms presented in this application are explained and illustrated:
similar algorithm: data similarity theory is used to search for identical text blocks between different documents, see Broder, a.z., on the resemblance and containment of documents, in Compression and Complexity of sequences.1997, ieee.p.21-29. There are the following theorem according to the Broder similarity algorithm:
theorem 1: two sets S1, S2, and assume that H (S1) and H (S2) are hashed fingerprint sets corresponding to elements in S1 and S2, respectively. Let min (H (S)) represent the minimum value of H (S), there are:
Figure BDA0001390373340000061
the above theorem states that when the minimum hash value of an element in one set is equal to the minimum hash value of an element in another set, there is a high probability that both sets share a certain number of elements. In a data deduplication system, this theory means that if the minimum fingerprints of two data blocks from two different sets of data blocks are the same, then there is a higher probability that the two data sets share a large number of data blocks. For ease of discussion, the smallest fingerprint in the collection is named the representative fingerprint (hereinafter referred to as representative fingerprint).
In a data deduplication system, a directory in a dataset is first recursively scanned and a list of files is formed. Each file in the list is partitioned into data blocks using a data partitioning algorithm. Each data block is hashed and forms a fingerprint list. Fig. 2 illustrates an example of how a list of fingerprints is obtained and how data similarity theory is applied.
According to the theory of data similarity, a fingerprint list is divided into a plurality of fingerprint subsets. A subset of fingerprints is defined herein as a segment, and in a fingerprint index table, the subset is stored in a similar fingerprint container.
And (3) similar combination: one feature of the present invention is that it employs a segment merging strategy, i.e., merging similar segments into one data container, thereby greatly enhancing the search process of similar segments.
In the present invention, the detection system sorts and merges similar segments in the fingerprint list into different data containers based on their representative fingerprints. A representative fingerprint index table is constructed to map representative fingerprints to addresses of fingerprint containers, and each representative fingerprint corresponds to a fingerprint container. With such a design, the detection system can quickly determine the location of the fingerprint container and use it for repeated data comparisons without having to find a large area of space to search for scattered similar segments, which greatly reduces the search area and speeds up the data deduplication process.
Fig. 3 illustrates how similar segment merging works. In this figure, it is assumed that there are three segments in the fingerprint list. Segment (a) { e, f, g, n, c, w }, segment (b) { f, n, w, t, m, e }, segment (c) { t, m, e, w, c, j, h }, wherein each character represents a data block fingerprint. In the example three segments are similar segments because they have the same representative fingerprint 'e'. RMD merges them together and stores them in one location, i.e., a fingerprint container. When a new segment of data having the same representative fingerprint 'e' arrives, the detection system compares all of the fingerprints in that segment with the fingerprints in similar containers, and due to the similarity of the data, most duplicate data blocks are likely to be successfully identified. In general, merging similar segments can reduce the amount of disk I/O caused by fingerprint detection while improving the accuracy of locating duplicate data objects.
Fingerprint list: the fingerprint collection is formed by data set through data partitioning and fingerprint extraction and according to the processing sequence.
Fingerprint input buffer: the fingerprint input cache is used for caching fingerprints in the fingerprint list. If the list of fingerprints is from a file, a number of fingerprints are read in one time and stored in the fingerprint input buffer. Then, the fingerprints in the fingerprint input buffer are segmented and representative fingerprint selection steps are performed.
Fingerprint container: the fingerprint container is a data structure of the system for storing fingerprints, and is also a basic unit of fingerprint disk storage and cache scheduling. A fingerprint container stores a variable number of individual fingerprints (individual fingerprint fingers are different in value from the other fingerprints).
Representing a fingerprint index table: the representative fingerprint index table is a key-value hash table residing in memory, in which is stored a mapping from the representative fingerprint RF to the fingerprint container ID in which the representative fingerprint resides. The method has the function of quickly positioning the position of the fingerprint container stored in the file when searching the fingerprint container on the disk. The specific structure of the representative fingerprint index table is shown in fig. 4. The fingerprint length is 20 bytes, the fingerprint container ID length is 4 bytes, and the Pointer occupies 8 bytes. Hash table storage presents a problem of hash collisions, which when occurring, are handled by chain addresses.
Fingerprint container buffer: the fingerprint container buffer is a buffer area opened up by the memory, and is used for the buffer before a new fingerprint container is written into the disk or the buffer read into the memory from the disk.
Memory hit table: and the method is used for judging whether the searched fingerprint container is in the cache or not, and if the searched fingerprint container is not in the cache, reading the needed fingerprint container from the disk to the fingerprint container cache.
In the implementation process of the invention, the optimal parameters of each module in the algorithm are required to be set, and the range of the optimal parameters of each module is given below:
fingerprint segment size: 32-8192 fingerprints, optimally ranging from 64-128;
fingerprint container incorporates fingerprint number: 512-4096 fingerprints, and the optimal range is 1024-2048;
the invention is further described below in connection with fig. 1 and 5 and the embodiments assuming that the duplicate data block index capacity is designed as C records.
As shown in fig. 5, the method for detecting repeated data based on clustering of the present invention comprises the following steps:
(1) Acquiring a fingerprint list file from a disk, judging whether partial fingerprints (the size of the partial fingerprints is equal to the size of a fingerprint input cache space opened up in a memory in advance and the range of the partial fingerprints is more than 0 percent and less than 80 percent of the size of the memory) can be acquired from the fingerprint list file, ending the process if the partial fingerprints cannot be acquired, otherwise, storing the acquired partial fingerprints in the fingerprint input cache space, segmenting all fingerprints N in the fingerprint input cache space, forming a fingerprint segment by every M fingerprints, and classifying less than M fingerprints into one fingerprint segment when N cannot be divided by M;
the size M of the fingerprint segment may be any natural number, with a preferred value of 64 to 128.
Before the method is executed, the initialization step is also needed to be executed, namely, an empty fingerprint input buffer space, an empty fingerprint container buffer, an empty memory hit table and a representative fingerprint index table are set in the memory.
The fingerprint input buffer space is used for storing part of the fingerprint in the memory.
The fingerprint container cache is used for caching a portion of the fingerprint container in the memory.
The memory hit table is used to determine whether a fingerprint container is cached in the memory.
The representative fingerprint index table is used for storing the representative fingerprint in the memory and providing an index function for the representative fingerprint.
The step (1) has the advantage that the overall retrieval performance of the repeated fingerprints can be optimized by setting the size of the fingerprint segment, thereby overcoming the defect of low performance in the prior art that the repeated fingerprint retrieval is performed by taking the file size as a logic segment.
(2) Setting a counter i=1;
(3) Judging whether i is larger than N/M, if so, returning to the step (1), otherwise, entering the step (4);
(4) Taking out an ith fingerprint segment from the plurality of fingerprint segments obtained in the step (1), acquiring a fingerprint with the smallest fingerprint value in the ith fingerprint segment as a representative fingerprint (Representative fingerprint, RF for short), judging whether the representative fingerprint is positioned in a representative fingerprint index table in a memory, if so, entering the step (5), otherwise, entering the step (8);
specifically, the fingerprint value of the fingerprint is obtained by converting a character string type fingerprint into a numeric type fingerprint.
The method has the advantages that the fingerprint container where the repeated fingerprint is located can be simply and efficiently searched, so that the searching range of the repeated fingerprint is effectively shortened.
(5) Taking out the fingerprint container ID corresponding to the representative fingerprint from the representative fingerprint index table, judging whether the fingerprint container corresponding to the fingerprint container ID exists in the fingerprint container cache or not by searching the memory hit table, if so, entering the step (6), otherwise, reading the fingerprint container corresponding to the fingerprint container ID into the fingerprint container cache from the disk, and then entering the step (6);
(6) Removing repeated fingerprints in the fingerprint segment where the representative fingerprint is located, matching each fingerprint in the removed fingerprint segment with all fingerprints in a fingerprint container corresponding to the fingerprint container ID one by one, marking the fingerprint as repeated fingerprints if the matching result is repeated, and inserting the fingerprint into the fingerprint container if the matching result is not repeated, wherein when the number of the fingerprints reaches the upper limit of the capacity of the fingerprint container, the fingerprint container does not receive new fingerprints any more, and writing the fingerprint container back to a disk;
(7) Setting a counter i=i+1, and returning to step (3);
(8) A new fingerprint container is built in the fingerprint container cache, repeated fingerprints in the fingerprint section where the representative fingerprint is located are removed, all fingerprints in the removed fingerprint section are inserted into the new fingerprint container, the representative fingerprint RF and the new fingerprint container ID are inserted into the representative fingerprint index table in a key value mode, and the new fingerprint container ID is inserted into the memory hit table.
(9) Setting a counter i=i+1, and returning to step (3);
the method has the advantages that the size of the fingerprint container can be effectively controlled, and the degradation of repeated fingerprint retrieval performance is avoided.
The technical effects of the invention are as follows: the invention mainly aims at the data set type with stronger similarity, reduces the detection range of repeated data by utilizing the similarity in the data set, and improves the throughput rate of data deduplication. Specifically, for possible repeated data in the dataset, the invention segments fingerprints in a fingerprint list, selects alternative table fingerprints in the segments according to a similarity theory, detects selected representative fingerprints in a representative fingerprint index table, detects similar containers searched by quick mapping, and finally quickly discovers the repeated data by detecting the repeated data in the similar containers, thereby improving the detection performance of the repeated data.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (8)

1. The repeated data detection method based on the clustering is characterized by comprising the following steps of:
(1) Acquiring a fingerprint list file from a magnetic disk, judging whether partial fingerprints can be acquired from the fingerprint list file, if not, ending the process, otherwise, storing the acquired partial fingerprints in a fingerprint input cache space, segmenting all fingerprints N in the fingerprint input cache space, and forming a fingerprint segment by every M fingerprints, wherein N is the number of all fingerprints, and M is any natural number;
(2) Setting a counter i=1;
(3) Judging whether i is larger than N/M, if so, returning to the step (1), otherwise, entering the step (4);
(4) Taking out an ith fingerprint segment from the plurality of fingerprint segments obtained in the step (1), acquiring a fingerprint with the smallest fingerprint value in the ith fingerprint segment as a representative fingerprint, judging whether the representative fingerprint is positioned in a representative fingerprint index table in a memory, if so, entering the step (5), otherwise, entering the step (8);
(5) Taking out the fingerprint container ID corresponding to the representative fingerprint from the representative fingerprint index table, judging whether the fingerprint container corresponding to the fingerprint container ID exists in the fingerprint container cache or not by searching the memory hit table, if so, entering the step (6), otherwise, reading the fingerprint container corresponding to the fingerprint container ID into the fingerprint container cache from the disk, and then entering the step (6);
(6) Removing repeated fingerprints in the fingerprint segment where the representative fingerprint is located, matching each fingerprint in the removed fingerprint segment with all fingerprints in a fingerprint container corresponding to the fingerprint container ID one by one, marking the fingerprint as repeated fingerprints if the matching result is repeated, and inserting the fingerprint into the fingerprint container if the matching result is not repeated;
(7) Setting a counter i=i+1, and returning to step (3);
(8) Constructing a new fingerprint container in the fingerprint container cache, removing repeated fingerprints in the fingerprint section where the representative fingerprint is located, inserting all fingerprints in the removed fingerprint section into the new fingerprint container, inserting the representative fingerprint and the new fingerprint container ID into the representative fingerprint index table in a key value mode, and inserting the new fingerprint container ID into the memory hit table;
(9) A counter i=i+1 is set, and step (3) is returned.
2. The repetitive data detecting method according to claim 1, further comprising the step of setting an empty fingerprint input buffer space in the memory, an empty fingerprint container buffer for storing a part of the fingerprint in the memory, an empty memory hit table for buffering a part of the fingerprint in the memory, and a representative fingerprint index table for judging whether a certain fingerprint container is already buffered in the memory, and providing an index function for the representative fingerprint, before the step (1).
3. The method of claim 1, wherein when N is not divisible by M, the remaining fewer than M fingerprints are grouped into a fingerprint segment.
4. The method of claim 2, wherein the partial fingerprint has a size equal to the size of the fingerprint input buffer space and ranges from 0% to 80% greater than the memory size.
5. The method for detecting repeated data according to any one of claims 1 to 4, wherein the size M of the fingerprint segment is 32 to 8192 pieces.
6. The repetitive data detecting method according to claim 5, wherein the fingerprint value of the fingerprint is obtained by converting a string type fingerprint into a numeric type fingerprint.
7. The repetitive data detection method according to claim 1, wherein in the step (6), when the number of fingerprints reaches an upper limit of the capacity of the fingerprint container, the fingerprint container no longer receives a new fingerprint, and the fingerprint container is written back to the disk.
8. A cluster-based duplicate data detection system, comprising:
the first module is used for acquiring a fingerprint list file from a magnetic disk, judging whether partial fingerprints can be acquired from the fingerprint list file, ending the process if the partial fingerprints cannot be acquired, otherwise, storing the acquired partial fingerprints in a fingerprint input cache space, segmenting all fingerprints N in the fingerprint input cache space, and forming a fingerprint segment by every M fingerprints, wherein N is the number of all fingerprints, and M is any natural number;
a second module for setting a counter i=1;
the third module is used for judging whether i is larger than N/M, if so, returning to the first module, otherwise, entering the fourth module;
a fourth module, configured to take out an ith fingerprint segment from the plurality of fingerprint segments obtained in the first module, obtain a fingerprint with a smallest fingerprint value in the ith fingerprint segment as a representative fingerprint, and determine whether the representative fingerprint is located in a representative fingerprint index table in the memory, if so, enter the fifth module, otherwise, enter the eighth module;
a fifth module, configured to take out a fingerprint container ID corresponding to the representative fingerprint from the representative fingerprint index table, and determine whether a fingerprint container corresponding to the fingerprint container ID exists in the fingerprint container cache by looking up the memory hit table, if yes, enter the sixth module, otherwise read the fingerprint container corresponding to the fingerprint container ID from the disk into the fingerprint container cache, and then go to the sixth module;
a sixth module, configured to reject duplicate fingerprints in a fingerprint segment where the representative fingerprint is located, and match each fingerprint in the rejected fingerprint segment with all fingerprints in a fingerprint container corresponding to the fingerprint container ID one by one, if the matching result is duplicate, mark the fingerprint as a duplicate fingerprint, and if the matching result is non-duplicate, insert the fingerprint into the fingerprint container;
a seventh module, configured to set a counter i=i+1, and return to the third module;
an eighth module, configured to construct a new fingerprint container in the fingerprint container cache, reject the repeated fingerprints in the fingerprint segment where the representative fingerprint is located, insert all the fingerprints in the fingerprint segment after rejection into the new fingerprint container, insert the representative fingerprint and the new fingerprint container ID into the representative fingerprint index table in a key value mode, and insert the new fingerprint container ID into the memory hit table;
a ninth module, configured to set a counter i=i+1, and return to step (3).
CN201710747552.2A 2017-08-28 2017-08-28 Repeated data detection method based on clustering Active CN107515931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710747552.2A CN107515931B (en) 2017-08-28 2017-08-28 Repeated data detection method based on clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710747552.2A CN107515931B (en) 2017-08-28 2017-08-28 Repeated data detection method based on clustering

Publications (2)

Publication Number Publication Date
CN107515931A CN107515931A (en) 2017-12-26
CN107515931B true CN107515931B (en) 2023-04-25

Family

ID=60724325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710747552.2A Active CN107515931B (en) 2017-08-28 2017-08-28 Repeated data detection method based on clustering

Country Status (1)

Country Link
CN (1) CN107515931B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109445702B (en) * 2018-10-26 2019-12-06 黄淮学院 block-level data deduplication storage system
CN109783523B (en) * 2019-01-24 2022-02-25 广州虎牙信息科技有限公司 Data processing method, device, equipment and storage medium
CN112100318B (en) * 2020-11-12 2021-02-26 北京智慧星光信息技术有限公司 Multi-dimensional information merging method, device, equipment and storage medium
CN112329717B (en) * 2020-11-25 2023-08-01 中国人民解放军国防科技大学 Fingerprint cache method for mass data similarity detection
CN115827619B (en) * 2023-01-06 2023-05-09 山东捷瑞数字科技股份有限公司 Method, device and equipment for detecting repeated data based on three-dimensional engine

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1940995A (en) * 2005-09-29 2007-04-04 中国科学院自动化研究所 Method for compressing fingerprint direction quantized diagram to embedded system
CN101681381A (en) * 2007-06-06 2010-03-24 杜比实验室特许公司 Improving audio/video fingerprint search accuracy using multiple search combining
CN102663086A (en) * 2012-04-09 2012-09-12 华中科技大学 Method for retrieving data block indexes
CN103699567A (en) * 2013-11-04 2014-04-02 北京中搜网络技术股份有限公司 Method for realizing same news clustering based on title fingerprint and text fingerprint
US8930648B1 (en) * 2012-05-23 2015-01-06 Netapp, Inc. Distributed deduplication using global chunk data structure and epochs
CN105493080A (en) * 2013-12-23 2016-04-13 华为技术有限公司 Method and apparatus for context aware based data de-duplication
CN105989033A (en) * 2015-02-03 2016-10-05 北京中搜网络技术股份有限公司 Information duplication eliminating method based on information fingerprints

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1940995A (en) * 2005-09-29 2007-04-04 中国科学院自动化研究所 Method for compressing fingerprint direction quantized diagram to embedded system
CN101681381A (en) * 2007-06-06 2010-03-24 杜比实验室特许公司 Improving audio/video fingerprint search accuracy using multiple search combining
CN102663086A (en) * 2012-04-09 2012-09-12 华中科技大学 Method for retrieving data block indexes
US8930648B1 (en) * 2012-05-23 2015-01-06 Netapp, Inc. Distributed deduplication using global chunk data structure and epochs
CN103699567A (en) * 2013-11-04 2014-04-02 北京中搜网络技术股份有限公司 Method for realizing same news clustering based on title fingerprint and text fingerprint
CN105493080A (en) * 2013-12-23 2016-04-13 华为技术有限公司 Method and apparatus for context aware based data de-duplication
CN105989033A (en) * 2015-02-03 2016-10-05 北京中搜网络技术股份有限公司 Information duplication eliminating method based on information fingerprints

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Xinye Li 等. Clustering Web Retrieval Results Accompanied by Removing Duplicate Documents.《2010 International Conference on Web Information Systems and Mining》.2011,259-261. *
张攀峰.数据去重中重复数据检测技术研究.《中国博士学位论文全文数据库信息科技辑》.2018,(第undefined期),I138-22. *
殷波 等.一种基于重复串的STC改进算法.《微计算机信息》.2009,第第25卷卷(第第25卷期),206-208. *

Also Published As

Publication number Publication date
CN107515931A (en) 2017-12-26

Similar Documents

Publication Publication Date Title
CN107515931B (en) Repeated data detection method based on clustering
Tuarob et al. Automatic detection of pseudocodes in scholarly documents using machine learning
Riba et al. Handwritten word spotting by inexact matching of grapheme graphs
JPWO2008026414A1 (en) Image recognition method, image recognition apparatus, and image recognition program
CN111177432B (en) Large-scale image retrieval method based on hierarchical depth hash
CN103399896A (en) Method and system for recognizing association relationships among users
US9280551B2 (en) De-duplication deployment planning
CN107844493B (en) File association method and system
CN103207889A (en) Method for retrieving massive face images based on Hadoop
CN102799614A (en) Image search method based on space symbiosis of visual words
Zhang et al. Hashfile: An efficient index structure for multimedia data
Chen et al. A novel algorithm for mining closed temporal patterns from interval-based data
CN114610708A (en) Vector data processing method and device, electronic equipment and storage medium
CN106648991A (en) Duplicated data deletion method in data recovery system
US20190050298A1 (en) Method and apparatus for improving database recovery speed using log data analysis
CN107301203B (en) Mass data comparison method and system
Hassan et al. Word shape descriptor-based document image indexing: a new DBH-based approach
Le et al. Improving logo spotting and matching for document categorization by a post-filter based on homography
Yin et al. Content-based image retrial based on Hadoop
Nie et al. Efficient storage support for real-time near-duplicate video retrieval
KR20130130330A (en) Hash-based skyline query processing method and apparatus thereof
CN110704643A (en) Method and device for automatically identifying same author of different documents and storage medium terminal
Zhou et al. Adaptive subspace symbolization for content-based video detection
CN110781160B (en) Data recovery method based on VMware virtualization file system damage
Munarko et al. HII: Histogram Inverted Index for Fast Images Retrieval.

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant