CN107515931A - A kind of duplicate data detection method based on cluster - Google Patents
A kind of duplicate data detection method based on cluster Download PDFInfo
- Publication number
- CN107515931A CN107515931A CN201710747552.2A CN201710747552A CN107515931A CN 107515931 A CN107515931 A CN 107515931A CN 201710747552 A CN201710747552 A CN 201710747552A CN 107515931 A CN107515931 A CN 107515931A
- Authority
- CN
- China
- Prior art keywords
- fingerprint
- container
- section
- module
- internal memory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Collating Specific Patterns (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of duplicate data detection method based on cluster, and it by using the data similarity principle in data set, improves the performance of duplicate data detection, while lift the performance of data deduplication mainly for the stronger data set type of data similarity.Specifically, similarity combination strategy is utilized for possible duplicate data in data set, the present invention, first detection fingerprint list is segmented, every section is selected representative fingerprint, and different sections are classified and are merged into different fingerprint containers according to its representative fingerprint.Fingerprint container collects the fingerprint of repetition from similar section of data set, to increase the efficiency of data deduplication, while lifts the performance of duplicate removal.For fingerprint container storage on disk, it can be written into and read disk as an entirety, the problem of this fragmented storage for improving fingerprint recall precision and overcoming similar section.
Description
Technical field
The invention belongs to computer memory technical field, is detected more particularly, to a kind of duplicate data based on cluster
Method and system.
Background technology
With information technology fast development, information turns into the precious resources that we depend on for existence, becomes promotion production
The fast-developing maximum power of power.The generation for widely applying the also data along with magnanimity of information technology, it is more and more valuable
The data of value need to be stored.So, the storage efficiency of existing storage medium how is effectively improved, meets ever-increasing deposit
Storage demand, have become storage research field and one of urgently solve the problems, such as.Meanwhile IDC LLC's investigation report show it is existing about
75% data are redundancy, i.e., only 25% data have uniqueness.In this context, data deduplication is used as larger
A kind of new technique of detection and elimination redundancy turns into the study hotspot of academia and industrial quarters in recent years in spatial dimension,
And various information storage systems are just widely applied to further.
The detection for repeating fingerprint is to realize the important technical of data deduplication, in existing data deduplication technology, weight
The detection of complex data mainly using the mode of fingerprint detection, i.e., by extracting the fingerprint (cryptographic Hash) of data block, then passes through inspection
The repeatability for surveying fingerprint identifies whether some data block is duplicate data block.Current repetition fingerprint detection method is typically to adopt
The identification of repetition fingerprint section is realized with data structures such as single Hash tables or B-tree.
However, the problem of one can not ignore existing for above-mentioned repetition fingerprint detection method is that its detection performance is more low
Under, effective duplicate data detection can not be realized for large data sets, so as to have influence on the overall efficiency of data deduplication.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of duplicate data inspection based on cluster
Method and system is surveyed, it is intended that solving detection performance existing for the existing duplicate data detection method based on fingerprint detection
It is more low, the technical problem that large data sets realize effective duplicate data detection can not be directed to.
To achieve the above object, according to one aspect of the present invention, there is provided a kind of duplicate data detection based on cluster
Method, comprise the following steps:
(1) fingerprint list file is obtained from disk, judges whether that part can be got from the fingerprint list file
Fingerprint, terminate if the process less than if that obtains, otherwise by the partial fingerprints got, it is stored in fingerprint input-buffer space,
All fingerprint N in fingerprint input-buffer space are segmented, form a fingerprint section per M fingerprint, wherein N is all fingers
The quantity of line, M are random natural number;
(2) counter i=1 is set;
(3) judge whether i is more than N/M, if greater than then return to step (1), otherwise into step (4);
(4) i-th of fingerprint section is taken out in the multiple fingerprint sections obtained from step (1), and obtains fingerprint in i-th of fingerprint section
The minimum fingerprint of value, which is used as, represents fingerprint, and judges that this represents whether fingerprint is located in the representative fingerprint index table in internal memory, such as
Fruit is, then into step (5), otherwise into step (8);
(5) this is taken out in fingerprint index table represent fingerprint Container ID corresponding to fingerprint from representing, and ordered by searching internal memory
Middle table judges that fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if yes then enter step
Suddenly (6), otherwise fingerprint container corresponding to the fingerprint Container ID is read into fingerprint container caching from disk, is then transferred to step
Suddenly (6);
(6) fingerprint repeated in fingerprint section where fingerprint will be represented to reject, and by each finger in the fingerprint section after rejecting
Line is matched with all fingerprints in the fingerprint container corresponding to fingerprint Container ID one by one, if the result of matching is repetition,
Then by the Finger-print labelling method to repeat fingerprint, if matching result inserts the fingerprint in the fingerprint container not repeat;
(7) counter i=i+1, and return to step (3) are set;
(8) a new fingerprint container is built in fingerprint container caching, is repeated in fingerprint section where this is represented into fingerprint
Fingerprint reject, all fingerprints in the fingerprint section after rejecting are inserted into new fingerprint container, fingerprint and new will be represented
Fingerprint Container ID is inserted into a manner of key-value pair to be represented in fingerprint index table, and new fingerprint Container ID insertion internal memory is hit into table
In.
(9) counter i=i+1, and return to step (3) are set;
Preferably, further comprise before step (1), the fingerprint input-buffer space, empty of sky is set in internal memory
Fingerprint container caching, empty internal memory hit table and the step of represent fingerprint index table, wherein fingerprint input-buffer space is used for
Partial fingerprints are stored in internal memory, fingerprint container is cached for caching partial fingerprints container in internal memory, and internal memory hit table is used for
Judge whether some fingerprint container has been buffered in internal memory, represent fingerprint index table and be stored in internal memory for fingerprint will to be represented
In, and represent fingerprint for this and index function is provided.
Preferably, it is remaining to be classified as a fingerprint section less than M fingerprint when N can not be divided exactly by M.
Preferably, the size of partial fingerprints is equal to the size in fingerprint input-buffer space, and its scope is greater than memory size
0%, less than 80%.
Preferably, the size M of fingerprint section is 64 to 128.
Preferably, the fingerprint value of fingerprint is by by the side of the fingerprint for being converted to value type of the fingerprint of character string type
Formula is got.
Preferably, in step (6), when fingerprint quantity reaches the upper limit of fingerprint container capacity, fingerprint container no longer receives
New fingerprint, fingerprint container is write back into disk.
It is another aspect of this invention to provide that a kind of duplicate data detecting system based on cluster is provided, including:
First module, for obtaining fingerprint list file from disk, judging whether can be from the fingerprint list file
Partial fingerprints are got, are terminated if the process less than if that obtains, otherwise it is stored in fingerprint input by the partial fingerprints got
In spatial cache, all fingerprint N in fingerprint input-buffer space are segmented, a fingerprint section is formed per M fingerprint, its
Middle N is the quantity of all fingerprints, and M is random natural number;
Second module, for setting counter i=1;
3rd module, for judging whether i is more than N/M, if greater than the first module is then returned, otherwise into the 4th mould
Block;
4th module, for taking out i-th of fingerprint section in multiple fingerprint sections for being obtained from the first module, and obtain i-th
The minimum fingerprint of fingerprint value, which is used as, in fingerprint section represents fingerprint, and judges that this represents the representative fingerprint whether fingerprint is located in internal memory
In concordance list, if it is, into the 5th module, otherwise into the 8th module;
5th module, for taking out this in fingerprint index table and representing fingerprint Container ID corresponding to fingerprint from representing, and pass through
Internal memory hit table is searched to judge that the fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if
It is then to enter the 6th module, fingerprint container corresponding to the fingerprint Container ID is otherwise read into fingerprint container caching from disk
In, then it is transferred to the 6th module;
6th module, rejected for the fingerprint repeated in fingerprint section where fingerprint will to be represented, and by the fingerprint section after rejecting
In each fingerprint matched one by one with all fingerprints in the fingerprint container corresponding to fingerprint Container ID, if matching knot
Fruit is repeats, then by the Finger-print labelling method to repeat fingerprint, holds if matching result inserts the fingerprint not repeat, by the fingerprint
In device;
7th module, for setting counter i=i+1, and return to the 3rd module;
8th module, for building a new fingerprint container in fingerprint container caching, refer to where this is represented into fingerprint
The fingerprint repeated in line section is rejected, and all fingerprints in the fingerprint section after rejecting is inserted into new fingerprint container, will represented
Fingerprint and new fingerprint Container ID are inserted into a manner of key-value pair and represented in fingerprint index table, and new fingerprint Container ID is inserted
In internal memory hit table.
9th module, for setting counter i=i+1, and return to step (3).
In general, by the contemplated above technical scheme of the present invention compared with prior art, it can obtain down and show
Beneficial effect:
(1) present invention can effectively reduce repetition fingerprint by using the mode that segment processing is carried out to fingerprint, which
The scope of lookup, so as to improve the performance of repetition fingerprint retrieval;
(2) present invention can effectively reduce the scope of repetition fingerprint lookup, so as to especially suitable for the weight in large data sets
Multiple fingerprint retrieval;
(3) present invention can be provided close to the effect accurately detected for highly redundant data set.
(4) because the fingerprint fragmentation procedure of the present invention employs Clustering, therefore can be with for similar fingerprint container
Disposable read in internal memory is handled, so as to avoid fingerprint container storage in existing method in multiple positions of disk,
And the drawbacks of needing to carry out multiple reading process.
Brief description of the drawings
Fig. 1 is the building-block of logic of the present invention.
Fig. 2 is the general principle figure of Similarity algorithm of the present invention.
Fig. 3 is similar fusion schematic diagram.
Fig. 4 shows to represent fingerprint index table.
Fig. 5 is the flow chart of the duplicate data detection method of the invention based on cluster.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below
Conflict can is not formed each other to be mutually combined.
The invention provides a kind of efficiently duplicate data detection method based on cluster, this method mainly faces similitude
Stronger data set, data similar in data set are brought together by storage by similarity principle and Clustering, solved
The problem of detection efficiency is low in existing duplicate data detection method, to adapt to the present situation that storage demand constantly expands.
The basic ideas of the present invention are, fingerprint list to be detected is segmented, and extract and represent fingerprint, by right
Detecting to identify similar fingerprint container for fingerprint is represented, the data of repetition next are identified to the lookup of fingerprint container.It is logical
The scope of fingerprint detection can effectively be reduced by crossing this method, and significant increase repeats the performance of fingerprint detection.
For the basic logical structure of the present invention as shown in figure 1, it is mainly made up of four parts, they are to represent fingerprint respectively
Fingerprint container on concordance list, internal memory hit table, fingerprint container caching and disk is formed.
In order to clearly illustrate the present invention, explanation and illustration is subject to the term occurred in present specification:
Similarity algorithm:The data theory of similarity be used to search for identical text block between different document, see Broder,
A.Z.,On the resemblance and containment of documents,in Compression and
Complexity of Sequences.1997, IEEE.p.21~29.Following theorem is had according to Broder Similarity algorithms:
Theorem 1:Two set S1, S2, and assume dissipating for the element that H (S1) and H (S2) are corresponded in S1 and S2 respectively
Row fingerprint set.Make min (H (S)) represent H (S) minimum value, have:
Above-mentioned theorem explanation, the minimum hashed value of the element in gathering when one is equal to the element in another set most
During small hashed value, two set be present has high probability to share a number of element.In data deduplication system, this theory meaning
If it is identical that taste, which two data block minimum fingerprints in two different pieces of information set of blocks, two datasets conjunction be present
Mass data block is shared with high probability.In order to facilitate discussion, the minimum fingerprint in set is named as and represents fingerprint (below
It is referred to as representing fingerprint).
In data deduplication system, first to needing to carry out recursive scanning to the catalogue that data are concentrated and forming file row
Table.File is cut into data block by each file in list using deblocking algorithm.Each data block is breathed out
It is uncommon to calculate and form fingerprint list.Fig. 2 illustrate how obtain fingerprint list and how the example of the application data theory of similarity.
Theoretical according to data similarity, fingerprint list is divided into multiple subset of the fingerprints.Herein, subset of the fingerprints is determined
Justice is section, and in fingerprint index table, the subset is stored in similar fingerprint container.
Similar merging:Feature of this invention is that it uses section consolidation strategy, i.e., it is merged into a data by similar section
Container, so as to which similar section of search procedure is greatly improved.
In the present invention, similar section in fingerprint list is classified and is merged into according to their representative fingerprint by detecting system
In different data capsules.Structure one represents the address that fingerprint index table will represent fingerprint and be mapped to fingerprint container, and often
The individual fingerprint that represents corresponds to a fingerprint container.Such design, detecting system can quickly determine the position of fingerprint container simultaneously
For the comparison of duplicate data, scattered similar section is searched for without finding large-scale space, this, which is greatly reduced, searches
Rope scope and the process for accelerating data deduplication.
Fig. 3 has been illustrated how similar section of merging works.In the figure, it is assumed that have three sections in fingerprint list.Section (a)
{ e, f, g, n, c, w }, section (b) { f, n, w, t, m, e }, section (c) { t, m, e, w, c, j, h }, wherein one number of each character representation
According to block fingerprint.Three sections are similar section in example, because they have identical representativeness fingerprint ' e '.RMD merges them one
Rise and be stored in a position, i.e. fingerprint container.When the New Data Segment with identical representative fingerprint ' e ' reaches,
All fingerprints in this section compared with the fingerprint in similar vessels, and due to the similitude of data, very may be used by detecting system
Most of duplicate data blocks can successfully be identified.Generally speaking, the magnetic as caused by fingerprint detection can be reduced by merging similar section
Disk I/O quantity, while improve the accuracy of positioning duplicate data object.
Fingerprint list:It is to be passed through deblocking by data set and taken the fingerprint and form fingerprint set according to processing sequence.
Fingerprint input-buffer:Fingerprint input-buffer is used to cache the fingerprint in fingerprint list.If fingerprint list
Come from file, then disposably read in a number of fingerprint, and be stored in fingerprint input-buffer area.Next it is defeated to fingerprint
Enter the fingerprint in caching to carry out staged operation and represent fingerprint selecting step.
Fingerprint container:Fingerprint container is the data structure of system storage fingerprint, and it is also that fingerprint disk storage and caching are adjusted
The base unit of degree.(independent fingerprint refers to numerically to be referred to the independent fingerprint of storage variable number with other in one fingerprint container
The different fingerprint of line).
Represent fingerprint index table:It is the key assignments resided in internal memory (key-value) Hash to represent fingerprint index table
Table, the inside are housed from the mapping for representing fingerprint RF fingerprint Container IDs to where representing fingerprint.It is on disk is searched that it, which is acted on,
Fingerprint container when, can quickly locate in file and store the fingerprint container position.Represent fingerprint index table concrete structure such as Fig. 4
It is shown.Fingerprint length is 20 bytes, and fingerprint Container ID length is 4 bytes, and Pointer (pointer) takes 8 bytes.Hash table stores
There can be the problem of hash-collision, when hash-collision occurs, conflict is handled using chain address.
Fingerprint container caches:Fingerprint container caching is to go to one piece of buffer zone opening up in internal memory, for new fingerprint container
Write the caching before disk or read in the caching in fingerprint container to internal memory from disk.
Internal memory hits table:For judge search fingerprint container whether in the buffer, if search fingerprint container do not exist
In caching, then required fingerprint container is read from disk and is cached to fingerprint container.
The optimized parameter of modules is configured in needing to algorithm in implementation process of the present invention, is given below each
The scope of module optimized parameter:
Fingerprint section size:32-8192 fingerprints, optimum range are 64-128;
Fingerprint container merges fingerprint quantity:512-4096 fingerprints, optimum range 1024-2048;
Recorded assuming that duplicate data block indexes Capacity design for C bars, below in conjunction with Fig. 1 and Fig. 5 and embodiment to the present invention
Further illustrate.
As shown in figure 5, the duplicate data detection method of the invention based on cluster comprises the following steps:
(1) fingerprint list file is obtained from disk, judges whether that part can be got from the fingerprint list file
(size of the partial fingerprints is equal to the size in the fingerprint input-buffer space opened up in advance in internal memory, its scope to fingerprint
Be greater than the 0% of memory size, less than 80%), terminate if the process less than if that obtains, otherwise by the partial fingerprints got its
It is stored in fingerprint input-buffer space, all fingerprint N in fingerprint input-buffer space is segmented, per M fingerprint group
It is remaining to be classified as a fingerprint section less than M fingerprint when N can not be divided exactly by M into a fingerprint section;
The size M of fingerprint section can be random natural number, and its preferred value is 64 to 128.
Before this method execution, it is also necessary to the step of performing initialization, i.e., set the fingerprint input of sky slow in internal memory
Deposit space, empty fingerprint container caching, empty internal memory hit table and represent fingerprint index table.
Fingerprint input-buffer space is used to store partial fingerprints in internal memory.
Fingerprint container is cached for caching partial fingerprints container in internal memory.
Internal memory hit table is used to judge whether some fingerprint container has been buffered in internal memory.
Represent fingerprint index table to be stored in internal memory for fingerprint will to be represented, and represent fingerprint for this and index function is provided.
The advantages of step (1), is, by setting the size of fingerprint section, can optimize the integral retrieval of repetition fingerprint
Can, carry out repeating lacking for degraded performance existing for fingerprint retrieval using file size as logical segment in the prior art so as to overcome
Point.
(2) counter i=1 is set;
(3) judge whether i is more than N/M, if greater than then return to step (1), otherwise into step (4);
(4) i-th of fingerprint section is taken out in the multiple fingerprint sections obtained from step (1), and obtains fingerprint in i-th of fingerprint section
The minimum fingerprint of value, which is used as, represents fingerprint (Representative fingerprint, abbreviation RF), and judges that this represents fingerprint
Whether it is located in the representative fingerprint index table in internal memory, if it is, into step (5), otherwise into step (8);
Specifically, the fingerprint value of fingerprint is by by the fingerprint for being converted to value type of the fingerprint of character string type
Mode is got.
The advantages of this step, is that it is possible to simply and efficiently find the fingerprint container where repetition fingerprint, so as to effectively
Ground reduces the seeking scope for repeating fingerprint.
(5) this is taken out in fingerprint index table represent fingerprint Container ID corresponding to fingerprint from representing, and ordered by searching internal memory
Middle table judges that fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if yes then enter step
Suddenly (6), otherwise fingerprint container corresponding to the fingerprint Container ID is read into fingerprint container caching from disk, is then transferred to step
Suddenly (6);
(6) fingerprint repeated in fingerprint section where fingerprint will be represented to reject, and by each finger in the fingerprint section after rejecting
Line is matched with all fingerprints in the fingerprint container corresponding to fingerprint Container ID one by one, if the result of matching is repetition,
Then by the Finger-print labelling method to repeat fingerprint, if matching result is not repeat, the fingerprint is inserted in the fingerprint container, wherein
When fingerprint quantity reaches the upper limit of fingerprint container capacity, fingerprint container no longer receives new fingerprint, and fingerprint container is write back into disk;
(7) counter i=i+1, and return to step (3) are set;
(8) a new fingerprint container is built in fingerprint container caching, is repeated in fingerprint section where this is represented into fingerprint
Fingerprint reject, all fingerprints in the fingerprint section after rejecting are inserted into new fingerprint container, fingerprint RF and new will be represented
Fingerprint Container ID be inserted into and represented in fingerprint index table in a manner of key-value pair, and by new fingerprint Container ID insertion internal memory hit
In table.
(9) counter i=i+1, and return to step (3) are set;
The advantages of this step, is that it is possible to the size of effectively control fingerprint container, avoids repeating under fingerprint retrieval performance
Drop.
The technique effect of the present invention is embodied in:Present invention is generally directed to the stronger data set type of similitude, by using
Similitude in data set, the detection range of duplicate data is reduced, lift the throughput of data deduplication.Specifically, for
Possible duplicate data in data set, the present invention are segmented to the fingerprint in fingerprint list first, and according to the theory of similarity,
Chosen in section and represent fingerprint, next the masterpiece fingerprint of selection is detected in fingerprint index table is represented, passes through detection
The similar vessels that fast mapping is searched, quickly repeat number is found finally by the detection to duplicate data in similar vessels
According to so as to lift the detection performance of duplicate data.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should be included
Within protection scope of the present invention.
Claims (8)
1. a kind of duplicate data detection method based on cluster, it is characterised in that comprise the following steps:
(1) fingerprint list file is obtained from disk, judges whether that partial fingerprints can be got from the fingerprint list file,
Terminate if the process less than if that obtains, otherwise by the partial fingerprints got, it is stored in fingerprint input-buffer space, will be referred to
All fingerprint N in line input-buffer space are segmented, and a fingerprint section is formed per M fingerprint, and wherein N is all fingerprints
Quantity, M are random natural number;
(2) counter i=1 is set;
(3) judge whether i is more than N/M, if greater than then return to step (1), otherwise into step (4);
(4) take out i-th of fingerprint section in the multiple fingerprint sections obtained from step (1), and obtain in i-th of fingerprint section fingerprint value most
Small fingerprint is used as and represents fingerprint, and judges that this represents whether fingerprint is located in the representative fingerprint index table in internal memory, if it is,
Then enter step (5), otherwise into step (8);
(5) this is taken out in fingerprint index table represent fingerprint Container ID corresponding to fingerprint from representing, and by searching internal memory hit table
To judge that the fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if yes then enter step
(6), otherwise fingerprint container corresponding to the fingerprint Container ID is read into fingerprint container caching from disk, is then transferred to step
(6);
(6) fingerprint repeated in fingerprint section where fingerprint will be represented to reject, and by each fingerprint in the fingerprint section after rejecting by
One is matched with all fingerprints in the fingerprint container corresponding to fingerprint Container ID, will if the result of matching is repetition
The Finger-print labelling method is repeats fingerprint, if matching result inserts the fingerprint in the fingerprint container not repeat;
(7) counter i=i+1, and return to step (3) are set.
(8) a new fingerprint container is built in fingerprint container caching, the finger repeated in fingerprint section where this is represented into fingerprint
Line is rejected, and all fingerprints in the fingerprint section after rejecting is inserted into new fingerprint container, will be represented fingerprint and new fingerprint
Container ID is inserted into a manner of key-value pair to be represented in fingerprint index table, and by new fingerprint Container ID insertion internal memory hit table.
(9) counter i=i+1, and return to step (3) are set.
2. duplicate data detection method according to claim 1, it is characterised in that further comprise before step (1),
The fingerprint input-buffer space of sky, the fingerprint container caching of sky, empty internal memory hit table are set in internal memory and represent fingerprint
The step of concordance list, wherein fingerprint input-buffer space, are used to store partial fingerprints in internal memory, and fingerprint container, which caches, to be used for
Partial fingerprints container is cached in internal memory, internal memory hit table is used to judge whether some fingerprint container has been buffered in internal memory, generation
Table fingerprint index table is stored in internal memory for will represent fingerprint, and is represented fingerprint for this and provided index function.
3. duplicate data detection method according to claim 1, it is characterised in that remaining when N can not be divided exactly by M
Less than M fingerprint is classified as a fingerprint section.
4. duplicate data detection method according to claim 2, it is characterised in that it is defeated that the size of partial fingerprints is equal to fingerprint
Enter the size of spatial cache, its scope is greater than the 0% of memory size, less than 80%.
5. duplicate data detection method as claimed in any of claims 1 to 4, it is characterised in that fingerprint section it is big
Small M is 32 to 8192.
6. duplicate data detection method as claimed in any of claims 1 to 5, it is characterised in that the fingerprint of fingerprint
Value is got by way of the fingerprint of character string type to be converted to the fingerprint of value type.
7. duplicate data detection method according to claim 1, it is characterised in that in step (6), when fingerprint quantity reaches
During the upper limit of fingerprint container capacity, fingerprint container no longer receives new fingerprint, and fingerprint container is write back into disk.
A kind of 8. duplicate data detecting system based on cluster, it is characterised in that including:
First module, for obtaining fingerprint list file from disk, judge whether to obtain from the fingerprint list file
To partial fingerprints, terminate if the process less than if that obtains, otherwise by the partial fingerprints got, it is stored in fingerprint input-buffer
In space, all fingerprint N in fingerprint input-buffer space are segmented, a fingerprint section, wherein N are formed per M fingerprint
For the quantity of all fingerprints, M is random natural number.
Second module, for setting counter i=1.
3rd module, for judging whether i is more than N/M, if greater than the first module is then returned, otherwise into the 4th module;
4th module, for taking out i-th of fingerprint section in multiple fingerprint sections for being obtained from the first module, and obtain i-th of fingerprint
The minimum fingerprint of fingerprint value, which is used as, in section represents fingerprint, and judges that this represents the representative fingerprint index whether fingerprint is located in internal memory
In table, if it is, into the 5th module, otherwise into the 8th module;
5th module, for this being taken out in fingerprint index table and representing fingerprint Container ID corresponding to fingerprint from representing, and passes through lookup
Internal memory hit table judges that fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if it is
Into the 6th module, otherwise fingerprint container corresponding to the fingerprint Container ID is read into fingerprint container caching from disk, so
After be transferred to the 6th module;
6th module, the fingerprint for being repeated in fingerprint section where representing fingerprint are rejected, and by the fingerprint section after rejecting
Each fingerprint is matched with all fingerprints in the fingerprint container corresponding to fingerprint Container ID one by one, if the result of matching is
Repeat, then by the Finger-print labelling method to repeat fingerprint, if the fingerprint is inserted the fingerprint container by matching result not repeat
In;
7th module, for setting counter i=i+1, and return to the 3rd module;
8th module, for building a new fingerprint container in fingerprint container caching, fingerprint section where this is represented into fingerprint
The fingerprint of middle repetition is rejected, and all fingerprints in the fingerprint section after rejecting is inserted into new fingerprint container, will be represented fingerprint
And new fingerprint Container ID is inserted into a manner of key-value pair and represented in fingerprint index table, and new fingerprint Container ID is inserted into internal memory
Hit in table.
9th module, for setting counter i=i+1, and return to step (3).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710747552.2A CN107515931B (en) | 2017-08-28 | 2017-08-28 | Repeated data detection method based on clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710747552.2A CN107515931B (en) | 2017-08-28 | 2017-08-28 | Repeated data detection method based on clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107515931A true CN107515931A (en) | 2017-12-26 |
CN107515931B CN107515931B (en) | 2023-04-25 |
Family
ID=60724325
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710747552.2A Active CN107515931B (en) | 2017-08-28 | 2017-08-28 | Repeated data detection method based on clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107515931B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109445702A (en) * | 2018-10-26 | 2019-03-08 | 黄淮学院 | A kind of piece of grade data deduplication storage |
CN109783523A (en) * | 2019-01-24 | 2019-05-21 | 广州虎牙信息科技有限公司 | A kind of data processing method, device, equipment and storage medium |
CN112100318A (en) * | 2020-11-12 | 2020-12-18 | 北京智慧星光信息技术有限公司 | Multi-dimensional information merging method, device, equipment and storage medium |
CN112329717A (en) * | 2020-11-25 | 2021-02-05 | 中国人民解放军国防科技大学 | Fingerprint cache method for similarity detection of mass data |
CN115827619A (en) * | 2023-01-06 | 2023-03-21 | 山东捷瑞数字科技股份有限公司 | Repeated data detection method, device and equipment based on three-dimensional engine |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1940995A (en) * | 2005-09-29 | 2007-04-04 | 中国科学院自动化研究所 | Method for compressing fingerprint direction quantized diagram to embedded system |
CN101681381A (en) * | 2007-06-06 | 2010-03-24 | 杜比实验室特许公司 | Improving audio/video fingerprint search accuracy using multiple search combining |
CN102663086A (en) * | 2012-04-09 | 2012-09-12 | 华中科技大学 | Method for retrieving data block indexes |
CN103699567A (en) * | 2013-11-04 | 2014-04-02 | 北京中搜网络技术股份有限公司 | Method for realizing same news clustering based on title fingerprint and text fingerprint |
US8930648B1 (en) * | 2012-05-23 | 2015-01-06 | Netapp, Inc. | Distributed deduplication using global chunk data structure and epochs |
CN105493080A (en) * | 2013-12-23 | 2016-04-13 | 华为技术有限公司 | Method and apparatus for context aware based data de-duplication |
CN105989033A (en) * | 2015-02-03 | 2016-10-05 | 北京中搜网络技术股份有限公司 | Information duplication eliminating method based on information fingerprints |
-
2017
- 2017-08-28 CN CN201710747552.2A patent/CN107515931B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1940995A (en) * | 2005-09-29 | 2007-04-04 | 中国科学院自动化研究所 | Method for compressing fingerprint direction quantized diagram to embedded system |
CN101681381A (en) * | 2007-06-06 | 2010-03-24 | 杜比实验室特许公司 | Improving audio/video fingerprint search accuracy using multiple search combining |
CN102663086A (en) * | 2012-04-09 | 2012-09-12 | 华中科技大学 | Method for retrieving data block indexes |
US8930648B1 (en) * | 2012-05-23 | 2015-01-06 | Netapp, Inc. | Distributed deduplication using global chunk data structure and epochs |
CN103699567A (en) * | 2013-11-04 | 2014-04-02 | 北京中搜网络技术股份有限公司 | Method for realizing same news clustering based on title fingerprint and text fingerprint |
CN105493080A (en) * | 2013-12-23 | 2016-04-13 | 华为技术有限公司 | Method and apparatus for context aware based data de-duplication |
CN105989033A (en) * | 2015-02-03 | 2016-10-05 | 北京中搜网络技术股份有限公司 | Information duplication eliminating method based on information fingerprints |
Non-Patent Citations (3)
Title |
---|
XINYE LI 等: "Clustering Web Retrieval Results Accompanied by Removing Duplicate Documents" * |
张攀峰: "数据去重中重复数据检测技术研究" * |
殷波 等: "一种基于重复串的STC改进算法" * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109445702A (en) * | 2018-10-26 | 2019-03-08 | 黄淮学院 | A kind of piece of grade data deduplication storage |
CN109445702B (en) * | 2018-10-26 | 2019-12-06 | 黄淮学院 | block-level data deduplication storage system |
CN109783523A (en) * | 2019-01-24 | 2019-05-21 | 广州虎牙信息科技有限公司 | A kind of data processing method, device, equipment and storage medium |
CN112100318A (en) * | 2020-11-12 | 2020-12-18 | 北京智慧星光信息技术有限公司 | Multi-dimensional information merging method, device, equipment and storage medium |
CN112329717A (en) * | 2020-11-25 | 2021-02-05 | 中国人民解放军国防科技大学 | Fingerprint cache method for similarity detection of mass data |
CN115827619A (en) * | 2023-01-06 | 2023-03-21 | 山东捷瑞数字科技股份有限公司 | Repeated data detection method, device and equipment based on three-dimensional engine |
CN115827619B (en) * | 2023-01-06 | 2023-05-09 | 山东捷瑞数字科技股份有限公司 | Method, device and equipment for detecting repeated data based on three-dimensional engine |
Also Published As
Publication number | Publication date |
---|---|
CN107515931B (en) | 2023-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107515931A (en) | A kind of duplicate data detection method based on cluster | |
CN107391034B (en) | A kind of repeated data detection method based on local optimization | |
CN102831222B (en) | Differential compression method based on data de-duplication | |
US9529912B2 (en) | Metadata querying method and apparatus | |
US10346257B2 (en) | Method and device for deduplicating web page | |
CN103488709B (en) | A kind of index establishing method and system, search method and system | |
CN100452055C (en) | Large-scale and multi-key word matching method for text or network content analysis | |
CN103597450B (en) | Memory with the metadata being stored in a part for storage page | |
CN108959563B (en) | Capacity expandable block chain query method and system | |
Xie et al. | Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb | |
CN107491487A (en) | A kind of full-text database framework and bitmap index establishment, data query method, server and medium | |
CN111382327A (en) | Character string matching device and method | |
CN103207889A (en) | Method for retrieving massive face images based on Hadoop | |
Tang et al. | Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce. | |
Wang et al. | Fast and adaptive indexing of multi-dimensional observational data | |
CN104021179B (en) | The Fast Recognition Algorithm of similarity data under a kind of large data sets | |
Romberg et al. | Bundle min-Hashing: Speeded-up object retrieval | |
Wang et al. | PLSM: a highly efficient LSM-tree index supporting real-time big data analysis | |
CN113760190A (en) | Small file merging system and method based on Ceph storage | |
Nie et al. | Efficient storage support for real-time near-duplicate video retrieval | |
CN116361796A (en) | Industrial control malicious code detection method based on content partitioning | |
Chen et al. | Efficient similarity search in nonmetric spaces with local constant embedding | |
CN106599326B (en) | Recorded data duplication eliminating processing method and system under cloud architecture | |
CN109213760A (en) | The storage of high load business and search method of non-relation data storage | |
Karim et al. | An efficient approach to mining maximal contiguous frequent patterns from large DNA sequence databases |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |