CN107515931A - A kind of duplicate data detection method based on cluster - Google Patents

A kind of duplicate data detection method based on cluster Download PDF

Info

Publication number
CN107515931A
CN107515931A CN201710747552.2A CN201710747552A CN107515931A CN 107515931 A CN107515931 A CN 107515931A CN 201710747552 A CN201710747552 A CN 201710747552A CN 107515931 A CN107515931 A CN 107515931A
Authority
CN
China
Prior art keywords
fingerprint
container
section
module
internal memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710747552.2A
Other languages
Chinese (zh)
Other versions
CN107515931B (en
Inventor
周可
王桦
张攀峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710747552.2A priority Critical patent/CN107515931B/en
Publication of CN107515931A publication Critical patent/CN107515931A/en
Application granted granted Critical
Publication of CN107515931B publication Critical patent/CN107515931B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Collating Specific Patterns (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of duplicate data detection method based on cluster, and it by using the data similarity principle in data set, improves the performance of duplicate data detection, while lift the performance of data deduplication mainly for the stronger data set type of data similarity.Specifically, similarity combination strategy is utilized for possible duplicate data in data set, the present invention, first detection fingerprint list is segmented, every section is selected representative fingerprint, and different sections are classified and are merged into different fingerprint containers according to its representative fingerprint.Fingerprint container collects the fingerprint of repetition from similar section of data set, to increase the efficiency of data deduplication, while lifts the performance of duplicate removal.For fingerprint container storage on disk, it can be written into and read disk as an entirety, the problem of this fragmented storage for improving fingerprint recall precision and overcoming similar section.

Description

A kind of duplicate data detection method based on cluster
Technical field
The invention belongs to computer memory technical field, is detected more particularly, to a kind of duplicate data based on cluster Method and system.
Background technology
With information technology fast development, information turns into the precious resources that we depend on for existence, becomes promotion production The fast-developing maximum power of power.The generation for widely applying the also data along with magnanimity of information technology, it is more and more valuable The data of value need to be stored.So, the storage efficiency of existing storage medium how is effectively improved, meets ever-increasing deposit Storage demand, have become storage research field and one of urgently solve the problems, such as.Meanwhile IDC LLC's investigation report show it is existing about 75% data are redundancy, i.e., only 25% data have uniqueness.In this context, data deduplication is used as larger A kind of new technique of detection and elimination redundancy turns into the study hotspot of academia and industrial quarters in recent years in spatial dimension, And various information storage systems are just widely applied to further.
The detection for repeating fingerprint is to realize the important technical of data deduplication, in existing data deduplication technology, weight The detection of complex data mainly using the mode of fingerprint detection, i.e., by extracting the fingerprint (cryptographic Hash) of data block, then passes through inspection The repeatability for surveying fingerprint identifies whether some data block is duplicate data block.Current repetition fingerprint detection method is typically to adopt The identification of repetition fingerprint section is realized with data structures such as single Hash tables or B-tree.
However, the problem of one can not ignore existing for above-mentioned repetition fingerprint detection method is that its detection performance is more low Under, effective duplicate data detection can not be realized for large data sets, so as to have influence on the overall efficiency of data deduplication.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of duplicate data inspection based on cluster Method and system is surveyed, it is intended that solving detection performance existing for the existing duplicate data detection method based on fingerprint detection It is more low, the technical problem that large data sets realize effective duplicate data detection can not be directed to.
To achieve the above object, according to one aspect of the present invention, there is provided a kind of duplicate data detection based on cluster Method, comprise the following steps:
(1) fingerprint list file is obtained from disk, judges whether that part can be got from the fingerprint list file Fingerprint, terminate if the process less than if that obtains, otherwise by the partial fingerprints got, it is stored in fingerprint input-buffer space, All fingerprint N in fingerprint input-buffer space are segmented, form a fingerprint section per M fingerprint, wherein N is all fingers The quantity of line, M are random natural number;
(2) counter i=1 is set;
(3) judge whether i is more than N/M, if greater than then return to step (1), otherwise into step (4);
(4) i-th of fingerprint section is taken out in the multiple fingerprint sections obtained from step (1), and obtains fingerprint in i-th of fingerprint section The minimum fingerprint of value, which is used as, represents fingerprint, and judges that this represents whether fingerprint is located in the representative fingerprint index table in internal memory, such as Fruit is, then into step (5), otherwise into step (8);
(5) this is taken out in fingerprint index table represent fingerprint Container ID corresponding to fingerprint from representing, and ordered by searching internal memory Middle table judges that fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if yes then enter step Suddenly (6), otherwise fingerprint container corresponding to the fingerprint Container ID is read into fingerprint container caching from disk, is then transferred to step Suddenly (6);
(6) fingerprint repeated in fingerprint section where fingerprint will be represented to reject, and by each finger in the fingerprint section after rejecting Line is matched with all fingerprints in the fingerprint container corresponding to fingerprint Container ID one by one, if the result of matching is repetition, Then by the Finger-print labelling method to repeat fingerprint, if matching result inserts the fingerprint in the fingerprint container not repeat;
(7) counter i=i+1, and return to step (3) are set;
(8) a new fingerprint container is built in fingerprint container caching, is repeated in fingerprint section where this is represented into fingerprint Fingerprint reject, all fingerprints in the fingerprint section after rejecting are inserted into new fingerprint container, fingerprint and new will be represented Fingerprint Container ID is inserted into a manner of key-value pair to be represented in fingerprint index table, and new fingerprint Container ID insertion internal memory is hit into table In.
(9) counter i=i+1, and return to step (3) are set;
Preferably, further comprise before step (1), the fingerprint input-buffer space, empty of sky is set in internal memory Fingerprint container caching, empty internal memory hit table and the step of represent fingerprint index table, wherein fingerprint input-buffer space is used for Partial fingerprints are stored in internal memory, fingerprint container is cached for caching partial fingerprints container in internal memory, and internal memory hit table is used for Judge whether some fingerprint container has been buffered in internal memory, represent fingerprint index table and be stored in internal memory for fingerprint will to be represented In, and represent fingerprint for this and index function is provided.
Preferably, it is remaining to be classified as a fingerprint section less than M fingerprint when N can not be divided exactly by M.
Preferably, the size of partial fingerprints is equal to the size in fingerprint input-buffer space, and its scope is greater than memory size 0%, less than 80%.
Preferably, the size M of fingerprint section is 64 to 128.
Preferably, the fingerprint value of fingerprint is by by the side of the fingerprint for being converted to value type of the fingerprint of character string type Formula is got.
Preferably, in step (6), when fingerprint quantity reaches the upper limit of fingerprint container capacity, fingerprint container no longer receives New fingerprint, fingerprint container is write back into disk.
It is another aspect of this invention to provide that a kind of duplicate data detecting system based on cluster is provided, including:
First module, for obtaining fingerprint list file from disk, judging whether can be from the fingerprint list file Partial fingerprints are got, are terminated if the process less than if that obtains, otherwise it is stored in fingerprint input by the partial fingerprints got In spatial cache, all fingerprint N in fingerprint input-buffer space are segmented, a fingerprint section is formed per M fingerprint, its Middle N is the quantity of all fingerprints, and M is random natural number;
Second module, for setting counter i=1;
3rd module, for judging whether i is more than N/M, if greater than the first module is then returned, otherwise into the 4th mould Block;
4th module, for taking out i-th of fingerprint section in multiple fingerprint sections for being obtained from the first module, and obtain i-th The minimum fingerprint of fingerprint value, which is used as, in fingerprint section represents fingerprint, and judges that this represents the representative fingerprint whether fingerprint is located in internal memory In concordance list, if it is, into the 5th module, otherwise into the 8th module;
5th module, for taking out this in fingerprint index table and representing fingerprint Container ID corresponding to fingerprint from representing, and pass through Internal memory hit table is searched to judge that the fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if It is then to enter the 6th module, fingerprint container corresponding to the fingerprint Container ID is otherwise read into fingerprint container caching from disk In, then it is transferred to the 6th module;
6th module, rejected for the fingerprint repeated in fingerprint section where fingerprint will to be represented, and by the fingerprint section after rejecting In each fingerprint matched one by one with all fingerprints in the fingerprint container corresponding to fingerprint Container ID, if matching knot Fruit is repeats, then by the Finger-print labelling method to repeat fingerprint, holds if matching result inserts the fingerprint not repeat, by the fingerprint In device;
7th module, for setting counter i=i+1, and return to the 3rd module;
8th module, for building a new fingerprint container in fingerprint container caching, refer to where this is represented into fingerprint The fingerprint repeated in line section is rejected, and all fingerprints in the fingerprint section after rejecting is inserted into new fingerprint container, will represented Fingerprint and new fingerprint Container ID are inserted into a manner of key-value pair and represented in fingerprint index table, and new fingerprint Container ID is inserted In internal memory hit table.
9th module, for setting counter i=i+1, and return to step (3).
In general, by the contemplated above technical scheme of the present invention compared with prior art, it can obtain down and show Beneficial effect:
(1) present invention can effectively reduce repetition fingerprint by using the mode that segment processing is carried out to fingerprint, which The scope of lookup, so as to improve the performance of repetition fingerprint retrieval;
(2) present invention can effectively reduce the scope of repetition fingerprint lookup, so as to especially suitable for the weight in large data sets Multiple fingerprint retrieval;
(3) present invention can be provided close to the effect accurately detected for highly redundant data set.
(4) because the fingerprint fragmentation procedure of the present invention employs Clustering, therefore can be with for similar fingerprint container Disposable read in internal memory is handled, so as to avoid fingerprint container storage in existing method in multiple positions of disk, And the drawbacks of needing to carry out multiple reading process.
Brief description of the drawings
Fig. 1 is the building-block of logic of the present invention.
Fig. 2 is the general principle figure of Similarity algorithm of the present invention.
Fig. 3 is similar fusion schematic diagram.
Fig. 4 shows to represent fingerprint index table.
Fig. 5 is the flow chart of the duplicate data detection method of the invention based on cluster.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below Conflict can is not formed each other to be mutually combined.
The invention provides a kind of efficiently duplicate data detection method based on cluster, this method mainly faces similitude Stronger data set, data similar in data set are brought together by storage by similarity principle and Clustering, solved The problem of detection efficiency is low in existing duplicate data detection method, to adapt to the present situation that storage demand constantly expands.
The basic ideas of the present invention are, fingerprint list to be detected is segmented, and extract and represent fingerprint, by right Detecting to identify similar fingerprint container for fingerprint is represented, the data of repetition next are identified to the lookup of fingerprint container.It is logical The scope of fingerprint detection can effectively be reduced by crossing this method, and significant increase repeats the performance of fingerprint detection.
For the basic logical structure of the present invention as shown in figure 1, it is mainly made up of four parts, they are to represent fingerprint respectively Fingerprint container on concordance list, internal memory hit table, fingerprint container caching and disk is formed.
In order to clearly illustrate the present invention, explanation and illustration is subject to the term occurred in present specification:
Similarity algorithm:The data theory of similarity be used to search for identical text block between different document, see Broder, A.Z.,On the resemblance and containment of documents,in Compression and Complexity of Sequences.1997, IEEE.p.21~29.Following theorem is had according to Broder Similarity algorithms:
Theorem 1:Two set S1, S2, and assume dissipating for the element that H (S1) and H (S2) are corresponded in S1 and S2 respectively Row fingerprint set.Make min (H (S)) represent H (S) minimum value, have:
Above-mentioned theorem explanation, the minimum hashed value of the element in gathering when one is equal to the element in another set most During small hashed value, two set be present has high probability to share a number of element.In data deduplication system, this theory meaning If it is identical that taste, which two data block minimum fingerprints in two different pieces of information set of blocks, two datasets conjunction be present Mass data block is shared with high probability.In order to facilitate discussion, the minimum fingerprint in set is named as and represents fingerprint (below It is referred to as representing fingerprint).
In data deduplication system, first to needing to carry out recursive scanning to the catalogue that data are concentrated and forming file row Table.File is cut into data block by each file in list using deblocking algorithm.Each data block is breathed out It is uncommon to calculate and form fingerprint list.Fig. 2 illustrate how obtain fingerprint list and how the example of the application data theory of similarity.
Theoretical according to data similarity, fingerprint list is divided into multiple subset of the fingerprints.Herein, subset of the fingerprints is determined Justice is section, and in fingerprint index table, the subset is stored in similar fingerprint container.
Similar merging:Feature of this invention is that it uses section consolidation strategy, i.e., it is merged into a data by similar section Container, so as to which similar section of search procedure is greatly improved.
In the present invention, similar section in fingerprint list is classified and is merged into according to their representative fingerprint by detecting system In different data capsules.Structure one represents the address that fingerprint index table will represent fingerprint and be mapped to fingerprint container, and often The individual fingerprint that represents corresponds to a fingerprint container.Such design, detecting system can quickly determine the position of fingerprint container simultaneously For the comparison of duplicate data, scattered similar section is searched for without finding large-scale space, this, which is greatly reduced, searches Rope scope and the process for accelerating data deduplication.
Fig. 3 has been illustrated how similar section of merging works.In the figure, it is assumed that have three sections in fingerprint list.Section (a) { e, f, g, n, c, w }, section (b) { f, n, w, t, m, e }, section (c) { t, m, e, w, c, j, h }, wherein one number of each character representation According to block fingerprint.Three sections are similar section in example, because they have identical representativeness fingerprint ' e '.RMD merges them one Rise and be stored in a position, i.e. fingerprint container.When the New Data Segment with identical representative fingerprint ' e ' reaches, All fingerprints in this section compared with the fingerprint in similar vessels, and due to the similitude of data, very may be used by detecting system Most of duplicate data blocks can successfully be identified.Generally speaking, the magnetic as caused by fingerprint detection can be reduced by merging similar section Disk I/O quantity, while improve the accuracy of positioning duplicate data object.
Fingerprint list:It is to be passed through deblocking by data set and taken the fingerprint and form fingerprint set according to processing sequence.
Fingerprint input-buffer:Fingerprint input-buffer is used to cache the fingerprint in fingerprint list.If fingerprint list Come from file, then disposably read in a number of fingerprint, and be stored in fingerprint input-buffer area.Next it is defeated to fingerprint Enter the fingerprint in caching to carry out staged operation and represent fingerprint selecting step.
Fingerprint container:Fingerprint container is the data structure of system storage fingerprint, and it is also that fingerprint disk storage and caching are adjusted The base unit of degree.(independent fingerprint refers to numerically to be referred to the independent fingerprint of storage variable number with other in one fingerprint container The different fingerprint of line).
Represent fingerprint index table:It is the key assignments resided in internal memory (key-value) Hash to represent fingerprint index table Table, the inside are housed from the mapping for representing fingerprint RF fingerprint Container IDs to where representing fingerprint.It is on disk is searched that it, which is acted on, Fingerprint container when, can quickly locate in file and store the fingerprint container position.Represent fingerprint index table concrete structure such as Fig. 4 It is shown.Fingerprint length is 20 bytes, and fingerprint Container ID length is 4 bytes, and Pointer (pointer) takes 8 bytes.Hash table stores There can be the problem of hash-collision, when hash-collision occurs, conflict is handled using chain address.
Fingerprint container caches:Fingerprint container caching is to go to one piece of buffer zone opening up in internal memory, for new fingerprint container Write the caching before disk or read in the caching in fingerprint container to internal memory from disk.
Internal memory hits table:For judge search fingerprint container whether in the buffer, if search fingerprint container do not exist In caching, then required fingerprint container is read from disk and is cached to fingerprint container.
The optimized parameter of modules is configured in needing to algorithm in implementation process of the present invention, is given below each The scope of module optimized parameter:
Fingerprint section size:32-8192 fingerprints, optimum range are 64-128;
Fingerprint container merges fingerprint quantity:512-4096 fingerprints, optimum range 1024-2048;
Recorded assuming that duplicate data block indexes Capacity design for C bars, below in conjunction with Fig. 1 and Fig. 5 and embodiment to the present invention Further illustrate.
As shown in figure 5, the duplicate data detection method of the invention based on cluster comprises the following steps:
(1) fingerprint list file is obtained from disk, judges whether that part can be got from the fingerprint list file (size of the partial fingerprints is equal to the size in the fingerprint input-buffer space opened up in advance in internal memory, its scope to fingerprint Be greater than the 0% of memory size, less than 80%), terminate if the process less than if that obtains, otherwise by the partial fingerprints got its It is stored in fingerprint input-buffer space, all fingerprint N in fingerprint input-buffer space is segmented, per M fingerprint group It is remaining to be classified as a fingerprint section less than M fingerprint when N can not be divided exactly by M into a fingerprint section;
The size M of fingerprint section can be random natural number, and its preferred value is 64 to 128.
Before this method execution, it is also necessary to the step of performing initialization, i.e., set the fingerprint input of sky slow in internal memory Deposit space, empty fingerprint container caching, empty internal memory hit table and represent fingerprint index table.
Fingerprint input-buffer space is used to store partial fingerprints in internal memory.
Fingerprint container is cached for caching partial fingerprints container in internal memory.
Internal memory hit table is used to judge whether some fingerprint container has been buffered in internal memory.
Represent fingerprint index table to be stored in internal memory for fingerprint will to be represented, and represent fingerprint for this and index function is provided.
The advantages of step (1), is, by setting the size of fingerprint section, can optimize the integral retrieval of repetition fingerprint Can, carry out repeating lacking for degraded performance existing for fingerprint retrieval using file size as logical segment in the prior art so as to overcome Point.
(2) counter i=1 is set;
(3) judge whether i is more than N/M, if greater than then return to step (1), otherwise into step (4);
(4) i-th of fingerprint section is taken out in the multiple fingerprint sections obtained from step (1), and obtains fingerprint in i-th of fingerprint section The minimum fingerprint of value, which is used as, represents fingerprint (Representative fingerprint, abbreviation RF), and judges that this represents fingerprint Whether it is located in the representative fingerprint index table in internal memory, if it is, into step (5), otherwise into step (8);
Specifically, the fingerprint value of fingerprint is by by the fingerprint for being converted to value type of the fingerprint of character string type Mode is got.
The advantages of this step, is that it is possible to simply and efficiently find the fingerprint container where repetition fingerprint, so as to effectively Ground reduces the seeking scope for repeating fingerprint.
(5) this is taken out in fingerprint index table represent fingerprint Container ID corresponding to fingerprint from representing, and ordered by searching internal memory Middle table judges that fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if yes then enter step Suddenly (6), otherwise fingerprint container corresponding to the fingerprint Container ID is read into fingerprint container caching from disk, is then transferred to step Suddenly (6);
(6) fingerprint repeated in fingerprint section where fingerprint will be represented to reject, and by each finger in the fingerprint section after rejecting Line is matched with all fingerprints in the fingerprint container corresponding to fingerprint Container ID one by one, if the result of matching is repetition, Then by the Finger-print labelling method to repeat fingerprint, if matching result is not repeat, the fingerprint is inserted in the fingerprint container, wherein When fingerprint quantity reaches the upper limit of fingerprint container capacity, fingerprint container no longer receives new fingerprint, and fingerprint container is write back into disk;
(7) counter i=i+1, and return to step (3) are set;
(8) a new fingerprint container is built in fingerprint container caching, is repeated in fingerprint section where this is represented into fingerprint Fingerprint reject, all fingerprints in the fingerprint section after rejecting are inserted into new fingerprint container, fingerprint RF and new will be represented Fingerprint Container ID be inserted into and represented in fingerprint index table in a manner of key-value pair, and by new fingerprint Container ID insertion internal memory hit In table.
(9) counter i=i+1, and return to step (3) are set;
The advantages of this step, is that it is possible to the size of effectively control fingerprint container, avoids repeating under fingerprint retrieval performance Drop.
The technique effect of the present invention is embodied in:Present invention is generally directed to the stronger data set type of similitude, by using Similitude in data set, the detection range of duplicate data is reduced, lift the throughput of data deduplication.Specifically, for Possible duplicate data in data set, the present invention are segmented to the fingerprint in fingerprint list first, and according to the theory of similarity, Chosen in section and represent fingerprint, next the masterpiece fingerprint of selection is detected in fingerprint index table is represented, passes through detection The similar vessels that fast mapping is searched, quickly repeat number is found finally by the detection to duplicate data in similar vessels According to so as to lift the detection performance of duplicate data.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should be included Within protection scope of the present invention.

Claims (8)

1. a kind of duplicate data detection method based on cluster, it is characterised in that comprise the following steps:
(1) fingerprint list file is obtained from disk, judges whether that partial fingerprints can be got from the fingerprint list file, Terminate if the process less than if that obtains, otherwise by the partial fingerprints got, it is stored in fingerprint input-buffer space, will be referred to All fingerprint N in line input-buffer space are segmented, and a fingerprint section is formed per M fingerprint, and wherein N is all fingerprints Quantity, M are random natural number;
(2) counter i=1 is set;
(3) judge whether i is more than N/M, if greater than then return to step (1), otherwise into step (4);
(4) take out i-th of fingerprint section in the multiple fingerprint sections obtained from step (1), and obtain in i-th of fingerprint section fingerprint value most Small fingerprint is used as and represents fingerprint, and judges that this represents whether fingerprint is located in the representative fingerprint index table in internal memory, if it is, Then enter step (5), otherwise into step (8);
(5) this is taken out in fingerprint index table represent fingerprint Container ID corresponding to fingerprint from representing, and by searching internal memory hit table To judge that the fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if yes then enter step (6), otherwise fingerprint container corresponding to the fingerprint Container ID is read into fingerprint container caching from disk, is then transferred to step (6);
(6) fingerprint repeated in fingerprint section where fingerprint will be represented to reject, and by each fingerprint in the fingerprint section after rejecting by One is matched with all fingerprints in the fingerprint container corresponding to fingerprint Container ID, will if the result of matching is repetition The Finger-print labelling method is repeats fingerprint, if matching result inserts the fingerprint in the fingerprint container not repeat;
(7) counter i=i+1, and return to step (3) are set.
(8) a new fingerprint container is built in fingerprint container caching, the finger repeated in fingerprint section where this is represented into fingerprint Line is rejected, and all fingerprints in the fingerprint section after rejecting is inserted into new fingerprint container, will be represented fingerprint and new fingerprint Container ID is inserted into a manner of key-value pair to be represented in fingerprint index table, and by new fingerprint Container ID insertion internal memory hit table.
(9) counter i=i+1, and return to step (3) are set.
2. duplicate data detection method according to claim 1, it is characterised in that further comprise before step (1), The fingerprint input-buffer space of sky, the fingerprint container caching of sky, empty internal memory hit table are set in internal memory and represent fingerprint The step of concordance list, wherein fingerprint input-buffer space, are used to store partial fingerprints in internal memory, and fingerprint container, which caches, to be used for Partial fingerprints container is cached in internal memory, internal memory hit table is used to judge whether some fingerprint container has been buffered in internal memory, generation Table fingerprint index table is stored in internal memory for will represent fingerprint, and is represented fingerprint for this and provided index function.
3. duplicate data detection method according to claim 1, it is characterised in that remaining when N can not be divided exactly by M Less than M fingerprint is classified as a fingerprint section.
4. duplicate data detection method according to claim 2, it is characterised in that it is defeated that the size of partial fingerprints is equal to fingerprint Enter the size of spatial cache, its scope is greater than the 0% of memory size, less than 80%.
5. duplicate data detection method as claimed in any of claims 1 to 4, it is characterised in that fingerprint section it is big Small M is 32 to 8192.
6. duplicate data detection method as claimed in any of claims 1 to 5, it is characterised in that the fingerprint of fingerprint Value is got by way of the fingerprint of character string type to be converted to the fingerprint of value type.
7. duplicate data detection method according to claim 1, it is characterised in that in step (6), when fingerprint quantity reaches During the upper limit of fingerprint container capacity, fingerprint container no longer receives new fingerprint, and fingerprint container is write back into disk.
A kind of 8. duplicate data detecting system based on cluster, it is characterised in that including:
First module, for obtaining fingerprint list file from disk, judge whether to obtain from the fingerprint list file To partial fingerprints, terminate if the process less than if that obtains, otherwise by the partial fingerprints got, it is stored in fingerprint input-buffer In space, all fingerprint N in fingerprint input-buffer space are segmented, a fingerprint section, wherein N are formed per M fingerprint For the quantity of all fingerprints, M is random natural number.
Second module, for setting counter i=1.
3rd module, for judging whether i is more than N/M, if greater than the first module is then returned, otherwise into the 4th module;
4th module, for taking out i-th of fingerprint section in multiple fingerprint sections for being obtained from the first module, and obtain i-th of fingerprint The minimum fingerprint of fingerprint value, which is used as, in section represents fingerprint, and judges that this represents the representative fingerprint index whether fingerprint is located in internal memory In table, if it is, into the 5th module, otherwise into the 8th module;
5th module, for this being taken out in fingerprint index table and representing fingerprint Container ID corresponding to fingerprint from representing, and passes through lookup Internal memory hit table judges that fingerprint container corresponding to the fingerprint Container ID whether there is in fingerprint container caching, if it is Into the 6th module, otherwise fingerprint container corresponding to the fingerprint Container ID is read into fingerprint container caching from disk, so After be transferred to the 6th module;
6th module, the fingerprint for being repeated in fingerprint section where representing fingerprint are rejected, and by the fingerprint section after rejecting Each fingerprint is matched with all fingerprints in the fingerprint container corresponding to fingerprint Container ID one by one, if the result of matching is Repeat, then by the Finger-print labelling method to repeat fingerprint, if the fingerprint is inserted the fingerprint container by matching result not repeat In;
7th module, for setting counter i=i+1, and return to the 3rd module;
8th module, for building a new fingerprint container in fingerprint container caching, fingerprint section where this is represented into fingerprint The fingerprint of middle repetition is rejected, and all fingerprints in the fingerprint section after rejecting is inserted into new fingerprint container, will be represented fingerprint And new fingerprint Container ID is inserted into a manner of key-value pair and represented in fingerprint index table, and new fingerprint Container ID is inserted into internal memory Hit in table.
9th module, for setting counter i=i+1, and return to step (3).
CN201710747552.2A 2017-08-28 2017-08-28 Repeated data detection method based on clustering Active CN107515931B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710747552.2A CN107515931B (en) 2017-08-28 2017-08-28 Repeated data detection method based on clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710747552.2A CN107515931B (en) 2017-08-28 2017-08-28 Repeated data detection method based on clustering

Publications (2)

Publication Number Publication Date
CN107515931A true CN107515931A (en) 2017-12-26
CN107515931B CN107515931B (en) 2023-04-25

Family

ID=60724325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710747552.2A Active CN107515931B (en) 2017-08-28 2017-08-28 Repeated data detection method based on clustering

Country Status (1)

Country Link
CN (1) CN107515931B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109445702A (en) * 2018-10-26 2019-03-08 黄淮学院 A kind of piece of grade data deduplication storage
CN109783523A (en) * 2019-01-24 2019-05-21 广州虎牙信息科技有限公司 A kind of data processing method, device, equipment and storage medium
CN112100318A (en) * 2020-11-12 2020-12-18 北京智慧星光信息技术有限公司 Multi-dimensional information merging method, device, equipment and storage medium
CN112329717A (en) * 2020-11-25 2021-02-05 中国人民解放军国防科技大学 Fingerprint cache method for similarity detection of mass data
CN115827619A (en) * 2023-01-06 2023-03-21 山东捷瑞数字科技股份有限公司 Repeated data detection method, device and equipment based on three-dimensional engine

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1940995A (en) * 2005-09-29 2007-04-04 中国科学院自动化研究所 Method for compressing fingerprint direction quantized diagram to embedded system
CN101681381A (en) * 2007-06-06 2010-03-24 杜比实验室特许公司 Improving audio/video fingerprint search accuracy using multiple search combining
CN102663086A (en) * 2012-04-09 2012-09-12 华中科技大学 Method for retrieving data block indexes
CN103699567A (en) * 2013-11-04 2014-04-02 北京中搜网络技术股份有限公司 Method for realizing same news clustering based on title fingerprint and text fingerprint
US8930648B1 (en) * 2012-05-23 2015-01-06 Netapp, Inc. Distributed deduplication using global chunk data structure and epochs
CN105493080A (en) * 2013-12-23 2016-04-13 华为技术有限公司 Method and apparatus for context aware based data de-duplication
CN105989033A (en) * 2015-02-03 2016-10-05 北京中搜网络技术股份有限公司 Information duplication eliminating method based on information fingerprints

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1940995A (en) * 2005-09-29 2007-04-04 中国科学院自动化研究所 Method for compressing fingerprint direction quantized diagram to embedded system
CN101681381A (en) * 2007-06-06 2010-03-24 杜比实验室特许公司 Improving audio/video fingerprint search accuracy using multiple search combining
CN102663086A (en) * 2012-04-09 2012-09-12 华中科技大学 Method for retrieving data block indexes
US8930648B1 (en) * 2012-05-23 2015-01-06 Netapp, Inc. Distributed deduplication using global chunk data structure and epochs
CN103699567A (en) * 2013-11-04 2014-04-02 北京中搜网络技术股份有限公司 Method for realizing same news clustering based on title fingerprint and text fingerprint
CN105493080A (en) * 2013-12-23 2016-04-13 华为技术有限公司 Method and apparatus for context aware based data de-duplication
CN105989033A (en) * 2015-02-03 2016-10-05 北京中搜网络技术股份有限公司 Information duplication eliminating method based on information fingerprints

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XINYE LI 等: "Clustering Web Retrieval Results Accompanied by Removing Duplicate Documents" *
张攀峰: "数据去重中重复数据检测技术研究" *
殷波 等: "一种基于重复串的STC改进算法" *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109445702A (en) * 2018-10-26 2019-03-08 黄淮学院 A kind of piece of grade data deduplication storage
CN109445702B (en) * 2018-10-26 2019-12-06 黄淮学院 block-level data deduplication storage system
CN109783523A (en) * 2019-01-24 2019-05-21 广州虎牙信息科技有限公司 A kind of data processing method, device, equipment and storage medium
CN112100318A (en) * 2020-11-12 2020-12-18 北京智慧星光信息技术有限公司 Multi-dimensional information merging method, device, equipment and storage medium
CN112329717A (en) * 2020-11-25 2021-02-05 中国人民解放军国防科技大学 Fingerprint cache method for similarity detection of mass data
CN115827619A (en) * 2023-01-06 2023-03-21 山东捷瑞数字科技股份有限公司 Repeated data detection method, device and equipment based on three-dimensional engine
CN115827619B (en) * 2023-01-06 2023-05-09 山东捷瑞数字科技股份有限公司 Method, device and equipment for detecting repeated data based on three-dimensional engine

Also Published As

Publication number Publication date
CN107515931B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN107515931A (en) A kind of duplicate data detection method based on cluster
CN107391034B (en) A kind of repeated data detection method based on local optimization
CN102831222B (en) Differential compression method based on data de-duplication
US9529912B2 (en) Metadata querying method and apparatus
US10346257B2 (en) Method and device for deduplicating web page
CN103488709B (en) A kind of index establishing method and system, search method and system
CN100452055C (en) Large-scale and multi-key word matching method for text or network content analysis
CN103597450B (en) Memory with the metadata being stored in a part for storage page
CN108959563B (en) Capacity expandable block chain query method and system
Xie et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb
CN107491487A (en) A kind of full-text database framework and bitmap index establishment, data query method, server and medium
CN111382327A (en) Character string matching device and method
CN103207889A (en) Method for retrieving massive face images based on Hadoop
Tang et al. Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce.
Wang et al. Fast and adaptive indexing of multi-dimensional observational data
CN104021179B (en) The Fast Recognition Algorithm of similarity data under a kind of large data sets
Romberg et al. Bundle min-Hashing: Speeded-up object retrieval
Wang et al. PLSM: a highly efficient LSM-tree index supporting real-time big data analysis
CN113760190A (en) Small file merging system and method based on Ceph storage
Nie et al. Efficient storage support for real-time near-duplicate video retrieval
CN116361796A (en) Industrial control malicious code detection method based on content partitioning
Chen et al. Efficient similarity search in nonmetric spaces with local constant embedding
CN106599326B (en) Recorded data duplication eliminating processing method and system under cloud architecture
CN109213760A (en) The storage of high load business and search method of non-relation data storage
Karim et al. An efficient approach to mining maximal contiguous frequent patterns from large DNA sequence databases

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant