CN107391034A - A kind of duplicate data detection method based on local optimization - Google Patents

A kind of duplicate data detection method based on local optimization Download PDF

Info

Publication number
CN107391034A
CN107391034A CN201710555589.5A CN201710555589A CN107391034A CN 107391034 A CN107391034 A CN 107391034A CN 201710555589 A CN201710555589 A CN 201710555589A CN 107391034 A CN107391034 A CN 107391034A
Authority
CN
China
Prior art keywords
fingerprint
hash
hash bucket
bucket
caching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710555589.5A
Other languages
Chinese (zh)
Other versions
CN107391034B (en
Inventor
王桦
周可
张攀峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710555589.5A priority Critical patent/CN107391034B/en
Publication of CN107391034A publication Critical patent/CN107391034A/en
Application granted granted Critical
Publication of CN107391034B publication Critical patent/CN107391034B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements

Abstract

The invention discloses a kind of duplicate data detection method based on local optimization, belong to computer memory technical field, solve the problems, such as that detection efficiency is low in existing duplicate data detection method, to adapt to because of data storage popularization, and the present situation for causing duplicate data detection efficiency to reduce.The present invention includes Bloom filter detection, the detection of Hash bucket write buffer, the detection of Hash bucket read buffer, Hash bucket address table detecting step.Present invention is generally directed to the stronger data set type of locality, the locality concentrated by mining data, the efficiency of data pre-fetching is improved, reduces disk access expense, lifts the throughput of data deduplication.For possible duplicate data in data set, the present invention prejudges first with Bloom filter to the repeatability of data block, next according to different conditions, the hot-zone to buffer area and cold-zone and disk carry out the detection of three-level duplicate data respectively, the locality in duplicate data is made full use of, lifts the detection efficiency of duplicate data.

Description

A kind of duplicate data detection method based on local optimization
Technical field
The invention belongs to computer memory technical field, more particularly, to a kind of repeat number based on local optimization According to detection method.
Background technology
With information technology fast development, information turns into the precious resources that we depend on for existence, becomes promotion production The fast-developing maximum power of power.The generation for widely applying the also data along with magnanimity of information technology, it is more and more valuable The data of value need to be stored.So, the storage efficiency of existing storage medium how is effectively improved, meets ever-increasing deposit Storage demand, have become storage research field and one of urgently solve the problems, such as.Meanwhile IDC LLC's investigation report show it is existing about 75% data are redundancy, i.e., only 25% data have uniqueness.In this context, data deduplication is used as larger A kind of new technique of detection and elimination redundancy turns into the study hotspot of academia and industrial quarters in recent years in spatial dimension, And various information storage systems are just widely applied to further.
The detection for repeating fingerprint is to realize the important technical of data deduplication.In existing data deduplication technology, weight The detection of complex data mainly using the mode of fingerprint detection, i.e., by extracting the fingerprint (cryptographic Hash) of data block, then passes through inspection The repeatability for surveying fingerprint identifies whether some data block is duplicate data block.In fingerprint detection method is repeated substantially, generally The identification of repetition fingerprint section is realized using data structures such as single Hash tables or B-tree.
However, the problem of one can not ignore existing for above-mentioned fingerprint detection method is that its detection performance is more low, can not Effective duplicate data detection is realized for large data sets, so as to have influence on the overall efficiency of data deduplication.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of repetition based on local optimization Data detection method, it is intended that solving detection performance existing for the existing duplicate data detection method based on fingerprint detection It is more low, the technical problem that large data sets realize effective duplicate data detection can not be directed to.
To achieve the above object, according to one aspect of the present invention, there is provided a kind of repeat number based on local optimization According to detection method, comprise the following steps:
(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, postpone Deposit one fingerprint of middle extraction;
(2) inquire about whether to record in Bloom filter and have the fingerprint extracted, if possible record has, then turns Enter step (4), be otherwise transferred to step (3);
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and from caching under extraction One fingerprint, and return to step (2);
(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, extracted if then from caching next Individual fingerprint, and return to step (2), are otherwise transferred to step (5);
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted if then from caching next Individual fingerprint, and return to step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, to judge to get corresponding Hash bucket ID, if obtained not To then assert that the fingerprint is new fingerprint, next fingerprint, and return to step (2) are extracted from caching, is turned if it can get Enter step (7);
(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traveled through according to the Hash bucket ID of acquisition, to judge whether There is Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching The middle next fingerprint of extraction, and return to step (2), are otherwise inserted into Hash by Hash bucket corresponding to Hash bucket ID from disk In first Hash bucket in the hot-zone of bucket degree caching, and the fingerprint is searched in Hash bucket after such insertion, said if finding The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching Line, and return to step (2).
Preferably, Bloom filter is created in initial phase, and is had
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number is:
Wherein C represents duplicate data block index capacity, and ε represents the False Rate of Bloom filter.
Preferably, judge whether to record in Bloom filter in step (2) and have the fingerprint X extracted specifically, such as Fruit is for hash function hashi(X), there is hash1(X)&hash2(X)…&hashk(X)=0, then show do not have in Bloom filter Fingerprint X was recorded, fingerprint X is new fingerprint, otherwise represents that fingerprint X may be recorded, wherein 1≤i≤k, k represent to breathe out The quantity of uncommon function.
Preferably, Hash bucket is the container for placing fingerprint, and its value is 192 to 512 fingerprint/buckets, and Hash bucket is write Caching is to be realized in initial phase by applying for the memory headroom of free time in internal memory, and its value is equal to 4 to 128 Kazakhstan Uncommon bucket, is provided with first list and second list in Hash bucket write buffer, and the node in each list is made up of Hash bucket.
Preferably, (a) calculates the value hash of k individual Hash functioni(X), and will be hash to offset in bit vectori (X) bit bit location puts 1, and wherein k is hash function number, 1≤i≤k;(b) two in Polling Hash bucket write buffer Bar Hash bucket list, judge whether to have in certain list less than bucket, when finding non-bucketful, by fingerprint be stored in it is first less than Bucket, and by the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address tables;If it find that all Kazakhstan in certain list Uncommon bucket is all had been filled with, then locks the list, and all Hash buckets in the list are write in disk, to the row after writing complete The operation that table is emptied and unlocked, wherein it is exactly to empty all Hash buckets in the list to carry out null clear operation to list, And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled with two lists, fingerprint insertion operation Certain list write-in is had to wait for complete and perform insertion operation again after emptying.
Preferably, in step (4), Hash bucket read buffer is the spatial cache set in initial phase in internal memory, its The chained list being made up of multiple Hash buckets is formed, and the size of Hash bucket read buffer is 1024-2048 Hash buckets.
It is another aspect of this invention to provide that a kind of duplicate data detecting system based on local optimization is provided, including:
First module, for obtaining fingerprint list file, fetching portion fingerprint and it is stored in from the fingerprint list file In caching, a fingerprint is extracted from caching;
Second module, there is the fingerprint extracted for inquiring about whether to record in Bloom filter, if possible Record has, then is transferred to the 4th module, is otherwise transferred to the 3rd module;
3rd module, for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and extracted from caching Next fingerprint, and the second module;
4th module, the fingerprint whether has been recorded in the hot-zone for judging Hash bucket read buffer, if then from caching The middle next fingerprint of extraction, and the second module, are otherwise transferred to the 5th module;
5th module, the fingerprint whether has been recorded in the hot bucket for judging Hash bucket write buffer, if then from caching The middle next fingerprint of extraction, and the second module is returned, otherwise it is transferred to the 6th module;
6th module, for searching Hash bucket address table according to fingerprint, to judge to get corresponding Hash bucket ID, Assert that the fingerprint is new fingerprint less than if if obtained, next fingerprint is extracted from caching, and return to the second module, if energy Get, be transferred to the 7th module;
7th module, all Hash buckets in cold-zone for traveling through Hash bucket read buffer according to the Hash bucket ID of acquisition, To determine whether Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then being searched in the Hash bucket should Fingerprint, extracts next fingerprint from caching, and returns to the second module, otherwise by Hash bucket corresponding to Hash bucket ID from disk In be inserted into the first Hash bucket in the hot-zone of Hash bucket degree caching, and search the fingerprint in Hash bucket after such insertion, such as Fruit finds, and it is existing fingerprint to illustrate the fingerprint, illustrates that the fingerprint is new fingerprint less than if if searched, then from caching Next fingerprint is extracted, and returns to the second module.
In general, by the contemplated above technical scheme of the present invention compared with prior art, it can obtain down and show Beneficial effect:
1st, the present invention can solve the problem that existing fingerprint detection method detection performance is more low, can not be directed to big data set and realize The technical problem of effective duplicate data detection:Due to that present invention employs step (2), can have by the anticipation of Bloom filter Effect reduces the detection number of fingerprint, and effectively lifting repeats the retrieval performance of fingerprint.
2nd, present invention employs step (3) to step (7), the characteristics of making full use of data set itself, using data pre-fetching With caching technology, the detection of three-level duplicate data is carried out to the hot-zone, cold-zone and disk of buffer area according to different conditions respectively, The locality in duplicate data is fully excavated, lifts the accuracy of data pre-fetching, and effectively reduces the access times of disk, enters one Step improves the detection efficiency of duplicate data.
Brief description of the drawings
Fig. 1 is the building-block of logic of the present invention;
Fig. 2 is the data structure of Bloom filter;
Fig. 3 is Hash bucket address table structural representation;
Fig. 4 is Hash bucket read buffer structural representation;
Fig. 5 is the schematic diagram of the duplicate data detection method of the invention based on local optimization.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below Conflict can is not formed each other to be mutually combined.
The present invention proposes a kind of efficient repetition fingerprint Fast Detection Technique.It is mainly for the stronger data of locality Collect the duplicate data detection of type, realize that the optimisation strategy that one-level anticipation three-level detects carries by Bloom filter and caching technology The performance of high duplicate data detection.
The basic ideas of the present invention are, for possible duplicate data in data set, first with Bloom filter pair The repeatability of data block is prejudged, and next according to different conditions, the hot-zone to buffer area and cold-zone and disk are carried out respectively The detection of three-level duplicate data, the locality in duplicate data is made full use of, lift the detection efficiency of duplicate data.
For the basic logical structure of the present invention as shown in figure 1, it is mainly made up of six parts, they refer respectively to line caching Hash bucket on table, Bloom filter, Hash bucket address table, Hash bucket write buffer, Hash bucket read buffer and disk is formed.
In order to clearly illustrate the present invention, explanation and illustration is subject to the term occurred in present specification:
Fingerprint list:It is to be passed through piecemeal by data set and taken the fingerprint and form fingerprint set according to processing sequence.
Fingerprint cache table:Fingerprint cache table is used to cache the fingerprint in fingerprint list.If fingerprint list comes from In file, then a number of fingerprint is disposably read in, and be stored in fingerprint cache area.System from fingerprint cache table one by one Take out fingerprint and repeat the lookup of fingerprint.
Bloom filter:As shown in Fig. 2 by bit vector and k independent hash function h that a length is m biti (x) (1≤i≤k, k < m) is formed, and is a kind of very high random data structure of space efficiency, and it represents a collection using bit vector Close, and can judge whether an element belongs to this set.In order to express set S={ x1, x2, x3..., xn, bit vector first In all position be initialized to 0;Then to the element x in set Sj(1≤j≤n) is all separate using this k respectively Hash function hi(x) k cryptographic Hash h, is obtainedi(xj) (1≤i≤k, xj∈ S), will using first of bit vector as starting point This k cryptographic Hash, can be by x as offsetjK position being mapped in bit vector { 1,2 ..., m }, these positions are set to 1, xjQuilt Mark;After element all in S is all labeled, i.e. set S is expressed by Bloom filter, if a position is repeatedly set to 1, then can only work for the first time.
Determine whether some data element y belongs to set S, first to y respectively using this k separate Hash letters Number hi(x) k cryptographic Hash h, is obtainedi(y), using first of bit vector as starting point, using this k cryptographic Hash as offset, Check whether corresponding position is all 1 in the bit vector of Bloom filter, be that y may belong to S;Otherwise it is not the member in S to determine y Element.
Due to hash function hi(x) (1≤i≤k) there is a possibility that hash-collision for arbitrary two different elements, Such as y is mapped in bit vector corresponding position to make affirmative by the non-y elements institute image in S, Bloom filter Property judgement when there is error.The possibility for the element that element is mistaken in S is referred to as during Bloom filter gathers non-S False positive probability (False Positive Probability), also abbreviation False Rate (Error Rate).False positive probability can To be controlled by mathematical method.
Set S radix n, the length m of Bloom filter bit vector and its quantity k of hash function is given, then the grand mistake of cloth It is still that 0 probability is (1-1/m) that the bit vector of filter, which is inserted into a certain position after n element,k×n.On the other hand, when some new element When all positions have all been set to 1 corresponding to y, Bloom filter could be made that false positive judges, and then deducibility false positive probability fBFFor:
fBF=(1- (1-1/m)k×n)k≈(1-e-k×n/m)k,
It can derive and work asWhen, Bloom filter has minimum false positive probability, is referred to as preferable erroneous judgement Rate, it is designated as FBF, now, the position that 50% is there are about in the bit vector of Bloom filter is " 1 ";SymbolRepresent big In the smallest positive integral of ln2 × (m/n) result;
Further, if n is, it is known that one Bloom filter of desired design, its preferable False Rate is no more than given mistake Sentence rate upper limit ε, then can derive that m must meet:
m≥log2e×log2(1/ ε) × n,
If m=log2e×log2(1/ ε) × n andAnd if only if, and all n are individual When element is all inserted into Bloom filter, its false positive probability just increases to ε, therefore n is also known as Bloom filter design capacity.Its InRepresent the smallest positive integral being not more than.
Analyzed more than, according to design capacity n and False Rate upper limit ε, the bit vector of Bloom filter can be calculated Length m and hash function quantity k;Design capacity n is the quantity of the estimated labelled element of Bloom filter, when a Bloom filter When the element of mark is less than n, then the Bloom filter be less than Bloom filter, less than Bloom filter can both continue New element is marked, can also be for inquiring about whether some element has marked wherein, when marked in a Bloom filter When number of elements is n, then the Bloom filter is full, it is impossible to continues to mark new element, but can provide inquiry, n≤m.
Hash bucket:The base unit that Hash bucket is fingerprint storage and caching swap-in swaps out.Store and fix in one Hash bucket The independent fingerprint of quantity (independent fingerprint refers to fingerprints numerically different with other fingerprints).
Hash bucket write buffer:Hash bucket write buffer is to go to one piece of buffer zone opening up in internal memory, is write for new fingerprint Caching before disk.Data due to being stored in new fingerprinting operation and Hash bucket write back disk operating can not be simultaneously to same Hash Bucket is carried out, to avoid critical resource conflict, so Hash bucket write buffer is designed to be made up of two Hash bucket lists, when certain When all Hash buckets all have been filled with list, the list is locked, and all Hash buckets in the list are all write disk. Because these Hash buckets are write-onces, generally they can be written into the same magnetic track on disk, and these fingerprints all maintain Data locality in the range of certain space.This provides possibility for the follow-up pre- extract operation of reading.Hash bucket after writing complete is complete Portion is cleared.And new fingerprint then continues to be stored in the Hash bucket of another list.And when the Hash bucket disk write in a list Do not complete, and another chained list Hash bucket completely when, then need to wait for disk write operation complete.When fingerprint deposit Hash bucket is write When in caching, that is, Hash bucket ID is distributed, while update Hash bucket address table.
Hash bucket address table:Hash bucket address table is the key assignments resided in internal memory (key-value) Hash table, inner Hash bucket ID mapping where face is housed from fingerprint key to fingerprint.It is that energy is quick in the fingerprint on searching disk that it, which is acted on, The Hash bucket position that ground positioning stores the fingerprint is put.Hash bucket address table concrete structure is as shown in Figure 3.Fingerprint length is 20 bytes, Bucket ID length is 4 bytes, and pointer (Pointer) takes 8 bytes.Hash table storage can have the problem of hash-collision, work as Hash When conflict occurs, conflict is handled using chain address.
Hash bucket read buffer:Hash bucket read buffer is that the one piece of memory space opened up in memory headroom is used to cache from disk The Hash bucket of reading.To improve the efficiency of disk index, a part of fingerprint index table (Hash bucket) in disk is read into internal memory Hash bucket read buffer in.Hash bucket read buffer is made up of a doubly linked list.Each node stores a Hash bucket in chained list, Its structure is as shown in Figure 4.Each node of chained list have recorded the ID number and flag bit of Hash bucket, and flag bit represents the Hash bucket Whether it is " dirty bucket ".Hash bucket read buffer is logically made up of two parts, before several nodes be hot-zone part, and back It is cold-zone to the part between tail node.The division in cold and hot section is used to optimize retrieval performance.
As shown in figure 5, the duplicate data detection method of the invention based on local optimization comprises the following steps:
(1) fingerprint list file, fetching portion fingerprint (size of the partial fingerprints etc. from the fingerprint list file are obtained It is specified for storing the size in the space of fingerprint in caching) and store in the buffer, a fingerprint is extracted from caching, when After all fingerprints in caching have all extracted, then new partial fingerprints are read from fingerprint list file and are stored in caching In;
(2) inquire about whether to record in Bloom filter and have the fingerprint extracted, if possible record has, then turns Enter step (4), be otherwise transferred to step (3);
Specifically, Bloom filter is to be created in initial phase according to procedure below:
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number is:
Wherein C represents duplicate data block index capacity, and ε represents the False Rate of Bloom filter, and its value is not higher than thresholding Value 0.00001.
Judge in this step in Bloom filter whether may record there is the fingerprint X extracted to be specially:
If for hashi(X) (wherein 1≤i≤k, k represent the quantity of hash function), there is h1(X)&h2(X)…&hk (X)=0, then show that it is new fingerprint that the fingerprint X, X were not recorded in Bloom filter, otherwise represent that the fingerprint may be recorded X;
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and from caching under extraction One fingerprint, and return to step (2);
Hash bucket in this step is the container for placing fingerprint, and its size can be arbitrary value, preferred value be 192 to 512 fingerprint/buckets, Hash bucket write buffer are to be realized in initial phase by applying for the memory headroom of free time in internal memory , its size can be arbitrary value, and preferred value is equal to 4 to 128 Hash buckets.
Initial phase is provided with two lists (i.e. first list and second list), Mei Gelie in Hash bucket write buffer Node on table is all made up of Hash bucket.
Specifically, the fingerprint is inserted into Bloom filter and Hash bucket write buffer in this step and is specially:(a) count Calculate the value hash of k individual Hash functioni(X) it is, and to offset in bit vector hashi(X) bit bit location puts 1, wherein K is hash function number, and 1≤i≤k;(b) two Hash bucket lists in Polling Hash bucket write buffer, judge certain Whether have in list less than bucket, when finding non-bucketful, fingerprint is stored in into first non-bucketful, and (now the Hash bucket is referred to as heat Bucket), and by the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address tables;If it find that all Kazakhstan in certain list Uncommon bucket is all had been filled with, then locks the list, and all Hash buckets in the list are write in disk, to the row after writing complete The operation that table is emptied and unlocked, wherein it is exactly to empty all Hash buckets in the list to carry out null clear operation to list, And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled with two lists, fingerprint insertion operation Certain list write-in is had to wait for complete and perform insertion operation again after emptying.
Above-mentioned Hash bucket address table is the address list created in initial phase in internal memory, and it is with the side of key-value pair Formula describes fingerprint and stores the mapping relations between the Hash bucket ID of the Hash bucket of the fingerprint.
(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, extracted if then from caching next Individual fingerprint, and return to step (2), are otherwise transferred to step (5);
Specifically, Hash bucket read buffer (Cache) is the spatial cache set in initial phase in internal memory, and it is The chained list being made up of multiple Hash buckets is formed, and the size of Hash bucket read buffer can be arbitrary value, and preferred value is 1024- 2048 Hash buckets;The anterior one or more Hash buckets of chained list are referred to as hot-zone, and remaining Hash bucket is referred to as cold-zone.
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted if then from caching next Individual fingerprint, and return to step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, to judge to get corresponding Hash bucket ID, if obtained not To then assert that the fingerprint is new fingerprint, next fingerprint, and return to step (2) are extracted from caching, is turned if it can get Enter step (7);
(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traveled through according to the Hash bucket ID of acquisition, to judge whether There is Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching The middle next fingerprint of extraction, and return to step (2), are otherwise inserted into Hash by Hash bucket corresponding to Hash bucket ID from disk In first Hash bucket in the hot-zone of bucket degree caching, and the fingerprint is searched in Hash bucket after such insertion, said if finding The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching Line, and return to step (2).
The present invention has following beneficial effect:Firstly, since present invention employs step (2), pass through Bloom filter Anticipation can effectively reduce the detection number of fingerprint, effectively lifting repeats the retrieval performance of fingerprint;Further, since the present invention uses Step (3) is to step (7), the characteristics of making full use of data set itself, using data pre-fetching and caching technology, according to different Condition carries out the detection of three-level duplicate data to the hot-zone, cold-zone and disk of buffer area respectively, fully excavates in duplicate data Locality, the accuracy of data pre-fetching is lifted, and effectively reduce the access times of disk, further improve the inspection of duplicate data Survey efficiency.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should be included Within protection scope of the present invention.

Claims (7)

1. a kind of duplicate data detection method based on local optimization, it is characterised in that comprise the following steps:
(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, from caching Extract a fingerprint;
(2) inquire about whether to record in Bloom filter and have the fingerprint extracted, if possible record has, then is transferred to step Suddenly (4), otherwise it is transferred to step (3);
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and extracted from caching next Fingerprint, and return to step (2);
(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, if then extracting next finger from caching Line, and return to step (2), are otherwise transferred to step (5);
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, if then extracting next finger from caching Line, and return to step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, to judge to get corresponding Hash bucket ID, if acquisition less than if It is new fingerprint to assert the fingerprint, and next fingerprint, and return to step (2) are extracted from caching, step is transferred to if it can get Suddenly (7);
(7) according to the Hash bucket ID of acquisition travel through Hash bucket read buffer cold-zone in all Hash buckets, with determine whether with Hash bucket corresponding to Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, carried from caching Next fingerprint, and return to step (2) are taken, Hash bucket corresponding to Hash bucket ID is otherwise inserted into Hash bucket degree from disk In first Hash bucket in the hot-zone of caching, and the fingerprint is searched in Hash bucket after such insertion, explanation should if finding Fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next fingerprint is then extracted from caching, and Return to step (2).
2. duplicate data detection method according to claim 1, it is characterised in that Bloom filter is in initial phase Create, and have
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number is:
Wherein C represents duplicate data block index capacity, and ε represents the False Rate of Bloom filter.
3. duplicate data detection method according to claim 1, it is characterised in that judge Bloom filter in step (2) If whether may be recorded in has the fingerprint X extracted specifically, for hash function hashi(X), there is hash1(X)&hash2 (X)…&hashk(X)=0, then show not recording fingerprint X in Bloom filter, fingerprint X is new fingerprint, otherwise table Fingerprint X may be recorded by showing, wherein 1≤i≤k, k represent the quantity of hash function.
4. duplicate data detection method according to claim 1, it is characterised in that
Hash bucket is the container for placing fingerprint, and its preferred value is 192 to 512 fingerprint/buckets;
Hash bucket write buffer is to realize that it preferably takes by applying for the memory headroom of free time in internal memory in initial phase Value is equal to 4 to 128 Hash buckets;
It is provided with first list and second list in Hash bucket write buffer, the node in each list is made up of Hash bucket.
5. duplicate data detection method according to claim 1, it is characterised in that the fingerprint is inserted into Bloom filter And it is specially in Hash bucket write buffer:(a) the value hash of k individual Hash function is calculatedi(X), and will be to being offset in bit vector Measure as hashi(X) bit bit location puts 1, and wherein k is hash function number, 1≤i≤k;(b) Polling Hash bucket is write slow Two Hash bucket lists in depositing, judge whether to have in certain list less than bucket, when finding non-bucketful, fingerprint is stored in first Individual non-bucketful, and by the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address tables;If it find that in certain list All Hash buckets are all had been filled with, then lock the list, and all Hash buckets in the list are write in disk, after writing complete The operation that the list is emptied and unlocked, wherein it is exactly by all Hash buckets in the list to carry out null clear operation to list Empty, and new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled with two lists, fingerprint insertion Operation has to wait for certain list write-in and completes and perform insertion operation again after emptying.
6. duplicate data detection method according to claim 1, it is characterised in that in step (4), Hash bucket read buffer is In the spatial cache that initial phase is set in internal memory, its chained list being made up of multiple Hash buckets is formed, and Hash bucket is read Caching is preferably sized to 1024-2048 Hash buckets.
A kind of 7. duplicate data detecting system based on local optimization, it is characterised in that including:
First module, for obtaining fingerprint list file, fetching portion fingerprint and caching is stored in from the fingerprint list file In, a fingerprint is extracted from caching;
Second module, there is the fingerprint extracted for inquiring about whether to record in Bloom filter, if possible record Have, be then transferred to the 4th module, be otherwise transferred to the 3rd module;
3rd module, for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and extract from caching next Individual fingerprint, and the second module;
4th module, the fingerprint whether has been recorded in the hot-zone for judging Hash bucket read buffer, if then being carried from caching Next fingerprint, and the second module are taken, is otherwise transferred to the 5th module;
5th module, the fingerprint whether has been recorded in the hot bucket for judging Hash bucket write buffer, if then being carried from caching Next fingerprint is taken, and returns to the second module, is otherwise transferred to the 6th module;
6th module, for searching Hash bucket address table according to fingerprint, to judge to get corresponding Hash bucket ID, if Obtain less than then assert that the fingerprint is new fingerprint, next fingerprint is extracted from caching, and return to the second module, if can obtain To being then transferred to the 7th module;
7th module, all Hash buckets in cold-zone for traveling through Hash bucket read buffer according to the Hash bucket ID of acquisition, to sentence It is disconnected whether to have Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, Next fingerprint is extracted from caching, and returns to the second module, otherwise inserts Hash bucket corresponding to Hash bucket ID from disk Enter into the first Hash bucket in the hot-zone of Hash bucket degree caching, and the fingerprint is searched in Hash bucket after such insertion, if looked into Find, it is existing fingerprint to illustrate the fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, then being extracted from caching Next fingerprint, and return to the second module.
CN201710555589.5A 2017-07-07 2017-07-07 A kind of repeated data detection method based on local optimization Active CN107391034B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710555589.5A CN107391034B (en) 2017-07-07 2017-07-07 A kind of repeated data detection method based on local optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710555589.5A CN107391034B (en) 2017-07-07 2017-07-07 A kind of repeated data detection method based on local optimization

Publications (2)

Publication Number Publication Date
CN107391034A true CN107391034A (en) 2017-11-24
CN107391034B CN107391034B (en) 2019-05-10

Family

ID=60335524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710555589.5A Active CN107391034B (en) 2017-07-07 2017-07-07 A kind of repeated data detection method based on local optimization

Country Status (1)

Country Link
CN (1) CN107391034B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107944038A (en) * 2017-12-14 2018-04-20 上海达梦数据库有限公司 A kind of generation method and device of duplicate removal data
CN108459826A (en) * 2018-02-01 2018-08-28 杭州宏杉科技股份有限公司 A kind of method and device of processing I/O Request
CN109101365A (en) * 2018-08-01 2018-12-28 南京壹进制信息技术股份有限公司 A kind of data backup and resume method deleted again based on source data
CN109240605A (en) * 2018-08-17 2019-01-18 华中科技大学 A kind of quick repeated data block identifying method stacking memory based on 3D
CN109471635A (en) * 2018-09-03 2019-03-15 中新网络信息安全股份有限公司 A kind of algorithm optimization method realized based on Java Set set
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN109783523A (en) * 2019-01-24 2019-05-21 广州虎牙信息科技有限公司 A kind of data processing method, device, equipment and storage medium
CN110046164A (en) * 2019-04-16 2019-07-23 中国人民解放军国防科技大学 Index independent grain distribution filter, consistency grain distribution filter and operation method
CN110489405A (en) * 2019-07-12 2019-11-22 平安科技(深圳)有限公司 The method, apparatus and server of data processing
CN111338581A (en) * 2020-03-27 2020-06-26 尹兵 Data storage method and device based on cloud computing, cloud server and system
CN112800430A (en) * 2021-02-01 2021-05-14 苏州棱镜七彩信息科技有限公司 Safety and compliance management method suitable for open source assembly
CN113721862A (en) * 2021-11-02 2021-11-30 腾讯科技(深圳)有限公司 Data processing method and device
US20230221864A1 (en) * 2022-01-10 2023-07-13 Vmware, Inc. Efficient inline block-level deduplication using a bloom filter and a small in-memory deduplication hash table

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253820A (en) * 2011-06-16 2011-11-23 华中科技大学 Stream type repetitive data detection method
CN102591946A (en) * 2010-12-28 2012-07-18 微软公司 Using index partitioning and reconciliation for data deduplication
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN103345472A (en) * 2013-06-04 2013-10-09 北京航空航天大学 Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system
CN103870514A (en) * 2012-12-18 2014-06-18 华为技术有限公司 Repeating data deleting method and device
US20140188912A1 (en) * 2012-12-28 2014-07-03 Fujitsu Limited Storage apparatus, control method, and computer product
CN103970744A (en) * 2013-01-25 2014-08-06 华中科技大学 Extendible repeated data detection method
CN103970875A (en) * 2014-05-15 2014-08-06 华中科技大学 Parallel repeated data deleting method
CN105740266A (en) * 2014-12-10 2016-07-06 国际商业机器公司 Data deduplication method and device
US20160335024A1 (en) * 2015-05-15 2016-11-17 ScaleFlux Assisting data deduplication through in-memory computation
CN106293525A (en) * 2016-08-05 2017-01-04 上海交通大学 A kind of method and system improving caching service efficiency
CN106610790A (en) * 2015-10-26 2017-05-03 华为技术有限公司 Repeated data deleting method and device
CN106649346A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data repeatability check method and apparatus
US20170177266A1 (en) * 2015-12-21 2017-06-22 Quantum Corporation Data aware deduplication object storage (dados)

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591946A (en) * 2010-12-28 2012-07-18 微软公司 Using index partitioning and reconciliation for data deduplication
CN102253820A (en) * 2011-06-16 2011-11-23 华中科技大学 Stream type repetitive data detection method
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN103870514A (en) * 2012-12-18 2014-06-18 华为技术有限公司 Repeating data deleting method and device
US20140188912A1 (en) * 2012-12-28 2014-07-03 Fujitsu Limited Storage apparatus, control method, and computer product
CN103970744A (en) * 2013-01-25 2014-08-06 华中科技大学 Extendible repeated data detection method
CN103345472A (en) * 2013-06-04 2013-10-09 北京航空航天大学 Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system
CN103970875A (en) * 2014-05-15 2014-08-06 华中科技大学 Parallel repeated data deleting method
CN105740266A (en) * 2014-12-10 2016-07-06 国际商业机器公司 Data deduplication method and device
US20160335024A1 (en) * 2015-05-15 2016-11-17 ScaleFlux Assisting data deduplication through in-memory computation
CN106610790A (en) * 2015-10-26 2017-05-03 华为技术有限公司 Repeated data deleting method and device
CN106649346A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data repeatability check method and apparatus
US20170177266A1 (en) * 2015-12-21 2017-06-22 Quantum Corporation Data aware deduplication object storage (dados)
CN106293525A (en) * 2016-08-05 2017-01-04 上海交通大学 A kind of method and system improving caching service efficiency

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PANFENG ZHANG: "Resemblance and mergence based indexing for high performance data deduplication", 《JOURNAL OF SYSTEMS AND SOFTWARE》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107944038A (en) * 2017-12-14 2018-04-20 上海达梦数据库有限公司 A kind of generation method and device of duplicate removal data
CN107944038B (en) * 2017-12-14 2020-11-10 上海达梦数据库有限公司 Method and device for generating deduplication data
CN108459826A (en) * 2018-02-01 2018-08-28 杭州宏杉科技股份有限公司 A kind of method and device of processing I/O Request
CN108459826B (en) * 2018-02-01 2020-12-29 杭州宏杉科技股份有限公司 Method and device for processing IO (input/output) request
CN109101365A (en) * 2018-08-01 2018-12-28 南京壹进制信息技术股份有限公司 A kind of data backup and resume method deleted again based on source data
CN109240605A (en) * 2018-08-17 2019-01-18 华中科技大学 A kind of quick repeated data block identifying method stacking memory based on 3D
CN109471635A (en) * 2018-09-03 2019-03-15 中新网络信息安全股份有限公司 A kind of algorithm optimization method realized based on Java Set set
CN109740037A (en) * 2019-01-02 2019-05-10 山东省科学院情报研究所 The distributed online real-time processing method of multi-source, isomery fluidised form big data and system
CN109740037B (en) * 2019-01-02 2023-11-24 山东省科学院情报研究所 Multi-source heterogeneous flow state big data distributed online real-time processing method and system
CN109783523A (en) * 2019-01-24 2019-05-21 广州虎牙信息科技有限公司 A kind of data processing method, device, equipment and storage medium
CN110046164B (en) * 2019-04-16 2021-07-02 中国人民解放军国防科技大学 Operation method of consistent valley filter
CN110046164A (en) * 2019-04-16 2019-07-23 中国人民解放军国防科技大学 Index independent grain distribution filter, consistency grain distribution filter and operation method
CN110489405A (en) * 2019-07-12 2019-11-22 平安科技(深圳)有限公司 The method, apparatus and server of data processing
WO2021008024A1 (en) * 2019-07-12 2021-01-21 平安科技(深圳)有限公司 Data processing method and apparatus, and server
CN110489405B (en) * 2019-07-12 2024-01-12 平安科技(深圳)有限公司 Data processing method, device and server
CN111338581A (en) * 2020-03-27 2020-06-26 尹兵 Data storage method and device based on cloud computing, cloud server and system
CN112800430A (en) * 2021-02-01 2021-05-14 苏州棱镜七彩信息科技有限公司 Safety and compliance management method suitable for open source assembly
CN113721862A (en) * 2021-11-02 2021-11-30 腾讯科技(深圳)有限公司 Data processing method and device
US20230221864A1 (en) * 2022-01-10 2023-07-13 Vmware, Inc. Efficient inline block-level deduplication using a bloom filter and a small in-memory deduplication hash table

Also Published As

Publication number Publication date
CN107391034B (en) 2019-05-10

Similar Documents

Publication Publication Date Title
CN107391034B (en) A kind of repeated data detection method based on local optimization
CN103377137B (en) The frequent block strengthened detection is used to carry out the method and system of storage duplicate removal
Bertino et al. Indexing techniques for advanced database systems
US7418544B2 (en) Method and system for log structured relational database objects
CN102831222B (en) Differential compression method based on data de-duplication
US20090240655A1 (en) Bit String Seacrching Apparatus, Searching Method, and Program
Leung Mining uncertain data
CN103597450B (en) Memory with the metadata being stored in a part for storage page
CN110377747B (en) Knowledge base fusion method for encyclopedic website
CN107515931A (en) A kind of duplicate data detection method based on cluster
CN104462582A (en) Web data similarity detection method based on two-stage filtration of structure and content
US8086641B1 (en) Integrated search engine devices that utilize SPM-linked bit maps to reduce handle memory duplication and methods of operating same
CN103229164B (en) Data access method and device
CN107291858B (en) Data indexing method based on character string suffix
CN107451233A (en) Storage method of the preferential space-time trajectory data file of time attribute in auxiliary storage device
CN103500183A (en) Storage structure based on multiple-relevant-field combined index and building, inquiring and maintaining method
CN113901279B (en) Graph database retrieval method and device
Liu et al. EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search
CN113961754B (en) Graph database system based on persistent memory
Su-Cheng et al. Node labeling schemes in XML query optimization: a survey and trends
US7987205B1 (en) Integrated search engine devices having pipelined node maintenance sub-engines therein that support database flush operations
CN106371765A (en) Method for removing memory thrashing through efficient LTL ((Linear Temporal Logic) model detection of large-scale system
Tao et al. Validity information retrieval for spatio-temporal queries: Theoretical performance bounds
CN110489448A (en) The method for digging of big data correlation rule based on Hadoop
Suei et al. A signature-based Grid index design for main-memory RFID database applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant