CN107391034A

CN107391034A - A kind of duplicate data detection method based on local optimization

Info

Publication number: CN107391034A
Application number: CN201710555589.5A
Authority: CN
Inventors: 王桦; 周可; 张攀峰
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-07-07
Filing date: 2017-07-07
Publication date: 2017-11-24
Anticipated expiration: 2037-07-07
Also published as: CN107391034B

Abstract

The invention discloses a kind of duplicate data detection method based on local optimization, belong to computer memory technical field, solve the problems, such as that detection efficiency is low in existing duplicate data detection method, to adapt to because of data storage popularization, and the present situation for causing duplicate data detection efficiency to reduce.The present invention includes Bloom filter detection, the detection of Hash bucket write buffer, the detection of Hash bucket read buffer, Hash bucket address table detecting step.Present invention is generally directed to the stronger data set type of locality, the locality concentrated by mining data, the efficiency of data pre-fetching is improved, reduces disk access expense, lifts the throughput of data deduplication.For possible duplicate data in data set, the present invention prejudges first with Bloom filter to the repeatability of data block, next according to different conditions, the hot-zone to buffer area and cold-zone and disk carry out the detection of three-level duplicate data respectively, the locality in duplicate data is made full use of, lifts the detection efficiency of duplicate data.

Description

A kind of duplicate data detection method based on local optimization

Technical field

The invention belongs to computer memory technical field, more particularly, to a kind of repeat number based on local optimization According to detection method.

Background technology

With information technology fast development, information turns into the precious resources that we depend on for existence, becomes promotion production The fast-developing maximum power of power.The generation for widely applying the also data along with magnanimity of information technology, it is more and more valuable The data of value need to be stored.So, the storage efficiency of existing storage medium how is effectively improved, meets ever-increasing deposit Storage demand, have become storage research field and one of urgently solve the problems, such as.Meanwhile IDC LLC's investigation report show it is existing about 75% data are redundancy, i.e., only 25% data have uniqueness.In this context, data deduplication is used as larger A kind of new technique of detection and elimination redundancy turns into the study hotspot of academia and industrial quarters in recent years in spatial dimension, And various information storage systems are just widely applied to further.

The detection for repeating fingerprint is to realize the important technical of data deduplication.In existing data deduplication technology, weight The detection of complex data mainly using the mode of fingerprint detection, i.e., by extracting the fingerprint (cryptographic Hash) of data block, then passes through inspection The repeatability for surveying fingerprint identifies whether some data block is duplicate data block.In fingerprint detection method is repeated substantially, generally The identification of repetition fingerprint section is realized using data structures such as single Hash tables or B-tree.

However, the problem of one can not ignore existing for above-mentioned fingerprint detection method is that its detection performance is more low, can not Effective duplicate data detection is realized for large data sets, so as to have influence on the overall efficiency of data deduplication.

The content of the invention

For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of repetition based on local optimization Data detection method, it is intended that solving detection performance existing for the existing duplicate data detection method based on fingerprint detection It is more low, the technical problem that large data sets realize effective duplicate data detection can not be directed to.

To achieve the above object, according to one aspect of the present invention, there is provided a kind of repeat number based on local optimization According to detection method, comprise the following steps：

(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, postpone Deposit one fingerprint of middle extraction；

(2) inquire about whether to record in Bloom filter and have the fingerprint extracted, if possible record has, then turns Enter step (4), be otherwise transferred to step (3)；

(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and from caching under extraction One fingerprint, and return to step (2)；

(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, extracted if then from caching next Individual fingerprint, and return to step (2), are otherwise transferred to step (5)；

(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted if then from caching next Individual fingerprint, and return to step (2), are otherwise transferred to step (6)；

(6) Hash bucket address table is searched according to fingerprint, to judge to get corresponding Hash bucket ID, if obtained not To then assert that the fingerprint is new fingerprint, next fingerprint, and return to step (2) are extracted from caching, is turned if it can get Enter step (7)；

(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traveled through according to the Hash bucket ID of acquisition, to judge whether There is Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching The middle next fingerprint of extraction, and return to step (2), are otherwise inserted into Hash by Hash bucket corresponding to Hash bucket ID from disk In first Hash bucket in the hot-zone of bucket degree caching, and the fingerprint is searched in Hash bucket after such insertion, said if finding The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching Line, and return to step (2).

Preferably, Bloom filter is created in initial phase, and is had

The optimum bit vector magnitude m of Bloom filter is equal to：

M=log₂e×log₂(1/ε)×C

Hash function number is：

Wherein C represents duplicate data block index capacity, and ε represents the False Rate of Bloom filter.

Preferably, judge whether to record in Bloom filter in step (2) and have the fingerprint X extracted specifically, such as Fruit is for hash function hash_i(X), there is hash₁(X)&hash₂(X)…&hash_k(X)=0, then show do not have in Bloom filter Fingerprint X was recorded, fingerprint X is new fingerprint, otherwise represents that fingerprint X may be recorded, wherein 1≤i≤k, k represent to breathe out The quantity of uncommon function.

Preferably, Hash bucket is the container for placing fingerprint, and its value is 192 to 512 fingerprint/buckets, and Hash bucket is write Caching is to be realized in initial phase by applying for the memory headroom of free time in internal memory, and its value is equal to 4 to 128 Kazakhstan Uncommon bucket, is provided with first list and second list in Hash bucket write buffer, and the node in each list is made up of Hash bucket.

Preferably, (a) calculates the value hash of k individual Hash function_i(X), and will be hash to offset in bit vector_i (X) bit bit location puts 1, and wherein k is hash function number, 1≤i≤k；(b) two in Polling Hash bucket write buffer Bar Hash bucket list, judge whether to have in certain list less than bucket, when finding non-bucketful, by fingerprint be stored in it is first less than Bucket, and by the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address tables；If it find that all Kazakhstan in certain list Uncommon bucket is all had been filled with, then locks the list, and all Hash buckets in the list are write in disk, to the row after writing complete The operation that table is emptied and unlocked, wherein it is exactly to empty all Hash buckets in the list to carry out null clear operation to list, And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled with two lists, fingerprint insertion operation Certain list write-in is had to wait for complete and perform insertion operation again after emptying.

Preferably, in step (4), Hash bucket read buffer is the spatial cache set in initial phase in internal memory, its The chained list being made up of multiple Hash buckets is formed, and the size of Hash bucket read buffer is 1024-2048 Hash buckets.

It is another aspect of this invention to provide that a kind of duplicate data detecting system based on local optimization is provided, including：

First module, for obtaining fingerprint list file, fetching portion fingerprint and it is stored in from the fingerprint list file In caching, a fingerprint is extracted from caching；

Second module, there is the fingerprint extracted for inquiring about whether to record in Bloom filter, if possible Record has, then is transferred to the 4th module, is otherwise transferred to the 3rd module；

3rd module, for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and extracted from caching Next fingerprint, and the second module；

4th module, the fingerprint whether has been recorded in the hot-zone for judging Hash bucket read buffer, if then from caching The middle next fingerprint of extraction, and the second module, are otherwise transferred to the 5th module；

5th module, the fingerprint whether has been recorded in the hot bucket for judging Hash bucket write buffer, if then from caching The middle next fingerprint of extraction, and the second module is returned, otherwise it is transferred to the 6th module；

6th module, for searching Hash bucket address table according to fingerprint, to judge to get corresponding Hash bucket ID, Assert that the fingerprint is new fingerprint less than if if obtained, next fingerprint is extracted from caching, and return to the second module, if energy Get, be transferred to the 7th module；

7th module, all Hash buckets in cold-zone for traveling through Hash bucket read buffer according to the Hash bucket ID of acquisition, To determine whether Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then being searched in the Hash bucket should Fingerprint, extracts next fingerprint from caching, and returns to the second module, otherwise by Hash bucket corresponding to Hash bucket ID from disk In be inserted into the first Hash bucket in the hot-zone of Hash bucket degree caching, and search the fingerprint in Hash bucket after such insertion, such as Fruit finds, and it is existing fingerprint to illustrate the fingerprint, illustrates that the fingerprint is new fingerprint less than if if searched, then from caching Next fingerprint is extracted, and returns to the second module.

In general, by the contemplated above technical scheme of the present invention compared with prior art, it can obtain down and show Beneficial effect：

1st, the present invention can solve the problem that existing fingerprint detection method detection performance is more low, can not be directed to big data set and realize The technical problem of effective duplicate data detection：Due to that present invention employs step (2), can have by the anticipation of Bloom filter Effect reduces the detection number of fingerprint, and effectively lifting repeats the retrieval performance of fingerprint.

2nd, present invention employs step (3) to step (7), the characteristics of making full use of data set itself, using data pre-fetching With caching technology, the detection of three-level duplicate data is carried out to the hot-zone, cold-zone and disk of buffer area according to different conditions respectively, The locality in duplicate data is fully excavated, lifts the accuracy of data pre-fetching, and effectively reduces the access times of disk, enters one Step improves the detection efficiency of duplicate data.

Brief description of the drawings

Fig. 1 is the building-block of logic of the present invention；

Fig. 2 is the data structure of Bloom filter；

Fig. 3 is Hash bucket address table structural representation；

Fig. 4 is Hash bucket read buffer structural representation；

Fig. 5 is the schematic diagram of the duplicate data detection method of the invention based on local optimization.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below Conflict can is not formed each other to be mutually combined.

The present invention proposes a kind of efficient repetition fingerprint Fast Detection Technique.It is mainly for the stronger data of locality Collect the duplicate data detection of type, realize that the optimisation strategy that one-level anticipation three-level detects carries by Bloom filter and caching technology The performance of high duplicate data detection.

The basic ideas of the present invention are, for possible duplicate data in data set, first with Bloom filter pair The repeatability of data block is prejudged, and next according to different conditions, the hot-zone to buffer area and cold-zone and disk are carried out respectively The detection of three-level duplicate data, the locality in duplicate data is made full use of, lift the detection efficiency of duplicate data.

For the basic logical structure of the present invention as shown in figure 1, it is mainly made up of six parts, they refer respectively to line caching Hash bucket on table, Bloom filter, Hash bucket address table, Hash bucket write buffer, Hash bucket read buffer and disk is formed.

In order to clearly illustrate the present invention, explanation and illustration is subject to the term occurred in present specification：

Fingerprint list：It is to be passed through piecemeal by data set and taken the fingerprint and form fingerprint set according to processing sequence.

Fingerprint cache table：Fingerprint cache table is used to cache the fingerprint in fingerprint list.If fingerprint list comes from In file, then a number of fingerprint is disposably read in, and be stored in fingerprint cache area.System from fingerprint cache table one by one Take out fingerprint and repeat the lookup of fingerprint.

Bloom filter：As shown in Fig. 2 by bit vector and k independent hash function h that a length is m bit_i (x) (1≤i≤k, k ＜ m) is formed, and is a kind of very high random data structure of space efficiency, and it represents a collection using bit vector Close, and can judge whether an element belongs to this set.In order to express set S={ x₁, x₂, x₃..., x_n, bit vector first In all position be initialized to 0；Then to the element x in set S_j(1≤j≤n) is all separate using this k respectively Hash function h_i(x) k cryptographic Hash h, is obtained_i(x_j) (1≤i≤k, x_j∈ S), will using first of bit vector as starting point This k cryptographic Hash, can be by x as offset_jK position being mapped in bit vector { 1,2 ..., m }, these positions are set to 1, x_jQuilt Mark；After element all in S is all labeled, i.e. set S is expressed by Bloom filter, if a position is repeatedly set to 1, then can only work for the first time.

Determine whether some data element y belongs to set S, first to y respectively using this k separate Hash letters Number h_i(x) k cryptographic Hash h, is obtained_i(y), using first of bit vector as starting point, using this k cryptographic Hash as offset, Check whether corresponding position is all 1 in the bit vector of Bloom filter, be that y may belong to S；Otherwise it is not the member in S to determine y Element.

Due to hash function h_i(x) (1≤i≤k) there is a possibility that hash-collision for arbitrary two different elements, Such as y is mapped in bit vector corresponding position to make affirmative by the non-y elements institute image in S, Bloom filter Property judgement when there is error.The possibility for the element that element is mistaken in S is referred to as during Bloom filter gathers non-S False positive probability (False Positive Probability), also abbreviation False Rate (Error Rate).False positive probability can To be controlled by mathematical method.

Set S radix n, the length m of Bloom filter bit vector and its quantity k of hash function is given, then the grand mistake of cloth It is still that 0 probability is (1-1/m) that the bit vector of filter, which is inserted into a certain position after n element,^k×n.On the other hand, when some new element When all positions have all been set to 1 corresponding to y, Bloom filter could be made that false positive judges, and then deducibility false positive probability f_BFFor：

f_BF=(1- (1-1/m)^k×n)^k≈(1-e^-k×n/m)^k,

It can derive and work asWhen, Bloom filter has minimum false positive probability, is referred to as preferable erroneous judgement Rate, it is designated as F_BF, now, the position that 50% is there are about in the bit vector of Bloom filter is " 1 "；SymbolRepresent big In the smallest positive integral of ln2 × (m/n) result；

Further, if n is, it is known that one Bloom filter of desired design, its preferable False Rate is no more than given mistake Sentence rate upper limit ε, then can derive that m must meet：

m≥log₂e×log₂(1/ ε) × n,

If m=log₂e×log₂(1/ ε) × n andAnd if only if, and all n are individual When element is all inserted into Bloom filter, its false positive probability just increases to ε, therefore n is also known as Bloom filter design capacity.Its InRepresent the smallest positive integral being not more than.

Analyzed more than, according to design capacity n and False Rate upper limit ε, the bit vector of Bloom filter can be calculated Length m and hash function quantity k；Design capacity n is the quantity of the estimated labelled element of Bloom filter, when a Bloom filter When the element of mark is less than n, then the Bloom filter be less than Bloom filter, less than Bloom filter can both continue New element is marked, can also be for inquiring about whether some element has marked wherein, when marked in a Bloom filter When number of elements is n, then the Bloom filter is full, it is impossible to continues to mark new element, but can provide inquiry, n≤m.

Hash bucket：The base unit that Hash bucket is fingerprint storage and caching swap-in swaps out.Store and fix in one Hash bucket The independent fingerprint of quantity (independent fingerprint refers to fingerprints numerically different with other fingerprints).

Hash bucket write buffer：Hash bucket write buffer is to go to one piece of buffer zone opening up in internal memory, is write for new fingerprint Caching before disk.Data due to being stored in new fingerprinting operation and Hash bucket write back disk operating can not be simultaneously to same Hash Bucket is carried out, to avoid critical resource conflict, so Hash bucket write buffer is designed to be made up of two Hash bucket lists, when certain When all Hash buckets all have been filled with list, the list is locked, and all Hash buckets in the list are all write disk. Because these Hash buckets are write-onces, generally they can be written into the same magnetic track on disk, and these fingerprints all maintain Data locality in the range of certain space.This provides possibility for the follow-up pre- extract operation of reading.Hash bucket after writing complete is complete Portion is cleared.And new fingerprint then continues to be stored in the Hash bucket of another list.And when the Hash bucket disk write in a list Do not complete, and another chained list Hash bucket completely when, then need to wait for disk write operation complete.When fingerprint deposit Hash bucket is write When in caching, that is, Hash bucket ID is distributed, while update Hash bucket address table.

Hash bucket address table：Hash bucket address table is the key assignments resided in internal memory (key-value) Hash table, inner Hash bucket ID mapping where face is housed from fingerprint key to fingerprint.It is that energy is quick in the fingerprint on searching disk that it, which is acted on, The Hash bucket position that ground positioning stores the fingerprint is put.Hash bucket address table concrete structure is as shown in Figure 3.Fingerprint length is 20 bytes, Bucket ID length is 4 bytes, and pointer (Pointer) takes 8 bytes.Hash table storage can have the problem of hash-collision, work as Hash When conflict occurs, conflict is handled using chain address.

Hash bucket read buffer：Hash bucket read buffer is that the one piece of memory space opened up in memory headroom is used to cache from disk The Hash bucket of reading.To improve the efficiency of disk index, a part of fingerprint index table (Hash bucket) in disk is read into internal memory Hash bucket read buffer in.Hash bucket read buffer is made up of a doubly linked list.Each node stores a Hash bucket in chained list, Its structure is as shown in Figure 4.Each node of chained list have recorded the ID number and flag bit of Hash bucket, and flag bit represents the Hash bucket Whether it is " dirty bucket ".Hash bucket read buffer is logically made up of two parts, before several nodes be hot-zone part, and back It is cold-zone to the part between tail node.The division in cold and hot section is used to optimize retrieval performance.

As shown in figure 5, the duplicate data detection method of the invention based on local optimization comprises the following steps：

(1) fingerprint list file, fetching portion fingerprint (size of the partial fingerprints etc. from the fingerprint list file are obtained It is specified for storing the size in the space of fingerprint in caching) and store in the buffer, a fingerprint is extracted from caching, when After all fingerprints in caching have all extracted, then new partial fingerprints are read from fingerprint list file and are stored in caching In；

Specifically, Bloom filter is to be created in initial phase according to procedure below：

The optimum bit vector magnitude m of Bloom filter is equal to：

M=log₂e×log₂(1/ε)×C

Hash function number is：

Wherein C represents duplicate data block index capacity, and ε represents the False Rate of Bloom filter, and its value is not higher than thresholding Value 0.00001.

Judge in this step in Bloom filter whether may record there is the fingerprint X extracted to be specially：

If for hash_i(X) (wherein 1≤i≤k, k represent the quantity of hash function), there is h₁(X)&h₂(X)…&h_k (X)=0, then show that it is new fingerprint that the fingerprint X, X were not recorded in Bloom filter, otherwise represent that the fingerprint may be recorded X；

Hash bucket in this step is the container for placing fingerprint, and its size can be arbitrary value, preferred value be 192 to 512 fingerprint/buckets, Hash bucket write buffer are to be realized in initial phase by applying for the memory headroom of free time in internal memory , its size can be arbitrary value, and preferred value is equal to 4 to 128 Hash buckets.

Initial phase is provided with two lists (i.e. first list and second list), Mei Gelie in Hash bucket write buffer Node on table is all made up of Hash bucket.

Specifically, the fingerprint is inserted into Bloom filter and Hash bucket write buffer in this step and is specially：(a) count Calculate the value hash of k individual Hash function_i(X) it is, and to offset in bit vector hash_i(X) bit bit location puts 1, wherein K is hash function number, and 1≤i≤k；(b) two Hash bucket lists in Polling Hash bucket write buffer, judge certain Whether have in list less than bucket, when finding non-bucketful, fingerprint is stored in into first non-bucketful, and (now the Hash bucket is referred to as heat Bucket), and by the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address tables；If it find that all Kazakhstan in certain list Uncommon bucket is all had been filled with, then locks the list, and all Hash buckets in the list are write in disk, to the row after writing complete The operation that table is emptied and unlocked, wherein it is exactly to empty all Hash buckets in the list to carry out null clear operation to list, And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled with two lists, fingerprint insertion operation Certain list write-in is had to wait for complete and perform insertion operation again after emptying.

Above-mentioned Hash bucket address table is the address list created in initial phase in internal memory, and it is with the side of key-value pair Formula describes fingerprint and stores the mapping relations between the Hash bucket ID of the Hash bucket of the fingerprint.

Specifically, Hash bucket read buffer (Cache) is the spatial cache set in initial phase in internal memory, and it is The chained list being made up of multiple Hash buckets is formed, and the size of Hash bucket read buffer can be arbitrary value, and preferred value is 1024- 2048 Hash buckets；The anterior one or more Hash buckets of chained list are referred to as hot-zone, and remaining Hash bucket is referred to as cold-zone.

The present invention has following beneficial effect：Firstly, since present invention employs step (2), pass through Bloom filter Anticipation can effectively reduce the detection number of fingerprint, effectively lifting repeats the retrieval performance of fingerprint；Further, since the present invention uses Step (3) is to step (7), the characteristics of making full use of data set itself, using data pre-fetching and caching technology, according to different Condition carries out the detection of three-level duplicate data to the hot-zone, cold-zone and disk of buffer area respectively, fully excavates in duplicate data Locality, the accuracy of data pre-fetching is lifted, and effectively reduce the access times of disk, further improve the inspection of duplicate data Survey efficiency.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should be included Within protection scope of the present invention.

Claims

1. a kind of duplicate data detection method based on local optimization, it is characterised in that comprise the following steps：

(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, from caching Extract a fingerprint；

(2) inquire about whether to record in Bloom filter and have the fingerprint extracted, if possible record has, then is transferred to step Suddenly (4), otherwise it is transferred to step (3)；

(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and extracted from caching next Fingerprint, and return to step (2)；

(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, if then extracting next finger from caching Line, and return to step (2), are otherwise transferred to step (5)；

(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, if then extracting next finger from caching Line, and return to step (2), are otherwise transferred to step (6)；

(6) Hash bucket address table is searched according to fingerprint, to judge to get corresponding Hash bucket ID, if acquisition less than if It is new fingerprint to assert the fingerprint, and next fingerprint, and return to step (2) are extracted from caching, step is transferred to if it can get Suddenly (7)；

(7) according to the Hash bucket ID of acquisition travel through Hash bucket read buffer cold-zone in all Hash buckets, with determine whether with Hash bucket corresponding to Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, carried from caching Next fingerprint, and return to step (2) are taken, Hash bucket corresponding to Hash bucket ID is otherwise inserted into Hash bucket degree from disk In first Hash bucket in the hot-zone of caching, and the fingerprint is searched in Hash bucket after such insertion, explanation should if finding Fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next fingerprint is then extracted from caching, and Return to step (2).

2. duplicate data detection method according to claim 1, it is characterised in that Bloom filter is in initial phase Create, and have

The optimum bit vector magnitude m of Bloom filter is equal to：

M=log₂e×log₂(1/ε)×C

Hash function number is：

3. duplicate data detection method according to claim 1, it is characterised in that judge Bloom filter in step (2) If whether may be recorded in has the fingerprint X extracted specifically, for hash function hash_i(X), there is hash₁(X)&hash₂ (X)…&hash_k(X)=0, then show not recording fingerprint X in Bloom filter, fingerprint X is new fingerprint, otherwise table Fingerprint X may be recorded by showing, wherein 1≤i≤k, k represent the quantity of hash function.

4. duplicate data detection method according to claim 1, it is characterised in that

Hash bucket is the container for placing fingerprint, and its preferred value is 192 to 512 fingerprint/buckets；

Hash bucket write buffer is to realize that it preferably takes by applying for the memory headroom of free time in internal memory in initial phase Value is equal to 4 to 128 Hash buckets；

It is provided with first list and second list in Hash bucket write buffer, the node in each list is made up of Hash bucket.

5. duplicate data detection method according to claim 1, it is characterised in that the fingerprint is inserted into Bloom filter And it is specially in Hash bucket write buffer：(a) the value hash of k individual Hash function is calculated_i(X), and will be to being offset in bit vector Measure as hash_i(X) bit bit location puts 1, and wherein k is hash function number, 1≤i≤k；(b) Polling Hash bucket is write slow Two Hash bucket lists in depositing, judge whether to have in certain list less than bucket, when finding non-bucketful, fingerprint is stored in first Individual non-bucketful, and by the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address tables；If it find that in certain list All Hash buckets are all had been filled with, then lock the list, and all Hash buckets in the list are write in disk, after writing complete The operation that the list is emptied and unlocked, wherein it is exactly by all Hash buckets in the list to carry out null clear operation to list Empty, and new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled with two lists, fingerprint insertion Operation has to wait for certain list write-in and completes and perform insertion operation again after emptying.

6. duplicate data detection method according to claim 1, it is characterised in that in step (4), Hash bucket read buffer is In the spatial cache that initial phase is set in internal memory, its chained list being made up of multiple Hash buckets is formed, and Hash bucket is read Caching is preferably sized to 1024-2048 Hash buckets.

A kind of 7. duplicate data detecting system based on local optimization, it is characterised in that including：

First module, for obtaining fingerprint list file, fetching portion fingerprint and caching is stored in from the fingerprint list file In, a fingerprint is extracted from caching；

Second module, there is the fingerprint extracted for inquiring about whether to record in Bloom filter, if possible record Have, be then transferred to the 4th module, be otherwise transferred to the 3rd module；

3rd module, for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and extract from caching next Individual fingerprint, and the second module；

4th module, the fingerprint whether has been recorded in the hot-zone for judging Hash bucket read buffer, if then being carried from caching Next fingerprint, and the second module are taken, is otherwise transferred to the 5th module；

5th module, the fingerprint whether has been recorded in the hot bucket for judging Hash bucket write buffer, if then being carried from caching Next fingerprint is taken, and returns to the second module, is otherwise transferred to the 6th module；

6th module, for searching Hash bucket address table according to fingerprint, to judge to get corresponding Hash bucket ID, if Obtain less than then assert that the fingerprint is new fingerprint, next fingerprint is extracted from caching, and return to the second module, if can obtain To being then transferred to the 7th module；

7th module, all Hash buckets in cold-zone for traveling through Hash bucket read buffer according to the Hash bucket ID of acquisition, to sentence It is disconnected whether to have Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, Next fingerprint is extracted from caching, and returns to the second module, otherwise inserts Hash bucket corresponding to Hash bucket ID from disk Enter into the first Hash bucket in the hot-zone of Hash bucket degree caching, and the fingerprint is searched in Hash bucket after such insertion, if looked into Find, it is existing fingerprint to illustrate the fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, then being extracted from caching Next fingerprint, and return to the second module.