CN107391034B

CN107391034B - A kind of repeated data detection method based on local optimization

Info

Publication number: CN107391034B
Application number: CN201710555589.5A
Authority: CN
Inventors: 王桦; 周可; 张攀峰
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-07-07
Filing date: 2017-07-07
Publication date: 2019-05-10
Anticipated expiration: 2037-07-07
Also published as: CN107391034A

Abstract

The repeated data detection method based on local optimization that the invention discloses a kind of, belong to computer memory technical field, solve the problems, such as that detection efficiency is low in existing repeated data detection method, to adapt to because of storing data popularization, and the status for causing repeated data detection efficiency to reduce.The present invention includes Bloom filter detection, the detection of Hash bucket write buffer, the detection of Hash bucket read buffer, Hash bucket address table detecting step.Present invention is generally directed to the stronger data set type of locality, the locality concentrated by mining data improves the efficiency of data pre-fetching, reduces disk access expense, promotes the throughput of data deduplication.For repeated data possible in data set, the invention firstly uses repeatability of the Bloom filter to data block to prejudge, next the detection of three-level repeated data is carried out to the hot-zone of buffer area and cold-zone and disk respectively according to different conditions, the locality in repeated data is made full use of, the detection efficiency of repeated data is promoted.

Description

A kind of repeated data detection method based on local optimization

Technical field

The invention belongs to computer memory technical fields, more particularly, to a kind of repeat number based on local optimization According to detection method.

Background technique

As information technology is grown rapidly, information has become the precious resources that we depend on for existence, becomes promotion production The maximum power of power fast development.The extensive application of information technology is more and more valuable also along with the generation of the data of magnanimity The data of value are stored.So, the storage efficiency for how effectively improving existing storage medium meets ever-increasing deposit Storage demand has become storage research field and one of urgently solves the problems, such as.Meanwhile IDC LLC's investigation report show it is existing about 75% data are redundancy, i.e., only 25% data have uniqueness.In this context, data deduplication is used as larger A kind of new technique that redundancy is detected and eliminated in spatial dimension becomes the research hotspot of academia and industry in recent years, And it is just widely applied to various information storage systems further.

The detection for repeating fingerprint is to realize the important technical of data deduplication.In existing data deduplication technology, weight The detection of complex data that is, by the fingerprint (cryptographic Hash) of extraction data block, then passes through inspection mainly using the mode of fingerprint detection The repeatability of fingerprint is surveyed to identify whether some data block is repeated data block.Substantially it is repeating in fingerprint detection method, usually The identification of repetition fingerprint section is realized using data structures such as single Hash tables or B-tree.

However, the problem of one can not ignore existing for above-mentioned fingerprint detection method is that detection performance is more low, it can not Effective repeated data detection is realized for large data sets, to influence the overall efficiency of data deduplication.

Summary of the invention

Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of repetitions based on local optimization Data detection method, it is intended that solving detection performance existing for the existing repeated data detection method based on fingerprint detection It is more low, the technical issues of effective repeated data detects can not be realized for large data sets.

To achieve the above object, according to one aspect of the present invention, a kind of repeat number based on local optimization is provided According to detection method, comprising the following steps:

(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, postpone Deposit one fingerprint of middle extraction；

(2) it inquires whether to record in Bloom filter and has the fingerprint extracted, if possible record has, then turns Enter step (4), is otherwise transferred to step (3)；

(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and under being extracted in caching One fingerprint, and return step (2)；

(4) judge whether recorded the fingerprint in the hot-zone of Hash bucket read buffer, extracted from caching if there is then next A fingerprint, and return step (2), are otherwise transferred to step (5)；

(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted from caching if there is then next A fingerprint, and return step (2), are otherwise transferred to step (6)；

(6) Hash bucket address table is searched according to fingerprint, can corresponding Hash bucket ID be got with judgement, if obtained not To then assert that the fingerprint is new fingerprint, next fingerprint, and return step (2) are extracted from caching, is turned if it can get Enter step (7)；

(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traversed, according to the Hash bucket ID of acquisition to judge whether There is Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching It is middle to extract next fingerprint, and return step (2), the corresponding Hash bucket of Hash bucket ID is otherwise inserted into Hash from disk In first Hash bucket in the hot-zone of bucket read buffer, and the fingerprint is searched in Hash bucket after such insertion, is said if finding The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching Line, and return step (2).

Preferably, Bloom filter is to create in initial phase, and have

The optimum bit vector magnitude m of Bloom filter is equal to:

M=log₂e×log₂(1/ε)×C

Hash function number are as follows:

Wherein C indicates that repeated data block indexes capacity, and ε indicates the False Rate of Bloom filter.

Preferably, whether judge to record in Bloom filter in step (2) has the fingerprint X extracted specifically, such as Fruit is for hash function hash_i(X), there is hash₁(X)&hash₂(X)...&hash_k(X)=0, then show do not have in Bloom filter Fingerprint X was recorded, fingerprint X is new fingerprint, otherwise indicates that fingerprint X may be recorded, wherein 1≤i≤k, k indicate to breathe out The quantity of uncommon function.

Preferably, Hash bucket is the container for placing fingerprint, and value is 192 to 512 fingerprint/buckets, and Hash bucket is write Caching is to be realized in initial phase by applying for idle memory headroom in memory, and value is equal to 4 to 128 Kazakhstan Bucket is wished, is provided with first list and second list in Hash bucket write buffer, the node in each list is made of Hash bucket.

Preferably, (a) calculates the value hash of k individual Hash function_iIt (X), and will be hash to offset in bit vector_i (X) bit bit location sets 1, and wherein k is hash function number, 1≤i≤k；(b) two in Polling Hash bucket write buffer Hash bucket list, judges whether there is less than bucket in certain list, when finding non-bucketful, fingerprint is stored in first less than Bucket, and will be in the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address table；If it find that all Kazakhstan in certain list Uncommon bucket all has been filled with, then locks the list, and all Hash buckets in the list are written in disk, to the column after writing complete The operation that table is emptied and unlocked, wherein carrying out null clear operation to list is exactly to empty all Hash buckets in the list, And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled in two lists, fingerprint insertion operation Certain list write-in is had to wait for complete and execute insertion operation again after emptying.

Preferably, in step (4), Hash bucket read buffer is the spatial cache being arranged in memory in initial phase, It is made of the chained list that multiple Hash buckets are constituted, the size of Hash bucket read buffer is 1024-2048 Hash bucket.

It is another aspect of this invention to provide that providing a kind of repeated data detection system based on local optimization, comprising:

First module fetching portion fingerprint and is stored in from the fingerprint list file for obtaining fingerprint list file In caching, a fingerprint is extracted from caching；

Second module has the fingerprint extracted for inquiring whether to record in Bloom filter, if possible Record has, then is transferred to the 4th module, is otherwise transferred to third module；

Third module for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and is extracted from caching Next fingerprint, and return to the second module；

Whether the 4th module has recorded the fingerprint in the hot-zone for judging Hash bucket read buffer, if there is then from caching It is middle to extract next fingerprint, and the second module is returned, otherwise it is transferred to the 5th module；

Whether the 5th module has recorded the fingerprint in the hot bucket for judging Hash bucket write buffer, if there is then from caching It is middle to extract next fingerprint, and the second module is returned, otherwise it is transferred to the 6th module；

Can the 6th module get corresponding Hash bucket ID for searching Hash bucket address table according to fingerprint with judgement, Assert that the fingerprint is new fingerprint less than if if obtained, next fingerprint is extracted from caching, and return to the second module, if energy It gets, is transferred to the 7th module；

7th module, all Hash buckets in cold-zone for traversing Hash bucket read buffer according to the Hash bucket ID of acquisition, To judge whether there is Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then searching in the Hash bucket should Fingerprint, extracts next fingerprint from caching, and returns to the second module, otherwise by the corresponding Hash bucket of Hash bucket ID from disk In be inserted into the first Hash bucket in the hot-zone of Hash bucket read buffer, and search the fingerprint in Hash bucket after such insertion, such as Fruit finds, and illustrates that the fingerprint is existing fingerprint, illustrates that the fingerprint is new fingerprint less than if if searched, then from caching Next fingerprint is extracted, and returns to the second module.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:

1, it is more low to be able to solve existing fingerprint detection method detection performance by the present invention, can not realize for big data set The technical issues of effective repeated data detection: since present invention employs step (2), can have by the anticipation of Bloom filter Effect reduces the detection number of fingerprint, effectively promotes the retrieval performance for repeating fingerprint.

2, present invention employs step (3) to step (7), the characteristics of making full use of data set itself, using data pre-fetching With caching technology, the detection of three-level repeated data is carried out to the hot-zone of buffer area, cold-zone and disk respectively according to different conditions, The locality in repeated data is sufficiently excavated, the accuracy of data pre-fetching is promoted, and effectively reduce the access times of disk, into one Step improves the detection efficiency of repeated data.

Detailed description of the invention

Fig. 1 is building-block of logic of the invention；

Fig. 2 is the data structure of Bloom filter；

Fig. 3 is Hash bucket address table structural schematic diagram；

Fig. 4 is Hash bucket read buffer structural schematic diagram；

Fig. 5 is the schematic diagram of the repeated data detection method the present invention is based on local optimization.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

The invention proposes a kind of efficient repetition fingerprint Fast Detection Techniques.It is mainly for the stronger data of locality The repeated data detection for collecting type realizes that level-one prejudges the optimisation strategy that three-level detects and mentions by Bloom filter and caching technology The performance of high repeated data detection.

Basic ideas of the invention are, for repeated data possible in data set, first with Bloom filter pair The repeatability of data block is prejudged, and is next carried out respectively to the hot-zone of buffer area and cold-zone and disk according to different conditions The detection of three-level repeated data makes full use of the locality in repeated data, promotes the detection efficiency of repeated data.

For basic logical structure of the invention as shown in Figure 1, it is mainly made of six parts, they refer respectively to line caching Hash bucket on table, Bloom filter, Hash bucket address table, Hash bucket write buffer, Hash bucket read buffer and disk is constituted.

The present invention is illustrated for clarity, and explanation and illustration is subject to the term occurred in present specification:

Fingerprint list: being to be passed through piecemeal by data set and taken the fingerprint and constitute fingerprint set according to processing sequence.

Fingerprint cache table: fingerprint cache table is for caching the fingerprint in fingerprint list.If fingerprint list comes from In file, then a number of fingerprint is disposably read in, and is stored in fingerprint cache area.System from fingerprint cache table one by one It takes out fingerprint and repeat the lookup of fingerprint.

Bloom filter: as shown in Fig. 2, by a independent hash function h of bit vector and k that a length is m bit_i (x) (1≤i≤k, k < m) is constituted, and is a kind of very high random data structure of space efficiency, it indicates a collection using bit vector It closes, and can judge whether an element belongs to this set.In order to express set S={ x₁, x₂, x₃..., x_n, first position to All positions are initialized to 0 in amount；Then to the element x in set S_j(1≤j≤n) is all mutually indepedent using this k respectively Hash function h_i(x), k cryptographic Hash h is obtained_i(x_j) (1≤i≤k, x_j∈ S), it is used as starting point by first of bit vector, It, can be by x using this k cryptographic Hash as offset_jIt is mapped to k position in bit vector { 1,2 ..., m }, these positions are set to 1, x_jIt is labeled；After element all in S is all labeled, i.e. set S is expressed by Bloom filter, if the multiple quilt in a position It is set to 1, then can only work for the first time.

It determines whether some data element y belongs to set S, uses y this k mutually independent Hash letters respectively first Number h_i(x), k cryptographic Hash h is obtained_i(y), it is used as starting point by first of bit vector, using this k cryptographic Hash as offset, It checks whether corresponding position is all 1 in the bit vector of Bloom filter, is that y may belong to S；Otherwise determine that y is not the member in S Element.

Due to hash function h_i(x) (1≤i≤k) element different for arbitrary two a possibility that there are hash-collisions, Such as y be mapped in bit vector corresponding position may be by the non-y element institute image in S, Bloom filter is making affirmative Property judgement when exist error a possibility that.A possibility that element is mistaken for the element in S in non-S set, is known as by Bloom filter False positive probability (False Positive Probability), also abbreviation False Rate (ErrorRate).False positive probability can be with It is controlled by mathematical method.

Radix n, the length m of Bloom filter bit vector and its quantity k of hash function of set S are given, then the grand mistake of cloth It is (1-1/m) that the bit vector of filter, which is inserted into the probability that a certain position after n element is still 0,^k×n.On the other hand, when some new element When the corresponding all positions y have all been set to 1, Bloom filter could be made that false positive judges, and then deducibility false positive probability f_BFAre as follows:

f_BF=(1- (1-1/m)^k×n)^k≈(1-e^-k×n/m)^k,

It can derive and work asWhen, Bloom filter has the smallest false positive probability, referred to as ideal erroneous judgement Rate is denoted as F_BF, at this point, in the bit vector of Bloom filter there are about 50% position be " 1 "；SymbolIndicate big In ln2 × (m/n) result smallest positive integral；

Further, if n is it is known that one Bloom filter of desired design, ideal False Rate is no more than given mistake Sentence rate upper limit ε, then can derive that m must meet:

m≥log₂e×log₂(1/ ε) × n,

If m=log₂e×log₂(1/ ε) × n andAnd if only if all n When element is all inserted into Bloom filter, false positive probability just increases to ε, therefore n is also known as Bloom filter design capacity.Its InIndicate the smallest positive integral being not more than.

By being analyzed above it is found that the bit vector of Bloom filter can be calculated according to design capacity n and False Rate upper limit ε Length m and hash function quantity k；Design capacity n is the quantity of the estimated labelled element of Bloom filter, when a Bloom filter When the element of label is less than n, then the Bloom filter is less than Bloom filter, and less than Bloom filter can both continue New element is marked, can also be for inquiring whether some element has marked wherein, when marked in a Bloom filter When number of elements is n, then the Bloom filter has been expired, and cannot continue to mark new element, but can provide inquiry, n≤m.

Hash bucket: Hash bucket is fingerprint storage and the basic unit that caching swap-in swaps out.It stores and fixes in one Hash bucket The independent fingerprint of quantity (independent fingerprint refers to fingerprints numerically different with other fingerprints).

Hash bucket write buffer: Hash bucket write buffer is to go to open up one piece of buffer zone in memory, is written for new fingerprint Caching before disk.Data due to being stored in new fingerprinting operation and Hash bucket write back disk operating cannot be simultaneously to the same Hash Bucket carries out, to avoid critical resource conflict, so Hash bucket write buffer is designed to be made of two Hash bucket lists, when certain When all Hash buckets all have been filled in list, the list is locked, and all Hash buckets in the list are all write disk. Since these Hash buckets are write-onces, usually they can be written into the same magnetic track on disk, and these fingerprints all maintain Data locality within the scope of certain space.This provides possibility for the subsequent pre- extract operation of reading.Hash bucket after writing complete is complete Portion is emptied.And new fingerprint then continues to be stored in the Hash bucket of another list.And when the Hash bucket disk write in a list It does not complete, and when another chained list Hash bucket expire, then need to wait for the write operation completion of disk.When fingerprint deposit Hash bucket is write When in caching, i.e. distribution Hash bucket ID, while updating Hash bucket address table.

Hash bucket address table: Hash bucket address table is resident key assignments (key-value) Hash table in memory, inner The mapping of Hash bucket ID where face is housed from fingerprint key to fingerprint.It is when searching the fingerprint on disk that it, which is acted on, and energy is quickly The Hash bucket position that ground positioning stores the fingerprint is set.The specific structure is shown in FIG. 3 for Hash bucket address table.Fingerprint length is 20 bytes, Bucket ID length is 4 bytes, and pointer (Pointer) occupies 8 bytes.Hash table storage can have hash-collision, work as Hash When conflict occurs, conflict is handled using chain address.

Hash bucket read buffer: Hash bucket read buffer is that the one piece of memory space opened up in memory headroom is used to cache from disk The Hash bucket of reading.To improve the efficiency that disk indexes, a part of fingerprint index table (Hash bucket) in disk is read into memory Hash bucket read buffer in.Hash bucket read buffer is made of a doubly linked list.Each node stores a Hash bucket in chained list, Its structure is as shown in Figure 4.Each node of chained list has recorded the ID number and flag bit of Hash bucket, and flag bit indicates the Hash bucket It whether is " dirty bucket ".Hash bucket read buffer is logically made of two parts, before several nodes be hot-zone part, and back It is cold-zone to the part between tail node.The division in cold and hot section is for optimizing retrieval performance.

As shown in figure 5, the present invention is based on the repeated data of local optimization, detection method includes the following steps:

(1) fingerprint list file, the fetching portion fingerprint (size etc. of the partial fingerprints from the fingerprint list file are obtained It is specified for storing the size in the space of fingerprint in caching) and store in the buffer, a fingerprint is extracted from caching, when After all fingerprints in caching have all extracted, then new partial fingerprints are read from fingerprint list file and are stored in caching In；

Specifically, Bloom filter is to create in initial phase according to following procedure:

The optimum bit vector magnitude m of Bloom filter is equal to:

M=log₂e×log₂(1/ε)×C

Hash function number are as follows:

Wherein C indicates that repeated data block indexes capacity, and ε indicates that the False Rate of Bloom filter, value are not higher than thresholding Value 0.00001.

Judge whether to record in Bloom filter in this step and have the fingerprint X extracted specifically:

If for hash_i(X) (quantity of wherein 1≤i≤k, k expression hash function), there is h₁(X)&h₂(X)...&h_k (X)=0, then show not recording fingerprint X in Bloom filter, X is new fingerprint, otherwise indicates that the fingerprint may be recorded X；

Hash bucket in this step is the container for placing fingerprint, and size can be arbitrary value, preferred value be 192 to 512 fingerprint/buckets, Hash bucket write buffer are to be realized in initial phase by applying for idle memory headroom in memory , size can be arbitrary value, and preferred value is equal to 4 to 128 Hash buckets.

Initial phase is provided with two lists (i.e. first list and second list), Mei Gelie in Hash bucket write buffer Node on table is all made of Hash bucket.

Specifically, the fingerprint is inserted into Bloom filter and Hash bucket write buffer in this step specifically: (a) meter Calculate the value hash of k individual Hash function_iIt (X), is and to offset in bit vector hash_i(X) bit bit location sets 1, wherein K is hash function number, and 1≤i≤k；(b) two Hash bucket lists in Polling Hash bucket write buffer, judge certain Whether less than bucket is had in list, and when finding non-bucketful, fingerprint is stored in first non-bucketful, and (the Hash bucket is referred to as heat at this time Bucket), and will be in the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address table；If it find that all Kazakhstan in certain list Uncommon bucket all has been filled with, then locks the list, and all Hash buckets in the list are written in disk, to the column after writing complete The operation that table is emptied and unlocked, wherein carrying out null clear operation to list is exactly to empty all Hash buckets in the list, And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled in two lists, fingerprint insertion operation Certain list write-in is had to wait for complete and execute insertion operation again after emptying.

Above-mentioned Hash bucket address table is the address list created in memory in initial phase, with the side of key-value pair Formula describes fingerprint and stores the mapping relations between the Hash bucket ID of the Hash bucket of the fingerprint.

Specifically, Hash bucket read buffer (Cache) is the spatial cache being arranged in memory in initial phase, it is It is made of the chained list that multiple Hash buckets are constituted, the size of Hash bucket read buffer can be arbitrary value, preferred value 1024-2048 A Hash bucket；One or more Hash buckets of chained list front are known as hot-zone, and remaining Hash bucket is known as cold-zone.

The present invention has the following beneficial effects: firstly, since passing through Bloom filter present invention employs step (2) Anticipation can effectively reduce the detection number of fingerprint, effectively promote the retrieval performance for repeating fingerprint；Further, since the present invention uses Step (3) is to step (7), the characteristics of making full use of data set itself, using data pre-fetching and caching technology, according to different Condition carries out the detection of three-level repeated data to the hot-zone of buffer area, cold-zone and disk respectively, sufficiently excavates in repeated data Locality, promotes the accuracy of data pre-fetching, and effectively reduces the access times of disk, further improves the inspection of repeated data Survey efficiency.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of repeated data detection method based on local optimization, which comprises the following steps:

(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, from caching Extract a fingerprint；

(2) it inquires whether to record in Bloom filter and has the fingerprint extracted, if possible record has, then is transferred to step Suddenly (4) are otherwise transferred to step (3)；

(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer, and extracts next fingerprint from caching, and return It returns step (2)；Wherein the fingerprint is inserted into Bloom filter and Hash bucket write buffer specifically: (a) calculates k independent Kazakhstan The value hash of uncommon function_iIt (X), and will be hash to offset in bit vector_i(X) bit bit location sets 1, and wherein k is Hash letter Several numbers, 1≤i≤k；(b) two Hash bucket lists in Polling Hash bucket write buffer, judge whether have in certain list Fingerprint is stored in first non-bucketful, the Hash bucket is referred to as hot bucket at this time, and this is referred to when finding non-bucketful by less than bucket In the Hash bucket ID of line and Hash bucket write-in Hash bucket address table；If it find that all Hash buckets have all filled in certain list It is full, then the list is locked, and by all Hash buckets write-in disk in the list, the list is emptied after writing complete With the operation of unlock, wherein to list carry out null clear operation be exactly all Hash buckets in the list are emptied, and be each Kazakhstan Uncommon bucket distributes new Hash bucket ID；If all buckets are all already filled in two lists, fingerprint insertion operation has to wait for certain List write-in is completed and executes insertion operation again after emptying；

(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, if there is then extracting next finger from caching Line, and return step (2), are otherwise transferred to step (5)；

(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, if there is then extracting next finger from caching Line, and return step (2), are otherwise transferred to step (6)；

(6) Hash bucket address table is searched according to fingerprint, can corresponding Hash bucket ID be got with judgement, if acquisition less than if Assert that the fingerprint is new fingerprint, next fingerprint, and return step (2) are extracted from caching, step is transferred to if it can get Suddenly (7)；

(7) according to the Hash bucket ID of acquisition traverse Hash bucket read buffer cold-zone in all Hash buckets, with judge whether there is with The corresponding Hash bucket of Hash bucket ID is then searched the fingerprint in the Hash bucket, is mentioned from caching if there is corresponding Hash bucket Next fingerprint, and return step (2) are taken, otherwise the corresponding Hash bucket of the Hash bucket ID Hash bucket is inserted into from disk and read In first Hash bucket in the hot-zone of caching, and the fingerprint is searched in Hash bucket after such insertion, explanation should if finding Fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next fingerprint is then extracted from caching, and Return step (2)；Wherein Hash bucket read buffer is logically made of two parts, before multiple nodes be hot-zone part, then While to the part between tail node be cold-zone.

2. repeated data detection method according to claim 1, which is characterized in that Bloom filter is in initial phase Creation, and have

The optimum bit vector magnitude m of Bloom filter is equal to:

M=log₂e×log₂(1/ε)×C

Hash function number are as follows:

3. repeated data detection method according to claim 1, which is characterized in that judge Bloom filter in step (2) In whether may record have the fingerprint X extracted specifically, if for hash function hash_i(X), there is hash₁(X)&hash₂ (X)...&hash_k(X)=0, then show not recording fingerprint X in Bloom filter, fingerprint X is new fingerprint, otherwise table Fingerprint X may be recorded by showing, wherein 1≤i≤k, k indicate the quantity of hash function.

4. repeated data detection method according to claim 1, which is characterized in that

Hash bucket is the container for placing fingerprint, and value is 192 to 512 fingerprint/buckets；

Hash bucket write buffer is to be realized in initial phase by applying for idle memory headroom in memory, value etc. In 4 to 128 Hash buckets；

First list and second list are provided in Hash bucket write buffer, the node in each list is made of Hash bucket.

5. repeated data detection method according to claim 1, which is characterized in that in step (4), Hash bucket read buffer is It in the spatial cache that initial phase is arranged in memory, is made of the chained list that multiple Hash buckets are constituted, Hash bucket is read The size of caching is 1024-2048 Hash bucket.

6. a kind of repeated data detection system based on local optimization characterized by comprising

First module fetching portion fingerprint and is stored in caching from the fingerprint list file for obtaining fingerprint list file In, a fingerprint is extracted from caching；

Second module has the fingerprint extracted for inquiring whether to record in Bloom filter, if possible records Have, be then transferred to the 4th module, is otherwise transferred to third module；

Third module for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and is extracted from caching next A fingerprint, and return to the second module；Wherein the fingerprint is inserted into Bloom filter and Hash bucket write buffer specifically: (a) Calculate the value hash of k individual Hash function_iIt (X), and will be hash to offset in bit vector_i(X) bit bit location sets 1, Wherein k is hash function number, 1≤i≤k；(b) two Hash bucket lists in Polling Hash bucket write buffer, judge certain Whether there is less than bucket in list, when finding non-bucketful, fingerprint is stored in first non-bucketful, the Hash bucket is referred to as at this time Hot bucket, and will be in the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address table；If it find that owning in certain list Hash bucket all has been filled with, then locks the list, and all Hash buckets in the list are written in disk, to this after writing complete The operation that list is emptied and unlocked, wherein carrying out null clear operation to list is exactly that all Hash buckets in the list are clear Sky, and new Hash bucket ID is distributed for each Hash bucket；If all buckets are all already filled in two lists, fingerprint insertion behaviour Certain list write-in is had to wait for complete and execute insertion operation again after emptying；

Whether the 4th module has recorded the fingerprint in the hot-zone for judging Hash bucket read buffer, if there is then mentioning from caching Next fingerprint is taken, and returns to the second module, is otherwise transferred to the 5th module；

Whether the 5th module has recorded the fingerprint in the hot bucket for judging Hash bucket write buffer, if there is then mentioning from caching Next fingerprint is taken, and returns to the second module, is otherwise transferred to the 6th module；

Can the 6th module get corresponding Hash bucket ID for searching Hash bucket address table according to fingerprint with judgement, if It obtains less than then assert that the fingerprint is new fingerprint, next fingerprint is extracted from caching, and return to the second module, if can obtain To being then transferred to the 7th module；

7th module, all Hash buckets in cold-zone for traversing Hash bucket read buffer according to the Hash bucket ID of acquisition, to sentence It is disconnected whether to have Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then search the fingerprint in the Hash bucket, Next fingerprint is extracted from caching, and returns to the second module, otherwise inserts the corresponding Hash bucket of Hash bucket ID from disk Enter in the first Hash bucket into the hot-zone of Hash bucket read buffer, and search the fingerprint in Hash bucket after such insertion, if looked into It finds, illustrates that the fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, then being extracted from caching Next fingerprint, and return to the second module；Wherein Hash bucket read buffer is logically made of two parts, before multiple nodes be Hot-zone part, and back to the part between tail node be cold-zone.