CN107391034B - A kind of repeated data detection method based on local optimization - Google Patents

A kind of repeated data detection method based on local optimization Download PDF

Info

Publication number
CN107391034B
CN107391034B CN201710555589.5A CN201710555589A CN107391034B CN 107391034 B CN107391034 B CN 107391034B CN 201710555589 A CN201710555589 A CN 201710555589A CN 107391034 B CN107391034 B CN 107391034B
Authority
CN
China
Prior art keywords
fingerprint
hash
hash bucket
bucket
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710555589.5A
Other languages
Chinese (zh)
Other versions
CN107391034A (en
Inventor
王桦
周可
张攀峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710555589.5A priority Critical patent/CN107391034B/en
Publication of CN107391034A publication Critical patent/CN107391034A/en
Application granted granted Critical
Publication of CN107391034B publication Critical patent/CN107391034B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The repeated data detection method based on local optimization that the invention discloses a kind of, belong to computer memory technical field, solve the problems, such as that detection efficiency is low in existing repeated data detection method, to adapt to because of storing data popularization, and the status for causing repeated data detection efficiency to reduce.The present invention includes Bloom filter detection, the detection of Hash bucket write buffer, the detection of Hash bucket read buffer, Hash bucket address table detecting step.Present invention is generally directed to the stronger data set type of locality, the locality concentrated by mining data improves the efficiency of data pre-fetching, reduces disk access expense, promotes the throughput of data deduplication.For repeated data possible in data set, the invention firstly uses repeatability of the Bloom filter to data block to prejudge, next the detection of three-level repeated data is carried out to the hot-zone of buffer area and cold-zone and disk respectively according to different conditions, the locality in repeated data is made full use of, the detection efficiency of repeated data is promoted.

Description

A kind of repeated data detection method based on local optimization
Technical field
The invention belongs to computer memory technical fields, more particularly, to a kind of repeat number based on local optimization According to detection method.
Background technique
As information technology is grown rapidly, information has become the precious resources that we depend on for existence, becomes promotion production The maximum power of power fast development.The extensive application of information technology is more and more valuable also along with the generation of the data of magnanimity The data of value are stored.So, the storage efficiency for how effectively improving existing storage medium meets ever-increasing deposit Storage demand has become storage research field and one of urgently solves the problems, such as.Meanwhile IDC LLC's investigation report show it is existing about 75% data are redundancy, i.e., only 25% data have uniqueness.In this context, data deduplication is used as larger A kind of new technique that redundancy is detected and eliminated in spatial dimension becomes the research hotspot of academia and industry in recent years, And it is just widely applied to various information storage systems further.
The detection for repeating fingerprint is to realize the important technical of data deduplication.In existing data deduplication technology, weight The detection of complex data that is, by the fingerprint (cryptographic Hash) of extraction data block, then passes through inspection mainly using the mode of fingerprint detection The repeatability of fingerprint is surveyed to identify whether some data block is repeated data block.Substantially it is repeating in fingerprint detection method, usually The identification of repetition fingerprint section is realized using data structures such as single Hash tables or B-tree.
However, the problem of one can not ignore existing for above-mentioned fingerprint detection method is that detection performance is more low, it can not Effective repeated data detection is realized for large data sets, to influence the overall efficiency of data deduplication.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of repetitions based on local optimization Data detection method, it is intended that solving detection performance existing for the existing repeated data detection method based on fingerprint detection It is more low, the technical issues of effective repeated data detects can not be realized for large data sets.
To achieve the above object, according to one aspect of the present invention, a kind of repeat number based on local optimization is provided According to detection method, comprising the following steps:
(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, postpone Deposit one fingerprint of middle extraction;
(2) it inquires whether to record in Bloom filter and has the fingerprint extracted, if possible record has, then turns Enter step (4), is otherwise transferred to step (3);
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and under being extracted in caching One fingerprint, and return step (2);
(4) judge whether recorded the fingerprint in the hot-zone of Hash bucket read buffer, extracted from caching if there is then next A fingerprint, and return step (2), are otherwise transferred to step (5);
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted from caching if there is then next A fingerprint, and return step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, can corresponding Hash bucket ID be got with judgement, if obtained not To then assert that the fingerprint is new fingerprint, next fingerprint, and return step (2) are extracted from caching, is turned if it can get Enter step (7);
(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traversed, according to the Hash bucket ID of acquisition to judge whether There is Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching It is middle to extract next fingerprint, and return step (2), the corresponding Hash bucket of Hash bucket ID is otherwise inserted into Hash from disk In first Hash bucket in the hot-zone of bucket read buffer, and the fingerprint is searched in Hash bucket after such insertion, is said if finding The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching Line, and return step (2).
Preferably, Bloom filter is to create in initial phase, and have
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number are as follows:
Wherein C indicates that repeated data block indexes capacity, and ε indicates the False Rate of Bloom filter.
Preferably, whether judge to record in Bloom filter in step (2) has the fingerprint X extracted specifically, such as Fruit is for hash function hashi(X), there is hash1(X)&hash2(X)...&hashk(X)=0, then show do not have in Bloom filter Fingerprint X was recorded, fingerprint X is new fingerprint, otherwise indicates that fingerprint X may be recorded, wherein 1≤i≤k, k indicate to breathe out The quantity of uncommon function.
Preferably, Hash bucket is the container for placing fingerprint, and value is 192 to 512 fingerprint/buckets, and Hash bucket is write Caching is to be realized in initial phase by applying for idle memory headroom in memory, and value is equal to 4 to 128 Kazakhstan Bucket is wished, is provided with first list and second list in Hash bucket write buffer, the node in each list is made of Hash bucket.
Preferably, (a) calculates the value hash of k individual Hash functioniIt (X), and will be hash to offset in bit vectori (X) bit bit location sets 1, and wherein k is hash function number, 1≤i≤k;(b) two in Polling Hash bucket write buffer Hash bucket list, judges whether there is less than bucket in certain list, when finding non-bucketful, fingerprint is stored in first less than Bucket, and will be in the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address table;If it find that all Kazakhstan in certain list Uncommon bucket all has been filled with, then locks the list, and all Hash buckets in the list are written in disk, to the column after writing complete The operation that table is emptied and unlocked, wherein carrying out null clear operation to list is exactly to empty all Hash buckets in the list, And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled in two lists, fingerprint insertion operation Certain list write-in is had to wait for complete and execute insertion operation again after emptying.
Preferably, in step (4), Hash bucket read buffer is the spatial cache being arranged in memory in initial phase, It is made of the chained list that multiple Hash buckets are constituted, the size of Hash bucket read buffer is 1024-2048 Hash bucket.
It is another aspect of this invention to provide that providing a kind of repeated data detection system based on local optimization, comprising:
First module fetching portion fingerprint and is stored in from the fingerprint list file for obtaining fingerprint list file In caching, a fingerprint is extracted from caching;
Second module has the fingerprint extracted for inquiring whether to record in Bloom filter, if possible Record has, then is transferred to the 4th module, is otherwise transferred to third module;
Third module for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and is extracted from caching Next fingerprint, and return to the second module;
Whether the 4th module has recorded the fingerprint in the hot-zone for judging Hash bucket read buffer, if there is then from caching It is middle to extract next fingerprint, and the second module is returned, otherwise it is transferred to the 5th module;
Whether the 5th module has recorded the fingerprint in the hot bucket for judging Hash bucket write buffer, if there is then from caching It is middle to extract next fingerprint, and the second module is returned, otherwise it is transferred to the 6th module;
Can the 6th module get corresponding Hash bucket ID for searching Hash bucket address table according to fingerprint with judgement, Assert that the fingerprint is new fingerprint less than if if obtained, next fingerprint is extracted from caching, and return to the second module, if energy It gets, is transferred to the 7th module;
7th module, all Hash buckets in cold-zone for traversing Hash bucket read buffer according to the Hash bucket ID of acquisition, To judge whether there is Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then searching in the Hash bucket should Fingerprint, extracts next fingerprint from caching, and returns to the second module, otherwise by the corresponding Hash bucket of Hash bucket ID from disk In be inserted into the first Hash bucket in the hot-zone of Hash bucket read buffer, and search the fingerprint in Hash bucket after such insertion, such as Fruit finds, and illustrates that the fingerprint is existing fingerprint, illustrates that the fingerprint is new fingerprint less than if if searched, then from caching Next fingerprint is extracted, and returns to the second module.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:
1, it is more low to be able to solve existing fingerprint detection method detection performance by the present invention, can not realize for big data set The technical issues of effective repeated data detection: since present invention employs step (2), can have by the anticipation of Bloom filter Effect reduces the detection number of fingerprint, effectively promotes the retrieval performance for repeating fingerprint.
2, present invention employs step (3) to step (7), the characteristics of making full use of data set itself, using data pre-fetching With caching technology, the detection of three-level repeated data is carried out to the hot-zone of buffer area, cold-zone and disk respectively according to different conditions, The locality in repeated data is sufficiently excavated, the accuracy of data pre-fetching is promoted, and effectively reduce the access times of disk, into one Step improves the detection efficiency of repeated data.
Detailed description of the invention
Fig. 1 is building-block of logic of the invention;
Fig. 2 is the data structure of Bloom filter;
Fig. 3 is Hash bucket address table structural schematic diagram;
Fig. 4 is Hash bucket read buffer structural schematic diagram;
Fig. 5 is the schematic diagram of the repeated data detection method the present invention is based on local optimization.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.
The invention proposes a kind of efficient repetition fingerprint Fast Detection Techniques.It is mainly for the stronger data of locality The repeated data detection for collecting type realizes that level-one prejudges the optimisation strategy that three-level detects and mentions by Bloom filter and caching technology The performance of high repeated data detection.
Basic ideas of the invention are, for repeated data possible in data set, first with Bloom filter pair The repeatability of data block is prejudged, and is next carried out respectively to the hot-zone of buffer area and cold-zone and disk according to different conditions The detection of three-level repeated data makes full use of the locality in repeated data, promotes the detection efficiency of repeated data.
For basic logical structure of the invention as shown in Figure 1, it is mainly made of six parts, they refer respectively to line caching Hash bucket on table, Bloom filter, Hash bucket address table, Hash bucket write buffer, Hash bucket read buffer and disk is constituted.
The present invention is illustrated for clarity, and explanation and illustration is subject to the term occurred in present specification:
Fingerprint list: being to be passed through piecemeal by data set and taken the fingerprint and constitute fingerprint set according to processing sequence.
Fingerprint cache table: fingerprint cache table is for caching the fingerprint in fingerprint list.If fingerprint list comes from In file, then a number of fingerprint is disposably read in, and is stored in fingerprint cache area.System from fingerprint cache table one by one It takes out fingerprint and repeat the lookup of fingerprint.
Bloom filter: as shown in Fig. 2, by a independent hash function h of bit vector and k that a length is m biti (x) (1≤i≤k, k < m) is constituted, and is a kind of very high random data structure of space efficiency, it indicates a collection using bit vector It closes, and can judge whether an element belongs to this set.In order to express set S={ x1, x2, x3..., xn, first position to All positions are initialized to 0 in amount;Then to the element x in set Sj(1≤j≤n) is all mutually indepedent using this k respectively Hash function hi(x), k cryptographic Hash h is obtainedi(xj) (1≤i≤k, xj∈ S), it is used as starting point by first of bit vector, It, can be by x using this k cryptographic Hash as offsetjIt is mapped to k position in bit vector { 1,2 ..., m }, these positions are set to 1, xjIt is labeled;After element all in S is all labeled, i.e. set S is expressed by Bloom filter, if the multiple quilt in a position It is set to 1, then can only work for the first time.
It determines whether some data element y belongs to set S, uses y this k mutually independent Hash letters respectively first Number hi(x), k cryptographic Hash h is obtainedi(y), it is used as starting point by first of bit vector, using this k cryptographic Hash as offset, It checks whether corresponding position is all 1 in the bit vector of Bloom filter, is that y may belong to S;Otherwise determine that y is not the member in S Element.
Due to hash function hi(x) (1≤i≤k) element different for arbitrary two a possibility that there are hash-collisions, Such as y be mapped in bit vector corresponding position may be by the non-y element institute image in S, Bloom filter is making affirmative Property judgement when exist error a possibility that.A possibility that element is mistaken for the element in S in non-S set, is known as by Bloom filter False positive probability (False Positive Probability), also abbreviation False Rate (ErrorRate).False positive probability can be with It is controlled by mathematical method.
Radix n, the length m of Bloom filter bit vector and its quantity k of hash function of set S are given, then the grand mistake of cloth It is (1-1/m) that the bit vector of filter, which is inserted into the probability that a certain position after n element is still 0,k×n.On the other hand, when some new element When the corresponding all positions y have all been set to 1, Bloom filter could be made that false positive judges, and then deducibility false positive probability fBFAre as follows:
fBF=(1- (1-1/m)k×n)k≈(1-e-k×n/m)k,
It can derive and work asWhen, Bloom filter has the smallest false positive probability, referred to as ideal erroneous judgement Rate is denoted as FBF, at this point, in the bit vector of Bloom filter there are about 50% position be " 1 ";SymbolIndicate big In ln2 × (m/n) result smallest positive integral;
Further, if n is it is known that one Bloom filter of desired design, ideal False Rate is no more than given mistake Sentence rate upper limit ε, then can derive that m must meet:
m≥log2e×log2(1/ ε) × n,
If m=log2e×log2(1/ ε) × n andAnd if only if all n When element is all inserted into Bloom filter, false positive probability just increases to ε, therefore n is also known as Bloom filter design capacity.Its InIndicate the smallest positive integral being not more than.
By being analyzed above it is found that the bit vector of Bloom filter can be calculated according to design capacity n and False Rate upper limit ε Length m and hash function quantity k;Design capacity n is the quantity of the estimated labelled element of Bloom filter, when a Bloom filter When the element of label is less than n, then the Bloom filter is less than Bloom filter, and less than Bloom filter can both continue New element is marked, can also be for inquiring whether some element has marked wherein, when marked in a Bloom filter When number of elements is n, then the Bloom filter has been expired, and cannot continue to mark new element, but can provide inquiry, n≤m.
Hash bucket: Hash bucket is fingerprint storage and the basic unit that caching swap-in swaps out.It stores and fixes in one Hash bucket The independent fingerprint of quantity (independent fingerprint refers to fingerprints numerically different with other fingerprints).
Hash bucket write buffer: Hash bucket write buffer is to go to open up one piece of buffer zone in memory, is written for new fingerprint Caching before disk.Data due to being stored in new fingerprinting operation and Hash bucket write back disk operating cannot be simultaneously to the same Hash Bucket carries out, to avoid critical resource conflict, so Hash bucket write buffer is designed to be made of two Hash bucket lists, when certain When all Hash buckets all have been filled in list, the list is locked, and all Hash buckets in the list are all write disk. Since these Hash buckets are write-onces, usually they can be written into the same magnetic track on disk, and these fingerprints all maintain Data locality within the scope of certain space.This provides possibility for the subsequent pre- extract operation of reading.Hash bucket after writing complete is complete Portion is emptied.And new fingerprint then continues to be stored in the Hash bucket of another list.And when the Hash bucket disk write in a list It does not complete, and when another chained list Hash bucket expire, then need to wait for the write operation completion of disk.When fingerprint deposit Hash bucket is write When in caching, i.e. distribution Hash bucket ID, while updating Hash bucket address table.
Hash bucket address table: Hash bucket address table is resident key assignments (key-value) Hash table in memory, inner The mapping of Hash bucket ID where face is housed from fingerprint key to fingerprint.It is when searching the fingerprint on disk that it, which is acted on, and energy is quickly The Hash bucket position that ground positioning stores the fingerprint is set.The specific structure is shown in FIG. 3 for Hash bucket address table.Fingerprint length is 20 bytes, Bucket ID length is 4 bytes, and pointer (Pointer) occupies 8 bytes.Hash table storage can have hash-collision, work as Hash When conflict occurs, conflict is handled using chain address.
Hash bucket read buffer: Hash bucket read buffer is that the one piece of memory space opened up in memory headroom is used to cache from disk The Hash bucket of reading.To improve the efficiency that disk indexes, a part of fingerprint index table (Hash bucket) in disk is read into memory Hash bucket read buffer in.Hash bucket read buffer is made of a doubly linked list.Each node stores a Hash bucket in chained list, Its structure is as shown in Figure 4.Each node of chained list has recorded the ID number and flag bit of Hash bucket, and flag bit indicates the Hash bucket It whether is " dirty bucket ".Hash bucket read buffer is logically made of two parts, before several nodes be hot-zone part, and back It is cold-zone to the part between tail node.The division in cold and hot section is for optimizing retrieval performance.
As shown in figure 5, the present invention is based on the repeated data of local optimization, detection method includes the following steps:
(1) fingerprint list file, the fetching portion fingerprint (size etc. of the partial fingerprints from the fingerprint list file are obtained It is specified for storing the size in the space of fingerprint in caching) and store in the buffer, a fingerprint is extracted from caching, when After all fingerprints in caching have all extracted, then new partial fingerprints are read from fingerprint list file and are stored in caching In;
(2) it inquires whether to record in Bloom filter and has the fingerprint extracted, if possible record has, then turns Enter step (4), is otherwise transferred to step (3);
Specifically, Bloom filter is to create in initial phase according to following procedure:
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number are as follows:
Wherein C indicates that repeated data block indexes capacity, and ε indicates that the False Rate of Bloom filter, value are not higher than thresholding Value 0.00001.
Judge whether to record in Bloom filter in this step and have the fingerprint X extracted specifically:
If for hashi(X) (quantity of wherein 1≤i≤k, k expression hash function), there is h1(X)&h2(X)...&hk (X)=0, then show not recording fingerprint X in Bloom filter, X is new fingerprint, otherwise indicates that the fingerprint may be recorded X;
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and under being extracted in caching One fingerprint, and return step (2);
Hash bucket in this step is the container for placing fingerprint, and size can be arbitrary value, preferred value be 192 to 512 fingerprint/buckets, Hash bucket write buffer are to be realized in initial phase by applying for idle memory headroom in memory , size can be arbitrary value, and preferred value is equal to 4 to 128 Hash buckets.
Initial phase is provided with two lists (i.e. first list and second list), Mei Gelie in Hash bucket write buffer Node on table is all made of Hash bucket.
Specifically, the fingerprint is inserted into Bloom filter and Hash bucket write buffer in this step specifically: (a) meter Calculate the value hash of k individual Hash functioniIt (X), is and to offset in bit vector hashi(X) bit bit location sets 1, wherein K is hash function number, and 1≤i≤k;(b) two Hash bucket lists in Polling Hash bucket write buffer, judge certain Whether less than bucket is had in list, and when finding non-bucketful, fingerprint is stored in first non-bucketful, and (the Hash bucket is referred to as heat at this time Bucket), and will be in the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address table;If it find that all Kazakhstan in certain list Uncommon bucket all has been filled with, then locks the list, and all Hash buckets in the list are written in disk, to the column after writing complete The operation that table is emptied and unlocked, wherein carrying out null clear operation to list is exactly to empty all Hash buckets in the list, And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled in two lists, fingerprint insertion operation Certain list write-in is had to wait for complete and execute insertion operation again after emptying.
Above-mentioned Hash bucket address table is the address list created in memory in initial phase, with the side of key-value pair Formula describes fingerprint and stores the mapping relations between the Hash bucket ID of the Hash bucket of the fingerprint.
(4) judge whether recorded the fingerprint in the hot-zone of Hash bucket read buffer, extracted from caching if there is then next A fingerprint, and return step (2), are otherwise transferred to step (5);
Specifically, Hash bucket read buffer (Cache) is the spatial cache being arranged in memory in initial phase, it is It is made of the chained list that multiple Hash buckets are constituted, the size of Hash bucket read buffer can be arbitrary value, preferred value 1024-2048 A Hash bucket;One or more Hash buckets of chained list front are known as hot-zone, and remaining Hash bucket is known as cold-zone.
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted from caching if there is then next A fingerprint, and return step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, can corresponding Hash bucket ID be got with judgement, if obtained not To then assert that the fingerprint is new fingerprint, next fingerprint, and return step (2) are extracted from caching, is turned if it can get Enter step (7);
(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traversed, according to the Hash bucket ID of acquisition to judge whether There is Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching It is middle to extract next fingerprint, and return step (2), the corresponding Hash bucket of Hash bucket ID is otherwise inserted into Hash from disk In first Hash bucket in the hot-zone of bucket read buffer, and the fingerprint is searched in Hash bucket after such insertion, is said if finding The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching Line, and return step (2).
The present invention has the following beneficial effects: firstly, since passing through Bloom filter present invention employs step (2) Anticipation can effectively reduce the detection number of fingerprint, effectively promote the retrieval performance for repeating fingerprint;Further, since the present invention uses Step (3) is to step (7), the characteristics of making full use of data set itself, using data pre-fetching and caching technology, according to different Condition carries out the detection of three-level repeated data to the hot-zone of buffer area, cold-zone and disk respectively, sufficiently excavates in repeated data Locality, promotes the accuracy of data pre-fetching, and effectively reduces the access times of disk, further improves the inspection of repeated data Survey efficiency.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims (6)

1. a kind of repeated data detection method based on local optimization, which comprises the following steps:
(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, from caching Extract a fingerprint;
(2) it inquires whether to record in Bloom filter and has the fingerprint extracted, if possible record has, then is transferred to step Suddenly (4) are otherwise transferred to step (3);
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer, and extracts next fingerprint from caching, and return It returns step (2);Wherein the fingerprint is inserted into Bloom filter and Hash bucket write buffer specifically: (a) calculates k independent Kazakhstan The value hash of uncommon functioniIt (X), and will be hash to offset in bit vectori(X) bit bit location sets 1, and wherein k is Hash letter Several numbers, 1≤i≤k;(b) two Hash bucket lists in Polling Hash bucket write buffer, judge whether have in certain list Fingerprint is stored in first non-bucketful, the Hash bucket is referred to as hot bucket at this time, and this is referred to when finding non-bucketful by less than bucket In the Hash bucket ID of line and Hash bucket write-in Hash bucket address table;If it find that all Hash buckets have all filled in certain list It is full, then the list is locked, and by all Hash buckets write-in disk in the list, the list is emptied after writing complete With the operation of unlock, wherein to list carry out null clear operation be exactly all Hash buckets in the list are emptied, and be each Kazakhstan Uncommon bucket distributes new Hash bucket ID;If all buckets are all already filled in two lists, fingerprint insertion operation has to wait for certain List write-in is completed and executes insertion operation again after emptying;
(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, if there is then extracting next finger from caching Line, and return step (2), are otherwise transferred to step (5);
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, if there is then extracting next finger from caching Line, and return step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, can corresponding Hash bucket ID be got with judgement, if acquisition less than if Assert that the fingerprint is new fingerprint, next fingerprint, and return step (2) are extracted from caching, step is transferred to if it can get Suddenly (7);
(7) according to the Hash bucket ID of acquisition traverse Hash bucket read buffer cold-zone in all Hash buckets, with judge whether there is with The corresponding Hash bucket of Hash bucket ID is then searched the fingerprint in the Hash bucket, is mentioned from caching if there is corresponding Hash bucket Next fingerprint, and return step (2) are taken, otherwise the corresponding Hash bucket of the Hash bucket ID Hash bucket is inserted into from disk and read In first Hash bucket in the hot-zone of caching, and the fingerprint is searched in Hash bucket after such insertion, explanation should if finding Fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next fingerprint is then extracted from caching, and Return step (2);Wherein Hash bucket read buffer is logically made of two parts, before multiple nodes be hot-zone part, then While to the part between tail node be cold-zone.
2. repeated data detection method according to claim 1, which is characterized in that Bloom filter is in initial phase Creation, and have
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number are as follows:
Wherein C indicates that repeated data block indexes capacity, and ε indicates the False Rate of Bloom filter.
3. repeated data detection method according to claim 1, which is characterized in that judge Bloom filter in step (2) In whether may record have the fingerprint X extracted specifically, if for hash function hashi(X), there is hash1(X)&hash2 (X)...&hashk(X)=0, then show not recording fingerprint X in Bloom filter, fingerprint X is new fingerprint, otherwise table Fingerprint X may be recorded by showing, wherein 1≤i≤k, k indicate the quantity of hash function.
4. repeated data detection method according to claim 1, which is characterized in that
Hash bucket is the container for placing fingerprint, and value is 192 to 512 fingerprint/buckets;
Hash bucket write buffer is to be realized in initial phase by applying for idle memory headroom in memory, value etc. In 4 to 128 Hash buckets;
First list and second list are provided in Hash bucket write buffer, the node in each list is made of Hash bucket.
5. repeated data detection method according to claim 1, which is characterized in that in step (4), Hash bucket read buffer is It in the spatial cache that initial phase is arranged in memory, is made of the chained list that multiple Hash buckets are constituted, Hash bucket is read The size of caching is 1024-2048 Hash bucket.
6. a kind of repeated data detection system based on local optimization characterized by comprising
First module fetching portion fingerprint and is stored in caching from the fingerprint list file for obtaining fingerprint list file In, a fingerprint is extracted from caching;
Second module has the fingerprint extracted for inquiring whether to record in Bloom filter, if possible records Have, be then transferred to the 4th module, is otherwise transferred to third module;
Third module for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and is extracted from caching next A fingerprint, and return to the second module;Wherein the fingerprint is inserted into Bloom filter and Hash bucket write buffer specifically: (a) Calculate the value hash of k individual Hash functioniIt (X), and will be hash to offset in bit vectori(X) bit bit location sets 1, Wherein k is hash function number, 1≤i≤k;(b) two Hash bucket lists in Polling Hash bucket write buffer, judge certain Whether there is less than bucket in list, when finding non-bucketful, fingerprint is stored in first non-bucketful, the Hash bucket is referred to as at this time Hot bucket, and will be in the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address table;If it find that owning in certain list Hash bucket all has been filled with, then locks the list, and all Hash buckets in the list are written in disk, to this after writing complete The operation that list is emptied and unlocked, wherein carrying out null clear operation to list is exactly that all Hash buckets in the list are clear Sky, and new Hash bucket ID is distributed for each Hash bucket;If all buckets are all already filled in two lists, fingerprint insertion behaviour Certain list write-in is had to wait for complete and execute insertion operation again after emptying;
Whether the 4th module has recorded the fingerprint in the hot-zone for judging Hash bucket read buffer, if there is then mentioning from caching Next fingerprint is taken, and returns to the second module, is otherwise transferred to the 5th module;
Whether the 5th module has recorded the fingerprint in the hot bucket for judging Hash bucket write buffer, if there is then mentioning from caching Next fingerprint is taken, and returns to the second module, is otherwise transferred to the 6th module;
Can the 6th module get corresponding Hash bucket ID for searching Hash bucket address table according to fingerprint with judgement, if It obtains less than then assert that the fingerprint is new fingerprint, next fingerprint is extracted from caching, and return to the second module, if can obtain To being then transferred to the 7th module;
7th module, all Hash buckets in cold-zone for traversing Hash bucket read buffer according to the Hash bucket ID of acquisition, to sentence It is disconnected whether to have Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then search the fingerprint in the Hash bucket, Next fingerprint is extracted from caching, and returns to the second module, otherwise inserts the corresponding Hash bucket of Hash bucket ID from disk Enter in the first Hash bucket into the hot-zone of Hash bucket read buffer, and search the fingerprint in Hash bucket after such insertion, if looked into It finds, illustrates that the fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, then being extracted from caching Next fingerprint, and return to the second module;Wherein Hash bucket read buffer is logically made of two parts, before multiple nodes be Hot-zone part, and back to the part between tail node be cold-zone.
CN201710555589.5A 2017-07-07 2017-07-07 A kind of repeated data detection method based on local optimization Active CN107391034B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710555589.5A CN107391034B (en) 2017-07-07 2017-07-07 A kind of repeated data detection method based on local optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710555589.5A CN107391034B (en) 2017-07-07 2017-07-07 A kind of repeated data detection method based on local optimization

Publications (2)

Publication Number Publication Date
CN107391034A CN107391034A (en) 2017-11-24
CN107391034B true CN107391034B (en) 2019-05-10

Family

ID=60335524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710555589.5A Active CN107391034B (en) 2017-07-07 2017-07-07 A kind of repeated data detection method based on local optimization

Country Status (1)

Country Link
CN (1) CN107391034B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107944038B (en) * 2017-12-14 2020-11-10 上海达梦数据库有限公司 Method and device for generating deduplication data
CN108459826B (en) * 2018-02-01 2020-12-29 杭州宏杉科技股份有限公司 Method and device for processing IO (input/output) request
CN109101365A (en) * 2018-08-01 2018-12-28 南京壹进制信息技术股份有限公司 A kind of data backup and resume method deleted again based on source data
CN109240605B (en) * 2018-08-17 2020-05-19 华中科技大学 Rapid repeated data block identification method based on 3D stacked memory
CN109471635B (en) * 2018-09-03 2021-09-17 中新网络信息安全股份有限公司 Algorithm optimization method based on Java Set implementation
CN109740037B (en) * 2019-01-02 2023-11-24 山东省科学院情报研究所 Multi-source heterogeneous flow state big data distributed online real-time processing method and system
CN109783523B (en) * 2019-01-24 2022-02-25 广州虎牙信息科技有限公司 Data processing method, device, equipment and storage medium
CN110046164B (en) * 2019-04-16 2021-07-02 中国人民解放军国防科技大学 Operation method of consistent valley filter
CN110489405B (en) * 2019-07-12 2024-01-12 平安科技(深圳)有限公司 Data processing method, device and server
CN111338581B (en) * 2020-03-27 2020-11-17 上海天天基金销售有限公司 Data storage method and device based on cloud computing, cloud server and system
CN112800430A (en) * 2021-02-01 2021-05-14 苏州棱镜七彩信息科技有限公司 Safety and compliance management method suitable for open source assembly
CN113721862B (en) * 2021-11-02 2022-02-08 腾讯科技(深圳)有限公司 Data processing method and device
US20230221864A1 (en) * 2022-01-10 2023-07-13 Vmware, Inc. Efficient inline block-level deduplication using a bloom filter and a small in-memory deduplication hash table

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253820A (en) * 2011-06-16 2011-11-23 华中科技大学 Stream type repetitive data detection method
CN102591946A (en) * 2010-12-28 2012-07-18 微软公司 Using index partitioning and reconciliation for data deduplication
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN103345472A (en) * 2013-06-04 2013-10-09 北京航空航天大学 Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system
CN103870514A (en) * 2012-12-18 2014-06-18 华为技术有限公司 Repeating data deleting method and device
CN103970875A (en) * 2014-05-15 2014-08-06 华中科技大学 Parallel repeated data deleting method
CN103970744A (en) * 2013-01-25 2014-08-06 华中科技大学 Extendible repeated data detection method
CN105740266A (en) * 2014-12-10 2016-07-06 国际商业机器公司 Data deduplication method and device
CN106293525A (en) * 2016-08-05 2017-01-04 上海交通大学 A kind of method and system improving caching service efficiency
CN106610790A (en) * 2015-10-26 2017-05-03 华为技术有限公司 Repeated data deleting method and device
CN106649346A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data repeatability check method and apparatus

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014130549A (en) * 2012-12-28 2014-07-10 Fujitsu Ltd Storage device, control method, and control program
US10416915B2 (en) * 2015-05-15 2019-09-17 ScaleFlux Assisting data deduplication through in-memory computation
US10761758B2 (en) * 2015-12-21 2020-09-01 Quantum Corporation Data aware deduplication object storage (DADOS)

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591946A (en) * 2010-12-28 2012-07-18 微软公司 Using index partitioning and reconciliation for data deduplication
CN102253820A (en) * 2011-06-16 2011-11-23 华中科技大学 Stream type repetitive data detection method
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN103870514A (en) * 2012-12-18 2014-06-18 华为技术有限公司 Repeating data deleting method and device
CN103970744A (en) * 2013-01-25 2014-08-06 华中科技大学 Extendible repeated data detection method
CN103345472A (en) * 2013-06-04 2013-10-09 北京航空航天大学 Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system
CN103970875A (en) * 2014-05-15 2014-08-06 华中科技大学 Parallel repeated data deleting method
CN105740266A (en) * 2014-12-10 2016-07-06 国际商业机器公司 Data deduplication method and device
CN106610790A (en) * 2015-10-26 2017-05-03 华为技术有限公司 Repeated data deleting method and device
CN106649346A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data repeatability check method and apparatus
CN106293525A (en) * 2016-08-05 2017-01-04 上海交通大学 A kind of method and system improving caching service efficiency

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Resemblance and mergence based indexing for high performance data deduplication;Panfeng Zhang;《Journal of Systems and Software》;20170630;第11-24页

Also Published As

Publication number Publication date
CN107391034A (en) 2017-11-24

Similar Documents

Publication Publication Date Title
CN107391034B (en) A kind of repeated data detection method based on local optimization
CN103377137B (en) The frequent block strengthened detection is used to carry out the method and system of storage duplicate removal
US7418544B2 (en) Method and system for log structured relational database objects
CN102831222B (en) Differential compression method based on data de-duplication
CN104156380B (en) A kind of distributed memory hash indexing method and system
US20090240655A1 (en) Bit String Seacrching Apparatus, Searching Method, and Program
US20090287660A1 (en) Bit string searching apparatus, searching method, and program
CN103597450B (en) Memory with the metadata being stored in a part for storage page
US11176110B2 (en) Data updating method and device for a distributed database system
CN107291858B (en) Data indexing method based on character string suffix
US8086641B1 (en) Integrated search engine devices that utilize SPM-linked bit maps to reduce handle memory duplication and methods of operating same
CN107515931A (en) A kind of duplicate data detection method based on cluster
CN111316255B (en) Data storage system and method for providing a data storage system
CN107944041A (en) A kind of storage organization optimization method of HDFS
Zhang et al. Hashfile: An efficient index structure for multimedia data
CN103500183A (en) Storage structure based on multiple-relevant-field combined index and building, inquiring and maintaining method
US7987205B1 (en) Integrated search engine devices having pipelined node maintenance sub-engines therein that support database flush operations
CN113961754B (en) Graph database system based on persistent memory
CN113901279B (en) Graph database retrieval method and device
CN106547484B (en) A kind of reliability method of realization internal storage data and system based on RAID5
Su-Cheng et al. Node labeling schemes in XML query optimization: a survey and trends
US7953721B1 (en) Integrated search engine devices that support database key dumping and methods of operating same
CN115935020A (en) Graph data storage method and device
CN109213760A (en) The storage of high load business and search method of non-relation data storage
CN112527804B (en) File storage method, file reading method and data storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant