CN107391034B - A kind of repeated data detection method based on local optimization - Google Patents
A kind of repeated data detection method based on local optimization Download PDFInfo
- Publication number
- CN107391034B CN107391034B CN201710555589.5A CN201710555589A CN107391034B CN 107391034 B CN107391034 B CN 107391034B CN 201710555589 A CN201710555589 A CN 201710555589A CN 107391034 B CN107391034 B CN 107391034B
- Authority
- CN
- China
- Prior art keywords
- fingerprint
- hash
- hash bucket
- bucket
- list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
- G06F3/0641—De-duplication techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0656—Data buffering arrangements
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Collating Specific Patterns (AREA)
Abstract
The repeated data detection method based on local optimization that the invention discloses a kind of, belong to computer memory technical field, solve the problems, such as that detection efficiency is low in existing repeated data detection method, to adapt to because of storing data popularization, and the status for causing repeated data detection efficiency to reduce.The present invention includes Bloom filter detection, the detection of Hash bucket write buffer, the detection of Hash bucket read buffer, Hash bucket address table detecting step.Present invention is generally directed to the stronger data set type of locality, the locality concentrated by mining data improves the efficiency of data pre-fetching, reduces disk access expense, promotes the throughput of data deduplication.For repeated data possible in data set, the invention firstly uses repeatability of the Bloom filter to data block to prejudge, next the detection of three-level repeated data is carried out to the hot-zone of buffer area and cold-zone and disk respectively according to different conditions, the locality in repeated data is made full use of, the detection efficiency of repeated data is promoted.
Description
Technical field
The invention belongs to computer memory technical fields, more particularly, to a kind of repeat number based on local optimization
According to detection method.
Background technique
As information technology is grown rapidly, information has become the precious resources that we depend on for existence, becomes promotion production
The maximum power of power fast development.The extensive application of information technology is more and more valuable also along with the generation of the data of magnanimity
The data of value are stored.So, the storage efficiency for how effectively improving existing storage medium meets ever-increasing deposit
Storage demand has become storage research field and one of urgently solves the problems, such as.Meanwhile IDC LLC's investigation report show it is existing about
75% data are redundancy, i.e., only 25% data have uniqueness.In this context, data deduplication is used as larger
A kind of new technique that redundancy is detected and eliminated in spatial dimension becomes the research hotspot of academia and industry in recent years,
And it is just widely applied to various information storage systems further.
The detection for repeating fingerprint is to realize the important technical of data deduplication.In existing data deduplication technology, weight
The detection of complex data that is, by the fingerprint (cryptographic Hash) of extraction data block, then passes through inspection mainly using the mode of fingerprint detection
The repeatability of fingerprint is surveyed to identify whether some data block is repeated data block.Substantially it is repeating in fingerprint detection method, usually
The identification of repetition fingerprint section is realized using data structures such as single Hash tables or B-tree.
However, the problem of one can not ignore existing for above-mentioned fingerprint detection method is that detection performance is more low, it can not
Effective repeated data detection is realized for large data sets, to influence the overall efficiency of data deduplication.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of repetitions based on local optimization
Data detection method, it is intended that solving detection performance existing for the existing repeated data detection method based on fingerprint detection
It is more low, the technical issues of effective repeated data detects can not be realized for large data sets.
To achieve the above object, according to one aspect of the present invention, a kind of repeat number based on local optimization is provided
According to detection method, comprising the following steps:
(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, postpone
Deposit one fingerprint of middle extraction;
(2) it inquires whether to record in Bloom filter and has the fingerprint extracted, if possible record has, then turns
Enter step (4), is otherwise transferred to step (3);
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and under being extracted in caching
One fingerprint, and return step (2);
(4) judge whether recorded the fingerprint in the hot-zone of Hash bucket read buffer, extracted from caching if there is then next
A fingerprint, and return step (2), are otherwise transferred to step (5);
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted from caching if there is then next
A fingerprint, and return step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, can corresponding Hash bucket ID be got with judgement, if obtained not
To then assert that the fingerprint is new fingerprint, next fingerprint, and return step (2) are extracted from caching, is turned if it can get
Enter step (7);
(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traversed, according to the Hash bucket ID of acquisition to judge whether
There is Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching
It is middle to extract next fingerprint, and return step (2), the corresponding Hash bucket of Hash bucket ID is otherwise inserted into Hash from disk
In first Hash bucket in the hot-zone of bucket read buffer, and the fingerprint is searched in Hash bucket after such insertion, is said if finding
The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching
Line, and return step (2).
Preferably, Bloom filter is to create in initial phase, and have
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number are as follows:
Wherein C indicates that repeated data block indexes capacity, and ε indicates the False Rate of Bloom filter.
Preferably, whether judge to record in Bloom filter in step (2) has the fingerprint X extracted specifically, such as
Fruit is for hash function hashi(X), there is hash1(X)&hash2(X)...&hashk(X)=0, then show do not have in Bloom filter
Fingerprint X was recorded, fingerprint X is new fingerprint, otherwise indicates that fingerprint X may be recorded, wherein 1≤i≤k, k indicate to breathe out
The quantity of uncommon function.
Preferably, Hash bucket is the container for placing fingerprint, and value is 192 to 512 fingerprint/buckets, and Hash bucket is write
Caching is to be realized in initial phase by applying for idle memory headroom in memory, and value is equal to 4 to 128 Kazakhstan
Bucket is wished, is provided with first list and second list in Hash bucket write buffer, the node in each list is made of Hash bucket.
Preferably, (a) calculates the value hash of k individual Hash functioniIt (X), and will be hash to offset in bit vectori
(X) bit bit location sets 1, and wherein k is hash function number, 1≤i≤k;(b) two in Polling Hash bucket write buffer
Hash bucket list, judges whether there is less than bucket in certain list, when finding non-bucketful, fingerprint is stored in first less than
Bucket, and will be in the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address table;If it find that all Kazakhstan in certain list
Uncommon bucket all has been filled with, then locks the list, and all Hash buckets in the list are written in disk, to the column after writing complete
The operation that table is emptied and unlocked, wherein carrying out null clear operation to list is exactly to empty all Hash buckets in the list,
And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled in two lists, fingerprint insertion operation
Certain list write-in is had to wait for complete and execute insertion operation again after emptying.
Preferably, in step (4), Hash bucket read buffer is the spatial cache being arranged in memory in initial phase,
It is made of the chained list that multiple Hash buckets are constituted, the size of Hash bucket read buffer is 1024-2048 Hash bucket.
It is another aspect of this invention to provide that providing a kind of repeated data detection system based on local optimization, comprising:
First module fetching portion fingerprint and is stored in from the fingerprint list file for obtaining fingerprint list file
In caching, a fingerprint is extracted from caching;
Second module has the fingerprint extracted for inquiring whether to record in Bloom filter, if possible
Record has, then is transferred to the 4th module, is otherwise transferred to third module;
Third module for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and is extracted from caching
Next fingerprint, and return to the second module;
Whether the 4th module has recorded the fingerprint in the hot-zone for judging Hash bucket read buffer, if there is then from caching
It is middle to extract next fingerprint, and the second module is returned, otherwise it is transferred to the 5th module;
Whether the 5th module has recorded the fingerprint in the hot bucket for judging Hash bucket write buffer, if there is then from caching
It is middle to extract next fingerprint, and the second module is returned, otherwise it is transferred to the 6th module;
Can the 6th module get corresponding Hash bucket ID for searching Hash bucket address table according to fingerprint with judgement,
Assert that the fingerprint is new fingerprint less than if if obtained, next fingerprint is extracted from caching, and return to the second module, if energy
It gets, is transferred to the 7th module;
7th module, all Hash buckets in cold-zone for traversing Hash bucket read buffer according to the Hash bucket ID of acquisition,
To judge whether there is Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then searching in the Hash bucket should
Fingerprint, extracts next fingerprint from caching, and returns to the second module, otherwise by the corresponding Hash bucket of Hash bucket ID from disk
In be inserted into the first Hash bucket in the hot-zone of Hash bucket read buffer, and search the fingerprint in Hash bucket after such insertion, such as
Fruit finds, and illustrates that the fingerprint is existing fingerprint, illustrates that the fingerprint is new fingerprint less than if if searched, then from caching
Next fingerprint is extracted, and returns to the second module.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show
Beneficial effect:
1, it is more low to be able to solve existing fingerprint detection method detection performance by the present invention, can not realize for big data set
The technical issues of effective repeated data detection: since present invention employs step (2), can have by the anticipation of Bloom filter
Effect reduces the detection number of fingerprint, effectively promotes the retrieval performance for repeating fingerprint.
2, present invention employs step (3) to step (7), the characteristics of making full use of data set itself, using data pre-fetching
With caching technology, the detection of three-level repeated data is carried out to the hot-zone of buffer area, cold-zone and disk respectively according to different conditions,
The locality in repeated data is sufficiently excavated, the accuracy of data pre-fetching is promoted, and effectively reduce the access times of disk, into one
Step improves the detection efficiency of repeated data.
Detailed description of the invention
Fig. 1 is building-block of logic of the invention;
Fig. 2 is the data structure of Bloom filter;
Fig. 3 is Hash bucket address table structural schematic diagram;
Fig. 4 is Hash bucket read buffer structural schematic diagram;
Fig. 5 is the schematic diagram of the repeated data detection method the present invention is based on local optimization.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below
Not constituting a conflict with each other can be combined with each other.
The invention proposes a kind of efficient repetition fingerprint Fast Detection Techniques.It is mainly for the stronger data of locality
The repeated data detection for collecting type realizes that level-one prejudges the optimisation strategy that three-level detects and mentions by Bloom filter and caching technology
The performance of high repeated data detection.
Basic ideas of the invention are, for repeated data possible in data set, first with Bloom filter pair
The repeatability of data block is prejudged, and is next carried out respectively to the hot-zone of buffer area and cold-zone and disk according to different conditions
The detection of three-level repeated data makes full use of the locality in repeated data, promotes the detection efficiency of repeated data.
For basic logical structure of the invention as shown in Figure 1, it is mainly made of six parts, they refer respectively to line caching
Hash bucket on table, Bloom filter, Hash bucket address table, Hash bucket write buffer, Hash bucket read buffer and disk is constituted.
The present invention is illustrated for clarity, and explanation and illustration is subject to the term occurred in present specification:
Fingerprint list: being to be passed through piecemeal by data set and taken the fingerprint and constitute fingerprint set according to processing sequence.
Fingerprint cache table: fingerprint cache table is for caching the fingerprint in fingerprint list.If fingerprint list comes from
In file, then a number of fingerprint is disposably read in, and is stored in fingerprint cache area.System from fingerprint cache table one by one
It takes out fingerprint and repeat the lookup of fingerprint.
Bloom filter: as shown in Fig. 2, by a independent hash function h of bit vector and k that a length is m biti
(x) (1≤i≤k, k < m) is constituted, and is a kind of very high random data structure of space efficiency, it indicates a collection using bit vector
It closes, and can judge whether an element belongs to this set.In order to express set S={ x1, x2, x3..., xn, first position to
All positions are initialized to 0 in amount;Then to the element x in set Sj(1≤j≤n) is all mutually indepedent using this k respectively
Hash function hi(x), k cryptographic Hash h is obtainedi(xj) (1≤i≤k, xj∈ S), it is used as starting point by first of bit vector,
It, can be by x using this k cryptographic Hash as offsetjIt is mapped to k position in bit vector { 1,2 ..., m }, these positions are set to 1,
xjIt is labeled;After element all in S is all labeled, i.e. set S is expressed by Bloom filter, if the multiple quilt in a position
It is set to 1, then can only work for the first time.
It determines whether some data element y belongs to set S, uses y this k mutually independent Hash letters respectively first
Number hi(x), k cryptographic Hash h is obtainedi(y), it is used as starting point by first of bit vector, using this k cryptographic Hash as offset,
It checks whether corresponding position is all 1 in the bit vector of Bloom filter, is that y may belong to S;Otherwise determine that y is not the member in S
Element.
Due to hash function hi(x) (1≤i≤k) element different for arbitrary two a possibility that there are hash-collisions,
Such as y be mapped in bit vector corresponding position may be by the non-y element institute image in S, Bloom filter is making affirmative
Property judgement when exist error a possibility that.A possibility that element is mistaken for the element in S in non-S set, is known as by Bloom filter
False positive probability (False Positive Probability), also abbreviation False Rate (ErrorRate).False positive probability can be with
It is controlled by mathematical method.
Radix n, the length m of Bloom filter bit vector and its quantity k of hash function of set S are given, then the grand mistake of cloth
It is (1-1/m) that the bit vector of filter, which is inserted into the probability that a certain position after n element is still 0,k×n.On the other hand, when some new element
When the corresponding all positions y have all been set to 1, Bloom filter could be made that false positive judges, and then deducibility false positive probability fBFAre as follows:
fBF=(1- (1-1/m)k×n)k≈(1-e-k×n/m)k,
It can derive and work asWhen, Bloom filter has the smallest false positive probability, referred to as ideal erroneous judgement
Rate is denoted as FBF, at this point, in the bit vector of Bloom filter there are about 50% position be " 1 ";SymbolIndicate big
In ln2 × (m/n) result smallest positive integral;
Further, if n is it is known that one Bloom filter of desired design, ideal False Rate is no more than given mistake
Sentence rate upper limit ε, then can derive that m must meet:
m≥log2e×log2(1/ ε) × n,
If m=log2e×log2(1/ ε) × n andAnd if only if all n
When element is all inserted into Bloom filter, false positive probability just increases to ε, therefore n is also known as Bloom filter design capacity.Its
InIndicate the smallest positive integral being not more than.
By being analyzed above it is found that the bit vector of Bloom filter can be calculated according to design capacity n and False Rate upper limit ε
Length m and hash function quantity k;Design capacity n is the quantity of the estimated labelled element of Bloom filter, when a Bloom filter
When the element of label is less than n, then the Bloom filter is less than Bloom filter, and less than Bloom filter can both continue
New element is marked, can also be for inquiring whether some element has marked wherein, when marked in a Bloom filter
When number of elements is n, then the Bloom filter has been expired, and cannot continue to mark new element, but can provide inquiry, n≤m.
Hash bucket: Hash bucket is fingerprint storage and the basic unit that caching swap-in swaps out.It stores and fixes in one Hash bucket
The independent fingerprint of quantity (independent fingerprint refers to fingerprints numerically different with other fingerprints).
Hash bucket write buffer: Hash bucket write buffer is to go to open up one piece of buffer zone in memory, is written for new fingerprint
Caching before disk.Data due to being stored in new fingerprinting operation and Hash bucket write back disk operating cannot be simultaneously to the same Hash
Bucket carries out, to avoid critical resource conflict, so Hash bucket write buffer is designed to be made of two Hash bucket lists, when certain
When all Hash buckets all have been filled in list, the list is locked, and all Hash buckets in the list are all write disk.
Since these Hash buckets are write-onces, usually they can be written into the same magnetic track on disk, and these fingerprints all maintain
Data locality within the scope of certain space.This provides possibility for the subsequent pre- extract operation of reading.Hash bucket after writing complete is complete
Portion is emptied.And new fingerprint then continues to be stored in the Hash bucket of another list.And when the Hash bucket disk write in a list
It does not complete, and when another chained list Hash bucket expire, then need to wait for the write operation completion of disk.When fingerprint deposit Hash bucket is write
When in caching, i.e. distribution Hash bucket ID, while updating Hash bucket address table.
Hash bucket address table: Hash bucket address table is resident key assignments (key-value) Hash table in memory, inner
The mapping of Hash bucket ID where face is housed from fingerprint key to fingerprint.It is when searching the fingerprint on disk that it, which is acted on, and energy is quickly
The Hash bucket position that ground positioning stores the fingerprint is set.The specific structure is shown in FIG. 3 for Hash bucket address table.Fingerprint length is 20 bytes,
Bucket ID length is 4 bytes, and pointer (Pointer) occupies 8 bytes.Hash table storage can have hash-collision, work as Hash
When conflict occurs, conflict is handled using chain address.
Hash bucket read buffer: Hash bucket read buffer is that the one piece of memory space opened up in memory headroom is used to cache from disk
The Hash bucket of reading.To improve the efficiency that disk indexes, a part of fingerprint index table (Hash bucket) in disk is read into memory
Hash bucket read buffer in.Hash bucket read buffer is made of a doubly linked list.Each node stores a Hash bucket in chained list,
Its structure is as shown in Figure 4.Each node of chained list has recorded the ID number and flag bit of Hash bucket, and flag bit indicates the Hash bucket
It whether is " dirty bucket ".Hash bucket read buffer is logically made of two parts, before several nodes be hot-zone part, and back
It is cold-zone to the part between tail node.The division in cold and hot section is for optimizing retrieval performance.
As shown in figure 5, the present invention is based on the repeated data of local optimization, detection method includes the following steps:
(1) fingerprint list file, the fetching portion fingerprint (size etc. of the partial fingerprints from the fingerprint list file are obtained
It is specified for storing the size in the space of fingerprint in caching) and store in the buffer, a fingerprint is extracted from caching, when
After all fingerprints in caching have all extracted, then new partial fingerprints are read from fingerprint list file and are stored in caching
In;
(2) it inquires whether to record in Bloom filter and has the fingerprint extracted, if possible record has, then turns
Enter step (4), is otherwise transferred to step (3);
Specifically, Bloom filter is to create in initial phase according to following procedure:
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number are as follows:
Wherein C indicates that repeated data block indexes capacity, and ε indicates that the False Rate of Bloom filter, value are not higher than thresholding
Value 0.00001.
Judge whether to record in Bloom filter in this step and have the fingerprint X extracted specifically:
If for hashi(X) (quantity of wherein 1≤i≤k, k expression hash function), there is h1(X)&h2(X)...&hk
(X)=0, then show not recording fingerprint X in Bloom filter, X is new fingerprint, otherwise indicates that the fingerprint may be recorded
X;
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and under being extracted in caching
One fingerprint, and return step (2);
Hash bucket in this step is the container for placing fingerprint, and size can be arbitrary value, preferred value be 192 to
512 fingerprint/buckets, Hash bucket write buffer are to be realized in initial phase by applying for idle memory headroom in memory
, size can be arbitrary value, and preferred value is equal to 4 to 128 Hash buckets.
Initial phase is provided with two lists (i.e. first list and second list), Mei Gelie in Hash bucket write buffer
Node on table is all made of Hash bucket.
Specifically, the fingerprint is inserted into Bloom filter and Hash bucket write buffer in this step specifically: (a) meter
Calculate the value hash of k individual Hash functioniIt (X), is and to offset in bit vector hashi(X) bit bit location sets 1, wherein
K is hash function number, and 1≤i≤k;(b) two Hash bucket lists in Polling Hash bucket write buffer, judge certain
Whether less than bucket is had in list, and when finding non-bucketful, fingerprint is stored in first non-bucketful, and (the Hash bucket is referred to as heat at this time
Bucket), and will be in the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address table;If it find that all Kazakhstan in certain list
Uncommon bucket all has been filled with, then locks the list, and all Hash buckets in the list are written in disk, to the column after writing complete
The operation that table is emptied and unlocked, wherein carrying out null clear operation to list is exactly to empty all Hash buckets in the list,
And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled in two lists, fingerprint insertion operation
Certain list write-in is had to wait for complete and execute insertion operation again after emptying.
Above-mentioned Hash bucket address table is the address list created in memory in initial phase, with the side of key-value pair
Formula describes fingerprint and stores the mapping relations between the Hash bucket ID of the Hash bucket of the fingerprint.
(4) judge whether recorded the fingerprint in the hot-zone of Hash bucket read buffer, extracted from caching if there is then next
A fingerprint, and return step (2), are otherwise transferred to step (5);
Specifically, Hash bucket read buffer (Cache) is the spatial cache being arranged in memory in initial phase, it is
It is made of the chained list that multiple Hash buckets are constituted, the size of Hash bucket read buffer can be arbitrary value, preferred value 1024-2048
A Hash bucket;One or more Hash buckets of chained list front are known as hot-zone, and remaining Hash bucket is known as cold-zone.
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted from caching if there is then next
A fingerprint, and return step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, can corresponding Hash bucket ID be got with judgement, if obtained not
To then assert that the fingerprint is new fingerprint, next fingerprint, and return step (2) are extracted from caching, is turned if it can get
Enter step (7);
(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traversed, according to the Hash bucket ID of acquisition to judge whether
There is Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching
It is middle to extract next fingerprint, and return step (2), the corresponding Hash bucket of Hash bucket ID is otherwise inserted into Hash from disk
In first Hash bucket in the hot-zone of bucket read buffer, and the fingerprint is searched in Hash bucket after such insertion, is said if finding
The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching
Line, and return step (2).
The present invention has the following beneficial effects: firstly, since passing through Bloom filter present invention employs step (2)
Anticipation can effectively reduce the detection number of fingerprint, effectively promote the retrieval performance for repeating fingerprint;Further, since the present invention uses
Step (3) is to step (7), the characteristics of making full use of data set itself, using data pre-fetching and caching technology, according to different
Condition carries out the detection of three-level repeated data to the hot-zone of buffer area, cold-zone and disk respectively, sufficiently excavates in repeated data
Locality, promotes the accuracy of data pre-fetching, and effectively reduces the access times of disk, further improves the inspection of repeated data
Survey efficiency.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include
Within protection scope of the present invention.
Claims (6)
1. a kind of repeated data detection method based on local optimization, which comprises the following steps:
(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, from caching
Extract a fingerprint;
(2) it inquires whether to record in Bloom filter and has the fingerprint extracted, if possible record has, then is transferred to step
Suddenly (4) are otherwise transferred to step (3);
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer, and extracts next fingerprint from caching, and return
It returns step (2);Wherein the fingerprint is inserted into Bloom filter and Hash bucket write buffer specifically: (a) calculates k independent Kazakhstan
The value hash of uncommon functioniIt (X), and will be hash to offset in bit vectori(X) bit bit location sets 1, and wherein k is Hash letter
Several numbers, 1≤i≤k;(b) two Hash bucket lists in Polling Hash bucket write buffer, judge whether have in certain list
Fingerprint is stored in first non-bucketful, the Hash bucket is referred to as hot bucket at this time, and this is referred to when finding non-bucketful by less than bucket
In the Hash bucket ID of line and Hash bucket write-in Hash bucket address table;If it find that all Hash buckets have all filled in certain list
It is full, then the list is locked, and by all Hash buckets write-in disk in the list, the list is emptied after writing complete
With the operation of unlock, wherein to list carry out null clear operation be exactly all Hash buckets in the list are emptied, and be each Kazakhstan
Uncommon bucket distributes new Hash bucket ID;If all buckets are all already filled in two lists, fingerprint insertion operation has to wait for certain
List write-in is completed and executes insertion operation again after emptying;
(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, if there is then extracting next finger from caching
Line, and return step (2), are otherwise transferred to step (5);
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, if there is then extracting next finger from caching
Line, and return step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, can corresponding Hash bucket ID be got with judgement, if acquisition less than if
Assert that the fingerprint is new fingerprint, next fingerprint, and return step (2) are extracted from caching, step is transferred to if it can get
Suddenly (7);
(7) according to the Hash bucket ID of acquisition traverse Hash bucket read buffer cold-zone in all Hash buckets, with judge whether there is with
The corresponding Hash bucket of Hash bucket ID is then searched the fingerprint in the Hash bucket, is mentioned from caching if there is corresponding Hash bucket
Next fingerprint, and return step (2) are taken, otherwise the corresponding Hash bucket of the Hash bucket ID Hash bucket is inserted into from disk and read
In first Hash bucket in the hot-zone of caching, and the fingerprint is searched in Hash bucket after such insertion, explanation should if finding
Fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next fingerprint is then extracted from caching, and
Return step (2);Wherein Hash bucket read buffer is logically made of two parts, before multiple nodes be hot-zone part, then
While to the part between tail node be cold-zone.
2. repeated data detection method according to claim 1, which is characterized in that Bloom filter is in initial phase
Creation, and have
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number are as follows:
Wherein C indicates that repeated data block indexes capacity, and ε indicates the False Rate of Bloom filter.
3. repeated data detection method according to claim 1, which is characterized in that judge Bloom filter in step (2)
In whether may record have the fingerprint X extracted specifically, if for hash function hashi(X), there is hash1(X)&hash2
(X)...&hashk(X)=0, then show not recording fingerprint X in Bloom filter, fingerprint X is new fingerprint, otherwise table
Fingerprint X may be recorded by showing, wherein 1≤i≤k, k indicate the quantity of hash function.
4. repeated data detection method according to claim 1, which is characterized in that
Hash bucket is the container for placing fingerprint, and value is 192 to 512 fingerprint/buckets;
Hash bucket write buffer is to be realized in initial phase by applying for idle memory headroom in memory, value etc.
In 4 to 128 Hash buckets;
First list and second list are provided in Hash bucket write buffer, the node in each list is made of Hash bucket.
5. repeated data detection method according to claim 1, which is characterized in that in step (4), Hash bucket read buffer is
It in the spatial cache that initial phase is arranged in memory, is made of the chained list that multiple Hash buckets are constituted, Hash bucket is read
The size of caching is 1024-2048 Hash bucket.
6. a kind of repeated data detection system based on local optimization characterized by comprising
First module fetching portion fingerprint and is stored in caching from the fingerprint list file for obtaining fingerprint list file
In, a fingerprint is extracted from caching;
Second module has the fingerprint extracted for inquiring whether to record in Bloom filter, if possible records
Have, be then transferred to the 4th module, is otherwise transferred to third module;
Third module for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and is extracted from caching next
A fingerprint, and return to the second module;Wherein the fingerprint is inserted into Bloom filter and Hash bucket write buffer specifically: (a)
Calculate the value hash of k individual Hash functioniIt (X), and will be hash to offset in bit vectori(X) bit bit location sets 1,
Wherein k is hash function number, 1≤i≤k;(b) two Hash bucket lists in Polling Hash bucket write buffer, judge certain
Whether there is less than bucket in list, when finding non-bucketful, fingerprint is stored in first non-bucketful, the Hash bucket is referred to as at this time
Hot bucket, and will be in the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address table;If it find that owning in certain list
Hash bucket all has been filled with, then locks the list, and all Hash buckets in the list are written in disk, to this after writing complete
The operation that list is emptied and unlocked, wherein carrying out null clear operation to list is exactly that all Hash buckets in the list are clear
Sky, and new Hash bucket ID is distributed for each Hash bucket;If all buckets are all already filled in two lists, fingerprint insertion behaviour
Certain list write-in is had to wait for complete and execute insertion operation again after emptying;
Whether the 4th module has recorded the fingerprint in the hot-zone for judging Hash bucket read buffer, if there is then mentioning from caching
Next fingerprint is taken, and returns to the second module, is otherwise transferred to the 5th module;
Whether the 5th module has recorded the fingerprint in the hot bucket for judging Hash bucket write buffer, if there is then mentioning from caching
Next fingerprint is taken, and returns to the second module, is otherwise transferred to the 6th module;
Can the 6th module get corresponding Hash bucket ID for searching Hash bucket address table according to fingerprint with judgement, if
It obtains less than then assert that the fingerprint is new fingerprint, next fingerprint is extracted from caching, and return to the second module, if can obtain
To being then transferred to the 7th module;
7th module, all Hash buckets in cold-zone for traversing Hash bucket read buffer according to the Hash bucket ID of acquisition, to sentence
It is disconnected whether to have Hash bucket corresponding with the Hash bucket ID, if there is corresponding Hash bucket, then search the fingerprint in the Hash bucket,
Next fingerprint is extracted from caching, and returns to the second module, otherwise inserts the corresponding Hash bucket of Hash bucket ID from disk
Enter in the first Hash bucket into the hot-zone of Hash bucket read buffer, and search the fingerprint in Hash bucket after such insertion, if looked into
It finds, illustrates that the fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, then being extracted from caching
Next fingerprint, and return to the second module;Wherein Hash bucket read buffer is logically made of two parts, before multiple nodes be
Hot-zone part, and back to the part between tail node be cold-zone.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710555589.5A CN107391034B (en) | 2017-07-07 | 2017-07-07 | A kind of repeated data detection method based on local optimization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710555589.5A CN107391034B (en) | 2017-07-07 | 2017-07-07 | A kind of repeated data detection method based on local optimization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107391034A CN107391034A (en) | 2017-11-24 |
CN107391034B true CN107391034B (en) | 2019-05-10 |
Family
ID=60335524
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710555589.5A Active CN107391034B (en) | 2017-07-07 | 2017-07-07 | A kind of repeated data detection method based on local optimization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107391034B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107944038B (en) * | 2017-12-14 | 2020-11-10 | 上海达梦数据库有限公司 | Method and device for generating deduplication data |
CN108459826B (en) * | 2018-02-01 | 2020-12-29 | 杭州宏杉科技股份有限公司 | Method and device for processing IO (input/output) request |
CN109101365A (en) * | 2018-08-01 | 2018-12-28 | 南京壹进制信息技术股份有限公司 | A kind of data backup and resume method deleted again based on source data |
CN109240605B (en) * | 2018-08-17 | 2020-05-19 | 华中科技大学 | Rapid repeated data block identification method based on 3D stacked memory |
CN109471635B (en) * | 2018-09-03 | 2021-09-17 | 中新网络信息安全股份有限公司 | Algorithm optimization method based on Java Set implementation |
CN109740037B (en) * | 2019-01-02 | 2023-11-24 | 山东省科学院情报研究所 | Multi-source heterogeneous flow state big data distributed online real-time processing method and system |
CN109783523B (en) * | 2019-01-24 | 2022-02-25 | 广州虎牙信息科技有限公司 | Data processing method, device, equipment and storage medium |
CN110046164B (en) * | 2019-04-16 | 2021-07-02 | 中国人民解放军国防科技大学 | Operation method of consistent valley filter |
CN110489405B (en) * | 2019-07-12 | 2024-01-12 | 平安科技(深圳)有限公司 | Data processing method, device and server |
CN111338581B (en) * | 2020-03-27 | 2020-11-17 | 上海天天基金销售有限公司 | Data storage method and device based on cloud computing, cloud server and system |
CN112800430A (en) * | 2021-02-01 | 2021-05-14 | 苏州棱镜七彩信息科技有限公司 | Safety and compliance management method suitable for open source assembly |
CN113721862B (en) * | 2021-11-02 | 2022-02-08 | 腾讯科技(深圳)有限公司 | Data processing method and device |
US20230221864A1 (en) * | 2022-01-10 | 2023-07-13 | Vmware, Inc. | Efficient inline block-level deduplication using a bloom filter and a small in-memory deduplication hash table |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102253820A (en) * | 2011-06-16 | 2011-11-23 | 华中科技大学 | Stream type repetitive data detection method |
CN102591946A (en) * | 2010-12-28 | 2012-07-18 | 微软公司 | Using index partitioning and reconciliation for data deduplication |
CN102663058A (en) * | 2012-03-30 | 2012-09-12 | 华中科技大学 | URL duplication removing method in distributed network crawler system |
CN103345472A (en) * | 2013-06-04 | 2013-10-09 | 北京航空航天大学 | Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system |
CN103870514A (en) * | 2012-12-18 | 2014-06-18 | 华为技术有限公司 | Repeating data deleting method and device |
CN103970875A (en) * | 2014-05-15 | 2014-08-06 | 华中科技大学 | Parallel repeated data deleting method |
CN103970744A (en) * | 2013-01-25 | 2014-08-06 | 华中科技大学 | Extendible repeated data detection method |
CN105740266A (en) * | 2014-12-10 | 2016-07-06 | 国际商业机器公司 | Data deduplication method and device |
CN106293525A (en) * | 2016-08-05 | 2017-01-04 | 上海交通大学 | A kind of method and system improving caching service efficiency |
CN106610790A (en) * | 2015-10-26 | 2017-05-03 | 华为技术有限公司 | Repeated data deleting method and device |
CN106649346A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Data repeatability check method and apparatus |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014130549A (en) * | 2012-12-28 | 2014-07-10 | Fujitsu Ltd | Storage device, control method, and control program |
US10416915B2 (en) * | 2015-05-15 | 2019-09-17 | ScaleFlux | Assisting data deduplication through in-memory computation |
US10761758B2 (en) * | 2015-12-21 | 2020-09-01 | Quantum Corporation | Data aware deduplication object storage (DADOS) |
-
2017
- 2017-07-07 CN CN201710555589.5A patent/CN107391034B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591946A (en) * | 2010-12-28 | 2012-07-18 | 微软公司 | Using index partitioning and reconciliation for data deduplication |
CN102253820A (en) * | 2011-06-16 | 2011-11-23 | 华中科技大学 | Stream type repetitive data detection method |
CN102663058A (en) * | 2012-03-30 | 2012-09-12 | 华中科技大学 | URL duplication removing method in distributed network crawler system |
CN103870514A (en) * | 2012-12-18 | 2014-06-18 | 华为技术有限公司 | Repeating data deleting method and device |
CN103970744A (en) * | 2013-01-25 | 2014-08-06 | 华中科技大学 | Extendible repeated data detection method |
CN103345472A (en) * | 2013-06-04 | 2013-10-09 | 北京航空航天大学 | Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system |
CN103970875A (en) * | 2014-05-15 | 2014-08-06 | 华中科技大学 | Parallel repeated data deleting method |
CN105740266A (en) * | 2014-12-10 | 2016-07-06 | 国际商业机器公司 | Data deduplication method and device |
CN106610790A (en) * | 2015-10-26 | 2017-05-03 | 华为技术有限公司 | Repeated data deleting method and device |
CN106649346A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Data repeatability check method and apparatus |
CN106293525A (en) * | 2016-08-05 | 2017-01-04 | 上海交通大学 | A kind of method and system improving caching service efficiency |
Non-Patent Citations (1)
Title |
---|
Resemblance and mergence based indexing for high performance data deduplication;Panfeng Zhang;《Journal of Systems and Software》;20170630;第11-24页 |
Also Published As
Publication number | Publication date |
---|---|
CN107391034A (en) | 2017-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107391034B (en) | A kind of repeated data detection method based on local optimization | |
CN103377137B (en) | The frequent block strengthened detection is used to carry out the method and system of storage duplicate removal | |
US7418544B2 (en) | Method and system for log structured relational database objects | |
CN102831222B (en) | Differential compression method based on data de-duplication | |
CN104156380B (en) | A kind of distributed memory hash indexing method and system | |
US20090240655A1 (en) | Bit String Seacrching Apparatus, Searching Method, and Program | |
US20090287660A1 (en) | Bit string searching apparatus, searching method, and program | |
CN103597450B (en) | Memory with the metadata being stored in a part for storage page | |
US11176110B2 (en) | Data updating method and device for a distributed database system | |
CN107291858B (en) | Data indexing method based on character string suffix | |
US8086641B1 (en) | Integrated search engine devices that utilize SPM-linked bit maps to reduce handle memory duplication and methods of operating same | |
CN107515931A (en) | A kind of duplicate data detection method based on cluster | |
CN111316255B (en) | Data storage system and method for providing a data storage system | |
CN107944041A (en) | A kind of storage organization optimization method of HDFS | |
Zhang et al. | Hashfile: An efficient index structure for multimedia data | |
CN103500183A (en) | Storage structure based on multiple-relevant-field combined index and building, inquiring and maintaining method | |
US7987205B1 (en) | Integrated search engine devices having pipelined node maintenance sub-engines therein that support database flush operations | |
CN113961754B (en) | Graph database system based on persistent memory | |
CN113901279B (en) | Graph database retrieval method and device | |
CN106547484B (en) | A kind of reliability method of realization internal storage data and system based on RAID5 | |
Su-Cheng et al. | Node labeling schemes in XML query optimization: a survey and trends | |
US7953721B1 (en) | Integrated search engine devices that support database key dumping and methods of operating same | |
CN115935020A (en) | Graph data storage method and device | |
CN109213760A (en) | The storage of high load business and search method of non-relation data storage | |
CN112527804B (en) | File storage method, file reading method and data storage system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |