CN107391034A - A kind of duplicate data detection method based on local optimization - Google Patents
A kind of duplicate data detection method based on local optimization Download PDFInfo
- Publication number
- CN107391034A CN107391034A CN201710555589.5A CN201710555589A CN107391034A CN 107391034 A CN107391034 A CN 107391034A CN 201710555589 A CN201710555589 A CN 201710555589A CN 107391034 A CN107391034 A CN 107391034A
- Authority
- CN
- China
- Prior art keywords
- fingerprint
- hash
- hash bucket
- bucket
- caching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
- G06F3/0641—De-duplication techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0656—Data buffering arrangements
Abstract
The invention discloses a kind of duplicate data detection method based on local optimization, belong to computer memory technical field, solve the problems, such as that detection efficiency is low in existing duplicate data detection method, to adapt to because of data storage popularization, and the present situation for causing duplicate data detection efficiency to reduce.The present invention includes Bloom filter detection, the detection of Hash bucket write buffer, the detection of Hash bucket read buffer, Hash bucket address table detecting step.Present invention is generally directed to the stronger data set type of locality, the locality concentrated by mining data, the efficiency of data pre-fetching is improved, reduces disk access expense, lifts the throughput of data deduplication.For possible duplicate data in data set, the present invention prejudges first with Bloom filter to the repeatability of data block, next according to different conditions, the hot-zone to buffer area and cold-zone and disk carry out the detection of three-level duplicate data respectively, the locality in duplicate data is made full use of, lifts the detection efficiency of duplicate data.
Description
Technical field
The invention belongs to computer memory technical field, more particularly, to a kind of repeat number based on local optimization
According to detection method.
Background technology
With information technology fast development, information turns into the precious resources that we depend on for existence, becomes promotion production
The fast-developing maximum power of power.The generation for widely applying the also data along with magnanimity of information technology, it is more and more valuable
The data of value need to be stored.So, the storage efficiency of existing storage medium how is effectively improved, meets ever-increasing deposit
Storage demand, have become storage research field and one of urgently solve the problems, such as.Meanwhile IDC LLC's investigation report show it is existing about
75% data are redundancy, i.e., only 25% data have uniqueness.In this context, data deduplication is used as larger
A kind of new technique of detection and elimination redundancy turns into the study hotspot of academia and industrial quarters in recent years in spatial dimension,
And various information storage systems are just widely applied to further.
The detection for repeating fingerprint is to realize the important technical of data deduplication.In existing data deduplication technology, weight
The detection of complex data mainly using the mode of fingerprint detection, i.e., by extracting the fingerprint (cryptographic Hash) of data block, then passes through inspection
The repeatability for surveying fingerprint identifies whether some data block is duplicate data block.In fingerprint detection method is repeated substantially, generally
The identification of repetition fingerprint section is realized using data structures such as single Hash tables or B-tree.
However, the problem of one can not ignore existing for above-mentioned fingerprint detection method is that its detection performance is more low, can not
Effective duplicate data detection is realized for large data sets, so as to have influence on the overall efficiency of data deduplication.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of repetition based on local optimization
Data detection method, it is intended that solving detection performance existing for the existing duplicate data detection method based on fingerprint detection
It is more low, the technical problem that large data sets realize effective duplicate data detection can not be directed to.
To achieve the above object, according to one aspect of the present invention, there is provided a kind of repeat number based on local optimization
According to detection method, comprise the following steps:
(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, postpone
Deposit one fingerprint of middle extraction;
(2) inquire about whether to record in Bloom filter and have the fingerprint extracted, if possible record has, then turns
Enter step (4), be otherwise transferred to step (3);
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and from caching under extraction
One fingerprint, and return to step (2);
(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, extracted if then from caching next
Individual fingerprint, and return to step (2), are otherwise transferred to step (5);
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted if then from caching next
Individual fingerprint, and return to step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, to judge to get corresponding Hash bucket ID, if obtained not
To then assert that the fingerprint is new fingerprint, next fingerprint, and return to step (2) are extracted from caching, is turned if it can get
Enter step (7);
(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traveled through according to the Hash bucket ID of acquisition, to judge whether
There is Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching
The middle next fingerprint of extraction, and return to step (2), are otherwise inserted into Hash by Hash bucket corresponding to Hash bucket ID from disk
In first Hash bucket in the hot-zone of bucket degree caching, and the fingerprint is searched in Hash bucket after such insertion, said if finding
The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching
Line, and return to step (2).
Preferably, Bloom filter is created in initial phase, and is had
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number is:
Wherein C represents duplicate data block index capacity, and ε represents the False Rate of Bloom filter.
Preferably, judge whether to record in Bloom filter in step (2) and have the fingerprint X extracted specifically, such as
Fruit is for hash function hashi(X), there is hash1(X)&hash2(X)…&hashk(X)=0, then show do not have in Bloom filter
Fingerprint X was recorded, fingerprint X is new fingerprint, otherwise represents that fingerprint X may be recorded, wherein 1≤i≤k, k represent to breathe out
The quantity of uncommon function.
Preferably, Hash bucket is the container for placing fingerprint, and its value is 192 to 512 fingerprint/buckets, and Hash bucket is write
Caching is to be realized in initial phase by applying for the memory headroom of free time in internal memory, and its value is equal to 4 to 128 Kazakhstan
Uncommon bucket, is provided with first list and second list in Hash bucket write buffer, and the node in each list is made up of Hash bucket.
Preferably, (a) calculates the value hash of k individual Hash functioni(X), and will be hash to offset in bit vectori
(X) bit bit location puts 1, and wherein k is hash function number, 1≤i≤k;(b) two in Polling Hash bucket write buffer
Bar Hash bucket list, judge whether to have in certain list less than bucket, when finding non-bucketful, by fingerprint be stored in it is first less than
Bucket, and by the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address tables;If it find that all Kazakhstan in certain list
Uncommon bucket is all had been filled with, then locks the list, and all Hash buckets in the list are write in disk, to the row after writing complete
The operation that table is emptied and unlocked, wherein it is exactly to empty all Hash buckets in the list to carry out null clear operation to list,
And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled with two lists, fingerprint insertion operation
Certain list write-in is had to wait for complete and perform insertion operation again after emptying.
Preferably, in step (4), Hash bucket read buffer is the spatial cache set in initial phase in internal memory, its
The chained list being made up of multiple Hash buckets is formed, and the size of Hash bucket read buffer is 1024-2048 Hash buckets.
It is another aspect of this invention to provide that a kind of duplicate data detecting system based on local optimization is provided, including:
First module, for obtaining fingerprint list file, fetching portion fingerprint and it is stored in from the fingerprint list file
In caching, a fingerprint is extracted from caching;
Second module, there is the fingerprint extracted for inquiring about whether to record in Bloom filter, if possible
Record has, then is transferred to the 4th module, is otherwise transferred to the 3rd module;
3rd module, for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and extracted from caching
Next fingerprint, and the second module;
4th module, the fingerprint whether has been recorded in the hot-zone for judging Hash bucket read buffer, if then from caching
The middle next fingerprint of extraction, and the second module, are otherwise transferred to the 5th module;
5th module, the fingerprint whether has been recorded in the hot bucket for judging Hash bucket write buffer, if then from caching
The middle next fingerprint of extraction, and the second module is returned, otherwise it is transferred to the 6th module;
6th module, for searching Hash bucket address table according to fingerprint, to judge to get corresponding Hash bucket ID,
Assert that the fingerprint is new fingerprint less than if if obtained, next fingerprint is extracted from caching, and return to the second module, if energy
Get, be transferred to the 7th module;
7th module, all Hash buckets in cold-zone for traveling through Hash bucket read buffer according to the Hash bucket ID of acquisition,
To determine whether Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then being searched in the Hash bucket should
Fingerprint, extracts next fingerprint from caching, and returns to the second module, otherwise by Hash bucket corresponding to Hash bucket ID from disk
In be inserted into the first Hash bucket in the hot-zone of Hash bucket degree caching, and search the fingerprint in Hash bucket after such insertion, such as
Fruit finds, and it is existing fingerprint to illustrate the fingerprint, illustrates that the fingerprint is new fingerprint less than if if searched, then from caching
Next fingerprint is extracted, and returns to the second module.
In general, by the contemplated above technical scheme of the present invention compared with prior art, it can obtain down and show
Beneficial effect:
1st, the present invention can solve the problem that existing fingerprint detection method detection performance is more low, can not be directed to big data set and realize
The technical problem of effective duplicate data detection:Due to that present invention employs step (2), can have by the anticipation of Bloom filter
Effect reduces the detection number of fingerprint, and effectively lifting repeats the retrieval performance of fingerprint.
2nd, present invention employs step (3) to step (7), the characteristics of making full use of data set itself, using data pre-fetching
With caching technology, the detection of three-level duplicate data is carried out to the hot-zone, cold-zone and disk of buffer area according to different conditions respectively,
The locality in duplicate data is fully excavated, lifts the accuracy of data pre-fetching, and effectively reduces the access times of disk, enters one
Step improves the detection efficiency of duplicate data.
Brief description of the drawings
Fig. 1 is the building-block of logic of the present invention;
Fig. 2 is the data structure of Bloom filter;
Fig. 3 is Hash bucket address table structural representation;
Fig. 4 is Hash bucket read buffer structural representation;
Fig. 5 is the schematic diagram of the duplicate data detection method of the invention based on local optimization.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below
Conflict can is not formed each other to be mutually combined.
The present invention proposes a kind of efficient repetition fingerprint Fast Detection Technique.It is mainly for the stronger data of locality
Collect the duplicate data detection of type, realize that the optimisation strategy that one-level anticipation three-level detects carries by Bloom filter and caching technology
The performance of high duplicate data detection.
The basic ideas of the present invention are, for possible duplicate data in data set, first with Bloom filter pair
The repeatability of data block is prejudged, and next according to different conditions, the hot-zone to buffer area and cold-zone and disk are carried out respectively
The detection of three-level duplicate data, the locality in duplicate data is made full use of, lift the detection efficiency of duplicate data.
For the basic logical structure of the present invention as shown in figure 1, it is mainly made up of six parts, they refer respectively to line caching
Hash bucket on table, Bloom filter, Hash bucket address table, Hash bucket write buffer, Hash bucket read buffer and disk is formed.
In order to clearly illustrate the present invention, explanation and illustration is subject to the term occurred in present specification:
Fingerprint list:It is to be passed through piecemeal by data set and taken the fingerprint and form fingerprint set according to processing sequence.
Fingerprint cache table:Fingerprint cache table is used to cache the fingerprint in fingerprint list.If fingerprint list comes from
In file, then a number of fingerprint is disposably read in, and be stored in fingerprint cache area.System from fingerprint cache table one by one
Take out fingerprint and repeat the lookup of fingerprint.
Bloom filter:As shown in Fig. 2 by bit vector and k independent hash function h that a length is m biti
(x) (1≤i≤k, k < m) is formed, and is a kind of very high random data structure of space efficiency, and it represents a collection using bit vector
Close, and can judge whether an element belongs to this set.In order to express set S={ x1, x2, x3..., xn, bit vector first
In all position be initialized to 0;Then to the element x in set Sj(1≤j≤n) is all separate using this k respectively
Hash function hi(x) k cryptographic Hash h, is obtainedi(xj) (1≤i≤k, xj∈ S), will using first of bit vector as starting point
This k cryptographic Hash, can be by x as offsetjK position being mapped in bit vector { 1,2 ..., m }, these positions are set to 1, xjQuilt
Mark;After element all in S is all labeled, i.e. set S is expressed by Bloom filter, if a position is repeatedly set to
1, then can only work for the first time.
Determine whether some data element y belongs to set S, first to y respectively using this k separate Hash letters
Number hi(x) k cryptographic Hash h, is obtainedi(y), using first of bit vector as starting point, using this k cryptographic Hash as offset,
Check whether corresponding position is all 1 in the bit vector of Bloom filter, be that y may belong to S;Otherwise it is not the member in S to determine y
Element.
Due to hash function hi(x) (1≤i≤k) there is a possibility that hash-collision for arbitrary two different elements,
Such as y is mapped in bit vector corresponding position to make affirmative by the non-y elements institute image in S, Bloom filter
Property judgement when there is error.The possibility for the element that element is mistaken in S is referred to as during Bloom filter gathers non-S
False positive probability (False Positive Probability), also abbreviation False Rate (Error Rate).False positive probability can
To be controlled by mathematical method.
Set S radix n, the length m of Bloom filter bit vector and its quantity k of hash function is given, then the grand mistake of cloth
It is still that 0 probability is (1-1/m) that the bit vector of filter, which is inserted into a certain position after n element,k×n.On the other hand, when some new element
When all positions have all been set to 1 corresponding to y, Bloom filter could be made that false positive judges, and then deducibility false positive probability fBFFor:
fBF=(1- (1-1/m)k×n)k≈(1-e-k×n/m)k,
It can derive and work asWhen, Bloom filter has minimum false positive probability, is referred to as preferable erroneous judgement
Rate, it is designated as FBF, now, the position that 50% is there are about in the bit vector of Bloom filter is " 1 ";SymbolRepresent big
In the smallest positive integral of ln2 × (m/n) result;
Further, if n is, it is known that one Bloom filter of desired design, its preferable False Rate is no more than given mistake
Sentence rate upper limit ε, then can derive that m must meet:
m≥log2e×log2(1/ ε) × n,
If m=log2e×log2(1/ ε) × n andAnd if only if, and all n are individual
When element is all inserted into Bloom filter, its false positive probability just increases to ε, therefore n is also known as Bloom filter design capacity.Its
InRepresent the smallest positive integral being not more than.
Analyzed more than, according to design capacity n and False Rate upper limit ε, the bit vector of Bloom filter can be calculated
Length m and hash function quantity k;Design capacity n is the quantity of the estimated labelled element of Bloom filter, when a Bloom filter
When the element of mark is less than n, then the Bloom filter be less than Bloom filter, less than Bloom filter can both continue
New element is marked, can also be for inquiring about whether some element has marked wherein, when marked in a Bloom filter
When number of elements is n, then the Bloom filter is full, it is impossible to continues to mark new element, but can provide inquiry, n≤m.
Hash bucket:The base unit that Hash bucket is fingerprint storage and caching swap-in swaps out.Store and fix in one Hash bucket
The independent fingerprint of quantity (independent fingerprint refers to fingerprints numerically different with other fingerprints).
Hash bucket write buffer:Hash bucket write buffer is to go to one piece of buffer zone opening up in internal memory, is write for new fingerprint
Caching before disk.Data due to being stored in new fingerprinting operation and Hash bucket write back disk operating can not be simultaneously to same Hash
Bucket is carried out, to avoid critical resource conflict, so Hash bucket write buffer is designed to be made up of two Hash bucket lists, when certain
When all Hash buckets all have been filled with list, the list is locked, and all Hash buckets in the list are all write disk.
Because these Hash buckets are write-onces, generally they can be written into the same magnetic track on disk, and these fingerprints all maintain
Data locality in the range of certain space.This provides possibility for the follow-up pre- extract operation of reading.Hash bucket after writing complete is complete
Portion is cleared.And new fingerprint then continues to be stored in the Hash bucket of another list.And when the Hash bucket disk write in a list
Do not complete, and another chained list Hash bucket completely when, then need to wait for disk write operation complete.When fingerprint deposit Hash bucket is write
When in caching, that is, Hash bucket ID is distributed, while update Hash bucket address table.
Hash bucket address table:Hash bucket address table is the key assignments resided in internal memory (key-value) Hash table, inner
Hash bucket ID mapping where face is housed from fingerprint key to fingerprint.It is that energy is quick in the fingerprint on searching disk that it, which is acted on,
The Hash bucket position that ground positioning stores the fingerprint is put.Hash bucket address table concrete structure is as shown in Figure 3.Fingerprint length is 20 bytes,
Bucket ID length is 4 bytes, and pointer (Pointer) takes 8 bytes.Hash table storage can have the problem of hash-collision, work as Hash
When conflict occurs, conflict is handled using chain address.
Hash bucket read buffer:Hash bucket read buffer is that the one piece of memory space opened up in memory headroom is used to cache from disk
The Hash bucket of reading.To improve the efficiency of disk index, a part of fingerprint index table (Hash bucket) in disk is read into internal memory
Hash bucket read buffer in.Hash bucket read buffer is made up of a doubly linked list.Each node stores a Hash bucket in chained list,
Its structure is as shown in Figure 4.Each node of chained list have recorded the ID number and flag bit of Hash bucket, and flag bit represents the Hash bucket
Whether it is " dirty bucket ".Hash bucket read buffer is logically made up of two parts, before several nodes be hot-zone part, and back
It is cold-zone to the part between tail node.The division in cold and hot section is used to optimize retrieval performance.
As shown in figure 5, the duplicate data detection method of the invention based on local optimization comprises the following steps:
(1) fingerprint list file, fetching portion fingerprint (size of the partial fingerprints etc. from the fingerprint list file are obtained
It is specified for storing the size in the space of fingerprint in caching) and store in the buffer, a fingerprint is extracted from caching, when
After all fingerprints in caching have all extracted, then new partial fingerprints are read from fingerprint list file and are stored in caching
In;
(2) inquire about whether to record in Bloom filter and have the fingerprint extracted, if possible record has, then turns
Enter step (4), be otherwise transferred to step (3);
Specifically, Bloom filter is to be created in initial phase according to procedure below:
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number is:
Wherein C represents duplicate data block index capacity, and ε represents the False Rate of Bloom filter, and its value is not higher than thresholding
Value 0.00001.
Judge in this step in Bloom filter whether may record there is the fingerprint X extracted to be specially:
If for hashi(X) (wherein 1≤i≤k, k represent the quantity of hash function), there is h1(X)&h2(X)…&hk
(X)=0, then show that it is new fingerprint that the fingerprint X, X were not recorded in Bloom filter, otherwise represent that the fingerprint may be recorded
X;
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and from caching under extraction
One fingerprint, and return to step (2);
Hash bucket in this step is the container for placing fingerprint, and its size can be arbitrary value, preferred value be 192 to
512 fingerprint/buckets, Hash bucket write buffer are to be realized in initial phase by applying for the memory headroom of free time in internal memory
, its size can be arbitrary value, and preferred value is equal to 4 to 128 Hash buckets.
Initial phase is provided with two lists (i.e. first list and second list), Mei Gelie in Hash bucket write buffer
Node on table is all made up of Hash bucket.
Specifically, the fingerprint is inserted into Bloom filter and Hash bucket write buffer in this step and is specially:(a) count
Calculate the value hash of k individual Hash functioni(X) it is, and to offset in bit vector hashi(X) bit bit location puts 1, wherein
K is hash function number, and 1≤i≤k;(b) two Hash bucket lists in Polling Hash bucket write buffer, judge certain
Whether have in list less than bucket, when finding non-bucketful, fingerprint is stored in into first non-bucketful, and (now the Hash bucket is referred to as heat
Bucket), and by the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address tables;If it find that all Kazakhstan in certain list
Uncommon bucket is all had been filled with, then locks the list, and all Hash buckets in the list are write in disk, to the row after writing complete
The operation that table is emptied and unlocked, wherein it is exactly to empty all Hash buckets in the list to carry out null clear operation to list,
And new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled with two lists, fingerprint insertion operation
Certain list write-in is had to wait for complete and perform insertion operation again after emptying.
Above-mentioned Hash bucket address table is the address list created in initial phase in internal memory, and it is with the side of key-value pair
Formula describes fingerprint and stores the mapping relations between the Hash bucket ID of the Hash bucket of the fingerprint.
(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, extracted if then from caching next
Individual fingerprint, and return to step (2), are otherwise transferred to step (5);
Specifically, Hash bucket read buffer (Cache) is the spatial cache set in initial phase in internal memory, and it is
The chained list being made up of multiple Hash buckets is formed, and the size of Hash bucket read buffer can be arbitrary value, and preferred value is 1024-
2048 Hash buckets;The anterior one or more Hash buckets of chained list are referred to as hot-zone, and remaining Hash bucket is referred to as cold-zone.
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, extracted if then from caching next
Individual fingerprint, and return to step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, to judge to get corresponding Hash bucket ID, if obtained not
To then assert that the fingerprint is new fingerprint, next fingerprint, and return to step (2) are extracted from caching, is turned if it can get
Enter step (7);
(7) all Hash buckets in the cold-zone of Hash bucket read buffer are traveled through according to the Hash bucket ID of acquisition, to judge whether
There is Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, from caching
The middle next fingerprint of extraction, and return to step (2), are otherwise inserted into Hash by Hash bucket corresponding to Hash bucket ID from disk
In first Hash bucket in the hot-zone of bucket degree caching, and the fingerprint is searched in Hash bucket after such insertion, said if finding
The bright fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next finger is then extracted from caching
Line, and return to step (2).
The present invention has following beneficial effect:Firstly, since present invention employs step (2), pass through Bloom filter
Anticipation can effectively reduce the detection number of fingerprint, effectively lifting repeats the retrieval performance of fingerprint;Further, since the present invention uses
Step (3) is to step (7), the characteristics of making full use of data set itself, using data pre-fetching and caching technology, according to different
Condition carries out the detection of three-level duplicate data to the hot-zone, cold-zone and disk of buffer area respectively, fully excavates in duplicate data
Locality, the accuracy of data pre-fetching is lifted, and effectively reduce the access times of disk, further improve the inspection of duplicate data
Survey efficiency.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should be included
Within protection scope of the present invention.
Claims (7)
1. a kind of duplicate data detection method based on local optimization, it is characterised in that comprise the following steps:
(1) obtain fingerprint list file, from the fingerprint list file fetching portion fingerprint and store in the buffer, from caching
Extract a fingerprint;
(2) inquire about whether to record in Bloom filter and have the fingerprint extracted, if possible record has, then is transferred to step
Suddenly (4), otherwise it is transferred to step (3);
(3) fingerprint is inserted into Bloom filter and Hash bucket write buffer (Buffer), and extracted from caching next
Fingerprint, and return to step (2);
(4) judge the fingerprint whether has been recorded in the hot-zone of Hash bucket read buffer, if then extracting next finger from caching
Line, and return to step (2), are otherwise transferred to step (5);
(5) judge whether recorded the fingerprint in the hot bucket of Hash bucket write buffer, if then extracting next finger from caching
Line, and return to step (2), are otherwise transferred to step (6);
(6) Hash bucket address table is searched according to fingerprint, to judge to get corresponding Hash bucket ID, if acquisition less than if
It is new fingerprint to assert the fingerprint, and next fingerprint, and return to step (2) are extracted from caching, step is transferred to if it can get
Suddenly (7);
(7) according to the Hash bucket ID of acquisition travel through Hash bucket read buffer cold-zone in all Hash buckets, with determine whether with
Hash bucket corresponding to Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket, carried from caching
Next fingerprint, and return to step (2) are taken, Hash bucket corresponding to Hash bucket ID is otherwise inserted into Hash bucket degree from disk
In first Hash bucket in the hot-zone of caching, and the fingerprint is searched in Hash bucket after such insertion, explanation should if finding
Fingerprint is existing fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, next fingerprint is then extracted from caching, and
Return to step (2).
2. duplicate data detection method according to claim 1, it is characterised in that Bloom filter is in initial phase
Create, and have
The optimum bit vector magnitude m of Bloom filter is equal to:
M=log2e×log2(1/ε)×C
Hash function number is:
Wherein C represents duplicate data block index capacity, and ε represents the False Rate of Bloom filter.
3. duplicate data detection method according to claim 1, it is characterised in that judge Bloom filter in step (2)
If whether may be recorded in has the fingerprint X extracted specifically, for hash function hashi(X), there is hash1(X)&hash2
(X)…&hashk(X)=0, then show not recording fingerprint X in Bloom filter, fingerprint X is new fingerprint, otherwise table
Fingerprint X may be recorded by showing, wherein 1≤i≤k, k represent the quantity of hash function.
4. duplicate data detection method according to claim 1, it is characterised in that
Hash bucket is the container for placing fingerprint, and its preferred value is 192 to 512 fingerprint/buckets;
Hash bucket write buffer is to realize that it preferably takes by applying for the memory headroom of free time in internal memory in initial phase
Value is equal to 4 to 128 Hash buckets;
It is provided with first list and second list in Hash bucket write buffer, the node in each list is made up of Hash bucket.
5. duplicate data detection method according to claim 1, it is characterised in that the fingerprint is inserted into Bloom filter
And it is specially in Hash bucket write buffer:(a) the value hash of k individual Hash function is calculatedi(X), and will be to being offset in bit vector
Measure as hashi(X) bit bit location puts 1, and wherein k is hash function number, 1≤i≤k;(b) Polling Hash bucket is write slow
Two Hash bucket lists in depositing, judge whether to have in certain list less than bucket, when finding non-bucketful, fingerprint is stored in first
Individual non-bucketful, and by the Hash bucket ID of the fingerprint and Hash bucket write-in Hash bucket address tables;If it find that in certain list
All Hash buckets are all had been filled with, then lock the list, and all Hash buckets in the list are write in disk, after writing complete
The operation that the list is emptied and unlocked, wherein it is exactly by all Hash buckets in the list to carry out null clear operation to list
Empty, and new Hash bucket ID is distributed for each Hash bucket.If all buckets are all already filled with two lists, fingerprint insertion
Operation has to wait for certain list write-in and completes and perform insertion operation again after emptying.
6. duplicate data detection method according to claim 1, it is characterised in that in step (4), Hash bucket read buffer is
In the spatial cache that initial phase is set in internal memory, its chained list being made up of multiple Hash buckets is formed, and Hash bucket is read
Caching is preferably sized to 1024-2048 Hash buckets.
A kind of 7. duplicate data detecting system based on local optimization, it is characterised in that including:
First module, for obtaining fingerprint list file, fetching portion fingerprint and caching is stored in from the fingerprint list file
In, a fingerprint is extracted from caching;
Second module, there is the fingerprint extracted for inquiring about whether to record in Bloom filter, if possible record
Have, be then transferred to the 4th module, be otherwise transferred to the 3rd module;
3rd module, for the fingerprint to be inserted into Bloom filter and Hash bucket write buffer, and extract from caching next
Individual fingerprint, and the second module;
4th module, the fingerprint whether has been recorded in the hot-zone for judging Hash bucket read buffer, if then being carried from caching
Next fingerprint, and the second module are taken, is otherwise transferred to the 5th module;
5th module, the fingerprint whether has been recorded in the hot bucket for judging Hash bucket write buffer, if then being carried from caching
Next fingerprint is taken, and returns to the second module, is otherwise transferred to the 6th module;
6th module, for searching Hash bucket address table according to fingerprint, to judge to get corresponding Hash bucket ID, if
Obtain less than then assert that the fingerprint is new fingerprint, next fingerprint is extracted from caching, and return to the second module, if can obtain
To being then transferred to the 7th module;
7th module, all Hash buckets in cold-zone for traveling through Hash bucket read buffer according to the Hash bucket ID of acquisition, to sentence
It is disconnected whether to have Hash bucket corresponding with Hash bucket ID, if corresponding Hash bucket, then the fingerprint is searched in the Hash bucket,
Next fingerprint is extracted from caching, and returns to the second module, otherwise inserts Hash bucket corresponding to Hash bucket ID from disk
Enter into the first Hash bucket in the hot-zone of Hash bucket degree caching, and the fingerprint is searched in Hash bucket after such insertion, if looked into
Find, it is existing fingerprint to illustrate the fingerprint, illustrating that the fingerprint is new fingerprint less than if if searched, then being extracted from caching
Next fingerprint, and return to the second module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710555589.5A CN107391034B (en) | 2017-07-07 | 2017-07-07 | A kind of repeated data detection method based on local optimization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710555589.5A CN107391034B (en) | 2017-07-07 | 2017-07-07 | A kind of repeated data detection method based on local optimization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107391034A true CN107391034A (en) | 2017-11-24 |
CN107391034B CN107391034B (en) | 2019-05-10 |
Family
ID=60335524
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710555589.5A Active CN107391034B (en) | 2017-07-07 | 2017-07-07 | A kind of repeated data detection method based on local optimization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107391034B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107944038A (en) * | 2017-12-14 | 2018-04-20 | 上海达梦数据库有限公司 | A kind of generation method and device of duplicate removal data |
CN108459826A (en) * | 2018-02-01 | 2018-08-28 | 杭州宏杉科技股份有限公司 | A kind of method and device of processing I/O Request |
CN109101365A (en) * | 2018-08-01 | 2018-12-28 | 南京壹进制信息技术股份有限公司 | A kind of data backup and resume method deleted again based on source data |
CN109240605A (en) * | 2018-08-17 | 2019-01-18 | 华中科技大学 | A kind of quick repeated data block identifying method stacking memory based on 3D |
CN109471635A (en) * | 2018-09-03 | 2019-03-15 | 中新网络信息安全股份有限公司 | A kind of algorithm optimization method realized based on Java Set set |
CN109740037A (en) * | 2019-01-02 | 2019-05-10 | 山东省科学院情报研究所 | The distributed online real-time processing method of multi-source, isomery fluidised form big data and system |
CN109783523A (en) * | 2019-01-24 | 2019-05-21 | 广州虎牙信息科技有限公司 | A kind of data processing method, device, equipment and storage medium |
CN110046164A (en) * | 2019-04-16 | 2019-07-23 | 中国人民解放军国防科技大学 | Index independent grain distribution filter, consistency grain distribution filter and operation method |
CN110489405A (en) * | 2019-07-12 | 2019-11-22 | 平安科技(深圳)有限公司 | The method, apparatus and server of data processing |
CN111338581A (en) * | 2020-03-27 | 2020-06-26 | 尹兵 | Data storage method and device based on cloud computing, cloud server and system |
CN112800430A (en) * | 2021-02-01 | 2021-05-14 | 苏州棱镜七彩信息科技有限公司 | Safety and compliance management method suitable for open source assembly |
CN113721862A (en) * | 2021-11-02 | 2021-11-30 | 腾讯科技(深圳)有限公司 | Data processing method and device |
US20230221864A1 (en) * | 2022-01-10 | 2023-07-13 | Vmware, Inc. | Efficient inline block-level deduplication using a bloom filter and a small in-memory deduplication hash table |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102253820A (en) * | 2011-06-16 | 2011-11-23 | 华中科技大学 | Stream type repetitive data detection method |
CN102591946A (en) * | 2010-12-28 | 2012-07-18 | 微软公司 | Using index partitioning and reconciliation for data deduplication |
CN102663058A (en) * | 2012-03-30 | 2012-09-12 | 华中科技大学 | URL duplication removing method in distributed network crawler system |
CN103345472A (en) * | 2013-06-04 | 2013-10-09 | 北京航空航天大学 | Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system |
CN103870514A (en) * | 2012-12-18 | 2014-06-18 | 华为技术有限公司 | Repeating data deleting method and device |
US20140188912A1 (en) * | 2012-12-28 | 2014-07-03 | Fujitsu Limited | Storage apparatus, control method, and computer product |
CN103970744A (en) * | 2013-01-25 | 2014-08-06 | 华中科技大学 | Extendible repeated data detection method |
CN103970875A (en) * | 2014-05-15 | 2014-08-06 | 华中科技大学 | Parallel repeated data deleting method |
CN105740266A (en) * | 2014-12-10 | 2016-07-06 | 国际商业机器公司 | Data deduplication method and device |
US20160335024A1 (en) * | 2015-05-15 | 2016-11-17 | ScaleFlux | Assisting data deduplication through in-memory computation |
CN106293525A (en) * | 2016-08-05 | 2017-01-04 | 上海交通大学 | A kind of method and system improving caching service efficiency |
CN106610790A (en) * | 2015-10-26 | 2017-05-03 | 华为技术有限公司 | Repeated data deleting method and device |
CN106649346A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Data repeatability check method and apparatus |
US20170177266A1 (en) * | 2015-12-21 | 2017-06-22 | Quantum Corporation | Data aware deduplication object storage (dados) |
-
2017
- 2017-07-07 CN CN201710555589.5A patent/CN107391034B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591946A (en) * | 2010-12-28 | 2012-07-18 | 微软公司 | Using index partitioning and reconciliation for data deduplication |
CN102253820A (en) * | 2011-06-16 | 2011-11-23 | 华中科技大学 | Stream type repetitive data detection method |
CN102663058A (en) * | 2012-03-30 | 2012-09-12 | 华中科技大学 | URL duplication removing method in distributed network crawler system |
CN103870514A (en) * | 2012-12-18 | 2014-06-18 | 华为技术有限公司 | Repeating data deleting method and device |
US20140188912A1 (en) * | 2012-12-28 | 2014-07-03 | Fujitsu Limited | Storage apparatus, control method, and computer product |
CN103970744A (en) * | 2013-01-25 | 2014-08-06 | 华中科技大学 | Extendible repeated data detection method |
CN103345472A (en) * | 2013-06-04 | 2013-10-09 | 北京航空航天大学 | Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system |
CN103970875A (en) * | 2014-05-15 | 2014-08-06 | 华中科技大学 | Parallel repeated data deleting method |
CN105740266A (en) * | 2014-12-10 | 2016-07-06 | 国际商业机器公司 | Data deduplication method and device |
US20160335024A1 (en) * | 2015-05-15 | 2016-11-17 | ScaleFlux | Assisting data deduplication through in-memory computation |
CN106610790A (en) * | 2015-10-26 | 2017-05-03 | 华为技术有限公司 | Repeated data deleting method and device |
CN106649346A (en) * | 2015-10-30 | 2017-05-10 | 北京国双科技有限公司 | Data repeatability check method and apparatus |
US20170177266A1 (en) * | 2015-12-21 | 2017-06-22 | Quantum Corporation | Data aware deduplication object storage (dados) |
CN106293525A (en) * | 2016-08-05 | 2017-01-04 | 上海交通大学 | A kind of method and system improving caching service efficiency |
Non-Patent Citations (1)
Title |
---|
PANFENG ZHANG: "Resemblance and mergence based indexing for high performance data deduplication", 《JOURNAL OF SYSTEMS AND SOFTWARE》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107944038A (en) * | 2017-12-14 | 2018-04-20 | 上海达梦数据库有限公司 | A kind of generation method and device of duplicate removal data |
CN107944038B (en) * | 2017-12-14 | 2020-11-10 | 上海达梦数据库有限公司 | Method and device for generating deduplication data |
CN108459826A (en) * | 2018-02-01 | 2018-08-28 | 杭州宏杉科技股份有限公司 | A kind of method and device of processing I/O Request |
CN108459826B (en) * | 2018-02-01 | 2020-12-29 | 杭州宏杉科技股份有限公司 | Method and device for processing IO (input/output) request |
CN109101365A (en) * | 2018-08-01 | 2018-12-28 | 南京壹进制信息技术股份有限公司 | A kind of data backup and resume method deleted again based on source data |
CN109240605A (en) * | 2018-08-17 | 2019-01-18 | 华中科技大学 | A kind of quick repeated data block identifying method stacking memory based on 3D |
CN109471635A (en) * | 2018-09-03 | 2019-03-15 | 中新网络信息安全股份有限公司 | A kind of algorithm optimization method realized based on Java Set set |
CN109740037A (en) * | 2019-01-02 | 2019-05-10 | 山东省科学院情报研究所 | The distributed online real-time processing method of multi-source, isomery fluidised form big data and system |
CN109740037B (en) * | 2019-01-02 | 2023-11-24 | 山东省科学院情报研究所 | Multi-source heterogeneous flow state big data distributed online real-time processing method and system |
CN109783523A (en) * | 2019-01-24 | 2019-05-21 | 广州虎牙信息科技有限公司 | A kind of data processing method, device, equipment and storage medium |
CN110046164B (en) * | 2019-04-16 | 2021-07-02 | 中国人民解放军国防科技大学 | Operation method of consistent valley filter |
CN110046164A (en) * | 2019-04-16 | 2019-07-23 | 中国人民解放军国防科技大学 | Index independent grain distribution filter, consistency grain distribution filter and operation method |
CN110489405A (en) * | 2019-07-12 | 2019-11-22 | 平安科技(深圳)有限公司 | The method, apparatus and server of data processing |
WO2021008024A1 (en) * | 2019-07-12 | 2021-01-21 | 平安科技(深圳)有限公司 | Data processing method and apparatus, and server |
CN110489405B (en) * | 2019-07-12 | 2024-01-12 | 平安科技(深圳)有限公司 | Data processing method, device and server |
CN111338581A (en) * | 2020-03-27 | 2020-06-26 | 尹兵 | Data storage method and device based on cloud computing, cloud server and system |
CN112800430A (en) * | 2021-02-01 | 2021-05-14 | 苏州棱镜七彩信息科技有限公司 | Safety and compliance management method suitable for open source assembly |
CN113721862A (en) * | 2021-11-02 | 2021-11-30 | 腾讯科技(深圳)有限公司 | Data processing method and device |
US20230221864A1 (en) * | 2022-01-10 | 2023-07-13 | Vmware, Inc. | Efficient inline block-level deduplication using a bloom filter and a small in-memory deduplication hash table |
Also Published As
Publication number | Publication date |
---|---|
CN107391034B (en) | 2019-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107391034B (en) | A kind of repeated data detection method based on local optimization | |
CN103377137B (en) | The frequent block strengthened detection is used to carry out the method and system of storage duplicate removal | |
Bertino et al. | Indexing techniques for advanced database systems | |
US7418544B2 (en) | Method and system for log structured relational database objects | |
CN102831222B (en) | Differential compression method based on data de-duplication | |
US20090240655A1 (en) | Bit String Seacrching Apparatus, Searching Method, and Program | |
Leung | Mining uncertain data | |
CN103597450B (en) | Memory with the metadata being stored in a part for storage page | |
CN110377747B (en) | Knowledge base fusion method for encyclopedic website | |
CN107515931A (en) | A kind of duplicate data detection method based on cluster | |
CN104462582A (en) | Web data similarity detection method based on two-stage filtration of structure and content | |
US8086641B1 (en) | Integrated search engine devices that utilize SPM-linked bit maps to reduce handle memory duplication and methods of operating same | |
CN103229164B (en) | Data access method and device | |
CN107291858B (en) | Data indexing method based on character string suffix | |
CN107451233A (en) | Storage method of the preferential space-time trajectory data file of time attribute in auxiliary storage device | |
CN103500183A (en) | Storage structure based on multiple-relevant-field combined index and building, inquiring and maintaining method | |
CN113901279B (en) | Graph database retrieval method and device | |
Liu et al. | EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search | |
CN113961754B (en) | Graph database system based on persistent memory | |
Su-Cheng et al. | Node labeling schemes in XML query optimization: a survey and trends | |
US7987205B1 (en) | Integrated search engine devices having pipelined node maintenance sub-engines therein that support database flush operations | |
CN106371765A (en) | Method for removing memory thrashing through efficient LTL ((Linear Temporal Logic) model detection of large-scale system | |
Tao et al. | Validity information retrieval for spatio-temporal queries: Theoretical performance bounds | |
CN110489448A (en) | The method for digging of big data correlation rule based on Hadoop | |
Suei et al. | A signature-based Grid index design for main-memory RFID database applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |