CN102722557A

CN102722557A - Self-adaption identification method for identical data blocks

Info

Publication number: CN102722557A
Application number: CN2012101718585A
Authority: CN
Inventors: 夏耐
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2012-05-29
Filing date: 2012-05-29
Publication date: 2012-10-10
Anticipated expiration: 2032-05-29
Also published as: CN102722557B

Abstract

The invention provides a self-adaption identification method for identical data blocks, which comprises the following steps: initially sampling a ratio, bytes of the data blocks, and contents of the data blocks, sampling the contents of the data blocks, and mixing the contents of the data blocks to obtain a hash value; and carrying out the query operation of a hash table or a search tree according to the hash value, finding the data blocks with the same hash value, then carrying out the full-content comparison, and further determining the sameness. If the data blocks are finally determine to be different, one-time hash collision is formed by the two data blocks. The hash collision rate within a period is calculated at set intervals, and a sampling value HS is adaptively adjusted according to the collision rate. The lower the sampling ratio is, the faster the hash calculation is. The self-adaptive identification method provided by the invention can adaptively reach an optimal sampling value on a batch of data sets so as to reach the rapidest identical data identification speed. The self-adaptive identification method provided by the invention can greatly improve the efficiency of a deduplication system in redundant data searching.

Description

A kind of self-adaptive identification method of identical block

Technical field

The present invention relates to the comparison recognition methods of identical block in a kind of computing machine, particularly a kind of self-adaptive identification method of the identical block based on hash function

Background technology

With the identical data block of Hash (Hash) function identification is a kind of common technology; Be widely used in the every field of infosystem; The superfluous system (Deduplication System) that goes such as applying to the server disk storage usually can be through the mode of Hash disk block content; Find the redundancy magnetic disk data piece of identical content, and eliminate, thereby reach the effect that improves disk storage efficient.In general, the Hash function can content map to an integer value with data block on, then through the comparison of integer value, can quicken to seek the process of identical data block.Existing method based on Hash in order to this purposes; General all can be with the input of the content of the full content of data block or fixed part as algorithm; Therefore the choosing automatically of input content quantity of can be not automatically not doing algorithm according to the characteristic of the data acquisition of institute's computing have no idea the performance of algorithm is reached optimization in a lot of occasions.

Illustrate, suppose that a data sets has several data blocks, and the content of these data blocks is widely different each other: any two data blocks; On same position; It is identical not having a pair of byte, so, and in fact; Only need any two bytes of the same position in the middle of these two data blocks of Hash to get final product, if their full content of Hash then can lose time.

Summary of the invention

Goal of the invention: technical matters to be solved by this invention is the deficiency to prior art, and a kind of self-adaptive identification method of identical block is provided, speed and the efficient of raising system in the time of the identification identical block.

In order to solve the problems of the technologies described above, the invention discloses a kind of self-adaptive identification method of identical block, may further comprise the steps:

Step 1; The data structure HStruct that initialization is searched in order to cryptographic hash (can be in array, Hash table or the search tree any one), initialization sampling rate value HS, 0≤HS≤100%; Difference initialization scan counter I; Success counter S is 0 (defining the beginning of first scan period) with colliding counter F, and selected fixed-size data block DATA, and data block DATA's is the SIZE byte;

Step 2, the data of certain byte number of from data block DATA, sampling out, the sampling byte number is HS * SIZE;

Step 3 is carried out promiscuous operation to the data of being sampled out, and promiscuous operation is that any binary shaping computing combination in any forms, and draws the cryptographic hash H of an integer value size;

Step 4; In searching data structure HStruct, search cryptographic hash H, if find the cryptographic hash value of another data block DTMP to equate with cryptographic hash H, return data piece DTMP then; Carry out step 5; Otherwise is key assignments with data block DATA with cryptographic hash H, inserts and searches data structure HStruct, and go to step 12; Data block DATA and data block DTMP are arranged in same storage space, represent any two data blocks respectively, and DATA and DTMP just distinguish the form of data block name, do not point to the particular content of data block.

Step 5, the content of comparing data piece DATA and data block DTMP if both contents are identical, then carry out step 6, otherwise carry out step 7;

Step 6 writes down once successful identical block identifying operation (being the value S=S+1 of successful counter), and doublet < data block DATA, data block DTMP>is as a result of exported, and skips to step 8;

Step 7 writes down the operation (promptly colliding the value F=F+1 of counter) that a time Hash collides, and carry out step 8;

Step 8, scan counter I adds 1, if I goes to step 12, otherwise carry out step 9 less than setting threshold N (N gathers the half the of quantity generally greater than 3 less than whole data block);

Step 9, calculate scan counter from 0 rise to Hash collision rate C=F/ (F+S) and computational discrimination function J this whole process of N (C, HS); If differentiate the result, then increase the sampling rate value, if the result is less than 0 greater than 0; Then reduce the sampling rate value, otherwise the sampling rate value is constant; Suppose definition B in this scan period because sampling ratio HS is less than 100%, and than the time that sampling all data blocks content is saved be B=T * (1-C) * (1-HS) * N, when wherein T is HS=100%, the required time of step 3.Definition W in this scan period because Hash collides the time waste, i.e. W=M * C * N, wherein M is a needed time of comparing data piece, then (C is to make B-W tend to be similar to the decision function of maximum value gradually HS) to discriminant function J.The typical form of one of which can for:

wherein B ' and W ' saves in the last scan period and time of waste, and initial value all is 0.

Step 10, the sampling rate value changes in the step if go up, and the sampling rate value after order changes is HS _NEW, according to sampling rate value HS _NEWThe data with existing piece carries out promiscuous operation among the data structure HStruct to searching, the cryptographic hash after obtaining upgrading, and the sampling byte number of promiscuous operation does | HS-HS _NEW| * SIZE;

Step 11, scan counter I puts, and the value S of successful counter and the value F that collides counter put 0 respectively, and define the end of this scan period and the beginning of next scan period;

Step 12 selects next data block as data block DATA, returns step 2, finishes up to all data block traversals.

Beneficial effect: the present invention is applicable to the data block of quick identification identical content, and its main beneficial effect is embodied in:

1) can adapt to the difference characteristic of data acquisition automatically; If it is widely different each other between the data sets data block; Method so of the present invention can be adjusted to a very little value with sampling rate very soon; And if similarity is very high each other between the data block in the data acquisition, method so of the present invention can be heightened sampling rate again automatically.Avoided only using in the background method certain hash method to be difficult to adapt to the inferior position of different scenes.

2) average behavior increases substantially than the existing method of background.In spontaneous a lot of data acquisitions, the present invention

Method can be adapted to one less than on 100% the sampling rate, therefore be expected at and have general performance on most data sets and improve.

Description of drawings

Below in conjunction with accompanying drawing and embodiment the present invention is done specifying further, above-mentioned and/or otherwise advantage of the present invention will become apparent.

Fig. 1 is the process flow diagram of the inventive method.

Embodiment

The invention discloses a kind of self-adaptive identification method of identical block, comprise the steps:

1. the data structure HStruct that searches in order to hash value of initialization; Initialization sampling rate value HS (0≤HS≤100%); The value I of difference initialization scan counter; The value S of success counter and the value F that collides counter are 0 (defining the beginning of first scan period), and a selected size is fixed as the data block DATA of SIZE byte.

The data structure HStruct that searches in order to Hash (Hash) value can select the Hash table; Perhaps search tree etc. can be as the data structure of searching fast; If select search tree, the common tree construction that can use RBTree or Adelson-Velskii-Landis tree etc. to have balanced structure.It is fixed that initial sampling rate value can come according to the pre-estimation to the data acquisition of required computing; If the data block of data centralization difference each other is relatively big; HS can adopt than higher value (>=50%) so, otherwise, then should choose than low value (≤50%).Because be to do identical block identification, obviously all data blocks of data centralization all are fixed sizes, are made as the SIZE byte.

2. the data of HS * SIZE the byte of sampling out in the middle of the data block DATA;

In the specific implementation, data block sampling is meaned the content that reads HS * SIZE byte in the middle of the data block.The mode of sampling, it is fixed equally also to need to come according to data block characteristics, distributes if the DATA DISTRIBUTION in the middle of the pre-estimation data block becomes immediately, then can use the mode of linear interval to sample according to sampling rate and (for example read the 1st, 3,5,7 ... Etc. byte).If the DATA DISTRIBUTION in the middle of the pre-estimation data block has fixedly rule, sample mode can be avoided this regularity of distribution so that algorithm is stable so.

3. the data of being sampled out are carried out promiscuous operation, draw the hash value H of an integer value size;

Promiscuous operation can be combined by any binary shaping operational symbol; As add, subtract, multiplication and division, get remainder, displacement, XOR etc.; And the unit of operation can be single byte, also can be by the integer value of the certain-length that forms of sampling combination of bytes, and like 32 or 64 s' integer value.Promiscuous operation can be carried out several times on the byte of sampling, with the result who draws at last, as hash value H.

4. in searching data structure HStruct, search cryptographic hash H; If find the hash value of another data block DTMP to equate with the cryptographic hash of data block DATA; Return data piece DTMP then continues next step, otherwise is key assignments with data block DATA with cryptographic hash H; Insert HStruct, and go to step 12.

In searching structure HStruct, search cryptographic hash H, the mode of searching is carried out according to the concrete kind of HStruct.If the hash value that finds another data block DTMP equates with it, then return B ', the mode of returning is the index that certain can read block DTMP, such as pointer or array index etc.If do not find the data block that equates hash value, so, data block DATA is a key assignments with cryptographic hash H, inserts HStruct (wherein comprise HStruct is done corresponding adjustment), and goes to step 11.

5. the content of comparing data piece DATA and data block DTMP if both contents are identical, is then carried out step

6, otherwise carry out step 7;

Data block DAT1 and data block DTMP hold C language functions such as relatively can adopting similar memcmp, or realize with loop structure voluntarily.

6. write down a successful operation (being the value S=S+1 of successful counter) of identical block identification, and doublet < data block DATA, data block DTMP>is exported as arithmetic result, skip to step 8;

If equate then write down a successful operation of identical block identification, the mode of record can be the value of certain counter of increase, is defined as S=S+1.If identical block is discerned successfully, this doublet < data block DATA, data block DTMP>is exported as arithmetic result; Other upper systems or algorithm, as go to superfluous system (deduplication system), can export further by this and operate.

7. then write down the operation (promptly colliding the value F=F+1 of counter) that a time Hash collides, carry out step 8;

If the content of data block DATA and data block DTMP is inequality, then write down the operation that a time Hash collides; The mode of record is to increase fixed size value of certain counter, typically can define the value F=F+1 that collides counter.

8. scan counter I adds 1, if I goes to step 12, otherwise carry out step 9 less than certain threshold value N;

In the specific implementation, threshold value N confirms, it is fixed to come according to the characteristic of data acquisition, if the distribution of data block is relatively even in the data acquisition, so, the N value can adopt a smaller value, otherwise, then should suitably increase.The overall span of N value is 3 ~ M/2, and wherein M is the quantity of data block in the whole data acquisition.

9. calculate Hash collision rate C=F/ (F+S) and the computational discrimination function J of scan counter in 0 ~ N scope time (C, HS), if differentiate the result greater than 0; Then increase the sampling rate value; If the result less than 0, then reduce the sampling rate value, otherwise the sampling rate value is constant;

If Hash collision rate is high more, should improves the sampling ratio so, otherwise should reduce the sampling ratio.Because each Hash collides and possibly bring extra performance loss,, collision rate C is big more, and (C, differentiation result HS) should cause the raising of sampling rate to J more easily so.(C HS) need consider current sampling rate to J simultaneously, with respect to the Hash calculating (HS=100%) of full content, the unnecessary data time for reading of being saved.One is common in order to be evaluated in certain scan period, adopt certain less than 100% sampling rate value than the effect of quickening with 100% sampling ratio, be the computational savings time and lose time between poor, and then hope that this difference maximizes.Suppose definition B in this scan period because sampling ratio HS is less than 100%, and than the time that sampling all data blocks content is saved be B=T * (1-C) * (1-HS) * N, when wherein T is HS=100%, the required time of step 3.Definition W in this scan period because Hash collides the time waste, i.e. W=M * C * N, wherein M is a needed time of comparing data piece, then (C is to make B-W tend to be similar to the decision function of maximum value gradually HS) to discriminant function J.J (C, HS) typical form can for:

wherein B ' and W ' saves in the last scan period and time of waste.If (C's discriminant function J HS) obtains a result greater than 0, then increases the sampling rate value, if the result less than 0, then reduces the sampling rate value.Otherwise the sampling rate value is constant.

10. if step 8 sampling rate value is updated to HS _NEW, according to sampling rate value HS _NEWThe data with existing piece carries out promiscuous operation among the data structure HStruct to searching, the cryptographic hash after obtaining upgrading, and the sampling byte number of promiscuous operation does | HS-HS _NEW| * SIZE.

If step 8 sampling rate value is updated to HS _NEW, the hash value of so next selecteed data block will be based on HS _NEWThe sampling ratio, and original hash value of searching among the data structure HStruct as the data block of key assignments is based on old sampling ratio, therefore; Need recomputate the new hash value of data with existing piece among the HStruct; The mode that recomputates is identical with the promiscuous operation that step 3 is adopted, simultaneously for reducing computing time, for the calculating of the hash value of data with existing piece among the HStruct; On the basis of existing its old hash value, only need further sampling | the content of HS-HS'| * SIZE byte.

11. the value I of scan counter, the value S of successful counter and the value F that collides counter put 0 respectively, and define the end of this scan period and the beginning of next scan period;

12. select next data block as data block DATA, return step 2, finish up to all data block traversals.；

In the specific implementation, next data block DATA chooses, and is incompatible fixed according to the data set of this algorithm input.

As a kind of application of the present invention, if data block DATA and data block DTMP content are identical, deleted data piece DAT1 then, the space that it is shared discharges.This type of demand can extensively appear at storage system, memory management, and a lot of scenes of data backup or the like, next embodiment will specifically set forth.

Embodiment 1

This case study on implementation discloses a kind of application of self-adaptive identification method on a still image supervisory system storage optimization of identical block.The enforcement scene is described below:

Every separated a bit of time of still image supervisory system is implemented high-resolution shooting to target, and shooting results is deposited with the png form.Because its photographic subjects do not change probably for a long time, also might change a lot the short time, so its picture stored has much identically, a small amount of difference is arranged.Because png form itself has been a compressed format, therefore guaranteeing under the situation that picture quality does not lose, adopt traditional lossless compressiong to be difficult to further dwindle storage space.Therefore, the system optimization strategy is planned through differentiating a large amount of identical images and then the space of reducing actual storage someway fast.

Under this scene, the method practical implementation step of present embodiment is:

Step 1, the data structure HStruct that initialization is searched in order to cryptographic hash, present embodiment adopts RBTree as an example.Initialization sampling rate value HS, owing to consider the characteristic of data acquisition, promptly most of identical, a great difference is arranged on a small quantity, the HS initial value is made as a less value 20%.The value I of difference initialization scan counter; The value S of success counter and the value F that collides counter are 0 (defining the beginning of first scan period); An and selected wherein pictures P (owing to be to take with resolution with camera lens, all picture size are identical, and present embodiment is assumed to be the SIZE byte);

Step 2, the data of certain byte number of from picture P, sampling out, the sampling byte number is HS * SIZE;

Step 3; Data to being sampled out are carried out promiscuous operation; The present embodiment promiscuous operation use dyadic operation for

wherein H be cryptographic hash H; Initial value is 0, I be successively preface mix certain the integer value in the sampled data in the process.Result of calculation draws the cryptographic hash H of this final picture sampling;

Step 4 is searched cryptographic hash H in searching data structure HStruct, if find the sampling cryptographic hash of another picture PNEW to equate with cryptographic hash H; Then return picture PNEW, carry out step 5, otherwise be key assignments with cryptographic hash H picture P; Data structure HStruct is searched in insertion, and goes to step 12;

Step 5, relatively the content of picture P and picture PNEW if both contents are identical, then carry out step 6, otherwise carry out step 7;

Step 6; Write down once successful identical block identifying operation (being the value S=S+1 of successful counter); And doublet < picture P, picture PNEW>as a result of exported, system's storage management program is pointed to picture PNEW with the file index of picture P on this basis; And delete redundant picture P, skip to step 8;

Step 8, the value I of scan counter adds 1, if I goes to step 12 less than setting threshold N, otherwise carry out step 9.For this case study on implementation, N gets smaller value 10;

Step 9, calculate Hash collision rate C=F/ (F+S) and the computational discrimination function J of scan counter in 0 ~ N scope time (C, HS); If differentiate the result, then increase the sampling rate value, if the result is less than 0 greater than 0; Then reduce the sampling rate value, otherwise the sampling rate value is constant; Suppose definition B in this scan period because sampling ratio HS is less than 100%, and than the time that sampling all data blocks content is saved be B=T * (1-C) * (1-HS) * N, when wherein T is HS=100%, the required time of step 3.Definition W in this scan period because Hash collides the time waste, i.e. W=M * C * N, wherein M is a needed time of comparing data piece, then (C is to make B-W tend to be similar to the decision function of maximum value gradually HS) to discriminant function J.(C HS) is specially: wherein B ' and W ' saves in the last scan period and time of waste discriminant function J.

Step 10, the sampling rate value changes in the step if go up, and the sampling rate value after order changes is HS _NEW, according to sampling rate value HS _NEWThe data with existing piece carries out promiscuous operation among the data structure HStruct to searching

Cryptographic hash after obtaining upgrading, the sampling byte number of promiscuous operation do | HS-HS _NEW| * SIZE;

Step 11, the value I of scan counter puts, and the value S of successful counter and the value F that collides counter put 0 respectively, and are labeled as the end of this scan period and the beginning of next scan period;

Step 12 selects next data block as picture P, returns step 2, finishes up to all data block traversals.

Present embodiment is compared with prior art disposal route (such as with the whole Hash picture of SuperFastHash); Through test more than 100 times, under the situation of handling same data object, bulk treatment speed has improved 5 times; CPU on average takies and has descended 60%, and it is identical to handle accuracy.

Embodiment 2

Present embodiment discloses a kind of application of self-adaptive identification method on multi-user's file backup system of identical block.The enforcement scene is described below:

System in order to multi-user's storage backup file; Usually have a lot of identical or identical files; Typical situation is that a working group is at a shared batch file; And the modification of oneself is arranged on each comfortable different files, and everyone directly backs up own file directory on network services shared device.The optimisation strategy of this standby system is hoped and can identical file data blocks be fused to together.

Under this scene, present embodiment is compared with embodiment 1, and difference is:

In the step 1, the selection of data block in order to improve the data volume that can merge, becomes the small data piece with the file stripping and slicing of all inputs, and a typical size is 4KB.

In the step 1, owing in advance can't expect that the HS initial value is made as the value of a compromise, promptly 50%, and let the step of back adjust automatically to the similar degree of file data blocks;

In step 3 and the step 10, promiscuous operation uses dyadic operation to be H=((H＞＞16) ^H^ (H＜＜17))+I;

In the step 4, search data structure Hstruct and adopt the Hash table structure;

In the step 8, because the file data number of blocks maybe be huger, the N value that therefore adopts is an All Files quantity.

In the step 9, owing to consider that file data is loaded into internal memory and needs the extra time, (C, the formula that HS) adopts does discriminant function J

B wherein _i, W _iRepresent in preceding i the scan period corresponding measuring and calculating value respectively.

The treatment step of all the other unaccounted parts is identical with embodiment 1.

Present embodiment is compared with prior art disposal route (such as with SHA-1 hash piece), and through test more than 100 times, under the situation of handling same data object, the bulk velocity of system improves more than 10 times, and the CPU average service rate descends 85%.

The invention provides a kind of self-adaptive identification method of identical block; The method and the approach of concrete this technical scheme of realization are a lot, and the above only is a preferred implementation of the present invention, should be understood that; For those skilled in the art; Under the prerequisite that does not break away from the principle of the invention, can also make some improvement and retouching, these improvement and retouching also should be regarded as protection scope of the present invention.The all available prior art of each ingredient not clear and definite in the present embodiment realizes.

Claims

1. the self-adaptive identification method of an identical block is characterized in that, may further comprise the steps:

Step 1; The data structure HStruct that initialization is searched in order to cryptographic hash, initialization sampling rate value HS, 0≤HS≤100%; The value I of difference initialization scan counter; The value S of success counter and the value F that collides counter are 0, and selected fixed-size data block DATA, and data block DATA's is the SIZE byte;

Step 3 is carried out promiscuous operation to the data of being sampled out, draws the cryptographic hash H of an integer value size;

Step 4; In searching data structure HStruct, search cryptographic hash H, if find the cryptographic hash of another data block DTMP to equate with cryptographic hash H, return data piece DTMP then; Carry out step 5; Otherwise is key assignments with data block DATA with cryptographic hash H, inserts and searches data structure HStruct, and go to step 12;

Step 5, the content of comparing data piece DATA and data block DTMP if both contents are identical, is then carried out

Step 6, otherwise carry out step 7;

Step 6 writes down once successful identical block identifying operation, even the value S=S+1 of successful counter, and doublet < data block DATA, data block DTMP>as a result of exported, skip to step 8;

Step 7 writes down the operation that a time Hash collides, and promptly collides the value F=F+1 of counter, carry out step 8;

Step 8, the value I of scan counter adds 1, if I goes to step 12 less than setting threshold N, otherwise carry out step 9;

Step 9, calculate Hash collision rate C=F/ (F+S) and the computational discrimination function J of scan counter in 0 ~ N scope time (C, HS); Wherein, HS is current sampling ratio, if differentiate the result greater than 0; Then increase the sampling rate value; If the result less than 0, then reduce the sampling rate value, otherwise the sampling rate value is constant;

Step 11, the value I of scan counter, the value S of successful counter and the value F that collides counter put 0 respectively, and are labeled as the end of this scan period and the beginning of next scan period;

2. the self-adaptive identification method of a kind of identical block according to claim 1 is characterized in that, the said data structure Hstruct that searches is in array, Hash table or the search tree any one.

3. the self-adaptive identification method of a kind of identical block according to claim 1 is characterized in that, said promiscuous operation is that any binary shaping computing combination in any forms.

4. the self-adaptive identification method of a kind of identical block according to claim 1; It is characterized in that (C is to make B-W tend to the decision function of approximate maximum value gradually HS) to said discriminant function J; Wherein, W is the time that single pass was wasted owing to the Hash collision in the cycle, W=M * C * N, and wherein M is a needed time of comparing data piece;

B is a single pass in the cycle, the time of adopting current sampling rate value HS to be saved: B=T than the sampling ratio that adopts 100% * (1-C) * (1-HS) * and N, the needed time of step 3 when wherein T is for hypothesis HS=100%.