Background technology
In existing Computer Storage and the file system, solve the identical two or more similar file of most contents, need each file of storage separately, the common method that causes taking the bigger problem of storage space is to adopt data to go repetition technological.
It is the data block that becomes length to equate basically file division that data are gone the implementation method of repetition technology, and in file system, the data block that content is identical is only stored portion.The standard that judgment data piece content is identical can be the MD5 value of comparing data piece, also can be the SHA-1 value of comparing data piece.The value that goes out with MD5 or SHA-1 algorithm computation all has the discreteness of height.The hashed value length that the MD5 algorithm computation goes out is 128bit; The probability that the data block process MD5 hash of different content obtains same hash value is at 1/ (2 (exp (B/2)) (B is the bit figure place of hashed value length in the hashing algorithm here); With the 128bitMD5 algorithm is example, and the identical probability of the MD5 hashed value of different content data block is 1/2
64(approximate 5.5 * 10
-20) the order of magnitude, it is impossible that so little probability is considered to usually.The SHA-1 algorithm is based on MD5's, and the hashed value that calculates reaches 160bit especially.It is generally acknowledged that MD5 value or SHA-1 value can represent the characteristic of prime information uniquely, be generally used for the encryption storage of password, digital signature, file integrality checking, authentication etc.On counting yield, MD5 is better than SHA-1.
After solving file modification, need carry out synchronously, the content of often revising is considerably less, but needs synchronous whole part of file content, causes the problem of a large amount of Network Transmission, and what adopt at present is the increment simultaneous techniques.
The increment simultaneous techniques refers to file through Network Synchronization, need not transmit the content of whole part of file, but only in the storage of transmission destination end and the file system non-existent content get final product.If between the different editions of identical file synchronously, be appreciated that changed information into transfer files.Implementation method is that file division is become logical blocks of data, through the content of comparing data logical block, finds out the identical and difference between destination and the source end file.Identical part need not passed through Network Transmission, can obtain at destination; Different portions just need be passed through Network Transmission, so just can reduce transmission volume.Whether identical standard can be passed through relatively MD5 value or SHA-1 value to the judgment data logical block equally.
Teledata synchronization means rsync commonly used also is a kind of increment simultaneous techniques, uses so-called " rsync algorithm " to make the file between local and long-range two computing machines reach synchronous.Suppose need be between two computing machines synchronous documents A ', and there has been the previous release A of this document in destination, the rsync algorithm will be accomplished through following step so:
1. destination is divided into the logical blocks of data (last piece may be littler than S) that one group of nonoverlapping length is fixed as the S byte with file A; To each cut apart good logical blocks of data calculate 32 verification with and 128 MD4 value, and with the verification of these pieces with reach the MD4 value and issue the source end.The MD4 algorithm is the previous release of MD5 algorithm, relative MD5 algorithm, and the security aspect is weaker a little.
2. the source end is the logical blocks of data of S (side-play amount can be chosen wantonly, is not necessarily the multiple of S) through all sizes of search file A ', seeks a certain logical blocks of data that identical verification is arranged and reach the MD4 value with file A.
3. the source end is issued a string instruction of destination and is generated the backup of file A ' on destination; The instruction here or be that file A has a certain logical blocks of data and the explanation that must not retransmit, otherwise be one not with any coupling of file A on logical blocks of data.
The rsync algorithm only transmits the different piece of two files, rather than all whole at every turn part of file transfer, so speed is quite fast.But rsync can only be used between the different editions of same file name synchronously.If above example in file A ' with file A similar content but filename is different, rsync will still can transmit whole part content of A '.
The above-mentioned repetition technology of going makes and goes the efficient of repetition lower owing to being the data block that becomes length to equate basically file division, can not reduce transmission volume effectively.
Summary of the invention
In order to solve the deficiency that prior art exists, the object of the present invention is to provide a kind of method of content-based file splitting method, a kind of file memory method and a kind of synchronous documents.
In order to accomplish above-mentioned purpose, a kind of content-based file splitting method of the present invention, this method may further comprise the steps:
1) selectes the length of window and the length of logical blocks of data expectation, and the length range of logical blocks of data is set according to the length of said logical blocks of data expectation;
2) adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window;
3) file is carried out the division of logical blocks of data;
The length range of the 4) length of selected data storage block expectation, and qualification DSB data store block;
5) search and confirm the breakpoint of DSB data store block;
6) file is carried out the division of DSB data store block.
For accomplishing the foregoing invention purpose, the present invention also provides a kind of file memory method, and this method may further comprise the steps:
File data is divided into logical blocks of data and DSB data store block;
Each DSB data store block of storage file, and the data storage block message that log file comprised in metadata.
For accomplishing the foregoing invention purpose, the present invention also provides a kind of synchronous documents storage means, and this method may further comprise the steps:
1) synchronous source end adopts the said file splitting method of claim to be divided into DSB data store block and logical blocks of data new file, and data storage block message and logical blocks of data information are sent to destination;
2) synchronous destination is searched local non-existent DSB data store block and logical blocks of data, and which logical blocks of data makes up DSB data store block and notification source end needs;
3) synchronous source end sends the logical blocks of data that destination needs.
The present invention has tangible advantage and good effect, adopts content-based file splitting method of the present invention, can find out content inequality between different files or identical file different editions accurately and efficiently.In storage system,, when running into similar content or identical file and need store, just can save a large amount of storage spaces because the identical DSB data store block of content only stores portion; In file loading, backup and the filing of file system, the source end just only needs non-existent DSB data store block of transmission destination end and logical blocks of data information, and need not transmit the DSB data store block and the logical blocks of data information of whole new file, and the content of transmission still less; In NFS, the mapping from file to the file physical storage block in the file system metadata can change into by logical blocks of data or the mapping of DSB data store block of file in this method, reaches and reduces the dependence of network file system performance to bandwidth.
Embodiment
Below in conjunction with accompanying drawing the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein only is used for explanation and explains the present invention, and be not used in qualification the present invention.
Fig. 1 is the process flow diagram according to content-based file splitting method of the present invention, and with reference to figure 1, concrete implementation procedure according to the present invention is described in detail as follows:
At first, in step 101, the length of the length of selected window (windows) and logical blocks of data (block) expectation, and the length range of qualification logical blocks of data.
Window is the continuous zone of a slice in the file, and the length of suggestion is 48 bytes.Moving window (slidingwindow) is based on the last window in the file, the byte of backward sliding, and the length of window after the slip is constant.
Logical blocks of data is the data of comparison fritter, and when realizing that increment is synchronous, logical blocks of data is minimum Synchronism Unit.The unit storage in storage system with the logical blocks of data.The length of logical blocks of data expectation can be 2K, 4K or 8K, also can be worth for other.
For fear of in the process of searching the logical blocks of data breakpoint, there are very many breakpoints in the file, file is divided into a lot of data logical blocks; Logical blocks of data length is all very short, causes storing and to transmit the quantity of information of logical blocks of data very big, causes memory space and the transmission quantity bigger than file content; Or logical blocks of data length is very big in the file, and the probability that causes logical blocks of data to be reused becomes very little, and the change in this logical blocks of data also can cause the problem of great amount of data transmission; In this step, the length range of data logical block is limited, minimum length is Tmin; Be traditionally arranged to be the half the of expected data logical block length or according to the actual conditions setting, also can be set to other values, maximum length is Tmax; Can select 16K, 32K or 64K byte etc. according to actual conditions.
In step 102, adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window.
In this step; Employing is searched the also breakpoint of specified data logical block according to the mode of file content (content based); The benefit that adopts this mode to bring is: insert and delete content hereof; Only can influence the vicissitudinous logical blocks of data of content, and can not influence other logical blocks of data.Concrete steps are from the file section start, to calculate Rabin's fingerprint value of each moving window; When the low n position of moving window Rabin fingerprint value equals certain specified value, this moving window will constitute the breakpoint of first logical blocks of data, begin to calculate Rabin's fingerprint value of each moving window then from first breakpoint; When the low n position of moving window Rabin fingerprint value equals certain specified value; This moving window just constitutes the breakpoint of second data logical block, according to aforementioned algorithm, calculates Rabin's fingerprint value of all moving windows; Find out the breakpoint of all logical blocks of data in the file, until EOF.The EOF place also must be the breakpoint of a data logical block.
Rabin's fingerprint (rabin fingerprinting) algorithm is a kind of fingerprint algorithm that the Rabin of Harvard University (rabin) proposes; It is a kind of algorithm of high efficiency calculating moving window hashed value, and has the height discreteness according to the value that Rabin's fingerprint algorithm calculates.
Getting the low n position of moving window Rabin fingerprint value, is that Rabin's fingerprint value with moving window is divided by 2
nThe remainder of gained.The value of n is relevant with the length of logical blocks of data expectation.Because the value that calculates according to Rabin's fingerprint algorithm is very uniform, and if file content also is very at random, the logical blocks of data length that splits so will be 2
nAbout byte, just 2
nThe length of=logical blocks of data expectation.Certainly, need add the content that breakpoint window comprises in the logical blocks of data.So if the length of our expected data logical block is the 4K byte, the value of n just should be 12 (2
12=4096=4K).
The low n position of Rabin's fingerprint value of moving window equals certain specified value, and this specified value is as long as confirm that what being on earth, it doesn't matter.We did such test: to different length, dissimilar file; Search breakpoint with different values respectively; The result is what value that don't work, and the logical blocks of data quantity of dividing at last is more or less the same, and the length difference of each logical blocks of data is also very little.The randomness of this set-point has more been confirmed in this test.
In this step, also can be in a last breakpoint (or file section start) Tmin byte afterwards, calculation window Rabin fingerprint value does not avoid producing the too small logical blocks of data of length.
If in a last breakpoint (or file section start) Tmax byte afterwards, do not find new breakpoint, we will select last the backup breakpoint in this segment limit for use.The method of confirming the backup breakpoint is the low n-1 position of getting moving window Rabin fingerprint value, compares (this specified value is not equal to the value of judgment data logical block breakpoint) with another specified value, if equate, thinks that then this window can be used as a backup breakpoint.Under the situation that does not have breakpoint, last backup breakpoint will become the breakpoint of logical blocks of data; If neither there is breakpoint, there is not the backup breakpoint yet, then need by force this segment limit to be divided into a data logical block, avoid producing the excessive logical blocks of data of length.
In step 103, file is carried out logical blocks of data divide.The all breakpoints of finding out according to step 102 of file; Content between per two adjacent breakpoints constitutes a data logical block, and wherein the content of the content of first logical blocks of data and file section start and EOF place and penult breakpoint equally also constitutes a data logical block respectively.
In step 104, the length of selected data storage block (chunk) expectation, and the length range of qualification DSB data store block.DSB data store block is relatively large data.In file system, DSB data store block is the used minimum memory of application layer unit, and the DSB data store block that content is identical is only stored portion.The length of DSB data store block expectation can be 1M, 2M or 4M, also can be worth for other.
The length of DSB data store block expectation is represented with Ec; The length of logical blocks of data expectation is represented with Eb; The DSB data store block length range is restricted to [Ec-m*Eb, Ec+k*Eb] (last DSB data store block of each file does not limit minimum length), and m and k are given as required numerical value.
In step 105; Search breakpoint with the specified data storage block; The present invention searches and specified data storage block breakpoint still adopts the mode according to file content (content based); The benefit that adopts this mode to bring is: insert and delete content hereof, only can influence the vicissitudinous DSB data store block of content, and can not influence other DSB data store block.Concrete steps are to calculate the total length of a plurality of continuous data logical blocks from the file section start.In case this total length is near the value of our desired data storage block length; And the n+1~n+x position of last logical blocks of data breakpoint Rabin fingerprint value equals another specified value (this specified value is not equal to the value of judgment data logical block breakpoint), and last logical blocks of data breakpoint is exactly the breakpoint of DSB data store block.If the breakpoint of last logical blocks of data does not satisfy condition; And total length can not surpass the restriction of DSB data store block length range after adding next logical blocks of data; After then judging the next logical blocks of data of adding, whether the breakpoint of last logical blocks of data satisfies condition.Up to finding out the breakpoint that satisfies condition, perhaps till the upper limit of total length near the DSB data store block length range.The breakpoint that satisfies condition is the breakpoint of logical blocks of data, also is the breakpoint of DSB data store block simultaneously.The breakpoint that is to say DSB data store block is equal to the breakpoint of last logical blocks of data in a plurality of continuous data logical blocks of composition data storage block.Begin from a last data storage block breakpoint then, using the same method finds out the breakpoint of next DSB data store block.Travel through all logical blocks of data breakpoints,, find out the breakpoint of all DSB data store blocks until EOF.The EOF place is inevitable also to be the breakpoint of DSB data store block.
Top x is relevant with the DSB data store block length range.We represent with Ec desired data storage block length; Desired data logical block length representes that with Eb the DSB data store block length range is restricted to [Ec-m*Eb, Ec+k*Eb] (last DSB data store block of each file does not limit minimum length); Possibly there be m+k data logical block breakpoint so in the DSB data store block length range; The value scope of the n+1 of breakpoint Rabin fingerprint value~n+x position is necessary for [0, m+k-1], and m+k=2 promptly satisfies condition
x, the probability that has and only have a data storage block breakpoint like this in the DSB data store block length range just can be maximum.For example; If the length of expected data logical block is 4K; The length of expected data storage block is 4M, and the DSB data store block length range is [4M-32*4K, 4M+32*4K]; The logical blocks of data breakpoint that possibly become the DSB data store block breakpoint so will have 32+32=64, and the x that mentions above just should equal 6 (2
6=64), promptly whether 13~18 of judgment data logical block breakpoint Rabin fingerprint value (situation that n equals 12) equals specified value.The same with dividing data logical block breakpoint, this specified value is as long as confirm that what being on earth, it doesn't matter.
When a plurality of continuous data logical block total lengths surpassed Ec-m*Eb, whether the breakpoint that then need pay close attention to last logical blocks of data satisfies the n+1 of Rabin's fingerprint value~n+x position equaled the condition of set-point.Satisfying condition then becomes the DSB data store block breakpoint, otherwise checks the breakpoint of next logical blocks of data.In case a plurality of continuous data logical block total lengths surpass Ec+k*Eb, just last breakpoint is set to the breakpoint of this DSB data store block in the maximum length scope of DSB data store block, and the length that can guarantee DSB data store block like this is within limited field.
In step 106; File is carried out DSB data store block to be divided; According to all DSB data store block breakpoints of file that step 105 is found out, the content between per two adjacent DSB data store block breakpoints promptly constitutes a data storage block, and record data logical block and data storage block message.DSB data store block information comprises: the length of DSB data store block, side-play amount and MD5 value or SHA-1 value.
Fig. 2 is that as shown in Figure 2, whole file is divided into several DSB data store blocks according to the file division synoptic diagram of content-based file splitting method of the present invention, and each DSB data store block comprises several logical blocks of data.
Fig. 3 divides synoptic diagram according to logical blocks of data in the content-based file splitting method of the present invention, and as shown in Figure 3, the zigzag lines are represented breakpoint.
A representes first file, or the prototype version of file.We are divided into a lot of logical blocks of data according to the breakpoint of content search logical blocks of data with a file, only show 7 the data logical blocks in front among the figure.
B with a has relatively made a little modifications at B2, but the content of revising does not cause producing new breakpoint, and breakpoint location also remains unchanged.Because variation has taken place in this logical blocks of data content, so generate new logical blocks of data B8.
C with b has relatively increased some contents at B3, but the content that increases does not cause producing new breakpoint, so breakpoint location also remains unchanged.Because this logical blocks of data has increased fresh content, so generate new logical blocks of data B9.
D with c has relatively deleted some contents among the B5, but this deletion does not cause producing new breakpoint, does not also cause former breakpoint to lose efficacy, so breakpoint location also remains unchanged.Because this logical blocks of data has been deleted partial content, so generate new logical blocks of data B10.
E compares with d, and revise at the place at the B6 breakpoint, causes breakpoint place content to change, and will no longer become breakpoint, so merge B6 and B7, generates new logical blocks of data B11.
F with e has relatively increased fresh content at B4, and the fresh content of increase causes producing new breakpoint, so B4 will be decomposed into B12 and B13.
A among Fig. 3, b, c, d, e, f file both possibly be the different editions of identical file name, also possibly be the different files of similar content.Each file compares with previous file, and content all changes, but most logical blocks of data can be reused.Whether identical algorithm can be the MD5 value of comparing data logical block to judgment data logical block content; It also can be the SHA-1 value of comparing data logical block; Or other have highly discrete, the value that can represent the algorithm computation of prime information characteristic to come out uniquely.Like this, we just can reuse a logical blocks of data in the file of synchronous mistake when synchronous documents, reduce transmission volume.
Fig. 4 divides synoptic diagram according to DSB data store block in the content-based file splitting method of the present invention; As shown in Figure 4; Zigzag short-term bar is represented the logical blocks of data breakpoint; The long lines of zigzag are represented the DSB data store block breakpoint, also are the breakpoints of this last logical blocks of data of DSB data store block simultaneously.Dash area is represented the place to previous file (or previous release of same file name) modification, B representative data logical block.
A representes first file, or the prototype version of file.We are divided into a lot of logical blocks of data and DSB data store block according to the breakpoint of content search logical blocks of data and DSB data store block with a file.
B with a has relatively made a little modifications in chunk1, but this modification does not cause the variation of DSB data store block breakpoint, has renewal in the chunk1, becomes chunk1 '.And chunk2, the chunk of chunk3 and back does not have to change.
C compares with b, and a little modifications have been made at the place at chunk1 ' breakpoint, causes this breakpoint to lose efficacy, and with searching new breakpoint again, generates chunk1 " and chunk2 '.And the chunk of chunk3 and back does not have to change.
D with c relatively, at chunk1 " in made a little modifications, this modification causes producing new DSB data store block breakpoint, so generate chunk1 " ' and chunk2 ".And the chunk of chunk3 and back does not have to change.
A among Fig. 4, b, c, d file both possibly be the different editions of identical file name, also possibly be the different files of similar content.Each file with previous file relatively; Content all changes; But the content of most of DSB data store block is identical, and whether identical algorithm can be the MD5 value of comparing data logical block to judgment data logical block content, also can be the SHA-1 value of comparing data logical block; Or other have highly discrete, the value that can represent the algorithm computation of prime information characteristic to come out uniquely.The DSB data store block that content is identical can be reused, and during storage file, need not store existing DSB data store block, so just can avoid the repeated storage of data block.
In storage system, adopt above-mentioned content-based file splitting method, file data is divided into after logical blocks of data and the DSB data store block; In storage file; Be not storage file itself, but each DSB data store block of storage file, and the data storage block message that log file comprised in metadata; Like the DSB data store block tabulation that file comprised, the length of each DSB data store block and MD5 value etc.Because the identical DSB data store block of content only stores portion, when running into similar content or identical file and need store, just can save a large amount of storage spaces.
The uploading, back up and file of file in file system, synchronous source end is divided into DSB data store block and logical blocks of data with new file, and these information are sent to destination.Destination can be searched local non-existent DSB data store block and logical blocks of data through the whole bag of tricks, and which logical blocks of data makes up DSB data store block and notification source end needs.The source end sends the logical blocks of data that destination needs in the past again.The new file here both possibly be to the file behind the last version modify, also possibly be the file that increases newly.Destination searches that local not have the method for DSB data store block and logical blocks of data both can be to calculate local file DSB data store block and logical blocks of data information in real time; Also can be these information to be kept to supply inquiry in the metadata in advance, we recommend the latter here.Logical blocks of data information comprises the position of this logical blocks of data in DSB data store block, logical blocks of data length, logical blocks of data MD5 value or SHA-1 value etc.If end has been preserved existing DSB data store block of destination and logical blocks of data information in the source; The source end just only needs non-existent DSB data store block of transmission destination end and logical blocks of data information; And need not transmit the DSB data store block and the logical blocks of data information of whole new file, the content of transmission is just still less.Compare the whole part file content of transmission, synchronous documents in this way, the content of transmission becomes seldom, the increment synchronous applications of Here it is indication of the present invention.
In NFS, the efficient operation of NFS needs the demand of good network switching performance such as bandwidth.In one type of low-bandwidth network connects, in the wired or wireless network connection like low bandwidth, can utilize the data of algorithm of the present invention to go repetition and increment synchronizing characteristics.In the realization of NFS; Mapping from file to the file physical storage block in the file system metadata; Can change into by logical blocks of data or the mapping of DSB data store block of file in this algorithm, reach and reduce the dependence of network file system performance bandwidth.
One of ordinary skill in the art will appreciate that: the above is merely the preferred embodiments of the present invention; Be not limited to the present invention; Although the present invention has been carried out detailed explanation with reference to previous embodiment; For a person skilled in the art, it still can be made amendment to the technical scheme of aforementioned each embodiment record, perhaps part technical characterictic wherein is equal to replacement.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.