CN101788976A

CN101788976A - File splitting method based on contents

Info

Publication number: CN101788976A
Application number: CN201010110841A
Authority: CN
Inventors: 张卫平; 刘为怀; 杨立辉; 张元丰; 李骞
Original assignee: Beijing Borqs Software Technology Co Ltd
Current assignee: Borqs Beijing Ltd.; Wuhan Borqs Technology Co., Ltd.; Beijing Borqs Software Technology Co Ltd
Priority date: 2010-02-10
Filing date: 2010-02-10
Publication date: 2010-07-28
Anticipated expiration: 2030-02-10
Also published as: CN101788976B; WO2011097887A1

Abstract

The invention relates to a file splitting method based on contents, comprising the following steps: selecting the length of windows, the expected length of data logic blocks and the length range of the data logic blocks; computing the Rabin fingerprint value of each sliding window by adopting Rabin fingerprint algorithm, and determining the breaking point of each data logic block according to the Rabin fingerprint value of each sliding window; splitting files according to the data logic blocks; selecting the expected length of data storage blocks, and limiting the length range of the data storage blocks; searching and confirming the breaking points of the data storage blocks; and splitting the files according to the data storage blocks. In the invention, the file splitting method based on the contents can accurately and efficiently search different contents among different files or different versions of the same file, thereby saving a large amount of storage space in storage and file systems, reducing the transmission contents of file information, and decreasing the dependence of the performance of a network file system on bandwidth.

Description

A kind of content-based file splitting method

Technical field

The present invention relates to a kind of dividing method of file, relate in particular to a kind of content-based file splitting method.

Background technology

In existing Computer Storage and the file system, solve the identical two or more similar file of most contents, need each file of storage separately, the common method that causes taking the bigger problem of storage space is to adopt data to go the repetition technology.

It is the data block that becomes length to equate substantially file division that data are gone the implementation method of repetition technology, and in file system, the data block that content is identical is only stored portion.The standard that judgment data piece content is identical can be the MD5 value of comparing data piece, also can be the SHA-1 value of comparing data piece.The value that goes out with MD5 or SHA-1 algorithm computation all has the discreteness of height.The hashed value length that the MD5 algorithm computation goes out is 128bit, the data block of different content obtains the probability of same Hash value at 1/ (2 (exp (B/2)) (B is the bit figure place of hashed value length in the hashing algorithm) here through the MD5 hash, with the 128bitMD5 algorithm is example, and the identical probability of the MD5 hashed value of different content data block is 1/2 ⁶⁴(approximate 5.5 * 10 ^-20) the order of magnitude, it is impossible that so little probability is considered to usually.The SHA-1 algorithm is based on MD5's, and the hashed value that calculates reaches 160bit especially.It is generally acknowledged that MD5 value or SHA-1 value can represent the feature of prime information uniquely, be generally used for the encryption storage of password, digital signature, file integrality checking, authentication etc.On counting yield, MD5 is better than SHA-1.

Need to carry out synchronously after solving file modification, often the content of Xiu Gaiing is considerably less, but needs synchronous whole part of file content, causes the problem of a large amount of Network Transmission, and what adopt at present is the increment simultaneous techniques.

The increment simultaneous techniques refers to file by Network Synchronization, does not need to transmit the content of whole part of file, but only in the storage of transmission destination end and the file system non-existent content get final product.If between the different editions of identical file synchronously, can be understood as the changed information of transfer files.Implementation method is that file division is become logical blocks of data, by the content of comparing data logical block, finds out the identical and difference between destination and the source end file.Identical part does not need can obtain at destination by Network Transmission; Different parts just needs so just can reduce transmission volume by Network Transmission.Whether identical standard can be passed through relatively MD5 value or SHA-1 value to the judgment data logical block equally.

Teledata synchronization means rsync commonly used also is a kind of increment simultaneous techniques, uses so-called " rsync algorithm " to make the file between local and remote two computing machines reach synchronous.Suppose need be between two computing machines synchronous documents A ', and there has been the previous release A of this document in destination, the rsync algorithm will be finished by following step so:

1. destination is divided into the logical blocks of data (last piece may be littler than S) that one group of nonoverlapping length is fixed as the S byte with file A, to each cut apart good logical blocks of data calculate 32 verification and and 128 MD4 value, and with the verification of these pieces with reach the MD4 value and issue the source end.The MD4 algorithm is the previous release of MD5 algorithm, relative MD5 algorithm, and the security aspect is weaker a little.

2. the source end is the logical blocks of data of S (side-play amount can be chosen wantonly, not necessarily the multiple of S) by all sizes of search file A ', seeks a certain logical blocks of data that identical verification is arranged and reach the MD4 value with file A.

3. the source end is issued a string instruction of destination and is generated the backup of file A ' on destination, the instruction here or be that file A has a certain logical blocks of data and the explanation that must not retransmit, otherwise be one not with any coupling of file A on logical blocks of data.

The rsync algorithm only transmits the different piece of two files, rather than all whole at every turn part of file transfer, so speed is quite fast.But rsync can only be used between the different editions of same file name synchronously.If above example in file A ' with file A content is similar but filename is different, rsync will still can transmit whole part content of A '.

The above-mentioned repetition technology of going makes and goes the efficient of repetition lower owing to being the data block that becomes length to equate substantially file division, can not reduce transmission volume effectively.

Summary of the invention

In order to solve the deficiency that prior art exists, the object of the present invention is to provide a kind of method of content-based file splitting method, a kind of file memory method and a kind of synchronous documents.

In order to finish above-mentioned purpose, a kind of content-based file splitting method of the present invention, this method may further comprise the steps:

1) selectes the length of window and the length of logical blocks of data expectation, and the length range of logical blocks of data is set according to the length of described logical blocks of data expectation;

2) adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window;

3) file is carried out the division of logical blocks of data;

The length range of the 4) length of selected data storage block expectation, and qualification DSB data store block;

5) search and confirm the breakpoint of DSB data store block;

6) file is carried out the division of DSB data store block.

For finishing the foregoing invention purpose, the present invention also provides a kind of file memory method, and this method may further comprise the steps:

File data is divided into logical blocks of data and DSB data store block;

Each DSB data store block of storage file, and the data storage block message that log file comprised in metadata.

For finishing the foregoing invention purpose, the present invention also provides a kind of synchronous documents storage means, and this method may further comprise the steps:

1) synchronous source end adopts the described file splitting method of claim to be divided into DSB data store block and logical blocks of data new file, and data storage block message and logical blocks of data information are sent to destination;

2) synchronous destination is searched local non-existent DSB data store block and logical blocks of data, and which logical blocks of data makes up DSB data store block and notification source end needs;

3) synchronous source end sends the logical blocks of data that destination needs.

The present invention has tangible advantage and good effect, adopts content-based file splitting method of the present invention, can find out content inequality between different files or identical file different editions accurately and efficiently.In storage system,, when running into the similar or identical file of content and need store, just can save a large amount of storage spaces because the identical DSB data store block of content only stores portion; In file loading, backup and the filing of file system, the source end just only needs non-existent DSB data store block of transmission destination end and logical blocks of data information, and does not need to transmit the DSB data store block and the logical blocks of data information of whole new file, and the content of transmission still less; In network file system(NFS), the mapping from file to the file physical storage block in the file system metadata, can change into by file in this method logical blocks of data or the mapping of DSB data store block, reach and reduce the dependence of network file system performance bandwidth.

Description of drawings

Accompanying drawing is used to provide further understanding of the present invention, and constitutes the part of instructions, with embodiments of the invention, is used to explain the present invention, is not construed as limiting the invention.In the accompanying drawings:

Fig. 1 is the process flow diagram according to content-based file splitting method of the present invention;

Fig. 2 is the file division synoptic diagram according to content-based file splitting method of the present invention;

Fig. 3 divides synoptic diagram according to logical blocks of data in the content-based file splitting method of the present invention;

Fig. 4 divides synoptic diagram according to DSB data store block in the content-based file splitting method of the present invention.

Embodiment

Below in conjunction with accompanying drawing the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein only is used for description and interpretation the present invention, and be not used in qualification the present invention.

Fig. 1 is that with reference to figure 1, specific implementation process according to the present invention is described in detail as follows according to the process flow diagram of content-based file splitting method of the present invention:

At first, in step 101, the length of the length of selected window (windows) and logical blocks of data (block) expectation, and the length range of qualification logical blocks of data.

Window is the continuous zone of a slice in the file, and the length of suggestion is 48 bytes.Moving window (slidingwindow) is based on the last window in the file, the byte of backward sliding, and the length of window after the slip is constant.

Logical blocks of data is the data of comparison fritter, and when realizing that increment is synchronous, logical blocks of data is minimum Synchronism Unit.The unit storage in storage system with the logical blocks of data.The length of logical blocks of data expectation can be 2K, 4K or 8K, also can be worth for other.

For fear of in the process of searching the logical blocks of data breakpoint, there are very many breakpoints in the file, file is divided into a lot of data logical blocks, logical blocks of data length is all very short, cause storing and to transmit the quantity of information of logical blocks of data very big, cause memory space and the transmission quantity bigger than file content, or logical blocks of data length is very big in the file, the probability that causes logical blocks of data to be reused becomes very little, and the change in this logical blocks of data also can cause the problem of great amount of data transmission, in this step, length range to the data logical block limits, minimum length is Tmin, is traditionally arranged to be half of expected data logical block length or according to the actual conditions setting, also can be set to other values, maximum length is Tmax, can select 16K, 32K or 64K byte etc. according to actual conditions.

In step 102, adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window.

In this step, employing is searched the also breakpoint of specified data logical block according to the mode of file content (content based), the benefit that adopts this mode to bring is: insert and delete content hereof, only can influence the vicissitudinous logical blocks of data of content, and can not influence other logical blocks of data.Concrete steps are, from the file section start, calculate Rabin's fingerprint value of each moving window, when the low n position of moving window Rabin fingerprint value equals certain specified value, this moving window will constitute the breakpoint of first logical blocks of data, begin to calculate Rabin's fingerprint value of each moving window then from first breakpoint, when the low n position of moving window Rabin fingerprint value equals certain specified value, this moving window just constitutes the breakpoint of second data logical block, according to aforementioned algorithm, calculate Rabin's fingerprint value of all moving windows, find out the breakpoint of all logical blocks of data in the file, until end-of-file (EOF).The end-of-file (EOF) place also must be the breakpoint of a data logical block.

Rabin's fingerprint (rabin fingerprinting) algorithm is a kind of fingerprint algorithm that the Rabin of Harvard University (rabin) proposes, it is a kind of algorithm of high efficiency calculating moving window hashed value, and has the height discreteness according to the value that Rabin's fingerprint algorithm calculates.

Getting the low n position of moving window Rabin fingerprint value, is to use Rabin's fingerprint value of moving window divided by 2 ⁿThe remainder of gained.The value of n is relevant with the length of logical blocks of data expectation.Because the value that calculates according to Rabin's fingerprint algorithm is very uniform, and if file content also is very at random, the logical blocks of data length that splits so will be 2 ⁿAbout byte, just 2 ⁿThe length of=logical blocks of data expectation.Certainly, need the content that adds that breakpoint window comprises in the logical blocks of data.So if the length of our expected data logical block is the 4K byte, the value of n just should be 12 (2 ¹²=4096=4K).

The low n position of Rabin's fingerprint value of moving window equals certain specified value, and this specified value is as long as determine that what being on earth, it doesn't matter.We did such test: to different length, dissimilar file, search breakpoint with different values respectively, the result is what value that don't work, and the logical blocks of data quantity of Hua Fening is more or less the same at last, and the length difference of each logical blocks of data is also very little.The randomness of this set-point has more been confirmed in this test.

In this step, also can be in a last breakpoint (or file section start) Tmin byte afterwards, calculation window Rabin fingerprint value does not avoid producing the too small logical blocks of data of length.

If do not find new breakpoint in a last breakpoint (or file section start) Tmax byte afterwards, we will select last the backup breakpoint in this segment limit for use.The method of determining the backup breakpoint is the low n-1 position of getting moving window Rabin fingerprint value, compares (this specified value is not equal to the value of judgment data logical block breakpoint) with another specified value, if equate, thinks that then this window can be used as a backup breakpoint.Under the situation that does not have breakpoint, last backup breakpoint will become the breakpoint of logical blocks of data; If neither there is breakpoint, there is not the backup breakpoint yet, then need by force this segment limit to be divided into a data logical block, avoid producing the excessive logical blocks of data of length.

In step 103, file is carried out logical blocks of data divide.The all breakpoints of finding out according to step 102 of file, content between per two adjacent breakpoints constitutes a data logical block, and wherein the content of the content of first logical blocks of data and file section start and end-of-file (EOF) place and penult breakpoint equally also constitutes a data logical block respectively.

In step 104, the length of selected data storage block (chunk) expectation, and the length range of qualification DSB data store block.DSB data store block is relatively large data.In file system, DSB data store block is the used minimum memory of application layer unit, and the DSB data store block that content is identical is only stored portion.The length of DSB data store block expectation can be 1M, 2M or 4M, also can be worth for other.

The length of DSB data store block expectation is represented with Ec, the length of logical blocks of data expectation is represented with Eb, the DSB data store block length range is restricted to [Ec-m*Eb, Ec+k*Eb] (last DSB data store block of each file does not limit minimum length), and m and k are given as required numerical value.

In step 105, search breakpoint with the specified data storage block, the present invention searches and specified data storage block breakpoint still adopts mode according to file content (content based), the benefit that adopts this mode to bring is: insert and delete content hereof, only can influence the vicissitudinous DSB data store block of content, and can not influence other DSB data store block.Concrete steps are to calculate the total length of a plurality of continuous data logical blocks from the file section start.In case this total length is near the value of our desired data storage block length, and the n+1～n+x position of last logical blocks of data breakpoint Rabin fingerprint value equals another specified value (this specified value is not equal to the value of judgment data logical block breakpoint), and last logical blocks of data breakpoint is exactly the breakpoint of DSB data store block.If the breakpoint of last logical blocks of data does not satisfy condition, and total length can not surpass the restriction of DSB data store block length range after adding next logical blocks of data, after then judging the next logical blocks of data of adding, whether the breakpoint of last logical blocks of data satisfies condition.Up to finding out the breakpoint that satisfies condition, perhaps till the upper limit of total length near the DSB data store block length range.The breakpoint that satisfies condition is the breakpoint of logical blocks of data, also is the breakpoint of DSB data store block simultaneously.The breakpoint that is to say DSB data store block is equal to the breakpoint of last logical blocks of data in a plurality of continuous data logical blocks of composition data storage block.Begin from a last data storage block breakpoint then, using the same method finds out the breakpoint of next DSB data store block.Travel through all logical blocks of data breakpoints,, find out the breakpoint of all DSB data store block until end-of-file (EOF).The end-of-file (EOF) place is inevitable also to be the breakpoint of DSB data store block.

Top x is relevant with the DSB data store block length range.We represent with Ec desired data storage block length, desired data logical block length is represented with Eb, the DSB data store block length range is restricted to [Ec-m*Eb, Ec+k*Eb] (last DSB data store block of each file does not limit minimum length), may there be m+k data logical block breakpoint so in the DSB data store block length range, the value scope of the n+1 of breakpoint Rabin fingerprint value～n+x position is necessary for [0, m+k-1], and m+k=2 promptly satisfies condition ^x, the probability that has and only have a data storage block breakpoint like this in the DSB data store block length range just can be maximum.For example, if the length of expected data logical block is 4K, the length of expected data storage block is 4M, the DSB data store block length range is [4M-32*4K, 4M+32*4K], the logical blocks of data breakpoint that may become the DSB data store block breakpoint so will have 32+32=64, and the x that mentions above just should equal 6 (2 ⁶=64), promptly whether 13～18 of judgment data logical block breakpoint Rabin fingerprint value (situation that n equals 12) equals specified value.The same with dividing data logical block breakpoint, this specified value is as long as determine that what being on earth, it doesn't matter.

When a plurality of continuous data logical block total lengths surpassed Ec-m*Eb, the n+1～n+x the position whether breakpoint that then needs to pay close attention to last logical blocks of data satisfies Rabin's fingerprint value equaled the condition of set-point.Satisfying condition then becomes the DSB data store block breakpoint, otherwise checks the breakpoint of next logical blocks of data.In case a plurality of continuous data logical block total lengths surpass Ec+k*Eb, just last breakpoint is set to the breakpoint of this DSB data store block in the maximum length scope of DSB data store block, can guarantee that like this length of DSB data store block is within limited field.

In step 106, file is carried out DSB data store block to be divided, according to all DSB data store block breakpoints of file that step 105 is found out, the content between per two adjacent DSB data store block breakpoints promptly constitutes a data storage block, and record data logical block and data storage block message.DSB data store block information comprises: the length of DSB data store block, side-play amount and MD5 value or SHA-1 value.

Fig. 2 is that as shown in Figure 2, whole file is divided into several DSB data store block according to the file division synoptic diagram of content-based file splitting method of the present invention, and each DSB data store block comprises several logical blocks of data.

Fig. 3 divides synoptic diagram according to logical blocks of data in the content-based file splitting method of the present invention, and as shown in Figure 3, the zigzag lines are represented breakpoint.

A represents first file, or the prototype version of file.We are divided into a lot of logical blocks of data according to the breakpoint of content search logical blocks of data with a file, only show 7 the data logical blocks in front among the figure.

B with a has relatively made a little modifications at B2, but the content of revising does not cause producing new breakpoint, and breakpoint location also remains unchanged.Because variation has taken place in this logical blocks of data content, so generate new logical blocks of data B8.

C with b has relatively increased some contents at B3, but the content that increases does not cause producing new breakpoint, so breakpoint location also remains unchanged.Because this logical blocks of data has increased fresh content, so generate new logical blocks of data B9.

D with c has relatively deleted some contents among the B5, but this deletion does not cause producing new breakpoint, does not also cause former breakpoint to lose efficacy, so breakpoint location also remains unchanged.Because this logical blocks of data has been deleted partial content, so generate new logical blocks of data B10.

E compares with d, and revise at the place at the B6 breakpoint, causes breakpoint place content to change, and will no longer become breakpoint, so merge B6 and B7, generates new logical blocks of data B11.

F with e has relatively increased fresh content at B4, and the fresh content of increase causes producing new breakpoint, so B4 will be decomposed into B12 and B13.

A among Fig. 3, b, c, d, e, f file both may be the different editions of identical file name, also may be the similar different files of content.Each file compares with previous file, and content all changes, but most logical blocks of data can be reused.Whether identical algorithm can be the MD5 value of comparing data logical block to judgment data logical block content, it also can be the SHA-1 value of comparing data logical block, or other have highly discrete, the value that can represent the algorithm computation of prime information feature to come out uniquely.Like this, we just can reuse a logical blocks of data in the file of synchronous mistake when synchronous documents, reduce transmission volume.

Fig. 4 divides synoptic diagram according to DSB data store block in the content-based file splitting method of the present invention, as shown in Figure 4, zigzag short-term bar is represented the logical blocks of data breakpoint, the long lines of zigzag are represented the DSB data store block breakpoint, also are the breakpoints of this last logical blocks of data of DSB data store block simultaneously.Dash area is represented the place at previous file (or previous release of same file name) modification, B representative data logical block.

A represents first file, or the prototype version of file.We are divided into a lot of logical blocks of data and DSB data store block according to the breakpoint of content search logical blocks of data and DSB data store block with a file.

B with a has relatively made a little modifications in chunk1, but this modification does not cause the variation of DSB data store block breakpoint, has renewal in the chunk1, becomes chunk1 '.And chunk2, the chunk of chunk3 and back does not have to change.

C compares with b, and a little modifications have been made at the place at chunk1 ' breakpoint, causes this breakpoint to lose efficacy, and will search new breakpoint again, generates chunk1 " and chunk2 '.And the chunk of chunk3 and back does not have to change.

D with c relatively, at chunk1 " in made a little modifications, this modification causes producing new DSB data store block breakpoint, so generate chunk1 " ' and chunk2 ".And the chunk of chunk3 and back does not have to change.

A among Fig. 4, b, c, d file both may be the different editions of identical file name, also may be the similar different files of content.Each file with previous file relatively, content all changes, but the content of most of DSB data store block is identical, whether identical algorithm can be the MD5 value of comparing data logical block to judgment data logical block content, it also can be the SHA-1 value of comparing data logical block, or other have highly discrete, the value that can represent the algorithm computation of prime information feature to come out uniquely.The DSB data store block that content is identical can be reused, and during storage file, does not need to store existing DSB data store block, so just can avoid the repeated storage of data block.

In storage system, adopt above-mentioned content-based file splitting method, file data is divided into after logical blocks of data and the DSB data store block, in storage file, be not storage file itself, but each DSB data store block of storage file, and the data storage block message that log file comprised in metadata, as the DSB data store block tabulation that file comprised, the length of each DSB data store block and MD5 value etc.Because the identical DSB data store block of content only stores portion, when running into the similar or identical file of content and need store, just can save a large amount of storage spaces.

The uploading, back up and file of file in file system, synchronous source end is divided into DSB data store block and logical blocks of data with new file, and these information are sent to destination.Destination can be searched local non-existent DSB data store block and logical blocks of data by the whole bag of tricks, and which logical blocks of data makes up DSB data store block and notification source end needs.The logical blocks of data that the source end needs destination again sends in the past.The new file here both may be at the file behind the last version modify, also may be the file that increases newly.Destination searches that local not have the method for DSB data store block and logical blocks of data both can be to calculate local file DSB data store block and logical blocks of data information in real time, also can be these information to be kept to supply inquiry in the metadata in advance, we recommend the latter here.Logical blocks of data information comprises the position of this logical blocks of data in DSB data store block, logical blocks of data length, logical blocks of data MD5 value or SHA-1 value etc.If preserved existing DSB data store block of destination and logical blocks of data information at the source end, the source end just only needs non-existent DSB data store block of transmission destination end and logical blocks of data information, and not needing to transmit the DSB data store block and the logical blocks of data information of whole new file, the content of transmission is just still less.Compare the whole part file content of transmission, synchronous documents in this way, the content of transmission becomes seldom, the increment synchronous applications of Here it is indication of the present invention.

In network file system(NFS), the efficient operation of network file system(NFS) needs the demand of good network switching performance such as bandwidth.In a class low-bandwidth network connects, in the wired or wireless network connection as low bandwidth, can utilize the data of algorithm of the present invention to go repetition and increment synchronizing characteristics.In the realization of network file system(NFS), mapping from file to the file physical storage block in the file system metadata, can change into by file in this algorithm logical blocks of data or the mapping of DSB data store block, reach and reduce the dependence of network file system performance bandwidth.

One of ordinary skill in the art will appreciate that: the above only is the preferred embodiments of the present invention, be not limited to the present invention, although the present invention is had been described in detail with reference to previous embodiment, for a person skilled in the art, it still can be made amendment to the technical scheme of aforementioned each embodiment record, perhaps part technical characterictic wherein is equal to replacement.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. content-based file splitting method, this method may further comprise the steps:

3) file is carried out the division of logical blocks of data;

5) search and confirm the breakpoint of DSB data store block;

6) file is carried out the division of DSB data store block.

2. content-based file splitting method according to claim 1 is characterized in that, described step 2 further may further comprise the steps:

1) from the file section start, calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm, according to Rabin's fingerprint value of described moving window, determine the breakpoint of first logical blocks of data, and with the breakpoint of this logical blocks of data breakpoint as a last data logical block;

2), calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm, and, determine the breakpoint of next logical blocks of data according to Rabin's fingerprint value of described moving window from the breakpoint of a described last data logical block;

3) repeat above-mentioned steps 2, find out the breakpoint of all logical blocks of data in the file.

3. content-based file splitting method according to claim 1 is characterized in that, described step 2 further may further comprise the steps:

1) the logical blocks of data minimum length after the file section start, calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm, Rabin's fingerprint value according to described moving window, determine the breakpoint of first logical blocks of data, and with the breakpoint of this logical blocks of data breakpoint as a last data logical block;

2) the logical blocks of data minimum length after the breakpoint of a described last data logical block, calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm, and, determine the breakpoint of next logical blocks of data according to Rabin's fingerprint value of described moving window;

4. content-based file splitting method according to claim 1 is characterized in that, the division that described step 3 pair file carries out logical blocks of data is as a data logical block with the content between the adjacent two data logical block breakpoint.

5. content-based file splitting method according to claim 1, it is characterized in that, the method of described Rabin's fingerprint value specified data logical block breakpoint according to moving window is that this moving window just constitutes a breakpoint when the low n position of described moving window Rabin fingerprint value equals a specified value.

6. content-based file splitting method according to claim 5 is characterized in that, the n value of the low n position of described moving window Rabin fingerprint value is by 2 ⁿThe length computation of=logical blocks of data expectation draws.

7. according to claim 2 or 3 described content-based file splitting methods, it is characterized in that, in described logical blocks of data maximum length, do not find new breakpoint, then find out the backup breakpoint, and last is backed up breakpoint as the logical blocks of data breakpoint according to window Rabin fingerprint value.

8. content-based file splitting method according to claim 7, it is characterized in that, the described method of finding out the backup breakpoint according to window Rabin fingerprint value is that the low n-1 position of described moving window Rabin fingerprint value equals a specified value, and the window of described moving window correspondence just constitutes a backup breakpoint.

9. content-based file splitting method according to claim 8 is characterized in that, has not both had breakpoint in the logical blocks of data length range, does not also back up breakpoint, then with the window of this logical blocks of data maximum length position as breakpoint.

10. content-based file splitting method according to claim 1 is characterized in that, described step 5 further may further comprise the steps:

1) from the file section start, calculate the length of a plurality of continuous data logical blocks, and Rabin's fingerprint value of each the logical blocks of data breakpoint of calculating in the length range of DSB data store block, according to Rabin's fingerprint value of described logical blocks of data breakpoint first breakpoint of DSB data store block is set, first breakpoint of this DSB data store block is as the breakpoint of a last data storage block;

2) from the breakpoint of a described last data storage block, calculate the length of a plurality of continuous data logical blocks, and calculate Rabin's fingerprint value of each the logical blocks of data breakpoint in the length range of DSB data store block, the next breakpoint of DSB data store block is set according to Rabin's fingerprint value of described logical blocks of data breakpoint;

3) repeat above-mentioned steps 2,, find out the breakpoint of all DSB data store block in the file until end-of-file (EOF).

11. content-based file splitting method according to claim 1, it is characterized in that, in the described step 5, if in the length range of DSB data store block, can't find out the breakpoint of DSB data store block according to Rabin's fingerprint value of described logical blocks of data breakpoint, just last breakpoint is set to the breakpoint of this DSB data store block in the maximum length scope of DSB data store block.

12. content-based file splitting method according to claim 1 is characterized in that, described step 6 is as a data storage block with the content between two adjacent data storage block breakpoints.

13. the method for a file storage is characterized in that, at first, file data is divided into logical blocks of data and DSB data store block, then, and each DSB data store block of storage file, and the data storage block message that log file comprised in metadata.

14. the method for file storage according to claim 13 is characterized in that, described DSB data store block information comprises: the length of DSB data store block, side-play amount and MD5 value.

15. the method for a synchronous documents is characterized in that, may further comprise the steps:

16. the method for synchronous documents according to claim 15 is characterized in that, described logical blocks of data information comprises: the position of logical blocks of data in affiliated DSB data store block, logical blocks of data length, logical blocks of data MD5 value.