CN101788976A - File splitting method based on contents - Google Patents

File splitting method based on contents Download PDF

Info

Publication number
CN101788976A
CN101788976A CN201010110841A CN201010110841A CN101788976A CN 101788976 A CN101788976 A CN 101788976A CN 201010110841 A CN201010110841 A CN 201010110841A CN 201010110841 A CN201010110841 A CN 201010110841A CN 101788976 A CN101788976 A CN 101788976A
Authority
CN
China
Prior art keywords
data
breakpoint
logical blocks
file
rabin
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201010110841A
Other languages
Chinese (zh)
Other versions
CN101788976B (en
Inventor
张卫平
刘为怀
杨立辉
张元丰
李骞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Borqs Beijing Ltd.
Wuhan Borqs Technology Co., Ltd.
Beijing Borqs Software Technology Co Ltd
Original Assignee
Beijing Borqs Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Borqs Software Technology Co Ltd filed Critical Beijing Borqs Software Technology Co Ltd
Priority to CN201010110841XA priority Critical patent/CN101788976B/en
Publication of CN101788976A publication Critical patent/CN101788976A/en
Priority to PCT/CN2010/077556 priority patent/WO2011097887A1/en
Application granted granted Critical
Publication of CN101788976B publication Critical patent/CN101788976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a file splitting method based on contents, comprising the following steps: selecting the length of windows, the expected length of data logic blocks and the length range of the data logic blocks; computing the Rabin fingerprint value of each sliding window by adopting Rabin fingerprint algorithm, and determining the breaking point of each data logic block according to the Rabin fingerprint value of each sliding window; splitting files according to the data logic blocks; selecting the expected length of data storage blocks, and limiting the length range of the data storage blocks; searching and confirming the breaking points of the data storage blocks; and splitting the files according to the data storage blocks. In the invention, the file splitting method based on the contents can accurately and efficiently search different contents among different files or different versions of the same file, thereby saving a large amount of storage space in storage and file systems, reducing the transmission contents of file information, and decreasing the dependence of the performance of a network file system on bandwidth.

Description

A kind of content-based file splitting method
Technical field
The present invention relates to a kind of dividing method of file, relate in particular to a kind of content-based file splitting method.
Background technology
In existing Computer Storage and the file system, solve the identical two or more similar file of most contents, need each file of storage separately, the common method that causes taking the bigger problem of storage space is to adopt data to go the repetition technology.
It is the data block that becomes length to equate substantially file division that data are gone the implementation method of repetition technology, and in file system, the data block that content is identical is only stored portion.The standard that judgment data piece content is identical can be the MD5 value of comparing data piece, also can be the SHA-1 value of comparing data piece.The value that goes out with MD5 or SHA-1 algorithm computation all has the discreteness of height.The hashed value length that the MD5 algorithm computation goes out is 128bit, the data block of different content obtains the probability of same Hash value at 1/ (2 (exp (B/2)) (B is the bit figure place of hashed value length in the hashing algorithm) here through the MD5 hash, with the 128bitMD5 algorithm is example, and the identical probability of the MD5 hashed value of different content data block is 1/2 64(approximate 5.5 * 10 -20) the order of magnitude, it is impossible that so little probability is considered to usually.The SHA-1 algorithm is based on MD5's, and the hashed value that calculates reaches 160bit especially.It is generally acknowledged that MD5 value or SHA-1 value can represent the feature of prime information uniquely, be generally used for the encryption storage of password, digital signature, file integrality checking, authentication etc.On counting yield, MD5 is better than SHA-1.
Need to carry out synchronously after solving file modification, often the content of Xiu Gaiing is considerably less, but needs synchronous whole part of file content, causes the problem of a large amount of Network Transmission, and what adopt at present is the increment simultaneous techniques.
The increment simultaneous techniques refers to file by Network Synchronization, does not need to transmit the content of whole part of file, but only in the storage of transmission destination end and the file system non-existent content get final product.If between the different editions of identical file synchronously, can be understood as the changed information of transfer files.Implementation method is that file division is become logical blocks of data, by the content of comparing data logical block, finds out the identical and difference between destination and the source end file.Identical part does not need can obtain at destination by Network Transmission; Different parts just needs so just can reduce transmission volume by Network Transmission.Whether identical standard can be passed through relatively MD5 value or SHA-1 value to the judgment data logical block equally.
Teledata synchronization means rsync commonly used also is a kind of increment simultaneous techniques, uses so-called " rsync algorithm " to make the file between local and remote two computing machines reach synchronous.Suppose need be between two computing machines synchronous documents A ', and there has been the previous release A of this document in destination, the rsync algorithm will be finished by following step so:
1. destination is divided into the logical blocks of data (last piece may be littler than S) that one group of nonoverlapping length is fixed as the S byte with file A, to each cut apart good logical blocks of data calculate 32 verification and and 128 MD4 value, and with the verification of these pieces with reach the MD4 value and issue the source end.The MD4 algorithm is the previous release of MD5 algorithm, relative MD5 algorithm, and the security aspect is weaker a little.
2. the source end is the logical blocks of data of S (side-play amount can be chosen wantonly, not necessarily the multiple of S) by all sizes of search file A ', seeks a certain logical blocks of data that identical verification is arranged and reach the MD4 value with file A.
3. the source end is issued a string instruction of destination and is generated the backup of file A ' on destination, the instruction here or be that file A has a certain logical blocks of data and the explanation that must not retransmit, otherwise be one not with any coupling of file A on logical blocks of data.
The rsync algorithm only transmits the different piece of two files, rather than all whole at every turn part of file transfer, so speed is quite fast.But rsync can only be used between the different editions of same file name synchronously.If above example in file A ' with file A content is similar but filename is different, rsync will still can transmit whole part content of A '.
The above-mentioned repetition technology of going makes and goes the efficient of repetition lower owing to being the data block that becomes length to equate substantially file division, can not reduce transmission volume effectively.
Summary of the invention
In order to solve the deficiency that prior art exists, the object of the present invention is to provide a kind of method of content-based file splitting method, a kind of file memory method and a kind of synchronous documents.
In order to finish above-mentioned purpose, a kind of content-based file splitting method of the present invention, this method may further comprise the steps:
1) selectes the length of window and the length of logical blocks of data expectation, and the length range of logical blocks of data is set according to the length of described logical blocks of data expectation;
2) adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window;
3) file is carried out the division of logical blocks of data;
The length range of the 4) length of selected data storage block expectation, and qualification DSB data store block;
5) search and confirm the breakpoint of DSB data store block;
6) file is carried out the division of DSB data store block.
For finishing the foregoing invention purpose, the present invention also provides a kind of file memory method, and this method may further comprise the steps:
File data is divided into logical blocks of data and DSB data store block;
Each DSB data store block of storage file, and the data storage block message that log file comprised in metadata.
For finishing the foregoing invention purpose, the present invention also provides a kind of synchronous documents storage means, and this method may further comprise the steps:
1) synchronous source end adopts the described file splitting method of claim to be divided into DSB data store block and logical blocks of data new file, and data storage block message and logical blocks of data information are sent to destination;
2) synchronous destination is searched local non-existent DSB data store block and logical blocks of data, and which logical blocks of data makes up DSB data store block and notification source end needs;
3) synchronous source end sends the logical blocks of data that destination needs.
The present invention has tangible advantage and good effect, adopts content-based file splitting method of the present invention, can find out content inequality between different files or identical file different editions accurately and efficiently.In storage system,, when running into the similar or identical file of content and need store, just can save a large amount of storage spaces because the identical DSB data store block of content only stores portion; In file loading, backup and the filing of file system, the source end just only needs non-existent DSB data store block of transmission destination end and logical blocks of data information, and does not need to transmit the DSB data store block and the logical blocks of data information of whole new file, and the content of transmission still less; In network file system(NFS), the mapping from file to the file physical storage block in the file system metadata, can change into by file in this method logical blocks of data or the mapping of DSB data store block, reach and reduce the dependence of network file system performance bandwidth.
Description of drawings
Accompanying drawing is used to provide further understanding of the present invention, and constitutes the part of instructions, with embodiments of the invention, is used to explain the present invention, is not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the process flow diagram according to content-based file splitting method of the present invention;
Fig. 2 is the file division synoptic diagram according to content-based file splitting method of the present invention;
Fig. 3 divides synoptic diagram according to logical blocks of data in the content-based file splitting method of the present invention;
Fig. 4 divides synoptic diagram according to DSB data store block in the content-based file splitting method of the present invention.
Embodiment
Below in conjunction with accompanying drawing the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein only is used for description and interpretation the present invention, and be not used in qualification the present invention.
Fig. 1 is that with reference to figure 1, specific implementation process according to the present invention is described in detail as follows according to the process flow diagram of content-based file splitting method of the present invention:
At first, in step 101, the length of the length of selected window (windows) and logical blocks of data (block) expectation, and the length range of qualification logical blocks of data.
Window is the continuous zone of a slice in the file, and the length of suggestion is 48 bytes.Moving window (slidingwindow) is based on the last window in the file, the byte of backward sliding, and the length of window after the slip is constant.
Logical blocks of data is the data of comparison fritter, and when realizing that increment is synchronous, logical blocks of data is minimum Synchronism Unit.The unit storage in storage system with the logical blocks of data.The length of logical blocks of data expectation can be 2K, 4K or 8K, also can be worth for other.
For fear of in the process of searching the logical blocks of data breakpoint, there are very many breakpoints in the file, file is divided into a lot of data logical blocks, logical blocks of data length is all very short, cause storing and to transmit the quantity of information of logical blocks of data very big, cause memory space and the transmission quantity bigger than file content, or logical blocks of data length is very big in the file, the probability that causes logical blocks of data to be reused becomes very little, and the change in this logical blocks of data also can cause the problem of great amount of data transmission, in this step, length range to the data logical block limits, minimum length is Tmin, is traditionally arranged to be half of expected data logical block length or according to the actual conditions setting, also can be set to other values, maximum length is Tmax, can select 16K, 32K or 64K byte etc. according to actual conditions.
In step 102, adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window.
In this step, employing is searched the also breakpoint of specified data logical block according to the mode of file content (content based), the benefit that adopts this mode to bring is: insert and delete content hereof, only can influence the vicissitudinous logical blocks of data of content, and can not influence other logical blocks of data.Concrete steps are, from the file section start, calculate Rabin's fingerprint value of each moving window, when the low n position of moving window Rabin fingerprint value equals certain specified value, this moving window will constitute the breakpoint of first logical blocks of data, begin to calculate Rabin's fingerprint value of each moving window then from first breakpoint, when the low n position of moving window Rabin fingerprint value equals certain specified value, this moving window just constitutes the breakpoint of second data logical block, according to aforementioned algorithm, calculate Rabin's fingerprint value of all moving windows, find out the breakpoint of all logical blocks of data in the file, until end-of-file (EOF).The end-of-file (EOF) place also must be the breakpoint of a data logical block.
Rabin's fingerprint (rabin fingerprinting) algorithm is a kind of fingerprint algorithm that the Rabin of Harvard University (rabin) proposes, it is a kind of algorithm of high efficiency calculating moving window hashed value, and has the height discreteness according to the value that Rabin's fingerprint algorithm calculates.
Getting the low n position of moving window Rabin fingerprint value, is to use Rabin's fingerprint value of moving window divided by 2 nThe remainder of gained.The value of n is relevant with the length of logical blocks of data expectation.Because the value that calculates according to Rabin's fingerprint algorithm is very uniform, and if file content also is very at random, the logical blocks of data length that splits so will be 2 nAbout byte, just 2 nThe length of=logical blocks of data expectation.Certainly, need the content that adds that breakpoint window comprises in the logical blocks of data.So if the length of our expected data logical block is the 4K byte, the value of n just should be 12 (2 12=4096=4K).
The low n position of Rabin's fingerprint value of moving window equals certain specified value, and this specified value is as long as determine that what being on earth, it doesn't matter.We did such test: to different length, dissimilar file, search breakpoint with different values respectively, the result is what value that don't work, and the logical blocks of data quantity of Hua Fening is more or less the same at last, and the length difference of each logical blocks of data is also very little.The randomness of this set-point has more been confirmed in this test.
In this step, also can be in a last breakpoint (or file section start) Tmin byte afterwards, calculation window Rabin fingerprint value does not avoid producing the too small logical blocks of data of length.
If do not find new breakpoint in a last breakpoint (or file section start) Tmax byte afterwards, we will select last the backup breakpoint in this segment limit for use.The method of determining the backup breakpoint is the low n-1 position of getting moving window Rabin fingerprint value, compares (this specified value is not equal to the value of judgment data logical block breakpoint) with another specified value, if equate, thinks that then this window can be used as a backup breakpoint.Under the situation that does not have breakpoint, last backup breakpoint will become the breakpoint of logical blocks of data; If neither there is breakpoint, there is not the backup breakpoint yet, then need by force this segment limit to be divided into a data logical block, avoid producing the excessive logical blocks of data of length.
In step 103, file is carried out logical blocks of data divide.The all breakpoints of finding out according to step 102 of file, content between per two adjacent breakpoints constitutes a data logical block, and wherein the content of the content of first logical blocks of data and file section start and end-of-file (EOF) place and penult breakpoint equally also constitutes a data logical block respectively.
In step 104, the length of selected data storage block (chunk) expectation, and the length range of qualification DSB data store block.DSB data store block is relatively large data.In file system, DSB data store block is the used minimum memory of application layer unit, and the DSB data store block that content is identical is only stored portion.The length of DSB data store block expectation can be 1M, 2M or 4M, also can be worth for other.
The length of DSB data store block expectation is represented with Ec, the length of logical blocks of data expectation is represented with Eb, the DSB data store block length range is restricted to [Ec-m*Eb, Ec+k*Eb] (last DSB data store block of each file does not limit minimum length), and m and k are given as required numerical value.
In step 105, search breakpoint with the specified data storage block, the present invention searches and specified data storage block breakpoint still adopts mode according to file content (content based), the benefit that adopts this mode to bring is: insert and delete content hereof, only can influence the vicissitudinous DSB data store block of content, and can not influence other DSB data store block.Concrete steps are to calculate the total length of a plurality of continuous data logical blocks from the file section start.In case this total length is near the value of our desired data storage block length, and the n+1~n+x position of last logical blocks of data breakpoint Rabin fingerprint value equals another specified value (this specified value is not equal to the value of judgment data logical block breakpoint), and last logical blocks of data breakpoint is exactly the breakpoint of DSB data store block.If the breakpoint of last logical blocks of data does not satisfy condition, and total length can not surpass the restriction of DSB data store block length range after adding next logical blocks of data, after then judging the next logical blocks of data of adding, whether the breakpoint of last logical blocks of data satisfies condition.Up to finding out the breakpoint that satisfies condition, perhaps till the upper limit of total length near the DSB data store block length range.The breakpoint that satisfies condition is the breakpoint of logical blocks of data, also is the breakpoint of DSB data store block simultaneously.The breakpoint that is to say DSB data store block is equal to the breakpoint of last logical blocks of data in a plurality of continuous data logical blocks of composition data storage block.Begin from a last data storage block breakpoint then, using the same method finds out the breakpoint of next DSB data store block.Travel through all logical blocks of data breakpoints,, find out the breakpoint of all DSB data store block until end-of-file (EOF).The end-of-file (EOF) place is inevitable also to be the breakpoint of DSB data store block.
Top x is relevant with the DSB data store block length range.We represent with Ec desired data storage block length, desired data logical block length is represented with Eb, the DSB data store block length range is restricted to [Ec-m*Eb, Ec+k*Eb] (last DSB data store block of each file does not limit minimum length), may there be m+k data logical block breakpoint so in the DSB data store block length range, the value scope of the n+1 of breakpoint Rabin fingerprint value~n+x position is necessary for [0, m+k-1], and m+k=2 promptly satisfies condition x, the probability that has and only have a data storage block breakpoint like this in the DSB data store block length range just can be maximum.For example, if the length of expected data logical block is 4K, the length of expected data storage block is 4M, the DSB data store block length range is [4M-32*4K, 4M+32*4K], the logical blocks of data breakpoint that may become the DSB data store block breakpoint so will have 32+32=64, and the x that mentions above just should equal 6 (2 6=64), promptly whether 13~18 of judgment data logical block breakpoint Rabin fingerprint value (situation that n equals 12) equals specified value.The same with dividing data logical block breakpoint, this specified value is as long as determine that what being on earth, it doesn't matter.
When a plurality of continuous data logical block total lengths surpassed Ec-m*Eb, the n+1~n+x the position whether breakpoint that then needs to pay close attention to last logical blocks of data satisfies Rabin's fingerprint value equaled the condition of set-point.Satisfying condition then becomes the DSB data store block breakpoint, otherwise checks the breakpoint of next logical blocks of data.In case a plurality of continuous data logical block total lengths surpass Ec+k*Eb, just last breakpoint is set to the breakpoint of this DSB data store block in the maximum length scope of DSB data store block, can guarantee that like this length of DSB data store block is within limited field.
In step 106, file is carried out DSB data store block to be divided, according to all DSB data store block breakpoints of file that step 105 is found out, the content between per two adjacent DSB data store block breakpoints promptly constitutes a data storage block, and record data logical block and data storage block message.DSB data store block information comprises: the length of DSB data store block, side-play amount and MD5 value or SHA-1 value.
Fig. 2 is that as shown in Figure 2, whole file is divided into several DSB data store block according to the file division synoptic diagram of content-based file splitting method of the present invention, and each DSB data store block comprises several logical blocks of data.
Fig. 3 divides synoptic diagram according to logical blocks of data in the content-based file splitting method of the present invention, and as shown in Figure 3, the zigzag lines are represented breakpoint.
A represents first file, or the prototype version of file.We are divided into a lot of logical blocks of data according to the breakpoint of content search logical blocks of data with a file, only show 7 the data logical blocks in front among the figure.
B with a has relatively made a little modifications at B2, but the content of revising does not cause producing new breakpoint, and breakpoint location also remains unchanged.Because variation has taken place in this logical blocks of data content, so generate new logical blocks of data B8.
C with b has relatively increased some contents at B3, but the content that increases does not cause producing new breakpoint, so breakpoint location also remains unchanged.Because this logical blocks of data has increased fresh content, so generate new logical blocks of data B9.
D with c has relatively deleted some contents among the B5, but this deletion does not cause producing new breakpoint, does not also cause former breakpoint to lose efficacy, so breakpoint location also remains unchanged.Because this logical blocks of data has been deleted partial content, so generate new logical blocks of data B10.
E compares with d, and revise at the place at the B6 breakpoint, causes breakpoint place content to change, and will no longer become breakpoint, so merge B6 and B7, generates new logical blocks of data B11.
F with e has relatively increased fresh content at B4, and the fresh content of increase causes producing new breakpoint, so B4 will be decomposed into B12 and B13.
A among Fig. 3, b, c, d, e, f file both may be the different editions of identical file name, also may be the similar different files of content.Each file compares with previous file, and content all changes, but most logical blocks of data can be reused.Whether identical algorithm can be the MD5 value of comparing data logical block to judgment data logical block content, it also can be the SHA-1 value of comparing data logical block, or other have highly discrete, the value that can represent the algorithm computation of prime information feature to come out uniquely.Like this, we just can reuse a logical blocks of data in the file of synchronous mistake when synchronous documents, reduce transmission volume.
Fig. 4 divides synoptic diagram according to DSB data store block in the content-based file splitting method of the present invention, as shown in Figure 4, zigzag short-term bar is represented the logical blocks of data breakpoint, the long lines of zigzag are represented the DSB data store block breakpoint, also are the breakpoints of this last logical blocks of data of DSB data store block simultaneously.Dash area is represented the place at previous file (or previous release of same file name) modification, B representative data logical block.
A represents first file, or the prototype version of file.We are divided into a lot of logical blocks of data and DSB data store block according to the breakpoint of content search logical blocks of data and DSB data store block with a file.
B with a has relatively made a little modifications in chunk1, but this modification does not cause the variation of DSB data store block breakpoint, has renewal in the chunk1, becomes chunk1 '.And chunk2, the chunk of chunk3 and back does not have to change.
C compares with b, and a little modifications have been made at the place at chunk1 ' breakpoint, causes this breakpoint to lose efficacy, and will search new breakpoint again, generates chunk1 " and chunk2 '.And the chunk of chunk3 and back does not have to change.
D with c relatively, at chunk1 " in made a little modifications, this modification causes producing new DSB data store block breakpoint, so generate chunk1 " ' and chunk2 ".And the chunk of chunk3 and back does not have to change.
A among Fig. 4, b, c, d file both may be the different editions of identical file name, also may be the similar different files of content.Each file with previous file relatively, content all changes, but the content of most of DSB data store block is identical, whether identical algorithm can be the MD5 value of comparing data logical block to judgment data logical block content, it also can be the SHA-1 value of comparing data logical block, or other have highly discrete, the value that can represent the algorithm computation of prime information feature to come out uniquely.The DSB data store block that content is identical can be reused, and during storage file, does not need to store existing DSB data store block, so just can avoid the repeated storage of data block.
In storage system, adopt above-mentioned content-based file splitting method, file data is divided into after logical blocks of data and the DSB data store block, in storage file, be not storage file itself, but each DSB data store block of storage file, and the data storage block message that log file comprised in metadata, as the DSB data store block tabulation that file comprised, the length of each DSB data store block and MD5 value etc.Because the identical DSB data store block of content only stores portion, when running into the similar or identical file of content and need store, just can save a large amount of storage spaces.
The uploading, back up and file of file in file system, synchronous source end is divided into DSB data store block and logical blocks of data with new file, and these information are sent to destination.Destination can be searched local non-existent DSB data store block and logical blocks of data by the whole bag of tricks, and which logical blocks of data makes up DSB data store block and notification source end needs.The logical blocks of data that the source end needs destination again sends in the past.The new file here both may be at the file behind the last version modify, also may be the file that increases newly.Destination searches that local not have the method for DSB data store block and logical blocks of data both can be to calculate local file DSB data store block and logical blocks of data information in real time, also can be these information to be kept to supply inquiry in the metadata in advance, we recommend the latter here.Logical blocks of data information comprises the position of this logical blocks of data in DSB data store block, logical blocks of data length, logical blocks of data MD5 value or SHA-1 value etc.If preserved existing DSB data store block of destination and logical blocks of data information at the source end, the source end just only needs non-existent DSB data store block of transmission destination end and logical blocks of data information, and not needing to transmit the DSB data store block and the logical blocks of data information of whole new file, the content of transmission is just still less.Compare the whole part file content of transmission, synchronous documents in this way, the content of transmission becomes seldom, the increment synchronous applications of Here it is indication of the present invention.
In network file system(NFS), the efficient operation of network file system(NFS) needs the demand of good network switching performance such as bandwidth.In a class low-bandwidth network connects, in the wired or wireless network connection as low bandwidth, can utilize the data of algorithm of the present invention to go repetition and increment synchronizing characteristics.In the realization of network file system(NFS), mapping from file to the file physical storage block in the file system metadata, can change into by file in this algorithm logical blocks of data or the mapping of DSB data store block, reach and reduce the dependence of network file system performance bandwidth.
One of ordinary skill in the art will appreciate that: the above only is the preferred embodiments of the present invention, be not limited to the present invention, although the present invention is had been described in detail with reference to previous embodiment, for a person skilled in the art, it still can be made amendment to the technical scheme of aforementioned each embodiment record, perhaps part technical characterictic wherein is equal to replacement.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (16)

1. content-based file splitting method, this method may further comprise the steps:
1) selectes the length of window and the length of logical blocks of data expectation, and the length range of logical blocks of data is set according to the length of described logical blocks of data expectation;
2) adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window;
3) file is carried out the division of logical blocks of data;
The length range of the 4) length of selected data storage block expectation, and qualification DSB data store block;
5) search and confirm the breakpoint of DSB data store block;
6) file is carried out the division of DSB data store block.
2. content-based file splitting method according to claim 1 is characterized in that, described step 2 further may further comprise the steps:
1) from the file section start, calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm, according to Rabin's fingerprint value of described moving window, determine the breakpoint of first logical blocks of data, and with the breakpoint of this logical blocks of data breakpoint as a last data logical block;
2), calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm, and, determine the breakpoint of next logical blocks of data according to Rabin's fingerprint value of described moving window from the breakpoint of a described last data logical block;
3) repeat above-mentioned steps 2, find out the breakpoint of all logical blocks of data in the file.
3. content-based file splitting method according to claim 1 is characterized in that, described step 2 further may further comprise the steps:
1) the logical blocks of data minimum length after the file section start, calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm, Rabin's fingerprint value according to described moving window, determine the breakpoint of first logical blocks of data, and with the breakpoint of this logical blocks of data breakpoint as a last data logical block;
2) the logical blocks of data minimum length after the breakpoint of a described last data logical block, calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm, and, determine the breakpoint of next logical blocks of data according to Rabin's fingerprint value of described moving window;
3) repeat above-mentioned steps 2, find out the breakpoint of all logical blocks of data in the file.
4. content-based file splitting method according to claim 1 is characterized in that, the division that described step 3 pair file carries out logical blocks of data is as a data logical block with the content between the adjacent two data logical block breakpoint.
5. content-based file splitting method according to claim 1, it is characterized in that, the method of described Rabin's fingerprint value specified data logical block breakpoint according to moving window is that this moving window just constitutes a breakpoint when the low n position of described moving window Rabin fingerprint value equals a specified value.
6. content-based file splitting method according to claim 5 is characterized in that, the n value of the low n position of described moving window Rabin fingerprint value is by 2 nThe length computation of=logical blocks of data expectation draws.
7. according to claim 2 or 3 described content-based file splitting methods, it is characterized in that, in described logical blocks of data maximum length, do not find new breakpoint, then find out the backup breakpoint, and last is backed up breakpoint as the logical blocks of data breakpoint according to window Rabin fingerprint value.
8. content-based file splitting method according to claim 7, it is characterized in that, the described method of finding out the backup breakpoint according to window Rabin fingerprint value is that the low n-1 position of described moving window Rabin fingerprint value equals a specified value, and the window of described moving window correspondence just constitutes a backup breakpoint.
9. content-based file splitting method according to claim 8 is characterized in that, has not both had breakpoint in the logical blocks of data length range, does not also back up breakpoint, then with the window of this logical blocks of data maximum length position as breakpoint.
10. content-based file splitting method according to claim 1 is characterized in that, described step 5 further may further comprise the steps:
1) from the file section start, calculate the length of a plurality of continuous data logical blocks, and Rabin's fingerprint value of each the logical blocks of data breakpoint of calculating in the length range of DSB data store block, according to Rabin's fingerprint value of described logical blocks of data breakpoint first breakpoint of DSB data store block is set, first breakpoint of this DSB data store block is as the breakpoint of a last data storage block;
2) from the breakpoint of a described last data storage block, calculate the length of a plurality of continuous data logical blocks, and calculate Rabin's fingerprint value of each the logical blocks of data breakpoint in the length range of DSB data store block, the next breakpoint of DSB data store block is set according to Rabin's fingerprint value of described logical blocks of data breakpoint;
3) repeat above-mentioned steps 2,, find out the breakpoint of all DSB data store block in the file until end-of-file (EOF).
11. content-based file splitting method according to claim 1, it is characterized in that, in the described step 5, if in the length range of DSB data store block, can't find out the breakpoint of DSB data store block according to Rabin's fingerprint value of described logical blocks of data breakpoint, just last breakpoint is set to the breakpoint of this DSB data store block in the maximum length scope of DSB data store block.
12. content-based file splitting method according to claim 1 is characterized in that, described step 6 is as a data storage block with the content between two adjacent data storage block breakpoints.
13. the method for a file storage is characterized in that, at first, file data is divided into logical blocks of data and DSB data store block, then, and each DSB data store block of storage file, and the data storage block message that log file comprised in metadata.
14. the method for file storage according to claim 13 is characterized in that, described DSB data store block information comprises: the length of DSB data store block, side-play amount and MD5 value.
15. the method for a synchronous documents is characterized in that, may further comprise the steps:
1) synchronous source end adopts the described file splitting method of claim to be divided into DSB data store block and logical blocks of data new file, and data storage block message and logical blocks of data information are sent to destination;
2) synchronous destination is searched local non-existent DSB data store block and logical blocks of data, and which logical blocks of data makes up DSB data store block and notification source end needs;
3) synchronous source end sends the logical blocks of data that destination needs.
16. the method for synchronous documents according to claim 15 is characterized in that, described logical blocks of data information comprises: the position of logical blocks of data in affiliated DSB data store block, logical blocks of data length, logical blocks of data MD5 value.
CN201010110841XA 2010-02-10 2010-02-10 File splitting method based on contents Active CN101788976B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201010110841XA CN101788976B (en) 2010-02-10 2010-02-10 File splitting method based on contents
PCT/CN2010/077556 WO2011097887A1 (en) 2010-02-10 2010-10-01 Content-based file splitting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010110841XA CN101788976B (en) 2010-02-10 2010-02-10 File splitting method based on contents

Publications (2)

Publication Number Publication Date
CN101788976A true CN101788976A (en) 2010-07-28
CN101788976B CN101788976B (en) 2012-05-09

Family

ID=42532194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010110841XA Active CN101788976B (en) 2010-02-10 2010-02-10 File splitting method based on contents

Country Status (2)

Country Link
CN (1) CN101788976B (en)
WO (1) WO2011097887A1 (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963982A (en) * 2010-09-27 2011-02-02 清华大学 Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN102065098A (en) * 2010-12-31 2011-05-18 网宿科技股份有限公司 Method and system for synchronizing data among network nodes
WO2011097887A1 (en) * 2010-02-10 2011-08-18 北京播思软件技术有限公司 Content-based file splitting method
CN102323958A (en) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 Data de-duplication method
CN102467571A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block partition method and addition method for data de-duplication
CN102571709A (en) * 2010-12-16 2012-07-11 腾讯科技(北京)有限公司 Method for uploading file, client, server and system
CN102567285A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Document loading method and device
CN102792259A (en) * 2010-03-04 2012-11-21 日本电气株式会社 Storage device
CN103078709A (en) * 2013-01-05 2013-05-01 中国科学院深圳先进技术研究院 Data redundancy identifying method
CN103279531A (en) * 2013-05-31 2013-09-04 北京瑞翔恒宇科技有限公司 Content based file blocking method in distributed file system
WO2013159631A1 (en) * 2012-04-23 2013-10-31 华为技术有限公司 Method and device for data block
CN103514250A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for deleting global repeating data and storage device
CN103873522A (en) * 2012-12-14 2014-06-18 联想(北京)有限公司 Electronic equipment, and file partitioning method applied to same
CN103973723A (en) * 2013-01-25 2014-08-06 中国科学院寒区旱区环境与工程研究所 Centralized scientific data synchronization method and system
CN104063377A (en) * 2013-03-18 2014-09-24 联想(北京)有限公司 Information processing method and electronic equipment using same
CN104239575A (en) * 2014-10-08 2014-12-24 清华大学 Virtual machine mirror image file storage and distribution method and device
CN105912268A (en) * 2016-04-12 2016-08-31 韶关学院 Distributed data deduplocation method and apparatus based on self-matching characteristics
CN106507210A (en) * 2013-09-25 2017-03-15 北京奇虎科技有限公司 Play the method and device of video in webpage
CN106572090A (en) * 2016-10-21 2017-04-19 网宿科技股份有限公司 Data transmission method and system
CN109445702A (en) * 2018-10-26 2019-03-08 黄淮学院 A kind of piece of grade data deduplication storage
CN111711671A (en) * 2020-06-01 2020-09-25 深圳华中科技大学研究院 Cloud storage method for efficient ciphertext file updating based on blind storage
CN111722787A (en) * 2019-03-22 2020-09-29 华为技术有限公司 Blocking method and device
CN112181312A (en) * 2020-10-23 2021-01-05 北京安石科技有限公司 Method and system for quickly reading hard disk data
CN113627132A (en) * 2021-08-27 2021-11-09 北京智慧星光信息技术有限公司 Data deduplication mark code generation method and system, electronic device and storage medium
WO2023004528A1 (en) * 2021-07-26 2023-02-02 深圳市检验检疫科学研究院 Distributed system-based parallel named entity recognition method and apparatus

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10831708B2 (en) 2017-12-20 2020-11-10 Mastercard International Incorporated Systems and methods for improved processing of a data file
CN110968575B (en) * 2018-09-30 2023-06-06 南京工程学院 Deduplication method of big data processing system
CN118043799A (en) * 2021-12-13 2024-05-14 华为技术有限公司 Data management method and device in storage system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1239573A (en) * 1996-12-02 1999-12-22 汤姆森消费电子有限公司 Appts. and method for identifying information stored on medium
US20080044016A1 (en) * 2006-08-04 2008-02-21 Henzinger Monika H Detecting duplicate and near-duplicate files
CN101595459A (en) * 2006-12-01 2009-12-02 美国日本电气实验室公司 The method and system that is used for quick and efficient data management and/or processing

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2450025A (en) * 2004-06-17 2008-12-10 Hewlett Packard Development Co Algorithm for dividing a sequence of values into chunks using breakpoints
US7836107B2 (en) * 2007-12-20 2010-11-16 Microsoft Corporation Disk seek optimized file system
US8300823B2 (en) * 2008-01-28 2012-10-30 Netapp, Inc. Encryption and compression of data for storage
CN101788976B (en) * 2010-02-10 2012-05-09 北京播思软件技术有限公司 File splitting method based on contents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1239573A (en) * 1996-12-02 1999-12-22 汤姆森消费电子有限公司 Appts. and method for identifying information stored on medium
US20080044016A1 (en) * 2006-08-04 2008-02-21 Henzinger Monika H Detecting duplicate and near-duplicate files
CN101595459A (en) * 2006-12-01 2009-12-02 美国日本电气实验室公司 The method and system that is used for quick and efficient data management and/or processing

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011097887A1 (en) * 2010-02-10 2011-08-18 北京播思软件技术有限公司 Content-based file splitting method
CN102792259B (en) * 2010-03-04 2016-12-07 日本电气株式会社 Get rid of the storage device repeating storage
CN102792259A (en) * 2010-03-04 2012-11-21 日本电气株式会社 Storage device
CN101963982B (en) * 2010-09-27 2012-07-25 清华大学 Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN101963982A (en) * 2010-09-27 2011-02-02 清华大学 Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN102467571A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block partition method and addition method for data de-duplication
CN102567285A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Document loading method and device
CN102571709A (en) * 2010-12-16 2012-07-11 腾讯科技(北京)有限公司 Method for uploading file, client, server and system
CN102065098A (en) * 2010-12-31 2011-05-18 网宿科技股份有限公司 Method and system for synchronizing data among network nodes
CN102323958A (en) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 Data de-duplication method
WO2013159631A1 (en) * 2012-04-23 2013-10-31 华为技术有限公司 Method and device for data block
CN103873522B (en) * 2012-12-14 2018-07-06 联想(北京)有限公司 A kind of electronic equipment and the file block method applied to electronic equipment
CN103873522A (en) * 2012-12-14 2014-06-18 联想(北京)有限公司 Electronic equipment, and file partitioning method applied to same
CN103078709A (en) * 2013-01-05 2013-05-01 中国科学院深圳先进技术研究院 Data redundancy identifying method
CN103078709B (en) * 2013-01-05 2016-04-13 中国科学院深圳先进技术研究院 Data redundancy recognition methods
CN103973723A (en) * 2013-01-25 2014-08-06 中国科学院寒区旱区环境与工程研究所 Centralized scientific data synchronization method and system
CN104063377B (en) * 2013-03-18 2017-06-27 联想(北京)有限公司 Information processing method and use its electronic equipment
CN104063377A (en) * 2013-03-18 2014-09-24 联想(北京)有限公司 Information processing method and electronic equipment using same
CN103279531B (en) * 2013-05-31 2016-06-08 北京瑞翔恒宇科技有限公司 A kind of file block method content-based in distributed file system
CN103279531A (en) * 2013-05-31 2013-09-04 北京瑞翔恒宇科技有限公司 Content based file blocking method in distributed file system
CN103514250A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for deleting global repeating data and storage device
CN106507210A (en) * 2013-09-25 2017-03-15 北京奇虎科技有限公司 Play the method and device of video in webpage
CN104239575A (en) * 2014-10-08 2014-12-24 清华大学 Virtual machine mirror image file storage and distribution method and device
CN105912268B (en) * 2016-04-12 2020-08-28 韶关学院 Distributed repeated data deleting method and device based on self-matching characteristics
CN105912268A (en) * 2016-04-12 2016-08-31 韶关学院 Distributed data deduplocation method and apparatus based on self-matching characteristics
CN106572090A (en) * 2016-10-21 2017-04-19 网宿科技股份有限公司 Data transmission method and system
CN109445702A (en) * 2018-10-26 2019-03-08 黄淮学院 A kind of piece of grade data deduplication storage
CN109445702B (en) * 2018-10-26 2019-12-06 黄淮学院 block-level data deduplication storage system
WO2020192627A1 (en) * 2019-03-22 2020-10-01 华为技术有限公司 Partitioning method and apparatus therefor
CN111722787A (en) * 2019-03-22 2020-09-29 华为技术有限公司 Blocking method and device
CN111722787B (en) * 2019-03-22 2021-12-03 华为技术有限公司 Blocking method and device
US11755540B2 (en) 2019-03-22 2023-09-12 Huawei Technologies Co., Ltd. Chunking method and apparatus
CN111711671A (en) * 2020-06-01 2020-09-25 深圳华中科技大学研究院 Cloud storage method for efficient ciphertext file updating based on blind storage
CN111711671B (en) * 2020-06-01 2023-07-25 深圳华中科技大学研究院 Cloud storage method for updating efficient ciphertext file based on blind storage
CN112181312A (en) * 2020-10-23 2021-01-05 北京安石科技有限公司 Method and system for quickly reading hard disk data
WO2023004528A1 (en) * 2021-07-26 2023-02-02 深圳市检验检疫科学研究院 Distributed system-based parallel named entity recognition method and apparatus
CN113627132A (en) * 2021-08-27 2021-11-09 北京智慧星光信息技术有限公司 Data deduplication mark code generation method and system, electronic device and storage medium
CN113627132B (en) * 2021-08-27 2024-04-02 智慧星光(安徽)科技有限公司 Data deduplication marking code generation method, system, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN101788976B (en) 2012-05-09
WO2011097887A1 (en) 2011-08-18

Similar Documents

Publication Publication Date Title
CN101788976B (en) File splitting method based on contents
US10223544B1 (en) Content aware hierarchical encryption for secure storage systems
US10228851B2 (en) Cluster storage using subsegmenting for efficient storage
US10380073B2 (en) Use of solid state storage devices and the like in data deduplication
US8452731B2 (en) Remote backup and restore
US10366072B2 (en) De-duplication data bank
US7478113B1 (en) Boundaries
US8166012B2 (en) Cluster storage using subsegmenting
CN102985911B (en) Telescopic in height and distributed data de-duplication
US20210360088A1 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
CN103944988A (en) Repeating data deleting system and method applicable to cloud storage
US10339124B2 (en) Data fingerprint strengthening
CN105808622A (en) File storage method and device
US11797488B2 (en) Methods for managing storage in a distributed de-duplication system and devices thereof
US11860739B2 (en) Methods for managing snapshots in a distributed de-duplication system and devices thereof
US8234413B2 (en) Partitioning a data stream using embedded anchors
US7685186B2 (en) Optimized and robust in-place data transformation
US20170124107A1 (en) Data deduplication storage system and process
TWI420333B (en) A distributed de-duplication system and the method therefore
Goel et al. A Detailed Review of Data Deduplication Approaches in the Cloud and Key Challenges
CN111159125B (en) Block deduplication technology for data storage and data disaster recovery
Leibenger et al. Triviback: A storage-efficient secure backup system
CN112948466A (en) Satellite data processing method and device, electronic equipment and storage medium
de la Mata Simulating secure cloud storage schemes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: BORQS COMMUNICATION TECHNOLOGY (BEIJING) CO., LTD.

Free format text: FORMER OWNER: BEIJING BORQS SOFTWARE TECHNOLOGY CO., LTD.

Effective date: 20121115

Owner name: BEIJING BORQS SOFTWARE TECHNOLOGY CO., LTD. WUHAN

Effective date: 20121115

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100102 CHAOYANG, BEIJING TO: 100015 CHAOYANG, BEIJING

TR01 Transfer of patent right

Effective date of registration: 20121115

Address after: 100015, B23 building, A, Hengtong business garden, No. 10 Jiuxianqiao Road, Beijing, Chaoyang District

Patentee after: Borqs Beijing Ltd.

Patentee after: Beijing Borqs Software Technology Co., Ltd.

Patentee after: Wuhan Borqs Technology Co., Ltd.

Address before: 100102 D building, building 9, South Central Road, Chaoyang District, Wangjing, Beijing, Wangjing

Patentee before: Beijing Borqs Software Technology Co., Ltd.