CN101788976B - File splitting method based on contents - Google Patents

File splitting method based on contents Download PDF

Info

Publication number
CN101788976B
CN101788976B CN201010110841XA CN201010110841A CN101788976B CN 101788976 B CN101788976 B CN 101788976B CN 201010110841X A CN201010110841X A CN 201010110841XA CN 201010110841 A CN201010110841 A CN 201010110841A CN 101788976 B CN101788976 B CN 101788976B
Authority
CN
China
Prior art keywords
data
breakpoint
logical blocks
file
rabin
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201010110841XA
Other languages
Chinese (zh)
Other versions
CN101788976A (en
Inventor
张卫平
刘为怀
杨立辉
张元丰
李骞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Borqs Beijing Ltd.
Wuhan Borqs Technology Co., Ltd.
Beijing Borqs Software Technology Co Ltd
Original Assignee
Beijing Borqs Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Borqs Software Technology Co Ltd filed Critical Beijing Borqs Software Technology Co Ltd
Priority to CN201010110841XA priority Critical patent/CN101788976B/en
Publication of CN101788976A publication Critical patent/CN101788976A/en
Priority to PCT/CN2010/077556 priority patent/WO2011097887A1/en
Application granted granted Critical
Publication of CN101788976B publication Critical patent/CN101788976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a file splitting method based on contents, comprising the following steps: selecting the length of windows, the expected length of data logic blocks and the length range of the data logic blocks; computing the Rabin fingerprint value of each sliding window by adopting Rabin fingerprint algorithm, and determining the breaking point of each data logic block according to the Rabin fingerprint value of each sliding window; splitting files according to the data logic blocks; selecting the expected length of data storage blocks, and limiting the length range of the data storage blocks; searching and confirming the breaking points of the data storage blocks; and splitting the files according to the data storage blocks. In the invention, the file splitting method based on the contents can accurately and efficiently search different contents among different files or different versions of the same file, thereby saving a large amount of storage space in storage and file systems, reducing the transmission contents of file information, and decreasing the dependence of the performance of a network file system on bandwidth.

Description

A kind of content-based file splitting method
Technical field
The present invention relates to a kind of dividing method of file, relate in particular to a kind of content-based file splitting method.
Background technology
In existing Computer Storage and the file system, solve the identical two or more similar file of most contents, need each file of storage separately, the common method that causes taking the bigger problem of storage space is to adopt data to go repetition technological.
It is the data block that becomes length to equate basically file division that data are gone the implementation method of repetition technology, and in file system, the data block that content is identical is only stored portion.The standard that judgment data piece content is identical can be the MD5 value of comparing data piece, also can be the SHA-1 value of comparing data piece.The value that goes out with MD5 or SHA-1 algorithm computation all has the discreteness of height.The hashed value length that the MD5 algorithm computation goes out is 128bit; The probability that the data block process MD5 hash of different content obtains same hash value is at 1/ (2 (exp (B/2)) (B is the bit figure place of hashed value length in the hashing algorithm here); With the 128bitMD5 algorithm is example, and the identical probability of the MD5 hashed value of different content data block is 1/2 64(approximate 5.5 * 10 -20) the order of magnitude, it is impossible that so little probability is considered to usually.The SHA-1 algorithm is based on MD5's, and the hashed value that calculates reaches 160bit especially.It is generally acknowledged that MD5 value or SHA-1 value can represent the characteristic of prime information uniquely, be generally used for the encryption storage of password, digital signature, file integrality checking, authentication etc.On counting yield, MD5 is better than SHA-1.
After solving file modification, need carry out synchronously, the content of often revising is considerably less, but needs synchronous whole part of file content, causes the problem of a large amount of Network Transmission, and what adopt at present is the increment simultaneous techniques.
The increment simultaneous techniques refers to file through Network Synchronization, need not transmit the content of whole part of file, but only in the storage of transmission destination end and the file system non-existent content get final product.If between the different editions of identical file synchronously, be appreciated that changed information into transfer files.Implementation method is that file division is become logical blocks of data, through the content of comparing data logical block, finds out the identical and difference between destination and the source end file.Identical part need not passed through Network Transmission, can obtain at destination; Different portions just need be passed through Network Transmission, so just can reduce transmission volume.Whether identical standard can be passed through relatively MD5 value or SHA-1 value to the judgment data logical block equally.
Teledata synchronization means rsync commonly used also is a kind of increment simultaneous techniques, uses so-called " rsync algorithm " to make the file between local and long-range two computing machines reach synchronous.Suppose need be between two computing machines synchronous documents A ', and there has been the previous release A of this document in destination, the rsync algorithm will be accomplished through following step so:
1. destination is divided into the logical blocks of data (last piece may be littler than S) that one group of nonoverlapping length is fixed as the S byte with file A; To each cut apart good logical blocks of data calculate 32 verification with and 128 MD4 value, and with the verification of these pieces with reach the MD4 value and issue the source end.The MD4 algorithm is the previous release of MD5 algorithm, relative MD5 algorithm, and the security aspect is weaker a little.
2. the source end is the logical blocks of data of S (side-play amount can be chosen wantonly, is not necessarily the multiple of S) through all sizes of search file A ', seeks a certain logical blocks of data that identical verification is arranged and reach the MD4 value with file A.
3. the source end is issued a string instruction of destination and is generated the backup of file A ' on destination; The instruction here or be that file A has a certain logical blocks of data and the explanation that must not retransmit, otherwise be one not with any coupling of file A on logical blocks of data.
The rsync algorithm only transmits the different piece of two files, rather than all whole at every turn part of file transfer, so speed is quite fast.But rsync can only be used between the different editions of same file name synchronously.If above example in file A ' with file A similar content but filename is different, rsync will still can transmit whole part content of A '.
The above-mentioned repetition technology of going makes and goes the efficient of repetition lower owing to being the data block that becomes length to equate basically file division, can not reduce transmission volume effectively.
Summary of the invention
In order to solve the deficiency that prior art exists, the object of the present invention is to provide a kind of method of content-based file splitting method, a kind of file memory method and a kind of synchronous documents.
In order to accomplish above-mentioned purpose, a kind of content-based file splitting method of the present invention, this method may further comprise the steps:
1) selectes the length of window and the length of logical blocks of data expectation, and the length range of logical blocks of data is set according to the length of said logical blocks of data expectation;
2) adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window;
3) file is carried out the division of logical blocks of data;
The length range of the 4) length of selected data storage block expectation, and qualification DSB data store block;
5) search and confirm the breakpoint of DSB data store block;
6) file is carried out the division of DSB data store block.
For accomplishing the foregoing invention purpose, the present invention also provides a kind of file memory method, and this method may further comprise the steps:
File data is divided into logical blocks of data and DSB data store block;
Each DSB data store block of storage file, and the data storage block message that log file comprised in metadata.
For accomplishing the foregoing invention purpose, the present invention also provides a kind of synchronous documents storage means, and this method may further comprise the steps:
1) synchronous source end adopts the said file splitting method of claim to be divided into DSB data store block and logical blocks of data new file, and data storage block message and logical blocks of data information are sent to destination;
2) synchronous destination is searched local non-existent DSB data store block and logical blocks of data, and which logical blocks of data makes up DSB data store block and notification source end needs;
3) synchronous source end sends the logical blocks of data that destination needs.
The present invention has tangible advantage and good effect, adopts content-based file splitting method of the present invention, can find out content inequality between different files or identical file different editions accurately and efficiently.In storage system,, when running into similar content or identical file and need store, just can save a large amount of storage spaces because the identical DSB data store block of content only stores portion; In file loading, backup and the filing of file system, the source end just only needs non-existent DSB data store block of transmission destination end and logical blocks of data information, and need not transmit the DSB data store block and the logical blocks of data information of whole new file, and the content of transmission still less; In NFS, the mapping from file to the file physical storage block in the file system metadata can change into by logical blocks of data or the mapping of DSB data store block of file in this method, reaches and reduces the dependence of network file system performance to bandwidth.
Description of drawings
Accompanying drawing is used to provide further understanding of the present invention, and constitutes the part of instructions, with embodiments of the invention, is used to explain the present invention, is not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the process flow diagram according to content-based file splitting method of the present invention;
Fig. 2 is the file division synoptic diagram according to content-based file splitting method of the present invention;
Fig. 3 divides synoptic diagram according to logical blocks of data in the content-based file splitting method of the present invention;
Fig. 4 divides synoptic diagram according to DSB data store block in the content-based file splitting method of the present invention.
Embodiment
Below in conjunction with accompanying drawing the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein only is used for explanation and explains the present invention, and be not used in qualification the present invention.
Fig. 1 is the process flow diagram according to content-based file splitting method of the present invention, and with reference to figure 1, concrete implementation procedure according to the present invention is described in detail as follows:
At first, in step 101, the length of the length of selected window (windows) and logical blocks of data (block) expectation, and the length range of qualification logical blocks of data.
Window is the continuous zone of a slice in the file, and the length of suggestion is 48 bytes.Moving window (slidingwindow) is based on the last window in the file, the byte of backward sliding, and the length of window after the slip is constant.
Logical blocks of data is the data of comparison fritter, and when realizing that increment is synchronous, logical blocks of data is minimum Synchronism Unit.The unit storage in storage system with the logical blocks of data.The length of logical blocks of data expectation can be 2K, 4K or 8K, also can be worth for other.
For fear of in the process of searching the logical blocks of data breakpoint, there are very many breakpoints in the file, file is divided into a lot of data logical blocks; Logical blocks of data length is all very short, causes storing and to transmit the quantity of information of logical blocks of data very big, causes memory space and the transmission quantity bigger than file content; Or logical blocks of data length is very big in the file, and the probability that causes logical blocks of data to be reused becomes very little, and the change in this logical blocks of data also can cause the problem of great amount of data transmission; In this step, the length range of data logical block is limited, minimum length is Tmin; Be traditionally arranged to be the half the of expected data logical block length or according to the actual conditions setting, also can be set to other values, maximum length is Tmax; Can select 16K, 32K or 64K byte etc. according to actual conditions.
In step 102, adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window.
In this step; Employing is searched the also breakpoint of specified data logical block according to the mode of file content (content based); The benefit that adopts this mode to bring is: insert and delete content hereof; Only can influence the vicissitudinous logical blocks of data of content, and can not influence other logical blocks of data.Concrete steps are from the file section start, to calculate Rabin's fingerprint value of each moving window; When the low n position of moving window Rabin fingerprint value equals certain specified value, this moving window will constitute the breakpoint of first logical blocks of data, begin to calculate Rabin's fingerprint value of each moving window then from first breakpoint; When the low n position of moving window Rabin fingerprint value equals certain specified value; This moving window just constitutes the breakpoint of second data logical block, according to aforementioned algorithm, calculates Rabin's fingerprint value of all moving windows; Find out the breakpoint of all logical blocks of data in the file, until EOF.The EOF place also must be the breakpoint of a data logical block.
Rabin's fingerprint (rabin fingerprinting) algorithm is a kind of fingerprint algorithm that the Rabin of Harvard University (rabin) proposes; It is a kind of algorithm of high efficiency calculating moving window hashed value, and has the height discreteness according to the value that Rabin's fingerprint algorithm calculates.
Getting the low n position of moving window Rabin fingerprint value, is that Rabin's fingerprint value with moving window is divided by 2 nThe remainder of gained.The value of n is relevant with the length of logical blocks of data expectation.Because the value that calculates according to Rabin's fingerprint algorithm is very uniform, and if file content also is very at random, the logical blocks of data length that splits so will be 2 nAbout byte, just 2 nThe length of=logical blocks of data expectation.Certainly, need add the content that breakpoint window comprises in the logical blocks of data.So if the length of our expected data logical block is the 4K byte, the value of n just should be 12 (2 12=4096=4K).
The low n position of Rabin's fingerprint value of moving window equals certain specified value, and this specified value is as long as confirm that what being on earth, it doesn't matter.We did such test: to different length, dissimilar file; Search breakpoint with different values respectively; The result is what value that don't work, and the logical blocks of data quantity of dividing at last is more or less the same, and the length difference of each logical blocks of data is also very little.The randomness of this set-point has more been confirmed in this test.
In this step, also can be in a last breakpoint (or file section start) Tmin byte afterwards, calculation window Rabin fingerprint value does not avoid producing the too small logical blocks of data of length.
If in a last breakpoint (or file section start) Tmax byte afterwards, do not find new breakpoint, we will select last the backup breakpoint in this segment limit for use.The method of confirming the backup breakpoint is the low n-1 position of getting moving window Rabin fingerprint value, compares (this specified value is not equal to the value of judgment data logical block breakpoint) with another specified value, if equate, thinks that then this window can be used as a backup breakpoint.Under the situation that does not have breakpoint, last backup breakpoint will become the breakpoint of logical blocks of data; If neither there is breakpoint, there is not the backup breakpoint yet, then need by force this segment limit to be divided into a data logical block, avoid producing the excessive logical blocks of data of length.
In step 103, file is carried out logical blocks of data divide.The all breakpoints of finding out according to step 102 of file; Content between per two adjacent breakpoints constitutes a data logical block, and wherein the content of the content of first logical blocks of data and file section start and EOF place and penult breakpoint equally also constitutes a data logical block respectively.
In step 104, the length of selected data storage block (chunk) expectation, and the length range of qualification DSB data store block.DSB data store block is relatively large data.In file system, DSB data store block is the used minimum memory of application layer unit, and the DSB data store block that content is identical is only stored portion.The length of DSB data store block expectation can be 1M, 2M or 4M, also can be worth for other.
The length of DSB data store block expectation is represented with Ec; The length of logical blocks of data expectation is represented with Eb; The DSB data store block length range is restricted to [Ec-m*Eb, Ec+k*Eb] (last DSB data store block of each file does not limit minimum length), and m and k are given as required numerical value.
In step 105; Search breakpoint with the specified data storage block; The present invention searches and specified data storage block breakpoint still adopts the mode according to file content (content based); The benefit that adopts this mode to bring is: insert and delete content hereof, only can influence the vicissitudinous DSB data store block of content, and can not influence other DSB data store block.Concrete steps are to calculate the total length of a plurality of continuous data logical blocks from the file section start.In case this total length is near the value of our desired data storage block length; And the n+1~n+x position of last logical blocks of data breakpoint Rabin fingerprint value equals another specified value (this specified value is not equal to the value of judgment data logical block breakpoint), and last logical blocks of data breakpoint is exactly the breakpoint of DSB data store block.If the breakpoint of last logical blocks of data does not satisfy condition; And total length can not surpass the restriction of DSB data store block length range after adding next logical blocks of data; After then judging the next logical blocks of data of adding, whether the breakpoint of last logical blocks of data satisfies condition.Up to finding out the breakpoint that satisfies condition, perhaps till the upper limit of total length near the DSB data store block length range.The breakpoint that satisfies condition is the breakpoint of logical blocks of data, also is the breakpoint of DSB data store block simultaneously.The breakpoint that is to say DSB data store block is equal to the breakpoint of last logical blocks of data in a plurality of continuous data logical blocks of composition data storage block.Begin from a last data storage block breakpoint then, using the same method finds out the breakpoint of next DSB data store block.Travel through all logical blocks of data breakpoints,, find out the breakpoint of all DSB data store blocks until EOF.The EOF place is inevitable also to be the breakpoint of DSB data store block.
Top x is relevant with the DSB data store block length range.We represent with Ec desired data storage block length; Desired data logical block length representes that with Eb the DSB data store block length range is restricted to [Ec-m*Eb, Ec+k*Eb] (last DSB data store block of each file does not limit minimum length); Possibly there be m+k data logical block breakpoint so in the DSB data store block length range; The value scope of the n+1 of breakpoint Rabin fingerprint value~n+x position is necessary for [0, m+k-1], and m+k=2 promptly satisfies condition x, the probability that has and only have a data storage block breakpoint like this in the DSB data store block length range just can be maximum.For example; If the length of expected data logical block is 4K; The length of expected data storage block is 4M, and the DSB data store block length range is [4M-32*4K, 4M+32*4K]; The logical blocks of data breakpoint that possibly become the DSB data store block breakpoint so will have 32+32=64, and the x that mentions above just should equal 6 (2 6=64), promptly whether 13~18 of judgment data logical block breakpoint Rabin fingerprint value (situation that n equals 12) equals specified value.The same with dividing data logical block breakpoint, this specified value is as long as confirm that what being on earth, it doesn't matter.
When a plurality of continuous data logical block total lengths surpassed Ec-m*Eb, whether the breakpoint that then need pay close attention to last logical blocks of data satisfies the n+1 of Rabin's fingerprint value~n+x position equaled the condition of set-point.Satisfying condition then becomes the DSB data store block breakpoint, otherwise checks the breakpoint of next logical blocks of data.In case a plurality of continuous data logical block total lengths surpass Ec+k*Eb, just last breakpoint is set to the breakpoint of this DSB data store block in the maximum length scope of DSB data store block, and the length that can guarantee DSB data store block like this is within limited field.
In step 106; File is carried out DSB data store block to be divided; According to all DSB data store block breakpoints of file that step 105 is found out, the content between per two adjacent DSB data store block breakpoints promptly constitutes a data storage block, and record data logical block and data storage block message.DSB data store block information comprises: the length of DSB data store block, side-play amount and MD5 value or SHA-1 value.
Fig. 2 is that as shown in Figure 2, whole file is divided into several DSB data store blocks according to the file division synoptic diagram of content-based file splitting method of the present invention, and each DSB data store block comprises several logical blocks of data.
Fig. 3 divides synoptic diagram according to logical blocks of data in the content-based file splitting method of the present invention, and as shown in Figure 3, the zigzag lines are represented breakpoint.
A representes first file, or the prototype version of file.We are divided into a lot of logical blocks of data according to the breakpoint of content search logical blocks of data with a file, only show 7 the data logical blocks in front among the figure.
B with a has relatively made a little modifications at B2, but the content of revising does not cause producing new breakpoint, and breakpoint location also remains unchanged.Because variation has taken place in this logical blocks of data content, so generate new logical blocks of data B8.
C with b has relatively increased some contents at B3, but the content that increases does not cause producing new breakpoint, so breakpoint location also remains unchanged.Because this logical blocks of data has increased fresh content, so generate new logical blocks of data B9.
D with c has relatively deleted some contents among the B5, but this deletion does not cause producing new breakpoint, does not also cause former breakpoint to lose efficacy, so breakpoint location also remains unchanged.Because this logical blocks of data has been deleted partial content, so generate new logical blocks of data B10.
E compares with d, and revise at the place at the B6 breakpoint, causes breakpoint place content to change, and will no longer become breakpoint, so merge B6 and B7, generates new logical blocks of data B11.
F with e has relatively increased fresh content at B4, and the fresh content of increase causes producing new breakpoint, so B4 will be decomposed into B12 and B13.
A among Fig. 3, b, c, d, e, f file both possibly be the different editions of identical file name, also possibly be the different files of similar content.Each file compares with previous file, and content all changes, but most logical blocks of data can be reused.Whether identical algorithm can be the MD5 value of comparing data logical block to judgment data logical block content; It also can be the SHA-1 value of comparing data logical block; Or other have highly discrete, the value that can represent the algorithm computation of prime information characteristic to come out uniquely.Like this, we just can reuse a logical blocks of data in the file of synchronous mistake when synchronous documents, reduce transmission volume.
Fig. 4 divides synoptic diagram according to DSB data store block in the content-based file splitting method of the present invention; As shown in Figure 4; Zigzag short-term bar is represented the logical blocks of data breakpoint; The long lines of zigzag are represented the DSB data store block breakpoint, also are the breakpoints of this last logical blocks of data of DSB data store block simultaneously.Dash area is represented the place to previous file (or previous release of same file name) modification, B representative data logical block.
A representes first file, or the prototype version of file.We are divided into a lot of logical blocks of data and DSB data store block according to the breakpoint of content search logical blocks of data and DSB data store block with a file.
B with a has relatively made a little modifications in chunk1, but this modification does not cause the variation of DSB data store block breakpoint, has renewal in the chunk1, becomes chunk1 '.And chunk2, the chunk of chunk3 and back does not have to change.
C compares with b, and a little modifications have been made at the place at chunk1 ' breakpoint, causes this breakpoint to lose efficacy, and with searching new breakpoint again, generates chunk1 " and chunk2 '.And the chunk of chunk3 and back does not have to change.
D with c relatively, at chunk1 " in made a little modifications, this modification causes producing new DSB data store block breakpoint, so generate chunk1 " ' and chunk2 ".And the chunk of chunk3 and back does not have to change.
A among Fig. 4, b, c, d file both possibly be the different editions of identical file name, also possibly be the different files of similar content.Each file with previous file relatively; Content all changes; But the content of most of DSB data store block is identical, and whether identical algorithm can be the MD5 value of comparing data logical block to judgment data logical block content, also can be the SHA-1 value of comparing data logical block; Or other have highly discrete, the value that can represent the algorithm computation of prime information characteristic to come out uniquely.The DSB data store block that content is identical can be reused, and during storage file, need not store existing DSB data store block, so just can avoid the repeated storage of data block.
In storage system, adopt above-mentioned content-based file splitting method, file data is divided into after logical blocks of data and the DSB data store block; In storage file; Be not storage file itself, but each DSB data store block of storage file, and the data storage block message that log file comprised in metadata; Like the DSB data store block tabulation that file comprised, the length of each DSB data store block and MD5 value etc.Because the identical DSB data store block of content only stores portion, when running into similar content or identical file and need store, just can save a large amount of storage spaces.
The uploading, back up and file of file in file system, synchronous source end is divided into DSB data store block and logical blocks of data with new file, and these information are sent to destination.Destination can be searched local non-existent DSB data store block and logical blocks of data through the whole bag of tricks, and which logical blocks of data makes up DSB data store block and notification source end needs.The source end sends the logical blocks of data that destination needs in the past again.The new file here both possibly be to the file behind the last version modify, also possibly be the file that increases newly.Destination searches that local not have the method for DSB data store block and logical blocks of data both can be to calculate local file DSB data store block and logical blocks of data information in real time; Also can be these information to be kept to supply inquiry in the metadata in advance, we recommend the latter here.Logical blocks of data information comprises the position of this logical blocks of data in DSB data store block, logical blocks of data length, logical blocks of data MD5 value or SHA-1 value etc.If end has been preserved existing DSB data store block of destination and logical blocks of data information in the source; The source end just only needs non-existent DSB data store block of transmission destination end and logical blocks of data information; And need not transmit the DSB data store block and the logical blocks of data information of whole new file, the content of transmission is just still less.Compare the whole part file content of transmission, synchronous documents in this way, the content of transmission becomes seldom, the increment synchronous applications of Here it is indication of the present invention.
In NFS, the efficient operation of NFS needs the demand of good network switching performance such as bandwidth.In one type of low-bandwidth network connects, in the wired or wireless network connection like low bandwidth, can utilize the data of algorithm of the present invention to go repetition and increment synchronizing characteristics.In the realization of NFS; Mapping from file to the file physical storage block in the file system metadata; Can change into by logical blocks of data or the mapping of DSB data store block of file in this algorithm, reach and reduce the dependence of network file system performance bandwidth.
One of ordinary skill in the art will appreciate that: the above is merely the preferred embodiments of the present invention; Be not limited to the present invention; Although the present invention has been carried out detailed explanation with reference to previous embodiment; For a person skilled in the art, it still can be made amendment to the technical scheme of aforementioned each embodiment record, perhaps part technical characterictic wherein is equal to replacement.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (15)

1. content-based file splitting method, this method may further comprise the steps:
1) selectes the length of window and the length of logical blocks of data expectation, and the length range of logical blocks of data is set according to the length of said logical blocks of data expectation;
2) adopt Rabin's fingerprint algorithm, calculate Rabin's fingerprint value of each moving window, and according to the breakpoint of Rabin's fingerprint value specified data logical block of moving window;
3) file is carried out the division of logical blocks of data;
The length range of the 4) length of selected data storage block expectation, and qualification DSB data store block;
5) search and confirm the breakpoint of DSB data store block;
6) file is carried out the division of DSB data store block;
Said step 5) further may further comprise the steps:
A) from the file section start; Calculate the length of a plurality of continuous data logical blocks; And Rabin's fingerprint value of each the logical blocks of data breakpoint of calculating in the length range of DSB data store block; According to Rabin's fingerprint value of said logical blocks of data breakpoint first breakpoint of DSB data store block is set, first breakpoint of this DSB data store block is as the breakpoint of a last data storage block;
B) breakpoint from a said last data storage block begins; Calculate the length of a plurality of continuous data logical blocks; And calculate Rabin's fingerprint value of each the logical blocks of data breakpoint in the length range of DSB data store block, the next breakpoint of DSB data store block is set according to Rabin's fingerprint value of said logical blocks of data breakpoint;
C) repeat above-mentioned steps b), until EOF, find out the breakpoint of all DSB data store blocks in the file.
2. content-based file splitting method according to claim 1 is characterized in that, said step 2 further may further comprise the steps:
1) begins from the file section start; Calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm; According to Rabin's fingerprint value of said moving window, confirm the breakpoint of first logical blocks of data, and with the breakpoint of this logical blocks of data breakpoint as a last data logical block;
2) breakpoint from a said last data logical block begins, and calculates Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm, and according to Rabin's fingerprint value of said moving window, confirms the breakpoint of next logical blocks of data;
3) repeat above-mentioned steps 2, find out the breakpoint of all logical blocks of data in the file.
3. content-based file splitting method according to claim 1 is characterized in that, said step 2 further may further comprise the steps:
1) the logical blocks of data minimum length after the file section start begins; Calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm; Rabin's fingerprint value according to said moving window; Confirm the breakpoint of first logical blocks of data, and with the breakpoint of this logical blocks of data breakpoint as a last data logical block;
2) the logical blocks of data minimum length after the breakpoint of a said last data logical block begins; Calculate Rabin's fingerprint value of each moving window according to Rabin's fingerprint algorithm; And, confirm the breakpoint of next logical blocks of data according to Rabin's fingerprint value of said moving window;
3) repeat above-mentioned steps 2, find out the breakpoint of all logical blocks of data in the file.
4. content-based file splitting method according to claim 1 is characterized in that, the division that said step 3 pair file carries out logical blocks of data is as a data logical block with the content between the adjacent two data logical block breakpoint.
5. content-based file splitting method according to claim 1; It is characterized in that; The method of said Rabin's fingerprint value specified data logical block breakpoint according to moving window is that this moving window just constitutes a breakpoint when the low n position of said moving window Rabin fingerprint value equals a specified value.
6. content-based file splitting method according to claim 5 is characterized in that, the n value of the low n position of said moving window Rabin fingerprint value is by 2 nThe length computation of=logical blocks of data expectation draws.
7. according to claim 2 or 3 described content-based file splitting methods; It is characterized in that; In said logical blocks of data maximum length, do not find new breakpoint, then find out the backup breakpoint, and last is backed up breakpoint as the logical blocks of data breakpoint according to window Rabin fingerprint value.
8. content-based file splitting method according to claim 7; It is characterized in that; The said method of finding out the backup breakpoint according to window Rabin fingerprint value is that the low n-1 position of said moving window Rabin fingerprint value equals a specified value, and the window that said moving window is corresponding just constitutes a backup breakpoint.
9. content-based file splitting method according to claim 8 is characterized in that, in the logical blocks of data length range, has not both had breakpoint, does not also back up breakpoint, then with the window of this logical blocks of data maximum length position as breakpoint.
10. content-based file splitting method according to claim 1; It is characterized in that; Among the said step 5-7; If in the length range of DSB data store block, can't find out the breakpoint of DSB data store block according to Rabin's fingerprint value of said logical blocks of data breakpoint, just last breakpoint is set to the breakpoint of this DSB data store block in the maximum length scope of DSB data store block.
11. content-based file splitting method according to claim 1 is characterized in that, said step 6 is as a data storage block with the content between two adjacent data storage block breakpoints.
12. the method for a file storage; It is characterized in that; At first, adopt each described method of claim 1-11 that file data is divided into logical blocks of data and DSB data store block, then; Each DSB data store block of storage file, and the data storage block message that log file comprised in metadata.
13. the method for file storage according to claim 12 is characterized in that, said DSB data store block information comprises: the length of DSB data store block, side-play amount and MD5 value.
14. the method for a synchronous documents is characterized in that, may further comprise the steps:
1) synchronous source end adopts each said file splitting method of claim 1-11 to be divided into DSB data store block and logical blocks of data new file, and data storage block message and logical blocks of data information are sent to destination;
2) synchronous destination is searched local non-existent DSB data store block and logical blocks of data, and which logical blocks of data makes up DSB data store block and notification source end needs;
3) synchronous source end sends the logical blocks of data that destination needs.
15. the method for synchronous documents according to claim 14 is characterized in that, said logical blocks of data information comprises: the position of logical blocks of data in affiliated DSB data store block, logical blocks of data length, logical blocks of data MD5 value.
CN201010110841XA 2010-02-10 2010-02-10 File splitting method based on contents Active CN101788976B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201010110841XA CN101788976B (en) 2010-02-10 2010-02-10 File splitting method based on contents
PCT/CN2010/077556 WO2011097887A1 (en) 2010-02-10 2010-10-01 Content-based file splitting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010110841XA CN101788976B (en) 2010-02-10 2010-02-10 File splitting method based on contents

Publications (2)

Publication Number Publication Date
CN101788976A CN101788976A (en) 2010-07-28
CN101788976B true CN101788976B (en) 2012-05-09

Family

ID=42532194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010110841XA Active CN101788976B (en) 2010-02-10 2010-02-10 File splitting method based on contents

Country Status (2)

Country Link
CN (1) CN101788976B (en)
WO (1) WO2011097887A1 (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101788976B (en) * 2010-02-10 2012-05-09 北京播思软件技术有限公司 File splitting method based on contents
WO2011108041A1 (en) * 2010-03-04 2011-09-09 日本電気株式会社 Storage device
CN101963982B (en) * 2010-09-27 2012-07-25 清华大学 Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN102467571A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block partition method and addition method for data de-duplication
CN102567285A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Document loading method and device
CN102571709A (en) * 2010-12-16 2012-07-11 腾讯科技(北京)有限公司 Method for uploading file, client, server and system
CN102065098A (en) * 2010-12-31 2011-05-18 网宿科技股份有限公司 Method and system for synchronizing data among network nodes
CN102323958A (en) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 Data de-duplication method
CN102682086B (en) * 2012-04-23 2014-11-05 华为技术有限公司 Data segmentation method and data segmentation equipment
CN103873522B (en) * 2012-12-14 2018-07-06 联想(北京)有限公司 A kind of electronic equipment and the file block method applied to electronic equipment
CN103078709B (en) * 2013-01-05 2016-04-13 中国科学院深圳先进技术研究院 Data redundancy recognition methods
CN103973723A (en) * 2013-01-25 2014-08-06 中国科学院寒区旱区环境与工程研究所 Centralized scientific data synchronization method and system
CN104063377B (en) * 2013-03-18 2017-06-27 联想(北京)有限公司 Information processing method and use its electronic equipment
CN103279531B (en) * 2013-05-31 2016-06-08 北京瑞翔恒宇科技有限公司 A kind of file block method content-based in distributed file system
CN103514250B (en) * 2013-06-20 2017-04-26 易乐天 Method and system for deleting global repeating data and storage device
CN103491452B (en) * 2013-09-25 2017-01-25 北京奇虎科技有限公司 Method and device for playing video in web page
CN104239575A (en) * 2014-10-08 2014-12-24 清华大学 Virtual machine mirror image file storage and distribution method and device
CN105912268B (en) * 2016-04-12 2020-08-28 韶关学院 Distributed repeated data deleting method and device based on self-matching characteristics
CN106572090A (en) * 2016-10-21 2017-04-19 网宿科技股份有限公司 Data transmission method and system
US10831708B2 (en) 2017-12-20 2020-11-10 Mastercard International Incorporated Systems and methods for improved processing of a data file
CN110968575B (en) * 2018-09-30 2023-06-06 南京工程学院 Deduplication method of big data processing system
CN109445702B (en) * 2018-10-26 2019-12-06 黄淮学院 block-level data deduplication storage system
CN111722787B (en) * 2019-03-22 2021-12-03 华为技术有限公司 Blocking method and device
CN111711671B (en) * 2020-06-01 2023-07-25 深圳华中科技大学研究院 Cloud storage method for updating efficient ciphertext file based on blind storage
CN112181312A (en) * 2020-10-23 2021-01-05 北京安石科技有限公司 Method and system for quickly reading hard disk data
WO2023004528A1 (en) * 2021-07-26 2023-02-02 深圳市检验检疫科学研究院 Distributed system-based parallel named entity recognition method and apparatus
CN113627132B (en) * 2021-08-27 2024-04-02 智慧星光(安徽)科技有限公司 Data deduplication marking code generation method, system, electronic equipment and storage medium
WO2023108360A1 (en) * 2021-12-13 2023-06-22 华为技术有限公司 Method and apparatus for managing data in storage system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1160730C (en) * 1996-12-02 2004-08-04 汤姆森消费电子有限公司 Method for identifying information stored in medium
GB2450025A (en) * 2004-06-17 2008-12-10 Hewlett Packard Development Co Algorithm for dividing a sequence of values into chunks using breakpoints
US8015162B2 (en) * 2006-08-04 2011-09-06 Google Inc. Detecting duplicate and near-duplicate files
US8214517B2 (en) * 2006-12-01 2012-07-03 Nec Laboratories America, Inc. Methods and systems for quick and efficient data management and/or processing
US7836107B2 (en) * 2007-12-20 2010-11-16 Microsoft Corporation Disk seek optimized file system
US8300823B2 (en) * 2008-01-28 2012-10-30 Netapp, Inc. Encryption and compression of data for storage
CN101788976B (en) * 2010-02-10 2012-05-09 北京播思软件技术有限公司 File splitting method based on contents

Also Published As

Publication number Publication date
WO2011097887A1 (en) 2011-08-18
CN101788976A (en) 2010-07-28

Similar Documents

Publication Publication Date Title
CN101788976B (en) File splitting method based on contents
US10380073B2 (en) Use of solid state storage devices and the like in data deduplication
US7962520B2 (en) Cluster storage using delta compression
US10256978B2 (en) Content-based encryption keys
CN102985911B (en) Telescopic in height and distributed data de-duplication
US7269689B2 (en) System and method for sharing storage resources between multiple files
US9305005B2 (en) Merging entries in a deduplication index
US11627207B2 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
CN103944988A (en) Repeating data deleting system and method applicable to cloud storage
US20150186372A1 (en) System and method for streaming files through differential compression
US10339124B2 (en) Data fingerprint strengthening
US11768947B1 (en) Distributed data security
US11797488B2 (en) Methods for managing storage in a distributed de-duplication system and devices thereof
US11860739B2 (en) Methods for managing snapshots in a distributed de-duplication system and devices thereof
GB2602216A (en) Opaque encryption for data deduplication
US20170124107A1 (en) Data deduplication storage system and process
CN104281412A (en) Method for removing repeating data before data storage
Jehlol et al. Enhancing Deduplication Efficiency Using Triple Bytes Cutters and Multi Hash Function.
CN112948466A (en) Satellite data processing method and device, electronic equipment and storage medium
de la Mata Simulating secure cloud storage schemes
CN115567515A (en) Data downloading method and device, computer equipment and storage medium
ALTHAF et al. Enhanced Trustworthiness And Data Deduplicationj by using Secret Sharing Technique
CHANDANA et al. Guardable and Decentralized Deduplication with Upgrading Adherence
PALLAVI et al. Enhanced Trustworthiness by Protectable Distributed De-duplication Scheme

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: BORQS COMMUNICATION TECHNOLOGY (BEIJING) CO., LTD.

Free format text: FORMER OWNER: BEIJING BORQS SOFTWARE TECHNOLOGY CO., LTD.

Effective date: 20121115

Owner name: BEIJING BORQS SOFTWARE TECHNOLOGY CO., LTD. WUHAN

Effective date: 20121115

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100102 CHAOYANG, BEIJING TO: 100015 CHAOYANG, BEIJING

TR01 Transfer of patent right

Effective date of registration: 20121115

Address after: 100015, B23 building, A, Hengtong business garden, No. 10 Jiuxianqiao Road, Beijing, Chaoyang District

Patentee after: Borqs Beijing Ltd.

Patentee after: Beijing Borqs Software Technology Co., Ltd.

Patentee after: Wuhan Borqs Technology Co., Ltd.

Address before: 100102 D building, building 9, South Central Road, Chaoyang District, Wangjing, Beijing, Wangjing

Patentee before: Beijing Borqs Software Technology Co., Ltd.