WO2011097887A1 - 一种基于内容的文件分割方法 - Google Patents

一种基于内容的文件分割方法 Download PDF

Info

Publication number
WO2011097887A1
WO2011097887A1 PCT/CN2010/077556 CN2010077556W WO2011097887A1 WO 2011097887 A1 WO2011097887 A1 WO 2011097887A1 CN 2010077556 W CN2010077556 W CN 2010077556W WO 2011097887 A1 WO2011097887 A1 WO 2011097887A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
breakpoint
block
file
data storage
Prior art date
Application number
PCT/CN2010/077556
Other languages
English (en)
French (fr)
Inventor
张卫平
刘为怀
杨立辉
张元丰
李骞
Original Assignee
北京播思软件技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京播思软件技术有限公司 filed Critical 北京播思软件技术有限公司
Publication of WO2011097887A1 publication Critical patent/WO2011097887A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks

Definitions

  • the present invention relates to a file segmentation method, and more particularly to a content-based file segmentation method.
  • the data deduplication technique is implemented by dividing a file into data blocks of substantially equal length. In the file system, only one copy of the same data block is stored.
  • the criterion for judging the content of the data block may be the MD5 value of the comparison data block or the SHA-1 value of the comparison data block.
  • the values calculated using the MD5 or SHA-1 algorithm are highly discrete.
  • the length of the hash value calculated by the MD5 algorithm is 128 bits, and the probability that the data blocks of different contents are hashed by MD5 to obtain the same hash value is 1/(2(exp(B/2)) (where B is the hash algorithm)
  • B is the hash algorithm
  • the bit length of the column value length) taking the 128 bit MD5 algorithm as an example, the probability that the MD5 hash values of different content data blocks are the same is on the order of 1/2 64 (about equal to 5.5 ⁇ 10 -20 ), so that the small probability is usually It is considered impossible.
  • the SHA-1 algorithm is based on MD5, and the calculated hash value is up to 160 bits. It is generally considered that the MD5 value or SHA-1 value can uniquely represent the characteristics of the original information, usually used for passwords. Encrypted storage, digital signature, file integrity verification, identity authentication, etc. MD5 is better than SHA-1 in terms of computational efficiency.
  • Incremental synchronization technology refers to the synchronization of files through the network, without the need to transfer the contents of the entire file, but only the content that does not exist in the destination storage and file system. If it is a synchronization between different versions of the same file, it can be understood as the change information of the transfer file.
  • the implementation method is to divide the file into data logic blocks, and compare the contents of the data logic block to find the same and different between the destination end and the source file. The same part does not need to be transmitted over the network, and is available at the destination; different parts need to be transmitted over the network, thus reducing the amount of network transmission.
  • the criteria for determining whether the data logic blocks are the same can also be compared by comparing MD5 values or SHA-1 values.
  • the popular remote data synchronization tool rsync is also an incremental synchronization technique that uses the so-called 'rsync algorithm'. 'To synchronize the files between the local and remote computers. Suppose you need to synchronize file A ' between two computers, and the previous version A of the file already exists on the destination, then rsync The algorithm will be completed by the following steps:
  • the destination side splits file A into a set of non-overlapping data logical blocks of fixed length S bytes (the last block may be better than S Small), calculate the 32-bit checksum and the 128-bit MD4 value for each divided data logic block, and send the checksum and MD4 values of these blocks to the source.
  • the MD4 algorithm is the MD5 algorithm. The previous version, compared to the MD5 algorithm, is slightly less secure.
  • the source side searches through the file A' for all data logic blocks of size S (the offset can be optional, not necessarily S A multiple of the data logic block that has the same checksum and MD4 value as a block of file A.
  • the source sends a string of instructions to the destination to generate a file A' backup on the destination, where the instruction is either file A A description of a data block that does not have to be retransmitted, or a block of data that does not match any of file A.
  • the rsync algorithm only transfers different parts of two files, rather than transferring the entire file each time, so it's quite fast. but Rsync can only be used for synchronization between different versions of the same file name. In the above example, if file A' is similar to file A but has a different file name, rsync will still transfer the entire contents of A'.
  • the above de-repetition technique is because the file is divided into data blocks of substantially equal length, so that the efficiency of de-repetition is low, and the network transmission amount cannot be effectively reduced.
  • an object of the present invention is to provide a content-based file segmentation method, a file storage method, and a method for synchronizing files.
  • a content-based file segmentation method of the present invention includes the following steps:
  • the present invention also provides a file storage method, the method comprising the following steps:
  • Each data storage block of the file is stored, and the data storage block information contained in the file is recorded in the metadata.
  • the present invention also provides a synchronous file storage method, the method comprising the following steps:
  • the source side of the synchronization divides the new file into a data storage block and a data logic block by using the file segmentation method described in the claims, and sends the data storage block information and the data logic block information to the destination end;
  • the destination of the synchronization finds the data storage block and the data logic block that do not exist locally, constructs the data storage block and notifies the source end which data logic blocks are needed;
  • the synchronous source sends the data logic block needed by the destination.
  • the invention has obvious advantages and positive effects.
  • the content-based file segmentation method of the invention it is possible to accurately and efficiently find out different files or different contents between different versions of the same file.
  • the storage system since only one copy of the data storage block with the same content is stored, when a file with similar content or the same content needs to be stored, a large amount of storage space can be saved; in the file uploading, backup and archiving of the file system, the source end It only needs to transmit the data storage block and the data logic block information that does not exist at the destination end, and does not need to transmit the data storage block and the data logic block information of the entire new file, and transmits less content; in the network file system, the file system
  • the mapping from file to file physical storage block in metadata can be changed from file to data logical block or data storage block in the method to reduce the dependence of network file system performance on bandwidth.
  • FIG. 1 is a flow chart of a content-based file segmentation method in accordance with the present invention.
  • FIG. 2 is a schematic diagram of file division of a content-based file segmentation method according to the present invention.
  • FIG. 3 is a schematic diagram of data logical block division in a content-based file segmentation method according to the present invention.
  • FIG. 4 is a schematic diagram of data storage block partitioning in a content-based file segmentation method in accordance with the present invention.
  • FIG. 1 is a flow chart of a content-based file segmentation method according to the present invention. Referring to FIG. 1, a specific implementation process according to the present invention is described in detail as follows:
  • step 101 the length of the window (windows) and the desired length of the data logic block are selected, and the length range of the data logic block is defined.
  • the window is a contiguous area in the file, and the recommended length is 48 bytes.
  • Sliding window Window is based on the previous window in the file, sliding one byte backwards, and the length of the window after sliding is the same.
  • the data logic block is a relatively small block of data. When incremental synchronization is implemented, the data logic block is the smallest synchronization unit. It is not stored in units of data logic blocks in the storage system.
  • the expected length of the data logic block can be 2K, 4K or 8K, or it can be other values.
  • the file is divided into a number of data logic blocks, and the data logic block length is very short, resulting in the amount of information for storing and transmitting the data logic block.
  • the length range of the data logic block is limited, the minimum length is Tmin, generally set to half of the expected data logic block length or according to the actual situation, can also be set to other values, the maximum The length is Tmax, and 16K, 32K or 64K bytes can be selected according to the actual situation.
  • step 102 the Rabin fingerprint algorithm is used to calculate the Rabin fingerprint value of each sliding window, and the breakpoint of the data logic block is determined according to the Rabin fingerprint value of the sliding window.
  • the advantage of this way is that inserting and deleting content in the file will only affect the data logic block with changed content, without affecting other data.
  • the specific step is to calculate the Rabin fingerprint value of each sliding window from the beginning of the file. When the lower n bits of the sliding window fingerprint value are equal to a given value, the sliding window will constitute the first data logic. The breakpoint of the block, and then calculate the value of the Rabin fingerprint of each sliding window from the first breakpoint.
  • the sliding window constitutes the first
  • the breakpoints of the two data logic blocks are calculated according to the above algorithm, and the snapshot values of all the sliding windows are calculated to find the breakpoints of all the data logic blocks in the file until the end of the file.
  • the end of the file must also be a breakpoint of a data logic block.
  • Rabin fingerprint (rabin The fingerprinting algorithm is a fingerprint algorithm proposed by Rabin University of Harbin, USA. It is a highly efficient algorithm for calculating the sliding window hash value, and the value calculated according to the Rabin fingerprint algorithm is highly discrete.
  • n the remainder obtained by dividing the Rabin fingerprint value of the sliding window by 2n.
  • the lower n bits of the sliding fingerprint value of the sliding window are equal to a given value. As long as the given value is determined, it does not matter. We have done this test: for different lengths and different types of files, use different values to find breakpoints. The result is that no matter what value is used, the number of logical blocks of the last divided data is not much different, and each data logic block The difference in length is also very small. This test confirms the randomness of this given value.
  • the method of determining the backup breakpoint is to take the low n-1 bit of the sliding window fingerprint value and compare it with another given value (the given value is not equivalent to the value of the data logic block breakpoint), if equal , I think this window can be used as a backup breakpoint. In the absence of a breakpoint, the last backup breakpoint will become the breakpoint of the data logic block; if there is neither a breakpoint nor a backup breakpoint, then the range needs to be forced to be divided into a data logic block. To avoid generating too long data logic blocks.
  • a logical block partition of the data is performed on the file.
  • the content between each two adjacent breakpoints constitutes a data logic block, wherein the first data logic block and the content at the beginning of the file and the end of the file are countdown
  • the contents of the two breakpoints also form a data logic block.
  • the desired length of the data chunk is selected and the length range of the data chunk is defined.
  • a data storage block is a relatively large block of data. In the file system, the data storage block is the smallest storage unit used by the application layer, and only one copy of the data storage block with the same content.
  • the desired length of the data storage block may be 1M, 2M or 4M, and may be other values.
  • the expected length of the data storage block is represented by Ec, and the expected length of the data logic block is represented by Eb.
  • the length of the data storage block is limited to [Ec-m*Eb, Ec+k*Eb] (the last data storage block of each file is not Limit the minimum length), m and k are values given as needed.
  • the present invention finds and determines whether the data storage block breakpoint is still based on the file content (content Based on the way, the advantage of this method is that inserting and deleting content in the file will only affect the data storage block with changed content, without affecting other data storage blocks.
  • the specific step is to calculate the total length of a plurality of consecutive data logical blocks from the beginning of the file. Once this total length is close to the value of the data storage block length we expect, and the n+1 ⁇ n+x bits of the last data logic block breakpoint fingerprint value are equal to another given value (the given value is not Equivalent to determining the value of the data block breakpoint), the last data block breakpoint is the breakpoint of the data block.
  • the breakpoint of the last data logic block does not satisfy the condition, and the total length after adding the next data logic block does not exceed the data memory block length range limit, then it is judged that after the next data logic block is added, the last data logic block is broken. Whether the point meets the condition. Until the breakpoint that satisfies the condition is found, or the total length is close to the upper limit of the length of the data storage block.
  • the breakpoint that satisfies the condition is both the breakpoint of the data logic block and the breakpoint of the data storage block. That is to say, the breakpoint of the data storage block is equivalent to the breakpoint of the last data logic block of the plurality of consecutive data logic blocks constituting the data storage block.
  • the above x is related to the length of the data storage block.
  • the expected data storage block length is denoted by Ec
  • the expected data logical block length is denoted by Eb
  • the data storage block length range is limited to [Ec-m*Eb, Ec+k*Eb] (the last data storage block of each file) Without limiting the minimum length)
  • the length of the data logic block is expected to be 4K
  • the length of the data storage block is expected to be 4M
  • the length of the data storage block is [4M-32*4K, 4M+32*4K]
  • step 106 data storage block partitioning is performed on the file, and according to all the data storage block breakpoints of the file found in step 105, the content between the breakpoints of each two adjacent data storage blocks constitutes a data storage block, and Record data logic blocks and data storage block information.
  • the data storage block information includes: the length of the data storage block, the offset, and the MD5 value or the SHA-1 value.
  • FIG. 2 is a schematic diagram of file segmentation of a content-based file segmentation method according to the present invention. As shown in FIG. 2, the entire file is divided into a plurality of data storage blocks, each of which includes a plurality of data logic blocks.
  • FIG. 3 is a schematic diagram of data logical block division in a content-based file segmentation method according to the present invention. As shown in FIG. 3, a zigzag line represents a breakpoint.
  • a indicates the first file, or the original version of the file. We look up the breakpoint of the data logic block according to the content, and divide the a file into many data logic blocks. Only the first 7 data logic blocks are displayed in the figure.
  • d compares with c, deletes some of the contents of B5, but the deletion does not result in a new breakpoint, nor does it cause the original breakpoint to fail, so the breakpoint position remains unchanged. Since the data logic block deletes part of the content, a new data logic block B10 is generated.
  • B4 Compared with e, f adds new content to B4, and the added new content leads to new breakpoints, so B4 will be decomposed into B12 and B13.
  • the a, b, c, d, e, f files in Figure 3 may be different versions of the same file name or different files with similar contents. Each file is compared to the previous file, and the content changes, but most of the data logic blocks are reusable.
  • the algorithm for determining whether the contents of the data logic block are the same may be the MD5 value of the comparison data logic block, or the SHA-1 value of the comparison data logic block, or other highly discrete algorithm that can uniquely represent the original information feature. The value that comes out. In this way, when we synchronize files, we can reuse the logical blocks of data in the last synchronized file to reduce the network transmission.
  • FIG. 4 is a schematic diagram of data storage block division in a content-based file segmentation method according to the present invention.
  • a zigzag short line indicates a data logic block breakpoint
  • a jagged long line indicates a data storage block breakpoint.
  • the data memory block is the breakpoint of the last data logic block.
  • the shaded portion indicates where the previous file (or the previous version of the same file name) was modified, and B represents the data logic block.
  • a indicates the first file, or the original version of the file.
  • c compares with b, and makes some modifications at the chunk1' breakpoint, causing the breakpoint to fail, and will find new breakpoints again, generating chunk1" and chunk2'. Chunk3 and subsequent chunks have no change.
  • d compares with c, and makes some changes in chunk1", which causes a new data storage block breakpoint to be generated, thus generating chunk1"' and chunk2'.
  • the chunk3 and the following chunks have not changed.
  • the a, b, c, and d files in Figure 4 may be different versions of the same file name or different files with similar contents. Each file is compared with the previous file, the content changes, but the content of most data storage blocks is the same.
  • the algorithm for determining whether the data logical block contents are the same may be the MD5 value of the comparison data logic block, or the comparison data logic.
  • Data storage blocks with the same content can be reused. When storing files, there is no need to store existing data storage blocks, so that repeated storage of data blocks can be avoided.
  • the storage system After the above-mentioned content-based file segmentation method is used to divide the file data into data logical blocks and data storage blocks, when the files are stored, the file itself is not stored, but each data storage block of the storage files is stored. And record the data storage block information contained in the file in the metadata, such as a list of data storage blocks included in the file, a length of each data storage block, and an MD5 value. Since only one copy of the data storage block with the same content is stored, when a file with similar content or the same content needs to be stored, a large amount of storage space can be saved.
  • the source of the synchronization divides the new file into data storage blocks and data logic blocks, and sends the information to the destination.
  • the destination can find data storage blocks and data logic blocks that do not exist locally by various methods, construct data storage blocks and notify the source side which data logic blocks are needed.
  • the source then sends the data logic block needed by the destination.
  • the new file here may be either a modified file for the previous version or a newly added file.
  • the destination side finds that there is no data storage block and data logic block locally.
  • the method can calculate the local file data storage block and the data logic block information in real time, or save the information in the metadata for query in advance. By.
  • the data logic block information includes the location of the data logic block in the data storage block, the data logic block length, and the data logic block. MD5 value or SHA-1 value, etc. If the existing data storage block and data logic block information of the destination end have been saved at the source end, the source end only needs to transmit the data storage block and the data logic block information that do not exist at the destination end, and does not need to transmit the data of the entire new file. Memory blocks and data logic block information are transmitted with less content. By synchronizing files in this way, the content of the transfer becomes less than that of transferring the entire contents of the file, which is the incremental synchronization application referred to in the present invention.
  • the efficient operation of the network file system requires good network connection performance such as bandwidth requirements.
  • the data deduplication and incremental synchronization features of the inventive algorithm can be utilized.
  • the mapping from the file to the physical storage block of the file system metadata can be changed from the file to the data logical block or the data storage block in the algorithm, so as to reduce the performance of the network file system.
  • Dependency of bandwidth is a class of low bandwidth network connections, such as low bandwidth wired or wireless network connections.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

一种基于内容的文件分割方法 技术领域
本发明涉及一种文件的分割方法,尤其涉及一种基于内容的文件分割方法。
背景技术
现有的计算机存储和文件系统中,解决大部分内容相同的两个或多个相似的文件,需要单独存储每个文件,导致占用存储空间比较大问题的常用方法是采用数据去重复化技术。
数据去重复化技术的实现方法是将文件分割成长度基本相等的数据块,在文件系统内,内容相同的数据块只存储一份。判断数据块内容相同的标准可以是比较数据块的MD5值,也可以是比较数据块的SHA-1值。用MD5或SHA-1算法计算出的值都具有高度的离散性。MD5算法计算出的散列值长度为128bit,不同内容的数据块经过MD5散列得到相同散列值的概率在1/(2(exp(B/2)) (这里B是散列算法里散列值长度的bit位数),以128bitMD5算法为例,不同内容数据块的MD5散列值相同的概率在1/264(约等于5.5×10-20)的数量级,这样小的概率通常被认为是不可能的。SHA-1算法是基于MD5的,计算出的散列值更是长达160bit。一般认为MD5值或SHA-1值可以唯一地代表原信息的特征,通常用于密码的加密存储,数字签名,文件完整性验证,身份认证等。在计算效率上,MD5要优于SHA-1。
为了解决文件修改之后需要进行同步,往往修改的内容非常少,但还是需要同步整份文件内容,导致大量的网络传输的问题,目前采用的是增量同步技术。
增量同步技术指的是文件通过网络同步,不需要传输整份文件的内容,而是仅仅传输目的端存储和文件系统中不存在的内容即可。如果是同一文件的不同版本之间的同步,可以理解为传输文件的更改信息。实现方法是将文件分割成数据逻辑块,通过比较数据逻辑块的内容,找出目的端和源端文件之间的相同和不同之处。相同的部分不需要通过网络传输,在目的端即可获得;不同的部分才需要通过网络传输,这样就可以减小网络传输量。判断数据逻辑块是否相同的标准同样可以通过比较MD5值或SHA-1值。
常用的 远程数据同步工具 rsync 也是一种增量同步技术, 使用所谓的' rsync 演算法 '来使本地和远程两台计算机之间的文件达到同步。假定需要在两台计算机之间同步文件 A ' ,而目的端已经存在该文件的前一个版本 A ,那么 rsync 算法将通过下面的步骤来完成:
1 .目的端将文件 A 分割成一组不重叠的长度固定为 S 字节的数据逻辑块(最后一块可能会比 S 小),对每一个分割好的数据逻辑块计算 32 位的校验和及 128 位的 MD4 值,并将这些块的校验和及 MD4 值发给源端。 MD4 算法 是 MD5 算法 的前一个版本,相对 MD5 算法, 安全性方面稍微差一些。
2 .源端通过搜索文件 A' 的所有大小为 S 的数据逻辑块(偏移量可以任选,不一定是 S 的倍数),来寻找与文件 A 的某一块有着相同的校验和及 MD4 值的数据逻辑块。
3 .源端 发给目的端一串指令来生成文件 A' 在目的端上的备份,这里的指令要么是文件 A 拥有某一数据逻辑块而不须重传的说明,要么是一个没有与文件 A 的任何一块匹配上的数据逻辑块。
技术问题
rsync 算法只传输两个文件的不同部分,而不是每次都整份文件传输,因此速度相当快。但是 rsync 只能用于相同文件名的不同版本之间的同步。上面的例子中如果文件A'与文件A内容相似但文件名不同,rsync将仍然会传输A'的整份内容。
上述去重复化技术由于是将文件分割成长度基本相等的数据块,使得去重复化的效率较低,不能有效地减少网络传输量。
技术解决方案
为了解决现有技术存在的不足,本发明的目的在于提供一种基于内容的文件分割方法、一种文件存储方法和一种同步文件的方法。
为了完成上述目的,本发明的一种基于内容的文件分割方法,该方法包括以下步骤:
1)选定窗口的长度和数据逻辑块期望的长度,并根据所述数据逻辑块期望的长度设置数据逻辑块的长度范围;
2)采用拉宾指纹算法,计算出每个滑动窗口的拉宾指纹值,并根据滑动窗口的拉宾指纹值确定数据逻辑块的断点;
3)对文件进行数据逻辑块的划分;
4)选定数据存储块期望的长度,并限定数据存储块的长度范围;
5)查找并确认数据存储块的断点;
6)对文件进行数据存储块的划分。
为完成上述发明目的,本发明还提供一种文件存储方法,该方法包括以下步骤:
将文件数据分割成数据逻辑块和数据存储块;
存储文件的每个数据存储块,并在元数据中记录文件所包含的数据存储块信息。
为完成上述发明目的,本发明还提供一种同步文件存储方法,该方法包括以下步骤:
1)同步的源端将新文件采用权利要求所述文件分割方法划分为数据存储块和数据逻辑块,并将数据存储块信息和数据逻辑块信息发送给目的端;
2)同步的目的端查找本地不存在的数据存储块和数据逻辑块,构建数据存储块并通知源端需要哪些数据逻辑块;
3)同步的源端发送目的端需要的数据逻辑块。
有益效果
本发明具有明显的优点和积极效果,采用本发明的基于内容的文件分割方法,能够准确、高效地查找出不同文件或同一文件不同版本间不相同的内容。在存储系统,由于内容相同的数据存储块只存储一份,当碰到内容相似或相同的文件需要存储时,就可以节省大量存储空间;在文件系统的文件上载、备份和归档中,源端就只需要传输目的端不存在的数据存储块和数据逻辑块信息,而不需要传输整个新文件的数据存储块和数据逻辑块信息,传输的内容更少;在网络文件系统的中,文件系统元数据中从文件到文件物理存储块的映射,可以改为由文件到本方法中的数据逻辑块或数据存储块的映射,达到降低网络文件系统性能对带宽的依赖。
附图说明
附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例一起,用于解释本发明,并不构成对本发明的限制。在附图中:
图1为根据本发明的基于内容的文件分割方法的流程图;
图2为根据本发明的基于内容的文件分割方法的文件分割示意图;
图3为根据本发明的基于内容的文件分割方法中数据逻辑块划分示意图;
图4为根据本发明的基于内容的文件分割方法中数据存储块划分示意图。
本发明的实施方式
以下结合附图对本发明的优选实施例进行说明,应当理解,此处所描述的优选实施例仅用于说明和解释本发明,并不用于限定本发明。
图1为根据本发明的基于内容的文件分割方法的流程图,参考图1,根据本发明的具体实现过程详细描述如下:
首先,在步骤101,选定窗口(windows)的长度和数据逻辑块(block)期望的长度,并限定数据逻辑块的长度范围。
窗口是文件中一片连续的区域,建议的长度为48字节。滑动窗口(sliding window)是基于文件中的上一个窗口,往后滑动一个字节,滑动后的窗口长度不变。
数据逻辑块是比较小块的数据,实现增量同步时,数据逻辑块是最小的同步单位。在存储系统中并不以数据逻辑块为单位存储。数据逻辑块期望的长度可以是2K、4K或8K,也可以为其它值。
为了避免在查找数据逻辑块断点的过程中,文件中存在非常多的断点,文件被划分成很多个数据逻辑块,数据逻辑块长度都非常短,导致存储和传输数据逻辑块的信息量非常大,引起比文件内容更大的存储量和传输量,或文件中数据逻辑块长度非常大,导致数据逻辑块重用的几率变得很小,该数据逻辑块内的更改也会导致大量的数据传输的问题,在该步骤中,对数据逻辑块的长度范围做了限制,最小长度为Tmin,一般设置为期望数据逻辑块长度的一半或根据实际情况设置,也可以设置为其他值,最大长度为Tmax,可根据实际情况,选择16K,32K或64K字节等。
在步骤102,采用拉宾指纹算法,计算出每个滑动窗口的拉宾指纹值,并根据滑动窗口的拉宾指纹值确定数据逻辑块的断点。
在该步骤中,采用根据文件内容(content based)的方式查找并确定数据逻辑块的断点,采用这种方式带来的好处是:在文件中插入和删除内容,只会影响内容有变化的数据逻辑块,而不会影响其他的数据逻辑块。具体步骤是,从文件起始处,计算每个滑动窗口的拉宾指纹值,当滑动窗口拉宾指纹值的低n位等于某个给定的值,该滑动窗口将构成第一个数据逻辑块的断点,然后从第一个断点处开始计算每个滑动窗口的拉宾指纹值,当滑动窗口拉宾指纹值的低n位等于某个给定的值,该滑动窗口就构成第二个数据逻辑块的断点,按照上述的算法,计算所有滑动窗口的拉宾指纹值,找出文件中所有数据逻辑块的断点,直至文件结尾。文件结尾处也必然是一个数据逻辑块的断点。
拉宾指纹(rabin fingerprinting)算法是美国哈佛大学拉宾(rabin)提出的一种指纹算法,它是一种高效率的计算滑动窗口散列值的算法,而且根据拉宾指纹算法计算出的值具有高度离散性。
取滑动窗口拉宾指纹值的低n位,是用滑动窗口的拉宾指纹值除以2n所得的余数。n的取值与数据逻辑块期望的长度有关。由于根据拉宾指纹算法计算出的值是非常均匀的,而如果文件内容也是非常随机的,那么分割出来的数据逻辑块长度将为2n字节左右,也就是2n=数据逻辑块期望的长度。当然,数据逻辑块中需要加上断点窗口所包含的内容。所以如果我们期望数据逻辑块的长度为4K字节,n的取值就应该是12(212=4096=4K)。
滑动窗口的拉宾指纹值的低n位等于某个给定的值,这个给定的值,只要确定即可,到底是多少没有关系。我们做过这样的测试:对不同长度、不同类型的文件,分别用不同的值来查找断点,结果是不管用什么值,最后划分的数据逻辑块数量相差不大,而且每个数据逻辑块的长度差别也非常小。这种测试更印证了这个给定值的随机性。
在本步骤中,也可以在上一个断点(或文件起始处)之后的Tmin字节内,不计算窗口拉宾指纹值,避免产生长度过小的数据逻辑块。
如果在上一个断点(或文件起始处)之后的Tmax字节内没有找到新的断点,我们将选用这段范围内的最后一个备份断点。确定备份断点的方法是取滑动窗口拉宾指纹值的低n-1位,跟另一个给定的值比较(该给定的值不等同于判断数据逻辑块断点的值),如果相等,则认为这个窗口可以作为一个备份断点。在不存在断点的情况下,最后一个备份断点将成为数据逻辑块的断点;如果既不存在断点,也不存在备份断点,则需要强行将这段范围划分为一个数据逻辑块,避免产生长度过大的数据逻辑块。
在步骤103,对文件进行数据逻辑块划分。根据步骤102找出的文件所有的断点,每两个相邻断点之间的内容构成一个数据逻辑块,其中第一个数据逻辑块与文件起始处的内容和文件结尾处与倒数第二个断点的内容同样也分别构成一个数据逻辑块。
在步骤104,选定数据存储块(chunk)期望的长度,并限定数据存储块的长度范围。数据存储块是相对较大块的数据。在文件系统中,数据存储块是应用层所用到的最小存储单元,内容相同的数据存储块只存储一份。数据存储块期望的长度可以是1M、2M或4M,也可以为其它值。
数据存储块期望的长度用Ec表示,数据逻辑块期望的长度用Eb表示,数据存储块长度范围限制为[Ec-m*Eb,Ec+k*Eb](每个文件最后一个数据存储块不限制最小长度),m和k为根据需要给定的数值。
在步骤105,查找和确定数据存储块的断点,本发明查找并确定数据存储块断点还是采用根据文件内容(content based)的方式,采用这种方式带来的好处是:在文件中插入和删除内容,只会影响内容有变化的数据存储块,而不会影响其他的数据存储块。具体步骤是,从文件起始处计算多个连续数据逻辑块的总长度。一旦这个总长度接近我们期望的数据存储块长度的值,且最后一个数据逻辑块断点拉宾指纹值的n+1~n+x位等于另一个给定的值(该给定的值不等同于判断数据逻辑块断点的值),最后一个数据逻辑块断点就是数据存储块的断点。如果最后一个数据逻辑块的断点不满足条件,且加入下一个数据逻辑块后总长度不会超过数据存储块长度范围限制,则判断加入下一个数据逻辑块后,最后一个数据逻辑块的断点是否满足条件。直到找出满足条件的断点,或者总长度接近数据存储块长度范围的上限为止。满足条件的断点既是数据逻辑块的断点,同时也是数据存储块的断点。也就是说数据存储块的断点等同于构成数据存储块的多个连续数据逻辑块中最后一个数据逻辑块的断点。然后从上一个数据存储块断点处开始,用同样的方法查找出下一个数据存储块的断点。遍历所有数据逻辑块断点,直至文件结尾,找出所有数据存储块的断点。文件结尾处必然也是数据存储块的断点。
上面的x跟数据存储块长度范围有关。我们期望的数据存储块长度用Ec表示,期望的数据逻辑块长度用Eb表示,数据存储块长度范围限制为[Ec-m*Eb,Ec+k*Eb](每个文件最后一个数据存储块不限制最小长度),那么数据存储块长度范围内可能存在m+k个数据逻辑块断点,断点拉宾指纹值的n+1~n+x位的值范围必须为[0,m+k-1],即满足条件m+k=2x,这样数据存储块长度范围内有且仅有一个数据存储块断点的概率才会最大。例如,如果期望数据逻辑块的长度为4K,期望数据存储块的长度为4M,数据存储块长度范围为[4M-32*4K,4M+32*4K],那么可能成为数据存储块断点的数据逻辑块断点将有32+32=64个,上面说到的x就应该等于6(26=64),即判断数据逻辑块断点拉宾指纹值的13~18位(n等于12的情况),是否等于给定的值。跟划分数据逻辑块断点一样,这个给定的值只要确定即可,到底是多少没有关系。
当多个连续数据逻辑块总长度超过Ec-m*Eb 时,则需要关注最后一个数据逻辑块的断点是否满足拉宾指纹值的n+1~n+x位等于给定值的条件。满足条件则成为数据存储块断点,否则检查下一个数据逻辑块的断点。一旦多个连续数据逻辑块总长度超过Ec+k*Eb,就将数据存储块的最大长度范围内最后一个断点设置为该数据存储块的断点,这样可以保证数据存储块的长度在限制范围之内。
在步骤106,对文件进行数据存储块划分,根据步骤105找出的文件所有的数据存储块断点,每两个相邻的数据存储块断点之间的内容即构成一个数据存储块,并记录数据逻辑块和数据存储块信息。数据存储块信息包括:数据存储块的长度、偏移量以及MD5值或SHA-1值。
图2为根据本发明的基于内容的文件分割方法的文件分割示意图,如图2所示,整个文件被分割成若干个数据存储块,每个数据存储块包括若干个数据逻辑块。
图3为根据本发明的基于内容的文件分割方法中数据逻辑块划分示意图,如图3所示,锯齿状线条表示断点。
a表示第一个文件,或者是文件的原始版本。我们根据内容查找数据逻辑块的断点,将a文件被划分成很多数据逻辑块,图中只显示前面7个数据逻辑块。
b跟a比较,在B2做了些修改,但是修改的内容没有导致产生新的断点,断点位置也保持不变。由于该数据逻辑块内容发生了变化,所以生成新的数据逻辑块B8。
c跟b比较,在B3增加了一些内容,但是增加的内容没有导致产生新的断点,所以断点位置也保持不变。由于该数据逻辑块增加了新内容,所以生成新的数据逻辑块B9。
d跟c比较,删除了B5中的一些内容,但是该删除没有导致产生新的断点,也没有导致原断点失效,所以断点位置也保持不变。由于该数据逻辑块删除了部分内容,所以生成新的数据逻辑块B10。
e跟d比较,在B6断点处做了修改,导致断点处内容发生变化,将不再成为断点,于是合并B6和B7,生成新的数据逻辑块B11。
f跟e比较,在B4增加了新内容,增加的新内容导致产生新的断点,于是B4将分解为B12和B13。
图3中的a,b,c,d,e,f文件既可能是同一文件名的不同版本,也可能是内容相似的不同文件。每个文件跟前一个文件比较,内容都有变化,但是大部分的数据逻辑块是可以重用的。判断数据逻辑块内容是否相同的算法可以是比较数据逻辑块的MD5值,也可以是比较数据逻辑块的SHA-1值,或者是其他具有高度离散的,可以唯一地代表原信息特征的算法计算出来的值。这样,我们在同步文件时,就可以重用上一个已经同步过的文件中的数据逻辑块,降低网络传输量。
图4为根据本发明的基于内容的文件分割方法中数据存储块划分示意图,如图4所示,锯齿状短线条表示数据逻辑块断点,锯齿状长线条表示数据存储块断点,同时也是该数据存储块最后一个数据逻辑块的断点。阴影部分表示针对前一个文件(或相同文件名的前一个版本)修改的地方,B代表数据逻辑块。
a表示第一个文件,或者是文件的原始版本。我们根据内容查找数据逻辑块和数据存储块的断点,将a文件被划分成很多数据逻辑块和数据存储块。
b跟a比较,在chunk1中做了些修改,但是该修改没有导致数据存储块断点的变化,chunk1的内容有更新,变成chunk1’。而chunk2,chunk3及后面的chunk都没有变化。
c跟b比较,在chunk1’断点处做了些修改,导致该断点失效,将重新查找新的断点,生成chunk1”和chunk2’。而chunk3及后面的chunk都没有变化。
d跟c比较,在chunk1”中做了些修改,该修改导致产生新的数据存储块断点,于是生成chunk1”’和chunk2”。而chunk3及后面的chunk都没有变化。
图4中的a,b,c,d文件既可能是同一文件名的不同版本,也可能是内容相似的不同文件。每个文件跟前一个文件比较,内容都有变化,但是大部分数据存储块的内容是相同的,判断数据逻辑块内容是否相同的算法可以是比较数据逻辑块的MD5值,也可以是比较数据逻辑块的SHA-1值,或者是其他具有高度离散的,可以唯一地代表原信息特征的算法计算出来的值。内容相同的数据存储块是可以重用的,存储文件时,不需要存储已有的数据存储块,这样就可以避免数据块的重复存储。
在存储系统中,采用上述基于内容的文件分割方法,将文件数据分割成数据逻辑块和数据存储块之后,在存储文件的时候,并不是存储文件本身,而是存储文件的每个数据存储块,并在元数据中记录文件所包含的数据存储块信息,如文件所包含的数据存储块列表,每个数据存储块的长度及MD5值等。由于内容相同的数据存储块只存储一份,当碰到内容相似或相同的文件需要存储时,就可以节省大量存储空间。
在文件系统中的文件的上载、备份和归档,同步的源端将新文件划分为数据存储块和数据逻辑块,并将这些信息发送给目的端。目的端可以通过各种方法查找本地不存在的数据存储块和数据逻辑块,构建数据存储块并通知源端需要哪些数据逻辑块。源端再将目的端需要的数据逻辑块发送过去。这里的新文件既可能是针对上一个版本修改后的文件,也可能是新增加的文件。目的端查找本地不存在数据存储块和数据逻辑块的方法既可以是实时计算本地文件数据存储块和数据逻辑块信息,也可以是预先将这些信息保存在元数据中供查询,这里我们推荐后者。数据逻辑块信息包括该数据逻辑块在数据存储块中的位置,数据逻辑块长度,数据逻辑块 MD5值或SHA-1值等。如果在源端已经保存了目的端已有的数据存储块和数据逻辑块信息,源端就只需要传输目的端不存在的数据存储块和数据逻辑块信息,而不需要传输整个新文件的数据存储块和数据逻辑块信息,传输的内容就更少了。相比传输整份文件内容而言,通过这种方式同步文件,传输的内容变得很少,这就是本发明所指的增量同步应用。
在网络文件系统中,网络文件系统的高效运行,需要有良好的网络连接性能如带宽的需求。在一类低带宽网络连接中,如低带宽的有线或无线网络连接中,可以利用本发明算法的数据去重复化和增量同步特性。在网络文件系统的实现中,文件系统元数据中从文件到文件物理存储块的映射,可以改为由文件到本算法中的数据逻辑块或数据存储块的映射,达到降低网络文件系统性能对带宽的依赖。
本领域普通技术人员可以理解:以上所述仅为本发明的优选实施例而已,并不用于限制本发明,尽管参照前述实施例对本发明进行了详细的说明,对于本领域的技术人员来说,其依然可以对前述各实施例记载的技术方案进行修改,或者对其中部分技术特征进行等同替换。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (16)

  1. 一种基于内容的文件分割方法,该方法包括以下步骤:
    1)选定窗口的长度和数据逻辑块期望的长度,并根据所述数据逻辑块期望的长度设置数据逻辑块的长度范围;
    2)采用拉宾指纹算法,计算出每个滑动窗口的拉宾指纹值,并根据滑动窗口的拉宾指纹值确定数据逻辑块的断点;
    3)对文件进行数据逻辑块的划分;
    4)选定数据存储块期望的长度,并限定数据存储块的长度范围;
    5)查找并确认数据存储块的断点;
    6)对文件进行数据存储块的划分。
  2. 根据权利要求1所述的基于内容的文件分割方法,其特征在于,所述步骤2进一步包括以下步骤:
    1)从文件起始处开始,按照拉宾指纹算法计算出每个滑动窗口的拉宾指纹值,根据所述滑动窗口的拉宾指纹值,确定第一个数据逻辑块的断点,并将该数据逻辑块的断点作为上一个数据逻辑块的断点;
    2)从所述上一个数据逻辑块的断点开始,按照拉宾指纹算法计算出每个滑动窗口的拉宾指纹值,并根据所述滑动窗口的拉宾指纹值,确定下一个数据逻辑块的断点;
    3)重复上述步骤2,找出文件中所有数据逻辑块的断点。
  3. 根据权利要求1所述的基于内容的文件分割方法,其特征在于,所述步骤2进一步包括以下步骤:
    1)从文件起始处之后的数据逻辑块最小长度开始,按照拉宾指纹算法计算出每个滑动窗口的拉宾指纹值,根据所述滑动窗口的拉宾指纹值,确定第一个数据逻辑块的断点,并将该数据逻辑块的断点作为上一个数据逻辑块的断点;
    2)从所述上一个数据逻辑块的断点之后的数据逻辑块最小长度开始,按照拉宾指纹算法计算出每个滑动窗口的拉宾指纹值,并根据所述滑动窗口的拉宾指纹值,确定下一个数据逻辑块的断点;
    3)重复上述步骤2,找出文件中所有数据逻辑块的断点。
  4. 根据权利要求1所述的基于内容的文件分割方法,其特征在于,所述步骤3对文件进行数据逻辑块的划分是将相邻两个数据逻辑块断点之间的内容作为一个数据逻辑块。
  5. 根据权利要求1所述的基于内容的文件分割方法,其特征在于,所述根据滑动窗口的拉宾指纹值确定数据逻辑块断点的方法是当所述滑动窗口拉宾指纹值的低n位等于一个给定的值时,该滑动窗口就构成一个断点。
  6. 根据权利要求5所述的基于内容的文件分割方法,其特征在于,所述滑动窗口拉宾指纹值的低n位的n值是由2n=数据逻辑块期望的长度计算得出的。
  7. 根据权利要求2或3所述的基于内容的文件分割方法,其特征在于,在所述数据逻辑块最大长度内没有找到新的断点,则根据窗口拉宾指纹值找出备份断点,并将最后一个备份断点作为数据逻辑块断点。
  8. 根据权利要求7所述的基于内容的文件分割方法,其特征在于,所述根据窗口拉宾指纹值找出备份断点的方法是所述滑动窗口拉宾指纹值的低n-1位等于一个给定的值,所述滑动窗口对应的窗口就构成一个备份断点。
  9. 根据权利要求8所述的基于内容的文件分割方法,其特征在于,在数据逻辑块长度范围内既没有断点,也没有备份断点,则将该数据逻辑块最大长度位置的窗口作为断点。
  10. 根据权利要求1所述的基于内容的文件分割方法,其特征在于,所述步骤5进一步包括以下步骤:
    1)从文件起始处,计算多个连续数据逻辑块的长度,并计算在数据存储块的长度范围内的每个数据逻辑块断点的拉宾指纹值,根据所述数据逻辑块断点的拉宾指纹值设置数据存储块的第一个断点,该数据存储块的第一个断点作为上一个数据存储块的断点;
    2)从所述上一个数据存储块的断点开始,计算多个连续数据逻辑块的长度,并计算在数据存储块的长度范围内的每个数据逻辑块断点的拉宾指纹值,根据所述数据逻辑块断点的拉宾指纹值设置数据存储块的下一个断点;
    3)重复上述步骤2,直至文件结尾,找出文件中所有数据存储块的断点。
  11. 根据权利要求1所述的基于内容的文件分割方法,其特征在于,所述步骤5中,如果在数据存储块的长度范围内,根据所述数据逻辑块断点的拉宾指纹值无法找出数据存储块的断点,就将数据存储块的最大长度范围内最后一个断点设置为该数据存储块的断点。
  12. 根据权利要求1所述的基于内容的文件分割方法,其特征在于,所述步骤6是将两个相邻数据存储块断点之间的内容作为一个数据存储块。
  13. 一种文件存储的方法,其特征在于,首先,将文件数据分割成数据逻辑块和数据存储块,然后,存储文件的每个数据存储块,并在元数据中记录文件所包含的数据存储块信息。
  14. 根据权利要求13所述的文件存储的方法,其特征在于,所述数据存储块信息包括:数据存储块的长度、偏移量以及MD5值。
  15. 一种同步文件的方法,其特征在于,包括以下步骤:
    1)同步的源端将新文件采用权利要求所述文件分割方法划分为数据存储块和数据逻辑块,并将数据存储块信息和数据逻辑块信息发送给目的端;
    2)同步的目的端查找本地不存在的数据存储块和数据逻辑块,构建数据存储块并通知源端需要哪些数据逻辑块;
    3)同步的源端发送目的端需要的数据逻辑块。
  16. 根据权利要求15所述的同步文件的方法,其特征在于,所述数据逻辑块信息包括:数据逻辑块在所属数据存储块中的位置,数据逻辑块长度,数据逻辑块 MD5值。
PCT/CN2010/077556 2010-02-10 2010-10-01 一种基于内容的文件分割方法 WO2011097887A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010110841XA CN101788976B (zh) 2010-02-10 2010-02-10 一种基于内容的文件分割方法
CN201010110841.X 2010-02-10

Publications (1)

Publication Number Publication Date
WO2011097887A1 true WO2011097887A1 (zh) 2011-08-18

Family

ID=42532194

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/077556 WO2011097887A1 (zh) 2010-02-10 2010-10-01 一种基于内容的文件分割方法

Country Status (2)

Country Link
CN (1) CN101788976B (zh)
WO (1) WO2011097887A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968575A (zh) * 2018-09-30 2020-04-07 南京工程学院 一种大数据处理系统的去重方法
US10831708B2 (en) 2017-12-20 2020-11-10 Mastercard International Incorporated Systems and methods for improved processing of a data file
WO2023108360A1 (zh) * 2021-12-13 2023-06-22 华为技术有限公司 一种存储系统中数据管理方法及装置

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101788976B (zh) * 2010-02-10 2012-05-09 北京播思软件技术有限公司 一种基于内容的文件分割方法
EP2544084A4 (en) * 2010-03-04 2014-03-19 Nec Corp STORAGE DEVICE
CN101963982B (zh) * 2010-09-27 2012-07-25 清华大学 基于位置敏感哈希的删冗存储系统元数据管理方法
CN102467571A (zh) * 2010-11-17 2012-05-23 英业达股份有限公司 重复数据删除的数据区块切分方法与新增方法
CN102567285A (zh) * 2010-12-13 2012-07-11 汉王科技股份有限公司 一种文档加载的方法及装置
CN102571709A (zh) * 2010-12-16 2012-07-11 腾讯科技(北京)有限公司 文件上传的方法、客户端、服务器及系统
CN102065098A (zh) * 2010-12-31 2011-05-18 网宿科技股份有限公司 网络节点之间数据同步的方法和系统
CN102323958A (zh) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 重复数据删除方法
CN102682086B (zh) * 2012-04-23 2014-11-05 华为技术有限公司 数据分块方法及设备
CN103873522B (zh) * 2012-12-14 2018-07-06 联想(北京)有限公司 一种电子设备及应用于电子设备的文件分块方法
CN103078709B (zh) * 2013-01-05 2016-04-13 中国科学院深圳先进技术研究院 数据冗余识别方法
CN103973723A (zh) * 2013-01-25 2014-08-06 中国科学院寒区旱区环境与工程研究所 一种集中式科学数据同步的方法和系统
CN104063377B (zh) * 2013-03-18 2017-06-27 联想(北京)有限公司 信息处理方法和使用其的电子设备
CN103279531B (zh) * 2013-05-31 2016-06-08 北京瑞翔恒宇科技有限公司 一种分布式文件系统中基于内容的文件分块方法
CN103514250B (zh) * 2013-06-20 2017-04-26 易乐天 一种全局重复数据删除的方法和系统及存储装置
CN103491452B (zh) * 2013-09-25 2017-01-25 北京奇虎科技有限公司 播放网页中视频的方法及装置
CN104239575A (zh) * 2014-10-08 2014-12-24 清华大学 一种虚拟机镜像文件存储、分发方法及装置
CN105912268B (zh) * 2016-04-12 2020-08-28 韶关学院 一种基于自匹配特征的分布式重复数据删除方法及其装置
CN106572090A (zh) * 2016-10-21 2017-04-19 网宿科技股份有限公司 数据传输方法及系统
CN109445702B (zh) * 2018-10-26 2019-12-06 黄淮学院 一种块级数据去重存储系统
CN111722787B (zh) 2019-03-22 2021-12-03 华为技术有限公司 一种分块方法及其装置
CN111711671B (zh) * 2020-06-01 2023-07-25 深圳华中科技大学研究院 一种基于盲存储的高效密文文件更新的云存储方法
CN112181312A (zh) * 2020-10-23 2021-01-05 北京安石科技有限公司 硬盘数据的快速读取方法及系统
WO2023004528A1 (zh) * 2021-07-26 2023-02-02 深圳市检验检疫科学研究院 一种基于分布式系统的并行化命名实体识别方法及装置
CN113627132B (zh) * 2021-08-27 2024-04-02 智慧星光(安徽)科技有限公司 数据去重标记码生成方法、系统、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2450025A (en) * 2004-06-17 2008-12-10 Hewlett Packard Development Co Algorithm for dividing a sequence of values into chunks using breakpoints
US20090164535A1 (en) * 2007-12-20 2009-06-25 Microsoft Corporation Disk seek optimized file system
US20090190760A1 (en) * 2008-01-28 2009-07-30 Network Appliance, Inc. Encryption and compression of data for storage
CN101788976A (zh) * 2010-02-10 2010-07-28 北京播思软件技术有限公司 一种基于内容的文件分割方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0941537B1 (en) * 1996-12-02 2002-10-23 Thomson Consumer Electronics, Inc. Apparatus and method for identifying the information stored on a medium
US8015162B2 (en) * 2006-08-04 2011-09-06 Google Inc. Detecting duplicate and near-duplicate files
US8214517B2 (en) * 2006-12-01 2012-07-03 Nec Laboratories America, Inc. Methods and systems for quick and efficient data management and/or processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2450025A (en) * 2004-06-17 2008-12-10 Hewlett Packard Development Co Algorithm for dividing a sequence of values into chunks using breakpoints
US20090164535A1 (en) * 2007-12-20 2009-06-25 Microsoft Corporation Disk seek optimized file system
US20090190760A1 (en) * 2008-01-28 2009-07-30 Network Appliance, Inc. Encryption and compression of data for storage
CN101788976A (zh) * 2010-02-10 2010-07-28 北京播思软件技术有限公司 一种基于内容的文件分割方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10831708B2 (en) 2017-12-20 2020-11-10 Mastercard International Incorporated Systems and methods for improved processing of a data file
CN110968575A (zh) * 2018-09-30 2020-04-07 南京工程学院 一种大数据处理系统的去重方法
CN110968575B (zh) * 2018-09-30 2023-06-06 南京工程学院 一种大数据处理系统的去重方法
WO2023108360A1 (zh) * 2021-12-13 2023-06-22 华为技术有限公司 一种存储系统中数据管理方法及装置

Also Published As

Publication number Publication date
CN101788976B (zh) 2012-05-09
CN101788976A (zh) 2010-07-28

Similar Documents

Publication Publication Date Title
WO2011097887A1 (zh) 一种基于内容的文件分割方法
US10256978B2 (en) Content-based encryption keys
US8831030B2 (en) Transmission apparatus operation for VPN optimization by defragmentation and deduplication method
US9253277B2 (en) Pre-fetching stored data from a memory
US8115660B2 (en) Compression of stream data using a hierarchically-indexed database
US8019882B2 (en) Content identification for peer-to-peer content retrieval
US8595188B2 (en) Operating system and file system independent incremental data backup
US8577850B1 (en) Techniques for global data deduplication
US7373520B1 (en) Method for computing data signatures
US5850565A (en) Data compression method and apparatus
US7636767B2 (en) Method and apparatus for reducing network traffic over low bandwidth links
US9613046B1 (en) Parallel optimized remote synchronization of active block storage
US20150074361A1 (en) Identification of non-sequential data stored in memory
US10339124B2 (en) Data fingerprint strengthening
US20180357217A1 (en) Chunk compression in a deduplication aware client environment
US20140279953A1 (en) Reducing digest storage consumption in a data deduplication system
WO2011159517A2 (en) Optimization of storage and transmission of data
US20070055834A1 (en) Performance improvement for block span replication
US11436088B2 (en) Methods for managing snapshots in a distributed de-duplication system and devices thereof
Lou et al. Asymptotic analysis of data deduplication with a constant number of substitutions
Henson et al. Guidelines for using compare-by-hash
Kim et al. File similarity evaluation scheme for multimedia data using partial hash information
CN112416878A (zh) 一种基于云平台的文件同步管理方法
US10949109B2 (en) Expansion cartridge for deduplication of data chunks in client devices interspersed in networked environments
Mishra et al. A Study of Data De-duplication Methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10845557

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23/11/2012)

122 Ep: pct application non-entry in european phase

Ref document number: 10845557

Country of ref document: EP

Kind code of ref document: A1