TWI441034B - Processing method for duplicate data - Google Patents

Processing method for duplicate data Download PDF

Info

Publication number
TWI441034B
TWI441034B TW100128071A TW100128071A TWI441034B TW I441034 B TWI441034 B TW I441034B TW 100128071 A TW100128071 A TW 100128071A TW 100128071 A TW100128071 A TW 100128071A TW I441034 B TWI441034 B TW I441034B
Authority
TW
Taiwan
Prior art keywords
fingerprint value
metadata
metadata block
block
read
Prior art date
Application number
TW100128071A
Other languages
Chinese (zh)
Other versions
TW201308113A (en
Inventor
Ming-Sheng Zhu
Chih Feng Chen
Original Assignee
Inventec Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Corp filed Critical Inventec Corp
Priority to TW100128071A priority Critical patent/TWI441034B/en
Publication of TW201308113A publication Critical patent/TW201308113A/en
Application granted granted Critical
Publication of TWI441034B publication Critical patent/TWI441034B/en

Links

Description

重複數據的處理方法Duplicate data processing method

本發明係關於一種重複數據的處理方法,特別是一種快速的重複數據的處理方法。The present invention relates to a method for processing repeated data, and more particularly to a method for processing fast repeated data.

重複數據刪除是一種數據縮減技術,通常用於基於磁盤的備份系統,主要目的在於減少存儲系統中使用的存儲容量。它的工作方式是在某個時間週期內查找不同檔中不同位置的重複可變大小數據塊。重複的數據塊用指示符取代。由於存儲系統中總是充斥著大量的冗餘數據。為了解決這個問題,節省更多空間,重複刪除的技術便順理成章地成了人們關注的焦點。採用重複刪除的技術可以將存儲的數據縮減為原來的1/20,從而讓出更多的備份空間,不僅可以使存儲系統上的備份數據保存更長的時間,而且還可以節約離線存儲時所需的大量的頻寬。Deduplication is a data reduction technique commonly used in disk-based backup systems with the primary goal of reducing the storage capacity used in storage systems. It works by finding duplicate variable-size blocks of data in different locations in a different time period. Duplicate data blocks are replaced with indicators. Because the storage system is always full of redundant data. In order to solve this problem and save more space, the technology of deduplication has become the focus of attention. The deduplication technology can reduce the stored data to 1/20 of the original, which allows more backup space, not only can save the backup data on the storage system for a longer period of time, but also save the offline storage. A large amount of bandwidth required.

由於欲存儲的數據資料都會被儲存在伺服器中,因此客戶端需要時時的將數據資料傳送至伺服器。接著,伺服器再對數據資料進行重複數據刪除的處理。然而為了判斷客戶端欲備份的數據是否已儲存在伺服器的磁碟中,需要將相關的整個數據塊載入記憶體,再讀取整個數據塊中的元數據(meta data)的部分。但不論此數據塊是不是重複數據,都不會用到被載入記憶體的數據塊中的原始數據(raw data)。Since the data to be stored is stored in the server, the client needs to transfer the data to the server from time to time. Then, the server performs deduplication processing on the data data. However, in order to judge whether the data to be backed up by the client has been stored in the disk of the server, the relevant entire data block needs to be loaded into the memory, and then the part of the metadata (data) in the entire data block is read. However, regardless of whether the data block is duplicate data, the raw data in the data block loaded into the memory is not used.

本發明提供一種重複數據的處理方法。重複數據的處理方法先將一儲存文件分割成多個原始數據塊(raw tank)以及多個元數據塊(meta tank),其中原始數據塊以及元數據塊為一對一對應,且每一個元數據塊中存有對應的原始數據塊的一儲存指紋值(fingerprint)。接著接收一判斷重複數據請求,其中判斷重複數據請求包括一請求指紋值。重複數據的處理方法讀取至少一個元數據塊,並比對請求指紋值與讀取的元數據塊的儲存指紋值。當請求指紋值與讀取的元數據塊的儲存指紋值相同時,修改讀取的元數據塊的一引用計數值(referred counter),並將修改過的元數據塊回存。The present invention provides a method of processing repeated data. The processing method of the repeated data first divides a storage file into a plurality of raw data blocks and a plurality of meta tanks, wherein the original data blocks and the metadata blocks have a one-to-one correspondence, and each element A stored fingerprint value of the corresponding original data block is stored in the data block. A decision repeat data request is then received, wherein the determine duplicate data request includes a request fingerprint value. The processing method of the duplicate data reads at least one metadata block and compares the requested fingerprint value with the stored fingerprint value of the read metadata block. When the requested fingerprint value is the same as the stored fingerprint value of the read metadata block, a referenced counter of the read metadata block is modified, and the modified metadata block is restored.

其中每一個原始數據塊可包括多個原始數據單元(raw chunk),每一個元數據塊可包括多個元數據單元(meta chunk),且原始數據單元以及元數據單元為一對一對應。Each of the original data blocks may include a plurality of original data chunks, each of the metadata chunks may include a plurality of metadata chunks, and the original data unit and the metadata unit have a one-to-one correspondence.

根據一實施範例,儲存文件可儲存於一伺服器的一磁碟,且修改過的元數據塊可回存於磁碟。According to an embodiment, the stored file may be stored on a disk of a server, and the modified metadata block may be restored to the disk.

「讀取至少一個元數據塊,並比對請求指紋值與讀取的元數據塊的儲存指紋值」的步驟中,可先判斷對應於請求指紋值的元數據塊是否存在於一記憶體。當對應於請求指紋值的元數據塊存在於記憶體時,讀取記憶體中對應於請求指紋值的元數據塊,並比對請求指紋值與讀取的元數據塊的儲存指紋值。In the step of "reading at least one metadata block and comparing the requested fingerprint value with the stored fingerprint value of the read metadata block", it may first determine whether a metadata block corresponding to the requested fingerprint value exists in a memory. When the metadata block corresponding to the requested fingerprint value exists in the memory, the metadata block corresponding to the requested fingerprint value in the memory is read, and the stored fingerprint value of the requested fingerprint value and the read metadata block is compared.

「讀取至少一個元數據塊,並比對請求指紋值與讀取的元數據塊的儲存指紋值」的步驟中,當對應於請求指紋值的元數據塊不存在於記憶體時,可由磁碟將對應於請求指紋值的元數據塊讀入記憶體,並比對請求指紋值與讀取的元數據塊的儲存指紋值。In the step of "reading at least one metadata block and comparing the requested fingerprint value with the stored fingerprint value of the read metadata block", when the metadata block corresponding to the requested fingerprint value does not exist in the memory, the magnetic The disc reads the metadata block corresponding to the requested fingerprint value into the memory, and compares the requested fingerprint value with the stored fingerprint value of the read metadata block.

根據另一實施範例,多個元數據塊可被作為一分配組。「讀取至少一個元數據塊,並比對請求指紋值與讀取的元數據塊的儲存指紋值」的步驟中則可先讀取至少一個分配組,再比對請求指紋值與讀取的分配組中的元數據塊的儲存指紋值。其中分配組的數據大小可以是16千位元組(KB)的正整數倍。According to another embodiment, a plurality of metadata blocks can be used as an allocation group. "Reading at least one metadata block and comparing the requested fingerprint value with the stored fingerprint value of the read metadata block" may first read at least one allocation group, and then compare the requested fingerprint value with the read The stored fingerprint value of the metadata block in the allocation group. The data size of the allocation group may be a positive integer multiple of 16 kilobytes (KB).

此外,「改讀取的元數據塊的引用計數值,並將修改過的元數據塊回存」的步驟中,可修改讀取的元數據塊的引用計數值,並將修改過的元數據塊所對應的分配組回存。In addition, in the step of "changing the reference count value of the read metadata block and returning the modified metadata block", the reference count value of the read metadata block can be modified, and the modified metadata is modified. The allocation group corresponding to the block is saved.

以下在實施方式中詳細敘述本發明之詳細特徵以及優點,其內容足以使任何熟習相關技藝者了解本發明之技術內容並據以實施,且根據本說明書所揭露之內容、申請專利範圍及圖式,任何熟習相關技藝者可輕易地理解本發明相關之目的及優點。The detailed features and advantages of the present invention are set forth in the Detailed Description of the Detailed Description of the <RTIgt; </ RTI> <RTIgt; </ RTI> </ RTI> </ RTI> <RTIgt; The objects and advantages associated with the present invention can be readily understood by those skilled in the art.

本發明係關於一種重複數據的處理方法,其適用於一伺服器。實施重複數據的處理方法的伺服器可被用以備份至少一客戶端的數據,且伺服器可具有重複數據刪除的功能。The present invention relates to a method of processing repeated data, which is applicable to a server. A server that implements a method of processing duplicate data can be used to back up data of at least one client, and the server can have a function of deduplication.

請參照「第1圖」,其係為根據本發明一實施範例之伺服器之示意圖。伺服器20可透過網際網路(Internet)或內網(internet)等各種網路(network)與多個客戶端10連接,並備份由客戶端10傳送來的數據。伺服器20可具有一記憶體30、一磁碟40以及處理器等硬體。磁碟40儲存多個已從客戶端10完整接收的多個儲存文件50;而在處理儲存文件50時,可將欲處理的數據從磁碟40載入記憶體30再進行處理。Please refer to FIG. 1 , which is a schematic diagram of a server according to an embodiment of the present invention. The server 20 can connect to a plurality of clients 10 through various networks such as the Internet or the Internet, and back up the data transmitted by the client 10. The server 20 can have a memory 30, a disk 40, and a hardware such as a processor. The disk 40 stores a plurality of storage files 50 that have been completely received from the client 10; and when the file 50 is processed, the data to be processed can be loaded from the disk 40 into the memory 30 for processing.

儲存文件50可以例如是純文字檔案、各種多媒體檔案或是客戶端10進行系統備份時產生的快照(snapshot)。重複數據的處理方法能夠在客戶端10欲進行備份時,快速地判斷在欲進行備份的文件是否是伺服器20的儲存文件50之一。The storage file 50 can be, for example, a plain text file, various multimedia files, or a snapshot generated when the client 10 performs a system backup. The method of processing the duplicate data can quickly determine whether the file to be backed up is one of the storage files 50 of the server 20 when the client 10 wants to perform backup.

請參照「第2圖」,其係為根據本發明一實施範例之重複數據的處理方法之流程圖。Please refer to FIG. 2, which is a flowchart of a method for processing repeated data according to an embodiment of the present invention.

首先,伺服器20可預先將儲存文件50分割成多個元數據塊(meta tank)60A、60B以及60C,以及多個原始數據塊(raw tank)70A、70B以及70C(步驟S100)。另,以下統稱為元數據塊60以及原始數據塊70。原始數據塊70以及元數據塊60為一對一對應。其中每一個元數據塊60中存有對應的原始數據塊70的一儲存指紋值(fingerprint)。舉例來說,元數據塊60A、60B以及60C可以是個別對應於原始數據塊70A、70B以及70C;且元數據塊60A中存有原始數據塊70A的儲存指紋值。First, the server 20 may divide the storage file 50 into a plurality of meta tanks 60A, 60B, and 60C in advance, and a plurality of raw tanks 70A, 70B, and 70C (step S100). In addition, the following is collectively referred to as a metadata block 60 and an original data block 70. The original data block 70 and the metadata block 60 have a one-to-one correspondence. A stored fingerprint value of the corresponding original data block 70 is stored in each of the metadata blocks 60. For example, metadata blocks 60A, 60B, and 60C may be individually corresponding to original data blocks 70A, 70B, and 70C; and stored fingerprint values of original data block 70A are stored in metadata block 60A.

伺服器20可以利用固定長度方式(fixed-size partition)或基於內容變長度分割方式(content-defined chunking,CDC)等方式先將整個儲存文件50分割成多組數據塊,其中每一組數據塊包括對應的一個元數據塊60以及原始數據塊70。接著再將成對的元數據塊60以及原始數據塊70個別存放,以利後續讀取。The server 20 may first divide the entire storage file 50 into a plurality of sets of data blocks by using a fixed-size partition or a content-defined chunking (CDC) method, wherein each set of data blocks A corresponding one of the metadata block 60 and the original data block 70 are included. The paired metadata block 60 and the original data block 70 are then separately stored for subsequent reading.

然而伺服器20也可以先將儲存文件50的內容部分分割成原始數據塊70,再依據分割好的原始數據塊70得到對應的元數據塊60。伺服器20並可透過MD5、SHA-1、SHA-256、SHA-512或單向哈希(One-way HASH)等演算法計算每一個原始數據塊70的儲存指紋值,並將這些儲存指紋值存入對應的元數據塊60之中。However, the server 20 may first divide the content portion of the storage file 50 into the original data block 70, and then obtain the corresponding metadata block 60 according to the divided original data block 70. The server 20 can calculate the stored fingerprint value of each original data block 70 through an algorithm such as MD5, SHA-1, SHA-256, SHA-512 or One-way Hash, and store the fingerprints. The value is stored in the corresponding metadata block 60.

於一實施範例中,每一個原始數據塊70可包括多個原始數據單元(raw chunk)72,如「第3圖」所示。相對地,每一個元數據塊60可包括多個元數據單元(meta chunk)62,且這些原始數據單元72以及元數據單元62也是一對一對應。例如元數據塊60A中的6個元數據單元62一對一對應於原始數據塊70A中的6個原始數據單元72。In an embodiment, each of the original data blocks 70 may include a plurality of raw data chunks 72, as shown in FIG. In contrast, each metadata block 60 can include a plurality of meta chunks 62, and these raw data units 72 and metadata units 62 also have a one-to-one correspondence. For example, the six metadata units 62 in the metadata block 60A correspond one-to-one to the six original data units 72 in the original data block 70A.

以固定長度的分割方式為例,假設每一個原始數據單元72的長度為64千位元組(KB),對應的元數據單元62的長度會是28位元組(byte),並固定每個原始數據塊70的長度為2百萬位元組(MB)。則可計算出每個數據塊中包括32個單元,而元數據塊60的長度為896KB。Taking a fixed length partitioning method as an example, assuming that each original data unit 72 has a length of 64 kilobytes (KB), the corresponding metadata unit 62 will have a length of 28 bytes, and each fixed The original data block 70 has a length of 2 million bytes (MB). Then, it can be calculated that 32 blocks are included in each data block, and the length of the metadata block 60 is 896 KB.

接著伺服器20可從任一個客戶端10接收一重複數據判斷請求,其中重複數據判斷請求包括一請求指紋值(步驟S120)。為了盡量減少客戶端10與伺服器20之間的數據傳輸,客戶端10預備份一請求數據塊時可以只傳送代表請求數據塊的請求指紋值給伺服器20。但依據請求數據塊計算請求指紋值的演算法須與依據原始數據塊70計算儲存指紋值的演算法相同。The server 20 can then receive a duplicate data determination request from any of the clients 10, wherein the duplicate data determination request includes a request fingerprint value (step S120). In order to minimize the data transmission between the client 10 and the server 20, the client 10 may only transmit a request fingerprint value representing the requested data block to the server 20 when pre-backing up a request data block. However, the algorithm for calculating the requested fingerprint value based on the request data block must be the same as the algorithm for calculating the stored fingerprint value based on the original data block 70.

接收重複數據判斷請求之後,伺服器20讀取對應此重複數據判斷請求的儲存文件50的至少一個元數據塊60,並比對請求指紋值與讀取的元數據塊60的儲存指紋值(步驟S130)。After receiving the duplicate data determination request, the server 20 reads at least one metadata block 60 of the storage file 50 corresponding to the duplicate data determination request, and compares the requested fingerprint value with the stored fingerprint value of the read metadata block 60 (step S130).

請同時參照「第4圖」,其係為根據本發明一實施範例之步驟S130之流程圖。Please refer to FIG. 4 at the same time, which is a flowchart of step S130 according to an embodiment of the present invention.

伺服器20首先判斷對應於請求指紋值的元數據塊60是否存在於記憶體30(步驟S132)。當對應於請求指紋值的元數據塊60已存在於記憶體30時,伺服器20可直接讀取記憶體30中對應於請求指紋值的元數據塊60(步驟S134),並比對請求指紋值與讀取的元數據塊60的儲存指紋值(步驟S136)。相反地,若在記憶體30中搜尋後找不到對應於請求指紋值的元數據塊60,則先由磁碟40將對應於請求指紋值的元數據塊60讀入記憶體30(步驟S138),在執行步驟S134以及步驟S136。The server 20 first judges whether or not the metadata block 60 corresponding to the requested fingerprint value exists in the memory 30 (step S132). When the metadata block 60 corresponding to the requested fingerprint value is already present in the memory 30, the server 20 can directly read the metadata block 60 corresponding to the requested fingerprint value in the memory 30 (step S134), and compare the requested fingerprint. The value and the stored fingerprint value of the read metadata block 60 (step S136). Conversely, if the metadata block 60 corresponding to the requested fingerprint value is not found after searching in the memory 30, the metadata block 60 corresponding to the requested fingerprint value is first read into the memory 30 by the disk 40 (step S138). ), step S134 and step S136 are performed.

舉例而言,假設元數據塊60C對應於請求指紋值。由於在記憶體30中找不到元數據塊60C,因此先將元數據塊60C載入記憶體30。須注意的是,在步驟S138之中只需載入數據長度極小的元數據塊60C,而不用載入對應的原始數據塊70C。在元數據塊60C尚未因記憶體30的空間不足而被交換出記憶體30的期間內,若再有接收到對應元數據塊60C的請求指紋值,則伺服器20可以直接執行步驟S134以及S136。For example, assume that metadata block 60C corresponds to a request fingerprint value. Since the metadata block 60C is not found in the memory 30, the metadata block 60C is first loaded into the memory 30. It should be noted that it is only necessary to load the metadata block 60C having a very small data length in step S138 without loading the corresponding original data block 70C. In the period in which the metadata block 60C has not been swapped out of the memory 30 due to insufficient space of the memory 30, if the requested fingerprint value of the corresponding metadata block 60C is received again, the server 20 can directly execute steps S134 and S136. .

伺服器20接著判斷請求指紋值與讀取的元數據塊60的儲存指紋值是否相同(步驟S140)。當請求指紋值與讀取的元數據塊60的儲存指紋值相同時,表示請求指紋值對應到一個已存在的原始數據塊70,因此請求指紋值代表的請求數據塊係為重複數據。針對重複數據,伺服器20僅須修改讀取的元數據塊60的一引用計數值(referred counter),並將修改過的元數據塊60回存(步驟S150)。其中引用計數值係代表對應的原始數據塊70被引用的次數;同一個原始數據塊70可能被相同客戶端10多次引用,也可能被不同客戶端10同時引用。修改過的元數據塊60係由記憶體30回存於磁碟40。The server 20 then determines whether the requested fingerprint value is the same as the stored fingerprint value of the read metadata block 60 (step S140). When the requested fingerprint value is the same as the stored fingerprint value of the read metadata block 60, it indicates that the requested fingerprint value corresponds to an existing original data block 70, and thus the requested data block represented by the requested fingerprint value is duplicate data. For duplicate data, the server 20 only has to modify a referred counter of the read metadata block 60 and restore the modified metadata block 60 (step S150). The reference count value represents the number of times the corresponding original data block 70 is referenced; the same original data block 70 may be referenced multiple times by the same client 10, or may be simultaneously referenced by different clients 10. The modified metadata block 60 is stored back to the disk 40 by the memory 30.

若在步驟S130中找不到與請求指紋值對應的儲存文件50或元數據塊60,則表示請求指紋值對應的請求數據塊是新的數據塊,而非重複數據。則伺服器20可執行一新增程序,要求客戶端10傳輸請求數據塊,並將接收到的請求數據塊新增至磁碟40。If the storage file 50 or the metadata block 60 corresponding to the requested fingerprint value is not found in step S130, it indicates that the requested data block corresponding to the requested fingerprint value is a new data block instead of the duplicate data. Then, the server 20 can execute a new program, requesting the client 10 to transmit the request data block, and add the received request data block to the disk 40.

而當請求指紋值與讀取的元數據塊60的儲存指紋值不同時,則表示發生哈希衝突。伺服器20可以透過查找一哈希衝突表等方式,重新搜尋記憶體30或是磁碟40中重新搜尋是否有存在與請求指紋值對應的儲存文件50或元數據塊60,並判斷是否執行上數步驟S130到S150。When the requested fingerprint value is different from the stored fingerprint value of the read metadata block 60, it indicates that a hash collision occurs. The server 20 can re-search the memory 30 or the disk 40 to search for the stored file 50 or the metadata block 60 corresponding to the requested fingerprint value by searching for a hash conflict table or the like, and determine whether to execute the storage file 50 or the metadata block 60 corresponding to the requested fingerprint value. Steps S130 to S150 are counted.

須注意的是,雖然「第2圖」係以在單一個儲存文件50的元數據塊60中尋找請求指紋值為例,但重複數據的處理方法亦可在多個儲存文件50之中搜尋與請求指紋值對應的元數據塊60。It should be noted that although "FIG. 2" is an example of finding a request fingerprint value in the metadata block 60 of a single storage file 50, the method of processing the duplicate data may also search among the plurality of storage files 50. A metadata block 60 corresponding to the fingerprint value is requested.

此外,伺服器20更可將多個連續的元數據塊60作為一分配組,並以分配組做為存取元數據塊60的單位。請參照「第5圖」,其係為根據本發明一實施範例之分配組之示意圖。In addition, the server 20 can further use a plurality of consecutive metadata blocks 60 as an allocation group, and use the allocation group as a unit of the access metadata block 60. Please refer to FIG. 5, which is a schematic diagram of an allocation group according to an embodiment of the present invention.

伺服器20可以將多個連續的元數據塊60作為一個分配組64,且一次將整個分配組64載入記憶體30,再讀取其中的至少一個元數據塊60。則於步驟S130時,伺服器20可以先讀取至少一個分配組64;再比對請求指紋值與讀取的分配組64中的元數據塊60的儲存指紋值。而在步驟S150進行回存時,也可一次將整個修改過的元數據塊60所對應的分配組64。The server 20 can treat the plurality of consecutive metadata blocks 60 as an allocation group 64, and load the entire allocation group 64 into the memory 30 at a time, and then read at least one of the metadata blocks 60. Then, in step S130, the server 20 may first read at least one allocation group 64; and then compare the requested fingerprint value with the stored fingerprint value of the metadata block 60 in the read allocation group 64. When the memory is restored in step S150, the allocation group 64 corresponding to the entire modified metadata block 60 may be used at one time.

分配組64的數據大小可以是16KB的正整數倍,例如16KB、64KB或是128KB。分配組64的數據大小可以配合磁碟扇區來選用。以磁碟扇區為64KB的伺服器20為例,單次的數據輸入輸出(input/output,IO)的單位就是64KB。因此配合磁碟扇區的設定將多個元數據塊60湊成一個分配組64,可以在一次的磁碟IO中讀寫最大數量的元數據塊60。如此一來,可避免每次都為了大小僅為3k的單一元數據塊60去讀寫64KB的數據。The data size of the allocation group 64 may be a positive integer multiple of 16 KB, such as 16 KB, 64 KB, or 128 KB. The data size of the allocation group 64 can be selected in conjunction with the disk sector. Taking the server 20 with a disk sector of 64 KB as an example, the unit of single input/output (IO) is 64 KB. Therefore, the plurality of metadata blocks 60 are mashed into an allocation group 64 in accordance with the setting of the disk sector, and the maximum number of metadata blocks 60 can be read and written in the primary disk IO. In this way, it is avoided that 64 KB of data is read and written each time for a single metadata block 60 of only 3 k in size.

綜上所述,重複數據的處理方法在判斷請求指紋值對應的請求數據塊是否已存在時,僅需將對應的元數據塊載入記憶體當中。如此一來,伺服器不須要把整個相關的元數據塊以及原始數據塊都載入記憶體,而能夠節省大量的磁碟IO時間。以前述實施範例的數值而言,重複數據的處理方法只需載入約3KB的元數據塊即可進行判別,而不需將用不到的2MB的原始數據塊也載入磁碟中。In summary, the method for processing the repeated data only needs to load the corresponding metadata block into the memory when determining whether the requested data block corresponding to the requested fingerprint value already exists. In this way, the server does not need to load the entire associated metadata block and the original data block into the memory, thereby saving a large amount of disk IO time. In the numerical value of the foregoing embodiment, the processing method of the repeated data only needs to load a metadata block of about 3 KB to perform discrimination, and it is not necessary to load the unused 2 MB original data block into the disk.

更進一步地,重複數據的處理方法能配合磁碟扇區的設定分配組,並一次存取連續的多個元數據塊。因此對於文件或數據塊數量龐大的全備份(full backup)等服務,由於請求數據塊的連續性很高,因此一次載入連續多個元數據塊的做法可提高搜尋的命中率,並且進一步減少所需的磁盤IO次數。更甚者,由於元數據塊的數據長度較小,重複數據的處理方法亦可由伺服器提供需用以比較的儲存指紋值給客戶端,並由各客戶端進行比較以及去除重複數據的步驟。Further, the method of processing the repeated data can allocate a group in accordance with the setting of the disk sector, and access a plurality of consecutive metadata blocks at a time. Therefore, for a service such as a full backup with a large number of files or data blocks, since the continuity of the request data block is high, the practice of loading multiple consecutive metadata blocks at one time can improve the hit rate of the search and further reduce The number of disk IOs required. Moreover, since the data length of the metadata block is small, the processing method of the duplicate data may also be provided by the server to store the fingerprint value to be compared to the client, and the steps of comparing and removing the duplicate data by each client.

以上較佳具體實施範例之詳述,是希望藉此更加清楚描述本發明之特徵與精神,並非以上述揭露的較佳具體實施範例對本發明之範疇加以限制。相反地,其目的是希望將各種改變及具相等性的安排涵蓋於本發明所欲申請之專利範圍的範疇內。The above detailed description of the preferred embodiments of the present invention is intended to provide a further understanding of the scope of the invention. On the contrary, the intention is to cover various modifications and equivalent arrangements within the scope of the invention as claimed.

10...客戶端10. . . Client

20...伺服器20. . . server

30...記憶體30. . . Memory

40...磁碟40. . . Disk

50...儲存文件50. . . Save file

60,60A,60B,60C...元數據塊60, 60A, 60B, 60C. . . Metablock

62...元數據單元62. . . Metadata unit

64...分配組64. . . Assignment group

70,70A,70B,70C...原始數據塊70, 70A, 70B, 70C. . . Raw data block

72...原始數據單元72. . . Raw data unit

第1圖係為根據本發明一實施範例之伺服器之示意圖。Figure 1 is a schematic diagram of a server in accordance with an embodiment of the present invention.

第2圖係為根據本發明一實施範例之重複數據的處理方法之流程圖。2 is a flow chart of a method of processing repeated data according to an embodiment of the present invention.

第3圖係為根據本發明一實施範例之元數據單元以及原始數據單元之示意圖。Figure 3 is a schematic diagram of a metadata unit and an original data unit in accordance with an embodiment of the present invention.

第4圖係為根據本發明一實施範例之步驟S130之流程圖。Figure 4 is a flow chart of step S130 in accordance with an embodiment of the present invention.

第5圖係為根據本發明一實施範例之分配組之示意圖。Figure 5 is a schematic diagram of an allocation group in accordance with an embodiment of the present invention.

Claims (8)

一種重複數據的處理方法,包括:將一儲存文件分割成多個原始數據塊以及多個元數據塊,其中該些原始數據塊以及該些元數據塊為一對一對應,且每一該元數據塊中存有對應的該原始數據塊的一儲存指紋值;接收一重複數據判斷請求,其中該重複數據判斷請求包括一請求指紋值;讀取至少一該元數據塊,並比對該請求指紋值與讀取的該元數據塊的該儲存指紋值;以及當該請求指紋值與讀取的該元數據塊的該儲存指紋值相同時,修改讀取的該元數據塊的一引用計數值,並將修改過的該元數據塊回存。A method for processing repeated data, comprising: dividing a storage file into a plurality of original data blocks and a plurality of metadata blocks, wherein the original data blocks and the metadata blocks are one-to-one correspondence, and each of the elements Storing a stored fingerprint value of the original data block; receiving a duplicate data determination request, wherein the duplicate data determination request includes a request fingerprint value; reading at least one of the metadata blocks, and comparing the request a fingerprint value and the stored fingerprint value of the read metadata block; and modifying a reference to the read metadata block when the requested fingerprint value is the same as the stored fingerprint value of the read metadata block The value is returned to the modified metadata block. 如請求項第1項所述之重複數據的處理方法,其中每一該原始數據塊包括多個原始數據單元,每一該元數據塊包括多個元數據單元,且該些原始數據單元以及該些元數據單元為一對一對應。The processing method of the duplicate data according to Item 1, wherein each of the original data blocks includes a plurality of original data units, each of the metadata blocks includes a plurality of metadata units, and the original data units and the Some metadata units have a one-to-one correspondence. 如請求項第1項所述之重複數據的處理方法,其中該儲存文件係儲存於一伺服器的一磁碟,且修改過的該元數據塊係回存於該磁碟。The method for processing duplicate data according to claim 1, wherein the stored file is stored in a disk of a server, and the modified metadata block is stored in the disk. 如請求項第3項所述之重複數據的處理方法,其中該讀取至少一該元數據塊,並比對該請求指紋值與讀取的該元數據塊的該儲存指紋值的步驟包括:判斷對應於該請求指紋值的該元數據塊是否存在於一記憶體;以及當該對應於該請求指紋值的該元數據塊存在於該記憶體時,讀取該記憶體中對應於該請求指紋值的該元數據塊,並比對該請求指紋值與讀取的該元數據塊的該儲存指紋值。The method for processing duplicate data according to claim 3, wherein the step of reading at least one of the metadata blocks and comparing the stored fingerprint value of the requested fingerprint value with the read metadata block comprises: Determining whether the metadata block corresponding to the requested fingerprint value exists in a memory; and when the metadata block corresponding to the requested fingerprint value exists in the memory, reading the memory corresponding to the request The metadata block of the fingerprint value and compared to the stored fingerprint value of the requested fingerprint value and the read metadata block. 如請求項第4項所述之重複數據的處理方法,其中該讀取至少一該元數據塊,並比對該請求指紋值與讀取的該元數據塊的該儲存指紋值的步驟包括:當該對應於該請求指紋值的該元數據塊不存在於該記憶體時,由該磁碟將對應於該請求指紋值的該元數據塊讀入該記憶體,並比對該請求指紋值與讀取的該元數據塊的該儲存指紋值。The method for processing duplicate data according to claim 4, wherein the step of reading at least one of the metadata blocks and comparing the stored fingerprint value to the requested fingerprint value and the read metadata block comprises: When the metadata block corresponding to the requested fingerprint value does not exist in the memory, the metadata block corresponding to the requested fingerprint value is read into the memory by the disk, and the requested fingerprint value is compared. The stored fingerprint value of the metadata block with the read. 如請求項第1項所述之重複數據的處理方法,其中多個該些元數據塊作為一分配組,且該讀取至少一該元數據塊,並比對該請求指紋值與讀取的該元數據塊的該儲存指紋值的步驟包括:讀取至少一該分配組;以及比對該請求指紋值與讀取的該分配組中的該些元數據塊的該儲存指紋值。The method for processing duplicate data according to claim 1, wherein a plurality of the metadata blocks are used as an allocation group, and the at least one metadata block is read, and the fingerprint value and the read fingerprint value are compared. The step of storing the fingerprint value of the metadata block includes: reading at least one of the allocation groups; and comparing the stored fingerprint value to the requested fingerprint value and the read metadata blocks in the allocation group. 如請求項第6項所述之重複數據的處理方法,其中該些分配組的數據大小為16千位元組的正整數倍。The method for processing duplicate data as described in claim 6, wherein the data size of the allocation groups is a positive integer multiple of 16 kilobytes. 如請求項第6項所述之重複數據的處理方法,其中該修改讀取的該元數據塊的該引用計數值,並將修改過的該元數據塊回存的步驟包括:修改讀取的該元數據塊的該引用計數值,並將修改過的該元數據塊所對應的該分配組回存。The method for processing the repeated data according to Item 6, wherein the step of modifying the reference count value of the read metadata block and returning the modified metadata block comprises: modifying the read The reference count value of the metadata block, and the modified allocation group corresponding to the metadata block is restored.
TW100128071A 2011-08-05 2011-08-05 Processing method for duplicate data TWI441034B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW100128071A TWI441034B (en) 2011-08-05 2011-08-05 Processing method for duplicate data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW100128071A TWI441034B (en) 2011-08-05 2011-08-05 Processing method for duplicate data

Publications (2)

Publication Number Publication Date
TW201308113A TW201308113A (en) 2013-02-16
TWI441034B true TWI441034B (en) 2014-06-11

Family

ID=48169820

Family Applications (1)

Application Number Title Priority Date Filing Date
TW100128071A TWI441034B (en) 2011-08-05 2011-08-05 Processing method for duplicate data

Country Status (1)

Country Link
TW (1) TWI441034B (en)

Also Published As

Publication number Publication date
TW201308113A (en) 2013-02-16

Similar Documents

Publication Publication Date Title
US10169365B2 (en) Multiple deduplication domains in network storage system
US9201800B2 (en) Restoring temporal locality in global and local deduplication storage systems
US9268783B1 (en) Preferential selection of candidates for delta compression
US9405764B1 (en) Method for cleaning a delta storage system
US11886704B2 (en) System and method for granular deduplication
US8972672B1 (en) Method for cleaning a delta storage system
US8918390B1 (en) Preferential selection of candidates for delta compression
US9305004B2 (en) Replica identification and collision avoidance in file system replication
US8639669B1 (en) Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US9851917B2 (en) Method for de-duplicating data and apparatus therefor
US10339112B1 (en) Restoring data in deduplicated storage
US9569357B1 (en) Managing compressed data in a storage system
US11068405B2 (en) Compression of host I/O data in a storage processor of a data storage system with selection of data compression components based on a current fullness level of a persistent cache
US20170115883A1 (en) Processing of Incoming Blocks in Deduplicating Storage System
US20120150824A1 (en) Processing System of Data De-Duplication
US9400610B1 (en) Method for cleaning a delta storage system
US10877680B2 (en) Data processing method and apparatus
US10366072B2 (en) De-duplication data bank
US20120310936A1 (en) Method for processing duplicated data
US9026740B1 (en) Prefetch data needed in the near future for delta compression
US9383936B1 (en) Percent quotas for deduplication storage appliance
CN110908589B (en) Data file processing method, device, system and storage medium
US11836053B2 (en) Resource allocation for synthetic backups
US10776028B2 (en) Method for maximum data reduction combining compression with deduplication in storage arrays
US10255288B2 (en) Distributed data deduplication in a grid of processors

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees