TW201224787A

TW201224787A - A deduplication system

Info

Publication number: TW201224787A
Application number: TW99143291A
Authority: TW
Inventors: Ming-Sheng Zhu; Chih-Feng Chen
Original assignee: Inventec Corp
Priority date: 2010-12-10
Filing date: 2010-12-10
Publication date: 2012-06-16

Abstract

A deduplication system comprises of a client and a server. Comparing the each data block's feature value to the feature value in the client. If the client has the same feature value, the client deletes the corresponded data block. The server's data management module connects to the client's data management module. If the server has not the feature value, the server fetches the feature value to the client. The document management module records the address of the data blocks in the server into the index document.

Description

201224787 々、發明說明：【發明所屬之技術領域】統一種文件儲存系統，特別有關於一種重複數據刪除的處理系【先前技術】重複數據刪除是一種數據縮減技術，通常用於基於磁盤的備份系統，主要目的在於減少存儲系統中使用的存儲容量。它的工作方式是在某個_週_賴不同射不同位置的重復可變大小歸，。重復的數據塊用指示符取代。由於存儲系㈣總二充斥著大量的.冗餘數據。為瞭解決這個問題，節省更多空間，「重广删除」技術便順理成章地成了人們關注的焦點。採用「重復刪| 技術可以將存儲的數據縮減為原來的·，從而讓出更多的備 1」空間，不僅可以使存儲系統上的備份數據保存更長的時間，而: 還可以節約離線存儲時所需的大量的帶寬。請參考「第玉圖戶不’其係為習知技術之重複數據删除的存取的示意圖。」所由於欲存錯的數據資料都會被儲存在词服針， :要實時㈣數«料傳送至舰端。接著，聽端再對數據; K丁重複數據刪除的處理。如果在具有多客戶端的飼服端必然需要面臨高壓的負载。伺【發明内容】馨於以上的問題，本發系統，將輸入文件透過飼服端二—種重複數據刪除的處理 /、客戶Μ進仃重複數據刪除的處理。 201224787 為達上述目的，本發明所揭露之重複數據删除的處理系統包括：客戶端數據管理模塊與伺服端數據管理模塊。於每—客戶端中設置客戶端數據管理模塊，客戶端數據管理模塊接收輸入文 .件，客戶端數據管理模塊更包括資料切塊模塊、指紋特徵模塊與 -特徵值查找模塊。資料切塊模塊(Data Chunking m〇dule)用以將輸入文件進行資料切分程序，並產生至少一數據區塊；指紋特徵: 塊(Finge卬riming module)對數據區塊進行特徵處理程序，並產生相籲應的特徵值，·將每-數據區塊的特徵值與客戶端所儲存的特徵值進行比對’若客戶端巾已存在相_特徵值，卿除進行比對的特徵值相應的數據區塊，若客戶端中不存在相同的特徵值，則客戶端向鑛端發送查詢請求；伺服端數據管理模塊透過網路連接 ^客戶端數據管理模塊，鑛端數據f理模塊更包括：特徵儲存核塊、文件Π顯触據齡顯。特賴存魏根據查詢請求判斷特徵值是否已經記錄於飼服端中，若特徵值不存在於飼服 •端中則向客戶端獲取相應的數據區塊，並將新的數據區塊與特徵值儲存於_端中；文件f理模塊用⑽每輸人文件·據區塊 ^司服端的儲存位址記錄至索引文件中；數據储存模塊用以儲存數據區塊與輸入文件的元數據(meta-data)。本發簡财雜區塊特儲、域_财信息、及特徵 —、存儲和&理都在恤端實現。而對輸人文件的資料切分、叶 =徵鮮輯暇由客戶端實現。然後通過網路在舰端和客關父互这些信息，客戶端處理數據時先把計算的特徵值發給 201224787 伺服端，如果該數據已存在不用/織w ⑥要更新數據塊位置引用信息，不用在、稱上發送數據塊本身破〜祥pL 々果不存在再把數據發給伺服實施例詳細說。=1:服端的存儲空間，也降低了網路帶寬的需求。有關柄mm徵與實作，賊合圖式作最佳明如下。【實施方式】本發明應具有處理錢數據崎程序的計算機，例如：個人電腦、筆記型電腦，《或朗在客戶端油服端架構中。重複數__處理__少—崎㈣翻服端。請分別參考「第2圖」與「第3圖」麻，其係分別為本發明之架構不意圖與運作流糊。客戶端21G可以透過網際網路( 或内網―η啦接至伺服端。為能更進-步說明本發明的各模塊的運作，請配合「第3圖的處理包括以下步驟：的運作解說。本發明的重複數據刪除步驟S31G :客戶端向伺服端發出查詢請求；步驟S32G .由贿端的布隆縣關斷查詢請求的數據區塊是否存在於伺服端中；步驟S33G :若欲查詢的數據區塊存在於舰端中，伺服端將儲存該數據區塊的特徵值；步驟S331 :命令客戶端將新的數據區塊傳送給伺服端；步驟S34G :若欲查_數魏塊不存在於舰端巾，根據查詢請求判斷特徵值是否已經記錄於伺服端中； 201224787 步驟S341 ··若特徵值不存在於伺服端中，則向客戶端獲取相應的數據區塊，並將新的數據區塊與特徵值儲存於伺服端中；步驟S342 :若特徵值已存在於值端中，键端將更新相應 - 數據區塊的元數據；以及々驟S343 ’通知客戶端該數據區塊已存在飼服端中，並命令客戶端重新查詢特徵值查找模塊。 •—每一客戶端210均具有客戶端數據管理模塊211，客戶端數據 B理极塊211接收輸入文件並運行部分的重複數據刪除程序(其運作部分將於後文詳加描述）。客戶端數據管理模塊211更包括資料切塊模塊犯、指紋特徵模塊213與特徵值查找模塊214。在伺服端22〇中包括錬端數據管理模塊κ司服端數據管理模塊切透1^、周路連接於客戶端數據官理模塊2U。伺服端輯管理模塊 221更包括：特徵儲存模塊222、文件管理模塊223、數據儲存模 • 塊224與布隆過濾器225(Bl00m fiiter)。、當客戶端210接收到新的輸入文件時，資料切塊模塊犯將對輸入文件進行資料切分處理。資料切塊模塊212可以利用固定長度方式（fixed-size partition)或基於内容變長度分割方式 .__defmed ehunking ’ CDC財式對輸人文件進行數據區塊的切分處理。 /定長切分演算法_預先定義好的數據區塊大小對輪入文件進行切分。定長分塊演算法的優點是鮮、性能高。内容定義切 201224787 分演算法是一種變長分塊演算法，它應用指紋數據(例如透過201224787 々, invention description: [Technical field of invention] A file storage system, especially related to a deduplication processing system [Prior technology] Deduplication is a data reduction technology, usually used in disk-based backup systems The main purpose is to reduce the storage capacity used in the storage system. It works by repeating a variable size at a certain location. Duplicate data blocks are replaced with indicators. Since the storage system (4) is always filled with a large amount of redundant data. In order to solve this problem and save more space, the "re-distribution" technology has become a focus of attention. With the “repeated deletion” technology, the stored data can be reduced to the original one, thus giving up more spare space, which not only saves the backup data on the storage system for a longer period of time, but also saves offline storage. A lot of bandwidth is required. Please refer to the "Dimensions of the Jade Figure No." which is a schematic diagram of the deduplication access of the prior art. The data that is to be stored in error will be stored in the word, and the data will be transmitted in real time (four) To the ship. Then, the listening end re-reads the data; If you have a multi-client feeding end, you will inevitably need to face high-pressure loads. Servicing [Invention] In the above problem, the system of the present invention passes the input file through the processing of the second-type data deletion/removal of the data, and the processing of the data deletion by the customer. 201224787 In order to achieve the above objective, the data processing system for deduplication disclosed in the present invention comprises: a client data management module and a server data management module. The client data management module is set in each client, and the client data management module receives the input text. The client data management module further includes a data dicing module, a fingerprint feature module and an eigenvalue finding module. A data sharding module (Data Chunking m〇dule) is used to perform an information sharding process on the input file and generate at least one data block; a fingerprint feature: a Finge 卬riming module performs a feature processing procedure on the data block, and Generate the corresponding feature values, and compare the feature values of each data block with the feature values stored by the client. If the client towel has a phase_feature value, the corresponding feature value of the comparison is performed. The data block, if the client does not have the same feature value, the client sends a query request to the mine; the server data management module through the network connection ^ client data management module, the mine terminal data module further includes : Feature storage kernel block, file Π display age. According to the query request, Terai Cunwei judges whether the feature value has been recorded in the feeding end. If the feature value does not exist in the feeding service, the corresponding data block is obtained from the client, and the new data block and feature are acquired. The value is stored in the _ terminal; the file f is used by the file (10) to record the address of each input file and the storage location of the data block to the index file; the data storage module is used to store the metadata of the data block and the input file ( Meta-data). The distribution of the simple wealth block, domain information, and features -, storage and & The data of the input file is divided into two parts, and the leaf is collected by the client. Then, through the network, the information is transmitted between the ship and the customer. When the client processes the data, the calculated feature value is first sent to the 201224787 server. If the data already exists, the data block location reference information is updated. It is not necessary to call the data block itself. The result is that the data is not sent to the servo embodiment. =1: The storage space of the server also reduces the need for network bandwidth. Regarding the shank mm sign and the implementation, the thief-integrated figure is best described below. [Embodiment] The present invention should have a computer for processing a money data program, such as: a personal computer, a notebook computer, or "in the client oil server side architecture. Repeat number __ processing __ less - Saki (four) turn the end. Please refer to "Fig. 2" and "3rd" respectively for the structure of the present invention, which is not intended to be operational. The client 21G can connect to the server through the Internet (or the intranet - η. To further explain the operation of each module of the present invention, please cooperate with the processing of the third figure including the following steps: The deduplication step S31G of the present invention: the client sends a query request to the server; step S32G. Whether the data block requested by the Bron County of the bribe is turned off is present in the server; Step S33G: If the query is to be queried The data block exists in the ship end, and the server will store the feature value of the data block; Step S331: command the client to transmit the new data block to the server; Step S34G: If the _ number of the Wei block does not exist According to the query request, it is judged whether the feature value has been recorded in the server according to the query request; 201224787 Step S341 · If the feature value does not exist in the server, the corresponding data block is obtained from the client, and the new data is acquired. The block and the feature value are stored in the server; step S342: if the feature value already exists in the value end, the key end will update the metadata of the corresponding data block; and the step S343 'notifies the client According to the block already exists in the feeding end, and the client is instructed to re-query the feature value finding module. - Each client 210 has a client data management module 211, and the client data B terminal block 211 receives the input file and runs A part of the deduplication program (the operation part of which will be described in detail later). The client data management module 211 further includes a data dicing module guilt, a fingerprint feature module 213 and a feature value finding module 214. In the server terminal 22 The data management module including the data management module of the top end is connected to the client data management module 2U. The server management module 221 further includes: a feature storage module 222, a file management module 223, and data. The storage module 224 and the Bloom filter 225 (Bl00m fiiter). When the client 210 receives the new input file, the data dicing module commits data segmentation processing on the input file. The data dicing module 212 can Use the fixed-size partition or the content-based length segmentation method.__defmed ehunking 'CDC for the data block of the input file Segmentation processing. / Fixed-length segmentation algorithm _ Pre-defined data block size is used to segment the round file. The advantage of fixed-length block algorithm is fresh and high performance. The content definition cut 201224787 is divided into two A variable length block algorithm that applies fingerprint data (eg, through

Rabm指紋演算法，將文件内容轉換成預設的哈希值)將檔分割成長度大小不等的分塊策略。與定長切分演算法不同，内容定義切分演算法是基於特定的指紋數據進行數魏塊的切分處理，因此數據區塊大小是可變化的。内容定㈣分演算法的伽在於可以提供財雜的查詢或插入數據區塊的策略，使得新增的數祕塊可以被快速的安插至目的地。在貝料切塊模塊212完成數據區塊的切分後，資料切塊模塊 212將所生成的數據區塊輸出至指紋特徵模塊。指紋特徵模塊 213(Fingerprinting module)對數據區塊進行特徵處理程序，並產生相應該輯區塊的概值。指紋特徵魏213可崎過但不限定為：娜隹-雨磁儒如或單向哈希細謂· 等演算法所實現。特徵值查找模塊2M將每一個數據區塊的特徵值與客戶端別所儲存的特徵值進行比對，藉以判斷是否有無相同的特徵值。右^加中已存在相同的特徵值，則刪除進行比對的相應的資料區塊。若客戶端210中已存在相同的_徵值時，特徵叫同時向伺服端220發送數據區塊索引請求。词請、免數據區塊中的計數的次數。並且返喷據客= 210。若客戶端210中不存在相同的特徵值、、、。I客以、客戶端210向伺服 201224787 端220發送查詢請求。虽舰端數據官理模塊221接收到來自於客戶端數據管理模 "11的查詢請求時，由特徵儲存模塊222根據查詢言青求判斷特徵值是否已經記錄於伺服端220中。驗貫先，由布隆過濾11 225接收來自於客戶端21G的數據區塊、·徵值。布隆過遽器225判斷所接收到醜據區塊是否已被修改過的數據區塊，並將判斷結果輸出至特徵儲存模塊222。若特徵值不存在於伺服端22〇中則向客戶端別獲取相應的數據區塊，並將新的數據區塊與特徵值儲存於伺服端22〇中。若特徵值已存在細W 220中’則特徵健存模塊222將更新數據區塊中的引用找的次數’並且返回數據區塊結果。並透過文件管理模塊奶將每一輸入文件的數據區塊在伺服端22〇的儲存位址記錄至索引 =件中，已變在索引信息中管理目標文件的所有數據區塊的位置，、引信息，藉以能够還原該目標文件。數據儲存模塊224用以儲存數據區塊與輸入文件的元數據。本發明將所有數觀塊的雜、元數據的描述信息、及特徵值_諸和管理都在舰端22G實現。而對輸入文件的資料切分、計算特徵值等動作則是由客戶端210實現。然後通過網路在飼服端細和客戶端训間交互這些信息，客戶端210處理數據時先巴什异的概值發給概端跡如果該數據已存細只需要更新數據塊位置引用信息，不用在網路上發送數據塊本身，如果不存在再把數據發給伺服端220。 201224787 雖然本發㈣前述之較佳實關揭露如上，然其並非用 f發明，任何熟習相像技藝者，在不脫離本發明之精神和範圍 :許之更動與潤飾’因此本發明之本說明書所附之申請專利範_界定者騎。乾圍J視【圖式簡單說明】第1圖係為習知技術之重複數據. 第2圖係為本發明之架構示意圖。存取的不思圖第3圖係為本發明之運作流程圖。【主要元件符號說明】客戶端210 客戶端數據管理模塊211 資料切塊模塊212 指紋特徵模塊213 特徵值查找模塊214 伺服端220 伺服端數據管理模塊221 特徵儲存模塊222 文件管理模塊223 數據儲存模塊224 布隆過濾器225The Rabm fingerprint algorithm converts the file content into a preset hash value. The file is split into chunking strategies of varying lengths. Different from the fixed-length segmentation algorithm, the content definition segmentation algorithm is based on the specific fingerprint data to perform the segmentation processing of the number of Wei blocks, so the data block size can be changed. The gamma of the content (4) presentation algorithm is that it can provide a rich query or a strategy of inserting data blocks, so that the newly added number of secret blocks can be quickly inserted to the destination. After the dicing block module 212 completes the sharding of the data block, the data dicing module 212 outputs the generated data block to the fingerprint feature module. The Fingerprinting Module 213 performs a feature processing procedure on the data block and generates an approximate value of the corresponding block. The fingerprint feature Wei 213 can be satisfactorily but not limited to: Na Yun - Rain magnetic Ruru or one-way hash subsequences and other algorithms. The feature value finding module 2M compares the feature value of each data block with the feature value stored by the client to determine whether the same feature value is present. If the same feature value already exists in the right ^ add, the corresponding data block for comparison is deleted. If the same _ sign value already exists in the client 210, the feature calls a data block index request to the server 220 at the same time. Word Please, free the number of counts in the data block. And return spray customer = 210. If the same feature value, , , is not present in the client 210. The client 210 sends a query request to the servo 201224787 terminal 220. When the ship-side data official module 221 receives the query request from the client data management module <11, the feature storage module 222 determines whether the feature value has been recorded in the server 220 according to the query. First, the Bron filter 11 225 receives the data block and the value from the client 21G. The Bloom filter 225 judges whether or not the data block that has received the ugly block has been modified, and outputs the judgment result to the feature storage module 222. If the feature value does not exist in the server terminal 22, the corresponding data block is acquired from the client, and the new data block and the feature value are stored in the server terminal 22A. If the feature value already exists in the thin W 220 ', the feature storage module 222 will update the number of references found in the data block' and return the data block result. And through the file management module milk, the data block of each input file is recorded in the storage address of the server 22〇 into the index=piece, and has changed the position of all the data blocks of the target file in the index information, Information to which the target file can be restored. The data storage module 224 is configured to store metadata of the data block and the input file. The present invention implements the description information of all the blocks of the miscellaneous and metadata, and the eigenvalues of the summation management at the ship terminal 22G. The action of dividing the data of the input file, calculating the feature value, and the like is implemented by the client 210. Then, through the network, the information is exchanged between the client and the client. When the client 210 processes the data, the first value is sent to the terminal. If the data is already stored, only the data block location reference information needs to be updated. There is no need to send the data block itself on the network, and if it does not exist, the data is sent to the server 220. 201224787 Although the above-mentioned preferred embodiment of the present invention is disclosed above, it is not intended to be inconsistent with the invention, and the skilled person in the art does not depart from the spirit and scope of the invention: Attached to the patent application _ define the rider. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a duplicate of conventional techniques. Fig. 2 is a schematic diagram of the architecture of the present invention. I don't think about accessing Figure 3 is a flow chart of the operation of the present invention. [Main component symbol description] Client 210 Client data management module 211 Data dicing module 212 Fingerprint feature module 213 Feature value search module 214 Servo terminal 220 Servo data management module 221 Feature storage module 222 File management module 223 Data storage module 224 Bloom filter 225

Claims

201224787 VII. Patent application scope: ι A data deduplication processing system that performs deduplication processing on an input file through a server and a client. The deduplication processing system includes: 'a client data management system a module, the client data management module is set in each of the clients. The client data management module receives the input file, and the client data management module further includes:

a data puncturing module (Data Chunking module), configured to perform a data segmentation process on the input file, and generate at least one data block; the stencil feature module (defective m〇duie), the The data block performs a feature processing procedure and generates a corresponding feature value; and the feature value search module compares the feature value of each of the data blocks with the feature values stored by the provincial client, if If the same feature value already exists in the client, the data block corresponding to the compared feature value is deleted. If the same feature value does not exist in the client, the client sends a message to the server. The query request; and the "server data management module, connecting to the client data management private block through the network", the server data management module further includes: the feature storage module determining whether the feature value is I, according to the query request 7< 'In the server and if the feature value does not exist in the server, the client obtains the corresponding data block, and stores the new data block and the feature value. Stored in the word server; 201224787 a file management module for recording the data blocks of each of the input files in a storage address of the word server to an index file; and a data storage module, And a processing system for storing the data block and the input file. 2. The data processing system of claim 1, wherein the data segmentation sequence comprises a fixed-size partition (fixed-size partition). a content-defined chunking or a sliding block splicing (sliding bl〇ck) 〇 3. A processing system for deduplication as described in claim 1, wherein the feature processing private sequence includes MD5 , SHA1, SHA256 or CRC32. 4. The processing secret of deduplication as described in the request item, wherein if the same value is already present in the client, the 1H feature value search module simultaneously sends the server to the server. The send-data block index request 'the service level updates the number of one reference count of the number of secret blocks' and returns - the data block result, the recorded block result includes a plurality of consecutive eigenvalues after the data block 5. The data processing system of claim 1, wherein the feature values of the client are stored in a memory or a cache. 6. Deduplication as described in n-month item 1. a processing system, wherein if the feature value is in the terminal, the job signature module will update the number of reference counts of the data block and return a data block result, the data block result including the data block a plurality of consecutive eigenvalues. 7. The processing system for deduplication as described in claim 1, further comprising a Bloom filter _om hall r) receiving the eigenvalue from the client, 12 201224787 The server verifies, by the Bloom filter, whether the received data block has been modified by the data block, and outputs the determination result to the feature storage module.

13