TW201308115A

TW201308115A - A distributed de-duplication system and the method therefore

Info

Publication number: TW201308115A
Application number: TW100128574A
Authority: TW
Inventors: Ming-Sheng Zhu; Hui Wang; Chih-Feng Chen
Original assignee: Inventec Corp
Priority date: 2011-08-10
Filing date: 2011-08-10
Publication date: 2013-02-16
Also published as: TWI420333B

Abstract

A distributed de-duplication system and the method therefore. The client performs a de-duplication process for the input file and generates a polarity of data blocks and the corresponded fingerprint value. The client sends the query request to the dispatch server and the query request has the fingerprint value. The dispatch server stores the location of the fingerprint value. The dispatch server redirects the query request to the de-duplication device according to the fingerprint value. If the de-duplication device has not the fingerprint value, the de-duplication device stores the new data block into the storage server according to the new fingerprint value.

Description

Decentralized deduplication system and processing method thereof

一種重複數據刪除系統及其方法，特別有關於一種分散式的重複數據刪除系統及其處理方法。A deduplication system and method thereof, in particular, a decentralized deduplication system and a processing method thereof.

隨著網際網路的興起之緣故，因此許多網路供應者為能有效保存使用者的文件，進而在網路上提供許多存放的空間。以往是由單一伺服端提供網路空間的存儲服務。然而，單一伺服器的運算能力有限，因此演進為多伺服器以平行處理的方式來提供存儲服務。這種存儲方式被稱為分散式存儲系統。With the rise of the Internet, many network providers provide a lot of storage space on the network in order to effectively save users' files. In the past, it was a storage service that provided network space by a single server. However, a single server has limited computing power, so it evolved into a multi-server to provide storage services in a parallel processing manner. This type of storage is called a decentralized storage system.

請參考「第1圖」所示，其係為習知技術的存儲數據示意圖。一般而言，分散式存儲系統為能完整備份使用者的文件數據。所以會在不同伺服端121中存儲相同的資料。舉例來說，分佈式存儲系統係具有三個存儲伺服端121。當客戶端111欲將100 Mbytes的數據資料存儲至網路空間中，則分佈式存儲系統會將這100 Mbytes分別存儲至這三台存儲伺服端121中。如此一來，所有的存儲伺服端121就會佔用掉300 Mbytes的空間。若是每一個客戶端111的文件均要備份在每一台存儲伺服端121上，這對於網路供應者而言不啻為一種沈重的負擔。Please refer to "Figure 1", which is a schematic diagram of stored data of the prior art. In general, a decentralized storage system is a file data that can completely back up users. Therefore, the same data is stored in different server terminals 121. For example, a distributed storage system has three storage servers 121. When the client 111 wants to store 100 Mbytes of data into the network space, the distributed storage system stores the 100 Mbytes into the three storage servers 121, respectively. As a result, all storage servers 121 will occupy 300 Mbytes of space. If the files of each client 111 are backed up on each storage server 121, this is a heavy burden for the network provider.

鑒於以上的問題，本發明在於提供一種分散式的重複數據刪除系統，用以存儲客戶端所產生的至少一切分資料塊。In view of the above problems, the present invention provides a distributed deduplication system for storing at least all sub-blocks generated by a client.

本發明所揭露之分散式的重複數據刪除系統包括：客戶端、派發伺服器、重複數據處理裝置與存儲伺服端。客戶端對輸入文件進行重複數據刪除程序(de-duplication)，並生成切分資料塊與相應的指紋特徵值(Fingerprint)。The decentralized deduplication system disclosed in the present invention comprises: a client, a dispatch server, a duplicate data processing device and a storage server. The client deduplicates the input file and generates a segmentation data block and a corresponding fingerprint feature value (Fingerprint).

派發伺服器(Dispatch Server)紀錄輸入文件的切分資料塊的儲存位置；派發伺服器根據指紋特徵值將查詢要求轉發至相應的重複數據處理裝置；重複數據處理裝置(Dedup. Engine)從指紋特徵查找表中查找指紋特徵值是否已經存在；若指紋特徵查找表中未存儲指紋特徵值，則重複數據處理裝置根據指紋特徵值將相應的切分資料塊指派到存儲伺服端，並向客戶端發送包含所指派的存儲伺服端的存儲節點訊息。The Dispatch Server records the storage location of the segmentation data block of the input file; the dispatch server forwards the query request to the corresponding duplicate data processing device according to the fingerprint feature value; and the duplicate data processing device (Dedup. Engine) from the fingerprint feature If the fingerprint feature value is already stored in the lookup table, if the fingerprint feature value is not stored in the fingerprint feature lookup table, the duplicate data processing device assigns the corresponding segmentation data block to the storage server according to the fingerprint feature value, and sends the fingerprint data block to the client. Contains the storage node information of the assigned storage server.

指紋特徵值係由SHA-1、哈希程序(Hash)或單向演算法所產生，使得每一切分資料塊只能對應到唯一的指紋特徵值。並且在存儲伺服端存儲新的切分資料塊後，重複數據處理裝置會運行指紋特徵查找表的同步處理，用以更新其他重複數據處理裝置的指紋特徵查找表。Fingerprint feature values are generated by SHA-1, Hash, or one-way algorithms, so that each sub-block can only correspond to a unique fingerprint feature value. And after the storage server stores the new segmentation data block, the duplicate data processing device runs the synchronization process of the fingerprint feature lookup table to update the fingerprint feature lookup table of the other duplicate data processing device.

本發明另提出一種重複數據刪除的分散式處理方法包括以下步驟：客戶端接收輸入文件後產生切分資料塊，並向派發伺服器發送具有指紋特徵值的查詢要求；派發伺服器根據指紋特徵值將查詢要求轉發至相應的重複數據處理裝置；重複數據處理裝置判斷指紋特徵值是否已經存在於指紋特徵查找表中；若指紋特徵查找表中未存儲指紋特徵值，則重複數據處理裝置根據指紋特徵值將相應的切分資料塊指派到存儲伺服端，並向客戶端發送包含所指派的存儲伺服端的存儲節點訊息；客戶端根據存儲節點訊息將切分資料塊傳送至存儲伺服端。The invention further provides a decentralized processing method for deduplication comprising the steps of: the client receives the input file to generate a segmentation data block, and sends a query request with the fingerprint feature value to the dispatch server; the dispatch server according to the fingerprint feature value Forwarding the query request to the corresponding duplicate data processing device; the repeated data processing device determines whether the fingerprint feature value already exists in the fingerprint feature lookup table; if the fingerprint feature value is not stored in the fingerprint feature lookup table, the duplicate data processing device according to the fingerprint feature The value assigns the corresponding segmentation data block to the storage server, and sends a storage node message containing the assigned storage server to the client; the client transmits the segmentation data block to the storage server according to the storage node message.

本發明所提出的分散式的重複數據刪除系統及其方法係透過分層指派與重複數據比對的處理，使得每一台數據存儲服務器的數據量可以有效的降低，進而提高整體數據量的存儲空間。The decentralized deduplication system and method thereof proposed by the invention are processed by hierarchical assignment and repeated data comparison, so that the data volume of each data storage server can be effectively reduced, thereby improving the storage of the overall data amount. space.

有關本發明的特徵與實作，茲配合圖式作最佳實施例詳細說明如下。The features and implementations of the present invention are described in detail below with reference to the drawings.

請參考「第2圖」所示，其係為本發明之架構示意圖。本發明分散式的重複數據刪除系統可以應用於區域網路或網際網路之中，而本發明的分散式重複數據刪除系統包括：客戶端211、派發伺服器212(Dispatch Server)、重複數據處理裝置213(De-dup Engine)與存儲伺服端214。客戶端211用以接收輸入文件，並對輸入文件執行切分處理，用以進行重複數據刪除之判斷。Please refer to "Figure 2" for a schematic diagram of the architecture of the present invention. The decentralized deduplication system of the present invention can be applied to a local area network or an internet network, and the distributed deduplication system of the present invention includes: a client 211, a dispatch server 212 (Dispatch Server), and deduplication processing. Device 213 (De-dup Engine) and storage server 214. The client 211 is configured to receive an input file and perform a segmentation process on the input file for determining the deduplication.

重複數據刪除是一種數據縮減技術，通常用於基於磁盤的備份系統，主要目的在於減少存儲系統中使用的存儲容量。它的工作方式是在某個時間周期內查找不同文件中不同位置的重複可變大小資料塊(文中將其定義為切分資料塊)。重複的資料塊用指示符(token)取代。採用「重複數據刪除」技術可以讓出更多的備份空間，不僅可以使存儲伺服端214上的備份數據保存更長的時間，而且還可以節約離線存儲時所需的大量的帶寬。Deduplication is a data reduction technique commonly used in disk-based backup systems with the primary goal of reducing the storage capacity used in storage systems. It works by looking for duplicate variable-sized data blocks at different locations in different files over a certain period of time (defined in the text as a split data block). Duplicate data blocks are replaced with tokens. The use of "deduplication" technology can make more backup space, not only can save the backup data on the storage server 214 for a longer period of time, but also save a lot of bandwidth required for offline storage.

在進行重複數據刪除的過程中，客戶端211會對輸入文件進行切分的處理。輸入文件在經過切分處理後會產生多個切分資料塊。隨後，客戶端211會對資料區塊進行哈希處理，並產生相應各資料區塊的一哈希值。客戶端211將所得到的哈希值與儲存於存儲伺服端21中的哈希值進行比對，並判斷有無相同的哈希值。若是存在相同的哈希值時，則代表此一資料區塊曾經被存放於存儲伺服端21中。In the process of performing deduplication, the client 211 performs a process of segmenting the input file. The input file will generate multiple sliced data blocks after being segmented. Subsequently, the client 211 hashes the data block and generates a hash value of each data block. The client 211 compares the obtained hash value with the hash value stored in the storage server 21, and judges whether or not there is the same hash value. If the same hash value exists, it means that the data block has been stored in the storage server 21.

在本發明的客戶端211在完成資料切分的處理後，會產生對應輸入文件的多筆切分資料塊與其指紋特徵值(Fingerprint)。指紋特徵值係由SHA-1程序、哈希程序(Hash)或單向演算法(One way function)所產生，使得每一切分資料塊只能對應到唯一的指紋特徵值。客戶端211發送將具有指紋特徵值的查詢要求傳送至派發伺服器212。After the client 211 of the present invention completes the process of data segmentation, a plurality of segmentation data blocks corresponding to the input file and its fingerprint feature value (Fingerprint) are generated. The fingerprint feature value is generated by the SHA-1 program, the hash program, or the One way function, so that each of the sub-blocks can only correspond to a unique fingerprint feature value. The client 211 sends a query request with a fingerprint feature value to the dispatch server 212.

派發伺服器212除了根據指紋特徵值將該查詢要求轉發至相應的重複數據刪除處理裝置，派發伺服器212更可用以紀錄輸入文件的切分資料塊的儲存位置。重複數據刪除處理裝置的數量係由客戶端211之數量所決定。每一台重複數據處理裝置213更包括指紋特徵查找表，指紋特徵查找表用以記錄每一個切分資料塊所相應的指紋特徵值。重複數據處理裝置213接收到指紋特徵值後會進行判斷該指紋特徵值是否已經存在。當指紋特徵查找表中不存在欲查詢的指紋特徵值時，重複數據刪除處理裝置會選取任一存儲伺服端214用以存放相應的切分資料塊。The dispatch server 212 forwards the query request to the corresponding deduplication processing device based on the fingerprint feature value, and the dispatch server 212 is further operable to record the storage location of the sliced data block of the input file. The number of deduplication processing devices is determined by the number of clients 211. Each of the repeated data processing devices 213 further includes a fingerprint feature lookup table for recording the fingerprint feature values corresponding to each of the segmented data blocks. After receiving the fingerprint feature value, the repeated data processing device 213 determines whether the fingerprint feature value already exists. When there is no fingerprint feature value to be queried in the fingerprint feature lookup table, the deduplication processing device selects any storage server 214 for storing the corresponding segmentation data block.

為能清楚說明本案之運作過程，還請參考「第3圖」所示，其係為本發明之運作流程示意圖，本發明係包括以下步驟：In order to clearly explain the operation process of the present case, please refer to the "Fig. 3", which is a schematic diagram of the operational flow of the present invention, and the present invention includes the following steps:

步驟S310：客戶端接收輸入文件後產生切分資料塊，並向派發伺服器發送具有指紋特徵值的查詢要求；Step S310: After receiving the input file, the client generates a segmentation data block, and sends a query request with a fingerprint feature value to the dispatch server;

步驟S320：派發伺服器根據指紋特徵值將查詢要求轉發至相應的重複數據處理裝置；Step S320: the dispatching server forwards the query request to the corresponding duplicate data processing device according to the fingerprint feature value;

步驟S330：重複數據處理裝置判斷指紋特徵值是否已經存在於指紋特徵值查找表中；Step S330: The repeated data processing device determines whether the fingerprint feature value already exists in the fingerprint feature value lookup table;

步驟S340：若指紋特徵值查找表中已存儲指紋特徵值，則重複數據處理裝置係透過派發伺服器向客戶端回應該筆切分資料塊已存在；Step S340: If the fingerprint feature value is stored in the fingerprint feature value lookup table, the duplicate data processing device returns the pen segmentation data block to the client through the dispatch server;

步驟S350：若指紋特徵值查找表中未存儲指紋特徵值，則重複數據處理裝置根據指紋特徵值將相應的切分資料塊指派到存儲伺服端，並向客戶端發送包含所指派的存儲伺服端的存儲節點訊息；以及Step S350: If the fingerprint feature value is not stored in the fingerprint feature value lookup table, the repeated data processing device assigns the corresponding segmentation data block to the storage server according to the fingerprint feature value, and sends the client with the assigned storage server. Storage node information;

步驟S360：客戶端根據存儲節點訊息將切分資料塊傳送至存儲伺服端。Step S360: The client transmits the segmentation data block to the storage server according to the storage node message.

客戶端211接收輸入文件並執行切分處理，用以產生切分資料塊。客戶端211將具有指紋特徵值的查詢要求傳送至派發伺服器212發送。派發伺服器212根據指紋特徵值將查詢要求轉發至相應的重複數據處理裝置213。而重複數據處理裝置213可以根據指紋特徵值進行取餘數處理，並根據取餘數處理後之結果將查詢要求轉發至派發伺服器212。The client 211 receives the input file and performs a segmentation process to generate a sliced data block. The client 211 transmits a query request with a fingerprint feature value to the dispatch server 212 for transmission. The dispatch server 212 forwards the query request to the corresponding duplicate data processing device 213 based on the fingerprint feature value. The repeated data processing device 213 can perform the remainder processing according to the fingerprint feature value, and forward the query request to the dispatch server 212 according to the result of the remainder processing.

舉例來說，客戶端211將輸入文件切分為1024筆切分資料塊，並透過SHA-1對切分資料塊產生相應的指紋特徵值(也是1024筆)。另假設派發伺服器212的數量為3台，則分別對這1024筆指紋特徵值進行取餘數(意即取3之餘數)。在實際運作時，可以根據派發伺服器212的數量決定取餘數的參數。接著，根據取餘的結果將查詢要求轉發至相應的重複數據處理裝置213。例如：餘數為「0」的指紋特徵值的查詢要求轉發至第一台重複數據處理裝置213、餘數為「1」的指紋特徵值的查詢要求轉發至第二台重複數據處理裝置213、餘數為「2」的指紋特徵值的查詢要求轉發至第三台重複數據處理裝置213。For example, the client 211 divides the input file into 1024-segment data blocks, and generates corresponding fingerprint feature values (also 1024 pens) for the segmented data blocks through SHA-1. In addition, if the number of dispatch servers 212 is three, the remainder of the 1024 fingerprint feature values are respectively taken (that is, the remainder of 3 is taken). In actual operation, the parameters of the remainder may be determined according to the number of dispatch servers 212. Next, the query request is forwarded to the corresponding duplicate data processing device 213 based on the result of the remainder. For example, the query request for the fingerprint feature value whose remainder is "0" is required to be forwarded to the first duplicate data processing device 213, and the query request for the fingerprint feature value having the remainder "1" is forwarded to the second duplicate data processing device 213, and the remainder is The query request of the fingerprint feature value of "2" is forwarded to the third iterative data processing device 213.

接下來，重複數據處理裝置213接獲查詢要求後，重複數據處理裝置213會查找指紋特徵值查找表中是否存在指紋特徵值。若指紋特徵值查找表中已存儲指紋特徵值，則重複數據處理裝置213係透過派發伺服器212向客戶端211回應該筆切分資料塊已存在。反之，則重複數據處理裝置213根據指紋特徵值將相應的切分資料塊指派到存儲伺服端214，並向客戶端211發送包含所指派的存儲伺服端214的存儲節點訊息。而通知客戶端211的方式有：派發伺服器212將查詢要求轉發至相應的重複數據處理裝置213後，並發送存儲節點訊息至客戶端211。或者是，派發伺服器212將查詢要求轉發至相應的重複數據處理裝置213後，並透過重複數據處理裝置213發送存儲節點訊息至客戶端211。Next, after the repeated data processing device 213 receives the query request, the duplicate data processing device 213 searches for a fingerprint feature value in the fingerprint feature value lookup table. If the fingerprint feature value is already stored in the fingerprint feature value lookup table, the duplicate data processing device 213 returns to the client 211 via the dispatch server 212 that the pen segmentation data block already exists. Otherwise, the repeated data processing means 213 assigns the corresponding sliced data block to the storage server 214 based on the fingerprint feature value, and transmits a storage node message containing the assigned storage server 214 to the client 211. The manner of notifying the client 211 is as follows: the dispatch server 212 forwards the query request to the corresponding duplicate data processing device 213, and sends the storage node message to the client 211. Alternatively, the dispatch server 212 forwards the query request to the corresponding duplicate data processing device 213 and transmits the storage node message to the client 211 via the duplicate data processing device 213.

此外，重複數據處理裝置213另記錄切分資料塊的元數據信息(Metadata)。元數據信息用以維護切分資料塊所存儲伺服端、在相應存儲伺服端上的存儲位置及長度。當客戶端211需要讀取切分資料塊時，重複數據處理裝置213可通過元數據信息進而找到相應的切分資料塊之位置並讀取，同時也可以通過指紋特徵值來確認切分資料塊的正確性。Further, the duplicate data processing device 213 additionally records the metadata information (Metadata) of the segmentation data block. The metadata information is used to maintain the storage location and length of the server stored in the segmentation data block and on the corresponding storage server. When the client 211 needs to read the segmentation data block, the duplicate data processing device 213 can further find and read the position of the corresponding segmentation data block through the metadata information, and can also confirm the segmentation data block by using the fingerprint feature value. The correctness.

最後，當客戶端211收到指定存儲位置的存儲節點訊息，客戶端211根據存儲節點訊息將切分資料塊傳送至存儲伺服端214。於此同時，重複數據處理裝置213會執行指紋特徵查找表(hash table)的同步處理，用以更新其他重複數據處理裝置213中的指紋特徵查找表所記錄的指紋特徵值與相應的切分資料塊所儲存的位置。當其他重複數據處理裝置213在接收到已存儲過的切分資料塊的查詢要求時，重複數據處理裝置213可以即時的判斷該筆切分資料塊是否已經存在。Finally, when the client 211 receives the storage node message specifying the storage location, the client 211 transmits the segmentation data block to the storage server 214 according to the storage node message. At the same time, the repeated data processing device 213 performs a synchronization process of the fingerprint feature lookup table for updating the fingerprint feature values recorded by the fingerprint feature lookup table in the other duplicate data processing devices 213 and the corresponding segmentation data. The location where the block is stored. When the other duplicate data processing device 213 receives the query request of the stored segment data block, the duplicate data processing device 213 can immediately determine whether the pen slice data block already exists.

雖然本發明以前述之較佳實施例揭露如上，然其並非用以限定本發明，任何熟習相像技藝者，在不脫離本發明之精神和範圍內，當可作些許之更動與潤飾，因此本發明之專利保護範圍須視本說明書所附之申請專利範圍所界定者為準。While the present invention has been described above in terms of the preferred embodiments thereof, it is not intended to limit the invention, and the invention may be modified and modified without departing from the spirit and scope of the invention. The patent protection scope of the invention is subject to the definition of the scope of the patent application attached to the specification.

111．．．客戶端111. . . Client

121．．．伺服端121. . . Servo end

211．．．客戶端211. . . Client

212．．．派發伺服器212. . . Dispatch server

213．．．重複數據處理裝置213. . . Repeated data processing device

214．．．存儲伺服端214. . . Storage server

第1圖係為習知技術的存儲數據示意圖。Figure 1 is a schematic diagram of stored data of the prior art.

第2圖係為本發明之架構示意圖。Figure 2 is a schematic diagram of the architecture of the present invention.

第3圖係為本發明之運作流程示意圖。Figure 3 is a schematic diagram of the operational flow of the present invention.

Claims

A decentralized deduplication system for storing at least all the sub-blocks generated by the client, the deduplication system comprising: at least one storage server for storing the segmented data blocks; and a client An input file runs a deduplication program, and generates the segmentation data block and a corresponding fingerprint feature value (Fingerprint), the client sends a query request having the fingerprint feature value, and according to a storage node message The split data block is sent to the storage server; a de-dup engine is used to determine whether the fingerprint feature value already exists, and the new slice data block is added according to the new fingerprint feature value. Assigned to the storage server; and a dispatch server (Dispatch Server), which records the storage location of the segmented data blocks of the input file, and the dispatch server forwards the query request to the corresponding identifier according to the fingerprint feature value. The duplicate data processing device.

The deduplication data deletion system of claim 1, wherein the data processing device performs the remainder processing on the fingerprint feature value, and forwards the query request to the dispatch server according to the result of the remainder processing.

The distributed deduplication system of claim 1, wherein the dispatching server forwards the query request to the corresponding data processing device and sends the storage node message to the client.

The deduplication system of claim 1, wherein the dispatching server forwards the query request to the corresponding data processing device, and sends the storage node message to the client through the data processing device. end.

The distributed deduplication system of claim 1, wherein the duplicate data processing device further records the metadata information of the segmentation data block.

The deduplication data deletion system of claim 1, wherein after the storage server stores the segmentation data blocks, the data processing devices run a synchronization process of a fingerprint feature lookup table. The fingerprint feature lookup table is used to update other such data processing devices.

A deduplication processing method for deduplicating data is used to store at least all the data blocks generated by a client. The processing method includes: the client receiving the input file, generating the segmentation data blocks, and sending a servo to the server. Transmitting a query request having a fingerprint feature value; the dispatching server forwards the query request to a corresponding one of the repeated data processing devices according to the fingerprint feature value; and the repeated data processing device determines whether the fingerprint feature value already exists in the In the fingerprint feature lookup table, if the fingerprint feature value is not stored in the fingerprint feature lookup table, the duplicate data processing device assigns the corresponding segmentation data block to the storage server according to the fingerprint feature value, and sends the sliced data block to the client The terminal sends a storage node message including the assigned storage server; and the client transmits the segmentation data block to the storage server according to the storage node message.

The deduplication processing method of the data deduplication method of claim 7, wherein the deduplication data processing device performs the remainder processing on the fingerprint feature value, and forwards the query request to the dispatch server according to the result of the remainder processing. .

A decentralized processing method for deduplication as described in claim 7, wherein the duplicate data processing device additionally records the metadata information of the segmentation data block.