TW201224805A

TW201224805A - A method of building the index of the data blocks

Info

Publication number: TW201224805A
Application number: TW99144092A
Authority: TW
Inventors: Yun-Song Wang; Ming-Sheng Zhu; Chih-Feng Chen
Original assignee: Inventec Corp
Priority date: 2010-12-15
Filing date: 2010-12-15
Publication date: 2012-06-16

Abstract

A method of building the index of the data blocks for the data deduplication process. The method comprises of the steps. Loading a index file. The index file includes a plurality of a location block. Each location block includes a plurality of a storage item, and the storage item saves a main hash value of the corresponds of the data block. Performing a first hash process for each the main hash value and outputs a block number. Performing a second hash process to the main hash value and outputs a item number. Loading a location list for checking the item number in the location list whether or not. If the item number is not in the location list, the main hash value writes into the location list.

Description

201224805 六、發明說明：【發明所屬之技術領域】一種應用在重複程序後所產生的種建立數據區塊的索引方法，特別有關於數據刪除程序之中，將經過重複數獅除的切分數據區塊相應的建立數據區塊的索引方法。【先前技術】重複數制除是-健據縮減技術，通^於基於磁盤的備刀糸統’主要目的在於減少存儲紐巾使用的存儲容量。它的工作方式是在某個時間週_查找不同射不同位置的重複可變大小數魏。缝的崎塊用指稍取代。由於麵系財總是充斥著大量的職數#。為雜決這侧題，_錢乡郎，「重複刪除」技術便順理成章地成了人們關注的焦點。採^重複刪Z =術可以將存儲的數據縮減為原來的·，從而讓出更多的備^ ^間’不僅可以使存儲系統上的備份數據保存更長的時間，而且還可以節約離線存儲時所需的大量的帶寬。明參考「第1圖」所不，其係為習知技術之重複數據刪除的存取的示賴。為能有效的掌控已儲存的文件數據，因此在飼服端中會透過哈希(Hash)列絲記錄各輸入文件的數據區塊。在哈希列表中記錄了數據區塊所相應的哈希值。由於哈希演算法具有單向轉換(One-Way transform)的特點，所以每一個數據區塊必然只有一組唯一的哈希值。重複刪除程序也藉此特性，將相同哈希值的數據區塊視為相同的。所以在儲存設備中只要存儲一份數據區 201224805 塊，並記錄不同文件中相同的數據區塊的對應關係即可。之二=曰:增加的資料量’也將使得啥希列表的長度也隨之曰力m將轉縣狀畴㈣的時間也會拉長。【發明内容】餐於以上的問題’本發明在於提供—種建立數據區塊的索弓^ 方法’應縣錢數據猶料之巾，將_錢輯刪除中的 •切分程序後，所產生的數據區塊建立相應的索引文件。為達上述目的，本發明所揭露之建立數據區塊的索引方法包括以下步驟：載人索引文件’在料文件包括多個位置區塊，每置區兔中更^括夕個存儲攔位’每—存儲欄位記錄數據區塊所相應的主哈希值；對數據區塊的一主哈希值進行第一哈希程序，計算區塊編號，·對同-數據區塊的主哈希值進行第二哈希程序’計算攔位編號；載入位置衝突列表；將攔位編號與位置衝突 •列表中的攔位編號進行比對，查找位置衝突列表中是否已經存儲有相同的嫌編號；若位置触列表巾碎在嫌編號時，則將主哈希值寫入相應的區塊編號與攔位編號之中。本發明所提出崎層式㈣文件㈣記騎顧塊的所在位置，藉以提高錢數翻除程序在贿(或㈣惟射引文件的存取效率。有關本發賴椒與實作，兹配合圖式作最佳實施例詳細說明如下。 201224805 【實施方式】請參考「第2圖」所示，其係為本發明之架構示意圖。禅明包括客戶端210與舰端22G。客戶端21〇可以通過網際網路 __或企業内網(i咖net)的方式連接於飼服端22〇，也^以將客戶端210與伺服端220同時運行於同一台計算機裝置上。而客戶端210用以對所輸入的文件進行重複數據刪除程序，並透^司服端根據本發明將產生相應的輸人文件的該些數據區塊的索引文件 221。 ’、在祠服端22〇中存儲索引文件功與位_突列表222。索引文件22〗記錄多組數據區塊的哈希值。為能提高索引文件221、的查找效率，並降低索引文件奶在内存或高速緩存間的存取時間。因此提出索引文件221的建立方法，請同時參考「第3a圖」與「第 3B圖」所不，其係分別為本發明之糾文件建立流程示意圖與索弓I文件架構示意圖。步驟S310:載入索引文件，在索引文件包括多個位置區塊，每一位置區塊中更包括多個存儲攔位，每一存儲攔位記錄數據區塊所相應的主哈希值；步驟S320 :對數據區塊的主哈希值進行第一哈希程序，計算區塊編號；步驟S33G :對同—數據區塊的主哈希值進行第二哈希鞋序，計鼻攔位編號；步驟S340 :建立位置衝朗表，用以記錄攔位編餘同者； 201224805 步驟測：=物細駿物_編號進行比們:―找位置衝突列表中是否已經存儲有相同的攔位編號；以及步驟⑽：若位置衝突列表林存在欄位編號時，則將主哈希值寫入相應的區塊編號與攔位編號之中。位置圖」所示，㈣文件221包括多個位置區塊，每一立^塊中更包括多個存储攔位，每—存儲襴位記錄數據區塊所相應的主哈希值。在索引文件功中的存储襴位均是定長。在本發明中存儲欄位的數量透過下式i所產生： N=位置區塊的容量/存儲襴位的容量式1 N:存儲攔位的數量。式2 而位置區塊的數量係由式2所產生：數據區塊的數量 Μ:位置區塊的數量。索引文件221被劃分成多個容量為固定大小的位置區塊(以下係以Μ個位置區塊作為說明）。數據區塊對應的主哈希值(可透過 SHA1或SHA256演算法得到)進行第一哈希程序的處理，使區塊編號能散顺Μ舰塊顧的細之内。為能達絲落於μ個區塊編號的翻之目的，可轉主哈希值透過概計算(_)，使得主哈希值的餘數可以確定落於Μ個區塊編號的範圍之内(如「第 3Β圖」所示，用以選擇相應的位置區塊）。第一哈希程序所産生的哈希值只用於分配主哈希值的存儲分配，所以其計算結果(區塊編 201224805 號)是不會占用實際的内存和硬盤空間。著再對主哈希值的做第一哈希程序，用以將所產生的第二哈希值作為相應數據區塊的攔位編號。·編號用以標示在區〜中的特痛位。聰，為能使攔位標號散落在N個存儲搁位的範圍之内(如「第3B圖」所示，用以選擇相應的存儲觸，可以將主哈倾透過概計算(職j)。#主哈輕_N的模數計异後’主哈希_餘歸僅會分布于_存賴⑽範圍之内。如此一來，則完成索引文件221的建立。 >考第4圖」所示’其係為本發明之查詢索引文件奶之流程示意圖。查詢索引文件221係包括以下步驟：步驟S4H)··客戶端接收區塊查詢請求，用以查詢索引文件十是否存在相應的數據區塊；步驟S420 1索引文件中不存在區塊查詢請求所要查詢的數據區塊時，則在内存中產生暫存索引文件，並在暫存索引文件中記錄數據區塊被查詢的次數；以及步驟S430 :當數據區塊被查詢的次數符合門檻值時，則於索引文件中建缝魏塊的域區塊編號與棚位編號。首先’客戶端210向伺服端22〇發出對一輸入文件的查詢要求時’伺服端220根據索引文件221簡入文件進行比對是否在伺服端220中已經存在有相同的數據區塊。 201224805 如果欲查詢的第二哈希值已見于索引文件221之中（意即經過第二哈希程序的主哈希值），則把主哈希值的攔位編號都保存在位置衝大列表222中。將攔位編號記錄于位置衝突列表a】中並且利用位址指針記錄欄位編號所相應的數據區塊。換言之，就是以鏈表的記錄方式’將每-條記錄都有—個字段記錄與主哈希值相同的下-條記錄的記錄號。如果在記錄號的後面沒有衝突的記錄這個字段值時，則可以將此一記錄號設置成無效值。 • 當第二哈希值出現與之前的主哈希值重複衝突時，對主哈希值再進行-次哈希將其散列在位置衝突列表222巾。在本發财鍊表的處理程序可以透過下述方式所實現：假設對主哈希值取N 的模數運算(福），則位置衝突列表222的項次數量即爲N個，並 5己錄號碼 ------- 主哈希值 ---—--- 記錄號~ ' ' 1 -------- 主哈希值1 ——~~~—__ N+1 2 —------ 主哈希值2 ----- ~-—~~~__ 無效值0 3 〜——--- 主哈希值3 -------- --~…_ 無效值0 ----- N 〜———一主哈希值N ----- 無效值0 ' N+1 主哈希值N+1 ------_ N+3 N+2 ——- 主哈希值N+2 無效值0 N+3 主哈希值N+3 無效值0 .- ---- 表1.位置衝突列表 201224805 首先對“主哈希值1”取N的餘數後並將其第二哈希值存入位置衝突列表222的第一條記錄中。然而，“主哈希值N+i”取 N的餘數後的第二哈希值是也會對應第一條記錄，因此就會產生了重複的衝突。這時該第一條記錄已有内容(其内容為“主哈希值丨”），並且兩個主哈希值不同(分別為“主哈希值Γ與“主哈希值N+1”）。因此主哈希值Ν+Γ的第二哈希值會被添加到位置衝突列表 222的尾部，並將其記錄號“主哈希值N+1，’記入第一條記錄中，以進行關聯。同理，假設“主哈希值N+3”對N取餘後同樣與“主哈希值 1會發生衝突’而其記錄的衝突記錄號“Ν+Γ找到“主哈希值 N+1比較後，主哈希值也不相同，則又被添加到位置衝突列表 222。並且將记錄號“主哈希值n+3”記錄在記錄“主哈希值 N+1中用以關聯。記錄“N+2”添加過程相同。而在位置衝突列表222中的無效值〇之§己錄號用以表示此記錄後面不存在衝突記錄。當查詢的主哈希值是新數據時則不立即進行寫入硬盤的動作，而疋先將主哈希值保存在高速緩存中。伺服端Mo會執行計數的動作，等待新數據的超過門檻值或者高速緩存的容量超過一定大小時才進行寫入硬盤的動作。這樣就能避免頻繁的寫盤動作。本發明所提出的階層式索引文件221用以記錄數據區塊的所在位置，藉以提高重複數據刪除程序在内存(或硬盤中)查找索引文 201224805 件221的存取效率。雖然本發明以前述之較佳實施例揭露如上，然其並非、定本發明’任何熟習相像技藝者’在不脫離本發明之精神:: 内，當可作些許之更動麵飾，因此本發明之專利保護範圍須視本說明書所附之申請專利範圍所界定者為準。【圖式簡單說明】第1圖係為習知技術之重複數據.的存取的示意圖。第2圖係為本發明之架構示意圖。 ★第3A圖係為本發明之索引文件建立流程示意圖。第3B圖係為本發明之索引文件架構示音圖。第4圖係為本發明之查詢索引文件之流Γ示意圖。【主要元件符號說明】客戶端210 伺服端220 索引文件221 位置衝突列表222 11201224805 VI. Description of the invention: [Technical field to which the invention pertains] An indexing method for establishing a data block generated by an application after repeating a program, and particularly relates to a segmentation data of a data culling program that is repeated by a plurality of lions The block method for establishing a data block corresponding to the block. [Prior Art] The repetition number is a reduction technique, and the main purpose of the disk-based preparation system is to reduce the storage capacity used by the storage towel. It works by finding a variable variable fraction Wei at different times in a certain time. The seams of the seams are slightly replaced with fingers. Because the face is always filled with a large number of jobs #. For the side of the problem, _ Qian Xianglang, the "repeated deletion" technology has become a focus of attention. By repeatedly deleting Z = surgery, the stored data can be reduced to the original ·, so that more spares can not only save the backup data on the storage system for a longer period of time, but also save offline storage. A lot of bandwidth is required. Referring to "Figure 1", it is a demonstration of the deduplication access of the prior art. In order to effectively control the stored file data, the data block of each input file is recorded in the feeding end through the hash column. The corresponding hash value of the data block is recorded in the hash list. Since the hash algorithm has the characteristics of a one-way transform, each data block must have only one unique set of hash values. The deduplication program also uses this feature to treat data blocks of the same hash value as identical. Therefore, in the storage device, only one data area 201224805 block is stored, and the correspondence relationship of the same data block in different files can be recorded. The second = 曰: the increased amount of data will also make the length of the list of the 啥希 also increase the time of the county (4). SUMMARY OF THE INVENTION The problem of the above meal is as follows: 'The present invention provides a method for establishing a data block, a method for generating a data block, and a method for collecting the data of the county, and deleting the _ money. The data block is created with the corresponding index file. In order to achieve the above objective, the method for indexing data blocks disclosed in the present invention includes the following steps: a manned index file 'in the material file includes a plurality of location blocks, and each of the zone rabbits has a more storage block" Each storage field records the corresponding main hash value of the data block; performs a first hash procedure on a main hash value of the data block, calculates the block number, and performs a main hash of the same-data block The value performs the second hash program 'calculates the block number; loads the position conflict list; compares the block number with the position conflict list number in the list, and finds whether the same suspect number is already stored in the position conflict list. If the location touches the list, the main hash value is written into the corresponding block number and the block number. According to the present invention, the position of the squad (4) document (4) is used to improve the access efficiency of the money-reversing procedure in bribery (or (4) only the cited documents. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The following is a detailed description of the following: 201224805 [Embodiment] Please refer to "FIG. 2", which is a schematic diagram of the architecture of the present invention. The meditation includes the client 210 and the terminal 22G. The client 21〇 It can be connected to the feeding end 22 via the Internet __ or the intranet (i coffee net), and the client 210 and the server 220 can be simultaneously run on the same computer device. The client 210 It is used to perform a deduplication procedure on the input file, and the index file 221 of the data blocks corresponding to the input file according to the present invention is generated by the server end. ', stored in the server port 22〇 The index file function and the bit_extension list 222. The index file 22 records the hash value of the plurality of sets of data blocks. To improve the search efficiency of the index file 221, and reduce the access of the index file milk between the memory or the cache. Time For the method of establishing the reference file 221, please refer to the "3a map" and the "3B map" at the same time, which are respectively a schematic diagram of the process of establishing the correct file of the present invention and a schematic diagram of the file structure of the file I. Step S310: Loading the index a file, the index file includes a plurality of location blocks, each location block further includes a plurality of storage blocks, and each storage block records a corresponding primary hash value of the data block; Step S320: a data block The main hash value is subjected to the first hashing process, and the block number is calculated; Step S33G: performing a second hash shoe sequence on the main hash value of the same-data block, and counting the nasal block number; Step S340: Establishing the position Chonglang table, used to record the same as the blocker; 201224805 Step test: = thing fine _ _ number to compare: "Is the location conflict list has been stored in the same block number; and step (10): If When there is a field number in the location conflict list, the main hash value is written into the corresponding block number and the block number. As shown in the location map, (4) the file 221 includes a plurality of location blocks, each of which is ^ Block also includes multiple storage blocks Each storage header records a corresponding primary hash value. The storage locations in the index file function are all fixed lengths. In the present invention, the number of storage fields is generated by the following formula i: N= Capacity of the location block/capacity of the storage unit 1 N: Number of storage blocks. Equation 2 The number of position blocks is generated by Equation 2: Number of data blocks: Number of position blocks. The file 221 is divided into a plurality of location blocks of a fixed size (hereinafter referred to as a location block). The primary hash value corresponding to the data block (obtained by the SHA1 or SHA256 algorithm) is first. The processing of the hash program enables the block number to be scattered within the fineness of the ship's block. For the purpose of turning the wire into the number of the block, the main hash value can be transferred to the general calculation (_). Therefore, the remainder of the main hash value can be determined to fall within the range of the block number (as shown in "Figure 3" to select the corresponding location block). The hash value generated by the first hash program is only used to allocate the storage allocation of the main hash value, so its calculation result (block 201224805) does not occupy the actual memory and hard disk space. A first hash procedure is then performed on the primary hash value to use the generated second hash value as the block number of the corresponding data block. • The number is used to indicate the special pain level in the area ~. Satoshi, in order to make the interception mark scattered within the range of N storage shelves (as shown in "3B"), to select the corresponding storage touch, the main haon can be passed through the calculation (job j). #主哈轻_N The modulus of the difference after the 'main hash _ remaining will only be distributed within the scope of _ 存 (10). As a result, the establishment of the index file 221 is completed. > test 4 The following is a schematic diagram of the process of query index file milk of the present invention. The query index file 221 includes the following steps: Step S4H) · The client receives the block query request to query whether the index file 10 has corresponding data. Step S420: When there is no data block to be queried by the block query request in the index file, a temporary index file is generated in the memory, and the number of times the data block is queried is recorded in the temporary index file; Step S430: When the number of times the data block is queried meets the threshold value, the domain block number and the booth number of the Wei block are constructed in the index file. First, when the client 210 sends a query request to an input file to the server 22, the server 220 compares the file according to the index file 221 to compare whether the same data block already exists in the server 220. 201224805 If the second hash value to be queried has been found in the index file 221 (that is, after the main hash value of the second hash program), the block number of the main hash value is saved in the location. In list 222. The block number is recorded in the position conflict list a] and the data block corresponding to the field number is recorded by the address pointer. In other words, the record number of each of the records in the linked list is recorded as the same as the main hash value. If there is no conflicting record of this field value after the record number, then this record number can be set to an invalid value. • When the second hash value appears to collide with the previous primary hash value, the primary hash value is then hashed again - hashed to the location conflict list 222. The processing procedure in the present financing list can be realized by the following method: assuming that the main hash value takes N modulo operation (fu), the position conflict list 222 has the number of items n times, and 5 Record number ------- main hash value --------- record number ~ ' ' 1 -------- main hash value 1 ——~~~___ N+1 2 —------ Main hash value 2 ----- ~--~~~__ Invalid value 0 3 ~——--- Main hash value 3 -------- --~ ..._ Invalid value 0 ----- N ~———One main hash value N ----- Invalid value 0 ' N+1 Main hash value N+1 ------_ N+3 N+2 ——- Main hash value N+2 Invalid value 0 N+3 Main hash value N+3 Invalid value 0 .- ---- Table 1. Position conflict list 201224805 First of all, "Master hash value 1 After taking the remainder of N and storing its second hash value in the first record of the position conflict list 222. However, the second hash value after the remainder of the "main hash value N+i" takes N also corresponds to the first record, so a repeated collision occurs. At this time, the first record has the content (the content is "main hash value"), and the two main hash values are different ("main hash value" and "main hash value N+1" respectively) Therefore, the second hash value of the main hash value Ν+Γ is added to the end of the position conflict list 222, and its record number "main hash value N+1," is entered in the first record for proceeding. Association. Similarly, suppose that the "main hash value N+3" is the same as the "main hash value 1 conflicts with the "main hash value N+3" and its recorded conflict record number "Ν+Γ finds the main hash value N+1". After the comparison, the main hash value is also different, and is added to the position conflict list 222 again, and the record number "main hash value n+3" is recorded in the record "main hash value N+1 for association. . The process of adding "N+2" is the same. The invalid value in the position conflict list 222 is used to indicate that there is no conflict record after this record. When the main hash value of the query is new data, the operation of writing to the hard disk is not immediately performed, and the main hash value is first saved in the cache. The server Mo performs the count operation and waits for the new data to exceed the threshold or the cache capacity exceeds a certain size before writing to the hard disk. This will avoid frequent writes. The hierarchical index file 221 proposed by the present invention is used to record the location of the data block, thereby improving the access efficiency of the deduplication program in the memory (or hard disk) to find the index file 201224805. Although the present invention has been disclosed above in the above preferred embodiments, it is not intended to be a matter of the invention, and the invention may be modified in the spirit of the present invention. The scope of patent protection shall be subject to the definition of the scope of the patent application attached to this specification. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a schematic diagram showing the access of the repeated data of the prior art. Figure 2 is a schematic diagram of the architecture of the present invention. ★ Figure 3A is a schematic diagram of the process of establishing an index file of the present invention. Figure 3B is a sound map of the index file architecture of the present invention. Figure 4 is a flow diagram of the query index file of the present invention. [Main component symbol description] Client 210 Server 220 Index file 221 Location conflict list 222 11

Claims

201224805 VII. Patent application scope: 1. - An index method for establishing a data block 'Apply in a duplicate data book' except for a program that will pass through the data segment of the deduplication program. Block construction: the index file of the target file The method of indexing the built-in secret block includes the following steps: loading-indexing file includes multiple location blocks in the index file, and each of the location blocks includes multiple Storing the block, each of the storage blocks recording a corresponding main hash value of the block; performing a first hashing process on the main hash of the data block, calculating and generating a block number Performing a second hash procedure on the main hash value of the same data block, calculating and generating a block number; establishing a position conflict list for recording the same fine number; Comparing the number of the position conflicts in the position conflicting position, finding whether the same position field has been stored in the position conflict list, and if the position number does not exist in the position conflict list, then The primary hash value is written to the corresponding block number and the block number. 2. The clock method for establishing a number of Wei blocks according to claim 1, wherein the step of the presence of the block number in the conflict list further includes: recording the field number in the green conflict list and utilizing - The address pointer records the data block corresponding to the field number. 12 201224805 method, in which the completion of the cable 3. If the handle 1 is in the shape of the data, Wei's cable is also included in the file: to check whether the data file towel exists to receive a block query request The data block; if the data block to be queried by the block query request does not exist in the index file, then the data file is generated in the memory, and the index data file is temporarily stored in the temporary index file. The number of times the block is thinned;

When the number of times the data block is queried meets a threshold, the corresponding block number of the data block and the location number are established in the file.