JP2012164130A

JP2012164130A - Data division program

Info

Publication number: JP2012164130A
Application number: JP2011023960A
Authority: JP
Inventors: Yoshiki Samejima; 吉喜鮫島
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2011-02-07
Filing date: 2011-02-07
Publication date: 2012-08-30

Abstract

PROBLEM TO BE SOLVED: To provide a data division method which can enhance data compression efficiency while performing data deduplication.SOLUTION: When determining a position to divide data, a data division program preferentially saves a position nearer to the tail end of data to divide the data at the position.

Description

本発明は、データを分割するプログラムに関するものである。 The present invention relates to a program for dividing data.

コンピュータにデータを格納する際に、重複部分を排除してデータサイズを小さくする処理を施す場合がある。具体的には、データをブロックに分割し、バイト列が同一であるブロックを削除する。 When data is stored in a computer, there is a case where processing for eliminating the overlapping portion and reducing the data size is performed. Specifically, the data is divided into blocks, and blocks having the same byte string are deleted.

データに重複排除処理を施す例として、データをバックアップする処理が挙げられる。バックアップデータは継続的に蓄積する必要があるため、サイズが大きくなりやすく、したがって重複しているデータを削除するニーズがあるからである。その他、今後はストレージ装置そのものにも重複排除技術が適用される可能性がある。 An example of performing deduplication processing on data includes processing for backing up data. This is because the backup data needs to be continuously stored, and therefore tends to increase in size, and therefore there is a need to delete duplicate data. In addition, the deduplication technology may be applied to the storage apparatus itself in the future.

バックアップデータに対して重複排除処理を実施する場合、まずバックアップするファイルをブロックに分割し、ブロックのハッシュ値をキーとして、適当な手法で圧縮したブロックをバックアップメディアに退避する。また、バックアップデータを取り出すことを容易にするため、各ブロックのハッシュ値をバックアップデータとは別の記憶装置に保持しておく場合もある。 When deduplication processing is performed on backup data, the file to be backed up is first divided into blocks, and the block compressed by an appropriate method is saved in the backup medium using the hash value of the block as a key. Further, in order to make it easy to extract backup data, the hash value of each block may be held in a storage device different from the backup data.

重複排除処理の効果を高めるためには、データを分割して生成した各ブロック同士ができる限り同一となるようにすることが望ましい。そのため、重複排除処理を実施する際には、データをどの位置で分割するかという課題がともなう。 In order to increase the effect of the deduplication processing, it is desirable that the blocks generated by dividing the data be as identical as possible. For this reason, when performing deduplication processing, there is a problem of where to divide data.

下記特許文献１では、バイト列を先頭から末尾に向かって順次スキャンし、所定長のバイト列のハッシュ値が、あらかじめ定めておいた規定の数値パターンになった時点でバイト列を分割する。予備的な分割条件として、ハッシュ値が規定数値パターンに部分的に一致した場合にも、バイト列を分割することとしている。 In Patent Document 1 below, a byte sequence is sequentially scanned from the beginning to the end, and the byte sequence is divided when a hash value of a predetermined length byte sequence becomes a predetermined numerical pattern. As a preliminary division condition, the byte string is divided even when the hash value partially matches the specified numerical value pattern.

下記特許文献２では、バイト列の中に分割枠を設けておき、分割枠内の各分割位置候補の特徴値を求め、各特徴値を比較することにより、分割位置を決定している。 In Patent Document 2 below, a division frame is provided in a byte string, a feature value of each division position candidate in the division frame is obtained, and a division position is determined by comparing each feature value.

下記特許文献３では、分割位置候補のオフセット、ハッシュ値、その他の特徴値を評価することにより、分割位置を決定している。 In Patent Document 3 below, the division position is determined by evaluating the offset, hash value, and other feature values of the division position candidates.

下記特許文献４では、バイト列を何らかの方法で分割し、分割したバイト列内に、先に記憶装置へ書き込んだバイト列が含まれるか否かを、ハッシュ値によって判定している。これにより、記憶装置に書込済のバイト列については重複書込しないようにしている。 In Patent Document 4 below, a byte string is divided by some method, and whether or not the divided byte string includes a byte string written to the storage device first is determined by a hash value. As a result, the byte string already written in the storage device is not overwritten.

下記非特許文献１では、データを遠隔コンピュータに複製する技術が記載されている。全データを一括して遠隔複製することは難しいため、対象データを適当な大きさに分割する必要がある。重複しているブロックは複製しなくともよいため、分割位置を適切に定めることにより、複製効率を向上させることができる。同文献では、データを固定長で分割するのではなく、可変長で分割し、重複するブロックを探している。 The following non-patent document 1 describes a technique for copying data to a remote computer. Since it is difficult to remotely replicate all data at once, it is necessary to divide the target data into an appropriate size. Since overlapping blocks do not need to be duplicated, duplication efficiency can be improved by appropriately determining the division position. In this document, data is not divided at a fixed length, but is divided at a variable length to search for overlapping blocks.

US 6,810,398 B2，2004，System and method for unorchestrated determination of data sequence using sticky byte factoring to determine breakpoints in digital sequencesUS 6,810,398 B2, 2004, System and method for unorchestrated determination of data sequence using sticky byte factoring to determine breakpoints in digital sequences US 7,504,969 B2，2009，Location-based stream segmentation for data deduplicationUS 7,504,969 B2, 2009, Location-based stream segmentation for data deduplication US 2006/0047855 A1，2006，Efficient Chunking AlgorithmUS 2006/0047855 A1, 2006, Efficient Chunking Algorithm US 6,704,730 B1，2004，Hash file system and method for use in a commonality factoring systemUS 6,704,730 B1, 2004, Hash file system and method for use in a commonality factoring system

Andrew Tridgell，Efficient Algorithms for Sorting and Synchronization，Chapter 3 : The rsync algorithm，1999，学位論文，URL：http://samba.org/~tridge/phd_thesis.pdf（２０１１年１月１１日取得）Andrew Tridgell, Efficient Algorithms for Sorting and Synchronization, Chapter 3: The rsync algorithm, 1999, dissertation, URL: http://samba.org/~tridge/phd_thesis.pdf (acquired on January 11, 2011)

重複排除処理の効果を高めるためには、データをブロックに分割する際のブロックサイズをできる限り小さくしたほうがよいと考えられる。サイズが小さいブロックほど、他のブロックと一致する確率が高まるからである。 In order to increase the effect of deduplication processing, it is considered that the block size when dividing data into blocks should be as small as possible. This is because the smaller the block, the higher the probability of matching with another block.

一方で、データを圧縮する際には、圧縮前のデータサイズが大きいほど、圧縮効率がよいとされる。したがって、上述したバックアップの例のように、データをブロック毎に圧縮して保存する場合には、各ブロックサイズができる限り大きいほうがよいということになる。このことは、重複排除の効率を高めるために必要となる上記要件と相反している。 On the other hand, when compressing data, the larger the data size before compression, the better the compression efficiency. Therefore, when the data is compressed and stored for each block as in the above-described backup example, each block size should be as large as possible. This is in contradiction to the above requirement that is necessary to increase the efficiency of deduplication.

重複排除とデータ圧縮は、ともに余分なデータを除去する目的があるので、総合的にデータサイズを小さくすることができる手法を優先すべきであると考えられる。特に上述したバックアップの例のように、重複排除とデータ圧縮を双方とも実施する場合には、上記のような相反する要求のバランスをとりつつ、総合的な効果としてデータサイズを小さくすることが肝要である。 Since both deduplication and data compression have the purpose of removing excess data, it is considered that priority should be given to methods that can reduce the data size overall. In particular, when both deduplication and data compression are performed, as in the backup example described above, it is important to reduce the data size as a comprehensive effect while balancing the conflicting requirements described above. It is.

また、大容量のデータをバックアップする際には、バックアップ処理の効率も考慮する必要がある。分割ブロックのサイズを小さくすると、重複排除処理の効果は高まる反面、ブロックをバックアップメディアに格納する書込処理の回数が増えるため、結果としてバックアップ処理の時間が増大してしまう。 In addition, when backing up a large amount of data, it is necessary to consider the efficiency of backup processing. If the size of the divided block is reduced, the effect of the deduplication process is increased, but the number of write processes for storing the block in the backup medium is increased, resulting in an increase in the backup process time.

以上のような理由から、データ重複排除とデータ圧縮のバランスをとることのできるデータ分割手法が望まれる。この点、上記特許文献１〜４および非特許文献１では、重複排除処理のみに着目しており、したがって分割ブロックサイズを大きくするという観点は記載されていない。以下、個別に説明する。 For the above reasons, a data division method that can balance data deduplication and data compression is desired. In this regard, Patent Documents 1 to 4 and Non-Patent Document 1 focus on only deduplication processing, and thus do not describe the viewpoint of increasing the divided block size. Hereinafter, it demonstrates individually.

非特許文献１では、あるブロックと重複するブロックを高速に探す方法を記載しているが、その前提として、重複するブロックが多くなるようにデータを分割することについては、明確には開示していない。 Non-Patent Document 1 describes a method of searching for a block that overlaps with a certain block at high speed. However, as a premise thereof, it is clearly disclosed that data is divided so that the number of overlapping blocks increases. Absent.

特許文献１では、分割条件を複数設け、第１分割条件に該当する分割候補が見つけられなければ、予備的な第２分割条件に該当する位置でデータを分割している。しかし、第１分割条件および第２分割条件ともに、分割ブロックのサイズが大きくなるように明示的に構成されているわけではない。したがって、原則として重複排除処理の効率が高くなるように分割位置を定めるように動作すると思われる。 In Patent Literature 1, a plurality of division conditions are provided, and if no division candidate corresponding to the first division condition is found, the data is divided at a position corresponding to the preliminary second division condition. However, neither the first division condition nor the second division condition is explicitly configured to increase the size of the divided block. Therefore, in principle, it seems to operate so as to determine the division position so that the efficiency of the deduplication processing is increased.

特許文献２〜３でも、特許文献１と同様に分割ブロックのサイズが大きくなるとは限らない。また、複数の分割位置候補を比較するため、処理負荷が高く、処理効率が求められる用途には向かないと考えられる。 Even in Patent Documents 2 and 3, as in Patent Document 1, the size of the divided blocks is not necessarily increased. Further, since a plurality of division position candidates are compared, it is considered that the processing load is high and it is not suitable for an application where processing efficiency is required.

特許文献４では、バイト列が書込済であるか否かを確認するために記憶装置から書込済データを読み取る必要があるので、記憶装置にアクセスする時間が長くなり、処理効率が高くないと考えられる。 In Patent Document 4, since it is necessary to read the written data from the storage device in order to confirm whether or not the byte string has been written, the time for accessing the storage device becomes long and the processing efficiency is not high. it is conceivable that.

本発明は、上記のような課題を解決するためになされたものであり、データ重複排除を実施しつつ、データ圧縮効率を高めることのできるデータ分割手法を提供することを目的とする。 The present invention has been made in order to solve the above-described problems, and an object of the present invention is to provide a data division method capable of improving data compression efficiency while performing data deduplication.

本発明に係るデータ分割プログラムは、データを分割する位置を判定する際に、データの終端により近い位置を優先して保存しておき、その位置でデータを分割する。 The data division program according to the present invention preferentially stores a position closer to the end of the data when determining the position to divide the data, and divides the data at that position.

本発明に係るデータ分割プログラムでは、データ重複排除を実施するためのデータ分割を実施する過程で、データの終端により近い位置が優先的に用いられるので、データ圧縮の効果を高める方向に処理を振り向けつつ、データ重複排除を実施することができる。これにより、データ重複排除とデータ圧縮のバランスをとり、データサイズを小さくする効果を総合的に最適化することができる。また、これにより、バックアップ処理の時間を総合的に短縮することができる。 In the data division program according to the present invention, the position closer to the end of the data is preferentially used in the process of performing the data division for performing the data deduplication. Therefore, the processing is directed to improve the data compression effect. However, data deduplication can be implemented. Thereby, it is possible to comprehensively optimize the effect of reducing the data size by balancing data deduplication and data compression. This also makes it possible to shorten the time for backup processing comprehensively.

実施形態１に係るデータ分割装置１００の機能ブロック図である。2 is a functional block diagram of a data dividing device 100 according to Embodiment 1. FIG. データ分割装置１００がデータを分割する処理を記述したプログラムである。It is a program describing a process in which the data dividing device 100 divides data. データ格納部１４０が実施する関数bdbの詳細処理フローを示す図である。It is a figure which shows the detailed process flow of the function bdb which the data storage part 140 implements. 実施形態２に係るデータ分割装置１００がデータを分割する処理を記述したプログラムである。It is a program describing a process of dividing data by the data dividing device 100 according to the second embodiment. 実施形態３に係るデータ分割装置１００がデータを分割する処理を記述したプログラムである。It is a program describing a process of dividing data by the data dividing device 100 according to the third embodiment. 実施形態４においてデータ分割部１３０とデータ格納部１４０が実施する関数bdbの詳細処理フローを示す図である。It is a figure which shows the detailed processing flow of the function bdb which the data division part 130 and the data storage part 140 implement in Embodiment 4. FIG.

＜実施の形態１＞
図１は、本発明の実施形態１に係るデータ分割装置１００の機能ブロック図である。データ分割装置１００は、データを分割する処理を実施する装置であり、データ読取部１１０、分割条件判定部１２０、データ分割部１３０、データ格納部１４０を備える。 <Embodiment 1>
FIG. 1 is a functional block diagram of a data dividing device 100 according to Embodiment 1 of the present invention. The data dividing device 100 is a device that performs processing for dividing data, and includes a data reading unit 110, a division condition determining unit 120, a data dividing unit 130, and a data storage unit 140.

データ読取部１１０は、ファイルサーバ２００などのデータソースから、分割対象となるデータを読み取る。分割条件判定部１２０は、データ読取部１１０が読み取ったデータを順次取得し、データを分割する条件に該当するか否かを判定する。データ分割部１３０は、分割条件判定部１２０が判定した条件に該当する位置で、データをブロックに分割する。データ格納部１４０は、データ分割部１３０がデータ分割によって生成したデータブロックのハッシュ値を計算し、バックアップ装置４００に格納する。また、データブロックとハッシュ値を対応付けてブロックＤＢ３００に格納する。 The data reading unit 110 reads data to be divided from a data source such as the file server 200. The division condition determination unit 120 sequentially acquires data read by the data reading unit 110 and determines whether or not a condition for dividing the data is satisfied. The data dividing unit 130 divides data into blocks at positions corresponding to the conditions determined by the dividing condition determining unit 120. The data storage unit 140 calculates a hash value of the data block generated by the data division unit 130 by the data division and stores it in the backup device 400. Further, the data block and the hash value are associated with each other and stored in the block DB 300.

ブロックＤＢ３００は、データ分割部１３０が元のデータを分割して生成したデータブロックを記憶するデータベースである。ブロックＤＢ３００は、ハードディスクなどの記憶装置を用いて構成することができる。バックアップ装置４００は、データ格納部１４０が生成したハッシュ値を記憶する記憶装置である。 The block DB 300 is a database that stores data blocks generated by the data dividing unit 130 dividing the original data. The block DB 300 can be configured using a storage device such as a hard disk. The backup device 400 is a storage device that stores the hash value generated by the data storage unit 140.

データ読取部１１０、分割条件判定部１２０、データ分割部１３０、データ格納部１４０は、個別の機能部として構成することもできるし、一体的に構成することもできる。 The data reading unit 110, the division condition determination unit 120, the data division unit 130, and the data storage unit 140 can be configured as individual functional units or can be configured integrally.

データ読取部１１０、分割条件判定部１２０、データ分割部１３０、データ格納部１４０は、これらの機能を実現する回路デバイスなどのハードウェアとして構成することもできるし、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などの演算装置とその動作を規定するプログラムによって構成することもできる。以下では主にプログラムによってこれら機能部を構成した例を説明する。 The data reading unit 110, the division condition determination unit 120, the data division unit 130, and the data storage unit 140 can be configured as hardware such as a circuit device that realizes these functions, or may be a CPU (Central Processing Unit) or the like. It can also be configured by an arithmetic device and a program that defines its operation. Below, the example which comprised these function parts mainly with the program is demonstrated.

図２は、データ分割装置１００がデータを分割する処理を記述したプログラムである。説明の便宜上、ステップ番号を併記した。以下、図２の各ステップについて説明する。 FIG. 2 is a program describing a process in which the data dividing apparatus 100 divides data. For convenience of explanation, step numbers are also shown. Hereinafter, each step of FIG. 2 will be described.

（図２：ステップ２００）
データ分割装置１００は、データ読取部１１０が読み取ったデータ全てに対して、以下のステップ２１０〜２２３を実施する。データ全てに対して処理を終えていない場合は、「ｓｔａｒｔ」マーカを目印にして本ステップに戻り、同様の処理を繰り返す。データ読取部１１０が読み取ったデータは、変数blockStartが示すメモリアドレスから、変数blockEndが示すメモリアドレスの１つ手前のアドレスまでに、バイト列として格納されているものとする。 (Figure 2: Step 200)
The data dividing apparatus 100 performs the following steps 210 to 223 for all the data read by the data reading unit 110. If the processing has not been completed for all the data, the process returns to this step using the “start” marker as a mark, and the same processing is repeated. It is assumed that the data read by the data reading unit 110 is stored as a byte string from the memory address indicated by the variable blockStart to the address immediately before the memory address indicated by the variable blockEnd.

（図２：ステップ２１０）
分割条件判定部１２０は、blockEnd <= blockStartが成り立てば処理を終了する。blockStart変数は、以下のステップにおいて逐次更新される。本ステップは、分割対象であるバイト列の長さが０となったことを確認し、処理を終了すべきか否かを判断するためのものである。 (FIG. 2: Step 210)
The division condition determination unit 120 ends the process when blockEnd <= blockStart holds. The blockStart variable is updated sequentially in the following steps. This step is for confirming that the length of the byte string to be divided is 0 and determining whether or not to end the process.

（図２：ステップ２１１）
分割条件判定部１２０は、blockEnd <= blockStart + blockMinが成り立つか否かを判断する。成り立つ場合、データ格納部１４０は、blockStartとblockEnd間のブロックをブロックＤＢ３００に書き込む。データ格納部１４０は、同じブロックのハッシュ値を計算し、バックアップ装置４００に格納する。データ格納部１４０が実施する上記処理は、関数bdb内に記述されている。同関数の詳細は、後述の図３で改めて説明する。 (Figure 2: Step 211)
The division condition determination unit 120 determines whether blockEnd <= blockStart + blockMin holds. If it does, the data storage unit 140 writes the block between blockStart and blockEnd to the block DB 300. The data storage unit 140 calculates a hash value of the same block and stores it in the backup device 400. The above processing performed by the data storage unit 140 is described in the function bdb. Details of this function will be described later with reference to FIG.

（図２：ステップ２１１：補足）
blockMinは、分割後ブロックの最小サイズである。したがって本ステップは、分割対象となるバイト列が最小サイズ以下になっているか否かを判定していることになる。対象バイト列が最小サイズ以下であれば、処理がデータの終端近くまで進んでおり、残りバイト列は僅かであることになるので、以下のステップを実施する意義はあまりない。そこで、残りバイト列についてはそのままブロックＤＢ３００に格納することとした。 (Figure 2: Step 211: Supplement)
blockMin is the minimum size of the divided block. Therefore, this step determines whether or not the byte sequence to be divided is smaller than the minimum size. If the target byte sequence is less than or equal to the minimum size, the process proceeds to near the end of the data, and the remaining byte sequence is small, so it is not meaningful to perform the following steps. Therefore, the remaining byte sequence is stored in the block DB 300 as it is.

（図２：ステップ２１２）
分割条件判定部１２０は、変数cP1と変数cP2を０で初期化する。変数cP1は、データを分割する条件である第１分割条件に該当するデータアドレスを保持するための変数である。変数cP2は、データを分割する条件である第２分割条件に該当するデータアドレスを保持するための変数である。 (Figure 2: Step 212)
The division condition determination unit 120 initializes the variables cP1 and cP2 with 0. The variable cP1 is a variable for holding a data address corresponding to a first division condition that is a condition for dividing data. The variable cP2 is a variable for holding a data address corresponding to a second division condition that is a condition for dividing data.

（図２：ステップ２１３）
分割条件判定部１２０は、分割位置を探す際に使用するアドレス変数cPを初期化する。最初の位置は、最小ブロックサイズに相当する位置よりも１バイト先のアドレスとする。以下、各機能部は、ファイルサーバ２００から取得したデータを１バイトずつ順次読み取りながら、以下の処理を実施する。 (Figure 2: Step 213)
The division condition determination unit 120 initializes an address variable cP used when searching for a division position. The first position is an address one byte ahead of the position corresponding to the minimum block size. Hereinafter, each functional unit performs the following processing while sequentially reading data acquired from the file server 200 byte by byte.

（図２：ステップ２１４）
分割条件判定部１２０は、cP < blockEndが満たされる間、すなわち、アドレス変数cPが分割対象データの最後に到達するまで、ステップ２１５〜２２３の処理を繰り返す。 (Figure 2: Step 214)
The division condition determination unit 120 repeats the processes of steps 215 to 223 while cP <blockEnd is satisfied, that is, until the address variable cP reaches the end of the division target data.

（図２：ステップ２１５）
分割条件判定部１２０は、分割条件に該当するか否かを判定するために用いる特徴値hashを求める。特徴値hashを求める処理は、関数crcに記述されている。関数crcの処理内容としては、任意の公知技術を用いることができる。 (Figure 2: Step 215)
The division condition determination unit 120 obtains a feature value hash used to determine whether the division condition is met. The process for obtaining the feature value hash is described in the function crc. Any known technique can be used as the processing content of the function crc.

（図２：ステップ２１５：補足）
関数crcの処理内容として、例えば指定されたアドレスから所定長手前のバイト列について巡回冗長検査値やハッシュ値を求め、その値を特徴値hashとして用いることが考えられる。 (Figure 2: Step 215: Supplement)
As processing contents of the function crc, for example, it is conceivable to obtain a cyclic redundancy check value or a hash value for a byte string before a predetermined length from a specified address, and use the value as a feature value hash.

（図２：ステップ２１６）
分割条件判定部１２０は、特徴値hashが第１分割条件を満たすか否かを検査する。第１分割条件は、関数pattern1に記述されている。特徴値hashが第１分割条件を満たす場合は、変数cP1に現在のデータアドレスcPをセットし、ステップ２１９に進む。第１分割条件を満たさない場合は、ステップ２１７へ進む。 (Figure 2: Step 216)
The division condition determination unit 120 checks whether or not the feature value hash satisfies the first division condition. The first division condition is described in the function pattern1. If the feature value hash satisfies the first division condition, the current data address cP is set in the variable cP1, and the process proceeds to step 219. If the first division condition is not satisfied, the process proceeds to step 217.

（図２：ステップ２１６：補足）
第１分割条件としては、例えば特徴値hashの部分ビットパターンが特定のビットパターンと一致するか否か、などが考えられる。その他、特徴値hashをある特定の値で割った余りが規定値に一致するか否か、などが考えられる。以下で説明する第２分割条件および第３分割条件も同様である。 (Figure 2: Step 216: Supplement)
As the first division condition, for example, whether or not the partial bit pattern of the feature value hash matches a specific bit pattern can be considered. In addition, it can be considered whether or not the remainder obtained by dividing the feature value hash by a specific value matches the specified value. The same applies to the second division condition and the third division condition described below.

（図２：ステップ２１７）
分割条件判定部１２０は、特徴値hashが第２分割条件を満たすか否かを検査する。第２分割条件は、関数pattern2に記述されている。第２分割条件は、ステップ２１６における第１分割条件とは異なる条件とする。特徴値hashが第２分割条件を満たす場合は、変数cP2に現在のデータアドレスcPをセットし、ステップ２１９に進む。第２分割条件を満たさない場合は、ステップ２１８へ進む。 (Figure 2: Step 217)
The division condition determination unit 120 checks whether or not the feature value hash satisfies the second division condition. The second division condition is described in the function pattern2. The second division condition is different from the first division condition in step 216. If the feature value hash satisfies the second division condition, the current data address cP is set in the variable cP2, and the process proceeds to step 219. If the second division condition is not satisfied, the process proceeds to step 218.

（図２：ステップ２１８）
分割条件判定部１２０は、特徴値hashが第３分割条件を満たすか否かを検査する。第３分割条件は、関数pattern3に記述されている。第３分割条件は、ステップ２１６における第１分割条件およびステップ２１７における第２分割条件とは異なる条件とする。特徴値hashが第３分割条件を満たす場合は、変数cP2に現在のデータアドレスcPをセットする。 (Figure 2: Step 218)
The division condition determination unit 120 checks whether or not the feature value hash satisfies the third division condition. The third division condition is described in the function pattern3. The third division condition is different from the first division condition in step 216 and the second division condition in step 217. If the feature value hash satisfies the third division condition, the current data address cP is set in the variable cP2.

（図２：ステップ２１９）
データ分割部１３０は、変数cP1がセットされている（0でない）か否かを検査する。変数cP1がセットされていれば、ステップ２１６で第１分割条件を満たす位置が見つかっていることになる。データ分割部１３０は、アドレスblockStartとアドレスcP1の間でデータをデータブロックに分割する。データ格納部１４０は、そのデータブロックをブロックＤＢ３００に書き込む。データ格納部１４０は、同じデータブロックのハッシュ値を計算し、バックアップ装置４００に書き込む。データ分割部１３０は、blockStart = cP1として、ステップ２１０に戻る。変数cP1がセットされていない場合はステップ２２０に進む。 (Figure 2: Step 219)
The data dividing unit 130 checks whether or not the variable cP1 is set (not 0). If the variable cP1 is set, a position satisfying the first division condition is found in step 216. The data dividing unit 130 divides data into data blocks between the address blockStart and the address cP1. The data storage unit 140 writes the data block in the block DB 300. The data storage unit 140 calculates a hash value of the same data block and writes it in the backup device 400. The data dividing unit 130 returns block 210 to blockStart = cP1. If the variable cP1 is not set, the process proceeds to step 220.

（図２：ステップ２２０）
データ分割部１３０は、アドレスcP（分割条件を判定している現アドレス）が、分割後ブロックの最大サイズに相当する位置（blockStart + blockMax）に達しているか否かを検査する。最大サイズに相当する位置に達していればステップ２２１に進み、達していなければステップ２２３にスキップする。 (Figure 2: Step 220)
The data dividing unit 130 checks whether or not the address cP (current address for determining the dividing condition) has reached a position (blockStart + blockMax) corresponding to the maximum size of the divided block. If the position corresponding to the maximum size has been reached, the process proceeds to step 221; otherwise, the process skips to step 223.

（図２：ステップ２２１）
データ分割部１３０は、変数cP2がセットされている（0でない）か否かを検査する。変数cP2がセットされていれば、ステップ２１７における第２分割条件またはステップ２１８における第３分割条件を満たす位置が見つかっていることになる。データ分割部１３０は、アドレスblockStartとアドレスcP2の間でデータをデータブロックに分割する。データ格納部１４０は、そのデータブロックをブロックＤＢ３００に書き込む。データ格納部１４０は、同じデータブロックのハッシュ値を計算し、バックアップ装置４００に書き込む。データ分割部１３０は、blockStart = cP2として、ステップ２１０に戻る。変数cP2がセットされていない場合はステップ２２２に進む。 (FIG. 2: Step 221)
The data dividing unit 130 checks whether or not the variable cP2 is set (not 0). If the variable cP2 is set, the position satisfying the second division condition in step 217 or the third division condition in step 218 is found. The data dividing unit 130 divides data into data blocks between the address blockStart and the address cP2. The data storage unit 140 writes the data block in the block DB 300. The data storage unit 140 calculates a hash value of the same data block and writes it in the backup device 400. The data dividing unit 130 returns block 210 to blockStart = cP2. If the variable cP2 is not set, the process proceeds to step 222.

（図２：ステップ２２２）
データ分割部１３０は、アドレスblockStartとアドレスcPの間でデータをデータブロック（ステップ２２０の条件により、分割ブロックの最大サイズとなっている）に分割する。データ格納部１４０は、そのデータブロックをブロックＤＢ３００に書き込む。データ格納部１４０は、同じデータブロックのハッシュ値を計算し、バックアップ装置４００に書き込む。データ分割部１３０は、blockStart = cPとして、ステップ２１０に戻る。 (Figure 2: Step 222)
The data dividing unit 130 divides the data into data blocks (the size of the divided block is the maximum size according to the condition of step 220) between the address blockStart and the address cP. The data storage unit 140 writes the data block in the block DB 300. The data storage unit 140 calculates a hash value of the same data block and writes it in the backup device 400. The data dividing unit 130 returns block 210 to blockStart = cP.

（図２：ステップ２２３）
データ分割部１３０は、アドレスcPを１バイト前（データの終端に向かう方向）に進め、ステップ２１５に戻る。 (Figure 2: Step 223)
The data dividing unit 130 advances the address cP by 1 byte (in the direction toward the end of data), and returns to step 215.

（図２：ステップ２２５）
データ分割部１３０は、アドレスblockStartとアドレスblockEndの間の残ったデータをデータブロックとして取り出す。データ格納部１４０は、そのデータブロックをブロックＤＢ３００に書き込む。データ格納部１４０は、同じデータブロックのハッシュ値を計算し、バックアップ装置４００に書き込む。本ステップは、分割後ブロックの最大サイズに相当する位置まで第１分割条件に該当する位置がみつからなかった場合の処理である。 (Figure 2: Step 225)
The data dividing unit 130 takes out the remaining data between the address blockStart and the address blockEnd as a data block. The data storage unit 140 writes the data block in the block DB 300. The data storage unit 140 calculates a hash value of the same data block and writes it in the backup device 400. This step is processing when a position corresponding to the first division condition is not found up to a position corresponding to the maximum size of the divided block.

図３は、データ格納部１４０が実施する関数bdbの詳細処理フローを示す図である。以下、図３の各ステップについて説明する。 FIG. 3 is a diagram illustrating a detailed processing flow of the function bdb executed by the data storage unit 140. Hereinafter, each step of FIG. 3 will be described.

（図３：ステップＳ３０１）
データ格納部１４０は、ブロックＤＢ３００に格納する対象となっているデータブロックのハッシュ値を計算する。衝突する確率が低いハッシュ関数を用いることが望ましい。例えば、SHA-1、SHA-256などのハッシュ関数を用いることができる。 (FIG. 3: Step S301)
The data storage unit 140 calculates a hash value of a data block to be stored in the block DB 300. It is desirable to use a hash function with a low probability of collision. For example, hash functions such as SHA-1 and SHA-256 can be used.

（図３：ステップＳ３０２）
データ格納部１４０は、ステップＳ３０１でデータ格納部１４０が求めたハッシュ値をキーにして、同じ内容のデータブロックが既にブロックＤＢ３００へ登録されているか否かを検索する。 (FIG. 3: Step S302)
The data storage unit 140 searches for a data block having the same content in the block DB 300 using the hash value obtained by the data storage unit 140 in step S301 as a key.

（図３：ステップＳ３０３）
ステップＳ３０２でキーが見つかった場合は、データ格納部１４０はハッシュ値をバックアップ装置４００に出力し、本処理フローを終了する。 (FIG. 3: Step S303)
If the key is found in step S302, the data storage unit 140 outputs the hash value to the backup device 400 and ends this processing flow.

（図３：ステップＳ３０４）
ステップＳ３０２でキーが見つからなかった場合は、データ格納部１４０はハッシュ値をバックアップ装置４００に出力し、ハッシュ値をキーにしてデータブロックをブロックＤＢ３００に格納する。ブロックＤＢ３００に格納するデータブロックは、適当なアルゴリズムを用いて圧縮することが望ましい。 (FIG. 3: Step S304)
If the key is not found in step S302, the data storage unit 140 outputs the hash value to the backup device 400, and stores the data block in the block DB 300 using the hash value as a key. The data blocks stored in the block DB 300 are preferably compressed using an appropriate algorithm.

（図３：ステップＳ３０３〜Ｓ３０４：補足）
これらステップにおいてハッシュ値をキーにしているのは、ステップＳ３０４でブロックＤＢ３００にデータブロックを格納する際に、データブロックを圧縮する場合があるからである。データブロックを圧縮してからブロックＤＢ３００に格納する場合、既にブロックＤＢ３００に格納されているデータブロックと新たなデータブロックが同一であるか否かは、圧縮を解かない限り分からないため、処理負担が重くなる。そこで、ハッシュ値が一致するか否かによって簡易的に同一判定できるようにしているのである。 (FIG. 3: Steps S303 to S304: Supplement)
The reason why the hash value is used as a key in these steps is that the data block may be compressed when the data block is stored in the block DB 300 in step S304. When the data block is compressed and then stored in the block DB 300, it is not known whether the data block already stored in the block DB 300 and the new data block are the same unless the compression is solved. Become heavier. Therefore, the same determination can be easily made depending on whether or not the hash values match.

＜実施の形態１：まとめ＞
以上のように、本実施形態１に係るデータ分割装置１００は、第２分割条件に該当するデータ位置のうち、データの終端にできる限り近い分割位置を優先してデータを分割するように構成されている。これにより、重複排除処理のためのデータ分割を実施しつつ、分割後のデータブロックサイズができる限り大きくなるように処理方針を方向付けていることになるので、結果としてデータ重複排除とデータ圧縮をバランスよく両立させることができる。 <Embodiment 1: Summary>
As described above, the data dividing apparatus 100 according to the first embodiment is configured to preferentially divide data among the data positions corresponding to the second division condition as close as possible to the end of the data. ing. As a result, while performing data division for deduplication processing, the processing policy is oriented so that the data block size after division is as large as possible. As a result, data deduplication and data compression are performed. Can be balanced.

具体的には、第１分割条件に該当する分割位置が見つからない場合には、第２分割条件に該当するできるだけ終端に近い分割位置を変数cP2に保存しておき、ステップ２２１でblockStartからcP2までのデータブロックをブロックＤＢ３００に書き込む。この処理手順により、最後に残ったcP2の値は、できる限りデータ終端に近い位置となる。 Specifically, if a division position corresponding to the first division condition is not found, a division position as close as possible to the end corresponding to the second division condition is stored in the variable cP2, and from blockStart to cP2 in step 221 Are written to the block DB300. By this processing procedure, the last remaining cP2 value is as close to the data end as possible.

また、データブロックが第１分割条件と第２分割条件いずれにも該当しない場合には、ステップ２２２で分割後ブロックの最大サイズblockMaxのデータブロックをブロックＤＢ３００に格納するようにしている。これにより、重複排除に適していないデータブロックはデータ圧縮を優先してできる限り大きなサイズのデータブロックを格納することになるので、重複排除効果を発揮できないとしても、データ圧縮効果でこれを補い、総合的にデータサイズを小さくすることができる。 If the data block does not satisfy either the first division condition or the second division condition, a data block having the maximum size blockMax of the divided block is stored in the block DB 300 in step 222. As a result, data blocks that are not suitable for deduplication will store data blocks of the largest possible size with priority on data compression, so even if the deduplication effect cannot be demonstrated, this will be compensated by the data compression effect, The data size can be reduced overall.

＜実施の形態２＞
本発明の実施形態２では、実施形態１と同様にできる限りデータ終端に近い位置でデータを分割する手法を説明する。本実施形態２では、分割するデータを終端から先頭に向かって順次読み取り、分割条件に合致した時点でデータを分割する。データ分割装置１００の構成は実施形態１と同様であるため、以下では差異点を中心に説明する。 <Embodiment 2>
In the second embodiment of the present invention, a method for dividing data at a position as close to the end of data as possible as in the first embodiment will be described. In the second embodiment, data to be divided is read sequentially from the end toward the beginning, and the data is divided when the division condition is met. Since the configuration of the data dividing apparatus 100 is the same as that of the first embodiment, the following description will focus on differences.

図４は、本実施形態２に係るデータ分割装置１００がデータを分割する処理を記述したプログラムである。説明の便宜上、ステップ番号を併記した。以下、図４の各ステップについて説明する。 FIG. 4 is a program describing a process of dividing data by the data dividing apparatus 100 according to the second embodiment. For convenience of explanation, step numbers are also shown. Hereinafter, each step of FIG. 4 will be described.

（図４：ステップ４００〜４１２）
これらのステップは、図２のステップ２００〜２１２と同様である。 (FIG. 4: Steps 400 to 412)
These steps are the same as steps 200 to 212 in FIG.

（図４：ステップ４１３）
分割条件判定部１２０は、blockStart + blockMax <= blockEndが成立する場合は、分割位置を探す際に使用するアドレス変数cPを、分割後ブロックの最大サイズに相当する位置に初期化する。 (FIG. 4: Step 413)
When blockStart + blockMax <= blockEnd is established, the division condition determination unit 120 initializes the address variable cP used when searching for the division position to a position corresponding to the maximum size of the divided block.

（図４：ステップ４１４）
分割条件判定部１２０は、ステップ４１３の条件式が成立しない場合は、変数cPをデータブロックの末尾位置に初期化する。 (FIG. 4: Step 414)
If the conditional expression in step 413 is not satisfied, the division condition determination unit 120 initializes the variable cP to the end position of the data block.

（図４：ステップ４１３〜４１４：補足）
これらのステップでは、条件判定する位置cPを、データブロックのできるだけ後方の位置に初期設定していることになる。 (FIG. 4: Steps 413 to 414: Supplement)
In these steps, the condition determination position cP is initialized to a position as far back as possible in the data block.

（図４：ステップ４１５）
分割条件判定部１２０は、cP >= blockStart + blockMinが満たされる間、すなわちアドレスcPがデータブロックの末尾から最小ブロックサイズに相当する位置に達するまで、ステップ４１６からステップ４２１までの処理を繰り返す。 (FIG. 4: Step 415)
The division condition determination unit 120 repeats the processing from step 416 to step 421 while cP> = blockStart + blockMin is satisfied, that is, until the address cP reaches the position corresponding to the minimum block size from the end of the data block.

（図４：ステップ４１６〜４１７）
これらのステップは、図２のステップ２１５〜２１６と同様である。 (FIG. 4: Steps 416 to 417)
These steps are the same as steps 215 to 216 in FIG.

（図４：ステップ４１８〜４１９）
これらのステップは、図２のステップ２１７〜２１８と同様である。ただし、分割条件判定部１２０は、第２分割条件pattern2と第３分割条件pattern3を判定する前に、アドレスcP2がセットされているか否か（値が０）を判定する。これは、本実施形態２ではデータブロックの末尾から先頭に向かってデータを読み取っていくため、最初に見つかったcP2の値が最もデータブロック末尾に近いからである。すなわち、できる限りデータブロックの末尾に近い位置でデータ分割するという観点では、cP2が既に見つかっていればその値を更新する必要はなく、cP2が空である場合に限り値をセットする必要があるからである。 (FIG. 4: Steps 418 to 419)
These steps are the same as steps 217 to 218 in FIG. However, the division condition determination unit 120 determines whether or not the address cP2 is set (value is 0) before determining the second division condition pattern2 and the third division condition pattern3. This is because, in the second embodiment, since data is read from the end of the data block toward the beginning, the value of cP2 found first is closest to the end of the data block. In other words, from the viewpoint of dividing data as close to the end of the data block as possible, it is not necessary to update the value if cP2 is already found, and it is necessary to set the value only when cP2 is empty Because.

（図４：ステップ４２０）
本ステップは、図２のステップ２１９と同様である。 (FIG. 4: Step 420)
This step is the same as step 219 in FIG.

（図４：ステップ４２１）
データ分割部１３０は、アドレスcPを１バイト後ろ（データの先頭に向かう方向）に進め、ステップ４１６に戻る。 (FIG. 4: Step 421)
The data dividing unit 130 advances the address cP by 1 byte (in the direction toward the beginning of the data), and returns to step 416.

（図４：ステップ４２３）
本ステップに到達した時点で変数cP2がセットされている場合、ステップ４１５〜４２１のループ内では、第１判定条件に合致する位置が見つからず、第２判定条件に合致する位置のみが見つかっていることになる。データ分割部１３０は、アドレスblockStartとアドレスcP2の間でデータをデータブロックに分割する。データ格納部１４０は、そのデータブロックをブロックＤＢ３００に書き込む。データ格納部１４０は、同じデータブロックのハッシュ値を計算し、バックアップ装置４００に書き込む。データ分割部１３０は、blockStart = cP2として、ステップ４１０に戻る。 (FIG. 4: Step 423)
When the variable cP2 is set when this step is reached, in the loop of steps 415 to 421, a position that matches the first determination condition is not found, and only a position that matches the second determination condition is found. It will be. The data dividing unit 130 divides data into data blocks between the address blockStart and the address cP2. The data storage unit 140 writes the data block in the block DB 300. The data storage unit 140 calculates a hash value of the same data block and writes it in the backup device 400. The data dividing unit 130 returns block 410 to blockStart = cP2.

（図４：ステップ４２４）
ステップ４２３で変数cP2がセットされていない場合、ステップ４１５〜４２１のループ内では、第１判定条件と第２分割条件いずれも合致する位置が見つからなかったことになるので、分割後ブロックの最小サイズでデータブロックを分割する。データ分割部１３０は、アドレスblockStartとアドレスblockStart + blockMinの間でデータをデータブロックに分割する。データ格納部１４０は、そのデータブロックをブロックＤＢ３００に書き込む。データ格納部１４０は、同じデータブロックのハッシュ値を計算し、バックアップ装置４００に書き込む。データ分割部１３０は、アドレスblockStartをblockMinだけインクリメントして、ステップ４１０に戻る。 (FIG. 4: Step 424)
If the variable cP2 is not set in step 423, a position that matches both the first determination condition and the second division condition is not found in the loop of steps 415 to 421, so the minimum size of the block after division To divide the data block. The data dividing unit 130 divides data into data blocks between the address blockStart and the address blockStart + blockMin. The data storage unit 140 writes the data block in the block DB 300. The data storage unit 140 calculates a hash value of the same data block and writes it in the backup device 400. The data dividing unit 130 increments the address blockStart by blockMin and returns to Step 410.

＜実施の形態２：まとめ＞
以上のように、本実施形態２に係るデータ分割装置１００は、データの終端から先頭に向かって順次データを読み取り、第２分割条件に該当するできるだけ終端に近い分割位置を変数cP2に保存しておく。ステップ４２３では、blockStartからcP2までのデータブロックをブロックＤＢ３００に書き込む。この処理手順により、最後に残ったcP2の値は、できる限りデータ終端に近い位置となる。 <Embodiment 2: Summary>
As described above, the data division device 100 according to the second embodiment sequentially reads data from the end of the data toward the top, and stores the division position that is as close as possible to the end corresponding to the second division condition in the variable cP2. deep. In step 423, the data blocks from blockStart to cP2 are written in the block DB 300. By this processing procedure, the last remaining cP2 value is as close to the data end as possible.

また、データブロックが第１分割条件と第２分割条件いずれにも該当しない場合には、ステップ４２４で分割後ブロックの最小サイズblockMinのデータブロックをブロックＤＢ３００に格納するようにしている。これにより、重複排除に適していないデータブロックでも、少なくとも最小サイズblockMinのデータブロックを格納することになるので、最低限のデータ圧縮効果を発揮することができる。 If the data block does not satisfy either the first division condition or the second division condition, in step 424, the data block having the minimum size blockMin of the divided block is stored in the block DB 300. As a result, even a data block that is not suitable for deduplication stores at least a data block of the minimum size blockMin, so that the minimum data compression effect can be exhibited.

なお、ステップ４２４において、最大サイズblockMaxのデータブロックをブロックＤＢ３００に格納するようにしてもよい。これは、図２のステップ２２２と同様の処理を採用したものといえる。 In step 424, the data block having the maximum size blockMax may be stored in the block DB 300. This can be said to employ the same processing as step 222 in FIG.

＜実施の形態３＞
実施形態１〜２では、第１分割条件と第２分割条件のみを用いているが、より多くの分割条件を判定するようにすることもできる。本発明の実施形態３では、第１分割条件、第２分割条件、第３分割条件のいずれかに該当する位置でデータを分割する。データ分割装置１００の構成は実施形態１と同様であるため、以下では差異点を中心に説明する。 <Embodiment 3>
In the first and second embodiments, only the first division condition and the second division condition are used. However, more division conditions can be determined. In Embodiment 3 of the present invention, data is divided at a position corresponding to any of the first division condition, the second division condition, and the third division condition. Since the configuration of the data dividing apparatus 100 is the same as that of the first embodiment, the following description will focus on differences.

図５は、本実施形態３に係るデータ分割装置１００がデータを分割する処理を記述したプログラムである。ここでは、実施形態１の図２で説明したプログラムに加えて、第３分割条件に合致する位置でデータ分割する処理を新たに設けた例を示すが、実施形態２の図４で説明したプログラムに加えて第３判定条件を設けることもできる。以下、図２で説明したプログラムと異なる部分を中心に説明する。 FIG. 5 is a program describing a process of dividing data by the data dividing apparatus 100 according to the third embodiment. Here, in addition to the program described with reference to FIG. 2 in the first embodiment, an example is shown in which a process for dividing data at a position that matches the third division condition is newly provided. The program described with reference to FIG. 4 in the second embodiment In addition to the above, a third determination condition can be provided. In the following, the description will focus on the parts different from the program described in FIG.

（図５：ステップ５１２）
分割条件判定部１２０は、変数cP1、変数cP2、変数cP3を０で初期化する。変数cP3は、データを分割する条件である第３分割条件に該当するデータアドレスを保持するための変数である。 (FIG. 5: Step 512)
The division condition determination unit 120 initializes the variable cP1, the variable cP2, and the variable cP3 with 0. The variable cP3 is a variable for holding a data address corresponding to a third division condition that is a condition for dividing data.

（図５：ステップ５１８）
分割条件判定部１２０は、特徴値hashが第３分割条件を満たす場合は、変数cP3に現在のデータアドレスcPをセットする。実施形態１〜２では、第３分割条件に合致する場合は変数cP2にデータアドレスcPを格納しているため、第１分割条件と第２分割条件のうちいずれかを採用していた。本実施形態３では、この選択肢に加えて新たに第３分割条件を設けた点が異なる。 (FIG. 5: Step 518)
When the feature value hash satisfies the third division condition, the division condition determination unit 120 sets the current data address cP to the variable cP3. In the first and second embodiments, when the third division condition is satisfied, the data address cP is stored in the variable cP2, and thus either the first division condition or the second division condition is employed. The third embodiment is different in that a third division condition is newly provided in addition to this option.

（図５：ステップ５２２）
データ分割部１３０は、変数cP3がセットされている（0でない）か否かを検査する。変数cP3がセットされていれば、ステップ５１８における第３分割条件を満たす位置が見つかっていることになる。データ分割部１３０は、アドレスblockStartとアドレスcP3の間でデータをデータブロックに分割する。データ格納部１４０は、データブロックをブロックＤＢ３００に書き込む。データ格納部１４０は、同じデータブロックのハッシュ値を計算し、バックアップ装置４００に書き込む。データ分割部１３０は、blockStart = cP3として、ステップ５１０に戻る。変数cP3がセットされていない場合はステップ５２３に進む。 (FIG. 5: Step 522)
The data dividing unit 130 checks whether or not the variable cP3 is set (not 0). If the variable cP3 is set, the position satisfying the third division condition in step 518 has been found. The data dividing unit 130 divides data into data blocks between the address blockStart and the address cP3. The data storage unit 140 writes the data block into the block DB 300. The data storage unit 140 calculates a hash value of the same data block and writes it in the backup device 400. The data dividing unit 130 returns block 510 to blockStart = cP3. If the variable cP3 is not set, the process proceeds to step 523.

＜実施の形態３：まとめ＞
以上のように、本実施形態３に係るデータ分割装置１００は、第１分割条件、第２分割条件、第３分割条件の順に分割条件を判定し、いずれかに該当した位置でデータを分割する。これにより、実施形態１〜２よりも細かな分割条件を設定することができる。 <Embodiment 3: Summary>
As described above, the data division device 100 according to the third embodiment determines the division conditions in the order of the first division condition, the second division condition, and the third division condition, and divides the data at a position corresponding to any one of them. . Thereby, finer dividing conditions than those in the first and second embodiments can be set.

＜実施の形態４＞
本発明の実施形態４では、データブロックのハッシュ値に加えてデータブロックの固有識別番号をブロックＤＢ３００とバックアップ装置４００に格納する動作例を説明する。これにより、ハッシュ値が衝突した場合でもブロックＤＢ３００やバックアップ装置４００に格納しているデータを破壊しないようにすることを図る。その他の構成は実施形態１〜３と同様であるため、以下では差異点を中心に説明する。 <Embodiment 4>
In the fourth embodiment of the present invention, an operation example in which the unique identification number of the data block is stored in the block DB 300 and the backup device 400 in addition to the hash value of the data block will be described. As a result, even if hash values collide, data stored in the block DB 300 or the backup device 400 is prevented from being destroyed. Since other configurations are the same as those in the first to third embodiments, the following description will focus on differences.

図６は、本実施形態４においてデータ分割部１３０とデータ格納部１４０が実施する関数bdbの詳細処理フローを示す図である。以下、図６の各ステップについて説明する。 FIG. 6 is a diagram illustrating a detailed processing flow of the function bdb executed by the data dividing unit 130 and the data storage unit 140 in the fourth embodiment. Hereinafter, each step of FIG. 6 will be described.

（図６：ステップＳ６０１）
本ステップは、図３のステップＳ３０１と同様である。 (FIG. 6: Step S601)
This step is the same as step S301 in FIG.

（図６：ステップＳ６０２）
データ格納部１４０は、データブロックを固有に識別する固有識別番号を、０で初期化する。 (FIG. 6: Step S602)
The data storage unit 140 initializes a unique identification number for uniquely identifying a data block with 0.

（図６：ステップＳ６０３）
データ格納部１４０は、ステップＳ６０１でデータ格納部１４０が求めたハッシュ値とデータブロックの固有識別番号をキーにして、同じ内容のデータブロックが既にブロックＤＢ３００へ登録されているか否かを検索する。 (FIG. 6: Step S603)
The data storage unit 140 uses the hash value obtained by the data storage unit 140 in step S601 and the unique identification number of the data block as a key to search whether a data block having the same content has already been registered in the block DB 300.

（図６：ステップＳ６０４）
ステップＳ６０３でキーが見つかった場合は、データ格納部１４０は対応するデータブロックをブロックＤＢ３００から取り出し、処理対象になっているデータブロックと同一内容であるか否か比較する。同一であればハッシュ値と固有識別番号をバックアップ装置４００に出力し、本処理フローを終了する。同一でなければ固有識別番号を１つインクリメントしてステップＳ６０３に戻る。 (FIG. 6: Step S604)
If a key is found in step S603, the data storage unit 140 retrieves the corresponding data block from the block DB 300, and compares whether or not the content is the same as the data block to be processed. If they are the same, the hash value and the unique identification number are output to the backup device 400, and this processing flow ends. If they are not the same, the unique identification number is incremented by 1, and the process returns to step S603.

（図６：ステップＳ６０５）
ステップＳ６０３でキーが見つからなかった場合は、データ格納部１４０はハッシュ値と固有識別番号をバックアップ装置４００に出力し、データ格納部１４０はハッシュ値と固有識別番号をキーにしてデータブロックをブロックＤＢ３００に格納する。ブロックＤＢ３００に格納するデータブロックは、適当なアルゴリズムを用いて圧縮することが望ましい。 (FIG. 6: Step S605)
If the key is not found in step S603, the data storage unit 140 outputs the hash value and the unique identification number to the backup device 400, and the data storage unit 140 uses the hash value and the unique identification number as a key to store the data block in the block DB 300. To store. The data blocks stored in the block DB 300 are preferably compressed using an appropriate algorithm.

＜実施の形態４：まとめ＞
以上のように、本実施形態４に係るデータ分割装置１００は、ハッシュ値が衝突する場合であっても、固有識別番号によってデータブロックを一意に識別することができる。これにより、ハッシュ値が衝突した場合でもブロックＤＢ３００やバックアップ装置４００に格納しているデータを破壊しないようにすることができる。 <Embodiment 4: Summary>
As described above, the data dividing apparatus 100 according to the fourth embodiment can uniquely identify a data block by a unique identification number even when hash values collide. As a result, even if hash values collide, data stored in the block DB 300 or the backup device 400 can be prevented from being destroyed.

＜実施の形態５＞
以上の実施形態１〜４において、データ分割装置１００は、データ読取部１１０がファイルサーバ２００からデータを読み取った時点で、そのデータが圧縮または暗号化されているか否かを判断し、圧縮または暗号化されている場合は図２〜図６で説明した処理を省略してそのデータをそのままブロックＤＢ３００に格納するようにしてもよい。圧縮または暗号化されているデータに対してさらに重複排除処理やデータ圧縮処理をしても、大きな効果は見込めないと考えられるからである。 <Embodiment 5>
In the first to fourth embodiments described above, the data dividing device 100 determines whether the data is compressed or encrypted when the data reading unit 110 reads the data from the file server 200, and compresses or encrypts the data. In the case where it is configured, the processing described with reference to FIGS. 2 to 6 may be omitted and the data stored in the block DB 300 as it is. This is because even if deduplication processing or data compression processing is further performed on the compressed or encrypted data, it is considered that a large effect cannot be expected.

図２〜図６で説明した処理を省略する場合は、データ全体を１つのデータブロックとして取り扱い、ハッシュ値はデータ全体に対して計算することになる。 When the processing described in FIGS. 2 to 6 is omitted, the entire data is handled as one data block, and the hash value is calculated for the entire data.

１００：データ分割装置、１１０：データ読取部、１２０：分割条件判定部、１３０：データ分割部、１４０：データ格納部、２００：ファイルサーバ、３００：ブロックＤＢ、４００：バックアップ装置。 DESCRIPTION OF SYMBOLS 100: Data division | segmentation apparatus, 110: Data reading part, 120: Division condition determination part, 130: Data division part, 140: Data storage part, 200: File server, 300: Block DB, 400: Backup apparatus.

Claims

A program for causing a computer to execute a process of dividing data,
In the computer,
A reading step for sequentially reading the data for each predetermined reading unit;
A determination step of determining whether the data read in the reading step satisfies a condition for dividing the data and acquiring the data position;
A dividing step of dividing the data at a position where the data satisfies the condition;
And execute
In the determination step, the computer
A first determination step of determining whether or not the data read in the reading step satisfies a first division condition that is a condition for dividing the data;
If the data read in the reading step does not meet the first division condition, it is further determined whether the data read in the reading step satisfies a second division condition that is a condition for dividing the data. 2 determination steps;
And execute
The second determination step includes
A data division program characterized in that a position closer to the end of the data is preferentially used as a result of the determination step.

Causing the computer to execute a step of predefining an upper limit size of the data after being divided in the dividing step;
In the reading step, the computer is
Read the data sequentially from the beginning to the end,
In the data dividing step, the computer is
2. The division step according to claim 1, wherein when the data read in the reading step does not satisfy either the first division condition or the second division condition, the division step is executed with the upper limit size. Data partitioning program.

Causing the computer to execute a step of predefining a lower limit size of the data after being divided in the dividing step;
In the reading step, the computer is
Read the data sequentially from the end to the beginning,
In the data dividing step, the computer is
2. The division step according to claim 1, wherein when the data read in the reading step does not correspond to either the first division condition or the second division condition, the division step is executed with the lower limit size. Data partitioning program.

In the determination step, the computer
If the data read in the reading step does not correspond to either the first division condition or the second division condition, the data read in the reading step corresponds to a third division condition that is a condition for dividing the data. A third determination step for further determining whether or not to perform,
In the dividing step, the computer is
Dividing the data at a position corresponding to any of the first division condition, the second division condition, or the third division condition;
The third determination step includes
The data division program according to any one of claims 1 to 3, wherein a position closer to the end of the data is preferentially used as a result of the determination step.

The data division program according to any one of claims 1 to 4, wherein the computer executes a storage step of storing the data divided in the division step in association with a hash value of the data in a storage device. .

In the storing step, the computer
The data division program according to claim 5, wherein a number that uniquely identifies the data divided in the division step is stored in the storage device in association with the hash value.

Determining whether the data is compressed or encrypted;
When the data is compressed or encrypted, the reading step, the determining step, the dividing step, and the storing step are omitted, and the data is stored in the storage device in association with the hash value Steps to perform only,
The data division program according to claim 5 or 6, wherein the computer is executed.

In the storing step, the computer
The data division program according to any one of claims 5 to 7, wherein the data or the data divided in the division step is compressed and stored in a storage device.

A device for dividing data,
A reading unit that sequentially reads the data for each predetermined reading unit;
A division condition determination unit that determines whether the data read by the reading unit satisfies a condition for dividing the data;
A dividing unit that divides the data at a position where the data satisfies the condition;
With
The division condition determining unit
Determining whether the data read by the reading unit satisfies a first division condition that is a condition for dividing the data;
If the data read by the reading unit does not meet the first division condition, further determine whether the data read by the reading unit meets a second division condition that is a condition for dividing the data,
The division condition determining unit
A data division apparatus characterized by preferentially using a position closer to the end of the data when determining whether or not the data satisfies a second division condition.