JP4852135B2

JP4852135B2 - Data division method and apparatus

Info

Publication number: JP4852135B2
Application number: JP2009227887A
Authority: JP
Inventors: 誠小原
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2009-09-30
Filing date: 2009-09-30
Publication date: 2012-01-11
Anticipated expiration: 2029-09-30
Also published as: JP2011076421A

Description

本発明は、データ間で重複するデータ断片を検出しながら、任意のデータを可変長のデータ断片に分割するのに好適な、データ分割方法及び装置に関する。 The present invention relates to a data division method and apparatus suitable for dividing arbitrary data into variable-length data fragments while detecting overlapping data fragments between data.

昨今、官公庁・企業・個人のデータを管理する基盤は急速に肥大化・複雑化しており、その基盤の主要な構成要素である記憶装置に格納するデータも増大の一途をたどっている。このようなデータの保管・管理コストを削減するための１つの技術として、重複排除技術が注目されている。 In recent years, the infrastructure for managing data of government offices, companies, and individuals has been rapidly enlarged and complicated, and the data stored in the storage device, which is the main component of the infrastructure, has been steadily increasing. As one technique for reducing such data storage and management costs, a deduplication technique has attracted attention.

重複排除技術とは、任意のデータ（以下、対象データと称する）を記憶装置に格納する際に、既に対象データと同じ内容のデータが当該記憶装置に格納されているかを検出、つまりデータの重複を検出し、既に格納されていれば当該対象データを例えばリンクで置き換えることにより重複データを１つにまとめる（排除する）技術をいう。この重複排除技術によれば、データの記憶に必要な記憶容量を少なくすることができる。 The deduplication technique is to detect whether data having the same content as the target data is already stored in the storage device when arbitrary data (hereinafter referred to as target data) is stored in the storage device. , And if the data is already stored, the target data is replaced with, for example, a link to combine (exclude) duplicate data into one. According to this deduplication technique, the storage capacity required for data storage can be reduced.

同じデータが記憶装置に格納されているか否かを高速に検出するためには、データの識別子を利用することが多い。即ち重複排除技術では一般に、データの重複を検出するのに、対象データ自身を既に記憶装置に格納されている全データと比較する手法ではなくて、対象データの識別子を求めてこれを既存の格納済みのデータの識別子の群と比較する手法が適用される。 In order to detect at high speed whether or not the same data is stored in the storage device, an identifier of the data is often used. That is, in general, the deduplication technique is not a method of comparing the target data itself with all the data already stored in the storage device to detect the duplication of data. A method of comparing with a group of already-identified data identifiers is applied.

データの重複は、予め定められた単位で検出される。この単位として、ファイルのようなデータ（コンテンツ）の一塊を用いることにより、データの重複を検出する第１の手法が古くから知られている。また最近では、上記の単位に、ファイルのようなデータを分割することによって得られるデータ断片（以下、チャンクと称する）を用いることにより、データの重複を検出する第２の手法が提案されている。第１の手法では、データの一部が異なるときにもデータ全体が異なるものであるとして処理される。これに対して第２の手法では、上記一部だけを処理すればよいという利点がある。 Data duplication is detected in a predetermined unit. As this unit, a first method for detecting data duplication by using a lump of data (content) such as a file has been known for a long time. Recently, a second method for detecting data duplication has been proposed by using a data fragment (hereinafter referred to as a chunk) obtained by dividing data such as a file in the above unit. . In the first method, even when part of the data is different, the entire data is processed as being different. On the other hand, the second method has an advantage that only a part of the above needs to be processed.

第２の手法を適用する重複排除技術では、一般に、以下のような手順を繰り返すことで重複排除が行われる。
手順１）対象データからチャンクを切り出す。
手順２）切り出されたチャンクの識別子を求める。
手順３）切り出されたチャンクの識別子を既に記憶装置に格納済みのチャンクの群のそれぞれの識別子と比較する。もし、切り出されたチャンクと識別子が同一のチャンクがあれば、切り出されたチャンクと同一内容のチャンクであるとして、重複を排除する形式で、例えばリンクに置き換えることにより、切り出されたチャンクを記憶装置に格納する。 In the deduplication technique to which the second method is applied, deduplication is generally performed by repeating the following procedure.
Procedure 1) Cut out chunks from target data.
Procedure 2) Find the identifier of the cut chunk.
Step 3) The identifier of the cut chunk is compared with the identifier of each group of chunks already stored in the storage device. If there is a chunk whose identifier is the same as that of the cut chunk, it is assumed that the chunk has the same content as the cut chunk, and the cut chunk is stored in the storage device by replacing it with a link, for example, in a form that eliminates duplication. To store.

第２の手法を適用する重複排除技術では、手順１で実行されるチャンクの切り出しの方法、つまりどの長さでチャンクを切り出すかが重要である。第２の手法を適用する重複排除技術は、対象データからチャンクを切り出すときに当該チャンクの切り出し点を求める方法によって、大きく次の２種類に分類される。 In the deduplication technique to which the second method is applied, the chunk cutout method executed in the procedure 1, that is, the length of the chunk cutout is important. The deduplication technique to which the second method is applied is roughly classified into the following two types according to a method for obtaining a cut point of the chunk when the chunk is cut out from the target data.

Ａ）固定長重複排除方法
固定長重複排除方法とは、ある一定の長さでチャンクの切り出し点を定め、チャンク毎に重複検出・排除を行う方法である。
Ｂ）可変長重複排除方法
可変長重複排除方法とは、対象データの内容に応じてデータ分割長を動的に調節して切り出し点を定め、チャンク毎に重複検出・排除を行う方法である。 A) Fixed-length deduplication method The fixed-length deduplication method is a method in which a chunk cut-out point is determined with a certain length, and duplication is detected and eliminated for each chunk.
B) Variable-length deduplication method The variable-length deduplication method is a method in which the data division length is dynamically adjusted according to the contents of the target data to determine the cut-out point, and duplicate detection / exclusion is performed for each chunk.

以下、固定長重複排除方法及び可変長重複排除方法の違いについて、図２３を参照して説明する。
図２３は、文書名が「文書＃１」の文書１１１及び文書名が「文書＃２」の文書１１２の２つの文書についてそれぞれ、チャンク切り出し点を固定長重複排除方法と可変長重複排除方法で求めた様子を示す。文書１１２は、文書１１１の一部を編集することによって、例えば文書１１１における文字列“name”及び“specified”の間に文字列“ABCD”を挿入することによって、生成された文書である。 Hereinafter, differences between the fixed-length deduplication method and the variable-length deduplication method will be described with reference to FIG.
In FIG. 23, the chunk cut-out point is determined by the fixed-length deduplication method and the variable-length deduplication method for the two documents, the document 111 with the document name “document # 1” and the document 112 with the document name “document # 2”. Shows how it was found. The document 112 is a document generated by editing a part of the document 111, for example, by inserting the character string “ABCD” between the character strings “name” and “specified” in the document 111.

固定長重複排除方法によれば、文書１１１及び文書１１２に対し、図２３において矢印１１３で示されるように、例えば１０文字の固定長を単位に、チャンクの切り出し点が定められる。一方、可変長重複排除方法によれば、文書１１１及び文書１１２に対し、図２３において矢印１１４で示されるように、データの中身に応じて、チャンクの切り出し点が定められる。この技術の詳細については後述する。 According to the fixed-length deduplication method, for the document 111 and the document 112, as indicated by an arrow 113 in FIG. 23, for example, chunk cut points are determined in units of a fixed length of 10 characters. On the other hand, according to the variable-length deduplication method, chunk cut points are determined for the document 111 and the document 112 according to the contents of the data, as indicated by the arrow 114 in FIG. Details of this technique will be described later.

ここでは以下の点に注目されたい。
固定長重複排除方法では、文書１１１と文書１１２との間で、文字列の挿入が発生した箇所から後ろ側、つまり文書の末尾側のチャンク全てが異なっている。
これに対して可変長重複排除方法では、文書１１１と文書１１２との間で、文字列の挿入が発生した箇所周辺のチャンクが異なっているのみで、それより後ろ側のチャンクは全て一致している。 Attention should be paid to the following points.
In the fixed-length deduplication method, the document 111 and the document 112 differ in all chunks on the back side, that is, on the tail side of the document, from the position where the character string is inserted.
On the other hand, in the variable-length deduplication method, only the chunks around the portion where the character string is inserted are different between the document 111 and the document 112, and all the chunks after that are the same. Yes.

このように、固定長重複排除方法に比較して、可変長重複排除方法の方が、あるデータ間で、データの一部挿入／削除／変更が発生したときでも、その影響を極力抑えながら重複排除を実現できる。 In this way, compared to the fixed-length deduplication method, the variable-length deduplication method performs duplication while minimizing the impact even when partial insertion / deletion / change of data occurs between certain data. Exclusion can be realized.

上述のような可変長でのチャンク切り出し点の求め方と、それを利用した重複排除を行う方法は種々知られている。ここでは、特許文献１に記載されているような方法について、例を挙げて説明する。 There are various known methods for obtaining chunk cut-out points with variable length as described above and methods for performing deduplication using the chunk cut-out points. Here, the method as described in Patent Document 1 will be described with an example.

特許文献１に記載の方法では、次の手順でチャンクの切り出し点が求められる。
１）データ上のある連続する固定長の区間（以下、ウィンドウと称する）のデータ断片（バイト列）を取り出して、当該データ断片の識別子を求める。ここでは、ウィンドウの長さが２バイトであるとする。このデータ断片がチャンクとしてのデータ断片とは異なる点に注意すべきである。 In the method described in Patent Document 1, a chunk cut-out point is obtained by the following procedure.
1) A data fragment (byte string) in a certain continuous fixed-length section (hereinafter referred to as a window) on data is taken out, and an identifier of the data fragment is obtained. Here, it is assumed that the window length is 2 bytes. It should be noted that this data fragment is different from the data fragment as a chunk.

２）求めた識別子の一部（例えば下位２ビット）が、予め定めた値（例えば０ｘ０１）と一致したときに、そこをチャンクの切り出し点とする。 2) When a part of the obtained identifier (for example, lower 2 bits) matches a predetermined value (for example, 0x01), this is set as a chunk cut point.

図２４は、２バイト長のウィンドウＷ内の文字列（データ断片）の識別子を求めて、その識別子の下位２ビットが、予め定めた値０ｘ０１と一致するときにそこをチャンクの切り出し点として決定する場合の動作例を示す。図２４の例では、文書データ“The fil…”の先頭より２バイト（２文字）長の区間をウィンドウＷとして初期設定し、以後当該ウィンドウＷを１バイトずつシフトさせながら、当該ウィンドウＷ内の文字列の識別子を、例えば当該文字列のハッシュ値を計算することによって求めている。このハッシュ値の計算に用いられるハッシュ関数を、ｈ_α( )で表す。ウィンドウＷ内の文字列が“Th”であるものとすると、その識別子は、ｈ_α(“Th”)で表される。 FIG. 24 obtains an identifier of a character string (data fragment) in a window W having a length of 2 bytes, and when the lower 2 bits of the identifier match a predetermined value 0x01, it is determined as a chunk cut-off point An example of the operation is shown. In the example of FIG. 24, a section having a length of 2 bytes (2 characters) from the beginning of the document data “The fil...” Is initially set as a window W, and then the window W is shifted by 1 byte and The identifier of the character string is obtained, for example, by calculating a hash value of the character string. A hash function used for calculating the hash value is represented by h _α (). If the character string in the window W is “Th”, the identifier is represented by h _α (“Th”).

ウィンドウＷ内の文字列“Th”の識別子ｈ_α(“Th”)が０ｘ１Ａであったとすると、当該識別子０ｘ１Ａの下位２ビットは０ｘ０２である。この識別子０ｘ１Ａの下位２ビットは、当該識別子０ｘ１Ａとマスクデータ０ｘ０３との論理積演算０ｘ１Ａ＆０ｘ０３によって求められる。識別子０ｘ１Ａの下位２ビット０ｘ０２は、予め定められた値０ｘ０１ではない。このため、このときのウィンドウＷの終端を、チャンクの切り出し点とはしない。 If the identifier h _α (“Th”) of the character string “Th” in the window W is 0x1A, the lower 2 bits of the identifier 0x1A are 0x02. The lower 2 bits of the identifier 0x1A are obtained by a logical product operation 0x1A & 0x03 of the identifier 0x1A and the mask data 0x03. The lower 2 bits 0x02 of the identifier 0x1A are not the predetermined value 0x01. For this reason, the end of the window W at this time is not used as a chunk cut-out point.

ウィンドウＷ内の文字列“ f”の識別子ｈ_α(“ f”)が０ｘ９９であったとすると、当該識別子０ｘ９９の下位２ビットは０ｘ０１である。これは予め定められた値０ｘ０１と同じなので、このときのウィンドウＷの終端をチャンクの切り出し点とする。これにより、文字列“The f”が切り出される。 If the identifier h _α (“f”) of the character string “f” in the window W is 0x99, the lower 2 bits of the identifier 0x99 are 0x01. Since this is the same as the predetermined value 0x01, the end of the window W at this time is set as a chunk cut-out point. Thereby, the character string “The f” is cut out.

この例では、チャンクの切り出し点を決定する条件とし、識別子の下位２ビットと予め定められた値との一致を用いているが、このビット数で、平均チャンクサイズが決定されることに注意されたい。例えば２ビットの場合、平均チャンクサイズは２²＝４バイトとなる。 In this example, the condition for determining the cut-out point of the chunk is used, and the match between the lower 2 bits of the identifier and a predetermined value is used, but it is noted that the average chunk size is determined by this number of bits. I want. For example, in the case of 2 bits, the average chunk size is 2 ² = 4 bytes.

以上のようにしてチャンクを切り出した上で、図２５に示すように、このチャンク自体の識別子を、例えば当該チャンクのハッシュ値を計算することによって求めている。切り出されたチャンクをＣ_Aで表し、ハッシュ値の計算に用いられるハッシュ関数を、ｈ_β( )で表す。チャンクＣ_Aを構成する文字列が“The f”である図２５の例では、チャンクＣ_Aの識別子は、ｈ_β(“The f”)＝ｈ_β(Ｃ_A)で表される。以下の説明では、チャンクＣ_Aの識別子をＨ_Aで表す。 After the chunk is cut out as described above, as shown in FIG. 25, the identifier of the chunk itself is obtained, for example, by calculating the hash value of the chunk. The cut chunk is represented by C _A , and the hash function used for calculating the hash value is represented by h _β (). In the example of FIG. 25 in which the character string constituting the chunk C _A is “The f”, the identifier of the chunk C _A is represented by h _β (“The f”) = h _β (C _A ). In the following description, it represents the identifier of the chunk C _A in H _A.

次に、前記手順３と同様に、求めたチャンクの識別子を、既に記憶装置に格納されているチャンクの群のそれぞれの識別子と比較する。もし、求めたチャンクと識別子が同一のチャンクがあれば、求めたチャンクと同一内容のチャンクが既に記憶装置に格納されているものとして処理する。これに対し、求めたチャンクと識別子が同一のチャンクがなければ、当該求めたチャンクを未だ記憶装置に格納されていない新しいチャンクとして処理する。
上述の処理を、チャンクを求める毎に繰り返すことで、重複検出・排除を行う。 Next, similarly to the procedure 3, the obtained chunk identifier is compared with each identifier of the group of chunks already stored in the storage device. If there is a chunk having the same identifier as the obtained chunk, processing is performed assuming that a chunk having the same content as the obtained chunk is already stored in the storage device. On the other hand, if there is no chunk having the same identifier as the obtained chunk, the obtained chunk is processed as a new chunk that is not yet stored in the storage device.
By repeating the above process every time a chunk is obtained, duplicate detection / exclusion is performed.

特許文献１に記載されたチャンクの切り出し方法によれば、例えば図２６に示す文書名が「文書＃１」の文書１１１（データ）は、識別子Ｈ_xがＨ_A乃至Ｈ_Iの９つのチャンクを含むチャンクＣ_xの群に分割される。文書名が「文書＃１」の文書１１１から切り出されるチャンクＣ_xの群のそれぞれの識別子Ｈ_xは、図２６に示されるように、当該文書名「文書＃１」に対応付けて文書構成テーブル２５１に登録される。また、識別子Ｈ_A乃至Ｈ_Iを含む識別子Ｈ_xの群のそれぞれと、その識別子Ｈ_xに対応するチャンクＣ_xとの一覧は、図２６に示されるようにチャンク一覧テーブル２５２に登録される。 According to the chunk cutout method described in Patent Document 1, for example, a document 111 (data) whose document name is “document # 1” shown in FIG. 26 includes nine chunks with identifiers H _x from H _{A to} H _I. They are divided into groups of chunks C _x containing. As shown in FIG. 26, each identifier H _x of the group of chunks C _x extracted from the document 111 with the document name “document # 1” is associated with the document name “document # 1” as shown in FIG. 251 is registered. Further, a list of each of the group of identifiers H _x including the identifiers H _{A to} H _I and the chunk C _x corresponding to the identifier H _x is registered in the chunk list table 252 as shown in FIG.

一般的には、チャンク切り出しに当たり、チャンクの長さに最小長さ及び最大長さの制限を設けることが多い。このような場合、最大長さに達した位置を、強制的にチャンクの切り出し点と定める。 Generally, when chunks are cut out, the lengths of the chunks are often limited by a minimum length and a maximum length. In such a case, the position reaching the maximum length is forcibly determined as the chunk cut-out point.

また、特許文献２には、チャンクの切り出し点を求めるためウィンドウをシフトしながら識別子を求める方法に関して、ＲｏｌｌｉｎｇＨａｓｈｉｎｇの手法を適用することが記載されている。 Japanese Patent Application Laid-Open No. H10-228561 describes that a RollingHashing technique is applied to a method for obtaining an identifier while shifting a window in order to obtain a chunk cut-out point.

特許文献１，２に記載されているような可変長重複排除技術には、次のような２つの課題がある。
＜第１の課題＞
重複排除率の向上のためには、チャンクの平均長を短くする必要がある。しかし、チャンクの平均長を短くすると、チャンクの個数が増える。このため、チャンク一覧テーブルに登録される識別子（より詳細には識別子とチャンクとの対）の数が増える。一般的にチャンク一覧テーブルはハッシュテーブルで実装されることが多いが、識別子の数が増えるとハッシュテーブルのサイズも大きくなってしまう。このため、チャンク一覧テーブルを、アクセス速度が高速なメモリ上に全て展開することが難しくなり、例えばメモリと比較して大容量だが低速なディスク装置上に展開せざるを得なくなる。このことは、性能を大幅に悪化させる要因となる。 The variable length deduplication techniques described in Patent Documents 1 and 2 have the following two problems.
<First issue>
In order to improve the deduplication rate, it is necessary to shorten the average length of the chunk. However, if the average length of chunks is shortened, the number of chunks increases. For this reason, the number of identifiers (more specifically, pairs of identifiers and chunks) registered in the chunk list table increases. Generally, the chunk list table is often implemented as a hash table, but as the number of identifiers increases, the size of the hash table also increases. For this reason, it is difficult to expand all chunk list tables on a memory having a high access speed. For example, the chunk list table must be expanded on a disk device having a large capacity but a low speed as compared with the memory. This is a factor that greatly deteriorates the performance.

＜第２の課題＞
逆に、チャンク一覧テーブルを全てメモリ上に展開するためには、当該メモリ上に展開可能なサイズにまでチャンク一覧テーブルを小さくする必要がある。そのためには、チャンク一覧テーブルに登録される識別子の数を少なく抑えなければならい。このことはつまり、チャンクの平均長を長くするということであり、重複排除率の低下を招くことになる。 <Second problem>
On the other hand, in order to expand all the chunk list table on the memory, it is necessary to reduce the chunk list table to a size that can be expanded on the memory. For this purpose, the number of identifiers registered in the chunk list table must be reduced. In other words, this means that the average length of the chunk is lengthened, which leads to a decrease in the deduplication rate.

以下に例を用いて説明する。図２７は上記第１の課題の例を、図２８は上記第２の課題の例を示している。図２７の例は、図２８と比較してチャンクの平均長を短くすることで、図２８と比較して、重複排除率の向上を実現している。チャンク一覧テーブル２５２に識別子に対応付けて登録されるチャンクの群のサイズを合計することで、重複排除率が高いことがわかる。しかし、図２８と比較して識別子数が多い。 An example will be described below. FIG. 27 shows an example of the first problem, and FIG. 28 shows an example of the second problem. In the example of FIG. 27, the deduplication rate is improved as compared with FIG. 28 by shortening the average length of the chunk as compared with FIG. It can be seen that the deduplication rate is high by summing the sizes of chunk groups registered in the chunk list table 252 in association with identifiers. However, there are many identifiers compared with FIG.

このように、重複排除率とチャンクの個数（チャンクの平均長）はトレードオフの関係にあることがわかる。 Thus, it can be seen that the deduplication rate and the number of chunks (average length of chunks) are in a trade-off relationship.

最近、上述のような課題を解決するために、様々な手法が検討・採用されている。例えば特許文献３は、切り出されたチャンクを２つ以上連結し（ここでは便宜上、連結されたチャンク群を「連結チャンク」と称する）、少なくとも連結チャンクの単位で重複検出を行うことで、チャンクの個数を減らしつつ高い重複排除率を維持する手法を開示している。 Recently, various methods have been studied and adopted to solve the above-described problems. For example, Patent Document 3 connects two or more cut chunks (here, for convenience, a group of connected chunks is referred to as a “connected chunk”), and performs duplicate detection at least in units of connected chunks. A technique for maintaining a high deduplication rate while reducing the number is disclosed.

特許文献３に開示されている手法の特徴は、チャンクの連結／非連結を動的に切り替えながら、連結チャンクと連結チャンクとの間に、非連結チャンクの群からなる「緩衝領域」を設ける点にある。この手法では、以下の第１及び第２の条件に基づいて、チャンクの連結／非連結が動的に切り替えられる。 The feature of the method disclosed in Patent Document 3 is that a “buffer region” made up of a group of unconnected chunks is provided between the connected chunks while dynamically switching the connected / unconnected chunks. It is in. In this method, chunk connection / disconnection is dynamically switched based on the following first and second conditions.

第１の条件とは、「連結対象として仮に定められたチャンクの群が、既にシステム内に登録されている連結チャンクと重複するか否か」である。
第２の条件とは、「連結対象として定められたチャンクの群の前後に連なるチャンクの群が、連結チャンクであるか否か」と、「上記連なるチャンクの群が、既にシステム内に登録されている連結チャンクと重複しているか否か」とである。 The first condition is “whether or not a group of chunks temporarily determined as a connection target overlaps with a connection chunk already registered in the system”.
The second condition is “whether or not a group of chunks connected before and after a group of chunks determined as a connection target is a connected chunk” and “the group of consecutive chunks has already been registered in the system. Whether or not it overlaps with the connected chunk.

米国特許第５，９９０，８１０号明細書US Pat. No. 5,990,810 米国特許第６，８１０，３９８号明細書US Pat. No. 6,810,398 米国特許出願公開第２００８／０１３３５６１号明細書US Patent Application Publication No. 2008/0133561

しかし、上記特許文献３に開示されているような手法では、連結／非連結を動的に切り替える条件が複雑である。条件が複雑であることは、実装上好ましくない。また、連結チャンクと連結チャンクの間に設けられる「緩衝領域」では細かなチャンクが多数生成される。このことは、文書構成テーブル及びチャンク一覧テーブルのエントリ数の増加を招き、結果的に性能劣化を引き起こす。 However, in the method disclosed in Patent Document 3, the condition for dynamically switching between connected / disconnected is complicated. It is not preferable in terms of implementation that the conditions are complicated. In the “buffer area” provided between the connected chunks, a large number of fine chunks are generated. This leads to an increase in the number of entries in the document configuration table and chunk list table, resulting in performance degradation.

また、上記特許文献３に開示されているような手法では、常に複数個の非連結チャンクの群が連結されて、この連結の単位で重複検出・排除が行われる。このことは、可変長でのチャンク切り出しで切り出されたチャンクの長さが長いとき、更に大きな連結チャンクの単位で、重複検出・排除が行われることになって、重複排除率が低下する要因となることを意味する。 In the technique disclosed in Patent Document 3, a plurality of groups of unconnected chunks are always connected, and duplicate detection / exclusion is performed in units of this connection. This is because when the length of chunks cut out by chunk cutout at a variable length is long, duplicate detection / removal is performed in units of larger connected chunks, and the deduplication rate is reduced. It means to become.

本発明は上記事情を考慮してなされたものでその目的は、分割されるデータ断片数を抑えながらも高い重複排除率を、単純な仕組みで実現できるデータ分割方法及び装置を提供することにある。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a data division method and apparatus capable of realizing a high deduplication rate with a simple mechanism while suppressing the number of data fragments to be divided. .

本発明の１つの観点によれば、入力手段、第１のデータ断片決定手段、第２のデータ断片決定手段、第３のデータ断片決定手段、重複検出手段及び制御手段を含む装置において、任意のデータを、重複検出を行いながら、複数の、任意の長さの第１のデータ断片に分割するためのデータ分割方法が提供される。このデータ分割方法は、前記任意のデータを前記入力手段が入力する入力ステップと、前記入力された任意のデータのうち、未だ前記第１のデータ断片として決定されていない残りのデータ部分から、前記第２のデータ断片決定手段が任意の長さまたは予め定められた長さの第２のデータ断片を順次決定する第１の決定ステップと、予め定められた第１の条件を満足する状態に達するまでに、前記第１のステップにおいて決定された１つの第２のデータ断片それ自体または複数の第２のデータ断片の組み合わせを、前記第３の断片決定手段が１つの第３のデータ断片として決定する第２の決定ステップと、前記決定された第３のデータ断片の重複の有無を、当該決定された第３のデータ断片に一致するビット列の第１のデータ断片が既に決定されているかによって、前記重複検出手段が検出する重複検出ステップと、前記重複が検出された場合、前記決定された第３のデータ断片を前記第１のデータ断片決定手段が前記第１のデータ断片として決定する第３の決定ステップと、前記重複が検出されなかった場合、前記第１及び第２の決定ステップを再実行させることにより、前記第１の条件を満足する状態に達するまでに新たな１つの第２のデータ断片または新たな複数の第２のデータ断片を決定させると共に、当該新たな１つの第２のデータ断片それ自体、当該新たな複数の第２のデータ断片の組み合わせ、前記重複が検出されなかった第３のデータ断片の一部と当該新たな１つの第２のデータ断片との組み合わせ、または前記重複が検出されなかった第３のデータ断片の一部と当該新たな複数の第２のデータ断片との組み合わせを、前記第３のデータ断片決定手段により１つの新たな第３のデータ断片として決定させるための制御を、予め定められた第２の条件を満足する状態で前記重複が検出されるまで前記制御手段が繰り返す第１の制御ステップと、前記第２の条件を満足する状態で前記第１の制御ステップが繰り返されても前記重複が検出されなかった場合、その間に決定された前記第２のデータ断片のうちの、前記第１の条件を満足する、１つの第２のデータ断片それ自体、または複数の第２のデータ断片の組み合わせを、前記第１のデータ断片決定手段が新たな第１のデータ断片として決定する第４の決定ステップと、前記入力された任意のデータが全て前記第１のデータ断片に分割されるまで、前記制御手段が前記第１の制御ステップを繰り返すための第２の制御ステップとを具備することを特徴とする。 According to one aspect of the present invention, in an apparatus including an input unit, a first data fragment determination unit, a second data fragment determination unit, a third data fragment determination unit, a duplication detection unit, and a control unit, an arbitrary There is provided a data dividing method for dividing data into a plurality of first data fragments having an arbitrary length while performing duplication detection. The data dividing method includes: an input step in which the input means inputs the arbitrary data; and the remaining data portion not yet determined as the first data fragment of the input arbitrary data. The second data fragment determining means reaches a state satisfying a first condition and a first determination step of sequentially determining a second data fragment of an arbitrary length or a predetermined length, and a predetermined first condition Up to this point, the third fragment determining means determines one second data fragment itself or a combination of a plurality of second data fragments determined in the first step as one third data fragment. The first data fragment of the bit string that matches the determined third data fragment has already determined whether or not the determined third data fragment is duplicated The duplication detection step detected by the duplication detection means, and when the duplication is detected, the first data fragment decision means determines the determined third data fragment as the first data fragment. If the overlap is not detected and the third determination step is determined as follows, the first determination step and the second determination step are re-executed, so that a new state is reached until the state satisfying the first condition is reached. One second data fragment or a plurality of new second data fragments are determined, and the new one second data fragment itself, a combination of the new plurality of second data fragments, and the duplication A combination of a part of the third data fragment in which no duplication is detected and the new second data fragment, or a part of the third data fragment in which the duplication is not detected Control for determining a combination of the new plurality of second data fragments as one new third data fragment by the third data fragment determining means is performed under a predetermined second condition. The first control step repeated by the control means until the overlap is detected in a satisfied state and the overlap is not detected even if the first control step is repeated in a state satisfying the second condition One second data fragment itself or a combination of a plurality of second data fragments satisfying the first condition among the second data fragments determined in the meantime, A fourth determining step in which the first data fragment determining means determines as a new first data fragment, and the control procedure until all the input arbitrary data is divided into the first data fragments. The stage comprises a second control step for repeating the first control step.

本発明によれば、任意のデータを、重複検出を行いながら、複数の、任意の長さの第１のデータ断片に分割するためのデータ分割方法及び装置において、第２のデータ断片の長さを重複検出のオフセット間隔としながら、当該第２のデータ断片の長さよりも長くなる可能性が高く、且つ第１のデータ断片として用いられる可能性の高い第３のデータ断片の長さで重複検出を行う構成とすることにより、従来技術と比較してより単純・高速な手法で、第１のデータ断片の数（つまりチャンク数または分割数）を少なくしながらも重複排除率を高く維持した、重複検出を行うことができる。 According to the present invention, in the data division method and apparatus for dividing arbitrary data into a plurality of first data fragments having an arbitrary length while performing duplication detection, the length of the second data fragment is determined. Is used as an offset interval for duplication detection, and duplication detection is performed with the length of the third data fragment that is likely to be longer than the length of the second data fragment and is likely to be used as the first data fragment. By adopting a configuration that performs the above, the deduplication ratio is maintained high while reducing the number of first data fragments (that is, the number of chunks or the number of divisions) by a simpler and faster method than the conventional technology. Duplicate detection can be performed.

本発明の一実施形態に係るストレージシステムの構成を示すブロック図。1 is a block diagram showing a configuration of a storage system according to an embodiment of the present invention. 図１に示される文書格納装置のハードウェア構成を示すブロック図。The block diagram which shows the hardware constitutions of the document storage apparatus shown by FIG. 図１に示される文書格納装置の主として機能構成を示すブロック図。FIG. 2 is a block diagram mainly showing a functional configuration of the document storage device shown in FIG. 1. 同実施形態で適用される文書格納処理の手順を示すフローチャート。6 is an exemplary flowchart illustrating a procedure of document storage processing applied in the embodiment. 同実施形態で適用される文書格納処理の手順を示すフローチャート。6 is an exemplary flowchart illustrating a procedure of document storage processing applied in the embodiment. 同実施形態で適用される親チャンクと子チャンクの群との関係を説明するための図。The figure for demonstrating the relationship between the parent chunk and the group of a child chunk applied in the embodiment. 子チャンクの切り出し点を決定する手法を説明するための図。The figure for demonstrating the method of determining the cut-out point of a child chunk. 第１及び第２の文書と、当該第１及び第２の文書の格納前における文書構成テーブル及びチャンク一覧テーブルの状態とを示す図。The figure which shows the 1st and 2nd document, and the state of the document structure table and chunk list table before the said 1st and 2nd document are stored. 第１の文書を格納するための格納動作（その１）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 1) for storing a 1st document with the state of a document structure table. 第１の文書を格納するための格納動作（その２）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 2) for storing a 1st document with the state of a document structure table. 第１の文書を格納するための格納動作（その３）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 3) for storing a 1st document with the state of a document structure table. 第１の文書を格納するための格納動作（その４）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 4) for storing a 1st document with the state of a document structure table. 第１の文書を格納するための格納動作（その５）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 5) for storing a 1st document with the state of a document structure table. 第１の文書の格納後における、文書構成テーブル及びチャンク一覧テーブルの状態を、当該第１の文書と当該第１の文書から切り出された親チャンクの列と共に示す図。The figure which shows the state of a document structure table and a chunk list table after storage of a 1st document with the row | line | column of the parent chunk cut out from the said 1st document and the said 1st document. 第１の文書の格納後に行われる第２の文書を格納するための格納動作（その１）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 1) for storing the 2nd document performed after storage of a 1st document with the state of a document structure table. 第２の文書を格納するための格納動作（その２）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 2) for storing a 2nd document with the state of a document structure table. 第２の文書を格納するための格納動作（その３）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 3) for storing a 2nd document with the state of a document structure table. 第２の文書を格納するための格納動作（その４）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 4) for storing a 2nd document with the state of a document structure table. 第２の文書を格納するための格納動作（その５）を文書構成テーブルの状態と共に示す図。The figure which shows the storing operation (the 5) for storing a 2nd document with the state of a document structure table. 第２の文書の格納後における、文書構成テーブル及びチャンク一覧テーブルの状態を、当該第２の文書と当該第２の文書から切り出された親チャンクの列と共に示す図。The figure which shows the state of a document structure table and a chunk list table after storage of a 2nd document with the row | line | column of the parent chunk cut out from the said 2nd document and the said 2nd document. 第１及び第２の文書の格納後における、文書構成テーブル及びチャンク一覧テーブルの状態を、当該第１及び第２の文書と当該第１及び第２の文書から切り出された親チャンクの列と共に示す図。The state of the document configuration table and the chunk list table after the storage of the first and second documents is shown together with the first and second documents and the parent chunk columns cut out from the first and second documents. Figure. 同実施形態で適用される文書取得処理の手順を示すフローチャート。6 is a flowchart showing a procedure of document acquisition processing applied in the embodiment. 従来技術における固定長重複排除方法及び可変長重複排除方法の違いを説明するための図。The figure for demonstrating the difference of the fixed-length deduplication method and variable-length deduplication method in a prior art. 従来技術における可変長重複排除方法で適用されるチャンク切り出し点を設定する動作の過程の一例を示す図。The figure which shows an example of the process of the operation | movement which sets the chunk cutout point applied with the variable-length deduplication method in a prior art. 従来技術におけるチャンク切り出し方法を説明するための図。The figure for demonstrating the chunk cutout method in a prior art. 従来技術におけるチャンク切り出し方法によって文書を対象とするチャンク切り出しを行って、文書構成テーブル及びチャンク一覧テーブルを構成した例を示す図。The figure which shows the example which performed the chunk cut-out for a document with the chunk cut-out method in a prior art, and comprised the document structure table and the chunk list table. 従来技術における第１の課題の例を示す図。The figure which shows the example of the 1st subject in a prior art. 従来技術における第２の課題の例を示す図。The figure which shows the example of the 2nd subject in a prior art.

以下、本発明の実施の形態につき図面を参照して説明する。
＜システム構成＞
図１は本発明の一実施形態に係るストレージシステムの構成を示すブロック図である。このストレージシステムは、文書格納装置１０と、クライアント装置２０とから構成される。文書格納装置１０とクライアント装置２０とは、例えばネットワーク３０によって接続されている。文書格納装置１０は文書をチャンクに分割して格納するためのデータ記憶装置である。クライアント装置２０は、文書格納装置１０を自身の記憶装置として利用する。つまりクライアント装置２０は、例えば当該クライアント装置２０上で動作するアプリケーションプログラムに従い、文書格納装置１０に対して文書格納を指示することにより当該文書格納装置１０に文書を格納させ、また文書格納装置１０に対して文書取得を指示することにより文書格納装置１０から文書を取得する。なお、文書格納装置１０とクライアント装置２０とが直接に接続されていても、クライアント装置２０としての機能が文書格納装置１０に内蔵されていても構わない。 Embodiments of the present invention will be described below with reference to the drawings.
<System configuration>
FIG. 1 is a block diagram showing a configuration of a storage system according to an embodiment of the present invention. This storage system includes a document storage device 10 and a client device 20. The document storage device 10 and the client device 20 are connected by a network 30, for example. The document storage device 10 is a data storage device for storing a document by dividing it into chunks. The client device 20 uses the document storage device 10 as its own storage device. That is, the client device 20 stores the document in the document storage device 10 by instructing the document storage device 10 to store the document, for example, in accordance with an application program running on the client device 20, and the document storage device 10 The document is acquired from the document storage device 10 by instructing the document acquisition to the document storage device 10. Note that the document storage device 10 and the client device 20 may be directly connected, or the function as the client device 20 may be built in the document storage device 10.

文書格納装置１０は、クライアント装置２０から文書名で指定される文書の格納を指示するための文書格納指示が与えられると、後述する手続きに従って、当該文書名で指定される文書をチャンクに分割しながら重複検出・排除を行った上で、当該文書を後述する文書格納部３２（図３参照）に格納する。また文書格納装置１０は、クライアント装置２０から文書名で指定される文書の取得を指示するための文書取得指示が与えられると、当該文書名で指定される文書を文書格納部３２から取り出してクライアント装置２０に出力する。 When a document storage instruction for instructing storage of a document specified by a document name is given from the client apparatus 20, the document storage apparatus 10 divides the document specified by the document name into chunks according to a procedure described later. However, after duplicate detection / exclusion is performed, the document is stored in a document storage unit 32 (see FIG. 3) described later. When the document storage apparatus 10 is given a document acquisition instruction for instructing acquisition of a document specified by a document name from the client apparatus 20, the document storage apparatus 10 retrieves the document specified by the document name from the document storage unit 32 and performs client acquisition. Output to the device 20.

ここでの文書とは例えばファイルまたは当該ファイル内のデータを指し、文書名とはファイル名を指す。なお、ファイルと当該ファイル内のデータとを区別するために、当該ファイル内のデータを文書のデータまたは文書データと称することもある。また、チャンクとは、データを断片化したもの（データ断片）を指す。また本実施形態では、データ断片として、「第１のデータ断片」、「第２のデータ断片」及び「第３のデータ断片」が定義される。以降の説明では、「第２のデータ断片」を「子チャンク」、「第３のデータ断片」を「親チャンク」と、それぞれ称する。また「第１のデータ断片」を、「登録済みのデータ断片」、「登録済みの親チャンク」または単に「親チャンク」と称する。 The document here refers to, for example, a file or data in the file, and the document name refers to a file name. In order to distinguish a file from data in the file, the data in the file may be referred to as document data or document data. A chunk refers to data fragmented (data fragment). In this embodiment, “first data fragment”, “second data fragment”, and “third data fragment” are defined as data fragments. In the following description, the “second data fragment” is referred to as “child chunk” and the “third data fragment” is referred to as “parent chunk”. The “first data fragment” is referred to as “registered data fragment”, “registered parent chunk”, or simply “parent chunk”.

＜文書格納装置１０のハードウェア構成＞
本実施形態において、文書格納装置１０はコンピュータを用いて実現される。図２は、このような文書格納装置１０のハードウェア構成を示すブロック図である。図２に示されるように、文書格納装置１０、少なくとも１つの処理ユニット２１、主記憶装置２２、補助記憶装置２３、通信機構２４及び入出力装置２５の周知のハードウェア構成を有する。補助記憶装置２３は、例えばハードディスクドライブを用いて構成される。補助記憶装置２３は、処理ユニット２１によって実行されるプログラム２３０を格納した記憶媒体２３１を備えている。本実施形態において記憶媒体２３１はディスク媒体である。 <Hardware Configuration of Document Storage Device 10>
In the present embodiment, the document storage device 10 is realized using a computer. FIG. 2 is a block diagram showing a hardware configuration of such a document storage device 10. As shown in FIG. 2, the document storage device 10, at least one processing unit 21, a main storage device 22, an auxiliary storage device 23, a communication mechanism 24, and an input / output device 25 have known hardware configurations. The auxiliary storage device 23 is configured using, for example, a hard disk drive. The auxiliary storage device 23 includes a storage medium 231 that stores a program 230 executed by the processing unit 21. In the present embodiment, the storage medium 231 is a disk medium.

＜文書格納装置１０の機能構成＞
図３は、文書格納装置１０の主として機能構成を示すブロック図である。文書格納装置１０は、文書格納部３１と、命令受け付けモジュール３２と、可変長重複排除モジュール３３と、作業用メモリ３４とを含む。本実施形態において、文書格納装置１０内の命令受け付けモジュール３２及び可変長重複排除モジュール３３は、当該文書格納装置１０が図２に示されるハードウェア構成のコンピュータから構成される場合に、当該コンピュータ内の処理ユニット２１が、補助記憶装置２３に格納されているプログラム２３０を主記憶装置２２に読み込んで実行することにより実現されるものとする。しかし、命令受け付けモジュール３２及び可変長重複排除モジュール３３の少なくとも１つがハードウェアとして実現されてもよい。 <Functional Configuration of Document Storage Device 10>
FIG. 3 is a block diagram mainly showing a functional configuration of the document storage device 10. The document storage device 10 includes a document storage unit 31, an instruction reception module 32, a variable length deduplication module 33, and a work memory 34. In the present embodiment, the instruction receiving module 32 and the variable-length deduplication module 33 in the document storage device 10 are included in the computer when the document storage device 10 is configured from a computer having the hardware configuration shown in FIG. The processing unit 21 reads the program 230 stored in the auxiliary storage device 23 into the main storage device 22 and executes it. However, at least one of the instruction receiving module 32 and the variable length deduplication module 33 may be realized as hardware.

文書格納部３１は、文書構成テーブル３１１及びチャンク一覧テーブル３１２を用いて文書の群を格納する。文書格納部３１は、図２に示される補助記憶装置２３の記憶領域の一部を用いて実現される。文書構成テーブル３１１及びチャンク一覧テーブル３１２は、それぞれ、従来技術で適用されている文書構成テーブル２５１及びチャンク一覧テーブル２５２（図２６乃至図２８参照）に相当する。 The document storage unit 31 stores a group of documents using the document configuration table 311 and the chunk list table 312. The document storage unit 31 is realized by using a part of the storage area of the auxiliary storage device 23 shown in FIG. The document configuration table 311 and the chunk list table 312 correspond to the document configuration table 251 and the chunk list table 252 (see FIGS. 26 to 28) applied in the conventional technology, respectively.

文書構成テーブル３１１は、文書格納部３１に格納される文書の群のそれぞれについて、その文書の文書名と、その文書を構成するチャンクの群の識別子（ハッシュ値）の配列（つまりリスト）とを対応付けて保持する。チャンク一覧テーブル３１２は、文書格納部３１に格納される文書を構成するチャンクのそれぞれについて、そのチャンクのデータ断片と、そのチャンクの識別子（ハッシュ値）とを対応付けて保持する。つまり、文書格納部３１には、文書が、当該文書を構成するチャンクの群に分割して格納される。 The document configuration table 311 includes, for each group of documents stored in the document storage unit 31, a document name of the document and an array (that is, a list) of identifiers (hash values) of the groups of chunks that configure the document. Hold in association. The chunk list table 312 holds the chunk data fragment and the chunk identifier (hash value) in association with each chunk constituting the document stored in the document storage unit 31. That is, the document storage unit 31 stores the document by dividing it into groups of chunks constituting the document.

本実施形態において、文書構成テーブル３１１及びチャンク一覧テーブル３１２は、文書格納装置１０の起動時（例えば文書格納装置１０の電源の投入時）に、アクセスの高速化のために、作業用メモリ３４にロードされて使用される。また、作業用メモリ３４にロードされている文書構成テーブル３１１及びチャンク一覧テーブル３１２は、例えば文書格納装置１０において処理を実行していない状態が一定時間続いた場合、或いは文書格納装置１０の動作停止時（例えば文書格納装置１０の電源の遮断時）に文書格納部３１に書き戻される。しかし、以降は便宜的に、文書格納装置１０の起動後においても文書格納部３１内の文書構成テーブル３１１及びチャンク一覧テーブル３１２が使用されるものとして説明する。 In the present embodiment, the document configuration table 311 and the chunk list table 312 are stored in the work memory 34 in order to speed up access when the document storage device 10 is activated (for example, when the document storage device 10 is turned on). Loaded and used. The document configuration table 311 and the chunk list table 312 loaded in the work memory 34 are, for example, when a state in which processing is not being performed in the document storage device 10 continues for a certain period of time or when the operation of the document storage device 10 is stopped. At this time (for example, when the power of the document storage device 10 is shut off), the data is written back to the document storage unit 31. However, for the sake of convenience, the following description will be made assuming that the document configuration table 311 and the chunk list table 312 in the document storage unit 31 are used even after the document storage device 10 is started.

命令受け付けモジュール３２は、クライアント装置２０からの指示を受け付けて、当該指示の内容に従って動作する。命令受け付けモジュール３２は、クライアント装置２０からの指示が文書格納指示の場合、当該文書格納指示を可変長重複排除モジュール３３に渡すことにより、当該可変長重複排除モジュール３３による文書格納処理を行わせる。命令受け付けモジュール３２は、クライアント装置２０からの指示が文書取得指示の場合に動作する文書取得部３２０を含む。文書取得部３２０は、文書取得指示に従い、指定された文書名の文書のデータを文書格納部３１から取得するための文書取得処理を行う。文書取得部３２０によって取得された文書のデータは命令受け付けモジュール３２によってクライアント装置２０に出力される。 The instruction receiving module 32 receives an instruction from the client device 20 and operates according to the content of the instruction. When the instruction from the client device 20 is a document storage instruction, the instruction receiving module 32 passes the document storage instruction to the variable length deduplication module 33 to cause the variable length deduplication module 33 to perform document storage processing. The command reception module 32 includes a document acquisition unit 320 that operates when the instruction from the client device 20 is a document acquisition instruction. The document acquisition unit 320 performs document acquisition processing for acquiring data of a document having a designated document name from the document storage unit 31 in accordance with the document acquisition instruction. The document data acquired by the document acquisition unit 320 is output to the client device 20 by the command reception module 32.

可変長重複排除モジュール３３は、命令受け付けモジュール３２から渡された文書格納指示に従い、指定された文書のデータから可変長でチャンクを切り出すためのチャンク切り出し処理と、切り出されたチャンク毎に重複を検出してそれを排除するための重複検出・排除処理とを行いながら、文書格納部３１に当該文書を格納する。可変長重複排除モジュール３３は、子チャンク決定部３３１と、親チャンク決定部３３２と、識別子生成部３３３と、重複検出部３３４と、親チャンク登録部３３５と、制御部３３６とを含む。 The variable-length deduplication module 33 detects chunk duplication for each chunk that has been cut out in accordance with the document storage instruction passed from the instruction receiving module 32 and cuts out chunks from the specified document data in variable length. Then, the document is stored in the document storage unit 31 while performing duplicate detection / exclusion processing for eliminating it. The variable length deduplication module 33 includes a child chunk determination unit 331, a parent chunk determination unit 332, an identifier generation unit 333, a duplication detection unit 334, a parent chunk registration unit 335, and a control unit 336.

子チャンク決定部３３１は、可変長のチャンクを子チャンクとして決定する。親チャンク決定部３３２は、子チャンク決定部３３１によって決定された連続する子チャンクの列または単一の子チャンクを親チャンクとして決定する。 The child chunk determination unit 331 determines a variable-length chunk as a child chunk. The parent chunk determination unit 332 determines a sequence of consecutive child chunks or a single child chunk determined by the child chunk determination unit 331 as a parent chunk.

識別子生成部３３３は、チャンク（ここでは親チャンク）の切り出しと重複検出で利用される当該チャンクの識別子を生成する。本実施形態では、識別子としてチャンクのハッシュ値が用いられる。このハッシュ値には、例えばＳＨＡ１などのハッシュ関数を利用して生成された値が用いられる。 The identifier generation unit 333 generates an identifier of the chunk that is used in chunk extraction (here, parent chunk) extraction and duplication detection. In this embodiment, a hash value of a chunk is used as an identifier. As this hash value, for example, a value generated using a hash function such as SHA1 is used.

重複検出部３３４は、親チャンク決定部３３２によって決定された親チャンクの識別子に基づいて、当該識別子のデータ断片がチャンク一覧テーブル３１２に登録されている重複を検出する。 Based on the identifier of the parent chunk determined by the parent chunk determination unit 332, the duplication detection unit 334 detects the duplication in which the data fragment of the identifier is registered in the chunk list table 312.

親チャンク登録部３３５は、重複検出部３３４の重複検出結果に基づいて、親チャンクを文書格納部３１内の文書構成テーブル３１１及びチャンク一覧テーブル３１２に登録するための親チャンク登録処理を行う。
制御部３３６は、子チャンク決定部３３１、親チャンク決定部３３２、識別子生成部３３３及び重複検出部３３４の動作を制御する。 The parent chunk registration unit 335 performs parent chunk registration processing for registering the parent chunk in the document configuration table 311 and the chunk list table 312 in the document storage unit 31 based on the duplicate detection result of the duplicate detection unit 334.
The control unit 336 controls operations of the child chunk determination unit 331, the parent chunk determination unit 332, the identifier generation unit 333, and the duplication detection unit 334.

作業用メモリ３４は、可変長重複排除モジュール３３によるチャンク切り出し処理と重複検出・排除処理のための作業用の記憶領域を提供する。作業用メモリ３４は、図２に示される主記憶装置２２の記憶領域の一部を用いて実現される。作業用メモリ３４の記憶領域の一部は、処理の対象となる文書データを一時格納するための文書バッファ３４１として用いられる。作業用メモリ３４の記憶領域の他の一部は、処理に用いられる各種変数を一時格納するためのレジスタ部３４２として用いられる。レジスタ部３４２は、子チャンク番号ｉ，ｊ，ｋをそれぞれ保持するための、ｉレジスタ、ｊレジスタ、ｋレジスタと、子チャンク番号ｋの子チャンクの後述する開始オフセットｃ_k.offsetを保持するための子チャンク開始オフセットレジスタと、子チャンク番号ｋの子チャンクの長さ（子チャンク長）ｃ_k.lenを保持するための子チャンク長レジスタを含む。 The work memory 34 provides a working storage area for chunk cutout processing and duplication detection / exclusion processing by the variable-length deduplication module 33. The work memory 34 is realized by using a part of the storage area of the main storage device 22 shown in FIG. A part of the storage area of the work memory 34 is used as a document buffer 341 for temporarily storing document data to be processed. Another part of the storage area of the work memory 34 is used as a register unit 342 for temporarily storing various variables used for processing. The register unit 342 holds an i register, a j register, and a k register for holding the child chunk numbers i, j, and k, respectively, and a start offset c _k .offset described later of the child chunk of the child chunk number k. Child chunk start offset register and a child chunk length register for holding the length (child chunk length) c _k .len of the child chunk of child chunk number k.

＜文書格納処理＞
次に、文書格納装置１０における文書格納処理について、図４乃至図６を参照して説明する。なお、図４及び図５は、文書格納処理の手順を示すフローチャート、図６は文書格納装置１０に格納されるべき文書が図２３に示される文書１１１の場合における、親チャンクと子チャンクの群との関係を説明するため図である。 <Document storage processing>
Next, document storage processing in the document storage device 10 will be described with reference to FIGS. 4 and 5 are flowcharts showing the procedure of document storage processing, and FIG. 6 is a group of parent chunks and child chunks when the document to be stored in the document storage device 10 is the document 111 shown in FIG. It is a figure for demonstrating the relationship with these.

まず、クライアント装置２０から文書格納装置１０にネットワーク３０を介して文書格納指示が送られたものとする。この文書格納指示は、文書格納装置１０に格納されるべき文書を指定する文書名を含んでいる。 First, it is assumed that a document storage instruction is sent from the client device 20 to the document storage device 10 via the network 30. This document storage instruction includes a document name that specifies a document to be stored in the document storage device 10.

文書格納装置１０に送られたクライアント装置２０からの文書格納指示は、当該文書格納装置１０の命令受け付けモジュール３２で受け付けられる。命令受け付けモジュール３２は、この文書格納指示を受け付けると入力手段として機能して、当該文書格納指示で指定される文書名の文書のデータをクライアント装置２０から入力して作業用メモリ３４内の文書バッファ３４１に格納する。そして命令受け付けモジュール３２は、クライアント装置２０からの文書格納指示を、可変長重複排除モジュール３３に渡す。すると可変長重複排除モジュール３３は、図４及び図５のフローチャートに示す手順の文書格納処理を実行する。即ち可変長重複排除モジュール３３は、文書バッファ３４１に格納されている文書データの例えば先頭から末尾に至るまで、以下の処理を繰り返す。 The document storage instruction from the client device 20 sent to the document storage device 10 is received by the command reception module 32 of the document storage device 10. When receiving the document storage instruction, the instruction reception module 32 functions as an input unit, inputs document data of the document name specified by the document storage instruction from the client device 20, and stores the document buffer in the work memory 34. 341. Then, the command receiving module 32 passes the document storage instruction from the client device 20 to the variable length deduplication module 33. Then, the variable-length deduplication module 33 executes document storage processing according to the procedure shown in the flowcharts of FIGS. That is, the variable-length deduplication module 33 repeats the following processing from the beginning to the end of the document data stored in the document buffer 341, for example.

まず可変長重複排除モジュール３３の制御部３３６は、子チャンク決定部３３１による子チャンクの切り出し（切り出し点の決定）のために、子チャンクｃ_kを指定するための子チャンク番号ｋを０に初期設定すると共に、当該子チャンクｃ_kのオフセット（開始オフセット）ｃ_k.offsetを文書データの先頭位置（ここでは先頭バイトの位置）を示す０に初期設定する（ステップ４０１）。つまり制御部３３６は、レジスタ部３４２内のｋレジスタに、子チャンク番号ｋとして０（ｋ＝０）を設定すると共に、レジスタ部３４２内の子チャンク開始オフセットレジスタに、子チャンクｃ_kの開始オフセットｃ_k.offsetとして０（ｃ_k.offset＝０）を設定する。子チャンクｃ_kの開始オフセットｃ_k.offsetは、当該子チャンクｃ_kの開始切り出し点を示すもので、当該開始切り出し点の文書データの先頭位置からのオフセット（相対位置）を示す。この時点では、子チャンクｃ_kの終了切り出し点を示す終了オフセットは決定されていないことに注意されたい。 First, the control unit 336 of the variable-length deduplication module 33, initially for excision child chunk by children chunk determining section 331 (determining the cut-out point), the child chunk number k for designating a child chunk c _k 0 At the same time, an offset (start offset) c _k .offset of the child chunk _ck is initialized to 0 indicating the start position (here, the start byte position) of the document data (step 401). That is, the control unit 336 sets 0 (k = 0) as the child chunk number k in the k register in the register unit 342, and sets the start offset of the child chunk _ck in the child chunk start offset register in the register unit 342. as c _k .offset to set 0 (c _k .offset = 0). Start offset c _k .offset child chunk c _k marks the start clipping point of the child chunk c _k, an offset (relative position) from the head position of the document data of the starting cut point. Note that at this time, the end offset indicating the end cut-out point of the child chunk _ck has not been determined.

次に制御部３３６は、子チャンクｃ_j，ｃ_iの子チャンク番号ｊ，ｉをいずれもｋに設定する（ステップ４０２）。つまり制御部３３６は、レジスタ部３４２内のｊレジスタ及びｉレジスタに、それぞれ子チャンク番号ｊ，ｉとしてｋ（ｋ＝０）を設定する。子チャンクｃ_iは、チャンク一覧テーブル３１２に登録すべき１つの親チャンクを決定するための一連の処理（登録親チャンク決定処理）の最初に求められる親チャンクにおける先頭の子チャンクを示す。子チャンクｃ_jは、登録親チャンク決定処理で求められる最新の親チャンクにおける先頭の子チャンクを示す。 Next, the control unit 336 sets the child chunk numbers j and i of the child chunks c _j and c _i to k (step 402). That is, the control unit 336 sets k (k = 0) as the child chunk numbers j and i in the j register and i register in the register unit 342, respectively. The child chunk c _i indicates the first child chunk in the parent chunk that is obtained first in a series of processes (registered parent chunk determination process) for determining one parent chunk to be registered in the chunk list table 312. The child chunk c _j indicates the first child chunk in the latest parent chunk obtained in the registered parent chunk determination process.

すると、可変長重複排除モジュール３３内の子チャンク決定部３３１は、子チャンクｃ_kの終了切り出し点を示す終了オフセットを決定することにより、当該子チャンクｃ_kの長さを求め、その長さを、当該子チャンクｃ_kの長さを示す子チャンク長ｃ_k.lenとして、レジスタ部３４２内の子チャンク長レジスタに設定する（ステップ４０３）。子チャンクｃ_kの終了オフセットを決定する手法、つまり子チャンクｃ_k（可変長のチャンク）の切り出し点を定める手法には、前記特許文献１，２に記載されているような手法の他に、図７を参照して後述する手法を適用することが可能である。 Then, the child chunk determining section 331 in the variable length deduplication module 33, by determining the end offset indicating the end cutout of the child element chunk c _k, determine the length of the child chunk c _k, the length The child chunk length c _k .len indicating the length of the child chunk c _k is set in the child chunk length register in the register unit 342 (step 403). Method for determining the offset of the end of the child chunk c _k, that is, the method for determining the cut-out point of the child chunk c _k (variable length chunks), in addition to the manner as described in Patent Documents 1 and 2, It is possible to apply the method described later with reference to FIG.

次に可変長重複排除モジュール３３の制御部３３６は、子チャンクｃ_kの開始オフセットｃ_k.offsetに子チャンク長ｃ_k.lenを加算した値（ｃ_k.offset + ｃ_k.len）が、文書データのサイズ未満であるかを判定する（ステップ４０４）。この判定は、子チャンクｃ_kの終端が文書データの末尾（終了位置）に到達していないことを確認するために行われる。 Then the control unit 336 of the variable-length deduplication module 33, start offset c _k .offset child chunk length c _k .len value obtained by adding the child chunk _{_{c k (c k .offset + c}} k .len) is, It is determined whether it is less than the size of the document data (step 404). This determination is performed to confirm that the end of the child chunk _kk has not reached the end (end position) of the document data.

もし、“ｃ_k.offset + ｃ_k.len”が文書データのサイズ未満であるならば（ステップ４０４のＮｏ）、制御部３３６は、“ｃ_k.offset + ｃ_k.len”から子チャンクｃ_jのオフセットｃ_j.offsetを減じた値（ｃ_k.offset + ｃ_k.len - ｃ_j.offset）が予め定められた連結ウィンドウサイズＷ以上であるかを判定する（ステップ４０５）。本実施形態では、連結ウィンドウサイズＷは１０バイト（Ｗ＝１０）であるものとする。 If “c _k .offset + c _k .len” is less than the size of the document data (No in step 404), the control unit 336 determines from “c _k .offset + c _k .len” as a child chunk c. the value obtained by subtracting the offset c _j .offset of _{_{_{j (c k .offset + c k}}} .len - c j .offset) determines whether it is predetermined connection window size W or more (step 405). In the present embodiment, it is assumed that the connection window size W is 10 bytes (W = 10).

“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”は、子チャンク番号がｊからｋまでの子チャンクｃ_j〜ｃ_kを連結した場合に、その連結された子チャンクの列の長さを表す。“ｃ_j.offset”、つまり連結された子チャンクの列の先頭の子チャンクｃ_jのオフセットは、当該連結された子チャンクの列の開始オフセットを示す。この連結された子チャンクの列の開始オフセットは、後述するように決定される親チャンクｐの開始オフセットとなる。そこで、この開始オフセットを、子チャンクの開始オフセットと区別するために、親チャンク開始オフセットと称する。 “C _k .offset + c _k .len−c _j .offset” is the length of the row of the connected child chunks when child chunks c _{j to} c _k having child chunk numbers j to _k are connected. Represents “C _j .offset”, that is, the offset of the first child chunk c _j of the connected child chunk sequence indicates the start offset of the connected child chunk sequence. The start offset of this connected row of child chunks is the start offset of the parent chunk p determined as described later. Therefore, in order to distinguish this start offset from the start offset of the child chunk, it is referred to as a parent chunk start offset.

ステップ４０１，４０２が実行された後にステップ４０５が最初に実行される場合、子チャンクｃ_kは子チャンク番号ｋが０の先頭の子チャンクであり、図６の例ではデータ断片“The f”が先頭の子チャンクである。このとき子チャンクｃ_kは子チャンクｃ_jに一致するため、“ｃ_k.offset”は“ｃ_j.offset”に一致する。この場合、“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”は子チャンクｃ_kの長さ“ｃ_k.len”に一致する。 When step 405 is executed first after steps 401 and 402 are executed, the child chunk _ck is the first child chunk whose child chunk number k is 0, and in the example of FIG. It is the first child chunk. At this time, since the child chunk c _k matches the child chunk c _j , “c _k .offset” matches “c _j .offset”. In this case, “c _k .offset + c _k .len−c _j .offset” matches the length “c _k .len” of the child chunk c _k .

もし、“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”が、連結ウィンドウサイズＷ以上でないならば（ステップ４０５のＮｏ）、制御部３３６は、レジスタ部３４２内の子チャンク開始オフセットレジスタにより次の子チャンクｃ_k+1の開始オフセットｃ_k+1.offsetが示されるように、現在当該子チャンク開始オフセットレジスタに保持されている子チャンクｃ_kのオフセットｃ_k.offsetに現在子チャンク長レジスタに設定されている当該子チャンクｃ_kの長さを加算した値を、次の子チャンクｃ_k+1の開始オフセットｃ_k+1.offsetとして当該子チャンク開始オフセットレジスタに設定する（ステップ４０６）。このステップ４０６において制御部３３６は、ｋレジスタに保持されている子チャンク番号ｋを１インクリメントする。これにより、１インクリメント後の子チャンク番号ｋは、ステップ４０６が実行される前の子チャンクｃ_kに後続する子チャンクｃ_k+1を新たな子チャンクｃ_kとして指定することになる。このとき、子チャンク開始オフセットレジスタは、新たな子チャンクｃ_kの開始オフセットを示す。 If “c _k .offset + c _k .len−c _j .offset” is not equal to or larger than the concatenated window size W (No in step 405), the control unit 336 causes the child chunk start offset register in the register unit 342 to be stored. Indicates the start offset c _{k + 1} .offset of the next child chunk c _{k + 1} , so that the current child chunk is offset to the offset c _k .offset of the child chunk c _k currently held in the child chunk start offset register. The value obtained by adding the lengths of the child chunks _ck set in the length register is set in the child chunk start offset register as the start offset c _{k + 1} .offset of the next child chunk _{ck + 1} (step 406). In step 406, the control unit 336 increments the child chunk number k held in the k register by 1. As a result, the child chunk number k after one increment designates the child chunk _{ck + 1} that follows the child chunk _ck before step 406 is executed as a new child chunk _ck . At this time, the child chunk start offset register indicates the start offset of the new child chunk _ck .

制御部３３６によってステップ４０６が実行されると、子チャンク決定部３３１は再びステップ４０３を実行することにより新たな子チャンクｃ_kの長さを求めて、その長さを子チャンク長ｃ_k.lenとして設定する。図６の例において、ステップ４０６が実行される前の子チャンクｃ_kが先頭の子チャンク“The f”である場合、“The f”に後続するデータ断片“file”が新たな子チャンクｃ_kとして決定される。そして、この新たな子チャンクｃ_kの子チャンク長ｃ_k.lenに基づき、上述の処理が再び行われる。 Step 406 is executed by the control unit 336, the child chunk determining section 331 in search of the length of the new child chunk c _k by executing the step 403 again, the length of the child chunk length c _k .len Set as. In the example of FIG. 6, when the child chunk c _k before the execution of step 406 is the first child chunk “The f”, the data fragment “file” following “The f” is a new child chunk c _k. As determined. Then, the above-described processing is performed again based on the child chunk length c _k .len of the new child chunk _ck .

すると、図６の例では、“file”に後続するデータ断片“ na”が更に新たな子チャンクｃ_kとして決定される。このとき“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”、つまり子チャンク番号がｊからｋまでの子チャンクｃ_j（“The f”）〜ｃ_k（“ na”）を連結した場合に、その連結された子チャンクの列の長さをＬ１とする。この長さＬ１は、図６に示されるように連結ウィンドウサイズＷ以上となる。 Then, in the example of FIG. 6, the data fragment “na” subsequent to “file” is determined as a new child chunk _ck . At this time, “c _k .offset + c _k .len−c _j .offset”, that is, child chunks c _j (“The f”) to c _k (“na”) whose child chunk numbers are j to k are connected. In this case, let L1 be the length of the row of the connected child chunks. This length L1 is not less than the connection window size W as shown in FIG.

このように、“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”が連結ウィンドウサイズＷ以上となると（ステップ４０５のＹｅｓ）、親チャンク決定部３３２は、子チャンク番号がｊからｋまでの子チャンクｃ_j〜ｃ_kを１つに連結し、それを親チャンクｐとして定める（ステップ４０７）。図６の例では、文書１１１（文書名が「文書＃１」の文書）の先頭から３つの子チャンクが連結されて、親チャンクｐ１（ｐ＝ｐ１）として決定される。この親チャンクｐ１を、後述するように親チャンク開始オフセットが再設定されることによって定められる後続の親チャンクｐ２，ｐ３と区別するために、当初親チャンクと呼ぶこともある。なお、ステップ４０４の判定条件として、上述の条件の他に、（１）“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”が連結ウィンドウサイズＷを超えること、（２）“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”が連結ウィンドウサイズＷに一致すること、（３）“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”が連結ウィンドウサイズＷに最も近くなること、（４）“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”が連結ウィンドウサイズＷを超えない最大値となることのいずれかを適用しても構わない。 As described above, when “c _k .offset + c _k .len−c _j .offset” becomes equal to or larger than the concatenated window size W (Yes in step 405), the parent chunk determination unit 332 determines that the child chunk numbers are j to k. Child chunks c _{j to} c _k are connected to one and defined as a parent chunk p (step 407). In the example of FIG. 6, three child chunks are concatenated from the top of the document 111 (document whose document name is “document # 1”), and determined as a parent chunk p1 (p = p1). In order to distinguish this parent chunk p1 from the subsequent parent chunks p2 and p3 determined by resetting the parent chunk start offset as will be described later, it may be called an initial parent chunk. As a determination condition in step 404, in addition to the above-described conditions, (1) “c _k .offset + c _k .len−c _j .offset” exceeds the connection window size W, (2) “c _k .offset + c _k .len -c _j .offset "matches the linked window size W, (3)" c _k .offset + c _k .len -c _j .offset "is closest to the linked window size W (4) “c _k .offset + c _k .len−c _j .offset” may be any of the maximum values that do not exceed the concatenated window size W.

上記ステップ４０７において親チャンク決定部３３２は、子チャンクｃ_j〜ｃ_kを連結したデータｃ_j...k.dataを親チャンクｐのデータ（データ断片）ｐ_dataとして求める。また上記ステップ４０７において親チャンク決定部３３２は、当該親チャンクｐのデータｐ_dataの識別子として用いられる、当該親チャンクｐのデータｐ_dataのハッシュ値ｐ_hashを、識別子生成部３３３により生成させる。このハッシュ値を求めるのに用いられるハッシュ関数をhash( )のように表すものとすると、ハッシュ値ｐ_hashは、hash(ｐ_data)の計算処理、つまりhash(ｃ_j...k.data)の計算処理により求められる。なお、子チャンク番号ｊがｋに一致するならば、つまり単一の子チャンクだけで連結ウィンドウサイズＷ以上となるならば、当該単一の子チャンク自体が親チャンクｐと決定される。 In step 407, the parent chunk determination unit 332 obtains data c _j... K.data _obtained by concatenating the child chunks c _{j to} c _k as data (data fragment) p _data of the parent chunk p. In step 407, the parent chunk determination unit 332 causes the identifier generation unit 333 to generate the hash value p _hash of the data p _data of the parent chunk p, which is used as the identifier of the data p _data of the parent chunk p. If the hash function used to obtain the hash value is expressed as hash (), the hash value p _hash is a calculation process of hash (p _data ), that is, hash (c _{j ... k} .data) It is calculated | required by the calculation process of. If the child chunk number j matches k, that is, if only a single child chunk is equal to or larger than the concatenated window size W, the single child chunk itself is determined as the parent chunk p.

次に、可変長重複排除モジュール３３内の重複検出部３３４は、ステップ４０７で求められた親チャンクｐ（ｐ＝ｐ１）のデータ断片が既にチャンク一覧テーブル３１２に登録されているかを判定する（ステップ４０８）。このステップ４０８は、親チャンクｐと同一内容の親チャンクが、既に文書格納部３１に格納されている重複を検出するために実行される。ステップ４０８の判定は、親チャンクｐのデータｐ_dataの識別子（ハッシュ値）ｐ_hashに一致する識別子（ハッシュ値）が既にチャンク一覧テーブル３１２に登録されているかを調べることにより実現可能である。しかし、親チャンクｐのデータ（データ断片）ｐ_dataのビット列と、親チャンクｐのデータｐ_dataの識別子に一致する識別子と対をなしてチャンク一覧テーブル３１２に登録されているチャンクのデータのビット列とは、識別子（ハッシュ値）の計算に用いるハッシュ関数によっては、必ずしも一致するとは限らない。そこで、上記ステップ４０８において重複検出部３３４は、親チャンクｐのデータｐ_dataの識別子（ハッシュ値）ｐ_hashと当該データｐ_dataの対が、チャンク一覧テーブル３１２に登録されているかを判定する。更に詳細に述べるならば、重複検出部３３４は、親チャンクｐの識別子ｐ_hashに一致する識別子がチャンク一覧テーブル３１２に登録されているけでなく、当該親チャンクｐのデータｐ_dataのビット列に一致するチャンクのデータのビット列が、当該一致する識別子と対応付けてチャンク一覧テーブル３１２に登録されているかを判定する。このようにすると、より高精度の重複検出が行えて、いわゆるハッシュ衝突を防止することができる。 Next, the duplication detection unit 334 in the variable length deduplication module 33 determines whether the data fragment of the parent chunk p (p = p1) obtained in step 407 is already registered in the chunk list table 312 (step). 408). This step 408 is executed in order to detect a duplicate in which the parent chunk having the same content as the parent chunk p is already stored in the document storage unit 31. The determination in step 408 can be realized by checking whether an identifier (hash value) matching the identifier (hash value) p _hash of the data p _data of the parent chunk p is already registered in the chunk list table 312. However, the bit string of the data (data fragment) p _data of the parent chunk p and the bit string of the chunk data registered in the chunk list table 312 paired with the identifier that matches the identifier of the data p _data of the parent chunk p Do not always match depending on the hash function used to calculate the identifier (hash value). Therefore, duplicate detection unit 334 in step 408 determines whether an identifier (hash values) of the data p _data in the parent chunk p p _hash a pair of the data p _data is registered in the chunk list table 312. More specifically, the duplication detection unit 334 matches not only the identifier matching the identifier p _hash of the parent chunk p but also the bit string of the data p _data of the parent chunk p, as well as being registered in the chunk list table 312. It is determined whether the bit string of the data of the chunk to be registered is registered in the chunk list table 312 in association with the matching identifier. In this way, more accurate duplication detection can be performed and so-called hash collision can be prevented.

もし、ステップ４０７で求められた親チャンクｐのデータ（データ断片）ｐ_dataがチャンク一覧テーブル３１２に登録されていないならば（ステップ４０８のＮｏ）、制御部３３６は子チャンク番号ｊを１インクリメントする（ステップ４０９）。この１インクリメント後の子チャンク番号ｊは、ステップ４０７で求められた親チャンクｐを構成する子チャンクの列における先頭の子チャンクに後続する子チャンクであって、次に決定されるべき親チャンクｐの先頭の子チャンク（新たな子チャンク）ｃ_jを指す。 If the data (data fragment) p _{data of} the parent chunk p obtained in step 407 is not registered in the chunk list table 312 (No in step 408), the control unit 336 increments the child chunk number j by 1. (Step 409). The child chunk number j after this increment is a child chunk subsequent to the first child chunk in the row of child chunks constituting the parent chunk p obtained in step 407, and is to be determined next. Indicates the first child chunk (new child chunk) c _j .

この新たな子チャンクｃ_jの開始オフセットｃ_j.offsetfは、ステップ４０７で求められた親チャンクｐを構成する子チャンクの列における先頭の子チャンクの長さだけ文書データの末尾側にずらされた、新たな親チャンク開始オフセットを示す。つまりステップ４０９により、親チャンク開始オフセットが再設定される。図６の例では、新たな（再設定された）親チャンク開始オフセットは、文書１１１の先頭から２番目の子チャンク（データ断片が“ile”の子チャンク）の開始オフセットに一致する。ステップ４０９は、後述するように、ステップ４０７で求められた親チャンクｐから、先頭の子チャンクののデータ断片を取り外すことと等価である。 The start offset c _j .offsetf of this new child chunk c _j is shifted toward the end of the document data by the length of the first child chunk in the row of child chunks constituting the parent chunk p obtained in step 407. , Indicates the new parent chunk start offset. That is, in step 409, the parent chunk start offset is reset. In the example of FIG. 6, the new (reset) parent chunk start offset matches the start offset of the second child chunk from the top of the document 111 (child chunk whose data fragment is “ile”). Step 409 is equivalent to removing the data fragment of the first child chunk from the parent chunk p determined in step 407, as will be described later.

次に制御部３３６は、“ｃ_j.offset + ｃ_j.len - ｃ_i.offset”が連結ウィンドウサイズＷ以上であるかを判定する（ステップ４１０）。このときｃ_j.offsetは、上述のように再設定された親チャンク開始オフセットを示す。一方、“ｃ_i.offset”は、当初親チャンクｐの開始オフセットを示す。したがって、“ｃ_j.offset + ｃ_j.len - ｃ_i.offset”は、子チャンク番号がｉからｊまでの子チャンクｃ_i〜ｃ_jを連結した場合に、その連結された子チャンクの列の長さを表す。つまり、“ｃ_j.offset + ｃ_j.len - ｃ_i.offset”は、ステップ４０９で再設定された親チャンク開始オフセットの当初親チャンクｐの開始オフセットからの「ずれ」を表す。 Next, the control unit 336 determines whether “c _j .offset + c _j .len−c _i .offset” is equal to or larger than the connection window size W (step 410). At this time, c _j .offset indicates the parent chunk start offset reset as described above. On the other hand, “c _i .offset” indicates the start offset of the initial parent chunk p. Therefore, “c _j .offset + c _j .len−c _i .offset” is a sequence of concatenated child chunks when child chunks c _{i to} c _j having child chunk numbers i to _j are concatenated. Represents the length of. That is, “c _j .offset + c _j .len−c _i .offset” represents a “deviation” of the parent chunk start offset reset in step 409 from the start offset of the initial parent chunk p.

もし、“ｃ_j.offset + ｃ_j.len - ｃ_i.offset”が連結ウィンドウサイズＷ以上でないならば（ステップ４１０のＮｏ）、制御部３３６はステップ４０６に進み、レジスタ部３４２内の子チャンク開始オフセットレジスタが次の子チャンクｃ_k+1の開始オフセットｃ_k+1.offsetを示すように、当該レジスタの内容を“ｃ_k.offset + ｃ_k.len”に更新すると共に、子チャンク番号ｋを１インクリメントする。このインクリメントにより、次の子チャンクｃ_k+1が新たな子チャンクｃ_kとして扱われる。この例のように、ステップ４１０でＮｏが判定されたためにステップ４０６が実行された場合、新たな子チャンクｃ_kは、先に決定された親チャンクｐを構成するチャンクの列に後続する子チャンクである。図６の例では、“ na”に後続するデータ断片“me spe”が新たな子チャンクｃ_kのデータ断片して決定される。 If “c _j .offset + c _j .len−c _i .offset” is not equal to or larger than the concatenated window size W (No in step 410), the control unit 336 proceeds to step 406 and the child chunk in the register unit 342 is obtained. The content of the register is updated to “c _k .offset + c _k .len” so that the start offset register indicates the start offset c _{k + 1} .offset of the next child chunk c _{k + 1} , and the child chunk number Increment k by 1. By this increment, the next child chunk c _{k + 1} is treated as a new child chunk c _k . As in this example, when Step 406 is executed because No is determined in Step 410, the new child chunk _ck is a child chunk that follows the sequence of chunks that make up the previously determined parent chunk p. It is. In the example of FIG. 6, the data fragment “me spe” following “na” is determined as the data fragment of the new child chunk _ck .

制御部３３６によってステップ４０６が実行されると、子チャンク決定部３３１は再びステップ４０３を実行することにより新たな子チャンクｃ_kの長さを求めて、その長さを子チャンク長ｃ_k.lenとして設定する。このように本実施形態では、ステップ４０９で再設定された親チャンク開始オフセットから始まる子チャンクｃ_j〜ｃ_kの列の長さが連結ウィンドウサイズＷ以上となるところまで、子チャンクが定められる。ここでは、ステップ４０６の処理から明らかなように、再設定された親チャンク開始オフセット以降に出現する、以前の処理で既に定められた子チャンクｃ_j〜ｃ_k-1について再び定め直す必要はない。 Step 406 is executed by the control unit 336, the child chunk determining section 331 in search of the length of the new child chunk c _k by executing the step 403 again, the length of the child chunk length c _k .len Set as. Thus, in the present embodiment, child chunks are determined until the column length of the child chunks c _{j to} c _k starting from the parent chunk start offset reset in step 409 is equal to or greater than the concatenated window size W. Here, as is clear from the processing in step 406, there is no need to re _- determine the child chunks c _{j to} c _k−1 that appear after the reset parent chunk start offset and are already determined in the previous processing. .

図６の例では、このときの子チャンクｃ_j〜ｃ_kのデータ断片は“file”〜“me spe”である。子チャンクｃ_j（“file”）〜ｃ_k（“me spe”）を連結した場合に、その連結された子チャンクの列の長さ“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”をＬ２とする。この長さＬ２は、図６に示されるように連結ウィンドウサイズＷ以上となる。 In the example of FIG. 6, the data fragments of the child chunks c _{j to} c _k at this time are “file” to “me spe”. When child chunks c _j (“file”) to c _k (“me spe”) are concatenated, the length “c _k .offset + c _k .len−c _j .offset” of the concatenated child chunks "Is L2. This length L2 is equal to or larger than the connection window size W as shown in FIG.

この例のように、“ｃ_k.offset + ｃ_k.len - ｃ_j.offset”が連結ウィンドウサイズＷ以上となったならば（ステップ４０５のＹｅｓ）、親チャンク決定部３３２は上述のように、子チャンク番号がｊからｋまでの子チャンクｃ_j〜ｃ_kを１つに連結し、それを親チャンクｐとして定める（ステップ４０７）。図６の例では、文書１１１の先頭から２番目乃至４番目の子チャンクが連結されて、親チャンクｐ２（ｐ＝ｐ２）として決定される。親チャンクｐ２は、先に決定された親チャンクｐ１から先頭の子チャンク（つまり文書１１１の先頭の子チャンク）を上記ステップ４０９によって取り外し、その先頭の子チャンクが取り外された親チャンクｐ１に、新たに文書１１１の先頭から４番目の子チャンクが上記ステップ４０６，４０３，４０７によって組み込まれることによって構成される新たな親チャンクと等価である。 As in this example, if “c _k .offset + c _k .len−c _j .offset” becomes equal to or larger than the concatenated window size W (Yes in step 405), the parent chunk determination unit 332 determines as described above. , Child chunks c _{j to} c _k whose child chunk numbers are j to k are concatenated into one and defined as parent chunk p (step 407). In the example of FIG. 6, the second to fourth child chunks from the top of the document 111 are concatenated and determined as the parent chunk p2 (p = p2). The parent chunk p2 removes the first child chunk (that is, the first child chunk of the document 111) from the previously determined parent chunk p1 by the above-described step 409, and adds the new child chunk p1 to the parent chunk p1 from which the first child chunk has been removed. Is equivalent to a new parent chunk constructed by incorporating the fourth child chunk from the top of the document 111 in steps 406, 403, and 407.

次に重複検出部３３４は、ステップ４０７で求められた親チャンクｐ（ｐ＝ｐ２）のデータ（データ断片）ｐ_dataが既にチャンク一覧テーブル３１２に登録されているかを判定する（ステップ４０８）。もし、親チャンクｐのデータｐ_dataがチャンク一覧テーブル３１２に登録されていないならば（ステップ４０８のＮｏ）、制御部３３６は上述したようにステップ４０９に進み、子チャンク番号ｊを１インクリメントする。これにより、親チャンク開始オフセットが再設定される。図６の例では、再設定された親チャンク開始オフセットは、文書１１１の先頭から３番目の子チャンク（データ断片が“ na”の子チャンク）の開始オフセットに一致する。 Next, the duplication detection unit 334 determines whether the data (data fragment) p _{data of} the parent chunk p (p = p2) obtained in step 407 is already registered in the chunk list table 312 (step 408). If the data p _data of the parent chunk p is not registered in the chunk list table 312 (No in Step 408), the control unit 336 proceeds to Step 409 as described above and increments the child chunk number j by 1. As a result, the parent chunk start offset is reset. In the example of FIG. 6, the reset parent chunk start offset matches the start offset of the third child chunk from the top of the document 111 (child chunk whose data fragment is “na”).

図６の例では、このときの“ｃ_j.offset + ｃ_j.len - ｃ_i.offset”、つまり親チャンク開始オフセットの「ずれ」は、連結ウィンドウサイズＷ以上でない（ステップ４１０のＮｏ）。この場合、ステップ４０６及び４０３を含む処理が繰り返される。これにより図６の例では、文書１１１の先頭から３番目乃至５番目の子チャンクが連結されて、長さがＬ３（Ｌ３≧Ｗ）の親チャンクｐ３（ｐ＝ｐ３）として決定される（ステップ４０７）。 In the example of FIG. 6, “c _j .offset + c _j .len−c _i .offset” at this time, that is, the “shift” of the parent chunk start offset is not equal to or larger than the linked window size W (No in step 410). In this case, the process including steps 406 and 403 is repeated. Accordingly, in the example of FIG. 6, the third to fifth child chunks from the top of the document 111 are connected and determined as a parent chunk p3 (p = p3) having a length L3 (L3 ≧ W) (step 407).

もし、決定された親チャンクｐ（ｐ＝ｐ３）のデータｐ_dataがチャンク一覧テーブル３１２に登録されていないならば（ステップ４０８のＮｏ）、制御部３３６は上述したようにステップ４０９に進み、子チャンク番号ｊを１インクリメントする。これにより、親チャンク開始オフセットが再設定される。図６の例では、再設定された親チャンク開始オフセットは、文書１１１の先頭から４番目の子チャンク（データ断片が“me spe”の子チャンク）の開始オフセットに一致する。 If the data p _{data of} the determined parent chunk p (p = p3) is not registered in the chunk list table 312 (No in step 408), the control unit 336 proceeds to step 409 as described above, and the child data Chunk number j is incremented by one. As a result, the parent chunk start offset is reset. In the example of FIG. 6, the reset parent chunk start offset matches the start offset of the fourth child chunk from the top of the document 111 (child chunk whose data fragment is “me spe”).

図６の例では、このときの“ｃ_j.offset + ｃ_j.len - ｃ_i.offset”、つまり親チャンク開始オフセットの「ずれ」は、連結ウィンドウサイズＷ以上である（ステップ４１０のＹｅｓ）。この場合、制御部３３６は、子チャンク番号ｋを子チャンク番号ｊから１を減じた値、つまりステップ４０９で１インクリメントされる前の子チャンク番号ｊに再設定する（ステップ４１１）。これにより、再設定された子チャンク番号ｋは、当初親チャンクにおける終端側の子チャンクｃkを示す。ステップ４１０が実行されると、ステップ５１０に進む。 In the example of FIG. 6, “c _j .offset + c _j .len−c _i .offset” at this time, that is, the “shift” of the parent chunk start offset is equal to or larger than the linked window size W (Yes in step 410). . In this case, the control unit 336 resets the child chunk number k to a value obtained by subtracting 1 from the child chunk number j, that is, the child chunk number j before being incremented by 1 in Step 409 (Step 411). Thereby, the reset child chunk number k indicates the terminal-side child chunk ck in the initial parent chunk. When step 410 is executed, the process proceeds to step 510.

一方、ステップ４０７で決定された親チャンクｐが既にチャンク一覧テーブル３１２に登録されている登録済み親チャンクであるならば（ステップ４０８のＹｅｓ）、親チャンク登録部３３５は文書構成テーブル３１１及びチャンク一覧テーブル３１２のうちの文書構成テーブル３１１のみに当該親チャンクｐを登録する（ステップ４１２）。更に詳細に述べるならば、親チャンク登録部３３５は、親チャンクｐを含む文書の文書名に対応付けて当該親チャンクｐの識別子（ハッシュ値）ｐ_hashを文書構成テーブル３１１に登録する。なお、文書構成テーブル３１１及びチャンク一覧テーブル３１２が空の状態にある場合、つまり未だ１つの親チャンクも文書構成テーブル３１１及びチャンク一覧テーブル３１２に登録されていない場合、最初にステップ４０７で決定された親チャンクｐを、文書構成テーブル３１１及びチャンク一覧テーブル３１２に登録しても構わない。 On the other hand, if the parent chunk p determined in step 407 is a registered parent chunk that has already been registered in the chunk list table 312 (Yes in step 408), the parent chunk registration unit 335 causes the document configuration table 311 and the chunk list to be registered. The parent chunk p is registered only in the document configuration table 311 in the table 312 (step 412). More specifically, the parent chunk registration unit 335 registers the identifier (hash value) p _hash of the parent chunk p in the document configuration table 311 in association with the document name of the document including the parent chunk p. If the document configuration table 311 and the chunk list table 312 are empty, that is, if one parent chunk has not been registered in the document configuration table 311 and the chunk list table 312, the determination is first made in step 407. The parent chunk p may be registered in the document configuration table 311 and the chunk list table 312.

ここで、既に親チャンクｐを含む文書の文書名が文書構成テーブル３１１に登録されている場合、親チャンク登録部３３５は、当該文書名に対応付けて文書構成テーブル３１１に既に登録されている識別子の配列の末尾に当該親チャンクｐの識別子ｐ_hashを追加する。これにより、親チャンクｐを含む文書の文書名に対応付けて文書構成テーブル３１１に登録される識別子の並び順は、当該文書から対応する親チャンクが切り出される順番、つまり対応する親チャンクの当該文書における並び順に一致する。 Here, when the document name of the document including the parent chunk p is already registered in the document configuration table 311, the parent chunk registration unit 335 associates the identifier with the identifier already registered in the document configuration table 311. The identifier p _hash of the parent chunk p is added to the end of the array. Thereby, the arrangement order of the identifiers registered in the document configuration table 311 in association with the document name of the document including the parent chunk p is the order in which the corresponding parent chunk is extracted from the document, that is, the document of the corresponding parent chunk. Matches the order of

可変長重複排除モジュール３３の制御部３３６は、文書構成テーブル３１１に親チャンクｐが登録されと（ステップ４１２）、子チャンク番号ｉと子チャンク番号ｊとが等しいかを判定する（ステップ４１３）。つまり制御部３３６は、文書構成テーブル３１１に登録された親チャンクｐが、親チャンク開始オフセットを再設定（ステップ４０９）することなく決定されたかを判定する。もし、子チャンク番号ｉと子チャンク番号ｊとが等しくないならば（ステップ４１３のＮｏ）、ステップ４１３からステップ５０１に進む。 When the parent chunk p is registered in the document configuration table 311 (step 412), the control unit 336 of the variable-length deduplication module 33 determines whether the child chunk number i is equal to the child chunk number j (step 413). That is, the control unit 336 determines whether or not the parent chunk p registered in the document configuration table 311 has been determined without resetting the parent chunk start offset (step 409). If the child chunk number i is not equal to the child chunk number j (No in step 413), the process proceeds from step 413 to step 501.

このように、ステップ４０６，４０３を含む処理を繰り返した結果、親チャンクｐが決定されて（ステップ４０７）、当該決定された親チャンクｐのデータｐ_dataが既にチャンク一覧テーブル３１２に登録されていることが検出され（ステップ４０８のＹｅｓ）、且つｉ＝ｊでない場合（ステップ４１３のＮｏ）、ステップ５０１が実行される。また、親チャンク開始オフセットの「ずれ」が連結ウィンドウサイズＷ以上になったことが検出された場合には（ステップ４１０のＹｅｓ）、ステップ４１１を経てステップ５０１が実行される。 As described above, as a result of repeating the processes including steps 406 and 403, the parent chunk p is determined (step 407), and the data p _{data of the} determined parent chunk p is already registered in the chunk list table 312. Is detected (Yes in Step 408), and if i = j is not satisfied (No in Step 413), Step 501 is executed. If it is detected that the “shift” of the parent chunk start offset is equal to or larger than the connection window size W (Yes in Step 410), Step 501 is executed through Step 411.

ステップ５０１において親チャンク決定部３３２は、子チャンク番号がｉからｊ−１までの子チャンクｃ_i〜ｃ_j-1を１つに連結し、それを親チャンクｐとして定める。またステップ５０１において、親チャンク決定部３３２は、子チャンクｃ_i〜ｃ_j-1を連結したデータｃ_i...j-1.dataを親チャンクｐのデータｐ_dataとして求めると共に、当該親チャンクｐのデータｐ_dataのハッシュ値（識別子）ｐ_hash（＝hash(ｃ_i...j-1.data)）を識別子生成部３３３により生成させる。 In step 501, the parent chunk determination unit 332 concatenates the child chunks c _{i to} c _j−1 having child chunk numbers i to j−1 and determines it as the parent chunk p. In step 501, the parent chunk determination unit 332 obtains data c _i... J−1.data obtained by concatenating the child chunks c _{i to} c _j−1 as data p _data of the parent chunk p, and the parent chunk p The identifier generation unit 333 generates a hash value (identifier) p _hash (= hash (c _{i... j−1} .data)) of the data p _data of p.

次に親チャンク登録部３３５は、チャンク一覧テーブル３１２に親チャンクｐを登録する（ステップ５０２）。更に詳細に述べるならば、親チャンク登録部３３５は、親チャンクｐの識別子（ハッシュ値）ｐ_hash及び当該親チャンクｐのデータ（データ断片）ｐ_dataをチャンク一覧テーブル３１２に登録する。また親チャンク登録部３３５は、文書構成テーブル３１１に親チャンクｐを登録する（ステップ５０３）。つまり親チャンク登録部３３５は、親チャンクｐを含む文書の文書名に対応付けて当該親チャンクｐの識別子ｐ_hashを文書構成テーブル３１１に登録する。 Next, the parent chunk registration unit 335 registers the parent chunk p in the chunk list table 312 (step 502). More specifically, the parent chunk registration unit 335 registers the identifier (hash value) p _hash of the parent chunk p and the data (data fragment) p _data of the parent chunk p in the chunk list table 312. In addition, the parent chunk registration unit 335 registers the parent chunk p in the document configuration table 311 (step 503). That is, the parent chunk registration unit 335 registers the identifier p _hash of the parent chunk p in the document configuration table 311 in association with the document name of the document including the parent chunk p.

さて、ステップ４１０からステップ４１１を経てステップ５０１に進んだ場合、ステップ５０１で決定される親チャンクｐ、つまり子チャンクｃ_i〜ｃ_j-1の列から構成される親チャンクｐは、当初親チャンクｐ（図６の例では、親チャンクｐ１）に一致する。 Now, when the process proceeds from step 410 to step 501 through step 411, the parent chunk p determined in step 501, that is, the parent chunk p composed of the columns of child chunks c _{i to} c _j−1 is the initial parent chunk p. It corresponds to p (in the example of FIG. 6, parent chunk p1).

一方、ステップ４１３からステップ５０１に進んだ場合、ステップ５０１で決定される親チャンクｐを構成する子チャンクｃ_i〜ｃ_j-1の列は、最も最近にチャンク一覧テーブル３１２に登録された親チャンクとステップ４０８での判定に用いられた親チャンクとの間に存在する子チャンクの列である。なお、ｉがｊ−１に等しい場合、子チャンクｃ_i〜ｃ_j-1は単一の子チャンクを意味する。 On the other hand, when the process proceeds from step 413 to step 501, the columns of the child chunks c _{i to} c _j−1 constituting the parent chunk p determined in step 501 are the parent chunks registered in the chunk list table 312 most recently. And a column of child chunks existing between the parent chunk used in the determination in step 408. If i is equal to j-1, the child chunks c _{i to} c _j-1 mean a single child chunk.

親チャンク登録部３３５によってステップ５０３が実行されると、１回の登録親チャンク決定処理が終了する。すると制御部３３６は先のステップ４０４と同様に、“ｃ_k.offset + ｃ_k.len”が文書データのサイズ未満であるかを判定する（ステップ５０４）。 When step 503 is executed by the parent chunk registration unit 335, one registration parent chunk determination process ends. Then, the control unit 336 determines whether “c _k .offset + c _k .len” is less than the size of the document data, similarly to the previous step 404 (step 504).

もし、“ｃ_k.offset + ｃ_k.len”が文書データのサイズ未満であるならば（ステップ５０４のＹｅｓ）、制御部３３６は文書データの末尾まで処理をし終えていないと判断する。この場合、制御部３３６は、次の登録親チャンク決定処理のために、レジスタ部３４２内の子チャンク開始オフセットレジスタが次の子チャンクｃ_k+1の開始オフセットｃ_k+1.offsetを示すように、当該レジスタの内容を“ｃ_k.offset + ｃ_k.len”に更新すると共に、子チャンク番号ｋを１インクリメントする（ステップ５０５）。制御部３３６はステップ５０５を実行すると、ステップ４０２に戻り、子チャンク番号ｊ，ｉをいずれもｋに設定する。これにより、親チャンク開始オフセットが、先の登録親チャンク決定処理における当初親チャンクの終了オフセットの位置（つまり終端位置）に再設定される。以後、ステップ４０３を含む上述と同様の手順の登録親チャンク決定処理が文書データの末尾まで繰り返される。 If “c _k .offset + c _k .len” is less than the size of the document data (Yes in step 504), the control unit 336 determines that the processing has not been completed up to the end of the document data. In this case, the control unit 336 causes the child chunk start offset register in the register unit 342 to indicate the start offset c _{k + 1} .offset of the next child chunk c _{k + 1 for} the next registered parent chunk determination process. Then, the contents of the register are updated to “c _k .offset + c _k .len” and the child chunk number k is incremented by 1 (step 505). After executing Step 505, the control unit 336 returns to Step 402 and sets both the child chunk numbers j and i to k. As a result, the parent chunk start offset is reset to the position (that is, the end position) of the end offset of the initial parent chunk in the previous registered parent chunk determination process. Thereafter, the registration parent chunk determination process including the step 403 and the same procedure as described above is repeated until the end of the document data.

そして文書データの末尾まで処理が行われた結果、“ｃ_k.offset + ｃ_k.len”が文書データのサイズ未満でなくなったものとする（ステップ４０４のＮｏ）。この場合、親チャンク決定部３３２は、ステップ４０７と同様に、子チャンク番号がｊからｋまでの子チャンクｃ_j〜ｃ_kを１つに連結し、それを親チャンクｐとして定める（ステップ４１４）。このステップ４１４において可変長重複排除モジュール３３は、子チャンクｃ_j〜ｃ_kを連結したデータｃ_j...k.dataを親チャンクｐのデータｐ_dataとして求めると共に、当該親チャンクｐのデータｐ_dataのハッシュ値（識別子）ｐ_hash（＝hash(ｃ_j...k.data)）を識別子生成部３３３により生成させる。 As a result of processing up to the end of the document data, it is assumed that “c _k .offset + c _k .len” is not less than the size of the document data (No in step 404). In this case, similar to step 407, the parent chunk determination unit 332 concatenates the child chunks c _{j to} c _k whose child chunk numbers are j to k and determines it as the parent chunk p (step 414). . In this step 414, the variable-length deduplication module 33 obtains the data c _j... K.data _obtained by concatenating the child chunks c _{j to} c _k as the data p _data of the parent chunk p, and the data p of the parent chunk p. _The identifier generation unit 333 generates a hash value (identifier) p _hash (= hash (c _j.... data)) of data.

すると重複検出部３３４は、ステップ４１４で求められた親チャンクｐのデータ断片が既にチャンク一覧テーブル３１２に登録されているかを判定する（ステップ４１５）。もし、ステップ４１４で求められた親チャンクｐのデータ断片がチャンク一覧テーブル３１２に登録されていないならば（ステップ４１５のＮｏ）、親チャンク登録部３３５は、チャンク一覧テーブル３１２に親チャンクｐの識別子（ハッシュ値）ｐ_hash及び当該親チャンクｐのデータ断片ｐ_dataを登録する（ステップ４１６）。次に親チャンク登録部３３５は上記ステップ４１２に進み、親チャンクｐを含む文書の文書名に対応付けて当該親チャンクｐの識別子ｐ_hashを文書構成テーブル３１１に登録する。これに対し、親チャンクｐのデータ断片がチャンク一覧テーブル３１２に既に登録されているならば（ステップ４１５のＹｅｓ）、親チャンク登録部３３５はステップ４１６をスキップして、ステップ４１２を実行する。 Then, the duplication detection unit 334 determines whether the data fragment of the parent chunk p obtained in step 414 is already registered in the chunk list table 312 (step 415). If the data fragment of the parent chunk p obtained in step 414 is not registered in the chunk list table 312 (No in step 415), the parent chunk registration unit 335 stores the identifier of the parent chunk p in the chunk list table 312. (Hash value) p _hash and the data fragment p _data of the parent chunk p are registered (step 416). Next, the parent chunk registration unit 335 proceeds to step 412 and registers the identifier p _hash of the parent chunk p in the document configuration table 311 in association with the document name of the document including the parent chunk p. On the other hand, if the data fragment of the parent chunk p is already registered in the chunk list table 312 (Yes in Step 415), the parent chunk registration unit 335 skips Step 416 and executes Step 412.

親チャンク登録部３３５によってステップ４１２が実行されると、制御部３３６は、子チャンク番号ｉと子チャンク番号ｊとが等しいかを判定する（ステップ４１３）。もし、子チャンク番号ｉと子チャンク番号ｊとが等しいならば（ステップ４１３のＹｅｓ）、制御部３３６上記ステップ５０４に進む。ステップ５０４において制御部３３６は、先のステップ４０４と同様に、“ｃ_k.offset + ｃ_k.len”が文書データのサイズ未満であるかを判定する。ステップ４０４の判定がＮｏであるこの例では、ステップ５０４の判定もＮｏとなる。この場合、可変長重複排除モジュール３３は文書データの末尾まで処理をし終えたとして、文書格納処理を終了する。 When step 412 is executed by the parent chunk registration unit 335, the control unit 336 determines whether the child chunk number i is equal to the child chunk number j (step 413). If the child chunk number i is equal to the child chunk number j (Yes in Step 413), the control unit 336 proceeds to Step 504. In step 504, the control unit 336 determines whether “c _k .offset + c _k .len” is less than the size of the document data, as in step 404. In this example in which the determination in step 404 is No, the determination in step 504 is also No. In this case, the variable-length deduplication module 33 finishes the process to the end of the document data, and ends the document storage process.

＜子チャンクの切り出し＞
次に、子チャンク（つまり可変長のデータ断片）の切り出し点を決定する手法について説明する。前述したように、この手法として、前記特許文献１，２に記載されているような手法を適用することが可能である。しかし、この特許文献１，２に記載の手法の他に、以下に述べるような新規の手法を適用することも可能である。この新規の手法の特徴は、あるデータ断片の識別子（ハッシュ値）の下位ｍビットが、予め定めた値Ａに一致したときに、当該データ断片の終端位置を子チャンクの切り出し点とすることにある。 <Extracting child chunks>
Next, a method for determining a cut-out point of a child chunk (that is, a variable-length data fragment) will be described. As described above, as this method, it is possible to apply the methods described in Patent Documents 1 and 2. However, in addition to the methods described in Patent Documents 1 and 2, a novel method as described below can be applied. The feature of this new method is that when the lower m bits of an identifier (hash value) of a certain data fragment matches a predetermined value A, the end position of the data fragment is used as a cut-out point of a child chunk. is there.

以下、この新規の手法について、図７を参照して説明する。図７は、あるデータ断片の識別子（ハッシュ値）の下位２ビット（ｍ＝２）が、予め定めた値２（Ａ＝２）に一致したときに、、当該データ断片の終端位置を子チャンクの切り出し点とする例を示す。データ断片の識別子（ハッシュ値）の計算に用いられるハッシュ関数をｈ_β( )のように表す。 Hereinafter, this new method will be described with reference to FIG. FIG. 7 shows that when the lower 2 bits (m = 2) of an identifier (hash value) of a certain data fragment matches a predetermined value 2 (A = 2), the end position of the data fragment is a child chunk. An example of the cut-out point is shown. A hash function used for calculating the identifier (hash value) of the data fragment is represented as h _β ().

図７の例では、文書データ“The fil…”におけるデータ断片“Th”の識別子ｈ_β（“Th”）が０ｘ５Ａである。この識別子０ｘ５Ａの下位２ビットは０ｘ０２である。この識別子０ｘ５Ａの下位２ビットは、当該識別子０ｘ５Ａとマスクデータ０ｘ０３との論理積演算０ｘ５Ａ＆０ｘ０３によって求められる。識別子０ｘ５Ａの下位２ビット０ｘ０２は、規定値０ｘ０１に一致しない。このためデータ断片“Th”の終端位置は子チャンクの切り出し点ではない。 In the example of FIG. 7, the identifier h _β (“Th”) of the data fragment “Th” in the document data “The fil...” Is 0x5A. The lower 2 bits of this identifier 0x5A are 0x02. The lower 2 bits of the identifier 0x5A are obtained by a logical product operation 0x5A & 0x03 of the identifier 0x5A and the mask data 0x03. The lower 2 bits 0x02 of the identifier 0x5A do not match the specified value 0x01. For this reason, the end position of the data fragment “Th” is not the cut-out point of the child chunk.

そこで子チャンク決定部３３１は、切り出し点決定に用いるデータ断片のサイズ（区間）を文書データの末尾側に１バイト拡張する。このサイズ拡張後のデータ断片“The”の識別子ｈ_β(“The”)が０ｘＦ２であるものとする。この識別子０ｘＦ２の下位２ビットは０ｘ０２であり、規定値０ｘ０１に一致しない。そこで子チャンク決定部３３１は、データ断片のサイズを更に１バイト拡張する。 Therefore, the child chunk determination unit 331 extends the size (section) of the data fragment used for determining the cutout point by 1 byte toward the end of the document data. It is assumed that the identifier h _β (“The”) of the data fragment “The” after this size expansion is 0xF2. The lower 2 bits of this identifier 0xF2 are 0x02 and do not match the specified value 0x01. Therefore, the child chunk determination unit 331 further extends the size of the data fragment by 1 byte.

サイズ拡張後のデータ断片“The ”の識別子ｈ_β(“The ”)が０ｘ７Ｃであるものとする。この識別子０ｘ７Ｃの下位２ビットは０ｘ００であり、規定値０ｘ０１に一致しない。そこで子チャンク決定部３３１は、データ断片のサイズを更に１バイト拡張する。 It is assumed that the identifier h _β (“The”) of the data fragment “The” after size expansion is 0x7C. The lower 2 bits of this identifier 0x7C are 0x00 and do not match the specified value 0x01. Therefore, the child chunk determination unit 331 further extends the size of the data fragment by 1 byte.

サイズ拡張後のデータ断片“The f”の識別子ｈ_β(“The f”)が０ｘ９９であるものとする。この識別子０ｘ９９の下位２ビットは０ｘ０１であり、規定値０ｘ０１に一致する。そこで子チャンク決定部３３１は、このデータ断片“The f”の終端位置を切り出し点（終了オフセット）として決定し、当該データ断片“The f”を子チャンクとして切り出す。 It is assumed that the identifier h _β (“The f”) of the data fragment “The f” after size expansion is 0x99. The lower 2 bits of this identifier 0x99 are 0x01, which matches the specified value 0x01. Therefore, the child chunk determination unit 331 determines the end position of the data fragment “The f” as a cutout point (end offset), and cuts out the data fragment “The f” as a child chunk.

上述の文書格納処理の主要な手順を以下に整理して示す。 The main procedure of the document storage process described above is summarized below.

可変長重複排除モジュール３３は、文書データの先頭から末尾に至るまで、以下の処理を繰り返し行う。 The variable length deduplication module 33 repeatedly performs the following processing from the beginning to the end of the document data.

ａ）文書の先頭を親チャンク開始オフセットとして設定する（ステップ４０１，４０２）。 a) The beginning of the document is set as a parent chunk start offset (steps 401 and 402).

ｂ）親チャンク開始オフセットから、連結後の長さが連結ウィンドウサイズＷ以上となるところまで、子チャンクの列（または単一の子チャンク）を定める（ステップ４０３〜４０６）。 b) A column of child chunks (or a single child chunk) is determined from the parent chunk start offset to a place where the length after concatenation becomes equal to or larger than the concatenation window size W (steps 403 to 406).

ｃ）処理ｂで子チャンクの列が定められたときには、これを１つに連結して、親チャンク（当初親チャンク）として定める（ステップ４０７）。処理ｂで単一の子チャンクが定められたときにも、これを親チャンク（当初親チャンク）として定める（ステップ４０７）。親チャンク（当初親チャンク）の長さは連結ウィンドウサイズＷ以上となる。 c) When the sequence of child chunks is determined in the process b, these are concatenated into one and defined as a parent chunk (initial parent chunk) (step 407). Even when a single child chunk is determined in the process b, it is determined as a parent chunk (initial parent chunk) (step 407). The length of the parent chunk (initial parent chunk) is equal to or greater than the linked window size W.

ｄ）親チャンクの識別子（ハッシュ値）を求める（ステップ４０７）。 d) The identifier (hash value) of the parent chunk is obtained (step 407).

ｅ）チャンク一覧テーブル３１２に既に親チャンクの識別子及びデータ断片が登録されているかを判定する（ステップ４０８）。 e) It is determined whether the identifier and data fragment of the parent chunk are already registered in the chunk list table 312 (step 408).

ｅ．１）登録されていれば、親チャンクの識別子を文書名に対応付けて文書構成テーブル３１１に登録する（ステップ４１２）。 e. 1) If registered, the identifier of the parent chunk is associated with the document name and registered in the document configuration table 311 (step 412).

ｅ．２）登録されていなければ、以下の処理を行う。 e. 2) If not registered, the following processing is performed.

ｅ．２-1）親チャンクを構成する子チャンクの列の先頭側の少なくとも１つの子チャンク、例えば先頭の子チャンク（つまり、親チャンク開始オフセット側に最も近い子チャンク）のサイズだけ後側にずらした位置を親チャンク開始オフセットとして再設定する（ステップ４０９）。 e. 2-1) Shifted to the back by the size of at least one child chunk at the beginning of the row of child chunks that make up the parent chunk, for example, the size of the first child chunk (that is, the child chunk closest to the parent chunk start offset side) The position is reset as the parent chunk start offset (step 409).

ｅ．２-2）親チャンク開始オフセットから、連結後の長さが連結ウィンドウサイズＷ以上となるところまで、子チャンクの列（または単一の子チャンク）を定める（ステップ４０３〜４０６）。このとき、以前の処理で既に定めた子チャンクについて再び定め直す必要はない（ステップ４０６）。 e. 2-2) A sequence of child chunks (or a single child chunk) is determined from the parent chunk start offset to a position where the length after concatenation becomes equal to or larger than the concatenation window size W (steps 403 to 406). At this time, it is not necessary to re-determine the child chunks already determined in the previous process (step 406).

ｅ．２-3）処理ｅ．２-2で子チャンクの列が定められたときには、これを１つに連結して、親チャンクとして定める（ステップ４０７）。処理ｅ．２-2で単一の子チャンクが定められたときにも、これを親チャンクとして定める（ステップ４０７）。 e. 2-3) Process e. When the row of child chunks is determined in 2-2, these are concatenated into one and defined as a parent chunk (step 407). Process e. When a single child chunk is determined in 2-2, it is determined as a parent chunk (step 407).

ｅ．２-4）親チャンクの識別子（ハッシュ値）を求める（ステップ４０７）。 e. 2-4) An identifier (hash value) of the parent chunk is obtained (step 407).

ｅ．２-5）チャンク一覧テーブル３１２に既に親チャンクの識別子及びデータ断片が登録されているかを判定する（ステップ４０８）。 e. 2-5) It is determined whether the identifier and data fragment of the parent chunk are already registered in the chunk list table 312 (step 408).

ｅ．２-６）以上の処理（ｅ．２-1〜ｅ．２-5）を、親チャンクの識別子及びデータ断片がチャンク一覧テーブル３１２に既に登録されているか（ステップ４０８のＹｅｓ）、親チャンク開始オフセットの「ずれ」が処理ｃで定めた当初親チャンクのサイズを超えるところまで（ステップ４１０のＹｅｓ）、繰り返す（図６の例では親チャンクｐ３まで）。 e. 2-6) Whether the parent chunk identifier and data fragment have already been registered in the chunk list table 312 (Yes in step 408), or the parent chunk start is performed in the above processing (e.2-1 to e.2-5) The process is repeated until the offset “deviation” exceeds the size of the initial parent chunk determined in the process c (Yes in step 410) (up to the parent chunk p3 in the example of FIG. 6).

ｅ．２-7）これでもなおデータ断片が登録されていないときには、処理ｃで定めた当初親チャンク（図６の例では親チャンクｐ１）の識別子及びデータ断片をチャンク一覧テーブル３１２に登録すると共に、当該識別子を文書名に対応付けて文書構成テーブル３１１に登録する（ステップ５０２，５０３）。そして、次の親チャンク開始オフセットとなる子チャンクの開始オフセットを、処理ｃで定めた当初親チャンクの終了オフセット（つまり当初親チャンクの終端側の子チャンクの終了オフセット）の位置に再設定して（ステップ４１１，５０５）、処理ｂに戻る。 e. 2-7) If the data fragment is not yet registered, the identifier and data fragment of the initial parent chunk (parent chunk p1 in the example of FIG. 6) determined in process c are registered in the chunk list table 312 and The identifier is associated with the document name and registered in the document configuration table 311 (steps 502 and 503). Then, the start offset of the child chunk that becomes the next parent chunk start offset is reset to the position of the end offset of the initial parent chunk (that is, the end offset of the child chunk on the end side of the initial parent chunk) determined in the process c. (Steps 411 and 505), the process returns to b.

親チャンクの識別子及びデータ断片がチャンク一覧テーブル３１２に既に登録されているときには（ステップ４０８のＹｅｓ）、その親チャンクの識別子を文書名に対応付けて文書構成テーブル３１１に登録する（ステップ４１２）。また、文書構成テーブル３１１に登録された親チャンクと、前回チャンク一覧テーブル３１２に登録された親チャンクとの間に、チャンク一覧テーブル３１２に未登録のデータ断片が存在するときには（ステップ４１３のＮｏ）、当該データ断片を親チャンクとして、当該データ断片及び当該データ断片の識別子をチャンク一覧テーブル３１２に登録すると共に、当該データ断片の識別子を文書名に対応付けて文書構成テーブル３１１に登録する（ステップ５０１〜５０３）。このとき、文書を構成するチャンクの順序が正しくなるように、文書構成テーブル３１１におけるチャンク（データ断片）を書き換える必要がある。そして、次の親チャンク開始オフセットとなる子チャンクの開始オフセットを、今回登録された親チャンクの終了オフセット（つまり当初親チャンクの終端側の子チャンクの終了オフセット）の位置に再設定して（ステップ４１１，５０５）、処理ｂに戻る。なお、文書名に対応付けて親チャンクの識別子を文書構成テーブル３１１に登録する際に、当該親チャンクの対応する文書データ上での位置・長さを示す情報を当該親チャンクの識別子に付加するならば、上述のような書き換えは必ずしも必要ない。 When the identifier and the data fragment of the parent chunk are already registered in the chunk list table 312 (Yes in Step 408), the identifier of the parent chunk is registered in the document configuration table 311 in association with the document name (Step 412). When there is an unregistered data fragment in the chunk list table 312 between the parent chunk registered in the document configuration table 311 and the parent chunk registered in the previous chunk list table 312 (No in step 413). The data fragment and the identifier of the data fragment are registered in the chunk list table 312 as the parent chunk, and the identifier of the data fragment is registered in the document configuration table 311 in association with the document name (step 501). ~ 503). At this time, it is necessary to rewrite the chunks (data fragments) in the document configuration table 311 so that the order of the chunks constituting the document is correct. Then, the start offset of the child chunk that becomes the next parent chunk start offset is reset to the position of the end offset of the parent chunk registered this time (that is, the end offset of the child chunk at the end of the initial parent chunk) (step) 411, 505), the process returns to process b. When the identifier of the parent chunk is registered in the document configuration table 311 in association with the document name, information indicating the position / length of the parent chunk on the corresponding document data is added to the identifier of the parent chunk. Then, rewriting as described above is not always necessary.

以上の処理を、データの末尾に至るまで繰り返すことで、可変長での重複排除を行いながら、データの格納を行うことができる。 By repeating the above processing until the end of the data, it is possible to store data while performing deduplication with variable length.

＜文書格納処理の具体例＞
次に、文書格納装置１０における文書格納処理の具体例について、図５及び図６のフローチャートに加えて、図８乃至図２１を参照して説明する。
ここでは、文書名が「文書＃１」の文書１１１及び文書名が「文書＃２」の文書１１２の２つの文書を順次、重複を排除しながら文書格納部３１に格納する例について述べる。以下の説明では、文書１１１，１１２を格納するための格納処理をそれぞれ格納処理ＳＸ，ＳＹと呼ぶ。この例では、連結ウィンドウサイズＷが１０（１０バイト）に設定される。 <Specific example of document storage processing>
Next, a specific example of document storage processing in the document storage device 10 will be described with reference to FIGS. 8 to 21 in addition to the flowcharts of FIGS.
Here, an example will be described in which two documents, a document 111 with a document name “document # 1” and a document 112 with a document name “document # 2”, are sequentially stored in the document storage unit 31 while eliminating duplication. In the following description, storage processes for storing the documents 111 and 112 are referred to as storage processes SX and SY, respectively. In this example, the linked window size W is set to 10 (10 bytes).

（１）格納処理ＳＸ，ＳＹの開始前
文書１１１，１１２が文書格納部３１に格納される前は、文書格納部３１内の文書構成テーブル３１１及びチャンク一覧テーブル３１２は空の状態になっている。図８は文書１１１，１１２（第１及び第２の文書）と、当該文書１１１，１１２の格納前における文書構成テーブル３１１及びチャンク一覧テーブル３１２の状態とを示す。 (1) Before the start of the storage processes SX and SY Before the documents 111 and 112 are stored in the document storage unit 31, the document configuration table 311 and the chunk list table 312 in the document storage unit 31 are in an empty state. . FIG. 8 shows the documents 111 and 112 (first and second documents) and the state of the document configuration table 311 and chunk list table 312 before the documents 111 and 112 are stored.

（２）格納処理ＳＸ
文書１１１を格納するための格納処理（格納動作）ＳＸについて、図９乃至図１４を参照して説明する。図９乃至図１３は文書１１１の格納動作を文書構成テーブル３１１の状態と共に示し、図１４は文書１１１の格納後における、文書構成テーブル３１１及びチャンク一覧テーブル３１２の状態を、当該文書１１１と当該文書１１１から切り出された親チャンクの列と共に示す。 (2) Storage process SX
A storage process (storage operation) SX for storing the document 111 will be described with reference to FIGS. 9 to 13 show the storage operation of the document 111 together with the state of the document structure table 311. FIG. 14 shows the state of the document structure table 311 and the chunk list table 312 after the document 111 is stored. It is shown together with a row of parent chunks cut out from 111.

Ａ）格納処理ＳＸその１
まず、文書１１１を格納するための格納処理ＳＸその１（以下、格納処理ＳＸ１と称する）について、図９を参照して説明する。なお、図９では、文書構成テーブル３１１は省略されている。 A) Storage process SX 1
First, a storage process SX 1 for storing the document 111 (hereinafter referred to as a storage process SX1) will be described with reference to FIG. In FIG. 9, the document configuration table 311 is omitted.

Ａ１）
Ａ１-1）可変長重複排除モジュール３３は、文書１１１の先頭から、連結ウィンドウサイズＷ（この例ではＷ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。各子チャンクは、前述した可変長のチャンク切り出し手法により定められる。図９の例では、可変長重複排除モジュール３３は、文書１１１の先頭より５，８，１１文字目のところに切り出し点を定め、子チャンクｃ₀（“The f”），ｃ₁（“ile”），ｃ₂（“ na”）を定めたものとする。 A1)
A1-1) The variable-length deduplication module 33 sequentially determines child chunks from the beginning of the document 111 to a place where the connection window size W is greater than or equal to W (W = 10 in this example) (steps 403 to 406). Each child chunk is determined by the variable-length chunk cutout method described above. In the example of FIG. 9, the variable-length deduplication module 33 sets cut points at the fifth, eighth, and eleventh characters from the beginning of the document 111, and sets child chunks c ₀ (“The f”), c ₁ (“ile "), C ₂ (" na ").

Ａ１-2）可変長重複排除モジュール３３は、連結ウィンドウサイズＷ以上となるところまで子チャンクｃ₀（“The f”），ｃ₁（“ile”），ｃ₂（“ na”）を順次定めたところで、それらの子チャンクを連結して親チャンク９０１を定め、当該親チャンク９０１の識別子（ハッシュ値）を生成する（ステップ４０７）。連結する子チャンクｃ₀（“The f”），ｃ₁（“ile”），ｃ₂（“ na”）の識別子（ハッシュ値）を、ｃ₀.hash＝Ｈ_A，ｃ₁.hash＝Ｈ_B，ｃ₂.hash＝Ｈ_Cとする。この例では、親チャンク９０１のデータのハッシュ値として、当該親チャンク９０１を構成する子チャンクｃ₀，ｃ₁，ｃ₂の識別子（ハッシュ値）Ｈ_A，Ｈ_B，Ｈ_Cから生成されたハッシュ値Ｈ_ABCを用い、これを親チャンク９０１の識別子とする。 A1-2) The variable-length deduplication module 33 sequentially determines child chunks c ₀ (“The f”), c ₁ (“ile”), and c ₂ (“na”) until the connection window size W or more is reached. Now, these child chunks are concatenated to define a parent chunk 901, and an identifier (hash value) of the parent chunk 901 is generated (step 407). The identifiers (hash values) of the child chunks c ₀ (“The f”), c ₁ (“ile”), and c ₂ (“na”) to be concatenated are expressed as c ₀ .hash = _HA and c ₁ .hash = H Let _B , c ₂ .hash = H _C. In this example, the hash values generated from the identifiers (hash values) H _A , H _B , and H _{C of the} child chunks c ₀ , c ₁ , and c ₂ constituting the parent chunk 901 are used as the hash values of the data of the parent chunk 901. The value H _ABC is used, and this is used as the identifier of the parent chunk 901.

Ａ１-3）可変長重複排除モジュール３３は、親チャンク９０１の識別子Ｈ_ABCに基づき、チャンク一覧テーブル３１２に当該識別子Ｈ_ABCに対応するデータ断片が登録されているかを判定する（ステップ４０８）。この例では、識別子Ｈ_ABCに対応するデータ断片は登録されていない。このため、親チャンク９０１に関するステップ４０８の判定結果は図９において矢印９１１で示されるよう未登録（Ｎｏ）となり、次の処理Ａ２に進む。 A1-3) The variable-length deduplication module 33 determines whether the data fragment corresponding to the identifier H _ABC is registered in the chunk list table 312 based on the identifier H _ABC of the parent chunk 901 (step 408). In this example, the data fragment corresponding to the identifier H _ABC is not registered. For this reason, the determination result of step 408 regarding the parent chunk 901 becomes unregistered (No) as indicated by an arrow 911 in FIG. 9, and the process proceeds to the next process A2.

Ａ２）
Ａ２-1）可変長重複排除モジュール３３は、上述の処理Ａ１で定めた親チャンク９０１を構成する子チャンクｃ₀，ｃ₁，ｃ₂の列の先頭の子チャンクｃ₀の長さだけ、文書１１１の末尾側にずらした位置（再設定された親チャンク開始位置）から（ステップ４０９）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。このとき、処理Ａ１で子チャンクｃ₁，ｃ₂を定めた部分についての再度の処理は必要ない。図９の例では、可変長重複排除モジュール３３は、文書１１１の先頭より１７文字目のところに切り出し点を定め、新たに子チャンクｃ₃を定めたものとする。 A2)
A2-1) length deduplication module 33, the length of the child chunk c ₀ of the column head of the configuration child chunk c _0, c _1, c ₂ parent chunks 901 determined in the above-described process A1, a document From the position shifted to the end of 111 (reset parent chunk start position) (step 409), the child chunks are sequentially determined from the position where the connection window size becomes W (W = 10) or more (steps 403 to 403). 406). At this time, it is not necessary to repeat the process for the portion where the child chunks c ₁ and c ₂ are determined in the process A1. In the example of FIG. 9, the variable-length deduplication module 33 defines a cut point at the 17-th character from the beginning of the document 111, and define those new children chunk c _3.

Ａ２-2）可変長重複排除モジュール３３は、連結ウィンドウサイズＷ以上となるところまで子チャンクｃ₁，ｃ₂，ｃ₃を定めたところで、それらの子チャンクｃ₁，ｃ₂，ｃ₃を連結して親チャンク９０２を定め、当該親チャンク９０２の識別子（ハッシュ値）を生成する（ステップ４０７）。この例では可変長重複排除モジュール３３は、連結する子チャンクｃ₁（“ile”），ｃ₂（“ na”），ｃ₃（“me spe”）の識別子（ハッシュ値）ｃ₁.hash＝Ｈ_B，ｃ₂.hash＝Ｈ_C，ｃ₃.hash＝Ｈ_Dから生成したハッシュ値Ｈ_BCDを、親チャンク９０２の識別子（ハッシュ値）とする。 A2-2) length deduplication module 33, where defining a coupling window size children chunk c ₁ W until it becomes more, c _2, c _3, connected to their child chunks c _1, c _2, c ₃ Thus, the parent chunk 902 is determined, and an identifier (hash value) of the parent chunk 902 is generated (step 407). In this example, the variable-length deduplication module 33 determines the identifier (hash value) c ₁ .hash = of the child chunks c ₁ (“ile”), c ₂ (“na”), and c ₃ (“me spe”) to be connected _{_{H B, c 2 .hash = H}} C, the hash value H _BCD generated from c ₃ .hash = H _D, an identifier of the parent chunk 902 (hash value).

Ａ２-3）可変長重複排除モジュール３３は、親チャンク９０２の識別子Ｈ_BCDに基づき、チャンク一覧テーブル３１２に当該識別子Ｈ_BCDに対応するデータ断片が登録されているかを判定する（ステップ４０８）。この例では、識別子Ｈ_BCDに対応するデータ断片は登録されていない。このため、親チャンク９０２に関するステップ４０８の判定結果は図９において矢印９１２で示されるように未登録（Ｎｏ）となり、次の処理Ａ３に進む。 A2-3) Based on the identifier H _BCD of the parent chunk 902, the variable length deduplication module 33 determines whether a data fragment corresponding to the identifier H _BCD is registered in the chunk list table 312 (step 408). In this example, the data fragment corresponding to the identifier H _BCD is not registered. For this reason, the determination result of step 408 regarding the parent chunk 902 becomes unregistered (No) as indicated by an arrow 912 in FIG. 9, and the process proceeds to the next process A3.

Ａ３）
Ａ３-1）可変長重複排除モジュール３３は、上述の処理Ａ２で定めた親チャンク９０２を構成する子チャンクｃ₁，ｃ₂，ｃ₃の列の先頭の子チャンクｃ₁の長さだけ、文書１１１の末尾側にずらした位置から（ステップ４０９）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。このとき、処理Ａ２で子チャンクｃ₂，ｃ₃を定めた部分についての再度の処理は必要ない。図９の例では、可変長重複排除モジュール３３は、文書１１１の先頭より２０文字目のところに切り出し点を定め、新たに子チャンクｃ₄を定めたものとする。 A3)
A3-1) The variable-length deduplication module 33 creates a document by the length of the _first child chunk c ₁ of the columns of the child chunks c ₁ , c ₂ , c ₃ constituting the parent chunk 902 defined in the above-described processing A2. From the position shifted to the end of 111 (step 409), child chunks are sequentially determined until the connection window size W (W = 10) or more is reached (steps 403 to 406). At this time, it is not necessary to repeat the process for the part where the child chunks c ₂ and c ₃ are determined in the process A 2. In the example of FIG. 9, the variable-length deduplication module 33 defines a point cut at the 20 th character from the beginning of the document 111, and define those new children chunk c _4.

Ａ３-2）可変長重複排除モジュール３３は、連結ウィンドウサイズＷ以上となるところまで子チャンクｃ₂，ｃ₃，ｃ₄を定めたところで、それらの子チャンクｃ₂，ｃ₃，ｃ₄を連結して親チャンク９０３を定め、当該親チャンク９０３の識別子（ハッシュ値）を生成する（ステップ４０７）。この例では可変長重複排除モジュール３３は、連結する子チャンクｃ₂（“ na”），ｃ₃（“me spe”）,ｃ₄（“cif”）の識別子（ハッシュ値）ｃ₂.hash＝Ｈ_C，ｃ₃.hash＝Ｈ_D，ｃ₄.hash＝Ｈ_Eから生成したハッシュ値Ｈ_CDEを、親チャンク９０３の識別子（ハッシュ値）とする。 A3-2) The variable-length deduplication module 33 determines the child chunks c ₂ , c ₃ , and c ₄ until the connection window size W is equal to or larger than the connection window size W, and connects these child chunks c ₂ , c ₃ , and c ₄ . Then, the parent chunk 903 is determined, and an identifier (hash value) of the parent chunk 903 is generated (step 407). In this example, the variable-length deduplication module 33 identifies the identifiers (hash values) c ₂ .hash = of the child chunks c ₂ (“na”), c ₃ (“me spe”), and c ₄ (“cif”) to be connected. The hash value H _CDE generated from H _C , c ₃ .hash = H _D and c ₄ .hash = H _{E is used as} the identifier (hash value) of the parent chunk 903.

Ａ３-3）可変長重複排除モジュール３３は、親チャンク９０３の識別子Ｈ_CDEに基づき、チャンク一覧テーブル３１２に当該識別子Ｈ_CDEに対応するデータ断片が登録されているかを判定する（ステップ４０８）。この例では、識別子Ｈ_CDEに対応するデータ断片は登録されていない。このため、親チャンク９０３に関するステップ４０８の判定結果は図９において矢印９１３で示されるように未登録（Ｎｏ）となり、次の処理Ａ４に進む。 A3-3) The variable-length deduplication module 33 determines whether a data fragment corresponding to the identifier H _CDE is registered in the chunk list table 312 based on the identifier H _CDE of the parent chunk 903 (step 408). In this example, the data fragment corresponding to the identifier H _CDE is not registered. For this reason, the determination result of step 408 regarding the parent chunk 903 becomes unregistered (No) as indicated by an arrow 913 in FIG. 9, and the process proceeds to the next process A4.

Ａ４）
図９の例では、上述の処理Ａ３で定めた親チャンク９０３を構成する子チャンクｃ₂，ｃ₃，ｃ₄の列の先頭の子チャンクｃ₂の長さだけ、文書１１１の末尾側にずらした位置（つまり再設定された親チャンク開始位置）は、処理Ａ１で定められた親チャンク（つまり当初親チャンク）９０１の終端の位置を超えている。したがって、ステップ４０９で再設定された親チャンク開始オフセットの親チャンク９０１の開始オフセットからの「ずれ」は、連結ウィンドウサイズＷ以上となる（ステップ４１０のＹｅｓ）。この場合、可変長重複排除モジュール３３は、処理Ａ１で定められた親チャンク９０１の識別子Ｈ_ABC及びデータ（データ断片）“The file na”を、図９において矢印９０４で示すようにチャンク一覧テーブル３１２に登録する（ステップ５０２）。また図９では省略されているが、可変長重複排除モジュール３３は、文書１１１の文書名「文書＃１」及び親チャンク９０１の識別子Ｈ_ABCを文書構成テーブル３１１に登録する（ステップ５０３）。なお本実施形態では、ステップ５０２で登録される親チャンク９０１の識別子Ｈ_ABCは、ステップ５０１で改めて求められる。 A4)
In the example of FIG. 9, the length of the first child chunk c _{2 in} the column of the child chunks c ₂ , c ₃ , c ₄ constituting the parent chunk 903 defined in the above process A 3 is shifted toward the end of the document 111. The position (that is, the reset parent chunk start position) exceeds the position of the end of the parent chunk (that is, the initial parent chunk) 901 determined in the process A1. Therefore, the “deviation” of the parent chunk start offset reset in step 409 from the start offset of the parent chunk 901 is equal to or larger than the linked window size W (Yes in step 410). In this case, the variable-length deduplication module 33 uses the chunk list table 312 as shown by an arrow 904 in FIG. 9 for the identifier H _ABC and the data (data fragment) “The file na” of the parent chunk 901 determined in the process A1. (Step 502). Although not shown in FIG. 9, the variable-length deduplication module 33 registers the document name “document # 1” of the document 111 and the identifier H _ABC of the parent chunk 901 in the document configuration table 311 (step 503). In this embodiment, the identifier H _ABC of the parent chunk 901 registered at step 502 is obtained again at step 501.

（Ｂ）格納処理ＳＸその２
上述の格納処理ＳＸ１に続いて実行される、文書１１１を格納するための格納処理ＳＸその２（以下、格納処理ＳＸ２と称する）について、図１０を参照して説明する。なお、図１０では、文書構成テーブル３１１は省略されている。 (B) Storage process SX 2
A storage process SX 2 for storing the document 111 (hereinafter referred to as a storage process SX2) executed subsequent to the above-described storage process SX1 will be described with reference to FIG. In FIG. 10, the document configuration table 311 is omitted.

Ｂ１）
Ｂ１-1）可変長重複排除モジュール３３は、格納処理ＳＸ１で定められた親チャンク９０１の終端の位置から（ステップ５０５）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。このとき、格納処理ＳＸ１で親チャンク９０１の終端の位置以降の子チャンクｃ₃，ｃ₄を定めた部分についての再度の処理は必要ない。図１０の例では、可変長重複排除モジュール３３は、文書１１１の先頭より２４文字目のところに切り出し点を定め、新たな子チャンクｃ₅（“cif”）を定めたものとする。 B1)
B1-1) The variable-length deduplication module 33 removes child chunks from the end position of the parent chunk 901 determined in the storage process SX1 (step 505) until it reaches the connected window size W (W = 10) or more. These are determined sequentially (steps 403 to 406). At this time, it is not necessary to repeat the process for the part in which the child chunks c ₃ and c ₄ after the end position of the parent chunk 901 are determined in the storage process SX1. In the example of FIG. 10, it is assumed that the variable-length deduplication module 33 determines a cut-out point at the 24th character from the top of the document 111 and sets a new child chunk c ₅ (“cif”).

以降の処理は、格納処理ＳＸ１における、処理Ａ２-2〜Ａ２-3と同様であり、子チャンクｃ₃，ｃ₄，ｃ₅を連結することにより親チャンク１００１が定められる。図１０の例では親チャンク１００１の識別子Ｈ_DEFに対応するデータ断片は、チャンク一覧テーブル３１２に登録されていない。このため、親チャンク１００１に関するステップ４０８の判定結果は図１０において矢印１０１１で示されるように未登録（Ｎｏ）となり、次の処理Ｂ２に進む。 The subsequent processing is the storage processing SX1, is similar to the processing A2-2～A2-3, parent chunk 1001 is determined by linking the child chunk _{_{_{c 3, c 4, c 5}}} . In the example of FIG. 10, the data fragment corresponding to the identifier H _DEF of the parent chunk 1001 is not registered in the chunk list table 312. For this reason, the determination result of step 408 regarding the parent chunk 1001 becomes unregistered (No) as indicated by an arrow 1011 in FIG. 10, and the process proceeds to the next process B2.

Ｂ２）
処理Ｂ２は格納処理ＳＸ１における処理Ａ２と同様である。処理Ｂ２では、新たに子チャンクｃ₆（“by path is ope”）が定められる。そして子チャンクｃ₄，ｃ₅，ｃ₆を連結することにより親チャンク１００２が定められる。図１０の例では親チャンク１００２の識別子Ｈ_EFGに対応するデータ断片は、チャンク一覧テーブル３１２に登録されていない。このため、親チャンク１００１に関するステップ４０８の判定結果は図１０において矢印１０１２で示されるように未登録（Ｎｏ）となり、次の処理Ｂ３に進む。 B2)
The process B2 is the same as the process A2 in the storage process SX1. In the process B2, a new child chunk c ₆ (“by path is ope”) is determined. Then, the parent chunk 1002 is determined by connecting the child chunks c ₄ , c ₅ , and c ₆ . In the example of FIG. 10, the data fragment corresponding to the identifier _HEFG of the parent chunk 1002 is not registered in the chunk list table 312. For this reason, the determination result of step 408 regarding the parent chunk 1001 becomes unregistered (No) as indicated by an arrow 1012 in FIG. 10, and the process proceeds to the next process B3.

Ｂ３）
処理Ｂ３は格納処理ＳＸ１における処理Ａ３と同様である。可変長重複排除モジュール３３は、上述の処理Ａ２で定めた親チャンク１００２を構成する子チャンクｃ₄，ｃ₅，ｃ₆の列の先頭の子チャンクｃ₄の長さだけ、文書１１１の末尾側にずらした位置から（ステップ４０９）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。このとき、処理Ｂ２で子チャンクｃ₅，ｃ₆を定めた部分についての再度の処理は必要ない。図１０の例では、子チャンクｃ₄の長さだけずらした位置から連結ウィンドウサイズＷ以上となるところまでに、新たに定める子チャンクはない。そこで可変長重複排除モジュール３３は、処理Ａ２で定めた親チャンク１００２から先頭の子チャンクｃ₄を除いた残りの子チャンクｃ₅，ｃ₆を連結して親チャンク１００３を定める。図１０の例では親チャンク１００３の識別子Ｈ_FGに対応するデータ断片は、チャンク一覧テーブル３１２に登録されていない。このため、親チャンク１００３に関するステップ４０８の判定結果は図１０において矢印１０１３で示されるように未登録（Ｎｏ）となり、次の処理Ｂ４に進む。 B3)
The process B3 is the same as the process A3 in the storage process SX1. The variable-length deduplication module 33 sets the end of the document 111 by the length of the first child chunk c _{4 in} the column of the child chunks c ₄ , c ₅ , c ₆ constituting the parent chunk 1002 defined in the above-described process A2. From the position shifted to (step 409), child chunks are sequentially determined until the connection window size W (W = 10) or more is reached (steps 403 to 406). At this time, it is not necessary to repeat the process for the portion where the child chunks c ₅ and c ₆ are determined in the process B2. In the example of FIG. 10, there is no newly defined child chunk from the position shifted by the length of the child chunk c _{4 to} the place where the connection window size W or more is reached. Therefore, the variable-length deduplication module 33 determines the parent chunk 1003 by concatenating the remaining child chunks c ₅ and c ₆ obtained by removing the first child chunk c ₄ from the parent chunk 1002 determined in the process A2. Data fragment corresponding to the identifier H _FG parent chunks 1003 in the example of FIG. 10 is not registered in the chunk list table 312. For this reason, the determination result of step 408 regarding the parent chunk 1003 becomes unregistered (No) as indicated by an arrow 1013 in FIG. 10, and the process proceeds to the next process B4.

Ｂ４）
処理Ｂ４は格納処理ＳＸ１における処理Ａ４と同様である。つまり、図１０の例では、処理Ｂ３で定めた親チャンク１００３を構成する子チャンクｃ₅，ｃ₆の列の先頭の子チャンクｃ₅の長さだけ、文書１１１の末尾側にずらした位置（つまり再設定された親チャンク開始位置）は、処理Ｂ１で定められた親チャンク（つまり当初親チャンク）１００１の終端の位置を超えている（ステップ４１０のＹｅｓ）。この場合、可変長重複排除モジュール３３は、処理Ｂ１で定められた親チャンク１００１の識別子Ｈ_DEF及びデータ（データ断片）“me specified ”を、図１０において矢印１００４で示すようにチャンク一覧テーブル３１２に登録する（ステップ５０２）。また図９では省略されているが、可変長重複排除モジュール３３は、文書１１１の文書名「文書＃１」及び親チャンク１００１の識別子Ｈ_DEFを文書構成テーブル３１１に登録する（ステップ５０３）。 B4)
The process B4 is the same as the process A4 in the storage process SX1. That is, in the example of FIG. 10, a position shifted toward the end of the document 111 by the length of the first child chunk c _{5 in} the column of the child chunks c ₅ and c ₆ constituting the parent chunk 1003 defined in the process B3 ( That is, the reset parent chunk start position) exceeds the end position of the parent chunk (that is, the initial parent chunk) 1001 determined in the process B1 (Yes in step 410). In this case, the variable-length deduplication module 33 stores the identifier H _DEF and data (data fragment) “me specified” of the parent chunk 1001 determined in the process B1 in the chunk list table 312 as indicated by an arrow 1004 in FIG. Register (step 502). Although not shown in FIG. 9, the variable-length deduplication module 33 registers the document name “document # 1” of the document 111 and the identifier H _DEF of the parent chunk 1001 in the document configuration table 311 (step 503).

（Ｃ）格納処理ＳＸその３
上述の格納処理ＳＸ２に続いて実行される、文書１１１を格納するための格納処理その３（以下、格納処理ＳＸ３と称する）について、図１１を参照して説明する。なお、図１１では、文書構成テーブル３１１は省略されている。 (C) Storage process SX 3
A storage process No. 3 (hereinafter referred to as storage process SX3) for storing the document 111, which is executed subsequent to the above-described storage process SX2, will be described with reference to FIG. In FIG. 11, the document configuration table 311 is omitted.

Ｃ１）
可変長重複排除モジュール３３は、格納処理ＳＸ２で定められた親チャンク１００１の終端の位置から（ステップ５０５）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。図１０の例では、可変長重複排除モジュール３３は、文書１１１の先頭より３８文字目のところに切り出し点を定め、子チャンクｃ₆を定めたものとする。 C1)
The variable-length deduplication module 33 sequentially determines the child chunks from the end position of the parent chunk 1001 determined in the storage process SX2 (step 505) until the connection window size W (W = 10) or more. (Steps 403 to 406). In the example of FIG. 10, the variable-length deduplication module 33 defines a cut point at the 38-th character from the beginning of the document 111, it is assumed that defines the child chunk c _6.

以降の処理は、格納処理ＳＸ１における、処理Ａ２-2〜Ａ２-3と同様である。但し、図１１の例では、子チャンクｃ₆（“by path is ope”）のみで連結ウィンドウサイズＷ（Ｗ＝１０）以上となるため、当該子チャンクｃ₆単体が親チャンク１１０１として定められる。可変長重複排除モジュール３３は、子チャンクｃ₆（“by path is ope”）の識別子（ハッシュ値）ｃ₆.hash＝Ｈ_Gより生成したハッシュ値Ｈ_G’を親チャンク１１０１の識別子（ハッシュ値）とする。図１１の例では親チャンク１１０１の識別子Ｈ_G’に対応するデータ断片は、チャンク一覧テーブル３１２に登録されていない。このため、親チャンク１１０１に関するステップ４０８の判定結果は図１１において矢印１１１１で示されるように未登録（Ｎｏ）となり、次の処理Ｃ２に進む。 The subsequent processing is the same as the processing A2-2 to A2-3 in the storage processing SX1. However, in the example of FIG. 11, since only the child chunk c ₆ (“by path is ope”) is equal to or larger than the linked window size W (W = 10), the child chunk c ₆ alone is determined as the parent chunk 1101. The variable-length deduplication module 33 uses the hash value H _G ′ generated from the identifier (hash value) c ₆ .hash = H _G of the child chunk c ₆ (“by path is ope”) as the identifier (hash value) of the parent chunk 1101. ). In the example of FIG. 11, the data fragment corresponding to the identifier H _G ′ of the parent chunk 1101 is not registered in the chunk list table 312. For this reason, the determination result of step 408 regarding the parent chunk 1101 becomes unregistered (No) as indicated by an arrow 1111 in FIG. 11, and the process proceeds to the next process C2.

Ｃ２）
処理Ｃ２は、格納処理ＳＸ１における処理Ａ４と同様である。図１１の例では、上述の処理Ｃ１で定めた親チャンク１１０１の開始位置から当該親チャンク１１０１を構成する子チャンクｃ₆の長さだけ、文書１１１の末尾側にずらした位置（つまり再設定された親チャンク開始位置）は、当該親チャンク（つまり当初親チャンク）１１０１の終端の位置を超えている（ステップ４１０のＹｅｓ）。この場合、可変長重複排除モジュール３３は、処理Ｃ１で定められた親チャンク１１０１の識別子Ｈ_G’及びデータ（データ断片）“by path is ope”を、図１１において矢印１１０２で示すようにチャンク一覧テーブル３１２に登録する（ステップ５０２）。また図１１では省略されているが、可変長重複排除モジュール３３は、文書１１１の文書名「文書＃１」及び親チャンク１１０１の識別子Ｈ_G’を文書構成テーブル３１１に登録する（ステップ５０３）。 C2)
The process C2 is the same as the process A4 in the storage process SX1. In the example of FIG. 11, the position shifted from the start position of the parent chunk 1101 determined in the above-described process C1 to the end side of the document 111 by the length of the child chunk c ₆ constituting the parent chunk 1101 (that is, reset) The parent chunk start position) exceeds the end position of the parent chunk (that is, the initial parent chunk) 1101 (Yes in step 410). In this case, the variable-length deduplication module 33 displays the identifier H _G ′ and data (data fragment) “by path is ope” of the parent chunk 1101 defined in the process C1 as a chunk list as indicated by an arrow 1102 in FIG. Register in the table 312 (step 502). Although omitted in FIG. 11, the variable-length deduplication module 33 registers the document name “document # 1” of the document 111 and the identifier H _G ′ of the parent chunk 1101 in the document configuration table 311 (step 503).

（Ｄ）格納処理ＳＸその４
上述の格納処理ＳＸ３に続いて実行される、文書１１１を格納するための格納処理ＳＸその４（以下、格納処理ＳＸ４と称する）について、図１２を参照して説明する。なお、図１２では、文書構成テーブル３１１は省略されている。 (D) Storage process SX 4
A storage process SX No. 4 (hereinafter referred to as a storage process SX4) for storing the document 111, which is executed subsequent to the above-described storage process SX3, will be described with reference to FIG. In FIG. 12, the document configuration table 311 is omitted.

Ｄ１）
Ｄ１-1）可変長重複排除モジュール３３は、格納処理ＳＸ３で定められた親チャンク１１０１の終端の位置から（ステップ５０５）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。図１２の例では、可変長重複排除モジュール３３は、文書１１１の先頭より４１，４７，５０文字目のところに切り出し点を定め、子チャンクｃ₇，ｃ₈，ｃ₉を定めたものとする。 D1)
D1-1) The variable-length deduplication module 33 removes the child chunk from the position of the end of the parent chunk 1101 determined in the storage process SX3 (step 505) until it reaches the connected window size W (W = 10) or more. These are determined sequentially (steps 403 to 406). In the example of FIG. 12, it is assumed that the variable-length deduplication module 33 defines cut points at the 41st, 47th, and 50th characters from the beginning of the document 111, and defines child chunks c ₇ , c ₈ , and c _9. .

以降の処理は、格納処理ＳＸ１における、処理Ａ２-2〜Ａ２-3と同様であり、子チャンクｃ₇，ｃ₈，ｃ₉を連結することにより親チャンク１２０１が定められる。図１２の例では親チャンク１２０１の識別子Ｈ_HIJに対応するデータ断片は、チャンク一覧テーブル３１２に登録されていない。このため、親チャンク１２０１に関するステップ４０８の判定結果は図１２において矢印１２１１で示されるように未登録（Ｎｏ）となり、次の処理Ｄ２に進む。 The subsequent processing is the storage processing SX1, is similar to the processing A2-2～A2-3, parent chunk 1201 is determined by linking the child chunk _{_{_{c 7, c 8, c 9}}} . In the example of FIG. 12, the data fragment corresponding to the identifier H _HIJ of the parent chunk 1201 is not registered in the chunk list table 312. For this reason, the determination result of step 408 regarding the parent chunk 1201 becomes unregistered (No) as indicated by an arrow 1211 in FIG. 12, and the process proceeds to the next process D2.

Ｄ２）
処理Ｄ２は、格納処理ＳＸ１における処理Ａ２と同様であり、新たに子チャンクｃ₁₀が定められる。そして子チャンクｃ₈，ｃ₉，ｃ₁₀を連結することにより親チャンク１２０２が定められる。図１２の例では親チャンク１２０２の識別子Ｈ_IJKに対応するデータ断片は、チャンク一覧テーブル３１２に登録されていない。このため、親チャンク１２０２に関するステップ４０８の判定結果は図１２において矢印１２１２で示されるように未登録（Ｎｏ）となり、次の処理Ｄ３に進む。 D2)
Processing D2 is the same as the processing A2 in storage processing SX1, new child chunk c ₁₀ is determined. Then, the parent chunk 1202 is defined by connecting the child chunks c ₈ , c ₉ , and c ₁₀ . In the example of FIG. 12, the data fragment corresponding to the identifier H _IJK of the parent chunk 1202 is not registered in the chunk list table 312. For this reason, the determination result of step 408 regarding the parent chunk 1202 becomes unregistered (No) as indicated by an arrow 1212 in FIG. 12, and the process proceeds to the next process D3.

Ｄ３）
処理Ｄ３は、格納処理ＳＸ１における処理Ａ３と同様である。可変長重複排除モジュール３３は、上述の処理Ｄ２で定めた親チャンク１２０２を構成する子チャンクｃ₈，ｃ₉，ｃ₁₀の列の先頭の子チャンクｃ₈の長さだけ、文書１１１の末尾側にずらした位置から（ステップ４０９）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。このとき、処理Ｄ２で子チャンクｃ₉，ｃ₁₀を定めた部分についての再度の処理は必要ない。図１２の例では、可変長重複排除モジュール３３は、文書１１１の先頭より５７文字目のところに切り出し点を定め、子チャンクｃ₁₁を定めたものとする。可変長重複排除モジュール３３は、子チャンクｃ₉，ｃ₁₀，ｃ₁₁を連結して親チャンク１２０３を定める。図１０の例では親チャンク１２０３の識別子Ｈ_JKLに対応するデータ断片は、チャンク一覧テーブル３１２に登録されていない。このため、親チャンク１２０３に関するステップ４０８の判定結果は図１２において矢印１２１３で示されるように未登録（Ｎｏ）となり、次の処理Ｄ４に進む。 D3)
The process D3 is the same as the process A3 in the storage process SX1. Variable-length deduplication module 33, the length of the first child chunk c ₈ column configuration child chunk c _8, c _9, c ₁₀ a parent chunk 1202 defined by the above-described process D2, the end side of the document 111 From the position shifted to (step 409), child chunks are sequentially determined until the connection window size W (W = 10) or more is reached (steps 403 to 406). At this time, it is not necessary to repeat the process for the portion where the child chunks c ₉ and c ₁₀ are determined in the process D2. In the example of FIG. 12, the variable-length deduplication module 33 defines a cut point at the 57-th character from the beginning of the document 111, it is assumed that defines the child chunk c _11. The variable-length deduplication module 33 concatenates the child chunks c ₉ , c ₁₀ , and c ₁₁ to determine a parent chunk 1203. In the example of FIG. 10, the data fragment corresponding to the identifier H _JKL of the parent chunk 1203 is not registered in the chunk list table 312. For this reason, the determination result of step 408 regarding the parent chunk 1203 becomes unregistered (No) as indicated by an arrow 1213 in FIG. 12, and the process proceeds to the next process D4.

Ｄ４）
処理Ｄ４は、格納処理ＳＸ１における処理Ａ４と同様である。つまり、図１２の例では、処理Ｄ３で定めた親チャンク１２０３を構成する子チャンクｃ₉，ｃ₁₀，ｃ₁₁の列の先頭の子チャンクｃ₉の長さだけ、文書１１１の末尾側にずらした位置は、処理Ｄ１で定められた親チャンク１２０１の終端の位置を超えている（ステップ４１０のＹｅｓ）。この場合、可変長重複排除モジュール３３は、処理Ｄ１で定められた親チャンク１２０１の識別子Ｈ_HIJ及びデータ（データ断片）“ned for read”を、図１２において矢印１２０４で示すようにチャンク一覧テーブル３１２に登録する（ステップ５０２）。また図１２では省略されているが、可変長重複排除モジュール３３は、文書１１１の文書名「文書＃１」及び親チャンク１２０１の識別子Ｈ_HIJを文書構成テーブル３１１に登録する（ステップ５０３）。 D4)
The process D4 is the same as the process A4 in the storage process SX1. That is, in the example of FIG. 12, the length of the first child chunk c _{9 in} the column of the child chunks c ₉ , c ₁₀ , c ₁₁ constituting the parent chunk 1203 defined in the process D3 is shifted toward the end of the document 111. The position exceeds the position of the end of the parent chunk 1201 determined in the process D1 (Yes in step 410). In this case, the variable-length deduplication module 33 uses the chunk list table 312 as shown by the arrow 1204 in FIG. 12 for the identifier H _HIJ and the data (data fragment) “ned for read” of the parent chunk 1201 determined in the process D1. (Step 502). Although omitted in FIG. 12, the variable-length deduplication module 33 registers the document name “document # 1” of the document 111 and the identifier H _HIJ of the parent chunk 1201 in the document configuration table 311 (step 503).

（Ｅ）格納処理ＳＸその５
上述の格納処理ＳＸ４に続いて実行される、文書１１１を格納するための格納処理ＳＸその５（以下、格納処理ＳＸ５と称する）について、図１３を参照して説明する。この例では、説明の簡略化のために、便宜的に文字列“and”が文書１１１の末尾であるとしている。なお、図１３では、文書構成テーブル３１１は省略されている。 (E) Storage process SX 5
A storage process SX No. 5 (hereinafter referred to as storage process SX5) for storing the document 111, which is executed subsequent to the above-described storage process SX4, will be described with reference to FIG. In this example, for the sake of simplicity, the character string “and” is assumed to be the end of the document 111 for convenience. In FIG. 13, the document configuration table 311 is omitted.

Ｅ１）
可変長重複排除モジュール３３は、格納処理ＳＸ４で定められた親チャンク１２０１の終端の位置から（ステップ５０５）、子チャンクを順次定めていく（ステップ４０３〜４０６）。図１３の例では、子チャンクｃ₁₀，ｃ₁₁が定められ、その結果、切り出し点が文書１１１の末尾に達したものとする（ステップ４０４）。この場合、可変長重複排除モジュール３３は、切り出し点が、格納処理ＳＸ４で定められた親チャンク１２０１の終端の位置から連結ウィンドウサイズＷ（Ｗ＝１０）以上となるか否かに無関係に、子チャンクｃ₁₀，ｃ₁₁を連結することにより親チャンク１３０１を定める（ステップ４１４）。 E1)
The variable-length deduplication module 33 sequentially determines child chunks from the end position of the parent chunk 1201 determined in the storage process SX4 (step 505) (steps 403 to 406). In the example of FIG. 13, it is assumed that child chunks c ₁₀ and c ₁₁ are determined, and as a result, the cut-out point has reached the end of the document 111 (step 404). In this case, the variable-length deduplication module 33 determines whether the cut-out point is equal to or larger than the linked window size W (W = 10) from the end position of the parent chunk 1201 determined in the storage process SX4. A parent chunk 1301 is determined by connecting the chunks c ₁₀ and c ₁₁ (step 414).

Ｅ２）
可変長重複排除モジュール３３は、処理Ｅ１で定められた親チャンク１３０１の識別子Ｈ_KL及びデータ（データ断片）“ing and”を、図１３において矢印１３０２で示すようにチャンク一覧テーブル３１２に登録する（ステップ４１２）。また図１３では省略されているが、可変長重複排除モジュール３３は、文書１１１の文書名「文書＃１」及び親チャンク１３０１の識別子Ｈ_KLを文書構成テーブル３１１に登録する（ステップ４１６）。これにより、文書１１１を格納するための格納処理ＳＸは完了する。 E2)
The variable-length deduplication module 33 registers the identifier H _KL and data (data fragment) “ing and” of the parent chunk 1301 determined in the process E1 in the chunk list table 312 as indicated by an arrow 1302 in FIG. Step 412). Although omitted in FIG. 13, the variable-length deduplication module 33 registers the document name “document # 1” of the document 111 and the identifier H _KL of the parent chunk 1301 in the document configuration table 311 (step 416). Thereby, the storage process SX for storing the document 111 is completed.

上述の格納処理ＳＸ（つまり格納処理ＳＸ１乃至ＳＸ５）が完了した後における、文書構成テーブル３１１及びチャンク一覧テーブル３１２の状態（つまり文書１１１の登録状態）を、文書１１１及び当該文書１１１から切り出された親チャンクの列と共に図１４に示す。 The state of the document configuration table 311 and the chunk list table 312 (that is, the registration state of the document 111) after the storage process SX (that is, the storage processes SX1 to SX5) is completed is cut out from the document 111 and the document 111. It is shown in FIG. 14 together with the parent chunk column.

（３）格納処理ＳＹ
次に文書１１２を格納するための格納処理ＳＹについて、図１５乃至図２１を参照して説明する。図１５乃至図１９は文書１１１の格納後に行われる文書１１２の格納動作を文書構成テーブル３１１の状態と共に示す。図２０は文書１１２の格納後（つまり文書１１１，１１２の格納後）における、文書構成テーブル３１１及びチャンク一覧テーブル３１２の状態を、文書１１２及び当該文書１１２から切り出された親チャンクの列と共に示し、図２１は文書１１１，１１２の格納後における、文書構成テーブル３１１及びチャンク一覧テーブル３１２の状態を、当該文書１１１，１１２から切り出された親チャンクの列と共に示す。 (3) Storage processing SY
Next, storage processing SY for storing the document 112 will be described with reference to FIGS. 15 to 19 show the storage operation of the document 112 performed after the storage of the document 111 together with the state of the document configuration table 311. FIG. 20 shows the state of the document configuration table 311 and the chunk list table 312 after storage of the document 112 (that is, after storage of the documents 111 and 112), together with the column of the parent chunk extracted from the document 112 and the document 112, FIG. 21 shows the state of the document configuration table 311 and the chunk list table 312 after the documents 111 and 112 are stored, together with the parent chunk columns extracted from the documents 111 and 112.

Ｆ）格納処理ＳＹその１
文書１１１を格納した後に、文書１１２を格納するための格納処理ＳＹその１（以下、格納処理ＳＹ１と称する）について、図１５を参照して説明する。なお、図１５では、文書構成テーブル３１１は省略されている。 F) Storage process SY 1
A storage process SY 1 (hereinafter referred to as storage process SY1) for storing the document 112 after storing the document 111 will be described with reference to FIG. In FIG. 15, the document configuration table 311 is omitted.

Ｆ-1）可変長重複排除モジュール３３は、文書１１２の先頭から、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。図１５の例では、可変長重複排除モジュール３３は、文書１１２の先頭より５，８，１１文字目のところに切り出し点を定め、子チャンクｃ₀（“The f”），ｃ₁（“ile”），ｃ₂（“ na”）を定めたものとする。 F-1) The variable-length deduplication module 33 sequentially determines child chunks from the top of the document 112 to a place where the linked window size W (W = 10) or more (steps 403 to 406). In the example of FIG. 15, the variable-length deduplication module 33 determines cut points at the fifth, eighth, and eleventh characters from the beginning of the document 112, and sets child chunks c ₀ (“The f”), c ₁ (“ile "), C ₂ (" na ").

Ｆ-2）可変長重複排除モジュール３３は、連結ウィンドウサイズＷ以上となるところまで子チャンクｃ₀（“The f”），ｃ₁（“ile”），ｃ₂（“ na”）を順次定めたところで、それらの子チャンクを連結して親チャンク１５０１を定め、当該親チャンク１５０１の識別子（ハッシュ値）を生成する（ステップ４０７）。この例では可変長重複排除モジュール３３は、連結する子チャンクｃ₀（“The f”），ｃ₁（“ile”），ｃ₂（“ na”）の識別子（ハッシュ値）ｃ₀.hash＝Ｈ_A，ｃ₁.hash＝Ｈ_B，ｃ₂.hash＝Ｈ_Cから生成したハッシュ値Ｈ_ABCを、親チャンク１５０１の識別子（ハッシュ値）とする。 F-2) The variable-length deduplication module 33 sequentially determines the child chunks c ₀ (“The f”), c ₁ (“ile”), and c ₂ (“na”) until the connection window size W is reached. Now, these child chunks are concatenated to define a parent chunk 1501, and an identifier (hash value) of the parent chunk 1501 is generated (step 407). In this example, the variable-length deduplication module 33 uses the identifiers (hash values) c ₀ .hash = of the child chunks c ₀ (“The f”), c ₁ (“ile”), and c ₂ (“na”) to be connected. _{_{H a, c 1 .hash = H}} B, the hash value H _ABC generated from c ₂ .hash = H _C, an identifier of the parent chunk 1501 (hash value).

Ｆ-3）可変長重複排除モジュール３３は、親チャンク１５０１の識別子Ｈ_ABCに基づき、チャンク一覧テーブル３１２に当該識別子Ｈ_ABCに対応するデータ断片が登録されているかを判定する（ステップ４０８）。図１５に示すように、チャンク一覧テーブル３１２には、識別子Ｈ_ABCに対応するデータ断片が登録されている。このため、親チャンク１５０１に関するステップ４０８の判定結果は図１５において矢印１５１１で示されるようにＹｅｓ（登録済）となり、次の処理Ｇに進む。この場合、チャンク一覧テーブル３１２は図１５において矢印１５０２で示されるように、処理Ｆの前後で変わらない。なお、図１５では省略されているが、可変長重複排除モジュール３３は処理Ｇに進む前に、文書１１２の文書名「文書＃２」及び親チャンク１５０１の識別子Ｈ_ABCを文書構成テーブル３１１に登録する（ステップ５０３）。 F-3) Based on the identifier H _ABC of the parent chunk 1501, the variable length deduplication module 33 determines whether a data fragment corresponding to the identifier H _ABC is registered in the chunk list table 312 (step 408). As shown in FIG. 15, a data fragment corresponding to the identifier H _ABC is registered in the chunk list table 312. For this reason, the determination result of step 408 regarding the parent chunk 1501 becomes Yes (registered) as indicated by an arrow 1511 in FIG. In this case, the chunk list table 312 does not change before and after the process F as indicated by an arrow 1502 in FIG. Although omitted in FIG. 15, the variable length deduplication module 33 registers the document name “document # 2” of the document 112 and the identifier H _ABC of the parent chunk 1501 in the document configuration table 311 before proceeding to the processing G. (Step 503).

Ｇ）格納処理ＳＹその２
上述の格納処理ＳＹ１に続いて実行される、文書１１２を格納するための格納処理ＳＹその２（以下、格納処理ＳＹ２と称する）について、図１６を参照して説明する。なお、図１６では、文書構成テーブル３１１は省略されている。 G) Storage process SY 2
A storage process SY 2 (hereinafter referred to as storage process SY2) for storing the document 112, which is executed subsequent to the above-described storage process SY1, will be described with reference to FIG. In FIG. 16, the document configuration table 311 is omitted.

Ｇ１）
Ｇ１-1）可変長重複排除モジュール３３は、格納処理ＳＹ１で定められた親チャンク１５０１の終端の位置から（ステップ５０５）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。図１６の例では、可変長重複排除モジュール３３は、文書１１２の先頭より１３，２２文字目のところに切り出し点を定め、子チャンクｃ₃，ｃ₄を定めたものとする。 G1)
G1-1) The variable-length deduplication module 33 removes the child chunk from the end position of the parent chunk 1501 determined in the storage process SY1 (step 505) until it reaches the connected window size W (W = 10) or more. These are determined sequentially (steps 403 to 406). In the example of FIG. 16, it is assumed that the variable-length deduplication module 33 determines a cut-out point at the 13th and 22nd characters from the top of the document 112, and determines child chunks c ₃ and c ₄ .

Ｇ２-2）可変長重複排除モジュール３３は、連結ウィンドウサイズＷ以上となるところまで子チャンクｃ₃，ｃ₄を定めたところで、それらの子チャンクｃ₃，ｃ₄を連結して親チャンク１６０１を定め、当該親チャンク１６０１の識別子（ハッシュ値）を生成する（ステップ４０７）。この例では可変長重複排除モジュール３３は、連結する子チャンクｃ₃（“me”），ｃ₄（“ABCD spe”）の識別子（ハッシュ値）ｃ₃.hash＝Ｈ_X，ｃ₄.hash＝Ｈ_Yから生成したハッシュ値Ｈ_XYを、親チャンク１６０１の識別子（ハッシュ値）とする。 G2-2) The variable length deduplication module 33 determines the child chunks c ₃ and c ₄ until the connection window size W is equal to or larger than the connection window size W, and concatenates the child chunks c ₃ and c ₄ to obtain the parent chunk 1601. The identifier (hash value) of the parent chunk 1601 is generated (step 407). In this example, the variable-length deduplication module 33 uses the identifiers (hash values) c ₃ .hash = H _X , c ₄ .hash = of the child chunks c ₃ (“me”) and c ₄ (“ABCD spe”) to be linked. The hash value H _XY generated from H _{Y is used as} the identifier (hash value) of the parent chunk 1601.

Ｇ２-3）可変長重複排除モジュール３３は、親チャンク１６０１の識別子Ｈ_XYに基づき、チャンク一覧テーブル３１２に当該識別子Ｈ_XYに対応するデータ断片が登録されているかを判定する（ステップ４０８）。この例では、識別子Ｈ_XYに対応するデータ断片は登録されていないため、次の処理Ｇ３に進む。 G2-3) length deduplication module 33 determines whether based on the identifier H _XY parent chunk 1601, data fragment corresponding to the identifier H _XY in the chunk list table 312 is registered (step 408). In this example, since the data fragment corresponding to the identifier _HXY is not registered, the process proceeds to the next process G3.

Ｇ３）
Ｇ３-1）可変長重複排除モジュール３３は、上述の処理Ｇ２で定めた親チャンク１６０１を構成する子チャンクｃ₃，ｃ₄の列の先頭の子チャンクｃ₃の長さだけ、文書１１２の末尾側にずらした位置から（ステップ４０９）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。このとき、処理Ｇ２で子チャンクｃ₄を定めた部分についての再度の処理は必要ない。図１６の例では、可変長重複排除モジュール３３は、文書１１２の先頭より２５文字目のところに切り出し点を定め、新たに子チャンクｃ₅を定めたものとする。 G3)
G3-1) length deduplication module 33, the length of the child chunk c ₃ of the column head of the configuration child chunk c _3, c ₄ parent chunks 1601 defined by the above-described process G2, the end of the document 112 From the position shifted to the side (step 409), child chunks are sequentially determined until the connection window size W (W = 10) or more is reached (steps 403 to 406). At this time, it is not necessary to repeat the process for the part for which the child chunk c ₄ is determined in the process G2. In the example of FIG. 16, the variable-length deduplication module 33 defines a point cut at the 25 th character from the beginning of the document 112, and define those new children chunk c _5.

Ｇ３-2）可変長重複排除モジュール３３は、連結ウィンドウサイズＷ以上となるところまで子チャンクｃ₄，ｃ₅を定めたところで、それらの子チャンクｃ₄，ｃ₅を連結して親チャンク１６０２を定め、当該親チャンク１６０２の識別子（ハッシュ値）を生成する（ステップ４０７）。この例では可変長重複排除モジュール３３は、連結する子チャンクｃ₄（“ABCD spe”），ｃ₅（“cif”）の識別子（ハッシュ値）ｃ₄.hash＝Ｈ_Y，ｃ₅.hash＝Ｈ_Eから生成したハッシュ値Ｈ_YEを、親チャンク１６０２の識別子（ハッシュ値）とする。 G3-2) The variable length deduplication module 33 determines the child chunks c ₄ and c ₅ until the connection window size W is equal to or larger than the connection window size W, and concatenates the child chunks c ₄ and c ₅ to obtain the parent chunk 1602. Then, an identifier (hash value) of the parent chunk 1602 is generated (step 407). In this example, the variable-length deduplication module 33 uses the identifiers (hash values) c ₄ .hash = H _Y and c ₅ .hash = of the child chunks c ₄ (“ABCD spe”) and c ₅ (“cif”) to be connected. The hash value H _YE generated from H _{E is used as} the identifier (hash value) of the parent chunk 1602.

Ｇ３-3）可変長重複排除モジュール３３は、親チャンク１６０２の識別子Ｈ_YEに基づき、チャンク一覧テーブル３１２に当該識別子Ｈ_YEに対応するデータ断片が登録されているかを判定する（ステップ４０８）。この例では、識別子Ｈ_YEに対応するデータ断片は登録されていないため、次の処理Ｇ４に進む。 G3-3) The variable-length deduplication module 33 determines whether a data fragment corresponding to the identifier H _YE is registered in the chunk list table 312 based on the identifier H _YE of the parent chunk 1602 (step 408). In this example, since the data fragment corresponding to the identifier H _YE is not registered, the process proceeds to the next process G4.

Ｇ４）
Ｇ４-1）図１６の例では、上述の処理Ｇ３で定めた親チャンク１６０２を構成する子チャンクｃ₄，ｃ₅の列の先頭の子チャンクｃ₄の長さだけ、文書１１２の末尾側にずらした位置は、処理Ｇ１で定められた親チャンク（つまり当初親チャンク）１６０１の終端の位置を超えている。したがって、ステップ４０９で再設定された親チャンク開始オフセットの親チャンク１６０１の開始オフセットからの「ずれ」は、連結ウィンドウサイズＷ以上となる（ステップ４１０のＴｅｓ）。この場合、可変長重複排除モジュール３３は、処理Ｇ１で定められた親チャンク１６０１の識別子Ｈ_XY及びデータ（データ断片）“me ABCD spe”を、図１６において矢印１６０３で示すようにチャンク一覧テーブル３１２に登録する（ステップ５０２）。また図１６では省略されているが、可変長重複排除モジュール３３は、文書１１２の文書名「文書＃２」及び親チャンク１６０１の識別子Ｈ_XYを文書構成テーブル３１１に登録する（ステップ５０３）。 G4)
G4-1) In the example of FIG. 16, only the length of the first child chunk c _{4 in} the column of the child chunks c ₄ and c ₅ constituting the parent chunk 1602 defined in the processing G3 described above is added to the end of the document 112. The shifted position exceeds the position of the end of the parent chunk (that is, the initial parent chunk) 1601 determined in the process G1. Therefore, the “deviation” of the parent chunk start offset reset in step 409 from the start offset of the parent chunk 1601 is equal to or larger than the linked window size W (Tes in step 410). In this case, the variable-length deduplication module 33 uses the chunk list table 312 as shown by the arrow 1603 in FIG. 16 for the identifier H _XY and the data (data fragment) “me ABCD spe” of the parent chunk 1601 determined in the process G1. (Step 502). Although omitted in FIG. 16, the variable-length deduplication module 33 registers the document name “document # 2” of the document 112 and the identifier H _XY of the parent chunk 1601 in the document configuration table 311 (step 503).

Ｈ）格納処理ＳＹその３
上述の格納処理ＳＹ２に続いて実行される、文書１１２を格納するための格納処理ＳＹその３（以下、格納処理ＳＹ３と称する）について、図１７を参照して説明する。なお、図１７では、文書構成テーブル３１１は省略されている。 H) Storage processing SY 3
A storage process SY 3 (hereinafter referred to as storage process SY3) for storing the document 112, which is executed subsequent to the above-described storage process SY2, will be described with reference to FIG. In FIG. 17, the document configuration table 311 is omitted.

Ｈ１）
Ｈ１-1）可変長重複排除モジュール３３は、格納処理ＳＹ２で定められた親チャンク１６０１の終端の位置から（ステップ５０５）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。このとき、格納処理ＳＹ２で親チャンク１６０１の終端の位置以降の子チャンクｃ₅を定めた部分についての再度の処理は必要ない。図１６の例では、可変長重複排除モジュール３３は、文書１１２の先頭より２９，４３文字目のところに切り出し点を定め、新たな子チャンクｃ₆（“ied”），ｃ₇（“by path is ope”）を定めたものとする。 H1)
H1-1) The variable-length deduplication module 33 selects child chunks from the end position of the parent chunk 1601 determined in the storage process SY2 (step 505) until it reaches the connected window size W (W = 10) or more. These are determined sequentially (steps 403 to 406). At this time, it is not necessary to repeat the process for the part in which the child chunk c ₅ after the end position of the parent chunk 1601 is determined in the storage process SY2. In the example of FIG. 16, the variable-length deduplication module 33 sets a cut point at the 29th and 43rd characters from the top of the document 112, and creates new child chunks c ₆ (“ied”), c ₇ (“by path is ope ”).

Ｈ１-2）可変長重複排除モジュール３３は、連結ウィンドウサイズＷ以上となるところまで子チャンクｃ₅，ｃ₆，ｃ₇を定めたところで、それらの子チャンクｃ₅，ｃ₆，ｃ₇を連結して親チャンク１７０１を定め、当該親チャンク１７０１の識別子（ハッシュ値）を生成する（ステップ４０７）。この例では可変長重複排除モジュール３３は、連結する子チャンクｃ₅（“cif”），ｃ₆（“ied”）,ｃ₇（“by path is ope”）の識別子（ハッシュ値）ｃ₅.hash＝Ｈ_E，ｃ₆.hash＝Ｈ_F，ｃ₇.hash＝Ｈ_Gから生成したハッシュ値Ｈ_EFGを、親チャンク１７０１の識別子（ハッシュ値）とする。 H1-2) length deduplication module 33, where defining a child chunk c _5, c _6, c ₇ until it becomes a connecting window size W or more, connecting their child chunks c _5, c _6, c ₇ Then, the parent chunk 1701 is determined, and an identifier (hash value) of the parent chunk 1701 is generated (step 407). In this example, the variable-length deduplication module 33 identifies the identifiers (hash values) c _{5. Of} child chunks c ₅ (“cif”), c ₆ (“ied”), and c ₇ (“by path is ope”) to be connected. Hash value H _EFG generated from hash = H _E , c ₆ .hash = H _F , c ₇ .hash = H _{G is used as} the identifier (hash value) of the parent chunk 1701.

Ｈ１-3）可変長重複排除モジュール３３は、親チャンク１７０１の識別子Ｈ_EFGに基づき、チャンク一覧テーブル３１２に当該識別子Ｈ_EFGに対応するデータ断片が登録されているかを判定する（ステップ４０８）。この例では、識別子Ｈ_EFGに対応するデータ断片は登録されていない。このため、親チャンク１７０１に関するステップ４０８の判定結果は図１７において矢印１７１１で示されるように未登録（Ｎｏ）となり、次の処理Ｈ２に進む。 H1-3) length deduplication module 33 determines whether based on the identifier H _EFG parent chunk 1701, data fragment corresponding to the identifier H _EFG in the chunk list table 312 is registered (step 408). In this example, the data fragment corresponding to the identifier _HEFG is not registered. For this reason, the determination result of step 408 regarding the parent chunk 1701 becomes unregistered (No) as indicated by an arrow 1711 in FIG. 17, and the process proceeds to the next process H2.

Ｈ２）
Ｈ２-1）可変長重複排除モジュール３３は、上述の処理Ｈ１で定めた親チャンク１７０１０１を構成する子チャンクｃ₅，ｃ₆，ｃ₇の列の先頭の子チャンクｃ₅の長さだけ、文書１１２の末尾側にずらした位置から（ステップ４０９）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。このとき、処理Ｈ１で子チャンクｃ₆，ｃ₇を定めた部分についての再度の処理は必要ない。図１７の例では、子チャンクｃ₅の長さだけずらした位置から連結ウィンドウサイズＷ以上となるところまでに、新たに定める子チャンクはない。 H2)
H2-1) length deduplication module 33, the length of the first child chunk c ₅ columns parent chunks constituting 170101 child chunk c _5, c _6, c ₇ was determined by the above-described process H1, document Child chunks are sequentially determined from the position shifted to the end of 112 (step 409) to a position where the connection window size W (W = 10) or more is reached (steps 403 to 406). At this time, it is not necessary to repeat the process for the part where the child chunks c ₆ and c ₇ are determined in the process H1. In the example of FIG. 17, there is no newly defined child chunk from the position shifted by the length of the child chunk c ₅ to a place where the connection window size W or more is reached.

Ｈ２-2）そこで可変長重複排除モジュール３３は、処理Ｈ１で定めた親チャンク１７０１から先頭の子チャンクｃ₅を除いた残りの子チャンクｃ₆，ｃ₇を連結して親チャンク１７０２を定め、当該親チャンク１７０２の識別子（ハッシュ値）を生成する（ステップ４０７）。この例では可変長重複排除モジュール３３は、連結する子チャンクｃ₆（“ied”）,ｃ₇（“by path is ope”）の識別子（ハッシュ値）ｃ₆.hash＝Ｈ_F，ｃ₇.hash＝Ｈ_Gから生成したハッシュ値Ｈ_FGを、親チャンク１７０２の識別子（ハッシュ値）とする。 H2-2) Therefore, the variable-length deduplication module 33 determines the parent chunk 1702 by concatenating the remaining child chunks c ₆ and c ₇ excluding the first child chunk c ₅ from the parent chunk 1701 determined in the processing H1. An identifier (hash value) of the parent chunk 1702 is generated (step 407). In this example, the variable-length deduplication module 33 uses identifiers (hash values) c ₆ .hash = H _F , c _{7. For the} child chunks c ₆ (“ied”) and c ₇ (“by path is ope”) to be connected. the hash = H _G hash value H _FG generated from an identifier (hash values) of the parent chunk 1702.

Ｈ２-3）可変長重複排除モジュール３３は、親チャンク１７０２の識別子Ｈ_FGに基づき、チャンク一覧テーブル３１２に当該識別子Ｈ_FGに対応するデータ断片が登録されているかを判定する（ステップ４０８）。この例では、識別子Ｈ_FGに対応するデータ断片は登録されていない。このため、親チャンク１７０２に関するステップ４０８の判定結果は図１７において矢印１７１２で示されるように未登録（Ｎｏ）となり、次の処理Ｈ３に進む。 H2-3) length deduplication module 33 determines whether based on the identifier H _FG parent chunk 1702, data fragment corresponding to the identifier H _FG in the chunk list table 312 is registered (step 408). In this example, the data fragment corresponding to the identifier _HFG is not registered. For this reason, the determination result of step 408 regarding the parent chunk 1702 becomes unregistered (No) as indicated by an arrow 1712 in FIG. 17, and the process proceeds to the next process H3.

Ｈ３）
Ｈ３-1）可変長重複排除モジュール３３は、上述の処理Ｈ２で定めた親チャンク１７０１０２を構成する子チャンクｃ₆，ｃ₇の列の先頭の子チャンクｃ_６の長さだけ、文書１１２の末尾側にずらした位置から（ステップ４０９）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。このとき、処理Ｈ２で子チャンクｃ₇を定めた部分についての再度の処理は必要ない。図１７の例では、子チャンクｃ₆の長さだけずらした位置から連結ウィンドウサイズＷ以上となるところまでに、新たに定める子チャンクはない。また子チャンクｃ₇のみで、連結ウィンドウサイズＷ（Ｗ＝１０）以上となる。 H3)
H3-1) length deduplication module 33, the length of the child chunk _{c 6} the head of the column structure of the parent chunk 170,102 child chunk c _6, c ₇ was determined by the above-described processing H2, the end of the document 112 From the position shifted to the side (step 409), child chunks are sequentially determined until the connection window size W (W = 10) or more is reached (steps 403 to 406). In this case, again the process is not necessary for the portion defining the processing H2 child chunk c _7. In the example of FIG. 17, there is no newly defined child chunk from the position shifted by the length of the child chunk c _{6 to} the place where the connection window size W or more is reached. Further, only the child chunk c ₇ becomes the connection window size W (W = 10) or more.

Ｈ３-2）そこで可変長重複排除モジュール３３は、処理Ｈ２で定めた親チャンク１７０２から先頭の子チャンクｃ₆を除いた残りの子チャンクｃ₇単体を親チャンク１７０３として定め、当該親チャンク１７０３の識別子（ハッシュ値）を生成する（ステップ４０７）。この例では可変長重複排除モジュール３３は、子チャンクｃ₇（“by path is ope”）の識別子（ハッシュ値）ｃ₇.hash＝Ｈ_Gから生成したハッシュ値Ｈ_G’を、親チャンク１７０３の識別子（ハッシュ値）とする。 H3-2) where the variable-length deduplication module 33 defines a remaining child chunk c ₇ alone, excluding the child chunk c ₆ from the parent chunk 1702 top of which defines the processing H2 as a parent chunk 1703, of the parent chunk 1703 An identifier (hash value) is generated (step 407). In this example, the variable-length deduplication module 33 uses the hash value H _G ′ generated from the identifier (hash value) c ₇ .hash = H _G of the child chunk c ₇ (“by path is ope”) as the parent chunk 1703. An identifier (hash value) is used.

Ｈ３-3）可変長重複排除モジュール３３は、親チャンク１７０３の識別子Ｈ_G’に基づき、チャンク一覧テーブル３１２に当該識別子Ｈ_G’に対応するデータ断片が登録されているかを判定する（ステップ４０８）。図１７に示すように、チャンク一覧テーブル３１２には、識別子Ｈ_G’に対応するデータ断片が登録されている。このため、親チャンク１７０３に関するステップ４０８の判定結果は図１７において矢印１７１３で示されるようにＹｅｓ（登録済）となり、次の処理Ｈ４に進む。なお、図１７では省略されているが、可変長重複排除モジュール３３は処理Ｈ４に進む前に、文書１１２の文書名「文書＃２」及び親チャンク１５０１の識別子Ｈ_G’を文書構成テーブル３１１に登録する（ステップ４１２）。 H3-3) The variable length deduplication module 33 determines whether a data fragment corresponding to the identifier H _G ′ is registered in the chunk list table 312 based on the identifier H _G ′ of the parent chunk 1703 (step 408). . As shown in FIG. 17, in the chunk list table 312, data fragments corresponding to the identifier H _G ′ are registered. For this reason, the determination result of step 408 regarding the parent chunk 1703 becomes Yes (registered) as indicated by an arrow 1713 in FIG. 17, and the process proceeds to the next process H4. Although omitted in FIG. 17, the variable-length deduplication module 33 stores the document name “document # 2” of the document 112 and the identifier H _G ′ of the parent chunk 1501 in the document configuration table 311 before proceeding to the processing H4. Registration is performed (step 412).

Ｈ４）
Ｈ４-1）可変長重複排除モジュール３３は、前述の格納処理ＳＹ２で定めた親チャンク１６０１（つまり、識別子Ｈ_XYにより識別されるデータ断片）と、上記処理Ｈ３で定めた親チャンク１７０３（つまり、識別子Ｈ_G’により識別されるデータ断片）との間に子チャンクまたは子チャンクの列があるならば（ステップ４１３のＮｏ）、その子チャンクまたは子チャンクの列を親チャンクと定めて、当該親チャンクの識別子（ハッシュ値）を生成する（ステップ５０１）。図１７の例では、親チャンク１６０１と親チャンク１７０３との間に、子チャンクｃ₅，ｃ₆が存在する。そこで可変長重複排除モジュール３３は、子チャンクｃ₅，ｃ₆を連結して親チャンク１７０４を定め、当該親チャンク１７０４の識別子（ハッシュ値）を生成する（ステップ５０１）。即ち可変長重複排除モジュール３３は、連結する子チャンクｃ₅（“cif”），ｃ₆（“ied ”）の識別子（ハッシュ値）ｃ₅.hash＝Ｈ_E，ｃ₆.hash＝Ｈ_Fから生成したハッシュ値Ｈ_EFを、親チャンク１７０４の識別子（ハッシュ値）とする。 H4)
H4-1) length deduplication module 33 includes a parent chunk 1601 defined by the storage processing SY2 described above (i.e., data fragments identified by the identifier H _XY), parent chunks 1703 defined by the process H3 (i.e., if a column of the child chunks or child chunks between data fragments) identified by the identifier H _G '(No in step 413), the columns of the child chunk or child chunk defines the parent chunk, the parent chunk Identifier (hash value) is generated (step 501). In the example of FIG. 17, child chunks c ₅ and c ₆ exist between the parent chunk 1601 and the parent chunk 1703. Therefore, the variable length deduplication module 33 concatenates the child chunks c ₅ and c ₆ to define a parent chunk 1704, and generates an identifier (hash value) of the parent chunk 1704 (step 501). That length deduplication module 33, connected child chunk _{c 5 ( "cif"),} c 6 the identifier (hash values) of _{( "ied") c 5 .hash} = H E, from c ₆ .hash = H _F The generated hash value H _EF is set as an identifier (hash value) of the parent chunk 1704.

Ｈ４-2）可変長重複排除モジュール３３は、親チャンク１７０４の識別子Ｈ_EF及びデータ（データ断片）“cified ”を、図１７において矢印１７０５で示すようにチャンク一覧テーブル３１２に登録する（ステップ５０２）。また図１７では省略されているが、可変長重複排除モジュール３３は、文書１１２の文書名「文書＃２」及び親チャンク１７０４の識別子Ｈ_EFを文書構成テーブル３１１に登録する（ステップ５０３）。 H4-2) The variable-length deduplication module 33 registers the identifier H _EF of the parent chunk 1704 and the data (data fragment) “cified” in the chunk list table 312 as indicated by an arrow 1705 in FIG. 17 (step 502). . Although omitted in FIG. 17, the variable-length deduplication module 33 registers the document name “document # 2” of the document 112 and the identifier H _EF of the parent chunk 1704 in the document configuration table 311 (step 503).

Ｉ）格納処理ＳＹその４
上述の格納処理ＳＹ３に続いて実行される、文書１１２を格納するための格納処理ＳＹその４（以下、格納処理ＳＹ４と称する）について、図１８を参照して説明する。なお、図１８では、文書構成テーブル３１１は省略されている。 I) Storage process SY 4
A storage process SY 4 (hereinafter referred to as a storage process SY4) for storing the document 112, which is executed subsequent to the above-described storage process SY3, will be described with reference to FIG. In FIG. 18, the document configuration table 311 is omitted.

Ｉ-1）可変長重複排除モジュール３３は、格納処理ＳＹ３で登録済みであると判定された親チャンク１７０３の終端の位置から（ステップ５０５）、連結ウィンドウサイズＷ（Ｗ＝１０）以上となるところまで、子チャンクを順次定めていく（ステップ４０３〜４０６）。図１８の例では、可変長重複排除モジュール３３は、文書１１２の先頭より４６，５２，５６文字目のところに切り出し点を定め、新たな子チャンクｃ₈（“ned”），ｃ9（“ for r”），ｃ10（“ead”）を定めたものとする。 I-1) The variable-length deduplication module 33 has a connection window size W (W = 10) or more from the position of the end of the parent chunk 1703 determined to have been registered in the storage process SY3 (step 505). Until then, child chunks are sequentially determined (steps 403 to 406). In the example of FIG. 18, the variable-length deduplication module 33 sets a cut point at the 46th, 52nd, and 56th characters from the top of the document 112, and creates new child chunks c ₈ (“ned”) and c9 (“for” r "), c10 (" ead ").

以降の処理は格納処理ＳＹ１における、処理Ｆ-2〜Ｆ-3と同様であり、子チャンクｃ₈，ｃ₉，ｃ₁₀を連結することにより親チャンク１８０１が定められる。図１８の例では親チャンク１８０１の識別子Ｈ_HIJに対応するデータ断片は、チャンク一覧テーブル３１２に登録されている（ステップ４０８のＹｅｓ）。このため、親チャンク１８０１に関するステップ４０８の判定結果は図１８において矢印１８１１で示されるようにＹｅｓ（登録済）となり、次の処理Ｊに進む。この場合、チャンク一覧テーブル３１２は図１８において矢印１８０２で示されるように、処理Ｉの前後で変わらない。なお、図１８では省略されているが、可変長重複排除モジュール３３は処理Ｊに進む前に、文書１１２の文書名「文書＃２」及び親チャンク１８０１の識別子Ｈ_HIJを文書構成テーブル３１１に登録する（ステップ５０３）。 In the subsequent process storage processing SY1, is similar to the process F-2~F-3, parent chunk 1801 is determined by linking the child chunk _{_{_{c 8, c 9, c 10}}} . In the example of FIG. 18, the data fragment corresponding to the identifier H _HIJ of the parent chunk 1801 is registered in the chunk list table 312 (Yes in step 408). For this reason, the determination result of step 408 regarding the parent chunk 1801 becomes Yes (registered) as indicated by an arrow 1811 in FIG. In this case, the chunk list table 312 does not change before and after the process I as indicated by an arrow 1802 in FIG. Although omitted in FIG. 18, the variable-length deduplication module 33 registers the document name “document # 2” of the document 112 and the identifier H _HIJ of the parent chunk 1801 in the document configuration table 311 before proceeding to the process J. (Step 503).

Ｊ）格納処理ＳＹその５
上述の格納処理ＳＹ４に続いて実行される、文書１１２を格納するための格納処理ＳＹその５（以下、格納処理ＳＹ５と称する）について、図１９を参照して説明する。なお、図１９では、文書構成テーブル３１１は省略されている。 J) Storage process SY, part 5
A storage process SY No. 5 (hereinafter referred to as a storage process SY5) for storing the document 112, which is executed subsequent to the above-described storage process SY4, will be described with reference to FIG. In FIG. 19, the document configuration table 311 is omitted.

Ｊ１）
Ｊ１-1）可変長重複排除モジュール３３は、格納処理ＳＹ４で定められた親チャンク１８０１の終端の位置から（ステップ５０５）、子チャンクを順次定めていく（ステップ４０３〜４０６）。図１９の例では、子チャンクｃ₁₁，ｃ₁₂が定められ、その結果、切り出し点が文書１１１の末尾に達したものとする（ステップ４０４）。この場合、可変長重複排除モジュール３３は、切り出し点が、格納処理ＳＹ４で定められた親チャンク１８０１の終端の位置から連結ウィンドウサイズＷ（Ｗ＝１０）以上となるか否かに無関係に、子チャンクｃ₁₁（“in”），ｃ₁₂（“g and”）を連結することにより親チャンク１９０１を定める（ステップ４１４）。 J1)
J1-1) The variable-length deduplication module 33 sequentially determines child chunks (steps 403 to 406) from the end position of the parent chunk 1801 determined in the storage process SY4 (step 505). In the example of FIG. 19, it is assumed that child chunks c ₁₁ and c ₁₂ are determined, and as a result, the cut-out point has reached the end of the document 111 (step 404). In this case, the variable-length deduplication module 33 determines whether the cut-out point is equal to or larger than the linked window size W (W = 10) from the end position of the parent chunk 1801 determined in the storage process SY4. A parent chunk 1901 is determined by concatenating the chunks c ₁₁ (“in”) and c ₁₂ (“g and”) (step 414).

Ｊ１-2）親チャンク１９０１の識別子（ハッシュ値）Ｈ_KLに対応するデータ断片はチャンク一覧テーブル３１２に登録されている（ステップ４１５のＹｅｓ）。このため、親チャンク１９０１に関するステップ４１５の判定結果は図１９において矢印１９１１で示されるようにＹｅｓ（登録済）となる。この場合、ステップ４１４で定められた親チャンク１９０１の識別子Ｈ_KL及びデータ（データ断片）“ing and”をチャンク一覧テーブル３１２に登録する処理（ステップ４１６）は行われない。このためチャンク一覧テーブル３１２は図１９において矢印１９０２で示されるように、処理Ｊの前後で変わらない。なお、図１９では省略されているが、可変長重複排除モジュール３３は、文書１１２の文書名「文書＃２」及び親チャンク１５０１の識別子Ｈ_KLを文書構成テーブル３１１に登録する（ステップ５０３）。これにより、文書１１２を格納するための格納処理ＳＹは完了する。 J1-2) The data fragment corresponding to the identifier (hash value) H _KL of the parent chunk 1901 is registered in the chunk list table 312 (Yes in step 415). For this reason, the determination result of step 415 regarding the parent chunk 1901 is Yes (registered) as indicated by an arrow 1911 in FIG. In this case, the process (step 416) of registering the identifier H _KL and data (data fragment) “ing and” of the parent chunk 1901 defined in step 414 in the chunk list table 312 is not performed. Therefore, the chunk list table 312 does not change before and after the process J, as indicated by an arrow 1902 in FIG. Although omitted in FIG. 19, the variable-length deduplication module 33 registers the document name “document # 2” of the document 112 and the identifier H _KL of the parent chunk 1501 in the document configuration table 311 (step 503). Thereby, the storage process SY for storing the document 112 is completed.

格納処理ＳＸに続いて上述の格納処理ＳＹ（つまり格納処理ＳＹ１乃至ＳＹ５）が完了した後、つまり文書１１１，１１２の格納後における、文書構成テーブル３１１及びチャンク一覧テーブル３１２の状態を、文書１１２及び当該文書１１２から切り出された親チャンクの列と共に図２０に示す。 After the storage process SX is completed following the storage process SX (that is, the storage processes SY1 to SY5), that is, after the documents 111 and 112 are stored, the state of the document configuration table 311 and the chunk list table 312 is changed to the document 112 and FIG. 20 shows the parent chunk columns cut out from the document 112.

同様に、文書１１１，１１２の格納後における、文書構成テーブル３１１及びチャンク一覧テーブル３１２の状態を、文書１１１，１１２から切り出された親チャンクの列と共に図２１に示す。文書１１１の格納後に文書１１２が格納される本実施形態では、当該文書１１２を格納するための格納処理ＸＹにより、重複データを排除しながら当該文書１１２を登録されることが、図２１からわかる。 Similarly, the states of the document configuration table 311 and the chunk list table 312 after storing the documents 111 and 112 are shown in FIG. 21 together with the parent chunk columns cut out from the documents 111 and 112. In this embodiment in which the document 112 is stored after the document 111 is stored, it can be seen from FIG. 21 that the document 112 is registered while eliminating duplicate data by the storage process XY for storing the document 112.

＜文書取得処理＞
次に、文書格納装置１０における文書取得処理について、図２２のフローチャートを参照して説明する。
まず、クライアント装置２０から文書格納装置１０にネットワーク３０を介して文書取得指示が送られたものとする。この文書取得指示は、文書格納装置１０から取得されるべき文書を指定する文書名を含んでいる。 <Document acquisition processing>
Next, document acquisition processing in the document storage device 10 will be described with reference to the flowchart of FIG.
First, it is assumed that a document acquisition instruction is sent from the client device 20 to the document storage device 10 via the network 30. This document acquisition instruction includes a document name that specifies a document to be acquired from the document storage device 10.

文書格納装置１０に送られたクライアント装置２０からの文書取得指示は、当該文書格納装置１０の命令受け付けモジュール３２で受け付けられる。命令受け付けモジュール３２内の文書取得部３２０は、この文書取得指示が命令受け付けモジュール３２で受け付けられると、当該文書取得指示で指定される文書名と対応付けて文書構成テーブル３１１に登録されている全てのチャンク（親チャンク）群の識別子を取得する（ステップ２２０１）。取得されたチャンク群の識別子の並び順は、前述したように、対応する文書におけるチャンク群の並びに一致する。 The document acquisition instruction from the client device 20 sent to the document storage device 10 is received by the command reception module 32 of the document storage device 10. When this document acquisition instruction is received by the instruction reception module 32, the document acquisition unit 320 in the instruction reception module 32 is associated with the document name specified by the document acquisition instruction and is registered in the document configuration table 311. The identifier of the chunk (parent chunk) group is acquired (step 2201). As described above, the arrangement order of the identifiers of the acquired chunk groups matches the sequence of the chunk groups in the corresponding document.

文書取得部３２０は、文書構成テーブル３１１から識別子の群を取得すると、当該識別子の群とそれぞれ対応付けてチャンク一覧テーブル３１２に登録されているチャンク（データ断片）の群を取得する（ステップ２２０２）。 When the document acquisition unit 320 acquires a group of identifiers from the document configuration table 311, the document acquisition unit 320 acquires a group of chunks (data fragments) registered in the chunk list table 312 in association with the group of identifiers (step 2202). .

文書取得部３２０は、チャンク一覧テーブル３１２から取得したチャンクの群に基づき、当該チャンクの群の並び順が、先に取得した当該チャンクの群の識別子の並び順に一致するように、クライアント装置２０からの文書取得指示で指定された文書名の文書のデータを再構成する（ステップ２２０３）。 Based on the group of chunks acquired from the chunk list table 312, the document acquisition unit 320 receives the chunk group from the client device 20 so that the order of the group of chunks matches the order of the identifiers of the group of chunks acquired previously. The document data of the document name designated by the document acquisition instruction is reconstructed (step 2203).

命令受け付けモジュール３２は、文書取得部３２０によって再構成された文書データを、クライアント装置２０からの文書取得指示に対する応答として当該クライアント装置２０に返す（ステップ２２０４）。 The command reception module 32 returns the document data reconstructed by the document acquisition unit 320 to the client device 20 as a response to the document acquisition instruction from the client device 20 (step 2204).

ところで、クライアント装置２０がユーザからの要求により、文書格納装置１０から文書（文書データ）上のデータ断片を取得したい場合がある。クライアント装置２０が文書格納装置１０から文書上のデータ断片を取得するための方法として、当該文書の文書名に加えて、当該データ断片の当該文書上の位置及び当該データ断片の長さを指定する方法が知られている。文書格納装置１０が、このような方法に適応するためには、クライアント装置２０によって指定された文書名の文書の文書データを上述のように再構成した上で、当該文書データからクライアント装置２０によって指定された位置・長さのデータ断片を取得する必要がある。 Incidentally, there are cases where the client device 20 wishes to acquire a data fragment on a document (document data) from the document storage device 10 in response to a request from the user. As a method for the client device 20 to acquire the data fragment on the document from the document storage device 10, in addition to the document name of the document, the position of the data fragment on the document and the length of the data fragment are specified. The method is known. In order for the document storage apparatus 10 to adapt to such a method, the document data of the document having the document name designated by the client apparatus 20 is reconstructed as described above, and the client apparatus 20 uses the document data from the document data. It is necessary to obtain a data fragment with the specified position and length.

そこで、例えば指定の文書名に対応付けてチャンク（親チャンク）の識別子を文書構成テーブル３１１に登録する際に、当該チャンクの対応する文書データ上での位置・長さを示す情報を当該チャンクの識別子に付加するとよい。このようにすると、この情報を参照して、この情報が付加されている識別子に対応付けてチャンク一覧テーブル３１２に保持されているチャンクを特定するだけで、指定の文書上の指定の位置・長さのデータ断片を取得することができる。 Therefore, for example, when registering an identifier of a chunk (parent chunk) in association with a specified document name in the document configuration table 311, information indicating the position / length of the chunk on the corresponding document data is displayed. It may be added to the identifier. In this way, by referring to this information and specifying the chunk held in the chunk list table 312 in association with the identifier to which this information is added, the specified position / length on the specified document is specified. Data fragment can be obtained.

＜本実施形態のまとめ＞
このように本実施形態では、文書格納装置１０が、任意の文書データを、重複検出を行いながら、可変長のチャンク（親チャンクもしくは登録済み親チャンク、または第１のデータ断片）に分割するデータ分割装置として機能する。 <Summary of this embodiment>
As described above, in the present embodiment, the document storage device 10 divides arbitrary document data into variable-length chunks (parent chunk or registered parent chunk, or first data fragment) while performing duplication detection. Functions as a dividing device.

本実施形態によれば、子チャンク（第２のデータ断片）の長さを重複検出のオフセット間隔としながら、当該子チャンクの長さよりも長くなる可能性が高く、且つ登録の対象として用いられる可能性の高い親チャンク（第３のデータ断片）の長さで重複検出を行う構成とすることにより、従来技術と比較してより単純・高速な手法で、登録済みとなる親チャンク（第１のデータ断片）の数（つまり分割数）を少なくしながらも重複排除率を高く維持した、重複検出を行うことができる。 According to the present embodiment, the length of a child chunk (second data fragment) is set as an offset interval for duplication detection, and is likely to be longer than the length of the child chunk and can be used as a registration target. By adopting a configuration in which duplication detection is performed with the length of the highly probable parent chunk (third data fragment), the registered parent chunk (first data) can be obtained by a simpler and faster method compared to the conventional technique. Duplicate detection can be performed while maintaining a high deduplication rate while reducing the number of data fragments) (that is, the number of divisions).

また本実施形態によれば、分割の対象となるデータの一端から順に、子チャンク（第２のデータ断片）を決定しながら、それがある条件を満たしたときに、決定されている複数の子チャンクを連結して（１つの子チャンクが決定されているときは連結しないで）親チャンク（第３のデータ断片）を決定し、当該決定した親チャンクの重複の有無を検出することにより、分割の対象となるデータがストリーム状態に入力されるときに、高速に重複検出を行うことができる。 Further, according to the present embodiment, a plurality of children that are determined when a certain condition is satisfied while determining child chunks (second data fragments) in order from one end of the data to be divided. Split by concatenating chunks (without linking when one child chunk is determined) and determining the parent chunk (third data fragment) and detecting whether the determined parent chunk is duplicated When the target data is input to the stream state, duplicate detection can be performed at high speed.

なお、本発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。例えば、上記実施形態では、文書データを、当該文書データの先頭から当該文書データの末尾の方向に分割している。しかし、文書データを、当該文書データの末尾から当該文書データの先頭の方向に分割しても構わない。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. For example, in the above embodiment, the document data is divided from the top of the document data to the end of the document data. However, the document data may be divided from the end of the document data to the start of the document data. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment.

１０…文書格納装置、２０…クライアント装置、３０…ネットワーク、３１…文書格納部、３２…命令受け付けモジュール（入力手段）、３３…可変長重複排除モジュール、３４…作業用メモリ、３１１…文書構成テーブル、３１２…チャンク一覧テーブル、３２０…文書取得部、３３１…子チャンク決定部（第２のデータ断片決定手段）、３３２…親チャンク決定部（第３のデータ断片決定手段）、３３３…識別子生成部、３３４…重複検出部、３３５…親チャンク登録部（第１のデータ断片決定手段）、３３６…制御部、３４１…文書バッファ、３４２…レジスタ部。 DESCRIPTION OF SYMBOLS 10 ... Document storage apparatus, 20 ... Client apparatus, 30 ... Network, 31 ... Document storage part, 32 ... Command reception module (input means), 33 ... Variable length deduplication module, 34 ... Working memory, 311 ... Document structure table 312 ... Chunk list table, 320 ... Document acquisition unit, 331 ... Child chunk determination unit (second data fragment determination unit), 332 ... Parent chunk determination unit (third data fragment determination unit), 333 ... Identifier generation unit 334... Duplicate detection unit, 335... Parent chunk registration unit (first data fragment determination means), 336... Control unit, 341... Document buffer, 342.

Claims

In an apparatus including an input unit, a first data fragment determination unit, a second data fragment determination unit, a third data fragment determination unit, a duplication detection unit, and a control unit, a plurality of arbitrary data can be detected while performing duplication detection. A data dividing method for dividing the first data fragment of any length,
An input step in which the input means inputs the arbitrary data;
Of the input arbitrary data, the second data fragment determination means has an arbitrary length or a predetermined length from the remaining data portion not yet determined as the first data fragment. A first determining step for sequentially determining two data fragments;
The second data fragment itself or a combination of a plurality of second data fragments determined in the first step is changed to the third data fragment until the state that satisfies the predetermined first condition is reached. A second determining step in which the fragment determining means determines as one third data fragment;
The duplication detection means detects the duplication of the determined third data fragment depending on whether the first data fragment of the bit string that matches the determined third data fragment has already been determined. A detection step;
A third determining step in which the first data fragment determining means determines the determined third data fragment as the first data fragment when the duplication is detected;
If the duplicate is not detected, the first and second determination steps are re-executed, so that a new second data fragment or a new one is reached before reaching the state satisfying the first condition. A plurality of second data fragments are determined, and the new one second data fragment itself, a combination of the new plurality of second data fragments, and the third data fragment in which the duplication is not detected A combination of a part of and a new second data fragment, or a combination of a part of a third data fragment in which the duplication is not detected and the new plurality of second data fragments, Control for causing the third data fragment determining means to determine one new third data fragment is performed until the duplication is detected in a state where a predetermined second condition is satisfied. A first control step means repeats,
If the duplication is not detected even if the first control step is repeated in a state where the second condition is satisfied, the first condition among the second data fragments determined in the meantime. A fourth determination step in which the first data fragment determination means determines one second data fragment itself or a combination of a plurality of second data fragments as a new first data fragment that satisfies When,
A second control step for the control means to repeat the first control step until the input arbitrary data is all divided into the first data fragments. Split method.

The second data fragment is determined in order from a first end of the remaining data portion toward a second end of the remaining data portion;
The first condition is the length of the one second data fragment determined in the first step, or the plurality of second data fragments sequentially determined in the first step. The length of the plurality of second data fragments after the connection,
When the plurality of second data fragments are determined in the first step before reaching the state satisfying the first condition, the plurality of second data fragments are connected in the order of determination. The data division method according to claim 1, wherein the third data fragment is determined as a combination of the plurality of second data fragments.

The second determining step includes
At least one second data fragment closest to the first end of the remaining data portion included in the third data fragment in which no duplication is detected is extracted from the third data fragment. Removing step,
The new one second data fragment or the plurality of second plurality of second data fragments determined in the first determination step until the state satisfying the first condition is reached. The data division method according to claim 2, further comprising: determining a new third data fragment by incorporating the second data fragment into the removed third data fragment.

The one second data fragment, which is determined as the new first data fragment when the duplicate is not detected even if the first control step is repeated within the range of the second condition, Itself or a combination of the plurality of second data fragments is a third fragment first determined after the most recent determination of the first data fragment;
The third determining step is determined last time when the first data fragment is determined as a result of detecting the duplication by repeating the first control step within the range of the second condition. The first data fragment determination means newly sets a data fragment that is sandwiched between the first data fragment and the first data fragment determined this time and has not yet been determined as the first data fragment. The data division method according to claim 1, further comprising a step of determining as a first data fragment.

The first condition is:
The length of the one second data fragment determined in the first step, or the concatenation of the plurality of second data fragments sequentially determined in the first step The length of the plurality of second data fragments is
Exceeding the length of a predetermined standard,
Equal to the length of a predetermined standard,
Be closest to the length of the predetermined standard,
5. The data dividing method according to claim 4, wherein the data dividing method is any one of a maximum value not exceeding a predetermined reference length.

The second condition is from the first end portion of the remaining data portion to the end portion farthest from the first end portion of the third data fragment in which duplication has not been detected most recently. 5. The data division method according to claim 4, wherein the condition is related to a data fragment length which is a length.

The second condition is that the data fragment length is:
Exceeding the length of a predetermined standard,
Equal to the length of a predetermined standard,
Be closest to the length of the predetermined standard,
The data dividing method according to claim 6, wherein the data dividing method is any one of a maximum value not exceeding a predetermined reference length.

In a data dividing device for dividing arbitrary data into a plurality of first data fragments having an arbitrary length while performing duplication detection,
Input means for inputting the arbitrary data;
A second data fragment having an arbitrary length or a predetermined length is sequentially determined from the remaining data portion of the input arbitrary data that has not yet been determined as the first data fragment. Two data fragment determination means;
One second data fragment itself or a combination of a plurality of second data fragments determined by the second data fragment determining means until reaching a state satisfying a predetermined first condition, Third data fragment determining means for determining as one third data fragment;
Duplicate detection means for detecting whether or not the determined third data fragment is duplicated depending on whether or not the first data fragment of the bit string that matches the determined third data fragment has already been determined;
First data fragment determination means for determining the determined third data fragment as the first data fragment when the duplication is detected;
When the duplication is not detected, a new second data fragment or a plurality of new second data fragments is obtained by the second data fragment determining means until the state satisfying the first condition is reached. The new one second data fragment itself, a combination of the new plurality of second data fragments, a part of the third data fragment in which the duplication is not detected, and the new data fragment A combination of one second data fragment or a combination of a part of the third data fragment in which no duplication is detected and the new plurality of second data fragments is determined as the third data fragment. The control for determining as one new third data fragment by the means is repeated until the duplication is detected in a state satisfying a predetermined second condition, and this control is repeated. Teeth, any data that is the input comprises a further repeat control means until the split to all the first data fragments,
If the duplicate is not detected even if the control is repeated in a state where the second condition is satisfied, the first data fragment determination means includes the second data fragment determined in the meantime. One data fragment itself or a combination of a plurality of second data fragments that satisfies the first condition is determined as a new first data fragment. .

The second data fragment determining means sequentially determines the second data fragment from the first end of the remaining data portion toward the second end of the remaining data portion;
The first condition is the length of the one second data fragment determined by the second data fragment determining means, or the plurality of second data sequentially determined by the second data fragment determining means. Is a condition regarding the length of the plurality of second data fragments after the connection,
When the plurality of second data fragments are determined by the second data fragment determination means before reaching the state satisfying the first condition, the third data fragment determination means The data dividing apparatus according to claim 8, wherein the third data fragment is determined as a combination of the plurality of second data fragments by concatenating the second data fragments in the order of determination.

The third data fragment determining means includes at least one second data on the side closest to the first end of the remaining data portion, which is included in the third data fragment in which the duplication is not detected. The fragment is removed from the third data fragment, and the new one second data fragment or the new data determined by the second data fragment determination means until the state that satisfies the first condition is reached. A new third data fragment is determined by incorporating a plurality of second data fragments into the third data fragment from which the at least one second data fragment has been removed. Item 12. The data dividing device according to Item 9.

The one second data fragment itself, which is determined as the new first data fragment when the duplicate is not detected even if the control is repeated within the range of the second condition, or The combination of the plurality of second data fragments is a third fragment determined first after the most recently determined first data fragment;
The first data fragment determining means determines the first data fragment previously determined when the first data fragment is determined as a result of detecting the duplication by repeating the control within the range of the second condition. A data fragment that is sandwiched between the data fragment and the first data fragment that has not yet been determined as the first data fragment is determined as a new first data fragment. Item 11. The data dividing device according to any one of Items 8 to 10.