JP2013003883A

JP2013003883A - Storage device, storage method, and program

Info

Publication number: JP2013003883A
Application number: JP2011134994A
Authority: JP
Inventors: Satoshi Yamakawa; 聡山川
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-06-17
Filing date: 2011-06-17
Publication date: 2013-01-07
Anticipated expiration: 2031-06-17
Also published as: JP5751041B2

Abstract

PROBLEM TO BE SOLVED: To add a new data format without greatly modifying a duplication exclusion type storage device, and to prevent a single fault point from being generated.SOLUTION: The storage device includes a duplication exclusion control part which extracts at least partial data from file data as real data according to a data format of file data, generates a stub file by replacing the part of real data of the file data with a storage destination address of the real data of the storage device, and adding an identifier indicative of the data format to the file data, and stores the file in the storage device; and a duplication data management part which holds a digest value of the data stored in the storage device and a storage destination address of the data in the storage device, and determines whether a digest value matching the digest value of the real data is held.

Description

本発明は、ストレージ装置、ストレージ方法およびプログラムに関し、特に、ファイルを重複して格納することを排除する重複排除方式のストレージ装置、ストレージ方法およびプログラムに関する。 The present invention relates to a storage apparatus, a storage method, and a program, and more particularly, to a deduplication type storage apparatus, a storage method, and a program that eliminate storing files redundantly.

複数のコンピューティング端末で生成されたデータを集中的に格納するストレージ装置において、ハードディスクドライブなどの物理的な記憶媒体にデータを格納する段階で、既に格納されているデータとの重複を判定し、重複しているデータについては記憶媒体にデータを格納せず、既に格納済みの重複データへのポインタ情報のみを記録することによって、物理的な記録容量を削減するデデュプリケーション（Ｄｅｄｕｐｌｉｃａｔｉｏｎ）と呼ばれる手法が知られている。 In a storage device that centrally stores data generated by a plurality of computing terminals, at the stage of storing data in a physical storage medium such as a hard disk drive, it is determined whether or not data is already stored, A method called deduplication that reduces physical recording capacity by recording only pointer information to already stored duplicate data without storing data in the storage medium for duplicate data It has been known.

デデュプリケーションは、ファイル単位、または、ファイルシステム上で記憶媒体にデータを格納する際に固定的に割り振られる物理データブロック単位で、既に格納済みのデータとの重複を判定する。重複判定の際、ディジタル認証等で用いられているＳＨＡ１（ＳｅｃｕｒｅＨａｓｈＡｌｇｏｒｉｔｈｍ１）、ＭＤ５（ＭｅｓｓａｇｅＤｉｇｅｓｔ５）等のハッシュ関数により生成された数十〜数百ビットのサイズの小さいダイジェストデータ同士を比較することにより、同一のバイト列で構成されたファイルまたはデータブロックであるか否かを判定する。 Deduplication determines duplication with already stored data in units of files or in units of physical data blocks that are fixedly allocated when data is stored in a storage medium on the file system. When duplication is judged, digest data with a small size of several tens to several hundreds of bits generated by hash functions such as SHA1 (Secure Hash Algorithm 1) and MD5 (Message Digest 5) used in digital authentication are compared. By doing so, it is determined whether or not the files or data blocks are configured by the same byte string.

このような、ダイジェストデータを用いた重複判定手法を採用することにより、ストレージ装置上で実行される重複判定に要する処理コストを低減することが可能となる。特に、高速なＩ／Ｏ処理の実行が見込まれるストレージ処理において、Ｉ／Ｏ処理と同時に重複判定を行うことで、Ｉ／Ｏ処理の性能の劣化を抑えることができる。 By adopting such a duplication determination method using digest data, it is possible to reduce the processing cost required for the duplication determination executed on the storage apparatus. In particular, in storage processing where high-speed I / O processing is expected to be performed, it is possible to suppress degradation in performance of I / O processing by performing duplication determination at the same time as I / O processing.

このように、ダイジェストデータを重複判定手段として持つ重複排除型ストレージ装置は、特に、同一のバイト列により構成されるファイルやデータブロックが多数見込まれるコンピューティング環境として、データ格納コストを下げる目的で広く採用されている。例えば、重複排除型ストレージ装置は、例えば、バックアップデータの格納を目的とするストレージ装置や、複数の仮想ＯＳのシステム部分のイメージデータの格納を目的とするストレージ装置において用いられている。 As described above, the deduplication storage apparatus having digest data as a duplication determination means is widely used for the purpose of reducing the data storage cost, particularly in a computing environment in which many files and data blocks composed of the same byte sequence are expected. It has been adopted. For example, the deduplication storage apparatus is used in, for example, a storage apparatus for storing backup data and a storage apparatus for storing image data of system parts of a plurality of virtual OSs.

重複排除型ストレージ装置の例として、特許文献１には、重複記憶を排除してデータ容量を削減したストレージシステムが記載されている。 As an example of a deduplication storage device, Patent Document 1 describes a storage system that eliminates duplicate storage and reduces the data capacity.

特開２０１０−２０４９７０号公報JP 2010-204970 A

以下の分析は、本発明者によってなされたものである。 The following analysis was made by the present inventors.

上述の重複排除型ストレージ装置では、格納するデータの重複判定を行う際における、データの重複判定単位が単一である。すなわち、ファイル単位、固定ブロック単位等の固定された単一のデータ単位ごとに重複判定が実施される。 In the above-described deduplication storage apparatus, there is a single data duplication determination unit when determining the duplication of stored data. That is, duplication determination is performed for each fixed single data unit such as a file unit or a fixed block unit.

一方、ファイル単位、固定ブロック単位等の固定された単位で重複判定を行う代わりに、特定のファイルやデータフォーマットの種別に応じて重複判定に用いるデータの分割方法を変更することで、潜在的に重複しているデータをより多く抽出する方法が知られている。 On the other hand, instead of performing duplication determination in fixed units such as file units and fixed block units, by changing the data division method used for duplication determination according to the type of specific file or data format, A method for extracting more duplicate data is known.

重複排除ストレージでは、以上のように、様々な重複判定単位を用いることで、潜在的に重複している可能性の高いデータを漏れなく検出することが可能となる。 As described above, in the deduplication storage, by using various duplication determination units, it is possible to detect data that is highly likely to be duplicated without omission.

ところで、データフォーマットの種別に応じて、重複判定に用いるデータの分割方法を変更するシステムの場合には、重複排除後のデータの復元に用いる、データの分割方法、データの構成情報、分割したデータの格納先情報の管理の仕組みをデータフォーマットごとに用意して集中管理しなければならない。したがって、新たに重複排除対象とするデータフォーマットを追加する際に、既存のデータ管理の仕組みに組み込むことができず、複数の管理機構を設ける必要があるという問題がある。また、このようにデータを集中管理する場合、管理データの消失により重複排除後のすべてのデータが復元不可能となり、データ管理部分が単一故障点（ＳＰＯＦ：ＳｉｎｇｌｅＰｏｉｎｔｏｆＦａｉｌｕｒｅ、単一障害点ともいう。）となるという問題もある。例えば、特許文献１に記載されたストレージシステムによると、分割後のデータはメタデータ格納部によって集中して管理されており、メタデータ格納部が単一故障点となる。 By the way, in the case of a system that changes the data division method used for duplication determination according to the type of data format, the data division method, data configuration information, and divided data used for data restoration after deduplication The storage location information management mechanism must be prepared for each data format and centrally managed. Therefore, when a data format to be deduplicated is newly added, it cannot be incorporated into an existing data management mechanism, and there is a problem that it is necessary to provide a plurality of management mechanisms. In addition, when data is centrally managed in this way, all data after deduplication cannot be restored due to loss of management data, and the data management part has a single point of failure (SPOF: Single Point of Failure). There is also a problem of becoming. For example, according to the storage system described in Patent Document 1, the divided data is centrally managed by the metadata storage unit, and the metadata storage unit becomes a single failure point.

そこで、重複排除型ストレージ装置において、装置を大幅に改修することなく新たなデータフォーマットを追加できるようにするとともに、単一故障点が生じないようにすることが課題となる。本発明の目的は、かかる課題を解決するストレージ装置、ストレージ方法およびプログラムを提供することにある。 In view of this, it is an object of the deduplication storage apparatus to be able to add a new data format without significantly modifying the apparatus and to prevent a single failure point from occurring. An object of the present invention is to provide a storage apparatus, a storage method, and a program for solving such a problem.

本発明の第１の視点に係るストレージ装置は、
ストレージデバイスに格納すべきファイルデータを入力とし、該ファイルデータのデータフォーマットに応じて該ファイルデータから少なくとも一部のデータを実データとして抽出し、該ファイルデータのうちの該実データの部分を該ストレージデバイスにおける該実データの格納先アドレスに置き換えるとともに該データフォーマットを示す識別子を該ファイルデータに追加したスタブファイルを生成し、生成したスタブファイルを該ストレージデバイスに格納する重複排除制御部と、
前記ストレージデバイスに格納されたデータのダイジェスト値および該データの前記ストレージデバイスにおける格納先アドレスを保持するとともに、前記実データのダイジェスト値に一致するダイジェスト値を保持しているか否かを判定する重複データ管理部と、を備える。 The storage apparatus according to the first aspect of the present invention is:
File data to be stored in the storage device is input, and at least a part of the data is extracted as real data from the file data according to the data format of the file data, and the real data portion of the file data is extracted from the file data. A deduplication controller that replaces the storage address of the actual data in the storage device and generates a stub file in which an identifier indicating the data format is added to the file data, and stores the generated stub file in the storage device;
Duplicate data that holds the digest value of the data stored in the storage device and the storage destination address of the data in the storage device and determines whether or not the digest value that matches the digest value of the actual data is held And a management unit.

本発明の第２の視点に係るストレージ方法は、
ストレージデバイスに格納すべきファイルデータを入力とし、該ファイルデータのデータフォーマットに応じて該ファイルデータから少なくとも一部のデータを実データとして抽出する工程と、
前記実データのダイジェスト値と前記ストレージデバイスに格納されたデータのダイジェスト値とが一致するか否かを判定する工程と、
前記ファイルデータのうちの前記実データの部分を前記ストレージデバイスにおける前記実データの格納先アドレスに置き換えるとともに前記データフォーマットを示す識別子を前記ファイルデータに追加したスタブファイルを生成する工程と、
前記スタブファイルを前記ストレージデバイスに格納する工程と、を含む。 The storage method according to the second aspect of the present invention is:
A step of inputting file data to be stored in the storage device, and extracting at least a part of the data as actual data from the file data according to the data format of the file data;
Determining whether the digest value of the actual data matches the digest value of the data stored in the storage device;
Replacing the real data portion of the file data with a storage address of the real data in the storage device and generating a stub file in which an identifier indicating the data format is added to the file data;
Storing the stub file in the storage device.

本発明の第３の視点に係るプログラムは、
ストレージデバイスに格納すべきファイルデータを入力とし、該ファイルデータのデータフォーマットに応じて該ファイルデータから少なくとも一部のデータを実データとして抽出する処理と、
前記実データのダイジェスト値と前記ストレージデバイスに格納されたデータのダイジェスト値とが一致するか否かを判定する処理と、
前記ファイルデータのうちの前記実データの部分を前記ストレージデバイスにおける前記実データの格納先アドレスに置き換えるとともに前記データフォーマットを示す識別子を前記ファイルデータに追加したスタブファイルを生成する処理と、
前記スタブファイルを前記ストレージデバイスに格納する処理と、をコンピュータに実行させる。
なお、プログラムは、コンピュータ読み取り可能な記憶媒体に記録でき、プログラム製品とすることができる。また、記憶媒体として、非トランジエントなものを用いることができる。 The program according to the third aspect of the present invention is:
A process for inputting file data to be stored in a storage device and extracting at least a part of the data as actual data from the file data according to the data format of the file data;
A process of determining whether or not the digest value of the actual data matches the digest value of the data stored in the storage device;
Processing to replace the portion of the actual data in the file data with a storage address of the actual data in the storage device and to generate a stub file in which an identifier indicating the data format is added to the file data;
And causing the computer to execute processing for storing the stub file in the storage device.
The program can be recorded on a computer-readable storage medium and can be a program product. Further, a non-transient medium can be used as the storage medium.

本発明に係るストレージ装置、ストレージ方法およびプログラムによると、重複排除型ストレージ装置において、装置を大幅に改修することなく新たなデータフォーマットを追加できるようにするとともに、単一故障点が生じないようにすることができる。 According to the storage apparatus, the storage method, and the program according to the present invention, in the deduplication storage apparatus, a new data format can be added without significantly modifying the apparatus, and a single point of failure does not occur. can do.

一実施形態に係るストレージ装置を備えたストレージシステムの構成を一例として示すブロック図である。It is a block diagram which shows the structure of the storage system provided with the storage apparatus which concerns on one Embodiment as an example. 一実施形態に係るストレージ装置における重複排除管理部の構成を一例として示すブロック図である。It is a block diagram which shows as an example the structure of the deduplication management part in the storage apparatus which concerns on one Embodiment. 重複排除処理対象となるファイルデータの内部のデータ構成例を示す図である。It is a figure which shows the example of a data structure inside the file data used as the deduplication process object. スタブファイルの内部のデータ構成例を示す図である。It is a figure which shows the example of a data structure inside a stub file. 重複排除処理対象となるファイルデータの分割例を示す図である。It is a figure which shows the example of a division | segmentation of the file data used as a duplication exclusion process target. スタブファイルの内部のデータ構成例を示す図である。It is a figure which shows the example of a data structure inside a stub file.

はじめに、本発明の概要について説明する。なお、この概要に付記する図面参照符号は、専ら理解を助けるための例示であり、本発明を図示の態様に限定することを意図するものではない。 First, the outline of the present invention will be described. Note that the reference numerals of the drawings attached to this summary are merely examples for facilitating understanding, and are not intended to limit the present invention to the illustrated embodiment.

本発明に係るストレージ装置は、重複排除後のデータ復元に用いられる情報、すなわち、データの分割方法、データの構成情報、分割したデータの格納先情報を、システムで集中管理するのではなく、ファイル単位で分散管理する手段を備えた重複排除格納対応型のストレージ装置である。 The storage apparatus according to the present invention does not centrally manage information used for data restoration after deduplication, that is, data division method, data configuration information, and storage destination information of divided data, instead of centralized management by the system. This is a deduplication storage-compatible storage apparatus provided with means for managing the distribution in units.

また、本発明のストレージ装置は、データフォーマットに応じてデータの分割手法を変更するような場合においても、単一の枠組みの範囲で上述の情報を管理することができる。 Further, the storage apparatus of the present invention can manage the above information within a single framework even when the data division method is changed according to the data format.

さらに、本発明のストレージ装置は、重複排除格納対象となるファイルデータの重複判定を行う際、ファイルフォーマットごとにデータの分割方法を変更するデータ分割手段と、
分割後のデータをファイルデータとして復元する際に必要となるファイル構成管理情報、分割後のデータの格納先アドレス情報、ファイルがどのように分割されたかを示す識別子を含む情報を、ファイルごとに、ファイルデータ本体の代わりに後述のスタブファイルとして記憶するデータ管理手段と、
重複排除格納されたファイルデータの読み出し時に、前記スタブファイルのデータから、ファイル構成管理情報、分割後のデータの格納先アドレス情報、ファイルがどのように分割されたかを示す識別子を読み出し、読み出し対象とされたファイルデータを構成するデータをストレージから読み出し、ファイルデータとして復元するデータ復元手段と、を備えている。 Furthermore, the storage apparatus of the present invention, when performing duplication determination of file data to be deduplicated storage target, data dividing means for changing the data dividing method for each file format,
For each file, file configuration management information required when restoring the divided data as file data, storage address information of the data after division, and information including an identifier indicating how the file was divided, Data management means for storing as a stub file described below instead of the file data body;
When reading the deduplicated stored file data, the file structure management information, the storage address information of the divided data, and the identifier indicating how the file was divided are read from the stub file data, Data restoration means for reading out the data constituting the file data from the storage and restoring the data as file data.

また、本発明のストレージ装置は、データを格納する際、すでに格納済みのデータとのデータの重複が発生しているかを確認するための重複判定手段と、データの重複未発生時にのみデータを格納する重複排除格納手段とを、備えている。なお、スタブファイルとは、ファイルデータ本体ではなく、ファイルデータ本体を復元するためのメタデータ群を格納したファイルをいう。 In addition, when storing data, the storage device of the present invention stores duplication determination means for confirming whether data duplication has already occurred with data already stored, and stores data only when no duplication of data has occurred. Deduplication storage means. A stub file refers to a file that stores a group of metadata for restoring the file data body, not the file data body.

図１ないし図４を参照すると、本発明のストレージ装置（２０）は、ストレージデバイス（２５）に格納すべきファイルデータを入力とし、該ファイルデータのデータフォーマットに応じて該ファイルデータから少なくとも一部のデータを実データ（図３参照）として抽出し、該ファイルデータのうちの該実データの部分を該ストレージデバイスにおける該実データの格納先アドレスに置き換えるとともに、該データフォーマットを示す識別子（重複排除方式識別子）を該ファイルデータに追加したスタブファイル（図４参照）を生成し、生成したスタブファイルをストレージデバイス（２５）に格納する重複排除制御部（１０３）と、ストレージデバイス（２５）に格納されたデータのダイジェスト値および該データのストレージデバイス（２５）における格納先アドレスを保持するとともに、前記実データのダイジェスト値に一致するダイジェスト値を保持しているか否かを判定する重複データ管理部（１０４）と、を備える。 1 to 4, the storage apparatus (20) of the present invention receives file data to be stored in the storage device (25) as an input, and at least a part of the file data according to the data format of the file data. Is extracted as real data (see FIG. 3), the real data portion of the file data is replaced with the storage address of the real data in the storage device, and an identifier (deduplication is shown) indicating the data format. A stub file (see FIG. 4) in which the method identifier is added to the file data is generated, and the generated stub file is stored in the storage device (25), and the deduplication control unit (103) is stored in the storage device (25). Data digest value and storage device for the data Holds the storage address in 25), comprising the duplicate data management unit determines whether to hold the digest value matches the digest value of the real data (104), the.

また、重複データ管理部（１０４）は、前記実データのダイジェスト値に一致するダイジェスト値を保持している場合には、一致したダイジェスト値に相当するデータのストレージデバイス（２５）における格納先アドレスを重複排除制御部（１０３）に通知し、重複排除制御部（１０３）は、前記ファイルデータのうちの前記実データの部分を、通知された格納先アドレスに置き換えて前記スタブファイルを生成するようにしてもよい。 If the duplicate data management unit (104) holds the digest value that matches the digest value of the actual data, the duplicate data management unit (104) sets the storage destination address of the data corresponding to the matched digest value in the storage device (25). The deduplication control unit (103) is notified, and the deduplication control unit (103) generates the stub file by replacing the real data portion of the file data with the notified storage destination address. May be.

さらに、重複データ管理部（１０４）は、前記実データのダイジェスト値に一致するダイジェスト値を保持していない場合には、その旨を重複排除制御部（１０３）に通知し、通知を受けた重複排除制御部（１０３）は、前記実データをストレージデバイス（２５）に格納し、前記ファイルデータのうちの前記実データの部分を該格納先アドレスに置き換えて前記スタブファイルを生成するとともに、該格納先アドレスを重複データ管理部（１０４）に通知し、重複データ管理部（１０４）は、該格納先アドレスを前記実データのストレージデバイス（２５）における格納先アドレスとして保持するようにしてもよい。 Further, when the duplicate data management unit (104) does not hold a digest value that matches the digest value of the actual data, the duplicate data management unit (104) notifies the duplicate elimination control unit (103) to that effect, and receives the notified duplicate. The exclusion control unit (103) stores the actual data in the storage device (25), replaces the actual data portion of the file data with the storage destination address, generates the stub file, and stores the stored data The destination address may be notified to the duplicate data management unit (104), and the duplicate data management unit (104) may hold the storage destination address as the storage destination address in the storage device (25) of the actual data.

また、重複排除制御部（１０３）は、スタブファイルの読み出し要求を受けると、該スタブファイルに含まれる格納先アドレスを参照して、ストレージデバイス（２５）から実データを読み出し、該スタブファイルにおける該格納先アドレスの部分を該実データに置き換えるとともに、該スタブファイルに含まれる識別子を削除して、該スタブファイルをもとのデータフォーマットのファイルデータに復元し、復元したファイルデータを要求元に送出するようにしてもよい。 Further, upon receiving a stub file read request, the deduplication control unit (103) reads the actual data from the storage device (25) by referring to the storage destination address included in the stub file, and The storage address part is replaced with the actual data, the identifier included in the stub file is deleted, the stub file is restored to the original data format, and the restored file data is sent to the request source. You may make it do.

本発明によると、複数のデータフォーマットに対応したデータ分割方法を採用した場合においても、ファイルデータ本体の代わりに保存するスタブファイルを用いた単一のデータ管理手法を用いることで、重複排除によって分割されたデータの基本的な管理手法を変更する必要がなくなる。したがって、対応するデータフォーマットを新たに拡充する場合にも、スタブファイルを用いた単一のデータ管理の枠組みの中で拡充することが可能となり、システムを大規模に改修する必要性がなくなる。 According to the present invention, even when a data division method corresponding to a plurality of data formats is adopted, division by deduplication is achieved by using a single data management method using a stub file to be saved instead of the file data body. It is not necessary to change the basic management method of the collected data. Therefore, even when the corresponding data format is newly expanded, it is possible to expand the data format within a single data management framework using a stub file, and it is not necessary to modify the system on a large scale.

また、本発明によると、復元時に必要となる各種の情報をファイルごとにスタブファイルとして分散して管理することが可能となる。したがって、各種情報を集中管理した場合と異なり、集中管理を行う部分が単一故障点となり、すべてのデータが復元できなくなるような事態を回避することができる。また、分散管理により、データ読み出し時の負荷を軽減することも可能となる。 In addition, according to the present invention, various types of information necessary for restoration can be distributed and managed as stub files for each file. Therefore, unlike the case where various types of information are centrally managed, it is possible to avoid a situation where the centralized management portion becomes a single failure point and all data cannot be restored. In addition, the load at the time of data reading can be reduced by distributed management.

本発明において、下記の形態が可能である。
［形態１］
上記第１の視点に係るストレージ装置のとおりである。
［形態２］
前記重複データ管理部は、前記実データのダイジェスト値に一致するダイジェスト値を保持している場合には、一致したダイジェスト値に相当するデータの前記ストレージデバイスにおける格納先アドレスを前記重複排除制御部に通知し、前記重複排除制御部は、前記ファイルデータのうちの前記実データの部分を、通知された格納先アドレスに置き換えて前記スタブファイルを生成するようにしてもよい。
［形態３］
前記重複データ管理部は、前記実データのダイジェスト値に一致するダイジェスト値を保持していない場合には、その旨を前記重複排除制御部に通知し、前記重複排除制御部は、前記実データを前記ストレージデバイスに格納し、前記ファイルデータのうちの前記実データの部分を該格納先アドレスに置き換えて前記スタブファイルを生成するとともに、該格納先アドレスを前記重複データ管理部に通知し、前記重複データ管理部は、該格納先アドレスを前記実データの前記ストレージデバイスにおける格納先アドレスとして保持するようにしてもよい。
［形態４］
前記重複排除制御部は、スタブファイルの読み出し要求を受けると、該スタブファイルに含まれる格納先アドレスを参照して、前記ストレージデバイスから実データを読み出し、該スタブファイルにおける該格納先アドレスの部分を該実データに置き換えるとともに、該スタブファイルに含まれる識別子を削除して、該スタブファイルをもとのデータフォーマットのファイルデータに復元し、復元したファイルデータを要求元に送出するようにしてもよい。
［形態５］
上記のストレージ装置は、前記ストレージデバイスをさらに備えていてもよい。
［形態６］
上記第２の視点に係るストレージ方法のとおりである。
［形態７］
上記のストレージ方法において、前記実データのダイジェスト値と前記ストレージデバイスに格納されたデータのダイジェスト値とが一致した場合には、前記スタブファイル生成工程において、前記ファイルデータのうちの前記実データの部分を、一致したダイジェスト値に相当するデータの前記ストレージデバイスにおける格納先アドレスに置き換えるようにしてもよい。
［形態８］
上記のストレージ方法において、前記実データのダイジェスト値と前記ストレージデバイスに格納されたデータのダイジェスト値とが一致しない場合には、前記実データを前記ストレージデバイスに格納する工程と、
前記ファイルデータのうちの前記実データの部分を、前記実データの前記ストレージデバイスにおける格納先アドレスに置き換える工程と、を含んでいてもよい。
［形態９］
上記のストレージ方法において、スタブファイルの読み出し要求を受けると、該スタブファイルに含まれる格納先アドレスを参照して、前記ストレージデバイスから実データを読み出す工程と、
前記スタブファイルにおける前記格納先アドレスの部分を前記実データに置き換えるとともに、前記スタブファイルに含まれる識別子を削除して、前記スタブファイルをもとのデータフォーマットのファイルデータに復元する工程と、
復元したファイルデータを要求元に送出する工程と、を含んでいてもよい。
［形態１０］
上記第３の視点に係るプログラムのとおりである。
［形態１１］
上記プログラムにおいて、前記実データのダイジェスト値と前記ストレージデバイスに格納されたデータのダイジェスト値とが一致した場合には、前記スタブファイル生成処理において、前記ファイルデータのうちの前記実データの部分を、一致したダイジェスト値に相当するデータの前記ストレージデバイスにおける格納先アドレスに置き換えるようにしてもよい。
［形態１２］
上記プログラムにおいて、前記実データのダイジェスト値と前記ストレージデバイスに格納されたデータのダイジェスト値とが一致しない場合には、前記実データを前記ストレージデバイスに格納する処理と、
前記ファイルデータのうちの前記実データの部分を、前記実データの前記ストレージデバイスにおける格納先アドレスに置き換える処理と、をコンピュータに実行させるようにしてもよい。
［形態１３］
上記プログラムにおいて、スタブファイルの読み出し要求を受けると、該スタブファイルに含まれる格納先アドレスを参照して、前記ストレージデバイスから実データを読み出す処理と、
前記スタブファイルにおける前記格納先アドレスの部分を前記実データに置き換えるとともに、前記スタブファイルに含まれる識別子を削除して、前記スタブファイルをもとのデータフォーマットのファイルデータに復元する処理と、
復元したファイルデータを要求元に送出する処理と、をコンピュータに実行させるようにしてもよい。 In the present invention, the following modes are possible.
[Form 1]
The storage device according to the first aspect is as described above.
[Form 2]
When the duplicate data management unit holds a digest value that matches the digest value of the actual data, the duplicate data management unit stores the storage destination address of the data corresponding to the matched digest value in the storage device to the deduplication control unit. The deduplication control unit may generate the stub file by replacing the real data portion of the file data with the notified storage destination address.
[Form 3]
When the duplicate data management unit does not hold a digest value that matches the digest value of the actual data, the duplicate data management unit notifies the deduplication control unit to that effect, and the deduplication control unit Storing in the storage device, replacing the real data portion of the file data with the storage destination address to generate the stub file, and notifying the duplicate data management unit of the storage destination address; The data management unit may hold the storage destination address as a storage destination address of the actual data in the storage device.
[Form 4]
When the deduplication control unit receives a stub file read request, the deduplication control unit reads the actual data from the storage device with reference to the storage destination address included in the stub file, and reads the storage destination address portion in the stub file. In addition to the replacement with the actual data, the identifier included in the stub file may be deleted, the stub file may be restored to the file data of the original data format, and the restored file data may be sent to the request source. .
[Form 5]
The storage device may further include the storage device.
[Form 6]
The storage method according to the second aspect is as described above.
[Form 7]
In the above storage method, when the digest value of the actual data matches the digest value of the data stored in the storage device, the actual data portion of the file data in the stub file generation step May be replaced with a storage destination address in the storage device of data corresponding to the matched digest value.
[Form 8]
In the above storage method, if the digest value of the actual data and the digest value of the data stored in the storage device do not match, storing the actual data in the storage device;
Replacing the actual data portion of the file data with a storage destination address of the actual data in the storage device.
[Form 9]
In the above storage method, upon receiving a stub file read request, referring to a storage destination address included in the stub file, reading the actual data from the storage device;
Replacing the portion of the storage address in the stub file with the actual data, deleting the identifier included in the stub file, and restoring the stub file to file data of the original data format;
And sending the restored file data to the request source.
[Mode 10]
The program is related to the third viewpoint.
[Form 11]
In the above program, when the digest value of the actual data matches the digest value of the data stored in the storage device, in the stub file generation process, the portion of the actual data in the file data is You may make it replace with the storage destination address in the said storage device of the data corresponding to the matched digest value.
[Form 12]
In the above program, when the digest value of the actual data and the digest value of the data stored in the storage device do not match, a process of storing the actual data in the storage device;
You may make it make a computer perform the process which replaces the part of the said actual data of the said file data with the storage destination address in the said storage device of the said actual data.
[Form 13]
In the above program, when a read request for a stub file is received, a process of reading actual data from the storage device with reference to a storage destination address included in the stub file;
A process of replacing the storage address part in the stub file with the actual data, deleting an identifier included in the stub file, and restoring the stub file to file data of an original data format;
A process for sending the restored file data to the request source may be executed by a computer.

（実施形態）
本発明の一実施形態に係るストレージ装置について、図面を参照して説明する。図１は、本実施形態のストレージ装置を備えたストレージシステムの構成を一例として示すブロック図である。図１を参照すると、ストレージシステムは、クライアント１、ストレージ装置２０、および、管理クライアント３０を備えている。また、これらの装置は、ネットワーク１０を介して相互に接続されている。 (Embodiment)
A storage apparatus according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating an example of a configuration of a storage system including a storage apparatus according to the present embodiment. Referring to FIG. 1, the storage system includes a client 1, a storage device 20, and a management client 30. These devices are connected to each other via the network 10.

クライアント１は、ネットワークファイルシステムクライアント２を備えている。クライアント１からストレージ装置２０へのネットワーク１０を介したデータアクセスは、ネットワークファイルシステムクライアント２を介して実行される。データアクセスは、例えば、ＮＦＳ（ＮｅｔｗｏｒｋＦｉｌｅＳｙｓｔｅｍ）、ＣＩＦＳ（ＣｏｍｍｏｎＩｎｔｅｒｎｅｔＦｉｌｅＳｙｓｔｅｍ）等の業界標準のプロトコルを用いて実行される。 The client 1 includes a network file system client 2. Data access from the client 1 to the storage apparatus 20 via the network 10 is executed via the network file system client 2. The data access is executed using an industry standard protocol such as NFS (Network File System) or CIFS (Common Internet File System).

図１を参照すると、ストレージ装置２０は、ネットワークファイルシステムサーバ２１、フィルタ部２２、重複排除管理部２３、ファイルシステム２４、ストレージデバイス２５、および、管理コンソール２６を備えている。 Referring to FIG. 1, the storage apparatus 20 includes a network file system server 21, a filter unit 22, a deduplication management unit 23, a file system 24, a storage device 25, and a management console 26.

ネットワークファイルシステムサーバ２１は、クライアント１のネットワークファイルシステムクライアント２からのデータアクセスリクエストに基づいて、ストレージ装置２０におけるデータの書き込み、読み出し等のデータ処理を仲介する。 The network file system server 21 mediates data processing such as data writing and reading in the storage apparatus 20 based on a data access request from the network file system client 2 of the client 1.

フィルタ部２２は、データアクセスリクエストに含まれるデータの読み出しリクエストのうちの、重複排除格納されたファイルデータを対象とするリクエストを重複排除管理部２３へ転送する。フィルタ部２２は、重複排除格納されたファイルデータのリストを保持している。フィルタ部２２は、リクエストで指定された読み出し対象のファイルデータのエントリがリスト内に存在する場合には、重複排除管理部２３へリクエストを転送する。一方、読み出し処理以外のリクエスト、または、エントリがリスト内に存在しないファイルデータに対する読み出し処理のリクエストである場合には、フィルタ部２２は、リクエストをファイルシステム２４へ転送する。 The filter unit 22 transfers, to the deduplication management unit 23, a request for file data stored in the deduplication among data read requests included in the data access request. The filter unit 22 holds a list of file data stored in a deduplication manner. The filter unit 22 transfers the request to the deduplication management unit 23 when an entry of the file data to be read designated by the request exists in the list. On the other hand, if the request is a request other than a read process or a request for a read process for file data whose entry does not exist in the list, the filter unit 22 transfers the request to the file system 24.

重複排除管理部２３は、重複排除格納の管理および制御を行う。 The deduplication management unit 23 manages and controls deduplication storage.

ファイルシステム２４は、ストレージデバイス２５へのデータの格納管理、データの読み出し、書き込みを行う。 The file system 24 performs storage management of data in the storage device 25, and reading and writing of data.

ストレージデバイス２５は、ＨＤＤ（ハードディスクドライブ）等の複数の物理ストレージメディアを備え、データを永続的に記憶する。 The storage device 25 includes a plurality of physical storage media such as HDDs (hard disk drives) and stores data permanently.

管理コンソール２６は、ストレージ装置２０の外部からコマンド命令を受け、重複排除管理部２３へ転送する。 The management console 26 receives a command command from the outside of the storage apparatus 20 and transfers it to the deduplication management unit 23.

管理クライアント３０は、ネットワーク１０を介してストレージ装置２０の管理コンソール２６にアクセスし、各種のコマンド実行命令を重複排除管理部２３へ送信する。 The management client 30 accesses the management console 26 of the storage apparatus 20 via the network 10 and transmits various command execution instructions to the deduplication management unit 23.

図２は、本実施形態に係るストレージ装置２０における重複排除管理部２３の構成を一例として示すブロック図である。図２を参照すると、重複排除管理部２３は、ファイルアクセス部１０１、コマンド実行部１０２、重複排除制御部１０３、および、重複データ管理部１０４を備えている。また、重複排除制御部１０３は、複数の（例えばＮ個の）重複排除部を備えている。 FIG. 2 is a block diagram illustrating an example of the configuration of the deduplication management unit 23 in the storage apparatus 20 according to the present embodiment. Referring to FIG. 2, the deduplication management unit 23 includes a file access unit 101, a command execution unit 102, a deduplication control unit 103, and a duplicate data management unit 104. The deduplication control unit 103 includes a plurality of (for example, N) deduplication units.

ファイルアクセス部１０１は、重複排除制御部１０３からの要求により、ファイルシステム２４からのファイルデータの読み出し、および、ファイルシステム２４へのファイルデータの書き込みを行う。 The file access unit 101 reads file data from the file system 24 and writes file data to the file system 24 in response to a request from the deduplication control unit 103.

重複排除制御部１０３は、複数の重複排除制御部１〜Ｎを備え、各重複排除制御部は担当するファイルフォーマットに対してデータの重複排除処理とデータの復元処理を実行する。 The deduplication control unit 103 includes a plurality of deduplication control units 1 to N, and each deduplication control unit executes data deduplication processing and data restoration processing for the file format in charge.

重複データ管理部１０４は、ファイルシステム２４を介してストレージデバイス２５に格納されているデータのダイジェスト値と、データの格納先のアドレス情報と、データの重複回数をテーブル化して保存する。また、重複データ管理部１０４は、テーブル化されたデータを元に、重複排除格納処理対象のデータが、ファイルシステム２４を介してストレージデバイス２５に格納されているデータと重複しているか否かを判定する。 The duplicate data management unit 104 tabulates and stores the digest value of data stored in the storage device 25 via the file system 24, the address information of the data storage destination, and the number of times of data duplication. Further, the duplicate data management unit 104 determines whether or not the data subject to deduplication storage processing is duplicated with the data stored in the storage device 25 via the file system 24 based on the tabulated data. judge.

＜重複排除格納＞
まず、ストレージ装置２０が重複するデータを排除しつつデータ格納する手順について、図１ないし図６を参照して説明する。ここでは、一例として、重複排除することなくストレージデバイス２５に通常格納済みのデータを、管理クライアント３０からの管理コンソール２６を介した命令により、重複排除格納対象のファイルデータ群として指定するとともに、指定されたファイルを重複排除格納する場合について説明する。 <Deduplication storage>
First, the procedure in which the storage apparatus 20 stores data while eliminating duplicate data will be described with reference to FIGS. Here, as an example, the data that is normally stored in the storage device 25 without deduplication is specified as a file data group to be deduplicated and stored by an instruction from the management client 30 via the management console 26. A case in which the deduplicated file is stored will be described.

管理クライアント３０は、管理コンソール２６を介して、重複排除格納対象となるファイルデータのデータ格納先パスが記載されたリストを、重複排除管理部２３のコマンド実行部１０２に登録し、重複排除格納の命令を発行する。コマンド実行部１０２は、重複排除格納の命令を受けて、登録されたファイルデータのリストを走査し、リストに登録されているファイルデータのエントリを順番に重複排除格納する。 The management client 30 registers a list in which the data storage destination path of the file data to be deduplicated stored is registered in the command execution unit 102 of the deduplication management unit 23 via the management console 26, and stores the deduplication storage. Issue an instruction. In response to the deduplication storage command, the command execution unit 102 scans the list of registered file data, and sequentially stores the entries of the file data registered in the list.

コマンド実行部１０２は、ファイルデータのエントリに登録されている格納パスから、ファイル名に付与された拡張子を抽出し、重複排除制御部１０３に設けられたＮ個の重複排除部の中から、拡張子に対応した重複排除部を選択し、パス情報と重複排除格納命令を出す。 The command execution unit 102 extracts the extension given to the file name from the storage path registered in the file data entry, and from among the N deduplication units provided in the deduplication control unit 103, A deduplication unit corresponding to the extension is selected, and path information and a deduplication storage command are issued.

重複排除制御部１０３は、パス情報と重複排除命令に基づき、パス情報から、ファイルアクセス部１０１を介して、ファイルシステム２４で管理されているストレージデバイス２５に記憶されている該当ファイルデータを読み出す。 The deduplication control unit 103 reads the corresponding file data stored in the storage device 25 managed by the file system 24 from the path information based on the path information and the deduplication command, via the file access unit 101.

図３は、重複排除処理対象となるファイルデータの内部のデータ構成例を示す図である。ファイルデータは、図３に示すように、テキストデータ、画像データ、映像データのように、そのデータ単独で他のファイルデータにコピー可能なデータ、または、単独のファイルデータとして抽出可能なデータ（以下「実データ」という。）と、ファイル生成時にアプリケーションにより付与された、ファイルの表示制御等をコントロールするメタデータとを含む。重複排除制御部１０３は、このファイル内部の構成情報を拡張子ごとに解析するために、拡張子にそれぞれ対応するＮ個の重複排除部を備えている。 FIG. 3 is a diagram illustrating an internal data configuration example of file data to be subjected to deduplication processing. As shown in FIG. 3, file data is data that can be copied to other file data by itself, such as text data, image data, and video data, or data that can be extracted as single file data (hereinafter referred to as “file data”). "Real data") and metadata that is given by the application when the file is generated and controls display control of the file. The deduplication control unit 103 includes N deduplication units each corresponding to an extension in order to analyze the configuration information inside the file for each extension.

重複排除制御部１０３は、ファイルデータの構造を解析し、ファイルデータの中から実データのみをすべて抽出する。重複しているかどうかの判定は、実データに対して行われる。 The deduplication control unit 103 analyzes the structure of the file data and extracts all the actual data from the file data. The determination as to whether or not there is an overlap is performed on the actual data.

重複排除制御部１０３は、データのダイジェスト値を算出する共通のハッシュ関数（ＳＨＡ−１、ＨＤ５等）を用いて、実データのダイジェスト値を算出し、ダイジェスト値を重複データ管理部１０４へ送信する。 The deduplication control unit 103 calculates the digest value of the actual data using a common hash function (SHA-1, HD5, etc.) that calculates the digest value of the data, and transmits the digest value to the duplicate data management unit 104. .

重複データ管理部１０４は、重複データ管理部１０４で管理されているテーブルのエントリを走査し、ダイジェスト値と合致するエントリが存在するかどうかを確認する。 The duplicate data management unit 104 scans the entries in the table managed by the duplicate data management unit 104 and confirms whether there is an entry that matches the digest value.

「ダイジェスト値と合致するエントリが存在しない場合」
合致するエントリが存在しない場合には、重複データ管理部１０４は、データの重複が存在しないとして前記テーブルに、前記ダイジェスト値を新しいエントリとして登録し、重複排除制御部１０３へ重複なしと応答する。 "If there is no entry that matches the digest value"
If there is no matching entry, the duplicate data management unit 104 registers the digest value as a new entry in the table, assuming that there is no data duplication, and responds to the deduplication control unit 103 that there is no duplication.

重複排除制御部１０３は、重複なしとの応答を受けた後、ダイジェスト値に相当する実データを、ファイルアクセス部１０１を介して、ファイルシステム２４へファイルデータとして新規作成し、実データをストレージデバイス２５へ格納する。 After receiving the response that there is no duplication, the deduplication control unit 103 newly creates actual data corresponding to the digest value as file data in the file system 24 via the file access unit 101, and creates the actual data as a storage device. 25.

実データの格納後、重複排除制御部１０３は、ダイジェスト値とともに、ファイルシステム２４の格納先アドレス情報を重複データ管理部１０４へ送信する。 After storing the actual data, the deduplication control unit 103 transmits the storage destination address information of the file system 24 to the duplicate data management unit 104 together with the digest value.

重複データ管理部１０４は、前記データを元に、対応するダイジェスト値のエントリへ格納先アドレス情報を登録し、重複回数を１に設定して、重複排除制御部１０３へエントリへの登録が完了したことを通知する。 Based on the data, the duplicate data management unit 104 registers the storage destination address information in the corresponding digest value entry, sets the number of duplicates to 1, and completes registration of the entry in the deduplication control unit 103. Notify that.

「ダイジェスト値と合致するエントリが存在する場合」
一方、重複排除制御部１０３から送信されたダイジェスト値と合致するエントリが存在する場合には、重複データ管理部１０４は、データの重複が存在するとして、エントリに登録されているデータ格納先のアドレス情報を重複排除制御部１０３へ送信し、エントリの重複回数を１だけ増やす。 "When there is an entry that matches the digest value"
On the other hand, if there is an entry that matches the digest value transmitted from the deduplication control unit 103, the duplicate data management unit 104 determines that there is data duplication, and the data storage destination address registered in the entry Information is transmitted to the deduplication control unit 103, and the number of duplicate entries is increased by one.

図４は、スタブファイルの内部のデータ構成例を示す図である。図４を参照すると、ファイルデータから抽出されたすべての実データについて、重複データ管理部１０４でのダイジェスト値の合致確認処理が終了した後、重複排除制御部１０３は、図３に示した構成のファイルデータの中から、実データ部分をファイルシステム２４の格納先アドレス情報に置き換えるとともに、ファイルデータの先頭に、対応する重複排除制御部１０３の重複排除部の識別子（例えば番号）を重複排除方式識別子として組み込み、この加工されたファイルデータ（すなわち、スタブファイル）をファイルシステム２４に格納されている重複排除対象となったファイルデータに上書きする。 FIG. 4 is a diagram showing an example of the data structure inside the stub file. Referring to FIG. 4, after the digest value match confirmation processing in the duplicate data management unit 104 is completed for all the actual data extracted from the file data, the duplicate elimination control unit 103 has the configuration shown in FIG. In the file data, the actual data part is replaced with the storage destination address information of the file system 24, and the identifier (eg, number) of the corresponding deduplication unit of the deduplication control unit 103 is added to the head of the file data as a deduplication method identifier. And the processed file data (that is, the stub file) is overwritten on the file data to be deduplicated stored in the file system 24.

また、重複排除制御部１０３は、スタブファイルのファイル識別子をフィルタ部２２へ登録する。ここで、ファイル識別子は、ストレージ装置２０のファイルシステム２４において、ファイルを一意に特定できる識別子とする。 Further, the deduplication control unit 103 registers the file identifier of the stub file in the filter unit 22. Here, the file identifier is an identifier that can uniquely identify a file in the file system 24 of the storage apparatus 20.

以上の処理を、コマンド実行部１０２にリストとして登録されているすべてのファイルデータについて実行することで、重複排除格納処理が完了する。 By executing the above process for all file data registered as a list in the command execution unit 102, the deduplication storage process is completed.

「構造の解析ができないファイルデータを重複排除格納処理する場合」
重複排除制御部１０３が有するＮ個の重複排除部が構造を解析できないファイルデータである場合、例えば、対応する拡張子と合致していない場合、または、拡張子を持たないファイルデータである場合には、図５に示すように、データを先頭から固定長または可変長のブロックで分割して、重複排除格納する。この場合には、スタブデータは、図６に示すように、重複排除方式識別子と、データ格納先アドレス情報の集合とで構成される。 "When deduplicating and storing file data whose structure cannot be analyzed"
When the N deduplication units included in the deduplication control unit 103 are file data whose structure cannot be analyzed, for example, when they do not match the corresponding extension, or when the file data has no extension As shown in FIG. 5, data is divided into fixed-length or variable-length blocks from the beginning, and deduplicated and stored. In this case, as shown in FIG. 6, the stub data includes a deduplication scheme identifier and a set of data storage destination address information.

＜データ読み出し＞
次に、重複排除格納されたファイルデータを、クライアント１からの読み出しリクエストによって読み出す場合の手順について説明する。 <Read data>
Next, a procedure for reading the deduplicated stored file data by a read request from the client 1 will be described.

まず、クライアント１は、ネットワークファイルシステムクライアント２を介して、ストレージ装置２０へ読み出し対象となるファイルの識別子と、読み出し要求を送信する。 First, the client 1 transmits an identifier of a file to be read and a read request to the storage apparatus 20 via the network file system client 2.

ストレージ装置２０のネットワークファイルシステムサーバ２１は、ファイル識別子および読み出し要求に基づいて、フィルタ部２２を介して、ファイルシステム２４へファイル識別子とデータ読み出し要求を送信する。 The network file system server 21 of the storage apparatus 20 transmits a file identifier and a data read request to the file system 24 via the filter unit 22 based on the file identifier and the read request.

フィルタ部２２は、ファイル識別子が重複排除格納されたファイルデータのリストに登録されているかどうかを走査する。 The filter unit 22 scans whether the file identifier is registered in the list of file data stored in a deduplication manner.

ファイル識別子がリストに登録されていない場合には、フィルタ部２２は、データ読み出し要求とファイル識別子をファイルシステム２４へ送信する。以降、一般的なファイルシステムからのデータ読み出し手続きを経て、ファイルデータが、クライアント１へ返送される。 If the file identifier is not registered in the list, the filter unit 22 transmits a data read request and the file identifier to the file system 24. Thereafter, the file data is returned to the client 1 through a data reading procedure from a general file system.

一方、ファイル識別子がリストに登録されている場合には、フィルタ部２２は、データ読み出し要求とファイル識別子を重複排除管理部２３へ送信する。 On the other hand, when the file identifier is registered in the list, the filter unit 22 transmits a data read request and the file identifier to the deduplication management unit 23.

重複排除管理部２３の重複排除制御部１０３は、データ読み出し要求とファイル識別子に基づいて、ファイルアクセス部１０１を介して、ファイルシステム２４から該当するファイルデータ（この場合には、重複排除格納対象であるためスタブファイルのデータに相当）を読み出す。 Based on the data read request and the file identifier, the deduplication control unit 103 of the deduplication management unit 23 receives the corresponding file data (in this case, the deduplication storage target) from the file system 24 via the file access unit 101. Therefore, it corresponds to the data of the stub file).

重複排除制御部１０３は、スタブファイルデータ内の重複排除方式識別子に応じて、Ｎ個の重複排除部の中から、対応する重複排除部を選択する。 The deduplication control unit 103 selects a corresponding deduplication unit from the N deduplication units according to the deduplication method identifier in the stub file data.

重複排除制御部１０３は、図４または図６に相当するスタブファイル内のデータ構造を解析し、データ格納先アドレスを元に、スタブファイル内に記録されているすべてのファイルデータを、ファイルアクセス部１０１を介してファイルシステム２４から読み出す。 The deduplication control unit 103 analyzes the data structure in the stub file corresponding to FIG. 4 or FIG. 6, and based on the data storage destination address, all the file data recorded in the stub file are converted into the file access unit. Read out from the file system 24 via 101.

重複排除制御部１０３は、読み出したファイルデータを実データとして、スタブファイル内のデータ格納先アドレス部分のデータと置き換え、重複排除方式識別子を削除して、重複排除格納処理前のファイルデータとして復元する。また、重複排除制御部１０３は、復元したファイルデータをフィルタ部２２へ送信する。 The deduplication control unit 103 replaces the read file data as actual data with the data at the data storage destination address part in the stub file, deletes the deduplication method identifier, and restores the file data before the deduplication storage process. . Further, the deduplication control unit 103 transmits the restored file data to the filter unit 22.

フィルタ部２２は、ネットワークファイルシステムサーバ２１へファイルデータを送信する。ファイルデータは、ネットワーク１０を介してクライアント１へ送信される。以上により、ファイルデータの読み出し処理が完了する。 The filter unit 22 transmits file data to the network file system server 21. The file data is transmitted to the client 1 via the network 10. Thus, the file data reading process is completed.

以上、図面を参照して、本発明の一実施形態について詳しく説明した。しかしながら、本発明を実施するための具体的な構成は上述の構成に限られることはなく、本発明の要旨を逸脱しない範囲内で様々な設計変更等が可能である。 The embodiment of the present invention has been described in detail above with reference to the drawings. However, the specific configuration for carrying out the present invention is not limited to the above-described configuration, and various design changes can be made without departing from the scope of the present invention.

例えば、ストレージ装置２０が内部にコンピュータシステムを有し、上述の各処理部の動作がプログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されている場合には、このプログラムをコンピュータが読み出して実行することによって、上記の処理を行うようにしてもよい。ここで、コンピュータ読み取り可能な記録媒体として、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等が挙げられる。また、ンピュータプログラムを通信回線によってコンピュータに配信し、この配信を受けたコンピュータが当該プログラムを実行するようにしてもよい。 For example, when the storage apparatus 20 has a computer system therein and the operations of the above-described processing units are stored in a computer-readable recording medium in the form of a program, the computer reads and executes the program. As a result, the above processing may be performed. Here, examples of the computer-readable recording medium include a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, and a semiconductor memory. Further, the computer program may be distributed to a computer via a communication line, and the computer that has received the distribution may execute the program.

また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、プログラムは、コンピュータシステムにすでに記録されているプログラムとの組み合わせで上述の機能を実現する、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be for realizing a part of the functions described above. Furthermore, the program may be a so-called difference file (difference program) that realizes the above-described function in combination with a program already recorded in the computer system.

なお、上記の特許文献等の先行技術文献の各開示を、本書に引用をもって繰り込むものとする。本発明の全開示（請求の範囲を含む）の枠内において、さらにその基本的技術思想に基づいて、実施形態の変更・調整が可能である。また、本発明の請求の範囲の枠内において種々の開示要素の多様な組み合わせないし選択が可能である。すなわち、本発明は、請求の範囲を含む全開示、技術的思想にしたがって当業者であればなし得るであろう各種変形、修正を含むことは勿論である。 It should be noted that the disclosures of prior art documents such as the above patent documents are incorporated herein by reference. Within the scope of the entire disclosure (including claims) of the present invention, the embodiment can be changed and adjusted based on the basic technical concept. Various combinations and selections of various disclosed elements are possible within the scope of the claims of the present invention. That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea.

１クライアント
２ネットワークファイルシステムクライアント
１０ネットワーク
２０ストレージ装置
２１ネットワークファイルシステムサーバ
２２フィルタ部
２３重複排除管理部
２４ファイルシステム
２５ストレージデバイス
２６管理コンソール
３０管理クライアント
１０１ファイルアクセス部
１０２コマンド実行部
１０３重複排除制御部
１０４重複データ管理部 DESCRIPTION OF SYMBOLS 1 Client 2 Network file system client 10 Network 20 Storage apparatus 21 Network file system server 22 Filter part 23 Deduplication management part 24 File system 25 Storage device 26 Management console 30 Management client 101 File access part 102 Command execution part 103 Deduplication control part 104 Duplicate data management department

Claims

File data to be stored in the storage device is input, and at least a part of the data is extracted as real data from the file data according to the data format of the file data, and the real data portion of the file data is extracted from the file data. A deduplication controller that replaces the storage address of the actual data in the storage device and generates a stub file in which an identifier indicating the data format is added to the file data, and stores the generated stub file in the storage device;
Duplicate data that holds the digest value of the data stored in the storage device and the storage destination address of the data in the storage device and determines whether or not the digest value that matches the digest value of the actual data is held And a management unit.

When the duplicate data management unit holds a digest value that matches the digest value of the actual data, the duplicate data management unit stores the storage destination address of the data corresponding to the matched digest value in the storage device to the deduplication control unit. 2. The notification according to claim 1, wherein the deduplication control unit generates the stub file by replacing the real data portion of the file data with the notified storage address. Storage device.

When the duplicate data management unit does not hold a digest value that matches the digest value of the actual data, the duplicate data management unit notifies the deduplication control unit to that effect, and the deduplication control unit Storing in the storage device, replacing the real data portion of the file data with the storage destination address to generate the stub file, and notifying the duplicate data management unit of the storage destination address; The storage apparatus according to claim 1 or 2, wherein the data management unit holds the storage destination address as a storage destination address of the actual data in the storage device.

When the deduplication control unit receives a stub file read request, the deduplication control unit reads the actual data from the storage device with reference to the storage destination address included in the stub file, and reads the storage destination address portion in the stub file. The stub file is replaced with the actual data, the identifier included in the stub file is deleted, the stub file is restored to the original data format, and the restored file data is sent to the request source. The storage apparatus according to any one of claims 1 to 3.

The storage apparatus according to any one of claims 1 to 4, further comprising the storage device.

A step of inputting file data to be stored in the storage device, and extracting at least a part of the data as actual data from the file data according to the data format of the file data;
Determining whether the digest value of the actual data matches the digest value of the data stored in the storage device;
Replacing the real data portion of the file data with a storage address of the real data in the storage device and generating a stub file in which an identifier indicating the data format is added to the file data;
Storing the stub file in the storage device.

When the digest value of the actual data and the digest value of the data stored in the storage device match, in the stub file generation step, the portion of the actual data of the file data is matched with the digest value The storage method according to claim 6, wherein the data is replaced with a storage destination address in the storage device.

If the digest value of the actual data does not match the digest value of the data stored in the storage device, storing the actual data in the storage device;
The storage method according to claim 6, further comprising a step of replacing the real data portion of the file data with a storage destination address of the real data in the storage device.

When receiving a read request for a stub file, referring to a storage destination address included in the stub file, reading actual data from the storage device;
Replacing the portion of the storage address in the stub file with the actual data, deleting the identifier included in the stub file, and restoring the stub file to file data of the original data format;
The storage method according to claim 6, further comprising a step of sending the restored file data to a request source.

A process for inputting file data to be stored in a storage device and extracting at least a part of the data as actual data from the file data according to the data format of the file data;
A process of determining whether or not the digest value of the actual data matches the digest value of the data stored in the storage device;
Processing to replace the portion of the actual data in the file data with a storage address of the actual data in the storage device and to generate a stub file in which an identifier indicating the data format is added to the file data;
A program for causing a computer to execute processing for storing the stub file in the storage device.