JP5348300B2

JP5348300B2 - Data management program and multi-node storage system

Info

Publication number: JP5348300B2
Application number: JP2012197110A
Authority: JP
Inventors: 泰生野口; 一隆荻原; 雅寿田村; 芳浩土屋; 哲太郎丸山; 高志渡辺; 達夫熊野; 和一大江
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2012-09-07
Filing date: 2012-09-07
Publication date: 2013-11-20
Anticipated expiration: 2028-10-16
Also published as: JP2013008387A

Abstract

PROBLEM TO BE SOLVED: To shorten a stop period of data access which occurs during the occurrence of a failure of a storage device. SOLUTION: If the operation of a storage device 1 is in a bad condition or if a lapse time (T) measured by response time calculation means 4a exceeds an operation bad condition detection time (T1), operation bad condition detection means 4b outputs operation bad condition information showing that the storage device 1 has the possibility of being in a bad condition to a control node 7. The control node 7 stores the operation bad condition information in operation bad condition information storing means 7b. When an access node 8 accesses the first slice 1a in the storage device 1 to cause an error after that, access relevant information showing that the access fails is transmitted to the control node 7. The control node 7 makes a recovery instruction of the slice 1a shown by the access relevant information to disk nodes 5, 6. COPYRIGHT: (C)2013,JPO&INPIT

Description

本発明はデータを二重化して管理するためのデータ管理プログラム、ストレージ装置診断プログラム、およびマルチノードストレージシステムに関し、特にストレージ装置の診断結果に応じてデータのリカバリ処理を行うデータ管理プログラム、ストレージ装置診断プログラム、およびマルチノードストレージシステムに関する。 The present invention relates to a data management program, a storage device diagnosis program, and a multi-node storage system for managing data by duplication, and in particular, a data management program for performing data recovery processing according to a diagnosis result of a storage device, and a storage device diagnosis The present invention relates to a program and a multinode storage system.

ネットワーク上でデータを管理するシステムの１つにマルチノードストレージシステムがある。マルチノードストレージシステムは、複数のディスクノード、少なくとも１つのアクセスノード、および制御ノードで構成される。ディスクノード、アクセスノード、および制御ノードは、ネットワークを介して接続されている。 One system for managing data on a network is a multi-node storage system. The multi-node storage system is composed of a plurality of disk nodes, at least one access node, and a control node. The disk node, access node, and control node are connected via a network.

マルチノードストレージシステム内では、仮想ディスクが定義されている。仮想ディスクは、複数のセグメントと呼ばれる複数の単位記憶領域で構成される。また、ディスクノードは、接続されたストレージ装置の記憶領域をスライスと呼ばれる単位に分割して管理する。そして、仮想ディスクを構成するセグメントと、各ディスクノードが管理するスライスとの対応関係が制御ノードで管理される。その対応関係を示すメタデータは、制御ノードからアクセスノードに通知される。 Within the multi-node storage system, virtual disks are defined. A virtual disk is composed of a plurality of unit storage areas called a plurality of segments. Further, the disk node manages the storage area of the connected storage device by dividing it into units called slices. Then, the correspondence relationship between the segments constituting the virtual disk and the slices managed by each disk node is managed by the control node. The metadata indicating the correspondence is notified from the control node to the access node.

アクセスノードは、仮想ディスク内のデータを指定したデータアクセスを受け取ると、メタデータに基づいて、そのデータが格納されたセグメントに対応するスライスを判断する。そして、アクセスノードは、実際にデータが格納されたスライスを管理するディスクノードに対してアクセス要求を送信する。 When the access node receives data access designating data in the virtual disk, the access node determines a slice corresponding to the segment in which the data is stored based on the metadata. Then, the access node transmits an access request to the disk node that manages the slice in which the data is actually stored.

このようなマルチノードストレージシステムにより、アクセスノードを使用するユーザは、ディスクノードに接続された多数のストレージ装置を、ローカルのディスク装置と同様に使用することができる。しかも、マルチノードストレージシステムでは、１つのセグメントに２つのスライスを割り当てることで、データを二重化することもできる。データの二重化を行っていれば、１台のストレージ装置が故障してもデータを消失させずにすむ。 With such a multi-node storage system, a user who uses an access node can use a large number of storage devices connected to the disk node in the same way as a local disk device. Moreover, in a multi-node storage system, data can be duplicated by assigning two slices to one segment. If data is duplicated, it is not necessary to lose data even if one storage device fails.

なお、データを二重化していても、障害によりストレージ装置へのアクセスができなくなると、一部のセグメントに関して二重化状態が崩れてしまう。その場合、リカバリ処理が行われる。リカバリ処理では、制御ノードにより、二重化状態が崩れたセグメント（リカバリ対象セグメント）に対して新たなスライスが割り当てられる。そして、リカバリ対象セグメントに割り当てられた既存のスライス内のデータを、ディスクノードが新たに割り当てたスライスへコピーする。これにより、データの二重化状態が回復する。 Even if the data is duplicated, if the storage device cannot be accessed due to a failure, the duplicated state will be lost for some segments. In that case, a recovery process is performed. In the recovery process, a new slice is assigned to the segment (recovery target segment) whose duplex state has been lost by the control node. Then, the data in the existing slice assigned to the recovery target segment is copied to the slice newly assigned by the disk node. As a result, the data duplex state is recovered.

また、仮想ディスクのセグメントに二重化のために割り当てられている２つのスライスは、それぞれプライマリスライスとセカンダリスライスとの属性が与えられている。アクセスノードは、プライマリスライスに対してアクセスを行う。もし、プライマリスライスを有するストレージ装置に障害が発生すると、制御ノードは、セカンダリスライスの属性を、プライマリスライスに変更する。その結果、アクセスノードは、リカバリ処理の完了を待たずに、障害が発生したストレージ装置に格納されていたデータへのアクセスが可能となる。 Further, the attributes of the primary slice and the secondary slice are given to the two slices assigned to the virtual disk segment for duplication. The access node accesses the primary slice. If a failure occurs in the storage apparatus having the primary slice, the control node changes the attribute of the secondary slice to the primary slice. As a result, the access node can access the data stored in the storage apparatus in which the failure has occurred without waiting for the completion of the recovery process.

国際公開第ＷＯ２００４／１０４８４５号パンフレットInternational Publication No. WO2004 / 104845 Pamphlet

ところで、アクセスノードからデータへのアクセスができなくなる障害の１つとして、ストレージ装置の機能障害がある。ストレージ装置に障害が発生した場合、まず、ディスクノードがその障害を検出する。障害を検出したディスクノードが、障害発生を制御ノードに通知することで、制御ノードが障害の発生を認識する。そして、制御ノードの管理の下、リカバリ処理が実行される。 By the way, as one of the failures that make it impossible to access data from the access node, there is a functional failure of the storage apparatus. When a failure occurs in the storage device, first, the disk node detects the failure. The disk node that detects the failure notifies the control node of the occurrence of the failure, so that the control node recognizes the occurrence of the failure. Then, recovery processing is executed under the control of the control node.

しかし、故障の誤検出を防止するためには、ストレージ装置の障害の予兆（動作不調）をディスクノードが検知してから障害であると判断するまでに、ある程度の猶予期間を設ける必要がある。そのため、その猶予期間中はアクセスノードのアクセス先を別のストレージ装置に変更する処理が実行されず、その間、アクセスノードからのデータアクセスが停止してしまう。 However, in order to prevent erroneous detection of a failure, it is necessary to provide a certain grace period after the disk node detects a failure sign (operation failure) of the storage apparatus until it is determined to be a failure. For this reason, during the grace period, the process of changing the access destination of the access node to another storage device is not executed, and data access from the access node is stopped during that period.

プライマリスライスを有するストレージ装置に障害が発生した場合、障害発生直後から、そのプライマリスライスへのアクセスノードからのアクセスはエラーとなる。他方、ディスクノードは、回復不能な障害が発生したのか、あるいは一時的な問題であり、ある程度の猶予期間をおけばアクセス可能となるのかが不明である。そのためディスクノードは、所定の猶予期間が経過するまで、故障であるとの判断をしない。その結果、現在のセカンダリスライスをプライマリスライスに切り替えるタイミングが遅れ、アクセスノードによるアクセスの停止期間が長期化していた。 When a failure occurs in a storage device having a primary slice, an access from the access node to the primary slice results in an error immediately after the failure occurs. On the other hand, it is unclear whether a disk node has an unrecoverable failure or is a temporary problem and can be accessed after a certain grace period. Therefore, the disk node does not determine that there is a failure until a predetermined grace period elapses. As a result, the timing for switching the current secondary slice to the primary slice is delayed, and the access stop period by the access node is prolonged.

本発明はこのような点に鑑みてなされたものであり、ストレージ装置の故障発生時に生じるデータアクセスの停止期間を短縮できるデータ管理プログラム、ストレージ装置診断プログラム、およびマルチノードストレージシステムを提供することを目的とする。 The present invention has been made in view of the above points, and provides a data management program, a storage device diagnostic program, and a multi-node storage system that can shorten a data access stop period that occurs when a storage device failure occurs. Objective.

上記課題を解決するために、以下の機能をコンピュータで実現するためのデータ管理プログラムが提供される。このデータ管理プログラムは、記憶領域が複数のスライスに分割して管理されている複数のストレージ装置に二重化して格納されたデータの管理処理をコンピュータに実行させるものである。 In order to solve the above problems, a data management program for realizing the following functions by a computer is provided. This data management program causes a computer to execute management processing of data stored in duplicate in a plurality of storage devices whose storage areas are divided into a plurality of slices and managed.

データ管理プログラムを実行するコンピュータは、動作不調情報管理手段とリカバリ指示手段として機能する。動作不調情報管理手段は、複数のストレージ装置のうちの１つが故障中の可能性があることを示す動作不調情報を受け取ると、動作不調情報を動作不調情報記憶手段に格納する。リカバリ指示手段は、複数のストレージ装置のスライスをアクセス対象スライスとしてアクセス要求が出されたことを示すアクセス関連情報を受け取ると、動作不調情報記憶手段内の動作不調情報を参照してアクセス対象スライスが属するストレージ装置が故障中である可能性の有無を判断し、故障の可能性がある場合、アクセス対象スライス内のデータと同じ内容の冗長データを格納するストレージ装置へのデータ入出力機能を有するスライス管理手段へ、アクセス対象スライスに格納されていたデータのリカバリ処理を指示する。 The computer that executes the data management program functions as malfunction information management means and recovery instruction means. When the malfunction information management means receives malfunction information indicating that one of the plurality of storage devices may be out of order, the malfunction information management means stores the malfunction information in the malfunction information storage means. When the recovery instruction means receives access-related information indicating that an access request has been issued with the slices of a plurality of storage devices as access target slices, the recovery instruction means refers to the operation failure information in the operation failure information storage means to determine the access target slice. A slice that has a data input / output function to the storage device that stores the redundant data with the same contents as the data in the access target slice if the storage device to which it belongs is determined whether there is a failure and if there is a failure The management unit is instructed to recover the data stored in the access target slice.

また、上記課題を解決するために、以下の機能をコンピュータで実現するためのストレージ装置診断プログラムが提供される。ストレージ装置診断プログラムは、ストレージ装置がローカルに接続されると共に、ストレージ装置に格納するデータの管理を行う制御ノードにネットワーク経由で接続されたコンピュータに、ストレージ装置の動作診断処理を実行させるためのものである。 In order to solve the above problems, a storage apparatus diagnosis program for realizing the following functions by a computer is provided. The storage device diagnosis program causes a computer connected to the control node that manages the data stored in the storage device via the network to execute the storage device operation diagnosis process while being connected locally. It is.

ストレージ装置診断プログラムを実行するコンピュータは、応答時間計測手段、動作不調検出手段、および復帰検出手段として機能する。応答時間計測手段は、ストレージ装置に対して検査コマンドを発行し、検査コマンド発行から応答があるまでの経過時間を計測する。動作不調検出手段は、経過時間が予め設定された動作不調検出時間に達しても応答がない場合、制御ノードに対して、ストレージ装置が故障中の可能性があることを示す動作不調情報を送信する。復帰検出手段は、動作不調情報を送信後にストレージ装置から検査コマンドに対する応答が返されると、制御ノードに対してストレージ装置の復帰を示す復帰情報を送信する。 The computer that executes the storage apparatus diagnosis program functions as response time measurement means, malfunction detection means, and recovery detection means. The response time measuring means issues an inspection command to the storage device and measures an elapsed time from when the inspection command is issued until there is a response. If there is no response when the elapsed time reaches a preset malfunction detection time, the malfunction detection means transmits malfunction information indicating that the storage device may be out of order to the control node. To do. When a response to the inspection command is returned from the storage apparatus after transmitting the malfunction information, the recovery detection means transmits recovery information indicating the recovery of the storage apparatus to the control node.

ストレージ装置の障害発生時に生じるデータアクセス停止期間を短縮できる。 The data access stop period that occurs when a storage device failure occurs can be shortened.

実施の形態の概要を示す図である。It is a figure which shows the outline | summary of embodiment. 第１の形態のマルチノードストレージシステム構成例を示す図である。1 is a diagram illustrating a configuration example of a multi-node storage system according to a first embodiment. FIG. 第１の形態に用いる制御ノードのハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of the control node used for a 1st form. 仮想ディスクのデータ構造を示す図である。It is a figure which shows the data structure of a virtual disk. マルチノードストレージシステムの各装置の機能を示すブロック図である。It is a block diagram which shows the function of each apparatus of a multinode storage system. ストレージ装置のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a storage apparatus. メタデータ記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a metadata memory | storage part. 仮想ディスクメタデータ記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a virtual disk metadata storage part. ストレージ状態記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a storage state memory | storage part. ストレージ装置故障時のスライス切り替え処理の手順を示すシーケンス図である。FIG. 11 is a sequence diagram illustrating a procedure for slice switching processing when a storage apparatus fails. 状態変更後のストレージ状態記憶部の例を示す図である。It is a figure which shows the example of the storage state memory | storage part after a state change. 更新後の仮想ディスクメタデータ記憶部の内容を示す図である。It is a figure which shows the content of the virtual disk metadata memory | storage part after an update. ストレージ装置の負荷が過大となったときのスライス切り替え処理の手順を示すシーケンス図である。FIG. 11 is a sequence diagram illustrating a procedure for slice switching processing when the load on the storage apparatus becomes excessive. タイムスタンプを用いた矛盾解消処理を示すシーケンス図である。It is a sequence diagram which shows the contradiction resolution process using a time stamp. 再構成された仮想ディスクメタデータテーブルの例を示す図である。It is a figure which shows the example of the reconfigure | reconstructed virtual disk metadata table. 管理ノードからディスクノード切り離しを指示する場合の処理手順を示すシーケンス図である。FIG. 10 is a sequence diagram illustrating a processing procedure when a disk node disconnection instruction is issued from a management node. 第３の実施の形態におけるマルチノードストレージシステムの各装置の機能を示すブロック図である。It is a block diagram which shows the function of each apparatus of the multinode storage system in 3rd Embodiment. メタデータ記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a metadata memory | storage part. 仮想ディスクメタデータ記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a virtual disk metadata storage part. 割り当て可否記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of the allocation availability storage part. ストレージ装置故障時のスライス切り替え処理の手順を示すシーケンス図である。FIG. 11 is a sequence diagram illustrating a procedure for slice switching processing when a storage apparatus fails. 割り当て可否更新後の割り当て可否記憶部の例を示す図である。It is a figure which shows the example of the allocation availability memory | storage part after an allocation availability update. 更新後の仮想ディスクメタデータ記憶部の内容を示す図である。It is a figure which shows the content of the virtual disk metadata memory | storage part after an update. ストレージ装置の負荷が過大となったときのスライス切り替え処理の手順を示すシーケンス図である。FIG. 11 is a sequence diagram illustrating a procedure for slice switching processing when the load on the storage apparatus becomes excessive. 複数のディスクでＴ１経過が検出されたときのスライス割り当て処理を示すシーケンス図である。It is a sequence diagram which shows slice allocation processing when T1 progress is detected by several disks. ディスク診断処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a disk diagnostic process. Ｔ２／復帰検出処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of T2 / recovery detection processing.

以下、本発明の実施の形態を図面を参照して説明する。
図１は、実施の形態の概要を示す図である。図１には、マルチノードストレージシステムが示されている。マルチノードストレージシステムは、ストレージ装置１〜３に格納されたデータを二重化して管理するため複数のディスクノード４〜６、制御ノード７、およびアクセスノード８を有する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram showing an outline of the embodiment. FIG. 1 shows a multi-node storage system. The multi-node storage system has a plurality of disk nodes 4 to 6, a control node 7, and an access node 8 for managing data stored in the storage apparatuses 1 to 2 in a duplex manner.

ストレージ装置１〜３は、記憶領域が複数のスライスに分割して管理されている。また、ストレージ装置１〜３は、それぞれディスクノード４〜６に対してローカルに接続されている。ここで、ローカルに接続されているとは、ネットワーク経由での接続ではないことを意味している。 The storage devices 1 to 3 are managed by dividing the storage area into a plurality of slices. The storage apparatuses 1 to 3 are locally connected to the disk nodes 4 to 6, respectively. Here, being connected locally means not connecting via a network.

ディスクノード４は、応答時間計測手段４ａ、動作不調検出手段４ｂ、故障検出手段４ｃ、および復帰検出手段４ｄを有する。
応答時間計測手段４ａは、ローカルに接続されたストレージ装置１に対して検査コマンドを発行する。そして、応答時間計測手段４ａは、検査コマンドの発行から検査コマンドに対する応答がストレージ装置１から返されるまでの経過時間「Ｔ」を計測する。 The disk node 4 includes response time measuring means 4a, malfunction detection means 4b, failure detection means 4c, and recovery detection means 4d.
The response time measuring unit 4a issues an inspection command to the locally connected storage apparatus 1. Then, the response time measuring unit 4a measures an elapsed time “T” from when the inspection command is issued until a response to the inspection command is returned from the storage device 1.

動作不調検出手段４ｂには、動作不調検出時間「Ｔ１」が予め設定されている。例えば、動作不調検出手段４ｂの管理するメモリ内に動作不調検出時間「Ｔ１」が格納されている。動作不調検出時間「Ｔ１」には、例えば、ストレージ装置１の動作が正常な場合に検査コマンドに応答可能な時間（例えば１秒）が設定される。動作不調検出手段４ｂは、経過時間「Ｔ」が動作不調検出時間「Ｔ１」に達してもストレージ装置１からの応答がない場合、制御ノード７に対してストレージ装置１が故障中の可能性があることを示す動作不調情報を送信する。 An operation failure detection time “T1” is preset in the operation failure detection means 4b. For example, the malfunction detection time “T1” is stored in the memory managed by the malfunction detection means 4b. In the malfunction detection time “T1”, for example, a time (for example, 1 second) that can respond to the inspection command when the operation of the storage apparatus 1 is normal is set. If there is no response from the storage apparatus 1 even if the elapsed time “T” reaches the operation malfunction detection time “T1”, the malfunctioning detection means 4b may cause the storage apparatus 1 to be out of order with respect to the control node 7. The malfunction information which shows that there exists is transmitted.

故障検出手段４ｃには、動作不調検出時間「Ｔ１」よりも大きな値が故障検出時間「Ｔ２」として設定されている。故障検出時間「Ｔ２」には、例えば、ストレージ装置１の処理が過負荷であっても、故障していなければ検査コマンドに応答可能な時間の最大値（例えば１分）が設定される。故障検出手段４ｃは、経過時間「Ｔ」が故障検出時間「Ｔ２」に達してもストレージ装置１からの応答がない場合、制御ノード７に対してストレージ装置１に関する故障検出情報を送信する。 In the failure detection means 4c, a value larger than the malfunction detection time “T1” is set as the failure detection time “T2”. For the failure detection time “T2”, for example, the maximum value (for example, 1 minute) of the time that can respond to the inspection command if there is no failure even if the processing of the storage apparatus 1 is overloaded is set. If there is no response from the storage device 1 even when the elapsed time “T” reaches the failure detection time “T2”, the failure detection unit 4 c transmits failure detection information regarding the storage device 1 to the control node 7.

復帰検出手段４ｄは、動作不調情報を送信後にストレージ装置１から検査コマンドに対する応答が返されると、制御ノード７に対してストレージ装置１の復帰を示す復帰情報を送信する。 The return detection unit 4 d transmits return information indicating the return of the storage apparatus 1 to the control node 7 when a response to the inspection command is returned from the storage apparatus 1 after transmitting the malfunction information.

ディスクノード５は、ストレージ装置２にデータの入出力を行うスライス管理手段５ａを有する。同様に、ディスクノード６は、ストレージ装置３にデータの入出力を行うスライス管理手段６ａを有する。 The disk node 5 has slice management means 5a for inputting / outputting data to / from the storage apparatus 2. Similarly, the disk node 6 has slice management means 6 a that inputs and outputs data to the storage apparatus 3.

なお、図１では省略しているが、ディスクノード４が有する各機能と同様の機能が、ディスクノード５，６にも含まれる。同様に、図１では省略しているが、ディスクノード５，６が有するスライス管理手段５ａ，６ａと同様の機能は、ディスクノード４にも含まれる。 Although omitted in FIG. 1, functions similar to the functions of the disk node 4 are also included in the disk nodes 5 and 6. Similarly, although omitted in FIG. 1, the disk node 4 includes functions similar to the slice management means 5 a and 6 a included in the disk nodes 5 and 6.

制御ノード７は、動作不調情報管理手段７ａ、動作不調情報記憶手段７ｂ、およびリカバリ指示手段７ｃを有する。
動作不調情報管理手段７ａは、ディスクノード４〜６の１つから動作不調情報を受け取ると、動作不調情報を動作不調情報記憶手段７ｂに格納する。 The control node 7 includes operation malfunction information management means 7a, operation malfunction information storage means 7b, and recovery instruction means 7c.
When the malfunction information management unit 7a receives malfunction information from one of the disk nodes 4 to 6, the malfunction information management unit 7a stores the malfunction information in the malfunction information storage unit 7b.

動作不調情報記憶手段７ｂは、動作不調情報を記憶する。例えば、メモリの記憶領域の一部が動作不調情報記憶手段７ｂとして用いられる。
リカバリ指示手段７ｃは、複数のストレージ装置１〜３のスライスをアクセス対象スライスとしてアクセス要求が出されたことを示すアクセス関連情報を受け取ると、動作不調情報記憶手段７ｂ内の動作不調情報を参照してアクセス対象スライスが属するストレージ装置が故障中である可能性の有無を判断する。すなわち、アクセス対象スライスが属するストレージ装置に故障の可能性がある旨の動作不調情報が動作不調情報記憶手段７ｂに格納されていれば、故障の可能性ありと判断される。 The malfunction information storage unit 7b stores malfunction information. For example, a part of the storage area of the memory is used as the malfunction information storage unit 7b.
When receiving the access related information indicating that an access request has been issued with the slices of the plurality of storage apparatuses 1 to 3 as the access target slice, the recovery instruction unit 7c refers to the operation failure information in the operation failure information storage unit 7b. Thus, it is determined whether there is a possibility that the storage apparatus to which the access target slice belongs is in failure. That is, if malfunction information indicating that there is a possibility of failure in the storage device to which the access target slice belongs is stored in the malfunction information storage unit 7b, it is determined that there is a possibility of failure.

リカバリ指示手段７ｃは、故障の可能性がある場合、アクセス対象スライス内のデータと同じ内容の冗長データを格納するストレージ装置が接続されたディスクノードへ、アクセス対象スライスに格納されていたデータのリカバリ処理を指示する。なお、リカバリ処理とは、アクセス対象スライス内の冗長データを他のストレージ装置にコピーし、データの二重化状態を回復する処理である。 When there is a possibility of failure, the recovery instruction unit 7c recovers the data stored in the access target slice to the disk node to which the storage device storing the redundant data having the same contents as the data in the access target slice is connected. Instruct processing. The recovery process is a process for copying redundant data in the access target slice to another storage device and recovering the data duplex state.

なお、図１の例では、アクセス関連情報は、複数のストレージ装置１〜３にアクセスを行うアクセスノード８によるストレージ装置１内のスライスへのアクセスが失敗したときに、アクセスノード８から送信される。この場合、リカバリ指示手段７ｃは、アクセスノード８からアクセス関連情報を受け取ると、アクセスノード８に対して冗長データの格納場所を通知する。 In the example of FIG. 1, the access related information is transmitted from the access node 8 when the access node 8 that accesses the plurality of storage apparatuses 1 to 3 fails to access the slice in the storage apparatus 1. . In this case, when receiving the access related information from the access node 8, the recovery instruction unit 7 c notifies the access node 8 of the redundant data storage location.

このようなマルチノードストレージシステムによれば、過負荷などによりストレージ装置１の動作が不調になると、応答時間計測手段４ａが出力する検査コマンドに対するストレージ装置１からの応答時間は、動作不調検出時間「Ｔ１」より長くかかるが故障検出時間「Ｔ２」よりは短くなる。また、ストレージ装置１が故障していれば、検査コマンドに対する応答は、故障検出時間「Ｔ２」を過ぎても出されない。 According to such a multi-node storage system, when the operation of the storage apparatus 1 becomes unsatisfactory due to overload or the like, the response time from the storage apparatus 1 to the inspection command output by the response time measuring unit 4a is the operation failure detection time “ It takes longer than “T1” but shorter than the failure detection time “T2”. If the storage apparatus 1 has failed, no response to the inspection command is issued even after the failure detection time “T2”.

ここで、ストレージ装置１の動作が不調になるか、あるいは故障により、応答時間計測手段４ａで計測している経過時間「Ｔ」が動作不調検出時間「Ｔ１」を超えたものとする。すると、動作不調検出手段４ｂにより、ストレージ装置１が故障中の可能性があることを示す動作不調情報が、制御ノード７に対して出力される。制御ノード７では、動作不調情報管理手段７ａにより、動作不調情報が動作不調情報記憶手段７ｂに格納される。 Here, it is assumed that the operation time of the storage apparatus 1 becomes unstable or the elapsed time “T” measured by the response time measuring unit 4a exceeds the operation failure detection time “T1” due to a failure. Then, operation malfunction information indicating that the storage apparatus 1 may be out of order is output to the control node 7 by the operation malfunction detection means 4b. In the control node 7, the malfunction information management unit 7a stores the malfunction information in the malfunction information storage unit 7b.

その後、アクセスノード８がストレージ装置１内の先頭のスライス１ａにアクセスすると、そのアクセスはエラーとなる。そこで、アクセスノード８は、制御ノード７に対して、ストレージ装置１の先頭のスライス１ａへのアクセスが失敗したことを示すアクセス関連情報を送信する。制御ノード７では、リカバリ指示手段７ｃが動作不調情報記憶手段７ｂを参照し、ストレージ装置１が故障中の可能性があることを認識する。そこで、リカバリ指示手段７ｃは、アクセス関連情報で示されるスライス１ａのリカバリ指示をディスクノード５，６に対して行う。 Thereafter, when the access node 8 accesses the first slice 1a in the storage device 1, the access becomes an error. Therefore, the access node 8 transmits access related information indicating that access to the first slice 1a of the storage apparatus 1 has failed to the control node 7. In the control node 7, the recovery instruction unit 7c refers to the malfunction information storage unit 7b and recognizes that there is a possibility that the storage apparatus 1 is in failure. Accordingly, the recovery instruction unit 7c issues a recovery instruction for the slice 1a indicated by the access related information to the disk nodes 5 and 6.

ここで、図１の例では、スライス１ａ内のデータ（data[A]）の冗長データがストレージ装置２の先頭のスライス２ａに格納されているものとする。また、ストレージ装置３の先頭のスライスが空き状態（有効なデータが格納されていない状態）であるものとする。この場合、リカバリ指示手段７ｃは、ディスクノード５のスライス管理手段５ａに対してスライス２ａのデータをスライス３ａにコピーすることを指示する。すると、スライス管理手段５ａは、スライス２ａのデータを読み出し、ディスクノード６に転送する。ディスクノード６では、スライス管理手段６ａがデータを受け取り、スライス３ａに書き込む。 Here, in the example of FIG. 1, it is assumed that redundant data of data (data [A]) in the slice 1 a is stored in the first slice 2 a of the storage device 2. Further, it is assumed that the first slice of the storage device 3 is in an empty state (a state in which valid data is not stored). In this case, the recovery instruction unit 7c instructs the slice management unit 5a of the disk node 5 to copy the data of the slice 2a to the slice 3a. Then, the slice management unit 5 a reads the data of the slice 2 a and transfers it to the disk node 6. In the disk node 6, the slice management means 6a receives the data and writes it to the slice 3a.

リカバリ指示手段７ｃは、ストレージ装置２のスライス２ａにスライス１ａの冗長データが格納されていることを、アクセスノード８に通知する。これにより、アクセスノード８は、迅速にアクセス先をストレージ装置２のスライス２ａに変更できる。 The recovery instruction unit 7 c notifies the access node 8 that the redundant data of the slice 1 a is stored in the slice 2 a of the storage device 2. Thereby, the access node 8 can quickly change the access destination to the slice 2a of the storage device 2.

その後、ストレージ装置１からの応答が故障検出時間「Ｔ２」を経過してもディスクノード４に対して出力されなかった場合、故障検出手段４ｃが故障検出時間「Ｔ２」の経過を検出する。そして、故障検出手段４ｃから制御ノード７へ、ストレージ装置１が故障したことを示す故障検出情報が送信される。制御ノード７では、リカバリ指示手段７ｃがストレージ装置２内のすべてのスライスについて、ディスクノード５へリカバリの指示を行う。 Thereafter, if the response from the storage apparatus 1 is not output to the disk node 4 even after the failure detection time “T2” has elapsed, the failure detection means 4c detects the elapse of the failure detection time “T2”. Then, failure detection information indicating that the storage apparatus 1 has failed is transmitted from the failure detection means 4 c to the control node 7. In the control node 7, the recovery instruction unit 7 c issues a recovery instruction to the disk node 5 for all slices in the storage apparatus 2.

また、ストレージ装置１からの応答が故障検出時間「Ｔ２」経過前にディスクノード４に対して出力された場合、復帰検出手段４ｄが復帰を検出する。そして、復帰検出手段４ｄから制御ノード７へ、ストレージ装置１が復帰したことを示す復帰情報が送信される。制御ノード７では、動作不調情報管理手段７ａにより、ストレージ装置１が故障中の可能性があることを示す動作不調情報が、動作不調情報記憶手段７ｂから消去される。 Further, when the response from the storage apparatus 1 is output to the disk node 4 before the failure detection time “T2” elapses, the return detection unit 4d detects the return. Then, return information indicating that the storage apparatus 1 has returned is transmitted from the return detection means 4d to the control node 7. In the control node 7, the malfunction information management means 7a deletes the malfunction information indicating that the storage apparatus 1 may be out of order from the malfunction information storage means 7b.

このように、図１に示すマルチノードストレージシステムでは、ストレージ装置１〜３の故障検出時間を動作不調検出時間「Ｔ１」、故障検出時間「Ｔ２」（Ｔ１＜Ｔ２）の２段階にする。従来は、故障検出時間「Ｔ２」のみが存在し、ディスクノードが故障検出時間「Ｔ２」を検出するとリカバリ処理が開始されていた。そのため、故障検出時間「Ｔ２」を長くとりすぎるとリカバリ処理が遅れ、故障発生時にアクセス不能の期間も長期化していた。本実施の形態では、図１に示したように、ディスクノード４が動作不調検出時間「Ｔ１」を検出して制御ノード７に通知しておく。そして制御ノード７は、アクセスに失敗したスライス１ａのデータのみのリカバリ処理を指示する。これにより、アクセス不能の期間の長期化が防止される。 As described above, in the multi-node storage system shown in FIG. 1, the failure detection times of the storage apparatuses 1 to 3 are set in two stages: the malfunction detection time “T1” and the failure detection time “T2” (T1 <T2). Conventionally, only the failure detection time “T2” exists, and the recovery processing is started when the disk node detects the failure detection time “T2”. For this reason, if the failure detection time “T2” is set too long, the recovery process is delayed, and the period of inaccessibility when a failure occurs is also prolonged. In the present embodiment, as shown in FIG. 1, the disk node 4 detects the malfunction detection time “T1” and notifies the control node 7 of it. Then, the control node 7 instructs a recovery process for only the data of the slice 1a that has failed to be accessed. This prevents the inaccessible period from being prolonged.

なお、図１では、ストレージ装置１〜３がそれぞれ個別のディスクノード４〜６に接続されているが、１つのノードで複数のストレージ装置がローカルに接続されるシステムもある。そのようなシステムでは、ストレージ装置が接続された１つのノードに、図１に示した制御ノード７とディスクノード４〜６との機能が内蔵されることとなる。 In FIG. 1, the storage apparatuses 1 to 3 are connected to the individual disk nodes 4 to 6, respectively. However, there is a system in which a plurality of storage apparatuses are locally connected to one node. In such a system, the functions of the control node 7 and the disk nodes 4 to 6 shown in FIG. 1 are built in one node to which the storage apparatus is connected.

ところで、マルチノードストレージシステムでは、仮想ディスクを介してデータアクセスが行われる。このとき、仮想ディスク内の記憶領域とストレージ装置内の記憶領域との割り当て関係はメタデータを用いて管理できる。そこで、メタデータを用いて割り当て関係を管理する場合の例を用い、以下に、本実施の形態の詳細を説明する。 By the way, in a multi-node storage system, data access is performed via a virtual disk. At this time, the allocation relationship between the storage area in the virtual disk and the storage area in the storage device can be managed using metadata. Therefore, the details of the present embodiment will be described below using an example in which the allocation relationship is managed using metadata.

［第１の実施の形態］
図２は、第１の形態のマルチノードストレージシステム構成例を示す図である。本実施の形態では、ネットワーク１０を介して、複数のディスクノード１００，２００，３００、制御ノード５００、アクセスノード６００，７００、および管理ノード３０が接続されている。ディスクノード１００，２００，３００それぞれには、ストレージ装置１１０，２１０，３１０が接続されている。 [First Embodiment]
FIG. 2 is a diagram illustrating a configuration example of a multi-node storage system according to the first embodiment. In the present embodiment, a plurality of disk nodes 100, 200, 300, a control node 500, access nodes 600, 700, and a management node 30 are connected via the network 10. Storage devices 110, 210, and 310 are connected to the disk nodes 100, 200, and 300, respectively.

ストレージ装置１１０には、複数のハードディスク装置（ＨＤＤ）１１１，１１２，１１３，１１４が実装されている。ストレージ装置２１０には、複数のＨＤＤ２１１，２１２，２１３，２１４が実装されている。ストレージ装置３１０には、複数のＨＤＤ３１１，３１２，３１３，３１４が実装されている。各ストレージ装置１１０，２１０，３１０は、内蔵するＨＤＤを用いたＲＡＩＤシステムである。本実施の形態では、各ストレージ装置１１０，２１０，３１０のＲＡＩＤ５のディスク管理サービスを提供する。 A plurality of hard disk devices (HDDs) 111, 112, 113, and 114 are mounted on the storage device 110. A plurality of HDDs 211, 212, 213, and 214 are mounted on the storage device 210. A plurality of HDDs 311, 312, 313, and 314 are mounted on the storage device 310. Each storage device 110, 210, 310 is a RAID system using a built-in HDD. In this embodiment, a RAID 5 disk management service for each storage device 110, 210, 310 is provided.

ディスクノード１００，２００，３００は、接続されたストレージ装置１１０，２１０，３１０に格納されたデータを管理し、管理しているデータをネットワーク１０経由で端末装置２１，２２，２３に提供する。また、ディスクノード１００，２００，３００は、冗長性を有するデータを管理している。すなわち、同一のデータが、少なくとも２つのディスクノードで管理されている。 The disk nodes 100, 200, and 300 manage data stored in the connected storage devices 110, 210, and 310, and provide the managed data to the terminal devices 21, 22, and 23 via the network 10. The disk nodes 100, 200, and 300 manage data having redundancy. That is, the same data is managed by at least two disk nodes.

制御ノード５００は、ディスクノード１００，２００，３００を管理する。例えば、制御ノード５００は、ディスクノード１００，２００，３００から新たなストレージ装置の接続通知を受け取ると、新たな仮想ディスクを定義し、その仮想ディスクを介して接続されたストレージ装置に格納されていたデータにアクセスできるようにする。 The control node 500 manages the disk nodes 100, 200, and 300. For example, when the control node 500 receives a notification of connection of a new storage device from the disk nodes 100, 200, and 300, a new virtual disk is defined and stored in the storage device connected through the virtual disk. Make the data accessible.

アクセスノード６００，７００には、ネットワーク２０を介して複数の端末装置２１，２２，２３が接続されている。また、アクセスノード６００，７００には、仮想ディスクが定義されている。そして、アクセスノード６００，７００は、端末装置２１，２２，２３からの仮想ディスクのデータのアクセス要求に応答して、ディスクノード１００，２００，３００内の対応するデータへアクセスする。 A plurality of terminal devices 21, 22, and 23 are connected to the access nodes 600 and 700 through the network 20. In addition, virtual disks are defined in the access nodes 600 and 700. Then, the access nodes 600 and 700 access corresponding data in the disk nodes 100, 200, and 300 in response to virtual disk data access requests from the terminal devices 21, 22, and 23.

管理ノード３０は、管理者がマルチノードストレージシステムの運用を管理するために使用するコンピュータである。例えば、管理ノード３０では、ストレージ装置の動作上級などの情報を収集し、収集した情報を画面に表示する。管理ノード３０を使用する管理者は、画面に表示された情報を参照し、リカバリ処理が必要なストレージ装置を見つけた場合、そのストレージ装置のリカバリ処理の指示を管理ノード３０に入力する。すると、管理ノード３０から制御ノード５００に、ストレージ装置を指定したリカバリ要求が送信される。 The management node 30 is a computer used by an administrator to manage the operation of the multi-node storage system. For example, the management node 30 collects information such as the advanced operation level of the storage device and displays the collected information on the screen. When an administrator who uses the management node 30 refers to the information displayed on the screen and finds a storage apparatus that requires recovery processing, the administrator inputs an instruction for recovery processing of the storage apparatus to the management node 30. Then, a recovery request designating a storage device is transmitted from the management node 30 to the control node 500.

図３は、第１の形態に用いる制御ノードのハードウェア構成例を示す図である。制御ノード５００は、ＣＰＵ（Central Processing Unit）５０１によって装置全体が制御されている。ＣＰＵ５０１には、バス５０７を介してＲＡＭ（Random Access Memory）５０２、ハードディスクドライブ（ＨＤＤ:Hard Disk Drive）５０３、グラフィック処理装置５０４、入力インタフェース５０５、および通信インタフェース５０６が接続されている。 FIG. 3 is a diagram illustrating a hardware configuration example of a control node used in the first embodiment. The control node 500 is entirely controlled by a CPU (Central Processing Unit) 501. A random access memory (RAM) 502, a hard disk drive (HDD) 503, a graphic processing device 504, an input interface 505, and a communication interface 506 are connected to the CPU 501 via a bus 507.

ＲＡＭ５０２は、制御ノード５００の主記憶装置として使用される。ＲＡＭ５０２には、ＣＰＵ５０１に実行させるＯＳ（Operating System）のプログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ５０２には、ＣＰＵ５０１による処理に必要な各種データが格納される。ＨＤＤ５０３は、制御ノード５００の二次記憶装置として使用される。ＨＤＤ５０３には、ＯＳのプログラム、アプリケーションプログラム、および各種データが格納される。なお、二次記憶装置としては、フラッシュメモリなどの半導体記憶装置を使用することもできる。 The RAM 502 is used as a main storage device of the control node 500. The RAM 502 temporarily stores at least part of an OS (Operating System) program and application programs to be executed by the CPU 501. The RAM 502 stores various data necessary for processing by the CPU 501. The HDD 503 is used as a secondary storage device of the control node 500. The HDD 503 stores an OS program, application programs, and various data. Note that a semiconductor storage device such as a flash memory can also be used as the secondary storage device.

グラフィック処理装置５０４には、モニタ１１が接続されている。グラフィック処理装置５０４は、ＣＰＵ５０１からの命令に従って、画像をモニタ１１の画面に表示させる。モニタ１１としては、ＣＲＴ（Cathode Ray Tube）を用いた表示装置や液晶表示装置がある。 A monitor 11 is connected to the graphic processing device 504. The graphic processing device 504 displays an image on the screen of the monitor 11 in accordance with a command from the CPU 501. Examples of the monitor 11 include a display device using a CRT (Cathode Ray Tube) and a liquid crystal display device.

入力インタフェース５０５には、キーボード１２とマウス１３とが接続されている。入力インタフェース５０５は、キーボード１２やマウス１３から送られてくる信号を、バス５０７を介してＣＰＵ５０１に送信する。なお、マウス１３は、ポインティングデバイスの一例であり、他のポインティングデバイスを使用することもできる。他のポインティングデバイスとしては、タッチパネル、タブレット、タッチパッド、トラックボールなどがある。 A keyboard 12 and a mouse 13 are connected to the input interface 505. The input interface 505 transmits a signal sent from the keyboard 12 or the mouse 13 to the CPU 501 via the bus 507. The mouse 13 is an example of a pointing device, and other pointing devices can also be used. Examples of other pointing devices include a touch panel, a tablet, a touch pad, and a trackball.

通信インタフェース５０６は、ネットワーク１０に接続されている。通信インタフェース５０６は、ネットワーク１０を介して、他のコンピュータとの間でデータの送受信を行う。 The communication interface 506 is connected to the network 10. The communication interface 506 transmits and receives data to and from other computers via the network 10.

以上のようなハードウェア構成によって、本実施の形態の処理機能を実現することができる。なお、図３では制御ノード５００のハードウェア構成を示したが、ディスクノード１００，２００，３００、アクセスノード６００，７００、および管理ノード３０も同様のハードウェア構成で実現することができる。ただし、ディスクノード１００，２００，３００は、図３に示した機能に加え、ストレージ装置１１０，２１０，３１０を外部接続するためのインタフェースを有している。 With the hardware configuration as described above, the processing functions of the present embodiment can be realized. Although FIG. 3 shows the hardware configuration of the control node 500, the disk nodes 100, 200, 300, the access nodes 600, 700, and the management node 30 can also be realized with the same hardware configuration. However, the disk nodes 100, 200, and 300 have an interface for externally connecting the storage apparatuses 110, 210, and 310 in addition to the functions shown in FIG.

次に、マルチノードストレージシステムにおいて定義される仮想ディスクのデータ構造について説明する。
図４は、仮想ディスクのデータ構造を示す図である。本実施の形態では、仮想ディスク６０には仮想ディスク識別子「ＬＶＯＬ−Ｘ」が付与されている。ネットワーク経由で接続された３台のディスクノード１００，２００，３００には、個々のノードの識別のためにそれぞれ「ＳＮ−Ａ」、「ＳＮ−Ｂ」、「ＳＮ−Ｃ」というノード識別子が付与されている。そして、各ディスクノード１００，２００，３００に接続されているストレージ装置１１０，２１０，３１０は、ディスクノード１００，２００，３００のノード識別子と、各ディスクノード１００，２００，３００内でのディスクＩＤの組によってネットワーク１０で一意に識別される。 Next, the data structure of the virtual disk defined in the multi-node storage system will be described.
FIG. 4 shows the data structure of the virtual disk. In the present embodiment, a virtual disk identifier “LVOL-X” is assigned to the virtual disk 60. Node identifiers “SN-A”, “SN-B”, and “SN-C” are assigned to the three disk nodes 100, 200, and 300 connected via the network in order to identify individual nodes. Has been. The storage apparatuses 110, 210, and 310 connected to the respective disk nodes 100, 200, and 300 have the node identifiers of the disk nodes 100, 200, and 300, and the disk IDs in the respective disk nodes 100, 200, and 300. The set is uniquely identified in the network 10.

各ディスクノード１００，２００，３００が有するストレージ装置１１０，２１０，３１０それぞれにおいてＲＡＩＤ５のストレージシステムが構成されている。各ストレージ装置１１０，２１０，３１０で提供される記憶機能は、複数のスライス１１５ａ〜１１５ｃ，２１５ａ〜２１５ｃ，３１５ａ〜３１５ｃに分割されて管理されている。 A RAID 5 storage system is configured in each of the storage devices 110, 210, and 310 included in each of the disk nodes 100, 200, and 300. The storage function provided by each storage device 110, 210, 310 is divided into a plurality of slices 115a to 115c, 215a to 215c, and 315a to 315c and managed.

仮想ディスク６０は、セグメント６１〜６３という単位で構成される。セグメント６１〜６３の記憶容量は、ストレージ装置１１０，２１０における管理単位であるスライスの記憶容量と同じである。例えば、スライスの記憶容量が１ギガバイトとするとセグメントの記憶容量も１ギガバイトである。仮想ディスク６０の記憶容量はセグメント１つ当たりの記憶容量の整数倍である。セグメント６１〜６３は、それぞれプライマリスライス６１ａ，６２ａ，６３ａとセカンダリスライス６１ｂ，６２ｂ，６３ｂとの組（スライスペア）で構成される。 The virtual disk 60 is configured in units of segments 61 to 63. The storage capacity of the segments 61 to 63 is the same as the storage capacity of a slice that is a management unit in the storage apparatuses 110 and 210. For example, if the storage capacity of the slice is 1 gigabyte, the storage capacity of the segment is also 1 gigabyte. The storage capacity of the virtual disk 60 is an integral multiple of the storage capacity per segment. Each of the segments 61 to 63 includes a set (slice pair) of primary slices 61a, 62a, and 63a and secondary slices 61b, 62b, and 63b.

同一セグメントに属する２つのスライスは別々のディスクノードに属する。個々のスライスを管理する領域には仮想ディスク識別子やセグメント情報や同じセグメントを構成するスライス情報の他にフラグがあり、そのフラグにはプライマリあるいはセカンダリなどを表す値が格納される。 Two slices belonging to the same segment belong to different disk nodes. In the area for managing individual slices, there are flags in addition to the virtual disk identifier, segment information, and slice information constituting the same segment, and a value representing primary or secondary is stored in the flag.

図４の例では、仮想ディスク６０内のスライスの識別子を、「Ｐ」または「Ｓ」のアルファベットと数字との組合せで示している。「Ｐ」はプライマリスライスであることを示している。「Ｓ」はセカンダリスライスであることを示している。アルファベットに続く数字は、何番目のセグメントに属するのかを表している。例えば、１番目のセグメント６１のプライマリスライスが「Ｐ１」で示され、セカンダリスライスが「Ｓ１」で示される。 In the example of FIG. 4, the identifier of the slice in the virtual disk 60 is indicated by a combination of “P” or “S” alphabets and numbers. “P” indicates a primary slice. “S” indicates a secondary slice. The number following the alphabet represents what number segment it belongs to. For example, the primary slice of the first segment 61 is indicated by “P1”, and the secondary slice is indicated by “S1”.

図５は、マルチノードストレージシステムの各装置の機能を示すブロック図である。アクセスノード６００は、メタデータ照会部６１０、アクセス用メタデータ記憶部６２０、およびスライスアクセス要求部６３０を有している。 FIG. 5 is a block diagram illustrating functions of each device of the multi-node storage system. The access node 600 includes a metadata inquiry unit 610, an access metadata storage unit 620, and a slice access request unit 630.

メタデータ照会部６１０は、仮想ディスク６０を定義するメタデータを、制御ノード５００から取得する。具体的には、メタデータ照会部６１０は、アクセスノード６００の起動時に制御ノード５００に対して全メタデータの照会要求を送信する。すると、制御ノード５００から仮想ディスク６０に関する全メタデータが送られてくる。また、メタデータ照会部６１０は、スライスアクセス要求部６３０による任意のスライスへのデータアクセスがエラーとなったとき、アクセス対象のスライスが割り当てられたセグメントに関するメタデータの照会要求を制御ノード５００に送信する。すると、制御ノード５００から、該当セグメントの最新のメタデータが送られてくる。なお、メタデータ照会部６１０は、制御ノード５００からメタデータを取得すると、そのメタデータをアクセス用メタデータ記憶部６２０に格納する。 The metadata inquiry unit 610 obtains metadata defining the virtual disk 60 from the control node 500. Specifically, the metadata inquiry unit 610 transmits an inquiry request for all metadata to the control node 500 when the access node 600 is activated. Then, all metadata regarding the virtual disk 60 is sent from the control node 500. Also, the metadata inquiry unit 610 transmits a metadata inquiry request regarding the segment to which the slice to be accessed is allocated to the control node 500 when data access to an arbitrary slice by the slice access request unit 630 results in an error. To do. Then, the latest metadata of the corresponding segment is sent from the control node 500. When the metadata inquiry unit 610 acquires metadata from the control node 500, the metadata inquiry unit 610 stores the metadata in the access metadata storage unit 620.

アクセス用メタデータ記憶部６２０は、仮想ディスク６０を定義するメタデータの記憶機能である。例えば、アクセスノード６００のＲＡＭの一部がアクセス用メタデータ記憶部６２０として使用される。なお、本実施の形態では、アクセスノード６００は常にプライマリスライスにアクセスを行う。そのため、アクセス用メタデータ記憶部６２０には、仮想ディスク６０のメタデータのうち、少なくともプライマリスライスに関するメタデータが格納されていればよい。 The access metadata storage unit 620 is a metadata storage function that defines the virtual disk 60. For example, a part of the RAM of the access node 600 is used as the access metadata storage unit 620. In the present embodiment, the access node 600 always accesses the primary slice. Therefore, the access metadata storage unit 620 only needs to store at least metadata regarding the primary slice among the metadata of the virtual disk 60.

スライスアクセス要求部６３０は、端末装置２１，２２，２３からの仮想ディスク上でのデータのアクセス要求に応答して、ストレージ装置１１０，２１０，３１０のデータのアクセス要求（リード要求またはライト要求）をディスクノード１００，２００，３００に対して送信する。具体的には、スライスアクセス要求部６３０は、仮想ディスクのアドレスを指定したアクセス要求を受け取ると、まず、アクセス用メタデータ記憶部６２０を参照し、アクセス対象のデータが属するセグメントを判断する。次に、スライスアクセス要求部６３０は、該当するセグメントにプライマリスライスとして割り当てられたスライスを判断する。そして、スライスアクセス要求部６３０は、該当するスライスを管理するディスクノードに対して、そのスライス内のデータのアクセス要求を送信する。ディスクノードからアクセス結果が応答されると、スライスアクセス要求部６３０は、端末装置２１，２２，２３にアクセス結果を送信する。 In response to a data access request on the virtual disk from the terminal devices 21, 22, and 23, the slice access request unit 630 sends a data access request (read request or write request) to the storage devices 110, 210, and 310. It is transmitted to the disk nodes 100, 200, 300. Specifically, when the slice access request unit 630 receives an access request specifying a virtual disk address, first, the slice access request unit 630 refers to the access metadata storage unit 620 to determine a segment to which the access target data belongs. Next, the slice access request unit 630 determines a slice assigned as a primary slice to the corresponding segment. Then, the slice access request unit 630 transmits an access request for data in the slice to the disk node that manages the corresponding slice. When the access result is returned from the disk node, the slice access request unit 630 transmits the access result to the terminal devices 21, 22, and 23.

なお、スライスアクセス要求部６３０は、アクセス先のディスクノードからエラーが返された場合、エラーが発生したセグメントをメタデータ照会部６１０に通知する。その後、スライスアクセス要求部６３０は、データアクセスのリトライを行う。リトライでは、アクセス用メタデータ記憶部６２０を参照して、プライマリスライスとした割り当てられたスライスを判断する処理から再度実行される。すなわち、前回のアクセス要求後にアクセス用メタデータ記憶部６２０内のメタデータが更新されていれば、更新後のメタデータに基づいてリトライ時のアクセス先となるディスクノードが判断される。 When an error is returned from the access destination disk node, the slice access request unit 630 notifies the metadata inquiry unit 610 of the segment in which the error has occurred. Thereafter, the slice access request unit 630 retries data access. In the retry, the access metadata storage unit 620 is referred to, and the process is performed again from the process of determining the assigned slice as the primary slice. That is, if the metadata in the access metadata storage unit 620 has been updated after the previous access request, the disk node that is the access destination at the time of retry is determined based on the updated metadata.

ディスクノード１００は、スライスアクセス処理部１２０、Ｔ１／復帰検出部１３０、Ｔ２検出部１４０、メタデータ記憶部１５０、およびスライス管理部１６０を有している。 The disk node 100 includes a slice access processing unit 120, a T1 / recovery detection unit 130, a T2 detection unit 140, a metadata storage unit 150, and a slice management unit 160.

スライスアクセス処理部１２０は、アクセスノード６００からのアクセス要求に応答して、ストレージ装置１１０に対するデータアクセスを行う。具体的には、スライスアクセス処理部１２０は、アクセスノード６００からアクセス要求を受け取ると、メタデータ記憶部１５０を参照し、アクセス対象となるセグメントに対して、割り当てられているストレージ装置１１０内のスライスを判断する。 The slice access processing unit 120 performs data access to the storage apparatus 110 in response to an access request from the access node 600. Specifically, when the slice access processing unit 120 receives an access request from the access node 600, the slice access processing unit 120 refers to the metadata storage unit 150 and slices in the storage apparatus 110 allocated to the segment to be accessed. Judging.

次に、スライスアクセス処理部１２０は、アクセス要求で指定されている該当するスライス内のデータに対してアクセスする。例えば、データリードのアクセス要求であれば、スライスアクセス処理部１２０は、該当するデータをストレージ装置１１０から読み出す。また、データライトのアクセス要求であれば、スライスアクセス処理部１２０は、ストレージ装置１１０内の該当する記憶領域にアクセス要求に含まれるデータを書き込む。そして、スライスアクセス処理部１２０は、アクセス結果をアクセスノード６００に送信する。データリードのアクセス要求の場合、ストレージ装置１１０から読み出したデータがアクセス結果に含まれる。なお、ストレージ装置１１０がアクセスに対する応答を返せない状況にある場合（検査コマンドへの応答が返せない状況と同様）、スライスアクセス処理部１２０は、アクセスノード６００に対してエラーを返す（エラーメッセージを送信する）。 Next, the slice access processing unit 120 accesses the data in the corresponding slice specified by the access request. For example, if it is a data read access request, the slice access processing unit 120 reads the corresponding data from the storage device 110. If the access request is a data write request, the slice access processing unit 120 writes the data included in the access request in the corresponding storage area in the storage apparatus 110. Then, the slice access processing unit 120 transmits the access result to the access node 600. In the case of a data read access request, data read from the storage device 110 is included in the access result. When the storage apparatus 110 cannot return a response to access (similar to the situation in which a response to the inspection command cannot be returned), the slice access processing unit 120 returns an error to the access node 600 (error message Send).

Ｔ１／復帰検出部１３０は、定期的にストレージ装置１１０に対して検査コマンドを送信し、応答の有無により故障の有無を判定する。具体的には、Ｔ１／復帰検出部１３０は、定期的にストレージ装置１１０へ”test unit ready”などの検査コマンドを発行する。また、Ｔ１／復帰検出部１３０は、内部に動作不調検出時間「Ｔ１」の値を保持している。例えば、Ｔ１／復帰検出部１３０が使用しているＲＡＭ内に「Ｔ１」の値が設定されている。Ｔ１としては、例えば、数秒から十数秒程度の時間が設定される。 The T1 / recovery detection unit 130 periodically transmits an inspection command to the storage apparatus 110 and determines the presence or absence of a failure based on the presence or absence of a response. Specifically, the T1 / return detection unit 130 periodically issues a test command such as “test unit ready” to the storage apparatus 110. Further, the T1 / return detection unit 130 holds the value of the malfunction detection time “T1” inside. For example, the value of “T1” is set in the RAM used by the T1 / return detection unit 130. As T1, for example, a time of about several seconds to several tens of seconds is set.

そして、Ｔ１／復帰検出部１３０は、検査コマンド発行から動作不調検出時間が経過してもストレージ装置１１０から応答が返ってこない場合、故障の可能性有りと判定する。故障の可能性有りと判定した場合、Ｔ１／復帰検出部１３０は、制御ノード５００に対してＴ１経過通知を送信する。また、Ｔ１／復帰検出部１３０は、検査コマンド発行からＴ１経過後に、ストレージ装置１１０から応答が返されると、制御ノード５００に対して復帰通知を送信する。 Then, the T1 / recovery detection unit 130 determines that there is a possibility of failure when no response is returned from the storage apparatus 110 even after the operation malfunction detection time has elapsed since the inspection command was issued. When it is determined that there is a possibility of failure, the T1 / recovery detection unit 130 transmits a T1 progress notification to the control node 500. Further, when a response is returned from the storage apparatus 110 after T1 has passed since the issue of the inspection command, the T1 / return detection unit 130 transmits a return notification to the control node 500.

Ｔ２検出部１４０は、故障の可能性有りと判定した後に、故障確定判定を行う。具体的には、Ｔ２検出部１４０は、内部に故障検出時間「Ｔ２」（Ｔ１＜Ｔ２）の値を予め保持している。例えば、Ｔ２検出部１４０が使用しているＲＡＭ内に「Ｔ２」の値が設定されている。Ｔ２としては、例えば、３０秒から１分程度の時間が設定される。 After determining that there is a possibility of failure, the T2 detection unit 140 performs failure determination determination. Specifically, the T2 detection unit 140 holds in advance the value of the failure detection time “T2” (T1 <T2). For example, a value of “T2” is set in the RAM used by the T2 detection unit 140. As T2, for example, a time of about 30 seconds to about 1 minute is set.

Ｔ２検出部１４０は、Ｔ１／復帰検出部１３０が検査コマンドを発行してからＴ２が経過するまでにストレージ装置１１０から応答が返ってこない場合、故障確定と判定する。故障確定と判定した場合、Ｔ２検出部１４０は、Ｔ２経過通知を制御ノード５００に送信する。 The T2 detection unit 140 determines that the failure is confirmed when no response is returned from the storage apparatus 110 until T2 elapses after the T1 / return detection unit 130 issues the inspection command. When it is determined that the failure has been confirmed, the T2 detection unit 140 transmits a T2 progress notification to the control node 500.

メタデータ記憶部１５０は、ディスクノード１００が管理しているスライスのメタデータの記憶機能である。例えば、ディスクノード１００のＲＡＭ内の一部の記憶領域がメタデータ記憶部１５０として使用される。 The metadata storage unit 150 is a slice metadata storage function managed by the disk node 100. For example, a part of the storage area in the RAM of the disk node 100 is used as the metadata storage unit 150.

スライス管理部１６０は、ストレージ装置１１０内の各スライスのメタデータを管理する。具体的には、スライス管理部１６０は、ディスクノード１００起動時に、ストレージ装置１１０から各スライスのメタデータを読み出し、メタデータ記憶部１５０に格納する。また、スライス管理部１６０は、制御ノード５００からメタデータの収集要求があれば、メタデータ記憶部１５０に格納されているメタデータを制御ノード５００に送信する。さらにスライス管理部１６０は、制御ノード５００からメタデータの変更要求を受け取ると、その変更要求で指定されたメタデータの内容を変更する。この際、スライス管理部１６０は、メタデータ記憶部１５０内のメタデータと、ストレージ装置１１０内のメタデータとを変更する。 The slice management unit 160 manages the metadata of each slice in the storage apparatus 110. Specifically, the slice management unit 160 reads the metadata of each slice from the storage device 110 and stores it in the metadata storage unit 150 when the disk node 100 is activated. In addition, when there is a metadata collection request from the control node 500, the slice management unit 160 transmits the metadata stored in the metadata storage unit 150 to the control node 500. Further, upon receiving a metadata change request from the control node 500, the slice management unit 160 changes the content of the metadata specified in the change request. At this time, the slice management unit 160 changes the metadata in the metadata storage unit 150 and the metadata in the storage device 110.

制御ノード５００は、仮想ディスクメタデータ記憶部５１０、メタデータ検索部５２０、ストレージ状態記憶部５３０、ストレージ状態管理部５４０、およびスライス割り当て管理部５５０を有している。 The control node 500 includes a virtual disk metadata storage unit 510, a metadata search unit 520, a storage state storage unit 530, a storage state management unit 540, and a slice allocation management unit 550.

仮想ディスクメタデータ記憶部５１０は、仮想ディスク６０を構成するセグメントへのスライスの割り当て関係を示すメタデータを記憶する記憶機能である。例えば、ＲＡＭ５０２の記憶領域の一部が仮想ディスクメタデータ記憶部５１０として使用される。 The virtual disk metadata storage unit 510 is a storage function that stores metadata indicating the allocation relationship of slices to segments constituting the virtual disk 60. For example, a part of the storage area of the RAM 502 is used as the virtual disk metadata storage unit 510.

メタデータ検索部５２０は、アクセスノード６００から照会されたセグメントに割り当てられたスライスのメタデータを仮想ディスクメタデータ記憶部５１０から検索し、検索結果をアクセスノード６００に応答する。また、メタデータ検索部５２０は、アクセスノード６００から照会されたスライスを有するストレージ装置が故障の可能性がある場合、アクセスノード６００から照会されたセグメントへのスライスの再割り当てをスライス割り当て管理部５５０に要求する。そして、メタデータ検索部５２０は、再割り当て結果を反映したメタデータを仮想ディスクメタデータ記憶部５１０より取得し、取得したメタデータをアクセスノード６００に応答する。なお、メタデータ検索部５２０は、各ストレージ装置の状態は、ストレージ状態記憶部５３０を参照することで判断する。 The metadata search unit 520 searches the virtual disk metadata storage unit 510 for the metadata of the slice assigned to the segment inquired from the access node 600, and returns the search result to the access node 600. In addition, when there is a possibility that the storage apparatus having the slice inquired from the access node 600 has a failure, the metadata search unit 520 reassigns the slice to the inquired segment from the access node 600. To request. Then, the metadata search unit 520 acquires the metadata reflecting the reassignment result from the virtual disk metadata storage unit 510 and responds to the access node 600 with the acquired metadata. Note that the metadata search unit 520 determines the status of each storage device by referring to the storage status storage unit 530.

ストレージ状態記憶部５３０には、各ディスクノード１００，２００，３００に接続されているストレージ装置１１０，２１０，３１０の状態を記憶する記憶機能である。例えば、ＲＡＭ５０２の記憶領域の一部がストレージ状態記憶部５３０として使用される。ストレージ状態記憶部５３０に設定されるストレージ装置１１０，２１０，３１０の状態には、通常の状態と、Ｔ１経過後の状態とがある。Ｔ１経過後の状態は、検査コマンドへの無応答時間が動作不調検出時間が経過したことを示している。ストレージ装置の状態が、Ｔ１経過後の状態であれば、そのストレージ装置が故障している可能性があることが分かる。 The storage status storage unit 530 is a storage function that stores the status of the storage apparatuses 110, 210, and 310 connected to the disk nodes 100, 200, and 300. For example, a part of the storage area of the RAM 502 is used as the storage state storage unit 530. The storage devices 110, 210, and 310 set in the storage state storage unit 530 include a normal state and a state after T1 has elapsed. The state after the elapse of T1 indicates that the non-response time to the inspection command has elapsed. If the state of the storage device is the state after the lapse of T1, it can be understood that the storage device may have failed.

ストレージ状態管理部５４０は、ディスクノード１００からの通知に応じて、ストレージ状態記憶部５３０内に示されるストレージ装置１１０の状態を変更する。具体的には、ディスクノード１００からＴ１経過通知を受け取ると、ストレージ状態管理部５４０は、ストレージ状態記憶部５３０内に示されるストレージ装置１１０の状態を、Ｔ１経過後の状態とする。また、ディスクノード１００から復帰通知を受け取ると、ストレージ状態管理部５４０は、ストレージ状態記憶部５３０内に示されるストレージ装置１１０の状態を、通常の状態とする。 The storage state management unit 540 changes the state of the storage device 110 indicated in the storage state storage unit 530 in response to the notification from the disk node 100. Specifically, when the T1 progress notification is received from the disk node 100, the storage state management unit 540 sets the state of the storage device 110 indicated in the storage state storage unit 530 to the state after the T1 has passed. In addition, when receiving a return notification from the disk node 100, the storage state management unit 540 sets the state of the storage device 110 indicated in the storage state storage unit 530 to a normal state.

スライス割り当て管理部５５０は、仮想ディスク６０のセグメントへのスライスの割り当てを管理する。例えば、スライス割り当て管理部５５０は、メタデータ検索部５２０から、故障の可能性があるストレージ装置内のアクセス対象となったスライスが通知されると、そのスライスが割り当てられているセグメントに対して、別のスライスを割り当てる。 The slice assignment management unit 550 manages the assignment of slices to the segments of the virtual disk 60. For example, when the slice allocation management unit 550 is notified from the metadata search unit 520 of a slice that is an access target in a storage device that may have a failure, for the segment to which the slice is allocated, Assign another slice.

また、スライス割り当て管理部５５０は、ディスクノード１００からＴ２経過通知を受け取ると、ストレージ装置１１０内のスライスが割り当てられたセグメントのリカバリ処理を開始する。スライス割り当て管理部５５０は、リカバリ処理において、リカバリ対象セグメントに割り当てられたスライスのうち、ストレージ装置１１０以外のストレージ装置２１０，３１０のスライスをすべてプライマリスライスとする。次に、スライス割り当て管理部５５０は、リカバリ対象セグメントのセカンダリスライスとして、ストレージ装置２１０，３１０のスライスを割り当てる。そして、スライス割り当て管理部５５０は、割り当て結果に応じたメタデータの更新要求をディスクノード２００，３００に送信する。ディスクノード２００，３００においてメタデータの更新が完了すると、スライス割り当て管理部５５０は、仮想ディスクメタデータ記憶部５１０内のメタデータを更新する。 When the slice allocation management unit 550 receives the T2 progress notification from the disk node 100, the slice allocation management unit 550 starts recovery processing of the segment to which the slice in the storage apparatus 110 is allocated. The slice assignment management unit 550 sets all slices of the storage apparatuses 210 and 310 other than the storage apparatus 110 as primary slices among the slices assigned to the recovery target segment in the recovery process. Next, the slice allocation management unit 550 allocates slices of the storage apparatuses 210 and 310 as secondary slices of the recovery target segment. Then, the slice allocation management unit 550 transmits a metadata update request corresponding to the allocation result to the disk nodes 200 and 300. When the update of metadata is completed in the disk nodes 200 and 300, the slice allocation management unit 550 updates the metadata in the virtual disk metadata storage unit 510.

リカバリ処理によりメタデータが更新された場合、リカバリ対象セグメントのプライマリスライスを管理するディスクノードは、ネットワーク１０経由でプライマリスライス内のデータをセカンダリスライスにコピーする。リカバリ処理に伴うデータのコピーは、プライマリスライスを管理するディスクノードとセカンダリスライスを管理するディスクノードとにおけるスライス管理部１６０に相当する機能の協働動作により実行される。 When the metadata is updated by the recovery process, the disk node that manages the primary slice of the recovery target segment copies the data in the primary slice to the secondary slice via the network 10. Data copying accompanying the recovery process is executed by a cooperative operation of functions corresponding to the slice management unit 160 in the disk node that manages the primary slice and the disk node that manages the secondary slice.

なお、図５では、２台のアクセスノード６００，７００のうちのアクセスノード６００の機能を代表で示しているが、アクセスノード７００も同様の機能を有している。また、３台のディスクノード１００，２００，３００のうちのディスクノード１００の機能を代表で示しているが、他のディスクノード２００，３００も同様の機能を有している。 In FIG. 5, the function of the access node 600 out of the two access nodes 600 and 700 is shown as a representative, but the access node 700 has the same function. In addition, the function of the disk node 100 of the three disk nodes 100, 200, 300 is shown as a representative, but the other disk nodes 200, 300 also have the same function.

次に、マルチノードストレージシステムの各ノードで管理されるメタデータについて詳細に説明する。本実施の形態におけるメタデータは、システムが停止している間は、ストレージ装置１１０，２１０，３１０に格納されている。マルチノードストレージシステムが起動されると、ストレージ装置１１０，２１０，３１０内のメタデータが読み出され、各ノードに保持される。 Next, metadata managed in each node of the multi-node storage system will be described in detail. The metadata in the present embodiment is stored in the storage apparatuses 110, 210, and 310 while the system is stopped. When the multi-node storage system is activated, metadata in the storage apparatuses 110, 210, and 310 is read and held in each node.

次に、各ノードで管理されているデータの構造について説明する。
図６は、ストレージ装置のデータ構造例を示す図である。ストレージ装置１１０には、スライス１１５ａ，１１５ｂ，１１５ｃ，・・・とは別に複数のメタデータ１１７ａ，１１７ｂ，１１７ｃ，・・・が格納されている。 Next, the structure of data managed by each node will be described.
FIG. 6 is a diagram illustrating an exemplary data structure of the storage apparatus. In addition to the slices 115a, 115b, 115c,..., A plurality of metadata 117a, 117b, 117c,.

ストレージ装置１１０に格納されたメタデータ１１７ａ，１１７ｂ，１１７ｃ，・・・は、ディスクノード１００の起動時にスライス管理部１６０によって読み出され、メタデータ記憶部１５０に格納される。 The metadata 117a, 117b, 117c,... Stored in the storage device 110 is read by the slice management unit 160 when the disk node 100 is activated and stored in the metadata storage unit 150.

図７は、メタデータ記憶部のデータ構造例を示す図である。メタデータ記憶部１５０には、メタデータテーブル１５１が格納されている。メタデータテーブル１５１には、ディスクノードＩＤ、ディスクＩＤ、スライスＩＤ、状態、仮想ディスクＩＤ、セグメントＩＤ、仮想ディスクアドレス、ペアのディスクノードＩＤ、ペアのディスクＩＤ、ペアのスライスＩＤ、およびタイムスタンプの欄が設けられている。メタデータテーブル１５１内の横方向に並べられた情報同士が互いに関連付けられ、メタデータを示す１つのレコードを構成している。 FIG. 7 is a diagram illustrating an example of the data structure of the metadata storage unit. A metadata table 151 is stored in the metadata storage unit 150. The metadata table 151 includes a disk node ID, a disk ID, a slice ID, a status, a virtual disk ID, a segment ID, a virtual disk address, a paired disk node ID, a paired disk ID, a paired slice ID, and a time stamp. A column is provided. Information arranged in the horizontal direction in the metadata table 151 is associated with each other to form one record indicating the metadata.

ディスクノードＩＤの欄は、ストレージ装置１１０を管理しているディスクノード１００の識別情報（ディスクノードＩＤ）が設定される。
ディスクＩＤの欄には、ディスクノード１００に接続されているストレージ装置の識別情報（ディスクＩＤ）が設定される。本実施の形態ではディスクノード１００に１台のストレージ装置１１０しか接続されていないが、複数のストレージ装置が接続された場合、各ストレージ装置に異なるディスクＩＤが設定される。 In the disk node ID column, identification information (disk node ID) of the disk node 100 that manages the storage apparatus 110 is set.
In the disk ID column, identification information (disk ID) of the storage device connected to the disk node 100 is set. In this embodiment, only one storage device 110 is connected to the disk node 100. However, when a plurality of storage devices are connected, different disk IDs are set for the respective storage devices.

スライスＩＤの欄には、メタデータに対応するスライスのストレージ装置１１０内での識別情報（スライスＩＤ）が設定される。
状態の欄には、スライスの状態を示す状態フラグが設定される。スライスが仮想ディスクのセグメントに割り当てられていない場合、状態フラグ「Ｆ」が設定される。仮想ディスクのセグメントのプライマリストレージに割り当てられている場合、状態フラグ「Ｐ」が設定される。仮想ディスクのセグメントのセカンダリストレージに割り当てられている場合、状態フラグ「Ｓ」が設定される。仮想ディスクのセグメントに割り当てることが決定したが、まだデータのコピーが行われていない場合、予約済を示す状態フラグ「Ｒ」が設定される。異常セグメントと判定された場合、異常であることを示す状態フラグ「Ｂ」が設定される。 In the slice ID column, identification information (slice ID) in the storage apparatus 110 of the slice corresponding to the metadata is set.
In the status column, a status flag indicating the status of the slice is set. If the slice is not assigned to a virtual disk segment, the status flag “F” is set. When the virtual disk segment is assigned to the primary storage, the status flag “P” is set. When the virtual disk segment is allocated to the secondary storage, the status flag “S” is set. If it has been decided to allocate to a segment of a virtual disk, but data has not yet been copied, a status flag “R” indicating reserved is set. When it is determined that the segment is abnormal, a status flag “B” indicating that the segment is abnormal is set.

仮想ディスクＩＤの欄には、スライスに対応するセグメントが属する仮想ディスクを識別するための識別情報（仮想ディスクＩＤ）が設定される。
セグメントＩＤの欄には、スライスが割り当てられたセグメントの識別情報（セグメントＩＤ）が設定される。 In the virtual disk ID column, identification information (virtual disk ID) for identifying the virtual disk to which the segment corresponding to the slice belongs is set.
In the segment ID column, identification information (segment ID) of a segment to which a slice is assigned is set.

仮想ディスクアドレスの欄には、スライスが割り当てられているセグメントの先頭を示す仮想ディスク内でのアドレスが設定される。
ペアのディスクノードＩＤの欄には、ペアのスライス（同じセグメントに属する別のスライス）を有するストレージ装置を管理するディスクノードの識別情報（ディスクノードＩＤ）が設定される。 In the virtual disk address column, an address in the virtual disk indicating the head of the segment to which the slice is assigned is set.
In the paired disk node ID column, identification information (disk node ID) of a disk node that manages a storage device having a pair of slices (another slice belonging to the same segment) is set.

ペアのディスクＩＤの欄には、ペアのディスクノードＩＤで示されるディスクノード内で、ペアのスライスを有するストレージ装置を識別するための識別情報（ディスクＩＤ）が設定される。 In the paired disk ID column, identification information (disk ID) for identifying a storage device having a paired slice in the disk node indicated by the paired disk node ID is set.

ペアのスライスＩＤの欄には、ペアのスライスを、そのスライスが属するストレージ装置内で識別するための識別情報（スライスＩＤ）が設定される。
タイムスタンプの欄には、セグメントへのスライスの割り当てを行った時刻（タイムスタンプ）が設定される。図中のタイムスタンプの値は、「ｔ」に続く数が大きいほど、新しい時刻を示している。ｔｎの「ｎ」は自然数を示している。 In the pair slice ID column, identification information (slice ID) for identifying the pair slice within the storage apparatus to which the slice belongs is set.
In the time stamp column, a time (time stamp) at which a slice is assigned to a segment is set. The time stamp value in the figure indicates a new time as the number following “t” increases. “n” in tn represents a natural number.

図７にはディスクノード１００のメタデータ記憶部１５０の内容を示しているが、他のディスクノード２００，３００も同様のメタデータ記憶部を有している。そして、各ディスクノード１００，２００，３００のメタデータ記憶部に格納されたメタデータは、制御ノード５００からの要求に応じて制御ノード５００に送信される。制御ノード５００では、ディスクノード１００，２００，３００から収集したメタデータは、スライス割り当て管理部５５０により仮想ディスクメタデータ記憶部５１０に格納される。 FIG. 7 shows the contents of the metadata storage unit 150 of the disk node 100, but the other disk nodes 200 and 300 also have the same metadata storage unit. The metadata stored in the metadata storage unit of each disk node 100, 200, 300 is transmitted to the control node 500 in response to a request from the control node 500. In the control node 500, the metadata collected from the disk nodes 100, 200, and 300 is stored in the virtual disk metadata storage unit 510 by the slice assignment management unit 550.

図８は、仮想ディスクメタデータ記憶部のデータ構造例を示す図である。仮想ディスクメタデータ記憶部５１０には、仮想ディスクメタデータテーブル５１１が格納されている。仮想ディスクメタデータテーブル５１１には、ディスクノードＩＤ、ディスクＩＤ、スライスＩＤ、状態、仮想ディスクＩＤ、セグメントＩＤ、仮想ディスクアドレス、ペアのディスクノードＩＤ、ペアのディスクＩＤ、ペアのスライスＩＤ、およびタイムスタンプの欄が設けられている。仮想ディスクメタデータテーブル５１１内の横方向に並べられた情報が互いに関連付けられ、メタデータを示す１つのレコードを構成している。仮想ディスクメタデータテーブル５１１の各欄に設定される情報は、メタデータテーブル１５１の同名の欄と同種の情報である。 FIG. 8 is a diagram illustrating an example of the data structure of the virtual disk metadata storage unit. The virtual disk metadata storage unit 510 stores a virtual disk metadata table 511. The virtual disk metadata table 511 includes a disk node ID, disk ID, slice ID, status, virtual disk ID, segment ID, virtual disk address, paired disk node ID, paired disk ID, paired slice ID, and time. A stamp field is provided. Information arranged in the horizontal direction in the virtual disk metadata table 511 is associated with each other to form one record indicating metadata. Information set in each column of the virtual disk metadata table 511 is the same type of information as the column of the same name in the metadata table 151.

仮想ディスクメタデータテーブル５１１に格納されたメタデータは、アクセスノード６００，７００からの照会要求に応答して、アクセスノード６００，７００に送信される。アクセスノード６００，７００は、取得したメタデータを記憶する。アクセスノード６００であれば、アクセス用メタデータ記憶部６２０にメタデータが格納される。アクセスノード７００においても、アクセス用メタデータ記憶部６２０に相当する記憶機能にメタデータが格納される。 The metadata stored in the virtual disk metadata table 511 is transmitted to the access nodes 600 and 700 in response to an inquiry request from the access nodes 600 and 700. The access nodes 600 and 700 store the acquired metadata. In the case of the access node 600, metadata is stored in the access metadata storage unit 620. Also in the access node 700, metadata is stored in a storage function corresponding to the access metadata storage unit 620.

アクセス用メタデータ記憶部６２０のデータ構造は、仮想ディスクメタデータ記憶部５１０と同様である。なお、本実施の形態では、アクセスノード６００は常にプライマリスライスにアクセスする。そのため、アクセス用メタデータ記憶部６２０には、少なくともプライマリスライスに関するメタデータが格納されていればよい。また、各メタデータにおけるペアのディスクノードＩＤ、ペアのディスクＩＤ、ペアのスライスＩＤ、タイムスタンプの各欄のデータは無くてもよい。 The data structure of the access metadata storage unit 620 is the same as that of the virtual disk metadata storage unit 510. In the present embodiment, the access node 600 always accesses the primary slice. Therefore, the access metadata storage unit 620 only needs to store at least metadata regarding the primary slice. Further, the data in each column of the paired disk node ID, the paired disk ID, the paired slice ID, and the time stamp in each metadata may be omitted.

次に、制御ノード５００内のストレージ状態記憶部５３０に格納されるデータについて説明する。
図９は、ストレージ状態記憶部のデータ構造例を示す図である。ストレージ状態記憶部５３０には、ディスク管理テーブル５３１が格納されている。ディスク管理テーブル５３１には、ディスクノードＩＤ、ディスクＩＤ、および状態の欄が設けられている。 Next, data stored in the storage state storage unit 530 in the control node 500 will be described.
FIG. 9 is a diagram illustrating an example of a data structure of the storage state storage unit. The storage status storage unit 530 stores a disk management table 531. The disk management table 531 has columns for disk node ID, disk ID, and status.

ディスクノードＩＤの欄には、ディスクノードを一意に識別するための識別情報（ディスクノードＩＤ）が設定される。ディスクＩＤの欄には、ディスクノードに接続されたストレージ装置の識別情報（ディスクＩＤ）が設定される。 In the disk node ID column, identification information (disk node ID) for uniquely identifying a disk node is set. In the disk ID column, identification information (disk ID) of the storage apparatus connected to the disk node is set.

状態の欄には、各ストレージ装置の状態が設定される。状態には、「通常」と「Ｔ１」とがある。「通常」の状態とは、ストレージ装置が正常に動作している状態である。具体的には、定期的にディスクノードから出力される検査コマンドに対してストレージ装置から応答が返されている場合、そのストレージ装置の状態は「通常」に設定される。「Ｔ１」の状態とは、ストレージ装置の故障の可能性がある状態である。具体的には、定期的にディスクノードから出力される検査コマンドに対して、Ｔ１経過前にストレージ装置から応答が返されない場合、そのストレージ装置の状態は「Ｔ１」に設定される。 In the status column, the status of each storage device is set. The state includes “normal” and “T1”. The “normal” state is a state in which the storage apparatus is operating normally. Specifically, when a response is returned from the storage apparatus to the inspection command periodically output from the disk node, the state of the storage apparatus is set to “normal”. The “T1” state is a state where there is a possibility of failure of the storage apparatus. Specifically, in the case where a response is not returned from the storage apparatus before T1 elapses with respect to the inspection command periodically output from the disk node, the state of the storage apparatus is set to “T1”.

このような構成のマルチノードストレージシステムにおいて、検査コマンドに対するストレージ装置からの応答が、Ｔ１を経過しても返されない場合、以下のようなスライス切り替え処理が実行される。 In the multi-node storage system having such a configuration, when the response from the storage apparatus to the inspection command is not returned even after T1, the following slice switching process is executed.

図１０は、ストレージ装置故障時のスライス切り替え処理の手順を示すシーケンス図である。この例では、ディスクノード１００に接続されたストレージ装置１１０が故障したものとする。以下、図１０に示す処理をステップ番号に沿って説明する。 FIG. 10 is a sequence diagram illustrating a procedure of slice switching processing when a storage apparatus fails. In this example, it is assumed that the storage apparatus 110 connected to the disk node 100 has failed. In the following, the process illustrated in FIG. 10 will be described in order of step number.

［ステップＳ１１］ディスクノード１００のＴ１／復帰検出部１３０は、定期的にストレージ装置１１０のディスク診断（動作確認）を行う。具体的には、Ｔ１／復帰検出部１３０は、定期的にストレージ装置１１０へ”test unit ready”の検査コマンドを発行する。ストレージ装置１１０が正常に動作していれば、ストレージ装置１１０からＴ１以内に応答が返される。 [Step S11] The T1 / recovery detection unit 130 of the disk node 100 periodically performs disk diagnosis (operation check) of the storage apparatus 110. Specifically, the T1 / recovery detection unit 130 periodically issues a “test unit ready” test command to the storage apparatus 110. If the storage apparatus 110 is operating normally, a response is returned from the storage apparatus 110 within T1.

ストレージ装置１１０が故障している場合やストレージ装置１１０内部でデータの再生成処理が行われている場合などには、ストレージ装置１１０からＴ１以内に応答が返されない。 When the storage device 110 is out of order or when a data regeneration process is performed within the storage device 110, no response is returned from the storage device 110 within T1.

データの再生成処理は、ＲＡＩＤ５のディスク故障時に実行される処理である。すなわち、図２に示すようにストレージ装置１１０は、複数のＨＤＤ１１１〜１１４が実装されており、ＲＡＩＤ５システムを構成している。ＲＡＩＤ５では、データを分割して複数のＨＤＤに分散格納するストライピング処理が行われる。その際、データを修復するためのパリティデータが生成され、データとは別のＨＤＤに格納される。ストレージ装置１１０内の１台のＨＤＤが故障した場合、パリティデータを用いて、そのＨＤＤに格納されていたデータが再生成される。 The data regeneration process is a process executed when a RAID 5 disk failure occurs. That is, as shown in FIG. 2, the storage apparatus 110 is mounted with a plurality of HDDs 111 to 114 and constitutes a RAID 5 system. In RAID5, striping processing is performed in which data is divided and distributedly stored in a plurality of HDDs. At that time, parity data for restoring the data is generated and stored in a separate HDD from the data. When one HDD in the storage device 110 fails, the data stored in the HDD is regenerated using the parity data.

このようなデータの再生成処理は、ストレージ装置１１０内で自動的に実行される。例えば、ストレージ装置１１０内の４台のＨＤＤ１１１〜１１４の１台が故障した場合、ストレージ装置１１０内のＲＡＩＤコントローラによる故障したＨＤＤに格納されていたデータが再生成される。さらに、ＲＡＩＤコントローラは、稼動するＨＤＤが３台になったことにより、ストライピング処理によるデータの再配置を行う。このようなデータの再生成や再配置の処理中は、ストレージ装置１１０内のＲＡＩＤコントローラへの負荷が普段より大きくなる。そのため、データの再生成処理中にディスクノード１００からストレージ装置１１０に検査コマンドが入力されると、応答に通常より多くの時間がかかることがある。ただし、データの再生成などの処理はストレージ装置１１０の正常動作の１つである。そのため、再生成処理中に検査コマンドへの応答が遅れたとしても、ストレージ装置１１０全体としての故障ではない。 Such data regeneration processing is automatically executed in the storage apparatus 110. For example, when one of the four HDDs 111 to 114 in the storage apparatus 110 fails, the data stored in the failed HDD by the RAID controller in the storage apparatus 110 is regenerated. Further, the RAID controller performs data rearrangement by striping processing when the number of operating HDDs becomes three. During such data regeneration and rearrangement processing, the load on the RAID controller in the storage apparatus 110 becomes larger than usual. Therefore, if a check command is input from the disk node 100 to the storage device 110 during the data regeneration process, the response may take more time than usual. However, processing such as data regeneration is one of the normal operations of the storage apparatus 110. Therefore, even if the response to the inspection command is delayed during the regeneration process, it is not a failure of the entire storage apparatus 110.

図１０の例では、ストレージ装置１１０が故障しているため、検査コマンドを発行してからＴ１が経過しても、応答が返ってこない。その場合、まずＴ１／復帰検出部１３０がＴ１が経過したことを検出する。 In the example of FIG. 10, since the storage apparatus 110 has failed, no response is returned even if T1 elapses after the inspection command is issued. In that case, first, the T1 / return detection unit 130 detects that T1 has elapsed.

Ｔ１の経過を検出すると、Ｔ１／復帰検出部１３０は、制御ノード５００に対してＴ１経過通知を送信する。Ｔ１経過通知には、ディスクノード１００のディスクノードＩＤとストレージ装置１１０のディスクＩＤとが含まれる。Ｔ１／復帰検出部１３０は、その後もストレージ装置１１０からの応答を待つ。 When the elapse of T1 is detected, the T1 / recovery detection unit 130 transmits a T1 elapse notification to the control node 500. The T1 progress notification includes the disk node ID of the disk node 100 and the disk ID of the storage device 110. The T1 / recovery detection unit 130 waits for a response from the storage apparatus 110 thereafter.

制御ノード５００のストレージ状態管理部５４０は、Ｔ１経過通知を受信すると、ストレージ装置１１０の状態を切り替える。すなわち、ストレージ状態管理部５４０は、Ｔ１経過通知に示されるディスクノードＩＤとディスクＩＤとの組みに対応する情報をストレージ状態記憶部５３０から検索する。そして、ストレージ状態管理部５４０は、該当するストレージ装置に関する情報の状態を「Ｔ１」に変更する。 When the storage state management unit 540 of the control node 500 receives the T1 progress notification, the state of the storage apparatus 110 is switched. That is, the storage state management unit 540 searches the storage state storage unit 530 for information corresponding to the combination of the disk node ID and the disk ID indicated in the T1 progress notification. Then, the storage status management unit 540 changes the status of the information related to the corresponding storage device to “T1”.

図１１は、状態変更後のストレージ状態記憶部の例を示す図である。図１１に示すように、ディスクノードＩＤ「ＳＮ−Ａ」とディスクＩＤ「１」との組に対応する状態が「Ｔ１」に変更されている。これにより、制御ノード５００では、ディスクノード１００に接続されたストレージ装置１１０に故障の可能性があることが認識できる。 FIG. 11 is a diagram illustrating an example of the storage state storage unit after the state change. As shown in FIG. 11, the state corresponding to the set of the disk node ID “SN-A” and the disk ID “1” is changed to “T1”. As a result, the control node 500 can recognize that the storage device 110 connected to the disk node 100 has a possibility of failure.

図１０の説明に戻る。
［ステップＳ１２］状態の切り替えが完了すると、ストレージ状態管理部５４０は、切り替え完了応答をディスクノード１００に送信する。 Returning to the description of FIG.
[Step S12] When the state switching is completed, the storage state management unit 540 transmits a switching completion response to the disk node 100.

［ステップＳ１３］一方、アクセスノード６００のスライスアクセス要求部６３０は、ユーザによる端末装置２１，２２，２３への操作入力などに応じて仮想ディスク６０内のデータへのアクセスが発生すると、アクセス用メタデータ記憶部６２０を参照し、アクセス対象のデータを管理しているディスクノードを判断する。図１０の例では、ディスクノード１００が管理するスライスへのリードのアクセスが発生したものとする。すると、スライスアクセス要求部６３０は、ディスクノード１００に対してアクセス対象データのリード要求を送信する。ディスクノード１００のスライスアクセス処理部１２０は、メタデータ記憶部１５０を参照して、アクセス対象スライスが自己の管理するストレージ装置１１０内のスライスであることを確認し、ストレージ装置１１０内の該当スライス内のデータを指定したリードアクセスを行う。 [Step S13] On the other hand, the slice access request unit 630 of the access node 600, when access to data in the virtual disk 60 occurs in response to an operation input to the terminal devices 21, 22, 23 by the user, etc. The data storage unit 620 is referenced to determine the disk node that manages the access target data. In the example of FIG. 10, it is assumed that a read access to a slice managed by the disk node 100 has occurred. Then, the slice access request unit 630 transmits a read request for access target data to the disk node 100. The slice access processing unit 120 of the disk node 100 refers to the metadata storage unit 150 to confirm that the access target slice is a slice in the storage device 110 managed by the slice access processing unit 120, and in the corresponding slice in the storage device 110 Read access with specified data.

図１０の例では、Ｔ１／復帰検出部１３０によるＴ１の経過が検出された後に、アクセスノード６００からのリード要求が出されている。この例では、ストレージ装置１１０は故障している。そのため、ストレージ装置１１０へのアクセスはエラーになる。ストレージ装置１１０が過負荷状態であっても、検査コマンドへの応答検出前（復帰前）であればデータアクセスもエラーとなる。 In the example of FIG. 10, a read request is issued from the access node 600 after the T1 / return detection unit 130 detects the elapse of T1. In this example, the storage device 110 has failed. For this reason, an access to the storage device 110 results in an error. Even if the storage apparatus 110 is in an overload state, data access also results in an error before detection of a response to the inspection command (before recovery).

なお、リード要求の場合には、プライマリスライスを有するストレージ装置に問題（故障または過負荷）がある場合にのみエラーとなる。他方、ライト要求の場合には、プライマリスライスを有するストレージ装置とセカンダリスライスを有するストレージ装置との少なくとも一方で問題（故障または過負荷）があるときにエラーとなる。すなわち、データのライトのアクセス要求を受け取ったディスクノード１００では、まず、スライスアクセス処理部１２０がデータ更新を行う。その後、スライス管理部１６０がメタデータ記憶部１５０を参照し、データが更新されたスライス（プライマリスライス）とペアとなるスライス（セカンダリスライス）を判断する。そして、スライス管理部１６０は、セカンダリスライスを管理するディスクノードに書き込み対象のデータを送信し、セカンダリスライスのデータ更新を要求する。スライスアクセス処理部１２０は、プライマリスライスとセカンダリスライスとの両方のデータ更新が完了したことを確認後、アクセスノード６００に対してライト要求の完了応答を返す。もし、プライマリスライスとセカンダリスライスとの少なくとも一方でデータ更新が失敗した場合、スライスアクセス処理部１２０はアクセスノード６００にエラーを応答する。 In the case of a read request, an error occurs only when there is a problem (failure or overload) in the storage apparatus having the primary slice. On the other hand, in the case of a write request, an error occurs when there is a problem (failure or overload) in at least one of the storage device having the primary slice and the storage device having the secondary slice. That is, in the disk node 100 that has received the data write access request, first, the slice access processing unit 120 updates the data. Thereafter, the slice management unit 160 refers to the metadata storage unit 150 and determines a slice (secondary slice) that is paired with a slice (primary slice) whose data has been updated. Then, the slice management unit 160 transmits data to be written to the disk node that manages the secondary slice, and requests data update of the secondary slice. The slice access processing unit 120 returns a write request completion response to the access node 600 after confirming that the data update has been completed for both the primary slice and the secondary slice. If the data update fails in at least one of the primary slice and the secondary slice, the slice access processing unit 120 sends an error response to the access node 600.

図１０の例では、ストレージ装置１１０（ディスクノードＩＤ「ＳＮ−Ａ」）の故障により、ストレージ装置１１０（ディスクＩＤ「１」）の先頭のスライス（スライスＩＤ「１」）へのリード要求がエラーになっている。 In the example of FIG. 10, a read request to the first slice (slice ID “1”) of the storage device 110 (disk ID “1”) is an error due to a failure of the storage device 110 (disk node ID “SN-A”). It has become.

［ステップＳ１４］ディスクノード１００のスライスアクセス処理部１２０は、アクセスノード６００からのリード要求に対してエラーを応答する。すると、アクセスノード６００のスライスアクセス要求部６３０は、メタデータ照会部６１０に対してエラーの発生を通知する。このとき、スライスアクセス要求部６３０は、アクセスがエラーとなったセグメントに関する仮想ディスクＩＤとセグメントＩＤとについてもメタデータ照会部６１０に伝える。 [Step S14] The slice access processing unit 120 of the disk node 100 returns an error in response to the read request from the access node 600. Then, the slice access request unit 630 of the access node 600 notifies the metadata inquiry unit 610 that an error has occurred. At this time, the slice access request unit 630 also notifies the metadata inquiry unit 610 of the virtual disk ID and the segment ID related to the segment in which the access has become an error.

［ステップＳ１５］メタデータ照会部６１０は、制御ノード５００に対してセグメントを指定したメタデータの照会要求を送信する。照会要求で指定されるセグメント（照会対象セグメント）は、エラーによりアクセスが失敗したセグメントである。紹介対象セグメントを指定したメタデータの照会要求は、その紹介対象セグメントに割り当てられているスライスへのアクセスに失敗したことを意味している。 [Step S15] The metadata inquiry unit 610 transmits a metadata inquiry request designating a segment to the control node 500. The segment (inquired segment) specified in the inquiry request is a segment whose access has failed due to an error. The metadata inquiry request specifying the introduction target segment means that access to the slice assigned to the introduction target segment has failed.

メタデータの照会要求を請け受け取った制御ノード５００のメタデータ検索部５２０は、スライス割り当て管理部５５０へ照会対象セグメントへのスライスの再割り当てを要求する。 The metadata search unit 520 of the control node 500 that has received the metadata inquiry request requests the slice assignment management unit 550 to reassign the slice to the inquiry target segment.

具体的には、メタデータ検索部５２０は、照会要求を受け取ると、仮想ディスクメタデータ記憶部５１０から、照会対象セグメントに割り当てられたスライス（プライマリスライスとセカンダリスライス）のメタデータを検索する。次に、メタデータ検索部５２０は、ストレージ状態記憶部５３０を参照し、照会対象のセグメントに割り当てられたスライスを有するストレージ装置の状態を確認する。 Specifically, when receiving the inquiry request, the metadata search unit 520 searches the virtual disk metadata storage unit 510 for the metadata of the slices (primary slice and secondary slice) assigned to the inquiry target segment. Next, the metadata search unit 520 refers to the storage state storage unit 530 and confirms the state of the storage device having the slice assigned to the segment to be queried.

ここで、各スライスを有するストレージ装置の状態が「通常」であれば、メタデータ検索部５２０は、検索によって取得したメタデータのうち、プライマリスライスのメタデータをアクセスノード６００に送信する。 Here, if the state of the storage device having each slice is “normal”, the metadata search unit 520 transmits the metadata of the primary slice among the metadata acquired by the search to the access node 600.

スライスを有するストレージ装置の状態が「Ｔ１」であれば、照会対象セグメントのリカバリ処理（二重化回復処理）が開始される。そのリカバリ処理では、まずメタデータ検索部５２０が照会対象セグメントへのスライスの再割り当てを行う。例えば、照会対象セグメントのプライマリスライスを有するストレージ装置の状態が「Ｔ１」であれば、プライマリスライスの再割り当てが行われる。また、照会対象セグメントのセカンダリスライスを有するストレージ装置の状態が「Ｔ１」であれば、セカンダリスライスの再割り当てが行われる。 If the state of the storage device having the slice is “T1”, the recovery process (duplex recovery process) of the inquiry target segment is started. In the recovery processing, first, the metadata search unit 520 reassigns slices to the inquiry target segment. For example, if the status of the storage apparatus having the primary slice of the inquiry target segment is “T1”, the primary slice is reassigned. If the state of the storage device having the secondary slice of the inquiry target segment is “T1”, the secondary slice is reassigned.

図１０の例では、ストレージ装置１１０の状態が「Ｔ１」となっており、プライマリスライスの再割り当てが行われる。具体的には、メタデータ検索部５２０からスライス割り当て管理部５５０へ、仮想ディスクのセグメントを指定したプライマリスライスの再割り当て要求を出力する。すると、スライス割り当て管理部５５０は、仮想ディスクメタデータ記憶部５１０から、照会対象セグメントのセカンダリスライスを管理するディスクノード以外のディスクノードで管理されるスライスのうち、空き（状態が「Ｆ」）のスライスを検索する。 In the example of FIG. 10, the state of the storage device 110 is “T1”, and the primary slice is reassigned. Specifically, a primary slice reassignment request specifying a virtual disk segment is output from the metadata search unit 520 to the slice assignment management unit 550. Then, the slice allocation management unit 550, from the virtual disk metadata storage unit 510, has a free (status “F”) among slices managed by disk nodes other than the disk node that manages the secondary slice of the query target segment. Search for a slice.

次に、スライス割り当て管理部５５０は、見つけ出した空きスライスを、照会対象セグメントのセカンダリスライスとすることを決定する。また、スライス割り当て管理部５５０は、照会対象セグメントにセカンダリスライスとして割り当てられているスライスの状態を、プライマリスライスに変更することを決定する。 Next, the slice allocation management unit 550 determines that the found empty slice is the secondary slice of the inquiry target segment. In addition, the slice assignment management unit 550 determines to change the state of the slice assigned as the secondary slice to the inquiry target segment to the primary slice.

図１０の例では、ディスクノード２００が管理するスライスをセカンダリスライスとして照会対象セグメントに割り当て、ディスクノード３００が管理するスライスの状態を、プライマリスライスからセカンダリスライスに変更することが決定される。スライス割り当て管理部５５０は、決定された再割り当ての内容に基づいて、仮想ディスクメタデータ記憶部５１０内のメタデータを更新する。 In the example of FIG. 10, it is determined that the slice managed by the disk node 200 is assigned as a secondary slice to the inquiry target segment, and the state of the slice managed by the disk node 300 is changed from the primary slice to the secondary slice. The slice allocation management unit 550 updates the metadata in the virtual disk metadata storage unit 510 based on the determined reallocation contents.

［ステップＳ１６］スライス割り当て管理部５５０は、ディスクノード２００に対して、メタデータの変更要求を送信する。具体的には、スライス割り当て管理部５５０は、再割り当て後のセカンダリスライス用のメタデータの情報をディスクノード２００に送信する。すると、ディスクノード２００では、取得した情報に基づいて、ディスクノード２００内で保持するメタデータとストレージ装置２１０内のメタデータとの内容を変更する。これにより、ストレージ装置２１０内の空きスライスが、照会対象セグメントのセカンダリスライスに変更される。 [Step S16] The slice allocation management unit 550 transmits a metadata change request to the disk node 200. Specifically, the slice assignment management unit 550 transmits metadata information for the secondary slice after the reallocation to the disk node 200. Then, the disk node 200 changes the contents of the metadata held in the disk node 200 and the metadata in the storage device 210 based on the acquired information. Thereby, the empty slice in the storage device 210 is changed to the secondary slice of the inquiry target segment.

［ステップＳ１７］スライス割り当て管理部５５０は、ディスクノード３００に対して、メタデータの変更要求を送信する。具体的には、スライス割り当て管理部５５０は、再割り当て後のプライマリスライス用のメタデータの情報をディスクノード３００に送信する。すると、ディスクノード３００では、取得した情報に基づいて、ディスクノード３００内で保持するメタデータとストレージ装置３１０内のメタデータとの内容を変更する。これにより、ストレージ装置３１０内の照会対象セグメントに割り当てられていたスライスが、セカンダリスライスからプライマリスライスに変更される。 [Step S 17] The slice allocation management unit 550 transmits a metadata change request to the disk node 300. Specifically, the slice allocation management unit 550 transmits metadata information for the primary slice after the reallocation to the disk node 300. Then, the disk node 300 changes the contents of the metadata held in the disk node 300 and the metadata in the storage device 310 based on the acquired information. Thereby, the slice allocated to the inquiry target segment in the storage apparatus 310 is changed from the secondary slice to the primary slice.

［ステップＳ１８］ディスクノード２００から制御ノード５００に、メタデータの変更完了応答が送信される。
［ステップＳ１９］ディスクノード３００から制御ノード５００に、メタデータの変更完了応答が送信される。 [Step S18] A metadata change completion response is transmitted from the disk node 200 to the control node 500.
[Step S19] A metadata change completion response is transmitted from the disk node 300 to the control node 500.

なお、図１０には示していないが、各ディスクノード２００，３００でメタデータの変更処理が完了すると、照会対象セグメントに対する二重化状態回復のためのデータコピーが開始される。具体的には、照会対象セグメントのプライマリスライスに変更されたディスクノード３００内のメタデータのデータが、ディスクノード３００からディスクノード２００へ転送される。そして、照会対象セグメントのセカンダリスライスとして新たに割り当てられたディスクノード２００内のスライスに、ディスクノード３００から送られたデータが格納される。データコピーが完了した時点で、紹介対象セグメントのリカバリ処理が完了する。 Although not shown in FIG. 10, when the metadata change processing is completed in each of the disk nodes 200 and 300, data copy for recovery of the duplex state for the inquiry target segment is started. Specifically, the metadata data in the disk node 300 changed to the primary slice of the inquiry target segment is transferred from the disk node 300 to the disk node 200. Then, the data sent from the disk node 300 is stored in the slice in the disk node 200 newly assigned as the secondary slice of the inquiry target segment. When the data copy is completed, the recovery process for the introduction target segment is completed.

このようにして、メタデータ照会要求に起因してスライスの再割り当てが行われる。このとき、スライス割り当て管理部５５０は、ストレージ装置の状態が「Ｔ１」であることにより再割り当てが行われたセグメント（再割り当て済セグメント）をＲＡＭ５０２内に記憶しておく。具体的には、再割り当て済セグメントの仮想ディスクＩＤとセグメントＩＤとの組が、ＲＡＭに記憶される。再割り当て後にストレージ装置が復帰した場合、復帰したストレージ装置のスライスのうち再割り当て済セグメントに割り当てられていたスライスは、空きスライスに変更されることとなる。 In this way, slice reassignment is performed due to the metadata query request. At this time, the slice allocation management unit 550 stores, in the RAM 502, the segment (reassigned segment) that has been reassigned due to the state of the storage device being “T1”. Specifically, a set of the virtual disk ID and segment ID of the reassigned segment is stored in the RAM. When the storage device is restored after the reassignment, the slice assigned to the reassigned segment among the slices of the restored storage device is changed to a free slice.

図１２は、更新後の仮想ディスクメタデータ記憶部の内容を示す図である。図１２に示すように、ディスクノードＩＤ「ＳＮ−Ａ」、ディスクＩＤ「１」、スライスＩＤ「１」で示されるスライスのメタデータは、状態が「Ｆ」に変更されている。これにより、ストレージ装置１１０内のスライスの照会対象セグメント（セグメントＩＤ「１」）への割り当てが解除される。 FIG. 12 is a diagram showing the contents of the updated virtual disk metadata storage unit. As shown in FIG. 12, the status of the metadata of the slice indicated by the disk node ID “SN-A”, the disk ID “1”, and the slice ID “1” is changed to “F”. As a result, the assignment of the slice in the storage apparatus 110 to the inquiry target segment (segment ID “1”) is released.

また、ディスクノードＩＤ「ＳＮ−Ｃ」、ディスクＩＤ「１」、スライスＩＤ「１」で示されるスライスのメタデータは、状態が「Ｐ」に変更されている。これにより、照会対象セグメントに割り当てられていたストレージ装置３１０内のスライスが、セカンダリスライスからプライマリスライスに変更される。 Further, the status of the metadata of the slice indicated by the disk node ID “SN-C”, the disk ID “1”, and the slice ID “1” is changed to “P”. Thereby, the slice in the storage apparatus 310 assigned to the inquiry target segment is changed from the secondary slice to the primary slice.

さらに、ディスクノードＩＤ「ＳＮ−Ｂ」、ディスクＩＤ「１」、スライスＩＤ「２」で示されるスライスのメタデータは、状態が「Ｓ」に変更され、仮想ディスクＩＤに「ＶＬＯＸ−Ｘ」が設定され、セグメントＩＤに「１」が設定されている。これにより、照会対象セグメントのセカンダリスライスとして、ストレージ装置２１０内のスライスが割り当てられる。 Furthermore, the status of the metadata of the slice indicated by the disk node ID “SN-B”, the disk ID “1”, and the slice ID “2” is changed to “S”, and “VLOX-X” is set to the virtual disk ID. The segment ID is set to “1”. As a result, the slice in the storage apparatus 210 is allocated as the secondary slice of the inquiry target segment.

また、状態等の内容が変更された各メタデータは、タイムスタンプが「ｔ（ｎ＋１）」に更新されている。「ｔ（ｎ＋１）」はメタデータの更新時刻である。
図１０の説明に戻り、メタデータの更新が完了すると、スライス割り当て管理部５５０は、メタデータ検索部５２０に対してスライス再割り当ての完了を通知する。 In addition, the time stamp of each metadata whose contents such as the state are changed is updated to “t (n + 1)”. “T (n + 1)” is a metadata update time.
Returning to the description of FIG. 10, when the update of the metadata is completed, the slice assignment management unit 550 notifies the metadata search unit 520 of the completion of the slice reassignment.

［ステップＳ２０］メタデータ検索部５２０は、アクセスノード６００に対して照会対象セグメントのプライマリスライスのメタデータを通知する。すると、アクセスノード６００のメタデータ照会部６１０は、取得したメタデータに基づいて、アクセス用メタデータ記憶部５６０内のメタデータを更新する。その後、メタデータ照会部６１０は、スライスアクセス要求部６３０に対して、メタデータの照会が完了したことを通知する。 [Step S20] The metadata search unit 520 notifies the access node 600 of the metadata of the primary slice of the query target segment. Then, the metadata inquiry unit 610 of the access node 600 updates the metadata in the access metadata storage unit 560 based on the acquired metadata. Thereafter, the metadata inquiry unit 610 notifies the slice access request unit 630 that the metadata inquiry has been completed.

［ステップＳ２１］スライスアクセス要求部６３０は、メタデータの照会が完了すると、アクセス用メタデータ記憶部６２０を参照してアクセス対象のスライスを管理するディスクノードを判断し、そのディスクノードへリード要求（リードリトライ）を送信する。リトライ時には、アクセス対象のプライマリスライスは、ディスクノード３００で管理されたスライスとなっている。そのため、リードリトライはディスクノード３００に対して行われる。 [Step S21] Upon completion of the metadata inquiry, the slice access request unit 630 refers to the access metadata storage unit 620 to determine the disk node that manages the slice to be accessed, and makes a read request to the disk node ( Send Read Retry). At the time of retry, the access target primary slice is a slice managed by the disk node 300. Therefore, the read retry is performed on the disk node 300.

［ステップＳ２２］ディスクノード３００ではリード要求を受け取ると、ストレージ装置３１０内のスライスからデータを読み出し、読み出したデータをアクセスノード６００に送信する。すると、スライスアクセス要求部６３０は、アクセス指示を出した端末装置に対して取得したデータを送信する。 [Step S22] Upon receiving a read request, the disk node 300 reads data from a slice in the storage apparatus 310 and transmits the read data to the access node 600. Then, the slice access request unit 630 transmits the acquired data to the terminal device that issued the access instruction.

［ステップＳ２３］図１０の例では、ストレージ装置１１０が故障しているため、検査コマンドの発行からＴ２が経過しても、ストレージ装置１１０からの応答は返ってこない。そのため、ディスクノード１００のＴ２検出部１４０は、検査コマンド発行からＴ２が経過したことを検出する。そして、Ｔ２検出部１４０は、制御ノード５００に対してＴ２経過通知を送信する。 [Step S23] In the example of FIG. 10, since the storage apparatus 110 is out of order, no response is returned from the storage apparatus 110 even if T2 has elapsed since the issuance of the inspection command. Therefore, the T2 detection unit 140 of the disk node 100 detects that T2 has elapsed since the inspection command was issued. Then, the T2 detection unit 140 transmits a T2 progress notification to the control node 500.

［ステップＳ２４］Ｔ２経過通知を受け取った制御ノード５００のスライス割り当て管理部５５０は、ストレージ装置１１０が故障により使用不可になったことを認識し、ストレージ装置４１０全体のリカバリ処理を開始する。 [Step S24] Upon receiving the T2 progress notification, the slice allocation management unit 550 of the control node 500 recognizes that the storage apparatus 110 has become unusable due to a failure, and starts recovery processing for the entire storage apparatus 410.

このようにして、ストレージ装置１１０が故障した場合、故障検出時間の経過を待たずに、ストレージ装置１１０内のスライスがプライマリスライスとして割り当てられていたセグメント内のデータへのアクセスが可能となる。その結果、アクセスノード６００からのアクセスがエラーとなる期間が短くて済む。 In this way, when the storage apparatus 110 fails, it becomes possible to access the data in the segment to which the slice in the storage apparatus 110 has been assigned as the primary slice without waiting for the failure detection time to elapse. As a result, the period during which access from the access node 600 becomes an error can be shortened.

次に、正常に動作しているストレージ装置１１０の負荷が一時的に過大であったため検査コマンドへの応答が遅れた場合のスライス切り替え処理について説明する。
図１３は、ストレージ装置の負荷が過大となったときのスライス切り替え処理の手順を示すシーケンス図である。この例では、ディスクノード１００に接続されたストレージ装置１１０の負荷が一時的に過大になったものとする。なお、図１３のステップＳ３１〜ステップＳ４２の処理は、それぞれ図１０のステップＳ１１〜Ｓ２２の処理と同じである。そこで、ステップＳ４３以降の処理をステップ番号に沿って説明する。 Next, the slice switching process when the response to the inspection command is delayed because the load on the normally operating storage apparatus 110 is temporarily excessive will be described.
FIG. 13 is a sequence diagram illustrating a procedure of slice switching processing when the load on the storage apparatus becomes excessive. In this example, it is assumed that the load on the storage device 110 connected to the disk node 100 is temporarily excessive. In addition, the process of step S31-step S42 of FIG. 13 is the same as the process of step S11-S22 of FIG. 10, respectively. Therefore, the processing after step S43 will be described along with step numbers.

［ステップＳ４３］ディスクノード１００のＴ１／復帰検出部１３０は、検査コマンドに対するストレージ装置１１０からの応答を受信する。これによりＴ１／復帰検出部１３０は、ストレージ装置１１０がアクセス可能な状態に復帰したことを検出する。すると、Ｔ１／復帰検出部１３０は、制御ノード５００に対してストレージ装置１１０の復帰通知を送信する。 [Step S43] The T1 / recovery detection unit 130 of the disk node 100 receives a response from the storage apparatus 110 to the inspection command. As a result, the T1 / recovery detection unit 130 detects that the storage apparatus 110 has returned to an accessible state. Then, the T1 / recovery detection unit 130 transmits a recovery notification of the storage apparatus 110 to the control node 500.

［ステップＳ４４］制御ノード５００のスライス割り当て管理部５５０は、復帰通知を受け取ると、ディスクノード１００に対して、メタデータの変更要求を送信する。具体的には、スライス割り当て管理部５５０は、仮想ディスクメタデータ記憶部５１０を参照し、ストレージ装置１１０のスライスのうち、メタデータ照会に基づくスライスの再割り当てを行ったセグメント（再割り当て済セグメント）に割り当てられていたスライスを抽出する。そして、スライス割り当て管理部５５０は、該当するスライスのセグメントへの割り当てを解除（状態を「Ｆ」として空きスライスに変更）するためのメタデータ変更要求をディスクノード１００に送信する。 [Step S44] Upon receiving the return notification, the slice allocation management unit 550 of the control node 500 transmits a metadata change request to the disk node 100. Specifically, the slice allocation management unit 550 refers to the virtual disk metadata storage unit 510, and among the slices of the storage apparatus 110, a segment that has been reassigned based on a metadata inquiry (reassigned segment) Extract the slice assigned to. Then, the slice allocation management unit 550 transmits a metadata change request to the disk node 100 to release the allocation of the corresponding slice to the segment (change the status to “F” to a free slice).

［ステップＳ４５］ディスクノード１００のスライス管理部１６０は、メタデータ更新要求に基づいて、メタデータ記憶部１５０内の指定されたスライスのメタデータを更新する。スライス管理部１６０は、メタデータの更新後、変更完了応答を制御ノード５００に送信する。 [Step S45] The slice management unit 160 of the disk node 100 updates the metadata of the designated slice in the metadata storage unit 150 based on the metadata update request. The slice management unit 160 transmits a change completion response to the control node 500 after updating the metadata.

このようにして、ストレージ装置１１０の過負荷状態が解消し、ストレージ装置１１０が復帰した場合、ディスクノード１００側のメタデータを更新することで、メタデータ間の矛盾が防止される。すなわち、メタデータの再割り当てが行われたことにより、再割り当て済セグメントには、プライマリスライス、セカンダリスライス共に、ストレージ装置１１０のスライスとは異なるストレージ装置のスライスが割り当てられている。そのため、ストレージ装置１１０が復帰すると、再割り当て済セグメントに割り当てられたスライス（図１３の例ではプライマリスライス）が重複して存在することとなってしまう。そこで、ストレージ装置１１０内のスライスを空き（Free）の状態に変更することで、割り当て関係の矛盾の発生を防止している。 In this way, when the overload state of the storage apparatus 110 is resolved and the storage apparatus 110 is restored, by updating the metadata on the disk node 100 side, inconsistency between the metadata is prevented. That is, due to the reallocation of metadata, a slice of a storage device different from the slice of the storage device 110 is assigned to the reassigned segment for both the primary slice and the secondary slice. For this reason, when the storage apparatus 110 is restored, the slices assigned to the reassigned segments (primary slices in the example of FIG. 13) will be duplicated. Thus, by changing the slice in the storage device 110 to a free state, occurrence of contradictions in allocation relations is prevented.

ところで、ストレージ装置１１０のメタデータの変更をストレージ装置１１０復帰後に行うのは、復帰が確認できるまではストレージ装置１１０が故障している可能性が残されており、正常にメタデータを更新できない可能性があるためである。そこで、制御ノード５００は、スライスの再割り当てを行った場合、再割り当て済セグメントをＲＡＭ内に記憶している。再割り当て済セグメントを記憶しておくことで、ストレージ装置１１０復帰時に空き状態とすべきスライスを判断することができる。 By the way, the metadata of the storage device 110 is changed after the storage device 110 is restored, because there is a possibility that the storage device 110 has failed until the restoration can be confirmed, and the metadata cannot be updated normally. It is because there is sex. Therefore, when the slice is reallocated, the control node 500 stores the reallocated segment in the RAM. By storing the reassigned segment, it is possible to determine a slice that should be vacant when the storage apparatus 110 returns.

ここで、ストレージ装置１１０が復帰する前に制御ノード５００に障害が発生し、制御ノード５００のＲＡＭ５０２内のデータが失われる場合も有り得る。例えば、制御ノード５００に障害が発生したことにより、代替のノードに機能が引き継がれた場合（ファイルオーバ）や、制御ノード５００の再起動が行われた場合である。このような場合、制御ノード５００はメタデータを各ディスクノード１００，２００，３００から収集して、仮想ディスクメタデータテーブル５１１を再構築する。このとき、制御ノード５００がフェイルオーバまたは再起動する間に該当ディスクが復帰すると、再割り当て済セグメントに割り当てられたスライスを示すメタデータが３つ収集され矛盾が生じる。これを防ぐため、仮想ディスクメタデータテーブル５１１を再構築した制御ノード５００は、メタデータに付与されているタイムスタンプを参照して、空き状態とすべきスライスを判断する。 Here, there is a case where a failure occurs in the control node 500 before the storage device 110 is restored, and data in the RAM 502 of the control node 500 is lost. For example, when a failure occurs in the control node 500, the function is taken over by an alternative node (file over), or the control node 500 is restarted. In such a case, the control node 500 collects metadata from each of the disk nodes 100, 200, and 300, and reconstructs the virtual disk metadata table 511. At this time, if the corresponding disk recovers while the control node 500 fails over or restarts, three pieces of metadata indicating slices assigned to the reallocated segments are collected, resulting in a contradiction. In order to prevent this, the control node 500 that has reconstructed the virtual disk metadata table 511 refers to the time stamp given to the metadata to determine a slice that should be free.

図１４は、タイムスタンプを用いた矛盾解消処理を示すシーケンス図である。以下、図１４に示す処理をステップ番号に沿って説明する。なお、以下の処理は、制御ノード５００が再起動またはフェイルオーバされたときに実行される。 FIG. 14 is a sequence diagram showing a conflict resolution process using a time stamp. In the following, the process illustrated in FIG. 14 will be described in order of step number. The following process is executed when the control node 500 is restarted or failed over.

［ステップＳ５１］制御ノード５００のスライス割り当て管理部５５０は、各ディスクノード１００，２００，３００に対して、メタデータ要求を送信する。
［ステップＳ５２］メタデータ要求を受信したディスクノード１００のスライス管理部１６０は、メタデータ記憶部１５０またはストレージ装置１１０からメタデータを取得し、制御ノード５００に送信する。他のディスクノード２００，３００も同様にメタデータを制御ノード５００に送信する。 [Step S51] The slice allocation management unit 550 of the control node 500 transmits a metadata request to each of the disk nodes 100, 200, and 300.
[Step S 52] The slice management unit 160 of the disk node 100 that has received the metadata request acquires the metadata from the metadata storage unit 150 or the storage device 110 and transmits it to the control node 500. The other disk nodes 200 and 300 also transmit metadata to the control node 500 in the same manner.

各ディスクノード１００，２００，３００からメタデータを収集した制御ノード５００では、スライス割り当て管理部５５０が、収集したメタデータに基づいて仮想ディスクメタデータテーブル５１１を再構成する。そして、スライス割り当て管理部５５０は、メタデータの整合性チェックを行う。整合性チェックでは、３つ以上のスライスが割り当てられているセグメントの存在の有無を確認する。次に、スライス割り当て管理部５５０は、該当するセグメントがある場合、割り当てられているスライスのメタデータを比較する。そして、スライス割り当て管理部５５０は、状態（プライマリスライスまたはセカンダリスライス）が同一のスライスのうち、タイムスタンプの時刻が最新のスライス以外を、割り当てを解除すべきスライスと判断する。 In the control node 500 that has collected metadata from each of the disk nodes 100, 200, and 300, the slice allocation management unit 550 reconfigures the virtual disk metadata table 511 based on the collected metadata. Then, the slice assignment management unit 550 performs a metadata consistency check. In the consistency check, it is confirmed whether there is a segment to which three or more slices are assigned. Next, when there is a corresponding segment, the slice allocation management unit 550 compares the metadata of the allocated slices. Then, the slice allocation management unit 550 determines that the slices with the same status (primary slice or secondary slice) other than the slice with the latest time stamp are the slices to be deallocated.

［ステップＳ５３］スライス割り当て管理部５５０は、割り当てを解除すべきスライスを管理するディスクノード（図１４の例では、ディスクノード１００）に対して、該当するスライスの状態を「Ｆ」に変更するためのメタデータ変更要求を送信する。 [Step S53] The slice allocation management unit 550 changes the status of the corresponding slice to “F” for the disk node (disk node 100 in the example of FIG. 14) that manages the slice to be deallocated. Send metadata change request for.

［ステップＳ５４］ディスクノード１００のスライス管理部１６０は、メタデータ変更要求に応じてメタデータ記憶部１５０とストレージ装置１１０とのメタデータの内容を更新する。そして、スライス管理部１６０は、変更完了の応答を制御ノード５００に返す。 [Step S54] The slice management unit 160 of the disk node 100 updates the metadata contents of the metadata storage unit 150 and the storage device 110 in response to the metadata change request. Then, the slice management unit 160 returns a change completion response to the control node 500.

図１５は、再構成された仮想ディスクメタデータテーブルの例を示す図である。この例は、ステップＳ３８，Ｓ３９後の仮想ディスクメタデータテーブル５１１更新後、ステップＳ４３の復帰通知前に、制御ノード５００が再起動された場合を想定している。 FIG. 15 is a diagram illustrating an example of a reconfigured virtual disk metadata table. In this example, it is assumed that the control node 500 is restarted after the virtual disk metadata table 511 is updated after steps S38 and S39 and before the return notification in step S43.

図１５の例では、ディスクノードＩＤ「ＳＮ−Ａ」、ディスクＩＤ「１」、およびスライスＩＤ「１」で特定されるスライスと、ディスクノードＩＤ「ＳＮ−Ｃ」、ディスクＩＤ「１」、およびスライスＩＤ「１」で特定されるスライスとが、セグメントＩＤ「１」のセグメントにプライマリスライス（状態「Ｐ」）として割り当てられている。そこで、スライス割り当て管理部５５０によって重複して割り当てられた２つのスライスのメタデータにおけるタイムスタンプが比較される。 In the example of FIG. 15, the slice specified by the disk node ID “SN-A”, the disk ID “1”, and the slice ID “1”, the disk node ID “SN-C”, the disk ID “1”, and The slice identified by the slice ID “1” is assigned as the primary slice (state “P”) to the segment with the segment ID “1”. Therefore, the time stamps in the metadata of the two slices assigned redundantly by the slice assignment management unit 550 are compared.

ディスクノードＩＤ「ＳＮ−Ａ」のスライスのタイムスタンプは「ｔ１」であり、ディスクノードＩＤ「ＳＮ−Ｃ」のスライスのタイムスタンプは「ｔ（ｎ＋１）」である。「ｔ（ｎ＋１）」は「ｔ１」よりも時刻が新しいため、ディスクノードＩＤ「ＳＮ−Ｃ」のスライスの方が正しいスライスである。従って、ディスクノードＩＤ「ＳＮ−Ａ」で示されるディスクノード１００に対して、ディスクＩＤ「１」のストレージ装置１１０内のスライスＩＤ「１」のスライスの状態を「Ｆ」（空き状態を示す）に変更するためのメタデータ変更要求が送信される。 The time stamp of the slice with the disk node ID “SN-A” is “t1”, and the time stamp of the slice with the disk node ID “SN-C” is “t (n + 1)”. Since “t (n + 1)” has a newer time than “t1”, the slice with the disk node ID “SN-C” is the correct slice. Therefore, for the disk node 100 indicated by the disk node ID “SN-A”, the state of the slice with the slice ID “1” in the storage device 110 with the disk ID “1” is “F” (indicating a free state). A metadata change request for changing to is sent.

このようにして、ディスクノード１００からのＴ１経過通知に起因するメタデータの再割り当てからディスクノード１００からの復帰通知までの間に制御ノード５００の再起動やファイルオーバが発生しても、タイムスタンプを用いてメタデータの整合性を保つことができる。すなわち、フェイルオーバまたは再起動により実行できなかったメタデータの更新を、フェイルオーバまたは再起動後に行うことが可能となる。 In this manner, even if the control node 500 is restarted or the file is over between the reallocation of metadata due to the T1 progress notification from the disk node 100 and the return notification from the disk node 100, the time stamp Can be used to maintain the consistency of metadata. That is, it becomes possible to update metadata that could not be executed due to failover or restart after failover or restart.

以上のように、第１の実施の形態によれば、故障検出時間「Ｔ２」と動作不調検出時間「Ｔ１」を分離することにより、以下の効果を持つ。
１．故障検出時間「Ｔ２」を長くとってもアクセスが長時間止まらない。 As described above, according to the first embodiment, by separating the failure detection time “T2” from the malfunction detection time “T1”, the following effects are obtained.
1. Even if the failure detection time “T2” is long, the access does not stop for a long time.

ディスク装置は内部で切り替えが発生するため、故障ではないがしばらく不応答（分オーダ）になるものがある。このような装置を使用しても動作不調検出時間「Ｔ１」（例えば１秒）でアクセスを復帰できる。 Since the disk device is switched internally, there is a disk device that does not fail but does not respond (minute order) for a while. Even when such a device is used, the access can be restored with the malfunction detection time “T1” (for example, 1 second).

２．ディスク装置ごとに故障検出時間を調整しなくてよい。
従来のマルチノードストレージでは誤検出を起こさず、できるだけ早くアクセスを復帰させるためにディスク装置ごとに故障検出時間を調整する必要があった。第１の実施の形態によれば、故障検出時間「Ｔ２」を大部分の装置が応答する時間（例えば１分）に設定し、動作不調検出時間「Ｔ１」を例えば１秒に設定する。故障検出時間「Ｔ２」を長く設定したことで、どのようなタイプのストレージ装置であっても、故障の誤検出をすることがなくなる。 2. It is not necessary to adjust the failure detection time for each disk device.
In the conventional multi-node storage, it is necessary to adjust the failure detection time for each disk device in order to restore access as soon as possible without causing erroneous detection. According to the first embodiment, the failure detection time “T2” is set to a time for which most devices respond (for example, 1 minute), and the malfunction detection time “T1” is set to, for example, 1 second. Since the failure detection time “T2” is set to be long, any type of storage device is not erroneously detected as a failure.

このように、信頼性の高い故障検出と、動作不調時における短時間でのアクセス環境の回復とを両立することが可能となる。
［第２の実施の形態］
第２の実施の形態は、Ｔ２を用いずに、管理ノード３０からのディスクノード切り離し指示に基づいてリカバリ処理を開始するものである。 As described above, it is possible to achieve both the reliable failure detection and the recovery of the access environment in a short time when the operation is malfunctioning.
[Second Embodiment]
In the second embodiment, recovery processing is started based on a disk node disconnection instruction from the management node 30 without using T2.

第１の実施の形態では、ディスクノード１００が、検査コマンド発行からＴ２が経過するまでストレージ装置１１０から応答がなければ、ディスクノード１００からＴ２経過通知が出力される。ただし、Ｔ２として適切な値を決定するのが困難な場合がある。例えば、接続されるストレージ装置１１０，２１０，３１０の製造元や性能が異なる場合、適切なＴ２も異なってくる。 In the first embodiment, if the disk node 100 does not respond from the storage apparatus 110 until T2 elapses from the issuing of the inspection command, the disk node 100 outputs a T2 progress notification. However, it may be difficult to determine an appropriate value for T2. For example, when the manufacturer and performance of the connected storage apparatuses 110, 210, and 310 are different, the appropriate T2 is also different.

また、第１の実施の形態で示したように、検査コマンドの発行からＴ１経過するまで無応答であれば、スライスの再割り当てをするようにしたことで、ストレージ装置の復帰を待たずにデータアクセスが可能となる。そのため、Ｔ２の検出を行わなくても、アクセスノードからのデータアクセスには影響がない。 In addition, as shown in the first embodiment, if there is no response until the time T1 elapses from the issue of the inspection command, the slice is reassigned so that the data is not waited for the storage device to return. Access is possible. Therefore, even if the detection of T2 is not performed, the data access from the access node is not affected.

一方、ストレージ装置が故障しているのであれば、そのストレージ装置内のデータのリカバリが必要である。そこで、第２の実施の形態では、管理者がストレージ装置の故障を確認した場合に、管理ノード３０を介してストレージ装置の切り離しを指示できるようにした。 On the other hand, if the storage device is out of order, it is necessary to recover the data in the storage device. Therefore, in the second embodiment, when the administrator confirms the failure of the storage device, the storage device can be instructed to be disconnected via the management node 30.

図１６は、管理ノードからディスクノード切り離しを指示する場合の処理手順を示すシーケンス図である。なお、図１６に示すステップＳ６１〜Ｓ７２，Ｓ７４の処理は、それぞれ図１０に示したステップＳ１１〜Ｓ２２，Ｓ２４と同じ処理である。そこで、図１０とは異なるステップＳ７３の処理について説明する。 FIG. 16 is a sequence diagram illustrating a processing procedure in the case of instructing the disk node separation from the management node. Note that the processes in steps S61 to S72 and S74 shown in FIG. 16 are the same processes as steps S11 to S22 and S24 shown in FIG. 10, respectively. Therefore, the process of step S73 different from FIG. 10 will be described.

［ステップＳ７３］管理ノード３０は、ディスクノード１００，２００，３００や制御ノード５００から、運用状況を示す各種情報を取得することができる。取得した情報は、管理ノード３０のモニタに表示される。従って、ストレージ装置１１０が検査コマンドに対する応答を返していないことも、管理ノード３０の画面に表示される。 [Step S 73] The management node 30 can acquire various types of information indicating the operation status from the disk nodes 100, 200, 300 and the control node 500. The acquired information is displayed on the monitor of the management node 30. Accordingly, the fact that the storage apparatus 110 has not returned a response to the inspection command is also displayed on the screen of the management node 30.

マルチノードストレージシステムの管理者は、管理ノード３０が画面を見ることで、ストレージ装置１１０が故障している可能性があることを認識する。管理者は、ストレージ装置１１０に対する各種制御コマンド（例えば再起動）を発行するなどの作業を行う。その結果、ストレージ装置１１０が故障していることが確認できると、管理者は、管理ノード３０に対してディスクノード１００の切り離しを指示する操作入力を行う。すると、管理ノード３０は、制御ノード５００に対してディスクノード切り離し要求を送信する。 The administrator of the multi-node storage system recognizes that there is a possibility that the storage device 110 is out of order by the management node 30 looking at the screen. The administrator performs operations such as issuing various control commands (for example, restart) for the storage apparatus 110. As a result, when it is confirmed that the storage apparatus 110 has failed, the administrator performs an operation input that instructs the management node 30 to disconnect the disk node 100. Then, the management node 30 transmits a disk node disconnection request to the control node 500.

このディスクノード切り離し要求に応答して、制御ノード５００の制御の元、リカバリ処理（ステップＳ７４）が実行され、ディスクノード１００がマルチノードストレージシステムから切り離される。 In response to this disk node disconnection request, recovery processing (step S74) is executed under the control of the control node 500, and the disk node 100 is disconnected from the multi-node storage system.

このようにして、管理者からの指示があるまで、ストレージ装置全体に対するリカバリ処理の開始を待つことができる。
［第３の実施の形態］
第３の実施の形態は、ディスクノードからのスライスを指定したスライス異常通知に基づいて、指定されたスライスが割り当てられていたセグメントへのスライスの再割り当てを行うものである。 In this way, it is possible to wait for the start of recovery processing for the entire storage apparatus until an instruction from the administrator is given.
[Third Embodiment]
In the third embodiment, a slice is reassigned to a segment to which a designated slice has been assigned based on a slice abnormality notification that designates a slice from a disk node.

第１、第２の実施の形態では、アクセスノードからメタデータ照会要求があったときに、制御ノードにおいて、メタデータ照会の対象となるセグメントに対するスライスの再割り当てを行っている。この場合、メタデータの照会処理の延長で、ディスクノードのメタデータの変更処理が実施される。スライスの再割り当て処理によりメタデータが変更されるディスクノードが正常に動作していれば、図１０に示すメタデータ照会要求の送信（ステップＳ１５）からメタデータ通知（ステップＳ２０）までの時間は、短時間ですむ。 In the first and second embodiments, when there is a metadata inquiry request from the access node, the control node reassigns the slice to the segment that is the object of the metadata inquiry. In this case, the metadata change processing of the disk node is performed as an extension of the metadata inquiry processing. If the disk node whose metadata is changed by the slice reassignment process is operating normally, the time from the transmission of the metadata inquiry request (step S15) to the metadata notification (step S20) shown in FIG. It takes a short time.

ところが、メタデータを変更するディスクノードが過負荷の場合、ディスクノードから制御ノードへメタデータの変更完了の応答が返ってこない場合がある。すると、制御ノードからアクセスノードへのメタデータ通知も遅れてしまう。 However, when the disk node that changes the metadata is overloaded, there is a case where the response of completion of the metadata change is not returned from the disk node to the control node. Then, metadata notification from the control node to the access node is also delayed.

また、制御ノードでは、メタデータ照会要求に応じたメタデータ検索処理を１つのプロセスで実行している場合がある。それは、システムの設計上、複数のプロセスで並列実行するよりも、単一のプロセスで短い処理を繰り返し実行した方が、全体的に処理効率が高い場合があるためである。例えば、複数のプロセスでメタデータ照会要求に応じたメタデータ検索処理を行うと、受信したメタデータ照会要求の振り分け処理などの余分な処理が必要となり、かえって処理効率が低下する場合がある。さらに、メタデータ検索用のプロセスが多数起動されると、そのプロセスでメモリなどの資源が多く消費されてしまい、制御ノード全体の処理効率を低下させる原因にもなる。 In addition, in the control node, there is a case where the metadata search process corresponding to the metadata inquiry request is executed in one process. This is because, in terms of system design, overall processing efficiency may be higher when short processes are repeatedly executed in a single process than in parallel execution in a plurality of processes. For example, when metadata search processing corresponding to a metadata inquiry request is performed in a plurality of processes, extra processing such as distribution processing of the received metadata inquiry request is required, and the processing efficiency may be lowered. Furthermore, if a large number of processes for searching for metadata are started, resources such as memory are consumed in the processes, which causes a reduction in processing efficiency of the entire control node.

１つのプロセスで多数のメタデータ照会要求に応じた処理を受け持っている場合、１つの処理に時間がかかると、他のメタデータ照会要求に応じた処理が待たされることとなる。すると、アクセスノードがメタデータ照会の結果を取得するまでに時間がかかり、マルチノードストレージシステムとしてのサービス効率が低下する。 When one process takes charge of processing corresponding to a large number of metadata query requests, if one process takes time, processing corresponding to another metadata query request is awaited. Then, it takes time until the access node acquires the result of the metadata inquiry, and the service efficiency as the multi-node storage system decreases.

そこで、第３の実施の形態では、各ディスクノードは、動作不調検出時刻「Ｔ１」を検出したストレージ装置内のスライスへのアクセス要求を受け取ると、制御ノードに対してスライス異常通知を行うようにする。制御ノードでは、メタデータ照会要求に応じた処理の延長としてスライス再割り当てを行うのではなく、ディスクノードからのスライス異常通知に応じてスライス再割り当てを行う。これにより、メタデータの照会要求への応答を遅延させずに、アクセス対象となったスライスが割り当てられていたセグメントの再割り当てが可能となる。 Therefore, in the third embodiment, when each disk node receives an access request to a slice in the storage apparatus that has detected the malfunction detection time “T1”, it notifies the control node of a slice abnormality. To do. The control node does not perform slice reassignment as an extension of processing in response to the metadata inquiry request, but performs slice reassignment in response to a slice abnormality notification from the disk node. This makes it possible to reassign the segment to which the slice to be accessed has been assigned without delaying the response to the metadata inquiry request.

第３の実施の形態のシステム構成は、図２に示した第１の実施の形態のシステム構成と同様である。ただし、ディスクノードと制御ノードとの機能が異なる。そこで、第３の実施の形態におけるディスクノードと制御ノードとの機能について説明する。 The system configuration of the third embodiment is the same as the system configuration of the first embodiment shown in FIG. However, the functions of the disk node and the control node are different. Therefore, the functions of the disk node and the control node in the third embodiment will be described.

図１７は、第３の実施の形態におけるマルチノードストレージシステムの各装置の機能を示すブロック図である。図２の各ディスクノード１００，２００，３００の内部機能を、図１７に示したディスクノード４００と同様の機能に置き換え、図２の制御ノード５００の内部機能を図１７に示した制御ノード８００と同様の機能に置き換えることで、第３の実施の形態に係るマルチノードストレージシステムが構築できる。 FIG. 17 is a block diagram illustrating functions of each device of the multi-node storage system according to the third embodiment. The internal functions of the disk nodes 100, 200, and 300 in FIG. 2 are replaced with functions similar to those of the disk node 400 shown in FIG. 17, and the internal functions of the control node 500 in FIG. By replacing with the same function, the multi-node storage system according to the third embodiment can be constructed.

ディスクノード４００は、スライスアクセス処理部４２０、Ｔ１／復帰検出部４３０、Ｔ２検出部４４０、メタデータ記憶部４５０、スライス管理部４６０、および被アクセススライス検出部４７０を有している。 The disk node 400 includes a slice access processing unit 420, a T1 / recovery detection unit 430, a T2 detection unit 440, a metadata storage unit 450, a slice management unit 460, and an accessed slice detection unit 470.

スライスアクセス処理部４２０は、図５に示したＴ１／復帰検出部１３０が有している機能を有する、さらに、スライスアクセス処理部４２０は、アクセスノード６００からのアクセス要求にエラーを返す場合、エラーとなったストレージ装置のスライスに関する情報を被アクセススライス検出部４７０に通知する機能を有している。 The slice access processing unit 420 has the function of the T1 / recovery detection unit 130 illustrated in FIG. 5. Further, when the slice access processing unit 420 returns an error to the access request from the access node 600, an error occurs. It has a function of notifying the accessed slice detection unit 470 of information related to the slice of the storage device that has become.

Ｔ１／復帰検出部４３０は、図５に示したＴ１／復帰検出部１３０が有している機能に加え、Ｔ１を検出した場合、その旨を被アクセススライス検出部４７０に通知する機能を有している。また、Ｔ１／復帰検出部４３０は、Ｔ１検出後にストレージ装置４１０が復帰した場合、その旨を被アクセススライス検出部４７０に通知する機能を有している。 In addition to the functions of the T1 / recovery detection unit 130 shown in FIG. 5, the T1 / recovery detection unit 430 has a function of notifying the accessed slice detection unit 470 when T1 is detected. ing. In addition, the T1 / recovery detection unit 430 has a function of notifying the accessed slice detection unit 470 that the storage apparatus 410 has recovered after detection of T1.

Ｔ２検出部４４０は、図５に示したＴ２検出部１４０が有している機能に加え、Ｔ２を検出した場合、その旨を被アクセススライス検出部４７０に通知する機能を有している。
メタデータ記憶部４５０は、図５に示したメタデータ記憶部４５０が記憶している情報に加え、各スライスの二重化状態を示す情報を記憶している。二重化状態を示す情報としては、二重化状態には、正常とコピー中とがある。正常は、ペアとなったスライスとの間の二重化が保たれている（格納されているデータの同一性が保たれている）状態である。コピー中は、ペアとなったスライスとの間で二重化を確立するためにデータをコピーしている状態である。 In addition to the function of the T2 detection unit 140 shown in FIG. 5, the T2 detection unit 440 has a function of notifying the accessed slice detection unit 470 that T2 is detected.
The metadata storage unit 450 stores information indicating the duplex status of each slice in addition to the information stored in the metadata storage unit 450 illustrated in FIG. As information indicating the duplex status, the duplex status includes normal and copying. The normal state is a state in which duplication between the paired slices is maintained (identity of stored data is maintained). During copying, data is being copied to establish duplexing with a pair of slices.

スライス管理部４６０は、図５に示したスライス管理部１６０が有している機能に加え、スライスの二重化状態を管理し、各スライスの現在の二重化状態をメタデータ記憶部４５０に設定する機能を有している。 In addition to the functions of the slice management unit 160 shown in FIG. 5, the slice management unit 460 has a function of managing the duplex status of slices and setting the current duplex status of each slice in the metadata storage unit 450. Have.

被アクセススライス検出部４７０は、アクセスノード６００からのアクセス要求がエラーとなったスライスに関するスライス異常通知を制御ノード８００に送信する。具体的には、被アクセススライス検出部４７０は、Ｔ１／復帰検出部４３０からの通知によって、ストレージ装置４１０のＴ１が検出されたこと、およびその後復帰したことを認識する。また、被アクセススライス検出部４７０は、Ｔ２検出部４４０からの通知によって、ストレージ装置４１０のＴ２が検出されたことを認識する。さらに、被アクセススライス検出部４７０は、スライスアクセス処理部４２０からの通知によって、ストレージ装置４１０内のスライスへのアクセスがエラーになったことを認識する。 The to-be-accessed slice detection unit 470 transmits to the control node 800 a slice abnormality notification related to the slice in which the access request from the access node 600 is an error. Specifically, the accessed slice detection unit 470 recognizes from the notification from the T1 / recovery detection unit 430 that T1 of the storage device 410 has been detected and has subsequently recovered. Further, the accessed slice detection unit 470 recognizes that the T2 of the storage apparatus 410 has been detected by the notification from the T2 detection unit 440. Furthermore, the accessed slice detection unit 470 recognizes that an access to a slice in the storage apparatus 410 has an error based on the notification from the slice access processing unit 420.

そして、被アクセススライス検出部４７０は、ストレージ装置４１０内のスライスへのアクセスがエラーになると、ストレージ装置４１０の状態を確認する。すなわち、被アクセススライス検出部４７０は、ストレージ装置４１０の検査コマンドに対する応答がなく、Ｔ１検出後であって、Ｔ２検出前であるか否かを判断する。この条件が満たされていた場合、被アクセススライス検出部４７０は、アクセス対象となっていたスライスを指定したスライス異常通知を制御ノード８００に送信する。 Then, the accessed slice detection unit 470 checks the state of the storage apparatus 410 when access to a slice in the storage apparatus 410 results in an error. That is, the accessed slice detection unit 470 determines whether there is no response to the inspection command of the storage apparatus 410 and is after the detection of T1 and before the detection of T2. When this condition is satisfied, the accessed slice detection unit 470 transmits a slice abnormality notification specifying the access target slice to the control node 800.

制御ノード８００は、仮想ディスクメタデータ記憶部８１０、メタデータ検索部８２０、割り当て可否記憶部８３０、ストレージ状態管理部８４０、およびスライス割り当て管理部８５０を有している。 The control node 800 includes a virtual disk metadata storage unit 810, a metadata search unit 820, an allocation availability storage unit 830, a storage state management unit 840, and a slice allocation management unit 850.

仮想ディスクメタデータ記憶部８１０の機能は、図５に示した仮想ディスクメタデータ記憶部５１０の機能と同様である。
メタデータ検索部８２０の機能は、図５に示したメタデータ検索部５２０の機能と同様である。ただし、スライスの再割り当てをスライス割り当て管理部５５０に要求する機能は不要である。 The function of the virtual disk metadata storage unit 810 is the same as the function of the virtual disk metadata storage unit 510 shown in FIG.
The function of the metadata search unit 820 is the same as the function of the metadata search unit 520 shown in FIG. However, the function of requesting the slice allocation management unit 550 to reassign slices is not necessary.

割り当て可否記憶部８３０は、ストレージ装置ごとに、そのストレージ装置内のスライスをセグメントに割り当て可能か否かを示す情報（割り当て可否情報）を記憶する記憶機能である。例えば、制御ノード８００のＲＡＭ内の記憶領域の一部が、割り当て可否記憶部８３０として使用される。 The allocation availability storage unit 830 is a storage function that stores, for each storage device, information (allocation availability information) indicating whether a slice in the storage device can be allocated to a segment. For example, a part of the storage area in the RAM of the control node 800 is used as the allocation availability storage unit 830.

ストレージ状態管理部８４０は、ディスクノード４００からの通知に応じて、ストレージ状態記憶部５３０内に示されるストレージ装置４１０の状態を変更する。具体的には、ディスクノード４００からＴ１経過通知を受け取ると、ストレージ状態管理部５４０は、割り当て可否記憶部８３０内に示されるストレージ装置４１０の状態を、割り当て不可の状態とする。また、ディスクノード４００から復帰通知を受け取ると、ストレージ状態管理部５４０は、ストレージ状態記憶部５３０内に示されるストレージ装置４１０の状態を、割り当て可の状態とする。 The storage state management unit 840 changes the state of the storage device 410 indicated in the storage state storage unit 530 in response to the notification from the disk node 400. Specifically, when the T1 progress notification is received from the disk node 400, the storage status management unit 540 sets the status of the storage device 410 indicated in the allocation availability storage unit 830 to an allocation impossible status. In addition, upon receiving a return notification from the disk node 400, the storage state management unit 540 sets the state of the storage device 410 indicated in the storage state storage unit 530 to an assignable state.

スライス割り当て管理部８５０は、仮想ディスク６０のセグメントへのスライスの割り当てを管理する。例えば、スライス割り当て管理部８５０は、ディスクノード４００からスライス異常通知を受け取ると、スライス異常通知に示されるスライスが割り当てられているセグメントに対して、別のスライスを割り当てる。 The slice assignment management unit 850 manages the assignment of slices to the segments of the virtual disk 60. For example, when the slice allocation management unit 850 receives a slice abnormality notification from the disk node 400, the slice allocation management unit 850 allocates another slice to the segment to which the slice indicated in the slice abnormality notification is allocated.

また、スライス割り当て管理部８５０は、ディスクノード４００からＴ２経過通知を受け取ると、ストレージ装置４１０内のスライスが割り当てられたセグメントのリカバリ処理を開始する。スライス割り当て管理部８５０は、リカバリ処理において、リカバリ対象セグメントに割り当てられたスライスのうち、ストレージ装置４１０以外のストレージ装置（正常動作ストレージ装置）のスライスをすべてプライマリスライスとする。次に、スライス割り当て管理部８５０は、リカバリ対象セグメントのセカンダリスライスとして、ストレージ装置２１０，３１０のスライスを割り当てる。そして、スライス割り当て管理部８５０は、割り当て結果に応じたメタデータの更新要求を、正常動作ストレージ装置を管理するディスクノードに送信する。メタデータの更新要求を受け取ったディスクノードにおいてメタデータの更新が完了すると、スライス割り当て管理部８５０は、仮想ディスクメタデータ記憶部８１０内のメタデータを更新する。 When the slice allocation management unit 850 receives the T2 progress notification from the disk node 400, the slice allocation management unit 850 starts recovery processing of the segment to which the slice in the storage device 410 is allocated. In the recovery process, the slice assignment management unit 850 sets all slices of storage devices (normally operating storage devices) other than the storage device 410 among the slices assigned to the recovery target segment as primary slices. Next, the slice allocation management unit 850 allocates slices of the storage apparatuses 210 and 310 as secondary slices of the recovery target segment. Then, the slice allocation management unit 850 transmits a metadata update request according to the allocation result to the disk node that manages the normal operation storage apparatus. When the metadata update is completed in the disk node that has received the metadata update request, the slice allocation management unit 850 updates the metadata in the virtual disk metadata storage unit 810.

次に、各ノードで管理されているデータの構造について説明する。
図１８は、メタデータ記憶部のデータ構造例を示す図である。メタデータ記憶部４５０には、メタデータテーブル４５１が格納されている。メタデータテーブル４５１には、ディスクノードＩＤ、ディスクＩＤ、スライスＩＤ、状態、仮想ディスクＩＤ、セグメントＩＤ、仮想ディスクアドレス、ペアのディスクノードＩＤ、ペアのディスクＩＤ、ペアのスライスＩＤ、タイムスタンプ、および二重化状態の欄が設けられている。メタデータテーブル４５１内の横方向に並べられた情報が互いに関連付けられ、メタデータを示す１つのレコードを構成している。メタデータテーブル４５１の二重化状態以外の各欄に設定される情報は、図７に示したメタデータテーブル１５１の同名の欄と同種の情報である。 Next, the structure of data managed by each node will be described.
FIG. 18 is a diagram illustrating an example of the data structure of the metadata storage unit. A metadata table 451 is stored in the metadata storage unit 450. The metadata table 451 includes a disk node ID, disk ID, slice ID, state, virtual disk ID, segment ID, virtual disk address, paired disk node ID, paired disk ID, paired slice ID, time stamp, and A column for duplex status is provided. Information arranged in the horizontal direction in the metadata table 451 is associated with each other to form one record indicating the metadata. Information set in each column other than the duplex state of the metadata table 451 is the same type of information as the column of the same name in the metadata table 151 shown in FIG.

二重化状態の欄には、対応するスライスの二重化状態が設定される。ペアとなったスライスとの間の二重化が保たれていれば「正常」、ペアとなったスライスとの間でデータのコピー処理中であれば「コピー中」と設定される。 The duplex status of the corresponding slice is set in the duplex status column. “Normal” is set if duplication between the paired slices is maintained, and “copying” is set if data is being copied between the paired slices.

図１９は、仮想ディスクメタデータ記憶部のデータ構造例を示す図である。仮想ディスクメタデータ記憶部８１０には、仮想ディスクメタデータテーブル８１１が格納されている。仮想ディスクメタデータテーブル８１１には、ディスクノードＩＤ、ディスクＩＤ、スライスＩＤ、状態、仮想ディスクＩＤ、セグメントＩＤ、仮想ディスクアドレス、ペアのディスクノードＩＤ、ペアのディスクＩＤ、ペアのスライスＩＤ、タイムスタンプ、および二重化状態の欄が設けられている。仮想ディスクメタデータテーブル８１１内の横方向に並べられた情報が互いに関連付けられ、メタデータを示す１つのレコードを構成している。仮想ディスクメタデータテーブル８１１の各欄に設定される情報は、メタデータテーブル４５１の同名の欄と同種の情報である。 FIG. 19 is a diagram illustrating an example of the data structure of the virtual disk metadata storage unit. The virtual disk metadata storage unit 810 stores a virtual disk metadata table 811. The virtual disk metadata table 811 includes a disk node ID, a disk ID, a slice ID, a status, a virtual disk ID, a segment ID, a virtual disk address, a paired disk node ID, a paired disk ID, a paired slice ID, and a time stamp. , And a column for duplex status. Information arranged in the horizontal direction in the virtual disk metadata table 811 is associated with each other to form one record indicating the metadata. Information set in each column of the virtual disk metadata table 811 is the same type of information as the column of the same name in the metadata table 451.

図２０は、割り当て可否記憶部のデータ構造例を示す図である。割り当て可否記憶部８３０には、割り当て可否管理テーブル８３１が格納されている。割り当て可否管理テーブル８３１には、ディスクノードＩＤ、ディスクＩＤ、および可／不可の欄が設けられている。 FIG. 20 is a diagram illustrating an example of a data structure of the allocation availability storage unit. The assignability storage unit 830 stores an assignability management table 831. The assignability management table 831 has columns for disk node ID, disk ID, and availability.

可／不可の欄には、各ストレージ装置のスライスをセグメントに割り当て可能か否かを示すフラグが設定される。割り当てが可能な場合、可／不可の欄には「可」の値が設定される。割り当てが不可能な場合、可／不可の欄には「不可」の値が設定される。 A flag indicating whether or not a slice of each storage device can be assigned to a segment is set in the possible / impossible column. When assignment is possible, a value of “permitted” is set in the field of possible / impossible. When assignment is impossible, a value of “impossible” is set in the possible / impossible column.

このような構成のマルチノードストレージシステムにおいて、検査コマンドに対するストレージ装置からの応答が、Ｔ１経過しても返されない場合、以下のようなスライス切り替え処理が実行される。 In the multi-node storage system having such a configuration, when a response from the storage apparatus to the inspection command is not returned even after T1 elapses, the following slice switching process is executed.

図２１は、ストレージ装置故障時のスライス切り替え処理の手順を示すシーケンス図である。図２１の例では、ディスクノード４００のディスクノードＩＤは「ＳＮ−Ａ」であり、ディスクノード４００ａのディスクノードＩＤは「ＳＮ−Ｃ」であり、ディスクノード４００ｂのディスクノードＩＤは「ＳＮ−Ｃ」であるものとする。ディスクノード４００ａ，４００ｂは、ディスクノード４００と同じ機能（図１７参照）を有している。 FIG. 21 is a sequence diagram illustrating a procedure of slice switching processing when a storage apparatus fails. In the example of FIG. 21, the disk node ID of the disk node 400 is “SN-A”, the disk node ID of the disk node 400a is “SN-C”, and the disk node ID of the disk node 400b is “SN-C”. ”. The disk nodes 400a and 400b have the same function as the disk node 400 (see FIG. 17).

なお、図２１に示すステップＳ８８〜Ｓ９５の処理は、それぞれ図１０に示したステップＳ１６〜Ｓ１９，Ｓ２１〜Ｓ２４と同じ処理である。そこで、図１０とは異なる処理について説明する。 21 are the same as steps S16 to S19 and S21 to S24 shown in FIG. 10, respectively. Therefore, processing different from that in FIG. 10 will be described.

［ステップＳ８１］ディスクノード４００のＴ１／復帰検出部４３０は、定期的にストレージ装置４１０の動作確認を行う。その詳細は、ステップＳ１１の説明と同様である。Ｔ１／復帰検出部４３０は、動作確認により、Ｔ１の経過を検出すると、制御ノード８００に対してＴ１経過通知を送信する。Ｔ１経過通知には、ディスクノード４００のディスクノードＩＤとストレージ装置４１０のディスクＩＤとが含まれる。Ｔ１／復帰検出部４３０は、その後もストレージ装置４１０からの応答を待つ。 [Step S81] The T1 / recovery detection unit 430 of the disk node 400 periodically checks the operation of the storage apparatus 410. Details thereof are the same as those described in step S11. The T1 / return detection unit 430 transmits a T1 progress notification to the control node 800 when detecting the progress of T1 by the operation check. The T1 progress notification includes the disk node ID of the disk node 400 and the disk ID of the storage device 410. The T1 / recovery detection unit 430 waits for a response from the storage apparatus 410 thereafter.

制御ノード８００のストレージ状態管理部８４０は、Ｔ１経過通知を受信すると、ストレージ装置４１０の状態を切り替える。すなわち、ストレージ状態管理部８４０は、Ｔ１経過通知に示されるディスクノードＩＤとディスクＩＤとの組みに対応する情報を割り当て可否記憶部８３０から検索する。そして、ストレージ状態管理部８４０は、該当するストレージ装置に関する情報の可／不可の情報を「不可」に変更する。 When the storage state management unit 840 of the control node 800 receives the T1 progress notification, the storage state management unit 840 switches the state of the storage apparatus 410. In other words, the storage status management unit 840 searches the allocation availability storage unit 830 for information corresponding to the combination of the disk node ID and the disk ID indicated in the T1 progress notification. Then, the storage status management unit 840 changes the information on whether or not the information about the corresponding storage device is possible to “impossible”.

図２２は、割り当て可否更新後の割り当て可否記憶部の例を示す図である。図２２に示すように、ディスクノードＩＤ「ＳＮ−Ａ」とディスクＩＤ「１」との組に対応する状態が「不可」に変更されている。これにより、制御ノード８００では、以後、ディスクノード４００に接続されたストレージ装置４１０のスライスをセグメントに割り当てることができないことを認識する。 FIG. 22 is a diagram illustrating an example of the assignment availability storage unit after the assignment availability update. As shown in FIG. 22, the state corresponding to the set of the disk node ID “SN-A” and the disk ID “1” is changed to “impossible”. Thereby, the control node 800 recognizes that the slice of the storage device 410 connected to the disk node 400 cannot be allocated to the segment thereafter.

図２１の説明に戻る。
［ステップＳ８２］割り当て可否の切り替えが完了すると、ストレージ状態管理部８４０は、切り替え完了応答をディスクノード４００に送信する。 Returning to the description of FIG.
[Step S 82] When the switching of the allocation possibility is completed, the storage state management unit 840 transmits a switching completion response to the disk node 400.

［ステップＳ８３］一方、アクセスノード６００のスライスアクセス要求部６３０は、ユーザによる端末装置２１，２２，２３への操作入力などに応じて仮想ディスク６０内のデータへのアクセスが発生すると、アクセス用メタデータ記憶部６２０を参照し、アクセス対象のデータを管理しているディスクノードを判断する。図２１の例では、ディスクノード４００が管理するスライスへのリードのアクセスが発生したものする。すると、スライスアクセス要求部６３０は、ディスクノード４００に対してアクセス対象データのリード要求を送信する。ディスクノード４００のスライスアクセス処理部４２０は、メタデータ記憶部４５０を参照して、アクセス対象スライスが自己の管理するストレージ装置４１０内のスライスであることを確認し、ストレージ装置４１０内の該当スライス内のデータを指定したリードアクセスを行う。 [Step S83] On the other hand, the slice access request unit 630 of the access node 600, when access to data in the virtual disk 60 occurs in response to an operation input to the terminal devices 21, 22, 23 by the user, etc. The data storage unit 620 is referenced to determine the disk node that manages the access target data. In the example of FIG. 21, it is assumed that a read access to a slice managed by the disk node 400 has occurred. Then, the slice access request unit 630 transmits a read request for access target data to the disk node 400. The slice access processing unit 420 of the disk node 400 refers to the metadata storage unit 450 to confirm that the access target slice is a slice in the storage device 410 managed by the slice access processing unit 420, and in the corresponding slice in the storage device 410 Read access with specified data.

図２１の例では、Ｔ１／復帰検出部４３０によるＴ１の経過が検出された後に、アクセスノード６００からのリード要求が出されている。この場合、ストレージ装置４１０は故障しているか、検査コマンドに対する応答が直ぐに返せない程の過負荷状態である。ストレージ装置４１０が故障していれば、ストレージ装置４１０へのアクセスはエラーになる。また、ストレージ装置４１０が過負荷状態であれば、高い頻度でデータアクセスもエラーとなる。図２１の例では、ストレージ装置４１０（ディスクノードＩＤ「ＳＮ−Ａ」）の故障により、ストレージ装置４１０（ディスクＩＤ「１」）の先頭のスライス（スライスＩＤ「１」）へのリード要求がエラーになっている。 In the example of FIG. 21, a read request is issued from the access node 600 after the T1 / return detection unit 430 detects the passage of T1. In this case, the storage device 410 has failed or is overloaded so that a response to the inspection command cannot be returned immediately. If the storage device 410 has failed, access to the storage device 410 will result in an error. Further, if the storage device 410 is overloaded, data access also frequently causes errors. In the example of FIG. 21, a read request to the first slice (slice ID “1”) of the storage device 410 (disk ID “1”) is an error due to a failure of the storage device 410 (disk node ID “SN-A”). It has become.

［ステップＳ８４］ディスクノード４００のスライスアクセス処理部４２０は、アクセスノード６００からのリード要求に対してエラーを応答する。すると、アクセスノード６００のスライスアクセス要求部６３０は、メタデータ照会部６１０に対してエラーの発生を通知する。このとき、スライスアクセス要求部６３０は、アクセスがエラーとなったセグメントに関する仮想ディスクＩＤとセグメントＩＤとについてもメタデータ照会部６１０に伝える。 [Step S84] The slice access processing unit 420 of the disk node 400 responds an error to the read request from the access node 600. Then, the slice access request unit 630 of the access node 600 notifies the metadata inquiry unit 610 that an error has occurred. At this time, the slice access request unit 630 also notifies the metadata inquiry unit 610 of the virtual disk ID and the segment ID related to the segment in which the access has become an error.

［ステップＳ８５］一方、ディスクノード４００のスライスアクセス処理部４２０は、リード要求に応じたストレージ装置４１０へのデータリードに失敗すると、被アクセススライス検出部４７０に対してエラーの発生を通知する。その際、アクセス対象となったストレージ装置のスライスを示すディスクＩＤとスライスＩＤとの組みが、被アクセススライス検出部４７０に渡される。 [Step S85] On the other hand, when the data access to the storage device 410 in response to the read request fails, the slice access processing unit 420 of the disk node 400 notifies the accessed slice detection unit 470 of the occurrence of an error. At that time, a combination of a disk ID and a slice ID indicating a slice of the storage apparatus that is the access target is passed to the accessed slice detection unit 470.

被アクセススライス検出部４７０は、エラーが発生したストレージ装置の状態を判断する。すなわち、被アクセススライス検出部４７０は、Ｔ１／復帰検出部４３０からストレージ装置４１０のＴ１の経過が検出されたことの通知を受けたがまだ復帰の通知を受けていないという第１の条件と、Ｔ２検出部４４０からストレージ装置４１０のＴ２の経過が検出されたことの通知を受けていないとう第２の条件とが共に満たされるか否かを判断する。条件が満たされていれば、被アクセススライス検出部４７０は、ストレージ装置４１０は故障の可能性がるものの、まだ故障であるとの判断は確定していない（ストレージ装置４１０全体のリカバリ処理は開始されていない）と判断する。そこで、条件が満たされていた場合、被アクセススライス検出部４７０は、アクセス対象となったスライスの状態を異常（Ｂａｄ）とするべき旨をスライス管理部４６０に通知する。すると、スライス管理部４６０は、メタデータ記憶部４５０内のアクセス対象となったスライスに対応するメタデータを検索し、状態を「Ｂ（Ｂａｄ）」（異常であることを示す）に変更する。さらに、スライス管理部４６０は、ストレージ装置４１０が復帰するのを待って、ストレージ装置４１０内のアクセス対象スライスに対応するメタデータの状態を「Ｂ」に変更する。 The accessed slice detection unit 470 determines the state of the storage apparatus in which the error has occurred. That is, the accessed slice detection unit 470 has received a notification from the T1 / recovery detection unit 430 that the progress of T1 of the storage device 410 has been detected, but has not yet received a return notification, It is determined whether or not both of the second conditions for not receiving notification from the T2 detection unit 440 that the progress of T2 of the storage apparatus 410 has been detected are satisfied. If the condition is satisfied, the accessed slice detection unit 470 has determined that the storage apparatus 410 has a failure, but the storage apparatus 410 has not yet been determined (the recovery process for the entire storage apparatus 410 has started). Judgment) Therefore, when the condition is satisfied, the accessed slice detection unit 470 notifies the slice management unit 460 that the state of the accessed slice should be abnormal (Bad). Then, the slice management unit 460 searches the metadata corresponding to the access target slice in the metadata storage unit 450 and changes the state to “B (Bad)” (indicating that it is abnormal). Further, the slice management unit 460 waits for the storage apparatus 410 to recover and changes the status of the metadata corresponding to the access target slice in the storage apparatus 410 to “B”.

さらに、被アクセススライス検出部４７０は、上記第１と第２の条件が満たされた場合、制御ノード８００に対してスライス異常通知を送信する。スライス異常通知には、アクセス対象となったストレージ装置のスライスを示すディスクＩＤとスライスＩＤとの組みが含められる。 Furthermore, the accessed slice detection unit 470 transmits a slice abnormality notification to the control node 800 when the first and second conditions are satisfied. The slice abnormality notification includes a combination of a disk ID and a slice ID indicating a slice of the storage apparatus that is the access target.

スライス異常通知を受け取った制御ノード８００では、スライス割り当て管理部８５０がアクセス対象となったスライスが割り当てられているセグメント（リカバリ対象セグメント）へのスライス再割り当て処理を行う。具体的には、スライス割り当て管理部８５０は、仮想ディスクメタデータ記憶部８１０を参照し、スライス異常通知で示されたスライス（異常スライス）がリカバリ対象セグメントのプライマリスライスかセカンダリスライスかを判断する。 In the control node 800 that has received the slice abnormality notification, the slice assignment management unit 850 performs slice reassignment processing to the segment (recovery target segment) to which the slice to be accessed is assigned. Specifically, the slice allocation management unit 850 refers to the virtual disk metadata storage unit 810 and determines whether the slice (abnormal slice) indicated by the slice abnormality notification is the primary slice or the secondary slice of the recovery target segment.

異常スライスがプライマリスライスであれば、スライス割り当て管理部８５０は、リカバリ対象セグメントの現在のプライマリスライスの状態を空きに変更し、セカンダリスライスをプライマリスライスに変更する。その後、スライス割り当て管理部８５０は、新たなプライマリスライスを管理するディスクノードとは別のディスクノードで管理されている空きスライスを、リカバリ対象セグメントのセカンダリスライスとして割り当てる。 If the abnormal slice is the primary slice, the slice allocation management unit 850 changes the current primary slice state of the recovery target segment to empty, and changes the secondary slice to the primary slice. Thereafter, the slice allocation management unit 850 allocates a free slice managed by a disk node different from the disk node that manages the new primary slice as a secondary slice of the recovery target segment.

異常スライスがセカンダリスライスであれば、スライス割り当て管理部８５０は、リカバリ対象セグメントの現在のセカンダリスライスの状態を空きに変更する。その後、スライス割り当て管理部８５０は、現在のプライマリスライスを管理するディスクノードとは別のディスクノードで管理されている空きスライスを、リカバリ対象セグメントのセカンダリスライスとして割り当てる。 If the abnormal slice is a secondary slice, the slice allocation management unit 850 changes the current secondary slice state of the recovery target segment to empty. Thereafter, the slice allocation management unit 850 allocates a free slice managed by a disk node different from the disk node that manages the current primary slice as a secondary slice of the recovery target segment.

ところで、図２１の例では、アクセスノード６００からのリード要求に起因してスライス異常通知が出力されている。リード要求はプライマリスライスに対してのみ行われるため、このとき検出される異常スライスはプライマリスライスである。セカンダリスライスが異常スライスとして検出されるのは、ライト要求が出された場合である。例えば、ディスクノード４００のストレージ装置４１０の動作が正常（検査コマンドにＴ１内に応答を返す）であるときに、アクセスノード６００からディスクノード４００にライト要求が出されると、ストレージ装置４１０内のスライスデータが書き込まれる。この際、二重化を維持するために、スライス管理部４６０によって、アクセス対象のスライスとペアを組んでいるセカンダリスライスにも同じデータが書き込まれる。セカンダリスライスへのデータの書き込みがエラーになると、セカンダリスライスを管理するディスクノードから制御ノード８００へスライス異常通知が出される。そして、スライス割り当て管理部８５０では、異常スライスがセカンダリスライスであると認識し、リカバリ対象セグメントへのセカンダリスライスの再割り当て処理を行う。 Incidentally, in the example of FIG. 21, a slice abnormality notification is output due to a read request from the access node 600. Since the read request is made only to the primary slice, the abnormal slice detected at this time is the primary slice. The secondary slice is detected as an abnormal slice when a write request is issued. For example, when a write request is issued from the access node 600 to the disk node 400 when the operation of the storage device 410 of the disk node 400 is normal (a response is returned in T1 to the inspection command), the slice in the storage device 410 Data is written. At this time, in order to maintain duplication, the slice management unit 460 also writes the same data to the secondary slice paired with the slice to be accessed. When writing data to the secondary slice results in an error, a slice abnormality notification is issued from the disk node that manages the secondary slice to the control node 800. Then, the slice allocation management unit 850 recognizes that the abnormal slice is the secondary slice, and performs the reassignment process of the secondary slice to the recovery target segment.

スライス割り当て管理部８５０は、リカバリ対象セグメントのスライスの再割り当ての内容が確定すると、仮想ディスクメタデータ記憶部８１０の内容を更新する。
図２３は、更新後の仮想ディスクメタデータ記憶部の内容を示す図である。図２３に示すように、ディスクノードＩＤ「ＳＮ−Ａ」、ディスクＩＤ「１」、スライスＩＤ「１」で示されるスライスのメタデータは、状態が「Ｂ」に変更されている。なお、状態「Ｂ」は、対応するスライスが異常であることを示す。これにより、ストレージ装置４１０内のスライスの照会対象セグメント（セグメントＩＤ「１」）への割り当てが解除される。 The slice allocation management unit 850 updates the content of the virtual disk metadata storage unit 810 when the content of the reassignment of the slice of the recovery target segment is confirmed.
FIG. 23 is a diagram showing the contents of the updated virtual disk metadata storage unit. As shown in FIG. 23, the status of the metadata of the slice indicated by the disk node ID “SN-A”, the disk ID “1”, and the slice ID “1” is changed to “B”. The state “B” indicates that the corresponding slice is abnormal. Thereby, the assignment of the slice in the storage device 410 to the inquiry target segment (segment ID “1”) is released.

また、ディスクノードＩＤ「ＳＮ−Ｃ」、ディスクＩＤ「１」、スライスＩＤ「１」で示されるスライスのメタデータは、状態が「Ｐ」に変更されている。これにより、リカバリ対象セグメントに割り当てられていたスライスが、セカンダリスライスからプライマリスライスに変更される。 Further, the status of the metadata of the slice indicated by the disk node ID “SN-C”, the disk ID “1”, and the slice ID “1” is changed to “P”. As a result, the slice assigned to the recovery target segment is changed from the secondary slice to the primary slice.

さらに、ディスクノードＩＤ「ＳＮ−Ｂ」、ディスクＩＤ「１」、スライスＩＤ「２」で示されるスライスのメタデータは、状態が「Ｓ」に変更され、仮想ディスクＩＤに「ＶＬＯＸ−Ｘ」が設定され、セグメントＩＤに「１」が設定されている。これにより、リカバリ対象セグメントに対してセカンダリスライスが割り当てられる。 Furthermore, the status of the metadata of the slice indicated by the disk node ID “SN-B”, the disk ID “1”, and the slice ID “2” is changed to “S”, and “VLOX-X” is set to the virtual disk ID. The segment ID is set to “1”. As a result, the secondary slice is assigned to the recovery target segment.

また、状態等の内容が変更された各メタデータは、タイムスタンプが「ｔ（ｎ＋１）」に更新されている。「ｔ（ｎ＋１）」はメタデータの更新時刻である。
メタデータの更新が完了したことで、メタデータ検索部８２０は、リカバリ対象セグメントへのメタデータ照会要求があれば、リカバリ対象セグメントのリカバリ後の状態を示すメタデータをアクセスノード６００に提供可能となる。 In addition, the time stamp of each metadata whose contents such as the state are changed is updated to “t (n + 1)”. “T (n + 1)” is a metadata update time.
When the metadata update is completed, the metadata search unit 820 can provide the access node 600 with metadata indicating the state after recovery of the recovery target segment if there is a metadata inquiry request to the recovery target segment. Become.

以下、図２１の説明に戻る。
［ステップＳ８６］メタデータ照会部６１０は、制御ノード８００に対してセグメントを指定したメタデータの照会要求を送信する。照会要求で指定されるセグメント（照会対象セグメント）は、エラーによりアクセスが失敗したセグメントである。 Returning to the description of FIG.
[Step S86] The metadata inquiry unit 610 transmits a metadata inquiry request designating a segment to the control node 800. The segment (inquired segment) specified in the inquiry request is a segment whose access has failed due to an error.

［ステップＳ８７］制御ノード８００のメタデータ検索部８２０は、仮想ディスクメタデータ記憶部８１０を参照し、照会対象セグメントのプライマリスライスのメタデータをアクセスノード６００に通知する。 [Step S87] The metadata search unit 820 of the control node 800 refers to the virtual disk metadata storage unit 810 and notifies the access node 600 of the metadata of the primary slice of the inquiry target segment.

一方、スライス割り当て管理部８５０は、メタデータ照会に対するメタデータ検索とは別プロセスで、リカバリ対象セグメントのリカバリ処理を続行している。すなわち、スライス割り当てに続けて、ステップＳ８８〜ステップＳ９１の処理が行われる。そして、ステップＳ９２においてアクセスノード６００がディスクノード４００ｂにリードリトライをすると、ステップＳ９３で目的のデータが応答される。また、ステップＳ９４でＴ２の経過が通知されると、ステップＳ９５でリカバリ処理が行われる。 On the other hand, the slice allocation management unit 850 continues the recovery process for the recovery target segment in a process different from the metadata search for the metadata inquiry. That is, following the slice assignment, the processing of step S88 to step S91 is performed. When the access node 600 makes a read retry to the disk node 400b in step S92, the target data is returned in step S93. Further, when the progress of T2 is notified in step S94, recovery processing is performed in step S95.

以上のようにして、Ｔ２の経過が検出される前に、アクセス対象のスライスが割り当てられたセグメントのみのリカバリ処理を行うことができる。しかも、ディスクノード４００からのスライス異常通知に基づいてセグメントのリカバリ処理が行われる。そのため、メタデータ照会要求に応じたメタデータ検索処理の実行を阻害せずに済み、システム全体の処理効率の低下を防止できる。 As described above, it is possible to perform the recovery process only for the segment to which the slice to be accessed is allocated before the passage of T2 is detected. In addition, segment recovery processing is performed based on the slice abnormality notification from the disk node 400. Therefore, it is not necessary to inhibit execution of the metadata search process according to the metadata inquiry request, and it is possible to prevent the processing efficiency of the entire system from being lowered.

次に、正常に動作しているストレージ装置４１０の負荷が一時的に過大であったため検査コマンドへの応答が遅れた場合のスライス切り替え処理について説明する。
図２４は、ストレージ装置の負荷が過大となったときのスライス切り替え処理の手順を示すシーケンス図である。この例では、ディスクノード４００に接続されたストレージ装置４１０の負荷が一時的に過大になったものとする。なお、図２４のステップＳ１０１〜ステップＳ１１３の処理は、それぞれ図２１のステップＳ８１〜Ｓ９３の処理と同じである。そこで、ステップＳ１１４以降の処理をステップ番号に沿って説明する。 Next, the slice switching process when the response to the inspection command is delayed because the load on the normally operating storage apparatus 410 is temporarily excessive will be described.
FIG. 24 is a sequence diagram illustrating a procedure of slice switching processing when the load on the storage apparatus becomes excessive. In this example, it is assumed that the load on the storage apparatus 410 connected to the disk node 400 temporarily becomes excessive. Note that the processing in steps S101 to S113 in FIG. 24 is the same as the processing in steps S81 to S93 in FIG. Therefore, the processing after step S114 will be described along with step numbers.

［ステップＳ１１４］ディスクノード４００のＴ１／復帰検出部４３０は、検査コマンドに対するストレージ装置４１０からの応答を受信する。これによりＴ１／復帰検出部４３０は、ストレージ装置４１０がアクセス可能な状態に復帰したことを検出する。すると、Ｔ１／復帰検出部４３０は、制御ノード８００に対してストレージ装置４１０の復帰通知を送信する。 [Step S114] The T1 / recovery detection unit 430 of the disk node 400 receives a response from the storage apparatus 410 to the inspection command. As a result, the T1 / recovery detection unit 430 detects that the storage apparatus 410 has returned to an accessible state. Then, the T1 / recovery detection unit 430 transmits a return notification of the storage apparatus 410 to the control node 800.

制御ノード８００のストレージ状態管理部８４０は、復帰通知を受け取ると、割り当て可否記憶部８３０内の復帰したストレージ装置の割り当て可否の情報を「可」に変更する。 When the storage status management unit 840 of the control node 800 receives the return notification, the storage state management unit 840 changes the information on whether or not the restored storage device is allocated in the allocation availability storage unit 830 to “permitted”.

［ステップＳ１１５］ストレージ状態管理部８４０は、割り当て可否記憶部８３０の変更後、ディスクノード４００に対して、確認応答を送信する。
なお、第１の実施の形態では、ストレージ装置が復帰するとリカバリ処理を行ったセグメント（第１の実施の形態では「照会対象セグメント」）に割り当てられていたスライスを空き（Ｆｒｅｅ）状態に変更している。他方、第３の実施の形態では、Ｔ１経過後にアクセスがあったスライスは、アクセスがあった時点で状態を「Ｂ（Ｂａｄ）」に変更している。そのため、ストレージ装置が復帰しても、メタデータの変更処理は不要である。 [Step S 115] The storage state management unit 840 transmits an acknowledgment to the disk node 400 after changing the allocation permission / inhibition storage unit 830.
In the first embodiment, when the storage apparatus returns, the slice assigned to the segment for which the recovery process has been performed (“query target segment” in the first embodiment) is changed to a free state. ing. On the other hand, in the third embodiment, the state of the slice that has been accessed after the elapse of T1 is changed to “B (Bad)” at the time of access. Therefore, even if the storage device is restored, the metadata change process is unnecessary.

ところで、検査コマンドによるストレージ装置の故障診断は、すべてのディスクノード４００，４００ａ，４００ｂで行われている。図２１の例は、１つのストレージ装置においてのみＴ１の経過が検出された場合である。他方、２台のストレージ装置で同時にＴ１の経過が検出されることも有り得る。このときＴ１の経過が検出された２台のストレージ装置それぞれのスライスでスライスペアを構成するセグメントがある場合、両方のスライスの状態を異常状態（状態「Ｂ」）にしてしまうと、そのセグメントのデータを喪失してしまう。そこで、そのような場合には、後にＴ１の経過が検出されたディスクノードでは、スライス異常通知を行わないこととする。 Incidentally, the failure diagnosis of the storage apparatus by the inspection command is performed in all the disk nodes 400, 400a, 400b. The example of FIG. 21 is a case where the progress of T1 is detected in only one storage device. On the other hand, it is possible that the progress of T1 is detected simultaneously by two storage apparatuses. At this time, if there is a segment that constitutes a slice pair in the slices of each of the two storage devices for which the progress of T1 has been detected, if both slices are in an abnormal state (state “B”), the segment Data will be lost. Therefore, in such a case, a disk abnormality notification is not performed in the disk node where the progress of T1 is detected later.

図２５は、複数のディスクでＴ１経過が検出されたときのスライス割り当て処理を示すシーケンス図である。この例では、ディスクノード４００に接続されたストレージ装置４１０が故障し、ディスクノード４００ｂに接続されたストレージ装置の負荷が一時的に過大になったものとする。なお、図２５のステップＳ１２１〜ステップＳ１３１の処理は、それぞれ図２１のステップＳ８１〜Ｓ９１の処理と同じである。そこで、ステップＳ１３２以降の処理をステップ番号に沿って説明する。 FIG. 25 is a sequence diagram showing slice allocation processing when T1 progress is detected in a plurality of disks. In this example, it is assumed that the storage device 410 connected to the disk node 400 has failed, and the load on the storage device connected to the disk node 400b is temporarily excessive. Note that the processes in steps S121 to S131 in FIG. 25 are the same as the processes in steps S81 to S91 in FIG. Therefore, the processing after step S132 will be described along with step numbers.

［ステップＳ１３２］ディスクノード４００ｂは、定期的にストレージ装置のディスク診断を行う。ストレージ装置が正常に動作していれば、ストレージ装置からＴ１以内に応答が返される。図２５の例では、ステップＳ１３１の処理を実行するまではディスクノード４００ｂのストレージ装置は正常に動作していたが、メタデータを変更したことによるコピー処理が完了する前に、ストレージ装置が過負荷状態となったものとする。ストレージ装置が過負荷であることにより、ディスクノード４００ｂは検査コマンド発行後Ｔ１が経過しても、正常な応答が返されていない。そこで、ディスクノード４００ｂはＴ１を検出し、制御ノード８００に対してＴ１経過通知を送信する。 [Step S132] The disk node 400b periodically performs disk diagnosis of the storage apparatus. If the storage apparatus is operating normally, a response is returned from the storage apparatus within T1. In the example of FIG. 25, the storage device of the disk node 400b was operating normally until the processing of step S131 was executed, but the storage device was overloaded before the copy processing due to the change of metadata was completed. Suppose that it is in a state. Due to the overload of the storage device, the disk node 400b does not return a normal response even after T1 has elapsed after issuing the inspection command. Therefore, the disk node 400b detects T1 and transmits a T1 progress notification to the control node 800.

Ｔ１経過通知を受信した制御ノード８００のストレージ状態管理部８４０は、Ｔ１経過通知を受信すると、ディスクノード４００ｂに接続されたストレージ装置の状態を割り当て「不可」に切り替える。 Upon receiving the T1 progress notification, the storage state management unit 840 of the control node 800 receives the T1 progress notification and assigns the state of the storage device connected to the disk node 400b to switch to “impossible”.

［ステップＳ１３３］状態の切り替えが完了すると、ストレージ状態管理部８４０は、切り替え完了応答をディスクノード４００ｂに送信する。
［ステップＳ１３４］その後、アクセスノード６００のスライスアクセス要求部６３０は、メタデータの照会が完了すると、アクセス用メタデータ記憶部６２０を参照してアクセス対象のスライスを管理するディスクノードを判断し、そのディスクノードへリード要求（リードリトライ）を送信する。 [Step S133] When the switching of the state is completed, the storage state management unit 840 transmits a switching completion response to the disk node 400b.
[Step S134] After that, when the metadata inquiry is completed, the slice access request unit 630 of the access node 600 refers to the access metadata storage unit 620 to determine the disk node that manages the slice to be accessed. Send a read request (read retry) to the disk node.

このときディスクノード４００ｂでは、まだＴ２の検出も復帰の検出もされていないものとする。この場合、ディスクノード４００ｂは、リード要求におけるアクセス対象のスライスの状態をメタデータに基づいて確認する。このとき参照されるメタデータは、図２３に示した仮想ディスクメタデータテーブル８１１内のディスクノードＩＤが「ＳＮ−Ｃ」である各メタデータと同じである。すると、該当するスライス（この例では、図２３に示されたディスクノードＩＤ「ＳＮ−Ｃ」、ディスクＩＤ「１」、スライスＩＤ「１」のスライス）は、ステップＳ１２９のメタデータ更新要求によってセカンダリスライスからプライマリスライスに変更されている。そして、該当するスライス内のデータが、ディスクノード４００ａ内のペアとなるセカンダリスライスにコピーされている最中である。 At this time, in the disk node 400b, it is assumed that the detection of T2 and the return have not been detected yet. In this case, the disk node 400b confirms the state of the access target slice in the read request based on the metadata. The metadata referred to at this time is the same as each metadata whose disk node ID is “SN-C” in the virtual disk metadata table 811 shown in FIG. Then, the corresponding slice (in this example, the disk node ID “SN-C”, the disk ID “1”, and the slice with the slice ID “1” illustrated in FIG. 23) is set as a secondary by the metadata update request in step S129. The slice has been changed to the primary slice. Then, the data in the corresponding slice is being copied to the secondary slice forming a pair in the disk node 400a.

ディスクノード４００ｂは、アクセス対象のスライスがコピー中であるため、スライス異常通知を行わず、データのリードができるようになるのを待つ。
［ステップＳ１３５］その後、ディスクノード４００のＴ２検出部４４０は、検査コマンド発行からＴ２が経過したことを検出する。そして、Ｔ２検出部４４０は、制御ノード８００に対してＴ２経過通知を送信する。 Since the slice to be accessed is being copied, the disk node 400b does not send a slice abnormality notification and waits until data can be read.
[Step S135] Thereafter, the T2 detection unit 440 of the disk node 400 detects that T2 has elapsed since the inspection command was issued. Then, the T2 detection unit 440 transmits a T2 progress notification to the control node 800.

［ステップＳ１３６］Ｔ２経過通知を受け取った制御ノード８００のスライス割り当て管理部８５０は、ストレージ装置４１０が故障により使用不可になったことを認識し、ストレージ装置４１０全体のリカバリ処理を開始する。 [Step S136] Upon receiving the T2 progress notification, the slice allocation management unit 850 of the control node 800 recognizes that the storage apparatus 410 has become unusable due to a failure, and starts recovery processing for the entire storage apparatus 410.

［ステップＳ１３７］一方、ディスクノード４００ｂは、検査コマンドに対するストレージ装置からの応答を受信する。すると、ディスクノード４００ｂは、制御ノード８００に対してストレージ装置の復帰通知を送信する。 [Step S137] On the other hand, the disk node 400b receives a response from the storage apparatus to the inspection command. Then, the disk node 400 b transmits a storage device recovery notification to the control node 800.

［ステップＳ１３８］ストレージ状態管理部８４０は、割り当て可否記憶部８３０の変更後、ディスクノード４００ｂに対して、確認応答を送信する。
このように、ストレージ装置内のスライス内のデータのコピー中（二重化状態が復旧していない状態）の場合、ディスクノードは、当該ストレージ装置のＴ１経過を検出し、当該スライスへのアクセスがあってもリード処理を継続する。すなわち、ディスクノードは、アクセス対象のスライスの状態を「Ｂ（Ｂａｄ）」に変更したり、制御ノード８００にスライス異常通知を送信したりしない。これにより、二重化回復のためのコピー処理が完了する前に、プライマリスライスが異常として取り扱われることを防ぐことができる。その結果、データのロストが防止される。 [Step S138] The storage state management unit 840 transmits a confirmation response to the disk node 400b after changing the allocation availability storage unit 830.
As described above, when data in a slice in the storage device is being copied (the duplexed state has not been restored), the disk node detects T1 progress of the storage device and there is access to the slice. Continue the lead process. In other words, the disk node does not change the status of the slice to be accessed to “B (Bad)” or transmits a slice abnormality notification to the control node 800. As a result, it is possible to prevent the primary slice from being treated as abnormal before the copy processing for duplex recovery is completed. As a result, data loss is prevented.

次に、ディスクノードにおけるディスク診断処理の詳細な手順を説明する。なお、以下の処理は、ディスクノード４００が行うものとして説明するが、他のディスクノード４００ａ，４００ｂも同様の処理を定期的に実行する。 Next, a detailed procedure of the disk diagnosis process in the disk node will be described. Although the following processing is described as being performed by the disk node 400, the other disk nodes 400a and 400b periodically execute similar processing.

図２６は、ディスク診断処理の手順を示すフローチャートである。以下、図２６に示す処理をステップ番号に沿って説明する。なお、この処理は予め設定された間隔で定期的に実行される。 FIG. 26 is a flowchart showing the procedure of the disk diagnosis process. In the following, the process illustrated in FIG. 26 will be described in order of step number. This process is periodically executed at preset intervals.

［ステップＳ１５１］Ｔ１／復帰検出部４３０は、検査コマンドをストレージ装置４１０に対して出力する。このときＴ１／復帰検出部４３０は、検査コマンドの出力時刻を内部メモリに記憶する。 [Step S151] The T1 / return detection unit 430 outputs an inspection command to the storage device 410. At this time, the T1 / return detection unit 430 stores the output time of the inspection command in the internal memory.

［ステップＳ１５２］Ｔ１／復帰検出部４３０は、検査コマンドが発行されてからＴ１が経過したか否かを判断する。具体的には、Ｔ１／復帰検出部４３０は、現在の時刻から検査コマンドの出力時刻を減算し、減算結果がＴ１以上であれば、Ｔ１が経過したものと判断する。Ｔ１が経過した場合、処理がステップＳ１５５に進められる。Ｔ１が経過していなければ、処理がステップＳ１５３に進められる。 [Step S152] The T1 / return detection unit 430 determines whether or not T1 has elapsed since the inspection command was issued. Specifically, the T1 / return detection unit 430 subtracts the output time of the inspection command from the current time, and determines that T1 has elapsed if the subtraction result is equal to or greater than T1. If T1 has elapsed, the process proceeds to step S155. If T1 has not elapsed, the process proceeds to step S153.

［ステップＳ１５３］Ｔ１／復帰検出部４３０は、診断完了チェックを行う。具体的には、Ｔ１／復帰検出部４３０は、ストレージ装置４１０から正常応答が返信されたかどうかを検査する。 [Step S153] The T1 / return detection unit 430 performs a diagnosis completion check. Specifically, the T1 / recovery detection unit 430 checks whether a normal response is returned from the storage device 410.

［ステップＳ１５４］Ｔ１／復帰検出部４３０は、正常応答が返信された場合、処理を終了する。正常応答が返信されていなければ、処理がステップＳ１５２に進められる。
［ステップＳ１５５］Ｔ１／復帰検出部４３０は、検査コマンドの送信からＴ１が経過すると、Ｔ１経過通知を制御ノード８００に送信する。 [Step S154] When the normal response is returned, the T1 / return detection unit 430 ends the process. If a normal response has not been returned, the process proceeds to step S152.
[Step S155] The T1 / return detection unit 430 transmits a T1 progress notification to the control node 800 when T1 has elapsed since the transmission of the inspection command.

［ステップＳ１５６］ディスクノード４００内の複数の要素の連携処理により、Ｔ２／復帰検出処理が実行される。この処理の詳細は後述する。この処理が終了すると、ディスク診断処理が終了する。 [Step S156] The T2 / recovery detection process is executed by the cooperation process of a plurality of elements in the disk node 400. Details of this processing will be described later. When this process ends, the disk diagnosis process ends.

図２７は、Ｔ２／復帰検出処理の手順を示すフローチャートである。以下、図２７に示す処理をステップ番号に沿って説明する。
［ステップＳ１６１］Ｔ２検出部４４０は、検査コマンドが発行されてからＴ２が経過したか否かを判断する。具体的には、Ｔ２検出部４４０は、Ｔ１／復帰検出部４３０から検査コマンドの出力時刻を取得する。そして、Ｔ２検出部４４０は、現在時刻から検査コマンドの出力時刻を減算し、減算結果がＴ２以上であれば、Ｔ２が経過したものと判断する。Ｔ２が経過した場合、処理がステップＳ１６２に進められる。Ｔ２が経過していない場合、処理がステップＳ１６３に進められる。 FIG. 27 is a flowchart illustrating a procedure of the T2 / return detection process. In the following, the process illustrated in FIG. 27 will be described in order of step number.
[Step S161] The T2 detection unit 440 determines whether or not T2 has elapsed since the inspection command was issued. Specifically, the T2 detection unit 440 acquires the output time of the inspection command from the T1 / recovery detection unit 430. Then, the T2 detection unit 440 subtracts the output time of the inspection command from the current time, and determines that T2 has elapsed if the subtraction result is equal to or greater than T2. If T2 has elapsed, the process proceeds to step S162. If T2 has not elapsed, the process proceeds to step S163.

［ステップＳ１６２］Ｔ２検出部４４０は、Ｔ２経過通知を制御ノード８００に対して送信する。その後、Ｔ２／復帰検出処理が終了する。
［ステップＳ１６３］スライスアクセス処理部４２０は、ストレージ装置４１０に対するアクセス要求が入力されたか否かを判断する。アクセス要求が入力された場合、処理がステップＳ１６５に進められる。アクセス要求が入力されていなければ、処理がステップＳ１６４に進められる。 [Step S162] The T2 detection unit 440 transmits a T2 progress notification to the control node 800. Thereafter, the T2 / return detection process ends.
[Step S163] The slice access processing unit 420 determines whether an access request to the storage apparatus 410 has been input. If an access request is input, the process proceeds to step S165. If no access request is input, the process proceeds to step S164.

［ステップＳ１６４］Ｔ１／復帰検出部４３０は、診断完了チェックを行う。具体的には、Ｔ１／復帰検出部４３０は、ストレージ装置４１０から正常応答が返信されたかどうかを検査する。その後、処理がステップＳ１７０に進められる。 [Step S164] The T1 / return detection unit 430 performs a diagnosis completion check. Specifically, the T1 / recovery detection unit 430 checks whether a normal response is returned from the storage device 410. Thereafter, the process proceeds to step S170.

［ステップＳ１６５］スライスアクセス処理部４２０は、アクセス対象のスライスがコピー中か否かを判断する。具体的には、スライスアクセス処理部４２０は、メタデータ記憶部４５０を参照し、アクセス対象のスライスの二重化状態を確認する。二重化状態として「コピー中」と設定されていれば、該当するスライスはコピー中である。コピー中であれば、処理がステップＳ１６６に進められる。コピー中でなければ、処理がステップＳ１６７に進められる。 [Step S165] The slice access processing unit 420 determines whether the slice to be accessed is being copied. Specifically, the slice access processing unit 420 refers to the metadata storage unit 450 and confirms the duplex state of the slice to be accessed. If “duplicating” is set as the duplex status, the corresponding slice is being copied. If copying is in progress, the process proceeds to step S166. If not copying, the process proceeds to step S167.

［ステップＳ１６６］スライスアクセス処理部４２０は、アクセス要求に応じたストレージ装置４１０へのアクセス処理を行う。なお、このアクセス処理はストレージ装置４１０が復帰するまで成功しない。そのため、スライスアクセス処理部４２０は、ストレージ装置４１０が復帰するのを待ち、アクセス要求に応じたデータリードまたはデータライトを実行する。その後、処理がステップＳ１７０に進められる。 [Step S166] The slice access processing unit 420 performs access processing to the storage apparatus 410 in response to the access request. This access process does not succeed until the storage device 410 is restored. Therefore, the slice access processing unit 420 waits for the storage device 410 to recover and executes data read or data write according to the access request. Thereafter, the process proceeds to step S170.

［ステップＳ１６７］アクセス対象のスライスがコピー中ではない場合、スライスアクセス処理部４２０から被アクセススライス検出部４７０へ、ストレージ装置４１０内のスライスへのアクセス要求があったことを通知する。被アクセススライス検出部４７０は、ストレージ装置４１０のディスク診断処理において、検査コマンド発行からＴ１経過後、Ｔ２経過／復帰前であることを確認し、スライス管理部４６０へメタデータの変更要求を送信する。メタデータの変更要求に応じて、スライス管理部４６０は、メタデータ記憶部４５０内のアクセス対象のスライスのメタデータに対し、状態を「Ｂ（Ｂａｄ）」とする更新を行う。また、スライス管理部４６０は、ストレージ装置４１０が復帰した場合は、ストレージ装置４１０内のメタデータも同様に更新する。 [Step S167] When the slice to be accessed is not being copied, the slice access processing unit 420 notifies the accessed slice detection unit 470 that there has been an access request to the slice in the storage device 410. In the disk diagnosis processing of the storage apparatus 410, the accessed slice detection unit 470 confirms that T1 has elapsed since the inspection command issuance and before T2 has elapsed / returned, and transmits a metadata change request to the slice management unit 460. . In response to the metadata change request, the slice management unit 460 updates the metadata of the slice to be accessed in the metadata storage unit 450 with the state being “B (Bad)”. In addition, when the storage device 410 returns, the slice management unit 460 similarly updates the metadata in the storage device 410.

［ステップＳ１６８］被アクセススライス検出部４７０は、制御ノード８００に対してスライス異常通知を行う。
［ステップＳ１６９］スライスアクセス処理部４２０は、アクセスエラーをアクセスノード６００に送信する。その後、処理がステップＳ１７０に進められる。 [Step S168] The accessed slice detection unit 470 notifies the control node 800 of a slice abnormality.
[Step S169] The slice access processing unit 420 transmits an access error to the access node 600. Thereafter, the process proceeds to step S170.

［ステップＳ１７０］Ｔ１／復帰検出部４３０は、正常応答が返信された場合、処理をステップＳ１７１に進める。正常応答が返信されていなければ、処理をステップＳ１６１に進める。 [Step S170] When a normal response is returned, the T1 / return detection unit 430 advances the process to step S171. If a normal response has not been returned, the process proceeds to step S161.

［ステップＳ１７１］Ｔ１／復帰検出部４３０は、復帰通知を制御ノード８００に送信する。その後、Ｔ２／復帰検出処理が終了する。
このようにして、リカバリ中のセグメントに割り当てられているスライスを有するストレージ装置でＴ１が検出され、そのスライスに対するアクセス要求があっても、スライス異常通知が行われない。その結果、複数のストレージ装置において同時にＴ１経過が検出された場合でも、データが消失するのを防止できる。 [Step S171] The T1 / recovery detection unit 430 transmits a return notification to the control node 800. Thereafter, the T2 / return detection process ends.
In this way, T1 is detected in the storage apparatus having the slice assigned to the segment being recovered, and even if there is an access request for that slice, no slice abnormality notification is performed. As a result, it is possible to prevent the data from being lost even when the T1 progress is simultaneously detected in a plurality of storage apparatuses.

以上のように、第３の実施の形態では、ディスクノードからの異常通知に基づいてスライスの再割り当てを行うようにしたため、メタデータ照会に応じたメタデータ検索処理を遅延させずに済む。これにより、アクセスノードは、スライス再割り当てに伴うメタデータ変更に時間がかかったとしても、メタデータ照会に対する応答を迅速に受け取ることができる。その結果、複数のストレージ装置が同時に不調に陥った場合でも、アクセスノードからのデータアクセスを可能な限り滞らせずにすむ。 As described above, in the third embodiment, since slice reassignment is performed based on an abnormality notification from a disk node, it is not necessary to delay the metadata search processing according to the metadata inquiry. As a result, the access node can quickly receive a response to the metadata query even if it takes time to change the metadata accompanying the slice reassignment. As a result, even when a plurality of storage apparatuses are in trouble at the same time, data access from the access node is prevented from being delayed as much as possible.

また、第３の実施の形態においても第１、第２の実施の形態と同様に、アクセスを長い時間止めることを抑えつつ、故障検出時間「Ｔ２」を長く設定することができる。すなわち、ストレージ装置ごとに故障検出時間「Ｔ２」を調整しなくてもよいという利点を有する。 Also in the third embodiment, similarly to the first and second embodiments, the failure detection time “T2” can be set long while suppressing the access from being stopped for a long time. That is, there is an advantage that it is not necessary to adjust the failure detection time “T2” for each storage device.

なお、第３の実施の形態の割り当て可否記憶部８３０に代えて、第１の実施の形態におけるストレージ状態記憶部５３０を用いることもできる。その場合、ストレージ状態記憶部５３０において状態が「正常」とされたストレージ装置は「割り当て可能」、状態が「Ｔ１」のストレージ装置は「割り当て不可」と判断する。 Note that the storage status storage unit 530 in the first embodiment may be used instead of the allocation permission / inhibition storage unit 830 in the third embodiment. In this case, it is determined that the storage apparatus in which the state is “normal” in the storage state storage unit 530 is “assignable” and the storage apparatus in which the state is “T1” is “unassignable”.

また、上記第１の実施の形態において、第３の実施の形態と同様にＴ１経過が検出されたストレージ装置内のスライスを、リカバリ処理における割当不可とすることもできる。
なお、上記の処理機能は、コンピュータによって実現することができる。その場合、制御ノードやディスクノードが有すべき機能の処理内容を記述したプログラムが提供される。そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリなどがある。磁気記録装置には、ハードディスク装置（ＨＤＤ）、フレキシブルディスク（ＦＤ）、磁気テープなどがある。光ディスクには、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）などがある。光磁気記録媒体には、ＭＯ（Magneto-Optical disc）などがある。 In the first embodiment, as in the third embodiment, a slice in the storage apparatus in which the T1 progress has been detected can be disabled in the recovery process.
The above processing functions can be realized by a computer. In that case, a program describing the processing contents of the functions that the control node and the disk node should have is provided. By executing the program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. Examples of the computer-readable recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory. Examples of the magnetic recording device include a hard disk device (HDD), a flexible disk (FD), and a magnetic tape. Optical disks include DVD (Digital Versatile Disc), DVD-RAM, CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), and the like. Magneto-optical recording media include MO (Magneto-Optical disc).

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ、ＣＤ−ＲＯＭなどの可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When distributing the program, for example, a portable recording medium such as a DVD or a CD-ROM in which the program is recorded is sold. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、サーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. Further, each time the program is transferred from the server computer, the computer can sequentially execute processing according to the received program.

なお、本発明は、上述の実施の形態にのみ限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変更を加えることができる。
以上説明した実施の形態の主な技術的特徴は、以下の付記の通りである。 The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the gist of the present invention.
The main technical features of the embodiment described above are as follows.

（付記１）記憶領域が複数のスライスに分割して管理されている複数のストレージ装置に二重化して格納されたデータの管理処理をコンピュータに実行させるデータ管理プログラムであって、
前記コンピュータを、
前記複数のストレージ装置のうちの１つが故障中の可能性があることを示す動作不調情報を受け取ると、前記動作不調情報を動作不調情報記憶手段に格納する動作不調情報管理手段、
前記複数のストレージ装置のスライスをアクセス対象スライスとしてアクセス要求が出されたことを示すアクセス関連情報を受け取ると、前記動作不調情報記憶手段内の前記動作不調情報を参照して前記アクセス対象スライスが属する前記ストレージ装置が故障中である可能性の有無を判断し、故障の可能性がある場合、前記アクセス対象スライス内のデータと同じ内容の冗長データを格納する前記ストレージ装置へのデータ入出力機能を有するスライス管理手段へ、前記アクセス対象スライスに格納されていたデータのリカバリ処理を指示するリカバリ指示手段、
として機能させることを特徴とするデータ管理プログラム。 (Supplementary note 1) A data management program for causing a computer to execute management processing of data stored in a duplicated manner in a plurality of storage devices in which a storage area is divided into a plurality of slices and managed.
The computer,
Operation malfunction information management means for storing the operation malfunction information in the malfunction information storage means when receiving malfunction information indicating that one of the plurality of storage devices may be out of order;
When access related information indicating that an access request is issued with the slices of the plurality of storage devices as access target slices, the access target slice belongs with reference to the operation malfunction information in the operation malfunction information storage unit A data input / output function to the storage device that determines whether or not the storage device may be in failure and stores redundant data having the same contents as the data in the access target slice when there is a possibility of failure. Recovery instruction means for instructing the slice management means having recovery processing of data stored in the access target slice;
Data management program characterized by functioning as

（付記２）前記動作不調情報は、故障中の可能性がある前記ストレージ装置が接続されたディスクノードから、ネットワークを介して送信されることを特徴とする付記１記載のデータ管理プログラム。 (Supplementary note 2) The data management program according to supplementary note 1, wherein the malfunction information is transmitted via a network from a disk node to which the storage device that may be in a failure state is connected.

（付記３）前記アクセス関連情報は、ネットワークを介して前記複数のストレージ装置にアクセスを行うアクセスノードによる前記ストレージ装置内のスライスへのアクセスが失敗したときに、前記アクセスノードから送信されることを特徴とする付記１記載のデータ管理プログラム。 (Supplementary Note 3) The access-related information is transmitted from the access node when access to the slice in the storage device by the access node that accesses the plurality of storage devices via the network fails. A data management program according to appendix 1, which is characterized.

（付記４）前記リカバリ指示手段は、前記アクセスノードから前記アクセス関連情報を受け取ると、前記アクセスノードに対して前記冗長データの格納場所を通知することを特徴とする付記３記載のデータ管理プログラム。 (Supplementary note 4) The data management program according to supplementary note 3, wherein the recovery instruction means notifies the access node of the storage location of the redundant data when receiving the access related information from the access node.

（付記５）前記スライス管理手段は、前記コンピュータに対してネットワークを介して接続され、前記冗長データを格納する前記ストレージ装置が接続されたディスクノードの一機能であることを特徴とする付記１記載のデータ管理プログラム。 (Additional remark 5) The said slice management means is one function of the disk node connected to the said computer via a network, and the said storage apparatus which stores the said redundant data is connected. Data management program.

（付記６）前記コンピュータを、さらに、
複数のセグメントで構成される仮想ディスクが定義されており、異なるストレージ装置に属する２つの前記スライスが前記セグメントそれぞれに割り当てられており、前記複数のストレージ装置それぞれの前記スライスごとに設けられ、前記スライスと前記セグメントとの割り当て関係を示すメタデータを記憶する仮想ディスクメタデータ記憶手段として機能させ、
前記アクセス関連情報は、ネットワークを介して前記複数のストレージ装置にアクセスを行うアクセスノードによる前記ストレージ装置内のスライスへのアクセスが失敗したときに前記アクセスノードから送信される、アクセスに失敗したスライスが割り当てられているセグメントを照会対象セグメントとして指定した割り当て関係の照会要求であり、
前記リカバリ指示手段は、
前記仮想ディスクメタデータ記憶手段から、前記アクセス関連情報で指定された前記照会対象セグメントに割り当てられている２つの被割り当てスライスの前記メタデータを検索し、前記動作不調情報記憶手段を参照し、検索により得られた前記メタデータに基づいて前記被割り当てスライスが属する前記ストレージ装置が故障中である可能性の有無を判断するメタデータ検索手段と、
前記メタデータ検索手段が故障中の可能性があると判断された場合、故障中の可能性がある前記ストレージ装置に属する前記被割り当てスライスの前記照会対象セグメントへの割り当てを解除し、他の前記スライスを前記照会対象セグメントに新規に割り当てるスライス割り当て手段と、
を有することを特徴とする付記１記載のデータ管理プログラム。 (Supplementary note 6)
A virtual disk composed of a plurality of segments is defined, two slices belonging to different storage devices are assigned to each of the segments, and each slice of the plurality of storage devices is provided for each slice. And function as virtual disk metadata storage means for storing metadata indicating the allocation relationship between the segment and the segment,
The access-related information is transmitted from the access node when access to the slice in the storage device by the access node that accesses the plurality of storage devices via the network fails. An allocation-related query request that specifies the segment being allocated as the query target segment,
The recovery instruction means includes
Search the metadata of the two allocated slices assigned to the inquiry target segment specified by the access related information from the virtual disk metadata storage means, and refer to the malfunction information storage means and search Metadata search means for determining whether or not there is a possibility that the storage device to which the allocated slice belongs is based on the metadata obtained by
When it is determined that the metadata search means may be in failure, the allocation of the allocated slice belonging to the storage device that may be in failure may be released from the query target segment, and the other A slice assignment means for newly assigning a slice to the query target segment;
The data management program according to appendix 1, characterized by comprising:

（付記７）前記スライス割り当て手段は、前記照会対象セグメントへの新規のスライスの割り当てが完了すると、割り当て結果に応じて前記仮想ディスクメタデータ記憶手段内のメタデータを更新し、
前記メタデータ検索手段は、前記被割り当てスライスが属する前記ストレージ装置が故障中の可能性があると判断した場合、前記被割り当てスライスに代えて前記照会対象セグメントに対して新規のスライスの割り当てられるのを待ち、新規のスライスが割り当てられた後、前記照会対象セグメントに割り当てられたスライスのメタデータを前記仮想ディスクメタデータ記憶手段から取得し、前記アクセスノードに送信することを特徴とする付記６記載のデータ管理プログラム。 (Appendix 7) When the slice allocation unit completes the allocation of the new slice to the inquiry target segment, it updates the metadata in the virtual disk metadata storage unit according to the allocation result,
When the metadata search unit determines that there is a possibility that the storage device to which the allocated slice belongs has a failure, a new slice is allocated to the inquiry target segment instead of the allocated slice. After the new slice is allocated, the metadata of the slice allocated to the inquiry target segment is acquired from the virtual disk metadata storage unit and transmitted to the access node. Data management program.

（付記８）前記スライス割り当て手段は、前記動作不調情報記憶手段を参照し、前記照会対象セグメントに新規に割り当てるスライスとして、前記動作不調情報で示されていない前記ストレージ装置のスライスから選択することを特徴とする付記６記載のデータ管理プログラム。 (Supplementary Note 8) The slice allocating unit refers to the operation malfunction information storage unit, and selects a slice to be newly allocated to the inquiry target segment from slices of the storage device not indicated by the operation malfunction information. The data management program according to appendix 6, which is a feature.

（付記９）前記リカバリ指示手段は、前記複数のストレージ装置のいずれか１つが故障したことを示す故障検出情報を受け取ると、故障したストレージ装置内の各データと同じ内容の冗長データを格納する前記ストレージ装置へのデータ入出力機能を有するスライス管理手段へ、前記故障したストレージ装置内のすべてのデータのリカバリ処理を指示する指示することを特徴とする付記１記載のデータ管理プログラム。 (Supplementary Note 9) When the recovery instruction means receives failure detection information indicating that any one of the plurality of storage devices has failed, the recovery instruction means stores redundant data having the same content as each data in the failed storage device. The data management program according to appendix 1, wherein the data management program instructs the slice management means having a data input / output function to the storage apparatus to instruct recovery processing of all data in the failed storage apparatus.

（付記１０）前記動作不調情報管理手段は、故障中の可能性があるとされていた前記ストレージ装置の正常動作が確認されたことを示す復帰情報を受け取ると、前記動作不調情報記憶手段から、正常動作が確認された前記ストレージ装置の前記動作不調情報を消去することを特徴とする付記１記載のデータ管理プログラム。 (Supplementary Note 10) When the malfunction information management unit receives return information indicating that the normal operation of the storage device that has been considered to be in failure is confirmed, from the malfunction information storage unit, The data management program according to appendix 1, wherein the operation malfunction information of the storage apparatus whose normal operation is confirmed is deleted.

（付記１１）ストレージ装置が接続されると共に、前記ストレージ装置に格納するデータの管理を行う制御ノードにネットワーク経由で接続されたコンピュータに、前記ストレージ装置の動作診断処理を実行させるストレージ装置診断プログラムであって、
前記コンピュータを、
前記ストレージ装置に対して検査コマンドを発行し、前記検査コマンド発行から応答があるまでの経過時間を計測する応答時間計測手段、
前記経過時間が予め設定された動作不調検出時間に達しても応答がない場合、前記制御ノードに対して、前記ストレージ装置が故障中の可能性があることを示す動作不調情報を送信する動作不調検出手段、
前記動作不調情報を送信後に前記ストレージ装置から前記検査コマンドに対する応答が返されると、前記制御ノードに対して前記ストレージ装置の復帰を示す復帰情報を送信する復帰検出手段、
として機能させることを特徴とするストレージ装置診断プログラム。 (Supplementary Note 11) A storage device diagnosis program that causes a computer connected to a control node that manages data stored in the storage device to be connected to the storage node via a network, and that performs an operation diagnosis process of the storage device. There,
The computer,
A response time measuring means for issuing an inspection command to the storage device and measuring an elapsed time from when the inspection command is issued until there is a response;
If there is no response even when the elapsed time reaches a preset operation malfunction detection time, operation malfunction information indicating that the storage apparatus may be out of order is sent to the control node. Detection means,
A return detection means for transmitting return information indicating return of the storage device to the control node when a response to the inspection command is returned from the storage device after transmitting the malfunction information;
A storage apparatus diagnosis program that is caused to function as a storage device.

（付記１２）前記コンピュータを、さらに、
前記経過時間が、前記動作不調検出時間よりも大きな値が予め設定された故障検出時間に達しても応答がない場合、前記制御ノードに対して前記ストレージ装置に関する故障検出情報を送信する故障検出手段、
として機能させることを特徴とする付記１１記載のストレージ装置診断プログラム。 (Supplementary note 12)
Failure detection means for transmitting failure detection information relating to the storage device to the control node when there is no response even when the elapsed time reaches a failure detection time set in advance that is larger than the malfunction detection time ,
The storage apparatus diagnosis program according to appendix 11, wherein the storage apparatus diagnosis program is made to function as:

（付記１３）前記コンピュータを、さらに、
前記動作不調情報送信後、前記復帰情報または前記故障検出情報送信前に前記ストレージ装置内のスライスへのアクセスがあると、アクセス対象スライスを指定したスライス異常通知を前記制御ノードに送信するスライス異常通知手段、
として機能させることを特徴とする付記１２記載のストレージ装置診断プログラム。 (Supplementary note 13)
After sending the malfunction information, if there is an access to a slice in the storage device before sending the return information or the failure detection information, a slice error notification that sends a slice error notification specifying the access target slice to the control node means,
The storage device diagnostic program according to appendix 12, wherein the storage device diagnostic program is made to function as:

（付記１４）ストレージ装置が接続されると共に、前記ストレージ装置に格納するデータの管理を行う制御ノードにネットワーク経由で接続されたコンピュータに、前記ストレージ装置の動作診断処理を実行させるストレージ装置診断プログラムであって、
前記コンピュータを、
前記ストレージ装置に対して検査コマンドを発行し、前記検査コマンド発行から応答があるまでの経過時間を計測する応答時間計測手段、
前記経過時間が、予め設定された動作不調検出時間を超えた後、前記検査コマンドに対する応答取得前に前記ストレージ装置内のスライスへのアクセスがあると、アクセス対象スライスを指定したスライス異常通知を前記制御ノードに送信するスライス異常通知手段、
として機能させることを特徴とするストレージ装置診断プログラム。 (Supplementary Note 14) A storage device diagnosis program for causing a computer connected to a control node for managing data stored in the storage device via a network to execute operation diagnosis processing of the storage device while the storage device is connected There,
The computer,
A response time measuring means for issuing an inspection command to the storage device and measuring an elapsed time from when the inspection command is issued until there is a response;
After the elapsed time exceeds a preset operation malfunction detection time, if there is an access to a slice in the storage device before obtaining a response to the check command, a slice abnormality notification specifying the access target slice is sent. Slice abnormality notification means to be transmitted to the control node,
A storage apparatus diagnosis program which is made to function as a storage device.

（付記１５）データを二重化して管理するマルチノードストレージシステムであって、
記憶領域が複数のスライスに分割して管理されたストレージ装置に対して検査コマンドを発行し、前記検査コマンド発行から応答があるまでの経過時間を計測する応答時間計測手段と、
前記経過時間が予め設定された動作不調検出時間に達しても応答がない場合、制御ノードに対して前記ストレージ装置が故障中の可能性があることを示す動作不調情報を送信する動作不調検出手段と、
前記動作不調情報を送信後に前記ストレージ装置から前記検査コマンドに対する応答が返されると、前記制御ノードに対して前記ストレージ装置の復帰を示す復帰情報を送信する復帰検出手段と、
を具備する複数のディスクノードと、
前記ディスクノードの１つから前記動作不調情報を受け取ると、前記動作不調情報を動作不調情報記憶手段に格納する動作不調情報管理手段と、
前記複数のストレージ装置のスライスをアクセス対象スライスとしてアクセス要求が出されたことを示すアクセス関連情報を受け取ると、前記動作不調情報記憶手段内の前記動作不調情報を参照して前記アクセス対象スライスが属する前記ストレージ装置が故障中である可能性の有無を判断し、故障の可能性がある場合、前記アクセス対象スライス内のデータと同じ内容の冗長データを格納する前記ストレージ装置が接続された前記ディスクノードへ、前記アクセス対象スライスに格納されていたデータのリカバリ処理を指示するリカバリ指示手段と、
を具備する前記制御ノードと、
を有することを特徴とするマルチノードストレージシステム。 (Supplementary note 15) A multi-node storage system for managing data by duplication,
A response time measuring means for issuing an inspection command to a storage device managed by dividing a storage area into a plurality of slices, and measuring an elapsed time from the inspection command issuance to a response;
When there is no response even when the elapsed time reaches a preset operation malfunction detection time, an operation malfunction detection means for transmitting malfunction information indicating that the storage apparatus may be out of order to the control node. When,
A return detection means for sending return information indicating return of the storage device to the control node when a response to the inspection command is returned from the storage device after sending the malfunction information;
A plurality of disk nodes comprising:
Upon receiving the malfunction information from one of the disk nodes, malfunction information management means for storing the malfunction information in the malfunction information storage means;
When access related information indicating that an access request is issued with the slices of the plurality of storage devices as access target slices, the access target slice belongs with reference to the operation malfunction information in the operation malfunction information storage unit The disk node to which the storage device that stores the redundant data having the same contents as the data in the slice to be accessed is determined if there is a possibility that the storage device is in failure. Recovery instruction means for instructing recovery processing of data stored in the access target slice,
The control node comprising:
A multi-node storage system comprising:

１〜３ストレージ装置
１ａ，２ａ，３ａスライス
４〜６ディスクノード
４ａ応答時間計測手段
４ｂ動作不調検出手段
４ｃ故障検出手段
４ｄ復帰検出手段
５ａ，６ａスライス管理手段
７制御ノード
７ａ動作不調情報管理手段
７ｂ動作不調情報記憶手段
７ｃリカバリ指示手段
８アクセスノード 1-3 Storage devices 1a, 2a, 3a Slice 4-6 Disk node 4a Response time measuring means 4b Operation malfunction detection means 4c Failure detection means 4d Recovery detection means 5a, 6a Slice management means 7 Control node 7a Operation malfunction information management means 7b Operation malfunction information storage means 7c Recovery instruction means 8 Access node

Claims

A data management program for causing a computer to execute a management process of data stored in duplicate in a plurality of storage devices managed by dividing a storage area into a plurality of slices,
In the computer,
One of the plurality of storage devices is issued when there is no response even when a preset malfunction detection time is reached, and there is a possibility that one of the plurality of storage devices is in failure. receives irregularity information indicating, storing the irregularity information memorize means,
Wherein the plurality of the slice-access request as an access target slice in the storage device receives the access-related information indicating that issued, the irregularity information referring to the storage access target slice belongs before crisis憶the means It is determined whether there is a possibility that the device is in failure. If there is a possibility that the device is in failure, a storage device that stores redundant data having the same contents as the data in the access target slice, and a storage to which the access target slice belongs Sending a redundant data copy instruction to a storage device that is different from any of the devices, and having a data input / output function to the storage device storing the redundant data ,
A failure indicating that one of the plurality of storage devices has failed, which is issued when there is no response even when the elapsed time reaches a failure detection time set in advance that is larger than the malfunction detection time. For each piece of redundant data having the same content as each data in the failed storage device, the detection information is received, and the storage device that stores the redundant data and a storage device that is different from both of the failed storage devices are used as copy destinations. Sending a copy instruction of the redundant data to a device having a data input / output function to a storage device storing the redundant data ;
A data management program for executing processing.

In addition to the computer,
When there is a possibility that the storage device to which the access target slice belongs is in failure, the access destination when accessing the data in the access target slice is changed to a storage device that stores redundant data having the same contents as the data To notify the transmission source of the access related information,
The data management program according to claim 1, wherein processing is executed.

  A multi-node storage system that manages data in duplicate,
  A storage area having a plurality of disk nodes connected to one of a plurality of storage devices managed by dividing into a plurality of slices, and a control node;
  Each of the plurality of disk nodes is
  A response time measuring means for issuing an inspection command to the connected storage device and measuring an elapsed time from when the inspection command is issued until a response is received;
  If there is no response when the elapsed time reaches a preset malfunction detection time, malfunction detection is performed to transmit malfunction information indicating that the storage device may be out of order to the control node. Means,
  Failure detection means for transmitting failure detection information indicating that the storage device has failed when there is no response even if the elapsed time reaches a failure detection time set in advance that is larger than the malfunction detection time When,
  In accordance with a data copy instruction, slice management means for copying data in the storage device to another storage device;
  Comprising
  The control node is
  Upon receiving the malfunction information from one of the plurality of disk nodes, malfunction information management means for storing the malfunction information in storage means;
  Upon receiving access-related information indicating that an access request has been issued with the slices of the plurality of storage devices as access target slices, the storage device to which the access target slice belongs is referred to the operation malfunction information in the storage means It is determined whether or not there is a possibility of failure, and when there is a possibility of failure, a storage device that stores redundant data having the same content as the data in the access target slice, and a storage device to which the access target slice belongs A first recovery instruction means for sending a redundant data copy instruction to a disk node to which the storage apparatus storing the redundant data is connected, with the storage apparatus being different from any of the storage apparatuses as a copy destination;
  Upon receiving failure detection information from one of the plurality of disk nodes, a storage device that stores the redundant data for each redundant data having the same content as each data in the failed storage device, and the failed storage device A second recovery instruction means for transmitting a copy instruction of the redundant data to a disk node to which the storage apparatus storing the redundant data is connected, with the storage apparatus different from any of the storage apparatuses as a copy destination;
  Comprising
  A multi-node storage system characterized by that.

  The control node is
  When there is a possibility that the storage device to which the access target slice belongs is in failure, the access destination when accessing the data in the access target slice is changed to a storage device that stores redundant data having the same contents as the data To notify the transmission source of the access related information,
  The multi-node storage system according to claim 3, further comprising: