JP2010134696A

JP2010134696A - Raid controller device, processing method, raid controller circuit and program

Info

Publication number: JP2010134696A
Application number: JP2008309945A
Authority: JP
Inventors: Madoka Komatsubara; 円小松原
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-12-04
Filing date: 2008-12-04
Publication date: 2010-06-17

Abstract

<P>PROBLEM TO BE SOLVED: To perform restoration without replacement of a hard disk in the event of false detection of a hard disk failure even if there is no failure in the hard disk. <P>SOLUTION: A failure detection unit 11 detects a failure which is caused in hard disk drives 20-1 to 20-N. A restart unit 14 restarts the RAID controller device 10 when the failure detection unit 11 detects a plurality of failures of the hard disk drives 20-1 to 20-N within a predetermined time, and a reconfiguration unit 16 reconfigures, when restarting the RAID controller device 10, a logic drive based on logic drive configuration information stored by the hard disk drives 20-1 to 20-N. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、冗長性を有する複数のハードディスクの配列を１つの仮想的記憶手段として運用するＲＡＩＤコントローラ装置、処理方法、ＲＡＩＤコントローラ回路及びプログラムに関する。 The present invention relates to a RAID controller device, a processing method, a RAID controller circuit, and a program that operate an array of a plurality of redundant hard disks as one virtual storage means.

従来、ハードディスク等の記憶媒体の故障によるデータの損失防止、及び入出力の処理性能の向上のために、ＲＡＩＤ（Redundant Array of Inexpensive Disks）という技術が用いられている（例えば、特許文献１を参照）。ＲＡＩＤは、冗長性を有する複数のハードディスクを仮想的に１つのハードディスクとして運用する技術であり、データを冗長化することで、あるハードディスクが故障しても全体としてのデータの損失がないようにする技術である。 Conventionally, a technique called RAID (Redundant Array of Inexpensive Disks) is used to prevent data loss due to a failure of a storage medium such as a hard disk and to improve input / output processing performance (see, for example, Patent Document 1). ). RAID is a technology for operating a plurality of redundant hard disks virtually as a single hard disk. By making data redundant, even if a hard disk fails, there is no loss of data as a whole. Technology.

通常、コンピュータにＲＡＩＤを適用する場合、ＲＡＩＤコントローラという装置がハードディスクのデータ構成及び管理を行う。ＲＡＩＤコントローラは、ハードディスクが故障した場合、ハードディスクの故障を検知し、故障したハードディスクを論理的に切り離し、他のハードディスクを用いてコンピュータを運用する。また、切り離したハードディスクをユーザが故障していないハードディスクに取り替えると、ＲＡＩＤコントローラは、他のハードディスクの情報から元のハードディスクの情報を再生し、再生した情報を取り替えたハードディスクに書き込むことで、データの再構成を行う。
特開２００２−３７３０５９号公報 Normally, when RAID is applied to a computer, a device called a RAID controller performs data configuration and management of the hard disk. When a hard disk fails, the RAID controller detects a hard disk failure, logically separates the failed hard disk, and operates the computer using another hard disk. When the user replaces the disconnected hard disk with a hard disk that has not failed, the RAID controller reproduces the information on the original hard disk from the information on the other hard disks, and writes the reproduced information on the replaced hard disk. Perform reconfiguration.
JP 2002-373059 A

しかしながら、従来のＲＡＩＤコントローラでは、ハードディスクが故障していないにも関わらず、ＲＡＩＤコントローラとハードディスク間の通信経路の一時的な異常等により、一時的にハードディスクにアクセスできなくなった場合に、ハードディスクが故障していると誤った判定を行ってしまう場合があった。この場合、ユーザは、ハードディスクが故障していないにも関わらず、障害が検出された全てのハードディスクを交換し、データの再構成を行う必要があり、長時間の復旧作業を実行する必要があった。
本発明は上記の点に鑑みてなされたものであり、その目的は、ハードディスクが故障していないにも関わらず、ハードディスク障害が誤って検出された場合に、ハードディスクの交換を行わずに復旧するＲＡＩＤコントローラ装置、処理方法、ＲＡＩＤコントローラ回路及びプログラムを提供することにある。 However, in the case of a conventional RAID controller, if the hard disk cannot be accessed due to a temporary abnormality in the communication path between the RAID controller and the hard disk even though the hard disk has not failed, the hard disk has failed. In some cases, incorrect judgments were made. In this case, the user needs to replace all the hard disks in which the failure is detected and perform data reconstruction even though the hard disk has not failed, and to perform a long recovery operation. It was.
The present invention has been made in view of the above points, and an object of the present invention is to recover without replacing a hard disk when a hard disk failure is detected erroneously even though the hard disk has not failed. A RAID controller device, a processing method, a RAID controller circuit, and a program are provided.

本発明は上記の課題を解決するためになされたものであり、予め仮想記憶手段の構成情報を記憶するハードディスクを複数組み合わせ、１つの論理的な仮想記憶手段として動作させるＲＡＩＤコントローラ装置であって、前記複数のハードディスクそれぞれの障害を検出する障害検出手段と、前記障害検出手段が所定の時間内に複数のハードディスクの障害を検出した場合に自装置を再起動させる再起動手段と、再起動時に、前記構成情報に基づいて前記仮想記憶手段を再構築する再構築手段と、を備えることを特徴とする。 The present invention has been made to solve the above problems, and is a RAID controller device that combines a plurality of hard disks that store configuration information of virtual storage means in advance and operates as one logical virtual storage means, A failure detecting means for detecting a failure of each of the plurality of hard disks; a restarting means for restarting the apparatus when the failure detecting means detects a failure of a plurality of hard disks within a predetermined time; and Reconstructing means for reconstructing the virtual storage means based on the configuration information.

また、本発明は、予め仮想記憶手段の構成情報を記憶するハードディスクを複数組み合わせ、１つの論理的な仮想記憶手段として動作させるＲＡＩＤコントローラ装置を用いた処理方法であって、障害検出手段は、前記複数のハードディスクそれぞれの障害を検出し、再起動手段は、前記障害検出手段が所定の時間内に複数のハードディスクの障害を検出した場合に自装置を再起動させ、再構築手段は、再起動時に、前記構成情報に基づいて前記仮想記憶手段を再構築する、ことを特徴とする。 Further, the present invention is a processing method using a RAID controller device that combines a plurality of hard disks that store configuration information of virtual storage means in advance and operates as one logical virtual storage means, wherein the failure detection means A failure detecting unit detects a failure of each of the plurality of hard disks, and the restarting unit restarts the apparatus when the failure detecting unit detects a failure of the plurality of hard disks within a predetermined time. The virtual storage means is reconstructed based on the configuration information.

また、本発明は、予め仮想記憶手段の構成情報を記憶するハードディスクを複数組み合わせ、１つの論理的な仮想記憶手段として動作させるＲＡＩＤコントローラ回路であって、前記複数のハードディスクそれぞれの障害を検出する障害検出回路と、前記障害検出手段が所定の時間内に複数のハードディスクの障害を検出した場合に自装置を再起動させる再起動回路と、再起動時に、前記構成情報に基づいて前記仮想記憶手段を再構築する再構築回路と、を備えることを特徴とする。 The present invention also provides a RAID controller circuit that combines a plurality of hard disks that store virtual storage unit configuration information in advance and operates as a single logical virtual storage unit, and detects a failure of each of the plurality of hard disks. A detection circuit, a restart circuit for restarting the apparatus when the failure detection means detects a failure of a plurality of hard disks within a predetermined time, and the virtual storage means based on the configuration information at the time of restart. And a reconstructing circuit for restructuring.

また、本発明は、予め仮想記憶手段の構成情報を記憶するハードディスクを複数組み合わせ、１つの論理的な仮想記憶手段として動作させるＲＡＩＤコントローラ装置を、前記複数のハードディスクそれぞれの障害を検出する障害検出手段、前記障害検出手段が所定の時間内に複数のハードディスクの障害を検出した場合に自装置を再起動させる再起動手段、再起動時に、前記構成情報に基づいて前記仮想記憶手段を再構築する再構築手段、として動作させるためのプログラムである。 Also, the present invention provides a RAID controller device that combines a plurality of hard disks that store virtual storage unit configuration information in advance and operates as a single logical virtual storage unit, and detects failure of each of the plurality of hard disks. A restart unit that restarts the device when the failure detection unit detects a failure of a plurality of hard disks within a predetermined time, and a reconfiguration that reconstructs the virtual storage unit based on the configuration information at the time of restart. It is a program for operating as construction means.

本発明によれば、再起動手段は、ハードディスクの障害を所定の時間内に複数検出した場合にＲＡＩＤコントローラ装置を再起動させる。複数のハードディスクが近いタイミングで故障する可能性は低いため、この場合、ノイズの混入等による障害検出手段とハードディスク間の通信経路の一時的な異常等、ハードディスクの故障以外の原因である可能性が高い。そのため、ＲＡＩＤコントローラ装置を再起動させることにより、再構築手段が、構成情報に基づいて前記仮想記憶手段を再構築することで、仮想的な記憶手段の復旧を試みることができる。 According to the present invention, the restarting means restarts the RAID controller device when a plurality of hard disk failures are detected within a predetermined time. Since it is unlikely that multiple hard disks will fail at close timing, in this case, there may be a cause other than hard disk failure, such as a temporary failure in the communication path between the failure detection means and the hard disk due to noise contamination. high. Therefore, by restarting the RAID controller device, the rebuilding unit can attempt to restore the virtual storage unit by rebuilding the virtual storage unit based on the configuration information.

以下、図面を参照しながら本発明の実施形態について詳しく説明する。
図１は、本発明の一実施形態によるＲＡＩＤコントローラ装置の構成を示す概略ブロック図である。
ＲＡＩＤコントローラ装置１０は、複数のハードディスク装置２０−１〜２０−Ｎに接続経路３（バス）を介して接続されている。また、ＲＡＩＤコントローラ装置１０は、ハードディスク装置２０−１〜２０−Ｎを組み合わせて論理ドライブ（仮想記憶手段）を構築し、当該論理ドライブをコンピュータ１に認識させて運用する。
また、ＲＡＩＤコントローラ装置１は、障害検出部１１と、障害検出パターン監視部１２と、ハードディスク情報記憶部１３と、再起動部１４と、再起動回数記憶部１５と、再構築部１６と、を備える。
障害検出部１１は、接続経路３０を介して、ハードディスク装置２０−１〜２０−Ｎに発生した障害を検出する。
障害検出パターン監視部１２は、ハードディスク装置２０−１〜２０−Ｎの障害を所定の時間内に複数検出しているか否かを判定する。
ハードディスク情報記憶部１３は、障害を検出したハードディスク装置２０−１〜２０−Ｎの識別番号と、障害を検出した時間とを関連付けて記憶する。
再起動部１４は、ＲＡＩＤコントローラ装置１を再起動させる。
再起動回数記憶部１５は、再起動部１４がＲＡＩＤコントローラ装置１を再起動させた回数を示す再起動回数を記憶する。
再構築部１６は、ハードディスク装置２０−１〜２０−Ｎが記憶する論理ドライブの構成情報に基づいて論理ドライブを再構築する。
また、ハードディスク装置２０−１〜２０−Ｎは、ハードディスク２１と制御部２２２２とを備える。
ハードディスク２１は、予めＲＡＩＤの仕様や論理ドライブが記憶するデータ等の論理ドライブの構成情報を記憶する。
制御部２２２２は、ハードディスク２１の障害検出を行う。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a schematic block diagram showing the configuration of a RAID controller device according to an embodiment of the present invention.
The RAID controller device 10 is connected to a plurality of hard disk devices 20-1 to 20-N via a connection path 3 (bus). Further, the RAID controller device 10 constructs a logical drive (virtual storage means) by combining the hard disk devices 20-1 to 20-N, and causes the computer 1 to recognize and operate the logical drive.
The RAID controller device 1 also includes a failure detection unit 11, a failure detection pattern monitoring unit 12, a hard disk information storage unit 13, a restart unit 14, a restart count storage unit 15, and a reconstruction unit 16. Prepare.
The failure detection unit 11 detects a failure that has occurred in the hard disk devices 20-1 to 20 -N via the connection path 30.
The failure detection pattern monitoring unit 12 determines whether a plurality of failures of the hard disk devices 20-1 to 20-N are detected within a predetermined time.
The hard disk information storage unit 13 stores the identification numbers of the hard disk devices 20-1 to 20-N that have detected the failure in association with the time at which the failure was detected.
The restart unit 14 restarts the RAID controller device 1.
The restart number storage unit 15 stores a restart number indicating the number of times the restart unit 14 restarts the RAID controller device 1.
The reconstruction unit 16 reconstructs the logical drive based on the logical drive configuration information stored in the hard disk devices 20-1 to 20-N.
In addition, the hard disk devices 20-1 to 20 -N include a hard disk 21 and a control unit 2222.
The hard disk 21 stores configuration information of the logical drive such as RAID specifications and data stored in the logical drive in advance.
The control unit 2222 detects a failure of the hard disk 21.

そして、本実施形態のＲＡＩＤコントローラ装置１においては、障害検出部１１がハードディスク装置２０−１〜２０−Ｎに発生した障害を検出し、障害検出部１１がハードディスク装置２０−１〜２０−Ｎの障害を所定の時間内に複数検出した場合に再起動部１４が自装置を再起動させ、再構築部１６が再起動時にハードディスク装置２０−１〜２０−Ｎのハードディスク２１が記憶する論理ドライブの構成情報に基づいて論理ドライブを再構築する。
これにより、ハードディスク装置２０−１〜２０−Ｎが故障していないにも関わらず、ハードディスク障害が誤って検出された場合に、ＲＡＩＤコントローラ装置１は、ハードディスクの交換を行わずに論理ドライブの復旧を行う。 In the RAID controller device 1 of the present embodiment, the failure detection unit 11 detects a failure that has occurred in the hard disk devices 20-1 to 20-N, and the failure detection unit 11 includes the hard disk devices 20-1 to 20-N. When a plurality of failures are detected within a predetermined time, the restart unit 14 restarts the own device, and the rebuilding unit 16 stores the logical drives stored in the hard disks 21 of the hard disk devices 20-1 to 20-N at the time of restart. Rebuild the logical drive based on the configuration information.
As a result, when the hard disk failure is erroneously detected even though the hard disk devices 20-1 to 20-N have not failed, the RAID controller device 1 restores the logical drive without replacing the hard disk. I do.

次に、ＲＡＩＤコントローラ装置１の動作を説明する。
図２は、ＲＡＩＤコントローラ装置の動作を示すフローチャートである。
まず、障害検出部１１は、ハードディスク装置２０−１〜２０−Ｎと通信を行い、ハードディスク装置２０−１〜２０−Ｎの障害の有無を判定する（ステップＳ１）。
障害の有無の判定は、例えば以下のように行う。
障害検出部１１は、接続経路３０を介して、各ハードディスク装置２０−１〜２０−Ｎの制御部２２に障害検出信号を送信する。各ハードディスク装置２０−１〜２０−Ｎの制御部２２は、障害検出信号を受信すると、対応するハードディスク２１の障害判定を行う。
制御部２２は、障害判定の結果、ハードディスク２１に障害が発生していないと判定した場合、障害検出部１１に障害検出信号に対する応答信号を送信する。他方、制御部２２は、障害判定の結果、ハードディスク２１に障害が発生していると判定した場合、障害検出部１１に障害検出信号に対する応答信号を送信しない。
障害検出部１１は、ハードディスク装置２０−１〜２０−Ｎの制御部２２から応答信号を受信した場合、障害が発生していないと判定し、ハードディスク装置２０−１〜２０−Ｎの制御部２２から応答信号を受信できない場合、障害が発生していると判定する。
このとき、ノイズの混入等によって接続経路３０に一時的な異常が発生した場合にも、障害検出部１１は、応答信号を受信できないため、応答信号を受信できなかったハードディスク装置２０−１〜２０−Ｎに障害が発生していると判定する。 Next, the operation of the RAID controller device 1 will be described.
FIG. 2 is a flowchart showing the operation of the RAID controller device.
First, the failure detection unit 11 communicates with the hard disk devices 20-1 to 20-N, and determines whether there is a failure in the hard disk devices 20-1 to 20-N (step S1).
The determination of the presence or absence of a failure is performed as follows, for example.
The failure detection unit 11 transmits a failure detection signal to the control unit 22 of each of the hard disk devices 20-1 to 20-N via the connection path 30. When the control unit 22 of each of the hard disk devices 20-1 to 20-N receives the failure detection signal, it determines the failure of the corresponding hard disk 21.
As a result of the failure determination, the control unit 22 transmits a response signal to the failure detection signal to the failure detection unit 11 when it is determined that no failure has occurred in the hard disk 21. On the other hand, when it is determined that a failure has occurred in the hard disk 21 as a result of the failure determination, the control unit 22 does not transmit a response signal for the failure detection signal to the failure detection unit 11.
When the failure detection unit 11 receives a response signal from the control unit 22 of the hard disk devices 20-1 to 20-N, the failure detection unit 11 determines that no failure has occurred, and the control unit 22 of the hard disk devices 20-1 to 20-N. If a response signal cannot be received from the mobile phone, it is determined that a failure has occurred.
At this time, even when a temporary abnormality occurs in the connection path 30 due to noise or the like, the failure detection unit 11 cannot receive the response signal, and thus the hard disk devices 20-1 to 20-20 that have not received the response signal. -N determines that a failure has occurred.

ステップＳ１により、障害検出部１１が、全てのハードディスク装置２０−１〜２０−Ｎに障害が発生していないと判定した場合（ステップＳ１：ＮＯ）、ＲＡＩＤコントローラ装置１０は、ステップＳ１に戻り、引き続きハードディスク装置２０−１〜２０−Ｎの障害の検出を継続する。なお、障害の検出は、所定の間隔毎に行われる。
他方、障害検出部１１は、少なくとも１つのハードディスク装置２０−１〜２０−Ｎに障害が発生していると判定した場合（ステップＳ１：ＹＥＳ）、障害が発生していると判定したハードディスク装置２０−１〜２０−ＮをＲＡＩＤの構成から論理的に切り離す（ステップＳ２）。 If the failure detection unit 11 determines in step S1 that no failure has occurred in all the hard disk devices 20-1 to 20-N (step S1: NO), the RAID controller device 10 returns to step S1, Subsequently, the failure detection of the hard disk devices 20-1 to 20-N is continued. The failure detection is performed at predetermined intervals.
On the other hand, when it is determined that a failure has occurred in at least one of the hard disk devices 20-1 to 20-N (step S1: YES), the failure detection unit 11 determines that a failure has occurred. -1 to 20-N are logically separated from the RAID configuration (step S2).

障害検出部１１がハードディスク装置２０−１〜２０−Ｎの切り離しを行うと、障害検出パターン監視部１２は、障害検出部１１から障害を検出したハードディスク装置２０−１〜２０−Ｎの識別番号と、障害を検出した時刻とを取得する（ステップＳ３）。
障害検出パターン監視部１２は、識別番号と時刻とを取得すると、ハードディスク情報記憶部１３が記憶する他のハードディスク装置２０−１〜２０−Ｎの障害を検出した時刻と、取得した時刻とを比較し、所定の時間内に既に障害を検出されているハードディスクがあるか否かを判定する（ステップＳ４）。ここで、所定の時間とは、例えば１０ミリ秒や１秒など、当該時間内に複数のハードディスク装置２０−１〜２０−Ｎに障害が発生することが稀となる短い時間であり、障害検出部１１とハードディスク装置２０−１〜２０−Ｎとの間の接続経路３０に異常によって、各ハードディスク装置２０−１〜２０−Ｎの障害が誤検出されるまでの時間差よりも長い時間であるものとする。 When the failure detection unit 11 disconnects the hard disk devices 20-1 to 20-N, the failure detection pattern monitoring unit 12 and the identification numbers of the hard disk devices 20-1 to 20-N that detected the failure from the failure detection unit 11 The time when the failure is detected is acquired (step S3).
When the failure detection pattern monitoring unit 12 acquires the identification number and the time, the failure detection pattern monitoring unit 12 compares the time when the failure of the other hard disk devices 20-1 to 20-N stored in the hard disk information storage unit 13 is detected with the acquired time. Then, it is determined whether or not there is a hard disk whose failure has already been detected within a predetermined time (step S4). Here, the predetermined time is, for example, a short time such as 10 milliseconds or 1 second, in which a failure occurs in a plurality of hard disk devices 20-1 to 20-N within the time, and failure detection is performed. That is longer than the time difference until a failure of each of the hard disk devices 20-1 to 20-N is erroneously detected due to an abnormality in the connection path 30 between the unit 11 and the hard disk devices 20-1 to 20-N And

障害検出パターン監視部１２は、所定の時間内に障害を検出されたハードディスク装置２０−１〜２０−Ｎがないと判定した場合（ステップＳ４：ＮＯ）、ステップＳ３で障害を検出したハードディスク装置２０−１〜２０−Ｎの識別番号と、障害を検出した時刻とを関連付けてハードディスク情報記憶部１３に登録する（ステップＳ５）。ここでハードディスク装置２０−１〜２０−Ｎの識別番号と時刻とを登録する理由は、次回に障害検出部１１がステップＳ１でハードディスク２０−１〜２０−Ｎの障害を検出した際に、障害検出パターン監視部１２がステップＳ４で、障害を検出したハードディスク装置２０−１〜２０−Ｎの障害を検出した時刻と、今回障害を検出したハードディスク装置２０−１〜２０−Ｎの障害を検出した時刻とを比較するためである。
障害検出パターン監視部１２がハードディスク装置２０−１〜２０−Ｎの識別番号と障害を検出した時刻とを登録すると、ＲＡＩＤコントローラ装置１０は、ステップＳ１に戻り、引き続きハードディスク装置２０−１〜２０−Ｎの障害の検出を継続する。
なお、障害検出パターン監視部１２が所定の時間内に障害を検出されたハードディスク装置２０−１〜２０−Ｎがあると判定した場合（ステップＳ４：ＹＥＳ）は、後述する処理により、ＲＡＩＤコントローラ装置１０の再起動を実行するが、障害検出パターン監視部１２が所定の時間内に障害を検出されたハードディスク装置２０−１〜２０−Ｎがないと判定した場合（ステップＳ４：ＮＯ）は、再起動を実行しない。この理由を以下に説明する。所定の時間内に他のハードディスク装置２０−１〜２０−Ｎの障害が検出されない場合、ステップＳ１で障害を検出されたハードディスク装置２０−１〜２０−Ｎは、単独で障害を検出されている。そのため、ノイズの混入等による接続経路３０の一時的な異常によって障害が誤検出された可能性は低く、障害を検出されたハードディスク装置２０−１〜２０−Ｎに実際に障害がある可能性が高い。再起動の実行は、接続経路３０の一時的な異常によって切り離しが行われたハードディスク装置２０−１〜２０−Ｎとの再接続を行うためであるため、ハードディスク装置２０−１〜２０−Ｎに実際に障害がある場合は、再起動を実行する必要が無い。以上が再起動を実行しない理由である。 If the failure detection pattern monitoring unit 12 determines that there is no hard disk device 20-1 to 20-N in which a failure is detected within a predetermined time (step S4: NO), the hard disk device 20 that has detected the failure in step S3. The identification numbers of −1 to 20-N and the time when the failure is detected are associated and registered in the hard disk information storage unit 13 (step S5). Here, the reason why the identification numbers and times of the hard disk devices 20-1 to 20-N are registered is that when the failure detection unit 11 detects the failure of the hard disks 20-1 to 20-N next time in step S1, the failure is detected. In step S4, the detection pattern monitoring unit 12 detects the failure of the hard disk devices 20-1 to 20-N that detected the failure and the failure of the hard disk devices 20-1 to 20-N that detected the failure this time. This is to compare the time.
When the failure detection pattern monitoring unit 12 registers the identification numbers of the hard disk devices 20-1 to 20-N and the time when the failure is detected, the RAID controller device 10 returns to step S1 and continues to the hard disk devices 20-1 to 20-. Continue detecting N failures.
When the failure detection pattern monitoring unit 12 determines that there is a hard disk device 20-1 to 20-N in which a failure is detected within a predetermined time (step S4: YES), the RAID controller device is processed by a process described later. 10. When the failure detection pattern monitoring unit 12 determines that there is no hard disk device 20-1 to 20-N in which a failure is detected within a predetermined time (step S4: NO), the restart is performed. Do not perform startup. The reason for this will be described below. If no failure of the other hard disk devices 20-1 to 20-N is detected within a predetermined time, the hard disk devices 20-1 to 20-N that have detected the failure in step S1 have been detected independently. . For this reason, there is a low possibility that a failure is erroneously detected due to a temporary abnormality in the connection path 30 due to noise or the like, and there is a possibility that the hard disk devices 20-1 to 20-N where the failure is detected actually have a failure. high. Since the restart is performed to reconnect to the hard disk devices 20-1 to 20-N that have been disconnected due to a temporary abnormality in the connection path 30, the hard disk devices 20-1 to 20-N are connected. If there is an actual failure, there is no need to perform a restart. This is the reason why the restart is not executed.

障害検出パターン監視部１２が所定の時間内に障害を検出されたハードディスク装置２０−１〜２０−Ｎがあると判定した場合（ステップＳ４：ＹＥＳ）、再起動部１４は、再起動回数記憶部１５に記憶されている再起動回数が予め決定した最大繰返し回数（所定の回数）未満であるか否かを判定する（ステップＳ６）。なお、上述したように、再起動回数は、再起動部１４がＲＡＩＤコントローラ装置１を再起動させた回数を示す。
再起動部１４が、再起動回数が最大繰返し回数以上であると判定した場合（ステップＳ６：ＮＯ）、ＲＡＩＤコントローラ装置１０は、ステップＳ１に戻り、引き続きハードディスク装置２０−１〜２０−Ｎの障害の検出を継続する。再起動回数が最大繰返し回数以上であると判定した場合に再起動を実行しない理由は、後述する。
再起動部１４は、再起動回数が最大繰返し回数未満であると判定した場合（ステップＳ６：ＹＥＳ）、ＲＡＩＤコントローラ装置１０の再起動を実行する（ステップＳ７）。ここで、障害検出部１１が、ノイズの混入等による接続経路３０の一時的な異常によって、誤って障害の発生を検出した場合、ＲＡＩＤコントローラ装置１０の再起動時にその一時的な異常が解消されていると、障害が発生していると判定されていたハードディスク装置２０−１〜２０−Ｎとの通信を正常に行うことができるようになり、障害が検出されなくなる。従って、ステップＳ７で再起動をおこなっている。
再起動部１４は、再起動を実行すると、再起動回数記憶部１５が記憶する再起動回数を、当該再起動回数に１を加えた値に書き換える（ステップＳ８）。但し、再起動部１４が実行した再起動が１回目である場合、再起動回数記憶部１５は再起動回数を記憶していない場合であれば、再起動部１４は、再起動回数記憶部１５に再起動回数として「１」を登録するものとする。 When the failure detection pattern monitoring unit 12 determines that there is a hard disk device 20-1 to 20-N in which a failure is detected within a predetermined time (step S4: YES), the restarting unit 14 is a restart count storage unit. It is determined whether or not the number of restarts stored in 15 is less than a predetermined maximum number of repetitions (predetermined number) (step S6). As described above, the number of restarts indicates the number of times that the restart unit 14 restarts the RAID controller device 1.
If the restarting unit 14 determines that the restart count is equal to or greater than the maximum number of repeats (step S6: NO), the RAID controller device 10 returns to step S1 and continues to fail the hard disk devices 20-1 to 20-N. Continue detection. The reason why the restart is not executed when it is determined that the number of restarts is equal to or greater than the maximum number of repetitions will be described later.
If the restarting unit 14 determines that the restart count is less than the maximum repeat count (step S6: YES), the restart unit 14 restarts the RAID controller device 10 (step S7). Here, when the failure detection unit 11 erroneously detects the occurrence of a failure due to a temporary abnormality in the connection path 30 due to noise mixing or the like, the temporary abnormality is resolved when the RAID controller device 10 is restarted. If this occurs, communication with the hard disk devices 20-1 to 20-N that have been determined to have failed can be performed normally, and the failure is not detected. Therefore, the restart is performed in step S7.
When the restart unit 14 executes the restart, the restart unit 14 rewrites the restart number stored in the restart number storage unit 15 to a value obtained by adding 1 to the restart number (step S8). However, if the restart performed by the restart unit 14 is the first time, and the restart count storage unit 15 does not store the restart count, the restart unit 14 stores the restart count storage unit 15. It is assumed that “1” is registered as the number of restarts.

再起動部１４が再起動回数を書き換えると、障害検出部１１は、ステップＳ４で所定時間内に障害を検出したと判定されたハードディスク装置２０−１〜２０−Ｎと通信を行い、障害の有無を判定する（ステップＳ９）。このとき、ステップＳ４で所定時間内に障害を検出したと判定されたハードディスク装置２０−１〜２０−Ｎに障害がなく、ステップＳ１の判定がノイズの混入等による接続経路３０の一時的な異常による誤判定であった場合、通常、ＲＡＩＤコントローラ装置１０の再起動によって正常に通信が行われる可能性が高い。そのため、ステップＳ９では、障害検出部１１がステップＳ４で所定時間内に障害を検出したと判定されたハードディスク装置２０−１〜２０−Ｎと通信を行い、障害の有無を判定することで、正常に通信が行われるようになったか否かを判定している。
例えば、ステップＳ４で、障害検出パターン監視部１２が、所定時間内にハードディスク装置２０−１とハードディスク装置２０−２の障害が検出されていると判定した場合、ステップＳ９で障害検出部１１は、ハードディスク装置２０−１とハードディスク装置２０−２の障害の有無を判定する。当該障害の有無の判定は、ステップＳ１で実行した判定と同様の処理によって行う。 When the restart unit 14 rewrites the number of restarts, the failure detection unit 11 communicates with the hard disk devices 20-1 to 20-N determined to have detected the failure within a predetermined time in step S4, and whether or not there is a failure. Is determined (step S9). At this time, there is no failure in the hard disk devices 20-1 to 20-N determined to have detected the failure within the predetermined time in step S4, and the determination in step S1 is a temporary abnormality in the connection path 30 due to noise mixing or the like. In the case of an erroneous determination due to the above, there is normally a high possibility that communication is normally performed by restarting the RAID controller device 10. Therefore, in step S9, the failure detection unit 11 communicates with the hard disk devices 20-1 to 20-N that are determined to have detected the failure within the predetermined time in step S4, and determines whether or not there is a failure. It is determined whether or not communication has been started.
For example, when the failure detection pattern monitoring unit 12 determines in step S4 that a failure has occurred in the hard disk device 20-1 and the hard disk device 20-2 within a predetermined time, the failure detection unit 11 in step S9 It is determined whether there is a failure in the hard disk device 20-1 and the hard disk device 20-2. The determination of the presence or absence of the failure is performed by the same process as the determination executed in step S1.

ステップＳ９で、障害検出部１１が、ステップＳ４で所定時間内に障害を検出されたと判定された全てのハードディスク装置２０−１〜２０−Ｎに障害があると判定した場合（ステップＳ９：ＹＥＳ）、ステップＳ６に戻り、再起動を実行する。ステップＳ６に戻る理由は以下の通りである。
ステップＳ４で所定時間内に障害を検出したと判定されたハードディスク装置２０−１〜２０−Ｎに実際に障害があった場合や、接続経路３０の切断などの障害が発生している場合等は、ステップＳ７、Ｓ８が繰り返されるので、再起動が実行され続けることになってしまう。そのため、ステップＳ６で再起動回数が最大繰返し回数以上となった場合は、障害検出部１１による障害発生の判定が、接続経路３０の一時的な異常による誤判定ではなく、ハードディスク装置２０−１〜２０−Ｎに実際に障害が発生していたり、接続経路３０の切断などの障害が発生していたりする可能性が高いため、それ以上再起動を実行せずに、ステップＳ１に戻り、ハードディスク装置２０−１〜２０−Ｎの障害検出を継続する。 In step S9, when the failure detection unit 11 determines that all the hard disk devices 20-1 to 20-N determined to have detected the failure within the predetermined time in step S4 have a failure (step S9: YES) Returning to step S6, restart is executed. The reason for returning to step S6 is as follows.
When there is an actual failure in the hard disk devices 20-1 to 20-N determined to have detected the failure within the predetermined time in step S4, or when a failure such as disconnection of the connection path 30 has occurred. Since steps S7 and S8 are repeated, the restart will continue to be executed. Therefore, when the number of restarts exceeds the maximum number of repetitions in step S6, the determination of the failure occurrence by the failure detection unit 11 is not an erroneous determination due to a temporary abnormality in the connection path 30, but the hard disk device 20-1 to -1. Since there is a high possibility that a failure has actually occurred in 20-N or a failure such as disconnection of the connection path 30, the hard disk device returns to step S1 without further restarting. The failure detection of 20-1 to 20-N is continued.

ステップＳ９で、障害検出部１１が、ステップＳ４で所定時間内に障害を検出されたと判定されたハードディスク装置２０−１〜２０−Ｎの少なくとも１つに障害がないと判定された場合（ステップＳ９：ＮＯ）、再構築部１５は、ハードディスク装置２０−１〜２０−Ｎから論理ドライブの構成情報を取得する（ステップＳ１０）。再構築部１５は、論理ドライブの構成情報を取得すると、取得した論理ドライブの構成情報に基づいて、ハードディスク装置２０−１〜２０−Ｎを組み合わせて論理ドライブを再構築する（ステップＳ１１）。 When it is determined in step S9 that the failure detection unit 11 has no failure in at least one of the hard disk devices 20-1 to 20-N determined to have detected the failure within the predetermined time in step S4 (step S9). : NO), the rebuilding unit 15 acquires the configuration information of the logical drive from the hard disk devices 20-1 to 20-N (step S10). When acquiring the logical drive configuration information, the rebuilding unit 15 reconstructs the logical drive by combining the hard disk devices 20-1 to 20-N based on the acquired logical drive configuration information (step S11).

このように、本実施形態によれば、障害検出部１１がハードディスク装置２０−１〜２０−Ｎに発生した障害を検出し、再起動部１４がハードディスク装置２０−１〜２０−Ｎの障害を所定の時間内に複数検出した場合に自装置を再起動させ、再構築部１６が再起動時にハードディスク装置２０−１〜２０−Ｎが記憶する論理ドライブの構成情報に基づいて論理ドライブを再構築する。
これにより、障害検出部１１が、複数のハードディスク装置２０−１〜２０−Ｎが近いタイミングで故障していると判定した場合に、論理ドライブの復旧を試みることができる。このとき、通信経路３０の一時的な異常等によってハードディスク装置２０−１〜２０−Ｎが故障していると判定した場合に、論理ドライブを復旧させることができる。 Thus, according to the present embodiment, the failure detection unit 11 detects a failure that has occurred in the hard disk devices 20-1 to 20-N, and the restart unit 14 detects a failure in the hard disk devices 20-1 to 20-N. When multiple devices are detected within a predetermined time, the device itself is restarted, and the rebuilding unit 16 rebuilds the logical drive based on the logical drive configuration information stored in the hard disk devices 20-1 to 20-N at the time of restarting. To do.
As a result, when the failure detection unit 11 determines that the plurality of hard disk devices 20-1 to 20-N have failed at close timings, recovery of the logical drive can be attempted. At this time, when it is determined that the hard disk devices 20-1 to 20-N are out of order due to a temporary abnormality in the communication path 30, the logical drive can be restored.

以上、図面を参照してこの発明の一実施形態について詳しく説明してきたが、具体的な構成は上述のものに限られることはなく、この発明の要旨を逸脱しない範囲内において様々な設計変更等をすることが可能である。 As described above, the embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to the above, and various design changes and the like can be made without departing from the scope of the present invention. It is possible to

上述のＲＡＩＤコントローラ装置１０は内部に、コンピュータシステムを有している。そして、上述した各処理部の動作は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータが読み出して実行することによって、上記処理が行われる。ここでコンピュータ読み取り可能な記録媒体とは、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等をいう。また、このコンピュータプログラムを通信回線によってコンピュータに配信し、この配信を受けたコンピュータが当該プログラムを実行するようにしても良い。 The RAID controller device 10 described above has a computer system therein. The operation of each processing unit described above is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer reading and executing this program. Here, the computer-readable recording medium means a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, or the like. Alternatively, the computer program may be distributed to the computer via a communication line, and the computer that has received the distribution may execute the program.

また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

本発明の一実施形態によるＲＡＩＤコントローラ装置の構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the RAID controller apparatus by one Embodiment of this invention. ＲＡＩＤコントローラ装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a RAID controller apparatus.

Explanation of symbols

１…コンピュータ１０…ＲＡＩＤコントローラ装置１１…障害検出部１２…障害検出パターン監視部１３…ハードディスク情報記憶部１４…再起動部１５…再起動回数記憶部１６…再構築部２０−１〜２０−Ｎ…ハードディスク装置２１…ハードディスク２２…制御部３０…接続経路 DESCRIPTION OF SYMBOLS 1 ... Computer 10 ... RAID controller apparatus 11 ... Failure detection part 12 ... Failure detection pattern monitoring part 13 ... Hard disk information storage part 14 ... Reboot part 15 ... Reboot number memory | storage part 16 ... Reconstruction part 20-1-20-N ... Hard disk device 21 ... Hard disk 22 ... Control unit 30 ... Connection path

Claims

A RAID controller device that combines a plurality of hard disks that store configuration information of virtual storage means in advance and operates as one logical virtual storage means,
A failure detection means for detecting a failure of each of the plurality of hard disks;
Restarting means for restarting the apparatus when the failure detecting means detects a failure of a plurality of hard disks within a predetermined time;
Rebuilding means for rebuilding the virtual storage means based on the configuration information upon restart;
A RAID controller device comprising:

The failure detection means detects a failure of a plurality of hard disks that have detected a failure within the predetermined time after restart by the restart means,
When the restarting unit detects a failure of a plurality of hard disks that have detected a failure within the predetermined time, the restarting unit performs restarting again,
The number of times that the restarting unit executes the restart is less than a predetermined number of times,
The RAID controller device according to claim 1.

A processing method using a RAID controller device that combines a plurality of hard disks that store configuration information of virtual storage means in advance and operates as one logical virtual storage means,
The failure detection means detects a failure of each of the plurality of hard disks,
The restarting means restarts the apparatus when the failure detecting means detects a failure of a plurality of hard disks within a predetermined time,
The rebuilding means rebuilds the virtual storage means based on the configuration information at the time of restarting.
A processing method characterized by the above.

A RAID controller circuit that combines a plurality of hard disks that store configuration information of virtual storage means in advance and operates as one logical virtual storage means,
A failure detection circuit for detecting a failure of each of the plurality of hard disks;
A restart circuit for restarting the apparatus when the failure detection means detects a failure of a plurality of hard disks within a predetermined time;
A reconfiguration circuit that reconstructs the virtual storage means based on the configuration information upon restart;
A RAID controller circuit comprising:

A RAID controller device that combines a plurality of hard disks that store the configuration information of the virtual storage means in advance and operates as one logical virtual storage means,
Failure detection means for detecting a failure of each of the plurality of hard disks;
Restarting means for restarting the apparatus when the failure detecting means detects failures of a plurality of hard disks within a predetermined time;
Rebuilding means for rebuilding the virtual storage means based on the configuration information at the time of restart;
Program to operate as.