JP4411197B2

JP4411197B2 - Loop fault detection apparatus and method

Info

Publication number: JP4411197B2
Application number: JP2004380599A
Authority: JP
Inventors: 昌彦門田
Original assignee: NEC System Technologies Ltd
Current assignee: NEC System Technologies Ltd
Priority date: 2004-12-28
Filing date: 2004-12-28
Publication date: 2010-02-10
Anticipated expiration: 2024-12-28
Also published as: JP2006185344A

Description

本発明は、上流から下流へ複数の記憶装置が直列に接続されて成るループに設けられ、当該ループに発生した障害を検出する、ループ障害検出装置等に関する。以下、「磁気ディスク装置」を単に「ディスク」と、「ファイバチャネル」を略して「ＦＣ」と、それぞれ呼ぶことにする。 The present invention relates to a loop failure detection device or the like that is provided in a loop in which a plurality of storage devices are connected in series from upstream to downstream and detects a failure that has occurred in the loop. Hereinafter, the “magnetic disk device” is simply referred to as “disk”, and the “fiber channel” is abbreviated as “FC”.

昨今の高速化及び接続の柔軟性の要求に応えるため、ディスク等の内部及び外部インタフェースでは、ＦＣインタフェースが多数採用されている。また、大容量化の要求に応えるため、一つのＦＣインタフェース上に多数のディスクをループ状に接続する状況となっている。 In order to meet the recent demands for high speed and flexible connection, a large number of FC interfaces are employed as internal and external interfaces such as disks. In order to meet the demand for larger capacity, a large number of disks are connected in a loop on one FC interface.

このようなＦＣループの障害は、初期の段階では頻度の極めて低い間欠障害であるため、リトライで救済されてしまう。そのため、障害が表面化しないことによって放置された結果、重大な障害を引き起こすケースがあった。また、故障ディスクが自らの故障を検出せずに、他の複数の正常ディスクが散発的に異常を報告する、という現象を伴う。そのため、障害箇所を特定することが、極めて困難であった。 Such a failure of the FC loop is an intermittent failure that is extremely infrequent in the initial stage, and thus is relieved by a retry. Therefore, as a result of being left unattended because the failure did not surface, there was a case that caused a serious failure. In addition, the failure disk does not detect its own failure, and a plurality of other normal disks sporadically report an abnormality. Therefore, it has been extremely difficult to identify the fault location.

特許文献１では、検査の過程でマージンを振りながらディスク自身をループから切断／接続することにより、障害の発生の有無を調べて被疑ディスクを特定する方法が述べられている。しかし、この方法では運用中の障害には対応することが困難である。 Patent Document 1 describes a method of identifying a suspected disk by examining whether or not a failure has occurred by disconnecting / connecting the disk itself from the loop while allocating a margin during the inspection process. However, with this method, it is difficult to cope with failures during operation.

また、特許文献２では、ＦＣループ内の特定の電子装置でＣＲＣ（cyclic redundancy check）エラーを検出し、隣接する上流側の電子装置でＣＲＣエラーを検出しない場合、それらの間が障害被疑部分であると結論付けている。しかし、障害の発生頻度が低い例えば図４のような発生状況であった場合は、ディスク４７，４８間に障害があると誤判断されてしまう。 Further, in Patent Document 2, when a CRC (cyclic redundancy check) error is detected in a specific electronic device in the FC loop and no CRC error is detected in an adjacent upstream electronic device, a failure suspected portion is between them. We conclude that there is. However, if the failure occurrence frequency is low, for example, as shown in FIG. 4, it is erroneously determined that there is a failure between the disks 47 and 48.

なお、ＣＲＣとは、データ伝送の途中で発生するビット誤りを検出する方式の一つである。検査対象のデータ伝送ブロックを二進データと見なし、この二進データを生成多項式という計算式で処理して、一定のビット数の検査用データを作る。生成した検査用データを実際のデータに付けて送る。受信側では実際のデータと送られてきた検査用データを、送信側と同じ生成多項式で処理して誤りの有無を検出する。 CRC is one of the methods for detecting bit errors that occur during data transmission. The data transmission block to be inspected is regarded as binary data, and this binary data is processed by a calculation formula called a generator polynomial to create inspection data having a fixed number of bits. The generated inspection data is sent along with the actual data. On the receiving side, the actual data and the inspection data sent are processed by the same generator polynomial as that on the transmitting side to detect the presence or absence of an error.

特開２００３−１６７７９６号公報Japanese Patent Laid-Open No. 2003-167796 特開２００２−３６８７６８号公報JP 2002-368768 A

複数のディスクがＦＣを介してループ接続されたシステムにおいては、一つのディスクのインタフェース障害をきっかけとし、全てのディスクが使用不可能になるようなループ閉塞障害が発生することがある。また、このような重障害に至らないまでも、その予兆として軽度のループ障害が発生するケースは多数認められる。どちらの状況においても、ＦＣ障害を引き起こしている故障ディスクを特定できなかったり、被疑範囲を絞り込めなかったりする状況となる。そのため、予防保守交換作業や復旧作業が長期化かつ大規模化し、結果として保守及び運用に重大な支障を与えるという問題があった。 In a system in which a plurality of disks are loop-connected via FC, a loop blockage failure may occur in which all disks cannot be used due to an interface failure of one disk. In addition, even if such a serious failure does not occur, there are many cases in which a minor loop failure occurs as a precursor. In either situation, the failed disk causing the FC failure cannot be identified, or the suspected range cannot be narrowed down. For this reason, there has been a problem that preventive maintenance replacement work and recovery work are prolonged and scaled up, resulting in a serious obstacle to maintenance and operation.

そこで、本発明の目的は、複数の記憶装置が直列に接続されて成るループに発生した障害を確実に検出できる、ループ障害検出装置等を提供することにある。 Accordingly, an object of the present invention is to provide a loop failure detection device and the like that can reliably detect a failure that has occurred in a loop formed by connecting a plurality of storage devices in series.

本発明に係るループ障害検出装置は、上流から下流へ複数の記憶装置が直列に接続されて成るループに設けられ、当該ループに発生した障害を検出するものである。そして、本発明に係るループ障害検出装置は、上流側障害検出手段、下流側障害検出手段及び障害特定手段を備えている。上流側障害検出手段は、上流から読み出しコマンドを各記憶装置へ送信し、各記憶装置によって読み出されたデータフレームを下流で受信し、それらのデータフレームのエラーを検出し、当該エラーとなったデータフレームを送信した記憶装置を含む前記上流側を、障害の影響を受ける上流側の範囲であると判断する。各記憶装置は、受信したデータフレームのエラーを報告する機能を有する。下流側障害検出手段は、上流から書き込みコマンド及びデータフレームを各記憶装置へ送信し、各記憶装置によって報告されたデータフレームのエラーを下流で受信し、データフレームのエラーを報告した記憶装置を含む下流側を、障害の影響を受ける下流側の範囲であると判断する。障害特定手段は、上流側障害検出手段で判断された上流側の範囲と下流側障害検出手段で判断された下流側の範囲との両方で定まる境界部分に、障害があると判断する。 The loop fault detection apparatus according to the present invention is provided in a loop formed by connecting a plurality of storage devices in series from upstream to downstream, and detects a fault occurring in the loop. The loop fault detection apparatus according to the present invention includes an upstream fault detection unit, a downstream fault detection unit, and a fault identification unit. The upstream failure detection means sends a read command from the upstream to each storage device, receives the data frame read by each storage device downstream, detects an error in those data frames, and results in the error. The upstream side including the storage device that transmitted the data frame is determined to be an upstream range that is affected by the failure. Each storage device has a function of reporting an error in the received data frame. The downstream failure detection means includes a storage device that transmits a write command and a data frame from upstream to each storage device, receives a data frame error reported by each storage device downstream, and reports a data frame error. The downstream side is determined to be the downstream range that is affected by the failure. The failure identification unit determines that there is a failure at a boundary portion determined by both the upstream range determined by the upstream failure detection unit and the downstream range determined by the downstream failure detection unit.

複数の記憶装置がループに散在しており、そのループのどこかに障害があるとする。このとき、障害箇所よりも下流側にある各記憶装置から更に下流へ送信されたデータは、障害箇所を通らないのでエラーにならずに受信される。一方、障害箇所よりも上流側にある各記憶装置から送信されたデータは、障害箇所を通過することによりエラーを生じて受信される。そのため、データのエラーを生じた記憶装置を含む上流側に障害があると判断できる。また、上流から送信されたデータは、障害箇所よりも上流側にある各記憶装置で受信されると、障害箇所を通らないのでエラーにならずに受信される。一方、障害箇所よりも下流側にある各記憶装置で受信されると、障害箇所を通過することによりエラーを生じて受信される。そのため、データのエラーを生じた記憶装置を含む下流側を、障害の影響を受ける下流側の範囲であると判断できる。したがって、障害の影響を受ける下流側の範囲と障害の影響を受ける上流側の範囲との境界部分に、障害があると判断できる。この障害が発生頻度の低い間欠障害であっても、二つの範囲を重ね合わせることにより、より確実に障害を検出することができる。 Suppose a plurality of storage devices are scattered in a loop, and there is a failure somewhere in the loop. At this time, data transmitted further downstream from each storage device downstream from the failure location is received without causing an error because it does not pass through the failure location. On the other hand, data transmitted from each storage device upstream from the failure location is received with an error by passing through the failure location. Therefore, it can be determined that there is a failure on the upstream side including the storage device in which the data error has occurred. Further, when the data transmitted from the upstream is received by each storage device on the upstream side of the failure location, it is received without causing an error because it does not pass through the failure location. On the other hand, if it is received by each storage device downstream from the fault location, it is received with an error by passing through the fault location. Therefore, it is possible to determine that the downstream side including the storage device in which the data error has occurred is the downstream range that is affected by the failure. Therefore, it can be determined that there is a failure at the boundary between the downstream range affected by the failure and the upstream range affected by the failure. Even if this failure is an intermittent failure with a low occurrence frequency, the failure can be detected more reliably by superimposing the two ranges.

本発明に係るループ障害検出方法は、本発明に係るループ障害検出装置の各手段の動作を手順として捉えたものである。また、本発明に係るループ障害検出方法は、本発明に係るループ障害検出装置の最上位概念に相当するもののみを請求項に記載したが、本発明に係るループ障害検出装置の各下位概念に相当するものも請求項に記載してもよい。更に、本発明は、本発明に係るループ障害検出装置の各手段をコンピュータに機能させるためのループ障害検出プログラムとして捉えることできる。 The loop fault detection method according to the present invention captures the operation of each means of the loop fault detection apparatus according to the present invention as a procedure. Further, the loop fault detection method according to the present invention is described in the claims only corresponding to the highest concept of the loop fault detection apparatus according to the present invention, but each subordinate concept of the loop fault detection apparatus according to the present invention is described in the claims. Corresponding items may also be described in the claims. Furthermore, the present invention can be understood as a loop fault detection program for causing a computer to function each means of the loop fault detection apparatus according to the present invention.

換言すると、本発明は、次の１〜３の機能を備えたものとしてもよい。１．送られてきたフレームのＣＲＣエラーを検出し、故障範囲を特定する機能。２．送られてきたデバイスからのエラー報告を解析し、故障範囲を特定する機能。３．更に上記１，２の結果を用いて故障範囲を更に絞り込む機能。これにより、本発明では、ＦＣループ障害を引き起こしている故障ディスクの特定方法において、故障ディスク特定の分解能を飛躍的に高めることができる。 In other words, the present invention may have the following functions 1 to 3. 1. A function that detects a CRC error in a sent frame and identifies the failure range. 2. A function that analyzes error reports sent from devices and identifies the failure range. 3. Further, a function for further narrowing down the failure range using the results of 1 and 2 above. As a result, according to the present invention, in the method for identifying the failed disk causing the FC loop failure, the resolution for identifying the failed disk can be dramatically increased.

本発明によれば、各記憶装置から送信されたデータが下流で受信されるまでに生じたデータのエラーを検出し、そのデータのエラーを生じた記憶装置を含む上流側を、障害の影響を受ける上流側の範囲であると判断し、上流から送信されたデータが各記憶装置で受信されるまでに生じたデータのエラーを検出し、そのデータのエラーを生じた記憶装置を含む下流側を、障害の影響を受ける下流側の範囲であると判断し、障害の影響を受ける上流側の範囲と障害の影響を受ける下流側の範囲との境界部分に障害があると判断することにより、二つの範囲を重ね合わせるので、発生頻度の低い間欠障害であっても確実に障害を検出することができる。 According to the present invention, an error in data that occurs until the data transmitted from each storage device is received downstream is detected, and the upstream side including the storage device in which the data error has occurred is affected by the failure. It is determined that it is in the range of the upstream side to receive, the data error that occurred until the data transmitted from the upstream is received by each storage device , and the downstream side including the storage device that caused the data error is detected By determining that there is a failure at the boundary between the upstream range affected by the failure and the downstream range affected by the failure, Since the two ranges are overlapped, it is possible to reliably detect a failure even with an intermittent failure with a low occurrence frequency.

図１は、本発明に係るループ障害検出装置の一実施形態を示すブロック図である。図２は、本実施形態における上流側障害検出手段の動作の一例を示すブロック図である。図３は、本実施形態における下流側障害検出手段の動作の一例を示すブロック図である。図４は、本実施形態における障害特定手段の動作の一例を示す図表である。以下、これらの図面に基づき説明する。 FIG. 1 is a block diagram showing an embodiment of a loop fault detection apparatus according to the present invention. FIG. 2 is a block diagram showing an example of the operation of the upstream failure detection means in the present embodiment. FIG. 3 is a block diagram showing an example of the operation of the downstream failure detection means in this embodiment. FIG. 4 is a chart showing an example of the operation of the fault identification unit in the present embodiment. Hereinafter, description will be given based on these drawings.

本実施形態のループ障害検出装置１１は、ディスク制御装置２０内に実現されている。ディスク制御装置２０は、ホストコンピュータ１０とディスク４１〜４ｎ（ｎは２以上の整数）との間で、データを転送する機能を有する。ディスク制御装置２０のループ障害検出装置１１以外の動作は、周知であるので説明を省略する。 The loop failure detection device 11 of this embodiment is realized in the disk control device 20. The disk control device 20 has a function of transferring data between the host computer 10 and the disks 41 to 4n (n is an integer of 2 or more). Since the operations of the disk control device 20 other than the loop failure detection device 11 are well known, description thereof will be omitted.

ループ障害検出装置１１は、上流５０から下流５１へディスク４１〜４ｎが直列に接続されて成るＦＣループ３０に設けられ、ＦＣループ３０に発生した障害を検出するものである。そして、ループ障害検出装置１１は、後で詳述する上流側障害検出手段、下流側障害検出手段及び障害特定手段を備えている。上流側障害検出手段は、ディスク４１〜４ｎから送信されたフレームが下流５１で受信されるまでに生じたフレームのエラーを検出し、フレームのエラーを生じたディスクを含む上流５０側を、障害Ｘの影響を受ける上流５０側の範囲Ａであると判断する。下流側障害検出手段は、上流５０から送信されたフレームがディスク４１〜４ｎで受信されるまでに生じたフレームのエラーを検出し、フレームのエラーを生じたディスクを含む下流５１側を、障害Ｘの影響を受ける下流５１側の範囲Ｂであると判断する。障害特定手段は、上流側障害検出手段で判断された上流５０側の範囲Ａと下流側障害検出手段で判断された下流５１側の範囲Ｂとの境界部分に、障害Ｘがあると判断する。 The loop failure detection device 11 is provided in the FC loop 30 in which the disks 41 to 4n are connected in series from the upstream 50 to the downstream 51, and detects a failure occurring in the FC loop 30. The loop fault detection device 11 includes an upstream fault detection unit, a downstream fault detection unit, and a fault identification unit, which will be described in detail later. The upstream failure detection means detects a frame error that occurs until the frame transmitted from the disks 41 to 4n is received by the downstream 51, and detects the upstream 50 side including the disk in which the frame error has occurred as a failure X. Is determined to be in the range A on the upstream 50 side affected by the above. The downstream failure detection means detects a frame error that occurs until a frame transmitted from the upstream 50 is received by the disks 41 to 4n, and detects the downstream 51 side including the disk in which the frame error has occurred as a failure X. Is determined to be in the range B on the downstream 51 side affected by the above. The failure identification unit determines that there is a failure X at the boundary between the upstream A range A determined by the upstream failure detection unit and the downstream B range B determined by the downstream failure detection unit.

ディスク４１〜４ｎがＦＣループ３０に散在しており、ＦＣループ３０のどこかに障害Ｘがあるとする。このとき、障害箇所よりも下流５１側にあるディスクから更に下流５１へ送信されたフレームは、障害箇所を通らないのでエラーにならずに受信される。一方、障害箇所よりも上流５０側にあるディスクから送信されたフレームは、障害箇所を通過することによりエラーを生じて受信される。そのため、フレームのエラーを生じたディスクを含む上流５０側に、障害Ｘがあると判断できる。ただし、発生頻度の低い間欠障害の場合は、障害箇所よりも上流５０側にあるディスクから送信されたフレームの全てがエラーになるわけではない。 It is assumed that the disks 41 to 4n are scattered in the FC loop 30 and there is a failure X somewhere in the FC loop 30. At this time, the frame transmitted further downstream 51 from the disk on the downstream side 51 from the fault location is received without causing an error because it does not pass through the fault location. On the other hand, a frame transmitted from the disk 50 on the upstream side of the failure location is received with an error by passing through the failure location. Therefore, it can be determined that there is a failure X on the upstream 50 side including the disk in which the frame error has occurred. However, in the case of an intermittent failure with a low occurrence frequency, not all frames transmitted from the disk 50 on the upstream side of the failure location result in an error.

また、上流５０から送信されたフレームは、障害箇所よりも上流５０側にあるディスクで受信されると、障害箇所を通らないのでエラーにならずに受信される。一方、障害箇所よりも下流５１側にあるディスクで受信されると、障害箇所を通過することによりエラーを生じて受信される。そのため、フレームのエラーを生じたディスクを含む下流５１側に、障害Ｘがあると判断できる。ただし、発生頻度の低い間欠障害の場合は、障害箇所よりも下流５１側にあるディスクで受信されたフレームの全てがエラーになるわけではない。 In addition, when a frame transmitted from the upstream 50 is received by a disk located on the upstream side 50 from the failure location, it is received without causing an error because it does not pass through the failure location. On the other hand, if it is received by the disk 51 on the downstream side of the failure location, it is received with an error by passing through the failure location. Therefore, it can be determined that there is a failure X on the downstream 51 side including the disk in which the frame error has occurred. However, in the case of intermittent failures with a low frequency of occurrence, not all frames received by the disk 51 on the downstream side of the failure location result in an error.

したがって、障害Ｘの影響を受ける下流側の範囲Ｂと障害Ｘの影響を受ける上流側の範囲Ａとの境界部分に、障害Ｘがあると判断できる。障害Ｘが発生頻度の低い間欠障害であっても、二つの範囲Ａ，Ｂを重ね合わせることにより、より確実に障害Ｘを検出することができる。 Therefore, it can be determined that there is a failure X at the boundary between the downstream range B affected by the failure X and the upstream range A affected by the failure X. Even if the failure X is an intermittent failure with a low occurrence frequency, the failure X can be detected more reliably by superimposing the two ranges A and B.

詳しく述べれば、上流側障害検出手段は、上流５０から読み出しコマンドをディスク４１〜４ｎへ送信し、ディスク４１〜４ｎによって読み出されたフレームを下流５１で受信し、それらのフレームのエラーを検出し、フレームのエラーを生じたディスクを含む上流５０側を、障害Ｘの影響を受ける上流５０側の範囲Ａであると判断する。下流側障害検出手段は、上流５０から書き込みコマンド及びフレームをディスク４１〜４ｎへ送信し、ディスク４１〜４ｎによって検出されたフレームのエラーを下流５１で受信し、フレームのエラーを生じたディスクを含む下流５１側を、障害Ｘの影響を受ける下流５１側の範囲Ｂであると判断する。 Specifically, the upstream failure detection means transmits a read command from the upstream 50 to the disks 41 to 4n, receives the frames read by the disks 41 to 4n at the downstream 51, and detects errors in those frames. The upstream 50 side including the disk in which the frame error has occurred is determined as the range A on the upstream 50 side affected by the failure X. The downstream side failure detection means transmits the write command and the frame from the upstream 50 to the disks 41 to 4n, receives the frame error detected by the disks 41 to 4n at the downstream 51, and includes the disk in which the frame error has occurred. The downstream 51 side is determined to be the range B on the downstream 51 side affected by the failure X.

この場合、上流側障害検出手段はコマンド発行部２３、フレーム送信部２１、フレーム受信部２２及びＣＲＣチェック部２５から成り、下流側障害検出手段はコマンド発行部２３、フレーム送信部２１、フレーム受信部２２及びコマンド応答解析部２４から成り、障害特定手段はコマンド応答解析部２４に内蔵されている。コマンド発行部２３、フレーム送信部２１及びフレーム受信部２２は、上流側障害検出手段及び下流側障害検出手段に共用される。コマンド発行部２３は、読み出しコマンド及び書き込みコマンド並びにフレームを発行する。フレーム送信部２１は、コマンド発行部２３から発行された読み出しコマンド及び書き込みコマンド並びにフレームを、ディスク４１〜４ｎへ送信する。フレーム受信部２２は、ディスク４１〜４ｎによって読み出されたフレーム及びディスク４１〜４ｎによって検出されたフレームのエラーを受信する。ＣＲＣチェック部２５は、ディスク４１〜４ｎによって読み出されフレーム受信部２２で受信されたフレームについてエラーを検出し、フレームのエラーを生じたディスクを含む上流５０側を、障害Ｘの影響を受ける上流側の範囲Ａであると判断する。コマンド応答解析部２４は、ディスク４１〜４ｎによって検出されフレーム受信部２２で受信されたフレームのエラーについて、書き込みコマンドに対するディスク４１〜４ｎの応答の中から抽出し、フレームのエラーを生じたディスクを含む下流５１側を、障害Ｘの影響を受ける下流５１側の範囲Ｂであると判断する。 In this case, the upstream failure detection means includes a command issue unit 23, a frame transmission unit 21, a frame reception unit 22, and a CRC check unit 25, and the downstream failure detection means includes a command issue unit 23, a frame transmission unit 21, and a frame reception unit. 22 and a command response analyzing unit 24, and the failure specifying means is built in the command response analyzing unit 24. The command issuing unit 23, the frame transmitting unit 21, and the frame receiving unit 22 are shared by the upstream failure detection unit and the downstream failure detection unit. The command issuing unit 23 issues a read command, a write command, and a frame. The frame transmitting unit 21 transmits the read command, the write command, and the frame issued from the command issuing unit 23 to the disks 41 to 4n. The frame receiving unit 22 receives errors of the frames read by the disks 41 to 4n and the frames detected by the disks 41 to 4n. The CRC check unit 25 detects an error in the frame read by the disks 41 to 4n and received by the frame receiving unit 22, and the upstream 50 side including the disk in which the frame error has occurred is upstream affected by the failure X. It is determined that it is in the range A on the side. The command response analyzing unit 24 extracts the error of the frame detected by the disks 41 to 4n and received by the frame receiving unit 22 from the responses of the disks 41 to 4n to the write command, and determines the disk in which the frame error has occurred. The downstream 51 side that is included is determined as the range B on the downstream 51 side that is affected by the failure X.

次に、図１乃至図４に基づき、本実施形態のループ障害検出装置１１について更に詳しく説明する。 Next, the loop failure detection apparatus 11 according to the present embodiment will be described in more detail with reference to FIGS.

ホストコンピュータ１０に接続されたディスク制御装置２０には複数のディスク４１〜４ｎが接続されており、これらが一つのＦＣループ３０を形成している。また、ディスク４１〜４ｎは、エラーフレームを受信したことを報告する機能を有する。ディスク制御装置２０内においてＦＣループ３０の上流５０側には、読み出し／書き込みのコマンドを発行するコマンド発行部２３と、そのコマンドやデータをＦＣループ３０にぶら下がっているディスク４１〜４ｎ宛に送信するフレーム送信部２１とが設けられている。ディスク制御装置２０内においてＦＣループ３０の下流５１側には、ディスク４１〜４ｎから送信されたコマンド応答やデータを受信するフレーム受信部２２と、その受信フレームのＣＲＣをチェックするＣＲＣチェック部２５と、受信フレームからディスク４１〜４ｎのコマンド応答を抽出してコマンドに対するエラーの有無を解析するコマンド応答解析部２４とが設けられている。 A plurality of disks 41 to 4n are connected to the disk controller 20 connected to the host computer 10, and these form one FC loop 30. The disks 41 to 4n have a function of reporting that an error frame has been received. In the disk controller 20, to the upstream 50 side of the FC loop 30, a command issuing unit 23 that issues a read / write command, and the commands and data are transmitted to the disks 41 to 4n hanging in the FC loop 30. A frame transmission unit 21 is provided. In the disk controller 20, on the downstream 51 side of the FC loop 30, a frame reception unit 22 that receives command responses and data transmitted from the disks 41 to 4n, and a CRC check unit 25 that checks the CRC of the received frame, A command response analysis unit 24 is provided for extracting command responses of the disks 41 to 4n from the received frames and analyzing whether or not there is an error with respect to the command.

図２及び図３は、ループ障害検出装置１１の動作の一例を示し、具体化してｎ＝８としている。以下、図２及び図３に基づき、ループ障害検出装置１１の動作を説明する。 2 and 3 show an example of the operation of the loop fault detection device 11, and specifically, n = 8. Hereinafter, the operation of the loop failure detection apparatus 11 will be described with reference to FIGS. 2 and 3.

図２は、コマンド発行部２３から読み出しコマンドが発行された場合における、障害検出の動作を示す説明図である。コマンド発行部２３から発行された読み出しコマンドは、フレーム送信部２１を通して上流５０側からディスク４１〜４８へ送信される。ディスク４１〜４８は、その読み出しコマンドで指定された読み出しデータを下流５１側へ送出する。フレーム受信部２２は、ディスク４１〜４８から送られてきたデータフレームを受信する。ＣＲＣチェック部２５は、そのデータフレーム全てに対してＣＲＣチェックを行い、「どのディスク４１〜４ｎから送られたデータフレームが壊れているか」をチェックする。 FIG. 2 is an explanatory diagram showing a failure detection operation when a read command is issued from the command issuing unit 23. The read command issued from the command issuing unit 23 is transmitted from the upstream 50 side to the disks 41 to 48 through the frame transmission unit 21. The disks 41 to 48 send the read data designated by the read command to the downstream 51 side. The frame receiving unit 22 receives data frames sent from the disks 41 to 48. The CRC check unit 25 performs a CRC check on all the data frames, and checks “which data frame sent from which disk 41 to 4n is broken”.

図３は、コマンド発行部２３から書き込みコマンドが発行された場合における、障害検出の動作を示す説明図である。コマンド発行部２３から発行された書き込みコマンドに伴い、フレーム送信部２１を通して上流５０側からディスク４０〜４８へデータフレームが送信される。ディスク４１〜４８は、そのデータフレームの書き込み処理を実施し、その書き込みコマンドの完了報告を下流５１側へデータフレームとして送出する。フレーム受信部２２は、ディスク４１〜４８から送られてきたデータフレームを受信する。コマンド応答解析部２４は、そのコマンド完了報告を解析し、「フレーム送信部２１から送られてきたデータフレームがディスク４１〜４８に到着した時点で既に壊れていた、という応答がどのディスク４１〜４８からあったのか」をチェックする。 FIG. 3 is an explanatory diagram showing a failure detection operation when a write command is issued from the command issuing unit 23. A data frame is transmitted from the upstream 50 side to the disks 40 to 48 through the frame transmitting unit 21 in accordance with the write command issued from the command issuing unit 23. The disks 41 to 48 perform the data frame writing process, and send a completion report of the write command to the downstream 51 side as a data frame. The frame receiving unit 22 receives data frames sent from the disks 41 to 48. The command response analysis unit 24 analyzes the command completion report, and determines which of the disks 41 to 48 has a response that the data frame sent from the frame transmission unit 21 has already been broken when it arrives at the disks 41 to 48. Check if it was from.

図２及び図３に示すようにディスク４４の出力側にフレームを乱すような障害Ｘがある場合を例にとると、ＣＲＣチェック部２５のチェックでは、障害箇所よりも上流５０側のディスク４１〜４４からのデータフレームのＣＲＣチェックではエラーが検出され、障害箇所よりも下流５１側のディスク４５〜４８からのデータフレームのＣＲＣチェックではエラーが検出されない。 As shown in FIG. 2 and FIG. 3, taking as an example the case where there is a failure X that disturbs the frame on the output side of the disk 44, the CRC check unit 25 checks the disks 41 to 50 upstream of the failure location. An error is detected in the CRC check of the data frame from 44, and no error is detected in the CRC check of the data frame from the disks 45 to 48 on the downstream side 51 side from the failure point.

逆に、コマンド応答部２４のチェックでは、障害箇所よりも上流５０側のディスク４１〜４４からのコマンド応答ではエラーが検出されず、障害箇所よりも下流５１側のディスク４５〜４８からのコマンド応答ではエラーが検出される。 On the contrary, in the check of the command response unit 24, no error is detected in the command response from the disks 41 to 44 on the upstream side 50 from the failed part, and the command responses from the disks 45 to 48 on the downstream side 51 from the failed part. An error is detected.

上記説明で障害の検出状況を示したのが図４である。発生頻度が低い間欠障害の場合、図示したように二つの検出方法での変化点を見つけることにより、早期に被疑特定の分解能を高めることができる。 FIG. 4 shows the failure detection status in the above description. In the case of an intermittent failure with a low occurrence frequency, the resolution for identifying the suspicion can be increased at an early stage by finding a change point between the two detection methods as shown in the figure.

以上のように、ＦＣループ３０のフレームの転送方向、すなわち上流５０／下流５１という考え方に基づき、フレームの最下流にそのフレームのＣＲＣをチェックするＣＲＣチェック部２５とディスク４１〜４８のコマンド応答をチェックするコマンド応答解析部２４とを有する。これらの二つのチェック機構により上流５０側と下流５１側との両面から被疑を絞り込み障害ディスクを特定する。 As described above, based on the transfer direction of the frame of the FC loop 30, that is, upstream 50 / downstream 51, the CRC check unit 25 that checks the CRC of the frame at the most downstream of the frame and the command responses of the disks 41 to 48 are sent. And a command response analysis unit 24 for checking. These two check mechanisms narrow down the suspicion from both the upstream 50 side and the downstream 51 side, and specify the failed disk.

本発明は、言うまでもなく、上記実施形態に限定されない。例えば、ディスクに限らず、エラーフレーム受信したことを報告する機能さえ有していればどのような電子装置にも適用可能である。また、上記各手段をコンピュータプログラムで実現することも可能である。更に、特定された故障ディスクをＦＣループから自動的に切り離すことにより、障害回復状態まで完全に自動化させることも可能である。 Needless to say, the present invention is not limited to the above embodiment. For example, the present invention can be applied to any electronic device as long as it has a function of reporting that an error frame has been received, not limited to a disk. The above means can also be realized by a computer program. Furthermore, it is possible to completely automate the failure recovery state by automatically disconnecting the identified failed disk from the FC loop.

本発明に係るループ障害検出装置の一実施形態を示すブロック図である。It is a block diagram which shows one Embodiment of the loop failure detection apparatus which concerns on this invention. 本実施形態における上流側障害検出手段の動作の一例を示すブロック図である。It is a block diagram which shows an example of operation | movement of the upstream fault detection means in this embodiment. 本実施形態における下流側障害検出手段の動作の一例を示すブロック図である。It is a block diagram which shows an example of operation | movement of the downstream fault detection means in this embodiment. 本実施形態における障害特定手段の動作の一例を示す図表である。It is a graph which shows an example of operation | movement of the fault specific | specification means in this embodiment.

１０ホストコンピュータ
１１ループ障害検出装置
２０ディスク制御装置
２１フレーム送信部
２２フレーム受信部
２３コマンド発行部
２４コマンド応答解析部
２５ＣＲＣチェック部
３０ＦＣループ
４１〜４ｎディスク
５０ＦＣループの上流
５１ＦＣループの下流 DESCRIPTION OF SYMBOLS 10 Host computer 11 Loop fault detection apparatus 20 Disk control apparatus 21 Frame transmission part 22 Frame reception part 23 Command issue part 24 Command response analysis part 25 CRC check part 30 FC loop 41-4n Disk 50 Upstream of FC loop 51 Downstream of FC loop

Claims

In a loop failure detection device, which is provided in a loop formed by connecting a plurality of storage devices in series from upstream to downstream and detects a failure occurring in the loop,
An upstream failure detection means, a downstream failure detection means and a failure identification means;
The upstream failure detection means transmits a read command from the upstream to each storage device, receives a data frame read by each storage device downstream, detects an error in those data frames, The upstream side including the storage device that has transmitted the data frame in error is determined to be an upstream range that is affected by the failure ,
Each of the storage devices has a function of reporting an error in the received data frame,
The downstream failure detection means transmits a write command and a data frame from the upstream to each storage device, receives an error of the data frame reported by the storage device, and receives an error of the data frame. Determining that the downstream side including the storage device that has reported
The failure identification means has the failure at a boundary portion determined by both the upstream range determined by the upstream failure detection means and the downstream range determined by the downstream failure detection means. To judge ,
A loop failure detection device characterized by the above.

In a loop failure detection method, which is provided in a loop formed by connecting a plurality of storage devices in series from upstream to downstream, and detects a failure occurring in the loop.
Each of the storage devices has a function of reporting an error in the received data frame,
Sending a read command from the upstream to each storage device, receiving data frames read by each storage device downstream, detecting errors in those data frames, The upstream side including the transmitted storage device is determined to be an upstream range affected by the failure,
Including a storage device that transmits a write command and a data frame from the upstream to each of the storage devices, receives an error of the data frame reported by the storage device downstream, and reports an error of the data frame The downstream side is determined to be the downstream range affected by the obstacle,
Determining that there is the failure at a boundary portion determined by both the upstream range and the downstream range;
A loop failure detection method characterized by the above.