JP2004227449A

JP2004227449A - Diagnostic device for trouble in disk array device

Info

Publication number: JP2004227449A
Application number: JP2003017092A
Authority: JP
Inventors: Hiroyasu Shirasaki; 裕康白崎; Koichi Tanaka; 幸一田中
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-01-27
Filing date: 2003-01-27
Publication date: 2004-08-12

Abstract

<P>PROBLEM TO BE SOLVED: To grasp error conditions for every unit disk constituting an RAID and across a plurality of disks, and to prevent data from being lost by preliminarily detecting the danger of data loss due to an error which is likely to be generated in a plurality of disks. <P>SOLUTION: This disk array device diagnostic device is provided with: first diagnostic processing for analyzing the operation history information of every unit disk constituting a disk array device, and for executing the diagnostic processing of every unit disk; second diagnosis processing for specifying a unit disk group belonging to a data group from configuration information showing the configuration of the disk array device and analyzing the conditions of any failure in the unit disk group belonging to the data group from the result of the first diagnostic processing; and a display device for displaying the result of the first diagnostic processing or the result of the second diagnostic processing. Furthermore, this diagnostic device is provided with third diagnostic processing for preliminarily storing the combination conditions of the disks under which an error is likely to be generated in a condition file, and for collating them with the results of the first diagnostic processing on the basis of the combination conditions from the condition file, and displaying the diagnostic result on the display device when any error due to the combination conditions is determined as the result of the third diagnostic processing. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明はディスクアレイ装置（ＲＡＩＤ：ＲｅｄｕｎｄａｎｔＡｒｒａｙｏｆＩｎｅｘｐｅｎｓｉｖｅＤｉｓｋｓ）における障害の診断装置に係り、特にディスクアレイ装置のディスク障害状況を管理することにより、ディスクの故障の兆候を事前に検知すると共に、データの消失を事前に防止するディスクアレイ装置の障害の診断処理に関する。
【０００２】
【従来の技術】
ディスクアレイ装置は複数台の単体ハードディスクで構成され、データグループは複数のハードディスクに亘って書き込み、読み出しが行われる。データの書き込み読出しは、消失データ修復用冗長データを付加して行われるので、１台の単体ディスクでエラーが発生しても修復用冗長データを用いて同じデータグループに属する他のディスクからデータを生成、回復することによりデータ読み出しが可能である。また２台のディスクでエラー発生している場合においても修復用冗長データを持つ最小アドレス単位内で、２台のディスクエラーが発生していない場合には、同様に他のディスクからデータを生成、回復してデータ読み出しが可能であり、高い信頼性を持つ。
【０００３】
一方、単体ディスクにエラーが在る場合であってもディスクアレイ装置は修復用冗長データを使用し、表面的には障害状況が見えることなく動作するため、内部的なエラー発生情報は装置の外から確認することができない。内部的なエラー発生状況を把握するためにはディスクアレイ装置の保守員がディスクアレイ装置のログ情報を分析することが必要となって来る。
またディスクの初期不良、および長年の使用による部品劣化により複数のディスクで故障が発生する場合でも潜在的にエラーが進行していることがあり、将来的にデータが消失する危険性が生じる場合もある。
【０００４】
ディスクアレイ装置に対する一般的な予防保守としては、保守員が定期的に保守する時のディスクアレイ装置のログ情報を参照し、ディスクのエラー発生状況から予防交換を必要するディスクの有無を判断しているのが実情である。しかし、次の定期保守を行うまでの間にディスクの故障が進行している場合もあり、保守員はこのような状況を知ることができない。最悪の場合には同じデータグループの属する２台のディスクでエラーが発生する可能性もあり、このような場合には修復用冗長データによるエラーの回復は不可能であるため、定期保守を実施するタイミングによっては、データ消失の危険性が生じる。
【０００５】
ディスク装置のエラーの予防保守に関する技術としては、例えば特開平１１−３５３８１９号公報（特許文献１）に開示されたものがある。この技術はディスクエラーの発生回数をカウンタに蓄え、そのエラー発生回数が閾値を超えた場合に警報を出すことでディスク故障を事前に検知し、データ消失を防ぐとしている。
【０００６】
【特許文献１】
特開平１１−３５３８１９号公報（第２−４頁、図２）
【０００７】
【発明が解決しようとする課題】
しかしながら特開平１１−３５３８１９号公報に記載の技術は、ディスクエラーの閾値による管理にすぎず、例えばＲＡＩＤにどのように適用されるのかについての示唆が無い。仮にＲＡＩＤに適用したとしても単体ディスクの故障を事前に検知する程度に示唆にすぎず、ＲＡＩＤを構成する全ディスクのエラー状態を横断的にいかに把握して、いかに診断しようとするのかについては何ら言及されていない。
【０００８】
本発明の目的は、ディスクアレイ装置におけるエラーの状況を把握し、かつ複数のディスクに亘るエラー発生によるデータ消失の危険性を事前に検知して、データ消失を防ぐことができるディスクアレイ装置における障害の診断装置を提供することにある。
本発明の他の目的は、ディスクアレイ装置を構成する単体ディスクのエラー診断、複数のディスク群のエラー診断、及びこれらを組み合わせた条件で診断を自動的に行うことにある。
【０００９】
【課題を解決するための手段】
本発明は、複数の単体ディスクで構成されるある単体ディスク群に対して、消失データ修復用冗長データを付加してデータグループのデータを書き込み読出しするディスクアレイ装置を含むシステムにおいて、ディスクアレイ装置の各単体ディスクの動作履歴情報およびディスクアレイ装置の構成を示す構成情報を収集して記憶する記憶装置と、この記憶装置から読み出された各単体ディスクの動作履歴情報を解析して単体ディスク毎の診断処理を行う第一の診断処理手段と、記憶装置から読み出された構成情報からディスクアレイ装置を構成するデータグループに属する単体ディスク群を特定し、第一の診断処理手段の診断結果から単体ディスク群における障害の状況を分析して処理する第二の診断処理手段と、第一の診断処理手段による診断結果に関連する情報又は第二の診断処理手段による診断結果に関連する情報を表示する表示手段とを有して診断装置を構成するものである。
ここで上記動作履歴情報として、例えば単体ディスクのエラー内容、エラーアドレス、発生時間に関する情報を含み、構成情報として例えばデータグループに対応する単体ディスク群を構成する情報を含み、これら動作履歴情報及び構成情報はディスクアレイ装置から定期的に収集されて前記記憶装置に記憶される。上記第一の診断処理手段は、定期的に該動作履歴情報から各ディスクのエラー発生回数、エラー増加率、エラーアドレス分布を算出する。第二の診断処理手段は、第一の診断処理手段の処理結果に従い、複数のディスクにエラーが認められた場合に、同じ修復用冗長データを持つデータグループ内における障害の状態を算出することが好ましい。
また、ディスクのエラーの発生しやすい組み合わせ条件を予め格納しておく組み合わせ条件ファイルを有し、この組み合わせ条件に基づいて上記第一の診断処理手段の処理結果を照合する第三の診断処理手段を有する。第三の診断処理手段による診断の結果、組み合わせ条件によるエラーが判定された場合、診断結果に関連する情報を表示手段に可視的に表示することにより障害の予告を行う。
また、上記第二の診断処理手段において障害の分析結果から同じデータグループの属する複数のディスクに障害が認められた場合、複数ディスクの診断をバッチ処理で行うためのバッチ処理手段を有する。
本発明は、また上記第一、第二、及び第三の診断処理手段における診断処理を夫々第一乃至第三の診断処理モードと捉え、これらの診断処理モードを適宜含む診断方法であると言うことができる。
【００１０】
【発明の実施の形態】
以下、本発明の実施例について図面により詳細に説明する。
図１はディスクアレイ装置（ＲＡＩＤ）及びその診断装置の構成を示す図である。ＲＡＩＤ１０は通信回線３０を介して診断装置２０に接続されている。通信回線３０は例えばＬＡＮの規格に準拠したインターフェースである。図示の例では説明の都合上、１台のＲＡＩＤ１０が接続された例を示しているが、実際には複数台のＲＡＩＤが通信回線３０に接続されていると考えてよい。従って診断装置２０は複数台のＲＡＩＤの診断を行う。
【００１１】
ＲＡＩＤ１０は、複数の単体ディスク群１２及びコントローラ１１を備えて構成される。複数の単体ディスク群１２は、横方向がｍ台の単体ディスク群で、縦方向がｎ行からなる複数のディスク群（合計ｍ×ｎの単体ディスク群）から構成される。データは修復用冗長データを含んでグループ単位で横方向ｍ台の単体ディスク群に記録される。
【００１２】
コントローラ１１はメモリ１３及び図示しない処理装置を有し、単体ディスク群１２に対してデータの書き込み読出しの制御を行うと共に、単体ディスク群１２をアクセスしてその状態情報を採取する。状態情報には、各ディスクに関する動作履歴情報１３及びＲＡＵＤの構成情報１４があり、これら採取された情報はメモリ１３に一時的に格納される。
動作履歴情報１３は、各ディスクへのアクセス時のエラー履歴情報としてエラー毎のエラー種別、エラー発生アドレス、発生時間に関する情報が含まれる。エラー種別とはハードウェアエラー、ディスク媒体エラー、インタフェースエラー等のエラーの種別を示す。エラーアドレスはエラーが発生した個所を特定するためのディスク上のアドレスを示す。発生時間はエラーが発生した時刻を示す。各ディスクにエラーが発生する度にこれらの情報を含む動作履歴情報がログ情報としてメモリ１３に格納される。
動作履歴情報１３は各ＲＡＩＤの各単体ディスクから採取されるので、単体ディスクがｍ×ｎ個存在する場合には、動作履歴情報１３も各々単体ディスクに対応してメモリ１３に記憶されることになる。
構成情報１４はＲＡＩＤの装置構成に関する情報であり、ＲＡＩＤレベル（データ格納のフォーマットを規定）、各ＲＡＩＤを構成する単体ディスクの総数、データグループのサイズ（グループの記憶容量）即ち、データグループの先頭になっている単体ディスクのアドレス及びその幅（横方向のディスクの台数）、データグループの記憶容量（縦方向の深さ：Ｇバイト容量の如き）を示す情報が含まれる。ＲＡＩＤレベルはグループデータ格納のフォーマットを規定する。
これらの情報１４，１５はＲＡＩＤの状態情報３１として例えば２４時間毎のような定期的に、各ＲＡＩＤ１０から通信回線３０を介して診断装置２０に送信される。
【００１３】
次に診断装置２０について説明する。
診断装置２０は各ＲＡＩＤ１０から送信される動作履歴情報１４や構成情報１５に基づいてエラーの状況を診断すると共に、診断結果に従って、対象とするＲＡＩＤにエラー部位の診断、回復を行うための処置を発する機能を有する。
この診断装置２０は例えばパーソナルコンピュータ或いは計算機であり、図示していないが、内部バスにプログラムを実行して診断データを処理する処理装置、プログラムやデータを格納するメモリ、ＲＡＩＤの状態はその診断結果を表示する表示装置（図２の符号４００）、保守員が診断のための指令や種々の情報を入力する入力装置、及びＲＡＩＤ１０からの状態情報や診断結果の情報を格納するハードディスク（記憶装置）を具備して構成される。
【００１４】
処理装置でプログラムが実行されて診断処理が行われる。診断処理には、単体ディスク毎のエラー診断処理２１、複数ディスクのエラー診断処理２２、組み合わせ条件チェック処理２３が含まれる。単体ディスク毎のエラー診断処理２１、複数ディスクのエラー診断処理２２、組み合わせ条件チェック処理２３の結果、診断装置２０からは、対象とするＲＡＩＤ１０の単体ディスク群１２に関する故障の前兆のあるディスクの診断及び回復を行うためのバッチ処理の指令３２が発せられる。指令３２は回線３０を通して対象とするＲＡＩＤ１０に送信される。
【００１５】
上記したように、通信回線３０を介して定期的に受信された動作履歴情報１４及び構成情報１５は、各ＲＡＩＤの各単体ディスクに対応してメモリ（図示せず）に格納される。そして処理装置はメモリに格納された動作履歴情報１４を読出し、各ディスクに関して単位時間（例えば２４時間）におけるエラー発生回数、エラーの増加率、エラーのアドレス分布を算出する。算出された結果は表示装置に表示される。
【００１６】
また、診断装置２０の処理装置は、単体ディスク毎に算出したこれらエラー発生回数、エラーの増加率、エラーのアドレス分布から、ＲＡＩＤ内の同じ修復用冗長データを持つ同一データグループに関するエラー発生回数、エラーの増加率、エラーのアドレス分布を算出する。その算出結果から、同一データグループ内での障害発生のデータ消失の危険性がある単体ディスク群に関連した診断を行う。診断結果の状況は表示装置に表示される。これにより保守員は診断状況を知り、ＲＡＩＤのデータグループ毎にデータ消失の危険性を了知できる。
【００１７】
診断装置２０はまた複数の単体ディスクに対する障害を検知した場合、診断装置内で当該ディスクの診断を行うためのバッチ処理を作成する機能を持ち、そのための指令３２を対象とするＲＡＩＤ１０に発する。
【００１８】
図２は単体ディスク毎のエラー診断処理２１の概略を示すフローチャートである。
通信回線３０を経由して、定期的に自動的にＲＡＩＤ１０から送信されるＲＡＩＤの状態情報（動作履歴情報１４及び構成情報１５）３１は、診断装置２０内のハードディスク装置（図示せず）の情報退避領域２００に格納される（ステップ１０１）。状態情報３１は２４時間毎にＲＡＩＤから定期的に送信されてきて領域２００に格納、蓄積されるので、数回分の状態情報が世代管理されることになる。この状態情報から各ディスクの動作履歴情報を抽出し、更にディスクエラーに関する情報を抽出する（ステップ１０２）。抽出した履歴情報１４にはエラー種別、エラー発生アドレス、発生時間に関する情報が含まれているので、いずれのディスクのどのアドレスでどのようなエラーが何時に発生したかが判る。
【００１９】
処理装置は情報退避領域から読み出した動作履歴情報１４から、単位時間（例えば２４時間）毎に、各ディスクにおけるエラー発生回数、エラーの増加率、エラーのアドレス分布を算出する。そしてその算出結果は結果ファイル３００としてハードディスク装置に格納される。結果ファイル３００には、これまでに分析した過去のエラー情報の分析結果が蓄積されている。結果ファイル３００から従前の分析結果の情報を読み出し（ステップ１０３）、今回の算出したエラーの分析情報の結果と比較する（ステップ１０４）。突発的なエラー発生状況のみでなく、長期的に進行していくエラー状況についての確認を行い、その結果は結果ファイル３００に記録され、併せて表示装置４００に表示される。前回までの結果ファイルと、今回のディスクエラー情報に基づいて故障の前兆があるディスクの有無を診断する（ステップ１０５）。診断処理は各エラー内容の重大度により、ディスクのハード仕様に基づく重み付けを行っており、エラー種別により将来的に故障の可能性があるエラー増加率の上限値を個別に設定している。各ディスクについて前述のエラー増加率の上限値を超えたものがないかチェックする（ステップ１０６）。上限値を超えたものが無ければ診断処理を終了する（ステップ１０８）。
【００２０】
一方、上限値を超えたディスクがあった場合は、表示装置４００に上限値を超えたディスクの報告と警告表示をすることで、保守員に対して障害部位を事前に検知、通知することができる（ステップ１０７）。
【００２１】
図３は複数ディスクのエラー診断処理２２の概略を示すフローチャートである。
この処理は、単体ディスク毎のエラー診断処理２１において複数ディスクのエラー発生が認められた場合に実行される（ステップ１１０）。これは図２に示した単体ディスク毎のエラーの診断処理２１の結果から判り、複数ディスクに亘るエラーが認識されると、診断処理２２が起動される。
情報退避領域２００に格納されているＲＡＩＤの状態情報から装置の構成情報１５が抽出される（ステップ１１１）。構成情報は同じ修復用冗長データを持つ論理的に同じデータグループに属する単体ディスク群を特定する情報を有しているので、既にエラーとなっている各ディスクに割り当てられているデータグループを分別できる。
【００２２】
また、各ディスクに対する動作履歴情報１４の示す各ディスクのエラーアドレス情報はディスク内の物理位置を示すアドレスと同一であるため、論理的ファイルイメージで考えた場合、エラーアドレスと構成情報に基づいてどこのファイルに割り当てられているデータグループのアドレスにエラーが発生しているかの算出が行なわれる（ステップ１１２）。即ち、障害が発生したディスク位置は、“シリンダアドレス”、“トラックアドレス”、“セクタアドレス”によって物理的に特定され、それは動作履歴情報に含まれて情報退避領域２００に格納されている。一方、構成情報には、データグループのサイズ（グループの記憶容量）、使用ディスク数、ＲＡＩＤレベル（データ格納のフォーマットを規定）が含まれる。構成情報に基づいて各データグループが単体ディスク群のどの範囲を使用しているかを判断することができるので、単体ディスク上で障害の発生した物理的位置がどこのデータグループに含まれるのかを知ることができる。
そして複数ディスクでのエラー発生状況の通知が行われる（ステップ１１３）。この場合、エラー情報としてエラー発生ディスクの位置と、エラー発生状況、及び上記ステップ１１２で算出した論理ファイルのエラーアドレス情報が報告される。
【００２３】
次にバッチファイル作成処理が行われる（ステップ１１４）が、この処理は、データ消失の危険性を通知した時（ステップ１１３）に、診断装置２０が作成する当該複数ディスク用の診断用のバッチ処理を作成する処理であり、作成されたバッチ処理は、ハードディスク装置内にバッチ処理ファイル５００として格納される。
バッチ処理は、ＲＡＩＤ１０に接続される通信回線３０のＩ／Ｆ規格に準拠したコマンドＩ／Ｆを使用して作成し、当該ディスクに対する全媒体面のリード検索用コマンドシーケンスを自動作成する。
【００２４】
このバッチ処理３２をそのＲＡＩＤ１０対して実行することで、ＲＡＩＤでは当該ディスクの全媒体面のリード検索が行われ、リード検索で検出したエラー部位については単体ディスクが一般的に持つリアサインブロック機能（当該エラーセクタ部を未使用とし、交替セクタにデータを貼りかえる機能）により複数ディスクに対するエラーアドレス部位の回復を行い、データ消失の可能性を排除し、データの保全性を保つ。
【００２５】
バッチ処理の起動タイミングは、診断装置２０の管理者が任意に設定可能であり、診断装置２０は設定状況を判断し（ステップ１１５）、自動起動の設定が入力されていた場合は、通知直後にバッチ処理を自動的に起動する（ステップ１１６）。一方、手動起動の設定が入力されていた場合は通知時のバッチ処理は起動されず、バッチ処理の起動は、診断装置の管理者が手動で行う（ステップ１１８）。バッチ処理の実行結果は診断装置２０がＲＡＩＤ１０から状態情報３１を読み出すことで内容確認を行う（ステップ１１７、１１９）。
【００２６】
図４は組み合わせ条件チェック処理２３の概略を示すフローチャートである。組み合わせ条件チェック処理２３とは、ディスクのエラー条件の組み合せに基づくディスクのエラーを診断して事前に検知するための診断処理である。組み合わせ条件とは、エラーの内容とエラー発生の時間の組み合わせ、或いはエラーの内容とアドレスの組み合わせ、特定の条件とエラー発生のタイミングの組み合わせ、等を言う。これらの組み合わせによってＲＡＩＤを構成するハードディスク装置に特有のエラーが発生することが経験的にわかっている。このような組み合わせ条件を予め組み合わせ条件ファイル６００として診断装置２０のハードディスク装置内に格納しておく。
【００２７】
組み合わせ条件チェック処理が起動されると、組み合わせ条件ファイル６００から、特定の条件を伴うエラーのタイミングや、複数の条件を伴うエラーのタイミング等の組み合わせ条件が読み出される（ステップ１２１）。そして、結果ファイル３００に格納されている単体ディスクの診断処理２１（図２）と、組み合わせファイル６００から読み出された組み合わせ条件との対比処理が行なわれる（ステップ１２２）。対比処理の結果、組み合わせ条件に基づく単体ディスクにエラーが在る旨が判定された場合（ステップ１２３）、診断装置２０は、組み合わせ条件に基づく該当の単体ディスクにエラーが発生する旨の予告通知を行う（１２４）。予告通知は表示装置４００に表示されることにより保守員に知らされる。一方、対比処理の結果、組み合わせ条件に基づく単体ディスクにエラーが無ければ、組み合わせ条件による診断チェック処理は終了する。
組み合わせ条件チェック処理２３は、組み合わせファイル６００に条件が登録されていればいつでも実行できるが、この処理も定期的に行う方が好ましい。そのため前述した単体ディスクのエラー診断処理２１の実行時に定期的に行うか、又は複数ディスクエラー診断処理２２の実行時に行うのが好ましい。
このようにＲＡＩＤを構成するハードディスクに関するエラー状況の組み合わせ条件に基づいて診断することにより、単体ディスクに対する例えばエラーの閾値のみによる診断の管理からは得られないようなエラーの状態が把握できる。これにより複数の単体ディスクから成るＲＡＩＤのエラー発生によるデータの消失の危険性を事前に察知できる。
【００２８】
以上説明したように、上記実施例によれば、一定期間毎に収集した単体ディスクの動作履歴情報を、順次複数回にわたって診断装置の記憶装置に記憶しておき、それらの履歴情報及びＲＡＩＤの構成情報に従って診断することにより、単体ディスク毎の故障のみではなく、複数のディスクに亘って徐々にディスクの壊れが進行していく状況において長期的な障害の変動を障害増加率から、事前に察知できる。
【００２９】
【発明の効果】
以上、説明したように、本発明によれば、ディスクアレイ装置を構成する単体ディスクのエラー診断だけでなく、複数のディスクに亘るエラー診断、及びこれらを組み合わせた条件での診断を行うことができる。これによりディスクのエラー状況を把握でき、複数のディスクに亘るエラーに起因するデータ消失の危険性を事前に察知でき、データ消失の予防が図れる。
【図面の簡単な説明】
【図１】本発明の一実施例によるＲＡＩＤ及びその障害の診断装置の概略構成を示すブロック図。
【図２】単体ディスク毎のエラー診断処理の概略を示すフローチャート。
【図３】複数ディスクのエラー診断処理の概略を示すフローチャート。
【図４】組み合わせ条件チェック処理の概略を示すフローチャート。
【符号の説明】
１０…ＲＡＩＤ１１…単体ディスク群
１２…コントローラ１３…メモリ
１４…各ディスクに対する動作履歴情報
１５…ＲＡＩＤの構成情報２０…診断装置
２１…単体ディスク毎のエラー診断処理
２２…複数ディスクのエラー診断処理
２３…組み合わせ条件チェック処理３０…通信回線
３１…ＲＡＩＤ状態情報３２…バッチ処理指令。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a diagnostic device for a failure in a disk array device (RAID: Redundant Array of Inexpensive Disks), and in particular, manages a disk failure status of a disk array device to detect a sign of a disk failure in advance, and to detect data failure in advance. The present invention relates to a process for diagnosing a failure of a disk array device that prevents loss of the disk array in advance.
[0002]
[Prior art]
The disk array device is composed of a plurality of single hard disks, and data groups are written and read over a plurality of hard disks. Since writing and reading of data are performed by adding redundant data for lost data recovery, even if an error occurs in one single disk, data is read from another disk belonging to the same data group using the redundant data for recovery. Data can be read by generating and restoring. Even when an error has occurred in two disks, if no two disk errors have occurred within the minimum address unit having the redundant data for restoration, data is similarly generated from another disk. It can recover and read data, and has high reliability.
[0003]
On the other hand, even if there is an error in a single disk, the disk array device uses the redundant data for restoration and operates without any apparent failure status. Can not be confirmed from. In order to grasp the internal error occurrence situation, it becomes necessary for the maintenance staff of the disk array device to analyze the log information of the disk array device.
Even if multiple disks fail due to initial failure of the disk and deterioration of parts due to long-term use, errors may potentially be progressing and there is a risk that data may be lost in the future. is there.
[0004]
As general preventive maintenance for a disk array device, maintenance personnel refer to the log information of the disk array device when performing regular maintenance, and determine whether there is a disk that needs preventive replacement based on the disk error status. That is the fact. However, the failure of the disk may be progressing before the next periodic maintenance is performed, and the maintenance staff cannot know such a situation. In the worst case, an error may occur in two disks to which the same data group belongs. In such a case, it is impossible to recover the error by using the redundant data for repair, so the periodic maintenance is performed. Depending on the timing, there is a risk of data loss.
[0005]
As a technique relating to preventive maintenance of a disk device error, for example, there is a technique disclosed in Japanese Patent Application Laid-Open No. H11-353819 (Patent Document 1). According to this technology, the number of times of occurrence of a disk error is stored in a counter, and when the number of times of occurrence of the error exceeds a threshold value, an alarm is issued to detect a disk failure in advance and prevent data loss.
[0006]
[Patent Document 1]
JP-A-11-353819 (pages 2-4, FIG. 2)
[0007]
[Problems to be solved by the invention]
However, the technique described in Japanese Patent Application Laid-Open No. H11-353819 is merely a management based on a disk error threshold, and there is no suggestion on how to apply the technique to, for example, RAID. Even if it is applied to RAID, it is only a hint that the failure of a single disk is detected in advance, and there is no explanation about how to grasp the error state of all the disks constituting the RAID in a cross-sectional manner and how to diagnose it. Not mentioned.
[0008]
SUMMARY OF THE INVENTION An object of the present invention is to provide a disk array device that can prevent data loss by grasping the error situation in the disk array device and detecting in advance the danger of data loss due to the occurrence of an error in a plurality of disks. To provide a diagnostic device.
It is another object of the present invention to automatically perform error diagnosis of a single disk constituting a disk array device, error diagnosis of a plurality of disk groups, and a combination thereof.
[0009]
[Means for Solving the Problems]
The present invention is directed to a system including a disk array device that writes and reads data in a data group by adding redundant data for recovering lost data to a single disk group including a plurality of single disks. A storage device for collecting and storing the operation history information of each single disk and the configuration information indicating the configuration of the disk array device, and analyzing the operation history information of each single disk read from this storage device and analyzing each single disk A first diagnostic processing unit for performing a diagnostic process, a single disk group belonging to a data group configuring the disk array device is identified from the configuration information read from the storage device, and a single disk group is identified from a diagnosis result of the first diagnostic processing unit. A second diagnostic processing unit that analyzes and processes a failure situation in the disk group; and a first diagnostic processing unit. It constitutes a diagnostic apparatus and a display means for displaying information related to the diagnostic result of the information or the second diagnosis processing means associated with the cross-sectional results.
Here, the operation history information includes, for example, information regarding the error content, error address, and occurrence time of the single disk, and the configuration information includes, for example, information configuring a single disk group corresponding to the data group. Information is periodically collected from the disk array device and stored in the storage device. The first diagnostic processing means periodically calculates the number of error occurrences, the error increase rate, and the error address distribution of each disk from the operation history information. The second diagnostic processing means may calculate a failure state in the data group having the same repair redundant data when an error is detected in a plurality of disks according to the processing result of the first diagnostic processing means. preferable.
Also, a third diagnostic processing unit having a combination condition file in which a combination condition in which an error of the disk is likely to occur is stored in advance, and a processing result of the first diagnostic processing unit is collated based on the combination condition file. Have. When an error due to the combination condition is determined as a result of the diagnosis by the third diagnosis processing means, information relating to the diagnosis result is visually displayed on the display means to notify a failure.
Further, when the second diagnostic processing unit detects a failure in a plurality of disks belonging to the same data group based on the failure analysis result, the second diagnosis processing unit includes a batch processing unit for performing diagnosis of the plurality of disks by batch processing.
The present invention also considers the diagnostic processing in the first, second, and third diagnostic processing means as first to third diagnostic processing modes, respectively, and refers to a diagnostic method appropriately including these diagnostic processing modes. be able to.
[0010]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram showing a configuration of a disk array device (RAID) and its diagnostic device. The RAID 10 is connected to the diagnostic device 20 via a communication line 30. The communication line 30 is, for example, an interface conforming to the LAN standard. The illustrated example shows an example in which one RAID 10 is connected for the sake of explanation, but it may be considered that a plurality of RAIDs are actually connected to the communication line 30. Therefore, the diagnostic device 20 diagnoses a plurality of RAIDs.
[0011]
The RAID 10 includes a plurality of single disk groups 12 and a controller 11. The plurality of single disk groups 12 is a group of m single disks in the horizontal direction and a plurality of disk groups (n × m single disk groups in total) having n rows in the vertical direction. The data, including the redundant data for restoration, is recorded in groups of m individual disks in the group unit in the horizontal direction.
[0012]
The controller 11 has a memory 13 and a processing unit (not shown), controls writing and reading of data to and from the single disk group 12, accesses the single disk group 12, and collects status information. The status information includes operation history information 13 and RAUD configuration information 14 for each disk, and the collected information is temporarily stored in the memory 13.
The operation history information 13 includes information on an error type, an error occurrence address, and an occurrence time for each error as error history information at the time of accessing each disk. The error type indicates an error type such as a hardware error, a disk medium error, and an interface error. The error address indicates an address on the disk for specifying the location where the error has occurred. The occurrence time indicates the time when the error occurred. Each time an error occurs in each disk, operation history information including such information is stored in the memory 13 as log information.
Since the operation history information 13 is collected from each single disk of each RAID, if there are m × n single disks, the operation history information 13 is also stored in the memory 13 corresponding to each single disk. Become.
The configuration information 14 is information relating to the RAID device configuration, and includes the RAID level (specifying the format of data storage), the total number of individual disks constituting each RAID, the size of the data group (the storage capacity of the group), that is, the head of the data group And information indicating the address of the single disk and its width (the number of disks in the horizontal direction), and the storage capacity of the data group (vertical depth: G-byte capacity, etc.). The RAID level defines the format of group data storage.
These pieces of information 14 and 15 are periodically transmitted as RAID state information 31, for example, every 24 hours from each RAID 10 to the diagnostic apparatus 20 via the communication line 30.
[0013]
Next, the diagnostic device 20 will be described.
The diagnostic device 20 diagnoses an error situation based on the operation history information 14 and the configuration information 15 transmitted from each RAID 10 and performs a process for diagnosing and recovering an error part in the target RAID according to the diagnosis result. It has a function to emit.
The diagnostic device 20 is, for example, a personal computer or a computer. Although not shown, a processing device for executing a program on an internal bus to process diagnostic data, a memory for storing the program and data, and a state of the RAID indicate a diagnostic result. Display device (reference numeral 400 in FIG. 2), an input device for maintenance personnel to input instructions for diagnosis and various information, and a hard disk (storage device) for storing status information from RAID 10 and information on diagnosis results. It comprises.
[0014]
The program is executed by the processing device to perform the diagnostic processing. The diagnosis process includes an error diagnosis process 21 for each single disk, an error diagnosis process 22 for a plurality of disks, and a combination condition check process 23. As a result of the error diagnosis process 21 for each single disk, the error diagnosis process 22 for a plurality of disks, and the combination condition check process 23, the diagnosis device 20 diagnoses a disk with a precursor of a failure related to the single disk group 12 of the target RAID 10, A batch processing command 32 for performing recovery is issued. The command 32 is transmitted to the target RAID 10 through the line 30.
[0015]
As described above, the operation history information 14 and the configuration information 15 periodically received via the communication line 30 are stored in a memory (not shown) corresponding to each single disk of each RAID. Then, the processing device reads the operation history information 14 stored in the memory, and calculates the number of error occurrences per unit time (for example, 24 hours), the error increase rate, and the error address distribution for each disk. The calculated result is displayed on the display device.
[0016]
Further, the processing device of the diagnostic device 20 determines the number of errors occurring for the same data group having the same repair redundant data in RAID from the number of errors, the rate of error increase, and the address distribution of errors calculated for each single disk. Calculate the error increase rate and error address distribution. Based on the calculation result, a diagnosis is performed in relation to a single disk group that has a risk of data loss due to a failure in the same data group. The status of the diagnosis result is displayed on the display device. This allows the maintenance staff to know the diagnosis status and to know the danger of data loss for each RAID data group.
[0017]
When detecting a failure in a plurality of single disks, the diagnostic device 20 has a function of creating a batch process for diagnosing the disk in the diagnostic device, and issues a command 32 for that to the RAID 10 that is the target.
[0018]
FIG. 2 is a flowchart showing an outline of the error diagnosis processing 21 for each single disk.
The RAID status information (the operation history information 14 and the configuration information 15) 31 automatically and periodically transmitted from the RAID 10 via the communication line 30 includes information on a hard disk device (not shown) in the diagnostic device 20. It is stored in the save area 200 (step 101). Since the state information 31 is periodically transmitted from the RAID every 24 hours and is stored and accumulated in the area 200, the state information for several times is subjected to generation management. The operation history information of each disk is extracted from the state information, and further information relating to the disk error is extracted (step 102). Since the extracted history information 14 includes information on the error type, the error occurrence address, and the occurrence time, it is possible to know which error occurred at which address on which disk.
[0019]
From the operation history information 14 read from the information save area, the processing device calculates the number of times of error occurrence, the rate of increase of errors, and the address distribution of errors in each disk for each unit time (for example, 24 hours). Then, the calculation result is stored in the hard disk device as a result file 300. The result file 300 stores analysis results of past error information analyzed so far. The information of the previous analysis result is read from the result file 300 (step 103), and is compared with the result of the error analysis information calculated this time (step 104). Not only a sudden error occurrence state but also a long-term error state is checked, and the result is recorded in the result file 300 and displayed on the display device 400 together. Based on the result file up to the last time and the current disk error information, it is diagnosed whether there is a disk that has a sign of failure (step 105). In the diagnostic processing, weighting is performed based on the hardware specifications of the disk according to the severity of each error content, and an upper limit value of an error increase rate that may cause a failure in the future is individually set according to the error type. It is checked whether or not each disk has exceeded the above-mentioned upper limit of the error increase rate (step 106). If none exceeds the upper limit, the diagnostic processing is terminated (step 108).
[0020]
On the other hand, if there is a disk exceeding the upper limit, a report of the disk exceeding the upper limit and a warning display are displayed on the display device 400, so that the maintenance staff can detect and notify a failure part in advance. Yes (step 107).
[0021]
FIG. 3 is a flowchart showing an outline of the error diagnosis processing 22 for a plurality of disks.
This processing is executed when an error has occurred in a plurality of disks in the error diagnosis processing 21 for each single disk (step 110). This can be seen from the result of the error diagnosis processing 21 for each single disk shown in FIG. 2, and when an error over a plurality of disks is recognized, the diagnosis processing 22 is started.
The device configuration information 15 is extracted from the RAID status information stored in the information save area 200 (step 111). Since the configuration information has information for specifying a single disk group that belongs to the same data group logically having the same repair redundant data, the data group assigned to each disk that has already failed can be sorted out. .
[0022]
In addition, since the error address information of each disk indicated by the operation history information 14 for each disk is the same as the address indicating the physical position in the disk, when considering a logical file image, where the error address information is based on the error address and the configuration information, It is calculated whether an error has occurred in the address of the data group allocated to the file (step 112). That is, the location of the failed disk is physically specified by the “cylinder address”, “track address”, and “sector address”, which are included in the operation history information and stored in the information save area 200. On the other hand, the configuration information includes the size of the data group (storage capacity of the group), the number of used disks, and the RAID level (specifying the format of data storage). Since it is possible to determine which range of the single disk group each data group uses based on the configuration information, it is possible to know which data group includes the physical location where the failure has occurred on the single disk be able to.
Then, a notification of an error occurrence status on a plurality of disks is made (step 113). In this case, as the error information, the position of the error disk, the error occurrence status, and the error address information of the logical file calculated in step 112 are reported.
[0023]
Next, a batch file creation process is performed (step 114). This process is performed when the risk of data loss is notified (step 113). Is created, and the created batch process is stored as a batch process file 500 in the hard disk device.
The batch processing is created by using a command I / F conforming to the I / F standard of the communication line 30 connected to the RAID 10, and automatically creates a read search command sequence for all media surfaces for the disk.
[0024]
By executing the batch process 32 for the RAID 10, the RAID performs a read search on all the media surfaces of the disk, and an error portion detected by the read search is assigned to a reassign block function (a single disk generally has). The error address portion is not used, and the error address portion for a plurality of disks is recovered by a function of pasting data to a replacement sector), thereby eliminating the possibility of data loss and maintaining data integrity.
[0025]
The start timing of the batch processing can be arbitrarily set by the administrator of the diagnostic device 20. The diagnostic device 20 determines the setting status (step 115), and if the setting of the automatic start has been input, immediately after the notification, Batch processing is automatically started (step 116). On the other hand, if the setting of manual activation has been input, the batch processing at the time of notification is not activated, and the administrator of the diagnostic apparatus manually activates the batch processing (step 118). The diagnostic apparatus 20 confirms the content of the execution result of the batch processing by reading the state information 31 from the RAID 10 (steps 117 and 119).
[0026]
FIG. 4 is a flowchart showing an outline of the combination condition check processing 23. The combination condition checking process 23 is a diagnosis process for diagnosing a disk error based on a combination of disk error conditions and detecting the error in advance. The combination condition refers to a combination of an error content and an error occurrence time, a combination of an error content and an address, a combination of a specific condition and an error occurrence timing, and the like. It has been empirically known that an error peculiar to the hard disk device constituting the RAID occurs due to the combination of these. Such a combination condition is stored in the hard disk device of the diagnostic device 20 as a combination condition file 600 in advance.
[0027]
When the combination condition check process is started, combination conditions such as an error timing with a specific condition and an error timing with a plurality of conditions are read from the combination condition file 600 (step 121). Then, a comparison process is performed between the diagnosis process 21 (FIG. 2) for the single disk stored in the result file 300 and the combination condition read from the combination file 600 (step 122). As a result of the comparison processing, when it is determined that there is an error in the single disk based on the combination condition (step 123), the diagnostic device 20 sends a preliminary notice that an error occurs in the single disk based on the combination condition. Perform (124). The notice of advance is displayed on the display device 400 to notify maintenance personnel. On the other hand, as a result of the comparison process, if there is no error in the single disk based on the combination condition, the diagnosis check process based on the combination condition ends.
The combination condition check processing 23 can be executed at any time as long as the conditions are registered in the combination file 600, but it is preferable that this processing is also performed periodically. Therefore, it is preferable to perform the above process periodically when the error diagnosis process 21 for a single disk is performed or when the multiple disk error diagnosis process 22 is performed.
As described above, by performing the diagnosis based on the combination condition of the error status regarding the hard disks configuring the RAID, it is possible to grasp an error state that cannot be obtained from the management of the diagnosis for the single disk, for example, using only the error threshold. This makes it possible to detect in advance the danger of data loss due to the occurrence of a RAID error composed of a plurality of single disks.
[0028]
As described above, according to the above-described embodiment, the operation history information of a single disk collected at regular intervals is stored in the storage device of the diagnostic apparatus several times sequentially, and the history information and the RAID configuration are stored. By diagnosing according to the information, it is possible to detect in advance not only the failure of each single disk but also a long-term variation of the failure from the failure increase rate in a situation where the failure of the disk gradually progresses over a plurality of disks. .
[0029]
【The invention's effect】
As described above, according to the present invention, it is possible to perform not only the error diagnosis of the single disk constituting the disk array device, but also the error diagnosis over a plurality of disks, and the diagnosis under the condition combining these. . As a result, the error status of the disk can be grasped, the danger of data loss due to errors over a plurality of disks can be detected in advance, and data loss can be prevented.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a RAID and a failure diagnosis apparatus according to an embodiment of the present invention.
FIG. 2 is a flowchart showing an outline of error diagnosis processing for each single disk.
FIG. 3 is a flowchart showing an outline of error diagnosis processing for a plurality of disks.
FIG. 4 is a flowchart showing an outline of a combination condition check process.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... RAID 11 ... Single disk group 12 ... Controller 13 ... Memory 14 ... Operation history information for each disk 15 ... RAID configuration information 20 ... Diagnosis device 21 ... Error diagnosis processing for each single disk 22 ... Error diagnosis processing for multiple disks 23 ... Combination condition check processing 30. Communication line 31. RAID status information 32. Batch processing command.

Claims

In a system including a disk array apparatus that writes and reads data of a data group by adding redundant data for recovering lost data to a single disk group composed of a plurality of single disks,
A storage device for collecting and storing operation history information of each single disk of the disk array device and configuration information indicating the configuration of the disk array device;
First diagnostic processing means for analyzing the operation history information of each single disk read from the storage device and performing diagnostic processing for each single disk,
A single disk group belonging to a data group constituting the disk array device is specified from the configuration information read from the storage device, and a failure state in the single disk group is analyzed from a diagnosis result of the first diagnostic processing unit. Second diagnostic processing means for processing
Display means for displaying information related to the diagnosis result by the first diagnosis processing means, or information related to the diagnosis result by the second diagnosis processing means,
A diagnostic device comprising:

2. The operation history information according to claim 1, wherein the operation history information includes information on an error content, an error address, and an occurrence time of a single disk, and the configuration information includes information configuring a single disk group corresponding to a data group. And configuration information is periodically collected from the disk array device and stored in the storage device,
The first diagnostic processing means periodically calculates the number of error occurrences, the error increase rate, and the error address distribution of each disk from the operation history information,
The second diagnostic processing means calculates a failure state in a data group having the same redundant data for repair, when an error is detected in a plurality of disks, according to a processing result of the first diagnostic processing means. A diagnostic device characterized by the above-mentioned.

3. The method according to claim 1, wherein the second diagnostic processing means further performs a diagnosis of the plurality of disks by a batch process when a plurality of disks to which the same data group belongs are found from the analysis result of the failure. A diagnostic apparatus comprising: a batch processing unit.

The combination condition file according to any one of claims 1 to 3, further comprising: a combination condition file storing a combination condition in which a disk error is likely to occur; and a combination condition read from the combination condition file. A third diagnostic processing unit for collating the processing result of the one diagnostic processing unit, and as a result of the diagnosis by the third diagnostic processing unit, when an error due to the combination condition is determined, information related to the diagnostic result is provided. A diagnostic device for displaying on the display means.

In a disk array device diagnostic processing method for writing and reading data of a data group by adding redundant data for recovering lost data to a single disk group composed of a plurality of single disks,
Periodically collecting operation history information of each single disk of the disk array device and configuration information indicating the configuration of the disk array device;
A first diagnostic processing mode for analyzing the collected operation history information of each single disk and performing diagnostic processing for each single disk,
A second diagnostic processing mode for identifying a single disk group belonging to a data group configuring the disk array device based on the configuration information and analyzing and processing a failure situation in the single disk group from the result of the first diagnostic processing; ,
Visually displaying and notifying information related to the result of the first diagnostic processing or information related to the result of the second diagnostic processing,
A diagnostic method comprising:

6. The diagnostic method according to claim 5, wherein the second diagnostic processing mode is performed when an error is found in a plurality of disks as a result of the diagnostic processing in the first diagnostic processing mode.

7. The third diagnostic processing mode according to claim 5, further comprising a third diagnostic processing mode for performing a process of comparing processing results of the first diagnostic processing mode based on a combination condition in which a disk error is likely to occur. A diagnostic method characterized by displaying information related to a result of diagnosis in a diagnostic processing mode on a display device.