JP4968078B2

JP4968078B2 - Failure diagnosis apparatus and failure diagnosis method

Info

Publication number: JP4968078B2
Application number: JP2008007156A
Authority: JP
Inventors: 健志牧田; 隆一浅野
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2008-01-16
Filing date: 2008-01-16
Publication date: 2012-07-04
Anticipated expiration: 2028-01-16
Also published as: JP2009170034A

Description

本発明は、例えばＲＡＩＤ（Redundant Arrays of Inexpensive Disks）装置の如き多数のハードディスクドライブを使用したデータ記録・再生装置に適用して好適な故障診断装置及び故障診断方法に関する。 The present invention relates to a failure diagnosis device and a failure diagnosis method suitable for application to a data recording / reproducing device using a large number of hard disk drives such as a RAID (Redundant Array of Inexpensive Disks) device.

ＲＡＩＤ技術を用いて複数台のハードディスクドライブにＡＶデータ（ビデオデータ及びオーディオデータ）を記録する装置であるＲＡＩＤ装置が、ビジネス用途やプロフェッショナル用途のデータ記録装置として普及している。こうしたＲＡＩＤ装置は、通常、上位の制御装置から送信されるコマンドに基づいてＡＶデータの書き込み／読み出しを行う。 A RAID device, which is a device for recording AV data (video data and audio data) on a plurality of hard disk drives using RAID technology, is widely used as a data recording device for business use and professional use. Such a RAID apparatus normally performs writing / reading of AV data based on a command transmitted from a host control apparatus.

ハードディスクドライブは、ヘッドを使用してディスクにデータを書き込ませるとともに読み出す装置であり、ヘッドによる書き込みや読み出しの状態や、ヘッドのシーク状態などの種々の要因で、書き込みや読み出しが不安定になる可能性があり、記録や再生にある程度のエラーが発生する可能性があるデータ記録装置である。 A hard disk drive is a device that uses a head to write and read data on the disk, and writing and reading can be unstable due to various factors such as writing and reading by the head and seek state of the head. This is a data recording apparatus that has a possibility of causing some errors in recording and reproduction.

上述したように、複数台のハードディスクドライブを使用してＲＡＩＤ装置として構成することで、データは複数台のハードディスクに分散して記録されることになり、１台のハードディスクドライブでエラーがあっても、他のハードディスクドライブから正しく読み出しができれば、欠落したデータを補うことができ、データ記録装置としての信頼性は確保できる。なお、エラーとしては、データが読み出されない状態になって、該当するデータが欠落する場合以外にも、例えば読み出しのコマンドを送ってから、該当するデータがディスクから読み出されるまでの時間が、遅くなるようなエラーもある。 As described above, by configuring as a RAID device using a plurality of hard disk drives, data is distributed and recorded on a plurality of hard disks, and even if there is an error in one hard disk drive. If data can be read correctly from other hard disk drives, the missing data can be compensated, and the reliability of the data recording apparatus can be ensured. In addition, as an error, in addition to the case where the data is not read and the corresponding data is lost, for example, the time from when the read command is sent until the corresponding data is read from the disk is slow. There is also an error such as

ところで、ハードディスクドライブは、寿命がある装置であり、ある程度使用した段階で、新しいものに交換することが必要である。例えば上述したビジネス用途やプロフェッショナル用途のデータ記録装置としては、データ記録の信頼性が非常に重要であり、データ記録装置としてある程度使用される毎に、ハードディスクドライブを交換するメンテナンスが必要である。 By the way, a hard disk drive is a device with a long life and needs to be replaced with a new one after it has been used to some extent. For example, the reliability of data recording is very important for the above-mentioned data recording device for business use or professional use, and maintenance is required to replace the hard disk drive every time it is used as a data recording device to some extent.

このようなハードディスクドライブの交換を考えた場合、例えば１年ごとに交換する等のように、一定の年月ごとに一斉に交換するのが、管理上は簡単である。しかしながら、ハードディスクドライブの不具合は、個々のハードディスクドライブごとに発生状況が異なり、長期間使用してもほとんどトラブルなく使用できるものがある一方で、比較的早い使用期間でエラーが多く発生するものもあり、１台ごとにエラーの発生状況を見て、管理を行うことが好ましい。 Considering such replacement of the hard disk drive, it is easy in terms of management to replace all at once every certain year, for example, every year. However, hard disk drive malfunctions vary from one hard disk drive to another, and some hard disk drives can be used with little trouble even after a long period of time, while others may generate many errors in a relatively early period of use. It is preferable to perform management by checking the error occurrence status for each unit.

従来のこの種のハードディスクドライブの不良の診断技術としては、例えばハードディスクドライブへのアクセス時のエラー発生の頻度を判定して、その判定した頻度が予め設定した閾値を越えた場合に、該当するハードディスクドライブが不良であると判断して、交換するようにしていた。
特許文献１には、このような判断でハードディスクドライブの故障の可能性が高いかどうかを診断する点についての記載がある。 As a conventional technique for diagnosing a failure of this type of hard disk drive, for example, the frequency of error occurrence when accessing a hard disk drive is determined, and when the determined frequency exceeds a preset threshold, the corresponding hard disk drive It was decided that the drive was defective and was replaced.
Japanese Patent Application Laid-Open No. 2004-228561 describes a point of diagnosing whether or not there is a high possibility of a hard disk drive failure based on such a determination.

特開平１０−１１７０１号公報JP-A-10-11701

ところで、ハードディスクドライブのアクセス時のエラー発生の頻度を見ているだけでは、個々のハードディスクドライブの不良状況を正確に把握することができない場合がある。
即ち、例えば図８（ａ）に示すように、アクセス回数の累計を横軸とし、一定アクセス（例えば１００アクセス）毎のエラー発生頻度を縦軸としたとき、エラー発生頻度ＥＲ０が故障判断の閾値ＴＨ０を越えたとき、該当するハードディスクドライブの交換が必要であると判断するようにしている。エラー発生頻度ＥＲ０は、一般的にはハードディスクドライブの使用により、徐々に高くなると想定されている。 By the way, there are cases where it is not possible to accurately grasp the failure status of each hard disk drive simply by looking at the frequency of error occurrence when accessing the hard disk drive.
That is, for example, as shown in FIG. 8A, when the cumulative number of accesses is on the horizontal axis and the error occurrence frequency for each fixed access (for example, 100 accesses) is on the vertical axis, the error occurrence frequency ER0 is the threshold for failure determination. When TH0 is exceeded, it is determined that the corresponding hard disk drive needs to be replaced. The error occurrence frequency ER0 is generally assumed to gradually increase with the use of a hard disk drive.

ところが、実施のハードディスクドライブのエラー発生状況として、例えば、図８（ａ）の例のように、長期間使用しても、エラー発生頻度ＥＲ０にほとんど変化が発生しない場合がある。この図８（ａ）の例を、図８（ｂ）に示すようにエラー発生数の積算値を縦軸とし、横軸をアクセス回数で見た場合には、累積のエラー発生数ＥＴ０の変化が、ほぼ直線的な変化である。 However, as an error occurrence state of the hard disk drive in practice, for example, as shown in the example of FIG. 8A, there is a case where the error occurrence frequency ER0 hardly changes even if it is used for a long time. In the example of FIG. 8A, when the integrated value of the number of error occurrences is plotted on the vertical axis and the horizontal axis is viewed as the number of accesses as shown in FIG. 8B, the cumulative error occurrence number ET0 changes. Is an almost linear change.

例えば図８（ａ）に示したエラー発生頻度ＥＲ０が、一定であっても、非常に低い頻度で一定であれば問題ないが、故障判断の閾値ＴＨ０より若干低い程度で頻度ＥＲ０が一定である場合には、ある程度エラー発生が多い状況で、そのハードディスクドライブが継続的に使用されることになり、好ましくない状況になってしまう。 For example, even if the error occurrence frequency ER0 shown in FIG. 8 (a) is constant, there is no problem as long as the error occurrence frequency ER0 is constant at a very low frequency, but the frequency ER0 is constant to a degree slightly lower than the failure determination threshold TH0. In such a case, the hard disk drive is continuously used in a situation where errors occur to some extent, which is not preferable.

本発明は、上述の点に鑑み、ハードディスクドライブの不良診断がより的確に行えるようにすることを目的とする。 The present invention has been made in view of the above-described points, and an object of the present invention is to make it possible to more accurately diagnose a failure of a hard disk drive.

この課題を解決するため、本発明は、ハードディスクドライブの故障を診断する故障診断を行う場合に、ハードディスクドライブの書き込み及び／又は読み出しのコマンドに基づいた処理が、コマンド発行から規定された時間内に正常に終了できなかったエラーを判断し、そのコマンドエラー判断でエラーと判断した回数を累積する。そして、エラーと判断される頻度が閾値を超えることで、ハードディスクドライブの不良と判断すると共に、その閾値をエラー回数の累積値に応じて可変設定することを特徴とする。 In order to solve this problem, according to the present invention, when performing a failure diagnosis for diagnosing a failure of a hard disk drive, the processing based on the write and / or read command of the hard disk drive is performed within the time specified from the issue of the command The error that could not be completed normally is determined, and the number of times the error is determined in the command error determination is accumulated. When the frequency at which an error is determined exceeds a threshold, it is determined that the hard disk drive is defective, and the threshold is variably set according to the cumulative value of the number of errors.

本発明によると、エラーと判断される頻度を比較する閾値が可変設定されるために、エラー回数の累積値に基づいて、閾値を適正に設定されることで、エラーの発生頻度についてはそれ程変化がない場合であっても、不良と判断できるようになる。 According to the present invention, since the threshold value for comparing the frequency at which an error is determined is variably set, the error occurrence frequency varies so much by appropriately setting the threshold value based on the cumulative value of the number of errors. Even if there is no error, it can be judged as defective.

本発明によれば、エラー回数の累積値に基づいて閾値を適正に設定することで、エラーの発生頻度についてはそれ程変化がない場合であっても、不良と判断できるようになり、適正にエラー発生回数に基づいた不良管理が可能となる。例えば、比較的エラーの発生状況が多い状況で安定しているようなハードディスクドライブを、比較的短い使用期間で不良品として検出できるようになり、ハードディスクドライブを使用したデータ記録装置の信頼性向上に貢献する。 According to the present invention, by appropriately setting the threshold value based on the cumulative value of the number of errors, even if there is not much change in the frequency of error occurrence, it can be determined as defective, and the error is appropriately Defect management based on the number of occurrences becomes possible. For example, hard disk drives that are stable in situations where there are relatively many errors can be detected as defective products in a relatively short period of use, which improves the reliability of data recording devices that use hard disk drives. To contribute.

以下、本発明の実施の形態を、図面を用いて具体的に説明する。本実施の形態では、デジタルシネマ上映システムに本発明を適用した例を説明するが、その前提として、まずデジタルシネマ上映システムの概要を説明する。 Embodiments of the present invention will be specifically described below with reference to the drawings. In this embodiment, an example in which the present invention is applied to a digital cinema screening system will be described. As an assumption, an outline of the digital cinema screening system will be described first.

図１は、本発明を適用したデジタルシネマ上映システムの全体構成図である。このデジタルシネマ上映システムは、ＲＡＩＤ技術を用いたデータ記録装置（いわゆるＲＡＩＤ装置）１と、デジタルシネマサーバ２と、プロジェクタ（投射型表示装置）３と、パーソナルコンピュータ４とで構成されている。 FIG. 1 is an overall configuration diagram of a digital cinema screening system to which the present invention is applied. This digital cinema screening system includes a data recording device (so-called RAID device) 1 using a RAID technology, a digital cinema server 2, a projector (projection display device) 3, and a personal computer 4.

データ記録装置１，デジタルシネマサーバ２及びプロジェクタ３は、同一の筐体内に収納されている。デジタルシネマサーバ２とパーソナルコンピュータ４とは、１０００ＢＡＳＥ−Ｔのような高速ネットワーク５で接続されている。 The data recording device 1, the digital cinema server 2, and the projector 3 are housed in the same casing. The digital cinema server 2 and the personal computer 4 are connected by a high-speed network 5 such as 1000BASE-T.

このデジタルシネマ上映システムは、上映する映画のＡＶデータ（ビデオデータは、ＪＰＥＧ２０００のような画像圧縮方式で圧縮したデータ）を、パーソナルコンピュータ４からデジタルシネマサーバ２を介してデータ記録装置１に記録させておき、データ記録装置１から再生させたＡＶデータを、デジタルシネマサーバ２からプロジェクタ３に送ってスクリーンに投影するものである。 This digital cinema screening system records AV data of a movie to be screened (video data is compressed by an image compression method such as JPEG2000) from a personal computer 4 to a data recording device 1 via a digital cinema server 2. The AV data reproduced from the data recording device 1 is sent from the digital cinema server 2 to the projector 3 and projected onto the screen.

パーソナルコンピュータ４には、ＧＵＩ画面上で、ＡＶデータの記録や、映画の上映や、ＡＶデータ以外のデータ（データ記録装置１の動作のログ情報や、データ記録装置１内のＡＶデータに関するファイルシステムの情報や、映画の字幕のデータ）の管理を行うためのプログラムが格納されている。パーソナルコンピュータ４からは、このＧＵＩ画面上での操作に基づき、ＡＶデータの書き込み／読み出し要求や、ログ情報やファイルシステム情報や字幕データの書き込み／読み出し要求を高速ネットワーク５経由でデジタルシネマサーバ２に送信する。 In the personal computer 4, on the GUI screen, recording of AV data, movie screening, data other than AV data (log information of operation of the data recording device 1, file system relating to AV data in the data recording device 1) And a program for managing movie subtitle data). Based on the operation on the GUI screen, the personal computer 4 sends AV data write / read requests, log information, file system information, and subtitle data write / read requests to the digital cinema server 2 via the high-speed network 5. Send.

図２は、デジタルシネマサーバ２内に設けられている主要な回路を示す図である。デジタルシネマサーバ２内には、ＣＰＵ１１と、圧縮されたＡＶデータを伸長するデコーダ２０とが設けられている。ＣＰＵ１１は、図１の高速ネットワーク５に接続されるとともに、ＰＣＩ−Ｘ規格のバスによって図１のデータ記録装置１に接続されている。デコーダ２０は、データ記録装置１に接続されるとともに、図１のプロジェクタ３に接続されている。 FIG. 2 is a diagram showing main circuits provided in the digital cinema server 2. In the digital cinema server 2, a CPU 11 and a decoder 20 that decompresses compressed AV data are provided. The CPU 11 is connected to the high-speed network 5 shown in FIG. 1 and is connected to the data recording apparatus 1 shown in FIG. 1 via a PCI-X standard bus. The decoder 20 is connected to the data recording device 1 and to the projector 3 in FIG.

ＣＰＵ１１は、汎用のＯＳ（例えばＬｉｎｕｘ）が動作しており、この汎用ＯＳ上の汎用のファイルシステム（例えばＸＦＳ）によってデータ記録装置１内のハードディスクを管理している。また、ＣＰＵ１１には、この汎用ＯＳ上で動作する下記の（ａ）〜（ｃ）の３つのプログラムが格納されている。
（ａ）データ記録装置１の内部制御を行うプログラムである「内部制御アプリケーション」
（ｂ）パーソナルコンピュータ４からのＡＶデータの書き込み／読み出し要求に従い、ＡＶデータの書き込み／読み出しコマンドを上記「内部制御アプリケーション」に送信するプログラムである「デジタルシネマ用アプリケーション」
（ｃ）パーソナルコンピュータ４からのログ情報やファイルシステム情報や字幕データの書き込み／読み出し要求に従い、ログ情報やファイルシステム情報や字幕データの書き込み／読み出しコマンドを上記「内部制御アプリケーション」に送信するプログラムである「管理用アプリケーション」 The CPU 11 operates a general-purpose OS (for example, Linux), and manages the hard disk in the data recording apparatus 1 by a general-purpose file system (for example, XFS) on the general-purpose OS. Further, the CPU 11 stores the following three programs (a) to (c) that operate on the general-purpose OS.
(A) “Internal control application” which is a program for performing internal control of the data recording apparatus 1
(B) “Digital cinema application” which is a program for transmitting AV data write / read commands to the “internal control application” in accordance with AV data write / read requests from the personal computer 4
(C) A program that transmits log information, file system information, and subtitle data write / read commands to the “internal control application” in accordance with log information, file system information, and subtitle data write / read requests from the personal computer 4. A management application

なお、ＣＰＵ１１は、「デジタルシネマ用アプリケーション」や「管理用アプリケーション」を実行することによってデータ記録装置１の上位の制御装置（データ記録装置１にデータの書き込み／読み出しコマンドを送信する装置）として機能するとともに、「内部制御アプリケーション」を実行することによって記録装置１の内部制御を行う機能も併有している。 The CPU 11 functions as a higher-level control device (device that transmits a data write / read command to the data recording device 1) by executing the “digital cinema application” or the “management application”. In addition, it also has a function of performing internal control of the recording apparatus 1 by executing an “internal control application”.

そこで、以下では、説明を分かりやすくするために、ＣＰＵ１１のうち、上位の制御装置として機能する部分を上位制御部１１−１とし、データ記録装置１の内部制御を行う機能の部分を内部制御部１１−２とするというように、ＣＰＵ１１を機能によって２つの部分に分けて説明を行うことにする。 Therefore, in the following, in order to make the explanation easy to understand, a portion of the CPU 11 that functions as a higher-level control device is referred to as a higher-level control unit 11-1, and a function portion that performs internal control of the data recording device 1 is referred to as an internal control unit. The description will be made by dividing the CPU 11 into two parts according to functions, such as 11-2.

図３は、データ記録装置１の構成を、この上位制御部１１−１，内部制御部１１−２及びデコーダ２０とともに示す図である。なお、ここでは、内部制御部１１−２は、その機能から、データ記録装置１の一部として示している。また、データ（ＡＶデータと、ログ情報やファイルシステム情報や字幕データ）の流れを二重線の矢印で示し、上位制御部１１−１や内部制御部１１−２で授受されるコマンドや応答の流れを単線の矢印で示している。 FIG. 3 is a diagram showing the configuration of the data recording apparatus 1 together with the upper control unit 11-1, the internal control unit 11-2, and the decoder 20. Here, the internal control unit 11-2 is shown as a part of the data recording apparatus 1 because of its function. In addition, the flow of data (AV data, log information, file system information, and caption data) is indicated by double arrows, and commands and responses sent and received by the upper control unit 11-1 and the internal control unit 11-2 are shown. The flow is indicated by a single line arrow.

データ記録装置１には、ＦＰＧＡから成るＥＣＣ・ＤＭＡ部１２と、キャッシュメモリ１３と、合計７台のＨＤＤ（ハードディスク）を制御するためのＨＤＤコントローラ（例えばＳＡＳコントローラ）１４（１４−１〜１４−７）とが設けられている。ＨＤＤコントローラ１４に接続される合計７台のＨＤＤ１５（１５−１〜１５−７）のうち、４台のＨＤＤ１５−１〜１５−４はデータ用、２台のＨＤＤ１５−５〜１５−６はエラー訂正用、残りの１台のＨＤＤ１５−７はスペア用である。各ＨＤＤ１５−１〜１５−７は、データ記録装置の保守時には１台ごとに個別に交換することが可能である。 The data recording apparatus 1 includes an ECC / DMA unit 12 composed of an FPGA, a cache memory 13, and an HDD controller (for example, SAS controller) 14 (14-1 to 14-) for controlling a total of seven HDDs (hard disks). 7). Out of a total of seven HDDs 15 (15-1 to 15-7) connected to the HDD controller 14, four HDDs 15-1 to 15-4 are for data, and two HDDs 15-5 to 15-6 are errors. For correction, the remaining one HDD 15-7 is for spare. Each of the HDDs 15-1 to 15-7 can be individually replaced at the time of maintenance of the data recording apparatus.

ＥＣＣ・ＤＭＡ部１２は、記録時には、デジタルシネマサーバ２内の上位制御部１１−１から送られたデータからエラー訂正符号を生成し、そのエラー訂正符号を付加したデータを、キャッシュメモリ１３を用いてストライピングして、ＨＤＤコントローラ１４−１〜１４−６を介してＨＤＤ１５−１〜１５−６に供給する。 At the time of recording, the ECC / DMA unit 12 generates an error correction code from the data sent from the upper control unit 11-1 in the digital cinema server 2, and uses the cache memory 13 for the data to which the error correction code is added. Striped and supplied to the HDDs 15-1 to 15-6 via the HDD controllers 14-1 to 14-6.

ＥＣＣ・ＤＭＡ部１２は、再生時には、ＨＤＤ１５−１〜１５−６からＨＤＤコントローラ１４−１〜１４−６を介して供給されたデータをキャッシュメモリ１３を用いてデストライピングし、そのデストライピングしたデータをエラー訂正して、元のデータを復元する。 During reproduction, the ECC / DMA unit 12 uses the cache memory 13 to destrip data supplied from the HDDs 15-1 to 15-6 via the HDD controllers 14-1 to 14-6, and the destriped data. Correct the error and restore the original data.

ビデオデータの再生時には、ＥＣＣ・ＤＭＡ部１２で復元されたビデオデータは、デコーダ２０に送られる。そして、デコーダ２０で伸長されたベースバンドのビデオデータが、図１のプロジェクタ３に送られて、プロジェクタ３からスクリーンに投影される。 When reproducing the video data, the video data restored by the ECC / DMA unit 12 is sent to the decoder 20. Then, the baseband video data expanded by the decoder 20 is sent to the projector 3 in FIG. 1 and projected from the projector 3 onto the screen.

オーディオデータの再生時には、ＥＣＣ・ＤＭＡ部１２で復元されたオーディオデータは、データ記録装置１，デジタルシネマサーバ２及びプロジェクタ３を収納している筐体に設けられたオーディオ出力端子（図示略）から外部に出力される。 At the time of reproduction of audio data, the audio data restored by the ECC / DMA unit 12 is transmitted from an audio output terminal (not shown) provided in a housing housing the data recording device 1, the digital cinema server 2, and the projector 3. Output to the outside.

ログ情報やファイルシステム情報の再生時には、ＥＣＣ・ＤＭＡ部１２で復元されたログ情報やファイルシステム情報は、上位制御部１１−１に送られ、上位制御部１１−１から図１のパーソナルコンピュータ４に送られて、パーソナルコンピュータ４のディスプレイ上のＧＵＩ画面に表示される。 At the time of reproduction of log information and file system information, the log information and file system information restored by the ECC / DMA unit 12 is sent to the upper control unit 11-1, and the upper control unit 11-1 performs the personal computer 4 shown in FIG. And displayed on the GUI screen on the display of the personal computer 4.

字幕データの再生時には、ＥＣＣ・ＤＭＡ部１２で復元された字幕データは、デジタルシネマサーバ２内に設けられているミキシング回路（図示略）に送られて、デコーダ２０で伸長されたベースバンドのビデオデータと合成される。
内部制御部１１−２は、前述の「内部制御アプリケーション」を実行して、ＥＣＣ・ＤＭＡ部１２及びＨＤＤコントローラ１４を制御する。 When reproducing subtitle data, the subtitle data restored by the ECC / DMA unit 12 is sent to a mixing circuit (not shown) provided in the digital cinema server 2 and decompressed by the decoder 20. Synthesized with data.
The internal control unit 11-2 executes the “internal control application” described above, and controls the ECC / DMA unit 12 and the HDD controller 14.

次に、図３に示したデータ記録装置１の構成の中で、ハードディスクへのデータの書き込み及び読み出しに関係した構成の概要を示したものである。
ＦＣ（ファイバーチャンネル）などの高速ネットワーク５経由で外部から供給された記録させるデータは、ＥＣＣ・ＤＭＡ部１２でエラー訂正符号を生成し、そのエラー訂正符号を付加したデータを、キャッシュメモリ１３を用いてストライピングして、ＨＤＤコントローラ１４−１〜１４−６を介してＨＤＤ１５−１〜１５−６に供給する。書き込みや読み出しのコマンドについても、高速ネットワーク５を介して供給される。 Next, in the configuration of the data recording apparatus 1 shown in FIG. 3, the outline of the configuration related to writing and reading of data to and from the hard disk is shown.
For data to be recorded supplied from the outside via a high-speed network 5 such as FC (Fibre Channel), an error correction code is generated by the ECC / DMA unit 12, and the data with the error correction code added is used by the cache memory 13. Striped and supplied to the HDDs 15-1 to 15-6 via the HDD controllers 14-1 to 14-6. Write and read commands are also supplied via the high-speed network 5.

また再生時には、ＨＤＤ１５−１〜１５−６からＨＤＤコントローラ１４−１〜１４−６を介して供給されたデータをキャッシュメモリ１３を用いてデストライピングし、そのデストライピングしたデータをエラー訂正して、元のデータを復元する。 During playback, the data supplied from the HDDs 15-1 to 15-6 via the HDD controllers 14-1 to 14-6 is destriped using the cache memory 13, and the destriped data is error-corrected. Restore the original data.

これらの記録時及び再生時の処理は、内部制御部１１−２の制御で実行され、内部制御部１１−２が備えるメモリであるＳＤＲＡＭ１１−３に、制御状態に関するデータが記憶される。記憶される制御状態に関するデータの例については後述する。なお、制御状態に関するデータは、ＳＤＲＡＭ以外の記憶手段に記憶させてもよい。例えば、不揮発性のメモリを使用してデータを記憶させるようにしてもよい。或いは、ハードディスクの一部に記憶させるようにしてもよい。 These recording and reproduction processes are executed under the control of the internal control unit 11-2, and data related to the control state is stored in the SDRAM 11-3, which is a memory included in the internal control unit 11-2. An example of data related to the stored control state will be described later. Data relating to the control state may be stored in a storage means other than the SDRAM. For example, data may be stored using a non-volatile memory. Or you may make it memorize | store in a part of hard disk.

次に、本実施の形態のデータ記録装置１内にコマンドが供給されることで、行われるデータの書き込み及び読み出しの処理構造であるプロセス構造を、図５を参照して説明する。この図５に示したプロセス構造は、図３及び図４に示した内部制御部１１−２の制御で実行される場合のプロセス構造を示したものであり、内部制御部１１−２が制御機能を備える。 Next, a process structure, which is a processing structure for writing and reading data performed when a command is supplied to the data recording apparatus 1 of the present embodiment, will be described with reference to FIG. The process structure shown in FIG. 5 shows the process structure when executed by the control of the internal control unit 11-2 shown in FIGS. 3 and 4, and the internal control unit 11-2 has a control function. Is provided.

デジタルシネマサーバから高速ネットワークのドライバであるＦＣドライバにコマンドが供給されると、コマンド待機処理が行われ、コマンドで指示された処理内容を、コマンド処理に伝える。
コマンド処理は、コマンドの処理内容を実際に処理する部分であり、テーブル検索、リソースの確保などを行い、高速ネットワーク側と通信を行う。
また、再構築設定に合わせて、再構築用のコマンドを発行し、設定と状況に合わせて再構築を実行する。
ＬＢＡマネージャでは、例えば、ＨＤＤにデータが書き込み出来ない状況が発生した場合に、そのことが通知されて、管理を行う。
モードページマネージャでは、ＨＤＤのエラーを集計して、各ＨＤＤの不良判断を行う。後述する不良判断処理は、ここで実行されることになる。ここでは、具体的にはエラー発生の頻度値と積算値とを集計する。頻度値は、各ＨＤＤに対して一定数のコマンドが発行される毎のエラー数であり、積算値は、該当するＨＤＤを使用開始してからの累積のエラー数である。また、不良判断は、現在の頻度値と設定された閾値とを比較して、現在の頻度値が閾値を越えた場合に、不良であると判断する。但し後述するように本例の場合には、この閾値を可変設定するようにしてある。また、不良と判断された場合には、該当する不良ＨＤＤを、管理者に対して伝える処理が実行される。例えば表示などで不良と判断されたＨＤＤを知らせる処理が行われる。 When a command is supplied from the digital cinema server to the FC driver, which is a high-speed network driver, command standby processing is performed, and the processing content designated by the command is transmitted to the command processing.
The command processing is the part that actually processes the processing contents of the command, and performs table search, resource reservation, etc., and communicates with the high-speed network side.
Also, a rebuild command is issued according to the rebuild setting, and the rebuild is executed according to the setting and the situation.
For example, when a situation in which data cannot be written to the HDD occurs, the LBA manager is notified and performs management.
In the mode page manager, HDD errors are totaled to determine whether each HDD is defective. A defect determination process to be described later is executed here. Specifically, the error occurrence frequency value and the integrated value are tabulated. The frequency value is the number of errors every time a certain number of commands are issued to each HDD, and the integrated value is the number of errors accumulated since the start of use of the corresponding HDD. In addition, the failure determination is made by comparing the current frequency value with a set threshold value and determining that the current frequency value exceeds the threshold value. However, as will be described later, in the case of this example, this threshold value is variably set. Further, when it is determined to be defective, processing for transmitting the corresponding defective HDD to the administrator is executed. For example, a process for notifying the HDD determined to be defective by display or the like is performed.

リソースマネージャでは、コマンドテーブルやキャッシュの割当てを行う。
ＨＤＤマネージャでは、コマンドテーブルの監視、空いているＨＤＤコントローラへの指令、ＨＤＤステータスの管理を行う。
ＨＤＤコントローラは、ＨＤＤ実行部へのコマンドの発行を行う。コマンドに基づいた処理が、規定された時間内に行われているかを監視するタイムアウト監視についても行う。正常に終了した場合は、終了を返答して次のコマンド待ちになる。規定された時間内に処理が終了しないタイムアウトした場合には、タイムアウトを返送し、ＨＤＤステータスをＲＡＤＩコントローラに返し、タイムアウトしたドライバの終了を待つ。さらにタイムアウトした場合はリセット処理に入り、終了時点でステータスを返し、次の待ちに入る。さらにだめな場合には、不良としてステータスを返し、次の待ちに入る。
実行部は、ＨＤＤドライバを呼び出す処理を実行する。
ステータスデーモンは、ＨＤＤの状態を監視する。ここでは、主として物理層の状態を監視する。ＨＤＤの着脱、物理層のエラーによる切断も反映させる。 The resource manager allocates command tables and caches.
The HDD manager monitors the command table, instructs a free HDD controller, and manages the HDD status.
The HDD controller issues a command to the HDD execution unit. Time-out monitoring is also performed to monitor whether the processing based on the command is performed within a specified time. When the process ends normally, it returns an end and waits for the next command. When a time-out occurs in which the process does not end within a specified time, a time-out is returned, the HDD status is returned to the RADIUS controller, and the end of the time-out driver is awaited. If the timeout further occurs, the reset process is started, the status is returned at the end time, and the next wait is entered. In case of further failure, the status is returned as defective and the next wait is entered.
The execution unit executes processing for calling the HDD driver.
The status daemon monitors the status of the HDD. Here, the state of the physical layer is mainly monitored. It also reflects the attachment / detachment of HDDs and the disconnection due to physical layer errors.

次に、本実施の形態によるＨＤＤの故障診断処理について、図６のフローチャートを参照して説明する。
この故障診断処理は、データ記録装置１が備える複数台のＨＤＤ１台ごとに行われるものである。
まず、ＨＤＤに対するデータの書き込み及び読み出しのコマンドが供給されると、そのコマンドが正常に処理されたか否か判断して、正常に処理されない場合に、エラーとする。正常に処理されない状態としては、データの読み出しや書き込みができない場合だけでなく、規定された時間内に処理されないタイムアウト時も含まれる。
エラー発生があると（ステップＳ１１）、そのエラー発生の頻度値と積算値とをカウントアップさせる（ステップＳ１２）。頻度値は、ここでは１００アクセス当りのエラーの発生数である。そして、そのエラー発生回数の積算値が１００００の倍数に達したか否か判断する（ステップＳ１３）。 Next, HDD failure diagnosis processing according to the present embodiment will be described with reference to the flowchart of FIG.
This failure diagnosis process is performed for each of a plurality of HDDs included in the data recording apparatus 1.
First, when a data write / read command is supplied to the HDD, it is determined whether or not the command has been processed normally. If the command is not processed normally, an error is determined. The state in which processing is not normally performed includes not only the case where data cannot be read or written but also the time-out time when processing is not performed within a specified time.
When an error occurs (step S11), the error occurrence frequency value and integrated value are counted up (step S12). Here, the frequency value is the number of occurrences of errors per 100 accesses. Then, it is determined whether or not the integrated value of the number of error occurrences has reached a multiple of 10,000 (step S13).

エラー発生回数の積算値が、１００００の倍数に達した場合には、頻度値を比較する閾値の設定値を、現在の設定値の約半分の値とする（ステップＳ１４）。但し、閾値の下限値を予め決めておく。ここでは、初期状態での閾値を１０とし、１０→５→２と変化させる。この２を下限値とし、以後は閾値２が設定された状態を継続させる。
そして、ステップＳ１３で積算値が１００００の倍数に達してない場合と、ステップＳ１４で閾値を変更した場合のいずれの場合でも、現在設定されている閾値とエラー発生回数の積算値とを比較して、故障判定を行う（ステップＳ１５）。閾値を越えた場合、該当するＨＤＤが故障と判断する。故障と判断すると、そのＨＤＤの交換を告知する処理が行われる。 When the integrated value of the number of error occurrences reaches a multiple of 10,000, the threshold setting value for comparing the frequency values is set to about half the current setting value (step S14). However, the lower limit value of the threshold is determined in advance. Here, the threshold value in the initial state is set to 10 and is changed from 10 → 5 → 2. This 2 is set as the lower limit value, and thereafter, the state in which the threshold value 2 is set is continued.
Then, in both cases where the integrated value has not reached a multiple of 10000 in step S13 and the threshold value has been changed in step S14, the currently set threshold value is compared with the integrated value of the number of error occurrences. Failure determination is performed (step S15). If the threshold is exceeded, it is determined that the corresponding HDD is out of order. If it is determined that there is a failure, processing for notifying the replacement of the HDD is performed.

このように処理されることで、例えば１台のＨＤＤが使用開始されると、当初はエラー頻度値と比較する閾値が１０であり、１００アクセス当りのエラー数が１０を越える場合に、不良であると判断される。そして、エラーの発生回数の累計値が１００００になったとき、閾値が５に変化し、１００アクセス当りのエラー数が５を越える場合に、不良であると判断される。さらに、エラーの発生回数の累計値が２００００になったとき、閾値が２に変化し、１００アクセス当りのエラー数が２を越える場合に、不良であると判断される。それ以後は、エラーの発生回数の累計値が３００００になっても、閾値は２のままである。 By processing in this way, for example, when one HDD is used, the threshold value initially compared with the error frequency value is 10, and if the number of errors per 100 accesses exceeds 10, it is defective. It is judged that there is. When the cumulative number of occurrences of errors reaches 10,000, the threshold value is changed to 5, and when the number of errors per 100 accesses exceeds 5, it is determined to be defective. Further, when the cumulative number of occurrences of errors reaches 20000, the threshold value is changed to 2, and when the number of errors per 100 accesses exceeds 2, it is determined to be defective. Thereafter, the threshold value remains 2 even if the cumulative number of occurrences of errors reaches 30000.

なお、この閾値の変化例は一例であり、その他の条件を設定してもよい。
例えば、積算値÷１０００＝ｎとし、閾値÷２＾ｎとしてもよい。（２＾ｎは２のｎ乗）
或いは、閾値の設定テーブルを設けて、積算値が１００００などの一定値を越えるごとに、その設定テーブルを参照して、新しい閾値を設定（又は新しい閾値を決めるための条件の値の設定）を行うようにしてもよい。 Note that this change example of the threshold value is an example, and other conditions may be set.
For example, the integrated value ÷ 1000 = n may be set, and the threshold value ÷ 2 ^ n may be set. (2 ^ n is 2 to the power of n)
Alternatively, a threshold setting table is provided, and every time the integrated value exceeds a certain value such as 10,000, a new threshold is set (or a condition value for determining a new threshold) is set by referring to the setting table. You may make it perform.

図７は、本実施の形態により故障判断処理例を示したものである。
図７（ａ）は、アクセス回数の累計を横軸とし、１００アクセス毎のエラー発生頻度を縦軸とした図である。この例では、エラー発生頻度ＥＲ１はほぼ一定で推移しているものとする。
この状況では、図７（ｂ）に示すようにエラー発生数の積算値を縦軸とし、横軸をアクセス回数で見た場合には、累積のエラー発生数ＥＴ１の変化が、ほぼ直線的な変化である。 FIG. 7 shows an example of failure determination processing according to this embodiment.
FIG. 7A is a diagram in which the cumulative number of accesses is plotted on the horizontal axis, and the error frequency for every 100 accesses is plotted on the vertical axis. In this example, it is assumed that the error occurrence frequency ER1 is substantially constant.
In this situation, as shown in FIG. 7B, when the integrated value of the number of error occurrences is taken as the vertical axis and the horizontal axis is viewed as the number of accesses, the change in the cumulative error occurrence number ET1 is almost linear. It is a change.

この状態で、故障判断の閾値は、初期の閾値ＴＨ１から、使用が進むに従って、閾値ＴＨ２，ＴＨ３と順に変化する。ここで、エラー発生頻度ＥＲ１がそれなりの頻度である場合には、頻度変化がなくても、閾値を越えることになり、故障であると診断される。図７の例では、最も低い閾値となった時点で不良であると判断されているが、頻度がより高い場合には、それよりも早い時点で閾値を越えるので、より早く不良であると判断されることになる。 In this state, the failure determination threshold value sequentially changes from the initial threshold value TH1 to the threshold values TH2 and TH3 as the use proceeds. Here, when the error occurrence frequency ER1 is a reasonable frequency, even if there is no frequency change, the threshold value is exceeded and a failure is diagnosed. In the example of FIG. 7, it is determined to be defective when the lowest threshold value is reached, but when the frequency is higher, the threshold value is exceeded at an earlier time point, so it is determined to be defective earlier. Will be.

従って本実施の形態によると、ＨＤＤのエラーの発生頻度についてはそれ程変化がない場合であっても、不良と判断できるようになり、適正にエラー発生回数に基づいた不良管理が可能となる。例えば、比較的エラーの発生状況が多い状況で安定しているようなハードディスクドライブを、比較的短い使用期間で不良品として検出できるようになり、ハードディスクドライブを使用したデータ記録装置の信頼性向上に貢献する。 Therefore, according to the present embodiment, even when the error occurrence frequency of the HDD does not change so much, it can be determined as defective, and defect management based on the number of error occurrences can be appropriately performed. For example, hard disk drives that are stable in situations where there are relatively many errors can be detected as defective products in a relatively short period of use, which improves the reliability of data recording devices that use hard disk drives. To contribute.

図７（ａ）に示したように、閾値の変化として段階的な変化を行うようにしたが、よりなだらかに（曲線状に）閾値を変化させるようにしてもよい。 As shown in FIG. 7A, the threshold value is changed stepwise, but the threshold value may be changed more gently (in a curved line).

また、以上の実施の形態ではデジタルシネマ上映システムに本発明を適用した例について説明した。しかし、本発明は、その他のＲＡＩＤ装置でのハードディスクドライブの故障診断にも適用可能である。さらに、ＲＡＩＤ装置以外の各種データ記録用のハードディスクドライブの故障診断（故障予測）にも適用可能である。 In the above embodiment, an example in which the present invention is applied to a digital cinema screening system has been described. However, the present invention is also applicable to hard disk drive failure diagnosis in other RAID devices. Furthermore, the present invention is applicable to failure diagnosis (failure prediction) of hard disk drives for recording various data other than RAID devices.

また、上述した実施の形態では、データ記録装置に故障診断を行う各部を備える構成としたが、例えば、ハードディスクドライブを備えた汎用の情報処理装置（コンピュータ装置など）に、本発明のそれぞれの処理（図６のフローチャートなどに示した処理）を実行するプログラム（ソフトウェア）を実装させて、同様の故障診断を行うように構成させてもよい。 In the above-described embodiment, the data recording device is configured to include each unit that performs failure diagnosis. However, for example, a general-purpose information processing device (such as a computer device) that includes a hard disk drive may include each process of the present invention. A program (software) for executing (the process shown in the flowchart of FIG. 6 and the like) may be installed and configured to perform similar failure diagnosis.

この場合のコンピュータ装置などによって実行可能な状態とされるプログラムを格納するプログラム記録媒体は、リムーバブルメディアによりパッケージメディアとして提供される。リムーバブルメディアとしては、磁気ディスク（フレキシブルディスクを含む）、光ディスク（ＣＤ−ＲＯＭ（Compact Disc - Read Only Memory），ＤＶＤ（Digital Versatile Disc），光磁気ディスクを含む）、もしくは半導体メモリなどを適用することができる。あるいは、プログラム記録媒体は、プログラムが一時的もしくは永続的に格納（記録）されるＲＯＭや、ハードディスクなどにより構成してもよい。 A program recording medium for storing a program that can be executed by a computer device or the like in this case is provided as a package medium by a removable medium. As a removable medium, a magnetic disk (including a flexible disk), an optical disk (including a CD-ROM (Compact Disc-Read Only Memory), a DVD (Digital Versatile Disc), a magneto-optical disk), or a semiconductor memory is applied. Can do. Alternatively, the program recording medium may be configured by a ROM, a hard disk, or the like in which the program is stored (recorded) temporarily or permanently.

プログラム記録媒体へのプログラムの格納は、必要に応じてルータ、モデムなどのインタフェースである通信部を介して、ローカルエリアネットワーク(LAN：Local Area Network)、インターネットなどの有線または無線の通信媒体を利用して行われる。 Programs are stored in the program recording medium using a wired or wireless communication medium such as a local area network (LAN) or the Internet via a communication unit that is an interface such as a router or a modem as necessary. Done.

なお、本明細書において、プログラム記録媒体に格納されるプログラムを記述する処理ステップは、記載された順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理（例えば、並列処理あるいはオブジェクトによる処理）をも含むものである。 In the present specification, the processing steps describing the program stored in the program recording medium are not limited to the processing performed in time series in the described order, but are not necessarily performed in time series. This includes processing that is executed manually or individually (for example, parallel processing or object processing).

また、プログラムは、一つのコンピュータにより処理されるものであってもよいし、複数のコンピュータによって分散処理されるものであってもよい。さらに、プログラムは、遠方のコンピュータに転送されて実行されるものであってもよい。 Further, the program may be processed by a single computer, or may be processed in a distributed manner by a plurality of computers. Furthermore, the program may be transferred to a remote computer and executed.

本発明の一実施の形態を適用したデジタルシネマ上映システムの全体構成図である。1 is an overall configuration diagram of a digital cinema screening system to which an embodiment of the present invention is applied. 図１例のデジタルシネマサーバに設けられている回路を示すブロック図である。It is a block diagram which shows the circuit provided in the digital cinema server of the example of FIG. 図１例のデータ記録装置の構成を示すブロック図である。It is a block diagram which shows the structure of the data recording device of the example of FIG. 図３の要部の構成例を示すブロック図である。FIG. 4 is a block diagram illustrating a configuration example of a main part of FIG. 3. 本発明の一実施の形態の装置でのプロセス構造例を示す説明図である。It is explanatory drawing which shows the example of a process structure in the apparatus of one embodiment of this invention. 本発明の一実施の形態による故障診断処理例を示したフローチャートである。It is the flowchart which showed the example of a failure diagnosis process by one embodiment of this invention. 本発明の一実施の形態による故障判断例を示した説明図である。It is explanatory drawing which showed the example of failure determination by one embodiment of this invention. 従来の故障判断例を示した説明図である。It is explanatory drawing which showed the conventional failure judgment example.

Explanation of symbols

１…データ記録装置、２…デジタルシネマサーバ、３…プロジェクタ、４…パーソナルコンピュータ、５…高速ネットワーク、１１…ＣＰＵ、１１−１…上位制御部、１１−２…内部制御部、１２…ＥＣＣ・ＤＭＡ部、１３…キャッシュメモリ、１４−１〜１４−７…ＨＤＤコントローラ、１５−１〜１５−７…ＨＤＤ、２０…デコーダ DESCRIPTION OF SYMBOLS 1 ... Data recording device, 2 ... Digital cinema server, 3 ... Projector, 4 ... Personal computer, 5 ... High speed network, 11 ... CPU, 11-1 ... High-order control part, 11-2 ... Internal control part, 12 ... ECC * DMA section, 13 ... cache memory, 14-1 to 14-7 ... HDD controller, 15-1 to 15-7 ... HDD, 20 ... decoder

Claims

In a failure diagnosis device that diagnoses a failure of a hard disk drive,
A command error determination unit for determining an error in which processing based on a command of writing or reading of the hard disk drive has not been normally completed within a time specified from command issue ;
An error number accumulating unit for accumulating the number of times that the command error determining unit determines an error;
A failure determination unit that determines that the hard disk drive is defective when the frequency at which the command error determination unit determines an error exceeds a threshold, and that variably sets the threshold according to the number of times of accumulation in the error number accumulation unit And a fault diagnosis device.

The failure diagnosis apparatus according to claim 1,
The error determined by the command error determination unit includes a case where the data read by the command is defective.
Fault diagnosis device.

The failure diagnosis apparatus according to claim 2 ,
Wherein each time exceeds a predetermined number of times the number of errors have been determined in advance is accumulated at the error count accumulation portion, wherein the defect determining unit further comprising Ru failure diagnosis apparatus a threshold setting unit to reduce the threshold frequency is variably set in.

The failure diagnosis apparatus according to claim 3 ,
The threshold setting unit, said each time exceeds a predetermined number of times, half reduced to fault diagnosis device threshold frequency.

In a failure diagnosis method for diagnosing a failure of a hard disk drive,
A command error determination process for determining an error that the process based on the write or read command of the hard disk drive could not be completed normally within a time specified from the command issue ;
An error count accumulation process for accumulating the number of times the command error determination process has determined an error;
A failure determination process in which the frequency at which the command error determination process determines that an error exceeds a threshold value determines that the hard disk drive is defective, and the threshold value is variably set according to the cumulative number in the error count accumulation process A failure diagnosis method characterized by:

The failure diagnosis method according to claim 5,
The error determined by the command error determination process includes a case where the data read by the command is defective.
Fault diagnosis method.