JP2017037626A

JP2017037626A - Device, method, and program for failure prediction

Info

Publication number: JP2017037626A
Application number: JP2016049802A
Authority: JP
Inventors: 長瀬　芳伸; Yoshinobu Nagase; 芳伸長瀬; 一郎宍戸; Ichiro Shishido
Original assignee: JVCKenwood Corp
Current assignee: JVCKenwood Corp
Priority date: 2015-08-07
Filing date: 2016-03-14
Publication date: 2017-02-16

Abstract

PROBLEM TO BE SOLVED: To provide a failure prediction technique that allows for predicting failure of a recording medium and preventing loss of data stored in the recording medium.SOLUTION: An HDD failure prediction device 100 has an HDD controller 10 configured to receive commands for writing to or reading from an HDD 300 from a host 200, and to write data to or read data from the HDD 300. An abnormal value DB recording unit 40 stores state information of the recording medium when access target points in the recording medium are accessed in an access pattern specific to the recording medium in increments of a specific unit volume. A controller 30 acquires the state information of the recording medium when the access target points are accessed according to the access pattern in increments of the specific unit volume, and registers the state information of the recording medium in association with the access target points in the abnormal value DB recording unit 40.SELECTED DRAWING: Figure 1

Description

本発明は、記録媒体の故障予測技術に関する。 The present invention relates to a failure prediction technique for a recording medium.

ハードディスクは円盤表面に微細な欠陥により不良セクタが発生して読み書きができなくなったり、ヘッド障害が原因で、リトライ動作を繰り返すことにより、データ転送速度が著しく低下することがある。また不良箇所が拡大してハードディスク自体が起動しなくなる障害が発生することもある。 A hard disk may have a bad sector due to a minute defect on the disk surface, making it impossible to read and write, or repeating a retry operation due to a head failure may significantly reduce the data transfer speed. In addition, a failure may occur and the hard disk itself may fail to start up.

特許文献１には、ハードディスクの転送時間を測定し、その転送時間からハードディスクの故障の予兆を検知する技術が開示されている。製品出荷後の所定の時間経過（たとえば１週間）ごとに、転送時間を測定した上で、工場出荷時の転送時間と比較し、両者の転送時間の違いが所定の条件を満たした場合に、ハードディスクに不具合が今後起こる可能性がある旨の警告通知処理を行っている。 Patent Document 1 discloses a technique for measuring a hard disk transfer time and detecting a hard disk failure sign from the transfer time. When the transfer time is measured every predetermined time (for example, one week) after product shipment and compared with the transfer time at the time of shipment from the factory, if the difference between the two transfer times satisfies the predetermined condition, Warning notification processing is performed to indicate that there may be a problem with the hard disk in the future.

特許文献２には、ハードディスクドライブ（ＨＤＤ）のＳＭＡＲＴ（Ｓｅｌｆ−Ｍｏｎｉｔｏｒｉｎｇ，ＡｎａｌｙｓｉｓａｎｄＲｅｐｏｒｔｉｎｇＴｅｃｈｎｏｌｏｇｙ）情報におけるＲｅａｌｌｏｃａｔｅｄＳｅｃｔｏｒＣｏｕｎｔ（以後、「代替セクタ数」という）の発生個数に注目し、代替セクタ数の変化からＨＤＤの異常を判断する技術が記載されている。 Patent Document 2 focuses on the number of occurrences of Realized Sector Count (hereinafter referred to as “alternative sector number”) in SMART (Self-Monitoring, Analysis and Reporting Technology) information of a hard disk drive (HDD), and changes in the number of alternative sectors Describes a technique for determining an abnormality of an HDD from the above.

特開２０１１−６８１０９号公報JP 2011-68109 A 特開２００１−６２７３号公報JP 2001-6273 A

特許文献１に開示された従来技術では、転送時間の測定値に所定の係数を乗じた値と、工場出荷時の測定時間とを比較し、前者が長い場合に遅延領域と判定している。そして、遅延領域が所定の閾値以上存在する場合に、警告を出すようにしている。しかしながら、ハードディスクドライブ（ＨＤＤ）のヘッド位置に依存するアクセス時間のばらつきに関しては、全く考慮されていなかった。例えば、ＨＤＤの同じ領域（セクタ）をアクセスした場合であっても、回転待ち時間が最大の場合は、ディスク１周分の待ち時間を要しその時間分アクセス時間が長くなり、回転待ち時間が最小の場合は、ディスク回転の待ち時間なく最短でアクセスできることからアクセス時間が短くなる。このように、ＨＤＤの同じ領域をアクセスした場合であっても、アクセス前のヘッド位置やアクセスのタイミングによって、データ読み出し完了までのアクセス時間が変動するが、従来技術においては、このような変動を考慮していないため、十分な精度でアクセス時間を測定することができなかった。従って、そのようなアクセス時間を基に故障予測を行っても、高い精度でＨＤＤの故障予測をすることはできなかった。 In the prior art disclosed in Patent Document 1, a value obtained by multiplying a measurement value of transfer time by a predetermined coefficient is compared with a measurement time at the time of factory shipment, and when the former is long, it is determined as a delay region. A warning is issued when the delay area is greater than or equal to a predetermined threshold. However, no consideration was given to variations in access time depending on the head position of a hard disk drive (HDD). For example, even when accessing the same area (sector) of the HDD, if the rotation waiting time is the maximum, the waiting time for one rotation of the disk is required, and the access time is increased by that time, and the rotation waiting time is reduced. In the case of the minimum, the access time is shortened because the access can be made in the shortest time without waiting for the disk rotation. As described above, even when the same area of the HDD is accessed, the access time until the data read is completed varies depending on the head position before access and the access timing. Since no consideration was given, the access time could not be measured with sufficient accuracy. Therefore, even if failure prediction is performed based on such access time, failure prediction of the HDD cannot be performed with high accuracy.

特許文献２に記載された技術では、ＳＭＡＲＴ情報における代替セクタの発生数に注目し、ハードディスクドライブ（ＨＤＤ）の故障予測を行っているが、代替セクタの発生したタイミングと位置が特定できず、一時的（偶発的）な要因による代替セクタの発生であるのか、より深刻な要因（ディスク表面の傷等の成長など）による代替セクタの発生であるのかを区別できなかった。 In the technique described in Patent Document 2, attention is paid to the number of occurrences of alternative sectors in the SMART information, and failure prediction of a hard disk drive (HDD) is performed. It was not possible to distinguish between the occurrence of alternative sectors due to accidental (accidental) factors and the occurrence of alternative sectors due to more serious factors (such as growth of scratches on the disk surface).

しかしながら、ディスク上に分散して代替セクタが存在する場合、代替セクタの発生そのものが偶発的によるものならば、以後、代替セクタが増加しなければ代替セクタの発生により、ＨＤＤ自体が正常動作に戻るため、障害は成長しているとは言えない。これに対し、例えば、ディスク１周分の連続するセクタに代替セクタが発生した場合、ヘッドがディスクに接触して、ディスク上に傷を付けている可能性が高い。このような場合、代替セクタが認識できたエリアを越えて近接しているセクタ等にヘッドが接触している可能性が高く、時間とともに障害が成長する可能性が高いので、このような成長する障害の可能性を認識した時点で直ちにデータを救い出す処置を講じる必要がある。しかしながら従来技術では、代替セクタの発生位置を特定できないため、ＨＤＤのディスク上に分散して発生する代替セクタも、連続するセクタ上に発生している代替セクタも同じ重み付けでその発生個数だけから障害の程度を判断するしかなかった。 However, if there are alternative sectors distributed on the disk, if the alternative sector itself is accidentally generated, then if the alternative sector does not increase, the HDD itself returns to normal operation due to the occurrence of the alternative sector. So the obstacles are not growing. On the other hand, for example, when an alternative sector is generated in a continuous sector for one circumference of the disk, there is a high possibility that the head is in contact with the disk and scratches the disk. In such a case, there is a high possibility that the head is in contact with a sector that is close to the area where the alternative sector can be recognized, and the failure is likely to grow over time. It is necessary to take measures to rescue the data immediately after recognizing the possibility of failure. However, in the conventional technology, the location where the alternative sector is generated cannot be specified, and therefore, the alternative sector generated on the HDD disk and the alternative sector generated on the consecutive sectors are troubled by the same number of occurrences with the same weight. There was no choice but to judge the degree of.

このように、本来は、代替セクタがどの位置で発生したかによりＨＤＤの障害予測は大きく異なってくるが、従来技術では、代替セクタの発生個数だけにもとづく故障判断であるため、このようなＨＤＤ異常の詳細な診断はできなかった。 In this way, originally, the failure prediction of the HDD greatly differs depending on the position where the alternative sector has occurred. However, in the prior art, the failure determination is based only on the number of occurrences of the alternative sector. A detailed diagnosis of the abnormality could not be made.

本発明はこうした状況に鑑みてなされたものであり、その目的は、記録媒体について、その障害発生を精度良く予測し、記録媒体内のデータの損失を防ぐことのできる故障予測技術を提供することにある。 The present invention has been made in view of such circumstances, and an object of the present invention is to provide a failure prediction technique capable of accurately predicting a failure occurrence of a recording medium and preventing loss of data in the recording medium. It is in.

上記課題を解決するために、本発明のある態様の故障予測装置は、記録媒体に対する所定のアクセスパターンによってアクセス対象箇所を特定容量単位でアクセスしたときの前記記録媒体の状態情報を記憶する状態情報記録部と、前記アクセスパターンにしたがって前記アクセス対象箇所を前記特定容量単位でアクセスした際に前記記録媒体の状態情報を取得し、前記アクセス対象箇所に対応づけて前記記録媒体の状態情報を前記状態情報記録部に登録する制御部とを含む。 In order to solve the above-described problem, a failure prediction apparatus according to an aspect of the present invention stores state information of the recording medium when the access target portion is accessed in a specific capacity unit according to a predetermined access pattern for the recording medium. The state information of the recording medium is acquired when the access target part is accessed in the specific capacity unit according to the recording unit and the access pattern, and the state information of the recording medium is associated with the access target part and the state And a control unit registered in the information recording unit.

本発明の別の態様は、故障予測方法である。この方法は、記録媒体に対する所定のアクセスパターンによってアクセス対象箇所を特定容量単位でアクセスしたときの前記記録媒体の状態情報を状態情報記録部に記憶するステップと、前記アクセスパターンにしたがって前記アクセス対象箇所を前記特定容量単位でアクセスした際に前記記録媒体の状態情報を取得し、前記アクセス対象箇所に対応づけて前記記録媒体の状態情報を前記状態情報記録部に登録するステップとを含む。 Another aspect of the present invention is a failure prediction method. The method includes a step of storing status information of the recording medium when a location to be accessed is accessed in a specific capacity unit according to a predetermined access pattern for the recording medium in a status information recording unit, and the location to be accessed according to the access pattern. Acquiring the status information of the recording medium when accessing the recording medium in the specific capacity unit, and registering the status information of the recording medium in the status information recording unit in association with the access target location.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、記録媒体、コンピュータプログラムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and a conversion of the expression of the present invention between a method, an apparatus, a system, a recording medium, a computer program, etc. are also effective as an aspect of the present invention.

本発明によれば、記録媒体について、その障害発生を精度良く予測し、記録媒体内のデータの損失を防ぐことができる。 According to the present invention, it is possible to accurately predict the occurrence of a failure in a recording medium and prevent data loss in the recording medium.

実施の形態１に係るＨＤＤ故障予測装置の構成図である。1 is a configuration diagram of an HDD failure prediction apparatus according to a first embodiment. 図１のＨＤＤ故障予測装置による故障予測手順を示すフローチャートである。It is a flowchart which shows the failure prediction procedure by the HDD failure prediction apparatus of FIG. 図２のＳＭＲＴ情報判定処理の詳細な手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of the SMRT information determination process of FIG. コマンド発行番号に対するアクセスパターンを説明する図である。It is a figure explaining the access pattern with respect to a command issue number. コマンド発行番号とＳＭＡＲＴ情報とを対応させて記録した異常値データベースを示す図である。It is a figure which shows the abnormal value database which recorded the command issue number and SMART information correspondingly. コマンド発行番号ごとに計数されるエラーカウンタを示す図である。It is a figure which shows the error counter counted for every command issue number. 実施の形態２のＨＤＤ故障予測装置による故障予測手順を示すフローチャートである。10 is a flowchart illustrating a failure prediction procedure by the HDD failure prediction apparatus according to the second embodiment. 図７の故障予想処理の詳細な手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of the failure prediction process of FIG. 図８のワーニング判定処理の詳細な手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of the warning determination process of FIG. 図８のエラー判定処理の詳細な手順を示すフローチャートである。It is a flowchart which shows the detailed procedure of the error determination process of FIG. コマンド実行時間を測定するためのアクセスパターンを説明する図である。It is a figure explaining the access pattern for measuring command execution time. 図１１のアクセスパターンにより実測されたコマンド実行時間を説明する図である。It is a figure explaining the command execution time measured by the access pattern of FIG. 図１２Ａの模式図を実測値によって示したグラフである。It is the graph which showed the schematic diagram of FIG. 12A by the measured value. コマンド発行番号をグループ化したセグメントにおける評価データと閾値データを説明する図である。It is a figure explaining the evaluation data and threshold value data in the segment which grouped the command issue number. 図１４（ａ）は、ＨＤＤ３００の正常時のアクセス時間を示し、図１４（ｂ）は、ＨＤＤ３００の異常発生時のアクセス時間を示す図である。FIG. 14A shows an access time when the HDD 300 is normal, and FIG. 14B shows an access time when an abnormality occurs in the HDD 300. 図１５（ａ）は、コマンド発行番号と閾値とを対応させて記録した異常値データベースを示し、図１５（ｂ）は、セグメント番号と閾値とを対応させて記録した異常値データベースを示す図である。FIG. 15A shows an abnormal value database recorded in association with a command issue number and a threshold, and FIG. 15B shows an abnormal value database recorded in association with a segment number and a threshold. is there. 図１６（ａ）は、コマンド発行番号ごとに計数されるワーニングカウンタを示し、図１６（ｂ）は、コマンド発行番号ごとに計数されるエラーカウンタを示す図である。FIG. 16A shows a warning counter counted for each command issue number, and FIG. 16B shows an error counter counted for each command issue number. 実施の形態３のワーニング判定処理の詳細な手順を示すフローチャートである。10 is a flowchart illustrating a detailed procedure of warning determination processing according to the third embodiment. 実施の形態３の回復処理の詳細な手順を示すフローチャートである。12 is a flowchart illustrating a detailed procedure of recovery processing according to the third embodiment. ＳＭＡＲＴ情報の格納方法を説明する図である。It is a figure explaining the storage method of SMART information.

（実施の形態１）
代替セクタが発生することはＨＤＤメーカーとして、回復不可能な障害が発生しているセクタを切り捨てて、新しい代替のセクタに切り替えるＨＤＤが正常動作に戻るための機能回復の方法であり、代替セクタの発生そのものが悪いわけではない。実際、偶発的なセクタ破壊は運用初期においても見られる現象であり、このように単発的に発生するセクタ破壊については、使用者が代替セクタの発生を意識することなく、正常にＨＤＤを使用し続けることができる。 (Embodiment 1)
The occurrence of an alternative sector is a method for recovering the function of the HDD manufacturer as a HDD manufacturer to cut off the sector in which an unrecoverable failure has occurred and switch the HDD to a new alternative sector to return to normal operation. The outbreak itself is not bad. In fact, accidental sector destruction is a phenomenon that can be seen even in the early stage of operation. For such sector destruction, HDDs can be used normally without the user being aware of the occurrence of alternative sectors. You can continue.

しかしながら、例えば１００セクタの書き込みにおいて代替セクタの発生個数がその１割に当たる１０個もの発生が確認された場合、例えばヘッドの接触による物理的な障害の可能性が考えられ、これが予想された場合、さらにアクセスしたセクタに近接する近傍エリアでの障害発生が考えられ、障害が成長する可能性がある。 However, for example, in the case of writing in 100 sectors, when occurrence of as many as 10 alternative sectors is confirmed, for example, there is a possibility of physical failure due to contact with the head, and when this is expected, Furthermore, a failure may occur in a nearby area close to the accessed sector, and the failure may grow.

ＳＭＡＲＴ情報には、代替セクタ数として発生個数が示されているだけであり、代替セクタの発生箇所や発生タイミングはわからない。そのため、ＳＭＡＲＴ情報を単に参照するだけでは、代替セクタがどの領域にどのような要因で増加しているかを特定することはできず、障害の成長を予測することができない。そのため、例えば物理的破損により既に深刻な障害によって代替セクタが発生していても、ＨＤＤからデータを救い出せないまで症状が進行してしまう可能性がある。このように代替セクタの発生数だけに注目する手法では、ＨＤＤの障害の成長を高い精度で検出し、故障前にデータを救出することができないという問題があった。 The SMART information only indicates the number of occurrences as the number of alternative sectors, and the location and timing of occurrence of the alternative sectors are not known. For this reason, simply referring to the SMART information cannot identify the area in which the alternative sector is increasing and for what reason, and cannot predict the failure growth. Therefore, even if an alternative sector has already occurred due to a serious failure due to physical damage, for example, the symptom may progress until data cannot be rescued from the HDD. As described above, the method that pays attention only to the number of occurrences of alternative sectors has a problem in that it is impossible to detect the growth of the HDD failure with high accuracy and to rescue the data before the failure.

この問題を解決するため、本実施の形態では、アクセス障害等の問題が起きてからの代替セクタ発生数に注目するのではなく、常にアクセスに対する代替セクタの発生状況を監視することにより、代替セクタの発生位置を特定し、ディスクにおける代替セクタの発生分布の作成を可能とし、その発生分布から故障の予測を行う方法を提案する。 In order to solve this problem, in this embodiment, instead of paying attention to the number of alternative sectors generated after the occurrence of a problem such as an access failure, the alternative sector is always monitored by monitoring the occurrence status of alternative sectors for access. This paper proposes a method for identifying the occurrence position of the disk, making it possible to create the distribution of alternative sectors on the disk, and predicting the failure from the distribution.

図１は、実施の形態に係るＨＤＤ故障予測装置１００の構成図である。ＨＤＤ故障予測装置１００は、ホスト２００が使用するＨＤＤ３００を駆動する機能とＨＤＤ３００の故障を予測する機能とを備える。ＨＤＤ故障予測装置１００は、ＨＤＤコントローラ１０、一時記憶部２０、制御部３０、および異常値ＤＢ記録部４０を含む。これらの構成はハードウェア、ソフトウェア、あるいはその組合せによって実現することができる。なお、ＨＤＤ故障予測装置１００とホスト２００とを一体的に構成することも可能である。また、ＨＤＤ故障予測装置１００とホスト２００とＨＤＤ３００とを一体的に構成してもよい。 FIG. 1 is a configuration diagram of an HDD failure prediction apparatus 100 according to the embodiment. The HDD failure prediction apparatus 100 has a function of driving the HDD 300 used by the host 200 and a function of predicting a failure of the HDD 300. The HDD failure prediction apparatus 100 includes an HDD controller 10, a temporary storage unit 20, a control unit 30, and an abnormal value DB recording unit 40. These configurations can be realized by hardware, software, or a combination thereof. It should be noted that the HDD failure prediction apparatus 100 and the host 200 can be configured integrally. Further, the HDD failure prediction apparatus 100, the host 200, and the HDD 300 may be configured integrally.

ＨＤＤコントローラ１０は、ハードディスクドライブのＡＴＡ規格に基づき、ＨＤＤ３００に対する読み書きのコマンドをホスト２００から受け取り、ＨＤＤ３００にデータを書き込んだり、ＨＤＤ３００からデータを読み出す。また、ＨＤＤコントローラ１０は、ＳＭＡＲＴ情報読み込みコマンドを発行する。一時記憶部２０は、その転送データを特定容量単位にまとめるためにＦＩＦＯ構造にて一時的に記憶する。 The HDD controller 10 receives a read / write command for the HDD 300 from the host 200 based on the ATA standard of the hard disk drive, writes data to the HDD 300, and reads data from the HDD 300. The HDD controller 10 issues a SMART information read command. The temporary storage unit 20 temporarily stores the transfer data in a FIFO structure in order to collect the transfer data in a specific capacity unit.

制御部３０は、ＨＤＤコントローラ１０により読み出されたＳＭＡＲＴ情報から代替セクタ数を抽出し、異常値ＤＢ記録部４０に記録し、異常値ＤＢ記録部４０に記録されているこれまでの代替セクタ数の変化からＨＤＤ３００の寿命を判断する。 The control unit 30 extracts the number of alternative sectors from the SMART information read by the HDD controller 10, records it in the abnormal value DB recording unit 40, and the number of alternative sectors so far recorded in the abnormal value DB recording unit 40. The service life of the HDD 300 is determined from the change in.

異常値ＤＢ記録部４０は、ＳＭＡＲＴ情報から抽出した代替セクタ数をコマンド発行番号ごとに登録する。 The abnormal value DB recording unit 40 registers the number of alternative sectors extracted from the SMART information for each command issue number.

ホスト２００は、図示しない表示部及び入力部を備える。ホスト２００の入力部への入力は、ＨＤＤ故障予測装置１００の制御部３０に伝達され、処理される。また、ＨＤＤ故障予測装置１００の制御部３０は、ホスト２００の表示部を制御する表示制御部としても機能する。 The host 200 includes a display unit and an input unit (not shown). The input to the input unit of the host 200 is transmitted to the control unit 30 of the HDD failure prediction apparatus 100 and processed. The control unit 30 of the HDD failure prediction apparatus 100 also functions as a display control unit that controls the display unit of the host 200.

ホスト２００がＨＤＤ３００に対してデータの書き込みコマンドを発行すると、書き込まれるデータはＨＤＤコントローラ１０を経由して一時記憶部２０に一時的に記憶される。一時的に記憶された書き込みデータが特定容量に達すると、ＨＤＤコントローラ１０は時間軸上で古いデータから一定容量単位で図４に示すアクセスパターンに従い、ＨＤＤ３００に書き込む。 When the host 200 issues a data write command to the HDD 300, the written data is temporarily stored in the temporary storage unit 20 via the HDD controller 10. When the temporarily stored write data reaches a specific capacity, the HDD controller 10 writes the old data on the time axis from the old data to the HDD 300 according to the access pattern shown in FIG.

現在のコマンド発行番号のデータの書き込みに引き続き、ＳＭＡＲＴ情報読み込みコマンドを発行し、ＨＤＤ３００からＳＭＡＲＴ情報を読み出し、ＳＭＡＲＴ情報から代替セクタ数を抽出する。 Following the writing of the data of the current command issue number, a SMART information read command is issued, the SMART information is read from the HDD 300, and the number of alternative sectors is extracted from the SMART information.

その後、あらかじめ異常値ＤＢ記録部４０に記憶された当該コマンド発行番号の以前の代替セクタ数を読み出し、ＨＤＤ３００から読み出した代替セクタ数と比較することにより、当該コマンド発行番号における代替セクタ数の変化を調べ、その変化が異常値とする閾値を超えていないかを確認し、ＨＤＤ３００から読み出したＳＭＡＲＴ情報を異常値ＤＢ記録部の以前のＳＭＡＲＴ情報に上書きする。 Thereafter, the number of alternative sectors before the command issuance number stored in advance in the abnormal value DB recording unit 40 is read and compared with the number of alternative sectors read from the HDD 300, thereby changing the number of alternative sectors in the command issuance number. It is checked whether or not the change exceeds a threshold value to be an abnormal value, and the SMART information read from the HDD 300 is overwritten on the previous SMART information in the abnormal value DB recording unit.

これをＨＤＤ全体に繰り返すことにより、ＨＤＤ全体の代替セクタ発生分布を作成し、エラーカウンタが異常と判断する特定の閾値を超えていないことを確認する。このエラーカウンタの変化により故障予測を行い、寿命に達したことを予測したときはＨＤＤの停止警告を発する。 By repeating this process for the entire HDD, a replacement sector occurrence distribution for the entire HDD is created, and it is confirmed that the error counter does not exceed a specific threshold value that is determined to be abnormal. Failure prediction is performed based on the change in the error counter, and an HDD stop warning is issued when it is predicted that the lifetime has been reached.

本実施の形態のＨＤＤ故障予測装置１００の目的は、ディスクにおける代替セクタの分布を常時把握し、故障を予測することである。個々のアクセスしたセクタそのものに代替セクタが発生したことに注目するのではなく、決められた特定容量でアクセスした領域において、アクセス量に対し、どの位の割合で代替セクタが発生し、代替セクタが発生しているエリアがディスク上にどのように分布しているかを情報として蓄積する。 The purpose of the HDD failure prediction apparatus 100 of the present embodiment is to constantly grasp the distribution of alternative sectors on the disk and predict failure. Rather than paying attention to the fact that a substitute sector has occurred in each accessed sector itself, in the area accessed with a specified specific capacity, the percentage of the substitute sector generated relative to the access amount, How the generated areas are distributed on the disk is stored as information.

これを通常のファイルシステムによるアクセスで行うと、１セクタ単位でアクセスして代替セクタの発生を監視し、アクセス毎に１セクタ単位でアクセス対象のセクタの情報を記録する必要があるので、ＨＤＤ容量に比例した大きなサイズのテーブルを用意する必要がある。 If this is performed by a normal file system access, it is necessary to monitor the occurrence of alternative sectors by accessing in units of one sector, and record the information of the sector to be accessed in units of one sector for each access. It is necessary to prepare a large table proportional to

その上、アクセス長も常に可変するため、長いデータ長のアクセスでは、そのアクセスから得られる情報が多いので、異常発生時の要因を予測しやすいが、短いデータ長のアクセスでは、アクセスから得られる情報が少ないので、異常が発生してもそこから障害の成長を予測することは難しくなり、毎回のアクセスに対して統一した故障予測の判断をすることが難しくなる。 In addition, since the access length is always variable, a long data length access has a lot of information obtained from the access, so it is easy to predict the factor at the time of occurrence of an abnormality, but a short data length access can be obtained from the access. Since there is little information, even if an abnormality occurs, it is difficult to predict the growth of the failure, and it becomes difficult to make a unified failure prediction judgment for each access.

特に上書きにより、長いデータ長のアクセスが行われた同じ領域に、短いデータ長のアクセスが行われた場合、異常セクタの変化の監視は短いデータ長の領域だけにしか行われないので、同じアクセスにもかかわらず、前にアクセスした残りの長いデータ長の異常セクタの変化は監視できない。そのため、前回の長いデータ長のアクセスでこの領域での代替セクタの発生が見られ、障害の成長が見られたとしてもその次の短いデータ長のアクセスによってそのアクセス内の代替セクタが正常範囲内とみなされると、障害の成長を見落とす可能性がある。 In particular, when a short data length is accessed in the same area where a long data length has been accessed due to overwriting, changes in abnormal sectors are monitored only in the short data length area, so the same access Nevertheless, it is not possible to monitor changes in the remaining long data length abnormal sectors accessed previously. For this reason, the occurrence of a replacement sector in this area was seen in the previous access with a long data length, and even if failure growth was observed, the replacement sector in that access was within the normal range by the next short data length access. If considered, it may overlook the growth of disabilities.

このような問題を回避し、代替セクタの変化量を正確に蓄積するために、一度にアクセスする容量（セクタ数）を固定し、アクセスの度に、代替セクタを監視する容量が変化しないようにする。この固定的な容量を「特定容量」と称する。本実施例では、特定容量を２５６セクタとするが、これに限定される訳ではなく、他のセクタ数を用いてもよい。また、各々のアクセスに使用する先頭ＬＢＡ（Logical Block Address）に対応させて、１、２、３、．．．等の連番を付与したものをコマンド発行番号と称する。なお、以下の説明において、コマンド発行番号の代わりに、ＬＢＡを用いることも可能である。ただし、コマンド発行番号を用いた方が、格納や演算に必要なデータ容量を抑えることができる。 In order to avoid such problems and to accumulate the amount of change of the alternative sector accurately, the capacity (number of sectors) accessed at a time is fixed so that the capacity for monitoring the alternative sector does not change at every access. To do. This fixed capacity is referred to as “specific capacity”. In this embodiment, the specific capacity is 256 sectors, but the present invention is not limited to this, and other sectors may be used. Also, in correspondence with the first LBA (Logical Block Address) used for each access, 1, 2, 3,. . . Those given serial numbers such as are called command issue numbers. In the following description, LBA can be used instead of the command issue number. However, using the command issue number can reduce the data capacity required for storage and computation.

図４は、コマンド発行番号に対するアクセスパターンを説明する図である。図４に示すように、特定容量（２５６セクタ）単位でアクセスし、そのアクセス直後のＳＭＡＲＴ情報に含まれる代替セクタに関する情報を読み込むことにより、常に一定容量のアクセス領域に含まれる代替セクタ数の発生数の割合を測定する。コマンド発行番号ごとにアクセスする特定容量のディスク領域（セクタ）を「コマンド発行番号領域」と呼ぶ。 FIG. 4 is a diagram for explaining an access pattern for a command issue number. As shown in FIG. 4, by accessing in units of specific capacity (256 sectors) and reading information on alternative sectors included in the SMART information immediately after the access, the number of alternative sectors included in the access area of a certain capacity is always generated. Measure the percentage of numbers. A disk area (sector) having a specific capacity accessed for each command issue number is called a “command issue number area”.

特定容量は、ディスク一周分のトラックに相当するセクタ数であってもよい。特定容量は、ディスクの最内周のセクタ数に合わせて決めてもよい。 The specific capacity may be the number of sectors corresponding to one track of the disk. The specific capacity may be determined according to the number of sectors on the innermost circumference of the disk.

特定容量単位のアクセスとしたことにより、代替セクタの監視テーブルのサイズを大幅に減らすことができるとともに、同一領域には同一量のアクセスを行うことから、常に同じセクタ数のアクセス領域における代替セクタ数の変化を監視することができるので、効率よく、正確に特定領域内の代替セクタ数の変化を確認することができる。 By making access in specific capacity units, the size of the monitoring table of alternative sectors can be greatly reduced, and the same amount of access is made to the same area, so the number of alternative sectors in the access area always having the same number of sectors. Therefore, the change of the number of alternative sectors in the specific area can be confirmed efficiently and accurately.

図４のアクセスパターンにしたがってディスクにアクセスするため、アクセスごとにコマンド発行番号に対応するディスクのアクセス領域がわかる。その上で、代替セクタがどのコマンド発行番号で発生したかがわかるので、代替セクタ数の変化をコマンド発行番号に対してプロットすることにより、代替セクタの発生がディスク上にどのように分布しているかを知ることができる。この代替セクタの分布の変化を監視することにより、ＨＤＤの故障予測を行う。 Since the disk is accessed according to the access pattern of FIG. 4, the access area of the disk corresponding to the command issue number is known for each access. On top of that, you can see which command issue number the alternative sector occurred, so by plotting the change in the number of alternative sectors against the command issue number, how the occurrence of the alternative sector is distributed on the disk I can know. The failure prediction of the HDD is performed by monitoring the change in the distribution of the alternative sector.

なお、ＳＭＡＲＴ情報とは、代替セクタ数に代表されるＨＤＤの異常の進行状態と、ＯＦＦ／ＯＮ回数、電源ＯＮ時間、シーク時間等の実際の運用についての積算をＨＤＤメーカーが用意した閾値とともにＨＤＤ内部のメモリ等に記憶したものであり、ＨＤＤの故障予測の指標として利用されている。 SMART information refers to the progress of the abnormality of the HDD represented by the number of alternative sectors, and the total of actual operation such as the number of OFF / ON, power ON time, seek time, etc. together with the threshold prepared by the HDD manufacturer. It is stored in an internal memory or the like, and is used as an index for HDD failure prediction.

しかしながら、ＳＭＡＲＴ情報の各種のパラメータ値がＨＤＤの動作とどのような関係にあり、どのような範囲で変化するかは明確ではなく、ＳＭＡＲＴ情報のパラメータ値がＨＤＤメーカーが用意している閾値に達する前にＨＤＤの障害が発生することが非常に多い。そのようなメーカーが決めた閾値に頼るだけでは運用上の故障予測が難しいことから、使用者側でパラメータ値の変化から導き出される閾値を別途用意して、故障予測を行ってきたのが実情である。本実施の形態では、ＳＭＡＲＴ情報の代替セクタ数を独自の方法で評価することで故障予測を行っている。 However, it is not clear how various parameter values of the SMART information are related to the operation of the HDD and in what range, and the parameter values of the SMART information reach the threshold value prepared by the HDD manufacturer. Very often HDD failures occur before. Because it is difficult to predict operational failures simply by relying on the thresholds determined by such manufacturers, the actual situation is that the user has prepared a separate threshold derived from parameter value changes and has performed failure prediction. is there. In this embodiment, failure prediction is performed by evaluating the number of alternative sectors of SMART information by a unique method.

図２は、ＨＤＤ故障予測装置１００による故障予測手順を示すフローチャートである。 FIG. 2 is a flowchart showing a failure prediction procedure performed by the HDD failure prediction apparatus 100.

図２のステップＳ２０１では、一時記憶部２０からＨＤＤ３００に転送する先頭のデータが、どのコマンド発行番号に該当するか特定する。例えば、一時記憶部２０において、ホスト２００からＨＤＤ３００に通常書き込むのと同じＬＢＡを用いて、転送すべきデータを管理し、転送時に先頭ＬＢＡを特定容量で割った値を算出してコマンド発行番号とすればよい。この特定したコマンド発行番号（データ転送に用いる先頭のコマンド発行番号）をｉとする。 In step S201 in FIG. 2, it is identified which command issue number corresponds to the top data transferred from the temporary storage unit 20 to the HDD 300. For example, in the temporary storage unit 20, the data to be transferred is managed using the same LBA that is normally written from the host 200 to the HDD 300, and a value obtained by dividing the head LBA by a specific capacity at the time of transfer is calculated and the command issue number and do it. This identified command issue number (first command issue number used for data transfer) is set to i.

ステップＳ２０５では、一時記憶部２０からＨＤＤ３００に転送するデータ容量（書き込み容量）が、何個分のコマンド発行番号（特定容量）に相当するかを算出する。具体的には、データ容量を特定容量で除算し、その商と余りを算出する。そして、その商を書き込み回数Ｍとする。 In step S205, it is calculated how many command issue numbers (specific capacities) the data capacity (write capacity) transferred from the temporary storage unit 20 to the HDD 300 corresponds to. Specifically, the data capacity is divided by the specific capacity, and the quotient and remainder are calculated. Then, the quotient is set as the write count M.

ＨＤＤ３００への書き込みはセクタ単位で制御できるが、特定容量はそれ以上（ここでは２５６セクタ）であるため、最後の書き込みデータが特定容量以下の場合は、最後に書き込む特定容量内に既存のデータが存在する可能性がある。 Writing to the HDD 300 can be controlled in units of sectors, but the specific capacity is more than that (here, 256 sectors), so if the last write data is less than the specific capacity, the existing data is in the specific capacity to be written last. May exist.

そこでステップＳ２１０では、ステップＳ３０５で算出された余りが「０」であるか否かを判定する。すなわち、書き込み容量が特定容量で割り切れるか否かを判定する。その結果、割り切れない場合（余りが存在する場合）（Ｓ２１０のＮＯ）、ステップＳ２１５に進む。ステップＳ２１５おいて、端数のデータに相当するＭ＋１番目の書き込み領域に他のデータがあるかどうかを確認し、他のデータが存在する場合（Ｓ２１５のＹＥＳ）、ステップＳ２２０において、その存在するデータを一時記憶部２０に読み込み、Ｍ＋１番目の書き込みデータに結合した後、Ｍ＋１番目に書き込むデータとして用意してステップＳ２２５に進む。ステップＳ２１５において他のデータが存在しない場合（Ｓ２１５のＮＯ）、書き込みデータを結合する必要はないので、そのままステップＳ２２５に進む。この結果、端数分につき書き込み回数が１つ増えるので、ステップＳ２２５においてＭを１だけ加算し、ステップＳ２３０に進む。 Therefore, in step S210, it is determined whether or not the remainder calculated in step S305 is “0”. That is, it is determined whether the write capacity is divisible by the specific capacity. As a result, when it is not divisible (when there is a remainder) (NO in S210), the process proceeds to step S215. In step S215, it is confirmed whether or not there is other data in the (M + 1) th writing area corresponding to the fractional data. If there is other data (YES in S215), the existing data is determined in step S220. After reading the data into the temporary storage unit 20 and combining it with the (M + 1) th write data, the data is prepared as the (M + 1) th write data, and the process proceeds to step S225. If no other data exists in step S215 (NO in S215), it is not necessary to combine the write data, and the process directly proceeds to step S225. As a result, the number of times of writing increases by one per fraction, so M is incremented by 1 in step S225, and the process proceeds to step S230.

ステップＳ２１０において書き込み容量が特定容量で割り切れる場合（Ｓ２１０のＹＥＳ）、ステップＳ２３０に進む。 When the write capacity is divisible by the specific capacity in step S210 (YES in S210), the process proceeds to step S230.

ステップＳ２３０において、図４に示すアクセスパターンにおける、コマンド発行番号ｉに対応する特定容量の書き込みを行う。 In step S230, the specific capacity corresponding to the command issue number i in the access pattern shown in FIG. 4 is written.

ステップＳ２３２において、コマンド発行番号ｉに対する書き込みを行った後のＳＭＡＲＴ情報をＨＤＤ３００から読み取り、異常値ＤＢ記録部４０に現在のＳＭＡＲＴ情報を登録する。図５に示すように異常値ＤＢ記録部には、それぞれのコマンド発行番号について、前にアクセスしたときに取得したＳＭＡＲＴ情報と現在のアクセスで取得したＳＭＡＲＴ情報が格納される。 In step S232, the SMART information after writing to the command issue number i is read from the HDD 300, and the current SMART information is registered in the abnormal value DB recording unit 40. As shown in FIG. 5, the abnormal value DB recording unit stores the SMART information acquired at the previous access and the SMART information acquired at the current access for each command issue number.

ステップＳ２３５では、コマンド発行番号領域内のＳＭＡＲＴ情報の解析を行う。ＳＭＡＲＴ情報判定処理については、図３のフローチャートを参照して後ほど詳しく説明する。 In step S235, the SMART information in the command issue number area is analyzed. The SMART information determination process will be described in detail later with reference to the flowchart of FIG.

ステップＳ２３５で解析されたコマンド発行番号領域の代替セクタ発生の累計分布が閾値を超えるようなら、障害が進行していることを示す。ステップＳ２３５でこれまでの各セクタのＳＭＡＲＴ情報の代替セクタ発生の分布から盤面の傷等の物理的エラーを解析し、総合的に寿命の到来を検出し、寿命到来検出時は最終的にＨＤＤ３００の停止警告を出すことにより、運用稼働中のＨＤＤ３００からデータを退避させることを促す。 If the cumulative distribution of alternative sector occurrences in the command issue number area analyzed in step S235 exceeds the threshold value, it indicates that a failure is in progress. In step S235, physical errors such as scratches on the board surface are analyzed from the distribution of alternative sector occurrences in the SMART information of each sector so far, and the arrival of the life is comprehensively detected. By issuing a stop warning, the user is prompted to save data from the HDD 300 in operation.

ステップＳ２４０において、ステップＳ２３５の処理で更新したＨＤＤのエラーカウンタおよびＳＭＡＲＴ情報を表示する。ユーザ（操作者）は、これらのカウンタ値によってＨＤＤ３００の状態を監視することができ、必要なときには操作者がこの数値から判断して、独自にＨＤＤ３００を停止させることもできる。 In step S240, the HDD error counter and SMART information updated in the process of step S235 are displayed. The user (operator) can monitor the status of the HDD 300 by using these counter values, and the operator can judge from this numerical value and stop the HDD 300 independently when necessary.

ステップＳ２４５において、ステップＳ２３５の結果を受けて処理されたエラーカウンタが動作停止パラメータ値を超えたことが確認された場合（Ｓ２４５のＹＥＳ）、故障予測処理を終了する。なお、ステップＳ２４５のＹＥＳの直後に、さらにユーザの注意を喚起するような警告メッセージを表示したり、故障予測処理を終了することを通知するメッセージを表示してもよい。 In step S245, when it is confirmed that the error counter processed in response to the result of step S235 has exceeded the operation stop parameter value (YES in S245), the failure prediction process ends. Note that immediately after YES in step S245, a warning message that further alerts the user may be displayed, or a message notifying that the failure prediction process is to be terminated may be displayed.

ステップＳ２４５がＮＯの場合は、ステップＳ２５０において、書き込み回数Ｍを１減算し、コマンド発行領域ｉはアクセスが次の領域に移るため、１加算する。 If NO in step S245, 1 is subtracted from the write count M in step S250, and 1 is added to the command issue area i because the access moves to the next area.

最後にステップＳ２５５において、書き込み回数Ｍが０より大きい場合（Ｓ２５５のＹＥＳ）、所定回数の書き込みに達するまでステップＳ２３０〜ステップＳ２５０までの一連の処理を繰り返す。書き込み回数Ｍが０になった場合（Ｓ２５５のＮＯ）、故障予測処理を終了する。 Finally, in step S255, when the number M of writing is larger than 0 (YES in S255), a series of processing from step S230 to step S250 is repeated until the predetermined number of writings is reached. When the number M of times of writing becomes 0 (NO in S255), the failure prediction process is terminated.

図３は、ステップＳ２３５のＳＭＡＲＴ情報判定処理の詳細な手順を示すフローチャートである。ＳＭＡＲＴ情報判定処理では、コマンド発行番号におけるセクタの状態を解析する。セクタの状態を解析するためにＳＭＡＲＴ情報の特に代替セクタ数の変化に注目し、代替セクタ数の変化から、最終的にＨＤＤ３００の故障予測を行う。 FIG. 3 is a flowchart showing a detailed procedure of the SMART information determination process in step S235. In the SMART information determination process, the state of the sector at the command issue number is analyzed. In order to analyze the state of the sector, paying attention to the change in the number of alternative sectors in the SMART information, the failure prediction of the HDD 300 is finally performed from the change in the number of alternative sectors.

ＳＭＡＲＴ情報判定処理は図６で示すコマンド発行番号ごとのエラーカウンタの値を加算することで行う。 The SMART information determination process is performed by adding the error counter value for each command issue number shown in FIG.

所定数Ｎ個のコマンド発行番号ごとに、コマンド発行番号をグループ化する。以下では、このグループを「コマンド発行セグメント」あるいは単に「セグメント」と称する。また、所定数Ｎを「セグメント長」と称する。典型的には、Ｎ＝３０〜５０とするのがよい。例えば、Ｎ＝３０とする場合、コマンド発行番号＝１〜３０をセグメント１、コマンド発行番号＝３１〜６０をセグメント２、コマンド発行番号＝６１〜９０をセグメント３とし、以下同様に、コマンド発行番号とセグメントを対応させる。 Command issue numbers are grouped for every predetermined number N of command issue numbers. Hereinafter, this group is referred to as “command issue segment” or simply “segment”. The predetermined number N is referred to as “segment length”. Typically, N = 30-50 is good. For example, when N = 30, command issue number = 1-30 is segment 1, command issue number = 31-60 is segment 2, command issue number = 61-90 is segment 3, and so on. And the segment.

ステップＳ３００において、前回コマンド発行番号ｉを実行したときのＳＭＡＲＴ情報（特に代替セクタ数）を異常値ＤＢ記録部４０より読み出す。 In step S300, the SMART information (particularly the number of alternative sectors) when the previous command issue number i was executed is read from the abnormal value DB recording unit 40.

ステップＳ３０５において、ステップＳ３００で読み込んだ前回のＳＭＡＲＴ情報の代替セクタ数と図２のステップＳ２３２で読み込んだ現在のＳＭＡＲＴ情報の代替セクタ数を比較する。このとき、発生する可能性のある代替セクタ数の最大値はコマンド発行番号内のセクタ数である。 In step S305, the number of alternative sectors of the previous SMART information read in step S300 is compared with the number of alternative sectors of the current SMART information read in step S232 of FIG. At this time, the maximum number of alternative sectors that may occur is the number of sectors in the command issue number.

ステップＳ３０５で現在の代替セクタ数が前回の代替セクタ数よりも増えている、すなわち、新たな代替セクタの発生が確認された場合（Ｓ３０５のＹＥＳ）、ステップＳ３１０において、現在のコマンド発行番号領域おける代替セクタ数が所定数（例えば５個）以上増加したか（新たに所定数以上の代替セクタが発生したか）否かの確認を行い、現在のコマンド発行番号によるアクセスだけで急激に代替セクタが増加していないか、調べる。 If the number of current alternative sectors is greater than the previous number of alternative sectors in step S305, that is, if the occurrence of a new alternative sector is confirmed (YES in S305), the current command issue number area is set in step S310. Check whether the number of alternative sectors has increased by a predetermined number (for example, 5) or more (whether a predetermined number of alternative sectors have been newly generated) or not. Check for an increase.

衝撃によるヘッドのスクラッチ傷が発生している場合、それを要因として、ＨＤＤ３００の盤面の同一円周上における連続するセクタについて代替セクタ数が急激に増える。そのため、ステップＳ３１０では、現在のコマンド発行番号によるアクセスだけで急激に代替セクタが発生していないかを確認する。傷による障害の場合、一時的には代替セクタの発生によりＨＤＤ３００としての機能は回復するが、障害は今のコマンド発行番号以外にも及んでいる可能性が高いので、至急にＨＤＤを停止させ、データを保護する策を取る必要がある。ステップＳ３１０で予め設定した閾値（例えば５個）以上の代替セクタの発生が確認された場合（Ｓ３１０のＹＥＳ）、ステップＳ３３０でＨＤＤ停止勧告を発し、ＨＤＤ３００の使用を停止させる必要がある。 When scratching of the head due to impact occurs, the number of alternative sectors increases rapidly for consecutive sectors on the same circumference of the board surface of the HDD 300 due to the scratch. For this reason, in step S310, it is confirmed whether or not an alternative sector is suddenly generated only by access using the current command issue number. In the case of a failure due to a flaw, the function of the HDD 300 is temporarily recovered by the occurrence of an alternative sector, but the failure is likely to extend beyond the current command issue number, so the HDD is stopped immediately, You need to take measures to protect your data. If it is confirmed in step S310 that an alternative sector equal to or more than a preset threshold value (for example, 5) is generated (YES in S310), it is necessary to issue an HDD stop recommendation in step S330 to stop the use of the HDD 300.

代替セクタ発生数の閾値は、連続するセクタが最も少ないディスク最内周で最小の傷が早期に発見できるようにＨＤＤ３００ごとに変更することが望ましい。 It is desirable to change the threshold value for the number of alternative sectors generated for each HDD 300 so that the smallest scratch can be detected early on the innermost circumference of the disk with the fewest consecutive sectors.

現在のコマンド発行番号領域内で、閾値以上の急激な代替セクタの発生が確認できない場合（Ｓ３１０のＮＯ）、ステップＳ３１５で現在のコマンド発行番号領域での代替セクタの発生数を、図６に示すエラーカウンタに登録する。 FIG. 6 shows the number of alternative sectors generated in the current command issue number area in step S315 when the occurrence of an abrupt alternative sector exceeding the threshold cannot be confirmed in the current command issue number area (NO in S310). Register in the error counter.

ステップＳ３２０において、アクセス対象のセグメントの前に１０個のセグメント、後に１０個のセグメントという近傍セグメントエリアにおいてゼロでないエラーカウンタが合計５個以上確認された場合（Ｓ３２０のＹＥＳ）、近傍セグメントエリアにおいて遅延が拡大するものと判断し、ステップＳ３３０に進み、ＨＤＤ停止の警告を発する。これは、近傍セグメントエリアにおいて複数個の代替セクタが含まれていれば、コマンド発行番号領域の障害が現在のコマンド発行番号領域の周辺に高密度に広がっていることを示すからである。 In step S320, when a total of five or more non-zero error counters are confirmed in the adjacent segment area of 10 segments before the access target segment and 10 segments after the access target segment (YES in S320), a delay occurs in the adjacent segment area. The process proceeds to step S330 and issues a HDD stop warning. This is because if a plurality of alternative sectors are included in the neighboring segment area, it indicates that the failure in the command issue number area spreads around the current command issue number area with high density.

このように近傍セグメントエリアでＨＤＤ３００のＳＭＡＲＴ情報を参照して障害の発生位置を評価することにより、障害の進行を捉え、故障を予測する。 As described above, the failure occurrence position is evaluated by referring to the SMART information of the HDD 300 in the neighboring segment area, thereby grasping the progress of the failure and predicting the failure.

アクセス対象のセグメントの前に１０個、後に１０個という隣接する近傍セグメントエリアのセグメント数と、近傍セグメントエリア内でのエラーカウンタが５個以上という閾値は、障害が予測できる代替セクタ発生の分布が明確に観察される値として特定されたものであるが、典型例であり、これ以外の値を用いてもよい。これらのパラメータの値はＨＤＤ３００の容量で変化する。ＨＤＤ３００の最大容量が小さい場合、ＨＤＤ３００全体をアクセスするコマンド発行番号領域が小さいので、少しの変化で遅延アクセスエリアは拡散しやすいことから、これらのパラメータ値は小さくする必要がある。ＨＤＤ３００の最大容量が大きい場合、ＨＤＤ３００全体をアクセスするコマンド発行番号領域が大きいので、少しの変化では遅延アクセスエリアは拡散しにくいので、これらのパラメータ値は大きくする必要がある。 The number of adjacent neighboring segment areas, 10 before and 10 after the segment to be accessed, and the threshold of 5 or more error counters in the neighboring segment area indicate that the distribution of the occurrence of alternative sectors that can predict failures Although it is specified as a value that is clearly observed, it is a typical example, and other values may be used. The values of these parameters vary with the capacity of the HDD 300. When the maximum capacity of the HDD 300 is small, since the command issue number area for accessing the entire HDD 300 is small, the delayed access area easily spreads with a slight change. Therefore, it is necessary to reduce these parameter values. When the maximum capacity of the HDD 300 is large, since the command issue number area for accessing the entire HDD 300 is large, the delayed access area is difficult to spread with a slight change, so these parameter values need to be increased.

障害の広がりが検出された場合、物理的障害が現在評価しているコマンド発行番号領域にとどまらず、まだ評価していないコマンド発行番号領域にも広まっている可能性が高く、評価の進行過程で読み込みエラーまで発展し、データを読み出せなくなる恐れがあるため、早急にＨＤＤ停止勧告を発し、ＨＤＤの使用を停止させる必要がある。 If a failure spread is detected, it is likely that the physical failure is not limited to the command issue number area currently being evaluated, but is also spread to the command issue number region that has not yet been evaluated. Since there is a possibility that data may not be read due to development up to a read error, it is necessary to promptly issue an HDD stop recommendation and stop using the HDD.

このように、障害解析結果から障害の進行が解析でき、直ぐにでもＨＤＤからデータを取り出す必要がある場合を除き、通常はエラーカウンタの更新を行い、ステップＳ３２５で全セグメントのエラーカウンタが所定数（例えば１０個）を超えた場合（Ｓ３２５のＹＥＳ）、ステップＳ３３０でＨＤＤの停止警告を発する。 As described above, the progress of the failure can be analyzed from the failure analysis result, and the error counter is normally updated unless the data needs to be taken out from the HDD immediately. In step S325, the error counters of all segments are a predetermined number ( For example, if it exceeds 10 (YES in S325), an HDD stop warning is issued in step S330.

全セグメントを通して、エラーカウンタが１０個以上という閾値は、障害に特定の広がりが見えない場合でも、ＨＤＤ３００上に代替セクタやアクセス遅延領域が点在して増加していく場合、障害が進行していることを示す値である。１０個という閾値は、ＨＤＤ３００の容量当たりでの発生個数でホストのアクセスに障害を与え始める値の総数を示し、この値はＨＤＤ３００の容量で変化する。所定数としてこれ以外の個数を用いてもよい。 The threshold that the error counter is 10 or more throughout all segments is that the failure progresses when the alternative sector and the access delay area increase on the HDD 300 even when the failure does not show a specific spread. It is a value indicating that The threshold value of 10 indicates the total number of values that start to cause a host access failure with the number of occurrences per capacity of the HDD 300, and this value varies with the capacity of the HDD 300. Other numbers may be used as the predetermined number.

以上述べたように、本実施の形態のＨＤＤ故障予測装置１００による故障予測手順によれば、代替セクタの発生位置を明確にすることにより、代替セクタの発生位置とその発生個数から起きている障害に重み付けを行って深刻度を評価することができ、障害の進行を的確に捉えることができる。その結果、ＨＤＤ３００の故障を正確に予測することができ、ＨＤＤ３００内のデータの損失を防ぐことができる。 As described above, according to the failure prediction procedure performed by the HDD failure prediction apparatus 100 according to the present embodiment, the failure occurring from the occurrence position and the number of occurrences of the alternative sector by clarifying the occurrence position of the alternative sector. The severity can be evaluated by weighting and the progress of the failure can be accurately grasped. As a result, failure of the HDD 300 can be accurately predicted, and loss of data in the HDD 300 can be prevented.

代替セクタの発生により機能回復した場合は、単発的であればＨＤＤ３００の機能を正常化させた物としてそれ以降エラーカウントの累積は急激に進行しないが、近傍セグメントの連続した領域においてエラーカウントが増加するようなら、代替セクタの発生による傷やヘッド不良による書き込みミスが発生していると考えられる。このように、各セグメントにおけるＳＭＡＲＴ情報の代替セクタの発生数をエラーカウントとして捉えることによって、ディスク全体において、問題が発生しているセグメントの進行の分布を明確にすることが出来るので、その分布の進行状態から、ＨＤＤ３００の故障が近いことを判断することができる。 If the function is restored due to the occurrence of a substitute sector, if it is a single occurrence, the error count will not increase rapidly as the HDD 300 function has been normalized, but the error count will increase in a contiguous area of neighboring segments. If this is the case, it is considered that a flaw due to the occurrence of an alternative sector or a write error due to a defective head has occurred. In this way, by grasping the number of occurrences of the alternative sector of SMART information in each segment as an error count, the distribution of the progress of the segment in which the problem has occurred can be clarified in the entire disk. From the progress state, it can be determined that the failure of the HDD 300 is near.

最終的にエラーカウンタは、代替セクタ数の変化で累計されたカウンタの値から、故障予測とする閾値を超えたことを判断し、ＨＤＤ３００に対し停止警告を表示灯などにより知らせるために用いられる。これは、ブザー等による警報であってもよく、本システムの停止機能と連動させてもよい。また、外部のシステムと連携し、エラーカウンタの値に応じて、ＨＤＤ等の記憶装置の購入に係る情報（広告情報など）を表示したり、クラウド等を用いたバックアップサービスの利用を促したり、記憶装置の購入やバックアップサービスの利用を促進するための優待サービス（クーポン券の提示など）を実施してもよい。このように、必要度の高いユーザにピンポイントで適切な情報を提供することにより、ユーザの利便性が向上するとともに、関連商品やサービスの売上増加が期待できる。 Finally, the error counter is used to determine from the counter value accumulated due to the change in the number of alternative sectors that the threshold for failure prediction has been exceeded and to notify the HDD 300 of a stop warning by an indicator lamp or the like. This may be a warning by a buzzer or the like, and may be linked with a stop function of the present system. In addition, in cooperation with an external system, according to the value of the error counter, information related to the purchase of a storage device such as an HDD (advertising information, etc.) is displayed, the use of a backup service using the cloud or the like is promoted, A preferential service (such as presentation of a coupon) for promoting the purchase of a storage device or the use of a backup service may be implemented. In this way, by providing pinpoint appropriate information to highly necessary users, it is possible to improve user convenience and increase sales of related products and services.

（実施の形態２）
ＨＤＤは動作不具合を起こした場合、書き込まれたデータの保証がない。ＨＤＤの動作不具合は一旦起きてしまうと、基本的に内部データを読み出すことができないため、大きな損失が発生する。すなわち、故障であることを気がついた段階では、ＨＤＤ内のどこかのデータを失うことを避けることができない。そこで、本発明の実施の形態に係るＨＤＤ故障予測装置１００では、運用上のデータの読み出しが正常にできる限界としてのＨＤＤの寿命を予測し、ＨＤＤの故障により、データを失うことを回避することを目的とする。 (Embodiment 2)
When an HDD malfunctions, there is no guarantee of written data. Once a malfunction of the HDD occurs, internal data cannot be read basically, resulting in a large loss. That is, at the stage where it is noticed that there is a failure, it is impossible to avoid losing some data in the HDD. Therefore, the HDD failure prediction apparatus 100 according to the embodiment of the present invention predicts the life of the HDD as a limit at which the operational data can be normally read, and avoids losing data due to the HDD failure. With the goal.

転送時間の比較によるＨＤＤの寿命予測について、特開２０１１−６８１０９号公報にはアクセス時間を評価する方法が記載されている。しかしながら、ＨＤＤの障害について、特に経年劣化に伴う内部パーツの摩耗から来る障害は、現在障害が発生している位置にとどまらず、時間とともに拡大する傾向があるので、その予兆を高い精度でとらえないと、時間の経過とともにＨＤＤの内部データを失う可能性が高くなる。 Japanese Patent Laid-Open No. 2011-68109 describes a method for evaluating access time for HDD life prediction by comparing transfer times. However, with regard to HDD failures, failures that come from wear of internal parts due to deterioration over time, in particular, tend not to stop at the location where the failure has occurred, but to expand over time, so the signs cannot be captured with high accuracy. As the time elapses, the possibility of losing the internal data of the HDD increases.

特に、障害へと進行しつつあるエリアを再アクセスすることは重大な障害へ進展する可能性が高いので、障害の進行を予測できることは、その後のＨＤＤの内部データの救済策を講じる手法を決める上で非常に重要な目安となってくる。 In particular, re-accessing an area that is progressing to a failure is likely to progress to a serious failure, so the ability to predict the progression of the failure will determine how to take subsequent HDD internal data remedies. This is a very important guideline.

そこで、実施の形態２のＨＤＤ故障予測装置１００では、特開２０１１−６８１０９号公報では使用していなかったＨＤＤのＳＭＡＲＴ情報を使用するとともに、特定のアクセスパターンによるアクセス時間の散布図にもとづいて、アクセス時間によって故障を予測する際に使用する閾値を決定することにより、故障予測の精度を上げる方法を採用する。 Therefore, in the HDD failure prediction apparatus 100 according to the second embodiment, the HDD SMART information that is not used in Japanese Patent Laid-Open No. 2011-68109 is used, and based on the scatter diagram of the access time according to a specific access pattern, A method of increasing the accuracy of failure prediction by determining a threshold value used when predicting failure according to access time is adopted.

実施の形態２に係るＨＤＤ故障予測装置１００の構成図は図１に示した実施の形態１に係るＨＤＤ故障予測装置１００の構成図と同じである。ここでは、実施の形態１と共通する構成と動作の説明は適宜省略し、実施の形態１と異なる構成と動作について説明する。 The configuration diagram of the HDD failure prediction apparatus 100 according to the second embodiment is the same as the configuration diagram of the HDD failure prediction apparatus 100 according to the first embodiment shown in FIG. Here, description of the configuration and operation common to the first embodiment will be omitted as appropriate, and configuration and operation different from the first embodiment will be described.

制御部３０は、ＨＤＤコントローラ１０の書き込み時のコマンド実行時間（アクセス時間）を測定するとともに、ＨＤＤコントローラ１０により読み出されたＳＭＡＲＴ情報から代替セクタ数を抽出し、アクセス時間と代替セクタ数の変化からＨＤＤ３００の寿命を判断する。 The control unit 30 measures the command execution time (access time) at the time of writing by the HDD controller 10, extracts the number of alternative sectors from the SMART information read by the HDD controller 10, and changes the access time and the number of alternative sectors. From the above, the life of the HDD 300 is determined.

異常値ＤＢ記録部４０は、アクセス時間の異常を判定するための閾値を記憶するとともに、遅延アクセス発生時に遅延アクセスが起きたコマンド発行番号と遅延アクセス時間と遅延アクセス発生時のＳＭＡＲＴ情報をデータベースとして登録する。本実施例では後述するように、所定数のセクタ単位（特定容量単位）でＨＤＤ３００にアクセスする。 The abnormal value DB recording unit 40 stores a threshold value for determining an abnormality in access time, and uses as a database the command issue number at which the delayed access occurred when the delayed access occurs, the delayed access time, and the SMART information when the delayed access occurs. sign up. In this embodiment, as will be described later, the HDD 300 is accessed in a predetermined number of sectors (specific capacity units).

ホスト２００がＨＤＤ３００に対してデータの書き込みコマンドを発行すると、書き込まれるデータはＨＤＤコントローラ１０を経由して一時記憶部２０に一時的に記憶される。一時的に記憶された書き込みデータが所定の容量に達すると、ＨＤＤコントローラ１０は時間軸上で古いデータから図１１に示すアクセスパターンに従い、ＨＤＤ３００に書き込む。この処理の詳細については後述する。 When the host 200 issues a data write command to the HDD 300, the written data is temporarily stored in the temporary storage unit 20 via the HDD controller 10. When the temporarily stored write data reaches a predetermined capacity, the HDD controller 10 writes to the HDD 300 from the old data on the time axis according to the access pattern shown in FIG. Details of this processing will be described later.

このとき、制御部３０は、ＨＤＤコントローラ１０からデータがＨＤＤ３００に書き込まれたときのコマンド実行時間（アクセス時間）を測定し、あらかじめ異常値ＤＢ記憶部に記憶された閾値を読み出し、現在のアクセス時間がこの閾値を超えているか否かを判定する。 At this time, the control unit 30 measures the command execution time (access time) when data is written from the HDD controller 10 to the HDD 300, reads the threshold value stored in advance in the abnormal value DB storage unit, and reads the current access time. It is determined whether or not exceeds this threshold.

アクセス時間が閾値を超えたことが確認された場合、制御部３０は、アクセス時間が閾値を超えたＬＢＡとアクセス時間とそのときのＳＭＡＲＴ情報を記憶し、後述のワーニング（警告）カウンタおよびエラーカウンタを計数して、ワーニングカウンタおよびエラーカウンタの推移によってＨＤＤ３００の故障予測を行い、寿命に達したことを予測したときはＨＤＤの停止警告を発する。 When it is confirmed that the access time exceeds the threshold, the control unit 30 stores the LBA, the access time and the SMART information at that time when the access time exceeds the threshold, and a warning (warning) counter and an error counter described later. The failure of the HDD 300 is predicted based on the transition of the warning counter and the error counter. When it is predicted that the life has been reached, an HDD stop warning is issued.

ＳＭＡＲＴ情報のおける代替セクタ数は、ＨＤＤの障害を知る上で重要な値であるが、ＨＤＤのＳＭＡＲＴ情報において代替セクタ数は単に発生数を数値で示しているだけであり、代替セクタがどのセクタに発生したかをＨＤＤの動作中にリアルタイムに知る方法はなかった。そのため、既存の技術ではＳＭＡＲＴ情報は、ＨＤＤの内部障害が発生した後の原因の解析に使われることがほとんどである。 The number of alternative sectors in the SMART information is an important value for knowing the failure of the HDD, but the number of alternative sectors in the HDD SMART information simply indicates the number of occurrences, and the sector that is the alternative sector There was no way to know in real time during the operation of the HDD. For this reason, SMART information is mostly used for analysis of causes after an internal failure of the HDD occurs in existing technologies.

実際の動作では、代替セクタはいきなり発生するわけではなく、ＨＤＤメーカー所定のリトライ回数を経て、本来書き込もうとしていたセクタに書き込めなかった場合、代替セクタが発生する。障害が進行する過程において、代替セクタが発生したセクタをアクセスすると、障害の予兆として、リトライが発生し、アクセス遅延が生じている。このリトライの過程で、データが読めて代替セクタが発生すればデータは守られるが、リトライを繰り返す過程でデータが読めなくなり、代替セクタへ移行できない場合もある。ここまで障害が進行すると、障害が発生したセクタに書かれていたデータが失われてしまう。 In the actual operation, the alternative sector does not occur suddenly, and an alternative sector is generated when writing to the sector originally intended to be written after a predetermined number of retries by the HDD manufacturer is not possible. When a sector in which an alternative sector has occurred is accessed in the process of failure progression, a retry occurs as a sign of failure, resulting in an access delay. In this retry process, if the data is read and an alternative sector is generated, the data is protected. However, in the process of repeating the retry, the data cannot be read and may not be transferred to the alternative sector. If the failure progresses so far, the data written in the sector where the failure has occurred will be lost.

動作中のＨＤＤ内部において、その障害が発生しているセクタを特定して集計することにより、障害がどのセクタにおいて時間経過とともに進行しているかを知ることができ、障害の分布の集計から故障の到来を予測することが可能になる。 By identifying and counting the sector where the failure has occurred in the operating HDD, it is possible to know in which sector the failure is progressing over time, and from the failure distribution count, It is possible to predict the arrival.

そこで、実施の形態２のＨＤＤ故障予測装置１００では、あらかじめ特定の方法で決定されたＨＤＤの正常動作時のアクセス時間の閾値を異常値ＤＢ記録部４０に記録しておき、ＨＤＤのアクセス時間が正常動作時の閾値を越えたことをトリガーとして代替セクタ数に代表されるＳＭＡＲＴ情報を読み取り、前回、アクセス時間が閾値を超えたときのＳＭＡＲＴ情報と比較する。基本的に、閾値を超えてアクセス時間を要するセクタはアクセス異常を起こすことによって処理時間が余計にかかっていることから、ＨＤＤ内部の制御システムはその異常内容をＳＭＡＲＴ情報として残している可能性が極めて高い。ただし、このように閾値を超えたアクセス時間がかかっても、そのセクタに何とかデータが書き込めた或いは読み込めた場合は、ＳＭＡＲＴ情報に反映されない場合もある。 Therefore, in the HDD failure prediction apparatus 100 of the second embodiment, the access time threshold value during normal operation of the HDD determined in advance by a specific method is recorded in the abnormal value DB recording unit 40, and the HDD access time is recorded. The SMART information typified by the number of alternative sectors is read as a trigger when the threshold value during normal operation is exceeded, and compared with the SMART information when the access time exceeds the threshold value last time. Basically, a sector that requires an access time exceeding the threshold value takes extra processing time due to an access abnormality, so the control system inside the HDD may leave the abnormality content as SMART information. Extremely expensive. However, even if it takes an access time exceeding the threshold value in this way, if data is written or read from the sector, it may not be reflected in the SMART information.

このように、アクセス時間が閾値を越えることは、ＨＤＤ内部で正常時よりも何らかのアクセス遅延を起こす要因が発生していることを示している。しかしＳＭＡＲＴ情報の更新につながるアクセス時間の閾値などの情報は、ＨＤＤメーカー毎に異なり、必ずしも明確ではない。しかしながら、事前にＨＤＤの正常動作時のアクセス時間の閾値を基準としてＳＭＡＲＴ情報の読み込みを行うならば、ＨＤＤメーカーから障害判定の閾値データを入手したり、事前の障害解析のような複雑なデータ解析をすることなく、ＨＤＤの故障予測ができるメリットがある。 As described above, when the access time exceeds the threshold value, it is indicated that a factor causing some access delay is generated in the HDD as compared with the normal time. However, information such as an access time threshold value that leads to the update of SMART information differs for each HDD manufacturer and is not always clear. However, if SMART information is read in advance with reference to the threshold of access time during normal operation of the HDD, threshold data for failure determination is obtained from the HDD manufacturer, or complex data analysis such as prior failure analysis is performed. There is an advantage that HDD failure can be predicted without having to

このように、実施の形態２のＨＤＤ故障予測装置１００は、ＨＤＤの正常動作時のアクセス時間の閾値を基準として、アクセス時間が閾値を超えた場合に、どのセクタでＳＭＡＲＴ情報が変化するかを把握して集計することにより、ＨＤＤ内部の障害の進行をより正確に捉える。リアルタイムにＳＭＡＲＴ情報とセクタの状態を関連づけて障害発生の予兆をとらえるため、高い精度で故障予測することができる。 As described above, the HDD failure prediction apparatus 100 according to the second embodiment determines in which sector the SMART information changes when the access time exceeds the threshold, based on the access time threshold during normal operation of the HDD. By grasping and tabulating, the progress of failures inside the HDD can be grasped more accurately. Since the SMART information and the state of the sector are correlated in real time to detect a sign of a failure occurrence, the failure can be predicted with high accuracy.

ここで、ＨＤＤ３００のアクセス時間の概要を説明する。図１４（ａ）は、ＨＤＤ３００の正常時のアクセス時間を示し、図１４（ｂ）は、ＨＤＤ３００の異常発生時のアクセス時間を示す。 Here, an outline of the access time of the HDD 300 will be described. FIG. 14A shows the access time when the HDD 300 is normal, and FIG. 14B shows the access time when an abnormality occurs in the HDD 300.

正常時のアクセス時間は、図１４（ａ）に示すように、シーク時間、回転待ち時間、集束時間などのヘッド動作に依存する時間と、データ転送時間との合計で表わすことができる。 As shown in FIG. 14A, the normal access time can be represented by the sum of the data transfer time and the time depending on the head operation such as seek time, rotation waiting time, and focusing time.

異常発生時のアクセス時間は、図１４（ｂ）に示すように、正常時のアクセス時間に加えて、代替セクタ発生時にはリトライ時間、代替セクタ処理時間等が加わるため、正常アクセス時の数十倍の処理時間を要する。よって、アクセス時間と代替セクタの変化には関連性があり、アクセス時間の伸びているセクタは、この直後に読み取るＳＭＡＲＴ情報で代替セクタが発生しているか、或いは、発生の可能性が高い。 As shown in FIG. 14 (b), the access time when an abnormality occurs is several tens of times that during normal access because a retry time, alternative sector processing time, etc. are added in addition to the normal access time. Processing time is required. Therefore, there is a relation between the access time and the change of the alternative sector, and in the sector where the access time is extended, the alternative sector is generated or is highly likely to occur in the SMART information read immediately after this.

ＨＤＤ３００は、一定速度でディスクが回転しているため、ヘッドが目標セクタにアクセスを行う際、アクセスタイミングによっては最大でディスク１周分の回転待ち時間が発生する。また、前回のアクセスが終了した時のヘッド位置によってこれからアクセスする位置までのシーク時間が変動するため、総合的なアクセス時間も変動する。その結果、アクセスに至るまでの集束時間も異なるため、工場出荷時と同じアクセス時間で対象セクタにアクセスできることはなく、アクセスごとにばらつく。そこで、本実施の形態では、図１１に示すアクセスパターンを用いることにより、アクセス時間のばらつきに対処している。 Since the disk rotates in the HDD 300 at a constant speed, when the head accesses the target sector, a rotation waiting time corresponding to one rotation of the disk occurs at maximum depending on the access timing. In addition, since the seek time from the next access position varies depending on the head position when the previous access is completed, the total access time also varies. As a result, since the convergence time until access is different, the target sector cannot be accessed in the same access time as at the time of factory shipment, and varies for each access. Therefore, in the present embodiment, variations in access time are dealt with by using the access pattern shown in FIG.

図７は、ＨＤＤ故障予測装置１００による故障予測手順を示すフローチャートである。 FIG. 7 is a flowchart showing a failure prediction procedure performed by the HDD failure prediction apparatus 100.

ステップＳ４０１において、ＨＤＤコントローラ１０は、図１１のアクセスパターンに従いＨＤＤ３００にアクセスし、制御部３０は、そのときのコマンド実行時間を測定する。 In step S401, the HDD controller 10 accesses the HDD 300 according to the access pattern of FIG. 11, and the control unit 30 measures the command execution time at that time.

図１１に示すアクセスパターンでは、１つのコマンド発行番号に対応して、特定容量のデータが書き込まれるようになっている。本実施例では、特定容量を２５６セクタにしているが、それ以外のセクタ数を用いてもよく、これに限定される訳ではない。ＨＤＤ３００の場合、以前のアクセスが終了した時のヘッド位置が不特定であると、特にシーク時間にばらつきが生じ、相対的に正確なアクセス時間が測定できない。そこで図１１に示すようにヘッドの位置をアクセス終了後、常に初期位置（ここではセクタ０の位置）にリセットしてから特定容量（２５６セクタ）の書き込みを順次行うことにより、より正確なアクセス時間の測定を可能としている。 In the access pattern shown in FIG. 11, data of a specific capacity is written in correspondence with one command issue number. In this embodiment, the specific capacity is 256 sectors, but the number of other sectors may be used and is not limited to this. In the case of the HDD 300, if the head position at the time when the previous access is ended is unspecified, the seek time varies particularly, and the relatively accurate access time cannot be measured. Therefore, as shown in FIG. 11, the head position is always reset to the initial position (in this case, the position of sector 0) after the end of access, and then the specific capacity (256 sectors) is sequentially written, thereby enabling a more accurate access time. Measurement is possible.

図１２Ａは、図１１のアクセスパターンにしたがってＨＤＤ３００にアクセスしたときのアクセス時間の模式図である。横軸はコマンド発行番号、縦軸はコマンド発行番号ごとのアクセス時間である。図１２Ｂは、図１２Ａの模式図を実測値によって示したグラフである。測定データをプロットすると、図１２Ｂに示すような散布図が得られ、アクセス時間が右肩上がりの帯状に分布する。 FIG. 12A is a schematic diagram of access time when the HDD 300 is accessed according to the access pattern of FIG. The horizontal axis represents the command issue number, and the vertical axis represents the access time for each command issue number. FIG. 12B is a graph showing the schematic diagram of FIG. 12A with actually measured values. When the measurement data is plotted, a scatter diagram as shown in FIG. 12B is obtained, and the access time is distributed in the shape of a band rising upward.

ＨＤＤ３００の回転待ちと集束時間が無い理想的な状態であれば、アクセスターゲットとなるセクタに対するアクセス時間をプロットした散布図は、ほぼ１本の線になるはずである。しかしながら、これまで述べたようにディスクのアクセスについては常に回転待ちと集束時間についてばらつきが存在するので、図１１のアクセスパターンで示す特定容量ごとのアクセス時間をプロットすると、実際は図１２Ｂのように特定のばらつきを持った帯のような形をなし、コマンド発行番号とアクセス時間の間には強い相関がある。図１２Ｂの帯全体の傾きは、図１１で示すところのコマンド発行番号の増加に伴い、アクセス対象のセクタがヘッドのリセット位置から遠くなることによる主にシーク時間の増大が要因である。 If the HDD 300 is in an ideal state without waiting for rotation and focusing time, the scatter diagram plotting the access time for the sector to be the access target should be almost one line. However, as described above, since there is always a variation in the waiting time for rotation and the focusing time for disk access, plotting the access time for each specific capacity shown in the access pattern of FIG. 11 actually specifies it as shown in FIG. 12B. There is a strong correlation between the command issue number and the access time. The inclination of the entire band in FIG. 12B is mainly due to an increase in seek time due to the fact that the sector to be accessed becomes far from the head reset position as the command issue number increases as shown in FIG.

なお、コマンド実行時間（アクセス時間）の測定は、ＨＤＤ故障予測装置１００で行ってもよいし、同じアクセスパターンを発生する外部機器で行ってもよい。 Note that the command execution time (access time) may be measured by the HDD failure prediction apparatus 100 or by an external device that generates the same access pattern.

次にステップＳ４０２では、ステップＳ４０１で測定された散布図における帯の上端に相当する値を検出する。そして、この値をコマンド発行番号毎の閾値（故障予測閾値）として用いる。この値は、シーク時間、回転待ち時間、集束時間などのヘッド動作に依存する待ち時間が極大となる場合のアクセス時間であり、ＨＤＤ３００が正常である時は、アクセス時間がこの値以下に収まるという特徴がある。従って、この値を故障予測の閾値として用いることにより、アクセス時間がこの閾値を超えたら、ＨＤＤ３００のアクセスが正常でないことが把握できる。この閾値の具体的な検出方法については後述する。 In step S402, a value corresponding to the upper end of the band in the scatter diagram measured in step S401 is detected. This value is used as a threshold value (failure prediction threshold value) for each command issue number. This value is the access time when the waiting time depending on the head operation such as seek time, rotation waiting time, and focusing time is maximized. When the HDD 300 is normal, the access time is less than this value. There are features. Therefore, by using this value as a failure prediction threshold, it is possible to grasp that the access of the HDD 300 is not normal when the access time exceeds this threshold. A specific method for detecting this threshold will be described later.

次にステップＳ４０３では、ステップＳ４０２で算出した閾値を異常値ＤＢ記録部４０に登録する。具体的には、図１５（ａ）に示すように、コマンド発行番号と閾値とを対応させて記録する。 In step S403, the threshold value calculated in step S402 is registered in the abnormal value DB recording unit 40. Specifically, as shown in FIG. 15A, a command issue number and a threshold value are recorded in association with each other.

ステップＳ４０１〜ステップＳ４０３は、故障予測を行う事前処理あるいは初期設定処理である。 Steps S401 to S403 are pre-processing or initial setting processing for performing failure prediction.

次にステップＳ４０４では、ＨＤＤ３００の使用時において故障予測動作を行う。制御部３０がコマンド発行領域のアクセス時間が異常値ＤＢ記録部４０に記録した閾値を超えていないかを監視し、閾値を超えたコマンド発行番号領域については異常値ＤＢ記録部４０にコマンド発行番号とアクセス時間とＳＭＡＲＴ情報を記録する。現在のＳＭＡＲＴ情報は、前回、アクセス時間が閾値を超えたときのＳＭＡＲＴ情報と比較される。そのため、異常値ＤＢ記録部４０には、前回、アクセス時間が閾値を超えたときのコマンド発行番号のＳＭＡＲＴ情報が一時的に記憶され、現在のコマンド発行番号のＳＭＡＲＴ情報との比較に用いられる。 In step S404, a failure prediction operation is performed when the HDD 300 is used. The control unit 30 monitors whether or not the access time of the command issue area exceeds the threshold value recorded in the abnormal value DB recording unit 40, and for the command issue number area exceeding the threshold value, the command issue number is stored in the abnormal value DB recording unit 40. And the access time and SMART information are recorded. The current SMART information is compared with the SMART information when the access time exceeded the threshold value last time. Therefore, the abnormal value DB recording unit 40 temporarily stores the SMART information of the command issue number when the access time exceeds the threshold value last time, and uses it for comparison with the SMART information of the current command issue number.

制御部３０は、異常値ＤＢ記録部４０に記録されたワーニングカウンタとエラーカウンタを計数し、その結果、ＨＤＤにおけるアクセス時間が閾値を超えたセクタとそのセクタにおけるＳＭＡＲＴ情報の変化の分布が故障予測と判定されるレベルにまで達したとき、ＨＤＤ３００の停止警告を発し、処理を終了する。 The control unit 30 counts the warning counter and the error counter recorded in the abnormal value DB recording unit 40. As a result, the sector in which the access time in the HDD exceeds the threshold and the distribution of changes in the SMART information in the sector are predicted to fail. Is reached, the HDD 300 is issued a stop warning and the process is terminated.

ステップＳ４０１〜Ｓ４０３の処理において、制御部３０は閾値算出部として動作する。閾値算出部は制御部３０とは別の回路としてもよい。 In the processing of steps S401 to S403, the control unit 30 operates as a threshold value calculation unit. The threshold calculation unit may be a circuit different from the control unit 30.

なお、ＨＤＤの型番とファームウェアが同じであれば、図１２Ｂの散布図の帯から得られる閾値は同じであるから、新たに閾値を作成する必要はないため、ステップＳ４０１〜ステップＳ４０３を省略し、他のＨＤＤで測定した閾値を用いて、ステップＳ４０４の故障予測を開始することができる。他のＨＤＤで測定した閾値を用いる場合、ＨＤＤ故障予測装置１００に閾値算出部を備える必要はない。 Note that if the HDD model number and firmware are the same, the threshold value obtained from the band of the scatter diagram in FIG. 12B is the same, so there is no need to create a new threshold value, so steps S401 to S403 are omitted. The failure prediction in step S404 can be started using the threshold value measured by another HDD. When using a threshold value measured with another HDD, the HDD failure prediction apparatus 100 does not need to include a threshold value calculation unit.

ここで、ステップＳ４０２の閾値を検出する方法を詳細に説明する。ステップＳ４０２の第１の方法を説明する。 Here, the method for detecting the threshold value in step S402 will be described in detail. The first method in step S402 will be described.

図１３に示すように、所定数Ｎ個のコマンド発行番号ごとに、コマンド発行番号をグループ化する。以下では、このグループを「コマンド発行セグメント」あるいは単に「セグメント」と称する。また、所定数Ｎを「セグメント長」と称する。ここで、所定数Ｎ（セグメント長）は、１つのセグメントにシーク時間、回転待ち時間、集束時間などのヘッド動作に依存する待ち時間が極大となる点が、おおよそ１つ以上含まれるように設定する。典型的には、Ｎ＝３０〜５０とするのがよい。例えば、Ｎ＝３０とする場合、コマンド発行番号＝１〜３０をセグメント１、コマンド発行番号＝３１〜６０をセグメント２、コマンド発行番号＝６１〜９０をセグメント３とし、以下同様に、コマンド発行番号とセグメントを対応させる。 As shown in FIG. 13, the command issue numbers are grouped for every predetermined number N of command issue numbers. Hereinafter, this group is referred to as “command issue segment” or simply “segment”. The predetermined number N is referred to as “segment length”. Here, the predetermined number N (segment length) is set so that one segment includes approximately one or more points at which the waiting time depending on the head operation such as seek time, rotation waiting time, and focusing time is maximized. To do. Typically, N = 30-50 is good. For example, when N = 30, command issue number = 1-30 is segment 1, command issue number = 31-60 is segment 2, command issue number = 61-90 is segment 3, and so on. And the segment.

次に、セグメントごとにアクセス時間の最大値を検出する。そして、その最大値をそのセグメントにおける閾値とする。例えば、Ｎ＝３０であり、セグメント１の中で、コマンド発行番号＝１２において、アクセス時間が最大となり、最大値が３０ｍｓｅｃとなる場合、コマンド発行番号１〜３０に対応する閾値を全て３０ｍｓｅｃとする。あるいは、各セグメントにおけるアクセス時間の最大値に所定倍率を乗じた値をそのセグメントの閾値としてもよい。例えば、所定倍率＝１．２とし、最大値３０ｍｓｅｃ×１．２＝３６ｍｓｅｃを当該セグメントの閾値としてもよい。あるいは、各セグメントにおけるアクセス時間の最大値に所定値を加算した値を閾値としてもよい。例えば、所定値＝５ｍｓｅｃとし、最大値３０ｍｓｅｃ＋５ｍｓｅｃ＝３５ｍｓｅｃを閾値としてもよい。 Next, the maximum value of the access time is detected for each segment. The maximum value is set as a threshold value in the segment. For example, when N = 30 and the command issue number = 12, the access time is maximum and the maximum value is 30 msec in the segment 1, all the threshold values corresponding to the command issue numbers 1 to 30 are set to 30 msec. . Alternatively, a value obtained by multiplying the maximum value of the access time in each segment by a predetermined magnification may be used as the threshold of the segment. For example, the predetermined magnification = 1.2 and the maximum value 30 msec × 1.2 = 36 msec may be set as the threshold value of the segment. Or it is good also considering the value which added predetermined value to the maximum value of the access time in each segment as a threshold value. For example, the predetermined value = 5 msec and the maximum value 30 msec + 5 msec = 35 msec may be set as the threshold value.

この第１の方法で算出した閾値は、１つのセグメントに対応するコマンド発行番号においては、全て同じ値となる。従って、ステップＳ４０３において、コマンド発行番号ごとに閾値を記録せずに、図１５（ｂ）に示すように、セグメント番号と閾値を対応させて記録してもよい。 The threshold values calculated by the first method are all the same for the command issue numbers corresponding to one segment. Therefore, in step S403, the segment number and the threshold value may be recorded in correspondence with each other as shown in FIG. 15B without recording the threshold value for each command issue number.

ステップＳ４０２の第２の方法を説明する。まず、第１の方法と同様に、所定数のコマンド発行番号ごとにセグメントを形成する。このセグメントは、後続の処理ステップで使用するためのもので、ステップＳ４０２においては、セグメントを使用しない。 The second method of step S402 will be described. First, as in the first method, a segment is formed for each predetermined number of command issue numbers. This segment is for use in subsequent processing steps, and in step S402, no segment is used.

次に、あるコマンド発行番号（コマンド発行番号ｉ）に対して、その前後の所定範囲のコマンド発行番号（ｉ−ｗ〜ｉ＋ｗ）を対象にアクセス時間の最大値を検出する。すなわち、数式（１）に従って、コマンド発行番号ｉに対応する閾値θ［ｉ］を算出する。ここで、ａ［ｉ］はコマンド発行番号ｉに対応するコマンド実行時間（アクセス時間）であり、ｗは正の整数であり、ｍａｘは引数に指定された値の中から最大値を返す関数である。数式（１）によれば、（２ｗ＋１）個のコマンド発行番号を対象にして最大値を検出することになる。正の整数ｗは、（２ｗ＋１）個のコマンド発行番号の中に、アクセス時間の極大値が１つ以上含まれるように設定するとよい。典型的には、ｗ＝１５〜２５を用いるとよい。（２ｗ＋１）がセグメント長Ｎと同じであってもよいし、異なっていてもよい。ステップＳ４０１で測定に用いた最大のコマンド発行番号をＰとすると、数式に従って、ｉ＝（ｗ＋１）〜（Ｐ−ｗ）に対応する閾値θ［ｗ＋１］〜θ［Ｐ−ｗ］を各々算出する。ｉ＝１〜ｗについては、θ［ｗ＋１］を流用し、ｉ＝（Ｐ−ｗ＋１）〜Ｐについては、θ［Ｐ−ｗ］を流用すればよい。 Next, for a certain command issue number (command issue number i), the maximum value of the access time is detected for command issue numbers (i-w to i + w) in a predetermined range before and after the command issue number. That is, the threshold value θ [i] corresponding to the command issue number i is calculated according to Equation (1). Here, a [i] is the command execution time (access time) corresponding to the command issue number i, w is a positive integer, and max is a function that returns the maximum value from the values specified in the arguments. is there. According to Equation (1), the maximum value is detected for (2w + 1) command issue numbers. The positive integer w may be set so that one or more maximum values of access time are included in (2w + 1) command issue numbers. Typically, w = 15-25 may be used. (2w + 1) may be the same as or different from the segment length N. Assuming that the maximum command issuance number used for measurement in step S401 is P, threshold values θ [w + 1] to θ [Pw] corresponding to i = (w + 1) to (P−w) are calculated according to mathematical formulas. . For i = 1 to w, θ [w + 1] may be used, and for i = (P−w + 1) to P, θ [Pw] may be used.

また、数式（１）に従って算出した値に、更に移動平均処理を行って、閾値を算出してもよい。例えば、数式（１）の左辺を一時変数μ［ｉ］に代入し、数式（２）に従って、μ［ｉ］の移動平均を算出して閾値θ［ｉ］とする。ここで、ε［ｊ］は数式（３）を満たす重み係数である。またＬは正の整数であり、典型的には５〜１０に設定するとよい。数式（２）に従って閾値を算出することにより、閾値の変化が滑らかになり、精度よく故障予測できる場合がある。 In addition, the threshold may be calculated by further performing a moving average process on the value calculated according to Equation (1). For example, the left side of Expression (1) is substituted into the temporary variable μ [i], and the moving average of μ [i] is calculated according to Expression (2) as the threshold θ [i]. Here, ε [j] is a weighting coefficient that satisfies Equation (3). L is a positive integer and is typically set to 5-10. By calculating the threshold value according to the mathematical formula (2), the change in the threshold value becomes smooth and failure prediction may be possible with high accuracy.

図８は、ステップＳ４０４で示した故障予想処理の詳細な手順を示すフローチャートである。 FIG. 8 is a flowchart showing a detailed procedure of the failure prediction process shown in step S404.

ＨＤＤ３００の故障は、特定容量単位で区切られたエリアのアクセス時間がどのくらいの遅延をもって図１２Ｂで示す散布図の帯の上端である閾値を超えているか、そのアクセス遅延の発生がどのようにエリアをまたいで広がっているかによって予測する。 The failure of the HDD 300 indicates how much the access time of the area delimited by the specific capacity unit exceeds the threshold which is the upper end of the band of the scatter diagram shown in FIG. 12B and how the access delay occurs. Predict based on whether it spreads over again.

ＨＤＤ３００の閾値を超えたアクセス遅延の原因は、アクセス時に異常が発生したため、通常アクセス時の処理に加え、リトライや代替セクタ発生のような異常発生時の処理時間が加わることによる。 The cause of the access delay exceeding the threshold value of the HDD 300 is that an abnormality has occurred at the time of access, and in addition to the processing at the time of normal access, the processing time at the time of occurrence of an abnormality such as retry or occurrence of alternative sector is added.

しかしながら、ある特定セクタだけの損傷による代替セクタの発生は、代替セクタが発生した時点だけ大きなアクセス遅延が発生するが、以後、同じセクタをアクセスしてもＨＤＤ３００としては、正常動作に戻ったとして扱われ、再度同じ領域をアクセスしてもアクセス遅延が発生しなくなるという特徴がある。この場合、発生も単発で異常セクタの拡大は確認できず、そのとき発生した代替セクタ以上の拡大は見られない。これに対し、経年劣化によるアクセス遅延は、劣化が進行するとともにセグメント長で区切られたエリアの閾値を超える数がアクセスごとに徐々に増加していくが、そのパターンは特定できないため、セグメントで区切られたエリアの閾値の超え方の推移を異常値ＤＢ記録部４０に登録することにより故障の予測を行う。 However, the occurrence of a replacement sector due to damage to a specific sector causes a large access delay only when the replacement sector occurs. However, even if the same sector is accessed thereafter, the HDD 300 is treated as having returned to normal operation. However, even if the same area is accessed again, the access delay does not occur. In this case, the occurrence of the abnormal sector is not confirmed, and the expansion of the abnormal sector cannot be confirmed, and the expansion beyond the alternative sector generated at that time is not seen. On the other hand, the access delay due to deterioration over time gradually increases as the deterioration progresses, and the number exceeding the threshold of the area delimited by the segment length increases for each access, but the pattern cannot be specified, so it is delimited by segment. A failure is predicted by registering in the abnormal value DB recording unit 40 the transition of how the threshold of the specified area is exceeded.

図８のステップＳ５０１では、一時記憶部２０からＨＤＤ３００に転送する先頭のデータが、どのコマンド発行番号に該当するか特定する。例えば、一時記憶部２０において、ホスト２００からＨＤＤ３００に通常書き込むのと同じＬＢＡを用いて、転送すべきデータを管理し、転送時に先頭ＬＢＡを特定容量で割った値を算出してコマンド発行番号とすればよい。この特定したコマンド発行番号（データ転送に用いる先頭のコマンド発行番号）をｉとする。 In step S501 of FIG. 8, it is identified which command issue number corresponds to the top data transferred from the temporary storage unit 20 to the HDD 300. For example, in the temporary storage unit 20, the data to be transferred is managed using the same LBA that is normally written from the host 200 to the HDD 300, and a value obtained by dividing the head LBA by a specific capacity at the time of transfer is calculated and the command issue number and do it. This identified command issue number (first command issue number used for data transfer) is set to i.

ステップＳ５０５では、一時記憶部２０からＨＤＤ３００に転送するデータ容量（書き込み容量）が、何個分のコマンド発行番号（特定容量）に相当するかを算出する。具体的には、データ容量を特定容量で除算し、その商と余りを算出する。そして、その商を書き込み回数Ｍとする。 In step S505, it is calculated how many command issue numbers (specific capacities) the data capacity (write capacity) transferred from the temporary storage unit 20 to the HDD 300 corresponds to. Specifically, the data capacity is divided by the specific capacity, and the quotient and remainder are calculated. Then, the quotient is set as the write count M.

そこでステップＳ５１０では、ステップＳ５０５で算出された余りが「０」であるか否かを判定する。すなわち、書き込み容量が特定容量で割り切れるか否かを判定する。その結果、割り切れない場合（余りが存在する場合）（Ｓ５１０のＮＯ）、ステップＳ５１５に進む。ステップＳ５１５において、端数のデータに相当するＭ＋１番目の書き込み領域に他のデータがあるかどうかを確認し、他のデータが存在する場合（Ｓ５１５のＹＥＳ）、ステップＳ５２０において、その存在するデータを一時記憶部２０に読み込み、Ｍ＋１番目の書き込みデータに結合した後、Ｍ＋１番目に書き込むデータとして用意してステップＳ５２５に進む。ステップＳ５１５において他のデータが存在しない場合（Ｓ５１５のＮＯ）、書き込みデータを結合する必要はないので、そのままステップＳ５２５に進む。この結果、端数分につき書き込み回数が１つ増えるので、ステップＳ５２５においてＭを１だけ加算し、ステップＳ５３０に進む。 Therefore, in step S510, it is determined whether or not the remainder calculated in step S505 is “0”. That is, it is determined whether the write capacity is divisible by the specific capacity. As a result, when it is not divisible (when there is a remainder) (NO in S510), the process proceeds to step S515. In step S515, it is checked whether there is other data in the M + 1th writing area corresponding to the fractional data. If there is other data (YES in S515), the existing data is temporarily stored in step S520. After reading into the storage unit 20 and combining with the (M + 1) th write data, the data is prepared as the (M + 1) th write data, and the process proceeds to step S525. If no other data exists in step S515 (NO in S515), it is not necessary to combine the write data, and the process directly proceeds to step S525. As a result, the number of times of writing is increased by one per fraction, so M is added by 1 in step S525, and the process proceeds to step S530.

ステップＳ５１０において書き込み容量が特定容量で割り切れる場合（Ｓ５１０のＹＥＳ）、ステップＳ５３０に進む。 If the write capacity is divisible by the specific capacity in step S510 (YES in S510), the process proceeds to step S530.

ステップＳ５３０において、図１１に示すアクセスパターンにおける、コマンド発行番号ｉに対応する特定容量の書き込みを行う。 In step S530, the specific capacity corresponding to the command issue number i in the access pattern shown in FIG. 11 is written.

ステップＳ５３５では、コマンド発行番号ｉに対応するコマンド実行時間ａ［ｉ］が閾値θ［ｉ］を超えたか否かを判定する。 In step S535, it is determined whether or not the command execution time a [i] corresponding to the command issue number i has exceeded the threshold value θ [i].

ＨＤＤ３００のアクセス時間に対する特徴として、障害、劣化が進んでいないＨＤＤ３００は、正常時の処理時間内にアクセスが終了するので、セグメントのアクセス時間は閾値内に収まる。しかしながら、経年劣化が進んだＨＤＤ３００や障害発生したＨＤＤ３００は、ヘッドの汚れ、内部での蓄積したほこりの影響、盤面上に発生した傷等により内部の障害が拡大し、その結果、ＨＤＤ３００の内部処理時間が障害に対応する処理を必要とし、正常処理時に比べて内部処理に時間を要するので、アクセス時間が決められた閾値を超え、書き込みアドレスに対するアクセス時間の遅延は拡大する。ステップＳ５３５ではこのような症状が起きていないかどうかを確認する。 As a feature with respect to the access time of the HDD 300, the HDD 300 in which failure and deterioration have not progressed finishes within the normal processing time, so the segment access time falls within the threshold. However, the HDD 300 that has deteriorated over time and the HDD 300 in which a failure has occurred have increased internal failures due to head contamination, the effect of accumulated dust, scratches on the board surface, and the like. Since time requires processing corresponding to a failure and time is required for internal processing compared to normal processing, the access time exceeds a predetermined threshold value, and the delay of the access time for the write address increases. In step S535, it is confirmed whether or not such a symptom has occurred.

コマンド実行時間が閾値を超えていない場合（Ｓ５３５のＮＯ）、ステップＳ５４５に進む。ステップＳ５４５において、当該コマンド発行番号に対応する過去のワーニングがカウントされている場合（Ｓ５４５のＹＥＳ）、アクセス時の一時的な要因があったとみなし、ステップＳ５５０においてコマンド発行番号領域のワーニングカウントを０に戻し、ステップＳ５６５に進む。 When the command execution time does not exceed the threshold (NO in S535), the process proceeds to step S545. If past warnings corresponding to the command issue number have been counted in step S545 (YES in S545), it is considered that there is a temporary factor at the time of access, and the command issue number area warning count is set to 0 in step S550. Return to step S565.

コマンド実行時間が閾値を超えた場合（Ｓ５３５のＹＥＳ）、そのアクセス遅延時間がＳＭＡＲＴ情報に変化を与えるものかを確認するため、ステップＳ５４０でコマンド発行番号ｉに対するＳＭＡＲＴ情報を読み取り、異常値ＤＢ記録部４０にＳＭＡＲＴ情報を記録する。 When the command execution time exceeds the threshold (YES in S535), in order to confirm whether the access delay time changes the SMART information, the SMART information for the command issue number i is read in step S540, and the abnormal value DB is recorded. The SMART information is recorded in the unit 40.

異常値ＤＢ記録部４０におけるＳＭＡＲＴ情報を格納するメモリの構造を図１９に示す。格納メモリＡは常に、今処理を行っているＳＭＡＲＴ情報を格納し、格納メモリＢには前回処理を行ったＳＭＡＲＴ情報を格納する。 FIG. 19 shows a memory structure for storing SMART information in the abnormal value DB recording unit 40. The storage memory A always stores the SMART information that is being processed, and the storage memory B stores the SMART information that was previously processed.

具体的には、ステップＳ５４０でＳＭＡＲＴ情報を読み込む際に、まず格納メモリＡ内のデータ（前回読み込んだＳＭＡＲＴ情報）を、格納メモリＢにコピーする。元々格納メモリＢに格納されていたデータ（前々回読み込んだＳＭＡＲＴ情報）は、上書きされる。その後、今回読み込んだＳＭＡＲＴ情報を格納メモリＡに格納する。 Specifically, when the SMART information is read in step S540, the data in the storage memory A (SMART information read last time) is first copied to the storage memory B. The data originally stored in the storage memory B (SMART information read last time) is overwritten. Thereafter, the SMART information read this time is stored in the storage memory A.

この後、ステップＳ５４３では、格納メモリＡ内の現在のＳＭＡＲＴ情報と、格納メモリＢ内の前回処理を行ったＳＭＡＲＴ情報とを比較し、ＳＭＡＲＴ情報（ここでは代替セクタ数）に更新があったかどうかを確認し、更新がなければ（Ｓ５４３のＮＯ）、ステップＳ５５５のワーニング判定処理を行い、更新があれば（Ｓ５４３のＹＥＳ）、ステップＳ５６０のエラー判定処理を行う。 Thereafter, in step S543, the current SMART information in the storage memory A is compared with the SMART information that has been subjected to the previous process in the storage memory B, and it is determined whether or not the SMART information (here, the number of alternative sectors) has been updated. If there is no update (NO in S543), a warning determination process in step S555 is performed. If there is an update (YES in S543), an error determination process in step S560 is performed.

ここで、ＳＭＡＲＴ情報に更新があった場合、今アクセスしたコマンド発行番号内のセクタにおいてＳＭＡＲＴ情報の更新であったことを示している。コマンド発行番号毎にアクセス時間が閾値を超えた場合に、ＳＭＡＲＴ情報の変化を把握し、特定セグメント毎にワーニングカウンタとエラーカウンタを集計することで、異常と判断されるセクタがどのようにＨＤＤ上に分布しているかを解析する。 Here, when the SMART information is updated, it indicates that the SMART information has been updated in the sector within the command issue number accessed now. When the access time exceeds the threshold for each command issue number, grasp the change in SMART information and count the warning counter and error counter for each specific segment to determine how the sectors judged abnormal are on the HDD. It is analyzed whether it is distributed to.

ステップＳ５５５において、致命的ではない障害を検出するワーニング判定処理を実行する。アクセス時間が閾値を超えながらＳＭＡＲＴ情報が更新されないような障害の場合、或いは、ステップＳ５４３においてＳＭＡＲＴ情報の中でも代替セクタ数以外の更新の場合は、致命的ではない障害と判断されるが、何らかの異常の発生を検出しているものはあるので、ワーニング処理を行ってワーニングカウンタを集計する。ワーニング判定処理については、図９のフローチャートを参照して後ほど詳しく説明する。 In step S555, a warning determination process for detecting a non-fatal failure is executed. If the failure is such that the SMART information is not updated while the access time exceeds the threshold, or if the SMART information is updated other than the number of alternative sectors in step S543, it is determined that the failure is not fatal. Since there are those that have detected the occurrence of, warning processing is performed and the warning counter is counted. The warning determination process will be described in detail later with reference to the flowchart of FIG.

ステップＳ５６０において、致命的な障害を検出するエラー判定処理を実行する。ステップＳ５６０では、コマンド発行番号領域内に含まれる各セクタに注目してアクセス時間の遅延の実態を把握し、主にリアルタイムに取得できたＳＭＡＲＴ情報により各セクタの状態を解析する。特に各セクタにおける代替セクタ発生の分布から盤面の傷等の物理的エラーを解析し、ステップＳ５５５のワーニング判定処理の解析結果を含め、総合的に寿命の到来を検出し、寿命到来検出時は最終的にＨＤＤの停止警告を出すことにより、運用稼働中のＨＤＤからデータを退避させることを促す。エラー判定処理については、図１０のフローチャートを参照して後ほど詳しく説明する。 In step S560, an error determination process for detecting a fatal failure is executed. In step S560, attention is paid to each sector included in the command issue number area, the actual state of the access time delay is grasped, and the state of each sector is analyzed mainly based on the SMART information acquired in real time. In particular, a physical error such as a scratch on the board surface is analyzed from the distribution of the occurrence of alternative sectors in each sector, and the arrival of the life is comprehensively detected including the analysis result of the warning determination process in step S555. By issuing a warning to stop the HDD, it is urged to save data from the HDD in operation. The error determination process will be described in detail later with reference to the flowchart of FIG.

ステップＳ５６５では、ステップＳ５５５およびステップＳ５６０の処理で更新したＨＤＤ３００のワーニングカウンタ値およびエラーカウンタ値をＳＭＡＲＴ情報とともに表示する。ユーザ（操作者）は、これらのカウンタ値によってＨＤＤ３００の状態を監視することができ、必要なときには操作者がこの数値から判断して、独自にＨＤＤ３００を停止させることもできる。 In step S565, the warning counter value and error counter value of the HDD 300 updated in the processes in steps S555 and S560 are displayed together with the SMART information. The user (operator) can monitor the status of the HDD 300 by using these counter values, and the operator can judge from this numerical value and stop the HDD 300 independently when necessary.

ステップＳ５７０において、ステップＳ５６０の結果を受けて処理されたエラーカウンタが動作停止パラメータ値を超えたことが確認された場合（Ｓ５７０のＹＥＳ）、故障予測処理を終了する。なお、ステップＳ５７０のＹＥＳの直後に、さらにユーザの注意を喚起するような警告メッセージを表示したり故障予測処理を終了することを通知するメッセージを表示してもよい。 If it is confirmed in step S570 that the error counter processed in response to the result of step S560 has exceeded the operation stop parameter value (YES in S570), the failure prediction process is terminated. Note that immediately after YES in step S570, a warning message that further alerts the user may be displayed, or a message notifying that the failure prediction process is to be terminated may be displayed.

ＨＤＤ３００へのデータ書き込み中はステップＳ５５５およびステップＳ５６０の処理をコマンド発行番号に対して行い、使用しているＨＤＤ３００の故障予測を行う。このため、ステップＳ５７５で書き込み回数Ｍを１減算し、コマンド発行領域ｉはアクセスが次の領域に移るため、１加算する。 While data is being written to the HDD 300, the processes in steps S555 and S560 are performed on the command issue number to predict failure of the HDD 300 being used. For this reason, 1 is subtracted from the write count M in step S575, and 1 is added to the command issue area i because the access moves to the next area.

最後にステップＳ５８０において、書き込み回数Ｍが０より大きい場合（Ｓ５８０のＹＥＳ）、所定回数の書き込みに達するまでステップＳ５３０〜ステップＳ５７５までの一連の処理を繰り返す。書き込み回数Ｍが０になった場合（Ｓ５８０のＮＯ）、故障予測処理を終了する。 Finally, in step S580, if the number of times of writing M is greater than 0 (YES in S580), a series of processing from step S530 to step S575 is repeated until the predetermined number of times of writing is reached. When the number M of times of writing becomes 0 (NO in S580), the failure prediction process is terminated.

図９は、ステップＳ５５５のワーニング判定処理の詳細な手順を示すフローチャートである。ワーニング判定処理では、ＨＤＤ内部においてＳＭＡＲＴ情報の代替セクタ数の変化では検出できない異常が発生しているエリアを特定し、異常発生エリアがどのように分布しているか（広がっているか）を判定する。 FIG. 9 is a flowchart showing a detailed procedure of the warning determination process in step S555. In the warning determination process, an area in which an abnormality that cannot be detected by a change in the number of alternative sectors in the SMART information is identified in the HDD, and it is determined how the abnormality occurrence area is distributed (spread).

ワーニング判定は図１６（ａ）で示すコマンド発行番号ごとのワーニングカウンタの値を加算することで行う。 The warning determination is performed by adding the value of the warning counter for each command issue number shown in FIG.

ワーニング判定処理はアクセスしたコマンド発行番号のアクセス時間が閾値を超えた場合に行われるので、ステップＳ６０１において、ワーニングカウンタを１つ増やすとともに、コマンド発行番号とそのときのコマンド実行時間（アクセス時間）を異常値ＤＢ記録部４０に登録する。 Since the warning determination process is performed when the access time of the accessed command issue number exceeds the threshold, in step S601, the warning counter is incremented by one, and the command issue number and the command execution time (access time) at that time are set. Register in the abnormal value DB recording unit 40.

ステップＳ６０５において、障害の広がりを確実に検出するため、アクセス対象のセグメントの前に５個のセグメント、後に５個のセグメントを取った狭い近傍セグメントエリア（第１近傍セグメントエリア）において、図１２Ｂの散布図の帯が閾値を超えて拡散しつつあるかどうかを確認する。具体的には、第１近傍セグメントエリアにおいて、ゼロでないワーニングカウンタが１０個以上発生しているかどうかを判定する。ワーニングカウンタが１０個以上確認された場合（Ｓ６０５のＹＥＳ）、この近傍セグメントエリアにおいて遅延が拡大していると判定し、次のエラー判定処理でＨＤＤ３００の停止の警告を発することができるよう、ステップＳ６１０でエラーカウンタに故障予測の閾値（ここでは１０）を加算する。 In step S605, in order to reliably detect the spread of the failure, in the narrow neighboring segment area (first neighboring segment area) in which five segments are taken before the access target segment and five segments are taken later, the first neighboring segment area shown in FIG. Check if the scatter plot band is spreading beyond the threshold. Specifically, it is determined whether 10 or more non-zero warning counters have occurred in the first neighboring segment area. If 10 or more warning counters have been confirmed (YES in S605), it is determined that the delay has increased in this neighboring segment area, and a warning for stopping the HDD 300 can be issued in the next error determination process. In S610, a failure prediction threshold value (here, 10) is added to the error counter.

ワーニング判定処理は、基本的に図８のステップＳ５３５で示すようにアクセス時間が閾値を超え、何らかのアクセス障害が発生しているが、ステップＳ５４３でＳＭＡＲＴ情報において代替セクタの発生が確認できない場合に行われる。 The warning determination process is basically performed when the access time exceeds the threshold value as shown in step S535 of FIG. 8 and an access failure has occurred, but in step S543, the occurrence of an alternative sector cannot be confirmed in the SMART information. Is called.

ＳＭＡＲＴ情報の特に代替セクタの更新は、ＨＤＤの内部システムでも障害が発生していることを認識でき、代替セクタ数の集計により自らの異常を判断できる。ヘッドの接触などによる傷等の物理的に判断できる障害は、代替セクタ数の増加によりその発生を判断できる。しかし、寿命予測においては、一度アクセス遅延が発生し、以後アクセス時間の回復することがない幾つかのセクタを中心に、その近傍セクタにおいて、代替セクタの発生に至らないが、閾値を超えたアクセス時間の遅延が徐々に拡大し、あるときから急に代替セクタが拡大する特徴がある。そのため、セクタごとのアクセス遅延の分布を記録し、それがどのように図１２Ｂの散布図の帯において閾値を超え、アクセス遅延が発生しているセクタが増えているかを判断する必要がある。 In the update of the SMART information, particularly the alternative sector, it is possible to recognize that a failure has occurred in the internal system of the HDD, and it is possible to determine its own abnormality by counting the number of alternative sectors. Faults that can be physically determined such as scratches due to head contact can be determined by increasing the number of alternative sectors. However, in life prediction, access delays occur once, and there are no alternative sectors in the neighboring sectors centered on some sectors where access time does not recover, but access exceeding the threshold There is a feature that the time delay gradually increases and the alternative sector suddenly expands from a certain time. Therefore, it is necessary to record the distribution of access delay for each sector and determine how the sector exceeds the threshold in the scatter diagram band of FIG.

そこで、コマンド発行番号ｉのアクセス時間が閾値を超え代替セクタが発生していないことに加え、コマンド発行番号ｉの近傍セクタにおいて同様の挙動が見られるかを判定し、寿命に達する障害の進行を予測する。これは、個々のセクタのアクセス遅延はそれほどたいしたものではないが、コマンド発行番号ｉの近傍エリアで、多くのアクセス遅延が発生することは、このエリアのアクセスが完了する挙動が正常ではなく、少なくともアクセス時間を延長させる何らかの障害が進行していることを意味している。その近傍領域のアクセス遅延が発生しているセクタ（コマンド発行番号）の個数を用いて、精度良い判定ができるように、ステップＳ６０５では、近傍セグメントエリアとして同一セクタ数で区切った領域を使用する。ステップＳ６０５における、前に５個のセグメント、後に５個のセグメント、ワーニングカウンタが１０個以上、といった所定数は一例であり、上述の値以外の所定数を用いてもよい。 Therefore, in addition to the fact that the access time of the command issue number i exceeds the threshold and no alternative sector has occurred, it is determined whether the similar behavior is seen in the sector near the command issue number i, Predict. This is because the access delay of each sector is not so great, but many access delays occur in the area near the command issue number i. It means that some kind of obstacle that extends the access time is in progress. In step S605, an area segmented by the same number of sectors is used as a neighboring segment area so that accurate determination can be made using the number of sectors (command issue numbers) in which access delays have occurred in the neighboring areas. In step S605, the predetermined number such as five segments before, five segments after, and 10 or more warning counters is an example, and a predetermined number other than the above-described values may be used.

ステップＳ６１５において、アクセス対象のセグメント内のＮ個のコマンド発行番号を見た場合に、ゼロでないワーニングカウンタが所定数（例えば２個）以上あれば、当該セグメントにおいて回復できないような障害が発生していると予想されるため、ステップＳ６２０においてエラーカウントを１加算し、エラー判定処理にエラーカウンタ値を渡す。この時点で、エラーが発生しているエリアの特定が可能となっていることから、問題エリアの使用を回避する等の処置によりＨＤＤ全体として延命へ導くことも可能である。 In step S615, when N command issuance numbers in the segment to be accessed are viewed, if there is a predetermined number (for example, 2) of non-zero warning counters, a failure that cannot be recovered in the segment has occurred. In step S620, the error count is incremented by 1, and the error counter value is passed to the error determination process. At this time, since the area where the error has occurred can be specified, it is possible to extend the life of the entire HDD by taking measures such as avoiding the use of the problem area.

このようにワーニング判定処理では、ワーニングカウンタを加算することにより、エラーにカウントされない障害がＨＤＤ３００に蓄積していることを判断することができる。また、一定のセグメントエリアでワーニングカウンタが所定数以上発生している場合は、ただちにエラーカウンタを加算することでエラー判定処理につなげることができる。 As described above, in the warning determination process, it is possible to determine that a failure that is not counted as an error is accumulated in the HDD 300 by adding the warning counter. Further, when a predetermined number or more of warning counters are generated in a certain segment area, it is possible to immediately connect to an error determination process by adding an error counter.

図１０は、ステップＳ５６０のエラー判定処理の詳細な手順を示すフローチャートである。エラー判定処理では、ＳＭＡＲＴ情報の代替セクタの発生を集計し、代替セクタの発生しているコマンド発行番号領域内のセクタの状態を主に解析する。 FIG. 10 is a flowchart showing a detailed procedure of the error determination process in step S560. In the error determination process, occurrences of alternative sectors in the SMART information are totaled, and the state of the sector in the command issue number area where the alternative sectors are generated is mainly analyzed.

特にセクタの状態を解析するためにＳＭＡＲＴ情報の特に代替セクタ数の変化に注目し、この代替セクタの発生位置とアクセス時間が閾値を超えたコマンド発行番号領域の関係から、既に多くのアクセス遅延が発生しているコマンド発行番号のセグメントに代替セクタが発生しているのか、或いは、これまでアクセス遅延が発生していないような領域に代替セクタが発生しているのかを解析する。前者のように、既にアクセス時間が閾値を超えているセグメントに含まれるコマンド発行番号内に代替セクタが発生している場合は、故障に達しつつあることを判断し、最終的にＨＤＤの故障予測を行う。後者のように、単独での代替セクタの発生は、セクタ自体の初期的な不良に見られるように、偶発的なセクタ破壊が起こったと予想されるので、ＨＤＤが正常に戻ったとして、ＨＤＤを正常に使用し続けることができる。 Pay particular attention to the change in the number of alternative sectors in the SMART information in order to analyze the state of the sector. From the relationship between the generation position of this alternative sector and the command issue number area in which the access time exceeds the threshold, many access delays have already occurred. It is analyzed whether an alternative sector has occurred in the segment of the generated command issue number, or whether an alternative sector has occurred in an area where no access delay has occurred so far. As in the former case, if an alternative sector has occurred within the command issue number included in the segment whose access time already exceeds the threshold, it is determined that a failure has been reached, and the HDD failure prediction is finally made. I do. Like the latter, the occurrence of a single alternative sector is expected to have caused an accidental sector destruction, as seen in the initial failure of the sector itself. Can continue to use normally.

代替セクタの発生は、回復不可能な障害が発生しているセクタを切り捨てて、新しい代替セクタに切り替え、ＨＤＤが正常動作に戻るための機能回復の方法である。代替セクタの発生そのものが悪いわけではないので、代替セクタの発生位置とアクセス遅延しているコマンド発行番号の関連性を解析することにより、代替セクタの発生が故障に至るものであるかどうかを判断する。 The generation of an alternative sector is a function recovery method for truncating a sector in which an unrecoverable failure has occurred, switching to a new alternative sector, and returning the HDD to normal operation. Since the generation of the alternative sector itself is not bad, by analyzing the relationship between the generation position of the alternative sector and the command issuance number that is delayed in access, it is determined whether the generation of the alternative sector is a failure. To do.

エラー判定処理では、ワーニング判定処理でアクセス時間が閾値を超えたコマンド発行番号を集計し、エリアにおいてワーニングカウンタが一定値を越えた場合に、エラーカウントとして加算される。図８のステップＳ５４３で、前回のＳＭＡＲＴ情報と比較し、代替セクタ数が更新された場合は、ワーニング判定処理をスキップしてエラー判定処理が行われる。これは、代替セクタ発生は、ＨＤＤ内部の異常が既にＨＤＤ本体が認識できるレベルまで達したことを意味し、できるだけ早くＨＤＤ停止等の判断を要するからであり、また、代替セクタが発生したのであれば、それまでに故障予測として判断できる、閾値を超えるコマンド発行番号の集計がとれているはずだからである。 In the error determination process, the command issue numbers whose access times exceeded the threshold in the warning determination process are totaled, and when the warning counter exceeds a certain value in the area, it is added as an error count. In step S543 of FIG. 8, when the number of alternative sectors is updated as compared with the previous SMART information, the warning determination process is skipped and the error determination process is performed. This is because the occurrence of an alternative sector means that an abnormality in the HDD has already reached a level that can be recognized by the HDD itself, and it is necessary to determine whether the HDD has stopped, etc. as soon as possible. This is because the command issuance numbers exceeding the threshold, which can be determined as failure predictions, should be tabulated so far.

エラー判定処理は図１６（ｂ）で示すコマンド発行番号ごとのエラーカウンタの値を加算することで行う。 The error determination process is performed by adding the error counter value for each command issue number shown in FIG.

ステップＳ７１０において、エラー判定として、最初に代替セクタだけに注目し、コマンド発行番号のアクセス遅延がそれほど発生していないにもかかわらず、代替セクタが急激に発生していないかを確認する。たとえば、衝撃によるヘッドのスクラッチ傷が発生する状況であれば、それを要因として、ＨＤＤの盤面の同一円周上における連続するセクタについて代替セクタ数が急激に増える。このように現在のコマンド発行番号によるアクセスだけで急激に代替セクタが発生していないかを確認する。 In step S710, as error determination, attention is first paid to only the alternative sector, and it is confirmed whether the alternative sector is abruptly generated although the access delay of the command issue number does not occur so much. For example, if the head is scratched due to an impact, the number of alternative sectors increases rapidly for consecutive sectors on the same circumference of the disk surface of the HDD. In this way, it is confirmed whether an alternative sector is suddenly generated only by accessing with the current command issue number.

傷による障害の場合、一時的には代替セクタの発生によりＨＤＤとしての機能は回復するが、障害は今のコマンド発行番号以外にも及んでいる可能性が高いので、至急にＨＤＤを停止させ、データを保護する策を取る必要がある。そこで、ステップＳ７１０において新たに５個以上の代替セクタの発生が確認された場合（Ｓ７１０のＹＥＳ）、ステップＳ７３０に進み、ＨＤＤ停止勧告を発し、ＨＤＤの使用を停止させる。 In the case of a failure due to a flaw, the HDD function is temporarily recovered due to the occurrence of an alternative sector, but the failure is likely to extend beyond the current command issue number, so the HDD is stopped immediately, You need to take measures to protect your data. Therefore, when it is confirmed in step S710 that five or more alternative sectors are newly generated (YES in S710), the process proceeds to step S730 to issue an HDD stop recommendation and stop the use of the HDD.

ここで、代替セクタ発生数の閾値は、連続するセクタが最も少ないディスク最内周で最小の傷が早期に発見できるように、ＨＤＤごとに変更することが望ましい。 Here, it is desirable to change the threshold value for the number of alternative sectors generated for each HDD so that the smallest scratch can be found at an early stage on the innermost circumference of the disk with the fewest consecutive sectors.

ステップＳ７１０において、現在のコマンド発行番号領域内で、５個以上という急激な代替セクタの発生が確認できない場合（Ｓ７１０のＮＯ）、ステップＳ７１５に進み、今のコマンド発行番号領域での代替セクタの発生数をエラーカウンタに加算する。ここで、エラーカウンタは、図９のワーニング判定処理においても既に加算されている場合があり、エラー判定処理における代替セクタの発生個数によりさらに加算されて累積する。 In step S710, if it is not possible to confirm the occurrence of 5 or more abrupt alternative sectors in the current command issue number area (NO in S710), the process proceeds to step S715, where an alternative sector is generated in the current command issue number area. Add the number to the error counter. Here, the error counter may have already been added in the warning determination process of FIG. 9, and is further added and accumulated depending on the number of alternative sectors generated in the error determination process.

次に、ステップＳ７２０において、アクセス対象のセグメントの前に１０個のセグメント、後に１０個のセグメントを取った第２近傍セグメントエリアにおいてゼロでないエラーカウンタが５個以上確認された場合（Ｓ７２０のＹＥＳ）、第２近傍セグメントエリアにおいてアクセス遅延が拡大するものと判定し、ステップＳ７３０に進み、ＨＤＤ停止の警告を発する。これは、ワーニング判定処理で集計した、アクセス時間が閾値を超えているセグメントに代替セクタが発生していることを意味する。このような場合、物理的障害が現在評価しているコマンド発行番号領域にとどまらす、まだ評価していないコマンド発行番号領域にも広がっている可能性が高く、評価の進行過程で読み込みエラーにまで発展し、データを読み出せなくなる恐れがあるため、早急にＨＤＤ停止警告を発し、ＨＤＤの使用を停止させる必要がある。 Next, in step S720, when five or more non-zero error counters are confirmed in the second neighboring segment area in which 10 segments are taken before the access target segment and 10 segments are taken later (YES in S720). Then, it is determined that the access delay is extended in the second neighboring segment area, and the process proceeds to step S730 to issue an HDD stop warning. This means that alternative sectors are generated in the segments that have been counted in the warning determination process and whose access time exceeds the threshold. In such a case, it is highly possible that the physical failure stays in the command issue number area that is currently being evaluated, and that it has spread to the command issue number area that has not yet been evaluated. Since there is a possibility that the data cannot be read out, it is necessary to immediately issue an HDD stop warning and stop the use of the HDD.

このように、第１近傍セグメントでの広い範囲での障害の広がりにより、アクセス遅延が図１２Ｂの帯の上端という閾値を超え、障害が広がっていくことを確認するととともに、第２近傍セグメントによりその広がったエリアにおいてＨＤＤのＳＭＡＲＴ情報により障害の発生の位置を特定することにより、障害の進行を正確にとらえ、故障を予測する。 In this way, it is confirmed that the access delay exceeds the threshold value of the upper end of the band in FIG. 12B due to the spread of the failure in a wide range in the first neighboring segment, and the failure spreads. By identifying the location of the failure in the extended area based on the SMART information of the HDD, the progress of the failure is accurately captured and the failure is predicted.

アクセス対象のセグメントの前に１０個、後に１０個という隣接する第２近傍セグメントエリアのセグメント数と、第２近傍セグメントエリア内でのエラーカウンタが５個以上という数は、一例であり、これ以外の値を用いてもよい。これらの数はＨＤＤの容量で変化する。ＨＤＤの最大容量が小さいとＨＤＤ全体をアクセスするコマンド発行番号領域が少ないので、少しの変化で図１２Ｂの散布図の帯は閾値を超えて拡散することから、これらの値は小さくする必要があり、ＨＤＤの最大容量が大きいとＨＤＤ全体をアクセスするコマンド発行番号領域が多いので、少しの変化で帯は拡散しないので、これらの値は大きくする必要がある。 The number of adjacent second neighboring segment areas that are 10 before the access target segment and 10 after that, and the number of error counters within the second neighboring segment area are 5 or more are examples, and other than this. The value of may be used. These numbers vary with the capacity of the HDD. Since the command issue number area for accessing the entire HDD is small when the maximum capacity of the HDD is small, the band of the scatter diagram of FIG. 12B spreads beyond the threshold value with a slight change, so these values need to be reduced. Since the command issue number area for accessing the entire HDD is large when the maximum capacity of the HDD is large, the band does not spread with a slight change, so these values need to be increased.

ステップＳ７２０において、第２近傍セグメントエリアにおけるゼロでないエラーカウンタが５個未満である場合（Ｓ７２０のＮＯ）、ステップＳ７２５に進む。そして、ステップＳ７２５で全セグメントにおけるエラーカウンタの値の合計値が所定値（例えば１０個）以上の場合（Ｓ７２５のＹＥＳ）、ステップＳ７３０でＨＤＤの停止警告を発する。 In step S720, when the number of non-zero error counters in the second neighboring segment area is less than 5 (NO in S720), the process proceeds to step S725. If the total value of error counters in all segments is equal to or greater than a predetermined value (for example, 10) in step S725 (YES in S725), an HDD stop warning is issued in step S730.

全セグメントにおけるエラーカウンタの値の合計値が１０個（所定値）以上というのは、閾値を超えたアクセス時間の集計との関連性が見られない場合でも、ＨＤＤ上に代替セクタやアクセス遅延領域が点在して増加していき、障害が進行していることを示す。ここで、１０個という数は、あくまでも一例であるが、ＨＤＤの容量当たりでの発生個数でホストのアクセスに障害を与え始める値の総数を示し、この値はＨＤＤの容量で変化する。なお、全セグメントにおけるエラーカウンタの値の合計値が所定値以上という条件の代わりに、全セグメントにおける、値が０より大きいエラーカウンタの数が所定数以上という条件を用いてもよい。 The total value of error counter values in all segments is 10 (predetermined value) or more, even if there is no relationship with the total of access time exceeding the threshold, the alternative sector or access delay area on the HDD Are scattered and indicate an increase in disability. Here, the number of 10 is merely an example, but indicates the total number of values that start to cause a host access failure by the number of occurrences per HDD capacity, and this value varies with the capacity of the HDD. Instead of the condition that the total value of the error counter values in all segments is equal to or greater than a predetermined value, a condition that the number of error counters having a value greater than 0 in all segments is equal to or greater than a predetermined value may be used.

このように、障害解析結果から障害の進行が解析でき、直ぐにでもＨＤＤからデータを取り出す必要がある場合を除き、ワーニング判定処理におけるエラーカウンタの加算や代替セクタ発生個数によるエラーカウンタの加算によってステップＳ７２５においてエラーカウンタが１０を超えた時にステップＳ７３０でＨＤＤの停止警告を発する。 As described above, the progress of the failure can be analyzed from the failure analysis result, and unless the data needs to be taken out from the HDD immediately, the error counter is added in the warning determination process or the error counter is added based on the number of alternative sectors generated in step S725. When the error counter exceeds 10, an HDD stop warning is issued in step S730.

以上述べたように、実施の形態２のＨＤＤ故障予測装置１００による故障予測手順によれば、アクセス時間が閾値を超えた場合にのみＳＭＡＲＴ情報を読み込むため、不要なＳＭＡＲＴ情報読み込みコマンドを発行する必要がない。また、アクセス時間の閾値比較とＳＭＡＲＴ情報（特に代替セクタ数）の変化を組み合わせて異常検知するため、故障予測の精度が高くなる。 As described above, according to the failure prediction procedure performed by the HDD failure prediction apparatus 100 according to the second embodiment, since the SMART information is read only when the access time exceeds the threshold, it is necessary to issue an unnecessary SMART information read command. There is no. Moreover, since the abnormality is detected by combining the threshold comparison of the access time and the change of the SMART information (particularly the number of alternative sectors), the accuracy of failure prediction is increased.

ＳＭＡＲＴ情報が更新されていない場合でもワーニング判定処理およびエラー判定処理が行われるため、ＳＭＡＲＴ情報が更新されないような軽度の異常の場合にも対処でき、故障予測の精度が向上する。また、ＳＭＡＲＴ情報が更新されている場合には、ワーニング判定処理を飛ばしてエラー判定処理を行うため、重度の異常を効率よく検知することができる。 Even when the SMART information is not updated, the warning determination process and the error determination process are performed. Therefore, it is possible to cope with a minor abnormality in which the SMART information is not updated, and the accuracy of failure prediction is improved. In addition, when the SMART information is updated, the warning determination process is skipped and the error determination process is performed, so that a severe abnormality can be detected efficiently.

また、以下のように高い精度でＨＤＤ３００の故障予測を行い、ＨＤＤ３００内のデータの損失を防ぐことができる。 Further, failure prediction of the HDD 300 can be performed with high accuracy as follows, and loss of data in the HDD 300 can be prevented.

パーティクルがヘッドの下に付加されることによる異常書き込みや、偶発的にヘッドがセクタにデータを完全に書き込めなかった時の書き損じが発生した場合、一時的にアクセス時間が閾値を超えるが、再度、同じコマンド発行番号領域を上書きすることによりアクセス時間が回復し、以後正常なアクセス時間で動作する。このような場合には、ワーニングカウンタがリセットされ、ＨＤＤ３００が正常動作に復帰したことが判断できる。 If abnormal writing due to particles being added under the head or accidental writing failure when the head could not completely write data to the sector, the access time temporarily exceeds the threshold, but again, By overwriting the same command issue number area, the access time is recovered, and thereafter the operation is performed with a normal access time. In such a case, it can be determined that the warning counter is reset and the HDD 300 has returned to normal operation.

また、代替セクタの発生により機能回復した場合は、単発的であればＨＤＤ３００の機能が正常化したとしてそれ以降エラーカウントの累積は進行しないが、近傍セグメントにおいて連続して代替セクタが発生すれば傷やヘッド不良による書き込みミスが発生していると考えられる。これらは、問題発生セクタと前後するセグメントにおいてエラーカウンタの発生が進行することによって、あるいはアクセス時間が代替セクタ発生予想閾値を超えた場合に代替セクタの発生と考えてエラーカウントが加算されることによって、ＨＤＤ３００の故障が近いことを判断することができる。 In addition, when the function is restored by the generation of the alternative sector, if it is single-shot, the function of the HDD 300 is normalized and the error count is not accumulated thereafter. It is considered that a write error has occurred due to a defective head. This is because the error counter is generated in the segment before and after the problem sector, or the error count is added when the access time exceeds the alternative sector occurrence prediction threshold and the occurrence of the alternative sector is considered. It can be determined that the failure of the HDD 300 is near.

さらに、ＨＤＤ３００の寿命による故障については、ワーニングカウンタが登録されたアクセス時間が閾値を超える範囲が特定セグメントに前後して広がって発生しているかどうかを確認することで判断することができる。この場合、視覚的には図１２Ｂの散布図が正常時と比較して広がりつつあることから判断することができる。 Further, a failure due to the life of the HDD 300 can be determined by checking whether or not the range in which the access time in which the warning counter is registered exceeds the threshold value is generated before and after the specific segment. In this case, it can be judged from the fact that the scatter diagram of FIG. 12B is expanding compared with the normal state.

特に、ワーニングカウンタが登録されたコマンド発行番号領域の再アクセスにおいてワーニングカウンタが所定数（例えば２個）以上発生すれば、書き損じによる回復が見込まれず、このコマンド発行番号領域におけるデータの書き込み異常が考えられ、早期故障への発展が考えられることから、ワーニングカウンタからエラーカウンタへ移行することにより故障発生が近いことをより正確に判断することができる。 In particular, if a predetermined number (for example, two) or more of the warning counter occurs in the re-access of the command issue number area in which the warning counter is registered, recovery due to write failure is not expected, and there is a possibility of data writing abnormality in this command issue number area. Therefore, it is possible to more accurately determine that the failure is near by shifting from the warning counter to the error counter.

このようにワーニングカウンタを蓄積しエラーカウンタへ移行するとともに、ワーニングカウンタ発生位置におけるＳＭＡＲＴ情報から検出できるＨＤＤ自体の代替セクタ発生要因を解析することにより、一般的なＨＤＤ３００の障害判断に加えて、時間をかけて進行する障害をより正確に判断することができるようになる。 In this way, the warning counter is accumulated and shifted to the error counter, and by analyzing the cause of the alternative sector of the HDD itself that can be detected from the SMART information at the warning counter occurrence position, in addition to the general failure determination of the HDD 300, time This makes it possible to more accurately determine obstacles that progress over time.

最終的にエラーカウンタは、累計されたワーニングカウンタの値から、故障予測とする閾値を超えたことを判断し、ＨＤＤ３００に対し停止警告を表示灯などにより知らせるために用いられる。これは、ブザー等による警報であってもよく、本システムの停止機能と連動させてもよい。 The error counter is finally used to determine from the accumulated warning counter value that a threshold value for failure prediction has been exceeded and to notify the HDD 300 of a stop warning by an indicator lamp or the like. This may be a warning by a buzzer or the like, and may be linked with a stop function of the present system.

このように、ＨＤＤ３００の正常動作時のアクセス最大時間から閾値を導くことにより、ＨＤＤ３００の正常時のアクセス時間の範囲がわかることから、アクセス時間が閾値を超えたコマンド発行番号領域において異常動作を正確に捉えることができ、ＨＤＤ３００の故障予測を高い精度で行うことができる。なお、本実施例では、ＳＭＡＲＴ情報の中の代替セクタ数を用いて処理を行ったが、これは、記録媒体の不良の程度を示す指標であるともいえる。また、記録媒体の不良に係る対応処理で使用されたリソースの量を示す指標であるともいえる。代替セクタ数に限らず、このような指標を用いて、同様の処理を行うことが可能である。 Thus, by deriving a threshold value from the maximum access time during normal operation of the HDD 300, the range of access time during normal operation of the HDD 300 can be known, so abnormal operation can be accurately performed in the command issue number area where the access time exceeds the threshold value. The failure prediction of the HDD 300 can be performed with high accuracy. In this embodiment, processing is performed using the number of alternative sectors in the SMART information, but this can be said to be an index indicating the degree of failure of the recording medium. It can also be said that it is an index indicating the amount of resources used in the handling process related to the defect of the recording medium. The same processing can be performed using such an index, not limited to the number of alternative sectors.

実施の形態２のＨＤＤ故障予測装置１００には以下の特徴がある。 The HDD failure prediction apparatus 100 according to the second embodiment has the following characteristics.

運用時にも初期測定時とアクセス開始位置と転送容量が同じ条件となるように、コマンド発行番号単位（特定容量単位）でＨＤＤにアクセスし、コマンド発行番号に対応するアクセス時間およびＳＭＡＲＴ情報の変化（特に代替セクタの増加）に基づいて、故障予測する。 During operation, the HDD is accessed in command issue number units (specific capacity units) so that the access start position and transfer capacity are the same as those in the initial measurement, and the access time and SMART information changes corresponding to the command issue numbers ( In particular, failure prediction is performed based on the increase in alternative sectors.

アクセス時間が閾値を超えた場合、代替セクタ増加の有無を判定し、増加がある場合は、増加がない場合に比べて、警告報知に猶予を持たせる。すなわち、代替セクタが新規に割り当てられ、ＨＤＤ全体の代替セクタ数が少ない場合は、警告を出さない。 When the access time exceeds the threshold, the presence / absence of an increase in the alternative sector is determined, and when there is an increase, the warning notification is given a grace compared to when there is no increase. That is, when a replacement sector is newly assigned and the number of replacement sectors in the entire HDD is small, no warning is issued.

近傍セグメントという狭い範囲に集中して代替セクタが発生した場合には、広い範囲に分散して発生した場合に比べて、深刻度の高い警告を報知する。 When alternative sectors are concentrated in a narrow range of neighboring segments, a warning with a higher degree of seriousness is issued than when the alternative sectors are distributed over a wide range.

あるコマンド発行番号に対応するアクセス時間が閾値を超えた場合、そのコマンド発行番号に対応するワーニングカウンタを増やし、２回目以降にアクセス時間が閾値以下であれば、ワーニングカウンタをリセットする。これにより、一過性のアクセス遅延と、永続的なアクセス遅延を区別することができる。 When the access time corresponding to a certain command issue number exceeds a threshold, the warning counter corresponding to that command issue number is increased, and if the access time is less than the threshold after the second time, the warning counter is reset. Thereby, a temporary access delay and a permanent access delay can be distinguished.

初期測定時と運用時でアクセスに関する条件が同じになるように、ＨＤＤの各領域（コマンド発行番号）のアクセス時間の測定毎に、ヘッドの位置をリセットさせるアクセスパターンを用いて、アクセス時間の測定を行う。 Access time measurement using an access pattern that resets the head position for each access time measurement of each area (command issue number) of the HDD so that the access conditions are the same during initial measurement and operation. I do.

ＨＤＤの各領域（コマンド発行番号）のアクセス時間を測定したデータに対して、複数の近傍する領域を対象に、アクセス時間の極大値（局所的な最大値）を算出し、それに基づき異常検出の閾値を設定する。 For the data obtained by measuring the access time of each area (command issue number) of the HDD, the local maximum value (local maximum value) of the access time is calculated for a plurality of adjacent areas, and abnormality detection is performed based on this value. Set the threshold.

（実施の形態３）
実施の形態３のＨＤＤ故障予測装置１００では、実施の形態２と同様に、ＨＤＤのＳＭＡＲＴ情報を使用するとともに、特定のアクセスパターンによるアクセス時間の散布図にもとづいて、アクセス時間によって故障を予測する際に使用する閾値を決定することにより、故障予測の精度を上げる方法を採用する。実施の形態３では、さらに、軽微な障害であれば、ＨＤＤ固有の障害回復機能である代替セクタの発生を強制的に促し、ＨＤＤのアクセス機能を回復させる方法を採用する。 (Embodiment 3)
In the HDD failure prediction apparatus 100 according to the third embodiment, the HDD SMART information is used as in the second embodiment, and a failure is predicted based on the access time based on a scatter diagram of access times according to a specific access pattern. A method of increasing the accuracy of failure prediction by determining a threshold value used at the time is adopted. In the third embodiment, if the failure is minor, a method of forcibly urging the generation of an alternative sector, which is a failure recovery function unique to the HDD, and recovering the access function of the HDD is adopted.

実施の形態３に係るＨＤＤ故障予測装置１００の構成と動作は、制御部３０によるステップＳ４０４の故障予測動作におけるステップＳ５５５のワーニング判定処理が異なり、回復フラグが設定され、回復処理がなされる点を除き、実施の形態２に係るＨＤＤ故障予測装置１００の構成と動作と同じである。ここでは、実施の形態２と共通する構成と動作の説明は適宜省略し、実施の形態２と異なる構成と動作について説明する。 The configuration and operation of the HDD failure prediction apparatus 100 according to the third embodiment are different from the warning determination processing in step S555 in the failure prediction operation in step S404 by the control unit 30 in that the recovery flag is set and the recovery processing is performed. Except for this, the configuration and operation of the HDD failure prediction apparatus 100 according to Embodiment 2 are the same. Here, the description of the configuration and operation common to the second embodiment will be omitted as appropriate, and the configuration and operation different from those of the second embodiment will be described.

制御部３０は、ＨＤＤコントローラ１０からデータがＨＤＤ３００に書き込まれたときのコマンド実行時間（アクセス時間）を測定し、あらかじめ異常値ＤＢ記憶部に記憶された閾値を読み出し、現在のアクセス時間がこの閾値を超えているか否かを判定する。 The control unit 30 measures the command execution time (access time) when data is written from the HDD controller 10 to the HDD 300, reads the threshold value stored in advance in the abnormal value DB storage unit, and sets the current access time to this threshold value. It is determined whether or not it exceeds.

さらに、制御部３０は、アクセス時間が閾値を越えたＬＢＡについて、障害要因により回復する可能性がある場合は、ホスト２００にそのＬＢＡを知らせる。ホスト２００は、その情報を元に、障害が発生しているＬＢＡに存在するデータを正常なエリアへコピーした後、通常のファイルアクセス動作の合間を縫って、障害が発生しているＬＢＡに対して、図１１のアクセスパターンに従い不特定データの書き込みを行い、代替セクタへの移行を強制的に促す。このとき、制御部３０は、そのときのＳＭＡＲＴ情報とアクセス時間から、代替処理保留中のセクタ位置を確認できるとともに、代替セクタへ移行したことを確認でき、代替処理保留中のセクタから代替セクタへ強制的に移行することにより、ＨＤＤを正常動作に復帰させることができる。 Furthermore, if there is a possibility that the LBA whose access time exceeds the threshold will be recovered due to a failure factor, the control unit 30 notifies the host 200 of the LBA. Based on the information, the host 200 copies the data existing in the failed LBA to a normal area, and then sews the interval between normal file access operations to the failed LBA. Thus, unspecified data is written in accordance with the access pattern shown in FIG. 11 to force the transition to the alternative sector. At this time, the control unit 30 can confirm the position of the sector pending for substitution processing from the SMART information and the access time at that time, and can confirm that the sector has been shifted to the substitution sector. By forcibly shifting, the HDD can be returned to normal operation.

ＳＭＡＲＴ情報のおけるＣｕｒｒｅｎｔＰｅｎｄｉｎｇＳｅｃｔｏｒＣｏｕｎｔ（以後、「代替処理保留中セクタ数」という）は、代替セクタスと同様にＨＤＤの障害を知る上で重要な値である。実際のＨＤＤの動作では、代替セクタがいきなり発生するわけではなく、多くの場合、障害の状態を監視するため、まず代替処理保留中セクタが発生する。そして、再度、そのセクタにアクセスしたときに、前回と同じレベルの障害が発生することが確認できれば、ＨＤＤは代替処理を行い、代替セクタを発生させる。しかし、代替セクタと同様に、ＳＭＡＲＴ情報において代替処理保留中セクタ数は単に発生数を数値で示しているだけであり、代替処理保留中セクタがどのセクタに発生したかをＨＤＤの動作中にリアルタイムに知る方法はなかった。既存の技術では、代替処理保留中セクタと代替セクタの関連性が不明であり、代替処理保留中セクタと代替セクタが同一セクタで起きているかどうかの判断もできなかった。そのため、既存の技術ではＳＭＡＲＴ情報は、ＨＤＤの内部障害が発生した後の原因の解析に使われることがほとんどである。 The Current Pending Sector Count (hereinafter referred to as “alternative processing pending sector count”) in the SMART information is an important value for knowing the failure of the HDD as in the alternative sectors. In the actual operation of the HDD, a substitute sector does not occur suddenly. In many cases, a substitute processing pending sector is first generated in order to monitor the failure state. If it is confirmed that a failure of the same level as the previous occurrence occurs when the sector is accessed again, the HDD performs a replacement process and generates a replacement sector. However, as with the alternative sector, the number of sectors pending for substitution processing in SMART information simply indicates the number of occurrences, and the sector in which the substitution processing pending is generated in real time during HDD operation. There was no way to know. In the existing technology, the relationship between the alternative processing pending sector and the alternative sector is unknown, and it cannot be determined whether the alternative processing pending sector and the alternative sector occur in the same sector. For this reason, SMART information is mostly used for analysis of causes after an internal failure of the HDD occurs in existing technologies.

実際に書き込み障害が進行している場合、障害が起きて代替セクタはいきなり発生するわけではなく、ＨＤＤメーカー所定のリトライ回数を経て、本来書き込もうとしていたセクタに書き込めなかった場合、最初に代替処理保留中セクタとして保留され、再度のアクセス時にやはり書き込めない場合、代替セクタの発生へ移行する。代替処理保留中セクタへ移行した場合、ＨＤＤのシステムは、書き込むべきセクタのデータの全てに障害が発生しているわけではなく、まだ、書き込める可能性があると判断する余地があり、障害の初期的な症状と判断することもある。実際、代替処理保留中セクタに対して、読み書きを行っても、正常なセクタよりも長いアクセス時間を要することが多いが、正常なデータが読み書きできることもある。ただし、多くの場合、将来的にＨＤＤに障害が発生し、代替セクタに移行するので、代替処理保留中セクタの発生を正確につかめば、ＨＤＤの故障をより正確に予測することができる。 If the write failure actually progresses, the failure does not occur and the replacement sector does not suddenly occur. If the HDD manufacturer has failed to write to the sector that was originally being written after the specified number of retries, the replacement process is performed first. If the sector is held as a pending sector and cannot be written at the time of accessing again, the process shifts to generation of a substitute sector. When the process shifts to the alternate processing pending sector, the HDD system does not have a failure in all the data of the sector to be written, and there is still room to determine that there is a possibility that the data can be written. It may be judged as a typical symptom. Actually, even when reading / writing is performed on a sector that is on hold of alternative processing, it often takes longer access time than a normal sector, but normal data may be read / written. However, in many cases, a failure will occur in the HDD in the future, and a transition will be made to an alternative sector. Therefore, if the occurrence of the alternative processing pending sector is accurately grasped, the failure of the HDD can be predicted more accurately.

また、ＨＤＤのシステムが書き込むセクタの情報を全く読み取れないような場合は、代替処理保留中のセクタに移行することなく、代替セクタへ移行するので、監視しているセクタが代替処理保留中セクタから代替セクタへ移行したのか、あるいは、いきなり代替セクタが発生したのかを把握することで、発生している障害のレベルを判断することができる。 In addition, when the HDD system cannot read the sector information to be written at all, it shifts to the alternative sector without shifting to the alternative processing pending sector. It is possible to determine the level of the failure that has occurred by grasping whether the sector has been shifted to the alternative sector or whether the alternative sector has suddenly occurred.

このようなＳＭＡＲＴ情報の更新が、ＨＤＤのどのセクタで起こっているかを知ることができれば、動作中のＨＤＤ内部において、その障害が発生しているセクタを特定して集計することにより、障害がどのセクタにおいてどのような障害レベルで時間経過とともに進行しているかを知ることができ、障害の分布の集計から故障の到来を予測することが可能になる。 If it is possible to know in which sector of the HDD such an update of SMART information has occurred, it is possible to identify which failure has occurred in the HDD that is operating, and to determine which failure has occurred. It is possible to know at what fault level the sector is progressing with time, and it is possible to predict the arrival of a fault from the summation of fault distributions.

また、代替処理保留中セクタが発生しているときは、アクセス自体はできているとはいえ、ＨＤＤのアクセスが正常時の閾値を越えているため、非常に不安定な状態である。しかし、そのエリアが代替セクタへ移行すればアクセス時間は正常に復帰するので、代替処理保留中のセクタを早期に代替セクタへ移行できれば、再度その領域は正常に使うことができる。 Further, when the alternative processing pending sector is generated, although the access itself is possible, the HDD access exceeds the normal threshold value, so that the state is very unstable. However, since the access time returns to normal when the area shifts to the alternative sector, if the sector pending replacement processing can be transferred to the alternative sector at an early stage, the area can be used normally again.

そこで、代替処理保留中セクタが発生しているエリアのデータを待避させた後、強制的に代替処理保留中セクタが発生しているセクタに書き込みをかけることによって、代替セクタへ移行させ、アクセスの不安定な状態を解決し、ＨＤＤを正常動作へ戻すことができる。これを回復処理という。 Therefore, after the data in the area where the alternative processing pending sector is generated is saved, the data is transferred to the alternative sector by forcibly writing to the sector where the alternative processing pending sector is generated. The unstable state can be solved and the HDD can be returned to the normal operation. This is called recovery processing.

回復処理は、代替処理保留中セクタが単独で発生する場合には有効であるが、狭いエリア内に複数個、代替セクタとともに代替処理保留中セクタが確認できたときは、代替セクタ数以上の障害が起きていることから、ＨＤＤの寿命として判断する。 The recovery process is effective when the alternative processing pending sector occurs alone, but if multiple alternative processing pending sectors can be confirmed along with the alternative sector in a narrow area, the number of failures exceeding the number of alternative sectors can be confirmed. Therefore, it is determined as the HDD life.

このように、実施の形態３のＨＤＤ故障予測装置１００は、ＨＤＤの正常動作時のアクセス時間の閾値を基準として、アクセス時間が閾値を超えた場合に、どのセクタでＳＭＡＲＴ情報が変化するかを把握して集計することにより、ＨＤＤ内部の障害の進行をより正確に捉える。リアルタイムにＳＭＡＲＴ情報とセクタの状態を関連づけて障害発生の予兆をとらえるため、高い精度で故障予測することができる。 As described above, the HDD failure prediction apparatus 100 according to the third embodiment determines in which sector the SMART information changes when the access time exceeds the threshold with reference to the threshold of the access time during normal operation of the HDD. By grasping and tabulating, the progress of failures inside the HDD can be grasped more accurately. Since the SMART information and the state of the sector are correlated in real time to detect a sign of a failure occurrence, the failure can be predicted with high accuracy.

さらに、実施の形態３のＨＤＤ故障予測装置１００では、リアルタイムでＳＭＡＲＴ情報が利用可能となることにより、どのセクタが代替処理保留中セクタであり、どの代替処理保留中セクタが代替セクタへ移行したかが分かる、そのため、ＳＭＡＲＴ情報の代替処理保留中セクタ数を有効に利用して、アクセスが不安定な代替処理保留中セクタを強制的に代替セクタへ移行させ、正常動作へ復帰させることも可能となる。 Furthermore, in the HDD failure prediction apparatus 100 according to the third embodiment, which SMART information can be used in real time, which sector is a replacement processing pending sector and which replacement processing suspension sector has shifted to the replacement sector. Therefore, it is also possible to make effective use of the number of pending sectors for substitution processing in the SMART information, forcibly shift the substitution processing pending sectors whose access is unstable, and return to normal operation. Become.

ＨＤＤ３００のアクセス時間の概要は、図１４（ａ）および図１４（ｂ）で説明した通りであるが、異常発生時のアクセス時間には、さらに代替処理保留中セクタ発生時の処理時間が加わる。アクセス時間と、代替処理保留中セクタおよび代替セクタの変化との間には関連性があり、アクセス時間の伸びているセクタは、この直後に読み取るＳＭＡＲＴ情報で代替処理保留中セクタまたは代替セクタが発生しているか、或いは、発生の可能性が高い。 The outline of the access time of the HDD 300 is as described with reference to FIGS. 14A and 14B, but the processing time when the alternative processing pending sector is generated is added to the access time when the abnormality occurs. There is a relationship between the access time and the alternative processing pending sector and the change of the alternative sector. For the sector whose access time is increasing, the alternative processing pending sector or alternative sector is generated by the SMART information read immediately after this. Or the possibility of occurrence is high.

ＨＤＤ故障予測装置１００による故障予測手順を示すフローチャートは、実施の形態２で説明した図７と同じであるが、ステップＳ４０４の故障予測動作には、回復動作が含まれ、故障予測動作の中で回復フラグが設定される。 The flowchart showing the failure prediction procedure performed by the HDD failure prediction apparatus 100 is the same as that shown in FIG. 7 described in the second embodiment. However, the failure prediction operation in step S404 includes a recovery operation. A recovery flag is set.

ステップＳ４０４では、ＨＤＤ３００の使用時において故障予測動作を行う。制御部３０がコマンド発行領域のアクセス時間が異常値ＤＢ記録部４０に記録した閾値を超えていないかを監視し、閾値を超えたコマンド発行番号領域については異常値ＤＢ記録部４０にコマンド発行番号とアクセス時間とＳＭＡＲＴ情報を記録する。図１９で示す方法で現在のＳＭＡＲＴ情報は、前回、アクセス時間が閾値を超えたときのＳＭＡＲＴ情報と比較される。そのため、異常値ＤＢ記録部４０には、前回、アクセス時間が閾値を超えたときのコマンド発行番号のＳＭＡＲＴ情報が一時的に記憶され、現在のコマンド発行番号のＳＭＡＲＴ情報との比較に用いられる。 In step S404, a failure prediction operation is performed when the HDD 300 is used. The control unit 30 monitors whether or not the access time of the command issue area exceeds the threshold value recorded in the abnormal value DB recording unit 40, and for the command issue number area exceeding the threshold value, the command issue number is stored in the abnormal value DB recording unit 40. And the access time and SMART information are recorded. In the method shown in FIG. 19, the current SMART information is compared with the SMART information when the access time exceeds the threshold value last time. Therefore, the abnormal value DB recording unit 40 temporarily stores the SMART information of the command issue number when the access time exceeds the threshold value last time, and uses it for comparison with the SMART information of the current command issue number.

また、ＳＭＡＲＴ情報の変化の分布から障害が単体の代替処理保留中セクタの発生レベルにとどまり軽微である場合は、制御部３０は、一度、その代替処理保留中セクタを利用しているデータを待避させた後、代替処理保留中セクタに任意のデータの書き込み動作を行い、代替処理保留中セクタを代替セクタへ強制的に移行させ、以後のＨＤＤの動作を安定させる。 In addition, when the failure is limited to the occurrence level of a single alternative processing pending sector due to the distribution of changes in SMART information, the control unit 30 saves data that uses the alternative processing pending sector once. Then, an arbitrary data write operation is performed in the alternative processing pending sector, and the alternative processing pending sector is forcibly shifted to the alternative sector, thereby stabilizing the subsequent operation of the HDD.

ステップＳ４０４の故障予想処理の詳細な手順を示すフローチャートは、実施の形態２で説明した図８と同じであるが、ステップＳ５５５のワーニング判定処理の詳細な手順が異なる。 The flowchart showing the detailed procedure of the failure prediction process in step S404 is the same as FIG. 8 described in the second embodiment, but the detailed procedure of the warning determination process in step S555 is different.

ステップＳ５５５において、致命的ではない障害を検出するワーニング判定処理を実行する。アクセス時間が閾値を超えながらＳＭＡＲＴ情報が更新されないような障害の場合、或いは、ステップＳ５４３においてＳＭＡＲＴ情報の中でも代替セクタ数以外のデータが更新された場合は、致命的ではない障害と判断されるが、何らかの異常の発生を検出しているものはあるので、ワーニング処理を行ってワーニングカウンタを集計する。ワーニング判定処理で、代替処理保留中セクタの発生箇所において、ＨＤＤの機能を回復する可能性のあるセクタについては、強制的に代替セクタへの移行させる回復処理を行うために回復フラグを立てる。実施の形態３のワーニング判定処理については、図１７のフローチャートを参照して詳しく説明する。 In step S555, a warning determination process for detecting a non-fatal failure is executed. If the failure is such that the SMART information is not updated while the access time exceeds the threshold, or if data other than the number of alternative sectors is updated in the SMART information in step S543, it is determined that the failure is not fatal. Since there is something that detects the occurrence of some abnormality, warning processing is performed and the warning counter is totaled. In the warning determination process, a recovery flag is set to perform a recovery process for forcibly shifting to the alternative sector for a sector that may recover the HDD function at the place where the alternative process is pending. The warning determination process of the third embodiment will be described in detail with reference to the flowchart of FIG.

図１７は、実施の形態３のワーニング判定処理の詳細な手順を示すフローチャートである。 FIG. 17 is a flowchart illustrating a detailed procedure of the warning determination process according to the third embodiment.

ワーニング判定処理では、ＨＤＤ内部において、ＳＭＡＲＴ情報の代替処理保留中セクタ数の発生位置とアクセス時間が閾値を超えたセクタの発生位置の関連性を調べ、両者の発生位置の分布が重なるときには、エラー判定処理における故障予測の判断が通常より加速するようにエラーカウンタを加算する。 In the warning determination process, the relationship between the generation position of the number of pending SMART information pending sectors and the generation position of the sector whose access time exceeds the threshold is checked in the HDD, and if the distribution of the generation positions overlaps, An error counter is added so that the determination of failure prediction in the determination process is accelerated more than usual.

また、代替セクタ数や代替処理保留中セクタ数の変化では検出できない異常が発生しているエリアを特定し、異常が発生したコマンド発行番号のエリアを集計することにより、異常エリアがどのようにＨＤＤ上に分布しているかを判定する。 Also, by identifying the areas where an abnormality that cannot be detected by the change in the number of alternative sectors and the number of sectors pending for substitution processing is identified, and summing up the areas of the command issue numbers where the abnormality occurred, how the abnormal area is Judge whether it is distributed above.

これらの結果から、代替処理保留中セクタが特定エリアに連続して発生しておらず、分布的にも集中しておらず、偶発的な障害と判断される場合、回復フラグを立てることで、代替処理保留中セクタの回復処理を促す。回復処理は後述するが、回復処理は、メインの故障予測処理とは独立して、ＨＤＤアクセスの空き時間を利用して実行される。 From these results, if the alternative processing pending sector is not continuously generated in the specific area, it is not concentrated in the distribution, and it is judged as an accidental failure, by setting a recovery flag, Prompts the recovery processing of the alternative processing pending sector. Although the recovery process will be described later, the recovery process is executed using the HDD access idle time independently of the main failure prediction process.

ワーニング判定処理は、基本的に図８のステップＳ５３５で示すようにアクセス時間が閾値を超え、何らかのアクセス障害が発生しているが、ステップＳ５４３でＳＭＡＲＴ情報において代替セクタの発生が確認できない場合に行われ、障害発生時の緊急度が高い代替セクタの発生以外のイベントを処理する。 The warning determination process is basically performed when the access time exceeds the threshold value as shown in step S535 of FIG. 8 and an access failure has occurred, but in step S543, the occurrence of an alternative sector cannot be confirmed in the SMART information. In addition, events other than the occurrence of a substitute sector with a high degree of urgency when a failure occurs are processed.

ＳＭＡＲＴ情報の特に代替セクタの更新は、代替処理保留中セクタが代替セクタに移行したことによる場合と、いきなり代替セクタが発生するレベルの障害が発生したことによる場合とが考えられるが、そのどちらの場合でもＨＤＤに極めて重大な障害を与えると認識できることから、ワーニング判定処理を行わず、直接エラー判定処理を行い、できるだけ、早急にＨＤＤを停止させる等の処理を行う。 In particular, the replacement of the SMART information with the alternative sector may be due to the fact that the alternative processing pending sector has shifted to the alternative sector, or due to the occurrence of a failure that suddenly causes the alternative sector. Even in this case, since it can be recognized that the HDD is seriously damaged, the error determination process is directly performed without performing the warning determination process, and the HDD is stopped as soon as possible.

ワーニング判定処理では、代替処理保留中セクタの分布を記録し、どの代替処理保留中セクタが代替セクタへ移行しているかを集計することにより、ＨＤＤ内部の障害の進行を判断し、集計の結果、分布の広がりが見えない、単独の代替処理保留中のセクタの発生については、代替処理保留中セクタに代替セクタへ強制的に移行させるために回復フラグをたて、ＨＤＤの性能を回復させる処理を促す。 In the warning determination process, the distribution of alternative processing pending sectors is recorded, and by counting which alternative processing pending sectors have moved to the alternative sector, the progress of failures in the HDD is determined. For the occurrence of a single sector pending the alternative process where the spread of the distribution cannot be seen, a recovery flag is set to force the transition sector to the alternative sector and the HDD performance is restored. Prompt.

寿命予測においては、近傍セクタにおいて代替セクタの発生に至らないが、閾値を超えたアクセス時間の遅延を起こしている代替処理保留中セクタが徐々に拡大し、あるときから、急速に代替処理保留中セクタが代替セクタへ移行し、急激に代替セクタが拡大する特徴がある。そのため、セクタごとのアクセス遅延の分布を記録し、それがどのように図１２Ｂの散布図の帯において閾値を超え、代替処理保留中セクタが発生しているセクタが増えているかを判断する必要がある。 In the life prediction, the replacement sector is not generated in the neighboring sector, but the replacement processing pending sector causing the access time delay exceeding the threshold gradually expands. The sector shifts to the alternative sector, and the alternative sector is rapidly expanded. Therefore, it is necessary to record the distribution of the access delay for each sector, and to determine how the sector exceeds the threshold in the scatter diagram band of FIG. is there.

そこで、アクセス時間が閾値を越え代替セクタが発生していないことに加え、近傍セクタにおいて代替処理保留中セクタから代替セクタへの移行が見られないか、発生個数を集計して寿命に達する障害の進行を予測する。 Therefore, in addition to the fact that the access time exceeds the threshold and no alternative sector has occurred, there is no transition from the alternative processing pending sector to the alternative sector in the neighboring sector, or the number of occurrences are totaled and the failure reaching the lifetime Predict progress.

これは、代替処理保留中セクタが発生した場合、アクセス時間が閾値を超えているが、ＨＤＤがまだ、致命的なエラーと判断しておらず、代替セクタへ移行するかどうかのＨＤＤ内部の判断の閾値まで達していない状態である。このようなセクタが近傍エリアに特定個数発生することは、エリア近傍のセクタが正常ではなく、代替処理保留中セクタが多く発生することは、それだけ多くの代替セクタが発生する可能性があることから、何らかの障害が進行していることを意味する。 This is because, when an alternative processing pending sector occurs, the access time exceeds the threshold, but the HDD has not yet determined that it is a fatal error, and the HDD internal determination of whether or not to move to the alternative sector This is a state in which the threshold value is not reached. The occurrence of a specific number of such sectors in the neighboring area means that the sector in the vicinity of the area is not normal, and that many alternative processing pending sectors occur, there is a possibility that so many alternative sectors may occur. , Means that some kind of failure is progressing.

しかしながら、１回の代替処理保留中セクタが次のアクセス時に必ず代替セクタに移行するとは限らないことから、１回目の閾値を超えたアクセスについては様子を見るために、ステップＳ８１０では、ワーニングカウンタが０であるか否か（既にワーニングカウンタが存在するかどうか）を判定する。 However, since a sector that is pending for one substitution does not always move to the alternate sector at the next access, in order to see the state of access exceeding the first threshold, a warning counter is set in step S810. It is determined whether it is 0 (whether a warning counter already exists).

ワーニングカウンタが０でない場合（Ｓ８１０のＮＯ）、２回目以降の同一セクタの処理であるから、ステップＳ８１５へ進む。ワーニングカウンタが０である場合（Ｓ８１０のＹＥＳ）、ステップＳ８２０へ進む。 If the warning counter is not 0 (NO in S810), the process proceeds to the same sector for the second time and thereafter, and the process proceeds to step S815. If the warning counter is 0 (YES in S810), the process proceeds to step S820.

この後、ステップＳ８１５およびＳ８２０では、代替処理保留中セクタが前回より増加したかどうかを調べる。 Thereafter, in steps S815 and S820, it is checked whether or not the number of alternative processing pending sectors has increased from the previous time.

ステップＳ８１５において代替処理保留中セクタが前回より増加していない場合（Ｓ８１５のＮＯ）、ワーニングカウンタが存在し、新たに代替処理保留中セクタの発生がないので、この時点で確認できるセクタに関する情報だけでは、ＨＤＤの障害がどのように進んでいるかが判断できないので、ステップＳ８２５に進む。また、ステップＳ８２０において代替処理保留中セクタが前回より増加している場合（Ｓ８２０のＹＥＳ）、ワーニングカウンタが存在せず、新たに代替処理保留中セクタが発生しているので、この時点で確認できるセクタに関する情報だけでは、ＨＤＤの障害がどのように進んでいるかが判断できないので、ステップＳ８２５に進む。 If the number of pending alternative processing sectors has not increased from the previous time in step S815 (NO in S815), there is a warning counter, and there is no new alternative pending sector, so only information on sectors that can be confirmed at this point in time. Since it is not possible to determine how the HDD failure has progressed, the process proceeds to step S825. Further, if the number of alternative processing pending sectors has increased from the previous time in step S820 (YES in S820), a warning counter does not exist and a new alternative processing pending sector has occurred, so this can be confirmed at this point. Since it is not possible to determine how the failure of the HDD has progressed only with the information regarding the sector, the process proceeds to step S825.

ステップＳ８２５では、第１近傍セグメント（アクセス対象のセグメントの前に５個のセグメント、後に５のセグメントの範囲）においてワーニングカウンタが３個以上、発生しているか否かを判定する。ステップＳ８２５がＹＥＳの場合は、図１２Ｂの散布図において、アクセス時間が閾値を超えて帯の集束が拡散しつつある可能性が高い。前に５個のセグメント、後に５個のセグメントといった所定数は、異常を判断できる近傍セグメントエリアに含まれる、アクセス遅延が発生しているセクタ数の一例であり、上述の値以外の所定数を用いてもよい。 In step S825, it is determined whether three or more warning counters have occurred in the first neighboring segment (a range of five segments before the access target segment and five segments after the access target segment). When step S825 is YES, in the scatter diagram of FIG. 12B, there is a high possibility that the access time exceeds the threshold and the convergence of the band is spreading. The predetermined number such as five segments before and five segments after is an example of the number of sectors in which an access delay occurs and is included in the adjacent segment area where abnormality can be determined. It may be used.

第１近傍セグメントにおいてワーニングカウンタが３個（値が０より大きいワーニングカウンタが３個）以上確認された場合（Ｓ８２５のＹＥＳ）、第１近傍セグメントにおいて代替セクタに移行するような障害が拡大しているとして、ステップＳ８３５でエラーカウンタを１加算する。 When three or more warning counters (three warning counters with a value greater than 0) are confirmed in the first neighboring segment (YES in S825), the failure to shift to an alternative sector in the first neighboring segment is expanded. In step S835, the error counter is incremented by one.

ステップＳ８４５は、ワーニングカウンタが０であり（Ｓ８１０のＹＥＳ）、代替処理保留中セクタが発生していない（Ｓ８２０のＮＯ）場合と、ステップＳ８２５において、代替処理保留中セクタの発生位置とこれまでのワーニングカウンタの発生位置の関連性が見られない場合（Ｓ８２５のＮＯ）に実行され、このときは、今調べているセクタのワーニングカウンタを１加算し、ワーニング判定処理を終了する。 In step S845, the warning counter is 0 (YES in S810), and no alternative process pending sector has occurred (NO in S820). In step S825, the occurrence position of the alternative process pending sector and the previous This process is executed when the relevance of the occurrence position of the warning counter is not found (NO in S825). At this time, 1 is added to the warning counter of the sector currently being examined, and the warning determination process is terminated.

ステップＳ８１５において、代替処理保留中セクタが前回より増加している場合（Ｓ８１５のＹＥＳ）、ワーニングカウンタが０でないことから２回目以降のアクセス遅延の発生であり、新たな代替処理保留中セクタも発生していることから、障害が進行していると考えられ、ステップＳ８３０に進む。 In step S815, if the alternative process pending sector has increased from the previous time (YES in S815), the warning counter is not 0, so the second and subsequent access delays have occurred, and a new alternative process pending sector has also occurred. Therefore, it is considered that the failure has progressed, and the process proceeds to step S830.

ステップＳ８３０において、第１近傍セグメントにおいてワーニングカウンタが１０個（所定数）以上発生しているかどうかを確認する。ステップＳ８２５においては、代替処理保留中セクタの発生とアクセス遅延の関連性が明確ではないため、障害の進行が緩やかであると判断した。しかし、ステップＳ８３０においては、代替処理保留中セクタの発生とアクセス遅延の関連性が明確であることから、アクセス遅延の発生が間違いなく代替処理保留中セクタによるものと判断される。そこで、第１近傍セグメントにおいて所定数以上の代替処理保留中セクタの発生が確認された場合（Ｓ８３０のＹＥＳ）、その代替処理保留中セクタが代替セクタへ移行する可能性があるほどの障害が発生していると予想されるから、次のエラー判定処理でＨＤＤ停止の警告を発するよう、ステップＳ８４０でエラーカウンタを１０加算する。 In step S830, it is confirmed whether or not 10 warning counters (predetermined number) have occurred in the first neighboring segment. In step S825, the relationship between the occurrence of the alternative processing pending sector and the access delay is not clear, so it is determined that the failure progresses slowly. However, in step S830, since the relationship between the occurrence of the substitution processing pending sector and the access delay is clear, it is determined that the occurrence of the access delay is definitely due to the substitution processing pending sector. Therefore, when it is confirmed that a predetermined number or more of alternative processing pending sectors are generated in the first neighboring segment (YES in S830), a failure occurs that may cause the alternative processing pending sector to shift to the alternative sector. In step S840, 10 is added to the error counter so as to issue an HDD stop warning in the next error determination process.

一方、このような特定エリアに集中するような代替処理保留中セクタの発生が見られない場合（Ｓ８３０のＮＯ）、単独での代替処理保留中セクタの発生と考えられる。しかし、代替処理保留中セクタが発生した状態でのＨＤＤの内部処理は、アクセスごとに代替セクタへ移行するべき状態かどうかの判断処理が増える分、アクセスに要する時間が増え、アクセス時間が閾値を越える可能性がある。また、実際に非常に不安定なデータ記憶状況にあるが、ＨＤＤの内部処理が代替セクタへ移行すべきと判断しない場合、アクセスごとリトライが発生し、ＨＤＤのアクセス時間が閾値を超え不安定な状態になる。そこで、このような単体での代替処理保留中セクタの発生が確認された場合、強制的に代替処理保留中セクタへの書き込み処理を行い、代替処理保留中セクタを代替セクタに移行させ、ＨＤＤの不安定な状態を解消する。ステップＳ８５０では、その処理を行うために回復フラグをセットする。回復処理については後に詳細を述べる。 On the other hand, if there is no occurrence of the pending sector for substitution processing that concentrates in such a specific area (NO in S830), it is considered that the sector is pending for substitution processing alone. However, in the HDD internal processing in the state where the alternative processing pending sector is generated, the time required for access increases as the processing for determining whether or not to shift to the alternative sector for each access increases. There is a possibility of exceeding. In addition, although the data storage situation is actually very unstable, if it is not determined that the internal processing of the HDD should move to the alternative sector, a retry occurs for each access, and the access time of the HDD exceeds the threshold and is unstable. It becomes a state. Therefore, when it is confirmed that such a stand-alone alternative process pending sector is generated, the write process to the alternative process pending sector is forcibly performed, and the alternative process pending sector is shifted to the alternative sector. Eliminate unstable conditions. In step S850, a recovery flag is set to perform the process. Details of the recovery process will be described later.

ステップＳ５６０のエラー判定処理の詳細な手順を示すフローチャートは、実施の形態２で説明した図１０と同じであるが、いくつか補足する。 A flowchart showing the detailed procedure of the error determination process in step S560 is the same as that in FIG. 10 described in the second embodiment, but some supplementary explanations will be made.

前述のように、ワーニング判定処理のステップＳ８３０では、代替処理保留中セクタの発生とアクセス遅延の関連性が明確であることから、第１近傍セグメントおいて代替処理保留中のセクタから代替セクタへの移行が急速に進んでいる場合、エラーカウンタが１０加算される。そのため、エラー判定処理では、ステップＳ７３０に進み、ＨＤＤ停止勧告が発せられる。 As described above, in step S830 of the warning determination process, since the relationship between the occurrence of the substitution processing pending sector and the access delay is clear, the substitution processing pending sector to the substitution sector in the first neighboring segment is determined. If the transition is progressing rapidly, the error counter is incremented by 10. Therefore, in the error determination process, the process proceeds to step S730, and a HDD stop recommendation is issued.

ステップＳ７２０で用いられる、アクセス対象のセグメントの前に１０個、後に１０個という隣接する第２近傍セグメントエリアのセグメント数と、第２近傍セグメントエリア内でのエラーカウンタが５個以上という数は、極めて狭い範囲において、代替処理保留中セクタが代替セクタに移行していることを明確にするための値であり、ＨＤＤの代替処理保留中セクタが代替セクタへ移行する処理能力の違いにより変化する。代替処理保留中セクタから代替セクタへ移行する判断の閾値が低いＨＤＤにおいては、この値を大きく取る必要があり、代替処理保留中セクタから代替セクタへ移行する判断の閾値が高いＨＤＤにおいてはこの値を小さくすることができる。 The number of adjacent second neighboring segment areas that are 10 before and 10 after the segment to be accessed and the number of error counters in the second neighboring segment area that are used in step S720 and 5 or more are as follows: It is a value for clarifying that the alternative processing pending sector has shifted to the alternative sector in an extremely narrow range, and changes depending on the difference in processing capacity of the HDD alternative processing pending sector to shift to the alternative sector. It is necessary to increase this value in an HDD with a low threshold value for judging whether to shift from the alternative processing pending sector to the alternative sector, and this value is required for an HDD having a high judgment threshold value for moving from the alternative processing pending sector to the alternative sector. Can be reduced.

ステップＳ７１０において、５個以上の代替セクタの発生が確認された場合、ステップＳ７３０に進み、ＨＤＤ停止勧告を発し、ＨＤＤの使用を停止させるが、代替セクタの発生数が５個以上という閾値は、代替処理保留中セクタ数に応じて可変にしてもよい。たとえば、代替処理保留中セクタ数が１０個未満である場合、代替セクタの発生数の閾値を１０とし、代替処理保留中セクタ数が１０個以上である場合、代替セクタの発生数の閾値を５としてもよい。代替処理保留中セクタ数が多くなるほど、代替セクタの発生数の閾値を下げて、ＨＤＤ停止勧告が出やすくするためである。あるいは、代替処理保留中セクタ数と代替セクタ数を組み合わせた総合的な指標を算出し、その総合指標に応じてＨＤＤ停止勧告を発するようにしてもよい。たとえば、代替セクタ数をｘ、代替処理保留中セクタ数をｙとして、総合指標ｚ＝αｘ＋βｙをステップＳ７１０の判定で用いてもよい。ここでα、βは０より大きい所定の値であり、典型的にはα＞βを満たす。 If generation of five or more alternative sectors is confirmed in step S710, the process proceeds to step S730 to issue an HDD stop recommendation to stop the use of the HDD, but the threshold that the number of alternative sectors generated is five or more is: It may be variable according to the number of alternative processing pending sectors. For example, when the number of alternative processing pending sectors is less than 10, the threshold of the number of alternative sectors generated is 10, and when the number of alternative processing pending sectors is 10 or more, the threshold of the number of alternative sectors generated is 5 It is good. This is because as the number of alternative processing pending sectors increases, the threshold for the number of alternative sectors generated is lowered to facilitate the HDD stop recommendation. Alternatively, a comprehensive index combining the number of alternative processing pending sectors and the number of alternative sectors may be calculated, and an HDD stop recommendation may be issued according to the total index. For example, the total index z = αx + βy may be used in the determination in step S710, where x is the number of alternative sectors and y is the number of sectors pending for substitution processing. Here, α and β are predetermined values larger than 0, and typically satisfy α> β.

図１８は、回復処理の詳細な手順を示すフローチャートである。セクタ回復処理は故障予測処理とは別のタスク等の処理で行う。基本的に、故障予測処理は、ホスト２００の読み書きのメイン処理の一環として行われるが、セクタ回復処理は、図１７のステップＳ８５０において回復フラグが設定された場合に、回復フラグの監視を行っているタスクによって行われる。セクタ回復処理は、回復フラグが立っている間、故障予測処理とは非同期に行われる。 FIG. 18 is a flowchart showing a detailed procedure of the recovery process. The sector recovery process is performed by a process such as a task different from the failure prediction process. Basically, the failure prediction processing is performed as part of the main read / write processing of the host 200, but the sector recovery processing is performed by monitoring the recovery flag when the recovery flag is set in step S850 in FIG. Is done by the task. The sector recovery process is performed asynchronously with the failure prediction process while the recovery flag is set.

セクタ回復処理は、ホスト２００からの読み書きのメイン処理の空き時間で行われ、本来の読み書き処理を妨害しない。基本的に、回復処理は、ホスト２００の読み書きの処理と同様に特定容量単位で指定したセクタに書き込むだけであり、大きな処理時間を必要としない。 The sector recovery process is performed in the free time of the main process of reading / writing from the host 200 and does not interfere with the original reading / writing process. Basically, the recovery process only writes to a sector designated by a specific capacity unit, similar to the read / write process of the host 200, and does not require a large processing time.

ステップＳ９０５では、ＨＤＤから回復処理を行うＳＭＡＲＴ情報を再度読み込み、これから強制書き込みを行うセクタ領域の代替処理保留中セクタ数と代替セクタ数を読み込む。 In step S905, the SMART information for performing the recovery process is read again from the HDD, and the number of alternative processing pending sectors and the number of alternative sectors for the sector area to be forcibly written are read.

これから回復処理を行う領域にデータが存在する場合、データの待避処理が必要になる。ステップＳ９１０において、これから回復処理を行う領域にデータがあるかどうか確認し、回復すべき領域にデータが存在する場合（Ｓ９１０のＹＥＳ）、ステップＳ９１５においてホスト２００はデータの回避処理を行う。これは、同一ＨＤＤ上の他の領域へのコピーでもいいし、他のメディアに対する待避でもよい。 When data is present in an area where recovery processing is to be performed, data saving processing is required. In step S910, it is confirmed whether or not there is data in an area where recovery processing is to be performed. If data exists in the area to be recovered (YES in S910), the host 200 performs data avoidance processing in step S915. This may be a copy to another area on the same HDD, or may be a save for other media.

この後、ステップＳ９２０において、これから行う強制書き込みの回数のカウンタをリセットし、ステップＳ９２５において、問題セクタに特定容量単位で書き込みを行う。 Thereafter, in step S920, the counter of the number of times of forced writing to be performed is reset, and in step S925, the problem sector is written in a specific capacity unit.

ステップＳ９３０において、書き込んだ後のＳＭＡＲＴ情報を読み込み、ステップＳ９４０において、読み込んだＳＭＡＲＴ情報の代替処理保留中セクタ数が０であるかどうかを調べる。代替処理保留中セクタが代替セクタへ移行するか、代替処理保留中セクタが一時的な異常に過ぎず正常セクタに復帰した場合、代替処理保留中セクタ数は０になる。 In step S930, the SMART information after writing is read. In step S940, it is checked whether the number of sectors pending substitution processing of the read SMART information is zero. When the alternative processing pending sector shifts to the alternative sector, or when the alternative processing pending sector is merely a temporary abnormality and returns to the normal sector, the number of alternative processing pending sectors becomes zero.

例えば、この領域中の代替処理保留中セクタ数が１であれば、書き込みにより代替セクタへ移行すれば、書き込み処理後のＳＭＡＲＴ情報の代替処理保留中セクタ数は１減るとともに、代替セクタが１増加する。代替処理保留中セクタの再アクセス時に当該セクタが正常セクタに復帰した場合、代替処理保留中のセクタ数は１減るが、代替セクタ数には変化がない。あるいは、ＨＤＤの内部処理上、代替処理へ移行するほどでもない軽微のエラー状態と判断された場合は、代替処理保留中セクタ数は変化せず、代替セクタ数にも変化はない。 For example, if the number of alternative processing pending sectors in this area is 1, if the shift to the alternative sector is performed by writing, the number of alternative processing pending sectors in the SMART information after the write processing is reduced by 1 and the alternative sector is increased by 1. To do. When the sector returns to a normal sector when the alternative processing pending sector is re-accessed, the number of sectors pending alternative processing is reduced by 1, but the number of alternative sectors remains unchanged. Alternatively, if it is determined in the HDD internal processing that the error state is minor enough not to shift to the replacement processing, the number of sectors pending for replacement processing does not change and the number of replacement sectors does not change.

代替処理保留中セクタが代替セクタへ移行するか、代替処理保留中セクタが正常復帰した場合（Ｓ９４０のＹＥＳ）、ステップＳ９５０で回復フラグをクリアし、回復処理を終了する。なお、このとき発生した代替セクタは、エラー判定処理での判断に使用され、代替処理保留中セクタから代替セクタへ移行し正常アクセス時間に戻ったとしても、全体として、代替セクタが増加するようであれば、エラー判定処理のステップＳ７３０でＨＤＤの停止警告を表示し、ＨＤＤの停止を促す。従って、代替処理保留中セクタの代替セクタへの移行も故障予測処理の一つとして動作する。 When the alternative process pending sector shifts to the alternative sector or the alternative process pending sector returns to normal (YES in S940), the recovery flag is cleared in step S950, and the recovery process ends. The replacement sector generated at this time is used for determination in the error determination processing, and even if the replacement processing pending sector shifts to the replacement sector and returns to the normal access time, the replacement sector seems to increase as a whole. If there is, an HDD stop warning is displayed in step S730 of the error determination process to prompt the HDD to stop. Therefore, the transition to the alternative sector from the alternative processing pending sector also operates as one of the failure prediction processes.

今回の書き込みで代替処理保留中セクタから代替セクタへの移行しなかった場合（Ｓ９４０のＮＯ）、書き込みカウンタがまだ５に達していないなら（Ｓ９４５のＮＯ）、ステップＳ９３５で書き込みカウンタに１加算し、ステップＳ９２５に戻り、再度、書き込み処理を行う。 If the write process has not shifted from the alternative processing pending sector to the alternative sector (NO in S940), if the write counter has not yet reached 5 (NO in S945), 1 is added to the write counter in step S935. Returning to step S925, the writing process is performed again.

ステップＳ９４５で書き込み回数が５に達した場合（Ｓ９４５のＹＥＳ）、書き込み処理が５回行われたにも関わらず、代替処理保留中セクタから代替セクタへの移行が見られず、代替処理保留中セクタの移行処理が行われない致命的な障害が発生している可能性があるので、ステップＳ９５５に進み、エラー判定処理のステップＳ７３０で直ぐにＨＤＤの停止警告表示処理がなされるように、エラーカウンタを１０に設定し、回復処理を終了する。 If the write count reaches 5 in step S945 (YES in S945), the shift from the alternative process pending sector to the alternative sector is not observed even though the write process is performed 5 times, and the alternative process is pending. Since there is a possibility that a fatal failure has occurred in which sector transfer processing is not performed, the process proceeds to step S955, and an error counter is displayed so that HDD stop warning display processing is immediately performed in step S730 of error determination processing. Is set to 10 and the recovery process is terminated.

強制書き込み回数の最大値はＨＤＤの異常を認識するレベルによって変化する。代替処理保留中セクタは、基本的に、発生後の次のアクセスで発生時と同じレベル以上の障害が発生したときに代替セクタへ移行する。回復フラグが立つのは、当該領域に図１７のワーニング判定処理でアクセス遅延が少なくとも２回発生し、２回目で代替処理保留中セクタの発生が確認できた場合であるから、その後の複数回の書き込みで正常に移行するとは考えにくい。このように判断されながら、代替セクタへ移行しないのは、例えば、既に代替セクタを使い切り、移行する代替セクタが既にない場合等が考えられ、代替セクタに移行ができないセクタを持つＨＤＤは、非常に危険で、直ぐにでも停止警告表示処理を行う必要がある。従って、この危険度の判断を厳しくしたい場合、強制書き込み回数を少なくしてもよく、危険度の判断を緩和する場合は、強制書き込み回数を増やしてもよい。 The maximum value of the number of forced writes varies depending on the level for recognizing abnormality of the HDD. The alternate processing pending sector basically shifts to the alternate sector when a failure of the same level or more occurs at the next access after the occurrence. The recovery flag is set when an access delay occurs at least twice in the warning determination process of FIG. 17 in the area, and the occurrence of the alternate processing pending sector can be confirmed at the second time. It is unlikely that it will migrate normally by writing. For example, there may be cases where the alternative sector is not migrated while being judged in this way, for example, when the alternative sector has already been used up and there is no alternative sector to be migrated. It is dangerous and it is necessary to perform stop warning display processing immediately. Therefore, the number of forced writings may be reduced if it is desired to make the judgment of the degree of risk strict, and the number of forced writings may be increased if the judgment of the degree of risk is eased.

以上述べたように、実施の形態３のＨＤＤ故障予測装置１００による故障予測手順によれば、代替処理保留中セクタを強制的に代替セクタに移行させることでアクセス時間を正常化してＨＤＤを安定化させることができる。さらに、高い精度でＨＤＤ３００の故障予測を行い、ＨＤＤ３００内のデータの損失を防ぐことができる。 As described above, according to the failure prediction procedure performed by the HDD failure prediction apparatus 100 according to the third embodiment, the access time is normalized by forcibly shifting the alternative processing pending sector to the alternative sector, thereby stabilizing the HDD. Can be made. Further, failure prediction of the HDD 300 can be performed with high accuracy, and loss of data in the HDD 300 can be prevented.

代替処理保留中セクタが発生しても代替セクタに移行して機能を回復した場合は、単発的であればＨＤＤ３００の機能が正常化したとしてそれ以降エラーカウントの累積は進行しないが、近傍セグメントにおいて代替処理保留中セクタと代替セクタが連続して発生すれば傷やヘッド不良による書き込みミスが発生していると考えられる。これらは、問題発生セクタと前後するセグメントにおいてエラーカウンタの発生が進行することによって、あるいは代替処理保留中セクタおよび代替セクタの発生によりエラーカウントが加算されることによって、ＨＤＤ３００の故障が近いことを判断することができる。 Even if an alternative processing pending sector occurs, if the function is restored by moving to the alternative sector, if it is a single occurrence, the function of the HDD 300 will be normalized and the error count accumulation will not proceed thereafter. If the alternative processing pending sector and the alternative sector occur consecutively, it is considered that a write error due to a scratch or a head defect has occurred. It is determined that the failure of the HDD 300 is close by the occurrence of an error counter in the segment before and after the problem-occurring sector, or by adding an error count due to the occurrence of a substitution processing pending sector and a substitution sector. can do.

また、代替処理保留中セクタの発生位置が分かることから、書き込みにより単発的な代替処理保留中セクタを直ちに代替セクタへ強制的に移行させ、以後のＨＤＤの動作を安定させることができる。 Further, since the generation position of the alternate processing pending sector is known, the single alternate processing pending sector can be immediately forced to the alternate sector by writing, and the subsequent operation of the HDD can be stabilized.

さらに、ＨＤＤ３００の寿命による故障については、ワーニングカウンタが登録された代替処理保留中セクタとアクセス時間が閾値を超えるコマンド発行番号が特定セグメントに前後して広がって発生しているかどうかを確認することで判断することができる。この場合、視覚的には図１２Ｂの散布図が正常時と比較して広がりつつあることから判断することができる。 Furthermore, regarding failures due to the lifetime of the HDD 300, it is possible to confirm whether or not the alternative processing pending sector in which the warning counter is registered and the command issue number whose access time exceeds the threshold have spread around the specific segment. Judgment can be made. In this case, it can be judged from the fact that the scatter diagram of FIG. 12B is expanding compared with the normal state.

特に、ワーニングカウンタが登録されたコマンド発行番号領域の複数個の代替処理保留中セクタとアクセス時間が閾値を越えたコマンド発行番号の分布の関連性が一致する場合、時間とともに急速に多くの代替セクタへ移行する障害の進行が考えられることから、ワーニングカウンタからエラーカウンタへの進行を早めることにより故障発生が近いことをより正確に判断することができる。 In particular, when the relationship between the plurality of alternative processing pending sectors in the command issue number area where the warning counter is registered matches the distribution of command issue numbers whose access times exceed the threshold, many alternative sectors rapidly increase over time. Therefore, it is possible to more accurately determine that the failure is near by accelerating the progress from the warning counter to the error counter.

このようにワーニングカウンタを蓄積しエラーカウンタへ移行するとともに、ワーニングカウンタ発生位置におけるＳＭＡＲＴ情報から検出できるＨＤＤ自体の代替処理保留中のセクタや代替セクタ発生要因を解析することにより、一般的なＨＤＤ３００の障害判断に加えて、時間をかけて進行する障害をより正確に判断することができるようになる。 In this way, the warning counter is accumulated and the process proceeds to the error counter, and by analyzing the sector pending the substitution process of the HDD itself that can be detected from the SMART information at the warning counter occurrence position and the cause of the substitution sector generation, In addition to the failure determination, it is possible to more accurately determine a failure that progresses over time.

以上述べたように、実施の形態３のＨＤＤ故障予測装置１００による故障予測手順によれば、代替処理保留中セクタの障害発生レベルが軽微なものであれば、強制的に代替セクタへの移行を促し、ＨＤＤの安定した動作を継続させることができる。さらに、障害が発生しているセクタを特定することにより、高い精度でＨＤＤの寿命を予測し、ＨＤＤの故障によってデータを失うことを回避することができる。 As described above, according to the failure prediction procedure performed by the HDD failure prediction apparatus 100 according to the third embodiment, if the failure occurrence level of the substitution processing pending sector is slight, the migration to the substitution sector is forcibly performed. Promptly, the stable operation of the HDD can be continued. Furthermore, by identifying the sector where the failure has occurred, it is possible to predict the life of the HDD with high accuracy and to avoid losing data due to the failure of the HDD.

以上、本発明を実施の形態をもとに説明した。実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。なお、本実施例では、ＳＭＡＲＴ情報の中の代替処理保留中セクタ数および代替セクタ数を用いて処理を行ったが、これらは、記録媒体の不良または不良の兆候の程度を示す指標であるともいえる。つまり、不良の程度が相対的に軽い第１の指標（代替処理保留中セクタ数）と、不良の程度が相対的に重い第２の指標（代替セクタ数）を用いている。また、記録媒体に書き込みを行うことにより、第１の指標（代替処理保留中セクタ数）は、正常値に戻る場合がある。このような特性を持つ指標であれば、代替処理保留中セクタ数以外のデータを用いて同様な処理を行うことも可能である。 The present invention has been described based on the embodiments. The embodiments are exemplifications, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are within the scope of the present invention. . In this embodiment, the processing is performed using the number of alternative processing pending sectors and the number of alternative sectors in the SMART information. However, these are indices indicating the degree of a recording medium failure or a sign of failure. I can say that. That is, the first index (the number of alternative processing pending sectors) that is relatively light and the second index (the number of alternative sectors) that is relatively heavy are used. Further, by writing to the recording medium, the first index (the number of alternative processing pending sectors) may return to a normal value. As long as the index has such characteristics, it is possible to perform the same processing using data other than the number of alternative processing pending sectors.

上記の説明では、ディスクの一例としてハードディスクを取り上げて故障予測技術を説明したが、本実施の形態の故障予測技術は、任意の磁気ディスク、あるいは、光ディスクにも適用することができる。また、本実施の形態の故障予測技術は、ディスクに限らず、メモリカードなどの記録媒体にも適用できる。 In the above description, the failure prediction technique has been described by taking a hard disk as an example of the disk. However, the failure prediction technique of the present embodiment can be applied to any magnetic disk or optical disk. Further, the failure prediction technique of the present embodiment can be applied not only to a disk but also to a recording medium such as a memory card.

また、上記の説明では、ハードディスクを例に挙げてＳＭＡＲＴ情報から代替セクタ数の変化を検出したが、本実施の形態の故障予測技術をハードディスク以外の記録媒体に適用する場合は、ＳＭＡＲＴ情報に代えて、記録媒体の信頼性を監視、分析するための任意の状態情報を利用し、アクセス領域の不良または不良の兆候を示す何らかの指標の変化を検出すればよい。 In the above description, the change in the number of alternative sectors is detected from the SMART information by taking the hard disk as an example. However, when the failure prediction technique of the present embodiment is applied to a recording medium other than the hard disk, the SMART information is replaced. Thus, any state information for monitoring and analyzing the reliability of the recording medium may be used to detect an access area defect or a change in some index indicating a sign of failure.

上記の説明ではリセットされることのあるワーニングカウンタと、リセットされることのないエラーカウンタを用いて故障予測を行ったが、ワーニングカウンタだけを用いてワーニングカウンタが所定数以上になるかどうかによって故障を判定してもよい。 In the above description, the failure prediction is performed using the warning counter that may be reset and the error counter that is not reset. However, the failure depends on whether the warning counter exceeds a predetermined number using only the warning counter. May be determined.

１０ＨＤＤコントローラ、２０一時記憶部、３０制御部、４０異常値ＤＢ記録部、１００ＨＤＤ故障予測装置、２００ホスト、３００ＨＤＤ。 10 HDD controller, 20 temporary storage unit, 30 control unit, 40 abnormal value DB recording unit, 100 HDD failure prediction device, 200 host, 300 HDD.

Claims

A status information recording unit for storing status information of the recording medium when a location to be accessed is accessed in a specific capacity unit according to a predetermined access pattern to the recording medium;
When the access target location is accessed in the specific capacity unit according to the access pattern, the status information of the recording medium is acquired, and the status information of the recording medium is associated with the access target location in the status information recording unit. A failure prediction apparatus comprising: a control unit to be registered.

The failure prediction apparatus according to claim 1, wherein the predetermined access pattern is a pattern for accessing the access target portion after returning a head for recording data on the recording medium to an initial position. .

The status information of the recording medium is an index indicating a degree of failure of the recording medium, or an index indicating an amount of resources used in processing corresponding to the defect of the recording medium. 2. The failure prediction apparatus according to 2.

4. The failure prediction apparatus according to claim 3, wherein the status information of the recording medium is the number of alternative sectors.

5. The failure prediction apparatus according to claim 1, wherein the specific capacity is determined according to the number of sectors in one track of the recording medium.

The control unit acquires state information of the recording medium when an access time when the access target location is accessed in the specific capacity unit according to the access pattern satisfies a first predetermined condition. The failure prediction apparatus according to any one of claims 1 to 5.

Prior to the measurement of the access time, the control unit measures in advance the access time corresponding to the access target location and the access target location in the vicinity thereof, and uses the plurality of access times measured in advance The threshold value corresponding to the access target location is calculated, and when the access time when the access target location is accessed in the specific capacity unit according to the access pattern exceeds the threshold, the first predetermined condition is The failure prediction apparatus according to claim 6, wherein it is determined that the condition is satisfied.

The control unit adds an error counter associated with the access target location when the newly acquired status information of the recording medium has changed compared to the previously acquired status information of the recording medium. The failure prediction apparatus according to claim 1, wherein an error determination process for detecting an abnormality based on the error counter is executed.

9. The failure prediction apparatus according to claim 8, wherein the control unit detects an abnormality when a change in state information of the recording medium satisfies a second predetermined condition at the access target location. .

The control unit has less influence on abnormality detection than the error counter, even when the newly acquired status information of the recording medium has not been updated as compared to the previously acquired status information of the recording medium. Adding a warning counter associated with the access target portion, and executing a warning determination process of adding the error counter when the warning counter in a neighboring area including the access target portion satisfies a predetermined condition The failure prediction apparatus according to claim 8 or 9.

Storing state information of the recording medium in a state information recording unit when the access target portion is accessed in a specific capacity unit by a predetermined access pattern to the recording medium;
When the access target location is accessed in the specific capacity unit according to the access pattern, the status information of the recording medium is acquired, and the status information of the recording medium is associated with the access target location in the status information recording unit. A failure prediction method comprising: a step of registering.

Storing state information of the recording medium in a state information recording unit when the access target portion is accessed in a specific capacity unit by a predetermined access pattern to the recording medium;
When the access target location is accessed in the specific capacity unit according to the access pattern, the status information of the recording medium is acquired, and the status information of the recording medium is associated with the access target location in the status information recording unit. A failure prediction program that causes a computer to execute the registering step.