JP2002215336A

JP2002215336A - Memory control method, and memory sub-system

Info

Publication number: JP2002215336A
Application number: JP2001006862A
Authority: JP
Inventors: Masaki Aizawa; 正樹相澤; Mikio Fukuoka; 幹夫福岡; Takao Sato; 孝夫佐藤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2001-01-15
Filing date: 2001-01-15
Publication date: 2002-08-02

Abstract

PROBLEM TO BE SOLVED: To prevent generation of uncorrectable errors which cannot be corrected in medium surface inspection of a disc unit. SOLUTION: In the disc unit 200 for composing a disc array under the control of a disc control device 110, an error-correcting part 202 of which error correcting ability is changeable is provided. Error correction ability is changed between reading for normal operation and reading for medium surface inspection to a disc 203; a correctable error of a specified error correcting length or more is detected as subject to recover of rewriting. Result of medium surface inspection executed, when each disc device 200 has no input or output is periodically taken by the disc control device 110; and a recovery process is conducted to a position, where the correctable error is generated. Uncorrectable errors are thus prevented, and medium surface inspection can be conducted efficiently; without having affecting the input/output process of a central processing unit 100.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、記憶装置の制御技
術および記憶サブシステムに関し、特に、ディスクアレ
イ装置等の冗長記憶装置を構成する磁気ディスク装置等
における媒体面障害検出によるデータ保証等に適用して
有効な技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a storage device control technique and a storage subsystem, and more particularly to data assurance by detecting a medium surface failure in a magnetic disk device or the like constituting a redundant storage device such as a disk array device. And effective technology.

【０００２】[0002]

【従来の技術】上位の計算機と大容量のディスク装置群
との間に介在し、両者の間のデータ転送を制御するディ
スク制御装置に於いて、ディスク装置の媒体障害によっ
て記録したデータが読めなくなる問題がある。2. Description of the Related Art In a disk controller interposed between a host computer and a group of large-capacity disk units and controlling data transfer between the two, data recorded by the disk unit becomes unreadable due to a medium failure in the disk unit. There's a problem.

【０００３】媒体障害には媒体面とヘッドが接触して摩
擦熱が発生すること等によって一時的に読み出しが不可
能となる場合と、媒体面に塵が付着したり傷が付いたり
して恒久的に読み出しが不可能となる場合がある。前者
の場合は時間経過により再び読み出しが可能となるが、
読み出し不可中、又は後者の恒久障害が発生している状
態で同じデータグループを構成する別のディスク装置が
縮退すると、冗長度越えによるデータの消失が発生す
る。[0003] A medium failure may occur when reading is temporarily impossible due to the frictional heat generated by the contact between the medium surface and the head, or when the medium surface is permanently damaged due to dust adhering or scratching. Reading may not be possible in some cases. In the former case, it becomes possible to read again after a lapse of time,
If another disk device constituting the same data group is degraded while reading is disabled or the latter permanent failure occurs, data loss due to exceeding the redundancy occurs.

【０００４】媒体面の障害は上位装置からのアクセスが
あった際に検出されるが、アクセスの頻度が低いデータ
の記録されたセクタで媒体障害が発生した場合は、前記
の現象になる可能性が高くなる為、媒体面の障害を早期
に検出してリカバリする手段が有効になってくる。[0004] A medium surface failure is detected when there is access from a higher-level device. However, if a medium failure occurs in a sector in which data of infrequently accessed data is recorded, the above phenomenon may occur. Therefore, means for detecting and recovering from a medium surface failure at an early stage becomes effective.

【０００５】この技術的課題を解決する技術として、例
えば、特開２０００−１０７３６号公報に記載された技
術などがある。すなわち、個々の磁気ディスク装置のデ
ータテストを実行しつつ、当該Ｉ／Ｏ（入出力）要求に
よるテスト中断時には、次のデータテストのアドレスを
記録することで、ホストコンピュータからのＩ／Ｏ要求
の間隙を縫って、データテストを実行し、正常に読めな
いデータが検出された時には、他の磁気ディスク装置の
冗長データを用いてデータが正常に読めなかった箇所の
データ復元を行おうとするものである。As a technique for solving this technical problem, for example, there is a technique described in Japanese Patent Application Laid-Open No. 2000-10736. That is, when the data test of each magnetic disk device is executed and the test is interrupted by the I / O (input / output) request, the address of the next data test is recorded, so that the I / O request from the host computer is recorded. A data test is performed by sewing a gap, and if data that cannot be read normally is detected, the data is restored from the data that could not be read normally using the redundant data of another magnetic disk device. is there.

【０００６】[0006]

【発明が解決しようとする課題】上記従来技術では、デ
ィスク制御装置でホストコンピュータからのＩ／Ｏの有
無を判定し、Ｉ／Ｏが無い場合に磁気ディスク装置の媒
体面検査を実施することによってホストＩ／Ｏに影響を
出さないようになっているが、キャッシュメモリを有す
るディスク制御装置では磁気ディスク装置に対する書込
みがホストＩ／Ｏと非同期に行われ、また複数台でデー
タグループを構成するＲＡＩＤでは、ホストＩ／Ｏがあ
った場合でもＩ／Ｏに対する処理をしていないディスク
装置がある為、ディスク制御装置側でホストＩ／Ｏの有
無を判断して媒体面検査を実施するのは効率的ではな
い。In the above prior art, the disk controller determines whether or not there is I / O from the host computer, and when there is no I / O, performs a medium surface inspection of the magnetic disk device. Although it does not affect the host I / O, in the disk control device having the cache memory, writing to the magnetic disk device is performed asynchronously with the host I / O, and a RAID that configures a data group with a plurality of units is used. In some cases, even if there is a host I / O, there is a disk device that does not perform processing for the I / O. Therefore, it is efficient for the disk controller to determine whether or not the host I / O exists and to perform the medium surface inspection. Not a target.

【０００７】また、磁気ディスク装置では通常エラー訂
正機能を備えており、データの読み出し時に読み出し不
可能なビットが発生しても、エラー訂正可能範囲内であ
るならばエラー訂正した結果を上位報告するようになっ
ており、上記従来技術においてディスク制御装置側によ
るリカバリ処理を実施する機会はアンコレクタブルエラ
ー発生時となる。その為、コレクタブルエラーの場合で
も再書込みによるリカバリを試みて、再書込み不可の場
合の書込み位置交代処理を早期に実施する事が効果大と
考えられる。A magnetic disk device usually has an error correction function, and even if an unreadable bit occurs when data is read, the result of the error correction is reported to the higher rank if it is within the error correctable range. In the above-described conventional technique, the opportunity to execute the recovery process on the disk controller side is when an uncorrectable error occurs. For this reason, it is considered effective to attempt recovery by rewriting even in the case of a collectable error, and to execute the write position change process early when rewriting is impossible.

【０００８】本発明の目的は、エラー訂正可能なコレク
タブルエラーの段階で早期に記憶媒体上でのリカバリ処
理を実施することで、エラー訂正不能なアンコレクタブ
ルエラーの発生を予防することが可能な技術を提供する
ことにある。An object of the present invention is to perform a recovery process on a storage medium at an early stage of an error-correctable correctable error, thereby preventing occurrence of an uncorrectable error that cannot be corrected. Is to provide.

【０００９】本発明の他の目的は、冗長構成の記憶装置
を備えた冗長記憶装置において、許容される冗長度を超
えた多重エラーの発生によるデータ喪失を予防すること
が可能な技術を提供することにある。Another object of the present invention is to provide a technique capable of preventing data loss due to occurrence of multiple errors exceeding allowable redundancy in a redundant storage device having a redundant storage device. It is in.

【００１０】本発明の他の目的は、上位装置に対する入
出力性能に影響を与えることなく、記憶装置における記
憶媒体の媒体面検査を実施することが可能な技術を提供
することにある。Another object of the present invention is to provide a technique capable of performing a medium surface inspection of a storage medium in a storage device without affecting input / output performance with respect to a host device.

【００１１】[0011]

【課題を解決するための手段】本発明は、記憶媒体から
のデータの読み出し処理を実行し、エラー訂正処理にて
回復可能なコレクタブルエラーを検出する第１ステップ
と、コレクタブルエラーが検出されたデータの、記憶媒
体の元の格納位置への上書きを試行する第２ステップ
と、を含む記憶装置の制御方法である。According to the present invention, there is provided a first step of executing a process of reading data from a storage medium and detecting a recoverable error that can be recovered by an error correction process; A second step of attempting to overwrite the original storage location of the storage medium.

【００１２】また、本発明は、記憶装置と、記憶装置と
上位装置との間に介在する記憶制御装置と、を含む記憶
サブシステムにおいて、記憶装置は、自装置内の記憶媒
体からのデータの読み出し処理を実行し、エラー訂正処
理にて回復可能なコレクタブルエラーを検出して検査結
果として記録する検査機能を備え、記憶制御装置は、個
々の記憶装置の検査結果を参照して、コレクタブルエラ
ーが検出されたデータを記憶装置から読み出して、当該
記憶装置の元の格納位置への上書きを試行する機能を備
えたものである。According to the present invention, in a storage subsystem including a storage device and a storage control device interposed between the storage device and a higher-level device, the storage device stores data from a storage medium in the storage device. The storage control device performs a read process, detects a recoverable error that can be recovered by an error correction process, and records the result as a test result. It has a function of reading the detected data from the storage device and attempting to overwrite the original storage position of the storage device.

【００１３】より具体的には、一例として、ディスク制
御装置の配下で、ホストコンピュータから要求されたデ
ータの書込みと読み出しを行なう磁気ディスク装置にお
いて、Ｉ／Ｏのない磁気ディスク装置に対して周期的に
媒体面検査を行なう手段と、アンコレクタブルエラーに
加えてエラー訂正長が一定以上のコレクタブルエラーを
障害として検出する手段と、を設け、ディスク制御装置
には前記媒体面検査の結果を磁気ディスク装置から周期
的に吸い上げ、リカバリ処理を行なう手段を設けたもの
である。More specifically, as an example, in a magnetic disk device that writes and reads data requested by a host computer under the control of a disk controller, a periodic operation is performed on a magnetic disk device without I / O. Means for performing a medium surface inspection, and means for detecting, as a failure, a correctable error having an error correction length equal to or more than a predetermined value in addition to an uncorrectable error, A means for periodically recovering the data and performing recovery processing.

【００１４】[0014]

【発明の実施の形態】以下、本発明の実施形態について
図面を用いて詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１５】（実施の形態１）図１は、本発明の一実施
の形態である記憶装置の制御方法を実施する記憶サブシ
ステムの一例であるディスクサブシステムのシステム構
成を示すブロック図である。(Embodiment 1) FIG. 1 is a block diagram showing a system configuration of a disk subsystem which is an example of a storage subsystem for executing a storage device control method according to an embodiment of the present invention.

【００１６】本実施の形態のディスクサブシステムは、
ディスク制御装置１１０と、その配下の複数のディスク
装置２００からなる。磁気ディスク装置等のディスク装
置２００は、ディスク制御装置１１０を介して上位装置
である中央処理装置１００と接続されている。The disk subsystem according to the present embodiment comprises:
It comprises a disk controller 110 and a plurality of disk devices 200 under its control. A disk device 200 such as a magnetic disk device is connected to a central processing unit 100 as a higher-level device via a disk control device 110.

【００１７】本実施の形態の場合、一例として、複数の
ディスク装置２００は冗長構成のＲＡＩＤ等のディスク
アレイを構成しており、中央処理装置１００等の上位装
置との間で授受されるデータと、当該データ生成された
冗長データが、複数のディスク装置２００に分散して格
納され、冗長度に応じて、障害発生時に読み出し不能に
なったデータを他のディスク装置２００に格納されてい
る冗長データ等を用いて復元することが可能になってい
る。In the case of the present embodiment, as an example, the plurality of disk devices 200 constitute a redundantly configured disk array such as a RAID, and store and receive data exchanged with a higher-level device such as the central processing unit 100. The generated redundant data is distributed and stored in the plurality of disk devices 200, and the data that has become unreadable when a failure occurs is stored in another disk device 200 according to the redundancy. And so on.

【００１８】ディスク制御装置１１０は、チャネルＩＦ
部１１１とキャッシュメモリ１１２とディスク装置制御
部１１３と診断結果登録部１１４を有し、中央処理装置
１００とディスク装置２００とのデータ転送は全てキャ
ッシュメモリ１１２を介して行われる。中央処理装置１
００とキャッシュメモリ１１２間はチャネルＩＦ部１１
１で制御され、キャッシュメモリ１１２とディスク装置
２００間がディスク装置制御部１１３によって制御され
る。また、個々のディスク装置２００の診断結果を読み
出して、診断結果登録部１１４へ格納する機能も備えて
いる。The disk controller 110 has a channel IF
It has a unit 111, a cache memory 112, a disk device control unit 113, and a diagnosis result registration unit 114, and all data transfer between the central processing unit 100 and the disk device 200 is performed via the cache memory 112. Central processing unit 1
00 and the cache memory 112 between the channel IF unit 11
1 is controlled by the disk device control unit 113 between the cache memory 112 and the disk device 200. In addition, a function of reading out a diagnosis result of each disk device 200 and storing it in the diagnosis result registration unit 114 is also provided.

【００１９】ディスク装置２００には、バッファ２０
１、エラー訂正部２０２、ディスク２０３、Ｉ／Ｏ検出
部２０４、カレントアドレス登録部２０５、診断結果登
録部２０６がある。ディスク制御装置１１０のキャッシ
ュメモリ１１２とディスク装置２００のディスク２０３
間のデータ転送は、バッファ２０１を介して行なわれ
る。エラー訂正部２０２はディスク２０３からデータを
読み上げる際に、データ中の欠落部を訂正可能範囲であ
れば訂正してバッファ２０１に読み上げる。読み出し中
のデータ内に読み出し不可部分があったが、エラー訂正
部２０２で訂正できた場合はコレクタブルエラーとなっ
て、バッファ２０１上のデータは完全となる。しかし、
読み出し不可部分が訂正可能範囲を超えている為に訂正
不可能な場合は、アンコレクタブルエラーとなりデータ
は読み出せない。The disk device 200 has a buffer 20
1, an error correction unit 202, a disk 203, an I / O detection unit 204, a current address registration unit 205, and a diagnosis result registration unit 206. The cache memory 112 of the disk control device 110 and the disk 203 of the disk device 200
Data transfer between them is performed via the buffer 201. When reading out data from the disk 203, the error correction unit 202 corrects a missing part in the data within a correctable range and reads out the data to the buffer 201. Although there is an unreadable portion in the data being read, if the error can be corrected by the error correction unit 202, a correctable error occurs and the data on the buffer 201 becomes complete. But,
If the unreadable portion exceeds the correctable range and cannot be corrected, an uncorrectable error occurs and data cannot be read.

【００２０】本実施の形態のディスク装置２００では、
通常の読み出しの場合はエラー訂正を最大で行なうが、
媒体面検査の読み出しの時にはこの訂正能力を落とす事
によって、図６に例示されるように、通常コレクタブル
エラーとなるケースに於いてもエラーとして検出し、診
断結果登録部２０６に記録する。In the disk device 200 of the present embodiment,
In the case of normal reading, error correction is performed at the maximum,
By lowering the correction capability at the time of reading the medium surface inspection, a normal correctable error is detected as an error, as shown in FIG. 6, and recorded in the diagnosis result registration unit 206.

【００２１】また、このエラー訂正能力は任意に設定す
ることができる。Ｉ／Ｏ検出部２０４は、ディスク装置
２００に対するＩ／Ｏ動作の有無を検出できる。カレン
トアドレス登録部２０５には、媒体面検査を実施するア
ドレスを登録する。診断結果登録部２０６は、診断によ
りエラーが発生したアドレス（ディスク２０３における
データ格納位置）とそのエラー詳細情報をｎ個まで登録
することができる。また、カレントアドレス登録部２０
５と診断結果登録部２０６は不揮発になっており、ディ
スク装置２００の電源が断たれても診断結果は残り、回
復後に続きから再開することができる。The error correction capability can be set arbitrarily. The I / O detection unit 204 can detect the presence or absence of an I / O operation on the disk device 200. In the current address registration unit 205, an address at which the medium surface inspection is performed is registered. The diagnosis result registration unit 206 can register up to n addresses (data storage positions on the disk 203) at which an error has occurred due to the diagnosis and detailed information on the error. Also, the current address registration unit 20
5 and the diagnosis result registration unit 206 are non-volatile, so that the diagnosis results remain even if the power of the disk device 200 is cut off, and can be resumed after recovery.

【００２２】図２は、本実施の形態におけるディスク装
置２００の媒体障害を検出する処理の流れの一例を示す
フローチャートである。ディスク装置２００が前回媒体
面検査を行なってから一定時間が経過したら（ステップ
３０１でＹｅｓ）、ディスク装置２００に対するＩ／Ｏ
があるかをチェックする（ステップ３０２）。Ｉ／Ｏが
あれば媒体面検査は行なわない（ステップ３０２でＮ
ｏ）。Ｉ／Ｏがなければ（ステップ３０２でＹｅｓ）、
カレントアドレス登録部２０５に登録されているアドレ
スのデータをディスク装置２００内のバッファ２０１に
読み上げる（ステップ３０３）。本実施の形態の場合、
この時のエラー訂正部２０２でのエラー訂正処理の能力
は、通常稼働時よりも低く設定される。FIG. 2 is a flowchart showing an example of a flow of processing for detecting a medium failure in the disk device 200 according to the present embodiment. If a certain period of time has elapsed since the disk device 200 last performed the medium surface inspection (Yes in step 301), the I / O to
It is checked whether there is (Step 302). If there is I / O, the medium surface inspection is not performed (N in step 302).
o). If there is no I / O (Yes in step 302),
The data of the address registered in the current address registration unit 205 is read out to the buffer 201 in the disk device 200 (step 303). In the case of this embodiment,
At this time, the error correction processing capability of the error correction unit 202 is set lower than in the normal operation.

【００２３】正常に読み上げられたら（ステップ３０４
でＮｏ）、今回実施したアドレスの次のアドレスをカレ
ントアドレス登録部２０５に登録して（ステップ３０
６）終了する。読み上げ時にエラーが発生した際は（ス
テップ３０４でＹｅｓ）、今回実施したアドレスとその
エラー詳細情報を診断結果登録部２０６に登録する（ス
テップ３０５）。今回実施したアドレスの次のアドレス
をカレントアドレス登録部２０５に登録して終了とな
る。If the reading is normally performed (step 304)
No), the address next to the address implemented this time is registered in the current address registration unit 205 (step 30).
6) End. If an error occurs during reading (Yes in step 304), the address executed this time and the error detailed information thereof are registered in the diagnosis result registration unit 206 (step 305). The address next to the address implemented this time is registered in the current address registration unit 205, and the process ends.

【００２４】本実施の形態の場合、ステップ３０４での
エラーとは、図６に例示されるように、通常稼働時のエ
ラー訂正処理よりも訂正能力を意図的に低く設定したエ
ラー訂正処理で回復不能なエラー範囲を意味しており、
通常稼働時のコレクタブルエラーの一部とアンコレクタ
ブルエラーが含まれ、後述の図３のフローチャートでの
リカバリ処理の対象となる。In the case of the present embodiment, the error at step 304 is recovered by an error correction process in which the correction capability is intentionally set lower than the error correction process during normal operation, as illustrated in FIG. Error range that is not possible,
A part of the correctable error and the uncorrectable error during the normal operation are included, and are subjected to the recovery processing in the flowchart of FIG. 3 described later.

【００２５】図３は、各ディスク装置２００が媒体面検
査を行なった結果をディスク制御装置１１０で吸い上
げ、エラー発生位置のデータに対してリカバリ処理を実
施する際の処理の流れを示すフローチャートである。デ
ィスク制御装置１１０は、各ディスク装置２００が実施
した媒体面検査の結果を周期的に吸い上げる。前回診断
結果を採取してから一定時間が経過したら（ステップ４
００でＹｅｓ）、ディスク装置２００の診断結果登録部
２０６から診断結果を採取し、ディスク制御装置１１０
内の診断結果登録部１１４に登録する（ステップ４０
１）。ディスク制御装置１１０は、診断結果登録部１１
４に登録されたエラー発生アドレスのデータを順次キャ
ッシュメモリ１１２上に読み上げる（ステップ４０
２）。読み出しが正常に行なえた場合（エラー無しか、
コレクタブルエラーの場合）は（ステップ４０３でＹｅ
ｓ）、キャッシュメモリ１１２上のデータをディスク装
置２００に再度書き直す（ステップ４０５）。エラー発
生アドレスデータの読み出しが失敗した場合（アンコレ
クタブルエラーの場合）は（ステップ４０３でＮｏ）、
冗長ディスク装置のデータを用いて読み出し不可だった
データをキャッシュメモリ１１２上に回復し（ステップ
４０４）、ディスク装置２００に再度書き直す（ステッ
プ４０５）。書込みが失敗した場合は（ステップ４０６
でＮｏ）、データの書込み位置の再割当てを行ない（ス
テップ４０７）、新規に割り当てられた書込み位置（代
替領域）に対してデータを書込む（ステップ４０５）。
書込みが正常に終了したら（ステップ４０６でＹｅ
ｓ）、リカバリ処理を終了する。FIG. 3 is a flowchart showing the flow of processing when each disk device 200 performs a medium surface inspection and the disk control device 110 picks up the result and performs recovery processing on the data at the error occurrence position. . The disk control device 110 periodically picks up the result of the medium surface inspection performed by each disk device 200. After a certain period of time has passed since the last diagnostic result was collected (step 4
00), the diagnosis result is collected from the diagnosis result registration unit 206 of the disk device 200, and the disk control device 110
Is registered in the diagnosis result registration unit 114 (step 40).
1). The disk control device 110 includes a diagnosis result registration unit 11
4 is sequentially read out on the cache memory 112 (step 40).
2). If reading was successful (no error,
(In the case of a correctable error) (Yes at step 403)
s) The data in the cache memory 112 is rewritten to the disk device 200 again (step 405). If reading of the error occurrence address data fails (in the case of an uncorrectable error) (No in step 403),
The data that could not be read is recovered in the cache memory 112 using the data of the redundant disk device (step 404), and rewritten to the disk device 200 (step 405). If the writing has failed (step 406
No), the data write position is reassigned (step 407), and the data is written to the newly assigned write position (alternate area) (step 405).
When writing is completed normally (Ye in step 406)
s), end the recovery process.

【００２６】ステップ４０３〜ステップ４０７の一連の
処理は、診断結果登録部２０６から採取された診断結果
に含まれるエラーアドレスの数だけ反復される。The series of processing from step 403 to step 407 is repeated by the number of error addresses included in the diagnosis result obtained from the diagnosis result registration unit 206.

【００２７】以上に述べた様に本発明の実施の形態１に
よれば、中央処理装置１００からＩ／Ｏ要求がきていな
いディスク装置２００が当該ディスク装置２００内で独
立して周期的にディスク２０３の媒体面検査を実施し、
その結果をディスク制御装置１１０が周期的に吸い上げ
てリカバリ処理を実施する為、Ｉ／Ｏに影響を与えずに
効率的に個々のディスク装置２００でのディスク２０３
の媒体面検査を実施することができる。As described above, according to the first embodiment of the present invention, the disk device 200 to which no I / O request has been received from the central processing unit 100 is independently and periodically stored in the disk device 200. Carry out media inspection of
Since the disk control device 110 periodically collects the result and performs the recovery process, the disk 203 in each disk device 200 can be efficiently processed without affecting the I / O.
Media surface inspection can be performed.

【００２８】また、ディスク装置２００にはエラー訂正
能力を任意に設定できるエラー訂正部２０２を備えてお
り、通常のデータ読み出しと媒体面検査（診断）時の読
み出しとでエラーの訂正レートを変える事により、通常
のエラー訂正能力ではコレクタブルエラーとなるケース
に於いて、一定以上のコレクタブルエラーを再書込みに
よるリカバリを実施する対象とする事ができるので、コ
レクタブルエラーの段階での予防的なリカバリ処理の実
施により、アンコレクタブルエラーの発生する確率を減
少させることが可能になる。Further, the disk device 200 is provided with an error correction section 202 which can arbitrarily set an error correction capability, and can change an error correction rate between normal data reading and reading during medium surface inspection (diagnosis). Therefore, in the case of a correctable error with normal error correction capability, it is possible to perform recovery by rewriting over a certain amount of collectable errors, so that preventive recovery processing at the stage of correctable errors can be performed. The implementation makes it possible to reduce the probability of occurrence of an uncorrectable error.

【００２９】また、コレクタブルエラーの場合には、デ
ータ自体は最終的に正常に読み出せるので、冗長データ
を用いたデータ回復処理を必要とすることなく、読み出
したデータ自体を用いた上書き処理にてリカバリ処理が
可能であり、コレクタブルエラーをリカバリ処理に含め
ても診断やリカバリ処理の効率が低下しないとともに、
リカバリ処理中に冗長データ等もエラーとなる多重エラ
ーにてデータ喪失が発生する、等の懸念もない。In the case of a collectable error, since the data itself can be finally read normally, the data recovery process using the redundant data is not required, and the overwriting process using the read data itself is performed. Recovery processing is possible, and even if a collectable error is included in the recovery processing, the efficiency of diagnosis and recovery processing will not decrease,
There is no concern that data loss occurs due to multiple errors in which redundant data and the like also become errors during the recovery processing.

【００３０】換言すれば、ディスクアレイ等を構成する
冗長構成の複数のディスク装置２００において、多重の
アンコレクタブルエラーの発生によるデータ喪失等の障
害を確実に予防することが可能になる。In other words, it is possible to reliably prevent a failure such as data loss due to the occurrence of multiple uncorrectable errors in a plurality of redundantly configured disk devices 200 constituting a disk array or the like.

【００３１】（実施の形態２）上述の実施の形態１で
は、診断時のエラー訂正能力を通常稼働時よりも低くし
て、コレクタブルエラーの一部をリカバリ処理の対象の
エラーに含める場合を例示したが、この実施の形態２で
は、コレクタブルエラーのエラー訂正長の大小に基づい
てリカバリ処理の対象とするコレクタブルエラーを選択
する場合について説明する。(Embodiment 2) In Embodiment 1 described above, the error correction capability at the time of diagnosis is made lower than that at the time of normal operation, and a part of the correctable errors is included in the errors to be subjected to the recovery processing. However, in the second embodiment, a case will be described where a correctable error to be subjected to recovery processing is selected based on the magnitude of the error correction length of the correctable error.

【００３２】すなわち、この実施の形態２では、ディス
ク制御装置１１０とディスク装置２００のインターフェ
ースにおいて、コレクタブルエラーの際のエラー訂正長
の分解能（長短を判別する機能）を持ち、ディスク制御
装置１１０が、ディスク装置２００にて実施した媒体面
検査において発生したエラー訂正長が一定値（閾値）以
上のコレクタブルエラーをリカバリ対象として検出す
る。That is, in the second embodiment, the interface between the disk control device 110 and the disk device 200 has an error correction length resolution (a function of determining the length) at the time of a correctable error, and the disk control device 110 A recoverable error in which the error correction length generated in the medium surface inspection performed in the disk device 200 is equal to or greater than a certain value (threshold) is detected as a recovery target.

【００３３】このための、本実施の形態における診断結
果登録部２０６の内容の一例を図５に例示する。診断結
果登録部２０６は、各エラー毎に、少なくとも、診断に
より当該エラーが検出されたデータのディスク２０３上
における格納位置を示すアドレス２０６ａと、コレクタ
ブルエラーの場合のエラー訂正長２０６ｂとを記録す
る。FIG. 5 shows an example of the contents of the diagnosis result registration unit 206 in this embodiment for this purpose. The diagnosis result registration unit 206 records, for each error, at least an address 206a indicating a storage position on the disk 203 of data in which the error is detected by the diagnosis, and an error correction length 206b in the case of a correctable error.

【００３４】そして、図４に例示される、ディスク制御
装置１１０によるリカバリ処理のステップ４０２Ａで
は、ディスク装置２００の診断結果登録部２０６から読
み出された診断結果の各アドレス２０６ａに対応するエ
ラー訂正長２０６ｂを参照して、当該エラー訂正長２０
６ｂの値が、所定の閾値より大きいもの、すなわち、よ
り重度のコレクタブルエラーを選択して、リカバリ処理
を実施する。In step 402A of the recovery process by the disk control device 110 illustrated in FIG. 4, the error correction length corresponding to each address 206a of the diagnosis result read from the diagnosis result registration unit 206 of the disk device 200 206b, the error correction length 20
The recovery processing is performed by selecting a value whose value of 6b is larger than a predetermined threshold, that is, a more serious correctable error.

【００３５】あるいは、ディスク装置２００におけるエ
ラー情報の記録に際して、所定のエラー訂正長以上のコ
レクタブルエラーを選択的に記録することで、記録デー
タ量の削減や、ディスク制御装置１１０でのエラー訂正
長の判別処理の省略による効率化、等を実現してもよ
い。Alternatively, when error information is recorded in the disk device 200, a collectable error having a predetermined error correction length or more is selectively recorded, thereby reducing the amount of recording data and reducing the error correction length in the disk control device 110. Efficiency may be realized by omitting the determination processing.

【００３６】なお、図４のフローチャートでステップ４
０２Ａ以外の処理は上述の図３の実施の形態１の場合と
同様なので説明は割愛する。Step 4 in the flowchart of FIG.
The processing other than 02A is the same as that of the first embodiment of FIG.

【００３７】このように本実施の形態２の場合には、上
述の実施の形態１の場合と同様の効果が得られるととも
に、個々のディスク装置２００にて、コレクタブルエラ
ーの発生したアドレスと、当該コレクタブルエラーにお
けるエラー訂正長とを対応つけて記録し、エラー訂正長
が所定の閾値以上のコレクタブルエラーをリカバリ処理
の対象とすることで、コレクタブルエラーの段階での予
防的なリカバリ処理の実施により、アンコレクタブルエ
ラーの発生する確率を減少させることが可能になる。As described above, in the case of the second embodiment, the same effect as that of the above-described first embodiment can be obtained. The error correction length of the correctable error is recorded in association with the error correction length, and the correctable error whose error correction length is equal to or greater than a predetermined threshold is targeted for the recovery process, thereby performing the preventive recovery process at the stage of the correctable error, It is possible to reduce the probability of occurrence of an uncorrectable error.

【００３８】本願の特許請求の範囲に記載された発明を
見方を変えて表現すれば以下の通りである。The invention described in the claims of the present application is expressed in a different way as follows.

【００３９】＜１＞１つ以上の磁気ディスク装置を接
続するディスク制御装置に於いて、通常の読み出しと媒
体面検査の読み出しとでエラー訂正能力を変えて、コレ
クタブルエラーを媒体面検査で検出することができる磁
気ディスク装置と、前記媒体面検査で検出したコレクタ
ブルエラーを磁気ディスク装置に書き込む手段と、デー
タが正しく書き込めたことを確認する手段と、を有する
ことを特徴とするディスクサブシステム。<1> In a disk controller connected to one or more magnetic disk devices, correctable errors are detected by medium surface inspection by changing the error correction capability between normal reading and medium surface inspection reading. A disk subsystem, comprising: a magnetic disk device capable of performing the above operation; a unit that writes a correctable error detected in the medium surface inspection to the magnetic disk device; and a unit that confirms that data has been correctly written.

【００４０】＜２＞磁気ディスク装置内に媒体面検査
機能とエラーを検出した際のアドレス及び詳細情報を保
持する手段と、当該磁気ディスク装置に対するＩ／Ｏが
ない場合に周期的に媒体面検査を行なうことによって、
ホストＩ／ＯがあってもＩ／Ｏのない磁気ディスク装置
では媒体面検査を実施することができる手段を有するこ
とを特徴とする項目＜１＞に記載のディスクサブシステ
ム。<2> A medium surface inspection function in the magnetic disk device and a means for holding an address and detailed information when an error is detected, and a medium surface inspection periodically when there is no I / O to the magnetic disk device By doing
<1> The disk subsystem according to item <1>, wherein the magnetic disk drive having no host I / O has means for performing a medium surface inspection.

【００４１】＜３＞エラー発生アドレスとエラー詳細
情報を不揮発メモリに記録又は当該磁気ディスクに格納
して、電源断による磁気ディスク装置停止時も、検査結
果の保持と復旧後に中断時の検査位置から媒体面検査を
やり直すことのできる手段を有することを特徴とする項
目＜１＞に記載のディスクサブシステム。<3> The error occurrence address and the error detail information are recorded in the non-volatile memory or stored in the magnetic disk, so that even when the magnetic disk device is stopped due to power interruption, the inspection result is retained and restored from the inspection position after the interruption. The disk subsystem according to item <1>, further comprising a unit capable of performing a medium surface inspection again.

【００４２】＜４＞前記磁気ディスク装置が媒体面検
査した結果を周期的に読み出す手段と、エラー登録され
ているアドレスのデータをキャッシュ上に読み上げ、読
み上げたデータを磁気ディスク装置に書き込み、正しく
書き込めたことを確認する手段と、書込みに失敗した際
は磁気ディスクへの書き込み位置を再割当てして書き込
む手段と、キャッシュへの読み上げに失敗した際は冗長
の磁気ディスク装置に記録されたデータを用いてキャッ
シュ上にデータを回復する手段を有することを特徴とす
る項目＜１＞に記載のディスクサブシステム。<4> Means for periodically reading the result of the medium surface inspection performed by the magnetic disk drive, reading out the data of the address registered as an error on the cache, writing the read out data to the magnetic disk drive, and correctly writing the data Means for confirming that the writing has failed, means for reassigning the writing position to the magnetic disk when writing has failed, and means for writing when the reading to the cache has failed, using data recorded in the redundant magnetic disk device. The disk subsystem according to item <1>, further comprising a unit for restoring data on the cache by using the disk subsystem.

【００４３】＜５＞１つ以上の磁気ディスク装置を接
続するディスク制御装置に於いて、磁気ディスク装置と
ディスク制御装置間でコレクタブルエラーに対してエラ
ー訂正長の分解能があるインターフェースを持つことに
よって、媒体面検査でコレクタブルエラーを検出して書
き込みによるリカバリ処理を行う手段を有することを特
徴とするディスクサブシステム。<5> In a disk controller connecting one or more magnetic disk units, an interface having an error correction length resolution for a collectable error is provided between the magnetic disk unit and the disk controller. A disk subsystem comprising means for detecting a correctable error in a medium surface inspection and performing recovery processing by writing.

【００４４】以上本発明者によってなされた発明を実施
の形態に基づき具体的に説明したが、本発明は前記実施
の形態に限定されるものではなく、その要旨を逸脱しな
い範囲で種々変更可能であることはいうまでもない。Although the invention made by the present inventor has been specifically described based on the embodiments, the present invention is not limited to the above-described embodiments and can be variously modified without departing from the gist thereof. Needless to say, there is.

【００４５】[0045]

【発明の効果】本発明によれば、エラー訂正可能なコレ
クタブルエラーの段階で早期に記憶媒体上でのリカバリ
処理を実施することで、エラー訂正不能なアンコレクタ
ブルエラーの発生を予防することができる、という効果
が得られる。According to the present invention, it is possible to prevent occurrence of an uncorrectable error that cannot be corrected by performing recovery processing on the storage medium at an early stage at the stage of a correctable error that can be corrected. Is obtained.

【００４６】本発明によれば、冗長構成の記憶装置を備
えた冗長記憶装置において、許容される冗長度を超えた
多重エラーの発生によるデータ喪失を予防することがで
きる、という効果が得られる。According to the present invention, in a redundant storage device having a storage device having a redundant configuration, an effect is obtained that data loss due to occurrence of multiple errors exceeding the allowable redundancy can be prevented.

【００４７】本発明によれば、上位装置に対する入出力
性能に影響を与えることなく、記憶装置における記憶媒
体の媒体面検査を実施することができる、という効果が
得られる。According to the present invention, it is possible to carry out an inspection of a medium surface of a storage medium in a storage device without affecting input / output performance with respect to a host device.

[Brief description of the drawings]

【図１】本発明の一実施の形態である記憶装置の制御方
法を実施する記憶サブシステムの一例であるディスクサ
ブシステムのシステム構成を示すブロック図である。FIG. 1 is a block diagram illustrating a system configuration of a disk subsystem that is an example of a storage subsystem that executes a storage device control method according to an embodiment of the present invention.

【図２】本発明の一実施の形態である記憶装置の制御方
法を実施する記憶サブシステムを構成するディスク装置
の作用の一例を示すフローチャートである。FIG. 2 is a flowchart illustrating an example of an operation of a disk device constituting a storage subsystem that implements a storage device control method according to an embodiment of the present invention;

【図３】本発明の一実施の形態である記憶装置の制御方
法を実施する記憶サブシステムを構成するディスク制御
装置の作用の一例を示すフローチャートである。FIG. 3 is a flowchart illustrating an example of an operation of a disk control device constituting a storage subsystem that implements a storage device control method according to an embodiment of the present invention;

【図４】本発明の一実施の形態である記憶装置の制御方
法を実施する記憶サブシステムを構成するディスク制御
装置の作用の変形例を示すフローチャートである。FIG. 4 is a flowchart showing a modified example of the operation of the disk control device constituting the storage subsystem that implements the storage device control method according to one embodiment of the present invention;

【図５】本発明の一実施の形態である記憶装置の制御方
法を実施する記憶サブシステムの一例であるディスクサ
ブシステムにて用いられる情報の一例を示す説明図であ
る。FIG. 5 is an explanatory diagram illustrating an example of information used in a disk subsystem that is an example of a storage subsystem that implements a storage device control method according to an embodiment of the present invention;

【図６】本発明の一実施の形態である記憶装置の制御方
法を実施する記憶サブシステムの一例であるディスクサ
ブシステムの作用の一例を示す概念図である。FIG. 6 is a conceptual diagram showing an example of an operation of a disk subsystem which is an example of a storage subsystem that implements a storage device control method according to an embodiment of the present invention.

[Explanation of symbols]

１００…中央処理装置、１１０…ディスク制御装置（記
憶制御装置）、１１１…チャネルＩＦ部、１１２…キャ
ッシュメモリ、１１３…ディスク装置制御部、１１４…
診断結果登録部、２００…ディスク装置（記憶装置）、
２０１…バッファ、２０２…エラー訂正部、２０３…デ
ィスク、２０４…Ｉ／Ｏ検出部、２０５…カレントアド
レス登録部、２０６…診断結果登録部、２０６ａ…アド
レス、２０６ｂ…エラー訂正長。100: central processing unit; 110: disk control unit (storage control unit); 111: channel IF unit; 112: cache memory; 113: disk unit control unit;
Diagnostic result registration unit, 200: disk device (storage device),
201: buffer, 202: error correction unit, 203: disk, 204: I / O detection unit, 205: current address registration unit, 206: diagnosis result registration unit, 206a: address, 206b: error correction length.

───────────────────────────────────────────────────── フロントページの続き (72)発明者佐藤孝夫神奈川県小田原市国府津2880番地株式会社日立製作所ストレージシステム事業部内Ｆターム(参考） 5B018 HA14 KA21 MA12 5B065 BA01 CA11 EA03 EA15 EA24 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Takao Sato 2880 Kozu, Odawara-shi, Kanagawa F-term in the Storage Systems Division, Hitachi, Ltd. (Reference) 5B018 HA14 KA21 MA12 5B065 BA01 CA11 EA03 EA15 EA24

Claims

[Claims]

A first step of executing a process of reading data from a storage medium and detecting a recoverable error that can be recovered by an error correction process; and a process of reading the data in which the correctable error is detected from the storage medium. A second step of attempting to overwrite the original storage location.

2. The method of controlling a storage device according to claim 1, wherein the storage device is each of a plurality of magnetic disk devices having a redundant configuration configuring a disk array, and in the first step, together with the correctable error. In the error correction process, an unrecoverable error that cannot be recovered is also detected, and when the uncorrectable error is detected, the error recovery process using redundant data stored in another storage device is performed. Obtaining the correct data, and in the second step, confirming whether or not the overwriting of the data has succeeded, and writing the data in an alternative area of the storage medium if the overwriting has failed. Storage device control method.

3. A storage subsystem including a storage device and a storage control device interposed between the storage device and a higher-level device, wherein the storage device transmits data from a storage medium in the storage device. Performing a read process, comprising an inspection function of detecting a recoverable error that can be recovered by an error correction process and recording the result as an inspection result, wherein the storage control device refers to the inspection result of each of the storage devices, A storage subsystem, comprising: a function of reading the data in which the correctable error is detected from the storage device and attempting to overwrite an original storage position of the storage device.

4. The storage subsystem according to claim 3, wherein the storage device is each of a plurality of storage devices in a redundant configuration, and wherein the storage control device is configured to read from a specific storage device based on the inspection result. If the reading of the data fails, a data recovery process is performed using the redundant data stored in the other storage device, the overwriting is attempted using the recovered data, and the overwriting fails. A storage subsystem having a function of storing the data in another alternative area of the storage medium of the storage device.

5. The storage subsystem according to claim 3, wherein the checking function of the storage device detects the correctable error by lowering the error correction processing capability than in a normal operation. The storage subsystem, wherein the controller selects the data having an error correction length equal to or greater than a predetermined value from the data having the correctable error and attempts the overwriting.