JPH11167498A

JPH11167498A - Data processor

Info

Publication number: JPH11167498A
Application number: JP9335302A
Authority: JP
Inventors: Toshio Matsumoto; 利夫松本
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1997-12-05
Filing date: 1997-12-05
Publication date: 1999-06-22

Abstract

PROBLEM TO BE SOLVED: To provide a highly reliable data processor with excellent failure resistance so as to realize a failsoft by surely operating while suppressing the degeneration of a function even at the time of occurrence of a failure caused by a bug in a program to be executed by a data processor. SOLUTION: Normal operation firmware 40A and firmware for degeneration operation at the time of a failure 40B expressed in mutually different coding are housed in ROM 30A and 30B respectively. The firmware 40B is provided only with a required function being a part of the function of the firmware 40A. At first the firmware 40A is executed, but when a failure occurs, a switching means 32 switches to the execution of the firmware 40B read from ROM 30B. As the firmwares 40A and 40B do not share a bug with each other, the failure is removed.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、情報処理や機器制
御を行うデータ処理装置に関するものであり、特にデー
タ処理装置の耐障害性の向上に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data processing device for performing information processing and device control, and more particularly to improving the fault tolerance of a data processing device.

【０００２】[0002]

【従来の技術】データ処理装置においては、ハードウェ
ア故障による障害に対してはハードウェアの二重化や故
障ハードウェアを切り離して行う縮退運転といった方法
で耐故障性を向上させ、処理の実行継続が図られる。し
かし、このようなハードウェアの構成上の工夫では、フ
ァームウェア等のプログラムに障害原因が存在する場合
に対応できない場合がある。例えば、ハードウェアを二
重化しても、それぞれで同一のファームウェアを実行し
ていると、それにバグが存在した場合、両方のハードウ
ェアにおいて障害が発生し、システムがハングアップし
たり、またシステムを一旦停止させ再起動しても再び同
じ障害に陥る可能性がある。2. Description of the Related Art In a data processing apparatus, a fault due to a hardware failure is improved by a method such as hardware duplication or degenerate operation in which the faulty hardware is separated so that the execution of the processing can be continued. . However, such a configuration of the hardware may not be able to cope with a case where a failure cause exists in a program such as firmware. For example, even if the hardware is duplicated, if each is running the same firmware, if there is a bug in it, both hardware will fail, the system will hang, or the system will be stopped once Stopping and restarting may cause the same failure again.

【０００３】このようなソフトウェア、ファームウェア
といったプログラムを原因とする障害に対しては、シス
テムに複数種類のプログラムを備え、障害発生時に実行
するプログラムを切り替えて実行させることによりシス
テムの運用を継続することができ耐故障性、信頼性が向
上する。In order to cope with a failure caused by a program such as software or firmware, the system is provided with a plurality of types of programs, and the operation of the system is continued by switching and executing a program to be executed when a failure occurs. This improves fault tolerance and reliability.

【０００４】図７は、特開昭６４−７１４７号公報に示
された従来のデータ処理装置の構成を示すブロック図で
ある。図７において、読み出し専用メモリ（ＲＯＭ：Re
ad Only Memory）２Ａ、２Ｂには同一内容のファームウ
ェアが格納されている。ロード制御部４は例えばＲＯＭ
２Ａに格納されたファームウェアを、随時書き込み／読
み出し可能なメモリであるＲＡＭ（Random Access Memo
ry）６へロードする。またロード制御部４は、ロード中
に、ＲＯＭ２Ａの障害を検出するとそのロード処理を中
止し、ロード元をＲＯＭ２ＡからＲＯＭ２Ｂに切り替え
て、ロード処理をやり直す。また、ＲＯＭ２Ａからのロ
ード処理が正常に終了した場合でもロードされたファー
ムウェアの実行中にＣＰＵ８が異常を検出した場合はロ
ード制御部に再ロードの要求がされる。その要求により
ロード制御部はもう一方のＲＯＭ２Ｂからファームウェ
アをロードし、これがＣＰＵ８により実行開始される。FIG. 7 is a block diagram showing a configuration of a conventional data processing apparatus disclosed in Japanese Patent Application Laid-Open No. 64-7147. In FIG. 7, a read-only memory (ROM: Re
ad Only Memory) 2A and 2B store the same firmware. The load control unit 4 is, for example, a ROM
RAM (Random Access Memory) which is a memory in which the firmware stored in the 2A can be written / read at any time.
ry) Load to 6. If the load control unit 4 detects a failure in the ROM 2A during loading, the load control unit 4 stops the loading process, switches the loading source from the ROM 2A to the ROM 2B, and restarts the loading process. Even if the loading process from the ROM 2A is completed normally, if the CPU 8 detects an abnormality during execution of the loaded firmware, a request for reloading is made to the load control unit. In response to the request, the load control unit loads the firmware from the other ROM 2B, and the CPU 8 starts executing the firmware.

【０００５】これにより一方のＲＯＭが故障して読み出
し不可になった場合でも、あるいはＲＯＭの内容が壊れ
ている場合でも、ファームウェアの読み出し先を変更す
ることで正常なファームウェアがロードされ、データ処
理装置は正常な処理を実行可能となる。なお、この場
合、二つのＲＯＭに格納されているのは同一の内容のフ
ァームウェアであり、ファームウェアの不具合であるバ
グは同じに含まれている。Thus, even if one of the ROMs fails and cannot be read, or if the contents of the ROM are damaged, the normal firmware is loaded by changing the firmware reading destination and the data processing device Can execute normal processing. In this case, firmware stored in the two ROMs has the same contents, and bugs that are defects in the firmware are included in the same manner.

【０００６】また図８は、もう一つの従来技術を示すも
のであり、これは特開平７−２４８９０９号公報に示さ
れた半導体記憶装置の原理図である。ＲＯＭ１２Ａ、１
２Ｂにはそれぞれ改版前のファームウェアとそれを改版
した後のファームウェアというようにバージョンの異な
るプログラムが格納されている。この装置は切替スイッ
チ１４を操作することにより、ＲＡＭ１６へファームウ
ェアを読み出すＲＯＭを切り替えることができる。この
装置によれば、バージョンアップを行っても切替スイッ
チ１４の操作により簡単に改版前ファームウェアを実行
させて、改版後ファームウェアと結果を比較したり、ま
た障害発生時に改版前のファームウェアに切り替えて故
障復旧することができる。FIG. 8 shows another prior art, which is a principle diagram of a semiconductor memory device disclosed in Japanese Patent Laid-Open No. 7-248909. ROM 12A, 1
2B stores programs having different versions, such as the firmware before the revision and the firmware after the revision. By operating the changeover switch 14, this device can switch the ROM from which the firmware is read to the RAM 16. According to this device, even if the version is upgraded, the firmware before the revision is easily executed by operating the changeover switch 14, and the result is compared with the firmware after the revision, or when a failure occurs, the firmware is switched to the firmware before the revision and the failure occurs. Can be restored.

【０００７】なお、この場合は二つのＲＯＭに格納され
るファームウェアはバージョンアップ前後のものであり
異なるものである。しかし一般にバージョンアップは、
バージョンアップ前のファームウェアをベースとして、
これにコードを追加またはコードの一部を変更・削除す
ることで行われる。すなわち、多くのコードはバージョ
ンアップ前後で共通に使われる。このため共通に使って
いるコード中にバグがある場合は、それがいずれのＲＯ
Ｍ１２Ａ、１２Ｂにも含まれることになる。例えば改版
前のファームウェアＡはコードＣ１、コードＣ２、コー
ドＣ３から成っていて、一方、ファームウェアＢはファ
ームウェアＡをバージョンアップしたものでコードＣ
１、コードＣ２、及びコードＣ３を変更して作成したコ
ードＣ３’から成るとする。この場合、改版に関係した
コードＣ３とコードＣ３’を除くコードＣ１及びコード
Ｃ２は、ファームウェアＡとファームウェアＢで共有さ
れている。そのため、コードＣ１、Ｃ２に存在するバグ
は、ファームウェアＡ、Ｂに共通に含まれることにな
る。In this case, the firmware stored in the two ROMs is different before and after the version upgrade. However, version upgrades are generally
Based on the firmware before the upgrade,
This is done by adding code or changing / deleting parts of the code. That is, many codes are commonly used before and after version upgrade. Therefore, if there is a bug in the commonly used code,
M12A and 12B will also be included. For example, the firmware A before the revision is composed of a code C1, a code C2, and a code C3, while the firmware B is an upgraded version of the firmware A and a code C
1, a code C2 and a code C3 'created by changing the code C3. In this case, the code C1 and the code C2 except for the code C3 and the code C3 ′ related to the revision are shared by the firmware A and the firmware B. Therefore, bugs existing in the codes C1 and C2 are commonly included in the firmware A and B.

【０００８】さらに、従来よりの耐故障性を向上させる
ための対応の仕方として、アベイラビリティ（可用性）
を向上させるアプローチの方向がある。このアベイラビ
リティの向上は、システムが正常に機能している時間
（アップタイム）と正しく機能していない時間（ダウン
タイム）の比を高くすることである。そのためには、障
害復旧時間を短縮する等によりダウンタイムを低減する
とともに、アップタイムを長くすることが図られる。こ
のアップタイムの改善方法として、正常時には複数の機
能を遂行するシステムにおいて、故障を生じた一部の機
能を失いつつも運用を継続させるフェイルソフト（縮退
許容システムともいう。）というものがあった。この場
合、具体的には、故障切り分け作業によって、故障が存
在する可能性がある範囲として特定された一単位のプロ
グラムの機能が失われることになる。[0008] Further, as a countermeasure for improving the fault tolerance as compared with the related art, availability (availability) is used.
There is a direction of approach to improve. The improvement in availability is to increase the ratio of the time during which the system is functioning normally (uptime) to the time during which the system is not functioning properly (downtime). For this purpose, downtime is reduced by shortening the failure recovery time, and the uptime is lengthened. As a method of improving the uptime, there is a method of fail software (also referred to as a degeneration allowable system) that continues operation while losing some of the failed functions in a system that performs a plurality of functions in a normal state. . In this case, specifically, the function of one unit of program specified as a range in which a failure may exist is lost by the failure isolation operation.

【０００９】[0009]

【発明が解決しようとする課題】上述の従来例では、使
用するＲＯＭを切り替えることにより、ＲＯＭの故障や
ファームウェアの追加コードのバグによる障害を復旧す
ることが可能である。しかし、これら従来装置に格納さ
れている複数のプログラム（ファームウェア）は、相互
に共通のコードを有する。そのため、プログラムの障害
がこの共通部分に存在するプログラムのバグが原因で発
生する可能性が高く、その障害は上記のようにプログラ
ムの切り替えを行ったとしても復旧されず、処理の継続
が不可能となるという問題があった。In the above-mentioned conventional example, it is possible to recover from a failure in the ROM or a failure due to a bug in the additional code of the firmware by switching the ROM to be used. However, a plurality of programs (firmware) stored in these conventional devices have a common code. Therefore, it is highly probable that a program failure occurs due to a bug in the program that exists in this common part, and the failure is not recovered even if the program is switched as described above, and processing cannot be continued. There was a problem that.

【００１０】一方、システムをフェイルソフトで構成す
る場合においては、障害個所が存在する可能性がある範
囲を十分に狭く特定することができずその範囲内にシス
テム運用に必須の機能が含まれている場合、その必須機
能を含んだ故障特定範囲全体の機能が失われる結果、シ
ステムの運用が不可能となるという問題があった。On the other hand, in the case of configuring the system with fail software, it is not possible to specify a range in which a failure point may exist sufficiently narrowly, and a function essential for system operation is included in the range. In such a case, the function of the entire specified failure range including the essential function is lost, resulting in a problem that the system cannot be operated.

【００１１】本発明は上記問題を解消するものであり、
データ処理装置で実行されるプログラムのバグに起因す
る障害が発生しても、機能の縮退をできるだけ抑えつ
つ、かつ確実に装置を稼働させることによりフェイルソ
フトの実現を図り、耐障害性、信頼性に優れたデータ処
理装置を提供することを目的とする。The present invention solves the above problem,
Even if a failure occurs due to a bug in the program executed by the data processing device, fail-soft operation is realized by minimizing the degradation of functions and ensuring operation of the device to achieve fail-safe and reliability. An object of the present invention is to provide a data processing device excellent in the above.

【００１２】[0012]

【課題を解決するための手段】第一の本発明に係るデー
タ処理装置は、通常運用プログラムを記憶する第１のプ
ログラム記憶手段と、前記通常運用プログラムの機能仕
様よりも低機能な機能仕様の障害時縮退運用プログラム
を記憶する第２のプログラム記憶手段と、前記通常運用
プログラムと前記障害時縮退運用プログラムとのいずれ
かを実行プログラムとして選択するプログラム選択手段
と、前記通常運用プログラムの実行時における障害を検
知する処理障害判定手段とを有し、前記障害時縮退運用
プログラムの機能と、これに対応する前記通常運用プロ
グラムの機能とは互いに異なる前記機能仕様に基づいて
作成され、当該両機能はそれら全体が互いに異なるコー
ディングで表現され、前記プログラム選択手段は、前記
処理障害判定手段により前記通常運用プログラムに障害
が検知されたとき、前記実行プログラムを前記通常運用
プログラムから前記障害時縮退運用プログラムに切り替
えるというものである。According to a first aspect of the present invention, there is provided a data processing apparatus comprising: first program storage means for storing a normal operation program; and a function specification having a lower function specification than the function specification of the normal operation program. A second program storage unit for storing a failure operation reduced operation program, a program selection unit for selecting any one of the normal operation program and the failure operation program as an execution program, It has processing failure determination means for detecting a failure, and the function of the failure-time degraded operation program and the function of the normal operation program corresponding thereto are created based on the function specifications different from each other. All of them are expressed by different coding from each other, and the program selecting means is When a fault is detected more the normal operation program, it is that switches the execution program to the disaster degenerate operation program from the normal operation program.

【００１３】第二の本発明に係るデータ処理装置は、前
記通常運用プログラム及び前記障害時縮退運用プログラ
ムが、それぞれ読み出し専用メモリに格納されたファー
ムウェア、又は磁気記録媒体に格納されたソフトウェア
であるというものである。[0013] In a data processing apparatus according to a second aspect of the present invention, the normal operation program and the fault reduction operation program are firmware stored in a read-only memory or software stored in a magnetic recording medium, respectively. Things.

【００１４】第三の本発明に係るデータ処理装置は、上
記発明において前記通常運用プログラムが、そのメイン
処理の進行に応じてカウントを行い、前記処理障害判定
手段は、前記カウント値が正常に更新されないことに基
づいて障害と判断するというものである。In the data processing apparatus according to a third aspect of the present invention, in the above-mentioned invention, the normal operation program counts according to the progress of the main processing, and the processing failure determination means updates the count value normally. That is, it is determined as a failure based on what is not done.

【００１５】本発明の好適な態様は、前記処理障害判定
手段が、前記カウント値の正常な更新を判断する処理を
行うタイマ割込処理により構成されるものである。[0015] In a preferred aspect of the present invention, the processing failure judging means is constituted by a timer interrupt processing for judging a normal update of the count value.

【００１６】第四の本発明に係るデータ処理装置は、前
記プログラム選択手段により切り替えられた前記障害時
縮退運用プログラムの起動においては、当該障害時縮退
運用プログラムの初期化処理のうち所定の切替時不要初
期化処理を省略するというものである。In the data processing apparatus according to a fourth aspect of the present invention, when the faulty reduced operation program switched by the program selecting means is activated, a predetermined switching of the initialization process of the faulty reduced operation program is performed. The unnecessary initialization process is omitted.

【００１７】本発明の好適な態様は、前記切替時不要初
期化処理が、メモリのチェックビット作成若しくは動作
チェックのためのメモリ初期化処理、又は前記処理障害
判定手段に対する初期化処理、又は前記通常運用プログ
ラムと前記障害時縮退運用プログラムとの間で処理用の
変換データテーブルを共通にしておいた場合の当該テー
ブルの初期化処理のうち少なくともいずれかを含むもの
である。According to a preferred aspect of the present invention, the unnecessary initialization process at the time of switching is a memory initialization process for creating a check bit of a memory or an operation check, an initialization process for the processing failure determination unit, or the normal process. When the conversion data table for processing is shared between the operation program and the fault reduction operation program, at least one of the initialization processing of the table is included.

【００１８】第五の本発明に係るデータ処理装置は、前
記プログラム選択手段により前記障害時縮退運用プログ
ラムへ切り替えた後の前記障害の再発に基づいて当該デ
ータ処理装置のハードウェア障害を検知し、ハードウェ
アの当該障害の関係箇所を切り離すハードウェア障害除
去手段を有するものである。A data processing apparatus according to a fifth aspect of the present invention detects a hardware failure of the data processing apparatus based on a recurrence of the failure after switching to the failure reduction operation program by the program selecting means, It has hardware failure removing means for separating the relevant part of the hardware concerned.

【００１９】[0019]

【発明の実施の形態】［実施の形態１］図１は本発明の
実施の形態であるデータ処理装置の全体の概略構成を示
すブロック図である。本装置は、例えば通信制御処理装
置、計算機のシステム管理制御装置、データベースマシ
ン等において使用される。データ処理装置に用いられる
半導体デバイス等の部品は、例えば過度の高温状況下に
おいて破壊されたり、また室温を含むある程度の温度範
囲から外れると特性の変化により誤動作を生じることも
考えられる。本装置はそのような温度変化による動作上
の問題を回避し、処理結果の高信頼性を維持するため、
本体内の基板温度をモニタして所定のアクションを行う
機能を備えている。[First Embodiment] FIG. 1 is a block diagram showing an overall schematic configuration of a data processing apparatus according to an embodiment of the present invention. This device is used in, for example, a communication control processing device, a computer system management control device, a database machine, and the like. A component such as a semiconductor device used in a data processing apparatus may be broken under an excessively high temperature condition, for example, or may malfunction due to a change in characteristics when the component is out of a certain temperature range including room temperature. This equipment avoids such operational problems due to temperature changes and maintains high reliability of processing results.
It has a function to monitor the temperature of the substrate inside the main body and perform a predetermined action.

【００２０】図１において、データ処理装置本体の一般
的な各種構成要素、すなわちプロセッサである中央処理
装置（ＣＰＵ：Central Processing Unit）２０、随時
書き込み／読み出し可能なメモリ（ＲＡＭ）２２、不揮
発性メモリ２４、及び監視装置２６がシステムバス２８
に接続されている。本装置ではさらに、システムバス２
８に２つの読み出し専用メモリ（ＲＯＭ）３０Ａ、３０
Ｂが切り替え手段３２を介して接続される。また、上述
した基板温度モニタのため温度センサ３４、及び本体で
の処理結果の表示を行うための表示部である液晶表示装
置（ＬＣＤ：Liquid Crystal Display）３６もシステム
バス２８に接続されている。In FIG. 1, various general components of the data processing apparatus main body, that is, a central processing unit (CPU) 20, which is a processor, a memory (RAM) 22 which can be written / read at any time, a nonvolatile memory 24, and the monitoring device 26 is a system bus 28
It is connected to the. The system further includes a system bus 2
8, two read-only memories (ROM) 30A, 30
B is connected via the switching means 32. Further, a temperature sensor 34 for monitoring the substrate temperature and a liquid crystal display (LCD) 36 which is a display unit for displaying a processing result in the main body are also connected to the system bus 28.

【００２１】さて、ここでファームウェアが上記ＲＯＭ
に保持される。一般のデータ処理装置においてはファー
ムウェアは１種類であるが、本装置では、通常運用時に
用いられる通常運用ファームウェア４０Ａと、通常運用
ファームウェア４０Ａの複数機能のうちの一部の機能で
あって障害時の運用継続に必要最低限の基本的機能を含
んだ障害時縮退運用ファームウェア４０Ｂとの２種類を
備えている。２つのＲＯＭ３０Ａ、３０Ｂはそれらの格
納のために用いられる。すなわち、ＲＯＭ３０Ａは、通
常運用ファームウェア４０Ａ（ファームウェアＡ）を保
持するプログラム記憶手段であり、一方、ＲＯＭ３０Ｂ
は、障害時縮退運用ファームウェア４０Ｂ（ファームウ
ェアＢ）を保持するプログラム記憶手段である。Now, here, the firmware is the ROM
Is held. In a general data processing device, there is only one type of firmware. In this device, however, the normal operation firmware 40A used during normal operation and some of the plurality of functions of the normal operation firmware 40A, It is provided with two types of firmware: a faulty degraded operation firmware 40B including the minimum basic functions necessary for the continuation of operation. Two ROMs 30A, 30B are used for storing them. That is, the ROM 30A is a program storage unit that holds the normal operation firmware 40A (firmware A).
Is a program storage unit that holds the failure-time degraded operation firmware 40B (firmware B).

【００２２】障害時縮退運用ファームウェアが通常運用
ファームウェアと共通に有する機能である基本機能（縮
退運用機能）は、それに対応する通常運用ファームウェ
アの機能と大きな意味で共通である。しかし、それらは
互いに異なる仕様に基づいて作成されるものであり、そ
のためより細かいレベルでの機能もしくは処理は互いに
異なりうる。The basic function (degraded operation function), which is a function that the degraded operation firmware at the time of failure has in common with the normal operation firmware, is in common with the function of the normal operation firmware corresponding thereto in a large meaning. However, they are created based on different specifications, so that functions or processes at a finer level may be different from each other.

【００２３】ここで一例として、システムを停止させる
シャットダウン機能を用いて簡単に説明すれば、通常運
用ファームウェアと障害時縮退運用ファームウェアとの
シャットダウン機能は、共に最終的にシステムを停止さ
せるという機能において共通である。しかし、通常運用
ファームウェアを、電源オフに至るまで種々の退避処理
等を行うような仕様に基づいて作成するのに対し、障害
時縮退運用ファームウェアを、ハードディスク等のアク
セス動作の停止は行うが、実行中のアプリケーションの
データの保存処理は省略するといった仕様に基づいて作
成することが可能である。ここでいう大きな意味での機
能とは、システムの運転維持の観点を例に説明すれば、
運転維持という目的達成のためには他の機能に変えたり
省略したりできないといった機能レベルであり、これに
対し、細かな意味での機能とは、同じく運転維持を例に
考えれば、それを他の機能に代替したり省略しても運転
維持に支障がないといった機能レベルである。このこと
を例えば、機能の目的は共通であるが、達成手段が異な
ると表現することも可能であろう。Here, as an example, a brief description will be made using a shutdown function for stopping the system. The shutdown function for the normal operation firmware and the degraded operation firmware for failure is common in the function of finally stopping the system. It is. However, while the normal operation firmware is created based on a specification that performs various evacuation processes until the power is turned off, the failure-time degraded operation firmware stops the access operation of the hard disk or the like, but executes the firmware. It can be created based on the specification that the process of saving the data of the middle application is omitted. The function in the large sense here means, from the viewpoint of maintaining the operation of the system as an example,
In order to achieve the purpose of maintaining operation, it is a function level that can not be changed or omitted in other functions.On the other hand, a function in a detailed sense is considered as another example in the case of operation maintenance. It is a functional level that does not hinder operation maintenance even if the function is replaced or omitted. This could be expressed, for example, as having a common purpose for the functions, but different means of achieving.

【００２４】ちなみに、通常運用ファームウェア４０Ａ
と障害時縮退運用ファームウェア４０Ｂとの共通機能
は、互いに異なる仕様に基づいて作成されているため、
一般に互いに異なるコーディングで表現されることにな
る。By the way, the normal operation firmware 40A
Since the common function of the failure firmware and operation firmware 40B is created based on different specifications,
Generally, they will be represented by different codings.

【００２５】切り替え手段３２はそれらファームウェア
を格納したＲＯＭのいずれかを、ファームウェア読み出
し先として選択して、システムバス２８に接続する。す
なわち、切り替え手段３２はプログラム選択手段として
機能する。なお、ＲＯＭ３０Ａ、３０Ｂは、ＥＰＲＯＭ
やフラッシュＲＯＭのような消去可能な読み出し専用メ
モリであってもよい。The switching means 32 selects one of the ROMs storing the firmware as a firmware reading destination and connects it to the system bus 28. That is, the switching unit 32 functions as a program selection unit. The ROMs 30A and 30B are EPROM
Or an erasable read-only memory such as a flash ROM.

【００２６】不揮発性メモリ２４には次の起動時に読み
込むべきＲＯＭを識別する情報が書き込まれている。例
えばＲＯＭ３０Ａ、３０Ｂに識別番号“１”、“２”が
付与されている場合において、ＲＯＭ３０Ａからファー
ムウェアを読み出すべき時は“１”、ＲＯＭ３０Ｂから
ファームウェアを読み出すべき時は“２”という情報が
書き込まれ、この値はデータ処理装置の電源をオフにし
ても消えることなく保持され、次の起動時に参照され
る。なお、不揮発性メモリ２４は特別に半導体メモリで
構成しなくとも、図示しない磁気ディスク装置の所定領
域を記憶場所として代用することもできるし、切り替え
手段３２に揮発性メモリを設け、電源が入っている間こ
れに一時的に保持するなどの他の構成も可能である。In the nonvolatile memory 24, information for identifying a ROM to be read at the next start-up is written. For example, when the identification numbers “1” and “2” are assigned to the ROMs 30A and 30B, information “1” is written when firmware is to be read from the ROM 30A, and information “2” is written when firmware is to be read from the ROM 30B. This value is retained even when the power of the data processing apparatus is turned off, and is referred to at the next startup. The non-volatile memory 24 can be replaced with a predetermined area of a magnetic disk drive (not shown) as a storage area without using a special semiconductor memory. A volatile memory is provided in the switching unit 32, and the power is turned on. Other arrangements are possible, such as temporarily holding this while in use.

【００２７】監視装置２６は、ファームウェア実行時の
障害を検知する処理障害判定手段であり、その動作は以
下の説明で明らかとなる。The monitoring device 26 is a processing failure determining means for detecting a failure during execution of the firmware, and its operation will be apparent from the following description.

【００２８】次に、上記実施の形態のデータ処理装置の
動作について説明する。図２は本装置の動作を説明する
フロー図である。データ処理装置が起動されると、ＣＰ
Ｕ２０はファームウェアのロード命令をシステムバス２
８に送出する（Ｓ５０）。切り替え手段３２はこれを受
信すると、不揮発性メモリ２４に保持されているＲＯＭ
識別情報を取得する（Ｓ５５）。装置の通常起動時にお
いては不揮発性メモリ２４にはＲＯＭ識別情報として通
常運用ファームウェア４０Ａが保持されたＲＯＭ３０Ａ
を指す“１”が格納されており、切り替え手段３２は、
その情報に対応するＲＯＭ３０Ａを選択し、当該ＲＯＭ
から通常運用ファームウェア４０Ａを読み出し、ＲＡＭ
２２に書き込む（Ｓ６０）。Next, the operation of the data processing apparatus of the above embodiment will be described. FIG. 2 is a flowchart illustrating the operation of the present apparatus. When the data processing device is started, the CP
U20 issues a firmware load instruction to system bus 2
8 (S50). Upon receiving this, the switching means 32 reads the ROM stored in the nonvolatile memory 24.
The identification information is obtained (S55). At the time of normal startup of the device, the non-volatile memory 24 stores the ROM 30A holding the normal operation firmware 40A as the ROM identification information.
Is stored, and the switching means 32 outputs
Select the ROM 30A corresponding to the information, and
Read the normal operation firmware 40A from the
22 (S60).

【００２９】選択されたＲＯＭからすべての通常運用フ
ァームウェア４０Ａが読み出され、ＲＡＭ２２に書き込
まれると、ＣＰＵ２０はＲＡＭ２２上に転写された通常
運用ファームウェアの実行を開始する（Ｓ６５）。本装
置は、このファームウェアの実行によって、図示しない
入出力ポートを経由して外部とデータをやり取りし、所
定の処理や制御を行う。When all the normal operation firmware 40A is read from the selected ROM and written into the RAM 22, the CPU 20 starts executing the normal operation firmware transferred to the RAM 22 (S65). By executing the firmware, the apparatus exchanges data with the outside via an input / output port (not shown), and performs predetermined processing and control.

【００３０】通常運用ファームウェアの実行状態を監視
装置２６が監視する。監視装置２６は、もし通常運用フ
ァームウェアの実行に障害を検出した場合（Ｓ７０）
は、不揮発性メモリ２４に格納されている起動時に選択
すべきＲＯＭの識別情報を、ＲＯＭ３０Ａを指定する値
“１”から障害時縮退運用ファームウェア４０Ｂを保持
するＲＯＭ３０Ｂを指定する値“２”に書き替え（Ｓ７
５）、システムを再起動する（Ｓ８０）。The monitoring device 26 monitors the execution status of the normal operation firmware. If the monitoring device 26 detects a failure in the execution of the normal operation firmware (S70)
Writes the identification information of the ROM to be selected at startup stored in the nonvolatile memory 24 from the value “1” specifying the ROM 30A to the value “2” specifying the ROM 30B holding the degraded operation firmware 40B at the time of failure. Replacement (S7
5) Restart the system (S80).

【００３１】ちなみに、例えばファームウェアは、その
バグの影響で分岐命令等の処理制御に失敗し、所定のシ
ーケンスに従って進行すべき処理が無限ループに陥るこ
とがある。監視装置２６は通常運用ファームウェアにお
けるこのような状態を検知し障害と判断する。Incidentally, for example, the firmware may fail to control the processing of a branch instruction or the like due to the bug, and the processing to proceed according to a predetermined sequence may fall into an infinite loop. The monitoring device 26 detects such a state in the normal operation firmware and determines that a failure has occurred.

【００３２】システムの再起動によりＣＰＵ２０は、フ
ァームウェアのロード命令を再びシステムバス２８に送
出し（Ｓ８５）、そして切り替え手段３２が当該命令を
受け、不揮発性メモリ２４からＲＯＭ識別情報の値
“２”を取得し（Ｓ９０）、対応するＲＯＭ３０Ｂを選
択してそこから障害時縮退運用ファームウェア４０Ｂを
読み出し、ＲＡＭ２２に書き込む（Ｓ９５）。When the system is restarted, the CPU 20 sends a firmware load instruction to the system bus 28 again (S85), and the switching means 32 receives the instruction, and the ROM identification information value "2" from the nonvolatile memory 24. Is acquired (S90), the corresponding ROM 30B is selected, the faulty degraded operation firmware 40B is read therefrom, and written to the RAM 22 (S95).

【００３３】これにより、ＲＡＭ２２上のファームウェ
アが通常運用ファームウェアから障害時縮退運用ファー
ムウェアに更新され、ＣＰＵ２０は障害時縮退運用ファ
ームウェアの実行を開始する（Ｓ１００）。この障害時
縮退運用ファームウェアは、上述したように前回実行さ
れていた通常運用ファームウェアとは異なる仕様に基づ
いて作成され、そのため一般に異なるコーディングで表
現される。つまり、一般に、通常運用ファームウェアの
実行において障害を生じたバグを内包していない。この
ため、前回の障害が発生した状況に至っても正しく処理
や制御を行うことができる。As a result, the firmware on the RAM 22 is updated from the normal operation firmware to the faulty degraded operation firmware, and the CPU 20 starts executing the faulty degraded operation firmware (S100). As described above, the faulty degraded operation firmware is created based on a specification different from that of the previously executed normal operation firmware, and is generally represented by different coding. That is, in general, it does not include a bug that causes a failure in the execution of the normal operation firmware. Therefore, processing and control can be performed correctly even in the situation where the previous failure has occurred.

【００３４】なお、ファームウェアを切り替えて実行し
ても再度障害が発生する場合は、ハードウェアの障害の
可能性が高くなる。監視装置２６は、そのようなファー
ムウェアの切り替えによって障害が再発する場合をハー
ドウェア障害と判断し、ユーザに通知する構成とするこ
ともできる。またそのような状況からハードウェア障害
を検知し、ハードウェアのうち障害に関係する箇所を切
り離すハードウェア障害除去手段をデータ処理装置に設
ける構成も可能である。例えば、図示しないＲＯＭにハ
ードウェア診断用の診断プログラムを格納しておき、フ
ァームウェアの切り替えによっても障害が再発する場合
は、診断プログラムをＲＡＭにロードして実行し、ハー
ドウェアの詳細な診断を実施するといった構成が可能で
ある。If a failure occurs again after switching and executing the firmware, the possibility of a hardware failure increases. The monitoring device 26 may also be configured to determine that a failure recurs due to such firmware switching as a hardware failure and notify the user. In addition, a configuration is also possible in which a hardware failure is removed from the data processing device by detecting a hardware failure from such a situation and separating a part of the hardware related to the failure. For example, a diagnostic program for hardware diagnosis is stored in a ROM (not shown), and if the failure recurs after switching the firmware, the diagnostic program is loaded into the RAM and executed to perform a detailed diagnosis of the hardware. Such a configuration is possible.

【００３５】図３は、本実施の形態における障害発生時
の処理フロー図である。まず装置の起動によって通常運
用ファームウェアがロードされ実行される（Ｓ１１
０）。監視装置２６は障害が発生したかどうかを監視し
続ける（Ｓ１１２）。もし障害が発生した場合、それが
ファームウェア切り替え後の再度の発生かどうかが判定
される（Ｓ１１４）。まだ、現在のファームウェアにお
ける一回目の障害である場合には、ファームウェアの切
り替えが実施され（Ｓ１１６）、状態が監視される。FIG. 3 is a processing flow chart when a failure occurs in this embodiment. First, the normal operation firmware is loaded and executed by starting the apparatus (S11).
0). The monitoring device 26 continues to monitor whether a failure has occurred (S112). If a failure has occurred, it is determined whether the failure has occurred again after the firmware switch (S114). If it is still the first failure in the current firmware, the firmware is switched (S116), and the status is monitored.

【００３６】一方、現在のファームウェアにおける再度
の障害の発生である場合には、ハードウェア診断用の診
断プログラムをＲＡＭにロードして実行し、ハードウェ
アの詳細な診断が実施される（Ｓ１１８）。この診断
で、新たな障害ハードウェアが検出された場合には（Ｓ
１２０）、障害ハードウェアリストにその障害ハードウ
ェアを書き込む（Ｓ１２２）。例えば、診断でＲＡＭの
一部にデータ誤りが発生する場合や、繰り返しのデータ
コンペアチェックで一定しない結果が得られる場合は、
このＲＡＭ領域を含むＲＡＭモジュールの障害と判断
し、例えば不揮発性メモリ上に保持される障害ハードウ
ェアリストに記録する。また複数のＣＰＵを使用するマ
ルチＣＰＵの装置において、診断でＣＰＵの一つに計算
結果の誤り等の障害が検出された場合も、同様にリスト
に書き込まれる。On the other hand, if the fault has occurred again in the current firmware, a diagnostic program for hardware diagnosis is loaded into the RAM and executed, and detailed hardware diagnosis is performed (S118). In this diagnosis, if new faulty hardware is detected (S
120), and writes the faulty hardware to the faulty hardware list (S122). For example, when a data error occurs in a part of the RAM in the diagnosis, or when an inconsistent result is obtained in a repeated data compare check,
It is determined that a failure has occurred in the RAM module including the RAM area, and the failure is recorded in, for example, a failure hardware list held in a nonvolatile memory. In a multi-CPU apparatus using a plurality of CPUs, when a failure such as an error in a calculation result is detected in one of the CPUs in the diagnosis, the information is similarly written in the list.

【００３７】その後、新たなファームウェアが指定さ
れ、再起動される。このとき障害ハードウェアリストが
参照され、それに記録された障害ハードウェアは切り離
して起動される（Ｓ１２４）。上述のＲＡＭの例では、
リストに記載されているＲＡＭモジュールは使用せずに
残りのＲＡＭモジュールのみが使用される。またマルチ
ＣＰＵの例でも、リストに記載されているＣＰＵはディ
スエーブルにして残りのＣＰＵのみを使用するように装
置が構成変更が行われる。Thereafter, a new firmware is designated and restarted. At this time, the failed hardware list is referred to, and the failed hardware recorded therein is separately activated (S124). In the above RAM example,
Only the remaining RAM modules are used without using the RAM modules listed. Also in the example of the multi-CPU, the configuration of the apparatus is changed so that the CPUs described in the list are disabled and only the remaining CPUs are used.

【００３８】ちなみに、処理Ｓ１２０において新たな障
害ハードウェアが検出されなかった場合は、原因を特定
できないとして、外部へ通知しシステムを停止する（Ｓ
１２６）。If no new faulty hardware is detected in step S120, it is determined that the cause cannot be identified and the system is notified to the outside and the system is stopped (S120).
126).

【００３９】なお、上に説明した例では、障害時縮退運
用ファームウェアが有する機能のうち通常運用ファーム
ウェアと共通の機能は、すべて相互にコーディングが異
なるとしたが、必ずしもそのような構成でなくても本発
明の効果を得ることができる。すなわち、障害時縮退運
用ファームウェアの一部が通常運用ファームウェアと共
通の仕様に基づいて作成され、その部分で互いに共通の
コーディングを有するものであっても、その一方で互い
に異なるコーディング部分を有する限り、全部が共通の
コーディングである場合に比べてバグによる障害発生の
確率を抑制することができる。In the example described above, among the functions of the degraded operation firmware at the time of failure, the functions common to the normal operation firmware are all different in coding from one another. The effects of the present invention can be obtained. In other words, even if a part of the faulty degraded operation firmware is created based on the common specification with the normal operation firmware and has a common coding in that part, on the other hand, as long as it has different coding parts, The probability of occurrence of a failure due to a bug can be suppressed as compared with the case where all coding is common.

【００４０】また、上記実施の形態に係るデータ処理装
置は、ファームウェアをＲＯＭに保持したものであっ
た。しかし、本発明はファームウェアのみに限定される
ものではなく、例えば固定磁気ディスク装置、光磁気デ
ィスク装置、磁気テープ等に記録されるソフトウェアに
も適用することができる。すなわち、機能は同一でコー
ディングの互いに異なるソフトウェアをディスク装置等
に記憶させ、障害時にはそれらソフトウェアを切り替え
ることにより、耐障害性を向上させることができる。Further, the data processing device according to the above embodiment has the firmware stored in the ROM. However, the present invention is not limited to only firmware, and can be applied to software recorded on a fixed magnetic disk device, a magneto-optical disk device, a magnetic tape, or the like. In other words, software having the same function but different coding is stored in a disk device or the like, and when a failure occurs, the software is switched to improve the fault tolerance.

【００４１】以上、本装置の特徴を一般的に説明した。
次に、本装置の特徴を本装置の基板温度モニタ機能を例
に一層、具体的に説明する。The features of the present apparatus have been described above generally.
Next, the features of the present apparatus will be described more specifically by taking the substrate temperature monitoring function of the present apparatus as an example.

【００４２】図４は、基板の温度をモニタして所定のア
クションを行う基板温度モニタ機能を作成する際の機能
仕様の一例を示す説明図である。図４（ａ）は、通常運
用ファームウェアの機能仕様を箇条書きに示したもので
あり、図４（ｂ）は、障害時縮退運用ファームウェアの
機能仕様を箇条書きに示したものである。両者を比較す
ると、障害時縮退運用ファームウェアは、通常運用ファ
ームウェアの一部の機能のみを実現するものであること
が分かる。FIG. 4 is an explanatory diagram showing an example of a functional specification for creating a substrate temperature monitoring function for monitoring a substrate temperature and performing a predetermined action. FIG. 4A shows the function specifications of the normal operation firmware in an itemized list, and FIG. 4B shows the function specifications of the faulty degraded operation firmware in the itemized list. A comparison between the two shows that the faulty degraded operation firmware implements only a part of the functions of the normal operation firmware.

【００４３】この機能仕様例における装置の動作は、以
下のようなものである。The operation of the apparatus in this example of the functional specification is as follows.

【００４４】まず、通常運用ファームウェアは、温度セ
ンサ３４からモニタ対象の基板の温度測定値を取り込む
動作を行う。通常運用ファームウェアは、ＲＡＭ２２上
に温度測定値１００回分を格納するデータ領域を確保
し、そこに最新の１００回分の温度測定値を保持させ
る。通常運用ファームウェアは、ＲＡＭ２２に格納され
た温度測定データに基づいて、直近の１０秒間継続して
０℃を下回るか又は８０℃を上回るかした場合に、基板
温度異常と判断し、そうでなければ正常と判断する。こ
こで、通常運用ファームウェアは、基板温度が８０℃を
超える温度異常と判断した場合には、ＬＣＤ３６に異常
メッセージを表示させるとともに例えばシステムを停止
させる。一方、通常運用ファームウェアは基板温度が０
℃を下回る温度異常と判断した場合には、ＬＣＤ３６に
異常メッセージを表示させる動作のみ行う。また、通常
運用ファームウェアは、正常と判断した場合には、それ
に対応したアクションは特にとらない。以上がこの例に
ついて、図２における処理Ｓ６５において行われる処理
となる。First, the normal operation firmware performs an operation of taking in the measured temperature value of the substrate to be monitored from the temperature sensor 34. The normal operation firmware secures a data area for storing the temperature measurement values for 100 times in the RAM 22, and holds the latest 100 temperature measurement values there. Based on the temperature measurement data stored in the RAM 22, the normal operation firmware determines that the substrate temperature is abnormal if the temperature continuously drops below 0 ° C. or exceeds 80 ° C. for the last 10 seconds. Judge as normal. Here, when the normal operation firmware determines that the substrate temperature exceeds 80 ° C., it displays an abnormality message on the LCD 36 and stops the system, for example. On the other hand, in the normal operation firmware, the board temperature is 0
When it is determined that the temperature is lower than the temperature, the operation of displaying an error message on the LCD 36 is performed. Further, when the normal operation firmware determines that the firmware is normal, it does not take any action corresponding thereto. The above is the processing performed in step S65 in FIG. 2 for this example.

【００４５】監視装置２６は通常運用ファームウェアの
実行に障害を検出すると（Ｓ７０）、不揮発性メモリ２
４に格納されている起動時に選択すべきＲＯＭの識別情
報を、ＲＯＭ３０Ａを指定する値“１”から障害時縮退
運用ファームウェアを保持するＲＯＭ３０Ｂを指定する
値“２”に書き替え（Ｓ７５）、システムを再起動する
（Ｓ８０）。以降、上述したような手順Ｓ８５〜Ｓ９５
を経て、障害時縮退運用ファームウェアの実行が開始さ
れる（Ｓ１００）。When the monitoring device 26 detects a failure in the execution of the normal operation firmware (S 70), the monitoring device 26
Then, the identification information of the ROM to be selected at the time of startup stored in No. 4 is rewritten from the value "1" specifying the ROM 30A to the value "2" specifying the ROM 30B holding the degraded operation firmware at the time of failure (S75). Is restarted (S80). Hereinafter, steps S85 to S95 as described above
After that, the execution of the failure-time degraded operation firmware is started (S100).

【００４６】障害時縮退運用ファームウェアも、まず、
温度センサ３４からモニタ対象の基板の温度測定値を取
り込む動作を行う。ただし、障害時縮退運用ファームウ
ェアは、基板の温度測定値そのものをＲＡＭ２２上に格
納しない。障害時縮退運用ファームウェアは、モニタし
た１回だけの温度値に基づいて基板温度の正常／異常を
判断する。すなわち、障害時縮退運用ファームウェア
は、その１回の基板温度が８０℃を超える場合には基板
温度異常と判断し、一方、そうでなければ正常と判断
し、その測定した温度値の正常／異常の判定結果のみを
ＲＡＭ２２に格納する。そして、障害時縮退運用ファー
ムウェアは８０℃を超える基板温度異常と判断した場合
には、システムを停止させる。以上がこの例について、
図２における処理Ｓ１００において行われる処理とな
る。First, the operation firmware for degrading at the time of failure is
An operation of acquiring a measured temperature value of the substrate to be monitored from the temperature sensor 34 is performed. However, the fault-time degraded operation firmware does not store the board temperature measurement value itself in the RAM 22. The faulty degraded operation firmware determines whether the board temperature is normal or abnormal based on the monitored temperature value only once. That is, the fault degraded operation firmware determines that the board temperature is abnormal if the single board temperature exceeds 80 ° C., otherwise determines that the board temperature is normal, and determines whether the measured temperature value is normal / abnormal. Is stored in the RAM 22 only. Then, if the faulty degraded operation firmware determines that the board temperature is abnormal exceeding 80 ° C., the system is stopped. That is all for this example.
This is the processing performed in processing S100 in FIG.

【００４７】この例において、障害時縮退運用ファーム
ウェアは温度モニタ機能を運用継続に必要最小限な処理
に絞って提供する。すなわち障害時縮退運用ファームウ
ェアは通常運用ファームウェアと異なり、異常メッセー
ジの表示は行わない。また、低温異常については検出も
表示も行わない。なお、ＲＡＭ２２に格納された判定結
果は、操作者から要求があった場合、障害時縮退運用フ
ァームウェアがＲＡＭ２２から読み出してＬＣＤ３６に
表示、通知される。In this example, the faulty degraded operation firmware provides the temperature monitoring function by limiting the processing to the minimum necessary for continuing the operation. That is, unlike the normal operation firmware, the fault-time degraded operation firmware does not display an error message. Neither detection nor display is performed for the low temperature abnormality. When a request is made by the operator, the judgment result stored in the RAM 22 is read out from the RAM 22 by the faulty degraded operation firmware, displayed on the LCD 36, and notified.

【００４８】さて、ここで例えば通常運用ファームウェ
アが温度値の正常・異常を判断する際の１０秒間の温度
値の処理にバグを有すると、それが原因でシステムが停
止するおそれがある。これに対し、障害時縮退運用ファ
ームウェアは、そもそも正常・異常の判断を１０秒間の
温度値に基づいて行わないので、同様のバグを内包する
ことがない。よって、通常運用ファームウェアを障害時
縮退運用ファームウェアに切り替えることによって、装
置の運用を継続することができる。Here, for example, if the normal operation firmware has a bug in the processing of the temperature value for 10 seconds when judging whether the temperature value is normal or abnormal, the system may be stopped due to the bug. On the other hand, the faulty degraded operation firmware does not determine whether the operation is normal or abnormal based on the temperature value for 10 seconds in the first place, and therefore does not include the same bug. Therefore, the operation of the device can be continued by switching the normal operation firmware to the faulty degraded operation firmware.

【００４９】また、通常運用ファームウェアは、８０℃
を上回る温度異常発生時に外部への通知を行うという処
理を有し、ここにバグを有する可能性もある。もしそこ
にバグが存在すると、通常運用ファームウェアでは、外
部への通知処理を実行した際にシステムがハングアップ
してしまうおそれがある。これに対し障害時縮退運用フ
ァームウェアは、外部への通知処理を有さないのでその
ようなバグを内包するおそれがない。よって、通常運用
ファームウェアを障害時縮退運用ファームウェアに切り
替えることによって、装置の運用を継続することができ
る。The normal operation firmware is set at 80 ° C.
It has a process of notifying to the outside when a temperature abnormality exceeds the limit, which may have a bug. If there is a bug there, there is a possibility that the system hangs up when executing the notification processing to the outside in the normal operation firmware. On the other hand, since the faulty degraded operation firmware does not have the notification process to the outside, there is no possibility that such a bug is included. Therefore, the operation of the device can be continued by switching the normal operation firmware to the faulty degraded operation firmware.

【００５０】このように、通常運用ファームウェアと障
害時縮退運用ファームウェアとは機能仕様が互いにすべ
て異なり、それを基に作成されるプログラムコードも互
いに異なったものとなる。そのため、両者の含まれるバ
グも異なり、通常運用ファームウェアでバグが発現して
も、そのバグは障害時縮退運用ファームウェアには基本
的に内包されていない。また、障害時縮退運用ファーム
ウェアは、通常運用ファームウェアの運用継続に必須機
能に限定されているので、プログラム規模も小さくな
り、内包されるバグの個数も少なくなり、その分、障害
を発生しにくい。As described above, the normal operation firmware and the faulty degraded operation firmware all have different functional specifications, and the program codes created based on the functional specifications are also different from each other. Therefore, the bugs included in both are different, and even if a bug appears in the normal operation firmware, the bug is not basically included in the faulty operation firmware. Further, since the faulty degraded operation firmware is limited to functions essential for the continuation of the operation of the normal operation firmware, the program scale is reduced, the number of included bugs is reduced, and a failure is less likely to occur.

【００５１】つまり、通常運用ファームウェアでバグが
発生した状況において、障害時縮退運用ファームウェア
は同様の障害を発生しないこと、また新たなバグが発生
する確率が低いことから、障害時縮退運用ファームウェ
アに切り替えることにより、最低限の機能に限定される
が運用を継続することができる。本装置では、このよう
に縮退された運用を行う一方で、発現した通常運用ファ
ームウェアのバグの原因究明及びその除去を施すことが
できる。すなわち、本データ処理装置は、システムが全
部ダウンすることを回避することが可能であり、特に高
信頼性、耐故障性が要求される用途に極めて有効であ
る。That is, in a situation where a bug has occurred in the normal operation firmware, the faulty degraded operation firmware does not cause the same failure, and the probability of occurrence of a new bug is low. As a result, the operation can be continued although it is limited to the minimum functions. In the present apparatus, while performing such degenerated operation, it is possible to investigate the cause of the bug of the normal operation firmware that has appeared and to remove the cause. That is, the present data processing apparatus can prevent the entire system from going down, and is extremely effective particularly in applications requiring high reliability and fault tolerance.

【００５２】また、本装置は改善されたフェイルソフト
構成を提供する。つまり、本装置によれば、故障が存在
するとして特定された範囲を受け持つ構成要素が、その
中の必須機能のみを実現する他の構成要素に切り替えら
れる。この場合、故障特定範囲のうち必須機能以外の機
能は失われるが、従来のフェイルソフトと異なり必須機
能は維持されシステムの運用は継続される。しかも障害
時縮退運用ファームウェアが提供し維持される必須機能
は、通常運用ファームウェアが有する必須機能と異なる
コードで表現されるため、万が一、通常運用ファームウ
ェアの必須機能のコーディングがバグを有する場合であ
っても、切り替え後は同一原因による障害を生じない。The present device also provides an improved failsoft configuration. In other words, according to the present device, the component that covers the range specified as having a failure is switched to another component that realizes only the essential function. In this case, the functions other than the essential functions in the failure specific range are lost, but unlike the conventional fail software, the essential functions are maintained and the operation of the system is continued. In addition, the essential functions provided and maintained by the faulty operation firmware are expressed by codes different from those required by the normal operation firmware, so if the coding of the essential functions of the normal operation firmware has a bug, Also, after switching, no failure occurs due to the same cause.

【００５３】なお、以上説明した例では、装置が２つの
ファームウェア、すなわち高機能の通常運用ファームウ
ェア４０Ａと低機能の障害時縮退運用ファームウェア４
０Ｂを備える場合を示したが、本発明の本質はこれに限
定されるものではなく、機能仕様が異なる３つ以上のフ
ァームウェアを備える構成をも包含している。In the example described above, the device is composed of two firmwares, that is, the high-function normal operation firmware 40A and the low-function fault degraded operation firmware 4A.
0B is shown, but the essence of the present invention is not limited to this, and also includes a configuration having three or more firmwares having different functional specifications.

【００５４】例えば、上記例のファームウェアＡ、Ｂに
さらにファームウェアＣを加えた構成を説明する。ここ
で、ファームウェアＡ、Ｂ、Ｃはこの順に、機能レベル
が高機能から低機能へ段階的に異なるような機能仕様に
基づいて作成されている。これら３つのファームウェア
は、上述した２つの場合と同様、互いに内包するバグが
異なり、よって、通常は高機能のファームウェアＡで運
用し、それが障害を起こした場合には、それよりやや機
能が縮退されたファームウェアＢで運用し、さらにファ
ームウェアＢの運用中に新たなバグが発生した場合に
は、低機能のファームウェアＣに切り替えて運用を継続
する運用形態が可能である。これにより、一層、信頼
性、耐故障性が向上する。ファームウェアが４つ以上の
場合においても同様であることは、この３つの場合の説
明から明らかであろう。For example, a configuration in which firmware C is added to firmware A and B in the above example will be described. Here, the firmware A, B, and C are created in this order based on a function specification in which the function level gradually changes from a high function to a low function. These three firmwares differ from each other in the bugs contained therein, as in the two cases described above. Therefore, the firmware is usually operated with the high-performance firmware A, and if a failure occurs, the function is somewhat reduced. If a new bug occurs during the operation of the firmware B, and a new bug occurs during the operation of the firmware B, an operation mode of switching to the low-function firmware C and continuing the operation is possible. Thereby, reliability and fault tolerance are further improved. It will be apparent from the description of these three cases that the same applies to the case of four or more firmwares.

【００５５】［実施の形態２］図５は本実施の形態であ
るファームウェア実行状態チェックのための処理フロー
図である。図５において、ファームウェアはファームウ
ェアのメイン処理ルーチンＳ１５０〜Ｓ１７５とタイマ
割り込みハンドラの処理Ｓ２００〜Ｓ２１５から成る。
メイン処理ルーチンでは初期化を行った後は一連の処理
Ｓ１５５〜Ｓ１７５を無限ループ的に繰り返す。このル
ープ中には、ＲＡＭ２２上の実行チェック用カウンタ２
５０をカウントアップする処理（カウント処理Ｓ１６
５）も含まれており、正常時、すなわちこのループが繰
り返されている間はカウント値が更新され続ける。タイ
マ割り込みハンドラはタイマ割り込みにより一定周期で
起動されタイマ割り込み処理を行うが、この中に実行チ
ェック用カウンタ２５０のカウント値をチェックする処
理（カウンタチェック処理Ｓ２０５）も含まれる。この
カウンタチェック処理Ｓ２０５でカウント値が一定時間
以上、更新されずに停止していることが判明するとカウ
ント値の異常状態と判断し（Ｓ２１０）、これがファー
ムウェアの異常に起因するとみなしてファームウェアの
切り替え処理が起動される（Ｓ２１５）。[Second Embodiment] FIG. 5 is a processing flowchart for checking the firmware execution state according to the present embodiment. In FIG. 5, the firmware includes firmware main processing routines S150 to S175 and timer interrupt handler processing S200 to S215.
In the main processing routine, after the initialization, a series of processing S155 to S175 is repeated in an infinite loop. During this loop, the execution check counter 2 on the RAM 22
Processing to count up 50 (count processing S16
5) is also included, and the count value is continuously updated during normal times, that is, while this loop is repeated. The timer interrupt handler is started at a fixed cycle by a timer interrupt and performs timer interrupt processing. This includes a process of checking the count value of the execution check counter 250 (counter check process S205). If it is determined in the counter check process S205 that the count value has stopped without being updated for a certain period of time or more, it is determined that the count value is in an abnormal state (S210). Is started (S215).

【００５６】ここで例えば通常運用ファームウェアのメ
イン処理ルーチンの入カデータ値のチェックをしている
ときに、通常運用ファームウェアのバグにより暴走しハ
ングアップしたとする。この場合、メイン処理ルーチン
のループは回らなくなり、メモリ上の実行チェック用カ
ウンタ２５０は更新されなくなる。一方、タイマ割り込
みハンドラはメイン処理の状態によらずタイマ割り込み
により定期的に起動されるため、何度か起動され一定時
間の後に実行チェック用カウンタ２５０のカウント値の
停止を検知する。この停止の検知によりタイマ割り込み
ハンドラはファームウェアの切り替え処理を起動する。
これにより、通常運用ファームウェアに代えて障害時縮
退運用ファームウェアが起動・実行される。この障害時
縮退運用ファームウェアは、上記実施の形態で述べたよ
うに、それまで実行されていた通常運用ファームウェア
と同じバグを持たないため、システムは切り替え動作の
ため瞬間的には停止するが、直ちに正常な状態が復旧さ
れ、処理が継続される。Here, for example, it is assumed that a runaway occurs due to a bug in the normal operation firmware and the system hangs up while checking the input data value in the main processing routine of the normal operation firmware. In this case, the loop of the main processing routine does not loop, and the execution check counter 250 on the memory is not updated. On the other hand, since the timer interrupt handler is periodically started by the timer interrupt regardless of the state of the main processing, it is started several times and detects a stop of the count value of the execution check counter 250 after a certain time. Upon detection of this stop, the timer interrupt handler activates firmware switching processing.
As a result, the faulty degraded operation firmware is activated and executed instead of the normal operation firmware. As described in the above embodiment, since the faulty degraded operation firmware does not have the same bug as the normal operation firmware that has been executed until then, the system stops momentarily due to the switching operation, but immediately The normal state is restored, and processing continues.

【００５７】上述したように本実施の形態では、ファー
ムウェアのメイン処理で生じる障害を、ファームウェア
自身が有する機能を用いて検知している。As described above, in this embodiment, a failure that occurs in the main processing of the firmware is detected by using the function of the firmware itself.

【００５８】［実施の形態３］次に本発明の第三の実施
の形態であるファームウェアの切り替え後の起動時の処
理の一例を説明する。なお、データ処理装置の構成は、
第一の実施の形態と同様であるので、以下の説明におい
て必要に応じて図１を援用する。[Third Embodiment] Next, an example of a process at the time of startup after firmware switching according to a third embodiment of the present invention will be described. The configuration of the data processing device is as follows:
Since this is the same as the first embodiment, FIG. 1 will be referred to as needed in the following description.

【００５９】通常、データ処理装置は電源投入時には起
動された通常運用ファームウェアの最初において図５に
示すように初期化処理Ｓ１５０を実行され、これにより
システムの初期化、チェックが行われる。一方、通常運
用ファームウェアの障害が発生し、障害時縮退運用ファ
ームウェアに切り替えるためにデータ処理装置が再起動
される際は、電源はオン状態のままである。Normally, when the power is turned on, the data processing apparatus executes an initialization process S150 at the beginning of the started normal operation firmware, as shown in FIG. 5, whereby the system is initialized and checked. On the other hand, when a failure occurs in the normal operation firmware and the data processing device is restarted to switch to the failure-time degraded operation firmware, the power remains on.

【００６０】上述のように、ファームウェアの切り替え
においては、障害を検知された通常運用ファームウェア
とは異なるコーディングをされた障害時縮退運用ファー
ムウェアが、ＲＯＭ３０ＢからＲＡＭ２２上にロードさ
れＣＰＵ２０により実行開始される。この障害時縮退運
用ファームウェアの起動においては、電源投入時の通常
運用ファームウェアの起動の場合と異なり、システムで
はそれまで実行されていた通常運用ファームウェアによ
って初期化処理Ｓ１５０が既に行われているので、一部
の初期化処理を省略することが可能である。As described above, when switching the firmware, the faulty degraded operation firmware coded differently from the normal operation firmware in which the fault is detected is loaded from the ROM 30B onto the RAM 22 and executed by the CPU 20. In the activation of the faulty degraded operation firmware, unlike the case of the activation of the normal operation firmware when the power is turned on, the initialization processing S150 is already performed by the normal operation firmware that has been executed in the system. It is possible to omit the initialization processing of the unit.

【００６１】本実施の形態は、その一部の初期化処理を
省略することにより、高速なファームウェアの切り替え
を実現するものである。本形態では、例えば、監視装置
の初期化、メモリの初期化、変換テーブルの初期化の処
理を省略することができる。This embodiment realizes high-speed firmware switching by omitting a part of the initialization processing. In the present embodiment, for example, the processing of initializing the monitoring device, initializing the memory, and initializing the conversion table can be omitted.

【００６２】例えば、メモリの初期化に関しては、パリ
ティやＥＣＣ（Error Checking andCorrecting）用のチ
ェックビット生成は、電源オン後に一度だけ実行されれ
ばチェックビットの整合性が確保されるので、再度実行
する必要はない。またメモリのＷＲＩＴＥ／ＲＥＡＤ／
ＣＯＭＰＡＲＥによる動作チェックも電源オン後に一度
実行されれば十分である場合が多い。監視装置２６に対
する初期値設定た処理開始命令も既に監視処理を実行中
の場合は省略することができる。For example, regarding the initialization of the memory, check bit generation for parity and ECC (Error Checking and Correcting) is executed only once after the power is turned on, so that the consistency of the check bits is ensured, so that the check bit generation is executed again. No need. In addition, WRITE / READ /
It is often sufficient for the operation check by COMPARE to be executed once after the power is turned on. The processing start instruction for which the initial value is set to the monitoring device 26 can be omitted when the monitoring processing is already being executed.

【００６３】また通常運用ファームウェアと障害時縮退
運用ファームウェアとの間で処理用の変換データテーブ
ルを共通にしておいた場合のそれらテーブルの初期化も
省略することができる。例えば、使用する文字コードを
外部からの指定で暗号化して処理する場合に、文字コー
ドと暗号化後のコードとの変換テーブルを１００００番
地から２００００番地に共通の形式で格納するというよ
うにしておけば、再度外部とやりとりして変換テーブル
を作成することは不要となる。When the conversion data table for processing is shared between the normal operation firmware and the faulty degraded operation firmware, the initialization of those tables can be omitted. For example, when a character code to be used is encrypted and processed by an external specification, a conversion table between the character code and the code after encryption may be stored in a common format from addresses 10000 to 20000. In this case, it is not necessary to communicate with the outside again to create the conversion table.

【００６４】よって、これらの初期化処理は切り替え処
理の高速化の必要に応じて省略することが可能であり、
省略する場合には省略処理リストに登録され、それに基
づいて処理が実行される。Therefore, it is possible to omit these initialization processes as necessary to speed up the switching process.
If omitted, it is registered in the omitted process list, and the process is executed based on it.

【００６５】図６は、本実施の形態のファームウェア起
動時の初期化処理の処理フロー図である。まず、ＣＰＵ
レジスタの各種レジスタの初期化（Ｓ３００）及びプロ
グラマブルデバイスのレジスタ初期化（Ｓ３０５）が行
われる。以降、監視装置、メモリ、変換テーブルのそれ
ぞれの初期化ステップとなる。FIG. 6 is a processing flowchart of the initialization processing at the time of starting the firmware according to the present embodiment. First, CPU
Initialization of various registers of registers (S300) and initialization of registers of programmable devices (S305) are performed. Thereafter, initialization steps of the monitoring device, the memory, and the conversion table are performed.

【００６６】監視装置の初期化処理においては、省略処
理リストが参照され（Ｓ３１０）、これに当該処理が登
録されていない場合には、その初期化処理が実行される
（Ｓ３１５）。一方、省略処理リストに登録されている
場合には、初期化処理Ｓ３１５は飛び越され、メモリの
初期化処理に進む。In the initialization processing of the monitoring apparatus, the omission processing list is referred to (S310). If the processing is not registered in this list, the initialization processing is executed (S315). On the other hand, if it is registered in the omission process list, the initialization process S315 is skipped, and the process proceeds to the memory initialization process.

【００６７】メモリの初期化処理においても、まず省略
処理リストが参照され（Ｓ３２０）、これに当該処理が
登録されていない場合には、その初期化処理が実行され
る（Ｓ３２５）。一方、省略処理リストに登録されてい
る場合には、初期化処理Ｓ３２５は飛び越される。Also in the memory initialization processing, first, the omission processing list is referred to (S320). If the processing is not registered in this list, the initialization processing is executed (S325). On the other hand, if it is registered in the omission processing list, the initialization processing S325 is skipped.

【００６８】次に入出力ポートの初期化処理が実行され
（Ｓ３３０）、それに引き続いて変換テーブルの初期化
処理に移行する。Next, an input / output port initialization process is executed (S330), and subsequently, the process proceeds to a conversion table initialization process.

【００６９】変換テーブルの初期化処理においても、省
略処理リストに基づいて当該処理が省略されるか否かが
判断される（Ｓ３３５）。当該リストに当該処理が登録
されていない場合には、変換テーブルの初期化処理が実
行される（Ｓ３４０）。一方、省略処理リストに当該処
理が登録されている場合には、初期化処理Ｓ３４０は飛
び越される。Also in the conversion table initialization processing, it is determined whether or not the processing is omitted based on the omission processing list (S335). If the process is not registered in the list, a conversion table initialization process is executed (S340). On the other hand, when the process is registered in the omitted process list, the initialization process S340 is skipped.

【００７０】なお、図に示したように、例えばＣＰＵの
レジスタ初期化（Ｓ３００）、メモリコントローラ等の
プログラマブル・デバイスのレジスタ初期化（３０
５）、入出力ポートの初期化（Ｓ３３０）は、ファーム
ウェアの切替時においても必要な初期化処理であり、こ
れらは省略されずに実施される。As shown in the figure, for example, the CPU initializes registers (S300) and initializes registers of programmable devices such as a memory controller (30).
5) The initialization of the input / output ports (S330) is an initialization process necessary even at the time of switching the firmware, and these are executed without being omitted.

【００７１】[0071]

【発明の効果】本発明に係るデータ処理装置によれば、
通常運用プログラムと、通常運用プログラムの機能仕様
よりも低機能な機能仕様の障害時縮退運用プログラムが
保持される。通常運用プログラムと障害時縮退運用プロ
グラムとの対応機能は、それら全体が互いに異なるコー
ディングで表現されているため、同一のバグを内包しな
いようにできる。よって、通常運用プログラムの実行時
において障害が発生したときに、通常運用プログラムよ
りも低機能仕様の障害時縮退運用プログラムに切り替え
ることにより、同一のバグによる障害を発生させること
なく、しかも通常運用プログラムを修正している間、シ
ステムダウンさせることなく、正常に処理を継続するこ
とができ、データ処理装置の耐故障性、信頼性を向上さ
せることができるという効果が得られる。According to the data processing device of the present invention,
A normal operation program and a failure reduction operation program having a function specification lower than the function specification of the normal operation program are held. Since the functions corresponding to the normal operation program and the fault reduction operation program are all expressed by different codings, the same bug can be prevented from being included. Therefore, when a failure occurs during the execution of the normal operation program, by switching to the failure reduction operation program with a lower function specification than the normal operation program, the failure due to the same bug does not occur, and the normal operation program During the correction, the processing can be continued normally without causing the system to go down, and the effect that the fault tolerance and reliability of the data processing device can be improved can be obtained.

【００７２】また本発明に係るデータ処理装置によれ
ば、通常運用プログラムが、そのメイン処理の進行に応
じてカウントを行い、処理障害判定手段は、カウント値
が正常に更新されないことに基づいて障害を検知する。
これにより、障害を簡便にチェック可能であるという効
果が得られる。Further, according to the data processing device of the present invention, the normal operation program counts according to the progress of the main processing, and the processing failure determining means determines the failure based on the fact that the count value is not updated normally. Is detected.
Thus, an effect is obtained that a failure can be easily checked.

【００７３】本発明に係るデータ処理装置によれば、切
り替えられた障害時縮退運用ファームウェアの起動にお
いては、当該ファームウェアの初期化処理のうち所定の
切替時不要初期化処理が省略される。これにより、高速
なファームウェアの切り替えが可能となり、システムの
ダウン時間をより少なくすることができるという効果が
得られる。According to the data processing device of the present invention, in the activation of the switched faulty degraded operation firmware, a predetermined switching unnecessary initialization process of the firmware initialization process is omitted. As a result, high-speed firmware switching becomes possible, and the effect that the down time of the system can be further reduced can be obtained.

【００７４】本発明に係るデータ処理装置によれば、障
害時縮退運用プログラムへ切り替えた後の障害の再発に
基づいて当該データ処理装置のハードウェア障害が検知
され、ハードウェアの当該障害の関係箇所が切り離され
る。これにより、対処される障害の範囲が拡大し、より
耐故障性、信頼性が向上する効果が得られる。According to the data processing apparatus of the present invention, a hardware failure of the data processing apparatus is detected based on the recurrence of the failure after switching to the failure-time degraded operation program, and the relevant part of the hardware is identified. Is disconnected. As a result, the range of faults to be dealt with is expanded, and the effect of further improving fault tolerance and reliability is obtained.

[Brief description of the drawings]

【図１】本発明の第一の実施の形態であるデータ処理
装置の全体の概略構成を示すブロック図である。FIG. 1 is a block diagram showing an overall schematic configuration of a data processing device according to a first embodiment of the present invention.

【図２】本装置の動作を説明するフロー図である。FIG. 2 is a flowchart illustrating the operation of the present apparatus.

【図３】本装置の障害発生時の処理フロー図である。FIG. 3 is a process flowchart when a failure occurs in the apparatus.

【図４】基板の温度をモニタして所定のアクションを
行う基板温度モニタ機能を作成する際の機能仕様の一例
を示す説明図である。FIG. 4 is an explanatory diagram showing an example of a function specification for creating a substrate temperature monitoring function for monitoring a substrate temperature and performing a predetermined action.

【図５】本発明の第二の実施の形態であるファームウ
ェア実行状態チェックのための処理フロー図である。FIG. 5 is a processing flowchart for checking a firmware execution state according to a second embodiment of the present invention.

【図６】本発明の第三の実施の形態であるファームウ
ェア起動時の初期化処理の処理フロー図である。FIG. 6 is a processing flowchart of initialization processing at the time of starting firmware according to a third embodiment of the present invention.

【図７】従来のデータ処理装置の構成を示すブロック
図である。FIG. 7 is a block diagram illustrating a configuration of a conventional data processing device.

【図８】他の従来技術に関わる半導体記憶装置の原理
図である。FIG. 8 is a principle diagram of a semiconductor memory device according to another related art.

[Explanation of symbols]

２０ＣＰＵ、２２ＲＡＭ、２４不揮発性メモリ、
２６監視装置、２８システムバス、３０Ａ，３０Ｂ
ＲＯＭ、３２切り替え手段、３４温度センサ、３６
ＬＣＤ、４０Ａ通常運用ファームウェア、４０Ｂ
障害時縮退運用ファームウェア。20 CPU, 22 RAM, 24 nonvolatile memory,
26 monitoring device, 28 system bus, 30A, 30B
ROM, 32 switching means, 34 temperature sensor, 36
LCD, 40A Normal operation firmware, 40B
Faulty degraded operation firmware.

Claims

[Claims]

1. A data processing apparatus having a processor for executing a program, a first program storage means for storing a normal operation program, and a failure degeneration of a function specification lower in function than the function specification of the normal operation program A second program storage unit for storing an operation program; a program selection unit for selecting one of the normal operation program and the failure-time degraded operation program as an execution program; and detecting a failure during execution of the normal operation program The function of the faulty fallback operation program and the function of the normal operation program corresponding thereto are created based on the function specifications different from each other. The whole is represented by different coding from each other, When a failure in the normal operation program is detected by the serial processing failure determining means, by switching the execution program to the disaster degenerate operation program from the normal operation program, the data processing apparatus according to claim.

2. The program according to claim 1, wherein the normal operation program and the fault reduction operation program are firmware stored in a read-only memory or software stored in a magnetic recording medium, respectively. Data processing device.

3. The normal operation program counts in accordance with the progress of the main processing, and the processing failure determination means determines a failure based on the fact that the count value is not updated properly. The data processing device according to claim 1 or 2, which performs the processing.

4. The data processing apparatus according to claim 3, wherein said processing failure determination means is configured by a timer interrupt processing for performing a processing of determining whether the count value is updated normally.

5. A method according to claim 1, wherein when the faulty reduced operation program is switched by the program selecting means, a predetermined switching unnecessary initialization process in the faulty reduced operation program initialization process is omitted. The data processing device according to claim 1 or 2, wherein

6. The switching-time unnecessary initialization process includes a memory initialization process for creating a check bit of a memory or an operation check, an initialization process for the processing failure determination unit, or the normal operation program and the time of the failure. 6. The method according to claim 5, further comprising at least one of initialization processing of the conversion data table when the conversion data table for processing is shared with the reduced operation program.
The data processing device according to claim 1.

7. A hardware failure of the data processing device is detected based on a reoccurrence of the failure after switching to the failure reduction operation program by the program selecting means, and a part of the hardware related to the failure is separated. The data processing apparatus according to claim 1, further comprising a hardware failure removing unit.