JPS60171544A

JPS60171544A - Self-diagnosis device for abnormality of computer system

Info

Publication number: JPS60171544A
Application number: JP2716384A
Authority: JP
Inventors: Haruki Inoue; 春樹井上
Original assignee: Hitachi Engineering Co Ltd; Hitachi Ltd
Current assignee: Hitachi Engineering Co Ltd; Hitachi Ltd
Priority date: 1984-02-17
Filing date: 1984-02-17
Publication date: 1985-09-05

Abstract

PURPOSE:To suppress a computer system to minimize the discontinuation of the function by detecting a fault factor of the software by the computer system itself and removing said fault factor. CONSTITUTION:A level action diagnosis module 2 and an action monitor timer 3 are provided for each working level of all tasks for an operating system OS1. The module 2 is started periodically by the OS1 and sets the fixed time value to a timer 3. The timer 3 sets 1 to a level action monitor register 4 after the count-up of time to store an abnormality. The OS1 monitors the working time of each task and sets this time value to a tast working time memory 5. A system diagnosis device 6 works periodically to analyze the contents of a fault and detects the direct factor of the abnormality. Then the device 6 delivers a stop request of the corresponding tast to the OS1 and also displays the abnormality contents to an external abnormality display device 7.

Description

【発明の詳細な説明】〔発明の利用分野」本発明は計算機システムの異常診断装置に係り、特に、
ソフトウェアの不良により引き起こされるシステムの全
機能停止を最小の局所停止に抑制する自己診断装置に関
する。[Detailed Description of the Invention] [Field of Application of the Invention] The present invention relates to an abnormality diagnosis device for a computer system, and in particular,
The present invention relates to a self-diagnosis device that suppresses a total system outage caused by a software defect to a minimum local outage.

[Background of the invention]

複数の独立したプログラムが同時に動作し、これらを一
定の手順で制御するオペレーティングシステム（以下０
Ｓ）ｔ−もつ計算機システムでは、全ての処理機能を常
に遅滞なく実行させることが最大の眼目である。An operating system (hereinafter referred to as 0
S) In a computer system with t-, the most important point is to always execute all processing functions without delay.

従来、プログラムの不良に対してＯ８はその多くを自己
判断し、その不良の全システムへの波及を最小限に抑制
することにある程度成功している。Conventionally, the O8 has determined most of the program defects by itself, and has been successful to some extent in minimizing the spread of the defects to the entire system.

例えば、データの書込みエリアに対するプロテクトチェ
ック、無効命令の実行チェックなどで不良発見時、直ち
に、該当のタスク（以後独立して動作するプログラムの
最小単位をタスクと称する）を動作禁止とする機能がこ
れに該当する。For example, when a defect is discovered during a data write area protection check or invalid instruction execution check, this function immediately disables the corresponding task (hereinafter the smallest unit of an independently operating program is referred to as a task). Applies to.

しかし、現在まで抑制することが困難であった不良に、
プログラムの永久ループ及び共有資源（データファイル
）の破壊防止のための占有・解除に関する命令によるタ
スク間のテンドロックニがあった。計算機システムでは
タスクにいくつかの動作レベルを与え、処理の重要性・
緊急性に応じて優先権を与えているが、あるレベルのタ
スクが永久ループに陥いると、それより低レベルのタス
クは、以後、全て待状態となり、実質的な機能停止とな
る。また、占有・ＷＣ除命令の発行會誤まると、ある資
源に対して入出力を行なう機能が全て停止トなる。この
様に、これらの不良はシステム全体の機能停止となる可
能性が高い。However, defects that have been difficult to suppress until now,
There was an endless loop of the program and a tend lock between tasks due to commands related to possession and release to prevent shared resources (data files) from being destroyed. In computer systems, tasks are given several action levels, and the importance and importance of processing are determined.
Priority is given according to the level of urgency, but if a task at a certain level falls into an eternal loop, all tasks at a lower level will be placed in a waiting state and will essentially stop functioning. Furthermore, if an occupancy/WC removal command is issued incorrectly, all functions for inputting and outputting to a certain resource are stopped. Thus, these defects are likely to cause the entire system to stop functioning.

これに対し、従来より様々な手法が考えられ実用化さｒ
してきた。例えば、永久ループの検出にはいくつかのタ
イマーを使用するなどである。これは各タスク毎に処理
最大時間を予め計算、設定しておき、この時間を超える
と該当のタスク全実行禁止とする方法である。この方法
では該当タスクに対しては完全な対策となるが、仮にそ
のタスクが直接原因でない場合は誤った処置を実施した
ことになり事態をより悪化させてしまう。また、処理最
大時間の設定は各タスクの設計者に任されるため、多く
の場合、十分検討されていなかったり、設定されていな
かったりして、十分信頼することはできないものであっ
た。また、システム全体としての方法として最Ｆ位レベ
ルにタスクを設け・・−ドウエアタイマー（ウォッチ嚇
ドグ・タイマーと呼ばわる）［対して周期的に一定値を
設定し、動作不可能の時、外部接点ｅＯＮさせる方法が
あるが、本方法では異常の検知が可能なたけで、システ
ム停止の抑制には効果がない。To deal with this, various methods have been considered and have not been put into practical use.
I've been doing it. For example, some timers may be used to detect endless loops. This is a method in which the maximum processing time is calculated and set in advance for each task, and when this time is exceeded, the execution of all tasks is prohibited. This method provides a perfect countermeasure for the task in question, but if the task is not the direct cause, then the wrong action has been taken and the situation will worsen. Furthermore, since the setting of the maximum processing time is left to the designer of each task, in many cases it has not been sufficiently considered or set, making it unreliable. In addition, as a method for the entire system, a task is set at the highest level. There is a method of turning the system on, but this method only detects an abnormality and is not effective in preventing system stoppage.

このように、従来の方法では十分な効果をあげることは
不可能であった。As described above, it has been impossible to achieve sufficient effects with conventional methods.

[Purpose of the invention]

本発明の目的はソフトウェアの不良要因をシステム的に
短時間で、かつ、的確に把え、システム自身が直接要因
金取り除くことにより、システムの機能停止を最小限に
抑制する装置を提供するにある。An object of the present invention is to provide a device that minimizes system outages by systematically identifying the causes of software failures in a short time and accurately, and directly removing the causes by the system itself. .

[Summary of the invention]

本発明では各プログラム゛の設計者に異常検出機能設定
をゆたねることを止め、計算機システム自身がその検出
と、要因除去全行なう点に特徴がある。すなわち、Ｏ８
が各レベルに対して、周期的に動作可、不可を診断し、
かつ、常に全タスクの処理時間を測定し、これら二つの
情報より異常要因を決定し、該当要因を除去することに
ある。The present invention is characterized in that it does not depend on the designer of each program to set the abnormality detection function, and the computer system itself performs all the detection and removal of the cause. That is, O8
periodically diagnoses whether the operation is possible or not for each level,
Moreover, the purpose is to constantly measure the processing time of all tasks, determine the cause of the abnormality from these two pieces of information, and remove the relevant cause.

[Embodiments of the invention]

以下、本発明の一実施例を第１図ないし第４図音用いて
説明する。An embodiment of the present invention will be described below with reference to FIGS. 1 to 4.

第１図は、一つのタスクが永久ループに陥った時のシス
テムへの波及を示したものである。Figure 1 shows the effects on the system when one task falls into an endless loop.

縦軸にはタスクの動作レベル（本例では７レベルに設定
）、横軸に時間の経過ケ示す。図において時刻ｔ。〜１
ｏではシステムが正常に動作しており、より高位のレベ
ル（若い番号のレベルを高位と称すり、１．、Ｌ２．・
・・Ｌ７）のタスクが優先的に動作している。ｔ８〜ｔ
９間でタスク■がある資源を占有（リザーブ［Ｆ］と称
す）している。ｔｏ　よりタスク■が動作を開始するが
、何らかの要因により、永久ループ状態となる。その後
、レベルの高いタスク■は正常動作するが、レベルの低
い■〜■は待ち状態となり、機能停止となる。また、高
レベルのタスクでもその資源を占有しようとする場合、
解除待ちとなり、機能停止となる。こび〕ように、’２
ではシステム全体機能が停止する。The vertical axis shows the task operation level (set to level 7 in this example), and the horizontal axis shows the elapsed time. In the figure, time t. ~1
o, the system is operating normally, and the higher levels (levels with lower numbers are referred to as higher levels, 1., L2..
...Task L7) is operating with priority. t8~t
9, task ■ occupies a certain resource (referred to as reserve [F]). Task (2) starts operating from to, but due to some reason, it becomes in an endless loop state. After that, the high-level task (■) operates normally, but the low-level tasks (2) to (2) enter a waiting state and stop functioning. Also, if a high-level task also tries to occupy that resource,
It will be waiting for release and will stop functioning. '2
The entire system will stop functioning.

第２図は、計算機システム異常自己診断装置構成例を示
す。FIG. 2 shows a configuration example of a computer system abnormality self-diagnosis device.

本装黄け０８１下に、全タスク動作レベル毎に、レベル
動作診断モジュール２と、動作監視タイマー３を設け、
２は周期的に１より起動され３に一定時間値全セットす
る。３はタイムアンプするとレベル動イ１１監視Ｖジス
タ４に１をセントし、異常を記憶する。Ｏ８は各タスク
の動作時間を監視しており、この値をタスク動作時間メ
モリー５にセットしている。システム診断装置６は周期
的に動作し、異常内容を解析し、その直、接原因を探し
田しＩＶｃ対し該当タスクの停止要求を発行、壕だ、外
部の異常表示装置７に異常内容を表示する。Under the main cover 081, a level operation diagnosis module 2 and an operation monitoring timer 3 are provided for each task operation level,
2 is periodically activated by 1 and sets all values to 3 for a certain period of time. When 3 is a time amplifier, 1 is sent to the level movement 11 monitoring V register 4, and an abnormality is memorized. O8 monitors the operation time of each task and sets this value in the task operation time memory 5. The system diagnostic device 6 operates periodically, analyzes the contents of the abnormality, searches for the direct cause, issues a request to the IVc to stop the corresponding task, and displays the contents of the abnormality on the external abnormality display device 7. do.

ここで、２はタスクと同等の動作を行なうように構成さ
れ、動作待ち行列に接続される。従って、各レベルの動
作可、不司金３金用いて検出することになる。Here, 2 is configured to perform an operation equivalent to a task and is connected to an operation queue. Therefore, each level of operation is detected using the three metals.

第３図は、６の詳細構成を示す。６は４．５會入力とし
て、まず、４が全てゼロが否が、すなわち、システムに
異常があるか無いがを判断する６７゜次に４のフラグＯ
Ｎ最左端Ｎｏ決定装置６８により、異常レベル全検出後
、次に異常要因タスク決定装置６９ｖｃより、直接原因
を解析する。FIG. 3 shows the detailed configuration of 6. 6 is the 4.5 meeting input. First, it is determined whether 4 is all zero or not, that is, whether there is an abnormality in the system or not. Next, the flag O of 4 is determined.
After all abnormal levels are detected by the leftmost N number determining device 68, the direct cause is analyzed by the abnormal cause task determining device 69vc.

６９は６８で決定された異常レベルに含まれるタスクの
動作時間を５より取り出し、その中で最大のものを直接
原因とする。その後、タスク動作停止要求装＠：６１０
が１に対して停止要求６１１を、異常表示装置６１２が
外部に対し表示出力する。69 extracts the operating times of the tasks included in the abnormality level determined in 68 from 5, and determines the maximum among them as the direct cause. After that, task operation stop request device @:610
The abnormality display device 612 displays and outputs a stop request 611 to the outside.

第４図は、第１図の状態全本装置により、最小の波及で
抑制した例である。FIG. 4 shows an example in which all of the conditions shown in FIG. 1 are suppressed with minimal influence by this device.

’！１で全システムの一時的な停止状態となるがココテ
Ｃ１〜Ｃ７が動作するとｃ１〜Ｃｕｆｆレベルが高いた
め、動作し、３にタイマー館全セントするが、Ｃ４〜Ｃ
７はタスク■の永久ループにより、３に値をセットでき
ない。３がｔ、２でタイムアンプすると４がＯＮとなり
、６はタスク■を検出し、１は■を動作禁止とする。こ
れにより、資源占有タスク■が動作し、これを解除（ｑ
う）する。'! At 1, the entire system is temporarily stopped, but when C1 to C7 operate, the c1 to Cuff level is high, so they operate, and at 3, the timer hall all cents, but C4 to C
7 cannot set the value to 3 due to the eternal loop of task ■. 3 is t, time amplification is performed at 2, 4 is turned on, 6 detects task 2, and 1 disables task 2. As a result, the resource-occupying task ■ operates and is released (q
c) Do.

この結果、待ち状態となっていたタスクが次々に待ち解
除され、ｔ２４でシステムは■以外の機能が全て正常と
なり、異常の波及は最小に抑制さね、たことになる。As a result, the tasks in the wait state are released from the wait state one after another, and at t24, all functions other than ■ in the system become normal, and the spread of the abnormality can be minimized.

〔Effect of the invention〕

本発明によｔば、計算機自身がソフトウェアの異常を自
身で検出し、かつ、その要因を除去することができるの
で、計算機システムの信頼性、稼動路を向上させること
ができる。According to the present invention, since the computer itself can detect software abnormalities and eliminate the cause thereof, the reliability and operation path of the computer system can be improved.

[Brief explanation of the drawing]

第１図は本発明のタスク永久ループによるシステｌ、へ
り波及の説明図、第２図は本発明の計算機システム異常
自己診断装置のブロック図、第３図は本発明の詳細１構
成図、第４図は本発明の異常抑止動作側図である。FIG. 1 is an explanatory diagram of system l and edge spread due to the eternal task loop of the present invention, FIG. 2 is a block diagram of the computer system abnormality self-diagnosis device of the present invention, and FIG. 3 is a detailed 1 configuration diagram of the present invention. FIG. 4 is a side view of the abnormality suppression operation of the present invention.

Claims

[Claims]

1. In a computer system having an operating system in which a plurality of independent programs operate simultaneously and control them according to a fixed procedure, a level operation diagnostic device is provided for each operation level of the program and monitors the operation, and an operation monitor. A timer, a level operation monitoring register to which data is set when the operation monitoring timer times up, this level operation monitoring register and program operation time memory are input, analyze the cause of system abnormality, remove all applicable causes, A computer system abnormality self-diagnosis device comprising a system diagnosis device that displays an external display.