JPH0696040A

JPH0696040A - Fault detection system

Info

Publication number: JPH0696040A
Application number: JP4246439A
Authority: JP
Inventors: 正浩 ▲吉▼沢; Masahiro Yoshizawa; Yasushi Wada; 康和田; Masato Mitomi; 正人三富
Original assignee: N T T FUANETSUTO SYST KK; Nippon Telegraph and Telephone Corp
Current assignee: N T T FUANETSUTO SYST KK; Nippon Telegraph and Telephone Corp
Priority date: 1992-09-16
Filing date: 1992-09-16
Publication date: 1994-04-08

Abstract

PURPOSE:To speedily and easily detect a fault and prevent the fault from being overlooked by providing each computer with a fault detection program and an inspection program which performs various inspection. CONSTITUTION:Respective computers 1-3 have fault detection programs 11, fault inspection program registration files 13, and talkers 15 as transmission programs and listeners 17 as reception programs which constitute communication control programs for a communication among the respective fault detection programs 11 provided on the respective computers 1-3. In this case, the contents (program name) of inspection performed in the respective computers 1-3 and file names of inspection results are registered in the fault detection program registration files 13 on the computers 1-3. Then, the fault detection programs 11 starts the respective inspection programs while referring to the fault detection program registration files 13 and transfer result files. Therefore, the addition and corrections of the inspection contents and the customerization of the inspection contents corresponding to a system and computers are facilitated.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、複数の計算機でネット
ワークを構成し、各計算機上で動作する複数のプログラ
ム間で相互通信を行いながら所定の処理を行う分散処理
システムにおける障害を検出する障害検出システムに関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a fault detecting a fault in a distributed processing system in which a network is composed of a plurality of computers and a plurality of programs operating on each computer perform predetermined processing while mutually communicating. Regarding detection system.

【０００２】[0002]

【従来の技術】複数の計算機を使用し、ネットワークに
よる通信を利用して、例えば各種製造ラインを制御する
方法は広く利用されている。例えば、ＡＳＩＣＬＳＩの
製造管理システムを例にとって説明する。このようなシ
ステムでは、複数の計算機間で、各計算機の上で別々に
動作する複数のプログラム（常時起動している）が、互
いにメッセージの授受を行いながら、ＬＳＩ製造ライン
でのロットのスケジュール、進行管理、データ収集等の
処理を行う。このように、多くの計算機間でネットワー
クを形成し、分散処理を行うシステムにおいては、ある
計算機の停止、プログラムの停止によって、システム全
体の停止をまねくことになる。従って、このような障害
を早期に検出することができる信頼性の高いシステムが
必要である。2. Description of the Related Art A method of controlling various manufacturing lines, for example, using a plurality of computers and utilizing network communication is widely used. For example, a manufacturing management system of ASIC LSI will be described as an example. In such a system, among a plurality of computers, a plurality of programs that operate separately on each computer (always running) exchange messages with each other, and schedule lots on the LSI manufacturing line. It manages the progress and collects data. As described above, in a system in which a network is formed between a large number of computers and distributed processing is performed, stopping a certain computer or a program causes the entire system to stop. Therefore, there is a need for a highly reliable system that can detect such failures early.

【０００３】このような自動障害検出のシステムは、計
算機ネットワークを利用して複数の装置を制御する各種
製造ラインの生産管理システム、監視システム等に利用
することができる。Such an automatic failure detection system can be used for a production management system, a monitoring system, etc. of various manufacturing lines for controlling a plurality of devices using a computer network.

【０００４】なお、本発明の対象の計算機は、ＣＲＴ端
末を有するコンピュータだけでなく、プログラム可能な
ＲＯＭを搭載した制御端末（シーケンサ等）やボード等
も含むものであり、これらにも本発明は適応可能であ
る。また、ネットワークとは、各種ＬＡＮ（ローカルエ
リアネットワーク）や、ＷＡＮ（ワイドエリアネットワ
ーク）等の計算機間の通信を行うことができるものだけ
でなく、上記シーケンサ等との間の接続形態を含むもの
である。The computer to which the present invention is applied includes not only a computer having a CRT terminal but also a control terminal (sequencer or the like) equipped with a programmable ROM, a board, and the like. It is adaptable. Further, the network includes not only a LAN (local area network) and a WAN (wide area network) that can communicate with each other, but also a connection form with the sequencer and the like.

【０００５】[0005]

【発明が解決しようとする課題】上述したように、複数
の計算機間のネットワークによる通信を利用して分散処
理を行う処理システムにおける障害を検出するには、従
来、例えば各計算機上で個別にシステムの関数をコマン
ド入力して、動作しているプログラムのチェックを行っ
たり、計算機相互間の通信が可能か否かのチェックを順
次行う必要があり、システム全体のチェックには時間や
手間がかかる上に、チェックもれが生じる等の問題があ
る。As described above, in order to detect a failure in a processing system for performing distributed processing by utilizing communication between a plurality of computers via a network, conventionally, for example, a system is individually set on each computer. It is necessary to check the operating programs by inputting the commands of the above commands and to check whether communication between computers is possible, and it takes time and effort to check the entire system. In addition, there are problems such as missing check.

【０００６】本発明は、上記に鑑みてなされたもので、
その目的とするところは、ネットワークを利用した分散
処理システムの動作状況や障害を簡単かつ自動的に検査
し得る障害検出システムを提供することにある。The present invention has been made in view of the above,
It is an object of the present invention to provide a fault detection system capable of easily and automatically inspecting the operating status and faults of a distributed processing system using a network.

【０００７】[0007]

【課題を解決するための手段】上記目的を達成するた
め、本発明の障害検出システムは、複数の計算機でネッ
トワークを構成し、各計算機上で動作する複数のプログ
ラム間で相互通信を行いながら所定の処理を行う分散処
理システムにおける障害を検出する障害検出システムで
あって、各計算機に障害検出プログラムおよび各種検査
を実行する検査プログラムを設けるとともに、各計算機
に設けられた障害検出プログラムを通信制御プログラム
を介して接続し、各計算機に設けられた障害検出プログ
ラムは、自計算機内の検査プログラムを起動するととも
に、通信制御プログラムおよび他の計算機内の障害検査
プログラムを介して他計算機内の検査プログラムを起動
して、各計算機内で常時起動されるべきプログラムの正
常動作チェックおよび計算機相互間の通信の正常動作チ
ェックを行わせるとともに、該検査結果の内容を各計算
機から受信することを要旨とする。In order to achieve the above object, the fault detection system of the present invention configures a network with a plurality of computers, and performs a predetermined communication while mutually communicating between a plurality of programs operating on each computer. Is a fault detection system for detecting a fault in a distributed processing system that performs the above process, wherein each computer is provided with a fault detection program and an inspection program for executing various inspections, and the fault detection program provided in each computer is a communication control program. The failure detection program installed in each computer is connected to the computer via the communication control program and the failure inspection program in another computer to activate the inspection program in the other computer. Start up and check the normal operation of programs that should always be started in each computer and Together to perform the normal operation check of the communication between the computer each other, and subject matter to receive the contents of the test results from each computer.

【０００８】[0008]

【作用】本発明の障害検出システムでは、各計算機に障
害検出プログラムおよび各種検査を実行する検査プログ
ラムを設け、各障害検出プログラムは、各検査プログラ
ムを起動して、各計算機内で常時起動されるべきプログ
ラムの正常動作チェックおよび計算機相互間の通信の正
常動作チェックを行わせるとともに、該検査結果の内容
を各計算機から受信する。In the fault detection system of the present invention, each computer is provided with a fault detection program and an inspection program for executing various inspections, and each fault detection program activates each inspection program and is always activated in each computer. The normal operation check of the program to be performed and the normal operation check of the communication between the computers are performed, and the contents of the inspection result are received from each computer.

【０００９】[0009]

【実施例】以下、図面を用いて本発明の実施例を説明す
る。Embodiments of the present invention will be described below with reference to the drawings.

【００１０】図１は、本発明の一実施例に係わる障害検
出システムの構成を示すブロック図である。同図に示す
障害検出システムは、複数の計算機、すなわち１次計算
機１、２次計算機２、３次計算機３でネットワークを構
成し、各計算機上で動作する複数のプログラム間で相互
通信を行いながら処理を行う分散処理システムにおける
動作状況および障害を検出するものであり、各計算機１
〜３には、それぞれ障害検出プログラム１１、障害検査
プログラム登録ファイル１３、および各計算機に設けら
れた各障害検出プログラム間の通信を行う通信制御プロ
グラムを構成する送信プログラムであるトーカ１５およ
び受信プログラムであるリスナ１７を有する。FIG. 1 is a block diagram showing the configuration of a fault detection system according to an embodiment of the present invention. In the fault detection system shown in the figure, a network is composed of a plurality of computers, that is, a primary computer 1, a secondary computer 2, and a tertiary computer 3, while mutual communication is performed between a plurality of programs operating on each computer. Each computer 1 detects an operating condition and a failure in a distributed processing system that performs processing.
Reference numerals 3 to 3 respectively include a fault detection program 11, a fault inspection program registration file 13, and a talker 15 and a reception program which are transmission programs constituting a communication control program for communicating between the fault detection programs provided in each computer. Have a listener 17.

【００１１】障害検出プログラム１１は、自計算機内の
障害を検査する機能と、他計算機にある障害検出プログ
ラムとの間で、検査内容やその検査結果を転送したり、
システムで用いる共通ファイルを転送してその内容を照
合する機能を有する。起動時のフラグ指定等により、全
計算機をチェックするか、ある特定の計算機間のみをチ
ェックするかの選択を可能としている。The fault detection program 11 transfers the inspection contents and the inspection result between the function of inspecting the fault in the own computer and the fault detection program in another computer,
It has a function to transfer a common file used in the system and collate its contents. It is possible to select whether to check all computers or only between certain specific computers by specifying flags at startup.

【００１２】障害検出プログラムは起動されると、障害
検出登録ファイルを参照しながら、各検査プログラムを
起動して検査を実行し、その検査結果を起動元計算機に
転送する。その結果、起動元計算機で全計算機の検査結
果を一元的に管理できる。When the failure detection program is activated, each inspection program is activated and the inspection is executed with reference to the failure detection registration file, and the inspection result is transferred to the boot source computer. As a result, the test results of all the computers can be centrally managed by the boot computer.

【００１３】検査プログラムの機能としては、各計算
機内で常時起動されるべきプログラムが正常に動作して
いるかをチェックする機能、計算機相互間の通信が正
常に行われるかをチェックする機能、複数の計算機間
で共通に用いる各種定義ファイル類（以下では共通ファ
イルと呼ぶ）が計算機間で一致しているかをチェックす
る機能、が含まれる。この他に、各計算機にどのような
装置が接続されているか等の各計算機ごとに異なる環境
設定情報がある場合には、その定義内容に、２重登録、
登録もれ等の矛盾がないか等をチェックする機能、等が
あげられる。As functions of the inspection program, a function of checking whether a program that should be always started in each computer is operating normally, a function of checking whether communication between computers is normally performed, and a plurality of functions A function of checking whether or not various definition files commonly used between computers (hereinafter referred to as common files) match between computers is included. In addition to this, if there is different environment setting information for each computer such as what kind of device is connected to each computer, double definition is added to the definition content,
Functions such as checking for inconsistencies such as registration omissions can be given.

【００１４】障害検出は、大きく分けて、以下の３段階
で検査を行う。第１段階では、障害検出プログラムを起
動する計算機（１次計算機１とする）内の障害を検査す
る。第２段階では、システムに接続されている他の計算
機（２次計算機２とする）内の障害、システムで用いる
共通ファイルの整合性をチェックする。第３段階では、
システム内に接続されている各計算機間の相互通信チェ
ックを行う。なお、第３段階のチェックは、ある特定の
計算機間のみをチェックする場合は、省略することも可
能である。また、システムによっては、１台のホスト計
算機の下に複数の端末計算機が接続され、各端末間の相
互通信を行わない場合もある。この場合は、第３段階の
チェックは不要である。Fault detection is broadly divided into three stages. In the first stage, the fault in the computer (primary computer 1) that activates the fault detection program is checked. In the second stage, a fault in another computer (secondary computer 2) connected to the system and the consistency of the common file used in the system are checked. In the third stage,
Perform mutual communication check between each computer connected in the system. It should be noted that the check in the third stage can be omitted when checking only between certain specific computers. In some systems, a plurality of terminal computers are connected under one host computer, and mutual communication between the terminals may not be performed. In this case, the third stage check is unnecessary.

【００１５】障害検出プログラム１１における通信は、
システム内で起動されるプログラム間の通信を行う通信
制御プログラムであるトーカ（送信プログラム）１５お
よびリスナ（受信プログラム）１７を用いて行う。通信
制御プログラムは、ある特定の通信メッセージ（メッセ
ージの番号あるいはその内容により識別）の場合には、
障害検出プログラム１１からのメッセージと認識して、
特別の処理を行う。すなわち、通信メッセージ内容が、
障害検出プログラムへの起動要求、検査要求等の『要求
メッセージ』であれば、そのメッセージを受信したリス
ナ１７が、該当する検出プログラムを起動し、メッセー
ジを渡す。また、通信メッセージ内容が、障害検出プロ
グラムから、他の障害検出プログラムへの応答である
『返信メッセージ』があれば、宛先の障害検出プログラ
ムへメッセージを送信する。相手計算機のリスナから応
答がない場合は、トーカが、障害検出結果ファイルにエ
ラー情報を出力する。Communication in the fault detection program 11
This is performed using a talker (transmission program) 15 and a listener (reception program) 17, which are communication control programs that perform communication between programs activated in the system. In the case of a specific communication message (identified by the message number or its content), the communication control program
Recognizing as a message from the fault detection program 11,
Perform special processing. That is, the content of the communication message is
If the message is a "request message" such as an activation request or inspection request to the failure detection program, the listener 17 receiving the message activates the detection program and passes the message. If the communication message content is a "reply message" which is a response from the failure detection program to another failure detection program, the message is transmitted to the destination failure detection program. If there is no response from the listener of the partner computer, the talker outputs error information to the failure detection result file.

【００１６】障害検出結果は、各計算機上でファイル化
するだけでなく、１次計算機に転送し、ファイルとして
出力する。このため、障害検出プログラムを起動した端
末で、システムに接続されている全計算機の動作状態の
検査結果を確認することができる。The fault detection result is not only converted into a file on each computer, but also transferred to the primary computer and output as a file. Therefore, it is possible to confirm the inspection result of the operating states of all the computers connected to the system at the terminal that has activated the failure detection program.

【００１７】各計算機内で検査する内容（プログラム
名）や、検査結果のファイル名は、各計算機上の障害検
出プログラム登録ファイル（以下単に登録ファイルと呼
ぶ）１３に登録しておく。障害検出プログラム１１がこ
のファイル１３を参照しながら各検査プログラムを起動
し、結果ファイルを転送するように構成されている。こ
のため、検査内容の追加／修正、システムや計算機に応
じた検査内容のカスタマイズが容易である。The contents (program name) to be inspected in each computer and the file name of the inspection result are registered in the fault detection program registration file (hereinafter simply referred to as registration file) 13 on each computer. The failure detection program 11 is configured to start each inspection program while referring to the file 13 and transfer the result file. Therefore, it is easy to add / correct the inspection contents and customize the inspection contents according to the system and the computer.

【００１８】次に、障害検出プログラムの動作手順を、
ある特定の計算機間（計算機１−計算機２）のみをチェ
ックする場合を例に、図２を用いて説明する。Next, the operation procedure of the fault detection program will be described.
An example of checking only between specific computers (computer 1-computer 2) will be described with reference to FIG.

【００１９】計算機１上で起動された障害検出プログ
ラムは、自計算機上にある登録ファイルを参照し、自計
算機内の障害を検出する。検査項目は、登録ファイルの
登録により変更可能である。The fault detection program started on the computer 1 refers to the registration file on the self computer to detect the fault in the self computer. The inspection item can be changed by registering the registration file.

【００２０】例えば、ＬＳＩ製造ライン生産管理の場合
には、以下のような項目が検査対象になる。For example, in the case of LSI production line production control, the following items are inspected.

【００２１】（１）システムに接続されている計算機情
報のチェック（２）共通ファイルのチェック（ファイルフォーマット
のチェック）（３）自計算機内で常時起動されるべきプログラムの起
動状態チェック（４）自計算機内プログラム間通信チェック（応答チェ
ック）（５）データベース設定環境チェック（６）ディスク空き容量のチェック（７）計算機ごとの環境設定情報チェック（計算機のシ
ステム設定、装置の接続情報等）個々の検査対象の検査内容、方法等については、後で説
明する。(1) Checking the computer information connected to the system (2) Checking the common file (checking the file format) (3) Checking the startup status of the program that should always be started in the computer itself (4) Communication check between programs in the computer (response check) (5) Database setting environment check (6) Disk free space check (7) Environment setting information check for each computer (computer system settings, device connection information, etc.) Individual inspection The inspection contents and method of the target will be described later.

【００２２】の検査結果は、結果ファイルとして出
力する。The inspection result of (1) is output as a result file.

【００２３】計算機２へ、障害検出プログラムの起動
要求を行う。なお、の全検査が終了する前であって
も、通信制御プログラム（トーカ／リスナ）が起動され
ていることが確認できれば、先に起動要求を実施しても
よい。その方が、計算機１と２の検査が並行して行われ
るので、検査時間の短縮が図れる。A request to activate the fault detection program is issued to the computer 2. Even before the completion of all the inspections, if the communication control program (talker / listener) can be confirmed to be activated, the activation request may be performed first. In that case, since the inspections of the computers 1 and 2 are performed in parallel, the inspection time can be shortened.

【００２４】起動要求を受け取った計算機２では、リ
スナが、障害検出プログラムを起動する。障害検出プロ
グラムは、自計算機上にある登録ファイルを参照し、障
害を検出する。処理内容は、の（１）〜（７）までと
同じである。（７）の環境設定情報ファイルについての
チェックは、ファイル内容は計算機ごとに異なるが、検
査手順等、処理内容は同じである。ただし、（２）の共
通ファイルのチェック内容は、の場合と多少異なる。
すなわち、このファイルは、本来システム内で共通に用
いられるファイル（例えば、各種名称登録ファイル等）
であるので、単にファイルの有無やフォーマットの整合
性だけでなく、データ内容が一致しているかのチェック
を行う。In the computer 2 which has received the activation request, the listener activates the fault detection program. The fault detection program refers to the registration file on its own computer to detect the fault. The processing contents are the same as those in (1) to (7). Regarding the check on the environment setting information file of (7), the file content differs for each computer, but the processing content such as the inspection procedure is the same. However, the check contents of the common file of (2) are slightly different from those of.
That is, this file is originally a file that is commonly used in the system (for example, various name registration files).
Therefore, not only the presence / absence of files and the consistency of the format, but also the data content is checked.

【００２５】その方法については後で述べる。The method will be described later.

【００２６】の検査結果は、自計算機に結果ファイ
ルとして出力する。この時の出力ファイル名は、登録フ
ァイルに登録されている。The inspection result of (1) is output to the computer as a result file. The output file name at this time is registered in the registration file.

【００２７】結果ファイルが、転送ファイルとして登
録ファイル内で指定されている場合は、計算機２から計
算機１へ検査結果ファイルを転送する。When the result file is designated as a transfer file in the registration file, the inspection result file is transferred from the computer 2 to the computer 1.

【００２８】検査結果ファイルを受け取った計算機１
では、自計算機の検査結果と区別するため、このファイ
ルが計算機２からの検査結果であることを示す識別子を
ファイル名に付加する等して、結果ファイルを出力す
る。Computer 1 that received the inspection result file
Then, in order to distinguish it from the inspection result of the self computer, an identifier indicating that this file is the inspection result from the computer 2 is added to the file name, and the result file is output.

【００２９】以上で、一連の処理が終了する。このよう
な処理は、必要な時にオペレータが障害検出プログラム
をコマンド入力して起動してもよいし、定期的に起動す
るように、計算機のシステムに設定することもできる。
例えば、計算機のＯＳ（オペレーティングシステム）が
ＵＮＩＸのシステムでは、ｃｒｏｎｔａｂファイルに登
録することで、定期的に起動することができる。この場
合、無人で、検査プログラムが起動されて、検査結果が
ファイルに出力される。このファイルを参照してエラー
情報があれば、警報やメッセージを出す別プログラムを
用意すれば、障害の発生が即座に通知できる。例えば、
障害検出結果ファイルの先頭に、各計算機の検査プログ
ラムでエラーが発生したか否かをフラグ（１：エラー有
り、０：エラー無し）として設定する欄を設けておけ
ば、このフラグを見るだけでエラーの有無が識別でき
る。エラー内容の詳細は、以下に文字等で出力したもの
を参照するようにすればよい。With the above, a series of processing is completed. Such processing may be activated by the operator by inputting a command to the failure detection program when necessary, or may be set in the computer system so as to be activated periodically.
For example, in a system in which the OS (operating system) of the computer is UNIX, it can be started up periodically by registering it in the frontab file. In this case, the inspection program is started unattended and the inspection result is output to a file. If there is error information by referring to this file, you can immediately notify the occurrence of a failure by preparing another program that issues an alarm or message. For example,
If you provide a column at the top of the failure detection result file to set whether or not an error has occurred in the inspection program of each computer as a flag (1: with error, 0: without error), just see this flag. Presence of error can be identified. For details of the error content, it is sufficient to refer to the following output in characters.

【００３０】システム内の全計算機のチェックを行う場
合は、システムに接続されている各計算機間が、相互通
信可能であるかどうかを、メッセージを送信し、返信が
あるか否かでチェックする。相互通信チェックは、１次
計算機より２次計算機へチェックを行うべき相手計算機
を指示して起動し、そのチェック結果を受け取る。２次
計算機は、指示された計算機を３次計算機として、通信
チェックを行い、結果を２次計算機を介して、１次計算
機へ返信する。この時、システムに接続される計算機
は、『計算機情報ファイル』により認識する。この計算
機情報ファイルは、システム内で用いる計算機の名称
と、各計算機上で常時起動されるべきプログラムの種類
（または数）からなり、詳細は、後で図４，５にて例示
する。When checking all the computers in the system, it is checked whether or not the computers connected to the system can communicate with each other by sending a message and checking whether or not there is a reply. In the mutual communication check, the primary computer instructs the secondary computer to instruct the other computer to perform the check and is activated, and the check result is received. The secondary computer performs communication check using the instructed computer as the tertiary computer and returns the result to the primary computer via the secondary computer. At this time, the computer connected to the system is recognized by the "computer information file". This computer information file consists of the name of the computer used in the system and the type (or number) of programs that should always be started on each computer, the details of which will be illustrated later in FIGS.

【００３１】相互通信チェックにおいて、通信チェック
経路は、重複がないように設定する。例えば、計算機が
１〜５まで５台あり、Ｎｏ．１が１次計算機である場合
は、図３に示すように通信経路を設定すればよい。な
お、１−５間は、５を２次計算機として検査するときに
検査済として省略することができる。また、途中で通信
エラーが検出された計算機は、その時点から相互通信チ
ェックの対象から外すことにより、リトライ等による検
査時間の増加を防ぐことができる。In the mutual communication check, the communication check paths are set so that there is no duplication. For example, there are five computers, 1 to 5, and No. When 1 is a primary computer, the communication path may be set as shown in FIG. It should be noted that the interval 1-5 can be omitted because it is already inspected when 5 is inspected as a secondary computer. Further, the computer in which the communication error is detected on the way can be excluded from the target of the mutual communication check from that point, so that the increase of the inspection time due to the retry or the like can be prevented.

【００３２】以上、障害検出プログラムの一般的な構
成、機能と、使用方法について述べた。各計算機内での
検査内容は、システムの構成、用途に応じて変える必要
がある。以下では、本発明の障害検出プログラムを分散
型のＬＳＩの生産管理システムに適用した実施例を示
す。The general structure, function, and usage of the fault detection program have been described above. It is necessary to change the inspection contents in each computer according to the system configuration and usage. An embodiment in which the fault detection program of the present invention is applied to a distributed LSI production management system will be described below.

【００３３】ＬＳＩの生産管理システムは、ＬＳＩ製造
ラインに投入される各ウェハの処理予定のスケジューリ
ング（処理計画線表の作成）、その進行状況の把握、処
理データの蓄積等を行うシステムである。このシステム
では、ライン上の各ウェハ（または、ウェハを複数枚集
めたロット）の名称、処理条件等の情報、各処理工程の
条件、接続される装置の名称、処理条件・時間等の情報
等、各種情報を管理し、予定を作成するプログラム（ラ
イン管理プログラムと呼ぶ）と、オペレータへの作業指
示等を行うユーザインタフェース、各装置を制御し、装
置からのデータを収集する装置制御プログラム、データ
ベースの構造や、データ、その検索条件等を管理するデ
ータベース管理プログラム等、複数のプログラムが、通
信制御プログラムを介して通信を行いながら動作する。The LSI production management system is a system for scheduling the processing schedule (creating a processing plan line table) of each wafer to be introduced into the LSI manufacturing line, grasping the progress status thereof, and accumulating processing data. In this system, the name of each wafer on the line (or lot of multiple wafers), information about processing conditions, conditions of each processing step, name of connected equipment, information about processing conditions / time, etc. , A program for managing various information and creating a schedule (called a line management program), a user interface for giving work instructions to an operator, a device control program for controlling each device and collecting data from the device, a database A plurality of programs, such as a database management program that manages the structure, data, search conditions, and the like, operate while communicating via the communication control program.

【００３４】このようなシステムでは、複数の計算機が
接続され、それぞれ別々の装置制御、データ管理等を行
っているため、各計算機で動作すべきプログラムの種類
が異なる。例えば、ある計算機は、ライン管理用のホス
ト計算機であり、ライン管理プログラムが動作する。こ
のライン管理計算機は、システム内で使用する装置、ラ
インに投入されるロット、処理の条件名、工程名等の情
報登録を行う端末機能を兼ねており、ここで登録された
情報が各計算機にネットワークを通して分配される。こ
の時、たまたま、ある計算機が停止しているとその情報
が伝わらない。その後その計算機が立ちあがっても、そ
の登録情報（共通ファイル）の内容が狂ってしまうた
め、その登録情報を使用する場合に、その計算機だけエ
ラーとなる。In such a system, since a plurality of computers are connected to each other and perform separate device control and data management, the types of programs to be run on each computer are different. For example, a certain computer is a host computer for line management, and a line management program operates. This line management computer also has a terminal function to register information such as equipment used in the system, lots to be put into the line, processing condition names, process names, etc., and the information registered here is stored in each computer. Distributed through the network. At this time, if a computer happens to be down, the information will not be transmitted. After that, even if the computer is booted up, the contents of the registration information (common file) will be corrupted, so when using the registration information, only that computer will result in an error.

【００３５】また、他の計算機は、複数の装置を制御す
るための計算機で、装置ごとに異なる装置制御プログラ
ムと、データベース管理プログラムが常時動作する。更
に、ある計算機は、オペレータが使用するユーザインタ
フェース専用の端末である場合もある。このように、計
算機ごとに、常時動作すべきプログラムが異なる。図４
は、そのような状況を示した例である。図４において、
陰影をつけたプログラムは常時起動プログラムである。Further, the other computer is a computer for controlling a plurality of devices, and a device control program and a database management program which are different for each device always operate. Further, a computer may be a terminal dedicated to the user interface used by the operator. In this way, the programs that should always run differ from computer to computer. Figure 4
Is an example showing such a situation. In FIG.
The shaded programs are always-on programs.

【００３６】上記のような各計算機上での常時起動すべ
きプログラムの種類を認識するため、各計算機上に、
『計算機情報ファイル』を用意する。この計算機情報
は、各計算機で共通に用いるもので、各計算機の名称
と、その計算機の上で動作すべき（定常的に動作すべ
き）プログラムの一覧で構成されている。この情報は、
起動されるプログラム名を列挙してもよいし、各プログ
ラムごとに起動時のプロセス番号を決め、そのプロセス
番号を登録するようにしてもよいし、起動するプログラ
ムをフラグ等で指定するものであってもよい。図５は、
図４のシステムの『計算機情報ファイル』の例を示した
ものである。この例では、起動されるべきプログラムの
種類を１（常時起動されるもの）、０（起動されないも
の）のフラグで指定した例である。In order to recognize the type of program that should be always started on each computer as described above,
Prepare a "computer information file". This computer information is commonly used by each computer, and is composed of a name of each computer and a list of programs that should run on the computer (should constantly run). This information is
The names of programs to be started may be listed, the process number at startup may be determined for each program, and the process number may be registered, or the program to be started may be specified by a flag or the like. May be. Figure 5
5 shows an example of a "computer information file" of the system of FIG. In this example, the types of programs to be started are designated by flags of 1 (always started) and 0 (not started).

【００３７】障害検出プログラムは、自計算機内で、そ
の時点で起動されているプログラムを調べ、この『計算
機情報ファイル』を参照して、常時されるべきプログラ
ムが、起動されているか、二重に起動されているものは
ないかを検出する。エラーがあれば、その情報をファイ
ルとして出力する。２次計算機の場合は、その結果を１
次計算機にファイル転送する。なお、プログラムに親子
関係があるようなプログラムは、その親子関係も指定し
ておけば、親プロセスが停止しているにも係わらず、子
プロセスだけが動いているようなエラーも検出すること
ができる。例えば、起動プロセス番号を『計算機情報フ
ァイル』内で登録する場合には、親プロセスの起動プロ
セス番号を、親子を識別する識別欄を設けてそこに記述
すれば、親子関係の識別ができる。図５のようにフラグ
で指定する場合は、親が１、子が２等のようにあらかじ
め決めておけば、親子の識別が可能である。The fault detection program checks the program started at that point in its own computer, and refers to this "computer information file" to check whether the program that should be executed at all times Detect if anything is running. If there is an error, output that information as a file. In the case of a secondary computer, the result is 1
Transfer the file to the next computer. If a program that has a parent-child relationship is also specified, it is possible to detect an error in which only the child process is running even if the parent process is stopped. it can. For example, when the boot process number is registered in the "computer information file", a parent-child relationship can be identified by providing the boot process number of the parent process in an identification field for identifying the parent and child. In the case of designating with a flag as shown in FIG. 5, it is possible to identify the parent and child by predetermining that the parent is 1, the child is 2, and so on.

【００３８】更に、２次計算機側では、プログラムの起
動検査を行う前に、１次計算機から送られた『計算機情
報ファイル』と、自計算機（２次計算機）の『計算機情
報ファイル』が同一内容であるかチェックを行う。この
ようにすれば、『計算機情報ファイル』を参照して送信
相手を認識するような場合のエラーを未然に防ぐことが
できる。Further, on the secondary computer side, the "computer information file" sent from the primary computer and the "computer information file" of its own computer (secondary computer) have the same contents before the program start inspection. Check if it is. By doing so, it is possible to prevent an error in the case of recognizing the transmission destination by referring to the "computer information file".

【００３９】このような共通ファイル（システム内で用
いる各種名称等の一覧テーブル、計算機情報ファイル、
各装置のデータベースがどの計算機上にあるかの情報フ
ァイル等）の整合性をチェックする方法として、ファイ
ルの内容が一致しているかを直接確認する方法もある
が、情報量が多いファイルの場合には、チェックに時間
がかかる。このため、全データ内容が一致するか否かを
チェックするのではなく、図６に示すような方法で行え
ば簡単である。各共通ファイルの作成／更新時に、その
ファイルの作成（更新）日時、データ件数等のファイル
情報をファイルの先頭等に書き込んでおく。この情報
は、図４のライン管理計算機側で共通ファイルを作成す
るときに書き込む。その情報を含めてネットワークでそ
のファイル全体を分配するようにすればよい。あるい
は、そのファイルが大きく、毎回全データを送るのでは
なく、新規に登録されたり削除された部分のみを通信す
る場合には、その情報を受け取って、受信側の計算機で
このファイル情報を作成するようにしてもよい。Such a common file (a list table of various names used in the system, a computer information file,
As a method of checking the consistency of information files such as which computer the database of each device is on), there is a method of directly checking whether the file contents match, but in the case of a file with a large amount of information Takes time to check. Therefore, it is easy to perform the method shown in FIG. 6 instead of checking whether or not all the data contents match. At the time of creating / updating each common file, file information such as the creation (updating) date and time of the file and the number of data items is written at the beginning of the file. This information is written when the common file is created on the line management computer side in FIG. The entire file including the information may be distributed on the network. Alternatively, if the file is large and you want to communicate only the newly registered or deleted part instead of sending all the data every time, receive that information and create this file information on the receiving computer. You may do it.

【００４０】ファイルの整合性のチェック時には、ま
ず、１次計算機側でこのファイル情報を読込み、このフ
ァイル情報を付加して２次計算機へ検査要求を行う。２
次計算機では、自計算機のファイル情報を読込み、その
内容（更新日時やデータ件数）が１次計算機から送信さ
れたものと一致するか否かをチェックする。その検査結
果を出力、あるいは１次計算機へ転送する。なお、検査
の結果、ファイルの内容が一致していない場合には、更
新日時の新しいファイルを他の計算機（日時が古い）へ
転送して、その情報を一致させることにより、システム
の動作エラーを回避することができる。なお、データが
追加のみで削除されることがない場合には、データ件数
の多い方が正しいとしてその内容を少なく方へ転送する
ことも可能である。また、毎回登録する計算機が１つに
かぎられている場合（ある計算機がホスト計算機で、そ
の上の情報が正しいとする）には、ホスト計算機のファ
イル情報と各計算機のファイル情報を比較して、チェッ
クすることも可能である。このようなファイル内容の整
合性チェックは、『計算機情報ファイル』に限らず、ロ
ットの名称テーブル、レシピ（処理条件）情報ファイ
ル、データベース構造定義ファイル等の各種情報ファイ
ルの登録内容が、システム内で一致しているか否かの検
査に用いることができる。When checking the file consistency, first, the primary computer side reads this file information, adds this file information, and issues an inspection request to the secondary computer. Two
The next computer reads the file information of its own computer and checks whether the contents (update date and number of data items) match those transmitted from the primary computer. The inspection result is output or transferred to the primary computer. If the file contents do not match as a result of the inspection, transfer the file with the newest update date and time to another computer (older date and time), and match the information, to eliminate the system operation error. It can be avoided. In addition, when the data is added but not deleted, it is possible to assume that the one with the larger number of data is correct and transfer the contents to the smaller one. If only one computer is registered each time (assuming that a computer is the host computer and the information on it is correct), the file information of the host computer is compared with the file information of each computer. It is also possible to check. This kind of file content consistency check is not limited to the "computer information file", but the registration contents of various information files such as the lot name table, recipe (processing condition) information file, database structure definition file, etc. It can be used to check whether they match.

【００４１】起動プログラムや、定義ファイルの中に
は、計算機によって異なるものがある。例えば、装置の
制御プログラムは、接続される装置によって同じ装置制
御プログラムでも異なるものが起動される。（装置ごと
に異なる名称のプログラムを起動させる方法もあるが、
その場合は、他のプログラムからの通信時の通信相手の
設定等が面倒になるので、通常は、環境設定テーブルを
参照しながら装置の違いを認識するようにした方がよ
い。）この場合、各装置の制御プログラムとして、正し
いものが起動されているか、起動時の通信パラメータが
正しいか、各装置用のデータベースの設定がされている
か等のシステム環境設定のチェックが必要であり、単
に、ファイルの有無や、起動プログラム名のチェックだ
けでは不十分である。このような検査の実施例を以下に
示す。Some startup programs and definition files differ depending on the computer. For example, the control program of the device is activated by the same device control program depending on the connected device. (There is also a method to start a program with a different name for each device,
In that case, setting of a communication partner at the time of communication from another program is troublesome, and therefore it is usually better to recognize the difference between the devices by referring to the environment setting table. ) In this case, it is necessary to check the system environment settings such as whether the correct control program for each device is started, whether the communication parameters at startup are correct, and whether the database for each device is set. , It is not enough to simply check the existence of files and the name of the startup program. An example of such an inspection is shown below.

【００４２】装置制御では、各計算機にどのような装置
が接続されているか、あるいは、その通信パラメータ等
を設定する装置制御関連の設定ファイルの内容に矛盾が
ないかをチェックする。In the device control, it is checked what kind of device is connected to each computer and whether or not there is a contradiction in the contents of the device control-related setting file for setting the communication parameters and the like.

【００４３】チェック対象の設定ファイルとそのチェッ
ク内容としては、以下のようなものがある。The setting files to be checked and their check contents are as follows.

【００４４】（１）装置登録ファイル：自計算機に接続
されている装置の登録（計算機と接続装置の対応表）以下のチェックの基準になるもので、このファイルに登
録されている各装置（番号）ごとに以下の設定がなされ
ているかをチェックする。(1) Device registration file: Registration of devices connected to the own computer (correspondence table of computers and connected devices) This is the basis of the following checks, and each device (number is registered in this file Check whether the following settings are made for each.

【００４５】（２）ＳＥＣＳ通信パラメータ設定ファイ
ル：ＬＳＩ製造装置の各装置との通信パラメータ通常、装置と制御プログラムとの間の通信には、ＳＥＣ
Ｓ規格が用いられている。この通信時の、デバイスＩ
Ｄ、接続ポート番号、タイムアウト時間等の通信パラメ
ータの指定がされているか、２重登録等がないか、接続
ポート名が正しいか、登録もれがないか、等をチェック
する。(2) SECS communication parameter setting file: Communication parameters with each device of the LSI manufacturing device Normally, SEC is used for communication between the device and the control program.
The S standard is used. Device I during this communication
It is checked whether communication parameters such as D, connection port number, and timeout time are specified, there is no double registration, the connection port name is correct, and there is no registration failure.

【００４６】（３）装置制御プログラム起動設定ファイ
ル：起動時のプロセスＩＤ、オプション等の指定各装置ごとに異なるプログラムを起動するときのプロセ
スＩＤ、オプション等の指定が、されているか、該当す
るプログラムがインストールされているか、等をチェッ
クする。(3) Device control program startup setting file: Designation of process ID, option, etc. at the time of startup Whether the process ID, option, etc., when starting a different program for each device are specified, or the corresponding program Check if is installed, etc.

【００４７】（必要なプログラム／ディレクトリ／登録
ファイル有無のチェック）また、装置ごとに異なるもの
としては、データベースへ格納する際のデータベースの
構造等のデータベース設定環境もチェックする必要があ
る。これは、本来あるべき装置のデータベースが、指定
された計算機上に存在するか否か、のチェックを行うも
のである。更には、各計算機上にあるデータベースの対
応を示すテーブル（ファイル）が、各計算機上で一致し
ているか、のチェックも必要である。このチェックは、
図６で示した方法を用いれば、チェックできる。(Check for Necessary Program / Directory / Registered File Existence) Also, as different from device to device, it is necessary to check the database setting environment such as the structure of the database when storing it in the database. This is to check whether or not the database of an originally intended device exists on the designated computer. Furthermore, it is also necessary to check whether the tables (files) showing the correspondences of the databases on each computer match on each computer. This check
This can be checked by using the method shown in FIG.

【００４８】この他、データベース等のディスク空き容
量（ディスクの使用率）を調べることで、ディスクが一
杯になる前に、その一部情報の退避や、接続変更等によ
り、ディスクの有効利用を図ることができる。これによ
って、単に、ディスクが一杯になることによるデータ格
納エラーを防ぐだけでなく、データ増大による（１計算
機にデータが集中して蓄積される等による）検索時間等
の増大を防ぐことができる。In addition, by checking the free disk capacity (disk usage rate) of the database, etc., effective use of the disk is achieved by saving some information and changing the connection before the disk is full. be able to. This not only prevents a data storage error due to the disk becoming full, but also prevents an increase in search time or the like due to an increase in data (such as data being concentrated and accumulated in one computer).

【００４９】以上、計算機ネットワークを利用したＬＳ
Ｉ生産管理システムに、本発明の障害検出システムを適
用した例を述べた。このように、本発明の障害検出シス
テムは、計算機ネットワークを利用した各種製造ライン
の制御、監視だけでなく、以下のようなさまざまな応用
があげられる。As described above, the LS using the computer network
The example in which the fault detection system of the present invention is applied to the I production management system has been described. As described above, the fault detection system of the present invention can be used not only for controlling and monitoring various manufacturing lines using a computer network, but also for various applications as described below.

【００５０】（１）．建物等への入退室管理システム（２）．各種乗物、宿泊施設、レジャー施設等の予約シ
ステム（３）．各種製造・販売管理システム(1). Entrance / exit control system for buildings (2). Reservation system for various vehicles, lodging facilities, leisure facilities, etc. (3). Various manufacturing / sales management systems

【００５１】[0051]

【発明の効果】以上説明したように、本発明によれば、
各計算機に障害検出プログラムおよび各種検査を実行す
る検査プログラムを設け、各障害検出プログラムは、各
検査プログラムを起動して、各計算機内で常時起動され
るべきプログラムの正常動作チェックおよび計算機相互
間の通信の正常動作チェックを行わせるとともに、該検
査結果の内容を各計算機から受信するので、１端末であ
る所与の計算機から他のすべての計算機上のプログラム
の動作状態を同時に検出することができるため、通常の
ように個々の計算機にリモートログインして個々にチェ
ックする必要がなく、障害の検出を迅速かつ簡単に行う
ことができるばかりでなく、障害の見落としを防止する
ことができる。また、計算機間で共通に使用される各種
定義ファイルの整合性もチェックしているので、見掛け
上正常に動作していても、定義ファイルの一部データに
抜けがある等により、ある特定の処理時に動作不良にな
るような障害も未然に防止することができ、システムの
動作の信頼性を大幅に改善することができる。As described above, according to the present invention,
Each computer is equipped with a fault detection program and an inspection program that executes various inspections. Each fault detection program activates each inspection program to check the normal operation of the programs that should always be activated in each computer, and between the computers. Since the normal operation check of communication is performed and the contents of the inspection result are received from each computer, the operating states of the programs on all other computers can be simultaneously detected from a given computer which is one terminal. Therefore, it is not necessary to remotely log in to each computer and check each one as usual, and not only can faults be detected quickly and easily, but also oversight of faults can be prevented. In addition, because the consistency of various definition files that are commonly used among computers is also checked, even if it seems to be operating normally, some data in the definition file may be missing and Occasionally, a failure that causes malfunction can be prevented, and the reliability of system operation can be greatly improved.

[Brief description of drawings]

【図１】本発明の一実施例に係わる障害検出システムの
構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a fault detection system according to an exemplary embodiment of the present invention.

【図２】図１に示す障害検出システムの動作手順を示す
説明図である。FIG. 2 is an explanatory diagram showing an operation procedure of the failure detection system shown in FIG.

【図３】計算機相互通信チェックを行う場合の通信経路
の設定の仕方を示す図である。FIG. 3 is a diagram showing a method of setting a communication path when performing a computer mutual communication check.

【図４】本発明の障害検出システムをＬＳＩ生産管理シ
ステムに適用した場合のシステム構成を示す図である。FIG. 4 is a diagram showing a system configuration when the fault detection system of the present invention is applied to an LSI production management system.

【図５】図４の実施例における計算機情報ファイルの例
を示した図である。5 is a diagram showing an example of a computer information file in the embodiment of FIG.

【図６】計算機間のファイルの整合性チェック手順を示
すフローチャートである。FIG. 6 is a flowchart showing a procedure for checking file consistency between computers.

[Explanation of symbols]

１，２，３計算機１１障害検出プログラム１３検査プログラム登録ファイル１５トーカ（送信プログラム）１７リスナ（受信プログラム） 1, 2 and 3 computer 11 failure detection program 13 inspection program registration file 15 talker (transmission program) 17 listener (reception program)

───────────────────────────────────────────────────── フロントページの続き (72)発明者和田康東京都千代田区内幸町１丁目１番６号日本電信電話株式会社内 (72)発明者三富正人東京都中央区日本橋堀留町１丁目５番７号エヌ・ティ・ティ・ファネット・システムズ株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Yasushi Wada 1-1-6 Uchisaiwaicho, Chiyoda-ku, Tokyo Nihon Telegraph and Telephone Corporation (72) Masato Mitomi 1-5 Nihonbashi-Horidomecho, Chuo-ku, Tokyo No. 7 in NTT Fanet Systems Co., Ltd.

Claims

[Claims]

1. A network comprising a plurality of computers,
A fault detection system for detecting a fault in a distributed processing system that performs a predetermined process while mutually communicating between a plurality of programs operating on each computer, and a fault detection program and an inspection program for executing various tests on each computer The failure detection program provided in each computer is connected via the communication control program, and the failure detection program provided in each computer activates the inspection program in its own computer, and the communication control program and other Starting the inspection program in other computers through the fault inspection program in the computer, to make sure that the normal operation check of the program that should be always started in each computer and the normal operation check of the communication between the computers are performed. Faults characterized by receiving the contents of the inspection result from each computer Out system.

2. The fault detection program has a function of inspecting whether or not the contents of a definition file commonly used in the distributed processing system match between computers directly or through the inspection program. The fault detection system according to claim 1, wherein: