JP4476190B2

JP4476190B2 - Multi-computer system

Info

Publication number: JP4476190B2
Application number: JP2005214198A
Authority: JP
Inventors: 諭橋本
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2005-07-25
Filing date: 2005-07-25
Publication date: 2010-06-09
Anticipated expiration: 2025-07-25
Also published as: JP2007034476A

Description

この発明は、システムの障害からの並列回復時に計算機内の内部データを同期化する多重系計算機システムに関するものである。 The present invention relates to a multi-computer system that synchronizes internal data in a computer during parallel recovery from a system failure.

鉄道運行管理システムなどの高信頼性が求められる制御用計算機システムは、処理を行う稼動系計算機のほかに、稼動系計算機に障害が発生した場合に処理を引き継ぐ待機系計算機を備えた多重系のシステムとして利用される場合が多い。
特許文献１のように、従来の多重系システムにおいては、一方の計算機に障害が発生するなどして停止した後、計算機を起動すると、起動後に正常な他方の稼動系計算機から時刻情報や、各種データを受信することにより、稼動系と待機系の計算機の同期を図る並列回復を行っていた。 Control computer systems that require high reliability, such as railway operation management systems, are not only active computers that perform processing, but also multi-systems that have standby computers that take over processing in the event of a failure in the active computer. Often used as a system.
As in Patent Document 1, in a conventional multiplex system, when one computer is stopped due to a failure or the like and then the computer is started, time information and various information are obtained from the other active system computer after the start. By receiving data, parallel recovery was performed to synchronize the active and standby computers.

特開平６−３２１１１２号公報（第３〜４頁、図１）Japanese Patent Laid-Open No. 6-311112 (pages 3 to 4, FIG. 1)

特許文献１の従来の多重系システムでは、計算機に障害が発生し、再起動により回復する際には、正常な稼動系計算機よりネットワークを介して多くのデータを受信することにより同期を図っている。このデータ量は多く、稼動系計算機では負荷が重くなるという問題がある。
また、確実な同期を図るために、送信するデータに該当する機能を抑制した上でデータの送信を行っている。このような稼動系計算機の機能を長時間に渡り抑制した状態で保つことは、システムに悪影響を与えるという問題があった。 In the conventional multiplex system of Patent Document 1, when a computer fails and is recovered by restart, synchronization is achieved by receiving more data from the normal operating system via the network. . This amount of data is large, and there is a problem that the load is heavy on the active computer.
Further, in order to achieve reliable synchronization, data transmission is performed after suppressing functions corresponding to data to be transmitted. Maintaining such a function of the active computer for a long time has a problem of adversely affecting the system.

この発明は、上述のような課題を解決するためになされたものであり、障害からの回復時に、ネットワークを介したデータの送受信をすることなく、素早く多重系構成に回復することができる信頼性の高い多重系計算機システムを得ることを目的としている。 The present invention has been made to solve the above-described problems, and is capable of quickly recovering to a multi-system configuration without transmitting / receiving data via a network when recovering from a failure. The purpose is to obtain a multi-computer system with high accuracy.

この発明に係わる多重系計算機システムにおいては、稼動系及び待機系の計算機がネットワークを介して多重系を構成する多重系計算機システムにおいて、各計算機は、アプリケーションを実行するオペレーティングシステム、このオペレーティングシステムとは独立に動作し、ネットワークを介して他の計算機と通信する通信管理プログラムを管理するマイクロカーネル、及びこのマイクロカーネルにより管理されると共にオペレーティングシステムによりアクセスされる共有メモリ領域を有するメモリを備え、マイクロカーネルは、共有メモリ領域に、自計算機のオペレーティングシステムが故障から回復するときに必要なデータを保存すると共に、オペレーティングシステムが停止中に、回復するときに必要なデータが更新された場合には、他の計算機から更新されたデータを受信し、保存したデータを更新するものである。 In the multi-system computer system according to the present invention, in the multi-system computer system in which active and standby computers constitute a multi-system via a network, each computer is an operating system for executing an application, and this operating system is A microkernel comprising a microkernel that operates independently and manages a communication management program that communicates with other computers via a network, and a memory having a shared memory area that is managed by the microkernel and accessed by the operating system It is shared memory area, the store data necessary when the operating system of its own computer to recover from a failure, while the operating system is stopped, if the data is updated as required when recovering Receives the updated data from another computer, it is to update the stored data.

この発明は、以上説明したように、稼動系及び待機系の計算機がネットワークを介して多重系を構成する多重系計算機システムにおいて、各計算機は、アプリケーションを実行するオペレーティングシステム、このオペレーティングシステムとは独立に動作し、ネットワークを介して他の計算機と通信する通信管理プログラムを管理するマイクロカーネル、及びこのマイクロカーネルにより管理されると共にオペレーティングシステムによりアクセスされる共有メモリ領域を有するメモリを備え、マイクロカーネルは、共有メモリ領域に、自計算機のオペレーティングシステムが故障から回復するときに必要なデータを保存すると共に、オペレーティングシステムが停止中に、回復するときに必要なデータが更新された場合には、他の計算機から更新されたデータを受信し、保存したデータを更新するので、障害からの回復時に、この共有メモリ領域のデータを利用することにより、ネットワークを介したデータの送受信をすることなく、素早く多重系構成に回復することができる。

As described above, according to the present invention, in a multi-system computer system in which active and standby computers constitute a multi-system via a network, each computer is an operating system that executes an application, and is independent of this operating system. A microkernel that manages a communication management program that communicates with other computers via a network, and a memory that has a shared memory area that is managed by the microkernel and that is accessed by the operating system. In the shared memory area, save the data required when the computer's operating system recovers from a failure, and if the data required for recovery is updated while the operating system is stopped, calculator It receives et updated data, since updating the stored data, during recovery from a failure, by using the data of the shared memory area, without the transmission and reception of data via the network, fast multiplexing system Can recover to configuration.

実施の形態１．
図１は、この発明の実施の形態１による多重系計算機システムを示す構成図である。
図１では、多重系計算機システムは、２台の制御用計算機からなる２重系システムである。ただし、制御用計算機は３台以上で構成されてもよい。
図１において、制御用計算機１０、１１は、それぞれ稼動系計算機、待機系計算機として動作する。制御用計算機１０、１１は、それぞれネットワークカード５０、メインメモリ２０、中央演算処理装置（以下、ＣＰＵと称す）６０、ＤＩＯ（ＤｉｇｉｔａｌＩ／Ｏ）カード１１０を備えており、これらはバスによって接続される。この他、ハードディスク装置、入出力装置などが接続される場合もある。
ネットワークカード５０は、イーサネット（登録商標）のネットワーク９０に接続され、このイーサネット（登録商標）のネットワーク９０は、他計算機とも接続される。このネットワーク９０を介して、制御用計算機１０と制御用計算機１１は通信を行い、また他計算機とも通信を行う。また、両系の計算機は、ＤＩＯ接点１２０を使用して接続されている。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a multi-computer system according to Embodiment 1 of the present invention.
In FIG. 1, the multi-computer system is a dual system composed of two control computers. However, the control computer may be composed of three or more.
In FIG. 1, control computers 10 and 11 operate as an active computer and a standby computer, respectively. Each of the control computers 10 and 11 includes a network card 50, a main memory 20, a central processing unit (hereinafter referred to as CPU) 60, and a DIO (Digital I / O) card 110, which are connected by a bus. The In addition, a hard disk device, an input / output device, or the like may be connected.
The network card 50 is connected to an Ethernet (registered trademark) network 90, and the Ethernet (registered trademark) network 90 is also connected to other computers. Via this network 90, the control computer 10 and the control computer 11 communicate, and also communicate with other computers. The computers of both systems are connected using the DIO contact 120.

制御用計算機１０と制御用計算機１１が共に正常な状態であるとき、稼動系である制御用計算機１０のメインメモリ２０には、ＯＳ（オペレーティングシステム）８０、マイクロカーネル４０、アプリケーション７０、及び通信管理プログラムなどのマイクロカーネル４０下で動作するアプリケーション１００がロードされる。
同様に、待機系である制御用計算機１１のメインメモリ２０にも、ＯＳ８０、マイクロカーネル４０、アプリケーション７０、及びマイクロカーネル４０のアプリケーション１００がロードされる。これらのプログラムは、制御用計算機１０、１１の両方で実行されている。なお、制御用計算機１１では、アプリケーション７０が実行されていない場合もある。
アプリケーション７０は、該当の多重系システムの用途である処理を行うプログラムである。また、メインメモリ２０には、回復に必要なデータ３０がマイクロカーネル４０により保存される。 When both the control computer 10 and the control computer 11 are in a normal state, an OS (operating system) 80, a microkernel 40, an application 70, and communication management are stored in the main memory 20 of the control computer 10 that is an active system. An application 100 that operates under the microkernel 40 such as a program is loaded.
Similarly, the OS 80, the microkernel 40, the application 70, and the application 100 of the microkernel 40 are loaded into the main memory 20 of the control computer 11 that is a standby system. These programs are executed by both the control computers 10 and 11. In the control computer 11, the application 70 may not be executed.
The application 70 is a program that performs processing that is the purpose of the corresponding multisystem. The main memory 20 stores data 30 necessary for recovery by the microkernel 40.

次に、動作について説明する。
マイクロカーネル４０は、ＯＳのカーネルから独立したもので、ＯＳ８０より下位に位置付けられるプログラムであり、ＯＳ８０の稼動状況の監視や、通信管理プログラムなどのリアルタイム性を要求されるプログラムを管理する。ＣＰＵ６０の処理時間は、マイクロカーネル４０に優先的に割り当てられ、ＯＳ８０には残りの時間が割り当てられる。
メインメモリ２０は、マイクロカーネル４０により管理・利用される領域と、ＯＳ８０によって管理・利用される領域に割り当てられる。その他のＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ）などのデバイスがある場合には、マイクロカーネル４０が、マイクロカーネル４０により管理するか、ＯＳ８０により管理するかを割り当てる。ＤＩＯカード１１０は、マイクロカーネル４０により管理され、他の計算機のマイクロカーネル４０との相互通信に用いられる。また、マイクロカーネル４０は、ネットワークカード５０の管理も行う。
通信管理プログラムは、ネットワーク９０を介して他の計算機から受信した受信データのうち、アプリケーション７０に必要なデータをＯＳ８０に対して送信するプログラムである。この通信には、バス通信や、仮想イーサネット（登録商標）通信が用いられる。 Next, the operation will be described.
The microkernel 40 is independent from the OS kernel, and is positioned below the OS 80. The microkernel 40 monitors the operating status of the OS 80 and manages programs that require real-time performance, such as a communication management program. The processing time of the CPU 60 is preferentially assigned to the microkernel 40, and the remaining time is assigned to the OS 80.
The main memory 20 is allocated to an area managed and used by the microkernel 40 and an area managed and used by the OS 80. When there are other devices such as PCI (Peripheral Component Interconnect), the microkernel 40 assigns whether to manage by the microkernel 40 or the OS 80. The DIO card 110 is managed by the microkernel 40 and used for mutual communication with the microkernel 40 of another computer. The microkernel 40 also manages the network card 50.
The communication management program is a program that transmits data necessary for the application 70 to the OS 80 among the received data received from other computers via the network 90. For this communication, bus communication or virtual Ethernet (registered trademark) communication is used.

メインメモリ２０のマイクロカーネル４０の管理する領域は、ＯＳ８０からのアクセスを可能としている。すなわち、マイクロカーネル４０とＯＳ８０とにより共有される共有メモリ領域である。この共有メモリ領域は、ＯＳ８０からは通常のＲＡＭディスクと同様に認識され、直接アクセス可能である。この共有メモリ領域は、マイクロカーネル４０により管理されるため、マイクロカーネル４０が起動している間は、ＯＳ８０が停止してもその内容は失われない。
この共有メモリ領域には、従来のシステムでの回復時に他系計算機から受信していたような、回復に必要な各種データ３０が保存される。マイクロカーネル４０は、上述のように通信管理プログラムを管理しているので、ＯＳ８０停止時にも他系計算機や、その他の装置からのメッセージを受信することができる。ＯＳ８０停止中に状態が変化し、回復に必要なデータが更新された場合には、ネットワーク９０を通じてマイクロカーネル４０が受信し、データ３０を更新する。 An area managed by the microkernel 40 of the main memory 20 can be accessed from the OS 80. That is, it is a shared memory area shared by the microkernel 40 and the OS 80. This shared memory area is recognized by the OS 80 in the same manner as a normal RAM disk and can be directly accessed. Since this shared memory area is managed by the microkernel 40, the contents are not lost even if the OS 80 is stopped while the microkernel 40 is running.
In this shared memory area, various data 30 necessary for recovery, such as those received from other computers at the time of recovery in the conventional system, are stored. Since the microkernel 40 manages the communication management program as described above, the microkernel 40 can receive messages from other computers and other devices even when the OS 80 is stopped. When the state changes while the OS 80 is stopped and the data necessary for recovery is updated, the microkernel 40 receives the data through the network 90 and updates the data 30.

次に、実施の形態１の多重系計算機システムにおける障害からの回復動作について説明する。
制御用計算機１０と制御用計算機１１は、ＤＩＯカード１１０を通じて接続され、マイクロカーネル４０が、ＤＩＯカード１１０を介して他の計算機のマイクロカーネル４０と相互通信することにより他の計算機を監視する。すなわち、マイクロカーネル４０が、ＤＩＯカード１１０を介して互いの計算機の運転状態を常時監視するようになっている。この相互通信により、制御用計算機１１は、制御用計算機１０に異常が発生したと判断すると、自らを稼動系計算機とする。
また、それぞれの制御用計算機１０、１１のマイクロカーネル４０は、一定間隔でＯＳ８０に対して生存メッセージを送ることを要求する。ＯＳ８０に異常が発生し、生存メッセージを受信することができなくなると、マイクロカーネル４０は、それを検知し、ＯＳ８０を再起動する。
マイクロカーネル４０は、共有メモリ領域に、回復に必要なデータ３０を保持しており、このデータ３０を使って多重系構成へと回復する。
マイクロカーネル４０の管理するネットワークカード５０を用いて、ＯＳ８０停止中に、稼動系の制御用計算機１１のマイクロカーネル４０やその他装置から各種データの受信を行い、このデータを更新し、これを用いて多重系構成へ回復する場合もある。 Next, a recovery operation from a failure in the multi-computer system according to the first embodiment will be described.
The control computer 10 and the control computer 11 are connected through the DIO card 110, and the microkernel 40 monitors other computers by communicating with the microkernel 40 of another computer via the DIO card 110. That is, the microkernel 40 constantly monitors the operation state of each other's computers via the DIO card 110. When the control computer 11 determines that an abnormality has occurred in the control computer 10 by this mutual communication, it makes itself an active computer.
Further, the microkernel 40 of each of the control computers 10 and 11 requests the OS 80 to send a survival message at regular intervals. When an abnormality occurs in the OS 80 and the survival message cannot be received, the microkernel 40 detects it and restarts the OS 80.
The microkernel 40 holds data 30 necessary for recovery in a shared memory area, and uses this data 30 to recover to a multi-system configuration.
Using the network card 50 managed by the microkernel 40, while the OS 80 is stopped, various data are received from the microkernel 40 and other devices of the active control computer 11 and this data is updated and used. In some cases, a multi-system configuration may be restored.

実施の形態１によれば、このように、ＯＳとは別にマイクロカーネルが管理するメモリ領域を設け、この領域へのＯＳからのアクセスを可能にし、この領域に故障からの回復に必要なデータを保存し、故障回復時には、このデータを利用することにより、異常停止からの回復時間を短縮する効果がある。 According to the first embodiment, in this way, a memory area managed by the microkernel is provided separately from the OS, the OS can access this area, and data necessary for recovery from the failure is stored in this area. By saving and using this data during failure recovery, there is an effect of shortening the recovery time from an abnormal stop.

実施の形態２．
実施の形態１では、回復に必要なデータ３０をマイクロカーネル４０が管理する共有メモリ領域に保存することにより、回復時間の短縮を可能にしたが、実施の形態２では、この共有メモリ領域にさらに自計算機の動作状況を保存するようにしたものである。
図２は、この発明の実施の形態２による多重系計算機システムを示す構成図である。
図２において、１０、１１、２０、３０、４０、５０、６０、７０、８０、９０、１００は図１におけるものと同一のものである。図２では、制御用計算機１０、１１は、ＤＩＯカード１１０を持っていない。すなわち、実施の形態１のように、計算機の相互監視をＤＩＯ接点を利用して行わないものである。 Embodiment 2. FIG.
In the first embodiment, the recovery time can be shortened by storing the data 30 necessary for recovery in the shared memory area managed by the microkernel 40. In the second embodiment, however, the shared memory area further includes The operation status of the own computer is saved.
FIG. 2 is a block diagram showing a multi-computer system according to the second embodiment of the present invention.
In FIG. 2, 10, 11, 20, 30, 40, 50, 60, 70, 80, 90, 100 are the same as those in FIG. In FIG. 2, the control computers 10 and 11 do not have the DIO card 110. That is, as in the first embodiment, mutual monitoring of computers is not performed using DIO contacts.

次に、動作について説明する。
図２の制御用計算機１０のマイクロカーネル４０は、一定間隔でＯＳ８０が管理するアプリケーション７０に対して、動作状況確認メッセージを送信する。メッセージを受信したアプリケーション７０は、自制御用計算機が稼動系として動作しているのか、待機系として動作しているのか、または回復動作中であるのかの動作状況の情報を、マイクロカーネル４０が管理するＯＳ８０との共有メモリ領域に保存する。
一方、制御用計算機１１のマイクロカーネル４０も、同様にＯＳ８０が管理するアプリケーション７０に対して、動作状況確認メッセージを送信し、自制御用計算機の動作状況を得て、共有メモリ領域に保存する。
そして、一定時間以上、この情報へのアクセスがない場合は、マイクロカーネル４０は、自制御用計算機のＯＳ８０が停止していると判断し、この情報を停止状態に変更する。 Next, the operation will be described.
The microkernel 40 of the control computer 10 in FIG. 2 transmits an operation status confirmation message to the application 70 managed by the OS 80 at regular intervals. In the application 70 that has received the message, the microkernel 40 manages information on the operation status of whether the self-control computer is operating as an active system, a standby system, or a recovery operation. Is stored in a shared memory area with the OS 80 to perform
On the other hand, the microkernel 40 of the control computer 11 similarly transmits an operation status confirmation message to the application 70 managed by the OS 80, obtains the operation status of the own control computer, and stores it in the shared memory area.
If there is no access to this information for a certain time or more, the microkernel 40 determines that the OS 80 of the self-control computer is stopped, and changes this information to a stopped state.

制御用計算機１０、１１のマイクロカーネル４０は、互いにこの動作状況に関する情報を、ネットワーク９０を通じて相互通信し、互いに監視することにより、他系監視を行うことができる。
実施の形態１では、ＤＩＯ接点情報を用いて監視していたが、実施の形態２では、本発明を利用することにより、実施の形態１と比較して安価に他系監視を行うことが可能になる。 The microkernels 40 of the control computers 10 and 11 can perform monitoring of other systems by mutually communicating information on the operation status through the network 90 and monitoring each other.
In the first embodiment, monitoring is performed using DIO contact information, but in the second embodiment, by using the present invention, it is possible to monitor other systems at a lower cost than in the first embodiment. become.

実施の形態２によれば、各制御用計算機のマイクロカーネルで、ＯＳが管理するアプリケーションに対して、動作状況を確認し、自制御用計算機が、稼動系か待機系かあるいは回復中かを示す情報を共有メモリ領域に保存し、この情報を相互に交換することにより他系監視を行うことができる。 According to the second embodiment, the microkernel of each control computer confirms the operation status with respect to the application managed by the OS, and indicates whether the self-control computer is an active system, a standby system, or being recovered. Other systems can be monitored by storing information in a shared memory area and exchanging this information with each other.

この発明の実施の形態１による多重系計算機システムを示す構成図である。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a configuration diagram showing a multiplex system computer system according to Embodiment 1 of the present invention; この発明の実施の形態２による多重系計算機システムを示す構成図である。It is a block diagram which shows the multiplex system computer system by Embodiment 2 of this invention.

Explanation of symbols

１０制御用計算機Ａ系、１１制御用計算機Ｂ系、２０メインメモリ、
３０回復に必要なデータ、４０マイクロカーネル、
５０ネットワークカード、６０ＣＰＵ、７０ＯＳ上のアプリケーション、
８０ＯＳ、９０ネットワーク、
１００マイクロカーネル上のアプリケーション、１１０ＤＩＯカード、
１２０ＤＩＯ接点。 10 control computer A system, 11 control computer B system, 20 main memory,
30 data required for recovery, 40 microkernels,
50 network card, 60 CPU, 70 OS application,
80 OS, 90 network,
100 applications on the microkernel, 110 DIO cards,
120 DIO contacts.

Claims

In a multi-system computer system in which active and standby computers constitute a multi-system via a network, each of the computers operates an operating system that executes an application, operates independently of the operating system, and passes through the network. A microkernel that manages a communication management program that communicates with other computers, and a memory that is managed by the microkernel and that has a shared memory area that is accessed by the operating system. The microkernel is stored in the shared memory area. , saves the necessary data when the operating system of its own computer to recover from a failure, while the operating system is stopped, when the data is updated as required when the recovery is said other computer Multiple system computer system receives Luo the updated data, and updates the data described above stored.

2. The computer according to claim 1, wherein each of the computers includes a DIO card, and the microkernel monitors the other computer by intercommunication with a microkernel of another computer via the DIO card. Multi-computer system.

In the shared memory, the operation status of the own computer is stored by the application according to the confirmation of the operation status from the microkernel at a fixed interval. 2. The multi-computer system according to claim 1, wherein said other computer is monitored by mutually communicating the operation status of the own computer in the shared memory area with the kernel.