JPH05257675A

JPH05257675A - Program parts duplex system

Info

Publication number: JPH05257675A
Application number: JP5587392A
Authority: JP
Inventors: Michihiro Onari; 道廣大成
Original assignee: NEC Software Shikoku Ltd
Current assignee: NEC Software Shikoku Ltd
Priority date: 1992-03-16
Filing date: 1992-03-16
Publication date: 1993-10-08

Abstract

PURPOSE:To improve reliability of the system by constituting the system so that the operation can be continued by switching other program parts having the same function, in the case where the on-line system becomes abnormal being caused by a software. CONSTITUTION:In a fault recovery processing of an on-line system, the system is provided with a means 14 for managing a program parts group in a storage area, a means 12 for collecting fault statistical information for every program parts in which a fault is occurred, a means 13 for selecting a program whose fault occurrence frequency is less from the program parts group having the same function as the program parts which cause a fault and switching to the active program, and a means 16 for scheduling the active program parts and shifting the control, and by switching the program parts concerned at the time of fault of the software, the operation of the system is continued.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明はオンラインシステムにお
けるソフトウェア信頼性向上に関し、特にオンラインに
おけるプログラム部品二重化方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to improving software reliability in an online system, and more particularly to an online program parts duplication system.

【０００２】[0002]

【従来の技術】従来、オンラインシステムではサービス
中に障害が発生した場合、運転を継続することが不可能
な障害ならシステムダウンする。この場合、障害原因が
ソフトウェアにある場合でも再度立上げして同一のソフ
トウェアをメモリ上に格納して運転を再開していた。2. Description of the Related Art Conventionally, in an online system, when a failure occurs during service, the system is down if it is impossible to continue operation. In this case, even if the cause of the failure is software, the software was restarted, the same software was stored in the memory, and the operation was restarted.

【０００３】[0003]

【発明が解決しようとする課題】上述した従来のオンラ
インシステムのシステムダウンからの再立上げでは、同
一のソフトウェアがメモリ上に格納され運転が再開され
る。しかし、同一のソフトウェアが動作するため、ソフ
トウェアに障害原因がある場合は障害に至る要因となる
データが入力されると再びシステムダウンするといった
欠点がある。When the above-described conventional online system is restarted from system down, the same software is stored in the memory and the operation is restarted. However, since the same software operates, there is a drawback that if the software has a cause of failure, the system goes down again when data that causes the failure is input.

【０００４】[0004]

【課題を解決するための手段】本発明のプログラム部品
二重化方式は、オンラインシステムの障害リカバリ処理
において、プログラム部品群を記憶域に管理する手段
と、障害が発生したプログラム部品毎に障害統計情報を
収集する手段と、障害となったプログラム部品と同一機
能を持つプログラム部品群のなかより障害発生頻度の少
ないプログラムを選択し現用に切替える手段と、現用と
なっているプログラム部品をスケジュールし制御を移行
する手段とを備え、ソフトウェア障害時に該当するプロ
グラム部品を切替えてシステムの運用を継続する。The program component duplication system of the present invention provides means for managing a group of program components in a storage area and fault statistical information for each faulted program component in a failure recovery process of an online system. A means for collecting, a means for selecting a program with a low failure frequency from the program parts group having the same function as the failed program part and switching to the active program, and scheduling the active program part to transfer control. And a means for doing so to switch the corresponding program component at the time of software failure and continue the operation of the system.

【０００５】[0005]

【実施例】次に本発明について図面を参照して説明す
る。The present invention will be described below with reference to the drawings.

【０００６】図１は本発明の一実施例を示す説明図であ
る。同図において本発明によるプログラム部品二重化方
式は、ＯＳ（オペレーティングシステム）の障害処理部
１１，障害統計情報収集部１２，プログラム部品切替制
御部１３，プログラム部品情報管理テーブル部１４，各
種プログラム群１５，プログラム部品スケジュール部１
６を含む。なお、プログラム群１５およびプログラム部
品情報管理テーブル１４はシステム立上げ時にＯＳによ
ってメモリ上に格納されているものとする。FIG. 1 is an explanatory view showing an embodiment of the present invention. In the same figure, the program component duplication system according to the present invention uses an OS (operating system) fault processing unit 11, fault statistical information collecting unit 12, program component switching control unit 13, program component information management table unit 14, various program groups 15, Program parts schedule section 1
Including 6. It is assumed that the program group 15 and the program component information management table 14 are stored in the memory by the OS when the system is started up.

【０００７】あるプログラムが走行中に、例えば実メモ
リが割当てられていない領域にデータを書込もうとする
例外割込みが発生する。こうしたソフトウェアに起因す
る障害が発生すると、ＯＳの障害処理部１１に制御が移
行される。[0007] While a program is running, an exception interrupt for writing data to an area to which real memory is not allocated occurs. When a failure due to such software occurs, control is transferred to the failure processing unit 11 of the OS.

【０００８】これを受けてＯＳの障害処理部１１は、障
害分析やリカバリ処理を行う。このとき、障害内容がソ
フトウェアに起因する障害なら障害統計情報収集部１２
へ障害発生を通知する。In response to this, the OS failure processing unit 11 performs failure analysis and recovery processing. At this time, if the fault content is a fault caused by software, the fault statistical information collecting unit 12
Notify the occurrence of a failure to.

【０００９】障害情報収集部１２は、障害統計情報収集
に先立って２重化されたプログラム部品内で障害が発生
したかどうか判断する。プログラム部品内で走行してい
る場合、プログラム部品の開始時点で対応する部品群Ｉ
Ｄ及び部品ＩＤがメモリ上の走行中部品識別ＩＤ退避域
に設定され、走行の終了時点で走行中部品識別ＩＤ退避
域がクリアされる。したがって走行中部品識別ＩＤ退避
域がクリアされていなければ、プログラム部品内で障害
が発生したと判断する。The fault information collection unit 12 judges whether a fault has occurred in the duplicated program component prior to the collection of the fault statistical information. When running in a program part, the corresponding part group I at the start of the program part
D and the part ID are set in the running part identification ID save area on the memory, and the running part identification ID save area is cleared at the end of running. Therefore, if the running part identification ID save area is not cleared, it is determined that a failure has occurred in the program part.

【００１０】走行中部品識別ＩＤ退避域がクリアされて
いれば、二重化されたプログラム部品ではないためプロ
グラム部品の切替は必要なく、ＯＳの障害処理部１１に
制御を戻す。If the running part identification ID save area is cleared, it is not a duplicated program part, so that switching of the program part is not necessary and control is returned to the failure processing section 11 of the OS.

【００１１】二重化されたプログラム部品内で障害が発
生したと判断された場合、障害統計情報収集部１２は、
走行中部品識別ＩＤ退避域に設定されている部品群ＩＤ
及び部品ＩＤを基にプログラム部品情報管理テーブル部
１４のプログラム部品情報管理テーブルを参照し、部品
群ＩＤ及び部品ＩＤに対応するテーブルエントリを得
る。When it is determined that a failure has occurred in the duplicated program component, the failure statistical information collecting unit 12
Parts ID set in the running parts identification ID save area
And the component ID, the program component information management table of the program component information management table unit 14 is referred to, and a table entry corresponding to the component group ID and the component ID is obtained.

【００１２】図２はプログラム部品情報管理テーブルの
例を示す説明図である。同図において斜線部分がテーブ
ルの１エントリを表わす。FIG. 2 is an explanatory diagram showing an example of the program component information management table. In the figure, the shaded portion represents one entry in the table.

【００１３】次にプログラム部品情報管理テーブルの該
当エントリに統計情報を収集するが、収集するものとし
て本実施例では障害の発生回数と連続障害発生カウンタ
の加算がある。連続障害とは同一部品が２度連続して選
択され、２度連続して障害になる事象を言う。障害統計
情報収集を完了すると、プログラム部品の切替えを行う
ためプログラム部品切替制御部１３へ制御を移行する。Next, the statistical information is collected in the corresponding entry of the program component information management table. In this embodiment, the number of failure occurrences and the addition of the continuous failure occurrence counter are added. The continuous failure means an event in which the same part is selected twice in a row and becomes a failure twice in a row. When the collection of the failure statistical information is completed, the control is transferred to the program part switching control unit 13 to switch the program parts.

【００１４】図３はプログラム切替え制御部１３の動作
を示す流れ図である。プログラム部品切替制御部１３の
入力情報は部品群ＩＤ及び部品ＩＤである。FIG. 3 is a flow chart showing the operation of the program switching control unit 13. The input information of the program component switching control unit 13 is the component group ID and the component ID.

【００１５】（ステップ２１）：メモリ上にあるプログ
ラム部品情報管理テーブル１４より部品群ＩＤに一致す
る部品対応の全エントリを得る。この実施例では１つの
部品群ＩＤは２つの部品を持つため、２つのエントリが
得られる。(Step 21): Obtain all entries corresponding to the component group ID from the program component information management table 14 on the memory. In this embodiment, one component group ID has two components, so two entries are obtained.

【００１６】（ステップ２２）：次に部品対応の全エン
トリについて障害統計情報域の障害発生数を比較し、最
も値が小さい部品対応のエントリを選択する。障害発生
数が同一のものが存在する場合は始めに検出した部品対
応のエントリを優先して選択する。(Step 22): Next, the number of fault occurrences in the fault statistical information area is compared for all entries corresponding to parts, and the entry corresponding to the part having the smallest value is selected. If there are the same number of failures, the entry corresponding to the first detected component is preferentially selected.

【００１７】（ステップ２３）：選択された部品対応エ
ントリの連続障害発生カウンタを参照し、２ならば連続
障害と判断しステップ２４の処理へ移る。２より小さけ
ればステップ２６の処理へ移る。(Step 23): The continuous failure occurrence counter of the selected component-corresponding entry is referred to. If it is smaller than 2, the process proceeds to step 26.

【００１８】（ステップ２４）：連続障害発生カウンタ
をクリアする。(Step 24): The continuous failure occurrence counter is cleared.

【００１９】（ステップ２５）：二重化されているもう
一方の部品に対応するエントリを得る。(Step 25): Obtain an entry corresponding to the other duplicated part.

【００２０】（ステップ２６）：プログラム部品情報管
理テーブル１４において障害発生部品に対応するエント
リの現用／予備表示を予備とし、新しく選択された部品
に対応するエントリの現用／予備表示を現用とする。(Step 26): In the program parts information management table 14, the working / spare display of the entry corresponding to the faulty part is set as the spare, and the working / spare display of the entry corresponding to the newly selected part is set as the working.

【００２１】以上で障害が発生したプログラム部品の切
替えを終了し、ＯＳの障害処理部１１へ戻る。The switching of the program component in which the fault has occurred is completed, and the process returns to the fault processing unit 11 of the OS.

【００２２】次に、実際に切替えられたプログラム部品
がどのようにして走行するか説明する。各種プログラム
群１５の中であるプログラムが走行している時、プログ
ラム部品化された機能を使用する必要が出てきた場合、
対象となるプログラムの部品群ＩＤを指定しプログラム
部品スケジュール部１６にプログラム部品起動を依頼す
る。Next, how the actually changed program parts run will be described. When it is necessary to use a function that has been made into a program component while a program in the various program groups 15 is running,
The component group ID of the target program is specified, and the program component scheduling unit 16 is requested to start the program component.

【００２３】これを受けて、プログラム部品スケジュー
ル部１６はプログラム部品情報管理テーブル部１４が持
つプログラム部品情報管理テーブルを参照し、指定され
た部品群ＩＤに対応する部品対応全エントリの中から現
用／予備表示が現用となっている部品対応エントリを選
択し、該当エントリ内のプログラムエントリアドレスを
得て制御を移行する。このようにしてプログラム部品を
動作させることが可能である。In response to this, the program parts schedule section 16 refers to the program parts information management table held by the program parts information management table section 14, and selects the active / active entry from all the parts correspondence entries corresponding to the specified parts group ID. The part corresponding entry whose preliminary display is currently used is selected, the program entry address in the corresponding entry is obtained, and the control is transferred. In this way, the program component can be operated.

【００２４】なお、本発明は同一機能を持つプログラム
部品を二重化する場合に限定されず、プログラム部品を
３個以上持つことも可能である。The present invention is not limited to the case where the program parts having the same function are duplicated, and it is possible to have three or more program parts.

【００２５】[0025]

【発明の効果】以上説明したように本発明は、オンライ
ンシステムがソフトウェア原因により異常となる場合、
同じ機能を持つ別のプログラム部品に切替えて運転を続
行できるので、システムの信頼性を大幅に向上させる効
果がある。As described above, according to the present invention, when the online system becomes abnormal due to software,
Since it is possible to continue operation by switching to another program part having the same function, there is an effect that the reliability of the system is greatly improved.

[Brief description of drawings]

【図１】本発明の一実施例を示す説明図。FIG. 1 is an explanatory view showing an embodiment of the present invention.

【図２】プログラム部品情報管理テーブルの例を示す説
明図。FIG. 2 is an explanatory diagram showing an example of a program component information management table.

【図３】プログラム部品切替制御部の動作を示す流れ
図。FIG. 3 is a flowchart showing an operation of a program parts switching control unit.

[Explanation of symbols]

１１ＯＳの障害処理部１２障害統計情報収集部１３プログラム部品切替制御部１４プログラム部品情報管理テーブル部１５各種プログラム群１６プログラム部品スケジュール部 11 OS Failure Processing Section 12 Failure Statistics Information Collection Section 13 Program Parts Switching Control Section 14 Program Parts Information Management Table Section 15 Various Programs 16 Program Parts Scheduling Section

Claims

[Claims]

1. In a failure recovery process of an online system, a means for managing a program part group in a storage area, a means for collecting failure statistical information for each failed program part, and the same program part as the failure A program part corresponding to a software failure is provided with means for selecting a program having a less frequent occurrence of a failure from the program parts group having functions and switching it to the active part, and means for scheduling the active program part and transferring control. The program component duplication method characterized by continuing to operate the system by switching between the two.