JP4494263B2

JP4494263B2 - Service system redundancy method

Info

Publication number: JP4494263B2
Application number: JP2005087523A
Authority: JP
Inventors: 英一郎森; 征雄川口; 弘泰板東; 智幸浜田; 芳則町田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2005-03-25
Filing date: 2005-03-25
Publication date: 2010-06-30
Anticipated expiration: 2025-03-25
Also published as: JP2006268596A

Description

本発明はサービスシステムの冗長化方式に関し、更に詳しくは、コンピュータのプログラム実行により所定のサービス機能を提供するサービスシステムの冗長化方式に関する。今日、各種通信制御装置等の様々なサービス機能がコンピュータのプログラム実行により実現されており、このようなサービスシステムの性能及び信頼性の向上が望まれる。 The present invention relates to a service system redundancy system, and more particularly to a service system redundancy system that provides a predetermined service function by executing a computer program. Today, various service functions such as various communication control devices are realized by computer program execution, and it is desired to improve the performance and reliability of such service systems.

従来、この種のサービスシステムの中断時間を短くする方法としては、コンピュータシステムのＣＰＵのみならずメモリその他のハードウェア全てを物理的に二重化（冗長化）し、両系の同期を常に実施することで、運用系のコンピュータシステムが故障した場合でも、待機側のコンピュータシステムがメモリの運用情報等を速やかに引き継ぎ、待機系を新しい運用系として、故障が発生するまでのサービス状態を維持したままサービスを再開させる方法がある。これにより、短いサービス中断時間で故障情報を収集しながらサービスを速やかに復旧できる。 Conventionally, as a method of shortening the interruption time of this type of service system, not only the CPU of the computer system but also all other hardware such as memory are physically duplicated (redundant), and both systems are always synchronized. Even if the active computer system fails, the standby computer system can quickly take over the memory operation information, etc., making the standby system a new active system and maintaining the service status until the failure occurs. There is a way to resume. As a result, the service can be quickly recovered while collecting failure information in a short service interruption time.

しかし、ハードウェアの二重化は、導入コストや消費電力が多くかかる欠点がある。しかも、実際に発生する障害（故障）のほとんどが、ソフトウェアの誤り（バグ）や過負荷による障害であり、実際にハードウェアが故障して系が切り替わることは少なく、ハードウェアを二重化することは、投資として過剰であるケースが多かった。 However, the duplication of hardware has a drawback that a lot of introduction cost and power consumption are required. Moreover, most of the failures (failures) that actually occur are failures due to software errors (bugs) or overloads, and it is unlikely that the hardware will actually fail and the system will switch, and it is not possible to duplicate hardware. In many cases, the investment was excessive.

また、従来のコンピュータシステムには、処理を継続出来ない様な致命的なエラーが発生した際に、プログラムを強制終了し、直前のメモリやレジスタの内容を故障情報ファイルに出力する機能が設けられており、障害発生要因の究明に不可欠な機能である。 Also, conventional computer systems are provided with a function that forcibly terminates a program and outputs the contents of the previous memory or register to a failure information file when a fatal error that prevents processing from continuing occurs. It is an indispensable function for investigating the cause of failure.

しかし、その際の故障ファイル出力処理時間は、該当プログラムのロードモジュール（ＬＭ）サイズや、プログラム処理内にて捕捉するメモリ領域サイズにより左右されるが、大規模プログラムになると故障ファイルのダンプに十数分間を要することもあり、サービス中断時間に少なからぬ影響を与えていた。 However, the failure file output processing time at that time depends on the load module (LM) size of the corresponding program and the size of the memory area captured in the program processing. It could take several minutes and had a significant impact on service interruption time.

特に、音声を扱うようなリアルタイム性を要求されるサーバシステムでは、プログラム故障時に、プログラムを早急に再起動する必要があるが、故障ファイル情報の収集完了前にプログラムを再起動させるとレジスタやスタック、メモリなどの上書き等が行われるため、有効な情報が採れない恐れがある。 In particular, in a server system that requires real-time processing such as handling audio, it is necessary to restart the program as soon as possible when the program fails. However, if the program is restarted before the failure file information collection is completed, the registers and stack Since the memory is overwritten, there is a possibility that valid information cannot be obtained.

この点、従来は、コンピュータシステムに通常アクセスに使用するシステムバス６とデータをシリアルに転送する保守・診断専用の診断バス７とを設け、ある障害ユニット３からのローカル・メモリ情報を、システムバス６を介して主メモリ１にダンプできない場合は、診断バス７を経由して採取することにより、システムバス６に関連した障害でも、ローカル・メモリ情報を採取可能にしたものが知られている（特許文献１）。
特開平６−３４２３８７（要約，図） In this regard, conventionally, a system bus 6 used for normal access to a computer system and a diagnostic bus 7 dedicated to maintenance / diagnosis for serially transferring data are provided, and local memory information from a fault unit 3 is transferred to the system bus. In the case where data cannot be dumped to the main memory 1 via 6, it is known that the local memory information can be collected even by a failure related to the system bus 6 by collecting via the diagnostic bus 7 ( Patent Document 1).
JP-A-6-342387 (summary, figure)

しかし、上記構成は別途シリアル診断バスを設ける比較的簡単なものであるが、システムバス６を使用できないときに、シリアル診断バス７を使用する場合には、必ずしも短い時間に十分な障害情報を収集できない問題がある。 However, although the above configuration is relatively simple with a separate serial diagnostic bus, when the system diagnostic bus 7 cannot be used and the serial diagnostic bus 7 is used, sufficient fault information is always collected in a short time. There is a problem that cannot be done.

本発明は上記従来技術の問題点に鑑みなされたものであり、その目的とする所は、簡単な構成により、故障によるサービス中断を最低限に抑え、かつその際の障害情報を確実に収集可能なサービスシステムの冗長化方式を提供することにある。 The present invention has been made in view of the above-described problems of the prior art, and the object of the present invention is to minimize service interruption due to a failure and to reliably collect failure information at that time with a simple configuration. It is to provide a redundant system of service system.

上記の課題は例えば図１の構成により解決される。即ち、本発明（１）のサービスシステムの冗長化方式は、コンピュータのプログラム実行により所定のサービス機能を提供するサービスシステムの冗長化方式であって、プログラム実行により同一のサービス機能を実現する第１，第２のサービス処理部２１Ａ，２１Ｂと、前記第１，第２のサービス処理部にそれぞれ対応して設けられ、サービス運用に係る処理情報を記憶する第１，第２のメモリエリア４１Ａ，４１Ｂと、前記第１又は第２のメモリエリアの記憶情報を退避するための退避メモリ６０と、前記第１，第２のサービス処理部と前記第１，第２のメモリエリアとからなる対を現用系と待機系として運用するサービス管理部３０とを備え、
現用系のサービス処理部２１Ａは、自己の処理情報を現用系及び待機系のメモリ４１Ａ，４１Ｂに書き込むと共に、前記サービス管理部３０は、現用系の障害発生により速やかに系を切り替え、かつ旧現用系のメモリエリア４１Ａに記憶されたログ情報を退避メモリ６０に退避し、該旧現用系のサービス処理部２１Ａが障害中又は該旧現用系のメモリエリア４１Ａのデータが退避中である期間に、データ書込のあった新現用系メモリ４１Ｂのエリア情報を保持すると共に、前記旧現用系のサービス処理部２１Ａの復旧により、前記新現用系メモリのエリア情報に対応する記憶データを前記旧現用系のメモリエリア４１Ａにコピーするものである。
The above problem is solved by the configuration of FIG. That is, the service system redundancy system of the present invention (1) is a service system redundancy system that provides a predetermined service function by computer program execution, and is a first system that realizes the same service function by program execution. , Second service processing units 21A, 21B and first and second memory areas 41A, 41B provided corresponding to the first and second service processing units, respectively, for storing processing information relating to service operation. And a pair consisting of a save memory 60 for saving storage information in the first or second memory area, the first and second service processing units, and the first and second memory areas. A service management unit 30 that operates as a standby system and a standby system,
Service processing unit 21A of the working system, the self-processing information the working and standby memory 41A, is written to 41B, the service management unit 30 switches the rapidly system by failure of the active system, and the old working The log information stored in the system memory area 41A is saved in the save memory 60, and the old active service processing unit 21A is in a fault or the data in the old active memory area 41A is being saved. The area information of the new active memory 41B where the data has been written is retained, and the storage data corresponding to the area information of the new active memory is stored in the old active memory by the restoration of the service processor 21A of the old active memory. To the memory area 41A .

本発明（１）によれば、プログラム実行により実現されるサービス処理部２１Ａ，２１Ｂと、二重化されたメモリエリア４１Ａ，４１Ｂとを備え、両エリアの記憶情報を同期（一致）させつつ現用系、待機系として運用する簡単な構成により、現用系の障害発生時には、速やかに系を切り替えることが可能であると共に、旧現用系のメモリエリア４１Ａから従前のサービス処理に係るログ情報を確実に収集できる。更に、データ書込のあった新現用系メモリ４１Ｂのエリア情報を保持するだけの簡単な構成により、旧現用系のサービス処理部２１Ａの復旧時には、新現用系メモリ４１Ｂの当該エリアのデータで旧現用系（待機系）メモリ４１Ａの内容を容易に同期（一致）化できる。
According to the present invention (1), the service processing units 21A and 21B realized by program execution and the duplicated memory areas 41A and 41B are provided, and the active system, while synchronizing (matching) the storage information of both areas, With a simple configuration that operates as a standby system, when a failure occurs in the active system, the system can be switched quickly, and log information related to the previous service processing can be reliably collected from the memory area 41A of the former active system. . Furthermore, with the simple configuration that only holds the area information of the new active memory 41B in which data has been written, when the old active service processing unit 21A is restored, the data in the area of the new active memory 41B is not updated. The contents of the active (standby) memory 41A can be easily synchronized (matched).

本発明（２）では、上記本発明（１）において、第１，第２のサービス処理部にそれぞれバインドされて第１，第２のメモリエリアへのメモリアクセスを代行する第１，第２のアダプタ部を備え、現用系のアダプタ部は、通常は該現用系のサービス処理部からのデータ書込命令に従って現用系及び待機系のメモリエリアに同一のデータを書き込むと共に、前記旧現用系（待機系）のサービス処理部が障害中又は該旧現用系（待機系）のメモリエリアのデータが退避中であることにより新現用系のメモリエリアにのみデータを書き込むものである。従って、引き続き運用データの更新が可能であると共に、系切替後の旧現用系のメモリ情報を確実に保護できる。
According to the present invention (2), in the above-mentioned present invention (1), the first and second services that are bound to the first and second service processing units and perform the memory access to the first and second memory areas, respectively. The active adapter unit normally writes the same data to the active and standby memory areas in accordance with a data write command from the active service processing unit, and the old active system ( standby Data is written only to the memory area of the new active system because the service processing unit of the system ) is faulty or the data in the memory area of the old active system ( standby system ) is being saved. Therefore, the operation data can be continuously updated, and the memory information of the old working system after the system switching can be surely protected.

本発明（３）では、上記本発明（１）において、例えば図９に示す如く、第１，第２のサービス処理部２１Ａ，２１Ｂが、それぞれ独自のＣＰＵ１１Ａ，１１Ｂによって動作するように構成されているものである。従って、比較的簡単な二重化構成により、所要のサービス機能を高性能かつ安全に提供できる。 In the present invention ( 3 ), in the present invention (1), as shown in FIG. 9, for example, the first and second service processing units 21A and 21B are configured to operate by their own CPUs 11A and 11B, respectively. It is what. Therefore, a required service function can be provided with high performance and safety by a relatively simple duplex configuration.

以上述べた如く本発明によれば、故障によるサービス中断を最低限に抑え、かつその際
の障害情報を確実に収集可能であるため、コンピュータを使用したこの種のサービスシステムを安価かつ安全に提供できる。 As described above, according to the present invention, service interruption due to a failure can be minimized, and failure information at that time can be reliably collected. Therefore, this type of service system using a computer can be provided inexpensively and safely. it can.

以下、添付図面に従って本発明に好適なる実施の形態を詳細に説明する。なお、全図を通して同一符号は同一又は相当部分を示すものとする。図１は実施の形態によるコンピュータシステムのブロック図で、マルチプログラミング方式（一つのＣＰＵで複数のサービス処理を並列実行）による構成例を示している。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described in detail with reference to the accompanying drawings. Note that the same reference numerals denote the same or corresponding parts throughout the drawings. FIG. 1 is a block diagram of a computer system according to an embodiment, and shows a configuration example by a multi-programming method (a plurality of service processes are executed in parallel by one CPU).

図において、１００はマルチプログラミング方式によるコンピュータシステム、１１はＣＰＵ、１０Ａ，１０Ｂはサービスプログラムの実行により同一のサービス機能（音声通信機能等）を実現可能なサービスプロセッサＡ，Ｂ、２０Ａ，２０Ｂはサービスプログラムを格納する主メモリ（ＲＡＭ，キャッシュメモリ等）、２１Ａ，２１Ｂはサービスプログラムの実行により実現されるサービス処理部Ａ，Ｂ、２２Ａ，２２Ｂはサービス処理部Ａ，Ｂによる仮想メモリＡ，Ｂのアクセスをそれぞれにインタフェースするアダプタ部Ａ，Ｂ、３０はサービスプロセッサ１０Ａ，１０Ｂの障害の監視、運用系と待機系のサービス運用状態の管理、障害発生時の運用系と待機系の切替制御等を行うサービス管理部、４０はサービスシステムの運用に係る各種処理情報を記憶するメモリ（主メモリの一部で良い）、４１Ａ，４１Ｂは運用系と待機系とに二重化された仮想メモリＡ，Ｂ、５０はオペレーティングシステム（ＯＳ）に係るプログラムを記憶しているオペレーティングシステム部、６０は仮想メモリＡ／Ｂの情報を退避格納するディスク装置（ＤＩＳＫ）である。 In the figure, 100 is a multi-programming computer system, 11 is a CPU, 10A and 10B are service processors A, B, 20A and 20B which can realize the same service function (voice communication function, etc.) by executing a service program. Main memories (RAM, cache memory, etc.) for storing programs, 21A and 21B are realized by execution of service programs, and service processing units A, B, 22A and 22B are virtual memories A and B of service processing units A and B. The adapter units A, B, and 30 that interface access respectively perform monitoring of failures of the service processors 10A and 10B, management of service operation states of the active and standby systems, switching control between the active and standby systems when a failure occurs, and the like. Service management unit to perform, 40 for service system operation A memory for storing various processing information (may be a part of the main memory), 41A and 41B are virtual memories A, B and 50 which are duplicated into an active system and a standby system, and a program related to an operating system (OS) is stored The operating system unit 60 is a disk device (DISK) for saving and storing information of the virtual memory A / B.

なお、以下の説明では、サービス処理部２１Ａ，２１Ｂ等を説明の簡単の為に単にサービス処理部Ａ，Ｂ等と呼ぶ。また、このサービス処理部Ａ，Ｂは機能が二重（冗長）化されており、図はＡが現用系で、Ｂが待機系（予備系）の場合を示している。 In the following description, the service processing units 21A, 21B, etc. are simply referred to as service processing units A, B, etc. for the sake of simplicity. Further, the service processing units A and B have dual (redundant) functions, and the figure shows a case where A is the active system and B is the standby system (standby system).

サービス処理部Ａ，Ｂ及びサービス管理部３０はＯＳ５０をプラットフォームとしたＣＰＵ１１のプログラム実行により実現される。 The service processing units A and B and the service management unit 30 are realized by program execution of the CPU 11 using the OS 50 as a platform.

仮想メモリＡは運用系のサービス処理部Ａがサービス運用に係る処理情報の読み書きに使用する運用メモリであり、また、仮想メモリＢは運用系のサービス処理部Ａが、障害時の系切替に備えて仮想メモリＡと同一の情報をコピーしておく待機メモリであると共に、障害発生時に系が切り替わった場合には、新運用系となったサービス処理部Ｂが引き続きサービス運用に係る処理情報の読み書きに使用する運用メモリとなる。一方、障害に係るログ情報の収集を完了した仮想メモリＡは、運用系のサービス処理部Ｂが、障害時の切り替えに備えて仮想メモリＢと同一の情報をコピーしておく待機メモリとなる。 The virtual memory A is an operational memory used by the active service processing unit A for reading and writing processing information related to service operation, and the virtual memory B is prepared for system switching in the event of a failure by the active service processing unit A. When the system is switched when a failure occurs, the service processing unit B that has become a new active system continues to read and write processing information related to service operation. This is the operational memory used for On the other hand, the virtual memory A that has completed the collection of log information related to the failure becomes a standby memory in which the service processor B of the active system copies the same information as the virtual memory B in preparation for switching at the time of failure.

このように、サービス処理部Ａにとっての現用メモリは仮想メモリＡ、待機メモリは仮想メモリＢであり、またサービス処理部Ｂにとっての現用メモリは仮想メモリＢ、待機メモリは仮想メモリＡであるが、サービス処理部Ａ，Ｂは同一のソースファイルから生成（コンパイル）されるため、各サービス処理部ａ，Ｂでは運用メモリと待機メモリに対するメモリアクセスの行き先（アドレス）の翻訳（変換）を行う必要がある。 Thus, the working memory for the service processing unit A is the virtual memory A and the standby memory is the virtual memory B, and the working memory for the service processing unit B is the virtual memory B and the standby memory is the virtual memory A. Since the service processing units A and B are generated (compiled) from the same source file, it is necessary for each service processing unit a and B to translate (convert) memory access destinations (addresses) for the operation memory and the standby memory. is there.

そこで、アダプタ部Ａはサービス処理部Ａからの現用，待機メモリに対するアクセスを仮想メモリＡ，Ｂに対するアクセスに翻訳し、一方、アダプタ部Ｂはサービス処理部Ｂからの現用，待機メモリに対するアクセスを仮想メモリＢ，Ａに対するアクセスに翻訳する。 Therefore, the adapter unit A translates the access to the working and standby memory from the service processing unit A into access to the virtual memories A and B, while the adapter unit B virtualizes the access to the working and standby memory from the service processing unit B. Translate to access to memory B, A.

係る構成により、予め仮想メモリＡ，Ｂの記憶情報は一致している。今、サービス処理部Ａが現用系であるとすると、アダプタ部Ａはサービス処理部Ａからのデータ読出命令に
従って仮想メモリＡのみからデータ読出Ｒを行う。また、アダプタ部Ａはサービス処理部Ａからのデータ書込命令に従って仮想メモリＡ，Ｂの同一（対応）エリア）にデータ書込Ｗ１，Ｗ２を行う。この処理は、あくまでもアダプタ部Ａが実行するため、サービスプログラムを記述するプログラマーは、このようなメモリアクセスの仕組みを意識する必要はない。一方、この期間のサービス処理部Ｂは仮想メモリＡ，Ｂに対するメモリアクセスを行っておらず、よって、仮想メモリＡ，Ｂの記憶内容はシステムの運用開始後も常に一致している。これにより、サービス管理部３０はいつでも速やかに系を切替可能である。 With this configuration, the stored information in the virtual memories A and B is matched in advance. Now, assuming that the service processing unit A is an active system, the adapter unit A performs data reading R only from the virtual memory A in accordance with a data reading command from the service processing unit A. The adapter unit A performs data writing W1 and W2 in the same (corresponding) areas of the virtual memories A and B in accordance with a data write command from the service processing unit A. Since this process is only executed by the adapter unit A, the programmer who writes the service program does not need to be aware of such a memory access mechanism. On the other hand, the service processing unit B during this period does not perform memory access to the virtual memories A and B. Therefore, the stored contents of the virtual memories A and B always match after the system operation is started. As a result, the service management unit 30 can quickly switch the system at any time.

図２は実施の形態によるサービスプロセッサの構築手順を説明する図で、サービスプロセッサのソースプログラムからアダプタ部経由で仮想メモリＡ，Ｂにアクセスするようなオブジェクトプログラムを生成する手順を示している。サービス処理部Ａ，Ｂが現用系／待機系に関わらず仮想メモリＡ，Ｂを適正にアクセス可能とするためには、サービス処理部Ａ，Ｂのメモリアクセス部分をアダプタ部Ａ，Ｂ経由に変更する必要がある。 FIG. 2 is a diagram for explaining the construction procedure of the service processor according to the embodiment, and shows the procedure of generating an object program for accessing the virtual memories A and B via the adapter unit from the source program of the service processor. In order for the service processing units A and B to properly access the virtual memories A and B regardless of the active / standby system, the memory access part of the service processing units A and B is changed to the adapter units A and B. There is a need to.

これを実現するため、まずサービスプログラムのソースファイルをプリコンパイラに入力してプリコンパイルすると、メモリアクセス部分がアダプタ経由に変換された中間ファイルが生成される。更に、この中間ファイルと、アダプタ部品及びＯＳ依存部品とをリンクさせ、コンパイラでコンパイルすると、最終的な出力として仮想メモリＡ，Ｂへのアダプタ部経由のアクセスを考慮したアダプタ対応サービスプログラム（オブジェクトファイル）が生成される。 In order to realize this, when the source file of the service program is first input to the precompiler and precompiled, an intermediate file in which the memory access portion is converted via the adapter is generated. Furthermore, when this intermediate file is linked with the adapter component and the OS-dependent component, and compiled by the compiler, an adapter-compatible service program (object file) that considers access to the virtual memories A and B via the adapter section as the final output ) Is generated.

図３に実施の形態によるプリコンパイラの動作を具体的に示す。ここでは、メモリアクセスに関連するインストラクションがアダプタ部経由のものに変換される。例えば、ソースファイル１における変数定義「static int Ｘ」は、内部参照であるので、メモリアクセスには関係しない。一方、「shmget(aa，bb）」は仮想メモリ上にエリアを確保するインストラクションであり、アダプタ部経由の中間ファイル「adp_shmget(aa，bb）」にプリコンパイルされる。他も同様に理解できる。また、ソースファイル２の変数Ｘは、外部参照「extern int Ｘ」であるので、アダプタ部対応に変換される。 FIG. 3 specifically shows the operation of the precompiler according to the embodiment. Here, instructions related to memory access are converted into those via the adapter unit. For example, since the variable definition “static int X” in the source file 1 is an internal reference, it does not relate to memory access. On the other hand, “shmget (aa, bb)” is an instruction for securing an area in the virtual memory, and is precompiled into an intermediate file “adp_shmget (aa, bb)” via the adapter unit. Others can be understood similarly. Further, since the variable X of the source file 2 is the external reference “extern int X”, it is converted to the adapter unit.

次に、係る構成による現用系障害発生時のリカバリ動作を説明する。図４，図５は実施の形態によるコンピュータシステムの動作を説明する図（１），（２）で、図４は現用系のサービス処理部Ａで障害が発生し、運用系を切り替えた際の状態を示している。 Next, the recovery operation when an active system failure occurs with such a configuration will be described. 4 and 5 are diagrams (1) and (2) for explaining the operation of the computer system according to the embodiment. FIG. 4 is a diagram when a failure occurs in the service processor A of the active system and the operating system is switched. Indicates the state.

サービス管理部３０及びＯＳ５０は、サービス処理部Ａの運用状態を監視しており、この状態で、サービス処理部Ａで何らかの障害が発生すると、この障害はＯＳ５０とサービス管理部３０のいずれかで検出される。ＯＳ５０がシステム運用上の障害を検出した場合は、サービス管理部３０に障害発生を通知し、またサービス管理部３０がサービス（アプリケーション）運用上の障害を検出した場合は、ＯＳ５０に対してサービス処理部Ａの故障発生を通知する。 The service management unit 30 and the OS 50 monitor the operating state of the service processing unit A. If any failure occurs in the service processing unit A in this state, this failure is detected by either the OS 50 or the service management unit 30. Is done. When the OS 50 detects a failure in system operation, the service management unit 30 is notified of the occurrence of the failure. When the service management unit 30 detects a failure in service (application) operation, the OS 50 performs service processing. Notification of occurrence of failure in part A.

サービス管理部３０は、ＯＳ管理下で動作するサービス処理部Ａの実行情報（レジスタ情報，スタック情報等）を故障情報として収集し、ＤＩＳＫ６０上のログファイルに退避する。また、サービス処理部Ｂにサービスの引き継ぎ要求を通知し、同時に故障したサービス処理部Ａが使用していた仮想メモリ領域Ａの情報もＤＩＳＫ６０に退避する。これらの退避処理は、旧運用系（サービス処理部Ａ、仮想メモリＡ、サービス管理部３０）により行われるため、新運用系のサービス処理部Ｂから見ると、バックグラウンドで実施されており、ログ収集によるシステム中断時間を最短にし、かつログ収集に伴う負荷によるサービス性能低下を起こさずして、サービスが再開される。 The service management unit 30 collects execution information (register information, stack information, etc.) of the service processing unit A operating under OS management as failure information and saves it in a log file on the DISK 60. Further, the service processing unit B is notified of a service takeover request, and information on the virtual memory area A used by the failed service processing unit A is also saved in the DISK 60. Since these saving processes are performed by the old operational system (service processing unit A, virtual memory A, service management unit 30), when viewed from the new operational system service processing unit B, they are executed in the background. The service is resumed without shortening the system interruption time due to the collection and without causing the service performance degradation due to the load accompanying the log collection.

一方、新運用系となったサービス処理部（アダプタ部）Ｂでは、仮想メモリＢからデ
ータを読み込み、かつ仮想メモリ、Ｂ及びＡに同一データを書き込もうとするが、仮想メモリＡは障害情報を保護又は収集中のためにデータを書き込めない。そこで、この場合のアダプタ部Ｂは、少なくとも、仮想メモリＢのどのエリアにデータを書き込もうとしたかのエリア情報をサービス管理部３０に通知し、サービス管理部３０はこれを保持する。 On the other hand, the service processing unit (adapter unit) B, which has become a new operation system, reads data from the virtual memory B and tries to write the same data to the virtual memories B and A, but the virtual memory A protects the failure information. Or data cannot be written because it is being collected. Therefore, the adapter unit B in this case notifies the service management unit 30 of at least the area information of the virtual memory B in which the data is to be written, and the service management unit 30 holds this.

図５は仮想メモリＡのログ収集退避作業が完了した際の動作を示している。サービス処理部Ａがログ収集完了により復旧通知を出力すると、これを受けたサービス管理部３０は、現用系のアダプタ部２２にコピー要求を行うと共に、上記保存したエリア情報を通知する。これを受けた、アダプタ部Ｂは、仮想メモリＢの当該エリアの情報を読み出して仮想メモリＡの対応（同一）エリアに書き込む。上記サービス管理部３０に保存したエリアが複数ある場合は、上記同様のコピー処理を繰り返し、こうして、全ての保存エリアのデータコピーを完了すると、仮想メモリＢ，Ａのデータ同期（一致）する。この方法によれば、仮想メモリＡ，Ｂ間で全てのデータをコピーする必要が無いので、効率が良い。 FIG. 5 shows an operation when the log collection / save operation of the virtual memory A is completed. When the service processing unit A outputs a recovery notification upon completion of log collection, the service management unit 30 receives the request and makes a copy request to the active adapter unit 22 and notifies the stored area information. Receiving this, the adapter unit B reads the information of the area of the virtual memory B and writes it in the corresponding (same) area of the virtual memory A. When there are a plurality of areas stored in the service management unit 30, the same copying process as described above is repeated. When the data copying of all the storage areas is completed in this way, the data in the virtual memories B and A is synchronized (matched). According to this method, since it is not necessary to copy all the data between the virtual memories A and B, the efficiency is high.

更に、サービス管理部３０は、全コピーの終了後に、サービス処理部Ａを待機系として再起動させると共に、その旨をアダプタ部Ｂに通知し、これにより、アダプタ部Ｂによる仮想メモリＡへのデータ書込Ｗ２が始まる。こうして、運用系と待機系とが完全に入れ替わる。 Further, the service management unit 30 restarts the service processing unit A as a standby system after the completion of all copying, and notifies the adapter unit B of the fact, whereby the data to the virtual memory A by the adapter unit B is transmitted. Writing W2 begins. In this way, the active system and the standby system are completely switched.

図６〜図８は実施の形態によるアダプタ部の動作シーケンス図（１）〜（３）で、アダプタ部の上記動作を実現する幾つかの典型的な動作シーケンスを示している。図６の上半部に現用のサービス処理部Ａがアダプタ部Ａからメモリエリアを取得する処理を示す。この処理はサービス処理部Ａが仮想メモリＡ，Ｂのあるエリアをアクセスしたい場合にエリアを確保する前処理として行われる。 FIGS. 6 to 8 are operation sequence diagrams (1) to (3) of the adapter unit according to the embodiment, and show some typical operation sequences for realizing the above-described operation of the adapter unit. The upper half of FIG. 6 shows a process in which the current service processing unit A acquires a memory area from the adapter unit A. This process is performed as a pre-process for securing the area when the service processing unit A wants to access an area with the virtual memories A and B.

サービス処理部Ａがアダプタ部Ａにメモリエリア取得要求を行うと、アダプタ部ＡはステップＳ１１でサービス処理部Ｂの運用状態をチェックすると共に、サービス処理部Ｂが待機系として運用中（即ち、通常の運用状態）の場合は、ステップＳ１２，Ｓ１３で仮想メモリＡ，Ｂのメモリエリアを取得し、サービス処理部Ａにエリア取得応答を行う。また、上記ステップＳ１１の判別でサービス処理部Ｂが故障又はログ情報を退避中の場合も、ステップＳ１４，Ｓ１５で仮想メモリＡ，Ｂのメモリエリアを取得し、サービス処理部Ａにエリア取得応答を行う。そして、上記何れの場合も、サービス管理部３０にメモリエリア取得情報を通知し、これを受けたサービス管理部３０はステップＳ１６で取得情報を保持する。これにより、サービス管理部３０は、仮想メモリＡ及び又はＢのどのメモリエリアがアクセスされるのかを管理可能となる。なお、この例では、上記ステップＳ１１で処理を分岐させる必要が無いが、説明の明瞭の為に処理を分岐させた。 When the service processing unit A makes a memory area acquisition request to the adapter unit A, the adapter unit A checks the operating state of the service processing unit B in step S11, and the service processing unit B is operating as a standby system (that is, normal) ), The memory areas of the virtual memories A and B are acquired in steps S12 and S13, and an area acquisition response is sent to the service processing unit A. Further, even when the service processing unit B has failed or saved log information in the determination in step S11, the memory areas of the virtual memories A and B are acquired in steps S14 and S15, and an area acquisition response is sent to the service processing unit A. Do. In any of the above cases, the service management unit 30 is notified of the memory area acquisition information, and the service management unit 30 that has received the information holds the acquisition information in step S16. Accordingly, the service management unit 30 can manage which memory area of the virtual memory A or B is accessed. In this example, it is not necessary to branch the process in step S11, but the process is branched for the sake of clarity.

図６の下半部に現用系のサービス処理部Ａがアダプタ部Ａからデータを読み出す処理を示す。サービス処理部Ａがアダプタ部Ａにメモリ読出要求を行うと、アダプタ部ＡはステップＳ１７で仮想メモリＡの前記取得エリアからデータを読み出し、サービス処理部Ａにメモリ読出応答を行う。 The lower half of FIG. 6 shows a process in which the active service processing unit A reads data from the adapter unit A. When the service processing unit A makes a memory read request to the adapter unit A, the adapter unit A reads data from the acquisition area of the virtual memory A and sends a memory read response to the service processing unit A in step S17.

図７の上半部は現用のサービス処理部Ａが仮想メモリＡ及び可能なら仮想メモリＢに同一データを書き込む処理を示している。サービス処理部Ａがアダプタ部Ａにメモリ書込要求を行うと、アダプタ部ＡはステップＳ２１でサービス処理部Ｂの運用状態をチェックすると共に、サービス処理部Ｂが待機系として運用中の場合は、ステップＳ２２，Ｓ２３で仮想メモリＡ，Ｂへのデータ書込を行い、サービス処理部Ａにメモリ書込応答を行う。また、上記ステップＳ２１の判別でサービス処理部Ｂが故障中又は退避中の場合は、ステップＳ２４で仮想メモリＡにのみデータを書き込み、サービス処理部Ａにメモリ書込応答を行う。従って、仮想メモリＢの内容破壊を有効に防止できる。 The upper half of FIG. 7 shows a process in which the current service processing unit A writes the same data in the virtual memory A and, if possible, the virtual memory B. When the service processing unit A makes a memory write request to the adapter unit A, the adapter unit A checks the operating state of the service processing unit B in step S21, and if the service processing unit B is operating as a standby system, In steps S22 and S23, data is written to the virtual memories A and B, and a memory write response is sent to the service processing unit A. If it is determined in step S21 that the service processing unit B is in failure or saving, data is written only in the virtual memory A in step S24, and a memory write response is sent to the service processing unit A. Therefore, destruction of the contents of the virtual memory B can be effectively prevented.

図７の下半部に現用系のサービス処理部Ａが仮想メモリＡ．Ｂのメモリエリアを開放する場合を示す。サービス処理部Ａがアダプタ部Ａにメモリ開放要求を行うと、アダプタ部ＡはステップＳ２５でサービス処理部Ｂの運用状態をチェックすると共に、サービス処理部Ｂが待機系として運用中の場合は、ステップＳ２６，Ｓ２７で仮想メモリＡ，Ｂのメモリエリア開放を行い、サービス処理部Ａにメモリ開放応答を行う。また、上記ステップＳ２５の判別でサービス処理部Ｂが故障中又は退避中の場合は、ステップＳ２８、Ｓ９で仮想メモリＡ，Ｂのメモリ解放を行い、サービス処理部Ａにメモリ開放応答を行う。 In the lower half of FIG. The case where the memory area B is released is shown. When the service processing unit A makes a memory release request to the adapter unit A, the adapter unit A checks the operating state of the service processing unit B in step S25, and if the service processing unit B is operating as a standby system, the step In S26 and S27, the memory areas of the virtual memories A and B are released, and a memory release response is sent to the service processing unit A. If it is determined in step S25 that the service processing unit B is in failure or saving, the virtual memories A and B are released in steps S28 and S9, and a memory release response is sent to the service processing unit A.

そして、上記何れの場合も、サービス管理部３０にメモリ開放情報を通知し、これを受けたサービス管理部３０はステップＳ３０で取得情報を保持する。これにより、サービス管理部３０は、仮想メモリＡ，Ｂへのメッモリアクセスが開放されたことを管理可能となる。 In either case, the service management unit 30 is notified of the memory release information, and the service management unit 30 that has received the information holds the acquired information in step S30. Thereby, the service management unit 30 can manage that the memory access to the virtual memories A and B is released.

図８にサービス処理部Ａの障害復帰（即ち、仮想メモリＡのログ情報収集完了）に伴う仮想メモリＡ，Ｂ間のデータを一致させる処理を示す。サービス管理部３０はサービス処理部Ａからの復旧通知を受けたことにより、ステップＳ４１で仮想メモリＢのアクセスに関して保存したメモリエリアがあるか否かを判別する。ある場合は、現用系のアダプタ部Ｂに仮想メモリＢ，Ａ間のコピー要求を行い、当該エリア情報を通知する。 FIG. 8 shows a process for matching the data between the virtual memories A and B in accordance with the failure recovery of the service processing unit A (that is, the log information collection of the virtual memory A is completed). Upon receiving the recovery notification from the service processing unit A, the service management unit 30 determines whether or not there is a memory area saved regarding the access to the virtual memory B in step S41. If there is, a copy request between the virtual memories B and A is made to the working adapter unit B, and the area information is notified.

これを受けたアダプタ部Ｂは、ステップＳ５０で仮想メモリＢの当該エリアのデータを仮想メモリＡの対応エリアにコピーし、サービス管理部３０にコピー完了応答を行う。 Receiving this, the adapter unit B copies the data in the area of the virtual memory B to the corresponding area of the virtual memory A in step S50, and sends a copy completion response to the service management unit 30.

これを受けたサービス管理部３０は、ステップＳ４２で当該取得エリアの情報をリセットし、ステップＳ４１に戻る。こうして、やがて、１又は２以上の保存エリアのコピー処理を全て完了すると、サービス処理部Ａに待機系再開の通知を行う。 Receiving this, the service management unit 30 resets the information of the acquisition area in step S42, and returns to step S41. Thus, when all of the copy processing of one or more storage areas is completed, the service processing unit A is notified of restart of the standby system.

図９は他の実施の形態によるコンピュータシステムのブロック図で、二つのＣＰＵにより二つのサービス処理を並列に実行可能なマルチＣＰＵ方式による構成例を示している。この例のサービス処理部ＡはＣＰＵ１１Ａのサービスプログラム実行により実現され、またサービス処理部ＢはＣＰＵ１１Ｂのサービスプログラム実行により実現される。また、ＣＰＵ−ＡはＯＳ−Ａの下で動作し、ＣＰＵ−ＢはＯＳ−Ｂの下で動作する。こうして、サービス処理部Ａ，Ｂの処理性能が格段に向上している。 FIG. 9 is a block diagram of a computer system according to another embodiment, and shows a configuration example of a multi-CPU system in which two service processes can be executed in parallel by two CPUs. In this example, the service processing unit A is realized by executing the service program of the CPU 11A, and the service processing unit B is realized by executing the service program of the CPU 11B. CPU-A operates under OS-A, and CPU-B operates under OS-B. Thus, the processing performance of the service processing units A and B is greatly improved.

一方、この例のシステム管理部３０については、基本的には待機系のＣＰＵ１１Ｂのプログラム実行により実現され、この場合の現用系ＣＰＵ１１Ａは専ら本来のサービス機能の提供に専念できる。 On the other hand, the system management unit 30 in this example is basically realized by program execution of the standby CPU 11B, and the active CPU 11A in this case can concentrate exclusively on providing the original service function.

なお、上記実施の形態では、事前にメモリエリアを取得するアクセス方法を示したが、これに限らない。他の様々なメモリアクセス方法を採用できる。 In the above embodiment, an access method for acquiring a memory area in advance has been described. However, the present invention is not limited to this. Various other memory access methods can be employed.

また、上記本発明に好適なる複数の実施の形態を述べたが、本発明思想を逸脱しない範囲内で各部の構成、制御、処理及びこれらの組合せの様々な変更が行えることは言うまでも無い。 Moreover, although several embodiment suitable for the said invention was described, it cannot be overemphasized that the structure of each part, control, a process, and these combination can be variously changed within the range which does not deviate from this invention. .

実施の形態によるコンピュータシステムのブロック図である。1 is a block diagram of a computer system according to an embodiment. 実施の形態によるサービスプロセッサの構築手順を説明する図である。It is a figure explaining the construction procedure of the service processor by embodiment. 実施の形態によるプリコンパイラの動作を説明する図である。It is a figure explaining operation | movement of the precompiler by embodiment. 実施の形態によるコンピュータシステムの動作を説明する図（１）である。FIG. 5 is a diagram (1) illustrating an operation of the computer system according to the embodiment. 実施の形態によるコンピュータシステムの動作を説明する図（２）である。FIG. 6 is a diagram (2) illustrating the operation of the computer system according to the embodiment. 実施の形態によるアダプタ部の動作シーケンス図（１）である。It is an operation | movement sequence diagram (1) of the adapter part by embodiment. 実施の形態によるアダプタ部の動作シーケンス図（２）である。It is an operation | movement sequence diagram (2) of the adapter part by embodiment. 実施の形態によるアダプタ部の動作シーケンス図（３）である。It is an operation | movement sequence diagram (3) of the adapter part by embodiment. 他の実施の形態によるコンピュータシステムのブロック図である。And FIG. 12 is a block diagram of a computer system according to another embodiment.

Explanation of symbols

１１ＣＰＵ
１０Ａ，１０ＢサービスプロセッサＡ，Ｂ
２０Ａ，２０Ｂ主メモリＡ，Ｂ
２１Ａ，２１Ｂサービス処理部Ａ，Ｂ
２２Ａ，２２Ｂアダプタ部Ａ，Ｂ
３０サービス管理部
４１Ａ，４１Ｂ仮想メモリＡ，Ｂ
６０ディスク装置（ＤＩＳＫ）
１００コンピュータシステム 11 CPU
10A, 10B Service processor A, B
20A, 20B Main memory A, B
21A, 21B Service processing part A, B
22A, 22B Adapter part A, B
30 Service management unit 41A, 41B Virtual memory A, B
60 Disk unit (DISK)
100 computer system

Claims

A service system redundancy system that provides a predetermined service function by executing a program on a computer,
First and second service processing units that realize the same service function by program execution;
First and second memory areas respectively provided corresponding to the first and second service processing units and storing processing information relating to service operation;
A save memory for saving storage information in the first or second memory area;
A service management unit that operates a pair of the first and second service processing units and the first and second memory areas as an active system and a standby system;
Service processor of an active system writes the own processing information in the memory of the active system and standby system, the service management unit switches the rapidly system by failure of the working system, and the memory area of the old active system stored log information saved in the save memory, the interval data in the memory area is in retreat of該旧during service processing unit of the active system disorders or該旧working system, new active for which the data write Holding the area information of the active memory and copying the storage data corresponding to the area information of the new active memory to the memory area of the old active system by restoring the service processing unit of the old active system Service system redundancy method.

The first, the bound respectively to the second service processing unit first, first, a second adapter part to act for a memory access to the second memory area,
The active adapter unit normally writes the same data to the active and standby memory areas in accordance with the data write command from the active service processor, and the old active service processor is in failure. 2. The service system redundancy system according to claim 1, wherein the data is written only in the memory area of the new working system because the data in the memory area of the old working system is being saved.

2. The service system redundancy system according to claim 1, wherein the first and second service processing units are configured to operate by their own CPUs.