JP2008003691A

JP2008003691A - Process recovery method for computer and check point restart system

Info

Publication number: JP2008003691A
Application number: JP2006170146A
Authority: JP
Inventors: Seiichi Domyo; 誠一道明; Tatsutoshi Sakuraba; 健年櫻庭
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2006-06-20
Filing date: 2006-06-20
Publication date: 2008-01-10

Abstract

<P>PROBLEM TO BE SOLVED: To provide a check point/restart control technology capable of appropriate recovery when a failure occurs regardless of a factor of the failure. <P>SOLUTION: The process recovery method for a computer holds CP information acquired by a plurality of CP, uses appropriate CP information in accordance with timings of recurrent failures, and recovers the system. The process operating on the computer is monitored, the occurrence of the failure of the process is detected, and a memory state of the process recorded just before an event which influences predetermined security of the process occurs in the process is used to restart the defective process. When the failure recurs within a prescribed period, the memory state recorded further previously is used to restart the process. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、ソフトウェア故障から計算機システムを復旧する、チェックポイント／リスタート技術に関する。特に、故障要因が攻撃や侵入である場合も含めて対処可能な技術に関する。 The present invention relates to a checkpoint / restart technique for recovering a computer system from a software failure. In particular, the present invention relates to a technique that can be dealt with even when the cause of failure is an attack or intrusion.

計算機のソフトウェアやハードウェアに故障が発生した際、実行中のプログラム（以下、プロセスと呼ぶ。）がデータ処理を中断し，以前のプロセスの実行状態に移行し，業務を継続する計算機技術を、一般に、チェックポイント／リスタート制御方法と呼ぶ。プロセスのリスタート（再始動）とは，システムやプロセスの停止を伴う，装置のリブート（再起動）とは異なる。リスタート（再始動）とは，プロセスを停止せずに，事前に記録したメモリや入出力の状態を呼び出すことで，状態を回復するものである。この状態を記録するタイミングをチェックポイント（ＣＰ）と呼ぶ。 When a computer software or hardware failure occurs, the computer program being executed (hereinafter referred to as a process) interrupts data processing, shifts to the previous process execution state, and continues computer processing. Generally, it is called a checkpoint / restart control method. Process restart (restart) is different from device reboot (restart) that involves system or process stoppage. Restart (restart) is to restore a state by calling a pre-recorded memory or input / output state without stopping the process. The timing for recording this state is called a checkpoint (CP).

なお、本明細書では、チェックポイント（ＣＰ）において複製する情報をチェックポイント情報（ＣＰ情報）と呼び、ＣＰ情報を複製し記録媒体に記録することをＣＰ情報の取得と呼ぶ。また、ＣＰ情報を呼び出し、状態の回復を行うことを回復処理と呼ぶ。 In this specification, information to be duplicated at a checkpoint (CP) is referred to as checkpoint information (CP information), and duplicating the CP information and recording it on a recording medium is referred to as obtaining CP information. Calling CP information and performing state recovery is called recovery processing.

ところで、故障発生の原因によっては、再始動したプロセスが再び故障する可能性が少なからず存在する。 By the way, depending on the cause of the occurrence of the failure, there is a considerable possibility that the restarted process will fail again.

チェックポイント／リスタート制御では、ソフトウェア故障が回復するまで、ＣＰ情報を利用し再始動を試みる。ただし、試行回数は、予め設定した最大値を越えない。または、予め設定した時間内で試行を繰り返す（例えば、特許文献１参照。）。 In the checkpoint / restart control, restart is attempted using the CP information until the software failure is recovered. However, the number of trials does not exceed a preset maximum value. Alternatively, the trial is repeated within a preset time (see, for example, Patent Document 1).

なお、通常のチェックポイント／リスタート制御の回復手順では、複数のＣＰでＣＰ情報を取得した場合も、直前に取得したＣＰ情報（最新のＣＰ情報）を用いて、再始動を試みる（例えば、特許文献２参照）。 In the normal checkpoint / restart control recovery procedure, even when CP information is acquired by a plurality of CPs, restart is attempted using the CP information (latest CP information) acquired immediately before (for example, Patent Document 2).

その他、故障回復処理の手法としては、故障が発生した入力データを処理せずに、次のデータから処理を継続する手法もある（例えば、特許文献３参照）。 As another failure recovery processing method, there is also a method of continuing processing from the next data without processing input data in which a failure has occurred (see, for example, Patent Document 3).

特許第３０７２０４８号Patent No.3072048 特許第３１３５７１４号Japanese Patent No. 3135714 特開平５−２３３３４１号公報JP-A-5-233341

一般に、サーバ装置の障害対策として、保全作業を実施する必要がある。保全作業とは、例えば、セキュリティパッチの適用、構成定義ファイルの更新など、システムの保全性に関わる作業である。具体的な作業内容は、ディスク上のプログラムやファイルの交換である。特に、無停止運用のサーバ装置においては、サーバのデータ処理を一時中断し、メモリ上のプロセスやデータの内容を変更した後に、業務を再開する操作が含まれる、場合がある。 In general, it is necessary to perform maintenance work as a countermeasure against a failure of a server device. The maintenance work is work related to system integrity, such as application of a security patch and update of a configuration definition file. The specific work content is the exchange of programs and files on the disk. In particular, in a non-stop operation server device, there is a case in which an operation for temporarily suspending data processing of a server and changing a process or data content in a memory and then restarting a job is included.

サーバ装置において、プロセスがデータを紛失する、データが改ざんされる、などの脅威を想定し、事前に解決手段を提供し、障害の発生を回避するのが、保全作業を実施する目的である。しかし、システムが大規模となり、複数のプログラムが密接にかつ複雑に絡み合った情況では、計算機の構成やプログラムの内容を変更する操作の結果や影響は、完全には把握しきれないのが現状である。そのため、実行中のプロセスの状態が不安定となり、故障が発生する可能性もある。 The purpose of the maintenance work is to provide a solution in advance and avoid the occurrence of a failure in the server device, assuming threats such as a process losing data or data being tampered with. However, in a situation where the system is large and multiple programs are closely and intertwined, the results and effects of operations that change the computer configuration and program contents cannot be fully understood. is there. For this reason, the state of the process being executed becomes unstable and a failure may occur.

故障の種類として、例えば、プロセスが新規の業務を受け付けない、プロセスが異常停止する、プロセスが規定外の処理を実行する、等がある。これらの故障の要因として、（Ｘ）偶然（例えば処理集中による資源の枯渇）、（Ｙ）事故（例えば保全作業自体の不具合）、（Ｚ）攻撃（例えばウィルスの混入）、の各脅威が想定される。 The types of failure include, for example, the process does not accept a new job, the process abnormally stops, or the process executes an unspecified process. As the cause of these failures, threats such as (X) chance (for example, depletion of resources due to concentration of processing), (Y) accident (for example, malfunction of maintenance work itself), and (Z) attack (for example, virus contamination) are assumed. Is done.

一般に、計算機システムの運用方針や回復手順は、システムの目標や故障の要因に応じて複数存在する。例えば、処理再開と故障解析という目標に対して、それぞれ回復手順は異なる。運用方針の例をあげると、故障の要因が脅威Ｘと脅威Ｙとのいずれかと推測できる場合は、処理再開を優先すべきである。他方、故障の要因が脅威Ｚである場合は、故障要因の排除を優先すべきである。さらに、排除できない場合に業務を中止することもある。 In general, there are a plurality of computer system operation policies and recovery procedures depending on system targets and failure factors. For example, the recovery procedures differ for the goals of process resumption and failure analysis. As an example of the operation policy, when the cause of the failure can be estimated as either the threat X or the threat Y, the resumption of processing should be given priority. On the other hand, if the failure factor is the threat Z, priority should be given to eliminating the failure factor. In addition, the business may be stopped if it cannot be excluded.

いずれの場合も、最終目標は、業務の再開並びにシステムの復旧であるが、要因が特定できれば、適切な回復手順を選択できる。チェックポイント／リスタート制御を行う場合も、故障発生の要因が脅威Ｘと脅威Ｙとのいずれかである場合は、業務再開を優先したＣＰ情報を用いた回復処理を行い、脅威Ｚによるものである場合は、故障の要因を排除可能なＣＰ情報を用いて回復処理を行うべきである。 In either case, the final goal is to resume the work and restore the system, but if the factors can be identified, an appropriate recovery procedure can be selected. Even when checkpoint / restart control is performed, if the cause of the failure is either threat X or threat Y, recovery processing using CP information that prioritizes business resumption is performed and threat Z In some cases, recovery processing should be performed using CP information that can eliminate the cause of failure.

しかし、故障の要因を脅威Ｘと推量したにも係わらず、実際は脅威Ｚであった場合は、再開時期を徒に早めることで、システムの安全性や業務の信頼性を損ねる。逆に、脅威Ｚと推量したにも係わらず、実際は脅威Ｘであった場合は、停止期間を際限なく延長すると、経済的な損失が増える。 However, in spite of having inferred the cause of the failure as the threat X, if it is actually the threat Z, the system restarts and the reliability of the work is impaired by advancing the restart time. On the other hand, in spite of the assumption of threat Z, if it is actually threat X, if the stop period is extended indefinitely, economic loss increases.

一般に、無停止サーバでは、故障が発生した場合、可能なかぎり短時間（秒単位）において回復することが望まれる。一方、故障要因を解析した後に復旧手順を選択することが理想である。しかし、故障要因の解析作業は、多くの場合、専門家による人手の作業であり、長時間（時間、日単位）を要する。従って、無停止サーバでの要求には対応できない。このため、現実には、故障の種類や時刻について記録した後に、ＣＰ情報を用いて回復するといった運用がなされている。回復処理の実行結果を踏まえて、故障の要因を推測（故障が再発したので脅威Ｘ、しないので故障Ｙなど）し、適宜、回復処理の内容や手順を調整する。 Generally, in a non-stop server, when a failure occurs, it is desired to recover as short as possible (in seconds). On the other hand, it is ideal to select a recovery procedure after analyzing the cause of failure. However, failure factor analysis work is often manual work by an expert, and requires a long time (in hours and days). Therefore, it cannot respond to requests from non-stop servers. For this reason, in practice, after recording the type and time of failure, recovery is performed using CP information. Based on the execution result of the recovery processing, the cause of the failure is estimated (the failure has recurred, threat X, not failure Y), and the content and procedure of the recovery processing are adjusted as appropriate.

ところで、脅威Ｚ特有の現象として、タイムラグ（時間差）があること、すなわち、システムに故障要因が絡んだ時点故障が顕在化する時点が、時間的に連続ではなく、期間を隔てていることが挙げられる。ウィルスによる故障を例にとると、ウィルスがシステムに混入する時点と、混入したウィルスが活性化し、業務が停止となる時点とではタイムラグがある。そのため、ウィルス混入後にＣＰ情報を取得する可能性が少なからずあり、その場合に、従来技術のチェックポイント制御を適用すると、故障発生直前に取得したウィルス混入後のＣＰ情報を用いて回復処理を行うこととなる。故障の要因（ウィルス）は排除されておらず、再び障害が発生し、業務が停止する可能性が高い。 By the way, as a phenomenon peculiar to the threat Z, there is a time lag (time difference), that is, the point in time when the failure at which the failure factor is involved in the system becomes apparent is not continuous in time but separated by a period. It is done. Taking a failure due to a virus as an example, there is a time lag between the time when the virus is mixed into the system and the time when the mixed virus is activated and the operation is stopped. For this reason, there is a considerable possibility that CP information is acquired after virus contamination. In this case, when the checkpoint control of the prior art is applied, recovery processing is performed using CP information after virus contamination acquired immediately before the occurrence of a failure. It will be. The cause of the failure (virus) has not been eliminated, and there is a high possibility that the failure will occur again and the business will stop.

本発明は、このような事情に鑑みなされたもので、故障の要因が不明確な場合であっても、システムを安全に回復し、業務を迅速に再開する、目的に適ったチェックチェックポイント／リスタート制御技術を提供することを目的とする。 The present invention has been made in view of such circumstances, and even when the cause of a failure is unclear, the system can be safely recovered and the operation can be quickly resumed. The purpose is to provide restart control technology.

複数のＣＰで取得したＣＰ情報を保持し、故障が再発するタイミングに応じて適切なＣＰ情報を用いてシステムを回復させる。 The CP information acquired by a plurality of CPs is held, and the system is recovered using appropriate CP information according to the timing when the failure recurs.

具体的には、計算機上で稼動しているプロセスを監視し当該プロセスの障害の発生を検出する障害検出ステップと、前記障害検出ステップで障害の発生を検出すると、前記プロセスにおいて予め定めた前記プロセスの保全性に影響を与える事象が発生する直前のタイミングで記録した前記プロセスのメモリ状態を用いて、前記障害が発生したプロセスを再始動させる再始動ステップと、を備えることを特徴とする計算機のプロセス回復方法を提供する。 Specifically, a failure detection step for monitoring a process running on a computer and detecting the occurrence of a failure in the process, and when the occurrence of a failure is detected in the failure detection step, the process predetermined in the process A restarting step for restarting the failed process using the memory state of the process recorded at a timing immediately before the occurrence of an event that affects the integrity of the computer. Provide a process recovery method.

本発明によれば、故障の要因が不明確な場合であっても、システムを安全に回復し、業務を迅速に再開する、目的に適ったチェックチェックポイント／リスタート制御技術を提供することができる。 According to the present invention, it is possible to provide a check checkpoint / restart control technology suitable for a purpose, which is capable of safely recovering a system and quickly restarting a business even when the cause of a failure is unclear. it can.

本発明を適用した実施形態を以下に示す。なお、チェックポイント／リスタートの実現手段として，システム構成によって，リスタート時に障害プロセスのメモリを更新するもの、もしくは，障害プロセスとは別に待機プロセスを活性化するもの，の２種類の手段がある。以下の実施形態では，後述するように，攻撃による障害発生も含めているので，とくに断らない限り，後者の待機プロセスを活性化する手段とする。 Embodiments to which the present invention is applied are shown below. There are two types of checkpoint / restart implementation means: updating the memory of the failed process at restart, or activating the standby process separately from the failed process, depending on the system configuration. . In the following embodiment, as will be described later, since the occurrence of a failure due to an attack is also included, the latter standby process is activated unless otherwise specified.

＜＜第一の実施形態＞＞
本実施形態の情報処理装置５００の構成図を図１に示す。本図に示すように、本実施形態の情報処理装置５００は、マルチプロセッサ５０１と、メモリ５０９と、ストレージ５０６と、ネットワークインタフェース５０７と、入出力インタフェース５０８と、を備える。また、メモリ５０９は、揮発性メモリ（ＤＲＡＭ）５０４と、相変化メモリ（ＰＲＡＭ）５０５と、ストレージ（ＨＤＤ）５０６とを備える。ここでは、マルチプロセッサ５０１は、プロセッサ５０２と５０３との２つを備えるものとする。ただし、プロセッサは１台以上のＣＰＵであればよく、その数は問わない。 << First Embodiment >>
FIG. 1 shows a configuration diagram of an information processing apparatus 500 of this embodiment. As shown in the figure, the information processing apparatus 500 of this embodiment includes a multiprocessor 501, a memory 509, a storage 506, a network interface 507, and an input / output interface 508. The memory 509 includes a volatile memory (DRAM) 504, a phase change memory (PRAM) 505, and a storage (HDD) 506. Here, it is assumed that the multiprocessor 501 includes two processors 502 and 503. However, the number of processors is not limited as long as it is one or more CPUs.

ＨＤＤ５０６は、オペレーティングシステムやアプリケーションプログラムなどのソフトウェアをファイルとして記憶する二次記憶である。 The HDD 506 is a secondary storage that stores software such as an operating system and application programs as files.

ＤＲＡＭ５０４は、オペレーティングシステムやアプリケーションプログラムなどのソフトウェアやデータを記憶する一次記憶である。 The DRAM 504 is primary storage for storing software and data such as an operating system and application programs.

マルチプロセッサ５０１は、ＨＤＤ５０６に記憶されているソフトウェアをＤＲＡＭ５０４にロードして実行することにより、後述する各機能を実現する。またマルチプロセッサ５０１は、処理の途中に生成したデータを一時的にＤＲＡＭ５０４に記憶する。 The multiprocessor 501 implements each function described later by loading software stored in the HDD 506 into the DRAM 504 and executing it. The multiprocessor 501 temporarily stores data generated during the processing in the DRAM 504.

ＰＲＡＭ５０５は、チェックポイント／リスタートで使用するデータを保持する。チェックポイント／リスタートで使用するデータは、各チェックポイントにおいて、再始動するために必要な情報として保持されるプロセスのメモリ状態である。以下、各チェックポイントにおける、保持されるプロセスのメモリ状態のデータをチェックポイント情報（ＣＰ情報）と呼ぶ。また、チェックポイント取得時刻をＣＰｔと表した場合、その時刻ＣＰｔにおいて取得したＣＰ情報をＣＰｔ情報と呼ぶ。なお、ＰＲＡＭ５０５は、改ざん困難な不揮発性メモリであれば、相変化メモリＰＲＡＭ５０５でなくてもよい。 The PRAM 505 holds data used for checkpoint / restart. The data used in the checkpoint / restart is the memory state of the process held as information necessary for restarting at each checkpoint. Hereinafter, the memory state data of the process held at each checkpoint is referred to as checkpoint information (CP information). When the checkpoint acquisition time is represented as CPt, the CP information acquired at the time CPt is referred to as CPt information. The PRAM 505 may not be the phase change memory PRAM 505 as long as it is a nonvolatile memory that is difficult to tamper with.

入出力インタフェース５０８は、管理者あるいは監視者からの要求の入力をマルチプロセッサ５０１に伝達し、逆に、マルチプロセッサ５０１で処理した結果を出力する、入出力装置、例えば、キーボード、マウス、ディスプレイ等のデバイスと接続する。 The input / output interface 508 transmits an input of a request from an administrator or a supervisor to the multiprocessor 501, and conversely outputs an result processed by the multiprocessor 501, such as a keyboard, mouse, display, etc. Connect with other devices.

ネットワークインタフェース５０７は、外部の計算機やシステムと接続するネットワークカードである。 A network interface 507 is a network card connected to an external computer or system.

本実施形態のマルチプロセッサ５０１がプログラムを実行することにより実現する機能構成を図２に示す。本図に示すように、本実施形態の計算機システム５００は、オペレーティングシステム（ＯＳ）１０１０と、ＯＳ１０１０上で稼動するアプリケーションプログラム（ＡＰ）１０００と、ライブラリ１０２０と、を備える。 FIG. 2 shows a functional configuration realized by the multiprocessor 501 of the present embodiment executing a program. As shown in the figure, the computer system 500 of this embodiment includes an operating system (OS) 1010, an application program (AP) 1000 that runs on the OS 1010, and a library 1020.

本実施形態では、マルチプロセッサ５０１がプログラムを実行することにより、システムコールハンドラ１０１１と、システムコールサービスルーチン１０１２とに加え、システムコール監視処理部１０１４と、チェックポイント／リスタート処理部１０１３と、ライブラリチェックポイント／リスタート処理部（Ｌチェックポイント／リスタート処理部）１０２３と、ライブラリ関数監視処理部１０２４と、を実現する。また、ストレージＨＤＤ５０６には、発行されるシステムコールまたは関数を、その発行順に記録する発行履歴１０１５と、チェックポイント情報を取得する特定のシステムコール、関数、および、システムコール列を記録するコールリスト１０１６と、発行されるライブラリ関数をその発行順に記録するライブラリ発行履歴（Ｌ発行履歴）１０２５と、チェックポイント情報を取得する特定のライブラリ関数を記録するライブラリ関数リスト（Ｌコールリスト）１０２６とが記録される。 In this embodiment, when the multiprocessor 501 executes a program, in addition to the system call handler 1011 and the system call service routine 1012, a system call monitoring processing unit 1014, a checkpoint / restart processing unit 1013, a library A checkpoint / restart processing unit (L checkpoint / restart processing unit) 1023 and a library function monitoring processing unit 1024 are realized. Also, the storage HDD 506 has an issue history 1015 for recording issued system calls or functions in the order of issue, and a call list 1016 for recording specific system calls, functions, and system call sequences for obtaining checkpoint information. And a library issuance history (L issuance history) 1025 for recording issued library functions in the order of issuance, and a library function list (L call list) 1026 for recording a specific library function for acquiring checkpoint information is recorded. The

なお、図２に示す例では、ライブラリ１０２０とＯＳ１０１０とにそれぞれシステムコール監視部１０１４とライブラリ関数監視部１０２４とが存在し、それぞれが専用のチェックポイント／リスタート機能と連携して動作する。このように図２に示す例は、監視部とチェックポイント／リスタート機能とがライブラリ１０２０とＯＳ１０１０とに別れ、システムで二組存在するものであるが、いずれか一方の組だけが存在するよう構成してもよい。 In the example shown in FIG. 2, the system call monitoring unit 1014 and the library function monitoring unit 1024 exist in the library 1020 and the OS 1010, respectively, and operate in cooperation with the dedicated checkpoint / restart function. As described above, in the example shown in FIG. 2, the monitoring unit and the checkpoint / restart function are separated into the library 1020 and the OS 1010, and there are two sets in the system, but only one of these sets exists. It may be configured.

マルチプロセッサ５０１が実行する監視対象のプロセス１００１が発行するシステムコール、および、関数は、上述のように発行順に発行履歴１０１５に記録される。ライブラリコールについても同様にＬ発行履歴１０２５に記録される。このため、以下においては、システムコールの場合を例にあげて説明する。 System calls and functions issued by the monitoring target process 1001 executed by the multiprocessor 501 are recorded in the issue history 1015 in the order of issue as described above. Library calls are also recorded in the L issue history 1025 in the same manner. Therefore, in the following, the case of a system call will be described as an example.

また、コールリスト１０１６には、予め定めたシステムコール列、システムコール、関数が記憶される。 The call list 1016 stores a predetermined system call sequence, system call, and function.

システムコール監視処理部１０１４は、プロセス１００１が発行するシステムコールおよび関数を監視し、所定のシステムコール列、システムコールおよび関数（以後、システムコール列と総称する。）が発行されたことを検出するとチェックポイント／リスタート処理部１０１３に通知する。具体的には、プロセス１００１が発行し、発行履歴１０１５に記録されるシステムコール列をコールリスト１０１６に記録されているシステムコール列と照合し、合致した場合、チェックポイント／リスタート処理部１０１３に通知する。 The system call monitoring processor 1014 monitors system calls and functions issued by the process 1001 and detects that a predetermined system call sequence, system call and function (hereinafter collectively referred to as a system call sequence) is issued. Notify the checkpoint / restart processing unit 1013. Specifically, the system call sequence issued by the process 1001 and recorded in the issuance history 1015 is checked against the system call sequence recorded in the call list 1016. If they match, the checkpoint / restart processing unit 1013 Notice.

チェックポイント／リスタート処理部１０１３は、システムコール監視処理部１０１４から合致したとの通知を受けると、その時刻およびその時点のメモリ状態をそれぞれ時刻ＣＰ、ＣＰ情報としてＰＲＡＭ５０５に記録する。 When the checkpoint / restart processing unit 1013 receives notification from the system call monitoring processing unit 1014 that the data matches, the checkpoint / restart processing unit 1013 records the time and the memory state at that time in the PRAM 505 as time CP and CP information, respectively.

また、チェックポイント／リスタート処理部１０１３は、故障が発生した際、ＰＲＡＭ５０５に記録したＣＰ情報を用いて、回復処理を行う。 The checkpoint / restart processing unit 1013 performs recovery processing using the CP information recorded in the PRAM 505 when a failure occurs.

以下、システムコール監視処理部１０１４およびチェックポイント／リスタート処理部１０１３の処理の詳細を説明する。 Details of the processing of the system call monitoring processing unit 1014 and the checkpoint / restart processing unit 1013 will be described below.

まず、システムコール監視処理部１０１４が監視するシステムコール列を、情報処理装置５００がオンライン状態の場合とオフライン状態の場合とに分けて説明する。ここでは、情報処理装置５００がＵＤＰサーバの場合を例にあげて説明する。オンライン状態では、ＵＤＰサーバである情報処理装置５００は、複数のＵＤＰクライアントと情報を送受信し、処理を行う。 First, the system call sequence monitored by the system call monitoring processing unit 1014 will be described separately for the case where the information processing apparatus 500 is online and the offline state. Here, a case where the information processing apparatus 500 is a UDP server will be described as an example. In the online state, the information processing apparatus 500, which is a UDP server, performs processing by transmitting / receiving information to / from a plurality of UDP clients.

図３は、情報処理装置５００（ＵＤＰサーバ）のサーバプロセス１００１が初期化処理を実行する際、データ通信に発行する代表的なシステムコール列を説明するフローである。以下では，情報処理装置５００はハードウェアであり，プロセス１００１はサーバソフトウェアが稼動しているソフトウェアとする。 FIG. 3 is a flow for explaining a typical system call sequence issued for data communication when the server process 1001 of the information processing apparatus 500 (UDP server) executes the initialization process. In the following, it is assumed that the information processing apparatus 500 is hardware, and the process 1001 is software in which server software is running.

（ａ）ＵＤＰサーバプロセス（情報処理装置５００）側
ＵＤＰサーバプロセスは、ｓｏｃｋｅｔシステムコールを用いて、それぞれのクライアントとの通信路を区別するために、それぞれエンドポイント（ソケット）を作成する（ｓ７０１）。 (A) UDP server process (information processing apparatus 500) side The UDP server process uses the socket system call to create an endpoint (socket) in order to distinguish the communication path with each client (s701). .

次に、ＵＤＰサーバは、ｂｉｎｄシステムコールを用いて、作成したソケットに名前をつける（ｓ７０２）。 Next, the UDP server uses the bind system call to name the created socket (s702).

そして、ＵＤＰサーバは、ｓｅｌｅｃｔシステムコールを用いて、複数のソケットを監視し、クライアントからのデータ送信を待つ（ｓ７０３）。 Then, the UDP server monitors a plurality of sockets using a select system call and waits for data transmission from the client (s703).

クライアントからの送信があると、ＵＤＰサーバは、ｒｅｃｖｆｒｏｍシステムコールを用いて、ソケットを介して要求データを受信する（ｓ７０４）。 When there is a transmission from the client, the UDP server receives the request data through the socket using the recvfrom system call (s704).

そして、ＵＤＰサーバは、ｓｅｎｄｔｏシステムコールを用いて、ソケットを介して応答データを送信する（ｓ７０５）。その後、クライアントからの入力を待つ状態に戻る（ｓ７０３）。 Then, the UDP server transmits the response data through the socket using the sendto system call (s705). Thereafter, the process returns to a state of waiting for an input from the client (s703).

（ｂ）ＵＤＰクライアント側
ＵＤＰクライアントは、ｓｏｃｋｅｔシステムコールを用いて、特定のＵＤＰサーバとの通信を目的としたエンドポイント（ソケット）を作成する（ｓ７１１）
ＵＤＰクライアントは、ｓｅｎｄｔｏシステムコールを用いて、ソケットを介して要求データをＵＤＰサーバに送信する（ｓ７１２）
ＵＤＰクライアントは、ｒｅｃｖｆｒｏｍシステムコールを用いて、ソケットを介して応答データをＵＤＰサーバから受信する（ｓ７１３）。 (B) UDP client side The UDP client uses the socket system call to create an endpoint (socket) for the purpose of communication with a specific UDP server (s711).
The UDP client sends the request data to the UDP server via the socket using the sendto system call (s712).
The UDP client receives the response data from the UDP server via the socket using the recvfrom system call (s713).

ＵＤＰクライアントは、ＵＤＰサーバとの間で行う処理が完了するまで、ｓ７１２とｓ７１３とを繰り返し、処理が完了すると、ｃｌｏｓｅシステムコールを行い、プロセスを終了する（ｓ７１４）。 The UDP client repeats s712 and s713 until the process performed with the UDP server is completed, and when the process is completed, performs a close system call and ends the process (s714).

本実施形態では、プロセス１００１の状態が大きく変化する直前のメモリ状態をＣＰ情報として記録する。本実施形態においては、情報処理装置５００がＵＤＰサーバの場合、上述のＵＤＰサーバにおける典型的なシステムコールのシーケンスに着目し、ＣＰ情報取得タイミングを決定する。以下においてシステムの状態が大きく変化する直前を判断可能なシステムコール列をシステムの保全性に影響するシステムコール列と呼ぶ。 In this embodiment, the memory state immediately before the state of the process 1001 changes greatly is recorded as CP information. In the present embodiment, when the information processing apparatus 500 is a UDP server, the CP information acquisition timing is determined by paying attention to a typical system call sequence in the above-described UDP server. In the following, a system call sequence that can be determined immediately before the system state is largely changed is referred to as a system call sequence that affects system maintainability.

まず、システムの保全性に影響を与えるシステムコール列として、オフライン状態からオンライン状態に変わる直前の状態を検出可能なシステムコール列がある。プロセス１００１開始時は、ＵＤＰサーバでは、…、ｓｏｃｋｅｔ、ｓｏｃｋｅｔ、ｂｉｎｄ、ｂｉｎｄ、ｓｅｌｅｃｔ、ｒｅｃｖｆｒｏｍ、…の順序でシステムコールが発行される。上述のようにｒｅｃｖｆｒｏｍシステムコールが発行されたときは、ソケットを介してＵＤＰクライアントから要求データを受信した状態、すなわち、オンラインでの使用が開始された状態である。ｓｏｃｋｅｔ、ｓｏｃｋｅｔ、ｂｉｎｄ、ｂｉｎｄ、の順にシステムコールが発行され、続いてｓｅｌｅｃｔシステムが発行されたタイミングが、オンライン状態となる直前のメモリ状態といえる。このタイミングでＣＰ情報を取得することにより、オンライン使用直前のメモリ状態をＣＰ情報として保存することができる。なお、オンライン直前のメモリ状態として取得するＣＰ情報を、ＣＰ０情報と呼ぶ。また、その時刻をＣＰ０とする。 First, as a system call sequence that affects system integrity, there is a system call sequence that can detect a state immediately before changing from an offline state to an online state. When the process 1001 starts, the UDP server issues system calls in the order of..., Socket, socket, bind, bind, select, recvfrom,. When the recvfrom system call is issued as described above, the request data is received from the UDP client via the socket, that is, the online use is started. It can be said that the timing at which system calls are issued in the order of socket, socket, bind, bind and subsequently the select system is issued is the memory state immediately before the online state is entered. By acquiring the CP information at this timing, the memory state immediately before online use can be saved as CP information. Note that CP information acquired as a memory state immediately before online is referred to as CP0 information. The time is CP0.

本実施形態では、ＣＰ０情報を取得するタイミングとして、コールリスト１０１６にシステムコール列Ｓ０（ホワイトリスト）ｓｏｃｋｅｔ、ｓｏｃｋｅｔ、ｂｉｎｄ、ｂｉｎｄ、ｓｅｌｅｃｔを予め記録しておく。システムコール監視処理部１０１４は、ｓｏｃｋｅｔ、ｓｏｃｋｅｔ、ｂｉｎｄ、ｂｉｎｄ、の順にシステムコールが発行され、続いてｓｅｌｅｃｔが発行されると、ＣＰ情報を取得するようチェックポイント／リスタート処理部１０１３に通知する。チェックポイント／リスタート処理部１０１３は、通知を受けると、その時点のメモリ状態をＣＰ０情報としてＰＲＡＭ５０５に記録するとともに、その時刻ＣＰ０をＣＰ０情報に対応づけて記録する。 In the present embodiment, the system call sequence S0 (white list) socket, socket, bind, bind, select is recorded in advance in the call list 1016 as the timing for acquiring the CP0 information. The system call monitoring processing unit 1014 issues a system call in the order of socket, socket, bind, bind, and then issues a select to notify the checkpoint / restart processing unit 1013 to obtain CP information. . Upon receiving the notification, the checkpoint / restart processing unit 1013 records the memory state at that time as CP0 information in the PRAM 505 and records the time CP0 in association with the CP0 information.

一方、発行されたシステムコール列がその他のものの場合、例えば、…、ｓｏｃｋｅｔ、ｓｏｃｋｅｔ、ｂｉｎｄ、ｂｉｎｄ、ｒｅｃｖｆｒｏｍ、…の場合は、予めＨＤＤ５０６にシステムコール列Ｓ０として記録されたシステムコール列と異なるため、メモリ状態を取得しない。 On the other hand, if the issued system call sequence is other, for example,..., Socket, socket, bind, bind, recvfrom,... Are different from the system call sequence recorded in the HDD 506 as the system call sequence S0 in advance. , Do not get memory status.

次に、オンライン状態でシステムの保全性に影響を与えるシステムコール列について説明する。オンライン状態でシステムの保全性に影響を与える作業としては、ファイルの更新、セキュリティパッチ処理がある。 Next, a system call sequence that affects system maintainability in an online state will be described. Operations that affect the integrity of the system in the online state include file update and security patch processing.

図４は、情報処理装置５００がオンライン状態においてファイル更新という保守作業が行われる場合を説明するための図である。 FIG. 4 is a diagram for explaining a case where a maintenance operation called file update is performed while the information processing apparatus 500 is online.

通常時は、図４に示すように、情報処理装置５００（ＵＤＰサーバ）は、ｒｅｃｖｆｒｏｍ（Ｓ８０１）とｓｅｎｄｔｏ（ｓ８０２）とを繰り返して処理を進める。すなわち、発行されるシステムコール列は
填、ｒｅｃｖｆｒｏｍ、ｓｅｎｄｔｏ、ｒｅｃｖｆｏｒｍ、ｓｅｎｄｔｏ、ｒｅｃｖｆｒｏｍ、…
である。 In normal times, as shown in FIG. 4, the information processing apparatus 500 (UDP server) repeats recvfrom (S801) and sendto (s802) to advance the process. That is, the issued system call sequence is filled, recvfrom, sendto, recvform, sendto, recvfrom,...
It is.

保守のためにファイル更新が行われる場合は、遠隔より入力されたデータによって、プログラムの動作を変え、バイナリファイルや定義ファイルを更新する。この時、ファイルに書き込むため、ｗｒｉｔｅシステムコールが発行される。この場合は、ｒｅｃｖｆｒｏｍ（Ｓ８０３）、ｗｒｉｔｅ（ｓ８０４）、ｓｅｎｄｔｏ（ｓ８０５）の順にシステムコールが発行され、処理が進められる。すなわち、発行されるシステムコール列は
…、ｒｅｃｖｆｒｏｍ、ｓｅｎｄｔｏ、ｒｅｃｖｆｏｒｍ、ｗｒｉｔｅ、ｓｅｎｄｔｏ、ｒｅｃｖｆｒｏｍ、…
となる。 When a file is updated for maintenance, the operation of the program is changed according to the data input from a remote location, and the binary file and the definition file are updated. At this time, a write system call is issued to write to the file. In this case, system calls are issued in the order of recvfrom (S803), write (s804), and sendto (s805), and the process proceeds. That is, the issued system call sequence is ..., recvfrom, sendto, recvform, write, sendto, recvfrom, ...
It becomes.

ｗｒｉｔｅシステムコールによって、ファイルが書き換えられるため、ｗｒｉｔｅシステムコールが保全性を変更するシステムコールにあたる。従って、システムの保全性に影響を与える処理が行われる直前のメモリ状態を保存するためには、ｗｒｉｔｅシステムコールが発行された後であって、システムコールサブルーチン１０１２において処理が実行される前にＣＰ情報を取得する。 Since the file is rewritten by the write system call, the write system call corresponds to a system call that changes the integrity. Therefore, in order to save the memory state immediately before processing that affects the integrity of the system is performed, CP is issued after the write system call is issued and before the processing is executed in the system call subroutine 1012. Get information.

すなわち、システムコール監視処理部１０１４は、ｗｒｉｔｅシステムコールが発行されたことを検出すると、その旨を、チェックポイント／リスタート処理部１０１３に通知する。チェックポイント／リスタート処理部１０１３は、通知を受けるとその時点のＣＰ情報および時刻を示すＣＰを対応づけてＰＲＡＭ５０５に記録する。 That is, when detecting that a write system call has been issued, the system call monitoring processing unit 1014 notifies the checkpoint / restart processing unit 1013 to that effect. Upon receiving the notification, the checkpoint / restart processing unit 1013 records the CP information at that time and the CP indicating the time in the PRAM 505 in association with each other.

以上では、ＯＳ１０１０のシステムコール監視部１０１４により実現する場合を例にあげて説明した。プロセス１００１は、ＡＰ１０００より呼び出されるソフトウェアであればよい。ＡＰ１０００とＯＳ１０１０との間に存在するライブラリ１０２０（複数のアプリケーションで共通に利用する関数を集めたソフトウェア）のライブラリ監視処理部１０２４でライブラリ関数を監視し、実現する場合も同様である。以下、ライブラリ関数監視部１０２４により実現する場合を説明する。 In the above, the case where it is realized by the system call monitoring unit 1014 of the OS 1010 has been described as an example. The process 1001 may be any software called from the AP 1000. The same applies to the case where the library function is monitored and realized by the library monitoring processing unit 1024 of the library 1020 (software that collects functions commonly used by a plurality of applications) existing between the AP 1000 and the OS 1010. Hereinafter, a case where the function is realized by the library function monitoring unit 1024 is described.

図５は、情報処理装置５００がオンライン状態においてセキュリティパッチ処理が行われる場合を説明するための図である。 FIG. 5 is a diagram for describing a case where security patch processing is performed while the information processing apparatus 500 is online.

セキュリティパッチ処理は、多数存在する。たとえば、プロセス１００１における、規定の保守手順として，保全性を向上させるソフトウェアモジュールを含むライブラリが存在するプロセス１００１の内容を修正または機能を追加した後に、ライブラリをアンロードするものがある。 There are many security patch processes. For example, as a predetermined maintenance procedure in the process 1001, there is one in which the contents of the process 1001 in which a library including a software module for improving maintainability exists or a function is added or a function is added and then the library is unloaded.

ここでは、ライブラリは、ネットワーク５０７から送られ、ストレージＨＤＤ５０６において、ファイルとして一旦保管されるものとする。 Here, it is assumed that the library is sent from the network 507 and temporarily stored as a file in the storage HDD 506.

セキュリティパッチ処理を行う場合の手順は以下のとおりである。 The procedure for performing security patch processing is as follows.

セキュリティパッチ処理を行う指示を受け取ると、情報処理装置５００は、ｄｌｏｐｅｎ関数を用いて、保守用のソフトウェアを内蔵するライブラリを動的にロードする（ｓ９０１）。 Upon receiving an instruction to perform security patch processing, the information processing apparatus 500 dynamically loads a library containing maintenance software using the droppen function (s901).

プロセス１００１は、ｄｌｓｙｍ関数を用いて、ｓ９０１でロードしたライブラリから保守用の関数を取得する（ｓ９０２）。 The process 1001 acquires a maintenance function from the library loaded in s901 using the dlsym function (s902).

保守用の関数を呼び出し、メモリ上のプロセスに対してパッチをあてる。ここでは、パッチプログラムの書き込みにｍｍａｐシステムコールを利用する（ｓ９０３）。 Call maintenance functions and patch memory processes. Here, the mmap system call is used for writing the patch program (s903).

ｄｌｃｌｏｓｅ関数を用いて、ライブラリをアンロードする。 Unload the library using the dlclose function.

従って、ここでは、関数およびシステムコールは以下の順に発行される。
…、ｄｌｏｐｅｎ、ｄｌｓｙｍ、…、ｍｍａｐ、…、ｄｌｃｏｓｅ、…
ライブラリを動的にロードすることにより、メモリの状態が変化する。従って、プロセス１００１がｄｌｏｐｅｎ関数を発行し、ライブラリ内部で処理を実行する前のメモリ状態を、ＣＰ情報として保存する。 Therefore, here, functions and system calls are issued in the following order.
..., dropen, dlsym, ..., mmap, ..., dlcose, ...
By dynamically loading the library, the memory state changes. Therefore, the process 1001 issues a droppen function and saves the memory state before executing processing inside the library as CP information.

すなわち、ライブラリ関数監視処理部１０２４はｄｌｏｐｅｎ関数が発行されたことを検出すると、その旨を、Ｌチェックポイント／リスタート処理部１０２３に通知する。Ｌチェックポイント／リスタート処理部１０２３は、通知を受けるとその時点のＣＰ情報および時刻を示すＣＰを対応づけてＰＲＡＭ５０５に記録する。 That is, when the library function monitoring processing unit 1024 detects that the droppen function has been issued, the library function monitoring processing unit 1024 notifies the L checkpoint / restart processing unit 1023 to that effect. When the L checkpoint / restart processing unit 1023 receives the notification, the L checkpoint / restart processing unit 1023 associates the CP information at that time with the CP indicating the time, and records them in the PRAM 505.

以上、保全性を変更するシステムコールあるいは関数が発行された場合に、ＯＳもしくはライブラリにおいて，ＣＰ情報を取得する具体的な手順を説明した。このように、本実施形態では、プロセスの保全性の維持に着目してＣＰ情報を自動的に取得する。 The specific procedure for acquiring CP information in the OS or library when a system call or function for changing maintainability is issued has been described above. As described above, in this embodiment, CP information is automatically acquired by paying attention to maintaining process integrity.

ここで、監視する対象のシステムコール列について、オフライン状態とオンライン状態とを分けている理由について説明する。 Here, the reason why the offline state and the online state of the system call sequence to be monitored are separated will be described.

図３で説明したオフライン状態では、初期化処理、例えば、定義ファイルの読み込みなど、定常状態に至るまでの処理手順は厳密に定まっている。そのために、アプリケーション毎にプロセス１００１が発行するシステムコール列は一意に定まる可能性が高い。このため、特定のシステムコール列Ｓ０を予め記録しておき、実際に発行されているシステムコール列と照合し、ＣＰ０情報取得タイミングを決定することが可能である。 In the offline state described with reference to FIG. 3, the processing procedure up to the steady state, such as initialization processing, for example, reading of a definition file, is strictly determined. Therefore, there is a high possibility that the system call sequence issued by the process 1001 for each application is uniquely determined. For this reason, it is possible to record a specific system call sequence S0 in advance and collate it with the actually issued system call sequence to determine the CP0 information acquisition timing.

しかし、図４および図５で説明したオンライン状態では、サーバプログラムは、クライアントプログラムからの要求に応じて、異なる処理を実行する。それに応じて、システムコール列は一意に定まらず、変化する傾向がある。従って、複雑な業務を実現するアプリケーションプログラムでは、多種類のシステムコール列が存在し，たとえば、保守作業に限定した特定のシステムコール列を検出することは難しい。特定のシステムコールレ地を検出するためには、たとえば、アプリケーションプログラムの動作やプロセスの状態を管理し、システムコールの引数について、業務にあわせて内容を確認する、などの確認処理を必要とする。 However, in the online state described with reference to FIGS. 4 and 5, the server program executes different processes in response to requests from the client program. Accordingly, the system call sequence is not uniquely determined and tends to change. Accordingly, there are many types of system call sequences in an application program that implements a complex task, and it is difficult to detect a specific system call sequence limited to maintenance work, for example. In order to detect a specific system call location, for example, it is necessary to manage the operation of the application program and the state of the process, and to confirm the contents of the system call arguments according to the business. .

しかし、システムコール列の引数について全てのパターンを事前に抽出できるとは限らず、確認処理は煩雑となる。たとえば、数十から数百程度のつながるシステムコールの発行を履歴として記録し、その一方で，対応するシステムコール列について、正常状態、異常状態、監視要の状態、例外処理，保守作業などの区別をつけたリスト、たとえば、数十から数千を準備する必要がある。また、システムコールが発行される度に、上述の確認処理を実行することは、業務本来のデータ処理を遅延することになる。 However, not all patterns can be extracted in advance for the arguments of the system call sequence, and the confirmation process becomes complicated. For example, the issuance of several tens to several hundreds of connected system calls is recorded as a history, while the corresponding system call sequence is distinguished from normal status, abnormal status, status requiring monitoring, exception handling, maintenance work, etc. It is necessary to prepare a list with tens, for example, tens to thousands. In addition, executing the above-described confirmation process every time a system call is issued delays the original data processing of the business.

オンライン状態であっても、システムコール列の監視が望ましい。しかし、上述のようにアプリケーションプログラムの特性によっては、システムコールの正当性について確認するのに、処理時間を要する。このため、本来のデータ処理に遅延が発生する恐れが有る。したがって、システムコール単独の監視に留めておいた方が望ましいといえる。 Even when online, it is desirable to monitor the system call queue. However, as described above, depending on the characteristics of the application program, it takes time to check the validity of the system call. For this reason, there is a possibility that a delay occurs in the original data processing. Therefore, it is desirable to keep monitoring system calls alone.

そこで、オンライン状態において、ある程度検出の精度を保ちつつ、実時間で確認するために、システムコール列ではなく、特定のシステムコールまたは関数が発行された場合にＣＰ情報を取得する。 Therefore, in order to confirm in real time while maintaining detection accuracy to some extent in the online state, CP information is acquired when a specific system call or function is issued instead of a system call sequence.

すなわち、オフラインではシステムコール列（システムコールの発行順序）を厳密に監視し、オンラインでは、特定のシステムコールまたは関数が発行されるかどうかを監視する。 That is, the system call string (system call issue order) is strictly monitored offline, and whether a specific system call or function is issued online is monitored.

次に、本実施形態のチェックポイント／リスタート処理部１０１３が再始動を行う場合、故障要因を区分し、その後の処理を決定するために導入する再故障寿命Ｔと再発故障間隔Ｔ'とについて説明する。 Next, when the checkpoint / restart processing unit 1013 of the present embodiment restarts, the failure factor is classified, and the re-failure life T and the recurrent failure interval T ′ to be introduced to determine the subsequent processing are determined. explain.

故障が発生した場合、チェックポイント／リスタート処理部１０１３は、故障が発生した時刻を記録するとともに、所定のＣＰｉ情報を用いて再始動し、回復処理を行う。 When a failure occurs, the checkpoint / restart processing unit 1013 records the time when the failure occurs and restarts using predetermined CPi information to perform a recovery process.

第１回目の故障が発生した時刻をｔ１とする。回復処理に用いたＣＰｉ情報を取得した時刻ＣＰｉから、故障が発生した時刻ｔ１までの時間を、再度故障が発生する予測時刻としてＣＰｉ情報の再故障寿命Ｔと呼ぶ。また、回復処理に用いたＣＰｉ情報を取得した時刻ＣＰｉから、回復処理後、再度故障が発生した時刻ｔ２までの時間を、実際に故障が再発した時刻としてＣＰｉ情報の再発故障間隔Ｔ'と呼ぶ。再故障寿命Ｔと再発故障間隔Ｔ' は、それぞれ、チェックポイント／リスタート処理部１０１３が故障発生毎に算出する。 The time when the first failure occurs is assumed to be t1. The time from the time CPi when the CPi information used for the recovery process is acquired to the time t1 when the failure occurs is referred to as a re-failure life T of the CPi information as a predicted time when the failure occurs again. Also, the time from the time CPi at which the CPi information used for the recovery process is acquired to the time t2 at which the failure occurs again after the recovery process is referred to as the CPi information recurrent failure interval T ′ as the time at which the failure actually recurs. . The re-failure life T and the recurrent failure interval T ′ are calculated by the checkpoint / restart processing unit 1013 each time a failure occurs.

チェックポイント／リスタート処理部１０１３は、１回目の故障が発生すると、最近のＣＰ情報を用いて回復処理を行う。次のＣＰ情報を取得する前に２回目の同じ故障が発生した場合、チェックポイント／リスタート処理部１０１３は、回復処理に用いたＣＰ情報の再故障寿命Ｔと再発故障間隔Ｔ'とを比較し、再故障寿命Ｔが再発故障間隔Ｔ'より短い場合は、３回目の同じ故障が発生した場合であっても、同じＣＰ情報を用いて回復処理を続ける。 When the first failure occurs, the checkpoint / restart processing unit 1013 performs recovery processing using the latest CP information. If the same failure occurs for the second time before obtaining the next CP information, the checkpoint / restart processing unit 1013 compares the re-failure life T of the CP information used for the recovery processing with the recurrent failure interval T ′. If the re-failure life T is shorter than the recurrent failure interval T ′, the recovery process is continued using the same CP information even when the same third failure occurs.

一方、２回目の同じ故障が発生した際、再発故障間隔Ｔ'が再故障寿命Ｔ以下の場合、チェックポイント／リスタート処理部１０１３は、先に回復処理に用いた最新のＣＰ情報の１回前に保存したＣＰ情報を用いて回復処理を行う。 On the other hand, when the same failure occurs for the second time and the recurrent failure interval T ′ is equal to or shorter than the re-failure life T, the checkpoint / restart processing unit 1013 performs the latest CP information once used for the recovery processing once. The recovery process is performed using the previously saved CP information.

次のＣＰ情報を取得する前に３回目の同じ故障が発生した場合、チェックポイント／リスタート処理部１０１３は、同様に回復処理に用いたＣＰ情報の再故障寿命Ｔと再発故障時間間隔Ｔ'とを比較し、Ｔ＜Ｔ'の場合、同じＣＰ情報を用いて回復処理を行う。一方Ｔ≧Ｔ'の場合は、さらに１回前に記録したＣＰ情報を用いて回復処理を行う。 When the same third failure occurs before the next CP information is acquired, the checkpoint / restart processing unit 1013 similarly uses the CP information re-failure life T and recurrent failure time interval T ′ used for the recovery processing. When T <T ′, the same CP information is used for recovery processing. On the other hand, when T ≧ T ′, the recovery process is further performed using the CP information recorded once more.

なお、ＣＰ情報を呼び出して状態を回復することを再始動と呼ぶと定義している。本実施形態の場合、複数のＣＰ情報を保持しているため、再度故障が発生した場合、先の再始動時と同じＣＰ情報を用いる場合と、別のＣＰ情報を用いる場合とがある。ここでは、同じＣＰ情報を用いて回復処理を行うことを再試行と呼び、新たなＣＰ情報を用いて回復処理を行うことを再始動と呼び、区別する。 Note that calling the CP information to recover the state is defined as restarting. In the present embodiment, since a plurality of pieces of CP information are held, when a failure occurs again, there are cases where the same CP information as in the previous restart is used and cases where another CP information is used. Here, performing recovery processing using the same CP information is referred to as retry, and performing recovery processing using new CP information is referred to as restarting.

以上のように、本実施形態では、チェックポイント／リスタート処理部１０１３は、新たなＣＰ情報を取得する前に同じ故障が再発した場合、再故障寿命Ｔと再発故障間隔Ｔ' とを比較し、Ｔ＜Ｔ'の場合再試行を行い、Ｔ≧Ｔ'の場合は、先に再始動を行ったＣＰ情報の１回前に記録したＣＰ情報を用いて再始動を行う。 As described above, in this embodiment, the checkpoint / restart processing unit 1013 compares the re-failure life T with the recurrent failure interval T ′ when the same failure recurs before acquiring new CP information. When T <T ′, retry is performed. When T ≧ T ′, restart is performed using the CP information recorded one time before the CP information previously restarted.

なお、上記おいては、最も近い時点で記録したＣＰ情報が故障発生時の状態に最も近いため、故障発生時に１回前のＣＰ情報を、再故障時には、さらに１回前のＣＰ情報を用いる場合を例にあげて説明したが、用いるＣＰ情報はこれに限られない。予め定めた分だけ遡ったＣＰ情報を用いるよう構成してもよい。 In the above, since the CP information recorded at the nearest time point is closest to the state at the time of the failure, the previous CP information is used at the time of the failure, and the previous CP information is used at the time of the failure again. Although the case has been described as an example, the CP information to be used is not limited to this. You may comprise so that CP information traced back by predetermined amount may be used.

なお、本実施形態では、故障が発生する毎に、チェックポイント／リスタート処理部１０１３が情報処理装置５００のＯＳ１０１０或いはプロセス１００１が発行するエラーコードをＰＲＡＭ５０５等のメモリに時刻とともに記録しておく。この記録を用いて、チェックポイント／リスタート処理部１０１３は、２回目以降の故障について、同じ故障の再発であるか、異なる故障の発生かを判断する。 In this embodiment, every time a failure occurs, the checkpoint / restart processing unit 1013 records an error code issued by the OS 1010 or the process 1001 of the information processing apparatus 500 in a memory such as the PRAM 505 along with the time. Using this record, the checkpoint / restart processing unit 1013 determines whether the same failure or a different failure has occurred for the second and subsequent failures.

次に、図６を用いて、チェックポイント／リスタート処理部１０１３が再試行または再始動を行う処理について、具体例をあげて説明する。 Next, the process in which the checkpoint / restart processing unit 1013 performs a retry or restart will be described using a specific example with reference to FIG.

図６は、ＣＰ情報の再故障寿命ＴとＣＰ情報の再発故障間隔Ｔ'との関係によって、チェックポイント／リスタート処理部１０１３が再試行または再始動を行う様子を説明するための図である。 FIG. 6 is a diagram for explaining how the checkpoint / restart processing unit 1013 performs a retry or restart according to the relationship between the re-failure life T of the CP information and the recurrent failure interval T ′ of the CP information. .

なお、本図において、オフライン状態においてオンライン状態の直前に取得するＣＰ情報をＣＰ０情報、その時刻をＣＰ０と呼び、オンライン状態でＣＰ情報を取得するタイミングを、時刻順にＣＰ１、ＣＰ２、…ＣＰｎ−１、ＣＰｎ、ＣＰｎ＋１と呼び、それぞれの時刻に対応づけてＰＲＡＭ５０５に記録されるＣＰ情報を、それぞれ、ＣＰ１情報、ＣＰ２情報、…、ＣＰｎ−１情報、ＣＰｎ情報、ＣＰｎ＋１情報と呼ぶ。また、以下、同じ故障は同じ文字で表し、何回目の故障であるかをダッシュで示す。図において、Ｐ'、Ｐ''は、再始動、再試行開始点を示す。 In this figure, the CP information acquired immediately before the online state in the offline state is referred to as CP0 information, the time is referred to as CP0, and the timing at which the CP information is acquired in the online state is defined as CP1, CP2,. , CPn, CPn + 1, and the CP information recorded in the PRAM 505 in association with each time is called CP1 information, CP2 information,..., CPn-1 information, CPn information, CPn + 1 information, respectively. Hereinafter, the same failure is represented by the same letter, and the number of failures is indicated by a dash. In the figure, P ′ and P ″ indicate restart and retry start points.

図６（ａ）は、ＣＰｎ情報の再故障寿命ＴｎとＣＰｎ情報の再発故障間隔Ｔｎ'との関係がＴｎ＜Ｔｎ'であり、ＣＰｎ情報を用いて再試行する場合の例である。偶発による故障は、このような発生パターンとなる可能性が高い。これは従来方式での手順に相当する。 FIG. 6A shows an example in which the relationship between the re-failure life Tn of the CPn information and the recurrent failure interval Tn ′ of the CPn information is Tn <Tn ′ and the retry is performed using the CPn information. A failure due to an accident is likely to have such an occurrence pattern. This corresponds to the procedure in the conventional method.

ＣＰｎ時点でＣＰｎ情報を記録した後、次のＣＰｎ＋１情報を記録する前の時刻ｔｘに１回目の故障Ｘが発生すると、直前に記録したＣＰｎ情報を用いて再始動を行う。詳しくは、故障Ｘが発生すると、チェックポイント／リスタート処理部１０１３が、直前に記録したＣＰｎ情報を再始動のために使用するＣＰｎ情報と決定し、ＣＰｎ情報を用いて再始動を行う。 After the CPn information is recorded at the time CPn, when the first failure X occurs at the time tx before the next CPn + 1 information is recorded, restart is performed using the CPn information recorded immediately before. Specifically, when a failure X occurs, the checkpoint / restart processing unit 1013 determines the CPn information recorded immediately before as CPn information to be used for restarting, and restarts using the CPn information.

ＣＰｎの再故障寿命Ｔｎは直前のＣＰｎ情報を取得した時刻ＣＰｎから故障Ｘ発生時ｔｘまでの時間である。一方、回復処理を行った後、次のＣＰ情報であるＣＰｎ＋１情報を取得する時刻ＣＰｎ＋１前に再度故障Ｘ'（２回目の故障Ｘ'）が発生した場合、ＣＰｎの再故障間隔Ｔｎ'は、ＣＰｎ情報を用いて再始動後、実際に再度故障Ｘ'が発生するまでの時間ｔｘ'である。 The re-failure life Tn of CPn is the time from the time CPn at which the previous CPn information is acquired to the time tx when the failure X occurs. On the other hand, when the failure X ′ (second failure X ′) occurs again after the recovery process and before the time CPn + 1 for obtaining the next CP information CPn + 1 information, the re-failure interval Tn ′ of CPn is: This is the time tx ′ until the failure X ′ actually occurs again after restarting using the CPn information.

ここでは、ＣＰｎの再故障間隔Ｔｎ'がＣＰｎの再故障寿命Ｔｎより長く、Ｔｎ＜Ｔｎ' であるため、チェックポイント／リスタート処理部１０１３は、ＣＰｎ情報を用いて再試行を行う。 Here, since the re-failure interval Tn ′ of CPn is longer than the re-failure life Tn of CPn and Tn <Tn ′, the checkpoint / restart processing unit 1013 performs retry using the CPn information.

なお、このとき、故障Ｘ''発生時前にＣＰｎ＋１情報を取得していた場合は、故障Ｘ''発生時はＣＰｎ＋１情報を用いて回復処理を行う。 At this time, if the CPn + 1 information is acquired before the failure X ″ occurs, the recovery process is performed using the CPn + 1 information when the failure X ″ occurs.

図６（ｂ）は、ＣＰｎ情報の再故障寿命ＴｎとＣＰｎ情報の再発故障間隔Ｔｎ'との関係がＴｎ≧Ｔｎ'であり、ＣＰｎ情報の１回前に記録したＣＰｎ−１情報を用いて再始動を行うと、ＣＰｎ−１情報の再故障寿命Ｔｎ−１とＣＰｎ−１情報の再発故障間隔Ｔｎ−１'との関係はＴｎ−１＜Ｔｎ−１'となり、その後は、ＣＰｎ−１を用いて再試行を繰り返す場合の例である。事故による故障は、このような発生パターンとなる可能性が高い。 FIG. 6B shows that the relationship between the re-failure life Tn of the CPn information and the recurrent failure interval Tn ′ of the CPn information is Tn ≧ Tn ′, and the CPn-1 information recorded one time before the CPn information is used. When restarting, the relationship between the re-failure life Tn-1 of the CPn-1 information and the recurrent failure interval Tn-1 'of the CPn-1 information becomes Tn-1 <Tn-1', and thereafter, CPn-1 It is an example in the case of repeating the retry using. A failure due to an accident is likely to have such an occurrence pattern.

ＣＰｎ時点でＣＰｎ情報を取得した後、次のＣＰｎ＋１情報を取得する前に１回目の故障Ｙが発生すると、チェックポイント／リスタート処理部１０１３は、図６（ａ）で説明した手順で直前に記録したＣＰｎ情報を用いて再始動を行う。ＣＰｎの再故障寿命Ｔｎは、直前のＣＰ情報を取得したＣＰｎから故障Ｙ発生時ｔｙまでの時間であり、再故障間隔Ｔｎ'は、ＣＰｎ情報を用いて再始動後、２回目の故障Ｙ'が発生するまでの時間ｔｙ'である。ここでは、Ｔｎ≧Ｔｎ'である。 If the first failure Y occurs after the CPn information is acquired at the time CPn and before the next CPn + 1 information is acquired, the checkpoint / restart processing unit 1013 immediately before the procedure described with reference to FIG. Restart using the recorded CPn information. The re-failure life Tn of CPn is the time from CPn when the previous CP information was acquired to the time of failure Y occurrence ty, and the re-failure interval Tn ′ is the second failure Y ′ after restart using the CPn information. This is a time ty ′ until occurrence of. Here, Tn ≧ Tn ′.

ＣＰｎ−１の再故障寿命Ｔｎ−１はＣＰｎ−１情報を取得した時刻ＣＰｎ−１から故障Ｙ発生時ｔｘまでの時間である。一方、回復処理を行った後、次のＣＰ情報であるＣＰｎ＋１情報を取得する時刻ＣＰｎ＋１前に再度故障Ｙ''（３回目の故障Ｙ''）が発生した場合、ＣＰｎ−１の再故障間隔Ｔｎ−１'は、ＣＰｎ−１情報を用いて再始動後、実際に再度故障Ｙ''が発生するまでの時間ｔｙ''である。ここでは、Ｔｎ−１＜Ｔｎ−１'であるため、チェックポイント／リスタート処理部１０１３は、ＣＰｎ−１情報を用いて再試行を行う。 The re-failure life Tn-1 of CPn-1 is the time from the time CPn-1 when the CPn-1 information is acquired to the time tx when the failure Y occurs. On the other hand, when the failure Y ″ (third failure Y ″) occurs again after the recovery process and before the time CPn + 1 for obtaining the next CP information, CPn + 1 information, the re-failure interval of CPn−1 Tn-1 ′ is a time ty ″ until the failure Y ″ actually occurs again after restart using the CPn-1 information. Here, since Tn-1 <Tn-1 ', the checkpoint / restart processing unit 1013 performs a retry using the CPn-1 information.

図６（ｃ）は、ＣＰ情報の再故障寿命ＴとＣＰ情報の再発故障間隔Ｔ'との関係がＴ≧Ｔ'であることが繰り返される場合の例である。同一の故障が発生する毎に更に１回前に記録したＣＰ情報を用いて再始動を行う。最終的にはネットワークに接続する直前のメモリ状態を記録したＣＰ０情報を用いて再始動を行う。例えば、ウィルスによる攻撃により発生した故障の場合、ウィルスが取り込まれた時点と故障が発症する時点との間にタイムラグがあることが多い。このような場合は、図６（ｃ）に示すような故障発生パターンとなる可能性が高い。 FIG. 6C shows an example in which the relationship between the re-failure life T of the CP information and the recurrent failure interval T ′ of the CP information is repeatedly T ≧ T ′. Every time the same failure occurs, restart is performed using the CP information recorded once more. Finally, restart is performed using the CP0 information in which the memory state immediately before connection to the network is recorded. For example, in the case of a failure caused by an attack by a virus, there is often a time lag between the time when the virus is taken in and the time when the failure occurs. In such a case, there is a high possibility that a failure occurrence pattern as shown in FIG.

ＣＰｎ時点でＣＰｎ情報を取得した後、次のＣＰｎ＋１情報を取得する前に１回目の故障Ｚが発生すると、チェックポイント／リスタート処理部１０１３は、図６（ａ）で説明した手順で直前に記録したＣＰｎ情報を用いて再始動を行う。ＣＰｎの再故障寿命Ｔｎは、直前のＣＰ情報を取得したＣＰｎから故障Ｚ発生時ｔｚまでの時間であり、再故障間隔Ｔｎ'は、ＣＰｎ情報を用いて再始動後、２回目の故障Ｚ'が発生するまでの時間ｔｚ'である。ここでは、Ｔｎ≧Ｔｎ'である。 If the first failure Z occurs after the CPn information is acquired at the time CPn and before the next CPn + 1 information is acquired, the checkpoint / restart processing unit 1013 immediately follows the procedure described in FIG. Restart using the recorded CPn information. The re-failure life Tn of CPn is the time from CPn when the previous CP information was acquired to the time tz when the failure Z occurred, and the re-failure interval Tn ′ is the second failure Z ′ after restart using the CPn information. Is the time tz ′ until occurrence of. Here, Tn ≧ Tn ′.

ＣＰｎ−１の再故障寿命Ｔｎ−１はＣＰｎ−１情報を取得した時刻ＣＰｎ−１から故障Ｚ発生時ｔｚまでの時間である。一方、回復処理を行った後、次のＣＰ情報であるＣＰｎ＋１情報を取得する時刻ＣＰｎ＋１前に再度故障Ｚ''（３回目の故障Ｚ''）が発生した場合、ＣＰｎ−１の再発故障間隔Ｔｎ−１'は、ＣＰｎ−１情報を用いて再始動後、実際に再度故障Ｚ''が発生するまでの時間ｔｚ''である。ここでも、Ｔｎ−１≧Ｔｎ−１'であるため、チェックポイント／リスタート処理部１０１３は、更に１回前に記録したＣＰｎー２情報に戻る。 The re-failure life Tn-1 of CPn-1 is the time from the time CPn-1 when the CPn-1 information is acquired to the time tz when the failure Z occurs. On the other hand, if the failure Z ″ (third failure Z ″) occurs again after the recovery process and before the time CPn + 1 for obtaining the next CP information, CPn + 1 information, the recurrent failure interval of CPn−1 Tn-1 ′ is a time tz ″ until the failure Z ″ actually occurs again after restarting using the CPn-1 information. Again, since Tn−1 ≧ Tn−1 ′, the checkpoint / restart processing unit 1013 returns to the CPn−2 information recorded one more time before.

チェックポイント／リスタート処理部１０１３は、故障発生毎に再故障寿命Ｔと再発故障間隔Ｔ'を計算し、比較し、Ｔ≧Ｔ'である限り、更に１回前のＣＰ情報を用いて再始動を行うことを、用いるＣＰ情報がＣＰ０情報となるまで繰り返す。 The checkpoint / restart processing unit 1013 calculates and compares the re-failure life T and the recurrence failure interval T ′ every time a failure occurs, and re-uses the CP information from the previous one as long as T ≧ T ′. The starting is repeated until the CP information to be used becomes the CP0 information.

なお、図６（ｃ）のケースにおいて、故障が再発する度に１つずつ遡ってＣＰ情報を用いて再始動しているが、本構成に限られない。例えば、最初の故障時にＣＰｎ情報を用いて再始動した後、故障が再発した場合、ＣＰ０情報を用いて再始動するよう構成するなど、ＣＰ０情報を用いる前に順次遡って用いるＣＰ情報の回数を予め定めておいてもよい。 In the case of FIG. 6C, each time the failure recurs, it is retroactively restarted using the CP information one by one, but this is not a limitation. For example, if the failure recurs after restarting using CPn information at the time of the first failure, it is configured to restart using CP0 information. For example, the number of times CP information is used sequentially before using CP0 information. It may be determined in advance.

次に、上記チェックポイント／リスタート処理部１０１３が故障発生毎に記録する時刻について説明する。情報処理装置５００が単一のサービス、単一のプロセスで実行されている場合は、問題とならないが、複数のサービス、複数のプロセスが並行して実行されている場合、処理時間が計測中のプロセス１００１以外のプロセス等に影響を受けることがある。 Next, the time recorded by the checkpoint / restart processing unit 1013 every time a failure occurs will be described. When the information processing apparatus 500 is executed with a single service and a single process, there is no problem, but when multiple services and multiple processes are executed in parallel, the processing time is being measured. It may be affected by processes other than the process 1001.

一般にシステムの時間として用いられるものには、経過時間４０１、ユーザ時間４０２、システム時間４０６がある。図７は、これらの経過時間４０１、ユーザ時間４０２、システム時間４０６の３種類の時間についてモデルを用いて説明するための図である。各ＣＰｎ情報取得時に、これらの時間がＣＰｎとしてＰＲＡＭ５０５に記録される。 Generally used as system time include elapsed time 401, user time 402, and system time 406. FIG. 7 is a diagram for describing these three types of time 401, user time 402, and system time 406 using a model. When each CPn information is acquired, these times are recorded in the PRAM 505 as CPn.

なお、経過時間４０１は、情報処理装置５００を外部から見た場合の時間に相当する。ユーザ時間４０２は、プロセス１００１がユーザモード（ＯＳ１０１０の一部として機能分類されている保護されたサブシステムとアプリケーションプログラム１０００とが動作する状態。）で動作した時間であり、システム時間４０６は、プロセス１００１がカーネルモード（スケジューラ，メモリ管理，プロセス管理などのプリミティブなＯＳ機能のみが動作する状態。）で動作した時間である。 The elapsed time 401 corresponds to the time when the information processing apparatus 500 is viewed from the outside. The user time 402 is a time when the process 1001 is operated in the user mode (a state where the protected subsystem and the application program 1000 that are classified as a part of the OS 1010 operate), and the system time 406 is a process time. Reference numeral 1001 denotes a time during which the system operates in a kernel mode (a state in which only a primitive OS function such as a scheduler, memory management, and process management operates).

また、本図において累積回数４０５、４０７は、システムコールの発行回数の累積値のことである。マルチプロセッサ５０１での時刻計測は、必ずしも正確とはいえない場合があるため、システムコールの発行回数の累積値（システムコールの累積回数４０５、４０７）を利用することがより簡便に正確な処理を行うことができる場合がある。これらは後述する第２の実施形態で用いる。第２の実施形態では、システムコールの累積回数４０５、４０７を用いて故障時の回復処理手順を決定する。 Also, in this figure, the cumulative counts 405 and 407 are cumulative values of the number of system call issuances. Since the time measurement by the multiprocessor 501 may not always be accurate, using the accumulated value of the number of system call issuances (accumulated number of system calls 405 and 407) makes it easier and more accurate. There are cases where it can be done. These are used in a second embodiment to be described later. In the second embodiment, the recovery processing procedure at the time of failure is determined using the cumulative number 405 and 407 of system calls.

再故障寿命Ｔと再発故障間隔Ｔ'との算出にいずれの時間を用いるかは、攻撃の型に応じて使い分けることが望ましい。例えば、無限ループを実行し、ＣＰＵの実行時間を浪費する型の攻撃であれば、ユーザ時間４０２を用いる。タイムアウトとなる入出力を行うシステムコールの発行を繰り返すことで、ＣＰＵの実行時間を浪費する型の攻撃であれば、システム時間を、それぞれ選択することで攻撃にあわせた精度の高い時間測定が可能となる。 Which time is used for calculating the re-failure life T and the recurrent failure interval T ′ is preferably used depending on the type of attack. For example, in the case of an attack that executes an infinite loop and wastes CPU execution time, the user time 402 is used. By repeating the issuance of system calls that perform time-out I / O, it is possible to measure the time with high accuracy according to the attack by selecting each system time if it is an attack that wastes CPU execution time. It becomes.

次に、本実施形態のチェックポイント／リスタート処理部１０１３が、所定のプロセス１００１を監視しながらＣＰ情報を取得し、故障が発生した場合に取得したＣＰ情報を用いて回復処理を行う処理フローについて説明する。 Next, a processing flow in which the checkpoint / restart processing unit 1013 according to the present embodiment acquires CP information while monitoring a predetermined process 1001, and performs recovery processing using the acquired CP information when a failure occurs. Will be described.

図８は、チェックポイント／リスタート処理部１０１３がシステムコール監視処理部１０１４からの通知に基づいてＣＰ情報を取得するＣＰ情報取得処理の処理フローである。 FIG. 8 is a processing flow of CP information acquisition processing in which the checkpoint / restart processing unit 1013 acquires CP information based on a notification from the system call monitoring processing unit 1014.

本実施形態では、オンライン状態でＰＲＡＭ５０５に保存するＣＰ情報の最大保存数を世代数ｍと呼ぶ。ＣＰ０情報以外のＣＰ情報は、新規に世代数ｍを越えて新規に取得した場合、古いものから順に削除される。例えば、世代数ｍを２と設定した場合、最新のものとその１回前の計２回分のＣＰ情報を保存する。なお、保存するＣＰ情報の数の決め方は本方法に限られない。例えば、保存先のＰＲＡＭ５０５の最大容量まで記録し、その後は古いものから順に上書きすることにより消去するよう構成してもよい。 In the present embodiment, the maximum number of CP information stored in the PRAM 505 in the online state is referred to as the generation number m. When CP information other than the CP0 information is newly acquired beyond the generation number m, the CP information is deleted in order from the oldest one. For example, when the number of generations m is set to 2, the latest information and the CP information for the previous two times are stored. The method for determining the number of CP information to be stored is not limited to this method. For example, the maximum capacity of the storage destination PRAM 505 may be recorded, and thereafter, the oldest one may be overwritten in order to erase.

また、オンライン状態でＣＰ情報を取得した回数を累計数ｎと呼ぶ。以下、累計数ｎを用いて、ＣＰ情報を取得する時刻をＣＰｎ、ＣＰｎにおいて取得したＣＰ情報をＣＰｎ情報と表す。 The number of times CP information is acquired in the online state is referred to as a cumulative number n. Hereinafter, using the cumulative number n, the time at which CP information is acquired is represented as CPn, and the CP information acquired at CPn is represented as CPn information.

チェックポイント／リスタート処理部１０１３は、プロセス１００１が処理を開始すると、予め管理者等により設定されたＣＰを保存する世代数ｍを参照する（ｓ６０１）。 When the process 1001 starts processing, the checkpoint / restart processing unit 1013 refers to the generation number m for storing the CP set in advance by the administrator or the like (s601).

チェックポイント／リスタート処理部１０１３は、保存するＣＰ情報数をカウントするカウンタをｎ（累計数と呼ぶ）とし、累計数ｎに初期値０を設定する（ｓ６０２）。 The checkpoint / restart processing unit 1013 sets a counter for counting the number of CP information to be stored as n (referred to as a cumulative number), and sets an initial value 0 to the cumulative number n (s602).

システムコール監視処理部１０１４は、プロセス１００１が発行するシステムコール列Ｓを監視する（Ｓ６０３）。システム監視処理部１０１４は、情報処理装置５００がオフライン状態の場合、システムコール列Ｓ０の発行の有無を監視する。監視中にコールリスト１０１６に保存されているシステムコール列Ｓ０が発生したことを検出すると（ｓ６０４）、システムコール監視処理部１０１４はチェックポイント／リスタート処理部１０１３に検出を通知する。通知を受けたチェックポイント／リスタート処理部１０１３は、通知を受けた時刻をＣＰ０とし、ＣＰ０時点のプロセス１００１のメモリ状態をＤＲＡＭ５０４上で複製することによりＣＰ情報を取得し、ＣＰ０情報とする（ｓ６０５）。 The system call monitoring processing unit 1014 monitors the system call sequence S issued by the process 1001 (S603). When the information processing apparatus 500 is in an offline state, the system monitoring processing unit 1014 monitors whether the system call sequence S0 has been issued. When it is detected that a system call sequence S0 stored in the call list 1016 has occurred during monitoring (s604), the system call monitoring processing unit 1014 notifies the checkpoint / restart processing unit 1013 of the detection. Upon receiving the notification, the checkpoint / restart processing unit 1013 sets CP0 as the time when the notification is received, and obtains CP information by duplicating the memory state of the process 1001 at the time of CP0 on the DRAM 504 to obtain CP0 information ( s605).

チェックポイント／リスタート処理部１０１３は、時刻ＣＰ０を時刻ｔ０とし、Ｓ６０４で取得したＣＰ０情報をＤＲＡＭ５０４からＰＲＡＭ５０５に転送し、時刻ＣＰ０と対応づけてＣＰ０情報をＰＲＡＭ５０５に記録する（ｓ６０６）。 The checkpoint / restart processing unit 1013 sets the time CP0 as the time t0, transfers the CP0 information acquired in S604 from the DRAM 504 to the PRAM 505, and records the CP0 information in the PRAM 505 in association with the time CP0 (s606).

情報処理装置５００がネットワークインタフェース５０７を介して他の装置やシステムと接続されたことを検出すると（ｓ６０７）、システムコール監視処理部１０１４は、プロセス１００１のシステムコール列Ｓの監視をオンライン状態の監視に切り替える。 When it is detected that the information processing apparatus 500 is connected to another apparatus or system via the network interface 507 (s607), the system call monitoring processing unit 1014 monitors the system call sequence S of the process 1001 in the online state. Switch to.

オンライン状態になると、システムコール監視処理部１０１４は、引き続きプロセス１００１が発行するシステムコール列Ｓを監視する（ｓ６０８）。ここでは、システムコール監視処理部１０１４は、上述のｗｒｉｔｅシステムコールまたは関数ｄｌｏｐｅｎの発行の有無を監視する。なお、オンライン状態になると、チェックポイント／リスタート処理部１０１３は故障の発生を監視する。そして、故障を検出した場合（ｓ６０９）は、後述する図９のＣＰ回復処理を行う。 In the online state, the system call monitoring processing unit 1014 continuously monitors the system call sequence S issued by the process 1001 (s608). Here, the system call monitoring processing unit 1014 monitors whether or not the above-described write system call or function droppen is issued. In the online state, the checkpoint / restart processing unit 1013 monitors the occurrence of a failure. If a failure is detected (s609), CP recovery processing in FIG. 9 described later is performed.

システムコール監視処理部１０１４が、システムコール列Ｓ監視中にオンライン状態における予め定めたシステムコールまたは関数を検出するとチェックポイント／リスタート処理部１０１３に通知する（ｓ６１０）。 When the system call monitoring processing unit 1014 detects a predetermined system call or function in the online state during monitoring of the system call sequence S, it notifies the checkpoint / restart processing unit 1013 (s610).

チェックポイント／リスタート処理部１０１３は、累計数ｎを１インクリメントし（ｓ６１１）、そして、検出時刻ＣＰｎと、ＣＰｎ時点のメモリの状態をＣＰｎ情報としてｓ６０５と同様に取得する（ｓ６１２）。 The checkpoint / restart processing unit 1013 increments the cumulative number n by 1 (s611), and acquires the detection time CPn and the memory state at the time of CPn as CPn information in the same manner as s605 (s612).

次に、チェックポイント／リスタート処理部１０１３は、累計数ｎと世代数ｍとを比較する（ｓ６１３）。累計数ｎが世代数より大きい場合、既に世代数ｍ分ＰＲＡＭ５０５に記録されている。この場合、最も古いＣＰ情報（ここでは、ＣＰｎ−ｍ情報）をＰＲＡＭ５０５から削除してから新たに取得したＣＰｎ情報をＰＲＡＭ５０５を記録する（ｓ６１４、ｓ６１５）。 Next, the checkpoint / restart processing unit 1013 compares the cumulative number n with the generation number m (s613). When the cumulative number n is larger than the number of generations, the number of generations m is already recorded in the PRAM 505. In this case, the oldest CP information (here, CPn-m information) is deleted from the PRAM 505, and the newly acquired CPn information is recorded in the PRAM 505 (s614, s615).

具体的には、Ｓ６１３における比較の結果、累計数ｎが世代数ｍより大きい場合、ＰＲＡＭ５０５における最も古いＣＰ情報であるＣＰｎ−ｍ情報をＰＲＡＭ５０５から削除する。その後、新たに取得したＣＰｎ情報をＰＲＡＭ５０５に記録する。一方、ｓ６１３における比較の結果、累計数ｎが世代数ｍ以下の場合、Ｓ６１４の処理は行わず、新たに取得したＣＰｎ情報をＰＲＡＭ５０５に記録する（ｓ６１４、ｓ６１５）。 Specifically, when the cumulative number n is larger than the generation number m as a result of the comparison in S613, the CPn-m information that is the oldest CP information in the PRAM 505 is deleted from the PRAM 505. Thereafter, the newly acquired CPn information is recorded in the PRAM 505. On the other hand, as a result of the comparison in s613, when the cumulative number n is less than or equal to the generation number m, the process of S614 is not performed, and the newly acquired CPn information is recorded in the PRAM 505 (s614, s615).

その後、システムコール監視処理部１０１４がシステムコール列の監視を行う状態に戻る（ｓ６０８）。 Thereafter, the system call monitoring processor 1014 returns to the state of monitoring the system call queue (s608).

次に、図８に示すＣＰ情報取得処理において、チェックポイント／リスタート処理部１０１３が故障を検出した際のＣＰ回復処理について説明する。図９は、ＣＰ回復処理の処理フローである。 Next, CP recovery processing when the checkpoint / restart processing unit 1013 detects a failure in the CP information acquisition processing shown in FIG. 8 will be described. FIG. 9 is a processing flow of CP recovery processing.

チェックポイント／リスタート処理部１０１３は、故障を検出すると、故障発生時刻（故障を検出した時刻）ｔｘをＰＲＡＭ５０５に記録する（ｓ１０１）。そして、情報処理装置５００をオフラインにする（回線切断処理）（ｓ１０２）。 When detecting the failure, the checkpoint / restart processing unit 1013 records the failure occurrence time (time when the failure is detected) tx in the PRAM 505 (s101). Then, the information processing apparatus 500 is set offline (line disconnection process) (s102).

また、チェックポイント／リスタート処理部１０１３は、その時点の累計数ｎをカウンタｉに代入する（ｓ１０３）。そして、ＣＰｉ情報がＰＲＡＭ５０５に記録されているか否かを判別する（ｓ１０４）。 Further, the checkpoint / restart processing unit 1013 substitutes the cumulative number n at that time for the counter i (s103). Then, it is determined whether or not CPi information is recorded in the PRAM 505 (s104).

ｓ１０４で記録されていない場合は、後述するｓ１１３に処理を進める。 If not recorded in s104, the process proceeds to s113 described later.

一方、ｓ１０４で記録されている場合、チェックポイント／リスタート処理部１０１３は、ＣＰｉ情報に対応づけて記録されているＣＰｉ情報を取得した時刻ＣＰｉと先に記録した故障時刻ｔｘとを用いて再故障寿命Ｔｉを算出して代入する（Ｔｉ＝ｔｘ−ＣＰｉ）（ｓ１０５）。 On the other hand, if it is recorded in s104, the checkpoint / restart processing unit 1013 restarts using the time CPi at which the CPi information recorded in association with the CPi information is acquired and the previously recorded failure time tx. The failure life Ti is calculated and substituted (Ti = tx−CPi) (s105).

そして、チェックポイント／リスタート処理部１０１３は、ＣＰｉ情報を用いて回復処理を行う（ｓ１０６）。 Then, the checkpoint / restart processing unit 1013 performs recovery processing using the CPi information (s106).

回復処理を行い、オンライン状態になると（ｓ１０７）、チェックポイント／リスタート処理部１０１３は、故障の発生の監視を続ける（ｓ１０８）。なお、ここで、チェックポイント／リスタート処理部１０１３が故障を検出する前にシステムコール監視処理部１０１４がシステムコール列Ｓを検出すると、図８のステップｓ６１１に戻る。 When recovery processing is performed and the online state is entered (s107), the checkpoint / restart processing unit 1013 continues to monitor the occurrence of a failure (s108). If the system call monitoring processing unit 1014 detects the system call sequence S before the checkpoint / restart processing unit 1013 detects a failure, the process returns to step s611 in FIG.

一方、システムコール監視処理部１０１４がシステムコール列Ｓを検出する前にチェックポイント／リスタート処理部１０１３が、先に検出した故障と同じ故障を検出すると（ｓ１０８）、故障を検出した時刻ｔｘ'記録する（ｓ１０９）。そして、オフライン（回線切断）所栄を行う（ｓ１１０）。 On the other hand, when the checkpoint / restart processing unit 1013 detects the same failure as the previously detected failure before the system call monitoring processing unit 1014 detects the system call sequence S (s108), the time tx ′ at which the failure is detected Record (s109). Then, offline (line disconnection) is performed (s110).

そして、チェックポイント／リスタート処理部１０１３は、時刻ｔｘ'および、時刻ＣＰｉを用いて、再発故障間隔Ｔｉ'を算出する（Ｔｉ'＝ｔｘ'−ＣＰｉ）（ｓ１１１）。 Then, the checkpoint / restart processing unit 1013 calculates the recurrent failure interval Ti ′ using the time tx ′ and the time CPi (Ti ′ = tx′−CPi) (s111).

そして、チェックポイント／リスタート処理部１０１３は、再故障寿命Ｔｉと再発故障間隔Ｔｉ' とを比較する（ｓ１１２）。 Then, the checkpoint / restart processing unit 1013 compares the re-failure life Ti with the recurrent failure interval Ti ′ (s112).

再故障寿命Ｔｉの期間が過ぎてから故障が再発した場合、すなわち、再故障寿命Ｔｉが再発故障間隔Ｔｉ'以上（Ｔｉ≧ＴＩ'）の場合、ステップＳ１０６に戻り、同じＣＰｉ情報を用いて回復処理を行う（再試行）。 If the failure recurs after the period of the re-failure life Ti has passed, that is, if the re-failure life Ti is equal to or greater than the re-failure interval Ti ′ (Ti ≧ TI ′), the process returns to step S106 and is recovered using the same CPi information. Process (retry).

一方、再故障寿命Ｔｉ前に故障が発生した場合、すなわち、再故障寿命Ｔｉが再発故障間隔Ｔｉ'より短い（Ｔｉ＜Ｔｉ'）場合、カウンタｉを１デクリメントし（ｓ１１３）、カウンタｉが負の値か否かを判別し（ｓ１１４）、負の値であれば、チェックポイント／リスタート処理部１０１３は、業務（プロセス１００１）を停止させる（ｓ１１５）。 On the other hand, if a failure occurs before the re-failure life Ti, that is, if the re-failure life Ti is shorter than the recurrent failure interval Ti ′ (Ti <Ti ′), the counter i is decremented by 1 (s113), and the counter i is negative. (S114), if it is a negative value, the checkpoint / restart processing unit 1013 stops the task (process 1001) (s115).

一方、カウンタｉが負の値でなければ（ｓ１１４）、ステップＳ１０５に戻り、１回前に記録したＣＰｉ情報を用いて回復処理を行う（再始動：ＣＰ情報の後退）。 On the other hand, if the counter i is not a negative value (s114), the process returns to step S105, and recovery processing is performed using the CPi information recorded once before (restart: retreat of CP information).

以上、本実施形態のチェックポイント／リスタート処理部１０１３によるＣＰ回復処理について説明した。 The CP recovery process by the checkpoint / restart processing unit 1013 of this embodiment has been described above.

本実施形態では、上述のように、プロセスを始動させたＣＰ情報を取得した時刻（ＣＰ）から故障するまでの時間を計測し、再故障寿命として設定する。故障発生後、所定のＣＰ情報を用いて、プロセスを再始動させる。そして、その再始動時に用いたＣＰ情報を取得した時刻から再び故障するまでの時間を計測し、再発故障間隔とする。再故障寿命と再発故障間隔とを比較し、再発故障間隔より再故障寿命が長いプロセスの場合、１回前に取得したＣＰ情報を用いてプロセスを再始動させる。このとき、再始動に用いたＣＰ情報を取得した時刻から最初の故障までの経過時間をもとに再故障寿命を再設定する。なお、プロセスの実時間の代わりに、プロセスのユーザ時間を利用する。あるいは、プロセスのシステム時間を利用することもできる。 In the present embodiment, as described above, the time from the time (CP) when the CP information for starting the process is acquired to the time of failure is measured and set as the re-failure life. After the failure occurs, the process is restarted using predetermined CP information. Then, the time from the time when the CP information used at the time of restarting is acquired until it fails again is measured and set as the recurrent failure interval. The re-failure life is compared with the recurrent failure interval, and if the process has a re-failure life longer than the recurrent failure interval, the process is restarted using the CP information acquired once before. At this time, the re-failure life is reset based on the elapsed time from the time when the CP information used for the restart is acquired to the first failure. Note that the user time of the process is used instead of the real time of the process. Alternatively, the system time of the process can be used.

本実施形態では、回復処理（再始動、再試行）に用いるＣＰ情報を、予め定めたシステムコール列検出時に自動的に取得する。すなわち、ＡＰの埋め込みやオペレータの操作ではなく、ＯＳ自体が、ＣＰ情報を取得する。 In this embodiment, CP information used for recovery processing (restart, retry) is automatically acquired when a predetermined system call sequence is detected. In other words, the OS itself acquires the CP information, not the AP embedding or the operator's operation.

具体的には、本実施形態の情報処理装置は、システムコールを監視する手段と、当該手段から呼び出される、監視対象のプロセスの保全性を変更する、システムコールの一覧表を参照する手段を備える。また、オンラインプロセスのＣＰ情報を取得する手段と、任意のＣＰ情報を用いてプロセスを再始動する手段と、を備える。さらに、ＣＰ情報を取得する手段から呼び出されるファイルを複製する手段と、プロセスを再始動する手段から呼び出されるファイルを復元する手段とを備える。 Specifically, the information processing apparatus according to the present embodiment includes a unit that monitors a system call, and a unit that refers to a list of system calls that is called from the unit and changes the integrity of a process to be monitored. . In addition, there are provided means for obtaining CP information of the online process and means for restarting the process using arbitrary CP information. Furthermore, a means for replicating the file called from the means for obtaining the CP information and a means for restoring the file called from the means for restarting the process are provided.

保全性を変更するシステムコールは具体的には、オンラインからオフラインへ切り替わるシステムコール、監視対象のプロセスに関連するファイルを更新するシステムコール、監視対象のプロセスにセキュリティパッチを適用するシステムコールなどである。従って、本実施形態によれば、本構成により、無人運用（かつ無停止運用）システムにおいて、必要十分なＣＰ情報を取得することができる。 Specifically, system calls that change maintainability include system calls that switch from online to offline, system calls that update files related to monitored processes, and system calls that apply security patches to monitored processes. . Therefore, according to this embodiment, this configuration makes it possible to acquire necessary and sufficient CP information in an unattended operation (and non-stop operation) system.

本実施形態によれば、オンライン途中だけでなくオンライン直前のＣＰ情報を取得する。このため、安全な状態に回復できるＣＰ情報を少なくとも一つ確保することができる。 According to the present embodiment, the CP information just before online as well as during online is acquired. Therefore, at least one piece of CP information that can be recovered to a safe state can be secured.

本実施形態によれば、所定のシステムコール列の検出を契機にＣＰ情報を取得するよう構成しているため、必要十分なＣＰ情報を自動的に取得し保持することができる。従って、無制限にＣＰ情報を保持することにより記録容量の無駄遣いを防ぎ、回復処理を行うために効果的なＣＰ情報を効率的に保持することができる。 According to the present embodiment, since the CP information is acquired upon detection of a predetermined system call sequence, necessary and sufficient CP information can be automatically acquired and held. Therefore, it is possible to prevent waste of recording capacity by holding CP information indefinitely, and efficiently hold CP information effective for performing recovery processing.

また、本実施形態によれば、再故障寿命という概念を導入したため、故障が再発した時間と再故障寿命を比較することにより、適切なＣＰ情報を選択して回復処理を行うことができる。すなわち、最初の故障発生直後に故障の原因（脅威の種類）を明確に判別を行うことなく、故障の再発という現象に基づいて、順次最適なＣＰ情報を選択し、回復処理を行うことができる。 Further, according to the present embodiment, since the concept of re-failure life is introduced, it is possible to perform recovery processing by selecting appropriate CP information by comparing the time when the failure recurs with the re-failure life. That is, optimal CP information can be sequentially selected and recovered based on the phenomenon of failure recurrence without clearly determining the cause of the failure (type of threat) immediately after the first failure. .

すなわち、複数の取得した必要十分なＣＰ情報を用いて、故障解析の有無によらずに、故障発生時に適切な回復手順を選択可能なチェックポイント／リスタート制御技術を提供することができる。 In other words, it is possible to provide a checkpoint / restart control technique that can select an appropriate recovery procedure when a failure occurs, using a plurality of necessary and sufficient CP information, regardless of whether or not failure analysis has occurred.

また、本実施形態によれば、チェックポイント／リスタート処理部１０１３にチェックポイント／リスタート処理部１０１３にＣＰ情報を取得する、再始動を行うといった契機となる指示を行う機能であるシステムコール監視手段をチェックポイント／リスタート処理部１０１３と同様にＯＳ内に備えている。このため、ＯＳ外で実行されるアプリケーションであるプロセス１００１からＣＰ情報取得等の指示を受ける場合に比べ、不正コード侵入の影響をうけにくい。すなわち、プロセス１００１から指示を受ける場合、プロセス１００１に不正コード挿入されると、ＣＰ情報の取得を回避する、無制限に実行するなど操作を受け易い。しかし、本実施形態の構成では、そのような不安はない。 Further, according to the present embodiment, the system call monitoring function is a function for instructing the checkpoint / restart processing unit 1013 to acquire CP information in the checkpoint / restart processing unit 1013 or to perform restarting. Means are provided in the OS in the same manner as the checkpoint / restart processing unit 1013. For this reason, compared with the case where an instruction such as CP information acquisition is received from the process 1001, which is an application executed outside the OS, it is less likely to be affected by unauthorized code intrusion. In other words, when an instruction is received from the process 1001, if an illegal code is inserted into the process 1001, it is easy to receive an operation such as avoiding acquisition of CP information or performing unlimited execution. However, there is no such anxiety in the configuration of the present embodiment.

以上により、本実施形態によれば、メモリを効率的に利用し、故障発生の状況に応じて適切な回復処理を行うことが可能なチェックポイント／リスタート技術を提供できる。 As described above, according to the present embodiment, it is possible to provide a checkpoint / restart technique that can efficiently use a memory and perform an appropriate recovery process according to a failure occurrence state.

＜＜第二の実施形態＞＞
次に、本発明を適用した第二の実施形態について説明する。第一の実施形態では、いずれのＣＰ情報を用いて回復処理を行うかを決定するために、故障時刻、ＣＰ情報取得時刻から算出した再故障寿命Ｔという概念を導入している。しかし、本実施形態では、時間を導入するかわりに、システムコールの発行回数を計測する。 << Second Embodiment >>
Next, a second embodiment to which the present invention is applied will be described. In the first embodiment, the concept of re-failure life T calculated from the failure time and the CP information acquisition time is introduced in order to determine which CP information is used for the recovery process. However, in this embodiment, instead of introducing time, the number of system calls issued is measured.

本実施形態の情報処理装置５００の機能構成、ハードウェア構成は、基本的に第一の実施形態と同様である。以下、本実施形態の第一の実施形態と異なる構成を説明する。 The functional configuration and hardware configuration of the information processing apparatus 500 of this embodiment are basically the same as those of the first embodiment. Hereinafter, a configuration different from the first embodiment of the present embodiment will be described.

本実施形態では、チェックポイント／リスタート処理部１０１３の代わりにチェックポイント／リスタート処理部１０１３−２を備える。チェックポイント／リスタート処理部１０１３−２は、ＣＰ０情報を取得後、その後のシステムコールの発行数をカウントする。そして、ＣＰ情報を取得する毎に、そのＣＰｉ情報を取得した時点ＣＰｉの時刻の代わりにシステムコールの累積発行数ＳＣｉを、ＣＰｉ情報に対応づけてＰＲＡＭ５０５に記録する。同様に、チェックポイント／リスタート処理部１０１３は、故障が発生した際も、発生時刻ｔｘの代わりにその時点までのシステムコールの累積発行数ＳＣｘを記録する。 In this embodiment, a checkpoint / restart processing unit 1013-2 is provided instead of the checkpoint / restart processing unit 1013. After acquiring the CP0 information, the checkpoint / restart processing unit 1013-2 counts the number of subsequent system calls issued. Each time CP information is acquired, the cumulative number of system calls issued SCi is recorded in the PRAM 505 in association with the CPi information instead of the time CPi at which the CPi information is acquired. Similarly, when a failure occurs, the checkpoint / restart processing unit 1013 records the cumulative number of system calls issued SCx up to that time instead of the occurrence time tx.

すなわち、ＣＰ情報取得処理において、本実施形態では、ＣＰ情報を取得した時点の時刻の記録は行わない。一方、システムコール監視時において、システムコールの発行回数をカウントする。さらに、故障が発生した場合は、故障時時刻ｔｘの記録ではなく、故障時のシステムコールの累積発行数ＳＣｘを記録する。 That is, in the CP information acquisition process, the time at which the CP information is acquired is not recorded in the present embodiment. On the other hand, when the system call is monitored, the number of system calls issued is counted. Further, when a failure occurs, the cumulative number of system calls issued SCx at the time of failure is recorded, not the failure time tx.

なお、本実施形態では、再故障寿命Ｔは、故障発生時のシステムコールの累積発行数ＳＣｘから最近のＣＰ情報（ＣＰｉ情報）取得時のシステムコールの累積発行数ＳＣｉを減算したものと定義する（Ｔｉ＝ＳＣｘ−ＳＣｉ）。 In the present embodiment, the re-failure life T is defined as the cumulative number of system calls issued SCx at the time of recent CP information (CPi information) acquisition from the cumulative number of system calls issued SCx at the time of failure. (Ti = SCx-SCi).

図１０は、本実施形態のＣＰ回復処理の処理フローである。 FIG. 10 is a processing flow of CP recovery processing according to this embodiment.

チェックポイント／リスタート処理部１０１３−２は、故障を検出すると、その時点の累積システムコール（ＳＣ）数ＳＣｘをＰＲＡＭ５０５に記録する（ｓ３０１）。そして、情報処理装置５００をオフラインにする（回線切断処理）（ｓ３０２）。 When the checkpoint / restart processing unit 1013-2 detects a failure, the checkpoint / restart processing unit 1013-2 records the cumulative system call (SC) number SCx at that time in the PRAM 505 (s301). Then, the information processing apparatus 500 is set offline (line disconnection process) (s302).

チェックポイント／リスタート処理部１０１３ー２は、その時点の累計数ｎをカウンタｉに代入する（ｓ３０３）。そして、ＣＰｉ情報がＰＲＡＭ５０５に記録されているか否かを判別する（ｓ３０４）。 The checkpoint / restart processing unit 1013-2 substitutes the cumulative number n at that time for the counter i (s303). Then, it is determined whether or not CPi information is recorded in the PRAM 505 (s304).

ｓ３０４で記録されていない場合は、後述するｓ３１４に処理を進める。
一方、ｓ３１４で記録されている場合は、チェックポイント／リスタート処理部１０１３−２は、ＣＰｉ情報に対応づけて記録されているＣＰｉ情報を取得した時点の累積システムコール数ＳＣｉと先に記録した故障を検出した時点のＳＣｘとを用いて再故障寿命Ｔｉを算出してカウンタＣに代入する（Ｃ＝Ｔｉ＝ＳＣｘ−ＳＣｉ）（ｓ３０５）。 If not recorded in s304, the process proceeds to s314 described later.
On the other hand, if it is recorded in s314, the checkpoint / restart processing unit 1013-2 previously records the cumulative system call number SCi at the time when the CPi information recorded in association with the CPi information is acquired. The re-failure life Ti is calculated using SCx at the time when the failure is detected, and is substituted into the counter C (C = Ti = SCx−SCi) (s305).

そして、チェックポイント／リスタート処理部１０１３−２は、ＣＰｉ情報を用いて回復処理を行う（ｓ３０６）。 Then, the checkpoint / restart processing unit 1013-2 performs recovery processing using the CPi information (s306).

回復処理を行い、オンライン状態になると（ｓ３０７）、まず、チェックポイント／リスタート処理部１０１３−２は、ＣＰ情報取得フラグをたてる。ここで、ＣＰ情報取得フラグは本実施形態独特の構成であり、チェックポイント／リスタート処理部１０１３−２は、このフラグが立っている間は、システムコール列Ｓが検出されたとしても、ＣＰ情報を取得しないよう構成されている。 When the recovery process is performed and the online state is entered (s307), the checkpoint / restart processing unit 1013-2 first sets a CP information acquisition flag. Here, the CP information acquisition flag is a configuration unique to the present embodiment, and the checkpoint / restart processing unit 1013-2 is configured so that the CP is not detected even if the system call sequence S is detected while this flag is set. It is configured not to obtain information.

そして、チェックポイント／リスタート処理部１０１３−２は故障の発生の監視を続ける（ｓ３０８）。ここでは、システムコールが発行される度に故障の発生の有無を判別する。 Then, the checkpoint / restart processing unit 1013-2 continues to monitor the occurrence of a failure (s308). Here, it is determined whether or not a failure has occurred each time a system call is issued.

故障を検出しなかった場合、チェックポイント／リスタート処理部１０１３−２は、カウンタＣを１デクリメントする（ｓ３０９）。このとき、システムコールの累積発行数ＳＣは通常どおり１インクリメントする。そして、カウンタＣが０か否かを判別し（ｓ３１０）、０でなければ、ｓ３０８に戻り、システムコール発行毎に故障の発生の有無の検出を続ける。 If no failure is detected, the checkpoint / restart processing unit 1013-2 decrements the counter C by 1 (s309). At this time, the cumulative number of system calls issued SC is incremented by 1 as usual. Then, it is determined whether or not the counter C is 0 (s310). If the counter C is not 0, the process returns to s308 and continues to detect whether or not a failure has occurred every time a system call is issued.

本実施形態では、システムコールの発行数で再発故障寿命を定義しているため、カウンタＣが０になった時点で、再発故障寿命Ｔｉになったものといえる。ｓ３１０でカウンタが０の場合、チェックポイント／リスタート処理部１０１３−２は、ＣＰ情報取得フラグをおろす（ｓ３１１）。これ以降、チェックポイント／リスタート処理部１０１３−２が故障を検出する前にシステムコール監視処理部１０１４がシステムコール列Ｓを検出すると、図８のステップｓ６１１に戻る。本構成により、本実施形態では、再発故障寿命Ｔｉの間は、ＣＰ情報を取得しないよう構成する。 In this embodiment, since the recurrent failure life is defined by the number of system calls issued, it can be said that the recurrent failure life Ti is reached when the counter C becomes zero. When the counter is 0 in s310, the checkpoint / restart processing unit 1013-2 lowers the CP information acquisition flag (s311). Thereafter, if the system call monitoring processing unit 1014 detects the system call sequence S before the checkpoint / restart processing unit 1013-2 detects a failure, the process returns to step s611 in FIG. With this configuration, in this embodiment, the CP information is not acquired during the recurrent failure life Ti.

そして、故障の監視を続ける（ｓ３１１）。同じ故障が再発した場合、再発故障寿命Ｔｉを超えた時期に故障が再発したことになるため、再試行を行うため、Ｓ３０６に戻る。故障の発生を検出しない間は、システムコール列Ｓを検出するまで、監視を続ける。 Then, the failure monitoring is continued (s311). If the same failure recurs, it means that the failure has recurred at a time exceeding the recurrent failure life Ti. While the occurrence of a failure is not detected, monitoring is continued until the system call sequence S is detected.

一方、ｓ３０８において、チェックポイント／リスタート処理部１０１３−２が、先に検出した故障と同じ故障を検出すると、再発故障寿命Ｔｉの間に故障が再発したことになるため、情報処理装置５００をオフラインにし（ｓ３１３）、カウンタを１デクリメント（ｉ＝ｉ−１）し（ｓ３１４）、ｉが０以上の間は、ｓ３０５に戻り、処理を続ける（再始動；使用するＣＰ情報の後退）。一方、ｉをデクリメントした結果マイナスになった場合は、チェックポイント／リスタート処理部１０１３−２は、業務（プロセス１００１）を停止する（ｓ３１６）。 On the other hand, when the checkpoint / restart processing unit 1013-2 detects the same failure as the previously detected failure in s308, the failure has recurred during the recurrent failure life Ti. Go offline (s313), decrement the counter by 1 (i = i-1) (s314), and return to s305 while i is greater than or equal to 0 and continue processing (restart; retreat of CP information to be used). On the other hand, if the result of decrementing i is negative, the checkpoint / restart processing unit 1013-2 stops the task (process 1001) (s316).

ｓ３０４で記録されていない場合は、後述するｓ３１４に処理を進める。一方、ｓ３０４で記録されている場合は、チェックポイント／リスタート処理部１０１３−２は、システムコール数をカウントするために導入したシステムコールカウンタＣに、故障発生時のシステムコールの累積発行数ＳＣｘと最も新しいＣＰ情報（ＣＰｉ情報）を取得した時点でのシステムコールの累積発行数ＳＣｉとから再故障寿命Ｔｉを算出して代入する（Ｓ３０５）。 If not recorded in s304, the process proceeds to s314 described later. On the other hand, if it is recorded in s304, the checkpoint / restart processing unit 1013-2 adds to the system call counter C introduced for counting the number of system calls the cumulative number of system calls issued at the time of failure SCx. Then, the re-failure life Ti is calculated and substituted from the cumulative issuance number SCi of system calls at the time when the latest CP information (CPi information) is acquired (S305).

また、チェックポイント／リスタート処理部１０１３−２は、ＣＰｉ情報を用いて回復処理を行う（Ｓ３０６）。なお、本実施形態では、ＣＰ情報取得停止フラグを導入する。ＣＰ情報取得停止フラグが立っている間は、たとえ予め定めたシステムコール列を検出した場合であっても、ＣＰ情報を取得しない。オフラインここで、ＣＰ情報取得停止フラグを立てる
以上説明したように、本実施形態によれば、故障発生後の処理を決定するために、時刻ではなくシステムコールの発行回数を用いている。従って、第一の実施形態で得られる効果に加え、処理を簡単な減算カウンタで実現することができ、システムの構成を簡易なものとすることができる。 Also, the checkpoint / restart processing unit 1013-2 performs recovery processing using the CPi information (S306). In the present embodiment, a CP information acquisition stop flag is introduced. While the CP information acquisition stop flag is set, CP information is not acquired even if a predetermined system call sequence is detected. Offline Here, the CP information acquisition stop flag is set. As described above, according to the present embodiment, the number of system call issuances is used instead of the time in order to determine the processing after the occurrence of a failure. Therefore, in addition to the effects obtained in the first embodiment, the processing can be realized with a simple subtraction counter, and the system configuration can be simplified.

また、本実施形態によれば、カウンタＣを利用して、再故障寿命を越えない限り（カウンタＣが０となるまで）、たとえば、連続故障が発生する可能性が少ないシステムでは、ＣＰ情報を取得しないようにシステムを構成することができる。一方、再故障寿命期間を超えた場合、所定のシステムコールが発行されたら、ＣＰ情報を取得するという通常の処理に戻る。すなわち、図６（ａ）ではＣＰｋ（図１０では、ＣＰｎ＋１）を取得する。このように、ＣＰ情報の取得に関しても、再故障寿命を考慮し、判断することが可能となる。 In addition, according to the present embodiment, the counter C is used so long as the re-failure life is not exceeded (until the counter C reaches 0), for example, in a system that is unlikely to cause a continuous failure, The system can be configured not to acquire. On the other hand, when the re-failure life period is exceeded, when a predetermined system call is issued, the process returns to the normal process of obtaining CP information. That is, CPk (CPn + 1 in FIG. 10) is acquired in FIG. As described above, it is possible to determine the CP information acquisition in consideration of the re-failure life.

なお、以上の各実施形態においては、その機能構成は、図２に示すものを例に挙げて説明したが、これに限られない。例えば、図１１に示す構成であってもよい。 In the above embodiments, the functional configuration has been described by taking the example shown in FIG. 2 as an example, but is not limited thereto. For example, the configuration shown in FIG.

図１１に示す構成は、情報処理装置５００に計算機装置の仮想化技術（ｈｙｐｅｒｖｉｓｏｒ）を適用したものである。 The configuration shown in FIG. 11 is obtained by applying a virtualization technology (hypervisor) of a computer apparatus to the information processing apparatus 500.

図１１において、情報処理装置５００は、業務系システムと監視系システムとを備える。業務系システムは、アプリケーションプログラム１０００とその代替１１００、オペレーティングシステム１０１０とその代替１１１０、を備える。監視系システムは、監視プロセス１２００と、オペレーティングシステム１２１０とを備える。アプリケーションプログラム１０００とその代替１１００と監視プロセス１２００、および、オペレーティングシステム１０１０とその代替１１１０とオペレーティングシステム１２１０とは、仮想計算機１１５０で結合する。チェックポイント／リスタート処理部１０１３は仮想計算機１１５０が備える。この場合、チェックポイント／リスタート処理部１０１３は、オペレーティングシステムからも隠蔽されるので、より安全性が高まる。 In FIG. 11, the information processing apparatus 500 includes a business system and a monitoring system. The business system includes an application program 1000 and its alternative 1100, and an operating system 1010 and its alternative 1110. The monitoring system includes a monitoring process 1200 and an operating system 1210. The application program 1000, its alternative 1100, the monitoring process 1200, and the operating system 1010, its alternative 1110, and the operating system 1210 are combined by a virtual machine 1150. The checkpoint / restart processing unit 1013 is provided in the virtual machine 1150. In this case, since the checkpoint / restart processing unit 1013 is also hidden from the operating system, the safety is further improved.

本図において、アプリケーションプログラム１０００およびその代替１１００は、被対象プロセス１００１に関連するバイナリプログラム１００２および設定ファイル１００３を含む。仮想化技術を用いることにより、個々のバイナリプログラム１００２や設定ファイル１００３を含めて保管することが容易となる。このため、本構成によれば、メモリだけではなく、プログラムや設定ファイルについても複製する。本構成によれば、保全作業自体を推測し、プロセスに関連するファイルをバックアップすることになる。このため、保全作業自体に偽装した攻撃に対抗し、再始動に必要なファイルを保護することができる。 In this figure, an application program 1000 and its alternative 1100 include a binary program 1002 and a setting file 1003 related to the target process 1001. By using the virtualization technology, it becomes easy to store each binary program 1002 and the setting file 1003. Therefore, according to this configuration, not only the memory but also the program and the setting file are duplicated. According to this configuration, the maintenance work itself is estimated, and files related to the process are backed up. For this reason, it is possible to protect against a file required for restarting against an attack camouflaged in the maintenance work itself.

監視プロセス１２００は、システムコール監視処理部１０１４とＣＰ情報を取得するシステムコール列を記録するコールリスト１０１６とを個々のオペレーティングシステム１０１０に登録する。オペレーティングシステム１０１０は、コールリスト１０１６に記録されているシステムコール列を検出した場合に、仮想マシン仮想計算機１１５０が備えるチェックポイント／リスタート処理部１０１３に要求を出して、アプリケーションプログラム１０００とオペレーティングシステム１０１０と含む計算機環境の全てをＣＰ情報として複製する。また、アプリケーションプログラム１０００あるいはオペレーティングシステム１０１０で障害が発生した場合には再始動の指示を代替のアプリケーションプログラム１１００とオペレーティングシステム１１１０とから実施する。 The monitoring process 1200 registers a system call monitoring processing unit 1014 and a call list 1016 that records a system call sequence for obtaining CP information in each operating system 1010. When the operating system 1010 detects a system call sequence recorded in the call list 1016, the operating system 1010 issues a request to the checkpoint / restart processing unit 1013 included in the virtual machine virtual machine 1150, and the application program 1000 and the operating system 1010 All of the computer environment including is copied as CP information. When a failure occurs in the application program 1000 or the operating system 1010, a restart instruction is issued from the alternative application program 1100 and the operating system 1110.

アプリケーションプログラム１０００の監視対象プロセス１００１の故障に対し、監視プロセス１２００は、間接的に監視することになる。このような構成にすることにより、システム間の独立性を保つことができ、業務系システムから監視系システムにウィルス等の侵入が発生しにくくなる。また、逆に監視系システムが業務系システムに影響を及ぼしにくくなる。さらに、本図においては、監視対象プロセス１００１の数（アプリケーションプログラム１０００）が１の場合を例に挙げて説明している。しかし、監視対象プロセス１００１の数はこれに限られない。監視対象プロセス１００１の数が数十から数百と多数存在し、その一方で監視プロセスが一つである場合、本図に示すようにアプリケーションプログラム１０００の監視対象プロセス１００１の故障に対し、監視プロセス１２００は、間接的に監視する構成とすると、監視系システムが、システム全体の処理を進める上でのボトルネックとなる可能性が低くなる。 The monitoring process 1200 indirectly monitors the failure of the monitoring target process 1001 of the application program 1000. With such a configuration, it is possible to maintain independence between systems, and it is difficult for viruses and the like to enter the monitoring system from the business system. Conversely, the monitoring system is less likely to affect the business system. Further, in this figure, the case where the number of monitoring target processes 1001 (application program 1000) is 1 is described as an example. However, the number of monitoring target processes 1001 is not limited to this. In the case where there are a large number of monitoring target processes 1001 such as several tens to several hundreds, and there is only one monitoring process, a monitoring process is performed for a failure of the monitoring target process 1001 of the application program 1000 as shown in FIG. If the 1200 is configured to indirectly monitor, the monitoring system is less likely to become a bottleneck in the processing of the entire system.

以上説明したように、本実施形態によれば、所定のシステムコール発行時にＣＰ情報を取得する。従って、定期的にＣＰ情報を取得する場合に比べて、安全性が高く、必要十分なＣＰ情報を取得することができる。このため、保全業務において、意味のある時点から回復できる可能性が高い。 As described above, according to the present embodiment, CP information is acquired when a predetermined system call is issued. Therefore, compared with the case of regularly acquiring CP information, the safety is high and necessary and sufficient CP information can be acquired. For this reason, it is highly possible to recover from a meaningful point in maintenance work.

第一の実施形態の情報処理装置の構成図である。It is a block diagram of the information processing apparatus of 1st embodiment. 第一の実施形態の情報処理装置の機能構成図である。It is a functional lineblock diagram of the information processor of a first embodiment. 第一の実施形態の情報処理装置がオフライン状態で発行するシステムコール列を説明するための図である。It is a figure for demonstrating the system call sequence which the information processing apparatus of 1st embodiment issues in an offline state. 第一の実施形態の情報処理装置がオンライン状態で発行するシステムコール列を説明するための図である。It is a figure for demonstrating the system call sequence which the information processing apparatus of 1st embodiment issues in an online state. 第一の実施形態の情報処理装置がオフライン状態で発行するシステムコール列を説明するための図である。It is a figure for demonstrating the system call sequence which the information processing apparatus of 1st embodiment issues in an offline state. 第一の実施形態の再試行および再始動を説明するための図である。It is a figure for demonstrating the retry and restart of 1st embodiment. 第一の実施形態の時間の種類を説明するための図である。It is a figure for demonstrating the kind of time of 1st embodiment. 第一の実施形態のＣＰ情報取得処理の処理フローである。It is a processing flow of CP information acquisition processing of the first embodiment. 第一の実施形態のＣＰ回復処理の処理フローである。It is a processing flow of CP recovery processing of the first embodiment. 第二の実施形態のＣＰ回復処理の処理フローである。It is a processing flow of CP recovery processing of the second embodiment. 情報処理装置の別の機能構成図である。It is another functional block diagram of information processing apparatus.

Explanation of symbols

５００：情報処理装置、５０１：マルチプロセッサ、５０２：プロセッサ、５０３：プロセッサ、５０４：揮発性メモリＤＲＡＭ、５０５：相変化メモリＰＲＡＭ、５０６：ストレージＨＤＤ、５０７：ネットワークインタフェース、５０８：入出力インタフェース、１０００：アプリケーションプログラム、１００１：監視対象プロセス、１０１０：オペレーティングシステム、１０１１：システムコールハンドラ、１０１２：システムコールサービスルーチン、１０１３：チェックポイント／リスタート処理部１０１３、１０１４：システムコール監視処理部、１０１５：発行履歴、１０１６：コールリスト 500: Information processing apparatus, 501: Multiprocessor, 502: Processor, 503: Processor, 504: Volatile memory DRAM, 505: Phase change memory PRAM, 506: Storage HDD, 507: Network interface, 508: Input / output interface, 1000 : Application program, 1001: Process to be monitored, 1010: Operating system, 1011: System call handler, 1012: System call service routine, 1013: Checkpoint / restart processing unit 1013, 1014: System call monitoring processing unit, 1015: Issue History, 1016: Call list

Claims

A failure detection step for monitoring a process running on the computer and detecting the occurrence of a failure in the process;
When the occurrence of a failure is detected in the failure detection step, the failure is detected using a memory state of the process recorded at a predetermined timing before an event that affects the process integrity determined in advance in the process. And a restart step for restarting the process in which the error occurred. A computer process recovery method comprising:

A process recovery method according to claim 1, comprising:
The failure detection step records the type and time of the failure detected each time the occurrence of the failure is detected,
The restart step records the memory state used for restart from the time when the memory state used for restart when the same fault occurrence was detected last time to the newly detected fault occurrence time If it is longer than the time from the time of failure to the time of the previous failure, restart using the memory state used at the time of the previous failure, and if shorter, record before the memory state used at the time of the previous failure. A process recovery method characterized by restarting using the selected memory state.

A process recovery method according to claim 1, comprising:
When the occurrence of a failure is detected, the failure detection step records the type of the detected failure and the number of system calls issued until the occurrence of the failure is detected from the operation of the process,
In the restart step, the number of system calls issued from the time when the memory state used for the restart was recorded when the occurrence of the same failure was detected last time to the time of the newly detected failure is recorded. If the number of system calls issued from the time of failure to the time of the previous failure is greater than the number of system calls, restart is performed using the memory state used at the time of the previous failure. A process recovery method comprising performing restart using a memory state recorded before the memory state.

A memory status recording method used to perform recovery processing when a failure occurs in a process running on a computer,
A monitoring step for monitoring the system call queue issued by the process;
In the monitoring step, when a system call sequence indicating that a process that has an influence on the integrity of the process is performed is detected, the memory state of the process at the time is correlated with the detected time. A memory status recording step for recording;
A memory status recording method comprising:

The memory status recording method according to claim 4,
When the computer is in an offline state, the memory state is recorded triggered by detection of a system call sequence indicating immediately before the transition from the offline state to the online state,
When the computer is in an online state, the memory state is recorded in response to detection of a system call sequence indicating immediately before either file update or program addition processing for correcting a security vulnerability is performed. A memory state recording method.

A system call monitoring means for monitoring a system call issued in a process running on a computer and detecting the issuance of a predetermined system call sequence;
When the system call monitoring unit acquires the memory state of the process at the timing when the predetermined system call sequence is detected, records the memory state in association with the acquired time, monitors the process state, and detects a failure. Recording restart means for restarting the process using the recorded memory state,
The checkpoint restart system, wherein the predetermined system call sequence is a system call sequence indicating that an event affecting the integrity of the process is just before the occurrence.

The checkpoint restart system according to claim 6, wherein the recording restarting means includes:
Record the type and time of the detected failure every time a failure is detected,
The time from the time when the memory state used for the restart was recorded when the same failure occurrence was detected last time to the time when the newly detected failure occurred was recorded from the time when the memory state used for the restart was recorded. If it is longer than the time up to the time of occurrence, restart using the memory state used at the time of the previous failure, and if shorter, use the memory state recorded before the memory state used at the time of the previous failure. A checkpoint restart system characterized by restarting.

The checkpoint restart system according to claim 6, wherein the recording restarting means includes:
When the occurrence of a failure is detected, the type of the detected failure and the number of system calls issued from the start of the process until the occurrence of the failure is recorded.
The number of system calls issued from the time when the memory state used for restart was recorded when the occurrence of the same failure was detected last time until the time of the newly detected failure was recorded from the time when the memory state was recorded. If the number of system calls issued is greater than the number of system calls issued before the occurrence of a fault, restart is performed using the memory state used at the time of the previous failure, and if it is less, it is recorded before the memory state used at the time of the previous failure. A checkpoint restart system characterized by restarting using the selected memory state.

A computer system including a virtual machine on which a plurality of application programs (AP) and a plurality of operating systems (OS) operate,
Each OS includes a system call monitoring unit that monitors a system call issued by an AP (process) being executed on the OS and detects an issuance of a predetermined system call sequence.
The virtual machine is
All memory states of the computer system are acquired at the timing when any one of the system call monitoring means detects the predetermined system call sequence, recorded in association with the acquired time, and the process state is monitored. And a recording restart means for restarting the process using the recorded memory state from an OS other than the OS on which the process in which the failure is detected operates when a failure is detected,
The computer system characterized in that the predetermined system call sequence is a system call sequence indicating that an event that affects the integrity of the process is just before the occurrence.

Calculator
System call monitoring means for monitoring a system call issued in a process running on the computer and detecting a system call string indicating that an event that affects the integrity of the process is just before the occurrence When,
When the system call monitoring unit acquires the memory state of the process at the timing when the predetermined system call sequence is detected, records the memory state in association with the acquired time, monitors the process state, and detects a failure. A program for functioning as recording restarting means for restarting the process using the recorded memory state.