JP2009176146A

JP2009176146A - Multi-processor system, failure detecting method and failure detecting program

Info

Publication number: JP2009176146A
Application number: JP2008015330A
Authority: JP
Inventors: Shinichi Hayashi; 伸一林; Atsuyuki Uchihira; 敬幸内平; 学 ▲塚▼田; Manabu Tsukada; Yoshiaki Horinouchi; 義章堀之内
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2008-01-25
Filing date: 2008-01-25
Publication date: 2009-08-06
Anticipated expiration: 2028-01-25
Also published as: JP4992740B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a multi-processor system capable of detecting failures occurring in all CPUs of the system. <P>SOLUTION: The multi-processor system includes a plurality of CPUs, where a WD daemon 11 detects failures using a watchdog timer. The system includes a primary storage device 2 for storing CPU information storing a CPU identifier operated when a standby task acquires the right of execution for each corresponding task and CPU moving rules as rules for a watchdog daemon to sequentially move and circulate the CPU, and a WD management part 12 which updates the CPU information corresponding to the task of the watchdog daemon based on the CPU moving rules and writes the updated CPU information in the primary storage device 2. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、システムが正常に動作しているかどうかを監視するためのウォッチドッグタイマ（ＷＤＴ）を用いて障害検出を行うマルチプロセッサシステム、障害検出方法および障害検出プログラムに関するものである。 The present invention relates to a multiprocessor system that detects a failure using a watchdog timer (WDT) for monitoring whether or not the system is operating normally, a failure detection method, and a failure detection program.

従来、サーバ等のコンピュータシステムでは、プログラムのバグ等による障害の要因を特定する為の有効な方法として、次のような方法が一般に用いられている。プログラムのバグ等により障害が発生した際、障害発生時にプログラムが使用していたメモリ内容を「ダンプファイル」としてディスク等へ出力するダンプ機能がＯＳ（Operating System）に備わっている。そして、ダンプファイルの内容を専用のツール等で解析する事によって、障害の要因を特定する。ダンプファイル内には、障害を検出した時に走行していたプログラムのアドレスを含むレジスタ情報、タスク情報、スタック情報も含まれている。 Conventionally, in a computer system such as a server, the following method is generally used as an effective method for specifying the cause of a failure due to a bug in a program or the like. When a failure occurs due to a bug in the program, the OS (Operating System) has a dump function that outputs the memory contents used by the program at the time of the failure to a disk or the like as a “dump file”. Then, the cause of the failure is identified by analyzing the contents of the dump file with a dedicated tool or the like. The dump file also includes register information, task information, and stack information including the address of the program that was running when the failure was detected.

また、一般的なコンピュータシステムでは、動作中のプログラムの正常性を確認するための一つの手段としてウォッチドッグタイマが搭載されている。通常時はウォッチドッグデーモン（以下、ＷＤデーモン）と呼ばれるＯＳ上のタスクがＣＰＵ（Central Processing Unit）の実行権を獲得する度にＷＤＴのカウンタをクリアする仕組みになっている。図２１は、従来技術における通常運用時のＷＤＴとＷＤデーモンの動きを示す図である。 In general computer systems, a watchdog timer is mounted as one means for confirming the normality of an operating program. Normally, a task on the OS called a watchdog daemon (hereinafter referred to as a WD daemon) clears the WDT counter every time it acquires the execution right of a CPU (Central Processing Unit). FIG. 21 is a diagram illustrating the movement of the WDT and the WD daemon during normal operation in the conventional technology.

プログラムに暴走等の障害が発生すると、ＷＤデーモンがＣＰＵの実行権を獲得する事が出来なくなる為、ＷＤＴへのクリアも実施されなくなる。図２２は、従来技術における障害発生時のＷＤＴとＷＤデーモンの動きを示す図である。一定期間以上このクリアが実施されないとＷＤＴはウォッチドッグタイムアウト（以下、ＷＤタイムアウト）として障害を検出し、ＣＰＵへ割込み等で通知を行い、この通知を契機として前述のダンプファイルの出力処理が開始される。 If a fault such as runaway occurs in the program, the WD daemon cannot acquire the right to execute the CPU, and clearing to the WDT is not performed. FIG. 22 is a diagram illustrating the movement of the WDT and the WD daemon when a failure occurs in the related art. If this clearing is not carried out for a certain period of time, WDT detects a failure as a watchdog timeout (hereinafter referred to as WD timeout), notifies the CPU with an interrupt, etc., and this dump file output process is triggered by this notification. The

一方、近年では、コンピュータシステムの処理能力を向上させ、種々のタスクを実行するための技術として、複数のＣＰＵ上で複数のタスクを並列実行するマルチプロセッサシステムが多用されるようになっている。マルチプロセッサシステムでも、動作中のプログラムの正常性の確認のためには、上述の例と同様、ＯＳ上のＷＤＴおよびＷＤデーモンによる障害検出が行われるのが一般である。 On the other hand, in recent years, a multiprocessor system that executes a plurality of tasks in parallel on a plurality of CPUs is frequently used as a technique for improving the processing capability of a computer system and executing various tasks. Even in a multiprocessor system, in order to confirm the normality of an operating program, failure detection by a WDT and a WD daemon on the OS is generally performed as in the above example.

また、複数の計算機を接続して負荷分散を可能とする計算機システムの障害発生に対する技術として、たとえば、下記特許文献１に記載の技術がある。下記特許文献１では、複数の計算機で構成される分散システムで、各計算機の障害を検出して障害が発生した場合に、上位機に問い合わせることなく、バックアップ計算機を決定することを可能とする技術が開示されている。 Further, as a technique for the occurrence of a failure in a computer system that enables load distribution by connecting a plurality of computers, for example, there is a technique described in Patent Document 1 below. In the following Patent Document 1, in a distributed system composed of a plurality of computers, when a failure occurs by detecting a failure of each computer, it is possible to determine a backup computer without inquiring of a host computer Is disclosed.

特開昭６２−０７２０５２号公報JP 62-072052 A

しかしながら、マルチプロセッサシステムの各ＣＰＵ上では、タスクはＣＰＵ実行権を獲得したり解放したりを繰り返しながら並列動作を行っており、あるタスクが実行権を解放した際、次にどのタスクが実行権を持つかはＯＳの機能であるスケジューラが決定し、スケジューラが次のタスクへの切替えを行っている。通常、タスクはなるべく同じＣＰＵ上に留るため、ＣＰＵ実行権を解放した後、次に実行権が回って来る際は同じＣＰＵ上で実行されるようスケジューリングされる事になる。上述のＷＤデーモンもタスクの一種であるため、通常であれば同じＣＰＵ上で動作する。したがって、ＷＤデーモンは実行権を獲得する度に同一のＣＰＵ上で動作して、ＷＤＴのクリアを実施することになる。 However, on each CPU of a multiprocessor system, tasks perform parallel operations while repeatedly acquiring and releasing CPU execution rights, and when a task releases execution rights, which task executes next. Is determined by the scheduler, which is a function of the OS, and the scheduler switches to the next task. Usually, since the task stays on the same CPU as much as possible, after the CPU execution right is released, the next execution right will be scheduled to be executed on the same CPU. Since the above-mentioned WD daemon is also a kind of task, it normally runs on the same CPU. Therefore, the WD daemon operates on the same CPU every time the execution right is acquired, and clears the WDT.

このため、上記ＷＤＴおよびＷＤデーモンを用いた障害検出の従来技術をマルチプロセッサシステムに適用する場合に、ＷＤデーモンが稼動しているＣＰＵとは異なるＣＰＵ上で無限ループ等の暴走障害が発生しても、ＷＤデーモンは稼動しているＣＰＵが正常であるためＷＤＴをクリアし続け、システム全体としての障害を検出する事ができない、という問題がある。 For this reason, when the conventional technology for detecting a failure using the WDT and the WD daemon is applied to a multiprocessor system, a runaway failure such as an infinite loop occurs on a CPU different from the CPU on which the WD daemon is operating. However, the WD daemon has a problem that since the operating CPU is normal, the WDT continues to be cleared, and the failure of the entire system cannot be detected.

また、ロードバランス機能を備えたＯＳの場合は、あるＣＰＵ上に多数のタスクが偏ってしまった場合等に、別のＣＰＵ上へタスクを強制移動させて負荷分散を行うことがある。この場合、暴走障害が発生したＣＰＵ以外のＣＰＵへＷＤデーモンが割当てられる可能性が高くなり、その場合、上記の例と同様に、システム全体としての障害を検出する事ができない、という問題がある。図２３は、従来技術におけるＷＤＴが障害を検出できないケースを示す図である。このように、ＣＰＵ＃２のタスクが暴走していても、ＷＤデーモンがＣＰＵ＃０またはＣＰＵ＃１で動作しているとＷＤＴはカウンタをクリアし、障害を検出できない。 In the case of an OS having a load balance function, when a large number of tasks are biased on a certain CPU, the tasks may be forcibly moved to another CPU to perform load distribution. In this case, there is a high possibility that the WD daemon is assigned to a CPU other than the CPU in which the runaway failure has occurred. In this case, as in the above example, there is a problem that the failure of the entire system cannot be detected. . FIG. 23 is a diagram illustrating a case where WDT in the related art cannot detect a failure. Thus, even if the task of CPU # 2 runs away, if the WD daemon is operating on CPU # 0 or CPU # 1, WDT clears the counter and cannot detect a failure.

また、ＣＰＵと同じ数のＷＤＴを実装すれば、全てのＣＰＵに渡って発生する障害を監視することが可能であるが、コスト的に負担が大きい上に、専用ハードを搭載することになり汎用性が損なわれる。近年では汎用ＣＰＵを搭載したボード上で汎用ＯＳを動かし同一ハードウェアで多様なサービスに対応するサーバシステムの導入がキャリア等を中心に広がっている。そうしたシステムに障害監視や障害情報収集のために独自に専用ハードウェアを搭載すると、汎用性を損なうことになるため、導入は難しい。 Also, if the same number of WDTs as the CPUs are installed, it is possible to monitor failures that occur across all CPUs. However, the cost is high and dedicated hardware is installed. Sexuality is impaired. In recent years, the introduction of a server system that runs a general-purpose OS on a board equipped with a general-purpose CPU and supports various services with the same hardware has spread mainly in carriers and the like. If such systems are equipped with dedicated hardware for fault monitoring and fault information collection, the versatility will be impaired, making it difficult to introduce.

この発明は、上述した従来技術による問題点を解消するためになされたものであり、専用ハードウェアを搭載することなく、システムの全てのＣＰＵで発生した障害を検出することができることができるマルチプロセッサシステム、障害検出方法および障害検出プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems caused by the prior art, and is a multiprocessor capable of detecting a failure occurring in all CPUs of the system without installing dedicated hardware. It is an object to provide a system, a failure detection method, and a failure detection program.

上述した課題を解決し、目的を達成するため、本発明は、複数のプロセッサを備え、ウォッチドッグデーモンがウォッチドッグタイマを用いて障害の検出を行い、障害を検出した場合に障害発生の通知を行うマルチプロセッサシステムであって、待機中のタスクが実行権を獲得した際に動作するプロセッサの識別子が対応するタスクごとに格納されているプロセッサ情報と、ウォッチドッグデーモンが動作するプロセッサを順次移動させて巡回させるためのルールであるプロセッサ移動ルールと、を格納するための記憶手段と、ウォッチドッグデーモンのタスクに対応する前記プロセッサ情報を前記プロセッサ移動ルールに基づいて更新し、更新後のプロセッサ情報を前記記憶手段に書き込むウォッチドッグ管理手段と、を備えることを特徴とする。 In order to solve the above-described problems and achieve the object, the present invention includes a plurality of processors, and the watchdog daemon detects a failure using a watchdog timer and notifies the occurrence of the failure when the failure is detected. In this multiprocessor system, the processor information stored for each task corresponding to the identifier of the processor that operates when the waiting task acquires the execution right and the processor on which the watchdog daemon operates are moved sequentially. A processor movement rule, which is a rule for making a circulation, and updating the processor information corresponding to the task of the watchdog daemon based on the processor movement rule, and updating the updated processor information. And a watchdog management means for writing to the storage means. To.

また、本発明は、前記記憶手段に、ウォッチドッグデーモンが動作したプロセッサの識別子とウォッチドッグデーモンがウォッチドッグタイマをクリアした時刻とを対応付けてウォッチドッグ起動履歴として格納するウォッチドッグ履歴記録手段と、前記ウォッチドッグ起動履歴に基づいて、所定の時間を超えてウォッチドッグデーモンが動作していないプロセッサがあると判断した場合には、そのプロセッサに障害が発生したことを示す障害発生を通知するウォッチドッグ起動監視手段と、をさらに備えることを特徴とする。 Further, the present invention relates to a watchdog history recording means for associating the identifier of the processor in which the watchdog daemon is operated and the time when the watchdog daemon clears the watchdog timer in the storage means in association with each other. When it is determined that there is a processor for which the watchdog daemon has not been operated for a predetermined time based on the watchdog activation history, a watch for notifying the occurrence of a failure indicating that the processor has failed And a dog activation monitoring means.

また、本発明は、前記プロセッサ移動ルールに対するユーザーからの変更要求を受け付けるウォッチドッグデーモン挙動指定手段、をさらに備え、前記ウォッチドッグ管理手段が、前記変更要求に基づいて前記プロセッサ移動ルールを書き換えることを特徴とする。 The present invention further comprises watchdog daemon behavior specifying means for accepting a change request from a user for the processor movement rule, wherein the watchdog management means rewrites the processor movement rule based on the change request. Features.

また、本発明は、障害発生の通知が行われた場合、実行命令を含むレジスタ情報、タスク情報、スタック情報を含むプログラム走行情報を、所定の期間収集し、収集したプログラム走行情報を前記記憶手段へ書き込むプログラム走行履歴記録手段、をさらに備えることを特徴とする Further, the present invention collects program running information including register information including task instructions, task information, and stack information for a predetermined period when a failure occurrence is notified, and the collected program running information is stored in the storage means. It further comprises program running history recording means for writing to

また、本発明は、前記プログラム走行履歴の出力先の装置を変更可能とすることを特徴とする。 Further, the present invention is characterized in that the output destination device of the program travel history can be changed.

本発明によれば、ＷＤデーモンの動作プロセッサを所定のプロセッサ移動ルールに従って移動させるようにしたので、ＷＤデーモンが満遍なく全プロセッサ上で動作することができ、専用ハードウェアを搭載することなく、マルチプロセッサシステム内のどのプロセッサで障害が発生した場合でも、ＷＤタイムアウトによる障害の検出を行うことができるという効果を奏する。 According to the present invention, the operation processor of the WD daemon is moved according to a predetermined processor movement rule, so that the WD daemon can operate on all the processors evenly, and the multiprocessor is installed without installing dedicated hardware. Even if a failure occurs in any processor in the system, it is possible to detect a failure due to a WD timeout.

また、本発明によれば、ＷＤＴクリアを行ったプロセッサ番号とその時刻をＷＤ起動履歴として格納し、ＷＤ起動履歴に基づいて所定の基準時間の間ＷＤＴクリアが行われていないプロセッサがある場合に、そのプロセッサ上で障害が発生したと判断して障害通知を出すようにしたので、ＯＳがロードバランス機能を有する場合でも、長時間ＷＤデーモンが動作していないプロセッサを検出することにより、全てのプロセッサについて障害検出を行うことができるという効果を奏する。 Further, according to the present invention, the processor number and time at which WDT is cleared is stored as a WD activation history, and there is a processor that has not been cleared for a predetermined reference time based on the WD activation history. Since it is determined that a failure has occurred on the processor and a failure notification is issued, even if the OS has a load balancing function, all processors can be detected by detecting a processor that has not been operated for a long time. There is an effect that failure detection can be performed for the processor.

また、本発明によれば、プロセッサ移動ルールに対するユーザーの変更指示を受付け、変更指示に基づいて格納されたプロセッサ移動ルールを書き換えるようにしたので、プロセッサ移動ルールをいつでも任意に変更することができ、ユーザーの要求を容易に反映することができるという効果を奏する。 In addition, according to the present invention, since the change instruction of the processor for the processor movement rule is accepted and the stored processor movement rule is rewritten based on the change instruction, the processor movement rule can be arbitrarily changed at any time, There is an effect that the user's request can be easily reflected.

また、本発明によれば、障害発生通知があった場合に、プログラムの実行命令を含むレジスタ情報、タスク情報、スタック情報、時刻情報などをあらかじめ定められた任意の時間の間収集して格納するようにしたので、障害発生時のみの情報を格納する通常のダンプファイルとは異なり所定の時間の情報を確認することができ、無限ループが発生するような障害についても、格納されたプログラム走行履歴を解析することにより、容易に障害箇所を特定することができるという効果を奏する。 Further, according to the present invention, when there is a failure notification, register information including task execution instructions, task information, stack information, time information, and the like are collected and stored for a predetermined arbitrary time. As a result, unlike a normal dump file that stores information only when a failure occurs, it is possible to check information for a predetermined time, and even for failures that cause an infinite loop, stored program running history By analyzing the above, it is possible to easily identify the fault location.

また、本発明によれば、ダンプファイルの出力先を指定できるようにしたので、システムの利用状況に応じてより利便性の高いメディアへの出力が選択可能となり、また、ダンプファイルを出力する装置に障害があった場合でも出力先を変更することによりダンプファイルの記録が可能となるという効果を奏する。 In addition, according to the present invention, the output destination of the dump file can be specified, so that it is possible to select more convenient output to the medium according to the usage status of the system, and an apparatus for outputting the dump file Even if there is a failure, it is possible to record a dump file by changing the output destination.

以下に添付図面を参照して、この発明に係るマルチプロセッサシステム、障害検出方法および障害検出プログラムの好適な実施の形態を詳細に説明する。なお、本実施例の説明では、プロセッサとして、ＣＰＵを例に挙げて説明するが、プロセッサの形態は、ＭＰＵ（Micro Processing Unit）等の他の形態であってもよい。また、ウォッチドッグデーモンおよびウォッチドッグタイマについても、所定のタイミングでシステムの動作状態を監視して障害の発生を検知する機能を有するものであればよい。 Exemplary embodiments of a multiprocessor system, a failure detection method, and a failure detection program according to the present invention will be described below in detail with reference to the accompanying drawings. In the description of the present embodiment, the CPU will be described as an example of the processor, but the processor may be in other forms such as an MPU (Micro Processing Unit). The watchdog daemon and the watchdog timer may also have a function of monitoring the system operating state at a predetermined timing and detecting the occurrence of a failure.

図１は、本発明にかかるマルチプロセッサシステムの実施例１の機能構成例を示す図である。本実施例のマルチプロセッサシステムは、複数のＣＰＵ（図示せず）で構成される制御部１と一次記憶装置２を備える。また、制御部１上では、ＯＳが稼動する。図１に示すように、本実施例のマルチプロセッサシステムは、実行するタスクに使用するＣＰＵの割り当てを行うスケジューラ１０と、ＣＰＵ実行権を獲得する度にＷＤＴに対してカウンタのクリアを行うＷＤデーモン１１と、ＷＤ管理部１２と、ＷＤデーモン１１により障害が検出された場合にダンプファイルを生成するダンプファイル出力部１３と、で構成される。また、出力先装置３は、磁気ディスク，メモリ，表示装置などで構成され、ダンプファイル出力部１３が生成するダンプファイルの出力先となる装置である。 FIG. 1 is a diagram illustrating a functional configuration example of a first embodiment of the multiprocessor system according to the present invention. The multiprocessor system of the present embodiment includes a control unit 1 and a primary storage device 2 that are constituted by a plurality of CPUs (not shown). In addition, on the control unit 1, an OS is operated. As shown in FIG. 1, the multiprocessor system of this embodiment includes a scheduler 10 that assigns a CPU to be used for a task to be executed, and a WD daemon that clears a counter for the WDT each time a CPU execution right is acquired. 11, a WD management unit 12, and a dump file output unit 13 that generates a dump file when a failure is detected by the WD daemon 11. The output destination device 3 includes a magnetic disk, a memory, a display device, and the like, and is a device that is an output destination of the dump file generated by the dump file output unit 13.

なお、スケジューラ１０，ＷＤデーモン１１，ダンプファイル出力部１３は、ここでは、ＯＳの一般的な機能の一部とするが、これに限らず、同様の機能を実現するＯＳ以外の手段を用いるようにしてもよい。 Here, the scheduler 10, the WD daemon 11, and the dump file output unit 13 are a part of the general functions of the OS, but the present invention is not limited to this, and means other than the OS that realizes the same function is used. It may be.

一次記憶装置２には、各タスクがどのＣＰＵが使用されるかの情報であるＣＰＵ情報を含むタスク情報と、ＷＤデーモン１１に割り当てるＣＰＵの移動ルールが格納されているＣＰＵ移動ルールとが記憶される。 The primary storage device 2 stores task information including CPU information, which is information about which CPU is used for each task, and a CPU movement rule in which a CPU movement rule assigned to the WD daemon 11 is stored. The

つづいて、本実施例の動作について説明する。ここでは、本実施例のマルチプロセッサシステムが、計４個のＣＰＵ（ＣＰＵ＃０〜ＣＰＵ＃３）を備える場合を仮定して説明する。まず、前提条件として、図２に、これらのＣＰＵ上で動作するタスクの状態遷移を示す。「実行状態」とはスケジューラ１０によりそのＣＰＵの実行権を渡され、タスクが処理を実行している状態を指す。「待機状態」とはタスクの処理が終了し、他のタスクへＣＰＵ実行権を明け渡しスリープしている状態を指す。「実行待ち状態」とは、待機状態からタスクが起床されてＣＰＵ毎に用意された実行待ちキューへ入った状態を指す。ＷＤデーモン１１もタスクの一つであり、図２の状態遷移を行いつつ、「実行状態」となりＣＰＵ実行権を獲得する度にタスク処理としてＷＤＴに対してカウンタのクリアを行う。 Next, the operation of this embodiment will be described. Here, description will be made assuming that the multiprocessor system of this embodiment includes a total of four CPUs (CPU # 0 to CPU # 3). First, as a precondition, FIG. 2 shows state transitions of tasks operating on these CPUs. The “execution state” refers to a state in which the execution right of the CPU is passed by the scheduler 10 and the task is executing a process. The “standby state” refers to a state in which the processing of a task is completed, the CPU execution right is given to another task, and the computer is sleeping. The “execution waiting state” refers to a state in which a task is woken up from the standby state and enters an execution waiting queue prepared for each CPU. The WD daemon 11 is one of the tasks, and clears the counter to the WDT as a task process every time the CPU executes the execution state while performing the state transition of FIG.

図３は、ＣＰＵ毎に用意されたタスクの実行待ちキューを説明するための図である。実行待ちキューは、タスクごとのタスク情報として、データがリスト状に格納されている。起床されたタスクは実行待ちキューの最後尾（Ｔａｉｌ）に入る（最後尾のタスク情報として格納される）。スケジューリング契機が来ると、スケジューラ１０はキューの先頭（Ｈｅａｄ）に格納されているタスク情報に対応するタスクから優先して該当ＣＰＵの実行権を渡す。たとえば、図３では、先頭（Ｈｅａｄ）に格納されているタスク情報に対応するタスクから順に実行権が与えられる。そして、タスク情報の順に対応するタスクに実行権が与えられ、最後にＴａｉｌのタスク情報に実行権が与えられる。ＣＰＵ上で処理を終えたタスクは、待機状態に移行すると共にキューから外され、次の実行契機まで待機する。 FIG. 3 is a diagram for explaining a task execution waiting queue prepared for each CPU. In the waiting queue, data is stored in a list as task information for each task. The wake-up task enters the tail of the execution queue (stored as tail task information). When a scheduling opportunity comes, the scheduler 10 gives the execution right of the CPU in preference to the task corresponding to the task information stored at the head (Head) of the queue. For example, in FIG. 3, the execution right is given in order from the task corresponding to the task information stored at the head (Head). Then, the execution right is given to the tasks corresponding to the order of the task information, and finally the execution right is given to the task information of the tail. The task that has finished processing on the CPU shifts to a standby state, is removed from the queue, and waits until the next execution trigger.

以上を前提として、本実施例の処理手順を説明する。図４は、本実施例の処理手順の一例を示すフローチャートである。図５−１，５−２，５−３，５−４は、本実施例の動作を説明するためのそれぞれ第１，第２，第３，第４の概念図である。まず、ＣＰＵ移動ルールがあらかじめ決められ、一次記憶装置２に格納されているとする。ここでは、例として「ラウンドロビン形式かつＣＰＵ番号の降順（ＣＰＵ＃３→ＣＰＵ＃２→ＣＰＵ＃１→ＣＰＵ＃０）へ移動」というルールが設定されたとする。 Based on the above, the processing procedure of the present embodiment will be described. FIG. 4 is a flowchart illustrating an example of a processing procedure according to the present exemplary embodiment. FIGS. 5-1, 5-2, 5-3, and 5-4 are first, second, third, and fourth conceptual diagrams for explaining the operation of the present embodiment, respectively. First, it is assumed that CPU movement rules are determined in advance and stored in the primary storage device 2. Here, as an example, it is assumed that a rule of “move to round robin format and descending order of CPU numbers (CPU # 3 → CPU # 2 → CPU # 1 → CPU # 0)” is set.

ある時点で、ＣＰＵ＃３上で実行していたタスクＡが処理を終了し、スケジューラ１０がＷＤデーモン１１にＣＰＵ＃３の実行権を割当てたとする（ステップＳ１１）。具体的には、図５−１に示した第１の概念図のように、タスクＡのタスク処理の終了後、タスクＡはＣＰＵ＃３の実行待ちキューから外され待機状態へ移行し、ＣＰＵ＃３の実行待ちキューのＨｅａｄであったＷＤデーモン１１のタスクにスケジューラ１０により実行権が与えられる。そして、ＷＤデーモン１１がＷＤＴのカウンタのクリアを要求する（ステップＳ１２）。 It is assumed that the task A executed on the CPU # 3 finishes the process at a certain point and the scheduler 10 assigns the execution right of the CPU # 3 to the WD daemon 11 (step S11). Specifically, as shown in the first conceptual diagram shown in FIG. 5A, after the task processing of the task A is completed, the task A is removed from the execution waiting queue of the CPU # 3 and enters the standby state. The execution right is given by the scheduler 10 to the task of the WD daemon 11 that is the head of the execution waiting queue # 3. Then, the WD daemon 11 requests clearing of the WDT counter (step S12).

ＷＤＴのカウンタのクリア要求の実行によりカウンタのクリアが終了すると、ＷＤデーモンはＷＤ管理部１２に対しＣＰＵ＃３上でＷＤＴのクリア処理を行った旨を通知する（ステップＳ１３）。通知を受けたＷＤ管理部１２は、一次記憶装置のＣＰＵ移動ルールを読み出し、ＣＰＵ移動ルールに基づいて、次回のＷＤデーモン１１のタスクを実行するＣＰＵを決定する（ステップＳ１４）。この場合には、ＣＰＵ移動ルールが「ラウンドロビン形式かつＣＰＵ番号の降順」であることから、次回ＷＤデーモン１１が起動される時の実行ＣＰＵをＣＰＵ＃２に決定する。 When the counter clearing is completed by executing the WDT counter clear request, the WD daemon notifies the WD management unit 12 that the WDT clear process has been performed on the CPU # 3 (step S13). Receiving the notification, the WD management unit 12 reads the CPU migration rule of the primary storage device, and determines the CPU to execute the task of the next WD daemon 11 based on the CPU migration rule (step S14). In this case, since the CPU movement rule is “round robin format and descending order of CPU numbers”, the CPU to be executed when the WD daemon 11 is started next time is determined to be CPU # 2.

つぎに、ＷＤ管理部１２は各タスクに対応したタスク情報群の中から、ＷＤデーモン１１のタスク情報を検索して読み出し、読み出したタスク情報に含まれるＣＰＵ情報のエリアを参照する。この時点では、ＷＤデーモン１１のタスクはＣＰＵ＃３で実行された為、参照したＣＰＵ情報のエリアにはＣＰＵ＃３を示す数値（ここでは、“３”とする）が記録されている。ＷＤ管理部１２は、このＣＰＵ情報のエリアをステップＳ１４で決定したＣＰＵ＃２を示す数値（“２”）へ書き換える（ステップＳ１５）。ステップＳ１５の書き換えが終了すると、ＷＤ管理部１２はＷＤデーモン１１に対しＣＰＵ情報の変更完了を通知し（ステップＳ１６）、ＷＤデーモン１１は待機状態へと状態遷移する。そして、図５−２の第２の概念図に示すように、ＷＤデーモン１１はＣＰＵ＃３の実行待ちキューから外され、スケジューラ１０により次のタスクＢがＣＰＵの実行権を獲得し、タスク処理を開始する。 Next, the WD management unit 12 retrieves and reads out the task information of the WD daemon 11 from the task information group corresponding to each task, and refers to the CPU information area included in the read task information. At this time, since the task of the WD daemon 11 is executed by the CPU # 3, a numerical value (here, “3”) indicating the CPU # 3 is recorded in the referred CPU information area. The WD management unit 12 rewrites the CPU information area to a numerical value (“2”) indicating the CPU # 2 determined in step S14 (step S15). When the rewriting in step S15 is completed, the WD management unit 12 notifies the WD daemon 11 of the completion of the change of the CPU information (step S16), and the WD daemon 11 changes to a standby state. Then, as shown in the second conceptual diagram of FIG. 5B, the WD daemon 11 is removed from the execution waiting queue of the CPU # 3, and the next task B acquires the execution right of the CPU by the scheduler 10, and task processing To start.

一定時間経過後、待機していたＷＤデーモン１１が起床されると、前述のようにＷＤデーモン１１のタスクは、再び実行待ちキューへ入ることになるが、この時、ＷＤデーモン１１のタスク情報内のＣＰＵ情報は“２”に書き換えられているため、図５−３の第３の概念図に示すように、ＷＤデーモン１１はＣＰＵ＃２の実行待ちキューへ自動的に入る（ＣＰＵ＃２のタスク情報として格納される）。この結果、図５−４の第４の概念図に示すように、さらに時間経過後に、ＷＤデーモン１１のタスクがキューの先頭へ移動すると、スケジューラによりＣＰＵ＃２の実行権を渡され、上述のステップＳ１２と同様にＷＤＴのカウンタのクリアを要求する。以降、ステップＳ１３以降の処理が行われるが、ステップＳ１４で決定されるＣＰＵは、ＣＰＵ移動ルールに従いＣＰＵ＃１となる。 When the waiting WD daemon 11 is woken up after a certain period of time, the task of the WD daemon 11 enters the execution waiting queue again as described above. At this time, the task information in the task information of the WD daemon 11 is included. Since the CPU information of “# 2” has been rewritten to “2”, the WD daemon 11 automatically enters the execution waiting queue of CPU # 2 as shown in the third conceptual diagram of FIG. Stored as task information). As a result, as shown in the fourth conceptual diagram of FIG. 5-4, when the task of the WD daemon 11 moves to the head of the queue after the elapse of time, the execution right of the CPU # 2 is passed by the scheduler, and the above-mentioned As in step S12, the clearing of the WDT counter is requested. Thereafter, the processing after step S13 is performed, but the CPU determined in step S14 is CPU # 1 in accordance with the CPU movement rule.

このような動作を繰り返しながら、ＷＤデーモン１１が動作する（ＷＤＴのカウンタクリアを実施する）ＣＰＵは、ＣＰＵ＃３→ＣＰＵ＃２、ＣＰＵ＃２→ＣＰＵ＃１、ＣＰＵ＃１→ＣＰＵ＃０、ＣＰＵ＃０→ＣＰＵ＃３と、システムが正常に運用されている間はラウンドロビン形式に従って全ＣＰＵ間を満遍なく巡るよう移動していく。 The CPU in which the WD daemon 11 operates (repeating the WDT counter) while repeating such operations is as follows: CPU # 3 → CPU # 2, CPU # 2 → CPU # 1, CPU # 1 → CPU # 0, During the normal operation of the system, such as CPU # 0 → CPU # 3, it moves so that it can travel evenly among all CPUs according to the round robin format.

つづいて、障害発生時の動作について説明する。図６は、本実施例の障害発生時の動作を説明するための図である。上述の本実施例の処理手順（以下、ＣＰＵ移動処理という）により、過去の時刻Ｔ１にＣＰＵ＃３上で、時刻Ｔ２にＣＰＵ＃２上で、時刻Ｔ３にＣＰＵ＃１上で、時刻Ｔ４にＣＰＵ＃０上で、それぞれＷＤデーモン１１によるＷＤＴのカウンタのクリアが実施されたとする。時刻Ｔ４の後に、ＣＰＵ＃３上で実行中のタスクＤ内で無限ループ等による障害が発生したと仮定する。時刻Ｔ４でＷＤデーモン１１がＣＰＵ＃０で起動されたため、障害発生の時点では、タスク情報のＣＰＵ情報エリアが３に書き換えられている。 Next, the operation when a failure occurs will be described. FIG. 6 is a diagram for explaining the operation when a failure occurs in this embodiment. According to the above-described processing procedure of the present embodiment (hereinafter referred to as CPU movement processing), CPU # 3 at time T1, past CPU # 2 at time T2, CPU # 1 at time T3, and time T4 at time T2. It is assumed that the WDT counter is cleared by the WD daemon 11 on the CPU # 0. Assume that a failure due to an infinite loop or the like has occurred in task D being executed on CPU # 3 after time T4. Since the WD daemon 11 is activated by the CPU # 0 at time T4, the CPU information area of the task information is rewritten to 3 at the time of occurrence of the failure.

しかし、ＷＤデーモン１１のタスクがＣＰＵ＃３の実行待ちキューの先頭にきても、タスクＤがＣＰＵ＃３の実行権を獲得したまま暴走し続けているため、ＣＰＵの実行権が回って来ないまま待ち続けざるを得ない。そして、時刻Ｔ４以降一定期間ＷＤＴのクリアが実施されないとＷＤＴはＷＤタイムアウトを検出し、ＯＳに対して割込み（割込みの要因はＷＤタイムアウト）を通知する。割込みが通知されると、ＯＳはそれを障害発生のトリガとみなし、ダンプファイル出力部１３が全ＣＰＵ分のメモリ情報をダンプファイルへと出力する処理を行う。ここでは、ＣＰＵ＃３で障害が発生した例について説明したが、他のＣＰＵで障害が発生した場合でも、上述のＣＰＵ移動処理によりいずれは障害が発生したＣＰＵの実行待ちキューにＷＤデーモン１１のタスクが入るため、障害発生を検出することができる。 However, even if the task of the WD daemon 11 comes to the head of the CPU # 3 waiting queue, the task D continues to run away while acquiring the execution right of the CPU # 3. I have to keep waiting. If the WDT is not cleared for a certain period after time T4, the WDT detects a WD timeout and notifies the OS of an interrupt (the cause of the interrupt is a WD timeout). When an interrupt is notified, the OS regards it as a failure occurrence trigger, and the dump file output unit 13 performs a process of outputting memory information for all CPUs to the dump file. Here, an example in which a failure has occurred in CPU # 3 has been described, but even if a failure has occurred in another CPU, any of the WD daemon 11 will be queued in the execution queue of the CPU that has failed due to the above-described CPU movement process. Since a task is entered, the occurrence of a failure can be detected.

なお、本実施例では、ＣＰＵ移動ルールを、ＣＰＵ番号の降順に従ってラウンドロビン方式に移動としたが、これに限らず、たとえば、ＣＰＵ番号の昇順にする、ラウンドロビン方式ではなく他の方式により巡回方式にする、など全てのＣＰＵを一定期間の間に一巡するようなルールであればどのようなルールとしてもよい。 In this embodiment, the CPU movement rule is moved to the round robin method according to the descending order of the CPU numbers. However, the present invention is not limited to this. Any rule may be used as long as it is a rule that makes a round of all CPUs during a certain period, such as a system.

以上のように、本実施例では、ＷＤデーモンの動作ＣＰＵを所定のＣＰＵ移動ルールに従って移動させるようにした。このため、専用ハードウェアを搭載することなく、マルチプロセッサシステム内のどのＣＰＵで障害が発生した場合でも、ＷＤタイムアウトによる障害の検出を行うことができる。 As described above, in this embodiment, the operation CPU of the WD daemon is moved according to a predetermined CPU movement rule. For this reason, it is possible to detect a failure due to a WD timeout regardless of which CPU in the multiprocessor system has a failure without mounting dedicated hardware.

図７は、本発明にかかるマルチプロセッサシステムの実施例２の機能構成例を示す図である。図７に示すように、本実施例のマルチプロセッサシステムは、実施例１のマルチプロセッサシステムに、ＷＤ履歴記録部１４とＷＤ起動監視部１５を追加しているが、それ以外は実施例１と同様である。実施例１と同様の機能を有する構成要素は、実施例１と同一の符号を付して説明を省略する。 FIG. 7 is a diagram illustrating an example of a functional configuration of the multiprocessor system according to the second embodiment of the present invention. As shown in FIG. 7, in the multiprocessor system of the present embodiment, a WD history recording unit 14 and a WD activation monitoring unit 15 are added to the multiprocessor system of the first embodiment. It is the same. Components having the same functions as those of the first embodiment are denoted by the same reference numerals as those of the first embodiment, and description thereof is omitted.

本実施例のＯＳは、ＣＰＵの実行待ちキュー内のタスク数に偏りが生じた場合に、実行待ちキュー内のタスク数が多いＣＰＵから少ないＣＰＵへ強制移動を行うロードバランス機能を備えていると仮定する。本実施例のＣＰＵ移動処理については実施例１と同様であり、以下、実施例１と異なる部分について説明する。 The OS of this embodiment has a load balance function that forcibly moves from a CPU with a large number of tasks in the execution queue to a CPU with a small number of tasks in the CPU. Assume. The CPU movement process of the present embodiment is the same as that of the first embodiment, and only the parts different from the first embodiment are described below.

通常運用時は実施例１と同様に、ＷＤデーモン１１はＣＰＵ移動ルールに従って、動作ＣＰＵを移動しながらＷＤＴのカウンタのクリア処理を行う。ここでは、実施例１と同様に、ＣＰＵ移動ルールを、ＣＰＵ番号の降順に従ってラウンドロビン方式で移動しながらＷＤＴのクリアを行うこととする。 During normal operation, as in the first embodiment, the WD daemon 11 clears the WDT counter while moving the operating CPU according to the CPU movement rule. Here, as in the first embodiment, the WDT is cleared while the CPU movement rule is moved in a round-robin manner according to the descending order of the CPU numbers.

図８−１，２は、本実施例の障害発生時の動作を説明するための第１，第２の概念図である。まず、図８−１に示すように、前述のＣＰＵ移動ルールにより、過去の時刻Ｔ１にＣＰＵ＃３上で、時刻Ｔ２にＣＰＵ＃２上で、時刻Ｔ３にＣＰＵ＃１上で、それぞれＷＤデーモン１１によりＷＤＴカウンタのクリアが実施されたとする。この時点で（時刻Ｔ３以降）、ＣＰＵ＃３上で実行されているタスクＤ内で無限ループによる障害が発生したと仮定する。ＷＤデーモン１１が待機状態より起床すると、タスク情報のＣＰＵ情報エリアが“０”に変更されているため、ＷＤデーモン１１のタスクはＣＰＵ＃０の実行待ちキューに入る。 FIGS. 8A and 8B are first and second conceptual diagrams for explaining the operation when a failure occurs according to the present embodiment. First, as shown in FIG. 8A, according to the above-described CPU movement rule, the WD daemon is on CPU # 3 at the past time T1, on CPU # 2 at time T2, and on CPU # 1 at time T3. 11, the WDT counter is cleared. At this point (after time T3), it is assumed that a failure due to an infinite loop has occurred in task D being executed on CPU # 3. When the WD daemon 11 wakes up from the standby state, the CPU information area of the task information is changed to “0”, so the task of the WD daemon 11 enters the execution waiting queue of the CPU # 0.

このため、つぎにＷＤデーモン１１がＷＤＴのカウンタのクリアを実施するのはＣＰＵ＃０上になるはずである。しかし、本実施例ではＯＳがロードバランス機能を備えているため、もしＣＰＵ＃０の実行待ちキューに入っているタスク数が多すぎるとＯＳが判断すると、ＯＳは、図８−２に示すように、ＣＰＵ＃０の実行待ちキュー内のタスクの一部を強制的に別のＣＰＵの実行待ちキューへ移動させてしまう。ＷＤデーモン１１もロードバランスの対象となる。ここでは、ＷＤデーモン１１のタスクがＣＰＵ＃０からＣＰＵ＃２の実行待ちキューへ移動させられたと仮定する。 For this reason, the WD daemon 11 should next clear the WDT counter on the CPU # 0. However, since the OS has a load balancing function in this embodiment, if the OS determines that there are too many tasks in the execution waiting queue of the CPU # 0, the OS is as shown in FIG. In addition, a part of the task in the execution waiting queue of CPU # 0 is forcibly moved to the execution waiting queue of another CPU. The WD daemon 11 is also subject to load balancing. Here, it is assumed that the task of the WD daemon 11 has been moved from the CPU # 0 to the execution waiting queue of the CPU # 2.

以上のロードバランス機能による処理により、次にＷＤデーモン１１はＣＰＵ＃２上でＷＤＴのカウンタのクリアを実施し、以降は再びＣＰＵ番号の降順のＣＰＵ移動が続行される。このようなロードバランス機能による強制的な実行待ちキュー移動が、何度も繰り返されているような状況では、場合によってはＷＤデーモン１１がＣＰＵ＃３の実行待ちキューへ入ることができないことがある。この場合、ＷＤタイムアウトが発生しないため、ＣＰＵ＃３上でタスクＤが暴走していることを検出できないことになる。 With the above processing by the load balance function, the WD daemon 11 then clears the WDT counter on the CPU # 2, and thereafter, the CPU movement in the descending order of the CPU number is continued again. In such a situation where the forced execution waiting queue movement by the load balance function is repeated many times, the WD daemon 11 may not be able to enter the execution waiting queue of the CPU # 3 in some cases. . In this case, since the WD timeout does not occur, it cannot be detected that the task D is running out of control on the CPU # 3.

こうしたケースに対応するため、本実施例では、ＷＤＴのカウンタのクリア処理（以下、ＷＤＴクリアという）を行ったＣＰＵ番号をＷＤ起動履歴として記憶し、所定の時間ＷＤＴクリアの行われていないＣＰＵを検出できるようにしている。図９は、本実施例の処理手順の一例を示すフローチャートである。まず、ＷＤ管理部１２がＷＤデーモン１１のタスク情報のＣＰＵ情報エリアを書き換えた（実施例１のステップＳ１６）後、ＷＤ管理部１２がＷＤ履歴記録部１４にＷＤＴクリアを行ったＣＰＵ番号（ＷＤデーモン１１が動作したＣＰＵ番号）を通知する（ステップＳ２１）。 In order to cope with such a case, in this embodiment, the CPU number for which the WDT counter clear process (hereinafter referred to as WDT clear) is performed is stored as a WD activation history, and CPUs that have not been cleared for a predetermined time are stored. It can be detected. FIG. 9 is a flowchart illustrating an example of a processing procedure of the present embodiment. First, after the WD management unit 12 rewrites the CPU information area of the task information of the WD daemon 11 (step S16 in the first embodiment), the WD management unit 12 performs the CPU number (WD) for which the WD history recording unit 14 has performed WDT clear. The CPU number on which the daemon 11 operates is notified (step S21).

通知を受けたＷＤ履歴記録部１４は、現在時刻をシステム時計より取得した上で、一次記憶装置２内にＷＤ起動履歴を書き込む（ステップＳ２２）。図１０はＷＤ起動履歴として書き込むテーブル情報の一例を示す図である。このテーブル内には少なくとも現在時刻およびＷＤＴをクリアしたＣＰＵ番号が含まれるものとする。 Receiving the notification, the WD history recording unit 14 obtains the current time from the system clock and writes the WD activation history in the primary storage device 2 (step S22). FIG. 10 is a diagram showing an example of table information written as a WD activation history. This table includes at least the current time and the CPU number that cleared WDT.

ＷＤ起動履歴への書込みが終了すると、ＷＤ履歴記録部１４はＷＤ起動監視部１５を起動する（ステップＳ２３）。ＷＤ起動監視部１５は、過去一定時間分のＷＤ起動履歴を遡って参照し、ＣＰＵ＃０〜ＣＰＵ＃３の中でＷＤデーモン１１が起動していない期間が所定の基準時間（例：１２０秒）を超えているＣＰＵがあるか否かを判断する（ステップＳ２４）。基準時間を超えてＷＤデーモン１１が起動されていないＣＰＵが存在すると判断した場合（ステップＳ２４Ｙｅｓ）、ＷＤ起動監視部１５は、そのＣＰＵ上で障害が発生したものと判断し、障害発生の通知を出す（ステップＳ２５）。この通知を受けたダンプファイル出力部１３は、ＷＤタイムアウト発生の通知を受けた時と同様に、ダンプファイルを生成し、出力先装置３にダンプファイルを出力する（ステップＳ２６）。 When the writing to the WD activation history is completed, the WD history recording unit 14 activates the WD activation monitoring unit 15 (step S23). The WD activation monitoring unit 15 refers back to the WD activation history for a certain past time, and the period during which the WD daemon 11 is not activated among the CPUs # 0 to # 3 is a predetermined reference time (for example, 120 seconds). It is determined whether there is a CPU exceeding () (step S24). If it is determined that there is a CPU for which the WD daemon 11 has not been activated for longer than the reference time (Yes in step S24), the WD activation monitoring unit 15 determines that a failure has occurred on the CPU, and notifies the occurrence of the failure. (Step S25). Upon receiving this notification, the dump file output unit 13 generates a dump file and outputs the dump file to the output destination device 3 in the same manner as when receiving the notification of the occurrence of the WD timeout (step S26).

図１１に本実施例の動作を説明するための第３の概念図を示す。図８−１と同様にＣＰＵ＃３上でタスクＤが無限ループ等の暴走を起こしている状況で、ＷＤデーモン１１が動作ＣＰＵを移動しながらＷＤＴのクリアを実施している。頻繁にロードバランス機能による処理が発生するためＣＰＵ＃０〜ＣＰＵ＃２の範囲内でのみＷＤデーモン１１が起動される状況がしばらく続いていることとする。この例では、ＣＰＵ＃１上でＷＤデーモン１１が動作したＴ３から、基準時間であるｍ秒経過したＴ１１の時点までの間で、ＣＰＵ＃３上でＷＤデーモン１１は動作していない。図１２は、本実施例のＷＤ起動履歴の一例を示す図である。図１２は、図１１で説明した例のＴ１１の時点でのＷＤ起動履歴の例である。このように、ＷＤ起動履歴に基づいてＴ３〜Ｔ１１までのｍ秒（基準時間：たとえば１２０秒）の間に一度もＣＰＵ＃３上でＷＤデーモンが起動されていないと判断できるため、この場合、ＣＰＵ＃３上で障害が発生したものとみなし、ＷＤ起動監視部１５は、障害発生の通知を出す。 FIG. 11 shows a third conceptual diagram for explaining the operation of this embodiment. As in FIG. 8A, in a situation where the task D is running out of control such as an infinite loop on the CPU # 3, the WD daemon 11 clears the WDT while moving the operating CPU. Since processing by the load balance function frequently occurs, it is assumed that the situation where the WD daemon 11 is activated only within the range of CPU # 0 to CPU # 2 continues for a while. In this example, the WD daemon 11 is not operating on the CPU # 3 from T3 when the WD daemon 11 is operated on the CPU # 1 to T11 when m seconds as the reference time has elapsed. FIG. 12 is a diagram illustrating an example of the WD activation history of the present embodiment. FIG. 12 is an example of the WD activation history at the time T11 in the example described in FIG. Thus, since it can be determined that the WD daemon has not been activated on the CPU # 3 once in m seconds from T3 to T11 (reference time: for example, 120 seconds) based on the WD activation history, Assuming that a failure has occurred on CPU # 3, WD activation monitoring unit 15 issues a failure notification.

このように、本実施例では、ＷＤ履歴記録部１４がＷＤＴクリアを行ったＣＰＵ番号とその時刻をＷＤ起動履歴として一次記憶装置２に格納し、ＷＤ起動監視部１５が、ＷＤ起動履歴に基づいて所定の基準時間の間ＷＤＴクリアが行われていないＣＰＵがある場合に、そのＣＰＵ上で障害が発生したと判断して障害通知を出すようにした。このため、ＯＳがロードバランス機能を有するシステムで、マルチプロセッサシステム内のどのＣＰＵで障害が発生した場合でも、ＷＤタイムアウトによる障害検出を行うことができる。 As described above, in this embodiment, the CPU number and the time at which the WD history recording unit 14 has performed WDT clearing are stored in the primary storage device 2 as the WD activation history, and the WD activation monitoring unit 15 is based on the WD activation history. When there is a CPU that has not been WDT cleared for a predetermined reference time, it is determined that a failure has occurred on that CPU and a failure notification is issued. For this reason, in a system in which the OS has a load balance function, even if any CPU in the multiprocessor system has a failure, failure detection by WD timeout can be performed.

図１３は、本発明にかかるマルチプロセッサシステムの実施例３の機能構成例を示す図である。図１３に示すように、本実施例のマルチプロセッサシステムは、実施例１のマルチプロセッサシステムのＷＤ管理部１２をＷＤ管理部１２ａに替え、ＷＤ挙動指定部１６を追加しているが、それ以外は実施例１と同様である。実施例１と同様の機能を有する構成要素は、実施例１と同一の符号を付して説明を省略する。 FIG. 13 is a diagram illustrating an example of a functional configuration of the multiprocessor system according to the third embodiment of the present invention. As shown in FIG. 13, the multiprocessor system of this embodiment replaces the WD management section 12 of the multiprocessor system of the first embodiment with the WD management section 12a and adds a WD behavior specifying section 16, but otherwise Is the same as in Example 1. Components having the same functions as those of the first embodiment are denoted by the same reference numerals as those of the first embodiment, and description thereof is omitted.

本実施例では、ＣＰＵ移動ルールをユーザーが指定するためにＷＤ挙動指定部１６を追加し、ＷＤ管理部１２ａがユーザーの指定に基づいて一次記憶装置２のＣＰＵ移動ルールを書き換える。 In the present embodiment, the WD behavior specifying unit 16 is added in order for the user to specify the CPU movement rule, and the WD management unit 12a rewrites the CPU movement rule of the primary storage device 2 based on the user's specification.

図１４は、本実施例の処理手順の一例を示すフローチャートである。また、図１５は、本実施例のＷＤデーモン１１が動作するＣＰＵの流れを示す図である。まず、実施例１と同様にＣＰＵ移動ルールがあらかじめ定められ、一次記憶装置２に格納されているとする。ここでは、あらかじめ定められたＣＰＵ移動ルールとしてランダム巡回（ランダムな順番で全てのＣＰＵの移動を繰り返す）が設定されていたとする。そして、実施例１で説明したＣＰＵ移動処理によって、そのＣＰＵ移動ルールに基づいた処理が行われているとする。その状態ではＷＤデーモン１１は、たとえば、図１５の期間（ａ）のようにランダム巡回を行っている。 FIG. 14 is a flowchart illustrating an example of a processing procedure according to this embodiment. FIG. 15 is a diagram illustrating the flow of the CPU on which the WD daemon 11 of this embodiment operates. First, similarly to the first embodiment, it is assumed that CPU movement rules are determined in advance and stored in the primary storage device 2. Here, it is assumed that random circulation (repeating movement of all CPUs in a random order) is set as a predetermined CPU movement rule. Then, it is assumed that the process based on the CPU movement rule is performed by the CPU movement process described in the first embodiment. In this state, the WD daemon 11 performs random patrol as shown in the period (a) of FIG.

このとき、ユーザーがＣＰＵ移動ルールを「ラウンドロビン形式かつＣＰＵ番号の降順へ移動」というルールに変更したいとする。この場合、ＷＤ挙動指定部１６が、ユーザーの指示（この例では「ラウンドロビン形式かつＣＰＵ番号の降順へ移動」というルールへの変更指示）を受付け、指示内容をＷＤ管理部１２ａに通知する（ステップＳ３１）。ユーザーからの指示は、たとえば、図示しないキーボード，マウスなどの入力装置を経由して行われることとする。つぎに、通知をうけたＷＤ管理部１２ａは、その指示内容に基づいて一次記憶装置２上のＣＰＵ移動ルールを書き換える（ステップＳ３２）。 At this time, it is assumed that the user wants to change the CPU movement rule to a rule of “move to the round robin format and descending CPU number”. In this case, the WD behavior specifying unit 16 accepts a user instruction (in this example, an instruction to change to the rule of “round robin format and move to descending CPU number”) and notifies the WD management unit 12a of the instruction content ( Step S31). The instruction from the user is performed via an input device such as a keyboard and a mouse (not shown). Next, the WD management unit 12a that receives the notification rewrites the CPU migration rule on the primary storage device 2 based on the contents of the instruction (step S32).

この処理以降、ＷＤ管理部１２ａは、ＷＤデーモン１１に対するＣＰＵ割り当てをユーザーにより変更された「ラウンドロビン形式かつＣＰＵ番号の降順へ移動」というルールに基づいて行うことになる（図１５の期間（ｂ））。なお、ＣＰＵ移動ルールのユーザーによる変更を受付けるタイミングに特に制約はない。 After this processing, the WD management unit 12a performs the CPU assignment for the WD daemon 11 based on the rule of “moving in a round robin format and descending CPU number” changed by the user (period (b) in FIG. )). There is no particular restriction on the timing of accepting the change of the CPU movement rule by the user.

なお、本実施例では、実施例１のマルチプロセッサシステムのＷＤ管理部１２をＷＤ管理部１２ａに替え、ＷＤ挙動指定部１６を追加しているが、実施例２のマルチプロセッサシステムのＷＤ管理部１２をＷＤ管理部１２ａに替え、さらにＷＤ挙動指定部１６を追加して、上述のＣＰＵ移動ルールに対するユーザーの変更指示を反映する処理を行うようにしてもよい。 In the present embodiment, the WD management specifying section 16 of the multiprocessor system of the second embodiment is added by replacing the WD management section 12 of the multiprocessor system of the first embodiment with the WD management section 12a. 12 may be replaced with the WD management unit 12a, and a WD behavior specifying unit 16 may be added to perform a process of reflecting the user's change instruction for the above-described CPU movement rule.

このように、本実施例では、ＷＤ挙動指定部１６が、ＣＰＵ移動ルールに対するユーザーの変更指示を受付け、ＷＤ管理部１２ａが変更指示に基づいて一次記憶装置２に格納されたＣＰＵ移動ルールを書き換えるようにした。このため、ＣＰＵ移動ルールをいつでも任意に変更することができる。 As described above, in this embodiment, the WD behavior specifying unit 16 receives a user change instruction for the CPU movement rule, and the WD management unit 12a rewrites the CPU movement rule stored in the primary storage device 2 based on the change instruction. I did it. For this reason, the CPU movement rule can be arbitrarily changed at any time.

図１６は、本発明にかかるマルチプロセッサシステムの実施例４の機能構成例を示す図である。図１６に示すように、本実施例のマルチプロセッサシステムは、実施例１のマルチプロセッサシステムに、プログラム走行履歴記録部１７を追加し、一次記憶装置２にさらにプログラム走行履歴を格納するようにしているが、それ以外は実施例１と同様である。実施例１と同様の機能を有する構成要素は、実施例１と同一の符号を付して説明を省略する。 FIG. 16 is a diagram illustrating an example of a functional configuration of the multiprocessor system according to the fourth embodiment of the present invention. As shown in FIG. 16, in the multiprocessor system of this embodiment, a program running history recording unit 17 is added to the multiprocessor system of the first embodiment, and the program running history is further stored in the primary storage device 2. However, the rest is the same as in the first embodiment. Components having the same functions as those of the first embodiment are denoted by the same reference numerals as those of the first embodiment, and description thereof is omitted.

従来のＷＤタイムアウト検出により、出力されたダンプファイルからは、障害原因を調査する事が困難なケースが存在する。障害の原因がプログラム内で発生した不正メモリアクセスや論理矛盾の場合であれば、障害発生時に走行していたプログラムのアドレスが障害箇所そのものであるため、ダンプファイル内の情報から障害要因を特定する事は比較的容易である。これに対し、プログラム内で無限ループが発生しＷＤタイムアウトが検出された場合では、ダンプファイルから得られるＣＰＵの実行アドレス情報は、ループしているアドレス範囲のうちＷＤタイムアウトが発生した瞬間に走行していたアドレスに過ぎない。そのため、プログラムのどの範囲内でループが発生していたのか、そして何が原因でループ発生に至ったのかという要因についてはダンプファイル内に残る情報からは特定する事ができず、障害の根本原因の究明が困難であるという課題があった。したがって、こうした障害の場合は、プログラムの走行情報を実行命令毎に記録する事が解析の有効な情報となり得る。しかし、通常運用中にもそうした走行情報を常時記録することは、システムに多大な負荷をかける事になり実用的では無い。 There are cases where it is difficult to investigate the cause of the failure from the dump file output by the conventional WD timeout detection. If the cause of the failure is an illegal memory access or logical contradiction that occurred in the program, the cause of the failure is identified from the information in the dump file because the address of the program that was running when the failure occurred is the failure location itself. Things are relatively easy. On the other hand, if an infinite loop occurs in the program and a WD timeout is detected, the CPU execution address information obtained from the dump file runs at the moment the WD timeout occurs in the looped address range. It was just the address I had. Therefore, it is impossible to identify the cause of the loop within the program and what caused the loop from the information remaining in the dump file. There was a problem that it was difficult to investigate. Therefore, in the case of such a failure, recording the running information of the program for each execution command can be effective information for analysis. However, it is not practical to always record such traveling information even during normal operation because it places a great load on the system.

上記の問題を解決するため、本実施例では、障害を検出した時点から一定時間プログラムの走行履歴を収集する機能を追加している。プログラムの走行履歴には、一般にダンプファイルとして出力される内容と同様な情報（たとえば、プログラムの実行命令を含むレジスタ情報、タスク情報、スタック情報、時刻情報など）と共に、システムから取得した各命令の実行時刻の情報も含まれる。 In order to solve the above problem, in this embodiment, a function of collecting a running history of a program for a certain period of time from the time when a failure is detected is added. The running history of a program generally includes information similar to the contents output as a dump file (for example, register information including program execution instructions, task information, stack information, time information, etc.) and information on each instruction acquired from the system. Execution time information is also included.

つづいて、本実施例の動作について説明する。図１７は、本実施例の処理手順の一例を示すフローチャートである。また、図１８は、本実施例の障害発生前後の処理概念を示す図である。まず、本実施例のマルチプロセッサシステムは、通常の状態では、実施例１と同様にＣＰＵ移動ルールに従ってＣＰＵ移動処理を行っている（図１８の（ａ）通常動作期間）。 Next, the operation of this embodiment will be described. FIG. 17 is a flowchart illustrating an example of a processing procedure according to this embodiment. FIG. 18 is a diagram illustrating a processing concept before and after the occurrence of a failure according to the present embodiment. First, in the normal state, the multiprocessor system of the present embodiment performs the CPU movement process according to the CPU movement rule as in the first embodiment ((a) normal operation period in FIG. 18).

このとき、プログラム暴走等が発生し、実施例１の障害発生時の動作と同様に、ＷＤＴはＷＤタイムアウトを検出し障害発生を通知したとする（ステップＳ４１）。障害発生通知をうけて、通常は、ＯＳがダンプファイルに必要な情報を収集してダンプファイル出力部１３がダンプファイルの生成を行う。つまり、障害発生通知が、障害情報収集のトリガとなっているため、以下では、障害発生通知を障害情報収集トリガとよぶこととする。 At this time, it is assumed that a program runaway or the like has occurred, and the WDT detects a WD timeout and notifies the occurrence of the failure in the same manner as the operation when the failure occurs in the first embodiment (step S41). Upon receipt of the failure occurrence notification, the OS normally collects information necessary for the dump file, and the dump file output unit 13 generates the dump file. That is, since the failure occurrence notification is a failure information collection trigger, hereinafter, the failure occurrence notification is referred to as a failure information collection trigger.

一般に、障害発生トリガをうけたＯＳは即座にその時点でのメモリの内容をダンプファイルとして出力するための情報として収集する処理を開始する。これに対し、本実施例では、まず、障害発生トリガが生じた場合に、ＯＳはプログラム走行履歴記録部１７に障害発生を通知し、通知を受けたプログラム走行履歴記録部１７が各ＣＰＵで実行されているプログラムの実行命令を含むレジスタ情報、タスク情報、スタック情報（通常ダンプファイルに出力されるのと同様の項目）と命令の実行時刻情報を収集し、プログラム走行履歴として一次記憶装置２へ格納する（図１７のステップＳ４２，図１８の（ｃ）ＣＰＵ毎の走行情報収集）。この収集および格納は、あらかじめ定められた任意の時間（図１８の例では５秒間）続行する。 In general, the OS that has received the failure occurrence trigger immediately starts processing to collect the contents of the memory at that time as information for outputting as a dump file. On the other hand, in this embodiment, first, when a failure occurrence trigger occurs, the OS notifies the program running history recording unit 17 of the occurrence of the failure, and the received program running history recording unit 17 is executed by each CPU. Register information including task execution instructions, task information, stack information (items similar to those normally output to a dump file) and instruction execution time information are collected and stored in the primary storage device 2 as a program running history Store (step S42 in FIG. 17, (c) travel information collection for each CPU in FIG. 18). This collection and storage continues for a predetermined time (5 seconds in the example of FIG. 18).

そして、あらかじめ定められた任意の時間が経過すると、通常のダンプファイル出力処理が行われる（ステップＳ４３）。本実施例のこれ以外の動作は、実施例１と同様である。 Then, when an arbitrary predetermined time has elapsed, normal dump file output processing is performed (step S43). Other operations in the present embodiment are the same as those in the first embodiment.

なお、本実施例では、実施例１のマルチプロセッサシステムにプログラム走行履歴記録部１７を追加しているが、実施例２のマルチプロセッサシステムまたは実施例３のマルチプロセッサシステムにプログラム走行履歴記録部１７を追加し、本実施例と同様に、障害収集トリガが生じた場合にプログラム走行履歴を収集して、一次記憶装置２に格納するようにしてもよい。 In this embodiment, the program running history recording unit 17 is added to the multiprocessor system of the first embodiment. However, the program running history recording unit 17 is added to the multiprocessor system of the second embodiment or the multiprocessor system of the third embodiment. As in the present embodiment, the program running history may be collected and stored in the primary storage device 2 when a failure collection trigger occurs.

このように、本実施例では、障害発生通知があった場合に、プログラム走行履歴記録部１７がプログラムの実行命令を含むレジスタ情報、タスク情報、スタック情報、時刻情報などをあらかじめ定められた任意の時間の間収集して、プログラム走行履歴として一次記憶装置２に格納するようにした。このため、無限ループが発生するような障害についても、格納されたプログラム走行履歴を解析することにより、容易に障害箇所を特定することができる。 As described above, in this embodiment, when there is a failure occurrence notification, the program running history recording unit 17 stores register information including a program execution instruction, task information, stack information, time information, etc. It was collected during the time and stored in the primary storage device 2 as a program running history. For this reason, it is possible to easily identify the fault location by analyzing the stored program running history even for a fault that causes an infinite loop.

図１９は、本発明にかかるマルチプロセッサシステムの実施例５の機能構成例を示す図である。図１９に示すように、本実施例のマルチプロセッサシステムは、実施例４のマルチプロセッサシステムに、ダンプファイル出力先指定部１８を追加しているが、それ以外は実施例４のマルチプロセッサシステムと同様である。実施例４と同様の機能を有する構成要素は、実施例４と同一の符号を付して説明を省略する。 FIG. 19 is a diagram illustrating a functional configuration example of the fifth embodiment of the multiprocessor system according to the present invention. As shown in FIG. 19, in the multiprocessor system of the present embodiment, a dump file output destination designating unit 18 is added to the multiprocessor system of the fourth embodiment. Otherwise, the multiprocessor system of the fourth embodiment is the same as the multiprocessor system of the fourth embodiment. It is the same. Components having functions similar to those of the fourth embodiment are denoted by the same reference numerals as those of the fourth embodiment, and description thereof is omitted.

また、本実施例では、出力先装置３は、磁気ディスクなどで構成されるディスク３１と、パケットとして出力しネットワーク上へ転送するパケット処理装置３２と、モニタなどで構成され表示を行う標準出力装置３３と、半導体などで構成されるメモリ３４と、を備えることする。 In the present embodiment, the output destination device 3 includes a disk 31 formed of a magnetic disk, a packet processing device 32 that outputs a packet and transfers it to the network, and a standard output device that includes a monitor and performs display. 33 and a memory 34 composed of a semiconductor or the like.

図２０は、本実施例の処理手順の一例を示すフローチャートである。本実施例の動作は、実施例４の動作と同様であるが、本実施例では、実施例４で出力されるダンプファイルとプログラム走行情報の出力先を選択できるようにしている。まず、実施例４と同様に障害発生が通知されたとする（ステップＳ５１）。その後、実施例４と同様にプログラム走行履歴記録部１７が、実施例４のステップＳ４２を実行し、その後、ダンプファイル出力部１３へダンプファイルの出力指示する（ステップＳ５２）。ダンプファイル出力部１３は、プログラム走行情報とダンプ情報（ダンプファイルの情報として収集した情報）をダンプファイル出力先指定部１８へ出力する（ステップＳ５３）。そして、ダンプファイル出力先指定部１８は、あらかじめユーザーにより設定されている出力先の指定に基づいて、プログラム走行情報とダンプファイルを出力する（ステップＳ５４）。たとえば、出力装置３のうち、ディスク３１，パケット処理装置３２，標準出力装置３３，メモリ３４のいずれかへと出力させる。なお、ダンプファイル出力先指定部１８は、あらかじめ設定された出力先を保持しており、ユーザーからの指定があった場合には、その設定内容を書き換えることとする。 FIG. 20 is a flowchart illustrating an example of a processing procedure according to this embodiment. The operation of the present embodiment is the same as the operation of the fourth embodiment, but in this embodiment, the dump file output in the fourth embodiment and the output destination of the program running information can be selected. First, it is assumed that the occurrence of a failure is notified as in the fourth embodiment (step S51). Thereafter, similarly to the fourth embodiment, the program running history recording unit 17 executes step S42 of the fourth embodiment, and then instructs the dump file output unit 13 to output a dump file (step S52). The dump file output unit 13 outputs the program running information and dump information (information collected as dump file information) to the dump file output destination designating unit 18 (step S53). Then, the dump file output destination designation unit 18 outputs the program running information and the dump file based on the designation of the output destination set in advance by the user (step S54). For example, the output device 3 outputs the data to any of the disk 31, the packet processing device 32, the standard output device 33, and the memory 34. Note that the dump file output destination designation unit 18 holds a preset output destination, and rewrites the setting contents when designated by the user.

なお、本実施例では、プログラム走行履歴情報と通常のダンプファイルの両方を出力先装置３へ出力するようにしたが、プログラム走行履歴情報のみを出力するようにしてもよい。 In the present embodiment, both the program travel history information and the normal dump file are output to the output destination device 3, but only the program travel history information may be output.

このように、本実施例では、ダンプファイルの出力先を指定できるようにした。このため、ユーザーの利用しやすい形態でダンプファイルを出力することができる。また、たとえば、ダンプファイルを出力する装置に障害があった場合、通常であればダンプファイルを記録することができなくなるが、本実施例では出力先を変更することによりダンプファイルの記録が可能となる。 As described above, in this embodiment, the output destination of the dump file can be specified. Therefore, the dump file can be output in a form that is easy for the user to use. Also, for example, if there is a failure in the device that outputs the dump file, the dump file cannot be recorded normally, but in this embodiment, the dump file can be recorded by changing the output destination. Become.

（付記１）複数のプロセッサを備え、ウォッチドッグデーモンがウォッチドッグタイマを用いて障害の検出を行い、障害を検出した場合に障害発生の通知を行うマルチプロセッサシステムであって、
待機中のタスクが実行権を獲得した際に動作するプロセッサの識別子が対応するタスクごとに格納されているプロセッサ情報と、ウォッチドッグデーモンが動作するプロセッサを順次移動させて巡回させるためのルールであるプロセッサ移動ルールと、を格納するための記憶手段と、
ウォッチドッグデーモンのタスクに対応する前記プロセッサ情報を前記プロセッサ移動ルールに基づいて更新し、更新後のプロセッサ情報を前記記憶手段に書き込むウォッチドッグ管理手段と、
を備えることを特徴とするマルチプロセッサシステム。 (Appendix 1) A multiprocessor system comprising a plurality of processors, wherein a watchdog daemon detects a failure using a watchdog timer, and notifies a failure when a failure is detected,
This is a rule for moving the processor where the identifier of the processor that operates when the waiting task acquires the execution right for each corresponding task and the processor on which the watchdog daemon operates to move around. Storage means for storing processor movement rules;
Watchdog management means for updating the processor information corresponding to the task of the watchdog daemon based on the processor movement rule, and writing the updated processor information in the storage means;
A multiprocessor system comprising:

（付記２）前記記憶手段に、ウォッチドッグデーモンが動作したプロセッサの識別子とウォッチドッグデーモンがウォッチドッグタイマをクリアした時刻とを対応付けてウォッチドッグ起動履歴として格納するウォッチドッグ履歴記録手段と、
前記ウォッチドッグ起動履歴に基づいて、所定の時間を超えてウォッチドッグデーモンが動作していないプロセッサがあると判断した場合には、そのプロセッサに障害が発生したことを示す障害発生を通知するウォッチドッグ起動監視手段と、
をさらに備えることを特徴とする付記１に記載のマルチプロセッサシステム。 (Appendix 2) Watchdog history recording means for storing in the storage means the identifier of the processor in which the watchdog daemon has been operated and the time when the watchdog daemon has cleared the watchdog timer in association with each other, and storing it as a watchdog activation history;
When it is determined that there is a processor for which the watchdog daemon has not been operated for a predetermined time based on the watchdog activation history, a watchdog for notifying the occurrence of a failure indicating that the processor has failed Startup monitoring means;
The multiprocessor system according to appendix 1, further comprising:

（付記３）前記プロセッサ移動ルールに対するユーザーからの変更要求を受け付けるウォッチドッグデーモン挙動指定手段、
をさらに備え、
前記ウォッチドッグ管理手段が、前記変更要求に基づいて前記プロセッサ移動ルールを書き換えることを特徴とする付記１または２に記載のマルチプロセッサシステム。 (Appendix 3) Watchdog daemon behavior designation means for accepting a change request from the user for the processor movement rule,
Further comprising
The multiprocessor system according to appendix 1 or 2, wherein the watchdog management unit rewrites the processor movement rule based on the change request.

（付記４）障害発生の通知が行われた場合、実行命令を含むレジスタ情報、タスク情報、スタック情報を含むプログラム走行情報を、所定の期間収集し、収集したプログラム走行情報を前記記憶手段へ書き込むプログラム走行履歴記録手段、
をさらに備えることを特徴とする付記１、２または３に記載のマルチプロセッサシステム。 (Supplementary Note 4) When a failure notification is made, register running information including execution instructions, task information, and program running information including stack information are collected for a predetermined period, and the collected program running information is written to the storage means Program running history recording means,
The multiprocessor system according to appendix 1, 2, or 3, further comprising:

（付記５）前記プログラム走行履歴の出力先の装置を変更可能とすることを特徴とする付記４に記載のマルチプロセッサシステム。 (Supplementary note 5) The multiprocessor system according to supplementary note 4, wherein the output destination device of the program running history can be changed.

（付記６）複数のプロセッサを備え、ウォッチドッグデーモンがウォッチドッグタイマを用いて障害の検出を行行い、障害を検出した場合に障害発生の通知を行うマルチプロセッサシステムにおける障害検出方法であって、
ウォッチドッグデーモンが動作するプロセッサを順次移動させて巡回させるためのルールであるプロセッサ移動ルールを格納するプロセッサ移動ルール格納ステップと、
待機中のタスクが実行権を獲得した際に動作するプロセッサの識別子が対応するタスクごとに格納されているプロセッサ情報のうち、ウォッチドッグデーモンのタスクに対応するプロセッサ情報を、前記プロセッサ移動ルールに基づいて更新するウォッチドッグ管理ステップと、
を含むことを特徴とする障害検出方法。 (Appendix 6) A failure detection method in a multiprocessor system comprising a plurality of processors, wherein a watchdog daemon detects a failure using a watchdog timer and notifies the occurrence of a failure when a failure is detected,
A processor movement rule storage step for storing a processor movement rule that is a rule for sequentially moving and circulating the processors on which the watchdog daemon operates;
Based on the processor movement rule, the processor information corresponding to the task of the watchdog daemon among the processor information stored for each task corresponding to the identifier of the processor that operates when the waiting task acquires the execution right Watchdog management steps to update,
A failure detection method comprising:

（付記７）ウォッチドッグデーモンが動作したプロセッサの識別子とウォッチドッグデーモンがウォッチドッグタイマをクリアした時刻とを対応付けてウォッチドッグ起動履歴として格納するウォッチドッグ履歴記録ステップと、
前記ウォッチドッグ起動履歴に基づいて、所定の時間を超えてウォッチドッグデーモンが動作していないプロセッサがあると判断した場合には、そのプロセッサに障害が発生したことを示す障害発生を通知するウォッチドッグ起動監視ステップと、
をさらに含むことを特徴とする付記６に記載の障害検出方法。 (Appendix 7) A watchdog history recording step for storing the identifier of the processor in which the watchdog daemon has been operated and the time when the watchdog daemon has cleared the watchdog timer in association with each other and storing it as a watchdog activation history;
When it is determined that there is a processor for which the watchdog daemon has not been operated for a predetermined time based on the watchdog activation history, a watchdog for notifying the occurrence of a failure indicating that the processor has failed A startup monitoring step;
The failure detection method according to appendix 6, further comprising:

（付記８）前記プロセッサ移動ルールに対するユーザーからの変更要求を受け付けるウォッチドッグデーモン挙動指定ステップと、
前記変更要求に基づいて前記プロセッサ移動ルールを書き換えるステップと、
をさらに含むことを特徴とする付記６または７に記載の障害検出方法。 (Appendix 8) Watchdog daemon behavior designation step for accepting a change request from a user for the processor movement rule;
Rewriting the processor migration rule based on the change request;
The failure detection method according to appendix 6 or 7, further comprising:

（付記９）ウォッチドッグデーモンによる障害の検出により障害発生の通知が行われた場合、または、前記ウォッチドッグ起動監視ステップによる障害発生の通知が行われた場合に、実行命令を含むレジスタ情報、タスク情報、スタック情報を含むプログラム走行情報を、所定の期間収集し、収集したプログラム走行情報を記録するプログラム走行履歴記録ステップ、
をさらに含むことを特徴とする付記６、７または８に記載の障害検出方法。 (Supplementary Note 9) When a failure occurrence is notified by detecting a failure by the watchdog daemon, or when a failure occurrence is notified by the watchdog activation monitoring step, register information including an execution instruction, a task Program running history recording step for collecting program running information including information and stack information for a predetermined period and recording the collected program running information;
The failure detection method according to appendix 6, 7 or 8, further comprising:

（付記１０）前記プログラム走行履歴の出力先の装置を変更可能とすることを特徴とする付記９に記載の障害検出方法。 (Supplementary note 10) The failure detection method according to supplementary note 9, wherein the output destination device of the program running history can be changed.

（付記１１）複数のプロセッサを備え、ウォッチドッグデーモンがウォッチドッグタイマを用いて障害の検出を行うマルチプロセッサシステムにおいて、障害を検出するための障害検出プログラムであって、
ウォッチドッグデーモンが動作するプロセッサを順次移動させて巡回させるためのルールであるプロセッサ移動ルールを記憶部に格納するプロセッサ移動ルール格納手順と、
記憶部からプロセッサ移動ルールを読み出し、さらに待機中のタスクが実行権を獲得した際に動作するプロセッサの識別子が対応するタスクごとに格納されているプロセッサ情報のうち、ウォッチドッグデーモンのタスクに対応するプロセッサ情報を記憶部から読み出し、読み出したプロセッサ情報を前記プロセッサ移動ルールに基づいて更新し、更新後のプロセッサ情報を記憶部に書き込むウォッチドッグ管理手順と、
をコンピュータに実行させることを特徴とする障害検出プログラム。 (Supplementary Note 11) A multi-processor system including a plurality of processors, wherein a watchdog daemon detects a fault using a watchdog timer, and a fault detection program for detecting a fault,
A processor movement rule storage procedure for storing a processor movement rule, which is a rule for sequentially moving and circulating the processor on which the watchdog daemon operates, in the storage unit;
The processor movement rule is read from the storage unit, and the identifier of the processor that operates when the waiting task acquires the execution right corresponds to the task of the watchdog daemon among the processor information stored for each corresponding task. A watchdog management procedure for reading processor information from the storage unit, updating the read processor information based on the processor movement rule, and writing the updated processor information to the storage unit;
A failure detection program for causing a computer to execute

（付記１２）ウォッチドッグデーモンが動作したプロセッサの識別子とウォッチドッグデーモンがウォッチドッグタイマをクリアした時刻とを対応付けてウォッチドッグ起動履歴として記憶部に格納するウォッチドッグ履歴記録手順と、
記憶部からウォッチドッグ起動履歴を読み出し、読み出したウォッチドッグ起動履歴に基づいて、所定の時間を超えてウォッチドッグデーモンが動作していないプロセッサがあると判断した場合には、そのプロセッサに障害が発生したことを示す障害発生を通知するウォッチドッグ起動監視手順と、
をさらに含むことを特徴とする付記１１に記載の障害検出プログラム。 (Appendix 12) A watchdog history recording procedure in which the identifier of the processor in which the watchdog daemon has been operated and the time when the watchdog daemon clears the watchdog timer are associated with each other and stored in the storage unit as a watchdog activation history;
When the watchdog activation history is read from the storage unit, and it is determined that there is a processor that has not operated the watchdog daemon for a predetermined time based on the read watchdog activation history, a failure has occurred in that processor Watchdog activation monitoring procedure to notify the occurrence of a failure indicating
The failure detection program according to appendix 11, further comprising:

（付記１３）前記プロセッサ移動ルールに対するユーザーからの変更要求を受け付けるウォッチドッグデーモン挙動指定手順と、
前記変更要求に基づいて前記プロセッサ移動ルールを書き換える手順と、
をさらに含むことを特徴とする付記１１または１２に記載の障害検出プログラム。 (Supplementary note 13) Watchdog daemon behavior designation procedure for accepting a change request from the user to the processor movement rule;
Rewriting the processor movement rule based on the change request;
The failure detection program according to appendix 11 or 12, further comprising:

（付記１４）ウォッチドッグデーモンによる障害の検出により障害発生の通知が行われた場合、または、前記ウォッチドッグ起動監視手順による障害発生の通知が行われた場合に、実行命令を含むレジスタ情報、タスク情報、スタック情報を含むプログラム走行情報を、所定の期間収集し、収集したプログラム走行情報を記憶部へ書き込むプログラム走行履歴記録手順、
をさらに含むことを特徴とする付記１１、１２または１３に記載の障害検出プログラム。 (Supplementary Note 14) Register information including an execution instruction and task when a failure occurrence is notified by detecting a failure by the watchdog daemon or when a failure occurrence is notified by the watchdog activation monitoring procedure Program running history recording procedure for collecting information, program running information including stack information for a predetermined period, and writing the collected program running information to the storage unit,
The failure detection program according to appendix 11, 12 or 13, further comprising:

（付記１５）前記プログラム走行履歴の出力先の装置を変更可能とすることを特徴とする付記１４に記載の障害検出プログラム。 (Supplementary note 15) The failure detection program according to supplementary note 14, wherein an output destination device of the program running history can be changed.

以上のように、本発明に係るマルチプロセッサシステム、障害検出方法および障害検出プログラムは、複数のプロセッサを有し、ＷＤＴを利用した障害検出機能を持つコンピュータシステムに適している。 As described above, the multiprocessor system, the failure detection method, and the failure detection program according to the present invention are suitable for a computer system having a plurality of processors and having a failure detection function using WDT.

本発明にかかるマルチプロセッサシステムの実施例１の機能構成例を示す図である。It is a figure which shows the function structural example of Example 1 of the multiprocessor system concerning this invention. タスクの状態遷移を示す図である。It is a figure which shows the state transition of a task. タスクの実行待ちキューを説明するための図である。It is a figure for demonstrating the task waiting queue. 実施例１の処理手順の一例を示すフローチャートである。3 is a flowchart illustrating an example of a processing procedure according to the first exemplary embodiment. 実施例１の動作を説明するためのそれぞれ第１の概念図である。FIG. 3 is a first conceptual diagram for explaining the operation of the first embodiment. 実施例１の動作を説明するためのそれぞれ第２の概念図である。FIG. 6 is a second conceptual diagram for explaining the operation of the first embodiment. 実施例１の動作を説明するためのそれぞれ第３の概念図である。FIG. 6 is a third conceptual diagram for explaining the operation of the first embodiment. 実施例１の動作を説明するためのそれぞれ第４の概念図である。FIG. 9 is a fourth conceptual diagram for explaining the operation of the first embodiment. 実施例１の障害発生時の動作を説明するための図である。FIG. 6 is a diagram for explaining an operation when a failure occurs according to the first embodiment. 本発明にかかるマルチプロセッサシステムの実施例２の機能構成例を示す図である。It is a figure which shows the function structural example of Example 2 of the multiprocessor system concerning this invention. 実施例２の障害発生時の動作を説明するための第１の概念図である。FIG. 10 is a first conceptual diagram for explaining an operation when a failure occurs in the second embodiment. 実施例２の障害発生時の動作を説明するための第２の概念図である。FIG. 10 is a second conceptual diagram for explaining an operation when a failure occurs according to the second embodiment. 実施例２の処理手順の一例を示すフローチャートである。10 is a flowchart illustrating an example of a processing procedure according to the second embodiment. ＷＤ起動履歴として書き込むテーブル情報の一例を示す図である。It is a figure which shows an example of the table information written as a WD starting log | history. 実施例２の動作を説明するための第３の概念図である。FIG. 10 is a third conceptual diagram for explaining the operation of the second embodiment. 実施例２のＷＤ起動履歴の一例を示す図である。It is a figure which shows an example of the WD starting log | history of Example 2. FIG. 本発明にかかるマルチプロセッサシステムの実施例３の機能構成例を示す図である。It is a figure which shows the function structural example of Example 3 of the multiprocessor system concerning this invention. 実施例３の処理手順の一例を示すフローチャートである。10 is a flowchart illustrating an example of a processing procedure according to the third embodiment. 実施例３のＷＤデーモンが動作するＣＰＵの流れを示す図である。It is a figure which shows the flow of CPU which the WD daemon of Example 3 operate | moves. 本発明にかかるマルチプロセッサシステムの実施例４の機能構成例を示す図である。It is a figure which shows the function structural example of Example 4 of the multiprocessor system concerning this invention. 実施例４の処理手順の一例を示すフローチャートである。10 is a flowchart illustrating an example of a processing procedure according to a fourth embodiment. 実施例４の障害発生前後の処理概念を示す図である。It is a figure which shows the processing concept before and behind the failure generation of Example 4. 本発明にかかるマルチプロセッサシステムの実施例５の機能構成例を示す図である。It is a figure which shows the function structural example of Example 5 of the multiprocessor system concerning this invention. 実施例５の処理手順の一例を示すフローチャートである。10 is a flowchart illustrating an example of a processing procedure according to a fifth embodiment. 従来技術における通常運用時のＷＤＴとＷＤデーモンの動きを示す図である。It is a figure which shows the motion of WDT and WD daemon at the time of normal operation in a prior art. 従来技術における障害発生時のＷＤＴとＷＤデーモンの動きを示す図である。It is a figure which shows the motion of WDT and WD daemon at the time of the failure occurrence in a prior art. 従来技術におけるＷＤＴが障害を検出できないケースを示す図である。It is a figure which shows the case where WDT in a prior art cannot detect a failure.

Explanation of symbols

１制御部
２一次記憶装置
３出力先装置
１０スケジューラ
１１ＷＤデーモン
１２，１２ａＷＤ管理部
１３ダンプファイル出力部
１４ＷＤ履歴記録部
１５ＷＤ起動監視部
１６ＷＤ挙動指定部
１７プログラム走行履歴記録部
１８ダンプファイル出力先指定部
３１ディスク
３２パケット処理装置
３３標準出力装置
３４メモリ 1 Control Unit 2 Primary Storage Device 3 Output Destination Device 10 Scheduler 11 WD Daemon 12, 12a WD Management Unit 13 Dump File Output Unit 14 WD History Recording Unit 15 WD Activation Monitoring Unit 16 WD Behavior Designation Unit 17 Program Running History Recording Unit 18 Dump File output destination designation unit 31 Disk 32 Packet processing device 33 Standard output device 34 Memory

Claims

A multiprocessor system comprising a plurality of processors, wherein a watchdog daemon detects a failure using a watchdog timer, and notifies a failure when a failure is detected,
This is a rule for moving the processor where the identifier of the processor that operates when the waiting task acquires the execution right for each corresponding task and the processor on which the watchdog daemon operates to move around. Storage means for storing processor movement rules;
Watchdog management means for updating the processor information corresponding to the task of the watchdog daemon based on the processor movement rule, and writing the updated processor information in the storage means;
A multiprocessor system comprising:

Watchdog history recording means for associating the storage means with the identifier of the processor on which the watchdog daemon has been operated and the time when the watchdog daemon has cleared the watchdog timer, and storing it as a watchdog activation history;
When it is determined that there is a processor for which the watchdog daemon has not been operated for a predetermined time based on the watchdog activation history, a watchdog for notifying the occurrence of a failure indicating that the processor has failed Startup monitoring means;
The multiprocessor system according to claim 1, further comprising:

A watchdog daemon behavior specifying means for accepting a change request from a user to the processor movement rule;
Further comprising
The multiprocessor system according to claim 1, wherein the watchdog management unit rewrites the processor movement rule based on the change request.

When a failure is notified, register running information including execution instructions, task information, and program running information including stack information are collected for a predetermined period after the occurrence of the failure, and the collected program running information is stored in the memory. Program running history recording means for writing to the means,
The multiprocessor system according to claim 1, further comprising:

5. The multiprocessor system according to claim 4, wherein an output destination device of the program running history can be changed.

A failure detection method in a multiprocessor system comprising a plurality of processors, wherein a watchdog daemon detects a failure using a watchdog timer, and notifies a failure when a failure is detected,
A processor movement rule storage step for storing a processor movement rule which is a rule for sequentially moving and circulating the processors on which the watchdog daemon operates;
Based on the processor movement rule, the processor information corresponding to the task of the watchdog daemon among the processor information stored for each task corresponding to the identifier of the processor that operates when the waiting task acquires the execution right Watchdog management steps to update,
A failure detection method comprising:

A failure detection program for detecting a failure in a multiprocessor system comprising a plurality of processors, wherein a watchdog daemon detects a failure using a watchdog timer,
A processor movement rule storage procedure for storing a processor movement rule, which is a rule for sequentially moving and circulating the processor on which the watchdog daemon operates, in the storage unit;
The processor movement rule is read from the storage unit, and the identifier of the processor that operates when the waiting task acquires the execution right corresponds to the task of the watchdog daemon among the processor information stored for each corresponding task. A watchdog management procedure for reading processor information from the storage unit, updating the read processor information based on the processor movement rule, and writing the updated processor information to the storage unit;
A failure detection program for causing a computer to execute