JP2007266776A

JP2007266776A - System for monitoring normal operation of service among plurality of servers, and its method

Info

Publication number: JP2007266776A
Application number: JP2006086559A
Authority: JP
Inventors: Eiichiro Mori; 英一郎森
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2006-03-27
Filing date: 2006-03-27
Publication date: 2007-10-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system for monitoring a normal operation of a service among a plurality of servers that earlier and further completely detects a failure state of a service by monitoring not only a failure of a service but also a normal operation state of a service in a service system on the premise of operation of a redundant configuration, and its method. <P>SOLUTION: The service system constructed by a plurality of the servers is composed so that communication is regularly executed to the servers opposite to each other with a UDP or a TCP and each sever in the service system is allowed to have a service normal-operation function for judging normalcy/abnormality of a service operation state by contents of a communication packet so as to mutually execute monitoring of the normal operation of a service with a redundant-configuration server. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、複数サーバ間でのサービス正常動作監視システムおよびその方法に係り、特に、サービス正常性動作監視機能をサービスシステム内の各サーバに持たせ、サービス正常動作監視を冗長構成のサーバで相互に行うことを可能にした複数サーバ間でのサービス正常動作監視システムおよびその方法に関するものである。 The present invention relates to a service normal operation monitoring system and a method for a plurality of servers, and in particular, to provide each server in a service system with a service normal operation monitoring function, and to perform service normal operation monitoring between redundant servers. The present invention relates to a service normal operation monitoring system and a method thereof among a plurality of servers.

近年様々なサービスを提供するサーバシステムは増加の一方であり、提供するサービスの中でも、早期故障検出によりサービスの中断時間を最小限に抑えたシステムの要求が一層高まっている。通常、交換システムなどでは、自サーバ内での故障の検出や、監視専用サーバを使用しての通信途絶の検出により、運用系サーバの故障を判断し、あらかじめ物理的に冗長化した装置系を新たに運用化（系切替え）することで、サービスの中断時間を最小限に抑える方式を採ってきた。 In recent years, the number of server systems that provide various services is increasing, and among the services that are provided, there is an increasing demand for systems that minimize service interruption time by early failure detection. Normally, in a switching system or the like, a failure of the active server is judged by detecting a failure in its own server or detecting a communication interruption using a dedicated monitoring server, and a device system that has been made redundant in advance is used. A new method of operation (system switching) has been adopted to minimize service interruption time.

また、前述したとおり、冗長化構成運用を前提としたサービスシステムにおいては、サービスの故障を検出する方法として、例えばミドルウェアでは、ミドル上で動作するサービス提供プロセスの生存確認を行い、プロセス消滅時にプロセス再起動を行なう機能がある。 In addition, as described above, in a service system that assumes redundant configuration operation, as a method for detecting a service failure, for example, middleware checks the existence of a service providing process that operates on the middle, and processes when the process disappears. There is a function to restart.

また、冗長化したサーバ間でミドルウェア同士の通信疎通監視を行う事で、通信途絶により対向系サーバの故障を検出し運用系サーバを切替える機能もある。 In addition, by performing communication communication between middleware between redundant servers, there is also a function of detecting a failure of the opposite server due to communication interruption and switching the active server.

さらにサービス提供状態のサーバについては、サービス監視用の外部サーバが、サービス提供プロセスと定期的に通信を行い監視している。 Furthermore, the service providing state server monitors the service monitoring external server by periodically communicating with the service providing process.

例えば、通信部、サービス処理部、プロセス管理部を実行系サーバおよび待機系サーバにそれぞれ設けて、実行系サーバ内に何らかの障害が発生した場合、通信部によるリクエストの受信制御、サービス処理部による処理の実行制御、プロセス管理部によるリクエストの配分制御を待機系サーバに引き継がせて、実行系サーバが行うべきサービスに関する処理を待機系サーバに代替させる技術が知られている。（例えば、特許文献１参照）
一方、二重化された監視／制御プロセッサ間相互で通信を行うための通信手段を設け、発生した事象を監視／制御プロセッサ間で相互に通知し、互いに相互の状態を把握しながらシステムの状態を監視・制御する技術が、また、二重化された監視／監視制御プロセッサ間で一定時間ごとに所定の通知事象を交換し、相互に動作を監視する技術が知られている。（例えば、特許文献２参照）
しかしながら、これらの従来技術は、いずれも、二重化されたサーバ又は監視／制御プロセッサで実行系が障害時に待機系への処理の移行を速やかに行うことだけに限られ、冗長化されたサーバ間にて、定期的に通信を行い、該通信の内容に基づき、相互にサービスの運用状態を判断することで、監視専用サーバを不必要とすることを目的とする発明は存在していない。 For example, if a communication unit, a service processing unit, and a process management unit are provided in the active server and the standby server, respectively, and a failure occurs in the active server, the request reception control by the communication unit and the processing by the service processing unit are performed. A technique is known in which a standby server replaces a process related to a service to be performed by an active server by taking over execution control and request distribution control by a process management unit to a standby server. (For example, see Patent Document 1)
On the other hand, a communication means for communicating between the redundant monitoring / control processors is provided, and the events that occurred are mutually notified between the monitoring / control processors, and the system status is monitored while grasping each other's status. As a control technology, a technology is also known in which a predetermined notification event is exchanged between the redundant monitoring / monitoring control processors at regular intervals, and the operations are mutually monitored. (For example, see Patent Document 2)
However, all of these conventional techniques are limited to a redundant server or a monitoring / control processor, and the execution system is limited to promptly transferring processing to the standby system when a failure occurs. Thus, there is no invention that aims to make the monitoring dedicated server unnecessary by performing regular communication and mutually judging the operation status of the service based on the contents of the communication.

また、サービスの正常動作状態の監視を行う為に、サービス監視専用の外部サーバを用意する方法もあるが、サービス提供サーバ以外に監視専用サーバを建てる事になる為、導入コストや消費電力が多くかかってしまうといった点でデメリットも見られる。
特開平０８−２１２０９５号公報特開平１０−１５４０８５号公報 In addition, there is a method to prepare an external server dedicated to service monitoring in order to monitor the normal operation status of the service. However, since a dedicated monitoring server is built in addition to the service providing server, the introduction cost and power consumption are high. There is a demerit in that it takes.
Japanese Patent Application Laid-Open No. 08-212095 JP-A-10-154085

本発明は、冗長化構成の運用を前提としたサービスシステムにおいて、サービスの故障のみならず、サービスの正常動作状態も監視することで、より早期に、より完全にサービス故障状態を検出する複数サーバ間でのサービス正常動作監視システムおよびその方法を提供することを課題とする。 The present invention provides a plurality of servers that detect a service failure state earlier and more completely by monitoring not only a service failure but also a normal operation state of the service in a service system premised on the operation of a redundant configuration. It is an object to provide a service normal operation monitoring system and method therefor.

上記課題を解決するための第１の発明は、複数のサーバで構築されたサービスを提供するサーバシステムにおいて、正常動作監視サーバ側から対向するサーバに対して、定期的にインターネットプロトコルの上位プロトコルで通信を行う通信手段と、前記通信を受付けた対向する正常性動作被監視サーバにて、予め登録された監視内容に基づくサービス正常性動作状態をチェックするチェック手段と、そのチェック結果に応じてサービス運用状態の正常／異常判断を行うサービス正常動作監視手段とを備える。 In a server system that provides a service constructed by a plurality of servers, a first invention for solving the above-described problem is to periodically use an upper protocol of the Internet protocol with respect to a server facing from the normal operation monitoring server side. A communication means for performing communication, a check means for checking a service normal operation state based on monitoring contents registered in advance at the opposing normal operation monitored server that has accepted the communication, and a service according to the check result Service normal operation monitoring means for determining normal / abnormal operation status.

この第１の発明によれば、サーバの故障のみならず、サービスの正常動作も監視することで、より早期に、より完全にサービス故障状態を検出できる複数サーバ間でのサービス正常動作監視システムを提供できる。 According to the first aspect of the present invention, there is provided a service normal operation monitoring system among a plurality of servers capable of detecting a service failure state more quickly and more completely by monitoring not only a server failure but also a normal service operation. Can be provided.

第２の発明は、第１の発明記載の複数サーバ間でのサービス正常動作監視システムにおいて、前記通信手段、前記チェック手段および前記サービス正常動作監視手段を各サーバに持たせる。 According to a second invention, in the service normal operation monitoring system among a plurality of servers described in the first invention, each server is provided with the communication means, the check means, and the service normal operation monitoring means.

この第２の発明によれば、サービス正常動作監視を冗長構成サーバで相互に行うことにより、監視専用サーバを必要としない複数サーバ間でのサービス正常動作監視システムを提供できる。 According to the second aspect of the present invention, a service normal operation monitoring system between a plurality of servers that does not require a dedicated monitoring server can be provided by performing service normal operation monitoring with redundant configuration servers.

第３の発明は、第１の発明記載の複数サーバ間でのサービス正常動作監視システムにおいて、前記通信手段は各サーバに設定されている複数の通信ルートを設定して通信を行う。 According to a third aspect of the present invention, in the service normal operation monitoring system between a plurality of servers described in the first aspect, the communication unit performs communication by setting a plurality of communication routes set in each server.

この第３の発明によれば、通信ルート上のハード故障やパケット紛失などのサービス動作以外での異常検出を最小化する複数サーバ間でのサービス正常動作監視システムを提供できる。 According to the third aspect of the invention, it is possible to provide a service normal operation monitoring system between a plurality of servers that minimizes anomaly detection other than a service operation such as a hardware failure or a packet loss on a communication route.

第４の発明は、複数のサーバで構築されたサービスを提供するサーバシステムのサービス正常性動作監視方法において、正常動作監視サーバ側から対向するサーバに対して、定期的にインターネットプロトコルの上位プロトコルで通信を行うステップと、前記通信を受付けた対向する正常性動作被監視サーバにて、予め登録された監視内容に基づくサービス正常性動作状態をチェックするステップと、そのチェック結果に応じてサービス運用状態の正常／異常判断を行うサービス正常動作監視ステップとを含む。 According to a fourth aspect of the present invention, there is provided a service normality operation monitoring method for a server system that provides a service constructed by a plurality of servers. A step of performing communication, a step of checking a service normal operation state based on monitoring contents registered in advance at the opposite normal operation monitored server that has accepted the communication, and a service operation state according to the check result Service normal operation monitoring step for determining normal / abnormal of the service.

この第４の発明によれば、サーバの故障のみならず、サービスの正常動作も監視することで、より早期に、より完全にサービス故障状態を検出できる複数サーバ間でのサービス正常動作監視方法を提供できる。 According to the fourth aspect of the present invention, there is provided a service normal operation monitoring method between a plurality of servers capable of detecting a service failure state more completely at an early stage by monitoring not only a server failure but also a normal service operation. Can be provided.

第５の発明は、コンピュータに正常動作監視サーバ側から対向するサーバに対して、定期的にインターネットプロトコルの上位プロトコルで通信を行うステップと、前記通信を受付けた対向する正常性動作被監視サーバにて、予め登録された監視内容に基づくサービス正常性動作状態をチェックするステップと、そのチェック結果に応じてサービス運用状態の正常／異常判断を行うサービス正常動作監視ステップとを実行させるためのプログラムである。 According to a fifth aspect of the present invention, there is provided a step of periodically communicating with a server opposed to a computer from the normal operation monitoring server side using a higher-order protocol of the Internet protocol, and a normal normal operation monitored server receiving the communication. A program for executing a service normal operation monitoring step for checking a service normal operation state based on monitoring contents registered in advance and a service normal operation monitoring step for determining normality / abnormality of the service operation state according to the check result is there.

この第５の発明によれば、サーバの故障のみならず、サービスの正常動作も監視することで、より早期に、より完全にサービス故障状態を検出できるプログラムを提供できる。 According to the fifth aspect of the present invention, it is possible to provide a program capable of detecting a service failure state more completely at an earlier stage by monitoring not only a server failure but also a normal operation of a service.

以上、本発明の複数サーバ間でのサービス正常動作監視システムおよびその方法によれば、
（１）サービス正常動作の監視を、監視専用サーバを必要とせずにサービスシステム内のサーバ間で実現することで、設備投資コストを最小限に抑えることができる。
（２）サーバに設定されている複数の通信ルートを全て利用してサービス正常動作をチェックすることで、サービス障害か、ハード障害かの切り分け精度を向上させることができる。
（３）サービス正常動作監視契機を監視対象以外の外部サーバより得ることで、監視対象サーバのソフトクロックの障害／遅延に影響されず、周期的な動作監視ができる。
（４）監視元サーバにて、被監視サーバより周期的にソフト時計時刻を受信し時間差分を比較することで、クロック障害やソフトのスローダウン要因によるサービス異常状態を検出することができる。
（５）各サーバにサービス正常動作監視プログラムを登録／削除（監視開始／監視停止）するインタフェースを設けることで、サーバ再起動等を経由させること無しにダイナミックにサービス正常動作の監視契機が変更でき、ハード・ソフトのオンライン増減設や、障害への対応が可能となる。
（６）被監視サーバ内にアプリケーションのサービス正常動作確認を行うプログラムを用意し、そのプログラムに対して確認するサービスを登録するインタフェースを設けることで、監視するサービスの内容をダイナミックに変更できる。
（７）サーバ増設／減設時にサーバ切離し／組込みを契機に周期的に、監視情報を交換することで、サーバ構成（１重化／２重化など）に対応してダイナミックに監視方法を変更できる。 As described above, according to the service normal operation monitoring system and the method among a plurality of servers of the present invention,
(1) By monitoring the normal operation of the service between servers in the service system without requiring a dedicated monitoring server, the capital investment cost can be minimized.
(2) By using all the plurality of communication routes set in the server and checking the normal operation of the service, it is possible to improve the accuracy of identifying a service failure or a hardware failure.
(3) By obtaining a service normal operation monitoring trigger from an external server other than the monitoring target, periodic operation monitoring can be performed without being affected by the failure / delay of the soft clock of the monitoring target server.
(4) The monitoring source server periodically receives the software clock time from the monitored server and compares the time difference, thereby detecting a service abnormal state due to a clock failure or a software slowdown factor.
(5) By providing an interface for registering / deleting (starting / stopping monitoring) the service normal operation monitoring program on each server, the monitoring trigger for service normal operation can be changed dynamically without going through server restart etc. In addition, it is possible to increase / decrease hardware and software online and respond to failures.
(6) By preparing a program for confirming the normal operation of an application service in the monitored server and providing an interface for registering the service to be confirmed for the program, the contents of the monitored service can be dynamically changed.
(7) Change the monitoring method dynamically according to the server configuration (single / duplex etc.) by exchanging the monitoring information periodically when the server is disconnected / installed when adding / removing servers it can.

以下、本発明の実施の形態について、図を参照しながら説明する。なお、全図を通じて理解を容易にするために同様箇所には、同一符号を付して示すものとする。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, in order to make an understanding easy through the whole figure, the same code | symbol shall be attached | subjected and shown to the same location.

図１は、本発明の一実施形態における複数サーバ間でのサービス正常動作監視システムの構成図である。同図において、１および２はそれぞれＡＡＡネットワーク及びＢＢＢネットワークであり、複数のハブの接続されたＬＡＮを構成する。１０は任意のサービスを提供する被監視サーバであり、１１は任意のサービスを提供する監視元サーバである。２０はサービス正常動作確認の制御をするプログラムであり、アプリケーション３０やオペレーションシステム４０に対する動作確認開始や動作確認終了の登録機構を持つ。また、プログラム２１からの監視通知を受付け先に動作確認登録された内容に従ってサービス正常動作の確認を行い、プログラム２１へ結果を応答する。さらに必要であれば、異常と判断したシステムの復旧を図る。２１はサービス正常動作監視の制御をするプログラムであり、サーバ１０上のプログラム２０に対する監視開始や監視終了の登録及び、監視の実施やその監視結果によるリアクションを行う。３０は任意のサービスを提供しているアプリケーションであり、プログラム２０に対して、自サービスの正常動作確認開始、終了の登録を行い、定期的にサービスが正常に動作しているかチェックされる。４０はオペレーションシステムであり、任意のサービスを提供しているアプリケーションなどが動作するプラットフォームである。プログラム２０への登録内容により、正常に動作しているかチェックされる場合がある。５０、５１はＡＡＡネットワークに属するハブ（ＨＵＢ）である。５２、５３はＢＢＢネットワークに属するハブ（ＨＵＢ）である。６０はプログラム２０やアプリケーション３０などがデータの読み書きに使用する磁気ディスク装置である。ここではサービス正常性動作確認内容が登録されているｃｏｎｆｉｇファイルなどが格納されている。６１はプログラム２１やアプリケーション３１などがデータの読み書きに使用する磁気ディスク装置である。ここではプログラム２０のサービス動作異常を検出したサービス２１がｓｙｓｌｏｇ＜シスログ＞を書き込むファイルなどが格納されている。ｓｙｓｌｏｇ＜シスログ＞とは、各種のＵＮＩＸ（登録商標）が備えるシステム・ログ出力機能である。７０、７１はシステム監視機構（ＳＣＦ：System Control Facility ）であり、７２はＳＣＦ間接続インタフェース（ＲＣＩ：Remote Cabinet interface）である。 FIG. 1 is a configuration diagram of a service normal operation monitoring system among a plurality of servers according to an embodiment of the present invention. In the figure, reference numerals 1 and 2 denote an AAA network and a BBB network, respectively, which constitute a LAN to which a plurality of hubs are connected. Reference numeral 10 denotes a monitored server that provides an arbitrary service, and reference numeral 11 denotes a monitoring source server that provides an arbitrary service. Reference numeral 20 denotes a program for controlling normal service operation confirmation, and has a registration mechanism for starting operation confirmation and ending operation confirmation for the application 30 and the operation system 40. Further, the normal operation of the service is confirmed according to the contents registered for the operation confirmation at the reception destination of the monitoring notification from the program 21, and the result is returned to the program 21. If necessary, restore the system that is determined to be abnormal. Reference numeral 21 denotes a program for controlling normal service operation monitoring, which registers monitoring start and monitoring end for the program 20 on the server 10, performs monitoring, and reacts based on the monitoring result. Reference numeral 30 denotes an application providing an arbitrary service. The program 20 registers the start and end of normal operation confirmation of its own service, and periodically checks whether the service is operating normally. Reference numeral 40 denotes an operation system, which is a platform on which an application providing an arbitrary service operates. Depending on the contents registered in the program 20, it may be checked whether it is operating normally. Reference numerals 50 and 51 denote hubs (HUBs) belonging to the AAA network. Reference numerals 52 and 53 denote hubs (HUBs) belonging to the BBB network. A magnetic disk device 60 is used by the program 20, the application 30, and the like for reading and writing data. Here, a config file in which service normality operation confirmation contents are registered is stored. A magnetic disk device 61 is used by the program 21, the application 31, and the like for reading and writing data. Here, a file in which the service 21 that has detected a service operation abnormality of the program 20 writes the syslog <syslog> is stored. Syslog <syslog> is a system log output function provided in various UNIX (registered trademark). Reference numerals 70 and 71 are system monitoring facilities (SCF), and 72 is an inter-SCF connection interface (RCI: Remote Cabinet interface).

図２は、本発明の一実施形態におけるサービス正常動作時の説明図（その１）であり、図３は、本発明の一実施形態におけるサービス正常動作時の説明図（その２）である。 FIG. 2 is an explanatory diagram (part 1) of a normal service operation according to an embodiment of the present invention, and FIG. 3 is an explanatory diagram (part 2) of a normal service operation according to an embodiment of the present invention.

図４は、本発明の一実施形態におけるサービス正常動作時のタイムチャートであり、図１のサービス正常動作監視システム中の二重化サーバ構成にて、サーバ１１からのサービス正常動作監視契機の場合を示す。以下に、図２と図３を参照しつつ、図４の動作を説明する。 4 is a time chart at the time of normal service operation in one embodiment of the present invention, and shows a case of a service normal operation monitoring trigger from the server 11 in the duplex server configuration in the service normal operation monitoring system of FIG. . The operation of FIG. 4 will be described below with reference to FIGS. 2 and 3.

動作（１）
先ず、被監視対象サーバであるサーバ１０のアプリケーション３０よりサービス正常動作確認を制御するプログラム２０に監視通知の登録を行う。登録契機は、初期設定中、運用中を問わず受付けるものとし、登録方法は、アプリケーションインタフェース関数コール（ＡＰＩ：application programming interface)／コマンド起動によるコンフィグ設定値読出しのいずれの方法でも良い。登録する内容は、ＡＰＩ通知時に使用するプロトコル（ＵＤＰ／ＴＣＰ／ｍｓｇ通信）／そのプロトコルを使用するにあたり必要な情報（ｐｏｒｔ番号やｍｅｓｓａｇｅキューＩＤなど) ／通知タイミング( 周期) ／通知する内容／正常時の応答内容／異常時のリアクション( プロセス再起動やサーバ再起動等) などである。またＯＳに対してコマンド発行／プロセスの実行などを実施するよう登録する事も可能とする。プログラム２０では登録内容をメモリ上に保持する事とし、以降プログラム２１からの通信を待つ。 Operation (1)
First, the monitoring notification is registered in the program 20 that controls the service normal operation confirmation from the application 30 of the server 10 that is the monitoring target server. The registration trigger is accepted regardless of whether it is in the initial setting or in operation, and the registration method may be any method of application interface function call (API: application programming interface) / configuration setting value reading by command activation. The contents to be registered are the protocol used for API notification (UDP / TCP / msg communication) / information necessary for using the protocol (port number, message queue ID, etc.) / Notification timing (cycle) / content to be notified / normal Response contents at the time / Reaction at the time of abnormality (process restart, server restart, etc.). It is also possible to register the OS to issue commands / execute processes. The program 20 holds the registered contents in the memory and waits for communication from the program 21 thereafter.

動作（２）
監視元サーバであるサーバ１１のサービス正常動作監視を行うプログラム２１にて監視開始の登録があると、ＡＡＡネットワークの通信ルートを使用してＩＰの上位プロトコルであるＵＤＰ（user datagram protocol）或いはＴＣＰ（transmission control protocol ）により定期的にサーバ１０のプログラム２０に通信を行う。監視停止の登録があるとプログラム２０に対する通信を切断する。プログラム２１への監視開始／停止登録契機は、初期設定中、運用中を問わず受付けるが、監視開始登録後プログラム２１の通信開始（監視有効）契機はサーバ１１が運用状態となってからとする。その理由は、サーバ１０異常検出時のサーバ系切替えによる運用サーバ化に備える為である。また、登録方法は、アプリケーションインタフェース関数コール（ＡＰＩ）／コマンド起動のいずれの方法でも良い。 Operation (2)
If there is a registration for monitoring start in the program 21 that monitors the normal operation of the service of the server 11 that is the monitoring source server, UDP (user datagram protocol) or TCP (TCP), which is an upper protocol of the IP, is used using the communication route of the AAA network. Transmission control protocol)) periodically communicates with the program 20 of the server 10. When monitoring stop is registered, communication with the program 20 is disconnected. The monitoring start / stop registration trigger to the program 21 is accepted regardless of whether it is in the initial setting or in operation, but the communication start (monitoring valid) trigger of the program 21 after monitoring start registration is after the server 11 is in the operating state. . The reason is to prepare for operation server conversion by server system switching at the time of server 10 abnormality detection. The registration method may be either application interface function call (API) / command activation.

動作（３）
プログラム２１からの通信を受付けたプログラム２０では、それまでに登録された監視内容通りにアプリケーション３０に対して通知を行い、アプリケーション３０からの応答内容からサービスの正常動作状態をチェックする。登録時にオペレーションシステム４０へのインタフェース確認登録があった場合には、指定されたコマンドやプログラムを実行し、実行から正常終了までを確認する。チェック結果が正常であれば、プログラム２０では自サーバのソフト時計時刻を収集し、サービス正常動作状態チェック結果と共にプログラム２１へ応答を返す。 Operation (3)
The program 20 that has received communication from the program 21 notifies the application 30 according to the monitoring contents registered so far, and checks the normal operation state of the service from the response contents from the application 30. When there is an interface confirmation registration with the operation system 40 at the time of registration, the designated command or program is executed and the process from execution to normal termination is confirmed. If the check result is normal, the program 20 collects the software clock time of its own server, and returns a response to the program 21 together with the service normal operation state check result.

動作（４）
プログラム２１では、チェック結果正常を確認すると、プログラム２０にて収集したサーバ１０のソフト時計時刻を退避しておき、再度プログラム２０に対して通信を行い、ソフト時計時刻を再収集させる。先に応答されたソフト時計時刻と再通信時に収集した時刻とを比較し、しきい値以上時間経過していないかをチェックし、問題なければサービスが正常に動作していると見なし、監視周期時間経過後に再び定期的に通信（監視）を行う。 Operation (4)
When the program 21 confirms that the check result is normal, the software clock time of the server 10 collected by the program 20 is saved, and the program 21 communicates again with the program 20 to recollect the software clock time. Compare the soft clock time that was previously responded with the time collected at the time of re-communication and check if the time exceeds the threshold, and if there is no problem, the service is considered to be operating normally, and the monitoring cycle Communication (monitoring) is performed again periodically after the elapse of time.

図５は、本発明の一実施形態におけるサービス異常動作時の説明図であり、アプリケーション３０の異常、オペレーションシステム４０の異常、スローダウン検出および通信タイムアウトにおける動作の流れを示す。 FIG. 5 is an explanatory diagram at the time of service abnormal operation according to an embodiment of the present invention, and shows the flow of operations in the case of an application 30 abnormality, an operation system 40 abnormality, slowdown detection, and a communication timeout.

図６は、本発明の一実施形態におけるサービス異常動作時のタイムチャート（その１）であり、アプリケーション３０の異常の場合を示す。以下に、図５を参照しつつ、図６の動作を説明する。 FIG. 6 is a time chart (part 1) at the time of service abnormal operation according to the embodiment of the present invention, and shows a case of an abnormality in the application 30. The operation of FIG. 6 will be described below with reference to FIG.

動作（５）
アプリケーション３０からのチェック結果応答にて異常（ＮＧ応答）又は無応答であった場合には、プログラム２０は指定されたリアクションによりアプリケーション３０に対して、サービスの復旧を試みる。その際、保守者へＳＮＭＰトラップ又はシスログにより異常を通知するが、プログラム２１への応答は返さない。登録時にオペレーションシステム４０へのインタフェース確認登録があった場合には、指定されたコマンドやプログラムを実行し、実行から正常終了までを確認する。チェック結果が正常であれば、プログラム２０では自サーバのソフト時計時刻を収集し、サービス正常動作状態チェック結果と共にプログラム２１へ応答を返す。 Operation (5)
If the check result response from the application 30 is abnormal (NG response) or no response, the program 20 tries to restore the service to the application 30 by the designated reaction. At that time, the maintenance person is notified of the abnormality by the SNMP trap or the syslog, but the response to the program 21 is not returned. When there is an interface confirmation registration with the operation system 40 at the time of registration, the designated command or program is executed and the process from execution to normal termination is confirmed. If the check result is normal, the program 20 collects the software clock time of its own server, and returns a response to the program 21 together with the service normal operation state check result.

図７は、本発明の一実施形態におけるサービス異常動作時のタイムチャート（その２）であり、オペレーションシステム４０の異常の場合を示す。以下に、図５を参照しつつ、図７の動作を説明する。 FIG. 7 is a time chart (part 2) at the time of service abnormal operation according to the embodiment of the present invention, and shows a case where the operation system 40 is abnormal. The operation of FIG. 7 will be described below with reference to FIG.

動作（５’）
オペレーションシステム４０からのコマンド／プログラムの実行結果にて異常（ＮＧ応答）又は無応答であった場合には、プログラム２０は指定されたリアクションによりオペレーションシステム４０に対して、サービスの復旧を試みる。その際、保守者へＳＮＭＰトラップ又はシスログにより異常を通知するが、プログラム２１への応答は返さない。 Operation (5 ')
If the execution result of the command / program from the operation system 40 is abnormal (NG response) or no response, the program 20 tries to restore the service to the operation system 40 by the designated reaction. At that time, the maintenance person is notified of the abnormality by the SNMP trap or the syslog, but the response to the program 21 is not returned.

図８は、本発明の一実施形態におけるサービス異常動作時のタイムチャート（その３）であり、スローダウン検出の場合を示す。以下に、図５を参照しつつ、図８の動作を説明する。 FIG. 8 is a time chart (part 3) at the time of service abnormal operation in one embodiment of the present invention, and shows a case of slowdown detection. The operation of FIG. 8 will be described below with reference to FIG.

動作（６）
プログラム２１では、先に応答されたソフト時計時刻と再通信時に収集した時刻とを比較し、一定以上の差分があった場合には、プログラム２０のスローダウンによる異常と判断し、保守者へＳＮＭＰトラップ又はシスログにより異常を通知する。また、ハードウェアによる電源制御装置( 例ではＳＣＦ７０、７１及びＲＣI ７２）があれば、異常となったサーバを停止させ、サーバ系切替えにより自サーバを運用サーバ化することでサービスを自動復旧させる。 Operation (6)
The program 21 compares the previously received software clock time with the time collected at the time of re-communication, and if there is a difference greater than a certain value, it is determined that the program 20 has been slowed down, and the maintenance person is notified of the SNMP. An error is notified by a trap or syslog. Also, if there is a hardware power control device (in the example, SCF 70, 71 and RCI 72), the server that has become abnormal is stopped, and the service is automatically restored by switching the server itself to an operational server.

図９は、本発明の一実施形態におけるサービス異常動作時のタイムチャート（その４）であり、通信タイムアウトの場合を示す。以下に、図５を参照しつつ、図９の動作を説明する。 FIG. 9 is a time chart (part 4) at the time of service abnormal operation in the embodiment of the present invention, and shows a case of communication timeout. The operation of FIG. 9 will be described below with reference to FIG.

動作（７）
プログラム２１では、通信応答がプログラム２０から返って来ずに、ＡＡＡネットワークからの応答タイムアウト時には、サーバ１１に設定されている他の通信ルートであるＢＢＢネットワークを使用してサーバ１０のプログラム２０と再度通信（監視）を試みる。即ち、ＢＢＢネットワークを使用して前述の動作（４）、動作（５）及び動作（６）を行う。その結果、両ネットワーク共にＮＧ（応答タイムアウト）の場合には、サーバ１０は異常であると判断する。 Operation (7)
In the program 21, the communication response is not returned from the program 20, and when the response from the AAA network times out, the program 20 of the server 10 is again used by using the BBB network which is another communication route set in the server 11. Attempt to communicate (monitor). That is, the above-described operation (4), operation (5), and operation (6) are performed using the BBB network. As a result, if both networks are NG (response timeout), it is determined that the server 10 is abnormal.

図１０は、本発明の一実施形態における監視通知登録内容を示すデータ構成図であり、（ａ）はアプリケーション監視通知登録内容のデータ構成図を、（ｂ）はオペレーションシステム監視通知登録内容のデータ構成図である。 FIG. 10 is a data configuration diagram showing contents of monitoring notification registration in one embodiment of the present invention, (a) is a data configuration diagram of application monitoring notification registration contents, and (b) is data of operation system monitoring notification registration contents. It is a block diagram.

同図にて、ＡＰＬ／ＯＳ監視種別（選択）は、Application 通信（アプリケーション通信）又は、Operation システム（オペレーションシステム）の監視種別を示し、ＡＰＬ通信とは、正常性動作監視プログラムとアプリケーションとの間の通信のことであり、指定（ｍｓｇ／ＵＤＰ／ＴＣＰ等）された方式で通信を行う。ｍｓｇ通信は、message 通信（メッセージ通信）であり、Solaris ／ LinuxなどのＯＳに実装されているメッセージキューを介した通信方式である。ｍｓｇキューＩＤは、メッセージ通信する上で、通信ユーザ間で認識する共通のＩＤ番号である。通信ユーザ側がこのＩＤ番号を指定してメッセージキューに情報を詰めて、受信ユーザ側でこのＩＤ番号を指定してメッセージキューから情報を取り出すことで、ユーザ間でお互いの通信が実現できる。通知応答タイムアウト時間のｍｓ単位とは、１０００分の１秒単位を示す。 In the figure, APL / OS monitoring type (selection) indicates the monitoring type of Application communication (application communication) or Operation system (operation system), and APL communication is between the normal operation monitoring program and the application. Communication is performed using a designated method (msg / UDP / TCP, etc.). The msg communication is message communication (message communication), and is a communication method via a message queue implemented in an OS such as Solaris / Linux. The msg queue ID is a common ID number recognized by communication users when performing message communication. The communication user side designates this ID number and packs information in the message queue, and the receiving user side designates this ID number and takes out the information from the message queue, whereby mutual communication between users can be realized. The ms unit of the notification response timeout time indicates a unit of 1 / 1000th of a second.

図１１は、本発明の一実施形態におけるソフトウェア構成図（正常動作被監視サーバ側）である。同図において、正常動作被監視サーバ側とは、二重化サーバ構成にて、サーバ１０のことである。即ち、この場合には、サーバ１１からのサービス正常動作監視契機となっているが、サーバ１０からの監視契機でも良い。。 FIG. 11 is a software configuration diagram (on the normal operation monitored server side) in an embodiment of the present invention. In the figure, the normally monitored server side is the server 10 in a dual server configuration. That is, in this case, the service normal operation monitoring trigger from the server 11 is used, but the monitoring trigger from the server 10 may be used. .

アプリケーションプロセス１０１及びライブラリ部１０３は、アプリケーション３０内に存在し、監視通知の登録方法は、監視通知登録（ユーザ）１０２と監視開始登録受付け部（ＡＰＩ）１０４によるアプリケーションインタフェース関数コール、又は、監視通知登録部（コマンド）１１１によるコマンド起動に基づくコンフィグ設定値読出しのいずれかにより行われる。正常性動作監視プログラムプロセス１０５は、プログラム２０内に存在し、サービス正常動作チェック実行部（正常異常判定）１０７にて、サービス正常性動作のチェックを行い、判定結果他が登録内容（メモリ）１１４に登録される。また、異常検出時リアクション部１１３の履歴も登録内容（メモリ）１１４に登録される。なお、１０６はサービス正常動作結果受信部、１０８はＯＳコマンド実行部、１０９はアプリケーション通信部、１１０は監視通知登録部、１１２はサービス正常動作チェック通知受付け、応答部をそれぞれ示す。 The application process 101 and the library unit 103 exist in the application 30, and the monitoring notification registration method is an application interface function call by the monitoring notification registration (user) 102 and the monitoring start registration accepting unit (API) 104, or monitoring notification. This is performed by either reading the configuration setting value based on the command activation by the registration unit (command) 111. The normal operation monitoring program process 105 exists in the program 20 and the service normal operation check execution unit (normal / abnormal determination) 107 checks the service normal operation, and the determination result and others are registered contents (memory) 114. Registered in Further, the history of the reaction unit 113 at the time of abnormality detection is also registered in the registered content (memory) 114. Reference numeral 106 denotes a service normal operation result reception unit, 108 denotes an OS command execution unit, 109 denotes an application communication unit, 110 denotes a monitoring notification registration unit, 112 denotes a service normal operation check notification reception, and a response unit.

図１２は、本発明の一実施形態におけるソフトウェア構成図（正常動作監視サーバ側）である。同図において、正常動作監視サーバ側とは、二重化サーバ構成にて、サーバ１１のことである。即ち、この場合には、サーバ１１からのサービス正常動作監視契機となっているが、前述したようにサーバ１０からの監視契機であっても良い。 FIG. 12 is a software configuration diagram (normal operation monitoring server side) in an embodiment of the present invention. In the figure, the normal operation monitoring server side is the server 11 in a dual server configuration. That is, in this case, the service normal operation monitoring trigger from the server 11 is used, but the monitoring trigger from the server 10 may be used as described above.

アプリケーションプロセス１０１’及びライブラリ部１０３’は、アプリケーション３１内に存在し、監視開始の登録方法は、監視開始登録（ユーザ）１１５と監視開始登録受付け部（ＡＰＩ）１１６によるアプリケーションインタフェース関数コール、又は、監視開始登録部（コマンド）１１８によるコマンド起動のいずれかにより行われる。サービス正常動作監視部（周期通信、ネットワーク選択）１１９は、プログラム２１内に存在し、サービス正常動作結果判定部１２０による判定結果他が登録内容（メモリ）１１４’に登録される。また、異常検出時リアクション部１２１の履歴も登録内容（メモリ）１１４’に登録される。なお、１１７は監視開始登録部を示す。 The application process 101 ′ and the library unit 103 ′ exist in the application 31, and the monitoring start registration method is an application interface function call by the monitoring start registration (user) 115 and the monitoring start registration acceptance unit (API) 116, or This is performed by any of the command activations by the monitoring start registration unit (command) 118. The service normal operation monitoring unit (periodic communication, network selection) 119 exists in the program 21, and the determination result and the like by the service normal operation result determination unit 120 are registered in the registration content (memory) 114 '. In addition, the history of the abnormality detection reaction unit 121 is also registered in the registered content (memory) 114 ′. Reference numeral 117 denotes a monitoring start registration unit.

以上の実施例を含む実施形態に関し、更に以下の付記を開示する。 The following additional notes are further disclosed with respect to the embodiment including the above examples.

（付記１）複数のサーバで構築されたサービスを提供するサーバシステムにおいて、正常動作監視サーバ側から対向するサーバに対して、定期的にインターネットプロトコルの上位プロトコルで通信を行う通信手段と、前記通信を受付けた対向する正常性動作被監視サーバにて、予め登録された監視内容に基づくサービス正常性動作状態をチェックするチェック手段と、そのチェック結果に応じてサービス運用状態の正常／異常判断を行うサービス正常動作監視手段とを備えることを特徴とする複数サーバ間でのサービス正常動作監視システム。 (Supplementary Note 1) In a server system that provides a service constructed by a plurality of servers, a communication unit that periodically communicates with a server facing from the normal operation monitoring server side using an upper protocol of the Internet protocol, and the communication In the opposite normal operation monitored server that received the service, the check means for checking the service normal operation state based on the monitoring contents registered in advance, and the normality / abnormality of the service operation state is determined according to the check result A service normal operation monitoring system among a plurality of servers, comprising service normal operation monitoring means.

（付記２）付記１記載の複数サーバ間でのサービス正常動作監視システムにおいて、前記通信手段、前記チェック手段および前記サービス正常動作監視手段を各サーバに持たせることを特徴とする複数サーバ間でのサービス正常動作監視システム。 (Supplementary Note 2) In the service normal operation monitoring system between a plurality of servers according to Supplementary Note 1, each server is provided with the communication means, the check means, and the service normal operation monitoring means. Service normal operation monitoring system.

（付記３）付記１記載の複数サーバ間でのサービス正常動作監視システムにおいて、前記通信手段は各サーバに設定されている複数の通信ルートを設定して通信を行うことを特徴とする複数サーバ間でのサービス正常動作監視システム。 (Supplementary note 3) In the service normal operation monitoring system between multiple servers according to supplementary note 1, the communication means performs communication by setting a plurality of communication routes set in each server. Service normal operation monitoring system.

（付記４）複数のサーバで構築されたサービスを提供するサーバシステムのサービス正常性動作監視方法において、正常動作監視サーバ側から対向するサーバに対して、定期的にインターネットプロトコルの上位プロトコルで通信を行うステップと、前記通信を受付けた対向する正常性動作被監視サーバにて、予め登録された監視内容に基づくサービス正常性動作状態をチェックするステップと、そのチェック結果に応じてサービス運用状態の正常／異常判断を行うサービス正常動作監視ステップとを含むことを特徴とする複数サーバ間でのサービス正常動作監視方法。 (Supplementary Note 4) In the service normality operation monitoring method of a server system that provides a service constructed by a plurality of servers, the normal operation monitoring server side periodically communicates with the opposite server using the higher level protocol of the Internet protocol. A step of checking the service normality operation state based on the monitoring contents registered in advance at the opposing normality operation monitored server that has accepted the communication, and the normality of the service operation state according to the check result A service normal operation monitoring method among a plurality of servers, comprising: a service normal operation monitoring step for performing an abnormality determination.

（付記５）コンピュータに正常動作監視サーバ側から対向するサーバに対して、定期的にインターネットプロトコルの上位プロトコルで通信を行うステップと、前記通信を受付けた対向する正常性動作被監視サーバにて、予め登録された監視内容に基づくサービス正常性動作状態をチェックするステップと、そのチェック結果に応じてサービス運用状態の正常／異常判断を行うサービス正常動作監視ステップとを実行させるためのプログラム。 (Supplementary Note 5) In the step of periodically performing communication with the upper protocol of the Internet protocol to the server facing the computer from the normal operation monitoring server side, and in the opposing normal operation monitored server that has received the communication, A program for executing a step of checking a service normal operation state based on monitoring contents registered in advance, and a service normal operation monitoring step of determining normal / abnormal of the service operation state according to the check result.

（付記６）付記１記載の複数サーバ間でのサービス正常動作監視システムにおいて、更に、監視元サーバにて被監視サーバより周期的にソフト時計時刻を収集する収集手段と、前記収集手段により時間差分を比較して障害を検出する検出手段を備えることを特徴とする複数サーバ間でのサービス正常動作監視システム。 (Supplementary note 6) In the service normal operation monitoring system between the plurality of servers according to supplementary note 1, a collection unit that periodically collects soft clock time from the monitored server at the monitoring source server, and a time difference by the collection unit A service normal operation monitoring system among a plurality of servers, comprising a detecting means for detecting a failure by comparing the two.

（付記７）付記１記載の複数サーバ間でのサービス正常動作監視システムにおいて、更に、各サーバにサービス正常動作監視プログラムを登録又は削除するインタフェースを備えることを特徴とする複数サーバ間でのサービス正常動作監視システム。 (Supplementary note 7) In the service normal operation monitoring system between a plurality of servers described in supplementary note 1, further comprising an interface for registering or deleting a service normal operation monitoring program in each server, normal service between a plurality of servers Operation monitoring system.

（付記８）付記１記載の複数サーバ間でのサービス正常動作監視システムにおいて、更に、被監視サーバ内にアプリケーションのサービス正常動作確認を行うサービス内容を登録するインタフェースを備えることを特徴とする複数サーバ間でのサービス正常動作監視システム。 (Supplementary Note 8) In the service normal operation monitoring system between the plurality of servers described in Supplementary Note 1, the plurality of servers, further comprising an interface for registering service contents for confirming the normal service operation of the application in the monitored server. Service normal operation monitoring system between.

（付記９）付記１記載の複数サーバ間でのサービス正常動作監視システムにおいて、冗長構成をとる予備系サーバの組込み後、周期的に監視情報を交換する手段を備えることを特徴とする複数サーバ間でのサービス正常動作監視システム。 (Supplementary Note 9) In the service normal operation monitoring system between the plurality of servers described in Supplementary Note 1, the system includes a means for periodically exchanging monitoring information after a redundant server having a redundant configuration is incorporated. Service normal operation monitoring system.

本発明は、複数のサーバで構築されたサービスを提供するサーバシステムにおいて、サービスの正常動作監視を行うことに利用できる。 The present invention can be used for monitoring normal operation of a service in a server system that provides a service constructed by a plurality of servers.

本発明の一実施形態における複数サーバ間でのサービス正常動作監視システムの構成図である。1 is a configuration diagram of a service normal operation monitoring system among a plurality of servers in an embodiment of the present invention. FIG. 本発明の一実施形態におけるサービス正常動作時の説明図（その１）である。It is explanatory drawing (the 1) at the time of service normal operation in one Embodiment of this invention. 本発明の一実施形態におけるサービス正常動作時の説明図（その２）である。It is explanatory drawing (the 2) at the time of service normal operation in one Embodiment of this invention. 本発明の一実施形態におけるサービス正常動作時のタイムチャートである。It is a time chart at the time of service normal operation in one embodiment of the present invention.

本発明の一実施形態におけるサービス異常動作時の説明図である。It is explanatory drawing at the time of service abnormal operation in one Embodiment of this invention. 本発明の一実施形態におけるサービス異常動作時のタイムチャート（その１）である。It is a time chart (the 1) at the time of service abnormal operation in one embodiment of the present invention. 本発明の一実施形態におけるサービス異常動作時のタイムチャート（その２）である。It is a time chart (the 2) at the time of service abnormal operation in one Embodiment of this invention. 本発明の一実施形態におけるサービス異常動作時のタイムチャート（その３）である。It is a time chart (the 3) at the time of service abnormal operation in one Embodiment of this invention. 本発明の一実施形態におけるサービス異常動作時のタイムチャート（その４）である。It is a time chart (the 4) at the time of service abnormal operation in one Embodiment of this invention. 本発明の一実施形態における監視通知登録内容を示すデータ構成図である。It is a data block diagram which shows the monitoring notification registration content in one Embodiment of this invention. 本発明の一実施形態におけるソフトウェア構成図（正常動作被監視サービス側）である。It is a software block diagram (normal operation monitored service side) in an embodiment of the present invention. 本発明の一実施形態におけるソフトウェア構成図（正常動作監視サービス側）である。It is a software block diagram (normal operation monitoring service side) in an embodiment of the present invention.

Explanation of symbols

１ＡＡＡネットワーク
２ＢＢＢネットワーク
１０、１１サーバ
２０、２１プログラム
３０、３１アプリケーション
４０オペレーションシステム
５１、５２、５３、５４ハブ
６０、６１磁気ディスク装置
７０、７１ＳＣＦ（System Control Facility)
７２ＲＣＩ（Remote Cabinet Interface）
１０１アプリケーションプロセス
１０２監視通知登録（ユーザ）
１０３ライブラリ部
１０４監視通知登録受付け部（ＡＰＩ）
１０５正常性動作監視プログラムプロセス
１０６サービス正常動作結果受信部
１０７サービス正常動作チェック実行部（正常異常判定）
１０８ＯＳコマンド実行部
１０９アプリケーション通信部
１１０監視通知登録部
１１１監視通知登録部（コマンド）
１１２サービス正常動作チェック通知受付け、応答部
１１３異常検出時リアクション部
１１４登録内容（メモリ）
１１５監視開始登録（ユーザ）
１１６監視開始登録受付け部（ＡＰＩ）
１１７監視開始登録部
１１８監視開始登録部（コマンド）
１１９サービス正常動作監視部（周期通信、ネットワーク選択）
１２０サービス正常動作結果判定部
１２１異常検出時リアクション部

1 AAA network 2 BBB network 10, 11 Server 20, 21 Program 30, 31 Application 40 Operation system 51, 52, 53, 54 Hub 60, 61 Magnetic disk unit 70, 71 SCF (System Control Facility)
72 RCI (Remote Cabinet Interface)
101 Application process 102 Monitoring notification registration (user)
103 Library 104 Monitoring Notification Registration Accepting Unit (API)
105 Normal operation monitoring program process 106 Service normal operation result receiving unit 107 Service normal operation check execution unit (normal / abnormal determination)
108 OS command execution unit 109 Application communication unit 110 Monitoring notification registration unit 111 Monitoring notification registration unit (command)
112 Service normal operation check notification reception, response unit 113 abnormality detection reaction unit 114 registered content (memory)
115 Monitoring start registration (user)
116 Monitoring start registration receiving part (API)
117 Monitoring start registration unit 118 Monitoring start registration unit (command)
119 Service normal operation monitoring unit (periodic communication, network selection)
120 Service normal operation result determination unit 121 Reaction unit when abnormality is detected

Claims

In a server system that provides services constructed from multiple servers,
A communication means for periodically communicating with the upper server of the Internet protocol to the server facing from the normal operation monitoring server side,
Checking means for checking the service normality operating state based on the monitoring contents registered in advance at the opposing normality operation monitored server that has accepted the communication, and normal / abnormal judgment of the service operating state according to the check result Service normal operation monitoring means for performing,
A service normal operation monitoring system among a plurality of servers.

In the service normal operation monitoring system between a plurality of servers according to claim 1,
A service normal operation monitoring system among a plurality of servers, wherein each server has the communication means, the check means, and the service normal operation monitoring means.

In the service normal operation monitoring system between a plurality of servers according to claim 1,
A service normal operation monitoring system between a plurality of servers, wherein the communication means performs communication by setting a plurality of communication routes set in each server.

In a service normality operation monitoring method of a server system that provides a service constructed by a plurality of servers,
A step of periodically communicating with the upper server of the Internet protocol to the server facing from the normal operation monitoring server side,
At the opposing normal operation monitored server that has accepted the communication, a step of checking the service normal operation state based on the monitoring contents registered in advance, and a normal / abnormal determination of the service operation state according to the check result A service normal operation monitoring step to be performed;
A service normal operation monitoring method among a plurality of servers.

A step of periodically communicating with a server facing the computer from the normal operation monitoring server side using an upper protocol of the Internet protocol,
At the opposing normal operation monitored server that has accepted the communication, a step of checking the service normal operation state based on the monitoring contents registered in advance, and a normal / abnormal determination of the service operation state according to the check result A service normal operation monitoring step to be performed;
A program for running