JP6290810B2

JP6290810B2 - Management apparatus and management method

Info

Publication number: JP6290810B2
Application number: JP2015057929A
Authority: JP
Inventors: 洋一岩倉
Original assignee: Fujitsu Frontech Ltd
Current assignee: Fujitsu Frontech Ltd
Priority date: 2015-03-20
Filing date: 2015-03-20
Publication date: 2018-03-07
Anticipated expiration: 2035-03-20
Also published as: JP2016177598A

Description

本発明は、各種サーバ、各種端末、各種機器が互いに接続されることにより構成されるネットワークシステムを管理する管理装置及び管理方法に関する。 The present invention relates to a management apparatus and a management method for managing a network system configured by connecting various servers, various terminals, and various devices to each other.

従来、大規模なネットワークインフラストラクチャが構築されている。例えば、ＩＤＣ（Internet Data Center）は、ネットワーク機器、サーバ、各種データを保管するストレージ群等を提供すると共に、インターネット接続などの各種通信網へのアクセスインフラ網を提供する。また、ＩＤＣは、通常は運用や監視業務なども同時に引き受け、エラー発生時の通知や対処などシステム運用のサポートを行っている。金融機関におけるネットワークインフラストラクチャには、センタ集約型のサーバシステムである、Ｗｅｂ型の金融システム（Ｗｅｂ−ＡＴＭ）がある。 Conventionally, a large-scale network infrastructure has been constructed. For example, the IDC (Internet Data Center) provides a network device, a server, a storage group for storing various data, and the like, and also provides an access infrastructure network for various communication networks such as Internet connection. In addition, the IDC usually accepts operations and monitoring work at the same time, and supports system operations such as notification and handling when an error occurs. A network infrastructure in a financial institution includes a Web-type financial system (Web-ATM), which is a center-intensive server system.

銀行業務で決まっている処理については、運用管理ミドルウェアを使ったジョブという単位で実行する。さらに、ジョブを所定の時刻で起動させる仕組みをジョブスケジューラという。業務は、ジョブスケジューラから呼び出される業務アプリケーションによって実現する。ジョブスケジューラと業務アプリケーションの間には、正常又は異常などの実行結果が分かるようなインタフェースが規定されている。例えば、業務アプリケーションが異常を検出し、呼び出し元のジョブスケジューラに情報を返却することで、業務（ジョブ）の状況が明確になる。 Processing determined by banking operations is executed in units of jobs using operation management middleware. Furthermore, a mechanism for starting a job at a predetermined time is called a job scheduler. The business is realized by a business application called from the job scheduler. An interface is provided between the job scheduler and the business application so that the execution result such as normal or abnormal can be understood. For example, when a business application detects an abnormality and returns information to the calling job scheduler, the status of the business (job) becomes clear.

このようなシステムは、単体の装置で構成されていないため、ジョブのエラー原因は、複数のサーバ、コンポーネント（業務アプリケーション、ＯＳ、ミドルウェア等）が影響する。そのため、サーバを管理する管理装置が、各サーバのエラーメッセージを監視する。管理装置は、エラーメッセージを元に、銀行業務の実行結果を判断している。 Since such a system is not composed of a single device, a job error cause is affected by a plurality of servers and components (business application, OS, middleware, etc.). Therefore, the management device that manages the server monitors the error message of each server. The management device determines the execution result of the banking business based on the error message.

そして、運用担当者や保守担当者が、各サーバのエラーメッセージを確認し、そのエラーは何が問題であるか、次に何をするべきか（調査ログの取得やリトライ実行等）を判断し対処している。 Then, the person in charge of operation or maintenance checks the error message of each server and decides what the error is and what to do next (acquisition of investigation log, retry execution, etc.) It is addressed.

また、エラーの調査を支援する技術として、想定される操作と、その操作により発生するエラーの現象との間の関係を関連付けた情報を蓄積しておき、エラー発生時にログファイルから抽出したエラーメッセージに基づいて、蓄積された情報を検索し、検索された一連の操作を、エラーを発生させた現象を再現させる手順として出力する技術が開示されている（例えば、特許文献１参照。）。 In addition, as a technology to support error investigation, information that correlates the relationship between the expected operation and the error phenomenon that occurs by that operation is accumulated, and the error message extracted from the log file when the error occurs Based on the above, there is disclosed a technique for searching stored information and outputting the searched series of operations as a procedure for reproducing a phenomenon that caused an error (see, for example, Patent Document 1).

特開２０１２−２１２２８３号公報JP 2012-212283 A

しかしながら、上述のような業務アプリケーションは、通常、簡易的なシェルスクリプトを実装する。このような業務アプリケーションは、シェルスクリプトの制約により、詳細データの受け渡しができず、「正常である」又は「異常である」程度の情報しか受け渡しができない、という問題がある。 However, a business application as described above usually implements a simple shell script. Such a business application has a problem in that detailed data cannot be transferred due to a limitation of a shell script, and only “normal” or “abnormal” information can be transferred.

また、エラー原因が業務アプリケーションに起因するだけではなく、サーバ、各種装置、ミドルウェアなど複数にわたる場合は、サーバを管理する管理サーバにメッセージを集約する仕組みがあるが、個々のメッセージを製品マニュアルや業務アプリケーションのエラー一覧と突き合わせて、原因を特定する必要がある。そのため、運用管理者の作業負担および原因特定に時間が掛かってしまう、という問題もある。 Also, if the cause of an error is not only due to a business application, but there are multiple servers, various devices, middleware, etc., there is a mechanism to consolidate messages to the management server that manages the server. The cause needs to be identified against the application error list. Therefore, there is also a problem that it takes time to identify the work load and cause of the operation manager.

また、業務アプリケーション自体が正常に動作している場合はよいが、業務アプリケーション自体の処理に不備があり、例えば無応答になってしまうような場合は、エラー原因やメッセージを出力できず、業務アプリケーションに何が起きているのか分からない、という問題もある。 Also, it is good if the business application itself is operating normally, but if the business application itself is incomplete, for example if it becomes unresponsive, the cause of the error or message cannot be output, and the business application There is also the problem of not knowing what is happening.

また、エラー発生時に、必要な調査用の情報（システムのログなど）を即座に取得できないことがある。時間やデータ量により、消えてしまう情報があり、エラー解析時間によっては、エラー情報が採取できない、という問題もある。 In addition, when an error occurs, necessary investigation information (system log, etc.) may not be acquired immediately. There is information that disappears depending on the time and the amount of data, and there is a problem that error information cannot be collected depending on the error analysis time.

また、Ｗｅｂ型の金融システム（Ｗｅｂ−ＡＴＭ）は、長年稼働しているシステムであり、業務アプリケーションやシステム自体の大幅な変更は、費用的にも品質的にも不可能な状況である。よって、既存のサーバ構成や業務アプリケーション（ロジック）を極力変更しないで対応する必要がある、という問題もある。 The Web-type financial system (Web-ATM) is a system that has been operating for many years, and it is impossible to drastically change business applications and the system itself in terms of cost and quality. Therefore, there is a problem that it is necessary to cope with the existing server configuration and business application (logic) without changing them as much as possible.

本発明は、上述のような実状に鑑みたものであり、ジョブスケジューラから呼び出される業務アプリケーションがエラー発生時の詳細情報を返せない場合、その間に発生した関連サーバのエラーメッセージから、原因を特定し、即座に必要な調査情報を取得することが可能な管理装置及び管理方法を提供することを目的とする。 The present invention has been made in view of the above situation, and when the business application called from the job scheduler cannot return the detailed information at the time of the error occurrence, the cause is identified from the error message of the related server generated during that time. An object of the present invention is to provide a management device and a management method capable of immediately obtaining necessary survey information.

本発明は、上記課題を解決するため、下記のような構成を採用した。
すなわち、本発明の一態様によれば、本発明の管理装置は、複数のサーバを有するデータセンタであり、金融機関のホストコンピュータと自動取引装置との間で実行する取引処理を中継するデータセンタを管理する管理装置であって、前記サーバで実行したジョブに関するジョブ情報を取得するジョブ情報取得部と、前記ジョブ情報取得部によって取得したジョブ情報から予め定めた文字列を含む対象ジョブ情報を検索する対象ジョブ情報検索部と、前記対象ジョブ情報検索部によって検索された対象ジョブ情報を出力したサーバに対して、前記対象ジョブ情報に対応付けられた検証コマンドを実行してエラーを検証するエラー検証部と、前記エラー検証部によって検証された検証結果に基づき、前記サーバのエラー状況を調査するための調査情報取得コマンドを実行する調査情報取得部とを備えることを特徴とする。 The present invention employs the following configuration in order to solve the above problems.
That is, according to one aspect of the present invention, the management apparatus of the present invention is a data center having a plurality of servers, and a data center that relays transaction processing executed between a host computer of a financial institution and an automatic transaction apparatus. A job information acquisition unit that acquires job information related to a job executed on the server, and retrieves target job information including a predetermined character string from the job information acquired by the job information acquisition unit Error verification for verifying an error by executing a verification command associated with the target job information on a target job information search unit to be executed and a server that has output the target job information searched by the target job information search unit And an investigation for investigating the error status of the server based on the verification result verified by the error verification unit. Characterized in that it comprises a survey information acquisition unit that executes the information acquisition command.

また、本発明の管理装置は、前記ジョブ情報取得部が、取得したジョブ情報をテーブル形式でメモリに格納し、前記対象ジョブ情報検索部が、前記メモリに格納されたジョブ情報を検索することが望ましい。 In the management apparatus of the present invention, the job information acquisition unit may store the acquired job information in a memory in a table format, and the target job information search unit may search for job information stored in the memory. desirable.

また、本発明の管理装置は、前記ジョブ情報取得部が、ジョブの開始時刻、ジョブの終了時刻、及びジョブの平均実行時間に基づく所定の条件を満たすジョブに関するジョブ情報を取得することが望ましい。 In the management apparatus of the present invention, it is preferable that the job information acquisition unit acquires job information related to a job that satisfies a predetermined condition based on a job start time, a job end time, and an average execution time of the job.

また、本発明の一態様によれば、本発明の管理方法は、複数のサーバを有するデータセンタであり、金融機関のホストコンピュータと自動取引装置との間で実行する取引処理を中継するデータセンタを管理する管理装置において実行される管理方法であって、前記サーバで実行したジョブに関するジョブ情報を取得し、前記取得したジョブ情報から予め定めた文字列を含む対象ジョブ情報を検索し、前記検索された対象ジョブ情報を出力したサーバに対して、前記対象ジョブ情報に対応付けられた検証コマンドを実行してエラーを検証し、前記検証された検証結果に基づき、前記サーバのエラー状況を調査するための調査情報取得コマンドを実行することを特徴とする。 According to another aspect of the present invention, the management method of the present invention is a data center having a plurality of servers, which relays transaction processing executed between a host computer of a financial institution and an automatic transaction apparatus. A management method executed by a management device that manages job information, acquiring job information related to a job executed by the server, searching target job information including a predetermined character string from the acquired job information, and executing the search For the server that has output the target job information, the verification command associated with the target job information is executed to verify the error, and the error status of the server is investigated based on the verified verification result For executing a survey information acquisition command.

本発明によれば、ジョブスケジューラから呼び出される業務アプリケーションがエラー発生時の詳細情報を返せない場合、その間に発生した関連サーバのエラーメッセージからエラーの原因を特定し、即座に必要な調査情報を取得することができる、という効果を奏する。 According to the present invention, when the business application called from the job scheduler cannot return the detailed information at the time of error occurrence, the cause of the error is identified from the error message of the related server generated in the meantime, and the necessary investigation information is acquired immediately There is an effect that can be done.

本実施の形態の管理システムの適用環境を示す図である。It is a figure which shows the application environment of the management system of this Embodiment. 本実施の形態の管理装置の機能ブロックを示す図である。It is a figure which shows the functional block of the management apparatus of this Embodiment. 本実施の形態の管理処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the management process of this Embodiment. ジョブ実行結果管理テーブルの例を示す図である。It is a figure which shows the example of a job execution result management table. システム共通定義テーブルの例を示す図である。It is a figure which shows the example of a system common definition table. ジョブ動作管理テーブルの例を示す図である。It is a figure which shows the example of a job operation management table. メッセージ管理テーブルの例を示す図である。It is a figure which shows the example of a message management table. 確認情報管理テーブルの例を示す図である。It is a figure which shows the example of a confirmation information management table. サーバ関連情報管理テーブルの例を示す図である。It is a figure which shows the example of a server related information management table. 共通サーバ情報管理テーブルの例を示す図である。It is a figure which shows the example of a common server information management table.

以下、本発明の実施の形態について、図面を参照しながら詳細に説明する。
図１は、本実施の形態の管理システムの適用環境を示す図である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram showing an application environment of the management system according to the present embodiment.

図１において、大規模なネットワークインフラストラクチャとして構築されたデータセンタ１は、公衆回線等のネットワーク２を介して、コンビニ等に設置された現金自動取引装置（ＡＴＭ）等の銀行ＡＴＭ３と接続されている。データセンタ１は、銀行ＡＴＭ３の運営を制御するものであり、ネットワーク２を介し銀行のホストコンピュータである銀行ホスト４と接続されている。そして、データセンタ１は、銀行ＡＴＭ３と銀行ホスト４との間の取引データの中継を行う。 In FIG. 1, a data center 1 constructed as a large-scale network infrastructure is connected to a bank ATM 3 such as an automatic teller machine (ATM) installed in a convenience store or the like via a network 2 such as a public line. Yes. The data center 1 controls the operation of the bank ATM 3 and is connected via a network 2 to a bank host 4 which is a bank host computer. The data center 1 relays transaction data between the bank ATM 3 and the bank host 4.

データセンタ１には、ネットワーク２に接続された複数のＷＷＷサーバ１１、ＷＷＷサーバ１１を介して複数の銀行ＡＴＭ３と取引データのやりとり、運用制御、運用監視等を行う複数の業務サーバ１２、業務サーバ１２が使用する各種データを格納する複数のデータベースサーバ１３等の複数の機器、及び管理装置１４が設置されている。 The data center 1 includes a plurality of WWW servers 11 connected to the network 2, a plurality of business servers 12 that perform transaction data exchange, operation control, operation monitoring, etc. with a plurality of bank ATMs 3 via the WWW server 11, business servers A plurality of devices such as a plurality of database servers 13 for storing various data used by 12 and a management device 14 are installed.

管理装置１４は、ジョブスケジューラを備え、予め格納されたジョブの実行スケジュールに基づいてジョブを起動させ、業務サーバ１２の業務アプリケーションを呼び出す。管理装置１４は、起動させたジョブの開始時刻、終了時刻、各サーバ（ＷＷＷサーバ１１、業務サーバ１２、データベースサーバ１３）のＯＳやミドルウェアから出力されたメッセージ情報を取得する。メッセージ情報には、エラーメッセージを含む。また、管理装置１４は、後述する「管理処理」を実行する。 The management device 14 includes a job scheduler, starts a job based on a job execution schedule stored in advance, and calls a business application of the business server 12. The management apparatus 14 acquires the start time and end time of the started job, and message information output from the OS and middleware of each server (WWW server 11, business server 12, and database server 13). The message information includes an error message. In addition, the management device 14 executes “management processing” to be described later.

なお、「管理処理」は、管理装置が起動している間、一定間隔（例えば６０秒）で処理を繰り返す。 The “management process” is repeated at regular intervals (for example, 60 seconds) while the management apparatus is activated.

図２は、本実施の形態の管理装置の機能ブロックを示す図である。
管理装置１４は、複数のＷＷＷサーバ１１、複数の業務サーバ１２、複数のデータベースサーバ１３等で構築されたデータセンタ１内を管理する。管理装置１４は、ジョブ情報取得部２１、対象ジョブ情報検索部２２、エラー検証部２３、及び調査情報取得部２４を備える。なお、これらを総称し、「管理処理」という。 FIG. 2 is a diagram illustrating functional blocks of the management apparatus according to the present embodiment.
The management device 14 manages the inside of the data center 1 constructed by a plurality of WWW servers 11, a plurality of business servers 12, a plurality of database servers 13, and the like. The management device 14 includes a job information acquisition unit 21, a target job information search unit 22, an error verification unit 23, and an investigation information acquisition unit 24. These are collectively referred to as “management processing”.

ジョブ情報取得部２１は、業務サーバ１２等で実行したジョブに関するジョブ情報を取得する。ジョブ情報取得部２１は、取得したジョブ情報をテーブル形式でメモリに格納する。ジョブ情報取得部２１は、ジョブの開始時刻、ジョブの終了時刻、及びジョブの平均実行時間に基づく所定の条件を満たすジョブに関するジョブ情報を取得する。 The job information acquisition unit 21 acquires job information related to a job executed by the business server 12 or the like. The job information acquisition unit 21 stores the acquired job information in a memory in a table format. The job information acquisition unit 21 acquires job information related to a job that satisfies a predetermined condition based on a job start time, a job end time, and an average job execution time.

対象ジョブ情報検索部２２は、ジョブ情報取得部２１によって取得したジョブ情報から予め定めた文字列を含む対象ジョブ情報を検索する。対象ジョブ情報検索部２２は、メモリに格納されたジョブ情報を検索する。 The target job information search unit 22 searches for target job information including a predetermined character string from the job information acquired by the job information acquisition unit 21. The target job information search unit 22 searches for job information stored in the memory.

エラー検証部２３は、対象ジョブ情報検索部２２によって検索された対象ジョブ情報を出力したサーバ、例えば業務サーバ１２に対して、対象ジョブ情報に対応付けられた検証コマンドを実行してエラーを検証する。 The error verification unit 23 executes a verification command associated with the target job information on the server that has output the target job information searched by the target job information search unit 22, for example, the business server 12, and verifies the error. .

調査情報取得部２４は、エラー検証部２３によって検証された検証結果に基づき、サーバのエラー状況を調査するための調査情報取得コマンドを実行する。 The investigation information acquisition unit 24 executes an investigation information acquisition command for investigating the error status of the server based on the verification result verified by the error verification unit 23.

そして、管理装置１４は、業務サーバ１２等が稼動中に、ジョブスケジューラから呼び出される業務アプリケーションの異常発生時の詳細情報を返すことができない場合、その間に発生した関連するサーバのエラーメッセージやエラー切り分け用検証コメンドを実行する。そして、その実行結果から、エラー原因を特定し、即座に必要な調査情報を取得する。これにより、ジョブスケジューラから呼び出される業務アプリケーションが異常を返せない場合であっても、関連したエラーから問題を特定することができる。また、ジョブスケジューラから呼び出される業務アプリケーション自体の処理に無応答等の不備があった場合でも、エラーを検出することができる。 If the management device 14 cannot return detailed information when an abnormality occurs in the business application called from the job scheduler while the business server 12 or the like is operating, an error message or error isolation of a related server that occurred during that time. Execute a verification command. Then, the cause of the error is identified from the execution result, and necessary investigation information is acquired immediately. Thereby, even when the business application called from the job scheduler cannot return an abnormality, the problem can be identified from the related error. Even when there is a defect such as no response in the processing of the business application itself called from the job scheduler, an error can be detected.

次に、管理装置１４が実行する「管理処理」について説明する。
図３は、本実施の形態の管理処理の流れを示すフローチャートである。図４は、ジョブ実行結果管理テーブルの例を示す図であり、ジョブスケジューラがジョブを開始／終了する度に各情報は格納される。 Next, the “management process” executed by the management apparatus 14 will be described.
FIG. 3 is a flowchart showing the flow of management processing according to the present embodiment. FIG. 4 is a diagram illustrating an example of a job execution result management table, and each piece of information is stored each time the job scheduler starts / ends a job.

図５は、システム共通定義テーブルの例を示す図である。図６は、ジョブ動作管理テーブルの例を示す図である。図７は、メッセージ管理テーブルの例を示す図であり、各サーバより、エラー事象等が発生時に管理装置へ通知し、管理装置内のメッセージ監視処理（既知技術）にて格納される。
図８は、確認情報管理テーブルの例を示す図である。図９は、サーバ関連情報管理テーブルの例を示す図である。図１０は、共通サーバ情報管理テーブルの例を示す図である。 FIG. 5 is a diagram illustrating an example of a system common definition table. FIG. 6 is a diagram illustrating an example of a job operation management table. FIG. 7 is a diagram illustrating an example of a message management table. When an error event or the like occurs from each server, the server notifies the management device and stores the message in a message monitoring process (known technology) in the management device.
FIG. 8 is a diagram illustrating an example of the confirmation information management table. FIG. 9 is a diagram illustrating an example of a server-related information management table. FIG. 10 is a diagram illustrating an example of the common server information management table.

図３乃至図１０を用いて、管理装置１４が実行する「管理処理」の具体例を説明する。図４乃至図１０に例示した各テーブルは、毎日２１時（２１：００：００）に開始され、通常３０分(１８００秒）で完了する「ジョブＡ」が、２０１４年１０月２６日（２０１４/１０/２６）に実行された例を用いる。 A specific example of “management processing” executed by the management apparatus 14 will be described with reference to FIGS. 3 to 10. Each of the tables illustrated in FIGS. 4 to 10 starts at 21:00 (21: 00: 00: 00) every day, and “Job A” is normally completed in 30 minutes (1800 seconds) on October 26, 2014 (2014 / 10/26) is used.

ステップＳ３００１乃至Ｓ３００５において、図４に例示したジョブ実行結果管理テーブル、及び図５に例示したシステム共通定義テーブルを参照し、ジョブの実行が異常であったか否かを判断する。 In steps S3001 to S3005, the job execution result management table illustrated in FIG. 4 and the system common definition table illustrated in FIG. 5 are referred to and it is determined whether or not the job execution is abnormal.

まず、ステップＳ３００１において、設定されているジョブについて、各ジョブの実行が完了しているか否かを判断する。例えば、図４のジョブ実行結果管理テーブルの「ジョブ開始時刻」及び「ジョブ終了時刻」に値（時刻データ）が格納されている場合は、ジョブの実行が完了したと判断する。なお、図４において、「ジョブ名」にはジョブの名称が格納され、「ジョブ平均時間」には、そのジョブの実行に掛かる平均時間が格納され、「ジョブ開始時刻」には、直近のジョブの実行が開始された時刻が格納され、「ジョブ終了時刻」には、直近のジョブの実行が終了された時刻が格納されている。 First, in step S3001, it is determined whether or not each job has been executed for the set job. For example, if values (time data) are stored in “job start time” and “job end time” in the job execution result management table of FIG. 4, it is determined that the job execution has been completed. In FIG. 4, “job name” stores the name of the job, “job average time” stores the average time taken to execute the job, and “job start time” includes the most recent job. Is stored, and the “job end time” stores the time when the most recent job execution is completed.

ジョブの実行が完了していると判断された場合（ステップＳ３００１：ＹＥＳ）、ステップＳ３００２において、完了したジョブの実行に掛かった時間が通常の実行時間に比べて掛かり過ぎであるか否かを判断する。例えば、図４の「ジョブ終了時刻」から「ジョブ開始時刻」を減算した値が、図４の「ジョブ平均時間」に図５の「マージン時間」を加算した値より大きいか否か（ジョブ終了時刻−ジョブ開始時刻＞ジョブ平均時間＋マージン時間）を判断する。大きい場合は、いつもより処理時間が掛かっているため、システム負荷等のトラブルの可能性がある。なお、図５において、「ジョブ名」にはジョブの名称が格納され、「マージン時間」には、そのジョブの実行時間のばらつきを考慮した所定の値が格納されている。 If it is determined that the job execution has been completed (step S3001: YES), it is determined in step S3002 whether the time taken to execute the completed job is too long compared to the normal execution time. To do. For example, whether or not the value obtained by subtracting “job start time” from “job end time” in FIG. 4 is greater than the value obtained by adding “margin time” in FIG. 5 to “job average time” in FIG. Time-job start time> job average time + margin time) is determined. If it is large, the processing time is longer than usual, which may cause a problem such as system load. In FIG. 5, “job name” stores the name of the job, and “margin time” stores a predetermined value that takes into account variations in the execution time of the job.

なお、本実施の形態は、ジョブ終了時刻からジョブ開始時刻の間は２４：００：００をまたがない。 In the present embodiment, the time between the job end time and the job start time does not cross 24: 00: 00: 00.

ステップＳ３００２で時間が掛かり過ぎであると判断された場合（ステップＳ３００２：ＹＥＳ）、ステップＳ３００６に進む。 If it is determined in step S3002 that it takes too much time (step S3002: YES), the process proceeds to step S3006.

他方、ステップＳ３００２で時間が掛かり過ぎでないと判断された場合（ステップＳ３００２：ＮＯ）、ステップＳ３００３において、完了したジョブの実行に掛かった時間が通常の実行時間に比べて掛からな過ぎであるか否かを判断する。例えば、図４の「ジョブ終了時刻」から「ジョブ開始時刻」を減算した値が、図４の「ジョブ平均時間」から図５の「マージン時間」を減算した値より小さいか否か（ジョブ終了時刻−ジョブ開始時刻＜ジョブ平均時間−マージン時間）を判断する。小さい場合は、いつもより処理時間が掛かっていないため、処理が中断されている可能性がある。 On the other hand, if it is determined in step S3002 that it does not take too much time (step S3002: NO), whether or not the time taken to execute the completed job in step S3003 is not much longer than the normal execution time. Determine whether. For example, whether or not the value obtained by subtracting “job start time” from “job end time” in FIG. 4 is smaller than the value obtained by subtracting “margin time” in FIG. 5 from “job average time” in FIG. Time-job start time <job average time-margin time) is determined. If it is smaller, the processing time is not longer than usual, so the processing may be interrupted.

ステップＳ３００３で時間が掛からな過ぎであると判断された場合（ステップＳ３００３：ＹＥＳ）、ステップＳ３００６に進む。 If it is determined in step S3003 that it is not too long (step S3003: YES), the process proceeds to step S3006.

他方、ステップＳ３００３で時間が掛からな過ぎでないと判断された場合（ステップＳ３００３：ＮＯ）、そのジョブの実行が正常であったと判断され、次のジョブの実行が異常であったか否かを判断するため、ステップＳ３０１６に進み、全ジョブの数だけ繰り返す。なお、ステップＳ３０１６の処理は、後述する。 On the other hand, when it is determined in step S3003 that it does not take too much time (step S3003: NO), it is determined that the execution of the job is normal, and it is determined whether the execution of the next job is abnormal. In step S3016, the process is repeated for the number of all jobs. Note that the processing in step S3016 will be described later.

ステップＳ３００１でジョブの実行が完了していないと判断された場合（ステップＳ３００１：ＮＯ）、ステップＳ３００４において、ジョブが実行中であるか否かを判断する。例えば、図４のジョブ実行結果管理テーブルの「ジョブ開始時刻」に値（時刻データ）が格納されており、「ジョブ終了時刻」に値（時刻）が格納されていない（初期状態）場合は、ジョブが実行中であると判断する。 If it is determined in step S3001 that the execution of the job has not been completed (step S3001: NO), it is determined in step S3004 whether the job is being executed. For example, when a value (time data) is stored in “job start time” in the job execution result management table of FIG. 4 and a value (time) is not stored in “job end time” (initial state), Determine that the job is running.

ジョブが実行中でないと判断された場合（ステップＳ３００４：ＮＯ）、そのジョブは未だ起動していない状態であると判断され、ステップＳ３０１６に進み、全ジョブの数だけ繰り返す。なお、ステップＳ３０１６の処理は、後述する。 If it is determined that the job is not being executed (step S3004: NO), it is determined that the job has not yet been activated, and the process proceeds to step S3016 and is repeated by the number of all jobs. Note that the processing in step S3016 will be described later.

他方、ジョブが実行中であると判断された場合（ステップＳ３００４：ＹＥＳ）、ステップＳ３００５において、そのジョブの実行に掛かっている時間が通常の実行時間に比べて掛かり過ぎであるか否かを判断する。例えば、現在時刻から図４の「ジョブ開始時刻」を減算した値が、図４の「ジョブ平均時間」から図５の「マージン時間」を加算した値より大きいか否か（現在時刻−ジョブ開始時刻＞ジョブ平均時間＋マージン時間）を判断する。大きい場合は、いつもより処理時間が掛かっているため、システム負荷、又は応答できなくなっている状態等のトラブルの可能性がある。 On the other hand, if it is determined that the job is being executed (step S3004: YES), in step S3005, it is determined whether or not the time required for executing the job is excessively longer than the normal execution time. To do. For example, whether or not the value obtained by subtracting “job start time” in FIG. 4 from the current time is greater than the value obtained by adding “margin time” in FIG. 5 to “job average time” in FIG. 4 (current time−job start Time> job average time + margin time) is determined. If it is large, the processing time is longer than usual, so there is a possibility of troubles such as a system load or a state where the response cannot be made.

なお、本実施の形態は、現在時刻からジョブ開始時刻の間は２４：００：００をまたがない。 In the present embodiment, the period between the current time and the job start time does not cross 24: 00: 00: 00.

ステップＳ３００５で時間が掛かり過ぎであると判断された場合（ステップＳ３００５：ＹＥＳ）、ステップＳ３００６に進む。 If it is determined in step S3005 that it takes too much time (step S3005: YES), the process proceeds to step S3006.

他方、ステップＳ３００５で時間が掛かり過ぎでないと判断された場合（ステップＳ３００５：ＮＯ）、そのジョブは現在正常に実行中であると判断され、ステップＳ３０１６に進み、全ジョブの数だけ繰り返す。なお、ステップＳ３０１６の処理は、後述する。 On the other hand, if it is determined in step S3005 that it does not take too much time (step S3005: NO), it is determined that the job is currently being executed normally, and the process proceeds to step S3016 and is repeated by the number of all jobs. Note that the processing in step S3016 will be described later.

次に、上述したステップＳ３００２、Ｓ３００３及びＳ３００５で、ジョブの実行が異常であると判断された場合、ステップＳ３００６において、図６に例示したジョブ動作管理テーブルを参照し、対象ジョブの情報を取得する。例えば、ジョブは完了しているようだが時間が掛かり過ぎである場合（ステップＳ３００２：ＹＥＳ）、ジョブは完了しているようであるが時間が掛からな過ぎである場合（ステップＳ３００３：ＹＥＳ）、又はジョブが実行中ではあるが時間が掛かり過ぎである場合（ステップＳ３００５：ＹＥＳ）、そのジョブを起動したサーバの名称を、図６の「ジョブ名」に対応する「ジョブ起動サーバ名」から取得し、そのジョブに関連するサーバの名称を、図６の「ジョブ名」に対応する「ジョブ関連サーバ名」から取得する。なお、図６において、「ジョブ名」にはジョブの名称が格納され、「ジョブ起動時刻」には、そのジョブを起動する予定の時刻が格納され、「ジョブ起動サーバ名」には、ジョブスケジューラが起動する業務アプリケーションを実行するサーバの名称が格納され、「ジョブ関連サーバ名」には、データを参照する等のジョブが使用するサーバの名称が格納される。 Next, when it is determined in steps S3002, S3003, and S3005 that the job execution is abnormal, in step S3006, the job operation management table illustrated in FIG. . For example, the job seems to be completed but takes too much time (step S3002: YES), the job seems to be completed but takes too much time (step S3003: YES), or If the job is being executed but takes too much time (step S3005: YES), the name of the server that started the job is acquired from the “job start server name” corresponding to the “job name” in FIG. The name of the server related to the job is acquired from the “job related server name” corresponding to the “job name” in FIG. In FIG. 6, “job name” stores the name of the job, “job start time” stores the scheduled time for starting the job, and “job start server name” stores the job scheduler. The name of the server that executes the business application that is started is stored, and the name of the server that is used by the job such as referring to data is stored in the “job-related server name”.

そして、ステップＳ３００７において、図７に例示したメッセージ管理テーブルを参照し、ステップＳ３００６で取得した業務サーバ１２等のサーバ名に基づいて、業務サーバ１２等から出力されるメッセージの文字列を検索する。なお、図７において、「サーバ名」には業務サーバ１２等の名称が格納され、「メッセージ情報時刻」には単位時間、例えば１日単位で業務サーバ１２等から出力されるメッセージ情報の出力時刻が格納され、「メッセージ情報」には単位時間内に業務サーバ１２等から出力されるメッセージ情報が格納される。 In step S3007, the message management table illustrated in FIG. 7 is referred to, and the character string of the message output from the business server 12 or the like is searched based on the server name of the business server 12 or the like acquired in step S3006. In FIG. 7, the name of the business server 12 or the like is stored in “server name”, and the output time of the message information output from the business server 12 or the like in unit time, for example, one day, in “message information time”. “Message information” stores message information output from the business server 12 or the like within a unit time.

ステップＳ３００８において、次のステップＳ３００９で実行する「検証コマンド」を全て実行したか否かを判断する。 In step S3008, it is determined whether all “verification commands” executed in the next step S3009 have been executed.

実行していない「検証コマンド」がある場合（ステップＳ３００８：ＮＯ）、ステップＳ３００９において、図８に例示した確認情報管理テーブルを参照し、ステップＳ３００６で取得した業務サーバ１２等の「サーバ名」及びステップＳ３００７で取得した「メッセージ情報」内の文字列に基づいて、該当するサーバ名に対する「検証コマンド」を実行し、その結果を「確認結果」へ格納する。この「検証コマンド」の実行、「確認結果」への格納を全て実行するまで繰り返す。なお、図８において、「サーバ名」には業務サーバ１２等の名称が格納され、「検索文字列」には出力されたメッセージ情報に応じて検証を実施するか否かを判断するためのトリガーとなるキーワード文字列が格納され、「検証コマンド」には検証を実施するコマンド名が格納され、「確認結果」には検証を実施した結果が、例えば異常であれば「１」が格納され、正常であれば「０」が格納され、「調査情報取得コマンド」には調査情報を取得するコマンド又はプログラム情報が格納され、「調査情報」には「調査情報取得コマンド」を実行し、取得された調査情報が格納される。 If there is a “verification command” that has not been executed (step S3008: NO), in step S3009, referring to the confirmation information management table illustrated in FIG. 8, the “server name” such as the business server 12 acquired in step S3006 and Based on the character string in the “message information” acquired in step S3007, the “verification command” for the corresponding server name is executed, and the result is stored in the “confirmation result”. The execution of the “verification command” and the storage in the “confirmation result” are repeated until the execution. In FIG. 8, “server name” stores the name of the business server 12 and the like, and “search string” is a trigger for determining whether to perform verification according to the output message information. The keyword character string to be stored is stored, the command name to be verified is stored in “verification command”, the result of verification is stored in “confirmation result”, for example “1” if abnormal, If it is normal, “0” is stored, “Survey information acquisition command” stores the command or program information for acquiring the survey information, and “Survey information” is acquired by executing the “Survey information acquisition command”. Survey information is stored.

なお、格納された調査情報は、別途、詳細調査実施時に手動でクリアされるまで蓄積される。 The stored survey information is separately accumulated until it is manually cleared when a detailed survey is performed.

他方、ステップＳ３００８で全ての「検証コマンド」を実行したと判断された場合（ステップＳ３００８：ＹＥＳ）、ステップＳ３０１０において、図８に例示した確認情報管理テーブルを参照し、例えば各「サーバ名」毎の「確認結果」の半数以上が「１」の異常であるか否かを判断する。 On the other hand, if it is determined in step S3008 that all “verification commands” have been executed (step S3008: YES), in step S3010, the confirmation information management table illustrated in FIG. It is determined whether more than half of the “confirmation results” are “1” abnormalities.

「確認結果」の半数以上が「１」ではないと判断された場合（ステップＳ３０１０：ＮＯ）、ステップＳ３０１２に進む。他方、「１」であると判断された場合（ステップＳ３０１０：ＹＥＳ）、ステップＳ３０１１において、関連サーバの検証を実施する。例えば、図９に例示したサーバ関連情報管理テーブルを参照し、半数以上の「確認結果」が「１」である業務サーバ１２等について、「関連サーバ名」を取得する。そして、図８に例示した確認情報管理テーブルを参照し、その「関連サーバ名」のサーバに対して「検証コマンド」を実行し、その結果を「確認結果」へ格納する。なお、図９において、「サーバ名」には業務サーバ１２等の名称が格納され、「関連サーバ名」には関連するサーバとしての業務サーバ１２等の名称が格納され、「共通サーバ名」には共通して使用するデータベースサーバ１３等の名称が格納される。 If it is determined that more than half of the “confirmation results” are not “1” (step S3010: NO), the process proceeds to step S3012. On the other hand, if it is determined that the value is “1” (step S3010: YES), the related server is verified in step S3011. For example, with reference to the server related information management table illustrated in FIG. 9, “related server name” is acquired for the business server 12 or the like whose “confirmation result” is “1” or more. Then, with reference to the confirmation information management table illustrated in FIG. 8, the “verification command” is executed for the server of the “related server name”, and the result is stored in the “confirmation result”. In FIG. 9, the name of the business server 12 or the like is stored in the “server name”, the name of the business server 12 or the like as a related server is stored in the “related server name”, and the “common server name” is displayed in the “common server name”. Stores the names of commonly used database servers 13 and the like.

次に、ステップＳ３０１２において、共通サーバの問題であるか否かを判断する。例えば、異常を検出したサーバの「確認結果」の半数以上が「１」の異常であり、かつ関連サーバの「実行結果」の半数以上が「１」の異常であれば、共通サーバの問題であると判断する。 Next, in step S3012, it is determined whether the problem is a common server problem. For example, if more than half of the “confirmation results” of the servers that detected the abnormality are “1” abnormalities and more than half of the “execution results” of the related servers are “1” abnormalities, Judge that there is.

共通サーバの問題であると判断された場合（ステップＳ３０１２：ＹＥＳ）、ステップＳ３０１３において、図１０に例示した共通サーバ情報管理テーブルを参照し、該当する「共通サーバ名」のサーバに対して「調査情報取得コマンド」を実行し、「調査情報」を格納する。なお、図１０において、「サーバ名」にはデータベースサーバ１３等の名称が格納され、「調査情報取得コマンド」には調査情報を取得するためのコマンド名が格納され、「調査情報」には「調査情報取得コマンド」を実行し、取得された調査情報が格納される。 If it is determined that the problem is a common server problem (step S3012: YES), in step S3013, the common server information management table illustrated in FIG. Execute "Information acquisition command" and store "Investigation information". In FIG. 10, the name of the database server 13 or the like is stored in “server name”, the command name for acquiring survey information is stored in “survey information acquisition command”, and “survey information” includes “ The survey information acquisition command "is executed, and the acquired survey information is stored.

他方、ステップＳ３０１２で共通サーバの問題でないと判断された場合（ステップＳ３０１２：ＮＯ）、「確認結果」が「１」の異常となったサーバ固有の問題であるので、ステップＳ３０１４において、図８に例示した確認情報管理テーブルを参照し、該当サーバに対して「調査情報取得コマンド」を実行し、調査情報を格納する。例えば、異常を検出したサーバの「確認結果」の半数未満が「１」の異常である場合、又は異常を検出したサーバの「確認結果」の半数以上が「１」の異常であり、かつ関連サーバの「実行結果」の半数未満が「１」の異常であれば、共通サーバの問題ではなく、「確認結果」が「１」の異常となったサーバ固有の問題であると判断する。 On the other hand, if it is determined in step S3012 that the problem is not a common server problem (step S3012: NO), the “confirmation result” is a problem specific to the server in which “1” is abnormal. With reference to the exemplified confirmation information management table, a “survey information acquisition command” is executed for the corresponding server, and the survey information is stored. For example, if less than half of the “confirmation results” of the server that detected the abnormality is “1”, or more than half of the “confirmation results” of the server that detected the abnormality is “1” and related If less than half of the “execution results” of the server is “1”, it is determined that the problem is not the problem of the common server but the problem specific to the server whose “confirmation result” is “1”.

そして、ステップＳ３０１５において、図４に例示したジョブ実行結果管理テーブル及び図８に例示した確認情報管理テーブルの「確認結果」のみを初期化してステップＳ３０１６に進む。ステップ３０１６において、図４に例示した「ジョブ名」のジョブ全部が処理されたかを判断する。ジョブが全部処理されないと判断された場合（ステップＳ３０１６：ＮＯ）、ステップＳ３００１に戻り、ジョブの数だけ以上の処理を繰り返す。 In step S3015, only the “confirmation result” in the job execution result management table illustrated in FIG. 4 and the confirmation information management table illustrated in FIG. 8 is initialized, and the process proceeds to step S3016. In step 3016, it is determined whether all the jobs of “job name” illustrated in FIG. 4 have been processed. If it is determined that all jobs are not processed (step S3016: NO), the process returns to step S3001, and the above processes are repeated by the number of jobs.

他方、ジョブが全部処理されたと判断された場合（ステップＳ３０１６：ＹＥＳ）、「管理処理」を終了する。 On the other hand, if it is determined that all jobs have been processed (step S3016: YES), the “management process” is terminated.

以上、本発明の実施の形態を、図面を参照しながら説明してきたが、本発明が適用される管理装置は、上述の実施の形態に限定されない。 The embodiments of the present invention have been described above with reference to the drawings. However, the management apparatus to which the present invention is applied is not limited to the above-described embodiments.

本実施の形態を説明してきたが、本実施の形態によれば、ジョブスケジューラから呼び出される業務アプリケーションがエラー（異常）を返せない場合であっても、関連したエラーメッセージから本質的なエラー原因を特定することができる。これにより、運用管理者の作業負担は減り、エラー原因を特定する時間を短縮することができる。また、ジョブスケジューラから呼び出される業務アプリケーション自体の処理に無応答等の不備があった場合でも、そのエラー情報を検出することができる。また、サーバ資源に無駄がなく、必要な情報のみ即座に取得することができる。また、長年稼働しているプログラム（業務アプリケーション）を修正することなく、本質的なエラー原因を特定することができる。 Although the present embodiment has been described, according to the present embodiment, even if a business application called from the job scheduler cannot return an error (abnormality), the essential error cause can be determined from the related error message. Can be identified. This reduces the work burden on the operations manager and shortens the time for identifying the cause of the error. Even if there is a defect such as no response in the processing of the business application itself called from the job scheduler, the error information can be detected. Further, there is no waste in server resources, and only necessary information can be acquired immediately. In addition, it is possible to specify an essential error cause without correcting a program (business application) that has been operating for many years.

また、上述してきた本発明の実施の形態は、管理装置の一機能としてハードウェアまたはＤＳＰ（Digital Signal Processor）ボードやＣＰＵボードでのファームウェアもしくはソフトウェアにより実現することができる。 The above-described embodiment of the present invention can be realized by hardware or firmware or software on a DSP (Digital Signal Processor) board or CPU board as one function of the management apparatus.

また、本発明が適用される管理装置は、その機能が実行されるのであれば、上述の実施の形態に限定されることなく、単体の装置であっても、複数の装置からなるシステムあるいは統合装置であっても、ＬＡＮ、ＷＡＮ等のネットワークを介して処理が行なわれるシステムであってもよいことは言うまでもない。 The management apparatus to which the present invention is applied is not limited to the above-described embodiment as long as the function is executed. Needless to say, the apparatus may be a system that performs processing via a network such as a LAN or a WAN.

また、バスに接続されたＣＰＵ、ＲＯＭやＲＡＭのメモリ、入力装置、出力装置、外部記録装置、媒体駆動装置、ネットワーク接続装置で構成されるシステムでも実現できる。すなわち、前述してきた実施の形態のシステムを実現するソフトェアのプログラムを記録したハードディスクやＲＯＭ及びＲＡＭ等のメモリ、外部記録装置、可搬記録媒体を、管理装置に供給し、その管理装置のコンピュータがプログラムを読み出し実行することによっても、達成されることは言うまでもない。 It can also be realized by a system including a CPU, a ROM or RAM memory connected to a bus, an input device, an output device, an external recording device, a medium driving device, and a network connection device. That is, a hard disk, a memory such as ROM and RAM, an external recording device, and a portable recording medium in which the software program for realizing the system according to the above-described embodiment is recorded are supplied to the management device. Needless to say, this can also be achieved by reading and executing the program.

この場合、可搬記録媒体等から読み出されたプログラム自体が本発明の新規な機能を実現することになり、そのプログラムを記録した可搬記録媒体等は本発明を構成することになる。 In this case, the program itself read from the portable recording medium or the like realizes the novel function of the present invention, and the portable recording medium or the like on which the program is recorded constitutes the present invention.

プログラムを供給するための可搬記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、磁気テープ、不揮発性のメモリーカード、ＲＯＭカード、電子メールやパソコン通信等のネットワーク接続装置（言い換えれば、通信回線）を介して記録した種々の記録媒体などを用いることができる。 Examples of portable recording media for supplying the program include flexible disks, hard disks, optical disks, magneto-optical disks, CD-ROMs, CD-Rs, DVD-ROMs, DVD-RAMs, magnetic tapes, and nonvolatile memory cards. Various recording media recorded via a network connection device (in other words, a communication line) such as a ROM card, electronic mail or personal computer communication can be used.

また、コンピュータ（情報処理装置）がメモリ上に読み出したプログラムを実行することによって、前述した実施の形態の機能が実現される他、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが実際の処理の一部又は全部を行ない、その処理によっても前述した実施の形態の機能が実現される。 The computer (information processing apparatus) executes the program read out on the memory, thereby realizing the functions of the above-described embodiment, and an OS running on the computer based on the instructions of the program. Performs part or all of the actual processing, and the functions of the above-described embodiments are also realized by the processing.

さらに、可搬型記録媒体から読み出されたプログラムやプログラム（データ）提供者から提供されたプログラム（データ）が、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部又は全部を行ない、その処理によっても前述した実施の形態の機能が実現され得る。 Furthermore, a program read from a portable recording medium or a program (data) provided by a program (data) provider is stored in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer. After being written, the CPU of the function expansion board or function expansion unit performs part or all of the actual processing based on the instructions of the program, and the functions of the above-described embodiment are realized by the processing. obtain.

すなわち、本発明は、以上に述べた実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲内で種々の構成又は形状を取ることができる。 That is, the present invention is not limited to the embodiments described above, and can take various configurations or shapes without departing from the gist of the present invention.

１データセンタ
２ネットワーク
３銀行ＡＴＭ
４銀行ホスト
１１ＷＷＷサーバ
１２業務サーバ
１３データベースサーバ
１４管理装置
２１ジョブ情報取得部
２２対象ジョブ情報検索部
２３エラー検証部
２４調査情報取得部 1 Data Center 2 Network 3 Bank ATM
4 Bank host 11 WWW server 12 Business server 13 Database server 14 Management device 21 Job information acquisition unit 22 Target job information search unit 23 Error verification unit 24 Investigation information acquisition unit

Claims

A data center having a plurality of servers, a management device for managing a data center that relays transaction processing executed between a host computer of a financial institution and an automatic transaction device,
A job information acquisition unit that acquires job information related to a job executed on the server;
A target job information search unit for searching for target job information including a predetermined character string from the job information acquired by the job information acquisition unit;
An error verification unit that verifies an error by executing a verification command associated with the target job information with respect to a server that outputs the target job information searched by the target job information search unit;
Based on the verification result verified by the error verification unit, a survey information acquisition unit that executes a survey information acquisition command for investigating the error status of the server,
A management apparatus comprising:

The job information acquisition unit stores the acquired job information in a memory in a table format,
The target job information search unit searches for job information stored in the memory.
The management apparatus according to claim 1.

The job information acquisition unit acquires job information related to a job that satisfies a predetermined condition based on a job start time, a job end time, and an average execution time of the job;
The management apparatus according to claim 1, wherein the management apparatus is a management apparatus.

A data center having a plurality of servers, a management method executed in a management device that manages a data center that relays transaction processing executed between a host computer of a financial institution and an automatic transaction device,
Get job information about jobs executed on the server,
Search the target job information including a predetermined character string from the acquired job information,
For the server that has output the retrieved target job information, execute a verification command associated with the target job information to verify the error,
Based on the verified verification result, execute a survey information acquisition command for investigating the error status of the server,
A management method characterized by that.