JP5007247B2

JP5007247B2 - Job processing system and job management method

Info

Publication number: JP5007247B2
Application number: JP2008021965A
Authority: JP
Inventors: 豪士穴吹; 潤大方
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 2008-01-31
Filing date: 2008-01-31
Publication date: 2012-08-22
Anticipated expiration: 2028-01-31
Also published as: JP2009181496A

Description

本発明は情報処理技術に関し、特にジョブのバッチ処理時に発生する異常を管理するジョブ管理方法および当該方法を適用したジョブ処理システムに関する。 The present invention relates to information processing technology, and more particularly to a job management method for managing an abnormality that occurs during batch processing of jobs and a job processing system to which the method is applied.

近年の情報処理技術の発展およびネットワーク環境の充実化に伴い、様々な情報がネットワークを行き来し、企業、社内の部門など端末に入力された個々のデータを統括管理する組織には、入力された膨大なデータおよび用いるシステムを厳密に管理する技術が必要不可欠となっている。データバックアップ、各種数値算出などデータを管理するための処理やシステムメンテナンスなどは一般的に、毎日、毎月、など定期的に行われるルーチン処理である。そのため、あらかじめ指定した複数のジョブをバッチで処理するように設定しておくことにより、夜間などに自動で行われることが多い。 With the recent development of information processing technology and the enhancement of the network environment, various information has been transferred to the network, and it has been input to organizations that manage and manage individual data input to terminals such as companies and internal departments. Technology that strictly manages enormous amounts of data and the systems used is indispensable. Processing for managing data such as data backup and calculation of various numerical values, system maintenance, and the like are routine processing that is generally performed regularly such as daily or monthly. Therefore, it is often performed automatically at night by setting a plurality of jobs specified in advance to be processed in batches.

ジョブをバッチ処理する場合、システムの処理能力、効率性、ジョブ同士の依存関係、優先順位などに基づき、ジョブの処理順序をあらかじめ決定しておく。そして各ジョブの処理内容、すなわちジョブフローとその実行順序とをシステムに登録しておくことにより、基本的には所望の時間に所望の処理が自動で終了していることになる。これにより人件費を削減しつつ、各種処理の効率化が望める（例えば特許文献１）。
特開平５−１２０３７号公報 When batch processing jobs, the processing order of the jobs is determined in advance based on the processing capability, efficiency, dependency between jobs, priority order, and the like. By registering the processing contents of each job, that is, the job flow and its execution order in the system, basically, the desired processing is automatically completed at a desired time. As a result, it is possible to improve the efficiency of various processes while reducing labor costs (for example, Patent Document 1).
Japanese Patent Laid-Open No. 5-12037

一方で、バッチ処理の途中で何らかの障害が発生した場合、その復旧は困難な作業となることが多い。例えば、あるジョブが途中で停止してしまった場合、その原因は当該ジョブそのものにある場合ばかりでなく、その前に実行されたジョブあるいはさらに前に実行されたジョブにある場合もある。複数のジョブを並列で実行していた場合などは、原因のの可能性を有するジョブの数がさらに増加する。ジョブのバッチ処理自体は多くの人員を必要としないことを前提としているが、ひとたび障害が発生すると、原因究明および復旧は人手に頼らざるを得ない部分が多い。このジレンマが、復旧をより困難なものとしている。 On the other hand, when a failure occurs in the middle of batch processing, recovery is often a difficult task. For example, when a job is stopped halfway, the cause is not only in the job itself, but also in a job executed before or a job executed before that. For example, when a plurality of jobs are executed in parallel, the number of jobs having a possible cause further increases. Job batch processing itself is premised on that it does not require a large number of personnel, but once a failure occurs, there are many parts where the cause investigation and recovery must be relied upon manually. This dilemma makes recovery more difficult.

そのため一般的には、障害原因を分析できる高スキルを有する人員を確保したり、余裕をもたせた人員配置を行ったり、緊急時の別のジョブフローを何重にも用意したり、といったことを障害の発生に備えて行っている。このことは結果として人件費、システム開発費、メンテナンス費などのコストの増大を招く。この問題は、バッチ処理を営業開始時間までに終了させなければいけないなど処理の制約が多いほど、またシステムが大規模化するほど顕著となる。また障害発生時に対応策を誤ると、正常処理されていたジョブに障害が及ぶなどの二次災害が発生する危険性もある。一方で、そのような事態を生じさせないようなシステム設計やジョブフローの設定を行うために、システム開発者の負担も増している。 For this reason, in general, secure high-skilled personnel who can analyze the cause of failure, assign personnel with sufficient margins, and prepare multiple job flows for emergencies. In preparation for the occurrence of a failure. This results in increased costs such as labor costs, system development costs, and maintenance costs. This problem becomes more prominent as the number of processing restrictions increases, such as when batch processing must be completed by the business start time, and the system becomes larger. In addition, if a countermeasure is mistaken when a failure occurs, there is a risk that a secondary disaster such as a failure of a job that has been processed normally will occur. On the other hand, the burden on system developers is increasing in order to perform system design and job flow settings that do not cause such a situation.

本発明はこうした状況に鑑みてなされたものであり、その目的は、コストを増大させることなく安全に障害発生に対処することのできるジョブ管理技術を提供することにある。 The present invention has been made in view of such circumstances, and an object of the present invention is to provide a job management technique capable of safely dealing with occurrence of a failure without increasing cost.

本発明のある態様は、ジョブ処理システムに関する。このジョブ処理システムは、ジョブをバッチ処理するジョブ処理システムであって、ジョブを処理する都度、ジョブの実行状況の情報を蓄積して記憶する実行状況記憶部と、処理中のジョブのいずれかに異常が発生した際、実行状況の情報を参照して、異常が発生したジョブと、同バッチ処理において処理されていたその他のジョブの処理時間の相対的な関係が、過去の実績において所定のしきい値を超えた確率で発生している通常状態からはずれた異常状態となっている事象を原因候補として検出する障害原因検出部と、障害原因検出部が検出した、原因候補に係る情報を出力する出力部と、を備えたことを特徴とする。 One embodiment of the present invention relates to a job processing system. This job processing system is a job processing system for batch processing jobs, and each time a job is processed, either an execution status storage unit that accumulates and stores job execution status information, or a job being processed When an error occurs, refer to the execution status information, and the relative relationship between the processing time of the job in which the error has occurred and the processing time of other jobs that have been processed in the same batch process has been determined in the past performance. A failure cause detection unit that detects an abnormal event that deviates from the normal state that occurs with a probability exceeding the threshold as a cause candidate, and outputs information related to the cause candidate detected by the failure cause detection unit And an output unit.

ここで「実行状況の情報」とは、ジョブの処理開始時刻、終了時刻、リソースへのアクセス開始時刻、終了時刻、エラー情報など、一般的なログで記録される情報のいずれでもよい。 Here, the “execution status information” may be any information recorded in a general log, such as job processing start time, end time, resource access start time, end time, and error information.

本発明の別の態様は、ジョブ管理方法に関する。このジョブ管理方法は、ジョブ処理システムにおいてバッチ処理されるジョブと各ジョブが利用するリソースに係る情報とを対応づけた利用リソース情報を取得するステップと、ジョブを処理する都度、ジョブの実行状況の情報を蓄積して記憶するステップと、処理中のジョブのいずれかに異常が発生した際、利用リソース情報および実行状況の情報を参照して、当該ジョブが利用するリソースのエラーの有無と、当該ジョブと同一のリソースを利用する他のジョブの実行状況と、障害が発生したジョブと当該ジョブと同一のリソースを利用する他のジョブの処理時間の相対的な関係の通常状態からの変化の有無と、障害が発生したジョブに対し過去に障害原因となった頻度と、の少なくともいずれかを確認することにより、障害原因の候補を検出するステップと、前記障害原因の候補に係る情報を出力するステップと、を含むことを特徴とする。 Another aspect of the present invention relates to a job management method. In this job management method, a step of acquiring used resource information in which a job batch-processed in a job processing system is associated with information on a resource used by each job, and a job execution status each time a job is processed Steps for storing and storing information, and when an error occurs in any of the jobs being processed, refer to the resource usage information and execution status information to determine whether there is an error in the resources used by the job, Whether the execution status of other jobs that use the same resource as the job and the relative relationship between the processing time of the job that has failed and the other job that uses the same resource as the job have changed from the normal state Failure cause candidates by checking at least one of the failure frequency in the past for the job in which the failure occurred Detecting, characterized in that it comprises the steps of: outputting information relating to candidates of the failure cause.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and a representation of the present invention converted between a method, an apparatus, a system, etc. are also effective as an aspect of the present invention.

本発明によれば、障害発生に備えるためのコストを増大させることなくジョブの処理を行うことができる。 According to the present invention, a job can be processed without increasing the cost for preparing for the occurrence of a failure.

図１は本実施の形態を適用できるシステムの構成例を示している。同図においてジョブ処理システム１０は第１サーバ１２、第２サーバ１４、第３サーバ１６、第４サーバ１８の４つのサーバを含む。また第１サーバ１２はデータベース２０に接続している。ユーザは各サーバの端末などを操作し設定、登録を行うことにより、所望のジョブを所望の時間に処理させる。なお、サーバやデータベースの数、データベースの接続先は図１に示したものに限らず、ジョブを処理できるシステムであればいかなる構成においても本実施の形態を適用できる。また各サーバにさらにクライアント端末などが接続していてもよい。 FIG. 1 shows a configuration example of a system to which this embodiment can be applied. In FIG. 1, the job processing system 10 includes four servers: a first server 12, a second server 14, a third server 16, and a fourth server 18. The first server 12 is connected to the database 20. A user operates a terminal of each server to perform setting and registration to process a desired job at a desired time. The number of servers, databases, and database connection destinations are not limited to those shown in FIG. 1, and the present embodiment can be applied to any configuration as long as the system can process jobs. Further, a client terminal or the like may be connected to each server.

第１サーバ１２、第２サーバ１４、第３サーバ１６、第４サーバ１８はそれぞれ、一以上のＣＰＵとメモリ、記憶装置、入出力装置、表示装置など、あるいはそのいずれかの組み合わせを備えた一般的な情報処理装置であればよく、パーソナルコンピュータ、汎用大型コンピュータなどその規模は限定されない。同図は一例として第１サーバ１２がハードディスク１３を、第２サーバ１４がハードディスク１５をそれぞれ備えた構成を示している。また第１サーバ１２〜第４サーバ１８はネットワーク２２に接続され、互いにデータを送受することができる。 Each of the first server 12, the second server 14, the third server 16, and the fourth server 18 generally includes one or more CPUs and memories, storage devices, input / output devices, display devices, or any combination thereof. As long as it is a typical information processing apparatus, the scale of a personal computer, general-purpose large computer, etc. is not limited. In the figure, as an example, the first server 12 has a hard disk 13 and the second server 14 has a hard disk 15. The first server 12 to the fourth server 18 are connected to the network 22 and can transmit and receive data to and from each other.

ユーザは第１サーバ１２、第２サーバ１４、第３サーバ１６、第４サーバ１８のいずれかに対しジョブフロー、バッチ処理時の処理の順序、処理開始時間などの設定を行うことにより、ジョブ処理システム１０にジョブを処理させる。ひとつのジョブを第１サーバ１２、第２サーバ１４、第３サーバ１６、第４サーバ１８のいずれかひとつのサーバで処理するようにしてもよいし、複数のサーバで処理するようにしてもよい。各ジョブをどのサーバでどのような順序で処理させるか、また、並列に複数のジョブを処理させるかどうかなどは、ＣＰＵの処理能力やネットワークの帯域など利用可能なリソースや、データベースへのアクセス順といった処理内容などに鑑み、ユーザが設定を行う。これらの手続きは、ジョブの処理に際し行われる一般的な手法を用いることができる。 The user performs job processing by setting the job flow, the processing order at the time of batch processing, the processing start time, etc. for any of the first server 12, the second server 14, the third server 16, and the fourth server 18. Cause the system 10 to process the job. One job may be processed by any one of the first server 12, the second server 14, the third server 16, and the fourth server 18, or may be processed by a plurality of servers. . In what order each server processes each job, and whether multiple jobs are processed in parallel, the available resources such as CPU processing capacity and network bandwidth, and the order of access to the database The user makes settings in view of the processing contents. These procedures can use general techniques used in job processing.

このような構成にあっては、複数のサーバ、データベースなどのハードウェアと、複数のジョブが複雑に連携しあって処理が進捗する。このときあるサーバ、例えば第１サーバ１２で処理していたジョブが何らかの障害により停止してしまった場合、その原因が、停止したジョブ自体にある場合もあれば、全く別のところにある場合もある。例えば、停止したジョブの前のジョブが出力した誤ったデータを読み込んだ場合、第２サーバ１４におけるジョブ処理でハードディスク１５のドライブＥの空き容量が不足し書き込みを行えない場合、並列で処理しているジョブとの競合でネットワーク接続にタイムアウトが発生した場合、ハードウェアの故障が生じた場合、などその原因は様々考えられる。一般的には人手によってそれらの要因を逐一検証し、原因を突き止めて問題点を克服し、もう一度ジョブの処理をやり直す必要がある。 In such a configuration, hardware such as a plurality of servers and databases and a plurality of jobs cooperate in a complex manner, and the process proceeds. At this time, if a job being processed by a certain server, for example, the first server 12, is stopped due to some trouble, the cause may be in the stopped job itself or in a completely different place. is there. For example, when erroneous data output by a job before the stopped job is read, when the job processing in the second server 14 has insufficient free space on the drive E of the hard disk 15 and writing cannot be performed, it is processed in parallel. There are various reasons for this, such as when a network connection time-out occurs due to a conflict with a job, or when a hardware failure occurs. Generally, it is necessary to manually verify these factors one by one, find the cause, overcome the problem, and re-execute the job processing.

原因究明に時間を要すると、予定していた全てのジョブを予定時間に終了させることができなくなり、場合によっては翌日の営業、作業に支障をきたすこともあり得る。このことはシステムの規模が大きくなるほど大きなリスクを生む。例えば第１サーバ１２と第２サーバ１４とが別の部門で管理されていたり、異なる場所に備えられていたりすると、第１サーバ１２が処理していたジョブの異常終了の原因が第２サーバ１４の内部にあったとしてもそれを見いだすことは容易でない。原因を究明しているうちに第２サーバ１４における問題がそれを管理する部門によって克服されてしまうと、第１サーバ１２では結局何が原因でジョブが異常終了したのかがうやむやになってしまう。 If it takes time to investigate the cause, it becomes impossible to finish all scheduled jobs at the scheduled time, and in some cases, the next day's business and work may be hindered. This creates greater risks as the system scales up. For example, if the first server 12 and the second server 14 are managed by different departments or provided in different places, the cause of abnormal termination of the job processed by the first server 12 is the second server 14. It is not easy to find it even if it is inside. If the problem in the second server 14 is overcome by the department that manages the problem while investigating the cause, the first server 12 becomes unaware of what caused the job to end abnormally. End up.

益々加速する様々な業務のオンライン化、自動化に伴い、処理するデータの量が膨大となり、システムの規模も大きくなるにつれ、上記のような問題が深刻化し、システム開発者、障害担当者などの負担が増している。そこで本実施の形態では、障害が発生したジョブと関連性のあるジョブを自動で検出し、障害発生の原因の絞り込みを自動化することにより、復旧作業の効率的な支援を行う。このとき関連性の拠り所として、（１）障害が発生したジョブ（以後、「障害ジョブ」と呼ぶ）が利用していたリソース（以後、「利用リソース」と呼ぶ）のエラー情報、（２）障害ジョブと同じリソースを利用するジョブ（以後、「同リソース利用ジョブ」と呼ぶ）の状況、（３）障害ジョブと同リソース利用ジョブとの相対的な状況、（４）同様の障害の発生頻度、の４つの観点の少なくとも１つに着目する。 As the amount of data to be processed increases and the scale of the system grows with the on-line and automation of various businesses that are increasingly accelerating, the above problems become more serious and burden on system developers and persons in charge of disabilities. Is increasing. Therefore, in this embodiment, a job related to the job in which the failure has occurred is automatically detected, and the cause of the failure occurrence is automatically narrowed down to efficiently support the recovery work. At this time, as the basis of relevance, (1) error information of a resource (hereinafter referred to as “used resource”) used by a job in which a failure occurred (hereinafter referred to as “failed job”), (2) failure The status of a job that uses the same resource as the job (hereinafter referred to as “same resource usage job”), (3) the relative status of the failed job and the same resource usage job, (4) the frequency of occurrence of the same failure, Focus on at least one of the four viewpoints.

処理内容の見地からはジョブ同士に直接的なつながりはなくとも、障害発生の見地からは偶発的に関連性が生じることも多い。そのようなジョブの障害上の関連性は、ジョブの処理順序や処理するデータ量など様々な要因で発生しうるため、あらかじめ予測することが難しい。また障害が発生した後でも対象となるサーバやジョブのログのみでは関連性を見出しにくい。そこで本実施の形態では、障害が発生した段階で上記４つの観点を評価することによりそれらを媒介としてジョブ同士を紐づけ、障害上の関連性を見出す。そしてジョブ処理システム全体から、障害が発生した原因となったジョブおよびリソースを検出する。 Although there is no direct connection between jobs from the viewpoint of processing contents, there are many cases where a relationship is accidentally generated from the viewpoint of failure occurrence. Such a fault relatedness of a job may occur due to various factors such as the job processing order and the amount of data to be processed, and is difficult to predict in advance. Even after a failure occurs, it is difficult to find the relevance only with the log of the target server or job. Therefore, in the present embodiment, when the failure occurs, the above four viewpoints are evaluated to link the jobs with each other as a medium to find the relationship on the failure. Then, the job and resource causing the failure are detected from the entire job processing system.

図２は上記４つの観点のうち（３）障害ジョブと同リソース利用ジョブとの相対的な状況、が変化することによって障害上の関連性が発生した場合を模式的に示している。同図において横軸は時間の経過を表しており、ジョブＡ、ジョブＢ、ジョブＸなる３つのジョブが処理された時間をそれぞれの矩形で表している。この例ではジョブＡ、ジョブＢがこの順序で処理され、ジョブＸはそれらのジョブと並列に処理されるものとする。これらのジョブは同一のサーバで処理されていてもよいし、別のサーバで処理されていてもよい。このような処理順序でジョブのバッチ処理を行っている環境において、ジョブＸが異常終了した、すなわちジョブＸが障害ジョブとなった場合を考える。通常は人手によりジョブＸと直接関係を有する、アクセスしたリソースのエラーログや、同じサーバでジョブＸの前に処理されていたジョブのログなどを逐一調べていく必要がある。 FIG. 2 schematically shows a case where a relationship on a failure occurs due to a change in (3) the relative situation between the failed job and the resource use job among the above four viewpoints. In the figure, the horizontal axis represents the passage of time, and the time when three jobs, job A, job B, and job X are processed, is represented by respective rectangles. In this example, job A and job B are processed in this order, and job X is processed in parallel with these jobs. These jobs may be processed by the same server or may be processed by different servers. Consider a case where job X is abnormally terminated in an environment in which job batch processing is performed in such a processing order, that is, job X becomes a failed job. Normally, it is necessary to check the error log of the accessed resource, which is directly related to the job X, or the log of the job processed before the job X on the same server.

一方、このような一見して障害原因と認識できる要因以外にも原因となりうる事象は多々あり、その一つが同リソース利用ジョブとの相対的な状況である。図２の例では、「通常状態」（図の上段）においてはジョブＸはジョブＡの処理中、すなわちジョブＢの開始時刻ｔ２より早い時刻ｔ１で終了する。ところがジョブＸに障害が発生した「異常状態」（図の下段）では、ジョブＸの開始が何らかの要因で遅延したか、ジョブＸ自体の処理時間が長くなったことにより、ジョブＸの処理終了時刻ｔ３がジョブＢの開始時刻ｔ２より遅くなっている。ここで「通常状態」とは、ある着目点に対し過去の運用実績において最も高い確率で発生する状態であり、「異常状態」とはそれ以外の状態である。実際には発生確率にしきい値を設けそれ以上の状態を「通常状態」、それ未満の状態を「異常状態」としてよい。また「通常状態」に２以上の状態が含まれていてもよい。 On the other hand, there are many other events that can cause other than the factors that can be recognized as the cause of failure at first glance, and one of them is a relative situation with the resource use job. In the example of FIG. 2, in the “normal state” (upper part of the figure), the job X is being processed at the time of job A, that is, at time t1 earlier than the start time t2 of job B. However, in the “abnormal state” (the lower part of the figure) in which a failure has occurred in job X, the start time of job X is delayed due to some reason or the processing time of job X itself has increased. t3 is later than the start time t2 of job B. Here, the “normal state” is a state that occurs with the highest probability in the past operation results for a certain point of interest, and the “abnormal state” is the other state. In practice, a threshold value may be set for the occurrence probability, and a state higher than that may be referred to as a “normal state”, and a state lower than that may be referred to as an “abnormal state”. Further, two or more states may be included in the “normal state”.

このような状況において、同図に示すようにジョブＢとジョブＸが同じデータベース２０へアクセスするジョブであった場合、すなわちジョブＢが同リソース利用ジョブであった場合、ジョブＸが異常終了することがあり得る。これは、通常状態ではジョブＢとジョブＸが全く異なる時刻にデータベース２０へアクセスするのに対し、ジョブＢとジョブＸの処理時間が時刻ｔ２からｔ３において重なることにより、ジョブＸがデータベース２０へアクセスできなかったり、ジョブＢにより予定外の上書きがされてジョブＸが正規のデータを参照できなかったりすることが考えられるためである。このような場合に、ジョブＸとジョブＢには障害上の関連性が発生するが、直接的な関連性がないうえ、そもそもジョブＸの処理時間が通常時よりずれていたという事実を認識することも難しいため、その関連性を人手で探索するのは容易ではない。 In such a situation, as shown in the figure, if job B and job X are jobs that access the same database 20, that is, if job B is a job using the same resource, job X ends abnormally. There can be. This is because the job B and job X access the database 20 at completely different times in the normal state, but the processing time of the job B and job X overlaps from time t2 to t3, so that the job X accesses the database 20 This is because it is possible that the job X cannot be performed or the job X is overwritten unscheduled and the job X cannot refer to regular data. In such a case, there is a faulty relationship between job X and job B, but there is no direct relationship, and the fact that the processing time of job X has shifted from the normal time is recognized. It is also difficult to search for the relationship manually.

さらににジョブＡ、ジョブＢと、ジョブＸが異なるサーバで処理されていたり管理部門が異なっていたりすると、ジョブＸの異常終了の原因をジョブＢとの処理時間の重なり、と結論づけることは一層困難となる。並列に処理されているジョブの数が増えるほど、実行状況の変化が影響を及ぼす範囲の把握が困難となり、原因の特定が不可能に近くなっていく。そこで本実施の形態では、障害発生時に上記４つの観点から、偶発的ともいうべき関連性の発生を検出する。そして当障害に影響を与えている可能性のあるジョブとリソースの組み合わせ（以後、「原因候補」とも呼ぶ）を抽出し、それが障害原因となり得る確率をユーザに示すことにより原因究明の効率を上げ、復旧作業の支援を行う。なお「原因候補」は検出した状況によってはジョブのみ、リソースのみの場合もある。 Furthermore, if job A, job B, and job X are processed on different servers or the management department is different, it is more difficult to conclude that the cause of abnormal termination of job X is the overlap of processing time with job B. It becomes. As the number of jobs processed in parallel increases, it becomes more difficult to grasp the range in which the change in the execution status affects, and the identification of the cause becomes nearly impossible. Therefore, in the present embodiment, when a failure occurs, the occurrence of an association that should be called accidental is detected from the above four viewpoints. Then, it extracts the combination of jobs and resources that may affect the failure (hereinafter also referred to as “cause candidates”), and shows the probability that it can be the cause of the failure to improve the efficiency of the cause investigation. And support recovery work. The “cause candidate” may be only a job or only a resource depending on the detected situation.

図３は第１サーバ１２の構成をより詳細に示している。第２サーバ１４、第３サーバ１６、第４サーバ１８も同様の構成としてよい。第１サーバ１２は、ハードディスク１３の他、ユーザがジョブフローなどを登録するジョブ登録部３２、各ジョブが利用するリソースの情報を取得する利用リソース情報取得部３４、ジョブフローや利用リソース情報などを記憶するジョブ情報記憶部４２、登録されたジョブを処理するジョブ処理部３６、障害発生時にその原因を検出する障害原因検出部３８、障害が発生した際のバッチ処理における各種ログと過去の実行状況を記憶する実行状況記憶部４４、状況ごとに設定した、障害原因となり得る確率を記憶する原因確率記憶部４６、検出した障害原因に係る情報を出力する出力部４０を含む。 FIG. 3 shows the configuration of the first server 12 in more detail. The second server 14, the third server 16, and the fourth server 18 may have the same configuration. The first server 12 includes, in addition to the hard disk 13, a job registration unit 32 in which a user registers a job flow, a use resource information acquisition unit 34 that acquires information on resources used by each job, a job flow and use resource information, and the like. Job information storage unit 42 to store, job processing unit 36 to process registered jobs, failure cause detection unit 38 to detect the cause when a failure occurs, various logs and past execution status in batch processing when a failure occurs An execution status storage unit 44 that stores information, a cause probability storage unit 46 that stores a probability that can be a cause of failure set for each situation, and an output unit 40 that outputs information on the detected cause of failure.

図３において、様々な処理を行う機能ブロックとして記載される各要素は、ハードウェア的には、ＣＰＵ、メモリ、その他のＬＳＩで構成することができ、ソフトウェア的には、演算やファイル操作、ネットワーク通信などを行うプログラムなどによって実現される。したがって、これらの機能ブロックがハードウェアのみ、ソフトウェアのみ、またはそれらの組合せによっていろいろな形で実現できることは当業者には理解されるところであり、いずれかに限定されるものではない。 In FIG. 3, each element described as a functional block for performing various processes can be configured by a CPU, a memory, and other LSIs in terms of hardware, and in terms of software, operations, file operations, and networks This is realized by a program for performing communication or the like. Therefore, it is understood by those skilled in the art that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof, and is not limited to any one.

ジョブ登録部３２は、ジョブフロー、すなわち各ジョブにおいてなされる処理内容や、バッチ処理における複数のジョブの処理の順序など、ジョブの処理に必要な情報をユーザが登録するためのインターフェースである。ジョブ登録部３２は、登録画面を表示した表示装置と、キーボード、ポインティングデバイスなど登録画面に対して入力を行う入力装置との組み合わせなどでよく、ジョブを処理する一般的なシステムで用いられる装置を適用することができる。登録された情報はジョブ情報記憶部４２に格納する。 The job registration unit 32 is an interface for a user to register information necessary for job processing, such as job flow, that is, processing contents performed in each job and processing order of a plurality of jobs in batch processing. The job registration unit 32 may be a combination of a display device that displays a registration screen and an input device that performs input on the registration screen, such as a keyboard and a pointing device, and is a device used in a general system that processes jobs. Can be applied. The registered information is stored in the job information storage unit 42.

利用リソース情報取得部３４は、ジョブ登録部３２が登録を受け付けたジョブフローから各ジョブが利用するリソースなどを抽出して、利用リソース情報のテーブルを作成する。利用リソース情報のテーブルは、ジョブ処理システム１０においてバッチで処理されるジョブの名前とそれらが利用するリソース、サーバ、リソースの利用率などを対応づけたテーブルである。ジョブ登録部３２がジョブフローのデータを、入出力を行うハードディスク、アクセスするサーバなどのフィールドを有する所定のフォーマットでジョブ情報記憶部４２に格納することにより、利用リソース情報取得部３４は、当該データの所定のフィールドからサーバ名、利用リソースなどを抽出する。このようにして作成した利用リソース情報のテーブルもジョブ情報記憶部４２に格納する。利用リソース情報により、障害ジョブが利用しているリソース、同リソース利用ジョブ、同リソース利用ジョブのリソース利用率等が判明し、原因検出に用いられる。 The used resource information acquisition unit 34 extracts resources used by each job from the job flow accepted by the job registration unit 32 and creates a table of used resource information. The used resource information table is a table in which job names processed in batches in the job processing system 10 are associated with resources used by the jobs, servers, resource utilization rates, and the like. The job registration unit 32 stores the job flow data in the job information storage unit 42 in a predetermined format having fields such as a hard disk for input / output and a server to be accessed. The server name, used resources, etc. are extracted from the predetermined fields. The used resource information table created in this way is also stored in the job information storage unit 42. The resource used by the failed job, the resource usage job, the resource usage rate of the resource usage job, and the like are determined from the usage resource information and used for cause detection.

ジョブ処理部３６は、ユーザが登録したジョブフロー、ジョブの処理順序などの情報をジョブ情報記憶部４２から読み出し、実行する。これはジョブを処理する一般的なシステムで用いられる手法を適用することができる。ジョブ処理部３６はジョブの処理内容に応じてハードディスク１３に対しデータの書き込みおよび読み出しを行う。ハードディスク１３は他のサーバ、すなわち第２サーバ１４、第３サーバ１６、第４サーバ１８からもネットワーク２２を介してアクセスが可能であるとする。 The job processing unit 36 reads information such as the job flow registered by the user and the job processing order from the job information storage unit 42 and executes the information. For this, a method used in a general system for processing jobs can be applied. The job processing unit 36 writes and reads data to and from the hard disk 13 in accordance with job processing contents. The hard disk 13 can be accessed from other servers, that is, the second server 14, the third server 16, and the fourth server 18 via the network 22.

ジョブ処理部３６は、ジョブの処理中に、当日の実行状況を表すジョブのログやエラーログなど実行状況の情報を実行状況記憶部４４に記憶させる。実行状況記憶部４４は、障害が発生した際のバッチ処理において記録されたログと、過去にバッチ処理を行った際のログに含まれる情報を実行状況の情報として蓄積して記憶する。過去のバッチ処理時の実行状況を蓄積しておくことにより、それを利用して統計的に「通常状態」であるのか「異常状態」であるのかの判定を行う。 The job processing unit 36 causes the execution status storage unit 44 to store execution status information such as a job log and an error log indicating the current status of execution during job processing. The execution status storage unit 44 accumulates and stores, as execution status information, logs recorded in batch processing when a failure occurs and information included in logs when batch processing has been performed in the past. By accumulating the execution status at the time of past batch processing, it is used to determine whether it is statistically “normal state” or “abnormal state”.

図２において、「通常状態」ではジョブＸはジョブＢの処理時間と重なることがない、という判断は、ジョブＡ、ジョブＢとジョブＸとの直接的な関連性が低いほど、前もって設定できる性質のものではなくなる。そのため、一般的にジョブＸとジョブＢの処理時間の重なりが「異常状態」であると判断することができず、図２のジョブＸの異常終了の原因を、処理時間の重なりに求めることを困難にしている。そこで実行状況の過去のデータから「通常状態」を定義したうえ、障害発生時には、そのときのバッチ処理の実行状況のうち「通常状態」と異なっていた事象を抽出することにより障害原因を検出する。 In FIG. 2, the determination that the job X does not overlap the processing time of the job B in the “normal state” can be set in advance as the direct relationship between the jobs A and B and the job X is lower. Is no longer a thing. Therefore, in general, it cannot be determined that the overlap between the processing times of job X and job B is “abnormal state”, and the cause of the abnormal end of job X in FIG. Making it difficult. Therefore, after defining the "normal state" from the past data of the execution status, when a failure occurs, the cause of the failure is detected by extracting the event that was different from the "normal state" in the execution status of the batch processing at that time .

過去の実行状況の情報は、ジョブ処理部３６がバッチ処理を行う都度記録するログの形式でもよいし、原因検出に用いる情報のみを抽出したものでもよい。例えば各ジョブの開始時刻、終了時刻を記憶することにより、図２で示したような環境における「通常状態」がどのような状態であるかを定義することができる。実行状況の情報として、各ジョブ処理中のファイル転送の開始時刻と終了時刻、ハードディスクなどリソースへのアクセスの開始時刻と終了時刻などを記憶させ、処理時間をさらに微視的に評価してもよい。その他の具体例は後に述べる。 The past execution status information may be in the form of a log that is recorded each time the job processing unit 36 performs batch processing, or only information used for cause detection may be extracted. For example, by storing the start time and end time of each job, it is possible to define what state the “normal state” in the environment as shown in FIG. 2 is. As the execution status information, the start time and end time of file transfer during processing of each job, the start time and end time of access to resources such as a hard disk, etc. may be stored, and the processing time may be further microscopically evaluated. . Other specific examples will be described later.

実行状況記憶部４４はさらに、過去に障害ジョブとなったジョブと、その障害が発生した際、原因となり得る確率が所定のしきい値以上となった原因候補とを対応づけて記憶する。この情報は、過去に障害を発生させる原因となった頻度から、今回の障害の原因候補が原因となり得る確率を評価するために用いる。障害が発生した後、真の原因を究明できた場合はその原因をユーザが実行状況記憶部４４に登録することにより、発生頻度のデータの正確性を高めてもよい。 The execution status storage unit 44 further stores a job that has become a failed job in the past and a cause candidate whose probability of causing a failure when the failure has occurred exceeds a predetermined threshold. This information is used to evaluate the probability that the cause candidate of the current failure can be caused from the frequency that caused the failure in the past. If the true cause can be determined after the failure has occurred, the user may register the cause in the execution status storage unit 44 to improve the accuracy of the occurrence frequency data.

原因確率記憶部４６は、上記４つの観点での評価における原因候補の状況と、その状況がどの程度障害原因となり得るかを確率で表した数値とを対応づけた原因確率情報を記憶する。原因確率情報は、それまでの経験値や、本番実行環境での実績、テスト結果などに基づきあらかじめ設定しておく。または各状況が影響する範囲などを考慮して理論的に設定してもよい。具体例は後述する。 The cause probability storage unit 46 stores cause probability information in which the status of the cause candidates in the evaluation from the above four viewpoints is associated with a numerical value that expresses how much the status can cause a failure. The cause probability information is set in advance based on previous experience values, actual performance in the execution environment, test results, and the like. Or you may set theoretically in consideration of the range etc. which each situation influences. Specific examples will be described later.

障害原因検出部３８は障害発生時に、利用リソース情報および実行状況の情報を参照して、上記４つの観点から当障害の原因と考えられる原因候補を検出する。そして原因確率記憶部４６が記憶する原因確率情報に基づき、それぞれが障害原因となり得る確率を取得する。具体的な手法は後述する。 When a failure occurs, the failure cause detection unit 38 refers to utilization resource information and execution status information, and detects a cause candidate that is considered to be the cause of the failure from the above four viewpoints. And based on the cause probability information which the cause probability memory | storage part 46 memorize | stores, the probability that each may become a failure cause is acquired. A specific method will be described later.

本実施の形態では、サーバを超えて障害原因の検出を行うため、どのサーバでどのジョブが処理されているかに関わらず、ジョブ処理システム１０でバッチ処理している全てのジョブについて、利用リソース情報と実行状況の情報を第１サーバ１２、第２サーバ１４、第３サーバ１６、第４サーバ１８間で共有する。そのために、あるサーバでそれらの情報が更新されるたびに、その更新情報を他のサーバに送信して各自が保持する情報を更新する。あるいは、あるサーバのジョブ情報記憶部４２や実行状況記憶部４４を他のサーバからアクセス可能とすることにより同一の情報を参照する。 In this embodiment, since the cause of the failure is detected beyond the server, the usage resource information for all jobs batch-processed by the job processing system 10 regardless of which server is processing which job. The execution status information is shared among the first server 12, the second server 14, the third server 16, and the fourth server 18. Therefore, every time the information is updated in a certain server, the updated information is transmitted to another server to update the information held by each server. Alternatively, the same information is referred to by making the job information storage unit 42 and the execution status storage unit 44 of a certain server accessible from other servers.

出力部４０は一般的な表示装置やプリンタなどの出力装置でよく、障害原因検出部３８が検出した原因候補とそれが障害原因となり得る確率を出力する。 The output unit 40 may be a general output device such as a display device or a printer, and outputs a cause candidate detected by the failure cause detection unit 38 and a probability that it can be a cause of the failure.

次に上記の構成によるジョブ処理システム１０の動作について説明する。図４は障害発生時の処理手順を示すフローチャートである。なお同図では、上記４つの観点からの評価を全て行うことにより多角的に障害原因の確率を求める場合について示しているが、いずれか１つ、あるいはいずれか複数の組み合わせで評価を行ってもよい。 Next, the operation of the job processing system 10 having the above configuration will be described. FIG. 4 is a flowchart showing a processing procedure when a failure occurs. Although the figure shows a case where the probability of the cause of failure is obtained in a multifaceted manner by performing all the evaluations from the above four viewpoints, the evaluation may be performed by any one or a combination of a plurality of them. Good.

まずジョブの処理中に障害が発生する（Ｓ３２）。すると障害原因検出部３８は、利用リソースに関するエラーの情報を収集し、エラー内容ごとに原因となり得る確率を取得することにより、利用リソースのエラーによる評価を行う（Ｓ３４）。具体的には、ジョブ情報記憶部４２の利用リソース情報を参照し、障害ジョブの利用リソースを抽出する。さらに障害発生時から時間を遡ってエラーログを参照していき、利用リソースのエラーログを収集する。次にエラーログに記載されたエラー内容に基づき原因確率記憶部４６に記憶された原因確率情報を参照し、エラーごとにそれが原因となり得る確率を取得する。 First, a failure occurs during job processing (S32). Then, the failure cause detection unit 38 collects information on errors related to the used resources, and obtains a probability that can be a cause for each error content, thereby performing an evaluation based on errors in the used resources (S34). Specifically, referring to the use resource information in the job information storage unit 42, the use resource of the failed job is extracted. Furthermore, refer back to the error log from the time of failure occurrence and collect the error log of the resource being used. Next, the cause probability information stored in the cause probability storage unit 46 is referred to based on the error content described in the error log, and the probability that the error can be caused for each error is acquired.

次に障害原因検出部３８は、同リソース利用ジョブの状況による評価を行う（Ｓ３６）。具体的には、利用リソース情報を参照し、同リソース利用ジョブと当該ジョブの利用リソースに対する利用率を取得する。そして利用率ごとに同ジョブが障害原因となり得る確率を原因確率情報から取得する。ここでは同リソース利用ジョブの状況として、同リソース利用ジョブのリソース利用率を代表として挙げたが、それに限る趣旨ではなく、同リソース利用ジョブの処理の状況で、かつ障害に関連のあるパラメータであればよい。例えば同リソース利用ジョブの平均処理時間に対する当日の処理時間の割合などでもよい。この場合、平均処理時間はジョブごとに実行状況記憶部４４に記憶させておく。 Next, the failure cause detection unit 38 performs evaluation based on the status of the resource use job (S36). Specifically, referring to the used resource information, the resource usage job and the usage rate for the used resource of the job are acquired. Then, the probability that the job can cause a failure for each usage rate is acquired from the cause probability information. Here, the resource usage rate of the same resource usage job is listed as a representative of the status of the same resource usage job. However, the present invention is not limited to this, and may be a parameter related to the status of processing of the same resource usage job and failure. That's fine. For example, the ratio of the processing time of the day to the average processing time of the resource use job may be used. In this case, the average processing time is stored in the execution status storage unit 44 for each job.

次に障害原因検出部３８は、障害ジョブと同リソース利用ジョブの相対的な状況による評価を行う（Ｓ３８）。具体的には、障害ジョブと同リソース利用ジョブの処理時間の相対的な関係が「異常状態」にあるか否かの判定を行い、「異常状態」であった場合はその状況ごとに同ジョブが障害原因となり得る確率を原因確率情報から取得する。詳細な手順は後に述べる。さらに障害原因検出部３８は、実行状況記憶部４４を参照して、障害ジョブが過去に障害を発生させた際、各原因候補が原因となった頻度、あるいは原因と推定された頻度に基づき、各々が障害原因となり得る確率を取得することにより、過去の発生頻度による評価を行う（Ｓ４０）。 Next, the failure cause detection unit 38 performs evaluation based on the relative status of the failed job and the resource use job (S38). Specifically, it is determined whether or not the relative relationship between the processing time of the failed job and the same resource use job is “abnormal”, and if it is “abnormal”, the same job for each status The probability that can cause a failure is acquired from the cause probability information. Detailed procedures will be described later. Further, the failure cause detection unit 38 refers to the execution status storage unit 44, and when the failure job has caused a failure in the past, based on the frequency at which each cause candidate is caused or the frequency estimated as the cause, By obtaining the probability that each can be a cause of failure, evaluation is performed based on the past occurrence frequency (S40).

上記４つの観点からの評価によりそれぞれの原因候補が原因となり得る確率を取得したら、それらの確率を原因候補ごとに掛け合わせることにより、原因候補が障害原因となり得る最終的な確率を算出する（Ｓ４２）。ただしこれまでの説明における「確率」やそれを掛け合わせる処理は一例であり、多角的な評価により重み付けを行い、評価結果を総合的に集計する手法であれば算出手法は限定されない。例えば「確率」を「影響度」に置き換え、１を中心とした分布を有する指数で表してもよい。また、ある観点からの評価による確率が所定のしきい値を超えた場合は、その原因候補の最終的な確率が高くなるように重み付けをするなどの補正を施してもよい。 When the probabilities that each cause candidate can be caused by the evaluation from the above four viewpoints are acquired, the final probability that the cause candidate can cause the failure is calculated by multiplying the probabilities for each cause candidate (S42). ). However, the “probability” in the description so far and the process of multiplying it are examples, and the calculation method is not limited as long as it is a method of performing weighting by multilateral evaluation and totalizing the evaluation results. For example, “probability” may be replaced with “influence” and an index having a distribution centered at 1 may be used. Further, when the probability based on the evaluation from a certain viewpoint exceeds a predetermined threshold value, correction such as weighting may be performed so that the final probability of the cause candidate becomes high.

そして出力部４０は、原因候補とそれが原因となり得る確率を対応づけて出力する（Ｓ４４）。ユーザは当該出力結果を確認して、例えば確率が高い順に検証作業を行うなどして障害原因を究明し、問題を克服することにより障害対応を実施することができる（Ｓ４６）。 Then, the output unit 40 outputs the cause candidates and the probabilities that the causes can be caused in association with each other (S44). The user can confirm the output result, investigate the cause of the failure by, for example, performing a verification operation in descending order of probability, and implement the failure response by overcoming the problem (S46).

次にＳ３８の、障害ジョブと同リソース利用ジョブの相対的状況による評価の手法をより詳細に説明する。図５は障害原因検出部３８が障害ジョブと同リソース利用ジョブの相対的な処理状況によって障害原因を検出、評価する処理手順を示している。まず「通常状態」、「異常状態」を判定する対象であるジョブの絞り込みを行う（Ｓ５０）。具体的には、図４のＳ３６において利用リソース情報を参照して得られた、同リソース利用ジョブを対象ジョブとする。 Next, the evaluation method based on the relative status of the failed job and the resource use job in S38 will be described in more detail. FIG. 5 shows a processing procedure in which the failure cause detection unit 38 detects and evaluates the cause of failure according to the relative processing status of the failure job and the resource use job. First, narrowing down jobs that are targets for determining “normal state” and “abnormal state” is performed (S50). Specifically, the resource use job obtained by referring to the use resource information in S36 of FIG.

次に実行状況記憶部４４から過去の実行状況の情報を読み出し、障害ジョブと対象ジョブとの相対関係について「通常状態」を定義する（Ｓ５２）。「通常状態」か否かの判定対象となる項目（以後、「判定項目」と呼ぶ）は、障害を起こす可能性のある事象のうち相対関係の変化によって検出できる事象をあらかじめ設定して実行状況記憶部４４に記憶させておく。例えば図２に示したような障害の原因を検出するためには、判定項目に「ジョブの処理時間の重なり」を含めておけばよい。すると、ジョブＸとジョブＢの処理時間が重ならないことが「通常状態」として定義される。 Next, information on the past execution status is read from the execution status storage unit 44, and “normal state” is defined for the relative relationship between the failed job and the target job (S52). Items that are subject to judgment on whether or not they are in the “normal state” (hereinafter referred to as “judgment items”) are preset events that can cause a failure and that can be detected by changes in the relative relationship. This is stored in the storage unit 44. For example, in order to detect the cause of the failure as shown in FIG. 2, “overlapping job processing time” may be included in the determination item. Then, it is defined as “normal state” that the processing times of job X and job B do not overlap.

ジョブの処理時間の重なりは、実効状況の情報のうち、各ジョブの終了時刻、開始時刻から算出できる。同様に、ファイル転送時間の重なり、リソースへのアクセス時間の重なりなどを判定項目としてもよい。これらの時間の重なりは、多少なりとも重なる時間があるか否かの２事象で通常・異常を判定してもよいし、どの程度重なるかといったより詳細な情報で判定してもよい。 The overlap of job processing times can be calculated from the end time and start time of each job in the information on the effective status. Similarly, an overlap of file transfer time, an overlap of resource access time, and the like may be used as determination items. The overlap of these times may be determined as normal / abnormality based on two events indicating whether or not there is some overlap, or may be determined based on more detailed information such as how much overlap.

いずれの場合も、実績として発生確率が所定のしきい値以上となった事象を「通常の状態」として定義する。例えば図２の場合に、ジョブＢとジョブＸの処理時間が全く重ならない確率が８０％以上であった場合、その状態を「通常状態」として定義する。あるいはどの程度重なるかを分散で表したときの標準偏差で定義してもよい。また、最も発生確率の高い事象を「通常状態」としてもよい。その他、統計的ないかなる手段を用いて定義してよい。図２で示した例以外の障害事例と、その原因を検出するために設定する判定項目の具体例は後に述べる。 In either case, an event whose occurrence probability is equal to or higher than a predetermined threshold is defined as a “normal state” as a result. For example, in the case of FIG. 2, if the probability that the processing times of job B and job X do not overlap at all is 80% or more, the state is defined as “normal state”. Alternatively, the degree of overlap may be defined by the standard deviation when expressed by variance. Further, an event having the highest occurrence probability may be set as a “normal state”. Any other statistical means may be used for the definition. Specific examples of failure cases other than the example shown in FIG. 2 and determination items set for detecting the cause will be described later.

次に実行状況記憶部４４に格納された、今回のバッチ処理のログを読み出し、「通常状態」の定義を行った判定項目のうち、「通常状態」からはずれた状態にある事象を「異常状態」として抽出する（Ｓ５４）。図２の例では、ジョブＸとジョブＢの処理時間が重なっている、という事象が「異常状態」として検出できる。そして、原因確率記憶部４６にあらかじめ記憶させた、事象ごとの障害の原因となり得る確率を参照することにより、Ｓ５４で抽出した事象を発生させた原因候補が障害の原因となり得る確率を取得する（Ｓ５６）。 Next, the log of the current batch process stored in the execution status storage unit 44 is read, and among the determination items for which the definition of “normal state” is defined, an event that is out of the “normal state” is indicated as “abnormal state”. "Is extracted (S54). In the example of FIG. 2, an event that the processing times of job X and job B overlap can be detected as an “abnormal state”. Then, by referring to the probability that can cause the failure for each event stored in advance in the cause probability storage unit 46, the probability that the cause candidate that caused the event extracted in S54 can cause the failure is acquired ( S56).

次に、図２で示した障害以外の障害事例を挙げて、原因検出に用いられる判定項目を例示する。図６、図７はジョブの相対的な状況の評価により原因を検出できる障害事例を説明する図である。同図の表示方法は図２と同様であり、上段が「通常状態」、下段が「異常状態」を示している。図６の例では、「通常状態」においてジョブＡが処理されている間にジョブＸが処理される。またジョブＡの処理終了後、ジョブＢの処理が開始される。図に示す通り、ジョブＸはあるハードディスクのファイルを削除する処理を含み、ジョブＢはファイルを作成して同ハードディスクに格納する処理を含むとする。 Next, the determination items used for the cause detection will be exemplified by giving failure cases other than the failure shown in FIG. FIG. 6 and FIG. 7 are diagrams for explaining failure cases whose causes can be detected by evaluating the relative status of jobs. The display method of FIG. 6 is the same as that of FIG. In the example of FIG. 6, job X is processed while job A is being processed in the “normal state”. Also, after the processing of job A is completed, the processing of job B is started. As shown in the figure, job X includes processing for deleting a file on a hard disk, and job B includes processing for creating a file and storing it on the same hard disk.

このような態様において、ジョブＸが削除する対象が大きなサイズのファイルであった場合、ジョブＢのファイルが格納される前にジョブＸがファイルを削除しておかないと、ハードディスクの容量を超えてしまう事が考えられる。このとき、ファイルが格納できずジョブＢが異常終了してしまうことがあり得る。図５の「異常状態」はその様子を示しており、何らかの原因によりジョブＸの処理開始時刻が遅延してジョブＢの処理開始時刻以後にずれ込み、それによりジョブＢが異常終了している。ジョブＢの開始時刻が早まったり、ジョブＸの処理時間が長期化した場合も同様の障害が起こりうる。このような場合に、ジョブＢの異常終了の原因を検出するためには、判定項目にジョブ同士の前後関係を含めればよい。 In such an embodiment, if the target of deletion by Job X is a large file, if the job X does not delete the file before the job B file is stored, the capacity of the hard disk will be exceeded. It can be thought of. At this time, the file cannot be stored and job B may end abnormally. The “abnormal state” in FIG. 5 shows such a situation. For some reason, the processing start time of job X is delayed and shifted after the processing start time of job B, so that job B ends abnormally. A similar failure can occur when the start time of job B is advanced or the processing time of job X is prolonged. In such a case, in order to detect the cause of abnormal termination of job B, the determination item may include the context between jobs.

ジョブ同士の前後関係は厳密には開始時刻の前後関係、開始時刻と終了時刻の前後関係、ハードディスクへアクセスする時刻の前後関係など、より詳細な基準を判定項目としてもよい。ジョブ同士の前後関係を判定項目とすることにより、過去の実行状況の情報から、ジョブＸはジョブＢの処理開始時刻より前に終了していることが「通常状態」として定義される。そして当日の実効状況の情報から、ジョブＸの処理開始時刻がジョブＢの処理開始時刻より後になっていることを「異常状態」として検出できる。同図のような障害の場合、ジョブＸ自体は処理開始時刻が遅延したのみで特にエラーが発生する状況にないため、エラーログなどを確認したのみでは障害原因を究明することが難しい。本実施の形態では上述のとおり、ジョブＸの処理開始の遅延が障害原因である可能性を容易に見出すことができる。 Strictly speaking, more detailed criteria such as the relationship between the start times, the relationship between the start times and the end times, and the relationship between the times when accessing the hard disk may be used as the determination items. By setting the context between jobs as a determination item, it is defined as “normal state” that job X has ended before the processing start time of job B, based on past execution status information. The fact that the processing start time of job X is later than the processing start time of job B can be detected as “abnormal state” from the information on the effective status of the day. In the case of the failure as shown in the figure, the job X itself is not in a situation where an error occurs because the processing start time is delayed, so it is difficult to investigate the cause of the failure only by checking the error log. In the present embodiment, as described above, it is possible to easily find the possibility that the delay in starting the processing of job X is the cause of the failure.

図７の例では、「通常状態」においてジョブＡが処理されている間にジョブＸおよびジョブＹが処理される。またジョブＡの処理終了後、ジョブＢの処理が開始される。図に示す通り、ジョブＢ、ジョブＸ、ジョブＹは全て、同じネットワーク、あるいはバスを介してファイルを転送する処理を含むとする。このような態様において、ジョブＸとジョブＹによるファイルの転送でネットワークの使用可能帯域の大半を使用してしまう場合、さらにジョブＢがファイルの転送をしようとしても、ネットワークの輻輳により転送エラーとなりジョブＢが異常終了してしまうことがあり得る。 In the example of FIG. 7, job X and job Y are processed while job A is being processed in the “normal state”. Also, after the processing of job A is completed, the processing of job B is started. As shown in the figure, it is assumed that job B, job X, and job Y all include processing to transfer a file via the same network or bus. In such a mode, when most of the usable bandwidth of the network is used for file transfer by job X and job Y, even if job B tries to transfer the file, a transfer error occurs due to network congestion. B may end abnormally.

図７の「異常状態」はその様子を示しており、何らかの原因によりジョブＸおよびジョブＹの処理開始時刻が遅延してジョブＸ、ジョブＹ、およびジョブＢのファイル転送が同時期に重なり、それによりジョブＢが異常終了している。このような場合に、ジョブＢの異常終了の原因としてジョブＸ、ジョブＹの処理開始時刻の遅延を検出するためには、前述したように各ジョブの処理時間の重なりや、ファイル転送時間の重なりなどを判定項目に含めてもよいし、並列で処理されたファイル転送を伴うジョブの数を判定項目に含めてもよい。 The “abnormal state” in FIG. 7 shows the situation, and the processing start time of job X and job Y is delayed for some reason, and the file transfer of job X, job Y, and job B overlaps at the same time. Due to this, job B has ended abnormally. In such a case, in order to detect a delay in the processing start time of job X and job Y as the cause of abnormal termination of job B, as described above, overlap of processing time of each job or overlap of file transfer time Etc. may be included in the determination item, or the number of jobs with file transfer processed in parallel may be included in the determination item.

ファイル転送を伴うジョブは、ジョブフローを登録する際にユーザに入力させてジョブに識別情報を付加することにより他と識別してもよいし、利用リソース情報取得部３４がジョブフローを解析して前述の利用リソース情報に含めてもよい。あるいは実際に処理した際のジョブのログからファイル転送を行うか否かの情報を取得してもよい。そして「通常状態」において、ファイル転送を伴うジョブの処理時間の重なりを開示時刻、終了時刻から判断し、同時に処理されるジョブの最大数を定義する。図７の例では、前日までの実行状況の情報から、最大２つのファイル転送を伴うジョブが同時に処理可能であることが定義できる。そして当日の実効状況の情報から、ジョブＢの処理開始時刻がジョブＸ、ジョブＹの処理終了時刻より前となっていて、同時に３つのファイル転送を伴うジョブが処理されていることを「異常状態」として検出できる。 A job that involves file transfer may be identified as another by having the user input it when registering the job flow and adding identification information to the job, or the resource usage information acquisition unit 34 may analyze the job flow. You may include in the above-mentioned utilization resource information. Alternatively, information on whether or not to perform file transfer may be acquired from a job log at the time of actual processing. Then, in the “normal state”, the overlapping of the processing time of the jobs accompanying the file transfer is judged from the disclosure time and the end time, and the maximum number of jobs processed simultaneously is defined. In the example of FIG. 7, it can be defined from the information on the execution status up to the previous day that jobs with up to two file transfers can be processed simultaneously. Based on the information on the effective status of the day, the processing start time of job B is before the processing end time of job X and job Y, and the job with three file transfers is processed at the same time. ”Can be detected.

このような障害の場合も、ジョブＸやジョブＹ自体はエラーが発生する状況にないため、ジョブＢの異常終了の原因として認識しづらく、一般的な手法では原因究明に時間を要する。結局、再度ジョブ処理を行ってみて検証するなどの方策がとられるが、再現性に乏しい場合、原因が究明できずに終わることもあり得る。上述のように「通常状態」を定義し、障害が発生したときの状況と比較することで「異常状態」を割り出すようにすれば、効率的な原因究明および復旧作業が可能となる。図７はネットワークを用いた転送処理を行うジョブについて示しているが、ネットワークをメモリや出力装置などに置き換えても同様の検出が可能である。 Even in the case of such a failure, the job X and the job Y themselves are not in a situation where an error occurs, so that it is difficult to recognize as the cause of the abnormal end of the job B, and it takes time to investigate the cause in a general method. Eventually, measures such as verifying the job process again are taken, but if the reproducibility is poor, the cause may not end up being investigated. If the “normal state” is defined as described above and the “abnormal state” is determined by comparing with the situation when a failure occurs, efficient cause investigation and recovery work can be performed. Although FIG. 7 shows a job for performing transfer processing using a network, similar detection is possible even if the network is replaced with a memory or an output device.

次に、利用リソース情報について説明する。図８は利用リソース情報取得部３４が作成する利用リソース情報のデータ構造の例を示している。利用リソース情報テーブル１００は、ジョブ名欄１０２、利用サーバ欄１０４、利用リソース欄１０６、リソース利用率欄１０８、および備考欄１１０を含む。利用リソース情報取得部３４は、新たなジョブフローが登録されるたびに、利用リソース情報テーブル１００にエントリを追加していく。 Next, usage resource information will be described. FIG. 8 shows an example of the data structure of the usage resource information created by the usage resource information acquisition unit 34. The usage resource information table 100 includes a job name column 102, a usage server column 104, a usage resource column 106, a resource usage rate column 108, and a remarks column 110. The used resource information acquisition unit 34 adds entries to the used resource information table 100 each time a new job flow is registered.

ジョブ名欄１０２には、ユーザが登録を行ったジョブの名前を記載する。利用サーバ欄１０４にはそれぞれのジョブが利用するリソースが属するサーバの名前を記載する。利用リソース欄１０６には利用するリソースの識別情報を記載する。リソース利用率欄には、各ジョブが利用リソース欄１０６に記載されたリソースを利用する際のリソースの容量に対する利用率を記載する。ただしデータベースの参照など、同時アクセスが可能な処理に対しては設定しなくてもよい。備考欄１１０には当該リソースを利用する際の処理の概要を記載する。上述したとおり、各欄に記載されるデータは、基本的には利用リソース情報取得部３４がジョブフローのデータを解析して取得できる。一方、リソース利用率欄１０８に記載するリソース利用率は、利用リソース情報取得部３４が、実際にジョブを処理した際のジョブ処理前後のスナップショットの差分などに基づき自動で算出してもよいし、開発機などでの実績値をユーザが登録するようにしてもよい。 The job name column 102 describes the name of the job registered by the user. The use server column 104 describes the name of the server to which the resource used by each job belongs. In the use resource column 106, identification information of the resource to be used is described. The resource utilization rate column describes the utilization rate with respect to the resource capacity when each job uses the resource described in the utilization resource column 106. However, it is not necessary to set for processing that allows simultaneous access, such as database reference. The remarks column 110 describes an outline of processing when using the resource. As described above, the data described in each column can be basically acquired by the use resource information acquisition unit 34 by analyzing the job flow data. On the other hand, the resource utilization rate described in the resource utilization rate column 108 may be automatically calculated based on the difference between snapshots before and after job processing when the utilization resource information acquisition unit 34 actually processes the job. The user may register the actual value in the development machine or the like.

図８において、例えばジョブ名が「ジョブＢ」なるジョブは、第１サーバ１２のデータベース２０に、データを参照するためにアクセスを行う（３行目）。また、データを作成して第１サーバ１２のハードディスク１３のドライブＤに格納する（４行目）。このときドライブＤを１０％利用して書き込みを行う。さらにＬＡＮカードを利用してデータ転送を行う（５行目）。このときのＬＡＮカードの利用率は３０％である。「ジョブＡ」、「ジョブＸ」、「ジョブＹ」も同様である。 In FIG. 8, for example, a job whose job name is “job B” accesses the database 20 of the first server 12 to refer to data (third line). Further, data is created and stored in the drive D of the hard disk 13 of the first server 12 (line 4). At this time, writing is performed using 10% of the drive D. Further, data transfer is performed using a LAN card (line 5). At this time, the utilization rate of the LAN card is 30%. The same applies to “job A”, “job X”, and “job Y”.

利用リソース情報のデータ構造は図８に示したものに限らず、処理内容、ジョブ処理システムの構成、それまでの障害発生の事例などを検討し、ジョブについてあらかじめ判明している性質を有し、かつ障害原因の検出に用いることのできるパラメータを適宜選択する。このような詳細な記録を利用リソース情報に含めると、より精度の高い障害原因検出を行うことができる。 The data structure of the used resource information is not limited to that shown in FIG. 8, and the processing contents, the configuration of the job processing system, the cases of faults up to that point, etc. In addition, a parameter that can be used for detecting the cause of the failure is appropriately selected. When such detailed records are included in the usage resource information, it is possible to detect the cause of the failure with higher accuracy.

次に、図５のＳ５６で「異常状態」として検出された事象を発生させた原因候補が障害原因となり得る確率を取得する際、障害原因検出部３８が参照する原因確率情報について説明する。図９は「異常状態」である事象に対して設定され原因確率記憶部４６に格納される、実行状況別原因確率テーブルの例を示している。実行状況別原因確率テーブル１２０は、関連リソース欄１２２、事象欄１２４、および確率欄１２６を含む。 Next, the cause probability information referred to by the failure cause detection unit 38 when acquiring the probability that the cause candidate that generated the event detected as the “abnormal state” in S56 of FIG. 5 can cause the failure will be described. FIG. 9 shows an example of a cause probability table classified by execution situation that is set for the event in the “abnormal state” and stored in the cause probability storage unit 46. The execution probability cause probability table 120 includes a related resource column 122, an event column 124, and a probability column 126.

ジョブの前後関係が同じような「異常状態」となっていても、それらのジョブがどのリソースを介して障害へ結びついたかによって、原因となり得る確率も異なってくる。したがって利用リソースによって「異常状態」の影響の度合いを検討し、より正確な確率設定を行うことが望ましい。そのため関連リソース欄１２２には、障害ジョブと対象ジョブが共通で利用するリソースが設定され、確率もそれに依存して設定できるようにする。さらに事象欄１２４には「異常状態」として抽出されると予想される事象が、確率欄１２６には各事象が原因となり得る確率が設定される。 Even if jobs have the same “abnormal state” in the context, the probabilities of possible causes vary depending on which resource the job is linked to the failure. Therefore, it is desirable to examine the degree of influence of the “abnormal condition” according to the resources used and set the probability more accurately. Therefore, in the related resource column 122, resources that are used in common by the faulty job and the target job are set, and the probability can be set depending on the resource. Further, an event that is expected to be extracted as an “abnormal state” is set in the event column 124, and a probability that each event can be a cause is set in the probability column 126.

例えば障害ジョブと、当該ジョブと同じ「データベース」を利用している対象ジョブの処理時間の前後関係が、「異常状態」であったとする。「通常状態」では対象ジョブは障害ジョブより後に処理されると定義された一方、当日は同時に処理されていたとする。この場合、同時に処理されたという事象が障害原因である確率は、図９の例では実行状況別原因確率テーブル１２０の２行目に示されるとおり「９０％」と設定されている。この状況は図２に示した障害の場合であり、データベースへのアクセスの競合が発生するため、原因となり得る確率を高く設定している。 For example, it is assumed that the relationship between the processing time of a failed job and a target job using the same “database” as the job is “abnormal state”. In the “normal state”, it is defined that the target job is processed after the failed job, but the current day is processed at the same time. In this case, the probability that the event that they are processed simultaneously is the cause of the failure is set to “90%” as shown in the second row of the cause probability table 120 by execution status in the example of FIG. This situation is in the case of the failure shown in FIG. 2, and contention for access to the database occurs, so the probability that can be the cause is set high.

一方、対象ジョブが処理されなかったことが「異常状態」として抽出されても、アクセスの競合は発生しないため、それが原因で障害が発生するとは考えにくい。そのため、実行状況別原因確率テーブル１２０の３行目に示されるとおり確率は「０％」と設定されている。共通して利用するリソースがハードディスクやＬＡＮカードの場合も、同様にそれぞれに対して抽出された事象ごとに確率を設定する。 On the other hand, even if the fact that the target job has not been processed is extracted as an “abnormal state”, access conflict does not occur, so it is unlikely that a failure will occur due to this. Therefore, the probability is set to “0%” as shown in the third line of the cause probability table 120 by execution status. Similarly, when the resource to be used in common is a hard disk or a LAN card, the probability is set for each event extracted in the same manner.

ただし、４つの観点から評価した確率を原因候補ごとに掛け合わせる場合、「通常状態」の原因確率を０％にしてしまうと、他の観点でどれだけ確率が高くても最終的に算出する確率が０％になってしまう。このような場合に備え、例えば図９の実行状況別原因確率テーブル１２０の６行目にあるように、「通常状態」のときは原因である確率を「５０％」などと設定してもよい。「通常状態」に対する確率の設定も、共通して利用するリソースごとに異ならせてよい。この場合、障害原因検出部３８は、「通常状態」のときも実行状況別原因確率テーブル１２０を参照して確率を取得する。 However, when multiplying the probabilities evaluated from four viewpoints for each cause candidate, if the cause probability of “normal state” is set to 0%, the probability that the probability is finally calculated no matter how high the other viewpoints are Becomes 0%. In preparation for such a case, the probability of the cause may be set to “50%” or the like in the “normal state” as shown in the sixth row of the cause probability table 120 according to execution status in FIG. . The probability setting for the “normal state” may be different for each resource used in common. In this case, the failure cause detection unit 38 acquires the probability by referring to the execution probability cause probability table 120 even in the “normal state”.

実行状況別原因確率テーブル１２０を含む原因確率情報は、基本的なものをあらかじめ設定して原因確率記憶部４６に記憶させておき、実際の運用に即した判定項目や確率の値となるようにユーザが更新できるようにしてもよい。 The cause probability information including the cause probability table 120 according to the execution situation is set in advance and stored in the cause probability storage unit 46 so that the determination result and the value of the probability conform to the actual operation. The user may be able to update.

前述のとおり、「通常状態」として定義する状態は、実績として発生確率が所定のしきい値以上となった事象である。このしきい値は、共通して利用するリソース、サーバ、ジョブの処理内容などに応じて適宜個別に設定してもよい。例えばアクセスの順序が多少変化してもそれによって障害が発生することが考えにくいようなジョブばかりを処理する場合、少しの変化を「異常状態」と抽出してしまうと原因の絞り込みがしづらくなる。このような場合は、「通常状態」と判定される事象の範囲を広げるため、しきい値を低く設定する。このようにジョブの処理内容やリソースなどによって「異常状態」として抽出すべき事象の割合を調整することにより、意味のない事象の抽出を防止し、原因検出の効率を上げる。 As described above, the state defined as the “normal state” is an event whose occurrence probability is equal to or higher than a predetermined threshold as a result. This threshold value may be set individually as appropriate in accordance with resources, servers, job processing contents, etc. that are commonly used. For example, when processing only jobs that are unlikely to cause a failure even if the access order changes slightly, extracting the slight change as "abnormal" makes it difficult to narrow down the causes. . In such a case, the threshold value is set low in order to widen the range of events determined as “normal state”. In this way, by adjusting the ratio of events to be extracted as “abnormal conditions” according to job processing contents and resources, extraction of meaningless events is prevented, and the efficiency of cause detection is increased.

同様の理由により障害原因検出部３８は、ある判定項目について「異常状態」となった事象を抽出し、実行状況別原因確率テーブル１２０から当該事象が原因となり得る確率を取得した後、当該判定項目の「通常状態」の実績としての発生確率に応じて、取得した原因確率を補正してもよい。例えば、「通常状態」のしきい値がＸ％と設定されているときに、実際には「通常状態」の発生確率がＹ％であるような判定項目に対し「異常状態」が抽出され、それが原因となり得る確率がＺ％であった場合、補正後の原因となり得る確率Ｚ’％は以下のように導出する。
Ｚ’＝Ｚ＋（１−Ｚ）×｜Ｙ−Ｘ｜／（１−Ｘ） For the same reason, the failure cause detection unit 38 extracts an event that has become an “abnormal state” for a certain determination item, acquires the probability that the event can be a cause from the execution probability cause probability table 120, and then determines the determination item. The acquired cause probability may be corrected according to the occurrence probability as the actual result of “normal state”. For example, when the threshold value of “normal state” is set to X%, “abnormal state” is actually extracted for a determination item whose occurrence probability of “normal state” is Y%, When the probability that can be the cause is Z%, the probability Z ′% that can be the cause after the correction is derived as follows.
Z ′ = Z + (1−Z) × | Y−X | / (1−X)

これにより、実際の発生確率Ｙ％がしきい値Ｘ％より大きく、ほとんどが「通常状態」となるような判定項目は、「異常状態」として抽出された事象の異常性が高いため、障害原因となり得る確率を増加させる。例えば、しきい値が８０％、「通常状態」の発生確率が８４％、原因となり得る確率が９０％であった場合、補正後の原因となり得る確率は９２％となる。このような補正を行うことにより、より精度の高い原因確率を示すことができる。なお上記の式は例示であり、例えば開発機でのテストや実際の運用時において、判定項目の「通常状態」の発生確率Ｙ％とその項目の「異常状態」が障害原因となった確率が近似直線で分布することが判明している場合などに用いることができる。補正に用いる式はその他に、様々な確率分布に基づき導出される関数を適宜選択してよい。 As a result, judgment items whose actual occurrence probability Y% is larger than the threshold value X% and almost become “normal state” are high in anomaly of the event extracted as “abnormal state”. Increase the probability of being. For example, if the threshold value is 80%, the occurrence probability of the “normal state” is 84%, and the probability that can be the cause is 90%, the probability that can be the cause after the correction is 92%. By performing such correction, a more accurate cause probability can be shown. The above formula is an example. For example, during a test on a development machine or actual operation, the occurrence probability Y% of the “normal state” of the determination item and the probability that the “abnormal state” of the item has caused the failure It can be used when the distribution is known to be an approximate straight line. In addition to the formula used for correction, a function derived based on various probability distributions may be selected as appropriate.

次に、上記４つの観点のうち、障害ジョブの利用リソースのエラー情報から障害原因を検出する具体例を説明する。図１０は図４のＳ３４で障害原因検出部３８が参照する原因確率情報である、リソース別確率テーブルの例を示す。リソース別確率テーブル１３０はリソース欄１３２、エラー内容欄１３４、および確率欄１３６を含む。リソース欄１３２に記載された利用リソースに対しエラー内容欄１３４に記載されたエラー内容がエラーログに記録されていたら、確率欄１３６に記載された確率が当該エラーが原因である確率となる。実際にはリソース欄１３２に記載された一つのリソースに対し、エラー内容欄１３４に複数のエラー内容を設定してよく、エラー内容ごとに確率欄１３６に確率を設定してよい。 Next, a specific example of detecting the cause of failure from the error information of the resource used by the failed job among the above four viewpoints will be described. FIG. 10 shows an example of a probability table for each resource, which is cause probability information referred to by the failure cause detection unit 38 in S34 of FIG. The resource-specific probability table 130 includes a resource column 132, an error content column 134, and a probability column 136. If the error content described in the error content column 134 is recorded in the error log for the use resource described in the resource column 132, the probability described in the probability column 136 is the probability that the error is the cause. Actually, for one resource described in the resource column 132, a plurality of error contents may be set in the error content column 134, and a probability may be set in the probability column 136 for each error content.

図２で示した障害の場合、図８で示した利用リソース情報テーブル１００から、障害が発生したジョブＸはデータベースを利用していることがわかる。つぎに当該データベースのエラーログを見て、「スナップショットが古すぎます」という内容のエラーが記録されていたら、当該データベースが障害原因となり得る確率を「９５％」とする。 In the case of the failure shown in FIG. 2, it can be seen from the use resource information table 100 shown in FIG. 8 that the job X in which the failure has occurred uses the database. Next, looking at the error log of the database, if an error with the content “snapshot is too old” is recorded, the probability that the database can cause a failure is set to “95%”.

次に、上記４つの観点のうち、同リソース利用ジョブ自体の状況から障害原因を検出する具体例を説明する。図１１は図４のＳ３６で障害原因検出部３８が参照する原因確率情報である、リソース利用率別確率テーブルの例を示す。リソース利用率別確率テーブル１４０は、リソースの空きスペースに対する利用率の欄１４２、および確率欄１４４を含む。リソースの空きスペースに対する利用率は、利用リソース情報テーブル１００のリソース利用率欄１０８に設定した、各ジョブのリソース利用率を、ジョブ処理直前の当該リソースの空きスペースに対する利用率に換算したものである。 Next, a specific example of detecting the cause of failure from the status of the resource use job itself among the above four viewpoints will be described. FIG. 11 shows an example of a probability table by resource utilization rate, which is cause probability information referred to by the failure cause detection unit 38 in S36 of FIG. The probability table 140 according to resource utilization rate includes a utilization rate column 142 and a probability column 144 for a resource free space. The utilization rate for the resource free space is obtained by converting the resource utilization rate of each job set in the resource utilization rate column 108 of the utilization resource information table 100 into the utilization rate for the resource free space immediately before the job processing. .

空きスペースは、リソースがハードディスクやメモリであればデータが格納されていない記憶領域の容量、ネットワークであれば使用されていない帯域に対応し、他のジョブやデータに占有されていない、自身のジョブが利用できる領域のことである。ジョブ処理直前のリソースの空きスペースは一般的なオペレーションシステムで提供されるスナップショットの機能により取得することができる。 If the resource is a hard disk or memory, the free space corresponds to the capacity of the storage area where data is not stored, and if it is a network, it corresponds to the unused bandwidth and is not occupied by other jobs or data. Is an area that can be used. The free space of the resource immediately before job processing can be acquired by a snapshot function provided by a general operation system.

空きスペースに対する利用率が増加するほど、他のジョブは当該リソースを利用できなくなり障害が発生しやすくなる。従って、リソース利用率別確率テーブル１４０もそのように設定する。またリソース利用率別確率テーブル１４０はリソースごとに用意してもよい。図７で示した障害の場合、図８で示した利用リソース情報テーブル１００から、障害ジョブであるジョブＢと同じＬＡＮカードを利用しているジョブがジョブＸ、ジョブＹであること、ジョブＸとジョブＹの当該ＬＡＮカードに対する利用率がそれぞれ「６０％」、「３０％」であることがわかる。したがって各ジョブがＬＡＮカードを単独で利用していた場合でも、ジョブＸとＬＡＮカードが原因である確率、ジョブＹとＬＡＮカードが原因である確率はいずれも、「９５％」となる。 As the utilization rate for the free space increases, other jobs cannot use the resource and a failure is likely to occur. Therefore, the resource utilization rate probability table 140 is set as such. Further, the resource utilization rate probability table 140 may be prepared for each resource. In the case of the failure shown in FIG. 7, from the use resource information table 100 shown in FIG. 8, the jobs using the same LAN card as the job B, which is the failure job, are job X and job Y, and job X It can be seen that the utilization rates of job Y for the LAN card are “60%” and “30%”, respectively. Therefore, even if each job uses the LAN card alone, the probability that the job X and the LAN card are the cause and the probability that the job Y and the LAN card are the cause are both “95%”.

上記は同リソース利用ジョブ自体の状況としてリソース利用率を採用した場合であるが、前述のとおり、本実施の形態に利用できる同リソース利用ジョブの状況はそれに限らない。例えば同リソース利用ジョブの処理時間に着目してもよいし、それらを組み合わせて評価してもよい。図１２は同リソース利用ジョブの平均処理時間に対する当日の処理時間の割合ごとに確率を設定した処理時間別確率テーブルの例を示す。処理時間別確率テーブル１５０は、当日の処理時間／平均処理時間の欄１５２および確率欄１５４を含む。 The above is a case where the resource usage rate is adopted as the status of the resource usage job itself, but as described above, the status of the resource usage job that can be used in the present embodiment is not limited thereto. For example, attention may be paid to the processing time of the resource utilization job, or evaluation may be performed by combining them. FIG. 12 shows an example of a probability table for each processing time in which a probability is set for each ratio of the processing time of the day to the average processing time of the resource use job. The processing time probability table 150 includes a processing time / average processing time column 152 and a probability column 154 for the current day.

同テーブルにより、過去の平均処理時間に対する当日の処理時間の割合に基づき、同リソース利用ジョブが障害原因となり得る確率を取得する。同リソース利用ジョブの処理時間が通常より長くなっているほど、リソースに何らかの問題が発生していたり障害ジョブに影響を与えやすいと考えられるため、処理時間別確率テーブル１５０もそのように設定する。各ジョブの平均処理時間はジョブ処理部３６がジョブの処理を行う都度、計算して実行状況記憶部４４に格納しておく。当日の処理時間はジョブ処理部３６が実行状況記憶部４４に格納した当日のジョブのログから、障害原因検出部３８が導出する。 Based on the ratio of the processing time of the day with respect to the past average processing time, the probability that the resource use job can cause a failure is acquired from the table. As the processing time of the resource use job is longer than usual, it is considered that some problem occurs in the resource or the troubled job is likely to be affected. Therefore, the processing time probability table 150 is set as such. The average processing time of each job is calculated and stored in the execution status storage unit 44 each time the job processing unit 36 processes the job. The failure cause detection unit 38 derives the processing time of the day from the job log of the day stored in the execution status storage unit 44 by the job processing unit 36.

次に、上記４つの観点のうち、同様の障害の発生頻度から障害原因を検出する具体例を説明する。図１３は図４のＳ４０で障害原因検出部３８が参照する原因確率情報である、発生頻度別確率テーブルの例を示す。発生頻度別確率テーブル１６０は、発生頻度欄１６２と確率欄１６４を含む。上述のとおり、障害が発生した際、高確率で原因と推定された原因候補、あるいは真の原因と確定したジョブやリソースは蓄積して記憶しておく。そして障害発生時は、障害ジョブに対し過去に原因として記憶された頻度に基づき、原因候補が障害原因となり得る確率を取得する。図１３に示すように、発生頻度が高ければそれが原因である確率も高く設定する。障害原因検出部３８は、障害発生時に、実行状況記憶部４４を参照して各原因候補ごとの発生頻度を取得し、次に発生頻度別確率テーブル１６０を参照して、原因候補ごとの確率を取得する。 Next, a specific example of detecting the cause of failure from the frequency of occurrence of the same failure among the above four viewpoints will be described. FIG. 13 shows an example of a probability table classified by occurrence frequency, which is cause probability information referred to by the failure cause detection unit 38 in S40 of FIG. The occurrence frequency probability table 160 includes an occurrence frequency column 162 and a probability column 164. As described above, when a failure occurs, a cause candidate that is estimated to have a high probability or a job or resource that has been determined to be a real cause is accumulated and stored. When a failure occurs, the probability that the cause candidate can be the cause of the failure is acquired based on the frequency stored as the cause in the past for the failed job. As shown in FIG. 13, if the occurrence frequency is high, the probability that it is the cause is also set high. When a failure occurs, the failure cause detection unit 38 refers to the execution status storage unit 44 to obtain the occurrence frequency for each cause candidate, and then refers to the occurrence frequency probability table 160 to determine the probability for each cause candidate. get.

図１４は以上述べた４つの観点から障害原因の評価を行い、最終的に各原因候補が原因となり得る確率を算出した例を示している。障害確率算出テーブル１７０は、原因ジョブ欄１７２、確率欄１７４、原因リソース欄１７６、リソースエラーによる確率欄１７８、同リソース利用ジョブの状況による確率欄１８０、障害ジョブ・同リソース利用ジョブの相対状況による確率欄１８２、発生頻度による確率欄１８４を含む。障害原因検出部３８は、障害確率算出テーブル１７０のリソースエラーによる確率欄１７８、同リソース利用ジョブの状況による確率欄１８０、障害ジョブ・同リソース利用ジョブの相対状況による確率欄１８２、発生頻度による確率欄１８４を埋めるように、各観点から評価を行い、原因候補ごとに取得した原因確率を取得する。そしてそれらの確率を掛け合わせて、確率欄１７４に示すように最終的な確率を算出する。ここでは前述のとおり、掛け合わせ以外の手法を用いてもよい。 FIG. 14 shows an example in which the cause of failure is evaluated from the four viewpoints described above, and the probability that each cause candidate can be the cause is finally calculated. The failure probability calculation table 170 includes a cause job column 172, a probability column 174, a cause resource column 176, a resource error probability column 178, a status of the resource use job status column 180, and a failure job / resource use job relative status. It includes a probability column 182 and a probability column 184 based on the occurrence frequency. The failure cause detection unit 38 includes a probability column 178 due to resource error in the failure probability calculation table 170, a probability column 180 based on the status of the resource use job, a probability column 182 based on the relative status of the failure job / resource use job, and a probability based on the occurrence frequency. Evaluation is performed from each viewpoint so as to fill the column 184, and the cause probability acquired for each cause candidate is acquired. Then, by multiplying these probabilities, the final probability is calculated as shown in the probability column 174. Here, as described above, a technique other than multiplication may be used.

出力部４０が出力するデータは、原因ジョブ欄１７２、確率欄１７４、原因リソース欄１７６のみでもよいし、障害確率算出テーブル１７０を全て出力してもよい。前者の場合、実際には障害確率算出テーブル１７０を作成せず、各観点からの評価による確率はレジスタなどに一時保存して最終的な確率を算出するようにしてもよい。図１４に示した例では、「ジョブＢ」が障害原因となり「データベース」を媒介して障害が発生した確率は、リソースエラーによる確率が「９５％」、同リソース利用ジョブの状況による確率が「９５％」、障害ジョブ・同リソース利用ジョブの相対状況による確率が「９０％」、発生頻度による確率が「９８％」であることが各欄からわかる。そして最終的な確率が「７９．６０％」となっており、他の原因候補よりその確率が高い。したがってユーザはジョブＢとデータベースに着目し、障害原因を集中的に検証していくことができる。 The data output by the output unit 40 may be only the cause job column 172, the probability column 174, and the cause resource column 176, or the failure probability calculation table 170 may be all output. In the former case, the failure probability calculation table 170 may not actually be created, and the final probability may be calculated by temporarily storing the probability based on the evaluation from each viewpoint in a register or the like. In the example shown in FIG. 14, the probability that a failure has occurred through “database” due to “job B” is “95%” due to a resource error, and the probability according to the status of the resource utilization job is “ It can be seen from each column that “95%”, the probability based on the relative status of the failed job / resource utilization job is “90%”, and the probability based on the occurrence frequency is “98%”. The final probability is “79.60%”, which is higher than other cause candidates. Therefore, the user can focus on the job B and the database, and verify the cause of the failure intensively.

これまでの説明は、障害ジョブの同リソース利用ジョブに障害が発生していないことを前提にしていたが、図４のＳ３６において利用リソース情報を参照して同リソース利用ジョブを取得したら、当該ジョブの当日のログを参照して障害の有無を確認するようにしてもよい。共通するリソースを利用する２つのジョブに障害が発生していたら、当該リソースに何らかの問題が発生し、障害原因となっている確率が高い。したがって、同リソース利用ジョブに障害が発生していた場合は、当該利用リソースが障害の原因である確率を高くするように重み付けを行う。 The above description is based on the assumption that no failure has occurred in the resource use job of the failed job. However, when the resource use job is acquired by referring to the use resource information in S36 of FIG. The presence / absence of a failure may be confirmed by referring to the log of that day. If a failure has occurred in two jobs that use a common resource, there is a high probability that some problem has occurred in that resource, causing a failure. Therefore, when a failure has occurred in the resource use job, weighting is performed so that the probability that the use resource is the cause of the failure is increased.

同リソース利用ジョブが同じ利用リソースに対して複数存在する場合は、それら全てのジョブの障害の有無を確認する。そして障害が発生している同リソース利用ジョブの数が増加するほど、それらが共通して利用するリソースの原因確率を高くする。あるいは、所定数以上の同リソース利用ジョブに障害が発生していることが判明した場合、それらが共通して利用するリソースが障害の原因であることは明らかと考えてもよい。例えばドライブＤを利用するジョブＢ、ジョブＸ、ジョブＺの３つのジョブに障害が発生していた場合、ドライブＤに何らかの問題が生じ障害を発生させていると考える。 When there are a plurality of the same resource use jobs for the same use resource, it is confirmed whether or not there is a failure in all the jobs. As the number of the same resource use jobs in which a failure has occurred increases, the cause probability of the resources shared by them is increased. Alternatively, when it is found that a failure has occurred in a predetermined number or more of the same resource use jobs, it may be considered that the resource that they use in common is the cause of the failure. For example, when a failure has occurred in three jobs, job B, job X, and job Z that use the drive D, it is considered that some problem has occurred in the drive D and the failure has occurred.

この場合は、以後の障害原因の評価を行わずに、当該リソースに原因があることを出力部４０に出力してユーザに通知したうえ、以後に処理が予定されていた、あるいは処理中の、当該リソースを利用するジョブの処理を自動的に停止するようにしてもよい。この操作は、障害原因検出部３８がジョブ処理部３６を制御することによって行う。このようにすることで、障害原因を検出しその確率を出力するばかりでなく、一のリソースが障害原因であることを判断し、ユーザが対応するより早く他のジョブを停止することができ、障害が伝搬していくという二次災害を防止することができる。 In this case, without evaluating the cause of the subsequent failure, the resource is output to the output unit 40 and notified to the user, and then the processing is scheduled or is being processed. Processing of a job that uses the resource may be automatically stopped. This operation is performed by the failure cause detection unit 38 controlling the job processing unit 36. In this way, not only can the failure cause be detected and the probability output, but also one resource can be determined to be the cause of the failure, and other jobs can be stopped earlier than the user responds, It is possible to prevent a secondary disaster in which a failure propagates.

以上、述べた本実施の形態によれば、処理するジョブと利用するリソースとを対応づけた利用リソース情報をあらかじめ用意する。障害発生時にはその情報に基づき、障害原因の候補を検出し、各候補が原因となり得る確率を出力する。確率は、利用リソースのエラー、同リソース利用ジョブの状況、障害ジョブと同リソース利用ジョブとの相対的な状況、同様障害の発生頻度、の４つの観点の少なくともいずれかを評価することにより取得する。これにより、一見関連性を見いだせないジョブ同士の障害上のつながりを容易に取得することができ、ユーザによる障害対応に有効な情報を提示することができる。 As described above, according to the present embodiment described above, use resource information in which a job to be processed and a resource to be used are associated is prepared in advance. When a failure occurs, a failure cause candidate is detected based on the information, and the probability that each candidate can be the cause is output. The probability is acquired by evaluating at least one of the following four points: an error of the resource used, the status of the resource usage job, the relative status of the failed job and the resource usage job, and the frequency of occurrence of the failure. . As a result, it is possible to easily acquire a faulty connection between jobs that cannot be found at first glance, and to present information that is effective for handling a fault by the user.

また利用リソース情報をサーバ間で共通とすることにより、管理部門が異なったり別の場所に設置されたサーバにおいて処理されているジョブにも原因を求めることができる。また４つの観点での評価を全て行った場合は、多角的な評価によって確率の精度が向上する。確率を表示することにより、ユーザは検証作業の優先順位づけを容易に行え、復旧作業の効率が上がる。 In addition, by sharing the usage resource information among the servers, the cause can be obtained for a job being processed in a server with a different management department or installed in another location. Further, when all the evaluations from the four viewpoints are performed, the accuracy of the probability is improved by the multi-faceted evaluation. By displaying the probability, the user can easily prioritize the verification work, and the efficiency of the recovery work increases.

障害ジョブと同リソース利用ジョブとの相対的な状況を評価する際は、実行状況を蓄積して記憶しておき、それに基づき「通常状態」を定義する。そして「通常状態」に対する「異常状態」を判定することにより、ログなどからは「異常」と認識されない事象をも障害原因として検出することができる。「通常状態」か否かを判定する対象となるジョブは、利用リソース情報を参照して得られる同リソース利用ジョブとすることで、効率的な原因候補の絞り込みを行うことができる。 When evaluating the relative status between the faulty job and the resource use job, the execution status is accumulated and stored, and the “normal status” is defined based on this. By determining the “abnormal state” with respect to the “normal state”, an event that is not recognized as “abnormal” from the log or the like can be detected as the cause of the failure. A job that is a target for determining whether or not it is in the “normal state” is the same resource use job that is obtained by referring to the use resource information, whereby efficient cause candidates can be narrowed down.

また、共通のリソースを利用している複数のジョブに障害が発生した場合は、当該リソースが障害原因である確率が著しく高いと判断し、そのリソースを利用している未処理のジョブなどを停止する。これにより人手を借りずとも明白な障害に対し、緊急を要する対応を迅速に行うことができ、被害の拡大を防止できる。 Also, if a failure occurs in multiple jobs that use a common resource, it is determined that the probability that the resource is the cause of the failure is extremely high, and unprocessed jobs that use that resource are stopped. To do. As a result, it is possible to promptly take urgent responses to obvious obstacles without helping humans, and prevent the spread of damage.

本実施の形態は、新規のハードウェアを導入せずに実施することが可能であるため、安価で容易に既存のシステムに適用することが可能である。また、障害の原因となり得る要因を詳細に洗い出すことが可能であるため、運用機のみならず開発機でのテスト運用などに導入することにより、改善点を抽出するなど、システム開発に対する支援を行うこともできる。 Since this embodiment can be implemented without introducing new hardware, it can be easily applied to an existing system at low cost. In addition, because it is possible to identify in detail the factors that can cause the failure, support is provided for system development, such as extracting improvement points by introducing it to test operations on development machines as well as operation machines. You can also.

以上、本発明を実施の形態をもとに説明した。この実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. This embodiment is an exemplification, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are also within the scope of the present invention. is there.

本実施の形態を適用できるシステムの構成例を示す図である。It is a figure which shows the structural example of the system which can apply this Embodiment. 障害ジョブと同リソース利用ジョブとの相対的な状況によって障害上の関連性が発生した場合を模式的に示す図である。It is a figure which shows typically the case where the relationship on a failure generate | occur | produced with the relative condition of a failure job and the same resource utilization job. 本実施の形態における第１サーバの構成をより詳細に示す図である。It is a figure which shows the structure of the 1st server in this Embodiment in detail. 本実施の形態における障害発生時の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence at the time of the failure generation in this Embodiment. 本実施の形態において障害原因検出部が障害ジョブと同リソース利用ジョブの相対的な状況によって障害原因を検出、評価する処理手順を示すフローチャートである。6 is a flowchart illustrating a processing procedure in which a failure cause detection unit detects and evaluates a failure cause based on a relative situation between the failure job and the resource use job in the present embodiment. 本実施の形態のジョブの相対関係の評価により原因を検出できる障害事例を説明する図である。It is a figure explaining the failure example which can detect a cause by evaluation of the relative relationship of the job of this Embodiment. 本実施の形態のジョブの相対関係の評価により原因を検出できる障害事例を説明する図である。It is a figure explaining the failure example which can detect a cause by evaluation of the relative relationship of the job of this Embodiment. 本実施の形態において利用リソース情報取得部が作成する利用リソース情報のデータ構造の例を示す図である。It is a figure which shows the example of the data structure of the utilization resource information which a utilization resource information acquisition part produces in this Embodiment. 本実施の形態において「異常状態」である事象に対して設定される実行状況別原因確率テーブルの例を示す図である。It is a figure which shows the example of the cause probability table classified by execution condition set with respect to the event which is an "abnormal state" in this Embodiment. 図４のＳ３４で障害原因検出部が参照する原因確率情報である、リソース別確率テーブルの例を示す図である。It is a figure which shows the example of the probability table classified by resource which is cause probability information which a failure cause detection part refers in S34 of FIG. 図４のＳ３６で障害原因検出部が参照する原因確率情報である、リソース利用率別確率テーブルの例を示す図である。It is a figure which shows the example of the probability table classified by resource utilization rate which is the cause probability information which a failure cause detection part refers in S36 of FIG. 本実施の形態における同リソース利用ジョブの平均処理時間に対する当日の処理時間の割合ごとに確率を設定した処理時間別確率テーブルの例を示す図である。It is a figure which shows the example of the probability table classified by processing time which set the probability for every ratio of the processing time of the day with respect to the average processing time of the same resource utilization job in this Embodiment. 図４のＳ４０で障害原因検出部３８が参照する原因確率情報である、発生頻度別確率テーブルの例を示す図である。It is a figure which shows the example of the probability table classified by occurrence frequency which is the cause probability information which the failure cause detection part 38 refers in S40 of FIG. 本実施の形態において４つの観点から障害原因の評価を行い、最終的に各ジョブ、リソースが原因となり得る確率を算出した例を示す図である。It is a figure which shows the example which evaluated the cause of a failure from four viewpoints in this Embodiment, and calculated the probability which each job and resource can cause finally.

Explanation of symbols

１０ジョブ処理システム、１２第１サーバ、１３ハードディスク、１４第２サーバ、２０データベース、３２ジョブ登録部、３４利用リソース情報取得部、３６ジョブ処理部、３８障害原因検出部、４０出力部、４２ジョブ情報記憶部、４４実行状況記憶部、４６原因確率記憶部。 DESCRIPTION OF SYMBOLS 10 Job processing system, 12 1st server, 13 Hard disk, 14 2nd server, 20 Database, 32 Job registration part, 34 Use resource information acquisition part, 36 Job processing part, 38 Fault cause detection part, 40 Output part, 42 Job Information storage unit, 44 execution status storage unit, 46 cause probability storage unit.

Claims

A job processing system for batch processing jobs,
An execution status storage unit that accumulates and stores job execution status information each time a job is processed,
When an error occurs in one of the jobs being processed, refer to the execution status information, and the degree of overlap in processing time between the job in which the error occurred and other jobs that were being processed in the same batch process Is a failure cause detection unit that detects, as a cause candidate, an event that is in an abnormal state that deviates from a normal state that occurs at a probability that exceeds a predetermined threshold in the past performance,
An output unit that outputs information related to a cause candidate detected by the failure cause detection unit.

A cause probability storage unit for storing a cause probability table for each execution situation in which an event predicted to be extracted as an abnormal state is associated with a set value of a probability that the event may cause a failure;
The failure cause detection unit acquires a probability that an event detected as the cause candidate with reference to the execution situation-specific cause probability table may be a failure cause,
The job processing system according to claim 1, wherein the output unit outputs the event detected as the cause candidate and the probability that the event can be a cause of the failure in association with each other.

A job information storage unit that stores use resource information in which a job processed in the job processing system is associated with a resource used by the job;
The failure cause detection unit identifies an error among the events detected as the cause candidate when an error is recorded in the log of the resource used by the job in which the failure has occurred, identified with reference to the use resource information. 3. The job processing system according to claim 2, wherein weighting is performed so as to increase a probability that an event related to a recorded resource may cause a failure.

A job information storage unit that stores use resource information in which a job processed in the job processing system is associated with a resource used by the job and a use rate when the job uses the resource;
The failure cause detection unit refers to the used resource information, identifies the resource usage rate of another job that uses the same resource as the job in which the failure has occurred, and batches when the failure has occurred In the processing, weighting is given to the probability that an event related to the other job among the events detected as the cause candidate can be a cause of failure according to the utilization rate of the other job with respect to the free space of the resource. The job processing system according to claim 2, wherein the job processing system is performed.

The execution status storage unit further stores a job in which a failure has occurred in the past in association with the cause of the failure,
The failure cause detection unit weights a probability that an event detected as the cause candidate can become a cause of failure according to a frequency of occurrence of a failure in the past for a job in which a failure has occurred. The job processing system according to 2.

The failure cause detection unit corrects a probability that an event detected as the cause candidate can be a cause of failure according to a difference between a probability of occurrence of a normal state in a past record and the threshold value. The job processing system according to claim 2.

The failure cause detection unit, when a failure has occurred in a job that uses the same resource as the job in which the failure has occurred, among the events detected as the cause candidate, the event related to the resource is the cause of the failure The job processing system according to claim 4, wherein weighting is performed so as to increase a probability of being possible.

When a failure has occurred in a predetermined number or more of jobs using the same resource as the job in which the failure has occurred, the failure cause detection unit estimates the resource as a failure cause, 8. The job processing system according to claim 4, wherein control is performed so as to stop a job being processed and an unprocessed job.

Acquiring resource usage information that associates jobs to be batch processed in the job processing system with information relating to resources used by each job;
Each time a job is processed, a step of accumulating and storing job execution status information;
When an error occurs in one of the jobs being processed, the use resource information and the execution status information are referenced to check whether there is an error in the resource used by the job and use the same resource as the job. The execution status of other jobs, whether there is a change from the normal state of the degree of overlap of processing time of other jobs that use the same resource as the job in which the failure occurred, and the job that has failed Detecting at least one of the frequency of causes of failures in the past and detecting candidate causes of failures,
Outputting information relating to the failure cause candidate;
Including a job management method.