JP2004258964A

JP2004258964A - Automatic operation method of computer system, and computer system

Info

Publication number: JP2004258964A
Application number: JP2003048790A
Authority: JP
Inventors: Tatsuya Kawashita; 達也川下
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-02-26
Filing date: 2003-02-26
Publication date: 2004-09-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a computer system which reduces rerun overhead of job when the computer system stops by occurring trouble of the system. <P>SOLUTION: The computer system provided with a computing server 101, a control unit 103 and an environment monitoring unit 104 which monitors operation environment of the computing server 101 and the control unit 103 acquires information about a job executing on the computing server 101 when the control unit 103 receives abnormal information report, decides the priority of the job based on the acquired information, saves continuation executing information in relation to the job from the job of the decided higher priority in order to a storage unit 102, and begins stop processing of the computer system 11 when a grace time of beginning of system stop processing of the computer system 11 in relation to the abnormal condition of the operation environment set beforehand is expired. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、計算機システムの自動運転方法及び計算機システムに関し、特に計算機システムにて発生する障害や、計算機システムの設置環境の異常などにより計算機システムを自動的に停止させる際に、計算機上で行われていた処理の継続性を向上させるための技術に関する。
【０００２】
【従来の技術】
計算機システムの運用において、設置環境が推奨される条件を満たさない状態となった場合、運用の継続により計算機システムに重大な損害を引き起こす恐れがある。
【０００３】
例えば、計算機システムを設置している部屋の室内温が異常に高くなった場合は、計算機システムを構成する部品が熱により故障する可能性があり、また、停電などの電源障害により計算機システムに電力を供給できなくなった場合は、ディスク装置などのデータ記憶装置が誤動作を起こし、データの破損を引き起こす可能性がある。
【０００４】
上記のような環境上の異常状態に加え、オペレーティングシステムのパニックなどシステムの障害が発生した場合、システムの損害を最小限に押さえるためには、早急にシステムを停止し、異常状態を取り除く必要がある。
【０００５】
このような、計算機システムにおけるさまざまな障害の発生に対しては、従来から、図１１に示すような、計算機システムを安全に停止させる動作を自動化する方法がとられている。
【０００６】
これは、計算サーバ１０１とこれを管理する制御装置１０３に加え、環境モニタ装置１０４と呼ばれる、計算機システムが設置されている環境の異常を検知するシステムを用いることにより、自動的な計算機システムの停止処理を実現している。
【０００７】
例えば、停電に対しては、無停電電源装置（ＵＰＳ）により、停電を自動的に検知し、これが環境モニタ装置１０４を通して制御装置１０３に通知され、蓄電池などの補助電源を用いて自動的にシステムの停止処理を行っている。
【０００８】
また、室内温など、設置環境の異常についても同様で、室内に温度センサなどを設けて常に室内温を監視し、異常な室温に達したことを環境モニタ装置１０４が検知するとシステムを自動的に停止させる、などの方法がとられている。
【０００９】
このような、計算機システムや計算機の設置環境にて発生した障害により自動的に計算機システムを停止する技術の公知例としては、特許文献１に記載の計算機システムや特許文献２に記載の電子計算機システムなどがあげられる。
【００１０】
一方で、計算機システムを停止させた場合、通常は、計算機システム上で実行されていたジョブは中止され、ジョブの再実行が必要となってしまう。
【００１１】
これに対しては、現在実行中のジョブやプロセスに対して、計算機システム上にて、これらが使用しているメモリ領域の内容をはじめとする計算機システム内部の状態をディスク装置などの不揮発記憶装置に記録することにより、ジョブなどを退避させるジョブの退避処理（以下、ジョブフリーズという）を行い、システム停止などによりジョブが中止されてしまっても、ジョブフリーズにより退避されている計算機システム内部の状態を記憶装置から読み出すことによりジョブを継続実行させ、システム停止によるオーバーヘッドを削減させる機能、いわゆるジョブフリーズ・リスタート機能を有する計算機システムが提案されている。
【００１２】
また、上記のような計算機システムの内部状態の記録を定期的に行い、障害発生時には最後に記録された内部状態をもとにジョブの継続実行を行うチェックポイント記録機能を有する計算機システムが提案されている。
【００１３】
従来では、これらの機能により、計算機システムの停止があり、計算機システムの再起動の後に、計算機システムの内部状態を復帰させ、計算機システム停止前の状態からジョブを継続実行させるようになっている。これにより、定期的な計算機システムの停止に対して、ジョブの再実行によるオーバヘッドを減らし、効率的に計算機システムを運用していた。
【００１４】
【特許文献１】
特開平１０−２４０３９０号公報
【００１５】
【特許文献２】
特開平５−１５２９４２号公報
【００１６】
【発明が解決しようとする課題】
計算機システムの内外で発生した障害による突発的なシステムの停止に対しては、その障害の発生を予見できないため、その障害発生後のシステム復旧を迅速に行うことが重要となる。
【００１７】
しかしながら、従来、ジョブフリーズ機能を用いる方法では、障害が発生した時に、動作中のジョブを全てジョブフリーズさせるので、ジョブフリーズに時間がかかってしまうという問題点があった。
【００１８】
例えば、停電時に全てのジョブに対してジョブフリーズを行うためには、その時間に見合った大容量の蓄電池を備えた無停電電源装置の準備が必要など、非常にコストが大きくなってしまい、このため、全てのジョブに対して予めジョブフリーズを行うことは難しく、システム復旧後はジョブの再実行が必須となってしまう。
【００１９】
また、チェックポイント記録機能を用いる方法では、定期的に内部状態を記録しておくため、ジョブの実行に際して定期的に内部状態を記録する処理が加わるので、ジョブの実行時間が長大化してしまうという問題点があった。
【００２０】
本発明は、このような問題点を解決するためになされたものであり、計算機システムのハードウエアコストの増大やジョブの実行時間の長大化を被ることなく、突発的なシステム内外の障害・警報に対して、システム復旧時のジョブの再実行によるオーバヘッドをできる限り削減することができる計算機システムを提供することを目的とする。
【００２１】
本発明の前記並びにその他の目的と新規な特徴は、本明細書の記述及び添付図面から明らかになるであろう。
【００２２】
【課題を解決するための手段】
本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、次のとおりである。
【００２３】
本発明による計算機システムの自動運転方法は、計算機システムの停止処理及び起動処理を行う自動運転方法において、計算機システムの動作環境を監視し、動作環境に異常が発生すると、計算機システム上で実行されているジョブに関する情報を取得し、その取得した情報に基づいて、そのジョブの優先度を決定し、決定した優先度の高いジョブから順に、ジョブに対する継続実行情報を不揮発記憶装置に退避させ、予め設定された動作環境の異常に対する計算機システムのシステム停止処理開始猶予時間を経過すると計算機システムを停止させるものである。
【００２４】
また、本発明による計算機システムは、ジョブの実行を行う少なくとも１つの計算サーバと、計算サーバの運用を管理する制御装置と、計算サーバ及び制御装置の動作環境を監視し、動作環境に異常が発生すると、その異常情報を制御装置に報告する環境モニタ装置とを備えた計算機システムにおいて、制御装置は、環境モニタ装置から異常情報の報告があると、計算サーバ上で実行されているジョブに関する情報を取得し、その取得した情報に基づいて、そのジョブの優先度を決定し、決定した優先度の高いジョブから順に、ジョブに対する継続実行情報を不揮発記憶装置に退避させ、予め設定された動作環境の異常に対する計算機システムのシステム停止処理開始猶予時間を経過すると計算機システムの停止処理を開始するものである。
【００２５】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一部材には同一の符号を付し、その繰り返しの説明は省略する。
【００２６】
（実施の形態１）
図１は本発明の実施の形態１による計算機システムの構成を示す構成図である。
【００２７】
図において、計算機システム１１は、１つまたは複数の計算サーバ１０１、ストレージ装置１０２、およびこれらの計算サーバの管理を行う制御装置１０３と、システムが設置される部屋の室内温や湿度を監視する気温／湿度センサ１０５、停電などの電源異常を監視する無停電電源装置（ＵＰＳ）１０６といった、外部環境に対するセンサ装置と、これらの外部環境センサ装置から室内温、電源の状態などの外部環境状況を受け、異常時にはその異常の内容を警報として制御装置に通知する環境モニタ装置１０４からなる。
【００２８】
制御装置１０３には、制御装置が動作するために必要なオペレーティングシステム、各種ソフトウエア、設定情報などを保持する記憶装置１１４が接続されている。
【００２９】
また、制御装置１０３では、計算機システム全体の運用管理を行うためのソフトウエアであるサーバ管理ソフトウエア１１１が稼動しており、現在計算サーバ１０１にて実行中されているジョブを管理するジョブ管理テーブル１１２に加え、障害・警報が発生した際にシステム停止までに許される計算機システムの停止処理開始猶予時間を保持する障害・警報時動作設定テーブル１１３を保持している。
【００３０】
ジョブ管理テーブル１１２では、現在実行中のジョブに対し、ジョブの実行元のユーザ、実行開始時間など、ジョブにかかわる情報が管理されている。また、障害・警報時動作設定テーブル１１３には、想定される障害・警報に対し、システム停止処理開始までの猶予時間、すなわち、障害・警報発生からどれくらい後にシステム停止処理を開始するかのシステム停止処理開始猶予時間が示されている。
【００３１】
ここで、ジョブ管理テーブル１１２、障害・警報時動作設定テーブル１１３の内容の一例について説明する。
【００３２】
図２はジョブ管理テーブル１１２の内容の一例を示す図、図３は、障害・警報時動作設定テーブル１１３の内容の一例を示す図である。
【００３３】
図２に示すジョブ管理テーブル１１２は、現在実行中のジョブに関する情報が記録されており、ジョブが実行開始となるときに、そのジョブ名１１２１、実行した計算機ユーザ名１１２２、プライオリティ１１２３、実行開始時間１１２４などが記録され、ジョブが終了したときに当該ジョブの情報が削除される。プライオリティ１１２３は、ジョブ実行開始時に、計算機ユーザによって設定される、などにより付加されるもので、優先度情報となり、ジョブフリーズが行われる時の優先度の決定要因に用いられるものである。
【００３４】
また、図３に示す障害・警報時動作設定テーブル１１３は、想定する障害・警報の内容となる障害要因１１３１、及びそれぞれの障害・警報に対して、ジョブフリーズ処理のために許される猶予時間となるシステム停止処理開始猶予時間１１３２が定義されている。これらの情報は、システム導入時に、システムに用いるサーバ、ストレージ、ＵＰＳなどの仕様から決定される。
【００３５】
例えば、停電時のシステム停止処理開始猶予時間は、ＵＰＳが内蔵している蓄電池の容量とシステムの規模から、蓄電池を停電時のバックアップ電源として利用可能な時間を設定すればよい。
【００３６】
次に、この実施の形態の動作について説明する。
【００３７】
図４はこの実施の形態の計算機システムにおける障害・警報発生時の処理フローであり、障害・警報が発生してからシステムが停止するまでの計算機システム、特にサーバ管理ソフトウエアの処理フローを示している。
【００３８】
図４では、計算機システム１１が設置されている部屋において、室内温異常が発生した場合を例にとって説明しているが、停電などその他の障害・警報が発せられた場合でも同様の処理フローによってシステムの停止処理が行われる。
【００３９】
まず、環境モニタ装置１０４にて室内温異常の検知がなされると（Ｓ２０１）、この情報は警報として制御装置１０３上で稼動しているサーバ管理ソフトウエア１１１に通知される。
【００４０】
次に、サーバ管理ソフトウエア１１１は室内温異常の警報を環境モニタ装置１０４より受けると、まず、障害・警報時動作設定テーブル１１３より、当該警報に対してシステム停止処理を開始するまでの時間（システム停止処理開始猶予時間）の取得を行う（Ｓ２０２）。
【００４１】
次に、現在計算サーバにて実行中のジョブに対して、ジョブフリーズにより退避するジョブの優先度を決定するために必要な情報の取得を行う（Ｓ２０３）。
【００４２】
次に、Ｓ２０３にて得られた情報から、ジョブフリーズを行うジョブの優先度を決定し、計算サーバ１０１に決定された優先度に従いジョブフリーズの開始の指示を行う（Ｓ２０４）。
【００４３】
図４では、ｊｏｂ１、ｊｏｂ２、ｊｏｂ３の順に優先度が決定した場合を示しているが、この場合は、計算サーバ１０１は、サーバ管理ソフトウエア１１１からのジョブフリーズ開始の指示に従い、ｊｏｂ１から順にストレージ装置１０２などに、そのジョブが計算機システムの再起動後に再実行可能な情報（継続実行情報）を退避させるジョブフリーズを開始する。
【００４４】
はじめにｊｏｂ１に対してジョブフリーズが終了すると、計算サーバ１０１はサーバ管理ソフトウエア１１１に対してｊｏｂ１のフリーズの処理が終わった旨を通知し、サーバ管理ソフトウエア１１１がｊｏｂ１のフリーズ処理終了の通知を受けると、ｊｏｂ１がフリーズされていることを制御装置１０３に接続されている記憶装置１１４にフリーズ処理済ジョブテーブル１１５として記録しておく（Ｓ２０５）。
【００４５】
以下、ｊｏｂ２、ｊｏｂ３の順にジョブフリーズ処理が計算サーバ１０１上で進んでいくが、先にＳ２０２にて取得したシステム停止処理開始猶予時間が経過すると、現在実行中のフリーズ処理の中止指示を計算サーバ１０１に指示し、計算機システム１１の停止処理を開始する（Ｓ２０６）。
【００４６】
図４の例では、ｊｏｂ３のフリーズ処理実行中にシステム停止処理開始の時間に達してしまったため、ｊｏｂ３のフリーズ処理を中止している。
【００４７】
このように、この実施の形態では、優先度の高いジョブに対して計算機システム１１の停止時に自動的にジョブフリーズがなされ、計算機システム１１の復旧時にはこれらのジョブが計算機システム１１の停止時の状態から継続して実行可能である。
【００４８】
実際に計算機システム１１が復旧したときは、制御装置１０３上のサーバ管理ソフトウエア１１１が、記憶装置１１４に記録されているフリーズされているジョブのリストを参照し、これらに記載されているジョブに対して計算サーバ１０１にてリスタート処理により、ストレージ装置１０２に退避された情報を用いて継続実行を開始させればよい。
【００４９】
次に、この実施の形態のジョブフリーズを実行するときの、ジョブに対する優先度決定方法の一例について説明する。
【００５０】
図５は、ジョブフリーズを実行するときの、ジョブに対する優先度決定方法の一例を示したものである。図５は、図４の処理フローにおける、Ｓ２０３の実行中のジョブの情報取得処理と、Ｓ２０４の優先度の決定、フリーズ処理開始処理に相当するものである。
【００５１】
まず、ジョブの情報取得処理Ｓ２０３では、現在作業中のジョブに対し、実行時間、使用メモリ量を計算サーバ１０１に問い合わせ（Ｓ２０３１）、計算サーバ１０１からジョブ、実行時間、メモリ量の情報を取得する（Ｓ２０３２）。
【００５２】
次に、優先度の決定、フリーズ処理開始処理Ｓ２０４では、まず、Ｓ２０３２で取得された各ジョブのメモリ量から、そのジョブをフリーズするのに必要な処理時間の算出を行う（Ｓ２０４１）。
【００５３】
ジョブフリーズは、そのジョブが使用しているメモリの情報をそのままストレージ装置１０２に記録する処理が主であり、計算サーバ１０１からストレージ１０２への転送速度、及びストレージ１０２の書きこみ速度等と、使用メモリ量から処理時間を計算することができる。
【００５４】
このジョブフリーズ時間が、システム停止処理開始までの時間より大きければ、当該ジョブのジョブフリーズを行っても終了する見込みが無いので、ジョブフリーズ可能な他のジョブを優先させるために、当該ジョブのジョブフリーズの対象からの削除を行う（Ｓ２０４２）。
【００５５】
次に、残ったジョブに対して、ジョブの実行時間が長いジョブから順にジョブフリーズの開始指示を計算サーバ１０１に対して指示する（Ｓ２０４３）。
【００５６】
この優先度決定方法により、既に長時間実行を続けていて、かつジョブフリーズ可能なジョブが優先的にジョブフリーズされるので、全体としてシステム停止によるジョブ実行時間のオーバーヘッドを削減することができる。
【００５７】
次に、この実施の形態のジョブフリーズを実行するときの、ジョブに対する優先度決定方法の他の例について説明する。
【００５８】
図６は、ジョブフリーズを実行するときの、ジョブに対する優先度決定方法の他の例を示したものである。図６は、図４の処理フローにおける、Ｓ２０３の実行中のジョブの情報取得処理と、Ｓ２０４の優先度の決定、フリーズ処理開始処理に相当するものである。
【００５９】
図６の例では、ジョブにはプライオリティの情報が設定されており、ジョブの優先度を決定するための情報として、このジョブに付加されたプライオリティの情報を使用する。
【００６０】
まず、ジョブの情報取得処理Ｓ２０３では、プライオリティの情報は、例えば図２に示すようにジョブ管理テーブル１１２にジョブ毎に管理されているので、サーバ管理ソフトウエア１１１は、ジョブ管理テーブル１１２から実行中のジョブと共にプライオリティの取得を行う（Ｓ２０３３）。
【００６１】
次に、優先度の決定、フリーズ処理開始処理Ｓ２０４では、Ｓ２０３３で取得されたプライオリティの高いジョブから順に、ジョブフリーズの開始指示を計算サーバ１０１に対して指示する（Ｓ２０４４）。
【００６２】
この優先度決定方法により、例えば、企業の基幹業務処理プログラムの実行に対して、その処理の重要度にしたがってプライオリティを適切に設定することにより、障害が発生してもこれによる処理のオーバヘッドを減らすことができる。
【００６３】
以上、図５及び図６を用いてジョブフリーズの優先度決定の方法の一例について説明したが、本発明はこの優先度決定の方法に限らず、この他にもジョブの優先度を決定する方法として、計算機の運用形態によって、制御装置１０３内の各種テーブルの情報や、計算サーバ１０１からの情報など、さまざまな情報を用いて優先度を決定するようにしてもよい。
【００６４】
また、この実施の形態では、ジョブフリーズで実行中のジョブを退避させる際、計算サーバ１０１に接続されたストレージ装置１０２にジョブが計算機システムの再起動後に再実行可能な情報を退避させているが、ストレージ装置１０２に限らず、他の不揮発記憶装置にジョブに関する情報を退避させるようにしてもよい。この場合は、計算機システムが復旧したときは、ジョブを退避させた不揮発記憶装置の情報に基づいて計算機サーバ１０１によりリスタート処理を行わせジョブを継続実行させればよい。
【００６５】
（実施の形態２）
この実施の形態は、実施の形態１での計算機システムにおける障害・警報発生時の自動運転の動作を、特にアプリケーションサービスプロバイダ（ＡＳＰ）のような、企業などを顧客として計算機上でのジョブの実行を代行するような事業に対して、適用し、新規のサービスとして提供するものである。
【００６６】
以下、本発明の計算機システムの自動運転方法を用いた新規事業について説明する。
【００６７】
図７は、本発明の実施の形態２におけるアプリケーションサービスプロバイダ（ＡＳＰ）事業の概要を示した図である。
【００６８】
図において、ＡＳＰ事業者１は、実施の形態１と同様な計算機システム１１を保有しており、利用顧客２に対してジョブの実行を代行、あるいはインターネットなどによるリモートアクセスによって計算機の利用を提供している。
【００６９】
ＡＳＰ事業者１は、利用顧客２が計算機システム１１にてジョブを実行する際にジョブに付加可能なプライオリティの使用許可を利用顧客２に与え、プライオリティの高さによって設定されているプライオリティ使用料を利用顧客２から徴収する。このプライオリティは、計算機システム１１において障害・警報が発生した際に、ジョブフリーズを行うジョブの優先度を決定するパラメータとなる。
【００７０】
このため、ジョブに高いプライオリティを付加する権利を利用顧客２が得ることは、計算機システム１１に障害が発生した場合に、その復旧時に障害の発生によるオーバヘッドを少なくする権利を得ることにつながる。
【００７１】
次に、この実施の形態におけるＡＳＰ事業者１と利用顧客２との間の関係について説明する。
【００７２】
図８は、実施の形態２におけるＡＳＰ事業者と利用顧客との間の関係を示す図である。
【００７３】
まず、ＡＳＰ事業者１と利用顧客２との間で、計算機システムの利用に関する契約を締結する契約フェーズ３の段階について説明する。
【００７４】
利用顧客２は、ＡＳＰ事業者１に対して、計算機システムの利用申込みを行う（Ｓ２１）。この際、契約条件として、利用方法などのほかに、ジョブフリーズの優先度につながる、利用可能なプライオリティの使用申込みも行う。
【００７５】
ＡＳＰ事業者１が、Ｓ２１での利用申込みを受けると、利用顧客２にプライオリティの使用許可も含めて計算機の利用許可を通知し、利用料・契約料の請求を行う（Ｓ２２）。これに対し利用顧客２は、利用料の支払いをＡＳＰ事業者１に対して行うことにより、契約が成立する（Ｓ２３）。
【００７６】
次に、実際の運用となる正常運用４の段階について説明する。
【００７７】
まず、利用顧客２はＡＳＰ事業者１にジョブ実行の依頼、あるいは、直接計算機システム１１にリモートアクセスしてジョブの実行を行う（Ｓ２４）。このとき、契約時に使用許可されたプライオリティを付加しての実行が可能である。
【００７８】
利用顧客２からのＳ２４でのジョブ実行依頼に対して、ＡＳＰ事業者１ではジョブの実行が行われ、利用顧客２にジョブの実行結果の返送を行う（Ｓ２５）。
【００７９】
この実行結果の内容としては、正常終了すれば結果を、実行が失敗した場合はその旨を通知する。実行が失敗するケースには、事前に利用許可を得ていないプライオリティを利用顧客２が使用しようとした場合も含まれる。
【００８０】
次に、運用中に計算機システムに障害が発生した場合となる障害発生時運用５の段階におけるＡＳＰと利用顧客との関係について説明する。
【００８１】
まず、利用顧客２がＡＳＰ事業者１にジョブ実行の依頼があり（Ｓ２６）、ＡＳＰ事業者１にてジョブの実行を開始した後に、計算機システム１１に障害が発生した場合（Ｓ２７）、ジョブに付加されたプライオリティと、計算機システム１１の使用状況、障害の内容によって当該ジョブが継続実行を行うか、継続実行ができず実行中止とするかが計算機システム内部で自動的に決定される。
【００８２】
継続実行を行うと決定された場合は、ジョブフリーズにより、そのジョブがストレージ装置１０２などに退避され、計算機システムの復旧後にリスタート処理により継続実行が行われ、実行中止と決定された場合は、そのジョブの実行は中止され、計算機システムの復旧後に改めてジョブの再実行が行われる（Ｓ２８）。その後、ジョブが終了したら、利用顧客に対する実行結果の報告が行われる（Ｓ２９）。
【００８３】
次に、この実施の形態の事業形態を実現するためのシステム構成について説明する。
【００８４】
図９は、実施の形態２の事業形態を実現するためのシステム構成図である。
【００８５】
図において、ＡＳＰ事業者１が管理する計算機システム１１に対して、利用顧客２が使用する顧客端末１４が、ＡＳＰのゲートウェイサーバ１２及びインターネット１３を通じて接続されている。
【００８６】
また、制御装置１０３にはジョブ管理テーブル１１２、障害・警報時動作設定テーブル１１３に加え、当該計算機システムを利用する利用顧客２に対して許可しているプライオリティを管理するユーザ設定テーブル１１６を具備しており、さらにＡＳＰ事業者１がユーザ設定テーブル１１６の設定を行うために管理端末１５が接続されている。
【００８７】
利用顧客２からのプライオリティ使用要求に対しては、ＡＳＰ事業者１が管理端末１５を通じてユーザ設定テーブル１１６を編集することによりプライオリティ許可の設定を行う。
【００８８】
また、ジョブの実行依頼は、顧客端末１４から、インターネット１３、ゲートウェイサーバ１２を通じて計算機システム１１にリモートで行われる。実行結果通知についても、計算機システム１１からゲートウェイサーバ１２、インターネット１３を通じて顧客端末１４に通知される。
【００８９】
次に、この実施の形態のプライオリティ付きのジョブの処理動作について説明する。
【００９０】
図１０は実施の形態２の利用顧客によるジョブ実行依頼時の計算機システムの動作フローであり、利用顧客がプライオリティ付きのジョブを計算機システムに投入したときの計算機システム内の、特にサーバ管理ソフトウエアの動作を示している。
【００９１】
まず、利用顧客２が計算機システム１１に対してプライオリティが付加されたジョブの実行依頼を出す（Ｓ３１）。
【００９２】
このＳ３１での実行依頼は、サーバ管理ソフトウエア１１１で受け付けられると、まずユーザ設定テーブル１１６の参照を行い（Ｓ３２）、利用顧客２にジョブに付加されたプライオリティの利用許可があるかどうかの判断を行う（Ｓ３３）。
【００９３】
Ｓ３３で利用許可があると判断された場合は計算サーバ１０１に実行開始指示を行う（Ｓ３４）。Ｓ３３で利用許可がないと判断された場合は、利用顧客２に対して、当該ジョブが実行不可能である旨の通知を行う（Ｓ３５）。
【００９４】
以上のように、この実施の形態では、計算機システム１１の利用顧客２に対して、プライオリティの利用許可を与えることにより、利用顧客２の要求に応じて障害発生時のジョブ実行のオーバヘッド削減というサービスを提供することができる。
【００９５】
以上、本発明者によってなされた発明をその実施の形態に基づき具体的に説明したが、本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。
【００９６】
【発明の効果】
本願において開示される発明のうち、代表的なものによって得られる効果を簡単に説明すれば、以下のとおりである。
【００９７】
（１）制御装置により、環境モニタ装置から異常情報の報告があると、計算サーバ上で実行されているジョブに関する情報を取得し、その取得した情報に基づいて、そのジョブの優先度を決定し、決定した優先度の高いジョブから順に、ジョブに対する継続実行情報を不揮発記憶装置に退避させ、予め設定された動作環境の異常に対する計算機システムのシステム停止処理開始猶予時間を経過すると計算機システムの停止処理を開始するようにしたので、計算機システムに障害が発生して、システムを停止しなければならない場合に、ジョブの退避を効率よく行うことができ、ジョブの再実行のオーバヘッドを削減することができるという効果を有する。
【００９８】
（２）チェックポイント記録を定期的に行うことなく、突発的な障害発生に対するジョブのオーバヘッドを削減するので、通常運用時におけるジョブの実行速度も損なうことは無いという効果を有する。
【００９９】
（３）ＡＳＰ事業への応用を行うことにより、障害発生時のジョブの退避の優先度を高める権利を対価とする新規事業形態を実現できる。
【図面の簡単な説明】
【図１】本発明の実施の形態１による計算機システムの構成を示す構成図である。
【図２】本発明の実施の形態１におけるジョブ管理テーブルの内容の一例を示す図である。
【図３】本発明の実施の形態１における障害・警報時動作定義テーブルの内容の一例を示す図である。
【図４】本発明の実施の形態１の計算機システムにおける障害・警報発生時の処理フローである。
【図５】本発明の実施の形態１におけるジョブフリーズを実行するときの、ジョブに対する優先度決定方法の一例を示したものである。
【図６】本発明の実施の形態１におけるジョブフリーズを実行するときの、ジョブに対する優先度決定方法の他の例を示したものである。
【図７】本発明の実施の形態２におけるアプリケーションサービスプロバイダ（ＡＳＰ）事業の概要を示した図である。
【図８】本発明の実施の形態２におけるＡＳＰ事業者と利用顧客との間の関係を示す図である。
【図９】本発明の実施の形態２の事業形態を実現するためのシステム構成図である。
【図１０】本発明の実施の形態２の利用顧客によるジョブ実行依頼時の計算機システムの動作フローである。
【図１１】従来の計算機システムにおける障害・警報発生時の処理フローである。
【符号の説明】
１アプリケーションサービスプロバイダ（ＡＳＰ）事業者
２利用顧客
１１計算機システム
１２ゲートウェイサーバ
１３インターネット
１４顧客端末
１５管理端末
１０１計算サーバ
１０２ストレージ装置
１０３制御装置
１０４環境モニタ装置
１０５気温／湿度センサ
１０６無停電電源装置（ＵＰＳ）
１１１サーバ管理ソフトウエア
１１２ジョブ管理テーブル
１１３障害・警報時動作設定テーブル
１１４記憶装置
１１５フリーズ処理済ジョブテーブル
１１６ユーザ設定テーブル
１１２１ジョブ名
１１２２ユーザ名
１１２３プライオリティ
１１２４ジョブの開始時間
１１３１障害要因
１１３２システム停止処理開始猶予時間[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an automatic operation method and a computer system of a computer system, and more particularly to a computer system that automatically stops a computer system due to a failure occurring in the computer system or an abnormality in an installation environment of the computer system. And a technique for improving the continuity of the processing.
[0002]
[Prior art]
In the operation of the computer system, if the installation environment does not satisfy the recommended conditions, there is a risk that the operation will continue and cause serious damage to the computer system.
[0003]
For example, if the room temperature of the room where the computer system is installed becomes abnormally high, the components that make up the computer system may fail due to heat. If the data cannot be supplied, a data storage device such as a disk device may malfunction and data may be damaged.
[0004]
In the event of a system failure such as an operating system panic in addition to the environmental abnormalities described above, it is necessary to stop the system immediately and remove the abnormal state in order to minimize damage to the system. is there.
[0005]
In order to cope with such various failures in the computer system, a method of automatically stopping the operation of safely stopping the computer system as shown in FIG. 11 has conventionally been adopted.
[0006]
This is because, in addition to the calculation server 101 and the control device 103 for managing the same, a system called an environment monitor device 104 that detects an abnormality in the environment where the computer system is installed is used to automatically stop the computer system. Processing is realized.
[0007]
For example, in the event of a power failure, a power failure is automatically detected by an uninterruptible power supply (UPS), this is notified to the control device 103 through the environment monitoring device 104, and the system is automatically activated using an auxiliary power supply such as a storage battery. Is being stopped.
[0008]
The same applies to abnormalities in the installation environment such as room temperature. A temperature sensor or the like is provided in the room to constantly monitor the room temperature, and when the environment monitoring device 104 detects that the room temperature has reached an abnormal room temperature, the system is automatically activated. Stopping, etc. are taken.
[0009]
Known examples of such a technique for automatically stopping a computer system due to a failure occurring in the computer system or an installation environment of the computer include a computer system described in Patent Document 1 and an electronic computer system described in Patent Document 2 And so on.
[0010]
On the other hand, when the computer system is stopped, the job that has been executed on the computer system is usually stopped, and the job needs to be re-executed.
[0011]
On the other hand, the status of the computer system, including the contents of the memory area used by the currently executed jobs and processes, is stored in a non-volatile storage device such as a disk device. The job is saved (e.g., job freeze) by saving the job, and even if the job is stopped due to a system stop, the status inside the computer system saved by the job freeze is saved. A computer system that has a function of reading a file from a storage device to continuously execute a job and reduce overhead due to a system stop, a so-called job freeze / restart function, has been proposed.
[0012]
Also, a computer system having a checkpoint recording function of periodically recording the internal state of the computer system as described above and performing continuous execution of a job based on the last recorded internal state when a failure occurs has been proposed. ing.
[0013]
Conventionally, with these functions, the computer system is stopped, and after the computer system is restarted, the internal state of the computer system is restored, and the job is continuously executed from the state before the computer system was stopped. As a result, the overhead caused by re-execution of a job with respect to a periodic shutdown of the computer system has been reduced, and the computer system has been operated efficiently.
[0014]
[Patent Document 1]
JP-A-10-240390
[0015]
[Patent Document 2]
JP-A-5-152942
[0016]
[Problems to be solved by the invention]
For a sudden system shutdown due to a failure occurring inside or outside the computer system, the occurrence of the failure cannot be foreseen, so it is important to quickly restore the system after the failure has occurred.
[0017]
However, conventionally, in the method using the job freeze function, when a failure occurs, all the running jobs are job-frozen, so that there is a problem that the job freeze takes time.
[0018]
For example, in order to perform a job freeze for all jobs at the time of a power failure, it is necessary to prepare an uninterruptible power supply having a large-capacity storage battery corresponding to the time. Therefore, it is difficult to freeze a job for all jobs in advance, and it is necessary to re-execute the job after the system is restored.
[0019]
Further, in the method using the checkpoint recording function, since the internal state is periodically recorded, a process of periodically recording the internal state is added at the time of executing the job, so that the execution time of the job is lengthened. There was a problem.
[0020]
SUMMARY OF THE INVENTION The present invention has been made to solve such a problem, and without causing an increase in hardware cost of a computer system or a prolonged job execution time, a sudden failure or alarm inside or outside the system can be achieved. It is therefore an object of the present invention to provide a computer system capable of reducing the overhead due to re-execution of a job at the time of system recovery as much as possible.
[0021]
The above and other objects and novel features of the present invention will become apparent from the description of the present specification and the accompanying drawings.
[0022]
[Means for Solving the Problems]
The following is a brief description of an outline of typical inventions disclosed in the present application.
[0023]
An automatic operation method of a computer system according to the present invention is an automatic operation method for performing a stop process and a start process of a computer system, wherein the operation environment of the computer system is monitored, and when an abnormality occurs in the operation environment, the computer system is executed on the computer system. Information about the current job, determine the priority of the job based on the obtained information, save the continuous execution information for the job in the non-volatile storage device in order from the job having the highest determined priority, and set the job in advance. The computer system is stopped when the system stop processing start delay time of the computer system for the abnormality of the operating environment is passed.
[0024]
Further, the computer system according to the present invention monitors at least one calculation server that executes a job, a control device that manages the operation of the calculation server, and monitors the operation environment of the calculation server and the control device. Then, in the computer system including the environment monitoring device that reports the abnormality information to the control device, the control device receives information about the job being executed on the calculation server when the environment monitoring device reports the abnormality information. The priority of the job is determined based on the acquired information, the continuous execution information for the job is saved in the nonvolatile storage device in the order of the determined priority job, and the operation environment of the preset operating environment is saved. When the system stop processing start grace time for the abnormality has elapsed, the computer system starts the stop processing.
[0025]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In all the drawings for describing the embodiments, the same members are denoted by the same reference numerals, and a repeated description thereof will be omitted.
[0026]
(Embodiment 1)
FIG. 1 is a configuration diagram showing a configuration of a computer system according to Embodiment 1 of the present invention.
[0027]
In the figure, a computer system 11 includes one or more calculation servers 101, a storage device 102, a control device 103 for managing these calculation servers, and an air temperature for monitoring the room temperature and humidity of a room where the system is installed. A sensor device for the external environment, such as a power / humidity sensor 105 and an uninterruptible power supply (UPS) 106 for monitoring a power failure such as a power failure, and an external environment condition such as a room temperature and a power supply state received from these external environment sensor devices. And an environment monitoring device 104 for notifying the control device of the content of the abnormality as an alarm when an abnormality occurs.
[0028]
The control device 103 is connected to a storage device 114 that holds an operating system, various software, setting information, and the like necessary for the operation of the control device.
[0029]
In the control device 103, server management software 111, which is software for performing operation management of the entire computer system, is running, and a job management table for managing jobs currently being executed in the calculation server 101. In addition to 112, it has a failure / alarm operation setting table 113 which holds a computer system stop processing start grace time allowed until the system is stopped when a failure / alarm occurs.
[0030]
The job management table 112 manages information relating to the currently executing job, such as the user who executed the job and the execution start time. In addition, the failure / alarm operation setting table 113 includes a grace period until the start of the system stop processing with respect to the assumed failure / alarm, that is, the system stop time after the occurrence of the failure / alarm. The processing start grace time is indicated.
[0031]
Here, an example of the contents of the job management table 112 and the failure / alarm operation setting table 113 will be described.
[0032]
FIG. 2 is a diagram showing an example of the contents of the job management table 112, and FIG. 3 is a diagram showing an example of the contents of the failure / alarm operation setting table 113.
[0033]
The job management table 112 shown in FIG. 2 records information on a currently executing job. When the job starts to be executed, the job name 1121, the executed computer user name 1122, the priority 1123, the execution start time 1124 is recorded, and when the job is completed, the information of the job is deleted. The priority 1123 is added by, for example, being set by a computer user at the start of job execution, becomes priority information, and is used as a factor for determining the priority when a job freeze is performed.
[0034]
In addition, the failure / alarm operation setting table 113 shown in FIG. 3 includes a failure factor 1131 that is the content of the failure / alarm to be assumed, and a grace time allowed for the job freeze process for each failure / alarm. The system stop processing start grace time 1132 is defined. These pieces of information are determined from the specifications of the server, storage, UPS, and the like used in the system when the system is introduced.
[0035]
For example, as the grace period for starting the system stop processing at the time of power failure, the time during which the storage battery can be used as a backup power supply at the time of power failure may be set based on the capacity of the storage battery built into the UPS and the scale of the system.
[0036]
Next, the operation of this embodiment will be described.
[0037]
FIG. 4 is a processing flow at the time of occurrence of a failure / alarm in the computer system of this embodiment, and shows a processing flow of the computer system, particularly server management software, from the occurrence of the failure / alarm until the system stops. I have.
[0038]
FIG. 4 illustrates an example in which an indoor temperature abnormality occurs in a room where the computer system 11 is installed. However, even when another failure or alarm such as a power failure is issued, the system performs the same processing flow. Is performed.
[0039]
First, when the indoor temperature abnormality is detected by the environment monitor device 104 (S201), this information is notified as an alarm to the server management software 111 running on the control device 103.
[0040]
Next, when the server management software 111 receives the alarm of the room temperature abnormality from the environment monitor device 104, first, the time until the system stop processing is started for the alarm is started from the failure / alarm operation setting table 113 (FIG. The system stop processing start grace time is acquired (S202).
[0041]
Next, with respect to the job currently being executed by the calculation server, information necessary for determining the priority of the job saved by the job freeze is acquired (S203).
[0042]
Next, the priority of the job for which the job is to be frozen is determined based on the information obtained in S203, and the calculation server 101 is instructed to start the job freeze in accordance with the determined priority (S204).
[0043]
FIG. 4 shows a case where the priorities are determined in the order of job1, job2, and job3. In this case, the calculation server 101 stores the jobs in order from job1 in accordance with a job freeze start instruction from the server management software 111. The apparatus 102 or the like starts a job freeze in which information that can be re-executed after the computer system is restarted (continuous execution information) is saved.
[0044]
First, when the job freeze is completed for job1, the calculation server 101 notifies the server management software 111 that the freeze processing of job1 has been completed, and the server management software 111 notifies the server management software 111 of the completion of the freeze processing of job1. When received, the fact that job1 is frozen is recorded in the storage device 114 connected to the control device 103 as the frozen job table 115 (S205).
[0045]
Hereinafter, the job freeze processing proceeds on the calculation server 101 in the order of job2 and job3. When the system stop processing start grace time previously acquired in S202 elapses, an instruction to stop the currently executed freeze processing is issued to the calculation server 101. 101, and starts the stop processing of the computer system 11 (S206).
[0046]
In the example of FIG. 4, during the execution of the freeze processing of job3, the time for starting the system stop processing has been reached, so the freeze processing of job3 is stopped.
[0047]
As described above, in this embodiment, a job freeze is automatically performed for a job with a high priority when the computer system 11 is stopped, and when the computer system 11 is restored, these jobs are in a state when the computer system 11 is stopped. Can be executed continuously.
[0048]
When the computer system 11 is actually restored, the server management software 111 on the control device 103 refers to the list of frozen jobs recorded in the storage device 114, and On the other hand, by performing the restart process in the calculation server 101, the continuous execution may be started using the information saved in the storage device 102.
[0049]
Next, an example of a method for determining the priority of a job when executing the job freeze according to this embodiment will be described.
[0050]
FIG. 5 shows an example of a method for determining the priority of a job when executing a job freeze. FIG. 5 corresponds to the information acquisition processing of the running job in S203, the determination of the priority in S204, and the freeze processing start processing in the processing flow of FIG.
[0051]
First, in the job information acquisition process S203, the execution time and the used memory amount are inquired to the calculation server 101 for the currently working job (S2031), and the job, the execution time and the memory amount information are acquired from the calculation server 101. (S2032).
[0052]
Next, in the priority determination and freeze processing start processing S204, first, the processing time required to freeze the job is calculated from the memory capacity of each job acquired in S2032 (S2041).
[0053]
The job freeze is mainly a process of recording the information of the memory used by the job in the storage device 102 as it is, and the transfer speed from the calculation server 101 to the storage 102, the writing speed of the storage 102, and the like. The processing time can be calculated from the amount of memory.
[0054]
If this job freeze time is longer than the time up to the start of the system stop processing, there is no possibility that the job will be terminated even if the job freeze is performed. It is deleted from the freeze target (S2042).
[0055]
Next, for the remaining jobs, a job freeze start instruction is given to the calculation server 101 in order from the job with the longest execution time (S2043).
[0056]
According to this priority determination method, jobs that have already been executed for a long time and that can be job-frozen are preferentially job-frozen, so that the overhead of job execution time due to system stoppage can be reduced as a whole.
[0057]
Next, another example of a method for determining the priority of a job when executing the job freeze according to this embodiment will be described.
[0058]
FIG. 6 shows another example of a method for determining the priority of a job when executing a job freeze. FIG. 6 corresponds to the information acquisition processing of the running job in S203, the determination of the priority in S204, and the freeze processing start processing in the processing flow of FIG.
[0059]
In the example of FIG. 6, priority information is set for the job, and the priority information added to the job is used as information for determining the priority of the job.
[0060]
First, in the job information acquisition process S203, since the priority information is managed for each job in the job management table 112 as shown in FIG. 2, for example, the server management software 111 The priority is acquired together with the job (S2033).
[0061]
Next, in the priority determination and freeze processing start processing S204, a job freeze start instruction is issued to the calculation server 101 in order from the job with the highest priority acquired in S2033 (S2044).
[0062]
By this priority determination method, for example, for the execution of a core business processing program of a company, by appropriately setting the priority according to the importance of the processing, even if a failure occurs, the processing overhead due to this is reduced. be able to.
[0063]
The example of the method of determining the priority of the job freeze has been described with reference to FIGS. 5 and 6. However, the present invention is not limited to this method of determining the priority, and other methods of determining the priority of the job may be used. Alternatively, the priority may be determined using various information such as information of various tables in the control device 103 and information from the calculation server 101 depending on the operation mode of the computer.
[0064]
Further, in this embodiment, when a job being executed in a job freeze is saved, information that can be re-executed after the restart of the computer system is saved in the storage device 102 connected to the calculation server 101. Alternatively, the information related to the job may be saved in another nonvolatile storage device other than the storage device 102. In this case, when the computer system is restored, the restart process may be performed by the computer server 101 based on the information in the nonvolatile storage device from which the job has been saved, and the job may be continuously executed.
[0065]
(Embodiment 2)
In this embodiment, the operation of automatic operation when a failure or an alarm occurs in the computer system in the first embodiment is described. In particular, the execution of a job on a computer by using a company as a customer such as an application service provider (ASP) is executed. And provide it as a new service to businesses that act on behalf of.
[0066]
Hereinafter, a new business using the automatic driving method of the computer system of the present invention will be described.
[0067]
FIG. 7 is a diagram showing an outline of an application service provider (ASP) business according to the second embodiment of the present invention.
[0068]
In the figure, an ASP company 1 has a computer system 11 similar to that in the first embodiment, and provides a customer 2 with the use of a computer by acting for a job or by remote access via the Internet or the like. ing.
[0069]
The ASP company 1 gives the use customer 2 permission to use the priority that can be added to the job when the customer 2 executes the job in the computer system 11, and sets the priority usage fee set according to the priority level. Collected from customer 2 The priority is a parameter that determines the priority of a job to be subjected to a job freeze when a failure / alarm occurs in the computer system 11.
[0070]
Therefore, obtaining the right to add a high priority to the job by the user 2 leads to obtaining the right to reduce the overhead due to the occurrence of the failure when the failure occurs in the computer system 11 when the failure occurs in the computer system 11.
[0071]
Next, the relationship between the ASP 1 and the customer 2 in this embodiment will be described.
[0072]
FIG. 8 is a diagram illustrating a relationship between an ASP operator and a user according to the second embodiment.
[0073]
First, the phase of the contract phase 3 in which a contract regarding the use of a computer system is made between the ASP 1 and the customer 2 will be described.
[0074]
The user 2 applies to the ASP 1 for use of the computer system (S21). At this time, as a contract condition, in addition to the method of use, an application for use of an available priority that leads to the priority of job freeze is also made.
[0075]
When the ASP company 1 receives the application for use in S21, it notifies the user 2 of the use permission of the computer including the use permission of the priority, and charges the use fee and the contract fee (S22). On the other hand, the customer 2 makes a contract by paying the usage fee to the ASP company 1 (S23).
[0076]
Next, the stage of normal operation 4, which is the actual operation, will be described.
[0077]
First, the customer 2 requests the ASP 1 to execute a job, or directly executes remote access to the computer system 11 to execute a job (S24). At this time, it is possible to execute the program by adding the priority permitted at the time of contract.
[0078]
In response to the job execution request from the customer 2 in S24, the ASP company 1 executes the job, and returns the job execution result to the customer 2 (S25).
[0079]
As the contents of the execution result, the result is notified if the processing ends normally, and if the execution fails, the fact is notified. The case where the execution is failed includes the case where the customer 2 tries to use a priority for which use permission has not been obtained in advance.
[0080]
Next, a description will be given of the relationship between the ASP and the customer at the stage of the failure operation 5 in which a failure occurs in the computer system during operation.
[0081]
First, when the customer 2 requests the ASP company 1 to execute a job (S26), and after the ASP company 1 starts executing the job, a failure occurs in the computer system 11 (S27). Depending on the added priority, the usage status of the computer system 11, and the content of the failure, it is automatically determined in the computer system whether the job is to be continuously executed or the job cannot be continuously executed and the execution is stopped.
[0082]
If it is determined that continuous execution is to be performed, the job is evacuated to the storage device 102 or the like due to a job freeze, continued execution is performed by restart processing after the restoration of the computer system, and if it is determined that execution is stopped, The execution of the job is stopped, and the job is re-executed after the computer system is restored (S28). Thereafter, when the job is completed, the execution result is reported to the user (S29).
[0083]
Next, a system configuration for realizing the business mode of this embodiment will be described.
[0084]
FIG. 9 is a system configuration diagram for realizing the business mode of the second embodiment.
[0085]
In the figure, a customer terminal 14 used by a customer 2 is connected to a computer system 11 managed by an ASP operator 1 through an ASP gateway server 12 and the Internet 13.
[0086]
Further, the control device 103 includes a job management table 112, a failure / alarm operation setting table 113, and a user setting table 116 for managing the priority permitted to the customer 2 who uses the computer system. In addition, a management terminal 15 is connected for the ASP operator 1 to set the user setting table 116.
[0087]
In response to a priority use request from the customer 2, the ASP operator 1 sets the priority permission by editing the user setting table 116 through the management terminal 15.
[0088]
A job execution request is remotely sent from the customer terminal 14 to the computer system 11 through the Internet 13 and the gateway server 12. The execution result is also notified from the computer system 11 to the customer terminal 14 through the gateway server 12 and the Internet 13.
[0089]
Next, the processing operation of a job with a priority according to this embodiment will be described.
[0090]
FIG. 10 is an operation flow of the computer system at the time of a job execution request by a user according to the second embodiment. In the computer system when the user submits a job with priority to the computer system, especially the server management software. The operation is shown.
[0091]
First, the customer 2 issues a request to the computer system 11 to execute a job to which a priority has been added (S31).
[0092]
When the execution request in S31 is accepted by the server management software 111, first, the user setting table 116 is referred to (S32), and it is determined whether the customer 2 has permission to use the priority added to the job. Is performed (S33).
[0093]
When it is determined that the use is permitted in S33, an execution start instruction is issued to the calculation server 101 (S34). If it is determined in S33 that the use is not permitted, the client 2 is notified that the job is not executable (S35).
[0094]
As described above, in this embodiment, the priority of use permission is given to the customer 2 of the computer system 11, and the service of reducing the overhead of job execution when a failure occurs in response to the request of the customer 2 is provided. Can be provided.
[0095]
As described above, the invention made by the inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and can be variously modified without departing from the gist thereof. Needless to say.
[0096]
【The invention's effect】
The effects obtained by typical aspects of the invention disclosed in the present application will be briefly described as follows.
[0097]
(1) When there is a report of abnormal information from the environment monitoring device, the control device obtains information on the job being executed on the calculation server, and determines the priority of the job based on the obtained information. The job execution process is saved in the non-volatile storage device in the order from the job having the highest determined priority to the non-volatile storage device. Is started, when a failure occurs in the computer system and the system must be stopped, the job can be efficiently saved, and the overhead of re-executing the job can be reduced. This has the effect.
[0098]
(2) Since checkpoint recording is not performed periodically, and the overhead of a job for a sudden failure is reduced, there is an effect that the job execution speed during normal operation is not impaired.
[0099]
(3) By applying the present invention to the ASP business, it is possible to realize a new business form in which the right to raise the priority of job retraction when a failure occurs is compensated.
[Brief description of the drawings]
FIG. 1 is a configuration diagram showing a configuration of a computer system according to a first embodiment of the present invention.
FIG. 2 is a diagram showing an example of the contents of a job management table according to the first embodiment of the present invention.
FIG. 3 is a diagram showing an example of the contents of a failure / alarm operation definition table according to Embodiment 1 of the present invention;
FIG. 4 is a processing flow when a failure / alarm occurs in the computer system according to the first embodiment of the present invention.
FIG. 5 illustrates an example of a method of determining a priority for a job when executing a job freeze according to the first embodiment of the present invention.
FIG. 6 shows another example of a method for determining a priority for a job when executing a job freeze according to the first embodiment of the present invention.
FIG. 7 is a diagram showing an outline of an application service provider (ASP) business according to the second embodiment of the present invention.
FIG. 8 is a diagram showing a relationship between an ASP operator and a customer in Embodiment 2 of the present invention.
FIG. 9 is a system configuration diagram for realizing a business mode according to a second embodiment of the present invention.
FIG. 10 is an operation flow of the computer system when a user executes a job execution request according to the second embodiment of the present invention.
FIG. 11 is a processing flow when a failure / alarm occurs in a conventional computer system.
[Explanation of symbols]
1. Application service provider (ASP) operators
2 Customers
11 Computer system
12 Gateway server
13 Internet
14 Customer terminal
15 Management terminal
101 Calculation server
102 Storage device
103 control device
104 Environmental monitoring device
105 Temperature / humidity sensor
106 Uninterruptible Power Supply (UPS)
111 Server management software
112 Job management table
113 Fault / alarm operation setting table
114 Storage
115 Freeze processed job table
116 User setting table
1121 Job name
1122 User name
1123 Priority
1124 Job start time
1131 Failure factor
1132 System stop processing start grace time

Claims

In an automatic operation method for performing a stop process and a start process of a computer system,
The operating environment of the computer system is monitored, and when an abnormality occurs in the operating environment, information on a job executed on the computer system is obtained, and the priority of the job is determined based on the obtained information. Then, the continuous execution information for the job is saved in the non-volatile storage device in the order of the determined jobs having the highest priority, and when the system stop processing start grace time for the computer system with respect to the abnormality of the operating environment that has been set in advance elapses, the computer An automatic operation method of a computer system, wherein the computer system is stopped.

The method for automatically operating a computer system according to claim 1,
The priority of the job is determined based on the elapsed time of execution of the job and the amount of memory used in the computer system, and the elapsed time of execution of the job is long. An automatic operation method for a computer system, wherein a priority of a job to be completed before a processing start delay time elapses is determined to be high.

The automatic operation method for a computer system according to claim 1,
The computer according to claim 1, wherein the priority of the job is determined based on priority information added in advance when the job is executed, and the priority of the job with the higher priority specified by the priority information is determined to be higher. Automatic operation of the system.

At least one calculation server for executing the job,
A control device that manages the operation of the calculation server;
In a computer system comprising: an operation environment of the calculation server and the control device; and when an abnormality occurs in the operation environment, the environment monitor device reports the abnormality information to the control device.
The control device, when there is a report of abnormal information from the environmental monitoring device, acquires information on a job being executed on the calculation server, and determines the priority of the job based on the acquired information. In the order from the job having the highest priority determined, the continuous execution information for the job is saved in the non-volatile storage device, and when the system stop processing start grace time of the computer system for the abnormality of the operating environment which has been set in advance elapses, the computer system Computer system for starting a stop process of a computer.

At least one calculation server for executing the job,
A control device that manages the operation of the calculation server;
In a computer system comprising: an operation environment of the calculation server and the control device; and when an abnormality occurs in the operation environment, the environment monitor device reports the abnormality information to the control device.
The control device is configured to start a system stop process of the computer system with respect to each of the first table for managing the job being executed in the calculation server and the abnormality information reported from the environment monitoring device. A second table storing an allowed system stop processing start delay time, and when there is a report of abnormal information from the environment monitoring device, information on a job being executed on the calculation server is stored in the first table. And the priority of the job is determined based on the acquired information, and the continuous execution information for the job is saved in the nonvolatile storage device in order from the job having the determined high priority. From the table, a system stop processing start grace time of the computer system with respect to the abnormality of the operation environment is obtained, and the obtained system stop Computer system, characterized in that to start with the expiration of physical start window time stopping process of the computer system.

The computer system according to claim 4 or 5,
The control device determines the priority of the job based on the elapsed time of execution of the job and the amount of memory used in the computer system, the long elapsed time of execution of the job, and saves the information in the memory. A computer system characterized in that the priority of a job to be completed before a system suspension process start delay time of the computer system has elapsed is determined to be high.

The computer system according to claim 4 or 5,
The control device may determine the priority of the job based on priority information added at the time of executing the job in advance, and determine a higher priority of the job with a higher priority specified in the priority information. A computer system characterized by the following.

The computer system according to claim 7,
The control device determines the priority of the job based on the execution elapsed time of the job and the memory usage in the computer system for the job with the same priority specified in the priority information. A computer system, wherein a job execution elapsed time is long, and saving of information in the memory determines a high priority of a job to be completed before a system stop processing start delay time of the computer system elapses.

9. The computer system according to claim 7, wherein
A customer terminal connected to the computer system via a network,
The control device, the manager of the computer system, issued to the customer using the computer system via the customer terminal, priority information according to the priority use request of the user customer, To the customer, and when the customer uses the computer system via the network, accepts the execution of the job to which the priority information is added, and Is executed on the calculation server.

The computer system according to claim 9,
The control device, the manager of the computer system, issued to the customer using the computer system via the customer terminal, priority information according to the priority use request of the user customer, A user setting table storing information of permission to execute the job with the priority information stored in the user setting table when the execution of the job added with the priority information from the customer is received. Comparing with the information, when the user is a job to which the priority information that is not permitted is added, stop the execution of the job and notify the customer that the job cannot be executed. Characteristic computer system.