JP2013222313A

JP2013222313A - Failure contact efficiency system

Info

Publication number: JP2013222313A
Application number: JP2012093486A
Authority: JP
Inventors: Yasuyuki Tamai; 康之玉井; Masatomo Ukeda; 賢知受田; Tomonori Sekiguchi; 知紀関口
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2012-04-17
Filing date: 2012-04-17
Publication date: 2013-10-28

Abstract

PROBLEM TO BE SOLVED: To solve the problem that since it is necessary to analyze the content of a failure first in order to determine a customer to be contacted with in accordance with the occurrence of a failure in a cloud computing service, a time required until making contact with the customer is increased.SOLUTION: When a failure occurs, a failure which is likely to be spread in the future due to the failure is predicted, and customer contact content is prepared. When the prediction of the failure is correct, the prepared contact content is used to quickly make contact.

Description

本発明は，コンピューティングシステムの運用管理における，障害対応時の顧客連絡の障害連絡効率化システムに関する。 The present invention relates to a failure communication efficiency system for customer contact when dealing with a failure in operation management of a computing system.

クラウドコンピューティングサービスでは，物理サーバの仮想化による，高集約化に伴い，障害発生時には，複数の利用者に影響が及ぶ。そのため，クラウドコンピューティングサービスの提供者は，その利用者には効率的に障害連絡を行う必要がある。
利用者への効率的な連絡方法として，特許文献１では，不具合発生時に，あらかじめ登録しておいた不具合事象から連絡先を特定し，不具合発生連絡を行う。 In cloud computing services, multiple users are affected in the event of a failure due to the high concentration of virtualized physical servers. For this reason, cloud computing service providers need to efficiently communicate faults to their users.
As an efficient method for contacting users, in Patent Document 1, when a problem occurs, a contact address is specified from a problem event registered in advance, and a problem occurrence notification is made.

特開2007-141007号公報JP2007-141007

特許文献１は，顧客ごとにシステムが独立している場合を対象としており，障害が発生した場合，どの顧客に連絡するかは一意であった。また，顧客ごとに専任の運用の従事者が割り当てられ，連絡内容が定まれば，すぐに顧客に連絡することが可能だった。
しかしクラウドコンピューティングサービスは，大規模かつ仮想化されたクラウドコンピューティングシステム上で，専任の従事者を設けずに多数の顧客のシステムを運用する。したがって，ある障害の発生に伴い，連絡すべき顧客を判断するためには，先に障害の内容を分析する必要があるが，この作業のため，顧客への連絡までに必要な時間が増大するという問題がある。
また，ある障害の結果として，複数の顧客に連絡する必要がある場合，連絡先の顧客数よりも少ない運用の従事者で対応することになるため，結果として顧客への連絡までに必要な時間が増大するという問題がある。 Patent Document 1 is intended for a case where the system is independent for each customer, and when a failure occurs, which customer is to be contacted is unique. In addition, a dedicated operation worker was assigned to each customer, and it was possible to immediately contact the customer once the contact details were determined.
However, the cloud computing service operates a large number of customer systems on a large-scale, virtualized cloud computing system without having a dedicated worker. Therefore, in order to determine the customer to be contacted when a certain failure occurs, it is necessary to analyze the content of the failure first, but this process increases the time required to contact the customer. There is a problem.
In addition, when it is necessary to contact multiple customers as a result of a certain failure, the number of customers who are smaller than the number of customers in the contact information will be handled, resulting in the time required to contact the customer. There is a problem that increases.

本発明では，障害発生時に，その障害が原因で将来波及する可能性のある障害を予測し，利用者へ連絡する内容の準備を行う。予測があたった場合は，あらかじめ準備していた連絡内容を用いて，迅速に連絡を行う。
また，同時多発的な障害の発生に対し，利用者・システム特性のパラメータにもとづいて，優先度のシステムによる決定を行う。 In the present invention, when a failure occurs, a failure that may possibly spread in the future due to the failure is predicted, and preparations are made for the contents to be notified to the user. If there is a prediction, we will contact you promptly using the contact details prepared in advance.
In addition, for the occurrence of simultaneous failures, priority is determined by the system based on parameters of user / system characteristics.

利用者への連絡内容は同じ障害を背景にしているため，効率的に準備することが可能で，利用者への連絡完了までの時間を短縮できる。
また，優先度の高い利用者の順に連絡を行うことで，クラウド環境全体でのサービスレベルをより広く満足させることができる。 Since the content of communication to the user is based on the same obstacles, it is possible to prepare efficiently and to shorten the time to complete the communication to the user.
In addition, the service level in the entire cloud environment can be satisfied more widely by contacting the users in the order of priority.

実施例１による障害連絡効率化システムの構成図である。It is a block diagram of the failure communication efficiency improvement system by Example 1. FIG. 実施例１による障害発生時の時間経過にもとづく，障害発生時の処理の流れを表したものである。6 shows a flow of processing when a failure occurs, based on the passage of time when the failure occurs according to the first embodiment. 実施例１によるストレージ１２０と，ストレージ１２１で保管されるデータの詳細図である。4 is a detailed diagram of data stored in the storage 120 and the storage 121 according to the first embodiment. FIG. 実施例１によるストレージ１２２と，ストレージ１２３と，ストレージ１２４で保管されるデータの詳細図である。3 is a detailed diagram of data stored in a storage 122, a storage 123, and a storage 124 according to the first embodiment. FIG. 実施例１のシステム間同士でやり取りされるデータの詳細図である。FIG. 3 is a detailed view of data exchanged between systems in the first embodiment. 実施例１による障害発生時にもとづく，サーバ監視システム１１０と，根本原因分析システム１１１と，波及先予測システム１１２と，顧客連絡／優先度決定システム１１３の処理手順の詳細を表すフローチャートである。5 is a flowchart showing details of processing procedures of a server monitoring system 110, a root cause analysis system 111, a spread destination prediction system 112, and a customer contact / priority determination system 113 based on a failure occurrence according to the first embodiment. 実施例１による障害発生時にもとづく，波及先予測システム１１２の処理手順の詳細を表すフローチャートである。10 is a flowchart showing details of a processing procedure of the spread destination prediction system 112 based on the occurrence of a failure according to the first embodiment. 実施例１による障害発生時にもとづく，顧客連絡／優先度決定システム１１３の顧客連絡文面決定の処理手順の詳細を表すフローチャートである。10 is a flowchart showing details of a processing procedure for determining a customer contact text in a customer contact / priority determination system 113 based on the occurrence of a failure according to the first embodiment. 実施例１による障害発生時にもとづく，顧客連絡／優先度決定システム１１３の顧客連絡優先度決定の処理手順の詳細を表すフローチャートである。6 is a flowchart showing details of a processing procedure for determining a customer contact priority of a customer contact / priority determination system 113 based on a failure occurrence according to the first embodiment. 実施例１による障害発生時にもとづく，顧客連絡／優先度決定システム１１３の顧客連絡文面決定時の電話連絡かメール連絡かを判断する処理手順の詳細を表すフローチャートである。6 is a flowchart showing details of a processing procedure for determining whether a customer contact / priority determination system 113 determines whether a customer contact message is determined based on the occurrence of a failure according to the first embodiment. 実施例１による障害発生時にもとづく，顧客連絡／優先度決定システム１１３の顧客連絡優先度の決定方法の具体例を示した図である。It is the figure which showed the specific example of the determination method of the customer contact priority of the customer contact / priority determination system 113 based on the failure occurrence by Example 1. FIG.

図１は本発明を適用した，マルチテナント環境を考慮した，利用者への障害連絡効率化システムを表す第一の実施例である。データセンタ１００内に，複数の物理サーバと管理用ルータとルータ，ストレージを具備する。複数の物理サーバ上では，仮想マシンを稼働させている。 FIG. 1 is a first embodiment showing a system for improving the efficiency of failure communication to users in consideration of a multi-tenant environment to which the present invention is applied. The data center 100 includes a plurality of physical servers, management routers, routers, and storages. Virtual machines are running on multiple physical servers.

本実施例では，サーバ監視システム１１０，障害根本原因分析システム１１１，波及先予測システム１１２，顧客連絡／優先度決定システム１１３の各システムは仮想サーバで実装した例を用いて説明する。各仮想サーバと管理用ルータ１０３は，Ethernet（登録商標）により接続され，各仮想サーバに接続された仮想NIC経由で，Ethernetパケットを各サーバ間で通信を行う。各仮想サーバにはIPアドレスが割り振られており，各仮想サーバはIPアドレスにより送受信先を特定できる。各仮想サーバ上のシステム間で送受信されるデータは，全てこのEthernetパケットに変換され，目的のシステムと送受信される。ここで目的のシステムのIPアドレスは，.hostsファイルによりあらかじめ定義してもよいし，各システムの名称からIPアドレスを変換するDNSシステムに問い合わせることで確かめてもよい。こうしたEthernetを利用したデータの送受信の仕組みは，下記のシステム間のデータ送受信において共通のものであり，特に断りがない限り上記の説明は省略する。 In the present embodiment, the server monitoring system 110, the failure root cause analysis system 111, the spread destination prediction system 112, and the customer contact / priority determination system 113 will be described using examples implemented by virtual servers. Each virtual server and the management router 103 are connected by Ethernet (registered trademark), and communicate Ethernet packets between the servers via the virtual NIC connected to each virtual server. Each virtual server is assigned an IP address, and each virtual server can specify a transmission / reception destination by the IP address. All data sent and received between systems on each virtual server is converted into Ethernet packets and sent to and received from the target system. Here, the IP address of the target system can be defined in advance in the .hosts file, or it can be confirmed by inquiring to the DNS system that converts the IP address from the name of each system. Such a data transmission / reception mechanism using Ethernet is common in data transmission / reception between the following systems, and the above description is omitted unless otherwise specified.

サーバ監視システム１１０は，データセンタ１００内の装置（物理サーバ１３０・１３５，ハイパバイザ１３１・１３６，仮想サーバ等１３２・１３３・１３４・１３７・１３８・１３９）からログ等を収集し，装置の状態を監視するシステムである。 The server monitoring system 110 collects logs and the like from the devices in the data center 100 (physical servers 130 and 135, hypervisors 131 and 136, virtual servers 132, 133, 134, 137, 138, and 139). This is a monitoring system.

サーバ監視システム１１０は，データセンタ１００内に装置を設置し監視を行うが，データセンタ外からリモートでデータセンタ１００内の装置を監視してもよい。サーバ監視システムはネットワークを利用し，データセンタ１００内の管理用ルータ１０３を介して，物理サーバ１３０・１３５と，ハイパバイザ１３１・１３６と，仮想マシン１３２・１３３・１３４・１３７・１３８・１３９に接続されている。また，サーバ監視システムは，ネットワークを利用し，サーバ稼働情報１２０と接続されており，また障害根本原因分析システム１１１，波及先予測システム１１２，顧客連絡/優先度決定システム１１３，Webポータル１１４と接続されている。 The server monitoring system 110 installs devices in the data center 100 and performs monitoring. However, the devices in the data center 100 may be monitored remotely from outside the data center. The server monitoring system uses a network and connects to the physical servers 130 and 135, the hypervisors 131 and 136, and the virtual machines 132, 133, 134, 137, 138, and 139 via the management router 103 in the data center 100. Has been. The server monitoring system is connected to the server operation information 120 using a network, and is connected to the failure root cause analysis system 111, the spread destination prediction system 112, the customer contact / priority determination system 113, and the Web portal 114. Has been.

サーバ管理システム１１０が，これらの装置からログを収集する手段としては，SNMPトラップのようにこれらの装置から送信されるデータを受信する仕組みであってもよいし，これらの装置が持つAPIを利用して能動的に読み出す仕組みであってもよい。 As a means for the server management system 110 to collect logs from these devices, a mechanism for receiving data transmitted from these devices such as an SNMP trap may be used, or an API of these devices may be used. Thus, a mechanism for actively reading out may be used.

また，これらの装置から収集されるデータとしては，障害などの装置の状態を判断する情報が該当する。物理サーバ１３０・１３５からは，IPアドレスや，CPU・メモリ・ネットワークの使用率，ストレージへの一秒間あたりのIO数や，装置の障害などの情報が収集されてもよい。ハイパバイザ１３１・１３６からは，仮想マシンに割り当てた仮想ネットワークや仮想CPUの情報や，それら仮想デバイス上で発生した障害の情報が収集されてもよい。仮想マシン１３２・１３３・１３４・１３７・１３８・１３９からは，仮想CPU・仮想メモリ・仮想ネットワークの使用率や，仮想ストレージへの一秒間あたりのIO数や，仮想装置の障害などの情報が収集されてもよい。 The data collected from these devices corresponds to information for determining the device status such as a failure. From the physical servers 130 and 135, information such as IP addresses, CPU / memory / network usage rates, the number of IOs per second to storage, and device failures may be collected. From the hypervisors 131 and 136, information on virtual networks and virtual CPUs assigned to virtual machines, and information on failures that have occurred on these virtual devices may be collected. From virtual machines 132, 133, 134, 137, 138, and 139, information such as the usage rate of virtual CPU, virtual memory, and virtual network, the number of IOs per second to virtual storage, and failure of virtual devices are collected. May be.

ここで，データセンタ１００内の物理サーバ数や，物理サーバ上で稼働するハイパバイザ数や種別や有無，ハイパバイザ上で稼働する仮想サーバ数は任意の数であってもよく，サーバ監視システム１１０は，複数のサブシステムを利用して，これらの機器から分散してログを収集してもよい。 Here, the number of physical servers in the data center 100, the number of hypervisors operating on the physical servers, types, presence / absence, the number of virtual servers operating on the hypervisors may be any number, and the server monitoring system 110 Logs may be collected from these devices using multiple subsystems.

サーバ監視システム１１０は，クラウドコンピューティングシステム１０１内の物理サーバ１３０および１３５，ハイパバイザ１３１および１３６，仮想マシン１３２，および１３３，および１３４，および１３７，および１３９，ストレージ１０５に対して，異常がないかを常に確認している。サーバ監視システム１１０の実装は，ソフトウェアに加え，ハードウェアや，ハードウェアとソフトウェアを含むものであってもよい。 The server monitoring system 110 checks whether there are any abnormalities in the physical servers 130 and 135, the hypervisors 131 and 136, the virtual machines 132, 133, and 134, 137, and 139, and the storage 105 in the cloud computing system 101. Always make sure. Implementation of the server monitoring system 110 may include hardware or hardware and software in addition to software.

障害根本原因分析システム１１１は，サーバ監視システムが検知したエラーメッセージをもとに，サーバ稼働情報１２０と，構成管理情報１２１を参照し，発生した障害の根本原因となった部位はどこであったかを分析するシステムである。根本原因分析システム１１１が特定できる部位は，物理サーバ１３０・１３５と，ハイパバイザ１３１・１３６と，仮想マシン１３２・１３３・１３４・１３７・１３８・１３９，ルータ１０４，管理用ルータ１０３，ストレージ１０５等の装置と，その装置内のCPUやメモリ，ストレージ，IOモジュール等である。 The failure root cause analysis system 111 refers to the server operation information 120 and the configuration management information 121 based on the error message detected by the server monitoring system, and analyzes the location of the root cause of the failure that occurred. System. The parts that can be identified by the root cause analysis system 111 are physical servers 130 and 135, hypervisors 131 and 136, virtual machines 132, 133, 134, 137, 138, and 139, router 104, management router 103, storage 105, and the like. A device and its CPU, memory, storage, IO module, etc.

根本原因分析システム１１１は，データセンタ１００内に設置されているが，データセンタ外からリモートでデータセンタ１００内の装置と接続してもよい。根本原因分析システム１１１は，ネットワークを利用し，データセンタ１００内の管理用ルータ１０３を介して，物理サーバ１３０・１３５と，ハイパバイザ１３１・１３６と，仮想マシン１３２・１３３・１３４・１３７・１３８・１３９に接続されている。また，根本原因分析システム１１１は，ネットワークを利用し，サーバ稼働情報１２０，構成管理情報１２１と接続されており，また，サーバ監視システム１１０，波及先予測システム１１２，顧客連絡/優先度決定システム１１３，Webポータル１１４と接続されている。 The root cause analysis system 111 is installed in the data center 100. However, the root cause analysis system 111 may be remotely connected to an apparatus in the data center 100 from outside the data center. The root cause analysis system 111 uses a network and the physical servers 130 and 135, the hypervisors 131 and 136, and the virtual machines 132, 133, 134, 137, 138, and the like via the management router 103 in the data center 100. 139. Further, the root cause analysis system 111 is connected to server operation information 120 and configuration management information 121 using a network, and also includes a server monitoring system 110, a spread destination prediction system 112, and a customer contact / priority determination system 113. , Connected to the Web portal 114.

根本原因分析システム１１１が，サーバ監視システム１１０からのアラートを受信する手段としては，サーバ監視システム１１０から，根本原因分析システム１１１のAPIを利用してデータを送信する仕組みであってもよいし，根本原因分析システム１１１が，サーバ監視システム１１０のAPIを用いて定期的にアラートの有無を確認し，読み出す処理としてもよい。障害根本原因分析システム１１１は，サーバ監視システム１１０が障害を検知した際に，検知した障害の根本原因が何であるのかを判定する処理を行う。根本原因分析システム１１１の実装は，ソフトウェアに加え，ハードウェアや，ハードウェアとソフトウェアを含むものであってもよい。 As a means for the root cause analysis system 111 to receive an alert from the server monitoring system 110, a mechanism for transmitting data from the server monitoring system 110 using the API of the root cause analysis system 111 may be used. The root cause analysis system 111 may periodically check whether there is an alert using the API of the server monitoring system 110 and read it. When the server monitoring system 110 detects a failure, the failure root cause analysis system 111 performs a process of determining what the root cause of the detected failure is. Implementation of the root cause analysis system 111 may include hardware or hardware and software in addition to software.

波及先予測システム１１２は，障害根本原因分析システム１１１が特定した障害発生部位をもとに，構成管理情報１２１と，契約情報１２２を参照し，波及すると考えられる顧客システムの特定を行うシステムである。 The spread destination prediction system 112 is a system that refers to the configuration management information 121 and the contract information 122 based on the failure occurrence part specified by the failure root cause analysis system 111 and identifies a customer system that is considered to spread. .

波及先予測システム１１２は，データセンタ１００内に設置されているが，データセンタ外からリモートでデータセンタ１００内の装置と接続してもよい。 The spread destination prediction system 112 is installed in the data center 100, but may be connected to a device in the data center 100 remotely from outside the data center.

波及予測システム１１２は，ネットワークを利用し，データセンタ１００内の管理用ルータ１０３を介して，物理サーバ１３０・１３５と，ハイパバイザ１３１・１３６と，仮想マシン１３２・１３３・１３４・１３７・１３８・１３９に接続されている。また，波及予測システム１１２は，ネットワークを利用し，構成管理情報１２１，契約情報１２２と接続されており，また，サーバ監視システム１１０，根本原因分析システム１１１，顧客連絡/優先度決定システム１１３，Webポータル１１４と接続されている。 The propagation prediction system 112 uses a network and the physical servers 130 and 135, the hypervisors 131 and 136, and the virtual machines 132, 133, 134, 137, 138, and 139 via the management router 103 in the data center 100. It is connected to the. The propagation prediction system 112 uses a network and is connected to the configuration management information 121 and the contract information 122. The server monitoring system 110, the root cause analysis system 111, the customer contact / priority determination system 113, the Web It is connected to the portal 114.

波及先予測システム１１２が，障害根本原因分析システム１１１からの結果を受信する手段としては，障害根本原因分析システム１１１から，波及先予測システム１１２のAPIを利用してデータを送信する仕組みであってもよいし，波及先予測システム１１２が，障害根本原因分析システム１１１のAPIを用いて定期的に分析結果の有無を確認し，読み出す処理としてもよい。 As a means for the propagation destination prediction system 112 to receive the result from the failure root cause analysis system 111, the failure root cause analysis system 111 transmits data using the API of the propagation destination prediction system 112. Alternatively, the spread destination prediction system 112 may periodically check the presence / absence of an analysis result using the API of the failure root cause analysis system 111 and read it out.

波及先予測システム１１２は，障害原因根本分析システム１１１が障害原因である部位を推定した際に，構成管理情報１２１と契約情報１２２などを用いて，障害の影響を受ける利用者や，今後障害を受ける可能性の高い利用者を特定する処理を行う。波及先予測システム１１２が障害IDと障害部位の情報１２６を受信（４３１）し，未処理の障害IDが存在するかを確認する（４３２）。 The spread destination prediction system 112 uses the configuration management information 121 and the contract information 122 when the failure cause root analysis system 111 estimates the part that is the cause of the failure. Process to identify users who are likely to receive. The transmission destination prediction system 112 receives the failure ID and the failure part information 126 (431), and checks whether an unprocessed failure ID exists (432).

未処理の障害IDが存在する場合，受信した障害部位が物理サーバであるかを確認する（４３５）。物理サーバの障害であった場合，構成管理情報１２１から障害部位の物理サーバを共有する顧客IDとシステム名を検索する（４４０）。物理サーバ障害でない場合は，受信した障害部位がソフトウェアであるかを確認する（４３６）。 If there is an unprocessed failure ID, it is checked whether the received failure part is a physical server (435). If the physical server is faulty, the customer ID and system name sharing the physical server at the faulty part are searched from the configuration management information 121 (440). If it is not a physical server failure, it is checked whether the received failure part is software (436).

ソフトウェア障害であった場合，構成管理情報１２１から同じソフトウェアを利用している顧客IDとシステム名を検索する（４４１）。ソフトウェア障害でない場合は，受信した障害部位がネットワークであるかを確認する（４３７）。ネットワーク障害であった場合，構成管理情報１２１から同じネットワーク装置，ネットワーク回線を利用している顧客IDとシステム名を検索する（４４２）。ネットワーク障害でない場合は，受信した障害部位がストレージであるかを確認する（４３８）。 If there is a software failure, the customer ID and system name using the same software are searched from the configuration management information 121 (441). If it is not a software failure, it is checked whether the received failure part is a network (437). If there is a network failure, the customer ID and system name using the same network device and network line are searched from the configuration management information 121 (442). If it is not a network failure, it is confirmed whether the received failure part is a storage (438).

ストレージ障害であった場合，構成管理情報１２１から同じストレージ装置，ディスクを共有する顧客ID，システム名を検索する（４４３）。ストレージ障害でない場合は，受信した障害部位が仮想サーバであるかを確認する（４３９）。仮想サーバ障害であった場合，構成管理情報１２１から仮想サーバを使用している顧客ID，システム名を検索する。 If it is a storage failure, the configuration management information 121 is searched for the same storage device, customer ID and system name sharing the disk (443). If it is not a storage failure, it is checked whether the received failure part is a virtual server (439). If a virtual server failure has occurred, the configuration management information 121 is searched for the customer ID and system name using the virtual server.

仮想サーバ障害でない場合は，その他の章が波及先推定困難な障害としてメッセージを記載する（４４７）。波及先予測システム１１２の実装は，ソフトウェアに加え，ハードウェアや，ハードウェアとソフトウェアを含むものであってもよい。 If the failure is not a virtual server failure, a message is written in the other chapters as a failure that is difficult to estimate (447). The implementation of the spread destination prediction system 112 may include hardware or hardware and software in addition to software.

顧客連絡／優先度決定システム１１３は，波及先予測システムが特定した顧客に対し，構成管理情報１２１と，契約情報１２２と，連絡文面テンプレート１２３を参照し，顧客連絡の優先度と，顧客への連絡文面の作成を行うシステムである。 The customer contact / priority determination system 113 refers to the configuration management information 121, the contract information 122, and the contact text template 123 for the customer specified by the spread destination prediction system, and determines the priority of the customer contact and the customer contact. It is a system that creates a contact text.

顧客連絡／優先度決定システム１１３は，データセンタ１００内に設置されているが，データセンタ外からリモートでデータセンタ１００内の装置と接続してもよい。 The customer contact / priority determination system 113 is installed in the data center 100. However, the customer contact / priority determination system 113 may be remotely connected to devices in the data center 100 from outside the data center.

顧客連絡／優先度決定システム１１３は，ネットワークを利用し，データセンタ１００内の管理用ルータ１０３を介して，物理サーバ１３０・１３５と，ハイパバイザ１３１・１３６と，仮想マシン１３２・１３３・１３４・１３７・１３８・１３９に接続されている。また，顧客連絡／優先度決定システム１１３は，ネットワークを利用し，構成管理情報１２１，契約情報１２２と，連絡文面テンプレート１２３と接続されており，また，サーバ監視システム１１０，根本原因分析システム１１１，波及先予測システム１１２，Webポータル１１４と接続されている。 The customer contact / priority determination system 113 uses a network and via the management router 103 in the data center 100, physical servers 130, 135, hypervisors 131, 136, virtual machines 132, 133, 134, 137. -It is connected to 138-139. The customer contact / priority determination system 113 is connected to the configuration management information 121, the contract information 122, and the contact message template 123 using a network, and also includes a server monitoring system 110, a root cause analysis system 111, The transmission destination prediction system 112 and the Web portal 114 are connected.

顧客連絡／優先度決定システム１１３が，波及先予測システム１１２からの結果を受信する手段としては，波及先予測システム１１２から，顧客連絡／優先度決定システム１１３のAPIを利用してデータを送信する仕組みであってもよいし，顧客連絡／優先度決定システム１１３が，波及先予測システム１１２のAPIを用いて，定期的に予測結果の有無を確認し，読み出す処理としてもよい。 As a means for the customer contact / priority determination system 113 to receive the result from the spread destination prediction system 112, data is transmitted from the spread destination prediction system 112 using the API of the customer contact / priority determination system 113. It may be a mechanism, or the customer contact / priority determination system 113 may periodically check the presence / absence of a prediction result using the API of the spread destination prediction system 112 and read it out.

顧客連絡／優先度決定システム１１３は，波及先予測システム１１２により障害が波及すると予測された利用者と，構成管理情報１２１と契約情報１２２の情報を利用し，連絡すべき順番を決定する処理を行う。
顧客連絡／優先度決定システム１１３の顧客連絡文面作成部では，波及予測先システムの出力である，障害IDと原因IDと顧客IDとシステム名の情報１２７を受信し，受信した顧客IDと契約情報１２２の顧客管理情報１３６の顧客ID２８０と一致するものを検索し，一致するレコードの報告パターン２９５を確認し，電話連絡かメール連絡であるかを確認する（４５２）。メール連絡の場合，連絡文面テンプレート１２３から原因IDに対応するメールテンプレートが存在するかどうかを確認する（４５３）。メールテンプレートが存在する場合は，テンプレートを利用し，メールの連絡文面を準備する（４５５）。 The customer contact / priority determination system 113 uses the users predicted to be affected by the transmission destination prediction system 112, the information of the configuration management information 121 and the contract information 122, and determines the order of contact. Do.
The customer contact text creation unit of the customer contact / priority determination system 113 receives the failure ID, the cause ID, the customer ID, and the system name information 127, which is the output of the spread prediction destination system, and receives the received customer ID and contract information. A search is made for a match with the customer ID 280 of the customer management information 136 of 122, the report pattern 295 of the matching record is confirmed, and it is confirmed whether it is a telephone contact or a mail contact (452). In the case of e-mail communication, it is confirmed whether or not an e-mail template corresponding to the cause ID exists from the communication text template 123 (453). If an email template exists, the template is used to prepare an email message (455).

メールテンプレートが存在しない場合は，オペレータが新たにメールテンプレートを作成する（４５６）。メールテンプレートに契約情報１２２の顧客情報管理１３６のメールアドレス２９４を挿入し，ログファイル１２４などを添付する（４５９）。
電話連絡の場合，連絡文面テンプレート１２３から原因IDに対応する電話用テンプレートが存在するかどうかを確認する（４５４）。電話用テンプレートが存在する場合は，テンプレートを利用し，電話連絡文面を準備する（４５７）。電話用テンプレートが存在しない場合は，オペレータが新たに電話用テンプレートを作成する（４５８）。電話用テンプレートに契約情報１２２の顧客情報管理の電話番号２９３を挿入する（４６０）。 If the mail template does not exist, the operator creates a new mail template (456). The mail address 294 of the customer information management 136 of the contract information 122 is inserted into the mail template, and the log file 124 and the like are attached (459).
In the case of telephone contact, it is confirmed from the contact text template 123 whether a telephone template corresponding to the cause ID exists (454). If a telephone template exists, the template is used to prepare a telephone contact text (457). If the telephone template does not exist, the operator creates a new telephone template (458). The customer information management telephone number 293 of the contract information 122 is inserted into the telephone template (460).

４６０，４６１に対して，障害IDと一致するログ１２４の障害発生時刻３３２と復旧時間を挿入する（４６１，４６２）。復旧時間の計算は，障害内容により，オペレータが計算してもよいし，障害内容により予め復旧時間を登録しておいてもよい。 For 460 and 461, the failure occurrence time 332 and the recovery time of the log 124 matching the failure ID are inserted (461 and 462). The calculation of the recovery time may be calculated by the operator according to the failure content, or the recovery time may be registered in advance according to the failure content.

顧客連絡／優先度決定システム１１３の顧客連絡優先度決定部では，顧客連絡文面作成部の出力である１２８，１２９を受信（４７１）し，キューに受信したデータを格納する（４７２）。キュー内にあるすべての顧客に対して優先度の評価が完了しているかどうかを確認する（４７３）。優先度の評価が完了してない場合，仮想サーバ稼働情報のCPU使用率，アクセス数にもとづく優先度，契約情報１２２のシステム種別２８６，SLA２８４，クレーム数２８５，運用開始日２８２と運用終了日２８３から算出した利用期間にもとづく優先度を加点する（４７４，４７５，４７６）。 The customer contact priority determination unit of the customer contact / priority determination system 113 receives 128 and 129 output from the customer contact text creation unit (471), and stores the received data in the queue (472). It is checked whether priority evaluation is completed for all customers in the queue (473). When the priority evaluation is not completed, the CPU usage rate of the virtual server operation information, the priority based on the number of accesses, the system type 286 of the contract information 122, the SLA 284, the number of claims 285, the operation start date 282, and the operation end date 283 A priority is added based on the usage period calculated from (474, 475, 476).

優先度の評価が完了している場合，加点した点数にもとづき，優先度順にキューを並べ替え（４７７），すぐに連絡すべきデータがあるか確認する（４７８）。すぐに連絡すべきデータがある場合，すぐに連絡すべき情報のうち，優先順位の高い顧客への連絡文面からオペレータの画面に表示する（４８１）。すべてのデータの表示が終わったら，新規に障害が発生するのを待ち（４８２），新規に障害が発生したかどうかを確認する（４７９）。新規に障害が発生した場合は，一定時間が経過した後，優先順位の高い順に，顧客への連絡文面から画面への表示を行う（４８０）。 When the priority evaluation is completed, the queues are rearranged in order of priority based on the added points (477), and it is confirmed whether there is data to be contacted immediately (478). If there is data to be contacted immediately, the information to be contacted immediately is displayed on the operator's screen from the contact text to the customer with the highest priority (481). When all the data has been displayed, the system waits for a new failure (482) and checks whether a new failure has occurred (479). When a new failure occurs, after a predetermined time has elapsed, the message is displayed on the screen from the customer contact text in descending order of priority (480).

Webポータル１１４は，顧客連絡／優先度決定システムの出力である，連絡すべき顧客の優先度と，連絡文面を表示し，オペレータ１４０が確認し，IT管理者１４１または１４２に連絡を行うためのシステムである。ただし，必要に応じてサーバ監視システム１１０が検知した障害内容，障害根本原因分析システム１１１が特定した障害発生部位，波及先予測システム１１２が特定した，障害が波及すると予測される顧客システム，サーバ稼働情報１２０，構成管理情報１２１，契約情報１２２，連絡文面テンプレート等の情報を参照してもよい。 The Web portal 114 displays the priority of the customer to be contacted and the contact text, which is the output of the customer contact / priority determination system, for the operator 140 to confirm and contact the IT manager 141 or 142 System. However, if necessary, the failure content detected by the server monitoring system 110, the failure occurrence location identified by the failure root cause analysis system 111, the customer system identified by the propagation destination prediction system 112 and the failure expected to spread, the server operation Information such as information 120, configuration management information 121, contract information 122, and a contact text template may be referred to.

Webポータル１１４は，データセンタ１００内に設置されているが，データセンタ外からリモートでデータセンタ１００内の装置と接続してもよい。
Webポータル１１４は，ネットワークを利用し，データセンタ１００内の管理用ルータ１０３を介して，物理サーバ１３０・１３５と，ハイパバイザ１３１・１３６と，仮想マシン１３２・１３３・１３４・１３７・１３８・１３９に接続されている。また，顧客連絡／優先度決定システム１１３は，ネットワークを利用し，構成管理情報１２１，契約情報１２２と，連絡文面テンプレート１２３と接続されており，また，サーバ監視システム１１０，根本原因分析システム１１１，波及先予測システム１１２，顧客連絡／優先度決定システム１１３と接続されている。
Webポータル１１４は，オペレータ１４０がサーバ監視システム１１０，根本原因分析システム１１１，波及予測システム１１２，顧客連絡／優先度決定システム１１３からのアウトプットを確認するために使用する。Webポータルの実装は，ソフトウェアに加え，ハードウェアや，ハードウェアとソフトウェアを含むものであってもよい。 Although the Web portal 114 is installed in the data center 100, it may be connected to devices in the data center 100 remotely from outside the data center.
The Web portal 114 is connected to the physical servers 130 and 135, the hypervisors 131 and 136, and the virtual machines 132, 133, 134, 137, 138, and 139 via the management router 103 in the data center 100 using a network. It is connected. The customer contact / priority determination system 113 is connected to the configuration management information 121, the contract information 122, and the contact message template 123 using a network, and also includes a server monitoring system 110, a root cause analysis system 111, The transmission destination prediction system 112 and the customer contact / priority determination system 113 are connected.
The web portal 114 is used by the operator 140 to confirm the output from the server monitoring system 110, the root cause analysis system 111, the spread prediction system 112, and the customer contact / priority determination system 113. Web portal implementations may include hardware or hardware and software in addition to software.

データセンタ１００は，仮想サーバと顧客システム（A社）１１５と，顧客システム（B社）１１６を接続するネットワーク１０２からの通信を適切な仮想サーバに受け渡すルータ１０４を持つ。ここで，ルータ１０４の機能として，外部との不正な接続を遮断するファイアウォール，外部からの不正な通信を監視，およびフィルタリングするIntrusion Detection SystemやIntrusion Prevention System，データセンタ１００外部との通信に用いるIPアドレスとデータセンタ１００内部での通信に用いるIPアドレスを変換するNetwork Address Translation，ルータ１０４を複数台のルータとして論理的に動作させるVirtual Routing Forwarding，ネットワークを論理的に分割するVLANの機能を有していてもよく，またこうした機能を持つ装置と接続されていてもよい。 The data center 100 includes a router 104 that passes communication from the network 102 connecting the virtual server to the customer system (company A) 115 and the customer system (company B) 116 to an appropriate virtual server. Here, as a function of the router 104, a firewall that blocks unauthorized connections to the outside, an Intrusion Detection System or Intrusion Prevention System that monitors and filters unauthorized communications from the outside, and an IP used for communication with the outside of the data center 100 Network Address Translation that converts addresses and IP addresses used for communication within the data center 100, Virtual Routing Forwarding that logically operates the router 104 as a plurality of routers, and VLAN functions that logically divide the network It may be connected to a device having such a function.

ルータ１０４は，アクセス元のIPアドレスに応じて接続先のネットワークを選択し，またアクセス先のIPアドレスにもとづき，データを送信する装置を選択する。ルータ１０４は，仮想サーバ１３２・１３３・１３４・１３７・１３８・１３９，物理サーバ１３０・１３５，ハイパバイザ１３１・１３６に接続されており，基本的な処理として顧客システム（A社）１１５や顧客システム（B社）１１６からのリクエストをこれらの仮想サーバに転送する役割を持つ。 The router 104 selects a connection destination network according to the IP address of the access source, and selects a device that transmits data based on the IP address of the access destination. The router 104 is connected to the virtual servers 132, 133, 134, 137, 138, 139, the physical servers 130, 135, and the hypervisors 131, 136. The customer system (Company A) 115 and the customer system ( B company) has a role of transferring requests from 116 to these virtual servers.

仮想サーバ１３２および１３３および１３４は，物理サーバ１３０上の論理的に分割したサーバであり，ハイパバイザ１３１上で動作するものである。仮想サーバ１３７および１３８および１３９は物理サーバ１３５上の論理的に分割したサーバであり，ハイパバイザ１３６上で動作するものである。 The virtual servers 132, 133, and 134 are logically divided servers on the physical server 130 and operate on the hypervisor 131. The virtual servers 137, 138, and 139 are logically divided servers on the physical server 135, and operate on the hypervisor 136.

ハイパバイザ１３１および１３６は，物理サーバのCPU，メモリ，および物理サーバ１３１および１３６に接続されたネットワークの回線容量，物理サーバ１３１および１３５に接続されたストレージ１０５のデータ領域などを論理的に分割し，複数のOSを物理サーバにインストールして利用可能とする機能を持つ。 The hypervisors 131 and 136 logically divide the CPU and memory of the physical server, the network capacity of the network connected to the physical servers 131 and 136, the data area of the storage 105 connected to the physical servers 131 and 135, etc. It has a function to install and use multiple OS on a physical server.

管理用ルータ１０３はサーバ監視システム１１０，障害根本原因分析システム１１１，波及先予測システム１１２，顧客連絡／優先度決定システムに接続されている。 The management router 103 is connected to a server monitoring system 110, a failure root cause analysis system 111, a spread destination prediction system 112, and a customer contact / priority determination system.

図２は本実施例の特徴である，障害発生時の処理の流れを表したものである。はじめに，クラウドコンピューティングシステムにおける通常の障害発生から顧客への連絡までの流れを説明する。クラウドコンピューティングシステムでは，障害発生（９１０）時にオペレータ１４０は，サーバ監視システム１１０から挙がってきた情報をもとに障害原因を特定するため，根本原因分析システム１１１による障害原因分析（９１１）を行う。その後，オペレータ１４０は，特定された障害原因をもとに，対象となる顧客を特定するため，波及先予測システム１１２を用いて顧客を特定する（９１２）。 FIG. 2 shows the flow of processing when a failure occurs, which is a feature of this embodiment. First, the flow from the normal failure occurrence to contact with the customer in the cloud computing system is explained. In the cloud computing system, when a failure occurs (910), the operator 140 performs a failure cause analysis (911) by the root cause analysis system 111 in order to identify the cause of the failure based on the information raised from the server monitoring system 110. . Thereafter, the operator 140 specifies a customer using the propagation destination prediction system 112 in order to specify the target customer based on the specified cause of failure (912).

その後，オペレータ１４０は，は，顧客ごとの連絡内容文面を作成するために，顧客連絡／優先度決定システム１１３を用いて顧客連絡文面の作成（９１３）を行う。文面作成後，オペレータ１４０はIT管理者（A社）１４１へ電話またはメールでの連絡（９１４）を行う，通常，（９１０）から（９１４）までの手順を障害発生ごとに行い，その間に発生した障害は，手順終了まで待って同様に処理される（９２１，９２２，９２３，９２４）。このため，顧客は未処理の障害連絡完了まで，この処理にかかる時間だけ障害の連絡を待たされることになる。 Thereafter, the operator 140 creates a customer contact text (913) using the customer contact / priority determination system 113 in order to create a contact content text for each customer. After creating the text, the operator 140 contacts the IT manager (Company A) 141 by telephone or e-mail (914). Usually, the procedure from (910) to (914) is performed every time a failure occurs, and occurs during that time. The failure is processed in the same manner after waiting until the procedure is completed (921, 922, 923, 924). For this reason, the customer waits for the notification of the failure for the time required for this processing until the completion of the unprocessed failure notification.

本実施例の特徴は，クラウドコンピューティングシステムの特徴を利用し，２件目以降の障害連絡までの処理を効率化することにある。障害発生（９４０）後，オペレータ１４０はサーバ監視システム１１０から挙がってきた情報をもとに障害原因を特定するために，障害根本原因分析システム１１１を用いて障害原因分析を行う（９４１）。その後，オペレータ１４０は，特定された障害原因をもとに，対象となる顧客を特定するため，波及先予測システム１１２を用いて顧客を特定する（９４２）。その後，波及先予測システム１１２は，直接の障害の対象となる顧客とは別に，今後障害が発生しうる顧客に対しての障害連絡文面を並行して作成する（９５１）。 The feature of the present embodiment is to use the features of the cloud computing system to improve the efficiency of processing up to the second and subsequent failure notifications. After the failure occurrence (940), the operator 140 performs failure cause analysis using the failure root cause analysis system 111 in order to identify the cause of the failure based on the information raised from the server monitoring system 110 (941). Thereafter, the operator 140 specifies a customer using the propagation destination prediction system 112 in order to specify the target customer based on the specified cause of failure (942). After that, the transmission destination prediction system 112 creates a failure communication message in parallel with a customer who may cause a failure in the future in addition to the customer who is directly subject to the failure (951).

このことは，クラウドコンピューティングシステムの顧客間でハードウェアを共有するという特徴にもとづく。たとえば，ある仮想サーバに障害が発生した場合，その根本原因は，ハードウェアやハードウェアの設定に起因する場合が多い。この場合，同様の障害は，同じハードウェアを共有するほかの顧客が利用するシステムでも発生する可能性が大きい。 This is based on the feature of sharing hardware among customers of cloud computing systems. For example, when a failure occurs in a virtual server, the root cause is often due to hardware or hardware settings. In this case, a similar failure is likely to occur in a system used by another customer who shares the same hardware.

このため，ある仮想サーバを直接利用する顧客だけでなく，その仮想サーバとハードウェアを共用する顧客に対して，同時に障害連絡文面を作成しておくことは，今後発生する顧客への連絡作業工数を削減する上で効果がある。この予測があたり，障害が発生した場合（９５０），オペレータ１４０は，前の障害連絡完了後，サーバ監視システム１１０から挙がってきた情報をもとに障害原因を特定するために，障害根本原因分析システム１１１による分析を行う（９５２）。 For this reason, not only customers who use a certain virtual server directly but also customers who share the hardware with the virtual server are required to create a trouble report at the same time. It is effective in reducing. When this prediction is true and a failure occurs (950), the operator 140 analyzes the root cause analysis in order to identify the cause of the failure based on the information raised from the server monitoring system 110 after the completion of the previous failure notification. Analysis by the system 111 is performed (952).

このとき，障害原因が先の障害と同じ場合，その原因に対しては顧客へ連絡する内容は同じになることから，オペレータ１４０は，あらかじめ作成していた連絡文面を用い，IT管理者（B社）１４２へ電話またはメールでの連絡を行う（９５３）。これにより，２社目以降の連絡完了までの時間を短縮できる。 At this time, if the cause of the failure is the same as the previous failure, the content of contacting the customer will be the same for the cause, so the operator 140 uses the contact text created in advance, and the IT administrator (B Company) 142 is contacted by telephone or e-mail (953). As a result, the time until the completion of the communication after the second company can be shortened.

図３は，サーバ稼働情報１２０，構成管理情報１２１の詳細を表している。 FIG. 3 shows details of the server operation information 120 and the configuration management information 121.

サーバ稼働情報１２０は，仮想サーバごとに１つずつ定義されるデータであり，仮想サーバ名２１０や，CPU使用率ログ２１１，ディスクI/Oログ２１２，ネットワークログ２１３，アクセスログ２１４，メモリ使用量ログ２１５，ディスク使用量ログ２１６などの情報を含むデータである。 The server operation information 120 is data defined for each virtual server, and includes a virtual server name 210, a CPU usage rate log 211, a disk I / O log 212, a network log 213, an access log 214, and a memory usage amount. Data including information such as a log 215 and a disk usage log 216.

構成管理情報１２１は，仮想サーバ管理情報１３０，物理サーバ管理情報１３１，ネットワーク管理情報１３２，ストレージネットワーク管理情報１３３，ストレージ管理情報１３４，ソフトウェア管理情報１３５などを含むデータである。 The configuration management information 121 is data including virtual server management information 130, physical server management information 131, network management information 132, storage network management information 133, storage management information 134, software management information 135, and the like.

仮想サーバ管理情報１３０は，仮想サーバごとにユニークな，仮想サーバを管理するための情報であり，仮想サーバ名２２０，IPアドレス２２１，接続情報２２２，コア数２２３，周波数２２４，メモリ容量２２５，ディスク容量２２６，物理サーバ名２２７，ストレージ接続情報２２８，ソフトウェア情報２２９などの情報を含む。 The virtual server management information 130 is information unique to each virtual server and used to manage the virtual server. The virtual server name 220, the IP address 221, the connection information 222, the number of cores 223, the frequency 224, the memory capacity 225, and the disk It includes information such as capacity 226, physical server name 227, storage connection information 228, software information 229, and the like.

物理サーバ構成情報１３１は，物理サーバごとにユニークな，物理サーバを管理するための情報であり，物理サーバ名２３０，IPアドレス２３１，コア数２３２，周波数２３３，メモリ容量２３４，ストレージ名２３５，ファームウェアバージョン２３６，ハイパバイザ種別２３７，接続情報２３８，設定情報２３９などの情報を含む。 The physical server configuration information 131 is information for managing a physical server that is unique for each physical server. The physical server configuration information 131 is a physical server name 230, an IP address 231, a core number 232, a frequency 233, a memory capacity 234, a storage name 235, firmware. Information such as version 236, hypervisor type 237, connection information 238, and setting information 239 is included.

ネットワーク管理情報１３２は，ネットワーク装置ごとにユニークな，ネットワーク装置を管理するための情報であり，装置名２４０，IPアドレス２５１，装置種別２４２，ファームウェアバージョン２５２，設定情報２４４，接続情報２４５などの情報を含む。 The network management information 132 is information unique to each network device for managing the network device. Information such as the device name 240, IP address 251, device type 242, firmware version 252, setting information 244, connection information 245, and the like. including.

ファイバチャネルスイッチ管理情報１３３は，ファイバチャネルスイッチ装置ごとにユニークな，ファイバチャネルスイッチを管理するための情報であり，装置名２６０，IPアドレス２６１，設定情報２６２，ファームウェアバージョン２６３，接続情報２６４などの情報を含む。 The fiber channel switch management information 133 is information for managing the fiber channel switch unique to each fiber channel switch device. The device name 260, the IP address 261, the setting information 262, the firmware version 263, the connection information 264, etc. Contains information.

ストレージ管理情報１３４は，ストレージ装置ごとにユニークな，ストレージを管理するための情報であり，装置名２６０，IPアドレス２６１，設定情報２６２，ファームウェアバージョン２６３，接続情報２６４などの情報を含む。 The storage management information 134 is information for managing a storage unique to each storage device, and includes information such as a device name 260, an IP address 261, setting information 262, firmware version 263, and connection information 264.

ソフトウェア管理情報１３５は，仮想サーバ，物理サーバ等にインストールされており，ソフトウェアを管理するための情報であり，ソフトウェア名２７０，設定情報２７１，バージョン情報２７２，ライセンス情報２７３などの情報を含む。 The software management information 135 is information installed in a virtual server, a physical server, etc., for managing software, and includes information such as a software name 270, setting information 271, version information 272, and license information 273.

図４は，契約情報１２２，連絡文面テンプレート１２３，ログ１２４の詳細を表している。 FIG. 4 shows details of the contract information 122, the contact text template 123, and the log 124.

契約情報１２２は，顧客管理情報１３６，顧客システム管理情報１３７などを含むデータである。 The contract information 122 is data including customer management information 136, customer system management information 137, and the like.

顧客管理情報１３６は，顧客ID２８０，契約サービス２８１，運用開始日２８２，運用終了日２８３，SLA２８４，クレーム数２８５，システム種別２８６，エンドユーザ数２８７，システム名２８８，企業名２８９，住所２９０，担当者名２９１，役職２９２，電話番号２９３，メールアドレス２９４，報告パターン２９５などの情報を含む。 Customer management information 136 includes customer ID 280, contract service 281, operation start date 282, operation end date 283, SLA 284, number of complaints 285, system type 286, number of end users 287, system name 288, company name 289, address 290, responsible It includes information such as name 291, title 292, telephone number 293, mail address 294, report pattern 295 and the like.

顧客システム管理情報１３７は，システム名３１０，ルーティング情報３１１，利用仮想サーバ情報３１２，利用ソフトウェア情報３１３などの情報を含む。 The customer system management information 137 includes information such as a system name 310, routing information 311, used virtual server information 312, and used software information 313.

連絡文面テンプレート１２３は，顧客への連絡文面を作る際のひな型になる情報であり，過去に発生した障害や原因，対策の種別等に応じて用意される情報である。原因ID３２０，対策方法３２１，過去文面３２２，お願い事項３２３などの情報を含む。 The contact text template 123 is information that serves as a template when creating a contact text for a customer, and is information that is prepared according to a failure or cause that occurred in the past, the type of countermeasure, and the like. It includes information such as cause ID 320, countermeasure method 321, past text 322, and request 323.

ログ１２３は，サーバ監視システム１１０，および障害根本原因分析システム１１１，および波及先予測システム１１２，および，顧客連絡／優先度決定システム１１３での処理に必要となるデータや，処理結果等を表している。障害ID３３０，障害部位３３１，障害発生時刻３３２，関連ID３３３，原因ID３３４，ログID３３５，原因正解率３３６などの情報を含む。 The log 123 represents data necessary for processing in the server monitoring system 110, the failure root cause analysis system 111, the spread destination prediction system 112, and the customer contact / priority determination system 113, processing results, and the like. Yes. It includes information such as failure ID 330, failure site 331, failure occurrence time 332, related ID 333, cause ID 334, log ID 335, cause correct rate 336, and the like.

図５は，アラート１２５，障害根本原因１２６，障害波及先顧客情報１２７，連絡文面（メール）１２８，連絡文面（電話）１２９の詳細を表している。 FIG. 5 shows details of the alert 125, the failure root cause 126, the failure spreading customer information 127, the contact message (mail) 128, and the contact message (phone) 129.

アラート１２５は，障害ID３４０，障害発生時刻３４１などの情報を含む。 The alert 125 includes information such as a failure ID 340 and a failure occurrence time 341.

根本原因１２６は，障害ID３５０，システム名３５１，障害部位名３５２，障害原因３５３，メッセージ３５４などの情報を含む。 The root cause 126 includes information such as a failure ID 350, a system name 351, a failure part name 352, a failure cause 353, and a message 354.

障害波及先顧客情報１２７は，障害ID３６０，顧客ID３６１，システム名３６２などの情報を含む。 The trouble spreading customer information 127 includes information such as a trouble ID 360, a customer ID 361, and a system name 362.

連絡文面（メール）１２８は，メールアドレス３７０，件名３７１，障害発生時刻３７２，障害種別３７３，推定障害原因３７４，復旧予定時間３７５，お願い事項３７６，連絡先３７７，添付情報３７８などの情報を含む。 The contact text (email) 128 includes information such as an email address 370, a subject 371, a failure occurrence time 372, a failure type 373, an estimated failure cause 374, a scheduled recovery time 375, a request 376, a contact address 377, and attached information 378. .

連絡文面（電話）１２９は，電話番号３８０，障害発生時刻３８１，障害種別３８２，推定障害原因３８３，復旧予定時間３８４，お願い事項３８５などの情報を含む。 The contact text (telephone) 129 includes information such as a telephone number 380, a failure occurrence time 381, a failure type 382, an estimated failure cause 383, a scheduled recovery time 384, and a request 385.

図６は，データセンタ１００で，仮想マシン１３２・１３３・１３４・１３７・１３８・１３９等で発生した障害をサーバ監視システム１１０で障害検知（４００）した際の障害根本原因分析システム１１１，波及先予測システム１１２，顧客優先度／優先度決定システム１１３の処理の詳細である。 FIG. 6 illustrates a failure root cause analysis system 111 and a transmission destination when a failure occurring in the virtual machine 132, 133, 134, 137, 138, 139, etc. in the data center 100 is detected (400) by the server monitoring system 110. It is the detail of the process of the prediction system 112 and the customer priority / priority determination system 113.

障害が発生すると（４００），サーバ監視システム１１０が障害を検知し，アラートを発生させ（４０１），障害根本原因分析システム１１１にアラート１２５の障害ID３４０を渡す。根本原因分析システム１１１がアラート１２５の障害ID３４０を受信すると（４０３），障害根本原因分析（４０４）を行い，障害の根本原因を突き止め，原因IDを特定し，この原因IDをログ１２４に一定期間保持する。一定期間ログ１２４に保持しておいた原因IDと，新たに発生した障害に対し，障害根本原因分析システム１１１により特定した原因IDが一致するかどうかを確認する（４０５）。 When a failure occurs (400), the server monitoring system 110 detects the failure, generates an alert (401), and passes the failure ID 340 of the alert 125 to the failure root cause analysis system 111. When the root cause analysis system 111 receives the fault ID 340 of the alert 125 (403), the root cause analysis (404) is performed, the root cause of the fault is identified, the cause ID is identified, and this cause ID is stored in the log 124 for a certain period of time. Hold. It is checked whether the cause ID held in the log 124 for a certain period and the cause ID specified by the failure root cause analysis system 111 match the newly generated failure (405).

一致するものがない場合，障害根本原因１２６のデータ群を波及先予測システム１１２に通知を行う（４０６）。 If there is no match, the data group of the failure root cause 126 is notified to the transmission destination prediction system 112 (406).

ここで障害根本原因分析とは，計算機を用いて，監視結果から障害の原因となりうる事象を推定する処理である。この処理手法としてFTA（Fault Tree Analysis）を用いてもよい。クラウドコンピューティングシステムの管理者は，あらかじめクラウドコンピューティングシステムで発生する既知の障害に対して，事故に対してその原因となる事象と，監視データの結果にもとづきその事象が原因となる確率とを論理式で記述したFault Treeを作成しておく。 Here, failure root cause analysis is a process of estimating an event that may cause a failure from a monitoring result using a computer. As this processing method, FTA (Fault Tree Analysis) may be used. The administrator of the cloud computing system determines the event that causes the accident and the probability that the event is caused based on the result of the monitoring data for a known failure that occurs in the cloud computing system in advance. Create a Fault Tree described by a logical expression.

障害発生時には，障害IDにより示されるサーバ監視システムからの監視結果にもとづき，その監視結果から導かれうる事故の大きさと，その監視結果を導く根本原因となりうる障害部位を導出する。この障害部位は複数であってもよく，複数の障害部位の多重障害により発生したとみなしてもよい。ここで，監視結果にもとづく事故の大きさは，監視結果と，あらかじめ定めた閾値との比較により導出する。たとえば，ネットワークの監視において，一分間Pingの応答が返らない場合に障害とみなすとしている場合，Pingの応答が一分以上返らなければ値として1を設定し，一分未満に応答が返っているなら値として０を設定するといった状態の二値化の処理が該当する。Fault Treeは論理式で記述されるため，結果として，ある監視結果に対して発生うる事故の種類とその確率が導出される。また監視結果から導出される根本原因は，Fault Treeの構造により導出される。 When a failure occurs, based on the monitoring result from the server monitoring system indicated by the failure ID, the size of the accident that can be derived from the monitoring result and the failure site that can be the root cause of the monitoring result are derived. There may be a plurality of fault sites, and it may be considered that the fault site has occurred due to multiple faults of a plurality of fault sites. Here, the magnitude of the accident based on the monitoring result is derived by comparing the monitoring result with a predetermined threshold value. For example, in network monitoring, if a Ping response is not returned for 1 minute and it is considered a failure, if the Ping response does not return for more than 1 minute, 1 is set as the value, and the response is returned in less than 1 minute In this case, binarization processing in a state where 0 is set as a value is applicable. Since the Fault Tree is described by a logical expression, as a result, the types of accidents that can occur for certain monitoring results and their probabilities are derived. The root cause derived from the monitoring results is derived from the fault tree structure.

たとえば，仮想サーバのネットワークIOエラーが見つかった場合，この原因として，ハイパバイザの障害，物理サーバのNIC（Network Interface Card）の障害，スイッチと接続しているケーブルの断線などの原因となる事象を発生確率付きで定義しておき，発生確率の高い順にログデータから異常がないか検出する。このように順に異常を判断することで，最初のネットワークIOエラーの発生につながった確率の高い障害部位を特定していくことができる。根本原因分析システム１１１は，導出した障害部位と特定した根本原因ごとにあらかじめ定義しておいた原因IDを付与し，障害IDによって示されるログデータに格納する。 For example, when a virtual server network IO error is found, this may be caused by a hypervisor failure, a physical server NIC (Network Interface Card) failure, or a cable disconnection from the switch. Define with probability and detect whether there is any abnormality from log data in descending order of occurrence probability. By judging the abnormality in this way, it is possible to identify a faulty part having a high probability of leading to the first network IO error. The root cause analysis system 111 assigns a cause ID defined in advance for each derived root cause and the identified root cause, and stores it in the log data indicated by the fault ID.

一方で，ログ１２４に保持しておいた原因ID３３４と，新たに発生した障害に対し，障害根本原因分析システム１１１により特定した原因IDが一致するものがある場合（４０５）は，障害根本原因１２６を顧客連絡／優先度決定システム１１３に送信する（４０７）。波及先予測システム１１２が障害根本原因１２６を受信すると（４１０），障害波及先顧客の特定を行い（４１１），障害波及先顧客情報１２７を出力する。波及先予測システムの詳細は図７の説明で詳細に述べる。顧客の特定後，障害波及先顧客名１２７の障害ID３６０，顧客ID３６１，システム名３６２を含むデータ群を顧客連絡／優先度決定システム１１３に送信する（４１２）。波及先予測システム１１１の詳細は後述する。顧客連絡／優先度決定システム１１３が障害波及先顧客名１２７の障害ID３６０，顧客ID３６１，システム名３６２を含むデータ群を受信すると（４１４），顧客への連絡優先度の決定処理と，顧客への連絡文面の作成処理を行う（４１５）。顧客連絡／優先度決定システム１１３の詳細は図７，図８で詳細に述べる。決定した連絡優先度と連絡文面をDBに格納し，それに対応する障害IDをキューに格納する（４１６）。新たに発生した障害の障害IDがすでにキューに格納済みでないことを確認し（４１７），障害IDがすでに格納されている場合は，障害が発生した優先順位の高い顧客への連絡文面をオペレータ１４０が閲覧する画面へ表示を行う（４１８）。キューに格納された障害IDと，障害IDを比較し，一致しなかった場合は，４１４の処理に戻る。オペレータ１４０は表示画面を確認し（４１９），連絡文面を確認（４２０）した後，該当する顧客へ連絡を行う（４２１）。顧客への連絡手順の詳細は後述する。 On the other hand, if the cause ID 334 held in the log 124 and the cause ID identified by the failure root cause analysis system 111 for a newly occurring failure coincide (405), the failure root cause 126 Is transmitted to the customer contact / priority determination system 113 (407). When the transmission destination prediction system 112 receives the failure root cause 126 (410), the failure transmission destination customer is specified (411), and the failure transmission destination customer information 127 is output. Details of the spread destination prediction system will be described in detail with reference to FIG. After the customer is specified, a data group including the failure ID 360, the customer ID 361, and the system name 362 of the failure spreading destination customer name 127 is transmitted to the customer contact / priority determination system 113 (412). Details of the spread destination prediction system 111 will be described later. When the customer contact / priority determination system 113 receives the data group including the failure ID 360, the customer ID 361, and the system name 362 of the customer name 127 to which the failure has spread (414), the customer contact priority determination processing, A process for creating a message is performed (415). Details of the customer contact / priority determination system 113 will be described in detail with reference to FIGS. The determined contact priority and contact text are stored in the DB, and the corresponding failure ID is stored in the queue (416). It is confirmed that the failure ID of the newly generated failure has not already been stored in the queue (417). If the failure ID has already been stored, the operator 140 displays a message to the customer with the higher priority in which the failure has occurred. Is displayed on the screen for browsing (418). The failure ID stored in the queue is compared with the failure ID. If they do not match, the processing returns to 414. The operator 140 confirms the display screen (419), confirms the contact text (420), and then contacts the corresponding customer (421). Details of the customer contact procedure will be described later.

図７は，波及先予測システム１１２の処理の詳細である。 FIG. 7 shows the details of the processing of the spread destination prediction system 112.

波及先予測システム１１２は，障害根本原因分析システム１１１からの受信データを保持するバッファを持ち，複数の受信データを保持できるものとする。データは，キューのように受信順にデータが並べられており，どのデータが最も古いデータかわかるようになっているものとする。また各受信データと対応して，そのデータが処理済みかどうかを示すフラグをメモリ上に有するものとする。波及先予測システム１１２は，障害根本原因分析システム１１１より，障害根本原因１２６を受信（４３１）する。 The spread destination prediction system 112 has a buffer that holds the reception data from the failure root cause analysis system 111 and can hold a plurality of reception data. It is assumed that the data is arranged in the order of reception like a queue so that it can be known which data is the oldest data. In addition, a flag indicating whether or not the data has been processed is stored in the memory in correspondence with each received data. The spread destination prediction system 112 receives the failure root cause 126 from the failure root cause analysis system 111 (431).

波及先予測システム１１２は，受信バッファの中の受信が古く，また，処理済みフラグが設定されていない，未処理の障害IDがあるかどうかの確認を行う（４３２）未処理の障害IDがない場合は，送信キューにデータがあるかを確認する（４３４）。送信キューにデータがない場合，処理を終了（４３４）し，送信キューにデータがある場合は，波及先予測システム１１２で作成した障害波及先顧客情報１２７を，顧客連絡／優先度決定システム１１３に送信するとともに，障害波及先顧客情報１２７の障害ID，原因IDをログ１２４に格納し，処理を終了する（４３８）。未処理の障害IDがある場合，物理サーバ障害であるかを確認する（４３５）。 The transmission destination prediction system 112 checks whether or not there is an unprocessed fault ID in which the reception in the reception buffer is old and the processed flag is not set (432). If so, it is checked whether there is data in the transmission queue (434). If there is no data in the transmission queue, the processing is terminated (434). If there is data in the transmission queue, the failure transmission destination customer information 127 created by the transmission destination prediction system 112 is sent to the customer contact / priority determination system 113. At the same time, the failure ID and cause ID of the failure spreading customer information 127 are stored in the log 124, and the processing is terminated (438). If there is an unprocessed failure ID, it is checked whether there is a physical server failure (435).

物理サーバ障害であった場合は，構成管理情報１２１の物理サーバ管理情報１３１の物理サーバ名２３０を検索キーにして，構成管理情報１２１の仮想サーバ管理情報１３０の物理サーバ名２２７と一致するものを検索し，まずは仮想サーバの特定を行う。さらに，仮想サーバ名２２０を検索キーとして，契約情報１３７の顧客システム管理情報１３７の利用仮想サーバ情報を取得し，システム名３１０を検索キーとして，顧客管理情報１３６のシステム名２８８を検索し，顧客IDを突き止める（４４０）。 If there is a physical server failure, the physical server name 230 of the physical server management information 131 of the configuration management information 121 is used as a search key, and the physical server name 227 of the virtual server management information 130 of the configuration management information 121 is matched. Search and first identify the virtual server. Further, using virtual server name 220 as a search key, virtual server information of customer system management information 137 of contract information 137 is acquired, system name 288 of customer management information 136 is searched using system name 310 as a search key, Locate the ID (440).

物理サーバ障害ではない場合は，ソフトウェア障害であるかを確認する（４３６）。ソフトウェア障害であった場合は，構成管理情報１２１のソフトウェア管理情報１３５のソフトウェア名を検索キーにして，構成管理情報１２１の仮想サーバ管理情報１３０のソフトウェア情報２２９と一致するものを検索し，まずは仮想サーバの特定を行う。さらに，仮想サーバ名２２０を検索キーとして，契約情報１３７の顧客システム管理情報１３７の利用仮想サーバ情報を取得し，システム名３１０を検索キーとして，顧客管理情報１３６のシステム名２８８を検索し，顧客IDを突き止める（４４１）。 If it is not a physical server failure, it is confirmed whether it is a software failure (436). If there is a software failure, the software name in the software management information 135 in the configuration management information 121 is used as a search key, and a search is made for a match with the software information 229 in the virtual server management information 130 in the configuration management information 121. Specify the server. Further, using virtual server name 220 as a search key, virtual server information of customer system management information 137 of contract information 137 is acquired, system name 288 of customer management information 136 is searched using system name 310 as a search key, Find the ID (441).

ソフトウェア障害ではない場合は，ネットワーク障害であるかを確認する（４３７）。ネットワーク障害であった場合は，構成管理情報１２１のネットワーク管理情報１３２の接続情報２４５を検索キーにして，構成管理情報１２１の仮想サーバ管理情報１３０の接続情報２２２と一致するものを検索し，まずは仮想サーバの特定を行う。さらに，仮想サーバ名２２０を検索キーとして，契約情報１３７の顧客システム管理情報１３７の利用仮想サーバ情報を取得し，システム名３１０を検索キーとして，顧客管理情報１３６のシステム名２８８を検索し，顧客IDを突き止める（４４２）。 If it is not a software failure, it is confirmed whether it is a network failure (437). If there is a network failure, the connection information 245 of the network management information 132 of the configuration management information 121 is used as a search key to search for a match with the connection information 222 of the virtual server management information 130 of the configuration management information 121. Specify the virtual server. Further, using virtual server name 220 as a search key, virtual server information of customer system management information 137 of contract information 137 is acquired, system name 288 of customer management information 136 is searched using system name 310 as a search key, Locate the ID (442).

ネットワーク障害でない場合は，ストレージ障害であるかを確認する（４３８）。ストレージ障害であった場合，構成管理情報１２１のストレージ管理情報１３４の接続情報２５３を検索キーにして，構成管理情報１２１の仮想サーバ管理情報１３０のストレージ接続情報２２８と一致するものを検索し，まずは仮想サーバの特定を行う。さらに，仮想サーバ名２２０を検索キーとして，契約情報１３７の顧客システム管理情報１３７の利用仮想サーバ情報を取得し，システム名３１０を検索キーとして，顧客管理情報１３６のシステム名２８８を検索し，顧客IDを突き止める（４４３）。 If it is not a network failure, it is confirmed whether it is a storage failure (438). In the case of a storage failure, the connection information 253 of the storage management information 134 of the configuration management information 121 is used as a search key to search for a match with the storage connection information 228 of the virtual server management information 130 of the configuration management information 121. Specify the virtual server. Further, using virtual server name 220 as a search key, virtual server information of customer system management information 137 of contract information 137 is acquired, system name 288 of customer management information 136 is searched using system name 310 as a search key, Locate the ID (443).

ストレージ障害ではない場合は，仮想サーバ障害であるかを確認する（４３９）。 If it is not a storage failure, it is confirmed whether it is a virtual server failure (439).

仮想サーバ障害であった場合は，仮想サーバ名２２０を検索キーとして，契約情報１３７の顧客システム管理情報１３７の利用仮想サーバ情報を取得し，システム名３１０を検索キーとして，顧客管理情報１３６のシステム名２８８を検索し，顧客IDを突き止める（４４４）。 If there is a virtual server failure, the virtual server name 220 is used as a search key, the virtual server information of the customer system management information 137 of the contract information 137 is acquired, and the system of the customer management information 136 is acquired using the system name 310 as the search key. The name 288 is searched to find the customer ID (444).

仮想サーバ障害でない場合は，その他の障害波及先推定困難な障害としてメッセージを記載し，処理４３２に戻る（４４７）。 If the failure is not a virtual server failure, a message is written as another failure that is difficult to estimate the propagation destination of the failure, and the processing returns to processing 432 (447).

次に波及先予測システム１１２は，処理４４０，および４４１，および４４２，および４４３，および４４４の処理の結果として突き止めた顧客IDとシステム名と，障害根本原因分析システム１１１から受信した障害IDと，波及先予測システム１１２の処理の結果として付与する原因ＩＤと障害波及先顧客情報１２７を送信キューに格納する。
図８は，顧客連絡／優先度決定システム１１３の連絡文面作成処理の詳細である。顧客連絡／優先度決定システム１１３は，波及先予測システムから送信した障害波及先顧客情報１２７を受信すると（４５１），障害波及先顧客情報１２７の顧客ID３６１とシステム名３６２の情報を検索キーにして，契約情報１２２の顧客管理情報１３６から，一致するレコードの報告パターン２９５を確認する。報告パターン２９５をもとに，電話連絡かメール連絡であるか，両方であるのかを判断する（４５２）。連絡方法の判断方法は後程詳細に述べる。メール連絡であると判断した場合，原因IDに対応する連絡文面（メール）１２８の有無を確認する（４５３）。 Next, the spread destination prediction system 112 determines the customer ID and system name found as a result of the processing 440 and 441 and 442 and 443 and 444, the failure ID received from the failure root cause analysis system 111, The cause ID and failure transmission destination customer information 127 assigned as a result of the processing of the transmission destination prediction system 112 are stored in the transmission queue.
FIG. 8 shows the details of the contact text creation process of the customer contact / priority determination system 113. When the customer contact / priority determination system 113 receives the failure transmission destination customer information 127 transmitted from the transmission destination prediction system (451), the customer ID 361 and the system name 362 information of the failure transmission destination customer information 127 are used as search keys. , The report pattern 295 of the matching record is confirmed from the customer management information 136 of the contract information 122. Based on the report pattern 295, it is determined whether it is telephone contact, mail contact, or both (452). The method of judging the contact method will be described in detail later. If it is determined that the contact is an email, the presence or absence of a contact text (email) 128 corresponding to the cause ID is checked (453).

原因IDの対応する連絡文面（メール）１２８が存在する場合は，原因IDに対応する連絡文面（メール）１２８をコピーし，原因IDを件名３７１に挿入し，連絡文面（メール）１２８の準備を行う（４５５）。件名は原因IDではなく，別の情報でもよい。原因IDに対応する連絡文面（メール）１２８が存在しない場合は，オペレータ１４０が連絡文面（メール）１２８を作成する（４５６）。連絡文面（メール）１２８に，顧客IDにもとづき，契約情報１２２の顧客管理情報１３６のメールアドレス３７０を検索し，メールに挿入し，ログなどの添付するファイルがあれば，添付する（４５９）。その後，アラート１２５の障害IDから障害発生時刻３７２を検索後，メールに挿入する（４６１）。 If the contact text (mail) 128 corresponding to the cause ID exists, the contact text (mail) 128 corresponding to the cause ID is copied, the cause ID is inserted into the subject 371, and the contact text (mail) 128 is prepared. (455). The subject is not a cause ID but may be other information. If the contact text (mail) 128 corresponding to the cause ID does not exist, the operator 140 creates the contact text (mail) 128 (456). In the contact text (mail) 128, the mail address 370 of the customer management information 136 of the contract information 122 is searched based on the customer ID, inserted into the mail, and if there is a file to be attached such as a log, it is attached (459). Thereafter, the failure occurrence time 372 is searched from the failure ID of the alert 125 and inserted into the mail (461).

顧客連絡／優先度決定システム１１３は，処理４５２で電話連絡であると判断した場合は，原因IDに対応する連絡文面（電話）１２９があるかを確認する（４５４）。原因IDに対応する連絡文面（電話）１２９が存在する場合は，原因IDに対応する連絡文面（電話）１２９をコピーし，連絡の準備を行う（４５７）。原因IDに対応する連絡文面（電話）１２９が存在しない場合は，オペレータ１４０が連絡文面（電話）１２９を作成する（４５８）。連絡文面（電話）１２９に，顧客IDにもとづき，契約情報１２２の顧客管理情報１３６の電話番号２９３を検索し，連絡文面（電話）に挿入する。 If the customer contact / priority determination system 113 determines in step 452 that the contact is a telephone contact, the customer contact / priority determination system 113 checks whether there is a contact text (telephone) 129 corresponding to the cause ID (454). If there is a contact text (telephone) 129 corresponding to the cause ID, the contact text (telephone) 129 corresponding to the cause ID is copied to prepare for contact (457). If the contact text (telephone) 129 corresponding to the cause ID does not exist, the operator 140 creates a contact text (telephone) 129 (458). Based on the customer ID, the telephone number 293 of the customer management information 136 in the contract information 122 is retrieved from the contact text (telephone) 129 and inserted into the contact text (telephone).

連絡文面（電話）１２９の準備を行う（４５７）。原因IDに対応する連絡文面（電話）１２９が存在しない場合は，オペレータ１４０が連絡文面（電話）１２９を作成する（４５８）。連絡文面（電話）１２９に顧客電話番号を挿入する（４６０）。電話連絡，メール連絡同様に，復旧時間を計算し，文面への挿入し（４６２），優先度決定部に障害ID，顧客ID，原因ID，システム名，連絡文面を渡す（４６３）。電話連絡，メール連絡同様に，障害発生時刻と，障害内容から復旧時間を計算し，連絡文面（メール）１２８，または連絡文面（電話）１２９へ挿入し（４６２），優先度決定部に障害ID，顧客ID，原因ID，システム名，連絡文面（メール）１２８，または連絡文面（電話）１２９を渡す（４６３）。 Preparation of a contact text (telephone) 129 is made (457). If the contact text (telephone) 129 corresponding to the cause ID does not exist, the operator 140 creates a contact text (telephone) 129 (458). The customer telephone number is inserted into the contact text (telephone) 129 (460). As with telephone contact and mail contact, the recovery time is calculated and inserted into the text (462), and the failure ID, customer ID, cause ID, system name, and contact text are passed to the priority determination unit (463). Similar to telephone contact and email contact, the recovery time is calculated from the failure occurrence time and the content of the failure, and inserted into the contact text (mail) 128 or contact text (phone) 129 (462), and the failure ID is input to the priority determination unit. , Customer ID, cause ID, system name, contact text (mail) 128, or contact text (phone) 129 is passed (463).

図９は，顧客連絡／優先度決定システム１１３の顧客連絡の優先度決定処理の詳細である。顧客連絡／優先度決定システム１１３は，連絡文面作成処理部から送信された連絡文面（メール）１２８，または，連絡文面（電話）１２９を受信（４７１）すると，キューに格納する（４７２）。このキューの構造として，受信した連絡文面（メール）１２８，または，連絡文面（電話）１２９に加え，優先度の格納も行うことができるものとする。顧客連絡／優先度決定システム１１３は，一定時間ごとにキューに格納された情報に対して処理を行うものとする。まず，顧客連絡／優先度決定システム１１３は，キュー内に格納したすべてのデータに対して，連絡優先度が付与されているかを確認する（４７３）。連絡優先度がまだ付与されていないデータがある場合，顧客連絡／優先度決定システム１１３は，仮想サーバ稼働情報をもとに，CPU使用率，アクセス数ごとにあらかじめ定めた閾値にもとづき，優先度を計算する（４７４）。この処理の一例としては，ＣＰＵ使用率が50%を以上の場合は1点，50%未満ならば0点として判断し，この結果の点数を項目ごとに加算していく処理となる。次に，構成管理情報をもとに，システム種類にもとづく優先度を計算し，処理４７４の計算結果に加える（４７５）。 FIG. 9 shows the details of the customer contact priority determination process of the customer contact / priority determination system 113. When the customer contact / priority determination system 113 receives (471) the contact message (mail) 128 or contact message (phone) 129 transmitted from the contact message creation processing unit, it stores it in the queue (472). As a structure of this queue, in addition to the received contact text (mail) 128 or the contact text (telephone) 129, priority can be stored. It is assumed that the customer contact / priority determination system 113 processes information stored in the queue at regular intervals. First, the customer contact / priority determination system 113 confirms whether contact priority is assigned to all data stored in the queue (473). If there is data for which the contact priority has not yet been assigned, the customer contact / priority determination system 113 uses the virtual server operation information to determine the priority based on a predetermined threshold for each CPU usage rate and number of accesses. Is calculated (474). As an example of this process, the CPU usage rate is determined to be 1 point when the CPU usage rate is 50% or more, and 0 point when the CPU usage rate is less than 50%, and the resulting score is added for each item. Next, the priority based on the system type is calculated based on the configuration management information, and added to the calculation result of the process 474 (475).

最後に，契約情報をもとに，SLA（Service Level Agreement），クレーム数，利用実績にもとづく優先度を計算し，処理４７５の計算結果に加え，これを対象となった連絡文面（メール）１２８，または，連絡文面（電話）１２９に結び付けて，優先度としてキューに格納する（４７６）。ただし，このパラメータの計算方法は一例であり，その他のパラメータを追加してもよい。処理４７６後，顧客連絡／優先度決定システム１１３は，再度処理４７３を実行する。こうしてキュー内のすべてのデータに対して優先度を付与が完了すると，顧客連絡／優先度決定システム１１３は，連絡優先度順にキューの並べ替えを行う（４７７）。 Finally, based on the contract information, the priority based on the SLA (Service Level Agreement), the number of claims, and the usage record is calculated, and in addition to the calculation result of the process 475, the contact text (email) 128 that is the target. Or, it is linked to the contact text (phone) 129 and stored in the queue as the priority (476). However, this parameter calculation method is merely an example, and other parameters may be added. After the process 476, the customer contact / priority determination system 113 executes the process 473 again. When giving priority to all data in the queue is completed in this way, the customer contact / priority determination system 113 rearranges the queue in order of contact priority (477).

ここで，顧客連絡／優先度決定システム１１３は，障害の直接の対象や，関連する障害であってもＳＬＡで報告することが求められている障害が含まれているかどうかを調べる（４７８）。連絡すべきものがある場合，優先度の高い顧客への連絡文面から画面に表示させ，新規アラート待ちの状態に移行する（４８１）。処理４７８ですぐに連絡すべきデータがない場合，新規障害発生待ちの状態に移行する（４８２）。この状態で新規障害が発生した場合，すなわち，図6において，処理４０７の結果として障害根本原因１２６が顧客連絡／優先度決定システム１１３に送信されてきた場合，顧客連絡／優先度決定システム１１３は，一定時間，同時に発生した障害がないかを待ち，その間に受信した障害に対して，原因ＩＤをキーに連絡優先度順にキューに一致するものがあるか確認し，一致したものを優先度の高い順に画面に表示させる。 Here, the customer contact / priority determination system 113 checks whether or not a fault directly related to a fault or a fault that is required to be reported by the SLA is included even if the fault is related (478). If there is something to be contacted, it is displayed on the screen from the contact text of the customer with high priority, and the state shifts to a waiting state for a new alert (481). If there is no data to be immediately contacted in the process 478, the state shifts to a new failure waiting state (482). When a new failure occurs in this state, that is, when the root cause 126 of failure is transmitted to the customer contact / priority determination system 113 as a result of the process 407 in FIG. 6, the customer contact / priority determination system 113 , Wait for a certain number of simultaneous failures, check whether there are any matches received in the queue in the order of contact priority, using the cause ID as a key for the failures received during that time. Display on the screen in descending order.

図１０は，顧客連絡／優先度決定システム１１３の連絡方法決定処理の詳細である。顧客への連絡方法は，障害が発生した時間帯や障害内容，顧客との契約に応じて，電話での連絡なのか，メールでの連絡なのかを例外的な場合にも対応する必要がある。波及先予測システム１１２から送信した障害波及先顧客名１２７を顧客連絡／優先度決定システム１１３で受信した後（４９１），顧客連絡／優先度決定システム１１３は，顧客IDを検索キーとして，契約情報１２２の顧客管理情報１３６の顧客ID２８０と一致する報告パターン２９５を検索し，標準連絡方法を参照し，連絡方法を確認する（４９２）。もし，報告パターン２９５に例外がなければ，標準連絡方法にしたがい顧客への連絡を行う（４９７・４９８・４９９）。 FIG. 10 shows details of the contact method determination processing of the customer contact / priority determination system 113. The customer contact method needs to respond to exceptional cases such as telephone contact or e-mail contact, depending on the time of failure, details of the failure, and contract with the customer. . After the failure contact customer name 127 transmitted from the transmission destination prediction system 112 is received by the customer contact / priority determination system 113 (491), the customer contact / priority determination system 113 uses the customer ID as a search key to obtain contract information. The report pattern 295 that matches the customer ID 280 of the customer management information 136 is searched, the standard contact method is referred to, and the contact method is confirmed (492). If there is no exception in the report pattern 295, the customer is contacted according to the standard contact method (497, 498, 499).

例外がある場合は，障害に応じた連絡方法があるかを確認し（４９５），障害に応じた連絡方法がある場合は，障害連絡に応じた連絡方法にしたがい連絡を行う（４９７・４９８・４９９）。障害に応じた連絡方法がない場合は，時刻により連絡方法が異なるかを確認し（４９６），時刻による連絡方法にしたがい連絡を行う（４９７・４９８・４９９）。 If there is an exception, check whether there is a communication method according to the failure (495), and if there is a communication method according to the failure, make a communication according to the communication method according to the failure communication (497, 498, 498). 499). If there is no contact method according to the failure, it is checked whether the contact method differs depending on the time (496), and the contact is made according to the contact method based on the time (497, 498, 499).

図１１は，顧客へ連絡する際の優先順位決定部の実施例である。５０８は連絡優先順位を決定するためのパラメータの例を表している。たとえば，顧客企業名５００，CPU使用率５０１，アクセス数５０２，システム種類５０３，SLA（サポート時間）５０４，SLA（契約稼働率残り時間）５０５，クレーム数５０６，利用実績５０７を利用する。５０８以外のパラメータを利用してもよい。５１９は連絡優先順位の決定方法を表している。閾値例５１３のように，５１９のパラメータに閾値や点数例５１３（重みづけ）を決定し，５０８のパラメータをあてはめ，合計点数を求める。この点数が高いものが，連絡優先順位の上位にくる。パラメータの閾値５１３，点数例５１４は一例であり，変更してもよい。 FIG. 11 shows an embodiment of a priority order determination unit for contacting a customer. Reference numeral 508 represents an example of parameters for determining the contact priority. For example, a customer company name 500, a CPU usage rate 501, an access count 502, a system type 503, an SLA (support time) 504, an SLA (contract operation rate remaining time) 505, a complaint count 506, and a usage record 507 are used. Parameters other than 508 may be used. Reference numeral 519 represents a method for determining the communication priority. As in the threshold example 513, the threshold and the score example 513 (weighting) are determined for the parameters 519, and the parameters 508 are assigned to obtain the total score. The one with this high score is the highest priority for communication. The parameter threshold 513 and the score example 514 are merely examples, and may be changed.

１００…データセンタ
１０１…クラウドコンピューティングシステム
１０２…ネットワーク
１０３…管理用ルータ
１０４…ルータ
１０５…ストレージ
１１０…サーバ監視システム
１１１…障害根本原因分析システム
１１２…波及先予測システム
１１３…顧客連絡／優先度決定システム
１１４…Webポータル
１１５…顧客システム（A社）
１１６…顧客システム（B社）
１３０・１３５…物理サーバ
１３１・１３６…ハイパバイザ
１３２・１３３・１３４・１３７・１３８・１３９…仮想マシン
１４０…オペレータ
１４１…IT管理者（A社）
１４２…IT管理者（B社） DESCRIPTION OF SYMBOLS 100 ... Data center 101 ... Cloud computing system 102 ... Network 103 ... Management router 104 ... Router 105 ... Storage 110 ... Server monitoring system 111 ... Failure root cause analysis system 112 ... Ripple destination prediction system 113 ... Customer contact / priority determination System 114 ... Web portal 115 ... Customer system (Company A)
116 ... Customer system (Company B)
130, 135, physical servers 131, 136, hypervisors 132, 133, 134, 137, 138, 139, virtual machines 140, operators 141, IT manager (Company A)
142 ... IT manager (Company B)

Claims

A server monitoring system having a function of detecting and notifying a failure of the cloud computing system by the system;
A root cause analysis system having a function of estimating and outputting a cause of a failure by receiving a notification from the server monitoring system;
Ripple that has the function to identify the user who is directly affected by the failure and the user who can be affected by the failure due to the failure by using the cause of the failure estimated by the root cause analysis system. A forecasting system,
Using the cause of the failure estimated by the root cause analysis system and the user information derived by the propagation destination prediction system as input, the user who is directly affected by the failure and the failure caused by the failure are affected. Customer contact that generates information to be communicated to the operator by both users and has a function to determine the order of contact based on the priority of the communication set for each user and the system status A priority determination system,
The failure communication efficiency system created in advance a failure contact content for users affected by future failures estimated by the spread destination prediction system in the customer contact / priority determination system, and a failure actually occurred In this case, when the root cause analysis system determines that a fault has occurred due to the same cause, the fault communication efficiency system is characterized in that the fault communication content previously created is presented to the operator.

2. The customer contact / priority determination system according to claim 1, further comprising a data holding unit that temporarily holds a notification that has occurred within a predetermined time, and each notification stored in the data holding unit after a predetermined time has passed. Based on the priority of communication set for the user and the status of the system, the order to be contacted is judged, and the operator is notified after changing the order of notification to the operator based on the order. , Failure communication efficiency system.