JP7401764B2

JP7401764B2 - Control program, control method and control device

Info

Publication number: JP7401764B2
Application number: JP2020041812A
Authority: JP
Inventors: 侑生梅澤; 寿志辻出; 雅広福田; 敏之内海; 明子松本; 康夫瀬崎; 真幸高原; 雄太下田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2023-12-20
Anticipated expiration: 2040-03-11
Also published as: JP2021144401A

Description

本発明は制御プログラム、制御方法および制御装置に関する。 The present invention relates to a control program, a control method, and a control device.

物理マシン上に、アプリケーションから物理マシンのように見える仮想処理単位を形成するコンピュータ仮想化技術がある。仮想処理単位には、プロセッサやメモリなど物理マシンが有するハードウェアリソースの一部が割り当てられる。仮想処理単位は、割り当てられたリソースを用いてアプリケーションを実行し得る。コンピュータ仮想化技術を利用することで、仮想処理単位の追加、移動および削除が容易となり、アプリケーションの需要の変動や物理マシンの運用状況に柔軟に対応することができる。 There is a computer virtualization technology that forms a virtual processing unit on a physical machine that looks like a physical machine to an application. A portion of the hardware resources of a physical machine, such as a processor and memory, are allocated to the virtual processing unit. A virtual processing unit may execute applications using allocated resources. By using computer virtualization technology, it is easy to add, move, and delete virtual processing units, making it possible to flexibly respond to changes in application demand and the operational status of physical machines.

仮想処理単位には、ゲストオペレーティングシステム（ＯＳ：Operating System）を実行する狭義の仮想マシンと、ゲストＯＳを実行しない軽量のコンテナとがある。物理マシンがハイパーバイザなどを実行することで、物理マシン上に仮想マシンを形成することができる。また、物理マシンがコンテナエンジンを実行することで、物理マシン上にコンテナを形成することができる。また、仮想マシンがコンテナエンジンを実行することで、仮想マシン上にコンテナを形成することもできる。よって、物理マシン、仮想マシン、コンテナなどの処理単位を階層的に積み上げることが可能である。 The virtual processing unit includes a narrowly defined virtual machine that executes a guest operating system (OS) and a lightweight container that does not execute a guest OS. A virtual machine can be created on a physical machine by running a hypervisor or the like on the physical machine. Furthermore, by having the physical machine execute a container engine, a container can be formed on the physical machine. Additionally, a container can be created on a virtual machine by having the virtual machine run a container engine. Therefore, it is possible to stack processing units such as physical machines, virtual machines, containers, etc. in a hierarchical manner.

なお、２以上のクラウドシステムを利用して実装された分散アプリケーションの障害を検出する障害診断方法が提案されている。提案の障害診断方法は、異なるクラウドシステム間の接続の情報を収集して監視し、接続に関する障害を検出する。 Note that a fault diagnosis method has been proposed for detecting a fault in a distributed application implemented using two or more cloud systems. The proposed fault diagnosis method collects and monitors information on connections between different cloud systems and detects connection-related faults.

国際公開第２０１２／１６２１７１号International Publication No. 2012/162171

ところで、システム設計思想の１つとして、多数の仮想処理単位を配置し、細分化されたアプリケーションをそれら仮想処理単位に分散して実行させ、それらアプリケーションが連携して一連のサービスを実現するようにする方法が考えられる。しかし、多数の仮想処理単位を含む仮想環境の上で様々なアプリケーションが実行されていると、障害発生時の原因分析を効率的に行うことが容易でないという問題がある。 By the way, one of the system design concepts is to arrange a large number of virtual processing units, distribute and execute subdivided applications among these virtual processing units, and have these applications work together to realize a series of services. There are ways to do this. However, when various applications are executed on a virtual environment including a large number of virtual processing units, there is a problem in that it is not easy to efficiently analyze the cause of a failure.

あるアプリケーションに障害が発生した場合に、その根本原因がアプリケーション自身にあるのではなく、アプリケーションを支えるインフラストラクチャとしての仮想環境の障害に起因していることがある。例えば、物理マシンのハードウェア障害や仮想マシンのゲストＯＳの障害が、コンテナ上で実行されているアプリケーションの障害の根本原因になっていることがある。しかしながら、多数の処理単位と多数のアプリケーションを含むシステムから、あるアプリケーション障害の根本原因になっている他の障害を特定するのは容易ではなく、システム管理者による長時間の作業を要することがある。 When a failure occurs in an application, the root cause may not lie in the application itself, but in a failure in the virtual environment that is the infrastructure that supports the application. For example, a hardware failure in a physical machine or a failure in a virtual machine's guest OS may be the root cause of a failure in an application running on a container. However, in a system that includes many processing units and many applications, identifying other failures that are the root cause of one application failure can be difficult and require a lot of effort by the system administrator. .

そこで、１つの側面では、本発明は、仮想化されたシステムにおける障害原因の分析を効率化する制御プログラム、制御方法および制御装置を提供することを目的とする。 Accordingly, in one aspect, an object of the present invention is to provide a control program, a control method, and a control device that make analysis of the cause of a failure in a virtualized system more efficient.

１つの態様では、コンピュータに以下の処理を実行させる制御プログラムが提供される。それぞれが割り当てられたリソースを用いてアプリケーションを実行可能な処理ノードであって、仮想化ソフトウェアを用いて階層的に配置することが可能な複数の処理ノードを含み、複数のアプリケーションそれぞれが複数の処理ノードの何れかで実行される情報処理システムについて、第１のアプリケーションの障害を示す第１の障害情報と、第２の処理ノードの障害を示す第２の障害情報と、第１の障害情報と第２の障害情報との間の関連の有無を示す教師ラベルとを取得する。第１の障害情報および第２の障害情報に基づいて、第１のアプリケーションを実行する第１の処理ノードと第２の処理ノードとの間の配置の階層関係を示す第１の評価値と、第２の処理ノードの上に配置された処理ノードで実行される第２のアプリケーションと第１のアプリケーションとの間の依存関係を示す第２の評価値と、第１の障害情報に含まれる第１のエラーメッセージと第２の障害情報に含まれる第２のエラーメッセージとの間の類似度を示す第３の評価値とを算出する。第１の評価値、第２の評価値および第３の評価値を含む特徴情報と教師ラベルとを対応付けた訓練データを用いて、２つの障害情報についての特徴情報に対応する入力データから２つの障害情報の関連性の有無を推定するモデルを生成する。 In one aspect, a control program is provided that causes a computer to perform the following processing. Each of the processing nodes is a processing node that can execute an application using the allocated resources, and includes multiple processing nodes that can be arranged hierarchically using virtualization software, and each of the multiple applications can execute multiple processing nodes. Regarding the information processing system executed on any of the nodes, first failure information indicating a failure of the first application, second failure information indicating a failure of the second processing node, and first failure information. and a teacher label indicating whether there is a relationship with the second failure information. a first evaluation value indicating a hierarchical relationship in arrangement between a first processing node that executes the first application and a second processing node based on the first failure information and the second failure information; A second evaluation value indicating a dependency relationship between a second application executed on a processing node placed above the second processing node and the first application, and a second evaluation value included in the first failure information. A third evaluation value indicating the degree of similarity between the first error message and the second error message included in the second failure information is calculated. Using training data in which feature information including the first evaluation value, second evaluation value, and third evaluation value is associated with a teacher label, 2. A model is generated to estimate whether two pieces of fault information are related.

また、１つの態様では、コンピュータが実行する制御方法が提供される。また、１つの態様では、記憶部と処理部とを有する制御装置が提供される。 Also, in one aspect, a computer-implemented control method is provided. Moreover, in one aspect, a control device having a storage section and a processing section is provided.

１つの側面では、仮想化されたシステムにおける障害原因の分析を効率化できる。 In one aspect, analysis of the cause of failure in a virtualized system can be made more efficient.

第１の実施の形態の制御装置の例を説明する図である。FIG. 2 is a diagram illustrating an example of a control device according to the first embodiment. 第２の実施の形態の情報処理システムの例を示す図である。FIG. 3 is a diagram illustrating an example of an information processing system according to a second embodiment. 管理サーバのハードウェア例を示すブロック図である。FIG. 2 is a block diagram showing an example of hardware of a management server. 仮想化インフラストラクチャの階層例を示す図である。FIG. 2 is a diagram illustrating an example hierarchy of virtualization infrastructure. システム構成グラフの例を示す図である。FIG. 3 is a diagram showing an example of a system configuration graph. サービスメッシュグラフの例を示す図である。FIG. 3 is a diagram showing an example of a service mesh graph. 原因判定モデルの生成例を示す図である。FIG. 3 is a diagram showing an example of generation of a cause determination model. システム管理画面の例を示す図である。FIG. 3 is a diagram showing an example of a system management screen. 管理サーバの機能例を示すブロック図である。FIG. 2 is a block diagram illustrating a functional example of a management server. 障害テーブルの例を示す図である。FIG. 3 is a diagram showing an example of a failure table. 構成テーブルの例を示す図である。FIG. 3 is a diagram showing an example of a configuration table. サービス距離テーブルとサービス配置テーブルの例を示す図である。It is a figure showing an example of a service distance table and a service arrangement table. 訓練データテーブルの例を示す図である。FIG. 3 is a diagram showing an example of a training data table. モデル生成の手順例を示すフローチャートである。3 is a flowchart illustrating an example of a model generation procedure. 障害原因判定の手順例を示すフローチャートである。3 is a flowchart illustrating an example of a procedure for determining the cause of a failure. モデル更新の手順例を示すフローチャートである。3 is a flowchart illustrating an example of a model update procedure.

以下、本実施の形態を図面を参照して説明する。
［第１の実施の形態］
第１の実施の形態を説明する。 The present embodiment will be described below with reference to the drawings.
[First embodiment]
A first embodiment will be described.

図１は、第１の実施の形態の制御装置の例を説明する図である。
第１の実施の形態の制御装置１０は、情報処理システム２０で発生した障害の分析に用いられる。制御装置１０は、クライアント装置でもよいしサーバ装置でもよい。制御装置１０を、コンピュータ、情報処理装置、分析装置、機械学習装置などと言うこともできる。制御装置１０と情報処理システム２０とがネットワークで接続されていてもよい。 FIG. 1 is a diagram illustrating an example of a control device according to a first embodiment.
The control device 10 according to the first embodiment is used to analyze a failure that occurs in the information processing system 20. The control device 10 may be a client device or a server device. The control device 10 can also be referred to as a computer, an information processing device, an analysis device, a machine learning device, or the like. The control device 10 and the information processing system 20 may be connected via a network.

情報処理システム２０は、コンピュータ仮想化技術を利用してアプリケーションを実行する監視対象システムである。情報処理システム２０は、処理ノード２１，２２，２３（第１、第２および第３の処理ノード）を含む複数の処理ノードを有する。 The information processing system 20 is a monitored system that executes applications using computer virtualization technology. The information processing system 20 has a plurality of processing nodes including processing nodes 21, 22, and 23 (first, second, and third processing nodes).

各処理ノードは、割り当てられたリソースを用いてアプリケーションを実行することができる。リソースには、プロセッサの演算能力、メモリの記憶領域、通信帯域などのハードウェアリソースが含まれ得る。各処理ノードは、物理マシンであることもあるし、仮想マシンやコンテナなどの仮想処理ノードであることもある。２以上の処理ノードを、仮想化ソフトウェアを用いて階層的に配置することが可能である。例えば、ある処理ノードがハイパーバイザを実行することで、その処理ノードの上に仮想マシンを配置することができる。また、ある処理ノードがコンテナエンジンを実行することで、その処理ノードの上にコンテナを配置することができる。物理マシンの上に１以上の仮想マシンを配置し、各仮想マシンの上に１以上のコンテナを配置することもある。 Each processing node can execute applications using assigned resources. Resources may include hardware resources such as the computing power of a processor, a storage area of a memory, and a communication band. Each processing node may be a physical machine or a virtual processing node such as a virtual machine or a container. Two or more processing nodes can be arranged hierarchically using virtualization software. For example, by having a processing node run a hypervisor, a virtual machine can be placed on the processing node. Further, by executing a container engine on a certain processing node, a container can be placed on that processing node. One or more virtual machines may be placed on top of a physical machine, and one or more containers may be placed on top of each virtual machine.

情報処理システム２０では、アプリケーション２４，２５（第１および第２のアプリケーション）を含む複数のアプリケーションそれぞれが、何れかの処理ノードで実行される。アプリケーションを、アプリケーションソフトウェアと言うこともできる。各アプリケーションは、単一のプログラムまたは２以上のプログラムの集合によって実装され得る。各アプリケーションは、単一のプロセスまたは２以上のプロセスの集合として動作する。上記のプログラムを、アプリケーションプログラムやユーザプログラムと言うこともできる。上記のプロセスを、ユーザプロセスと言うこともできる。アプリケーションの例として、Ｗｅｂサーバ、業務ロジックサーバ、データベースサーバなどが挙げられる。 In the information processing system 20, each of a plurality of applications including applications 24 and 25 (first and second applications) is executed on one of the processing nodes. An application can also be called application software. Each application may be implemented by a single program or a collection of two or more programs. Each application operates as a single process or a collection of two or more processes. The above program can also be called an application program or a user program. The above process can also be called a user process. Examples of applications include web servers, business logic servers, database servers, and the like.

複数のアプリケーションは、いわゆるマイクロサービスアーキテクチャに基づいて実装されたものであってもよい。マイクロサービスアーキテクチャでは、機能が細分化された複数のアプリケーションが実装され、それら複数のアプリケーションが複数の処理ノードに分散して配置される。それら複数のアプリケーションが相互に通信して連携することで、Ｗｅｂサービスなどの一連のサービスがユーザに対して提供される。 The plurality of applications may be implemented based on a so-called microservice architecture. In microservices architecture, multiple applications with subdivided functions are implemented, and these multiple applications are distributed and placed on multiple processing nodes. By mutually communicating and coordinating these multiple applications, a series of services such as web services are provided to the user.

一例として、処理ノード２１がアプリケーション２４を実行している。また、処理ノード２３がアプリケーション２５を実行している。処理ノード２３は処理ノード２２の上に存在する。ここで、処理ノード２２の上に処理ノード２３があるとは、両者の間に階層的な親子関係があればよい。処理ノード２２が処理ノード２３を直接制御しているという直接的な親子関係であってもよいし、処理ノード２２と処理ノード２３の間に更に他の処理ノードの層が存在するという間接的な親子関係であってもよい。物理マシンに近い方が下位の階層であり、アプリケーションに近い方が上位の階層である。 As an example, processing node 21 is running application 24 . Further, the processing node 23 is executing the application 25. Processing node 23 exists above processing node 22 . Here, the presence of the processing node 23 above the processing node 22 suffices if there is a hierarchical parent-child relationship between the two. There may be a direct parent-child relationship where the processing node 22 directly controls the processing node 23, or an indirect relationship where another layer of processing nodes exists between the processing node 22 and the processing node 23. It may be a parent-child relationship. The lower tier is closer to the physical machine, and the higher tier is closer to the application.

処理ノード２１は、処理ノード２２の上にあってもよいし、処理ノード２２の上になくてもよい。よって、処理ノード２１と処理ノード２３は、同一の物理マシンに配置されていることもあるし、異なる物理マシンに配置されていることもある。例えば、処理ノード２１，２３は、物理マシンまたは仮想マシンの１つ上の階層に位置するコンテナである。また、例えば、処理ノード２２は、物理マシンまたは仮想マシンである。 The processing node 21 may be located above the processing node 22 or may not be located above the processing node 22. Therefore, the processing node 21 and the processing node 23 may be located on the same physical machine or may be located on different physical machines. For example, the processing nodes 21 and 23 are containers located one level above a physical machine or a virtual machine. Further, for example, the processing node 22 is a physical machine or a virtual machine.

制御装置１０は、記憶部１１および処理部１２を有する。記憶部１１は、ＲＡＭ（Random Access Memory）などの揮発性半導体メモリでもよいし、ＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの不揮発性ストレージでもよい。処理部１２は、例えば、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）などのプロセッサである。ただし、処理部１２は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの特定用途の電子回路を含んでもよい。プロセッサは、ＲＡＭなどのメモリ（記憶部１１でもよい）に記憶されたプログラムを実行する。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うこともある。 The control device 10 has a storage section 11 and a processing section 12. The storage unit 11 may be a volatile semiconductor memory such as a RAM (Random Access Memory), or may be a nonvolatile storage such as an HDD (Hard Disk Drive) or a flash memory. The processing unit 12 is, for example, a processor such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), or a DSP (Digital Signal Processor). However, the processing unit 12 may include a specific purpose electronic circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). The processor executes a program stored in a memory such as a RAM (or the storage unit 11). A collection of multiple processors is sometimes referred to as a "multiprocessor" or simply "processor."

記憶部１１は、障害情報１３，１４（第１および第２の障害情報）と教師ラベル１５とを記憶する。障害情報１３は、過去にアプリケーション２４で発生した障害を示す。障害情報１４は、過去に処理ノード２２で発生した障害を示す。教師ラベル１５は、障害情報１３と障害情報１４との間の関連の有無を示す。例えば、教師ラベル１５は、アプリケーション２４の障害の原因が、処理ノード２２の障害であるか否かを示すフラグである。 The storage unit 11 stores fault information 13 and 14 (first and second fault information) and a teacher label 15. The failure information 13 indicates failures that occurred in the application 24 in the past. Fault information 14 indicates a fault that occurred in the processing node 22 in the past. The teacher label 15 indicates whether there is a relationship between the failure information 13 and the failure information 14. For example, the teacher label 15 is a flag indicating whether or not the cause of the failure in the application 24 is a failure in the processing node 22.

障害情報１３は、例えば、障害が発生したアプリケーション２４を識別する情報と、障害内容を示すエラーメッセージ（第１のエラーメッセージ）とを含む。障害情報１３は、アプリケーション２４を実行する処理ノード２１の管理ソフトウェア（例えば、コンテナライブラリ）によって生成されてもよい。また、障害情報１３は、処理ノード２１の管理ソフトウェアが出力するログから制御装置１０が生成したものでもよい。同様に、障害情報１４は、例えば、障害が発生した処理ノード２２を識別する情報と、障害内容を示すエラーメッセージ（第２のエラーメッセージ）とを含む。障害情報１４は、処理ノード２２の管理ソフトウェア（例えば、物理マシンのホストＯＳまたは仮想マシンのゲストＯＳ）によって生成されてもよい。また、障害情報１４は、処理ノード２２の管理ソフトウェアが出力するログから制御装置１０が生成したものでもよい。 The failure information 13 includes, for example, information that identifies the application 24 in which the failure has occurred, and an error message (first error message) indicating the details of the failure. The failure information 13 may be generated by management software (for example, a container library) of the processing node 21 that executes the application 24. Further, the failure information 13 may be generated by the control device 10 from a log output by the management software of the processing node 21. Similarly, the failure information 14 includes, for example, information that identifies the processing node 22 in which the failure has occurred, and an error message (second error message) indicating the details of the failure. The failure information 14 may be generated by management software of the processing node 22 (for example, a host OS of a physical machine or a guest OS of a virtual machine). Further, the failure information 14 may be generated by the control device 10 from a log output by the management software of the processing node 22.

アプリケーション２４の障害としては、例えば、他のアプリケーションとのデータ通信の失敗、データ処理のタイムアウトなどが挙げられる。処理ノード２２の障害としては、例えば、メモリやＨＤＤや通信インタフェースなどのハードウェアへのアクセス失敗、通信プロセスなどの管理用プロセスの異常停止などが挙げられる。 Examples of failures in the application 24 include failure in data communication with other applications, timeout in data processing, and the like. Examples of failures in the processing node 22 include failure to access hardware such as memory, HDD, and communication interface, and abnormal termination of management processes such as communication processes.

教師ラベル１５は、アプリケーション２４の障害に対するシステム管理者の原因分析結果を反映している。教師ラベル１５は、システム管理者によって作成されてもよい。また、教師ラベル１５は、システム管理者によって作成された障害対応記録から制御装置１０が生成したものでもよい。例えば、アプリケーション２４の障害の原因が処理ノード２２の障害であると障害対応の際に結論付けられた場合、教師ラベル１５は「関連あり」を示す。一方、アプリケーション２４の障害の原因が処理ノード２２の障害以外にあると障害対応の際に結論付けられた場合、教師ラベル１５は「関連なし」を示す。 The teacher label 15 reflects the result of the system administrator's analysis of the cause of the failure of the application 24. Teacher label 15 may be created by a system administrator. Further, the teacher label 15 may be generated by the control device 10 from a failure response record created by a system administrator. For example, if it is concluded during failure handling that the cause of the failure in the application 24 is a failure in the processing node 22, the teacher label 15 indicates "related." On the other hand, if it is concluded during failure handling that the cause of the failure of the application 24 is other than the failure of the processing node 22, the teacher label 15 indicates "unrelated."

処理部１２は、障害情報１３，１４および教師ラベル１５を用いて訓練データを生成し、生成した訓練データを用いて機械学習によりモデル１７を生成する。訓練データの生成では、処理部１２は、障害情報１３，１４に基づいて、評価値１６ａ，１６ｂ，１６ｃ（第１、第２および第３の評価値）を含む特徴情報１６を生成する。そして、処理部１２は、特徴情報１６と教師ラベル１５とを対応付けた訓練データを生成する。特徴情報１６は説明変数に相当し、教師ラベル１５は目的変数に相当する。 The processing unit 12 generates training data using the fault information 13 and 14 and the teacher label 15, and generates a model 17 by machine learning using the generated training data. In generating the training data, the processing unit 12 generates feature information 16 including evaluation values 16a, 16b, and 16c (first, second, and third evaluation values) based on the failure information 13 and 14. Then, the processing unit 12 generates training data in which the feature information 16 and the teacher label 15 are associated with each other. The feature information 16 corresponds to an explanatory variable, and the teacher label 15 corresponds to an objective variable.

評価値１６ａは、処理ノード間の親子関係を示す指標である。評価値１６ａは、障害が発生したアプリケーション２４を実行する処理ノード２１と、障害が発生した処理ノード２２との間の配置の階層関係を示す。例えば、評価値１６ａは、コンテナと物理マシンまたは仮想マシンとの間の階層関係を示す。評価値１６ａは、処理ノード２２の上に処理ノード２１が存在するか否か示すものでもよい。また、評価値１６ａは、処理ノード２１の階層と処理ノード２２の階層との間の距離（階層差）を示すものでもよい。 The evaluation value 16a is an index indicating the parent-child relationship between processing nodes. The evaluation value 16a indicates the hierarchical relationship between the processing node 21 that executes the application 24 in which the fault has occurred and the processing node 22 in which the fault has occurred. For example, the evaluation value 16a indicates a hierarchical relationship between a container and a physical machine or a virtual machine. The evaluation value 16a may indicate whether or not the processing node 21 exists above the processing node 22. Furthermore, the evaluation value 16a may indicate the distance (hierarchical difference) between the hierarchy of the processing node 21 and the hierarchy of the processing node 22.

評価値１６ｂは、アプリケーション間の依存関係を示す指標である。評価値１６ｂは、障害が発生した処理ノード２２の上に配置された処理ノード２３で実行されるアプリケーション２５と、障害が発生したアプリケーション２４との間の依存関係を示す。依存関係として、例えば、アプリケーション２４とアプリケーション２５との間の通信関係を用いることができる。評価値１６ｂは、アプリケーション２４とアプリケーション２５とが直接または間接的にデータ通信を行うか否かを示すものでもよい。また、評価値１６ｂは、複数のアプリケーションの間の通信関係を示すサービスメッシュグラフにおいて、アプリケーション２４とアプリケーション２５との間の距離を示すものでもよい。 The evaluation value 16b is an index indicating the dependency relationship between applications. The evaluation value 16b indicates the dependency relationship between the application 25 executed on the processing node 23 placed above the processing node 22 in which the fault has occurred and the application 24 in which the fault has occurred. For example, a communication relationship between the application 24 and the application 25 can be used as the dependency relationship. The evaluation value 16b may indicate whether the application 24 and the application 25 perform data communication directly or indirectly. Furthermore, the evaluation value 16b may indicate the distance between the application 24 and the application 25 in a service mesh graph indicating communication relationships between a plurality of applications.

評価値１６ｃは、障害情報１３に含まれるエラーメッセージと障害情報１４に含まれるエラーメッセージとの間の類似度を示す。類似度の指標として、例えば、Bag of Wordsのコサイン類似度が用いられる。ただし、編集距離（レーベンシュタイン距離）などの他の指標を用いることもできる。また、類似度を算出するにあたり、各エラーメッセージから、数値などノイズとなり得る所定の種類の文字列をフィルタにより除去する、特徴的なキーワードのみを抽出する、といった前処理を行うようにしてもよい。 The evaluation value 16c indicates the degree of similarity between the error message included in the fault information 13 and the error message included in the fault information 14. As an index of similarity, for example, cosine similarity of Bag of Words is used. However, other metrics such as edit distance (Levenshtein distance) can also be used. Additionally, in calculating the similarity, preprocessing may be performed such as filtering out predetermined types of character strings that can become noise, such as numbers, from each error message, or extracting only characteristic keywords. .

訓練データの１つのレコードは、障害情報のペア毎に生成される。処理部１２は、障害情報１３，１４のペアと同様の方法で、異なる障害情報のペアから訓練データのレコードを生成する。訓練データが生成されると、処理部１２は、その訓練データを用いて機械学習によりモデル１７を生成する。モデル１７は、２つの障害情報についての特徴情報に対応する入力データから、当該２つの障害情報の関連性の有無を推定するものである。モデル１７は、２つの障害の間に関連があるか否か判定するものでもよく、一方の障害の原因が他方の障害にあるか否か判定するものでもよい。また、モデル１７は、その確信度を出力するものでもよい。 One record of training data is generated for each pair of fault information. The processing unit 12 generates a training data record from a different pair of failure information in the same manner as the pair of failure information 13 and 14 . When the training data is generated, the processing unit 12 generates the model 17 by machine learning using the training data. The model 17 estimates the presence or absence of a relationship between two pieces of fault information from input data corresponding to feature information about the two pieces of fault information. The model 17 may determine whether there is a relationship between two failures, or may determine whether the cause of one failure is the other failure. Furthermore, the model 17 may output its confidence level.

モデル１７は、機械学習によってその値が決定されるパラメータを含む。モデル１７は、例えば、ロジスティック回帰分析によって生成される回帰モデルである。ただし、モデル１７が、サポートベクタマシン（ＳＶＭ：Support Vector Machine）、ランダムフォレスト、ニューラルネットワークなどの他の種類のモデルであってもよい。 Model 17 includes parameters whose values are determined by machine learning. The model 17 is, for example, a regression model generated by logistic regression analysis. However, the model 17 may be another type of model such as a support vector machine (SVM), a random forest, or a neural network.

第１の実施の形態の制御装置１０によれば、過去の障害を示す障害情報１３，１４から、アプリケーションの障害と仮想環境としてのインフラストラクチャに含まれる処理ノードの障害との間の関連の有無を判定するモデル１７が生成される。生成されたモデル１７を利用することで、アプリケーションの障害の原因が特定の処理ノードの障害である可能性を評価することができる。よって、システム管理者による障害対応を支援することができ、システム管理者の作業時間を短縮することが可能となる。 According to the control device 10 of the first embodiment, from the failure information 13 and 14 indicating past failures, it is possible to determine whether there is a relationship between an application failure and a failure of a processing node included in the infrastructure as a virtual environment. A model 17 is generated that determines. By using the generated model 17, it is possible to evaluate the possibility that the cause of an application failure is a failure of a specific processing node. Therefore, it is possible to support the system administrator in handling failures, and it is possible to shorten the system administrator's work time.

また、モデル１７の入力となる特徴情報１６には、障害が発生したアプリケーションを実行する処理ノードと、同時期に障害が発生した処理ノードとの間の階層関係を示す評価値１６ａが含まれる。このため、障害が下位の階層から上位の階層に伝播する態様を表現することができる。また、特徴情報１６には、障害が発生したアプリケーションと、障害が発生した処理ノードの上の階層で実行されているアプリケーションとの間の依存関係を示す評価値１６ｂが含まれる。このため、通信エラーなどによりアプリケーション間で障害が伝播する態様を表現することができる。また、特徴情報１６には、エラーメッセージの類似性を示す評価値１６ｃが含まれる。このため、障害の伝播によりアプリケーションと処理ノードとで同じ種類の障害が発生する態様を表現することができる。以上により、２つの障害の間の関連性を精度よく判定することが可能となる。 Furthermore, the characteristic information 16 that is input to the model 17 includes an evaluation value 16a that indicates the hierarchical relationship between the processing node that executes the application in which the failure occurred and the processing node in which the failure occurred at the same time. Therefore, it is possible to express how a failure propagates from a lower hierarchy to an upper hierarchy. Further, the characteristic information 16 includes an evaluation value 16b indicating the dependency relationship between the application in which the failure has occurred and the application being executed in the hierarchy above the processing node in which the failure has occurred. Therefore, it is possible to express a manner in which a failure propagates between applications due to a communication error or the like. Further, the feature information 16 includes an evaluation value 16c indicating the similarity of error messages. Therefore, it is possible to express a situation in which the same type of failure occurs in the application and the processing node due to the propagation of the failure. With the above, it becomes possible to accurately determine the relationship between two failures.

［第２の実施の形態］
次に、第２の実施の形態を説明する。
図２は、第２の実施の形態の情報処理システムの例を示す図である。 [Second embodiment]
Next, a second embodiment will be described.
FIG. 2 is a diagram illustrating an example of an information processing system according to the second embodiment.

第２の実施の形態の情報処理システムは、クライアント端末３１、管理サーバ１００および監視対象システム２００を含む。クライアント端末３１、管理サーバ１００および監視対象システム２００は、ネットワーク３０に接続されている。ネットワーク３０は、ＬＡＮ（Local Area Network）を含んでもよく、インターネットなどの広域ネットワークを含んでもよい。管理サーバ１００は、第１の実施の形態の制御装置１０に対応する。監視対象システム２００は、第１の実施の形態の情報処理システム２０に対応する。 The information processing system of the second embodiment includes a client terminal 31, a management server 100, and a monitored system 200. The client terminal 31, the management server 100, and the monitored system 200 are connected to the network 30. The network 30 may include a LAN (Local Area Network) or a wide area network such as the Internet. Management server 100 corresponds to control device 10 of the first embodiment. The monitored system 200 corresponds to the information processing system 20 of the first embodiment.

クライアント端末３１は、監視対象システム２００を管理するシステム管理者が使用するクライアントコンピュータである。監視対象システム２００に障害が発生すると、クライアント端末３１は、管理サーバ１００から障害情報を受信して表示する。同時期に２以上の障害が発生している場合、管理サーバ１００から受信する障害情報には、ある障害の根本原因が他の障害である可能性を示す情報が含まれることがある。システム管理者は、クライアント端末３１に表示された障害情報を参考にして、障害原因の特定を含む障害対応作業を行う。クライアント端末３１は、システム管理者から障害対応情報の入力を受け付け、管理サーバ１００に障害対応情報を送信する。 The client terminal 31 is a client computer used by a system administrator who manages the monitored system 200. When a failure occurs in the monitored system 200, the client terminal 31 receives failure information from the management server 100 and displays it. When two or more failures occur at the same time, the failure information received from the management server 100 may include information indicating the possibility that the root cause of one failure is another failure. The system administrator refers to the fault information displayed on the client terminal 31 and performs troubleshooting work including identifying the cause of the fault. The client terminal 31 receives input of failure handling information from the system administrator, and transmits the failure handling information to the management server 100.

管理サーバ１００は、監視対象システム２００を監視するサーバコンピュータである。管理サーバ１００は、監視対象システム２００に含まれる構成要素の配置を示す構成情報を収集する。また、管理サーバ１００は、監視対象システム２００で実行されるアプリケーション間の論理関係を示すサービス情報を収集する。また、管理サーバ１００は、監視対象システム２００で発生した障害を示す障害情報を収集する。管理サーバ１００は、障害を検出すると、クライアント端末３１に障害情報を送信する。また、管理サーバ１００は、クライアント端末３１から障害対応情報を受信して保存する。 The management server 100 is a server computer that monitors the monitored system 200. The management server 100 collects configuration information indicating the arrangement of components included in the monitored system 200. Additionally, the management server 100 collects service information indicating logical relationships between applications executed on the monitored system 200. The management server 100 also collects failure information indicating failures that have occurred in the monitored system 200. When the management server 100 detects a failure, it transmits failure information to the client terminal 31. Additionally, the management server 100 receives failure response information from the client terminal 31 and stores it.

同時期に２以上の障害が発生している場合、管理サーバ１００は、ある障害の根本原因が他の障害である可能性を評価し、評価結果を障害情報に含めてクライアント端末３１に送信する。２つの障害の関連性の評価には、機械学習によって生成された原因判定モデルが使用される。原因判定モデルは、２つの障害の情報から、その２つの障害が関連している確率を示す確信度を算出する回帰モデルである。管理サーバ１００は、過去に収集した構成情報、サービス情報および障害情報から訓練データを生成し、この訓練データを用いて原因判定モデルを生成する。また、管理サーバ１００は、クライアント端末３１からのフィードバックに基づいて、判定精度が上がるように原因判定モデルを更新する。 If two or more failures occur at the same time, the management server 100 evaluates the possibility that the root cause of one failure is another failure, includes the evaluation result in failure information, and sends it to the client terminal 31. . A cause determination model generated by machine learning is used to evaluate the relationship between two disorders. The cause determination model is a regression model that calculates a degree of certainty indicating the probability that the two failures are related from information on the two failures. The management server 100 generates training data from previously collected configuration information, service information, and failure information, and uses this training data to generate a cause determination model. Furthermore, the management server 100 updates the cause determination model based on feedback from the client terminal 31 so that the determination accuracy increases.

監視対象システム２００は、コンピュータ仮想化技術を利用してアプリケーションを実行する情報処理システムである。監視対象システム２００は、サービス事業者が所有するオンプレミスシステム（自社システム）であってもよいし、クラウド事業者が所有してサービス事業者に有料で使用させるクラウドシステムであってもよい。 The monitored system 200 is an information processing system that executes applications using computer virtualization technology. The monitored system 200 may be an on-premises system (in-house system) owned by a service provider, or a cloud system owned by a cloud provider and used by the service provider for a fee.

監視対象システム２００は、物理マシン２０１，２０２を含む複数の物理マシンと、スイッチ２０３を含む１以上の通信装置とを含む。物理マシン２０１，２０２は、スイッチ２０３に接続されている。物理マシン２０１，２０２は、仮想化ソフトウェアを用いて複数の仮想処理ノードを形成するサーバコンピュータである。仮想処理ノードには、ハイパーバイザを用いて形成される狭義の仮想マシンと、コンテナエンジンを用いて形成されるコンテナとが含まれる。仮想マシンは、ゲストＯＳを実行する独立性の高い仮想処理ノードであるのに対し、コンテナは、ゲストＯＳを実行しない軽量の仮想処理ノードである。 The monitored system 200 includes a plurality of physical machines including physical machines 201 and 202, and one or more communication devices including a switch 203. Physical machines 201 and 202 are connected to a switch 203. Physical machines 201 and 202 are server computers that form a plurality of virtual processing nodes using virtualization software. The virtual processing node includes a narrowly defined virtual machine formed using a hypervisor and a container formed using a container engine. A virtual machine is a highly independent virtual processing node that runs a guest OS, whereas a container is a lightweight virtual processing node that does not run a guest OS.

監視対象システム２００では、マイクロサービスアーキテクチャに基づいて実装されたアプリケーションが実行される。Ｗｅｂサービスなどのサービスを実現するための機能が細分化されて複数のアプリケーションとして実装され、異なるアプリケーションが異なる仮想処理ノードで実行される。アプリケーションには、例えば、Ｗｅｂサーバ、業務ロジックサーバ、データベースサーバなどが含まれる。複数のアプリケーションが相互に通信して連携し、一連のサービスを実現する。第２の実施の形態では、後述するように、アプリケーションがコンテナで実行されるようにする。これにより、負荷の変動に応じたスケールアウト（サーバ台数の増加）やスケールイン（サーバ台数の減少）が容易となる。また、新しい機能をもつアプリケーションの追加も容易となる。 The monitored system 200 executes an application implemented based on a microservice architecture. Functions for realizing services such as web services are subdivided and implemented as multiple applications, and different applications are executed on different virtual processing nodes. Applications include, for example, web servers, business logic servers, database servers, and the like. Multiple applications communicate and cooperate with each other to realize a series of services. In the second embodiment, an application is executed in a container, as will be described later. This facilitates scale-out (increase in the number of servers) and scale-in (decrease in the number of servers) in response to load fluctuations. Additionally, it becomes easy to add applications with new functions.

図３は、管理サーバのハードウェア例を示すブロック図である。
管理サーバ１００は、ＣＰＵ１０１、ＲＡＭ１０２、ＨＤＤ１０３、画像インタフェース１０４、入力インタフェース１０５、媒体リーダ１０６および通信インタフェース１０７を有する。管理サーバ１００が有するこれらのユニットは、バスに接続されている。ＣＰＵ１０１は、第１の実施の形態の処理部１２に対応する。ＲＡＭ１０２またはＨＤＤ１０３は、第１の実施の形態の記憶部１１に対応する。クライアント端末３１や物理マシン２０１，２０２も、管理サーバ１００と同様のハードウェアを用いて実現できる。 FIG. 3 is a block diagram showing an example of hardware of the management server.
Management server 100 includes CPU 101 , RAM 102 , HDD 103 , image interface 104 , input interface 105 , media reader 106 , and communication interface 107 . These units included in the management server 100 are connected to a bus. The CPU 101 corresponds to the processing unit 12 of the first embodiment. RAM 102 or HDD 103 corresponds to storage unit 11 in the first embodiment. The client terminal 31 and the physical machines 201 and 202 can also be realized using the same hardware as the management server 100.

ＣＰＵ１０１は、プログラムの命令を実行するプロセッサである。ＣＰＵ１０１は、ＨＤＤ１０３に記憶されたプログラムやデータの少なくとも一部をＲＡＭ１０２にロードし、プログラムを実行する。ＣＰＵ１０１は複数のプロセッサコアを備えてもよく、管理サーバ１００は複数のプロセッサを備えてもよい。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The CPU 101 is a processor that executes program instructions. The CPU 101 loads at least part of the program and data stored in the HDD 103 into the RAM 102, and executes the program. The CPU 101 may include multiple processor cores, and the management server 100 may include multiple processors. A collection of multiple processors is sometimes referred to as a "multiprocessor" or simply "processor."

ＲＡＭ１０２は、ＣＰＵ１０１が実行するプログラムやＣＰＵ１０１が演算に使用するデータを一時的に記憶する揮発性半導体メモリである。管理サーバ１００は、ＲＡＭ以外の種類のメモリを備えてもよく、複数のメモリを備えてもよい。 The RAM 102 is a volatile semiconductor memory that temporarily stores programs executed by the CPU 101 and data used by the CPU 101 for calculations. The management server 100 may include a type of memory other than RAM, or may include a plurality of memories.

ＨＤＤ１０３は、ＯＳやミドルウェアやアプリケーションソフトウェアなどのソフトウェアのプログラム、および、データを記憶する不揮発性ストレージである。管理サーバ１００は、フラッシュメモリやＳＳＤ（Solid State Drive）など他の種類のストレージを備えてもよく、複数のストレージを備えてもよい。 The HDD 103 is a nonvolatile storage that stores software programs such as an OS, middleware, and application software, and data. The management server 100 may include other types of storage such as flash memory and SSD (Solid State Drive), or may include multiple storages.

画像インタフェース１０４は、ＣＰＵ１０１からの命令に従って、管理サーバ１００に接続された表示装置１１１に画像を出力する。表示装置１１１として、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）、有機ＥＬ（ＯＥＬ：Organic Electro-Luminescence）ディスプレイ、プロジェクタなど、任意の種類の表示装置を使用することができる。管理サーバ１００に、プリンタなど表示装置１１１以外の出力デバイスが接続されてもよい。 The image interface 104 outputs an image to the display device 111 connected to the management server 100 according to instructions from the CPU 101. As the display device 111, any type of display device can be used, such as a CRT (Cathode Ray Tube) display, a Liquid Crystal Display (LCD), an Organic Electro-Luminescence (OEL) display, or a projector. . An output device other than the display device 111, such as a printer, may be connected to the management server 100.

入力インタフェース１０５は、管理サーバ１００に接続された入力デバイス１１２から入力信号を受け付ける。入力デバイス１１２として、マウス、タッチパネル、タッチパッド、キーボードなど、任意の種類の入力デバイスを使用することができる。管理サーバ１００に複数種類の入力デバイスが接続されてもよい。 The input interface 105 receives input signals from the input device 112 connected to the management server 100. Any type of input device can be used as the input device 112, such as a mouse, touch panel, touch pad, keyboard, etc. Multiple types of input devices may be connected to the management server 100.

媒体リーダ１０６は、記録媒体１１３に記録されたプログラムやデータを読み取る読み取り装置である。記録媒体１１３として、フレキシブルディスク（ＦＤ：Flexible Disk）やＨＤＤなどの磁気ディスク、ＣＤ（Compact Disc）やＤＶＤ（Digital Versatile Disc）などの光ディスク、半導体メモリなど、任意の種類の記録媒体を使用することができる。媒体リーダ１０６は、例えば、記録媒体１１３から読み取ったプログラムやデータを、ＲＡＭ１０２やＨＤＤ１０３などの他の記録媒体にコピーする。読み取られたプログラムは、例えば、ＣＰＵ１０１によって実行される。なお、記録媒体１１３は可搬型記録媒体であってもよく、プログラムやデータの配布に用いられることがある。また、記録媒体１１３やＨＤＤ１０３を、コンピュータ読み取り可能な記録媒体と言うことがある。 The medium reader 106 is a reading device that reads programs and data recorded on the recording medium 113. As the recording medium 113, any type of recording medium can be used, such as a magnetic disk such as a flexible disk (FD) or HDD, an optical disk such as a compact disc (CD) or a digital versatile disc (DVD), or a semiconductor memory. I can do it. For example, the media reader 106 copies programs and data read from the recording medium 113 to other recording media such as the RAM 102 and the HDD 103. The read program is executed by the CPU 101, for example. Note that the recording medium 113 may be a portable recording medium, and may be used for distributing programs and data. Further, the recording medium 113 and the HDD 103 are sometimes referred to as computer-readable recording media.

通信インタフェース１０７は、ネットワーク３０に接続され、ネットワーク３０を介してクライアント端末３１や物理マシン２０１，２０２と通信する。通信インタフェース１０７は、スイッチやルータなどの有線通信装置に接続される有線通信インタフェースである。ただし、通信インタフェース１０７が、基地局やアクセスポイントなどの無線通信装置に接続される無線通信インタフェースであってもよい。 The communication interface 107 is connected to the network 30 and communicates with the client terminal 31 and the physical machines 201 and 202 via the network 30. The communication interface 107 is a wired communication interface connected to a wired communication device such as a switch or a router. However, the communication interface 107 may be a wireless communication interface connected to a wireless communication device such as a base station or an access point.

次に、監視対象システム２００の仮想環境について説明する。
図４は、仮想化インフラストラクチャの階層例を示す図である。
監視対象システム２００は、コンピュータ仮想化技術として、ハイパーバイザ型仮想化とコンテナ型仮想化を併用する。ハイパーバイザ型仮想化では、仮想化ソフトウェアであるハイパーバイザを用いて、ゲストＯＳを含む仮想マシンを形成する。コンテナ型仮想化では、仮想化ソフトウェアであるコンテナエンジンを用いて、ゲストＯＳを含まないコンテナを形成する。物理マシン、仮想マシンおよびコンテナは何れも、アプリケーションからコンピュータとして認識され得る処理ノードである。ただし、仮想マシンおよびコンテナは、物理マシンそのものではない仮想処理ノードである。 Next, the virtual environment of the monitored system 200 will be explained.
FIG. 4 is a diagram illustrating an example hierarchy of virtualization infrastructure.
The monitored system 200 uses both hypervisor virtualization and container virtualization as computer virtualization technologies. In hypervisor-type virtualization, a virtual machine including a guest OS is formed using a hypervisor, which is virtualization software. In container-type virtualization, a container engine, which is virtualization software, is used to form a container that does not include a guest OS. Physical machines, virtual machines, and containers are all processing nodes that can be seen as computers by applications. However, virtual machines and containers are virtual processing nodes that are not physical machines themselves.

第２の実施の形態では、コンテナがアプリケーションを実行する。監視対象システム２００は、物理マシンの上にコンテナを配置する２階層の仮想環境と、物理マシンの上に仮想マシンを配置し仮想マシンの上にコンテナを配置する３階層の仮想環境とを併用する。２階層の仮想環境では、物理マシンの上で１以上のコンテナが動作し、各コンテナの上で１以上のアプリケーションが動作する。３階層の仮想環境では、物理マシンの上で１以上の仮想マシンが動作し、各仮想マシンの上で１以上のコンテナが動作し、各コンテナの上で１以上のアプリケーションが動作する。１台の物理マシンの中に、２階層の仮想環境と３階層の仮想環境とを混在させることも可能である。 In the second embodiment, a container executes an application. The monitored system 200 uses both a two-tier virtual environment in which containers are placed on top of physical machines and a three-tier virtual environment in which virtual machines are placed on top of physical machines and containers are placed on top of virtual machines. . In a two-tier virtual environment, one or more containers run on a physical machine, and one or more applications run on each container. In a three-tier virtual environment, one or more virtual machines run on a physical machine, one or more containers run on each virtual machine, and one or more applications run on each container. It is also possible to have a two-layer virtual environment and a three-layer virtual environment coexist in one physical machine.

２階層の仮想環境では、物理マシンがホストＯＳ２４１およびコンテナエンジン２４４ａを実行する。ホストＯＳ２４１は、物理マシンのハードウェアリソースを管理し、物理マシン上でのプロセスの実行を制御する。ハードウェアリソースには、ＣＰＵの演算能力、ＲＡＭの記憶領域、ＨＤＤの記憶領域、通信インタフェースの通信帯域などが含まれる。メモリ空間がカーネル空間とユーザ空間とに分けて管理される。ホストＯＳ２４１は、コンテナから見てベースＯＳと言われることもある。 In a two-tier virtual environment, a physical machine executes a host OS 241 and a container engine 244a. The host OS 241 manages the hardware resources of the physical machine and controls the execution of processes on the physical machine. The hardware resources include the computing power of the CPU, the storage area of the RAM, the storage area of the HDD, the communication band of the communication interface, and the like. Memory space is managed separately into kernel space and user space. The host OS 241 is sometimes referred to as a base OS from the perspective of the container.

ホストＯＳ２４１は、物理マシンの稼働状況を示すログを生成する。ホストＯＳ２４１が生成するログは、物理マシンの障害を示すことがある。物理マシンの障害には、ハードウェアリソースのアクセス異常や、重要な管理用プロセスの異常停止が含まれる。物理マシンは、ホスト名、ＩＰ（Internet Protocol）アドレス、ＭＡＣ（Medium Access Control）アドレスなどの識別子をもつ。ホストＯＳ２４１は、物理マシンの識別子を知っている。 The host OS 241 generates a log indicating the operating status of the physical machine. A log generated by the host OS 241 may indicate a failure in the physical machine. Physical machine failures include abnormal access to hardware resources and abnormal termination of important management processes. A physical machine has an identifier such as a host name, an IP (Internet Protocol) address, and a MAC (Medium Access Control) address. The host OS 241 knows the identifier of the physical machine.

コンテナエンジン２４４ａは、物理マシンの直上に１以上のコンテナを形成する仮想化ソフトウェアである。コンテナエンジン２４４ａは、物理マシンが有するハードウェアリソースの一部をコンテナに割り当てる。コンテナエンジン２４４ａは、ホストＯＳ２４１が管理するユーザ空間の一部を仮想ユーザ空間としてコンテナに提供する。コンテナエンジン２４４ａは、物理マシン上で動作するコンテナを把握している。 The container engine 244a is virtualization software that forms one or more containers directly above a physical machine. The container engine 244a allocates a part of the hardware resources of the physical machine to the container. The container engine 244a provides a part of the user space managed by the host OS 241 to the container as a virtual user space. The container engine 244a knows the containers running on the physical machine.

また、２階層の仮想環境では、コンテナがコンテナライブラリ２４５ａおよびサイドカープロキシ２４６ａを実行する。コンテナライブラリ２４５ａは、コンテナに割り当てられたハードウェアリソースを用いて、コンテナ上でのプロセスの実行を制御する。コンテナライブラリ２４５ａは、コンテナエンジン２４４ａから割り当てられた仮想ユーザ空間を使用する。コンテナライブラリ２４５ａは、ＯＳそのものではないものの、アプリケーションに対して限定的なＯＳ機能をＡＰＩ（Application Programming Interface）として提供する。アプリケーションからＯＳのように見えることがあるため、コンテナライブラリ２４５ａをコンテナＯＳと言うことがある。 Further, in a two-tier virtual environment, a container executes a container library 245a and a sidecar proxy 246a. The container library 245a controls execution of processes on containers using hardware resources allocated to the containers. Container library 245a uses virtual user space allocated from container engine 244a. Although the container library 245a is not an OS itself, it provides limited OS functions to applications as an API (Application Programming Interface). The container library 245a is sometimes referred to as a container OS because it may look like an OS from an application.

コンテナライブラリ２４５ａは、コンテナの稼働状況を示すログを生成する。コンテナライブラリ２４５ａが生成するログには、コンテナが実行するアプリケーションの障害を示すことがある。アプリケーションの障害には、他のアプリケーションとの通信失敗や、アプリケーションのプロセスの異常停止が含まれる。コンテナは、ホスト名やアドレスなどの識別子をもつ。コンテナライブラリ２４５ａは、コンテナの識別子を知っている。 The container library 245a generates a log indicating the operating status of the container. The logs generated by the container library 245a may indicate failures in applications executed by the container. Application failures include communication failures with other applications and abnormal termination of application processes. A container has an identifier such as a host name or address. The container library 245a knows the container identifier.

サイドカープロキシ２４６ａは、プロキシサーバのソフトウェアである。サイドカープロキシ２４６ａは、コンテナで実行されるアプリケーションが他のアプリケーション（特に、他のコンテナ上のアプリケーション）と通信するとき、アプリケーション層において通信を中継する。これにより、サイドカープロキシ２４６ａは、アプリケーション間で行われるメッセージ通信を把握することができる。コンテナライブラリ２４５ａまたはサイドカープロキシ２４６ａは、コンテナ上で動作するアプリケーションを把握している。 Sidecar proxy 246a is proxy server software. Sidecar proxy 246a relays communications at the application layer when applications running in containers communicate with other applications (particularly applications on other containers). This allows the sidecar proxy 246a to understand message communication performed between applications. The container library 245a or sidecar proxy 246a knows the applications running on the container.

また、２階層の仮想環境では、上記のコンテナの上にサービスノード２４７ａが配置される。サービスノード２４７ａは、あるサービスを実現するために細分化されて実装された１つのアプリケーションである。サービスノード２４７ａは、コンテナライブラリ２４５ａの制御のもとで実行される。サービスノード２４７ａは、例えば、Ｗｅｂサーバ、業務ロジックサーバ、データベースサーバなどのサーバアプリケーションである。サービスノード２４７ａは、単一のユーザプログラムまたは２以上のユーザプログラムの集合によって実装される。また、サービスノード２４７ａは、単一のプロセスまたは２以上のプロセスの集合として実行される。サービスノード２４７ａは、サーバ名やＵＲＬ（Uniform Resource Locator）などの識別子をもつ。 Furthermore, in a two-tier virtual environment, a service node 247a is placed above the container. The service node 247a is one application that is segmented and implemented to realize a certain service. Service node 247a is executed under the control of container library 245a. The service node 247a is, for example, a server application such as a web server, business logic server, or database server. Service node 247a is implemented by a single user program or a collection of two or more user programs. Further, the service node 247a is executed as a single process or a collection of two or more processes. The service node 247a has an identifier such as a server name or a URL (Uniform Resource Locator).

３階層の仮想環境では、物理マシンがハイパーバイザ２４２を実行する。ハイパーバイザ２４２は、物理マシンの直上に１以上の仮想マシンを形成する仮想化ソフトウェアである。ハイパーバイザ２４２は、物理マシンが有するハードウェアリソースの一部を仮想マシンに割り当てる。ハイパーバイザ２４２は、物理マシン上で動作する仮想マシンを把握している。なお、３階層の仮想環境でも、物理マシンは管理用ＯＳを実行している。 In a three-tier virtual environment, a physical machine runs the hypervisor 242. Hypervisor 242 is virtualization software that forms one or more virtual machines directly above a physical machine. The hypervisor 242 allocates some of the hardware resources of the physical machine to the virtual machine. The hypervisor 242 knows the virtual machines running on the physical machines. Note that even in a three-layer virtual environment, the physical machine runs a management OS.

また、３階層の仮想環境では、仮想マシンがゲストＯＳ２４３およびコンテナエンジン２４４ｂを実行する。ゲストＯＳ２４３は、仮想マシンに割り当てられたハードウェアリソースを管理し、仮想マシン上でのプロセスの実行を制御する。ゲストＯＳ２４３は、コンテナから見てベースＯＳと言われることもある。ゲストＯＳ２４３は、仮想マシンの稼働状況を示すログを生成する。ゲストＯＳ２４３が生成するログは、仮想マシンの障害を示すことがある。仮想マシンの障害には、割り当てられたハードウェアリソースのアクセス異常や、重要な管理用プロセスの異常停止が含まれる。仮想マシンは、ホスト名やアドレスなどの識別子をもつ。ゲストＯＳ２４３は、仮想マシンの識別子を知っている。 Further, in a three-tier virtual environment, a virtual machine executes a guest OS 243 and a container engine 244b. The guest OS 243 manages hardware resources allocated to virtual machines and controls execution of processes on the virtual machines. The guest OS 243 is sometimes called a base OS from the perspective of the container. The guest OS 243 generates a log indicating the operating status of the virtual machine. A log generated by the guest OS 243 may indicate a failure of the virtual machine. Virtual machine failures include abnormal access to allocated hardware resources and abnormal termination of important management processes. A virtual machine has an identifier such as a host name or address. The guest OS 243 knows the virtual machine identifier.

コンテナエンジン２４４ｂは、仮想マシンの直上に１以上のコンテナを形成する仮想化ソフトウェアである。コンテナエンジン２４４ｂは、仮想マシンに割り当てられたハードウェアリソースの一部を更にコンテナに割り当てる。コンテナエンジン２４４ｂは、ゲストＯＳ２４３が管理するユーザ空間の一部を仮想ユーザ空間としてコンテナに提供する。コンテナエンジン２４４ｂは、仮想マシン上で動作するコンテナを把握している。 The container engine 244b is virtualization software that forms one or more containers directly above a virtual machine. The container engine 244b further allocates a portion of the hardware resources allocated to the virtual machine to the container. The container engine 244b provides a part of the user space managed by the guest OS 243 to the container as a virtual user space. The container engine 244b knows the containers running on the virtual machines.

また、３階層の仮想環境では、コンテナがコンテナライブラリ２４５ｂおよびサイドカープロキシ２４６ｂを実行する。コンテナライブラリ２４５ｂの機能はコンテナライブラリ２４５ａと同様である。サイドカープロキシ２４６ｂの機能は、サイドカープロキシ２４６ａと同様である。また、３階層の仮想環境では、上記のコンテナの上にサービスノード２４７ｂが配置される。サービスノード２４７ｂは、１つのアプリケーションである。サービスノード２４７ｂは、サービスノード２４７ａと通信して連携することがある。 Further, in a three-tier virtual environment, a container executes a container library 245b and a sidecar proxy 246b. The functionality of the container library 245b is similar to that of the container library 245a. The functionality of sidecar proxy 246b is similar to sidecar proxy 246a. Furthermore, in a three-layer virtual environment, the service node 247b is placed above the container. Service node 247b is one application. Service node 247b may communicate and cooperate with service node 247a.

ここで、サービスノード２４７ａ，２４７ｂなどのアプリケーションの機能に生じた障害を、アプリケーション障害または略して「アプリ障害」と言うことがある。また、物理マシン、仮想マシン、コンテナなどの処理ノードの機能に生じた障害を、インフラストラクチャ障害または略して「インフラ障害」と言うことがある。第２の実施の形態では、インフラ障害として主に、物理マシンの障害と仮想マシンの障害を想定する。 Here, a failure occurring in the function of an application such as the service nodes 247a, 247b may be referred to as an application failure or an "application failure" for short. Further, a failure that occurs in the function of a processing node such as a physical machine, virtual machine, or container is sometimes referred to as an infrastructure failure or "infrastructure failure" for short. In the second embodiment, infrastructure failures are mainly assumed to be physical machine failures and virtual machine failures.

図５は、システム構成グラフの例を示す図である。
第２の実施の形態の説明では、一例として、図５に示すような処理ノードおよびサービスノードの配置を使用する。監視対象システム２００は、処理ノードとして、物理マシン２０１，２０２、仮想マシン２１１，２１２およびコンテナ２２１，２２２，２２３，２２４，２２５，２２６を含む。また、監視対象システム２００は、サービスノード２３１，２３２，２３３，２３４，２３５，２３６を含む。これらの処理ノードおよびサービスノードのトポロジは、スイッチ２０３をルートとする木構造になっている。 FIG. 5 is a diagram showing an example of a system configuration graph.
In the description of the second embodiment, the arrangement of processing nodes and service nodes as shown in FIG. 5 will be used as an example. The monitored system 200 includes physical machines 201, 202, virtual machines 211, 212, and containers 221, 222, 223, 224, 225, 226 as processing nodes. Additionally, the monitored system 200 includes service nodes 231, 232, 233, 234, 235, and 236. The topology of these processing nodes and service nodes is a tree structure with the switch 203 as the root.

物理マシン２０１（Ｍ１）の直上には、仮想マシン２１１（ＶＭ１）およびコンテナ２２１，２２２（Ｃ１，Ｃ２）が配置されている。仮想マシン２１１の直上には、コンテナ２２３，２２４（Ｃ３，Ｃ４）が配置されている。物理マシン２０２（Ｍ２）の直上には、仮想マシン２１２（ＶＭ２）およびコンテナ２２６（Ｃ６）が配置されている。仮想マシン２１２の直上には、コンテナ２２５（Ｃ５）が配置されている。 A virtual machine 211 (VM1) and containers 221, 222 (C1, C2) are placed directly above the physical machine 201 (M1). Containers 223 and 224 (C3, C4) are placed directly above the virtual machine 211. A virtual machine 212 (VM2) and a container 226 (C6) are placed directly above the physical machine 202 (M2). A container 225 (C5) is placed directly above the virtual machine 212.

コンテナ２２１は、サービスノード２３１（ＡＰ１）を実行する。コンテナ２２２は、サービスノード２３２（ＡＰ２）を実行する。コンテナ２２３は、サービスノード２３３（ＡＰ３）を実行する。コンテナ２２４は、サービスノード２３４（ＡＰ４）を実行する。コンテナ２２５は、サービスノード２３５（ＡＰ５）を実行する。コンテナ２２６は、サービスノード２３６（ＡＰ６）を実行する。 Container 221 executes service node 231 (AP1). Container 222 runs service node 232 (AP2). Container 223 executes service node 233 (AP3). Container 224 runs service node 234 (AP4). Container 225 executes service node 235 (AP5). Container 226 runs service node 236 (AP6).

図６は、サービスメッシュグラフの例を示す図である。
サービスノード２３１，２３２，２３３，２３４，２３５，２３６は、相互に通信することで、連携して１つのサービスを実現する。サービスノード間の通信には、例えば、ＲＥＳＴ（Representational State Transfer）などの軽量な通信ＡＰＩが使用される。前述のサイドカープロキシ２４６ａ，２４６ｂに相当するサイドカープロキシを用いることで、サービスノード間のメッセージ通信の状況を把握することが可能である。 FIG. 6 is a diagram showing an example of a service mesh graph.
The service nodes 231, 232, 233, 234, 235, and 236 cooperate to realize one service by communicating with each other. For example, a lightweight communication API such as REST (Representational State Transfer) is used for communication between service nodes. By using sidecar proxies corresponding to the sidecar proxies 246a and 246b described above, it is possible to grasp the status of message communication between service nodes.

サービスノードのペアの中には、直接通信することがあるペアもあるし、直接通信することがないペアもある。直接通信することがある２つのサービスノードの間にエッジを記述すると、図６に示すようなサービスメッシュグラフを生成することができる。サービスメッシュグラフは、サービスノードを示す節点とサービスノード間の通信を示す節点間のエッジとを含む無向グラフである。第２の実施の形態の説明では、一例として、図６に示すサービスメッシュグラフを使用する。 Some pairs of service nodes may communicate directly, while others may not. By describing an edge between two service nodes that may communicate directly, a service mesh graph as shown in FIG. 6 can be generated. A service mesh graph is an undirected graph that includes nodes indicating service nodes and edges between nodes indicating communication between service nodes. In the description of the second embodiment, the service mesh graph shown in FIG. 6 is used as an example.

サービスノード２３１（ＡＰ１）は、サービスノード２３３と通信する。サービスノード２３２（ＡＰ２）は、サービスノード２３３，２３４，２３６と通信する。サービスノード２３３（ＡＰ３）は、サービスノード２３１，２３２，２３４と通信する。サービスノード２３４（ＡＰ４）は、サービスノード２３２，２３３と通信する。サービスノード２３５（ＡＰ５）は、サービスノード２３６と通信する。サービスノード２３６（ＡＰ６）は、サービスノード２３２，２３５と通信する。上記以外のサービスノードのペアは、直接には通信しない。ただし、このサービスメッシュグラフは連結グラフであるため、任意の２つのサービスノードを結ぶパスが存在する。よって、全てのサービスノードが連携している。 Service node 231 (AP1) communicates with service node 233. Service node 232 (AP2) communicates with service nodes 233, 234, and 236. Service node 233 (AP3) communicates with service nodes 231, 232, and 234. Service node 234 (AP4) communicates with service nodes 232 and 233. Service node 235 (AP5) communicates with service node 236. Service node 236 (AP6) communicates with service nodes 232 and 235. Pairs of service nodes other than those listed above do not communicate directly. However, since this service mesh graph is a connected graph, there is a path connecting any two service nodes. Therefore, all service nodes are cooperating.

次に、原因判定モデルの生成について説明する。
図７は、原因判定モデルの生成例を示す図である。
管理サーバ１００は、原因判定モデル１５１を生成する。原因判定モデル１５１は、ロジスティック回帰分析によって生成される回帰モデルである。原因判定モデル１５１は、１つのアプリ障害と１つのインフラ障害との間の関係を示す特徴ベクトルを、説明変数として使用する。また、原因判定モデル１５１は、当該１つのアプリ障害の原因が当該１つのインフラ障害である確率を示す確信度を、目的変数として使用する。 Next, generation of a cause determination model will be explained.
FIG. 7 is a diagram showing an example of generation of a cause determination model.
The management server 100 generates a cause determination model 151. The cause determination model 151 is a regression model generated by logistic regression analysis. The cause determination model 151 uses a feature vector indicating the relationship between one application failure and one infrastructure failure as an explanatory variable. Further, the cause determination model 151 uses a confidence level indicating the probability that the cause of the one application failure is the one infrastructure failure as an objective variable.

よって、原因判定モデル１５１は、監視対象システム２００で同時期に発生したアプリ障害とインフラ障害について、アプリ障害の根本原因がインフラ障害であるか否か判定するものである。ある処理ノードにおける通信プロセスの異常終了が、あるサービスノードのメッセージ通信の失敗を引き起こすなど、インフラ障害がサービスノードに伝播することがある。その場合、システム管理者は、アプリ障害への対応として、その原因となっているインフラ障害を解消すればよい。このため、原因判定モデル１５１を使用することで、システム管理者に対して有用な情報を提供することができる。 Therefore, the cause determination model 151 determines whether or not the root cause of an application failure is an infrastructure failure regarding an application failure and an infrastructure failure that occur at the same time in the monitored system 200. An infrastructure failure may propagate to a service node, such as when an abnormal termination of a communication process in a certain processing node causes a message communication failure in a certain service node. In that case, the system administrator can respond to the application failure by eliminating the infrastructure failure that is causing it. Therefore, by using the cause determination model 151, useful information can be provided to the system administrator.

原因判定モデル１５１を生成するにあたり、管理サーバ１００は、過去に発生したアプリ障害とインフラ障害のペア毎に、特徴ベクトル１４１と教師ラベル１４２とを対応付けたレコードを生成して訓練データに追加する。特徴ベクトル１４１は、評価値ｖ_１，ｖ_２，ｖ_３，ｖ_４を含む４次元ベクトルである。評価値ｖ_１は、親子距離１４３を表す。評価値ｖ_２は、サービス距離１４４を表す。評価値ｖ_３は、テキスト類似度１４５を表す。評価値ｖ_４は、時刻差１４６を表す。教師ラベル１４２は、着目するアプリ障害の原因が、着目するインフラ障害であるか否かを示すフラグである。障害原因であるか否かは、過去にシステム管理者が行った障害対応作業の結果として把握される。インフラ障害がアプリ障害の原因でない場合、教師ラベル１４２が「０」となる。インフラ障害がアプリ障害の原因である場合、教師ラベル１４２が「１」となる。 In generating the cause determination model 151, the management server 100 generates a record in which a feature vector 141 and a teacher label 142 are associated with each other for each pair of application failure and infrastructure failure that occurred in the past, and adds the record to the training data. . The feature vector 141 is a four-dimensional vector including evaluation values v ₁ , v ₂ , v ₃ , and v ₄ . The evaluation value v ₁ represents the parent-child distance 143. The evaluation value v ₂ represents the service distance 144 . The evaluation value _v3 represents a text similarity of 145. The evaluation value _v4 represents a time difference of 146. The teacher label 142 is a flag indicating whether or not the cause of the application failure of interest is an infrastructure failure of interest. Whether or not this is the cause of the failure can be determined as a result of the failure handling work performed by the system administrator in the past. If the infrastructure failure is not the cause of the application failure, the teacher label 142 is "0". If an infrastructure failure is the cause of the application failure, the teacher label 142 is "1".

親子距離１４３は、アプリ障害とインフラ障害との間の処理ノードの階層関係を示す。具体的には、親子距離１４３は、アプリ障害が発生したサービスノードを実行するコンテナ、すなわち、アプリ障害を検出したコンテナと、インフラ障害を検出した物理マシンまたは仮想マシンとの間の階層関係を示す。上記のコンテナが上記の物理マシンまたは仮想マシンの上方に配置されている場合、すなわち、親子関係が存在する場合、当該コンテナと当該物理マシンまたは仮想マシンとの間の階層差が、親子距離１４３となる。一方、上記のコンテナが上記の物理マシンまたは仮想マシンの上方に配置されていない場合、すなわち、親子関係が存在しない場合、親子距離１４３は「０」とみなされる。 The parent-child distance 143 indicates the hierarchical relationship of processing nodes between an application failure and an infrastructure failure. Specifically, the parent-child distance 143 indicates the hierarchical relationship between the container that executes the service node where the application failure occurred, that is, the container that detected the application failure, and the physical machine or virtual machine that detected the infrastructure failure. . When the above container is placed above the above physical machine or virtual machine, that is, when a parent-child relationship exists, the hierarchical difference between the container and the physical machine or virtual machine is the parent-child distance 143. Become. On the other hand, if the above-mentioned container is not placed above the above-mentioned physical machine or virtual machine, that is, if there is no parent-child relationship, the parent-child distance 143 is considered to be "0".

例えば、図５において、サービスノード２３３でアプリ障害が発生し、仮想マシン２１１でインフラ障害が発生した場合、親子距離１４３は「１」となる。サービスノード２３３でアプリ障害が発生し、物理マシン２０１でインフラ障害が発生した場合、親子距離１４３は「２」となる。サービスノード２３３でアプリ障害が発生し、物理マシン２０２でインフラ障害が発生した場合、親子距離１４３は「０」となる。親子距離１４３が小さいほど（ただし「０」を除く）、インフラ障害がアプリ障害の原因である可能性が高い。なお、親子距離１４３は、例えば、スイッチ２０３をルートとする木構造において各処理ノードの深さを算出し、２つの処理ノードの深さの差を求めることで算出できる。 For example, in FIG. 5, if an application failure occurs in the service node 233 and an infrastructure failure occurs in the virtual machine 211, the parent-child distance 143 is "1". When an application failure occurs in the service node 233 and an infrastructure failure occurs in the physical machine 201, the parent-child distance 143 becomes "2". When an application failure occurs in the service node 233 and an infrastructure failure occurs in the physical machine 202, the parent-child distance 143 becomes "0". The smaller the parent-child distance 143 (excluding "0"), the higher the possibility that an infrastructure failure is the cause of the application failure. Note that the parent-child distance 143 can be calculated, for example, by calculating the depth of each processing node in a tree structure with the switch 203 as the root, and finding the difference in depth between two processing nodes.

サービス距離１４４は、アプリ障害とインフラ障害に関係するサービスノード間の通信関係を示す。具体的には、サービス距離１４４は、アプリ障害が発生したサービスノードと、インフラ障害を検出した物理マシンまたは仮想マシンを基盤として実行されている他のサービスノードとの間の通信関係を示す。アプリ障害が発生したサービスノードと上記の他のサービスノードとの間に、サービスメッシュグラフ上のパスが存在する場合、当該パスの長さ（距離）がサービス距離１４４となる。上記の他のサービスノードが２以上ある場合、２以上の他のサービスノードそれぞれに対して算出される距離のうちの最小値がサービス距離１４４となる。一方、何れの他のサービスノードとの間にもパスが存在しない場合、サービス距離１４４は「０」とみなされる。 The service distance 144 indicates the communication relationship between service nodes related to application failures and infrastructure failures. Specifically, the service distance 144 indicates the communication relationship between the service node where the application failure has occurred and another service node running on the basis of the physical machine or virtual machine that detected the infrastructure failure. If a path exists on the service mesh graph between the service node where the application failure has occurred and the other service node mentioned above, the length (distance) of the path is the service distance 144. If there are two or more of the above other service nodes, the minimum value of the distances calculated for each of the two or more other service nodes is the service distance 144. On the other hand, if there is no path to any other service node, the service distance 144 is considered to be "0".

例えば、図５において、サービスノード２３３でアプリ障害が発生し、仮想マシン２１１でインフラ障害が発生したとする。この場合、インフラ障害がある仮想マシン２１１の上方で実行されているサービスノードは、サービスノード２３３，２３４である。すると、図６のサービスメッシュグラフにおいて、｛ＡＰ３｝と｛ＡＰ３，ＡＰ４｝の間の最小距離は「０」である。このため、サービス距離１４４は「０」となる。 For example, in FIG. 5, assume that an application failure occurs in the service node 233 and an infrastructure failure occurs in the virtual machine 211. In this case, service nodes 233 and 234 are running above the virtual machine 211 that has an infrastructure failure. Then, in the service mesh graph of FIG. 6, the minimum distance between {AP3} and {AP3, AP4} is "0". Therefore, the service distance 144 is "0".

また、例えば、図５において、サービスノード２３３でアプリ障害が発生し、物理マシン２０２でインフラ障害が発生したとする。この場合、インフラ障害がある物理マシン２０２の上方で実行されているサービスノードは、サービスノード２３５，２３６である。すると、図６のサービスメッシュグラフにおいて、｛ＡＰ３｝と｛ＡＰ５，ＡＰ６｝の間の最小距離は「２」である。このため、サービス距離１４４は「２」となる。サービス距離１４４が小さいほど（ただし「０」を除く）、インフラ障害がアプリ障害の原因である可能性が高い。なお、サービス距離１４４は、例えば、サービスメッシュグラフに対して、ダイクストラ法などの最短経路探索アルゴリズムを実行することで算出できる。 Further, for example, in FIG. 5, assume that an application failure occurs in the service node 233 and an infrastructure failure occurs in the physical machine 202. In this case, service nodes 235 and 236 are running above the physical machine 202 that has an infrastructure failure. Then, in the service mesh graph of FIG. 6, the minimum distance between {AP3} and {AP5, AP6} is "2". Therefore, the service distance 144 is "2". The smaller the service distance 144 (excluding "0"), the higher the possibility that an infrastructure failure is the cause of the application failure. Note that the service distance 144 can be calculated, for example, by executing a shortest path search algorithm such as Dijkstra's algorithm on the service mesh graph.

テキスト類似度１４５は、アプリ障害とインフラ障害との間のエラーメッセージの類似度を示す。アプリ障害のエラーメッセージとインフラ障害のエラーメッセージとの間に共通単語が多いほど、テキスト類似度１４５が大きくなる。テキスト類似度１４５が大きいほど、インフラ障害がアプリ障害の原因である可能性が高い。 The text similarity 145 indicates the similarity of error messages between an application failure and an infrastructure failure. The more common words there are between the error message of the application failure and the error message of the infrastructure failure, the greater the text similarity 145 becomes. The larger the text similarity 145, the higher the possibility that an infrastructure failure is the cause of the application failure.

テキスト類似度１４５として、第２の実施の形態では、Bug of Wordsのコサイン類似度を使用する。Bug of Wordsの生成では、管理サーバ１００は、エラーメッセージ毎に文字列を単語に分割し、同一単語の出現数をカウントする。管理サーバ１００は、エラーメッセージ毎に、各単語の出現数を列挙したベクトルをBug of Wordsとして生成する。管理サーバ１００は、２つのエラーメッセージに対応する２つのベクトルの間でコサイン類似度を算出する。コサイン類似度は、０以上１以下の実数である。コサイン類似度が１に近いほど２つのエラーメッセージの類似度が高いことを意味し、コサイン類似度が０に近いほど２つのエラーメッセージの類似度が低いことを意味する。 In the second embodiment, the cosine similarity of Bug of Words is used as the text similarity 145. When generating Bug of Words, the management server 100 divides a character string into words for each error message, and counts the number of occurrences of the same word. The management server 100 generates a vector listing the number of occurrences of each word as a Bug of Words for each error message. The management server 100 calculates cosine similarity between two vectors corresponding to two error messages. The cosine similarity is a real number greater than or equal to 0 and less than or equal to 1. The closer the cosine similarity is to 1, the higher the similarity between the two error messages, and the closer the cosine similarity is to 0, the lower the similarity between the two error messages.

ただし、コサイン類似度は、テキスト類似度１４５の一例である。テキスト類似度１４５の指標として、分散表現を使用するものや、編集距離（レーベンシュタイン距離）を使用するものなど、他の指標も考えられる。また、エラーメッセージからホスト名やアドレスなどの識別子を除去するなど、前処理を行うようにしてもよい。 However, the cosine similarity is an example of the text similarity 145. As an index of text similarity 145, other indexes such as one using distributed representation and one using edit distance (Levenshtein distance) are also considered. Preprocessing may also be performed, such as removing identifiers such as host names and addresses from error messages.

時刻差１４６は、アプリ障害の発生とインフラ障害の発生の間の遅延を示す。アプリ障害の発生時刻がインフラ障害の発生時刻以後である場合、時刻差１４６は、アプリ障害の発生時刻からインフラ障害の発生時刻を引いた差となる。第２の実施の形態では、時刻差１４６の単位として時間（hour）を使用する。一方、アプリ障害の発生時刻がインフラ障害の発生時刻より前である場合、時刻差１４６は「０」とみなされる。例えば、アプリ障害の発生時刻が２０２０年４月１日１６時０分であり、インフラ障害の発生時刻が２０２０年４月１日１５時０分である場合、時刻差１４６は「１」となる。時刻差１４６が小さいほど（ただし「０」を除く）、インフラ障害がアプリ障害の原因である可能性が高い。 A time difference 146 indicates a delay between the occurrence of an application failure and the occurrence of an infrastructure failure. If the application failure occurrence time is after the infrastructure failure occurrence time, the time difference 146 is the difference obtained by subtracting the infrastructure failure occurrence time from the application failure occurrence time. In the second embodiment, time (hour) is used as the unit of time difference 146. On the other hand, if the application failure occurrence time is before the infrastructure failure occurrence time, the time difference 146 is considered to be "0". For example, if the app failure occurrence time is 16:00 on April 1, 2020, and the infrastructure failure occurrence time is 15:00 on April 1, 2020, the time difference 146 will be "1". . The smaller the time difference 146 (excluding "0"), the higher the possibility that an infrastructure failure is the cause of the application failure.

特徴ベクトル１４１が、親子距離１４３を示す評価値ｖ_１を含むことで、下位の階層の処理ノードの障害が上位の階層の処理ノードに影響を与えるという垂直方向の障害伝播を評価することができる。また、特徴ベクトル１４１が、サービス距離１４４を示す評価値ｖ_２を含むことで、あるサービスノードの障害が別のサービスノードに影響を与えるという水平方向の障害伝播を評価することができる。また、特徴ベクトル１４１が、テキスト類似度１４５を示す評価値ｖ_３を含むことで、障害内容の類似性を評価することができる。また、特徴ベクトル１４１が、時刻差１４６を示す評価値ｖ_４を含むことで、アプリ障害とインフラ障害の発生の同時性を評価することができる。 By including the evaluation value v ₁ indicating the parent-child distance 143 in the feature vector 141, it is possible to evaluate vertical fault propagation in which a fault in a processing node in a lower hierarchy affects a processing node in an upper hierarchy. . Further, since the feature vector 141 includes the evaluation value v ₂ indicating the service distance 144, it is possible to evaluate horizontal fault propagation in which a fault in one service node affects another service node. Further, since the feature vector 141 includes the evaluation value _v3 indicating the text similarity 145, it is possible to evaluate the similarity of the failure contents. Further, since the feature vector 141 includes the evaluation value _v4 indicating the time difference 146, it is possible to evaluate the simultaneity of occurrence of the application failure and the infrastructure failure.

一方で、特徴ベクトル１４１は、ホスト名やアドレスなどの識別子に関する評価値を含まない。これは、処理ノードやサービスノードの識別子を使用すると、処理ノードの構成が変化した場合に原因判定モデル１５１の判定精度が低下し、原因判定モデル１５１を再生成せざるを得なくなる可能性があるためである。 On the other hand, the feature vector 141 does not include evaluation values regarding identifiers such as host names and addresses. This is because if the identifier of the processing node or service node is used, the determination accuracy of the cause determination model 151 may decrease if the configuration of the processing node changes, and the cause determination model 151 may have to be regenerated. It is.

監視対象システム２００では、仮想マシンの追加、移動、削除により、物理マシンと仮想マシンとの対応関係が変化し得る。また、監視対象システム２００では、コンテナの追加、移動、削除により、物理マシンまたは仮想マシンとコンテナとの対応関係が変化し得る。この点、原因判定モデル１５１の入力に識別子を使用すると、構成変更によって原因判定モデル１５１の有用性が低下するおそれがある。これに対して、原因判定モデル１５１の入力に識別子を使用しないことで、仮想マシンやコンテナの配置変更があっても、原因判定モデル１５１を引き続き使用することができる。 In the monitored system 200, the correspondence between physical machines and virtual machines may change due to addition, movement, or deletion of virtual machines. Furthermore, in the monitored system 200, the correspondence between a physical machine or a virtual machine and a container may change due to addition, movement, or deletion of a container. In this regard, if an identifier is used as an input to the cause determination model 151, there is a possibility that the usefulness of the cause determination model 151 may be reduced due to configuration changes. On the other hand, by not using an identifier as an input to the cause determination model 151, the cause determination model 151 can continue to be used even if the arrangement of virtual machines or containers is changed.

管理サーバ１００は、上記のような特徴ベクトル１４１および教師ラベル１４２を含む訓練データを用いて、原因判定モデル１５１を生成する。原因判定モデル１５１は、例えば、数式（１）に示すようなロジスティック関数として表現される。数式（１）のロジスティック関数は、評価値ｖ_１’，ｖ_２’，ｖ_３，ｖ_４’から確信度Ｐを算出する。ここで算出される確信度Ｐは、０より大きく１より小さい実数である。確信度Ｐが１に近いほど、インフラ障害がアプリ障害の原因である可能性が高く、確信度Ｐが０に近いほど、インフラ障害がアプリ障害の原因である可能性が低い。 The management server 100 generates the cause determination model 151 using training data including the feature vector 141 and teacher label 142 as described above. The cause determination model 151 is expressed, for example, as a logistic function as shown in equation (1). The logistic function of Equation (1) calculates the confidence level P from the evaluation values v ₁ ′, v ₂ ′, v ₃ , and v ₄ ′. The confidence level P calculated here is a real number greater than 0 and less than 1. The closer the confidence level P is to 1, the higher the probability that an infrastructure failure is the cause of the application failure, and the closer the confidence level P is to 0, the lower the possibility that the infrastructure failure is the cause of the application failure.

評価値ｖ_１’，ｖ_２’，ｖ_４’は、後述するように、評価値ｖ_１，ｖ_２，ｖ_４から変換されるものである。また、このロジスティック関数は、パラメータα，β_１，β_２，β_３，β_４を含む。パラメータα，β_１，β_２，β_３，β_４の値は、訓練データを用いて機械学習を通じて決定される。パラメータαは定数である。パラメータβ_１は評価値ｖ_１’の重み係数である。パラメータβ_２は評価値ｖ_２’の重み係数である。パラメータβ_３は評価値ｖ_３の重み係数である。パラメータβ_４は評価値ｖ_４’の重み係数である。 The evaluation values v ₁ ′, v ₂ ′, and v ₄ ′ are converted from the evaluation values v ₁ , v ₂ , and v ₄ as described later. Moreover, this logistic function includes parameters α, β ₁ , β ₂ , β ₃ , and β ₄ . The values of parameters α, β ₁ , β ₂ , β ₃ , and β ₄ are determined through machine learning using training data. Parameter α is a constant. The parameter β ₁ is a weighting coefficient of the evaluation value v ₁ ′. The parameter β ₂ is a weighting coefficient of the evaluation value v ₂ ′. Parameter β ₃ is a weighting coefficient of evaluation value v ₃ . Parameter β ₄ is a weighting coefficient of evaluation value v ₄ ′.

評価値ｖ_１’は、評価値ｖ_１から数式（２）のように変換される。ｖ_１＝０の場合はｖ_１’＝０となり、それ以外の場合はｖ_１が大きいほどｖ_１’が小さくなる。評価値ｖ_１’は、０以上１以下の実数である。評価値ｖ_２’は、評価値ｖ_２から数式（３）のように変換される。ｖ_２＝０の場合はｖ_２’＝０となり、それ以外の場合はｖ_２が大きいほどｖ_２’が小さくなる。評価値ｖ_２’は、０以上１以下の実数である。評価値ｖ_４’は、評価値ｖ_４から数式（４）のように変換される。ｖ_４＝０の場合はｖ_４’＝０となり、それ以外の場合はｖ_４が大きいほどｖ_４’が小さくなる。評価値ｖ_４’は、０以上の実数である。 The evaluation value v ₁ ' is converted from the evaluation value v ₁ as shown in Equation (2). When v ₁ =0, v ₁ '=0; otherwise, the larger v ₁ is, the smaller v ₁ ' is. The evaluation value v ₁ ′ is a real number from 0 to 1. The evaluation value v ₂ ′ is converted from the evaluation value v ₂ as shown in Equation (3). When v ₂ =0, v ₂ ′=0, and in other cases, the larger v ₂ is, the smaller v ₂ ′ is. The evaluation value v ₂ ′ is a real number from 0 to 1. The evaluation value v ₄ ' is converted from the evaluation value v ₄ as shown in equation (4). When v ₄ =0, v ₄ ′=0, and in other cases, the larger v ₄ is, the smaller v ₄ ′ is. The evaluation value v ₄ ′ is a real number greater than or equal to 0.

管理サーバ１００は、このようにして生成した原因判定モデル１５１を使用して、同時期に発生したアプリ障害とインフラ障害との間の関連性を評価する。あるアプリ障害が発生したときに、その直近に２以上のインフラ障害が発生していることがある。その場合、管理サーバ１００は、当該アプリ障害と当該２以上のインフラ障害それぞれとの間で、原因判定モデル１５１を用いて確信度を算出する。管理サーバ１００は、２以上のインフラ障害を確信度の高い順にソートしてシステム管理者に提示する。 The management server 100 uses the cause determination model 151 generated in this manner to evaluate the relationship between an application failure and an infrastructure failure that occurred at the same time. When a certain application failure occurs, two or more infrastructure failures may occur immediately after. In that case, the management server 100 uses the cause determination model 151 to calculate the degree of certainty between the application failure and each of the two or more infrastructure failures. The management server 100 sorts two or more infrastructure failures in descending order of certainty and presents them to the system administrator.

図８は、システム管理画面の例を示す図である。
システム管理画面１５２は、管理サーバ１００からクライアント端末３１に送信され、クライアント端末３１のディスプレイに表示される。システム管理画面１５２は、あるアプリケーションに障害が発生したことを示すメッセージを含む。また、システム管理画面１５２は、アプリ障害と同時期に、物理マシンや仮想マシンなどの処理ノードに障害が発生していることを示すメッセージを含む。また、システム管理画面１５２は、インフラ障害毎に、アプリ障害の原因である可能性を示す確信度の数値を含む。インフラ障害のメッセージは、確信度の降順にソートされている。 FIG. 8 is a diagram showing an example of a system management screen.
The system management screen 152 is transmitted from the management server 100 to the client terminal 31 and displayed on the display of the client terminal 31. System management screen 152 includes a message indicating that a failure has occurred in a certain application. The system management screen 152 also includes a message indicating that a failure has occurred in a processing node such as a physical machine or a virtual machine at the same time as the application failure. Furthermore, the system management screen 152 includes, for each infrastructure failure, a confidence value indicating the possibility that the failure is the cause of the application failure. Infrastructure failure messages are sorted in descending order of confidence.

例えば、システム管理画面１５２は、サービスノード２３３（ＡＰ３）のアプリ障害を報告する。また、システム管理画面１５２は、仮想マシン２１１（ＶＭ１）のインフラ障害と、そのインフラ障害がアプリ障害の原因である可能性が９０％である旨を報告する。また、システム管理画面１５２は、物理マシン２０２（Ｍ２）のインフラ障害と、そのインフラ障害がアプリ障害の原因である可能性が４０％である旨を報告する。これにより、システム管理者は、アプリ障害の原因分析と障害解消作業を効率的に行うことができる。 For example, the system management screen 152 reports an application failure of the service node 233 (AP3). The system management screen 152 also reports that there is an infrastructure failure in the virtual machine 211 (VM1) and that there is a 90% possibility that the infrastructure failure is the cause of the application failure. The system management screen 152 also reports that there is an infrastructure failure in the physical machine 202 (M2) and that there is a 40% probability that the infrastructure failure is the cause of the application failure. This allows system administrators to efficiently analyze the causes of application failures and resolve them.

次に、管理サーバ１００の機能について説明する。
図９は、管理サーバの機能例を示すブロック図である。
管理サーバ１００は、障害情報記憶部１２１、構成情報記憶部１２２、サービス情報記憶部１２３およびモデル記憶部１２４を有する。これらの記憶部は、例えば、ＲＡＭ１０２またはＨＤＤ１０３の記憶領域を用いて実現される。また、管理サーバ１００は、障害監視部１２５、構成監視部１２６、サービス監視部１２７、学習部１２８および原因判定部１２９を有する。これらの処理部は、例えば、プログラムを用いて実現される。 Next, the functions of the management server 100 will be explained.
FIG. 9 is a block diagram showing an example of the functions of the management server.
The management server 100 includes a failure information storage section 121, a configuration information storage section 122, a service information storage section 123, and a model storage section 124. These storage units are realized using, for example, a storage area of the RAM 102 or the HDD 103. The management server 100 also includes a fault monitoring section 125, a configuration monitoring section 126, a service monitoring section 127, a learning section 128, and a cause determining section 129. These processing units are realized using, for example, a program.

障害情報記憶部１２１は、監視対象システム２００で発生した障害を示す障害情報を記憶する。また、障害情報記憶部１２１は、障害に対するシステム管理者の障害対応を示す障害対応情報を記憶する。構成情報記憶部１２２は、物理マシン、仮想マシンおよびコンテナの配置を示す構成情報を記憶する。サービス情報記憶部１２３は、複数のサービスノードの間の通信関係を示すサービスメッシュグラフの情報を記憶する。また、サービス情報記憶部１２３は、コンテナへのサービスノードの配置を示す情報を記憶する。モデル記憶部１２４は、管理サーバ１００が生成した原因判定モデル１５１を記憶する。 The fault information storage unit 121 stores fault information indicating a fault that has occurred in the monitored system 200. Further, the failure information storage unit 121 stores failure response information indicating the system administrator's response to the failure. The configuration information storage unit 122 stores configuration information indicating the arrangement of physical machines, virtual machines, and containers. The service information storage unit 123 stores information on a service mesh graph indicating communication relationships between a plurality of service nodes. The service information storage unit 123 also stores information indicating the arrangement of service nodes in containers. The model storage unit 124 stores the cause determination model 151 generated by the management server 100.

障害監視部１２５は、監視対象システム２００の障害を監視する。障害監視部１２５は、監視対象システム２００から障害に関する情報を収集し、障害情報を障害情報記憶部１２１に保存する。障害に関する情報は、物理マシンのホストＯＳ、仮想マシンのゲストＯＳ、コンテナのコンテナライブラリなどから収集することができる。監視方法として、物理マシン、仮想マシンおよびコンテナなどの各処理ノードが、障害を検出したときに、管理サーバ１００に対して障害を通知するようにしてもよい。また、障害監視部１２５が定期的に各処理ノードからエラーログを収集するようにしてもよい。また、障害監視部１２５が定期的に各処理ノードから各種ログを収集し、ログを分析して障害の有無を判定し、障害を検出したときにログから障害情報を抽出するようにしてもよい。 The fault monitoring unit 125 monitors the monitored system 200 for faults. The fault monitoring unit 125 collects information regarding faults from the monitored system 200 and stores the fault information in the fault information storage unit 121. Information regarding the failure can be collected from the host OS of the physical machine, the guest OS of the virtual machine, the container library of the container, and the like. As a monitoring method, each processing node such as a physical machine, a virtual machine, and a container may notify the management server 100 of the failure when it detects a failure. Further, the failure monitoring unit 125 may periodically collect error logs from each processing node. Alternatively, the failure monitoring unit 125 may periodically collect various logs from each processing node, analyze the logs to determine the presence or absence of a failure, and extract failure information from the logs when a failure is detected. .

構成監視部１２６は、監視対象システム２００の仮想環境の構成を監視する。障害監視部１２５は、物理マシン、仮想マシンおよびコンテナの配置が変更されたことを検出すると、変更後の配置を示す構成情報を構成情報記憶部１２２に保存する。物理マシンと仮想マシンとの間の関係は、物理マシンで実行されているハイパーバイザから収集することができる。物理マシンまたは仮想マシンとコンテナとの間の関係は、物理マシンまたは仮想マシンで実行されているコンテナエンジンから収集することができる。 The configuration monitoring unit 126 monitors the configuration of the virtual environment of the monitored system 200. When the fault monitoring unit 125 detects that the arrangement of the physical machine, virtual machine, and container has been changed, it stores configuration information indicating the changed arrangement in the configuration information storage unit 122. The relationship between physical machines and virtual machines can be gleaned from the hypervisor running on the physical machine. The relationship between a physical or virtual machine and a container can be gleaned from a container engine running on the physical or virtual machine.

サービス監視部１２７は、監視対象システム２００に配置された複数のサービスノード（アプリケーション）を監視する。サービス監視部１２７は、複数のサービスノードの間の通信関係が変化したことを検出すると、変更後の通信関係を示す情報をサービス情報記憶部１２３に保存する。サービスノード間の通信関係は、コンテナで実行されているサイドカープロキシから収集することができる。また、サービス監視部１２７は、コンテナへのサービスノードの配置が変更されたことを検出すると、変更後の配置を示す情報をサービス情報記憶部１２３に保存する。サービスノードの配置は、コンテナで実行されているコンテナライブラリまたはサイドカープロキシから収集することができる。 The service monitoring unit 127 monitors a plurality of service nodes (applications) placed in the monitored system 200. When the service monitoring unit 127 detects that the communication relationship between the plurality of service nodes has changed, it stores information indicating the changed communication relationship in the service information storage unit 123. Communication relationships between service nodes can be gleaned from sidecar proxies running in containers. Further, when the service monitoring unit 127 detects that the arrangement of service nodes in the container has been changed, it stores information indicating the changed arrangement in the service information storage unit 123. Service node placement can be gleaned from container libraries or sidecar proxies running in containers.

学習部１２８は、障害情報記憶部１２１、構成情報記憶部１２２およびサービス情報記憶部１２３に記憶された情報から訓練データを生成し、ロジスティック回帰分析により原因判定モデル１５１を生成する。学習部１２８は、原因判定モデル１５１をモデル記憶部１２４に保存する。また、学習部１２８は、原因判定部１２９による障害原因の予測に対応する障害対応情報がシステム管理者から提供されると、予測と正解との間の誤差に基づいて、判定精度が上がるように原因判定モデル１５１を更新する。 The learning unit 128 generates training data from the information stored in the failure information storage unit 121, configuration information storage unit 122, and service information storage unit 123, and generates a cause determination model 151 through logistic regression analysis. The learning unit 128 stores the cause determination model 151 in the model storage unit 124. Furthermore, when the system administrator provides failure response information corresponding to the prediction of the cause of the failure by the cause determination unit 129, the learning unit 128 improves the determination accuracy based on the error between the prediction and the correct answer. The cause determination model 151 is updated.

原因判定部１２９は、新たなアプリ障害が検出されると、障害情報記憶部１２１に記憶された障害情報に基づいて、システム管理画面１５２を生成してクライアント端末３１に送信する。このとき、原因判定部１２９は、障害情報記憶部１２１、構成情報記憶部１２２およびサービス情報記憶部１２３に記憶された情報から特徴ベクトルを生成し、モデル記憶部１２４に記憶された原因判定モデル１５１に入力する。これにより、原因判定部１２９は、アプリ障害と同時期に発生しているインフラ障害それぞれの確信度を算出し、算出した確信度をシステム管理画面１５２に含めて送信する。 When a new application failure is detected, the cause determination unit 129 generates a system management screen 152 based on the failure information stored in the failure information storage unit 121 and sends it to the client terminal 31. At this time, the cause determination unit 129 generates a feature vector from the information stored in the failure information storage unit 121, configuration information storage unit 122, and service information storage unit 123, and generates a feature vector from the cause determination model 151 stored in the model storage unit 124. Enter. As a result, the cause determining unit 129 calculates the reliability of each infrastructure failure occurring at the same time as the application failure, and transmits the calculated reliability in the system management screen 152.

その後、原因判定部１２９は、クライアント端末３１から障害対応情報を受信して障害情報記憶部１２１に保存する。障害対応情報には、確信度を算出したアプリ障害とインフラ障害のペアに対して、障害原因であったか否かを示す正解の教師ラベルが含まれる。原因判定部１２９は、学習部１２８に原因判定モデル１５１を更新させる。 Thereafter, the cause determination unit 129 receives the failure handling information from the client terminal 31 and stores it in the failure information storage unit 121. The failure handling information includes a correct teacher label indicating whether or not the pair of application failure and infrastructure failure was the cause of the failure for which the confidence was calculated. The cause determination unit 129 causes the learning unit 128 to update the cause determination model 151.

図１０は、障害テーブルの例を示す図である。
障害テーブル１３１は、障害情報記憶部１２１に記憶される。障害テーブル１３１は、障害ＩＤ、時刻、検出ノード、障害種別、メッセージ、対応フラグおよび原因ＩＤの項目をそれぞれ含む複数のレコードを記憶する。障害ＩＤとして、障害を識別する識別子が登録される。時刻として、障害が発生した時刻または障害が認識された時刻が登録される。検出ノードとして、障害を検出した処理ノードの識別子が登録される。障害を検出する処理ノードは、物理マシン、仮想マシンまたはコンテナである。 FIG. 10 is a diagram showing an example of a failure table.
The failure table 131 is stored in the failure information storage unit 121. The failure table 131 stores a plurality of records each including items of failure ID, time, detected node, failure type, message, response flag, and cause ID. An identifier for identifying a failure is registered as the failure ID. As the time, the time when the failure occurred or the time when the failure was recognized is registered. The identifier of the processing node that detected the failure is registered as the detection node. The processing node that detects the failure can be a physical machine, a virtual machine, or a container.

障害種別として、アプリ障害であるかインフラ障害であるかの区分が登録される。コンテナで検出されたサービスノード（アプリケーション）の障害は、アプリ障害である。物理マシンまたは仮想マシンで検出された障害は、インフラ障害である。メッセージとして、ログに含まれるエラーメッセージのテキストが登録される。対応フラグとして、システム管理者の障害対応作業によって障害が既に解消しているか未解消であるかを示すフラグが登録される。障害対応作業によって、障害原因が別の障害であるとシステム管理者が結論付けた場合、原因ＩＤとして原因の障害の障害ＩＤが登録される。対応フラグおよび原因ＩＤは、障害対応情報に基づいて追記される情報である。 As the failure type, the category of application failure or infrastructure failure is registered. A service node (application) failure detected in a container is an application failure. A failure detected in a physical or virtual machine is an infrastructure failure. The text of the error message included in the log is registered as the message. As the response flag, a flag indicating whether the fault has already been resolved or not resolved by the system administrator's troubleshooting work is registered. When the system administrator concludes that the cause of the failure is another failure through failure handling work, the failure ID of the cause failure is registered as the cause ID. The response flag and cause ID are information that is added based on the failure response information.

図１１は、構成テーブルの例を示す図である。
構成テーブル１３２は、構成情報記憶部１２２に記憶される。構成テーブル１３２は、時刻、親ノード、子ノードおよび距離の項目をそれぞれ含む複数のレコードを記憶する。時刻として、構成変更が検出された時刻が登録される。親ノードとして、下位（物理マシンに近い方）にある処理ノードの識別子が登録される。子ノードとして、親ノードの上位（アプリケーションに近い方）にある処理ノードの識別子が登録される。 FIG. 11 is a diagram showing an example of a configuration table.
The configuration table 132 is stored in the configuration information storage unit 122. The configuration table 132 stores a plurality of records each including items of time, parent node, child node, and distance. The time at which the configuration change was detected is registered as the time. The identifier of the lower processing node (closer to the physical machine) is registered as the parent node. As a child node, the identifier of a processing node located above the parent node (closer to the application) is registered.

親ノードおよび子ノードはそれぞれ、物理マシン、仮想マシンまたはコンテナである。子ノードは、親ノードで実行されていることもある。また、親ノードで別の処理ノードが実行され、その処理ノードで子ノードが実行されていることもある。距離として、親ノードと子ノードとの間の親子距離が登録される。距離は、垂直方向の階層の差である。物理マシンで仮想マシンが実行され、仮想マシンでコンテナが実行されている場合、物理マシンと仮想マシンの間の距離は「１」であり、仮想マシンとコンテナの間の距離は「１」である。また、物理マシンとコンテナの間の距離は「２」である。一方の処理ノードの上に他方の処理ノードがあるという親子関係が存在しないペア、すなわち、距離が「０」のペアは、構成テーブル１３２に登録しなくてもよい。 The parent node and child node are each a physical machine, virtual machine, or container. A child node may also be running on a parent node. Further, another processing node may be executed on a parent node, and a child node may be executed on that processing node. A parent-child distance between a parent node and a child node is registered as the distance. Distance is the vertical hierarchy difference. If a virtual machine is running on a physical machine and a container is running on a virtual machine, the distance between the physical machine and the virtual machine is "1" and the distance between the virtual machine and the container is "1". . Further, the distance between the physical machine and the container is "2". Pairs in which there is no parent-child relationship in which one processing node is located above the other processing node, that is, pairs with a distance of "0", do not need to be registered in the configuration table 132.

図１２は、サービス距離テーブルとサービス配置テーブルの例を示す図である。
サービス距離テーブル１３３は、サービス情報記憶部１２３に記憶される。サービス距離テーブル１３３は、時刻、始点ノード、終点ノードおよび距離の項目をそれぞれ含む複数のレコードを記憶する。時刻として、サービスメッシュグラフの変更が検出された時刻が登録される。サービスメッシュグラフは、コンテナの追加や削除、アプリケーションプログラムの更新などによって変化することがある。 FIG. 12 is a diagram showing an example of a service distance table and a service arrangement table.
The service distance table 133 is stored in the service information storage unit 123. The service distance table 133 stores a plurality of records each including items of time, start point node, end point node, and distance. The time at which a change in the service mesh graph is detected is registered as the time. The service mesh graph may change due to the addition or deletion of containers, updates to application programs, etc.

始点ノードとして、直接通信する２つのサービスノードのうちの一方の識別子が登録される。終点ノードとして、直接通信する２つのサービスノードのうちの他方の識別子が登録される。サービスメッシュグラフは無向グラフであるため、始点ノードと終点ノードを入れ替えたものを別レコードとして登録しなくてもよい。距離として、サービスメッシュグラフにおける始点ノードと終点ノードとの間のパスのホップ数が登録される。 As the starting point node, the identifier of one of the two service nodes that directly communicate is registered. As the end node, the identifier of the other of the two service nodes that directly communicate is registered. Since the service mesh graph is an undirected graph, it is not necessary to register a record in which the start point node and end point node are swapped. The number of hops of the path between the starting point node and the ending point node in the service mesh graph is registered as the distance.

サービス配置テーブル１３４は、サービス情報記憶部１２３に記憶される。サービス配置テーブル１３４は、時刻、サービスノードおよびコンテナの項目をそれぞれ含む複数のレコードを記憶する。時刻として、サービスノードの配置変更が検出された時刻が登録される。サービスノードとして、配置されるサービスノードの識別子が登録される。コンテナとして、サービスノードを配置したコンテナの識別子が登録される。サービスノードの配置は、コンテナの追加や削除などによって変化することがある。 The service placement table 134 is stored in the service information storage unit 123. The service placement table 134 stores a plurality of records each including items of time, service node, and container. The time at which the change in the arrangement of the service node is detected is registered as the time. The identifier of the service node to be placed is registered as the service node. The identifier of the container in which the service node is placed is registered as the container. The arrangement of service nodes may change depending on the addition or deletion of containers.

図１３は、訓練データテーブルの例を示す図である。
訓練データテーブル１３５は、学習部１２８によって生成される。訓練データテーブル１３５が、モデル記憶部１２４に保存されてもよい。訓練データテーブル１３５は、アプリ障害、インフラ障害、評価値ｖ_１，ｖ_２，ｖ_３，ｖ_４、評価値ｖ_１’，ｖ_２’，ｖ_４’および教師ラベルの項目をそれぞれ含む複数のレコードを記憶する。 FIG. 13 is a diagram showing an example of a training data table.
The training data table 135 is generated by the learning unit 128. A training data table 135 may be stored in the model storage unit 124. The training data table 135 includes a plurality of records each including items of application failure, infrastructure failure, evaluation values v ₁ , v ₂ , v ₃ , v ₄ , evaluation values v _{1 ′} , v ₂ ′, v ₄ ′, and teacher label. Remember.

アプリ障害として、障害種別がアプリ障害である障害の障害ＩＤが登録される。インフラ障害として、障害種別がインフラ障害である障害の障害ＩＤが登録される。評価値ｖ_１として、アプリ障害とインフラ障害のペアに対して算出された親子距離が登録される。評価値ｖ_２として、上記のペアに対して算出されたサービス距離が登録される。評価値ｖ_３として、上記のペアに対して算出されたテキスト類似度が登録される。評価値ｖ_４として、上記のペアに対して算出された時刻差が登録される。評価値ｖ_１’，ｖ_２’，ｖ_４’として、評価値ｖ_１，ｖ_２，ｖ_４から変換された補正値が登録される。教師ラベルとして、インフラ障害がアプリ障害の原因か否かを示すフラグが登録される。 As an application failure, a failure ID of a failure whose failure type is an application failure is registered. As an infrastructure failure, a failure ID of a failure whose failure type is an infrastructure failure is registered. As the evaluation value _v1 , the parent-child distance calculated for the pair of application failure and infrastructure failure is registered. The service distance calculated for the above pair is registered as the evaluation value _v2 . The text similarity calculated for the above pair is registered as the evaluation value _v3 . The time difference calculated for the above pair is registered as the evaluation value _v4 . Correction values converted from the evaluation values v ₁ , v ₂ , v ₄ are registered as the evaluation values v _{1 ′} , v _{2 ′} , v _{4 ′} . A flag indicating whether an infrastructure failure is the cause of the application failure is registered as a teacher label.

次に、管理サーバ１００の処理手順について説明する。
図１４は、モデル生成の手順例を示すフローチャートである。
（Ｓ１０）学習部１２８は、障害テーブル１３１から、複数のアプリ障害の障害情報と複数のインフラ障害の障害情報とを分けて抽出する。 Next, the processing procedure of the management server 100 will be explained.
FIG. 14 is a flowchart showing an example of a model generation procedure.
(S10) The learning unit 128 separately extracts failure information about a plurality of application failures and failure information about a plurality of infrastructure failures from the failure table 131.

（Ｓ１１）学習部１２８は、アプリ障害とインフラ障害の組を１つ選択する。なお、ステップＳ１０において、ｍ個のアプリ障害の障害情報とｎ個のインフラ障害の障害情報とが抽出された場合、アプリ障害とインフラ障害の組の候補はｍ×ｎ個存在する。 (S11) The learning unit 128 selects one set of an application failure and an infrastructure failure. Note that in step S10, when failure information of m application failures and failure information of n infrastructure failures are extracted, there are m×n candidates for pairs of application failures and infrastructure failures.

（Ｓ１２）学習部１２８は、アプリ障害を検出した検出ノードとインフラ障害を検出した検出ノードとを特定する。学習部１２８は、構成テーブル１３２から、２つの検出ノードの間の親子距離を検索して評価値ｖ_１とする。なお、２つの検出ノードの間の親子距離が構成テーブル１３２に登録されていない場合はｖ_１＝０とする。また、構成テーブル１３２を参照するにあたり、障害時刻の直前の情報を使用する。 (S12) The learning unit 128 identifies the detection node that detected the application failure and the detection node that detected the infrastructure failure. The learning unit 128 searches the configuration table 132 for the parent-child distance between the two detected nodes and sets it as an evaluation value _v1 . Note that if the parent-child distance between two detected nodes is not registered in the configuration table 132, v ₁ =0. Furthermore, when referring to the configuration table 132, information immediately before the failure time is used.

（Ｓ１３）学習部１２８は、構成テーブル１３２を参照して、インフラ障害を検出した検出ノードと親子関係にあるコンテナを検索する。学習部１２８は、サービス配置テーブル１３４を参照して、検索したコンテナで実行されるサービスノードを検索する。なお、サービス配置テーブル１３４を参照するにあたり、障害時刻の直前の情報を使用する。 (S13) The learning unit 128 refers to the configuration table 132 and searches for a container that has a parent-child relationship with the detection node that detected the infrastructure failure. The learning unit 128 refers to the service placement table 134 and searches for a service node to be executed in the searched container. Note that when referring to the service placement table 134, information immediately before the failure time is used.

（Ｓ１４）学習部１２８は、サービス距離テーブル１３３から、アプリ障害が発生しているサービスノードとステップＳ１３で検索されたサービスノードそれぞれとの間のサービス距離を検索する。学習部１２８は、検索されたサービス距離のうちの最小のサービス距離を評価値ｖ_２とする。なお、異なるサービスノードの間のサービス距離がサービス距離テーブル１３３に登録されていない場合、当該異なるサービスノードは非連結である。ステップＳ１３で検索されたサービスノードの全てが、アプリ障害が発生しているサービスノードと非連結である場合、ｖ_２＝０とする。また、サービス距離テーブル１３３を参照するにあたり、障害時刻の直前の情報を使用する。 (S14) The learning unit 128 searches the service distance table 133 for the service distance between the service node where the application failure has occurred and each of the service nodes searched in step S13. The learning unit 128 sets the minimum service distance among the searched service distances as an evaluation value _v2 . Note that if the service distance between different service nodes is not registered in the service distance table 133, the different service nodes are not connected. If all of the service nodes searched in step S13 are disconnected from the service node where the application failure has occurred, v ₂ =0. Furthermore, when referring to the service distance table 133, information immediately before the failure time is used.

（Ｓ１５）学習部１２８は、アプリ障害の障害情報およびインフラ障害の障害情報それぞれからエラーメッセージを抽出する。学習部１２８は、抽出した２つのエラーメッセージの間でテキスト類似度を算出して評価値ｖ_３とする。例えば、学習部１２８は、エラーメッセージ毎にテキストを単語に分割してBug of Wordsのベクトルを算出し、２つのベクトルの間でコサイン類似度を算出して評価値ｖ_３とする。 (S15) The learning unit 128 extracts error messages from each of the failure information of the application failure and the failure information of the infrastructure failure. The learning unit 128 calculates the text similarity between the two extracted error messages and sets it as an evaluation value _v3 . For example, the learning unit 128 divides the text into words for each error message, calculates a Bug of Words vector, calculates the cosine similarity between the two vectors, and sets the evaluation value _v3 .

（Ｓ１６）学習部１２８は、アプリ障害の障害情報およびインフラ障害の障害情報それぞれから障害時刻を抽出する。学習部１２８は、抽出した２つの障害時刻の間の時刻差を算出して評価値ｖ_４とする。時刻差の単位は、例えば、時間（hour）とする。なお、インフラ障害の方がアプリ障害より遅い場合、ｖ_４＝０とする。 (S16) The learning unit 128 extracts the failure time from each of the failure information of the application failure and the failure information of the infrastructure failure. The learning unit 128 calculates the time difference between the two extracted failure times and sets it as an evaluation value _v4 . The unit of time difference is, for example, hour. Note that if the infrastructure failure is slower than the application failure, v ₄ =0.

（Ｓ１７）学習部１２８は、ステップＳ１２で算出した評価値ｖ_１を、数式（２）に従って評価値ｖ_１’に変換する。また、学習部１２８は、ステップＳ１４で算出した評価値ｖ_２を、数式（３）に従って評価値ｖ_２’に変換する。また、学習部１２８は、ステップＳ１６で算出した評価値ｖ_４を、数式（４）に従って評価値ｖ_４’に変換する。学習部１２８は、評価値ｖ_１’，ｖ_２’，ｖ_３，ｖ_４’を含む特徴ベクトルを生成する。 (S17) The learning unit 128 converts the evaluation value v ₁ calculated in step S12 into an evaluation value v ₁ ' according to formula (2). Further, the learning unit 128 converts the evaluation value v ₂ calculated in step S14 into an evaluation value v ₂ ′ according to formula (3). Further, the learning unit 128 converts the evaluation value v ₄ calculated in step S16 into an evaluation value v ₄ ′ according to formula (4). The learning unit 128 generates a feature vector including evaluation values v _{1 ′} , v ₂ ′, v ₃ , and v _{4 ′} .

（Ｓ１８）学習部１２８は、アプリ障害の障害情報に含まれる原因ＩＤに基づいて、着目するインフラ障害がアプリ障害の原因であるか判断する。インフラ障害がアプリ障害の原因でない場合、教師ラベルを「０」に決定する。インフラ障害がアプリ障害の原因である場合、教師ラベルを「１」に決定する。 (S18) The learning unit 128 determines whether the infrastructure failure of interest is the cause of the application failure, based on the cause ID included in the failure information of the application failure. If the infrastructure failure is not the cause of the application failure, the teacher label is determined to be "0". If an infrastructure failure is the cause of the application failure, the teacher label is determined to be "1".

（Ｓ１９）学習部１２８は、ステップＳ１７で生成した特徴ベクトルとステップＳ１８で決定した教師ラベルとを対応付けて、訓練データに追加する。
（Ｓ２０）学習部１２８は、ステップＳ１１において、全てのアプリ障害とインフラ障害の組を選択したか判断する。全ての組を選択した場合はステップＳ２１に進み、未選択の組がある場合はステップＳ１１に戻る。 (S19) The learning unit 128 associates the feature vector generated in step S17 with the teacher label determined in step S18, and adds them to the training data.
(S20) The learning unit 128 determines whether all the pairs of application failures and infrastructure failures have been selected in step S11. If all the groups have been selected, the process advances to step S21, and if there are any unselected groups, the process returns to step S11.

（Ｓ２１）学習部１２８は、ステップＳ１０～Ｓ２０を通じて生成された訓練データを用いて、ロジスティック回帰分析により原因判定モデル１５１を生成する。ここでは、数式（１）に含まれるパラメータα，β_１，β_２，β_３，β_４が決定される。学習部１２８は、生成した原因判定モデル１５１をモデル記憶部１２４に保存する。 (S21) The learning unit 128 generates the cause determination model 151 through logistic regression analysis using the training data generated through steps S10 to S20. Here, parameters α, β ₁ , β ₂ , β ₃ , and β ₄ included in equation (1) are determined. The learning unit 128 stores the generated cause determination model 151 in the model storage unit 124.

図１５は、障害原因判定の手順例を示すフローチャートである。
（Ｓ３０）原因判定部１２９は、新たなアプリ障害が発生したことを検出する。すると、原因判定部１２９は、障害テーブル１３１から、当該新たなアプリ障害の障害情報と未解消のインフラ障害の障害情報とを抽出する。 FIG. 15 is a flowchart illustrating an example of a procedure for determining the cause of a failure.
(S30) The cause determining unit 129 detects that a new application failure has occurred. Then, the cause determination unit 129 extracts from the failure table 131 the failure information of the new application failure and the failure information of the unresolved infrastructure failure.

（Ｓ３１）原因判定部１２９は、インフラ障害を１つ選択する。
（Ｓ３２）原因判定部１２９は、前述のステップＳ１２と同様にして、アプリ障害の検出ノードとインフラ障害の検出ノードの間の親子距離を示す評価値ｖ_１を算出する。 (S31) The cause determining unit 129 selects one infrastructure failure.
(S32) The cause determination unit 129 calculates the evaluation value _v1 indicating the parent-child distance between the application failure detection node and the infrastructure failure detection node, in the same manner as in step S12 described above.

（Ｓ３３）原因判定部１２９は、前述のステップＳ１３と同様にして、インフラ障害を検出した処理ノードの上位階層で実行されているサービスノードを検索する。
（Ｓ３４）原因判定部１２９は、前述のステップＳ１４と同様にして、サービスノード間のサービス距離を示す評価値ｖ_２を算出する。 (S33) The cause determination unit 129 searches for a service node that is being executed in the upper layer of the processing node that detected the infrastructure failure, in the same manner as in step S13 described above.
(S34) The cause determining unit 129 calculates an evaluation value _v2 indicating the service distance between service nodes in the same manner as in step S14 described above.

（Ｓ３５）原因判定部１２９は、前述のステップＳ１５と同様にして、アプリ障害とインフラ障害の間のエラーメッセージの類似度を示す評価値ｖ_３を算出する。
（Ｓ３６）原因判定部１２９は、前述のステップＳ１６と同様にして、アプリ障害とインフラ障害の間の時刻差を示す評価値ｖ_４を算出する。 (S35) The cause determination unit 129 calculates an evaluation value _v3 indicating the similarity of error messages between the application failure and the infrastructure failure in the same manner as in step S15 described above.
(S36) The cause determining unit 129 calculates an evaluation value _v4 indicating the time difference between the application failure and the infrastructure failure in the same manner as in step S16 described above.

（Ｓ３７）原因判定部１２９は、前述のステップＳ１７と同様にして、評価値ｖ_１，ｖ_２，ｖ_４を評価値ｖ_１’，ｖ_２’，ｖ_４’に変換し、評価値ｖ_１’，ｖ_２’，ｖ_３，ｖ_４’を含む特徴ベクトルを生成する。 (S37) The cause determination unit 129 converts the evaluation values v ₁ , v ₂ , v ₄ into evaluation values v ₁ ′, v ₂ ′, v ₄ ′ in the same manner as in step _S17 described above, A feature vector including ', v ₂ ', v ₃ and v ₄ ' is generated.

（Ｓ３８）原因判定部１２９は、ステップＳ３７で生成した特徴ベクトルを原因判定モデル１５１に入力し、確信度を算出する。
（Ｓ３９）原因判定部１２９は、ステップＳ３１において、全てのインフラ障害を選択したか判断する。全てのインフラ障害を選択した場合はステップＳ４０に進み、未選択のインフラ障害がある場合はステップＳ３１に戻る。 (S38) The cause determination unit 129 inputs the feature vector generated in step S37 to the cause determination model 151, and calculates the degree of certainty.
(S39) The cause determining unit 129 determines whether all infrastructure failures have been selected in step S31. If all infrastructure failures have been selected, the process advances to step S40, and if there are unselected infrastructure failures, the process returns to step S31.

（Ｓ４０）原因判定部１２９は、未解消のインフラ障害を確信度の降順にソートする。
（Ｓ４１）原因判定部１２９は、アプリ障害の情報と、確信度の降順に並べた未解消のインフラ障害の情報と、各インフラ障害の確信度とを含むシステム管理画面１５２を生成し、システム管理画面１５２をクライアント端末３１に送信する。 (S40) The cause determining unit 129 sorts unresolved infrastructure failures in descending order of certainty.
(S41) The cause determination unit 129 generates a system management screen 152 that includes information on application failures, information on unresolved infrastructure failures arranged in descending order of certainty, and certainty of each infrastructure failure, and manages the system. The screen 152 is sent to the client terminal 31.

図１６は、モデル更新の手順例を示すフローチャートである。
このモデル更新は、図１５の障害原因判定の後に実行される。
（Ｓ５０）原因判定部１２９は、クライアント端末３１から障害対応情報を受信する。障害対応情報は、アプリ障害の障害ＩＤと、インフラ障害の障害ＩＤと、そのインフラ障害がアプリ障害の原因であったか否かを示す教師ラベルとを含む。教師ラベルは、障害対応作業を通じてシステム管理者により判断された結果である。そのインフラ障害がアプリ障害の原因でない場合は教師ラベルが「０」となる。そのインフラ障害がアプリ障害の原因である場合は教師ラベルが「１」となる。原因判定部１２９は、受信した障害対応情報に基づいて、障害テーブル１３１を更新する。 FIG. 16 is a flowchart illustrating an example of a model update procedure.
This model update is executed after the failure cause determination in FIG. 15.
(S50) The cause determining unit 129 receives failure handling information from the client terminal 31. The failure response information includes a failure ID of an application failure, a failure ID of an infrastructure failure, and a teacher label indicating whether the infrastructure failure was the cause of the application failure. The teacher label is the result of judgment by the system administrator through troubleshooting work. If the infrastructure failure is not the cause of the application failure, the teacher label will be "0". If the infrastructure failure is the cause of the application failure, the teacher label will be "1". The cause determination unit 129 updates the failure table 131 based on the received failure handling information.

（Ｓ５１）学習部１２８は、前述のステップＳ３７で生成された特徴ベクトルが保存されている場合、その特徴ベクトルを取得する。一方、学習部１２８は、特徴ベクトルが保存されていない場合、ステップＳ３２～Ｓ３７と同様にして特徴ベクトルを再生成する。 (S51) If the feature vector generated in step S37 described above is saved, the learning unit 128 acquires the feature vector. On the other hand, if the feature vector is not saved, the learning unit 128 regenerates the feature vector in the same manner as steps S32 to S37.

（Ｓ５２）学習部１２８は、前述のステップＳ３８で算出された確信度が保存されている場合、その確信度を取得する。一方、学習部１２８は、確信度が保存されていない場合、ステップＳ３８と同様にして確信度を再算出する。 (S52) The learning unit 128 acquires the certainty factor, if the certainty factor calculated in step S38 described above is saved. On the other hand, if the confidence level is not saved, the learning unit 128 recalculates the confidence level in the same manner as in step S38.

（Ｓ５３）学習部１２８は、原因判定モデル１５１、特徴ベクトル、確信度および教師ラベルを用いて、オンライン学習により原因判定モデル１５１を更新する。オンライン学習には、確率的勾配降下法などの勾配法を用いることができる。例えば、学習部１２８は、確信度と教師ラベルの間の誤差を算出し、パラメータα，β_１，β_２，β_３，β_４の値をそれぞれ微少量だけ変化させたときの誤差の変化量から、パラメータα，β_１，β_２，β_３，β_４に対する誤差の勾配を算出する。学習部１２８は、誤差の勾配に所定の学習率を乗じた分だけパラメータα，β_１，β_２，β_３，β_４の値を変化させる。ただし、オンライン学習を行う代わりに、今回の特徴ベクトルと教師ラベルの組を訓練データに追加して、前述のステップＳ２１の機械学習を再実行してもよい。 (S53) The learning unit 128 updates the cause determination model 151 through online learning using the cause determination model 151, feature vector, confidence level, and teacher label. Gradient methods such as stochastic gradient descent can be used for online learning. For example, the learning unit 128 calculates the error between the confidence level and the teacher label, and the amount of change in error when each of the values of parameters α, β ₁ , β ₂ , β ₃ , and β ₄ is changed by a minute amount. From this, the gradient of the error with respect to the parameters α, β ₁ , β ₂ , β ₃ , and β ₄ is calculated. The learning unit 128 changes the values of the parameters α, β ₁ , β ₂ , β ₃ , and β ₄ by an amount obtained by multiplying the error gradient by a predetermined learning rate. However, instead of performing online learning, the current set of feature vector and teacher label may be added to the training data, and the machine learning in step S21 described above may be re-executed.

第２の実施の形態の情報処理システムによれば、アプリ障害とインフラ障害とが同時期に発生している場合に、インフラ障害がアプリ障害の根本原因である可能性が評価され、その確信度がシステム管理者に対して提示される。複数のインフラ障害が発生している場合、確信度の高い順にそれら複数のインフラ障害が提示される。これにより、システム管理者の障害対応作業の負担が軽減され、障害解消までの所要時間を短縮できる。特に、物理マシン、仮想マシンおよびコンテナが階層的に配置された複雑な仮想環境において、細分化された多数のアプリケーションが多数のコンテナによって分散して実行されていても、インフラ障害が原因で引き起こされるアプリ障害の原因分析を効率化できる。 According to the information processing system of the second embodiment, when an application failure and an infrastructure failure occur at the same time, the possibility that the infrastructure failure is the root cause of the application failure is evaluated, and the certainty level thereof is evaluated. is presented to the system administrator. If multiple infrastructure failures occur, the multiple infrastructure failures are presented in order of certainty. This reduces the burden of troubleshooting work on the system administrator and shortens the time required to resolve the trouble. Especially in complex virtual environments where physical machines, virtual machines, and containers are arranged in a hierarchical manner, infrastructure failures can cause problems even when many fine-grained applications are distributed and executed by many containers. You can streamline the cause analysis of application failures.

また、アプリ障害とインフラ障害の関連性の評価には、機械学習によって過去の障害情報から生成された原因判定モデルが使用される。原因判定モデルの説明変数には、処理ノードの階層関係を示す評価値が含まれる。よって、下位の階層の処理ノードで発生した障害が上位の階層の処理ノードに影響を与えるという垂直方向の障害伝播の可能性を考慮することができる。また、特徴ベクトルには、アプリケーション間の通信関係を示す評価値が含まれる。よって、あるアプリケーションの障害が別のアプリケーションに影響を与えるという水平方向の障害伝播の可能性を考慮することができる。 Additionally, a cause determination model generated from past failure information through machine learning is used to evaluate the relationship between application failures and infrastructure failures. The explanatory variables of the cause determination model include evaluation values that indicate the hierarchical relationship of processing nodes. Therefore, it is possible to consider the possibility of vertical fault propagation in which a fault occurring in a processing node in a lower hierarchy affects a processing node in an upper hierarchy. Furthermore, the feature vector includes an evaluation value indicating the communication relationship between applications. Therefore, it is possible to consider the possibility of horizontal fault propagation in which a fault in one application affects another application.

また、特徴ベクトルには、エラーメッセージの類似度を示す評価値が含まれる。よって、障害内容の類似性を考慮することができる。また、特徴ベクトルには、アプリ障害とインフラ障害の時刻差を示す評価値が含まれる。よって、遅延時間の側面から障害伝播の可能性を評価することができる。このように、上記の４つの観点を総合的に利用することで、原因判定モデルの判定精度を向上させることができる。また、原因判定モデルの入力には、物理マシン、仮想マシン、コンテナおよびアプリケーションの識別子は使用されない。このため、仮想環境の構成変更が行われても、生成した原因判定モデルの判定精度が低下しづらく、原因判定モデルの有用性を維持することができる。 The feature vector also includes an evaluation value indicating the degree of similarity of the error messages. Therefore, the similarity of failure contents can be taken into consideration. The feature vector also includes an evaluation value that indicates the time difference between the application failure and the infrastructure failure. Therefore, the possibility of fault propagation can be evaluated from the aspect of delay time. In this way, by comprehensively utilizing the above four viewpoints, the determination accuracy of the cause determination model can be improved. Furthermore, identifiers of physical machines, virtual machines, containers, and applications are not used as input to the cause determination model. Therefore, even if the configuration of the virtual environment is changed, the determination accuracy of the generated cause determination model is unlikely to decrease, and the usefulness of the cause determination model can be maintained.

１０制御装置
１１記憶部
１２処理部
１３，１４障害情報
１５教師ラベル
１６特徴情報
１６ａ，１６ｂ，１６ｃ評価値
１７モデル
２０情報処理システム
２１，２２，２３処理ノード
２４，２５アプリケーション 10 Control device 11 Storage unit 12 Processing unit 13, 14 Fault information 15 Teacher label 16 Feature information 16a, 16b, 16c Evaluation value 17 Model 20 Information processing system 21, 22, 23 Processing node 24, 25 Application

Claims

to the computer,
Each of the plurality of processing nodes is a processing node capable of executing an application using allocated resources, and includes a plurality of processing nodes that can be arranged hierarchically using virtualization software, and each of the plurality of applications is a processing node capable of executing an application using the allocated resources. Regarding an information processing system executed by any of the processing nodes, first failure information indicating a failure of a first application, second failure information indicating a failure of a second processing node, and the first failure obtaining a teacher label indicating the presence or absence of a relationship between the information and the second failure information;
a first failure information indicating a hierarchical relationship of arrangement between a first processing node that executes the first application and the second processing node based on the first failure information and the second failure information; an evaluation value; a second evaluation value indicating a dependency relationship between the first application and a second application executed on a processing node arranged on the second processing node; a third evaluation value indicating the degree of similarity between a first error message included in the failure information and a second error message included in the second failure information;
Using training data in which feature information including the first evaluation value, the second evaluation value, and the third evaluation value is associated with the teacher label, the feature information regarding the two pieces of fault information is corresponded to. Generating a model that estimates the presence or absence of a relationship between the two failure information from the input data,
A control program that executes processing.

The output of the model includes a confidence level indicating the strength of the association between the two failure information,
The computer further includes:
After the model is generated, obtaining third failure information indicating a failure of one application and a plurality of fourth failure information indicating failures of different processing nodes,
For each set of the third failure information and one fourth failure information among the plurality of fourth failure information, based on the third failure information and the one fourth failure information, generating input data, inputting the input data into the model and calculating the confidence level;
Prioritizing and outputting the plurality of fourth failure information based on the confidence level;
The control program according to claim 1, which causes the control program to execute a process.

The computer further includes:
Obtain fault handling information indicating an analysis result of a fault in the first application, and if the cause of the fault indicated by the fault handling information includes a fault in the second processing node, generate the teacher label indicating that there is a relationship. and when the cause of the failure indicated by the failure handling information does not include a failure of the second processing node, generating the teacher label indicating no relation;
The control program according to claim 1, which causes the control program to execute a process.

The first evaluation value indicates the distance between the first processing node hierarchy and the second processing node hierarchy, and the second evaluation value indicates the communication relationship between the plurality of applications. The third evaluation value indicates the distance between the first application and the second application in a mesh graph showing the difference between the words included in the first error message and the second error message. indicates the degree of similarity between the words contained in it,
The control program according to claim 1.

The characteristic information further includes a fourth evaluation value indicating a time difference between a time when a failure occurs in the first application and a time when a failure occurs in the second processing node.
The control program according to claim 1.

The plurality of processing nodes include two or more physical machines, two or more virtual machines each placed on one of the two or more physical machines, and each one placed on one of the two or more virtual machines. and two or more containers arranged,
The first processing node is any one of the two or more containers, and the second processing node is any one of the two or more physical machines and the two or more virtual machines.
The control program according to claim 1.

The computer is
Each of the plurality of processing nodes is a processing node capable of executing an application using allocated resources, and includes a plurality of processing nodes that can be arranged hierarchically using virtualization software, and each of the plurality of applications is a processing node capable of executing an application using the allocated resources. Regarding an information processing system executed by any of the processing nodes, first failure information indicating a failure of a first application, second failure information indicating a failure of a second processing node, and the first failure obtaining a teacher label indicating the presence or absence of a relationship between the information and the second failure information;
a first failure information indicating a hierarchical relationship of arrangement between a first processing node that executes the first application and the second processing node based on the first failure information and the second failure information; an evaluation value; a second evaluation value indicating a dependency relationship between the first application and a second application executed on a processing node arranged on the second processing node; a third evaluation value indicating the degree of similarity between a first error message included in the failure information and a second error message included in the second failure information;
Using training data in which feature information including the first evaluation value, the second evaluation value, and the third evaluation value is associated with the teacher label, the feature information regarding the two pieces of fault information is corresponded to. Generating a model that estimates the presence or absence of a relationship between the two failure information from the input data,
Control method.

Each of the plurality of processing nodes is a processing node capable of executing an application using allocated resources, and includes a plurality of processing nodes that can be arranged hierarchically using virtualization software, and each of the plurality of applications is a processing node capable of executing an application using the allocated resources. Regarding an information processing system executed by any of the processing nodes, first failure information indicating a failure of a first application, second failure information indicating a failure of a second processing node, and the first failure a storage unit that stores a teacher label indicating the presence or absence of a relationship between the information and the second failure information;
a first failure information indicating a hierarchical relationship of arrangement between a first processing node that executes the first application and the second processing node based on the first failure information and the second failure information; an evaluation value; a second evaluation value indicating a dependency relationship between the first application and a second application executed on a processing node arranged on the second processing node; and a third evaluation value indicating the degree of similarity between the first error message included in the failure information and the second error message included in the second failure information, and , using training data in which the feature information including the second evaluation value and the third evaluation value is associated with the teacher label, the two failure information are extracted from the input data corresponding to the feature information about the two failure information. a processing unit that generates a model that estimates the presence or absence of relevance of failure information;
A control device having: