JP7472628B2

JP7472628B2 - Fault recovery device, fault recovery method and program

Info

Publication number: JP7472628B2
Application number: JP2020079042A
Authority: JP
Inventors: 光希池内; 洋一松尾; 敬志郎渡辺; 嘉文葛
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2024-04-23
Anticipated expiration: 2040-04-28
Also published as: JP2021174348A

Description

本発明は、障害復旧装置、障害復旧方法及びプログラムに関する。 The present invention relates to a disaster recovery device, a disaster recovery method, and a program.

大規模化・複雑化が進むＩＣＴシステムにおいて、発生する障害の種類、件数は増大している。障害対応には、ログ、メトリクス等の観測データの監視、分析が必要であるが、そのデータ量やデータ間の関係性も複雑であるため、障害の復旧業務は困難を極めている。そこで、近年、機械学習を用いた障害復旧業務の自動化、高精度化が取り組まれている。 As ICT systems become larger and more complex, the types and number of faults occurring are increasing. Responding to faults requires monitoring and analysis of observational data such as logs and metrics, but the volume of data and the complex relationships between the data make fault recovery extremely difficult. As a result, efforts have been made in recent years to automate and improve the accuracy of fault recovery work using machine learning.

障害時のログに基づいて復旧コマンド列を生成する技術として非特許文献１がある。これは、障害が発生した際に対象システムから得られたログと、その障害を復旧する際に実行した復旧コマンドの関係性をニューラルネットワークにより学習しておくことで、新たな障害が発生した際に、観測されたログに基づき尤もらしい復旧コマンドを生成する技術である。 Non-Patent Document 1 describes a technology that generates a sequence of recovery commands based on logs at the time of a failure. This technology uses a neural network to learn the relationship between the logs obtained from the target system when a failure occurs and the recovery commands executed to recover from that failure, and when a new failure occurs, it generates plausible recovery commands based on the observed logs.

また、障害要因を意図的に挿入して障害データを収集することで、要因特定のための分類器を学習するアプローチもある。非特許文献２では、カオスエンジニアリングのために開発された障害要因挿入ツールを用いて、様々な障害要因を検証環境に挿入し、その際の観測データを収集する。それらの観測データと、障害要因（障害種類とその発生箇所の組み合わせ）を紐づける分類器を学習することで、新たな障害が発生した際に、観測データに基づいて障害要因を特定することを可能としている。非特許文献１が実運用で得られるデータを想定しているのに対し、非特許文献２では意図的に障害要因を挿入するフレームワークを用いているため、実際にはまだ起こっていない障害や発生頻度の少ない障害にも対応できる可能性がある。 There is also an approach in which a classifier for identifying the cause is learned by intentionally inserting a fault factor and collecting fault data. In Non-Patent Document 2, a fault factor insertion tool developed for chaos engineering is used to insert various fault factors into the verification environment and collect observation data at that time. By learning a classifier that links the observation data with the fault factor (a combination of the fault type and its occurrence location), it becomes possible to identify the fault factor based on the observation data when a new fault occurs. While Non-Patent Document 1 assumes data obtained during actual operation, Non-Patent Document 2 uses a framework for intentionally inserting a fault factor, so it may be possible to handle faults that have not actually occurred yet or faults that occur infrequently.

一方で、システムから得られる観測データに基づいて、どのようなコマンドを打てば復旧できるかを表す復旧方策を学習するアプローチも存在する。 On the other hand, there are also approaches that learn recovery strategies that indicate what commands need to be entered to recover the system, based on observational data obtained from the system.

非特許文献３では、障害発生時に或るコマンドを実行した際、どのようなログが出現しやすいかを表現した確率モデルを事前に作成している。この確率モデルを利用して、障害要因を特定するのに最も効率の良い復旧方策が強化学習などを用いて学習される。 In Non-Patent Document 3, a probabilistic model is created in advance that represents what kind of log is likely to appear when a certain command is executed when a failure occurs. Using this probabilistic model, the most efficient recovery strategy for identifying the cause of the failure is learned using reinforcement learning or other methods.

非特許文献４では、障害要因を挿入した検証環境において、ｐｉｎｇなどの障害切り分けコマンドを実行し、そのフィードバックとして得られるｐｉｎｇの成功／失敗などの二値結果に基づいて次に打つべきコマンドを算出する復旧方策関数を、強化学習により学習している。 In Non-Patent Document 4, in a verification environment where a fault factor is inserted, a fault isolation command such as ping is executed, and a recovery policy function is learned by reinforcement learning, which calculates the next command to be executed based on the binary result such as success/failure of the ping obtained as feedback.

池内光希，渡邉暁，松尾洋一，川田丈浩，「Seq2Seqによる障害復旧コマンド列の自動生成」，信学会総合大会，B-7-25，2019．Mitsuki Ikeuchi, Satoru Watanabe, Yoichi Matsuo, Takehiro Kawada, "Automatic Generation of Fault Recovery Command Sequences Using Seq2Seq", IEICE General Conference, B-7-25, 2019. 池内光希，葛嘉文，松尾洋一，渡辺敬志郎，「障害データ生成に基づく要因特定手法の一検討」，信学会総合大会，B-7-32，2020．Mitsuki Ikeuchi, Yoshifumi Kuzu, Yoichi Matsuo, Keishiro Watanabe, "A study on a method for identifying the cause of faults based on fault data generation," IEICE General Conference, B-7-32, 2020. H. Ikeuchi, A. Watanabe, T. Kawata, and R. Kawahara, "Root-Cause Diagnosis Using Logs Generated by User Actions," in Proc. of IEEE Global Communication Conference (Globecom), pp. 1-7, 2018.H. Ikeuchi, A. Watanabe, T. Kawata, and R. Kawahara, "Root-Cause Diagnosis Using Logs Generated by User Actions," in Proc. of IEEE Global Communication Conference (Globecom), pp. 1-7, 2018. M. L. Littman, N. Ravi, E. Fenson, and R. Howard, "An Instance-based State Representation for Network Repair", in Proc. of the 19th National Conference on American Association for Artificial Intelligence (AAAI), pp. 287-292, 2004.M. L. Littman, N. Ravi, E. Fenson, and R. Howard, "An Instance-based State Representation for Network Repair", in Proc. of the 19th National Conference on American Association for Artificial Intelligence (AAAI), pp. 287-292, 2004.

非特許文献１の方法では、十分な数の障害時のログや復旧コマンドが学習データとして蓄積されている必要があるが、現実的にはこれは必ずしも満たされない。 The method described in Non-Patent Document 1 requires that a sufficient number of failure logs and recovery commands be accumulated as learning data, but in reality this is not always possible.

非特許文献２の障害データ生成アプローチは、非特許文献１の課題を一部解決するものではあるが、障害要因の特定に焦点を当てた手法であり、復旧コマンドは得ることができないため、復旧の自動化は達成できない。 The fault data generation approach in Non-Patent Document 2 partially solves the issues in Non-Patent Document 1, but it is a method that focuses on identifying the cause of the fault and cannot obtain recovery commands, so recovery automation cannot be achieved.

非特許文献３の技術では、ログ出現を表す確率モデルの作成が必要だが、複雑なシステムにおいて、これは極めて困難である。 The technology in Non-Patent Document 3 requires the creation of a probabilistic model that represents log occurrences, but this is extremely difficult in a complex system.

非特許文献４の技術では、学習データや事前の確率モデル作成が不要ではあるが、復旧方策の入力値として二値結果のものを仮定しているため、ログやメトリクスなどの高次元で、確率的に変動する、複雑な相関関係を持つ観測データをもとに復旧コマンドを決定する必要のある障害に対応するのは困難である。 The technology in Non-Patent Document 4 does not require learning data or the creation of a prior probability model, but it assumes binary results as input values for the recovery measures, making it difficult to respond to failures that require recovery commands to be determined based on high-dimensional, probabilistically fluctuating, complexly correlated observational data such as logs and metrics.

本発明は、上記の点に鑑みてなされたものであって、障害を復旧させるためのコマンドを自動的に推定可能とすることを目的とする。 The present invention was made in consideration of the above points, and aims to make it possible to automatically estimate commands for recovering from a failure.

そこで上記課題を解決するため、障害復旧装置は、人工的な障害要因の挿入後において第１のシステムにおいて観測される第１のデータを取得する第１の取得部と、前記第１のデータをニューラルネットワークに入力することで前記ニューラルネットワークによって推定されるコマンドを前記第１のシステムに対して実行する第１の実行部と、前記コマンドの実行後において前記第１のシステムにおいて観測される第２のデータを取得する第２の取得部と、前記コマンドの実行によって得られる報酬を取得する第３の取得部と、前記第１のデータ、前記コマンド、前記報酬、前記第２のデータに基づく深層強化学習によって前記ニューラルネットワークを学習する学習部と、を有し、前記ニューラルネットワークの学習度合いに応じて、人工的な障害要因別に前記第１のシステムに対する挿入回数が変更される。

In order to solve the above problem, a fault recovery device includes a first acquisition unit that acquires first data observed in a first system after an artificial fault factor is inserted, a first execution unit that executes a command estimated by the neural network by inputting the first data into a neural network to the first system, a second acquisition unit that acquires second data observed in the first system after execution of the command, a third acquisition unit that acquires a reward obtained by executing the command, and a learning unit that learns the neural network by deep reinforcement learning based on the first data, the command, the reward, and the second data , and the number of insertions into the first system is changed for each artificial fault factor depending on the learning level of the neural network .

障害を復旧させるためのコマンドを自動的に推定可能とすることができる。 It is possible to automatically estimate the commands required to recover from a failure.

本発明の実施の形態における障害復旧装置１０のハードウェア構成例を示す図である。1 is a diagram illustrating an example of a hardware configuration of a failure recovery device 10 according to an embodiment of the present invention. 本発明の実施の形態における障害復旧装置１０の機能構成例を示す図である。1 is a diagram illustrating an example of a functional configuration of a failure recovery device 10 according to an embodiment of the present invention. 学習フェーズの１エピソードにおいて障害復旧装置１０が実行する処理手順の一例を説明するためのフローチャートである。10 is a flowchart illustrating an example of a processing procedure executed by the failure recovery device 10 in one episode of a learning phase. 自動復旧フェーズにおいて障害復旧装置１０が実行する処理手順の一例を説明するためのフローチャートである。10 is a flowchart illustrating an example of a processing procedure executed by the failure recovery device 10 in an automatic recovery phase. 実験における学習結果を示す図である。FIG. 13 is a diagram showing learning results in an experiment.

本実施の形態では、対象のＩＣＴシステム（以下、「対象システム」という。）に対して障害要因挿入ツールにより障害要因を挿入することで様々な障害を人工的に発生させ、その環境の中で深層強化学習を実行し、復旧方策を獲得する技術が開示される。すなわち、深層強化学習のエージェントが、対象システムから得られるログやメトリクスなどの観測データに基づいて、適当な復旧コマンドを選択して実行し、対象システムの状態変化に応じた報酬を受け取るという過程を繰り返す。この過程を繰り返すことで、エージェントは、報酬を最大化するような最適な復旧方策を獲得する。 In this embodiment, a technology is disclosed in which various faults are artificially generated by injecting fault factors into a target ICT system (hereinafter referred to as the "target system") using a fault factor insertion tool, and deep reinforcement learning is executed in that environment to acquire a recovery policy. That is, a deep reinforcement learning agent repeats a process in which it selects and executes an appropriate recovery command based on observational data such as logs and metrics obtained from the target system, and receives a reward according to changes in the state of the target system. By repeating this process, the agent acquires an optimal recovery policy that maximizes the reward.

本実施の形態では、人工的に生成される障害要因の挿入とエージェントによる復旧コマンドの実行によって、対象システムから実際の観測データが取得され、復旧方策が獲得されるため、それらが事前に学習データとして蓄積されている必要がなく、確率モデルの事前作成も必要ない。更に、深層強化学習では、ニューラルネットワークが高次元な入力値を解析することが可能なため、ログやメトリクスなどの高次元で、確率的に変動する、複雑な相関関係を持つ観測データを扱うことができる。したがって、学習データや事前の確率モデル作成が不要でありながら、ログやメトリクスなどの高次元で、確率的に変動する、複雑な相関関係を持つ観測データをもとに、障害を復旧させるための復旧コマンドを自動で推定するようなフレームワークを確立することができる。 In this embodiment, actual observation data is acquired from the target system by inserting artificially generated failure factors and executing recovery commands by an agent, and recovery measures are acquired, so there is no need to accumulate them as learning data in advance, and there is no need to create a probabilistic model in advance. Furthermore, in deep reinforcement learning, neural networks can analyze high-dimensional input values, so it is possible to handle high-dimensional, probabilistically fluctuating, complexly correlated observation data such as logs and metrics. Therefore, it is possible to establish a framework that automatically estimates recovery commands for recovering from failures based on high-dimensional, probabilistically fluctuating, complexly correlated observation data such as logs and metrics, without the need for learning data or the prior creation of a probabilistic model.

まず、本実施の形態の基本となる強化学習、そのアルゴリズムの一種であるＱ学習（「R. Sutton and A. Barto, (邦訳三上貞芳, 皆川雅章),「強化学習」, 森北出版, pp. 159-161, 2000.」）、深層強化学習におけるアルゴリズムの一種であるｄｅｅｐＱ－ｎｅｔｗｏｒｋ（「V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmillier, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518 no. 7540, pp. 529-533, 2015.」）について簡単に述べる。但し、本実施の形態は、これらのアルゴリズムに限定されるものではなく、他のアルゴリズムが用いられてもよい。 First, reinforcement learning, which is the basis of this embodiment, Q-learning, which is one of its algorithms (R. Sutton and A. Barto, (translated in Japanese by Sadayoshi Mikami and Masaaki Minagawa), Reinforcement Learning, Morikita Shuppan, pp. 159-161, 2000.), and deep Q-network, which is one of the algorithms in deep reinforcement learning (V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmillier, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518 no. 7540, pp. 529-533, 2015."). However, this embodiment is not limited to these algorithms, and other algorithms may be used.

強化学習とは、与えられた環境内で、獲得報酬和が最大となるように、最適な行動をとるための方策をエージェントが獲得するための手法である。強化学習は、行動の主体であるエージェントと、エージェントが行動を実行する環境からなる。強化学習の概要は以下の通りである。 Reinforcement learning is a method for an agent to acquire a policy for taking optimal actions in a given environment so as to maximize the sum of rewards obtained. Reinforcement learning consists of an agent that is the subject of the action and an environment in which the agent executes the action. An overview of reinforcement learning is as follows.

エージェントは、環境の「状態ｓ」を観測し、現状の「方策π」に従って、或る「行動ａ」を実行する。但し、方策とは、π（ｓ）＝ａのように、状態に応じて行動を返す関数である。行動ａを実行すると、環境の状態は状態ｓ'に遷移し、その遷移に応じてエージェントは報酬ｒを獲得する。以上のような、エージェントと環境との相互作用が繰り返される過程において、エージェントは、一連の行動の中で獲得できる報酬和を最大化できるように方策を改善していく。 The agent observes the "state s" of the environment and executes a certain "action a" according to the current "policy π." A policy is a function that returns an action depending on the state, such as π(s) = a. When action a is executed, the state of the environment transitions to state s', and the agent earns a reward r in response to this transition. In the process of repeated interactions between the agent and the environment as described above, the agent improves the policy so as to maximize the sum of the rewards that can be earned over a series of actions.

Ｑ学習は、強化学習を実現するためのアルゴリズムの一種である。Ｑ学習において、エージェントは、Ｑ関数Ｑ^π（ｓ，ａ）を学習する。Ｑ^π（ｓ，ａ）は、状態ｓにおいて行動ａを実行し、その後は方策πに従った場合に、将来にわたって獲得する報酬和の期待値である。エージェントは、環境との相互作用を繰り返し、方策πおよびＱ関数Ｑ^π（ｓ，ａ）を更新していく。 Q-learning is a type of algorithm for realizing reinforcement learning. In Q-learning, an agent learns a Q-function Q ^π (s, a). Q ^π (s, a) is the expected value of the sum of rewards to be obtained in the future when an action a is executed in state s and then a policy π is followed. The agent repeatedly interacts with the environment and updates the policy π and the Q-function Q ^π (s, a).

状態空間や行動空間が広い場合や連続な場合、Ｑ関数をパラメトライズされた関数で近似する手法がよく用いられる。特に、近似関数としてニューラルネットワークを用いる手法を深層強化学習といい、ｄｅｅｐＱ－ｎｅｔｗｏｒｋ（ＤＱＮ）というアルゴリズムが最も基本的な形態として知られている。ＤＱＮでは、以下の損失関数Ｌ（θ）を最小化するようにニューラルネットワークの重みパラメタθを更新する。 When the state space or action space is large or continuous, a method that approximates the Q function with a parameterized function is often used. In particular, the method that uses a neural network as an approximation function is called deep reinforcement learning, and an algorithm called deep Q-network (DQN) is known as the most basic form. In DQN, the weight parameter θ of the neural network is updated to minimize the following loss function L(θ).

ここで、γは割引率と呼ばれるパラメタであり、Ｑ_θ（ｓ'，ａ'）はＱ関数である。

Here, γ is a parameter called the discount rate, and Q _θ (s′, a′) is the Q function.

パラメタ更新を行う際は、ｅｘｐｅｒｉｅｎｃｅｒｅｐｌａｙと呼ばれる手法、すなわち経験（ｓ，ａ，ｒ，ｓ'）をメモリに蓄えておき、蓄えた経験をランダムにミニバッチとして取り出してθの更新を行う手法がとられる。学習中の方策は、Ｑ関数Ｑ_θ（ｓ'，ａ'）に基づきε貪欲法などで定められ、学習後の方策は、 When updating parameters, a method called experience replay is used, in which experience (s, a, r, s') is stored in memory, and the stored experience is randomly taken out as a mini-batch to update θ. The policy during learning is determined by the ε-greedy method or the like based on the Q function Q _θ (s', a'), and the policy after learning is

で与えられる。

is given by:

本実施の形態は、検証環境の対象システム又は運用前の本番環境の対象システム（以下、これらを「学習用システム」という。）において深層強化学習を実行して復旧方策（復旧コマンドを推定するニューラルネットワーク）を学習する学習フェーズと、学習済みの復旧方策（ニューラルネットワーク）を用いて自動復旧を行う自動復旧フェーズからなる。 This embodiment consists of a learning phase in which deep reinforcement learning is performed on a target system in a verification environment or a target system in a pre-operational production environment (hereinafter, these are referred to as "learning systems") to learn a recovery policy (a neural network that estimates a recovery command), and an automatic recovery phase in which automatic recovery is performed using the learned recovery policy (neural network).

学習フェーズについて説明する。まず、学習用システムにおいて、障害要因挿入ツールを用いて、人工的又は人為的（以下、「人工的」で統一する）に障害を発生させる。障害要因挿入ツールとしては、考えうる障害を再現するために技術者が作成したスクリプトが用いられてもよいし、既存の負荷試験ツール、障害要因挿入ツール等が用いられてもよい。 The learning phase will now be explained. First, a fault is artificially or man-made (hereafter referred to as "artificially") generated in the learning system using a fault insertion tool. The fault insertion tool may be a script created by an engineer to reproduce possible faults, or an existing load testing tool, fault insertion tool, etc. may also be used.

続いて、人工的な障害が発生した状態の学習用システムにおいて観測される各種の観測データ、すなわち、学習用システムを構成している各機器から得られるログデータや、ＣＰＵ使用率、メモリ使用量などのメトリクス、機器間を流れるトラヒックなどが取得される。観測データは、何らかの方法で数値ベクトルに変換する必要があり、このようにして得られた数値ベクトルを、以下「特徴ベクトル」と呼ぶ。特徴ベクトルは、例えば、得られた数値データやログのカウント数などを単純に並べ、適当な正規化等の前処理を行うことで生成されてもよい。 Next, various types of observation data are obtained from the learning system in a state where an artificial failure has occurred, i.e., log data obtained from each device constituting the learning system, metrics such as CPU usage and memory usage, and traffic flowing between devices. The observation data must be converted into a numerical vector in some way, and the numerical vector obtained in this way is hereinafter referred to as a "feature vector." A feature vector may be generated, for example, by simply arranging the obtained numerical data and log counts, and performing preprocessing such as appropriate normalization.

本実施の形態では、このようにして得られた特徴ベクトルを、強化学習における「状態ｓ」とみなす。状態ｓは、連続値高次元で、要素間に複雑な相関があり、確率的な揺らぎも含むため、本実施の形態では、ＤＱＮなどの深層強化学習手法が好適である。 In this embodiment, the feature vector obtained in this manner is regarded as the "state s" in reinforcement learning. State s is continuous-valued and high-dimensional, with complex correlations between elements and including probabilistic fluctuations, so in this embodiment, a deep reinforcement learning method such as DQN is suitable.

エージェントは、特徴ベクトルをＤＱＮのニューラルネットワーク（復旧コマンドを推定するニューラルネットワーク）に入力し、現状の方策に基づいて行動ａを決定する。エージェントは、この行動ａを学習用システムに対して実行し、新たな観測データに基づく特徴ベクトルｓ'を得る。エージェントは、また、別途配置されたシステム状態確認ツールにより、学習用システムの状態が行動ａによってどのように変化したかを確認する。例えば、学習用システムを構成する各機器へのｐｉｎｇの成否をもとにしたチェックや、システムの異常度を算出する異常検出システムが用いられてもよい。 The agent inputs the feature vector into the DQN neural network (a neural network that estimates recovery commands) and determines action a based on the current policy. The agent executes this action a on the training system and obtains a feature vector s' based on new observation data. The agent also uses a separately placed system status confirmation tool to check how the status of the training system has changed as a result of action a. For example, a check based on the success or failure of a ping to each device that makes up the training system, or an anomaly detection system that calculates the degree of anomaly in the system may be used.

エージェントは、このシステム状態変化結果に応じた報酬ｒを獲得する。例えば、学習用システムが正常状態か異常状態かの二値しか確認することができない場合には、正常状態に戻ったときに正の報酬が与えられ、それ以外のときは負又は０の報酬が与えられる。一方で、連続的な異常度を算出できる異常検知ツールが用いられる場合には、異常度の下がり具合（行動によってどの程度正常状態に近づいたか）が報酬として定義されてもよい。対象システムがいくつかのコンポーネントから構成されており、各コンポーネントの正常状態を確認できる場合には、正常状態に戻ったコンポーネントの数が報酬として与えられてもよい。また、獲得したい復旧方策の性質によっても報酬の設計の仕方は変わる。できるだけ短時間で復旧を実現するような復旧方策を獲得したければ、各コマンドの実行時間を加味した報酬を与えるべきであり、できるだけ安全な復旧コマンドのみで復旧を実現するような復旧方策を獲得したければ、各コマンドのシステム影響度（システムのＣＰＵに与える負荷度合いなど）を加味した報酬を与えればよい。 The agent receives a reward r according to the result of the change in the system state. For example, if the learning system can only confirm the binary state of normal or abnormal, a positive reward is given when the learning system returns to the normal state, and a negative or zero reward is given otherwise. On the other hand, if an anomaly detection tool that can calculate the continuous degree of anomaly is used, the degree of decrease in the degree of anomaly (how close the normal state is to the action) may be defined as the reward. If the target system is composed of several components and the normal state of each component can be confirmed, the number of components that return to the normal state may be given as a reward. In addition, the way in which the reward is designed changes depending on the nature of the recovery measure to be acquired. If it is desired to acquire a recovery measure that realizes recovery in the shortest possible time, a reward should be given that takes into account the execution time of each command, and if it is desired to acquire a recovery measure that realizes recovery using only the safest possible recovery commands, a reward should be given that takes into account the system impact of each command (such as the degree of load on the system's CPU).

一方、経験（ｓ，ａ，ｒ，ｓ'）を獲得したエージェントは、これをメモリに蓄えつつ、過去の経験もミニバッチとして利用してニューラルネットワークの重みパラメタ更新を行う。 Meanwhile, the agent that has acquired experience (s, a, r, s') stores this in memory and uses past experience as mini-batches to update the weight parameters of the neural network.

以上の過程が、学習用システムが正常状態に戻るまで繰り返される。学習用システムへの１つの障害要因挿入から、正常状態への復旧までのサイクルをエピソードと呼ぶ。なお、１つのエピソードにおいて、一定回数上記のサイクルを繰り返してもシステムが復旧されない場合には、学習用システムのバックアップなどを用いて検証状態を正常状態に戻し、エピソードを強制終了してもよい。本実施の形態では、挿入する障害要因（発生する障害の種類、障害要因の挿入箇所の組み合わせ）を変えてエピソードを何度も繰り返していく。事前定義された学習収束条件（例えば、過去数回のエピソードでの平均獲得報酬和（エピソードごとの報酬和の各数回の平均値）がある閾値を超えること）を満たすことをもって、学習フェーズを終了する。学習済みのニューラルネットワーク（以下、「学習済みニューラルネットワーク」という。）からは、 The above process is repeated until the learning system returns to a normal state. The cycle from the insertion of one fault factor into the learning system to the recovery to a normal state is called an episode. If the system does not recover after repeating the above cycle a certain number of times in one episode, the verification state may be returned to a normal state using a backup of the learning system, etc., and the episode may be forcibly terminated. In this embodiment, episodes are repeated many times by changing the fault factors to be inserted (the type of fault that occurs, and the combination of the insertion points of the fault factors). The learning phase ends when a predefined learning convergence condition (for example, the average reward sum for the past few episodes (the average value of the reward sum for each episode for each few episodes) exceeds a certain threshold) is satisfied. From a trained neural network (hereinafter referred to as a "trained neural network"),

に基づいて復旧方策を獲得することができる。

Based on this, a recovery strategy can be obtained.

続いて、自動復旧フェーズについて説明する。自動復旧フェーズでは、運用中（実環境）の対象システム（以下、「実システム」という。）が利用される。障害が発生し観測データが得られたとき、これを特徴ベクトルに変換し、特徴ベクトルを状態ｓとして学習済みニューラルネットワークに入力する。学習済みニューラルネットワークによって、学習された復旧方策に基づいて実行すべき行動ａが推定されるので、推定された行動ａを対象システムに実行し、新たな観測データを得る。 Next, the automatic recovery phase will be described. In the automatic recovery phase, a target system (hereinafter referred to as the "real system") in operation (real environment) is used. When a failure occurs and observation data is obtained, this is converted into a feature vector, and the feature vector is input to the trained neural network as state s. The trained neural network estimates the action a to be executed based on the learned recovery strategy, and the estimated action a is executed on the target system to obtain new observation data.

以上のサイクルが、実システムが正常状態に戻るまで（システム状態確認ツールや異常検知ツールがシステムの復旧を確認するまで）繰り返される。 The above cycle is repeated until the actual system returns to a normal state (until system status confirmation tools and anomaly detection tools confirm that the system has recovered).

以下、図面に基づいて本発明の実施の形態を説明する。図１は、本発明の実施の形態における障害復旧装置１０のハードウェア構成例を示す図である。図１の障害復旧装置１０は、それぞれバスＢで相互に接続されているドライブ装置１００、補助記憶装置１０２、メモリ装置１０３、プロセッサ１０４、及びインタフェース装置１０５等を有する。 The following describes an embodiment of the present invention with reference to the drawings. FIG. 1 is a diagram showing an example of the hardware configuration of a disaster recovery device 10 in an embodiment of the present invention. The disaster recovery device 10 in FIG. 1 includes a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, and an interface device 105, which are all interconnected by a bus B.

障害復旧装置１０での処理を実現するプログラムは、ＣＤ－ＲＯＭ等の記録媒体１０１によって提供される。プログラムを記憶した記録媒体１０１がドライブ装置１００にセットされると、プログラムが記録媒体１０１からドライブ装置１００を介して補助記憶装置１０２にインストールされる。但し、プログラムのインストールは必ずしも記録媒体１０１より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置１０２は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes the processing in the disaster recovery device 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed from the recording medium 101 via the drive device 100 into the auxiliary storage device 102. However, the program does not necessarily have to be installed from the recording medium 101, but may be downloaded from another computer via a network. The auxiliary storage device 102 stores the installed program as well as necessary files, data, etc.

メモリ装置１０３は、プログラムの起動指示があった場合に、補助記憶装置１０２からプログラムを読み出して格納する。プロセッサ１０４は、ＣＰＵ又はＧＰＵ（Graphics Processing Unit）等であり、メモリ装置１０３に格納されたプログラムに従って障害復旧装置１０に係る機能を実行する。インタフェース装置１０５は、ネットワークに接続するためのインタフェースとして用いられる。 When an instruction to start a program is received, the memory device 103 reads out the program from the auxiliary storage device 102 and stores it. The processor 104 is a CPU or a GPU (Graphics Processing Unit), etc., and executes functions related to the disaster recovery device 10 in accordance with the program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network.

図２は、本発明の実施の形態における障害復旧装置１０の機能構成例を示す図である。図２には、対象システム、障害要因挿入装置２０、システム状態確認装置３０及び障害復旧装置１０が示されている。 Figure 2 is a diagram showing an example of the functional configuration of the failure recovery device 10 in an embodiment of the present invention. Figure 2 shows a target system, a failure factor insertion device 20, a system status confirmation device 30, and a failure recovery device 10.

対象システムは、学習フェーズにおける学習用システムであり、自動復旧フェーズにおける実システムである。 The target system is the learning system in the learning phase and the real system in the automatic recovery phase.

障害要因挿入装置２０は、上記した障害要因挿入ツールを有するコンピュータであり、学習フェーズにおいて、対象システムに対して人工的に生成される障害要因を挿入する。 The fault factor insertion device 20 is a computer that has the fault factor insertion tool described above, and in the learning phase, it inserts artificially generated fault factors into the target system.

システム状態確認装置３０は、対象システムの状態を確認し、当該状態の変化に基づく報酬ｒを障害復旧装置１０のエージェントに与える（入力する）コンピュータである。 The system status confirmation device 30 is a computer that confirms the status of the target system and gives (inputs) a reward r based on changes in the status to the agent of the failure recovery device 10.

なお、対象システム、障害要因挿入装置２０及びシステム状態確認装置３０は、強化学習における環境を構成する。 The target system, the fault factor insertion device 20, and the system status confirmation device 30 constitute the environment for reinforcement learning.

一方、障害復旧装置１０は、前処理部１１、復旧方策学習部１２及び行動実行部１３等を有する。これら各部は、障害復旧装置１０にインストールされた１以上のプログラムが、プロセッサ１０４に実行させる処理により実現される。障害復旧装置１０は、また、経験履歴記憶部１４を利用する。経験履歴記憶部１４は、例えば、メモリ装置１０３等を用いて実現可能である。 On the other hand, the disaster recovery device 10 has a preprocessing unit 11, a recovery policy learning unit 12, an action execution unit 13, etc. Each of these units is realized by a process in which one or more programs installed in the disaster recovery device 10 are executed by the processor 104. The disaster recovery device 10 also uses an experience history storage unit 14. The experience history storage unit 14 can be realized, for example, using a memory device 103, etc.

前処理部１１は、対象システムにおいて観測（取得）されるログやメトリクスなどの観測データを特徴ベクトルへ変換する。 The preprocessing unit 11 converts observation data such as logs and metrics observed (acquired) in the target system into feature vectors.

復旧方策学習部１２は、経験履歴記憶部１４に記憶される経験（ｓ，ａ，ｒ，ｓ'）の履歴に基づいて、ニューラルネットワークのパラメタ更新を行うと共に、当該ニューラルネットワークを用いて、対象システムに対して実行する行動の決定を行う。ここで、行動とは、復旧コマンドの入力である。 The recovery strategy learning unit 12 updates the parameters of the neural network based on the history of experiences (s, a, r, s') stored in the experience history storage unit 14, and uses the neural network to determine the action to be taken on the target system. Here, the action is the input of a recovery command.

行動実行部１３は、復旧方策学習部１２が決定した行動（復旧コマンド）を対象システムに対して実行する。 The action execution unit 13 executes the action (recovery command) determined by the recovery strategy learning unit 12 on the target system.

なお、障害要因挿入装置２０及びシステム状態確認装置３０は、障害復旧装置１０に含まれてもよい。 The fault factor insertion device 20 and the system status confirmation device 30 may be included in the fault recovery device 10.

学習フェーズでは、エピソードと呼ばれる過程が何度も繰り返される。例えば、障害要因（障害の種類と障害箇所との組み合わせ）が５であり、１００回のエピソードを実行する場合、５つの障害要因の中からランダムに１つの障害要因が選択されてエピソードを実行するということが１００回繰り返される。１エピソードは次のような処理手順からなる。図３は、学習フェーズの１エピソードにおいて障害復旧装置１０が実行する処理手順の一例を説明するためのフローチャートである。以下、当該１エピソードに対して選択された障害要因を「対象要因」という。 In the learning phase, a process called an episode is repeated many times. For example, if there are five failure factors (combinations of failure type and failure location) and 100 episodes are to be executed, one failure factor is randomly selected from the five failure factors and the episode is executed, and this process is repeated 100 times. One episode consists of the following processing steps. Figure 3 is a flowchart for explaining an example of the processing steps executed by the failure recovery device 10 in one episode of the learning phase. Hereinafter, the failure factor selected for that one episode is referred to as the "target factor."

障害要因挿入装置２０が対象要因を学習用システムに対して挿入することで学習用システムにおいて人工的な障害が発生すると（Ｓ１０１でＹｅｓ）、前処理部１１は、当該障害が発生している状態の学習用システムから観測データを取得する（Ｓ１０２）。続いて、前処理部１１は、観測データを特徴ベクトルへ変換し、当該特徴ベクトルを復旧方策学習部１２へ入力する（Ｓ１０３）。 When the fault factor insertion device 20 injects a target factor into the learning system to artificially cause a fault in the learning system (Yes in S101), the pre-processing unit 11 acquires observation data from the learning system in a state in which the fault occurs (S102). Next, the pre-processing unit 11 converts the observation data into a feature vector and inputs the feature vector to the recovery policy learning unit 12 (S103).

続いて、復旧方策学習部１２は、特徴ベクトルをニューラルネットワークに入力することで、学習用システムに入力すべき復旧コマンドを決定（選択）する（Ｓ１０４）。例えば、ニューラルネットワークからは、各復旧コマンドの確率が出力される。復旧方策学習部１２は、最も確率が高い復旧コマンドを選択する。続いて、行動実行部１３は、復旧方策学習部１２によって決定された復旧コマンドを学習用システムに対して実行（入力）する（Ｓ１０５）。 Then, the recovery policy learning unit 12 inputs the feature vector into a neural network to determine (select) a recovery command to be input to the learning system (S104). For example, the neural network outputs the probability of each recovery command. The recovery policy learning unit 12 selects the recovery command with the highest probability. The action execution unit 13 then executes (inputs) the recovery command determined by the recovery policy learning unit 12 to the learning system (S105).

続いて、復旧方策学習部１２は、システム状態確認装置３０から報酬を取得する（Ｓ１０６）。例えば、システム状態確認装置３０は、学習用システムの状態変化を確認し、当該状態変化に応じた報酬を復旧方策学習部１２に与える。続いて、前処理部１１は、状態変化後（復旧コマンドの実行後）の学習用システムから観測データを取得する（Ｓ１０７）。続いて、前処理部１１は、観測データを特徴ベクトルへ変換し、当該特徴ベクトルを復旧方策学習部１２へ入力する（Ｓ１０８）。 Then, the recovery policy learning unit 12 obtains a reward from the system state confirmation device 30 (S106). For example, the system state confirmation device 30 confirms a state change of the learning system, and provides the recovery policy learning unit 12 with a reward according to the state change. Next, the pre-processing unit 11 obtains observation data from the learning system after the state change (after the execution of the recovery command) (S107). Next, the pre-processing unit 11 converts the observation data into a feature vector, and inputs the feature vector to the recovery policy learning unit 12 (S108).

続いて、復旧方策学習部１２は、ステップＳ１０３において生成された特徴ベクトルを状態ｓとし、ステップＳ１０４において決定した復旧コマンドを行動ａとし、ステップＳ１０６において取得した報酬を報酬ｒとし、ステップＳ１０８において生成された特徴ベクトル状態ｓ'とする経験（ｓ，ａ，ｒ，ｓ'）を経験履歴記憶部１４に記憶する（Ｓ１０９）。したがって、経験履歴記憶部１４には、経験の履歴が記憶される。 Then, the recovery policy learning unit 12 stores the experience (s, a, r, s') in the experience history storage unit 14 (S109), in which the feature vector generated in step S103 is set as state s, the recovery command determined in step S104 is set as action a, the reward obtained in step S106 is set as reward r, and the feature vector state s' generated in step S108. Therefore, the experience history storage unit 14 stores the history of experiences.

続いて、復旧方策学習部１２は、経験履歴記憶部１４に記憶された経験をランダムにミニバッチとして取り出してニューラルネットワークのパラメタ更新（ニューラルネットワークの学習）を行う（Ｓ１１０）。続いて、復旧方策学習部１２は、例えば、システム状態確認装置３０から取得される状態に基づいて、学習用システムが正常状態であるか（障害から復旧したか）否かを判定する（Ｓ１１１）。 Then, the recovery policy learning unit 12 randomly extracts the experiences stored in the experience history storage unit 14 as mini-batches and updates the parameters of the neural network (learns the neural network) (S110). Then, the recovery policy learning unit 12 determines whether the learning system is in a normal state (whether it has recovered from the failure) based on, for example, the state acquired from the system state confirmation device 30 (S111).

学習用システムが正常状態でない場合（Ｓ１１１でＮｏ）、復旧方策学習部１２は、行動回数（復旧コマンドの実行回数）が閾値以上であるか否かを判定する（Ｓ１１２）。なお、行動回数は、ステップＳ１０４以降の繰り返しの回数と等価である。行動回数が閾値未満である場合（Ｓ１１２でＮｏ）、ステップＳ１０４以降が繰り返される。 If the learning system is not in a normal state (No in S111), the recovery policy learning unit 12 determines whether the number of actions (number of times the recovery command is executed) is equal to or greater than a threshold (S112). The number of actions is equivalent to the number of repetitions of step S104 and subsequent steps. If the number of actions is less than the threshold (No in S112), step S104 and subsequent steps are repeated.

一方、学習用システムが正常状態である場合（Ｓ１１１でＹｅｓ）、又は行動回数が閾値以上である場合（Ｓ１１２でＹｅｓ）、１つのエピソードについて、図３の処理手順は終了する。なお、行動回数が閾値以上であるという条件に基づいて図３の処理手順が終了する場合、学習用システムの初期状態のバックアップなどを用いて、検証状態が正常状態に戻されるようにしてもよい。そうすることで、今回のエピソードにおいて発生した障害について、次に実行されるエピソードに対する影響を低下させることができる。 On the other hand, if the learning system is in a normal state (Yes in S111) or if the number of actions is equal to or greater than the threshold (Yes in S112), the processing procedure in FIG. 3 ends for one episode. Note that when the processing procedure in FIG. 3 ends based on the condition that the number of actions is equal to or greater than the threshold, the verification state may be returned to a normal state using a backup of the initial state of the learning system, for example. In this way, it is possible to reduce the impact of a failure that occurred in the current episode on the next episode to be executed.

１つのエピソードが終了すると、次のエピソードが開始される。例えば、エピソードが終了するたびに、事前定義された学習収束条件（例えば過去数回のエピソードでの平均獲得報酬和がある閾値を超えること）の充足の有無を復旧方策学習部１２が判定し、当該学習収束条件が満たされた場合に、復旧方策学習部１２が学習フェーズを終了させる。 When one episode ends, the next episode begins. For example, each time an episode ends, the recovery policy learning unit 12 determines whether a predefined learning convergence condition (e.g., the sum of the average rewards obtained over the past few episodes exceeds a certain threshold) is satisfied, and if the learning convergence condition is satisfied, the recovery policy learning unit 12 ends the learning phase.

図４は、自動復旧フェーズにおいて障害復旧装置１０が実行する処理手順の一例を説明するためのフローチャートである。 Figure 4 is a flowchart illustrating an example of the processing procedure executed by the failure recovery device 10 during the automatic recovery phase.

実システムにおいて障害が発生すると（Ｓ２０１でＹｅｓ）、前処理部１１は、当該障害が発生している状態の実システムから観測データを取得する（Ｓ２０２）。続いて、前処理部１１は、観測データを特徴ベクトルへ変換し、当該特徴ベクトルを復旧方策学習部１２へ入力する（Ｓ２０３）。 When a fault occurs in the real system (Yes in S201), the preprocessing unit 11 acquires observation data from the real system in a state where the fault occurs (S202). Next, the preprocessing unit 11 converts the observation data into a feature vector and inputs the feature vector to the recovery policy learning unit 12 (S203).

続いて、復旧方策学習部１２は、特徴ベクトルを学習済みニューラルネットワーク（学習済みの復旧方策）に入力することで、実システムに入力すべき復旧コマンドを決定（選択）する（Ｓ２０４）。例えば、学習済みニューラルネットワークからは、各復旧コマンドの確率が出力される。復旧方策学習部１２は、最も確率が高い復旧コマンドを選択する。続いて、行動実行部１３は、復旧方策学習部１２によって決定された復旧コマンドを実システムに対して実行（入力）する（Ｓ２０５）。 Then, the recovery policy learning unit 12 inputs the feature vector into the trained neural network (trained recovery policy) to determine (select) a recovery command to be input to the real system (S204). For example, the trained neural network outputs the probability of each recovery command. The recovery policy learning unit 12 selects the recovery command with the highest probability. The action execution unit 13 then executes (inputs) the recovery command determined by the recovery policy learning unit 12 to the real system (S205).

ステップＳ２０２～Ｓ２０５は、例えば、システム状態確認装置３０から取得される状態に基づいて、実システムが正常状態に戻ったと判定されるまで（障害から復旧するまで）繰り返される（Ｓ２０６）。 Steps S202 to S205 are repeated until it is determined that the actual system has returned to a normal state (until it has recovered from the failure), for example, based on the state obtained from the system state confirmation device 30 (S206).

続いて、本実施の形態を実施する上で生じうるいくつかの課題とその解決策について述べる。 Next, we will discuss some of the issues that may arise when implementing this embodiment and how to solve them.

［課題１］ [Task 1]

本実施の形態の学習フェーズでは、各エピソード内において、障害要因挿入後に観測データ収集、行動実行というステップを何度も繰り返すことになる。対象システムによってはこの１ステップを回すのに時間を要したり技術的に難しかったりする場合がある。例えば、サーバに何らかの障害要因を挿入した後、再起動やバックアップなどのコマンドで復旧しようとした場合、数分のオーダーで時間がかかることがある。また、障害要因挿入の方法も難しい場合がある。本実施の形態では、深層強化学習を行える程度のデータ量が必要であるため、１ステップに要する時間・困難性はできるだけ小さくしなければならない。 In the learning phase of this embodiment, the steps of injecting a fault factor, collecting observational data, and executing an action are repeated many times within each episode. Depending on the target system, completing this single step may take time or be technically difficult. For example, after injecting a fault factor into a server, attempting to recover using a command such as restart or backup may take time on the order of several minutes. In addition, the method of injecting the fault factor may also be difficult. In this embodiment, a volume of data sufficient to perform deep reinforcement learning is required, so the time and difficulty required for one step must be kept as small as possible.

［課題１の解決策］ [Solution to problem 1]

このような課題がある場合、学習用システムをコンテナ基盤などの仮想環境で模擬し、仮想環境を用いて学習フェーズが実施されてもよい。コンテナであれば、再起動に要する時間は数秒程度であり、また、Ｋｕｂｅｒｎｅｔｅｓなどのようなオーケストレータも利用することができるため、複数台のコンテナに関する操作も容易である。また、コンテナはイメージファイルで管理されるため、完全に同じ環境を再現することもできる。同一環境を複数用意し、同時に本実施の形態の手法を適用することで、効率よく経験（ｓ，ａ，ｒ，ｓ'）を蓄積する分散学習も可能である。更に、近年カオスエンジニアリングの取組の中で様々な障害要因挿入ツールが開発されている（「C. Rosenthal, L. Hochstein, A. Blohowiak, N. Jones, and A. Basiri, "Chaos Engineering," O 'Reilly Media, Incorporated, 2017.」参照）ため、多種多様な障害を容易に挿入することができる。 If such a problem exists, the learning system may be simulated in a virtual environment such as a container platform, and the learning phase may be performed using the virtual environment. With a container, the time required for restarting is only a few seconds, and an orchestrator such as Kubernetes can also be used, making it easy to operate multiple containers. In addition, since containers are managed as image files, it is possible to reproduce the exact same environment. By preparing multiple identical environments and simultaneously applying the method of this embodiment, distributed learning that efficiently accumulates experience (s, a, r, s') is also possible. Furthermore, various fault factor insertion tools have been developed in recent years as part of chaos engineering efforts (see "C. Rosenthal, L. Hochstein, A. Blohowiak, N. Jones, and A. Basiri, "Chaos Engineering," O'Reilly Media, Incorporated, 2017."), so a wide variety of faults can be easily inserted.

［課題２］ [Task 2]

本実施の形態では、様々な障害要因を挿入し深層強化学習を実行するが、挿入する障害要因の選択をランダムに、あるいは均等にすることは必ずしも得策ではない。同じ障害を複数回挿入した場合、毎回同じような観測データが得られるため復旧方策が簡単に獲得できるような障害もあれば、観測データが大きく揺らぐため復旧方策の獲得に時間がかかる障害もある。後者のような障害に対して重点的に障害要因挿入を行って学習を行うのが望ましい。このように挿入する障害要因の選択方法は、復旧方策の精度や、復旧方策獲得までの時間に大きな影響を与える。 In this embodiment, deep reinforcement learning is performed by inserting various fault factors, but it is not necessarily a good idea to select the fault factors to be inserted randomly or uniformly. When the same fault is inserted multiple times, there are some faults for which a recovery measure can be easily acquired because similar observation data is obtained each time, and there are other faults for which it takes time to acquire a recovery measure because the observation data fluctuates greatly. It is desirable to focus on inserting fault factors for the latter type of fault and perform learning. The method of selecting the fault factors to be inserted in this way has a significant impact on the accuracy of the recovery measure and the time it takes to acquire the recovery measure.

［課題２の解決策］ [Solution to problem 2]

そこで、次のような挿入障害要因の選択方法を採用することで、効果的に観測データを取得することが考えられる。まず、或る程度ランダムに障害要因挿入を行い深層強化学習を実行してエピソードを繰り返す。その後、学習フェーズの途中の段階（例えば、最大で１００回のエピソードを繰り返す予定であれば、５０回のエピソードの終了した段階）で、復旧方策学習部１２は、障害要因ごとに、直近の過去Ｎ回（例えば、Ｎ＝５等）の報酬和の平均を求める。当該平均が高報酬和の基準として設定された閾値以上である障害要因群があれば、その障害要因群に対しては復旧方策が獲得できたと考えてよい。一方で、当該平均が当該閾値未満である障害要因群に関しては、復旧方策が対応できていないので、そのような障害要因群を重点的に挿入するようにして深層学習を続ける。例えば、この場合、復旧方策学習部１２は、当該障害要因群の中からランダムに障害要因挿入を実行するように、障害要因挿入装置２０に指示を送信してもよい。すなわち、復旧方策の学習度合いに応じて、障害要因別（障害種類別及び障害箇要因挿入箇所別）に挿入回数が変更されてもよい。このようにすることで、あらゆる障害要因に対応できる復旧方策を効率よく学習することができる。 Therefore, it is possible to effectively obtain observation data by adopting the following method of selecting the insertion failure factor. First, the failure factor is inserted randomly to some extent, and deep reinforcement learning is performed to repeat the episode. After that, in the middle of the learning phase (for example, when 50 episodes have been completed if a maximum of 100 episodes are planned to be repeated), the recovery policy learning unit 12 calculates the average of the reward sums of the most recent N times (for example, N = 5, etc.) for each failure factor. If there is a failure factor group whose average is equal to or greater than a threshold set as a criterion for a high reward sum, it may be considered that a recovery policy has been acquired for that failure factor group. On the other hand, for a failure factor group whose average is less than the threshold, since the recovery policy cannot be applied, such a failure factor group is inserted with a focus on the deep learning. For example, in this case, the recovery policy learning unit 12 may send an instruction to the failure factor insertion device 20 to randomly insert a failure factor from the failure factor group. That is, the number of insertions may be changed for each failure factor (for each failure type and for each failure factor insertion location) according to the degree of learning of the recovery policy. This allows for efficient learning of recovery strategies that can deal with any type of failure.

［課題３］ [Challenge 3]

観測データは、挿入された障害要因（障害種類、障害要因挿入箇所）だけではなく、背景トラヒックなどの状況によっても大きく変化しうる。そのため、一定条件下で復旧方策を学習した場合、トラヒックのトレンドなど環境の状況が変わっただけで、復旧方策が役立たなくなってしまう可能性がある。 The observed data can change significantly depending not only on the inserted fault factor (type of fault, location of fault factor insertion) but also on conditions such as background traffic. Therefore, if a recovery strategy is learned under certain conditions, a change in the environmental conditions, such as traffic trends, may render the recovery strategy useless.

［課題３の解決方法］ [How to solve problem 3]

このような課題がある場合は、例えば、トラヒック生成装置などを用い、学習用システムに背景トラヒックを生成し、その大きさ等、学習用システムが属する環境を変化させながら深層強化学習を行うことで、様々な状況に対応できる復旧方策を学習することができる。すなわち、復旧方策学習部１２は、学習用システムが属する環境が時間の経過に応じて変化する状況においてニューラルネットワークの学習を行ってもよい。この場合、背景トラヒックの大きさ等の環境の変化は、エピソード単位で変更されてもよいし、各エピソードのステップＳ１０４が繰り返される過程において変更されてもよい。 When such a problem exists, for example, a traffic generation device may be used to generate background traffic for the training system, and deep reinforcement learning may be performed while changing the environment to which the training system belongs, such as the size of the background traffic, to learn recovery measures that can handle various situations. That is, the recovery measure learning unit 12 may train the neural network in a situation in which the environment to which the training system belongs changes over time. In this case, the changes in the environment, such as the size of the background traffic, may be changed on an episode-by-episode basis, or may be changed during the process of repeating step S104 for each episode.

次に、本実施の形態について実際に行った実験の結果について説明する。 Next, we will explain the results of an actual experiment conducted on this embodiment.

コンテナ型仮想環境のオーケストレータであるＫｕｂｅｒｎｅｔｅｓを用いてＫｕｂｅｒｎｅｔｅｓクラスタを作成し、その中にｗｅｂ３層環境を構築した。このコンテナで実現されたｗｅｂ３層環境を、本実施の形態における検証環境及び実環境とみなす。Ｗｅｂ３層環境は、Ｎｇｉｎｘ、Ｒａｉｌｓ、ＭｙＳＱＬの３コンテナによって構成されたものが２セットあり、全６コンテナからなる環境である。 A Kubernetes cluster was created using Kubernetes, an orchestrator for container-based virtual environments, and a 3-tier web environment was constructed within it. The 3-tier web environment realized with this container is considered to be the verification environment and the actual environment in this embodiment. The 3-tier web environment consists of two sets of 3 containers, Nginx, Rails, and MySQL, making a total of 6 containers.

背景トラヒックとして、負荷試験ツールによりｈｔｔｐリクエストをランダムに発生させた。また障害要因として、６個中最大２個のコンテナに８０％のパケットロスまたは１００００±１０００ｍｓの遅延のいずれかを挿入した。 As background traffic, HTTP requests were randomly generated using a load test tool. As a fault factor, either 80% packet loss or a delay of 10,000 ± 1,000 ms was inserted into up to two of the six containers.

各コンテナにおける流入出トラヒック（バイト単位、パケット数単位）を収集し観測データとした。特徴ベクトルは、観測データをスケーリングして得られる２４次元ベクトルである。 The inflow and outflow traffic (in bytes and packets) for each container was collected and used as observational data. The feature vector is a 24-dimensional vector obtained by scaling the observational data.

エージェントがとりうる行動は、各コンテナを再生成させるコマンドの実行である。コンテナは全６個であるため、全行動数も６である。障害要因を挿入されたコンテナは、再生成コマンドを打たれることで復旧する。 The action that the agent can take is to execute a command to regenerate each container. Since there are six containers in total, the total number of actions is six. A container in which a fault has been injected can be restored by issuing a regeneration command.

報酬は、次のように定めた。挿入された障害が１箇所の場合、ｒ＝１．０（復旧完了）ｏｒ０．２５（それ以外）。挿入された障害が２箇所の場合、ｒ＝０．５（１箇所の障害が復旧）ｏｒ－０．２５（それ以外）。すなわち、エージェントが復旧に必要十分な復旧コマンドを実行した場合、報酬＋１．０を獲得し、誤ったコマンドを実行した場合、ペナルティ（負の報酬）を得る。深層強化学習で用いるアルゴリズムとしては、ＤＱＮの改良版であるＤｏｕｂｌｅＤＱＮ（「H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double Q-learning," in Proc. of the 30th National Conference on American Association for Artificial Intelligence (AAAI), pp. 2094-2100, 2016.」）を採用し、過去１０エピソードの平均獲得報酬が０．９５を超えたことをもって学習完了とみなした。 The reward was determined as follows: If one fault was inserted, r = 1.0 (recovery completed) or 0.25 (otherwise). If two faults were inserted, r = 0.5 (one fault recovered) or -0.25 (otherwise). In other words, if the agent executed a recovery command that was necessary and sufficient for recovery, it received a reward of +1.0, and if it executed an incorrect command, it received a penalty (negative reward). The algorithm used for deep reinforcement learning was Double DQN, an improved version of DQN (H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double Q-learning," in Proc. of the 30th National Conference on American Association for Artificial Intelligence (AAAI), pp. 2094-2100, 2016.), and learning was considered complete when the average reward obtained over the past 10 episodes exceeded 0.95.

学習結果を図５に示す。横軸はエピソード数であり、縦軸は過去１０エピソードの平均獲得報酬である。青い実線が学習曲線で、水平な破線が報酬和＝０．９５の閾値（学習完了の閾値）である。７３２エピソードで完了し、所要時間は約５８．５時間であった。このように、適切な学習を行えば、復旧方策を自動で獲得し、システムの自動復旧が可能となることが分かり、本実施の形態の有効性を示している。 The learning results are shown in Figure 5. The horizontal axis is the number of episodes, and the vertical axis is the average reward obtained in the past 10 episodes. The blue solid line is the learning curve, and the horizontal dashed line is the threshold of reward sum = 0.95 (threshold for completion of learning). It was completed in 732 episodes, taking approximately 58.5 hours. As such, it can be seen that with appropriate learning, recovery measures can be automatically acquired and the system can be automatically restored, demonstrating the effectiveness of this embodiment.

上述したように、本実施の形態によれば、学習データや事前の確率モデル作成が不要でありながら、ログやメトリクスなどの高次元で、確率的に変動する、複雑な相関関係を持つ観測データをもとに、障害を復旧させるためのコマンドを自動的に推定可能とすることができる。 As described above, this embodiment makes it possible to automatically estimate commands to recover from a failure based on high-dimensional, probabilistically fluctuating, complexly correlated observational data such as logs and metrics, without the need for learning data or the creation of a prior probabilistic model.

学習データや事前の確率モデル作成が不要であるため、十分な量の学習データが蓄積されていなかったり、システム挙動を表すような確率モデルの作成が難しかったりする場合でも、高次元な観測データを元に適切な復旧コマンドを出力するような復旧方策を獲得することができるようになり、障害復旧の自動化、高精度化に貢献することができる。 Since there is no need to create training data or a prior probability model, even in cases where a sufficient amount of training data has not been accumulated or where it is difficult to create a probability model that represents the system behavior, it is possible to acquire a recovery strategy that outputs an appropriate recovery command based on high-dimensional observation data, thereby contributing to the automation and high accuracy of failure recovery.

なお、本実施の形態において、学習用システムは、第１のシステムの一例である。実システムは、第２のシステムの一例である。前処理部１１は、第１の取得部及び第２の取得部の一例である。観測データ及びその特徴ベクトルは、第１のデータ及び第２のデータの一例である。行動実行部１３は、第１の実行部及び第２の実行部の一例である。復旧方策学習部１２は、第３の取得部及び学習部の一例である。 In this embodiment, the learning system is an example of a first system. The real system is an example of a second system. The preprocessing unit 11 is an example of a first acquisition unit and a second acquisition unit. The observation data and its feature vector are an example of first data and second data. The action execution unit 13 is an example of a first execution unit and a second execution unit. The recovery measure learning unit 12 is an example of a third acquisition unit and a learning unit.

以上、本発明の実施の形態について詳述したが、本発明は斯かる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the embodiment of the present invention has been described in detail above, the present invention is not limited to such a specific embodiment, and various modifications and variations are possible within the scope of the gist of the present invention described in the claims.

１０障害復旧装置
１１前処理部
１２復旧方策学習部
１３行動実行部
１４経験履歴記憶部
２０障害要因挿入装置
３０システム状態確認装置
１００ドライブ装置
１０１記録媒体
１０２補助記憶装置
１０３メモリ装置
１０４プロセッサ
１０５インタフェース装置
Ｂバス REFERENCE SIGNS LIST 10 Fault recovery device 11 Preprocessing unit 12 Recovery policy learning unit 13 Action execution unit 14 Experience history storage unit 20 Fault factor insertion device 30 System state confirmation device 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device B Bus

Claims

a first acquisition unit that acquires first data observed in the first system after an artificial fault factor is inserted;
a first execution unit that inputs the first data into a neural network and executes a command estimated by the neural network on the first system;
a second acquisition unit that acquires second data observed in the first system after execution of the command;
a third acquisition unit that acquires a reward obtained by executing the command;
A learning unit that learns the neural network by deep reinforcement learning based on the first data, the command, the reward, and the second data;
having
The number of times of insertion into the first system is changed for each artificial fault factor according to the learning degree of the neural network.
A disaster recovery device comprising:

a second execution unit that inputs data observed in the second system in a state in which a fault occurs in the second system into the trained neural network, and executes a command estimated by the neural network on the second system;
2. The failure recovery device according to claim 1, further comprising:

the first system is a virtual environment;
3. The failure recovery device according to claim 1 or 2.

The learning unit learns the neural network in a situation in which a situation of an environment to which the first system belongs changes over time.
4. The failure recovery device according to claim 1, wherein the failure recovery device is a data recovery device.

a first acquisition procedure for acquiring first data observed in a first system after an artificial disturbance is injected;
a first execution step of inputting the first data into a neural network to execute a command estimated by the neural network on the first system;
a second acquisition step of acquiring second data observed at the first system after execution of the command;
A third acquisition step of acquiring a reward obtained by executing the command;
a learning step of learning the neural network by deep reinforcement learning based on the first data, the command, the reward, and the second data;
The computer executes
The number of times of insertion into the first system is changed for each artificial fault factor according to the learning degree of the neural network.
A disaster recovery method comprising:

a second execution procedure for inputting data observed in the second system in a state in which a fault occurs in the second system into the trained neural network, thereby executing a command estimated by the neural network on the second system;
6. The method for recovering from a failure according to claim 5 , wherein the method is executed by a computer.

a first acquisition procedure for acquiring first data observed in a first system after an artificial disturbance is injected;
a first execution step of inputting the first data into a neural network to execute a command estimated by the neural network on the first system;
a second acquisition step of acquiring second data observed at the first system after execution of the command;
A third acquisition step of acquiring a reward obtained by executing the command;
a learning step of learning the neural network by deep reinforcement learning based on the first data, the command, the reward, and the second data;
Run the following on your computer :
The number of times of insertion into the first system is changed for each artificial fault factor according to the learning degree of the neural network.
A program characterized by:

a second execution procedure for inputting data observed in the second system in a state in which a fault occurs in the second system into the trained neural network, thereby executing a command estimated by the neural network on the second system;
8. The program according to claim 7 , wherein the program is executed by a computer.