JP7001422B2

JP7001422B2 - Information processing equipment, information processing methods, and programs

Info

Publication number: JP7001422B2
Application number: JP2017203645A
Authority: JP
Inventors: 魁相原
Original assignee: Lifull Co Ltd
Current assignee: Lifull Co Ltd
Priority date: 2017-10-20
Filing date: 2017-10-20
Publication date: 2022-01-19
Anticipated expiration: 2037-10-20
Also published as: JP2019079120A

Description

本発明は、情報処理装置、情報処理方法、及びプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

各種のサーバ等の機器を運用するシステムにおいては、負荷を予測し、当該予測負荷に基づいて最適化が図られることが多い（例えば、特許文献１参照）。特許文献１には、負荷実績データを定期的に収集した負荷実績データに基づいて負荷予測を行うことが記載されている。特許文献１に記載の負荷予測においては、１日の時間帯ごとに同じような変化を示す日パターンの変動等が考慮される。 In a system that operates devices such as various servers, a load is often predicted and optimization is attempted based on the predicted load (see, for example, Patent Document 1). Patent Document 1 describes that load prediction is performed based on load actual data obtained by periodically collecting actual load data. In the load prediction described in Patent Document 1, fluctuations in the day pattern showing similar changes for each time zone of the day are taken into consideration.

特開２０１７－０２１４９７号公報Japanese Unexamined Patent Publication No. 2017-021497

ここで、特許文献１には、予測精度を高めることについては開示されているものの、予測された負荷がどのように利用されるかについては十分な記載がない。 Here, Patent Document 1 discloses that the prediction accuracy is improved, but does not sufficiently describe how the predicted load is used.

本発明のいくつかの態様は前述の課題に鑑みてなされたものであり、負荷予測に基づき、好適にシステムを運用することを可能とする情報処理装置、情報処理方法、及びプログラムを提供することを目的の１つとする。 Some aspects of the present invention have been made in view of the above-mentioned problems, and provide an information processing device, an information processing method, and a program that enable a suitable system operation based on load prediction. Is one of the purposes.

本発明の一態様に係る情報処理装置は、各々の時刻と、各々の時刻において測定されたシステム負荷である実測システム負荷とが対応付けられた学習データの深層学習により生成される予測モデルを用いて、第１時刻までに検出された実測システム負荷に基づき、第１時刻よりも後の第２時刻におけるシステム負荷を予測した予測システム負荷を生成する第１予測部と、前記第２時刻における実測システム負荷を検出する検出部と、前記第２時刻における予測システム負荷と実測システム負荷との差分が閾値を超過する場合に、その旨を出力する出力部とを備える。 The information processing apparatus according to one aspect of the present invention uses a prediction model generated by deep learning of training data in which each time is associated with an actually measured system load which is a system load measured at each time. The first prediction unit that generates the prediction system load that predicts the system load at the second time after the first time based on the actual measurement system load detected by the first time, and the actual measurement at the second time. It is provided with a detection unit for detecting a system load and an output unit for outputting when the difference between the predicted system load and the actually measured system load at the second time exceeds the threshold value.

本発明の一態様に係る情報処理方法は、各々の時刻と、各々の時刻において測定されたシステム負荷である実測システム負荷とが対応付けられた学習データの深層学習により生成される予測モデルを用いて、第１時刻までに検出された実測システム負荷に基づき、第１時刻よりも後の第２時刻におけるシステム負荷を予測した予測システム負荷を生成するステップと、前記第２時刻における実測システム負荷を検出するステップと、前記第２時刻における予測システム負荷と実測システム負荷との差分が閾値を超過する場合に、その旨を出力するステップとを情報処理装置が行う。 The information processing method according to one aspect of the present invention uses a prediction model generated by deep learning of training data in which each time is associated with an actually measured system load which is a system load measured at each time. Then, based on the measured system load detected by the first time, the step of generating the predicted system load that predicts the system load at the second time after the first time and the measured system load at the second time are set. The information processing apparatus performs a step of detecting and a step of outputting when the difference between the predicted system load and the measured system load at the second time exceeds the threshold value.

本発明の一態様に係るプログラムは、各々の時刻と、各々の時刻において測定されたシステム負荷である実測システム負荷とが対応付けられた学習データの深層学習により生成される予測モデルを用いて、第１時刻までに検出された実測システム負荷に基づき、第１時刻よりも後の第２時刻におけるシステム負荷を予測した予測システム負荷を生成する処理と、前記第２時刻における実測システム負荷を検出する処理と、前記第２時刻における予測システム負荷と実測システム負荷との差分が閾値を超過する場合に、その旨を出力する処理とをコンピュータに実行させる。 The program according to one aspect of the present invention uses a prediction model generated by deep learning of training data in which each time is associated with an actually measured system load which is a system load measured at each time. Based on the measured system load detected by the first time, the process of generating the predicted system load that predicts the system load at the second time after the first time and the actual measurement system load at the second time are detected. When the difference between the predicted system load and the measured system load at the second time exceeds the threshold value, the computer is made to execute the process and the process of outputting to that effect.

なお、本発明において、「部」や「手段」、「装置」、「システム」とは、単に物理的手段を意味するものではなく、その「部」や「手段」、「装置」、「システム」が有する機能をソフトウェアによって実現する場合も含む。また、１つの「部」や「手段」、「装置」、「システム」が有する機能が２つ以上の物理的手段や装置により実現されても、２つ以上の「部」や「手段」、「装置」、「システム」の機能が１つの物理的手段や装置により実現されても良い。 In the present invention, the "part", "means", "device", and "system" do not simply mean physical means, but the "part", "means", "device", and "system". Including the case where the function of "" is realized by software. Further, even if the functions of one "part", "means", "device", or "system" are realized by two or more physical means or devices, two or more "parts" or "means", The functions of "device" and "system" may be realized by one physical means or device.

情報処理装置の実施形態であるシステム監視装置を含むシステムの機能構成を示す図である。It is a figure which shows the functional structure of the system including the system monitoring apparatus which is an embodiment of an information processing apparatus. 図１に示したシステム監視装置の処理の流れを示すフローチャートである。It is a flowchart which shows the processing flow of the system monitoring apparatus shown in FIG. 図１に示したシステム監視装置の処理の流れを示すフローチャートである。It is a flowchart which shows the processing flow of the system monitoring apparatus shown in FIG. 図１に示したシステム監視装置のハードウェア構成の具体例を示す図である。It is a figure which shows the specific example of the hardware composition of the system monitoring apparatus shown in FIG.

以下、図面を参照して本発明の実施形態を説明する。ただし、以下に説明する実施形態は、あくまでも例示であり、以下に明示しない種々の変形や技術の適用を排除する意図はない。即ち、本発明は、その趣旨を逸脱しない範囲で種々変形して実施することができる。また、以下の図面の記載において、同一又は類似の部分には同一又は類似の符号を付して表している。図面は模式的なものであり、必ずしも実際の寸法や比率等とは一致しない。図面相互間においても互いの寸法の関係や比率が異なる部分が含まれていることがある。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, the embodiments described below are merely examples, and there is no intention of excluding the application of various modifications and techniques not specified below. That is, the present invention can be variously modified and implemented without departing from the spirit of the present invention. Further, in the description of the following drawings, the same or similar parts are represented by the same or similar reference numerals. The drawings are schematic and do not necessarily match the actual dimensions and ratios. Even between drawings, there may be parts where the relationship and ratio of dimensions differ from each other.

［実施形態］
［１概要］
複数台のサーバ等からなる情報処理システムを用いてウェブサービス等の各種情報処理サービスを提供する場合、ユーザ等からのリクエストが常時変化する等の理由から、情報処理システムの負荷（プロセッサの処理能力、ネットワークの通信容量、メモリの使用量、ストレージの使用量等）も時々刻々変化する。もしシステム負荷が情報処理システムの処理能力を超えるとサービス停止等の深刻な事態が生じかねないため、情報処理システムの負荷を計測し、異常を検知することは極めて重要である。 [Embodiment]
[1 Overview]
When providing various information processing services such as web services using an information processing system consisting of multiple servers, the load of the information processing system (processing capacity of the processor) is due to the fact that requests from users etc. are constantly changing. , Network communication capacity, memory usage, storage usage, etc.) also change from moment to moment. If the system load exceeds the processing capacity of the information processing system, a serious situation such as a service stop may occur. Therefore, it is extremely important to measure the load of the information processing system and detect an abnormality.

従来、情報処理システムの異常は、管理者が手動で静的な監視閾値を設定することにより行われてきた。この場合、システム負荷が監視閾値を超過した場合、又は下回った場合にシステム異常として検知される。しかしながら、近年、システム要件が複雑になっていることから、システム負荷に対して手動で基準となる監視閾値を設定したのでは、十分にシステム異常を検出しきれない事態が生じている。例えば、慣例的に監視閾値は高負荷時のみを検知するように設定されることが多いことから、低負荷時の異常、例えばキャッシュストレージの負荷の減少に伴うキャッシュヒット率の低下等のシステム異常の検知が遅れたり見逃されたりすることが多い。このような低負荷時の異常を検知できるような監視閾値の設定は、情報処理システム毎に特性を深く調査する必要があるため、全てのシステムにおいて適切な監視閾値を設定するのは困難である。 Conventionally, an abnormality in an information processing system has been performed by an administrator manually setting a static monitoring threshold value. In this case, when the system load exceeds or falls below the monitoring threshold value, it is detected as a system abnormality. However, since system requirements have become complicated in recent years, there has been a situation in which system abnormalities cannot be sufficiently detected by manually setting a reference monitoring threshold value for a system load. For example, since the monitoring threshold is customarily set to detect only when the load is high, an abnormality when the load is low, for example, a system abnormality such as a decrease in the cache hit rate due to a decrease in the load of the cache storage. Detection is often delayed or overlooked. It is difficult to set an appropriate monitoring threshold in all systems because it is necessary to deeply investigate the characteristics of each information processing system in order to set a monitoring threshold that can detect such abnormalities at low load. ..

また、ウェブサービス等の多くのサービスは負荷傾向に季節要因が含まれるが、例えばシステム負荷が上昇又は低下した場合であっても、それがシステム異常によるものなのか季節要因なのかの判断は管理者の判断に委ねられるのが通常であるため、このことが運用コストの肥大化を招いている。加えて、システム負荷が高くなった際にはサーバ台数を増やす等の対応が取られるが、あくまでも負荷が高くなったことが検出された後の事後的な対応となるため、異常検知後、一時的にパフォーマンスが低下する等の事態が生じやすい。すなわち、負荷が高まること等を事前に検知して、サーバ台数の増加等の対処を取ることが望ましい。 In addition, many services such as web services include seasonal factors in the load tendency, but even if the system load rises or falls, for example, it is managed to judge whether it is due to a system abnormality or a seasonal factor. This usually leads to an increase in operating costs, as it is usually left to the discretion of the person. In addition, when the system load becomes high, measures such as increasing the number of servers are taken, but since it is an ex post facto response after it is detected that the load has become high, it is temporary after the abnormality is detected. It is easy for a situation such as deterioration of performance to occur. That is, it is desirable to detect in advance that the load will increase and take measures such as increasing the number of servers.

そこで本実施形態にかかるシステム監視装置では、深層学習を用いた予測モデルを用いた負荷予測、及びそれに基づく異常検知を行う。システム監視装置が自動で異常を検知して管理者に報知することで、異常の検知が遅れたり見逃されたりすることを抑制し、また、管理者が調査を行う作業負荷を軽減することが可能である。また、未来のシステム負荷を予測することにより、予め高負荷に備えてサーバ台数を増やしてサービスのパフォーマンス低下を防ぐこともできる。 Therefore, in the system monitoring device according to the present embodiment, load prediction using a prediction model using deep learning and abnormality detection based on the load prediction are performed. By automatically detecting an abnormality and notifying the administrator by the system monitoring device, it is possible to prevent the detection of the abnormality from being delayed or overlooked, and to reduce the workload for the administrator to investigate. Is. In addition, by predicting the future system load, it is possible to increase the number of servers in advance in preparation for a high load and prevent the service performance from deteriorating.

本実施形態におけるシステム監視装置の負荷予測では、ＬＳＴＭ（ＬｏｎｇＳｈｏｒｔＴｅｒｍＭｅｍｏｒｙ）による予測モデルを利用する。ＬＳＴＭは、長期記憶が可能な再帰的ニューラルネットワークであるため、長期的な季節要因を考慮した高速な負荷予測が可能である。また、本実施形態にかかるシステム監視装置は、このような負荷予測と実測負荷とを比較して異常検知を行うことで、長期的な季節要因を考慮した上での高速な異常検知を可能としている。 In the load prediction of the system monitoring device in the present embodiment, a prediction model by LSTM (Long Short Term Memory) is used. Since LSTM is a recursive neural network capable of long-term memory, high-speed load prediction considering long-term seasonal factors is possible. Further, the system monitoring device according to the present embodiment enables high-speed abnormality detection in consideration of long-term seasonal factors by comparing such load prediction with the actually measured load and performing abnormality detection. There is.

また、ＬＳＴＭによる予測モデルを生成する際、長期的な季節要因を当該予測モデルに反映させるためには、長期間にわたる負荷情報を学習させる必要がある。すなわち、長期間に渡る負荷情報を、学習データとして全て保持する必要がある。しかしながら、過去の全ての負荷情報を保持しておくためには、膨大な記憶容量が必要であり、またそれを用いた学習にも膨大な時間を要する。よって、本実施形態におけるシステム監視装置では、学習済みの予測モデルに対し、学習以降に新たに得られた負荷情報のみの再学習を転移学習により行う。これにより、システム監視装置は、前回予測モデルを生成する際に用いた時点以降の負荷情報のみを保持し、それを学習させればよいことから、負荷情報の保持コスト及び再学習に要する時間の低減を図ることを可能としている。 In addition, when generating a prediction model by LSTM, it is necessary to learn load information over a long period of time in order to reflect long-term seasonal factors in the prediction model. That is, it is necessary to retain all the load information over a long period of time as learning data. However, in order to retain all the load information in the past, a huge storage capacity is required, and learning using the storage capacity also requires a huge amount of time. Therefore, in the system monitoring device of the present embodiment, only the load information newly obtained after learning is relearned by transfer learning for the trained prediction model. As a result, the system monitoring device only needs to hold the load information after the time point used when generating the previous prediction model and train it, so that the holding cost of the load information and the time required for re-learning can be increased. It is possible to reduce the cost.

また、本実施形態におけるシステム監視装置は、異常が検知された場合に、検知された異常や予測負荷に応じて、好適なオペレーション情報（ワークフロー）を実行する。これにより、管理者による運用負荷を低減することができる。 Further, when an abnormality is detected, the system monitoring device in the present embodiment executes suitable operation information (workflow) according to the detected abnormality and the predicted load. This makes it possible to reduce the operational load on the administrator.

［２機能構成］
［２．１情報処理システム１の概要］
以下、図１を参照しながら、本実施形態にかかるシステム監視装置１００を含む情報処理システム１の全体構成を説明する。情報処理システム１は、大きく分けて、ウェブサービス等の各種サービスを提供するサーバ２００ａ乃至２００ｎ（以下、総称してサーバ２００という。）と、サーバ２００に係るシステム負荷状況を監視するシステム監視装置１００とを含む。 [2 Function configuration]
[2.1 Overview of Information Processing System 1]
Hereinafter, the overall configuration of the information processing system 1 including the system monitoring device 100 according to the present embodiment will be described with reference to FIG. 1. The information processing system 1 is roughly divided into servers 200a to 200n (hereinafter, collectively referred to as server 200) that provide various services such as web services, and a system monitoring device 100 that monitors the system load status related to the server 200. And include.

システム監視装置１００は、各々のサーバ２００と通信可能に接続され、少なくともサーバ２００のシステム負荷状況、例えばＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）の処理状況、ネットワークの通信容量、メモリの使用量、ストレージの使用量等を逐次監視する。またシステム監視装置１００は、システム異常が検出された際に、当該システム異常に対応するための処置（オペレーション）、例えば新たなサーバ２００の稼働等の対処を行う。なお、サーバ２００が提供するサービスはウェブサービスに限られず、任意のものとすることができる。 The system monitoring device 100 is communicably connected to each server 200, and at least the system load status of the server 200, for example, the processing status of a CPU (Central Processing Unit), the communication capacity of the network, the memory usage, and the storage usage. Etc. are monitored sequentially. Further, when a system abnormality is detected, the system monitoring device 100 takes measures (operations) for dealing with the system abnormality, such as operation of a new server 200. The service provided by the server 200 is not limited to the web service, and may be any service.

図１に示すように、システム監視装置１００は、負荷検出部１０１、負荷情報ＤＢ（データベース）１０３、第１予測部１０７、第２予測部１０９、学習部１１１、異常判定部１１３、出力部１１５、対処部１１７、及びオペレーションＤＢ１１９を含む。 As shown in FIG. 1, the system monitoring device 100 includes a load detection unit 101, a load information DB (database) 103, a first prediction unit 107, a second prediction unit 109, a learning unit 111, an abnormality determination unit 113, and an output unit 115. , The coping unit 117, and the operation DB 119.

ここで、本実施形態に係るシステム監視装置１００では、現在時刻（例えば時刻Ｔとする）から一定時間前の時点（例えば時刻Ｔ－ｔ１。ｔ１＞０）までの負荷情報１０５を用いて、現在時刻Ｔのシステム負荷を第１予測部１０７で予測する（予測システム負荷）。一方、また負荷検出部１０１は現在時刻Ｔにおける実際のシステム負荷を検出する（実測システム負荷）。異常判定部１１３は、時刻Ｔにおける予測システム負荷と実測システム負荷とを比較し、もし両者の乖離が閾値よりも大きければ、予期しない異常が発生しているものとして、出力部１１５から管理者へ報知等する。 Here, in the system monitoring device 100 according to the present embodiment, the load information 105 from the current time (for example, time T) to the time point before a certain time (for example, time Tt1. T1> 0) is used at present. The system load at time T is predicted by the first prediction unit 107 (prediction system load). On the other hand, the load detection unit 101 also detects the actual system load at the current time T (actual measurement system load). The abnormality determination unit 113 compares the predicted system load and the measured system load at time T, and if the deviation between the two is larger than the threshold value, it is assumed that an unexpected abnormality has occurred, and the output unit 115 informs the administrator. Notify, etc.

またこれと併せて、第２予測部１０９では、現在時刻Ｔまでの負荷情報１０５を用いて、現在時刻Ｔから一定時間後の時点（時刻Ｔ＋ｔ２。ｔ２＞０）のシステム負荷を予測する。対処部１１７は、異常判定部１１３で異常が検出された場合には、未来である時刻Ｔ＋ｔ２に生じると予想されるシステム負荷に応じたオペレーション情報１２１を読込み、当該オペレーション情報１２１に基づいて、検出されたシステム異常及び将来予想されるシステム負荷への対処を行うことができる。 At the same time, the second prediction unit 109 predicts the system load at a time point (time T + t2. T2> 0) after a certain time from the current time T by using the load information 105 up to the current time T. When the abnormality determination unit 113 detects an abnormality, the coping unit 117 reads the operation information 121 according to the system load expected to occur at the time T + t2 in the future, and detects the operation information 121 based on the operation information 121. It is possible to deal with the system abnormality that has occurred and the system load that is expected in the future.

なお、本実施形態では、現在時刻Ｔ＋ｔ２の予測システム負荷に応じたオペレーション情報１２１を読み込んで対処を行うが、これに限られるものではない。例えば、異常判定部１１３で検出された異常の種類に応じたオペレーション情報１２１を読み込んで対処を行うことも考えられる。この場合には、システム監視装置１００は第２予測部１０９を必ずしも備えている必要はない。 In the present embodiment, the operation information 121 corresponding to the prediction system load at the current time T + t2 is read and the countermeasure is taken, but the present invention is not limited to this. For example, it is conceivable to read the operation information 121 according to the type of the abnormality detected by the abnormality determination unit 113 and take a countermeasure. In this case, the system monitoring device 100 does not necessarily have to include the second prediction unit 109.

［２．２システム監視装置１００の機能］
以下、システム監視装置１００が有する各機能を説明する。なお、図１に示したシステム監視装置１００の各機能は、必ずしも物理的に１台の装置として実現される必要はなく、複数台の協働するコンピュータにより実現することも考えられる。例えば第１予測部１０７及び第２予測部１０９が有する予測モデル１０７ａ及び予測モデル１０９ａを生成及び再学習するための学習部１１１と、他の機能とを異なる装置上に実現することも考えられる。 [2.2 Functions of system monitoring device 100]
Hereinafter, each function of the system monitoring device 100 will be described. It should be noted that each function of the system monitoring device 100 shown in FIG. 1 does not necessarily have to be physically realized as one device, but may be realized by a plurality of cooperating computers. For example, it is conceivable to realize the prediction model 107a and the learning unit 111 for generating and relearning the prediction model 107a and the prediction model 109a of the first prediction unit 107 and the second prediction unit 109, and other functions on different devices.

負荷検出部１０１は、各々のサーバ２００と通信可能に設けられ、サーバ２００の実際のシステム負荷を検出する。システム負荷の検出方法は種々考えられるが、例えば、サーバ２００からサーバ自身で測定した情報を受信する方法も考えられるし、負荷検出部１０１がサーバ２００の動作を観察することによりシステム負荷を測定することも考えられる。なお、負荷検出部１０１は、プロセッサの稼働状況やメモリの使用状況等、複数種類のシステム負荷を検出するように構成することができる。 The load detection unit 101 is provided so as to be able to communicate with each server 200, and detects the actual system load of the server 200. Various methods for detecting the system load can be considered. For example, a method of receiving information measured by the server itself from the server 200 can be considered, and the load detection unit 101 measures the system load by observing the operation of the server 200. It is also possible. The load detection unit 101 can be configured to detect a plurality of types of system loads such as a processor operating status and a memory usage status.

負荷情報ＤＢ１０３は、負荷情報１０５を管理する。負荷情報１０５には、負荷検出部１０１で検出されたシステム負荷の値と、検出或いは測定された時刻（絶対時刻であるか、或いは所定の時点を基準とした相対時刻であるかは問わない）の情報とが少なくとも含まれる。ここで、もし監視対象の負荷が、例えばプロセッサの稼働状況、メモリの使用状況、ネットワーク帯域の使用状況など、複数種類あるのであれば、各々のシステム負荷の値が、時刻情報と対応付けられて負荷情報１０５として負荷情報ＤＢ１０３に格納される。 The load information DB 103 manages the load information 105. The load information 105 includes the value of the system load detected by the load detection unit 101 and the time of detection or measurement (whether it is an absolute time or a relative time with respect to a predetermined time point). And at least the information of. Here, if there are multiple types of loads to be monitored, such as processor operating status, memory usage status, network bandwidth usage status, etc., the value of each system load is associated with the time information. It is stored in the load information DB 103 as the load information 105.

第１予測部１０７及び第２予測部１０９は、各々、予測モデル１０７ａ及び予測モデル１０９ａを用いて、負荷情報ＤＢ１０３から得られる負荷情報１０５に基づいて、例えば３０分後等の所定時間後の予測システム負荷を予測する。例えば、予測モデル１０７ａ及び予測モデル１０９ａは、時刻Ｔｘまでの負荷情報１０５に基づいて、時刻Ｔｘよりも時間ｔ（ｔ＞０。以下同じ）先のシステム負荷、すなわち時刻Ｔｘ＋ｔのシステム負荷を算出することができる。第１予測部１０７及び第２予測部１０９は、このようなシステム負荷の予測を、例えば６０秒毎等、定期的に行う。 The first prediction unit 107 and the second prediction unit 109 use the prediction model 107a and the prediction model 109a, respectively, to make predictions after a predetermined time, for example, after 30 minutes, based on the load information 105 obtained from the load information DB 103, respectively. Predict system load. For example, the prediction model 107a and the prediction model 109a calculate the system load at time t (t> 0; the same applies hereinafter) ahead of time Tx, that is, the system load at time Tx + t, based on the load information 105 up to time Tx. be able to. The first prediction unit 107 and the second prediction unit 109 periodically predict such a system load, for example, every 60 seconds.

なお、第１予測部１０７及び第２予測部１０９は、プロセッサの稼働状況やメモリの使用状況、ネットワーク帯域の使用状況等の、監視対象のシステム負荷の種類毎に、各々複数用意することもできるし、或いは複数のシステム負荷を併せて処理するものとして用意することもできる。ここでは、システム負荷毎に用意するものとする。その場合、後述の学習部１１１は、システム負荷の種類毎に、予測モデル１０７ａ及び予測モデル１０９ａの学習及び再学習を行う。 It should be noted that a plurality of the first prediction unit 107 and the second prediction unit 109 can be prepared for each type of system load to be monitored, such as the operating status of the processor, the usage status of the memory, and the usage status of the network bandwidth. Alternatively, it can be prepared to handle a plurality of system loads together. Here, it is assumed that it is prepared for each system load. In that case, the learning unit 111, which will be described later, learns and relearns the prediction model 107a and the prediction model 109a for each type of system load.

なお、予測モデル１０７ａ及び予測モデル１０９ａは同一であっても、異なるモデルであっても良い。もし異なるものとする場合には、予測モデル１０７ａでは時刻Ｔｘまでの負荷情報１０５に基づき時刻Ｔｘ＋ｔ１（ｔ１＞０）のシステム負荷が、予測モデル１０９ａでは時刻Ｔｘまでの負荷情報１０５に基づき時刻Ｔｘ＋ｔ２（ｔ２＞０）のシステム負荷が、それぞれ予測できる。ここでは、第１予測部１０７で使用される予測モデル１０７ａと、第２予測部１０９で使用される予測モデル１０９ａとは同一であるもの、つまり同じ負荷情報１０５を入力すれば、同一の予測システム負荷が得られるものとして説明する。予測モデル１０７ａ及び１０９ａを同一のものとすることにより、予測モデル１０７ａ及び予測モデル１０９ａを生成するための学習及び再学習は各々１回となるため、運用コストを下げることが可能である。 The prediction model 107a and the prediction model 109a may be the same or different models. If they are different, the system load at time Tx + t1 (t1> 0) is based on the load information 105 up to time Tx in the prediction model 107a, and the time Tx + t2 (t2) based on the load information 105 up to time Tx in the prediction model 109a. The system load of t2> 0) can be predicted respectively. Here, the prediction model 107a used in the first prediction unit 107 and the prediction model 109a used in the second prediction unit 109 are the same, that is, if the same load information 105 is input, the same prediction system is used. It will be described as assuming that a load can be obtained. By making the prediction models 107a and 109a the same, it is possible to reduce the operating cost because the learning and the re-learning for generating the prediction model 107a and the prediction model 109a are performed once each.

また、予測モデル１０７ａ及び予測モデル１０９ａが入力とする負荷情報１０５は、１つの時刻Ｔｘにおける負荷情報１０５のみとしても良いし、或いは、一定の時間幅にかかる負荷情報１０５、例えば時刻Ｔｘ－ｔから時刻Ｔｘまでの負荷情報１０５とすることも考えられる。 Further, the load information 105 input by the prediction model 107a and the prediction model 109a may be only the load information 105 at one time Tx, or from the load information 105 over a certain time width, for example, from the time Tx-t. It is also conceivable to use the load information 105 up to the time Tx.

第１予測部１０７と第２予測部１０９とでは、入力される負荷情報１０５が異なる。例えば、第１予測部１０７は時刻Ｔ－ｔまでの負荷情報１０５を入力として時刻Ｔのシステム負荷を予測し、第２予測部１０９は時刻Ｔまでの負荷情報１０５を入力として時刻Ｔ＋ｔのシステム負荷を予測する。時刻Ｔを現在時刻として運用する場合には、第１予測部１０７では、過去である時刻Ｔ－ｔ（例えば現在時刻の３０分前）までのシステム稼働状況から予測される現在時刻Ｔのシステム負荷が算出される。一方、第２予測部１０９では、現在時刻Ｔまでのシステム稼働状況から、未来である時刻Ｔ＋ｔ（例えば現在時刻の３０分後）におけるシステム負荷が算出される。 The input load information 105 is different between the first prediction unit 107 and the second prediction unit 109. For example, the first prediction unit 107 predicts the system load at time T by inputting the load information 105 up to time Tt, and the second prediction unit 109 inputs the load information 105 up to time T as input to the system load at time T + t. Predict. When operating the time T as the current time, the first prediction unit 107 determines the system load of the current time T predicted from the system operation status up to the past time Tt (for example, 30 minutes before the current time). Is calculated. On the other hand, the second prediction unit 109 calculates the system load at the future time T + t (for example, 30 minutes after the current time) from the system operation status up to the current time T.

学習部１１１は、予測モデル１０７ａ及び予測モデル１０９ａを生成するための学習、及びその再学習を行う。先述の通り、本実施形態においては予測モデル１０７ａ及び１０９ａは同一であるため、ここでは予測モデル１０７ａを生成／再学習するものとして説明する。なお、もし予測モデル１０７ａ及び予測モデル１０９ａを別のものとする場合には、各々について、下記の学習及び再学習の処理を行えば良い。 The learning unit 111 performs learning for generating the prediction model 107a and the prediction model 109a, and re-learning thereof. As described above, since the prediction models 107a and 109a are the same in the present embodiment, the prediction model 107a will be described here as being generated / relearned. If the prediction model 107a and the prediction model 109a are different from each other, the following learning and relearning processes may be performed for each of them.

学習部１１１は、負荷情報ＤＢ１０３に格納される負荷情報１０５を学習データとして、ＬＳＴＭによる学習を行うことで予測モデル１０７ａを生成する。先述の通り、サーバ２００が提供するサービスは、季節や月、曜日、時間帯等の種々の季節要因に応じて変化する。よって、最初に予測モデル１０７ａを生成する際に用いる負荷情報１０５は、なるべく長期間のものとすることが好ましい。ＬＳＴＭは長期記憶が可能な再帰的ニューラルネットワークであるため、より長期間にわたる負荷情報１０５を学習させることで、長期的な季節要因を考慮した予測モデル１０７ａを生成することが可能となる。 The learning unit 111 generates a prediction model 107a by performing learning by LSTM using the load information 105 stored in the load information DB 103 as learning data. As described above, the service provided by the server 200 changes according to various seasonal factors such as the season, the month, the day of the week, and the time zone. Therefore, it is preferable that the load information 105 used when first generating the prediction model 107a is as long as possible. Since LSTM is a recursive neural network capable of long-term memory, it is possible to generate a prediction model 107a considering long-term seasonal factors by learning load information 105 over a longer period of time.

また、学習部１１１は、既に生成されている予測モデル１０７ａに対し、転移学習により再学習を行う機能も有する。この際、例えば任意の時刻Ｔｎまでの負荷情報１０５を学習データとして予測モデル１０７ａが生成されているのであれば、例えば時刻Ｔｎ＋１（時間軸で時刻Ｔｎよりも先にある任意の時刻）～時刻Ｔｎ＋ｘの負荷情報１０５を学習データとして、学習部１１１は予測モデル１０７ａの再学習を行えば良い。もし時刻Ｔｎ＋ｘまでの負荷情報１０５の再学習が終了し、予測モデル１０７ａにそれらの負荷情報１０５が反映されたのであれば、負荷情報ＤＢ１０３から、時刻Ｔｎ＋ｘまでの負荷情報１０５を削除することが可能である。これにより、負荷情報１０５を長期間保持する必要がなくなるため、システム監視装置１００の運営コストを抑制させることが可能となる。 Further, the learning unit 111 also has a function of re-learning the already generated prediction model 107a by transfer learning. At this time, for example, if the prediction model 107a is generated using the load information 105 up to an arbitrary time Tn as learning data, for example, from time Tn + 1 (arbitrary time ahead of time Tn on the time axis) to time Tn + x. The learning unit 111 may relearn the prediction model 107a using the load information 105 of the above as training data. If the re-learning of the load information 105 up to the time Tn + x is completed and the load information 105 is reflected in the prediction model 107a, the load information 105 up to the time Tn + x can be deleted from the load information DB 103. Is. As a result, it is not necessary to hold the load information 105 for a long period of time, so that it is possible to suppress the operating cost of the system monitoring device 100.

なお、学習部１１１は、例えば１日に１回、１週間に１回等、定期的に自動で予測モデル１０７ａの再学習処理を行うことができる。定期的に学習部１１１が再学習を行うことで、１回の再学習に用いる学習データである負荷情報１０５の量が減るため、予測モデル１０７ａの再学習に要する１回あたりの時間を低減させることができる。 The learning unit 111 can automatically relearn the prediction model 107a periodically and automatically, for example, once a day and once a week. Since the learning unit 111 periodically relearns, the amount of load information 105, which is the learning data used for one relearning, is reduced, so that the time required for each relearning of the prediction model 107a is reduced. be able to.

異常判定部１１３は、第１予測部１０７で予測された予測システム負荷の値と、負荷検出部１０１で検出された実際のシステム負荷（実測システム負荷）との差異に基づき、異常を検出する。すなわち、時刻Ｔ－ｔまでのシステム稼働状況から予測される時刻Ｔの予測システム負荷の値と、実際に観測された時刻Ｔのシステム負荷の値とが、例えば閾値以上乖離していれば、異常判定部１１３は異常と判定する。これは、第１予測部１０７で生成される予測システム負荷は、一定時間ｔ前までのシステム稼働状況から、季節要因を考慮して算出されるものであるため、この予測システム負荷の値と、実際の実測システム負荷の値とが大きく乖離している場合には、予期しない何らかの事態（想定しない大量の処理要求を受けている、サーバ２００の何らかの部位の稼動状態に問題がある等）と考えられるからである。 The abnormality determination unit 113 detects an abnormality based on the difference between the value of the prediction system load predicted by the first prediction unit 107 and the actual system load (actual measurement system load) detected by the load detection unit 101. That is, if the value of the predicted system load at time T predicted from the system operating status up to time Tt and the value of the system load at time T actually observed deviate from each other by, for example, a threshold value or more, it is abnormal. The determination unit 113 determines that it is abnormal. This is because the prediction system load generated by the first prediction unit 107 is calculated from the system operation status up to a certain time t in consideration of seasonal factors, so that the value of this prediction system load and If there is a large deviation from the actual measured system load value, it is considered that something unexpected has happened (a large amount of unexpected processing requests have been received, there is a problem with the operating status of some part of the server 200, etc.). Because it is done.

出力部１１５は、異常判定部１１３による異常判定結果を出力する。出力方法としては、例えば、表示装置へのメッセージの表示やスピーカからの音声出力、メールやメッセンジャーアプリケーション等のメッセージ通知を行うことにより、管理者へ報知すること、或いは、ログファイルとして、時刻及びその時刻における正常／異常状態を出力すること、等が考えられる。 The output unit 115 outputs the abnormality determination result by the abnormality determination unit 113. As an output method, for example, by displaying a message on a display device, outputting a voice from a speaker, or notifying a message such as an e-mail or a messenger application, the administrator is notified, or the time and its as a log file are output. It is conceivable to output the normal / abnormal state at the time.

対処部１１７は、異常判定部１１３により異常が検出された際に、当該異常に対する対処（オペレーション）を行う。この際、対処部１１７は、オペレーションＤＢ１１９に予め格納されたオペレーション情報１２１を参照することができる。オペレーション情報１２１に記載されるオペレーションの内容としては、例えば、異常が検知されたシステム負荷の種類に応じて、休止中のサーバ２００の稼働や一部のサーバ２００の稼働停止、サーバ２００の再起動等を行うためのものが考えられる。 When an abnormality is detected by the abnormality determination unit 113, the coping unit 117 takes a coping (operation) for the abnormality. At this time, the coping unit 117 can refer to the operation information 121 stored in advance in the operation DB 119. The contents of the operation described in the operation information 121 include, for example, the operation of the dormant server 200, the operation of some of the servers 200, and the restart of the server 200, depending on the type of system load in which the abnormality is detected. Etc. can be considered.

なおこの際、対処部１１７は、予想される未来の予測システム負荷に応じた対処を行うことも考えられる。この場合には、対処部１１７は、異常判定部１１３から出力される現在時刻Ｔ時点での異常判定結果と、第２予測部１０９で生成される時刻Ｔ＋ｔの予測システム負荷とを用いてオペレーション情報１２１を参照すればよい。これにより、例えば現在時刻Ｔでの異常が異常判定部１１３により検出された場合に、第２予測部１０９で算出される時刻Ｔ＋ｔのシステム負荷に応じた対処を対処部１１７が行うことが可能となる。また、例えば時刻Ｔ＋ｔに予想される予測システム負荷が高ければサーバの新規稼働等を行うものの、時刻Ｔ＋ｔで予想される予測システム負荷は正常範囲内である場合には、何ら対処を行わない、といったオペレーションも可能となる。 At this time, it is conceivable that the coping unit 117 takes measures according to the expected future load of the prediction system. In this case, the coping unit 117 uses the abnormality determination result output from the abnormality determination unit 113 at the current time T and the prediction system load at the time T + t generated by the second prediction unit 109 to provide operation information. You may refer to 121. As a result, for example, when an abnormality at the current time T is detected by the abnormality determination unit 113, the coping unit 117 can take measures according to the system load at the time T + t calculated by the second prediction unit 109. Become. Further, for example, if the predicted system load expected at time T + t is high, a new server is started, but if the predicted system load expected at time T + t is within the normal range, no action is taken. Operation is also possible.

［３処理の流れ］
以下、図２及び図３を参照しながら、システム監視装置１００の処理の流れを説明する。図２及び図３は、システム監視装置１００の処理の流れを示すフローチャートである。 [3 Processing flow]
Hereinafter, the processing flow of the system monitoring device 100 will be described with reference to FIGS. 2 and 3. 2 and 3 are flowcharts showing the processing flow of the system monitoring device 100.

なお、後述の各処理ステップは、処理内容に矛盾を生じない範囲で、任意に順番を変更して若しくは並列に実行することができ、また、各処理ステップ間に他のステップを追加しても良い。更に、便宜上１つのステップとして記載されているステップは複数のステップに分けて実行することもでき、便宜上複数に分けて記載されているステップを１ステップとして実行することもできる。 It should be noted that each processing step described later can be arbitrarily changed in order or executed in parallel within a range that does not cause a contradiction in the processing content, and even if another step is added between each processing step. good. Further, the step described as one step for convenience can be executed by being divided into a plurality of steps, and the step described as being divided into a plurality of steps can be executed as one step for convenience.

［３．１異常検出時の処理の流れ］
まず、図２を参照しながら、システム監視装置１００によるシステム異常検出にかかる処理を説明する。図２は、システム異常検出にかかるシステム監視装置１００の処理の流れを示すフローチャートである。 [3.1 Processing flow when anomaly is detected]
First, with reference to FIG. 2, a process related to system abnormality detection by the system monitoring device 100 will be described. FIG. 2 is a flowchart showing a processing flow of the system monitoring device 100 for detecting a system abnormality.

まず負荷検出部１０１は、現在時刻Ｔにおけるサーバ２００のシステム負荷（実測システム負荷）を検出する（Ｓ２０１）。先述の通り、負荷検出部１０１によるシステム負荷の検出方法は、サーバ２００等のシステム稼働状況を負荷検出部１０１を観測することにより検出しても良いし、或いは、サーバ２００側から受信することにより検出することも考えられる。検出された実測システム負荷は、時刻Ｔと対応付けられて負荷情報ＤＢ１０３に負荷情報１０５として格納される。 First, the load detection unit 101 detects the system load (measured system load) of the server 200 at the current time T (S201). As described above, the system load detection method by the load detection unit 101 may detect the system operation status of the server 200 or the like by observing the load detection unit 101, or by receiving from the server 200 side. It is also possible to detect it. The detected actual measurement system load is stored as load information 105 in the load information DB 103 in association with the time T.

第１予測部１０７は、負荷情報ＤＢ１０３に格納された、時刻Ｔよりも時間ｔ（ｔ＞０）早い、時刻Ｔ－ｔまでの負荷情報１０５を読込み、当該負荷情報１０５を予測モデル１０７ａに入力することにより、時刻Ｔの予測システム負荷を算出する（Ｓ２０３）。 The first prediction unit 107 reads the load information 105 stored in the load information DB 103, which is time t (t> 0) earlier than the time T, up to the time Tt, and inputs the load information 105 into the prediction model 107a. By doing so, the prediction system load at time T is calculated (S203).

異常判定部１１３は、第１予測部１０７で算出された時刻Ｔの予測システム負荷の値と、負荷検出部１０１で検出された時刻Ｔにおける実測システム負荷の値との差異を算出する（Ｓ２０５）。この結果、予測システム負荷と実測システム負荷との差異が予め定められた閾値以上であった場合には（Ｓ２０７のＹｅｓ）、異常判定部１１３は異常が発生しているものとして判定し、その旨を出力部１１５から出力させる（Ｓ１１５）。管理者は、当該出力を見ることで、サーバ２００からなるシステムに何らかの異常が発生していることを把握することができるため、これに応じて異常への何らかの対処を行うことが可能である。 The abnormality determination unit 113 calculates the difference between the value of the prediction system load at time T calculated by the first prediction unit 107 and the value of the actual measurement system load at time T detected by the load detection unit 101 (S205). .. As a result, if the difference between the predicted system load and the measured system load is equal to or greater than a predetermined threshold value (Yes in S207), the abnormality determination unit 113 determines that an abnormality has occurred, and to that effect. Is output from the output unit 115 (S115). By looking at the output, the administrator can grasp that some abnormality has occurred in the system including the server 200, and therefore, it is possible to take some measures against the abnormality accordingly.

また、第２予測部１０９では、時刻Ｔまでの実測システム負荷である負荷情報１０５を読込み、当該負荷情報１０５を予測モデル１０９ａに入力することにより、時刻Ｔ＋ｔの予測システム負荷を算出する（Ｓ２１１）。対処部１１７は、異常判定部１１３で検出された異常と、当該時刻Ｔ＋ｔの予測システム負荷とに基づくオペレーション情報１２１をオペレーションＤＢ１１９から読出し、オペレーションを実行する（Ｓ２１５）。なお、主にシステム異常への対応を管理者が行う場合には、Ｓ２１１乃至Ｓ２１５の処理は必ずしも行う必要はない。 Further, the second prediction unit 109 reads the load information 105, which is the actual measurement system load up to the time T, and inputs the load information 105 into the prediction model 109a to calculate the prediction system load at the time T + t (S211). .. The coping unit 117 reads the operation information 121 based on the abnormality detected by the abnormality determination unit 113 and the prediction system load at the time T + t from the operation DB 119, and executes the operation (S215). When the administrator mainly deals with the system abnormality, it is not always necessary to perform the processes of S211 to S215.

なお、Ｓ２０７において、予測システム負荷の値と実測システム負荷の値との乖離が閾値未満である場合には（Ｓ２０７のＮｏ）、Ｓ２０９乃至Ｓ２１５にかかる処理は不要である。 In S207, when the deviation between the value of the predicted system load and the value of the measured system load is less than the threshold value (No of S207), the processing related to S209 to S215 is unnecessary.

システム監視装置１００でのシステム管理を継続する場合には（Ｓ２１７のＮｏ）、システム監視装置１００は、時刻ＴをＴ＋１（時間軸で所定時間単位先にある任意の時刻。例えば時刻Ｔの３０秒後）に更新して（Ｓ２１９）、再度Ｓ２０１以降の処理を行えば良い。 When the system management by the system monitoring device 100 is continued (No of S217), the system monitoring device 100 sets the time T to T + 1 (arbitrary time in a predetermined time unit ahead on the time axis, for example, 30 seconds of the time T). It may be updated to (S219) and the processing after S201 may be performed again.

［３．２予測モデル１０７ａの学習に関する処理の流れ］
次に、図３を参照しながら、システム監視装置１００がシステム負荷を予測するために利用する予測モデル１０７ａの学習にかかる処理を説明する。図３は、予測モデル１０７ａの学習にかかるシステム監視装置１００の処理の流れを示すフローチャートである。なお、先述の通り、本実施形態においては、予測モデル１０９ａは予測モデル１０７ａと同一であるため、図３の処理により、予測モデル１０７ａ及び予測モデル１０９ａの両者が生成される。もし両者を違うものとするのであれば、学習データとする負荷情報１０５の相違はあるものの、同様の手順により予測モデル１０９ａも生成することができる。 [3.2 Flow of processing related to learning of the prediction model 107a]
Next, with reference to FIG. 3, a process related to learning of the prediction model 107a used by the system monitoring device 100 to predict the system load will be described. FIG. 3 is a flowchart showing a processing flow of the system monitoring device 100 for learning the prediction model 107a. As described above, in the present embodiment, the prediction model 109a is the same as the prediction model 107a, so that both the prediction model 107a and the prediction model 109a are generated by the process of FIG. If the two are different, the prediction model 109a can also be generated by the same procedure, although there is a difference in the load information 105 as the training data.

まだ予測モデル１０７ａが生成されていない場合には、システム監視装置１００の学習部１１１は、予測モデル１０７ａを生成するために時刻Ｔｎまでの全ての負荷情報１０５を読込み（Ｓ３０１）、これを学習データとして、ＬＳＴＭにより予測モデル１０７ａを生成する（Ｓ３０３）。生成された予測モデル１０７ａを第１予測部１０７に読み込ませることで、第１予測部１０７は負荷情報１０５に基づいてシステム負荷を予測できるようになる。 If the prediction model 107a has not been generated yet, the learning unit 111 of the system monitoring device 100 reads all the load information 105 up to the time Tn in order to generate the prediction model 107a (S301), and this is the learning data. As a result, the prediction model 107a is generated by LSTM (S303). By loading the generated prediction model 107a into the first prediction unit 107, the first prediction unit 107 can predict the system load based on the load information 105.

その後、予め設定された、予測モデル１０７ａの再学習時刻が到来すると（Ｓ３０５のＹｅｓ）、学習部１１１は、時刻Ｔまでの負荷情報１０５による学習で生成された予測モデル１０７ａを読み込むとともに（Ｓ３０７）、時刻Ｔｎ＋１乃至Ｔｎ＋ｔ（Ｔｎ＋ｔ＞Ｔｎ＋１であり、ｔは任意の時間）までの負荷情報１０５を負荷情報ＤＢ１０３から読み込む（Ｓ３０９）。学習部１１１は、読み込んだ時刻Ｔｎ＋１乃至Ｔｎ＋ｔの負荷情報１０５を用いて、予測モデル１０７ａの転移学習による再学習を行う。これにより学習部１１１は、時刻Ｔｎまでの負荷情報１０５を考慮した予測モデル１０７ａを、時刻Ｔｎ＋ｔまでの負荷情報１０５を考慮したものとすることができる。よって学習部１１１は、予測モデル１０７ａに反映された最新の時刻を示す時刻Ｔｎを、時刻Ｔｎ＋ｔで更新する（Ｓ３１３）。またこのとき、学習部１１１は、負荷情報ＤＢ１０３に格納された時刻Ｔｎ＋ｔまでの負荷情報１０５を削除しても良い。 After that, when the preset re-learning time of the prediction model 107a arrives (Yes in S305), the learning unit 111 reads the prediction model 107a generated by the learning by the load information 105 up to the time T (S307). , The load information 105 up to the time Tn + 1 to Tn + t (Tn + t> Tn + 1, where t is an arbitrary time) is read from the load information DB 103 (S309). The learning unit 111 performs re-learning by transfer learning of the prediction model 107a using the load information 105 at the read times Tn + 1 to Tn + t. As a result, the learning unit 111 can assume that the prediction model 107a considering the load information 105 up to the time Tn considers the load information 105 up to the time Tn + t. Therefore, the learning unit 111 updates the time Tn indicating the latest time reflected in the prediction model 107a at the time Tn + t (S313). At this time, the learning unit 111 may delete the load information 105 stored in the load information DB 103 up to the time Tn + t.

システム監視装置１００でのシステム管理を継続する場合には（Ｓ３１５のＮｏ）、システム監視装置１００は、Ｓ３０５乃至Ｓ３１３の処理を繰り返すことにより、予測モデル１０７ａの定期的な更新を行うことができる。 When the system management by the system monitoring device 100 is continued (No of S315), the system monitoring device 100 can periodically update the prediction model 107a by repeating the processes of S305 to S313.

［４ハードウェア構成］
以下、図４を参照しながら、システム監視装置１００を実現可能なコンピュータ（情報処理装置）のハードウェア構成を説明する。システム監視装置１００は、制御部４０１と、記憶部４０５と、通信インタフェース（Ｉ／Ｆ）部４１１と、入力部４１３と、表示部４１５とを含み、各部はバスライン４１７を介して接続される。 [4 Hardware configuration]
Hereinafter, the hardware configuration of the computer (information processing device) capable of realizing the system monitoring device 100 will be described with reference to FIG. The system monitoring device 100 includes a control unit 401, a storage unit 405, a communication interface (I / F) unit 411, an input unit 413, and a display unit 415, and each unit is connected via a bus line 417. ..

制御部４０１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ。図示せず）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ。図示せず）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）４０３等を含む。制御部４０１は、記憶部４０５に記憶される制御プログラム４０７を実行することにより、一般的なコンピュータとしての機能に加え、図１に示したシステム監視装置１００の各構成に関する処理を実行可能に構成される。例えば、図１に示した負荷検出部１０１、第１予測部１０７、第２予測部１０９、学習部１１１、異常判定部１１３、出力部１１５、及び対処部１１７は、ＲＡＭ４０３に一時記憶された上で、ＣＰＵ上で動作する制御プログラム４０７として実現可能である。 The control unit 401 includes a CPU (Central Processing Unit, not shown), a ROM (Read Only Memory, not shown), a RAM (Random Access Memory) 403, and the like. By executing the control program 407 stored in the storage unit 405, the control unit 401 can execute the processing related to each configuration of the system monitoring device 100 shown in FIG. 1 in addition to the function as a general computer. Will be done. For example, the load detection unit 101, the first prediction unit 107, the second prediction unit 109, the learning unit 111, the abnormality determination unit 113, the output unit 115, and the coping unit 117 shown in FIG. 1 are temporarily stored in the RAM 403. Therefore, it can be realized as a control program 407 that operates on the CPU.

また、ＲＡＭ４０３は、制御プログラム４０７に含まれるコードの他、負荷情報１０５やオペレーション情報１２１、異常判定部１１３による判定結果等の一部又は全部を一時的に記憶する。更にＲＡＭ４０３は、ＣＰＵが各種処理を実行する際のワークエリアとしても使用される。 Further, the RAM 403 temporarily stores a part or all of the load information 105, the operation information 121, the determination result by the abnormality determination unit 113, and the like, in addition to the code included in the control program 407. Further, the RAM 403 is also used as a work area when the CPU executes various processes.

記憶部４０５は、例えばＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やフラッシュメモリ等の不揮発性の記憶媒体である。記憶部４０５は、一般的なコンピュータとしての機能を実現するためのオペレーティングシステム（ＯＳ）や制御プログラム４０７、及びその実行に必要となるデータであるＤＢ４０９を記憶する。ＤＢ４０９には、負荷情報ＤＢ１０３及びオペレーションＤＢ１１９を含みうる。 The storage unit 405 is a non-volatile storage medium such as an HDD (Hard Disk Drive) or a flash memory. The storage unit 405 stores an operating system (OS) and a control program 407 for realizing a function as a general computer, and a DB 409 which is data necessary for executing the control program 407. The DB 409 may include the load information DB 103 and the operation DB 119.

通信Ｉ／Ｆ部４１１は、必要に応じて、サーバ２００や、その他の情報処理装置と有線又は無線によるデータ通信を行うためのデバイスである。例えば、サーバ２００の負荷を検出するための負荷検出部１０１による負荷検出処理は、通信Ｉ／Ｆ部４１１を介して行うことが考えられる。 The communication I / F unit 411 is a device for performing wired or wireless data communication with the server 200 and other information processing devices as needed. For example, it is conceivable that the load detection process by the load detection unit 101 for detecting the load of the server 200 is performed via the communication I / F unit 411.

入力部４１３は、システム監視装置１００の管理者から各種入力操作を受け付けるためのデバイスである。入力部４１３の具体例としては、キーボードやマウス、タッチパネル等を挙げることができる。 The input unit 413 is a device for receiving various input operations from the administrator of the system monitoring device 100. Specific examples of the input unit 413 include a keyboard, a mouse, a touch panel, and the like.

表示部４１５は、システム監視装置１００を管理する管理者に各種情報を提示するためのディスプレイ装置である。表示部４１５の具体例としては、例えば液晶ディスプレイや有機ＥＬ（Ｅｌｅｃｔｒｏ－Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ等が挙げられる。例えば、異常判定部１１３によりシステム異常が検出された際には、出力部１１５が表示部４１５にその旨を表示させること等が考えられる。 The display unit 415 is a display device for presenting various information to the manager who manages the system monitoring device 100. Specific examples of the display unit 415 include a liquid crystal display, an organic EL (Electro-Luminescence) display, and the like. For example, when a system abnormality is detected by the abnormality determination unit 113, the output unit 115 may display the display unit 415 to that effect.

［５本実施形態の効果］
以上説明したように、本実施形態に係るシステム監視装置１００では、長期的な季節要因を考慮しうる予測モデル１０７ａを用いて、予測システム負荷を算出し、この予測システム負荷と、実測システム負荷とを比較することにより、異常を検出する。これにより、システム負荷が高くない場合であっても、予測と異なる状況にあれば異常が検出されるため、異常の検知漏れ等を抑制することができる。 [5 Effects of the present embodiment]
As described above, in the system monitoring device 100 according to the present embodiment, the prediction system load is calculated using the prediction model 107a that can consider long-term seasonal factors, and the prediction system load and the actual measurement system load are used. Abnormalities are detected by comparing. As a result, even when the system load is not high, an abnormality is detected if the situation is different from the prediction, so that it is possible to suppress an abnormality detection omission or the like.

また、監視対象となる時刻Ｔよりも先の時刻Ｔ＋ｔの予測システム負荷を算出し、これに基づく対処を可能とすることで、将来的に高負荷等によるパフォーマンス低下が見込まれる場合には、予めサーバを増強する等の措置をとることが可能である。
更に、異常の検知、及びその対処を自動的に行うことを可能とするため、管理者による運用コストの低減を図ることができる。 In addition, by calculating the prediction system load at time T + t before the time T to be monitored and making it possible to take measures based on this, if performance deterioration due to high load etc. is expected in the future, it will be done in advance. It is possible to take measures such as increasing the number of servers.
Further, since it is possible to detect an abnormality and automatically deal with it, it is possible to reduce the operation cost by the administrator.

［６付記］
以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。 [6 Addendum]
The embodiments described above are for facilitating the understanding of the present invention, and are not for limiting the interpretation of the present invention. Each element included in the embodiment and its arrangement, material, condition, shape, size, and the like are not limited to those exemplified, and can be appropriately changed. Further, it is possible to partially replace or combine the configurations shown in different embodiments.

１…情報処理システム、１００…システム監視装置、１０１…負荷検出部、１０３・・・負荷情報データベース（ＤＢ）、１０５…負荷情報、１０７…第１予測部、１０７ａ…予測モデル、１０９…第２予測部、１０９ａ…予測モデル、１１１…学習部、１１３…異常判定部、１１５…出力部、１１７…対処部、１１９…オペレーションＤＢ、１２１…オペレーション情報、２００…サーバ、４０１…制御部、４０３…ＲＡＭ、４０５…記憶部、４０７…制御プログラム、４１１…通信インタフェース（Ｉ／Ｆ）部、４１３…入力部、４１５…表示部、４１７…バスライン 1 ... Information processing system, 100 ... System monitoring device, 101 ... Load detection unit, 103 ... Load information database (DB), 105 ... Load information, 107 ... First prediction unit, 107a ... Prediction model, 109 ... Second Prediction unit, 109a ... Prediction model, 111 ... Learning unit, 113 ... Abnormality determination unit, 115 ... Output unit, 117 ... Coping unit 119 ... Operation DB, 121 ... Operation information, 200 ... Server, 401 ... Control unit, 403 ... RAM, 405 ... Storage unit, 407 ... Control program, 411 ... Communication interface (I / F) unit, 413 ... Input unit, 415 ... Display unit, 417 ... Bus line

Claims

An actual measurement system detected by the first time using a prediction model generated by deep learning of training data in which each time and the actual measurement system load, which is the system load measured at each time, are associated with each other. A prediction unit that predicts the system load at the second time after the first time based on the load, and the first prediction unit that generates the system load.
A detection unit that detects the actual measurement system load at the second time, and
An information processing device including an output unit that outputs when the difference between the predicted system load and the measured system load at the second time exceeds a threshold value.

The claim further comprises a second prediction unit that uses the prediction model to generate a prediction system load at a third time after the second time, based on the measured system load detected by the second time. 1. The information processing apparatus according to 1.

The management department that manages operations according to the system load,
The information processing apparatus according to claim 2, further comprising a control unit that executes the operation according to the prediction system load at the third time when the difference at the second time exceeds the threshold value.

The prediction model generated by the first learning data related to the actual measurement system load up to the fourth time is used, and the prediction model is obtained by using the second learning data related to the actual measurement system load from the fourth time to the fifth time. The information processing apparatus according to any one of claims 1 to 3, further comprising a re-learning unit for re-learning.

An actual measurement system detected by the first time using a prediction model generated by deep learning of training data in which each time and the actual measurement system load, which is the system load measured at each time, are associated with each other. A step to generate a predicted system load that predicts the system load at the second time after the first time based on the load.
The step of detecting the measured system load at the second time and
An information processing method in which an information processing apparatus performs a step of outputting a step when the difference between the predicted system load and the measured system load at the second time exceeds a threshold value.

An actual measurement system detected by the first time using a prediction model generated by deep learning of training data in which each time and the actual measurement system load, which is the system load measured at each time, are associated with each other. A process to generate a predicted system load that predicts the system load at the second time after the first time based on the load.
The process of detecting the actual measurement system load at the second time and
A program that causes a computer to execute a process of outputting a process when the difference between the predicted system load and the measured system load at the second time exceeds a threshold value.