JP2022087371A

JP2022087371A - Monitoring device, monitoring method, and program

Info

Publication number: JP2022087371A
Application number: JP2020199275A
Authority: JP
Inventors: 理仁深沢; Michihito Fukazawa
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2022-06-13

Abstract

To provide a monitoring device that controls monitoring in consideration of an operating environment of each server.SOLUTION: In a maintenance management system, a monitoring device 100 comprises: acquisition means 101 for acquiring hardware monitoring information for each server, an individual failure rate due to operating time and operating environment; calculation means 102 for calculating the individual failure rate based on the hardware monitoring information for each server, the operating time, and an operating environment temperature of the server; and determination means 103 for determining a monitoring interval for a server based on the individual failure rate.SELECTED DRAWING: Figure 1

Description

本開示は、サーバの監視に関する。 This disclosure relates to server monitoring.

インターネットで介した様々なサービスは、コンピュータにプログラムを導入したサーバを稼働させることで実現されている。しかしコンピュータは、常に安定して稼働しているわけではなく、思わぬ障害の発生によりサーバがダウンしたり動作が重くなることがある。サーバを常時監視して障害発生時に迅速な対応を取ることは安定したサービスを提供する上で重要となっている。 Various services via the Internet are realized by running a server with a program installed in a computer. However, the computer is not always running stably, and the server may go down or slow down due to an unexpected failure. It is important to constantly monitor the server and take prompt action in the event of a failure in order to provide stable services.

電子機器の障害の発生を検知するために、ネットワークを通じた電子機器の監視が行われている。特許文献１には、パーソナルコンピュータやサーバなどの保守管理対象装置、または、監視装置が、保守管理対象装置における障害の発生を通知することが開示されている。特許文献２には、プリンタなどの管理対象機器を使用状況に応じて所定の監視間隔で監視することが開示されている。 In order to detect the occurrence of a failure of an electronic device, the electronic device is monitored through a network. Patent Document 1 discloses that a maintenance-managed device such as a personal computer or a server, or a monitoring device notifies the occurrence of a failure in the maintenance-managed device. Patent Document 2 discloses that a device to be managed such as a printer is monitored at a predetermined monitoring interval according to a usage situation.

なお、本開示に関連する技術として、特許文献３には、機器の累積的な運転状況の情報と、各機器に固有のパラメータを考慮して機器の異常の有無を判断する運用・保守支援システムが開示されている。 As a technique related to the present disclosure, Patent Document 3 describes an operation / maintenance support system for determining the presence or absence of an abnormality in a device in consideration of information on the cumulative operating status of the device and parameters unique to each device. Is disclosed.

特開２０１６－０８１３７４号公報Japanese Unexamined Patent Publication No. 2016-081374 特開２０１４－０５３０２７号公報Japanese Unexamined Patent Publication No. 2014-053027 特開２０１０－２７１９０５号公報Japanese Unexamined Patent Publication No. 2010-271905

電子機器は温度や湿度などの動作環境によって故障率が変化する。しかし、特許文献２において、機器の動作環境を考慮して電子機器の監視の制御を行っていない。 The failure rate of electronic devices changes depending on the operating environment such as temperature and humidity. However, in Patent Document 2, monitoring and control of electronic devices are not performed in consideration of the operating environment of the devices.

本開示は、サーバの動作環境を考慮して監視を制御する監視装置等を提供することを目的とする。 It is an object of the present disclosure to provide a monitoring device or the like that controls monitoring in consideration of the operating environment of the server.

本開示に係る監視装置は、サーバごとのハードウェア、稼働時間、及び、動作環境に起因する個体故障率を取得する取得手段と、前記個体故障率に基づき、前記サーバに対する監視間隔を決定する決定手段と、を備える。 The monitoring device according to the present disclosure determines the monitoring interval for the server based on the acquisition means for acquiring the individual failure rate due to the hardware, operating time, and operating environment for each server, and the individual failure rate. Means and.

本開示に係る監視方法は、サーバごとのハードウェア、稼働時間、及び、動作環境に起因する個体故障率を取得し、前記個体故障率に基づき、前記サーバに対する監視間隔を決定する。 The monitoring method according to the present disclosure acquires the individual failure rate due to the hardware, operating time, and operating environment for each server, and determines the monitoring interval for the server based on the individual failure rate.

本開示に係るプログラムは、サーバごとのハードウェア、稼働時間、及び、動作環境に起因する個体故障率を取得し、前記個体故障率に基づき、前記サーバに対する監視間隔を決定する処理をコンピュータに実行させる。 The program according to the present disclosure acquires the individual failure rate due to the hardware, operating time, and operating environment of each server, and executes a process of determining the monitoring interval for the server on the computer based on the individual failure rate. Let me.

本開示によれば、サーバの動作環境を考慮して監視を制御できる。 According to the present disclosure, monitoring can be controlled in consideration of the operating environment of the server.

第１実施形態に係る保守管理システムの構成を示す概略図である。It is a schematic diagram which shows the structure of the maintenance management system which concerns on 1st Embodiment. サーバ３００のハードウェア概略図である。It is a hardware schematic diagram of a server 300. 保守管理システムの動作例を示すシーケンス図である。It is a sequence diagram which shows the operation example of the maintenance management system. 関係モデルの例を示すグラフである。It is a graph which shows an example of a relational model. 関係モデルの他の例を示すグラフである。It is a graph which shows another example of a relational model. 第２実施形態に係る監視装置１００の構成を示すブロック図である。It is a block diagram which shows the structure of the monitoring apparatus 100 which concerns on 2nd Embodiment. 算出装置１２０の配置例を示す概略図である。It is a schematic diagram which shows the arrangement example of the calculation apparatus 120. 個体故障率と決定される監視間隔の関係の他の例を示すグラフである。It is a graph which shows other example of the relationship between an individual failure rate and a determined monitoring interval. 第２実施形態に係る監視装置１００の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the monitoring apparatus 100 which concerns on 2nd Embodiment. コンピュータ５００のハードウェア構成の例を示すブロック図である。It is a block diagram which shows the example of the hardware composition of the computer 500.

本開示において、大規模システムを構成する複数の電子機器を保守管理のために監視する監視装置について説明する。被監視対象の電子機器の例として、サーバを監視する監視装置を例に、以下実施形態において図面を参照しながら説明する。ただし被監視対象には、サーバ以外の電子機器が含まれてもよい。 In this disclosure, a monitoring device for monitoring a plurality of electronic devices constituting a large-scale system for maintenance management will be described. As an example of the electronic device to be monitored, a monitoring device for monitoring a server will be taken as an example, and the following embodiments will be described with reference to the drawings. However, the monitored target may include electronic devices other than the server.

［第１実施形態］
（構成）
図１は、第１実施形態に係る保守管理システムの構成を示す概略図である。保守管理システムは、監視装置１００と被監視対象となる１以上のサーバラック２００（２００＿１、・・・、２００＿ｙ）を備える。各サーバラック２００は、１以上のサーバ３００（３００＿１、３００＿２、・・・、３００＿ｘ）と温度センサ４００を備える。監視装置１００は、ネットワーク０１０によって各サーバ３００及び温度センサ４００と、通信可能に接続される。サーバラック２００ごとのサーバ３００の台数は任意に変更されうる。 [First Embodiment]
(Constitution)
FIG. 1 is a schematic diagram showing a configuration of a maintenance management system according to the first embodiment. The maintenance management system includes a monitoring device 100 and one or more server racks 200 (200_1, ..., 200_y) to be monitored. Each server rack 200 includes one or more servers 300 (300_1, 300_2, ..., 300_x) and a temperature sensor 400. The monitoring device 100 is communicably connected to each server 300 and the temperature sensor 400 by the network 010. The number of servers 300 for each server rack 200 can be changed arbitrarily.

監視装置１００は、サーバ３００における障害の発生の有無を監視する。第１実施形態において、監視装置１００は、取得部１０１、算出部１０２、及び決定部１０３を備える。 The monitoring device 100 monitors the presence or absence of a failure in the server 300. In the first embodiment, the monitoring device 100 includes an acquisition unit 101, a calculation unit 102, and a determination unit 103.

取得部１０１は、サーバ３００ごとのハードウェア監視情報、稼働情報、及び、動作環境に関する情報を取得する。ハードウェア監視情報は、サーバ３００内部の各部品の異常傾向を判定するためのハードウェア監視データを示す。稼働情報は、例えば初期配置からの稼働時間、または使用年数など、サーバ３００の稼働状況を示す。動作環境に関する情報は、例えばサーバラック２００内の温度、または湿度、あるいはサーバ３００の温度を示す情報である。さらに、取得部１０１は、算出部１０２が算出した、各サーバ３００のハードウェア、稼働情報、及び、動作環境に基づく個体故障率を取得する。 The acquisition unit 101 acquires hardware monitoring information, operation information, and information on the operating environment for each server 300. The hardware monitoring information indicates hardware monitoring data for determining an abnormality tendency of each component inside the server 300. The operation information indicates the operation status of the server 300, such as the operation time from the initial arrangement or the number of years of use. The information regarding the operating environment is, for example, information indicating the temperature or humidity in the server rack 200, or the temperature of the server 300. Further, the acquisition unit 101 acquires the individual failure rate based on the hardware, operation information, and operating environment of each server 300 calculated by the calculation unit 102.

算出部１０２は、取得部１０１が取得したハードウェア監視情報、稼働情報、及び、動作環境に関する情報に基づいて、サーバ３００ごとに個体故障率Ｐを算出する。個体故障率Ｐの算出については後述する。 The calculation unit 102 calculates the individual failure rate P for each server 300 based on the hardware monitoring information, the operation information, and the information regarding the operating environment acquired by the acquisition unit 101. The calculation of the individual failure rate P will be described later.

決定部１０３は、各サーバ３００の個体故障率と監視間隔の関係モデルに基づいて、各サーバ３００に対する監視間隔を決定する。監視間隔とは、監視装置１００がサーバ３００に対する障害の監視処理を実行した後、再度監視処理を実行するまでの時間間隔である。監視装置１００は、監視処理として、例えば、サーバ３００に対して入出力要求を送信し、入出力要求に対する応答時間が正常な範囲内であるか確認してもよい。あるいは、監視装置１００は、監視処理として、サーバ３００の各部品の監視データが正常な範囲内であるか確認してもよい。 The determination unit 103 determines the monitoring interval for each server 300 based on the relationship model between the individual failure rate of each server 300 and the monitoring interval. The monitoring interval is a time interval after the monitoring device 100 executes the failure monitoring process for the server 300 until the monitoring process is executed again. As a monitoring process, the monitoring device 100 may transmit an input / output request to the server 300, for example, and confirm whether the response time to the input / output request is within a normal range. Alternatively, the monitoring device 100 may confirm whether the monitoring data of each component of the server 300 is within a normal range as a monitoring process.

個体故障率と監視間隔の関係モデルは、個体故障率の増加に対応する監視間隔の減少傾向を示してもよい。関係モデルは、個体故障率と監視間隔の関係式によって表されてもよく、テーブルにより表されてもよい。関係モデルは、目的変数をネットワーク負荷とし、説明変数として例えば、故障率、監視間隔、サーバ台数、及びネットワーク帯域を用いた、機械学習によって得られてもよい。 The relationship model between the individual failure rate and the monitoring interval may show a decreasing tendency of the monitoring interval corresponding to the increase in the individual failure rate. The relational model may be represented by a relational expression between the individual failure rate and the monitoring interval, or may be represented by a table. The relational model may be obtained by machine learning with the objective variable as the network load and, for example, the failure rate, the monitoring interval, the number of servers, and the network bandwidth as the explanatory variables.

第１実施形態において、温度センサ４００は、サーバラック２００内の温度を測定する。温度センサ４００は、測定温度をサーバラック２００内のサーバ３００＿１、・・・、３００＿ｘの動作環境に関する情報として、監視装置１００に送信する。 In the first embodiment, the temperature sensor 400 measures the temperature inside the server rack 200. The temperature sensor 400 transmits the measured temperature to the monitoring device 100 as information regarding the operating environment of the servers 300_1, ..., 300_x in the server rack 200.

動作環境に関する情報として、各サーバ３００の温度を用いる場合、１台のサーバ３００ごとに１つの温度センサ４００が設置されてもよい。温度センサ４００は各サーバの測定温度を監視装置１００に送信する。 When the temperature of each server 300 is used as the information regarding the operating environment, one temperature sensor 400 may be installed for each server 300. The temperature sensor 400 transmits the measured temperature of each server to the monitoring device 100.

サーバラック２００内には、取得部１０１が取得する動作環境に関する情報の種類に応じて、他のセンサが設置されうる。動作環境に関する情報として湿度を用いる場合、以下の説明において温度センサ４００に関する説明は湿度センサに置き換えることができる。 In the server rack 200, other sensors may be installed depending on the type of information regarding the operating environment acquired by the acquisition unit 101. When humidity is used as information regarding the operating environment, the description regarding the temperature sensor 400 can be replaced with the humidity sensor in the following description.

図２は、サーバ３００のハードウェア概略図である。ＣＰＵ（Central Processing Unit）３０１、メモリ３０２、ＨＤＤ（hard disk drive）３０３、イーサポート３０４、及び、ＢＭＣ（Baseboard Management Controller）３０９を備える。ＢＭＣ３０９は、サーバ３００のハードウェア（サーバ３００内部に構成する各部品、要素）からの監視データを取得し、ハードウェア監視情報として監視装置１００に送信する。 FIG. 2 is a schematic hardware diagram of the server 300. It includes a CPU (Central Processing Unit) 301, a memory 302, an HDD (hard disk drive) 303, an e-support 304, and a BMC (Baseboard Management Controller) 309. The BMC 309 acquires monitoring data from the hardware of the server 300 (each component and element constituting inside the server 300) and transmits it to the monitoring device 100 as hardware monitoring information.

（動作）
図３は、保守管理システムの動作例を示すシーケンス図である。図３において、簡略化のためにサーバ３００は１台だけ示されているが、各サーバ３００は同様に動作する。 (motion)
FIG. 3 is a sequence diagram showing an operation example of the maintenance management system. In FIG. 3, only one server 300 is shown for simplification, but each server 300 operates in the same manner.

＜＜ハードウェアに起因する故障率の算出＞＞
監視装置１００の取得部１０１は、ネットワーク０１０とＢＭＣ３０９を介して、サーバ３００からハードウェア監視情報を取得する。算出部１０２は、取得部１０１が取得した監視情報に基づいてハードウェアに起因する故障率Ｐ_ｈを算出する（ステップＳ１０１）。 << Calculation of failure rate due to hardware >>
The acquisition unit 101 of the monitoring device 100 acquires hardware monitoring information from the server 300 via the network 010 and the BMC 309. The calculation unit 102 calculates the failure rate _Ph due to hardware based on the monitoring information acquired by the acquisition unit 101 (step S101).

ハードウェアに起因する故障率Ｐ_ｈとは、個々の装置の故障の起こり得る確率である。故障率Ｐ_ｈは、例えば０から１（あるいは０％から１００％）の間で算出される。 The failure rate _Ph due to hardware is the probability of failure of individual devices. The failure rate _Ph is calculated, for example, between 0 and 1 (or 0% to 100%).

ハードウェア監視情報は、ＣＰＵ３０１、メモリ３０２、ＨＤＤ３０３、イーサポート３０４、電源など、複数の監視対象部品に対応するそれぞれの監視データを含む。ＣＰＵ３０１に関する監視データは、例えば、ＣＰＵの使用率及び温度である。メモリ３０２に関する監視データとして、例えば、ＥＣＣ（Error checking and correcting）におけるエラーの回数が用いられる。ＨＤＤ３０３に関する監視データとして、例えばＳＭＡＲＴ（Self-Monitoring, Analysis and Reporting Technology）による検査値が用いられる。電源に関する監視データは、例えば電源の冗長化状態を示す。 The hardware monitoring information includes monitoring data corresponding to a plurality of monitored components such as CPU 301, memory 302, HDD 303, e-support 304, and power supply. The monitoring data regarding the CPU 301 is, for example, the CPU usage rate and the temperature. As the monitoring data related to the memory 302, for example, the number of errors in ECC (Error checking and correcting) is used. As the monitoring data related to the HDD 303, for example, inspection values by SMART (Self-Monitoring, Analysis and Reporting Technology) are used. The monitoring data regarding the power supply indicates, for example, the redundancy state of the power supply.

算出部１０２は、例えば、ハードウェアの故障対象部品ごとに取得された監視データの値に基づいて異常傾向の有無を判定する。算出部１０２は、監視データが所定の閾値を超えたとき、部品に異常傾向があると判定する。あるいは、算出部１０２は、部品の監視データの値に基づいて算出した部品の故障率が所定の閾値以上である場合に、該部品に異常傾向があると判定してもよい。監視情報が１００個の部品に関する情報を含む場合、算出部１０２は、例えば、異常傾向のある部品の数だけ故障率Ｐ_ｈを＋０．０１（あるいは＋１％）増加させる。なお、監視情報に含まれる部品の重要度や計測されたデータの値に基づいて、増加させる故障率Ｐ_ｈは部品ごとに調整されてもよい。 The calculation unit 102 determines, for example, whether or not there is an abnormality tendency based on the value of the monitoring data acquired for each hardware failure target component. When the monitoring data exceeds a predetermined threshold value, the calculation unit 102 determines that the component has an abnormal tendency. Alternatively, the calculation unit 102 may determine that the component has an abnormality tendency when the failure rate of the component calculated based on the value of the monitoring data of the component is equal to or higher than a predetermined threshold value. When the monitoring information includes information about 100 parts, the calculation unit 102 increases the failure rate Ph by _+0.01 (or + 1%) by, for example, the number of parts having a tendency to be abnormal. The failure rate _Ph to be increased may be adjusted for each component based on the importance of the component included in the monitoring information and the value of the measured data.

＜＜動作環境に起因する故障率の算出＞＞
監視装置１００の取得部１０１は、サーバ３００の動作環境に関する情報として、サーバラック２００に設置された温度センサ４００の測定温度を取得する。算出部１０２は、取得部１０１が取得した温度に基づいて動作環境に起因する故障率Ｐ_ｅを算出する（ステップＳ１０２）。 << Calculation of failure rate due to operating environment >>
The acquisition unit 101 of the monitoring device 100 acquires the measured temperature of the temperature sensor 400 installed in the server rack 200 as information regarding the operating environment of the server 300. The calculation unit 102 calculates the failure rate _Pe due to the operating environment based on the temperature acquired by the acquisition unit 101 (step S102).

動作環境に起因する故障率Ｐ_ｅとは、サーバ３００が設置される環境と安定稼働条件との差異に基づいて算出される故障の確率である。故障率Ｐ_ｅは例えば０から１（０％から１００％）の間で算出される。 The failure rate _Pe due to the operating environment is the probability of failure calculated based on the difference between the environment in which the server 300 is installed and the stable operating conditions. The failure rate _Pe is calculated, for example, between 0 and 1 (0% to 100%).

算出部１０２は、例えば、温度が稼働想定温度の中央値であれば０（あるいは０％）、温度上限値以上であれば１（あるいは１００％）、温度下限値以下であれば１（あるいは１００％）と算出する。算出部１０２は、動作環境温度の変化率、温度の継続時間、または、所定の上限値若しくは下限値を超えた回数を考慮し、故障率Ｐ_ｅを算出してもよい。 The calculation unit 102 may, for example, 0 (or 0%) if the temperature is the median of the assumed operating temperature, 1 (or 100%) if the temperature is equal to or higher than the upper limit of the temperature, and 1 (or 100%) if the temperature is equal to or lower than the lower limit of the temperature. %). The calculation unit 102 may calculate the failure rate _Pe in consideration of the rate of change in the operating environment temperature, the duration of the temperature, or the number of times the predetermined upper limit value or lower limit value is exceeded.

なお、取得部１０１は、外部装置からサーバ３００が配置されたサーバルームの温度分布を取得してもよい。取得部１０１は、サーバルームの温度分布とサーバ３００の配置に基づいて、各サーバ３００の動作環境温度を取得してもよい。 The acquisition unit 101 may acquire the temperature distribution of the server room in which the server 300 is arranged from an external device. The acquisition unit 101 may acquire the operating environment temperature of each server 300 based on the temperature distribution of the server room and the arrangement of the servers 300.

＜＜稼働時間に起因する故障率の算出＞＞
サーバ３００は初期配置からの合計稼働時間（合計通電時間）を記録、蓄積する。取得部１０１は、サーバ３００からサーバ３００の稼働時間を稼働情報として取得する。算出部１０２は、稼働時間に基づいて故障率Ｐ_ｔを算出する（ステップＳ１０３）。 << Calculation of failure rate due to operating time >>
The server 300 records and accumulates the total operating time (total energizing time) from the initial arrangement. The acquisition unit 101 acquires the operating time of the server 300 from the server 300 as operation information. The calculation unit 102 calculates the failure rate _Pt based on the operating time (step S103).

稼働時間に起因する故障率Ｐ_ｔとは、合計稼働時間に起因するサーバ３００の寿命特性を示す装置のバスタブ曲線に基づいて算出される故障の確率である。故障率Ｐ_ｔは、未使用状態で０（０％）、稼働時間上限値を１（１００％）とする。 The failure rate _Pt due to the operating time is the probability of failure calculated based on the bathtub curve of the device showing the life characteristic of the server 300 due to the total operating time. The failure rate _Pt is 0 (0%) in the unused state, and the upper limit of the operating time is 1 (100%).

＜＜個体故障率Ｐの算出と取得＞＞
算出部１０２は、それぞれのサーバ３００に対して算出されたハードウェアに起因する故障率Ｐ_ｈ、動作環境に起因する故障率Ｐ_ｅ、及び、稼働時間に起因する故障率Ｐ_ｔを総合的に考慮して、個体故障率Ｐを算出する（ステップＳ１０４）。取得部１０１は、算出部１０２が算出した個体故障率Ｐを取得する（ステップＳ１０５）。 << Calculation and acquisition of individual failure rate P >>
The calculation unit 102 comprehensively calculates the failure rate _{Ph due to the hardware, the failure rate P e} _due to the operating environment, and the failure rate P _t due to the operating time calculated for each server 300. In consideration, the individual failure rate P is calculated (step S104). The acquisition unit 101 acquires the individual failure rate P calculated by the calculation unit 102 (step S105).

ステップＳ１０１～１０３の順番は変更可能である。 The order of steps S101 to 103 can be changed.

個体故障率Ｐの算出例１：算出部１０２は、３つの故障率Ｐ_ｈ，Ｐ_ｅ，Ｐ_ｔに優先順位や重みづけをつけてもよい。個体故障率Ｐは例えば以下の式で表される。
Ｐ＝Ｗ_ｈ×Ｐ_ｈ＋Ｗ_ｅ×Ｐ_ｅ＋Ｗ_ｔ×Ｐ_ｔ（重み付け定数Ｗ_ｈ，Ｗ_ｅ，Ｗ_ｔ＞０）
個体故障率Ｐの算出例２：算出部１０２は、故障率Ｐ_ｈ及びＰ_ｔが所定値まで増加したとき、故障率Ｐ_ｅの重みＷ_ｅを小さく設定してもよい。つまり、ハードウェアに起因する故障率Ｐ_ｈと稼働時間に起因する故障率Ｐ_ｔが高いサーバは、動作環境に起因する故障率Ｐ_ｅが低くても、個体故障率Ｐを高く算出してもよい。これによりサーバは短い監視間隔で監視される。 Calculation example of individual failure rate P 1: The calculation unit 102 may give priority or weight to three failure rates _Ph , P _e , and P _t . The individual failure rate P is expressed by, for example, the following equation.
P = W _h x P _h + We x P _e + W _t x P _t ₍ weighting constants W _h , _We , W _t > 0)
Calculation Example 2: Individual Failure Rate P: When the failure rates _Ph and P _t increase to a predetermined value, the calculation unit 102 may set the weight _We of the failure rate P _e to be small. That is, a server having a high failure rate _{Ph due to hardware and a high failure rate P t} _due to operating time can calculate a high individual failure rate P even if the failure rate _Pe due to the operating environment is low. good. This allows the server to be monitored at short monitoring intervals.

＜＜監視間隔の決定＞＞
決定部１０３は、サーバ３００の個体故障率と監視間隔の関係モデルに基づいて、サーバ３００に対する監視間隔を決定する（ステップＳ１０６）。図４は、関係モデルの例を示すグラフである。図４において、個体故障率Ｐと監視間隔Ｉの関係モデルは直線により表されているが、曲線であってもよい。また図４において、個体故障率の増加に伴い、監視間隔Ｉは連続した値が設定されているが、監視間隔Ｉは非連続（離散的）であってもよい。監視間隔Ｉは個体故障率Ｐの範囲によって、異なる関係式により表されてもよい。 << Determination of monitoring interval >>
The determination unit 103 determines the monitoring interval for the server 300 based on the relationship model between the individual failure rate of the server 300 and the monitoring interval (step S106). FIG. 4 is a graph showing an example of a relational model. In FIG. 4, the relational model between the individual failure rate P and the monitoring interval I is represented by a straight line, but may be a curve. Further, in FIG. 4, a continuous value is set for the monitoring interval I as the individual failure rate increases, but the monitoring interval I may be discontinuous (discrete). The monitoring interval I may be expressed by a different relational expression depending on the range of the individual failure rate P.

個体故障率Ｐが１００に近いほど、監視間隔Ｉは一般的な監視装置の通常の監視間隔よりも短く定められてもよい。また、個体故障率Ｐが０に近いほど、監視間隔Ｉは一般的な監視装置の通常の監視間隔よりも長く定められてもよい。 As the individual failure rate P approaches 100, the monitoring interval I may be set shorter than the normal monitoring interval of a general monitoring device. Further, as the individual failure rate P is closer to 0, the monitoring interval I may be set longer than the normal monitoring interval of a general monitoring device.

監視装置１００は、サーバ３００ごとに決定された監視間隔によりサーバ３００を監視する。 The monitoring device 100 monitors the server 300 at a monitoring interval determined for each server 300.

（効果）
第１実施形態によれば、サーバ３００の動作環境を考慮して監視を制御できる。その理由は、監視装置１００の取得部１０１がサーバごとのハードウェア、稼働時間、及び、動作環境に起因する個体故障率を取得し、決定部１０３が個体故障率に基づきサーバ３００に対する監視間隔を決定するためである。 (effect)
According to the first embodiment, monitoring can be controlled in consideration of the operating environment of the server 300. The reason is that the acquisition unit 101 of the monitoring device 100 acquires the individual failure rate due to the hardware, operating time, and operating environment of each server, and the determination unit 103 determines the monitoring interval for the server 300 based on the individual failure rate. To decide.

システムの保守管理において、１つの監視装置により多くの電子機器を監視することが行われている。システムの規模が大きくなればなるほど被監視対象の電子機器が増える。また、監視項目が一つ増えるたびに電子機器の台数分のアクセスが増えるためネットワーク負荷、及び、監視装置の負荷が増加する。そのため、大きなシステムでは監視間隔を一律に長くすることにより、ネットワーク負荷、監視装置の負荷を下げることが行われる。 In system maintenance, many electronic devices are monitored by one monitoring device. As the scale of the system increases, the number of electronic devices to be monitored increases. Further, each time the number of monitoring items is increased, the access for the number of electronic devices increases, so that the network load and the load of the monitoring device increase. Therefore, in a large system, the network load and the load of the monitoring device can be reduced by uniformly increasing the monitoring interval.

しかし、監視間隔が長いと、障害発生から監視装置が障害を検出するまでの時間が長くなる。即時に障害を検出できなければ、システム全体の稼働率を下げ、さらには、サービス品質の低下につながる。 However, if the monitoring interval is long, the time from the occurrence of the failure to the detection of the failure by the monitoring device becomes long. If a failure cannot be detected immediately, the utilization rate of the entire system will be reduced, and the quality of service will be reduced.

第１実施形態によれば、決定部１０３が、所定の故障率の基準より個体故障率Ｐが低いサーバ３００の監視間隔を所定の監視間隔の基準より長く決定することで、多くのサーバを監視する際のネットワーク０１０の負荷を低減し、監視負荷を低減できる。 According to the first embodiment, the determination unit 103 monitors many servers by determining the monitoring interval of the server 300 whose individual failure rate P is lower than the standard of the predetermined failure rate to be longer than the standard of the predetermined monitoring interval. The load on the network 010 can be reduced and the monitoring load can be reduced.

さらに、第１実施形態によれば、決定部１０３が、所定の故障率の基準より個体故障率Ｐが高いサーバ３００の監視間隔を所定の監視間隔の基準よりも短くすることで、実際の障害発生から監視装置１００が障害を検知するまでの時間を短くし、障害の検知の遅延を防ぐことができる。 Further, according to the first embodiment, the determination unit 103 makes the monitoring interval of the server 300, which has an individual failure rate P higher than the standard of the predetermined failure rate, shorter than the standard of the predetermined monitoring interval, so that the actual failure occurs. The time from the occurrence to the detection of the failure by the monitoring device 100 can be shortened, and the delay in the detection of the failure can be prevented.

（変形例）
図５は、関係モデルの他の例を示すグラフである。図５において、関係モデルは個体故障率Ｐが基準故障率Ｐ_０より小さいか、基準故障率Ｐ_０以上であるかによって、異なる関係式により表されている。 (Modification example)
FIG. 5 is a graph showing another example of the relational model. In FIG. 5, the relational model is represented by a different relational expression depending on whether the individual failure rate P is smaller than the reference failure rate P ₀ or greater than or equal to the reference failure rate P ₀ .

Ｐ≦Ｐ_０のとき、Ｉ＝Ｉ_Ａ
０＜Ｐ＜Ｐ_０のとき、Ｉ＝Ｉ_Ｂ
図５の例において、例えば、監視間隔Ｉ_Ａは２～３分、監視間隔Ｉ_Ｂは１時間と定められてもよい。決定部１０３は、例えば、所定期間における各サーバラック２００の各サーバ３００＿１、３００＿２、・・・、３００＿ｘの個体故障率の平均値、あるいは中央値を基準故障率Ｐ_０としてもよい。また、決定部１０３は、基準故障率Ｐ_０は上述の機械学習を用いて設定されてよい。 When P ≤ P ₀ , I = _IA
When 0 <P < _P ₀ , I = IB
In the example of FIG. 5, for example, the monitoring interval _IA may be set to 2 to 3 minutes, and the monitoring interval _IB may be set to 1 hour. For example, the determination unit 103 may set the average value or the median of the individual failure rates of the servers 300_1, 300_2, ..., _{300_x} of each server rack 200 in a predetermined period as the reference failure rate P0. Further, in the determination unit 103, the reference failure rate P ₀ may be set by using the above-mentioned machine learning.

［第２実施形態］
第１実施形態において、監視装置１００が個体故障率Ｐを算出する算出部を備える場合について説明した。第２実施形態において、他の装置が第１実施形態に係る算出部１０２の機能を有し、監視装置１００が、算出装置によって算出された個体故障率Ｐを取得する場合について説明する。第１実施形態と同様の説明は、第２実施形態の説明において省略する。 [Second Embodiment]
In the first embodiment, the case where the monitoring device 100 includes a calculation unit for calculating the individual failure rate P has been described. In the second embodiment, a case where another device has the function of the calculation unit 102 according to the first embodiment and the monitoring device 100 acquires the individual failure rate P calculated by the calculation device will be described. The same description as in the first embodiment will be omitted in the description of the second embodiment.

（構成）
図６は、第２実施形態に係る監視装置１００の構成を示すブロック図である。第２実施形態に係る監視装置１００は、取得部１２１と決定部１２２を備える。第２実施形態に係る監視装置１００は、図１における監視装置１００と置き換えることができる。 (Constitution)
FIG. 6 is a block diagram showing the configuration of the monitoring device 100 according to the second embodiment. The monitoring device 100 according to the second embodiment includes an acquisition unit 121 and a determination unit 122. The monitoring device 100 according to the second embodiment can be replaced with the monitoring device 100 in FIG.

図７は、算出装置１２０の配置例を示す、第２実施形態に係る保守管理システムの概略図である。図７に示すように、１台の確率算出装置１２０が、監視装置１００、各サーバラック２００の各サーバ３００及び温度センサ４００と通信可能に接続されてもよい。算出装置１２０は、各サーバ３００からハードウェア監視情報と、稼働情報を取得し、温度センサ４００から動作環境に関する情報を取得する。 FIG. 7 is a schematic diagram of the maintenance management system according to the second embodiment, showing an arrangement example of the calculation device 120. As shown in FIG. 7, one probability calculation device 120 may be communicably connected to the monitoring device 100, each server 300 of each server rack 200, and the temperature sensor 400. The calculation device 120 acquires hardware monitoring information and operation information from each server 300, and acquires information on the operating environment from the temperature sensor 400.

監視装置１００の取得部１２１は、外部の算出装置１２０から、各サーバ３００のハードウェア、稼働情報、及び、動作環境に基づく個体故障率を取得する。 The acquisition unit 121 of the monitoring device 100 acquires the hardware of each server 300, the operation information, and the individual failure rate based on the operating environment from the external calculation device 120.

決定部１２２は、各サーバ３００の個体故障率に基づいて、サーバに対する監視間隔を決定する。 The determination unit 122 determines the monitoring interval for the server based on the individual failure rate of each server 300.

決定部１２２は、個体故障率が所定の基準故障率より高い場合、サーバ３００の監視間隔を第１の監視間隔に決定し、個体故障率が該基準故障率より低い場合、サーバ３００の監視間隔を第１の監視間隔より長い第２の監視間隔に決定してもよい。個体故障率が基準故障率と等しい場合、決定部１２２は、監視間隔を第１の監視間隔以上であり第２の監視間隔以下の任意の時間に決定する。 The determination unit 122 determines the monitoring interval of the server 300 as the first monitoring interval when the individual failure rate is higher than the predetermined reference failure rate, and when the individual failure rate is lower than the reference failure rate, the monitoring interval of the server 300. May be determined as a second monitoring interval that is longer than the first monitoring interval. When the individual failure rate is equal to the reference failure rate, the determination unit 122 determines the monitoring interval at any time equal to or greater than the first monitoring interval and equal to or less than the second monitoring interval.

個体故障率と決定部１２２により決定される監視間隔の関係は、図４、または、図５のグラフにより示されてもよい。図８は、個体故障率と決定される監視間隔の関係の他の例を示すグラフである。図８において、基準故障率Ｐ_１～Ｐ_４に対応して、監視間隔Ｉ_１～Ｉ_５が決定される。基準故障率Ｐ_１～Ｐ_４はそれぞれ、所定の基準故障率の一実施形態である。監視間隔Ｉ_１～Ｉ_５はそれぞれ第１または第２の監視間隔の一実施形態である。 The relationship between the individual failure rate and the monitoring interval determined by the determination unit 122 may be shown by the graph of FIG. 4 or FIG. FIG. 8 is a graph showing another example of the relationship between the individual failure rate and the determined monitoring interval. In FIG. 8, the monitoring intervals I ₁ to I ₅ are determined corresponding to the reference failure rates P ₁ to P ₄ . _Each of the reference failure rates P1 to _P4 is an embodiment of a predetermined reference failure rate. The monitoring intervals I ₁ to I ₅ are embodiments of the first or second monitoring interval, respectively.

監視装置１００は、決定した監視間隔によりサーバ３００を監視する。あるいは、監視装置１００は、図示しない他の監視処理実行装置に決定した監視間隔を送信し、監視処理実行装置にサーバ３００を監視させる。 The monitoring device 100 monitors the server 300 according to the determined monitoring interval. Alternatively, the monitoring device 100 transmits a determined monitoring interval to another monitoring processing execution device (not shown), and causes the monitoring processing execution device to monitor the server 300.

図７において、１台の算出装置１２０が配置されているが、算出装置１２０は１台のサーバラック２００ごとに１台配置されてもよい。このとき、算出装置１２０はサーバラック２００内の各サーバ３００の個体故障率を算出する。あるいは、算出装置１２０の機能は、各サーバ３００が備えてもよい。このとき、各サーバ３００は温度センサ４００から動作環境に関する情報を取得し、自装置の個体故障率を算出する。監視装置１００は複数の算出装置１２０のそれぞれから各サーバ３００の個体故障率を取得する。 In FIG. 7, one calculation device 120 is arranged, but one calculation device 120 may be arranged for each server rack 200. At this time, the calculation device 120 calculates the individual failure rate of each server 300 in the server rack 200. Alternatively, the function of the calculation device 120 may be provided in each server 300. At this time, each server 300 acquires information on the operating environment from the temperature sensor 400 and calculates the individual failure rate of its own device. The monitoring device 100 acquires the individual failure rate of each server 300 from each of the plurality of calculation devices 120.

（動作）
図９は、第２実施形態に係る監視装置１００の動作例を示すフローチャートである。取得部１２１は、サーバ３００ごとのハードウェア、稼働時間、及び、動作環境に起因する個体故障率を取得する（ステップＳ１２１）。決定部１２２は、各サーバ３００の個体故障率に基づいて、サーバ３００に対する監視間隔を決定する（ステップＳ１２２）。 (motion)
FIG. 9 is a flowchart showing an operation example of the monitoring device 100 according to the second embodiment. The acquisition unit 121 acquires the individual failure rate due to the hardware, operating time, and operating environment for each server 300 (step S121). The determination unit 122 determines the monitoring interval for the server 300 based on the individual failure rate of each server 300 (step S122).

（効果）
第２実施形態によれば、サーバ３００の動作環境を考慮して監視を制御できる。その理由は、監視装置１００の取得部１２１がサーバごとのハードウェア、稼働時間、及び、動作環境に起因する個体故障率を取得し、決定部１２２が個体故障率に基づきサーバ３００に対する監視間隔を決定するためである。 (effect)
According to the second embodiment, monitoring can be controlled in consideration of the operating environment of the server 300. The reason is that the acquisition unit 121 of the monitoring device 100 acquires the individual failure rate due to the hardware, operating time, and operating environment of each server, and the determination unit 122 sets the monitoring interval for the server 300 based on the individual failure rate. To decide.

（変形例）
決定部１２２は、個体故障率に加えて、さらに、監視間隔ごとのサーバ３００の台数の割合、あるいは上限数に基づいて、サーバに対する監視間隔を決定してもよい。すなわち、決定部１２２は、例えば、第１の監視間隔で監視するサーバ３００が所定の上限数に達した場合、残りのサーバ３００はより長い第２の監視間隔で監視されるよう、基準故障率を決定する。 (Modification example)
In addition to the individual failure rate, the determination unit 122 may further determine the monitoring interval for the server based on the ratio of the number of servers 300 for each monitoring interval or the upper limit number. That is, the determination unit 122 has a reference failure rate so that, for example, when the number of servers 300 monitored in the first monitoring interval reaches a predetermined upper limit, the remaining servers 300 are monitored in the longer second monitoring interval. To determine.

本変形例によれば、多くのサーバを監視する際のネットワークの負荷を低減し、監視負荷を低減する要請と、実際の障害発生から監視装置１００が障害を検知するまでの時間を短くし、障害の検知の遅延を防ぐ要請の両方に応えることができる。 According to this modification, the request to reduce the network load and the monitoring load when monitoring many servers, and the time from the actual occurrence of the failure to the detection of the failure by the monitoring device 100 are shortened. It can respond to both requests to prevent delays in fault detection.

［ハードウェア構成］
上述した各実施形態において、監視装置１００の各構成要素は、機能単位のブロックを示している。各装置の各構成要素の一部又は全部は、コンピュータ５００とプログラムとの任意の組み合わせにより実現されてもよい。 [Hardware configuration]
In each of the above-described embodiments, each component of the monitoring device 100 indicates a block of functional units. Some or all of the components of each device may be realized by any combination of the computer 500 and the program.

図１０は、コンピュータ５００のハードウェア構成の例を示すブロック図である。図１０を参照すると、コンピュータ５００は、例えば、ＣＰＵ（Central Processing Unit）５０１、ＲＯＭ（Read Only Memory）５０２、ＲＡＭ（Random Access Memory）５０３、プログラム５０４、記憶装置５０５、ドライブ装置５０７、通信インタフェース５０８、入力装置５０９、入出力インタフェース５１１、及び、バス５１２を含む。 FIG. 10 is a block diagram showing an example of the hardware configuration of the computer 500. Referring to FIG. 10, the computer 500 includes, for example, a CPU (Central Processing Unit) 501, a ROM (Read Only Memory) 502, a RAM (Random Access Memory) 503, a program 504, a storage device 505, a drive device 507, and a communication interface 508. , Input device 509, input / output interface 511, and bus 512.

プログラム５０４は、各装置の各機能を実現するための命令（instruction）を含む。プログラム５０４は、予め、ＲＯＭ５０２やＲＡＭ５０３、記憶装置５０５に格納される。ＣＰＵ５０１は、プログラム５０４に含まれる命令を実行することにより、各装置の各機能を実現する。例えば、監視装置１００のＣＰＵ５０１がプログラム５０４に含まれる命令を実行することにより、監視装置１００の機能を実現する。また、ＲＡＭ５０３は、各装置の各機能において処理されるデータを記憶してもよい。例えば、コンピュータ５００のＲＡＭ５０３に、監視装置１００の算出部１０２が用いるハードウェア監視情報、動作環境温度を記憶してもよい。 The program 504 includes an instruction for realizing each function of each device. The program 504 is stored in the ROM 502, the RAM 503, and the storage device 505 in advance. The CPU 501 realizes each function of each device by executing the instruction included in the program 504. For example, the function of the monitoring device 100 is realized by the CPU 501 of the monitoring device 100 executing the instruction included in the program 504. Further, the RAM 503 may store data processed in each function of each device. For example, the hardware monitoring information and the operating environment temperature used by the calculation unit 102 of the monitoring device 100 may be stored in the RAM 503 of the computer 500.

ドライブ装置５０７は、記録媒体５０６の読み書きを行う。通信インタフェース５０８は、通信ネットワークとのインタフェースを提供する。入力装置５０９は、例えば、マウスやキーボード等であり、保守管理システムの管理者からの情報の入力を受け付ける。出力装置５１０は、例えば、ディスプレイであり、管理者へ情報を出力（表示）する。入出力インタフェース５１１は、周辺機器とのインタフェースを提供する。バス５１２は、これらハードウェアの各構成要素を接続する。なお、プログラム５０４は、通信ネットワークを介してＣＰＵ５０１に供給されてもよいし、予め、記録媒体５０６に格納され、ドライブ装置５０７により読み出され、ＣＰＵ５０１に供給されてもよい。 The drive device 507 reads and writes the recording medium 506. The communication interface 508 provides an interface with a communication network. The input device 509 is, for example, a mouse, a keyboard, or the like, and receives input of information from the administrator of the maintenance management system. The output device 510 is, for example, a display, and outputs (displays) information to the administrator. The input / output interface 511 provides an interface with peripheral devices. Bus 512 connects each component of these hardware. The program 504 may be supplied to the CPU 501 via the communication network, or may be stored in the recording medium 506 in advance, read by the drive device 507, and supplied to the CPU 501.

なお、図１０に示されているハードウェア構成は例示であり、これら以外の構成要素が追加されていてもよく、一部の構成要素を含まなくてもよい。 The hardware configuration shown in FIG. 10 is an example, and components other than these may be added or may not include some components.

各装置の実現方法には、様々な変形例がある。例えば、各装置は、構成要素毎にそれぞれ異なるコンピュータとプログラムとの任意の組み合わせにより実現されてもよい。また、各装置が備える複数の構成要素が、一つのコンピュータとプログラムとの任意の組み合わせにより実現されてもよい。 There are various modifications in the method of realizing each device. For example, each device may be realized by any combination of a computer and a program, which are different for each component. Further, a plurality of components included in each device may be realized by any combination of one computer and a program.

また、各装置の各構成要素の一部又は全部は、プロセッサ等を含む汎用又は専用の回路（circuitry）や、これらの組み合わせによって実現されてもよい。これらの回路は、単一のチップによって構成されてもよいし、バスを介して接続される複数のチップによって構成されてもよい。各装置の各構成要素の一部又は全部は、上述した回路等とプログラムとの組み合わせによって実現されてもよい。 Further, a part or all of each component of each device may be realized by a general-purpose or dedicated circuitry including a processor or the like, or a combination thereof. These circuits may be composed of a single chip or a plurality of chips connected via a bus. A part or all of each component of each device may be realized by the combination of the circuit or the like and the program described above.

また、各装置の各構成要素の一部又は全部が複数のコンピュータや回路等により実現される場合、複数のコンピュータや回路等は、集中配置されてもよいし、分散配置されてもよい。 Further, when a part or all of each component of each device is realized by a plurality of computers, circuits, etc., the plurality of computers, circuits, etc. may be centrally arranged or distributed.

また、監視装置１００の少なくとも一部がＳａａＳ（Software as a Service）形式で提供されてよい。すなわち、監視装置１００を実現するための機能の少なくとも一部が、ネットワーク経由で実行されるソフトウェアによって実行されてよい。 Further, at least a part of the monitoring device 100 may be provided in the SaaS (Software as a Service) format. That is, at least a part of the functions for realizing the monitoring device 100 may be executed by software executed via the network.

以上、実施形態を参照して本開示を説明したが、本開示は上記実施形態に限定されるものではない。本開示の構成や詳細には、本開示のスコープ内で当業者が理解し得る様々な変更をすることができる。また、各実施形態における構成は、本開示のスコープを逸脱しない限りにおいて、互いに組み合わせることが可能である。 Although the present disclosure has been described above with reference to the embodiments, the present disclosure is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the structure and details of the present disclosure within the scope of the present disclosure. Also, the configurations in each embodiment can be combined with each other as long as they do not deviate from the scope of the present disclosure.

１００監視装置
１０１、１２１取得部
１０２算出部
１０３、１２２決定部
１２０算出装置
２００サーバラック
３００サーバ
４００温度センサ
５００コンピュータ 100 Monitoring device 101, 121 Acquisition unit 102 Calculation unit 103, 122 Decision unit 120 Calculation device 200 Server rack 300 Server 400 Temperature sensor 500 Computer

Claims

An acquisition method for acquiring the individual failure rate due to the hardware, operating time, and operating environment of each server, and
A determination means for determining the monitoring interval for the server based on the individual failure rate, and
A monitoring device equipped with.

The acquisition means acquires the individual failure rate based on the hardware monitoring information for each server, the operating time, and the operating environment temperature of the server in the server rack.
The monitoring device according to claim 1.

Further provided with a calculation means for calculating the individual failure rate based on the hardware monitoring information for each server, the operating time, and the operating environment temperature of the server.
The monitoring device according to claim 1 or 2.

The determination means determines the monitoring interval for each server based on a relational model showing a decreasing tendency of the monitoring interval corresponding to an increase in the individual failure rate.
The monitoring device according to any one of claims 1 to 3.

The determination means is
When the individual failure rate is higher than a predetermined reference failure rate, the monitoring interval of the server is determined to be the first monitoring interval.
When the individual failure rate is lower than the reference failure rate, the monitoring interval of the server is determined to be a second monitoring interval longer than the first monitoring interval.
The monitoring device according to any one of claims 1 to 4.

The monitoring device according to any one of claims 1 to 5, wherein the determination means further determines the monitoring interval based on the upper limit number of the servers monitored at the monitoring interval.

Obtain the individual failure rate due to the hardware, operating time, and operating environment of each server.
The monitoring interval for the server is determined based on the individual failure rate.
Monitoring method.

Obtain the individual failure rate due to the hardware, operating time, and operating environment of each server.
A program that causes a computer to execute a process of determining a monitoring interval for the server based on the individual failure rate.