JP2014153736A

JP2014153736A - Fault symptom detection method, program and device

Info

Publication number: JP2014153736A
Application number: JP2013020110A
Authority: JP
Inventors: Akira Goto; 公後藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-02-05
Filing date: 2013-02-05
Publication date: 2014-08-25

Abstract

【課題】障害予兆検出で、監視対象の稼働状況に応じた閾値で障害予兆を判定できるようにする。
【解決手段】障害予兆検出装置１は、監視対象２について異常が検出されなかった期間の監視対象２の監視データを曜日、時間帯、日にち、または、週数毎に分類して記憶部（２３）に記憶し（１４）、記憶部に記憶された監視データの曜日、時間帯、日にち、または、週数毎の分布をもとに許容範囲を設定し、監視対象２から取得した監視データと、監視データの監視日時が属する曜日、時間帯、日にち、または、週数の監視データの分布にもとづく許容範囲とを比較し、監視データが該許容範囲の上限または下限を超える場合に監視対象２の障害予兆を検出する（１３）。
【選択図】図２PROBLEM TO BE SOLVED: To detect a failure sign with a threshold corresponding to an operation status of a monitoring target in detecting the failure sign.
A failure sign detection device 1 classifies monitoring data of a monitoring target 2 during a period in which no abnormality is detected with respect to the monitoring target 2 according to day of the week, time zone, date, or number of weeks. (14), and setting the allowable range based on the day of the week, the time zone, the date, or the distribution of the number of weeks of the monitoring data stored in the storage unit, and the monitoring data acquired from the monitoring target 2 When the monitoring data exceeds the upper limit or the lower limit of the permissible range by comparing with the permissible range based on the distribution of the monitoring data of the day of the week, the time zone, the date, or the week to which the monitoring date of the monitoring data belongs, The failure sign is detected (13).
[Selection] Figure 2

Description

本発明は、コンピュータシステム監視における障害予兆の検出技術に関する。 The present invention relates to a failure sign detection technique in computer system monitoring.

２４時間稼働するコンピュータシステムでは、障害によるシステム停止を極力短くする必要がある。そのため、コンピュータシステムが停止してから障害を検知するのではなく、障害の予兆を検知して、停止前の障害の回避や復旧作業の初動を早くすることが求められている。 In a computer system that operates for 24 hours, it is necessary to make the system stop due to a failure as short as possible. For this reason, it is required not to detect a failure after the computer system is stopped, but to detect a failure sign, and to avoid the failure before the stop and speed up the initial operation of the recovery operation.

従来のコンピュータシステムの障害監視では、コンピュータシステムの停止を検出した後または稼働状況が予め設定された閾値を超えた時に異常を通知するようにしていた。 In the conventional fault monitoring of a computer system, an abnormality is notified after the stop of the computer system is detected or when the operating status exceeds a preset threshold.

障害検出の従来手法の１つとして、監視対象システムの性能を表す時系列データを一定周期で抽出し、過去の時系列データとして過去のメタデータに関連付けて格納し、リアルタイムの時系列データを示すメタデータと照合し、今後の変化を検出して障害を出力する手法が知られている。 As one of the conventional methods of fault detection, time-series data representing the performance of the monitored system is extracted at a fixed period, stored as past time-series data in association with past metadata, and real-time time-series data is shown. A method of collating with metadata, detecting a future change, and outputting a fault is known.

また、別の従来手法として、障害管理の対象から出力されたログ情報と過去の障害発生時の障害ログ情報とを読み出し、ログ情報および障害ログ情報の類似度を判定し、類似度が高い障害ログ情報の障害関連情報を出力する手法が知られている。 Another conventional method is to read the log information output from the target of fault management and the fault log information at the time of past faults, determine the similarity between the log information and fault log information, and There is known a method of outputting failure related information of log information.

さらに、別の従来手法として複数のネットワーク装置からの監視情報を初期監視情報として連続的に収集し、収集した連続情報の統計的な振舞いを監視し、通常の振舞いと異なる場合に異常発生の予兆の検出とみなして関連する複数の監視情報収集を指示する手法が知られている。 Furthermore, as another conventional method, monitoring information from a plurality of network devices is continuously collected as initial monitoring information, and the statistical behavior of the collected continuous information is monitored. There is known a technique for instructing the collection of a plurality of related monitoring information, which is regarded as detection of an error.

特開２００９−２８９２２１号公報JP 2009-289221 A 特開２００６−０９９２４９号公報JP 2006-099249 A 特開２００５−２８５０４０号公報JP 2005-285040 A

監視対象のコンピュータシステムで実際に障害が発生する前にその予兆を検知する必要がある。障害の予兆を閾値で判定する場合に、閾値の設定が問題となる。設定した閾値が低すぎれば誤検知が生じやすく、高すぎれば検知の直後に障害となる。 It is necessary to detect a sign before a failure actually occurs in a monitored computer system. Setting a threshold value is a problem when a failure sign is determined by a threshold value. If the set threshold is too low, false detection is likely to occur, and if it is too high, a failure occurs immediately after detection.

また、コンピュータシステムによっては夜間にバッチ処理を実行したり特定時機にシステムを一時停止したりすることがあり、コンピュータシステムの稼働状況は常に一定であるとは限らない。そのため、変動する稼働状況に応じて閾値を変える必要がある。 Depending on the computer system, batch processing may be executed at night or the system may be temporarily stopped at a specific time, and the operating status of the computer system is not always constant. Therefore, it is necessary to change the threshold according to the changing operating situation.

さらに、コンピュータシステムでは障害のない運用が期待されているため、障害が実際に発生する前から適切な閾値を設定する必要がある。 Furthermore, since a computer system is expected to operate without a failure, it is necessary to set an appropriate threshold before the failure actually occurs.

しかしながら、従来手法では、監視対象の稼働状況に応じた閾値で障害予兆を検出することができず、また、実際に障害が発生しなければ適切な閾値を得ることができなかった。 However, according to the conventional method, a failure sign cannot be detected with a threshold corresponding to the operation status of the monitoring target, and an appropriate threshold cannot be obtained unless a failure actually occurs.

１つの側面では、本発明は、監視対象の通常時の稼働情報から監視時の稼働状況に応じた閾値を設定して障害予兆を検出できる障害予兆検出を実行する方法、プログラムおよび装置を提供することである。さらに、本発明の前記ならびに他の目的と新規な特徴は、明細書の記述および添付図面から明らかにされるであろう。 In one aspect, the present invention provides a method, a program, and an apparatus for executing a failure sign detection that can detect a failure sign by setting a threshold value according to the operation status at the time of monitoring from the normal operation information to be monitored. That is. Furthermore, the above and other objects and novel features of the present invention will become apparent from the description of the specification and the accompanying drawings.

１実施態様に係る障害予兆検出方法は、監視対象システムについて異常が検出されなかった期間における該監視対象システムの監視データを曜日、時間帯、日にち、または、週数毎に分類して記憶部に記憶し、前記記憶部に記憶された監視データの曜日、時間帯、日にち、または、週数毎の分布をもとに許容範囲を設定し、前記監視対象システムから現在取得した監視データと、該現在の日時が属する曜日、時間帯、日にち、または、週数の監視データの分布にもとづく許容範囲とを比較し、該取得した監視データが該許容範囲の上限または下限を超える場合に前記監視対象システムの障害予兆を検出する、処理をコンピュータが実行するものである。 In the failure sign detection method according to one embodiment, the monitoring data of the monitoring target system in a period in which no abnormality is detected in the monitoring target system is classified by day of the week, time zone, date, or week number and stored in the storage unit. Storing the monitoring data stored in the storage unit, setting an allowable range based on the distribution of the day of the week, the time zone, the date, or the number of weeks, and the monitoring data currently acquired from the monitoring target system, Compare with the allowable range based on the distribution of monitoring data of the day of the week, time zone, date, or week to which the current date belongs, and if the acquired monitoring data exceeds the upper limit or lower limit of the allowable range, the monitoring target A computer executes a process for detecting a sign of system failure.

監視対象のコンピュータシステムの稼働状況に応じた適切な閾値を用いて障害予兆を検出する処理を実現することができる。 It is possible to realize processing for detecting a failure sign using an appropriate threshold value corresponding to the operating status of the computer system to be monitored.

障害予兆検出装置の一実施例におけるハードウェア構成例を示す図である。It is a figure which shows the hardware structural example in one Example of a failure sign detection apparatus. 開示する障害予兆検出装置の一実施例における機能ブロック例を示す図である。It is a figure which shows the example of a functional block in one Example of the failure sign detection apparatus to disclose. 監視結果ログテーブルの一実施例におけるデータ構成例を示す図である。It is a figure which shows the example of a data structure in one Example of the monitoring result log table. 監視閾値テーブルの一実施例におけるデータ構成例を示す図である。It is a figure which shows the example of a data structure in one Example of the monitoring threshold value table. 正常稼働情報テーブルの一実施例におけるデータ構成例を示す図である。It is a figure which shows the example of a data structure in one Example of a normal operation information table. 稼働システムテーブルの一実施例におけるデータ構成例を示す図である。It is a figure which shows the example of a data structure in one Example of an operation system table. 障害予兆検出装置の一実施例における障害予兆の検出処理フローを示す図である。It is a figure which shows the detection process flow of the failure sign in one Example of a failure sign detection apparatus. 障害予兆検出装置が取得する監視結果と許容範囲との関係例を示す図である。It is a figure which shows the example of a relationship between the monitoring result which a failure sign detection apparatus acquires, and tolerance | permissible_range. 障害予兆検出装置の一実施例における閾値設定処理フローを示す図である。It is a figure which shows the threshold value setting process flow in one Example of a failure sign detection apparatus.

以下、本発明の一態様として開示する障害予兆検出方法を実行する障害予兆検出装置について説明する。 Hereinafter, a failure sign detection apparatus that executes the failure sign detection method disclosed as one aspect of the present invention will be described.

図１は、障害予兆検出装置１の一実施例におけるハードウェア構成例を示す図である。 FIG. 1 is a diagram illustrating a hardware configuration example in one embodiment of the failure sign detection apparatus 1.

障害予兆検出装置１は、ＣＰＵ１０１、短期記憶部（ＤＲＡＭ）１０２、長期記憶部（ＨＤＤ）１０３、ネットワークインタフェース１０４、入力装置（キーボード、マウス等）１０５、出力装置（ディスプレイ、プリンタ等）１０６が内部ネットワーク等で接続されたコンピュータとして実施することができる。 The failure sign detection apparatus 1 includes a CPU 101, a short-term storage unit (DRAM) 102, a long-term storage unit (HDD) 103, a network interface 104, an input device (keyboard, mouse, etc.) 105, and an output device (display, printer, etc.) 106. It can be implemented as a computer connected via a network or the like.

障害予兆検出装置１は、監視対象のコンピュータシステムの障害予兆を検出する処理に必要な情報をファイルとして長期記憶部１０３に記憶し、入力装置１０５から実行プログラムを起動し、起動された実行プログラムが、短期記憶部１０２にロードされ、ネットワークインタフェース１０４で受信した監視対象のコンピュータシステムの正常時の稼働状況を示す情報（正常稼働情報）をもとに障害予兆の検出処理を実行する。 The failure sign detection device 1 stores information necessary for processing for detecting a failure sign of the computer system to be monitored as a file in the long-term storage unit 103, starts an execution program from the input device 105, and the started execution program is The failure sign detection process is executed based on the information (normal operation information) indicating the normal operation status of the computer system to be monitored, which is loaded into the short-term storage unit 102 and received by the network interface 104.

障害予兆検出装置１は、必要に応じて情報を長期記憶部１０３から短期記憶部１０２に読み出しながら障害予兆検出処理を進める。障害予兆検出装置１は、監視対象の正常稼働情報を日時情報と対応付けて記憶し、記憶した正常稼働情報をもとに監視時に対応する許容範囲を示す閾値（上限値、下限値）を設定し、リアルタイムで取得した稼働情報が監視時の許容範囲を超えた場合に障害予兆検出を出力する。 The failure sign detection device 1 proceeds with the failure sign detection process while reading information from the long-term storage unit 103 to the short-term storage unit 102 as necessary. The failure sign detection device 1 stores normal operation information to be monitored in association with date and time information, and sets thresholds (upper limit value and lower limit value) indicating an allowable range corresponding to monitoring based on the stored normal operation information. When the operation information acquired in real time exceeds the allowable range at the time of monitoring, a failure sign detection is output.

障害予兆検出装置１は、正常稼働情報として、監視対象のコンピュータシステムの正常時の稼働状況における情報、例えば、監視対象のコンピュータシステムを構成する各システムを実行するコンピュータ装置のＣＰＵ使用率、記憶領域使用率、未処理データ件数等の情報を用いる。 The failure sign detection device 1 includes, as normal operation information, information on the normal operation status of the computer system to be monitored, for example, the CPU usage rate and the storage area of the computer device that executes each system constituting the computer system to be monitored Use information such as usage rate and number of unprocessed data.

なお、障害予兆検出処理の実行プログラムは、ＣＤ−ＲＯＭ、ＣＤ−ＲＷ、ＤＶＤ−Ｒ、ＤＶＤ−ＲＡＭ、ＤＶＤ−ＲＷ等やフレキシブルディスク等の記録媒体だけでなく、通信回線の先に備えられた他の記憶装置やコンピュータのハードディスク等に記憶されるものであってもよい。 The failure sign detection process execution program is provided not only on a recording medium such as a CD-ROM, CD-RW, DVD-R, DVD-RAM, DVD-RW, or flexible disk, but also at the end of a communication line. It may be stored in another storage device or a hard disk of a computer.

図２は、開示する障害予兆検出装置１の一実施例における機能ブロック例を示す図である。 FIG. 2 is a diagram illustrating an example of functional blocks in an embodiment of the disclosed failure sign detection apparatus 1.

障害予兆検出装置１は、一実施例において、医療機関に設置されたコンピュータシステムを監視対象２とし、そのコンピュータシステムを構成する各システムを監視対象システム２Ａ〜２Ｃについて障害予兆を検出する。 In one embodiment, the failure sign detection device 1 uses a computer system installed in a medical institution as a monitoring target 2 and detects a failure sign for each of the systems constituting the computer system for the monitoring target systems 2A to 2C.

障害予兆検出装置１は、上記処理を実行するため、監視結果取得部１１、監視結果比較部１２、正常稼働情報比較部１３、正常稼働情報算出部１４、予兆検知通知部１５を備え、データ保管場所として、監視結果ログテーブル２１、監視閾値テーブル２２、正常稼働情報テーブル２３を備える。 The failure sign detection device 1 includes a monitoring result acquisition unit 11, a monitoring result comparison unit 12, a normal operation information comparison unit 13, a normal operation information calculation unit 14, and a sign detection notification unit 15 in order to perform the above processing, and data storage As locations, a monitoring result log table 21, a monitoring threshold table 22, and a normal operation information table 23 are provided.

さらに、障害予兆検出装置１は、稼働システム比較部１６、閾値設定部１７、データ保管場所の稼働システムテーブル２４を備えてもよい。 Further, the failure sign detection device 1 may include an operating system comparison unit 16, a threshold setting unit 17, and an operating system table 24 for a data storage location.

監視結果取得部１１は、監視対象２であるコンピュータシステムの各監視対象システム２Ａ〜２Ｃそれぞれから正常時の稼働状況を示すデータである監視結果データを取得し、監視データに監視対象および監視日時を付けた監視結果ログデータを監視結果ログテーブル２１に記録する。なお、監視結果データは、監視対象システム２Ａ〜２Ｃを実行する各コンピュータ装置に常駐する監視プログラム等により生成され障害予兆検出装置１へ送信されるものとする。 The monitoring result acquisition unit 11 acquires monitoring result data, which is data indicating the normal operation status, from each of the monitoring target systems 2A to 2C of the computer system that is the monitoring target 2, and sets the monitoring target and the monitoring date and time in the monitoring data. The attached monitoring result log data is recorded in the monitoring result log table 21. Note that the monitoring result data is generated by a monitoring program or the like resident in each computer device that executes the monitoring target systems 2 </ b> A to 2 </ b> C and transmitted to the failure sign detection device 1.

図３は、監視結果ログテーブル２１のデータ構成例を示す図である。 FIG. 3 is a diagram illustrating a data configuration example of the monitoring result log table 21.

監視結果ログテーブル２１は、施設、監視日時、監視対象機器、監視項目および監視結果のデータ項目を有する。「施設」は監視対象２のコンピュータシステムが設置されている場所を識別する情報である。「監視日時」は、監視結果データを取得した日時を示す情報、「監視対象機器」は監視対象２のコンピュータシステムの各監視対象システムを実行するコンピュータ装置等の機器を識別する情報である。 The monitoring result log table 21 includes data items of facilities, monitoring date and time, monitoring target devices, monitoring items, and monitoring results. The “facility” is information for identifying a place where the computer system of the monitoring target 2 is installed. “Monitoring date / time” is information indicating the date and time when the monitoring result data is acquired, and “Monitoring target device” is information for identifying a device such as a computer device that executes each monitoring target system of the computer system of the monitoring target 2.

「監視項目」は、監視対象２に対して監視する稼働状況の項目を示す情報であり、例えば、ＣＰＵ使用率、記憶領域の使用率（ディスク使用率）、処理するデータのうち未処理のデータの件数（未処理データ件数）等が予め設定される。「監視結果」は、監視項目の状況について監視日時に取得された値である。 The “monitor item” is information indicating an operation status item to be monitored for the monitoring target 2. For example, unprocessed data among CPU usage rate, storage area usage rate (disk usage rate), and data to be processed. (Number of unprocessed data) etc. are preset. The “monitoring result” is a value acquired at the monitoring date and time for the status of the monitoring item.

図３に示す監視結果ログテーブル２１の例では、先頭データが、“Ａ病院”に設置された監視対象２のコンピュータシステムを構成する“電子カルテサーバ”で“２０１２年０２月２０日００時００分”取得した“ＣＰＵ使用率”の監視結果が“３２％”であることを表している。 In the example of the monitoring result log table 21 shown in FIG. 3, the head data is “electronic medical record server” constituting the computer system of the monitoring target 2 installed in “A hospital”, “February 20, 2012 00:00:00” This indicates that the monitoring result of “CPU usage rate” acquired is “32%”.

監視結果比較部１２は、監視結果ログテーブル２１に現時点で取得した監視結果ログデータが記録されると、監視閾値テーブル２２に記憶された監視結果ログデータの監視結果を監視閾値と比較し、監視結果が対応する監視閾値を超過する場合に「異常検知」を出力する。 When the monitoring result log data acquired at the current time is recorded in the monitoring result log table 21, the monitoring result comparison unit 12 compares the monitoring result of the monitoring result log data stored in the monitoring threshold table 22 with the monitoring threshold, and performs monitoring. When the result exceeds the corresponding monitoring threshold, “abnormality detection” is output.

図４は、監視閾値テーブル２２のデータ構成例を示す図である。 FIG. 4 is a diagram illustrating a data configuration example of the monitoring threshold value table 22.

監視閾値テーブル２２は、施設、監視対象機器、監視項目、閾値ｔｈ１、閾値ｔｈ２のデータ項目を有する。 The monitoring threshold value table 22 includes data items of facilities, monitoring target devices, monitoring items, threshold values th1 and threshold values th2.

監視閾値テーブル２２の「施設」、「監視対象機器」、「監視項目」は、監視結果ログテーブル２１の同名のデータ項目と同じ情報である。監視閾値の「閾値ｔｈ１」および「閾値ｔｈ２」は、異常検知を出力するかを判定する情報である。監視閾値は、１つが設定されていればよく、図４に示すように、異常の段階に応じて複数の閾値が設定されていてもよい。 The “facility”, “monitored device”, and “monitoring item” in the monitoring threshold table 22 are the same information as the data item of the same name in the monitoring result log table 21. The monitoring threshold values “threshold th1” and “threshold th2” are information for determining whether to output abnormality detection. One monitoring threshold may be set, and a plurality of thresholds may be set according to the stage of abnormality as shown in FIG.

図４に示す監視閾値テーブル２２の先頭データは、“Ａ病院”に設置された監視対象２のコンピュータシステムの監視対象システム“電子カルテサーバ”の“ＣＰＵ使用率”について、閾値ｔｈ１＝８５％および閾値ｔｈ２＝９０％が設定されていることを表している。 The head data of the monitoring threshold value table 22 shown in FIG. 4 includes the threshold th1 = 85% for the “CPU usage rate” of the monitoring target system “electronic medical record server” of the computer system of the monitoring target 2 installed in “A hospital”. The threshold th2 = 90% is set.

監視結果比較部１２は、監視結果ログテーブル２１にリアルタイムで取得された監視結果ログデータが記録されると、監視閾値テーブル２２から、その監視結果ログデータと施設、監視対象機器、および監視項目が一致する閾値ｔｈ１および閾値ｔｈ２を抽出し、その監視結果ログデータの監視結果が閾値ｔｈ１または閾値ｔｈ２のいずれかを超過したと判断した場合に「異常検知」を出力する。 When the monitoring result log data acquired in real time is recorded in the monitoring result log table 21, the monitoring result comparison unit 12 stores the monitoring result log data, the facility, the monitoring target device, and the monitoring item from the monitoring threshold table 22. The matching threshold th1 and threshold th2 are extracted, and “abnormality detection” is output when it is determined that the monitoring result of the monitoring result log data exceeds either the threshold th1 or the threshold th2.

正常稼働情報比較部１３は、監視結果ログデータの監視結果が対応する監視閾値（閾値ｔｈ１および閾値ｔｈ２）を超過しなかった場合に、その監視結果ログデータの監視日時と日時にもとづく条件（曜日、日にち、週数、または時間帯）が一致する正常稼働情報から算出された許容範囲と監視結果ログデータの監視結果と比較して、監視結果が許容範囲を超過する場合に「障害予兆検知」を出力する。 When the monitoring result of the monitoring result log data does not exceed the corresponding monitoring threshold (threshold th1 and threshold th2), the normal operation information comparison unit 13 determines the condition (day of the week) based on the monitoring date and time of the monitoring result log data. Compared with the monitoring range of the monitoring result log data and the allowable range calculated from the normal operation information with the same date, number of weeks, or time zone), "failure sign detection" is detected when the monitoring result exceeds the allowable range Is output.

図５は、正常稼働情報テーブル２３のデータ構成例を示す図である。 FIG. 5 is a diagram illustrating a data configuration example of the normal operation information table 23.

正常稼働情報テーブル２３は、施設、監視対象機器、監視項目、条件区分条件、監視時間、許容範囲のデータ項目を有する。 The normal operation information table 23 includes data items of facilities, monitoring target devices, monitoring items, condition classification conditions, monitoring time, and allowable range.

正常稼働情報テーブル２３の「施設」、「監視対象機器」、「監視項目」は、監視結果ログテーブル２１の同名のデータ項目と同じ情報が記録される。 The “facility”, “monitored device”, and “monitoring item” in the normal operation information table 23 record the same information as the data item with the same name in the monitoring result log table 21.

「条件区分」は、許容範囲を適用するための条件であって、監視日時の月日に対する区分である。「条件区分」は、例えば、曜日、週数、日にち等の区分が設定される。「曜日」の条件区分では「日曜日」〜「土曜日」までの各曜日が、「週数」の条件区分では１年単位での各週の週番号が、「日にち」の条件区分では月単位の第何番目の日、月末等が、その条件としてそれぞれ設定される。 The “condition category” is a condition for applying the allowable range, and is a category for the month and day of the monitoring date. As the “condition category”, for example, a category such as a day of the week, the number of weeks, or a date is set. Each day of the week from “Sunday” to “Saturday” in the “Day of Week” condition category, the week number of each week in the “Number of Weeks” condition category, and the Monthly number in the “Day” condition category The number of days, the end of the month, etc. are set as the conditions.

「監視時間」は、監視日時の時刻に対する区分であり、監視時間帯の中央時刻を示す情報である。例えば、「監視時間」が“０：００”である場合は、時刻０：００を中央とする前後所定の時間帯が監視日時の条件となる。 “Monitoring time” is a section for the time of the monitoring date and time, and is information indicating the central time of the monitoring time zone. For example, when the “monitoring time” is “0:00”, a predetermined time zone before and after the center at time 0:00 is the condition for the monitoring date and time.

「許容範囲」は、日時に基づく条件で分類された通常の稼働状況での監視結果の分布から求められた正常と許容できる範囲である。図５では、“下限値”〜“上限値”として表している。なお、許容範囲の算出については後述する。 The “allowable range” is a range that can be considered normal and obtained from the distribution of the monitoring results in the normal operation status classified by the condition based on the date and time. In FIG. 5, the values are expressed as “lower limit value” to “upper limit value”. The calculation of the allowable range will be described later.

図５に示す正常稼働情報テーブル２３の先頭データは、“Ａ病院”に設置された監視対象２のコンピュータシステムの“電子カルテサーバ”の“ＣＰＵ使用率”について、監視日時が“日曜日”かつ“０：００”前後に得られた監視結果が“３０％〜３５％”を超過する場合に、障害予兆が検出されたと判定されることを表している。 The head data of the normal operation information table 23 shown in FIG. 5 includes the “CPU usage rate” of the “electronic medical record server” of the computer system of the monitoring target 2 installed in “A hospital” and the monitoring date is “Sunday” and “ When the monitoring result obtained before and after “0:00” exceeds “30% to 35%”, it is determined that a failure sign is detected.

正常稼働情報比較部１３は、監視結果ログデータの監視日時が属する条件区分毎の区分（曜日、週番号、監視時間）を特定する。ここで、監視日時から、“日曜日”、“第１週”、“０：００”が特定されたとする。 The normal operation information comparison unit 13 identifies a category (day of week, week number, monitoring time) for each condition category to which the monitoring date / time of the monitoring result log data belongs. Here, it is assumed that “Sunday”, “first week”, and “0:00” are specified from the monitoring date and time.

正常稼働情報比較部１３は、特定した監視日時の区分と正常稼働情報テーブル２３の条件とをつきあわせ、該当する１または複数の許容範囲の最大上限値および最小下限値を求め、監視結果ログデータの監視結果が、最大上限値および最小下限値を超過している場合に「障害予兆検知」を出力する。 The normal operation information comparison unit 13 adds the identified monitoring date and time and the conditions of the normal operation information table 23 to obtain the maximum upper limit value and the minimum lower limit value of the corresponding one or more allowable ranges, and monitors the result log data. When the result of monitoring exceeds the maximum upper limit value and the minimum lower limit value, “failure sign detection” is output.

正常稼働情報算出部１４は、監視結果が許容範囲を超過しなかった監視結果ログデータ、すなわち異常や障害予兆が検出されなかった監視結果ログデータを、その監視日時をもとに予め定められた条件区分（曜日、日にち、週数）毎の該当する条件および監視時間（時間帯）で分類し、条件区分毎の監視結果の分布をもとに、各条件での許容範囲を算出し正常稼働情報テーブル２３に記録する。正常稼働情報算出部１４は、分類した正常稼働情報の監視結果についての所定の区分（例えば、５分毎）での度数分布を算出し、分布が最大となる区分（範囲）の監視結果を求め、求めた監視結果から一定の上限値および下限値を決定して許容範囲とする。 The normal operation information calculation unit 14 sets the monitoring result log data in which the monitoring result does not exceed the allowable range, that is, the monitoring result log data in which no abnormality or failure sign is detected based on the monitoring date and time. Classify according to the applicable conditions and monitoring time (time zone) for each condition category (day of the week, date, number of weeks), calculate the allowable range for each condition based on the distribution of monitoring results for each condition category, and operate normally Record in the information table 23. The normal operation information calculation unit 14 calculates a frequency distribution in a predetermined section (for example, every 5 minutes) for the monitoring result of the classified normal operation information, and obtains a monitoring result of the section (range) in which the distribution is maximum. Then, a certain upper limit value and lower limit value are determined from the obtained monitoring results and set as an allowable range.

予兆検知通知部１５は、監視結果比較部１２が「異常検知」を出力した場合または正常稼働情報比較部１３が「障害予兆検知」を出力した場合に、監視対象２の監視対象システムの異常を示す情報として、出力された「異常検知」または「障害予兆検知」を予め設定された監視システムや管理者端末等の通知先へ通知する。 When the monitoring result comparison unit 12 outputs “abnormality detection” or when the normal operation information comparison unit 13 outputs “failure sign detection”, the sign detection notification unit 15 reports an abnormality in the monitoring target system of the monitoring target 2. As the information to be displayed, the output “abnormality detection” or “failure sign detection” is notified to a notification destination such as a preset monitoring system or administrator terminal.

稼働システム比較部１６は、新しい施設に設置されたコンピュータシステムが監視対象２となる場合に、新しく監視対象２とするコンピュータシステムのシステム構成および利用機能に関する情報を取得して稼働システムテーブル２４に追加する。そして、稼働システム比較部１６は、追加したコンピュータシステムが備える稼働システムの構成を、既存の監視対象２のコンピュータシステムが備える稼働システムの構成および利用機能と比較し、新しく監視対象２のコンピュータシステムの稼働システムの構成と高い割合で一致する既存のコンピュータシステムを特定する。 When the computer system installed in the new facility is the monitoring target 2, the operating system comparison unit 16 acquires information on the system configuration and the use function of the computer system newly set as the monitoring target 2 and adds the information to the operating system table 24. To do. Then, the operating system comparison unit 16 compares the configuration of the operating system included in the added computer system with the configuration and use function of the operating system included in the existing monitoring target 2 computer system, and Identify existing computer systems that match the operating system configuration at a high rate.

図６は、稼働システムテーブル２４のデータ構成例を示す図である。 FIG. 6 is a diagram illustrating a data configuration example of the operating system table 24.

稼働システムテーブル２４は、施設、稼働システム構成、利用機能のデータ項目を有する。 The operating system table 24 includes data items of facilities, operating system configurations, and use functions.

「施設」は監視対象２が設置されている場所である。「稼働システム」は、監視対象２となっているコンピュータシステムが備える稼働システムを識別する情報である。「稼働システム」は、コンピュータシステムを構成する機器、装置等のハードウェアだけでなく、ＯＳ、アプリケーションプログラム等のソフトウェアの構成であってもよい。 The “facility” is a place where the monitoring target 2 is installed. The “operating system” is information for identifying an operating system included in the computer system that is the monitoring target 2. The “operating system” may be a software configuration such as an OS and an application program as well as hardware such as devices and apparatuses that configure the computer system.

「利用機能」は、監視対象２となるコンピュータシステムが備える稼働システムの機能の利用状態を示す情報であり、全機能が利用されている状態（全機能）、機能の一部が未使用である状態（一部機能は未使用）等が記録されている。 “Used function” is information indicating the use state of the function of the operating system provided in the computer system to be monitored 2. The state where all the functions are used (all functions) and a part of the functions are unused. The status (some functions are not used) is recorded.

図６に示す稼働システムテーブル２４では、第１番目〜第３番目のデータは、“Ａ病院”に設置された監視対象２のコンピュータシステムに電子カルテシステム、医事会計システム、給食システムの稼働システムが含まれ、各稼働システムで全機能が利用されていることを表している。また、稼働システムテーブル２４の第４番目〜第６番目のデータは、“Ｂ病院”に設置された監視対象２のコンピュータシステムに電子カルテシステム、医事会計システム、検査システムが含まれ、検査システムでは一部の機能が未使用であることを表している。 In the operating system table 24 shown in FIG. 6, the first to third data are stored in the computer system of the monitoring target 2 installed in “A hospital” and the operating system of the electronic medical record system, the medical accounting system, and the lunch system. It is included and represents that all functions are used in each operating system. The fourth to sixth data of the operation system table 24 includes an electronic medical record system, a medical accounting system, and an inspection system in the computer system of the monitoring target 2 installed in “B hospital”. This means that some functions are unused.

閾値設定部１７は、新しく監視対象２とするコンピュータシステムの稼働システムの構成と高い割合で一致する既存のコンピュータシステムが稼働システムテーブル２４で特定できた場合に、監視閾値テーブル２２および正常稼働情報テーブル２３から特定した監視対象２のコンピュータシステムの稼働システムに対する監視閾値および正常稼働情報を抽出し、新しく監視対象２とするコンピュータシステムの監視閾値および正常稼働情報に情報を複写する。 The threshold setting unit 17 detects the monitoring threshold table 22 and the normal operation information table when an existing computer system that matches the configuration of the operating system of the computer system that is newly set as the monitoring target 2 at a high rate can be identified in the operating system table 24. The monitoring threshold value and normal operation information for the operating system of the computer system of the monitoring target 2 identified from 23 are extracted, and the information is copied to the monitoring threshold value and normal operation information of the computer system newly set as the monitoring target 2.

新しい監視対象２として、稼働システムテーブル２４に“Ｃ病院”に設置されたコンピュータシステムが追加されたとする。また、監視対象同士の一致を判断する際に、一致する割合が１００％（完全一致）で設定されているとする。この場合に、図６に示す稼働システムテーブル２４において、“Ｃ病院”のコンピュータシステムの構成が、既存の監視対象２の“Ａ病院”のコンピュータシステムと「稼働システム」および「利用機能」が一致している。閾値設定部１７は、監視閾値テーブル２２および正常稼働情報テーブル２３から“Ａ病院”のコンピュータシステムに対する監視閾値および正常稼働情報を抽出して“Ｃ病院”の監視閾値のデータおよび正常稼働情報に複写する。 It is assumed that a computer system installed in “C hospital” is added to the operating system table 24 as a new monitoring target 2. Further, it is assumed that the matching ratio is set to 100% (complete matching) when determining the matching between the monitoring targets. In this case, in the operating system table 24 shown in FIG. 6, the configuration of the computer system of “C hospital” is the same as the computer system of “A hospital” of the existing monitoring target 2, “operating system” and “usage function”. I'm doing it. The threshold value setting unit 17 extracts the monitoring threshold value and the normal operation information for the computer system “A hospital” from the monitoring threshold table 22 and the normal operation information table 23 and copies them to the monitoring threshold data and the normal operation information for “C hospital”. To do.

一方、新しい監視対象２として、稼働システムテーブル２４に“Ｄ病院”に設置されたコンピュータシステムが追加された場合に、“Ｄ病院”のコンピュータシステムと“Ａ病院”のコンピュータシステムと「稼働システム構成」が同一であるが「利用機能」の一部が一致していない。したがって、閾値設定部１７は、“Ｄ病院”のコンピュータシステムの監視閾値データおよび正常稼働情報を他の既存の監視対象２の情報を利用せず、所定の初期値を設定して生成する。 On the other hand, when a computer system installed in “D hospital” is added to the operating system table 24 as a new monitoring target 2, a computer system of “D hospital”, a computer system of “A hospital”, and an “operating system configuration” "Are the same, but some of the" used functions "do not match. Therefore, the threshold setting unit 17 generates the monitoring threshold data and the normal operation information of the computer system “D hospital” by setting a predetermined initial value without using the information of the other existing monitoring target 2.

図７は、障害予兆検出装置１の一実施例における障害予兆の検出処理フローを示す図である。 FIG. 7 is a diagram showing a failure sign detection processing flow in one embodiment of the failure sign detection apparatus 1.

障害予兆検出装置１の監視結果取得部１１が、一定時間毎に、監視対象２のコンピュータシステムの各監視対象システムを実行する監視対象機器から、施設、監視対象機器、監視日時、監視項目と監視結果を含む監視結果データを取得し管理結果ログテーブル２１を更新する（ステップＳ１）。 The monitoring result acquisition unit 11 of the failure predictor detection apparatus 1 starts from the monitoring target device that executes each monitoring target system of the computer system of the monitoring target 2 at fixed time intervals from the monitoring target device, the monitoring target device, the monitoring date and time, the monitoring item The monitoring result data including the result is acquired and the management result log table 21 is updated (step S1).

監視結果比較部１２が、追加された監視結果ログデータに対応する監視閾値（閾値ｔｈ１、閾値ｔｈ２）を監視閾値テーブル２２から取得し（ステップＳ２）、監視結果ログデータの監視結果が監視閾値を超過しているかを判定する（ステップＳ３）。 The monitoring result comparison unit 12 acquires monitoring threshold values (threshold value th1, threshold value th2) corresponding to the added monitoring result log data from the monitoring threshold value table 22 (step S2), and the monitoring result of the monitoring result log data sets the monitoring threshold value. It is determined whether it has exceeded (step S3).

監視結果が、監視閾値を超過していない場合に（ステップＳ３のＮ）、正常稼働情報比較部１３が、監視閾値を超過しなかった監視結果ログデータに対応する正常稼働情報の許容範囲を正常稼働情報テーブル２３から取得し（ステップＳ４）、監視結果が取得した正常稼働情報の許容範囲（上限値または下限値）を超過しているかを判定する（ステップＳ５）。 When the monitoring result does not exceed the monitoring threshold value (N in Step S3), the normal operation information comparison unit 13 normalizes the allowable range of normal operation information corresponding to the monitoring result log data that does not exceed the monitoring threshold value. It is acquired from the operation information table 23 (step S4), and it is determined whether the monitoring result exceeds the allowable range (upper limit value or lower limit value) of the acquired normal operation information (step S5).

監視結果が取得された許容範囲（上限値および下限値）を超過していない場合は（ステップＳ５のＮ）、正常稼働情報算出部１４は、監視結果ログデータに、その監視日時に対応する条件および監視時間を設定し、同じ正常稼働情報の許容範囲（上限値および下限値）を設定した正常稼働情報を算出し（ステップＳ６）、算出した正常稼働情報で正常稼働情報テーブル２３を更新する（ステップＳ７）。 When the monitoring result does not exceed the acquired allowable range (upper limit value and lower limit value) (N in step S5), the normal operation information calculation unit 14 adds a condition corresponding to the monitoring date and time to the monitoring result log data. In addition, the normal operation information in which the allowable range (upper limit value and lower limit value) of the same normal operation information is set is calculated (step S6), and the normal operation information table 23 is updated with the calculated normal operation information ( Step S7).

ステップＳ３の処理で、監視結果データの監視結果が監視閾値（閾値ｔｈ１または閾値ｔｈ２）のいずれかを超過しているか（ステップＳ３のＹ）、もしくは、ステップＳ５の処理で、監視結果が許容範囲（上限値または下限値）のいずれかを超過していれば（ステップＳ５のＹ）、予兆検知通知部１５は、出力された異常検知または障害予兆検知を含む異常情報を所定の通知先へ通知する（ステップＳ８）。 Whether the monitoring result of the monitoring result data exceeds any of the monitoring threshold values (threshold th1 or threshold th2) in step S3 (Y in step S3), or the monitoring result is in an allowable range in step S5. If any one of (the upper limit value or the lower limit value) is exceeded (Y in Step S5), the sign detection notification unit 15 notifies the predetermined notification destination of the abnormality information including the output abnormality detection or failure sign detection. (Step S8).

図８は、障害予兆検出装置１が取得する監視結果と許容範囲との関係例を示す図である。 FIG. 8 is a diagram illustrating an example of the relationship between the monitoring result acquired by the failure sign detection apparatus 1 and the allowable range.

図８に示すグラフは、障害予兆検出装置１が、病院に設置された監視対象のある１日（０時〜２４時）に取得した監視結果ログデータの「ＣＰＵ使用率」の監視結果（ｎ％）の時間的変化と許容範囲との関係を表している。グラフの横軸は時間経過を、縦軸は「ＣＰＵ使用率（％）」を示している。 The graph shown in FIG. 8 shows the monitoring result (n) of the “CPU usage rate” of the monitoring result log data acquired by the failure sign detection apparatus 1 on one day (0:00 to 24:00) on which the monitoring target installed in the hospital is located. %) Over time and the allowable range. The horizontal axis of the graph indicates the passage of time, and the vertical axis indicates “CPU usage rate (%)”.

図８のグラフに示すように、１２時辺り（昼休み時間に該当）のシステムの稼働率がその前後の時間帯に比べて低く、監視対象２から得る監視結果（ＣＰＵ使用率）もこのような状況を反映する。したがって、この監視対象２では、午前１２時辺りの許容範囲の閾値もその前後の時間帯に比べて低く設定しなければ、障害予兆を正確に検知することができない。 As shown in the graph of FIG. 8, the operating rate of the system around 12:00 (corresponding to the lunch break time) is lower than the time zone before and after that, and the monitoring result (CPU usage rate) obtained from the monitoring target 2 is also like this Reflect the situation. Therefore, in this monitoring target 2, a failure sign cannot be accurately detected unless the threshold value of the allowable range around 12:00 am is set lower than the time zone before and after that.

障害予兆検出装置１では、日時に基づく条件区分および監視時間で区分した正常稼働情報、すなわち正常時の稼働状況を示す計測値をもとに対応する許容範囲を決定している。したがって、障害予兆検出装置１では、図８に示すように、ある１日の単位では、監視対象２の正常な稼働状況の時間に応じた変動が許容範囲の設定に反映される。図８に示すグラフを、特定の曜日や日にちの１日の時間帯毎の変動を示すグラフとしても、また、横軸をある月単位や週単位における日毎の変動を示すグラフとしても、同様に、障害予兆検出装置１は、監視対象２の正常な稼働状況に応じた許容範囲を設定することができる。 The failure sign detection device 1 determines the corresponding allowable range based on the normal operation information divided by the condition classification based on the date and time and the monitoring time, that is, the measured value indicating the operation status at the normal time. Therefore, in the failure sign detection device 1, as shown in FIG. 8, in accordance with the unit of a certain day, the fluctuation according to the time of the normal operation status of the monitoring target 2 is reflected in the setting of the allowable range. Similarly, the graph shown in FIG. 8 may be used as a graph showing the fluctuations of the specific day of the week or the day for each time zone, and the horizontal axis may be a graph showing the fluctuations of the day in a certain month or week. The failure sign detection device 1 can set an allowable range according to the normal operating status of the monitoring target 2.

図９は、障害予兆検出装置１の一実施例における閾値設定処理フローを示す図である。 FIG. 9 is a diagram illustrating a threshold value setting process flow in one embodiment of the failure sign detection apparatus 1.

障害予兆検出装置１の稼働システム比較部１６は、稼働システムテーブル２４から監視対象２の稼働システムを示す情報（稼働システム情報）を取得し（ステップＳ１１）、新しく監視対象２とするコンピュータシステムの稼働システムの構成が、取得した既存の監視対象２の稼働システム情報（稼働システムの構成）と高い割合で一致するかを判定する（ステップＳ１２）。 The operating system comparison unit 16 of the failure sign detection device 1 acquires information (operating system information) indicating the operating system of the monitoring target 2 from the operating system table 24 (step S11), and operates the computer system that is newly set as the monitoring target 2. It is determined whether the system configuration matches with the acquired operating system information (configuration of the operating system) of the existing monitoring target 2 at a high rate (step S12).

新規の監視対象２の稼働システムの構成が取得した既存の監視対象２の稼働システム情報と高い割合で一致すると判定された場合に（ステップＳ１２のＹ）、閾値設定部１７は、取得した既存の監視対象２の各稼働システムに対応する監視対象機器の監視閾値と正常稼働情報を監視閾値テーブル２２および正常稼働情報テーブル２３からそれぞれ取得し（ステップＳ１３）、取得した監視閾値および正常稼働情報をもとに新規の監視対象２の稼働システムに対する監視閾値データと正常稼働情報を生成して監視閾値テーブル２２および正常稼働情報テーブル２３を更新する（ステップＳ１４）。 When it is determined that the configuration of the operating system of the new monitoring target 2 matches the acquired operating system information of the existing monitoring target 2 at a high rate (Y in step S12), the threshold setting unit 17 The monitoring threshold value and the normal operation information of the monitoring target device corresponding to each operation system of the monitoring target 2 are acquired from the monitoring threshold value table 22 and the normal operation information table 23, respectively (step S13), and the acquired monitoring threshold value and normal operation information are stored. At the same time, monitoring threshold value data and normal operating information for the operating system of the new monitoring target 2 are generated, and the monitoring threshold value table 22 and the normal operating information table 23 are updated (step S14).

新規の監視対象２の稼働システムの構成が取得した既存の監視対象２の稼働システム情報と高い割合で一致すると判定されなければ（ステップＳ１２のＮ）、そのまま処理を終了する。 If it is not determined that the configuration of the operating system of the new monitoring target 2 matches the acquired operating system information of the existing monitoring target 2 at a high rate (N in step S12), the process ends.

以上説明したように、開示した障害予兆検出装置１は、監視対象構成する稼働システム毎に日時により変化する稼働状況に応じた許容範囲を設定することができる。 As described above, the disclosed failure sign detection device 1 can set an allowable range according to the operating status that changes depending on the date and time for each operating system that is configured as a monitoring target.

また、障害予兆検出装置１は、異常状態が生じることなく運用されている監視対象２に対しても、正常時の稼働状況のみをもとに稼働状況に応じた許容範囲を設定することができる。 Further, the failure sign detection device 1 can set an allowable range corresponding to the operation status based on only the operation status at the normal time for the monitoring target 2 that is operated without an abnormal state. .

よって、障害予兆検出装置１によれば、監視対象の稼働状況に対応した閾値をもとに監視結果が正常とみなせる許容範囲であるかの判断を行えるため、より精度の高い障害予兆検知を実現することができる。 Therefore, according to the failure sign detection device 1, it is possible to determine whether the monitoring result is within an allowable range based on the threshold value corresponding to the operation status of the monitoring target, thereby realizing more accurate failure sign detection. can do.

以上説明した障害予兆検出装置１は、構成する要素が任意の組合せで実現されてもよい。複数の構成要素が１つの部材として実現されてもよく、１つの構成要素が複数の部材から構成されてもよい。また、障害予兆検出装置１は、上述した実施形態に限定されず、本発明の要旨を逸脱しない範囲において各種の改良および変更を行ってもよいことは当然である。 The failure sign detection device 1 described above may be realized by any combination of constituent elements. A plurality of components may be realized as one member, and one component may be configured from a plurality of members. Further, the failure sign detection device 1 is not limited to the above-described embodiment, and various improvements and changes may naturally be made without departing from the gist of the present invention.

１障害予兆検出装置
１１監視結果取得部
１２監視結果比較部
１３正常稼働情報比較部
１４正常稼働情報算出部
１５予兆検知通知部
１６稼働システム比較部
１７閾値設定部
２１監視結果ログテーブル
２２監視閾値テーブル
２３正常稼働情報テーブル
２４稼働システムテーブル
２監視対象
２Ａ〜２Ｃ監視対象システム DESCRIPTION OF SYMBOLS 1 Failure sign detection apparatus 11 Monitoring result acquisition part 12 Monitoring result comparison part 13 Normal operation | movement information comparison part 14 Normal operation | movement information calculation part 15 Predictive detection notification part 16 Operating system comparison part 17 Threshold setting part 21 Monitoring result log table 22 Monitoring threshold value table 23 Normal operation information table 24 Operating system table 2 Monitoring target 2A to 2C Monitoring target system

Claims

The monitoring data of the monitored system in a period in which no abnormality is detected for the monitored system is classified and stored in the storage unit by day of the week, time zone, date, or number of weeks,
Set the allowable range based on the distribution of the day of the week, time zone, date, or number of weeks of the monitoring data stored in the storage unit,
The monitoring data currently acquired from the monitoring target system is compared with the allowable range based on the distribution of the monitoring data of the day of the week, the time zone, the date, or the week to which the current date belongs, and the acquired monitoring data Detecting a predictive failure of the monitored system when an upper limit or a lower limit of an allowable range is exceeded;
A failure sign detection method characterized in that a computer executes processing.

Storing the configuration of the monitored system in the storage unit;
When the configuration of the system to be newly monitored coincides with the configuration of the monitored system at a high rate, the day of the week, the time zone, the date, or the week of the monitoring data of the monitored system stored in the storage unit Use the tolerance based on the distribution for each number as the tolerance of the newly monitored system,
The failure sign detection method according to claim 1.

The monitoring data of the monitored system in a period in which no abnormality is detected for the monitored system is classified and stored in the storage unit by day of the week, time zone, date, or number of weeks,
Set the allowable range based on the distribution of the day of the week, time zone, date, or number of weeks of the monitoring data stored in the storage unit,
The monitoring data currently acquired from the monitoring target system is compared with the allowable range based on the distribution of the monitoring data of the day of the week, the time zone, the date, or the week to which the current date belongs, and the monitoring data is the allowable range Detecting a predictive failure of the monitored system when an upper limit or a lower limit is exceeded,
A failure sign detection program for causing a computer to execute processing.

A storage unit that classifies and stores monitoring data of the monitoring target system in a period in which no abnormality is detected for the monitoring target system, by day of week, time zone, date, or week number;
A normal operation information calculation unit that sets an allowable range based on the distribution of the day of the week, the time zone, the date, or the number of weeks of the monitoring data stored in the storage unit;
The monitoring data currently acquired from the monitoring target system is compared with the allowable range based on the distribution of the monitoring data of the day of the week, the time zone, the date, or the week to which the current date / time belongs. A normal operation information comparison unit that detects a failure sign of the monitored system when the upper limit or lower limit of the allowable range is exceeded;
A failure sign detection device comprising: