TWI808762B

TWI808762B - Event monitoring method

Info

Publication number: TWI808762B
Application number: TW111118461A
Authority: TW
Inventors: 陳惠群; 趙偉庭
Original assignee: 動力安全資訊股份有限公司
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2023-07-11
Also published as: TW202347129A

Abstract

An event monitoring method comprises performing, by a monitoring host: training a first learning model of the monitoring host with a plurality pieces of first data and a plurality of first commands of a remote host to generate a second learning model, detecting an operating state of the remote host to generate second state data, generating a second command corresponding to the second state data by the second learning model, and outputting the second command to the remote host.

Description

Abnormal event monitoring method

本發明係關於一種異常事件監控方法。The invention relates to a method for monitoring abnormal events.

隨著資訊科技的快速發展，多數企業皆會使用各種系統，例如雲端系統、資安系統等，以使企業內部的運作更有效率。在系統建立完成後，仍然需要對系統進行監控，以確保系統可以良好地運作。所述的監控通常仍是由該系統廠商的工程師執行。因此，當系統有突發狀況甚或是中斷運作的風險時，工程師往往需親上火線做緊急處理。然而，若工程師注意力不集中或經驗不足而無法進行適當的處理，便可能使系統在暫時修復之後又再次故障。此外，一般的情況下，都是系統已經有問題而無法正常運作時，工程師再根據相關的異常紀錄判斷問題點並進行修復。在這樣的情況下，往往也導致了系統的維修時間增加。With the rapid development of information technology, most enterprises will use various systems, such as cloud systems, information security systems, etc., to make their internal operations more efficient. After the system is established, it is still necessary to monitor the system to ensure that the system can operate well. Said monitoring is usually still performed by the system manufacturer's engineers. Therefore, when there is an emergency or even a risk of interrupting the operation of the system, engineers often need to go to the fire line for emergency treatment. However, if the engineer is not focused or experienced enough to deal with it properly, the system may fail again after being temporarily fixed. In addition, under normal circumstances, when the system already has problems and cannot operate normally, the engineers will judge the problem points based on the relevant abnormal records and make repairs. Under such circumstances, it often leads to an increase in the maintenance time of the system.

鑒於上述，本發明提供一種以滿足上述需求的異常事件監控方法。In view of the above, the present invention provides a method for monitoring abnormal events to meet the above requirements.

依據本發明一實施例的一種異常事件監控方法，包含以一監控主機執行：以一遠端主機的多個第一資料及多個第一指令訓練該監控主機的一第一學習模型，以產生一第二學習模型；以該監控主機偵測該遠端主機的運作狀態以產生一第二資料；以該第二學習模型根據該第二資料產生對應的一第二指令；以及以該監控主機輸出該第二指令至該遠端主機。An abnormal event monitoring method according to an embodiment of the present invention includes executing with a monitoring host: training a first learning model of the monitoring host with multiple first data and multiple first commands of a remote host to generate a second learning model; using the monitoring host to detect the operation status of the remote host to generate a second data; using the second learning model to generate a corresponding second command according to the second data; and using the monitoring host to output the second command to the remote host.

綜上所述，依據本發明一或多個實施例的異常事件監控方法，在監控主機端可以提早預測出在遠端主機可能發生的異常事件，且更能夠輸出對應的第二指令至遠端主機，以降低遠端主機發生異常狀況的機率。並且，更可以避免遠端主機的維修受人為疏失影響導致維修過程不順利。此外，依據本發明一或多個實施例的異常事件監控方法，更能夠自動化且準確地判斷適用於遠端主機的第二指令。依據本發明一或多個實施例的異常事件監控方法，透過將用於訓練第一學習模型的資料進行刪減、濃縮，以產生第一資料集及第二資料集，更可降低監控主機在產生第一學習模型的過程中的資料運算量。To sum up, according to the abnormal event monitoring method of one or more embodiments of the present invention, the abnormal events that may occur in the remote host can be predicted in advance at the monitoring host side, and the corresponding second command can be output to the remote host, so as to reduce the probability of abnormal conditions occurring in the remote host. Moreover, it is possible to prevent the maintenance process of the remote host from being affected by human negligence and causing the maintenance process to be unsmooth. In addition, the abnormal event monitoring method according to one or more embodiments of the present invention can more automatically and accurately determine the second command applicable to the remote host. According to the abnormal event monitoring method of one or more embodiments of the present invention, the first data set and the second data set are generated by pruning and condensing the data used for training the first learning model, which can further reduce the data calculation amount of the monitoring host in the process of generating the first learning model.

以上之關於本揭露內容之說明及以下之實施方式之說明係用以示範與解釋本發明之精神與原理，並且提供本發明之專利申請範圍更進一步之解釋。The above description of the disclosure and the following description of the implementation are used to demonstrate and explain the spirit and principle of the present invention, and provide a further explanation of the patent application scope of the present invention.

以下在實施方式中詳細敘述本發明之詳細特徵以及優點，其內容足以使任何熟習相關技藝者了解本發明之技術內容並據以實施，且根據本說明書所揭露之內容、申請專利範圍及圖式，任何熟習相關技藝者可輕易地理解本發明相關之目的及優點。以下之實施例係進一步詳細說明本發明之觀點，但非以任何觀點限制本發明之範疇。The detailed features and advantages of the present invention are described in detail below in the embodiments, the content of which is sufficient to enable any skilled person to understand the technical content of the present invention and implement it accordingly, and according to the content disclosed in this specification, the scope of the patent application and the drawings, any skilled person can easily understand the related objectives and advantages of the present invention. The following examples are to further describe the concept of the present invention in detail, but not to limit the scope of the present invention in any way.

本發明所示的異常事件監控方法可由一監控主機執行，該監控主機係用於監控一遠端主機的運作狀態，且監控主機及遠端主機彼此通訊連接。舉例而言，遠端主機可以為網路核心交換器（core switch），而監控主機即為用於監控網路核心交換器運作狀態的運算裝置。又或者，以雲端運算或雲端資料庫的使用樣態為例，遠端主機為使用該雲端平台的客戶端的運算裝置，而監控主機可以是提供該雲端平台的廠商的運算裝置，監控主機亦可以是設置在客戶端用於監控雲端平台的運算裝置。監控主機可以為伺服器、通用處理器、數位訊號處理器（DSP）、微處理器、專用積體電路（ASIC）、現場可程式設計閘陣列（FPGA）電路或複雜可程式邏輯裝置（CPLD）等，本發明不對監控主機的類型予以限制。The abnormal event monitoring method shown in the present invention can be executed by a monitoring host, which is used to monitor the operation status of a remote host, and the monitoring host and the remote host are connected by communication. For example, the remote host can be a network core switch, and the monitoring host is a computing device for monitoring the operating status of the network core switch. Or, taking cloud computing or cloud database usage as an example, the remote host is the computing device of the client using the cloud platform, and the monitoring host can be the computing device of the manufacturer that provides the cloud platform, and the monitoring host can also be a computing device installed on the client for monitoring the cloud platform. The monitoring host can be a server, a general-purpose processor, a digital signal processor (DSP), a microprocessor, an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) circuit, or a complex programmable logic device (CPLD). The present invention does not limit the type of the monitoring host.

請參考圖1，圖1係依據本發明一實施例所繪示的異常事件監控方法的流程圖。如圖1所示，異常事件監控方法包含：步驟S1：以多個第一資料及多個第一指令訓練監控主機的第一學習模型，以產生第二學習模型；步驟S3：偵測遠端主機的運作狀態以產生第二資料；步驟S5：以第二學習模型根據第二資料產生對應的第二指令；以及步驟S7：輸出第二指令至遠端主機。Please refer to FIG. 1 . FIG. 1 is a flowchart of a method for monitoring abnormal events according to an embodiment of the present invention. As shown in FIG. 1 , the abnormal event monitoring method includes: step S1: using a plurality of first data and a plurality of first commands to train a first learning model of a monitoring host to generate a second learning model; step S3: detecting the operation status of a remote host to generate second data; step S5: using the second learning model to generate a corresponding second command according to the second data; and step S7: outputting the second command to the remote host.

於步驟S1，監控主機係將第一資料及第一指令輸入至未經訓練的第一學習模型以產生經訓練後的第二學習模型，其中監控主機存有所述的學習模型，且學習模型可以採用K-近鄰（K-Nearest Neighbors，KNN）演算法等演算法。In step S1, the monitoring host inputs the first data and the first command into the untrained first learning model to generate a trained second learning model, wherein the monitoring host stores the learning model, and the learning model can use K-nearest neighbors (K-Nearest Neighbors, KNN) algorithm and other algorithms.

上述第一資料為遠端主機過往運作時發生異常事件的資料，其中每一筆第一資料係基於一第一狀態資料而產生，第一狀態資料即為遠端主機過往於運作時的所有原始資料（raw data）。具體地，第一狀態資料可以係取自遠端主機的系統紀錄檔（Syslog），包含正常運作時的原始資料及異常運作時的原始資料，監控主機可預存對應不同元件的預設值，其中所述預設值係用於判斷第一狀態資料中是否有任何元件的運作資料超過其對應的預設值，並將超過其對應的預設值的運作資料作為第一資料，即第一資料可為遠端主機異常運作時的原始資料。以遠端主機為網路核心交換器為例，第一資料可根據資料內容指出「網路核心交換器的中央處理器、記憶體、輸入/輸出介面及/或網路介面等元件在三分鐘內的使用率達85%以上」的第一狀態資料而產生，第一狀態資料還可包含所述元件過往正常及非正常運作時的使用率數據，而第一資料可包含元件在三分鐘內的使用率及元件非正常運作時的使用率數據。在一例子中，第一資料可以根據資料內容指出「網路核心交換器的中央處理器、記憶體、輸入/輸出介面及/或網路介面等元件的使用率上升速度高於各別的預設速度」的第一狀態資料而產生，第一狀態資料還可包含所述元件過往正常及非正常運作時的使用率數據及對應的速度變化，而第一資料可包含上升速度高於各別的預設速度的元件的使用率，以及元件非正常運作時的使用率數據及對應的速度變化。在一例子中，第一資料可根據資料內容指出「網路介面傳輸的封包數量高於預設數量」的第一狀態資料而產生，第一狀態資料還可包含過往的各時間點網路介面傳輸所傳輸的封包數量，而第一資料可包含高於預設數量的封包數量。在一例子中，第一狀態資料可包含系統紀錄檔中的過往的所有訊息，而第一資料可以包含系統紀錄檔中的錯誤訊息。此外，第一狀態資料還可包含媒體存取控制位址（Media Access Control Address，MAC位址）過往各時間點的網路裝置位址等，而第一資料可以包含媒體存取控制位址的飄移（Flapping）；第一資料可根據資料內容指出「網路核心交換器（或任一資訊設備）的資料庫故障」的第一狀態資料而產生，第一狀態資料還可包含網路核心交換器的連線數量過高或網路輸入輸出介面（input-output interface，I/O interface）被大量使用，而第一資料可為網路核心交換器在資料庫故障時的運作數據及網路核心交換器的連線數量或網路輸入輸出介面的使用率；第一資料可根據資料內容指出「網路迴圈（Loop）」的第一狀態資料而產生，第一狀態資料還可包含網路核心交換器的中央處理器的負載，而第一資料可為發生網路迴圈時的網路核心交換器的運作數據及所述負載。另外，第一資料也可以是一或多個上述事件的數據的組合。The above-mentioned first data is the data of abnormal events in the past operation of the remote host, wherein each piece of first data is generated based on a first state data, and the first state data is all raw data (raw data) of the remote host in the past operation. Specifically, the first status data can be taken from the system log file (Syslog) of the remote host, including the original data during normal operation and the original data during abnormal operation. The monitoring host can pre-store preset values corresponding to different components, wherein the preset value is used to determine whether the operation data of any component in the first status data exceeds its corresponding default value, and the operation data exceeding its corresponding default value is used as the first data, that is, the first data can be the original data when the remote host is abnormal. Taking the remote host as a network core switch as an example, the first data can be generated according to the first state data indicating that "the CPU, memory, input/output interface and/or network interface of the network core switch have a utilization rate of more than 85% within three minutes". In one example, the first data can be generated according to the first status data indicating that “the usage rates of components such as the central processing unit, memory, input/output interface, and/or network interface of the network core switch are rising faster than their respective preset speeds.” The first status data can also include past usage data and corresponding speed changes of the components during normal and abnormal operation, and the first data can include usage rates of components whose rising speeds are higher than their respective preset speeds, and usage data and corresponding speed changes when the components are not operating normally. In one example, the first data may be generated according to the first state data indicating that "the number of packets transmitted by the network interface is higher than the preset number" in the data content. The first state data may also include the number of packets transmitted by the network interface at various time points in the past, and the first data may include the number of packets higher than the preset number. In one example, the first status data may include all past messages in the system log file, and the first data may include error messages in the system log file. In addition, the first status data may also include the addresses of network devices at past points in time of the Media Access Control Address (MAC address), etc., and the first data may include the flapping of the media access control address; the first data may be generated according to the first status data indicating "database failure of the network core switch (or any information device)" according to the content of the data, and the first status data may also include the network core switch. put interface, I/O interface) are widely used, and the first data can be the operation data of the network core switch when the database fails, the number of connections of the network core switch or the utilization rate of the network input and output interface; the first data can be generated according to the first state data of "network loop (Loop)" according to the content of the data, the first state data can also include the load of the central processing unit of the network core switch, and the first data can be the operation data and the load of the network core switch when a network loop occurs. In addition, the first information may also be a combination of data of one or more of the above events.

該些第一指令則為監控主機過往輸入至遠端主機的指令，即該些第一指令為用於解決或緩解上述的第一資料的指令。舉例而言，當第一資料對應的運作狀態為網路核心交換器的元件的使用率上升速度超過預設速度時，第一指令可以包含指示降速的指令。此外，步驟S1的實現方式更可以包含以多個歷史警告通知訓練第一學習模型，其中歷史警告通知的輸出形式可以包含文字形式、語音形式或虛擬實境形式等。具體地，在以該些第一資料及第一指令（甚或歷史警告通知）訓練第一學習模型後，監控主機的機器學習模型即可為第二學習模型。The first commands are commands previously input by the monitoring host to the remote host, that is, the first commands are commands for resolving or alleviating the above-mentioned first data. For example, when the operation state corresponding to the first data is that the usage rate of the elements of the network core switch increases faster than a preset speed, the first instruction may include an instruction to instruct speed reduction. In addition, the implementation of step S1 may further include training the first learning model with multiple historical warning notifications, wherein the output form of the historical warning notifications may include text, voice or virtual reality. Specifically, after training the first learning model with the first data and first instructions (or even historical warning notifications), the machine learning model of the monitoring host can be the second learning model.

在產生第二學習模型後，於步驟S3，監控主機可以偵測遠端主機的運作狀態，以實時或定期地取得遠端主機的第二資料，其中此述的第二資料亦可為與上述第一狀態資料相同類型的原始資料。步驟S3的第二資料與第一狀態資料的不同處在於，第一狀態資料係用於產生訓練第一學習模型的歷史資料（即第一資料）以產生第二學習模型，而步驟S3的第二資料係指示遠端主機的當前運作狀態的原始資料，第二資料係用於輸入至第二學習模型，以由第二學習模型預測之後在遠端主機可能發生的異常事件，並執行輸出對應的第二指令。After the second learning model is generated, in step S3, the monitoring host can detect the operating status of the remote host to obtain the second data of the remote host in real time or periodically, wherein the second data mentioned here can also be the same type of original data as the first status data. The difference between the second data in step S3 and the first state data is that the first state data is used to generate historical data (i.e., the first data) for training the first learning model to generate the second learning model, and the second data in step S3 is raw data indicating the current operating status of the remote host, and the second data is used to input into the second learning model, so that the second learning model can predict abnormal events that may occur in the remote host after the second learning model, and execute the output corresponding second command.

具體地，以遠端主機的中央處理器的使用率達85%為例，由於使用率接近100%的情況通常僅為暫態，但在非為暫態的情況下，可能會造成遠端主機的服務完全中斷。因此，透過以產生自第一狀態資料的第一資料訓練第一學習模型，可以讓訓練後的第二學習模型在異常狀態發生前預先判斷接下來是否確實會發生異常狀態。Specifically, taking the usage rate of the central processing unit of the remote host as high as 85% as an example, since the usage rate is close to 100%, it is usually only transient, but in a non-transient situation, the service of the remote host may be completely interrupted. Therefore, by using the first data generated from the first state data to train the first learning model, the trained second learning model can pre-judge whether the abnormal state will indeed occur next before the abnormal state occurs.

於步驟S5，監控主機係將第二資料輸入至第二學習模型以產生對應的第二指令。由於用於訓練的第一資料係取自對應的第一狀態資料，故第二學習模型即可根據遠端主機的第二資料，判斷出在遠端主機可能即將發生的異常事件。因此，第二學習模型可以進一步根據第一資料及第一指令判斷是否即將發生異常事件，並於判斷即將發生異常事件時使用K-近鄰演算法產生並輸出對應第二資料的第二指令。第二學習模型所產生的第二指令即可以用於緩解或提前解決在遠端主機可能發生的異常事件。此外，如前所述，在訓練階段，用於訓練第一學習模型的資料可以包含多個歷史警告通知，故在步驟S5中，第二學習模型更可以產生對應的警告通知。In step S5, the monitoring host inputs the second data into the second learning model to generate a corresponding second instruction. Since the first data used for training is obtained from the corresponding first state data, the second learning model can determine the possible imminent abnormal event on the remote host according to the second data of the remote host. Therefore, the second learning model can further judge whether an abnormal event is about to occur according to the first data and the first instruction, and use the K-nearest neighbor algorithm to generate and output a second instruction corresponding to the second data when it is judged that an abnormal event is about to occur. The second instruction generated by the second learning model can be used to alleviate or resolve abnormal events that may occur on the remote host in advance. In addition, as mentioned above, in the training phase, the data used to train the first learning model may contain multiple historical warning notifications, so in step S5, the second learning model may generate corresponding warning notifications.

於步驟S7，監控主機將第二學習模型產生的第二指令輸出至遠端主機。亦即，在第二學習模型產生對應第二資料的指令後，監控主機可以將該第二指令傳輸至遠端主機。因此，遠端主機即可基於收到的第二指令修正當前的運作狀態。此外，若第二學習模型在步驟S5更產生警告通知，則監控主機在步驟S7更可以輸出警告通知至遠端主機，且警告通知的輸出形式可以包含文字形式、語音形式或等虛擬實境形式等，以提醒在使用遠端主機的使用者注意遠端主機的運作狀態。In step S7, the monitoring host outputs the second instruction generated by the second learning model to the remote host. That is, after the second learning model generates an instruction corresponding to the second data, the monitoring host can transmit the second instruction to the remote host. Therefore, the remote host can modify the current operating state based on the received second command. In addition, if the second learning model generates a warning notice in step S5, the monitoring host can output a warning notice to the remote host in step S7, and the output form of the warning notice can include text, voice, or other virtual reality forms, etc., to remind users who are using the remote host to pay attention to the operating status of the remote host.

請參考圖2，圖2係繪示圖1的步驟S1的細部流程圖。圖2的步驟S101、S103、S105、S107及S109是在說明取得第一資料的方法，而步驟S111、S113、S115、S117及S119是在說明取得第一學習模型的方法。如圖2所示，步驟S1可包括：步驟S101：取得遠端主機的第一事件，其中第一事件係發生於第一時間點；步驟S103：以包含第一時間點的時間序列的多筆第一狀態資料作為第一資料集；步驟S105：對第一資料集進行標準化並產生第二資料集，其中第二資料集包含遠端主機在時間序列的多個變量；步驟S107：對第二資料集執行相關度分析；步驟S109：根據相關度分析的結果，移除該些變量中相關度低於門檻值的一或多者以更新第二資料集，並以更新後的第二資料集作為該些第一資料的其中一者；步驟S111：以該些第一資料的一者作為訓練樣本；步驟S113：基於回歸模型判斷觀測參數及訓練樣本是否具有趨勢樣態；若判斷訓練樣本未具有趨勢樣態，則執行步驟S114：調整觀測參數，或對訓練樣本進行差分；若判斷訓練樣本具有趨勢樣態，則執行步驟S115：以該些第一資料的另一者作為檢測樣本；步驟S117：根據觀測參數及回歸模型對檢測樣本執行時間序列檢定以判斷觀測參數及回歸模型是否通過時間序列檢定；若觀測參數及回歸模型未通過時間序列檢定，則執行步驟S114；若觀測參數及回歸模型通過時間序列檢定，則執行步驟S119：以回歸模型作為第一學習模型。Please refer to FIG. 2 , which is a detailed flowchart of step S1 in FIG. 1 . Steps S101 , S103 , S105 , S107 and S109 in FIG. 2 illustrate the method of obtaining the first data, while steps S111 , S113 , S115 , S117 and S119 illustrate the method of obtaining the first learning model. As shown in Figure 2, step S1 may include: step S101: obtain the first event of the remote host, wherein the first event occurred at the first time point; step S103: use a plurality of first state data of the time series including the first time point as the first data set; step S105: standardize the first data set and generate a second data set, wherein the second data set includes multiple variables of the remote host in the time series; step S107: perform correlation analysis on the second data set; One or more of these variables whose correlation degree is lower than the threshold value is used to update the second data set, and the updated second data set is used as one of the first data; step S111: using one of the first data as a training sample; step S113: judging whether the observed parameters and the training sample have a trend pattern based on the regression model; : using the other of the first data as a detection sample; step S117: performing time series verification on the detection sample according to the observation parameters and the regression model to determine whether the observation parameters and the regression model pass the time series verification; if the observation parameters and the regression model do not pass the time series verification, then perform step S114; if the observation parameters and the regression model pass the time series verification, then perform step S119: use the regression model as the first learning model.

具體而言，步驟S101所述的第一事件為遠端主機在過往發生的異常事件，而第一時間點即為該第一事件發生的歷史時間點。舉例而言，歷史異常事件為遠端主機故障，而第一時間點即為遠端主機故障的時間點。簡言之，第一事件為第一資料與發生第一資料所指示之異常狀況的時間點的組合。Specifically, the first event described in step S101 is an abnormal event that occurred on the remote host in the past, and the first time point is the historical time point when the first event occurred. For example, the historical abnormal event is the failure of the remote host, and the first time point is the time point of the failure of the remote host. In short, the first event is a combination of the first data and the time point when the abnormal situation indicated by the first data occurs.

於步驟S103，監控主機以包含第一時間點的時間序列的多筆第一狀態資料作為第一資料集，其中第一狀態資料即為遠端主機過往運作時的所有資料（例如，風扇轉速、元件的使用率等），且第一事件與該些第一狀態資料可以是相同類型的資料。換言之，監控主機是將第一時間點回推一預設時段以作為時間序列，或是以第一時間點作為中心點，並組合第一時間點的前後預設時段作為時間序列，及根據時間序列以及遠端主機在所述時間序列內的第一狀態資料產生第一資料集。舉例而言，假設監控主機發出一訊號至遠端主機，以從遠端接收確認資訊（acknowledge，ACK），而第一資料為遠端主機從發出訊號到收到確認資訊之間的時間（下稱回應時間）為30秒，超過預設的0.1秒，且第一資料的發生時間（即第一時間點）為9月2日早上10點，預設時段為12個小時，則第一事件即為回應時間為30秒與9月2日早上10點的組合，時間序列即為9月1日晚上10點到9月2日早上10點，而第一資料集即為遠端主機在9月1日晚上10點到9月2日早上10點的期間的所有回應時間。又或者，以第一資料為網路迴圈作為例子，因網路迴圈通常是人為造成（即網路線接錯），故網路迴圈的發生可能無明顯的週期性，而透過僅保留網路迴圈發生前後一段時間的第一狀態資料，便能於後續找出指示可能發生異常狀況的趨勢樣態，其中趨勢樣態的內容將於後說明。In step S103, the monitoring host uses a plurality of first state data including time series of the first time point as a first data set, wherein the first state data is all data (such as fan speed, component usage, etc.) during past operation of the remote host, and the first event and the first state data may be of the same type. In other words, the monitoring host pushes back the first time point for a preset period of time as a time series, or takes the first time point as a center point and combines preset periods before and after the first time point as a time series, and generates the first data set according to the time series and the first state data of the remote host in the time series. For example, assume that the monitoring host sends a signal to the remote host to receive acknowledgment information (acknowledge, ACK) from the remote end, and the first data is that the time between sending the signal and receiving the acknowledgment information (hereinafter referred to as the response time) from the remote host is 30 seconds, which exceeds the preset 0.1 second, and the occurrence time of the first data (i.e. the first time point) is 10 am on September 2, and the preset time period is 12 hours, then the first event is the combination of the response time of 30 seconds and 10 am on September 2. The sequence is from 10:00 pm on September 1st to 10:00 am on September 2nd, and the first data set is all response times of the remote host during the period from 10:00 pm on September 1st to 10:00 am on September 2nd. Or, taking the first data as an example of a network loop, since network loops are usually caused by humans (i.e., network cables are connected incorrectly), the occurrence of network loops may not have obvious periodicity, and by only retaining the first state data for a period of time before and after the occurrence of network loops, it is possible to find out the trend pattern that indicates that an abnormal situation may occur later, and the content of the trend pattern will be explained later.

於步驟S105，監控主機對第一資料集進行標準化是指將第一資料集的第一狀態資料分級，再將分級後的第一狀態資料標準化。以上述回應時間為例，假設在時間序列中的反應時間包含0.1秒、30秒、50秒及逾時（timeout），則監控主機執行標準化的方式可以是將0.1秒、30秒、50秒及逾時分別標記為第一級、第二級、第三級及第四級，再將第一級到第四級分別轉為0.25階（step）、0.5階、0.75階及1階。據此，監控主機可有更一致的標準，用於判斷遠端主機的運行狀態是否正常。In step S105, standardizing the first data set by the monitoring host refers to grading the first state data of the first data set, and then standardizing the classified first state data. Taking the above response time as an example, assuming that the response time in the time series includes 0.1 second, 30 seconds, 50 seconds and timeout, the monitoring host can standardize by marking 0.1 second, 30 seconds, 50 seconds and timeout as the first level, second level, third level and fourth level respectively, and then converting the first level to the fourth level into 0.25 step, 0.5 step, 0.75 step and 1 step respectively. Accordingly, the monitoring host can have a more consistent standard for judging whether the operating status of the remote host is normal.

對於每一個反應時間，監控主機更蒐集關於該反應時間的多個變量，其中該些變量除了包含反應時間外，還可包含遠端主機的處理器的溫度、遠端主機的風扇的轉速、遠端主機的記憶體的儲存狀態等。具體地，當遠端主機發生異常事件時，可能是因遠端主機的記憶體容量不足、處理器溫度過高等的異常狀態而導致。接著，監控主機可將第一資料集與該些變量合併，以產生第二資料集。換言之，第二資料集除了包含第一資料集（時間序列及該些第一狀態資料）之外，更包含遠端主機的其他元件在時間序列上的狀態資料。For each response time, the monitoring host further collects a plurality of variables about the response time, wherein these variables may include the temperature of the processor of the remote host, the speed of the fan of the remote host, the storage status of the memory of the remote host, etc. in addition to the response time. Specifically, when an abnormal event occurs on the remote host, it may be caused by an abnormal state such as insufficient memory capacity of the remote host or excessive temperature of the processor. Then, the monitoring host can combine the first data set with the variables to generate a second data set. In other words, in addition to the first data set (time series and the first state data), the second data set further includes time series state data of other components of the remote host.

另需說明的是，上述的第一級到第四級的級數僅為示例，監控主機亦可將第一狀態資料分成更多或更少的級數，且所述的階數亦僅為示例。此外，第一狀態資料可以等距或非等距的方式被劃分成多個級數。舉例而言，當以等距的方式劃分反應時間時，則監控主機可以每20秒作為一個級距；而當以非等距的方式劃分第一狀態資料時，則監控主機可以將0.1秒到1秒標記為第一級、40秒到50秒標記為第四級，本發明不對劃分第一狀態資料的方式予以限制。It should be noted that the above-mentioned levels from the first level to the fourth level are just examples, and the monitoring host may also divide the first state data into more or fewer levels, and the above-mentioned levels are also just examples. In addition, the first state data can be divided into a plurality of series in an equidistant or non-equidistant manner. For example, when dividing the response time in an equidistant manner, the monitoring host can use 20 seconds as a level interval; and when dividing the first state data in a non-equidistant manner, the monitoring host can mark 0.1 seconds to 1 second as the first level, and 40 seconds to 50 seconds as the fourth level. The present invention does not limit the way of dividing the first state data.

於步驟S107及步驟S109，監控主機係對第二資料集執行相關度分析以移除第二資料集的該些變量中相關度低於門檻值的一或多者，並以更新後的第二資料集作為該些第一資料的其中一者，其中相關度分析可以是互相關函數（cross-correlation function，CCF），例如自相關函數（autocorrelation function，ACF）或偏自相關函數（partial autocorrelation function，ACF）。In step S107 and step S109, the monitoring host performs correlation analysis on the second data set to remove one or more of the variables in the second data set whose correlation degree is lower than a threshold value, and uses the updated second data set as one of the first data, wherein the correlation analysis can be a cross-correlation function (CCF), such as an autocorrelation function (autocorrelation function, ACF) or a partial autocorrelation function (partial autocorrelation) relation function, ACF).

具體地，步驟S107是用於判斷第二資料集中各變量之間的相關度。舉例而言，監控主機可以是判斷遠端主機的處理器溫度與反應時間之間的相關度、遠端主機的風扇轉速與反應時間之間的相關度、遠端主機的記憶體的儲存狀態與反應時間之間的相關度等，以判斷哪些變量與反應時間有關。接著，監控主機可將與反應時間的相關度低於一門檻值的變量移除，以更新第二資料集，並以更新後的第二資料集作為前述步驟S1中的該些第一資料的其中一者。Specifically, step S107 is for judging the degree of correlation between variables in the second data set. For example, the monitoring host can determine the correlation between the processor temperature of the remote host and the response time, the correlation between the fan speed of the remote host and the response time, the storage status of the memory of the remote host and the response time, etc., so as to determine which variables are related to the response time. Next, the monitoring host can remove variables whose correlation with the response time is lower than a threshold to update the second data set, and use the updated second data set as one of the first data in the aforementioned step S1.

在原始的第二資料集中，可能不是所有的變量皆與第一事件有關，故透過執行步驟S107及S109，可過濾掉與第一事件相關度低的變量的資料，避免第一事件被其他相關度低的資料稀釋掉。據此，便可降低監控主機在產生第一學習模型的過程中的資料運算量。In the original second data set, not all variables may be related to the first event, so by performing steps S107 and S109, the data of variables with low correlation with the first event can be filtered out to prevent the first event from being diluted by other data with low correlation. According to this, the amount of data calculation performed by the monitoring host in the process of generating the first learning model can be reduced.

舉例而言，第一事件為資料庫故障，因資料庫故障時，中央處理器通常是處於低水位狀態，與資料庫故障的關聯度較低。因此，在這個例子中，可使用自相關函數（ACF）進行相關度分析。換言之，若欲針對不同的系統、設備、裝置等之間的相關度進行分析，可使用自相關函數（ACF）的分析方式。For example, the first event is a database failure, because when the database fails, the CPU is usually in a low water level state, and the correlation degree with the database failure is low. Therefore, in this example, the autocorrelation function (ACF) can be used for correlation analysis. In other words, if you want to analyze the correlation between different systems, equipment, devices, etc., you can use the analysis method of autocorrelation function (ACF).

於步驟S111及步驟S113，監控主機係利用第一資料基於回歸模型判斷觀測參數及訓練樣本是否具有趨勢樣態。所述回歸模型例如為自我迴歸整合移動平均（autoregressive integrated moving average，ARIMA）模型，而觀測參數可為觀測週期或觀測時間點，用於基於自我迴歸整合移動平均模型判斷訓練樣本是否在每個觀測週期或觀測時間點皆會出現特定趨勢，但本發明不對觀測參數的具體內容予以限制。In step S111 and step S113, the monitoring host uses the first data to determine whether the observed parameters and the training samples have a trend pattern based on the regression model. The regression model is, for example, an autoregressive integrated moving average (ARIMA) model, and the observation parameter can be an observation period or an observation time point, which is used to judge whether a training sample has a specific trend in each observation period or observation time point based on the autoregressive integrated moving average model, but the present invention does not limit the specific content of the observation parameters.

監控主機在取得訓練樣本後，可將觀測參數及訓練樣本輸入回歸模型，以判斷觀測參數及訓練樣本是否具有趨勢樣態。假設觀測參數的觀測週期為一小時，訓練樣本為遠端主機的處理器的使用率或遠端主機的記憶體的儲存量，則所述趨勢樣態可為遠端主機的處理器的使用率每小時逐漸上升，或遠端主機的記憶體的儲存量每小時會達臨界值。又或者，以第一資料為網路迴圈作為例子，在網路迴圈的狀況未發生時，網路核心交換器的中央處理器的負載通常是在低水位。當網路迴圈發生時，網路核心交換器的中央處理器的負載會持續升高直到滿載，進而造成網路的癱瘓。由於網路的負載規模是固定的，故監控主機可以自我迴歸整合移動平均（ARIMA）模型判斷出由網路迴圈造成的中央處理器的負載上升的相似樣態（pattern），而中央處理器的負載上升的相似樣態即為所述趨勢樣態。After obtaining the training samples, the monitoring host can input the observation parameters and training samples into the regression model to determine whether the observation parameters and training samples have a trend pattern. Assuming that the observation period of the observed parameters is one hour, and the training samples are the usage rate of the processor of the remote host or the storage capacity of the memory of the remote host, then the trend state may be that the usage rate of the processor of the remote host increases gradually every hour, or the storage capacity of the memory of the remote host reaches a critical value every hour. Or, taking the first data as an example of a network loop, when the network loop does not occur, the CPU load of the network core switch is usually at a low level. When a network loop occurs, the CPU load of the network core switch will continue to increase until it is fully loaded, thereby causing the network to be paralyzed. Since the load scale of the network is fixed, the monitoring host can determine the similar pattern of the CPU load increase caused by the network loop through the self-regressive integrated moving average (ARIMA) model, and the similar pattern of the CPU load increase is the trend pattern.

當監控主機判斷觀測參數及訓練樣本不具有趨勢樣態時，監控主機可執行步驟S114。調整觀測參數的方式可為監控主機將觀測週期拉長或縮短，對訓練樣本進行差分的方式可為監控主機將訓練樣本中的前後資料相減，以產生級數序列。監控主機在執行步驟S114後，即可再次執行步驟S113，以判斷調整後的該觀測參數或進行差分後的該訓練樣本是否具有趨勢樣態。When the monitoring host determines that the observed parameters and training samples do not have a trend pattern, the monitoring host can execute step S114. The way to adjust the observation parameters is to lengthen or shorten the observation cycle of the monitoring host, and the way to differentiate the training samples is to subtract the data before and after the training samples from the monitoring host to generate a series sequence. After the monitoring host executes step S114, step S113 can be executed again to determine whether the adjusted observation parameter or the differenced training sample has a trend pattern.

當監控主機在步驟S113判斷觀測參數及訓練樣本具有趨勢樣態時，監控主機執行步驟S115及步驟S117，利用該些第一資料的另一者作為檢測樣本以判斷觀測參數及回歸模型是否通過時間序列檢定，其中作為檢測樣本的第一資料異於作為訓練相本的第一資料。步驟S111與步驟S115的不同處在於，步驟S111所選出的訓練樣本是用於判斷回歸模型是否為適當的模型（即判斷回歸模型是否能判斷出所述趨勢樣態），而步驟S115所選出的檢測樣本是用於判斷回歸模型的精準度。When the monitoring host determines in step S113 that the observed parameters and the training samples have a trend pattern, the monitoring host executes steps S115 and S117, using the other of the first data as a test sample to judge whether the observed parameters and the regression model pass the time series test, wherein the first data as the test sample is different from the first data as the training sample. The difference between step S111 and step S115 is that the training samples selected in step S111 are used to judge whether the regression model is an appropriate model (that is, to judge whether the regression model can determine the trend pattern), while the detection samples selected in step S115 are used to judge the accuracy of the regression model.

在步驟S117，監控主機將觀測參數及檢測樣本輸入至回歸模型，以根據觀測參數及回歸模型對檢測樣本執行時間序列檢定，其中時間序列檢定可為Box-Pierce檢定。具體地，時間序列檢定是針對殘差序列或顯著性進行檢驗，以判斷回歸模型基於觀測參數及檢測樣本的輸出結果是否為穩定的（stationary），即當顯著性大於預設值（例如，p-value大於0.05）時，表示觀測參數及回歸模型通過時間序列檢定。In step S117, the monitoring host inputs the observation parameters and detection samples into the regression model, so as to perform time series verification on the detection samples according to the observation parameters and regression model, wherein the time series verification can be Box-Pierce verification. Specifically, the time series verification is to test the residual sequence or significance to judge whether the output result of the regression model based on the observed parameters and the test samples is stable (stationary), that is, when the significance is greater than the preset value (for example, p-value greater than 0.05), it means that the observed parameters and the regression model pass the time series verification.

當監控主機判斷觀測參數及回歸模型未通過時間序列檢定時，監控主機可再次執行步驟S114。當監控主機判斷觀測參數及回歸模型通過時間序列檢定時，監控主機可執行步驟S119，以回歸模型作為用於產生第二學習模型的第一學習模型。When the monitoring host determines that the observed parameters and the regression model do not pass the time series test, the monitoring host may execute step S114 again. When the monitoring host judges that the observed parameters and the regression model pass the time series test, the monitoring host may execute step S119 to use the regression model as the first learning model for generating the second learning model.

請接著參考圖3，圖3係繪示圖1的步驟S5的細部流程圖。在圖3中，圖1的步驟S5可以包含步驟S51、S53及S55，其中步驟S51、S53及S55係用於說明產生第二指令的方法。Please refer to FIG. 3 , which is a detailed flowchart of step S5 in FIG. 1 . In FIG. 3 , step S5 of FIG. 1 may include steps S51 , S53 and S55 , wherein steps S51 , S53 and S55 are used to illustrate the method for generating the second instruction.

在取得第二資料後，監控主機可以執行步驟S51：從遠端主機接收系統紀錄檔，其中系統紀錄檔包含多筆當前的狀態資料；步驟S53：以第二學習模型從該些狀態資料中選出屬於最高異常等級的至少一者；以及步驟S55：以第二學習模型產生屬於最高異常等級的至少一狀態資料的第二指令。After obtaining the second data, the monitoring host can perform step S51: receive the system record file from the remote host, wherein the system record file includes a plurality of current status data; step S53: use the second learning model to select at least one of the status data belonging to the highest abnormal level; and step S55: use the second learning model to generate a second instruction for at least one status data belonging to the highest abnormal level.

需先說明的是，在訓練第一學習模型以產生第二學習模型時，第一資料可能具有不同的嚴重程度，故監控主機可以將每筆第一資料分類至對應的嚴重程度。以遠端主機的中央處理器的使用率為例，若使用率落在85%到90%，則此第一資料可以被分類為中度異常等級；而使用率落在90%以上，則此第一狀態資料可以被分類為最高異常等級。因此，在訓練第一學習模型時，監控主機更可將異常等級及第一指令與對應的第一資料一併輸入至第一學習模型。舉例而言，對於被分類為中度異常等級的第一資料，其對應的第一指令可以包含對中央處理器的使用率再進行一分鐘的觀測；而對於被分類為最高異常等級的第一資料，其對應的第一指令可以包含直接修正中央處理器的運作。據此，經過訓練的機器學習模型（第二學習模型）即可根據遠端主機即時的運作狀態預測出可能發生的異常事件及其嚴重程度，並產生對應的第二指令。接著，在實際運作時，監控主機即可執行圖1的步驟S3以產生第二資料。It should be noted that when the first learning model is trained to generate the second learning model, the first information may have different severity levels, so the monitoring host can classify each piece of first information into a corresponding severity level. Taking the utilization rate of the central processing unit of the remote host as an example, if the utilization rate falls between 85% and 90%, the first data can be classified as a moderate abnormal level; and if the utilization rate falls above 90%, the first state data can be classified as the highest abnormal level. Therefore, when training the first learning model, the monitoring host can further input the abnormal level, the first command and the corresponding first data into the first learning model. For example, for the first data classified as a moderate abnormal level, the corresponding first instruction may include observing CPU utilization for another minute; and for the first data classified as the highest abnormal level, the corresponding first instruction may include directly modifying the operation of the CPU. Accordingly, the trained machine learning model (second learning model) can predict possible abnormal events and their severity according to the real-time operating status of the remote host, and generate corresponding second instructions. Then, during actual operation, the monitoring host can execute step S3 in FIG. 1 to generate the second data.

具體地，在監控主機收到系統紀錄中的當前的狀態資料後，經過異常等級訓練的第二學習模型即可判斷每筆狀態資料所屬的異常等級，並產生對應異常等級的第二指令。接著，監控主機即可執行步驟S7，將與異常等級相符的第二指令輸出至遠端主機。如上所述，該些異常等級包含最高異常等級，假設遠端主機的中央處理器的使用率達90%時會被判定為最高異常等級，則監控主機可利用第二學習模型從系統紀錄檔中選出遠端主機的中央處理器的使用率達90%的當前狀態資料，且所選的當前狀態資料屬於最高異常等級，則監控主機可輸出對應的第二指令至遠端主機，以調控遠端主機的中央處理器的使用狀態。Specifically, after the monitoring host receives the current state data in the system records, the second learning model trained on the abnormal level can determine the abnormal level to which each piece of state data belongs, and generate a second instruction corresponding to the abnormal level. Then, the monitoring host can execute step S7, outputting the second command matching the abnormal level to the remote host. As mentioned above, these abnormality levels include the highest abnormality level. Assuming that the CPU utilization rate of the remote host reaches 90%, it will be judged as the highest abnormality level. Then the monitoring host can use the second learning model to select the current status data of the remote host’s CPU utilization rate reaching 90% from the system record file. If the selected current status data belongs to the highest abnormal level, the monitoring host can output the corresponding second command to the remote host to regulate the usage status of the remote host’s CPU.

另外，除了上述的實施例，於步驟S51後，監控主機還可以第二學習模型根據第二資料，使用K-近鄰演算法輸出對應第二資料的趨勢樣態至遠端主機，以讓遠端主機的使用者可根據趨勢樣態判斷是否需調控遠端主機的使用狀態。簡言之，對應第二資料的趨勢樣態指示了第二資料的異常程度。In addition, in addition to the above-mentioned embodiment, after step S51, the monitoring host can also use the second learning model based on the second data, and use the K-nearest neighbor algorithm to output the trend state corresponding to the second data to the remote host, so that the user of the remote host can judge whether it is necessary to regulate the use status of the remote host according to the trend state. In short, the trend pattern corresponding to the second data indicates the degree of abnormality of the second data.

雖然本發明以前述之實施例揭露如上，然其並非用以限定本發明。在不脫離本發明之精神和範圍內，所為之更動與潤飾，均屬本發明之專利保護範圍。關於本發明所界定之保護範圍請參考所附之申請專利範圍。Although the present invention is disclosed by the aforementioned embodiments, they are not intended to limit the present invention. Without departing from the spirit and scope of the present invention, all changes and modifications are within the scope of patent protection of the present invention. For the scope of protection defined by the present invention, please refer to the appended scope of patent application.

S1,S3,S5,S7:步驟 S101,S103,S105,S107,S109,S111,S113,S114,S115,S117,S119:步驟 S51,S53,S55:步驟 S1, S3, S5, S7: steps S101, S103, S105, S107, S109, S111, S113, S114, S115, S117, S119: steps S51, S53, S55: steps

圖1係依據本發明一實施例所繪示的異常事件監控方法的流程圖。圖2係繪示圖1的步驟S1的細部流程圖。圖3係繪示圖1的步驟S5的細部流程圖。FIG. 1 is a flowchart of a method for monitoring abnormal events according to an embodiment of the present invention. FIG. 2 is a detailed flowchart of step S1 in FIG. 1 . FIG. 3 is a detailed flowchart of step S5 in FIG. 1 .

S1,S3,S5,S7:步驟 S1, S3, S5, S7: steps

Claims

A method for monitoring abnormal events, comprising: executing with a monitoring host: training a first learning model of the monitoring host with a plurality of first data and a plurality of first commands of a remote host to generate a second learning model, wherein the first data are data of abnormal events in the past operation of the remote host, and the first commands are instructions for solving or alleviating the first data; detecting the operation status of the remote host to generate a second data; using the second learning model to generate a corresponding second command according to the second data; and outputting the second command to the remote host, Wherein, training the first learning model of the monitoring host with the first data and the first instructions of the remote host includes obtaining the first data, and obtaining the first data includes: obtaining a first event of the remote host, wherein the first event occurs at a first time point; using a plurality of first state data of a time series including the first time point as a first data set; standardizing the first data set to generate a second data set, wherein the second data set includes a plurality of variables of the remote host in the time series; performing a correlation on the second data set analysis; and According to the result of the correlation analysis, one or more of the variables whose correlation is lower than a threshold value is removed to update the second data set, and the updated second data set is used as one of the first data.

The abnormal event monitoring method as described in claim 1, wherein training the first learning model with the first data and the first instructions further includes: using one of the first data as a training sample; judging whether an observation parameter and the training sample have a trend pattern based on a regression model; when judging that the training sample has the trend pattern, using the other of the first data as a detection sample; performing a time series test on the test sample according to the observation parameter and the regression model; The regression model is used as the first learning model.

The abnormal event monitoring method as described in claim 2, wherein when it is judged based on the regression model that the observed parameter and the training sample do not have the trend pattern, the method further includes executing with the monitoring host: adjusting the observed parameter, or performing a difference on the training sample; and using the adjusted observed parameter or the differentiated training sample to determine whether the observed parameter and the training sample have the trend pattern.

The abnormal event monitoring method as described in claim item 2, wherein when it is judged that the observed parameter and the regression model fail the time series test, the method further includes executing with the monitoring host: adjusting the observed parameter, or performing a difference on the training sample; and using the adjusted observed parameter or the differentiated training sample to determine whether the observed parameter and the training sample have the trend pattern.

The abnormal event monitoring method as described in claim 1, wherein using the first data and the first instructions of the remote host to train the first learning model of the monitoring host further includes: training the first learning model with a plurality of abnormal levels respectively corresponding to the first data and the first instructions, and using the second learning model to generate the corresponding second command according to the second data includes: using the second learning model to determine one of the abnormal levels to which the second data belongs; the second instruction.

The abnormal event monitoring method as described in claim 5, wherein the abnormal levels include a highest abnormal level, and using the second learning model to generate the second command corresponding to the one of the abnormal levels to which the second data belongs includes: receiving a system log file from the remote host, wherein the system log file includes a plurality of pieces of status data; using the second learning model to select at least one of the status data that belongs to the highest abnormal level; The second instruction of the at least one state data belonging to the highest abnormal level is generated by the second learning model.

The abnormal event monitoring method as described in claim 1, wherein using the second learning model to generate the corresponding second instruction according to the second data includes: using a K-nearest neighbor algorithm to output a trend pattern corresponding to the second data based on the second learning model.

The abnormal event monitoring method as claimed in claim 1, wherein after the second learning model is used to generate the corresponding second instruction according to the second data, the method further includes: outputting a warning notification to the remote host.