TWM622216U

TWM622216U - Apparatuses for service anomaly detection and alerting

Info

Publication number: TWM622216U
Application number: TW110210722U
Authority: TW
Inventors: 林柏州
Original assignee: 伊雲谷數位科技股份有限公司
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2022-01-11

Abstract

An apparatus for service anomaly detection and alerting considers business operating data and system data to perform service anomaly/risk detection and alarm, and based on the operating data (for example, business operations, public information such as community, public news, etc.) and system data (for example, system software and hardware information, log) categorizes a system to be monitored that provide services to a group to determine its service characteristics. Even, in order to further improve the accuracy of detection and alarm, when detecting an abnormal behavior exist in the service, it will further determine whether the detected abnormal behavior of the service occurs in the group of the same service characteristics, and whether it is an abnormal event. Therefore, compared with the prior art, the apparatus has the beneficial technical effects of improving the efficiency of system monitoring and reducing the risk of maintenance and operation.

Description

Device for service anomaly detection and alarm

本新型涉及一種用於分析受監控系統的服務是否有異常的偵測告警設備，且特別是一種智慧化的服務異常偵測告警設備。The new model relates to a detection and alarm device for analyzing whether the service of a monitored system is abnormal, and in particular to an intelligent service abnormality detection and alarm device.

在一般企業中，無論是對內或對外的線上服務，通常都有數個到幾十個系統與上百個軟硬體模組於伺服器中，其中每日(甚至每分或每秒)都可能記錄了非常多的系統監控數值，或是數量龐大的日誌文件 (Log)，當發生服務或系統異常、資安攻擊等事件時，系統管理者或維運人員便需要針對監控數值與日誌等做判讀(或偵測)，找出異常原因並加以排除。於傳統資訊技術(Information Technology，簡稱IT)的維運中，系統管理者會根據過去的經驗來定義異常事件的規則，但隨著IT基礎設施以及雲端服務的普及與擴充、系統架構與維運環境變得複雜，錯誤或複雜的規則常會觸發大量錯誤告警，讓系統管理者疲於奔命，更可能因疏失反而忽略嚴重的威脅。另外，傳統維運人員往往在異常事件或訊號發生後，才能被動地處理問題。In general enterprises, whether for internal or external online services, there are usually several to dozens of systems and hundreds of software and hardware modules in the server, of which daily (even every minute or second) A large number of system monitoring values or a large number of log files (Log) may be recorded. When a service or system abnormality, information security attack and other events occur, system administrators or maintenance personnel need to monitor the monitoring values and logs, etc. Do interpretation (or detection), find out the abnormal cause and eliminate it. In the maintenance and operation of traditional information technology (IT), system administrators will define the rules of abnormal events based on past experience, but with the popularization and expansion of IT infrastructure and cloud services, system architecture and maintenance As the environment becomes complex, mistakes or complex rules often trigger a large number of false alarms, leaving system administrators overwhelmed and more likely to ignore serious threats due to negligence. In addition, traditional maintenance personnel often deal with problems passively after abnormal events or signals occur.

近年來，由人為判斷的維運已不止進步到自動化監控，更有許多維運方法、企業與服務將人工智慧(Artificial Intelligence，簡稱AI)與機器學習引入IT基礎架構與維運管理之中。例如利用過去的歷史系統負載監控值(例如，但不限於中央處理單元(CPU)的使用率或記憶體的負載)，運用機器學習訓練出正常的系統負載曲線與容許值，未來當即時的監控值偏離容許值，即可觸發系統告警，加速維運與反應時間。In recent years, human-based maintenance has not only progressed to automated monitoring, but many maintenance methods, enterprises and services have introduced artificial intelligence (AI) and machine learning into IT infrastructure and maintenance management. For example, using past historical system load monitoring values (such as, but not limited to, central processing unit (CPU) usage or memory load), using machine learning to train normal system load curves and allowable values, and monitoring in real time in the future If the value deviates from the allowable value, a system alarm can be triggered to speed up maintenance and response time.

另亦有技術方法利用人工智慧進行異常事件的關聯分析，透過演算法與文字分析等技術來分析歷史日誌，將看似無關的事件分群，進而找出事件的隱性關聯。例如，發現服務網頁中斷(記錄超文本傳輸協定錯誤的回應碼)、CPU使用率過高以及網頁瀏覽量過低此類看似無關的情況常常同時發生，就能透過異常事件關聯演算法將這些事件進行根因分析(Root Cause Analysis)，未來即有可能做到提前預警，增加 IT維運效率。There are also technical methods that use artificial intelligence to analyze the correlation of abnormal events, analyze historical logs through technologies such as algorithms and text analysis, and group seemingly unrelated events to find out the implicit correlation of events. For example, it is found that seemingly unrelated conditions such as service page outages (recording response codes for hypertext transfer protocol errors), high CPU usage, and low page views often occur at the same time. Root Cause Analysis of incidents will enable early warning in the future to increase IT maintenance and operation efficiency.

現有技術的其中一種做法可完整採集單位內全域設備的設備健康狀態、單位使用網路的流量多寡與各種日誌(包含資安事件)三種異質資料，並作關聯分析，省去人工比對查找所耗的時間並利用人工智慧的趨勢演算法則。接著，據蒐集到的各種日誌與/或單位使用網路的流量多寡之歷史資料，自動學習建立動態基準，持續比對每一分鐘進來的各種日誌與/或單位使用網路的流量多寡之資料，以即時發覺事件次數(Hit Count)、流量封包數或是位元組(Byte)數異常突增的事件、來源網際網路協定(IP)位址(通常是攻擊端)以及目的IP位址(通常是被攻擊端)。此作法無需人工逐條設定閥值，故能讓維運以及資安防護工作變得更輕鬆容易。One of the methods in the prior art can completely collect three heterogeneous data such as the device health status of the global devices in the unit, the amount of network traffic used by the unit, and various logs (including information security events), and perform correlation analysis, eliminating the need for manual comparison and search. time-consuming and trending algorithms that leverage artificial intelligence. Then, according to the collected historical data of various logs and/or the amount of network traffic used by the unit, it automatically learns to establish a dynamic benchmark, and continuously compares the various logs that come in every minute and/or the amount of data about the amount of network traffic used by the unit. , to instantly detect the number of events (Hit Count), the number of traffic packets or the abnormal sudden increase of the number of bytes (Byte), the source Internet Protocol (IP) address (usually the attack side) and the destination IP address (usually the attacked end). This method does not need to manually set the thresholds one by one, so it makes maintenance, operation and information security protection easier.

現有技術的其中另一種做法則是比對各個異常事件後，利用機器學習演算法可以將類似行為表現的事件整理出來，自動偵測系統服務的延遲性是否驟升、系統錯誤率是否上升以及甚至公有雲廠商的網路是否出現異常。此作法讓使用者不需要設定警報觸發條件，系統就會自動監測平臺是否出現效能異常的事件。Another method in the prior art is to use machine learning algorithms to sort out events with similar behaviors after comparing each abnormal event, and automatically detect whether the latency of system services suddenly increases, whether the system error rate increases, and even Check whether the public cloud vendor's network is abnormal. This approach allows users to not need to set alarm trigger conditions, and the system will automatically monitor the platform for abnormal performance events.

上述的人工智慧偵測只收集了單一公司或服務的系統監測資料等，此種人工智慧維運的相關領域知識無法與現實商業營運面連結，且未考量不同的服務系統有相同或不同的特性，故監控的準確度仍有改善空間。The above artificial intelligence detection only collects system monitoring data of a single company or service, etc. The relevant domain knowledge of such artificial intelligence maintenance cannot be connected with the actual business operation, and it does not consider that different service systems have the same or different characteristics , so the monitoring accuracy still has room for improvement.

根據本新型之目的，本新型實施例提出一種用於服務異常偵測告警的設備，其包括多個電性連接的硬體電路，該等硬體電路組態成多個單元，且該等單元用於：接收對應一服務之多個受監控系統的其中一個受監控系統的營運資料與系統資料，並對該受監控系統的該營運資料與該系統資料進行資料前處理，以獲得該受監控系統的多個狀態參數；對該受監控系統的每一個該狀態參數進行分群，以獲得該受監控系統的每一個該狀態參數對應的一群集標籤；根據該受監控系統的該等狀態參數的該等群集標籤對該受監控系統分群，以獲得該受監控系統對應的一群組號碼；根據該受監控系統的至少一個該等狀態參數偵測該受監控系統是否有一異常行為；以及於偵測到該受監控系統有該異常行為時，判斷於該受監控系統的該群組號碼對應的一群組中是否有超出一特定數量的該等受監控系統也有異常行為，若該群組號碼對應的該群組未有超出該特定數量的該等受監控系統也有異常行為，則產生一告警。According to the purpose of the present invention, an embodiment of the present invention provides a device for serving abnormality detection and alarm, which includes a plurality of electrically connected hardware circuits, the hardware circuits are configured into a plurality of units, and the units Used to: receive the operation data and system data of one of the monitored systems corresponding to a service, and perform data preprocessing on the operation data and the system data of the monitored system to obtain the monitored system Multiple status parameters of the system; group each of the status parameters of the monitored system to obtain a cluster label corresponding to each of the status parameters of the monitored system; according to the status parameters of the monitored system The cluster tags group the monitored system to obtain a group number corresponding to the monitored system; detect whether the monitored system has an abnormal behavior according to at least one of the state parameters of the monitored system; and When detecting that the monitored system has the abnormal behavior, determine whether there are more than a certain number of the monitored systems in a group corresponding to the group number of the monitored system also have abnormal behavior, if the group number If the monitored systems corresponding to the group that do not exceed the specific number also behave abnormally, an alarm is generated.

本新型實施例還提供一種服務異常偵測告警方法，其與前述的服務異常偵測告警方法近似，但多個受監控系統是預先被分好群組，而不具有相關的分群步驟。The novel embodiment also provides a service anomaly detection and alarm method, which is similar to the aforementioned service anomaly detection and alarm method, but a plurality of monitored systems are grouped in advance without a related grouping step.

本新型實施例還提供一種偵測異常並發出告警之設備，其組態有多個單元，以執行上述服務異常/風險偵測告警方法，以及本新型實施例更提供一種儲存媒介，係用於儲存關聯於上述服務異常偵測告警方法的多個程式碼。The novel embodiment also provides a device for detecting anomalies and issuing an alarm, which is configured with a plurality of units to execute the above-mentioned service anomaly/risk detection and alarm method, and the novel embodiment further provides a storage medium, which is used for Stores a plurality of code codes associated with the above-mentioned service abnormality detection and alarm method.

本新型實施例還提供多個用於判定受監控系統發生異常事件並針對該異常事件產生異常告警之電腦軟體程式。The novel embodiment also provides a plurality of computer software programs for determining the occurrence of an abnormal event in the monitored system and generating an abnormal alarm for the abnormal event.

綜上所述，本新型實施例的服務異常/風險偵測告警方法、使用此方法的雲端設備與儲存此方法的儲存媒介可以精準地偵測出服務異常/風險。To sum up, the service abnormality/risk detection and alarm method, the cloud device using the method, and the storage medium storing the method can accurately detect the service abnormality/risk.

為了進一步理解本新型的技術、手段和效果，可以參考以下詳細描述和附圖，從而可以徹底和具體地理解本新型的目的、特徵和概念。然而，以下詳細描述和附圖僅用於參考和說明本新型的實現方式，其並非用於限制本新型。In order to further understand the technology, means and effects of the present invention, reference may be made to the following detailed description and accompanying drawings, so that the purpose, features and concepts of the present invention can be fully and specifically understood. However, the following detailed description and accompanying drawings are only used to refer to and illustrate the implementation of the present invention, and are not intended to limit the present invention.

現在將詳細參考本新型的示範實施例，其示範實施例會在附圖中被繪示出。在可能的情況下，在附圖和說明書中使用相同的元件符號來指代相同或相似的部件。另外，示範實施例的做法僅是本新型之設計概念的實現方式之一，下述的該等示範皆非用於限定本新型。Reference will now be made in detail to exemplary embodiments of the present invention, exemplary embodiments of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used in the drawings and the description to refer to the same or like parts. In addition, the practice of the exemplary embodiment is only one implementation of the design concept of the present invention, and the following examples are not intended to limit the present invention.

先前技術透過人工智慧偵測服務異常的作法大概有以下技術問題：(1) 無法與現實商業營運面連結，舉例如每日中午午休時間新聞網站流量常會暴增，但此種暴增為正常現象，若管理者與AI告警將此段期間的CPU的使用率與系統負載提升視為異常，即有可能發出誤報；(2) 一般IT監控值或歷史日誌大都為單一系統或單一企業內的基礎設施維運資料，不同的服務系統有不同的特性，即使相類似的系統，對不同產業的企業間而言也可能有不同的反應監控紀錄或日誌，若沒有針對不同的業務或服務特性作機器學習模型的建立，其預防性告警的正確性往往不足；以及(3)傳統的AI需要大量的資料訓練模型，但服務異常的狀況並非常發生，因此僅用基礎設施與環境資料會使機器學習無足夠的異常資料、標註去訓練模型，即便以訓練出正常服務狀態的曲線，仍要設定較為保守的告警觸發閥值，以因應意外風險。The prior art using artificial intelligence to detect service anomalies probably has the following technical problems: (1) It is impossible to connect with the real business operation. For example, the traffic of news websites often surges during the daily lunch break, but this surge is normal. , if managers and AI alarms regard the CPU usage and system load increase during this period as abnormal, they may issue false alarms; (2) Generally, IT monitoring values or historical logs are mostly based on a single system or a single enterprise. Facility maintenance data, different service systems have different characteristics, even similar systems may have different response monitoring records or logs for enterprises in different industries. If there is no machine for different business or service characteristics For the establishment of learning models, the correctness of preventive alarms is often insufficient; and (3) traditional AI requires a large amount of data to train models, but abnormal service conditions do not often occur, so only infrastructure and environmental data will enable machine learning If there is not enough abnormal data and labels to train the model, even if the curve of the normal service state is trained, a conservative alarm trigger threshold should be set to cope with unexpected risks.

為了解決上述技術問題，本新型實施例提供服務異常偵測告警方法與使用此方法的設備與系統考量了營運資料與系統資料來進行服務異常的偵測告警，以及根據營運資料(例如，商務運營面、公開資料如社群、公開新聞等資料)與系統資料(例如，系統軟硬體資訊、運行日誌)對提供服務的受監控系統進行分群，以判斷其服務特性，為了進一步提升偵測告警的精準度，在偵測到有異常時，會進一步地判斷偵測到的異常行為是否在同一服務特性的群組中也是經常出現，或其為尋常行為而並非真正的異常，也就是說，判斷這個偵測到的異常行為是否相同或類似於在同一服務特性的群組中發生頻率較高的尋常行為或頻率發生較低的異常行為。因此，相較於先前技術，本新型實施例提供的服務異常偵測告警方法與使用此方法的設備與系統具有系統監控效率提昇與維運風險降低的有益技術效果。In order to solve the above-mentioned technical problems, the novel embodiment provides a service abnormality detection and alarm method and a device and system using the method to perform service abnormality detection and alarm by considering operational data and system data, and based on operational data (for example, business operations face, public information (such as community, public news, etc.) and system data (such as system software and hardware information, operation logs) to group the monitored systems that provide services to determine their service characteristics, in order to further improve detection and alarming When an abnormality is detected, it will further determine whether the detected abnormal behavior also occurs frequently in the same service feature group, or it is a common behavior rather than a real abnormality, that is, Determine whether the detected abnormal behavior is the same or similar to the common behavior that occurs more frequently or the abnormal behavior that occurs less frequently in the same service characteristic group. Therefore, compared with the prior art, the service abnormality detection and alarm method provided by the novel embodiment and the device and system using the method have beneficial technical effects of improving system monitoring efficiency and reducing maintenance and operation risks.

在本新型數個實施例中，服務泛指藉由資訊設備軟硬體所構成的數位化的服務，其包括線上商務系統、企業營運系統等，或者，服務可以泛指數位化的資訊互動系統，例如線上的實體或虛擬商品購買交易、金融交易、訊息交換發佈、影音圖文上傳瀏覽下載等。服務可由實際的工作負載辨識與反映，且與服務相關的資料可以是結構化或非結構化的資料，其中這些與服務相關的資料可指與此服務有關之營運或系統方面的各種資料，例如經由服務或系統處理的輸入、輸出、運算數據，或是服務或系統本身之設定、運行、監測數據，或是前述數據之衍生數據等。於本新型數個實施例中，所蒐集之與服務相關的資料包括營運資料與系統資料，其中營運資料包括即時訂單數(Gross Merchandise Volume，簡稱為GMV)、營業額、上線人數、頁面瀏覽數(Page View，簡稱為PV)、回頭客數(Repeat Visitors，簡稱維RV)、造訪次數(Unique Visitor，簡稱為UV)、IP位址數、流量來源、地區、使用裝置、使用者瀏覽事件、服務網站或應用程式操作行為與客服通話記錄的至少其中一者，以及系統資料包括系統日誌文件(Log)、基礎設施運行資料與系統指標的至少其中一者。基礎設施可指IT設備、其軟硬體、或其軟硬體之環境或架構等。基礎設施運行資料可指提供這些服務時，系統所需要使用到的基礎設施在運行時的能耗資料、流量資料、所使用之基礎設施之其他資源的資料或所使用之基礎設施之效能的資料。系統指標可指提供這些服務的受監控系統所使用之資源的資料或效能的資料，且例如為CPU使用率、記憶體用量、I/O數(Read/Write PS)、網路流出/入量、封包流出/入量、彈性開啟的機器/叢集數量、交換(Swap)數、每秒查詢數(Queries Per Second，簡稱QPS)、每秒回覆數(Responses Per Second，簡稱RPS)、資料庫連線數與機器回應時間的至少其中一者，但本新型不以上述營運資料與系統資料的類型為限制。In several embodiments of the present invention, the service generally refers to a digitalized service composed of software and hardware of information equipment, which includes an online business system, an enterprise operation system, etc., or the service can generally refer to a digitalized information interaction system. , such as online purchase transactions of physical or virtual goods, financial transactions, information exchange and release, uploading, browsing and downloading of audio, video, graphics, and texts. The service can be identified and reflected by the actual workload, and the data related to the service can be structured or unstructured data, wherein the data related to the service can refer to various data related to the operation or system of the service, such as Input, output, and operation data processed by the service or system, or the setting, operation, and monitoring data of the service or system itself, or data derived from the aforementioned data, etc. In several embodiments of the present invention, the collected service-related data includes operation data and system data, wherein the operation data includes the number of real-time orders (Gross Merchandise Volume, GMV for short), turnover, number of online users, and number of page views. (Page View, referred to as PV), repeat visitors (referred to as RV), number of visits (Unique Visitor, referred to as UV), number of IP addresses, traffic sources, regions, devices used, user browsing events, services At least one of website or application operation behavior and customer service call records, and system data including at least one of system log files (Log), infrastructure operation data and system indicators. Infrastructure may refer to IT equipment, its software and hardware, or the environment or structure of its software and hardware. Infrastructure operation data may refer to the energy consumption data, flow data, data of other resources of the used infrastructure, or data of the performance of the used infrastructure that the system needs to use when providing these services. . System metrics may refer to the data or performance data of the resources used by the monitored system that provides these services, such as CPU usage, memory usage, I/O counts (Read/Write PS), network outflow/inflow , Packet outflow/incoming volume, Number of machines/clusters enabled elastically, Number of swaps (Swap), Number of queries per second (Queries Per Second, QPS), Responses Per Second (RPS), database connection At least one of the number of lines and the response time of the machine, but the new model is not limited by the types of the above-mentioned operation data and system data.

首先請參照圖1，圖1是本新型實施例之使用服務異常偵測告警方法的異常告警系統的方塊圖。本新型亦可應用於地端系統(On-Premises System)或地端與雲端之混合系統；例如圖1之雲端設備11可被替代為地端設備或混合雲設備等。本新型實施例的異常告警系統1包括雲端設備11，多個受監控系統121～12N與雲端設備11通訊連接，例如透過有線或無線直接或間接連接。受監控系統121～12N可以包含以下至少一者：系統提供之服務、系統承載之工作負載、提供服務的伺服器、網路、網路相關設備如交換器、閘道器等、防火牆或儲存設備等設備、此等設備之元件如CPU、記憶體、I/O埠等、此等設備上運行之虛擬機、容器、虛擬私有雲、資料庫、軟體程式等。受監控系統121～12N可以包含雲端設備、地端設備、終端設備等。受監控系統121～12N可以分屬於多個不同服務提供者，或者，也可以是屬於同一個服務者，又或者，受監控系統121～12N的一部分屬於其中一個服務提供者，受監控系統121～12N的另一部分屬於另一個服務提供者。First, please refer to FIG. 1 . FIG. 1 is a block diagram of an abnormality alarm system using a service abnormality detection and alarm method according to a new embodiment of the present invention. The present invention can also be applied to an on-premises system or a hybrid system of on-premises and cloud; for example, the cloud device 11 in FIG. 1 can be replaced by an on-premises system or a hybrid cloud device. The abnormality alarm system 1 of this novel embodiment includes a cloud device 11 , and a plurality of monitored systems 121 - 12N are connected to the cloud device 11 in communication, for example, directly or indirectly through wired or wireless connection. The monitored systems 121-12N may include at least one of the following: services provided by the system, workloads carried by the system, servers providing services, networks, network-related devices such as switches, gateways, etc., firewalls or storage devices Such equipment, components of such equipment such as CPU, memory, I/O ports, etc., virtual machines, containers, virtual private clouds, databases, software programs, etc. running on such equipment. The monitored systems 121-12N may include cloud equipment, ground equipment, terminal equipment, and the like. The monitored systems 121-12N may belong to a plurality of different service providers, or may belong to the same service provider, or a part of the monitored systems 121-12N may belong to one of the service providers, and the monitored systems 121-12N may belong to one of the service providers. Another part of 12N belongs to another service provider.

受監控系統121～12N用以提供上述各種服務，其中多個服務的服務特性彼此可能不全部相同，服務的服務特性是可以事先透過人工方式貼標(在預先知道服務類型時，即可以分群，例如涉及線上購物服務的受監控系統貼標為同一群組，而涉及線上諮詢服務的受監控系統貼標為另一同一群組)，或透過其他監督式機器學習分群方法判讀。服務的服務特性也可以不用事先知道，而是在蒐集資料後，透過蒐集的資料找出狀態參數配合非監督式機器學習分群方法判讀，其中多個狀態參數包括CPU使用率、上線人數、營業額、頁面瀏覽數、記憶體使用量與輸入/輸出的數量等營運資料及系統資料。進行非監督式機器學習分群後，各個群集可進行貼標，各標籤可表示不同群集之服務特性。標籤可以次序變數、類型或類別變數、索引值、獨特值、號碼等方式自動標示，例如標籤可藉由分群模型之程式化方式而自動取得。標籤亦可附加標籤描述。例如，群集可依服務之商業性質標記或描述為「直播」、「入口網站」、「電商平台」、「論壇」等，或依負載或流量趨勢標記或描述為「晚間密集」、「週末活躍」、「冬季期間」、「上班族模式」等，或依技術型態或監視參數如I/O趨勢、連線頻率、等性質標記或描述，或依據該群集出現過之異常問題的特徵如過載、欠載、超頻、超時、尖峰等標記或描述。事先透過人工方式貼標而標註相同服務特性的多個服務，在根據資料分群後也可能屬於不同服務特性的群組，透過人工方式貼標而標註不同服務特性的多個服務，在根據資料分群後也可能屬於同一服務特性的群組。雲端設備11用於接受受監控系統121～12N提供的營運資料與系統資料，並藉此監控受監控系統121～12N是否有服務異常，以達到服務異常偵測告警的目的，其中營運資料與系統資料可以如上所述，故不再贅述。The monitored systems 121-12N are used to provide the above-mentioned various services, wherein the service characteristics of multiple services may not all be the same as each other. For example, the monitored systems involved in online shopping services are labeled as the same group, and the monitored systems involved in online consulting services are labeled as another same group), or interpreted through other supervised machine learning grouping methods. The service characteristics of the service can also not be known in advance, but after the data is collected, the state parameters are found through the collected data and the unsupervised machine learning grouping method is used to interpret, among which a number of state parameters include CPU usage, number of online users, and turnover. , operational data such as page views, memory usage and the number of input/output and system data. After unsupervised machine learning clustering, each cluster can be labeled, and each label can represent the service characteristics of different clusters. Labels can be automatically marked by order variables, type or category variables, index values, unique values, numbers, etc. For example, labels can be automatically obtained by stylized methods of clustering models. Labels can also have label descriptions attached. For example, clusters can be tagged or described as "live broadcast", "portal", "e-commerce platform", "forum", etc. depending on the commercial nature of the service, or as "heavy evenings", "weekends" depending on load or traffic trends Active", "Winter Period", "Office Worker Mode", etc., or flagged or described by technology type or monitoring parameters such as I/O trends, connection frequency, etc., or by characteristics of abnormal problems that have occurred in the cluster Flags or descriptions such as overload, underload, overclock, timeout, spike, etc. Multiple services marked with the same service characteristics through manual labeling in advance may also belong to groups with different service characteristics after being grouped according to data. may also belong to the same service feature group. The cloud device 11 is used to accept the operation data and system data provided by the monitored systems 121-12N, and thereby monitor whether the monitored systems 121-12N have service abnormalities, so as to achieve the purpose of service abnormality detection and alarm, wherein the operation data and the system The information can be described above, so it is not repeated here.

請接著參照圖2，圖2是本新型實施例之使用服務異常偵測告警方法的雲端設備的方塊圖。如圖2所示，雲端設備11包括資料前處理單元111、狀態參數分群單元112、個體分群單元113、同群比對單元114與偵測單元115，其中可以透過硬體電路與軟體程式的執行來實現上述多個功能單元111～115，但本新型不以功能單元111～115的實現方式為限制。資料前處理單元111信號連接狀態參數分群單元112與偵測單元115，狀態參數分群單元112信號連接個體分群單元113，以及個體分群單元113信號連接同群比對單元114。雲端設備11尚可包含與同群比對單元114信號連接之告警單元116。Please refer to FIG. 2 , which is a block diagram of a cloud device using a service abnormality detection and alarm method according to a new embodiment of the present invention. As shown in FIG. 2 , the cloud device 11 includes a data preprocessing unit 111 , a state parameter grouping unit 112 , an individual grouping unit 113 , a group comparison unit 114 and a detection unit 115 , which can be executed through hardware circuits and software programs. to realize the above-mentioned multiple functional units 111-115, but the present invention is not limited by the implementation manner of the functional units 111-115. The data preprocessing unit 111 is connected to the state parameter grouping unit 112 and the detection unit 115 by the signal, the state parameter grouping unit 112 is connected to the individual grouping unit 113 by the signal, and the individual grouping unit 113 is connected to the group comparison unit 114 by the signal. The cloud device 11 may further include an alarm unit 116 signally connected to the peer comparison unit 114 .

資料前處理單元111接收前述各受監控系統121～12N的營運資料與系統資料，並對營運資料與系統資料進行資料前處理。舉例來說，前處理為對系統日誌文件進行文字分析，建立在不同時間點的關鍵字詞頻，例如各時間段網頁受監控系統的接取日誌文件(Access.log)或錯誤日誌文件(Error.log)中會有不同關鍵字如「emerg」、「alert」、「err」、「warning」(關鍵字可能會因不同受監控系統不同而不同)，進行TF-IDF(Term Frequency–Inverse Document Frequency)的詞頻分析以獲取詞頻，所獲取的詞頻則之後將作為不同的狀態參數，例如透過解析錯誤日誌文件知悉記憶體溢出頻率或數量。前處理還可以是記錄基礎設施運行指標，例如CPU使用率、記憶體用量與I/O數等，或者是將客戶服務輸出的業務營運資料進行匿名化處理。前處理的目的是為了從營運資料與系統資料獲取各受監控系統121～12N的多個狀態參數(狀態參數可用來表示受監控系統之資源所被使用或其效能的量化指標)，以在後面進行服務特性的分群。在某些實施例中，要觀察1000個狀態參數，並以一週七天做為一個觀察週期，且每一個狀態參數以一分鐘為單位蒐集其參數值，則每一個狀態參數會有7*24*60=10080個資料點(各資料點有一參數值)。The data preprocessing unit 111 receives the aforementioned operational data and system data of the monitored systems 121 to 12N, and performs data preprocessing on the operational data and the system data. For example, the pre-processing is to perform text analysis on the system log file, and establish the keyword frequency at different time points, such as the access log file (Access.log) or the error log file (Error. log), there will be different keywords such as "emerg", "alert", "err", "warning" (the keywords may be different for different monitored systems), and TF-IDF (Term Frequency–Inverse Document Frequency ) to obtain the word frequency, and the obtained word frequency will be used as different status parameters later, for example, by parsing the error log file to know the memory overflow frequency or quantity. Pre-processing can also be recording infrastructure operation indicators, such as CPU usage, memory usage, and I/O numbers, or anonymizing business operation data output by customer service. The purpose of pre-processing is to obtain a plurality of state parameters of each monitored system 121-12N from operational data and system data (state parameters can be used to represent quantitative indicators of the resource usage or performance of the monitored system), so as to be used later. Perform grouping of service characteristics. In some embodiments, 1000 state parameters are to be observed, and seven days a week is used as an observation period, and each state parameter collects its parameter value in units of one minute, then each state parameter will have 7*24* 60=10080 data points (each data point has a parameter value).

狀態參數分群單元112則是接收各受監控系統121～12N的多個狀態參數，並對每一個狀態參數(參數值)基於分群模型來進行分群，並可給予相關的分群標籤。例如，受監控系統121的CPU使用率基於分群模型進行分群，並可標記各類對應的群集標籤，自分群結果分析，所對應的群集標籤可能代表晚上較忙碌、早上較忙碌或中午較忙碌之不同CPU使用模式的服務特性。在本新型數個實施例中，分群模型可以定期地根據即時資料來更新與訓練，或者可以預先訓練後才拿來使用，且訓練或更新分群模型可以以非監督式機器學習的方式(不用預先知悉類型與貼標)來實現。例如，定期獲取各受監控系統121～12N的CPU使用率，並根據K均值(K-means)、層次凝聚聚類演算法(HAC)或基於密度聚類演算法(DBSCAN)來建立分群模型，將CPU使用率分群，並可依據分群結果給予群集標籤。訓練或更新分群模型也可以使用監督式學習的分群演算法(對已知的類別進行貼標)來實現，例如支持向量機(SVM)、K-近鄰演算法(KNN)、決策樹(Decision Tree)或隨機森林(Random Forests)。The state parameter grouping unit 112 receives a plurality of state parameters of each monitored system 121-12N, and groups each state parameter (parameter value) based on the grouping model, and can assign relevant grouping labels. For example, the CPU usage of the monitored system 121 can be grouped based on the clustering model, and can be marked with corresponding cluster labels of various types. From the analysis of the clustering results, the corresponding cluster labels may represent those that are busy at night, busy in the morning, or busy at noon. Service characteristics for different CPU usage patterns. In several embodiments of the present invention, the clustering model can be periodically updated and trained according to real-time data, or it can be used after pre-training, and the training or updating of the clustering model can be performed by means of unsupervised machine learning (without pre-training). Know the type and labeling) to achieve. For example, periodically obtain the CPU usage of each monitored system 121-12N, and establish a clustering model according to K-means (K-means), Hierarchical Agglomerative Clustering (HAC) or Density-Based Clustering (DBSCAN), The CPU usage is grouped into clusters, and cluster labels can be given according to the clustering results. Training or updating a clustering model can also be implemented using supervised learning clustering algorithms (labeling known classes) such as Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Decision Trees ) or Random Forests.

個體分群單元113對各受監控系統121～12N進行分群，以給予受監控系統121～12N對應的群組號碼(或標籤、次序變數、類型變數、索引值、獨特值等)，其中對各受監控系統121～12N進行分群的方式可基於分群模型根據各受監控系統121～12N的多個狀態參數的多個群集標籤進行分群。某些實施例中，可將受監控系統視為其所提供之服務或所承載之工作負載，則個體分群單元113可謂對此等服務或工作負載進行分群；亦可將受監控系統視為所有人、客戶、用戶、企業、使用者、服務對象等，則個體分群單元113可謂對此等所有人、客戶、用戶、企業、使用者、或服務對象進行分群。例如，單一受監控系統代表單一用戶時，受監控系統之分群可視為對用戶之分群；單一受監控系統代表單一服務時，受監控系統之分群可視為對服務之分群；單一受監控系統代表單一設備或軟體程式時，受監控系統之分群可視為對設備或軟體程式之分群。例如，基於分群模型根據受監控系統121～12N的CPU使用率的群集標籤、上線人數的群集標籤、營業額的群集標籤與網路流出/入量的群集標籤等來決定受監控系統121～12N提供的服務屬於哪些群組，各群組可代表分群至各群組內之服務彼此間具有某種相同或相似之服務特性，並可給予相關的群組號碼、變數、或標籤。同樣地，用於分群的模型也同樣地可以以非監督式機器學習或監督式機器學習來進行訓練與更新。可以不特別指定服務特性，將多個狀態參數的群集標籤進行非監督式學習之模型訓練(例如K-means、HAC或DBSCAN)，或者，事先針對某些已知特性的服務貼標，例如電商服務、企業資源規劃系統(ERP)系統、交易系統、直播系統等，將貼標完的服務的多個狀態參數的群集標籤進行監督式學習之模型訓練(例如SVM、KNN、決策樹或隨機森林)，之後有未知的服務即可用此模型進行分群，歸納服務的服務特性。The individual grouping unit 113 groups the monitored systems 121 to 12N to give the monitored systems 121 to 12N corresponding group numbers (or labels, order variables, type variables, index values, unique values, etc.) The manner of grouping the monitoring systems 121-12N may be based on a grouping model and performing grouping according to a plurality of cluster labels of a plurality of state parameters of each of the monitored systems 121-12N. In some embodiments, the monitored system can be regarded as the services it provides or the workloads it carries, and the individual grouping unit 113 can be said to group these services or workloads; the monitored system can also be regarded as all Persons, customers, users, enterprises, users, service objects, etc., the individual grouping unit 113 can be said to group such owners, customers, users, enterprises, users, or service objects. For example, when a single monitored system represents a single user, the grouping of monitored systems can be regarded as a grouping of users; when a single monitored system represents a single service, the grouping of monitored systems can be regarded as a grouping of services; a single monitored system represents a single In the case of equipment or software programs, the grouping of monitored systems can be regarded as the grouping of equipment or software programs. For example, based on the clustering model, the monitored systems 121 to 12N are determined based on the cluster labels of the CPU usage, the cluster labels of the number of people online, the cluster labels of the turnover, and the cluster labels of the network outflow/inflow of the monitored systems 121 to 12N. Which groups the provided services belong to, each group can represent that the services grouped into each group have certain same or similar service characteristics, and can be given related group numbers, variables, or labels. Likewise, the models used for clustering can likewise be trained and updated with unsupervised machine learning or supervised machine learning. Without specifying service characteristics, the cluster labels of multiple state parameters can be used for unsupervised learning model training (such as K-means, HAC or DBSCAN), or, service labels for some known characteristics in advance, such as electricity. For commercial services, enterprise resource planning systems (ERP) systems, transaction systems, live broadcast systems, etc., the cluster labels of multiple state parameters of the labelled services are subjected to supervised learning model training (such as SVM, KNN, decision tree or randomization). Forest), after which there are unknown services, which can be used to group groups and summarize the service characteristics of services.

偵測單元115基於模型根據各受監控系統121～12N的多個狀態參數進行建模，以偵測各受監控系統121～12N是否有異常(例如，預測出未來或現在的狀態參數超出門限值，或者，未來可能有特定異常事件的發生)，其中此處的模型可以是時間序列模型、歷史資料模型與/或風險預測模型。各受監控系統121～12N的歷史資料模型可以使用機器學習演算法等套用到過去蒐集的多個狀態參數來建立其正常狀況下的時間序列模型，使用演算法可以是差分整合移動平均自迴歸演算法(ARIMA)、長短期記憶演算法(LSTM)或隨機切割森林(Random Cut Forests，簡稱RCF)。舉例來說，可以建立受監控系統121之CPU使用率的時間序列模型，透過時間序列模型的預測可以知悉CPU使用率是否目前有異常或未來可能會異常。某些實施例中，若已知某些已發生的系統問題、資安事件、遭受攻擊等異常狀況，則可將受監控系統多個狀態參數的參數值聯集進行風險值標註，以訓練出各受監控系統121～12N於單一時間點的風險預測模型，其中訓練風險預測模型的演算法可以是隨機森林或極限梯度提升演算法(XGBoost)。舉例來說，受監控系統121之風險預測模型中，可根據實際發生異常之情況標註單筆資料之風險值(該筆資料之參數觀察值可能為CPU使用率=0.5、記憶體使用量=15GB、輸入/輸出的數量=2500等)，如此藉由對歷史資料集中每筆資料進行風險標註，對該資料集進行擬合以訓練出該風險預測模型，則可對新資料預測此受監控系統121是否遭受異常。對此等風險值可形成一風險值序列，並對此風險值序列訓練出一時間序列模型，藉此可預測如下一時間點之風險值。如下開將說明，藉由利用風險預測模型及時間序列模型所預測之風險值進行比對，並藉由同群比對單元114之判斷，可判斷是否遭受異常。The detection unit 115 performs modeling according to a plurality of state parameters of the monitored systems 121-12N based on the model, so as to detect whether the monitored systems 121-12N are abnormal (for example, predicting that the future or current state parameters exceed the thresholds) value, or, there may be specific abnormal events in the future), where the model here can be a time series model, a historical data model and/or a risk prediction model. The historical data model of each monitored system 121-12N can be applied to multiple state parameters collected in the past using machine learning algorithms to establish a time series model under normal conditions. The algorithm used can be differential integration moving average autoregressive calculus ARIMA, Long Short Term Memory (LSTM) or Random Cut Forests (RCF). For example, a time series model of the CPU usage rate of the monitored system 121 can be established, and through the prediction of the time series model, it can be known whether the CPU usage rate is abnormal at present or may be abnormal in the future. In some embodiments, if it is known that some abnormal conditions such as system problems, information security events, and attacks have occurred that have occurred, the parameter values of multiple state parameters of the monitored system can be combined to carry out risk value annotation, so as to train the The risk prediction model of each monitored system 121-12N at a single time point, wherein the algorithm for training the risk prediction model may be random forest or extreme gradient boosting algorithm (XGBoost). For example, in the risk prediction model of the monitored system 121, the risk value of a single data can be marked according to the actual abnormal situation (the parameter observation value of this data may be CPU usage = 0.5, memory usage = 15GB) , the number of input/output = 2500, etc.), so by labeling the risk of each data in the historical data set and fitting the data set to train the risk prediction model, the monitored system can be predicted for the new data Whether 121 suffers from anomalies. A risk value sequence can be formed for these risk values, and a time series model can be trained for the risk value sequence, whereby the risk value at the next time point can be predicted. As will be explained below, by comparing the risk values predicted by the risk prediction model and the time series model, and by the determination of the cohort comparison unit 114, it can be determined whether an abnormality is encountered.

某些實施例中，針對受監控系統所提供之服務，對於某一種服務特性的服務，偵測單元115所偵測的異常行為未必真的是異常，可能在同一個服務特性的群組中，此偵測到的異常行為實際上在群組內會被判斷為尋常行為。因此，同群比對單元114判斷此偵測到的異常行為在其同一服務特性(如同一群組號碼)的群組之受監控系統中是否為異常行為或尋常行為。舉例來說，受監控系統121、122、129與12N都是提供線上購物服務，且在母親節檔期，受監控系統121之上線人數被偵測為暴增為因而先被判斷為異常行為，且透過同群比對單元114發現受監控系統122、129與12N亦偵測到上線人數為暴增，因此，同群比對單元114判斷偵測到之受監控系統121之上線人數暴增的異常行為屬於群組內之尋常行為，因此判斷受監控系統121未發生異常。又例如，受監控系統123、126與129都是財會系統，且在報稅期間，受監控系統123的流量來源飛快增加，而透過風險預測模型被偵測成有被攻擊的風險，但同群比對單元114偵測到受監控系統126與129的流量來源也飛快增加，故同群比對單元114不會將偵測到受監控系統121之流量來源飛快增加的風險當作異常事件，即不會認為受監控系統121有被攻擊的風險。雲端設備11可信號連接一告警單元116，告警單元116可根據前述判斷出之異常事件產生一異常告警，並可將該異常告警發送至與雲端設備11信號連接之一終端裝置(未繪示)，使該終端裝置顯示該異常告警。如此，雲端設備11可以避免錯誤地向系統管理者告警，讓服務提供者在運營上更有效率。In some embodiments, for the service provided by the monitored system, for a service of a certain service characteristic, the abnormal behavior detected by the detection unit 115 may not be really abnormal, and may be in the same service characteristic group, This detected abnormal behavior will actually be judged as normal behavior within the group. Therefore, the group comparison unit 114 determines whether the detected abnormal behavior is abnormal behavior or normal behavior in the monitored system of the group with the same service characteristic (eg, the same group number). For example, the monitored systems 121, 122, 129 and 12N all provide online shopping services, and during the Mother's Day period, the number of online users of the monitored system 121 was detected as a sudden increase, so it was first judged as abnormal behavior, and Through the peer comparison unit 114, it is found that the monitored systems 122, 129 and 12N also detected a sudden increase in the number of online users. Therefore, the peer comparison unit 114 determines that the detected abnormal increase in the number of online users of the monitored system 121 is abnormal. The behavior belongs to the normal behavior in the group, so it is determined that the monitored system 121 is not abnormal. For another example, the monitored systems 123, 126 and 129 are all accounting systems, and during the tax filing period, the traffic sources of the monitored system 123 increase rapidly, and the risk prediction model is used to detect that there is a risk of being attacked, but the same group is more likely to be at risk of being attacked. The pairing unit 114 detects that the traffic sources of the monitored systems 126 and 129 also increase rapidly, so the cohort comparison unit 114 does not regard the risk of detecting a rapid increase in the traffic sources of the monitored system 121 as an abnormal event, that is, it does not. The monitored system 121 would be considered at risk of being attacked. The cloud device 11 can be signally connected to an alarm unit 116, and the alarm unit 116 can generate an abnormal alarm according to the abnormal event determined above, and can send the abnormal alarm to a terminal device (not shown) that is signally connected to the cloud device 11. , so that the terminal device displays the abnormal alarm. In this way, the cloud device 11 can avoid erroneously alerting the system administrator, making the service provider more efficient in operation.

請參照圖3，圖3是本新型實施例之服務異常偵測告警方法操作於判讀模式的流程圖。服務異常偵測告警方法可以被上述雲端設備11所執行。在訓練或更新各模型後，雲端設備11可以操作於判讀模式，並於判讀模式執行下述步驟。首先，在步驟S31中，接收各受監控系統對應的營運資料與系統資料，並進行資料前處理，以產生各受監控系統的多個狀態參數。然後，在步驟S32中，基於用於分群每一個狀態參數之類別的模型，對各受監控系統的每一個狀態參數進行分群，以給予各受監控系統的每一個狀態參數一個群集標籤。然後，在步驟S33中，基於用於分群每一個服務之服務特性的模型，依據各受監控系統的多個狀態參數對應的多個群集標籤對各受監控系統分群，以給予各受監控系統一個群組號碼(或稱之為另一組群集標籤)。Please refer to FIG. 3 . FIG. 3 is a flowchart of the service abnormality detection and alarm method according to the new embodiment of the present invention operating in the interpretation mode. The service abnormality detection and alarm method can be executed by the above-mentioned cloud device 11 . After training or updating each model, the cloud device 11 can operate in the interpretation mode, and perform the following steps in the interpretation mode. First, in step S31, the operation data and system data corresponding to each monitored system are received, and data preprocessing is performed to generate a plurality of state parameters of each monitored system. Then, in step S32, each state parameter of each monitored system is grouped based on the model used to group the category of each state parameter to give each state parameter of each monitored system a cluster label. Then, in step S33, based on the model for grouping the service characteristics of each service, each monitored system is grouped according to a plurality of cluster labels corresponding to a plurality of state parameters of each monitored system to give each monitored system a Group number (or another set of cluster labels).

在步驟S34中，基於各受監控系統之多個歷史資料模型與/或多個風險預測模型，根據各受監控系統的多個狀態參數偵測各受監控系統是否有異常行為。若無偵測到異常行為，則無需告警，若有偵測到異常行為，則進行步驟S35。在步驟S35中，判斷偵測到之各受監控系統之異常行為在其群組號碼的群組中是否為異常行為或尋常行為，亦即判斷是否為異常，判斷的方式可以是，同一群組號碼的多個受監控系統中有至少一特定數量者也被偵測到有此異常行為，其中此特定數量可以是同一群組號碼的多個受監控系統的一半、全部、四分之一或其他數值，例如「1」(群組內一個受監控系統)、「2」(群組內二個受監控系統)等。簡單地說，只要同一群組號碼的群組有超過特定數量的受監控系統都有被偵測到相同或相似的異常行為，則此被偵測到的異常行為並非群組內真的異常行為而應該是尋常行為，故不用進行告警。若同一群組號碼的群組沒有超過特定數量的受監控系統都有被偵測到相同的異常行為，則進行步驟S36。在步驟S36中，向與雲端設備11電性連接之一終端裝置發送一異常告警，並使該終端裝置顯示該異常告警，以向對應的系統管理者或維運人員進行告警。In step S34 , based on a plurality of historical data models and/or a plurality of risk prediction models of each monitored system, whether each monitored system has abnormal behavior is detected according to a plurality of state parameters of each monitored system. If no abnormal behavior is detected, no alarm is required, and if abnormal behavior is detected, step S35 is performed. In step S35, it is determined whether the detected abnormal behavior of each monitored system is abnormal behavior or normal behavior in the group of its group number, that is, whether it is abnormal. At least a specific number of the multiple monitored systems of the number are also detected to have this abnormal behavior, where the specific number can be half, all, one quarter or half of the multiple monitored systems of the same group number. Other values, such as "1" (one monitored system in the group), "2" (two monitored systems in the group), etc. Simply put, as long as more than a certain number of monitored systems in the same group number group have the same or similar abnormal behavior detected, the detected abnormal behavior is not the real abnormal behavior in the group. It should be a normal behavior, so there is no need to alert. If the group of the same group number does not exceed the specified number of monitored systems all have the same abnormal behavior detected, then go to step S36. In step S36, an abnormal alarm is sent to a terminal device electrically connected to the cloud device 11, and the terminal device is caused to display the abnormal alarm, so as to alert the corresponding system administrator or maintenance personnel.

接著，請參照圖4，圖4是本新型實施例之服務異常偵測告警方法操作於建模模式的流程圖。判讀模式下的模型是雲端設備11在建模模式所建立，且建模模式的步驟如下。在步驟S41，接收多個受監控系統對應的營運資料與系統資料，並進行資料前處理，以產生多個受監控系統之每一者的多個狀態參數。接著，在步驟S42中，針對每一種狀態參數，依據多個受監控系統的多個同一種狀態參數，建立用於分群狀態參數的模型。之後，在步驟S43中，針對每一個受監控系統，對受監控系統的每一個狀態參數進行分群，以給予每一個狀態參數一個群集標籤。然後，在步驟S44中，根據多個受監控系統的多個狀態參數的群集標籤建立用於針對受監控系統之服務特性的分群模型。接著，在步驟S45中，針對每一個受監控系統的每一個狀態參數，依據每一個狀態參數之一段時間的數值序列建立出用於偵測異常的模型。Next, please refer to FIG. 4 . FIG. 4 is a flowchart of the service abnormality detection and alarm method according to the novel embodiment operating in the modeling mode. The model in the interpretation mode is created by the cloud device 11 in the modeling mode, and the steps in the modeling mode are as follows. In step S41, operation data and system data corresponding to a plurality of monitored systems are received, and data preprocessing is performed to generate a plurality of state parameters of each of the plurality of monitored systems. Next, in step S42, for each state parameter, a model for grouping state parameters is established according to a plurality of the same state parameters of a plurality of monitored systems. Afterwards, in step S43, for each monitored system, each state parameter of the monitored system is grouped to give each state parameter a cluster label. Then, in step S44, a clustering model for service characteristics of the monitored systems is established according to the cluster labels of the plurality of state parameters of the plurality of monitored systems. Next, in step S45, for each state parameter of each monitored system, a model for detecting anomalies is established according to the numerical sequence of each state parameter over a period of time.

本新型實施例還提供一種儲存媒介，此儲存媒介為非揮發性的儲存媒介，例如快閃記憶體、光碟或硬碟等，其儲存有多個程式碼，且此等程式碼可以被計算機裝置所讀取，以藉此讓讀取此等程式碼的計算機裝置進行如圖3與圖4之服務異常偵測告警方法的步驟。The novel embodiment also provides a storage medium, which is a non-volatile storage medium, such as a flash memory, an optical disk, or a hard disk, which stores a plurality of code codes, and the code codes can be used by a computer device read, so as to allow the computer device that reads these codes to perform the steps of the service abnormality detection and alarm method as shown in FIG. 3 and FIG. 4 .

基於上述內容，以下使用一個實際例子來說明。在某些實施例中，受監控系統的CPU使用率大於80%持續超過5分鐘即告警，或某外部同一個IP連線在1分鐘內超過50次即告警，但門限值往往需要跟系統實際運行的業務、服務有關係，而隨著業務運行，告警的門限值也需要根據不同時間週期作調整，而不一定是固定值。加上時間一久、規則持續新增，系統管理者很難釐清不同規則間的關係或邏輯，不同服務之系統間的系統差異也容易造成雜訊或噪音的誤報。Based on the above, a practical example is used below to illustrate. In some embodiments, if the CPU usage rate of the monitored system is greater than 80% for more than 5 minutes, the alarm will be raised, or if the same external IP connection exceeds 50 times in 1 minute, the alarm will be raised, but the threshold value often needs to be consistent with the actual system. The running business and service are related, and as the business runs, the alarm threshold value also needs to be adjusted according to different time periods, not necessarily a fixed value. In addition, as time passes and rules continue to be added, it is difficult for system administrators to clarify the relationship or logic between different rules, and system differences between systems of different services are likely to cause noise or false positives.

因此，本新型實施例之系統或設備使用的服務異常偵測告警方法則可以解決上述技術問題。首先，以狀態參數「上線人數」為例，經營某一類型之電子商務的系統的線上人數通常在20至24點人數較多，而某一類型之新聞類網站系統，其線上人數的高峰值通常在於7至9點、12至13點與18點21點。於對狀態參數分群階段，可將每種狀態參數利用無監督式學習的分群方法，例如使用 K-means歸納出受監控系統之上線人數模式的10種態樣，以及整理出各受監控系統的線上人數模式各屬於何種態樣，並給予相應的群集標籤(如採1、2、3…的次序變數標籤)。類似地，各受監控系統的CPU使用率各屬於何種態樣也可以利用上述分群方法獲得。在獲得各受監控系統的各狀態參數的各群集標籤後，即可以對各受監控系統進行分群。於下面表一至表三，明白舉出狀態參數分群與受監控系統分群的例子。Therefore, the service abnormality detection and alarm method used by the system or device of the present novel embodiment can solve the above-mentioned technical problems. First of all, taking the status parameter "number of people online" as an example, the number of people online in a system that operates a certain type of e-commerce is usually large between 20 and 24 o'clock, while a certain type of news website system has a peak online number of people. Usually from 7 to 9, 12 to 13 and 18 to 21. In the stage of grouping state parameters, the grouping method of unsupervised learning can be used for each state parameter. What type of online population patterns belong to, and give corresponding cluster labels (such as order variable labels of 1, 2, 3...). Similarly, what kind of CPU usage rate of each monitored system belongs to can also be obtained by using the above-mentioned grouping method. After each cluster label of each state parameter of each monitored system is obtained, each monitored system can be grouped. In the following Tables 1 to 3, examples of the status parameter grouping and the monitored system grouping are clearly given.

表一：線上人數分群後之群集標籤之舉例時間點1 時間點2 時間點3 時間點4 時間點5 時間點6 … 時間點N 上線人數的群集標籤受監控系統#1 100人 2人 4人 5人 111人 121人 … 4人 1 受監控系統#2 101人 6人 3人 5人 100人 151人 … 4人 1 受監控系統#3 0人 55人 55人 55人 0人 0人 … 7人 6 受監控系統#4 21人 22人 19人 18人 20人 23人 … 21人 7 … … … … … … … … … … 受監控系統#N 80人 77人 75人 4人 3人 0人 … 0人 3 Table 1: Examples of cluster labels after online population grouping time point 1 time point 2 time point 3 time point 4 time point 5 time point 6 … time point N Cluster labels for people online Monitored System #1 100 people 2 people 4 people 5-people 111 people 121 people … 4 people 1 Monitored System #2 101 people 6 people 3 people 5-people 100 people 151 people … 4 people 1 Monitored System #3 0 people 55 people 55 people 55 people 0 people 0 people … 7 people 6 Monitored System #4 21 people 22 people 19 people 18 people 20 people 23 people … 21 people 7 … … … … … … … … … … Monitored System#N 80 people 77 people 75 people 4 people 3 people 0 people … 0 people 3

表二： CPU使用率分群後的群集標籤之舉例時間點1 時間點2 時間點3 時間點4 時間點5 時間點6 … 時間點N CPU使用率的群集標籤受監控系統#1 55% 3% 6% 50% 9% 54% … 51% 9 受監控系統#2 50% 3% 4% 57% 9% 51% … 45% 9 受監控系統#3 3% 3% 3% 20% 40% 80% … 88% 2 受監控系統#4 20% 22% 21% 22% 21% 21% … 22% 1 … … … … … … … … … … 受監控系統#N 50% 49% 49% 3% 1% 1% … 1% 3 Table 2: Example of cluster labels after CPU usage clustering time point 1 time point 2 time point 3 time point 4 time point 5 time point 6 … time point N Cluster label for CPU usage Monitored System #1 55% 3% 6% 50% 9% 54% … 51% 9 Monitored System #2 50% 3% 4% 57% 9% 51% … 45% 9 Monitored System #3 3% 3% 3% 20% 40% 80% … 88% 2 Monitored System #4 20% twenty two% twenty one% twenty two% twenty one% twenty one% … twenty two% 1 … … … … … … … … … … Monitored System#N 50% 49% 49% 3% 1% 1% … 1% 3

表三：各受監控系統分群之舉例上線人數的群集標籤 CPU使用率的群集標籤 … 頁面瀏覽數的群集標籤營業額的群集標籤群組號碼受監控系統#1 1 9 … 4 5 1 受監控系統#2 1 9 … 4 6 1 受監控系統#3 6 2 … 1 1 4 受監控系統#4 7 1 … 2 2 5 … … … … … … 受監控系統#N 3 3 … 2 2 3 Table 3: Examples of each monitored system grouping Cluster labels for people online Cluster label for CPU usage … Cluster labels for page views Cluster labels for turnover group number Monitored System #1 1 9 … 4 5 1 Monitored System #2 1 9 … 4 6 1 Monitored System #3 6 2 … 1 1 4 Monitored System #4 7 1 … 2 2 5 … … … … … … Monitored System#N 3 3 … 2 2 3

如上表的舉例，受監控系統#1與受監控系統#2的各狀態參數之群集標籤近似，所以被給予同一個群組號碼，以表示被歸類到服務特性相同的群組。接著，受監控系統#1的上線人數與CPU使用率於未來的觀察值可以與其上線人數與CPU使用率的歷史資料模型與/或風險預測模型的預測值進行比較，以偵測上線人數與CPU使用率是否有異常行為。若任一狀態參數有異常行為，則比較群組號碼為1之群組的各受監控系統(即群組內其他受監控系統)的上線人數或CPU使用率多數是否也有被偵測到異常行為，若無，則表示偵測到之受監控系統#1之上線人數或CPU使用率的異常行為在群組內為異常事件，並且需要向受監控系統#1的系統管理者或維運人員告警。再者，當在偵測到之受監控系統#1之上線人數或CPU使用率的異常行為被認為是異常事件時，該異常的資料或異常發生前的歷史資料可以拿來訓練另一個歷史資料模型或風險預測模型，以讓同一群組的受監控系統可以使用此歷史資料模型或風險預測模型來偵測異常。As an example in the above table, the cluster labels of the state parameters of the monitored system #1 and the monitored system #2 are similar, so they are given the same group number to indicate that they are classified into groups with the same service characteristics. Next, the observed values of the online population and CPU usage of the monitored system #1 in the future can be compared with the predicted values of the historical data model and/or the risk prediction model of the online population and CPU usage to detect the online population and CPU usage. Whether the usage is behaving abnormally. If any state parameter has abnormal behavior, compare the number of online users or CPU usage of each monitored system in the group with group number 1 (ie, other monitored systems in the group) to see if abnormal behavior is also detected. , if not, it means that the detected abnormal behavior of the number of people online or the CPU usage of the monitored system #1 is an abnormal event in the group, and the system administrator or maintenance personnel of the monitored system #1 needs to be alerted . Furthermore, when the detected abnormal behavior of the number of people online or the CPU usage rate of the monitored system #1 is regarded as an abnormal event, the abnormal data or the historical data before the abnormal occurrence can be used to train another historical data. A model or risk prediction model to allow the same group of monitored systems to use this historical data model or risk prediction model to detect anomalies.

基於上述內容，某些實施例中，多個受監控系統對應的營運資料與系統資料仍被接收與進行資料前處理，以產生多個受監控系統之每一者的多個狀態參數。然而，多個受監控系統是預先透過人工方式分好群組，或者透過非監督式學習分群方法分群。針對每一個受監控系統，根據過去實際異常行為發生的與否(例如，駭客入侵、服務中斷與系統當機等)對受監控系統的多個狀態參數的參數值聯集標註風險值。在某一個實施例中，已知某個受監控系統被駭客入侵時的CPU使用率為0.6，其記憶體使用量為20GB，且其輸入/輸出的數量為3000，則可以將該筆資料標註風險值為1。其他無異常發生的資料則將風險值標註為0。接著，利用機器學習演算法(例如XGBoost)，根據標記有風險值的資料建立(訓練)受監控系統的風險預測模型，此風險預測模型可以預測未來時間點的風險值。附帶一提的是，受監控系統的風險預測模型可能會有週期性，或經過一段較長時間後改變，或經過特定事件(如系統軟硬體版本更新、新模組加入系統、商業模式改變、市場需求變化等)後會改變。例如，本新型可藉如進行季節性、週期性等時間序列分析，由歷史資料所學習的時間序列圖形判斷風險趨勢，以找出風險值的變化趨勢，藉此重新訓練模型或調整模型超參數等，以得到適用新週期之模型。在建立完對應受監控系統的風險預測模型後，在獲得受監控系統的新的多個狀態參數(參數值)，則可以根據新的多個狀態參數(參數值)基於風險預測模型計算預測風險值，可將值的範圍界定在0至1之間。若預測風險值大於特定值，例如0.5，則等同於偵測到異常行為，此時再根據同一群組是否也有超過特定數量的其他受監控系統也有預測風險值大於特定值的情況，如果沒有，則將此偵測到的異常行為當作異常(異常事件)，反之，則視為非異常(非異常事件)。在此實際例子中，預測風險值的結果可能來自於許多參數的影響，而參數間關係太複雜無法以人工規則去理解，但此時隱性的參數間關係與時間序列關係則可藉由模型的訓練來獲取。Based on the above, in some embodiments, the operational data and system data corresponding to the plurality of monitored systems are still received and subjected to data preprocessing to generate a plurality of state parameters for each of the plurality of monitored systems. However, multiple monitored systems are pre-grouped manually, or grouped by unsupervised learning grouping methods. For each monitored system, a risk value is marked on the parameter value combination of multiple state parameters of the monitored system according to whether or not actual abnormal behaviors occurred in the past (eg, hacking, service interruption, system downtime, etc.). In one embodiment, it is known that the CPU usage of a monitored system is 0.6 when hacked, its memory usage is 20 GB, and its input/output quantity is 3000, then the data can be Label the risk value as 1. For other data without abnormal occurrence, the risk value is marked as 0. Next, a machine learning algorithm (eg XGBoost) is used to establish (train) a risk prediction model of the monitored system according to the data marked with the risk value, and the risk prediction model can predict the risk value at a future time point. Incidentally, the risk prediction model of the monitored system may be cyclical, or change after a long period of time, or after specific events (such as system software and hardware version updates, new modules added to the system, business model changes). , changes in market demand, etc.) will change. For example, the new model can use the time series analysis such as seasonality and periodicity to determine the risk trend from the time series graph learned from historical data, so as to find the change trend of the risk value, thereby retraining the model or adjusting the model hyperparameters etc. to get a model for the new cycle. After the risk prediction model corresponding to the monitored system is established, after obtaining new multiple state parameters (parameter values) of the monitored system, the predicted risk can be calculated based on the risk prediction model according to the new multiple state parameters (parameter values). value, which can range from 0 to 1. If the predicted risk value is greater than a certain value, such as 0.5, it is equivalent to detecting abnormal behavior. At this time, according to whether there are other monitored systems in the same group that exceed a certain number, the predicted risk value is greater than a certain value. If not, The detected abnormal behavior is regarded as abnormal (abnormal event), otherwise, it is regarded as non-abnormal (non-abnormal event). In this practical example, the result of predicting the risk value may come from the influence of many parameters, and the relationship between the parameters is too complicated to be understood by manual rules, but the implicit relationship between the parameters and the time series relationship can be determined by the model training to obtain.

附帶一提的是，若受監控系統#1之上線人數或CPU使用率等狀態參數有被偵測到異常行為的原因在於推出特定活動(例如，受監控系統#1提供線上購物服務舉辦如週年慶之特定線上活動)或發生特定事件(如調整CPU使用率政策)等，則群組號碼為1的其他受監控系統的上線人數與CPU使用率並沒有被偵測到有異常行為，故雖然本新型實施例提供的服務異常偵測告警方法與使用此方法的設備與系統會將偵測到的受監控系統#1之上線人數或CPU使用率的異常行為認定為異常，但在使用非監督式的即時學習方式時，在受監控系統1的系統管理者或維運人員確認為此偵測到的異常實際原因可以確定後，接下來的偵測到之受監控系統#1之上線人數與CPU使用率的異常便可與該特定活動綜合判斷，此外歷史資料模型或風險預測模型也可藉由調整風險值等進行修正、更新或重新訓練，以使該特定活動不會成被預測為異常行為。Incidentally, if the monitored system #1 has an abnormal behavior detected in status parameters such as the number of online users or CPU usage, the reason is the launch of a specific event (for example, the monitored system #1 provides an online shopping service such as an anniversary event). Celebrating a specific online activity) or a specific event (such as adjusting the CPU usage policy), etc., the number of people online and the CPU usage of other monitored systems with group number 1 have not been detected abnormal behavior, so although this The service abnormality detection and alarm method provided by the new embodiment and the device and system using the method will recognize the detected abnormal behavior of the number of people online or the CPU usage rate of the monitored system #1 as abnormal. In the real-time learning method, after the system administrator or maintenance personnel of the monitored system 1 confirms that the actual cause of the detected abnormality can be determined, the number of online users and the CPU number of the next detected monitored system #1 can be determined. Abnormal usage rate can be comprehensively judged with the specific activity. In addition, the historical data model or risk prediction model can also be revised, updated or retrained by adjusting the risk value, so that the specific activity will not be predicted as abnormal behavior. .

某些實施例中，本新型可藉由一種用於判定受監控系統發生異常事件並針對該異常事件產生異常告警之電腦軟體程式來實施，該程式可載入如伺服器之電腦設備。該電腦軟體程式可於載入電腦設備後，執行以下之步驟：接收複數個第一監控資料集，該等第一監控資料集每一者包含複數個監控資料，該等監控資料包含第一監控參數及第二監控參數，該第一監控參數及該第二監控參數分別包含複數個監控資料點，該等資料點每一者包含一監控參數值，該等第一監控資料集係分別包含不同受監控系統之監控資料，該等第一監控資料集每一者之複數個監控資料包含營運資料及系統資料；利用第一分群模型對該等第一監控資料集之第一監控參數的監控參數值進行分群並產生複數個第一群集標籤，使該等第一監控資料集每一者對應該等第一群集標籤其中一者，且利用該第一分群模型對該等第一監控資料集之第二監控參數的監控參數值進行分群並產生複數個第二群集標籤，使該等第一監控資料集每一者對應該等第二群集標籤其中一者；利用第二分群模型對該等第一監控資料集之第一群集標籤及第二群集標籤進行分群並產生複數個第三群集標籤，使該等第一監控資料集每一者對應該等第三群集標籤其中一者，該等第一監控資料集其中至少複數者係對應該等第三群集標籤中某一者而形成第一監控群組，該第一監控群組依其所對應的複數個第一監控資料集包含相對應的複數個受監控系統，該第一監控群組中的複數個受監控系統包含第一受監控系統；自該第一監控群組接收複數個第二監控資料集，該等第二監控資料集係分別接收自該第一監控群組中的複數個受監控系統，該等第二監控資料集每一者包含該第一監控參數及該第二監控參數；基於該等第二監控資料集，利用一時間序列演算法預測分別對應該第一監控群組之該第一監控參數及該第二監控參數的複數個第一監控參數預測值及複數個第二監控參數預測值，使該第一監控群組中的複數個受監控系統每一者對應其第一監控參數預測值及第二監控參數預測值；自該第一監控群組中的複數個受監控系統接收針對該第一監控參數之複數個第一監控參數觀測值，該等第一監控參數觀測值係分別對應該等第一監控參數預測值；對該第一監控群組中的複數個受監控系統每一者所對應的第一監控參數觀測值及第一監控參數預測值進行比對，判斷對應該第一受監控系統的第一監控參數觀測值不符合其對應的第一監控參數預測值，並判斷該第一監控群組中除了該第一受監控系統之外的其他受監控系統中所對應之第一監控參數觀測值不符合相對應第一監控參數預測值的受監控系統的數量低於一受監控系統數量閾值，藉此判斷該第一受監控系統發生一異常事件，其中該受監控系統數量閾值係小於該第一監控群組的受監控系統數量；以及根據該異常事件之判斷產生一異常告警。In some embodiments, the present invention can be implemented by a computer software program for determining the occurrence of an abnormal event in a monitored system and generating an abnormal alarm for the abnormal event, which program can be loaded into a computer device such as a server. The computer software program can perform the following steps after being loaded into the computer device: receiving a plurality of first monitoring data sets, each of the first monitoring data sets including a plurality of monitoring data, and the monitoring data including the first monitoring data parameters and second monitoring parameters, the first monitoring parameters and the second monitoring parameters respectively include a plurality of monitoring data points, each of the data points includes a monitoring parameter value, and the first monitoring data sets respectively include different monitoring data points The monitoring data of the monitored system, the plurality of monitoring data of each of the first monitoring data sets include operation data and system data; the monitoring parameters of the first monitoring parameters of the first monitoring data sets using the first clustering model The values are grouped and a plurality of first cluster labels are generated, so that each of the first monitoring data sets corresponds to one of the corresponding first cluster labels, and the first clustering model is used to classify the first monitoring data sets. The monitoring parameter values of the second monitoring parameter are grouped and a plurality of second cluster labels are generated, so that each of the first monitoring data sets corresponds to one of the second cluster labels; The first cluster label and the second cluster label of the first monitoring data set are grouped to generate a plurality of third cluster labels, so that each of the first monitoring data sets corresponds to one of the third cluster labels, and the At least a plurality of the first monitoring data sets are corresponding to one of the third cluster labels to form a first monitoring group, and the first monitoring group includes a plurality of corresponding first monitoring data sets corresponding to the first monitoring data set. a plurality of monitored systems, the plurality of monitored systems in the first monitoring group include a first monitored system; a plurality of second monitoring data sets are received from the first monitoring group, the second monitoring data sets are respectively received from a plurality of monitored systems in the first monitoring group, each of the second monitoring data sets includes the first monitoring parameter and the second monitoring parameter; based on the second monitoring data sets, Using a time series algorithm to predict a plurality of first monitoring parameter predicted values and a plurality of second monitoring parameter predicted values respectively corresponding to the first monitoring parameter and the second monitoring parameter of the first monitoring group, so that the first monitoring parameter Each of the plurality of monitored systems in the monitoring group corresponds to its first monitoring parameter predicted value and the second monitoring parameter predicted value; the first monitoring parameter is received from the plurality of monitored systems in the first monitoring group The plurality of first monitoring parameter observed values, the first monitoring parameter observed values are respectively corresponding to the first monitoring parameter predicted value; The observed value of the first monitoring parameter and the predicted value of the first monitoring parameter are compared, and it is determined that the observed value of the first monitoring parameter corresponding to the first monitored system does not conform to its corresponding predicted value of the first monitoring parameter, and it is determined that the first monitoring parameter The number of monitored systems whose observed values of the first monitoring parameter corresponding to the other monitored systems other than the first monitored system in the group do not meet the predicted value of the corresponding first monitored parameter is lower than a number of monitored systems Threshold value, thereby judging that an abnormal event occurs in the first monitored system, wherein the number of the monitored system The threshold is less than the number of monitored systems in the first monitoring group; and an abnormal alarm is generated according to the judgment of the abnormal event.

某些實施例中，對於包含前述電腦軟體程式所進行之步驟，可包含以下之實施方式。一監控資料集可自一受監控系統直接或間接接收；監控資料集可為一歷史資料集，如受監控系統已發生或已藉由監控所蒐集之資料集。監控資料集可於不同時段或時間點接收，例如第一監控資料集可於第一時間點接收，第二監控資料集可於第二時間點接收。第一監控資料集及第二監控資料集可包含相同特徵維度之監控資料，例如第一監控資料集及第二監控資料集可包含相同之監控參數，如當第一監控資料集包含線上人數、CPU使用率等二十五個監控參數時，第二監控資料集亦包含相同的二十五個監控參數。第一監控參數及第二監控參數可為受監控系統不同之監控參數，例如第一監控參數可為線上人數，第二監控參數可為CPU使用率。監控參數可以是受監控系統之營運相關之參數，亦可為受監控系統之系統相關之參數。可對監控資料集進行資料前處理，以形成複數個監控參數及/或監控資料點。監控資料可指受監控系統受監控而產生之資料，而該等資料可包含本揭露所例示者。監控資料所包含之監控參數可包含營運資料參數及/或系統資料參數，且監控資料亦可包含表示該等參數之狀態的狀態值。一個監控資料可指自一受監控系統接收之一筆包含監控參數之資料。若以表格方式處理資料，監控資料可以「列」(row)之方式為其資料形態。監控資料點可指不同之時間點，複數個監控資料點可包含一時間序列形態。監控資料點可為表一中之時間點，且監控參數值可為受監控系統於各時間點之對應的參數值。第一監控資料集與第二監控資料集可包含不同時段或時間點之監控資料，例如每個監控資料包含有一時間值，該時間值可作為所述時間點或是所關注時段內之一時間點。第一監控資料集可包含第一時段之監控資料，第二監控資料集可包含第二時段之監控資料。第一分群模型及第二分群模型可採用相同之分群演算法。第一分群模型及第二分群模型可包含不同之超參數，如可包含不同之群集數。第一分群模型及第二分群模型之區分可僅為依所進行之分群階段不同所做之區分。監控資料集與群集標籤之關係可為多對一關係，即一或多個監控資料集可對應至一個群集標籤。當監控資料集與受監控系統具有一對一的關係時，受監控系統與群集標籤因此具有多對一關係，即一個群集標籤可指派或對應一或多個受監控系統。藉由本揭露之分群方式，可產生複數個監控群組，各監控群組對應或指派有一群集標籤，如第三群集標籤中其中一者，各第三群集標籤可以是獨特之標籤，使各監控群組得以彼此間區分，亦即相似之受監控系統可分群至同一群組，使各監控群組可包含一或多個受監控系統。時間序列演算法可採ARIMA、LSTM、RCF等可適用於時間序列型態資料的演算法。第二監控資料集可輸入由時間序列演算法所建立之模型，藉以產出基於第二監控資料集所預測之預測值。例如，第二監控資料集可以是欲預測之資料點的前一週的資料點，若每分鐘為一資料點，對單一監控參數而言，第二監控資料集可包含10080個資料點。時間序列模型可先經歷史資料訓練而建模，例如利用各受監控系統之各監控參數的歷史資料來建立各受監控系統的時間序列模型，因此產出各受監控系統所對應的時間序列模型。第二監控資料集包含如線上人數、CPU使用率等二十五個監控參數時，歷史資料可包含與第二監控資料集相同的二十五個監控參數。可設定一信賴區間，使監控參數預測值包含一範圍內之值，例如信賴區間可設為95%。對於不同之監控參數，可設定不同數值之信賴區間。信賴區間可藉由採用一時間序列模組或涵式庫來自動計算、設定。監控參數觀測值可於同一時間點上對應監控參數預測值，例如，對於某一監控參數而言，預測一受監控系統的下一個資料點的監控參數預測值，並可取得該受監控系統於同一個資料點的監控參數觀測值。進行監控參數觀測值與監控參數預測值的比對時，可將監控參數觀測值比對監控參數預測值的信賴區間，若監控參數觀測值不在信賴區間內，可判斷為監控參數觀測值不符合監控參數預測值。監控參數觀測值不在信賴區間可指該觀測值超出該信賴區間，例如該觀測值大於或小於該信賴區間的上限值或下限值。監控參數觀測值與監控參數預測值之比對，可包含計算或接收多個時間點的多個值後進行整體比對，例如可於相同時段內建立一監控參數預測值時間序列及其所對應之一監控參數觀測值時間序列，各時間序列包含該時段內之多個時間點對應的監控參數預測值或監控參數觀測值，進行該二時間序列之比對，若該二時間序列之間判斷有一組觀測值與預測值不符合、有數組觀測值與預測值不符合、有數組觀測值與預測值連續不符合、有數組觀測值與預測值連續不符合後又有一組觀測值與預測值不符合、或有數組觀測值與預測值連續不符合後又緊接一組觀測值與預測值不符合等，則可判斷為監控參數觀測值不符合監控參數預測值。或者，對於發生一組觀測值與預測值後連續數組觀測值與預測值皆無不符合，則可判斷未發生監控參數觀測值不符合監控參數預測值。受監控系統數量閾值可以根據監控群組的受監控系統數量來設定。受監控系統數量閾值可設定為「2」，使群組內除了所關注之受監控系統(如第一受監控系統)判斷有監控參數觀測值不符合監控參數預測值之情形之外，僅有另一個受監控系統亦判斷有監控參數觀測值不符合監控參數預測值之情形或沒有另外的受監控系統判斷有監控參數觀測值不符合監控參數預測值之情形時，便可判斷發生異常事件；若欲將判斷異常事件的標準設為較寬鬆，換言之欲使異常偵測敏感度降低，則該閾值可設為較高之值，如「3」等；若欲採較嚴格標準，則該閾值可設為「1」，使當群組內只有所關注之受監控系統判斷有監控參數觀測值不符合監控參數預測值之情形時，便可判斷發生異常事件。受監控系統數量閾值亦可設為一比例，例如10%。此外，對於群組內除了所關注之受監控系統之外的其他受監控系統中判斷有觀察值不符合預測值者，可將之判斷為發生異常事件，亦即判斷該其他受監控系統中判斷有觀察值不符合預測值者每一者發生一異常事件，藉此對各異常事件發出告警；在此情況下，可基於群組中判斷為發生異常事件的受監控系統的數量少於同群組中未判斷為發生異常事件的受監控系統的數量，而此數量比例可藉由類似本揭露設定閾值方式設定。依本揭露之群組內判斷異常事件的原理，本新型亦包含依相同原理採相反之判斷方式，例如可判斷第一監控群組中除了第一受監控系統之外的其他受監控系統中所對應之第一監控參數觀測值符合相對應第一監控參數預測值的受監控系統的數量高於一受監控系統數量閾值，藉此判斷第一受監控系統發生異常事件，將該閾值設定為較高數值則反映出較嚴格標準等。異常告警可包含訊息、通知、旗標、標籤、音訊等。異常告警可包含與所欲關注之受監控系統(如該第一受監控系統)的相關資訊，例如指示發生異常者為所欲關注之受監控系統，以使異常告警及/或受監控系統可進行如記錄、統計、反映等後續處理。In some embodiments, the steps performed by the aforementioned computer software program may include the following implementations. A monitoring data set can be received directly or indirectly from a monitored system; the monitoring data set can be a historical data set, such as the data set that has occurred in the monitored system or has been collected by monitoring. The monitoring data set can be received at different time periods or time points, for example, the first monitoring data set can be received at the first time point, and the second monitoring data set can be received at the second time point. The first monitoring data set and the second monitoring data set may contain monitoring data of the same feature dimension. For example, the first monitoring data set and the second monitoring data set may contain the same monitoring parameters. When there are twenty-five monitoring parameters such as CPU usage, the second monitoring data set also includes the same twenty-five monitoring parameters. The first monitoring parameter and the second monitoring parameter may be different monitoring parameters of the monitored system, for example, the first monitoring parameter may be the number of people online, and the second monitoring parameter may be CPU usage. The monitoring parameters may be parameters related to the operation of the monitored system or system-related parameters of the monitored system. Data preprocessing may be performed on the monitoring data set to form a plurality of monitoring parameters and/or monitoring data points. Monitoring data may refer to data generated by the monitoring of a monitored system, and such data may include those exemplified in this disclosure. The monitoring parameters included in the monitoring data may include operational data parameters and/or system data parameters, and the monitoring data may also include status values indicating the status of these parameters. A monitoring data may refer to a data containing monitoring parameters received from a monitored system. If the data is processed in tabular form, the monitoring data can be in the form of "row" as its data form. Monitoring data points may refer to different time points, and a plurality of monitoring data points may include a time series pattern. The monitoring data points can be the time points in Table 1, and the monitoring parameter values can be the corresponding parameter values of the monitored system at each time point. The first monitoring data set and the second monitoring data set may contain monitoring data of different time periods or time points, for example, each monitoring data set contains a time value, and the time value can be used as the time point or a time in the concerned time period point. The first monitoring data set may include monitoring data of a first period, and the second monitoring data set may include monitoring data of a second period. The first clustering model and the second clustering model can use the same clustering algorithm. The first clustering model and the second clustering model may include different hyperparameters, eg, may include different numbers of clusters. The distinction between the first clustering model and the second clustering model may only be made according to the different stages of clustering performed. The relationship between monitoring data sets and cluster labels may be many-to-one, that is, one or more monitoring data sets may correspond to one cluster label. When the monitoring data set has a one-to-one relationship with the monitored system, the monitored system and the cluster label therefore have a many-to-one relationship, that is, a cluster label can be assigned or correspond to one or more monitored systems. Through the grouping method of the present disclosure, a plurality of monitoring groups can be generated, and each monitoring group corresponds to or is assigned a cluster label, such as one of the third cluster labels. Groups can be distinguished from each other, that is, similar monitored systems can be grouped into the same group, so that each monitored group can contain one or more monitored systems. The time series algorithm can adopt ARIMA, LSTM, RCF and other algorithms that can be applied to time series type data. The second monitoring data set can be input into a model established by a time series algorithm, thereby producing a forecast value predicted based on the second monitoring data set. For example, the second monitoring data set may be the data points of the previous week of the data points to be predicted. If every minute is one data point, for a single monitoring parameter, the second monitoring data set may include 10080 data points. The time series model can be modeled through historical data training. For example, the historical data of each monitoring parameter of each monitored system is used to establish the time series model of each monitored system, so the time series model corresponding to each monitored system is generated. . When the second monitoring data set includes twenty-five monitoring parameters such as the number of people online, CPU usage, etc., the historical data may include the same twenty-five monitoring parameters as the second monitoring data set. A confidence interval can be set so that the predicted value of the monitoring parameter includes a value within a range, for example, the confidence interval can be set to 95%. For different monitoring parameters, confidence intervals of different values can be set. Confidence intervals can be automatically calculated and set by using a time series module or library. The observed value of the monitoring parameter can correspond to the predicted value of the monitoring parameter at the same time point. For example, for a certain monitoring parameter, the predicted value of the monitoring parameter of the next data point of a monitored system can be predicted, and the monitored value of the monitored system can be obtained. Monitoring parameter observations for the same data point. When comparing the observed value of the monitoring parameter with the predicted value of the monitoring parameter, the observed value of the monitoring parameter can be compared with the confidence interval of the predicted value of the monitoring parameter. If the observed value of the monitoring parameter is not within the confidence interval, it can be judged that the observed value of the monitoring parameter does not meet the Monitor parameter predictions. The observation value of the monitoring parameter that is not in the confidence interval may refer to the observation value exceeding the confidence interval, for example, the observation value being larger or smaller than the upper limit value or the lower limit value of the confidence interval. The comparison between the observed value of the monitoring parameter and the predicted value of the monitoring parameter may include calculating or receiving multiple values at multiple time points and then performing an overall comparison. One monitoring parameter observation value time series, each time series contains the monitoring parameter predicted value or monitoring parameter observation value corresponding to multiple time points in the period, and compare the two time series, if the judgment between the two time series is There is a set of observed values that do not match the predicted values, there are arrays of observed values that do not match the predicted values, there are arrays of observed values that do not match the predicted values continuously, and there is a continuous set of observed and predicted values. If it does not match, or there is an array of observed values that do not match the predicted values continuously and then a group of observed values does not match the predicted values, it can be judged that the observed values of the monitoring parameters do not match the predicted values of the monitoring parameters. Or, after a set of observed values and predicted values have occurred, there is no inconsistency between the continuous array of observed values and predicted values, it can be judged that no observed values of monitoring parameters do not meet the predicted values of monitoring parameters. The threshold for the number of monitored systems can be set according to the number of monitored systems in the monitoring group. The threshold for the number of monitored systems can be set to "2", so that in the group, except for the monitored system concerned (such as the first monitored system) that determines that the observed value of the monitored parameter does not meet the predicted value of the monitored parameter, there are only Another monitored system also judges that there is a situation where the observed value of the monitoring parameter does not conform to the predicted value of the monitoring parameter, or when no other monitored system judges that the observed value of the monitoring parameter does not meet the predicted value of the monitoring parameter, it can determine that an abnormal event occurs; If the criterion for judging abnormal events is to be set looser, in other words, to reduce the sensitivity of abnormal detection, the threshold can be set to a higher value, such as "3", etc.; if a stricter criterion is to be adopted, the threshold It can be set to "1", so that when only the monitored system in the group determines that the observed value of the monitoring parameter does not meet the predicted value of the monitoring parameter, it can determine that an abnormal event occurs. The threshold for the number of monitored systems can also be set to a percentage, such as 10%. In addition, if the observed values in other monitored systems in the group other than the monitored system of interest do not meet the predicted values, it can be judged that an abnormal event has occurred, that is, it is judged that the judgment in the other monitored system An abnormal event occurs for each of the observed values that do not meet the predicted value, thereby issuing an alarm for each abnormal event; in this case, it can be judged that the number of monitored systems in the group that have abnormal events is less than that of the same group The number of monitored systems in the group that are not judged to have abnormal events, and the proportion of the number can be set by a method similar to the threshold value set in the present disclosure. According to the principle of judging abnormal events in a group of the present disclosure, the present invention also includes the opposite judgment method according to the same principle, for example, it can judge all the other monitored systems in the first monitoring group except the first monitored system. The number of monitored systems whose observed value of the first monitoring parameter corresponds to the predicted value of the corresponding first monitoring parameter is higher than a threshold for the number of monitored systems, thereby determining that an abnormal event occurs in the first monitored system, and setting the threshold to a higher value. Higher numbers reflect stricter standards, etc. Anomaly alerts can include messages, notifications, flags, labels, audio, etc. The abnormal alarm may include information related to the monitored system to be concerned (such as the first monitored system), for example, indicating that the abnormal person is the monitored system to be concerned about, so that the abnormal alarm and/or the monitored system can be Carry out follow-up processing such as recording, statistics, and reflection.

某些實施例中，本新型可藉由一種產生異常告警之電腦軟體程式，經由電腦載入該程式後，執行包含以下之步驟：接收複數個受監控系統每一者的第一組監控資料，該等第一組監控資料每一者包含所對應之受監控系統受監控的營運資料及系統資料，該等營運資料及系統資料以複數個狀態參數分類；利用該等第一組監控資料之狀態參數對該等受監控系統進行分群，以產生複數個狀態參數群集標籤，使該等受監控系統每一者對應該等狀態參數群集標籤其中一者；利用該等狀態參數的該等狀態參數群集標籤對該等受監控系統進行分群，以產生複數個受監控系統群集標籤，使該等受監控系統每一者對應該等受監控系統群集標籤其中一者，該等受監控系統形成複數個受監控系統群集，該等受監控系統群集分別對應至該等受監控系統群集標籤，該等受監控系統群集包含第一受監控系統群集，該第一受監控系統群集包含對應至同一受監控系統群集標籤的複數個受監控系統；針對該等狀態參數每一者，接收該第一受監控系統群集之複數個受監控系統每一者的一監控觀測值，且針對該第一受監控系統群集之複數個受監控系統每一者產生對應該等狀態參數每一者的一監控資料時間序列，並依據各監控資料時間序列產生一監控預測值，該監控預測值於一時間軸上係對應該監控觀測值；判斷該第一受監控系統群集之複數個受監控系統每一者的監控觀測值是否符合其對應之監控預測值，若不符合則判斷為所對應之受監控系統發生一異常行為，並當該第一受監控系統群集內判斷為發生異常行為的受監控系統的數量大於一且小於或等於一異常閾值時，判斷發生異常行為之該至少一受監控系統發生異常事件，該異常閾值係小於該第一受監控系統群集中全部受監控系統的數量；以及當判斷發生該異常事件時，產生一異常告警，該異常告警係指示發生異常事件者為該至少一受監控系統。In some embodiments, the present invention can use a computer software program that generates abnormal alarms, and after the program is loaded through a computer, the steps include the following: receiving the first set of monitoring data of each of the plurality of monitored systems, Each of the first set of monitoring data includes the monitored operation data and system data of the corresponding monitored system, and the operation data and system data are classified by a plurality of state parameters; the state of the first set of monitoring data is used. The parameters are grouped to the monitored systems to generate a plurality of state parameter cluster labels, so that each of the monitored systems corresponds to one of the state parameter cluster labels; the state parameter clusters of the state parameters are used The labels group the monitored systems to generate a plurality of monitored system cluster labels, so that each of the monitored systems corresponds to one of the monitored system cluster labels, and the monitored systems form a plurality of monitored system cluster labels. Monitoring system clusters, the monitored system clusters correspond to the monitored system cluster labels respectively, the monitored system clusters include the first monitored system cluster, and the first monitored system cluster includes corresponding to the same monitored system cluster a plurality of monitored systems of the tag; for each of the state parameters, receiving a monitoring observation for each of the plurality of monitored systems of the first monitored system cluster, and for each of the first monitored system cluster Each of the plurality of monitored systems generates a monitoring data time series corresponding to each of the corresponding state parameters, and generates a monitoring forecast value according to each monitoring data time series, and the monitoring forecast value corresponds to the monitoring data on a time axis Observation value; determine whether the monitoring observed value of each of the plurality of monitored systems in the first monitored system cluster conforms to its corresponding monitoring predicted value, if not, it is determined that an abnormal behavior has occurred in the corresponding monitored system, And when the number of monitored systems that are determined to have abnormal behavior in the first monitored system cluster is greater than one and less than or equal to an abnormal threshold, it is determined that an abnormal event occurs in the at least one monitored system that has abnormal behavior, and the abnormal threshold is less than the number of all monitored systems in the first monitored system cluster; and when it is determined that the abnormal event occurs, an abnormal alarm is generated, and the abnormal alarm indicates that the abnormal event is the at least one monitored system.

某些實施例中，前述各步驟更可藉由以下方式實施。所謂異常行為，可意指或代表參數值不符合之判斷結果，並非一定指據以產生異常告警之異常事件。異常行為可視為異常判斷之初步結果或中繼結果，並藉由多個異常行為之綜合判斷，例如判斷發生異常行為之受監控系統的數量等，再判斷最終結果，例如將最終結果判斷為異常事件，以異常事件為產生異常告警之依據。異常行為可作為單一受監控系統的異常判斷結果，由於本新型可進行群組內多個受監控系統的整體判斷，因此群組內多個異常行為可整體作為單一受監控系統是否發生異常事件之判斷依據。例如，第一受監控系統群集包含15個受監控系統時，異常閾值可設為50%，則判斷群集內至少一個且不超過七個的受監控系統發生異常行為時，可判斷該等發生異常行為的受監控系統發生異常事件，據此可產生異常告警，所產生之異常告警可指示該至少一個且不超過七個的受監控系統發生異常事件。此外，若判斷超過七個受監控系統發生異常行為時，可對群集內未發生異常行為的受監控系統產生異常告警，亦即不將判斷發生異常行為的受監控系統判斷為發生異常事件，而是將未發生異常行為的受監控系統判斷為發生異常事件，此方式可使群集內在行為上與其他受監控系統不同的少數者被判斷為發生異常事件。據此，此方式可藉由下列步驟實施為：判斷該第一受監控系統群集之複數個受監控系統每一者的監控觀測值是否符合其對應之監控預測值，若不符合則判斷為所對應之受監控系統發生一異常行為，並當該第一受監控系統群集內判斷為發生異常行為的受監控系統的數量大於該群集內未判斷為發生異常行為的受監控系統的數量時，將該未判斷為發生異常行為的至少一受監控系統判斷為發生一異常事件，當判斷發生該至少一異常事件時，產生一異常告警，該異常告警係指示該發生異常事件之至少一受監控系統。當然，此方式中對於群集內之數量判斷(如多數/少數判斷，或以判斷異常行為來區分群集內的二個次群集)可視為等同設定一閾值，或是可藉由設定一閾值來控制此判斷方式。例如，針對包含多個受監控系統的一特定受監控系統群集，利用本揭露判斷異常行為的方式，建立第一異常事件判斷條件，即當群集內少於半數的受監控系統有異常行為時，判斷該少於半數的受監控系統發生異常事件；此外，可再建立第二異常事件判斷條件，即當群集內多於半數的受監控系統有異常行為時，判斷所剩的、未判斷為發生異常行為的受監控系統發生異常事件，而第二異常事件判斷條件可獨立作為群集內異常事件的判斷，或是與第一異常事件判斷條件一同使用。此外，此方式可同時利用本揭露其他閾值的方式一同實施，以建立不同情況下的判斷條件，且本新型可藉由設定複數個閾值來對不同的異常事件判斷方式進行告警。例如，可設定不同或層級化的數值門檻之閾值來表示不同等級之異常事件，如可設第一類型異常閾值為「2」及第二類型中度異常閾值為「5」，將二個以內之受監控系統的異常行為視為第一類型異常事件時，異常告警可指示發生第一類型異常事件，而將三個以上至五個以內之受監控系統的異常行為視為第二類型異常事件時，異常告警可指示發生第二類型異常事件等；不同類型之異常事件可代表不同異常程度等。針對各個受監控系統群集，可採本揭露之群集內異常告警方式，因此可確保每一受監控系統在其自身的群集內皆可受到監控而在發生異常事件時產生告警。In some embodiments, the aforementioned steps can be implemented in the following manner. The so-called abnormal behavior may refer to or represent the judgment result that the parameter value does not match, and does not necessarily refer to the abnormal event based on which the abnormal alarm is generated. Abnormal behavior can be regarded as the preliminary result or the relay result of abnormal judgment, and the final result is judged after the comprehensive judgment of multiple abnormal behaviors, such as judging the number of monitored systems that have abnormal behavior, etc., for example, the final result is judged as abnormal Events, with abnormal events as the basis for generating abnormal alarms. Abnormal behavior can be used as the abnormal judgment result of a single monitored system. Since the new model can perform the overall judgment of multiple monitored systems in a group, multiple abnormal behaviors in a group can be regarded as a single monitored system as a whole as a result of whether an abnormal event occurs. Judgments based. For example, when the first monitored system cluster includes 15 monitored systems, the abnormality threshold can be set to 50%, and when it is judged that at least one and no more than seven monitored systems in the cluster have abnormal behaviors, it can be judged that these abnormal behaviors have occurred. An abnormal event occurs in the monitored system of the behavior, according to which an abnormal alarm can be generated, and the generated abnormal alarm can indicate that an abnormal event occurs in the at least one and no more than seven monitored systems. In addition, if it is determined that more than seven monitored systems have abnormal behaviors, an abnormal alarm can be generated for the monitored systems that do not have abnormal behaviors in the cluster. It is to judge a monitored system that does not have abnormal behavior as an abnormal event. In this way, a small number of people whose behavior is different from other monitored systems in the cluster can be judged to have an abnormal event. Accordingly, this method can be implemented by the following steps: judging whether the monitoring observed value of each of the plurality of monitored systems in the first monitored system cluster conforms to its corresponding monitoring predicted value, and if not, determining that the An abnormal behavior occurs in the corresponding monitored system, and when the number of monitored systems that are judged to have abnormal behavior in the first monitored system cluster is greater than the number of monitored systems that are not judged to have abnormal behavior in the cluster, the The at least one monitored system that is not determined to have an abnormal behavior determines that an abnormal event has occurred, and when it is determined that the at least one abnormal event occurs, an abnormal alarm is generated, and the abnormal alarm indicates the at least one monitored system where the abnormal event occurred. . Of course, in this way, the number judgment in the cluster (such as majority/minority judgment, or distinguishing two sub-clusters in the cluster by judging abnormal behavior) can be regarded as equivalent to setting a threshold, or can be controlled by setting a threshold this judgment method. For example, for a specific monitored system cluster including multiple monitored systems, the method for judging abnormal behavior of the present disclosure is used to establish a first abnormal event judgment condition, that is, when less than half of the monitored systems in the cluster have abnormal behavior, It is judged that an abnormal event occurs in less than half of the monitored systems; in addition, a second abnormal event judgment condition can be established, that is, when more than half of the monitored systems in the cluster have abnormal behavior, it is judged that the remaining and not judged as occurrences An abnormal event occurs in the monitored system with abnormal behavior, and the second abnormal event judgment condition can be used independently as the judgment of the abnormal event in the cluster, or used together with the first abnormal event judgment condition. In addition, this method can be simultaneously implemented by using other threshold methods of the present disclosure to establish judgment conditions in different situations, and the present invention can set a plurality of threshold values to alert different abnormal event judgment methods. For example, different or hierarchical numerical thresholds can be set to represent abnormal events of different levels. For example, the first type of abnormality threshold can be set to "2" and the second type of moderate abnormality threshold can be set to "5". When the abnormal behavior of the monitored system is regarded as the first type of abnormal event, the abnormal alarm can indicate the occurrence of the first type of abnormal event, and the abnormal behavior of the monitored system of more than three to less than five is regarded as the second type of abnormal event. , the abnormal alarm may indicate the occurrence of a second type of abnormal event, etc.; different types of abnormal events may represent different abnormal degrees, etc. For each cluster of monitored systems, the method of alarming anomalies within the cluster disclosed in the present disclosure can be adopted, thereby ensuring that each monitored system can be monitored in its own cluster and generate an alarm when an abnormal event occurs.

某些實施例中，本新型可藉由一種產生異常告警之電腦軟體程式來實施，經由電腦載入該程式後可執行包含以下之步驟：對複數個受監控系統的複數個狀態參數進行分群並產生每一者包含複數個第一群集標籤之複數個第一群集標籤集，使該等受監控系統每一者被指派該等第一群集標籤集每一者中的一個第一群集標籤，該等第一群集標籤集係分別對應該等狀態參數，該複數個狀態參數係包含該等受監控系統之營運資料及系統資料的狀態參數；以該等第一群集標籤集的複數個第一群集標籤對該等受監控系統進行分群並產生包含複數個第二群集標籤之第二群集標籤集，使該等受監控系統每一者被指派該第二群集標籤集中的一個第二群集標籤，其中，該等受監控系統中有複數個受監控系統形成第一受監控系統群組，使該第一受監控系統群組之全部受監控系統係被指派同一個第二群集標籤，該第一受監控系統群組包含一目標受監控系統；針對該第一受監控系統群組，形成該群組中各受監控系統之一訓練資料集，該等訓練資料集每一者包含該等狀態參數且更包含一異常值參數，該異常值參數係指示所對應之受監控系統的一異常狀態；基於該等訓練資料集建立對應該第一受監控系統群組中各受監控系統的第一異常預測模型，該等第一異常預測模型每一者係用以預測所對應之受監控系統的第一異常值參數預測值；自該等訓練資料集每一者中之異常值參數形成一異常值參數時間序列，基於該等異常值參數時間序列建立對應該第一受監控系統群組中各受監控系統的第二異常預測模型，該等第二異常預測模型每一者係用以預測所對應之受監控系統的第二異常值參數預測值；針對該第一受監控系統群組中各受監控系統，利用該等第一異常預測模型中所對應者預測於一目標時間點之第一異常值參數預測值，利用該等第二異常預測模型中所對應者預測於該目標時間點之第二異常值參數預測值，於該第一異常值參數預測值不符合該第二異常值參數預測值時判斷所對應之受監控系統發生異常行為，依前述異常行為判斷方式判斷該目標受監控系統發生異常行為，並依前述異常行為判斷方式判斷該第一受監控系統群組中除了該目標受監控系統之外有低於一特定數量之其他受監控系統發生異常行為，該數量低於該第一受監控系統群組的受監控系統數量；以及產生關聯於該目標受監控系統之一異常告警。In some embodiments, the present invention can be implemented by a computer software program that generates abnormal alarms. After the program is loaded through a computer, the following steps can be performed: grouping a plurality of state parameters of a plurality of monitored systems and generating a plurality of first cluster label sets each including a plurality of first cluster labels, causing each of the monitored systems to be assigned a first cluster label in each of the first cluster label sets, the The first cluster label sets are respectively corresponding to the corresponding state parameters, and the plurality of state parameters include the operation data of the monitored systems and the state parameters of the system data; a plurality of first clusters of the first cluster label sets are used The tags group the monitored systems and generate a second cluster tag set comprising a plurality of second cluster tags, such that each of the monitored systems is assigned a second cluster tag in the second cluster tag set, wherein , a plurality of monitored systems in these monitored systems form a first monitored system group, so that all monitored systems in the first monitored system group are assigned the same second cluster label, and the first monitored system group is assigned the same second cluster label. The monitoring system group includes a target monitored system; for the first monitored system group, a training data set for each monitored system in the group is formed, and each of the training data sets includes the state parameters and It further includes an abnormal value parameter, the abnormal value parameter indicates an abnormal state of the corresponding monitored system; establishes a first abnormal prediction corresponding to each monitored system in the first monitored system group based on the training data sets models, each of the first anomaly prediction models is used to predict the predicted value of the first outlier parameter of the corresponding monitored system; an outlier parameter is formed from the outlier parameter in each of the training data sets time series, establishing a second anomaly prediction model corresponding to each monitored system in the first monitored system group based on the outlier parameter time series, each of the second anomaly prediction models is used to predict the corresponding The predicted value of the second abnormal value parameter of the monitored system; for each monitored system in the first monitored system group, the first abnormal value predicted at a target time point using the corresponding one of the first abnormal prediction models Parameter prediction value, using the second outlier parameter prediction value predicted at the target time point corresponding to the second anomaly prediction models, when the first outlier parameter prediction value does not match the second outlier parameter prediction value When the corresponding monitored system is judged to have abnormal behavior, the target monitored system is judged to have abnormal behavior according to the aforementioned abnormal behavior judgment method, and the target monitored system is judged according to the aforementioned abnormal behavior judgment method. Abnormal behavior occurs in other monitored systems outside the system below a specific number, the number being lower than the number of monitored systems in the first monitored system group; and an abnormal alarm associated with the target monitored system is generated.

某些實施例中，前述訓練資料集的異常值參數係代表一異常值或風險值，如可為本揭露所述之風險值。異常值參數可以包含多個時間點或一時間序列的異常值，該等異常值可以有無異常或有無風險來標註，如有異常者標註為「1」，無異常者標註為「0」；針對此異常值參數，訓練資料集中標註為「1」的資料可代表受監控系統的異常狀態為異常，而標註為「0」的資料可代表受監控系統的異常狀態為非異常。訓練資料集可包含歷史資料，例如包含異常值或風險值的資料。訓練資料集的每筆資料可包含一特定時間點之資料，據此，異常值參數反映受監控系統於該特定時間點之異常狀態。利用訓練資料集可訓練出第一異常預測模型，此模型所用之演算法可包含適合之監督式機器學習演算法。或者可說，異常值參數係對狀態參數聯集進行標註之資料，使在不需針對個別狀態參數進監控之情況下，可針對全部狀態參數來整體監控，並據此以第一異常預測模型來預測異常。此外，各個時間點的異常值可藉由形成一時間序列，所形成之異常值參數時間序列可用以訓練如採用LSTM演算法等所形成之第二異常預測模型。例如，針對一受監控系統，每分鐘為一資料點，對每一資料點標註異常值參數，若有標註了一週之資料，則有10080個異常值參數的參數值(或說異常值參數包含了該週各資料點)，此等參數值便可形成一異常值參數時間序列並用以建立時間序列模型。針對一筆新資料，如新的觀察值，該新資料可包含狀態參數但不包含異常值參數，利用第一及第二異常預測模型便可分別預測出各模型的預測值，該等預測值若有差異時，便可據以判斷有異常行為產生。例如，將第二異常預測模型作為基準模型時，第二異常值參數預測值可作為預測基準值，若第一異常值參數預測值不符合第二異常值參數預測值或超出其信賴區間之上、下限值時，便可判斷發生異常行為。In some embodiments, the outlier parameter of the aforementioned training data set represents an outlier or risk value, such as the risk value described in this disclosure. The outlier parameter can include outliers at multiple time points or a time series, and these outliers can be marked with or without abnormality or risk. If there is an abnormality, it is marked as "1", and if there is no abnormality, it is marked as "0"; for For this outlier parameter, the data marked with "1" in the training data set can represent that the abnormal state of the monitored system is abnormal, and the data marked with "0" can represent that the abnormal state of the monitored system is not abnormal. The training data set may contain historical data, such as data containing outliers or values at risk. Each piece of data in the training data set may include data at a specific time point, and accordingly, the abnormal value parameter reflects the abnormal state of the monitored system at the specific time point. Using the training data set, a first anomaly prediction model can be trained, and the algorithm used for this model can include a suitable supervised machine learning algorithm. In other words, the abnormal value parameter is the data for marking the state parameter union, so that all state parameters can be monitored as a whole without monitoring individual state parameters, and the first abnormal prediction model can be used accordingly. to predict anomalies. In addition, outliers at each time point can be formed into a time series, and the formed outlier parameter time series can be used to train a second anomaly prediction model formed by using an LSTM algorithm or the like. For example, for a monitored system, every minute is a data point, and each data point is marked with an outlier parameter. If there is data marked for a week, there are 10080 outlier parameters. The parameter values (or the outlier parameters include the data points of the week), these parameter values can form an outlier parameter time series and used to build a time series model. For a new piece of data, such as a new observation value, the new data may include state parameters but not outlier parameters, and the first and second anomaly prediction models can be used to predict the predicted values of each model respectively. When there is a difference, it can be judged that there is abnormal behavior. For example, when the second anomaly prediction model is used as the benchmark model, the predicted value of the second outlier parameter can be used as the prediction benchmark value. If the predicted value of the first outlier parameter does not conform to the predicted value of the second outlier parameter or exceeds its confidence interval , the lower limit value, the abnormal behavior can be judged.

綜合以上所述，本新型實施例提供的服務異常偵測告警方法與使用此方法的設備與系統，相較於先前技術，可包含下述優點：(1)可避免僅從 IT維運等資料與日誌來監控或判斷異常事件，提升早期告警機會，亦降低誤報可能性；(2)藉由同類型服務或公司的資料判斷或訓練，避免資料不足狀況，提升模型準確性；(3)藉由同類型服務或公司的資料與建模，提升異常狀況的比對與告警效率；(4)可以事前防範在不明顯特徵或特徵組合下，即可預測異常，不再需要人為定義邏輯；(5)可考量事件的週期性、季節性，以及長短期資料的影響；以及(6)綜合時間序列趨勢的預防性告警，以及風險值評估方法，避免傳統風險預測方法的主觀偏差，以及對處理大量資料效率與準確率過低的狀況。另外，本新型實施例提供的服務異常偵測告警方法與使用此方法的設備與系統可以應用於各類AI相關產品與服務，例如電商、遊戲業的異常偵測、製造業的異常偵測、財務報表保險制度(FSI)金融保險產業的詐欺偵測，以及管理服務提供商(MSP)的自動化維運等。To sum up the above, the service abnormality detection and alarm method and the device and system using the method provided by the new embodiment of the present invention can include the following advantages compared with the prior art: (1) It can avoid data such as IT maintenance and operation only. Use logs to monitor or judge abnormal events, increase the chance of early warning, and reduce the possibility of false positives; (2) Judging or training with data from the same type of service or company to avoid insufficient data and improve model accuracy; Based on the data and modeling of the same type of service or company, the comparison and alarming efficiency of abnormal conditions can be improved; (4) The abnormality can be predicted in advance without obvious characteristics or combination of characteristics, and no artificial definition logic is required; ( 5) The periodicity, seasonality of events, and the impact of long-term and short-term data can be considered; and (6) preventive alarms that integrate time series trends, and risk value assessment methods, to avoid the subjective bias of traditional risk prediction methods, and to deal with A situation where the efficiency and accuracy of a large amount of data are too low. In addition, the service anomaly detection and alarm method and the device and system using the method provided by this novel embodiment can be applied to various AI-related products and services, such as anomaly detection in e-commerce, game industry, and anomaly detection in manufacturing industry. , Financial Statement Insurance System (FSI) fraud detection in the financial and insurance industry, and automated maintenance of Management Service Providers (MSP), etc.

應當理解，本文描述的示例和實施例僅用於說明目的，所揭露之實施例及技術特徵在符合本新型之精神之下可有各種組合，並且鑑於其的各種修改或改變將被建議給本領域技術人員，並且將被包括在本申請的精神和範圍以及所附權利要求的範圍之內。It should be understood that the examples and embodiments described herein are for illustrative purposes only, and the disclosed embodiments and technical features may be combined in various ways within the spirit of the present invention, and various modifications or changes in view thereof will be suggested to the present invention. those skilled in the art and are to be included within the spirit and scope of this application and the scope of the appended claims.

1:異常告警系統 11:雲端設備 121～12N:受監控系統 111:資料前處理單元 112:個體參數分群單元 113:個體分群單元 114:同群比對單元 115:偵測單元 116:告警單元 S31～S45:步驟 1: Abnormal alarm system 11: Cloud Devices 121～12N: Monitored system 111: Data preprocessing unit 112: Individual parameter clustering unit 113: Individual grouping unit 114: Cohort alignment unit 115: Detection unit 116: Alarm unit S31～S45: Steps

提供的附圖用以使本新型所屬技術領域具有通常知識者可以進一步理解本新型，並且被併入與構成本新型之說明書的一部分。附圖示出了本新型的示範實施例，並且用以與本新型之說明書一起用於解釋本新型的原理。The accompanying drawings are provided to enable those having ordinary skill in the art to which the present invention pertains to further understand the present invention, and are incorporated in and constitute a part of the description of the present invention. The drawings illustrate exemplary embodiments of the invention, and together with the description of the invention serve to explain the principles of the invention.

圖1是本新型實施例之使用服務異常偵測告警方法的異常告警系統的方塊圖。FIG. 1 is a block diagram of an abnormality alarm system using a service abnormality detection and alarm method according to a new embodiment of the present invention.

圖2是本新型實施例之使用服務異常偵測告警方法的雲端設備的方塊圖。FIG. 2 is a block diagram of a cloud device using a service abnormality detection and alarm method according to a new embodiment of the present invention.

圖3是本新型實施例之服務異常偵測告警方法操作於判讀模式的流程圖。FIG. 3 is a flowchart of the service abnormality detection and alarm method operating in the interpretation mode according to the new embodiment of the present invention.

圖4是本新型實施例之服務異常偵測告警方法操作於建模模式的流程圖。FIG. 4 is a flowchart of the service abnormality detection and alarm method operating in the modeling mode according to the novel embodiment.

11:雲端設備 11: Cloud Devices

111:資料前處理單元 111: Data preprocessing unit

112:個體參數分群單元 112: Individual parameter clustering unit

113:個體分群單元 113: Individual grouping unit

114:同群比對單元 114: Cohort alignment unit

115:偵測單元 115: Detection unit

116:告警單元 116: Alarm unit

Claims

A device for serving abnormality detection and alarm, comprising a plurality of electrically connected hardware circuits, the hardware circuits are configured into a plurality of units, and the units are used for: Receive the operational data and system data of one of the monitored systems corresponding to a service, and perform data preprocessing on the operational data and the system data of the monitored system to obtain multiple data of the monitored system. a state parameter; Grouping each of the state parameters of the monitored system to obtain a cluster label corresponding to each of the state parameters of the monitored system; Group the monitored system according to the cluster labels of the state parameters of the monitored system to obtain a group number corresponding to the monitored system; Detecting whether the monitored system has an abnormal behavior according to at least one of the status parameters of the monitored system; and When detecting that the monitored system has the abnormal behavior, determine whether there are more than a certain number of the monitored systems in a group corresponding to the group number of the monitored system also have abnormal behavior, if the group If the monitored systems of the group corresponding to the group number that do not exceed the specified number also behave abnormally, an alarm is generated.

The apparatus for serving anomaly detection and alerting as claimed in claim 1, wherein the units are used to group each of the state parameters of the monitored system based on a model, and the model is based on the monitored systems The state parameters corresponding to each of them are obtained by training.

The apparatus for service anomaly detection and alerting as claimed in claim 1, wherein the units are used to group the monitored system according to the cluster labels of the state parameters of the monitored system based on a model, and the The model is obtained by training on the cluster labels of the state parameters of the monitored systems.

The device for serving abnormality detection and alarming as described in claim 1, wherein the units are configured to determine whether the monitored system has the abnormal behavior according to at least one of the state parameters of the monitored system based on at least one model, At least one of the models includes a risk prediction model.

The device for serving abnormality detection and alarming as described in claim 4, wherein the units are used to mark the state parameters corresponding to the monitored system when abnormality occurred in the past with a risk value of 1, and the monitored system When no abnormality has occurred in the system in the past, the corresponding state parameters are marked with a risk value of 0, and the risk prediction model is established according to the risk value.

The apparatus for service anomaly detection and alarm as described in claim 5, wherein the units are configured to calculate whether a predicted risk value exceeds a specific value based on the risk prediction model according to the current state parameters of the monitored system, to detect whether the monitored system has the abnormal behavior.

A device for detecting anomalies and issuing alarms, including: A data preprocessing unit, receiving operation data and system data of a monitored device corresponding to a service, and performing data preprocessing on the operation data and system data of the monitored device to obtain multiple states of the monitored device parameter; a state parameter grouping unit for grouping each state parameter of the monitored device to obtain a cluster label corresponding to each state parameter of the monitored device; a grouping unit, which groups the monitored equipment according to the cluster labels of the state parameters of the monitored equipment to obtain a group number corresponding to the monitored equipment; a detection unit for detecting whether the monitored device has an abnormal behavior according to at least one of the state parameters of the monitored device; and A group comparison unit, when detecting that the monitored equipment has the abnormal behavior, determines whether there are a plurality of monitored equipments exceeding a certain number in a group corresponding to the group number of the monitored equipment. The abnormal behavior is detected. If the group corresponding to the group number does not exceed the specified number of the monitored devices that also have the abnormal behavior, it is determined that the abnormal behavior of the monitored device is detected as: an unusual event; and An alarm unit generates an abnormal alarm according to the abnormal event.

A device for judging the occurrence of an abnormal event in a monitored system and generating an abnormal alarm for the abnormal event, which includes a plurality of electrically connected hardware circuits, the hardware circuits are configured into a plurality of units, and the units Used for: Receive a plurality of first monitoring data sets, each of the first monitoring data sets includes a plurality of monitoring data, the monitoring data includes a first monitoring parameter and a second monitoring parameter, the first monitoring parameter and the second monitoring The parameters respectively include a plurality of monitoring data points, each of the data points includes a monitoring parameter value, the first monitoring data sets respectively include monitoring data of different monitored systems, and each of the first monitoring data sets The plurality of monitoring data include operational data and system data; A first clustering model is used to group the monitoring parameter values of the first monitoring parameters of the first monitoring data sets and generate a plurality of first cluster labels, so that each of the first monitoring data sets corresponds to the first one of the cluster labels, and using the first clustering model to group the monitoring parameter values of the second monitoring parameters of the first monitoring data sets to generate a plurality of second cluster labels, so that each of the first monitoring data sets is One corresponds to one of the labels that should wait for the second cluster; The first cluster label and the second cluster label of the first monitoring data sets are grouped by the second clustering model to generate a plurality of third cluster labels, so that each of the first monitoring data sets corresponds to the rank One of the three cluster labels, at least plural ones of the first monitoring data sets correspond to one of the third cluster labels to form a first monitoring group, and the first monitoring group is based on the corresponding pluralities a first monitoring data set includes a corresponding plurality of monitored systems, and the plurality of monitored systems in the first monitoring group includes a first monitored system; A plurality of second monitoring data sets are received from the first monitoring group, the second monitoring data sets are respectively received from a plurality of monitored systems in the first monitoring group, and each of the second monitoring data sets is which includes the first monitoring parameter and the second monitoring parameter; Based on the second monitoring data sets, a time series algorithm is used to predict a plurality of first monitoring parameter predicted values and a plurality of second monitoring parameters respectively corresponding to the first monitoring parameter and the second monitoring parameter of the first monitoring group monitoring the predicted value of the parameter, so that each of the plurality of monitored systems in the first monitoring group corresponds to the predicted value of the first monitoring parameter and the predicted value of the second monitoring parameter; A plurality of first monitoring parameter observed values for the first monitoring parameter are received from a plurality of monitored systems in the first monitoring group, and the first monitoring parameter observed values are respectively corresponding to the corresponding first monitoring parameter predicted values ; Compare the observed value of the first monitoring parameter and the predicted value of the first monitoring parameter corresponding to each of the plurality of monitored systems in the first monitoring group, and determine the first monitoring parameter corresponding to the first monitored system The observed value does not conform to its corresponding predicted value of the first monitoring parameter, and it is determined that the observed value of the first monitoring parameter corresponding to the other monitored systems in the first monitoring group except the first monitored system does not conform to the corresponding predicted value. The number of monitored systems corresponding to the predicted value of the first monitoring parameter is lower than a threshold of the number of monitored systems, thereby determining that an abnormal event occurs in the first monitored system, wherein the threshold of the number of monitored systems is smaller than the first monitoring group the number of monitored systems for the group; and An abnormal alarm is generated according to the judgment of the abnormal event.

A device for generating an abnormal alarm, which includes a plurality of electrically connected hardware circuits, the hardware circuits are configured into a plurality of units, and the units are used for: Receive the first set of monitoring data of each of the plurality of monitored systems, each of the first set of monitoring data includes the monitored operational data and system data of the corresponding monitored system, and the operational data and system data are Plural state parameter classification; grouping the monitored systems using the state parameters of the first set of monitoring data to generate a plurality of state parameter cluster labels, so that each of the monitored systems corresponds to one of the corresponding state parameter cluster labels; The monitored systems are grouped using the state parameter cluster labels of the state parameters to generate a plurality of monitored system cluster labels such that each of the monitored systems corresponds to one of the monitored system cluster labels or, the monitored systems form a plurality of monitored system clusters, the monitored system clusters are respectively corresponding to the monitored system cluster labels, and the monitored system clusters include a first monitored system cluster, the first monitored system cluster The monitoring system cluster includes a plurality of monitored systems corresponding to the same monitored system cluster label; For each of the state parameters, receive a monitoring observation for each of the plurality of monitored systems of the first cluster of monitored systems, and for each of the plurality of monitored systems of the first cluster of monitored systems generating a monitoring data time series corresponding to each of the equivalent state parameters, and generating a monitoring forecast value according to each monitoring data time series, and the monitoring forecast value corresponds to the monitoring observation value on a time axis; Determine whether the monitoring observed value of each of the plurality of monitored systems in the first monitored system cluster conforms to its corresponding monitoring predicted value, if not, it is determined that an abnormal behavior has occurred in the corresponding monitored system, and when the When the number of monitored systems that are determined to have abnormal behaviors in the first monitored system cluster is greater than one and less than or equal to an abnormality threshold, it is determined that at least one monitored system with abnormal behaviors has an abnormal event, and the abnormality threshold is less than the number of all monitored systems in the first monitored system cluster; and Based on the at least one abnormal event, an abnormal alarm is generated, and the abnormal alarm indicates the at least one monitored system in which the abnormal event occurred.

A device for generating an abnormal alarm, which includes a plurality of electrically connected hardware circuits, the hardware circuits are configured into a plurality of units, and the units are used for: grouping a plurality of state parameters of a plurality of monitored systems and generating a plurality of first cluster label sets each including a plurality of first cluster labels such that each of the monitored systems is assigned the first cluster a first cluster label in each of the label sets, the first cluster label sets are respectively corresponding to the corresponding state parameters, the plurality of state parameters include the operation data of the monitored systems and the state parameters of the system data; clustering the monitored systems with a plurality of first cluster labels of the first cluster label sets and generating a second cluster label set including a plurality of second cluster labels such that each of the monitored systems is assigned A second cluster label in the second cluster label set, wherein a plurality of monitored systems in the monitored systems form a first monitored system group, so that all monitored systems in the first monitored system group are assigned the same second cluster label, the first monitored system group includes a target monitored system; For the first monitored system group, a training data set for each monitored system in the group is formed, each of the training data sets includes the state parameters and further includes an outlier parameter, the outlier parameter It indicates an abnormal state of the corresponding monitored system; A first anomaly prediction model corresponding to each monitored system in the first monitored system group is established based on the training data sets, and each of the first anomaly prediction models is used to predict the first anomaly of the corresponding monitored system an outlier parameter prediction value; An outlier parameter time series is formed from the outlier parameters in each of the training data sets, and a second anomaly prediction corresponding to each monitored system in the first monitored system group is established based on the outlier parameter time series a model, each of the second anomaly prediction models is used to predict the predicted value of the second anomaly parameter of the corresponding monitored system; For each monitored system in the first monitored system group, the predicted value of the first anomaly parameter predicted at a target time point by the corresponding one of the first anomaly prediction models is used, and the second anomaly prediction model is used The predicted value of the second outlier parameter at the target time point corresponding to the one in According to the aforementioned abnormal behavior judgment method, it is judged that the target monitored system has abnormal behavior, and according to the aforementioned abnormal behavior judgment method, it is judged that there are less than a certain number of other monitored systems in the first monitored system group except the target monitored system. An abnormal behavior occurs in the monitoring system, and the number is lower than the number of monitored systems in the first monitored system group; and Generate an exception alarm associated with the target monitored system.