WO2020259421A1 - Method and apparatus for monitoring service system - Google Patents

Method and apparatus for monitoring service system Download PDF

Info

Publication number
WO2020259421A1
WO2020259421A1 PCT/CN2020/097249 CN2020097249W WO2020259421A1 WO 2020259421 A1 WO2020259421 A1 WO 2020259421A1 CN 2020097249 W CN2020097249 W CN 2020097249W WO 2020259421 A1 WO2020259421 A1 WO 2020259421A1
Authority
WO
WIPO (PCT)
Prior art keywords
time period
index data
business system
business
machine learning
Prior art date
Application number
PCT/CN2020/097249
Other languages
French (fr)
Chinese (zh)
Inventor
陈泽昊
邹高锋
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2020259421A1 publication Critical patent/WO2020259421A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Definitions

  • the present invention relates to the field of machine learning in Fintech, and in particular to a monitoring method and device of a business system.
  • monitoring and alarm policies are all pre-configured warning thresholds for users in the alarm tool, and usually such thresholds are configured by operation and maintenance personnel based on historical experience, and the accuracy is low. And from detecting the abnormality to confirming the occurrence of the abnormality, and finally notifying the relevant operation and maintenance personnel, the alarm process takes a long time. Therefore, in some cases, the alarm has hysteresis.
  • the abnormality of its data is uncontrollable and may increase exponentially.
  • the business system just appears to be abnormal its indicator data does not meet the alarm conditions.
  • the operation and maintenance personnel receive the alarm notification, the business indicator The degree of abnormality is very serious, and the scope of the abnormality has spread rapidly. At this time, the business has been damaged and the meaning of warning is lost.
  • This application provides a monitoring method and device for a business system to solve the problem of hysteresis and low accuracy in business system alarms.
  • An embodiment of the present invention provides a monitoring method for a business system, including:
  • monitoring index data does not satisfy the direct warning condition, input the monitoring index data into a pre-trained machine learning algorithm model, and use the machine learning algorithm model to determine the prediction result in the prediction time period;
  • the prediction result is compared with the predicted alarm condition to predict whether the business system is abnormal in the predicted time period.
  • the method before inputting the monitoring index data into a pre-trained machine learning algorithm model, before using the machine learning algorithm model to determine the prediction result in the prediction time period, the method further includes:
  • the training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
  • the predicted alarm condition is determined in the following manner:
  • the monitoring index data of the business system includes hardware index data of the business system
  • the obtaining the monitoring index data of the business system in a reference time period includes:
  • the using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
  • the comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
  • the hardware failure prediction time period and the prediction accuracy rate are determined.
  • the monitoring index data of the business system includes the business index data of the business system
  • the obtaining the monitoring index data of the business system in a reference time period includes:
  • the using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
  • the comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
  • the normal fluctuation range is determined by the machine learning algorithm model based on the fluctuation of the business index data in the historical time period.
  • the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes
  • the method further includes:
  • the weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
  • a monitoring device for a business system includes:
  • the obtaining unit is used to obtain the monitoring index data of the business system in the reference time period
  • the comparison unit is used to compare the monitoring index data with the direct alarm condition
  • the prediction unit is configured to input the monitoring index data into a pre-trained machine learning algorithm model if the monitoring index data does not meet the direct alarm condition, and use the machine learning algorithm model to determine the prediction time period forecast result;
  • the alarm unit is used to compare the prediction result with the predicted alarm condition and predict whether the business system is abnormal in the predicted time period.
  • a training unit is further included for:
  • the training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
  • a training unit is further included for:
  • the monitoring index data of the business system includes hardware index data of the business system
  • the acquiring unit is configured to acquire hardware index data of the business system in a first reference time period
  • the prediction unit is configured to determine the fluctuation of the hardware index data in the prediction time period
  • the alarm unit is configured to compare the fluctuation situation of the hardware index data with the failure condition, and determine whether the business system has a hardware failure within the forecast time period; if the business system is in the forecast If a hardware failure occurs during the time period, the hardware failure prediction time period and the prediction accuracy rate are determined.
  • the monitoring index data of the business system includes the business index data of the business system
  • the acquiring unit is configured to acquire business index data of the business system in a second reference time period
  • the prediction unit is configured to determine the fluctuation of the business index data in the prediction time period
  • the alarm unit is configured to compare the fluctuation of the business index data with the normal fluctuation range, and determine whether the business system is abnormal in the predicted time period; if the business is in the predicted time period If an abnormality occurs, the abnormal prediction time period is determined; the normal fluctuation range is determined by the machine learning algorithm model according to the fluctuation situation of the business index data in the historical time period.
  • the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes
  • the prediction unit is also used for:
  • the weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
  • This application provides a computing device, which includes:
  • processor memory, transceiver, and bus interface; among them, the processor, memory and transceiver are connected by a bus;
  • the processor is configured to read the program in the memory and execute the monitoring method of the business system described above;
  • the memory is used to store one or more executable programs, and can store data used by the processor when performing operations.
  • This application provides a non-transitory computer-readable storage medium in which instructions are stored in the computer storage medium, which when run on a computer, cause the computer to execute the above-mentioned monitoring method of the business system.
  • This application provides a computer program product containing instructions, which when running on a computer, enables the computer to execute the monitoring method of the above-mentioned business system.
  • the monitoring index data is compared with the direct alarm condition, and if the monitoring index data meets the direct alarm condition, the user is directly alerted. If the monitoring index data does not meet the direct warning conditions, the monitoring index data is input into the machine learning algorithm model, and the machine learning algorithm model is used to determine the prediction result within the prediction time period. Since the predicted time period includes the time period after the current time point, that is, the machine learning algorithm model can predict the operation status of the business system for a period of time in the future, and compare the operation status with the expected alarm conditions, so as to predict whether abnormalities may occur. If an abnormality occurs, the user will be alerted.
  • the embodiment of the present invention uses a machine learning algorithm model to predict future abnormalities that may occur, so that business operation and maintenance personnel can prepare for business disaster recovery in advance for upcoming abnormalities, improve the availability of the business system, and have high prediction accuracy. .
  • Figure 1 is a schematic structural diagram of a possible system architecture provided by an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of a monitoring method for a business system provided by an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a monitoring device of a business system provided by an embodiment of the present invention.
  • Fig. 4 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.
  • a system architecture to which the embodiment of the present invention is applicable includes a business system 101, a monitoring platform 102, and a monitoring client 103.
  • the business system 101 and/or the monitoring platform 102 may be a network device such as a computer, an independent device, or a server cluster formed by multiple servers.
  • the business system 101 and/or the monitoring platform 102 can use cloud computing technology for information processing.
  • the monitoring client 103 is installed on the monitoring platform 102.
  • the monitoring platform 102 can be an electronic device with wireless communication functions such as a mobile phone, a tablet computer or a dedicated handheld device, or it can be a personal computer (PC), notebook computer, server and other wired access devices connected to the Internet .
  • PC personal computer
  • the monitoring platform 102 can communicate with the business system 101 through the INTERNET network, or through mobile communication systems and business systems such as the Global System for Mobile Communications (GSM), long term evolution (LTE) system, etc. 101 to communicate.
  • the monitoring client 103 can communicate with the monitoring platform 102 through the INTERNET network, or through mobile communication systems such as the Global System for Mobile Communications (GSM), long term evolution (LTE) system, etc.
  • the platform 102 communicates.
  • the system architecture in the embodiment of the present invention is almost the same as the traditional monitoring platform. Users only need to configure the monitoring strategy of the business indicators they care about, so it is more user-friendly, and users do not need to pay attention to how faults are realized inside the monitoring platform. Forecast, there is no threshold for use.
  • the user in the embodiment of the present invention includes business system developers, business operation and maintenance personnel, and all relevant personnel who use the monitoring platform for business monitoring.
  • Intelligent monitoring platform a tool responsible for monitoring and alerting business systems. Including monitoring system business indicators and basic service indicators (such as server hardware health status, network connection status, etc.), the detected indicators are integrated through the machine learning algorithm model to predict possible failures and abnormalities in the future.
  • monitoring system business indicators and basic service indicators such as server hardware health status, network connection status, etc.
  • Alarm detection/prediction Also known as business system failure detection/prediction, it detects and predicts the possible failures/abnormalities in the daily operation of the business system for the monitoring platform.
  • LSTM Long Short-Term Memory
  • Time series refers to the sequence of numbers of the same statistical indicator arranged in the order of the time of occurrence.
  • the main purpose of time series analysis is to predict the future based on existing historical data. Most of the economic data are given in the form of time series.
  • the time in the time series can be year, quarter, month or any other time format.
  • an embodiment of the present invention provides a monitoring method of a business system. As shown in FIG. 2, the monitoring method of a business system provided by the embodiment of the present invention includes the following steps:
  • Step 201 Obtain monitoring index data of the business system in a reference time period.
  • the collected data format will be different, including different hardware, the recorded hardware data format is also different, and the data format of different service interfaces and different services may also be different, so it is necessary to report the data Perform cleaning processing to achieve the unification of various data formats, and ensure that the cleaned data can be used for big data processing and machine learning algorithm modules for machine learning training, as well as alarm matching and prediction.
  • the monitoring indicators of server hardware and business interfaces are different, and the dimensional data of each component and interface may be a lot, it is necessary to select data sources that are positively related to the monitoring indicators to eliminate interference items, such as the SMART value of the hard disk, the motherboard Health value, etc.
  • Step 202 Compare the monitoring index data with the direct alarm condition.
  • the cleaned monitoring index data is logically processed to determine whether the monitoring index data meets the direct warning condition. If the direct warning condition is met, it indicates that the business system is currently abnormal, and the user is directly alerted; if the direct warning condition is not met, then The reported monitoring index data is calculated through the trained machine learning algorithm model to predict whether abnormalities may occur in the future time period.
  • the direct alarm conditions in the embodiment of the present invention can also be trained by the machine learning algorithm model and the system can judge by itself in the daily iterative process in the production environment, thereby reducing the time spent by the operation and maintenance personnel in the process of configuring the direct alarm conditions , Improve management efficiency, and avoid false alarms caused by human configuration.
  • Step 203 If the monitoring index data does not meet the direct warning condition, input the monitoring index data into a pre-trained machine learning algorithm model, and use the machine learning algorithm model to determine the prediction result in the prediction time period .
  • the machine learning algorithm model may include Convolutional Neural Networks (CNN), Support Vector Machine (SVM), K-Means clustering, and Logistic Regression (Logistic Regression).
  • CNN Convolutional Neural Networks
  • SVM Support Vector Machine
  • K-Means clustering K-Means clustering
  • Logistic Regression Logistic Regression
  • Step 204 Compare the prediction result with the predicted alarm condition, and predict whether the business system is abnormal in the predicted time period.
  • the expected alarm conditions can be determined by the operation and maintenance personnel based on experience, can also be obtained through machine learning algorithm model training, or be judged by the system during the daily iteration process in the production environment. If an abnormality occurs, it can be notified to the user via email and/or SMS and/or phone call and/or WeChat.
  • the monitoring index data is compared with the direct alarm condition, and if the monitoring index data meets the direct alarm condition, the user is directly alerted. If the monitoring index data does not meet the direct warning conditions, the monitoring index data is input into the machine learning algorithm model, and the machine learning algorithm model is used to determine the prediction result within the prediction time period. Since the predicted time period includes the time period after the current time point, that is, the machine learning algorithm model can predict the operation status of the business system for a period of time in the future, and compare the operation status with the expected alarm conditions, so as to predict whether abnormalities may occur. If an abnormality occurs, the user will be alerted.
  • the embodiment of the present invention uses a machine learning algorithm model to predict future abnormalities that may occur, so that business operation and maintenance personnel can prepare for business disaster recovery in advance for upcoming abnormalities, improve the availability of the business system, and have high prediction accuracy. .
  • the forecast results in the future time period can be predicted based on the monitoring index data, thereby indexing the operational status of the business system. Then compare the predicted results with the set predicted alarm conditions to determine whether the business system is likely to be abnormal in the future time period.
  • the LSTM algorithm model is trained based on the training data in the historical time period. Said inputting the monitoring index data into a pre-trained machine learning algorithm model, before using the machine learning algorithm model to determine the prediction result in the prediction time period, further includes:
  • the training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
  • the training data of the business system at each time point is used as the output parameter of the LSTM algorithm model.
  • a lot of training data in the historical time period before the corresponding time point is used as the LSTM algorithm model Input parameters.
  • the historical time period corresponding to the training process and the reference time period corresponding to the prediction process can be the same time period or different time periods. If the historical time period and the reference time period are different time periods, the two The time periods may or may not overlap.
  • the historical time period is 1000 hours before the current time point, and the reference time period is 999 hours before the current time point; or the historical time period is 9 am to 11 am every day from January to March 2018, reference The time period is from 9 am to 11 am every day from January to March 2019.
  • the selection of the historical time period and the reference time period is based on calculation requirements, and is not limited in the embodiment of the present invention.
  • the predicted alarm condition can also be obtained by training using the LSTM algorithm.
  • the expected alarm conditions are determined according to the following methods:
  • the historical fault samples are various hardware index data collected when the business system determines the hardware fault. Input the historical fault samples into the LSTM algorithm model to determine the fault model parameters of the hardware of the business system when it fails. Historical non-fault samples are various hardware index data collected during normal operation of the business system. Inputting historical non-fault samples into the LSTM algorithm model can determine the non-fault model parameters of the hardware of the business system during normal operation. Therefore, specific fault conditions can be determined based on the fault model parameters and the non-fault model parameters.
  • the embodiment of the present invention respectively performs prediction and alarm for two different types of monitoring indexes.
  • the obtaining the monitoring index data of the business system in a reference time period includes:
  • the using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
  • the comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
  • the hardware failure prediction time period and the prediction accuracy rate are determined.
  • each server has its own life cycle. The more temporary the time node when the abnormal fault occurs, the higher the accuracy of the prediction. Therefore, the first reference time period of the hardware index data is as close as possible to the current time point.
  • Table 1 shows the failure prediction results of the hardware index data.
  • the server hardware may be abnormal within 45 days, and the prediction accuracy rate is 78%; the prediction is that the server hardware may be abnormal within 60 days, the prediction accuracy rate is 80%.
  • the obtaining the monitoring index data of the business system in a reference time period includes:
  • the using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
  • the comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
  • the normal fluctuation range is determined by the machine learning algorithm model based on the fluctuation of the business index data in the historical time period.
  • the accuracy of the prediction will also increase with the survival time of the server hardware after production, or the longer the monitoring time of the business indicators.
  • the more monitoring indicator data used for forecasting the larger the sample data and the more accurate the results. Therefore, the second reference time period of the business indicator data should be as long as possible.
  • the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes
  • the method further includes:
  • the weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
  • the monitoring index situation in a certain time period in the future is predicted through the LSTM model. If the monitoring index data is detected to be out of the predicted normal fluctuation range at the next monitoring time point, an alarm is notified to the business operation and maintenance/ Developer.
  • business operation and maintenance personnel can also make corresponding preparations in advance according to the fluctuations of the monitoring indicators predicted by the monitoring platform to avoid business impact. For example, before holidays or new business activities go online, the monitoring platform will predict the daily access traffic that the business may increase in the future, so that business operation and maintenance personnel can make system expansion in advance to avoid business system unavailability due to insufficient business system performance.
  • Step 300 Train the LSTM algorithm model to obtain model parameters.
  • Step 301 Obtain monitoring index data.
  • Step 302 Data preprocessing.
  • Step 303 Obtain estimated alarm conditions. Among them, it is necessary to use historical fault sample data and historical non-fault sample data for model training respectively to obtain fault model parameters and non-fault model parameters, and then determine the expected alarm conditions according to the fault model parameters and non-fault model parameters.
  • Step 304 Input the monitoring index data and weight parameters into the trained LSTM algorithm model, and use the LSTM algorithm model to calculate the prediction result of the prediction time period.
  • the LSTM algorithm model is used to predict a complete sequence, that is, the training window is initialized once with the first part of the training data, and then the sliding window is continuously moved and the next point is predicted like point-by-point prediction.
  • the LSTM algorithm model uses the predicted data to make predictions, that is, in the second prediction, one data point (the last point) in the data used by the model comes from the previous prediction; in the third prediction, there are two points in the data From previous predictions...and so on.
  • the time of the 99th prediction the data in the test set was completely predicted. This means that the predictable time series of the algorithm model is greatly extended.
  • Step 305 Compare the predicted result with the predicted alarm condition, determine whether the business system will be abnormal in the predicted time period, and display the predicted result to the user.
  • the embodiment of the present invention also provides a monitoring device for a business system, as shown in FIG. 3, including:
  • the obtaining unit 31 is configured to obtain monitoring index data of the business system in the reference time period
  • the comparing unit 32 is configured to compare the monitoring index data with the direct alarm condition
  • the prediction unit 33 is configured to input the monitoring index data into a pre-trained machine learning algorithm model if the monitoring index data does not meet the direct alarm condition, and use the machine learning algorithm model to determine the prediction period Forecast results;
  • the alarm unit 34 is configured to compare the prediction result with the predicted alarm condition, and predict whether the business system is abnormal in the predicted time period.
  • the training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
  • a training unit 35 for:
  • the monitoring index data of the business system includes hardware index data of the business system
  • the acquiring unit 31 is configured to acquire hardware index data of the business system in the first reference time period
  • the prediction unit 33 is configured to determine the fluctuation of the hardware indicator data in the prediction time period
  • the alarm unit 34 is configured to compare the fluctuation of the hardware index data with the failure condition, and determine whether the business system has a hardware failure within the predicted time period; if the business system is in the If a hardware failure occurs during the prediction time period, the hardware failure prediction time period and the prediction accuracy rate are determined.
  • the monitoring index data of the business system includes the business index data of the business system; the business index data of the business system,
  • the acquiring unit 31 is configured to acquire business index data of the business system in a second reference time period
  • the prediction unit 33 is configured to determine the fluctuation of the business index data in the prediction time period
  • the alarm unit 34 is configured to compare the fluctuation of the business index data with the normal fluctuation range, and determine whether the business system is abnormal in the predicted time period; if the business is in the predicted time period If an abnormality occurs within the period, the abnormal prediction time period is determined; the normal fluctuation range is determined by the machine learning algorithm model according to the fluctuation situation of the business index data in the historical time period.
  • the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes
  • the prediction unit 33 is further configured to:
  • the weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
  • the computing device includes:
  • the processor 401 is configured to read a program in the memory 402, and execute the foregoing monitoring method of the business system;
  • the processor 401 may be a central processing unit (central processing unit, CPU for short), a network processor (NP for short), or a combination of CPU and NP. It can also be a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (ASIC for short), a programmable logic device (PLD for short), or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (generic array logic, GAL), or any of them combination.
  • the memory 402 is configured to store one or more executable programs, and can store data used by the processor 401 when performing operations.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory 402 may include a volatile memory (volatile memory), such as random-access memory (RAM for short); the memory 402 may also include a non-volatile memory (non-volatile memory), such as flash memory ( flash memory, hard disk drive (HDD for short) or solid-state drive (SSD for short); the memory 402 may also include a combination of the foregoing types of memories.
  • volatile memory volatile memory
  • RAM random-access memory
  • non-volatile memory non-volatile memory
  • flash memory flash memory, hard disk drive (HDD for short) or solid-state drive (SSD for short
  • SSD solid-state drive
  • the memory 402 stores the following elements, executable modules or data structures, or their subsets, or their extended sets:
  • Operating instructions including various operating instructions, used to implement various operations.
  • Operating system including various system programs, used to implement various basic services and process hardware-based tasks.
  • the bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc.
  • the bus interface 404 may be a wired communication access port, a wireless bus interface or a combination thereof, where the wired bus interface may be, for example, an Ethernet interface.
  • the Ethernet interface can be an optical interface, an electrical interface or a combination thereof.
  • the wireless bus interface may be a WLAN interface.
  • the embodiments of the present application also provide a non-transitory computer-readable storage medium, which stores instructions in the computer storage medium, which when run on a computer, causes the computer to execute the foregoing monitoring method of the business system.
  • the embodiments of the present application provide a computer program product containing instructions, which when running on a computer, cause the computer to execute the above-mentioned monitoring method of the business system.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Testing And Monitoring For Control Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiments of the present invention relate to the field of machine learning, and in particular to a method and apparatus for monitoring a service system, which are used for solving the problem of lag and low accuracy in a service system alarm. The embodiment of the present invention comprises: acquiring monitoring index data of a service system within a reference time period; comparing the monitoring index data with a direct alarm condition; if the monitoring index data does not meet the direct alarm condition, inputting the monitoring index data into a machine learning algorithm model trained in advance, and determining, using the machine learning algorithm model, a prediction result within a prediction time period; and comparing the prediction result with a predicted alarm condition to predict whether the service system is abnormal within the prediction time period.

Description

一种业务系统的监控方法及装置Monitoring method and device of business system
相关申请的交叉引用Cross references to related applications
本申请要求在2019年06月28日提交中国专利局、申请号为201910580570.5、申请名称为“一种业务系统的监控方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 28, 2019, the application number is 201910580570.5, and the application name is "A monitoring method and device for a business system", the entire content of which is incorporated herein by reference Applying.
技术领域Technical field
本发明涉及金融科技(Fintech)中的机器学习领域,尤其涉及一种业务系统的监控方法及装置。The present invention relates to the field of machine learning in Fintech, and in particular to a monitoring method and device of a business system.
背景技术Background technique
随着计算机技术的发展,越来越多的技术(大数据、分布式、区块链Blockchain、人工智能等)应用在金融领域,传统金融业正在逐步向金融科技(Fintech)转变,但由于金融行业的安全性、实时性要求,也对技术提出了更高的要求。传统的业务系统监控平台,主要是用户根据需求来配置相关的告警策略。当一个业务系统上线并需要加以日常监控时,先由业务运维/开发人员梳理出业务系统的关键点,并对其制定相关的告警策略条件,并把对应的监控告警策略配置在监控平台中。这样,监控平台会扫描探测这些配置好的业务关键点,得到相对应的探测指标,并与用户配置的监控告警策略(即是否满足告警条件)做匹配,如果满足了用户配置好的告警条件,那么会告警通知给用户。With the development of computer technology, more and more technologies (big data, distributed, blockchain, artificial intelligence, etc.) are applied in the financial field. The traditional financial industry is gradually transforming to Fintech. However, due to financial The industry's security and real-time requirements also place higher requirements on technology. In the traditional business system monitoring platform, users configure related alarm strategies according to their needs. When a business system is online and needs to be monitored daily, the business operation and maintenance personnel first sort out the key points of the business system, and formulate relevant alarm policy conditions for it, and configure the corresponding monitoring and alarm policy in the monitoring platform . In this way, the monitoring platform will scan and detect these configured key business points to obtain the corresponding detection indicators, and match them with the monitoring alarm strategy configured by the user (that is, whether the alarm condition is met). If the alarm condition configured by the user is met, Then the user will be notified.
现有技术中,监控告警策略都是为用户在告警工具中预先配置好的警戒阈值,而通常这类阈值都是运维/开发人员按照历史的经验进行配置,准确性较低。且从探测到异常,到确认异常发生,最后通知相关运维人员,告警过程耗时较长。因此在某些情况下,告警存在滞后性。业务系统出现异常时,其数据的异常性不可控,可能成指数级上升,会出现业务系统刚出现异常时,其指标数据还未满足告警条件,当运维人员接收到告警通知时,业务指标的异常程度已非常严重,异常影响范围已迅速扩散,此时业务已受损,失去了告警的意义。In the prior art, monitoring and alarm policies are all pre-configured warning thresholds for users in the alarm tool, and usually such thresholds are configured by operation and maintenance personnel based on historical experience, and the accuracy is low. And from detecting the abnormality to confirming the occurrence of the abnormality, and finally notifying the relevant operation and maintenance personnel, the alarm process takes a long time. Therefore, in some cases, the alarm has hysteresis. When the business system is abnormal, the abnormality of its data is uncontrollable and may increase exponentially. When the business system just appears to be abnormal, its indicator data does not meet the alarm conditions. When the operation and maintenance personnel receive the alarm notification, the business indicator The degree of abnormality is very serious, and the scope of the abnormality has spread rapidly. At this time, the business has been damaged and the meaning of warning is lost.
发明内容Summary of the invention
本申请提供一种业务系统的监控方法及装置,用以解决业务系统告警存在滞后性且准确性较低的问题。This application provides a monitoring method and device for a business system to solve the problem of hysteresis and low accuracy in business system alarms.
本发明实施例提供的一种业务系统的监控方法,包括:An embodiment of the present invention provides a monitoring method for a business system, including:
获取参考时间段内业务系统的监控指标数据;Obtain the monitoring index data of the business system in the reference time period;
将所述监控指标数据与直接告警条件相对比;Compare the monitoring index data with the direct alarm condition;
若所述监控指标数据不满足所述直接告警条件,则将所述监控指标数据输入预先训练出的机器学习算法模型中,利用所述机器学习算法模型确定预测时间段内的预测结果;If the monitoring index data does not satisfy the direct warning condition, input the monitoring index data into a pre-trained machine learning algorithm model, and use the machine learning algorithm model to determine the prediction result in the prediction time period;
将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常。The prediction result is compared with the predicted alarm condition to predict whether the business system is abnormal in the predicted time period.
一种可选的实施例中,所述将所述监控指标数据输入预先训练出的机器学习算法模型中,利用所述机器学习算法模型确定预测时间段内的预测结果之前,还包括:In an optional embodiment, before inputting the monitoring index data into a pre-trained machine learning algorithm model, before using the machine learning algorithm model to determine the prediction result in the prediction time period, the method further includes:
获取历史时间段内业务系统的训练数据;Obtain the training data of the business system in the historical time period;
将所述历史时间段内业务系统的训练数据作为参数,输入所述机器学习算法模型,确定所述机器学习算法模型的模型参数。The training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
一种可选的实施例中,所述预计告警条件根据以下方式确定:In an optional embodiment, the predicted alarm condition is determined in the following manner:
将所述业务系统的历史故障样本数据输入所述机器学习算法模型中进行训练,确定故障模型参数;Inputting historical fault sample data of the business system into the machine learning algorithm model for training, and determining fault model parameters;
将所述业务系统的历史非故障样本数据输入所述机器学习算法模型中进行训练,确定非故障模型参数;Input historical non-fault sample data of the business system into the machine learning algorithm model for training, and determine non-fault model parameters;
根据所述故障模型参数与所述非故障模型参数,确定故障条件。Determine a fault condition according to the fault model parameters and the non-fault model parameters.
一种可选的实施例中,所述业务系统的监控指标数据包括所述业务系统的硬件指标数据;In an optional embodiment, the monitoring index data of the business system includes hardware index data of the business system;
针对所述业务系统的硬件指标数据,所述获取参考时间段内业务系统的监控指标数据,包括:Regarding the hardware index data of the business system, the obtaining the monitoring index data of the business system in a reference time period includes:
获取第一参考时间段内所述业务系统的硬件指标数据;Acquiring hardware index data of the business system in the first reference time period;
所述利用所述机器学习算法模型确定预测时间段内的预测结果,包括:The using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
确定所述预测时间段内所述硬件指标数据的波动情况;Determine the fluctuation of the hardware indicator data in the prediction time period;
所述将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常,包括:The comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
将所述硬件指标数据的波动情况与所述故障条件相对比,判断所述业务系统在所述预测时间段内是否发生硬件故障;Comparing the fluctuation of the hardware index data with the failure condition, and judging whether the business system has a hardware failure within the predicted time period;
若所述业务系统在所述预测时间段内发生硬件故障,则确定硬件故障预测时间段以及预测准确率。If the business system has a hardware failure within the predicted time period, the hardware failure prediction time period and the prediction accuracy rate are determined.
一种可选的实施例中,所述业务系统的监控指标数据包括所述业务系统的业务指标数据;In an optional embodiment, the monitoring index data of the business system includes the business index data of the business system;
针对所述业务系统的业务指标数据,所述获取参考时间段内业务系统的监控指标数据,包括:Regarding the business index data of the business system, the obtaining the monitoring index data of the business system in a reference time period includes:
获取第二参考时间段内所述业务系统的业务指标数据;Acquiring business index data of the business system in the second reference time period;
所述利用所述机器学习算法模型确定预测时间段内的预测结果,包括:The using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
确定所述预测时间段内所述业务指标数据的波动情况;Determine the fluctuation of the business index data in the forecast time period;
所述将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常,包括:The comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
将所述业务指标数据的波动情况与正常波动范围相对比,判断所述业务系统在所述预测时间段内是否出现异常;Comparing the fluctuation of the business index data with the normal fluctuation range, and judging whether the business system is abnormal in the predicted time period;
若所述业务在所述预测时间段内出现异常,则确定异常预测时间段;所述正常波动范围为所述机器学习算法模型根据历史时间段内的业务指标数据的波动情况确定。If the business is abnormal in the predicted time period, determine the abnormal predicted time period; the normal fluctuation range is determined by the machine learning algorithm model based on the fluctuation of the business index data in the historical time period.
一种可选的实施例中,所述业务系统的监控指标数据包括多个监控指标的监控指标数据;In an optional embodiment, the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes;
所述利用所述机器学习算法模型确定预测时间段内的预测结果之前,还包括:Before determining the prediction result in the prediction time period by using the machine learning algorithm model, the method further includes:
确定每一个监控指标的权重参数;Determine the weight parameter of each monitoring index;
将与监控指标数据对应的权重参数输入所述机器学习算法模型。The weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
一种业务系统的监控装置,包括:A monitoring device for a business system includes:
获取单元,用于获取参考时间段内业务系统的监控指标数据;The obtaining unit is used to obtain the monitoring index data of the business system in the reference time period;
对比单元,用于将所述监控指标数据与直接告警条件相对比;The comparison unit is used to compare the monitoring index data with the direct alarm condition;
预测单元,用于若所述监控指标数据不满足所述直接告警条件,则将所述监控指标数据输入预先训练出的机器学习算法模型中,利用所述机器学习算法模型确定预测时间段内的预测结果;The prediction unit is configured to input the monitoring index data into a pre-trained machine learning algorithm model if the monitoring index data does not meet the direct alarm condition, and use the machine learning algorithm model to determine the prediction time period forecast result;
告警单元,用于将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常。The alarm unit is used to compare the prediction result with the predicted alarm condition and predict whether the business system is abnormal in the predicted time period.
一种可选的实施例中,还包括训练单元,用于:In an optional embodiment, a training unit is further included for:
获取历史时间段内业务系统的训练数据;Obtain the training data of the business system in the historical time period;
将所述历史时间段内业务系统的训练数据作为参数,输入所述机器学习算法模型,确定所述机器学习算法模型的模型参数。The training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
一种可选的实施例中,还包括训练单元,用于:In an optional embodiment, a training unit is further included for:
将历史时间段内所述业务系统的历史故障样本数据输入所述机器学习算法模型中进行训练,确定故障模型参数;Input the historical fault sample data of the business system in the historical time period into the machine learning algorithm model for training, and determine the fault model parameters;
将所述历史时间段内所述业务系统的历史非故障样本数据输入所述机器学习算法模型中进行训练,确定非故障模型参数;Input historical non-fault sample data of the business system in the historical time period into the machine learning algorithm model for training, and determine non-fault model parameters;
根据所述故障模型参数与所述非故障模型参数,确定故障条件。Determine a fault condition according to the fault model parameters and the non-fault model parameters.
一种可选的实施例中,所述业务系统的监控指标数据包括所述业务系统的硬件指标数 据;In an optional embodiment, the monitoring index data of the business system includes hardware index data of the business system;
针对所述业务系统的硬件指标数据,For the hardware index data of the business system,
所述获取单元,用于获取第一参考时间段内所述业务系统的硬件指标数据;The acquiring unit is configured to acquire hardware index data of the business system in a first reference time period;
所述预测单元,用于确定所述预测时间段内所述硬件指标数据的波动情况;The prediction unit is configured to determine the fluctuation of the hardware index data in the prediction time period;
所述告警单元,用于将所述硬件指标数据的波动情况与所述故障条件相对比,判断所述业务系统在所述预测时间段内是否发生硬件故障;若所述业务系统在所述预测时间段内发生硬件故障,则确定硬件故障预测时间段以及预测准确率。The alarm unit is configured to compare the fluctuation situation of the hardware index data with the failure condition, and determine whether the business system has a hardware failure within the forecast time period; if the business system is in the forecast If a hardware failure occurs during the time period, the hardware failure prediction time period and the prediction accuracy rate are determined.
一种可选的实施例中,所述业务系统的监控指标数据包括所述业务系统的业务指标数据;In an optional embodiment, the monitoring index data of the business system includes the business index data of the business system;
针对所述业务系统的业务指标数据,For the business index data of the business system,
所述获取单元,用于获取第二参考时间段内所述业务系统的业务指标数据;The acquiring unit is configured to acquire business index data of the business system in a second reference time period;
所述预测单元,用于确定所述预测时间段内所述业务指标数据的波动情况;The prediction unit is configured to determine the fluctuation of the business index data in the prediction time period;
所述告警单元,用于将所述业务指标数据的波动情况与正常波动范围相对比,判断所述业务系统在所述预测时间段内是否出现异常;若所述业务在所述预测时间段内出现异常,则确定异常预测时间段;所述正常波动范围为所述机器学习算法模型根据历史时间段内的业务指标数据的波动情况确定。The alarm unit is configured to compare the fluctuation of the business index data with the normal fluctuation range, and determine whether the business system is abnormal in the predicted time period; if the business is in the predicted time period If an abnormality occurs, the abnormal prediction time period is determined; the normal fluctuation range is determined by the machine learning algorithm model according to the fluctuation situation of the business index data in the historical time period.
一种可选的实施例中,所述业务系统的监控指标数据包括多个监控指标的监控指标数据;In an optional embodiment, the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes;
所述预测单元,还用于:The prediction unit is also used for:
确定每一个监控指标的权重参数;Determine the weight parameter of each monitoring index;
将与监控指标数据对应的权重参数输入所述机器学习算法模型。The weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
本申请提供一种计算设备,该计算设备包括:This application provides a computing device, which includes:
处理器、存储器、收发器、总线接口;其中,处理器、存储器与收发器之间通过总线连接;Processor, memory, transceiver, and bus interface; among them, the processor, memory and transceiver are connected by a bus;
所述处理器,用于读取所述存储器中的程序,执行上述业务系统的监控方法;The processor is configured to read the program in the memory and execute the monitoring method of the business system described above;
所述存储器,用于存储一个或多个可执行程序,可以存储所述处理器在执行操作时所使用的数据。The memory is used to store one or more executable programs, and can store data used by the processor when performing operations.
本申请提供一种非暂态计算机可读存储介质,计算机存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述业务系统的监控方法。This application provides a non-transitory computer-readable storage medium in which instructions are stored in the computer storage medium, which when run on a computer, cause the computer to execute the above-mentioned monitoring method of the business system.
本申请提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述业务系统的监控方法。This application provides a computer program product containing instructions, which when running on a computer, enables the computer to execute the monitoring method of the above-mentioned business system.
本发明实施例中,获取参考时间段内业务系统的监控指标数据,首先将监控指标数据与直接告警条件相对比,若监控指标数据满足直接告警条件,则直接向用户告警。若监控 指标数据不满足直接告警条件,则将监控指标数据输入机器学习算法模型中,利用机器学习算法模型确定预测时间段内的预测结果。由于预测时间段包括当前时间点之后的时间段,即机器学习算法模型可以预测业务系统未来一段时间的运行状况,并将运行状况与预计告警条件相对比,从而对是否可能出现异常进行预测,若有可能出现异常则向用户进行告警。本发明实施例利用机器学习算法模型对未来即将可能产生的异常进行预测,使得业务运维人员能够为即将发生的异常提前做好业务容灾准备,提高业务系统的可用性,且预测准确率较高。In the embodiment of the present invention, to obtain the monitoring index data of the business system in the reference time period, firstly, the monitoring index data is compared with the direct alarm condition, and if the monitoring index data meets the direct alarm condition, the user is directly alerted. If the monitoring index data does not meet the direct warning conditions, the monitoring index data is input into the machine learning algorithm model, and the machine learning algorithm model is used to determine the prediction result within the prediction time period. Since the predicted time period includes the time period after the current time point, that is, the machine learning algorithm model can predict the operation status of the business system for a period of time in the future, and compare the operation status with the expected alarm conditions, so as to predict whether abnormalities may occur. If an abnormality occurs, the user will be alerted. The embodiment of the present invention uses a machine learning algorithm model to predict future abnormalities that may occur, so that business operation and maintenance personnel can prepare for business disaster recovery in advance for upcoming abnormalities, improve the availability of the business system, and have high prediction accuracy. .
附图说明Description of the drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present invention, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings may be obtained from these drawings without creative labor.
图1为本发明实施例提供的一种可能的系统构架的结构示意图;Figure 1 is a schematic structural diagram of a possible system architecture provided by an embodiment of the present invention;
图2为本发明实施例提供的一种业务系统的监控方法的流程示意图;2 is a schematic flowchart of a monitoring method for a business system provided by an embodiment of the present invention;
图3为本发明实施例提供的一种业务系统的监控装置的结构示意图;3 is a schematic structural diagram of a monitoring device of a business system provided by an embodiment of the present invention;
图4为本发明实施例提供的计算设备的结构示意图。Fig. 4 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明一部份实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. . Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
如图1所示,本发明实施例所适用的一种系统架构,包括业务系统101、监控平台102和监控客户端103。业务系统101和/或监控平台102可以是计算机等网络设备,可以是一个独立的设备,也可以是多个服务器所形成的服务器集群。优选地,业务系统101和/或监控平台102可以采用云计算技术进行信息处理。As shown in FIG. 1, a system architecture to which the embodiment of the present invention is applicable includes a business system 101, a monitoring platform 102, and a monitoring client 103. The business system 101 and/or the monitoring platform 102 may be a network device such as a computer, an independent device, or a server cluster formed by multiple servers. Preferably, the business system 101 and/or the monitoring platform 102 can use cloud computing technology for information processing.
监控客户端103安装于监控平台102上。监控平台102可以是手机、平板电脑或者是专用的手持设备等具有无线通信功能的电子设备,也可以是个人计算机(personal computer,简称PC),笔记本电脑,服务器等有线接入方式连接上网的设备。The monitoring client 103 is installed on the monitoring platform 102. The monitoring platform 102 can be an electronic device with wireless communication functions such as a mobile phone, a tablet computer or a dedicated handheld device, or it can be a personal computer (PC), notebook computer, server and other wired access devices connected to the Internet .
监控平台102可以通过INTERNET网络与业务系统101进行通信,也可以通过全球移动通信系统(Global System for Mobile Communications,简称GSM)、长期演进(long term evolution,简称LTE)系统等移动通信系统与业务系统101进行通信。监控客户端103可 以通过INTERNET网络与监控平台102进行通信,也可以通过全球移动通信系统(Global System for Mobile Communications,简称GSM)、长期演进(long term evolution,简称LTE)系统等移动通信系统与监控平台102进行通信。The monitoring platform 102 can communicate with the business system 101 through the INTERNET network, or through mobile communication systems and business systems such as the Global System for Mobile Communications (GSM), long term evolution (LTE) system, etc. 101 to communicate. The monitoring client 103 can communicate with the monitoring platform 102 through the INTERNET network, or through mobile communication systems such as the Global System for Mobile Communications (GSM), long term evolution (LTE) system, etc. The platform 102 communicates.
对于用户使用来说,本发明实施例中的系统架构与传统监控平台相差无几,用户只需配置自己关心的业务指标监控策略,所以对于用户使用更加友好,用户无需关注监控平台内部是如何实现故障预测,没有使用的门槛。For users, the system architecture in the embodiment of the present invention is almost the same as the traditional monitoring platform. Users only need to configure the monitoring strategy of the business indicators they care about, so it is more user-friendly, and users do not need to pay attention to how faults are realized inside the monitoring platform. Forecast, there is no threshold for use.
为了便于理解,下面对本发明实施例中可能涉及的名词进行定义和解释。For ease of understanding, the following defines and explains the terms that may be involved in the embodiments of the present invention.
用户:本发明实施例中的用户包括业务系统开发人员、业务运维人员及所有使用监控平台进行业务监控的相关人员。User: The user in the embodiment of the present invention includes business system developers, business operation and maintenance personnel, and all relevant personnel who use the monitoring platform for business monitoring.
智能监控平台:用于负责对业务系统进行监控和告警的一种工具。包括监控系统的业务指标及基础服务(如服务器硬件健康度状况,网络连通状况等)指标,通过机器学习算法模型将探测的指标整合起来,预测可能将来可能会产生的故障异常。Intelligent monitoring platform: a tool responsible for monitoring and alerting business systems. Including monitoring system business indicators and basic service indicators (such as server hardware health status, network connection status, etc.), the detected indicators are integrated through the machine learning algorithm model to predict possible failures and abnormalities in the future.
告警探测/预测:又称为业务系统故障探测/预测,为监控平台对业务系统日常运行中可能出现的故障/异常进行探测和预测。Alarm detection/prediction: Also known as business system failure detection/prediction, it detects and predicts the possible failures/abnormalities in the daily operation of the business system for the monitoring platform.
长短期记忆(Long Short-Term Memory,LSTM):一种机器学习中的时间递归神经网络算法。Long Short-Term Memory (LSTM): A time recurrent neural network algorithm in machine learning.
时间序列:是指将同一统计指标的数值按其发生的时间先后顺序排列而成的数列。时间序列分析的主要目的是根据已有的历史数据对未来进行预测。经济数据中大多数以时间序列的形式给出。根据观察时间的不同,时间序列中的时间可以是年份、季度、月份或其他任何时间形式。Time series: refers to the sequence of numbers of the same statistical indicator arranged in the order of the time of occurrence. The main purpose of time series analysis is to predict the future based on existing historical data. Most of the economic data are given in the form of time series. Depending on the observation time, the time in the time series can be year, quarter, month or any other time format.
为了实施预测节点的头寸数据,并提高预测的准确性,本发明实施例提供了一种业务系统的监控方法,如图2所示,本发明实施例提供的业务系统的监控方法包括以下步骤:In order to implement forecasting node position data and improve the accuracy of prediction, an embodiment of the present invention provides a monitoring method of a business system. As shown in FIG. 2, the monitoring method of a business system provided by the embodiment of the present invention includes the following steps:
步骤201、获取参考时间段内业务系统的监控指标数据。Step 201: Obtain monitoring index data of the business system in a reference time period.
由于各服务器厂商不同,采集到的数据格式会有所不同,包括不同硬件,记录的硬件数据格式也不同,以及不同业务接口、不同业务的数据格式也可能有所不同,所以需要对上报的数据进行清洗处理,实现各类数据格式的统一,保证清洗后的数据能够供大数据处理及机器学习算法模块进行机器学习训练,以及告警匹配和预测。Due to different server manufacturers, the collected data format will be different, including different hardware, the recorded hardware data format is also different, and the data format of different service interfaces and different services may also be different, so it is necessary to report the data Perform cleaning processing to achieve the unification of various data formats, and ensure that the cleaned data can be used for big data processing and machine learning algorithm modules for machine learning training, as well as alarm matching and prediction.
同时,由于服务器硬件及业务接口的监控指标有所不同,且各部件、各接口的维度数据可能会很多,需要选取与监控指标正相关的数据源,排除干扰项,如硬盘的SMART值、主板的Health值等。At the same time, because the monitoring indicators of server hardware and business interfaces are different, and the dimensional data of each component and interface may be a lot, it is necessary to select data sources that are positively related to the monitoring indicators to eliminate interference items, such as the SMART value of the hard disk, the motherboard Health value, etc.
步骤202、将所述监控指标数据与直接告警条件相对比。Step 202: Compare the monitoring index data with the direct alarm condition.
将清洗好的监控指标数据进行逻辑处理,判断监控指标数据是否达到直接告警条件,若满足直接告警条件,则表明业务系统当前已经出现异常,则直接向用户告警;若不满足 直接告警条件,则将上报来的监控指标数据通过训练好的机器学习算法模型进行计算,预测未来时间段内是否可能产生异常。The cleaned monitoring index data is logically processed to determine whether the monitoring index data meets the direct warning condition. If the direct warning condition is met, it indicates that the business system is currently abnormal, and the user is directly alerted; if the direct warning condition is not met, then The reported monitoring index data is calculated through the trained machine learning algorithm model to predict whether abnormalities may occur in the future time period.
本发明实施例中的直接告警条件,也可以通过机器学习算法模型训练以及在生产环境中的日常迭代过程中由系统自行判断,从而减少了运维人员在配置直接告警条件的过程中花费的时间,提高管理效率,也避免人为配置带来的误告警。The direct alarm conditions in the embodiment of the present invention can also be trained by the machine learning algorithm model and the system can judge by itself in the daily iterative process in the production environment, thereby reducing the time spent by the operation and maintenance personnel in the process of configuring the direct alarm conditions , Improve management efficiency, and avoid false alarms caused by human configuration.
步骤203、若所述监控指标数据不满足所述直接告警条件,则将所述监控指标数据输入预先训练出的机器学习算法模型中,利用所述机器学习算法模型确定预测时间段内的预测结果。Step 203: If the monitoring index data does not meet the direct warning condition, input the monitoring index data into a pre-trained machine learning algorithm model, and use the machine learning algorithm model to determine the prediction result in the prediction time period .
其中,机器学习算法模型可以包括卷积神经网络(Convolutional Neural Networks,CNN)、支持向量机(Support Vector Machine,SVM)、K-Means聚类、逻辑回归(Logistic Regression)等。考虑到训练成本(运算时间、所需要运算的服务器集群规模)与预测结果间的平衡关系,本发明实施例中优选长短期记忆网络(Long Short-Term Memory,LSTM)神经网络算法进行预测。Among them, the machine learning algorithm model may include Convolutional Neural Networks (CNN), Support Vector Machine (SVM), K-Means clustering, and Logistic Regression (Logistic Regression). Considering the balanced relationship between training cost (operation time, required operation server cluster size) and prediction results, in the embodiment of the present invention, a Long Short-Term Memory (LSTM) neural network algorithm is preferred for prediction.
步骤204、将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常。Step 204: Compare the prediction result with the predicted alarm condition, and predict whether the business system is abnormal in the predicted time period.
具体实施过程中,预计告警条件可以为运维人员根据经验确定,也可以通过机器学习算法模型训练得出,或者在生产环境中的日常迭代过程中由系统自行判断。若出现异常,可以通过邮件和/或短信和/或电话和/或微信等方式通知给用户。In the specific implementation process, the expected alarm conditions can be determined by the operation and maintenance personnel based on experience, can also be obtained through machine learning algorithm model training, or be judged by the system during the daily iteration process in the production environment. If an abnormality occurs, it can be notified to the user via email and/or SMS and/or phone call and/or WeChat.
本发明实施例中,获取参考时间段内业务系统的监控指标数据,首先将监控指标数据与直接告警条件相对比,若监控指标数据满足直接告警条件,则直接向用户告警。若监控指标数据不满足直接告警条件,则将监控指标数据输入机器学习算法模型中,利用机器学习算法模型确定预测时间段内的预测结果。由于预测时间段包括当前时间点之后的时间段,即机器学习算法模型可以预测业务系统未来一段时间的运行状况,并将运行状况与预计告警条件相对比,从而对是否可能出现异常进行预测,若有可能出现异常则向用户进行告警。本发明实施例利用机器学习算法模型对未来即将可能产生的异常进行预测,使得业务运维人员能够为即将发生的异常提前做好业务容灾准备,提高业务系统的可用性,且预测准确率较高。In the embodiment of the present invention, to obtain the monitoring index data of the business system in the reference time period, firstly, the monitoring index data is compared with the direct alarm condition, and if the monitoring index data meets the direct alarm condition, the user is directly alerted. If the monitoring index data does not meet the direct warning conditions, the monitoring index data is input into the machine learning algorithm model, and the machine learning algorithm model is used to determine the prediction result within the prediction time period. Since the predicted time period includes the time period after the current time point, that is, the machine learning algorithm model can predict the operation status of the business system for a period of time in the future, and compare the operation status with the expected alarm conditions, so as to predict whether abnormalities may occur. If an abnormality occurs, the user will be alerted. The embodiment of the present invention uses a machine learning algorithm model to predict future abnormalities that may occur, so that business operation and maintenance personnel can prepare for business disaster recovery in advance for upcoming abnormalities, improve the availability of the business system, and have high prediction accuracy. .
由于业务系统的监控指标数据与时间相关,组成时间序列数据,因此,可以根据监控指标数据预测出未来时间段内的预测结果,从而将业务系统的运行状况指标化。再将预测结果与设置的预计告警条件相对比,从而确定业务系统在未来的时间段内是否有可能出现异常。Since the monitoring index data of the business system is related to time and composes time series data, the forecast results in the future time period can be predicted based on the monitoring index data, thereby indexing the operational status of the business system. Then compare the predicted results with the set predicted alarm conditions to determine whether the business system is likely to be abnormal in the future time period.
进一步地,依据历史时间段内的训练数据对LSTM算法模型进行训练。所述将所述监控指标数据输入预先训练出的机器学习算法模型中,利用所述机器学习算法模型确定预测 时间段内的预测结果之前,还包括:Further, the LSTM algorithm model is trained based on the training data in the historical time period. Said inputting the monitoring index data into a pre-trained machine learning algorithm model, before using the machine learning algorithm model to determine the prediction result in the prediction time period, further includes:
获取历史时间段内业务系统的训练数据;Obtain the training data of the business system in the historical time period;
将所述历史时间段内业务系统的训练数据作为参数,输入所述机器学习算法模型,确定所述机器学习算法模型的模型参数。The training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
具体实施过程中,将各个时间点的业务系统的训练数据作为LSTM算法模型的输出参数,针对每个输出参数,将其对应的时间点之前的历史时间段内的很多训练数据作为LSTM算法模型的输入参数。这样,在获取到大量的上述输入参数和输出参数的对应关系后,基于现有的LSTM算法模型的训练方法可以得到LSTM算法模型的模型参数。In the specific implementation process, the training data of the business system at each time point is used as the output parameter of the LSTM algorithm model. For each output parameter, a lot of training data in the historical time period before the corresponding time point is used as the LSTM algorithm model Input parameters. In this way, after obtaining a large number of corresponding relationships between the aforementioned input parameters and output parameters, the model parameters of the LSTM algorithm model can be obtained based on the existing training method of the LSTM algorithm model.
需要说明的是,训练过程对应的历史时间段与预测过程对应的参考时间段,可以为同一时间段,也可以为不同时间段,若历史时间段与参考时间段为不同时间段时,两个时间段可以有重叠也可以没有重叠。例如,历史时间段为当前时间点之前的1000个小时,参考时间段为当前时间点之前的999个小时;或者历史时间段为2018年1月至3月的每天上午9点至11点,参考时间段为2019年1月至3月的每天上午9点至11点。历史时间段与参考时间段的选取依据计算需要,本发明实施例中不做限制。It should be noted that the historical time period corresponding to the training process and the reference time period corresponding to the prediction process can be the same time period or different time periods. If the historical time period and the reference time period are different time periods, the two The time periods may or may not overlap. For example, the historical time period is 1000 hours before the current time point, and the reference time period is 999 hours before the current time point; or the historical time period is 9 am to 11 am every day from January to March 2018, reference The time period is from 9 am to 11 am every day from January to March 2019. The selection of the historical time period and the reference time period is based on calculation requirements, and is not limited in the embodiment of the present invention.
进一步地,本发明实施例中,预计告警条件也可以利用LSTM算法训练得出。预计告警条件根据以下方式确定:Further, in the embodiment of the present invention, the predicted alarm condition can also be obtained by training using the LSTM algorithm. The expected alarm conditions are determined according to the following methods:
将所述业务系统的历史故障样本数据输入所述机器学习算法模型中进行训练,确定故障模型参数;Inputting historical fault sample data of the business system into the machine learning algorithm model for training, and determining fault model parameters;
将所述业务系统的历史非故障样本数据输入所述机器学习算法模型中进行训练,确定非故障模型参数;Input historical non-fault sample data of the business system into the machine learning algorithm model for training, and determine non-fault model parameters;
根据所述故障模型参数与所述非故障模型参数,确定故障条件。Determine a fault condition according to the fault model parameters and the non-fault model parameters.
具体实施过程中,历史故障样本为业务系统在确定硬件故障时采集到的各种硬件指标数据,将历史故障样本输入LSTM算法模型中,可以确定业务系统的硬件在故障时的故障模型参数。历史非故障样本为业务系统在正常运行时采集到的各种硬件指标数据,将历史非故障样本输入LSTM算法模型中,可以确定业务系统的硬件在正常运行过程中的非故障模型参数。从而,可以根据故障模型参数与非故障模型参数,确定具体的故障条件。In the specific implementation process, the historical fault samples are various hardware index data collected when the business system determines the hardware fault. Input the historical fault samples into the LSTM algorithm model to determine the fault model parameters of the hardware of the business system when it fails. Historical non-fault samples are various hardware index data collected during normal operation of the business system. Inputting historical non-fault samples into the LSTM algorithm model can determine the non-fault model parameters of the hardware of the business system during normal operation. Therefore, specific fault conditions can be determined based on the fault model parameters and the non-fault model parameters.
由于业务系统的监控指标数据包括业务系统的硬件指标数据和业务系统的业务指标数据,因此,本发明实施例针对两类不同的监控指标,分别进行预测和告警。Since the monitoring index data of the business system includes the hardware index data of the business system and the business index data of the business system, the embodiment of the present invention respectively performs prediction and alarm for two different types of monitoring indexes.
进一步地,针对所述业务系统的硬件指标数据,所述获取参考时间段内业务系统的监控指标数据,包括:Further, for the hardware index data of the business system, the obtaining the monitoring index data of the business system in a reference time period includes:
获取第一参考时间段内所述业务系统的硬件指标数据;Acquiring hardware index data of the business system in the first reference time period;
所述利用所述机器学习算法模型确定预测时间段内的预测结果,包括:The using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
确定所述预测时间段内所述硬件指标数据的波动情况;Determine the fluctuation of the hardware indicator data in the prediction time period;
所述将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常,包括:The comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
将所述硬件指标数据的波动情况与所述故障条件相对比,判断所述业务系统在所述预测时间段内是否发生硬件故障;Comparing the fluctuation of the hardware index data with the failure condition, and judging whether the business system has a hardware failure within the predicted time period;
若所述业务系统在所述预测时间段内发生硬件故障,则确定硬件故障预测时间段以及预测准确率。If the business system has a hardware failure within the predicted time period, the hardware failure prediction time period and the prediction accuracy rate are determined.
具体实施过程中,对于硬件指标数据而言,每台服务器都有着自己的生命周期,越临时故障异常发生的时间节点,预测的准确率也就越高。因此,硬件指标数据的第一参考时间段尽量选择与当前时间点接近的时间段。In the specific implementation process, for hardware index data, each server has its own life cycle. The more temporary the time node when the abnormal fault occurs, the higher the accuracy of the prediction. Therefore, the first reference time period of the hardware index data is as close as possible to the current time point.
表1示出了硬件指标数据的故障预测结果。Table 1 shows the failure prediction results of the hardware index data.
表1Table 1
Figure PCTCN2020097249-appb-000001
Figure PCTCN2020097249-appb-000001
举例来说,如表1所示,针对监控指标1,预测为45天内服务器硬件可能会出现异常,且预测正确率为78%;预测为60天内服务器硬件可能会出现异常,则预测正确率为80%。For example, as shown in Table 1, for monitoring indicator 1, it is predicted that the server hardware may be abnormal within 45 days, and the prediction accuracy rate is 78%; the prediction is that the server hardware may be abnormal within 60 days, the prediction accuracy rate is 80%.
针对所述业务系统的业务指标数据,所述获取参考时间段内业务系统的监控指标数据,包括:Regarding the business index data of the business system, the obtaining the monitoring index data of the business system in a reference time period includes:
获取第二参考时间段内所述业务系统的业务指标数据;Acquiring business index data of the business system in the second reference time period;
所述利用所述机器学习算法模型确定预测时间段内的预测结果,包括:The using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
确定所述预测时间段内所述业务指标数据的波动情况;Determine the fluctuation of the business index data in the forecast time period;
所述将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常,包括:The comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
将所述业务指标数据的波动情况与正常波动范围相对比,判断所述业务系统在所述预测时间段内是否出现异常;Comparing the fluctuation of the business index data with the normal fluctuation range, and judging whether the business system is abnormal in the predicted time period;
若所述业务在所述预测时间段内出现异常,则确定异常预测时间段;所述正常波动范围为所述机器学习算法模型根据历史时间段内的业务指标数据的波动情况确定。If the business is abnormal in the predicted time period, determine the abnormal predicted time period; the normal fluctuation range is determined by the machine learning algorithm model based on the fluctuation of the business index data in the historical time period.
具体实施过程中,由于业务指标数据是每天都在变化,所以对于预测模型来说,预测的准确率也会随着服务器硬件出产后存活时间,或者业务指标监控时间变长而提高。对于业务指标而言,用于预测的监控指标数据越多,样本数据越大,结果往往也就更加准确。因此,业务指标数据的第二参考时间段尽量选择长的时间段。In the specific implementation process, because the business index data changes every day, for the prediction model, the accuracy of the prediction will also increase with the survival time of the server hardware after production, or the longer the monitoring time of the business indicators. For business indicators, the more monitoring indicator data used for forecasting, the larger the sample data and the more accurate the results. Therefore, the second reference time period of the business indicator data should be as long as possible.
由于采集的监控指标数据的数据量可能会非常大,且各监控指标的权重对于异常的影响不同,需求计算配置出各监控指标的权重范围。进一步地,所述业务系统的监控指标数据包括多个监控指标的监控指标数据;Since the amount of collected monitoring index data may be very large, and the weight of each monitoring index has different effects on abnormalities, it is necessary to calculate and configure the weight range of each monitoring index. Further, the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes;
所述利用所述机器学习算法模型确定预测时间段内的预测结果之前,还包括:Before determining the prediction result in the prediction time period by using the machine learning algorithm model, the method further includes:
确定每一个监控指标的权重参数;Determine the weight parameter of each monitoring index;
将与监控指标数据对应的权重参数输入所述机器学习算法模型。The weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
本发明实施例中,通过LSTM模型预测出在未来某个时间段内的监控指标情况,如果下一监控时间点探测出监控指标数据不在预测的正常波动范围内,则告警通知给业务运维/开发人员。另外,业务运维/开发人员也可以根据监控平台预测的监控指标波动情况,提前做出相应的准备,避免业务受影响。例如,在节假日或者业务上线新活动之前,监控平台会预测出业务未来可能增长的日常访问流量,从而业务运维人员可以提前做出系统扩容,避免因业务系统性能不足而导致业务系统不可使用。In the embodiment of the present invention, the monitoring index situation in a certain time period in the future is predicted through the LSTM model. If the monitoring index data is detected to be out of the predicted normal fluctuation range at the next monitoring time point, an alarm is notified to the business operation and maintenance/ Developer. In addition, business operation and maintenance personnel can also make corresponding preparations in advance according to the fluctuations of the monitoring indicators predicted by the monitoring platform to avoid business impact. For example, before holidays or new business activities go online, the monitoring platform will predict the daily access traffic that the business may increase in the future, so that business operation and maintenance personnel can make system expansion in advance to avoid business system unavailability due to insufficient business system performance.
为了更清楚地理解本发明,下面基于图1的架构,以具体实施例对上述流程进行详细描述,具体实施例的步骤如下所示,包括:In order to understand the present invention more clearly, the following describes the above process in detail with specific embodiments based on the architecture of FIG. 1. The steps of the specific embodiments are as follows, including:
步骤300:对LSTM算法模型进行训练,得到模型参数。Step 300: Train the LSTM algorithm model to obtain model parameters.
步骤301:获取监控指标数据。Step 301: Obtain monitoring index data.
由于服务器硬件及业务接口的数据指标都有所不同,且各部件、各接口的维度数据可能会很多,需要选取与业务指标正相关的数据源,排除干扰项,如硬盘的SMART值、主板的Health值等。Since the data indicators of server hardware and business interfaces are different, and the dimensional data of each component and interface may be a lot, it is necessary to select data sources that are positively related to business indicators to eliminate interference items, such as the SMART value of the hard disk and the motherboard Health value, etc.
步骤302:数据预处理。Step 302: Data preprocessing.
由于采集的监控指标数据的数据量可能会非常大,且各监控指标的权重对于异常的影响不同,需求获取各监控指标的权重参数。Since the amount of collected monitoring index data may be very large, and the weight of each monitoring index has different effects on abnormalities, it is necessary to obtain the weight parameters of each monitoring index.
步骤303:获取预计告警条件。其中,需要利用历史故障样本数据和历史非故障样本数据分别进行模型训练,得到故障模型参数和非故障模型参数,再根据故障模型参数和非故障模型参数确定预计告警条件。Step 303: Obtain estimated alarm conditions. Among them, it is necessary to use historical fault sample data and historical non-fault sample data for model training respectively to obtain fault model parameters and non-fault model parameters, and then determine the expected alarm conditions according to the fault model parameters and non-fault model parameters.
步骤304:将监控指标数据和权重参数输入训练好的LSTM算法模型中,利用LSTM算法模型计算预测时间段的预测结果。Step 304: Input the monitoring index data and weight parameters into the trained LSTM algorithm model, and use the LSTM algorithm model to calculate the prediction result of the prediction time period.
具体实施过程中,利用LSTM算法模型预测一个完整序列,即只用训练数据的第一部分初始化一次训练窗口,然后像逐点预测一样,不断移动滑动窗口并预测下一个点。LSTM算法模型用预测所得的数据进行预测,即在第二次预测时,模型所用数据中有一个数据点(最后一个点)来自之前的预测;第三次预测时,数据中就有两个点来自之前的预测……以此类推,到第99次预测时,测试集里的数据已经完全是预测的数据。这意味着算法模型可预测的时间序列被大大延长。In the specific implementation process, the LSTM algorithm model is used to predict a complete sequence, that is, the training window is initialized once with the first part of the training data, and then the sliding window is continuously moved and the next point is predicted like point-by-point prediction. The LSTM algorithm model uses the predicted data to make predictions, that is, in the second prediction, one data point (the last point) in the data used by the model comes from the previous prediction; in the third prediction, there are two points in the data From previous predictions...and so on. By the time of the 99th prediction, the data in the test set was completely predicted. This means that the predictable time series of the algorithm model is greatly extended.
步骤305:将预测结果与预计告警条件相对比,确定业务系统在预测时间段内是否会出现异常,并将预测结果向用户显示。Step 305: Compare the predicted result with the predicted alarm condition, determine whether the business system will be abnormal in the predicted time period, and display the predicted result to the user.
对于不同的业务系统/服务器来说,不同业务系统/服务器对于告警的优先级会有所差异。由于算法模型会同时反馈预测正确率,因此可以根据不同的预测正确率,以及用户可以提前定义好的阈值配比策略来决定对应的业务系统是否需要进行故障预测。For different business systems/servers, different business systems/servers have different priorities for alarms. Since the algorithm model will feed back the prediction accuracy rate at the same time, it can determine whether the corresponding business system needs to perform fault prediction according to different prediction accuracy rates and the threshold matching strategy that users can define in advance.
本发明实施例还提供了一种业务系统的监控装置,如图3所示,包括:The embodiment of the present invention also provides a monitoring device for a business system, as shown in FIG. 3, including:
获取单元31,用于获取参考时间段内业务系统的监控指标数据;The obtaining unit 31 is configured to obtain monitoring index data of the business system in the reference time period;
对比单元32,用于将所述监控指标数据与直接告警条件相对比;The comparing unit 32 is configured to compare the monitoring index data with the direct alarm condition;
预测单元33,用于若所述监控指标数据不满足所述直接告警条件,则将所述监控指标数据输入预先训练出的机器学习算法模型中,利用所述机器学习算法模型确定预测时间段内的预测结果;The prediction unit 33 is configured to input the monitoring index data into a pre-trained machine learning algorithm model if the monitoring index data does not meet the direct alarm condition, and use the machine learning algorithm model to determine the prediction period Forecast results;
告警单元34,用于将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常。The alarm unit 34 is configured to compare the prediction result with the predicted alarm condition, and predict whether the business system is abnormal in the predicted time period.
还包括训练单元35,用于:It also includes a training unit 35 for:
获取历史时间段内业务系统的训练数据;Obtain the training data of the business system in the historical time period;
将所述历史时间段内业务系统的训练数据作为参数,输入所述机器学习算法模型,确定所述机器学习算法模型的模型参数。The training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
可选的,还包括训练单元35,用于:Optionally, it also includes a training unit 35 for:
将历史时间段内所述业务系统的历史故障样本数据输入所述机器学习算法模型中进行训练,确定故障模型参数;Input the historical fault sample data of the business system in the historical time period into the machine learning algorithm model for training, and determine the fault model parameters;
将所述历史时间段内所述业务系统的历史非故障样本数据输入所述机器学习算法模型中进行训练,确定非故障模型参数;Input historical non-fault sample data of the business system in the historical time period into the machine learning algorithm model for training, and determine non-fault model parameters;
根据所述故障模型参数与所述非故障模型参数,确定故障条件。Determine a fault condition according to the fault model parameters and the non-fault model parameters.
可选的,所述业务系统的监控指标数据包括所述业务系统的硬件指标数据;Optionally, the monitoring index data of the business system includes hardware index data of the business system;
针对所述业务系统的硬件指标数据,For the hardware index data of the business system,
所述获取单元31,用于获取第一参考时间段内所述业务系统的硬件指标数据;The acquiring unit 31 is configured to acquire hardware index data of the business system in the first reference time period;
所述预测单元33,用于确定所述预测时间段内所述硬件指标数据的波动情况;The prediction unit 33 is configured to determine the fluctuation of the hardware indicator data in the prediction time period;
所述告警单元34,用于将所述硬件指标数据的波动情况与所述故障条件相对比,判断所述业务系统在所述预测时间段内是否发生硬件故障;若所述业务系统在所述预测时间段内发生硬件故障,则确定硬件故障预测时间段以及预测准确率。The alarm unit 34 is configured to compare the fluctuation of the hardware index data with the failure condition, and determine whether the business system has a hardware failure within the predicted time period; if the business system is in the If a hardware failure occurs during the prediction time period, the hardware failure prediction time period and the prediction accuracy rate are determined.
所述业务系统的监控指标数据包括所述业务系统的业务指标数据;针对所述业务系统的业务指标数据,The monitoring index data of the business system includes the business index data of the business system; the business index data of the business system,
所述获取单元31,用于获取第二参考时间段内所述业务系统的业务指标数据;The acquiring unit 31 is configured to acquire business index data of the business system in a second reference time period;
所述预测单元33,用于确定所述预测时间段内所述业务指标数据的波动情况;The prediction unit 33 is configured to determine the fluctuation of the business index data in the prediction time period;
所述告警单元34,用于将所述业务指标数据的波动情况与正常波动范围相对比,判断所述业务系统在所述预测时间段内是否出现异常;若所述业务在所述预测时间段内出现异常,则确定异常预测时间段;所述正常波动范围为所述机器学习算法模型根据历史时间段内的业务指标数据的波动情况确定。The alarm unit 34 is configured to compare the fluctuation of the business index data with the normal fluctuation range, and determine whether the business system is abnormal in the predicted time period; if the business is in the predicted time period If an abnormality occurs within the period, the abnormal prediction time period is determined; the normal fluctuation range is determined by the machine learning algorithm model according to the fluctuation situation of the business index data in the historical time period.
可选的,所述业务系统的监控指标数据包括多个监控指标的监控指标数据;Optionally, the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes;
所述预测单元33,还用于:The prediction unit 33 is further configured to:
确定每一个监控指标的权重参数;Determine the weight parameter of each monitoring index;
将与监控指标数据对应的权重参数输入所述机器学习算法模型。The weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
基于与上述图2所示的方法相同的构思,本申请还提供一种计算设备,如图4所示,该计算设备包括:Based on the same concept as the method shown in FIG. 2, this application also provides a computing device. As shown in FIG. 4, the computing device includes:
处理器401、存储器402、收发器403、总线接口404;其中,处理器401、存储器402与收发器403之间通过总线连接;The processor 401, the memory 402, the transceiver 403, and the bus interface 404; wherein the processor 401, the memory 402 and the transceiver 403 are connected by a bus;
所述处理器401,用于读取所述存储器402中的程序,执行上述业务系统的监控方法;The processor 401 is configured to read a program in the memory 402, and execute the foregoing monitoring method of the business system;
处理器401可以是中央处理器(central processing unit,简称CPU),网络处理器(network processor,简称NP)或者CPU和NP的组合。还可以是硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,简称ASIC),可编程逻辑器件(programmable logic device,简称PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,简称CPLD),现场可编程逻辑门阵列(field-programmable gate array,简称FPGA),通用阵列逻辑(generic array logic,简称GAL)或其任意组合。The processor 401 may be a central processing unit (central processing unit, CPU for short), a network processor (NP for short), or a combination of CPU and NP. It can also be a hardware chip. The aforementioned hardware chip may be an application-specific integrated circuit (ASIC for short), a programmable logic device (PLD for short), or a combination thereof. The above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (generic array logic, GAL), or any of them combination.
所述存储器402,用于存储一个或多个可执行程序,可以存储所述处理器401在执行操作时所使用的数据。The memory 402 is configured to store one or more executable programs, and can store data used by the processor 401 when performing operations.
具体地,程序可以包括程序代码,程序代码包括计算机操作指令。存储器402可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,简称RAM);存储器402也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,简称HDD)或固态硬盘(solid-state drive,简称 SSD);存储器402还可以包括上述种类的存储器的组合。Specifically, the program may include program code, and the program code includes computer operation instructions. The memory 402 may include a volatile memory (volatile memory), such as random-access memory (RAM for short); the memory 402 may also include a non-volatile memory (non-volatile memory), such as flash memory ( flash memory, hard disk drive (HDD for short) or solid-state drive (SSD for short); the memory 402 may also include a combination of the foregoing types of memories.
存储器402存储了如下的元素,可执行模块或者数据结构,或者它们的子集,或者它们的扩展集:The memory 402 stores the following elements, executable modules or data structures, or their subsets, or their extended sets:
操作指令:包括各种操作指令,用于实现各种操作。Operating instructions: including various operating instructions, used to implement various operations.
操作系统:包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。Operating system: including various system programs, used to implement various basic services and process hardware-based tasks.
总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。The bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus, etc.
总线接口404可以为有线通信接入口,无线总线接口或其组合,其中,有线总线接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线总线接口可以为WLAN接口。The bus interface 404 may be a wired communication access port, a wireless bus interface or a combination thereof, where the wired bus interface may be, for example, an Ethernet interface. The Ethernet interface can be an optical interface, an electrical interface or a combination thereof. The wireless bus interface may be a WLAN interface.
基于同一发明构思,本申请实施例还提供了一种非暂态计算机可读存储介质,计算机存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述业务系统的监控方法。Based on the same inventive concept, the embodiments of the present application also provide a non-transitory computer-readable storage medium, which stores instructions in the computer storage medium, which when run on a computer, causes the computer to execute the foregoing monitoring method of the business system.
基于同一发明构思,本申请实施例提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述业务系统的监控方法。Based on the same inventive concept, the embodiments of the present application provide a computer program product containing instructions, which when running on a computer, cause the computer to execute the above-mentioned monitoring method of the business system.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选 实施例以及落入本发明范围的所有变更和修改。Although the preferred embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the present invention.
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包括这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. In this way, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention is also intended to include these modifications and variations.

Claims (15)

  1. 一种业务系统的监控方法,其特征在于,包括:A monitoring method for a business system, characterized in that it comprises:
    获取参考时间段内业务系统的监控指标数据;Obtain the monitoring index data of the business system in the reference time period;
    将所述监控指标数据与直接告警条件相对比;Compare the monitoring index data with the direct alarm condition;
    若所述监控指标数据不满足所述直接告警条件,则将所述监控指标数据输入预先训练出的机器学习算法模型中,利用所述机器学习算法模型确定预测时间段内的预测结果;If the monitoring index data does not satisfy the direct warning condition, input the monitoring index data into a pre-trained machine learning algorithm model, and use the machine learning algorithm model to determine the prediction result in the prediction time period;
    将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常。The prediction result is compared with the predicted alarm condition to predict whether the business system is abnormal in the predicted time period.
  2. 如权利要求1所述的方法,其特征在于,所述将所述监控指标数据输入预先训练出的机器学习算法模型中,利用所述机器学习算法模型确定预测时间段内的预测结果之前,还包括:The method according to claim 1, characterized in that, before inputting the monitoring index data into a pre-trained machine learning algorithm model, and before using the machine learning algorithm model to determine the prediction result in the prediction time period, further include:
    获取历史时间段内业务系统的训练数据;Obtain the training data of the business system in the historical time period;
    将所述历史时间段内业务系统的训练数据作为参数,输入所述机器学习算法模型,确定所述机器学习算法模型的模型参数。The training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
  3. 如权利要求1所述的方法,其特征在于,所述预计告警条件根据以下方式确定:The method according to claim 1, wherein the predicted alarm condition is determined in the following manner:
    将所述业务系统的历史故障样本数据输入所述机器学习算法模型中进行训练,确定故障模型参数;Inputting historical fault sample data of the business system into the machine learning algorithm model for training, and determining fault model parameters;
    将所述业务系统的历史非故障样本数据输入所述机器学习算法模型中进行训练,确定非故障模型参数;Input historical non-fault sample data of the business system into the machine learning algorithm model for training, and determine non-fault model parameters;
    根据所述故障模型参数与所述非故障模型参数,确定故障条件。Determine a fault condition according to the fault model parameters and the non-fault model parameters.
  4. 如权利要求3所述的方法,其特征在于,所述业务系统的监控指标数据包括所述业务系统的硬件指标数据;The method according to claim 3, wherein the monitoring index data of the business system includes hardware index data of the business system;
    针对所述业务系统的硬件指标数据,所述获取参考时间段内业务系统的监控指标数据,包括:Regarding the hardware index data of the business system, the obtaining the monitoring index data of the business system in a reference time period includes:
    获取第一参考时间段内所述业务系统的硬件指标数据;Acquiring hardware index data of the business system in the first reference time period;
    所述利用所述机器学习算法模型确定预测时间段内的预测结果,包括:The using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
    确定所述预测时间段内所述硬件指标数据的波动情况;Determine the fluctuation of the hardware indicator data in the prediction time period;
    所述将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常,包括:The comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
    将所述硬件指标数据的波动情况与所述故障条件相对比,判断所述业务系统在所述预测时间段内是否发生硬件故障;Comparing the fluctuation of the hardware index data with the failure condition, and judging whether the business system has a hardware failure within the predicted time period;
    若所述业务系统在所述预测时间段内发生硬件故障,则确定硬件故障预测时间段以及 预测准确率。If the business system has a hardware failure within the predicted time period, the hardware failure prediction time period and the prediction accuracy rate are determined.
  5. 如权利要求1所述的方法,其特征在于,所述业务系统的监控指标数据包括所述业务系统的业务指标数据;The method according to claim 1, wherein the monitoring index data of the business system includes the business index data of the business system;
    针对所述业务系统的业务指标数据,所述获取参考时间段内业务系统的监控指标数据,包括:Regarding the business index data of the business system, the obtaining the monitoring index data of the business system in a reference time period includes:
    获取第二参考时间段内所述业务系统的业务指标数据;Acquiring business index data of the business system in the second reference time period;
    所述利用所述机器学习算法模型确定预测时间段内的预测结果,包括:The using the machine learning algorithm model to determine the prediction result in the prediction time period includes:
    确定所述预测时间段内所述业务指标数据的波动情况;Determine the fluctuation of the business index data in the forecast time period;
    所述将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常,包括:The comparing the prediction result with the predicted alarm condition and predicting whether the business system is abnormal during the predicted time period includes:
    将所述业务指标数据的波动情况与正常波动范围相对比,判断所述业务系统在所述预测时间段内是否出现异常;Comparing the fluctuation of the business index data with the normal fluctuation range, and judging whether the business system is abnormal in the predicted time period;
    若所述业务在所述预测时间段内出现异常,则确定异常预测时间段;所述正常波动范围为所述机器学习算法模型根据历史时间段内的业务指标数据的波动情况确定。If the business is abnormal in the predicted time period, determine the abnormal predicted time period; the normal fluctuation range is determined by the machine learning algorithm model based on the fluctuation of the business index data in the historical time period.
  6. 如权利要求1所述的方法,其特征在于,所述业务系统的监控指标数据包括多个监控指标的监控指标数据;The method according to claim 1, wherein the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes;
    所述利用所述机器学习算法模型确定预测时间段内的预测结果之前,还包括:Before determining the prediction result in the prediction time period by using the machine learning algorithm model, the method further includes:
    确定每一个监控指标的权重参数;Determine the weight parameter of each monitoring index;
    将与监控指标数据对应的权重参数输入所述机器学习算法模型。The weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
  7. 一种业务系统的监控装置,其特征在于,包括:A monitoring device for a business system, characterized in that it comprises:
    获取单元,用于获取参考时间段内业务系统的监控指标数据;The obtaining unit is used to obtain the monitoring index data of the business system in the reference time period;
    对比单元,用于将所述监控指标数据与直接告警条件相对比;The comparison unit is used to compare the monitoring index data with the direct alarm condition;
    预测单元,用于若所述监控指标数据不满足所述直接告警条件,则将所述监控指标数据输入预先训练出的机器学习算法模型中,利用所述机器学习算法模型确定预测时间段内的预测结果;The prediction unit is configured to input the monitoring index data into a pre-trained machine learning algorithm model if the monitoring index data does not meet the direct alarm condition, and use the machine learning algorithm model to determine the prediction time period forecast result;
    告警单元,用于将所述预测结果与预计告警条件相对比,预测所述业务系统在预测时间段内是否出现异常。The alarm unit is used to compare the prediction result with the predicted alarm condition and predict whether the business system is abnormal in the predicted time period.
  8. 如权利要求7所述的装置,其特征在于,还包括训练单元,用于:8. The device of claim 7, further comprising a training unit for:
    获取历史时间段内业务系统的训练数据;Obtain the training data of the business system in the historical time period;
    将所述历史时间段内业务系统的训练数据作为参数,输入所述机器学习算法模型,确定所述机器学习算法模型的模型参数。The training data of the business system in the historical time period is used as a parameter, and the machine learning algorithm model is input to determine the model parameters of the machine learning algorithm model.
  9. 如权利要求7所述的装置,其特征在于,还包括训练单元,用于:8. The device of claim 7, further comprising a training unit for:
    将历史时间段内所述业务系统的历史故障样本数据输入所述机器学习算法模型中进 行训练,确定故障模型参数;Input the historical fault sample data of the business system in the historical time period into the machine learning algorithm model for training, and determine the fault model parameters;
    将所述历史时间段内所述业务系统的历史非故障样本数据输入所述机器学习算法模型中进行训练,确定非故障模型参数;Input historical non-fault sample data of the business system in the historical time period into the machine learning algorithm model for training, and determine non-fault model parameters;
    根据所述故障模型参数与所述非故障模型参数,确定故障条件。Determine a fault condition according to the fault model parameters and the non-fault model parameters.
  10. 如权利要求9所述的装置,其特征在于,所述业务系统的监控指标数据包括所述业务系统的硬件指标数据;9. The apparatus according to claim 9, wherein the monitoring index data of the business system includes hardware index data of the business system;
    针对所述业务系统的硬件指标数据,For the hardware index data of the business system,
    所述获取单元,用于获取第一参考时间段内所述业务系统的硬件指标数据;The acquiring unit is configured to acquire hardware index data of the business system in a first reference time period;
    所述预测单元,用于确定所述预测时间段内所述硬件指标数据的波动情况;The prediction unit is configured to determine the fluctuation of the hardware index data in the prediction time period;
    所述告警单元,用于将所述硬件指标数据的波动情况与所述故障条件相对比,判断所述业务系统在所述预测时间段内是否发生硬件故障;若所述业务系统在所述预测时间段内发生硬件故障,则确定硬件故障预测时间段以及预测准确率。The alarm unit is configured to compare the fluctuation situation of the hardware index data with the failure condition, and determine whether the business system has a hardware failure within the forecast time period; if the business system is in the forecast If a hardware failure occurs during the time period, the hardware failure prediction time period and the prediction accuracy rate are determined.
  11. 如权利要求7所述的装置,其特征在于,所述业务系统的监控指标数据包括所述业务系统的业务指标数据;8. The device according to claim 7, wherein the monitoring index data of the business system includes the business index data of the business system;
    针对所述业务系统的业务指标数据,For the business index data of the business system,
    所述获取单元,用于获取第二参考时间段内所述业务系统的业务指标数据;The acquiring unit is configured to acquire business index data of the business system in a second reference time period;
    所述预测单元,用于确定所述预测时间段内所述业务指标数据的波动情况;The prediction unit is configured to determine the fluctuation of the business index data in the prediction time period;
    所述告警单元,用于将所述业务指标数据的波动情况与正常波动范围相对比,判断所述业务系统在所述预测时间段内是否出现异常;若所述业务在所述预测时间段内出现异常,则确定异常预测时间段;所述正常波动范围为所述机器学习算法模型根据历史时间段内的业务指标数据的波动情况确定。The alarm unit is configured to compare the fluctuation of the business index data with the normal fluctuation range, and determine whether the business system is abnormal in the predicted time period; if the business is in the predicted time period If an abnormality occurs, the abnormal prediction time period is determined; the normal fluctuation range is determined by the machine learning algorithm model according to the fluctuation situation of the business index data in the historical time period.
  12. 如权利要求7所述的装置,其特征在于,所述业务系统的监控指标数据包括多个监控指标的监控指标数据;The device according to claim 7, wherein the monitoring index data of the business system includes monitoring index data of multiple monitoring indexes;
    所述预测单元,还用于:The prediction unit is also used for:
    确定每一个监控指标的权重参数;Determine the weight parameter of each monitoring index;
    将与监控指标数据对应的权重参数输入所述机器学习算法模型。The weight parameter corresponding to the monitoring index data is input into the machine learning algorithm model.
  13. 一种计算设备,其特征在于,包括处理器、存储器、收发器、总线接口,其中处理器、存储器与收发器之间通过总线连接;A computing device, characterized by comprising a processor, a memory, a transceiver, and a bus interface, wherein the processor, the memory and the transceiver are connected by a bus;
    所述处理器,用于读取所述存储器中的程序,执行权利要求1~6任一所述方法;The processor is configured to read the program in the memory and execute the method according to any one of claims 1 to 6;
    所述存储器,用于存储一个或多个可执行程序,以及存储所述处理器在执行操作时所使用的数据。The memory is used to store one or more executable programs and store data used by the processor when performing operations.
  14. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计算机执行权利要求1~6任一所述方法。A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to make the computer execute the method described in any one of claims 1 to 6 .
  15. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1~6任一所述方法。A computer program product, characterized in that, the computer program product includes a calculation program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer, The computer executes the method described in any one of claims 1 to 6.
PCT/CN2020/097249 2019-06-28 2020-06-19 Method and apparatus for monitoring service system WO2020259421A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910580570.5 2019-06-28
CN201910580570.5A CN110275814A (en) 2019-06-28 2019-06-28 A kind of monitoring method and device of operation system

Publications (1)

Publication Number Publication Date
WO2020259421A1 true WO2020259421A1 (en) 2020-12-30

Family

ID=67963677

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/097249 WO2020259421A1 (en) 2019-06-28 2020-06-19 Method and apparatus for monitoring service system

Country Status (2)

Country Link
CN (1) CN110275814A (en)
WO (1) WO2020259421A1 (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766698A (en) * 2021-01-13 2021-05-07 中国工商银行股份有限公司 Application service pressure determining method and device
CN113111589A (en) * 2021-04-25 2021-07-13 北京百度网讯科技有限公司 Training method of prediction model, method, device and equipment for predicting heat supply temperature
CN113127309A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Program monitoring method and device, electronic equipment and storage medium
CN113391981A (en) * 2021-06-30 2021-09-14 中国民航信息网络股份有限公司 Early warning method for monitoring index and related equipment
CN113468022A (en) * 2021-07-01 2021-10-01 丁鹤 Automatic operation and maintenance method for centralized monitoring of products
CN113590427A (en) * 2021-08-09 2021-11-02 中国建设银行股份有限公司 Alarm method, device, storage medium and equipment for monitoring index abnormity
CN113626285A (en) * 2021-07-30 2021-11-09 平安普惠企业管理有限公司 Model-based job monitoring method and device, computer equipment and storage medium
CN113835961A (en) * 2021-09-23 2021-12-24 中国联合网络通信集团有限公司 Alarm information monitoring method, device, server and storage medium
CN114003461A (en) * 2021-09-26 2022-02-01 苏州浪潮智能科技有限公司 Server failure prediction method, system, terminal and storage medium
CN114157585A (en) * 2021-12-09 2022-03-08 京东科技信息技术有限公司 Method and device for monitoring service resources
CN114971057A (en) * 2022-06-09 2022-08-30 支付宝(杭州)信息技术有限公司 Model selection method and device
CN115103386A (en) * 2021-03-05 2022-09-23 中国电信股份有限公司 Cell 5G wireless network performance early warning device, method and recording medium
CN115119237A (en) * 2021-03-17 2022-09-27 中国移动通信集团福建有限公司 Indoor classification hidden fault identification method and device
CN115314412A (en) * 2022-06-22 2022-11-08 北京邮电大学 Operation and maintenance-oriented type-adaptive index prediction early warning method and device
CN115473784A (en) * 2022-09-06 2022-12-13 中国银联股份有限公司 Method and device for determining invalid alarm
CN115981969A (en) * 2023-03-10 2023-04-18 中国信息通信研究院 Monitoring method and device for block chain data platform, electronic equipment and storage medium
CN116455679A (en) * 2023-06-16 2023-07-18 杭州美创科技股份有限公司 Abnormal database operation and maintenance flow monitoring method and device and computer equipment
CN116664110A (en) * 2023-06-08 2023-08-29 湖北华中电力科技开发有限责任公司 Electric power marketing digitizing method and system based on business center
CN116720048A (en) * 2022-07-18 2023-09-08 华能汕头海门发电有限责任公司 Power station auxiliary machine fault diagnosis method and system based on machine learning model
CN116895046A (en) * 2023-07-21 2023-10-17 北京亿宇嘉隆科技有限公司 Abnormal operation and maintenance data processing method based on virtualization
CN116991108A (en) * 2023-09-25 2023-11-03 四川公路桥梁建设集团有限公司 Intelligent management and control method, system and device for bridge girder erection machine and storage medium
CN117149552A (en) * 2023-10-31 2023-12-01 联通在线信息科技有限公司 Automatic interface detection method and device, electronic equipment and storage medium
WO2024040794A1 (en) * 2022-08-23 2024-02-29 天翼安全科技有限公司 Abnormal traffic detection method and apparatus, electronic device, and storage medium
CN117648383A (en) * 2024-01-30 2024-03-05 中国人民解放军国防科技大学 Heterogeneous database real-time data synchronization method, device, equipment and medium
CN117806900A (en) * 2023-07-28 2024-04-02 苏州浪潮智能科技有限公司 Server management method, device, electronic equipment and storage medium
CN117892249A (en) * 2024-03-15 2024-04-16 宁波析昶环保科技有限公司 Intelligent operation and maintenance platform early warning system
CN117896284A (en) * 2024-01-17 2024-04-16 北京奇虎科技有限公司 Performance fluctuation positioning method, device, equipment and storage medium
CN115473784B (en) * 2022-09-06 2024-07-09 中国银联股份有限公司 Method and device for determining invalid alarm

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275814A (en) * 2019-06-28 2019-09-24 深圳前海微众银行股份有限公司 A kind of monitoring method and device of operation system
CN112702184A (en) * 2019-10-23 2021-04-23 中国电信股份有限公司 Fault early warning method and device and computer-readable storage medium
CN110941797B (en) * 2019-11-07 2023-04-07 中信银行股份有限公司 Operation index monitoring and trend prediction system based on service index
CN112825175A (en) * 2019-11-20 2021-05-21 顺丰科技有限公司 Client abnormity early warning method, device and equipment
CN110865929B (en) * 2019-11-26 2024-01-23 携程旅游信息技术(上海)有限公司 Abnormality detection early warning method and system
CN112948223A (en) * 2019-11-26 2021-06-11 北京沃东天骏信息技术有限公司 Method and device for monitoring operation condition
CN111104299A (en) * 2019-11-29 2020-05-05 山东英信计算机技术有限公司 Server performance prediction method and device, electronic equipment and storage medium
CN112994960B (en) * 2019-12-02 2022-09-16 中国移动通信集团浙江有限公司 Method and device for detecting business data abnormity and computing equipment
CN111078503B (en) * 2019-12-23 2023-08-01 中国建设银行股份有限公司 Abnormality monitoring method and system
CN111241151A (en) * 2019-12-27 2020-06-05 北京健康之家科技有限公司 Service data analysis early warning method, system, storage medium and computing device
CN111339156B (en) * 2020-02-07 2023-09-26 京东城市(北京)数字科技有限公司 Method, apparatus and computer readable storage medium for long-term determination of business data
CN113535444B (en) * 2020-04-14 2023-11-03 中国移动通信集团浙江有限公司 Abnormal motion detection method, device, computing equipment and computer storage medium
CN113572625B (en) * 2020-04-28 2023-04-28 中国移动通信集团浙江有限公司 Fault early warning method, early warning device, equipment and computer medium
CN111563022B (en) * 2020-05-12 2023-09-05 中国民航信息网络股份有限公司 Centralized memory monitoring method and device
EP3913451A1 (en) * 2020-05-21 2021-11-24 Tata Consultancy Services Limited Predicting early warnings of an operating mode of equipment in industry plants
CN111708682B (en) * 2020-06-17 2021-10-26 腾讯科技(深圳)有限公司 Data prediction method, device, equipment and storage medium
CN111796995B (en) * 2020-06-30 2024-02-09 中国工商银行股份有限公司 Integrated learning-based cyclic serial number usage early warning method and system
CN111752816A (en) * 2020-06-30 2020-10-09 深圳前海微众银行股份有限公司 Operating system analysis method and device
CN111833557A (en) * 2020-07-27 2020-10-27 中国工商银行股份有限公司 Fault identification method and device
CN112019390A (en) * 2020-09-09 2020-12-01 腾讯科技(深圳)有限公司 Network fault positioning method and related device
CN112102049A (en) * 2020-09-23 2020-12-18 中国建设银行股份有限公司 Model training method, business processing method, device and equipment
CN112256526B (en) * 2020-10-14 2024-02-23 中国银联股份有限公司 Machine learning-based data real-time monitoring method and device
CN113516270A (en) * 2020-10-30 2021-10-19 腾讯科技(深圳)有限公司 Service data monitoring method and device
CN112486767B (en) * 2020-11-25 2022-10-18 中移(杭州)信息技术有限公司 Intelligent monitoring method, system, server and storage medium for cloud resources
CN113411549B (en) * 2021-06-11 2022-09-06 上海兴容信息技术有限公司 Method for judging whether business of target store is normal or not
CN113411233B (en) * 2021-06-17 2022-12-23 中国建设银行股份有限公司 Method and device for monitoring CPU utilization rate of central processing unit
CN113411217A (en) * 2021-06-21 2021-09-17 广州迷听科技有限公司 Method and device for monitoring and alarming call system
CN113537809A (en) * 2021-07-28 2021-10-22 深圳供电局有限公司 Active decision-making method and system for resource expansion in deep learning
CN113807690A (en) * 2021-09-09 2021-12-17 国网江苏省电力有限公司苏州供电分公司 Online evaluation and early warning method and system for operation state of regional power grid regulation and control system
CN113821416A (en) * 2021-09-18 2021-12-21 中国电信股份有限公司 Monitoring alarm method, device, storage medium and electronic equipment
CN114399321A (en) * 2021-11-15 2022-04-26 湖南快乐阳光互动娱乐传媒有限公司 Business system stability analysis method, device and equipment
CN114415602B (en) * 2021-12-03 2023-09-26 珠海格力电器股份有限公司 Monitoring method, device, system and storage medium for industrial equipment
CN114328118B (en) * 2021-12-30 2023-11-14 苏州浪潮智能科技有限公司 Intelligent alarming method, device, equipment and medium for operation and maintenance monitoring data
CN115439089B (en) * 2022-09-08 2023-09-08 江苏方洋智能科技有限公司 Service management system based on machine learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090299695A1 (en) * 2008-05-29 2009-12-03 General Electric Company System and method for advanced condition monitoring of an asset system
CN108172288A (en) * 2018-01-05 2018-06-15 深圳倍佳医疗科技服务有限公司 Medical Devices intelligent control method, device and computer readable storage medium
CN110275814A (en) * 2019-06-28 2019-09-24 深圳前海微众银行股份有限公司 A kind of monitoring method and device of operation system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090299695A1 (en) * 2008-05-29 2009-12-03 General Electric Company System and method for advanced condition monitoring of an asset system
CN108172288A (en) * 2018-01-05 2018-06-15 深圳倍佳医疗科技服务有限公司 Medical Devices intelligent control method, device and computer readable storage medium
CN110275814A (en) * 2019-06-28 2019-09-24 深圳前海微众银行股份有限公司 A kind of monitoring method and device of operation system

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766698A (en) * 2021-01-13 2021-05-07 中国工商银行股份有限公司 Application service pressure determining method and device
CN112766698B (en) * 2021-01-13 2024-02-09 中国工商银行股份有限公司 Application service pressure determining method and device
CN115103386A (en) * 2021-03-05 2022-09-23 中国电信股份有限公司 Cell 5G wireless network performance early warning device, method and recording medium
CN115119237A (en) * 2021-03-17 2022-09-27 中国移动通信集团福建有限公司 Indoor classification hidden fault identification method and device
CN113111589A (en) * 2021-04-25 2021-07-13 北京百度网讯科技有限公司 Training method of prediction model, method, device and equipment for predicting heat supply temperature
CN113127309A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Program monitoring method and device, electronic equipment and storage medium
CN113127309B (en) * 2021-04-30 2023-10-10 北京奇艺世纪科技有限公司 Program monitoring method and device, electronic equipment and storage medium
CN113391981A (en) * 2021-06-30 2021-09-14 中国民航信息网络股份有限公司 Early warning method for monitoring index and related equipment
CN113468022A (en) * 2021-07-01 2021-10-01 丁鹤 Automatic operation and maintenance method for centralized monitoring of products
CN113468022B (en) * 2021-07-01 2024-02-09 丁鹤 Automatic operation and maintenance method for centralized monitoring of products
CN113626285A (en) * 2021-07-30 2021-11-09 平安普惠企业管理有限公司 Model-based job monitoring method and device, computer equipment and storage medium
CN113590427A (en) * 2021-08-09 2021-11-02 中国建设银行股份有限公司 Alarm method, device, storage medium and equipment for monitoring index abnormity
CN113590427B (en) * 2021-08-09 2024-05-03 中国建设银行股份有限公司 Alarm method, device, storage medium and equipment for monitoring index abnormality
CN113835961B (en) * 2021-09-23 2023-05-16 中国联合网络通信集团有限公司 Alarm information monitoring method, device, server and storage medium
CN113835961A (en) * 2021-09-23 2021-12-24 中国联合网络通信集团有限公司 Alarm information monitoring method, device, server and storage medium
CN114003461A (en) * 2021-09-26 2022-02-01 苏州浪潮智能科技有限公司 Server failure prediction method, system, terminal and storage medium
CN114157585A (en) * 2021-12-09 2022-03-08 京东科技信息技术有限公司 Method and device for monitoring service resources
CN114971057A (en) * 2022-06-09 2022-08-30 支付宝(杭州)信息技术有限公司 Model selection method and device
CN115314412A (en) * 2022-06-22 2022-11-08 北京邮电大学 Operation and maintenance-oriented type-adaptive index prediction early warning method and device
CN115314412B (en) * 2022-06-22 2023-09-05 北京邮电大学 Operation-and-maintenance-oriented type self-adaptive index prediction and early warning method and device
CN116720048A (en) * 2022-07-18 2023-09-08 华能汕头海门发电有限责任公司 Power station auxiliary machine fault diagnosis method and system based on machine learning model
WO2024040794A1 (en) * 2022-08-23 2024-02-29 天翼安全科技有限公司 Abnormal traffic detection method and apparatus, electronic device, and storage medium
CN115473784B (en) * 2022-09-06 2024-07-09 中国银联股份有限公司 Method and device for determining invalid alarm
CN115473784A (en) * 2022-09-06 2022-12-13 中国银联股份有限公司 Method and device for determining invalid alarm
CN115981969A (en) * 2023-03-10 2023-04-18 中国信息通信研究院 Monitoring method and device for block chain data platform, electronic equipment and storage medium
CN116664110A (en) * 2023-06-08 2023-08-29 湖北华中电力科技开发有限责任公司 Electric power marketing digitizing method and system based on business center
CN116664110B (en) * 2023-06-08 2024-03-29 湖北华中电力科技开发有限责任公司 Electric power marketing digitizing method and system based on business center
CN116455679B (en) * 2023-06-16 2023-09-08 杭州美创科技股份有限公司 Abnormal database operation and maintenance flow monitoring method and device and computer equipment
CN116455679A (en) * 2023-06-16 2023-07-18 杭州美创科技股份有限公司 Abnormal database operation and maintenance flow monitoring method and device and computer equipment
CN116895046B (en) * 2023-07-21 2024-05-07 北京亿宇嘉隆科技有限公司 Abnormal operation and maintenance data processing method based on virtualization
CN116895046A (en) * 2023-07-21 2023-10-17 北京亿宇嘉隆科技有限公司 Abnormal operation and maintenance data processing method based on virtualization
CN117806900A (en) * 2023-07-28 2024-04-02 苏州浪潮智能科技有限公司 Server management method, device, electronic equipment and storage medium
CN117806900B (en) * 2023-07-28 2024-05-07 苏州浪潮智能科技有限公司 Server management method, device, electronic equipment and storage medium
CN116991108B (en) * 2023-09-25 2023-12-12 四川公路桥梁建设集团有限公司 Intelligent management and control method, system and device for bridge girder erection machine and storage medium
CN116991108A (en) * 2023-09-25 2023-11-03 四川公路桥梁建设集团有限公司 Intelligent management and control method, system and device for bridge girder erection machine and storage medium
CN117149552A (en) * 2023-10-31 2023-12-01 联通在线信息科技有限公司 Automatic interface detection method and device, electronic equipment and storage medium
CN117896284A (en) * 2024-01-17 2024-04-16 北京奇虎科技有限公司 Performance fluctuation positioning method, device, equipment and storage medium
CN117648383A (en) * 2024-01-30 2024-03-05 中国人民解放军国防科技大学 Heterogeneous database real-time data synchronization method, device, equipment and medium
CN117648383B (en) * 2024-01-30 2024-06-11 中国人民解放军国防科技大学 Heterogeneous database real-time data synchronization method, device, equipment and medium
CN117892249A (en) * 2024-03-15 2024-04-16 宁波析昶环保科技有限公司 Intelligent operation and maintenance platform early warning system
CN117892249B (en) * 2024-03-15 2024-05-31 宁波析昶环保科技有限公司 Intelligent operation and maintenance platform early warning system

Also Published As

Publication number Publication date
CN110275814A (en) 2019-09-24

Similar Documents

Publication Publication Date Title
WO2020259421A1 (en) Method and apparatus for monitoring service system
US20220036264A1 (en) Real-time adaptive operations performance management system
US10585774B2 (en) Detection of misbehaving components for large scale distributed systems
US10467533B2 (en) System and method for predicting response time of an enterprise system
JP7237110B2 (en) FAILURE PREDICTION METHOD, DEVICE, ELECTRONIC EQUIPMENT, STORAGE MEDIUM, AND PROGRAM
CN112712113B (en) Alarm method, device and computer system based on index
US20160042289A1 (en) Systems and methods for adaptive thresholding using maximum concentration intervals
WO2021213247A1 (en) Anomaly detection method and device
CN107766533B (en) Automatic detection method and system for telephone traffic abnormality, storage medium and electronic equipment
US11012289B2 (en) Reinforced machine learning tool for anomaly detection
CN104063747A (en) Performance abnormality prediction method in distributed system and system
Rajagopal et al. FedSDM: Federated learning based smart decision making module for ECG data in IoT integrated Edge–Fog–Cloud computing environments
CN109471783B (en) Method and device for predicting task operation parameters
CN112188531A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and computer storage medium
US11449798B2 (en) Automated problem detection for machine learning models
US20210232104A1 (en) Method and system for identifying and forecasting the development of faults in equipment
US20160364467A1 (en) Event notification system with cluster classification
US11392821B2 (en) Detecting behavior patterns utilizing machine learning model trained with multi-modal time series analysis of diagnostic data
Gupta et al. A supervised deep learning framework for proactive anomaly detection in cloud workloads
US20150120912A1 (en) Automated generation and dynamic update of rules
JP2023547849A (en) Method or non-transitory computer-readable medium for automated real-time detection, prediction, and prevention of rare failures in industrial systems using unlabeled sensor data
US20210279633A1 (en) Algorithmic learning engine for dynamically generating predictive analytics from high volume, high velocity streaming data
CN115686756A (en) Virtual machine migration method and device, storage medium and electronic equipment
CN114861909A (en) Model quality monitoring method and device, electronic equipment and storage medium
CN110413482B (en) Detection method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20832609

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.04.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20832609

Country of ref document: EP

Kind code of ref document: A1