CN106844138A - O&M warning system and method - Google Patents

O&M warning system and method Download PDF

Info

Publication number
CN106844138A
CN106844138A CN201611155555.9A CN201611155555A CN106844138A CN 106844138 A CN106844138 A CN 106844138A CN 201611155555 A CN201611155555 A CN 201611155555A CN 106844138 A CN106844138 A CN 106844138A
Authority
CN
China
Prior art keywords
alarm
index data
data
module
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611155555.9A
Other languages
Chinese (zh)
Inventor
胡嘉伟
许晓炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201611155555.9A priority Critical patent/CN106844138A/en
Publication of CN106844138A publication Critical patent/CN106844138A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a kind of O&M warning system and method, the system includes:Off-line model training module, is updated for the parameter needed for the detection algorithm that is used on-line checking module based on machine learning, and, distribution of the analysis result based on historical data to achievement data is predicted;On-line checking module, for receiving achievement data to be detected, and the distribution based on achievement data predicts the outcome, and achievement data is detected using preset detection algorithm;Alarm module, for the testing result of on-line checking module, alarm is determined whether based on preset alarm rule.The present invention is capable of achieving automatic O&M alarm.

Description

Operation and maintenance alarm system and method
Technical Field
The invention relates to the technical field of computers, in particular to an operation and maintenance alarm system and method.
Background
With the rapid development of networks, network systems that need to serve a large number of users have emerged. These network systems have a large number of computers (servers) or computing resources distributed at various locations, and these computers or computing resources are often structured in clusters to serve users. As more and more computers or computing resources are provided for service, monitoring various indexes of the computers or computing resources and timely and accurately alarming when a fault occurs are very important problems.
Taking a data center system as an example, various indexes of computers and computing resources of the data center need to be monitored so as to find abnormal conditions of the data center system, so that operation and maintenance personnel can remove faults as soon as possible and the stable operation of the system is ensured. The current monitoring method mainly comprises the steps that operation and maintenance personnel manually check operation and maintenance indexes or monitor the indexes in a mode of setting a fixed threshold value. Besides a large amount of manpower, the manual checking of the operation and maintenance indexes is very easy to miss abnormal conditions in a large amount of data, and when the data volume increases to a certain degree, the manual checking mode is not feasible. The method for monitoring by setting a fixed threshold requires setting a reasonable threshold for each index, and is not feasible when the number of the indexes is large. In addition, the mode of setting the fixed threshold value can only alarm the abnormal condition which accords with the simple rule, and a large amount of false alarms are easily generated in the complex actual production environment.
Disclosure of Invention
In order to avoid the disadvantage of manual operation and maintenance alarm, the embodiment of the invention provides a system and a method capable of realizing automatic operation and maintenance alarm.
According to an aspect of an embodiment of the present invention, an operation and maintenance alarm system is provided, configured to detect and alarm for an abnormality of index data of a device or a computing resource in a network system, where the system includes: off-line model training module, on-line measuring module and alarm module, wherein: the off-line model training module is used for updating parameters required by a detection algorithm used by the on-line detection module based on machine learning and predicting the distribution of index data based on the analysis result of historical data; the online detection module is used for receiving index data to be detected and detecting the index data by using a preset detection algorithm based on a prediction result of distribution of the index data; the alarm module determines whether to alarm or not based on a preset alarm rule aiming at the detection result of the online detection module.
Preferably, the offline model training module predicts value distribution conditions of each index data at different moments based on analysis results of historical data, and predicts expected values of the periodic index data in a plurality of periods in the future.
Preferably, the offline model training module acquires index data required for machine learning from an offline database, determines parameters required for a detection algorithm used by the online detection module according to the acquired index data, stores the determined parameters required for the detection algorithm used by the online detection module in a model parameter database, and stores a distribution prediction result for the index data in a long-term prediction database.
Preferably, the online detection module is further configured to change a detection algorithm for the index data according to feedback information that is returned by the alarm module and is ignored when the user is feeding back the abnormality for the index data, and adopt a new detection algorithm to detect the index data again.
Preferably, the online detection module acquires index data required by a detection algorithm used from an online cache database, and stores an abnormal record in a detection result and feature description information of the abnormal record in an abnormal database.
Preferably, the alarm module judges whether to alarm according to a preset alarm rule for the abnormality detected by the online detection module, and updates the alarm rule according to the user feedback and in combination with the abnormality record and the feature description information of the abnormality record.
Preferably, the alarm module acquires parameters of a semi-supervised learning algorithm from an alarm model database, stores relevant information of alarm index data in the alarm database, and stores user feedback in a user feedback database.
According to another aspect of the embodiments of the present invention, an operation and maintenance alarm method is provided for detecting and alarming abnormality of index data of a device or a computing resource in a network system, the method including: receiving index data to be detected, and detecting the index data by using a preset detection algorithm based on a prediction result of distribution of the index data according to an analysis result of historical data in advance; and determining whether to alarm or not based on a preset alarm rule aiming at the detection result.
Preferably, the method further comprises: and receiving user feedback after alarming, and updating the alarming rule according to the user feedback.
Preferably, the method further comprises: parameters required by a detection algorithm used by the line detection module are updated based on machine learning.
Preferably, the method further comprises: and predicting the value distribution condition of each index data at different moments based on the analysis result of the historical data, and predicting expected values of the periodic index data in a plurality of periods in the future to obtain the prediction result.
Preferably, the method further comprises: and according to the returned feedback information that the abnormality fed back by the user aiming at the index data is ignored, changing a detection algorithm aiming at the index data, and adopting a new detection algorithm to detect the index data again.
The operation and maintenance alarm system provided by the embodiment of the invention learns the characteristics and rules of each index from the acquired index data in a machine learning mode, and detects abnormal data in the newly acquired index data by analyzing the newly acquired index data. The process is basically an automatic process and basically does not need manual configuration. After the abnormal data is detected, the alarm is not simply sent out directly, but the individualized alarm rule is learned according to the past feedback of the operation and maintenance personnel with different indexes on different alarms, so that the abnormality which is not interested by the operation and maintenance personnel can be filtered, the burden of the operation and maintenance personnel is reduced, and the false alarm rate is reduced.
Drawings
Fig. 1 is a schematic structural diagram of an operation and maintenance alarm system in a data center application scenario according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an operation and maintenance alarm system according to an embodiment of the present invention;
fig. 3a and fig. 3b are schematic diagrams of the operation and maintenance alarm system according to an embodiment of the present invention;
fig. 4 is a flowchart of an operation and maintenance alarm method according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Operation and maintenance, namely operation and maintenance, and research, development, testing and system management are all four major supports of internet product technology, the core objective of the operation and maintenance is to efficiently and reasonably integrate delivered business software and hardware infrastructure into products capable of continuously providing high-quality services, and meanwhile, the service operation cost is reduced to the maximum extent, and the service operation safety is guaranteed. One technical task of operation and maintenance is to provide service fault management. The operation and maintenance alarm system provided by the invention aims at equipment or service to provide index detection and alarm, and is part of operation and maintenance service fault management.
In the embodiment of the invention, the operation and maintenance alarm system is introduced by taking a data center as an application scene as an example. The data center is a specific equipment network with cooperation of multiple parties and is used for transmitting, accelerating, showing, calculating and storing data information on the internet network infrastructure. For a data center, various indexes need to be monitored, such as index data of network traffic, storage space, cpu, memory, network card, and the like of each machine.
It should be understood that the operation and maintenance alarm system provided by the embodiment of the present invention may not only be applied to a data center, but also be used for operation and maintenance alarm in other types of monitoring data or application scenarios. The operation and maintenance indexes of the data center in the above example only generally relate to comparing the indexes of the bottom layer related to the machine, however, many business related indexes need to be monitored for different businesses. Such as delay in the search return results, delay in the remote invocation interface, monitoring of throughput of the message queue, etc. Such as the access traffic of web sites, the traffic of login requests, etc., which require monitoring alarms.
In summary, the operation and maintenance alarm system provided in the embodiment of the present invention can be understood as an operation and maintenance alarm system for time-series-oriented indexes, and application scenarios thereof include the data center, the network service system, the website system, and the like (collectively expressed by "network system"), and monitored index data is different according to different scenarios.
Referring to fig. 1, an architecture diagram of an operation and maintenance alarm system provided in an application scenario of a data center is shown in an embodiment of the present invention.
The data center is a system composed of a plurality of computing resources, network resources and storage resources, the structure of the data center is complex, and the embodiment of the invention is not described too much. In cloud computing, a computing resource (computing resource) mainly refers to a resource provided with computing power by a device or a virtual machine. In a data center, an operation and maintenance alarm system is deployed at the downstream of data monitoring equipment, and as shown in the example of fig. 1, the operation and maintenance alarm system receives monitoring data from message queue monitoring, traffic monitoring, and service monitoring, and acquires an index data stream to be detected and alarmed.
Fig. 2 is a schematic structural diagram of an operation and maintenance alarm system according to an embodiment of the present invention. The operation and maintenance alarm system is used for detecting and alarming abnormity of index data of equipment or computing resources in a network system. As shown in fig. 2, the system includes: the device comprises an offline model training module 1, an online detection module 1 and an alarm module 3.
The monitoring data in fig. 1 is input into the operation and maintenance alarm system in the form of data stream. For newly input data, the online detection module 2 uses various detection algorithms to detect abnormal data therein. The data such as parameters required by each detection algorithm in the online detection module 2 are obtained by calculation through the offline model training module 1. And for the detected abnormal data, the alarm module 3 is used for alarming the user by combining the feedback information of the user.
Each module is described in detail below.
1. Offline model training module 1
The method is mainly used for updating parameters required by a detection algorithm used by the line detection module 2 based on machine learning and predicting the distribution (trend) of index data based on the analysis result of historical data.
It can be seen that the offline model training module 1 mainly has the following two functions:
a. parameters required by each detection algorithm of the online detection module 2 are updated in a machine learning mode;
the specific parameters vary depending on the algorithm used. For example, the GEN-ESD algorithm, the quantile statistical method based on LOWESS smoothing, the detection algorithm based on RPCA (data dimension reduction), the threshold judgment method, and the like. For each type of algorithm, the required parameters are determined and are known in the art, and this is not discussed much in the embodiments of the present invention.
b. And predicting the future long-term distribution of the index data by analyzing the historical data.
Specific indexes of the application scenario of the data center mainly include cpu of a machine, a memory, network card monitoring data, QPS (query rate per second) of service, request delay, website login information and the like.
The predicted content may include two parts: the distribution of the values of the indexes at different moments is the first, and the expected values of the periodic indexes in a plurality of future periods are the second.
In a preferred mode of the embodiment of the present invention, three databases may be provided to cooperate with the offline model training module 1 to store data. Referring to fig. 2, these are an offline database, a model parameter database, a long-term prediction database. The off-line model training module stores the relevant data in three databases, namely an off-line database, a model parameter database and a long-term prediction database.
a) The off-line database is used for storing operation and maintenance index data in a certain time window required by off-line model training;
b) the model parameter database is used for storing parameters required by each detection algorithm of the online detection module 2;
c) the long-term prediction database is used for storing the result of long-term distribution prediction of the operation and maintenance indexes in the offline model training module 1.
The above-mentioned manner of setting the three databases, i.e. the offline database, the model parameter database, and the long-term prediction database, is a preferable manner, and actually, other manners may be adopted, for example, only one database is set to store different data in blocks, or data of the required or processing result is stored in a classified manner on the device where the offline model training module 1 is located.
2. On-line detection module 2
The method is mainly used for receiving index data to be detected, and detecting the index data by using a preset detection algorithm based on the prediction result of the off-line model training module 1 on the distribution of the index data.
The online detection module receives index data streams collected by the data center and detects different abnormal conditions by using various detection algorithms. For example, data can be collected by kafka (a high-throughput distributed publish-subscribe messaging system), and then an index requiring abnormality detection is accessed to the operation and maintenance alarm system through kafka, and data registration can be set in the system, and only registered data can be detected.
For different indexes, various detection algorithms can be combined to detect the indexes according to the respective characteristics of the indexes. For example, assuming that an anomaly detected by a certain algorithm is always fed back by a user as uninteresting for a certain index, the detection algorithm will not be used for detection in subsequent anomaly detection. For another example, if an algorithm detects a certain type of indicator very well (learning the effect through user feedback), the algorithm will be added to detect the type of indicator that has not been used before.
The detection algorithm used in the embodiment of the present invention includes, but is not limited to, the following types:
(1) performing STL decomposition on the data with periodicity, and performing anomaly detection on the residual items through a GEN-ESD algorithm and quantile statistic;
(2) performing LOWESS smoothing on short-term data, and performing anomaly detection on residual items through quantile statistics;
(3) carrying out RPCA decomposition on data with periodicity, and judging the data with a sparse item exceeding a certain significance threshold value as abnormal data;
(4) comparing the actual data with the data predicted by the offline model training module, and judging the data with the deviation degree exceeding a certain significance threshold value as abnormal data;
(5) and (4) performing derivation on the index sequence, and judging the data with the change rate exceeding a certain significance threshold value as abnormal data.
The significance threshold values in the detection algorithm are all threshold values which are obtained by the offline model training module 1 according to historical data training and are stored in the model parameter database.
In a preferred mode of the embodiment of the present invention, two databases may be provided to cooperate with the online detection module 2 to store data. Referring to FIG. 2, these are the online data cache database and the exception database. The online detection module 2 stores the relevant data in the online data cache database and the abnormal database. Wherein,
a) the line data cache database is used for storing operation and maintenance index data in a certain time window required by each detection algorithm of the online detection module 2;
b) the abnormal database is used for storing abnormal data records detected by the online detection module 2 and instantaneous feature description information of an index sequence corresponding to the abnormal records; the feature description information includes, but is not limited to: periodicity information, distribution information, autocorrelation information, skewness information, peak information.
The above-mentioned manner of setting the line data cache database and the exception database is a preferred manner, and actually, other manners may be adopted to implement, for example, only one database is set to store different data in blocks, or data required by or processing results of the online detection module 2 are stored in a classified manner on the device where the online detection module is located.
3. Alarm module 3
The alarm module 3 is mainly used for determining whether to alarm or not based on a preset alarm rule according to the detection result of the online detection module 2, receiving user feedback after alarming, and updating the alarm rule according to the user feedback.
It can be seen that the alarm module 3 mainly comprises the following two functions:
a) and performing real-time alarm filtering on the abnormity detected by the online detection module 2, and judging whether to alarm according to a preset alarm rule. For example, the information can be sent to the user in the form of short messages, mails and internal apps; the user is generally an operation and maintenance person or a related service person.
b) And learning the interest of the user according to the user feedback and the feature description of the sequence, and updating the alarm rule. The user feedback is feedback made by the user to the alarm, the user is generally an operation and maintenance person or a related business person, and the user can make feedback to the alarm by accessing an API (application programming interface) or in a form of a webpage. The alarm module 3 may learn whether the user is interested in a certain type of alarm information based on the user feedback using a semi-supervised learning algorithm. Among them, Semi-Supervised Learning (SSL) is a key problem in the field of pattern recognition and machine Learning, and is a Learning method combining Supervised Learning and unsupervised Learning. The method mainly considers the problem of how to train and classify by using a small amount of labeled samples and a large amount of unlabeled samples. The method mainly comprises semi-supervised classification, semi-supervised regression, semi-supervised clustering and semi-supervised dimension reduction algorithm.
In a preferred mode of the embodiment of the present invention, three databases may be provided to cooperate with the alarm module 3 to store data. Referring to fig. 2, these are an alarm database, a user feedback database, and an alarm model database. The alarm module 3 will store the relevant data in the alarm database, the user feedback database and the alarm model database. Wherein,
a) the alarm database is used for storing the related information of the data which finally gives an alarm;
b) the user feedback database is used for storing feedback information of the user on the alarm, such as whether the alarm is false alarm, whether the abnormality is interested, and the like;
c) the alarm model database is used for storing model parameters of a machine learning algorithm used by the alarm module 3.
The above-mentioned manner of setting the alarm database, the user feedback database, and the alarm model database is a preferred manner, and actually, other manners may be adopted, for example, only one database is set to store different data in blocks, or data required by or processing results of the alarm module 3 are stored in a classified manner on the device.
Referring to fig. 3a and 3b, schematic diagrams of a workflow of an operation and maintenance alarm system according to an embodiment of the present invention are shown.
The whole work flow comprises two parts of online detection alarm (figure 3a) and offline modeling (figure 3 b).
The on-line detection alarm part is started to work in real time after receiving the index data stream to be detected:
301a, the service or operation and maintenance personnel type data to be detected into kafka and register the data;
302a, the online detection module receives the data stream from kafka;
303a, splicing the data stream fragments received in real time with historical data in a cache database by an online detection module;
304a, adding the data stream segment received in real time to an offline database by the online detection module (or setting a service special for storing the real-time data in the offline database);
305a, the online detection module acquires parameters of a relevant model from a model parameter database and acquires a predicted value corresponding to offline prediction from an offline prediction database;
306a, performing anomaly detection on the sequence by using each detection algorithm by an online detection module according to the spliced data sequence, the relevant model parameters and the offline predicted value;
307a, the online detection module stores the abnormal detection result to an abnormal database and informs the alarm module to filter the abnormality;
308a, the alarm module filters the abnormity according to the alarm model database;
309a, the alarm module stores the filtered abnormity into an alarm database and sends alarm information to a user;
310a, when the user feeds back the alarm information, the alarm module stores the feedback information into a user feedback database;
311a, the alarm module updates the alarm model and strategy on line according to the alarm feedback.
The offline modeling part is run periodically (e.g., one day):
301b, reading historical data from an offline database by an offline model training module;
302b, the offline model training module updates model parameters according to historical data and predicts the data situation of the future day;
303b, the offline model training module stores the updated model parameters into a model parameter database, and stores the prediction result into a long-term prediction database.
The operation and maintenance alarm system provided by the embodiment of the invention learns the characteristics and rules of each index from the acquired index data in a machine learning mode, and detects abnormal data in the newly acquired index data by analyzing the newly acquired index data. The process is basically an automatic process and basically does not need manual configuration. After the abnormal data is detected, the alarm is not simply sent out directly, but the individualized alarm rule is learned according to the past feedback of the operation and maintenance personnel with different indexes on different alarms, so that the abnormality which is not interested by the operation and maintenance personnel can be filtered, the burden of the operation and maintenance personnel is reduced, and the false alarm rate is reduced.
Referring to fig. 4, a flowchart of an operation and maintenance alarm method provided in an embodiment of the present invention is shown, where the method includes:
s401: and receiving index data to be detected, and detecting the index data by using a preset detection algorithm based on a prediction result of distribution of the index data according to an analysis result of historical data in advance.
S402: and determining whether to alarm or not based on a preset alarm rule aiming at the detection result.
S403: and receiving user feedback after alarming, and updating the alarming rule according to the user feedback.
The above method is a processing method based on an operation and maintenance alarm system, wherein the structure of the operation and maintenance alarm system still refers to fig. 2. The operation and maintenance alarm system is used for detecting and alarming abnormity of index data of equipment or computing resources in a network system. The system comprises: the device comprises an offline model training module 1, an online detection module 2 and an alarm module 3. For newly input data, the online detection module 2 uses various detection algorithms to detect abnormal data therein. The data such as parameters required by each detection algorithm in the online detection module 2 are obtained by calculation through the offline model training module 1. And for the detected abnormal data, the alarm module 3 is used for alarming the user by combining the feedback information of the user.
Preferably, the method further comprises: parameters required by the detection algorithm are updated based on machine learning.
Preferably, the method further comprises: and predicting the value distribution condition of each index data at different moments based on the analysis result of the historical data, and predicting expected values of the periodic index data in a plurality of periods in the future to obtain the prediction result.
Preferably, the method further comprises: and according to the returned feedback information that the abnormality fed back by the user aiming at the index data is ignored, changing a detection algorithm aiming at the index data, and adopting a new detection algorithm to detect the index data again.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The operation and maintenance alarm system and method provided by the invention are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (12)

1. An operation and maintenance alarm system for detecting and alarming abnormality of index data of a device or a computing resource in a network system, the system comprising: off-line model training module, on-line measuring module and alarm module, wherein:
the off-line model training module is used for updating parameters required by a detection algorithm used by the on-line detection module based on machine learning and predicting the distribution of index data based on the analysis result of historical data;
the online detection module is used for receiving index data to be detected and detecting the index data by using a preset detection algorithm based on a prediction result of distribution of the index data;
and the alarm module is used for determining whether to alarm or not based on a preset alarm rule aiming at the detection result of the online detection module.
2. The system of claim 1, wherein the offline model training module predicts value distribution of each index data at different time points and predicts expected values of the periodic index data in a plurality of future cycles based on analysis results of historical data.
3. The system of claim 1, wherein the offline model training module acquires index data required for machine learning from an offline database, determines parameters required for a detection algorithm used by the online detection module according to the acquired index data, stores the determined parameters required for the detection algorithm used by the online detection module in a model parameter database, and stores a distribution prediction result for the index data in a long-term prediction database.
4. The system according to claim 1, 2 or 3, wherein the online detection module is further configured to change a detection algorithm for the index data according to feedback information that is returned by the alarm module and is ignored when the user is feeding back the abnormality for the index data, and adopt a new detection algorithm to detect the index data again.
5. The system of claim 4, wherein the online detection module obtains index data required by a detection algorithm used from an online cache database, and stores the abnormal record in the detection result and the feature description information of the abnormal record in an abnormal database.
6. The system according to claim 5, wherein the alarm module judges whether to alarm according to a preset alarm rule for the abnormality detected by the online detection module, and updates the alarm rule according to user feedback and by combining the abnormality record and the feature description information of the abnormality record.
7. The system of claim 6, wherein the alarm module obtains parameters of a semi-supervised learning algorithm from an alarm model database, stores information related to indicator data of alarms in an alarm database, and stores user feedback in a user feedback database.
8. An operation and maintenance alarm method, which is used for detecting and alarming abnormity of index data of equipment or computing resources in a network system, and comprises the following steps:
receiving index data to be detected, and detecting the index data by using a preset detection algorithm based on a prediction result of distribution of the index data according to an analysis result of historical data in advance;
and determining whether to alarm or not based on a preset alarm rule aiming at the detection result.
9. The method of claim 8, further comprising:
and receiving user feedback after alarming, and updating the alarming rule according to the user feedback.
10. The method of claim 8, further comprising:
and updating parameters required by the detection algorithm based on machine learning.
11. The method of claim 10, further comprising:
and predicting the value distribution condition of each index data at different moments based on the analysis result of the historical data, and predicting expected values of the periodic index data in a plurality of periods in the future to obtain the prediction result.
12. The method of claim 8, 9, 10 or 11, further comprising:
and according to the returned feedback information that the abnormality fed back by the user aiming at the index data is ignored, changing a detection algorithm aiming at the index data, and adopting a new detection algorithm to detect the index data again.
CN201611155555.9A 2016-12-14 2016-12-14 O&M warning system and method Pending CN106844138A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611155555.9A CN106844138A (en) 2016-12-14 2016-12-14 O&M warning system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611155555.9A CN106844138A (en) 2016-12-14 2016-12-14 O&M warning system and method

Publications (1)

Publication Number Publication Date
CN106844138A true CN106844138A (en) 2017-06-13

Family

ID=59140095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611155555.9A Pending CN106844138A (en) 2016-12-14 2016-12-14 O&M warning system and method

Country Status (1)

Country Link
CN (1) CN106844138A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423205A (en) * 2017-07-11 2017-12-01 北京明朝万达科技股份有限公司 A kind of system failure method for early warning and system for anti-data-leakage system
CN107589735A (en) * 2017-08-31 2018-01-16 远景能源(江苏)有限公司 Photovoltaic O&M robot system
CN107608862A (en) * 2017-10-13 2018-01-19 众安信息技术服务有限公司 Monitoring alarm method, monitoring alarm device and computer-readable recording medium
CN107766204A (en) * 2017-10-10 2018-03-06 曙光信息产业(北京)有限公司 A kind of method and system for checking cluster health status
CN107885642A (en) * 2017-11-29 2018-04-06 小花互联网金融服务(深圳)有限公司 Business monitoring method and system based on machine learning
CN108226548A (en) * 2017-12-29 2018-06-29 江苏汇环环保科技有限公司 A kind of environmental unit operation management system based on life period of an equipment supervision
CN109427177A (en) * 2017-08-25 2019-03-05 贵州白山云科技股份有限公司 A kind of monitoring alarm method and device
CN109583475A (en) * 2018-11-02 2019-04-05 阿里巴巴集团控股有限公司 The monitoring method and device of exception information
CN109766244A (en) * 2019-01-04 2019-05-17 中国银行股份有限公司 A kind of distributed system CPU method for detecting abnormality, device and storage medium
CN109885556A (en) * 2019-01-10 2019-06-14 四川长虹电器股份有限公司 A kind of implementation method of device data model
CN110008100A (en) * 2019-03-08 2019-07-12 阿里巴巴集团控股有限公司 Method and device for web page access amount abnormality detection
CN110188793A (en) * 2019-04-18 2019-08-30 阿里巴巴集团控股有限公司 Data exception analysis method and device
CN110428018A (en) * 2019-08-09 2019-11-08 北京中电普华信息技术有限公司 A kind of predicting abnormality method and device in full link monitoring system
WO2020001642A1 (en) * 2018-06-28 2020-01-02 中兴通讯股份有限公司 Operation and maintenance system and method
CN110703743A (en) * 2019-11-12 2020-01-17 深圳市亲邻科技有限公司 Equipment failure prediction and detection system and method
CN110795324A (en) * 2019-10-30 2020-02-14 中国银联股份有限公司 Data processing method and device
CN110865929A (en) * 2019-11-26 2020-03-06 携程旅游信息技术(上海)有限公司 Abnormity detection early warning method and system
CN111204128A (en) * 2018-11-22 2020-05-29 精工爱普生株式会社 Electronic device
CN111204129A (en) * 2018-11-22 2020-05-29 精工爱普生株式会社 Electronic device
CN111325466A (en) * 2020-02-20 2020-06-23 深圳壹账通智能科技有限公司 Intelligent early warning method and system
CN111858231A (en) * 2020-05-11 2020-10-30 北京必示科技有限公司 Single index abnormality detection method based on operation and maintenance monitoring
CN112436968A (en) * 2020-11-23 2021-03-02 恒安嘉新(北京)科技股份公司 Network flow monitoring method, device, equipment and storage medium
CN113194297A (en) * 2021-04-30 2021-07-30 重庆市科学技术研究院 Intelligent monitoring system and method
CN114201374A (en) * 2021-12-07 2022-03-18 华融融通(北京)科技有限公司 Operation and maintenance time sequence data anomaly detection method and system based on hybrid machine learning
US11605093B1 (en) * 2017-02-22 2023-03-14 Amazon Technologies, Inc. Security policy enforcement

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103905240A (en) * 2012-12-28 2014-07-02 中国电信股份有限公司 Method and system for active network service fault reminding and processing
CN105323111A (en) * 2015-11-17 2016-02-10 南京南瑞集团公司 Operation and maintenance automation system and method
CN105447518A (en) * 2015-11-19 2016-03-30 航天东方红卫星有限公司 Remote measurement data interpretation system based on K-means

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103905240A (en) * 2012-12-28 2014-07-02 中国电信股份有限公司 Method and system for active network service fault reminding and processing
CN105323111A (en) * 2015-11-17 2016-02-10 南京南瑞集团公司 Operation and maintenance automation system and method
CN105447518A (en) * 2015-11-19 2016-03-30 航天东方红卫星有限公司 Remote measurement data interpretation system based on K-means

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11605093B1 (en) * 2017-02-22 2023-03-14 Amazon Technologies, Inc. Security policy enforcement
US11810130B2 (en) 2017-02-22 2023-11-07 Amazon Technologies, Inc. Security policy enforcement
CN107423205A (en) * 2017-07-11 2017-12-01 北京明朝万达科技股份有限公司 A kind of system failure method for early warning and system for anti-data-leakage system
CN109427177A (en) * 2017-08-25 2019-03-05 贵州白山云科技股份有限公司 A kind of monitoring alarm method and device
CN107589735A (en) * 2017-08-31 2018-01-16 远景能源(江苏)有限公司 Photovoltaic O&M robot system
CN107766204A (en) * 2017-10-10 2018-03-06 曙光信息产业(北京)有限公司 A kind of method and system for checking cluster health status
CN107608862A (en) * 2017-10-13 2018-01-19 众安信息技术服务有限公司 Monitoring alarm method, monitoring alarm device and computer-readable recording medium
CN107608862B (en) * 2017-10-13 2020-10-27 众安信息技术服务有限公司 Monitoring alarm method, monitoring alarm device and computer readable storage medium
CN107885642A (en) * 2017-11-29 2018-04-06 小花互联网金融服务(深圳)有限公司 Business monitoring method and system based on machine learning
CN108226548A (en) * 2017-12-29 2018-06-29 江苏汇环环保科技有限公司 A kind of environmental unit operation management system based on life period of an equipment supervision
US11947438B2 (en) 2018-06-28 2024-04-02 Xi'an Zhongxing New Software Co., Ltd. Operation and maintenance system and method
WO2020001642A1 (en) * 2018-06-28 2020-01-02 中兴通讯股份有限公司 Operation and maintenance system and method
CN109583475A (en) * 2018-11-02 2019-04-05 阿里巴巴集团控股有限公司 The monitoring method and device of exception information
CN111204128A (en) * 2018-11-22 2020-05-29 精工爱普生株式会社 Electronic device
CN111204129A (en) * 2018-11-22 2020-05-29 精工爱普生株式会社 Electronic device
CN109766244A (en) * 2019-01-04 2019-05-17 中国银行股份有限公司 A kind of distributed system CPU method for detecting abnormality, device and storage medium
CN109885556A (en) * 2019-01-10 2019-06-14 四川长虹电器股份有限公司 A kind of implementation method of device data model
CN109885556B (en) * 2019-01-10 2021-12-21 四川长虹电器股份有限公司 Method for realizing equipment data model
CN110008100A (en) * 2019-03-08 2019-07-12 阿里巴巴集团控股有限公司 Method and device for web page access amount abnormality detection
CN110008100B (en) * 2019-03-08 2023-03-14 创新先进技术有限公司 Method and device for detecting abnormal access volume of web page
CN110188793B (en) * 2019-04-18 2024-02-09 创新先进技术有限公司 Data anomaly analysis method and device
CN110188793A (en) * 2019-04-18 2019-08-30 阿里巴巴集团控股有限公司 Data exception analysis method and device
CN110428018A (en) * 2019-08-09 2019-11-08 北京中电普华信息技术有限公司 A kind of predicting abnormality method and device in full link monitoring system
CN110795324B (en) * 2019-10-30 2023-06-20 中国银联股份有限公司 Data processing method and device
CN110795324A (en) * 2019-10-30 2020-02-14 中国银联股份有限公司 Data processing method and device
CN110703743A (en) * 2019-11-12 2020-01-17 深圳市亲邻科技有限公司 Equipment failure prediction and detection system and method
CN110865929A (en) * 2019-11-26 2020-03-06 携程旅游信息技术(上海)有限公司 Abnormity detection early warning method and system
CN110865929B (en) * 2019-11-26 2024-01-23 携程旅游信息技术(上海)有限公司 Abnormality detection early warning method and system
WO2021164465A1 (en) * 2020-02-20 2021-08-26 深圳壹账通智能科技有限公司 Intelligent early warning method and system
CN111325466A (en) * 2020-02-20 2020-06-23 深圳壹账通智能科技有限公司 Intelligent early warning method and system
CN111858231A (en) * 2020-05-11 2020-10-30 北京必示科技有限公司 Single index abnormality detection method based on operation and maintenance monitoring
CN112436968A (en) * 2020-11-23 2021-03-02 恒安嘉新(北京)科技股份公司 Network flow monitoring method, device, equipment and storage medium
CN112436968B (en) * 2020-11-23 2023-10-17 恒安嘉新(北京)科技股份公司 Network traffic monitoring method, device, equipment and storage medium
CN113194297A (en) * 2021-04-30 2021-07-30 重庆市科学技术研究院 Intelligent monitoring system and method
CN114201374A (en) * 2021-12-07 2022-03-18 华融融通(北京)科技有限公司 Operation and maintenance time sequence data anomaly detection method and system based on hybrid machine learning
CN114201374B (en) * 2021-12-07 2024-08-09 华融融通(北京)科技有限公司 Operation and maintenance time sequence data anomaly detection method and system based on hybrid machine learning

Similar Documents

Publication Publication Date Title
CN106844138A (en) O&M warning system and method
US10877863B2 (en) Automatic prediction system for server failure and method of automatically predicting server failure
CN111212038B (en) Open data API gateway system based on big data artificial intelligence
US11245713B2 (en) Enrichment and analysis of cybersecurity threat intelligence and orchestrating application of threat intelligence to selected network security events
CN110362612B (en) Abnormal data detection method and device executed by electronic equipment and electronic equipment
CN110851342A (en) Fault prediction method, device, computing equipment and computer readable storage medium
CN106664218A (en) Systems and methods for correlating derived metrics for system activity
CN112291266B (en) Data processing method, device, server and storage medium
US20220214948A1 (en) Unsupervised log data anomaly detection
CN115378711B (en) Intrusion detection method and system for industrial control network
CN112948223B (en) Method and device for monitoring running condition
Smrithy et al. Online anomaly detection using non-parametric technique for big data streams in cloud collaborative environment
CN108039971A (en) A kind of alarm method and device
CN110677271B (en) Big data alarm method, device, equipment and storage medium based on ELK
CN114138601A (en) Service alarm method, device, equipment and storage medium
AU2021218217A1 (en) Systems and methods for preventative monitoring using AI learning of outcomes and responses from previous experience.
CN111651652B (en) Emotion tendency identification method, device, equipment and medium based on artificial intelligence
CN112712369A (en) Method and device for monitoring suspicious transactions of anti-money laundering
CN116843395A (en) Alarm classification method, device, equipment and storage medium of service system
Hani et al. Support vector regression for service level agreement violation prediction
Fahrnberger Outlier removal for the reliable condition monitoring of telecommunication services
CN113992496A (en) Abnormal operation warning method and device based on quartile algorithm and computing equipment
Hu et al. A semi-supervised method for digital twin-enabled predictive maintenance in the building industry
Ahirwar et al. Anomaly detection in the services provided by multi cloud architectures: a survey
CN116405287B (en) Industrial control system network security assessment method, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170613

RJ01 Rejection of invention patent application after publication