CN111181767A - Monitoring and fault self-healing system and method for complex system - Google Patents

Monitoring and fault self-healing system and method for complex system Download PDF

Info

Publication number
CN111181767A
CN111181767A CN201911256239.4A CN201911256239A CN111181767A CN 111181767 A CN111181767 A CN 111181767A CN 201911256239 A CN201911256239 A CN 201911256239A CN 111181767 A CN111181767 A CN 111181767A
Authority
CN
China
Prior art keywords
fault
monitoring
service
module
application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911256239.4A
Other languages
Chinese (zh)
Inventor
杨科
艾国红
黎志碧
唐博
陆陈
冯大川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AVIC Chengdu Aircraft Design and Research Institute
Original Assignee
AVIC Chengdu Aircraft Design and Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AVIC Chengdu Aircraft Design and Research Institute filed Critical AVIC Chengdu Aircraft Design and Research Institute
Priority to CN201911256239.4A priority Critical patent/CN111181767A/en
Publication of CN111181767A publication Critical patent/CN111181767A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • H04L41/0661Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities by reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention belongs to the operation and maintenance technology of a data center system, and relates to a monitoring and fault self-healing system and a monitoring and fault self-healing method for a complex system. The invention comprises a resource and application monitoring module, a business relation model module, a fault diagnosis and analysis module and a fault processing module. According to the method and the system, diagnosis and analysis are carried out according to the business relation model and the monitoring index data, corresponding fault recovery operation is automatically executed according to the analysis result, rapid fault recovery and prevention are realized, operation and maintenance efficiency is improved, and good operation of a research and development system is guaranteed.

Description

Monitoring and fault self-healing system and method for complex system
Technical Field
The invention belongs to the operation and maintenance technology of a data center system, and relates to a monitoring and fault self-healing system and a monitoring and fault self-healing method for a complex system.
Background
With the deep advancement of digitalization and informatization of aviation equipment, the whole research and development system is increasingly large and complex, the service systems are continuously increased, components in each system are expanded, and the relationship is increasingly complex. The effective monitoring of the research and development system is realized, the problem fault is quickly positioned, processed and prevented, and the good operation of the research and development system is very necessary.
The traditional monitoring mode is to sort and enumerate for hosts, networks, storage, application software and the like, and collect corresponding indexes for monitoring. The fault location needs the cooperation of experts with wide knowledge and rich experience in multiple fields, the fault processing is largely completed manually, the efficiency is low, the fault is repeated, and errors are easy to miss. In order to improve the operation and maintenance efficiency, some repetitive work including system monitoring, fault treatment, daily inspection and the like is completed through the automatic operation and maintenance script. The automated operation and maintenance can be considered as an expert system based on industry domain knowledge and operation and maintenance scenarios. With the expansion of system scale and the complexity and diversity of service types, the method relying on manual judgment is often difficult to deal with the operation and maintenance problem.
In order to meet the operation management requirements faced by the scale expansion of the data center, operation and maintenance monitoring is changed from traditional basic resource-oriented monitoring to application-centered service monitoring, and diagnosis and processing are performed according to the dependency relationship in problem fault processing.
The prior art has a commercial system in the internet field, can realize monitoring and fault self-healing, but has high cost, and multiple services are based on micro-service and containerization application, and cannot be effectively used in a traditional professional software system in the manufacturing industry.
Disclosure of Invention
The purpose of the invention is: the simple and effective monitoring and fault self-healing system and method for the complex system are provided, fault recovery and prevention can be rapidly carried out, operation and maintenance efficiency is improved, and good operation of a research and development system is guaranteed.
The technical scheme of the invention is as follows:
a monitoring and fault self-healing system for a complex system comprises a resource and application monitoring module, a business relation model module, a fault diagnosis and analysis module and a fault processing module, wherein:
resource and application monitoring module: the system is responsible for monitoring index data acquisition of a host, middleware, an application service layer and a service log;
a business relation model module: the system is responsible for service topology relation management and service deployment information management;
a fault diagnosis analysis module: the system is in charge of business service alarm processing and fault diagnosis, after diagnosis is completed, objects on which fault nodes depend are placed into a queue to be detected, index data analysis is carried out by detection service, whether the objects are abnormal or not is judged, and if the objects are abnormal, a fault processing module is informed to process the objects; the fault diagnosis is based on: a. monitoring index data of the resources and the application monitoring module; b. service topological relation and service deployment information in the service relation model module;
a fault processing module: is responsible for initiating the fault recovery operation job.
Furthermore, in the fault processing module, the fault recovery operation includes service process restart, disk directory cleaning, and server host restart; the fault recovery operation job is executed by a remote management and control Agent installed on the host.
Further, the job description of the fault recovery operation includes a job name, an execution object, and a job script.
Further, in the fault diagnosis analysis module, the determination mode of the abnormality includes a static threshold, a ring ratio, and whether the abnormality is available.
A method based on the system comprises the following steps:
the method comprises the following steps that firstly, monitoring index data are periodically acquired through a data acquisition Agent deployed on a monitored object; the monitored object comprises a host, middleware and an application service layer;
step two, operation and maintenance personnel construct service topological relation information and service deployment information of each application system through a service relation model module;
step three, the fault diagnosis analysis module detects the service availability of the application service layer at regular time, if the service is found to be unavailable, the detection service operation is started, the objects on which the fault nodes depend are put into the queue to be detected according to the service relation model data, the detection service sequentially takes the queue objects to perform monitoring index data analysis, whether the index data are abnormal is judged, and if the index data are abnormal, a fault notification message is sent to inform the fault processing module to process;
and step four, the fault processing module executes fault recovery operation through the remote control Agent according to the fault notification message.
Further, in the third step, the process of putting the object dependent on the failed node into the queue to be detected is as follows: and aiming at the fault object node, searching the dependent application service object node in the service topological relation graph according to the breadth-first search algorithm, and entering the dependent application service object node into the queue to be detected, and meanwhile, entering the deployment position object of the dependent application service object node into the queue to be detected according to the service deployment information.
Further, in the first step, preset monitoring index data acquisition is performed for different monitored objects, wherein the host monitoring index includes a CPU utilization rate, a memory utilization rate, a disk space occupancy rate, a network flow, a TCP connection number, and a process number; the middleware monitoring indexes comprise process survival, memory size occupied by the JVM, session number and thread pool size; the application service layer monitoring indexes comprise service availability and response time, and service dial testing is carried out on the target service through HTTP/TCP.
Further, in the first step, the collected monitoring data is stored in a monitoring database of the resource and application monitoring module, and the data point format is as follows: the monitoring data format is monitoring object name + label + index name + monitoring value + time stamp.
Further, in the third step, the failure notification message includes a failed node, an exception type, and a job name of the failure recovery operation.
The invention has the beneficial effects that: the invention clearly describes the relationship between application services and the relationship between the application services and the deployment nodes by constructing a business relationship model and taking the application as a center; according to the service relation model and the monitoring index data, the fault diagnosis and analysis automatically analyzes the result and executes corresponding fault recovery operation, fault recovery and prevention can be rapidly carried out, manual omission and misoperation are avoided, the labor cost is saved, the operation and maintenance efficiency is improved, and good operation of a research and development system is guaranteed.
Drawings
FIG. 1 is a block diagram of a monitoring and fault self-healing system according to the present invention;
FIG. 2 is a schematic diagram of a business relationship model according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a queue waiting detection process according to an embodiment of the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the accompanying drawings.
The invention relates to a monitoring and fault self-healing system facing a complex system, which comprises a resource and application monitoring module, a service relation model module, a fault diagnosis module and a fault processing module, wherein the resource and application monitoring module comprises: the system framework is shown in figure 1.
The functions of each module of the system are as follows:
a) resource and application monitoring module: and the system is responsible for monitoring index data acquisition of a host, a middleware, an application service layer and a service log.
b) A business relation model module: and the system is responsible for business service topological relation management and service deployment information management.
c) A fault diagnosis analysis module: and the system is responsible for business service alarm processing and fault diagnosis, objects on which fault nodes depend are put into a queue to be detected in the diagnosis process according to a business relation model, and index data analysis is carried out by the detection service to judge whether the fault nodes are abnormal or not. If the abnormity is judged, the fault processing module is informed to process.
d) A fault processing module: and the system is responsible for starting fault recovery operation, such as pulling up a process through a remote control Agent.
In the business relationship model module, the business service topological relationship is described by using a Directed Acyclic Graph (DAG), and the service deployment information is described in a data table form, as shown in fig. 2 and table 1.
Table 1 service deployment information schematic table
Service name Deploying location objects
Service A Host A
Service B Host B
Service C Host B
Service D Host C
Service E Host C
In the fault diagnosis analysis module, when the application service is unavailable, monitoring object nodes on which fault nodes depend are imported into a queue to be detected (as shown in fig. 3) according to a topology DAG graph and a service deployment information table, and the detection service analyzes monitoring index data of the object nodes in the queue to be detected in sequence to determine whether the monitoring index data are abnormal, wherein the abnormal determination mode can be a static threshold value, a ring ratio, availability and the like. When the monitored object is determined to be abnormal, the processing mode comprises 2 modes of informing operation and maintenance personnel in a message mode and informing a fault processing module to carry out self-healing processing.
In the above fault processing module, the self-healing processing operation includes a service process restart, a disk directory clean, and a server host restart. The job description comprises a job name, an execution object and a job script. The execution is carried out through a remote management and control Agent installed on the host.
The invention relates to a method based on the system, which comprises the following steps:
acquiring monitoring index data periodically (such as T time) through a data acquisition Agent deployed on a monitored object; aiming at different monitoring objects, preset monitoring index data acquisition is carried out, wherein host monitoring indexes comprise CPU utilization rate, memory utilization rate, disk space occupancy rate, network flow, TCP connection number, process number and the like; the middleware monitoring indexes comprise process survival, memory size occupied by the JVM, session number, thread pool size and the like; the application service monitoring indexes comprise service availability, response time and the like, and service dial testing is carried out on the target service through HTTP/TCP. The collected monitoring data is stored in a monitoring database of a resource and application monitoring module, and the data point format is
The monitoring data format is monitoring object name + label + index name + monitoring value + time stamp.
And secondly, the operation and maintenance personnel construct service topological relation information and service deployment information of each application system through a fault diagnosis analysis module.
And step three, the fault diagnosis analysis module detects the service availability at regular time, starts fault detection service operation if the service is found to be unavailable, puts the objects depended by the fault nodes into the queue to be detected according to the service relation model data, the detection service sequentially takes the queue objects to perform monitoring index data analysis, judges whether the index data is abnormal or not, and informs the fault processing module to process if the index data is abnormal, and informs the information of the fault nodes, the abnormal types and the fault recovery operation names.
And step four, the fault processing module carries out fault recovery operation execution through the remote control Agent according to the fault notification message.

Claims (9)

1. A monitoring and fault self-healing system for complex systems is characterized in that: the system comprises a resource and application monitoring module, a business relation model module, a fault diagnosis and analysis module and a fault processing module, wherein:
resource and application monitoring module: the system is responsible for monitoring index data acquisition of a host, middleware, an application service layer and a service log;
a business relation model module: the system is responsible for service topology relation management and service deployment information management;
a fault diagnosis analysis module: the system is in charge of business service alarm processing and fault diagnosis, after diagnosis is completed, objects on which fault nodes depend are placed into a queue to be detected, index data analysis is carried out by detection service, whether the objects are abnormal or not is judged, and if the objects are abnormal, a fault processing module is informed to process the objects; the fault diagnosis is based on: a. monitoring index data of the resources and the application monitoring module; b. service topological relation and service deployment information in the service relation model module;
a fault processing module: is responsible for initiating the fault recovery operation job.
2. The monitoring and fault self-healing system according to claim 1, wherein: in the fault processing module, the fault recovery operation comprises service process restart, disk directory cleaning and server host restart; the fault recovery operation job is executed by a remote management and control Agent installed on the host.
3. The monitoring and fault self-healing system according to claim 2, wherein: the fault recovery operation job description comprises a job name, an execution object and a job script.
4. The monitoring and fault self-healing system according to claim 1, wherein: in the fault diagnosis analysis module, the abnormal judgment mode comprises a static threshold value, a ring ratio and whether the abnormal judgment mode is available.
5. A monitoring and fault self-healing method based on the system of claim 1, characterized in that the method comprises the following steps:
the method comprises the following steps that firstly, monitoring index data are periodically acquired through a data acquisition Agent deployed on a monitored object; the monitored object comprises a host, middleware and an application service layer;
step two, operation and maintenance personnel construct service topological relation information and service deployment information of each application system through a service relation model module;
step three, the fault diagnosis analysis module detects the service availability of the application service layer at regular time, if the service is found to be unavailable, the detection service operation is started, the objects on which the fault nodes depend are put into the queue to be detected according to the service relation model data, the detection service sequentially takes the queue objects to perform monitoring index data analysis, whether the index data are abnormal is judged, and if the index data are abnormal, a fault notification message is sent to inform the fault processing module to process;
and step four, the fault processing module executes fault recovery operation through the remote control Agent according to the fault notification message.
6. The monitoring and fault self-healing method according to claim 5, wherein: in the third step, the process of putting the object depended by the fault node into the queue to be detected is as follows: and aiming at the fault object node, searching the dependent application service object node in the service topological relation graph according to the breadth-first search algorithm, and entering the dependent application service object node into the queue to be detected, and meanwhile, entering the deployment position object of the dependent application service object node into the queue to be detected according to the service deployment information.
7. The monitoring and fault self-healing method according to claim 5, wherein: in the first step, preset monitoring index data acquisition is carried out for different monitoring objects, wherein host monitoring indexes comprise CPU utilization rate, memory utilization rate, disk space occupancy rate, network flow, TCP connection number and process number; the middleware monitoring indexes comprise process survival, memory size occupied by the JVM, session number and thread pool size; the application service layer monitoring indexes comprise service availability and response time, and service dial testing is carried out on the target service through HTTP/TCP.
8. The monitoring and fault self-healing method according to claim 7, wherein: in the first step, the collected monitoring data is stored in a monitoring database of the resource and application monitoring module, and the data point format is as follows: the monitoring data format is monitoring object name + label + index name + monitoring value + time stamp.
9. The monitoring and fault self-healing method according to claim 5, wherein: in the third step, the fault notification message includes a fault node, an exception type and a fault recovery operation job name.
CN201911256239.4A 2019-12-10 2019-12-10 Monitoring and fault self-healing system and method for complex system Pending CN111181767A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911256239.4A CN111181767A (en) 2019-12-10 2019-12-10 Monitoring and fault self-healing system and method for complex system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911256239.4A CN111181767A (en) 2019-12-10 2019-12-10 Monitoring and fault self-healing system and method for complex system

Publications (1)

Publication Number Publication Date
CN111181767A true CN111181767A (en) 2020-05-19

Family

ID=70657200

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911256239.4A Pending CN111181767A (en) 2019-12-10 2019-12-10 Monitoring and fault self-healing system and method for complex system

Country Status (1)

Country Link
CN (1) CN111181767A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111865695A (en) * 2020-07-28 2020-10-30 浪潮云信息技术股份公司 Method and system for automatic fault handling in cloud environment
CN111858176A (en) * 2020-07-22 2020-10-30 欧冶云商股份有限公司 Remote monitoring fault self-healing system and method
CN111970168A (en) * 2020-08-11 2020-11-20 北京点众科技股份有限公司 Method and device for monitoring full-link service node and storage medium
CN112149975A (en) * 2020-09-11 2020-12-29 杭州东方通信软件技术有限公司 APM monitoring system and method based on artificial intelligence
CN112350862A (en) * 2020-10-30 2021-02-09 广州市汇聚支付电子科技有限公司 Monitoring alarm and fault self-healing system
CN113010331A (en) * 2021-03-12 2021-06-22 腾讯科技(深圳)有限公司 Abnormal data processing method and device and computer readable storage medium
CN113342560A (en) * 2021-06-04 2021-09-03 中国工商银行股份有限公司 Fault processing method, system, electronic equipment and storage medium
CN113590370A (en) * 2021-08-06 2021-11-02 北京百度网讯科技有限公司 Fault processing method, device, equipment and storage medium
CN114443443A (en) * 2022-04-11 2022-05-06 北京优特捷信息技术有限公司 Fault self-healing method, device, equipment and storage medium
WO2022252860A1 (en) * 2021-06-01 2022-12-08 中国民航信息网络股份有限公司 Event processing method and apparatus, and computer device and storage medium
CN116032723A (en) * 2022-12-20 2023-04-28 浪潮云信息技术股份公司 Fault root cause combination analysis method for application
WO2023104219A1 (en) * 2021-12-07 2023-06-15 广州地铁集团有限公司 Solution method based on internet of things rail transit for software and application fault self-healing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107733941A (en) * 2016-08-11 2018-02-23 南京联成科技发展股份有限公司 A kind of realization method and system of the data acquisition platform based on big data
CN109343987A (en) * 2018-08-20 2019-02-15 科大国创软件股份有限公司 IT system fault diagnosis and restorative procedure, device, equipment, storage medium
CN109783322A (en) * 2018-11-22 2019-05-21 远光软件股份有限公司 A kind of monitoring analysis system and its method of enterprise information system operating status
CN109787816A (en) * 2018-12-28 2019-05-21 北京奇安信科技有限公司 Traffic failure localization method, device, equipment and medium
CN110428018A (en) * 2019-08-09 2019-11-08 北京中电普华信息技术有限公司 A kind of predicting abnormality method and device in full link monitoring system
CN110430071A (en) * 2019-07-19 2019-11-08 云南电网有限责任公司信息中心 Service node fault self-recovery method, apparatus, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107733941A (en) * 2016-08-11 2018-02-23 南京联成科技发展股份有限公司 A kind of realization method and system of the data acquisition platform based on big data
CN109343987A (en) * 2018-08-20 2019-02-15 科大国创软件股份有限公司 IT system fault diagnosis and restorative procedure, device, equipment, storage medium
CN109783322A (en) * 2018-11-22 2019-05-21 远光软件股份有限公司 A kind of monitoring analysis system and its method of enterprise information system operating status
CN109787816A (en) * 2018-12-28 2019-05-21 北京奇安信科技有限公司 Traffic failure localization method, device, equipment and medium
CN110430071A (en) * 2019-07-19 2019-11-08 云南电网有限责任公司信息中心 Service node fault self-recovery method, apparatus, computer equipment and storage medium
CN110428018A (en) * 2019-08-09 2019-11-08 北京中电普华信息技术有限公司 A kind of predicting abnormality method and device in full link monitoring system

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858176A (en) * 2020-07-22 2020-10-30 欧冶云商股份有限公司 Remote monitoring fault self-healing system and method
CN111865695A (en) * 2020-07-28 2020-10-30 浪潮云信息技术股份公司 Method and system for automatic fault handling in cloud environment
CN111970168A (en) * 2020-08-11 2020-11-20 北京点众科技股份有限公司 Method and device for monitoring full-link service node and storage medium
CN112149975A (en) * 2020-09-11 2020-12-29 杭州东方通信软件技术有限公司 APM monitoring system and method based on artificial intelligence
CN112350862A (en) * 2020-10-30 2021-02-09 广州市汇聚支付电子科技有限公司 Monitoring alarm and fault self-healing system
CN113010331A (en) * 2021-03-12 2021-06-22 腾讯科技(深圳)有限公司 Abnormal data processing method and device and computer readable storage medium
WO2022252860A1 (en) * 2021-06-01 2022-12-08 中国民航信息网络股份有限公司 Event processing method and apparatus, and computer device and storage medium
CN113342560A (en) * 2021-06-04 2021-09-03 中国工商银行股份有限公司 Fault processing method, system, electronic equipment and storage medium
CN113590370B (en) * 2021-08-06 2022-06-21 北京百度网讯科技有限公司 Fault processing method, device, equipment and storage medium
CN113590370A (en) * 2021-08-06 2021-11-02 北京百度网讯科技有限公司 Fault processing method, device, equipment and storage medium
WO2023104219A1 (en) * 2021-12-07 2023-06-15 广州地铁集团有限公司 Solution method based on internet of things rail transit for software and application fault self-healing
CN114443443A (en) * 2022-04-11 2022-05-06 北京优特捷信息技术有限公司 Fault self-healing method, device, equipment and storage medium
CN116032723A (en) * 2022-12-20 2023-04-28 浪潮云信息技术股份公司 Fault root cause combination analysis method for application

Similar Documents

Publication Publication Date Title
CN111181767A (en) Monitoring and fault self-healing system and method for complex system
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN106951315B (en) ETL-based data task scheduling method and system
CN110716842B (en) Cluster fault detection method and device
CN106685676B (en) Node switching method and device
CN106789141B (en) Gateway equipment fault processing method and device
WO2016188100A1 (en) Information system fault scenario information collection method and system
CN102479113A (en) Abnormal self-adapting processing method and system
CN112000502B (en) Processing method and device for mass error logs, electronic device and storage medium
CN109274531A (en) Data acquisition equipment method for restarting, system and computer readable storage medium
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN103023028A (en) Rapid grid failure positioning method based on dependency graph of entities
CN106021070A (en) Method and device for server cluster monitoring
CN107204868B (en) Task operation monitoring information acquisition method and device
CN105025179A (en) Method and system for monitoring service agents of call center
CN113055203B (en) Method and device for recovering exception of SDN control plane
CN115766402B (en) Method and device for filtering server fault root cause, storage medium and electronic device
CN115102838B (en) Emergency processing method and device for server downtime risk and electronic equipment
JP4575020B2 (en) Failure analysis device
CN116264541A (en) Multi-dimension-based database disaster recovery method and device
CN115525392A (en) Container monitoring method and device, electronic equipment and storage medium
CN112000442A (en) Method and device for automatically acquiring cluster state based on kubernets platform
CN115705259A (en) Fault processing method, related device and storage medium
US20070124343A1 (en) Method or apparatus for processing data in a system management application program
JP2007052756A (en) Learning type diagnostic database applied to trouble diagnosis in wireless device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200519

RJ01 Rejection of invention patent application after publication