CN111181767A

CN111181767A - Monitoring and fault self-healing system and method for complex system

Info

Publication number: CN111181767A
Application number: CN201911256239.4A
Authority: CN
Inventors: 杨科; 艾国红; 黎志碧; 唐博; 陆陈; 冯大川
Original assignee: AVIC Chengdu Aircraft Design and Research Institute
Current assignee: AVIC Chengdu Aircraft Design and Research Institute
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-05-19

Abstract

The invention belongs to the operation and maintenance technology of a data center system, and relates to a monitoring and fault self-healing system and a monitoring and fault self-healing method for a complex system. The invention comprises a resource and application monitoring module, a business relation model module, a fault diagnosis and analysis module and a fault processing module. According to the method and the system, diagnosis and analysis are carried out according to the business relation model and the monitoring index data, corresponding fault recovery operation is automatically executed according to the analysis result, rapid fault recovery and prevention are realized, operation and maintenance efficiency is improved, and good operation of a research and development system is guaranteed.

Description

Monitoring and fault self-healing system and method for complex system

Technical Field

The invention belongs to the operation and maintenance technology of a data center system, and relates to a monitoring and fault self-healing system and a monitoring and fault self-healing method for a complex system.

Background

With the deep advancement of digitalization and informatization of aviation equipment, the whole research and development system is increasingly large and complex, the service systems are continuously increased, components in each system are expanded, and the relationship is increasingly complex. The effective monitoring of the research and development system is realized, the problem fault is quickly positioned, processed and prevented, and the good operation of the research and development system is very necessary.

The traditional monitoring mode is to sort and enumerate for hosts, networks, storage, application software and the like, and collect corresponding indexes for monitoring. The fault location needs the cooperation of experts with wide knowledge and rich experience in multiple fields, the fault processing is largely completed manually, the efficiency is low, the fault is repeated, and errors are easy to miss. In order to improve the operation and maintenance efficiency, some repetitive work including system monitoring, fault treatment, daily inspection and the like is completed through the automatic operation and maintenance script. The automated operation and maintenance can be considered as an expert system based on industry domain knowledge and operation and maintenance scenarios. With the expansion of system scale and the complexity and diversity of service types, the method relying on manual judgment is often difficult to deal with the operation and maintenance problem.

In order to meet the operation management requirements faced by the scale expansion of the data center, operation and maintenance monitoring is changed from traditional basic resource-oriented monitoring to application-centered service monitoring, and diagnosis and processing are performed according to the dependency relationship in problem fault processing.

The prior art has a commercial system in the internet field, can realize monitoring and fault self-healing, but has high cost, and multiple services are based on micro-service and containerization application, and cannot be effectively used in a traditional professional software system in the manufacturing industry.

Disclosure of Invention

The purpose of the invention is: the simple and effective monitoring and fault self-healing system and method for the complex system are provided, fault recovery and prevention can be rapidly carried out, operation and maintenance efficiency is improved, and good operation of a research and development system is guaranteed.

The technical scheme of the invention is as follows:

a monitoring and fault self-healing system for a complex system comprises a resource and application monitoring module, a business relation model module, a fault diagnosis and analysis module and a fault processing module, wherein:

resource and application monitoring module: the system is responsible for monitoring index data acquisition of a host, middleware, an application service layer and a service log;

a business relation model module: the system is responsible for service topology relation management and service deployment information management;

a fault diagnosis analysis module: the system is in charge of business service alarm processing and fault diagnosis, after diagnosis is completed, objects on which fault nodes depend are placed into a queue to be detected, index data analysis is carried out by detection service, whether the objects are abnormal or not is judged, and if the objects are abnormal, a fault processing module is informed to process the objects; the fault diagnosis is based on: a. monitoring index data of the resources and the application monitoring module; b. service topological relation and service deployment information in the service relation model module;

a fault processing module: is responsible for initiating the fault recovery operation job.

Furthermore, in the fault processing module, the fault recovery operation includes service process restart, disk directory cleaning, and server host restart; the fault recovery operation job is executed by a remote management and control Agent installed on the host.

Further, the job description of the fault recovery operation includes a job name, an execution object, and a job script.

Further, in the fault diagnosis analysis module, the determination mode of the abnormality includes a static threshold, a ring ratio, and whether the abnormality is available.

A method based on the system comprises the following steps:

the method comprises the following steps that firstly, monitoring index data are periodically acquired through a data acquisition Agent deployed on a monitored object; the monitored object comprises a host, middleware and an application service layer;

step two, operation and maintenance personnel construct service topological relation information and service deployment information of each application system through a service relation model module;

step three, the fault diagnosis analysis module detects the service availability of the application service layer at regular time, if the service is found to be unavailable, the detection service operation is started, the objects on which the fault nodes depend are put into the queue to be detected according to the service relation model data, the detection service sequentially takes the queue objects to perform monitoring index data analysis, whether the index data are abnormal is judged, and if the index data are abnormal, a fault notification message is sent to inform the fault processing module to process;

and step four, the fault processing module executes fault recovery operation through the remote control Agent according to the fault notification message.

Further, in the third step, the process of putting the object dependent on the failed node into the queue to be detected is as follows: and aiming at the fault object node, searching the dependent application service object node in the service topological relation graph according to the breadth-first search algorithm, and entering the dependent application service object node into the queue to be detected, and meanwhile, entering the deployment position object of the dependent application service object node into the queue to be detected according to the service deployment information.

Further, in the first step, preset monitoring index data acquisition is performed for different monitored objects, wherein the host monitoring index includes a CPU utilization rate, a memory utilization rate, a disk space occupancy rate, a network flow, a TCP connection number, and a process number; the middleware monitoring indexes comprise process survival, memory size occupied by the JVM, session number and thread pool size; the application service layer monitoring indexes comprise service availability and response time, and service dial testing is carried out on the target service through HTTP/TCP.

Further, in the first step, the collected monitoring data is stored in a monitoring database of the resource and application monitoring module, and the data point format is as follows: the monitoring data format is monitoring object name + label + index name + monitoring value + time stamp.

Further, in the third step, the failure notification message includes a failed node, an exception type, and a job name of the failure recovery operation.

The invention has the beneficial effects that: the invention clearly describes the relationship between application services and the relationship between the application services and the deployment nodes by constructing a business relationship model and taking the application as a center; according to the service relation model and the monitoring index data, the fault diagnosis and analysis automatically analyzes the result and executes corresponding fault recovery operation, fault recovery and prevention can be rapidly carried out, manual omission and misoperation are avoided, the labor cost is saved, the operation and maintenance efficiency is improved, and good operation of a research and development system is guaranteed.

Drawings

FIG. 1 is a block diagram of a monitoring and fault self-healing system according to the present invention;

FIG. 2 is a schematic diagram of a business relationship model according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a queue waiting detection process according to an embodiment of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings.

The invention relates to a monitoring and fault self-healing system facing a complex system, which comprises a resource and application monitoring module, a service relation model module, a fault diagnosis module and a fault processing module, wherein the resource and application monitoring module comprises: the system framework is shown in figure 1.

The functions of each module of the system are as follows:

a) resource and application monitoring module: and the system is responsible for monitoring index data acquisition of a host, a middleware, an application service layer and a service log.

b) A business relation model module: and the system is responsible for business service topological relation management and service deployment information management.

c) A fault diagnosis analysis module: and the system is responsible for business service alarm processing and fault diagnosis, objects on which fault nodes depend are put into a queue to be detected in the diagnosis process according to a business relation model, and index data analysis is carried out by the detection service to judge whether the fault nodes are abnormal or not. If the abnormity is judged, the fault processing module is informed to process.

d) A fault processing module: and the system is responsible for starting fault recovery operation, such as pulling up a process through a remote control Agent.

In the business relationship model module, the business service topological relationship is described by using a Directed Acyclic Graph (DAG), and the service deployment information is described in a data table form, as shown in fig. 2 and table 1.

Table 1 service deployment information schematic table

Service name	Deploying location objects
		Service A	Host A
Service B	Host B
		Service C	Host B
Service D	Host C
		Service E	Host C

In the fault diagnosis analysis module, when the application service is unavailable, monitoring object nodes on which fault nodes depend are imported into a queue to be detected (as shown in fig. 3) according to a topology DAG graph and a service deployment information table, and the detection service analyzes monitoring index data of the object nodes in the queue to be detected in sequence to determine whether the monitoring index data are abnormal, wherein the abnormal determination mode can be a static threshold value, a ring ratio, availability and the like. When the monitored object is determined to be abnormal, the processing mode comprises 2 modes of informing operation and maintenance personnel in a message mode and informing a fault processing module to carry out self-healing processing.

In the above fault processing module, the self-healing processing operation includes a service process restart, a disk directory clean, and a server host restart. The job description comprises a job name, an execution object and a job script. The execution is carried out through a remote management and control Agent installed on the host.

The invention relates to a method based on the system, which comprises the following steps:

acquiring monitoring index data periodically (such as T time) through a data acquisition Agent deployed on a monitored object; aiming at different monitoring objects, preset monitoring index data acquisition is carried out, wherein host monitoring indexes comprise CPU utilization rate, memory utilization rate, disk space occupancy rate, network flow, TCP connection number, process number and the like; the middleware monitoring indexes comprise process survival, memory size occupied by the JVM, session number, thread pool size and the like; the application service monitoring indexes comprise service availability, response time and the like, and service dial testing is carried out on the target service through HTTP/TCP. The collected monitoring data is stored in a monitoring database of a resource and application monitoring module, and the data point format is

The monitoring data format is monitoring object name + label + index name + monitoring value + time stamp.

And secondly, the operation and maintenance personnel construct service topological relation information and service deployment information of each application system through a fault diagnosis analysis module.

And step three, the fault diagnosis analysis module detects the service availability at regular time, starts fault detection service operation if the service is found to be unavailable, puts the objects depended by the fault nodes into the queue to be detected according to the service relation model data, the detection service sequentially takes the queue objects to perform monitoring index data analysis, judges whether the index data is abnormal or not, and informs the fault processing module to process if the index data is abnormal, and informs the information of the fault nodes, the abnormal types and the fault recovery operation names.

And step four, the fault processing module carries out fault recovery operation execution through the remote control Agent according to the fault notification message.

Claims

1. A monitoring and fault self-healing system for complex systems is characterized in that: the system comprises a resource and application monitoring module, a business relation model module, a fault diagnosis and analysis module and a fault processing module, wherein:

2. The monitoring and fault self-healing system according to claim 1, wherein: in the fault processing module, the fault recovery operation comprises service process restart, disk directory cleaning and server host restart; the fault recovery operation job is executed by a remote management and control Agent installed on the host.

3. The monitoring and fault self-healing system according to claim 2, wherein: the fault recovery operation job description comprises a job name, an execution object and a job script.

4. The monitoring and fault self-healing system according to claim 1, wherein: in the fault diagnosis analysis module, the abnormal judgment mode comprises a static threshold value, a ring ratio and whether the abnormal judgment mode is available.

5. A monitoring and fault self-healing method based on the system of claim 1, characterized in that the method comprises the following steps:

6. The monitoring and fault self-healing method according to claim 5, wherein: in the third step, the process of putting the object depended by the fault node into the queue to be detected is as follows: and aiming at the fault object node, searching the dependent application service object node in the service topological relation graph according to the breadth-first search algorithm, and entering the dependent application service object node into the queue to be detected, and meanwhile, entering the deployment position object of the dependent application service object node into the queue to be detected according to the service deployment information.

7. The monitoring and fault self-healing method according to claim 5, wherein: in the first step, preset monitoring index data acquisition is carried out for different monitoring objects, wherein host monitoring indexes comprise CPU utilization rate, memory utilization rate, disk space occupancy rate, network flow, TCP connection number and process number; the middleware monitoring indexes comprise process survival, memory size occupied by the JVM, session number and thread pool size; the application service layer monitoring indexes comprise service availability and response time, and service dial testing is carried out on the target service through HTTP/TCP.

8. The monitoring and fault self-healing method according to claim 7, wherein: in the first step, the collected monitoring data is stored in a monitoring database of the resource and application monitoring module, and the data point format is as follows: the monitoring data format is monitoring object name + label + index name + monitoring value + time stamp.

9. The monitoring and fault self-healing method according to claim 5, wherein: in the third step, the fault notification message includes a fault node, an exception type and a fault recovery operation job name.