CN115840656A - Automatic operation and maintenance method and system for application program based on fault self-healing - Google Patents

Automatic operation and maintenance method and system for application program based on fault self-healing Download PDF

Info

Publication number
CN115840656A
CN115840656A CN202211518354.6A CN202211518354A CN115840656A CN 115840656 A CN115840656 A CN 115840656A CN 202211518354 A CN202211518354 A CN 202211518354A CN 115840656 A CN115840656 A CN 115840656A
Authority
CN
China
Prior art keywords
application program
healing
information
fault self
application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211518354.6A
Other languages
Chinese (zh)
Inventor
刘德一
李钢
肖俟泽
陆玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHINA REALTIME DATABASE CO LTD
Original Assignee
CHINA REALTIME DATABASE CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHINA REALTIME DATABASE CO LTD filed Critical CHINA REALTIME DATABASE CO LTD
Priority to CN202211518354.6A priority Critical patent/CN115840656A/en
Publication of CN115840656A publication Critical patent/CN115840656A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses an automatic operation and maintenance method and system of an application program based on fault self-healing, wherein the method comprises the following steps: collecting information of a server where an application program is located in real time; acquiring the operating resource occupation and log information of the application program in real time; analyzing and predicting the running state of the program according to the historical information and the real-time acquisition information; alarming in an abnormal state; setting a fault self-healing process according to a program start-stop method; monitoring the running state of the program in real time to realize fault self-healing; the system comprises a data acquisition module, an alarm module and a fault self-healing module. The invention can continuously improve the operation and maintenance management and business service capability of the large-scale enterprise system, is beneficial to fully playing the operation and maintenance service value of a company, serves the management improvement in the company, supports various business applications of lean management and intelligent operation, and creates huge direct and indirect benefits for the company.

Description

Automatic operation and maintenance method and system for application program based on fault self-healing
Technical Field
The invention relates to application program operation and maintenance, in particular to an automatic operation and maintenance method and system for an application program based on fault self-healing.
Background
With the comprehensive construction of internationally leading energy Internet enterprises by power grid companies, high attention is paid to the digitalization, networking and intelligent development, advanced technologies and means such as a 'big cloud thing moving intelligence chain' and the like are fully applied, the mutual promotion of energy transformation, information technology deep fusion, technological innovation and industrial upgrading is promoted, and the new energy of development is enhanced.
Various production application systems also support safe production, operation and customer service of enterprises, and the operation of the application systems is often influenced by various factors, so that abnormal conditions occur. When a non-working time system has a fault, operation and maintenance personnel can spend a certain time when arriving at a site and cannot process the system at the first time, long-time system faults can affect the use of users and the index assessment of national network companies, in order to guarantee the stable operation of a company business system, a set of automatic operation and maintenance mechanism based on an application service state needs to be established, important services of the business system are monitored, alarmed and self-healed in real time by means of configuring system service information, once the system services are abnormally withdrawn, ports are abnormal, page response is abnormal, database connection is abnormal and the like, corresponding processing can be carried out at the first time according to a custom configuration rule, and the system fault time is greatly shortened.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide an automatic operation and maintenance method and system of an application program based on fault self-healing, so that the service condition of server resources and the operation condition of application services are collected in time, the time for passive processing of manual operation and maintenance is reduced, an effective early warning and fault automatic recovery mechanism is established, and a brand-new operation and maintenance working mode of automation, intellectualization and digitization is formed.
The technical scheme is as follows: the invention relates to an automatic operation and maintenance method of an application program based on fault self-healing, which comprises the following steps:
(1) Collecting information of a server where an application program is located in real time;
(2) Acquiring the operating resource occupation and log information of the application program in real time;
(3) Analyzing and predicting the running state of the program according to the historical information and the real-time acquisition information;
(4) Alarming in an abnormal state;
(5) Setting a fault self-healing process according to a program start-stop method;
(6) And monitoring the running state of the program in real time to realize fault self-healing.
The step (1) is specifically as follows:
(1.1) acquiring information of a server where an application program is located in real time by using a distributed data real-time acquisition technology, wherein the information comprises server operation information of a CPU (central processing unit), a memory, an I/O (input/output), a network and a hard disk of the application program server needing operation and maintenance, and an operation log, an error log and an operation log of an operating system;
and (1.2) intelligently retrieving and classifying the collected server information and extracting information influencing the operation of the application program.
The step (2) is specifically as follows:
(2.1) acquiring the state information of the monitored application program by using a distributed real-time acquisition technology, wherein the state information comprises the conditions of process numbers, internal memories, CPUs (central processing units), ports and bandwidth occupation;
and (2.2) acquiring log information of the running of the application program in real time.
In the step (1) and the step (2), the real-time data acquisition technology supports active push of a user and user-defined plug-in; and dynamic horizontal expansion is supported, and hundred million times of data acquisition, alarm judgment, historical data storage and query in each period are supported.
The step (3) is specifically as follows:
(3.1) analyzing the running state and situation of the application program according to the historical information of the server and the application program;
and (3.2) according to the historical information and the real-time collected data analysis, predicting the running state of the application program and giving an alarm for possible abnormity.
The step (4) is specifically as follows: and alarming the abnormal server and application program.
The step (5) is specifically as follows: and setting a fault self-healing process of the application program according to the starting script of the application program.
The step (6) is specifically as follows:
(6.1) finishing the judgment of the abnormal state of the application program by applying monitoring, wherein the monitoring comprises direct judgment and auxiliary judgment;
(6.1.1) direct judgment: judging whether the application program process is alive or not by utilizing the collected application process information and the port occupation information, and directly restarting the application if the application program process is abnormal to realize fault self-healing;
(6.1.2) auxiliary judgment: for the application that the system process is in a survival state, the port communication is normal, but the system process is actually in a false death state, the self-healing of the fault is realized by a mode of artificial auxiliary judgment;
and (6.2) the application state is probed by utilizing host communication, database connection detection and simulated page access, when an abnormal state occurs, an alarm is sent in time, and fault judgment and self-healing of the application are realized through manual judgment or a predefined rule.
An application program automation operation and maintenance system based on fault self-healing, the system comprises the following modules:
a data acquisition module: the system is used for acquiring information of a server where an application program is located in real time and acquiring resource occupation and log information of the application program in real time; the system is used for collecting information in real time, analyzing and predicting the running state of a program;
an alarm module: the system is used for alarming the abnormal server and application program;
the fault self-healing module: the method is used for setting a fault self-healing process and realizing fault self-healing.
A computer storage medium, on which a computer program is stored, which, when executed by a processor, implements a method for automatic operation and maintenance of an application based on fault self-healing as described above.
A computer device includes a storage, a processor, and a computer program stored on the storage and executable on the processor, and the processor implements a self-healing failure-based application automation operation and maintenance method when executing the computer program.
Has the advantages that: compared with the prior art, the invention has the following advantages:
1. the invention can continuously improve the operation and maintenance management and business service capability of the large-scale enterprise system, is beneficial to fully exerting the operation and maintenance service value of the company, serves the internal management improvement of the company, supports various business applications of lean management and intelligent operation, and creates huge direct and indirect benefits for the company;
2. the invention saves the manpower input through the mechanism of 'active discovery and automatic solution', thereby improving the reliability of each business application and laying a condition foundation for creating huge economic benefits for business departments;
3. the invention helps the internal organization and department of the company to continuously improve the service quality of the business by continuously improving the operation and maintenance capability, continuously creates value for the client, reduces the maintenance cost of the company to the application system, provides application maintenance service for the enterprise and generates economic benefit for a long time;
4. the invention can be integrated with various service systems, is convenient for application management and quick in expansion, effectively ensures the safety of the system operation life cycle by deepening the operation and maintenance control capability, prevents major safety events, can effectively save the operation and maintenance support expenditure for dealing with the system safety events, and improves the management benefits of companies.
Drawings
FIG. 1 is a flow chart of the steps of the method of the present invention;
FIG. 2 is an overall architecture diagram of the method of the present invention;
fig. 3 is a fault self-healing flow chart.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1-2, an automated operation and maintenance method for an application based on fault self-healing includes the following steps:
(1) And collecting information of a server where the application program is located in real time.
And (1.1) acquiring information of a server where an application program is positioned in real time by utilizing a distributed data real-time acquisition technology, wherein the information comprises server operation information of a CPU (central processing unit), a memory, an I/O (input/output), a network and a hard disk of the application program server which needs to be operated and maintained, and an operation log, an error log and an operation log of an operating system.
And (1.2) intelligently retrieving and classifying the collected server information, and extracting information influencing the operation of the application program.
(2) And acquiring the operating resource occupation and log information of the application program in real time.
And (2.1) acquiring the state information of the monitored application program by using a distributed real-time acquisition technology, wherein the state information comprises the conditions of process numbers, internal memories, CPUs (central processing units), ports and bandwidth occupation.
And (2.2) acquiring log information of the running of the application program in real time.
In the step (1) and the step (2), the real-time data acquisition technology supports active push of a user and user-defined plug-in; and dynamic horizontal expansion is supported, and hundred million times of data acquisition, alarm judgment, historical data storage and query in each period are supported.
(3) And analyzing and predicting the running state of the program according to the historical information and the real-time acquisition information.
And (3.1) analyzing the running state and situation of the application program according to the historical information of the server and the application program.
And (3.2) according to the historical information and the real-time collected data analysis, predicting the running state of the application program and giving an alarm for possible abnormity.
(4) And (4) abnormal state warning: and alarming the abnormal server and application program.
(5) Setting a fault self-healing process according to a program starting and stopping method: and setting a fault self-healing process of the application program according to the starting script of the application program.
(6) As shown in fig. 3, the running state of the program is monitored in real time, and the fault self-healing is realized.
And (6.1) finishing the judgment of the abnormal state of the application program by applying monitoring, wherein the monitoring is divided into direct judgment and auxiliary judgment.
(6.1.1) direct judgment: and judging whether the application program process is alive or not by utilizing the collected application process information and the port occupation information, and directly restarting the application if the application program process is abnormal, so as to realize fault self-healing.
(6.1.2) auxiliary judgment: for the application that the system process is in a survival state and the port communication is normal but the system process is actually in a false death state, the self-healing of the fault is realized by a mode of artificial auxiliary judgment.
And (6.2) the application state is probed by utilizing host communication, database connection detection and simulated page access, when an abnormal state occurs, an alarm is sent in time, and fault judgment and self-healing of the application are realized through manual judgment or a predefined rule.
An application program automation operation and maintenance system based on fault self-healing, the system comprises the following modules:
a data acquisition module: the system is used for acquiring information of a server where an application program is located in real time and acquiring resource occupation and log information of the application program in real time; the system is used for collecting information in real time, analyzing and predicting the running state of a program;
an alarm module: the system is used for alarming the abnormal server and application program;
the fault self-healing module: the method is used for setting a fault self-healing process and realizing fault self-healing.

Claims (10)

1. An automatic operation and maintenance method for an application program based on fault self-healing is characterized by comprising the following steps:
(1) Collecting information of a server where an application program is located in real time;
(2) Acquiring the operating resource occupation and log information of the application program in real time;
(3) Analyzing and predicting the running state of the program according to the historical information and the real-time acquisition information;
(4) Alarming in an abnormal state;
(5) Setting a fault self-healing process according to a program start-stop method;
(6) And monitoring the running state of the program in real time to realize fault self-healing.
2. The method for automatically operating and maintaining the application program based on fault self-healing according to claim 1, wherein the step (1) is specifically as follows:
(1.1) acquiring information of a server where an application program is located in real time by using a distributed data real-time acquisition technology, wherein the information comprises server operation information of a CPU (central processing unit), a memory, an I/O (input/output), a network and a hard disk of the application program server needing operation and maintenance, and an operation log, an error log and an operation log of an operating system;
and (1.2) intelligently retrieving and classifying the collected server information, and extracting information influencing the operation of the application program.
3. The method for automatically operating and maintaining the application program based on fault self-healing according to claim 1, wherein the step (2) is specifically as follows:
(2.1) acquiring state information of the monitored application program by using a distributed real-time acquisition technology, wherein the state information comprises a process number, a memory, a CPU (central processing unit), a port and the occupation condition of bandwidth;
and (2.2) acquiring log information of the running of the application program in real time.
4. The method for automatically operating and maintaining the application program based on fault self-healing according to claim 1, wherein the step (3) is specifically as follows:
(3.1) analyzing the running state and situation of the application program according to the historical information of the server and the application program;
and (3.2) according to the historical information and the real-time collected data analysis, predicting the running state of the application program and giving an alarm for possible abnormity.
5. The method for automatically operating and maintaining the application program based on fault self-healing according to claim 1, wherein the step (4) is specifically as follows: and alarming the abnormal server and application program.
6. The method for automatically operating and maintaining the application program based on fault self-healing according to claim 1, wherein the step (5) is specifically as follows: and setting a fault self-healing process of the application program according to the starting script of the application program.
7. The method for automatically operating and maintaining the application program based on fault self-healing according to claim 1, wherein the step (6) is specifically as follows:
(6.1) finishing the judgment of the abnormal state of the application program by applying monitoring, wherein the monitoring comprises direct judgment and auxiliary judgment;
(6.1.1) direct judgment: judging whether the application program process is alive or not by utilizing the collected application process information and the port occupation information, and directly restarting the application if the application program process is abnormal to realize fault self-healing;
(6.1.2) auxiliary judgment: for the application that the system process is in a survival state, the port communication is normal, but the system process is actually in a false death state, the self-healing of the fault is realized by a mode of artificial auxiliary judgment;
and (6.2) the application state is probed by utilizing host communication, database connection detection and simulated page access, when an abnormal state occurs, an alarm is sent in time, and fault judgment and self-healing of the application are realized through manual judgment or a predefined rule.
8. An application program automation operation and maintenance system based on fault self-healing is characterized by comprising the following modules:
a data acquisition module: the system is used for acquiring information of a server where an application program is located in real time and acquiring resource occupation and log information of the application program in real time; the system is used for collecting information in real time, analyzing and predicting the running state of a program;
an alarm module: the system is used for alarming the abnormal server and application program;
the fault self-healing module: the method is used for setting a fault self-healing process and realizing fault self-healing.
9. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a self-healing failure-based application automation operation and maintenance method according to any one of claims 1 to 7.
10. A computer device comprising a storage, a processor and a computer program stored on the storage and executable on the processor, wherein the processor implements the method for application automation operation and maintenance based on fault self-healing according to any one of claims 1 to 7 when executing the computer program.
CN202211518354.6A 2022-11-30 2022-11-30 Automatic operation and maintenance method and system for application program based on fault self-healing Pending CN115840656A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211518354.6A CN115840656A (en) 2022-11-30 2022-11-30 Automatic operation and maintenance method and system for application program based on fault self-healing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211518354.6A CN115840656A (en) 2022-11-30 2022-11-30 Automatic operation and maintenance method and system for application program based on fault self-healing

Publications (1)

Publication Number Publication Date
CN115840656A true CN115840656A (en) 2023-03-24

Family

ID=85577413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211518354.6A Pending CN115840656A (en) 2022-11-30 2022-11-30 Automatic operation and maintenance method and system for application program based on fault self-healing

Country Status (1)

Country Link
CN (1) CN115840656A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116560893A (en) * 2023-07-07 2023-08-08 湖南开放大学(湖南网络工程职业学院、湖南省干部教育培训网络学院) Computer application program operation data fault processing system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116560893A (en) * 2023-07-07 2023-08-08 湖南开放大学(湖南网络工程职业学院、湖南省干部教育培训网络学院) Computer application program operation data fault processing system
CN116560893B (en) * 2023-07-07 2023-09-22 湖南开放大学(湖南网络工程职业学院、湖南省干部教育培训网络学院) Computer application program operation data fault processing system

Similar Documents

Publication Publication Date Title
CN107943668B (en) Computer server cluster log monitoring method and monitor supervision platform
CN101854277B (en) Method for monitoring mobile communication operation analysis system
CN106371986A (en) Log treatment operation and maintenance monitoring system
CN111176879A (en) Fault repairing method and device for equipment
CN106778253A (en) Threat context aware information security Initiative Defense model based on big data
CN105656698A (en) Intelligent monitoring structure and method for network application system
CN104966172A (en) Large data visualization analysis and processing system for enterprise operation data analysis
CN106649040A (en) Automatic monitoring method and device for performance of Weblogic middleware
CN104881352A (en) System resource monitoring device based on mobile terminal
CN101436274A (en) Method for across-platform monitoring enterprise application system performance
CN105119757A (en) Method and system for operation and maintenance automation of enterprise servers
CN103186603B (en) Determine that SQL statement is on the method for the impact of the performance of key business, system and equipment
CN111259073A (en) Intelligent business system running state studying and judging system based on logs, flow and business access
CN109034580B (en) Information system overall health degree evaluation method based on big data analysis
CN111405032A (en) General cloud platform of industry thing networking
CN111125056A (en) Automatic operation and maintenance system and method for information system database
CN112559237B (en) Operation and maintenance system troubleshooting method and device, server and storage medium
CN115809183A (en) Method for discovering and disposing information-creating terminal fault based on knowledge graph
CN115840656A (en) Automatic operation and maintenance method and system for application program based on fault self-healing
CN103117878A (en) Design method of Nagios-based distribution monitoring system
CN110515799B (en) MySQL monitoring system based on python language and implementation method
CN114860830A (en) System for building operation and maintenance data middlings based on big data technology
CN109800133A (en) A kind of method, one-stop monitoring alarm platform and the system of unified monitoring alarm
CN113592210A (en) Internet of things integrated management platform for water supply non-negative-pressure secondary water supply facility
CN112883001A (en) Data processing method, device and medium based on marketing and distribution through data visualization platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination