CN111897671A - Failure recovery method, computer device, and storage medium - Google Patents

Failure recovery method, computer device, and storage medium Download PDF

Info

Publication number
CN111897671A
CN111897671A CN202010728104.XA CN202010728104A CN111897671A CN 111897671 A CN111897671 A CN 111897671A CN 202010728104 A CN202010728104 A CN 202010728104A CN 111897671 A CN111897671 A CN 111897671A
Authority
CN
China
Prior art keywords
fault
server
fault recovery
data
recovery
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010728104.XA
Other languages
Chinese (zh)
Inventor
郑磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Securities Co Ltd
Original Assignee
Ping An Securities Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Securities Co Ltd filed Critical Ping An Securities Co Ltd
Priority to CN202010728104.XA priority Critical patent/CN111897671A/en
Publication of CN111897671A publication Critical patent/CN111897671A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to the technical field of safety monitoring, and provides a fault recovery method, computer equipment and a storage medium, wherein the fault recovery method comprises the following steps: acquiring identification information of a server after monitoring that the server fails; identifying a fault type of the fault; when the fault recovery operation flow identification corresponding to the fault type is matched, a first fault alarm instruction is generated; hooking the first fault alarm instruction by using a hook function, and sending a second fault alarm instruction carrying identification information of the server and the fault recovery operation flow identification to the automatic fault processing system, so that the automatic fault processing system feeds back the fault recovery instruction; and calling the fault recovery workflow script in the fault recovery instruction through the client of the server to execute the fault recovery instruction. The invention encapsulates the solution corresponding to the fault into the fault recovery workflow script, and when the fault occurs, the monitoring system triggers the fault automatic processing system to call the fault recovery workflow script so as to automatically solve the fault.

Description

Failure recovery method, computer device, and storage medium
Technical Field
The invention relates to the technical field of safety monitoring, in particular to a fault recovery method, computer equipment and a storage medium.
Background
In the current server management system, a fault monitoring platform (or called Uwork) monitors whether all maintained servers have faults in real time, when the servers have faults, alarm information is sent out to inform relevant personnel (such as server managers, operation and maintenance personnel and the like) to process the faults in time, and the relevant personnel log in the fault monitoring platform to manually recover the faults.
The existing server fault recovery scheme needs manual intervention, so that the labor cost is high, the operation efficiency is low, and the reliability is poor.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a fault recovery method, a computer device, and a storage medium, which encapsulate a solution corresponding to each fault into a fault recovery workflow script, and trigger a fault automatic processing system to call the fault recovery workflow script by a monitoring system when the fault occurs, so as to automatically resolve the fault.
A first aspect of the present invention provides a fault recovery method applied in a monitoring system, where the method includes:
monitoring whether a server fails or not, and acquiring identification information of the server after monitoring that the server fails;
identifying a fault type of the fault;
matching whether a fault recovery operation flow identifier corresponding to the fault type exists or not;
generating a first fault alarm instruction after matching the fault recovery operation flow identification corresponding to the fault type;
hooking the first fault alarm instruction by using a hook function, and sending a second fault alarm instruction carrying identification information of the server and the fault recovery operation flow identification to an automatic fault processing system, so that the automatic fault processing system feeds back a fault recovery instruction according to the second fault alarm instruction;
and calling a fault recovery workflow script in the fault recovery instruction through a client of the server to execute the fault recovery instruction.
According to an alternative embodiment of the present invention, the monitoring whether the server fails includes:
acquiring a log reported by a client of the server, wherein a plurality of data are recorded in the log;
comparing each data to a corresponding data threshold;
when at least one data is larger than a corresponding data threshold value, determining that the server is monitored to have a fault;
and when all the data are less than or equal to the corresponding data threshold value, determining that the server is monitored to normally operate.
According to an alternative embodiment of the invention, said identifying a fault type of said fault comprises:
determining target data in the plurality of data that is greater than a data threshold;
matching preset keywords with the target data by a regular matching method;
and when a target keyword which is the same as the preset keyword is matched from the target data, determining a fault type corresponding to the target keyword according to a preset monitoring rule table.
According to an alternative embodiment of the invention, when all data are less than or equal to the corresponding data threshold, the method further comprises:
inputting each data into a fault prediction classifier;
predicting the risk fault type and probability of each data through the fault prediction classifier;
taking the risk fault type corresponding to the maximum probability as a target risk fault type;
and sending a risk alarm signal carrying the target risk fault type to the server.
According to an alternative embodiment of the present invention, the training process of the fault prediction classifier comprises:
acquiring historical data corresponding to each piece of data and a fault type of the historical data;
constructing a data array according to each historical data and the corresponding fault type;
and inputting the data array into a convolutional neural network for training to obtain the fault prediction classifier.
According to an alternative embodiment of the invention, the method further comprises:
and when the matched fault recovery operation flow identification corresponding to the fault type does not exist, sending preset notification information to operation and maintenance service personnel, so that the operation and maintenance service personnel manually process the fault recovery.
A second aspect of the present invention provides a fault recovery method, which is applied in a fault automatic processing system, and the method includes:
receiving a fault alarm signal which is sent by a monitoring system and carries identification information of a server and a fault recovery operation flow identification;
matching a fault recovery workflow script corresponding to the fault recovery workflow identification;
and sending the fault recovery instruction carrying the fault recovery workflow script to a server corresponding to the identification information of the server, so that a client of the server calls the fault recovery workflow script to execute the fault recovery instruction.
According to an alternative embodiment of the invention, the method further comprises:
acquiring a plurality of fault types in an operation and maintenance service system and a fault recovery strategy of each fault type;
configuring a fault recovery workflow script according to the fault recovery strategy of each fault type;
setting a fault recovery workflow identification for the fault recovery workflow script;
and sending the fault recovery operation flow identification to the monitoring system.
A third aspect of the invention provides a computer device comprising a processor for implementing the fault recovery method when executing a computer program stored in a memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the fault recovery method.
In summary, the fault recovery method, the computer device and the storage medium according to the present invention perform fault recovery on a server that has a fault by means of a monitoring system and an automatic fault handling system, encapsulate a solution corresponding to each fault into a fault recovery workflow script by the automatic fault handling system, trigger the automatic fault handling system to call the fault recovery workflow script by the monitoring system when the fault occurs, automatically solve the fault, remove a process of manual participation therein, and achieve rapid recovery of the fault.
Drawings
Fig. 1 is a flowchart of a failure recovery method according to an embodiment of the present invention.
Fig. 2 is a flowchart of a failure recovery method according to a second embodiment of the present invention.
Fig. 3 is a structural diagram of a failure recovery apparatus according to a third embodiment of the present invention.
Fig. 4 is a structural diagram of a failure recovery apparatus according to a fourth embodiment of the present invention.
Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Example one
Fig. 1 is a flowchart of a failure recovery method according to an embodiment of the present invention. The failure recovery method specifically includes the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements.
S11, monitoring whether the server has a fault or not by the monitoring system, and acquiring the identification information of the server after monitoring that the server has a fault.
The fault recovery system comprises a monitoring system and an automatic fault processing system, wherein the monitoring system is responsible for monitoring whether the server fails or not and acquiring the identification information of the failed server when the server is monitored to fail.
The identification information may include: the IP address of the server, the MAC address of the server and the equipment identification number of the server.
The identification information is used to uniquely represent the server.
In an optional embodiment, the monitoring whether the server fails includes:
acquiring a log reported by a client of the server, wherein a plurality of data are recorded in the log;
comparing each data to a corresponding data threshold;
when at least one data is larger than a corresponding data threshold value, determining that the server is monitored to have a fault;
and when all the data are less than or equal to the corresponding data threshold value, determining that the server is monitored to normally operate.
In this optional embodiment, the monitoring system is connected to multiple servers simultaneously, each server is provided with a client in advance, and the client actively reports the log of the corresponding server to the monitoring system at regular time or irregular time. The client can also report the log of the corresponding server when receiving a log reporting instruction sent by the monitoring system regularly or irregularly.
The plurality of data may include, but is not limited to: CPU utilization, memory utilization, process survival status, disk utilization, database connections, and the like.
Different data correspond to different data thresholds, for example, the data threshold corresponding to the CPU utilization is 90%, and the data threshold corresponding to the database connection number is 100.
S12, the monitoring system identifies the fault type of the fault.
And identifying the fault type of the server according to the acquired data.
In an optional embodiment, the identifying the fault type of the fault includes:
determining target data in the plurality of data that is greater than a data threshold;
matching preset keywords with the target data by a regular matching method;
and when a target keyword which is the same as the preset keyword is matched from the target data, determining a fault type corresponding to the target keyword according to a preset monitoring rule table.
In this optional embodiment, a monitoring rule table is preset in the monitoring system, where the monitoring rule table records a plurality of fault types and keywords corresponding to each fault type.
The fault types include: CPU failures, memory failures, process failures, disk failures, database failures, and the like. The keyword corresponding to the CPU fault is "CPU", and the keyword corresponding to the memory fault is "memory".
According to the working experience of the operation and maintenance service personnel, the faults have obvious characteristics and typical processing methods, so that an operation and maintenance knowledge base can be summarized and formed, and a fault recovery workflow script is developed and configured according to the operation and maintenance knowledge base so as to execute a fault recovery instruction and carry out fault recovery on the server.
S13, the monitoring system matches whether the fault recovery operation flow identification corresponding to the fault type exists.
And the monitoring system records the incidence relation between the fault type and the fault recovery operation flow identification in advance, and matches the fault recovery operation flow identification corresponding to the fault type with the incidence relation, so that the fault automatic processing system can conveniently send the fault recovery operation flow script according to the fault recovery operation flow identification.
And S14, generating a first fault alarm instruction after the monitoring system matches that the fault recovery operation flow identification corresponding to the fault type exists.
When a monitoring system monitors that a fault (such as fault a) of a certain fault type occurs in a certain server, and after matching out that a fault recovery job flow identifier corresponding to the fault type exists, a first fault alarm instruction is generated, but the first fault alarm instruction is only a mechanism for alarming, and does not carry any information.
And S15, the monitoring system hooks the first fault alarm instruction by using a hook function, and sends a second fault alarm instruction carrying the identification information of the server and the identification of the fault recovery operation flow to the automatic fault processing system, so that the automatic fault processing system feeds back the fault recovery instruction according to the second fault alarm instruction.
The monitoring system has a hook function and can hook the first fault alarm instruction, so that a second fault alarm instruction carrying the identification information of the server and the identification of the fault recovery operation flow is sent to the automatic fault processing system.
Because the second fault alarm instruction includes the identification information of the server and the fault recovery operation stream identification, the fault automatic processing system can determine which server has a fault according to the identification information of the server, and determine which fault recovery operation stream script needs to be called according to the fault recovery operation stream identification, so that the fault recovery operation stream script is sent to the server having the fault to perform fault recovery.
The specific process of performing fault recovery can be referred to in embodiment two and the related description.
In an optional embodiment, when all the data are less than or equal to the corresponding data threshold, the method further comprises:
inputting each data into a fault prediction classifier;
predicting the risk fault type and probability of each data through the fault prediction classifier;
taking the risk fault type corresponding to the maximum probability as a target risk fault type;
and sending a risk alarm signal carrying the target risk fault type to the server.
The fault prediction classifier is a fault classifier based on a convolutional neural network model. By training the fault prediction classifier in advance, the risk prediction can be performed on data in the server when the server does not have faults, and the fault types with the risk faults and the probability of each fault type are determined. The server is provided with a plurality of data, each data corresponds to a plurality of predicted risk fault types, and the risk fault type corresponding to the maximum probability is used as the target risk fault type of the data. And taking the target risk fault type corresponding to the maximum probability in the target risk fault types of all the data as the target risk fault type of the server. By predicting the target risk failure type of the server, advance prevention and preparation measures can be taken.
In an alternative embodiment, the training process of the fault prediction classifier includes:
acquiring historical data corresponding to each piece of data and a fault type of the historical data;
constructing a data array according to each historical data and the corresponding fault type;
and inputting the data array into a convolutional neural network for training to obtain the fault prediction classifier.
In this alternative embodiment, the fault prediction classifier is trained based on the data array, so that the trained fault prediction classifier has a function of fault type prediction.
During specific implementation, all historical data and fault type identifications of the same index item are stored in a database according to a time sequence, a fault prediction classifier is trained on the basis of the database, and fault classification is carried out by using the trained fault prediction classifier according to a plurality of data reported by a server client side, so that the probability of the fault type of each current data can be obtained.
In an optional embodiment, the method further comprises:
and when the matched fault recovery operation flow identification corresponding to the fault type does not exist, sending preset notification information to operation and maintenance service personnel, so that the operation and maintenance service personnel manually process the fault recovery.
Example two
Fig. 2 is a flowchart of a failure recovery method according to a second embodiment of the present invention. The failure recovery method specifically includes the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements.
And S21, the fault automatic processing system receives the fault alarm signal which is sent by the monitoring system and carries the identification information of the server and the identification of the fault recovery operation flow.
And S22, matching the fault recovery job flow script corresponding to the fault recovery job flow identification by the fault automatic processing system.
A plurality of fault recovery workflow scripts are pre-developed and configured in the fault automatic processing system, each fault recovery workflow script is used for executing a fault recovery workflow of the server, and each fault recovery workflow script has a unique ID as a fault recovery workflow identifier.
The fault recovery workflow script stores commonly used step instructions for processing problems. The fault recovery workflow script may include a middleware start script, a stop script, a restart script, a database table space expansion script, a database node restart script, an F5 load balancing device isolation script, a process shutdown script, and the like.
The input parameter of the fault automatic processing system is the identification information of the server, and when the identification information of the server is transmitted, the fault automatic processing system can issue a fault recovery instruction to the server corresponding to the identification information of the server to execute a fault recovery command.
And S23, the fault automatic processing system sends the fault recovery instruction carrying the fault recovery workflow script to the server corresponding to the identification information of the server, so that the client of the server calls the fault recovery workflow script to execute the fault recovery instruction.
When the client of the server receives the fault recovery instruction, the fault recovery workflow script in the fault recovery instruction is analyzed, the step instruction in the fault recovery workflow script is called, and the fault recovery can be automatically completed, so that the server is recovered to a normal state.
In an optional embodiment, the method further comprises:
acquiring a plurality of fault types in an operation and maintenance service system and a fault recovery strategy of each fault type;
configuring a fault recovery workflow script according to the fault recovery strategy of each fault type;
setting a fault recovery workflow identification for the fault recovery workflow script;
and sending the fault recovery operation flow identification to the monitoring system.
The invention carries out fault recovery on the server with faults by matching the monitoring system with the automatic fault processing system, encapsulates the solution corresponding to each fault into a fault recovery workflow script by the automatic fault processing system, triggers the automatic fault processing system to call the fault recovery workflow script when the fault occurs, automatically solves the fault, removes the process of manual participation in the fault and realizes the rapid recovery of the fault.
By the aid of the automatic fault processing system, human intervention factors in fault processing are eliminated, notification and delay of online manual processing are avoided, and additional problems caused by misoperation possibly occurring in manual processing of faults are avoided.
EXAMPLE III
Fig. 3 is a structural diagram of a failure recovery apparatus according to a third embodiment of the present invention.
In some embodiments, the failure recovery apparatus 30 may include a plurality of functional modules composed of program code segments. The program code of the various program segments in the fault recovery apparatus 30 may be stored in a memory of a computer device and executed by the at least one processor to perform the fault recovery function (described in detail in fig. 1).
In this embodiment, the failure recovery device 30 may be divided into a plurality of functional modules according to the functions performed by the failure recovery device. The functional module may include: a monitoring module 301, an obtaining module 302, a recognition module 303, a matching module 304, a generating module 305, an alerting module 306, a predicting module 307, and a notifying module 308. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The monitoring module 301 is configured to monitor whether a server fails.
The fault recovery system comprises a monitoring system and an automatic fault processing system, wherein the monitoring system is responsible for monitoring whether the server fails.
In an optional embodiment, the monitoring module 301 monitoring whether the server fails includes:
acquiring a log reported by a client of the server, wherein a plurality of data are recorded in the log; comparing each data to a corresponding data threshold; when at least one data is larger than a corresponding data threshold value, determining that the server is monitored to have a fault; and when all the data are less than or equal to the corresponding data threshold value, determining that the server is monitored to normally operate.
In this optional embodiment, the monitoring system is connected to multiple servers simultaneously, each server is provided with a client in advance, and the client actively reports the log of the corresponding server to the monitoring system at regular time or irregular time. The client can also report the log of the corresponding server when receiving a log reporting instruction sent by the monitoring system regularly or irregularly.
The plurality of data may include, but is not limited to: CPU utilization, memory utilization, process survival status, disk utilization, database connections, and the like.
Different data correspond to different data thresholds, for example, the data threshold corresponding to the CPU utilization is 90%, and the data threshold corresponding to the database connection number is 100.
The obtaining module 302 is configured to obtain the identification information of the failed server after the monitoring module 301 monitors that the server fails.
The identification information may include: the IP address of the server, the MAC address of the server and the equipment identification number of the server.
The identification information is used to uniquely represent the server.
The identifying module 303 is configured to identify a fault type of the fault.
And identifying the fault type of the server according to the acquired data.
In an alternative embodiment, the identifying module 303 identifies the fault type of the fault by:
determining target data in the plurality of data that is greater than a data threshold;
matching preset keywords with the target data by a regular matching method;
and when a target keyword which is the same as the preset keyword is matched from the target data, determining a fault type corresponding to the target keyword according to a preset monitoring rule table.
In this optional embodiment, a monitoring rule table is preset in the monitoring system, where the monitoring rule table records a plurality of fault types and keywords corresponding to each fault type.
The fault types include: CPU failures, memory failures, process failures, disk failures, database failures, and the like. The keyword corresponding to the CPU fault is "CPU", and the keyword corresponding to the memory fault is "memory".
According to the working experience of the operation and maintenance service personnel, the faults have obvious characteristics and typical processing methods, so that an operation and maintenance knowledge base can be summarized and formed, and a fault recovery workflow script is developed and configured according to the operation and maintenance knowledge base so as to execute a fault recovery instruction and carry out fault recovery on the server.
The matching module 304 is configured to match whether a failure recovery job flow identifier corresponding to the failure type exists.
And the monitoring system records the incidence relation between the fault type and the fault recovery operation flow identification in advance, and matches the fault recovery operation flow identification corresponding to the fault type with the incidence relation, so that the fault automatic processing system can conveniently send the fault recovery operation flow script according to the fault recovery operation flow identification.
The generating module 305 is configured to generate a first fault warning instruction after matching that the fault recovery job flow identifier corresponding to the fault type exists.
When a monitoring system monitors that a fault (such as fault a) of a certain fault type occurs in a certain server, and after matching out that a fault recovery job flow identifier corresponding to the fault type exists, a first fault alarm instruction is generated, but the first fault alarm instruction is only a mechanism for alarming, and does not carry any information.
The alarm module 306 is configured to hook the first fault alarm instruction by using a hook function, and send a second fault alarm instruction carrying the identification information of the server and the identifier of the fault recovery operation flow to the automatic fault handling system, so that the automatic fault handling system feeds back the fault recovery instruction according to the second fault alarm instruction.
The monitoring system has a hook function and can hook the first fault alarm instruction, so that a second fault alarm instruction carrying the identification information of the server and the identification of the fault recovery operation flow is sent to the automatic fault processing system.
Because the second fault alarm instruction includes the identification information of the server and the fault recovery operation stream identification, the fault automatic processing system can determine which server has a fault according to the identification information of the server, and determine which fault recovery operation stream script needs to be called according to the fault recovery operation stream identification, so that the fault recovery operation stream script is sent to the server having the fault to perform fault recovery.
The specific process of performing fault recovery can be referred to in embodiment two and the related description.
The prediction module 307 is configured to input each data into the fault prediction classifier when all data are less than or equal to the corresponding data threshold; predicting the risk fault type and probability of each data through the fault prediction classifier; taking the risk fault type corresponding to the maximum probability as a target risk fault type; and sending a risk alarm signal carrying the target risk fault type to the server.
The fault prediction classifier is a fault classifier based on a convolutional neural network model. By training the fault prediction classifier in advance, the risk prediction can be performed on data in the server when the server does not have faults, and the fault types with the risk faults and the probability of each fault type are determined. The server is provided with a plurality of data, each data corresponds to a plurality of predicted risk fault types, and the risk fault type corresponding to the maximum probability is used as the target risk fault type of the data. And taking the target risk fault type corresponding to the maximum probability in the target risk fault types of all the data as the target risk fault type of the server. By predicting the target risk failure type of the server, advance prevention and preparation measures can be taken.
In an alternative embodiment, the training process of the fault prediction classifier includes:
acquiring historical data corresponding to each piece of data and a fault type of the historical data;
constructing a data array according to each historical data and the corresponding fault type;
and inputting the data array into a convolutional neural network for training to obtain the fault prediction classifier.
In this alternative embodiment, the fault prediction classifier is trained based on the data array, so that the trained fault prediction classifier has a function of fault type prediction.
During specific implementation, all historical data and fault type identifications of the same index item are stored in a database according to a time sequence, a fault prediction classifier is trained on the basis of the database, and fault classification is carried out by using the trained fault prediction classifier according to a plurality of data reported by a server client side, so that the probability of the fault type of each current data can be obtained.
The notifying module 308 is configured to send preset notification information to an operation and maintenance service staff when no fault recovery operation flow identifier corresponding to the fault type exists, so that the operation and maintenance service staff manually handles fault recovery.
Example four
Fig. 4 is a structural diagram of a failure recovery apparatus according to a fourth embodiment of the present invention.
In some embodiments, the fault recovery apparatus 40 may include a plurality of functional modules composed of program code segments. The program code of the various program segments in the fault recovery apparatus 40 may be stored in a memory of a computer device and executed by the at least one processor to perform the fault recovery function (described in detail with reference to fig. 2).
In this embodiment, the failure recovery apparatus 40 may be divided into a plurality of functional modules according to the functions performed by the apparatus. The functional module may include: the device comprises a receiving module 401, a selecting module 402, a sending module 403 and a configuring module 404. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The receiving module 401 is configured to receive a fault alarm signal that is sent by the monitoring system and carries the identification information of the server and the identification of the fault recovery operation flow.
The selecting module 402 is configured to match out a fault recovery workflow script corresponding to the fault recovery workflow identifier.
A plurality of fault recovery workflow scripts are pre-developed and configured in the fault automatic processing system, each fault recovery workflow script is used for executing a fault recovery workflow of the server, and each fault recovery workflow script has a unique ID as a fault recovery workflow identifier.
The fault recovery workflow script stores commonly used step instructions for processing problems. The fault recovery workflow script may include a middleware start script, a stop script, a restart script, a database table space expansion script, a database node restart script, an F5 load balancing device isolation script, a process shutdown script, and the like.
The input parameter of the fault automatic processing system is the identification information of the server, and when the identification information of the server is transmitted, the fault automatic processing system can issue a fault recovery instruction to the server corresponding to the identification information of the server to execute a fault recovery command.
The sending module 403 is configured to send the fault recovery instruction carrying the fault recovery workflow script to the server corresponding to the identification information of the server, so that the client of the server calls the fault recovery workflow script to execute the fault recovery instruction.
When the client of the server receives the fault recovery instruction, the fault recovery workflow script in the fault recovery instruction is analyzed, the step instruction in the fault recovery workflow script is called, and the fault recovery can be automatically completed, so that the server is recovered to a normal state.
The configuration module 404 is configured to obtain a plurality of fault types in the operation and maintenance service system and a fault recovery policy for each fault type; configuring a fault recovery workflow script according to the fault recovery strategy of each fault type; setting a fault recovery workflow identification for the fault recovery workflow script; and sending the fault recovery operation flow identification to the monitoring system.
The invention carries out fault recovery on the server with faults by matching the monitoring system with the automatic fault processing system, encapsulates the solution corresponding to each fault into a fault recovery workflow script by the automatic fault processing system, triggers the automatic fault processing system to call the fault recovery workflow script when the fault occurs, automatically solves the fault, removes the process of manual participation in the fault and realizes the rapid recovery of the fault.
By the aid of the automatic fault processing system, human intervention factors in fault processing are eliminated, notification and delay of online manual processing are avoided, and additional problems caused by misoperation possibly occurring in manual processing of faults are avoided.
EXAMPLE five
Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 5 includes a memory 51, at least one processor 52, at least one communication bus 53, and a transceiver 54.
It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 5 is not limiting to the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 5 may include more or less hardware or software than those shown, or a different arrangement of components.
In some embodiments, the computer device 5 is a computer device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 5 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the computer device 5 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are incorporated herein by reference.
In some embodiments, the memory 51 is used for storing program codes and various data, such as devices installed in the computer device 5, and realizes high-speed and automatic access to programs or data during the operation of the computer device 5. The Memory 51 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only Memory (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer that can be used to carry or store data.
In some embodiments, the at least one processor 52 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The at least one processor 52 is a Control Unit (Control Unit) of the computer device 5, connects various components of the entire computer device 5 by using various interfaces and lines, and executes various functions of the computer device 5 and processes data by running or executing programs or modules stored in the memory 51 and calling data stored in the memory 51.
In some embodiments, the at least one communication bus 53 is arranged to enable connection communication between the memory 51 and the at least one processor 52, etc.
Although not shown, the computer device 5 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 52 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 5 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In a further embodiment, the at least one processor 52 may execute operating means of the computer device 5 as well as installed various types of applications, program code, etc., such as the various modules described above.
The memory 51 has program code stored therein, and the at least one processor 52 can call the program code stored in the memory 51 to perform related functions.
In one embodiment of the invention, the memory 51 stores a plurality of instructions that are executed by the at least one processor 52 to implement all or a portion of the steps of the method of the invention.
Specifically, the at least one processor 52 may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 and fig. 2, and is not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A fault recovery method is applied to a monitoring system, and is characterized by comprising the following steps:
monitoring whether a server fails or not, and acquiring identification information of the server after monitoring that the server fails;
identifying a fault type of the fault;
matching whether a fault recovery operation flow identifier corresponding to the fault type exists or not;
when the fault recovery operation flow identification corresponding to the fault type exists in the matching, generating a first fault alarm instruction;
hooking the first fault alarm instruction by using a hook function, and sending a second fault alarm instruction carrying identification information of the server and the fault recovery operation flow identification to an automatic fault processing system, so that the automatic fault processing system feeds back a fault recovery instruction according to the second fault alarm instruction;
and calling a fault recovery workflow script in the fault recovery instruction through a client of the server to execute the fault recovery instruction.
2. The method of claim 1, wherein monitoring whether the server fails comprises:
acquiring a log reported by a client of the server, wherein a plurality of data are recorded in the log;
comparing each data to a corresponding data threshold;
when at least one data is larger than a corresponding data threshold value, determining that the server is monitored to have a fault;
and when all the data are less than or equal to the corresponding data threshold value, determining that the server is monitored to normally operate.
3. The method of claim 2, wherein the identifying the fault type of the fault comprises:
determining target data in the plurality of data that is greater than a data threshold;
matching preset keywords with the target data by a regular matching method;
and when a target keyword which is the same as the preset keyword is matched from the target data, determining a fault type corresponding to the target keyword according to a preset monitoring rule table.
4. The method of claim 2, wherein when all of the data is less than or equal to the corresponding data threshold, the method further comprises:
inputting each data into a fault prediction classifier;
predicting the risk fault type and probability of each data through the fault prediction classifier;
taking the risk fault type corresponding to the maximum probability as a target risk fault type;
and sending a risk alarm signal carrying the target risk fault type to the server.
5. The method of claim 4, wherein the training process of the fault prediction classifier comprises:
acquiring historical data corresponding to each piece of data and a fault type of the historical data;
constructing a data array according to each historical data and the corresponding fault type;
and inputting the data array into a convolutional neural network for training to obtain the fault prediction classifier.
6. The method of any one of claims 1 to 5, further comprising:
and when the matched fault recovery operation flow identification corresponding to the fault type does not exist, sending preset notification information to operation and maintenance service personnel, so that the operation and maintenance service personnel manually process the fault recovery.
7. A fault recovery method applied to a fault automatic processing system is characterized by comprising the following steps:
receiving a fault alarm signal which is sent by a monitoring system and carries identification information of a server and a fault recovery operation flow identification;
matching a fault recovery workflow script corresponding to the fault recovery workflow identification;
and sending the fault recovery instruction carrying the fault recovery workflow script to a server corresponding to the identification information of the server, so that a client of the server calls the fault recovery workflow script to execute the fault recovery instruction.
8. The method of claim 7, wherein the method further comprises:
acquiring a plurality of fault types in an operation and maintenance service system and a fault recovery strategy of each fault type;
configuring a fault recovery workflow script according to the fault recovery strategy of each fault type;
setting a fault recovery workflow identification for the fault recovery workflow script;
and sending the fault recovery operation flow identification to the monitoring system.
9. A computer device comprising a processor for implementing a method for fault recovery according to any one of claims 1 to 6, or for implementing a method for fault recovery according to claim 7 or 8, when executing a computer program stored in a memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for fault recovery according to one of claims 1 to 6, or carries out the method for fault recovery according to claim 7 or 8.
CN202010728104.XA 2020-07-23 2020-07-23 Failure recovery method, computer device, and storage medium Pending CN111897671A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010728104.XA CN111897671A (en) 2020-07-23 2020-07-23 Failure recovery method, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010728104.XA CN111897671A (en) 2020-07-23 2020-07-23 Failure recovery method, computer device, and storage medium

Publications (1)

Publication Number Publication Date
CN111897671A true CN111897671A (en) 2020-11-06

Family

ID=73190016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010728104.XA Pending CN111897671A (en) 2020-07-23 2020-07-23 Failure recovery method, computer device, and storage medium

Country Status (1)

Country Link
CN (1) CN111897671A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112838965A (en) * 2021-02-19 2021-05-25 浪潮云信息技术股份公司 Method for identifying and recovering strong synchronization role fault
CN113157483A (en) * 2021-05-26 2021-07-23 中国银行股份有限公司 Exception handling method and device
CN113179180A (en) * 2021-04-23 2021-07-27 杭州安恒信息技术股份有限公司 Basalt client disaster fault repairing method, basalt client disaster fault repairing device and basalt client disaster storage medium
CN113176996A (en) * 2021-04-29 2021-07-27 深信服科技股份有限公司 Fault processing method, engine, plug-in probe, device and readable storage medium
CN113448811A (en) * 2021-05-31 2021-09-28 山东英信计算机技术有限公司 Method, device, equipment and readable medium for lighting fault lamp of server system
CN113535034A (en) * 2021-09-07 2021-10-22 北京轻松筹信息技术有限公司 Fault warning method, device, system and medium
CN113553210A (en) * 2021-07-30 2021-10-26 平安普惠企业管理有限公司 Alarm data processing method, device, equipment and storage medium
CN113592337A (en) * 2021-08-09 2021-11-02 北京豆萌信息技术有限公司 Fault processing method and device, electronic equipment and storage medium
CN114064341A (en) * 2021-11-22 2022-02-18 建信金融科技有限责任公司 Fault disposal method and system based on emergency plan
CN114095343A (en) * 2021-11-18 2022-02-25 深圳壹账通智能科技有限公司 Disaster recovery method, device, equipment and storage medium based on double-active system
CN116560908A (en) * 2023-05-09 2023-08-08 中工数保(北京)科技有限公司 Data recovery method of industrial control system and related equipment thereof
WO2024183555A1 (en) * 2023-03-06 2024-09-12 广州疆海科技有限公司 Energy storage device troubleshooting method and apparatus, and computer device, medium and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150074367A1 (en) * 2013-09-09 2015-03-12 International Business Machines Corporation Method and apparatus for faulty memory utilization
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
CN109634828A (en) * 2018-12-17 2019-04-16 浪潮电子信息产业股份有限公司 Failure prediction method, device, equipment and storage medium
CN110191003A (en) * 2019-06-18 2019-08-30 北京达佳互联信息技术有限公司 Fault repairing method, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150074367A1 (en) * 2013-09-09 2015-03-12 International Business Machines Corporation Method and apparatus for faulty memory utilization
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
CN109634828A (en) * 2018-12-17 2019-04-16 浪潮电子信息产业股份有限公司 Failure prediction method, device, equipment and storage medium
CN110191003A (en) * 2019-06-18 2019-08-30 北京达佳互联信息技术有限公司 Fault repairing method, device, computer equipment and storage medium

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112838965A (en) * 2021-02-19 2021-05-25 浪潮云信息技术股份公司 Method for identifying and recovering strong synchronization role fault
CN112838965B (en) * 2021-02-19 2023-03-28 浪潮云信息技术股份公司 Method for identifying and recovering strong synchronization role fault
CN113179180A (en) * 2021-04-23 2021-07-27 杭州安恒信息技术股份有限公司 Basalt client disaster fault repairing method, basalt client disaster fault repairing device and basalt client disaster storage medium
CN113176996A (en) * 2021-04-29 2021-07-27 深信服科技股份有限公司 Fault processing method, engine, plug-in probe, device and readable storage medium
CN113157483A (en) * 2021-05-26 2021-07-23 中国银行股份有限公司 Exception handling method and device
CN113448811A (en) * 2021-05-31 2021-09-28 山东英信计算机技术有限公司 Method, device, equipment and readable medium for lighting fault lamp of server system
CN113553210A (en) * 2021-07-30 2021-10-26 平安普惠企业管理有限公司 Alarm data processing method, device, equipment and storage medium
CN113592337A (en) * 2021-08-09 2021-11-02 北京豆萌信息技术有限公司 Fault processing method and device, electronic equipment and storage medium
CN113535034A (en) * 2021-09-07 2021-10-22 北京轻松筹信息技术有限公司 Fault warning method, device, system and medium
CN114095343A (en) * 2021-11-18 2022-02-25 深圳壹账通智能科技有限公司 Disaster recovery method, device, equipment and storage medium based on double-active system
CN114064341A (en) * 2021-11-22 2022-02-18 建信金融科技有限责任公司 Fault disposal method and system based on emergency plan
WO2024183555A1 (en) * 2023-03-06 2024-09-12 广州疆海科技有限公司 Energy storage device troubleshooting method and apparatus, and computer device, medium and system
CN116560908A (en) * 2023-05-09 2023-08-08 中工数保(北京)科技有限公司 Data recovery method of industrial control system and related equipment thereof
CN116560908B (en) * 2023-05-09 2024-01-26 中工数保(北京)科技有限公司 Data recovery method of industrial control system and related equipment thereof

Similar Documents

Publication Publication Date Title
CN111897671A (en) Failure recovery method, computer device, and storage medium
CN108710544B (en) Process monitoring method of database system and rail transit comprehensive monitoring system
WO2016188100A1 (en) Information system fault scenario information collection method and system
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN105099783B (en) A kind of method and system for realizing operation system alarm emergency disposal automation
JP6647824B2 (en) Error diagnosis system and error diagnosis method
CN110738352A (en) Maintenance dispatching management method, device, equipment and medium based on fault big data
CN107660289A (en) Automatic network control
CN107844339B (en) Task scheduling method, plug-in and server
CN112631866A (en) Server hardware state monitoring method and device, electronic equipment and medium
CN110275795A (en) A kind of O&M method and device based on alarm
CN113592337A (en) Fault processing method and device, electronic equipment and storage medium
CN115202958A (en) Power abnormity monitoring method and device, electronic equipment and storage medium
CN113312200A (en) Event processing method and device, computer equipment and storage medium
CN117670033A (en) Security check method, system, electronic equipment and storage medium
CN117453036A (en) Method, system and device for adjusting power consumption of equipment in server
CN109995554A (en) The control method and cloud dispatch control device of multi-stage data center active-standby switch
CN112286762A (en) System information analysis method and device based on cloud environment, electronic equipment and medium
CN115102838B (en) Emergency processing method and device for server downtime risk and electronic equipment
US20080216057A1 (en) Recording medium storing monitoring program, monitoring method, and monitoring system
CN111710403A (en) Medical equipment supervision method, equipment and readable storage medium
CN114237196B (en) Split robot fault processing method and device, terminal equipment and medium
CN112152878B (en) Monitoring and management method, system, terminal and storage medium for digital channel of transformer substation
CN115225534A (en) Method for monitoring running state of monitoring server
CN115168137A (en) Monitoring method and system for timing task, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201106