CN111897671A - Failure recovery method, computer device, and storage medium - Google Patents
Failure recovery method, computer device, and storage medium Download PDFInfo
- Publication number
- CN111897671A CN111897671A CN202010728104.XA CN202010728104A CN111897671A CN 111897671 A CN111897671 A CN 111897671A CN 202010728104 A CN202010728104 A CN 202010728104A CN 111897671 A CN111897671 A CN 111897671A
- Authority
- CN
- China
- Prior art keywords
- fault
- server
- fault recovery
- data
- recovery
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000011084 recovery Methods 0.000 title claims abstract description 201
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000013515 script Methods 0.000 claims abstract description 69
- 238000012544 monitoring process Methods 0.000 claims abstract description 68
- 238000012545 processing Methods 0.000 claims abstract description 48
- 230000006870 function Effects 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 20
- 238000012423 maintenance Methods 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 238000010586 diagram Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 3
- 230000001788 irregular Effects 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000002265 prevention Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000004083 survival effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Quality & Reliability (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention relates to the technical field of safety monitoring, and provides a fault recovery method, computer equipment and a storage medium, wherein the fault recovery method comprises the following steps: acquiring identification information of a server after monitoring that the server fails; identifying a fault type of the fault; when the fault recovery operation flow identification corresponding to the fault type is matched, a first fault alarm instruction is generated; hooking the first fault alarm instruction by using a hook function, and sending a second fault alarm instruction carrying identification information of the server and the fault recovery operation flow identification to the automatic fault processing system, so that the automatic fault processing system feeds back the fault recovery instruction; and calling the fault recovery workflow script in the fault recovery instruction through the client of the server to execute the fault recovery instruction. The invention encapsulates the solution corresponding to the fault into the fault recovery workflow script, and when the fault occurs, the monitoring system triggers the fault automatic processing system to call the fault recovery workflow script so as to automatically solve the fault.
Description
Technical Field
The invention relates to the technical field of safety monitoring, in particular to a fault recovery method, computer equipment and a storage medium.
Background
In the current server management system, a fault monitoring platform (or called Uwork) monitors whether all maintained servers have faults in real time, when the servers have faults, alarm information is sent out to inform relevant personnel (such as server managers, operation and maintenance personnel and the like) to process the faults in time, and the relevant personnel log in the fault monitoring platform to manually recover the faults.
The existing server fault recovery scheme needs manual intervention, so that the labor cost is high, the operation efficiency is low, and the reliability is poor.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a fault recovery method, a computer device, and a storage medium, which encapsulate a solution corresponding to each fault into a fault recovery workflow script, and trigger a fault automatic processing system to call the fault recovery workflow script by a monitoring system when the fault occurs, so as to automatically resolve the fault.
A first aspect of the present invention provides a fault recovery method applied in a monitoring system, where the method includes:
monitoring whether a server fails or not, and acquiring identification information of the server after monitoring that the server fails;
identifying a fault type of the fault;
matching whether a fault recovery operation flow identifier corresponding to the fault type exists or not;
generating a first fault alarm instruction after matching the fault recovery operation flow identification corresponding to the fault type;
hooking the first fault alarm instruction by using a hook function, and sending a second fault alarm instruction carrying identification information of the server and the fault recovery operation flow identification to an automatic fault processing system, so that the automatic fault processing system feeds back a fault recovery instruction according to the second fault alarm instruction;
and calling a fault recovery workflow script in the fault recovery instruction through a client of the server to execute the fault recovery instruction.
According to an alternative embodiment of the present invention, the monitoring whether the server fails includes:
acquiring a log reported by a client of the server, wherein a plurality of data are recorded in the log;
comparing each data to a corresponding data threshold;
when at least one data is larger than a corresponding data threshold value, determining that the server is monitored to have a fault;
and when all the data are less than or equal to the corresponding data threshold value, determining that the server is monitored to normally operate.
According to an alternative embodiment of the invention, said identifying a fault type of said fault comprises:
determining target data in the plurality of data that is greater than a data threshold;
matching preset keywords with the target data by a regular matching method;
and when a target keyword which is the same as the preset keyword is matched from the target data, determining a fault type corresponding to the target keyword according to a preset monitoring rule table.
According to an alternative embodiment of the invention, when all data are less than or equal to the corresponding data threshold, the method further comprises:
inputting each data into a fault prediction classifier;
predicting the risk fault type and probability of each data through the fault prediction classifier;
taking the risk fault type corresponding to the maximum probability as a target risk fault type;
and sending a risk alarm signal carrying the target risk fault type to the server.
According to an alternative embodiment of the present invention, the training process of the fault prediction classifier comprises:
acquiring historical data corresponding to each piece of data and a fault type of the historical data;
constructing a data array according to each historical data and the corresponding fault type;
and inputting the data array into a convolutional neural network for training to obtain the fault prediction classifier.
According to an alternative embodiment of the invention, the method further comprises:
and when the matched fault recovery operation flow identification corresponding to the fault type does not exist, sending preset notification information to operation and maintenance service personnel, so that the operation and maintenance service personnel manually process the fault recovery.
A second aspect of the present invention provides a fault recovery method, which is applied in a fault automatic processing system, and the method includes:
receiving a fault alarm signal which is sent by a monitoring system and carries identification information of a server and a fault recovery operation flow identification;
matching a fault recovery workflow script corresponding to the fault recovery workflow identification;
and sending the fault recovery instruction carrying the fault recovery workflow script to a server corresponding to the identification information of the server, so that a client of the server calls the fault recovery workflow script to execute the fault recovery instruction.
According to an alternative embodiment of the invention, the method further comprises:
acquiring a plurality of fault types in an operation and maintenance service system and a fault recovery strategy of each fault type;
configuring a fault recovery workflow script according to the fault recovery strategy of each fault type;
setting a fault recovery workflow identification for the fault recovery workflow script;
and sending the fault recovery operation flow identification to the monitoring system.
A third aspect of the invention provides a computer device comprising a processor for implementing the fault recovery method when executing a computer program stored in a memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the fault recovery method.
In summary, the fault recovery method, the computer device and the storage medium according to the present invention perform fault recovery on a server that has a fault by means of a monitoring system and an automatic fault handling system, encapsulate a solution corresponding to each fault into a fault recovery workflow script by the automatic fault handling system, trigger the automatic fault handling system to call the fault recovery workflow script by the monitoring system when the fault occurs, automatically solve the fault, remove a process of manual participation therein, and achieve rapid recovery of the fault.
Drawings
Fig. 1 is a flowchart of a failure recovery method according to an embodiment of the present invention.
Fig. 2 is a flowchart of a failure recovery method according to a second embodiment of the present invention.
Fig. 3 is a structural diagram of a failure recovery apparatus according to a third embodiment of the present invention.
Fig. 4 is a structural diagram of a failure recovery apparatus according to a fourth embodiment of the present invention.
Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Example one
Fig. 1 is a flowchart of a failure recovery method according to an embodiment of the present invention. The failure recovery method specifically includes the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements.
S11, monitoring whether the server has a fault or not by the monitoring system, and acquiring the identification information of the server after monitoring that the server has a fault.
The fault recovery system comprises a monitoring system and an automatic fault processing system, wherein the monitoring system is responsible for monitoring whether the server fails or not and acquiring the identification information of the failed server when the server is monitored to fail.
The identification information may include: the IP address of the server, the MAC address of the server and the equipment identification number of the server.
The identification information is used to uniquely represent the server.
In an optional embodiment, the monitoring whether the server fails includes:
acquiring a log reported by a client of the server, wherein a plurality of data are recorded in the log;
comparing each data to a corresponding data threshold;
when at least one data is larger than a corresponding data threshold value, determining that the server is monitored to have a fault;
and when all the data are less than or equal to the corresponding data threshold value, determining that the server is monitored to normally operate.
In this optional embodiment, the monitoring system is connected to multiple servers simultaneously, each server is provided with a client in advance, and the client actively reports the log of the corresponding server to the monitoring system at regular time or irregular time. The client can also report the log of the corresponding server when receiving a log reporting instruction sent by the monitoring system regularly or irregularly.
The plurality of data may include, but is not limited to: CPU utilization, memory utilization, process survival status, disk utilization, database connections, and the like.
Different data correspond to different data thresholds, for example, the data threshold corresponding to the CPU utilization is 90%, and the data threshold corresponding to the database connection number is 100.
S12, the monitoring system identifies the fault type of the fault.
And identifying the fault type of the server according to the acquired data.
In an optional embodiment, the identifying the fault type of the fault includes:
determining target data in the plurality of data that is greater than a data threshold;
matching preset keywords with the target data by a regular matching method;
and when a target keyword which is the same as the preset keyword is matched from the target data, determining a fault type corresponding to the target keyword according to a preset monitoring rule table.
In this optional embodiment, a monitoring rule table is preset in the monitoring system, where the monitoring rule table records a plurality of fault types and keywords corresponding to each fault type.
The fault types include: CPU failures, memory failures, process failures, disk failures, database failures, and the like. The keyword corresponding to the CPU fault is "CPU", and the keyword corresponding to the memory fault is "memory".
According to the working experience of the operation and maintenance service personnel, the faults have obvious characteristics and typical processing methods, so that an operation and maintenance knowledge base can be summarized and formed, and a fault recovery workflow script is developed and configured according to the operation and maintenance knowledge base so as to execute a fault recovery instruction and carry out fault recovery on the server.
S13, the monitoring system matches whether the fault recovery operation flow identification corresponding to the fault type exists.
And the monitoring system records the incidence relation between the fault type and the fault recovery operation flow identification in advance, and matches the fault recovery operation flow identification corresponding to the fault type with the incidence relation, so that the fault automatic processing system can conveniently send the fault recovery operation flow script according to the fault recovery operation flow identification.
And S14, generating a first fault alarm instruction after the monitoring system matches that the fault recovery operation flow identification corresponding to the fault type exists.
When a monitoring system monitors that a fault (such as fault a) of a certain fault type occurs in a certain server, and after matching out that a fault recovery job flow identifier corresponding to the fault type exists, a first fault alarm instruction is generated, but the first fault alarm instruction is only a mechanism for alarming, and does not carry any information.
And S15, the monitoring system hooks the first fault alarm instruction by using a hook function, and sends a second fault alarm instruction carrying the identification information of the server and the identification of the fault recovery operation flow to the automatic fault processing system, so that the automatic fault processing system feeds back the fault recovery instruction according to the second fault alarm instruction.
The monitoring system has a hook function and can hook the first fault alarm instruction, so that a second fault alarm instruction carrying the identification information of the server and the identification of the fault recovery operation flow is sent to the automatic fault processing system.
Because the second fault alarm instruction includes the identification information of the server and the fault recovery operation stream identification, the fault automatic processing system can determine which server has a fault according to the identification information of the server, and determine which fault recovery operation stream script needs to be called according to the fault recovery operation stream identification, so that the fault recovery operation stream script is sent to the server having the fault to perform fault recovery.
The specific process of performing fault recovery can be referred to in embodiment two and the related description.
In an optional embodiment, when all the data are less than or equal to the corresponding data threshold, the method further comprises:
inputting each data into a fault prediction classifier;
predicting the risk fault type and probability of each data through the fault prediction classifier;
taking the risk fault type corresponding to the maximum probability as a target risk fault type;
and sending a risk alarm signal carrying the target risk fault type to the server.
The fault prediction classifier is a fault classifier based on a convolutional neural network model. By training the fault prediction classifier in advance, the risk prediction can be performed on data in the server when the server does not have faults, and the fault types with the risk faults and the probability of each fault type are determined. The server is provided with a plurality of data, each data corresponds to a plurality of predicted risk fault types, and the risk fault type corresponding to the maximum probability is used as the target risk fault type of the data. And taking the target risk fault type corresponding to the maximum probability in the target risk fault types of all the data as the target risk fault type of the server. By predicting the target risk failure type of the server, advance prevention and preparation measures can be taken.
In an alternative embodiment, the training process of the fault prediction classifier includes:
acquiring historical data corresponding to each piece of data and a fault type of the historical data;
constructing a data array according to each historical data and the corresponding fault type;
and inputting the data array into a convolutional neural network for training to obtain the fault prediction classifier.
In this alternative embodiment, the fault prediction classifier is trained based on the data array, so that the trained fault prediction classifier has a function of fault type prediction.
During specific implementation, all historical data and fault type identifications of the same index item are stored in a database according to a time sequence, a fault prediction classifier is trained on the basis of the database, and fault classification is carried out by using the trained fault prediction classifier according to a plurality of data reported by a server client side, so that the probability of the fault type of each current data can be obtained.
In an optional embodiment, the method further comprises:
and when the matched fault recovery operation flow identification corresponding to the fault type does not exist, sending preset notification information to operation and maintenance service personnel, so that the operation and maintenance service personnel manually process the fault recovery.
Example two
Fig. 2 is a flowchart of a failure recovery method according to a second embodiment of the present invention. The failure recovery method specifically includes the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements.
And S21, the fault automatic processing system receives the fault alarm signal which is sent by the monitoring system and carries the identification information of the server and the identification of the fault recovery operation flow.
And S22, matching the fault recovery job flow script corresponding to the fault recovery job flow identification by the fault automatic processing system.
A plurality of fault recovery workflow scripts are pre-developed and configured in the fault automatic processing system, each fault recovery workflow script is used for executing a fault recovery workflow of the server, and each fault recovery workflow script has a unique ID as a fault recovery workflow identifier.
The fault recovery workflow script stores commonly used step instructions for processing problems. The fault recovery workflow script may include a middleware start script, a stop script, a restart script, a database table space expansion script, a database node restart script, an F5 load balancing device isolation script, a process shutdown script, and the like.
The input parameter of the fault automatic processing system is the identification information of the server, and when the identification information of the server is transmitted, the fault automatic processing system can issue a fault recovery instruction to the server corresponding to the identification information of the server to execute a fault recovery command.
And S23, the fault automatic processing system sends the fault recovery instruction carrying the fault recovery workflow script to the server corresponding to the identification information of the server, so that the client of the server calls the fault recovery workflow script to execute the fault recovery instruction.
When the client of the server receives the fault recovery instruction, the fault recovery workflow script in the fault recovery instruction is analyzed, the step instruction in the fault recovery workflow script is called, and the fault recovery can be automatically completed, so that the server is recovered to a normal state.
In an optional embodiment, the method further comprises:
acquiring a plurality of fault types in an operation and maintenance service system and a fault recovery strategy of each fault type;
configuring a fault recovery workflow script according to the fault recovery strategy of each fault type;
setting a fault recovery workflow identification for the fault recovery workflow script;
and sending the fault recovery operation flow identification to the monitoring system.
The invention carries out fault recovery on the server with faults by matching the monitoring system with the automatic fault processing system, encapsulates the solution corresponding to each fault into a fault recovery workflow script by the automatic fault processing system, triggers the automatic fault processing system to call the fault recovery workflow script when the fault occurs, automatically solves the fault, removes the process of manual participation in the fault and realizes the rapid recovery of the fault.
By the aid of the automatic fault processing system, human intervention factors in fault processing are eliminated, notification and delay of online manual processing are avoided, and additional problems caused by misoperation possibly occurring in manual processing of faults are avoided.
EXAMPLE III
Fig. 3 is a structural diagram of a failure recovery apparatus according to a third embodiment of the present invention.
In some embodiments, the failure recovery apparatus 30 may include a plurality of functional modules composed of program code segments. The program code of the various program segments in the fault recovery apparatus 30 may be stored in a memory of a computer device and executed by the at least one processor to perform the fault recovery function (described in detail in fig. 1).
In this embodiment, the failure recovery device 30 may be divided into a plurality of functional modules according to the functions performed by the failure recovery device. The functional module may include: a monitoring module 301, an obtaining module 302, a recognition module 303, a matching module 304, a generating module 305, an alerting module 306, a predicting module 307, and a notifying module 308. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The monitoring module 301 is configured to monitor whether a server fails.
The fault recovery system comprises a monitoring system and an automatic fault processing system, wherein the monitoring system is responsible for monitoring whether the server fails.
In an optional embodiment, the monitoring module 301 monitoring whether the server fails includes:
acquiring a log reported by a client of the server, wherein a plurality of data are recorded in the log; comparing each data to a corresponding data threshold; when at least one data is larger than a corresponding data threshold value, determining that the server is monitored to have a fault; and when all the data are less than or equal to the corresponding data threshold value, determining that the server is monitored to normally operate.
In this optional embodiment, the monitoring system is connected to multiple servers simultaneously, each server is provided with a client in advance, and the client actively reports the log of the corresponding server to the monitoring system at regular time or irregular time. The client can also report the log of the corresponding server when receiving a log reporting instruction sent by the monitoring system regularly or irregularly.
The plurality of data may include, but is not limited to: CPU utilization, memory utilization, process survival status, disk utilization, database connections, and the like.
Different data correspond to different data thresholds, for example, the data threshold corresponding to the CPU utilization is 90%, and the data threshold corresponding to the database connection number is 100.
The obtaining module 302 is configured to obtain the identification information of the failed server after the monitoring module 301 monitors that the server fails.
The identification information may include: the IP address of the server, the MAC address of the server and the equipment identification number of the server.
The identification information is used to uniquely represent the server.
The identifying module 303 is configured to identify a fault type of the fault.
And identifying the fault type of the server according to the acquired data.
In an alternative embodiment, the identifying module 303 identifies the fault type of the fault by:
determining target data in the plurality of data that is greater than a data threshold;
matching preset keywords with the target data by a regular matching method;
and when a target keyword which is the same as the preset keyword is matched from the target data, determining a fault type corresponding to the target keyword according to a preset monitoring rule table.
In this optional embodiment, a monitoring rule table is preset in the monitoring system, where the monitoring rule table records a plurality of fault types and keywords corresponding to each fault type.
The fault types include: CPU failures, memory failures, process failures, disk failures, database failures, and the like. The keyword corresponding to the CPU fault is "CPU", and the keyword corresponding to the memory fault is "memory".
According to the working experience of the operation and maintenance service personnel, the faults have obvious characteristics and typical processing methods, so that an operation and maintenance knowledge base can be summarized and formed, and a fault recovery workflow script is developed and configured according to the operation and maintenance knowledge base so as to execute a fault recovery instruction and carry out fault recovery on the server.
The matching module 304 is configured to match whether a failure recovery job flow identifier corresponding to the failure type exists.
And the monitoring system records the incidence relation between the fault type and the fault recovery operation flow identification in advance, and matches the fault recovery operation flow identification corresponding to the fault type with the incidence relation, so that the fault automatic processing system can conveniently send the fault recovery operation flow script according to the fault recovery operation flow identification.
The generating module 305 is configured to generate a first fault warning instruction after matching that the fault recovery job flow identifier corresponding to the fault type exists.
When a monitoring system monitors that a fault (such as fault a) of a certain fault type occurs in a certain server, and after matching out that a fault recovery job flow identifier corresponding to the fault type exists, a first fault alarm instruction is generated, but the first fault alarm instruction is only a mechanism for alarming, and does not carry any information.
The alarm module 306 is configured to hook the first fault alarm instruction by using a hook function, and send a second fault alarm instruction carrying the identification information of the server and the identifier of the fault recovery operation flow to the automatic fault handling system, so that the automatic fault handling system feeds back the fault recovery instruction according to the second fault alarm instruction.
The monitoring system has a hook function and can hook the first fault alarm instruction, so that a second fault alarm instruction carrying the identification information of the server and the identification of the fault recovery operation flow is sent to the automatic fault processing system.
Because the second fault alarm instruction includes the identification information of the server and the fault recovery operation stream identification, the fault automatic processing system can determine which server has a fault according to the identification information of the server, and determine which fault recovery operation stream script needs to be called according to the fault recovery operation stream identification, so that the fault recovery operation stream script is sent to the server having the fault to perform fault recovery.
The specific process of performing fault recovery can be referred to in embodiment two and the related description.
The prediction module 307 is configured to input each data into the fault prediction classifier when all data are less than or equal to the corresponding data threshold; predicting the risk fault type and probability of each data through the fault prediction classifier; taking the risk fault type corresponding to the maximum probability as a target risk fault type; and sending a risk alarm signal carrying the target risk fault type to the server.
The fault prediction classifier is a fault classifier based on a convolutional neural network model. By training the fault prediction classifier in advance, the risk prediction can be performed on data in the server when the server does not have faults, and the fault types with the risk faults and the probability of each fault type are determined. The server is provided with a plurality of data, each data corresponds to a plurality of predicted risk fault types, and the risk fault type corresponding to the maximum probability is used as the target risk fault type of the data. And taking the target risk fault type corresponding to the maximum probability in the target risk fault types of all the data as the target risk fault type of the server. By predicting the target risk failure type of the server, advance prevention and preparation measures can be taken.
In an alternative embodiment, the training process of the fault prediction classifier includes:
acquiring historical data corresponding to each piece of data and a fault type of the historical data;
constructing a data array according to each historical data and the corresponding fault type;
and inputting the data array into a convolutional neural network for training to obtain the fault prediction classifier.
In this alternative embodiment, the fault prediction classifier is trained based on the data array, so that the trained fault prediction classifier has a function of fault type prediction.
During specific implementation, all historical data and fault type identifications of the same index item are stored in a database according to a time sequence, a fault prediction classifier is trained on the basis of the database, and fault classification is carried out by using the trained fault prediction classifier according to a plurality of data reported by a server client side, so that the probability of the fault type of each current data can be obtained.
The notifying module 308 is configured to send preset notification information to an operation and maintenance service staff when no fault recovery operation flow identifier corresponding to the fault type exists, so that the operation and maintenance service staff manually handles fault recovery.
Example four
Fig. 4 is a structural diagram of a failure recovery apparatus according to a fourth embodiment of the present invention.
In some embodiments, the fault recovery apparatus 40 may include a plurality of functional modules composed of program code segments. The program code of the various program segments in the fault recovery apparatus 40 may be stored in a memory of a computer device and executed by the at least one processor to perform the fault recovery function (described in detail with reference to fig. 2).
In this embodiment, the failure recovery apparatus 40 may be divided into a plurality of functional modules according to the functions performed by the apparatus. The functional module may include: the device comprises a receiving module 401, a selecting module 402, a sending module 403 and a configuring module 404. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The receiving module 401 is configured to receive a fault alarm signal that is sent by the monitoring system and carries the identification information of the server and the identification of the fault recovery operation flow.
The selecting module 402 is configured to match out a fault recovery workflow script corresponding to the fault recovery workflow identifier.
A plurality of fault recovery workflow scripts are pre-developed and configured in the fault automatic processing system, each fault recovery workflow script is used for executing a fault recovery workflow of the server, and each fault recovery workflow script has a unique ID as a fault recovery workflow identifier.
The fault recovery workflow script stores commonly used step instructions for processing problems. The fault recovery workflow script may include a middleware start script, a stop script, a restart script, a database table space expansion script, a database node restart script, an F5 load balancing device isolation script, a process shutdown script, and the like.
The input parameter of the fault automatic processing system is the identification information of the server, and when the identification information of the server is transmitted, the fault automatic processing system can issue a fault recovery instruction to the server corresponding to the identification information of the server to execute a fault recovery command.
The sending module 403 is configured to send the fault recovery instruction carrying the fault recovery workflow script to the server corresponding to the identification information of the server, so that the client of the server calls the fault recovery workflow script to execute the fault recovery instruction.
When the client of the server receives the fault recovery instruction, the fault recovery workflow script in the fault recovery instruction is analyzed, the step instruction in the fault recovery workflow script is called, and the fault recovery can be automatically completed, so that the server is recovered to a normal state.
The configuration module 404 is configured to obtain a plurality of fault types in the operation and maintenance service system and a fault recovery policy for each fault type; configuring a fault recovery workflow script according to the fault recovery strategy of each fault type; setting a fault recovery workflow identification for the fault recovery workflow script; and sending the fault recovery operation flow identification to the monitoring system.
The invention carries out fault recovery on the server with faults by matching the monitoring system with the automatic fault processing system, encapsulates the solution corresponding to each fault into a fault recovery workflow script by the automatic fault processing system, triggers the automatic fault processing system to call the fault recovery workflow script when the fault occurs, automatically solves the fault, removes the process of manual participation in the fault and realizes the rapid recovery of the fault.
By the aid of the automatic fault processing system, human intervention factors in fault processing are eliminated, notification and delay of online manual processing are avoided, and additional problems caused by misoperation possibly occurring in manual processing of faults are avoided.
EXAMPLE five
Fig. 5 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. In the preferred embodiment of the present invention, the computer device 5 includes a memory 51, at least one processor 52, at least one communication bus 53, and a transceiver 54.
It will be appreciated by those skilled in the art that the configuration of the computer device shown in fig. 5 is not limiting to the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and that the computer device 5 may include more or less hardware or software than those shown, or a different arrangement of components.
In some embodiments, the computer device 5 is a computer device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The computer device 5 may also include a client device, which includes, but is not limited to, any electronic product capable of interacting with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the computer device 5 is only an example, and other electronic products that are currently available or may come into existence in the future, such as electronic products that can be adapted to the present invention, should also be included in the scope of the present invention, and are incorporated herein by reference.
In some embodiments, the memory 51 is used for storing program codes and various data, such as devices installed in the computer device 5, and realizes high-speed and automatic access to programs or data during the operation of the computer device 5. The Memory 51 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only Memory (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer that can be used to carry or store data.
In some embodiments, the at least one processor 52 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The at least one processor 52 is a Control Unit (Control Unit) of the computer device 5, connects various components of the entire computer device 5 by using various interfaces and lines, and executes various functions of the computer device 5 and processes data by running or executing programs or modules stored in the memory 51 and calling data stored in the memory 51.
In some embodiments, the at least one communication bus 53 is arranged to enable connection communication between the memory 51 and the at least one processor 52, etc.
Although not shown, the computer device 5 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 52 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The computer device 5 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a computer device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In a further embodiment, the at least one processor 52 may execute operating means of the computer device 5 as well as installed various types of applications, program code, etc., such as the various modules described above.
The memory 51 has program code stored therein, and the at least one processor 52 can call the program code stored in the memory 51 to perform related functions.
In one embodiment of the invention, the memory 51 stores a plurality of instructions that are executed by the at least one processor 52 to implement all or a portion of the steps of the method of the invention.
Specifically, the at least one processor 52 may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 and fig. 2, and is not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (10)
1. A fault recovery method is applied to a monitoring system, and is characterized by comprising the following steps:
monitoring whether a server fails or not, and acquiring identification information of the server after monitoring that the server fails;
identifying a fault type of the fault;
matching whether a fault recovery operation flow identifier corresponding to the fault type exists or not;
when the fault recovery operation flow identification corresponding to the fault type exists in the matching, generating a first fault alarm instruction;
hooking the first fault alarm instruction by using a hook function, and sending a second fault alarm instruction carrying identification information of the server and the fault recovery operation flow identification to an automatic fault processing system, so that the automatic fault processing system feeds back a fault recovery instruction according to the second fault alarm instruction;
and calling a fault recovery workflow script in the fault recovery instruction through a client of the server to execute the fault recovery instruction.
2. The method of claim 1, wherein monitoring whether the server fails comprises:
acquiring a log reported by a client of the server, wherein a plurality of data are recorded in the log;
comparing each data to a corresponding data threshold;
when at least one data is larger than a corresponding data threshold value, determining that the server is monitored to have a fault;
and when all the data are less than or equal to the corresponding data threshold value, determining that the server is monitored to normally operate.
3. The method of claim 2, wherein the identifying the fault type of the fault comprises:
determining target data in the plurality of data that is greater than a data threshold;
matching preset keywords with the target data by a regular matching method;
and when a target keyword which is the same as the preset keyword is matched from the target data, determining a fault type corresponding to the target keyword according to a preset monitoring rule table.
4. The method of claim 2, wherein when all of the data is less than or equal to the corresponding data threshold, the method further comprises:
inputting each data into a fault prediction classifier;
predicting the risk fault type and probability of each data through the fault prediction classifier;
taking the risk fault type corresponding to the maximum probability as a target risk fault type;
and sending a risk alarm signal carrying the target risk fault type to the server.
5. The method of claim 4, wherein the training process of the fault prediction classifier comprises:
acquiring historical data corresponding to each piece of data and a fault type of the historical data;
constructing a data array according to each historical data and the corresponding fault type;
and inputting the data array into a convolutional neural network for training to obtain the fault prediction classifier.
6. The method of any one of claims 1 to 5, further comprising:
and when the matched fault recovery operation flow identification corresponding to the fault type does not exist, sending preset notification information to operation and maintenance service personnel, so that the operation and maintenance service personnel manually process the fault recovery.
7. A fault recovery method applied to a fault automatic processing system is characterized by comprising the following steps:
receiving a fault alarm signal which is sent by a monitoring system and carries identification information of a server and a fault recovery operation flow identification;
matching a fault recovery workflow script corresponding to the fault recovery workflow identification;
and sending the fault recovery instruction carrying the fault recovery workflow script to a server corresponding to the identification information of the server, so that a client of the server calls the fault recovery workflow script to execute the fault recovery instruction.
8. The method of claim 7, wherein the method further comprises:
acquiring a plurality of fault types in an operation and maintenance service system and a fault recovery strategy of each fault type;
configuring a fault recovery workflow script according to the fault recovery strategy of each fault type;
setting a fault recovery workflow identification for the fault recovery workflow script;
and sending the fault recovery operation flow identification to the monitoring system.
9. A computer device comprising a processor for implementing a method for fault recovery according to any one of claims 1 to 6, or for implementing a method for fault recovery according to claim 7 or 8, when executing a computer program stored in a memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for fault recovery according to one of claims 1 to 6, or carries out the method for fault recovery according to claim 7 or 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010728104.XA CN111897671A (en) | 2020-07-23 | 2020-07-23 | Failure recovery method, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010728104.XA CN111897671A (en) | 2020-07-23 | 2020-07-23 | Failure recovery method, computer device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111897671A true CN111897671A (en) | 2020-11-06 |
Family
ID=73190016
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010728104.XA Pending CN111897671A (en) | 2020-07-23 | 2020-07-23 | Failure recovery method, computer device, and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111897671A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112838965A (en) * | 2021-02-19 | 2021-05-25 | 浪潮云信息技术股份公司 | Method for identifying and recovering strong synchronization role fault |
CN113157483A (en) * | 2021-05-26 | 2021-07-23 | 中国银行股份有限公司 | Exception handling method and device |
CN113179180A (en) * | 2021-04-23 | 2021-07-27 | 杭州安恒信息技术股份有限公司 | Basalt client disaster fault repairing method, basalt client disaster fault repairing device and basalt client disaster storage medium |
CN113176996A (en) * | 2021-04-29 | 2021-07-27 | 深信服科技股份有限公司 | Fault processing method, engine, plug-in probe, device and readable storage medium |
CN113448811A (en) * | 2021-05-31 | 2021-09-28 | 山东英信计算机技术有限公司 | Method, device, equipment and readable medium for lighting fault lamp of server system |
CN113535034A (en) * | 2021-09-07 | 2021-10-22 | 北京轻松筹信息技术有限公司 | Fault warning method, device, system and medium |
CN113553210A (en) * | 2021-07-30 | 2021-10-26 | 平安普惠企业管理有限公司 | Alarm data processing method, device, equipment and storage medium |
CN113592337A (en) * | 2021-08-09 | 2021-11-02 | 北京豆萌信息技术有限公司 | Fault processing method and device, electronic equipment and storage medium |
CN114064341A (en) * | 2021-11-22 | 2022-02-18 | 建信金融科技有限责任公司 | Fault disposal method and system based on emergency plan |
CN114095343A (en) * | 2021-11-18 | 2022-02-25 | 深圳壹账通智能科技有限公司 | Disaster recovery method, device, equipment and storage medium based on double-active system |
CN116560908A (en) * | 2023-05-09 | 2023-08-08 | 中工数保(北京)科技有限公司 | Data recovery method of industrial control system and related equipment thereof |
WO2024183555A1 (en) * | 2023-03-06 | 2024-09-12 | 广州疆海科技有限公司 | Energy storage device troubleshooting method and apparatus, and computer device, medium and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150074367A1 (en) * | 2013-09-09 | 2015-03-12 | International Business Machines Corporation | Method and apparatus for faulty memory utilization |
CN105337765A (en) * | 2015-10-10 | 2016-02-17 | 上海新炬网络信息技术有限公司 | Distributed hadoop cluster fault automatic diagnosis and restoration system |
CN109634828A (en) * | 2018-12-17 | 2019-04-16 | 浪潮电子信息产业股份有限公司 | Failure prediction method, device, equipment and storage medium |
CN110191003A (en) * | 2019-06-18 | 2019-08-30 | 北京达佳互联信息技术有限公司 | Fault repairing method, device, computer equipment and storage medium |
-
2020
- 2020-07-23 CN CN202010728104.XA patent/CN111897671A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150074367A1 (en) * | 2013-09-09 | 2015-03-12 | International Business Machines Corporation | Method and apparatus for faulty memory utilization |
CN105337765A (en) * | 2015-10-10 | 2016-02-17 | 上海新炬网络信息技术有限公司 | Distributed hadoop cluster fault automatic diagnosis and restoration system |
CN109634828A (en) * | 2018-12-17 | 2019-04-16 | 浪潮电子信息产业股份有限公司 | Failure prediction method, device, equipment and storage medium |
CN110191003A (en) * | 2019-06-18 | 2019-08-30 | 北京达佳互联信息技术有限公司 | Fault repairing method, device, computer equipment and storage medium |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112838965A (en) * | 2021-02-19 | 2021-05-25 | 浪潮云信息技术股份公司 | Method for identifying and recovering strong synchronization role fault |
CN112838965B (en) * | 2021-02-19 | 2023-03-28 | 浪潮云信息技术股份公司 | Method for identifying and recovering strong synchronization role fault |
CN113179180A (en) * | 2021-04-23 | 2021-07-27 | 杭州安恒信息技术股份有限公司 | Basalt client disaster fault repairing method, basalt client disaster fault repairing device and basalt client disaster storage medium |
CN113176996A (en) * | 2021-04-29 | 2021-07-27 | 深信服科技股份有限公司 | Fault processing method, engine, plug-in probe, device and readable storage medium |
CN113157483A (en) * | 2021-05-26 | 2021-07-23 | 中国银行股份有限公司 | Exception handling method and device |
CN113448811A (en) * | 2021-05-31 | 2021-09-28 | 山东英信计算机技术有限公司 | Method, device, equipment and readable medium for lighting fault lamp of server system |
CN113553210A (en) * | 2021-07-30 | 2021-10-26 | 平安普惠企业管理有限公司 | Alarm data processing method, device, equipment and storage medium |
CN113592337A (en) * | 2021-08-09 | 2021-11-02 | 北京豆萌信息技术有限公司 | Fault processing method and device, electronic equipment and storage medium |
CN113535034A (en) * | 2021-09-07 | 2021-10-22 | 北京轻松筹信息技术有限公司 | Fault warning method, device, system and medium |
CN114095343A (en) * | 2021-11-18 | 2022-02-25 | 深圳壹账通智能科技有限公司 | Disaster recovery method, device, equipment and storage medium based on double-active system |
CN114064341A (en) * | 2021-11-22 | 2022-02-18 | 建信金融科技有限责任公司 | Fault disposal method and system based on emergency plan |
WO2024183555A1 (en) * | 2023-03-06 | 2024-09-12 | 广州疆海科技有限公司 | Energy storage device troubleshooting method and apparatus, and computer device, medium and system |
CN116560908A (en) * | 2023-05-09 | 2023-08-08 | 中工数保(北京)科技有限公司 | Data recovery method of industrial control system and related equipment thereof |
CN116560908B (en) * | 2023-05-09 | 2024-01-26 | 中工数保(北京)科技有限公司 | Data recovery method of industrial control system and related equipment thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111897671A (en) | Failure recovery method, computer device, and storage medium | |
CN108710544B (en) | Process monitoring method of database system and rail transit comprehensive monitoring system | |
WO2016188100A1 (en) | Information system fault scenario information collection method and system | |
CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
CN105099783B (en) | A kind of method and system for realizing operation system alarm emergency disposal automation | |
JP6647824B2 (en) | Error diagnosis system and error diagnosis method | |
CN110738352A (en) | Maintenance dispatching management method, device, equipment and medium based on fault big data | |
CN107660289A (en) | Automatic network control | |
CN107844339B (en) | Task scheduling method, plug-in and server | |
CN112631866A (en) | Server hardware state monitoring method and device, electronic equipment and medium | |
CN110275795A (en) | A kind of O&M method and device based on alarm | |
CN113592337A (en) | Fault processing method and device, electronic equipment and storage medium | |
CN115202958A (en) | Power abnormity monitoring method and device, electronic equipment and storage medium | |
CN113312200A (en) | Event processing method and device, computer equipment and storage medium | |
CN117670033A (en) | Security check method, system, electronic equipment and storage medium | |
CN117453036A (en) | Method, system and device for adjusting power consumption of equipment in server | |
CN109995554A (en) | The control method and cloud dispatch control device of multi-stage data center active-standby switch | |
CN112286762A (en) | System information analysis method and device based on cloud environment, electronic equipment and medium | |
CN115102838B (en) | Emergency processing method and device for server downtime risk and electronic equipment | |
US20080216057A1 (en) | Recording medium storing monitoring program, monitoring method, and monitoring system | |
CN111710403A (en) | Medical equipment supervision method, equipment and readable storage medium | |
CN114237196B (en) | Split robot fault processing method and device, terminal equipment and medium | |
CN112152878B (en) | Monitoring and management method, system, terminal and storage medium for digital channel of transformer substation | |
CN115225534A (en) | Method for monitoring running state of monitoring server | |
CN115168137A (en) | Monitoring method and system for timing task, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201106 |