CN107562556B

CN107562556B - Failure recovery method, recovery device and storage medium

Info

Publication number: CN107562556B
Application number: CN201710691358.7A
Authority: CN
Inventors: 陈薪; 袁佳; 秦涛; 雷教敏; 朱志武; 赵志辉; 刘光华; 付惠; 田盈盈; 杨文兵; 陈雷; 王正迪; 党受辉; 刘章雄; 王建学; 杨继宁; 梅璠
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-08-14
Filing date: 2017-08-14
Publication date: 2020-05-12
Anticipated expiration: 2037-08-14
Also published as: CN107562556A

Abstract

The invention discloses a fault recovery method, a fault recovery device and a storage medium, wherein the recovery method comprises the following steps: acquiring fault information; matching corresponding alarm parameters and a fault processing flow for the fault information according to a preset alarm matching rule; carrying out convergence processing on the alarm parameters to generate converged alarm parameters; and executing the fault processing flow according to the converged alarm parameters to recover the fault. The invention matches the corresponding alarm parameter and fault processing flow for the fault information through the preset alarm matching rule, and carries out convergence processing on the alarm parameter, and has the advantages of strong compatibility, simple processing flow, less occupied system resources, high processing efficiency and the like.

Description

Failure recovery method, recovery device and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus for recovering a failure, and a storage medium.

Background

With the rise of various application programs, various types of software failures also frequently occur, which not only causes service interruption, but also brings poor experience to corresponding merchants and users. Therefore, the problem of automatic recovery from failure is particularly urgent and prominent.

The conventional automatic fault recovery system is specific to a fault. The developer of the fault automatic recovery system configures a processing flow aiming at the specific fault, and then the processing flow is executed by the flow system. The process flow is typically configured by a Specific Domain Specific Language (DSL) type profile, and the processing of a portion of the process flow even requires a developer to perform targeted coding work.

The conventional automatic fault recovery system has certain disadvantages, such as: the user-defined threshold is very high and can usually only cover some basic, common fault handling procedures. In the aspects of access and maintenance, the method has the technical problems of poor compatibility between recovery schemes, more and complicated processing flows, high system resource consumption, slow feedback and the like.

Disclosure of Invention

The embodiment of the invention provides a fault recovery method, a fault recovery device and a storage medium, and has the advantages of strong compatibility, simple processing flow and the like.

In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:

a method of recovering from a failure, comprising:

acquiring fault information;

matching corresponding alarm parameters and a fault processing flow for the fault information according to a preset alarm matching rule;

carrying out convergence processing on the alarm parameters to generate converged alarm parameters; and

and executing the fault processing flow according to the converged alarm parameters to recover the fault.

In order to solve the above technical problems, embodiments of the present invention further provide the following technical solutions:

a failed recovery device, comprising:

the acquisition module is used for acquiring fault information;

the matching module is used for matching corresponding alarm parameters and fault processing flows for the fault information according to a preset alarm matching rule;

the convergence module is used for carrying out convergence processing on the alarm parameters to generate converged alarm parameters; and

and the recovery module is used for executing the fault processing flow according to the converged alarm parameters so as to recover the fault.

a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements the above-described failure recovery method.

According to the fault recovery method, the fault recovery device and the storage medium provided by the embodiment of the invention, the corresponding alarm parameters and the corresponding fault processing flow are matched for the fault information through the preset alarm matching rules, and the alarm parameters are subjected to convergence processing, so that the fault recovery method, the fault recovery device and the storage medium have the advantages of strong compatibility, simple processing flow, less occupied system resources, high processing efficiency and the like.

Drawings

The technical solution and other advantages of the present invention will become apparent from the following detailed description of specific embodiments of the present invention, which is to be read in connection with the accompanying drawings.

Fig. 1 is a schematic flow chart of a failure recovery method according to an embodiment of the present invention;

fig. 2 is another schematic flow chart of a method for recovering from a failure in a method for processing video data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a user configuration interface for alert matching according to an embodiment of the present invention;

FIG. 4 is an interface diagram of a tree fault handling process according to an embodiment of the present invention;

FIG. 5 is a schematic processing logic diagram of a tree fault processing flow according to an embodiment of the present invention;

FIG. 6 is a technical side schematic diagram of a user-defined alert setting provided by an embodiment of the present invention;

FIG. 7 is a block diagram of a failed recovery device provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of another module of a failed recovery device according to an embodiment of the present invention;

fig. 9 is a schematic diagram of a hardware environment of a recovery method, a recovery apparatus, and a storage medium for recovering from a failure according to an embodiment of the present invention.

Detailed Description

Referring to the drawings, wherein like reference numbers refer to like elements, the principles of the present invention are illustrated as being implemented in a suitable computing environment. The following description is based on illustrated embodiments of the invention and should not be taken as limiting the invention with regard to other embodiments that are not detailed herein.

In the description that follows, specific embodiments of the present invention are described with reference to steps and symbols executed by one or more computers, unless otherwise indicated. Accordingly, these steps and operations will be referred to, several times, as being performed by a computer, the computer performing operations involving a processing unit of the computer in electronic signals representing data in a structured form. This operation transforms the data or maintains it at locations in the computer's memory system, which may be reconfigured or otherwise altered in a manner well known to those skilled in the art. The data maintains a data structure that is a physical location of the memory that has particular characteristics defined by the data format. However, while the principles of the invention have been described in language specific to above, it is not intended to be limited to the specific details shown, since one skilled in the art will recognize that various steps and operations described below may be implemented in hardware.

The terms "module" and "unit" as used herein may be considered software objects that execute on the computing system. The various components, modules, engines, and services described herein may be viewed as objects implemented on the computing system. The apparatus and method described herein are preferably implemented in software, but may also be implemented in hardware, and are within the scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for recovering from a failure according to an embodiment of the present invention. The fault recovery method can be applied to a server with related functions of operation and maintenance.

The fault recovery method comprises the following steps:

in step S101, failure information is acquired.

Wherein the fault information includes but is not limited to: abnormal alarm and health early warning.

The abnormal alarm refers to a service abnormality or a pseudo service abnormality caused by various reasons. The common can be attributed to: business anomalies caused by network or Internet Data Center (IDC) anomalies, business anomalies caused by key module performance problems, business anomalies caused by host hardware or system anomalies, false business anomalies caused by invalid error notices and the like. The business exception caused by the exception of the host hardware or the system has the highest occurrence ratio.

The health early warning refers to various acquired indexes of the system, and the various indexes are used for evaluating and detecting faults of the system. It is understood that the health pre-warning can be understood as a physical examination report of the system, which is used for finding whether an abnormal point exists after being compared with an index threshold, and the abnormal point can be regarded as fault information.

In step S102, according to a preset alarm matching rule, matching a corresponding alarm parameter and a corresponding fault processing procedure for the fault information.

Referring to fig. 3, a schematic diagram of a user configuration interface for alert matching is shown. Wherein, the alarm matching rule comprises: the method comprises an alarm parameter 310 corresponding to fault information and a fault processing flow 320, wherein the alarm parameter 310 and the fault processing flow 320 support user-defined so as to adjust the attribute and the corresponding strategy of the fault information.

The alarm parameters 310 may be defined or displayed as "self-healing scenes" in the real operation interface, that is, various parameters corresponding to the scenes that need to be recovered. The alarm parameters 310 include: the fault information corresponds to a fault type 31, a fault event number 32, a text description 33, an environment attribute 34, a service partition 35 and a service module 36.

Specifically, the fault type 31, also called alarm type, is used to customize the attributes of the fault. The fault event number 32 is found and generated under the fault type, and for example, it includes: 973469Hostsrv update failure alarms (alarm type). And the text description 33 is used for supporting content screening, and only the part of the alarms matched with the content can be screened, and the regular expression is used for matching. The word description 33 is optional and is not filled to no filtering. The environment attribute 34, which may also be shown as a SET attribute, is used to SET the application environment type. The types of application environments include, but are not limited to: a test environment, an experience environment, and a formal environment. And a service partition 35 for virtually partitioning according to service contents. Such as: 318 (guangdong-one section), it is understood that the guangdong-one section generally refers to a section that suggests preferences for users in the guangdong region, but does not affect the access of users in other areas. The service module 36 refers to a logic module of a service, such as a logic module on a host, a game module, a chat module, and the like.

The fault processing flow 320 may be defined or displayed as "self-healing processing" in the user configuration interface, and is used to provide relevant options for fault processing, so that the user can customize the fault processing flow. The failure processing flow 320 includes: tree-like fault handling 37, notification 38, and enable control 39.

Specifically, the control 39 is enabled for determining whether the corresponding fault handling procedure is enabled.

The notification 38, including: the triggering of the notification, the channel of the notification, and the notifier. Specifically, the triggering of the notification includes: start-time trigger, success-time trigger, and failure-time trigger. The channel of the notice comprises: short messages, mails, social software (such as WeChat, Tencent telecom RTX), telephones, and the like. The notifier comprises: the associated responsible person (e.g., the master responsible person), and the peripheral responsible person (additional notifiers may be added).

The tree-shaped fault processing flow 37 may further refer to fig. 4, which is an interface schematic diagram of the tree-shaped fault processing flow. The tree fault processing flow 37 includes: a root node 371, a plurality of child nodes 372, and their associations 373. Wherein the child node 372 further comprises: leaf node 3720, meaning the corresponding failure handling flow is complete.

It will be appreciated that each node (371, 372 or 3720) includes corresponding recovery parameter instructions for recovering from a failure when the node is triggered in accordance with the recovery instructions.

It will be appreciated that each tree fault handling flow 37 corresponds to a unique root node 371. In this step, a plurality of trigger nodes can be provided for the user to select. The trigger nodes are displayed by setting labels or characters, and the nodes are

trigger nodes

371 and 374 by taking 'shortcut' as an example in the figure. In addition, a plurality of tree fault processing flows can be combined, such as: and taking the root node of one tree fault flow as a child node of the other tree fault processing flow.

In step S103, a convergence process is performed on the alarm parameter to generate a converged alarm parameter.

It will be appreciated that when an alarm parameter is received, the subsequent steps may be performed directly. When receiving multiple alarm parameters, judging whether the multiple alarm parameters are related to the same event, and when the multiple alarm parameters are related to the same event, converging the multiple alarm parameters to generate converged alarm parameters. Specifically, the convergence refers to that when a plurality of alarm parameters represent the same event, only any one of the alarm parameters is executed, and other alarm parameters are stored without being executed.

For example, such as: when the system crashes, a plurality of alarm parameters of the communication module, the storage module and the processing process are received respectively, and convergence can be carried out at the moment to generate a converged alarm parameter.

In step S104, a corresponding fault processing procedure is executed according to the converged alarm parameters to perform fault recovery.

Specifically, the present step includes:

(1) and acquiring the service logic of the current node in the fault processing flow. Wherein the business logic comprises: calling a self-defined enterprise service bus or a corresponding function of an operating system through a network protocol, wherein the corresponding function comprises the following steps: a restart function, a transfer function, and/or a script execution function.

(2) And executing the service logic of the fault processing according to the converged alarm parameters.

And if the recovery is successful, generating an alarm processing list, maintaining the processing process on the alarm processing list, and storing the alarm processing list into an alarm database. If the recovery fails or overtime, the alarm information is tracked, and the fault processing flow is further optimized.

According to the fault recovery method provided by the embodiment of the invention, the corresponding alarm parameters and the fault processing flow are matched for the fault information through the preset alarm matching rules, the alarm parameters are subjected to convergence processing, and each specific fault does not need to be coded independently, so that the fault recovery method has the advantage of strong compatibility, the processing flow of configuration and recovery is simple, and the processing efficiency of fault recovery can be improved.

Referring to fig. 2, fig. 2 is another flow chart illustrating a method for recovering from a failure according to an embodiment of the present invention. The fault recovery method can be applied to a server with related functions of operation and maintenance.

The fault recovery method comprises the following steps:

in step S201, a tree-shaped failure processing flow is configured.

Please refer to the interface schematic diagram of the tree fault processing flow shown in fig. 4, specifically, the present step includes:

(1) a trigger node is set as the root node 371.

It will be appreciated that each tree fault handling flow 37 corresponds to a unique root node 371.

The method also can comprise the following steps: acquiring a plurality of trigger nodes, and marking the trigger nodes by labels or characters, wherein the

trigger nodes

371 and 374 are shown by taking 'shortcut' as an example in the figure; and selecting one trigger node from the plurality of trigger nodes as a unique root node.

(2) And sequentially acquiring the association options of the root node according to the processing logic to generate child nodes 372.

It is understood that the tree, which is often referred to as a binary tree, is: each node can be decomposed into a left sub-tree and a right sub-tree which respectively correspond to processing logics of recovery success and recovery failure.

Further, the step may be performed as: (2.1) sequentially setting a recovery instruction of the current node from the root node 371, wherein the recovery instruction is used for performing fault recovery on fault information; (2.2) according to the processing logic, respectively taking the recovery success and the recovery failure as the association options of the current node, and generating a child node 372 of the current node; (2.3) recording the association 373 of the current node with the parent node of the current node.

(3) According to the root node 371, the child node 372, and the association 373, a tree-shaped fault handling flow 37 is generated.

Wherein, a plurality of tree fault processing flows can be combined. Namely: and when a plurality of trigger nodes are included, combining the tree logic processing flow of the root node, the unselected trigger nodes and the child nodes. It can be understood that: when there are multiple trigger nodes, the root node of one tree fault flow is used as a child node of another tree fault processing flow, which can be referred to as node 374.

Wherein the child node 372 further comprises: leaf node 3720, meaning the corresponding failure handling flow is complete.

In step S202, failure information is acquired.

This step may be used to obtain fault information from the associated alarm system. The fault information includes but is not limited to: abnormal alarm and health early warning.

In step S203, according to a preset alarm matching rule, matching a corresponding alarm parameter and a tree fault processing flow for the fault information.

In step S204, a convergence process is performed on the alarm parameter according to whether the alarm event corresponding to the alarm parameter occurs for the first time, so as to generate a converged alarm parameter.

Specifically, the present step includes:

(1) and judging whether an alarm event corresponding to the alarm parameters occurs for the first time or not, wherein the alarm event corresponds to at least one alarm parameter. And (3) if the first time occurs, executing the step (2), and if the first time does not occur, executing the step (3).

For example, such as: when the system crashes, a plurality of alarm parameters such as a communication module, a storage module, a processing process and the like are respectively received, and at the moment, the plurality of alarm parameters can be converged to generate a converged alarm parameter.

(2) And taking the alarm parameters as the alarm parameters after convergence, and sending the alarm parameters to the fault processing flow.

(3) Recording the alarm parameter in the alarm event in an alarm database and marking the alarm parameter as converged.

In step S205, the fault processing procedure is executed according to the converged alarm parameters to perform fault recovery.

Specifically, the present step includes:

(1) sending the converged alarm parameters to a root node corresponding to a fault processing flow;

(2) acquiring a recovery instruction corresponding to the root node;

(3) performing fault recovery on the fault information according to the recovery instruction;

(4) judging whether the current node is a leaf node;

(5) when the current node is a leaf node, ending the fault recovery;

(6) when the current node is not a leaf node, according to a fault recovery result, obtaining a recovery instruction of a child node corresponding to the root node to perform fault recovery, wherein the fault recovery result comprises: recovery success or recovery failure.

To further explain the above steps, please refer to fig. 5, which is a schematic processing logic diagram of a tree fault processing flow according to an embodiment of the present invention. The processing logic of the tree fault processing flow comprises:

the alarm matching module 51: acquiring fault information, and matching corresponding alarm parameters and tree fault processing flows for the fault information according to a preset alarm matching rule.

The alarm convergence module 52: the failure information is subjected to convergence processing, and the convergence processing result is written into the database 58.

The pending flow queue 53: receiving the converged alarm parameters and determining the current node;

the flow control module 54: acquiring a next node to be executed according to the current node;

unit task queue 55: if there is no next node to be executed, it is determined that the current node is a leaf node, the fault recovery is ended, and the flow control module 54 stores the recovery process in the database 58; if the current node is not the leaf node, acquiring and caching a recovery instruction corresponding to the current node;

the task execution module 56: sequentially reading the recovery instructions from the unit task queue 55, and performing fault recovery on the fault information according to the recovery instructions;

it can be understood that the nodes in the unit task queue 55 are fetched by the task execution module 56, execute specific business logic according to the task configuration of the node, and notify the pending process queue to update the current node after the unit task is finished. The business logic here is all to call the enterprise's own ESB via HTTP protocol, or other systems with execution class operations, such as a system that can restart the host, a system that can transfer files and execute scripts, etc. The interface calling template of each execution system is completed by development, and a user only needs to individualize a specific calling logic through a configuration option provided by a Web page.

Poll control and callback control module 57: some nodes that need to perform poll control and callback control may cycle through the unit task queue, the task execution module, and the poll/callback control module multiple times until the node finishes executing.

The flow corresponding to the executed unit task is pushed into the pending flow queue 53 again by the task execution module 56, and the next execution node is calculated by the flow control module 54. And the process is circulated until the whole process is finished.

In step S206, after the failure recovery is finished, the failure recovery process is saved in an alarm database.

And if the recovery is successful, generating an alarm processing list, maintaining the processing process on the alarm processing list, and storing the alarm processing list into an alarm database.

If the recovery fails or overtime, the alarm information is tracked, and the fault processing flow is further optimized.

In step S207, according to the query instruction, a query is performed from the alarm database, and a query result is output.

The alarm database mainly supports the query of processing state and the query of statistical quantity. The processing state refers to a current state of some fault information, such as: converged, recovery successful, recovery failed, etc. The statistical quantity refers to the quantity of fault information in a certain period of time, a certain partition or a certain type.

According to the fault recovery method provided by the embodiment of the invention, the preset alarm matching rule is used for matching the corresponding alarm parameters and the fault processing flow for the fault information, the alarm parameters are subjected to convergence processing, and each specific fault does not need to be coded independently, so that the fault recovery method has the advantage of strong compatibility, and meanwhile, the processing flow of configuration and recovery is simple, so that the occupied system resources are few during operation, and the processing efficiency is high; in addition, the fault recovery result can be stored in an alarm database so as to track the alarm information and further optimize the processing flow.

FIG. 6 is a technical side view of a user-defined alert setting according to an embodiment of the present invention. The user 61 can perform man-machine interaction on the alarm system 62 and the fault recovery device 63, and process various fault information.

The alert system 62, comprising: an alarm configuration module 621, an alarm generation module 622, an alarm storage module 623, and an alarm pull Application interface (API) 624.

The alarm configuration module 621 is configured to configure the type of the alarm information, such as: abnormal alarm and health early warning.

The alarm generating module 622 is connected to the alarm configuring module 621, and is configured to generate fault information during operation.

The alarm storage module 623 is connected to the alarm generation module 622, and is configured to store the fault information in the alarm system 62.

The alarm pulling application interface 624 is connected to the alarm storage module 623, and is configured to read the fault information from the alarm storage module 623, and send the fault information to the fault recovery device 63, so as to support the acquisition of the fault information in the embodiment of the present invention.

The failure recovery device 63 includes: back end logic 631, database 632, and user interface 633.

Wherein the back-end logic 631 comprises: the alarm obtaining module 6311, the alarm matching module 6312, the alarm convergence module 6313, and the fault handling module 6314 may refer to fig. 1 or fig. 2.

A database 632 connected to the back-end logic 631, the database 632 comprising: a configuration storage module 6321, and a process record module 6322. The configuration storage module 6321 is used to store the configuration of the user. The processing recording module 6322 is used to store the processing procedure of the alarm information.

A user interface 633 coupled to the database 632, the user interface 633 comprising: a process flow configuration module 6331, an associated alarm and process flow module 6332, a process status query module 6333, and a statistical data query module 6334 to provide corresponding human-machine interfaces. The processing flow configuration module 6331 is configured to configure the unit processing logic and generate a corresponding tree combination flow, which may be specifically referred to in fig. 4. The associated alarm and processing flow module 6332 is configured to configure an alarm matching rule and specify an alarm processing flow, which may be specifically referred to in fig. 3. A processing status query module 6333, configured to query the processing status based on the database 632. The processing state refers to a current state of some fault information, such as: converged, recovery successful, recovery failed, etc. The statistical data query module 6334 is configured to perform a statistical number of queries based on the database 632. The statistical quantity refers to the quantity of fault information in a certain period of time, a certain partition or a certain type.

Referring to fig. 7, fig. 7 is a block diagram of a failure recovery apparatus according to an embodiment of the present invention. The fault recovery device can be applied to a server with the related functions of operation and maintenance.

The apparatus 700 for recovering from a failure includes: an acquisition module 71, a matching module 72, a convergence module 73, and a recovery module 74.

And an obtaining module 71, configured to obtain the fault information. The fault information includes but is not limited to: abnormal alarm and health early warning.

And the matching module 72 is connected to the obtaining module 71, and is configured to match corresponding alarm parameters and fault processing procedures for the fault information according to a preset alarm matching rule.

Wherein each node (371, 372 or 3720) comprises a corresponding recovery parameter instruction for recovering from the failure according to the recovery instruction when the node is triggered. It will be appreciated that each tree fault handling flow 37 corresponds to a unique root node 371. In this step, a plurality of trigger nodes can be provided for the user to select. The trigger nodes are displayed by setting labels or characters, and the nodes are

trigger nodes

And the convergence module 73 is connected to the matching module 72, and is configured to perform convergence processing on the alarm parameter to generate a converged alarm parameter.

It will be appreciated that when an alarm parameter is received, the subsequent steps may be performed directly. When receiving multiple alarm parameters, judging whether the multiple alarm parameters are related to the same event, and when the multiple alarm parameters are related to the same event, converging the multiple alarm parameters to generate converged alarm parameters.

And the recovery module 74 is connected to the convergence module 73, and configured to execute the fault processing procedure according to the converged alarm parameter, so as to perform fault recovery.

Wherein, the recovery module 74 includes: a logic unit 741, and a parameter unit 742.

Specifically, the logic unit 741 is configured to acquire the service logic of the failure processing flow.

A parameter unit 742, configured to execute the service logic of the fault handling according to the converged alarm parameter, where the service logic includes: calling a self-defined enterprise service bus or a corresponding function of an operating system through a network protocol, wherein the corresponding function comprises the following steps: a restart function, a transfer function, and/or a script execution function.

The fault recovery device provided by the embodiment of the invention matches the corresponding alarm parameters and the fault processing flow for the fault information through the preset alarm matching rules, performs convergence processing on the alarm parameters, and does not need to perform independent coding on each specific fault, so that the fault recovery device has the advantage of strong compatibility, and meanwhile, the processing flow of configuration and recovery is simple, and the processing efficiency of fault recovery can be improved.

Referring to fig. 8, fig. 8 is a schematic block diagram of a failure recovery apparatus according to an embodiment of the present invention. The fault recovery device can be applied to a server with the related functions of operation and maintenance.

The apparatus 800 for recovering from a failure includes: a configuration module 81, an acquisition module 71, a matching module 72, a convergence module 73, and a recovery module 74.

And the configuration module 81 is used for configuring a tree-shaped fault processing flow.

Wherein the configuration module 81 comprises: a root configuration unit 811, a sub-configuration unit 812, a tree configuration unit 813, and a trigger labeling unit 814. Please refer to fig. 4, which is a schematic interface diagram of a tree fault handling process.

Specifically, the root configuration unit 811 is configured to set a trigger node as the root node 371.

The failure recovery apparatus 800 may provide a plurality of trigger nodes for the root configuration unit 811 to select. When a plurality of trigger nodes exist, the trigger labeling unit 814 is configured to obtain the plurality of trigger nodes and label the trigger nodes by using labels or characters, where the drawing takes "shortcut" as an example, and the trigger node 371 and the trigger node 374 are shown; the root configuration unit 811 is further configured to select one trigger node from the plurality of trigger nodes as a unique root node.

And a sub-configuration unit 812, connected to the root configuration unit 811, configured to set, in sequence from the root node, a recovery instruction of the current node, where the recovery instruction is used to perform fault recovery on the fault information, and according to processing logic, the recovery success and the recovery failure are respectively used as association options of the current node, and are generated as a sub-node 372 of the current node.

Further, referring to fig. 4, the sub-configuration unit 812 sets, in sequence from the root node 371, a recovery instruction of a current node, where the recovery instruction is used to perform fault recovery on fault information; then, according to the processing logic, the recovery success and the recovery failure are respectively used as the association options of the current node, and are generated as child nodes 372 of the current node; and records the association 373 of the current node with the parent node of the current node.

The tree configuration unit 813 is configured to generate the tree-shaped fault handling flow 37 according to the root node 371, the child node 372, and the association 373.

The tree configuration unit is further configured to combine a plurality of tree fault processing procedures. Namely: and when a plurality of trigger nodes are included, combining the tree logic processing flow of the root node, the unselected trigger nodes and the child nodes. It can be understood that: when there are multiple trigger nodes, the root node of one tree fault flow is used as a child node of another tree fault processing flow, which can be referred to as node 374.

And an obtaining module 82, configured to obtain the fault information.

Wherein the obtaining module 82 may obtain the fault information from the associated alarm system. The fault information includes but is not limited to: abnormal alarm and health early warning.

And the matching module 83 is connected to the obtaining module 82, and is configured to match corresponding alarm parameters and a fault processing procedure for the fault information according to a preset alarm matching rule. The matching process refers to the alarm matching shown in fig. 3 and the tree fault handling flow shown in fig. 4.

And a convergence module 84, connected to the matching module 83, configured to perform convergence processing on the alarm parameter according to whether the alarm event corresponding to the alarm parameter occurs for the first time, so as to generate a converged alarm parameter.

Wherein the convergence module 84 comprises: an event judgment unit 841, a parameter transmission unit 842, and a parameter convergence unit 843.

Specifically, the event determining unit 841 is configured to determine whether an alarm event corresponding to an alarm parameter is a first occurrence, where the alarm event corresponds to at least one alarm parameter. For example, such as: when the system crashes, a plurality of alarm parameters such as a communication module, a storage module, a processing process and the like are received respectively, and convergence can be performed at the moment to generate a converged alarm parameter.

A parameter sending unit 842, configured to, if the alarm parameter occurs for the first time, take the alarm parameter as the alarm parameter after convergence, and send the alarm parameter to the fault handling process.

A parameter convergence unit 843, configured to record the alarm parameter in the alarm event in an alarm database if the alarm event does not occur for the first time, and mark the alarm parameter as converged.

And the recovery module 85 is configured to execute the fault processing procedure according to the converged alarm parameter, so as to perform fault recovery.

The failure recovery module 85 includes: a parameter processing unit 851, an instruction processing unit 852, a restoration processing unit 853, a node processing unit 854, a leaf determination unit 855, and an end processing unit 856.

Specifically, the parameter processing unit 851 is configured to send the converged alarm parameter to a root node corresponding to the fault handling procedure.

The instruction processing unit 852 is connected to the parameter processing unit 851 and configured to obtain a recovery instruction corresponding to the root node.

And a recovery processing unit 853, connected to the instruction processing unit 852, configured to perform fault recovery on the fault information according to the recovery instruction.

And a node processing unit 854 connected to the recovery processing unit 853, configured to obtain a corresponding current node according to the failure recovery result. Wherein the failure recovery result comprises: recovery success or recovery failure.

The leaf determining unit 855 is connected to the node processing unit 854, and is configured to determine whether the current node is a leaf node.

The end processing unit 856 is connected to the leaf judging unit 855, and configured to end the failure recovery and store the failure recovery process in the alarm database when the current node is a leaf node.

The node processing unit 854 is connected to the leaf determining unit 855, and is further configured to obtain a recovery instruction of the current node according to the failure recovery result when the current node is not the leaf node, so as to perform failure recovery.

And the alarm database 86 is connected to the recovery module 85 and is used for saving the fault recovery process after the fault recovery is finished.

If the recovery is successful, an alarm processing list is generated, the successful processing process is stored in the alarm processing list, and then the alarm processing list is stored in the alarm database 86.

If the recovery fails or overtime, the alarm information is tracked, the failure or overtime processing process is stored in the alarm processing list, the alarm processing list is stored in the alarm database 86, and the failure or overtime fault processing flow is further optimized.

And the query module 87 is connected to the alarm database 86 and is used for outputting a query result according to the query instruction.

The alarm database 86 mainly supports processing status query and statistical quantity query. The processing state refers to a current state of some fault information, such as: converged, recovery successful, recovery failed, etc. The statistical quantity refers to the quantity of fault information in a certain period of time, a certain partition or a certain type.

According to the fault recovery device provided by the embodiment of the invention, the preset alarm matching rule is used for matching the corresponding alarm parameters and the fault processing flow for the fault information, and the alarm parameters are subjected to convergence processing, and each specific fault does not need to be coded independently, so that the fault recovery device has the advantage of strong compatibility, and meanwhile, the processing flow for configuration and recovery is simple, so that the occupied system resources are few during operation, and the processing efficiency is high; in addition, the fault recovery result can be stored in an alarm database so as to track the alarm information and further optimize the processing flow.

Correspondingly, the embodiment of the invention also provides a server for showing the hardware environment schematic diagram of the fault recovery method and the fault recovery device. As shown in fig. 9, the server is used to perform the restoration method of the failure in fig. 1-2 or to operate the restoration apparatus of the failure in fig. 7-8. The server 900 includes: a processor 901 of one or more processing cores, a memory 902 of one or more computer-readable storage media, an input unit 903, a short-range wireless transmission (WiFi) module 904, a display screen 905, and a power supply 906 for performing the method of recovering from the failure and/or operating the failed recovery device 907.

Those skilled in the art will appreciate that the above-described architecture is not intended to be limiting of terminal device 900 and may include more or fewer components than those described, some of the components may be combined, or a different arrangement of components. Wherein:

specifically, in this embodiment, in the terminal device 900, the processor 901 loads an executable file corresponding to a process of one or more application programs into the memory 902 according to the following instructions, and the processor 901 runs the application programs stored in the memory 902, so as to implement various functions as follows: a method of recovering from a failure, comprising: acquiring fault information; matching corresponding alarm parameters and a fault processing flow for the fault information according to a preset alarm matching rule; carrying out convergence processing on the alarm parameters to generate converged alarm parameters; and executing the fault processing flow according to the converged alarm parameters to recover the fault.

Preferably, the processor 901 is further configured to: setting a trigger node as a root node; sequentially acquiring the association options of the root node according to processing logic so as to generate child nodes; and generating a tree-shaped fault processing flow according to the root node, the child nodes and the incidence relation.

Preferably, the processor 901 is further configured to: starting from the root node, sequentially setting a recovery instruction of the current node, wherein the recovery instruction is used for performing fault recovery on fault information; according to the processing logic, the recovery success and the recovery failure are respectively used as the association options of the current node and are generated as child nodes of the current node; and recording the incidence relation between the current node and the parent node of the current node.

Preferably, the processor 901 is further configured to: judging whether an alarm event corresponding to an alarm parameter appears for the first time or not, wherein the alarm event corresponds to at least one alarm parameter; if the fault occurs for the first time, the alarm parameter is used as a converged alarm parameter and is sent to the fault processing flow; and/or if not, recording the alarm parameter in the alarm event in an alarm database and marking the alarm parameter as converged.

Preferably, the processor 901 is further configured to: sending the converged alarm parameters to a root node corresponding to the fault processing flow; acquiring a recovery instruction corresponding to the root node; performing fault recovery on the fault information according to the recovery instruction; according to the fault recovery result, obtaining a recovery instruction of a child node corresponding to the root node so as to perform fault recovery, wherein the fault recovery result comprises: recovery success or recovery failure.

Preferably, the processor 901 is further configured to: judging whether the current node is a leaf node; when the current node is a leaf node, ending the fault recovery; and when the current node is not the leaf node, executing a step of obtaining a recovery instruction of the child node corresponding to the root node according to the success or failure of recovery so as to carry out fault recovery.

Preferably, the processor 901 is further configured to: acquiring the service logic of the fault processing flow; executing the service logic of the fault processing according to the converged alarm parameters, wherein the service logic comprises: calling a self-defined enterprise service bus or a corresponding function of an operating system through a network protocol, wherein the corresponding function comprises the following steps: a restart function, a transfer function, and/or a script execution function.

Preferably, the processor 901 is further configured to: and after the fault recovery is finished, storing the fault recovery process in an alarm database.

The server provided by the embodiment of the invention matches the corresponding alarm parameters and the corresponding fault processing flow for the fault information through the preset alarm matching rules, and performs convergence processing on the alarm parameters, and has the advantages of strong compatibility, simple processing flow, less occupied system resources, high processing efficiency and the like.

The server provided by the embodiment of the invention has the same concept as the fault recovery method and the fault recovery device in the embodiment.

It should be noted that, for the method for recovering a failure according to the present invention, a person skilled in the art may understand that all or part of the processes in the embodiment of the present invention may be implemented by controlling related hardware through a computer program, where the computer program may be stored in a computer readable storage medium, such as a memory of a terminal device, and executed by at least one processor in the terminal device, and during the execution process, the processes in the embodiment of the information sharing method may be included. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

In the failure recovery apparatus according to the embodiment of the present invention, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, or the like.

The above detailed description is provided for the failure recovery method, the recovery device, the storage medium, and the server according to the embodiments of the present invention, and a specific example is applied in this document to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for recovering from a failure, comprising:

acquiring fault information;

executing the fault processing flow according to the converged alarm parameters to recover the fault;

the fault processing flow is a tree-shaped fault processing flow, and the tree-shaped fault processing flow comprises the following steps: the fault recovery method comprises a root node, a plurality of child nodes and an incidence relation of the root node and the child nodes, wherein the root node and each child node comprise corresponding recovery instructions which are used for recovering faults according to the recovery instructions when the root node or each child node is triggered.

2. The method for recovering from a failure of claim 1, wherein obtaining failure information further comprises:

setting a trigger node as a root node;

acquiring the incidence relation of the root nodes in sequence according to processing logic so as to generate child nodes;

generating a tree-shaped fault processing flow according to the root node, the child nodes and the incidence relation;

the matching of the corresponding alarm parameters and the fault processing flow for the fault information comprises the following steps: and matching corresponding alarm parameters with the tree-shaped fault processing flow for the fault information.

3. The method for recovering from a failure according to claim 2, wherein sequentially obtaining the association relationship of the root node according to the processing logic to generate child nodes comprises:

starting from the root node, sequentially setting a recovery instruction of the current node, wherein the recovery instruction is used for performing fault recovery on fault information;

according to the processing logic, the recovery success and the recovery failure are respectively used as the incidence relation of the current node and are generated as child nodes of the current node; and

and recording the incidence relation between the current node and the parent node of the current node.

4. The method for recovering from a failure of claim 1, wherein the converging the alarm parameters to generate converged alarm parameters comprises:

judging whether an alarm event corresponding to an alarm parameter appears for the first time or not, wherein the alarm event corresponds to at least one alarm parameter;

if the fault occurs for the first time, the alarm parameter is used as a converged alarm parameter and is sent to the fault processing flow; and/or

If not, recording the alarm parameter in the alarm event in an alarm database, and marking the alarm parameter as converged.

5. The method for recovering from a failure according to claim 2, wherein executing the failure processing procedure according to the converged alarm parameters for failure recovery comprises:

sending the converged alarm parameters to a root node corresponding to a fault processing flow;

acquiring a recovery instruction corresponding to the root node;

performing primary fault recovery on the fault information according to the recovery instruction;

according to the fault recovery result, obtaining a recovery instruction of a child node corresponding to the root node so as to perform layer-by-layer fault recovery, wherein the fault recovery result comprises: recovery success or recovery failure.

6. The method for recovering from a failure according to claim 5, wherein, according to the recovery result, a recovery instruction of the child node corresponding to the root node is obtained to perform failure recovery, and before the method, the method further comprises:

judging whether the current node is a child node; and

when the current node is a child node, ending the fault recovery;

and when the current node is not the child node, executing a step of obtaining a recovery instruction of the child node corresponding to the root node according to the recovery result so as to perform fault recovery.

7. The method for recovering from a failure according to claim 2, wherein a tree-shaped failure processing flow is generated according to the root node, the child nodes, and the association relationship, and further comprising:

acquiring a plurality of trigger nodes, and labeling the trigger nodes through labels or characters;

selecting one trigger node from the plurality of trigger nodes as a root node;

and combining the tree-shaped fault processing flows of the root node, the unselected trigger nodes and the child nodes according to processing logic.

8. A failure recovery apparatus, comprising:

the acquisition module is used for acquiring fault information;

the recovery module is used for executing the fault processing flow according to the converged alarm parameters so as to recover the fault;

9. The apparatus for recovering from a failure of claim 8, further comprising a configuration module, the configuration module comprising:

a root configuration unit, configured to set a trigger node as a root node;

the child configuration unit is used for sequentially setting a recovery instruction of the current node from the root node, wherein the recovery instruction is used for performing fault recovery on fault information, respectively taking recovery success and recovery failure as incidence relations of the current node according to processing logic, and generating child nodes of the current node;

the tree configuration unit is used for generating a tree-shaped fault processing flow according to the root node, the child nodes and the association relation;

and the matching module is also used for matching corresponding alarm parameters with the tree-shaped fault processing flow for the fault information.

10. The apparatus for recovering from a failure of claim 8, wherein the convergence module comprises:

the event judging unit is used for judging whether an alarm event corresponding to an alarm parameter appears for the first time or not, wherein the alarm event corresponds to at least one alarm parameter;

the parameter sending unit is used for taking the alarm parameter as the alarm parameter after convergence and sending the alarm parameter to the fault processing flow if the alarm parameter appears for the first time; and

and the parameter convergence unit is used for recording the alarm parameters in the alarm event in an alarm database if the alarm events do not occur for the first time, and marking the alarm parameters as converged.

11. The apparatus for recovering from a failure of claim 9, wherein the failure recovery module comprises:

the parameter processing unit is used for sending the converged alarm parameters to a root node corresponding to a fault processing flow;

the instruction processing unit is used for acquiring a recovery instruction corresponding to the root node;

the recovery processing unit is used for performing primary fault recovery on the fault information according to the recovery instruction;

a node processing unit, configured to obtain a corresponding current node according to a failure recovery result, where the failure recovery result includes: recovery success or recovery failure;

the leaf judging unit is used for judging whether the current node is a child node;

the end processing unit is used for ending the fault recovery when the current node is a child node and storing the fault recovery process in an alarm database;

and the node processing unit is also used for acquiring a recovery instruction of the current node according to the fault recovery result when the current node is not the child node, so as to carry out layer-by-layer fault recovery.

12. The apparatus for recovering from a failure of claim 9, wherein in the configuration module:

further comprising: the trigger marking unit is used for acquiring a plurality of trigger nodes and marking the trigger nodes through labels or characters;

the root configuration unit is further configured to select one trigger node from the plurality of trigger nodes as a root node;

and the tree configuration unit is further configured to perform tree-shaped fault processing flow combination on the root node, the unselected trigger nodes, and the child nodes according to processing logic.

13. A computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a method of recovering from a failure according to any one of claims 1 to 7.