CN110659147A

CN110659147A - Self-repairing method and system based on module self-checking behavior

Info

Publication number: CN110659147A
Application number: CN201910757544.5A
Authority: CN
Inventors: 王鹏
Original assignee: Suzhou Wave Intelligent Technology Co Ltd
Current assignee: Suzhou Wave Intelligent Technology Co Ltd
Priority date: 2019-08-16
Filing date: 2019-08-16
Publication date: 2020-01-07
Anticipated expiration: 2039-08-16
Also published as: CN110659147B

Abstract

The invention has proposed a self-repairing method and system based on module self-checking behavior, this method designs the unusual detection code to the module of different functions at first, the built-in identical unusual arbitration code, call the unusual detection code to carry on the self-checking, write the abnormality found into the abnormal situation table; calling an exception arbitration code every fixed time interval, detecting whether the module generates an exception condition table, and if not, directly executing a result; if the judgment result is generated, the judgment result is obtained by taking the abnormal condition table as the input of the abnormal judgment, and the accuracy is calculated. And the module which finds the abnormity pushes the leader module, and the leader module recovers the abnormal module through the recovery module. Based on the method, a self-repairing system based on the module self-checking behavior is also provided. The invention implements ordered system recovery after the judgment of the abnormity judgment center on the basis of module-level abnormity detection technology, can prevent the integral fault of the system caused by module abnormity delay processing, and reduces catastrophic events.

Description

Self-repairing method and system based on module self-checking behavior

Technical Field

The invention belongs to the technical field of software design, and particularly relates to a self-repairing method and system based on module self-checking behaviors.

Background

The fault recovery technology is a strategy, method and technology adopted for isolating a fault after the fault is detected in a system, and selecting another facility or method to enable the system to return to a task point before the fault occurs so as to enable the system to continue to work normally.

Typical software system failover behavior relies on centralized system logs or centralized monitoring data, which is a post-processing behavior. In addition, the hard disk S.M.A.R.T recovery technology monitors and records the running conditions of hardware of the hard disk, such as a magnetic head, a disk, a motor and a circuit, through a detection instruction in the hard disk hardware, compares the running conditions with a preset safety value set by a manufacturer, and automatically warns a user through monitoring hardware or software of a host and slightly automatically repairs the user if the monitoring conditions are or exceed the safety range of the preset safety value so as to ensure the safety of hard disk data in advance. However, large-scale software system modules are divided into a plurality of modules, the relationships among the modules are complex, and post-repair is still adopted based on the overall behavior, so that the system recovery requirements are not processed in real time and intelligently. The detection behavior at the module level is therefore of its interest.

Disclosure of Invention

The invention provides a self-repairing method and a self-repairing system based on module self-checking behaviors. On the basis of a module-level anomaly detection technology, orderly system recovery is implemented after the judgment of an anomaly judgment center so as to meet the intelligent system recovery requirements in some application scenes.

In order to achieve the above object, the present invention provides a self-repairing method based on module self-checking behavior, which comprises the following steps:

s1: designing abnormal detection codes for modules with different functions according to resource requirements and functions of the modules, and internally setting the same abnormal arbitration codes in different functional modules;

s2: the module calls an abnormality detection code to perform self-detection, and writes the found abnormality into an abnormal condition table; the abnormal condition table comprises a host name, a module name, an abnormal name and frequency;

s3: at every fixed time interval, the module calls an abnormal arbitration code, and executes corresponding sanction operation according to whether the detection module generates an abnormal condition table or not;

s4: and for the module which finds the abnormity, a leader module is proposed, the leader module sends a recovery list to a recovery component and carries an arbitration result, and the recovery component restarts the module which finds the abnormity.

Further, in step S1: the resources comprise the number of CPU cores, the usage amount of a memory, the usage amount of a disk and the usage amount of a network;

the functions comprise a calculation type module, a storage type module and a network communication type module; the abnormal detection code in the calculation module is used for detecting the number of CPU cores which are occupied for a long time and are larger than a threshold value, the memory usage which is larger than the threshold value, out-of-memory error reporting and CPU unavailability; the abnormal detection code in the storage module is used for detecting a continuous high-speed read-write disk and frequently generating a large-capacity file and dense writing and reporting errors; an anomaly detection code within the network communication type module is used to detect long-lasting high bandwidth and no traffic.

Further, the arbitration principle of the exception arbitration code is as follows: conflict does not exist between exceptions, the abnormal value is legal, and the priority processing is high.

Further, in step S3, every fixed time interval, the module invokes an exception arbitration code, and according to whether the detection module itself generates an exception condition table, the performing of the corresponding sanction operation includes: the arbitration function of the module is periodically started, the module calls the same abnormal arbitration code built in the module at every fixed time interval, firstly, whether the module generates an abnormal condition table or not is detected, and if the abnormal condition table is not generated, the result is directly executed;

and if the abnormal condition table is generated, taking the abnormal condition table as an input for executing abnormal arbitration, obtaining an arbitration result and calculating the accuracy.

Further, the algorithm of the readiness ratio is as follows: the same as the arbitration result is a correct item, and different and missing items are all error items; the total term is correct term + error term;

the accuracy is the correct/total term.

Further, in step S4, the method for estimating the leader module for the module found to be abnormal is: and marking the modules with the abnormality as leader modules which are respectively considered to have the highest accuracy in the leader table, wherein the modules with the abnormality more than half of the modules are leader modules.

The invention also provides a self-repairing system based on the module self-checking behavior, which comprises a module design unit, an abnormality detection unit, an abnormality arbitration unit and an execution leader unit;

the module design unit is used for designing an abnormal detection code for modules with different functions according to the resource requirements and the functions of the modules, and the same abnormal arbitration code is built in the different functional modules;

the abnormality detection unit is used for carrying out self-detection on the module and writing the found abnormality into an abnormal condition table;

the exception arbitration unit is used for executing corresponding sanction operation according to whether the detection module generates an exception condition table per fixed time interval;

the execution leader unit is used for promoting a leader module for the modules which are found to be abnormal, sending the recovery list to the recovery component and carrying an arbitration result, and the recovery component restarts the modules which are found to be abnormal.

Further, the module design unit comprises a first design unit and a second design unit;

the first design unit is used for designing abnormal detection codes for modules with different functions according to resource requirements and functions of the modules;

and the second design unit is used for setting abnormal arbitration codes in modules with different functions.

Further, the exception arbitration unit comprises a first arbitration unit and a second arbitration unit;

the first arbitration unit is used for calling an abnormal arbitration code by the module at every fixed time interval, detecting whether the module generates an abnormal condition table or not, and if the abnormal condition table is not generated, directly executing a result;

the second arbitration unit is used for calling an abnormal arbitration code by the module every fixed time interval, detecting whether the module generates an abnormal condition table or not, if the abnormal condition table is generated, taking the abnormal condition table as the input for executing the abnormal arbitration to obtain an arbitration result, and calculating the accuracy.

Further, the collar and sleeve execution unit comprises a collar and sleeve module pushing unit and a collar and sleeve module execution unit;

the leader module electing unit is used for marking the modules which find abnormality in the leader table as leader modules which are respectively considered to have the highest accuracy, and electing the modules which are more than half supported as leader modules;

and the leader module execution unit is used for sending a recovery list to the recovery component and carrying an arbitration result, and the recovery component restarts the module for discovering the abnormity.

The effect provided in the summary of the invention is only the effect of the embodiment, not all the effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:

the embodiment of the invention provides a self-repairing method and a self-repairing system based on module self-checking behaviors, the method comprises the steps of firstly designing abnormal detection codes for modules with different functions according to resource requirements and functions of the modules, embedding abnormal arbitration codes in different functional modules, calling the abnormal detection codes by the modules for self-checking, and writing found abnormal conditions into an abnormal condition table; the arbitration function of the module is periodically started, the module calls an abnormal arbitration code at each fixed time interval, whether the module generates an abnormal condition table or not is detected, and if the abnormal condition table is not generated, the result is directly executed; and if the abnormal condition table is generated, taking the abnormal condition table as an input for executing the abnormal arbitration, obtaining an arbitration result and calculating the accuracy. And marking the modules which find the abnormity in the leader table as leader modules which respectively consider the highest accuracy as leader modules, marking the modules which exceed half of the support as leader modules, sending the recovery list to the recovery component by the leader modules, and restarting the modules which find the abnormity by the recovery component with the arbitration result. The invention provides a self-repairing method based on module self-checking behaviors and also provides a self-repairing system based on the module self-checking behaviors. The invention implements ordered system recovery after the judgment of the abnormity judgment center on the basis of module-level abnormity detection technology so as to meet the intelligent system recovery requirement in certain application scenes, prevent the integral fault of the system caused by module abnormity delay processing and reduce catastrophic events, such as outbreak of data loss and the like.

Drawings

Fig. 1 is a flow chart of a self-repairing method based on module self-inspection behavior according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of an embodiment of a self-repairing method based on module self-inspection behavior proposed in embodiment 1 of the present invention;

fig. 3 is a schematic diagram of a self-repairing system based on module self-checking behavior according to embodiment 1 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, are merely for convenience of description of the present invention, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.

Example 1

The invention provides a self-repairing method and system based on module self-checking behavior, and as shown in fig. 1, the self-repairing method based on module self-checking behavior provided by embodiment 1 of the invention;

in step S101, execution of the flow is started.

In step S102, exception detection codes are designed for modules with different functions according to their resource requirements and functions, and the same exception arbitration code is built in different functional modules.

The resources comprise the number of CPU cores, the usage amount of a memory, the usage amount of a disk and the usage amount of a network, and the functions comprise a calculation type module, a storage type module and a network communication type module; the abnormal detection code in the calculation module is used for detecting the number of CPU cores which are occupied for a long time and are larger than a threshold value, the memory usage which is larger than the threshold value, out-of-memory error reporting and CPU unavailability; CPU cores greater than > 80% and memory quanta greater than 70% are typically tested for long periods. The protection of the invention is not limited to the value and can be configured according to the actual situation. The abnormal detection code in the storage module is used for detecting a continuous high-speed read-write disk and frequently generating a large-capacity file, such as a PB-level file and dense writing and reporting errors; an anomaly detection code in a network communication type module is used to detect a high bandwidth for a long duration and a long-term no-traffic.

The arbitration principle of the exception arbitration code is as follows: there is no conflict between exceptions, the exception value is legal, for example, the CPU core count is 200% is illegal, and the priority processing is frequently high, and at most 2 exceptions are processed at a time.

In step S103, the module calls the anomaly detection code to perform self-checking, and writes the found anomaly into the anomaly table.

The abnormal condition table comprises fields of host names, module names, abnormal names, frequency and the like. Abnormal names such as CPU occupancy rate over high, out-of-memory, disk error write, and continuous high bandwidth. And the module for reporting the abnormity frequently updates only the frequency field in the abnormal condition table.

In step S104, the module calls the module built-in same exception arbitration code.

In step S105, the abnormality arbitration code determines whether to generate an abnormal situation table; if the abnormal situation table is not generated, executing step S206; if the abnormal situation table is generated, step S207 is performed.

In step S106, the result is directly executed.

In step S107, the abnormal situation table is used as an input for performing the abnormal arbitration, and the arbitration result is obtained and calculated accurately.

The algorithm of the accuracy rate is as follows: the same as the arbitration result is a correct item, and different and missing items are all error items; the total term is correct term + wrong term, and the accuracy is correct term/total term.

In step S108, the module found to be abnormal marks the module considered to be the leader module with high accuracy in the leader table, and the modules more than half of the modules supported by the module found to be the leader modules.

The collar-sleeve modules are pushed one at a time, but may be different modules at a time. The electing principle is that the reported abnormity after arbitration has the highest accuracy, namely the reported abnormity is the leader module, and any one of the reported abnormity is the leader module with the same accuracy. Each module that reports an exception is the respective calculated arbitration result, which is the same because the algorithm is the same and the inputs are the same. The recovery list is an output of the arbitration that has removed the module that reported the illegal exception, and contains a list of all modules that need to be recovered this time.

In step S109, the leader module sends the recovery list to the recovery component with the arbitration result, and the recovery component restarts the module that found the abnormality.

The recovery component has the functions of accessing the original images of all the modules, verifying the signatures of the original images, stopping starting each module, recording the logs of the recovery operation and the like. The recovery component in the invention plays a role of restarting the module for discovering the abnormity.

In step S110, the entire flow ends.

As shown in fig. 2, an example of a self-repair method based on module self-inspection behavior provided in embodiment 1 of the present invention; the system consists of n modules, wherein a module 1, a module 3 and a module 4 are respectively used for designing an abnormal detection code according to the resource requirement and the function of the module; the abnormality detection code of the module 1 has an abnormality detection function a, the abnormality detection code of the module 3 has an abnormality detection function a, the abnormality detection code of the module 4 has an abnormality detection function C, the module 1 performs abnormality detection according to the abnormality detection function a, the module 3 performs abnormality detection according to the abnormality detection function a, the module 4 performs abnormality detection according to the abnormality detection function C, and the detected abnormalities are all written into an abnormality condition table. The arbitration function of the module is periodically started, the module calls the same built-in abnormal arbitration code at every fixed time interval, firstly, whether the module generates an abnormal condition table or not is detected, and if the abnormal condition table is not generated, the result is directly executed; and if the abnormal condition table is generated, taking the abnormal condition table as an input for executing the abnormal arbitration, obtaining an arbitration result and calculating the accuracy. The algorithm of the preparation rate is as follows: the same as the arbitration result is a correct item, and different and missing items are all error items; total term is correct term + error term; accuracy is the correct/total term. And then, marking the modules with abnormal conditions in a leader table as leader modules with high respective considered accuracy, and marking the modules with more than half support as leader modules. The leader module illustrated in fig. 1 is a module 3, and when the module 3 scans the leader table and knows that it is a leader, it sends the recovery list to the recovery component and carries an arbitration result, and the recovery component restarts the module that finds an abnormality.

The embodiment 1 of the invention also provides a self-repairing system based on the module self-checking behavior. As shown in fig. 3, embodiment 1 of the present invention is a self-repairing system based on module self-checking behavior. The system comprises a module design unit, an abnormality detection unit, an abnormality arbitration unit and an execution leader unit.

And the module design unit is used for designing an abnormal detection code for the modules with different functions according to the resource requirements and the functions of the modules, and the same abnormal arbitration code is embedded in the different functional modules. The resources comprise the number of CPU cores, the usage amount of a memory, the usage amount of a disk and the usage amount of a network, and the functions comprise a calculation type module, a storage type module and a network communication type module; the abnormal detection code in the calculation module is used for detecting the number of CPU cores which are occupied for a long time and are larger than a threshold value, the memory usage which is larger than the threshold value, out-of-memory error reporting and CPU unavailability; CPU cores greater than > 80% and memory quanta greater than 70% are typically tested for long periods. The protection of the invention is not limited to the value and can be configured according to the actual situation. The abnormal detection code in the storage module is used for detecting a continuous high-speed read-write disk and frequently generating a large-capacity file, such as a PB-level file and dense writing and reporting errors; an anomaly detection code in a network communication type module is used to detect a high bandwidth for a long duration and a long-term no-traffic. The arbitration principle of the exception arbitration code is as follows: there is no conflict between exceptions, the exception value is legal, for example, the CPU core count is 200% is illegal, and the priority processing is frequently high, and at most 2 exceptions are processed at a time.

The abnormality detection unit is used for carrying out self-detection on the module and writing the found abnormality into an abnormal condition table; the abnormal condition table comprises fields of host names, module names, abnormal names, frequency and the like. Abnormal names such as CPU occupancy rate over high, out-of-memory, disk error write, and continuous high bandwidth. And the module for reporting the abnormity frequently updates only the frequency field in the abnormal condition table.

The exception arbitration unit is used for executing corresponding sanction operation according to whether the detection module generates an exception condition table per fixed time interval. At regular time intervals, the module calls an exception arbitration code, and executes corresponding sanction operations according to whether the detection module generates an exception condition table or not, wherein the exception arbitration operations comprise: the arbitration function of the module is periodically started, the module calls the same abnormal arbitration code built in the module at every fixed time interval, firstly, whether the module generates an abnormal condition table or not is detected, and if the abnormal condition table is not generated, the result is directly executed; and if the abnormal condition table is generated, taking the abnormal condition table as an input for executing abnormal arbitration, obtaining an arbitration result and calculating the accuracy.

And the execution leader unit is used for promoting a leader module for the modules which are found to be abnormal, sending the recovery list to the recovery component and carrying an arbitration result, and restarting the modules which are found to be abnormal by the recovery component. The leader module is only pushed one at a time, but can be different modules at a time, and the pushing principle is that the reported abnormity is the highest in accuracy after arbitration, namely the leader module is any one with the same accuracy. Each module that reports an exception is the respective calculated arbitration result, which is the same because the algorithm is the same and the inputs are the same. The recovery list is an output of the arbitration that has removed the module that reported the illegal exception, and contains a list of all modules that need to be recovered this time.

And the leader module sends the recovery list to the recovery component and carries an arbitration result, and the recovery component restarts the module which finds the abnormality. The recovery component has the functions of accessing the original images of all the modules, verifying the signatures of the original images, stopping starting each module, recording the logs of the recovery operation and the like. The recovery component in the invention plays a role of restarting the module for discovering the abnormity.

Wherein the module design unit comprises a first design unit and a second design unit; the first design unit is used for designing the abnormal detection codes for the modules with different functions according to the resource requirements and the functions of the modules. The second design unit is used for modules with different functions and internally provided with the same exception arbitration codes.

The exception arbitration unit comprises a first arbitration unit and a second arbitration unit; the first arbitration unit is used for calling the same abnormal arbitration code built in the module at every fixed time interval, detecting whether the module generates an abnormal condition table or not, and if the abnormal condition table is not generated, directly executing the result; the second arbitration unit is used for calling the same abnormal arbitration codes built in the module every fixed time interval, detecting whether the module generates an abnormal condition table or not, and if the abnormal condition table is generated, taking the abnormal condition table as the input for executing the abnormal arbitration to obtain an arbitration result and calculate the accuracy.

The executive leader unit comprises a leader module lifting unit and a leader module executive unit; the leader module electing unit is used for marking the modules which find abnormality in the leader table as leader modules which are respectively considered to have the highest accuracy, and electing the modules which are more than half supported as leader modules; and the leader module execution unit is used for sending the recovery list to the recovery component and carrying an arbitration result, and the recovery component restarts the module which finds the abnormality.

The foregoing is merely exemplary and illustrative of the present invention and various modifications, additions and substitutions may be made by those skilled in the art to the specific embodiments described without departing from the scope of the present invention as defined in the accompanying claims.

Claims

1. A self-repairing method based on module self-checking behavior is characterized by comprising the following steps:

2. The self-repairing method based on module self-test behavior of claim 1, wherein in step S1: the resources comprise the number of CPU cores, the usage amount of a memory, the usage amount of a disk and the usage amount of a network;

3. A self-repairing method based on module self-checking behavior as claimed in claim 1, wherein the arbitration principle of the exception arbitration code is: conflict does not exist between exceptions, the abnormal value is legal, and the priority processing is high.

4. A self-repairing method based on module self-checking behavior as claimed in claim 1, wherein in step S3, the module calls an exception arbitration code every fixed time interval, and the performing of the corresponding sanction operation according to whether the detection module itself generates the exception table includes: the arbitration function of the module is periodically started, the module calls the same abnormal arbitration code built in the module at every fixed time interval, firstly, whether the module generates an abnormal condition table or not is detected, and if the abnormal condition table is not generated, the result is directly executed;

5. The self-repairing method based on module self-checking behavior as claimed in claim 4, wherein the preparation rate algorithm is: the same as the arbitration result is a correct item, and different and missing items are all error items; the total term is correct term + error term;

the accuracy is the correct/total term.

6. The self-repairing method based on module self-checking behavior of claim 1, wherein in step S4, the method for enumerating leader modules for modules found to be abnormal is as follows: and marking the modules with the abnormality as leader modules which are respectively considered to have the highest accuracy in the leader table, wherein the modules with the abnormality more than half of the modules are leader modules.

7. A self-repairing system based on module self-checking behavior is characterized by comprising a module design unit, an abnormality detection unit, an abnormality arbitration unit and an execution leader unit;

the module design unit is used for designing abnormal detection codes for modules with different functions according to the resource requirements and the functions of the modules, and the same abnormal arbitration codes are built in the different functional modules;

8. The self-repairing system based on module self-checking behavior as claimed in claim 7, wherein the module design unit comprises a first design unit and a second design unit;

the second design unit is used for modules with different functions and internally provided with the same exception arbitration codes.

9. The self-repairing system based on module self-testing behavior of claim 7, wherein the exception arbitration unit comprises a first arbitration unit and a second arbitration unit;

the first arbitration unit is used for calling the same abnormal arbitration codes built in the module at every fixed time interval, detecting whether the module generates an abnormal condition table or not, and if the abnormal condition table is not generated, directly executing the result;

the second arbitration unit is used for calling the same abnormal arbitration codes built in the module every fixed time interval, detecting whether the module generates an abnormal condition table or not, and if the abnormal condition table is generated, taking the abnormal condition table as the input for executing the abnormal arbitration to obtain an arbitration result and calculate the accuracy.

10. The self-repairing system based on module self-checking behavior as claimed in claim 7, wherein the executive leader unit comprises a leader module electing unit and a leader module executive unit;