CN105550056B - A kind of fault self-recovery system and its implementation based on system reconfiguration - Google Patents
A kind of fault self-recovery system and its implementation based on system reconfiguration Download PDFInfo
- Publication number
- CN105550056B CN105550056B CN201510926572.7A CN201510926572A CN105550056B CN 105550056 B CN105550056 B CN 105550056B CN 201510926572 A CN201510926572 A CN 201510926572A CN 105550056 B CN105550056 B CN 105550056B
- Authority
- CN
- China
- Prior art keywords
- failure
- configuration
- self
- fault
- healing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
Abstract
The present invention proposes a kind of fault self-recovery method based on system reconfiguration, applied computer system is made of on hardware multiple functional modules and backup module, it is formed on software including modules such as fault management, configuration management, system blueprints, step is followed successively by system starting, configuration load, fault detection, failure logging, failure recovery, policy selection, configuration update, system closing.The key Design of system blueprint system failure self-healing.It includes the configuration item of multiple system operations, and each configuration item all describes the various configurations such as software, hardware, network of computer system, and computer system can be run according to the description of configuration item;These configuration items further include the description to all troubleshooting strategies.Backup module is the necessary condition of system self-healing, it can in systems some module occur unrecoverable failure when, instead of the work of the module.The present invention has the advantages that Self healing Strategy is simple, self-healing procedure is controllable, self-healing result determines.
Description
Technical field
The invention belongs to computer fields, provide a kind of fault self-recovery method.
Background technique
In some frequent uses but complex computer system difficult in maintenance, such as civil aircraft airborne electronic equipment system, satellite
In satellite borne electronic system, designer wishes the design by system self-healing, reaches the availability of raising system, reduces the dimension of system
Protect the purpose in period and cost.
Currently employed self Healing Technology reaches system mainly by the recovery again of software, hardware for abort situation
The purpose of function self-healing, some common technical measures have:
(1) in the multiple channels of Position Design that may be broken down, failure selects trouble-free channel bypass event after occurring
Hinder channel, realizes the self-healing of system;
(2) when be stored in permanent memory software code, fpga logic by brokenization after, use correct code
Or logical over-write fault code or logic, realize the self-healing of system;
Above-mentioned self Healing Technology mainly for system local function restoration designing, it is many although having preferable effect
Hardware fault still cannot achieve the self-healing of function.
Summary of the invention
The present invention is towards the complex computer system frequently used, by the way that backup module is arranged in systems, utilizes failure
The functional modules such as management, configuration management, system blueprint, realize the fault self-recovery of system, to improve the availability of system, reduce
The maintenance period and cost of system.
Specific technical solution of the invention is as follows:
A kind of fault self-recovery system based on system reconfiguration, it is characterised in that: including
System blueprint software module includes a fault-free configuration item and multiple fault configuration items, and wherein some is configured
Item is arranged to be currently configured item;Various configurations needed for each configuration item is fully described computer system operation (calculate
Machine system is run according to the description of configuration item), and include troubleshooting strategy;The troubleshooting strategy include system close,
It system self-healing and continues to run;
Configuration management software module, by loading the current-configuration item in the system blueprint after system initialization based on
In calculation machine system, computer system is made to operate normally or close system according to the description for being currently configured item;
Fault management software module periodically carries out fault detection to computer system, implements to screen to the failure of generation, filter
Accidental recoverable failure is gone, determines position and the type of unrecoverable failure, then inquiry system blueprint, determines at failure
Reason strategy, if requiring system self-healing, updating current-configuration item is the fault configuration item that can bypass failure;And
Hardware backup module, to substitute in computer system corresponding hardware function to support bypass failure.
The method of present invention realization fault self-recovery, comprising the following steps:
1] system starts: completing the initialization of each module software and hardware of system during the system startup process;
2] configuration loads: current-configuration item is into computer system described in loading system blueprint;
3] system is run: computer system operates normally or closes system according to the description for being currently configured item;
4] in normal course of operation, event periodically fault detection: is carried out to each module of software and hardware of composition computer system
Barrier detection;
5] failure logging: after finding that computer system breaks down, tentatively judging the failure of generation, record event
Hinder information;
6] failure recovery: according to the fault message of record, further screening failure, filters off accidental recoverable
Failure determines position and the type of unrecoverable failure;
7] it policy selection: according to the position of unrecoverable failure and type, is determined at failure according to the description of system blueprint
Reason strategy, if requiring system self-healing, selection can bypass the configuration item of failure;
8] configuration updates: setting can bypass the configuration item of failure as item is currently configured, which will be in system
It is loaded into system when starting next time, renewal process does not change the operating status of current failure system;
9] system is closed: after the completion of configuration updates, being closed the operation of system, is loaded updated match when waiting starting next time
Set item.
Above step 5] in record fault message mainly include failure occur time, position and type.
The present invention has the advantages that
The present invention replaces malfunctioning module from the point of view of computer system, through backup module, using in system blueprint
Current-configuration item control system self-healing procedure, realize the fault self-recovery of complex computer system, have Self healing Strategy it is simple,
The advantage that self-healing procedure is controllable, self-healing result determines.
Detailed description of the invention
Fig. 1 is system structure of the invention figure;In figure, FM: functional module, BM: backup module;A~J: application.
Fig. 2 is the crucial self-healing procedure schematic diagram of present system;In figure, FM: functional module, BM: backup module;A~
J: application;CF1: configuration item 1;CF2: configuration item 2;P1~Px: network address (or port numbers).
Fig. 3 is present system blueprint structure chart.
Fig. 4 is fault self-recovery flow chart of the present invention.
Specific embodiment
The present invention is described in detail below in conjunction with attached drawing.
Fault self-recovery method based on system reconfiguration, applied computer system on hardware by multiple functional modules and
Backup module composition includes the modules such as fault management, configuration management, system blueprint composition on software.
System blueprint is the key Design of system failure self-healing.It includes multiple configuration items of system operation, each configuration
Item all describes the various configurations such as software, hardware, network of computer system, and computer system can be according to the description of configuration item
Operation;These configuration items further include the description to all troubleshooting strategies, these strategy include system closing, system self-healing,
It continues to run;Configuration item includes 1 fault-free configuration item and multiple fault configuration items, and fault-free configuration item describes system and exists
Configuration when fault-free, and fault configuration item utilizes certain unrecoverable failure of backup module bypath system, enables a system to
It is operated normally under the failure;In all configuration items, there is 1 meeting to be arranged to be currently configured item, can add after system starting
Carry this configuration item.
Backup module is the necessary condition of system self-healing.Some module unrecoverable failure can occurs in systems in it
When, instead of the work of the module.
As shown in Figure 1, whole system is interconnected by multiple modules, input-output unit by communication network, pass through branch
The middleware for holding system self-healing realizes the fault self-recovery of system according to the description of system configuration file.Systematic difference operates in
In processing module, the input and output of application are completed by input-output unit.The module of system is by multiple functional modules and backup
Module composition, under the action of middleware, when unrecoverable failure occurs for some module in system, backup module can be replaced
The work of the module.
Critical process of the invention is as shown in Figure 2.Firstly, in system operation, after detecting system jam,
Failure is positioned and is filtered, judges fault type;Secondly, the strategy of self-healing is selected according to fault type, by working as system
Preceding configuration item is updated to that the configuration item of failure can be bypassed;Finally, system is according to new current-configuration item after system restarting
Work, system function self-healing.
Middleware is the moulds such as critical software, including fault management, configuration management, system blueprint that system self-healing function is realized
Block, by the way that system starting, configuration load, fault detection, failure logging, failure recovery, policy selection, configuration update, system is closed
Close and etc., realize the self-healing of failure, as shown in Figure 4.It is described as follows:
1] system starts, and completes the initialization of each module software and hardware of system during the system startup process;
2] configuration loads, current-configuration item described in configuration management function loading system blueprint to computer system
In;
3] system is run, and computer system operates normally or close system according to the description for being currently configured item;
4] fault detection, in normal course of operation, fault management capability is regular to each module of composition computer system
Carry out fault detection;
5] failure logging tentatively judges the failure of generation after finding that computer system breaks down, and records event
Time, position, type of barrier etc. are used for system maintenance, accident analysis;
6] failure recovery is implemented to screen, filters off accidental recoverable failure to the failure of generation, determines irrecoverable event
The position of barrier and type;
7] policy selection, according to the position of unrecoverable failure and type, it is suitable to select according to the description of system blueprint
Processing mode or reconstruction strategy;
8] configuration updates, and setting can bypass the configuration item of failure as item is currently configured, which will be in system
It is loaded into system when starting next time, renewal process does not change the operating status of current failure system;
9] system is closed, and according to the operation of set instruction closing system, waits starting next time.
Claims (3)
1. a kind of fault self-recovery system based on system reconfiguration, it is characterised in that: including
System blueprint software module includes a fault-free configuration item and multiple fault configuration items, wherein some configuration item quilt
It is set as being currently configured item;Each configuration item is fully described the required various configurations of computer system operation, wherein failure
Configuration item also includes the description to all troubleshooting strategies;The troubleshooting strategy include system close, system self-healing and
It continues to run;
Configuration management software module, for loading the current-configuration item in the system blueprint after system initialization to computer
In system, computer system is made to operate normally or close system according to the description for being currently configured item;
Fault management software module periodically carries out fault detection to computer system, implements to screen to the failure of generation, filters off even
The recoverable failure of hair, determines position and the type of unrecoverable failure, then inquiry system blueprint, determines troubleshooting plan
Slightly, if requiring system self-healing, update that be currently configured item be the fault configuration item that can bypass failure, which will be
System next time is loaded into system when starting, and renewal process does not change the operating status of current failure system;And
Hardware backup module, to substitute in computer system corresponding hardware function to support bypass failure.
2. the method that the fault self-recovery system described in claim 1 based on system reconfiguration realizes fault self-recovery, which is characterized in that packet
Include following steps:
1] system starts: completing the initialization of each module software and hardware of system during the system startup process;
2] configuration loads: current-configuration item is into computer system described in loading system blueprint;
3] system is run: computer system operates normally or closes system according to the description for being currently configured item;
4] in normal course of operation, failure inspection periodically fault detection: is carried out to each module of software and hardware of composition computer system
It surveys;
5] failure logging: after finding that computer system breaks down, tentatively judging the failure of generation, record failure letter
Breath;
6] failure recovery: according to the fault message of record, further screening failure, filters off accidental recoverable event
Barrier, determines position and the type of unrecoverable failure;
7] policy selection: according to the position of unrecoverable failure and type, troubleshooting plan is determined according to the description of system blueprint
Slightly, if requiring system self-healing, selection can bypass the configuration item of failure;
8] configuration updates: setting can bypass the configuration item of failure as current-configuration item, which will be in system next time
System is loaded into when starting, renewal process does not change the operating status of current failure system;
9] system is closed: after the completion of configuration updates, being closed the operation of system, is loaded updated configuration when waiting starting next time
?.
3. according to claim 2 realize fault self-recovery method, it is characterised in that: step 5] in record failure information package
Include time, position and the type of failure generation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510926572.7A CN105550056B (en) | 2015-12-11 | 2015-12-11 | A kind of fault self-recovery system and its implementation based on system reconfiguration |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510926572.7A CN105550056B (en) | 2015-12-11 | 2015-12-11 | A kind of fault self-recovery system and its implementation based on system reconfiguration |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105550056A CN105550056A (en) | 2016-05-04 |
CN105550056B true CN105550056B (en) | 2019-08-06 |
Family
ID=55829253
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510926572.7A Active CN105550056B (en) | 2015-12-11 | 2015-12-11 | A kind of fault self-recovery system and its implementation based on system reconfiguration |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105550056B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106451384B (en) * | 2016-11-09 | 2019-06-04 | 贵州电网有限责任公司兴义供电局 | Power grid self-healing DSS based on scheduling emergency preplan |
CN106411615A (en) * | 2016-11-22 | 2017-02-15 | 北京奇虎科技有限公司 | Device used for cloud remediation of system application and method |
CN107273232B (en) * | 2017-05-22 | 2020-04-28 | 国网安徽省电力公司信息通信分公司 | Self-healing scheduling method for enterprise information system |
CN108958989B (en) * | 2017-06-06 | 2021-09-17 | 北京猎户星空科技有限公司 | System fault recovery method and device |
CN108021827A (en) * | 2017-12-07 | 2018-05-11 | 中科开元信息技术(北京)有限公司 | A kind of method and system based on area mechanism structure security system |
CN115509803A (en) * | 2021-06-23 | 2022-12-23 | 中兴通讯股份有限公司 | Software recovery method, electronic device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662788A (en) * | 2012-04-28 | 2012-09-12 | 浪潮电子信息产业股份有限公司 | Computer system fault diagnosis decision and processing method |
CN104035831A (en) * | 2014-07-01 | 2014-09-10 | 浪潮(北京)电子信息产业有限公司 | High-end fault-tolerant computer management system and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6567272B1 (en) * | 2001-11-09 | 2003-05-20 | Dell Products L.P. | System and method for utilizing system configurations in a modular computer system |
-
2015
- 2015-12-11 CN CN201510926572.7A patent/CN105550056B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662788A (en) * | 2012-04-28 | 2012-09-12 | 浪潮电子信息产业股份有限公司 | Computer system fault diagnosis decision and processing method |
CN104035831A (en) * | 2014-07-01 | 2014-09-10 | 浪潮(北京)电子信息产业有限公司 | High-end fault-tolerant computer management system and method |
Also Published As
Publication number | Publication date |
---|---|
CN105550056A (en) | 2016-05-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105550056B (en) | A kind of fault self-recovery system and its implementation based on system reconfiguration | |
CN106933570B (en) | Aerospace test, launch and control software platform based on plug-in technology | |
CN104133734B (en) | Distributed integrated modular avionic system hybrid dynamic reconfiguration system and method | |
CN102200944B (en) | Test environment cloning method and system for enterprise resource planning (ERP) system | |
Lee et al. | Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks | |
WO2008078281A2 (en) | Distributed platform management for high availability systems | |
CN102207879B (en) | Hot-updating method and hot-updating system of Lua script | |
CN109445825A (en) | The method and apparatus that a kind of server cluster system updates upgrading | |
JPH04139544A (en) | Data restoring method | |
CN105988885B (en) | Operating system failure self-recovery method based on compensation rollback | |
CN104156369B (en) | A kind of database mirroring production method and a kind of database | |
CN104780068B (en) | A kind of method for switching network, the apparatus and system of computer room migration | |
CN110674192A (en) | Redis high-availability VIP (very important person) drifting method, terminal and storage medium | |
CN105279042A (en) | Redundant backup system and method for BSD system | |
CN106598703A (en) | Transaction compensation method and device for integrated system | |
CN115687019A (en) | Database cluster fault processing method, intelligent monitoring platform, equipment and medium | |
CN109189444A (en) | A kind of upgrade control method and device of the management node of server virtualization system | |
CN109116818A (en) | Real time data dump method and device when a kind of SCADA system upgrades | |
Dugan et al. | Simple models of hardware and software fault tolerance | |
US20030126159A1 (en) | Method and system for rollback of software system upgrade | |
CN105677515A (en) | Online backup method and system for database | |
CN105404278A (en) | Safety-critical software health management method | |
US10365864B2 (en) | Information processing system and operation redundantizing method | |
Tóth et al. | A structural decomposition-based diagnosis method for dynamic process systems using HAZID information | |
CN116545845B (en) | Redundant backup device, system and method for production server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |