CN105550056B - A kind of fault self-recovery system and its implementation based on system reconfiguration - Google Patents

A kind of fault self-recovery system and its implementation based on system reconfiguration Download PDF

Info

Publication number
CN105550056B
CN105550056B CN201510926572.7A CN201510926572A CN105550056B CN 105550056 B CN105550056 B CN 105550056B CN 201510926572 A CN201510926572 A CN 201510926572A CN 105550056 B CN105550056 B CN 105550056B
Authority
CN
China
Prior art keywords
failure
configuration
self
fault
healing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510926572.7A
Other languages
Chinese (zh)
Other versions
CN105550056A (en
Inventor
王乐
郭鹏
孙允明
谢建春
邸海涛
黄英兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Aeronautics Computing Technique Research Institute of AVIC
Original Assignee
Xian Aeronautics Computing Technique Research Institute of AVIC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Aeronautics Computing Technique Research Institute of AVIC filed Critical Xian Aeronautics Computing Technique Research Institute of AVIC
Priority to CN201510926572.7A priority Critical patent/CN105550056B/en
Publication of CN105550056A publication Critical patent/CN105550056A/en
Application granted granted Critical
Publication of CN105550056B publication Critical patent/CN105550056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment

Abstract

The present invention proposes a kind of fault self-recovery method based on system reconfiguration, applied computer system is made of on hardware multiple functional modules and backup module, it is formed on software including modules such as fault management, configuration management, system blueprints, step is followed successively by system starting, configuration load, fault detection, failure logging, failure recovery, policy selection, configuration update, system closing.The key Design of system blueprint system failure self-healing.It includes the configuration item of multiple system operations, and each configuration item all describes the various configurations such as software, hardware, network of computer system, and computer system can be run according to the description of configuration item;These configuration items further include the description to all troubleshooting strategies.Backup module is the necessary condition of system self-healing, it can in systems some module occur unrecoverable failure when, instead of the work of the module.The present invention has the advantages that Self healing Strategy is simple, self-healing procedure is controllable, self-healing result determines.

Description

A kind of fault self-recovery system and its implementation based on system reconfiguration
Technical field
The invention belongs to computer fields, provide a kind of fault self-recovery method.
Background technique
In some frequent uses but complex computer system difficult in maintenance, such as civil aircraft airborne electronic equipment system, satellite In satellite borne electronic system, designer wishes the design by system self-healing, reaches the availability of raising system, reduces the dimension of system Protect the purpose in period and cost.
Currently employed self Healing Technology reaches system mainly by the recovery again of software, hardware for abort situation The purpose of function self-healing, some common technical measures have:
(1) in the multiple channels of Position Design that may be broken down, failure selects trouble-free channel bypass event after occurring Hinder channel, realizes the self-healing of system;
(2) when be stored in permanent memory software code, fpga logic by brokenization after, use correct code Or logical over-write fault code or logic, realize the self-healing of system;
Above-mentioned self Healing Technology mainly for system local function restoration designing, it is many although having preferable effect Hardware fault still cannot achieve the self-healing of function.
Summary of the invention
The present invention is towards the complex computer system frequently used, by the way that backup module is arranged in systems, utilizes failure The functional modules such as management, configuration management, system blueprint, realize the fault self-recovery of system, to improve the availability of system, reduce The maintenance period and cost of system.
Specific technical solution of the invention is as follows:
A kind of fault self-recovery system based on system reconfiguration, it is characterised in that: including
System blueprint software module includes a fault-free configuration item and multiple fault configuration items, and wherein some is configured Item is arranged to be currently configured item;Various configurations needed for each configuration item is fully described computer system operation (calculate Machine system is run according to the description of configuration item), and include troubleshooting strategy;The troubleshooting strategy include system close, It system self-healing and continues to run;
Configuration management software module, by loading the current-configuration item in the system blueprint after system initialization based on In calculation machine system, computer system is made to operate normally or close system according to the description for being currently configured item;
Fault management software module periodically carries out fault detection to computer system, implements to screen to the failure of generation, filter Accidental recoverable failure is gone, determines position and the type of unrecoverable failure, then inquiry system blueprint, determines at failure Reason strategy, if requiring system self-healing, updating current-configuration item is the fault configuration item that can bypass failure;And
Hardware backup module, to substitute in computer system corresponding hardware function to support bypass failure.
The method of present invention realization fault self-recovery, comprising the following steps:
1] system starts: completing the initialization of each module software and hardware of system during the system startup process;
2] configuration loads: current-configuration item is into computer system described in loading system blueprint;
3] system is run: computer system operates normally or closes system according to the description for being currently configured item;
4] in normal course of operation, event periodically fault detection: is carried out to each module of software and hardware of composition computer system Barrier detection;
5] failure logging: after finding that computer system breaks down, tentatively judging the failure of generation, record event Hinder information;
6] failure recovery: according to the fault message of record, further screening failure, filters off accidental recoverable Failure determines position and the type of unrecoverable failure;
7] it policy selection: according to the position of unrecoverable failure and type, is determined at failure according to the description of system blueprint Reason strategy, if requiring system self-healing, selection can bypass the configuration item of failure;
8] configuration updates: setting can bypass the configuration item of failure as item is currently configured, which will be in system It is loaded into system when starting next time, renewal process does not change the operating status of current failure system;
9] system is closed: after the completion of configuration updates, being closed the operation of system, is loaded updated match when waiting starting next time Set item.
Above step 5] in record fault message mainly include failure occur time, position and type.
The present invention has the advantages that
The present invention replaces malfunctioning module from the point of view of computer system, through backup module, using in system blueprint Current-configuration item control system self-healing procedure, realize the fault self-recovery of complex computer system, have Self healing Strategy it is simple, The advantage that self-healing procedure is controllable, self-healing result determines.
Detailed description of the invention
Fig. 1 is system structure of the invention figure;In figure, FM: functional module, BM: backup module;A~J: application.
Fig. 2 is the crucial self-healing procedure schematic diagram of present system;In figure, FM: functional module, BM: backup module;A~ J: application;CF1: configuration item 1;CF2: configuration item 2;P1~Px: network address (or port numbers).
Fig. 3 is present system blueprint structure chart.
Fig. 4 is fault self-recovery flow chart of the present invention.
Specific embodiment
The present invention is described in detail below in conjunction with attached drawing.
Fault self-recovery method based on system reconfiguration, applied computer system on hardware by multiple functional modules and Backup module composition includes the modules such as fault management, configuration management, system blueprint composition on software.
System blueprint is the key Design of system failure self-healing.It includes multiple configuration items of system operation, each configuration Item all describes the various configurations such as software, hardware, network of computer system, and computer system can be according to the description of configuration item Operation;These configuration items further include the description to all troubleshooting strategies, these strategy include system closing, system self-healing, It continues to run;Configuration item includes 1 fault-free configuration item and multiple fault configuration items, and fault-free configuration item describes system and exists Configuration when fault-free, and fault configuration item utilizes certain unrecoverable failure of backup module bypath system, enables a system to It is operated normally under the failure;In all configuration items, there is 1 meeting to be arranged to be currently configured item, can add after system starting Carry this configuration item.
Backup module is the necessary condition of system self-healing.Some module unrecoverable failure can occurs in systems in it When, instead of the work of the module.
As shown in Figure 1, whole system is interconnected by multiple modules, input-output unit by communication network, pass through branch The middleware for holding system self-healing realizes the fault self-recovery of system according to the description of system configuration file.Systematic difference operates in In processing module, the input and output of application are completed by input-output unit.The module of system is by multiple functional modules and backup Module composition, under the action of middleware, when unrecoverable failure occurs for some module in system, backup module can be replaced The work of the module.
Critical process of the invention is as shown in Figure 2.Firstly, in system operation, after detecting system jam, Failure is positioned and is filtered, judges fault type;Secondly, the strategy of self-healing is selected according to fault type, by working as system Preceding configuration item is updated to that the configuration item of failure can be bypassed;Finally, system is according to new current-configuration item after system restarting Work, system function self-healing.
Middleware is the moulds such as critical software, including fault management, configuration management, system blueprint that system self-healing function is realized Block, by the way that system starting, configuration load, fault detection, failure logging, failure recovery, policy selection, configuration update, system is closed Close and etc., realize the self-healing of failure, as shown in Figure 4.It is described as follows:
1] system starts, and completes the initialization of each module software and hardware of system during the system startup process;
2] configuration loads, current-configuration item described in configuration management function loading system blueprint to computer system In;
3] system is run, and computer system operates normally or close system according to the description for being currently configured item;
4] fault detection, in normal course of operation, fault management capability is regular to each module of composition computer system Carry out fault detection;
5] failure logging tentatively judges the failure of generation after finding that computer system breaks down, and records event Time, position, type of barrier etc. are used for system maintenance, accident analysis;
6] failure recovery is implemented to screen, filters off accidental recoverable failure to the failure of generation, determines irrecoverable event The position of barrier and type;
7] policy selection, according to the position of unrecoverable failure and type, it is suitable to select according to the description of system blueprint Processing mode or reconstruction strategy;
8] configuration updates, and setting can bypass the configuration item of failure as item is currently configured, which will be in system It is loaded into system when starting next time, renewal process does not change the operating status of current failure system;
9] system is closed, and according to the operation of set instruction closing system, waits starting next time.

Claims (3)

1. a kind of fault self-recovery system based on system reconfiguration, it is characterised in that: including
System blueprint software module includes a fault-free configuration item and multiple fault configuration items, wherein some configuration item quilt It is set as being currently configured item;Each configuration item is fully described the required various configurations of computer system operation, wherein failure Configuration item also includes the description to all troubleshooting strategies;The troubleshooting strategy include system close, system self-healing and It continues to run;
Configuration management software module, for loading the current-configuration item in the system blueprint after system initialization to computer In system, computer system is made to operate normally or close system according to the description for being currently configured item;
Fault management software module periodically carries out fault detection to computer system, implements to screen to the failure of generation, filters off even The recoverable failure of hair, determines position and the type of unrecoverable failure, then inquiry system blueprint, determines troubleshooting plan Slightly, if requiring system self-healing, update that be currently configured item be the fault configuration item that can bypass failure, which will be System next time is loaded into system when starting, and renewal process does not change the operating status of current failure system;And
Hardware backup module, to substitute in computer system corresponding hardware function to support bypass failure.
2. the method that the fault self-recovery system described in claim 1 based on system reconfiguration realizes fault self-recovery, which is characterized in that packet Include following steps:
1] system starts: completing the initialization of each module software and hardware of system during the system startup process;
2] configuration loads: current-configuration item is into computer system described in loading system blueprint;
3] system is run: computer system operates normally or closes system according to the description for being currently configured item;
4] in normal course of operation, failure inspection periodically fault detection: is carried out to each module of software and hardware of composition computer system It surveys;
5] failure logging: after finding that computer system breaks down, tentatively judging the failure of generation, record failure letter Breath;
6] failure recovery: according to the fault message of record, further screening failure, filters off accidental recoverable event Barrier, determines position and the type of unrecoverable failure;
7] policy selection: according to the position of unrecoverable failure and type, troubleshooting plan is determined according to the description of system blueprint Slightly, if requiring system self-healing, selection can bypass the configuration item of failure;
8] configuration updates: setting can bypass the configuration item of failure as current-configuration item, which will be in system next time System is loaded into when starting, renewal process does not change the operating status of current failure system;
9] system is closed: after the completion of configuration updates, being closed the operation of system, is loaded updated configuration when waiting starting next time ?.
3. according to claim 2 realize fault self-recovery method, it is characterised in that: step 5] in record failure information package Include time, position and the type of failure generation.
CN201510926572.7A 2015-12-11 2015-12-11 A kind of fault self-recovery system and its implementation based on system reconfiguration Active CN105550056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510926572.7A CN105550056B (en) 2015-12-11 2015-12-11 A kind of fault self-recovery system and its implementation based on system reconfiguration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510926572.7A CN105550056B (en) 2015-12-11 2015-12-11 A kind of fault self-recovery system and its implementation based on system reconfiguration

Publications (2)

Publication Number Publication Date
CN105550056A CN105550056A (en) 2016-05-04
CN105550056B true CN105550056B (en) 2019-08-06

Family

ID=55829253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510926572.7A Active CN105550056B (en) 2015-12-11 2015-12-11 A kind of fault self-recovery system and its implementation based on system reconfiguration

Country Status (1)

Country Link
CN (1) CN105550056B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106451384B (en) * 2016-11-09 2019-06-04 贵州电网有限责任公司兴义供电局 Power grid self-healing DSS based on scheduling emergency preplan
CN106411615A (en) * 2016-11-22 2017-02-15 北京奇虎科技有限公司 Device used for cloud remediation of system application and method
CN107273232B (en) * 2017-05-22 2020-04-28 国网安徽省电力公司信息通信分公司 Self-healing scheduling method for enterprise information system
CN108958989B (en) * 2017-06-06 2021-09-17 北京猎户星空科技有限公司 System fault recovery method and device
CN108021827A (en) * 2017-12-07 2018-05-11 中科开元信息技术(北京)有限公司 A kind of method and system based on area mechanism structure security system
CN115509803A (en) * 2021-06-23 2022-12-23 中兴通讯股份有限公司 Software recovery method, electronic device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662788A (en) * 2012-04-28 2012-09-12 浪潮电子信息产业股份有限公司 Computer system fault diagnosis decision and processing method
CN104035831A (en) * 2014-07-01 2014-09-10 浪潮(北京)电子信息产业有限公司 High-end fault-tolerant computer management system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6567272B1 (en) * 2001-11-09 2003-05-20 Dell Products L.P. System and method for utilizing system configurations in a modular computer system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662788A (en) * 2012-04-28 2012-09-12 浪潮电子信息产业股份有限公司 Computer system fault diagnosis decision and processing method
CN104035831A (en) * 2014-07-01 2014-09-10 浪潮(北京)电子信息产业有限公司 High-end fault-tolerant computer management system and method

Also Published As

Publication number Publication date
CN105550056A (en) 2016-05-04

Similar Documents

Publication Publication Date Title
CN105550056B (en) A kind of fault self-recovery system and its implementation based on system reconfiguration
CN106933570B (en) Aerospace test, launch and control software platform based on plug-in technology
CN104133734B (en) Distributed integrated modular avionic system hybrid dynamic reconfiguration system and method
CN102200944B (en) Test environment cloning method and system for enterprise resource planning (ERP) system
Lee et al. Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks
WO2008078281A2 (en) Distributed platform management for high availability systems
CN102207879B (en) Hot-updating method and hot-updating system of Lua script
CN109445825A (en) The method and apparatus that a kind of server cluster system updates upgrading
JPH04139544A (en) Data restoring method
CN105988885B (en) Operating system failure self-recovery method based on compensation rollback
CN104156369B (en) A kind of database mirroring production method and a kind of database
CN104780068B (en) A kind of method for switching network, the apparatus and system of computer room migration
CN110674192A (en) Redis high-availability VIP (very important person) drifting method, terminal and storage medium
CN105279042A (en) Redundant backup system and method for BSD system
CN106598703A (en) Transaction compensation method and device for integrated system
CN115687019A (en) Database cluster fault processing method, intelligent monitoring platform, equipment and medium
CN109189444A (en) A kind of upgrade control method and device of the management node of server virtualization system
CN109116818A (en) Real time data dump method and device when a kind of SCADA system upgrades
Dugan et al. Simple models of hardware and software fault tolerance
US20030126159A1 (en) Method and system for rollback of software system upgrade
CN105677515A (en) Online backup method and system for database
CN105404278A (en) Safety-critical software health management method
US10365864B2 (en) Information processing system and operation redundantizing method
Tóth et al. A structural decomposition-based diagnosis method for dynamic process systems using HAZID information
CN116545845B (en) Redundant backup device, system and method for production server

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant