CN105550056A - System reconfiguration based fault self-recovery system and realization method therefor - Google Patents

System reconfiguration based fault self-recovery system and realization method therefor Download PDF

Info

Publication number
CN105550056A
CN105550056A CN201510926572.7A CN201510926572A CN105550056A CN 105550056 A CN105550056 A CN 105550056A CN 201510926572 A CN201510926572 A CN 201510926572A CN 105550056 A CN105550056 A CN 105550056A
Authority
CN
China
Prior art keywords
fault
self
configuration item
module
recovery
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510926572.7A
Other languages
Chinese (zh)
Other versions
CN105550056B (en
Inventor
王乐
郭鹏
孙允明
谢建春
邸海涛
黄英兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Aeronautics Computing Technique Research Institute of AVIC
Original Assignee
Xian Aeronautics Computing Technique Research Institute of AVIC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Aeronautics Computing Technique Research Institute of AVIC filed Critical Xian Aeronautics Computing Technique Research Institute of AVIC
Priority to CN201510926572.7A priority Critical patent/CN105550056B/en
Publication of CN105550056A publication Critical patent/CN105550056A/en
Application granted granted Critical
Publication of CN105550056B publication Critical patent/CN105550056B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment

Abstract

The invention proposes a system reconfiguration based fault self-recovery method. An applied computer system consists of a plurality of functional modules and a backup module in hardware and consists of a fault management module, a configuration management module, a system blueprint module and other modules in software. The method sequentially comprises the steps of system starting, configuration loading, fault detection, fault recording, fault filtering, policy selection, configuration updating and system shutdown. A system blueprint module is a key design of system fault self-recovery and comprises configuration items run by a plurality of systems; each configuration item describes various configurations of software, hardware, a network and the like of the computer system; the computer system can run according to the descriptions of the configuration items; and the configuration items further comprise descriptions of all fault handling policies. The backup module is a necessary condition for system self-recovery and can work instead of a module when the module has an unrecoverable fault in the system. The method has the advantages of simple self-recovery policy, controllable self-recovery process and definite self-recovery result.

Description

A kind of fault self-recovery system based on system reconfiguration and its implementation
Technical field
The invention belongs to computer realm, a kind of fault self-recovery method is provided.
Background technology
Frequently use but complex computer system difficult in maintenance at some, as in civil aircraft airborne electronic equipment system, satellite carried electronic system, deviser wishes the design by system self-healing, reaches the availability of raising system, reduces the maintenance period of system and the object of cost.
Current adopted self Healing Technology, mainly through for the software of abort situation, the recondition of hardware, reaches the object of systemic-function self-healing, and some common technical measures have:
(1) at the multiple passage of the Position Design that may break down, select trouble-free passage bypass faulty channel after fault occurs, realize the self-healing of system;
(2) when being stored in the software code in permanent memory, after fpga logic suffers brokenly to change, use correct code or logical over-write failure code or logic, realize the self-healing of system;
Above-mentioned self Healing Technology is mainly for the restoration designing of system local function, although have good effect, many hardware faults still cannot the self-healing of practical function.
The present invention considers from the angle of computer system, be arranged through backup module and replace malfunctioning module, utilize the self-healing procedure of the current configuration item control system in system blueprint, realize the fault self-recovery of complex computer system, there is the advantage that Self healing Strategy is simple, self-healing procedure is controlled, self-healing result is determined.
Summary of the invention
The present invention is towards the frequent complex computer system used, by arranging backup module in systems in which, utilizing the functional modules such as fault management, configuration management, system blueprint, realizing the fault self-recovery of system, thus improve the availability of system, reduce maintenance period and the cost of system.
Concrete technical solution of the present invention is as follows:
Based on a fault self-recovery system for system reconfiguration, it is characterized in that: comprise
System blueprint software module, comprise a non-fault configuration item and multiple fault configuration item, wherein some configuration items are set to current configuration item; Each configuration item is fully described the various configurations (computer system is run according to the description of configuration item) needed for computer system operation, and comprises fault handling strategy; Described fault handling strategy comprises system closing, system self-healing and continuation and runs;
Configuration management software module, for loading current configuration item in described system blueprint in computer system after system initialization, makes computer system normally run or shutdown system according to the description of current configuration item;
Fault management software module, regularly fault detect is carried out to computer system, implement to screen to the fault occurred, the recoverable fault that elimination is accidental, determine position and the type of unrecoverable failure, then inquiry system blueprint, determine fault handling strategy, if require system self-healing, then upgrading current configuration item is can the fault configuration item of bypass fault; And
Hardware backup module, in order to hardware function corresponding in replacement computer system to support bypass fault.
The present invention realizes the method for fault self-recovery, comprises the following steps:
1] system starts: the initialization of each module software and hardware of completion system in system starting process;
2] configuration loads: the current configuration item described in loading system blueprint is in computer system;
3] system cloud gray model: computer system is normally run or shutdown system according to the description of current configuration item;
4] fault detect: in normal course of operation, regularly carries out fault detect to each module of software and hardware of composition computer system;
5] failure logging: after discovery computer system breaks down, the fault occurred tentatively is judged, record trouble information;
6] failure recovery: according to the failure message of record, further fault is screened, the recoverable fault that elimination is accidental, determine position and the type of unrecoverable failure;
7] policy selection: according to position and the type of unrecoverable failure, determine fault handling strategy according to the description of system blueprint, if require system self-healing, then selecting can the configuration item of bypass fault;
8] config update: arrange can the configuration item of bypass fault as current configuration item, this configuration item will be loaded into system when system starts next time, and renewal process does not change the running status of current failure system;
9] system closing: after config update completes, the operation of shutdown system, loads the configuration item after upgrading when waiting for that next time starts.
Above step 5] in the failure message of record mainly comprise time that fault occurs, position and type.
The invention has the advantages that:
The method, by arranging backup module in systems in which, utilizes the configuration and strategy that are recorded in system blueprint, realizes the fault self-recovery of complex computer system, have the advantage that Self healing Strategy is simple, self-healing procedure is controlled, self-healing result is determined.
Accompanying drawing explanation
Fig. 1 is present system Organization Chart; In figure, FM: functional module, BM: backup module; A ~ J: application.
Fig. 2 is the crucial self-healing procedure schematic diagram of present system; In figure, FM: functional module, BM: backup module; A ~ J: application; CF1: configuration item 1; CF2: configuration item 2; P1 ~ Px: the network address (or port numbers).
Fig. 3 is present system blueprint structural drawing.
Fig. 4 is fault self-recovery process flow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail.
Based on the fault self-recovery method of system reconfiguration, the computer system applied is made up of multiple functional module and backup module on hardware, and software comprises the module compositions such as fault management, configuration management, system blueprint.
System blueprint is the key Design of system failure self-healing.It comprises multiple configuration items of system cloud gray model, and each configuration item all describes the various configuration such as software, hardware, network of computer system, and computer system can be run according to the description of configuration item; These configuration items also comprise the description to all fault handling strategies, and these strategies comprise system closing, system self-healing, continuation operation etc.; Configuration item comprises 1 non-fault configuration item and multiple fault configuration item, non-fault configuration item describes the configuration of system when non-fault, and fault configuration item utilizes certain unrecoverable failure of backup module bypath system, system is normally run under this fault; In all configuration items, there is 1 meeting to be set to current configuration item, this configuration item can be loaded after system starts.
Backup module is the necessary condition of system self-healing.It can in systems in which certain module generation unrecoverable failure time, replace the work of this module.
As shown in Figure 1, whole system is formed by communication network is interconnected by multiple module, input-output unit, by the middleware of back-up system self-healing, according to the description of CONFIG.SYS, realizes the fault self-recovery of system.Systematic difference operates in processing module, is completed the input and output of application by input-output unit.The module of system is made up of multiple functional module and backup module, and under the effect of middleware, when in system during certain module generation unrecoverable failure, backup module can replace the work of this module.
Critical process of the present invention as shown in Figure 2.First, in system operation, after system jam being detected, fault is positioned and filtering, failure judgement type; Secondly, select the strategy of self-healing according to fault type, being updated to by the current configuration item of system can the configuration item of bypass fault; Finally, after system restart, system according to new current configuration item work, systemic-function self-healing.
Middleware is the critical software of system self-healing functional realiey, comprise the modules such as fault management, configuration management, system blueprint, started by system, configure the steps such as loading, fault detect, failure logging, failure recovery, policy selection, config update, system closing, realize the self-healing of fault, as shown in Figure 4.Be described as follows:
1] system starts, the initialization of each module software and hardware of completion system in system starting process;
2] configuration loads, and the current configuration item described in configuration management function loading system blueprint is in computer system;
3] system cloud gray model, computer system is normally run or shutdown system according to the description of current configuration item;
4] fault detect, in normal course of operation, each module of fault management capability to composition computer system regularly carries out fault detect;
5] failure logging, after discovery computer system breaks down, tentatively judge the fault occurred, the time, position, type etc. of record trouble are for system maintenance, fault analysis;
6] failure recovery, implement to screen to the fault occurred, the recoverable fault that elimination is accidental, determines position and the type of unrecoverable failure;
7] policy selection, according to position and the type of unrecoverable failure, selects suitable processing mode or reconstruction strategy according to the description of system blueprint;
8] config update, arrange can the configuration item of bypass fault as current configuration item, this configuration item will be loaded into system when system starts next time, and renewal process does not change the running status of current failure system;
9] system closing, according to the operation of set instruction shutdown system, waits for that next time starts.

Claims (3)

1., based on a fault self-recovery system for system reconfiguration, it is characterized in that: comprise
System blueprint software module, comprise a non-fault configuration item and multiple fault configuration item, wherein some configuration items are set to current configuration item; Each configuration item is fully described the various configurations needed for computer system operation, and comprises fault handling strategy; Described fault handling strategy comprises system closing, system self-healing and continuation and runs;
Configuration management software module, for loading current configuration item in described system blueprint in computer system after system initialization, makes computer system normally run or shutdown system according to the description of current configuration item;
Fault management software module, regularly fault detect is carried out to computer system, implement to screen to the fault occurred, the recoverable fault that elimination is accidental, determine position and the type of unrecoverable failure, then inquiry system blueprint, determine fault handling strategy, if require system self-healing, then upgrading current configuration item is can the fault configuration item of bypass fault; And
Hardware backup module, in order to hardware function corresponding in replacement computer system to support bypass fault.
2. realize the method for fault self-recovery described in claim 1 based on the fault self-recovery system of system reconfiguration, it is characterized in that, comprise the following steps:
1] system starts: the initialization of each module software and hardware of completion system in system starting process;
2] configuration loads: the current configuration item described in loading system blueprint is in computer system;
3] system cloud gray model: computer system is normally run or shutdown system according to the description of current configuration item;
4] fault detect: in normal course of operation, regularly carries out fault detect to each module of software and hardware of composition computer system;
5] failure logging: after discovery computer system breaks down, the fault occurred tentatively is judged, record trouble information;
6] failure recovery: according to the failure message of record, further fault is screened, the recoverable fault that elimination is accidental, determine position and the type of unrecoverable failure;
7] policy selection: according to position and the type of unrecoverable failure, determine fault handling strategy according to the description of system blueprint, if require system self-healing, then selecting can the configuration item of bypass fault;
8] config update: arrange can the configuration item of bypass fault as current configuration item, this configuration item will be loaded into system when system starts next time, and renewal process does not change the running status of current failure system;
9] system closing: after config update completes, the operation of shutdown system, loads the configuration item after upgrading when waiting for that next time starts.
3. realize the method for fault self-recovery according to claim 2, it is characterized in that: step 5] in the failure message of record comprise time that fault occurs, position and type.
CN201510926572.7A 2015-12-11 2015-12-11 A kind of fault self-recovery system and its implementation based on system reconfiguration Active CN105550056B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510926572.7A CN105550056B (en) 2015-12-11 2015-12-11 A kind of fault self-recovery system and its implementation based on system reconfiguration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510926572.7A CN105550056B (en) 2015-12-11 2015-12-11 A kind of fault self-recovery system and its implementation based on system reconfiguration

Publications (2)

Publication Number Publication Date
CN105550056A true CN105550056A (en) 2016-05-04
CN105550056B CN105550056B (en) 2019-08-06

Family

ID=55829253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510926572.7A Active CN105550056B (en) 2015-12-11 2015-12-11 A kind of fault self-recovery system and its implementation based on system reconfiguration

Country Status (1)

Country Link
CN (1) CN105550056B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106411615A (en) * 2016-11-22 2017-02-15 北京奇虎科技有限公司 Device used for cloud remediation of system application and method
CN106451384A (en) * 2016-11-09 2017-02-22 贵州电网有限责任公司兴义供电局 Power grid self-healing decision support system based on scheduling emergency plan
CN107273232A (en) * 2017-05-22 2017-10-20 国网安徽省电力公司信息通信分公司 A kind of Enterprise Informatization system self-healing dispatching method
CN108021827A (en) * 2017-12-07 2018-05-11 中科开元信息技术(北京)有限公司 A kind of method and system based on area mechanism structure security system
CN108958989A (en) * 2017-06-06 2018-12-07 北京猎户星空科技有限公司 A kind of system failure recovery method and device
WO2022267812A1 (en) * 2021-06-23 2022-12-29 中兴通讯股份有限公司 Software recovery method, electronic device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010102724A (en) * 2001-11-09 2010-05-06 Dell Products Lp System and method for utilizing system configuration in modular computer system
CN102662788A (en) * 2012-04-28 2012-09-12 浪潮电子信息产业股份有限公司 Computer system fault diagnosis decision and processing method
CN104035831A (en) * 2014-07-01 2014-09-10 浪潮(北京)电子信息产业有限公司 High-end fault-tolerant computer management system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010102724A (en) * 2001-11-09 2010-05-06 Dell Products Lp System and method for utilizing system configuration in modular computer system
CN102662788A (en) * 2012-04-28 2012-09-12 浪潮电子信息产业股份有限公司 Computer system fault diagnosis decision and processing method
CN104035831A (en) * 2014-07-01 2014-09-10 浪潮(北京)电子信息产业有限公司 High-end fault-tolerant computer management system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
崔西宁等: "GSM技术的AICPS故障管理与容错机制的研究", 《航空计算技术》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106451384A (en) * 2016-11-09 2017-02-22 贵州电网有限责任公司兴义供电局 Power grid self-healing decision support system based on scheduling emergency plan
CN106411615A (en) * 2016-11-22 2017-02-15 北京奇虎科技有限公司 Device used for cloud remediation of system application and method
CN107273232A (en) * 2017-05-22 2017-10-20 国网安徽省电力公司信息通信分公司 A kind of Enterprise Informatization system self-healing dispatching method
CN108958989A (en) * 2017-06-06 2018-12-07 北京猎户星空科技有限公司 A kind of system failure recovery method and device
CN108958989B (en) * 2017-06-06 2021-09-17 北京猎户星空科技有限公司 System fault recovery method and device
CN108021827A (en) * 2017-12-07 2018-05-11 中科开元信息技术(北京)有限公司 A kind of method and system based on area mechanism structure security system
WO2022267812A1 (en) * 2021-06-23 2022-12-29 中兴通讯股份有限公司 Software recovery method, electronic device, and storage medium

Also Published As

Publication number Publication date
CN105550056B (en) 2019-08-06

Similar Documents

Publication Publication Date Title
CN105550056A (en) System reconfiguration based fault self-recovery system and realization method therefor
CN102622298B (en) Software testing system and method
EP1980943B1 (en) System monitor device control method, program, and computer system
CN107634860B (en) Method for automatically upgrading weblogic cluster patches in batches
CN103458086B (en) A kind of smart mobile phone and fault detection method thereof
CN102739451B (en) Method and device for updating master-slave switchover condition, server and system
CN115562911B (en) Virtual machine data backup method, device, system, electronic equipment and storage medium
CN103595572B (en) A kind of method of cloud computing cluster interior joint selfreparing
CN104360952A (en) Software test system and software test method
CN110138611A (en) Automate O&M method and system
CN104468217A (en) Network reconstruction method under 1394 network manager fault
CN103984309A (en) Cigarette production system with disaster tolerance function and disaster tolerance exercise method thereof
CN103064759B (en) The method of data restore and device
CN111984274A (en) Method and device for one-key automatic deployment of ETCD (electronic toll collection) cluster
CN111124749A (en) Method and system for automatically repairing BMC (baseboard management controller) system of tightly-coupled high-performance computer system
CN114020509A (en) Method, device and equipment for repairing work load cluster and readable storage medium
CN109271270A (en) The troubleshooting methodology, system and relevant apparatus of bottom hardware in storage system
CN105988885B (en) Operating system failure self-recovery method based on compensation rollback
CN105279042A (en) Redundant backup system and method for BSD system
CN106250266B (en) System repairing method and device
CN109815055A (en) Satellite program management system and satellite program management method
CN112269693B (en) Node self-coordination method, device and computer readable storage medium
CN104503871A (en) Implementation method based on full-redundancy model of small computer system
CN115098324A (en) Hard disk maintenance method, device, equipment and storage medium
CN103327105B (en) Slave node service automatic recovery method in hadoop system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant