CN103593251A - Fault-tolerant system based on process redundancy and design method thereof - Google Patents

Fault-tolerant system based on process redundancy and design method thereof Download PDF

Info

Publication number
CN103593251A
CN103593251A CN201310546513.8A CN201310546513A CN103593251A CN 103593251 A CN103593251 A CN 103593251A CN 201310546513 A CN201310546513 A CN 201310546513A CN 103593251 A CN103593251 A CN 103593251A
Authority
CN
China
Prior art keywords
tolerant
fault
redundancy
module
redundant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310546513.8A
Other languages
Chinese (zh)
Inventor
王恩东
胡雷钧
张东
吴楠
刘璧怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201310546513.8A priority Critical patent/CN103593251A/en
Publication of CN103593251A publication Critical patent/CN103593251A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Hardware Redundancy (AREA)

Abstract

The invention relates to the technical field of a design method of a fault-tolerant system, in particular to a fault-tolerant system based on process redundancy and a design method thereof. The invention provides a fault-tolerant mechanism and strategy based on process redundancy, dual-mode redundancy or multi-mode redundancy is constructed on key processes, inter-process synchronization and other means are adopted to guarantee that redundant processes operate according to the same execution logic, a monitoring system conducts corresponding error processing on different errors, and finally reliability and usability of the system are improved.

Description

A kind of tolerant system and method for designing thereof based on process redundancy
Technical field
The present invention relates to fault tolerant systems design method and technology field, particularly a kind of tolerant system and method for designing thereof based on process redundancy.
Background technology
Along with the widespread use of field to computer system such as banking business processing, information service, finance calculating, also more and more higher to the requirement of computer system security.Fault-tolerant is a kind of important means that improves computer system security, and fault-tolerant implication refers in the situation that internal system breaks down, and computing machine still can correctly be carried out assignment algorithm.For the application of the key areas such as bank, telecommunications, computer system is extremely responsive for thrashing, guarantees that the reliability of system core process is most important.Common fault tolerant mechanism Main Basis static structure redundancy principle realizes, yet the redundancy cost of hardware layer is very high and it is complicated to realize, and the redundancy of application software layer does not have versatility.
Common software/hardware fault-tolerant mechanism, as the Main Basis static structure redundancy principles such as processor lock-step technology, memory mirror technology, Multipath I/O technology, the design of N version program realize, but the redundancy cost of hardware layer is very high and it is complicated to realize, and the redundancy of application software layer does not have versatility.
Summary of the invention
In order to solve the problem of prior art, the invention provides a kind of tolerant system and method for designing thereof based on process redundancy, it is to critical processes structure duplication redundancy or multi-mode redundant, adopt the means such as inter process synchronization to guarantee that redundancy process is according to same actuating logic operation, supervisory system is also carried out corresponding wrong processing to different mistakes, with this, improves the reliabilty and availability of system.
The technical solution adopted in the present invention is as follows:
A tolerant system based on process redundancy, it all concentrates on operating system kernel layer, comprises fault-tolerant management of process module, mistake processing module, the fault-tolerant control module of I/O and monitoring management module, wherein,
Fault-tolerant management of process module, for realizing redundancy process lifecycle management, comprises establishment, scheduling, synchronous, the communication of redundancy process and destroys, and makes primary process and redundancy process thereof when carrying out original logic in order, meet the needs of failure tolerance;
Mistake processing module, when when synchronously in service the making a mistake of redundancy process relatively being detected, mistake processing module is carried out fault type diagnosis and is taked corresponding processing mode according to pre-configured, completes fast wrong recovery;
The fault-tolerant control module of I/O, format conversion and the Redundant Control of responsible tolerant system internal data and external data, and auxiliary I/O operation is synchronously compared;
Monitoring management module, comprise the control desk that runs on user's state and the monitoring management module that runs on kernel state, control desk offers user's operation interface intuitively, and user monitors redundancy running state of process by control desk, check fault-tolerant event log, the major parameter of system is configured etc.; All data of control desk are by obtaining alternately with kernel monitoring module.
Fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.
The fault-tolerant control module of I/O comprises terminal interface, disk interface and network interface.
A kind of method for designing of the tolerant system based on process redundancy, comprise: (1), many CPU of take in SMP framework are redundant hardware, structure Redundant task executed in parallel in CPU group independently, by the executing data of Redundant task is compared and realizes error detection, and use and to fall mould, restart etc. mechanism completes wrong recovery; (2), in (SuSE) Linux OS, add fault-tolerant container, application in fault-tolerant container moves with redundant fashion, by process tolerant system, be in charge of Redundant task, independent execution in redundant hardware respectively, and control task is synchronous, data comparison, error-detecting and recovery.
In fault-tolerant container, one group of Redundant task is carried out identical function code, in its operational process, tolerant system to its manage, synchronous and monitoring, and carry out error-detecting according to data comparative result, the task outside fault-tolerant container is still moved in common single mode mode.
The beneficial effect that technical scheme provided by the invention is brought is:
The present invention has provided a kind of tolerant system and method for designing thereof based on process redundancy, fault-tolerant strategy and method based on process redundancy have been designed, and realized the prototype system of process tolerant system, critical processes is carried out to redundancy, and by synchronization mechanism, guarantee the correct execution of process, and system monitoring is carried out to corresponding wrong processing.Experimental results show that the loss of the method performance is little, and can effectively improve the reliability of system, avoided the complicacy of hardware customization simultaneously, and application programs and user transparent.
Accompanying drawing explanation
Fig. 1 is a kind of tolerant system based on process redundancy of the present invention and the system module graph of a relation of method for designing thereof;
Fig. 2 is a kind of tolerant system based on process redundancy of the present invention and the fault-tolerant process creation process flow diagram of method for designing thereof;
Fig. 3 is a kind of tolerant system based on process redundancy of the present invention and the workflow diagram of method for designing thereof.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
As shown in Figure 1, a kind of tolerant system based on process redundancy, it all concentrates on operating system kernel layer, comprises fault-tolerant management of process module, mistake processing module, the fault-tolerant control module of I/O and monitoring management module, wherein,
Fault-tolerant management of process module, for realizing redundancy process lifecycle management, comprises establishment, scheduling, synchronous, the communication of redundancy process and destroys, and makes primary process and redundancy process thereof when carrying out original logic in order, meet the needs of failure tolerance;
Mistake processing module, when when synchronously in service the making a mistake of redundancy process relatively being detected, mistake processing module is carried out fault type diagnosis and is taked corresponding processing mode according to pre-configured, completes fast wrong recovery;
The fault-tolerant control module of I/O, format conversion and the Redundant Control of responsible tolerant system internal data and external data, and auxiliary I/O operation is synchronously compared;
Monitoring management module, comprise the control desk that runs on user's state and the monitoring management module that runs on kernel state, control desk offers user's operation interface intuitively, and user monitors redundancy running state of process by control desk, check fault-tolerant event log, the major parameter of system is configured etc.; All data of control desk are by obtaining alternately with kernel monitoring module.
Fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.
With reference to accompanying drawing 2, fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.In tolerant system, a process will derive from subprocess and will call in fork, clone or vfork system call, first system can judge according to ft_mak zone bit whether current process is redundancy process, if so, call the subprocess of do_ft_fork () function creation redundancy process; If not, judge whether fault-tolerant sign ft_exec, if had, shown that needs are fault-tolerant, called do_double_fork () function, derived from redundancy process; If do not need fault-tolerant, original do_fork () function in calling system, normal derived processes.By above-mentioned control, the needs that under different situations, process derives from have been realized.
The fault-tolerant control module of I/O comprises terminal interface, disk interface and network interface.According to different I/O kinds, the fault-tolerant control module structure of design system I/O also can mainly comprise two submodules, and disk/terminal read-write I/O controls and network data read-write I/O controls.
With reference to accompanying drawing 3, content of the present invention is described to the process that realizes this architecture with an instantiation.
User starts application program by fault-tolerant control desk interface, open fault-tolerant switch (zone bit ft_exec is set) process tolerant system and create a pair of process (primary-redundancy process) for this application program, this a pair of process is loaded the identical run time version of application program, and starts executed in parallel in CPU group independently.
Under synchronous protocol is controlled, primary-redundancy process is to arrive synchronous point simultaneously, and start synchronous, if certain process wait timeout on synchronous point wherein triggers that mistake enters fault detect, mistake is processed.If the I/O that is operating as on synchronous point operates, need to carry out I/O conversion, whether decision operation is write operation, if primary-redundancy process is compared the data of writing out, data are identical thinks that execution is correct, if comparing data difference is thought, there is fault, enter fault detect, mistake treatment scheme.If I/O is operating as read operation, the fault-tolerant control module of I/O completes the data-switching of I/O interface, and carries out final actual functional capability operation.
If state consistency adjustment is directly carried out in the non-I/O operation of being operating as on synchronous point on synchronous point, as unified function return value, record current process status information, complete subsynchronous.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (5)

1. the tolerant system based on process redundancy, it all concentrates on operating system kernel layer, comprises fault-tolerant management of process module, mistake processing module, the fault-tolerant control module of I/O and monitoring management module, wherein,
Fault-tolerant management of process module, for realizing redundancy process lifecycle management, comprises establishment, scheduling, synchronous, the communication of redundancy process and destroys, and makes primary process and redundancy process thereof when carrying out original logic in order, meet the needs of failure tolerance;
Mistake processing module, when when synchronously in service the making a mistake of redundancy process relatively being detected, mistake processing module is carried out fault type diagnosis and is taked corresponding processing mode according to pre-configured, completes fast wrong recovery;
The fault-tolerant control module of I/O, format conversion and the Redundant Control of responsible tolerant system internal data and external data, and auxiliary I/O operation is synchronously compared;
Monitoring management module, comprise the control desk that runs on user's state and the monitoring management module that runs on kernel state, control desk offers user's operation interface intuitively, and user monitors redundancy running state of process by control desk, check fault-tolerant event log, the major parameter of system is configured etc.; All data of control desk are by obtaining alternately with kernel monitoring module.
2. a kind of tolerant system based on process redundancy according to claim 1, it is characterized in that, described fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.
3. a kind of tolerant system based on process redundancy according to claim 1, is characterized in that, the fault-tolerant control module of described I/O comprises terminal interface, disk interface and network interface.
4. the method for designing of the tolerant system based on process redundancy, comprise: (1), many CPU of take in SMP framework are redundant hardware, structure Redundant task executed in parallel in CPU group independently, by the executing data of Redundant task is compared and realizes error detection, and use and to fall mould, restart etc. mechanism completes wrong recovery; (2), in (SuSE) Linux OS, add fault-tolerant container, application in fault-tolerant container moves with redundant fashion, by process tolerant system, be in charge of Redundant task, independent execution in redundant hardware respectively, and control task is synchronous, data comparison, error-detecting and recovery.
5. the method for designing of a kind of tolerant system based on process redundancy according to claim 4, it is characterized in that, in fault-tolerant container, one group of Redundant task is carried out identical function code, in its operational process, tolerant system to its manage, synchronous and monitoring, and carry out error-detecting according to data comparative result, the task outside fault-tolerant container is still moved in common single mode mode.
CN201310546513.8A 2013-11-07 2013-11-07 Fault-tolerant system based on process redundancy and design method thereof Pending CN103593251A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310546513.8A CN103593251A (en) 2013-11-07 2013-11-07 Fault-tolerant system based on process redundancy and design method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310546513.8A CN103593251A (en) 2013-11-07 2013-11-07 Fault-tolerant system based on process redundancy and design method thereof

Publications (1)

Publication Number Publication Date
CN103593251A true CN103593251A (en) 2014-02-19

Family

ID=50083406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310546513.8A Pending CN103593251A (en) 2013-11-07 2013-11-07 Fault-tolerant system based on process redundancy and design method thereof

Country Status (1)

Country Link
CN (1) CN103593251A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502089A (en) * 2016-12-27 2017-03-15 河南森源重工有限公司 A kind of redundancy control method of compression type garbage truck loading process
CN109634769A (en) * 2018-12-13 2019-04-16 郑州云海信息技术有限公司 Fault-tolerance processing method, device, equipment and storage medium in a kind of storage of data
CN111143125A (en) * 2019-12-20 2020-05-12 浪潮电子信息产业股份有限公司 MCE error processing method and device, electronic equipment and storage medium
CN115981879A (en) * 2023-03-16 2023-04-18 北京全路通信信号研究设计院集团有限公司 Data synchronization method, device, equipment and storage medium of redundant structure
WO2023082819A1 (en) * 2021-11-10 2023-05-19 武汉路特斯汽车有限公司 Data processing method and apparatus, device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060150004A1 (en) * 2004-12-21 2006-07-06 Nec Corporation Fault tolerant system and controller, operation method, and operation program used in the fault tolerant system
CN102364448A (en) * 2011-09-19 2012-02-29 浪潮电子信息产业股份有限公司 Fault-tolerant method for computer fault management system
CN103455393A (en) * 2013-09-25 2013-12-18 浪潮电子信息产业股份有限公司 Fault tolerant system design method based on process redundancy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060150004A1 (en) * 2004-12-21 2006-07-06 Nec Corporation Fault tolerant system and controller, operation method, and operation program used in the fault tolerant system
CN102364448A (en) * 2011-09-19 2012-02-29 浪潮电子信息产业股份有限公司 Fault-tolerant method for computer fault management system
CN103455393A (en) * 2013-09-25 2013-12-18 浪潮电子信息产业股份有限公司 Fault tolerant system design method based on process redundancy

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502089A (en) * 2016-12-27 2017-03-15 河南森源重工有限公司 A kind of redundancy control method of compression type garbage truck loading process
CN109634769A (en) * 2018-12-13 2019-04-16 郑州云海信息技术有限公司 Fault-tolerance processing method, device, equipment and storage medium in a kind of storage of data
CN109634769B (en) * 2018-12-13 2021-11-09 郑州云海信息技术有限公司 Fault-tolerant processing method, device, equipment and storage medium in data storage
CN111143125A (en) * 2019-12-20 2020-05-12 浪潮电子信息产业股份有限公司 MCE error processing method and device, electronic equipment and storage medium
CN111143125B (en) * 2019-12-20 2022-04-22 浪潮电子信息产业股份有限公司 MCE error processing method and device, electronic equipment and storage medium
WO2023082819A1 (en) * 2021-11-10 2023-05-19 武汉路特斯汽车有限公司 Data processing method and apparatus, device, and storage medium
CN115981879A (en) * 2023-03-16 2023-04-18 北京全路通信信号研究设计院集团有限公司 Data synchronization method, device, equipment and storage medium of redundant structure
CN115981879B (en) * 2023-03-16 2023-05-23 北京全路通信信号研究设计院集团有限公司 Data synchronization method, device and equipment of redundant structure and storage medium

Similar Documents

Publication Publication Date Title
CN103455393A (en) Fault tolerant system design method based on process redundancy
US9052935B1 (en) Systems and methods for managing affinity rules in virtual-machine environments
US10114834B2 (en) Exogenous virtual machine synchronization and replication
US8020041B2 (en) Method and computer system for making a computer have high availability
US7814364B2 (en) On-demand provisioning of computer resources in physical/virtual cluster environments
US8219990B2 (en) Techniques for managing virtual machine (VM) states
US20110083046A1 (en) High availability operator groupings for stream processing applications
CN103593251A (en) Fault-tolerant system based on process redundancy and design method thereof
WO2018054081A1 (en) Fault processing method, virtual infrastructure management system and service management system
US20140223225A1 (en) Multi-core re-initialization failure control system
US20070192765A1 (en) Virtual machine system
US9195553B2 (en) Redundant system control method
US9235485B2 (en) Moving objects in a primary computer based on memory errors in a secondary computer
CN101236515B (en) Multi-core system single-core abnormity restoration method
JP2011060055A (en) Virtual computer system, recovery processing method and of virtual machine, and program therefor
WO2011106067A1 (en) Systems and methods for failing over cluster unaware applications in a clustered system
CN103778079A (en) Dual operating system architecture capable of sharing USB device and sharing method
GB2520808A (en) Process control systems and methods
CN102523257A (en) Infrastructure as a service (IAAS)-cloud-platform-based virtual machine fault-tolerance method
US20090044186A1 (en) System and method for implementation of java ais api
JP2005242404A (en) Method for switching system of computer system
CN103795742A (en) Heterogeneous storage disaster recovery management system and heterogeneous storage disaster recovery management method
Camargos et al. Multicoordinated paxos
US9645857B2 (en) Resource fault management for partitions
CN110333973A (en) A kind of method and system of multi-host hot swap

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140219