CN103593251A - Fault-tolerant system based on process redundancy and design method thereof - Google Patents
Fault-tolerant system based on process redundancy and design method thereof Download PDFInfo
- Publication number
- CN103593251A CN103593251A CN201310546513.8A CN201310546513A CN103593251A CN 103593251 A CN103593251 A CN 103593251A CN 201310546513 A CN201310546513 A CN 201310546513A CN 103593251 A CN103593251 A CN 103593251A
- Authority
- CN
- China
- Prior art keywords
- tolerant
- fault
- redundancy
- module
- redundant
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Hardware Redundancy (AREA)
Abstract
The invention relates to the technical field of a design method of a fault-tolerant system, in particular to a fault-tolerant system based on process redundancy and a design method thereof. The invention provides a fault-tolerant mechanism and strategy based on process redundancy, dual-mode redundancy or multi-mode redundancy is constructed on key processes, inter-process synchronization and other means are adopted to guarantee that redundant processes operate according to the same execution logic, a monitoring system conducts corresponding error processing on different errors, and finally reliability and usability of the system are improved.
Description
Technical field
The present invention relates to fault tolerant systems design method and technology field, particularly a kind of tolerant system and method for designing thereof based on process redundancy.
Background technology
Along with the widespread use of field to computer system such as banking business processing, information service, finance calculating, also more and more higher to the requirement of computer system security.Fault-tolerant is a kind of important means that improves computer system security, and fault-tolerant implication refers in the situation that internal system breaks down, and computing machine still can correctly be carried out assignment algorithm.For the application of the key areas such as bank, telecommunications, computer system is extremely responsive for thrashing, guarantees that the reliability of system core process is most important.Common fault tolerant mechanism Main Basis static structure redundancy principle realizes, yet the redundancy cost of hardware layer is very high and it is complicated to realize, and the redundancy of application software layer does not have versatility.
Common software/hardware fault-tolerant mechanism, as the Main Basis static structure redundancy principles such as processor lock-step technology, memory mirror technology, Multipath I/O technology, the design of N version program realize, but the redundancy cost of hardware layer is very high and it is complicated to realize, and the redundancy of application software layer does not have versatility.
Summary of the invention
In order to solve the problem of prior art, the invention provides a kind of tolerant system and method for designing thereof based on process redundancy, it is to critical processes structure duplication redundancy or multi-mode redundant, adopt the means such as inter process synchronization to guarantee that redundancy process is according to same actuating logic operation, supervisory system is also carried out corresponding wrong processing to different mistakes, with this, improves the reliabilty and availability of system.
The technical solution adopted in the present invention is as follows:
A tolerant system based on process redundancy, it all concentrates on operating system kernel layer, comprises fault-tolerant management of process module, mistake processing module, the fault-tolerant control module of I/O and monitoring management module, wherein,
Fault-tolerant management of process module, for realizing redundancy process lifecycle management, comprises establishment, scheduling, synchronous, the communication of redundancy process and destroys, and makes primary process and redundancy process thereof when carrying out original logic in order, meet the needs of failure tolerance;
Mistake processing module, when when synchronously in service the making a mistake of redundancy process relatively being detected, mistake processing module is carried out fault type diagnosis and is taked corresponding processing mode according to pre-configured, completes fast wrong recovery;
The fault-tolerant control module of I/O, format conversion and the Redundant Control of responsible tolerant system internal data and external data, and auxiliary I/O operation is synchronously compared;
Monitoring management module, comprise the control desk that runs on user's state and the monitoring management module that runs on kernel state, control desk offers user's operation interface intuitively, and user monitors redundancy running state of process by control desk, check fault-tolerant event log, the major parameter of system is configured etc.; All data of control desk are by obtaining alternately with kernel monitoring module.
Fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.
The fault-tolerant control module of I/O comprises terminal interface, disk interface and network interface.
A kind of method for designing of the tolerant system based on process redundancy, comprise: (1), many CPU of take in SMP framework are redundant hardware, structure Redundant task executed in parallel in CPU group independently, by the executing data of Redundant task is compared and realizes error detection, and use and to fall mould, restart etc. mechanism completes wrong recovery; (2), in (SuSE) Linux OS, add fault-tolerant container, application in fault-tolerant container moves with redundant fashion, by process tolerant system, be in charge of Redundant task, independent execution in redundant hardware respectively, and control task is synchronous, data comparison, error-detecting and recovery.
In fault-tolerant container, one group of Redundant task is carried out identical function code, in its operational process, tolerant system to its manage, synchronous and monitoring, and carry out error-detecting according to data comparative result, the task outside fault-tolerant container is still moved in common single mode mode.
The beneficial effect that technical scheme provided by the invention is brought is:
The present invention has provided a kind of tolerant system and method for designing thereof based on process redundancy, fault-tolerant strategy and method based on process redundancy have been designed, and realized the prototype system of process tolerant system, critical processes is carried out to redundancy, and by synchronization mechanism, guarantee the correct execution of process, and system monitoring is carried out to corresponding wrong processing.Experimental results show that the loss of the method performance is little, and can effectively improve the reliability of system, avoided the complicacy of hardware customization simultaneously, and application programs and user transparent.
Accompanying drawing explanation
Fig. 1 is a kind of tolerant system based on process redundancy of the present invention and the system module graph of a relation of method for designing thereof;
Fig. 2 is a kind of tolerant system based on process redundancy of the present invention and the fault-tolerant process creation process flow diagram of method for designing thereof;
Fig. 3 is a kind of tolerant system based on process redundancy of the present invention and the workflow diagram of method for designing thereof.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.
As shown in Figure 1, a kind of tolerant system based on process redundancy, it all concentrates on operating system kernel layer, comprises fault-tolerant management of process module, mistake processing module, the fault-tolerant control module of I/O and monitoring management module, wherein,
Fault-tolerant management of process module, for realizing redundancy process lifecycle management, comprises establishment, scheduling, synchronous, the communication of redundancy process and destroys, and makes primary process and redundancy process thereof when carrying out original logic in order, meet the needs of failure tolerance;
Mistake processing module, when when synchronously in service the making a mistake of redundancy process relatively being detected, mistake processing module is carried out fault type diagnosis and is taked corresponding processing mode according to pre-configured, completes fast wrong recovery;
The fault-tolerant control module of I/O, format conversion and the Redundant Control of responsible tolerant system internal data and external data, and auxiliary I/O operation is synchronously compared;
Monitoring management module, comprise the control desk that runs on user's state and the monitoring management module that runs on kernel state, control desk offers user's operation interface intuitively, and user monitors redundancy running state of process by control desk, check fault-tolerant event log, the major parameter of system is configured etc.; All data of control desk are by obtaining alternately with kernel monitoring module.
Fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.
With reference to accompanying drawing 2, fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.In tolerant system, a process will derive from subprocess and will call in fork, clone or vfork system call, first system can judge according to ft_mak zone bit whether current process is redundancy process, if so, call the subprocess of do_ft_fork () function creation redundancy process; If not, judge whether fault-tolerant sign ft_exec, if had, shown that needs are fault-tolerant, called do_double_fork () function, derived from redundancy process; If do not need fault-tolerant, original do_fork () function in calling system, normal derived processes.By above-mentioned control, the needs that under different situations, process derives from have been realized.
The fault-tolerant control module of I/O comprises terminal interface, disk interface and network interface.According to different I/O kinds, the fault-tolerant control module structure of design system I/O also can mainly comprise two submodules, and disk/terminal read-write I/O controls and network data read-write I/O controls.
With reference to accompanying drawing 3, content of the present invention is described to the process that realizes this architecture with an instantiation.
User starts application program by fault-tolerant control desk interface, open fault-tolerant switch (zone bit ft_exec is set) process tolerant system and create a pair of process (primary-redundancy process) for this application program, this a pair of process is loaded the identical run time version of application program, and starts executed in parallel in CPU group independently.
Under synchronous protocol is controlled, primary-redundancy process is to arrive synchronous point simultaneously, and start synchronous, if certain process wait timeout on synchronous point wherein triggers that mistake enters fault detect, mistake is processed.If the I/O that is operating as on synchronous point operates, need to carry out I/O conversion, whether decision operation is write operation, if primary-redundancy process is compared the data of writing out, data are identical thinks that execution is correct, if comparing data difference is thought, there is fault, enter fault detect, mistake treatment scheme.If I/O is operating as read operation, the fault-tolerant control module of I/O completes the data-switching of I/O interface, and carries out final actual functional capability operation.
If state consistency adjustment is directly carried out in the non-I/O operation of being operating as on synchronous point on synchronous point, as unified function return value, record current process status information, complete subsynchronous.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.
Claims (5)
1. the tolerant system based on process redundancy, it all concentrates on operating system kernel layer, comprises fault-tolerant management of process module, mistake processing module, the fault-tolerant control module of I/O and monitoring management module, wherein,
Fault-tolerant management of process module, for realizing redundancy process lifecycle management, comprises establishment, scheduling, synchronous, the communication of redundancy process and destroys, and makes primary process and redundancy process thereof when carrying out original logic in order, meet the needs of failure tolerance;
Mistake processing module, when when synchronously in service the making a mistake of redundancy process relatively being detected, mistake processing module is carried out fault type diagnosis and is taked corresponding processing mode according to pre-configured, completes fast wrong recovery;
The fault-tolerant control module of I/O, format conversion and the Redundant Control of responsible tolerant system internal data and external data, and auxiliary I/O operation is synchronously compared;
Monitoring management module, comprise the control desk that runs on user's state and the monitoring management module that runs on kernel state, control desk offers user's operation interface intuitively, and user monitors redundancy running state of process by control desk, check fault-tolerant event log, the major parameter of system is configured etc.; All data of control desk are by obtaining alternately with kernel monitoring module.
2. a kind of tolerant system based on process redundancy according to claim 1, it is characterized in that, described fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.
3. a kind of tolerant system based on process redundancy according to claim 1, is characterized in that, the fault-tolerant control module of described I/O comprises terminal interface, disk interface and network interface.
4. the method for designing of the tolerant system based on process redundancy, comprise: (1), many CPU of take in SMP framework are redundant hardware, structure Redundant task executed in parallel in CPU group independently, by the executing data of Redundant task is compared and realizes error detection, and use and to fall mould, restart etc. mechanism completes wrong recovery; (2), in (SuSE) Linux OS, add fault-tolerant container, application in fault-tolerant container moves with redundant fashion, by process tolerant system, be in charge of Redundant task, independent execution in redundant hardware respectively, and control task is synchronous, data comparison, error-detecting and recovery.
5. the method for designing of a kind of tolerant system based on process redundancy according to claim 4, it is characterized in that, in fault-tolerant container, one group of Redundant task is carried out identical function code, in its operational process, tolerant system to its manage, synchronous and monitoring, and carry out error-detecting according to data comparative result, the task outside fault-tolerant container is still moved in common single mode mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310546513.8A CN103593251A (en) | 2013-11-07 | 2013-11-07 | Fault-tolerant system based on process redundancy and design method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310546513.8A CN103593251A (en) | 2013-11-07 | 2013-11-07 | Fault-tolerant system based on process redundancy and design method thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103593251A true CN103593251A (en) | 2014-02-19 |
Family
ID=50083406
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310546513.8A Pending CN103593251A (en) | 2013-11-07 | 2013-11-07 | Fault-tolerant system based on process redundancy and design method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103593251A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502089A (en) * | 2016-12-27 | 2017-03-15 | 河南森源重工有限公司 | A kind of redundancy control method of compression type garbage truck loading process |
CN109634769A (en) * | 2018-12-13 | 2019-04-16 | 郑州云海信息技术有限公司 | Fault-tolerance processing method, device, equipment and storage medium in a kind of storage of data |
CN111143125A (en) * | 2019-12-20 | 2020-05-12 | 浪潮电子信息产业股份有限公司 | MCE error processing method and device, electronic equipment and storage medium |
CN115981879A (en) * | 2023-03-16 | 2023-04-18 | 北京全路通信信号研究设计院集团有限公司 | Data synchronization method, device, equipment and storage medium of redundant structure |
WO2023082819A1 (en) * | 2021-11-10 | 2023-05-19 | 武汉路特斯汽车有限公司 | Data processing method and apparatus, device, and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060150004A1 (en) * | 2004-12-21 | 2006-07-06 | Nec Corporation | Fault tolerant system and controller, operation method, and operation program used in the fault tolerant system |
CN102364448A (en) * | 2011-09-19 | 2012-02-29 | 浪潮电子信息产业股份有限公司 | Fault-tolerant method for computer fault management system |
CN103455393A (en) * | 2013-09-25 | 2013-12-18 | 浪潮电子信息产业股份有限公司 | Fault tolerant system design method based on process redundancy |
-
2013
- 2013-11-07 CN CN201310546513.8A patent/CN103593251A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060150004A1 (en) * | 2004-12-21 | 2006-07-06 | Nec Corporation | Fault tolerant system and controller, operation method, and operation program used in the fault tolerant system |
CN102364448A (en) * | 2011-09-19 | 2012-02-29 | 浪潮电子信息产业股份有限公司 | Fault-tolerant method for computer fault management system |
CN103455393A (en) * | 2013-09-25 | 2013-12-18 | 浪潮电子信息产业股份有限公司 | Fault tolerant system design method based on process redundancy |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502089A (en) * | 2016-12-27 | 2017-03-15 | 河南森源重工有限公司 | A kind of redundancy control method of compression type garbage truck loading process |
CN109634769A (en) * | 2018-12-13 | 2019-04-16 | 郑州云海信息技术有限公司 | Fault-tolerance processing method, device, equipment and storage medium in a kind of storage of data |
CN109634769B (en) * | 2018-12-13 | 2021-11-09 | 郑州云海信息技术有限公司 | Fault-tolerant processing method, device, equipment and storage medium in data storage |
CN111143125A (en) * | 2019-12-20 | 2020-05-12 | 浪潮电子信息产业股份有限公司 | MCE error processing method and device, electronic equipment and storage medium |
CN111143125B (en) * | 2019-12-20 | 2022-04-22 | 浪潮电子信息产业股份有限公司 | MCE error processing method and device, electronic equipment and storage medium |
WO2023082819A1 (en) * | 2021-11-10 | 2023-05-19 | 武汉路特斯汽车有限公司 | Data processing method and apparatus, device, and storage medium |
CN115981879A (en) * | 2023-03-16 | 2023-04-18 | 北京全路通信信号研究设计院集团有限公司 | Data synchronization method, device, equipment and storage medium of redundant structure |
CN115981879B (en) * | 2023-03-16 | 2023-05-23 | 北京全路通信信号研究设计院集团有限公司 | Data synchronization method, device and equipment of redundant structure and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103455393A (en) | Fault tolerant system design method based on process redundancy | |
US9052935B1 (en) | Systems and methods for managing affinity rules in virtual-machine environments | |
US10114834B2 (en) | Exogenous virtual machine synchronization and replication | |
US8020041B2 (en) | Method and computer system for making a computer have high availability | |
US7814364B2 (en) | On-demand provisioning of computer resources in physical/virtual cluster environments | |
US8219990B2 (en) | Techniques for managing virtual machine (VM) states | |
US20110083046A1 (en) | High availability operator groupings for stream processing applications | |
CN103593251A (en) | Fault-tolerant system based on process redundancy and design method thereof | |
WO2018054081A1 (en) | Fault processing method, virtual infrastructure management system and service management system | |
US20140223225A1 (en) | Multi-core re-initialization failure control system | |
US20070192765A1 (en) | Virtual machine system | |
US9195553B2 (en) | Redundant system control method | |
US9235485B2 (en) | Moving objects in a primary computer based on memory errors in a secondary computer | |
CN101236515B (en) | Multi-core system single-core abnormity restoration method | |
JP2011060055A (en) | Virtual computer system, recovery processing method and of virtual machine, and program therefor | |
WO2011106067A1 (en) | Systems and methods for failing over cluster unaware applications in a clustered system | |
CN103778079A (en) | Dual operating system architecture capable of sharing USB device and sharing method | |
GB2520808A (en) | Process control systems and methods | |
CN102523257A (en) | Infrastructure as a service (IAAS)-cloud-platform-based virtual machine fault-tolerance method | |
US20090044186A1 (en) | System and method for implementation of java ais api | |
JP2005242404A (en) | Method for switching system of computer system | |
CN103795742A (en) | Heterogeneous storage disaster recovery management system and heterogeneous storage disaster recovery management method | |
Camargos et al. | Multicoordinated paxos | |
US9645857B2 (en) | Resource fault management for partitions | |
CN110333973A (en) | A kind of method and system of multi-host hot swap |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140219 |