CN103593251A

CN103593251A - Fault-tolerant system based on process redundancy and design method thereof

Info

Publication number: CN103593251A
Application number: CN201310546513.8A
Authority: CN
Inventors: 王恩东; 胡雷钧; 张东; 吴楠; 刘璧怡
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2013-11-07
Filing date: 2013-11-07
Publication date: 2014-02-19

Abstract

The invention relates to the technical field of a design method of a fault-tolerant system, in particular to a fault-tolerant system based on process redundancy and a design method thereof. The invention provides a fault-tolerant mechanism and strategy based on process redundancy, dual-mode redundancy or multi-mode redundancy is constructed on key processes, inter-process synchronization and other means are adopted to guarantee that redundant processes operate according to the same execution logic, a monitoring system conducts corresponding error processing on different errors, and finally reliability and usability of the system are improved.

Description

A kind of tolerant system and method for designing thereof based on process redundancy

Technical field

The present invention relates to fault tolerant systems design method and technology field, particularly a kind of tolerant system and method for designing thereof based on process redundancy.

Background technology

Along with the widespread use of field to computer system such as banking business processing, information service, finance calculating, also more and more higher to the requirement of computer system security.Fault-tolerant is a kind of important means that improves computer system security, and fault-tolerant implication refers in the situation that internal system breaks down, and computing machine still can correctly be carried out assignment algorithm.For the application of the key areas such as bank, telecommunications, computer system is extremely responsive for thrashing, guarantees that the reliability of system core process is most important.Common fault tolerant mechanism Main Basis static structure redundancy principle realizes, yet the redundancy cost of hardware layer is very high and it is complicated to realize, and the redundancy of application software layer does not have versatility.

Common software/hardware fault-tolerant mechanism, as the Main Basis static structure redundancy principles such as processor lock-step technology, memory mirror technology, Multipath I/O technology, the design of N version program realize, but the redundancy cost of hardware layer is very high and it is complicated to realize, and the redundancy of application software layer does not have versatility.

Summary of the invention

In order to solve the problem of prior art, the invention provides a kind of tolerant system and method for designing thereof based on process redundancy, it is to critical processes structure duplication redundancy or multi-mode redundant, adopt the means such as inter process synchronization to guarantee that redundancy process is according to same actuating logic operation, supervisory system is also carried out corresponding wrong processing to different mistakes, with this, improves the reliabilty and availability of system.

The technical solution adopted in the present invention is as follows:

A tolerant system based on process redundancy, it all concentrates on operating system kernel layer, comprises fault-tolerant management of process module, mistake processing module, the fault-tolerant control module of I/O and monitoring management module, wherein,

Fault-tolerant management of process module, for realizing redundancy process lifecycle management, comprises establishment, scheduling, synchronous, the communication of redundancy process and destroys, and makes primary process and redundancy process thereof when carrying out original logic in order, meet the needs of failure tolerance;

Mistake processing module, when when synchronously in service the making a mistake of redundancy process relatively being detected, mistake processing module is carried out fault type diagnosis and is taked corresponding processing mode according to pre-configured, completes fast wrong recovery;

The fault-tolerant control module of I/O, format conversion and the Redundant Control of responsible tolerant system internal data and external data, and auxiliary I/O operation is synchronously compared;

Monitoring management module, comprise the control desk that runs on user's state and the monitoring management module that runs on kernel state, control desk offers user's operation interface intuitively, and user monitors redundancy running state of process by control desk, check fault-tolerant event log, the major parameter of system is configured etc.; All data of control desk are by obtaining alternately with kernel monitoring module.

Fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.

The fault-tolerant control module of I/O comprises terminal interface, disk interface and network interface.

A kind of method for designing of the tolerant system based on process redundancy, comprise: (1), many CPU of take in SMP framework are redundant hardware, structure Redundant task executed in parallel in CPU group independently, by the executing data of Redundant task is compared and realizes error detection, and use and to fall mould, restart etc. mechanism completes wrong recovery; (2), in (SuSE) Linux OS, add fault-tolerant container, application in fault-tolerant container moves with redundant fashion, by process tolerant system, be in charge of Redundant task, independent execution in redundant hardware respectively, and control task is synchronous, data comparison, error-detecting and recovery.

In fault-tolerant container, one group of Redundant task is carried out identical function code, in its operational process, tolerant system to its manage, synchronous and monitoring, and carry out error-detecting according to data comparative result, the task outside fault-tolerant container is still moved in common single mode mode.

The beneficial effect that technical scheme provided by the invention is brought is:

The present invention has provided a kind of tolerant system and method for designing thereof based on process redundancy, fault-tolerant strategy and method based on process redundancy have been designed, and realized the prototype system of process tolerant system, critical processes is carried out to redundancy, and by synchronization mechanism, guarantee the correct execution of process, and system monitoring is carried out to corresponding wrong processing.Experimental results show that the loss of the method performance is little, and can effectively improve the reliability of system, avoided the complicacy of hardware customization simultaneously, and application programs and user transparent.

Accompanying drawing explanation

Fig. 1 is a kind of tolerant system based on process redundancy of the present invention and the system module graph of a relation of method for designing thereof;

Fig. 2 is a kind of tolerant system based on process redundancy of the present invention and the fault-tolerant process creation process flow diagram of method for designing thereof;

Fig. 3 is a kind of tolerant system based on process redundancy of the present invention and the workflow diagram of method for designing thereof.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

As shown in Figure 1, a kind of tolerant system based on process redundancy, it all concentrates on operating system kernel layer, comprises fault-tolerant management of process module, mistake processing module, the fault-tolerant control module of I/O and monitoring management module, wherein,

With reference to accompanying drawing 2, fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.In tolerant system, a process will derive from subprocess and will call in fork, clone or vfork system call, first system can judge according to ft_mak zone bit whether current process is redundancy process, if so, call the subprocess of do_ft_fork () function creation redundancy process; If not, judge whether fault-tolerant sign ft_exec, if had, shown that needs are fault-tolerant, called do_double_fork () function, derived from redundancy process; If do not need fault-tolerant, original do_fork () function in calling system, normal derived processes.By above-mentioned control, the needs that under different situations, process derives from have been realized.

The fault-tolerant control module of I/O comprises terminal interface, disk interface and network interface.According to different I/O kinds, the fault-tolerant control module structure of design system I/O also can mainly comprise two submodules, and disk/terminal read-write I/O controls and network data read-write I/O controls.

With reference to accompanying drawing 3, content of the present invention is described to the process that realizes this architecture with an instantiation.

User starts application program by fault-tolerant control desk interface, open fault-tolerant switch (zone bit ft_exec is set) process tolerant system and create a pair of process (primary-redundancy process) for this application program, this a pair of process is loaded the identical run time version of application program, and starts executed in parallel in CPU group independently.

Under synchronous protocol is controlled, primary-redundancy process is to arrive synchronous point simultaneously, and start synchronous, if certain process wait timeout on synchronous point wherein triggers that mistake enters fault detect, mistake is processed.If the I/O that is operating as on synchronous point operates, need to carry out I/O conversion, whether decision operation is write operation, if primary-redundancy process is compared the data of writing out, data are identical thinks that execution is correct, if comparing data difference is thought, there is fault, enter fault detect, mistake treatment scheme.If I/O is operating as read operation, the fault-tolerant control module of I/O completes the data-switching of I/O interface, and carries out final actual functional capability operation.

If state consistency adjustment is directly carried out in the non-I/O operation of being operating as on synchronous point on synchronous point, as unified function return value, record current process status information, complete subsynchronous.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the tolerant system based on process redundancy, it all concentrates on operating system kernel layer, comprises fault-tolerant management of process module, mistake processing module, the fault-tolerant control module of I/O and monitoring management module, wherein,

2. a kind of tolerant system based on process redundancy according to claim 1, it is characterized in that, described fault-tolerant management of process module is carried out fault-tolerant control to fork, clone and vfork system call, on the basis of do_fork () function, increased by two power functions of do_double_fork () and do_ft_fork (), realized controlling application program derivation bimodulus process and bimodulus process and derived from the function of subprocess separately.

3. a kind of tolerant system based on process redundancy according to claim 1, is characterized in that, the fault-tolerant control module of described I/O comprises terminal interface, disk interface and network interface.

4. the method for designing of the tolerant system based on process redundancy, comprise: (1), many CPU of take in SMP framework are redundant hardware, structure Redundant task executed in parallel in CPU group independently, by the executing data of Redundant task is compared and realizes error detection, and use and to fall mould, restart etc. mechanism completes wrong recovery; (2), in (SuSE) Linux OS, add fault-tolerant container, application in fault-tolerant container moves with redundant fashion, by process tolerant system, be in charge of Redundant task, independent execution in redundant hardware respectively, and control task is synchronous, data comparison, error-detecting and recovery.

5. the method for designing of a kind of tolerant system based on process redundancy according to claim 4, it is characterized in that, in fault-tolerant container, one group of Redundant task is carried out identical function code, in its operational process, tolerant system to its manage, synchronous and monitoring, and carry out error-detecting according to data comparative result, the task outside fault-tolerant container is still moved in common single mode mode.