CN101369241A

CN101369241A - Cluster fault-tolerance system, apparatus and method

Info

Publication number: CN101369241A
Application number: CNA200810215663XA
Authority: CN
Inventors: 霍志刚
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2007-09-21
Filing date: 2008-09-12
Publication date: 2009-02-18
Also published as: CN101377750B; CN101377750A

Abstract

The invention discloses a cluster fault tolerance system, a device and a method. The system includes: a remote checkpoint server which is used to respond the remote checkpoint request from a faulty node and execute the checkpoint operation; a node faulty checking module which is used to monitor the operating system of a local node and the running state of an assigned process and to trigger a remote checkpoint; and a communication system checkpoint module which is used to realize the checkpoint of the communication device and support the recovery function of communication breakpoint. The invention provides localized fast fault restoration for the parallel processing cluster, has a lower overhead and good expansibility, and makes ideal availability for the cluster system with ten billions and hundred billions calculation scale.

Description

A kind of cluster fault-tolerance system, device and method

Technical field

The present invention relates to computing machine parallel processing fault-tolerant technique field, particularly relate to a kind of cluster fault-tolerance system, device and method of parallel processing.

Background technology

The computing machine parallel processing technique application in modern society that with a group of planes is representative has reached appreciable breadth and depth.As the important component part of social informatization infrastructure, the integrity problem of the parallel processing in the Network of Workstation exerts an influence to economy and society.At present, along with the continuous expansion of Network of Workstation scale and the raising gradually of complicacy, the phenomenon that the reliability of its parallel processing presents downtrending has caused the extensive concern of academia and industry member, and the theoretical research of the fault-tolerant technique of group of planes parallel processing and engineering demands of applications thereof are urgent day by day.

A group of planes is the concurrent computational system with a plurality of independently computing machines (node that is called a group of planes) formation of network interconnection and energy collaborative work.

Fault, mistake and inefficacy are the most basic notions in fault-tolerant calculation field, are the bases of understanding fault-tolerant technique.In brief, losing efficacy is meant that a system departs from its correct this incident of service, and mistake is meant the part that may cause its follow-up inefficacy in whole states of a system, and fault then is the reason of a mistake.Fault may derive from the inside or the outside of a system.If a fault has caused mistake, be in activation (Active) state exactly, otherwise be exactly to be in dormancy (Dormant) state.

Mistake can be detected by the error messages or the signal that reports an error, and has produced but the mistake that is not detected as yet is called (Latent) mistake of hiding.A mistake can constantly change by computation process or propagate between system module, and this process is called incorrect migration (Error Propagation).

At present, mainly contain four kinds of disposal routes for the hardware and software failure in the computer system:

Fault-avoidance (Fault Prevention): avoid the appearance of fault in advance;

Fault is allowed (Fault Tolerance): avoid it to cause service failure after fault occurs;

Fault is eliminated (Fault Removal): reduce the quantity and the harm thereof of fault;

Failure prediction (Fault Forecasting): estimate the current quantity of fault, following incidence and the consequence.

The fault-tolerant technique of broad sense can contain the various disposal routes of fault, mistake and inefficacy.

Fault allows it generally is to realize by error-detecting (Error Detection) and system recovery, and wherein the latter can be divided into based on fault handling with based on these two types of fault processing according to its process object.

According to the mutual relationship of fault, mistake and inefficacy, fault processing is to avoid the key link of service failure.Existing fault processing technology mainly is divided into rollback (Rollback), preceding rolling (Rollforward) and compensation (Compensation) three kinds of strategies.Rollback recovery is under situation about can't determine with the debug reason, system state is returned to a correct status of preserving in advance rerun, and no longer takes place in the hope of mistake.Roll-forward recovery is then realized based on N-modular redundancy usually; when compare (N=2) or voting (N 〉=3) discovery mistake by regular state after; all redundant modules continue operation; only utilize free cells to rerun the computation process of one-period, and judge and reject the state of wrong redundant module according to its result.Rollback recovery mainly is based on time redundancy, and roll-forward recovery then needs to depend on hardware redundancy.In these two kinds of strategies, the former application is more extensive.Common process checkpoint and rollback recovery are exactly a kind of typical rollback fault processing technology.

Fault can be introduced at any one system level.Correspondingly, different system levels just need corresponding with it fault tolerant mechanism.Simultaneously, any fault tolerant mechanism all to its treatable fault or wrong certain hypothesis is arranged, such as factors such as fault type, failure-frequencies,, often have different fault-tolerance approaches therefore for different fault types.So the crash handling in actual calculation machine system usually needs the multiple fault-tolerant technique of integrated use, and be divided into a plurality of steps or level is handled.Ten steps using always in the fault processing of computing machine are that fault suppresses (Fault Containment), fault detect (FaultDetection), fault masking (Fault Masking), retry (Retry), diagnosis (Diagnosis), reshuffles (Reconfiguration), recovers (Recovery), restarts (Restart), repairs (Repair) and reintegrate (Reintegration) successively.

Classical fault-tolerant technique comprises triplication redundancy (the Triple Modular Redundancy of hardware aspect, TMR abridges) and multi-mode redundant (N-tuple Modular Redundancy, be abbreviated as NMR) and the recovery block (Recovery Blocks) of software aspect, many version program designs (N-Version Programming), fault-tolerant (the Algorithm-Based Fault-Tolerance of algorithm, ABFT), methods such as software self check, and software aging and regeneration techniques, towards calculating (the Recovery-Oriented Computing that recovers, ROC), calculating technology such as (Failure-Oblivious Computing) is ignored in inefficacy.

Traditional challenge is linear speed-up ratio problem in the group of planes design.Calculating under the constant substantially situation of granularity, after nodal point number was increased to a certain degree, the overall performance of a group of planes not only can't reach linear speed-up ratio, even can not rise counter falling.The integrity problem of group of planes node is taken into account, and the reliability of supposing each group of planes node is unsatisfactory, so, growth along with used node quantity, what have to worry will no longer be whether system performance can keep linear growth, but can a calculation task of given size be finished to non-fault smoothly.

Inherent redundancy has solved the problem of a part of cluster fault toleration in the group of planes architecture.But cluster fault toleration faces more challenge.At first, when the scale of Network of Workstation constantly enlarges, according to statistical law, the reliability of total system will descend inevitably.The second, the concurrency between the group of planes node makes that the state that intactly obtains and recover to use is difficult more.The existence of interprocess communication makes in parallel the application and exists complicated priority dependence between each state of a process, the processing of any single failure is all needed to consider the restorability of global state.

The availability that improves Network of Workstation has two kinds of approach: the one, and continue to improve the reliability of single node, thereby the reliability of entire system is correspondingly improved.But in view of Network of Workstation generally adopts the COTS parts, the suffered restriction of this method is more.Another approach is, is conceived to the availability of entire system, and the fault that system is occurred at single node can access the recovery processing of localization (time, space).The regular global-inspection point technology commonly used in the cluster fault toleration field can be called " time localization " cluster fault toleration technology.The Network of Workstation that this technology will be moved continuously is divided into short unit from the time, and promptly traditional checkpoint is (Checkpoint Interval) at interval.By at the register system state zero hour of each time quantum, make the fault that within each time quantum, takes place only can destroy the result of calculation of whole Network of Workstation in this time quantum.This technology facts have proved it is one of very effective cluster fault toleration strategy, but this technology does not realize the spatial locality of group of planes fault handling, and its expense is directly related with system scale.We can say that be that all processes that a group of planes walks abreast in using are all carried out the tendency that checkpointed has excessive redundancy (Aggressive Redundancy) finish time at interval in each checkpoint.Along with the continuous expansion that a group of planes calculates scale, the drawback of global-inspection's point becomes obviously gradually, and cluster fault toleration mechanism is demanded urgently developing to the direction of lightweight.

Summary of the invention

Problem to be solved by this invention is to provide a kind of cluster fault-tolerance system, device and method, its group of planes for parallel processing provides the quick fault recovery of localization, have lower expense and good extensibility, make 1,000,000, the Network of Workstation of peta-scale can have desirable availability.

A kind of cluster fault-tolerance system for realizing that the object of the invention provides comprises following functional module:

Long-range checkpoint server is used to respond the long-range checkpoint request from the fault node, carries out checkpointed;

The node failure detection module is used to monitor the operating system of local node and specify running state of a process, triggers long-range checkpoint;

Communication system checkpoint module, the checkpoint that is used to realize communication facilities, and support communication breakpoint restore funcitons.

Described cluster fault-tolerance system also comprises following functional module:

Parallel application process manager, checkpoint system provides the progress information of monitored application when being used to fault, and managing process rejuvenation;

Checkpoint file server is used to store check point file, and provides the check point file visit to support when fault in the rejuvenation of checkpoint.

A kind of cluster fault toleration disposal route comprises the following steps:

Steps A 1. is used registration;

The long-range checkpoint of step B1. cuts;

Step C1. process is recovered.

Described cluster fault toleration disposal route after the described steps A 1, before the step B1, also comprises the following steps:

The monitoring of step S1. node;

Step S2. Trouble Report;

A kind of long-range checkpoint cuts system, comprising:

Kernel symbolic addressing module is used for the addressing to the interior nuclear symbol of the operating system of target process place node;

Data cache module, be used for data buffer memory and look ahead;

Pointer module is used for pointer operation.

A kind of long-range checkpoint cuts method, comprises the following steps:

Step S10, the operating system symbol table of loaded targets node;

Step S20, the operating system kernel type list of loaded targets node;

Step S30 number searches the process control block (PCB) of target process according to target process, and copies in the local buffer;

Step S40 creates the file header of videoing in long-range checkpoint;

Step S50 preserves the fullpath of root directory and work at present catalogue;

Step S60, the filec descriptor table of preservation target process;

Step S70, the essential information of preserving the target process opened file one by one;

Step S80 is for long-range checkpoint image file adds the end mark.

A kind of long-range checkpoint recovery system comprises:

The state area sub-module is used to distinguish the state of target process, to avoid the misuse to registers such as RCX;

The springboard module, the general-purpose register state that is used to regain one's integrity makes process all withdraw from kernel mode with the IRET instruction, returns user's space from core space.

A kind of long-range checkpoint restoration methods comprises the steps:

Step S10 ', checkpoint recovery instrument create a subprocess, and begin to wait for that it is complete or withdraw from unusually;

Step S20 ' judges its legitimacy according to the header information of check point file; If the operating system of indicating in the header information does not meet expection, then withdraw from; Otherwise, enter next step;

Step S30 ' resets the base attribute of target process;

Step S40 ' recovers the CPU state part in the target process core stack;

Step S50 ', the relevant information of signal Processing in the recovery target process;

Step S60 ' removes the mapping of all virtual storage regions of host's process;

Step S70 ', the mapping of all virtual storage regions of loaded targets process;

Step S80 ' is provided with the virtual address space descriptor of target process;

Step S90 ' recovers the root directory of target process and the path of work at present catalogue;

Step S100 ' closes each file of host's process;

Step S110 ' recovers the essential information of target process opened file one by one;

Step S120 ', the state that target process is set be for moving, and it is withdrawed from after the rejuvenation of long-range checkpoint can be by scheduled for executing normally.

A kind of checkpoint of communication system cuts method, comprises the following steps:

Step S100 reads communication facilities file port status structure pointed;

Step S200, the network interface card to the target port place sends the order of freezing designated port;

Step S300 if confirmed that the target process of main frame side is out of service, preserves transmission request queue and each transmission buffer zone that consumer process can write, otherwise needs at first to send to target process the long-range interruption of operation suspension;

Step S400, after the communication protocols processor confirms that target port is frozen, the send buffer and the event queue of preserving user's space, and the port controlling piece on the network interface card and all these ports transmission handle just in use.

A kind of checkpoint restoration methods of communication system comprises the following steps:

Step S100 ', the reconstruction of port;

Step S200 ', the reorientation of port.

A kind of breakpoint restoration methods of communication comprises the following steps:

Step 1. is freezed the transmitting-receiving operation of local communication port

Step 2. is preserved the state of frozen port

Step 3. is to other MCP broadcasting.

The checkpoint cuts method during a kind of unit fault, comprises the following steps:

Step S1000, the process good working condition is preserved;

Step S2000, the detection of node failure;

Step S3000, the state-detection of long-range checkpoint target process.

The process detection method that a kind of long-range interruption is ended comprises the following steps:

Step N1 searches the process control block (PCB) of specifying the process number correspondence;

Step N2 fills in long-range interrupt request structure;

Step N3 if the CPU at this process place sign equals to carry out the CPU sign of long-range interrupt service routine, then directly is provided with the sign of the process that is interrupted; Otherwise, between the CPU at this process place transmission processor, interrupt;

Subsequently, the CPU at this process place responds the request of registering in the long-range interrupt request structure in the process scheduling module, and upgrades solicited status sign wherein after finishing this request.

The invention has the beneficial effects as follows: a kind of cluster fault-tolerance system of the present invention, device and method, it is a kind of checkpointing mechanism during at the parallel group of planes fault of using, and is intended to use the quick fault recovery that localization is provided for parallel.For global-inspection's dot system, the consistance that all states of a process need periodically write non-volatile memory device and keep the interprocess communication state in process of its normal operation in parallel the application, the core concept of checkpointing mechanism then is the state that can't obtain when only preserving CPU state etc. at node failure when using normal operation parallel during group of planes fault.When finding fault, to the long-range execution of the process on fault node checkpoint, preserve the state of its communication system, and by the correct recovery after the group of planes communication protocol assurance communication disruption;

The invention allows for long-range checkpoint method mechanism based on the remote direct memory access technique.This technology utilizes the communication process of remote direct memory visit to need not CPU and the characteristics of operating system participation and the excellent properties of group of planes high-speed communication system of destination node, can cut application state efficiently under the fault conditions such as operating system denial of service of destination node.This technology makes the dependence of application program and operating system become loose, and the fault of operating system can not threaten the continuation operation of serviced process.

The present invention has also realized the checkpoint and the restoration methods mechanism of user class group of planes communication system.This mechanism is utilized communication reliability support method mechanism such as replying in the group of planes communication system, repeating transmission and message format, reduced of the requirement of parallel checkpoint process, reduced the checkpoint of process and the expense of rejuvenation the maintenance of communication system global coherency state.

The present invention is also on the basis of the checkpoint of user class group of planes communication system and Restoration Mechanism, explored group of planes communication protocol how during to the parallel fault of using checkpoint and recovery operation provide support, the method mechanism that the communication breakpoint at the checkpoint of individual process in parallel the application recovers has been proposed.

The node level fault-tolerance approach mechanism of checkpoint when the present invention has also realized supporting fault, comprise based on the node failure detection technique of coprocessor and CPU buffer status the preservations technology switched based on process operation context, can realize the fast detecting of node failure and guarantee the integrality of state after the node failure generation of target process.

Checkpoint and recovery system when the present invention has also realized the concurrent program fault of a lightweight (Crash-Time ChecKpoint and Restart system, CTCKR).This system only collapses because of the operating system of a node at its place when a parallel application and just triggers checkpoint and recovery operation in the time of can't continuing to move, avoided frequently carrying out termly the time overhead that bring the checkpoint, and this system only carries out checkpoint and recovery at the associated process in the fault node, and need not to carry out global-inspection's point, thereby storage overhead also significantly reduces.

Show by the evaluation and test experiment that utilizes benchmarks such as NPB (NAS Parallel Benchmarks), LINPACK, cluster fault-tolerance system of the present invention, device and method have all reached design object well in every performance test with based on the correctness test that fault is injected, and this has shown that fully cluster fault-tolerance system of the present invention, device and method are a kind of feasible lightweight cluster fault toleration technology.

Description of drawings

Fig. 1 is a progress information item structural representation;

Fig. 2 is a cluster fault-tolerance system course of work synoptic diagram of the present invention;

Fig. 3 cuts the operand synoptic diagram of method for long-range checkpoint;

Fig. 4 is the conversion synoptic diagram of user's space virtual address;

Fig. 5 is the coherency state synoptic diagram of distributed system;

Fig. 6 cuts the optimization synoptic diagram for the checkpoint of annular send buffer;

Fig. 7 is that the breakpoint in the MX communication protocol recovers support method synoptic diagram; ,

Fig. 8 is a binary tree broadcasting method synoptic diagram;

Fig. 9 is MX checkpoint expense test result figure;

Figure 10 is core heap Stack schematical top view.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer,, a kind of cluster fault-tolerance system of the present invention, apparatus and method are further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

Cluster fault-tolerance system is pursued following target: 1) application programs adaptability widely.Cluster fault toleration should be independent of application program, parallel Application Middleware as much as possible, does not even depend on node operating system, makes things convenient for application developer and system manager, and promptly fault tolerant mechanism should keep the transparency to using as far as possible.2) the low expense under the condition of system's failure-free operation.It is minimum that the loss that fault tolerant mechanism causes system performance should be made every effort to, and also should make every effort to minimum to taking of resources such as storage.3) the low expense of recovering.Reduce the level of rollback recovery, recover expense thereby reduce.

Cluster fault-tolerance system of the present invention is based on following technology: 1) the operating system denial of service occurs when a group of planes node, even after the fault such as collapse, make the state of this node must think that extraneous institute knows based on the fault detection mechanism of coprocessor; 2) long-range checkpoint technology makes that the checkpoint of arbitrary process can be cut by long-range on this node; 3) and the checkpointing mechanism of group of planes communication system can be ended and recover the parallel communication process of using at any time.

Cluster fault-tolerance system of the present invention comprises following functional module:

As a kind of enforceable mode, cluster fault-tolerance system of the present invention selects MPI to use as tested object, therefore should be implemented among the function expansion of MPI management of process device by parallel application process manager.

Communication system checkpoint module, the checkpoint that is used to realize communication facilities, and support communication breakpoint restore funcitons.As a kind of embodiment of the embodiment of the invention, this communication system checkpoint module is implemented among the expansion of MX communication system.

Further specify long-range checkpoint of the present invention server below.

The server response of long-range checkpoint is carried out checkpointed from the long-range checkpoint request of fault node, and its workflow is as follows:

1, the operating system nucleus symbol table of each node and type of data structure table in the loaded targets platform.

Checkpoint registration message or from the emergency message of fault node during the fault of 2, circular wait application program:

(A) for the registration message of application program:

A1) the node broadcasted application topology information at each process place in this is used.The communication protocols processor of each node starts local status monitoring after receiving this message.

A2) begin to handle the emergency message that each node sends in this application.

(B) for the emergency message of fault node:

B1) reply emergency message.In the system of a plurality of long-range checkpoints of configuration server, the server that returns the emergency response message at first could continue to carry out this checkpoint process.

B2) continue to carry out long-range checkpoint.

B3) notice MPI control program is by the fullpath of the rank of the process of cutting, checkpoint.

C) for the registration message of recovering process:

It returns the parallel applied topology information of using in its place to recovering process, so that can initiate the breakpoint rejuvenation of communicating by letter.

Further specify node failure detection module of the present invention below.

The function of the node failure detection module in the cluster fault-tolerance system of the present invention is the operating system of the local node of monitoring and specifies running state of a process, triggers long-range checkpoint process.

The node state monitoring method that is based on coprocessor of its utilization.

Long-range interrupt mechanism, promptly a kind of mechanism from the message in the group of planes communication system to the operating system of destination node that transmit control command by.Wherein, long-range interrupt function be mainly used in control as yet not the collapse Remote Node RN in certain running state of a process.

Further specify parallel application process manager of the present invention below

Although the communication system checkpointing mechanism that checkpoint system relied on during fault can support to be positioned at the various parallel Application Middleware on the group of planes communication system, as MPI, PVM etc., but different parallel Application Middlewares needs different concurrent process administrative mechanisms.In view of MPI is group of planes parallel computation middleware de facto standards at present, the tested object of checkpoint system when the embodiment of the invention selects MPI to use as fault.

Multiple realization version is arranged before the MPI standard mesh, and what the embodiment of the invention was selected is the MPICH 1.2.5 version of supporting the MX communication system.The management of process device of MPICH acquiescence is MPIRUN, and its original flow process is:

1, resolve command row input is provided with the controlled variable that this uses operation.

2, generate the destination node array according to alternative node tabulation.

3, set up the listening port that is used to receive the subprocess essential information.

4, create the subprocess of carrying out remote command (RSH/SSH) one by one according to MPI process sequence number (RANK).The subprocess of the subprocess of these MPIRUN and then the parallel application of establishment MPI in the destination node of correspondence.

5, essential informations such as the process number that all subprocesss of reception return in listening port, net card number, port numbers.

6, after collection finishes the essential information of all subprocesss, broadcast the essential information of all subprocesss in this parallel application to all subprocesss.After each subprocess was received this broadcasting, the MPI initialization procedure finished.

7, wait for that in above-mentioned listening port receiving subprocess calls the message that MPI _ Abort () sends.If receive a this message, stop order to all the other subprocess seedings.

In cluster fault-tolerance system of the present invention, parallel application process manager not only will provide the process loading service for MPI uses, and the topology information of intended application will be provided for long-range checkpoint server, and managing process rejuvenation.For this reason, the present invention expands the flow process of MPIRUN:

1) after the subprocess essential information of collecting to all subprocess broadcasting, the progress information that MPIRUN should use to long-range checkpoint server registration, wherein the form of each clauses and subclauses is as shown in Figure 1.Include information RANK, the PID, NIC_ID, PORT_ID, the STATUS that carry out in each clauses and subclauses.

Long-range checkpoint server can be according to the log-on message of using, and further notifies the node failure detection module of all junction associated to begin to monitor the running status of local node.

2) when long-range checkpoint server has successfully cut the fault of certain process, after the checkpoint, can notify MPIRUN, be responsible for follow-up process rejuvenation by it.

MPIRUN creates a new subprocess, and recovers order by the latter to the standby node transmission process of appointment.Subsequently, MPIRUN upgrades the mapping table of its subprocess and MPI process sequence number, so that the process that withdraws from that this MPI uses is normally carried out.

From command line parameter, obtain the address of long-range checkpoint server through the MPIRUN that expands, and support the multiple methods that standby node is provided such as standby node listing file, order line input.

Further specify checkpoint file server of the present invention below.

Checkpoint file server of the present invention has realized a Virtual File System MXFS based on the MX communication system, the image file of checkpoint when being used for access fault.It provides open (), read (), lseek (), write file operations such as the () interface of standard by the procfs Virtual File System in the (SuSE) Linux OS.When user program is opened filename as "/proc/mxfs/NODE/FILENAME ", the client core model of MXFS can be automatically in node " NODE " the MXFS server of operation send and set up link, with the request of visit " FILENAME " file.The MXFS server is opened the rreturn value of order to the local file of client transmission appointment in the response message of this request.After the access links of this document was set up, the MXFS client just can be by normal read write command visit this document.When the selected local file system of MXFS server was RAMFS, MXFS read performance and can reach 97MB/s.This performance only reach the MX communication system peak performance 42%, this mainly is owing to the lower reason of the implementation efficiency of MXFS server and procfs.

In the realization of checkpoint system, MXFS server and long-range checkpoint server run on same node, even realize in same program when fault, so that the MXFS server can directly be visited the check point file that long-range checkpoint server is preserved.

Describe the workflow of a kind of cluster fault toleration disposal route of the present invention below in detail.

A node generation operating system failure during the embodiment of the invention is used with N process MPI is a background, the workflow of checkpoint when describing the fault of cluster fault toleration method of the present invention in detail.

Step, S1. used registration.

When MPI management of process device loads a MPI application, after collecting the essential information of all subprocesss, send the application register requirement to long-range checkpoint server.Long-range checkpoint server is after receiving this request, according to the progress information that wherein comprises, to the node failure detection module transmission node monitoring request of all junction associated.Comprise the number of registration of monitored application and the essential information of all processes thereof, available long-range checkpoint server address (tabulation) and node monitoring strategies sign indicating number in this request.Node monitoring strategies sign indicating number is used to dispose the node monitoring module which kind of monitoring is the system state of this locality done, and comprises monitoring time interval, monitoring content etc.

Step S2. node monitoring.

In the operational process of monitored application, the node failure detection module is the fault of the operating system of this node of active monitoring termly, also can receive the node failure recovery request that sends with processing host side simultaneously passively.

Step S3. Trouble Report.

When the node failure detection module detects fault or receives the fault recovery request:

At first, freeze the communication port that local all monitored processes are opened;

Then, send Trouble Report to default long-range checkpoint server, the request latter implements long-range checkpoint to the local process of appointment;

At last, the MCP by other process in the monitored application of long-range interruption broadcast announcement specifies the checkpoint of process to start, shown in Fig. 2 (2).

The long-range checkpoint of step S4. cuts.

After a long-range checkpoint server is received effective Trouble Report, just start long-range checkpoint process, shown in Fig. 2 (3).After this process successfully finished, long-range checkpoint server sent to MPI management of process device with the position of resulting checkpoint image file and the information such as RANK of corresponding process, was responsible for follow-up process rejuvenation by it.

Step S5. process is recovered.

MPI management of process device at first needs to determine the used node of process recovery, for example selects new node from default idle node tabulation, perhaps waits restarting of pending fault node.

Then, MPI management of process device is to the order of destination node transmission recovering process, shown in Fig. 2 (4).In this order, the position of check point file provides with the form of its address in the MXFS file system.In the rejuvenation of target process, its communication port can other process of broadcast announcement continue communication, shown in Fig. 2 (5) after recovering.If do not have other node failure in the checkpoint of this process and rejuvenation, then monitored application continues operation.

Preferably, in the final step of said process, MPI management of process device can restart the fault node by the remote power supply management software.At present, the solution of comparatively ripe remote power supply management has been arranged, as the multiple remote power supply management equipment that connects based on Ethernet, RS-232 etc. of CPS company release.That in embodiments of the present invention, adopts that linux system provides in panic () function overtimely restarts mechanism.By the startupoptions panic of (SuSE) Linux OS is set, operating system is restarted automatically after entering the panic state fixed time.

In order to further specify cluster fault-tolerance system of the present invention and disposal route, the embodiment of the invention with development platform is: hardware configuration is the NUMA server of two-way AMD Opteron processor; Network interface card is Myricom PCIXD-2, and its communication protocols processor is the LanaiX version.The operating system kernel of development platform is the Linux-2.6.12 version; The version of parallel computation middleware MPICH is 1.2.6; The communication system of bottom is a Myricom MX 1.1.0 version.Cluster fault-tolerance system of the present invention and method are described, it will introduce the realization details of this system and method according to three levels such as operating system, bottom communication system, long-range checkpoint server respectively.

1, operating system layer

11) process status is preserved

Arch/x86 in the (SuSE) Linux OS core code _ to be Opteron CPU pass in and out the assembly routine of kernel mode by system call, interruption, mode such as unusual for the content of 64/kernel/entry.S file.In order to make the CPU register in the consumer process context, state marquis when CPU enters kernel mode of FPU register obtain preserving at every turn, the present invention revises this document.From the x8664 of each CPU _ take out the pointer of core stack the pda structure after, will intactly preserve above-mentioned state.Main modification segment for the system call inlet is as follows:

...

SAVE_REST

pushq?％rax

pushq?％rdx

call?ctckr_save_fpu

popq?％rdx

popq?％rax

call*sys_call_table(，％rax，8)

movq?％rax，RAX(％rsp)

RESTORE_REST

...

Wherein, SAVE _ REST/RESTORE _ REST preserves and recovers RBX, RBP, R12～sR15 register.

The function interface of preserving FPU in the described code is set to:

void?ctckr_save_fpu(void)

{

struct?task_struct*tsk＝current；

if(！used_math())

return；

if((tsk)->thread_info->status?&?TS_USEDFPU){

asm?volatile(＂rex64；fxsave?％0＂

:＂＝m＂(tsk->thread.i387.fxsave))；

}

return；

}

Wherein, FXSAVE preserves the instruction of XMM, MMX and x87 buffer status for Opteron CPU.

In view of the preservation of FPU state is operation the most consuming time during the process status of cluster fault-tolerance system of the present invention and method is preserved, from above-mentioned code as can be seen, for the non-science calculation procedure that does not use FPU, the performance cost during failure-free operation that cluster fault-tolerance system of the present invention and method are brought can be ignored.

12) operating system failure detects

In order to support that the present invention is provided with the ctckr_danger_level[of an integer type based on the fault detection mechanism of the cluster fault toleration of operating system failure counting] array, each CPU corresponds respectively to a element in this array by its sequence number.In order to make coprocessor obtain the content of this array and current clock interruption counting by a DMA read operation, the present invention has revised the linker script of generating run system core reflection, be the arch/x86_64/kernel/vmlinux.lds.S file, so that the address of this array and clock interruption counting variable is adjacent.In the nucleus module loading procedure of communication network interface card, described address will be registered in the communication protocols processor, so that the latter regularly detects the malfunction of main frame side.

2, bottom communication system

The required expansion to communication system of cluster fault-tolerance system and method comprises functions such as long-range interruption, RDMA read, the MX port freezes, the recovery of MX port.The realization of above-mentioned functions all relates to the modification to the MCP of MX communication system, two parts of user library, and wherein MCP is owing to the embedded software that belongs in the communication network interface card, and the difficulty of its exploitation and debugging is all than main frame equation height.The implementation that is embodied as example explanation above-mentioned functions that the present invention will read with RDMA.

In user library, add new request type MX_REQUEST_TYPE_RMA_GET, and in user's request data structure union mx_request, add corresponding subtype struct rma_get.

struct{

struct?mx＿basic_request?basic；

mx_segment_t*segments；

uint32_t?count；

mx_segment_t?segment；

uint64_t?remote_addr；

mx＿endpoint_addr_t?dest_address；

uint16_t?lib_seqnum；

uint16_t?pad；

uint32_t?remote_len；

uint16_t?msg_seq；

uint8_t?unexpected；

uint8_t?pad2；

struct?mx＿partner*partner；

}rma_get；

User library sends request to MCP then after handling described user's request, therefore, the present invention adds user's request type MX_MCP_UREQ_RMA_GET that the MCP layer is set, and corresponding MCP appealing structure mcp_ureq_rma_get_t.

typedef?struct

{

uint16_t?target_peer_index；

uint8_t?target_endpt；

uint8_t?is_reply；

uint32_t?target_session；

uint32_t?timeout；

uint16_t?lib_seqnum；

uint32_t?remote_addr_high；

uint32_t?remote_addr_low；

uint32_t?remote_len；

uint16_t?lib_cookie；

uint8_t?pad1；

uint8_t?type；

}mcp_ureq_rma_get_t；

The MCP layer will send the packet of following form by newly-installed MX type of data packet PKT_TYPE_RMA_GET after handling this request:

typedef?struct{

pkt_data_common_t?common；

uint32_t?remote_len；

uint32_t?remote_addr_high；

uint32_t?remote_addr_low；

uint8_t?is_reply；

uint8_t?pad；

uint16_t?return_peer_index；

uint32_t?src_mac_low32；

}pkt_data_rma_get_t；

Take over party's MCP starts the DMA read operation of appointment after receiving this message, and returns the packet of the PKT_TYPE_RMA_GET_REPLY type of redetermination.The MCP of transmit leg starts the DMA that receives data earlier when receiving this message, be the incident of MX_MCP_UEVT_RECV_RMA_GET to the user library DMA of main frame side type again.

In the realization that the MX port recovers, the present invention recovers at first to create a new MX core port in the module in process, its state is made amendment according to the content of process checkpoint, to reach the complete transparent purpose of consumer process again.

3, long-range checkpoint server

Long-range checkpoint server is implemented as a backstage service processes method.In view of the operating system version of each node in the general Network of Workstation is identical, this method is supported in the version of the node operating system of allocating default when starting.As a kind of enforceable mode, the user interface of this method is:

Usage:rmac[options]

-m，--map<System.map> System.map?file

-k，--kerntypes<Kerntypes>?Kerntypes?file

-s，--server Daemon?mode

-g，--debug?LEVEL Debug?level

-h，--help Display?this?help

In the rejuvenation of checkpoint, recovering process will be to long-range checkpoint server registration, to obtain the parallel applied topology information of using in its place when the fault of cluster fault-tolerance system of the present invention and method.This is because long-range checkpoint server has the up-to-date process status information (comprising the information of the process in the process of long-range checkpoint still) of registered applications, MPIRUN is a perl script program simultaneously, recover the parameter in the order except process can be set, can't be resumed process and communicate by letter easily.MPIRUN recovers the address notification of long-range checkpoint server to be resumed process in the order in process, and the form of this order is as follows:

rsh?NEW_NODE″cd?CTCKR_IMG_PATH；\

ctck_restart?-r?CTCKR_SERV_MAC:CTCKR_SERV_PORTctckr.TARGET_PID＂

Should be noted that the embodiment of the invention selected the implementation platform of Myrinet/MX communication system as cluster fault-tolerance system of the present invention and method, but the present invention can be implemented in equally also on other the communication system, for example QsNet, InfiniBand etc.

Describe the process of long-range inspection work of the present invention below in detail, it comprises that a kind of long-range checkpoint cuts system and method, and a kind of long-range checkpoint recovery system and method.

The checkpoint technology is a kind of fault-tolerant technique that is widely used in rollback and before rolls wrong recovery, and its design philosophy is simple and clear.This mechanism can realize in ranks such as processor, physical storage, virtual memory, and in the Network of Workstation based on the COTS parts, and this technology often is implemented among the software layers such as operating system, application program time.The major function of checkpoint technology is to reduce the loss of time that single failure caused.In fault-tolerant networks with process copy (Process/Task Duplication), for example DMR-F (Double ModularRedundancy with Forward Recovery), TMR-F (Triple Modular Redundancy withForward Recovery) and RFCS (Roll-Forward Checkpointing Scheme) etc., checkpoint more convenient and fault detection method accurately.Except being used for described fault-tolerant purpose, this technology also is used to occasions such as load balancing, job scheduling and system maintenance.In engineering practice, the OS/360 operating system that IBM Corporation released in late 1960s just application programs the support of checkpoint and Restoration Mechanism is provided.Along with the user improves constantly the demand of reliabilty and availability, this technology has begun to be popularized in high-end science and engineering calculation platform.

In the present invention, process checkpoint and recovery technology are meant the running status of at a time preserving a target process, and are that starting point is rebuild this process in a certain moment subsequently with this state, make it continue operation.In this process, the process status that is saved is called the checkpoint of this process, the operation of preserving the checkpoint is called and cuts (Checkpointing), and utilizes the checkpoint process of rebuilding, makes its operation that can continue to move be called recovery (Recovery or Restart).For the checkpointed of periodically carrying out in the operational process of using, the time span between adjacent twice operation is called checkpoint (Checkpoint Interval) at interval.

The content of process checkpoint not only comprises basic Process Attributes, the current content of storage areas such as the data segment in the user address space, stack segment, heap, and comprise the current state of the various system resources that are used for interprocess communication and I/O, the socket of for example having created, semaphore, shared drive, message queue and opened various files or the like.

Although the thought of process checkpoint and recovery technology is simple and clear, the complexity of its realization is but often higher.The (SuSE) Linux OS platform commonly used with group of planes calculating field is example, and up to the present still neither one checkpoint and recovery system can be realized checkpoint and the recovery at any time of any process.The realization difficulty of this technology mainly comes from the reason of following two aspects: one, and the design of operating system is increasingly sophisticated, when providing more supports to the operation of process and management, the content of process checkpoint is constantly increased; Two, communicate by letter for existing, the process of behaviors such as file I/O and interactive operation with the external world, its checkpoint and rejuvenation need to consider to external world especially, comprise other process, storage system, user etc., influence.

According to the different levels of checkpoint, the checkpoint can be divided into system-level checkpoint, user class checkpoint, and wherein the user class checkpoint mainly is the file checking point.

System-level checkpoint is by the mode of retouching operation system or loading core expansion module, in the preservation and the recovery of core layer implementation process state.

User class checkpoint technology is preserved its state in user's attitude context of target process and is recovered.

The fundamental purpose of file checking point is exactly with as far as possible little expense, makes the state variation of a file between two adjacent checkpoints can access rollback to using transparent way as far as possible.

Along with the development of cluster file system, checkpoint system of the present invention and method utilize the snapshot functions (Snapshot) in the log-structured file systems such as WAFL (Log-Structured File System) to replace the file checking point.This method makes full use of the existing capability in the specific file system, and realization is convenient and efficient is higher, especially can avoid under the long at interval situation in checkpoint, and the expense of file checking point is tending towards significant problem.Use for a fairly large group of planes, if its a plurality of processes are carried out file I/O by network store system in a shared file system, then the efficient of this method is better.

Long-range checkpoint of the present invention cuts system and method, be based on remote direct memory visit (RemoteDirect Memory Access, RDMA) long-range checkpoint cuts system and method, is a kind of CPU of target process place node and checkpoint system and method that operating system participates in of need not.

Remote storage visit (Remote Memory Access, RMA) be a kind of data transfer mode that realizes by hardware mechanisms such as shared drive, communication protocols processor, dma controller or hard-wired long-range Put/Get operations, can make a node directly read or write one section storage area of another node.

The described long-range checkpoint of the embodiment of the invention cuts system and method, utilizes RDMA to need not the CPU of purpose node and the characteristics that operating system participates in communication process.In embodiments of the present invention, as a kind of enforceable mode, RDMA is based on the LANai communication protocols processor in the Myrinet network, and the RDMA that the MX communication system is carried out reads function expansion.This RDMA can realize reading the memory headroom of assigned object reason address in the destination node.

The complete status information of a process may be distributed in general-purpose register, flating point register and the data cache among the CPU, and in internal memory and the disk.(Cache Coherence Protocol CCP) realizes RDMA operation reading cache content to the embodiment of the invention by means of hard-wired cache coherent protocol.

The CCP agreement is a kind of prior art, and those skilled in the art can reappear its process according to description of the invention, therefore, in embodiments of the present invention, describes in detail no longer one by one.

Long-range checkpoint of the present invention cuts system, can be considered as a kind of special core stage checkpointing mechanism.In the process of long-range checkpoint, the process status information that core stage checkpoint system such as remote access process status information of checkpoint implementing procedure and BLCR are visited is basic identical.The long-range checkpoint system of the present invention is consistent with BLCR on the form of checkpoint reflection.

Long-range checkpoint of the present invention cuts system and the core stage checkpoint key distinction in following three aspects, addressing to the interior nuclear symbol of the operating system of target process place node, the buffer memory of data and looking ahead, and pointer operation, above-mentioned three aspects have also constituted three modules in design of long-range checkpoint and the realization.

The long-range checkpoint of the present invention system of cutting comprises kernel symbolic addressing module, is used for the addressing to the interior nuclear symbol of the operating system of target process place node; Data cache module, be used for data buffer memory and look ahead; And pointer module, be used for pointer operation.

Describe kernel symbolic addressing module below in detail.

In order to extract the status information of target process, long-range checkpoint system needs the various variablees and the data structure of the operating system nucleus of Access Management Access target process.On the platform of the embodiment of the invention, as a kind of embodiment, the long-range checkpoint system of the present invention is utilized System.map and two files of Kerntypes that generate in the compilation process of (SuSE) Linux OS kernel.

Wherein, the former is all interior nuclear symbols of (SuSE) Linux OS, comprises all kernel variables and function, with the mapping table of its virtual address;

The latter then comprises the type specification information of data structures all in the linux kernel.In conjunction with the content of above two files, can calculate each kernel variable, and any one member in the data structure, virtual address and length.

Simultaneously, the mapping relations that virtual address that the (SuSE) Linux OS kernel is adopted and RDMA operate between the used physical address are fixed, and therefore, the long-range checkpoint system of the present invention just can realize the visit to the various interior nuclear symbols in the destination OS.

In the embodiment of the invention, the data structure in the linux kernel that the long-range checkpoint system of cutting need visit can be divided into two classes by the memory location: a class is positioned at the kernel data section and the BSS section of static allocation, and is another kind of in the storage unit of kernel dynamic assignment.Last class data structure is corresponding to the global variable of initialization or no initializtion, and its identifier and address can directly be found in System.map; The address of back one class data structure then needs to search indirectly by known global variable.For example, when the process control block (PCB) of searching a process number P correspondence (task_struct structure), at first read the global variable init_task of task_struct type in the linux kernel, then according to the tasks territory of list_head type in the task_struct structure, visit the next unit in the process control block (PCB) chained list successively, up to finding out the process that process number equals P.After the process control block (PCB) that obtains a target process, can be according to each pointer that wherein comprises, the various system resources of this process are given in visit one by one, as shown in Figure 3.

For the user address space of target process, long-range checkpoint system need conduct interviews according to the virtual address translation mechanism of CPU in the destination node.In order to improve the checkpoint performance, system of the present invention is the user's space that unit reads target process with the page.In the platform of the embodiment of the invention, as a kind of embodiment, Opteron CPU adopts the paged memory management based on the level Four page table, as shown in Figure 4.The page table directory plot PGD that long-range checkpoint system of the present invention is at first comprised according to storage administration structure mm_struct (Page Global Directory), read the physical page at PGD table place, inquire about corresponding page table by step shown in Fig. 4 successively by the PGD index in the virtual address, PUD (Page Upper Directory) index, PMD (Page Middle Directory) index and PTE (Page Table Entry) index then, up to the physical page of finding a user's space virtual address place.

Describe data cache module of the present invention below in detail.

The performance of visiting the Remote Node RN internal memory by RDMA memory access mode is well below inter-node memory access performance.In order to address this problem, in long-range checkpoint system of the present invention, adopt caching mechanism and data pre-fetching mechanism to alleviate the influence that RDMA memory access mode is brought the checkpoint performance.

There is the higher data locality in accessing operation in the process of long-range checkpoint.At the repetition RDMA read operation all identical of the destination address of a large amount of existence, only just the quantity of RDMA operation can be reduced to original 25% for follow-up method of searching by the buffer memory reading of content with length.At the adjacent RDMA read operation of a large amount of destination addresses that present, by setting the quantity that method that the RDMA minimum reads length just can greatly reduce the required RDMA read operation of long-range checkpoint process.

Describe pointer module below in detail.

The described long-range checkpoint of the embodiment of the invention system that cuts belongs to the core stage checkpoint in itself, and operating system kernel and checkpoint system with management objectives process lay respectively at independently contextual characteristics.Since the validity of a pointer only be confined to its at address space, when when reading mode and the pointer variable in the operating system kernel of management objectives process is copied in the server of long-range checkpoint, taking measures to avoid improper quoting to cause program mal because of the remote pointer variable by long-range.

Aspect the comparison operation of pointer, the address (being buffer address) of a data structure in the server of long-range checkpoint and the comparison of the assignment (being original address) that is directly read the pointer variable in the destination node internal memory might appear being cached in.The former value belongs to the virtual address in the user address space of long-range checkpoint server processes, and the latter's value belongs to the virtual address in the core address space of operating system of destination node.Obviously, the comparison between the two is without any practical significance.For this reason, the original address of the data structure that all relate to the address compare operation and the corresponding relation of buffer address have been safeguarded in the long-range checkpoint system of the present invention.

Before the address of data structure compares in destination node, the Hash table of the above-mentioned address corresponding relation of first inquiry maintenance, the original address that buffer address all is converted in the destination node compares again.

Aspect the arithmetical operation of pointer, the addressing of certain element in a complex data structures or the array is often related to the computing of pointer, if long-range read and local cache operation makes that variation has taken place original relative position between the data, just might cause error in data, even pointer crosses the border.Therefore, the long-range checkpoint system of the present invention is all intactly duplicated for the data structure that relates to the pointer computing, to keep the relative position relation between its inner element.

Describe long-range checkpoint of the present invention below in detail and cut method

The basic procedure that long-range checkpoint cuts is as follows:

Step S10, the operating system symbol table System.map of loaded targets node.

Step S20, the operating system kernel type list Kerntypes of loaded targets node.

Step S30 number searches the process control block (PCB) of target process according to target process, and copies in the local buffer.

Step S40 creates the file header of videoing in long-range checkpoint;

Step S41 preserves the PID of target process, UID, EUID, GID, base attributes such as EGID and title (comm[]).

Step S42, the status information of preserving CPU comprises the state of general-purpose register, debug registers and coprocessor.

Step S43 preserves signal (Signal) process information.

Step S44 preserves the virtual address space of process according to the mm_struct structure.

I) at first, preserve the start-stop address in code segment, data segment, heap space, stack segment and environmental variance district;

Ii) then, preserve each vma_area_struct corresponding virtual region of memory (VirtualMemory Area, VMA).Physical page in data segment, heap space and the stack segment all can be read by long-range, and the page of contained data and non-full zero all will be saved in the reflection of checkpoint.

Step S50 preserves root, the fullpath of altroot and work at present catalogue.

Step S60, the filec descriptor table of preservation target process.

Step S70, the essential information of preserving the target process opened file one by one.

Step S71, for ordinary file, in the embodiment of the invention, information such as log file name, access module, length, skew.Preferably, described ordinary file is read-only file and the read/write file opened by the memory-mapped mode.

Step S72 for character device, according to different master and slave device numbers, calls corresponding equipment respectively and freezes function.In the embodiment of the invention, for example, the checkpoint of character devices such as Myrinet communication port is called herein.

Step S80 is for long-range checkpoint image file adds the end mark.

Correspondingly, the present invention also provides a kind of long-range checkpoint recovery system and method.

Checkpoint recovery system and method are utilized the process status information of preserving in the check point file, and the reconstructed object process makes it can correctly continue operation.

Checkpoint recovery system and method generally need the checkpoint corresponding with it cut system and method and are implemented in same level in the Network of Workstation, be the rejuvenation that the checkpoint process of cutting of core stage and user class should correspond respectively to core stage and user class, this is that state of a process information has different positions, form and corresponding with it access mode owing to the different levels in computer system.But, for long-range checkpoint of the present invention cuts system and method, because the format compatible of the form of its check point file and the check point file of core stage checkpoint system BLCR, as a kind of embodiment, the recovery system of long-range checkpoint and method have also adopted the essentially identical recovery flow process with BLCR.

Long-range checkpoint cuts the handled target process of system and method may enter operating system by interruption and system call dual mode.The state of target process when being cut the checkpoint can not influence its long-range checkpoint and cut process, but may cause the difference of the initial phase of its rejuvenation.

For example, for Opteron CPU, by interrupting or the mode of system call when entering kernel mode, the function of some registers can be had any different.In order to reduce CPU passes in and out kernel mode in the system call process expense, the X86_64 processor is that flat sections memorymodel (Flat-Segment Memory Model) provides SYSCALL and two instructions of SYSRET.Under 64 long patterns (Long Mode), SYSCALL instruction meeting is saved in RCX with the RIP that points to next bar instruction, and loads new RIP values from low 64 of LSTAR register.LSTAR belong to the model particular register (Model-SpecificRegister, MSR).When the linux system initialization, this register has been written into the entry address of system call.When returning user's space by SYSRET, CPU obtains the RIP value again from RCX.If CPU enters kernel mode by interrupt mode, and return user's attitude by IRET instruction, the RCX register can not be used to.

Therefore, for the process that enters kernel mode by interrupt mode, when recovering, should begin to carry out from next the bar instruction that is interrupted instruction.For the process that enters kernel mode by system call, when recovering, should restart to carry out the system call of being interrupted by fault.

Long-range checkpoint of the present invention recovery system comprises the state area sub-module, is used to distinguish the state of target process, to avoid the misuse to registers such as RCX.

As, Opteron CPU can write the RAX register with an interrupt vector when entering kernel mode by interrupt mode, and the numbering of system call can be write the RAX register when entering kernel mode by system call, so the RAX register just becomes the sign of difference target process state.

In the long-range inspection recovery system of the present invention, also comprise the springboard module, the general-purpose register state that is used to regain one's integrity makes process all withdraw from kernel mode with the IRET instruction, returns user's space from core space.

Describe long-range checkpoint of the present invention restoration methods below in detail, it is the inverse process that described long-range checkpoint cuts method, comprises the steps:

Step S10 ', the checkpoint is recovered instrument rmac_restart and is created a subprocess, and begins to wait for that it is complete or withdraw from unusually.

The thread of process equal number will at first be created and be resumed to this subprocess, and the mode by system call enters operating system kernel then.In following steps, this system call will be read the content of check point file in proper order, based on each thread in the rmac_restart subprocess (or claiming host's process), rebuilds the process that is resumed.

Step S20 ' judges its legitimacy according to the header information of check point file; If versions such as the operating system nucleus of indicating in the header information, checkpoint instrument do not meet expection, then withdraw from; Otherwise, enter next step.

Step S30 ' resets the PID of target process, UID, EUID, GID, base attributes such as EGID and process title.

Step S40 ' recovers the CPU state part in the target process core stack, comprises general-purpose register, debug registers.

Step S41 ', the mark process is cut after system call failure if the process status label table in the checkpoint makes eye bright, and the RIP of target process and RAX then is set makes after target process resumes operation, and re-executes this system call.

Step S42 ', otherwise target process is cut by interrupting, enter unusually kernel mode afterwards, and the address that the auxiliary mark process is returned the springboard program of user's attitude directly is set.

Step S50 ', the relevant information of signal Processing in the recovery target process.

Step S60 ' removes the mapping of all virtual storage regions of host's process.

Step S70 ', the mapping of all virtual storage regions of loaded targets process.

For the data in data segment, heap space and the stack segment, it will be that unit reads from check point file with the page that instrument is recovered in the checkpoint, and copy in the physical page that is distributed for the corresponding virtual address.

Step S80 ' is provided with the virtual address space descriptor of target process, as the start-stop address in the code segment in the mm_struct structure, data segment, heap space, stack segment and environmental variance district etc.

Step S90 ', the root of recovery target process, the path of altroot and work at present catalogue.

Step S100 ' closes each file in the close_on_exec filec descriptor group (fd_set) of host's process.

Step S110 ' recovers the essential information of target process opened file one by one.

Step S111 ' for ordinary file, recovers attributes such as access module, length, skew.

Step S112 ' for character device, according to different master and slave device numbers, calls corresponding equipment and recovers function.As being recovered to be called herein in the checkpoint of Myrinet communication port.

The checkpoint that describes a kind of communication system of the present invention below in detail cuts and recovery system and method, and the breakpoint restoration methods of communication protocol, and it is corresponding to the process of the Trouble Report and the recovery of the communication system in cluster fault-tolerance system of the present invention and the method.

Usually, checkpoint system can't impel the communication system of target process to enter the state of appointment before checkpointed during fault, and this has just produced two problems.At first, how to obtain the good working condition of the communication system of target process? second, do you how to guarantee the interprocess communication interrupted because of the checkpoint and the recovery operation of target process? at these two problems, the support of checkpoint and Restoration Mechanism when the present invention has inquired into group of planes communication system to fault, comprise that mainly the communication facilities checkpoint in the user class group of planes communication system cuts and recovery system and method, and support individual process in parallel the application to carry out the checkpoint to cut breakpoint restoration methods with the communication protocol of recovery operation.

The basis of the distributed system checkpoint technology of communication system is the global coherency of distributed system.In embodiments of the present invention, in the distributed system and checkpoint technology thereof to communication system, as a kind of embodiment, based on following system model:

1, parallel an application is process P by the individual execution target program of N (N 〉=2) (Target Program) _i(0≤i＜N) set of composition wherein moves process P _iProcessor be expressed as p _i

2, message transmission (Message Passing) is the sole mode of interprocess communication, the storer that does not have desirable global clock and share.

3, the reliability of interprocess communication is guaranteed, the situation that does not have message error, loses or repeat to receive.In the present invention, claim the parallel message of using that interprocess communication produced for calculating message, the message that produces for purpose such as checkpoint belongs to control messages.

4, each bar link of interprocess communication meet first-in first-out (First-In-First-Out, FIFO), promptly to process P _iMail to P via a link _jTwo message, if message M ₁Prior to M ₂Send, then M must be arranged ₁Prior to M ₂Be received.

5, the failure model of process is Fail-Stop, and promptly under failure state, process stops to calculate and communicating by letter.

6, process P _iJ checkpoint be expressed as C _{I, j}

The state of the parallel application that moves in a distributed system not only comprises its each state of a process, the condition of information that also comprises interprocess communication and produced.As shown in Figure 5, expression is by P ₀, P ₁And P ₂Three parallel application that process is formed.c ₀, c ₁, c ₂Represent this parallel three state cross sections (Cut) of using on three lines.Parallel global-inspection's point of using can be thought the set of the checkpoint in the moment that its each process and a certain state cross section intersect.When obtaining global-inspection's point of parallel application, always wish that selected state cross section is as c ₀The same crossing with any one message, if but the message as M1 and M2 is not taked special measure, in fact is difficult to avoid.Among the present invention, M1 is called message (In-Transit Message) midway, M2 is called orphan message (Orphan Message).

In parallel k global-inspection's point using, for the message M that is issued Pj by Pi, if at checkpoint Ci, M does not send as yet among the k, and at Cj, M has received among the k, and then M is called orphan message.

In parallel k global-inspection's point using, for the message M that is issued Pj by Pi, if at checkpoint Ci, M sends among the k, and at Cj, M does not receive as yet among the k, and then M is called message midway, claims to lose message again.

The connotation of global coherency state does not have orphan message exactly in the distributed system.

In the realization of the parallel checkpoint system of communication system, the processing of communication system state roughly is divided into two kinds of strategies, a kind of is the black box strategy.In this strategy, the realization of checkpoint protocol is positioned on the communication system layer, and the design of checkpoint system and implementor need not to be concerned about that the inside of the communication system of bottom realizes.This strategy makes checkpoint protocol can be independent of the bottom communication system, has portable strong advantage.For example, for block type cooperative check point agreement, the state of communication system just is cleared at the beginning of the process of checkpoint (Quiesce), thus make in the process checkpoint needn't record communication system state.C﹠amp for the unblock formula; The L agreement, the state of communication system is not cleared in global scope, but according to the label information in the checkpoint protocol, sequentially preserves the message of receiving in the fixed time scope in process checkpoint.

In the embodiment of the invention, description be another kind of strategy, this strategy combines checkpoint system with communication system, be intended to reduce the expense of checkpoint and rejuvenation, and improves its dirigibility.This strategy can make the checkpoint process can utilize various internal states in the communication system, thereby make the state of interprocess communication, comprise the state of the communication facilities that process is used, the part condition of information that process sends and receives, become the part of process checkpoint.In embodiments of the present invention, will be presented in the checkpoint of the more general user-level communication system of application in the Network of Workstation and recover to support.

In the user-level communication systems inspection point of the embodiment of the invention cuts and recovers, as a kind of embodiment, describe as platform with the MX communication system on the Myrinet network, but, should be noted that, it is not a limitation of the invention, and the present invention can be applied to other communication systems equally.

Myrinet is U.S. Myricom company (Myricom, Inc.) a kind of group of planes high-speed communicating network of releasing in 1994.

Among the present invention, the checkpoint of MX communication system is supported to be divided into two parts, and promptly the checkpoint of communication facilities cuts and recovers, and the recovery of the breakpoint in the communication protocol, and wherein the former is the latter's basis.In the embodiment of the invention, will at first inquire into the former, just the checkpoint of MX port cuts and recovers.For a process, a MX port is the special character type equipment of its of opening, and therefore, the checkpoint of MX port cuts one of step in the process check point process when being fault.

The checkpoint of MX port cuts the content that at first needs clear and definite MX port status to be comprised.Each MX port all has and only has transmission request queue, a short message to send buffer zone, middle message transmission buffer zone, reception request queue, send buffer, event queue and is positioned at the port controlling structure on the Myrinet network interface card and sends the handle chained list.Under a situation of agreement clear channel that fails inspection, all might there be a message at any time, control information that it is complete and data are positioned among one or more above structure.Therefore, in the process of the checkpoint of MX port, described structure all will intactly be preserved.

In the embodiment of the invention, to send that buffer zone, middle message send buffer zone, receive request queue to the transmission request queue in the MX communication system, short message, send buffer, event queue, port controlling structure and send handle list and corresponding checkpoint cut one by one and be introduced, and lay particular emphasis on integrality how to guarantee the content of preserving.

(1) sends request queue

Send request queue and be arranged in the network interface card storer that is mapped to user address space.This formation logically is organized as annular, is filled in MCP poll and reading by consumer process.It is to fill in the transmission request type that sends the appealing structure end that consumer process is filled in last step that sends request, and whether MCP is that empty (UREQ_NONE) judges whether new transmission request according to this territory exactly.When MCP sends request one of processing, after a transmission of the content creating that sends request handle, generally it will be sent the request type territory and be changed to sky.

Cut in the process in the checkpoint of this structure, mistake can not occur.At first, when MCP preserved this structure, consumer process was out of service, can not occur state that MCP preserved because of consumer process continue to write the problem that becomes out-of-date, this has also just guaranteed to send request and can not lose.Secondly, MCP sends appealing structure according to one and creates and send handle and the request that should send is put empty these two operations and is positioned at same MCP subroutine module, and its continuity can be by a checkpoint process interruption, so can not duplicate the transmission request of processing.

(2) short message sends buffer zone

Short message sends buffer zone and above-mentioned transmission request queue all is arranged in the network interface card storer that is mapped to user address space, and length is about 14KB.When handling a short message transmission request, the MX communication pool at first copies user data to the free space in the short message transmission buffer zone, just can fill in corresponding transmission request then.MCP is when handling a short message and send request, according to plot and the off-set value of the user data address that calculate user data of this buffer zone in the network interface card storer.MCP does not revise any content of this buffer zone, and also Maintenance free is used to visit any pointer of this buffer zone.Therefore, the content that in the MX checkpoint, only need intactly preserve this buffer zone.

(3) message sends buffer zone in

It is the host memory of 4MB that middle message among the MX sends buffer zone, is write by consumer process, and MCP reads.In the address space of a process, this structure and send buffer and event queue all take respectively one independently the virtual memory piece (Virtual Memory Area, VMA).The embodiment of the invention makes these three structures can be identified in the process check point process by add label information in the VMA structure.The distribution of this buffer zone and to use be to be unit with the page.Since can't guarantee to send to various objectives ground long message send request finish order, this buffer zone does not adopt annular logical organization.When handling a long message send request, message sent the free Page in the buffer zone during the MX communication pool copied user data to earlier, just filled in corresponding transmission request then.MCP starts main frame side DMA according to the physical address of the user data place page and reads user data when handling a long message send request.MCP does not revise any content of this buffer zone, so the content that only need intactly preserve this buffer zone in the MX checkpoint.

In the rejuvenation of MX port, the virtual address of this buffer zone can remain unchanged, but its physical address will inevitably change, thereby need re-register the physical address of each page in this buffer zone in MCP.In addition, send buffer and event buffer also all need same processing in rejuvenation.

(4) send buffer

Send buffer among the MX is the host memory of 8MB, is write by MCP, and consumer process reads.Using of this buffer zone is to be unit with the page, and its logical organization is an annular.MCP as long as the user data that receives is longer than RECV_INLINE_SIZE (being defaulted as 43 bytes), will sequentially distribute a page from this buffer zone when handling a network reception incident.In the user class reception incident of correspondence, MCP informs the position of user data by the numbering of this page in send buffer.The MX communication pool can copy user data in the customer-furnished send buffer when handling the reception incident of short message and long message once more.

When the MX communication pool user data certain page from send buffer is read finish after, consider for performance, can't empty this page.The checkpoint cuts process during for the fault of communication system, in the user address space not the page of zero clearing will be saved in the check point file.This just means that the content of having handled in a large number in the send buffer all can be saved.For fear of this phenomenon, preserve expense to reduce the checkpoint, the embodiment of the invention has been taked following measure, and as shown in Figure 6, MCP increases progressively recvq_vpage_index after filling in send buffer.The MX communication pool increases progressively recvq_offset after reading a send buffer page.The physical address of this variable is registered to MCP, so that MCP can read its currency in the process of checkpoint.So the oblique line that only needs to preserve among Fig. 6 in the MX checkpoint partly gets final product.

(5) event buffer

Event buffer among the MX is the host memory of 128KB, is responsible for filling in by MCP, and consumer process reads.The allocation unit of this buffer zone is 64 bytes, i.e. the length of an event structure.The event type that MCP filled in comprises the incident that finishes receiving, the connection request of the incident of being sent completely, short message and long message and replys etc.Newly arrived incident in the MX communication pool sequential processes event buffer is to empty type field in each event structure as the sign that disposes.Because this buffer zone and send buffer are all bigger buffer circle, so the present invention has also taked above-mentioned minimizing checkpoint to preserve the optimized Measures of length.

(6) RDMA window

MX adopts the RDMA mode to transmit message greater than 100KB, and the transmission buffer zone of RDMA message and send buffer need to be registered respectively be a RDMA window.The detailed log-on message of each RDMA window is arranged in main frame side's internal memory, comprises the data such as physical address of RDMA window slogan, registration order number, window plot, length of window and each page.The RDMA window information of registering in MCP only comprises the plot of RDMA window slogan, registration order number, length of window and main frame side's registration page table.In the transmission or receiving course of above-mentioned RDMA message, MCP only reads the physical address of 8 or 32 pages at every turn.

The checkpoint of MX port cuts process need the RDMA port is carried out special processing, causes communication failure with the RDMA communication of avoiding being examined the point process interruption in rejuvenation.

At first, the RDMA window that is in open mode in MX port test point process needs to re-register in MX port rejuvenation.

Secondly, in the rejuvenation of MX port,, just begin to carry out again RDMA communication from the window log-on message that reads main frame side if find to have still uncompleted RDMA request.

(7) related data structures among the MCP

The port controlling piece that is arranged in the network interface card storer is the most important structure of a MX port current state of record, and its content comprises and sends request queue address, short message sends buffer address, middle message sends buffer zone, send buffer and event queue log-on message and both current write pointers of back, sends the head pointer of handle chained list etc.Send handle and be used to write down a transmission processing of request state, the transmission handle that belongs to a MX port is all in a single-track link table.In the process of the checkpoint of a MX port, after the transmitting-receiving operation of this port was all frozen, above-mentioned port information will be read and write in the process checkpoint reflection by (long-range).

The checkpoint that describes communication system of the present invention below in detail cuts method, and it is that the example explanation is carried out the equipment inspection point and cut process with MX communication.

The MX communication system offers user program with the Myrinet network interface card with the form of character device file and uses.So, opened MX equipment if in the process check point process, find target process, just can begin to carry out the checkpoint process of MX communication facilities.In embodiments of the present invention, the checkpoint of MX port cuts to operate and may further comprise the steps:

Step S100 reads the private_data territory MX port status structure pointed of MX device file.

This structure is positioned at the core address space, comprises port numbers, net card number, and send buffer and isostructural plot of event queue and current information such as reading pointer.

Step S200 sends the order of freezing designated port to the Myrinet at target port place network interface card.

Step S300 if confirmed that the target process of main frame side is out of service, preserves transmission request queue and each transmission buffer zone that consumer process can write, otherwise needs at first to send to target process the long-range interruption of operation suspension.

Send buffer and event queue among the MX are filled by dma mode by MCP, and the operation of MCP is totally independent of the process of main frame side, even the process of main frame side is because normal process scheduling or system crash and after out of service, MCP still can receive new message and fill the associated queue of main frame side.For preventing to receive losing and keeping data integrity of data, should guarantee that the communication protocols processor has stopped the reception operation of corresponding port before preserving each main frame square structure that the communication protocols processor can write.

The following describes the checkpoint rejuvenation of communication system of the present invention.

The content of the checkpoint rejuvenation of MX port comprises the creating again of MX port, the reorientation of MX port and resending of message, and wherein resending of message is that all of preserving in the checkpoint to the MX port send re-executing according to the order of sequence of handles.

(A) reconstruction of MX port

When creating the MX port again, the content that former MX port is preserved need be inserted the correspondence position in the MX port of new establishment respectively in the checkpoint, be comprised port controlling piece among the MCP, send request queue, little message sends the port status structure in buffer zone and the operating system kernel.Send buffer zone, send buffer and event buffer for the middle message in user address space, other zone in its restoration methods and the user address space is as good as.

(B) reorientation of MX port

In the rejuvenation of MX port test point, whether the communication process that the reorientation of MX port is related to after the port recovery may be proceeded.This problem relates to a plurality of ingredients in the MX communication system.In the communication firmware layer, each Myrinet network interface card all has a MAC Address with the 00:60:DD beginning, and this is its unique identification.After Myrinet network interface card inserts the Myrinet network, can also by its in network the position and located uniquely.The Myrinet mapper calculates the routing iinformation between any two network interface cards after detecting the topology of network, and is filled into the assigned address of communication firmware layer.In communication process, the linking status between a MX port and other each port all is maintained in respectively among the Partner structure, and this is the important data structures that realizes reliable communication end to end.Mainly comprise in this structure and send message sequence number, the conversation index that receives message sequence number, current link and destination-mac address etc.So in the MX communication system, the visible destination address of user is made up of three parts: the pointer of the Partner structure of destination slogan, the routing index of purpose network interface card in local MCP number and this destination address correspondence.

In the rejuvenation of MX port, the address of the port numbers of original communication port and place network interface card all might change, and therefore, the purpose of port reorientation is exactly to eliminate the influence of above-mentioned variation to the subsequent communications process.

Describe the breakpoint restoration methods of the checkpoint of communication protocol of the present invention below in detail, it is that example describes with MX communication breakpoint restoration methods.

The breakpoint restoration methods of communication protocol is used to support that former thereby communication process that cause interrupts because of process checkpoint etc.The thinking of this method is at first to obtain global-inspection's point of one group of bottom communication system of intercommunication mutually, recovers subsequently again.In the MX communication system that the embodiment of the invention adopted, the realization of this method is based on the expansion to original acknowledgement messaging mechanism.

In the communication system that adopts acknowledgement messaging mechanism, transmit leg confirms that by receiving response message the success of each message sends to.For the message that does not obtain replying at the appointed time, transmit leg generally need resend.This just means that transmit leg is writing down any one and do not obtaining the message of replying as yet.Angle from communication system, send, but the message that does not obtain as yet replying just can be thought message midway, and the message midway in the process of traditional checkpoint is meant that then the process of transmit leg sends, but the message of the process of the side of not being received reception as yet.The message midway of the following indication of this paper all is the message midway for communication system.In global-inspection of obtaining communication system point process; transmit leg preserve on one's own initiative local message midway than the take over party wait for passively in all receive channels to drive message faster and flexible; and its expense can and not be exponential increase along with the system scale increase, relatively is suitable for the checkpoint process that an extensive group of planes is used.The characteristics of the acquisition methods of communication system global-inspection's point that this paper adopted are exactly the state of each communication system minute book earth communication system in the process of checkpoint, avoid global collaborative; In rejuvenation, retransmit message midway, and filter possible repetition message.

In breakpoint restoration methods of the present invention, the MCP that the communication breakpoint is initiated in order notifies other MCP to stop to send message to it on one's own initiative, and notifies other MCP to recover communication in breakpoint rejuvenation on one's own initiative.In order to reduce the communication breakpoint, before the communication breakpoint recovers, the communication that does not relate to communication breakpoint promoter is proceeded to parallel influence of using.If initiating the MCP of communication breakpoint is MCPi, the comprising the steps: of breakpoint restoration methods

MCPi：

1. freeze the transmitting-receiving operation of local communication port

2. preserve the state of frozen port

To other MCP broadcasting M (CKPT, i)

---process rejuvenation---

To other MCP broadcasting M (WAKE, i)

The MCP of proper node:

If receive M (CKPT, i):

1. establish～MCP.state=PRECKPT

2. start to sending the inspection of request destination and send buffer remaining space, if promptly the transmission destination of next message is that MCPi or send buffer remaining space are less than predetermined value:

(1). freeze the transmitting-receiving operation of local communication port

(2). by interrupting ending the execution of this process

(3). establish local MCP.state=CKPT.Under this state to follow-up each message of receiving, all can response message M (NACK _ CKPT, i)

If receive～M (NACK _ CKPT, i):

1. freeze the transmitting-receiving operation of local communication port

2. by interrupting ending the execution of this process

3. establish local MCP.state=CKPT

If receive M (WAKE, i):

1. if MCP.state=PRECKPT:

(1). if do not receive M (CKPT, j), (j ≠ i), then establish MCP.state=RUNNING

2. if MCP.state=CKPT:

(1). continue to handle because of M (CKPT, the transmission request of i) being obstructed

(2). if do not receive M (CKPT, j), (j ≠ i), then:

(a) establish MCP.state=RUNNING

(b) wake-up master process

In the realization of this method, the checkpoint control messages adopts the reliable transmission pattern, and promptly each message needs take over party's echo reply message.As shown in Figure 7, be one four process synoptic diagram of this method.When MCP1, MCP2 etc. receives M (CKPT, 0) afterwards, enter the PRECKPT state, the communication between them still can be proceeded, for example message M1.MCP1 finds that its take over party MCP0 is in checkpointed state in the first time of message M2 in the process of transmitting, thereby entry port frozen state (MCP.state=CKPT).Subsequently, when MCP1 receives message M3, echo reply message M (NACK_CKPT, 0), thus make also entry port frozen state of MCP3.When MCP1 receives M (WAKE, 0), when entering normal operating condition, it will continue to send M2.

Can continuous service if initiate the MCP of breakpoint, preferably, can also utilize the consistance of this MCP maintaining network state so, to reduce broadcast operation one time, the MCPi that promptly initiates breakpoint is not by broadcasting M (CKPT, i) other MCP of mode proactive notification stops to send message to it, but when replying the message of having received, notifies the transmit leg of this message to stop to its transmission.Further, can by write down the M that each MCP sends (CKPT, i) or M (NACK_CKPT, the mode of destination i) are avoided M (WAKE, the i) broadcasting of message, but this can increase the expense of breakpoint rejuvenation.

Preferably, the present invention has adopted following binary tree flooding algorithm, to quicken the broadcasting process in the said method.

1, for the MCPi that initiates broadcasting

(if i〉0), to MCPi-1 send M (CKPT, i);

(if i＜(N-1)), to MCPi+1 send M (CKPT, i);

2, for receive M (CKPT, MCPj i):

If (0≤(2*j-i)＜N), to MCP2j-i send M (CKPT, i);

If (j＜i) and ((2*j-i-1) 〉=0), to MCP2j-i-1 send M (CKPT, i);

(if j〉i) and ((2*j-i+1)＜N), to MCP2j-i+1 send M (CKPT, i);

3, each MCP have only its transmission by the time all M (CKPT, i) all replied after, just can finish.

For the situation of supporting that a plurality of nodes break down simultaneously, if MCP by CKPT message reception or send overtime discovery it send target MCPk and fault occurred, it just initiatively replaces MCPk to send desired two broadcasts of above-mentioned algorithm, as shown in Figure 8.Obviously, this is a recurrence processing procedure.

The embodiment of the invention has been carried out the performance test of MX equipment inspection point on following test platform: the hardware platform of node is 2 road 1.6GHz AMD Opteron processors, the 2GB internal memory; Network interface card is MyricomPCIXD-2, and the dominant frequency of its communication protocols processor and storer is 225MHz.The node operating system kernel is Linux-2.6.12; The version of MPICH is 1.2.6; The realization of communication system checkpoint is based on the MX1.1.0 version.This tests selected program is to carry out the table tennis test that continual intensive message is transmitted between node.In the process of the checkpoint of a MX port, port freezes to be about 16.0 microseconds and 8.6 microseconds respectively with the expense of port wakeup process, and the expense that the port status on the network interface card is kept in the reflection of checkpoint is about 54 microseconds, as shown in Figure 9.

Describe node monitoring of the present invention below in detail, promptly a kind of unit checkpoint cuts method, and the checkpoint cuts method in the time of also can being called the unit fault.

The checkpoint cuts the basis that the checkpoint cuts when being a group of planes fault during unit fault, its realization depends on node level support technology such as the preservation of process status, the detection of node failure and based on the long-range checkpointing mechanism of long-distance inner visit, does not relate to preservation and recovery to the interprocess communication state.

Step S1000, the process good working condition is preserved

The good working condition of a process may comprise a plurality of parts of different parts such as being arranged in CPU, internal memory and disk.The characteristics of checkpoint during at fault need take corresponding method to obtain wherein contained process status respectively according to the characteristics of above each parts.For the state in the internal memory,, still can conduct interviews by the RDMA mode even the operating system of destination node is absorbed in the collapse state.For general-purpose register, the flating point register of CPU inside, owing to can't be when the system failure read by instruction, therefore, the present invention takes the strategy preserved on one's own initiative when process is normally moved.

Step S1001, the preservation of CPU general-purpose register

When CPU change runtime class other the time can preserve the currency of various registers because of contextual switching.With the X86 of Linux2.6 _ 64 codes are example, when CPU will enter kernel mode, each general-purpose register, status register all are kept at the top of the core stack of current process, as shown in figure 10.

Step S1002, the preservation of flating point register

(Floating-Point Processor, FPU) state in is often more at floating point processor that science computing application intermediate frequency uses.In the platform of the embodiment of the invention, the FPU state of Opteron processor comprises that (XMM0～XMM15), 8 128 flating point register and the control register of several FPU amount to 512 bytes for 16 128 XMM register.For fear of the state that all will preserve FPU when each process context switches, optimization mechanism specially that the X86/X86_64 series processors is all built-in is commonly referred to as Lazy Context-Switching.This mechanism is based on most of process in operating system FPU that all need not or be of little use, and the state of FPU is preserved postponed till the moment that another process need be used FPU.When the TS of CR0 register (Task Switch) position is 1, if CPU attempts to carry out the X87 instruction or the medium instruction will trigger #NM (Device-Not-Available) unusually, operating system can be saved to the content of FPU the designated space of its corresponding process earlier in this abnormity processing, and then loads the FPU content of current process.

Step S1003, the preservation of CPU high-speed cache

The CPU that the checkpoint does not rely on destination node during fault carries out any specific operation when fault, this means content in the CPU high-speed cache can not be before the checkpoint cuts write memory in time.Buffer consistency based on realizing by bus monitoring (Bus Snoopying) agreement in the X86 multiprocessor architecture can cut the latest data content that obtains specified memory address in the process in the checkpoint.After node failure occurs, if CPU receives the RESET signal, the state of its all registers will be reset so, and the data in internally cached also can all lose efficacy at once, and can not be write back internal memory.

Step S2000, the detection of node failure

Accurately and efficiently the node failure detection is the prerequisite that realizes the prevention of the system failure and fast quick-recovery.From realizing other angle of level, node failure detects and generally can realize respectively in operating system and two levels of application program (process).For the design of operating system failure detection system, topmost problem can be summed up as detection means and these two aspects of detected object.The present invention has realized based on coprocessor and from the method for a plurality of angular detection operation system states.

Step S2100 is based on the node failure detection method of coprocessor.

Node failure detection system in the past generally relies on the state self check that main frame side CPU carries out operating system, for example based on the method for software or hardware watchdog counter (Watchdog Timer).These methods make it can't handle CPU for the dependence of main frame side CPU and close and interrupt or situation such as the control corresponding structure has been destroyed.For fear of above problem, to avoid simultaneously purchasing the special-purpose node state monitoring cost that hardware brought, the embodiment of the invention can take to utilize the method for communication protocols processor monitoring Host Status.At present, the communication protocols processor is universal day by day in group of planes high-speed communication system, and its performance has also obtained promoting faster.

In the embodiment of the invention, node state monitoring method based on coprocessor is at first the physical address of given Host Status monitored parameters to be registered to coprocessor, automatically read each monitored parameters by coprocessor at regular intervals subsequently, and compare with preset threshold value, whether normal to judge node state.

In the realization of this monitoring function, the embodiment of the invention has been utilized the clock interrupt mechanism in the LanaiX communication protocols processor.The internal timer IT2 of LanaiX can be by the frequency of 2MHz its assignment of successively decreasing, and just triggers its corresponding clock and interrupts in case become 0 time marquis.In this interrupted, LanaiX initiated to read the dma operation of main frame side's specified memory address to read the currency of selected each monitored parameters of this paper.

Step S2200, the detection method of operating system failure.

Step S2210, clock interrupts counting

For the general-purpose operating systems such as common Unix, Linux, the maintenance of process scheduling, system resource monitoring, system time all depends on clock and interrupts.We can say that the fault that clock interrupts is operating system occur the waiting indefinitely major reason of catastrophic failure still not, also is that operating system is absorbed in one of important behaviour of catastrophic failure.In (SuSE) Linux OS, the frequency that clock interrupts generally is made as p.s. 100 to 1000 times.The jiffies/jiffies_64 variable that increases progressively in clock interruption is each time being safeguarded the counting of the clock interruption that operating system has been handled.Whether normally the variation whether embodiment of the invention takes place in each monitoring periods with this variable to conform to monitoring periods as judging host operating system basic sign.

Clock interrupts the precision problem that method of counting need be considered the coprocessor monitoring periods emphatically, to avoid the fault misdescription that floats and cause owing to monitoring periods length.

(Real-Time Clock RTC), can accurately write down the time span of each monitoring periods according to its reading to the real-time clock register that in the LanaiX processor frequency to be arranged be 2MHz.In the present invention, it is normal less than 5 that main frame side's clock of establishing each monitoring periods interrupts counting the difference that fluctuates, otherwise thinks that promptly serious denial of service fault has appearred in main frame side's operating system.

The advantage of this method is not need the retouching operation system, can accurately detect serious operating system kernel denial of service fault simultaneously.

Step S2220, the operating system failure counting

As a kind of embodiment, this monitored parameters is based on the expansion to the fault handling method of (SuSE) Linux OS.These expansions comprise detection and the record to the rreturn value of the core function interface of key and system call interfaces.

In the core code of (SuSE) Linux OS, there are a lot of vital function interfaces of normal operation for consumer process or other nucleus module, for example various system call interfaces, and kmalloc (), kmem_cache_alloc core memory such as () management interface.The fault that tends to cause calling module is returned in the failure of these function interfaces, even loses efficacy.In original Linux core code, generally comparatively simple for the processing that the failure of these function interfaces is returned, for example return error code to upper level more, perhaps do not do any judgement and continue to carry out.

In order to support that this paper provides the measuring ability to the rreturn value of critical core interface function to more fine granularity, detection more timely of operating system failure.The present invention is provided with the grand detection function rreturn value that is used for of unlucky ().

#define?unlucky(condition，level)

do{

if(unlikely((condition)！＝0)){

inc_danger_level(level)；

}

}while(0)

When conditions such as " rreturn value are NULL " in first parameter of unluck () was set up, its second parameter will be added in the operating system failure counting variable of this paper definition.The address of this variable has been registered in the coprocessor of monitor operating system state, and is regularly read by it.

Among the grand code that is added to a plurality of corn module such as storage administration, process scheduling management of unlucky (), it can reflect the quantity of the system failure in time.

The method of interrupting counting with the monitoring clock is compared, and this method can be supported higher monitoring frequency, but need make a spot of modification to the code of operating system.Simultaneously, this method also can be used for the failure count to consumer process fully, for example embeds the failure count function in Glibc.The physical address that will be arranged in the failure count variable of user's space is registered to the coprocessor of supervisory system state.

Step S2230, operating system catastrophic failure code detection method.

Aforesaid two kinds of variablees that are used for the monitor operating system state all are positioned at main frame side's internal memory, are regularly read by dma mode by coprocessor, and the present invention also provides the third method to coprocessor report node failure state.In the method, main frame side's operating system can write a failure code in the storer of coprocessor by the PIO mode on one's own initiative.Owing to needn't read the internal memory of main frame side to the visit of this failure code, coprocessor just can be with higher frequency monitoring node state.This method is mainly used in the occasion that the fault of operating system can obtain confirming.At present, this method is mainly used in BUG (), the panic function interfaces such as () of original processing catastrophic failure in the expansion (SuSE) Linux OS.

Step S3000, the state-detection of long-range checkpoint target process

The core stage checkpoint system can utilize mechanism such as signal, system call to control and influence the operation of target process easily, makes it arrive the state of appointment before the checkpoint process begins.Long-range checkpoint system is the operation of controlled target process easily then, and therefore, before long-range checkpoint process began, there was multiple possibility in the state of target process.Long-range checkpoint can comprise that the target process of following two states detects:

(1) process in the fault node detects:

For only passing through the detected fault node of operating system failure counter mechanism,, therefore need utilize above-mentioned long-range interrupt mechanism to end its operation because target process still may be in running status.The detected fault node for interrupt counting and catastrophic failure code by the operation system clock, checkpoint system can not utilize long-range interrupt mechanism to control its any one state of a process again during fault.Although can't confirm the concrete state of the target process of long-range checkpoint,, can confirm that target process is by system call, interruption or entered kernel mode unusually according to the characteristics of above-mentioned fault.As long as target process does not move in user's context, checkpointed just can be carried out to it in long-range checkpoint.

(2) the process detection method of long-range interruption termination

In order to make long-range checkpoint can be used for purposes such as cluster management, can under the situation of the normal operation of target process, end its operation by long-range interrupt function.Long-range interruption is to trigger main frame side by communication facilities after receiving the message that has special sign to interrupt and realize.

For fear of in interrupt handling routine, carrying out process scheduling, for each CPU has set up long-range interrupt request structure.This structure is filled in by long-range interrupt service routine; When entering scheduler module, each CPU can check this structure, to judge whether to end immediately or to continue the operation of certain process.If make the target process CPU that will make its place exactly that in the shortest time, stops running enter the process scheduling module as early as possible.As a kind of embodiment, in Linux 2.6 operating system nucleuss of supporting preemption scheduling, it is exactly an opportunity that enters the process scheduling module that CPU withdraws from from various interrupt service routines.

The basic procedure of the service routine of described long-range interruption is as follows:

1, searches the process control block (PCB) of specifying the process number correspondence.

The embodiment of the invention remembers that the CPU at this process place is CPUTarget, and the CPU that carries out long-range interrupt service routine is CPUIntr.

2, fill in long-range interrupt request structure.

3,, the NEED_RESCHED sign of the process that is interrupted is set directly then as if CPUTarget=CPUIntr; Otherwise, send between the processor that interrupt vector is RESCHEDULE_VECTOR to CPUTarget and to interrupt.

Subsequently, CPUTarget responds the request of registering in the long-range interrupt request structure in the process scheduling module, and upgrades solicited status sign wherein after finishing this request.

A kind of cluster fault-tolerance system of the present invention, device and method, it is a kind of checkpointing mechanism during at the parallel group of planes fault of using, and is intended to use the quick fault recovery that localization is provided for parallel.For global-inspection's dot system, the consistance that all states of a process need periodically write non-volatile memory device and keep the interprocess communication state in process of its normal operation in parallel the application, the core concept of checkpointing mechanism then is the state that can't obtain when only preserving CPU state etc. at node failure when using normal operation parallel during group of planes fault.When finding fault, to the long-range execution of the process on fault node checkpoint, preserve the state of its communication system, and by the correct recovery after the group of planes communication protocol assurance communication disruption;

In conjunction with the drawings to the description of the specific embodiment of the invention, others of the present invention and feature are conspicuous to those skilled in the art.

More than specific embodiments of the invention are described and illustrate it is exemplary that these embodiment should be considered to it, and be not used in and limit the invention, the present invention should make an explanation according to appended claim.

Claims

1, a kind of cluster fault-tolerance system is characterized in that, comprises following functional module:

2, cluster fault-tolerance system according to claim 1 is characterized in that, also comprises following functional module:

3, a kind of cluster fault toleration disposal route is characterized in that, comprises the following steps:

Steps A 1. is also being collected to the management of process device after the essential information of all subprocesss, send the application register requirement to long-range checkpoint server, described long-range checkpoint server is after receiving this request, according to the progress information that comprises in the described register requirement, to the node failure detection module transmission node monitoring request of all junction associated;

The described long-range checkpoint of step B1. server is received after the effective Trouble Report, start long-range checkpoint process, after described long-range checkpoint process successfully finished, described long-range checkpoint server sent to described concurrent process manager with the position and the corresponding progress information of resulting checkpoint image file;

The described concurrent process manager of step C1. is determined the used node of process recovery, and sends the order of recovering process to destination node.

4, cluster fault toleration disposal route according to claim 3 is characterized in that, after the described steps A 1, before the step B1, also comprises the following steps:

Step S1. is in the operational process of monitored application, and the fault of the operating system of node receives and handle the node failure recovery request simultaneously under the regular active monitoring of node failure detection module;

When step S2. detects fault or receives the fault recovery request when described node failure detection module, the communication port that all monitored processes of node are opened under freezing, send Trouble Report to default long-range checkpoint server, ask described long-range checkpoint server that the process of appointment is implemented long-range checkpoint.

5, a kind of long-range checkpoint cuts system, it is characterized in that, comprising:

Data cache module, be used for data buffer memory and look ahead;

Pointer module is used for pointer operation, and before the address of data structure compared in destination node, the Hash table of inquiry maintenance raw address and buffer address corresponding relation compared after buffer address all is converted to original address in the destination node again.

6, a kind of long-range checkpoint cuts method, it is characterized in that, comprises the following steps:

Step S10, the operating system symbol table of loaded targets node;

Step S20, the operating system kernel type list of loaded targets node;

Step S40 creates the file header of videoing in long-range checkpoint;

Step S60, the filec descriptor table of preservation target process;

Step S80 is for long-range checkpoint image file adds the end mark.

7, long-range checkpoint according to claim 6 cuts method, it is characterized in that, described step 40 further comprises:

Step S41, the base attribute of preserving target process;

Step S42, the status information of preserving CPU comprises the state of general-purpose register, debug registers and coprocessor;

Step S43 preserves signal Processing information;

Step S44 preserves the virtual address space of process.

8, a kind of long-range checkpoint recovery system is characterized in that, comprising:

The state area sub-module is used to distinguish the state of target process, to avoid the misuse to register;

9, a kind of long-range checkpoint restoration methods is characterized in that, comprises the steps:

Step S30 ' resets the base attribute of target process;

Step S40 ' recovers the CPU state part in the target process core stack;

Step S100 ' closes each file of host's process;

10, long-range checkpoint according to claim 9 restoration methods is characterized in that described step step S40 ' further comprises:

Step S41 ', the mark process is cut after system call failure if the process status label table in the checkpoint makes eye bright, and the RIP of target process and RAX then is set makes after target process resumes operation, and re-executes this system call, otherwise, execution in step S42 ';

Step S42 ', target process are cut by interrupting, enter unusually kernel mode afterwards, and the address that the auxiliary mark process is returned the springboard program of user's attitude directly is set.

11, a kind of checkpoint of communication system cuts method, it is characterized in that, comprises the following steps:

Step S100 reads communication facilities file port status structure pointed;

12, a kind of checkpoint restoration methods of communication system is characterized in that, comprises the following steps:

Step S100 ', the reconstruction of port, the content that former port is preserved in the checkpoint is inserted the correspondence position of the port of establishment respectively;

Step S200 ', the reorientation of port.

13, a kind of breakpoint restoration methods of communication is characterized in that, comprises the following steps:

Step 1. is freezed the transmitting-receiving operation of local communication port;

Step 2. is preserved the state of frozen port;

Step 3. is to other MCP broadcasting.

14, the checkpoint cuts method during a kind of unit fault, it is characterized in that, comprises the following steps:

Step S1000, the process good working condition is preserved;

Step S2000, the detection of node failure;

Step S3000, the state-detection of long-range checkpoint target process.

15, the checkpoint cuts method during unit fault as claimed in claim 14, it is characterized in that described step S1000 further comprises:

Step S1001 when CPU changes the operation rank, preserves the currency of various registers;

Step S1002 when a process need be used flating point register, preserves the state of described flating point register;

Step S1003 preserves the CPU high-speed cache.

16, the checkpoint cuts method during unit fault as claimed in claim 14, it is characterized in that described step S2000 further comprises:

Whether step S2100 registers the physical address of given Host Status monitored parameters to coprocessor, read each monitored parameters automatically by described coprocessor at regular intervals, and compare with preset threshold value, normal to judge node state;

Step S2210 interrupts counting to clock, is used for the variation whether variable of the counting that clock that the attended operation system handled interrupts take place to conform to monitoring periods at each monitoring periods by judgement and determines whether operating system is normal;

Step S2220, the operating system failure counting;

Step S2230, detecting operation system catastrophic failure code.

17, a kind of process detection method of long-range interruption termination is characterized in that, comprises the following steps:

Step N2 fills in long-range interrupt request structure;