CN103150236A

CN103150236A - Parallel communication library state self-recovery method facing to process failure fault

Info

Publication number: CN103150236A
Application number: CN2013100969203A
Authority: CN
Inventors: 廖湘科; 卢宇彤; 谢旻; 所光; 曹宏嘉; 蒋艳凰; 董勇; 陈海涛
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-03-25
Filing date: 2013-03-25
Publication date: 2013-06-12
Anticipated expiration: 2033-03-25
Also published as: CN103150236B

Abstract

The invention discloses a parallel communication library state self-recovery method facing to the process failure fault. The parallel communication library state self-recovery method comprises the following implementation steps of: an operation management process executes non-communication local calculation by the derived calculation process of a node management process; the operation management process monitors the process failure situation, and a failure message is sent to the calculation process by the node management process; the calculation process inquires the failed calculation process number of the time of a shared memory and the failure process list in a global communicator so as to execute the fault recovery operation aiming at the process failure fault; and the failed parallel program is recovered to one consistent state. According to the parallel communication library state self-recovery method disclosed by the invention, the parallel program can not interrupt or exit when meeting the failure process fault, the whole failure parallel program does not need to be loaded again by the operation management system, the failure calculation process can be automatically recovered, and the parallel communication library state self-recovery method has the advantages of strong fault-tolerant ability and high calculation efficiency.

Description

Parallel communications storehouse state self-recovery method towards the process failure error

Technical field

The present invention relates to computing machine parallel computing field, be specifically related to after a kind of concurrent program in operational process, the process failure error occurs, adopt the parallel communications storehouse state self-recovery method of the concurrent program of program message passing pattern.

Background technology

Along with the development of high-performance computing sector and universal, on the one hand, the usable range of parallel computer is more and more wider in recent years; On the other hand, the scale of parallel computer is more and more huger.Parallel popularization along with parallel computer, average non-fault (Mean Time Between Failures interval time of concurrent computational system, MTBF) also shorter and shorter, thus concurrent program (or being Parallel application, parallel task) to occur the probability that lost efficacy in operational process also more and more higher.

Parallel computer comprises a plurality of computing nodes, the resource management system of parallel computer is the interface that the user uses computational resource in parallel computer, the instrument that provides by resource management system, system manager and user can check system state, submit job, the loading calculation task is checked historical information etc., and resource management system mainly is comprised of two parts: task management process, node administration process.Task management process and node administration process may operate on same or different physical nodes, and the function of task management process is mainly to receive operation, Resources allocation, statistical study; The function of node administration process is mainly task, the derivation calculation procedure that receives from the task management process, the relation of task management process, node administration process and calculation procedure as shown in Figure 1, the computing node running job managing process of the parallel computer leftmost side, other computing node move calculation procedure (can comprise a plurality of) and node administration process for the parallel computer that is comprised of a plurality of computing nodes.

Message is transmitted (Message Passing Interface is called for short MPI), and expense is low, extendible advantage because having for programming model, has become the concurrent program programming model of current main-stream.Message transmission parallel communications storehouse is the basis of program message passing model.The process inefficacy just can make and makes mistakes and withdraw from operating concurrent program.Therefore, the state of realizing communication pool after inefficacy recovers that automatically the reliabilty and availability that improves concurrent program is had very positive effect.But, the parallel communications storehouse that current main flow is transmitted based on message, for example: MPICH2 and OpenMPI all do not have the communication pool recovering state problem after Considering Failure.During based on MPI Standard compilation message-passing parallel program, at first need to call the initialization function in parallel communications storehouse as the programmer in the incipient stage of program.Message is transmitted in parallel communications library initialization process, needs to create a global communication device, the sequence number (be called rank sequence number) of this process of initialization in the global communication device.The global communication device is a kind of of communicator, and the global communication device is indicated by process sequence number list MPI_COMM_WORLD in the MPI communication pool; Remaining communicator is all directly or indirectly derived from by the global communication device and creates.When MPI uses the parallel communications storehouse, send or when receipt message, transmit leg or take over party must use same communicator.Different communicator distinguished in the inner integer context that uses of MPI communicator, and the process sequence number list MPI_COMM_WORLD of global communication device is also distinguished by context in inside, the parallel storehouse of MPI.For example, a MPI parallel task that comprises four processes is in initialization procedure, and each process needs to create a global communication device, but the rank sequence number of these processes is different, is respectively 0,1,2,3 No. rank of these four processes.Therefore, the different processes of parallel task can be identified different processes by the rank sequence number of communicator, and can pass through the rank sequence number of communicator to different process transmission or the receipt message of this communicator.

There is multiple interprocess communication mode in (SuSE) Linux OS, comprises: pipeline, message queue, network interface and shared drive etc.Wherein fastest and to postpone minimum mode be shared drive, use two processes of shared drive communication before communication, must create shared drive, creation operation can only be completed by a side.During communication, a side writes shared drive to data, and the opposing party is perception at once, and communicating pair does not interfere with each other the other side's execution.The fault model of concurrent program generally is divided into two classes: Byzatine fault model and Fail-stop fault model.In the Byzantine model, when a process broke down, crashed process can cause that other process produces wrong state, such as sending wrong data etc.The Byzantine model can represent Arbitrary Fault, but it is very difficult to detect this class fault.In the Fail-stop model, when a process broke down, this process was out of service, and it can not cause that other process in system produces wrong state.The Fail-stop fault model can be described the situation that in concurrent program, process is hung up or collapsed, and is the common hardware fault model in parallel computation field.The fault-tolerant technique that high-performance computing sector is commonly used is to follow the fail-stop model mostly, and therefore, this patent is referred to as process to the situation of process hang-up or collapse in concurrent program and lost efficacy for the Fail-stop fault model.

At present, U.S. Patent Publication No. is the fault-tolerant and recovery technology that technical scheme that US7475274 B2, name are called " FAULT TOLERANCE AND RECOVERY IN A HIGH-PERFORMANCE COMPUTING (HPC) SYSTEM " has been put down in writing a kind of high-performance computer system.This technical scheme hypothesis HPC system is comprised of unit such as management node, computing node, high performance communication networks, and calculation task calculates on computing node, operation resource management and monitor service on management node.Patent analyses the resource management part of system after how probe node lost efficacy, lost efficacy processing policy and distribute in conjunction with the resource of network topology.But this technical scheme does not relate to the recovering state strategy in the rear parallel communications storehouse of makeing mistakes.

In sum, have no in high performance parallel computer in current patent and document, towards the process failure error, the relevant report of parallel communications storehouse state automatic recovery method, high performance parallel program development personnel and high-performance computer managerial personnel need to urgently solve this technical problem.

Summary of the invention

The technical problem to be solved in the present invention be to provide can not interrupt after a kind of process lost efficacy withdrawing from, do not need job management system reload the inefficacy process, can the automatic restoring failure calculation procedure, fault-tolerant ability is strong, counting yield is high towards the parallel communications storehouse state self-recovery method of process failure error.

In order to solve the problems of the technologies described above, the technical solution used in the present invention is:

A kind of parallel communications storehouse state self-recovery method towards the process failure error, implementation step is as follows:

1) initiating task managing process and node administration process; The user submits parallel task to the task management process, and the task management process is according to the concurrency Distribution Calculation node of described parallel task and notify the node managing process; Whether then task management process Real Time Monitoring calculation procedure lost efficacy, if lost efficacy the task management process sends to the node administration process to fail message;

2) the node administration process is received the request from the derivation calculation procedure of task management process, for each calculation procedure that need to derive from, the node administration process utilizes the shared drive that operating system provides to create shared drive of system call establishment, the initial value of shared drive is made as full 0, creates the calculation procedure of concurrent job also the designated environment variable of the identifier assignment of shared drive to calculation procedure; After calculation procedure created successfully, whether node administration process Real Time Monitoring received the failure notification message from the task management process, if receive failure notification message, the node administration process sends failure notification message by shared drive to calculation procedure;

3) at first initialization parallel communications of calculation procedure storehouse, obtain the identifier of shared drive according to described designated environment variable inquiry, the system call of the shared drive binding that provides by operating system according to the identifier of shared drive is tied to this shared drive, then carries out local computing; In computation process, the message passing interface that calculation procedure is called the parallel communications storehouse carries out arrival and the transmission state of message transmission, detect-message; Simultaneously, the shared drive that calculation procedure is bound by detection has judged whether that the process inefficacy occurs, and occurs changes step 4) over to if there is process to lose efficacy;

4) calculation procedure number and its inefficacy process list in the global communication device of losing efficacy by this generation of coupling of inquiry shared drive; The execution of inefficacy process list in the global communication device for the error recovery operation of process failure error, recovers calculation procedure to the calculation procedure number that lost efficacy according to described this generation with it, and local computing is carried out in continuation.

As further improvement in the technical proposal of the present invention:

In described step 4), execution is as follows for the detailed step of the error recovery operation of process failure error:

4.1) empty the error condition in the parallel communications storehouse, the message that sends or receive before the error handling processing after deletion makes a mistake;

4.2) the calculation procedure in the rear global communication device of makeing mistakes rearrangement, the rule of rearrangement is that No. rank of normal procedure that comes the right side moves to left, the global communication device is reordered rear generation and shrinks the global communication device;

4.3) for the process in described contraction global communication device, utilize the process that the parallel communications storehouse provides to derive from function, derive from according to this calculation procedure number and inefficacy process list in the global communication device that inefficacy occurs the process of substituting, the number that substitutes process is identical with the number that inefficacy occurs for this;

4.4) process in described contraction global communication device and alternative process are reconfigured the new global communication device of establishment, described new global communication device comprises two parts, and left-half is for shrinking the process in the global communication device, and right half part is the replacement process;

4.5) process in described new global communication device is resequenced, make the replacement process fill up the position of inefficacy process, and normal procedure is kept No. rank, process before losing efficacy;

4.6) after the inefficacy process was successfully recovered, calculation procedure was returned to continue to carry out and calculated.

the present invention has following advantage towards the parallel communications storehouse state self-recovery method of process failure error: when the present invention is derived from calculation procedure in the task management process by the node administration process and carried out non-communication local computing, by task management process monitoring process failure conditions, and send thrashing message by the node administration process to calculation procedure, calculation procedure number and its inefficacy process list in the global communication device that calculation procedure lost efficacy by this generation of coupling of inquiry shared drive, calculation procedure is carried out error recovery operation for the process failure error according to this calculation procedure number that occur to lose efficacy and the inefficacy process list in the global communication device, the concurrent program that lost efficacy is returned to a consistent state, therefore can make concurrent program after the generation process lost efficacy, can not interrupt withdrawing from, do not need job management system to reload whole concurrent program, can the automatic restoring failure calculation procedure, has fault-tolerant ability strong, the advantage that counting yield is high.

Description of drawings

Fig. 1 is the parallel computer architecture schematic diagram of prior art.

Fig. 2 is the method flow schematic diagram of the embodiment of the present invention.

Fig. 3 is for the principle schematic of the error recovery operation of process failure error in the embodiment of the present invention.

Fig. 4 is the calculation procedure logical organization schematic diagram of parallel task in the embodiment of the present invention.

Embodiment

The present embodiment occurs that process lost efficacy and the situation of makeing mistakes and withdrawing from for concurrent program on parallel computer, a kind of parallel communications storehouse state self-recovery method towards failure error has been proposed, realized the autonomous recovery of concurrent program communications status after inefficacy of being formed by a plurality of concurrent processes, and then guarantee that concurrent program is at the follow-up reforwarding row of makeing mistakes, for convenience, hereinafter with the process of concurrent program referred to as calculation procedure.

As shown in Figure 2, the present embodiment is as follows towards the implementation step of the parallel communications storehouse state self-recovery method of process failure error:

1) initiating task managing process and node administration process; The user submits parallel task to the task management process, and the task management process is according to the concurrency Distribution Calculation node of parallel task and notify the node managing process; Whether then task management process Real Time Monitoring calculation procedure lost efficacy, if lost efficacy the task management process sends to the node administration process to fail message;

2) the node administration process is received the request from the derivation calculation procedure of task management process, for each calculation procedure that need to derive from, the node administration process utilizes the shared drive that operating system provides to create shared drive of system call establishment, the initial value of shared drive is made as full 0, create the calculation procedure of concurrent job and (title of designated environment variable need to be different from other environmental variance title that has existed in calculation procedure to the designated environment variable of calculation procedure the identifier assignment of shared drive, in the present embodiment, the concrete name of designated environment variable is called FD_SHM_ID), after calculation procedure created successfully, whether node administration process Real Time Monitoring received the failure notification message from the task management process, if receive failure notification message, the node administration process sends failure notification message by shared drive to calculation procedure,

3) at first initialization parallel communications of calculation procedure storehouse, inquiry obtains the identifier of shared drive according to designated environment variable, the system call of the shared drive binding that provides by operating system according to the identifier of shared drive is tied to this shared drive, then carries out local computing; In computation process, calculation procedure is called the message passing interface pass-along message in parallel communications storehouse, arrival and the transmission state of detect-message; Simultaneously, the shared drive that calculation procedure is bound by detection has judged whether that the process inefficacy occurs, and occurs changes step 4) over to if there is process to lose efficacy;

4) calculation procedure number and its inefficacy process list in the global communication device of losing efficacy by this generation of coupling of inquiry shared drive; The execution of inefficacy process list in the global communication device for the error recovery operation of process failure error, recovers calculation procedure, continues the execution local computing with it according to this calculation procedure number that inefficacy occurs.

The task management process of the present embodiment receives the fault-tolerant concurrent job from the user, utilize the node administration process to derive from the calculation procedure group of this fault-tolerant operation on one or more node, this fault-tolerant operation in the process of implementation, if inefficacy has occured, node administration process or task management process can detect this time inefficacy, then notify other calculation procedure of this operation, and then other calculation procedure is repaired the communication pool state that this time lost efficacy and damaged in the parallel communications storehouse, at last, this lost efficacy and can circulate a notice of to fault-tolerant concurrent program.Referring to Fig. 3, in the present embodiment take 4 calculation procedure as example, the node administration process comprises " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 2 ", " calculation procedure 3 " totally 4 calculation procedure for the calculation procedure that current concurrent job creates, therefore the process sequence number list of global communication device is { 0,1,2,3}.When " calculation procedure 1 " transmission process lost efficacy, " calculation procedure 0 ", " calculation procedure 2 ", " calculation procedure 3 " will get current inefficacy process list in the shared drive of this locality, and this inefficacy process list only has a process, is { 1}; When " calculation procedure 2 " and " calculation procedure 3 " of this operation lost efficacy, " calculation procedure 0 " and " calculation procedure 1 " will obtain the inefficacy process list in shared drive content was { 2,3}.In the present embodiment, hypothesis " calculation procedure 2 " in operational process lost efficacy, calculation procedure is by the calculation procedure number of this generation inefficacy of collocation of inquiry shared drive and the inefficacy process list in the global communication device, the inefficacy calculation procedure number that obtains at last is 1, and the inefficacy process list in the global communication device is { 2}.

In the present embodiment, in step 4), execution is as follows for the detailed step of the error recovery operation of process failure error:

4.1) empty the error condition in the parallel communications storehouse, the message that sends or receive before the error handling processing after deletion makes a mistake; Referring to Fig. 4, each calculation procedure of " calculation procedure 0 "～" calculation procedure 3 " comprises one and sends request queue and receipt message formation, when arriving the message that sends or receive before error handling processing after deletion makes a mistake, empty the transmission request and the reception message that send in request queue and receipt message formation and get final product.By execution in step 4.1) empty the error condition in the parallel communications storehouse after, the global communication device comprises " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 3 " 3 calculation procedure, so the process sequence number list of global communication device becomes { 0,1,3}.

4.2) the calculation procedure in the rear global communication device of makeing mistakes rearrangement, the rule of rearrangement is that No. rank of normal procedure that comes the right side moves to left, the global communication device is reordered rear generation and shrinks the global communication device; In the present embodiment, execution in step 4.2) with after the calculation procedure rearrangement, No. rank " 3 " move to left and make " calculation procedure 3 " to be punctured into " calculation procedure 2 " for " 2 ", the process sequence number list that comprises the contraction global communication device of " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 2 " is { 0,1,2}.

4.3) for the process of shrinking in the global communication device, utilize the process that the parallel communications storehouse provides to derive from function, derive from according to this calculation procedure number and inefficacy process list in the global communication device that occurs to lose efficacy the process of substituting, the number of the process of substituting is identical with the number that inefficacy occurs for this, finally derives 1 " replacement process " that is used for replacing " calculation procedure 2 " in the present embodiment.

4.4) will shrink process in the global communication device and alternative process and reconfigure and create new global communication device, new global communication device comprises two parts, and left-half is for shrinking the process in the global communication device, and right half part is the replacement process; Final to build left-half be " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 2 ", and right half part is the new global communication device of " calculation procedure 3 ", the process sequence number list of new global communication device be 0,1,2, replacement process 0}.

4.5) process in new global communication device is resequenced, make the replacement process fill up the position of inefficacy process, and normal procedure is kept the process sequence number before losing efficacy; After resequencing, replacement process " calculation procedure 3 " is filled up the position of inefficacy process former " calculation procedure 2 ", and the process sequence number list of global communication device is: { 0,1,2,3}.

4.6) after the inefficacy process was successfully recovered, calculation procedure was returned and continued to carry out local computing.Step 4.6) after execution finishes, the inefficacy calculation procedure is derived from again, also reinitialized in the global communication device, at this moment, the user data of inefficacy process still is in lost condition, and therefore, whether the programmer need to lose efficacy according to the rreturn value judgement of communication statement and occur, if be yes, need the programmer to recover the user data of loss.

the present embodiment is by task management process load operations and detecting process failure conditions, the situation that the node administration process notifies calculation procedure generation process to lose efficacy by shared drive, calculation procedure at first in communication pool according to the communication pool status information of notifying dynamic automatic restoring failure, then recovered the user data of loss by the user program of programmer's definition, and then the concurrent program that lost efficacy is returned to a consistent state, in the parallel computer that is formed by a plurality of computing nodes, adopt the present invention can make concurrent program based on the program message passing interface after the generation process lost efficacy, concurrent program can recover fast to carry out and not need to withdraw from and reload, can solve the recovery problem after the concurrent program generation process of moving on parallel computer lost efficacy.

In order to check the effect of the present embodiment, the present embodiment is applied in carries out experimental verification on the TH-1A Supper parallel computer, the node concrete configuration of TH-1A Supper parallel computer is as follows: two Intel Xeon 5,670 six core CPU, the frequency of each core is 2.93 GHz, and the double-precision floating point theory of computation peak value of two CPU is 140Gflops; The two-way band width in physical of communication network is 160Gbps, and two-way MPI communication bandwidth is 6.3GB/s.The resource management system that uses during test is SLURM-2.4.0, and the test case CG(full name that the concurrent program of test changes in NPB-3.3-MPI is Conjugate Gradient) program.The CG program uses that one of method of conjugate gradient approximate treatment is sparse, symmetrical, the minimal eigenvalue of positive definite large matrix.The parallel communications storehouse of the present embodiment changes from mpich2, is called ft-mpich2.Ft-mpich2 has expanded the wrong processing module of mpich2, has added the process of carrying out in the step 4) for the error recovery operation of process failure error.The data scale of CG program is the D level, and the concurrency of test is 1024.Be to build and to occur wrong situation in the concurrent program implementation, the present embodiment is in CG program implementation process, and the kill order of using (SuSE) Linux OS to provide is random kills a calculation procedure in concurrent program CG program.When not using the method for the present embodiment, whole CG program can be made mistakes and be withdrawed from.

The present embodiment is applied in the TH-1A Supper parallel computer, and to carry out the experimental verification step as follows:

The first step, start task management process and node administration process with fault tolerance respectively;

The concurrent job (name is called " CG.D.1024 ") of second step, submission CG program is given the task management process;

The 3rd step, task management process are moved job scheduling to computing node, derive calculation procedure by the node administration process;

In the 4th step, CG program process, select at random a node to use the kill order to kill a process of CG program;

The 5th step, task management process detect certain process death of CG;

The 6th step, task management process send to all node administration processes to the process thrashing message;

The 7th step, node administration process are received the process thrashing message, and the message that message is written to all processes that live is transmitted in shared drive;

Other process that lives of the 8th step, CG program detects failure notification, the process sequence number list MPI_COMM_WORLD of the global communication device of Recover from damaging in the parallel communications storehouse;

The user data rejuvenation of the 9th step, programmer's definition is called, and the process missing data of being killed is resumed, and the recovering state of concurrent job is consistent state, restarts to carry out;

Experimental verification is not in the situation that make mistakes, and be 46.2 seconds the working time of CG program " CG.D.1024 ", loses efficacy and uses the method for the present embodiment in the situation that a process occurs, and be 46.6 seconds the working time of CG program " CG.D.1024 "; When not using the method for the present embodiment, when occurring the situation of process inefficacy in CG program " CG.D.1024 " operational process, whole calculation task withdrawed from because process loses efficacy.Experimental results show that by above-mentioned, after the present embodiment is in operation and has occured to lose efficacy, still can return to a consistent state to system state by the parallel communications storehouse, and then the concurrent job of recovery inefficacy, in this process, concurrent job can not interrupt withdrawing from, and does not need job management system to reload the concurrent job of inefficacy yet.

The above is only the preferred embodiment of the present invention, and protection scope of the present invention also not only is confined to above-described embodiment, and all technical schemes that belongs under thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art, in the some improvements and modifications that do not break away under principle of the invention prerequisite, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. parallel communications storehouse state self-recovery method towards the process failure error is characterized in that implementation step is as follows:

2) the node administration process is received the request from the derivation calculation procedure of task management process, for each calculation procedure that need to derive from, the node administration process utilizes the shared drive that operating system provides to create shared drive of system call establishment, the initial value of described shared drive is made as full 0, creates the calculation procedure of concurrent job also the designated environment variable of the identifier assignment of shared drive to calculation procedure; After calculation procedure created successfully, whether node administration process Real Time Monitoring received the failure notification message from the task management process, if receive failure notification message, the node administration process sends failure notification message by shared drive to calculation procedure;

4) calculation procedure is by calculation procedure number and its inefficacy process list in the global communication device of this generation inefficacy of coupling of inquiry shared drive, carry out error recovery operation for the process failure error according to the described calculation procedure number that this occur to lose efficacy and its inefficacy process list in the global communication device, recover calculation procedure, continue to carry out local computing.

2. the parallel communications storehouse state self-recovery method towards the process failure error according to claim 1, is characterized in that, carries out for the detailed step of the error recovery operation of process failure error as follows in described step 4):

4.6) after the inefficacy process was successfully recovered, calculation procedure was returned and continued to carry out local computing.