CN103150236B

CN103150236B - Parallel communication library state self-recovery method facing to process failure fault

Info

Publication number: CN103150236B
Application number: CN201310096920.3A
Authority: CN
Inventors: 廖湘科; 卢宇彤; 谢旻; 所光; 曹宏嘉; 蒋艳凰; 董勇; 陈海涛
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-03-25
Filing date: 2013-03-25
Publication date: 2014-03-19
Anticipated expiration: 2033-03-25
Also published as: CN103150236A

Abstract

The invention discloses a parallel communication library state self-recovery method facing to the process failure fault. The parallel communication library state self-recovery method comprises the following implementation steps of: an operation management process executes non-communication local calculation by the derived calculation process of a node management process; the operation management process monitors the process failure situation, and a failure message is sent to the calculation process by the node management process; the calculation process inquires the failed calculation process number of the time of a shared memory and the failure process list in a global communicator so as to execute the fault recovery operation aiming at the process failure fault; and the failed parallel program is recovered to one consistent state. According to the parallel communication library state self-recovery method disclosed by the invention, the parallel program can not interrupt or exit when meeting the failure process fault, the whole failure parallel program does not need to be loaded again by the operation management system, the failure calculation process can be automatically recovered, and the parallel communication library state self-recovery method has the advantages of strong fault-tolerant ability and high calculation efficiency.

Description

Parallel communications storehouse state self-recovery method towards process failure error

Technical field

The present invention relates to computing machine parallel computing field, be specifically related to a kind of concurrent program and in operational process, occur after process failure error, adopt the parallel communications storehouse state self-recovery method of the concurrent program of program message passing pattern.

Background technology

Along with the development of high-performance computing sector and universal, on the one hand, the usable range of parallel computer is more and more wider in recent years; On the other hand, the scale of parallel computer is more and more huger.Parallel popularization along with parallel computer, average non-fault (Mean Time Between Failures interval time of concurrent computational system, MTBF) also shorter and shorter, so concurrent program (or being Parallel application, parallel task) occurs that in operational process the probability losing efficacy is also more and more higher.

Parallel computer comprises a plurality of computing nodes, the resource management system of parallel computer is the interface that user uses computational resource in parallel computer, the instrument providing by resource management system, system manager and user can check system state, submit job, loading calculation task, checks historical information etc., and resource management system is mainly comprised of two parts: task management process, node administration process.Task management process and node administration process may operate on same or different physical nodes, and the function of task management process is mainly to receive operation, Resources allocation, statistical study; The function of node administration process is mainly task, the derivation calculation procedure receiving from task management process, the relation of task management process, node administration process and calculation procedure as shown in Figure 1, the computing node running job managing process of the parallel computer leftmost side, other computing node moves calculation procedure (can comprise a plurality of) and node administration process for the parallel computer being comprised of a plurality of computing nodes.

Message is transmitted (Message Passing Interface is called for short MPI) programming model, because having low, the extendible advantage of expense, has become the concurrent program programming model of current main-stream.It is the basis of program message passing model that message is transmitted parallel communications storehouse.Process inefficacy can make just at operating concurrent program, make mistakes and exit.Therefore the state of, realizing communication pool after inefficacy recovers to have very positive effect to improving the reliabilty and availability of concurrent program automatically.But, the parallel communications storehouse that current main flow is transmitted based on message, for example: MPICH2 and OpenMPI, all do not have the communication pool recovering state problem after Considering Failure.When programmer is during based on MPI Standard compilation message-passing parallel program, in the incipient stage of program, first need to call the initialization function in parallel communications storehouse.Message is transmitted in parallel communications library initialization process, need to create a global communication device, this process of initialization sequence number (being called rank sequence number) in global communication device.Global communication device is a kind of of communicator, and in MPI communication pool, global communication device is indicated by process sequence number list MPI_COMM_WORLD; Remaining communicator is all directly or indirectly derived from and is created by global communication device.When MPI is used parallel communications storehouse, in the time of transmission or receipt message, transmit leg or take over party must use same communicator.Different communicator distinguished in the inner integer context that uses of MPI communicator, and the process sequence number list MPI_COMM_WORLD of global communication device is also distinguished by context in inside, the parallel storehouse of MPI.For example, a MPI parallel task that comprises four processes is in initialization procedure, and each process needs to create a global communication device, but the rank sequence number of these processes is different, is respectively 0,1,2,3 No. rank of these four processes.Therefore, the different processes of parallel task can be identified different processes by the rank sequence number of communicator, and can pass through the rank sequence number of communicator to different process transmission or the receipt message of this communicator.

In (SuSE) Linux OS, there is multiple interprocess communication mode, comprising: pipeline, message queue, network interface and shared drive etc.Wherein fastest and to postpone minimum mode be shared drive, use two processes of shared drive communication before communication, must create shared drive, creation operation can only be completed by a side.During communication, a side writes shared drive data, and the opposing party is perception at once, and communicating pair does not interfere with each other the other side's execution.The fault model of concurrent program is generally divided into two classes: Byzatine fault model and Fail-stop fault model.In Byzantine model, when a process breaks down, crashed process can cause that other process produces wrong state, such as sending wrong data etc.Byzantine model can represent Arbitrary Fault, but it is very difficult to detect this class fault.In Fail-stop model, when a process breaks down, this process is out of service, and it can not cause that other process in system produces wrong state.Fail-stop fault model can be described the situation that in concurrent program, process is hung up or collapsed, and is the common hardware fault model in parallel computation field.The fault-tolerant technique that high-performance computing sector is conventional is to follow fail-stop model mostly, and therefore, this patent, for Fail-stop fault model, is referred to as process the situation of process hang-up or collapse in concurrent program and lost efficacy.

At present, U.S. Patent Publication No. is the fault-tolerant and recovery technology that technical scheme that US7475274 B2, name are called " FAULT TOLERANCE AND RECOVERY IN A HIGH-PERFORMANCE COMPUTING (HPC) SYSTEM " has been recorded a kind of high-performance computer system.This technical scheme hypothesis HPC system is comprised of unit such as management node, computing node, high performance communication networks, and calculation task calculates on computing node, moves resource management and monitor service on management node.Patent analyses system how the resource management part of probe node after losing efficacy, losing efficacy processing policy and in conjunction with the resource of network topology, distribute.But this technical scheme do not relate to make mistakes after the recovering state strategy in parallel communications storehouse.

In sum, in current patent and document, have no in high performance parallel computer, towards process failure error, the relevant report of parallel communications storehouse state automatic recovery method, high performance parallel program development personnel and high-performance computer managerial personnel need to urgently solve this technical problem.

Summary of the invention

The technical problem to be solved in the present invention be to provide after a kind of process lost efficacy, can not interrupt exiting, not need job management system reload inefficacy process, can automatic restoring failure calculation procedure, fault-tolerant ability is strong, counting yield the is high parallel communications storehouse state self-recovery method towards process failure error.

In order to solve the problems of the technologies described above, the technical solution used in the present invention is:

Towards a parallel communications storehouse state self-recovery method for process failure error, implementation step is as follows:

1) initiating task managing process and node administration process; User submits parallel task to task management process, and task management process is according to the concurrency Distribution Calculation node of described parallel task and notify node managing process; Whether the real-time monitoring calculation process of then task management process lost efficacy, if lost efficacy, task management process sends to node administration process fail message;

2) node administration process is received the request from the derivation calculation procedure of task management process, the calculation procedure that need to derive from for each, node administration process utilizes the shared drive that operating system provides to create a shared drive of system call establishment, the initial value of shared drive is made as to full 0, creates the calculation procedure of concurrent job and the designated environment variable to calculation procedure the identifier assignment of shared drive; After calculation procedure creates successfully, whether the real-time monitoring of node administration process receives the failure notification message from task management process, if receive failure notification message, node administration process sends failure notification message by shared drive to calculation procedure;

3) first initialization parallel communications of calculation procedure storehouse, according to described designated environment variable inquiry, obtain the identifier of shared drive, the system call of the shared drive binding providing by operating system according to the identifier of shared drive is tied to this shared drive, then carries out local computing; In computation process, the message passing interface that calculation procedure is called parallel communications storehouse carries out arrival and the transmission state of message transmission, detect-message; Meanwhile, the shared drive that calculation procedure is bound by detection has judged whether that process inefficacy occurs, if there is process to lose efficacy, occurs, and proceeds to step 4);

4) calculation procedure number and its inefficacy process list in global communication device of by this generation of coupling of inquiry shared drive, losing efficacy; According to described calculation procedure number and its inefficacy process list in global communication device that this occurs to lose efficacy, carry out the error recovery operation for process failure error, recover calculation procedure, continue to carry out local computing.

Further improvement as technique scheme of the present invention:

In described step 4), execution is as follows for the detailed step of the error recovery operation of process failure error:

4.1) empty the error condition in parallel communications storehouse, delete the message that arrives the front transmission of error handling processing or receive after making a mistake;

4.2) the calculation procedure rearrangement in global communication device after makeing mistakes, the rule of rearrangement is the moving to left for No. rank of normal procedure that comes right side, and global communication device is reordered rear generation and shrinks global communication device;

4.3) for the process in described contraction global communication device, utilize the process that parallel communications storehouse provides to derive from function, the calculation procedure number and the inefficacy process list in global communication device that inefficacy occur according to this derive from the process of substituting, and the number that substitutes process is identical with the number that inefficacy occurs for this;

4.4) process in described contraction global communication device and alternative process are reconfigured to the new global communication device of establishment, described new global communication device comprises two parts, and left-half is for shrinking the process in global communication device, and right half part is replacement process;

4.5) process in described new global communication device is resequenced, make replacement process fill up the position of inefficacy process, and normal procedure maintains process before inefficacy No. rank;

4.6), after inefficacy process is successfully recovered, calculation procedure is returned to continue to carry out and is calculated.

The present invention has following advantage towards the parallel communications storehouse state self-recovery method of process failure error: the present invention is when task management process is carried out non-communication local computing by node administration process derivation calculation procedure, by task management process monitoring process failure conditions, and to calculation procedure, send thrashing message by node administration process, calculation procedure number and its inefficacy process list in global communication device that calculation procedure lost efficacy by this generation of coupling of inquiry shared drive, calculation procedure number and the inefficacy process list in global communication device that calculation procedure occurs to lose efficacy according to this are carried out the error recovery operation for process failure error, the concurrent program losing efficacy is returned to a consistent state, therefore can make concurrent program after generation process lost efficacy, can not interrupt exiting, do not need job management system to reload whole concurrent program, can automatic restoring failure calculation procedure, there is fault-tolerant ability strong, the advantage that counting yield is high.

Accompanying drawing explanation

Fig. 1 is the parallel computer architecture schematic diagram of prior art.

Fig. 2 is the method flow schematic diagram of the embodiment of the present invention.

Fig. 3 is for the principle schematic of the error recovery operation of process failure error in the embodiment of the present invention.

Fig. 4 is the calculation procedure logical organization schematic diagram of parallel task in the embodiment of the present invention.

Embodiment

The present embodiment occurs that process lost efficacy and the situation of makeing mistakes and exiting for concurrent program on parallel computer, a kind of parallel communications storehouse state self-recovery method towards failure error has been proposed, realized the autonomous recovery of the concurrent program communications status after inefficacy being formed by a plurality of concurrent processes, and then guarantee that concurrent program is at the follow-up reforwarding row of makeing mistakes, for convenience, hereinafter by the process of concurrent program referred to as calculation procedure.

As shown in Figure 2, the present embodiment is as follows towards the implementation step of the parallel communications storehouse state self-recovery method of process failure error:

1) initiating task managing process and node administration process; User submits parallel task to task management process, and task management process is according to the concurrency Distribution Calculation node of parallel task and notify node managing process; Whether the real-time monitoring calculation process of then task management process lost efficacy, if lost efficacy, task management process sends to node administration process fail message;

2) node administration process is received the request from the derivation calculation procedure of task management process, the calculation procedure that need to derive from for each, node administration process utilizes the shared drive that operating system provides to create a shared drive of system call establishment, the initial value of shared drive is made as to full 0, create the calculation procedure of concurrent job and (title of designated environment variable need to be different from other environmental variance title having existed in calculation procedure to the designated environment variable of calculation procedure the identifier assignment of shared drive, in the present embodiment, the concrete name of designated environment variable is called FD_SHM_ID), after calculation procedure creates successfully, whether the real-time monitoring of node administration process receives the failure notification message from task management process, if receive failure notification message, node administration process sends failure notification message by shared drive to calculation procedure,

3) first initialization parallel communications of calculation procedure storehouse, according to designated environment variable, inquiry obtains the identifier of shared drive, the system call of the shared drive binding providing by operating system according to the identifier of shared drive is tied to this shared drive, then carries out local computing; In computation process, calculation procedure is called the message passing interface pass-along message in parallel communications storehouse, arrival and the transmission state of detect-message; Meanwhile, the shared drive that calculation procedure is bound by detection has judged whether that process inefficacy occurs, if there is process to lose efficacy, occurs, and proceeds to step 4);

4) calculation procedure number and its inefficacy process list in global communication device of by this generation of coupling of inquiry shared drive, losing efficacy; The calculation procedure number and its inefficacy process list in global communication device that according to this, occur to lose efficacy are carried out the error recovery operation for process failure error, recover calculation procedure, continue to carry out local computing.

The task management process of the present embodiment receives the fault-tolerant concurrent job from user, utilize node administration process on one or more node, to derive from the calculation procedure group of this fault-tolerant operation, this fault-tolerant operation in the process of implementation, if there is inefficacy, node administration process or task management process can detect this inefficacy, then notify other calculation procedure of this operation, and then other calculation procedure is repaired the communication pool state that this time lost efficacy and damaged in parallel communications storehouse, finally, this lost efficacy and can circulate a notice of to fault-tolerant concurrent program.Referring to Fig. 3,4 calculation procedure of take in the present embodiment are example, the calculation procedure that node administration process creates for current concurrent job comprises " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 2 ", " calculation procedure 3 " totally 4 calculation procedure, therefore the process sequence number list of global communication device is { 0,1,2,3}.When " calculation procedure 1 " transmission process lost efficacy, " calculation procedure 0 ", " calculation procedure 2 ", " calculation procedure 3 " will get current inefficacy process list in local shared drive, and this inefficacy process list only has a process, is { 1}; When " calculation procedure 2 " and " calculation procedure 3 " of this operation lost efficacy, the content that " calculation procedure 0 " and " calculation procedure 1 " will obtain inefficacy process list in shared drive is { 2,3}.In the present embodiment, hypothesis " calculation procedure 2 " in operational process lost efficacy, the calculation procedure number that calculation procedure lost efficacy by this generation of collocation of inquiry shared drive and the inefficacy process list in global communication device, the inefficacy calculation procedure number finally obtaining is 1, and the inefficacy process list in global communication device is { 2}.

In the present embodiment, in step 4), execution is as follows for the detailed step of the error recovery operation of process failure error:

4.1) empty the error condition in parallel communications storehouse, delete the message that arrives the front transmission of error handling processing or receive after making a mistake; Referring to Fig. 4, each calculation procedure of " calculation procedure 0 "～" calculation procedure 3 " comprises one and sends request queue and receipt message queue, after deletion makes a mistake to send before error handling processing or receive message time, empty to send request and send request and receive message in queue and receipt message queue.By performing step 4.1) empty after the error condition in parallel communications storehouse, global communication device comprises " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 3 " 3 calculation procedure, so the process sequence number list of global communication device becomes { 0,1,3}.

4.2) the calculation procedure rearrangement in global communication device after makeing mistakes, the rule of rearrangement is the moving to left for No. rank of normal procedure that comes right side, and global communication device is reordered rear generation and shrinks global communication device; In the present embodiment, perform step 4.2) by after calculation procedure rearrangement, No. rank " 3 " move to left and for " 2 ", make " calculation procedure 3 " to be punctured into " calculation procedure 2 ", the process sequence number list of the contraction global communication device that comprises " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 2 " is { 0,1,2}.

4.3) for the process of shrinking in global communication device, utilize the process that parallel communications storehouse provides to derive from function, the calculation procedure number and the inefficacy process list in global communication device that according to this, occur to lose efficacy derive from the process of substituting, the number of the process of substituting is identical with the number that inefficacy occurs for this, finally derives 1 for replacing " the replacement process " of " calculation procedure 2 " in the present embodiment.

4.4) process and the alternative process of shrinking in global communication device are reconfigured to the new global communication device of establishment, new global communication device comprises two parts, and left-half is for shrinking the process in global communication device, and right half part is replacement process; Final to build left-half be " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 2 ", and right half part is the new global communication device of " calculation procedure 3 ", the process sequence number list of new global communication device be 0,1,2, replacement process 0}.

4.5) process in new global communication device is resequenced, make replacement process fill up the position of inefficacy process, and normal procedure maintains the process sequence number before inefficacy; After resequencing, replacement process " calculation procedure 3 " is filled up the position of inefficacy process former " calculation procedure 2 ", and the process sequence number list of global communication device is: { 0,1,2,3}.

4.6), after inefficacy process is successfully recovered, calculation procedure is returned and is continued to carry out local computing.Step 4.6) after execution finishes, inefficacy calculation procedure is derived from again, in global communication device, also reinitialized, now, the user data of inefficacy process is still in lost condition, and therefore, whether programmer need to lose efficacy and occur according to the rreturn value judgement of communication statement, if YES, need programmer to recover the user data of loss.

The present embodiment is by task management process load operations and detecting process failure conditions, the situation that node administration process notifies calculation procedure generation process to lose efficacy by shared drive, calculation procedure first in communication pool according to the communication pool status information of notifying dynamic automatic restoring failure, then the user program being defined by programmer recovers the user data of losing, and then the concurrent program losing efficacy is returned to a consistent state, in the parallel computer being formed by a plurality of computing nodes, adopt the present invention can make concurrent program based on program message passing interface after generation process lost efficacy, concurrent program can recover fast to carry out and not need to exit and reload, can solve the recovery problem after the concurrent program generation process of moving lost efficacy on parallel computer.

In order to check the effect of the present embodiment, the present embodiment is applied in and on TH-1A Supper parallel computer, carries out experimental verification, the node concrete configuration of TH-1A Supper parallel computer is as follows: two Intel Xeon 5,670 six core CPU, the frequency of each core is 2.93 GHz, and it is 140Gflops that the double-precision floating point of two CPU calculates theoretical peak value; The two-way band width in physical of communication network is 160Gbps, and two-way MPI communication bandwidth is 6.3GB/s.The resource management system using during test is SLURM-2.4.0, and the test case CG(full name that the concurrent program of test changes in NPB-3.3-MPI is Conjugate Gradient) program.CG program is used that one of method of conjugate gradient approximate treatment is sparse, symmetrical, the minimal eigenvalue of positive definite large matrix.The parallel communications storehouse of the present embodiment changes from mpich2, is called ft-mpich2.Ft-mpich2 expanded the wrong processing module of mpich2, added and in step 4), carried out the process for the error recovery operation of process failure error.The data scale of CG program is D level, and the concurrency of test is 1024.For building in concurrent program implementation, occur wrong situation, the present embodiment is in the implementation of CG program, and the kill order of using (SuSE) Linux OS to provide is random kills a calculation procedure in concurrent program CG program.When not using the method for the present embodiment, whole CG program can be made mistakes and be exited.

The present embodiment is applied in TH-1A Supper parallel computer, and to carry out experimental verification step as follows:

The first step, start task management process and the node administration process with fault tolerance respectively;

The concurrent job (name is called " CG.D.1024 ") of second step, submission CG program is given task management process;

The 3rd step, task management process are moved job scheduling to computing node, by node administration process, derive calculation procedure;

In the 4th step, CG program process, select at random a node to use kill order to kill a process of CG program;

The 5th step, task management process detect certain process death of CG;

The 6th step, task management process send to all node administration processes process thrashing message;

The 7th step, node administration process are received process thrashing message, and the message that message is written to all processes that live is transmitted in shared drive;

Other process living of the 8th step, CG program detects failure notification, the process sequence number list MPI_COMM_WORLD of the global communication device of Recover from damaging in parallel communications storehouse;

The user data rejuvenation of the 9th step, programmer's definition is called, and the data that the process of being killed is lost are resumed, and the recovering state of concurrent job is consistent state, restarts to carry out;

Experimental verification is not in the situation that makeing mistakes, and be 46.2 seconds the working time of CG program " CG.D.1024 ", applies the method for the present embodiment in the situation that occurring that a process lost efficacy, and be 46.6 seconds the working time of CG program " CG.D.1024 "; While not using the method for the present embodiment, while there is the situation of process inefficacy in CG program " CG.D.1024 " operational process, whole calculation task exited because process loses efficacy.By above-mentioned, experimental results show that, after the present embodiment is in operation and has occurred to lose efficacy, still can system state be returned to a consistent state by parallel communications storehouse, and then the concurrent job of recovery inefficacy, in this process, concurrent job can not interrupt exiting, and does not need job management system to reload the concurrent job of inefficacy yet.

The above is only the preferred embodiment of the present invention, and protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonging under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. towards a parallel communications storehouse state self-recovery method for process failure error, it is characterized in that implementation step is as follows:

2) node administration process is received the request from the derivation calculation procedure of task management process, the calculation procedure that need to derive from for each, node administration process utilizes the shared drive that operating system provides to create a shared drive of system call establishment, the initial value of described shared drive is made as to full 0, creates the calculation procedure of concurrent job and the designated environment variable to calculation procedure the identifier assignment of shared drive; After calculation procedure creates successfully, whether the real-time monitoring of node administration process receives the failure notification message from task management process, if receive failure notification message, node administration process sends failure notification message by shared drive to calculation procedure;

4) calculation procedure lost efficacy by this generation of coupling of inquiry shared drive calculation procedure number and its inefficacy process list in global communication device, according to described calculation procedure number and its inefficacy process list in global communication device that this occurs to lose efficacy, carry out the error recovery operation for process failure error, recover calculation procedure, continue to carry out local computing;

4.6), after inefficacy process is successfully recovered, calculation procedure is returned and is continued to carry out local computing.