CN103150236A - Parallel communication library state self-recovery method facing to process failure fault - Google Patents

Parallel communication library state self-recovery method facing to process failure fault Download PDF

Info

Publication number
CN103150236A
CN103150236A CN2013100969203A CN201310096920A CN103150236A CN 103150236 A CN103150236 A CN 103150236A CN 2013100969203 A CN2013100969203 A CN 2013100969203A CN 201310096920 A CN201310096920 A CN 201310096920A CN 103150236 A CN103150236 A CN 103150236A
Authority
CN
China
Prior art keywords
calculation procedure
communication device
global communication
failure
shared drive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100969203A
Other languages
Chinese (zh)
Other versions
CN103150236B (en
Inventor
廖湘科
卢宇彤
谢旻
所光
曹宏嘉
蒋艳凰
董勇
陈海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310096920.3A priority Critical patent/CN103150236B/en
Publication of CN103150236A publication Critical patent/CN103150236A/en
Application granted granted Critical
Publication of CN103150236B publication Critical patent/CN103150236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a parallel communication library state self-recovery method facing to the process failure fault. The parallel communication library state self-recovery method comprises the following implementation steps of: an operation management process executes non-communication local calculation by the derived calculation process of a node management process; the operation management process monitors the process failure situation, and a failure message is sent to the calculation process by the node management process; the calculation process inquires the failed calculation process number of the time of a shared memory and the failure process list in a global communicator so as to execute the fault recovery operation aiming at the process failure fault; and the failed parallel program is recovered to one consistent state. According to the parallel communication library state self-recovery method disclosed by the invention, the parallel program can not interrupt or exit when meeting the failure process fault, the whole failure parallel program does not need to be loaded again by the operation management system, the failure calculation process can be automatically recovered, and the parallel communication library state self-recovery method has the advantages of strong fault-tolerant ability and high calculation efficiency.

Description

Parallel communications storehouse state self-recovery method towards the process failure error
Technical field
The present invention relates to computing machine parallel computing field, be specifically related to after a kind of concurrent program in operational process, the process failure error occurs, adopt the parallel communications storehouse state self-recovery method of the concurrent program of program message passing pattern.
Background technology
Along with the development of high-performance computing sector and universal, on the one hand, the usable range of parallel computer is more and more wider in recent years; On the other hand, the scale of parallel computer is more and more huger.Parallel popularization along with parallel computer, average non-fault (Mean Time Between Failures interval time of concurrent computational system, MTBF) also shorter and shorter, thus concurrent program (or being Parallel application, parallel task) to occur the probability that lost efficacy in operational process also more and more higher.
Parallel computer comprises a plurality of computing nodes, the resource management system of parallel computer is the interface that the user uses computational resource in parallel computer, the instrument that provides by resource management system, system manager and user can check system state, submit job, the loading calculation task is checked historical information etc., and resource management system mainly is comprised of two parts: task management process, node administration process.Task management process and node administration process may operate on same or different physical nodes, and the function of task management process is mainly to receive operation, Resources allocation, statistical study; The function of node administration process is mainly task, the derivation calculation procedure that receives from the task management process, the relation of task management process, node administration process and calculation procedure as shown in Figure 1, the computing node running job managing process of the parallel computer leftmost side, other computing node move calculation procedure (can comprise a plurality of) and node administration process for the parallel computer that is comprised of a plurality of computing nodes.
Message is transmitted (Message Passing Interface is called for short MPI), and expense is low, extendible advantage because having for programming model, has become the concurrent program programming model of current main-stream.Message transmission parallel communications storehouse is the basis of program message passing model.The process inefficacy just can make and makes mistakes and withdraw from operating concurrent program.Therefore, the state of realizing communication pool after inefficacy recovers that automatically the reliabilty and availability that improves concurrent program is had very positive effect.But, the parallel communications storehouse that current main flow is transmitted based on message, for example: MPICH2 and OpenMPI all do not have the communication pool recovering state problem after Considering Failure.During based on MPI Standard compilation message-passing parallel program, at first need to call the initialization function in parallel communications storehouse as the programmer in the incipient stage of program.Message is transmitted in parallel communications library initialization process, needs to create a global communication device, the sequence number (be called rank sequence number) of this process of initialization in the global communication device.The global communication device is a kind of of communicator, and the global communication device is indicated by process sequence number list MPI_COMM_WORLD in the MPI communication pool; Remaining communicator is all directly or indirectly derived from by the global communication device and creates.When MPI uses the parallel communications storehouse, send or when receipt message, transmit leg or take over party must use same communicator.Different communicator distinguished in the inner integer context that uses of MPI communicator, and the process sequence number list MPI_COMM_WORLD of global communication device is also distinguished by context in inside, the parallel storehouse of MPI.For example, a MPI parallel task that comprises four processes is in initialization procedure, and each process needs to create a global communication device, but the rank sequence number of these processes is different, is respectively 0,1,2,3 No. rank of these four processes.Therefore, the different processes of parallel task can be identified different processes by the rank sequence number of communicator, and can pass through the rank sequence number of communicator to different process transmission or the receipt message of this communicator.
There is multiple interprocess communication mode in (SuSE) Linux OS, comprises: pipeline, message queue, network interface and shared drive etc.Wherein fastest and to postpone minimum mode be shared drive, use two processes of shared drive communication before communication, must create shared drive, creation operation can only be completed by a side.During communication, a side writes shared drive to data, and the opposing party is perception at once, and communicating pair does not interfere with each other the other side's execution.The fault model of concurrent program generally is divided into two classes: Byzatine fault model and Fail-stop fault model.In the Byzantine model, when a process broke down, crashed process can cause that other process produces wrong state, such as sending wrong data etc.The Byzantine model can represent Arbitrary Fault, but it is very difficult to detect this class fault.In the Fail-stop model, when a process broke down, this process was out of service, and it can not cause that other process in system produces wrong state.The Fail-stop fault model can be described the situation that in concurrent program, process is hung up or collapsed, and is the common hardware fault model in parallel computation field.The fault-tolerant technique that high-performance computing sector is commonly used is to follow the fail-stop model mostly, and therefore, this patent is referred to as process to the situation of process hang-up or collapse in concurrent program and lost efficacy for the Fail-stop fault model.
At present, U.S. Patent Publication No. is the fault-tolerant and recovery technology that technical scheme that US7475274 B2, name are called " FAULT TOLERANCE AND RECOVERY IN A HIGH-PERFORMANCE COMPUTING (HPC) SYSTEM " has been put down in writing a kind of high-performance computer system.This technical scheme hypothesis HPC system is comprised of unit such as management node, computing node, high performance communication networks, and calculation task calculates on computing node, operation resource management and monitor service on management node.Patent analyses the resource management part of system after how probe node lost efficacy, lost efficacy processing policy and distribute in conjunction with the resource of network topology.But this technical scheme does not relate to the recovering state strategy in the rear parallel communications storehouse of makeing mistakes.
In sum, have no in high performance parallel computer in current patent and document, towards the process failure error, the relevant report of parallel communications storehouse state automatic recovery method, high performance parallel program development personnel and high-performance computer managerial personnel need to urgently solve this technical problem.
Summary of the invention
The technical problem to be solved in the present invention be to provide can not interrupt after a kind of process lost efficacy withdrawing from, do not need job management system reload the inefficacy process, can the automatic restoring failure calculation procedure, fault-tolerant ability is strong, counting yield is high towards the parallel communications storehouse state self-recovery method of process failure error.
In order to solve the problems of the technologies described above, the technical solution used in the present invention is:
A kind of parallel communications storehouse state self-recovery method towards the process failure error, implementation step is as follows:
1) initiating task managing process and node administration process; The user submits parallel task to the task management process, and the task management process is according to the concurrency Distribution Calculation node of described parallel task and notify the node managing process; Whether then task management process Real Time Monitoring calculation procedure lost efficacy, if lost efficacy the task management process sends to the node administration process to fail message;
2) the node administration process is received the request from the derivation calculation procedure of task management process, for each calculation procedure that need to derive from, the node administration process utilizes the shared drive that operating system provides to create shared drive of system call establishment, the initial value of shared drive is made as full 0, creates the calculation procedure of concurrent job also the designated environment variable of the identifier assignment of shared drive to calculation procedure; After calculation procedure created successfully, whether node administration process Real Time Monitoring received the failure notification message from the task management process, if receive failure notification message, the node administration process sends failure notification message by shared drive to calculation procedure;
3) at first initialization parallel communications of calculation procedure storehouse, obtain the identifier of shared drive according to described designated environment variable inquiry, the system call of the shared drive binding that provides by operating system according to the identifier of shared drive is tied to this shared drive, then carries out local computing; In computation process, the message passing interface that calculation procedure is called the parallel communications storehouse carries out arrival and the transmission state of message transmission, detect-message; Simultaneously, the shared drive that calculation procedure is bound by detection has judged whether that the process inefficacy occurs, and occurs changes step 4) over to if there is process to lose efficacy;
4) calculation procedure number and its inefficacy process list in the global communication device of losing efficacy by this generation of coupling of inquiry shared drive; The execution of inefficacy process list in the global communication device for the error recovery operation of process failure error, recovers calculation procedure to the calculation procedure number that lost efficacy according to described this generation with it, and local computing is carried out in continuation.
As further improvement in the technical proposal of the present invention:
In described step 4), execution is as follows for the detailed step of the error recovery operation of process failure error:
4.1) empty the error condition in the parallel communications storehouse, the message that sends or receive before the error handling processing after deletion makes a mistake;
4.2) the calculation procedure in the rear global communication device of makeing mistakes rearrangement, the rule of rearrangement is that No. rank of normal procedure that comes the right side moves to left, the global communication device is reordered rear generation and shrinks the global communication device;
4.3) for the process in described contraction global communication device, utilize the process that the parallel communications storehouse provides to derive from function, derive from according to this calculation procedure number and inefficacy process list in the global communication device that inefficacy occurs the process of substituting, the number that substitutes process is identical with the number that inefficacy occurs for this;
4.4) process in described contraction global communication device and alternative process are reconfigured the new global communication device of establishment, described new global communication device comprises two parts, and left-half is for shrinking the process in the global communication device, and right half part is the replacement process;
4.5) process in described new global communication device is resequenced, make the replacement process fill up the position of inefficacy process, and normal procedure is kept No. rank, process before losing efficacy;
4.6) after the inefficacy process was successfully recovered, calculation procedure was returned to continue to carry out and calculated.
the present invention has following advantage towards the parallel communications storehouse state self-recovery method of process failure error: when the present invention is derived from calculation procedure in the task management process by the node administration process and carried out non-communication local computing, by task management process monitoring process failure conditions, and send thrashing message by the node administration process to calculation procedure, calculation procedure number and its inefficacy process list in the global communication device that calculation procedure lost efficacy by this generation of coupling of inquiry shared drive, calculation procedure is carried out error recovery operation for the process failure error according to this calculation procedure number that occur to lose efficacy and the inefficacy process list in the global communication device, the concurrent program that lost efficacy is returned to a consistent state, therefore can make concurrent program after the generation process lost efficacy, can not interrupt withdrawing from, do not need job management system to reload whole concurrent program, can the automatic restoring failure calculation procedure, has fault-tolerant ability strong, the advantage that counting yield is high.
Description of drawings
Fig. 1 is the parallel computer architecture schematic diagram of prior art.
Fig. 2 is the method flow schematic diagram of the embodiment of the present invention.
Fig. 3 is for the principle schematic of the error recovery operation of process failure error in the embodiment of the present invention.
Fig. 4 is the calculation procedure logical organization schematic diagram of parallel task in the embodiment of the present invention.
Embodiment
The present embodiment occurs that process lost efficacy and the situation of makeing mistakes and withdrawing from for concurrent program on parallel computer, a kind of parallel communications storehouse state self-recovery method towards failure error has been proposed, realized the autonomous recovery of concurrent program communications status after inefficacy of being formed by a plurality of concurrent processes, and then guarantee that concurrent program is at the follow-up reforwarding row of makeing mistakes, for convenience, hereinafter with the process of concurrent program referred to as calculation procedure.
As shown in Figure 2, the present embodiment is as follows towards the implementation step of the parallel communications storehouse state self-recovery method of process failure error:
1) initiating task managing process and node administration process; The user submits parallel task to the task management process, and the task management process is according to the concurrency Distribution Calculation node of parallel task and notify the node managing process; Whether then task management process Real Time Monitoring calculation procedure lost efficacy, if lost efficacy the task management process sends to the node administration process to fail message;
2) the node administration process is received the request from the derivation calculation procedure of task management process, for each calculation procedure that need to derive from, the node administration process utilizes the shared drive that operating system provides to create shared drive of system call establishment, the initial value of shared drive is made as full 0, create the calculation procedure of concurrent job and (title of designated environment variable need to be different from other environmental variance title that has existed in calculation procedure to the designated environment variable of calculation procedure the identifier assignment of shared drive, in the present embodiment, the concrete name of designated environment variable is called FD_SHM_ID), after calculation procedure created successfully, whether node administration process Real Time Monitoring received the failure notification message from the task management process, if receive failure notification message, the node administration process sends failure notification message by shared drive to calculation procedure,
3) at first initialization parallel communications of calculation procedure storehouse, inquiry obtains the identifier of shared drive according to designated environment variable, the system call of the shared drive binding that provides by operating system according to the identifier of shared drive is tied to this shared drive, then carries out local computing; In computation process, calculation procedure is called the message passing interface pass-along message in parallel communications storehouse, arrival and the transmission state of detect-message; Simultaneously, the shared drive that calculation procedure is bound by detection has judged whether that the process inefficacy occurs, and occurs changes step 4) over to if there is process to lose efficacy;
4) calculation procedure number and its inefficacy process list in the global communication device of losing efficacy by this generation of coupling of inquiry shared drive; The execution of inefficacy process list in the global communication device for the error recovery operation of process failure error, recovers calculation procedure, continues the execution local computing with it according to this calculation procedure number that inefficacy occurs.
The task management process of the present embodiment receives the fault-tolerant concurrent job from the user, utilize the node administration process to derive from the calculation procedure group of this fault-tolerant operation on one or more node, this fault-tolerant operation in the process of implementation, if inefficacy has occured, node administration process or task management process can detect this time inefficacy, then notify other calculation procedure of this operation, and then other calculation procedure is repaired the communication pool state that this time lost efficacy and damaged in the parallel communications storehouse, at last, this lost efficacy and can circulate a notice of to fault-tolerant concurrent program.Referring to Fig. 3, in the present embodiment take 4 calculation procedure as example, the node administration process comprises " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 2 ", " calculation procedure 3 " totally 4 calculation procedure for the calculation procedure that current concurrent job creates, therefore the process sequence number list of global communication device is { 0,1,2,3}.When " calculation procedure 1 " transmission process lost efficacy, " calculation procedure 0 ", " calculation procedure 2 ", " calculation procedure 3 " will get current inefficacy process list in the shared drive of this locality, and this inefficacy process list only has a process, is { 1}; When " calculation procedure 2 " and " calculation procedure 3 " of this operation lost efficacy, " calculation procedure 0 " and " calculation procedure 1 " will obtain the inefficacy process list in shared drive content was { 2,3}.In the present embodiment, hypothesis " calculation procedure 2 " in operational process lost efficacy, calculation procedure is by the calculation procedure number of this generation inefficacy of collocation of inquiry shared drive and the inefficacy process list in the global communication device, the inefficacy calculation procedure number that obtains at last is 1, and the inefficacy process list in the global communication device is { 2}.
In the present embodiment, in step 4), execution is as follows for the detailed step of the error recovery operation of process failure error:
4.1) empty the error condition in the parallel communications storehouse, the message that sends or receive before the error handling processing after deletion makes a mistake; Referring to Fig. 4, each calculation procedure of " calculation procedure 0 "~" calculation procedure 3 " comprises one and sends request queue and receipt message formation, when arriving the message that sends or receive before error handling processing after deletion makes a mistake, empty the transmission request and the reception message that send in request queue and receipt message formation and get final product.By execution in step 4.1) empty the error condition in the parallel communications storehouse after, the global communication device comprises " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 3 " 3 calculation procedure, so the process sequence number list of global communication device becomes { 0,1,3}.
4.2) the calculation procedure in the rear global communication device of makeing mistakes rearrangement, the rule of rearrangement is that No. rank of normal procedure that comes the right side moves to left, the global communication device is reordered rear generation and shrinks the global communication device; In the present embodiment, execution in step 4.2) with after the calculation procedure rearrangement, No. rank " 3 " move to left and make " calculation procedure 3 " to be punctured into " calculation procedure 2 " for " 2 ", the process sequence number list that comprises the contraction global communication device of " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 2 " is { 0,1,2}.
4.3) for the process of shrinking in the global communication device, utilize the process that the parallel communications storehouse provides to derive from function, derive from according to this calculation procedure number and inefficacy process list in the global communication device that occurs to lose efficacy the process of substituting, the number of the process of substituting is identical with the number that inefficacy occurs for this, finally derives 1 " replacement process " that is used for replacing " calculation procedure 2 " in the present embodiment.
4.4) will shrink process in the global communication device and alternative process and reconfigure and create new global communication device, new global communication device comprises two parts, and left-half is for shrinking the process in the global communication device, and right half part is the replacement process; Final to build left-half be " calculation procedure 0 ", " calculation procedure 1 ", " calculation procedure 2 ", and right half part is the new global communication device of " calculation procedure 3 ", the process sequence number list of new global communication device be 0,1,2, replacement process 0}.
4.5) process in new global communication device is resequenced, make the replacement process fill up the position of inefficacy process, and normal procedure is kept the process sequence number before losing efficacy; After resequencing, replacement process " calculation procedure 3 " is filled up the position of inefficacy process former " calculation procedure 2 ", and the process sequence number list of global communication device is: { 0,1,2,3}.
4.6) after the inefficacy process was successfully recovered, calculation procedure was returned and continued to carry out local computing.Step 4.6) after execution finishes, the inefficacy calculation procedure is derived from again, also reinitialized in the global communication device, at this moment, the user data of inefficacy process still is in lost condition, and therefore, whether the programmer need to lose efficacy according to the rreturn value judgement of communication statement and occur, if be yes, need the programmer to recover the user data of loss.
the present embodiment is by task management process load operations and detecting process failure conditions, the situation that the node administration process notifies calculation procedure generation process to lose efficacy by shared drive, calculation procedure at first in communication pool according to the communication pool status information of notifying dynamic automatic restoring failure, then recovered the user data of loss by the user program of programmer's definition, and then the concurrent program that lost efficacy is returned to a consistent state, in the parallel computer that is formed by a plurality of computing nodes, adopt the present invention can make concurrent program based on the program message passing interface after the generation process lost efficacy, concurrent program can recover fast to carry out and not need to withdraw from and reload, can solve the recovery problem after the concurrent program generation process of moving on parallel computer lost efficacy.
In order to check the effect of the present embodiment, the present embodiment is applied in carries out experimental verification on the TH-1A Supper parallel computer, the node concrete configuration of TH-1A Supper parallel computer is as follows: two Intel Xeon 5,670 six core CPU, the frequency of each core is 2.93 GHz, and the double-precision floating point theory of computation peak value of two CPU is 140Gflops; The two-way band width in physical of communication network is 160Gbps, and two-way MPI communication bandwidth is 6.3GB/s.The resource management system that uses during test is SLURM-2.4.0, and the test case CG(full name that the concurrent program of test changes in NPB-3.3-MPI is Conjugate Gradient) program.The CG program uses that one of method of conjugate gradient approximate treatment is sparse, symmetrical, the minimal eigenvalue of positive definite large matrix.The parallel communications storehouse of the present embodiment changes from mpich2, is called ft-mpich2.Ft-mpich2 has expanded the wrong processing module of mpich2, has added the process of carrying out in the step 4) for the error recovery operation of process failure error.The data scale of CG program is the D level, and the concurrency of test is 1024.Be to build and to occur wrong situation in the concurrent program implementation, the present embodiment is in CG program implementation process, and the kill order of using (SuSE) Linux OS to provide is random kills a calculation procedure in concurrent program CG program.When not using the method for the present embodiment, whole CG program can be made mistakes and be withdrawed from.
The present embodiment is applied in the TH-1A Supper parallel computer, and to carry out the experimental verification step as follows:
The first step, start task management process and node administration process with fault tolerance respectively;
The concurrent job (name is called " CG.D.1024 ") of second step, submission CG program is given the task management process;
The 3rd step, task management process are moved job scheduling to computing node, derive calculation procedure by the node administration process;
In the 4th step, CG program process, select at random a node to use the kill order to kill a process of CG program;
The 5th step, task management process detect certain process death of CG;
The 6th step, task management process send to all node administration processes to the process thrashing message;
The 7th step, node administration process are received the process thrashing message, and the message that message is written to all processes that live is transmitted in shared drive;
Other process that lives of the 8th step, CG program detects failure notification, the process sequence number list MPI_COMM_WORLD of the global communication device of Recover from damaging in the parallel communications storehouse;
The user data rejuvenation of the 9th step, programmer's definition is called, and the process missing data of being killed is resumed, and the recovering state of concurrent job is consistent state, restarts to carry out;
Experimental verification is not in the situation that make mistakes, and be 46.2 seconds the working time of CG program " CG.D.1024 ", loses efficacy and uses the method for the present embodiment in the situation that a process occurs, and be 46.6 seconds the working time of CG program " CG.D.1024 "; When not using the method for the present embodiment, when occurring the situation of process inefficacy in CG program " CG.D.1024 " operational process, whole calculation task withdrawed from because process loses efficacy.Experimental results show that by above-mentioned, after the present embodiment is in operation and has occured to lose efficacy, still can return to a consistent state to system state by the parallel communications storehouse, and then the concurrent job of recovery inefficacy, in this process, concurrent job can not interrupt withdrawing from, and does not need job management system to reload the concurrent job of inefficacy yet.
The above is only the preferred embodiment of the present invention, and protection scope of the present invention also not only is confined to above-described embodiment, and all technical schemes that belongs under thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art, in the some improvements and modifications that do not break away under principle of the invention prerequisite, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (2)

1. parallel communications storehouse state self-recovery method towards the process failure error is characterized in that implementation step is as follows:
1) initiating task managing process and node administration process; The user submits parallel task to the task management process, and the task management process is according to the concurrency Distribution Calculation node of described parallel task and notify the node managing process; Whether then task management process Real Time Monitoring calculation procedure lost efficacy, if lost efficacy the task management process sends to the node administration process to fail message;
2) the node administration process is received the request from the derivation calculation procedure of task management process, for each calculation procedure that need to derive from, the node administration process utilizes the shared drive that operating system provides to create shared drive of system call establishment, the initial value of described shared drive is made as full 0, creates the calculation procedure of concurrent job also the designated environment variable of the identifier assignment of shared drive to calculation procedure; After calculation procedure created successfully, whether node administration process Real Time Monitoring received the failure notification message from the task management process, if receive failure notification message, the node administration process sends failure notification message by shared drive to calculation procedure;
3) at first initialization parallel communications of calculation procedure storehouse, obtain the identifier of shared drive according to described designated environment variable inquiry, the system call of the shared drive binding that provides by operating system according to the identifier of shared drive is tied to this shared drive, then carries out local computing; In computation process, the message passing interface that calculation procedure is called the parallel communications storehouse carries out arrival and the transmission state of message transmission, detect-message; Simultaneously, the shared drive that calculation procedure is bound by detection has judged whether that the process inefficacy occurs, and occurs changes step 4) over to if there is process to lose efficacy;
4) calculation procedure is by calculation procedure number and its inefficacy process list in the global communication device of this generation inefficacy of coupling of inquiry shared drive, carry out error recovery operation for the process failure error according to the described calculation procedure number that this occur to lose efficacy and its inefficacy process list in the global communication device, recover calculation procedure, continue to carry out local computing.
2. the parallel communications storehouse state self-recovery method towards the process failure error according to claim 1, is characterized in that, carries out for the detailed step of the error recovery operation of process failure error as follows in described step 4):
4.1) empty the error condition in the parallel communications storehouse, the message that sends or receive before the error handling processing after deletion makes a mistake;
4.2) the calculation procedure in the rear global communication device of makeing mistakes rearrangement, the rule of rearrangement is that No. rank of normal procedure that comes the right side moves to left, the global communication device is reordered rear generation and shrinks the global communication device;
4.3) for the process in described contraction global communication device, utilize the process that the parallel communications storehouse provides to derive from function, derive from according to this calculation procedure number and inefficacy process list in the global communication device that inefficacy occurs the process of substituting, the number that substitutes process is identical with the number that inefficacy occurs for this;
4.4) process in described contraction global communication device and alternative process are reconfigured the new global communication device of establishment, described new global communication device comprises two parts, and left-half is for shrinking the process in the global communication device, and right half part is the replacement process;
4.5) process in described new global communication device is resequenced, make the replacement process fill up the position of inefficacy process, and normal procedure is kept No. rank, process before losing efficacy;
4.6) after the inefficacy process was successfully recovered, calculation procedure was returned and continued to carry out local computing.
CN201310096920.3A 2013-03-25 2013-03-25 Parallel communication library state self-recovery method facing to process failure fault Active CN103150236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310096920.3A CN103150236B (en) 2013-03-25 2013-03-25 Parallel communication library state self-recovery method facing to process failure fault

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310096920.3A CN103150236B (en) 2013-03-25 2013-03-25 Parallel communication library state self-recovery method facing to process failure fault

Publications (2)

Publication Number Publication Date
CN103150236A true CN103150236A (en) 2013-06-12
CN103150236B CN103150236B (en) 2014-03-19

Family

ID=48548333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310096920.3A Active CN103150236B (en) 2013-03-25 2013-03-25 Parallel communication library state self-recovery method facing to process failure fault

Country Status (1)

Country Link
CN (1) CN103150236B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391740A (en) * 2014-11-11 2015-03-04 上海斐讯数据通信技术有限公司 Deadlock unlocking method
WO2017050165A1 (en) * 2015-09-24 2017-03-30 阿里巴巴集团控股有限公司 Data synchronization method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1117766A (en) * 1993-02-10 1996-02-28 艾利森电话股份有限公司 A method and a system in a distributed operating system
US5900020A (en) * 1996-06-27 1999-05-04 Sequent Computer Systems, Inc. Method and apparatus for maintaining an order of write operations by processors in a multiprocessor computer to maintain memory consistency
CN1913481A (en) * 2005-08-10 2007-02-14 阿尔卡特公司 System comprising execution nodes for executing schedules

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1117766A (en) * 1993-02-10 1996-02-28 艾利森电话股份有限公司 A method and a system in a distributed operating system
US5900020A (en) * 1996-06-27 1999-05-04 Sequent Computer Systems, Inc. Method and apparatus for maintaining an order of write operations by processors in a multiprocessor computer to maintain memory consistency
CN1913481A (en) * 2005-08-10 2007-02-14 阿尔卡特公司 System comprising execution nodes for executing schedules

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391740A (en) * 2014-11-11 2015-03-04 上海斐讯数据通信技术有限公司 Deadlock unlocking method
WO2017050165A1 (en) * 2015-09-24 2017-03-30 阿里巴巴集团控股有限公司 Data synchronization method and system

Also Published As

Publication number Publication date
CN103150236B (en) 2014-03-19

Similar Documents

Publication Publication Date Title
CN108604202B (en) Working node reconstruction for parallel processing system
Fagg et al. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world
Silva et al. Fault-tolerant execution of mobile agents
US8615578B2 (en) Using a standby data storage system to detect the health of a cluster of data storage servers
EP3142011B9 (en) Anomaly recovery method for virtual machine in distributed environment
CN110535680B (en) Byzantine fault-tolerant method
CN109669762B (en) Cloud computing resource management method, device, equipment and computer readable storage medium
JP2017084333A (en) Method and system for monitoring virtual machine cluster
CN105933407B (en) method and system for realizing high availability of Redis cluster
CN108270726B (en) Application instance deployment method and device
WO2015058711A1 (en) Rapid fault detection method and device
CN108512753B (en) Method and device for transmitting messages in cluster file system
CN113965576A (en) Container-based big data acquisition method and device, storage medium and equipment
CN103150236B (en) Parallel communication library state self-recovery method facing to process failure fault
US20100085871A1 (en) Resource leak recovery in a multi-node computer system
CN113872997B (en) Container group POD reconstruction method based on container cluster service and related equipment
US8537662B2 (en) Global detection of resource leaks in a multi-node computer system
CN109117317A (en) A kind of clustering fault restoration methods and relevant apparatus
US11762741B2 (en) Storage system, storage node virtual machine restore method, and recording medium
CN104516778B (en) The preservation of process checkpoint and recovery system and method under a kind of multitask environment
CN109947593B (en) Data disaster tolerance method, system, strategy arbitration device and storage medium
CN116932274B (en) Heterogeneous computing system and server system
CN104516790A (en) System and method for recording and recovering checking point in distributed environment
CN109522158A (en) A kind of disaster-tolerant backup method and relevant apparatus
US11954509B2 (en) Service continuation system and service continuation method between active and standby virtual servers

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant