CN102915257A

CN102915257A - TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method

Info

Publication number: CN102915257A
Application number: CN2012103676534A
Authority: CN
Inventors: 林霞
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Shuguang zhisuan Information Technology Co.,Ltd.
Priority date: 2012-09-28
Filing date: 2012-09-28
Publication date: 2013-02-06
Anticipated expiration: 2032-09-28
Also published as: CN102915257B

Abstract

The invention provides a TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method. In an NFS (network file system) file shared storage system, a checkpoint operation of a computation node is carried out. The method comprises the following steps that: (1) a user submits a job to a TORQUE server daemon process pbs_server and submits the request of adding checkpoints to a command, and a job script starts up a task by using a job start-up command chkp_mpirun; (2) the TORQUE server daemon process sends task information to a TORQUE dispatcher pbs-sched, and the TORQUE dispatcher pbs_sched requests to search the computation node according to a parameter requirement appointed in the job; and (3) the checkpoint operation of the computation node is carried out. By the TORQUE-based parallel checkpoint execution method and by using an improved TORQUE-based checkpoint improvement technique, the TORQUE which only supports a single-process checkpoint in the prior art can support a multi-process checkpoint, node fault can be automatically processed, and the processes can be migrated.

Description

Parallel checkpointing manner of execution based on TORQUE

Technical field

The invention belongs to computer realm, be specifically related to a kind of Parallel checkpointing manner of execution based on TORQUE.

Background technology

Job scheduling system is the application management software on high-performance computer system basis, and its function and performance are directly connected to efficient and the reliability of whole computer system.But on parallel tables to the checkpoint technology and fail to accomplish to support widely, and a typical case of checkpoint to use be exactly fault-tolerant.

Checkpoint prior art: Libckpt: be integrated in the checkpoint system in the Condor operating system.Because lack the support of kernel, it can only carry out Checkpointing to limited consumer process, so that its application difficult in a group of planes/job management system.

The Irix of SGI: kernel level realizes, abundant user interface and practical application is provided.But Irix is not the system of a open source code.

Epckpt: based on the checkpoint system of Linux, do not make storage optimization, inefficiency.

Although existing TORQUE has realized the checkpoint technology, can only carry out checkpointing to the task of one process, and can't support multi-process is carried out the migration of checkpoint and process.Exist the reason of these problems as follows: TORQUE just combines BLCR (checkpoint and the recovery technology that realize in Berkeley Lab ' s Linux Checkpoint/Restart Berkeley laboratory) and carries out process checkpoint.And BLCR self can't support distributed multi-process task; And only having the checkpoint image file at the task run node, other nodes can't utilize this document, also just can't accomplish process migration.

Summary of the invention

For overcoming defects, the invention provides a kind of Parallel checkpointing manner of execution based on TORQUE, use improved improvement checkpoint technology based on TORQUE, so that only supported the TORQUE of one process checkpoint originally, also can support the multi-process checkpoint now, and can automatically process node failure, process is moved.

For achieving the above object, the invention provides a kind of Parallel checkpointing manner of execution based on TORQUE, based on the Parallel checkpointing manner of execution of TORQUE, in NFS file-sharing storage system, computing node is carried out checkpointed, its improvements are that described method comprises the steps:

(1). the user is to TORQUE server finger daemon pbs_server submit job;

(2) .TORQUE server finger daemon sends task message to TORQUE scheduler pbs_sched, and TORQUE scheduler pbs_sched seeks computing node according to the parameter request of appointment in the operation;

(3). computing node is carried out checkpointed.

In the optimal technical scheme provided by the invention, in described step 1, the user is by TORQUE submit job order qsub submit job script, and job script uses MPI process initiation order chkp_mpirun to start MPI, adds the Job checkpointing request in the submit job order.

In the second optimal technical scheme provided by the invention, in described step 2, TORQUE scheduler pbs_sched returns to TORQUE server finger daemon pbs_server by the state of each computing node of computing node finger daemon pbs_mom poll with the result.

In the 3rd optimal technical scheme provided by the invention, in described step 3, TORQUE server finger daemon pbs_server is sent on the corresponding computing node user's checkpointed request and periodic duty; Wherein, each computing node is respectively arranged with independent check point mirror image.

In the 4th optimal technical scheme provided by the invention, cycle each computing node of carrying out checkpointed is put mirror image with independent check and is merged into global-inspection's dot file.

In the 5th optimal technical scheme provided by the invention, global-inspection's dot file is stored in the NFS file-sharing storage system.

In the 6th optimal technical scheme provided by the invention, after described step 3, there is following steps a:

(a). when each computing node of poll, if find certain computing node fault, what then all of distributing on this computing node are not executed all carries out rollback recovery with unenforced task according to global-inspection's dot file.

In the 7th optimal technical scheme provided by the invention, in described step a, TORQUE scheduler pbs_sched sends back to TORQUE server finger daemon pbs_server with all task numbers of calculation of fault node, TORQUE server finger daemon pbs_server will read on the NFS file-sharing storage system from nearest overall process image file of current time, resubmit task in the mode of rollback.

Compared with the prior art, a kind of Parallel checkpointing manner of execution based on TORQUE provided by the invention, improvement checkpoint technology based on TORQUE, can be so that TORQUE be to moving thereon, no matter be that one process or the task of multi-process are carried out the checkpoint, and when node failure, automatically carry out rollback recovery, avoided finding node failure and the again inconvenience of submit job by the user, and it is low to restart the untimely system availability that causes of operation, thereby has improved the reliability of dispatching system; And, through experiment test, use improved improvement checkpoint technology based on TORQUE, so that only supported the TORQUE of one process checkpoint originally, also can support the multi-process checkpoint now, and can automatically process node failure, process is moved, thereby reduced the computing resource waste that causes because of node failure.

Description of drawings

Fig. 1 is the schematic flow sheet based on the Parallel checkpointing manner of execution of TORQUE.

Embodiment

The explanation of the gordian technique term that relates to

TERA-SCALE computational resource and queue management device that 1TORQUE Tera-scale Open-source Resource and Queue manager increases income

Finish and another affairs are about at first affairs 2 checkpoints, to a snapshot of system state.

3 image file check point files have been preserved all information that process is carried out.

4 process migrations move one just in the ability of executive process at different processors, and these processors are connected to each other by network rather than local shared drive.

5NFS Network File System network file system(NFS)

As shown in Figure 1, a kind of Parallel checkpointing manner of execution based on TORQUE based on the Parallel checkpointing manner of execution of TORQUE, is carried out checkpointed to computing node in NFS file-sharing storage system, and described method comprises the steps:

(1). the user is to TORQUE server finger daemon pbs_server submit job;

(3). computing node is carried out checkpointed.

In described step 1, the user is by TORQUE submit job order qsub submit job script, need to add the Job checkpointing request in the submiting command, use MPI process initiation order chkp_mpirun to start MPI in the job script, perhaps can directly submit the MPI operation in client to the order line form, method is to carry out MPI startup command chkp_mpirun in order line.

In described step 2, TORQUE scheduler pbs_sched returns to TORQUE server finger daemon pbs_server by the state of each computing node of computing node finger daemon pbs_mom poll with the result.

In described step 3, TORQUE server finger daemon pbs_server is sent on the corresponding computing node user's checkpointed request and periodic duty; Wherein, each computing node is respectively arranged with independent check point mirror image.Cycle each computing node of carrying out checkpointed is put mirror image with independent check and is merged into global-inspection's dot file.Global-inspection's dot file is stored in the NFS file-sharing storage system.

After described step 3, there is following steps a:

(a). when each computing node of poll, if find certain computing node fault, what then all of distributing on this computing node are not executed all carries out rollback recovery with unenforced task according to global-inspection's dot file.In described step a, TORQUE scheduler pbs_sched sends back to TORQUE server finger daemon pbs_server with all task numbers of calculation of fault node, TORQUE server finger daemon pbs_server will read on the NFS file-sharing storage system from nearest overall process image file of current time, resubmit task in the mode of rollback.

By following examples the Parallel checkpointing manner of execution based on TORQUE is described further.

Parallel checkpointing manner of execution based on TORQUE comprises:

One, NFS file system (Network File System network file system(NFS))

This method has adopted NFS to share and has stored the memory map file, each process is when being own Checkpointing, its check point file is actually by NFS and sends in the stable storage array of carry on management node, and the reliability of storage array is generally guaranteed (RAID-Redundant Arrays of InexpensiveDisks disk array mechanism) by himself.The service that provides with the NFS shared-file system can be so that all computing nodes can both have access to this storage space, for recovering to provide advantage at other nodes behind the migration of process and the node failure.

Two, Parallel checkpointing is carried out flow process:

1. the realization of parallel task all is based on MPI (Message Passing Interface message passing interface) and realizes usually, this method does not re-use the mpirun (MPI startup command) of MPI self when carrying out job command, but the chkp_mpirun order (the MPI process initiation order that this method provides) of using this method to provide, the specific implementation flow process is as follows:

2. the user can directly use the chkp_mpirun order at start up with command-line options MPI executable program, the user also can order by qsub (the submit job order of TORQUE) submit job script, use the chkp_mpirun order to start the MPI executable program in the job script, and in submiting command, add the Job checkpointing request; Pbs_server (finger daemon of TORQUE on management node) receives after user's script, send task message to pbs_sched scheduler (scheduler of TORQUE on management node), scheduler is sought suitable node according to the parameter request of appointment in the script file.Scheduler is by the state of pbs_mom (finger daemon of TORQUE on computing node) each computing node of poll, the result is returned to pbs_server, pbs_server is sent to corresponding computing node with user's task program by network again, and move at this node, cycle is carried out the checkpoint on this node simultaneously, the node of initiating task can be merged into the independent check point mirror image on the individual computing node the consistent checkpoint mirror image of the overall situation, thereby reaches the single system mapping effect.And leave the check point file that this periodically produces in NFS and share in the storage.

3. realize the namely implementation process migration of automatic rollback recovery, needing each computing node of management node automatic regular polling is state.If the discovery node failure, what then all of distributing on this node are not executed all carries out rollback recovery with unenforced task.This is to carry out function by the timing of revising the pbs_sched scheduler to realize.The pbs_sched scheduler can send back to pbs_server with all task numbers of malfunctioning node, pbs_server will read share in the storage from nearest overall process image file of current time, resubmit task in the mode of rollback.

What need statement is that content of the present invention and embodiment are intended to prove the practical application of technical scheme provided by the present invention, should not be construed as the restriction to protection domain of the present invention.Those skilled in the art can do various modifications, be equal to and replace or improve inspired by the spirit and principles of the present invention.But these changes or modification are all in the protection domain that application is awaited the reply.

Claims

1. the Parallel checkpointing manner of execution based on TORQUE is carried out checkpointed to computing node in NFS file-sharing storage system, it is characterized in that described method comprises the steps:

(1). the user is to TORQUE server finger daemon pbs_server submit job;

(3). computing node is carried out checkpointed.

2. method according to claim 1, it is characterized in that in described step 1, the user is by TORQUE submit job order qsub submit job script, job script uses MPI process initiation order chkp_mpirun to start MPI, adds the Job checkpointing request in the submit job order.

3. method according to claim 1, it is characterized in that, in described step 2, TORQUE scheduler pbs_sched returns to TORQUE server finger daemon pbs_server by the state of each computing node of computing node finger daemon pbs_mom poll with the result.

4. method according to claim 1 is characterized in that, in described step 3, TORQUE server finger daemon pbs_server is sent on the corresponding computing node user's checkpointed request and periodic duty; Wherein, each computing node is respectively arranged with independent check point mirror image.

5. method according to claim 4 is characterized in that, cycle each computing node of carrying out checkpointed is put mirror image with independent check and is merged into global-inspection's dot file.

6. method according to claim 5 is characterized in that, global-inspection's dot file is stored in the NFS file-sharing storage system.

7. according to claim 1 or 4 described methods, it is characterized in that, after described step 3, have following steps a:

8. method according to claim 7, it is characterized in that, in described step a, TORQUE scheduler pbs_sched sends back to TORQUE server finger daemon pbs_server with all task numbers of calculation of fault node, TORQUE server finger daemon pbs_server will read on the NFS file-sharing storage system from nearest overall process image file of current time, resubmit task in the mode of rollback.