CN102915257A - TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method - Google Patents

TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method Download PDF

Info

Publication number
CN102915257A
CN102915257A CN2012103676534A CN201210367653A CN102915257A CN 102915257 A CN102915257 A CN 102915257A CN 2012103676534 A CN2012103676534 A CN 2012103676534A CN 201210367653 A CN201210367653 A CN 201210367653A CN 102915257 A CN102915257 A CN 102915257A
Authority
CN
China
Prior art keywords
torque
pbs
server
computing node
checkpoint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012103676534A
Other languages
Chinese (zh)
Other versions
CN102915257B (en
Inventor
林霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shuguang zhisuan Information Technology Co.,Ltd.
Original Assignee
Dawning Information Industry Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Beijing Co Ltd filed Critical Dawning Information Industry Beijing Co Ltd
Priority to CN201210367653.4A priority Critical patent/CN102915257B/en
Publication of CN102915257A publication Critical patent/CN102915257A/en
Application granted granted Critical
Publication of CN102915257B publication Critical patent/CN102915257B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method. In an NFS (network file system) file shared storage system, a checkpoint operation of a computation node is carried out. The method comprises the following steps that: (1) a user submits a job to a TORQUE server daemon process pbs_server and submits the request of adding checkpoints to a command, and a job script starts up a task by using a job start-up command chkp_mpirun; (2) the TORQUE server daemon process sends task information to a TORQUE dispatcher pbs-sched, and the TORQUE dispatcher pbs_sched requests to search the computation node according to a parameter requirement appointed in the job; and (3) the checkpoint operation of the computation node is carried out. By the TORQUE-based parallel checkpoint execution method and by using an improved TORQUE-based checkpoint improvement technique, the TORQUE which only supports a single-process checkpoint in the prior art can support a multi-process checkpoint, node fault can be automatically processed, and the processes can be migrated.

Description

Parallel checkpointing manner of execution based on TORQUE
Technical field
The invention belongs to computer realm, be specifically related to a kind of Parallel checkpointing manner of execution based on TORQUE.
Background technology
Job scheduling system is the application management software on high-performance computer system basis, and its function and performance are directly connected to efficient and the reliability of whole computer system.But on parallel tables to the checkpoint technology and fail to accomplish to support widely, and a typical case of checkpoint to use be exactly fault-tolerant.
Checkpoint prior art: Libckpt: be integrated in the checkpoint system in the Condor operating system.Because lack the support of kernel, it can only carry out Checkpointing to limited consumer process, so that its application difficult in a group of planes/job management system.
The Irix of SGI: kernel level realizes, abundant user interface and practical application is provided.But Irix is not the system of a open source code.
Epckpt: based on the checkpoint system of Linux, do not make storage optimization, inefficiency.
Although existing TORQUE has realized the checkpoint technology, can only carry out checkpointing to the task of one process, and can't support multi-process is carried out the migration of checkpoint and process.Exist the reason of these problems as follows: TORQUE just combines BLCR (checkpoint and the recovery technology that realize in Berkeley Lab ' s Linux Checkpoint/Restart Berkeley laboratory) and carries out process checkpoint.And BLCR self can't support distributed multi-process task; And only having the checkpoint image file at the task run node, other nodes can't utilize this document, also just can't accomplish process migration.
Summary of the invention
For overcoming defects, the invention provides a kind of Parallel checkpointing manner of execution based on TORQUE, use improved improvement checkpoint technology based on TORQUE, so that only supported the TORQUE of one process checkpoint originally, also can support the multi-process checkpoint now, and can automatically process node failure, process is moved.
For achieving the above object, the invention provides a kind of Parallel checkpointing manner of execution based on TORQUE, based on the Parallel checkpointing manner of execution of TORQUE, in NFS file-sharing storage system, computing node is carried out checkpointed, its improvements are that described method comprises the steps:
(1). the user is to TORQUE server finger daemon pbs_server submit job;
(2) .TORQUE server finger daemon sends task message to TORQUE scheduler pbs_sched, and TORQUE scheduler pbs_sched seeks computing node according to the parameter request of appointment in the operation;
(3). computing node is carried out checkpointed.
In the optimal technical scheme provided by the invention, in described step 1, the user is by TORQUE submit job order qsub submit job script, and job script uses MPI process initiation order chkp_mpirun to start MPI, adds the Job checkpointing request in the submit job order.
In the second optimal technical scheme provided by the invention, in described step 2, TORQUE scheduler pbs_sched returns to TORQUE server finger daemon pbs_server by the state of each computing node of computing node finger daemon pbs_mom poll with the result.
In the 3rd optimal technical scheme provided by the invention, in described step 3, TORQUE server finger daemon pbs_server is sent on the corresponding computing node user's checkpointed request and periodic duty; Wherein, each computing node is respectively arranged with independent check point mirror image.
In the 4th optimal technical scheme provided by the invention, cycle each computing node of carrying out checkpointed is put mirror image with independent check and is merged into global-inspection's dot file.
In the 5th optimal technical scheme provided by the invention, global-inspection's dot file is stored in the NFS file-sharing storage system.
In the 6th optimal technical scheme provided by the invention, after described step 3, there is following steps a:
(a). when each computing node of poll, if find certain computing node fault, what then all of distributing on this computing node are not executed all carries out rollback recovery with unenforced task according to global-inspection's dot file.
In the 7th optimal technical scheme provided by the invention, in described step a, TORQUE scheduler pbs_sched sends back to TORQUE server finger daemon pbs_server with all task numbers of calculation of fault node, TORQUE server finger daemon pbs_server will read on the NFS file-sharing storage system from nearest overall process image file of current time, resubmit task in the mode of rollback.
Compared with the prior art, a kind of Parallel checkpointing manner of execution based on TORQUE provided by the invention, improvement checkpoint technology based on TORQUE, can be so that TORQUE be to moving thereon, no matter be that one process or the task of multi-process are carried out the checkpoint, and when node failure, automatically carry out rollback recovery, avoided finding node failure and the again inconvenience of submit job by the user, and it is low to restart the untimely system availability that causes of operation, thereby has improved the reliability of dispatching system; And, through experiment test, use improved improvement checkpoint technology based on TORQUE, so that only supported the TORQUE of one process checkpoint originally, also can support the multi-process checkpoint now, and can automatically process node failure, process is moved, thereby reduced the computing resource waste that causes because of node failure.
Description of drawings
Fig. 1 is the schematic flow sheet based on the Parallel checkpointing manner of execution of TORQUE.
Embodiment
The explanation of the gordian technique term that relates to
TERA-SCALE computational resource and queue management device that 1TORQUE Tera-scale Open-source Resource and Queue manager increases income
Finish and another affairs are about at first affairs 2 checkpoints, to a snapshot of system state.
3 image file check point files have been preserved all information that process is carried out.
4 process migrations move one just in the ability of executive process at different processors, and these processors are connected to each other by network rather than local shared drive.
5NFS Network File System network file system(NFS)
As shown in Figure 1, a kind of Parallel checkpointing manner of execution based on TORQUE based on the Parallel checkpointing manner of execution of TORQUE, is carried out checkpointed to computing node in NFS file-sharing storage system, and described method comprises the steps:
(1). the user is to TORQUE server finger daemon pbs_server submit job;
(2) .TORQUE server finger daemon sends task message to TORQUE scheduler pbs_sched, and TORQUE scheduler pbs_sched seeks computing node according to the parameter request of appointment in the operation;
(3). computing node is carried out checkpointed.
In described step 1, the user is by TORQUE submit job order qsub submit job script, need to add the Job checkpointing request in the submiting command, use MPI process initiation order chkp_mpirun to start MPI in the job script, perhaps can directly submit the MPI operation in client to the order line form, method is to carry out MPI startup command chkp_mpirun in order line.
In described step 2, TORQUE scheduler pbs_sched returns to TORQUE server finger daemon pbs_server by the state of each computing node of computing node finger daemon pbs_mom poll with the result.
In described step 3, TORQUE server finger daemon pbs_server is sent on the corresponding computing node user's checkpointed request and periodic duty; Wherein, each computing node is respectively arranged with independent check point mirror image.Cycle each computing node of carrying out checkpointed is put mirror image with independent check and is merged into global-inspection's dot file.Global-inspection's dot file is stored in the NFS file-sharing storage system.
After described step 3, there is following steps a:
(a). when each computing node of poll, if find certain computing node fault, what then all of distributing on this computing node are not executed all carries out rollback recovery with unenforced task according to global-inspection's dot file.In described step a, TORQUE scheduler pbs_sched sends back to TORQUE server finger daemon pbs_server with all task numbers of calculation of fault node, TORQUE server finger daemon pbs_server will read on the NFS file-sharing storage system from nearest overall process image file of current time, resubmit task in the mode of rollback.
By following examples the Parallel checkpointing manner of execution based on TORQUE is described further.
Parallel checkpointing manner of execution based on TORQUE comprises:
One, NFS file system (Network File System network file system(NFS))
This method has adopted NFS to share and has stored the memory map file, each process is when being own Checkpointing, its check point file is actually by NFS and sends in the stable storage array of carry on management node, and the reliability of storage array is generally guaranteed (RAID-Redundant Arrays of InexpensiveDisks disk array mechanism) by himself.The service that provides with the NFS shared-file system can be so that all computing nodes can both have access to this storage space, for recovering to provide advantage at other nodes behind the migration of process and the node failure.
Two, Parallel checkpointing is carried out flow process:
1. the realization of parallel task all is based on MPI (Message Passing Interface message passing interface) and realizes usually, this method does not re-use the mpirun (MPI startup command) of MPI self when carrying out job command, but the chkp_mpirun order (the MPI process initiation order that this method provides) of using this method to provide, the specific implementation flow process is as follows:
2. the user can directly use the chkp_mpirun order at start up with command-line options MPI executable program, the user also can order by qsub (the submit job order of TORQUE) submit job script, use the chkp_mpirun order to start the MPI executable program in the job script, and in submiting command, add the Job checkpointing request; Pbs_server (finger daemon of TORQUE on management node) receives after user's script, send task message to pbs_sched scheduler (scheduler of TORQUE on management node), scheduler is sought suitable node according to the parameter request of appointment in the script file.Scheduler is by the state of pbs_mom (finger daemon of TORQUE on computing node) each computing node of poll, the result is returned to pbs_server, pbs_server is sent to corresponding computing node with user's task program by network again, and move at this node, cycle is carried out the checkpoint on this node simultaneously, the node of initiating task can be merged into the independent check point mirror image on the individual computing node the consistent checkpoint mirror image of the overall situation, thereby reaches the single system mapping effect.And leave the check point file that this periodically produces in NFS and share in the storage.
3. realize the namely implementation process migration of automatic rollback recovery, needing each computing node of management node automatic regular polling is state.If the discovery node failure, what then all of distributing on this node are not executed all carries out rollback recovery with unenforced task.This is to carry out function by the timing of revising the pbs_sched scheduler to realize.The pbs_sched scheduler can send back to pbs_server with all task numbers of malfunctioning node, pbs_server will read share in the storage from nearest overall process image file of current time, resubmit task in the mode of rollback.
What need statement is that content of the present invention and embodiment are intended to prove the practical application of technical scheme provided by the present invention, should not be construed as the restriction to protection domain of the present invention.Those skilled in the art can do various modifications, be equal to and replace or improve inspired by the spirit and principles of the present invention.But these changes or modification are all in the protection domain that application is awaited the reply.

Claims (8)

1. the Parallel checkpointing manner of execution based on TORQUE is carried out checkpointed to computing node in NFS file-sharing storage system, it is characterized in that described method comprises the steps:
(1). the user is to TORQUE server finger daemon pbs_server submit job;
(2) .TORQUE server finger daemon sends task message to TORQUE scheduler pbs_sched, and TORQUE scheduler pbs_sched seeks computing node according to the parameter request of appointment in the operation;
(3). computing node is carried out checkpointed.
2. method according to claim 1, it is characterized in that in described step 1, the user is by TORQUE submit job order qsub submit job script, job script uses MPI process initiation order chkp_mpirun to start MPI, adds the Job checkpointing request in the submit job order.
3. method according to claim 1, it is characterized in that, in described step 2, TORQUE scheduler pbs_sched returns to TORQUE server finger daemon pbs_server by the state of each computing node of computing node finger daemon pbs_mom poll with the result.
4. method according to claim 1 is characterized in that, in described step 3, TORQUE server finger daemon pbs_server is sent on the corresponding computing node user's checkpointed request and periodic duty; Wherein, each computing node is respectively arranged with independent check point mirror image.
5. method according to claim 4 is characterized in that, cycle each computing node of carrying out checkpointed is put mirror image with independent check and is merged into global-inspection's dot file.
6. method according to claim 5 is characterized in that, global-inspection's dot file is stored in the NFS file-sharing storage system.
7. according to claim 1 or 4 described methods, it is characterized in that, after described step 3, have following steps a:
(a). when each computing node of poll, if find certain computing node fault, what then all of distributing on this computing node are not executed all carries out rollback recovery with unenforced task according to global-inspection's dot file.
8. method according to claim 7, it is characterized in that, in described step a, TORQUE scheduler pbs_sched sends back to TORQUE server finger daemon pbs_server with all task numbers of calculation of fault node, TORQUE server finger daemon pbs_server will read on the NFS file-sharing storage system from nearest overall process image file of current time, resubmit task in the mode of rollback.
CN201210367653.4A 2012-09-28 2012-09-28 TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method Active CN102915257B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210367653.4A CN102915257B (en) 2012-09-28 2012-09-28 TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210367653.4A CN102915257B (en) 2012-09-28 2012-09-28 TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method

Publications (2)

Publication Number Publication Date
CN102915257A true CN102915257A (en) 2013-02-06
CN102915257B CN102915257B (en) 2017-02-08

Family

ID=47613630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210367653.4A Active CN102915257B (en) 2012-09-28 2012-09-28 TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method

Country Status (1)

Country Link
CN (1) CN102915257B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066205A (en) * 2016-12-30 2017-08-18 曙光信息产业(北京)有限公司 A kind of data-storage system
CN107743618A (en) * 2015-06-24 2018-02-27 英特尔公司 Technology for data center environment checkpointing
CN108491159A (en) * 2018-03-07 2018-09-04 北京航空航天大学 A kind of massively parallel system checkpoint method for writing data for alleviating I/O bottlenecks based on random delay

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101256526A (en) * 2008-03-10 2008-09-03 清华大学 Method for implementing document condition compatibility maintenance in inspection point fault-tolerant technique
CN101881996A (en) * 2010-07-19 2010-11-10 中国人民解放军国防科学技术大学 Parallel memory system check-point power consumption optimization method
US7933991B2 (en) * 2007-10-25 2011-04-26 International Business Machines Corporation Preservation of file locks during checkpoint and restart of a mobile software partition
CN102147755A (en) * 2011-04-14 2011-08-10 中国人民解放军国防科学技术大学 Multi-core system fault tolerance method based on memory caching technology
US8145947B1 (en) * 2006-09-29 2012-03-27 Emc Corporation User customizable CVFS namespace

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8145947B1 (en) * 2006-09-29 2012-03-27 Emc Corporation User customizable CVFS namespace
US7933991B2 (en) * 2007-10-25 2011-04-26 International Business Machines Corporation Preservation of file locks during checkpoint and restart of a mobile software partition
CN101256526A (en) * 2008-03-10 2008-09-03 清华大学 Method for implementing document condition compatibility maintenance in inspection point fault-tolerant technique
CN101881996A (en) * 2010-07-19 2010-11-10 中国人民解放军国防科学技术大学 Parallel memory system check-point power consumption optimization method
CN102147755A (en) * 2011-04-14 2011-08-10 中国人民解放军国防科学技术大学 Multi-core system fault tolerance method based on memory caching technology

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107743618A (en) * 2015-06-24 2018-02-27 英特尔公司 Technology for data center environment checkpointing
CN107066205A (en) * 2016-12-30 2017-08-18 曙光信息产业(北京)有限公司 A kind of data-storage system
CN107066205B (en) * 2016-12-30 2020-06-05 曙光信息产业(北京)有限公司 Data storage system
CN108491159A (en) * 2018-03-07 2018-09-04 北京航空航天大学 A kind of massively parallel system checkpoint method for writing data for alleviating I/O bottlenecks based on random delay
CN108491159B (en) * 2018-03-07 2020-07-17 北京航空航天大学 Large-scale parallel system check point data writing method for relieving I/O bottleneck based on random delay

Also Published As

Publication number Publication date
CN102915257B (en) 2017-02-08

Similar Documents

Publication Publication Date Title
CN109062655B (en) Containerized cloud platform and server
US10275851B1 (en) Checkpointing for GPU-as-a-service in cloud computing environment
Liu Cutting {MapReduce} Cost with Spot Market
Maas et al. Taurus: A holistic language runtime system for coordinating distributed managed-language applications
US8560889B2 (en) Adding scalability and fault tolerance to generic finite state machine frameworks for use in automated incident management of cloud computing infrastructures
Riesen et al. Alleviating scalability issues of checkpointing protocols
Bouguerra et al. A flexible checkpoint/restart model in distributed systems
Yang et al. Reliable computing service in massive-scale systems through rapid low-cost failover
CN102915257A (en) TORQUE(tera-scale open-source resource and queue manager)-based parallel checkpoint execution method
e Silva et al. Application execution management on the InteGrade opportunistic grid middleware
CN116954944A (en) Distributed data stream processing method, device and equipment based on memory grid
Friedman et al. Transparent fault-tolerant Java virtual machine
Liu et al. Supporting fault-tolerance in presence of in-situ analytics
Goulart et al. Checkpointing techniques in distributed systems: A synopsis of diverse strategies over the last decades
JPH11353284A (en) Job re-executing method
Li et al. A replication structure for efficient and fault-tolerant parallel and distributed simulations
El-Desoky et al. Improving fault tolerance in desktop grids based on incremental checkpointing
Cogorno et al. Fault tolerance in Hadoop MapReduce implementation
Rong Design and Implementation of Operating System in Distributed Computer System Based on Virtual Machine
Sunil et al. An Innovative Approach for Cloud-Based Web Dev App Migration
Mehta et al. Checkpointing and recovery mechanism in grid
Yan et al. Veca: A High-Performance Consensus Algorithm for State Machine Replication
Wrzesinska et al. Persistent fault-tolerance for divide-and-conquer applications on the grid
Yucheng MapReduce model implementation on MPI platform
Rodríguez et al. Performance evaluation of an application-level checkpointing solution on grids

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20211025

Address after: 100089 zone A-1, floor 2, building 36, yard 8, Dongbeiwang West Road, Haidian District, Beijing

Patentee after: Shuguang zhisuan Information Technology Co.,Ltd.

Address before: 100193 building 36, Zhongguancun Software Park, 8 North East West Road, Haidian District, Beijing.

Patentee before: Dawning Information Industry (Beijing) Co.,Ltd.

TR01 Transfer of patent right