CN102289392A - Operation scheduling method and system based on check point - Google Patents

Operation scheduling method and system based on check point Download PDF

Info

Publication number
CN102289392A
CN102289392A CN2011102653649A CN201110265364A CN102289392A CN 102289392 A CN102289392 A CN 102289392A CN 2011102653649 A CN2011102653649 A CN 2011102653649A CN 201110265364 A CN201110265364 A CN 201110265364A CN 102289392 A CN102289392 A CN 102289392A
Authority
CN
China
Prior art keywords
user job
described user
job
checkpoint
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102653649A
Other languages
Chinese (zh)
Inventor
马少杰
戴荣
王璟
许涛
李斌
李程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dawning Information Industry Co Ltd
Original Assignee
Dawning Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Co Ltd filed Critical Dawning Information Industry Co Ltd
Priority to CN2011102653649A priority Critical patent/CN102289392A/en
Publication of CN102289392A publication Critical patent/CN102289392A/en
Pending legal-status Critical Current

Links

Landscapes

  • Retry When Errors Occur (AREA)

Abstract

The invention provides an operation scheduling method based on a check point. The operation scheduling method comprises the steps of: guiding a user operation to enter a queue for waiting; when resources are acquired, guiding the user operation to enter a memory area for running; saving the user operation based on a time point according to a preset migration parameter, and setting the time point as a check point; when the user operation interrupts abnormally, submitting the user operation again, guiding the user operation to enter a queue for waiting, when resources are acquired, running the user operation, reading the information of the user operation corresponding to the check point, and continuing the user operation. The invention also provides an operation scheduling system based on a check point.

Description

Job scheduling method and system based on the checkpoint
Technical field
Present invention relates in general to network field, more specifically, relate to job scheduling method and system based on the checkpoint.
Background technology
In present network hardware configuration, the operation of user job need rely on the stable of node computing environment under the cluster environment, and the cluster environment node is numerous, because force majeure factors such as maloperation, hardware fault cause operation to stop, for the user brings loss.Utilization checkpoint technology can be protected user job in the set time, and the operation that can resume operation rapidly when operation stops to be reduced the loss, and improves the operational efficiency of operation.
A lot of softwares self also have similar function, but lack versatility.And the charge of such software is expensive, and the user is difficult to accept.
Summary of the invention
For addressing the above problem, the invention provides a kind of job scheduling method based on the checkpoint, may further comprise the steps: user job enters formation and waits for, and when obtaining resource, user job enters the region of memory operation; According to the transfer parameter that sets in advance, user job is preserved by time point, and time point is set to the checkpoint; When the user job aborted, carry out the submission once more of user job, user job enters formation and waits for, and when obtaining resource, the information of the user job corresponding with the checkpoint is read in the user job operation, and continues to carry out user job.
Wherein, when user job did not have aborted, user job was finished.
Wherein, when the submission of user job makes a mistake, withdraw from user job.
Wherein, before user job enters the step that formation waits for, submit user job to, and job parameter and transfer parameter are set when user job begins.
In addition, the present invention also provides a kind of job scheduling system based on the checkpoint, comprising: wait for module, be used to make user job to enter formation and wait for that when obtaining resource, user job enters the region of memory operation; Module is preserved in the checkpoint, is used for user job is preserved by time point, and time point being set to the checkpoint according to the transfer parameter that sets in advance; Wherein, during the user job aborted, carry out the submission once more of user job, user job enters formation and waits for, when obtaining resource, the information of the user job corresponding with the checkpoint is read in the user job operation, and continues to carry out user job.
This system also comprises: withdraw from module, be used for withdrawing from user job when the submission of user job makes a mistake.
This system also comprises: submit module to, be used to submit to user job; And parameter is provided with module, is used for being provided with when user job begins job parameter and transfer parameter.
The checkpoint that we propose and the combination technology of job scheduling system can so that the operation that stops to rerun automatically, utilize the characteristic of job queue, in operation is to resubmit automatically under the situation about losing efficacy to resume operation, automatically the operation once more that fulfils assignment, and have purposes widely, be applicable to most software systems.
Description of drawings
When reading in conjunction with the accompanying drawings, the present invention may be better understood according to the following detailed description.Should be emphasized that according to the standard practices in the industry, various parts are not drawn in proportion.In fact, in order clearly to discuss, the size of various parts can be increased arbitrarily or be reduced
Fig. 1 shows the process flow diagram based on the job scheduling method of checkpoint according to exemplary embodiment of the present invention.
Embodiment
In order to implement different parts of the present invention, below describe many different embodiment or example are provided.The specific example of below describing element and layout is to simplify the present invention.Certainly these only are that example does not plan to limit.Moreover, first parts are formed on and can comprise on second parts that wherein first and second parts are with the embodiment of direct contact formation in below describing, and can comprise that also wherein extra parts form the embodiment that is inserted in first and second parts, make first and second parts directly not contact.With clear, can at random draw various parts for the sake of simplicity with different sizes.
The present invention's technical scheme thinking substantially is as follows:
1. utilize the Checkpoing/Restart technology; process to special time is protected; to preserve file storage on hardware memory space, when consumer process since aborted cause can from the check point file of storage space, process being recovered rapidly when out of service.
2. utilize job scheduling system, with the submission once more of operation, the assurance user job is uninterruptedly carried out.
Fig. 1 shows the process flow diagram based on the job scheduling method of checkpoint according to exemplary embodiment of the present invention.As shown in Figure 1, the invention provides a kind of job scheduling method, may further comprise the steps based on the checkpoint: S101, submit user job to, and job parameter and transfer parameter be set when user job begins; S103, user job enters formation and waits for, and when obtaining resource, user job enters the region of memory operation; S105 according to the transfer parameter that sets in advance, user job is preserved by time point, and time point is set to the checkpoint; When the user job aborted, the shutdown inspection failure problems wherein, when fault is got rid of, is recovered user job by reading the pairing user job information in checkpoint again.
Preferably, when user job did not have aborted, user job was finished.
Preferably, when the submission of user job makes a mistake, withdraw from user job.
Preferably, before user job enters the step that formation waits for, submit user job to, and job parameter and transfer parameter are set when user job begins.
In addition, the present invention also provides a kind of job scheduling system (not shown) based on the checkpoint, comprising: wait for module, be used to make user job to enter formation and wait for that when obtaining resource, user job enters the region of memory operation; Module is preserved in the checkpoint, is used for user job is preserved by time point, and time point being set to the checkpoint according to the transfer parameter that sets in advance; Wherein, during the user job aborted, carry out the submission once more of user job, user job enters formation and waits for, when obtaining resource, the information of the user job corresponding with the checkpoint is read in the user job operation, and continues to carry out user job.
This system also comprises: withdraw from module, be used for withdrawing from user job when the submission of user job makes a mistake.
This system also comprises: submit module to, be used to submit to user job; And parameter is provided with module, is used for being provided with when user job begins job parameter and transfer parameter.
Particularly, can handle according to following flow process:
User's submit job is provided with job parameter and transfer parameter when job run begins.
Submit job in job scheduling system, operation enters formation and waits for, in case the resource of obtaining, operation enters the region of memory operation, is presented as consumer process.
According to the transfer parameter time that is provided with, the operation process is carried out the process checkpoint preservation on time, and checkpoint information is write physical storage areas.
If the operation aborted is then carried out the submission once more of operation by the operation submission system, operation this moment begins to enter job queue ranks, and obtains to enter running status after the resource, reads checkpoint information, continues to carry out, if make mistakes, then repeats this operation.
If there is not aborted, operation is finished smoothly.
If the operation submittal error, operation is withdrawed from.
By above-mentioned processing, can obtain following technique effect:
1. in the high-performance computing environment; because the situation that aborted causes user job to carry out happens occasionally; under the help of checkpoint technology, can carry out the checkpoint protection to user job on time, guarantee that user job can be because of aborted from not newly.
2. utilize the ancestors' meritorious achievements dispatching system, user job can be submitted to once more, this technology can be recovered user job rapidly, guarantees that user job also can normally move under the aborted situation.
Discuss the parts of some embodiment above, made those of ordinary skills' various aspects that the present invention may be better understood.It will be understood by those skilled in the art that can use at an easy rate the present invention design or change as the basis other be used to reach with here the identical purpose of the embodiment that introduces and/or realize the processing and the structure of same advantage.Those of ordinary skills should be appreciated that also this equivalent constructions does not deviate from the spirit and scope of the present invention, and under the situation that does not deviate from the spirit and scope of the present invention, can carry out multiple variation, replacement and change.

Claims (7)

1. the job scheduling method based on the checkpoint is characterized in that, may further comprise the steps:
User job enters formation and waits for, when obtaining resource, described user job enters the region of memory operation;
According to the transfer parameter that sets in advance, described user job is preserved by time point, and described time point is set to the checkpoint;
When described user job aborted, carry out the submission once more of described user job, described user job enters formation and waits for, when obtaining resource, the information of the described user job corresponding with described checkpoint is read in described user job operation, and continues to carry out described user job.
2. method according to claim 1 is characterized in that, when described user job did not have aborted, described user job was finished.
3. method according to claim 1 and 2 is characterized in that, when the submission of described user job makes a mistake, withdraws from described user job.
4. method according to claim 1 is characterized in that, before described user job enters the step that formation waits for, submits described user job to, and job parameter and described transfer parameter are set when described user job begins.
5. the job scheduling system based on the checkpoint is characterized in that, comprising:
Wait for module, be used to make user job to enter formation and wait for that when obtaining resource, described user job enters the region of memory operation;
Module is preserved in the checkpoint, be used for according to the transfer parameter that sets in advance described user job being preserved by time point, and described time point is set to the checkpoint;
Wherein, during described user job aborted, carry out the submission once more of described user job, described user job enters formation and waits for, when obtaining resource, and described user job operation, read the information of the described user job corresponding, and continue to carry out described user job with described checkpoint.
6. system according to claim 5 is characterized in that, also comprises: withdraw from module, be used for withdrawing from described user job when the submission of described user job makes a mistake.
7. system according to claim 5 is characterized in that, also comprises:
Submit module to, be used to submit to described user job; And
Parameter is provided with module, is used for being provided with when described user job begins job parameter and described transfer parameter.
CN2011102653649A 2011-09-08 2011-09-08 Operation scheduling method and system based on check point Pending CN102289392A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102653649A CN102289392A (en) 2011-09-08 2011-09-08 Operation scheduling method and system based on check point

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102653649A CN102289392A (en) 2011-09-08 2011-09-08 Operation scheduling method and system based on check point

Publications (1)

Publication Number Publication Date
CN102289392A true CN102289392A (en) 2011-12-21

Family

ID=45335840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102653649A Pending CN102289392A (en) 2011-09-08 2011-09-08 Operation scheduling method and system based on check point

Country Status (1)

Country Link
CN (1) CN102289392A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106663031A (en) * 2014-08-21 2017-05-10 微软技术许可有限责任公司 Equitable sharing of system resources in workflow execution
WO2017114176A1 (en) * 2015-12-30 2017-07-06 阿里巴巴集团控股有限公司 Method and apparatus for coordinating consumption queue in distributed environment
TWI735519B (en) * 2017-01-24 2021-08-11 香港商阿里巴巴集團服務有限公司 Distributed environment coordinated consumption queue method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1155340A (en) * 1994-07-25 1997-07-23 英国电讯有限公司 Computer system having client-server architecture
US20050028159A1 (en) * 2003-07-30 2005-02-03 Masayoshi Kodama Memory managing system and task controller in multitask system
CN102012843A (en) * 2010-11-19 2011-04-13 曙光信息产业(北京)有限公司 Task migration system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1155340A (en) * 1994-07-25 1997-07-23 英国电讯有限公司 Computer system having client-server architecture
US20050028159A1 (en) * 2003-07-30 2005-02-03 Masayoshi Kodama Memory managing system and task controller in multitask system
CN102012843A (en) * 2010-11-19 2011-04-13 曙光信息产业(北京)有限公司 Task migration system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106663031A (en) * 2014-08-21 2017-05-10 微软技术许可有限责任公司 Equitable sharing of system resources in workflow execution
US10554575B2 (en) 2014-08-21 2020-02-04 Microsoft Technology Licensing, Llc Equitable sharing of system resources in workflow execution
WO2017114176A1 (en) * 2015-12-30 2017-07-06 阿里巴巴集团控股有限公司 Method and apparatus for coordinating consumption queue in distributed environment
TWI735519B (en) * 2017-01-24 2021-08-11 香港商阿里巴巴集團服務有限公司 Distributed environment coordinated consumption queue method and device

Similar Documents

Publication Publication Date Title
US8433833B2 (en) Dynamic reassignment for I/O transfers using a completion queue
US20170068574A1 (en) Multiple pools in a multi-core system
EP3161639B1 (en) Techniques for handling errors in persistent memory
US9122595B2 (en) Fault tolerance for complex distributed computing operations
US8516492B2 (en) Soft partitions and load balancing
US20180121240A1 (en) Job Scheduling Method, Device, and Distributed System
EP2816467A2 (en) Method and device for checkpoint and restart of container state
CN106528893B (en) Data synchronization method and device
CN102902589B (en) The management of a kind of cluster MIC operation and dispatching method
TW201222236A (en) Transparently increasing power savings in a power management environment
US11243795B2 (en) CPU overcommit with guest idle polling
US20150378782A1 (en) Scheduling of tasks on idle processors without context switching
WO2015131542A1 (en) Data processing method, device and system
CN103279386A (en) Method for achieving high availability of computer operation scheduling system
RU2009139312A (en) METHOD FOR ELIMINATING AN EXCLUSIVE SITUATION IN ONE OF THE MULTI-CORE SYSTEM CORES
CN104346211A (en) Method and system for realizing virtual machine migration under cloud computing
CN103293967A (en) Multi-task control method for intelligent meter reading terminal
CN102289392A (en) Operation scheduling method and system based on check point
WO2015078215A1 (en) Device resource control method and apparatus
US9612907B2 (en) Power efficient distribution and execution of tasks upon hardware fault with multiple processors
US20170048169A1 (en) Message queue replication with message ownership migration
CN109002286A (en) Data asynchronous processing method and device based on synchronous programming
US9229817B2 (en) Control method of data storage system for restarting expander
CN104460938A (en) System-wide power conservation method and system using memory cache
US10678749B2 (en) Method and device for dispatching replication tasks in network storage device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20111221