CN108536528A - Using the extensive network job scheduling method of perception - Google Patents

Using the extensive network job scheduling method of perception Download PDF

Info

Publication number
CN108536528A
CN108536528A CN201810245680.1A CN201810245680A CN108536528A CN 108536528 A CN108536528 A CN 108536528A CN 201810245680 A CN201810245680 A CN 201810245680A CN 108536528 A CN108536528 A CN 108536528A
Authority
CN
China
Prior art keywords
job
node
time
grid
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810245680.1A
Other languages
Chinese (zh)
Inventor
唐小勇
李肯立
刘楚波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201810245680.1A priority Critical patent/CN108536528A/en
Publication of CN108536528A publication Critical patent/CN108536528A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of extensive network job scheduling methods of application perception.This method is mainly made of the following steps:The first step, user submit interactive interface to submit operation by grid work.Second step, system predict Activity Calculation amount according to operational feature information.Third walks, the large-scale distributed grid node current state information of system queries.4th step, job scheduler accordingly can calculate nodes according to the lookup of operation application demand.5th step, distribution operation to corresponding calculate node.6th step judges in job queue whether all operations dispatch and finishes, do not dispatch and finish, and recycles and executes the 4th step and the 5th step, otherwise waits for next dispatching point and execute this method again.Using this method, grid system resource utilization rate can be effectively improved and meet user demand by being compared compared with the existing job scheduling method based on first-come-first-served policy.

Description

Application-aware large-scale grid job scheduling method
Technical Field
The invention belongs to the technical field of resource management and task scheduling of computer software and a large-scale parallel distributed processing system, and relates to an application-aware large-scale grid job scheduling method.
Background
The Chinese national grid is a new generation of information infrastructure test bed that aggregates high performance computing and transaction processing capabilities. The grid can effectively support scientific research, resource environment, advanced manufacturing, information service and other applications through resource sharing, cooperative work and service mechanisms. The method is mainly based on the existing high-performance computing environment foundation, the key technology of application service optimization of the high-performance computing environment is intensively researched, the resource construction mechanism is further perfected, a practical high-performance computing application service environment and an application field community which have a novel operation mechanism and rich application resources are established, the high-performance computing application cost is reduced, and the high-performance computing application service level is comprehensively improved.
At present, the national grid of China has converged with north main nodes, south main nodes, national super-computation centers and common nodes to form a large-scale distributed computation grid. The national super computing Wuxi center has a first super computer in the world, namely Shenwei Taihu light, with peak computing performance exceeding billions of floating point computing power per second, and the computing system fully adopts a Chinese-made Shenwei 26010 many-core processor and is also a first super computer in the world ranking constructed by all Chinese-made processors in China. The national super computing Guangzhou center is provided with a Tianhe second system, the first-stage peak computing speed is 5.49 hundred million times per second, the continuous computing speed is 3.39 hundred million times per second, and the energy efficiency ratio is 19 hundred million times per watt for double-precision floating point operation.
The national supercomputing tianjin center is equipped with a "tianheyi" high performance computer system with peak performance up to 4700 trillions per second, which ranks world first in the HPC TOP500 ranking of 2010. Additionally, a Tianhe-Tianteng (TH-1) system with the calculation performance reaching hundreds of trillion times is provided; a Tianhe-Tianxiang system comprising 128 Intel-EX5675 CPUs; and a Tianhe-Tianchi system including 96 CPUs. These supercomputing centers all have quite remarkable computing and storage capabilities, while common nodes can perform large-scale computations, such as the wave "Tiansuo" high-performance computer system deployed by Qinghua university, the computing power of the general purpose processor of the system is 104 trillion times per second, and the computing power of the GPU acceleration component is 64 trillion times per second; the Shenzhen advanced technology research institute of Chinese academy of sciences adopts an eosin 5000A cluster system, has the general computing power of 10 trillion times per second, the special computing power of 200 trillion times per second, the storage capacity of 500TB, the internal data exchange capacity of 2GB/s, and the resource availability of the whole system exceeds 90%; the Gansu province computing center is provided with a high-performance scientific and engineering computing cluster with the peak computing value of 40 trillion times per second and a network distributed storage system with the capacity of 35 TB. Hong Kong university has multiple clusters including two organizations, hong Kong university computer and computer center, with a peak of general computing power of 23.45 trillion times per second and a peak of specialized computing power of 7.7 trillion times per second.
However, how to fully and effectively utilize the resources, improve the user job completion efficiency and reduce the user cost has the core of user job scheduling. The main objective of job scheduling is to determine a job allocation scheme and a job execution sequence of a user job according to a scheduling policy under the condition of satisfying constraint relations such as user time limit, cost and the like, so as to satisfy user and system requirements. At present, job scheduling on the national grid of China only adopts a simple first-come first-serve strategy, and the efficiency is relatively low, so that the effective application of the national grid is influenced. Aiming at the problem, the invention tries to carry out comprehensive balance decision on the current situation of the application, the user application and the computing node resource which can be supported by the computing node so as to realize the efficient scheduling of the system operation, thereby providing effective technical support for national grids and other large-scale distributed systems.
Disclosure of Invention
The invention provides an application-aware large-scale grid job scheduling method aiming at the problem of low efficiency of national grids and other large-scale distributed systems caused by resource heterogeneity, regional distribution and user application diversity. In order to solve the problems, the technical scheme adopted by the invention is as follows:
an application-aware large-scale grid job scheduling method is characterized by comprising the following steps:
step 1: a user submits a job to a grid system, the grid system stores job information into a corresponding database grid job table, and then the job is inserted into a job queue based on a multithreading sharing mechanism according to a processing condition, wherein the job queue comprises a ready queue, a running queue and a result feedback queue, the ready queue is used for arranging and scheduling jobs to be processed, the running queue is used for arranging and scheduling jobs to be run at a certain computing node, and the result feedback queue is used for arranging jobs after running and storing results returned from the nodes in the queue;
step 2: according to a pre-established user operation running time model, carrying out running time prediction on the operation submitted by a user;
and step 3: inquiring the current state information of the large-scale distributed grid nodes in real time, and storing the current state information into a grid node resource information table of a database;
and 4, step 4: searching a grid node which can be used for operation calculation according to the application requirement of the operation, namely a calculation node;
and 5: searching nodes meeting the operation time limit of the user operation on the calculable node set found in the step 4 as a schedulable node set, and then selecting the node with the lowest resource utilization rate from the schedulable node set as a grid node for executing the operation;
step 6: distributing the operation to the grid nodes obtained in the step 5;
and 7: judging whether all the jobs in the job queue are scheduled or not, and waiting for the next scheduling point if the scheduling is finished; otherwise, taking the job from the job queue and returning to the step 4 for circular execution.
In the step 1, the job information includes a user ID, a job ID, an application software requirement, a version number, License, a node number, a CPU number, a many-core number, an operation time, a job data amount, and an expected completion time.
In the step 2, the user job running time model is a historical database running at each computing node based on system historical jobs, and is described as follows:<Jobi,Timei,j>(ii) a Wherein JobiThe application characteristics of (1) include the data volume Jd (Job)i) Job size Js (Job)i) (ii) a And calculates Job Job to be scheduled and historical Job Job by the following formulaiProximity p on application featuresi
ρi=|Jd(jobi)-Jd(job)|+|Js(jobi)-Js(job)|
Taking the proximity rhoiMinimum historical job run Timei,jAs the predicted run time of the job at each grid compute node.
In the step 3, the current state information of the large-scale distributed grid nodes includes the online operation workload, the resource utilization rate, and the hardware and software information which can be supported by the nodes.
The application-aware large-scale grid job scheduling method comprises the following steps of:
step 5.1: initializing a data structure jobnode _ start time capable of calculating the node set operation start time according to the real-time node resource information;
step 5.2: calculating the deviation value theta of the user job history submission running time and the actual running time as shown in the following formula:
wherein,submitting the operation time and the actual operation time for the ith job submitted by the user respectively, wherein m is the number of jobs submitted by the user in the grid system;
step 5.3: if the deviation value theta is less than 0.2, the operation running time adopted by the operation scheduling decision is the running time in the operation information submitted by the user, otherwise, the operation running time predicted by the system in the step 2 is adopted as the basis of the scheduling decision;
step 5.4: sequentially calculating job completion time jobnode end time of each node in the calculable node set by using the job running time obtained in the step 5.3 and the start execution time of the job on the node;
step 5.5: judging whether the operation time limit is met or not according to the completion time of the operation on each calculable node, if so, putting the node into a schedulable node set;
step 5.6: and selecting the node with the lowest resource utilization rate from the schedulable node set, and scheduling the job to the grid node.
In the step 6, the job is distributed to the corresponding computing node, and the method comprises the steps that the system provides a job computing request, job data transmission, job execution state query and system scheduling information to the computing node and feeds back the job computing request, the job data transmission, the job execution state query and the system scheduling information to a user.
The method has the technical effects that compared with the prior operation scheduling method based on the first-come-first-serve strategy, the method provided by the invention can effectively improve the resource utilization rate of the grid system and meet the user requirements.
The invention will be further explained with reference to the drawings.
Drawings
FIG. 1 is a flow chart of a method for application-aware large-scale grid job scheduling provided by the present invention;
FIG. 2 is a diagram of a large-scale grid global scheduler architecture provided by an implementation of the present invention;
FIG. 3 is a diagram of a user job interaction interface provided by implementations of the invention;
fig. 4 is an operation state transition diagram.
Detailed Description
The invention provides an application-aware large-scale grid job scheduling method, and a flow chart of the method is shown in figure 1. The method is based on a large-scale China national grid, starts from the application characteristic of the user operation, provides a large-scale distributed operation scheduling method meeting the operation duration and the application requirement, and can effectively improve the operation scheduling efficiency of the China national grid.
The invention is realized by the following technical scheme:
the embodiment is oriented to the China national grid of the large-scale distributed computing system, and the operation is dispatched to each computing node in the overall operation dispatcher so as to integrate the idle resources of the large-scale computing nodes, fully utilize the idle resources and improve the service quality of the China national grid. The development of the embodiment is based on a Linux platform and adopts a MySql database. The system structure is shown in fig. 2, and the global job scheduler mainly comprises a scheduling decision module, a job distribution module, an information acquisition module and a communication module. The scheduling decision module integrates a scheduling algorithm and makes decisions on scheduling of jobs and corresponding computing nodes; the operation execution module automatically generates an instruction for submitting the current operation to the computing center according to the computing node selected by the scheduling decision module; the information collection module is responsible for collecting resource conditions (such as the number of idle servers, the number of CPU cores, the use of a memory and the like) of each computing node, and can also collect job completion condition information (such as running time, predicted completion time and the like of a job) and feed back the information to a user in real time; the communication module is responsible for the specific communication between the user main program and each computing node, for example, the instruction generated by the job execution module is sent to the corresponding computing node, so that the job is submitted and executed at the node.
A user first submits a job through the grid system, such as the job interaction interface shown in fig. 3. The main job information includes user ID, job ID, application software requirement, version number, License, number of nodes, number of CPUs, number of many cores, running time, job data amount, expected completion time, and the like. After the user clicks the job submission button, the present embodiment first stores the user job information into the MySql database grid job table, which has the following main attributes: jobId-Job number, jobName-Job name, software _ type application software requirements, software _ system software environment, resource _ type resource type, resource _ value-resource value under the corresponding type, cpu-CPU required by the user, runtime-user estimated Job runtime, deadline-user wishes to complete the job before this time, budget-user budget, priority-Job priority, memory required by memory-Job, disk-required disk size for the job. This job is then inserted into a job queue based on a multi-thread sharing mechanism. Job queuing is a technique used by this patent to manage user submitted jobs, which are transferred from a "ready queue" to a "running queue" when the job is ready to be scheduled to run on a certain compute node; when the job runs to the end, the job is transferred from the running queue to the result feedback queue, and the result returned from the node is saved in the queue. Multiple users may submit jobs at the same time, that is, operate on the same queue, so mutual exclusion of queue operations is guaranteed. At the same time, certain concurrency is also to be maintained. The operation state transition diagram is shown in fig. 4.
The second step of the embodiment is to perform runtime prediction on the job submitted by the user. The embodiment establishes a historical database of the operation of the user job on each computing node. The method mainly collects operation parameters such as a user number/a user name, an operation number, the number of CPUs used by the operation, an operation queue, an operation work catalog, an operation command, operation submitting time, operation starting running time, operation running ending time and an operation quitting reason; grid computing node parameters, such as cluster, host name, time: data collected at the time, CPU occupancy, CPU micro-architectural data, CPU floating point computing power, CPU instruction execution speed, last level cache hit rate, memory occupancy, memory read-write bandwidth, Infiniband network usage, Ethernet usage, and disk/NFS usage. The embodiment counts the distribution of the running time aiming at all the jobs and the jobs of sub-users and sub-applications, and finds that the running time is largeThe measuring operation is a short-time operation. Meanwhile, enough statistical results of the operation number show that the operation duration distribution is power law distribution. Secondly, the time when the user submits the jobs has the property of clustering, and the difference between the running times of two clusters of jobs with longer distance is larger. Finally, the patent researches the behavior habits of the user, and establishes a specific application running time model of the user by combining the application characteristics and the user operation characteristics, so that the predicted value of the operation running time is obtained by utilizing the user operation information and the model. The user job running time model referred to in this embodiment is a historical database that runs on each computing node based on system historical jobs, and is described as follows:<Jobi,Timei,j>(ii) a Wherein JobiThe application characteristics of (1) include the data volume Jd (Job)i) Job size Js (Job)i) (ii) a And calculates Job Job to be scheduled and historical Job Job by the following formulaiProximity p on application featuresi
ρi=|Jd(jobi)-Jd(job)|+|Js(jobi)-Js(job)|
Taking the proximity rhoiMinimum historical job run Timei,jAs the predicted run time of the job at each grid compute node.
In the third step of this embodiment, the current state information of the large-scale distributed grid node, such as the online operation workload, the resource utilization rate, the hardware and software information that can be supported by the node, etc., is queried in real time and stored in the grid node resource information table of the database. The main fields of this table are node _ Id-node Id, node _ name-node name, resource _ type-resource type (indicating the resource type owned by the corresponding node), resource _ software (indicating the application software available to the user), wait _ job _ num-the number of jobs currently waiting to run, wait time-how long the user probably needs to execute, cpu-the number of cpu cores currently idle, memory-the memory state currently available, disk-the disk capacity currently remaining, max _ run-the maximum time the user job is allowed to run, predict _ run-the predicted job run time, net _ delay-network delay (the network delay problem needs to be considered when scheduling across centers), cpu _ use-cpu usage, memory _ use. The part adopts a C/S model, a client is responsible for collecting the current state information of each grid node in real time, and a server is responsible for receiving the information and storing the information into a Mysql database node resource information table. Socket sockets under a Linux platform are adopted for communication, a select function is used for monitoring the sockets, and connection is established to realize communication. The communication protocol mainly adopts a TCP protocol facing connection, so that the safety of information can be ensured. And a multithread multiplexing mode is mainly adopted on concurrent access.
And fourthly, searching a corresponding grid computing node according to the application requirement of the operation. Software _ type and software _ system in the grid job table are application software and supporting software environment requirements needed by the job to run. The present embodiment will use these information to search the node resource set meeting the requirement in the node resource, and the following data structure and queue operation function are used in its implementation. And screening a queue meeting the application requirement from the current grid node queue according to the current job information needing to be scheduled, and storing the queue name and the grid node into an array. The key function is intra _ info _ read (void job _ info, void sw _ match [ ]), wherein the function parameters are a job pointer and a pointer array of a structure sw _ match _ info type. The array stores the swmatchajnfo structure pointers of the grid queue meeting the software requirements. The return value is the number of grid queues that meet the software requirements. In this function, the number of mesh QUEUEs to be selected is generated by multiplying constants GN _ NUM, QUEUE _ NUM, and SOFT _ NUM (the number of mesh nodes, the number of QUEUEs, and the number of software). And obtaining grid ID and queue name information of the grid queue according with software matching after the function call is completed. Secondly, according to the currently obtained sw _ match array, the array with the array grid ID and the array name is put into a linked list. The key function is void queue info read (List l, void sw match [ ]). And List in the function parameters is a queue linked List which is matched with the stored software. Then, according to the current operation information, if the queue of the queue information linked list matched with the stored software meets the hardware requirement, the queue information is put into the hardware matching queue linked list, and if the queue does not meet the hardware requirement, the queue is directly deleted from the software matching linked list. Thus, the node queue chain table after hardware screening is obtained. The key function is intqueShift (List swjlist, List hw _ List, void data), the linked List pointer of the hardware matching grid queue is finally obtained from the second parameter, and the return value is the number of the grid queues after the hardware matching. Thus, the operation can calculate the node set searching.
And fifthly, the job scheduler realizes job scheduling meeting the user job operation time limit on the basis of the computable node set. In this embodiment, a data structure jobnode start time data structure for calculating node set job start time is first established by using real-time information of node resources, and then a deviation value θ between a historical submission running time of a user job and an actual running time is calculated. The specific calculation process is as follows: firstly, initializing a data structure jobnode start time of a computable node set according to real-time node resource information; and then calculating the deviation value theta of the user job history submission running time and the actual running time, as shown in the following formula:
wherein,submitting the operation time and the actual operation time for the ith job submitted by the user respectively, wherein m is the number of jobs submitted by the user in the grid system; the present embodiment sets the deviation value θ of the node<And when 0.2, the time used by the job scheduling decision is the operation time in the job information submitted by the user, otherwise, the operation time predicted by the second step of the patent is used as the scheduling basis. And if the user submits the job for the first time, scheduling the time as the job scheduling time by using the predicted time of the patent. Based on the scheduling time calculation method and the starting execution time of the operation on the node, the method can calculate the job completion time jobnode end of each node in the node set in sequence. Then, according to the operationThe completion time on a node can be calculated to determine whether the completion time meets the operation running time limit, and if so, the node is placed into a schedulable node set. Finally, the patent selects the node with the lowest resource utilization rate from the schedulable node set, and schedules the operation to the grid node.
And the sixth step of the system is responsible for distributing the operation to the corresponding computing node, and comprises operation computing request, operation data transmission, operation execution state query, scheduling information feedback and the like. And then, judging whether all the jobs in the job queue are scheduled or not, and waiting for the next scheduling point if the scheduling is finished. Otherwise, taking the operation from the operation queue after finishing, and repeating the fourth step, the fifth step and the sixth step to realize operation scheduling.

Claims (6)

1. An application-aware large-scale grid job scheduling method is characterized by comprising the following steps:
step 1: a user submits a job to a grid system, the grid system stores job information into a corresponding database grid job table, and then the job is inserted into a job queue based on a multithreading sharing mechanism according to a processing condition, wherein the job queue comprises a ready queue, a running queue and a result feedback queue, the ready queue is used for arranging and scheduling jobs to be processed, the running queue is used for arranging and scheduling jobs to be run at a certain computing node, and the result feedback queue is used for arranging jobs after running and storing results returned from the nodes in the queue;
step 2: according to a pre-established user operation running time model, carrying out running time prediction on the operation submitted by a user;
and step 3: inquiring the current state information of the large-scale distributed grid nodes in real time, and storing the current state information into a grid node resource information table of a database;
and 4, step 4: searching a grid node which can be used for operation calculation according to the application requirement of the operation, namely a calculation node;
and 5: searching nodes meeting the operation time limit of the user operation on the calculable node set found in the step 4 as a schedulable node set, and then selecting the node with the lowest resource utilization rate from the schedulable node set as a grid node for executing the operation;
step 6: distributing the operation to the grid nodes obtained in the step 5;
and 7: judging whether all the jobs in the job queue are scheduled or not, and waiting for the next scheduling point if the scheduling is finished; otherwise, taking the job from the job queue and returning to the step 4 for circular execution.
2. The method according to claim 1, wherein in step 1, the job information includes user ID, job ID, application software requirement, version number, License, number of nodes, number of CPUs, number of many cores, running time, job data amount, and expected completion time.
3. The application-aware large-scale grid job scheduling method according to claim 1, wherein in step 2, the user job running time model is a historical database based on system historical jobs running at each computing node, and is described as:<Jobi,Timei,j>(ii) a Wherein JobiThe application characteristics of (1) include the data volume Jd (Job)i) Job size Js (Job)i) (ii) a And pass throughThe Job Job to be scheduled and the historical Job Job are calculated by the following formulaiProximity p on application featuresi
ρi=|Jd(jobi)-Jd(job)|+|Js(jobi)-Js(job)|
Taking the proximity rhoiMinimum historical job run Timei,jAs the predicted run time of the job at each grid compute node.
4. The method according to claim 1, wherein in step 3, the current state information of the large-scale distributed grid nodes includes on-line operation workload, resource utilization rate, and node-supportable hardware and software information.
5. The application-aware large-scale grid job scheduling method according to claim 1, wherein the step 5 comprises the following steps:
step 5.1: initializing a data structure jobnode _ start time capable of calculating the node set operation start time according to the real-time node resource information;
step 5.2: calculating the deviation value theta of the user job history submission running time and the actual running time as shown in the following formula:
wherein,submitting the operation time and the actual operation time for the ith job submitted by the user respectively, wherein m is the number of jobs submitted by the user in the grid system;
step 5.3: if the deviation value theta is less than 0.2, the operation running time adopted by the operation scheduling decision is the running time in the operation information submitted by the user, otherwise, the operation running time predicted by the system in the step 2 is adopted as the basis of the scheduling decision;
step 5.4: sequentially calculating job completion time jobnode end time of each node in the calculable node set by using the job running time obtained in the step 5.3 and the start execution time of the job on the node;
step 5.5: judging whether the operation time limit is met or not according to the completion time of the operation on each calculable node, if so, putting the node into a schedulable node set;
step 5.6: and selecting the node with the lowest resource utilization rate from the schedulable node set, and scheduling the job to the grid node.
6. The method according to claim 1, wherein the step 6 of distributing the job to the corresponding computing node comprises the step of the system sending job calculation request, job data transmission, job execution status query and system scheduling information to the computing node for feedback to the user.
CN201810245680.1A 2018-03-23 2018-03-23 Using the extensive network job scheduling method of perception Pending CN108536528A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810245680.1A CN108536528A (en) 2018-03-23 2018-03-23 Using the extensive network job scheduling method of perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810245680.1A CN108536528A (en) 2018-03-23 2018-03-23 Using the extensive network job scheduling method of perception

Publications (1)

Publication Number Publication Date
CN108536528A true CN108536528A (en) 2018-09-14

Family

ID=63485138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810245680.1A Pending CN108536528A (en) 2018-03-23 2018-03-23 Using the extensive network job scheduling method of perception

Country Status (1)

Country Link
CN (1) CN108536528A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110191155A (en) * 2019-05-07 2019-08-30 中国人民解放军国防科技大学 Parallel job scheduling method, system and storage medium for fat tree interconnection network
CN110442445A (en) * 2019-06-28 2019-11-12 苏州浪潮智能科技有限公司 A kind of design method and device based on calculating grid under extensive container cloud scene
CN110597639A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 CPU distribution control method, device, server and storage medium
CN113037800A (en) * 2019-12-09 2021-06-25 华为技术有限公司 Job scheduling method and job scheduling device
CN113424152A (en) * 2019-08-27 2021-09-21 微软技术许可有限责任公司 Workflow-based scheduling and batching in a multi-tenant distributed system
CN114780000A (en) * 2022-03-18 2022-07-22 江苏红网技术股份有限公司 Multipath large-scale real-time data job scheduling system and method thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101309208A (en) * 2008-06-21 2008-11-19 华中科技大学 Job scheduling system suitable for grid environment and based on reliable expense
CN101697141A (en) * 2009-10-30 2010-04-21 清华大学 Prediction method of operational performance based on historical data modeling in grid
CN103324534A (en) * 2012-03-22 2013-09-25 阿里巴巴集团控股有限公司 Operation scheduling method and operation scheduler
CN103729246A (en) * 2013-12-31 2014-04-16 浪潮(北京)电子信息产业有限公司 Method and device for dispatching tasks
CN104516784A (en) * 2014-07-11 2015-04-15 中国科学院计算技术研究所 Method and system for forecasting task resource waiting time
US9747130B2 (en) * 2011-06-16 2017-08-29 Microsoft Technology Licensing, Llc Managing nodes in a high-performance computing system using a node registrar

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101309208A (en) * 2008-06-21 2008-11-19 华中科技大学 Job scheduling system suitable for grid environment and based on reliable expense
CN101697141A (en) * 2009-10-30 2010-04-21 清华大学 Prediction method of operational performance based on historical data modeling in grid
US9747130B2 (en) * 2011-06-16 2017-08-29 Microsoft Technology Licensing, Llc Managing nodes in a high-performance computing system using a node registrar
CN103324534A (en) * 2012-03-22 2013-09-25 阿里巴巴集团控股有限公司 Operation scheduling method and operation scheduler
CN103729246A (en) * 2013-12-31 2014-04-16 浪潮(北京)电子信息产业有限公司 Method and device for dispatching tasks
CN104516784A (en) * 2014-07-11 2015-04-15 中国科学院计算技术研究所 Method and system for forecasting task resource waiting time

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ERIC GAUSSIER等: "Improving Backfilling by using Machine Learning to Predict Running Times", 《PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS》 *
YUPING FAN等: "Trade-off between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates", 《2017 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING》 *
柴亚辉等: "基于动态竞标机制的密集计算网格作业调度模型", 《华东交通大学学报》 *
沈新超等: "一个基于全局竞标机制的网格调度系统", 《计算机应用与软件》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110191155A (en) * 2019-05-07 2019-08-30 中国人民解放军国防科技大学 Parallel job scheduling method, system and storage medium for fat tree interconnection network
CN110191155B (en) * 2019-05-07 2022-01-18 中国人民解放军国防科技大学 Parallel job scheduling method, system and storage medium for fat tree interconnection network
CN110442445A (en) * 2019-06-28 2019-11-12 苏州浪潮智能科技有限公司 A kind of design method and device based on calculating grid under extensive container cloud scene
CN110442445B (en) * 2019-06-28 2022-04-22 苏州浪潮智能科技有限公司 Design method and device based on computing grid in large-scale container cloud scene
CN113424152A (en) * 2019-08-27 2021-09-21 微软技术许可有限责任公司 Workflow-based scheduling and batching in a multi-tenant distributed system
CN110597639A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 CPU distribution control method, device, server and storage medium
CN113037800A (en) * 2019-12-09 2021-06-25 华为技术有限公司 Job scheduling method and job scheduling device
CN113037800B (en) * 2019-12-09 2024-03-05 华为云计算技术有限公司 Job scheduling method and job scheduling device
CN114780000A (en) * 2022-03-18 2022-07-22 江苏红网技术股份有限公司 Multipath large-scale real-time data job scheduling system and method thereof

Similar Documents

Publication Publication Date Title
CN108536528A (en) Using the extensive network job scheduling method of perception
Zhong et al. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling
CN102063336B (en) Distributed computing multiple application function asynchronous concurrent scheduling method
CN101957780A (en) Resource state information-based grid task scheduling processor and grid task scheduling processing method
CN109491761A (en) Cloud computing multiple target method for scheduling task based on EDA-GA hybrid algorithm
CN111782627B (en) Task and data cooperative scheduling method for wide-area high-performance computing environment
CN111651864B (en) Event centralized emission type multi-heterogeneous time queue optimization simulation execution method and system
CN114741200A (en) Data center station-oriented computing resource allocation method and device and electronic equipment
CN102855157A (en) Method for comprehensively scheduling load of servers
CN111506407B (en) Resource management and job scheduling method and system combining Pull mode and Push mode
CN117909061A (en) Model task processing system and resource scheduling method based on GPU hybrid cluster
Wang et al. HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce
Mehta et al. A modified delay strategy for dynamic load balancing in cluster and grid environment
CN116755888A (en) High-performance computing cloud platform-oriented job scheduling device and method
Wang et al. A survey of system scheduling for hpc and big data
Gomathi et al. An adaptive grouping based job scheduling in grid computing
Zhao et al. RAS: a task scheduling algorithm based on resource attribute selection in a task scheduling framework
Mishra et al. A memory-aware dynamic job scheduling model in Grid computing
Ding et al. Data locality-aware and QoS-aware dynamic cloud workflow scheduling in Hadoop for heterogeneous environment
Yuan et al. PPCTS: Performance Prediction-Based Co-Located Task Scheduling in Clouds
Han et al. A Review of Hadoop Resource Scheduling Research
Liu et al. Dynamic co-scheduling of distributed computation and replication
Liu et al. An adaptive strategy for scheduling data-intensive applications in grid environments
Bakni et al. Survey on improving the performance of MapReduce in Hadoop
Lin et al. A multi-centric model of resource and capability management in cloud simulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180914

RJ01 Rejection of invention patent application after publication