CN113703952A - Resource allocation method for queue resource scheduling based on super computer - Google Patents

Resource allocation method for queue resource scheduling based on super computer Download PDF

Info

Publication number
CN113703952A
CN113703952A CN202010429029.7A CN202010429029A CN113703952A CN 113703952 A CN113703952 A CN 113703952A CN 202010429029 A CN202010429029 A CN 202010429029A CN 113703952 A CN113703952 A CN 113703952A
Authority
CN
China
Prior art keywords
resources
queue
user
resource
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010429029.7A
Other languages
Chinese (zh)
Other versions
CN113703952B (en
Inventor
刘弢
田敏
潘景山
郭莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN202010429029.7A priority Critical patent/CN113703952B/en
Publication of CN113703952A publication Critical patent/CN113703952A/en
Application granted granted Critical
Publication of CN113703952B publication Critical patent/CN113703952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a resource allocation method for queue resource scheduling based on a super computer, which comprises the following steps: (1) a user submits a job, and the number of computing resources and the private queue name are specified; (2) the submitted parameters are sent to a system for judgment, if the private queue resources are enough to use, namely the number of the resources in the private queue resources is larger than the number of the computing resources, the user operation is normally operated, and the operation is finished; otherwise, the system judges whether the conditions are met; the submitted parameters refer to the number of computing resources and the private queue name specified by the user; (3) if the condition is met, dividing the needed temporary nodes into private queues corresponding to the private queue names from the resource pool, and finishing normal operation of user operation; otherwise, printing out the reason of not meeting the condition; (4) and the system re-partitions the temporary node into the resource pool and ends. The invention optimizes the configuration of computing resources and improves the efficiency. A vigorous resource queue may be maintained for emergency resource calls.

Description

Resource allocation method for queue resource scheduling based on super computer
Technical Field
The invention relates to a resource allocation method for queue resource scheduling based on a supercomputer, belonging to the technical field of high-performance computing supercomputing calculation resource dynamic scheduling algorithms.
Background
The super computer is mainly used in the national high-tech field and the advanced technical research, is the embodiment of national research strength, has significance for national safety, economy and social development, and is an important mark for the national science and technology development level and the comprehensive national force. A state supercomputer is generally responsible for operation and maintenance by a state-level supercomputer center. By the end of 5 months in 2020, the Chinese is built together or seven super computing centers are being built, namely a national super computing Tianjin center, a national super computing Changsha center, a national super computing Jinan center, a national super computing Guangzhou center, a national super computing Shenzhen center, a national super computing tin-free center and a national super computing Zheng Zhou center.
At present, whether a national super computing center (super computing center) is a commercial computing resource or a domestic computing resource, queue resource allocation basically adopts two modes, namely a shared computing node queue resource and an exclusive computing node queue resource. In the field of supercomputers, the attributes of computing node resources are consistent, a dynamic scheduling algorithm for scheduling from a logic level does not exist, most of computing resources are manually allocated by users according to application conditions, and the flexibility and the real-time property need to be improved.
In the initial stage of the construction of the supercomputer, the total performance of the supercomputer is generally evaluated by using the whole computer node resources of the supercomputer. After the super computing center is gradually put into operation and computing node resources are gradually leased out, a relatively large computing resource queue pool is difficult to be arranged to coordinate and assist important scientific computing. In the process of operating the super-computation center, the following problems are caused: (1) the user frequently occupies most of the computing resources of the shared queue, and the resource shortage at a certain moment and the system pressure are overlarge due to the assault computing. (2) The exclusive queue is owned by one family or type of user, and the computing resources are occupied, but the idle rate is so high that the supercomputer cannot provide a large amount of centralized computing power. (3) Some large-scale computing scientific research tasks are not supported by sufficient computing resources for a short time.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a resource allocation method based on queue resource scheduling of a supercomputer.
The prior art means has no dynamic allocation capability, and needs to modify a repackaging scheduling algorithm at a user scheduling system layer to realize dynamic resource scheduling at a logic layer. Therefore, the existing resource analysis and the resource analysis of the user are required to be performed according to the dynamic trigger monitoring, a dynamic dispatching mechanism of the supercomputing center of the resource logic layer is designed, the resource utilization rate in the resource dispatching process is improved through a series of modes, and the problem that the supercomputing center is insufficient in computing resources is solved.
The technical scheme of the invention is as follows:
a resource allocation method based on queue resource scheduling of a super computer comprises the following steps:
(1) a user submits a job and specifies the number of required computing resources and a private queue name; for example, the number of computing resources required includes the number of nodes, the number of cores required for each node, and the private queue name of the task required to be submitted;
(2) the submitted parameters are judged by the queue resource scheduling resource allocation method designed by the invention, if the private queue resources are enough to be used, namely the number of the resources in the private queue resources is more than the number of the resources calculated in the step (1), the user operation is normally operated, and the operation is finished; otherwise, the queue resource scheduling resource allocation method judges whether the existing computing resources meet the required computing resource number, and the step (3) is carried out; the submitted parameters refer to the number of computing resources and the private queue name specified by the user in the step (1);
(3) if the existing computing resources meet the number of the needed computing resources, dividing the needed temporary nodes into the private queues corresponding to the private queue names in the step (1) from the resource pool, completing the normal operation of the user operation, and entering the step (4); otherwise, printing out the reason of not meeting the condition; for example: the submission job counts the number of nodes that has exceeded the actual purchase total.
(4) And the queue resource scheduling resource allocation method rescrips the temporary node back to the resource pool, and ends.
Preferably, the steps (2) to (4) include the following steps:
A. judging whether the number of resources in the private queue resources meets the number of resources required by the user, namely calculating the number of resources required by the user, if so, transmitting the bsub1 parameter to the system bsub2, and entering the step F, otherwise, entering the step B; the bsub1 parameter is all parameters which are configured by the user for submitting the bsub command, including the number of nodes, the number of cores required by each node and the private queue name of the task to be submitted, and the bsub2 is the bsub command which is called after the step (4) is finished; namely: and acquiring the operation node number of the operation, detecting the operation state, executing a system command to divide zero resources with corresponding number from a user queue to a resource pool queue after the operation is normally finished, and calling a bsub command after the operation is normally finished.
B. Counting the sum of the number of the nodes in the submitted operation of the user and the number of the nodes expected to be used in the submitted operation, if the sum is more than the total number of the nodes purchased by the user, returning to the printing to prompt the user, and if the sum is more than the total number of the nodes purchased by the user, submitting the operation to calculate that the number of the nodes exceeds the actual total number of the nodes purchased, otherwise, executing the step C;
C. d, the system calculates the remaining available computing resources in the resource pool at the moment, if the remaining available computing resources in the resource pool at the moment are smaller than the number of nodes which are expected to be used by the submitted operation, the step D is carried out, and if not, the step E is carried out;
D. after t minutes (min), the system calculates the residual available computing resources in the resource pool at the moment, if the residual available computing resources in the resource pool at the moment are still smaller than the number of nodes expected to be used by the submitted job, the system returns to the printing prompt user, the system calculates the resources insufficiently and asks the system administrator to contact, otherwise, the step E is carried out;
E. executing a scheduling system command, transferring the number of nodes expected to be used by the submitted job from the resource pool to a private queue corresponding to the private queue name of the user, and then transmitting the bsub1 parameter to the bsub 2;
F. executing the bsub2, acquiring the operation node number of the submission operation, and executing the submission operation;
G. and after the submission operation is normally finished, executing a system command, and dividing the number of nodes which are expected to be used by the submission operation from the resource pool into the resource pool from the private user queue.
More preferably, t is 1.
According to a preferred embodiment of the present invention, in the resource allocation method based on queue resource scheduling of a supercomputer, the number of resources in the private queue resource is greater than the number of resources in the resource pool. Generally speaking, the annual utilization of a hypercalculation center X86 architecture cluster fluctuates around 75%.
The invention has the beneficial effects that:
1. the invention optimizes the configuration of computing resources and improves the efficiency. Even if the computing resources cannot be unified, the algorithm is still effective for cluster resource management, and the larger the algorithm cardinality is, the more useful the algorithm is. A vigorous resource queue may be maintained for emergency resource calls.
2. The invention eliminates the need of modifying each user attribute in the initial setting, can be automatically maintained in the later period, and can automatically run the system so as to save the labor cost.
Drawings
Fig. 1 is a schematic flowchart of a resource allocation method based on queue resource scheduling of a supercomputer.
Detailed Description
The invention is further defined in the following, but not limited to, the figures and examples in the description.
Example 1
A resource allocation method based on queue resource scheduling of a supercomputer is disclosed, as shown in FIG. 1, and comprises the following steps:
(1) a user submits a job and specifies the number of required computing resources and a private queue name; for example, the number of computing resources required includes the number of nodes, the number of cores required for each node, and the private queue name of the task required to be submitted;
(2) the submitted parameters are judged by the queue resource scheduling resource allocation method designed by the invention, if the private queue resources are enough to be used, namely the number of the resources in the private queue resources is more than the number of the resources calculated in the step (1), the user operation is normally operated, and the operation is finished; otherwise, the queue resource scheduling resource allocation method judges whether the existing computing resources meet the required computing resource number, and the step (3) is carried out; the submitted parameters refer to the number of computing resources and the private queue name specified by the user in the step (1);
(3) if the existing computing resources meet the number of the needed computing resources, dividing the needed temporary nodes into the private queues corresponding to the private queue names in the step (1) from the resource pool, completing the normal operation of the user operation, and entering the step (4); otherwise, printing out the reason of not meeting the condition; for example: the submission job counts the number of nodes that has exceeded the actual purchase total.
(4) And the queue resource scheduling resource allocation method rescrips the temporary node back to the resource pool, and ends.
Example 2
The resource allocation method for queue resource scheduling based on the super computer according to embodiment 1 is characterized in that: the steps (2) to (4) comprise the following steps:
A. judging whether the number of resources in the private queue resources meets the number of resources required by the user, namely calculating the number of resources required by the user, if so, transmitting the bsub1 parameter to the system bsub2, and entering the step F, otherwise, entering the step B; the bsub1 parameter is all parameters which are configured by the user for submitting the bsub command, including the number of nodes, the number of cores required by each node and the private queue name of the task to be submitted, and the bsub2 is the bsub command which is called after the step (4) is finished; namely: and acquiring the operation node number of the operation, detecting the operation state, executing a system command to divide zero resources with corresponding number from a user queue to a resource pool queue after the operation is normally finished, and calling a bsub command after the operation is normally finished.
B. Counting the sum of the number of the nodes in the submitted operation of the user and the number of the nodes expected to be used in the submitted operation, if the sum is more than the total number of the nodes purchased by the user, returning to the printing to prompt the user, and if the sum is more than the total number of the nodes purchased by the user, submitting the operation to calculate that the number of the nodes exceeds the actual total number of the nodes purchased, otherwise, executing the step C;
C. d, the system calculates the remaining available computing resources in the resource pool at the moment, if the remaining available computing resources in the resource pool at the moment are smaller than the number of nodes which are expected to be used by the submitted operation, the step D is carried out, and if not, the step E is carried out;
D. after t minutes (min), the system calculates the residual available computing resources in the resource pool at the moment, if the residual available computing resources in the resource pool at the moment are still smaller than the number of nodes expected to be used by the submitted job, the system returns to the printing prompt user, the system calculates the resources insufficiently and asks the system administrator to contact, otherwise, the step E is carried out;
E. executing a scheduling system command, transferring the number of nodes expected to be used by the submitted job from the resource pool to a private queue corresponding to the private queue name of the user, and then transmitting the bsub1 parameter to the bsub 2;
F. executing the bsub2, acquiring the operation node number of the submission operation, and executing the submission operation;
G. and after the submission operation is normally finished, executing a system command, and dividing the number of nodes which are expected to be used by the submission operation from the resource pool into the resource pool from the private user queue.
t=1。
In the resource allocation method based on queue resource scheduling of the super computer, the number of resources in private queue resources is greater than the number of resources in a resource pool. Generally speaking, the annual utilization of a hypercalculation center X86 architecture cluster fluctuates around 75%.

Claims (4)

1. A resource allocation method for queue resource scheduling based on a super computer is characterized by comprising the following steps:
(1) a user submits a job and specifies the number of required computing resources and a private queue name;
(2) if the private queue resources are enough to use, namely the number of the resources in the private queue resources is larger than the number of the resources calculated in the step (1), the user works normally and the operation is finished; otherwise, judging whether the existing computing resources meet the required computing resource number, and entering the step (3); the submitted parameters refer to the number of computing resources and the private queue name specified by the user in the step (1);
(3) if the existing computing resources meet the number of the needed computing resources, dividing the needed temporary nodes into the private queues corresponding to the private queue names in the step (1) from the resource pool, completing the normal operation of the user operation, and entering the step (4); otherwise, printing out the reason of not meeting the condition;
(4) and the temporary node is re-divided into the resource pool, and the process is finished.
2. The resource allocation method for queue resource scheduling based on the super computer as claimed in claim 1, wherein the steps (2) to (4) comprise the following steps:
A. judging whether the number of resources in the private queue resources meets the number of resources required by the user, namely calculating the number of resources required by the user, if so, transmitting the bsub1 parameter to the system bsub2, and entering the step F, otherwise, entering the step B; the bsub1 parameter is all parameters which are configured by the user for submitting the bsub command, including the number of nodes, the number of cores required by each node and the private queue name of the task to be submitted, and the bsub2 is the bsub command which is called after the step (4) is finished;
B. counting the sum of the number of the nodes in the submitted operation of the user and the number of the nodes expected to be used in the submitted operation, if the sum is more than the total number of the nodes purchased by the user, returning to the printing to prompt the user, and if the sum is more than the total number of the nodes purchased by the user, submitting the operation to calculate that the number of the nodes exceeds the actual total number of the nodes purchased, otherwise, executing the step C;
C. d, the system calculates the remaining available computing resources in the resource pool at the moment, if the remaining available computing resources in the resource pool at the moment are smaller than the number of nodes which are expected to be used by the submitted operation, the step D is carried out, and if not, the step E is carried out;
D. after t minutes (min), the system calculates the residual available computing resources in the resource pool at the moment, if the residual available computing resources in the resource pool at the moment are still smaller than the number of nodes expected to be used by the submitted job, the system returns to the printing prompt user, the system calculates the resources insufficiently and asks the system administrator to contact, otherwise, the step E is carried out;
E. executing a scheduling system command, transferring the number of nodes expected to be used by the submitted job from the resource pool to a private queue corresponding to the private queue name of the user, and then transmitting the bsub1 parameter to the bsub 2;
F. executing the bsub2, acquiring the operation node number of the submission operation, and executing the submission operation;
G. and after the submission operation is normally finished, executing a system command, and dividing the number of nodes which are expected to be used by the submission operation from the resource pool into the resource pool from the private user queue.
3. The resource allocation method for queue resource scheduling based on supercomputer according to claim 2, characterized in that t is 1.
4. The resource allocation method for queue resource scheduling based on supercomputer as claimed in any of claims 1 to 3, characterized in that in the above resource allocation method for queue resource scheduling based on supercomputer, the number of resources in private queue resources is greater than the number of resources in the resource pool.
CN202010429029.7A 2020-05-20 2020-05-20 Resource allocation method for queue resource scheduling based on supercomputer Active CN113703952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010429029.7A CN113703952B (en) 2020-05-20 2020-05-20 Resource allocation method for queue resource scheduling based on supercomputer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010429029.7A CN113703952B (en) 2020-05-20 2020-05-20 Resource allocation method for queue resource scheduling based on supercomputer

Publications (2)

Publication Number Publication Date
CN113703952A true CN113703952A (en) 2021-11-26
CN113703952B CN113703952B (en) 2023-10-10

Family

ID=78645441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010429029.7A Active CN113703952B (en) 2020-05-20 2020-05-20 Resource allocation method for queue resource scheduling based on supercomputer

Country Status (1)

Country Link
CN (1) CN113703952B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020443A (en) * 2022-01-05 2022-02-08 国家超级计算天津中心 Supercomputer resource scheduling method, electronic device and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080066070A1 (en) * 2006-09-12 2008-03-13 Sun Microsystems, Inc. Method and system for the dynamic scheduling of jobs in a computing system
CN102902592A (en) * 2012-09-10 2013-01-30 曙光信息产业(北京)有限公司 Zoning scheduling management method of cluster computing resources
CN105320565A (en) * 2014-07-31 2016-02-10 中国石油化工股份有限公司 Computer resource scheduling method for various application software
CN106708622A (en) * 2016-07-18 2017-05-24 腾讯科技(深圳)有限公司 Cluster resource processing method and system, and resource processing cluster
CN106844056A (en) * 2017-01-25 2017-06-13 北京百分点信息科技有限公司 Hadoop big datas platform multi-tenant job management method and its system
CN108304260A (en) * 2017-12-15 2018-07-20 上海超算科技有限公司 A kind of virtualization job scheduling system and its implementation based on high-performance cloud calculating
CN109445919A (en) * 2018-10-19 2019-03-08 曙光信息产业(北京)有限公司 Online computing resource transaction system based on cloud service
CN110806928A (en) * 2019-10-16 2020-02-18 北京并行科技股份有限公司 Job submitting method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080066070A1 (en) * 2006-09-12 2008-03-13 Sun Microsystems, Inc. Method and system for the dynamic scheduling of jobs in a computing system
CN102902592A (en) * 2012-09-10 2013-01-30 曙光信息产业(北京)有限公司 Zoning scheduling management method of cluster computing resources
CN105320565A (en) * 2014-07-31 2016-02-10 中国石油化工股份有限公司 Computer resource scheduling method for various application software
CN106708622A (en) * 2016-07-18 2017-05-24 腾讯科技(深圳)有限公司 Cluster resource processing method and system, and resource processing cluster
CN106844056A (en) * 2017-01-25 2017-06-13 北京百分点信息科技有限公司 Hadoop big datas platform multi-tenant job management method and its system
CN108304260A (en) * 2017-12-15 2018-07-20 上海超算科技有限公司 A kind of virtualization job scheduling system and its implementation based on high-performance cloud calculating
CN109445919A (en) * 2018-10-19 2019-03-08 曙光信息产业(北京)有限公司 Online computing resource transaction system based on cloud service
CN110806928A (en) * 2019-10-16 2020-02-18 北京并行科技股份有限公司 Job submitting method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
VINEETHA KONDAMEEDI等: "Adaptive Hybrid Queue Configuration for Supercomputer Systems", 《2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID)》, pages 90 - 99 *
王硕: "Hama中满足公平性和负载均衡资源调度器的研究及实现", 《中国优秀硕士学位论文全文数据库信息科技辑》, pages 138 - 2267 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020443A (en) * 2022-01-05 2022-02-08 国家超级计算天津中心 Supercomputer resource scheduling method, electronic device and medium

Also Published As

Publication number Publication date
CN113703952B (en) 2023-10-10

Similar Documents

Publication Publication Date Title
CN105302638B (en) MPP cluster task dispatching method based on system load
CN107247651B (en) Cloud computing platform monitoring and early warning method and system
CN103179048B (en) Main frame qos policy transform method and the system of cloud data center
CN104598426B (en) Method for scheduling task for heterogeneous multi-nucleus processor system
WO2015139374A1 (en) Virtual machine distributed task scheduling method in cloud computing platform
CN113535409B (en) Server-free computing resource distribution system oriented to energy consumption optimization
WO2004084069A3 (en) Load balancing and taskdistribution system
TW201205441A (en) Multi-CPU domain mobile electronic device and operation method thereof
CN108408514B (en) Multi-connected machine group control type elevator dispatching method
CN114816715B (en) Cross-region-oriented flow calculation delay optimization method and device
CN113703952B (en) Resource allocation method for queue resource scheduling based on supercomputer
US20190138354A1 (en) Method for scheduling jobs with idle resources
CN104572286A (en) Task scheduling method based on distributed memory clusters
CN109597378A (en) A kind of resource-constrained hybrid task energy consumption cognitive method
CN109918181B (en) Worst response time-based task schedulability analysis method for hybrid critical system
CN102043676B (en) Visualized data centre dispatching method and system
CN108429784B (en) Energy efficiency priority cloud resource allocation and scheduling method
CN110850957B (en) Scheduling method for reducing system power consumption through dormancy in edge computing scene
CN105183563A (en) CPU resource dynamic self-configuration method facing mission critical computer
CN112148546A (en) Static safety analysis parallel computing system and method for power system
WO2012167591A1 (en) Processing method and system for distributed operating command
CN107391248B (en) Multilevel feedback queue dispatching method for STM32 system
CN114741200A (en) Data center station-oriented computing resource allocation method and device and electronic equipment
CN111506407B (en) Resource management and job scheduling method and system combining Pull mode and Push mode
CN112764883A (en) Energy management method of cloud desktop system based on software definition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant