CN113703952B - Resource allocation method for queue resource scheduling based on supercomputer - Google Patents

Resource allocation method for queue resource scheduling based on supercomputer Download PDF

Info

Publication number
CN113703952B
CN113703952B CN202010429029.7A CN202010429029A CN113703952B CN 113703952 B CN113703952 B CN 113703952B CN 202010429029 A CN202010429029 A CN 202010429029A CN 113703952 B CN113703952 B CN 113703952B
Authority
CN
China
Prior art keywords
resources
user
queue
resource
private
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010429029.7A
Other languages
Chinese (zh)
Other versions
CN113703952A (en
Inventor
刘弢
田敏
潘景山
郭莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN202010429029.7A priority Critical patent/CN113703952B/en
Publication of CN113703952A publication Critical patent/CN113703952A/en
Application granted granted Critical
Publication of CN113703952B publication Critical patent/CN113703952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request

Abstract

The invention relates to a resource allocation method for queue resource scheduling based on a supercomputer, which comprises the following steps: (1) A user submits a job, and the number of computing resources and the name of a private queue are specified; (2) The submitted parameters are sent to a system for judgment, if the private queue resources are used enough, namely the number of resources in the private queue resources is larger than the number of computing resources, the user works normally, and the operation is finished; otherwise, the system judges whether the conditions are met; the submitted parameters refer to the number of computing resources specified by the user and the private queue name; (3) If the condition is met, dividing the needed temporary node into a private queue corresponding to the middle private queue name from the resource pool, and completing normal operation of the user operation; otherwise, printing out reasons for the non-compliance condition; (4) The system repaints the temporary node back into the resource pool and ends. The invention optimizes the configuration of computing resources and improves the efficiency. A vigorous resource queue may be maintained for resource calls in emergency.

Description

Resource allocation method for queue resource scheduling based on supercomputer
Technical Field
The invention relates to a resource allocation method for queue resource scheduling based on a supercomputer, and belongs to the technical field of high-performance computing supercomputer resource dynamic scheduling algorithms.
Background
The super computer is used for the research of the national high-tech field and the advanced technology, is the embodiment of the scientific research strength of the country, has important significance for the national security, economy and social development, and is an important mark of the national science and technology development level and the comprehensive national force. The supercomputers of one country are generally responsible for operation and maintenance by the state-level supercomputer centers. By the end of 5 months in 2020, seven super computing centers are built or are being built in China, namely a national super computing Tianjin center, a national super computing Changsha center, a national super computing Jinan center, a national super computing Guangzhou center, a national super computing Shenzhen center, a national super computing Wuxi center and a national super computing Zheng center.
Currently, the allocation of queue resources in a national supercomputer (supercomputer) is basically in two modes, namely commercial computing resources or domestic computing resources, and the queue resources of shared computing nodes and the queue resources of exclusive computing nodes. In the field of supercomputers, the computing node resources have consistent properties, and a dynamic scheduling algorithm for scheduling from a logic level is not available, so that the computing resources are manually distributed mostly through the purchase application condition of users, and the flexibility and the instantaneity are to be improved.
In the initial stage of super computer construction, the total performance of the super computer is generally evaluated by using the total computer computing node resources of the super computer. After the super computing center is gradually put into operation, after computing node resources are gradually leased out, a relatively large computing resource queue pool is difficult to coordinate to assist in important scientific computation. In the operation process of the supercomputer center, the following problems exist: (1) Users frequently occupy most computing resources of the shared queue, and the resources are tensed at a certain moment due to assault computation, so that the system pressure is overlarge. (2) The exclusive queue is owned by one or a type of user and the computing resources are occupied, but the idleness is so high that supercomputers cannot provide a large amount of centralized computing power. (3) Some large computing scientific tasks are not supported by sufficient computing resources for a short time.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a resource allocation method for queue resource scheduling based on a supercomputer.
The prior art means has no dynamic allocation capability, and a repackaging scheduling algorithm needs to be modified at a user scheduling system layer to realize dynamic resource scheduling at a logic level. Therefore, the existing resource analysis and the user's own resource analysis are required to be carried out according to the dynamic trigger monitoring, a super computing center dynamic scheduling mechanism of a resource logic layer is designed, the resource utilization rate problem in the resource scheduling process is improved in a series of modes, and the problem of insufficient computing resources of the super computing center is further solved.
The technical scheme of the invention is as follows:
a resource allocation method for queue resource scheduling based on a supercomputer comprises the following steps:
(1) A user submits a job, and the number of required computing resources and the private queue name are specified; for example, the number of computing resources required includes the number of nodes, the number of cores required for each node, and the private queue name of the task to be submitted;
(2) The submitted parameters are judged by the queue resource scheduling resource allocation method designed by the invention, if the private queue resources are used enough, namely the number of resources in the private queue resources is larger than the number of the computing resources in the step (1), the user operation is normally operated, and the operation is ended; otherwise, the queue resource scheduling resource allocation method judges whether the existing computing resources meet the required computing resource number, and the step (3) is entered; the submitted parameters refer to the number of computing resources and the private queue name specified by the user in the step (1);
(3) If the existing computing resources meet the required computing resource number, dividing the required temporary nodes into private queues corresponding to the private queue names in the step (1) from a resource pool, completing normal operation of user operation, and entering the step (4); otherwise, printing out reasons for the non-compliance condition; for example: the submitted job calculates that the node number has exceeded the actual purchase total.
(4) And (5) the temporary node is re-divided back into the resource pool by the queue resource scheduling resource allocation method, and the method is ended.
According to a preferred embodiment of the present invention, steps (2) to (4) include the steps of:
A. judging whether the number of resources in the private queue resources meets the number of resources required by a user, wherein the number of resources required by the user is the number of calculated resources, if so, executing the transmission of the bsub1 parameter to the system bsub2, and entering a step F, otherwise, entering a step B; the bsub1 parameters are all parameters which are configured and submitted by a user, and comprise the number of nodes, the number of cores required by each node and the private queue name of a task to be submitted, and the bsub2 is the bsub command which is called after the step (4) is finished; namely: and acquiring a job node number of the job, detecting the job state, and executing a corresponding number of zero-time resources of the system command from a user queue to a resource pool queue after the job is normally ended, and calling a bsub command after the job is normally ended.
B. Counting the sum of the node number in the submitted job and the node number expected to be used in the submitted job, if the sum is larger than the total number of nodes purchased by the user, returning to print to prompt the user, and if the sum is not larger than the total number of nodes purchased by the user, executing the step C;
C. the system calculates the residual usable calculation resources in the resource pool at the moment, if the residual usable calculation resources in the resource pool at the moment are smaller than the number of the nodes which are expected to be used in the submitting operation at the moment, the step D is entered, otherwise, the step E is entered;
D. after t minutes (min), the system calculates the residual usable computing resources in the resource pool at the moment, if the residual usable computing resources in the resource pool at the moment are still smaller than the number of nodes which are expected to be used in the submitted operation at the moment, the printing prompt is returned to the user, the system is insufficient in computing resources, and the system administrator is requested to be contacted, otherwise, the step E is entered;
E. executing a scheduling system command, namely transferring the number of nodes expected to be used by the submitted job from a resource pool to a private queue corresponding to the private queue name of the user, and then executing the transmission of the bsub1 parameter to the bsub 2;
F. executing the bsub2, acquiring the job node number of the submitted job, and executing the submitted job;
G. after the submitted job normally ends, executing a system command, and dividing the number of nodes which are allocated from the resource pool and are expected to be used for the submitted job into the resource pool from the private user queue.
Further preferably, t=1.
According to the preferred embodiment of the present invention, in the above method for allocating resources for scheduling queue resources based on a supercomputer, the number of resources in the private queue resources is greater than the number of resources in the resource pool. Generally, the annual utilization of the supercomputer center X86 architecture clusters floats above and below 75%.
The beneficial effects of the invention are as follows:
1. the invention optimizes the configuration of computing resources and improves the efficiency. Even though the computing resources cannot be unified, the algorithm is still effective for cluster resource management, and the larger the algorithm base is, the greater the use is. A vigorous resource queue may be maintained for resource calls in emergency.
2. The invention removes the need of modifying each user attribute in the initial setting, can automatically maintain in the later period, and can automatically operate the system so as to save the labor cost.
Drawings
FIG. 1 is a flow chart of a method for allocating resources for queue resource scheduling based on a supercomputer.
Detailed Description
The invention is further defined by, but is not limited to, the following drawings and examples in conjunction with the specification.
Example 1
A resource allocation method for queue resource scheduling based on supercomputer, as shown in figure 1, comprises the following steps:
(1) A user submits a job, and the number of required computing resources and the private queue name are specified; for example, the number of computing resources required includes the number of nodes, the number of cores required for each node, and the private queue name of the task to be submitted;
(2) The submitted parameters are judged by the queue resource scheduling resource allocation method designed by the invention, if the private queue resources are used enough, namely the number of resources in the private queue resources is larger than the number of the computing resources in the step (1), the user operation is normally operated, and the operation is ended; otherwise, the queue resource scheduling resource allocation method judges whether the existing computing resources meet the required computing resource number, and the step (3) is entered; the submitted parameters refer to the number of computing resources and the private queue name specified by the user in the step (1);
(3) If the existing computing resources meet the required computing resource number, dividing the required temporary nodes into private queues corresponding to the private queue names in the step (1) from a resource pool, completing normal operation of user operation, and entering the step (4); otherwise, printing out reasons for the non-compliance condition; for example: the submitted job calculates that the node number has exceeded the actual purchase total.
(4) And (5) the temporary node is re-divided back into the resource pool by the queue resource scheduling resource allocation method, and the method is ended.
Example 2
The resource allocation method for queue resource scheduling based on a supercomputer according to embodiment 1 is characterized in that: step (2) to step (4), comprising the steps of:
A. judging whether the number of resources in the private queue resources meets the number of resources required by a user, wherein the number of resources required by the user is the number of calculated resources, if so, executing the transmission of the bsub1 parameter to the system bsub2, and entering a step F, otherwise, entering a step B; the bsub1 parameters are all parameters which are configured and submitted by a user, and comprise the number of nodes, the number of cores required by each node and the private queue name of a task to be submitted, and the bsub2 is the bsub command which is called after the step (4) is finished; namely: and acquiring a job node number of the job, detecting the job state, and executing a corresponding number of zero-time resources of the system command from a user queue to a resource pool queue after the job is normally ended, and calling a bsub command after the job is normally ended.
B. Counting the sum of the node number in the submitted job and the node number expected to be used in the submitted job, if the sum is larger than the total number of nodes purchased by the user, returning to print to prompt the user, and if the sum is not larger than the total number of nodes purchased by the user, executing the step C;
C. the system calculates the residual usable calculation resources in the resource pool at the moment, if the residual usable calculation resources in the resource pool at the moment are smaller than the number of the nodes which are expected to be used in the submitting operation at the moment, the step D is entered, otherwise, the step E is entered;
D. after t minutes (min), the system calculates the residual usable computing resources in the resource pool at the moment, if the residual usable computing resources in the resource pool at the moment are still smaller than the number of nodes which are expected to be used in the submitted operation at the moment, the printing prompt is returned to the user, the system is insufficient in computing resources, and the system administrator is requested to be contacted, otherwise, the step E is entered;
E. executing a scheduling system command, namely transferring the number of nodes expected to be used by the submitted job from a resource pool to a private queue corresponding to the private queue name of the user, and then executing the transmission of the bsub1 parameter to the bsub 2;
F. executing the bsub2, acquiring the job node number of the submitted job, and executing the submitted job;
G. after the submitted job normally ends, executing a system command, and dividing the number of nodes which are allocated from the resource pool and are expected to be used for the submitted job into the resource pool from the private user queue.
t=1。
In the resource allocation method based on the queue resource scheduling of the supercomputer, the number of resources in the private queue resource is larger than that in the resource pool. Generally, the annual utilization of the supercomputer center X86 architecture clusters floats above and below 75%.

Claims (3)

1. A resource allocation method for queue resource scheduling based on a supercomputer is characterized by comprising the following steps:
(1) A user submits a job, and the number of required computing resources and the private queue name are specified;
(2) If the private queue resources can be used, namely the number of resources in the private queue resources is larger than the number of the computing resources in the step (1), the user operation is normally operated, and the process is finished; otherwise, judging whether the existing computing resources meet the required computing resource number, and entering the step (3); the submitted parameters refer to the number of computing resources and the private queue name specified by the user in the step (1);
(3) If the existing computing resources meet the required computing resource number, dividing the required temporary nodes into private queues corresponding to the private queue names in the step (1) from a resource pool, completing normal operation of user operation, and entering the step (4); otherwise, printing out reasons for the non-compliance condition;
(4) The temporary node is re-drawn back to the resource pool and is ended;
step (2) to step (4), comprising the steps of:
A. judging whether the number of resources in the private queue resources meets the number of resources required by a user, wherein the number of resources required by the user is the number of calculated resources, if so, executing the transmission of the bsub1 parameter to the system bsub2, and entering a step F, otherwise, entering a step B; the bsub1 parameters are all parameters which are configured and submitted by a user, and comprise the number of nodes, the number of cores required by each node and the private queue name of a task to be submitted, and the bsub2 is the bsub command which is called after the step (4) is finished;
B. c, counting the sum of the node number in the submitted job of the user and the node number expected to be used in the submitted job, if the sum is larger than the total number of the nodes purchased by the user, returning to print to prompt the user, and if the sum is not larger than the total number of the nodes purchased by the user, executing the step C;
C. the system calculates the residual usable calculation resources in the resource pool at the moment, if the residual usable calculation resources in the resource pool at the moment are smaller than the number of the nodes which are expected to be used in the submitting operation at the moment, the step D is entered, otherwise, the step E is entered;
D. after t minutes, the system calculates the residual usable computing resources in the resource pool at the moment, if the residual usable computing resources in the resource pool at the moment are still smaller than the number of nodes which are expected to be used in the submitted operation at the moment, the printing is returned to prompt the user, the system is insufficient in computing resources, and the system administrator is contacted, otherwise, the step E is entered;
E. executing a scheduling system command, namely transferring the number of nodes expected to be used by the submitted job from a resource pool to a private queue corresponding to the private queue name of the user, and then executing the transmission of the bsub1 parameter to the bsub 2;
F. executing the bsub2, acquiring the job node number of the submitted job, and executing the submitted job;
G. after the submitted job normally ends, executing a system command, and dividing the number of nodes which are allocated from the resource pool and are expected to be used for the submitted job into the resource pool from the private user queue.
2. A method for resource allocation for supercomputer-based queue resource scheduling as recited in claim 1, wherein t = 1.
3. The method for allocating resources for scheduling queue resources on a supercomputer according to any one of claims 1 and 2, wherein the number of resources in the private queue resources is greater than the number of resources in the resource pool.
CN202010429029.7A 2020-05-20 2020-05-20 Resource allocation method for queue resource scheduling based on supercomputer Active CN113703952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010429029.7A CN113703952B (en) 2020-05-20 2020-05-20 Resource allocation method for queue resource scheduling based on supercomputer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010429029.7A CN113703952B (en) 2020-05-20 2020-05-20 Resource allocation method for queue resource scheduling based on supercomputer

Publications (2)

Publication Number Publication Date
CN113703952A CN113703952A (en) 2021-11-26
CN113703952B true CN113703952B (en) 2023-10-10

Family

ID=78645441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010429029.7A Active CN113703952B (en) 2020-05-20 2020-05-20 Resource allocation method for queue resource scheduling based on supercomputer

Country Status (1)

Country Link
CN (1) CN113703952B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020443B (en) * 2022-01-05 2022-03-18 国家超级计算天津中心 Supercomputer resource scheduling method, electronic device and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902592A (en) * 2012-09-10 2013-01-30 曙光信息产业(北京)有限公司 Zoning scheduling management method of cluster computing resources
CN105320565A (en) * 2014-07-31 2016-02-10 中国石油化工股份有限公司 Computer resource scheduling method for various application software
CN106708622A (en) * 2016-07-18 2017-05-24 腾讯科技(深圳)有限公司 Cluster resource processing method and system, and resource processing cluster
CN106844056A (en) * 2017-01-25 2017-06-13 北京百分点信息科技有限公司 Hadoop big datas platform multi-tenant job management method and its system
CN108304260A (en) * 2017-12-15 2018-07-20 上海超算科技有限公司 A kind of virtualization job scheduling system and its implementation based on high-performance cloud calculating
CN109445919A (en) * 2018-10-19 2019-03-08 曙光信息产业(北京)有限公司 Online computing resource transaction system based on cloud service
CN110806928A (en) * 2019-10-16 2020-02-18 北京并行科技股份有限公司 Job submitting method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8020161B2 (en) * 2006-09-12 2011-09-13 Oracle America, Inc. Method and system for the dynamic scheduling of a stream of computing jobs based on priority and trigger threshold

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902592A (en) * 2012-09-10 2013-01-30 曙光信息产业(北京)有限公司 Zoning scheduling management method of cluster computing resources
CN105320565A (en) * 2014-07-31 2016-02-10 中国石油化工股份有限公司 Computer resource scheduling method for various application software
CN106708622A (en) * 2016-07-18 2017-05-24 腾讯科技(深圳)有限公司 Cluster resource processing method and system, and resource processing cluster
CN106844056A (en) * 2017-01-25 2017-06-13 北京百分点信息科技有限公司 Hadoop big datas platform multi-tenant job management method and its system
CN108304260A (en) * 2017-12-15 2018-07-20 上海超算科技有限公司 A kind of virtualization job scheduling system and its implementation based on high-performance cloud calculating
CN109445919A (en) * 2018-10-19 2019-03-08 曙光信息产业(北京)有限公司 Online computing resource transaction system based on cloud service
CN110806928A (en) * 2019-10-16 2020-02-18 北京并行科技股份有限公司 Job submitting method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Vineetha Kondameedi等.Adaptive Hybrid Queue Configuration for Supercomputer Systems.《2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)》.2017,90-99. *
王硕.Hama中满足公平性和负载均衡资源调度器的研究及实现.《中国优秀硕士学位论文全文数据库信息科技辑》.2017,I138-2267. *

Also Published As

Publication number Publication date
CN113703952A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
WO2020206705A1 (en) Cluster node load state prediction-based job scheduling method
CN103179048B (en) Main frame qos policy transform method and the system of cloud data center
CN104598426B (en) Method for scheduling task for heterogeneous multi-nucleus processor system
CN104111877A (en) Thread dynamic deployment system and method based on thread deployment engine
CN102158513A (en) Service cluster and energy-saving method and device thereof
CN105868004B (en) Scheduling method and scheduling device of service system based on cloud computing
CN103716372A (en) Digital library-as-a-service cloud computing platform construction method
WO2015100995A1 (en) Intelligent service scheduling method
CN113703952B (en) Resource allocation method for queue resource scheduling based on supercomputer
CN114816715B (en) Cross-region-oriented flow calculation delay optimization method and device
CN109960591A (en) A method of the cloud application resource dynamic dispatching occupied towards tenant's resource
CN108664116A (en) Adaptive electricity saving method, device and the cpu controller of network function virtualization
CN106095581B (en) Network storage virtualization scheduling method under private cloud condition
CN110850957B (en) Scheduling method for reducing system power consumption through dormancy in edge computing scene
CN108388471A (en) A kind of management method constraining empty machine migration based on double threshold
CN116360922A (en) Cluster resource scheduling method, device, computer equipment and storage medium
Wang et al. A hard real-time scheduler for Spark on YARN
CN114741200A (en) Data center station-oriented computing resource allocation method and device and electronic equipment
WO2020244300A1 (en) Method and apparatus for reducing power consumption of virtual machine cluster
CN111506407B (en) Resource management and job scheduling method and system combining Pull mode and Push mode
CN110149341B (en) Cloud system user access control method based on sleep mode
CN109960565A (en) Cloud platform, dispatching method of virtual machine and device based on cloud platform
Gvozdetska et al. Energy-efficient backfill-based scheduling approach for SLURM resource manager
CN106293000B (en) A kind of virtual machine storage subsystem power-economizing method towards cloud environment
KR20190061241A (en) Mesos process apparatus for unified management of resource and method for the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant