CN113703952B

CN113703952B - Resource allocation method for queue resource scheduling based on supercomputer

Info

Publication number: CN113703952B
Application number: CN202010429029.7A
Authority: CN
Inventors: 刘弢; 田敏; 潘景山; 郭莹
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2023-10-10
Anticipated expiration: 2040-05-20
Also published as: CN113703952A

Abstract

The invention relates to a resource allocation method for queue resource scheduling based on a supercomputer, which comprises the following steps: (1) A user submits a job, and the number of computing resources and the name of a private queue are specified; (2) The submitted parameters are sent to a system for judgment, if the private queue resources are used enough, namely the number of resources in the private queue resources is larger than the number of computing resources, the user works normally, and the operation is finished; otherwise, the system judges whether the conditions are met; the submitted parameters refer to the number of computing resources specified by the user and the private queue name; (3) If the condition is met, dividing the needed temporary node into a private queue corresponding to the middle private queue name from the resource pool, and completing normal operation of the user operation; otherwise, printing out reasons for the non-compliance condition; (4) The system repaints the temporary node back into the resource pool and ends. The invention optimizes the configuration of computing resources and improves the efficiency. A vigorous resource queue may be maintained for resource calls in emergency.

Description

Resource allocation method for queue resource scheduling based on supercomputer

Technical Field

The invention relates to a resource allocation method for queue resource scheduling based on a supercomputer, and belongs to the technical field of high-performance computing supercomputer resource dynamic scheduling algorithms.

Background

The super computer is used for the research of the national high-tech field and the advanced technology, is the embodiment of the scientific research strength of the country, has important significance for the national security, economy and social development, and is an important mark of the national science and technology development level and the comprehensive national force. The supercomputers of one country are generally responsible for operation and maintenance by the state-level supercomputer centers. By the end of 5 months in 2020, seven super computing centers are built or are being built in China, namely a national super computing Tianjin center, a national super computing Changsha center, a national super computing Jinan center, a national super computing Guangzhou center, a national super computing Shenzhen center, a national super computing Wuxi center and a national super computing Zheng center.

Currently, the allocation of queue resources in a national supercomputer (supercomputer) is basically in two modes, namely commercial computing resources or domestic computing resources, and the queue resources of shared computing nodes and the queue resources of exclusive computing nodes. In the field of supercomputers, the computing node resources have consistent properties, and a dynamic scheduling algorithm for scheduling from a logic level is not available, so that the computing resources are manually distributed mostly through the purchase application condition of users, and the flexibility and the instantaneity are to be improved.

In the initial stage of super computer construction, the total performance of the super computer is generally evaluated by using the total computer computing node resources of the super computer. After the super computing center is gradually put into operation, after computing node resources are gradually leased out, a relatively large computing resource queue pool is difficult to coordinate to assist in important scientific computation. In the operation process of the supercomputer center, the following problems exist: (1) Users frequently occupy most computing resources of the shared queue, and the resources are tensed at a certain moment due to assault computation, so that the system pressure is overlarge. (2) The exclusive queue is owned by one or a type of user and the computing resources are occupied, but the idleness is so high that supercomputers cannot provide a large amount of centralized computing power. (3) Some large computing scientific tasks are not supported by sufficient computing resources for a short time.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a resource allocation method for queue resource scheduling based on a supercomputer.

The prior art means has no dynamic allocation capability, and a repackaging scheduling algorithm needs to be modified at a user scheduling system layer to realize dynamic resource scheduling at a logic level. Therefore, the existing resource analysis and the user's own resource analysis are required to be carried out according to the dynamic trigger monitoring, a super computing center dynamic scheduling mechanism of a resource logic layer is designed, the resource utilization rate problem in the resource scheduling process is improved in a series of modes, and the problem of insufficient computing resources of the super computing center is further solved.

The technical scheme of the invention is as follows:

a resource allocation method for queue resource scheduling based on a supercomputer comprises the following steps:

(1) A user submits a job, and the number of required computing resources and the private queue name are specified; for example, the number of computing resources required includes the number of nodes, the number of cores required for each node, and the private queue name of the task to be submitted;

(2) The submitted parameters are judged by the queue resource scheduling resource allocation method designed by the invention, if the private queue resources are used enough, namely the number of resources in the private queue resources is larger than the number of the computing resources in the step (1), the user operation is normally operated, and the operation is ended; otherwise, the queue resource scheduling resource allocation method judges whether the existing computing resources meet the required computing resource number, and the step (3) is entered; the submitted parameters refer to the number of computing resources and the private queue name specified by the user in the step (1);

(3) If the existing computing resources meet the required computing resource number, dividing the required temporary nodes into private queues corresponding to the private queue names in the step (1) from a resource pool, completing normal operation of user operation, and entering the step (4); otherwise, printing out reasons for the non-compliance condition; for example: the submitted job calculates that the node number has exceeded the actual purchase total.

(4) And (5) the temporary node is re-divided back into the resource pool by the queue resource scheduling resource allocation method, and the method is ended.

According to a preferred embodiment of the present invention, steps (2) to (4) include the steps of:

A. judging whether the number of resources in the private queue resources meets the number of resources required by a user, wherein the number of resources required by the user is the number of calculated resources, if so, executing the transmission of the bsub1 parameter to the system bsub2, and entering a step F, otherwise, entering a step B; the bsub1 parameters are all parameters which are configured and submitted by a user, and comprise the number of nodes, the number of cores required by each node and the private queue name of a task to be submitted, and the bsub2 is the bsub command which is called after the step (4) is finished; namely: and acquiring a job node number of the job, detecting the job state, and executing a corresponding number of zero-time resources of the system command from a user queue to a resource pool queue after the job is normally ended, and calling a bsub command after the job is normally ended.

B. Counting the sum of the node number in the submitted job and the node number expected to be used in the submitted job, if the sum is larger than the total number of nodes purchased by the user, returning to print to prompt the user, and if the sum is not larger than the total number of nodes purchased by the user, executing the step C;

C. the system calculates the residual usable calculation resources in the resource pool at the moment, if the residual usable calculation resources in the resource pool at the moment are smaller than the number of the nodes which are expected to be used in the submitting operation at the moment, the step D is entered, otherwise, the step E is entered;

D. after t minutes (min), the system calculates the residual usable computing resources in the resource pool at the moment, if the residual usable computing resources in the resource pool at the moment are still smaller than the number of nodes which are expected to be used in the submitted operation at the moment, the printing prompt is returned to the user, the system is insufficient in computing resources, and the system administrator is requested to be contacted, otherwise, the step E is entered;

E. executing a scheduling system command, namely transferring the number of nodes expected to be used by the submitted job from a resource pool to a private queue corresponding to the private queue name of the user, and then executing the transmission of the bsub1 parameter to the bsub 2;

F. executing the bsub2, acquiring the job node number of the submitted job, and executing the submitted job;

G. after the submitted job normally ends, executing a system command, and dividing the number of nodes which are allocated from the resource pool and are expected to be used for the submitted job into the resource pool from the private user queue.

Further preferably, t=1.

According to the preferred embodiment of the present invention, in the above method for allocating resources for scheduling queue resources based on a supercomputer, the number of resources in the private queue resources is greater than the number of resources in the resource pool. Generally, the annual utilization of the supercomputer center X86 architecture clusters floats above and below 75%.

The beneficial effects of the invention are as follows:

1. the invention optimizes the configuration of computing resources and improves the efficiency. Even though the computing resources cannot be unified, the algorithm is still effective for cluster resource management, and the larger the algorithm base is, the greater the use is. A vigorous resource queue may be maintained for resource calls in emergency.

2. The invention removes the need of modifying each user attribute in the initial setting, can automatically maintain in the later period, and can automatically operate the system so as to save the labor cost.

Drawings

FIG. 1 is a flow chart of a method for allocating resources for queue resource scheduling based on a supercomputer.

Detailed Description

The invention is further defined by, but is not limited to, the following drawings and examples in conjunction with the specification.

Example 1

A resource allocation method for queue resource scheduling based on supercomputer, as shown in figure 1, comprises the following steps:

Example 2

The resource allocation method for queue resource scheduling based on a supercomputer according to embodiment 1 is characterized in that: step (2) to step (4), comprising the steps of:

t＝1。

In the resource allocation method based on the queue resource scheduling of the supercomputer, the number of resources in the private queue resource is larger than that in the resource pool. Generally, the annual utilization of the supercomputer center X86 architecture clusters floats above and below 75%.

Claims

1. A resource allocation method for queue resource scheduling based on a supercomputer is characterized by comprising the following steps:

(1) A user submits a job, and the number of required computing resources and the private queue name are specified;

(2) If the private queue resources can be used, namely the number of resources in the private queue resources is larger than the number of the computing resources in the step (1), the user operation is normally operated, and the process is finished; otherwise, judging whether the existing computing resources meet the required computing resource number, and entering the step (3); the submitted parameters refer to the number of computing resources and the private queue name specified by the user in the step (1);

(3) If the existing computing resources meet the required computing resource number, dividing the required temporary nodes into private queues corresponding to the private queue names in the step (1) from a resource pool, completing normal operation of user operation, and entering the step (4); otherwise, printing out reasons for the non-compliance condition;

(4) The temporary node is re-drawn back to the resource pool and is ended;

step (2) to step (4), comprising the steps of:

A. judging whether the number of resources in the private queue resources meets the number of resources required by a user, wherein the number of resources required by the user is the number of calculated resources, if so, executing the transmission of the bsub1 parameter to the system bsub2, and entering a step F, otherwise, entering a step B; the bsub1 parameters are all parameters which are configured and submitted by a user, and comprise the number of nodes, the number of cores required by each node and the private queue name of a task to be submitted, and the bsub2 is the bsub command which is called after the step (4) is finished;

B. c, counting the sum of the node number in the submitted job of the user and the node number expected to be used in the submitted job, if the sum is larger than the total number of the nodes purchased by the user, returning to print to prompt the user, and if the sum is not larger than the total number of the nodes purchased by the user, executing the step C;

D. after t minutes, the system calculates the residual usable computing resources in the resource pool at the moment, if the residual usable computing resources in the resource pool at the moment are still smaller than the number of nodes which are expected to be used in the submitted operation at the moment, the printing is returned to prompt the user, the system is insufficient in computing resources, and the system administrator is contacted, otherwise, the step E is entered;

2. A method for resource allocation for supercomputer-based queue resource scheduling as recited in claim 1, wherein t = 1.

3. The method for allocating resources for scheduling queue resources on a supercomputer according to any one of claims 1 and 2, wherein the number of resources in the private queue resources is greater than the number of resources in the resource pool.