CN117687774A - Task model training method for computing power scheduling and computing power scheduling method and system - Google Patents
Task model training method for computing power scheduling and computing power scheduling method and system Download PDFInfo
- Publication number
- CN117687774A CN117687774A CN202311507154.5A CN202311507154A CN117687774A CN 117687774 A CN117687774 A CN 117687774A CN 202311507154 A CN202311507154 A CN 202311507154A CN 117687774 A CN117687774 A CN 117687774A
- Authority
- CN
- China
- Prior art keywords
- task
- resource
- execution time
- scheduling
- resource usage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 84
- 238000012549 training Methods 0.000 title claims abstract description 29
- 238000003860 storage Methods 0.000 claims description 12
- 230000008602 contraction Effects 0.000 claims description 9
- 230000000694 effects Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 claims description 3
- 230000002787 reinforcement Effects 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 2
- 238000010801 machine learning Methods 0.000 description 19
- 230000008569 process Effects 0.000 description 18
- 230000003044 adaptive effect Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000013468 resource allocation Methods 0.000 description 7
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 239000002699 waste material Substances 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000003066 decision tree Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000000611 regression analysis Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Supply And Distribution Of Alternating Current (AREA)
Abstract
The invention provides a task model training method for computing power dispatching and a computing power dispatching method and system, wherein the computing power dispatching method comprises the following steps: acquiring a current task to be scheduled, wherein the task comprises a task type and a first resource usage amount applied for executing the task; according to the task model obtained through training, under the condition that the task performance is not influenced or within an acceptable range, carrying out dynamic telescopic adjustment on the application quota of the resource with small influence on the task performance in the first resource usage amount to obtain the resource usage amount allocated for executing the task; and performing computational scheduling on the task according to the allocated resource usage. The method and the device can complete self-adaptive resource dynamic scheduling by integrating the influence of the multidimensional resource on the task performance, and effectively improve the task deployment completion condition and the multidimensional resource utilization rate in AI computing power scheduling.
Description
Technical Field
The invention relates to the field of cloud computing, in particular to the field of technical resource scheduling of cloud computing, and particularly relates to a task model training method for computing power scheduling, and a computing power scheduling method and system.
Background
With the wide application of artificial intelligence technology in life in recent years, more and more intelligent computing clusters are constructed. In the field of cloud computing resource scheduling, the cluster scheduling problem of artificial intelligence (artificial intelligence, AI) computing power is provided, and the cluster scheduling problem is used for solving the problems of the utilization rate of AI computing power resources and the deployment efficiency of machine learning tasks, and finally, the increase of the computing efficiency of the AI computing power is realized. At present, AI computing power cluster scheduling is mainly used for analyzing the characteristics of Machine Learning (ML) tasks and optimizing the allocation of graphics processor (graphics processing unit, GPU) resources in AI computing power, so that ML task performance or GPU resource utilization rate is improved. For example, the characteristics of periodicity, preemption and position sensitivity of the ML task in the training process are considered to guide GPU allocation and sharing, to meet the efficient scheduling of different types of training task requirements, and to achieve the improvement of GPU resource utilization. The existing artificial intelligence computing resource scheduling method generally designs a scheduling strategy only aiming at a single dimension of GPU resources, but the actual task performance can be determined not only by considering single GPU resource allocation but by the combined action of multi-dimension resources, so that the design performance of the existing scheduling method cannot be completely realized in an actual production cluster. Moreover, the AI algorithm power scheduling method cannot fully utilize the whole resources of multiple dimensions in the AI cluster, so that the waste of limited cluster resources is caused.
Disclosure of Invention
In order to solve the problems, the invention provides a task model constructed by combining a machine learning method, and utilizes the task model to carry out expansion and contraction adjustment of resource allocation on the tasks in the current dynamic queue so as to complete computational scheduling.
According to a first aspect, an embodiment of the present invention provides a training method for a task model for computing power scheduling, including: taking task execution time, task GPU utilization rate, task average memory occupation, task CPU core number and task type as training samples; the task GPU utilization rate, the task average memory occupation, the task CPU core number and the task type are used as inputs of a task model, and the task execution time is a label and is used as an expected output value; training the task model based on a regression tree with gradient lifting according to data in the training sample; and taking the trained task model as a prediction model of task execution time in the computing power resource allocation scheduling method.
Preferably, the task types in the same training sample are the same.
Preferably, the task types include a computer vision task, a natural language processing task, a reinforcement learning task, a graph neural network task and a recommendation task.
Preferably, the depth of the regression tree is 10, the number of the adopted basic learners is 100, and the learning rate is 0.1.
According to a second aspect, an embodiment of the present invention provides a method for power scheduling, including: acquiring a current task to be scheduled, wherein the task comprises a task type and a first resource usage amount applied for executing the task; according to the task model obtained by any one of the methods in the first aspect, under the condition that the task performance is not affected or within an acceptable range, performing dynamic expansion adjustment on an application quota of a resource with small influence on the task performance in the first resource usage amount to obtain a resource usage amount allocated for executing the task; and performing computational scheduling on the task according to the allocated resource usage.
Preferably, the task model obtained according to any one of the first aspect dynamically adjusts, in a manner that does not affect task performance or is within an acceptable range, a resource application quota having a small effect on task performance in a first resource usage amount, so as to obtain a resource usage amount allocated for execution of the task, where the method includes: outputting a first expected execution time of the task according to the task type and the first resource usage amount by using the task model obtained by any one of the methods according to the first aspect; the resource application quota with small influence on the task performance in the resource dimension is subjected to telescopic adjustment under the condition that the task performance is not influenced or within an acceptable range, so that a second resource usage amount is obtained; outputting a second expected execution time of the task by using the task model obtained by any one of the methods in the first aspect according to the second resource usage amount; and when the difference between the first expected execution time and the second expected execution time is below a preset difference threshold, taking the second resource usage as the allocated resource usage.
Preferably, the method further comprises: if the difference between the first expected execution time and the second expected execution time exceeds a set difference threshold, resource application quota with small influence on task performance in the resource dimension by using another resource expansion rate is utilized to carry out resource expansion adjustment.
Preferably, the method further comprises: and when the difference between the execution time of the first expected execution time and the execution time of the second expected execution time exceeds the difference threshold after the resource expansion and contraction adjustment is carried out on the resource dimension by utilizing all the resource expansion and contraction rates, taking the first resource usage amount as the allocated resource usage amount.
Preferably, the executing, by using the task model, resource expansion adjustment under a set resource expansion rate for the resource dimension to obtain a second resource usage amount includes: screening out the resource dimension with the influence degree on the execution time of the task smaller than a set influence degree threshold from the resource dimension, and taking the resource dimension as the resource dimension to be stretched; and executing resource expansion adjustment under the set resource expansion rate aiming at the dimension of the resource to be expanded, so as to obtain a second resource usage amount.
According to a third aspect, an embodiment of the present invention provides a computing power scheduling system, including: a server, a task queue, a scheduler, further comprising a task history log database and a task model according to any one of the first aspects; the task history running log database is used for collecting and storing the history data of tasks, including resource use conditions, execution time and task types; the task model is used for dynamically adjusting the application quota of the resource with small influence on the task performance in the task to be scheduled, so as to obtain the resource usage amount allocated for the execution of the task.
According to a fourth aspect, embodiments of the present invention provide a storage medium having stored therein computer executable instructions which when loaded and executed by a processor implement the steps of any of the methods as in the first and second aspects.
Compared with the prior art, the invention has the advantages that:
compared with the current AI algorithm power scheduling method which is limited to the design facing the GPU resource allocation, the scheduling method can complete the self-adaptive resource dynamic scheduling by integrating the influence of the multi-dimensional resources on the task performance, and effectively improves the task deployment completion condition and the multi-dimensional resource utilization rate in the AI algorithm power scheduling. The method for dynamically adjusting the resource requirement can optimize the problem of resource overuse application in the user task application process and reduce resource waste. Meanwhile, the method also optimizes the low-efficiency utilization of the ML task part dimension resources due to the mismatching among the dimension resources, and improves the overall resource utilization efficiency of AI computing power.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
FIG. 1 is a schematic diagram of a power dispatching system architecture according to an embodiment of the present invention;
FIG. 2 is a flow chart of a task model-based computational power scheduling method according to an embodiment of the present invention;
FIG. 3 is a flowchart of a multi-dimensional resource expansion scheduling step according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a hardware device architecture of a piggyback scheduling system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by means of specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
When AI computing power scheduling is studied, the inventor discovers that the reason that the utilization rate of the existing computing power resources is low and even causes resource waste is that only GPU resources are considered in the scheduling process, and the influence of the whole of a plurality of dimension resources on the actual task performance is ignored. In order to improve task running performance and resource utilization rate of the AI computing power and overcome the defect that an existing AI computing power scheduling method only aims at GPU resources in a dimension design scheduling strategy, the inventor combines analysis of existing multidimensional resources and analysis facing to actual production clusters, proposes that the influence of multidimensional resource requirements of Machine Learning (ML) tasks in different types of AI computing power clusters on task execution is modeled through machine learning, and combines computing power resource states to complete adaptive resource expansion allocation scheduling to guide resource scheduling under the AI computing power scene so as to improve task deployment execution efficiency of a user task side and resource utilization rate of a hardware resource side. In the present invention, the multi-dimensional resources generally include CPU core number, GPU usage, and memory usage; in some embodiments, the method may further include applying for capacity to occupy network resources.
Fig. 1 is a schematic diagram of a power dispatching system architecture according to an embodiment of the invention. The computing power dispatching system consists of users, servers of different types, task queues, a task history running log database, a task resource performance model (hereinafter referred to as a task model) and a dispatcher. The user submits the machine learning task on line or off line, and the task queue of the current dispatching round can be updated according to the task submitted by the user. The server is a main body for running the task, and the calculation scheduling process is to deploy the task to the server; the server type comprises core computing power resources and edge computing power resources, and specifically comprises physical machines, virtual machines, edge computing nodes and the like. The task queue is responsible for maintaining a dynamic task list formed by submitting user tasks online, storing all tasks waiting for scheduling, and selecting tasks which can be executed in the current server state from the task queue in the process of computing power scheduling so as to perform computing power scheduling. The task history running log database is responsible for collecting and storing the history data of tasks, including resource use conditions, execution time, task types and the like; in some embodiments, the historical data includes task start time, task end time, task GPU utilization, task average memory footprint, task CPU core count, task type (i.e., application type of task load). The task model is responsible for guiding the adaptive resource expansion adjustment process of the task in the scheduling process.
The operation flow of the power calculation scheduling system is as follows: the scheduler updates the dynamic task queue according to the task dynamically submitted by the user, the task type, the multidimensional resource requirements and other specific task attributes. For each round of current scheduling: under the task level, the scheduler combines the task model to adaptively and flexibly adjust the resource requirements of the current user task in the task queue, and the updated task demand resource quantity is returned to the scheduler; under the cluster level, the scheduler deploys the task to a server node capable of meeting the resource demand by combining the task demand expansion result and the multidimensional resource use state of the cluster, so as to dynamically optimize the resource allocation condition of the cluster job. And finally, the scheduling system updates the execution state at the task level and the resource state at the cluster level, and prepares to schedule for the next round.
According to one embodiment of the invention, a training method of a task model is provided, and the construction of the task model is completed by adopting a machine learning regression analysis method based on historical data in a task historical operation log database. Specifically, the modeling process analyzes the multidimensional resource requirements of different types of tasks, i.e., the influence of multidimensional resources on task execution time, based on a machine learning method of extreme gradient lifting (extreme gradient boosting, XGBoost). XGBoost is based on a lifting (i.e., boosting) method to train a set of decision trees, and iteratively fitting the residuals of the decision trees using a gradient lifting method. The output result obtained by the XGBoost model is the weighted sum of all the residual iterative decision tree predictors. The task resource demand prediction model objective function based on XGBoost consists of a loss function for measuring the model fitting degree and a regularization term for increasing the model complexity. The model structure establishment process of the task resource demand prediction model needs to minimize the objective function value. The specific method comprises the steps of selecting characteristic value dividing points of leaf nodes by maximizing branch gains, sequentially constructing new branches of a tree model for each characteristic in the model, and finally completing a modeling task resource demand XGBoost model. And selecting the feature with the maximum branch gain as the segmentation feature of the leaf node every time, sequentially selecting the resource feature to complete task resource demand prediction model construction, and completing task resource demand prediction model construction of data fitting prediction by the minimum total objective function of the obtained model. Preferably, the training method comprises: taking task execution time, task GPU utilization rate, task average memory occupation, task CPU core number and task type as training samples; the task GPU utilization, the task average memory occupation, the task CPU core number, and the task type are used as inputs of a task model, and the task execution time (i.e., the start time to the execution end time of the task execution) is a tag as an expected output value. The characteristics (i.e. input) in the training sample are task GPU utilization rate, task average memory occupation, task CPU core number and task type. The labels (i.e. output) in the training samples are task execution time; the task execution time is from the start time of task execution to the execution end time. Because the content of the original trace log record is the task start and end time, the task execution time in the training sample needs to be calculated through data preprocessing. The task types involved in the training samples may be different, but in some embodiments, in order to achieve a more accurate computational power schedule, the task types in the training samples are the same when training one task model, i.e., the corresponding task models are trained for different task types. The task types can include various common machine learning task types in various fields, such as computer vision tasks, natural language processing tasks, reinforcement learning tasks, graphic neural network tasks, recommendation tasks and the like. Training the task model according to data in a training sample; in some embodiments, modeling is performed based on a gradient lifting tree regression method, that is, a regression method provided in an XGBoost library, a task regression model corresponding to a current task type uses a task resource usage situation to regress task execution time, that is, input is a resource situation and a task type, and output is task prediction execution time. In some embodiments, in the training process, the regression tree depth is 10, the number of base learners adopted in boosting integrated learning is 100, and the learning rate is 0.1. And taking the trained task model as a prediction model of task execution time in the computational power scheduling method.
The embodiment of the invention builds task models of different task types based on a machine learning method, is used for guiding dynamic resource expansion and contraction in the online scheduling process, can solve the problem of excessive application of the resources of the tasks in the resource scheduling and distributing process, and improves the utilization rate of cluster resources and the completion condition of task deployment.
The computational power scheduling system of the invention completes the telescopic adjustment of the allocation of the on-line task resource demands based on task performance modeling. The system starts from two layers of user tasks and cluster resources, and optimizes the resource scheduling of AI computing power. The method specifically comprises the following steps: under the task level, based on the task expected performance and the task resource performance model, performing self-adaptive telescopic adjustment on task demands, and improving task deployment execution efficiency; and under the cluster level, optimizing the resource allocation scheduling based on the utilization states of the multiple dimension resource utilization conditions, and designing the scheduling strategy of the ML tasks after the completion of the resource expansion and contraction, thereby completing the improvement of the resource utilization rate in the AI computing power scheduling.
Fig. 2 is a flow chart of a computational power scheduling method based on a task model according to an embodiment of the invention. The method comprises the following steps: a task submitted by a user is received, the task including a task type and a number of multi-dimensional resources (i.e., a first resource usage amount) applied for execution of the task. The resource usage includes CPU core number, GPU utilization occupancy percentage, average memory occupancy, task load type. In some embodiments, the resource usage may also include a capacity that applies to occupy network resources. And updating a task queue of the scheduling system according to the user task to be executed in the current scheduling round, and periodically informing a scheduler to execute scheduling. The period can be reasonably adjusted according to the height of the load; preferably, the period may be set to 5 minutes. The period of the dispatching turn can be set according to the running requirement of the system, and the dispatching queue maintains the task to be executed submitted by the current user and the specific resource requirement information thereof through a list. The scheduling queue orders the tasks waiting for scheduling currently according to a scheduling strategy, wherein the scheduling strategy comprises task priority scheduling (first to first service) submitted first, shortest time remaining priority, main resource fairness and the like, and supports further modification and customization of the ordering of the tasks waiting for scheduling; the rest priority in the shortest time refers to that each time a new task arrives at the adjustment scheduling queue, the task which is to be ended in the shortest time is placed in priority; the main resource fairness refers to that as resources with multiple dimensions are applied, fairness among the resources in task scheduling needs to be ensured, and resources with minimum proportion of the resources applied by the tasks to the total resource pool are preferentially placed in a scheduling queue, so that the effect of resource allocation fairness among the tasks is achieved. The scheduler requests to schedule the current task in the queue. And the scheduling queue sets the initial waiting state of the first task as a scheduling state according to a task ordering result under the current designated scheduling strategy, and then executes the scheduling process of the current task.
And executing a self-adaptive resource expansion process based on task model prediction to adjust the resource application demand of the task. FIG. 3 is a flowchart of a multi-dimensional resource expansion scheduling step. Firstly, determining a trained task model corresponding to a current task according to the task type of the current task, and predicting the execution time of the current task under different multi-dimensional resource application amounts according to the task model. The construction input of the task model comprises the use amount of resources with different dimensions in the history log data and the corresponding task execution time, and the regression model is obtained based on a machine learning method, so that the expected task execution time under the application amount of the resources can be predicted. According to one embodiment of the invention, a model is trained with each task type, which results in a higher accuracy model. And combining the existing model feature importance analysis method to obtain the degree of influence of different dimension resources in the current task model on the execution time of the final task. And combining the original resource application result of the current task, executing the resource expansion adjustment under the set resource expansion rate by the resource dimension with small influence on the task execution time, and obtaining the allocated resource usage. The resource expansion rate is one of the parameters of the invention, represents the percentage amplitude of the adjustment of the resource usage amount, and can reduce the resource waste on the premise of ensuring the execution efficiency of the actual task by selecting the proper parameter as the resource expansion rate. Because the sensitivity degree of different tasks to the resources is different, the invention searches the resource expansion rate adaptively, which meets the performance requirement of the current task. The searching method of the resource expansion rate is to combine the task model, perform binary search under different resource expansion rates, and maximally save the resources of the task application while ensuring the task performance requirements. Specifically, the importance of the multi-dimensional resource features is ordered according to the influence degree of the resource features of each dimension in the task model on the task execution time. The influence degree is determined by randomly shuffling the characteristic value of each dimension resource, and the reduction value of the prediction accuracy of the task model is calculated compared with the case of no random shuffling, wherein the larger the reduction value is, the larger the influence degree of the corresponding dimension resource on the task execution time is, and the higher the importance is. The quota of the resource expansion adjustment with low feature importance (little influence on task performance) is adjusted, namely, the self-adaptive dynamic expansion adjustment is performed on the resource expansion adjustment under the condition that the task performance is not influenced or the resource expansion adjustment is within the acceptable range of a user. For example, the task resource performance model of the current task a indicates that the CPU and the GPU are important features of the task execution time, and the memory has a small influence on the task performance, so in the resource expansion stage, the memory applied by the user is expanded and contracted in resource, and the expansion ratio is determined by combining the task model and performing binary search under different resource expansion ratios, i.e. the resource expansion ratio meeting the performance requirement of the current task. Predicting task execution time corresponding to the resource demand after resource expansion adjustment again through the current task model; judging whether the performance of the task is obviously influenced by the performance adjustment threshold value or not by comparing the execution time of the task before and after the resource expansion adjustment, and if the influence is large, attempting the resource expansion adjustment under other resource expansion rates by the scheduling method; finally, the influence degree is in the threshold range, the scheduling method receives the current resource application adjustment result (namely, the allocated resource usage is used as a second resource usage) and is applied to the scheduling process; if all the set resource expansion rates have a great influence on the task execution time, the scheduling method gives up the resource expansion adjustment of the task and directly applies the original resource application result to the scheduling process. And in the process of expanding and contracting the resources of the memory, the task model of the task A is utilized to expand and contract the expected execution time of the task before and after (the execution time difference between the first expected execution time and the second expected execution time), if the difference between the execution time before and after expanding and contracting is smaller than a set difference threshold value, the memory expansion is executed, and if the difference is not satisfied, the expansion and contraction size is adjusted until the condition is satisfied. If the condition that the performance or execution time difference of the task is not within the acceptable range is affected after the current resource is stretched, the original resource application of the current task (namely, the resource applied for the task when the user submits the task, namely, the first resource usage amount is used as the second resource usage amount) is directly returned, and the adjustment of the resource stretching is not carried out on the resource application, so that the service quality of the system is not affected.
And after the resource expansion and contraction is executed, submitting the final resource application quantity of the task in the current scheduling state to a scheduler, and waiting for the scheduler to allocate the corresponding available resources. The scheduler allocates corresponding server resources according to the resource availability status of the servers in the current AI computing power cluster, including the current use condition of the current multi-dimensional resources and the load balancing condition among the servers. If the current server state can not meet the execution of the current task, the scheduling fails, the current task is resubmitted to a task queue, and the next round of scheduling is waited. I.e., the user task that failed to schedule is automatically resubmitted until scheduling is enabled.
And executing task deployment, scheduling the tasks to the corresponding servers for execution, updating the server resource use state of the currently executed tasks and the completion state of the current tasks, and waiting for the next round of scheduling to execute a new round of online scheduling. An adaptive resource expansion scheduling method based on task prediction.
The power computing scheduling method guides the resource scheduling under the AI power computing scene, comprehensively considers the scheduling and the allocation of the resources with multiple dimensions, and is used for improving the task deployment and execution efficiency of the user task. In the user task layer, the task influence is analyzed and modeled by multidimensional resource factors, so that the self-adaptive dynamic resource expansion adjustment is realized, the task deployment and completion conditions are optimized, and the task efficiency is improved. The invention considers the resource demand characteristics of different types of tasks and the cluster utilization of multi-dimensional resources, so that the problem of mismatching among the multi-dimensional resources allocated at the task level can be solved, and the deployment efficiency of the tasks at the cluster level and the utilization rate of the multi-dimensional resources can be improved.
Fig. 4 is a hardware device carrying the adaptive task resource expansion scheduling system, including a functional implementation of the scheduling step and a corresponding architecture module configuration implementation. The scheduling architecture module may be implemented by a plurality of hardware or processors required for the corresponding steps, and implement the scheduling process by code execution of the scheduling system flow stored in a computer readable medium.
The device includes a processor of a central processing unit, an internal bus, a network interface, and a computer readable storage medium. The processor and the computer readable communicate with each other via a bus. Program code for implementing all or part of the functions of the adaptive task resource flexible scheduling system in the present invention is stored in the readable medium. When the program is executed by the processor, the adaptive task resource telescopic scheduling function can be realized.
The invention provides a self-adaptive task resource expansion scheduling algorithm based on a machine learning task prediction method. The method combines the task performance model under the multi-dimensional resources, and optimizes the resource excess application and the mismatch problem between the multi-dimensional resources when the user submits the task. By means of a machine learning method, the importance of the influence of different resource dimensions on the final performance of the task is built, and adaptive expansion optimization is performed on the overapplied part. According to the resource scheduling method, the resource requirements of the user tasks are optimized, so that the resources of all dimensions of the computing power cluster can be efficiently utilized, particularly, scarce and expensive GPU resources can be fully utilized, and the overall utilization efficiency of the cluster is effectively improved.
The embodiments of the present disclosure are described in a progressive manner, and each embodiment focuses on the difference from other embodiments or implementations, so that the same or similar parts between the embodiments of the present disclosure may be referred to each other, and the implementation principles around the inventive concept and the technical effects produced may be referred to each other, which are not repeated herein. The various embodiments or implementations of the invention may be combined with one another without conflict.
It should be noted that, although the steps are described above in a specific order, it is not meant to necessarily be performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order, as long as the required functions are achieved.
The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used in the embodiments of the present invention is chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed in the embodiments of the present invention.
Claims (11)
1. A method of training a task model for computational power scheduling, comprising:
taking task execution time, task GPU utilization rate, task average memory occupation, task CPU core number and task type as training samples; the task GPU utilization rate, the task average memory occupation, the task CPU core number and the task type are used as inputs of a task model, and the task execution time is a label and is used as an expected output value;
training the task model based on a regression tree with gradient lifting according to data in the training sample;
and taking the trained task model as a prediction model of task execution time in the computational power scheduling method.
2. The method of claim 1, wherein the task types in the same training sample are the same.
3. The method of claim 2, wherein the task types include computer vision tasks, natural language processing tasks, reinforcement learning tasks, graphic neural network tasks, recommendation tasks.
4. The method of claim 1, wherein the regression tree has a depth of 10, the number of base learners employed is 100, and the learning rate is 0.1.
5. A method of power dispatch, comprising:
acquiring a current task to be scheduled, wherein the task comprises a task type and a first resource usage amount applied for executing the task;
the task model obtained by any one of the methods according to claims 1 to 4, under the condition that the task performance is not affected or within an acceptable range, performing dynamic expansion adjustment on an application quota of a resource with small influence on the task performance in the first resource usage amount, so as to obtain a resource usage amount allocated for execution of the task;
and performing computational scheduling on the task according to the allocated resource usage.
6. The method according to claim 5, wherein the task model obtained according to any one of claims 1 to 4 dynamically scaling the resource application quota having a small effect on the task performance in the first resource usage without affecting the task performance or within an acceptable range, to obtain the resource usage allocated for the execution of the task, includes:
outputting a first expected execution time of the task according to the task type and the first resource usage amount by using the task model obtained according to any one of the methods of claims 1 to 4;
the resource application quota with small influence on the task performance in the resource dimension is subjected to telescopic adjustment under the condition that the task performance is not influenced or within an acceptable range, so that a second resource usage amount is obtained;
outputting a second expected execution time of the task according to the second resource usage amount by using the task model obtained by the method of any one of claims 1 to 4;
and when the difference between the first expected execution time and the second expected execution time is below a preset difference threshold, taking the second resource usage as the allocated resource usage.
7. The method as recited in claim 6, further comprising:
if the difference between the first expected execution time and the second expected execution time exceeds a set difference threshold, resource application quota with small influence on task performance in the resource dimension by using another resource expansion rate is utilized to carry out resource expansion adjustment.
8. The method as recited in claim 6, further comprising:
and when the difference between the execution time of the first expected execution time and the execution time of the second expected execution time exceeds the difference threshold after the resource expansion and contraction adjustment is carried out on the resource dimension by utilizing all the resource expansion and contraction rates, taking the first resource usage amount as the allocated resource usage amount.
9. The method of claim 6, wherein the performing, with the task model, resource scaling adjustments at a set resource scaling rate for the resource dimension to obtain a second resource usage amount comprises:
screening out the resource dimension with the influence degree on the execution time of the task smaller than a set influence degree threshold from the resource dimension, and taking the resource dimension as the resource dimension to be stretched;
and executing resource expansion adjustment under the set resource expansion rate aiming at the dimension of the resource to be expanded, so as to obtain a second resource usage amount.
10. A computing power scheduling system, comprising: a server, a task queue, a scheduler, further comprising a task history log database and a task model according to any one of claims 1 to 4;
the task history running log database is used for collecting and storing the history data of tasks, including resource use conditions, execution time and task types;
the task model is used for dynamically adjusting the application quota of the resource with small influence on the task performance in the task to be scheduled, so as to obtain the resource usage amount allocated for the execution of the task.
11. A storage medium having stored therein computer executable instructions which when loaded and executed by a processor perform the steps of the method according to any of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311507154.5A CN117687774A (en) | 2023-11-13 | 2023-11-13 | Task model training method for computing power scheduling and computing power scheduling method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311507154.5A CN117687774A (en) | 2023-11-13 | 2023-11-13 | Task model training method for computing power scheduling and computing power scheduling method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117687774A true CN117687774A (en) | 2024-03-12 |
Family
ID=90136126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311507154.5A Pending CN117687774A (en) | 2023-11-13 | 2023-11-13 | Task model training method for computing power scheduling and computing power scheduling method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117687774A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118467187A (en) * | 2024-07-15 | 2024-08-09 | 云南神经元信息技术有限公司 | Distributed cluster data production system |
-
2023
- 2023-11-13 CN CN202311507154.5A patent/CN117687774A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118467187A (en) * | 2024-07-15 | 2024-08-09 | 云南神经元信息技术有限公司 | Distributed cluster data production system |
CN118467187B (en) * | 2024-07-15 | 2024-09-17 | 云南神经元信息技术有限公司 | Distributed cluster data production system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111738434B (en) | Method for executing deep neural network on heterogeneous processing unit | |
CN109788046B (en) | Multi-strategy edge computing resource scheduling method based on improved bee colony algorithm | |
CN107404523A (en) | Cloud platform adaptive resource dispatches system and method | |
CN114721833A (en) | Intelligent cloud coordination method and device based on platform service type | |
CN112732444A (en) | Distributed machine learning-oriented data partitioning method | |
CN117687774A (en) | Task model training method for computing power scheduling and computing power scheduling method and system | |
CN116594748B (en) | Model customization processing method, device, equipment and medium for task | |
CN112114973A (en) | Data processing method and device | |
CN117271101B (en) | Operator fusion method and device, electronic equipment and storage medium | |
CN113032367A (en) | Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system | |
CN115586961A (en) | AI platform computing resource task scheduling method, device and medium | |
CN118484277A (en) | Task scheduling method, task scheduling system and computer storage medium | |
US11775344B1 (en) | Training task queuing cause analysis method and system, device and medium | |
CN113407343A (en) | Service processing method, device and equipment based on resource allocation | |
CN116302448B (en) | Task scheduling method and system | |
CN116896591A (en) | Scheduling method and device for network data analysis model and computer equipment | |
CN114490094B (en) | GPU (graphics processing Unit) video memory allocation method and system based on machine learning | |
CN112598112B (en) | Resource scheduling method based on graph neural network | |
CN112363819B (en) | Big data task dynamic arrangement scheduling method and device and computing equipment | |
CN115904708A (en) | AI platform dynamic weighting scheduling method, device and storage medium | |
CN114327925A (en) | Power data real-time calculation scheduling optimization method and system | |
CN110008002B (en) | Job scheduling method, device, terminal and medium based on stable distribution probability | |
Górski et al. | Adaptive GP-based Algorithm for Hardware/Software Co-design of Distributed Embedded Systems. | |
JP7532666B2 (en) | Solving mixed integer problems using neural networks | |
CN117742928B (en) | Algorithm component execution scheduling method for federal learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |