CN113886080A - High-performance cluster task scheduling method and device, electronic equipment and storage medium - Google Patents

High-performance cluster task scheduling method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113886080A
CN113886080A CN202111150923.1A CN202111150923A CN113886080A CN 113886080 A CN113886080 A CN 113886080A CN 202111150923 A CN202111150923 A CN 202111150923A CN 113886080 A CN113886080 A CN 113886080A
Authority
CN
China
Prior art keywords
task
task scheduling
model
prediction model
performance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111150923.1A
Other languages
Chinese (zh)
Inventor
李龙翔
刘羽
王倩
边晴云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111150923.1A priority Critical patent/CN113886080A/en
Publication of CN113886080A publication Critical patent/CN113886080A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a high-performance cluster task scheduling method and device, electronic equipment and a readable storage medium. The method comprises the step of constructing a task prediction model based on a deep learning algorithm in advance, wherein the task prediction model is used for predicting the application running time of each user task in the process of simultaneously running different jobs. And training a task scheduling model in advance, wherein the task scheduling model is used for dynamically optimizing a task scheduling process based on trial and error and delayed reward characteristics of reinforcement learning. Inputting current task operation parameters and calculation parameters of high-performance cluster task scheduling into a task prediction model to obtain application operation prediction time; and calling the task scheduling model to obtain a task scheduling result based on the application running prediction time. According to the method and the device, users do not need to design computing resources, and the utilization rate of the high-performance cluster and the task scheduling accuracy can be effectively improved.

Description

High-performance cluster task scheduling method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a high-performance cluster task scheduling method and apparatus, an electronic device, and a readable storage medium.
Background
With the rapid expansion of computing demands in different industries, commercial CPUs (central processing units) and high-speed interconnection network devices, High Performance Computing (HPC) clusters have not been developed in the last 20 years. The high-performance computing cluster has excellent expansibility and extremely high cost performance by aggregating a large number of microprocessor units, and can realize quick solution of complex problems, so that the high-performance computing cluster is widely applied to the technical fields of nuclear explosion simulation, weather forecast, engineering computation and the like. According to moore's law, the computing power of high-performance cluster platforms increases exponentially every year, but in the actual operation process, the actual performance of an application program cannot meet the current demand. According to the NERSC research, the ratio of the running peak performance of the large-scale scientific application to the theoretical performance of the running platform in the ACM Gordon Bell prize winning application case in the past year is reduced from 40% -50% in 90 years to 5% -10% at present. It can be seen that even large, highly-optimized high-performance computing applications can use peak performance far below the theoretical maximum performance of the high-performance computing cluster. Therefore, the goals of improving the resource utilization rate of the high-performance cluster, reducing the completion time of all the jobs, improving the throughput rate and fairness of the cluster and the like become more and more technical problems to be solved in the running process of the high-performance cluster.
The existing high-performance cluster scheduling system is realized based on a classical heuristic algorithm, such as a commonly used Min-Min algorithm, a Max-Min algorithm and the like. The methods realize the automation of performance cluster task scheduling by constructing a minimum completion time matrix and searching an allocation strategy corresponding to a calculation task. However, since the high-performance cluster scheduling system cannot predict the application running time, the resource application size is customized when a user submits a task in the task allocation process, and the allocated resource cannot be adaptively adjusted according to the actual state of the cluster. Due to competition among different applications for resources such as memory bandwidth and the like, the running time is affected, and the prior art cannot consider that a plurality of applications run on a single node of a cluster at the same time, so that the final task scheduling is inaccurate. In the traditional application prediction model, the regression method is simple, and the condition that a plurality of applications run simultaneously and interfere with each other cannot be described, so that an accurate result under the complex condition cannot be provided. In addition, the regression model cannot consider resource limitation conditions, such as the number of cores, memory capacity, storage bandwidth and the like, in the application running process, so that the obtained model is usually only suitable for the condition that all resources of a single node are occupied, and the running time accuracy is not high under the condition that the limitation exists. In addition, a short board exists in the conventional heuristic scheduling module, for example, a Min-Min algorithm allocates tasks to nodes with a higher processing speed as much as possible, so that nodes with a lower processing speed are always in a starvation state, and finally, the node utilization rate is lower.
Disclosure of Invention
The application provides a high-performance cluster task scheduling method, a high-performance cluster task scheduling device, electronic equipment and a readable storage medium, a user does not need to design computing resources, and the utilization rate of a high-performance cluster and the task scheduling accuracy are effectively improved.
In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:
an embodiment of the present invention provides a high performance cluster task scheduling method, including:
a task prediction model is constructed in advance based on a deep learning algorithm, and the task prediction model is used for predicting the application running time of each user task in the process of simultaneously running different jobs;
pre-training a task scheduling model, wherein the task scheduling model is used for dynamically optimizing a task scheduling process based on trial and error and delayed reward characteristics of reinforcement learning;
inputting current task operation parameters and calculation parameters of high-performance cluster task scheduling into the task prediction model to obtain application operation prediction time;
and calling the task scheduling model to obtain a task scheduling result based on the application running prediction time.
Optionally, the task prediction model includes a preprocessing module and a deep learning module;
the preprocessing module is uniquely corresponding to the high-performance cluster type and is used for analyzing the current task operation parameters to obtain network parameters input into the deep learning module;
the deep learning module is used for training a deep neural network model through a sample data set based on the network parameters to obtain the task prediction model.
Optionally, training the deep neural network model through the sample data set based on the network parameters to obtain the task prediction model includes:
constructing a speed prediction model based on a deep neural network model in advance;
constructing a sample data set according to real state data of a plurality of acquired high-performance computing tasks trained in different high-performance clusters;
constructing target characteristics according to actual configuration information of a high-performance cluster and a historical state of each high-performance computing task running in the high-performance cluster; based on the target characteristics, obtaining the training speed of each high-performance computing task in different characteristic states to serve as a label of the corresponding high-performance computing task of the sample data set;
and training the speed prediction model by utilizing a sample data set carrying a label based on the network parameters.
Optionally, the training the speed prediction model by using the sample data set carrying the label includes:
and (3) by utilizing a sample data set carrying a label, using Adam as an optimizer, calling a supervised learning algorithm and training the speed prediction model by using a gradient descent method.
Optionally, the dynamically optimizing task scheduling process based on the trial-and-error and delayed reward features of reinforcement learning includes:
using cluster node numbers and task running sequence vectors of all task loads as states;
taking the cluster computing resources distributed by the task load and the change of the execution sequence of each task as actions to construct an action value function;
inputting a corresponding initial scheduling result into the task scheduling model, and taking the total predicted operation time as a delay reward characteristic;
and performing iterative computation on the action value function by adopting a reinforcement learning algorithm according to the delay reward characteristic so as to determine a final task calling result according to a convergence result.
Optionally, the performing iterative computation on the action cost function by using a reinforcement learning algorithm according to the delay reward feature includes:
initializing a function value of the action value, and setting the current state of the action value function;
inputting the current state into the task scheduling model to obtain task running time;
taking the task running time as the delay reward characteristic, and obtaining the current reward characteristic and the next state by executing an action;
updating the action cost function according to the current reward characteristics and the next state, taking the next state as the current state, and repeating iteration until the end state of the state set is reached; the state set is constructed according to all task resource allocation states and task scheduling sequences.
Another aspect of the embodiments of the present invention provides a high-performance cluster task scheduling apparatus, including:
the model training module is used for constructing a task prediction model in advance based on a deep learning algorithm, and the task prediction model is used for predicting the application running time of each user task in the process of simultaneously running different jobs; pre-training a task scheduling model, wherein the task scheduling model is used for dynamically optimizing a task scheduling process based on trial and error and delayed reward characteristics of reinforcement learning;
the running time prediction module is used for inputting the current task running parameters and the calculation parameters of the high-performance cluster task scheduling into the task prediction model to obtain application running prediction time;
and the task calling module is used for calling the task scheduling model to obtain a task scheduling result based on the application running prediction time.
Optionally, the task prediction model includes a preprocessing module and a deep learning module;
the preprocessing module is uniquely corresponding to the high-performance cluster type and is used for analyzing the current task operation parameters to obtain network parameters input into the deep learning module;
the deep learning module is used for training a deep neural network model through a sample data set based on the network parameters to obtain the task prediction model.
An embodiment of the present invention further provides an electronic device, which includes a processor, and the processor is configured to implement the steps of the high performance cluster task scheduling method according to any one of the foregoing items when executing the computer program stored in the memory.
Finally, an embodiment of the present invention provides a readable storage medium, where a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the steps of the high performance cluster task scheduling method according to any of the foregoing embodiments are implemented.
The technical scheme provided by the application has the advantages that the high-performance cluster application running time prediction model is established by using a deep learning method, and due to the fact that the condition of system resource occupation when the high-performance cluster computing application runs and the condition of mutual interference when different applications run simultaneously are considered, the model can predict various complex conditions in the actual running process of the cluster, is applicable to a high-performance cluster task scheduling system with complex running tasks, is different from a traditional regression model, and is wider in applicability and more accurate in final scheduling result. The application running time predicted by the task prediction model provides data support for task scheduling, and the optimization process of the task scheduling is converted into the training process of reinforcement learning, so that the dynamic optimization process of the task scheduling is realized, and the overall running efficiency of the cluster is effectively improved. In the operation process of the high-performance cluster, a user submits tasks without setting computing resources, the scheduling system can adaptively adjust the actually used computing resources according to the time prediction result and the cluster resources and execute all the tasks according to a certain sequence, so that the user submits the tasks in the shortest time, the cluster utilization rate and the total execution speed of all the tasks can be maximized, and the overall cluster utilization rate is effectively improved.
In addition, the embodiment of the invention also provides a corresponding implementation device, electronic equipment and a readable storage medium for the high-performance cluster task scheduling method, so that the method has higher practicability, and the device, the electronic equipment and the readable storage medium have corresponding advantages.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the related art, the drawings required to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a high-performance cluster task scheduling method according to an embodiment of the present invention;
FIG. 2 is a block diagram of an exemplary application scenario provided by an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a deep learning module of an exemplary application scenario according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating a training process of a task scheduling model in an exemplary application scenario according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating an exemplary application scenario in which an action set and a state set are iterated using an action cost function according to an embodiment of the present invention;
fig. 6 is a structural diagram of a specific implementation manner of the high-performance cluster task scheduling device according to the embodiment of the present invention;
fig. 7 is a block diagram of an embodiment of an electronic device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.
Having described the technical solutions of the embodiments of the present invention, various non-limiting embodiments of the present application are described in detail below.
Referring to fig. 1, fig. 1 is a schematic flowchart of a high-performance cluster task scheduling method provided in an embodiment of the present invention, where the embodiment of the present invention may include the following:
s101: and constructing a task prediction model in advance based on a deep learning algorithm.
In this embodiment, the task prediction model is used to predict the application running time of each user task in the process of running different jobs simultaneously. The task prediction model is constructed based on a deep learning method, and the resource use conditions such as the running time of a high-performance computing application task, the memory occupation and the like are predicted. Different from the traditional prediction module using a regression method, the task prediction model constructed in the step considers the interference condition existing in the simultaneous operation of different jobs, improves the accuracy of prediction time, and is suitable for prediction of any high-performance application. The main function of the task prediction model is to provide support for the optimization process of the whole high-performance cluster task scheduling system.
S102: and training a task scheduling model in advance.
The task scheduling model of the embodiment is used for dynamically optimizing task scheduling based on trial and error and delayed reward features of reinforcement learning. The module is constructed based on a reinforcement learning method, and the task scheduling result is optimized in the training process through trial and error and delayed reward characteristics of reinforcement learning.
S103: and inputting the current task operation parameters and the calculation parameters of the high-performance cluster task scheduling into the task prediction model to obtain the application operation prediction time.
S104: and calling the task scheduling model to obtain a task scheduling result based on the application running prediction time.
The embodiment aims to solve the problem of high-performance cluster task scheduling. As shown in fig. 2, in the operation process of the high-performance cluster, the high-performance cluster task scheduling system needs to allocate according to the cluster resources, and execute all tasks in a certain order, so that the tasks submitted by the users are completed in the shortest time, and the overall utilization rate of the cluster is improved. Assuming that in a high performance cluster, there are P nodes in total, the cluster may be assembled as R ═ { R1,R2,...,RPAnd users correspondingly submit a total of S computing tasks, and the tasks submitted by the users can be represented as W ═ W1,W2,...,WS}. At each computing task WjIn (1), can be subdivided into NjA calculation load Tj={Tj,1,Tj,2,...,Tj,SAs the minimum execution unit, the total load of all tasks is
Figure BDA0003287159010000071
Each task WjThe time taken from the start of commit to completion is T (W)j) The total time of all tasks after completion of operation is
Figure BDA0003287159010000072
For high performance cluster task scheduling systems, FTI.e. the corresponding optimization objective function. In the practical use of the high-performance cluster computing task calling system, the task prediction model adopts a deep learning method and provides a module with uniform parameters through extractionThe system is more flexible as a whole, and can be suitable for prediction of different HPC (High Performance Computing) applications. The task scheduling model adopts a reinforcement learning method, and the optimization process of the task scheduling result is realized in the reinforcement learning training process by converting the scheduling result into a state vector.
In the technical scheme provided by the embodiment of the invention, the high-performance cluster application running time prediction model is established by using a deep learning method, and due to the fact that the condition of system resource occupation when the high-performance cluster computing application runs and the condition of mutual interference when different applications run simultaneously are considered, the model can predict various complex conditions in the actual running process of the cluster, is suitable for a high-performance cluster task scheduling system with complex running tasks, is different from a traditional regression model, and has wider applicability and more accurate final scheduling result. The application running time predicted by the task prediction model provides data support for task scheduling, and the optimization process of the task scheduling is converted into the training process of reinforcement learning, so that the dynamic optimization process of the task scheduling is realized, and the overall running efficiency of the cluster is effectively improved. In the operation process of the high-performance cluster, a user submits tasks without setting computing resources, the scheduling system can adaptively adjust the actually used computing resources according to the time prediction result and the cluster resources and execute all the tasks according to a certain sequence, so that the user submits the tasks in the shortest time, the cluster utilization rate and the total execution speed of all the tasks can be maximized, and the overall cluster utilization rate is effectively improved.
It should be noted that, in the present application, there is no strict sequential execution order among the steps, and as long as the logical order is met, the steps may be executed simultaneously or according to a certain preset order, and fig. 1 is only an exemplary manner, and does not represent that only the execution order is the order.
In the foregoing embodiment, the structure of the task prediction model is not limited, and an optional implementation of the task prediction model in this embodiment may include:
the task prediction model comprises a preprocessing module and a deep learning module; the preprocessing module is uniquely corresponding to the high-performance cluster type and is used for analyzing the current task operation parameters to obtain network parameters input into the deep learning module; the deep learning module is used for training the deep neural network model through the sample data set based on the network parameters to obtain a task prediction model. The task prediction model inputs application calculation parameters and operation parameters serving as feature vectors into the deep neural network model for training to obtain a unified prediction model suitable for different applications, and the running time prediction of different HPC applications is realized by matching with different preprocessing modules.
In this embodiment, in the task prediction model, a deep neural network method is used to predict a task submitted by a user, and information such as predicted running time and occupied resources is obtained. Firstly, the application running task parameters are submitted according to analysis, relevant parameter information such as running states and application resources is obtained, and finally the relevant parameter information and the calculation parameters are input into a deep neural network model as feature vectors for prediction. The task prediction model comprises a preprocessing module for executing preprocessing operation and a deep learning module comprising a neural network model. Preprocessing can analyze input files and operating parameters of specific application software and extract parameters required by neural network model prediction. Aiming at different high-performance application software, only the corresponding preprocessing module needs to be adjusted, so that the prediction module has the advantages of wide applicability, simple application and the like. The Deep Neural Network model of the Deep learning module may adopt, for example, a three-layer full-link layer DNN (Deep Neural Network) model, each layer having 80 Neural units, using a ReLU (Rectified Linear Unit) activation function, as shown in fig. 3. The output result of the deep neural network is the application running time.
In the embodiment, during the operation of the task prediction model, the application is determined to be operated according to the call command. Before predicting time, firstly calling a corresponding preprocessing module to analyze the operation parameters and extracting the parameters required by the deep learning module. Due to the specificity of different high-performance cluster computing applications, the preprocessing module needs to be developed separately for different applications, but since most of the extracted parameters are directly included in the HPC application runtimeIn the line parameter or parameter file, the workload of the preprocessing module is small, and the parameters required to be proposed by each model are identical. In addition, in the high-performance application operation process, most of the main application operation time is the linear equation set iterative calculation process, so that the calculated amount can be predicted according to the size of the linear equation set, and more accurate simulation time can be obtained. The calculation parameters extracted during the preprocessing include the following:
Figure BDA0003287159010000091
a binary vector of 0-1 of length m, representing the type of application.
Figure BDA0003287159010000092
If an element is 1, the HPC job is an application of a certain type, and the other elements are 0. m is the number of all HPC applications. w represents the number of unknowns in the HPC application calculation process.
Figure BDA0003287159010000093
The data type used for storing the coefficient matrix of the linear equation system is represented by a 0-1 binary vector with the length of n. In the Intel MKL, there are 6 types of sparse matrices commonly used, such as csr, bsr, coo, dia, dok, and csc, and therefore n may be set to 6. When the coefficient matrix is not stored using a sparse data format,
Figure BDA0003287159010000094
all elements in (1) are 0.
And (4) obtaining the operation parameters in the preprocessing process to form a characteristic vector, inputting the characteristic vector into the deep neural network model, and predicting the operation time of the application under the given condition. As an optional implementation manner of this embodiment, based on the network parameters, the process of training the deep neural network model through the sample data set to obtain the task prediction model may include:
constructing a speed prediction model based on a deep neural network model in advance; constructing a sample data set according to real state data of a plurality of acquired high-performance computing tasks trained in different high-performance clusters; constructing target characteristics according to actual configuration information of the high-performance cluster and the historical state of each high-performance computing task running in the high-performance cluster; based on the target characteristics, obtaining the training speed of each high-performance computing task in different characteristic states to serve as a label of the corresponding high-performance computing task of the sample data set; and training the speed prediction model by using the sample data set carrying the label based on the network parameters.
The input features of the speed prediction model are used for representing the training process features of a task, and each piece of data in the data set represents the real state of a task in a cluster under different values of the features. The individual elements of the characteristic variables represent the following:
Figure BDA0003287159010000101
the representation application calculates the relevant parameters, mainly extracted by the preprocessing module; p represents the HPC application calculation use process number;
Figure BDA0003287159010000102
is a vector of length k, representing HPC application computing process placement information. k represents the number of compute nodes in the cluster. For example, the server has 4 nodes: node 01-node 04, assuming that an HPC job has 30 processes evenly distributed over the first 3 nodes
Figure BDA0003287159010000103
Figure BDA0003287159010000104
Is a vector of length k that represents the sum of the number of processes that have been done on each compute node while the HPC application is running.
In order to predict HPC job speed, corresponding features are constructed according to the actual configuration of the cluster and the historical state of the HPC job running in the cluster, and the training speed of each HPC job in different feature states is obtained by an automation tool according to the features as a label of a data set. The speed prediction model may be built using Keras, constructed using a three-tier fully-linked DNN model, with the output of the model being the run-time of the HPC job, trained using supervised learning and using Adam as the optimizer. The model trained dataset is constructed by collecting the real data of the HPC job runs in the cluster. Since most of the calculation process of the HPC application has an iterative characteristic, and the calculation and communication requirements of the whole process are basically stable, a Mini-batch gradient descent method can be used for training the prediction model, so that the training process is converged quickly.
The foregoing embodiment does not limit S102, and this application further provides an implementation manner of task scheduling optimization, where the task scheduling model is implemented through a reinforcement learning training process, and obtains an operation time as a reward by using the task prediction model, so as to achieve an optimal scheduling result optimization effect when the method is applied to different clusters or high-performance applications, and improve the cluster utilization efficiency to the maximum extent, where the method includes:
using cluster node numbers and task running sequence vectors of all task loads as states; taking the cluster computing resources distributed by the task load and the change of the execution sequence of each task as actions to construct an action value function; inputting the corresponding initial scheduling result into a task scheduling model, and taking the total predicted operation time as a delay reward characteristic; and performing iterative computation on the action value function by adopting a reinforcement learning algorithm according to the delay reward characteristic so as to determine a final task calling result according to the convergence result.
The task scheduling model of the embodiment uses a reinforcement learning method, integrates an application task operation optimization process into a training process of the reinforcement learning method, and finally determines the FT minimum value. In the training process, all task loads use cluster node numbers and task running sequence vectors as states; taking the change of the cluster computing resources and the execution sequence of the task load distribution as an action to construct an action value function; inputting the corresponding scheduling result into a prediction module, and taking the total predicted running time as a reward; and performing iterative computation on the action value function by adopting a reinforcement learning algorithm according to the reward to finally obtain a converged result, thereby realizing scheduling process optimization according to the state, the action and the convergence function, as shown in fig. 4. In the present embodiment, for example, Q-Learni can be used for reinforcement learningAnd in the ng algorithm, the state set is a set formed by all task resource allocation states and task scheduling sequences. Using a vector p of length N + SsAs a state, where the first N elements are the node numbers that make up the load distribution for all tasks and the last S elements are the compute tasks WjExecution numbering: in the first N elements, the value range of the ith element is [0, P-1 ]]Node number to which ith load is distributed; in the last S elements, the value range of the ith element is [0, S-1 ]]And indicating the operation sequence of the ith task. When the corresponding values of the two elements are the same, the two tasks are indicated to start running simultaneously. When the reinforcement learning method is used for training, all load distribution resource changes and task scheduling sequence changes form an action set A. The iterative computation of the action price function by using the Q-Learning algorithm according to the reward can comprise the following steps:
initializing function value of action value and setting current state p of action value functions
Inputting the current state into a task scheduling model to obtain task running time FT
Running a task FTAs a delayed bonus feature, the current bonus feature and the next state are derived by performing an action, as shown in fig. 5. To ensure that the task invocation model can search all possible invocation processes and that the Q-Learning algorithm can converge, a greedy algorithm can be used for the search.
Updating the action cost function according to the current reward feature and the next state, the updated action cost function may be expressed as:
Q(ps,α)=(1-β)·Q(ps,α)+β[R+γmaxαQ(p′s,α)];
wherein p iss=ps`,psIs current state, α is action, Q (p)sα) is a function of the action's worth, representing the current state psThe convergence function value obtained by performing the action alpha, beta is the learning rate, R is the reward, gamma is the discount factor, ps"is the maximum convergence function value of the action taken for the next state.
Taking the next state as the current state, and repeating the iteration until the end state of the state set is reached; the state set is constructed according to the allocation states of all task resources and the task scheduling sequence.
As can be seen from the above, the present embodiment is suitable for different high-performance clusters and applications, and has good universality. By integrating the high-precision task prediction model, the scheduling result can be automatically optimized in the task scheduling model after the user submits the task instead of being realized by running the application for many times, so that the computing resource is saved, the training speed is increased, and the overall utilization rate of the cluster can be effectively increased.
The embodiment of the invention also provides a corresponding device for the high-performance cluster task scheduling method, so that the method has higher practicability. Wherein the means can be described separately from the functional module point of view and the hardware point of view. In the following, the high-performance cluster task scheduling device provided by the embodiment of the present invention is introduced, and the high-performance cluster task scheduling device described below and the high-performance cluster task scheduling method described above may be referred to in a corresponding manner.
Based on the angle of the functional module, referring to fig. 6, fig. 6 is a structural diagram of a high-performance cluster task scheduling device according to an embodiment of the present invention, in a specific implementation manner, where the device may include:
the model training module 601 is used for constructing a task prediction model in advance based on a deep learning algorithm, and the task prediction model is used for predicting the application running time of each user task in the process of simultaneously running different jobs; and training a task scheduling model in advance, wherein the task scheduling model is used for dynamically optimizing a task scheduling process based on trial and error and delayed reward characteristics of reinforcement learning.
And the running time prediction module 602 is configured to input the current task running parameters and the calculation parameters of the high-performance cluster task scheduling into the task prediction model to obtain application running prediction time.
And the task calling module 603 is configured to call the task scheduling model to obtain a task scheduling result based on the application running prediction time.
Optionally, in some embodiments of this embodiment, the task prediction model may include a preprocessing module and a deep learning module;
the preprocessing module is uniquely corresponding to the high-performance cluster type and is used for analyzing the current task operation parameters to obtain network parameters input into the deep learning module;
the deep learning module is used for training the deep neural network model through the sample data set based on the network parameters to obtain a task prediction model.
As an optional implementation manner of this embodiment, the model training module 601 may be further configured to: constructing a speed prediction model based on a deep neural network model in advance; constructing a sample data set according to real state data of a plurality of acquired high-performance computing tasks trained in different high-performance clusters; constructing target characteristics according to actual configuration information of the high-performance cluster and the historical state of each high-performance computing task running in the high-performance cluster; based on the target characteristics, obtaining the training speed of each high-performance computing task in different characteristic states to serve as a label of the corresponding high-performance computing task of the sample data set; and training the speed prediction model by using the sample data set carrying the label based on the network parameters.
As an optional implementation manner of the foregoing embodiment, the model training module 601 may further be configured to: and (3) by utilizing a sample data set carrying a label, using Adam as an optimizer, calling a supervised learning algorithm and training a speed prediction model by using a gradient descent method.
Optionally, in other embodiments of this embodiment, the model training module may be further configured to: using cluster node numbers and task running sequence vectors of all task loads as states; taking the cluster computing resources distributed by the task load and the change of the execution sequence of each task as actions to construct an action value function; inputting the corresponding initial scheduling result into a task scheduling model, and taking the total predicted operation time as a delay reward characteristic; and performing iterative computation on the action value function by adopting a reinforcement learning algorithm according to the delay reward characteristic so as to determine a final task calling result according to the convergence result.
As an optional implementation manner of this embodiment, the model training module may further be configured to: initializing a function value of the action value, and setting the current state of the action value function; inputting the current state into a task scheduling model to obtain the task running time; taking the task running time as a delay reward characteristic, and obtaining the current reward characteristic and the next state by executing an action; updating the action value function according to the current reward characteristic and the next state, taking the next state as the current state, and repeating iteration until the termination state of the state set is reached; the state set is constructed according to the allocation states of all task resources and the task scheduling sequence.
The functions of the functional modules of the high-performance cluster task scheduling device according to the embodiments of the present invention may be specifically implemented according to the method in the above method embodiments, and the specific implementation process may refer to the description related to the above method embodiments, which is not described herein again.
Therefore, the embodiment of the invention does not need a user to design computing resources, and effectively improves the utilization rate of the performance cluster and the accuracy of task scheduling.
The high-performance cluster task scheduling device mentioned above is described from the perspective of a functional module, and further, the present application also provides an electronic device, which is described from the perspective of hardware. Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 7, the electronic device includes a memory 70 for storing a computer program; a processor 71, configured to execute a computer program to implement the steps of the high performance cluster task scheduling method according to any of the above embodiments.
The processor 71 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the processor 71 may also be a controller, a microcontroller, a microprocessor or other data processing chip, and the like. The processor 71 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 71 may also include a main processor and a coprocessor, the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 71 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 71 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
The memory 70 may include one or more computer-readable storage media, which may be non-transitory. Memory 70 may also include high speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. The memory 70 may in some embodiments be an internal storage unit of the electronic device, for example a hard disk of a server. The memory 70 may also be an external storage device of the electronic device in other embodiments, such as a plug-in hard disk provided on a server, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 70 may also include both an internal storage unit and an external storage device of the electronic device. The memory 70 may be used for storing various data and application software installed in the electronic device, such as: the code of the program that executes the vulnerability handling method, etc. may also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 70 is at least used for storing the following computer program 701, wherein after being loaded and executed by the processor 71, the computer program can implement the relevant steps of the high performance cluster task scheduling method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 70 may also include an operating system 702, data 703, and the like, and the storage manner may be a transient storage or a permanent storage. Operating system 702 may include Windows, Unix, Linux, etc. The data 703 may include, but is not limited to, data corresponding to the scheduling result of the high-performance cluster task, and the like.
In some embodiments, the electronic device may further include a display 72, an input/output interface 73, a communication interface 74, alternatively referred to as a network interface, a power supply 75, and a communication bus 76. The display 72 and the input/output interface 73, such as a Keyboard (Keyboard), belong to a user interface, and the optional user interface may also include a standard wired interface, a wireless interface, and the like. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, as appropriate, is used for displaying information processed in the electronic device and for displaying a visualized user interface. The communication interface 74 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a bluetooth interface, etc., typically used to establish a communication connection between an electronic device and other electronic devices. The communication bus 76 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.
Those skilled in the art will appreciate that the configuration shown in fig. 7 is not intended to be limiting of the electronic device and may include more or fewer components than those shown, such as a sensor 77 that performs various functions.
The functions of the functional modules of the electronic device according to the embodiments of the present invention may be specifically implemented according to the method in the above method embodiments, and the specific implementation process may refer to the description related to the above method embodiments, which is not described herein again.
Therefore, the embodiment of the invention does not need a user to design computing resources, and effectively improves the utilization rate of the performance cluster and the accuracy of task scheduling.
It is understood that, if the high-performance cluster task scheduling method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be substantially or partially implemented in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods of the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrically erasable programmable ROM, a register, a hard disk, a multimedia card, a card type Memory (e.g., SD or DX Memory, etc.), a magnetic Memory, a removable magnetic disk, a CD-ROM, a magnetic or optical disk, and other various media capable of storing program codes.
Based on this, the embodiment of the present invention further provides a readable storage medium, which stores a computer program, where the computer program is executed by a processor, and the steps of the high performance cluster task scheduling method according to any of the above embodiments are provided.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. For hardware including devices and electronic equipment disclosed by the embodiment, the description is relatively simple because the hardware includes the devices and the electronic equipment correspond to the method disclosed by the embodiment, and the relevant points can be obtained by referring to the description of the method.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The high-performance cluster task scheduling method, the high-performance cluster task scheduling device, the electronic device and the readable storage medium provided by the application are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present application.

Claims (10)

1. A high-performance cluster task scheduling method is characterized by comprising the following steps:
a task prediction model is constructed in advance based on a deep learning algorithm, and the task prediction model is used for predicting the application running time of each user task in the process of simultaneously running different jobs;
pre-training a task scheduling model, wherein the task scheduling model is used for dynamically optimizing a task scheduling process based on trial and error and delayed reward characteristics of reinforcement learning;
inputting current task operation parameters and calculation parameters of high-performance cluster task scheduling into the task prediction model to obtain application operation prediction time;
and calling the task scheduling model to obtain a task scheduling result based on the application running prediction time.
2. The method according to claim 1, wherein the task prediction model comprises a preprocessing module and a deep learning module;
the preprocessing module is uniquely corresponding to the high-performance cluster type and is used for analyzing the current task operation parameters to obtain network parameters input into the deep learning module;
the deep learning module is used for training a deep neural network model through a sample data set based on the network parameters to obtain the task prediction model.
3. The method according to claim 2, wherein training a deep neural network model through a sample data set to obtain the task prediction model based on the network parameters comprises:
constructing a speed prediction model based on a deep neural network model in advance;
constructing a sample data set according to real state data of a plurality of acquired high-performance computing tasks trained in different high-performance clusters;
constructing target characteristics according to actual configuration information of a high-performance cluster and a historical state of each high-performance computing task running in the high-performance cluster; based on the target characteristics, obtaining the training speed of each high-performance computing task in different characteristic states to serve as a label of the corresponding high-performance computing task of the sample data set;
and training the speed prediction model by utilizing a sample data set carrying a label based on the network parameters.
4. The method according to claim 3, wherein the training the speed prediction model using the sample data set with labels comprises:
and (3) by utilizing a sample data set carrying a label, using Adam as an optimizer, calling a supervised learning algorithm and training the speed prediction model by using a gradient descent method.
5. The method according to any one of claims 1 to 4, wherein the dynamically optimizing task scheduling process based on the reinforcement learning trial-and-error and delay reward features comprises:
using cluster node numbers and task running sequence vectors of all task loads as states;
taking the cluster computing resources distributed by the task load and the change of the execution sequence of each task as actions to construct an action value function;
inputting a corresponding initial scheduling result into the task scheduling model, and taking the total predicted operation time as a delay reward characteristic;
and performing iterative computation on the action value function by adopting a reinforcement learning algorithm according to the delay reward characteristic so as to determine a final task calling result according to a convergence result.
6. The method according to claim 5, wherein the iteratively calculating the action cost function by using a reinforcement learning algorithm according to the delay reward feature comprises:
initializing a function value of the action value, and setting the current state of the action value function;
inputting the current state into the task scheduling model to obtain task running time;
taking the task running time as the delay reward characteristic, and obtaining the current reward characteristic and the next state by executing an action;
updating the action cost function according to the current reward characteristics and the next state, taking the next state as the current state, and repeating iteration until the end state of the state set is reached; the state set is constructed according to all task resource allocation states and task scheduling sequences.
7. A high performance cluster task scheduler comprising:
the model training module is used for constructing a task prediction model in advance based on a deep learning algorithm, and the task prediction model is used for predicting the application running time of each user task in the process of simultaneously running different jobs; pre-training a task scheduling model, wherein the task scheduling model is used for dynamically optimizing a task scheduling process based on trial and error and delayed reward characteristics of reinforcement learning;
the running time prediction module is used for inputting the current task running parameters and the calculation parameters of the high-performance cluster task scheduling into the task prediction model to obtain application running prediction time;
and the task calling module is used for calling the task scheduling model to obtain a task scheduling result based on the application running prediction time.
8. The high-performance cluster task scheduler of claim 7, wherein the task prediction model comprises a pre-processing module and a deep learning module;
the preprocessing module is uniquely corresponding to the high-performance cluster type and is used for analyzing the current task operation parameters to obtain network parameters input into the deep learning module;
the deep learning module is used for training a deep neural network model through a sample data set based on the network parameters to obtain the task prediction model.
9. An electronic device, comprising a processor and a memory, the processor being configured to carry out the steps of the high performance cluster task scheduling method according to any one of claims 1 to 6 when executing a computer program stored in the memory.
10. A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the high performance cluster task scheduling method according to any one of claims 1 to 6.
CN202111150923.1A 2021-09-29 2021-09-29 High-performance cluster task scheduling method and device, electronic equipment and storage medium Pending CN113886080A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111150923.1A CN113886080A (en) 2021-09-29 2021-09-29 High-performance cluster task scheduling method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111150923.1A CN113886080A (en) 2021-09-29 2021-09-29 High-performance cluster task scheduling method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113886080A true CN113886080A (en) 2022-01-04

Family

ID=79007902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111150923.1A Pending CN113886080A (en) 2021-09-29 2021-09-29 High-performance cluster task scheduling method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113886080A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936085A (en) * 2022-07-21 2022-08-23 联通沃音乐文化有限公司 ETL scheduling method and device based on deep learning algorithm
CN115168016A (en) * 2022-09-07 2022-10-11 浙江大华技术股份有限公司 Task scheduling method and related device, chip, device and medium
CN115907022A (en) * 2023-01-04 2023-04-04 苏州浪潮智能科技有限公司 Multi-quantum service conversion and simulation scheduling method, device, equipment and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936085A (en) * 2022-07-21 2022-08-23 联通沃音乐文化有限公司 ETL scheduling method and device based on deep learning algorithm
CN115168016A (en) * 2022-09-07 2022-10-11 浙江大华技术股份有限公司 Task scheduling method and related device, chip, device and medium
CN115907022A (en) * 2023-01-04 2023-04-04 苏州浪潮智能科技有限公司 Multi-quantum service conversion and simulation scheduling method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN110399222B (en) GPU cluster deep learning task parallelization method and device and electronic equipment
JP6898496B2 (en) Computation graph processing
Chen et al. A bi-layered parallel training architecture for large-scale convolutional neural networks
CN110737529B (en) Short-time multi-variable-size data job cluster scheduling adaptive configuration method
CN113886080A (en) High-performance cluster task scheduling method and device, electronic equipment and storage medium
EP3353655B1 (en) Stream-based accelerator processing of computational graphs
CN108055292B (en) Optimization method for mapping from virtual machine to physical machine
CN113515382B (en) Cloud resource allocation method and device, electronic equipment and storage medium
CN114237835A (en) Task solving method and device
CN113220450A (en) Load prediction method, resource scheduling method and device for cloud-side multi-data center
CN115168027A (en) Calculation power resource measurement method based on deep reinforcement learning
CN116263701A (en) Computing power network task scheduling method and device, computer equipment and storage medium
CN116541176A (en) Optimization method and optimization device for computing power resource allocation, electronic equipment and medium
CN114895773A (en) Energy consumption optimization method, system and device of heterogeneous multi-core processor and storage medium
CN116501505B (en) Method, device, equipment and medium for generating data stream of load task
CN112463532B (en) Method for constructing SNN workload automatic mapper and automatic mapper
Dietze et al. Water-level scheduling for parallel tasks in compute-intensive application components
CN114596009A (en) Computing resource deployment method, device, equipment and storage medium of intelligent computing center
Betting et al. Oikonomos: An Opportunistic, Deep-Learning, Resource-Recommendation System for Cloud HPC
US11900239B2 (en) Systems and methods for accelerating sparse neural network execution
CN114021733A (en) Model training optimization method and device, computer equipment and storage medium
Scully-Allison et al. Data imputation with an improved robust and sparse fuzzy k-means algorithm
WO2023207630A1 (en) Task solving method and apparatus therefor
CN109921957B (en) Computer room end-to-end capacity management method, electronic device and storage medium
CN111274030B (en) Efficient multiprocessor system-on-chip design space mining method oriented to application features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination