CN116541176A

CN116541176A - Optimization method and optimization device for computing power resource allocation, electronic equipment and medium

Info

Publication number: CN116541176A
Application number: CN202310596058.6A
Authority: CN
Inventors: 杜洋
Original assignee: Beijing Research Institute Of China Telecom Corp ltd; China Telecom Corp Ltd
Current assignee: Beijing Research Institute Of China Telecom Corp ltd; China Telecom Corp Ltd
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-04

Abstract

The disclosure provides an optimization method, an optimization device, electronic equipment and a medium for computing power resource allocation, and relates to the technical field of computers. The optimization method of the computing power resource allocation comprises the following steps: based on the monitoring of the task to be processed by the cluster monitoring program, descriptive information is obtained, and the descriptive information is used for describing subtasks of a plurality of batches of the task to be processed; inputting the description information into a prediction model to predict the predicted completion time and the computational power resource requirements of each subtask based on the prediction model; constructing a target optimization model of the predicted completion time of the subtasks of the plurality of batches based on the computing power resource demand and the occupied computing power resource occupation of the computer cluster; and optimizing the computing power resource allocation of each subtask based on the target optimization model. According to the technical scheme, the rationality of the computing power resource allocation of the HPC cluster task which is complex and takes longer time can be improved, so that the probability of computing power resource allocation conflict is reduced, and the higher utilization rate of computing power resources is ensured.

Description

Optimization method and optimization device for computing power resource allocation, electronic equipment and medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a method for optimizing computing power resource allocation, an apparatus for optimizing computing power resource allocation, an electronic device, and a computer readable storage medium.

Background

The HPC (High performance computing ) cluster system refers to a computer system used for middle and large scale scientific engineering computing and composed of high specification computer groups, and can process large scale and long time-consuming tasks which cannot be processed by common computers through ultra high (trillion level) computing speed, so that the requirement on the allocation mode of shared resources such as computing power, network, remote parallel file system and the like in the cluster is high, and particularly under the environment of concurrent execution of multiple tasks, the unreasonable allocation of the resources can cause high variability of computing power performance, such as conflict between real-time resources and pre-allocation resource requirements due to internal or cross-task interference and the like.

The current computing power resource allocation based on the HPC cluster system mainly performs pre-allocation on resources required to be occupied by tasks or queues the tasks according to priorities according to the resource requirements of the tasks, but the computing power resource utilization rate is lower than expected because the real-time utilization requirements on the computing power resources in the task execution process are not considered.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure aims to provide an optimization method for computing power resource allocation, an optimization device for computing power resource allocation, an electronic device and a computer readable storage medium, which overcome at least to some extent the problem that the utilization rate of computing power resources is lower than expected in the related art.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to one aspect of the present disclosure, there is provided a method of optimizing computing power resource allocation, comprising: acquiring description information based on monitoring of a task to be processed by a cluster monitoring program, wherein the description information is used for describing subtasks of a plurality of batches of the task to be processed; inputting the description information into a prediction model to predict the predicted completion time and computational power resource requirements of each of the subtasks based on the prediction model; constructing a target optimization model of the predicted completion time of the subtasks of the lots based on the computing power resource demand and the occupied computing power resource occupation of the computer cluster; optimizing the computing power resource allocation of each subtask based on the target optimization model.

In one embodiment, the constructing a target optimization model of the predicted completion time of the subtasks of the plurality of batches based on the computing power resource demand and the computing power resource occupancy of the occupied computer cluster comprises: taking the computing power resource demand as an optimization parameter, determining the predicted completion time of the subtasks of the batches as an optimization target, and constructing a multi-target computing power resource optimization model; and respectively configuring adjustable weights for each optimization target based on the corresponding computing power resource demand and the computing power resource occupation so as to convert the multi-target computing power resource optimization model into a single-target computing power resource optimization model, and taking the single-target computing power resource optimization model as the target optimization model.

In one embodiment, the configuring the adjustable weight for each of the optimization objectives based on the corresponding computing power resource demand and computing power resource occupancy includes: calculating the demand ratio of each subtask, wherein the demand ratio is the ratio between the demand of the computing power resource and the occupation of the computing power resource; calculating the sum of the demand ratios of all the subtasks to obtain a total demand ratio; and (3) adjusting the weight of each optimization target according to the ratio between the required ratio and the required total ratio.

In one embodiment, the optimizing the computing power resource allocation of each of the subtasks based on the target optimization model includes: the sum of the predicted completion times of the subtasks of the batches output by the target optimization model is minimized as a target, and the adjustable weight is adjusted under the constraint condition based on a genetic algorithm or a particle swarm algorithm, so that the computing power resource demand is globally combined and optimized in the process of adjusting the adjustable weight; determining the adjusted adjustable weight based on the global combination optimizing result as a target weight; and optimizing the computing power resource allocation of each subtask based on the target weight.

In one embodiment, the constraint includes: the computing power resource demand of each task to be processed is less than or equal to the computing power resource total of the computer cluster.

In one embodiment, the obtaining the description information, where the description information is used to describe subtasks of multiple batches of the task to be processed includes: acquiring a pre-processing model of the prediction model; splitting the task to be processed into sub-tasks of the plurality of batches based on time sequence based on the pre-processing model; and acquiring the task quantity to be processed of each subtask and the executed duration of the subtask to serve as the description information.

In one embodiment, said inputting said descriptive information into a predictive model to predict an estimated completion time and computational power resource requirements for each of said subtasks based on said predictive model comprises: monitoring the execution process of the task to be processed based on the cluster monitoring program to obtain the occupation amount of computing power resources, wherein the occupation amount of computing power resources comprises CPU instruction accumulated execution amount, memory accumulated read-write amount and IO accumulated read-write amount; constructing an input sequence based on the task quantity to be processed, the executed time length and the computing power resource occupation amount; and inputting the input sequence into the prediction model to obtain an output sequence, wherein the output sequence comprises the predicted completion time and the computing power resource demand, and the computing power resource demand comprises a CPU instruction execution prediction amount, a memory read-write prediction amount and an IO read-write prediction amount for executing the subtasks.

In one embodiment, before predicting the predicted completion time and computational power resource requirements for each of the subtasks based on the predictive model, further comprising: constructing a sample set based on the task to be processed and computing power resource monitoring data of the computer cluster; setting up a cyclic neural network by taking the amount of tasks to be processed of each batch in the tasks to be processed, the occupation amount of computing power resources on the computer cluster and the executed time length of the subtasks as input sample sequences and taking the predicted completion time and the computing power resource demand amount as output sample sequences; and training the model of the cyclic neural network based on the sample set until the model converges to obtain the computational power resource prediction model.

In one embodiment, the recurrent neural network includes a gated recurrent neural network, and the model training the recurrent neural network based on the sample set until the model converges to obtain the computational resource prediction model includes: acquiring a reset gating state and an update gating state of the gating circulating neural network based on the input sequence and a last state sequence before the current task state; updating the candidate hidden layer state of the gating circulating neural network based on the reset gating state; generating the output sequence based on the updated candidate hidden layer state and the updated gating state; detecting a loss value between the output sequence and an actual sequence based on a model loss function; and carrying out loop iteration on model training based on the loss value until the model converges.

In one embodiment, further comprising: a root mean square error loss function is used as the model loss function.

According to another aspect of the present disclosure, there is provided an optimizing apparatus for computing power resource allocation, including: the acquisition module is used for monitoring the task to be processed based on the cluster monitoring program and acquiring description information, wherein the description information is used for describing subtasks of a plurality of batches of the task to be processed; a prediction module for inputting the description information into a prediction model to predict the predicted completion time and computational power resource requirements of each of the subtasks based on the prediction model; the construction module is used for constructing a target optimization model of the predicted completion time of the subtasks of the lots based on the calculation power resource demand and the calculation power resource occupation of the occupied computer cluster; and the optimization module is used for optimizing the computing power resource allocation of each subtask based on the target optimization model.

According to still another aspect of the present disclosure, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; the processor is configured to perform the training method of the computational power resource prediction model described above via execution of the executable instructions.

According to yet another aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described method of optimizing computing power resource allocation.

According to the optimization scheme for computing power resource allocation provided by the embodiment of the disclosure, aiming at large-scale and long-time-consuming tasks to be processed by the HPC cluster, the tasks to be processed are processed into sub-tasks of a plurality of batches, and the predicted completion time and the computing power resource demand of the sub-tasks are predicted by adopting a pre-stored model, so that the process of dynamically optimizing the required computing power resources combines the predicted completion time multi-objective optimization of the sub-tasks of the plurality of batches and the real-time working condition of the current HPC cluster by combining the prediction result and the monitored computing power resource occupation amount of the currently occupied computer cluster, the rationality of computing power resource allocation of the HPC cluster task which is complex and takes longer time is improved, the probability of computing power resource allocation conflict is reduced, and the higher utilization rate of computing power resources is ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 illustrates a flowchart of a method for optimizing computing power resource allocation in an embodiment of the present disclosure;

FIG. 2 illustrates another method flow diagram for optimizing computing power resource allocation in an embodiment of the present disclosure;

FIG. 3 illustrates another method flow diagram for optimizing computing power resource allocation in an embodiment of the present disclosure;

FIG. 4 illustrates a flowchart of a method for optimizing allocation of computing resources in accordance with an embodiment of the present disclosure;

FIG. 5 illustrates a flowchart of yet another method for optimizing computing power resource allocation in an embodiment of the present disclosure;

FIG. 6 illustrates a schematic diagram of an optimization apparatus for computing power resource allocation in an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of a computer device in an embodiment of the present disclosure; and

fig. 8 shows a block diagram of a program product in an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

At present, the existing computing power resource allocation method mainly pre-allocates resources required by a task according to the resource requirements of the task, but the resources required by the task are difficult to accurately predict, for example, a traditional scheme of directly applying resources by a user tends to overestimate the required resources, a scheme of performing resource allocation when the task is re-executed according to the historical performance of the task is not applicable to an HPC cluster, particularly, the historical performance acquisition cost of some high-cost long-time computing tasks is high, and the acquired monitoring data is not accurate enough due to the running environment of mutual interference of multiple tasks.

In addition, the existing computing power resource allocation method generally pre-allocates resources or queues tasks according to priorities, and does not consider the real-time utilization requirement of computing power resources when the tasks are executed, for example, most HPC tasks have the characteristics of periodic operation of computation and IO. So that the utilization rate of the computing power resources is lower.

The respective steps of the optimization method and the model training method for computing power resource allocation in the present exemplary embodiment will be described in more detail with reference to the accompanying drawings and examples.

FIG. 1 illustrates a flowchart of a method for optimizing computing power resource allocation in an embodiment of the present disclosure.

As shown in fig. 1, a method of optimizing computing power resource allocation according to one embodiment of the present disclosure includes the following steps.

Step S102, based on the monitoring of the task to be processed by the cluster monitoring program, descriptive information is obtained, and the descriptive information is used for describing sub-tasks of a plurality of batches of the task to be processed.

The timing task or the hook task may be registered in advance with the cluster monitor, and when the task is detected to be submitted or triggered, the task to be processed is confirmed to be acquired.

In one embodiment, description information is obtained, the description information being used to describe sub-tasks of a plurality of batches of tasks to be processed, including: dividing a task to be processed into sub-tasks of a plurality of batches based on time sequence based on a pre-processing model of the prediction model; and acquiring the amount of the tasks to be processed of each subtask and the executed time length of the subtask as description information.

Specifically, the prediction model comprises a pre-processing model, the description information can be generated by the pre-processing model of the prediction model, in the training process of the prediction model, the HPC cluster tasks with high complexity and long time consumption are processed into a plurality of batches of subtasks based on time sequence through the pre-processing model, and the description information of each subtask is output.

In addition, by dividing the task to be processed into a plurality of sub-tasks of batches, the cluster task with high complexity and high calculation amount is divided into a plurality of sub-tasks to be processed based on time sequence, so that the sub-tasks have smaller orders of magnitude, and the real-time adjustment of the computing power resources of the task to be processed based on the batches is realized by respectively calculating the computing power resources of each sub-task.

Step S104, inputting the description information into a prediction model to predict the predicted completion time and the computational power resource demand of each subtask based on the prediction model.

The computing power resources include, but are not limited to, CPU resources, memory resources, IO resources and the like.

Through the batch processing operation, the task to be processed is described as a multi-batch and smaller-order data construction resource occupation sample set, and the predicted completion time and the calculation resource demand of the subtasks of each batch are predicted by combining a prediction model, so that the prediction of the resource occupation characteristics of the task when the data volume is ultra-large is realized.

Step S106, constructing a target optimization model of the predicted completion time of the subtasks of the plurality of batches based on the computing power resource demand amount.

The collection of the occupied computing power resource occupation amount of the computer cluster can be performed by using an independent environment, so that the accuracy of the collected resource occupation amount data is improved, and cross-application interference is eliminated.

And step S108, optimizing the computing power resource allocation of each subtask based on the target optimization model.

The computing power resource allocation of each subtask is optimized based on the target optimization model, namely, the minimization of the predicted completion time is realized by improving the utilization rate of the computing power resources.

In the embodiment, for a large-scale and long-time-consuming task to be processed by the HPC cluster, the task to be processed is processed into sub-tasks of a plurality of batches, and the predicted completion time and the calculation resource demand of the sub-tasks are predicted by adopting a pre-stored model, so that the process of dynamically optimizing the required calculation resource combines the predicted result and the monitored calculation resource occupation of the currently occupied computer cluster, and the multi-objective optimization of the predicted completion time of the sub-tasks of the plurality of batches and the real-time working condition of the current HPC cluster are considered, thereby improving the rationality of calculation resource allocation of the complex and long-time-consuming HPC cluster task, further being beneficial to reducing the probability of calculation resource allocation conflict and ensuring that the calculation resource keeps higher utilization rate.

As shown in fig. 2, a method of optimizing computing power resource allocation according to one embodiment of the present disclosure includes the following steps.

In step S202, the description information of the task to be processed is input into the prediction model to predict the predicted completion time and the computational power resource demand of each sub-task based on the prediction model.

Step S204, the computing power resource demand is used as an optimization parameter, the predicted completion time of the subtasks of a plurality of batches is determined as an optimization target, and a multi-target computing power resource optimization model is constructed.

The multi-objective computational power resource optimization model is used for weighting a plurality of objectives by different weights under given limits so as to reach the best possible at the same time.

Taking the predicted completion time of the subtask i of the ith batch in the tasks calculated by the prediction network as an optimized multi-objective, as shown in a formula (1):

where n is the number of batches dividing the task to be processed.

Further, based on the corresponding computing power resource demand and computing power resource occupation, an adjustable weight is respectively configured for each optimization target so as to convert the multi-target computing power resource optimization model into a single-target computing power resource optimization model, and the single-target computing power resource optimization model is used as a target optimization model, as shown in a formula (2).

By formulating corresponding weights for different objective functions, all the objective functions are weighted linearly, and a comprehensive utility function is used for representing the overall optimization target, and the solution corresponding to the optimal utility function is regarded as the optimal calculation force resource demand.

In addition, based on the prediction model, when the computational power resources allocated to each task meet the computational power resources required by the prediction model as much as possible, the task is most suitable for the running progress under the independent running environment, that is, the predicted completion time is shortest, so that the problem is converted into a single-objective optimization problem such as a particle swarm, a genetic algorithm and the like to carry out optimization solution by taking a linear weighting method as a basis.

In this embodiment, the multi-objective optimization mode is converted into the single-objective optimization mode by configuring the corresponding weight for the predicted completion time of each subtask and adjusting the weight based on the computing power resource demand and the computing power resource occupation, so as to facilitate the optimization of computing power resource allocation.

In one embodiment, configuring the adjustable weights for each optimization objective based on the corresponding computing power resource demand and computing power resource occupancy, respectively, includes:

in step S206, the demand ratio of each subtask is calculated, where the demand ratio is the ratio between the demand of computing power resources and the occupation of computing power resources.

Specifically, taking the CPU execution amount as an example, the demand ratio between the demand of the computing power resources and the occupation of the computing power resources of each subtask is specifically

Wherein, the liquid crystal display device comprises a liquid crystal display device,for calculating the force resource demand, < >>And i is the number of subtasks and is the current occupied amount of computing power resources.

Step S208, the sum of the demand ratios of all subtasks is calculated to obtain the total demand ratio.

Step S210, the adjustable weight of each optimization target of the ratio between the demand ratio and the demand total ratio is used for constructing a target optimization model based on the adjustable weights, as shown in the formula (3).

In addition, for the task which is performed in the period of alternating the calculation and IO period or takes up less resources for a long time, higher weight can be given to further adjust the calculation force resource quantity in time.

In one embodiment, optimizing the allocation of computing power resources for each subtask based on a target optimization model includes:

step S212, adjusting the adjustable weight based on a genetic algorithm or a particle swarm algorithm under a constraint condition with the aim of minimizing the sum of the predicted completion times of the subtasks of the plurality of batches output by the target optimization model, so as to perform global combination optimization on the computing power resource demand in the process of adjusting the adjustable weight.

Step S214, determining the adjusted adjustable weight as a target weight based on the optimizing result of the global combination.

Step S216, optimizing the computing power resource allocation of each subtask based on the target weight.

Specifically, taking a genetic algorithm as an example, adjusting the adjustable weight under a constraint condition, so as to perform global combination optimization on the demand of the computing power resources in the process of adjusting the adjustable weight, wherein the process comprises the following steps of: the method comprises the steps of performing coding operation on the demand quantity of the computational power resources in a floating point number coding mode, initializing population and weights, establishing a fitness function by enabling the length of a chromosome gene to be equal to the sum of the demand quantity of all the computational power resources during coding, converting a target function value into a relative fitness value, performing selection operation, performing genetic operation and performing mutation operation.

In one embodiment, the constraints include: the computing power resource demand of each task to be processed is less than or equal to the computing power resource total of the computer cluster.

Wherein, the total resources g needed to be occupied for each task are limited due to the limited computational power resources of the HPC cluster _cpu,t ,g _memo,t ,g _io,t The sum needs to be less than the total amount of computational resources cpu _total ,memo _total ,io _total Namely, the constraint condition is as shown in the formula (4).

In one embodiment, inputting descriptive information into a predictive model to predict an estimated completion time and computational power resource requirements for each subtask based on the predictive model includes: monitoring the execution process of a task to be processed based on a cluster monitoring program to obtain the occupation amount of computing power resources, wherein the occupation amount of computing power resources comprises CPU instruction accumulated execution amount, memory accumulated read-write amount and IO accumulated read-write amount; constructing an input sequence based on description information and the occupation amount of computing power resources, wherein the description information comprises the amount of tasks to be processed of each subtask and the executed duration of the subtask; and inputting the input sequence into a prediction model to obtain an output sequence, wherein the output sequence comprises the predicted completion time and the computing power resource demand, and the computing power resource demand comprises the CPU instruction execution prediction quantity, the memory read-write prediction quantity and the IO read-write prediction quantity of the execution subtasks.

In the embodiment, the predicted completion time and the computing power resource demand are predicted by respectively acquiring the task quantity to be processed of the subtasks, the executed time length in the task to be processed and the current computing power resource occupation quantity as input sequences of a prediction model, so that the reliability of a prediction result is ensured.

As shown in fig. 3, a training method of a computational power resource prediction model according to one embodiment of the present disclosure includes:

step S302, a sample set is constructed based on the task to be processed and the computing power resource monitoring data of the computer cluster.

For high-complexity and calculated HPC tasks, a small batch of data test is usually required to be performed to preliminarily verify that the functions of the tasks meet expectations, so as to prevent the computational effort and resource waste caused by the problems of the tasks, the monitoring data is collected during the program running of the process, a sample set is constructed, and the training of a GUR model is performed, so that the data quantity input_size and the running time of the tasks can be input according to the input data quantity to be processed _t The HPC task is pre-modeled to predict the computational resource usage of the next stage HPC task.

Step S304, taking the amount of the task to be processed, the computing power resource occupation amount of the computer cluster and the executed time length of the subtask of each batch in the task to be processed as an input sample sequence, taking the predicted completion time of the amount of the task to be processed and the computing power resource demand amount as an output sample sequence, and building a circulating neural network.

The RNN (Recurrent Neural Networks) is a recurrent neural network, which takes sequence data as input, performs recursion (recovery) in the evolution direction of the sequence, and connects all nodes (circulation units) in a chained manner, and the network model can effectively capture the relationship characteristics between the sequences.

Specifically, by converting HPC tasks to be committed into smaller batches of data of different orders of magnitude to generate a sample set, the HPC tasks to be committed are represented as X= { X ₀ ,x ₁ ,x ₂ ,...,x _t I.e., monitoring the data sequence as sequence input data, wherein,

the input sequence is specifically as follows:input_size represents the amount of data to be processed input, time _t Indicating execution time, cpu _t 、memo _t 、io _t The CPU instruction accumulated execution amount, the memory accumulated read-write amount and the IO accumulated read-write amount obtained according to monitoring are respectively used for predicting the resource demand by training the neural network.

The output sequence is specifically: y is _t ＝{finish _t ,cpu _t+1 ,memo _t+1 ,io _t+1 And the prediction result of the computational power resource prediction model is represented, and the prediction capability is reflected. Similar to input, y _t Finish in (3) _t The method is characterized in that the method comprises the steps of representing the expected completion time based on the existing resource accumulation amount, and other data respectively represent CPU instruction accumulation execution amount, memory accumulation read-write amount and IO accumulation read-write amount in the next time period, wherein the accumulation amount in the next time period is reduced by the current occupation amount, namely the corresponding calculation force resource demand amount.

And step S306, performing model training on the cyclic neural network based on the sample set until the model converges to obtain a computational power resource prediction model.

In the embodiment, the task quantity to be processed and the computing power resource occupation amount of the computer cluster are taken as input sample sequences, the predicted completion time and the computing power resource demand amount are taken as computing power resource prediction models of output sample sequences, the internal rules among the output sequences of the input sequences are trained and learned through the models, the resource occupation condition during autonomous learning task operation is further realized, and the computing power resource utilization rate is improved by combining the prediction results on the basis of full data utilization.

As shown in fig. 4, in one embodiment, the recurrent neural network includes a gated recurrent neural network, step S306, performing model training on the recurrent neural network based on the sample set until the model converges, so as to obtain a specific implementation of the computational resource prediction model, which includes:

step S402, obtaining the reset gating state and updating the gating state of the gating circulating neural network based on the input sequence and the last state sequence before the current task state.

The GRU (Gated Recurrent Unit, gated cyclic neural network) is one of RNNs, can better capture dependency relationships with larger intervals in time sequence data, and has less parameter quantity and high efficiency.

Specifically, the input is the current monitored state:

x _t ＝{input_size,time _t ,cpu _t ,memo _t ,io _t }

the GRU internally passes through the last state h _t-1 And the current input x _t To obtain two gating states, namely reset gate state r _t And updating the door state z _t As shown in the formulas (5) and (6), respectively.

r _t ＝σ(x _t W _xr +h _t-1 W _hr +b _r ) (5)

z _t ＝σ(x _t W _xz +h _t-1 W _hz +b _z ) (6)

Wherein x is _t ∈R _n×d (the number of samples is n, the number of inputs is d), h _t-1 ∈R _n×h (h is the number of hidden units) W _xr ，W _xz ∈R _d×h And W is _hr ，W _hz ∈R _h×h Is a weight parameter, b _r ，b _z ∈R _l×h Is a deviation parameter.

Step S404, updating the candidate hidden layer state of the gated loop neural network based on the reset gating state.

Specifically, the candidate hidden layer state h 'is based on the reset gating state' _t Updating is performed to determine the forgotten information as shown in equation (7).

h’ _t ＝tanh(W _h’ ·[r _t *h _t-1 ,x _t ]) (7)

Step S406, generating an output sequence based on the updated candidate hidden layer state and the updated gating state.

In particular, based on the updated door status, the downward transfer of information is determined,h _t ＝(1-z _t )*h _t-1 +z _t *h’ _t to obtain an output sequence based on the information transferred downwards, the output sequence y _t From h _t The result is shown in the formula (8).

y _t ＝σ(W _O ·h _t ) (8)

Step S408, detecting a loss value between the output sequence and the actual sequence based on the model loss function.

Step S410, performing loop iteration on model training based on the loss value until the model converges.

In the embodiment, the prediction model is trained based on the gated cyclic neural network, and the computing power resource prediction model obtained through training can learn the real-time utilization requirement of the computing power resource, so that the computing power resource required by the task with the ultra-large data volume in real time can be reliably predicted.

In one embodiment, further comprising: and adopting a root mean square error loss function as a model loss function, wherein the root mean square error loss function is shown as a formula (9).

Wherein, the liquid crystal display device comprises a liquid crystal display device,and y is a predicted value.

As shown in fig. 5, another optimization scheme for computing power resource allocation according to an embodiment of the disclosure specifically includes a preprocessing module and an optimization module.

The preprocessing module specifically executes the following processes, and specifically comprises:

step S502, initializing the LSTM network randomly.

Wherein LSTM is the optimized structure of RNN.

Step S504, modeling a prediction model of the computational power resources based on small batches of sample data of different orders of magnitude and the LSTM network.

And step S506, performing model training until the model converges to obtain a prediction model.

Step S508, based on a pre-processing model of the prediction model, splitting the task to be processed into sub-tasks of a plurality of batches based on time sequence.

Step S510, obtaining the amount of tasks to be processed and the executed time length of each subtask as description information.

Step S512, submitting the prediction model and the description information as metadata to the HPC cluster.

Specifically, in the RNN pre-modeling stage, for a task to be submitted, firstly, the task needs to be pre-modeled through a time sequence network GRU to predict the calculation or IO period where the task is likely to be located and the amount of resources required to be occupied under the conditions of input data, running time conditions, calculation amount to be performed and the like of a certain order of magnitude, and the trained GRU model in the stage is exported as metadata to be submitted together with the task. This stage may also be generated by user training when the user tests the task.

Step S514, a timed task or a hooked task is registered based on the cluster monitor.

Step S516, when the submitted task or the timing task trigger is detected, a task to be processed composed of a plurality of batches of subtasks is acquired.

And step S518, calculating the required resource duty ratio of each subtask based on the task monitoring data and the prediction model, and carrying out resource optimization allocation on the resources of the HPC cluster through a multi-objective optimization algorithm.

It is noted that the above-described figures are only schematic illustrations of processes involved in a method according to an exemplary embodiment of the invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

An optimization apparatus 600 for computing power resource allocation according to an embodiment of the present invention is described below with reference to fig. 6. The computing power resource allocation optimizing apparatus 600 shown in fig. 6 is only an example, and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.

The optimizing means 600 of the computing power resource allocation is presented in the form of hardware modules. The components of the computing resource allocation optimization apparatus 600 may include, but are not limited to: the acquisition module 602 is configured to monitor a task to be processed based on the cluster monitoring program, and acquire description information, where the description information is used to describe subtasks of multiple batches of the task to be processed; a prediction module 604 for inputting the descriptive information into a prediction model to predict an estimated completion time and a computational power resource demand for each subtask based on the prediction model; a building module 606 for building a target optimization model of predicted completion times for sub-tasks of multiple batches based on the computing power resource demand and the computing power resource occupancy of the occupied computer cluster; an optimization module 608 is configured to optimize the computing power resource allocation of each subtask based on the target optimization model.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 700 according to this embodiment of the invention is described below with reference to fig. 7. The electronic device 700 shown in fig. 7 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 7, the electronic device 700 is embodied in the form of a general purpose computing device. Components of electronic device 700 may include, but are not limited to: the at least one processing unit 710, the at least one memory unit 720, and a bus 730 connecting the different system components, including the memory unit 720 and the processing unit 710.

Wherein the storage unit stores program code that is executable by the processing unit 710 such that the processing unit 710 performs steps according to various exemplary embodiments of the present invention described in the above-mentioned "exemplary methods" section of the present specification. For example, the processing unit 710 may perform the schemes described in steps S102 to S110 shown in fig. 1.

The memory unit 720 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 7201 and/or cache memory 7202, and may further include Read Only Memory (ROM) 7203.

The storage unit 720 may also include a program/utility 7204 having a set (at least one) of program modules 7205, such program modules 7205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 730 may be a bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 700 may also communicate with one or more external devices 770 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 700, and/or any device (e.g., router, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 750. Also, electronic device 700 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 760. As shown, network adapter 760 communicates with other modules of electronic device 700 over bus 730. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 700, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible implementations, aspects of the invention may also be implemented in the form of a program product comprising program code for causing an electronic device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when the program product is run on the electronic device.

Referring to fig. 8, a program product 800 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on an electronic device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method for optimizing computing power resource allocation, comprising:

acquiring description information based on monitoring of a task to be processed by a cluster monitoring program, wherein the description information is used for describing subtasks of a plurality of batches of the task to be processed;

inputting the description information into a prediction model to predict the predicted completion time and computational power resource requirements of each of the subtasks based on the prediction model;

constructing a target optimization model of the predicted completion time of the subtasks of the lots based on the computing power resource demand and the occupied computing power resource occupation of the computer cluster;

optimizing the computing power resource allocation of each subtask based on the target optimization model.

2. The method of optimizing computing power resource allocation according to claim 1, wherein the constructing a target optimization model of predicted completion times of sub-tasks of the plurality of batches based on the computing power resource demand and the computing power resource occupancy of the occupied computer cluster comprises:

taking the computing power resource demand as an optimization parameter, determining the predicted completion time of the subtasks of the batches as an optimization target, and constructing a multi-target computing power resource optimization model;

and respectively configuring adjustable weights for each optimization target based on the corresponding computing power resource demand and the computing power resource occupation so as to convert the multi-target computing power resource optimization model into a single-target computing power resource optimization model, and taking the single-target computing power resource optimization model as the target optimization model.

3. The method of optimizing the allocation of computing power resources according to claim 2, wherein the configuring of the adjustable weights for each of the optimization objectives based on the corresponding computing power resource demand amounts and computing power resource occupation amounts, respectively, includes:

calculating the demand ratio of each subtask, wherein the demand ratio is the ratio between the demand of the computing power resource and the occupation of the computing power resource;

Calculating the sum of the demand ratios of all the subtasks to obtain a total demand ratio;

and (3) adjusting the weight of each optimization target according to the ratio between the required ratio and the required total ratio.

4. The method of optimizing computing power resource allocation according to claim 2, wherein optimizing computing power resource allocation for each of the subtasks based on the target optimization model comprises:

the sum of the predicted completion times of the subtasks of the batches output by the target optimization model is minimized as a target, and the adjustable weight is adjusted under the constraint condition based on a genetic algorithm or a particle swarm algorithm, so that the computing power resource demand is globally combined and optimized in the process of adjusting the adjustable weight;

determining the adjusted adjustable weight based on the global combination optimizing result as a target weight;

and optimizing the computing power resource allocation of each subtask based on the target weight.

5. The method of optimizing computing power resource allocation according to claim 4, wherein the constraint comprises:

the computing power resource demand of each task to be processed is less than or equal to the computing power resource total of the computer cluster.

6. The method of optimizing computing power resource allocation according to any one of claims 1 to 5, wherein the acquiring the descriptive information includes:

acquiring a pre-processing model of the prediction model;

splitting the task to be processed into sub-tasks of the plurality of batches based on time sequence based on the pre-processing model;

and acquiring the task quantity to be processed of each subtask and the executed duration of the subtask to serve as the description information.

7. The method of optimizing computing power resource allocation according to claim 6, wherein said inputting the description information into a predictive model to predict an estimated completion time and a computing power resource demand for each of the subtasks based on the predictive model comprises:

monitoring the execution process of the task to be processed based on the cluster monitoring program to obtain the occupation amount of computing power resources, wherein the occupation amount of computing power resources comprises CPU instruction accumulated execution amount, memory accumulated read-write amount and IO accumulated read-write amount;

constructing an input sequence based on the task quantity to be processed, the executed time length and the computing power resource occupation amount;

and inputting the input sequence into the prediction model to obtain an output sequence, wherein the output sequence comprises the predicted completion time and the computing power resource demand, and the computing power resource demand comprises a CPU instruction execution prediction amount, a memory read-write prediction amount and an IO read-write prediction amount for executing the subtasks.

8. The method of optimizing computing power resource allocation according to claim 7, further comprising, prior to predicting an estimated completion time and computing power resource demand for each of the subtasks based on the predictive model:

constructing a sample set based on the task to be processed and computing power resource monitoring data of the computer cluster;

setting up a cyclic neural network by taking the amount of tasks to be processed of each batch in the tasks to be processed, the occupation amount of computing power resources on the computer cluster and the executed time length of the subtasks as input sample sequences and taking the predicted completion time and the computing power resource demand amount as output sample sequences;

and training the model of the cyclic neural network based on the sample set until the model converges to obtain the prediction model of the computational power resource.

9. The method of optimizing computing power resource allocation according to claim 8, wherein the recurrent neural network comprises a gated recurrent neural network, the model training the recurrent neural network based on the sample set until a model converges to obtain the predictive model of computing power resources, comprising:

acquiring a reset gating state and an update gating state of the gating circulating neural network based on the input sequence and a last state sequence before the current task state;

Updating the candidate hidden layer state of the gating circulating neural network based on the reset gating state;

generating the output sequence based on the updated candidate hidden layer state and the updated gating state;

detecting a loss value between the output sequence and an actual sequence based on a model loss function;

and carrying out loop iteration on model training based on the loss value until the model converges.

10. The method of optimizing computing power resource allocation according to claim 8, further comprising:

a root mean square error loss function is used as the model loss function.

11. An apparatus for optimizing the allocation of computing power resources, comprising:

the acquisition module is used for monitoring the task to be processed based on the cluster monitoring program and acquiring description information, wherein the description information is used for describing subtasks of a plurality of batches of the task to be processed;

a prediction module for inputting the description information into a prediction model to predict the predicted completion time and computational power resource requirements of each of the subtasks based on the prediction model;

the construction module is used for constructing a target optimization model of the predicted completion time of the subtasks of the lots based on the calculation power resource demand and the calculation power resource occupation of the occupied computer cluster;

And the optimization module is used for optimizing the computing power resource allocation of each subtask based on the target optimization model.

12. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the training method of the computational power resource prediction model of any one of claims 1-10 via execution of the executable instructions.

13. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the method for optimizing the allocation of computational resources according to any one of claims 1 to 10.