CN113254192A

CN113254192A - Resource allocation method, resource allocation device, electronic device, and storage medium

Info

Publication number: CN113254192A
Application number: CN202010088853.0A
Authority: CN
Inventors: 毕钰; 包勇军; 崔永雄; 张泽华; 熊浪涛
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-02-12
Filing date: 2020-02-12
Publication date: 2021-08-13
Anticipated expiration: 2040-02-12
Also published as: CN113254192B

Abstract

The disclosure provides a resource allocation method, a resource allocation device, electronic equipment and a computer readable storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: determining initial resource quotas of a plurality of task queues, and enabling each task queue to adopt a corresponding initial resource quota to run a task; when a first preset condition is met, acquiring resource use state data of at least one task queue; processing the resource use state data by using the latest reinforcement learning model to obtain resource allocation action data, and adjusting the resource quotas of the plurality of task queues by using the resource allocation action data; when a second preset condition is met, acquiring task running state data of at least one task queue; and determining an incentive value according to the task running state data, and updating the reinforcement learning model according to the incentive value. The method and the device can be used for performing efficient and reasonable resource allocation on the task queue.

Description

Resource allocation method, resource allocation device, electronic device, and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a resource allocation method, a resource allocation apparatus, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of the internet, the data size created by the internet is also growing explosively, and a cluster and distributed large data platform is produced. When the platforms run a plurality of tasks, the queues running the tasks often compete for the limited resources in the system. Therefore, it is necessary to allocate system resources reasonably.

In the existing resource allocation method, a specific scheduler is usually adopted to implement allocation management of cluster resources. However, in the existing scheduler, a certain rule needs to be set by a system or a person, so that the resource allocation situation is rigid and cannot be flexibly updated in the process of running the task by the task queue, and an important task may take a long time, thereby affecting the efficiency of running the whole task.

Therefore, how to adopt an effective and reasonable resource allocation method to enable the task queue to run the task efficiently is a technical problem to be solved urgently in the prior art.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure provides a resource allocation method, a resource allocation apparatus, an electronic device, and a computer-readable storage medium, so as to overcome, at least to a certain extent, the problems of low allocation efficiency and difficulty in ensuring the reasonability of allocation in the existing resource allocation method.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the present disclosure, there is provided a resource allocation method, including: determining initial resource quotas of a plurality of task queues, and enabling each task queue to adopt a corresponding initial resource quota to run a task; when a first preset condition is met, acquiring resource use state data of at least one task queue; processing the resource use state data by using the latest reinforcement learning model to obtain resource allocation action data, and adjusting the resource quotas of the plurality of task queues by using the resource allocation action data; when a second preset condition is met, acquiring task running state data of at least one task queue; and determining an incentive value according to the task running state data, and updating the reinforcement learning model according to the incentive value.

In an exemplary embodiment of the present disclosure, the determining an initial resource quota for a plurality of task queues includes: and determining the initial resource quota of each task queue according to the importance level of each task queue.

In an exemplary embodiment of the present disclosure, the determining a reward value according to the task running state data includes: and when determining that a task completed in advance or a task failed to run exists according to the task running state data, calculating the reward value based on the importance level of the task queue to which the task belongs.

In an exemplary embodiment of the present disclosure, the first preset condition includes any one or more of: reaching a first predetermined cycle time; the resource utilization rate of any task queue exceeds a first preset threshold value; and adding a new task in any task queue.

In an exemplary embodiment of the disclosure, the obtaining resource usage status data of at least one of the task queues includes: acquiring resource use state data of each task queue; the processing the resource usage state data by using the latest reinforcement learning model to obtain resource allocation action data comprises the following steps: taking the resource use state data of each task queue as a row, and converting the resource use state data of the plurality of task queues into a resource use state matrix; inputting the resource usage state matrix into the latest reinforcement learning model, and outputting corresponding resource allocation action data.

In an exemplary embodiment of the present disclosure, the second preset condition includes any one or more of: reaching a second predetermined cycle time; adjusting the resource quota m times, wherein m is a first preset time; and the resource utilization rate of any task queue exceeds a second preset threshold in the continuous n times of resource quota adjustment, wherein n is a second preset number.

In an exemplary embodiment of the disclosure, the updating the reinforcement learning model by the reward value includes: and updating the value function of the reinforcement learning model by adopting a Bellman equation based on the reward value, the current resource use state data, the resource use state data when the resource quota is adjusted last time and the resource allocation action data when the resource quota is adjusted last time.

In an exemplary embodiment of the present disclosure, the cost function includes a neural network, and a parameter of the neural network is updated when the cost function of the reinforcement learning model is updated.

In an exemplary embodiment of the present disclosure, the resource allocation action data includes: increasing a preset resource quota for one task queue and reducing a preset resource quota for the other task queue; or keeping the current resource quota of each task queue.

According to an aspect of the present disclosure, there is provided a resource allocation apparatus, including: the resource determining module is used for determining initial resource quotas of a plurality of task queues and enabling each task queue to adopt the corresponding initial resource quotas to run tasks; the first data acquisition module is used for acquiring resource use state data of at least one task queue when a first preset condition is met; the resource adjusting module is used for processing the resource using state data by utilizing the latest reinforcement learning model to obtain resource allocation action data and adjusting the resource quotas of the plurality of task queues by adopting the resource allocation action data; the second data acquisition module is used for acquiring task running state data of at least one task queue when a second preset condition is met; and the model updating module is used for determining an incentive value according to the task running state data and updating the reinforcement learning model according to the incentive value.

In an exemplary embodiment of the disclosure, the resource determining module includes a resource quota determining unit, configured to determine an initial resource quota for each of the task queues according to the importance level of each of the task queues.

In an exemplary embodiment of the present disclosure, the model update module includes: and the reward value calculation unit is used for calculating the reward value based on the importance level of the task queue to which the task belongs when determining that a task completed in advance or a task failed to run exists according to the task running state data.

In an exemplary embodiment of the present disclosure, the first data acquisition module includes: the state data acquisition unit is used for acquiring the resource use state data of each task queue; the resource adjusting module comprises: the matrix conversion unit is used for converting the resource use state data of the plurality of task queues into a resource use state matrix by taking the resource use state data of each task queue as a row; and the action data output unit is used for inputting the resource use state matrix into the latest reinforcement learning model, outputting corresponding resource allocation action data, and adjusting the resource quotas of the plurality of task queues by adopting the resource allocation action data.

In an exemplary embodiment of the present disclosure, the model update module includes: and the cost function updating unit is used for determining a reward value according to the task running state data, and updating the cost function of the reinforcement learning model by adopting a Bellman equation based on the reward value, the current resource use state data, the resource use state data when the resource quota is adjusted last time and the resource allocation action data when the resource quota is adjusted last time.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

Exemplary embodiments of the present disclosure have the following advantageous effects:

determining initial resource quotas of a plurality of task queues, enabling each task queue to adopt a corresponding initial resource quota to run a task, when a first preset condition is met, obtaining resource usage state data of at least one task queue, processing the resource usage state data by using a latest reinforcement learning model to obtain resource allocation action data, adjusting the resource quotas of the plurality of task queues by adopting the resource allocation action data, when a second preset condition is met, obtaining the task running state data of at least one task queue, determining a reward value according to the task running state data, and updating the reinforcement learning model through the reward value. On one hand, the resource allocation is carried out on the task queue through the reinforcement learning model, compared with the mode of manually allocating queue resources in the prior art or the related technology, the allocation process is simple and efficient, and the allocation result is more accurate; on the other hand, in the exemplary embodiment, by setting the first preset condition, the resource usage state data of the task queue is obtained, the resource allocation action data is determined, the resource allocation is performed, and the second preset condition is set to update the model in time, so that the application of the model in the queue resource allocation is optimized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 schematically shows a flowchart of a resource allocation method in the present exemplary embodiment;

FIG. 2 schematically illustrates an interaction diagram of an agent with an environment in reinforcement learning;

FIG. 3 schematically illustrates a sub-flow diagram of a method of resource allocation in the present exemplary embodiment;

FIG. 4 is a diagram schematically illustrating a resource allocation process of a task queue in the exemplary embodiment;

FIG. 5 is a flow chart schematically illustrating another resource allocation method in the present exemplary embodiment;

FIG. 6 is a flow chart that schematically illustrates a reinforcement learning model update process in the present exemplary embodiment;

fig. 7 is a block diagram schematically showing the structure of a resource allocation apparatus in the present exemplary embodiment;

fig. 8 schematically illustrates an electronic device for implementing the above method in the present exemplary embodiment;

fig. 9 schematically illustrates a computer-readable storage medium for implementing the above-described method in the present exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In the related art, two existing cluster resource management techniques are generally adopted: a horn (Another Resource coordinator) scheduler and a messos scheduler.

The Yarn scheduler includes three types, which are a FIFO (First Input First Output, First in First out) scheduler, a Capacity scheduler and a Fair scheduler, and the resource allocation method of each scheduler is as follows:

the FIFO scheduler arranges the tasks into a queue according to the submitted sequence, the queue is a first-in first-out queue, when the resource allocation is carried out, the resource allocation is carried out on the task at the forefront in the queue, the resource is allocated to the next task after the resource meets the requirement, and the like.

The Capacity scheduler manually allocates resources to the queues before the task runs, for example, two queues, the a queue is set to use 60% of the resources, the B queue is set to use 40% of the resources, and after that, the resource quota is not changed. The Capacity scheduler also has a flexible allocation mechanism, and free resources can be allocated to any queue. When multiple queues are contended, the balancing is done proportionally.

The Fair scheduler is a dynamic resource scheduler. It may apply a fair allocation of resources to all queues. For example, when there is only one queue a, it may occupy all resources; when a queue B is newly added and two queues are changed, the Fair scheduler will slowly release the resources of queue A, and then A and B each occupy half of the resources.

However, when the FIFO scheduler is adopted, it may happen that cluster resources are occupied by large tasks for a long time, resulting in overlong waiting time of small tasks; or cluster resources are occupied by a plurality of small tasks, so that the large tasks cannot obtain sufficient resources, and starvation is caused; when a Capacity scheduler is adopted, a queue specially set for a small task can occupy certain cluster resources in advance, so that the execution time of a large task is behind the time when the FIFO scheduler is used; when the Fair scheduler is adopted, all tasks are viewed identically, so that the Fair scheduler cannot be applied to scenes with high and low task importance degrees, and the application range is limited.

The tasks scheduler uses a DRF (Dominant Resource Fairness) algorithm. The idea is that when there are many resources, the resource allocation should be determined by the dominant share of the user, which is the one occupying the largest share among all the various resources already allocated to the user. If single resource is considered, the algorithm is degraded into a max-min fair algorithm, namely, the resources are firstly distributed evenly or well according to weight, then the resources are distributed from small to large according to the user requirement, the user with small requirement obtains the resources higher than the self requirement, and the difference value is taken out to carry out the distribution in the same way on the rest users. However, the meso scheduler is statically configured and cannot dynamically modify resource allocation according to the actual task running condition, and therefore, the meso scheduler cannot flexibly perform resource allocation in an application scenario.

Based on this, the exemplary embodiments of the present disclosure first provide a resource allocation method, and resources may include, but are not limited to: computing resources, storage resources, network resources, and the like. The computing resources may include various devices, such as a CPU (central Processing Unit), a GPU (Graphics Processing Unit), an NPU (neutral-network Processing Unit), a TPU (tensor Processing Unit), and the like; the storage resources can comprise a memory, a hard disk and the like; the network resources may be in the form of bandwidth or periodically supplemented traffic. In general, a task requires a plurality of resources for its execution, and the present exemplary embodiment may allocate one or more resources for it. The application scenario of the method of the embodiment may be as follows: in a platform or a system, reasonable resources are automatically allocated for the application added into the queue, so that each queue can efficiently and accurately run tasks.

The exemplary embodiment is further described with reference to fig. 1, and as shown in fig. 1, the resource allocation method may include the following steps S110 to S140:

step S110, determining initial resource quotas of the plurality of task queues, so that each task queue runs the task by using the corresponding initial resource quotas.

The initial resource quota refers to the size of the resource originally allocated to the task queue, and may be set by human self-definition, or may be set by default by the system for each task queue by an initial value, and the size of the specific initial resource quota is not specifically limited by the present disclosure. In this exemplary embodiment, a centralized scheduling manner may be adopted, and one processor is used as a special scheduler, and all acquired tasks may reach the central scheduler first, and then run the tasks according to a certain initial resource quota.

In an exemplary embodiment, the determining the initial resource quotas of the plurality of task queues may include:

and determining the initial resource quota of each task queue according to the importance level of each task queue.

In practical application, the importance degree of each task is different, if important and unimportant tasks are placed into a task queue, and then the same resource is allocated to each task queue to run the tasks, not only is more time spent, but also the important tasks are difficult to run quickly and accurately due to the lack of pertinence in task running. Therefore, in the present exemplary embodiment, first, the plurality of task queues may be divided into a plurality of types, such as an important task queue, a more important task queue, a common task queue, and the like, and then different initial resource quotas are allocated to the queues according to the types of the queues, for example, according to the importance degree (i.e., importance level) of each task queue, it is determined to allocate 50% of the initial resource quotas to the important queue, 30% of the initial resource quotas to the more important queue, 20% of the initial resource quotas to the common queue, and the like.

Step S120, when a first preset condition is satisfied, obtaining resource usage status data of at least one task queue.

The resource usage status data of the task queue refers to a condition that can reflect the current resource usage of the task queue, such as a memory usage rate, a Central Processing Unit (CPU) occupancy rate, an Input Output (IO) usage rate, a task number, a total waiting time, a total response time, a current occupancy quota, and the like. In the present exemplary embodiment, the task runtime may have multiple states, i.e., the resource usage state data of the task queue may have multiple dimensions. Therefore, for convenience of subsequent calculation and processing, a vector of the resource usage state data may be established for the task queues, for example, a corresponding vector is generated by using a word embedding method for at least one task queue. Each dimension in the vector may represent the current resource usage of the task queue, such as memory usage, CPU occupancy, IO usage, number of tasks, total latency, total response time, and current occupancy quota.

In practical application, in order to be better applied to various scenes and flexibly adjust and allocate resources of various task queues under different conditions, the exemplary embodiment may further set a mechanism of a first preset condition, and when the first preset condition is met, the method triggers obtaining of resource usage state data of the task queues to perform subsequent steps of resource allocation on the various task queues. Specifically, in an exemplary embodiment, the first preset condition may include any one or more of the following:

(1) reaching a first predetermined cycle time;

(2) the resource utilization rate of any task queue exceeds a first preset threshold;

(3) and adding a new task in any task queue.

When the one or more first preset conditions are met, the resource use state data can be acquired so as to carry out the subsequent steps. Specifically, considering that after the tasks in the task queues run for a period of time, the task running conditions of the task queues may be different, a first preset condition (1) may be set, so that the resource quotas of the task queues may be periodically adjusted, for example, if the first preset period of time is set to 5 seconds, the resource usage state data of the task queues may be periodically obtained every 5 seconds; in the task running process, if the resource utilization rate in a certain task queue exceeds a certain degree, the running condition of the task queue is also influenced, so that a first preset threshold value can be set according to the resource utilization rate, and when the resource utilization rate of the certain task queue exceeds the first preset threshold value, the resources of each task queue can be distributed again, namely the first preset condition (2); the first preset condition (3) is to consider that, if a new task is added, in order to perform better balance processing on resources among the task queues, when the new task is added, the task queues can reasonably and efficiently run the task.

Step S130, processing the resource using state data by using the latest reinforcement learning model to obtain resource allocation action data, and adjusting resource quotas of the plurality of task queues by using the resource allocation action data.

The reinforcement learning model is a branch of machine learning, and emphasizes how to act based on the current environment to achieve the maximum expected effect, namely how to gradually form an expectation to the stimulus under the reward or punishment stimulus of the environment, so as to generate habitual behaviors capable of maximizing benefits. Reinforcement learning is different from supervised and unsupervised learning, and the interaction of reinforcement learning is mainly based on the interaction between the environment and the intelligent agent, and the environment does not need to provide labels or data for the intelligent agent, but gives certain stimulation to the intelligent agent to change the behavior of the intelligent agent.

There are 4 elements in the reinforcement learning model: (1) state, which may be represented by State or S; (2) the behavior can be represented by an Action or A, and the behavior can bring benefits to the intelligent agent and enable the intelligent agent to come to a new state; (3) an instant Reward, which may be denoted by Reward or R, describes feedback from the environment after an action is performed, either positive or negative; (4) policies, which may be expressed in Policy or π, describe the agent's mapping of states to behaviors. The training process of the reinforcement learning model is a process in which an agent interacts with an environment in real time and affects the environment through actions, and the purpose is to find an optimal strategy so that the agent obtains as many rewards from the environment as possible.

Fig. 2 shows a schematic diagram of interaction between an agent and an environment in reinforcement learning, wherein the agent 210 is an individual performing a decision, for example, a scheduler for resource allocation or other terminal equipment, and the environment 220 represents a scenario in which the individual performing the decision is located, for example, a scenario in which a scheduler for resource allocation allocates resources for a plurality of queues. The training process of the reinforcement learning model may be that, at each time step, the agent 210 determines a resource allocation Action (Action) of a next step according to Observation (observer) in a current resource usage state, the agent 210 continuously interacts in the environment 220, and based on feedback (Reward) from the environment 220 to the agent 210, the behavior policy of the agent is updated until the agent 210 obtains an optimal feedback from the environment 220.

The latest reinforcement learning model refers to a model that has been updated last time, that is, a model that has been trained last time, and may be a model that has been updated last time a resource allocation operation has been performed, or may be a model that has been updated some time ago. In this exemplary embodiment, the resource allocation action data is data that can reflect the above resource allocation action, and is substantially data of how to allocate resources to each task queue, or action data of how many resources are specifically allocated to a task queue, for example, the task queue A, B, C, where the initial resource quotas are 50%, 30%, and 20%, respectively, and the resource allocation action data may be 40%, 35%, and 25%, and then the resource quotas of each task queue may be adjusted according to the resource allocation action data; or may be action data that adds, subtracts, or does not change resources in the task queue.

In an exemplary embodiment, the resource allocation action data may include:

increasing a preset resource quota for one task queue and reducing the preset resource quota for the other task queue; or

And keeping the current resource quota of each task queue.

For example, two queues are arbitrarily selected from A, B, C, and a fixed quota is increased or decreased for one queue, and the corresponding other queue is decreased or increased by the same quota, where the fixed quota is the preset resource quota in this exemplary embodiment. For example: two queues A and B are selected, a quota of 2% is added to the queue A or the quota of 2% is reduced, and the corresponding queue B loses the quota of 2% or obtains the quota of 2% on the previous basis. The resource allocation action may also choose not to change the current resource quota. Based on this, the resource allocation action data may contain 7 actions: a increases B decreases, a decreases B increases, a increases C decreases, a decreases C increases, B increases C decreases, B decreases C increases, and B does not change.

Step S140, when a second preset condition is satisfied, acquiring task running state data of at least one task queue.

The task running state data may refer to the time for running the task in the task queue, whether the task is completed in advance, whether the queue contains the task that fails to run, and the like. In the present exemplary embodiment, for the updating of the reinforcement learning model and the resource allocation to the task queue, two parallel processes may be performed, that is, the updating of the reinforcement learning model is not affected while the queue resource allocation is performed, and each time the queue resource allocation is performed, the resource usage state data may be processed by using the reinforcement learning model that is updated most recently, so as to obtain the resource allocation action data. The second preset condition is a condition for judging whether to trigger the model updating.

Specifically, in an exemplary embodiment, the second preset condition includes any one or more of the following:

(1) reaching a second predetermined cycle time;

(2) adjusting the resource quota m times, wherein m is a first preset time;

(3) and the resource utilization rate of any task queue exceeds a second preset threshold in the continuous n times of resource quota adjustment, wherein n is a second preset time.

When the one or more second preset conditions are met, task running state data can be acquired so as to perform the subsequent step of updating the model. In the exemplary embodiment, a second preset condition may be set from three aspects, one of which, in terms of time, the reinforcement learning model is periodically updated, and when the second preset period time is reached, task running state data of at least one task queue is triggered to be acquired; secondly, considering the number of times of adjusting the resource quota, namely when the number of times of adjusting the resource quota exceeds a certain number, acquiring task running state data of at least one task queue by triggering, for example, dynamically performing resource allocation of the task queue, and after 500 times, updating the reinforcement learning model by using data generated in the 500 times of resource allocation process as a sample; considering from the aspect of resource utilization rate, when the resource utilization rate of one task queue exceeds a second preset threshold in the adjustment process of the resource quota for n times, it can be considered that the currently used reinforcement learning model is probably not very suitable in the allocation of the current resource quota, and task running state data of at least one task queue can be triggered and obtained to update the model.

And S150, determining an incentive value according to the task running state, and updating the reinforcement learning model according to the incentive value.

The basic idea of reinforcement learning is to update behavior strategies by continuously interacting with the environment, by receiving feedback information, and thus determine the best feedback of the environment. After all states and behaviors are exhausted in the environment, it can find the best behavior in each state, i.e. find the best decision. In the exemplary embodiment, namely, the reward value is determined according to the acquired task running state, and the reinforcement learning model is updated according to the reward value, for example, in the training process of the reinforcement learning model, when a statistical window is reached, if a task in a task queue is successfully completed and the running time is shorter than the expected time, a higher reward value is given, and if the task fails, a lower reward value is given. In the present exemplary embodiment, when updating the reinforcement learning model, the reinforcement learning model may be updated by the reward value, the state matrix, and the history allocation operation data.

Specifically, in an exemplary embodiment, the updating the reinforcement learning model by the reward value in step S150 may include:

and updating the value function of the reinforcement learning model by adopting a Bellman equation based on the reward value, the current resource use state data, the resource use state data when the resource quota is adjusted last time and the resource allocation action data when the resource quota is adjusted last time.

In general, the cost function may reflect the value between the current state and the behavior. The present exemplary embodiment can update the merit function of the reinforcement learning model by the bellman equation. The cost function Q (s, a) can be obtained by the following formula:

Q(s，a)＝r+γmax_a’Q(s’，a’)

the cost function Q (s, a) may represent a cost function value of current resource usage status data that may be iteratively combined with a cost function value of next resource usage status data weighted by a current reward value r. Wherein s represents the current resource usage status data in the task queue, a represents the resource allocation action data of the current task queue, r represents the current reward value, s ' represents the new resource usage status data that the resource usage status data s reaches after the resource allocation action data a, and a ' represents the action of maximizing the cost function under the resource usage status data s '.

In an exemplary embodiment, as shown in fig. 3, the obtaining resource usage status data of at least one task queue in step S120 may include the following steps:

step S310, acquiring resource use state data of each task queue;

further, step S130 may include:

step S320, taking the resource using state data of each task queue as a row, converting the resource using state data of a plurality of task queues into a resource using state matrix;

and step S330, inputting the resource use state matrix into the latest reinforcement learning model, outputting corresponding resource allocation action data, and adjusting the resource quotas of the plurality of task queues by adopting the resource allocation action data.

In order to reasonably adjust the resources of all the task queues and increase the balance of the overall task operation, in the exemplary embodiment, the resource usage state data of each task queue may be acquired. A, B, C, for example, the memory usage rate, CPU occupancy, IO usage rate, task number, total waiting time, total response time, and current occupancy quota of the three task queues may be obtained A, B, C, respectively, and the resource usage state data of each task queue is taken as a row to establish a corresponding vector, and then a resource usage state matrix is generated according to the vector of the resource usage state data of each task queue, and the matrix is used as an input of the reinforcement learning model, and resource allocation action data corresponding to the resource usage state matrix is output. Each row vector in the resource use state matrix can reflect the resource use state of the task queue corresponding to the row. In this exemplary embodiment, the number of the column vectors of the resource usage state matrix, that is, the dimension of the vector corresponding to each task queue, may be adjusted by user according to specific situations.

Fig. 4 schematically illustrates A, B, C a process diagram of resource allocation of three task queues, where three task queues, namely a task queue a 410, a task queue B411, and a task queue C412, have initial resource quotas of 60%, 20%, and 20%, respectively, and t1 and t2 represent tasks that need to be run in each task queue, and in this exemplary embodiment, t1 and t2 are merely exemplary and are not limited to two tasks. Firstly, determining vectors V1, V2 and V3, x1, x2 and x3 corresponding to resource usage state data of the task queue A, B, C respectively, where the vectors V1, V2 and V3, x1, x2 and x3 respectively represent data under multiple dimensions, such as memory usage rate, CPU occupancy rate, IO usage rate, task number, total waiting time, total response time, current occupied quota, and the like, it should be noted that x1, x2 and x3 are only schematic examples and are not limited to a vector dimension of 3; based on the data, taking V1, V2 and V3 as row vectors of the matrix, and determining a resource use state matrix of the task queue A, B, C; after the resource usage state matrix is input into the reinforcement learning model 420, resource allocation action data can be obtained, and finally, resource quotas are updated for each task queue according to the resource allocation action data. It can be seen that the resource quota increased by 2% in task queue a 430 becomes 62%, the resource quota maintained by task queue B431 is 20%, and the resource quota decreased by 2% in task queue C432 becomes 18%.

Fig. 5 schematically shows a flowchart of another resource allocation method in this exemplary embodiment, which may specifically include the following steps:

step S510, determining initial resource quotas of a plurality of task queues, and enabling each task queue to adopt the corresponding initial resource quotas to run tasks;

step S520, judging whether a first preset condition is met;

if the first preset condition is met, executing step S530 to obtain resource use state data of at least one task queue;

step S540 is executed, the latest reinforcement learning model is utilized to process the resource use state data, and resource allocation action data is obtained;

step S550, adjusting resource quotas of the plurality of task queues by adopting the resource allocation action data;

step S560, judging whether a second preset condition is met;

if the second preset condition is met, executing step S570 to obtain task running state data of at least one task queue;

and step S580 is executed to determine an incentive value according to the mission operation state data, and update the reinforcement learning model according to the incentive value.

After step S520, if the first preset condition is not satisfied, step S530 and the subsequent adjustment step of the resource quota may not be executed, and step S520 is continuously executed to determine whether the current state satisfies the first preset condition. After step S550, the process may return to step S520, and continue the determining step of the first preset condition to perform the resource adjusting process. After step S560, if the second preset condition is not satisfied, the step of updating the reinforcement learning model may not be performed, and step S560 may be continuously performed to determine whether the current state satisfies the second preset condition to determine whether to update the reinforcement learning model. After step S580, the updated reinforcement learning model may be applied as the latest model in step S540.

Fig. 6 schematically illustrates a flowchart of updating the reinforcement learning model in the resource allocation method in the present exemplary embodiment, which may specifically include the following steps:

step S610, acquiring resource use state data of each task queue;

step S620, taking the resource using state data of each task queue as a row, converting the resource using state data of a plurality of task queues into a resource using state matrix;

step S630, inputting the resource usage state matrix into the latest reinforcement learning model, and outputting corresponding resource allocation action data;

step S640, when a second preset condition is met, task running state data of at least one task queue is obtained, and a reward value is determined according to the task running state data;

step S650, updating the value function of the reinforcement learning model by adopting a Bellman equation based on the reward value, the current resource usage state data, the resource usage state data when the resource quota is adjusted last time and the resource allocation action data when the resource quota is adjusted last time.

In the present exemplary embodiment, data related to the historical process of performing resource adjustment, such as the resource usage state matrix, the resource allocation action data, and the reward value, may be combined and added to the memory, and when a second preset condition is satisfied, such as when 500 times of resource adjustment are performed, the historical data may be extracted from the memory as sample data to perform training of the reinforcement learning model.

Based on the above description, in the present exemplary embodiment, initial resource quotas of multiple task queues are determined, each task queue runs a task by using a corresponding initial resource quota, when a first preset condition is met, resource usage state data of at least one task queue is obtained, the resource usage state data is processed by using a latest reinforcement learning model, resource allocation action data is obtained, the resource quotas of the multiple task queues are adjusted by using the resource allocation action data, when a second preset condition is met, task operation state data of at least one task queue is obtained, a reward value is determined according to the task operation state data, and the reinforcement learning model is updated by the reward value. On one hand, the resource allocation is carried out on the task queue through the reinforcement learning model, compared with the mode of manually allocating queue resources in the prior art or the related technology, the allocation process is simple and efficient, and the allocation result is more accurate; on the other hand, in the exemplary embodiment, by setting the first preset condition, the resource usage state data of the task queue is obtained, the resource allocation action data is determined, the resource allocation is performed, and the second preset condition is set to update the model in time, so that the application of the model in the queue resource allocation is optimized.

In an exemplary embodiment, the determining the bonus value according to the task running status in step S130 may include:

and when determining that a task completed in advance or a task failed to run exists according to the task running state data, calculating the reward value based on the importance level of the task queue to which the task belongs.

During the training process of the reinforcement learning model, a cost function can be established, and the cost function can comprise resource usage state data of the queue task and resource allocation action data. And all possible resource allocation action data is scored by a cost function to determine which resource allocation action data may yield better results. With the increase of the number of iterations, the scores of the resource allocation action data in different states tend to be stable, and at this time, it can be determined which resource allocation action data is better. After the resource allocation action data is determined, when it is determined that a task completed in advance or a task failed to run exists according to the task running state, calculating an incentive value based on the importance level of a task queue to which the task belongs, for example, if the task is successfully completed in the task queue and the running time is shorter than the expected time, setting a latest cost function to obtain 100 times of importance level coefficients of the task queue (setting A, B, C importance coefficients of the task queue as 1 for the A queue task, 0.6 for the B queue task and 0.3 for the C queue task); and when the task queue fails, setting the latest cost function to obtain-100 multiplied by the important grade coefficient of the task. The importance level coefficient of the task queue can be set by self-definition according to needs, which is not specifically limited in the present disclosure.

In an exemplary embodiment, the cost function includes a neural network, and the parameters of the neural network are updated when the cost function of the reinforcement learning model is updated.

In the present exemplary embodiment, the cost function established according to the enhanced Learning model DQN (Deep Q-Learning) may include a neural network, and the cost function may be represented as Q (s, a, θ), where s represents resource usage state data in a task queue, a represents resource allocation action data of a current task queue, and θ represents a parameter in the neural network model, and θ is fixed and invariant during each round of training iteration of the enhanced Learning model, when s is determined, the current resource allocation action data may be scored according to the cost function, and it is determined which resource allocation action is a better result according to the score, for example, the resource allocation action data includes: and in 7 cases, respectively scoring the 7 actions to determine better resource allocation action data. In the present exemplary embodiment, in the training process of the reinforcement model, when the cost function of the reinforcement learning model is updated, the parameters of the neural network may be updated.

In addition, the reinforcement learning model of the present exemplary embodiment may also use a policy epsilon-Greedy algorithm, and determine whether to select the resource allocation operation data with the highest score or to randomly select one of the plurality of resource allocation operation data according to the number generated by the algorithm. Among other things, the epsilon-Greedy algorithm is one that allows a strategy to be performed in both a known good direction and an unknown direction, where attempts to find better are expected. The exemplary embodiment can strengthen the iterative initial period of model training to make the strategy as explorable as possible, i.e., try each resource allocation action data as possible.

In the exemplary embodiment, the resource usage state matrix of the task queue observed after the resource allocation action data is completed each time is executed is represented as s, the resource allocation action data used is represented as a, the resource usage state matrix observed when the resource allocation is performed next time is represented as s ', the obtained reward score is represented as r, based on this, a quadruple (s, s', a, r) is generated and stored in the memory, the resource allocation is dynamically performed 500 times, 500 samples are randomly selected from the memory as samples, when the neural network parameter is updated, the random sampling is performed by using an experience replay algorithm, and θ in Q (s, a, θ) is updated by using a random batch gradient descent method.

An exemplary embodiment of the present disclosure also provides a resource allocation apparatus. Referring to fig. 7, the apparatus 700 may include: a resource determining module 710, configured to determine initial resource quotas of multiple task queues, so that each task queue runs a task using a corresponding initial resource quota; a first data obtaining module 720, configured to obtain resource usage status data of at least one task queue when a first preset condition is met; the resource adjusting module 730 is configured to process the resource usage state data by using the latest reinforcement learning model to obtain resource allocation action data, and adjust resource quotas of the plurality of task queues by using the resource allocation action data; the second data obtaining module 740 is configured to obtain task running state data of at least one task queue when a second preset condition is met; and the model updating module 750 is used for determining the reward value according to the task running state data and updating the reinforcement learning model according to the reward value.

In an exemplary embodiment, the resource determining module includes a resource quota determining unit, configured to determine an initial resource quota for each task queue according to the importance level of each task queue.

In an exemplary embodiment, the model update module includes: and the reward value calculation unit is used for calculating a reward value based on the importance level of the task queue to which the task belongs when determining that a task completed in advance or a task failed to run exists according to the task running state data.

In an exemplary embodiment, the first preset condition includes any one or more of: reaching a first predetermined cycle time; the resource utilization rate of any task queue exceeds a first preset threshold; and adding a new task in any task queue.

In an exemplary embodiment, the first data acquisition module includes: the state data acquisition unit is used for acquiring the resource use state data of each task queue; the resource adjusting module comprises: the matrix conversion unit is used for converting the resource use state data of the plurality of task queues into a resource use state matrix by taking the resource use state data of each task queue as a row; and the action data output unit is used for inputting the resource use state matrix into the latest reinforcement learning model, outputting corresponding resource allocation action data, and adjusting the resource quotas of the plurality of task queues by adopting the resource allocation action data.

In an exemplary embodiment, the second preset condition includes any one or more of: reaching a second predetermined cycle time; adjusting the resource quota m times, wherein m is a first preset time; and the resource utilization rate of any task queue exceeds a second preset threshold in the continuous n times of resource quota adjustment, wherein n is a second preset time.

In an exemplary embodiment, the model update module includes: and the value function updating unit is used for determining a reward value according to the task running state data, and updating the value function of the reinforcement learning model by adopting a Bellman equation based on the reward value, the current resource use state data, the resource use state data when the resource quota is adjusted last time and the resource allocation action data when the resource quota is adjusted last time.

In an exemplary embodiment, the resource allocation action data includes: increasing a preset resource quota for one task queue and reducing the preset resource quota for the other task queue; or maintain the current resource quotas of each task queue.

The specific details of each module/unit in the above-mentioned apparatus have been described in detail in the embodiment of the method section, and the details that are not disclosed may refer to the contents of the embodiment of the method section, and therefore are not described herein again.

Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 800 according to such an exemplary embodiment of the present disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 8, electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, a bus 830 connecting different system components (including the memory unit 820 and the processing unit 810), and a display unit 840.

Where the memory unit stores program code, the program code may be executed by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, the processing unit 810 may execute steps S110 to S150 shown in fig. 1, or may execute steps S310 to S330 shown in fig. 3, or the like.

The storage unit 820 may include readable media in the form of volatile storage units, such as a random access storage unit (RAM)821 and/or a cache storage unit 822, and may further include a read only storage unit (ROM) 823.

Storage unit 820 may also include a program/utility 824 having a set (at least one) of program modules 825, such program modules 825 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the exemplary embodiments of the present disclosure.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device.

Referring to fig. 9, a program product 900 for implementing the above method according to an exemplary embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit according to an exemplary embodiment of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A method for resource allocation, comprising:

determining initial resource quotas of a plurality of task queues, and enabling each task queue to adopt a corresponding initial resource quota to run a task;

when a first preset condition is met, acquiring resource use state data of at least one task queue;

processing the resource use state data by using the latest reinforcement learning model to obtain resource allocation action data, and adjusting the resource quotas of the plurality of task queues by using the resource allocation action data;

when a second preset condition is met, acquiring task running state data of at least one task queue;

and determining an incentive value according to the task running state data, and updating the reinforcement learning model according to the incentive value.

2. The method of claim 1, wherein determining an initial resource quota for a plurality of task queues comprises:

3. The method of claim 1, wherein determining a reward value based on the task performance state data comprises:

4. The method of claim 1, wherein the first preset condition comprises any one or more of:

reaching a first predetermined cycle time;

the resource utilization rate of any task queue exceeds a first preset threshold value;

and adding a new task in any task queue.

5. The method of claim 1, wherein obtaining resource usage status data for at least one of the task queues comprises:

acquiring resource use state data of each task queue;

the processing the resource usage state data by using the latest reinforcement learning model to obtain resource allocation action data comprises the following steps:

taking the resource use state data of each task queue as a row, and converting the resource use state data of the plurality of task queues into a resource use state matrix;

inputting the resource usage state matrix into the latest reinforcement learning model, and outputting corresponding resource allocation action data.

6. The method according to claim 1, wherein the second preset condition comprises any one or more of:

reaching a second predetermined cycle time;

adjusting the resource quota m times, wherein m is a first preset time;

and the resource utilization rate of any task queue exceeds a second preset threshold in the continuous n times of resource quota adjustment, wherein n is a second preset number.

7. The method of claim 1, wherein updating a reinforcement learning model with the reward value comprises:

8. The method of claim 7, wherein the cost function comprises a neural network, and wherein parameters of the neural network are updated when the cost function of the reinforcement learning model is updated.

9. The method according to any of claims 1 to 8, wherein the resource allocation action data comprises:

increasing a preset resource quota for one task queue and reducing a preset resource quota for the other task queue; or

And keeping the current resource quota of each task queue.

10. A resource allocation apparatus, comprising:

the resource determining module is used for determining initial resource quotas of a plurality of task queues and enabling each task queue to adopt the corresponding initial resource quotas to run tasks;

the first data acquisition module is used for acquiring resource use state data of at least one task queue when a first preset condition is met;

the resource adjusting module is used for processing the resource using state data by utilizing the latest reinforcement learning model to obtain resource allocation action data and adjusting the resource quotas of the plurality of task queues by adopting the resource allocation action data;

the second data acquisition module is used for acquiring task running state data of at least one task queue when a second preset condition is met;

and the model updating module is used for determining an incentive value according to the task running state data and updating the reinforcement learning model according to the incentive value.

11. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-9 via execution of the executable instructions.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-9.