CN112416578B

CN112416578B - Container cloud cluster resource utilization optimization method based on deep reinforcement learning

Info

Publication number: CN112416578B
Application number: CN202011225270.4A
Authority: CN
Inventors: 吴迪; 吴灿豪; 胡淼
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2023-08-15
Anticipated expiration: 2040-11-05
Also published as: CN112416578A

Abstract

The invention provides a container cloud cluster resource utilization optimization method based on deep reinforcement learning, which comprises the following steps: preprocessing the original load data and assembling the original load data into an input state s; constructing a depth Q network model, inputting an input state s into the depth Q network model, randomly selecting an action a by the depth Q network model with a certain probability, or selecting the action a which enables the depth Q network model to be optimal, and executing one-time overstock ratio prediction; evaluating the selected action a through a reward function to obtain a reward r and entering a next state s'; the input state s, the action a, the reward r and the next state s' are formed into a quadruple and are taken as training samples to be put into a cache; when a preset training interval is reached, e training samples are sampled from the cache and input into the depth Q network model for training, and parameters of the depth Q network model are updated; and after the deep Q network model is trained by the E round, applying the deep Q network model with the updated parameters to determine the overstock strategy.

Description

Container cloud cluster resource utilization optimization method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of cloud computing resource management, in particular to a container cloud cluster resource utilization optimization method based on deep reinforcement learning.

Background

The Docker is an open-source application container engine, and Google first puts forward kubernetes (K8 s) in 2015 along with the wide application of the Docker container technology in application development, testing and release, and is a distributed architecture scheme based on the Docker container technology. K8s provides complete cluster management capabilities such as multi-level security and admission control, transparent service registration and service discovery mechanisms, and multi-granularity resource management capabilities. In addition, k8s provides both a built-in load balancer and extensible automatic resource scheduling capability, which is provided by a built-in scheduler and requires the following steps: (1) node preselection: excluding nodes which do not meet the conditions at all, for example, the conditions such as memory size, ports and the like are not met; (2) node prioritization: selecting an optimal node according to the priority; (3) binding: binding Pod to the optimal node screened in the previous step. Wherein, one of the screening conditions of (1) is the remaining resources of the node, and when the remaining resources of the node are smaller than the resource application amount of Pod, the node is directly excluded by the k8s scheduler, that is, the node cannot continue to schedule Pod.

It can be seen here that the scheduling manner used by k8s is static scheduling, i.e. k8s is binned according to the resources requested by the container, rather than according to the actual load of the node. However, the static scheduling method is simple and effective, but has a problem of easily causing low resource utilization of the cluster, which is also a common problem in the industry.

Disclosure of Invention

The invention provides a container cloud cluster resource utilization optimization method based on deep reinforcement learning for overcoming the defect of low resource utilization rate of clusters in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a container cloud cluster resource utilization optimization method based on deep reinforcement learning comprises the following steps:

s1: preprocessing original load data, and assembling the preprocessed original load data into an input state s;

s2: constructing a depth Q network model for determining a super-selling strategy, inputting an input state s into the depth Q network model, randomly selecting an action a by the depth Q network model with a certain probability, or selecting the action a which enables the depth Q network model to be optimal, and executing one-time super-selling ratio prediction;

s3: evaluating the selected action a through a reward function to obtain a reward r and entering a next state s';

s4: the input state s, the action a, the reward r and the next state s' are formed into a quadruple and are taken as training samples to be put into a cache;

s5: when a preset training interval is reached, e training samples are sampled from the cache and input into the depth Q network model for training, and parameters of the depth Q network model are updated; wherein e is a positive integer;

s6: and after the deep Q network model is trained by the E round, applying the deep Q network model with the updated parameters to determine the overstock strategy.

Preferably, in the step S1, preprocessing the raw load data includes a binning operation; the method comprises the following specific steps:

assuming that the load update period is T, in the kth time period, the original load data in the kth period of the node n isAnd original load data +.>The dimension of the sampling point is the same as the number of sampling points in the updating period T; setting the number of the sub-boxes as B, and corresponding boundary value B _i The value range of (b) is { b } _i I 0 is less than or equal to i is less than or equal to B, i is E N; assuming that M clusters are arranged, the number of nodes of the mth cluster is N _m ；

Then to the original load dataThe expression formula for performing the binning operation is as follows:

wherein C is _m,n Representing the actual resource capacity of the nth node in the cluster m; the I function is an indication function;representing node load when load data +>When the value of (2) falls within the ith binning interval, node load +.>The value of (2) is correspondingly increased by 1.

Preferably, the input state s includes node information C _n Node container informationNode history load->Wherein the node information C _n The method comprises the steps of cluster ID where the nth node is located, city ID where the nth node is located, actual resource capacity and current super sales ratio; node container information->Representing the number of online service Pod, the number of offline service Pod and the total resource application amount of Pod on the nth node in the kth load updating period in the last 7 days; node history load->Representing the historical load of the last 7 days of the nodes after binning.

Preferably, in step S2, after the input state S is input into the depth Q network model, the depth Q network model calculates a Q value corresponding to each action a according to the input state S, and determines whether the current action a optimizes the depth Q network model according to the Q value corresponding to the action a; the calculation formula of the Q value is as follows:

Q(s,a,θ)＝r′ _current +γr′ _future

wherein the Q value represents the instantaneous prize r 'available to the deep Q network model in the case of state s and execution of action a' _current And a predictive value r 'for future rewards' _future The method comprises the steps of carrying out a first treatment on the surface of the Gamma represents the discount factors for the instant and future rewards; θ is a parameter of the deep Q network model.

Preferably, in step S3, the selected action a is evaluated by a reward function, wherein the calculation formula of the reward r is as follows:

wherein w is _o And w _u Trade-off factors representing excess loss and loss of shortage, respectivelyAnd w is _o +w _u ＝1；o _k Indicating excessive loss, i.e. when the load of the node is higher than L _target The high load alarm risk existing in the time node; the expression formula is as follows:

u _k indicating loss of demand, i.e. when the load of the node is below L _target The resource waste exists in the time node; the expression formula is as follows:

wherein L is _target A target load level for a preset node; h is a _o And h _u Half-lives representing excessive loss and loss of ullage, respectively;representing an estimate of the superpose node load.

Preferably, the estimate of the super-post-sale node loadFor compressible resources, the average resource utilization of cluster m is used +.>Estimating the load state of the super-sold node; the calculation formula is as follows:

for incompressible resources, when a node cannot provide the memory required by a process, the process is directly terminated by an operating system, and the maximum utilization rate of a cluster m is required to be adoptedThe load status of the super-sold node is estimated.

Preferably, the deep Q network model includes a target Q network and an online Q network, the parameters of which are denoted as θ and θ', respectively;

in the step S2, an online Q network is adopted to select an action a;

in the step S5, when a preset training interval is reached, e training samples are sampled from the cache and input into the online Q network for training, and the parameter theta' is updated;

in the step S6, after the online Q network is trained by the E-turn, the parameter θ' after completing the parameter update is used to update the parameter θ of the target Q network, and the target Q network is applied to determine the overstock policy.

Preferably, in step S5, a gradient descent algorithm pair (y-Q (S, a, θ') ² Updating the parameter theta'; wherein, the expression formula of y is as follows:

y＝r+γmax _a′ Q(s′,a′,θ)

where y represents the current target Q value and a 'represents the action performed in the next state s'.

Preferably, the container cloud cluster resource utilization optimization method further comprises the following steps:

s7: deploying the depth Q network model with the updated parameters outside the cluster as a super-selling strategy interface to provide service, and deploying an event interception module in the cluster; and the event interception module intercepts Pod creation, deletion events and node heartbeat events by adopting an admission controller provided by k8 s.

Preferably, the event interception module performs the steps of:

1) The k8s server hopes to store the real resource state of the node into a database through a heartbeat event;

2) Intercepting a node heartbeat event, and replacing the real resource state of the node with a super-selling calculation result;

3) Intercepting the creation and deletion events of Pod in the cluster, wherein the corresponding node is marked as a dirty node;

4) Adding the dirty node into a node pool;

5) The independent thread monitors the node state in the local cache;

6) When no Pod creation or deletion event occurs on the node, marking the node as an 'old node' and adding the 'old node' into a node pool;

7) The nodes in the node pool call the super-selling strategy interface concurrently;

8) And calling a super-selling value strategy interface for the node with high load alarm.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the invention adopts a deep Q network model as a super-selling strategy model, which is used for determining a super-selling strategy; preprocessing data of an input depth Q network model, reducing the storage space of original load data and reducing the input dimension of a subsequent reinforcement learning model; and the action output by the deep Q network model is evaluated by combining the reward function, and the deep Q network model is trained with the aim of reducing the risk of overhigh load caused by overstock and reducing resource waste, so that the resource utilization rate of the cluster is effectively improved.

Drawings

FIG. 1 is a flow chart of a method for optimizing the resource utilization of a container cloud cluster based on deep reinforcement learning.

Fig. 2 is a flow chart of the design of the event interception module according to the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

The embodiment provides a container cloud cluster resource utilization optimization method based on deep reinforcement learning, as shown in fig. 1, which is a flowchart of the container cloud cluster resource utilization optimization method based on deep reinforcement learning in the embodiment.

In the container cloud cluster resource utilization optimization method based on deep reinforcement learning provided by the embodiment, the method specifically comprises the following steps:

s1: preprocessing the original load data, and assembling the preprocessed original load data into an input state s.

In this embodiment, preprocessing the original load data according to the load update period includes a binning operation, which can reduce the space required for storing the original load data and reduce the input dimension of the subsequent reinforcement learning model. The method comprises the following specific steps:

assuming that the load update period is T is one cycle, in the kth time period, the original load data in the kth period of the node n isAnd original load data +.>The dimension of the sampling point is the same as the number of sampling points in the updating period T; setting the number of the sub-boxes as B, and corresponding boundary value B _i The value range of (b) is { b } _i I 0 is less than or equal to i is less than or equal to B, i is E N; assuming that M clusters are arranged, the number of nodes of the mth cluster is N _m ；

The input state s in the present embodiment includes node information C _n Node container informationNode history loadWherein the node information C _n The method comprises the steps of cluster ID where the nth node is located, city ID where the nth node is located, actual resource capacity and current super sales ratio; node container information->Representing the number of online service Pod, the number of offline service Pod and the total resource application amount of Pod on the nth node in the kth load updating period in the last 7 days; node history load->Representing the historical load of the last 7 days of the nodes after binning.

S2: and constructing a deep Q network model for determining the overstock strategy, inputting an input state s into the deep Q network model, randomly selecting an action a by the deep Q network model with a certain probability, or selecting the action a which enables the deep Q network model to be optimal, and executing one-time overstock ratio prediction.

In this embodiment, a deep Q network model proposed by Google deep team is used, and a neural network is used instead of the value function in reinforcement learning. In the reinforcement learning model, the deep Q network model in this embodiment is used as an agent to interact with the environment, and according to the observation of the environment, the agent takes a corresponding decision and performs a corresponding action. The deep Q network model comprises a target Q network and an online Q network, and parameters of the deep Q network model are respectively represented as theta and theta'.

In the step, the input state s is input into an online Q network, the online Q network randomly selects an action a with a certain probability, or selects the action a which enables the depth Q network model to be optimal, and one time of super sales ratio prediction is executed. Wherein each action a represents all possible super-selling ratios, which are themselves a continuous value, which is discretized into dimension B, which is the bin count, for simplicity; here, action a selected by the online Q network represents the predicted overstock ratio of the agent.

In addition, in this embodiment, after the input state s is input into the deep Q network model, the deep Q network model calculates a Q value corresponding to each action a according to the input state s, and determines whether the current action a optimizes the deep Q network model according to the Q value corresponding to the action a; the calculation formula of the Q value is as follows:

Q(s,a,θ)＝r′ _current +γr′ _future

S3: the selected action a is evaluated by the reward function, yielding the reward r and entering the next state s'.

When the agent selects an action a, that is, executes a super-sell ratio prediction, the action needs to be evaluated by using a reward function, so that the state of the super-sell node is estimated by using the relevant statistical characteristics of the cluster. The present embodiment includes the average resource utilization of cluster m by collecting important statistics (portraits) for each clusterResource utilization maximum->Resource utilizationMinimum->Etc.

In this step, the selected action a is evaluated by a reward function, wherein the calculation formula of the reward r is as follows:

the current decision cost of the intelligent agent is a weighted sum of excessive loss and loss of shortage, and the reciprocal of the decision cost is used as a reward function in the embodiment; w (w) _o And w _u Weight factors representing excessive loss and loss of shortage, respectively, and w _o +w _u =1. The two trade-off factors are also different for different types of resources, e.g. the risk of excessive amounts of memory resources is greater than for CPU and memory, thus corresponding excessive losses trade-off factor w _o Will be set larger.

o _k Indicating excessive loss, i.e. when the load of the node is higher than L _target The high load alarm risk existing in the time node; the expression formula is as follows:

wherein L is _target The target load level of the preset node is set according to practical experience, and the target load level is used for indicating that the utilization rate of the node resources near the level line is ideal; h is a _o And h _u Half-lives representing excessive loss and loss of ullage, respectively;

representing an estimate of the load of the super-sold node, the calculation of which uses different estimated calculations depending on its resource type. For compressible resources, the average resource utilization of cluster m is used +.>Estimating the load state of the super-sold node; for incompressible resources, when a node cannot provide the memory size required by a process, the process is directly terminated by the operating system, and the maximum utilization rate of cluster m is required to be adopted>The load status of the super-sold node is estimated for guiding the agent to take action more conservatively.

Taking the resource as CPU (compressible resource) as an example, estimation of the load of the super-sold nodeThe calculation formula of (2) is as follows:

s4: the input state s, action a, prize r, next state s 'are formed into quadruples (s, a, r, s') and placed in a buffer as training samples.

S5: and when a preset training interval is reached, e training samples are sampled from the buffer memory and input into the online Q network for training, and the parameter theta' of the online Q network is updated.

In this step, a gradient descent algorithm pair (y-Q (s, a, θ') ² Updating the parameter theta'; wherein, the expression formula of y is as follows:

y＝r+γmax _a′ Q(s′,a′,θ)

S6: and after the online Q network is trained by the E round, assigning the parameter theta' to the parameter theta of the target Q network, updating the target Q network, and applying the target Q network with the updated parameters to determine the overstock strategy.

Further, the target Q network with the updated parameters is deployed outside the cluster to serve as a super-selling strategy interface, and an event interception module is deployed in the cluster to improve the processing capacity of node load change. And the event interception module intercepts Pod creation, deletion events and node heartbeat events by adopting an admission controller provided by k8 s.

Fig. 2 is a schematic flow chart of an event interception module according to the present embodiment.

The event interception module performs the steps of:

4) Adding the dirty node into a node pool;

5) The independent thread monitors the node state in the local cache;

In the embodiment, k8s is used as a management scheme of the container cloud platform, and the container cloud cluster resource utilization optimization method based on deep reinforcement learning is provided for solving the problem of cluster resource utilization rate caused by a static scheduling mechanism built in k8s and improving the cluster resource utilization rate under the condition of reducing interference to running service as much as possible.

The embodiment fully considers the requirements of different services on the availability, quality and the like of the services, and the proposed super-selling scheme based on the capacity state of the k8s node resource has transparency on the containers at any stage, and can super-sell the cluster resources on the premise of meeting the service availability requirements of the service provider. The embodiment provides a technical scheme such as a deep reinforcement learning model for determining a super-selling strategy, a reward function based on cluster images, an event interception module design based on timing update and event driving, and the like, and specifically, the embodiment trains the super-selling strategy model (deep Q network model) based on the deep reinforcement learning, with the states (node information and current load states) of nodes in the clusters as network inputs, and with the aim of reducing the excessive load risk caused by super-selling and reducing resource waste, thereby effectively improving the resource utilization rate of the clusters.

The same or similar reference numerals correspond to the same or similar components;

the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The container cloud cluster resource utilization optimization method based on deep reinforcement learning is characterized by comprising the following steps of:

s1: preprocessing original load data, and assembling the preprocessed original load data into an input state s; the preprocessing of the original load data comprises a box-dividing processing operation;setting the number of the sub-boxes as B, and corresponding boundary value B _i The value range of (b) is { b } _i |0≤i≤B,i∈N}；

S2: constructing a depth Q network model for determining a super-selling strategy, inputting an input state s into the depth Q network model, randomly selecting an action a by the depth Q network model with a certain probability, or selecting the action a which enables the depth Q network model to be optimal, and executing super-selling ratio prediction once; after the input state s is input into the depth Q network model, the depth Q network model calculates a Q value corresponding to each action a according to the input state s, and judges whether the current action a optimizes the depth Q network model according to the Q value corresponding to the action a; the calculation formula of the Q value is as follows:

Q(s,a,θ)＝r′ _current +γr′ _future

wherein the Q value represents the instantaneous prize r 'available to the deep Q network model in the case of state s and execution of action a' _current And a predictive value r 'for future rewards' _future The method comprises the steps of carrying out a first treatment on the surface of the Gamma represents the discount factors for the instant and future rewards; θ is a parameter of the deep Q network model;

s3: evaluating the selected action a through a reward function to obtain a reward r and entering a next state s'; wherein the selected action a is evaluated by a reward function, wherein the calculation formula of the reward r is as follows:

wherein w is _o And w _u Weight factors representing excessive loss and loss of shortage, respectively, and w _o +w _u ＝1；o _k Indicating excessive loss, i.e. when the load of the node is higher than L _target The high load alarm risk existing in the time node; the expression formula is as follows:

u _k indicating loss of absence, i.e. whenThe load of the node is lower than L _target The resource waste exists in the time node; the expression formula is as follows:

wherein L is _target A target load level for a preset node; h is a _o And h _u Half-lives representing excessive loss and loss of ullage, respectively;representing an estimate of the superpose node load;

s6: and after the deep Q network model is trained by the E wheel, applying the deep Q network model with the updated parameters to determine the overstock strategy.

2. The container cloud cluster resource utilization optimization method of claim 1, wherein in step S1, preprocessing the raw load data includes a binning operation; the method comprises the following specific steps:

assuming that the load update period is T, in the kth time period, the original load data in the kth period of the node n isAnd original load data +.>The dimension of the sampling point is the same as the number of sampling points in the updating period T; setting the number of the sub-boxes as B, and corresponding boundary value B _i The value range of (b) is { b } _i |0≤i≤B,iE N }; assuming that M clusters are arranged, the number of nodes of the mth cluster is N _m ；

Then for the original load data l _k ⁿ The expression formula for performing the binning operation is as follows:

3. The container cloud cluster resource utilization optimization method of claim 2, wherein the input state s includes node information C therein _n Node container informationNode history load->Wherein the node information C _n The method comprises the steps of cluster ID where the nth node is located, city ID where the nth node is located, actual resource capacity and current super sales ratio; node container information->Representing the number of online service Pod, the number of offline service Pod and the total resource application amount of Pod on the nth node in the kth load updating period in the last 7 days; node history load->Representing the historical load of the last 7 days of the nodes after binning.

4. The container cloud cluster resource utilization optimization method according to claim 2, wherein in the step S2, after the input state S is input into the depth Q network model, the depth Q network model calculates a Q value corresponding to each action a according to the input state S, and judges whether the current action a optimizes the depth Q network model according to the Q value corresponding to the action a; the calculation formula of the Q value is as follows:

Q(s,a,θ)＝r _c ′ _urrent +γr _f ′ _uture

wherein the Q value represents the instantaneous prize r available to the deep Q network model in the case of state s and execution of action a _c ′ _urrent And a predicted value r of future rewards _f ′ _uture The method comprises the steps of carrying out a first treatment on the surface of the Gamma represents the discount factors for the instant and future rewards; θ is a parameter of the deep Q network model.

5. The container cloud cluster resource utilization optimization method of claim 4, wherein in step S3, the selected action a is evaluated by a reward function, wherein a calculation formula of the reward r is as follows:

6. The container cloud cluster resource utilization optimization method of claim 5, wherein said estimate of super-sell node loadFor compressible resources, the average resource utilization of cluster m is used +.>Estimating the load state of the super-sold node; the calculation formula is as follows:

7. The container cloud cluster resource utilization optimization method of claim 6, wherein said deep Q network model comprises a target Q network and an online Q network, parameters of which are denoted as θ and θ', respectively;

in the step S2, an online Q network is adopted to select an action a;

8. The method of optimizing resource utilization of container cloud clusters as claimed in claim 7, wherein in said step S5, a gradient descent algorithm pair (y-Q (S, a, θ') ² Updating the parameter theta'; wherein, the expression formula of y is as follows:

y＝r+γmax _a′ Q(s′,a′,θ)

9. The container cloud cluster resource utilization optimization method of any of claims 1-8, further comprising the steps of:

10. The container cloud cluster resource utilization optimization method of claim 9, wherein the event interception module performs the steps of:

4) Adding the dirty node into a node pool;

5) The independent thread monitors the node state in the local cache;