CN112311578B

CN112311578B - VNF scheduling method and device based on deep reinforcement learning

Info

Publication number: CN112311578B
Application number: CN201910704763.7A
Authority: CN
Inventors: 邢彪; 郑屹峰; 张卷卷; 陈维新; 章淑敏
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2023-04-07
Anticipated expiration: 2039-07-31
Also published as: CN112311578A

Abstract

The invention discloses a VNF scheduling method and device based on deep reinforcement learning. The method comprises the following steps: collecting historical state data corresponding to the VNF; performing model training on the deep reinforcement learning neural network based on historical state data to obtain a deep reinforcement scheduling model; acquiring real-time state data of a VNF to be scheduled, and inputting the real-time state data into a deep enhanced scheduling model to obtain a telescopic action corresponding to the VNF to be scheduled; and carrying out scheduling processing on the VNF to be scheduled based on the telescopic action. Based on the scheme provided by the invention, the VNFM can automatically add or delete VNFs according to the real-time state data of the VNFs, thereby realizing flexible scheduling of the VNFs, improving the scheduling accuracy, reducing the scheduling time, overcoming the problems that manual strategy formulation wastes time and labor, is easy to make mistakes, and cannot adjust the strategies in time when the services change, and avoiding the influence on the services.

Description

VNF scheduling method and device based on deep reinforcement learning

Technical Field

The invention relates to the technical field of communication, in particular to a VNF scheduling method and device based on deep reinforcement learning.

Background

In the prior art, the VNFM can obtain an elasticity index (performance or alarm) defined in an elasticity policy from the VNF, the EMS, or the VIM, and trigger automatic elasticity of the VNF instance according to a manually defined threshold of the elasticity policy. And the VNFM informs the NFVO of carrying out expansion and contraction on the VNF, the interface comprises the JobID, and the VNFM can return to the operation state according to the request of the NFVO in the automatic elastic process. And the VNFM requests the NFVO for the authority of operating the virtual resources according to the virtual resources required by elasticity. Judging whether a capacity expansion and reduction strategy is needed or not, as shown in FIG. 1, sampling workpload according to a set sampling interval, and judging that capacity expansion is needed each time a sampling result exceeds a capacity expansion waterline threshold value in a monitoring period; and when the sampling result is lower than the threshold value of the capacity reduction waterline, judging that capacity reduction is needed.

The VNF elastic scaling strategy in the existing NFV network needs to be realized by simply setting scaling threshold values for each KPI by depending on expert experience. However, the change of the network element service of the VNF is neglected in this kind of cutting-and-cutting manner, and the VNF is not suitable for being applied in the complex network element environment such as the NFV, and frequent expansion and contraction are easily caused, thereby affecting the service perception; in addition, the manual strategy making is time-consuming, labor-consuming and error-prone, and the strategy cannot be adjusted in time when the service changes.

Disclosure of Invention

In view of the above, the present invention is proposed to provide a VNF scheduling method and apparatus based on deep reinforcement learning, which overcomes or at least partially solves the above problems.

According to an aspect of the present invention, a VNF scheduling method based on deep reinforcement learning is provided, including:

collecting historical state data corresponding to the VNF;

performing model training on the deep reinforcement learning neural network based on historical state data to obtain a deep reinforcement scheduling model;

acquiring real-time state data of the VNF to be scheduled, and inputting the real-time state data into a deep enhanced scheduling model to obtain a telescopic action corresponding to the VNF to be scheduled;

and carrying out scheduling processing on the VNF to be scheduled based on the telescopic action.

According to another aspect of the present invention, there is provided a VNF scheduling apparatus based on deep reinforcement learning, including:

the collecting module is suitable for collecting historical state data corresponding to the VNF;

the training module is suitable for carrying out model training on the deep reinforcement learning neural network based on historical state data to obtain a deep reinforcement scheduling model;

the flexible action determining module is suitable for acquiring real-time state data of the VNF to be scheduled and inputting the real-time state data into the deep enhanced scheduling model to obtain a flexible action corresponding to the VNF to be scheduled;

and the scheduling processing module is suitable for performing scheduling processing on the VNF to be scheduled based on the telescopic action.

According to still another aspect of the present invention, there is provided an electronic apparatus including: the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the deep reinforcement learning-based VNF scheduling method.

According to another aspect of the present invention, a computer storage medium is provided, where at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to perform an operation corresponding to the deep reinforcement learning-based VNF scheduling method as described above.

According to the scheme provided by the invention, historical state data corresponding to the VNF is collected; performing model training on the deep reinforcement learning neural network based on historical state data to obtain a deep reinforcement scheduling model; acquiring real-time state data of a VNF to be scheduled, and inputting the real-time state data into a deep enhanced scheduling model to obtain a telescopic action corresponding to the VNF to be scheduled; and carrying out scheduling processing on the VNF to be scheduled based on the telescopic action. Based on the scheme provided by the invention, the VNFM can automatically add or delete VNFs according to the real-time state data of the VNFs, thereby realizing flexible scheduling of the VNFs, improving the scheduling accuracy, reducing the scheduling time, overcoming the problems that manual strategy formulation wastes time and labor, is easy to make mistakes, and cannot adjust the strategies in time when the services change, and avoiding the influence on the services.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 illustrates a prior art schematic diagram of determining whether a VNF is scaled;

FIG. 2 is a schematic diagram of the structure of NFV;

figure 3A illustrates a flowchart of a VNF scheduling method based on deep reinforcement learning according to an embodiment of the present invention;

FIG. 3B shows a signaling diagram for collecting historical status data;

FIG. 3C is a schematic diagram of the Actor-Critic method;

figure 3D is a VNF scheduling diagram;

FIG. 4A is a flowchart illustrating a specific training method of the deep enhanced scheduling model in step S302 in the embodiment of FIG. 3A;

FIG. 4B is a schematic structural diagram of an Actor network and a criticic network;

fig. 5 is a schematic structural diagram of a VNF scheduling apparatus based on deep reinforcement learning according to an embodiment of the present invention;

fig. 6 shows a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

NFV (Network Function Virtualization) refers to implementing various Network device functions on standardized general IT devices (x 86 servers, storage and switching devices) by using Virtualization technology. A VNFM (virtualized Network Function Manager) is a functional module in the NFV for performing virtualized Network Function module lifecycle management. The VNFM implements lifecycle management of a virtualized network element VNF, including management and processing of a VNFD, initialization of a VNF instance, capacity expansion/contraction of the VNF, and termination of the VNF instance.

A VNF (virtualized Network Function) corresponds to a conventional telecommunication service Network, and each physical Network element is mapped as a virtual Network element VNF, which is a Network element Function implemented by pure software, and can run on NFVI, and corresponds to a conventional physical Network element Function.

As shown in fig. 2, the VNFM interacts with the VNF through a C10 interface, wherein,

1) VNFM to VNF message: configuration management related to deployment (non-VNF application layer), self-healing of VNF and the like are completed;

2) VNF to VNFM information: the VNF loads completion information and informs that the service on the virtual machine established by the VNFM can provide service; performance information of the VNF tells the VNFM to use for elastic scaling.

Fig. 3A is a flowchart illustrating a VNF scheduling method based on deep reinforcement learning according to an embodiment of the present invention. As shown in fig. 3A, the method includes the steps of:

step S301, collecting historical state data corresponding to the VNF.

In particular, collecting historical state data, also referred to as KPI data, by the VNFM to the VNF includes: traffic load occupancy, CPU load occupancy, number of users, process memory occupancy, BHSA (Busy Session for single user), maximum control block resource occupancy, etc., with a collection granularity of 5 minutes, i.e., collecting historical status data once every 5 minutes.

Specifically, as shown in fig. 3B, the VNFM sends a KPI data query request to the OMU, the OMU forwards the KPI data query request to the VNF, the VNF collects KPI data according to the KPI data query request, and feeds back the collected KPI data to the OMU, and then the OMU feeds back the KPI data to the VNFM.

In an optional embodiment of the present invention, after the historical state data corresponding to the VNF is collected, the collected historical state data may be normalized, where the normalization is to scale the data to fall into a small specific interval, that is, to scale the historical state data to be between the minimum value and the maximum value of the specific interval, usually between 0 and 1, that is, to map the historical state data into a range of [0,1], and by performing the normalization on the historical state data, the convergence rate of the model and the accuracy of the model can be improved during model training.

Specifically, the historical state data may be normalized by equation (1):

X_std＝(X-Xmin)/(Xmax-Xmin)

x _ scaled = X _ std (Xmax-Xmin) + Xmin formula (1)

And step S302, performing model training on the deep reinforcement learning neural network based on the historical state data to obtain a deep reinforcement scheduling model.

In this embodiment, a deep reinforcement learning neural network is model-trained by using historical state data, and a deep reinforcement scheduling model is obtained through training. In the following, deep reinforcement learning is described, which is developed from reinforcement learning, wherein reinforcement learning (reinforcement learning) is an important machine learning method, and includes three elements, namely, a state (state), an action (action), and a reward (rewarded). The Agent needs to take actions according to the current state, and after obtaining corresponding rewards, the Agent improves the actions, so that the Agent can take more optimal actions when the Agent reaches the same state next time. The main goal of the agent is to maximize a certain number prize by performing a specific sequence of operations in the environment.

Deep reinforcement learning, namely, a deep neural network is used for extracting data characteristics and training a reinforcement learning algorithm model, so that the model can fully learn the rules of a complex external environment, make correct actions under different environments and obtain higher accumulated return in long-term interaction.

DQN (Deep Q-network) is a value function represented by a Deep network proposed by Deep Mind in 2015, and a target value is provided for the Deep network according to Q-Learning in reinforcement Learning, and the network is continuously updated until convergence. Within the DQN there are two Neural Networks (NN), one network target-net with relatively fixed parameters for obtaining the value of the Q-target (Q-target) and the other eval _ net for obtaining the value of the Q-evaluation (Q-eval). The training data is randomly extracted from a memory library that records the actions, rewards, and results (s, a, r, s') for each state. The memory bank is limited in size so that when the data is full, the next data overwrites the first data in the memory bank. However, DQN is a method based on a value function, and it is difficult to cope with a large operation space, particularly a continuous operation. The DDPG is based on an Actor-Critic method, adopts a network to fit a strategy function in the aspect of action output, directly outputs actions, and can deal with the output of continuous actions and large action space.

DPG (Deterministic Policy) was proposed by deepmed in 2014, i.e. the behavior of each step directly gets a determined value by the function μ:

a _t ＝μ(s _t |θ ^μ ) Formula (2)

Wherein, a _t The action selected for time t, s _t Is the state data of the environment at time t, theta ^μ For weight values, the function μ is the optimal behavior strategy.

The function mu, i.e. the optimal behavior strategy, is no longer a random strategy that requires sampling. The DPG is needed because after the random strategy is obtained through Policy Gradient learning, the obtained optimal strategy probability distribution needs to be sampled to obtain the specific value of action when each walking is performed; while action is usually a high-dimensional vector, frequent sampling in a high-dimensional action space is certainly very computationally expensive. And the DPG algorithm is fused into an operator-critical framework, and is combined with traditional Q function learning methods such as Q-learning and the like, and a deterministic optimal behavior strategy function is obtained through training. The deterministic policy gradient equation is as follows:

the parameters in this equation (3) are all well known in the art.

DDPG (Deep Deterministic Policy Gradient) is a Policy learning method that fuses Deep learning neural networks into DPG. Namely, the value function and the strategy function are both expressed by the neural network. The core improvements over DPG are: adopting a neural network as the simulation of the strategy functions mu and Q, namely the strategy network and the Q network; the neural network is then trained using a deep learning approach.

DDPG is a deep reinforcement learning method based on Actor-Critic architecture. DDPG is a kind proposed by google deep mind using an Actor-criticc structure, but the output is not the probability of an action, but a specific action for prediction of a continuous action (continuous action). DDPG combines the previously obtained successful DQN structure, and improves the stability and the convergence of the Actor Critic.

The Actor-critical method is an important reinforcement learning algorithm, and is a time sequence difference method (TDmethod), and combines a value function-based method and a strategy function-based method. Wherein the policy function is an Actor (Actor) giving an action; the merit function is an evaluator (Critic), the evaluation actor gives the action quality, and a time sequence difference signal is generated to guide the updating of the merit function and the strategy function. By combining the method with deep learning, two deep networks are respectively used for representing a value function and a strategy function. The Actor selects behavior based on probability, the Critic judges the score of behavior based on behavior of Actor, and the Actor modifies the selected behavior according to Critic score, as shown in fig. 3C.

State s in conjunction with the present application scenario _t Is KPI number of VNF at time tAccordingly; action (action) a _t The flexible action at the time t belongs to the continuous action space (continuous actions) type, and after the action selected by the model is executed through VNFM, the VNF state is represented by s _t Conversion to s _t+1 ；r(s _t ,a _t ) The function being VNF in state s _t Performing an action a _t Thereafter, the returned single step reward value, the specific reward function will be determined by the state s of the VNF _t+1 Determining; r _t Is a weighted sum of the prize values earned by all activities from the current state until a future state.

In an alternative embodiment of the present invention, the following method may be adopted to perform model training on the deep reinforcement learning neural network: inputting the historical state data into an action network to obtain telescopic actions corresponding to the historical state data; inputting the historical state data and the telescopic action into a judgment network to obtain a reward value corresponding to the telescopic action; and training the action network-evaluation network by taking the historical state data, the telescopic action, the reward value and the state data after the telescopic action is executed as training data.

Step S303, obtaining real-time status data of the VNF to be scheduled, and inputting the real-time status data into the deep enhanced scheduling model to obtain a scaling action corresponding to the VNF to be scheduled.

The VNFM acquires real-time status data from the VNF to be scheduled, and then inputs the acquired real-time status data into the deep enhanced scheduling model obtained in step S302, where the deep enhanced scheduling model outputs a corresponding scaling action for the real-time status data, and the scaling action is used to indicate that the VNF needs to perform capacity expansion or capacity reduction.

Optionally, after the real-time state data of the VNF to be scheduled is acquired, normalization processing may be performed on the real-time state data, and the normalized real-time state data is input into the deep enhanced scheduling model, so as to obtain a scaling action corresponding to the VNF to be scheduled.

Wherein, the flexible action includes: a lateral telescoping action and/or a longitudinal telescoping action; the transverse telescoping action comprises: scaling of the number of VNFs; the longitudinal telescoping action comprises: scaling of CPU, memory and/or storage resources occupied by the VNF. The horizontal scaling action is to increase or decrease the number of virtual machines to expand or shrink the VNF instance, and the vertical scaling action is to increase or decrease the resources, such as CPU, memory, and/or storage resources, occupied by the virtual machines to expand or shrink the VNF instance.

And step S304, performing scheduling processing on the VNF to be scheduled based on the scaling action.

In this embodiment, the scaling action step value is output in percentage form, and if the output result is nonzero, the VNF to be scheduled is scheduled based on the scaling action, and automatic elastic scaling of the VNF is triggered; if the output result is 0, the scheduling process is not performed on the VNF, as shown in fig. 3D.

Taking the scaling of the VNF number as an example, the "scaling action step size" represents scaling by a percentage of the VNF number, and the principle is as follows: the current VNF number multiplied by the scaling step value, divided by 100, is the number of pre-scaled-scaling VNFs. And rounding up during capacity expansion and rounding down during capacity reduction. For example, when the current VNF number is 5, the step value is 0.3, and when capacity expansion is performed, the pre-expansion VNF number is calculated to be 1.5, and then capacity expansion is finally performed by 2 VNFs; during capacity reduction, the number of the pre-capacity-reducing VNFs is calculated to be 1.5, and finally, 1 VNF is subjected to capacity reduction.

According to the method provided by the embodiment of the invention, the VNFM can automatically add or delete VNFs according to the real-time state data of the VNFs, thereby realizing flexible scheduling of the VNFs, improving the scheduling accuracy, reducing the scheduling time, overcoming the problems that manual strategy making is time-consuming and labor-consuming and easy to make mistakes, and the strategies cannot be adjusted in time when the services are changed, and avoiding the influence on the services.

Fig. 4A is a flowchart illustrating a specific training method of the deep enhanced scheduling model in step S302 in the embodiment of fig. 3A. As shown in fig. 4A, the method includes the steps of:

starting from j = 1:

step S401, inputting any unselected jth state data randomly selected from historical state data into an action network to obtain corresponding telescopic action; and inputting the jth state data and the corresponding telescopic action into a judgment network to obtain a reward value corresponding to the telescopic action.

In the embodiment, the action network (Actor network) includes an action reality network (target _ net) and an action estimation network (eval _ net), which are neural networks having the same structure, except that the parameter update frequency is different. The Actor network inputs VNF status data and outputs a scaling operation in a continuous operation space.

As shown in fig. 4B, the Actor network is provided with: an input layer to receive state data of the VNF; the hidden layer comprises 3 full connection layers (Dense), wherein 300 neurons, 200 neurons and 100 neurons are respectively arranged, and corresponding activation functions are 'relu'; the output layer is used for outputting continuous actions, including two categories of transverse telescopic actions and longitudinal telescopic actions, the transverse telescopic actions refer to the capacity expansion or the capacity contraction of the quantity of the VNF, the longitudinal telescopic actions refer to the expansion and contraction of the CPU, the memory and the storage resources occupied by the VNF, wherein the output layer is a full connection layer (Dense) and is provided with 4 neurons, an activation function is 'tanh', and 4 continuous actions are correspondingly output: the VNF number expansion percentage, the VNF expansion percentage of the CPU, the VNF memory expansion percentage and the VNF storage expansion percentage are within the numerical range of-1 to 1, the positive value is expansion, the negative value is contraction and capacity, and 0 is unchanged;

the crithic network comprises a state reality network (target _ net) and a state estimation network (eval _ net), wherein the two networks are neural networks with the same structure, and only the parameter updating frequency is different. The input of the Critic network is state data of a VNF and the telescopic action selected by the Actor network, a reward value Q (s, a) selected for the telescopic action is output, the Q value is fed back to the Actor network through the Critic network, and the Actor network selects the telescopic action capable of obtaining the maximum benefit according to the Q value.

As shown in fig. 4B, the criticic network includes: two input layers (input layer 1 and input layer 2), wherein one input layer is used for receiving state data of the VNF, and the other input layer is used for receiving corresponding scaling actions; an input layer 1 is respectively provided with 300 and 200 neurons through two full connection layers (Dense), and an activation function is 'relu'; an input layer 2 passes through 1 full connection layer (Dense), 200 neurons are arranged, and an activation function is 'relu'; then merging the telescopic action and the state data through a merging layer (merge); finally, a fully connected layer (200 neurons, with an activation function of "relu") and an output layer (4 neurons, with an activation function of "tanh") are followed, outputting the Q value.

In this step, deep enhanced scheduling model training is performed, specifically, any unselected jth state data randomly selected from historical state data is input to the action network, and based on the above introduction, the telescopic action corresponding to the jth state data is obtained; the jth state data and the corresponding telescopic action are input into the evaluation network, so that the reward value corresponding to the telescopic action can be obtained.

When j =1, there is initialization process of mobile network and judgment network, randomly initializing Critic network

And Actor network mu (s θ) ^μ ) The initialized weights are respectively theta ^Q And theta ^μ . Then, the target network Q' = Q (s, a θ) is initialized ^Q ) And μ' = μ (s θ) ^μ )。

Step S402, storing the jth state data, the telescopic action, the reward value and the state data after the telescopic action is executed into a playback memory as a piece of training data, wherein the state data after the telescopic action is executed is defined as the ith state data.

After step S401 is executed, the scaling action, the reward value after the scaling action and the state data after the scaling action corresponding to the jth state data are obtained, and the jth state data, the scaling action, the reward value and the state data after the scaling action are stored in the playback memory as a piece of training data.

Step S403, inputting the ith state data into the action network to obtain the corresponding telescopic action; and inputting the ith state data and the corresponding telescopic action into a judgment network to obtain a reward value corresponding to the telescopic action.

Step S404, storing the ith state data, the telescopic action, the reward value and the (i + 1) th state data after the telescopic action is executed into a playback memory as a piece of training data; and assigning i to be i +1, and repeatedly executing the steps S403-S404 until the number of the training data stored in the playback memory is greater than or equal to a preset threshold value.

Step S403 to step S404 are the preparation process of training data, and the training data is stored in the replay memory, so that during training, the action network and the evaluation network can be updated by randomly selecting data from the replay memory in small batches instead of using the latest training data, thereby solving the problem of correlation between the training data and greatly improving the stability of the model.

Step S405, randomly selecting a plurality of training data from a playback memory to calculate a loss function of the evaluation network, and updating the evaluation network by using the loss function; and updating the judging network based on the updated judging network.

Randomly extracting a plurality of training data(s) from a playback memory _i ,a _i ,r _i ,s _i+1 ) Setting an objective function y _i Comprises the following steps:

y _i ＝r _i +γQ′(s _i+1 ，μ′(s _i+1 |θ ^μ′ )|θ ^Q′ )

wherein y is _i Representing the target network, r _i Representing the value of return, θ, during i learning ^μ' 、θ ^Q' Representing the target weight and gamma representing the discount factor.

According to an objective function y _i Updating the Critic network: specifically, the criticic network is updated by minimizing the following loss function L.

Critic's training of the state estimation network is based on the real Q value obtained by inputting state data s and the scaling action a output from the action estimation network into the state estimation network and the square loss of the estimated Q value obtained by adding the real reward value R and the singular value of the Q value obtained by inputting the state s ' at the next time and the scaling action a ' obtained from the action real network into the state real network.

After updating the Critic network, updating the Actor network: since a is a deterministic policy, i.e., a = μ (s θ), the actor network is updated by a deterministic policy gradient. Action gradients from Critic networks

Multiplied by a parameter gradient derived from the Actor network>

So that the Actor network modifies the parameters in a direction in which it is more likely to obtain a larger Q function value. The purpose of the Actor network is to obtain a stretching operation with a high Q value as much as possible, and therefore the loss of the Actor network can be simply understood as the larger the obtained feedback Q value is, the smaller the loss is, and the smaller the obtained feedback Q value is, the larger the loss is.

Wherein

Representing a gradient, adjusting the weight value by means of an action network, and/or determining a value based on the gradient>

Represents an action gradient, <' > based on the signal>

Representing the parameter gradient.

Optionally, the target network also needs to be updated: since implementing the Q-value function directly with a neural network proves to be unstable, deepmind proposes to use a target network in which copies of the Actor network and Critic network are created, respectively, to calculate the target value. The weights of these target networks are updated by slowly tracking the learned networks:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

τ is set to a number very close to 1 so that the parameter θ of the target network does not change too much.

In step S406, j is assigned as j +1.

Step S407, judging whether j is larger than n, if so, executing step S408; if not, the step S401 is skipped to execute.

And step S408, finishing training the deep reinforcement scheduling model.

Step S401-step S405 describe a one-time iterative training process, after step S405 is finished, j is assigned as j +1, then whether j is larger than n is judged, and if yes, the deep reinforcement scheduling model training is finished; if not, continuing the next iterative training process, and skipping to execute the step S401.

In this embodiment, n =1000 may be set, which is only an example and does not have any limiting effect. In the training process, the gradient descent optimization algorithm selects an adam optimizer for improving the learning speed of the traditional gradient descent (optimizer = 'adam'). The neural network can find the optimal weight value which enables the loss function to be minimum through gradient descent, the training error is gradually descended along with the increase of the number of training rounds, and the model is gradually converged. And after the training is finished, obtaining a deep enhanced scheduling model.

Fig. 5 is a schematic structural diagram of a VNF scheduling apparatus based on deep reinforcement learning according to an embodiment of the present invention. As shown in fig. 5, the apparatus includes: a collection module 501, a training module 502, a scaling action determination module 503, and a scheduling processing module 504.

A collecting module 501 adapted to collect historical state data corresponding to the VNF;

the training module 502 is adapted to perform model training on the deep reinforcement learning neural network based on the historical state data to obtain a deep reinforcement scheduling model;

the scaling action determining module 503 is adapted to obtain real-time state data of the VNF to be scheduled, and input the real-time state data into the deep enhanced scheduling model to obtain a scaling action corresponding to the VNF to be scheduled;

and the scheduling processing module 504 is adapted to perform scheduling processing on the VNFs to be scheduled based on the scaling action.

Optionally, the deep reinforcement learning neural network is specifically: mobile network-evaluation network.

Optionally, the training module is further adapted to: inputting the historical state data into an action network to obtain telescopic actions corresponding to the historical state data;

inputting the historical state data and the telescopic action into a judgment network to obtain a reward value corresponding to the telescopic action;

and training the action network-evaluation network by taking the historical state data, the telescopic action, the reward value and the state data after the telescopic action is executed as training data.

Optionally, the training module is further adapted to: starting from j = 1: s1, inputting any unselected jth state data randomly selected from historical state data into an action network to obtain corresponding telescopic action; inputting the jth state data and the corresponding telescopic action into a judgment network to obtain a reward value corresponding to the telescopic action;

s2, storing the jth state data, the telescopic action, the reward value and the state data after the telescopic action is executed into a playback memory as a piece of training data, wherein the state data after the telescopic action is executed is defined as the ith state data;

s3, inputting the ith state data into the action network to obtain a corresponding telescopic action; inputting the ith state data and the corresponding telescopic action into a judgment network to obtain a reward value corresponding to the telescopic action;

s4, storing the ith state data, the telescopic action, the reward value and the (i + 1) th state data after the telescopic action is executed into a playback memory as a piece of training data; assigning i to i +1, and repeatedly executing S3-S4 until the number of the training data stored in the replay memory is larger than or equal to a preset threshold value;

s5, randomly selecting a plurality of training data from the playback memory to calculate a loss function of the evaluation network, and updating the evaluation network by using the loss function; updating the judging network based on the updated judging network;

s6, assigning j to be j +1;

s7, judging whether j is larger than n, if so, finishing the training of the deep enhanced scheduling model; if not, skipping to execute S1.

Optionally, the apparatus further comprises: and the normalization processing module is suitable for carrying out data normalization processing on the historical state data.

Optionally, the telescoping action comprises: a lateral telescoping action and/or a longitudinal telescoping action;

the transverse telescoping action comprises: scaling of VNF numbers; the longitudinal telescoping action comprises: scaling of CPU, memory and/or storage resources occupied by the VNF.

Optionally, the status data comprises one or more of the following: service load occupancy rate, CPU load occupancy rate, user number, process memory occupancy rate, BHSA, and control block resource occupancy rate.

According to the device provided by the embodiment of the invention, the VNFM can automatically add or delete VNFs according to the real-time state data of the VNFs, thereby realizing flexible scheduling of the VNFs, improving the scheduling accuracy, reducing the scheduling time, overcoming the problems that manual strategy making is time-consuming and labor-consuming and easy to make mistakes, and the strategies cannot be adjusted in time when the services are changed, and avoiding the influence on the services.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores at least one executable instruction, and the computer executable instruction can execute the VNF scheduling method based on deep reinforcement learning in any method embodiment.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the electronic device.

As shown in fig. 6, the electronic device may include: a processor (processor), a Communications Interface (Communications Interface), a memory (memory), and a Communications bus.

Wherein:

the processor, the communication interface, and the memory communicate with each other via a communication bus.

A communication interface for communicating with network elements of other devices, such as clients or other servers.

The processor is configured to execute a program, and may specifically execute relevant steps in the foregoing VNF scheduling method embodiment based on deep reinforcement learning.

In particular, the program may include program code comprising computer operating instructions.

The processor may be a central processing unit CPU, or an application specific Integrated Circuit ASIC (application specific Integrated Circuit), or one or more Integrated circuits configured to implement an embodiment of the present invention. The electronic device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And the memory is used for storing programs. The memory may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program may specifically be configured to cause the processor to execute the VNF scheduling method based on deep reinforcement learning in any of the method embodiments described above. For specific implementation of each step in the program, reference may be made to corresponding descriptions in corresponding steps and units in the foregoing VNF scheduling embodiment based on deep reinforcement learning, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the devices in an embodiment may be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in a deep reinforcement learning based VNF scheduling apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website, or provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A VNF scheduling method based on deep reinforcement learning comprises the following steps:

collecting historical state data corresponding to the VNF;

performing model training on the deep reinforcement learning neural network based on the historical state data to obtain a deep reinforcement scheduling model; the deep reinforcement learning neural network specifically comprises the following components: mobile network-evaluation network;

acquiring real-time state data of a VNF to be scheduled, and inputting the real-time state data into the deep enhanced scheduling model to obtain a telescopic action corresponding to the VNF to be scheduled, wherein the telescopic action comprises: a lateral telescoping action and/or a longitudinal telescoping action; the transverse telescoping action comprises: scaling of the number of VNFs; the longitudinal telescoping action comprises: scaling of CPU, memory and/or storage resources occupied by the VNF;

scheduling the VNF to be scheduled based on the scaling action;

performing model training on the deep reinforcement learning neural network based on the historical state data to obtain a deep reinforcement scheduling model, further comprising:

starting from j = 1: s1, inputting any unselected jth state data randomly selected from the historical state data into an action network to obtain corresponding telescopic action; inputting the jth state data and the corresponding telescopic action into a judgment network to obtain a reward value corresponding to the telescopic action;

s4, storing the ith state data, the telescopic action, the reward value and the (i + 1) th state data after the telescopic action is executed into a playback memory as a piece of training data; assigning i to i +1, and repeatedly executing S3-S4 until the number of training data stored in the playback memory is greater than or equal to a preset threshold value;

s6, assigning j to be j +1;

2. The method of claim 1, wherein after collecting historical state data corresponding to VNFs, the method further comprises: and carrying out data normalization processing on the historical state data.

3. The method of claim 1, wherein the status data comprises one or more of: service load occupancy rate, CPU load occupancy rate, user number, process memory occupancy rate, BHSA, and control block resource occupancy rate.

4. A VNF scheduling apparatus based on deep reinforcement learning, comprising:

the training module is suitable for carrying out model training on the deep reinforcement learning neural network based on the historical state data to obtain a deep reinforcement scheduling model; the deep reinforcement learning neural network specifically comprises the following components: mobile network-evaluation network;

the flexible action determining module is suitable for acquiring real-time state data of the VNF to be scheduled and inputting the real-time state data into the deep enhanced scheduling model to obtain a flexible action corresponding to the VNF to be scheduled; wherein, the flexible action includes: a lateral telescoping action and/or a longitudinal telescoping action; the transverse telescoping action comprises: scaling of the number of VNFs; the longitudinal telescoping action comprises: scaling of CPU, memory and/or storage resources occupied by the VNF;

the scheduling processing module is suitable for performing scheduling processing on the VNF to be scheduled based on the telescopic action;

wherein the training module is further adapted to: starting from j = 1: s1, inputting any unselected jth state data randomly selected from historical state data into an action network to obtain corresponding telescopic action; inputting the jth state data and the corresponding telescopic action into a judgment network to obtain a reward value corresponding to the telescopic action;

s6, assigning j to be j +1;

5. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface are communicated with each other through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to the deep reinforcement learning-based VNF scheduling method of any one of claims 1-3.

6. A computer storage medium having stored therein at least one executable instruction to cause a processor to perform operations corresponding to the deep reinforcement learning-based VNF scheduling method of any one of claims 1-3.