CN114154566A

CN114154566A - Edge calculation active service method and system based on deep reinforcement learning

Info

Publication number: CN114154566A
Application number: CN202111370645.0A
Authority: CN
Inventors: 缪巍巍; 张明轩; 曾锃; 黄进; 张瑞; 张震; 李世豪; 滕昌志
Original assignee: Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-03-08

Abstract

The invention discloses an edge calculation active service method and system based on deep reinforcement learning, wherein the method comprises the following steps: 1) extracting user characteristic information and extracting user intention classification at the same time; 2) pre-training an intention pre-judging model through a deep neural network, outputting the intention pre-judging model as a multi-classification user intention probability through a normalized exponential function softmax, then optimizing the intention pre-judging model by utilizing a cross entropy loss function, outputting the optimized intention pre-judging model as the type of a current intention, and establishing a DDPG model by taking the second layer from the last half of the intention pre-judging model as a representation vector; 3) optimizing the DDPG model through on-line exploration; 4) setting a reward function of reinforcement learning, wherein if the user uses one of the services, the reward value is 1, and otherwise, the reward value is 0; and prejudging the user resource request according to the reward value. The method of the invention can improve the service efficiency of the edge node and improve the satisfaction degree of users.

Description

Edge calculation active service method and system based on deep reinforcement learning

Technical Field

The invention relates to an edge computing active service system and method based on deep reinforcement learning, and belongs to the technical field of user edge computing.

Background

During the interaction process between the user (such as an AR user and an intrusion detection terminal device) using edge computing and the edge node, the edge node may provide active edge services according to the load condition of the user, so as to increase the user experience, such as computing offloading, edge cache services, and the like. If the performance bottleneck of the user can be pre-judged in advance, the user can be actively served according to the use information of the user, and the user experience is improved. The satisfaction degree of the user can be effectively improved by prejudging the load condition of the user and carrying out active service, and the existing methods mainly comprise the following steps:

1) based on manual rule configuration, according to user preferences, historical loads and the like, relevant rules can be configured manually, user resource requirements can be pre-judged, and for example, video resources can be deployed in advance for users who like watching movies; for users who enjoy playing games, more computing resources can be pre-allocated.

Problems with manual rule configuration:

a) expert domain knowledge is required, and a large amount of manual participation is required;

b) the resource requirements of users may be changeable and complex, and the configuration needs to be carried out step by step;

c) the user portrait, application information and the like are very complex and have thousands of characteristics, so that reasonable rules are difficult to configure manually;

2) the method based on supervised learning trains supervised learning models through neural networks, tree models and the like according to user characteristics, historical loads and the like of users, predicts user resource requirements through multi-classification, and deploys in advance.

The problems existing in supervised learning are as follows:

a) the resource requirements of the users have sequentiality, the service quality of the users in the last resource request can influence the requirements of the users in the next step, and the supervision and learning are difficult to consider;

b) with the development of services, resource request characteristics of different applications of a user also change, and when application updating and the like occur each time, supervised learning needs to retrain a model, so that the calculation amount is large, and much time is needed.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: in the edge computing scenario, the resource requests of the users are sequential, and the resource requests of the users may change dynamically, which is large in computation amount and takes much time.

The technical scheme adopted by the invention has the working principle that: the invention carries out user resource demand prejudgment on the following two scenes:

edge caching: when a user browses video resources and other conditions, if video requests and the like of the user can be predicted, caching can be performed on the edge in advance, and faster bandwidth resources are provided for the user;

compute intensive applications: in application requests of a user for playing games, data calculation and the like, if the user request can be predicted, more efficient calculation service is actively provided for a user calculation task, and the calculation efficiency of the user is improved.

Aiming at the situations, the method and the system realize the prejudgment of the user resource request by exploring different resource services provided for the user each time in a reinforcement learning mode and acquiring rewards through user clicking or other feedback, and finally aim at maximizing the long-term accumulated rewards of the user or the satisfaction degree of the user.

The technical scheme of the invention is as follows:

an edge computing active service method based on deep reinforcement learning comprises the following steps:

1) extracting user characteristic information, wherein the characteristic information comprises a user portrait, an application load of a user in a set period, a user position and the like, and extracting user intention classification;

2) pre-training an intention pre-judging model through a deep neural network, wherein the intention pre-judging model is a multi-classification neural network model, the input of the intention pre-judging model is a user portrait, an application load of a user in a set period and a user position, the output of the intention pre-judging model is a multi-classification user intention probability passing through a normalized exponential function softmax, then the intention pre-judging model is optimized by utilizing a cross entropy loss function, the trained intention pre-judging model is output as the category of the current intention, and meanwhile, the second reciprocal layer of the trained intention pre-judging model is used as a representation vector to establish a DDPG model;

3) optimizing the DDPG model through on-line exploration;

4) setting a reward function of reinforcement learning, wherein if the user uses a service corresponding to one intention, the reward value is 1, otherwise, the reward value is 0; in the process of interaction between the active service system and the user, the active service system prejudges the user resource request according to the reward value, and selects the action which enables the critic valuation function to be maximum, namely, corresponding service is provided.

An edge computing active service system based on deep reinforcement learning comprises the following program modules;

a feature extraction module: extracting user characteristic information, wherein the characteristic information comprises a user portrait, an application load of a user in a set period, a user position and the like, and extracting user intention classification;

a neural network training module: pre-training an intention pre-judging model through a deep neural network, wherein the intention pre-judging model is a multi-classification neural network model, the input of the intention pre-judging model is a user portrait, an application load of a user in a set period and a user position, the output of the intention pre-judging model is a multi-classification user intention probability passing through a normalized exponential function softmax, then the intention pre-judging model is optimized by utilizing a cross entropy loss function, the trained intention pre-judging model is output as the category of the current intention, and meanwhile, the second reciprocal layer of the trained intention pre-judging model is used as a representation vector to establish a DDPG model;

a model optimization module: optimizing the DDPG model by over-line exploration;

a prejudgment module: setting a reward function of reinforcement learning, wherein if the user uses a service corresponding to one intention, the reward value is 1, otherwise, the reward value is 0; in the process of interaction between the active service system and the user, the active service system prejudges the user resource request according to the reward value, and selects the action which enables the critic valuation function to be maximum, namely, corresponding service is provided.

The invention achieves the following beneficial effects: the method and the system of the invention actively push the service to the user in a dynamic environment through deep reinforcement learning, optimize the pushed service quality through continuous trial and error, improve the service efficiency of the edge node and improve the satisfaction degree of the user. Meanwhile, the method can dynamically increase or reduce the user intentions, and the model can be automatically updated through reinforcement learning and can provide an optimal intention prejudgment result aiming at the behaviors of the user sequence.

Drawings

FIG. 1 is a flowchart of an edge computing initiative service method based on deep reinforcement learning according to the present invention;

FIG. 2 is a diagram of the structure of DDPG model of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the drawings and the specific embodiments.

Example 1

As shown in fig. 1, an edge computing active service method based on deep reinforcement learning according to the present invention includes the following steps:

2) pre-training an intention pre-judging model through a deep neural network, wherein the intention pre-judging model is a multi-classification neural network model, the input of the intention pre-judging model is a user portrait, an application load of a user in a set period and a user position, the output of the intention pre-judging model is a multi-classification user intention probability passing through a normalized exponential function softmax, then the intention pre-judging model is optimized by utilizing a cross entropy loss function, the trained intention pre-judging model is output as the category of the current intention, and meanwhile, the second reciprocal layer of the trained intention pre-judging model is used as a representation vector to establish a DDPG model; because the last layer of the intention prejudging model is a normalized exponential function softmax and is irrelevant to the next task, the output of the network middle layer is used as a representation vector according to a transfer learning method;

3) optimizing the DDPG model through on-line exploration, and specifically comprising the following steps:

31) implementing reinforcement learning by reinforcement learning DDPG algorithm (deep dependent policy gradient), wherein the actor network takes the expression vector obtained in step 2) as input, and DDPG algorithm outputs storage or calculation service provided for users;

32) the critic network predicts the long-term benefit after service by representing vectors and exposed problems and optimizes by timing differential errors,

wherein Q represents a critic network, s is the current environment state, a is the selected service action, and w is a parameter of the critic network; s ', a' are the state and action, respectively, at the next moment, r is the reward function, γ is the discount factor, typically 0.95; l (w) represents an optimized value, E [ ] is a desired value, a ' is a value that maximizes the critic network Q (s ', a ', w);

33) the DDPG algorithm dynamically explores through a noise function OUNoise;

4) setting a reward function of reinforcement learning, wherein if the user uses a service corresponding to one intention, the reward value is 1, otherwise, the reward value is 0; in the process of interaction between the active service system and the user, the active service system prejudges the user resource request according to the reward value, and selects the action which enables the critic valuation function to be maximum, namely, corresponding service is provided. Through interaction with a user, a reinforcement learning DDPG algorithm is utilized to optimize a strategy function, namely an active service model, and the prejudgment accuracy is improved;

5) when a new demand of a user is added, the deep neural network in the step 2) is kept unchanged, the operator network output and the critic network input in the step 3) are modified, dynamic exploration is carried out on a new intention, and the click rate of the user is improved.

The structure of the DDPG model is shown in FIG. 2, and the specific working steps of the DDPG model are as follows:

1) pushing calculation or storage service to a user according to a strategy function, and selecting an action which enables a criticc evaluation function to be maximum after outish noise is added to the strategy output at the training time; at the time of testing, selecting an action that maximizes the critic's evaluation function; the strategy function refers to an output value of a strategy network, and outputs corresponding actions aiming at each state, wherein the actions are pushed services;

2) selecting whether to use the pushed service by the user at the user side;

3) acquiring a reward function according to the selection of a user, and updating a valuation function and a strategy function at the same time;

4) and continuing returning to the step 1) for circulating work.

A lifting module: when a user newly increases the demand, the deep neural network in the neural network training module is kept unchanged, the operator network output and the critic network input in the model optimization module are modified, the new intention is dynamically explored, and the click rate of the user is improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An edge computing active service method based on deep reinforcement learning is characterized by comprising the following steps:

1) extracting user characteristic information, wherein the characteristic information comprises a user portrait, an application load of a user in a set period and a user position, and extracting user intention classification;

3) optimizing the DDPG model through on-line exploration;

2. The edge computing active service method based on deep reinforcement learning of claim 1, further comprising:

3. The edge computing active service method based on deep reinforcement learning according to claim 1 or 2, wherein in step 3), the specific steps are as follows:

31) implementing reinforcement learning by a reinforcement learning DDPG algorithm, wherein the actor network takes the expression vector obtained in the step 2) as input, and the DDPG algorithm outputs storage or calculation service provided for a user;

wherein Q represents a critic network, s is the current environment state, a is the selected service action, and w is a parameter of the critic network; s ', a' are respectively the state and action at the next moment, r is the reward function, and gamma is the discount factor; l (w) represents an optimized value, E [ ] is a desired value, a ' is a value that maximizes the critic network Q (s ', a ', w);

33) the DDPG algorithm is dynamically explored by the noise function OUNoise.

4. The edge computing active service method based on deep reinforcement learning according to claim 1 or 2, characterized in that the DDPG model comprises the following specific working steps:

2) selecting whether to use the pushed service by the user at the user side;

4) and continuing returning to the step 1) for circulating work.

5. An edge computing active service system based on deep reinforcement learning is characterized by comprising the following program modules;

a feature extraction module: extracting user characteristic information, wherein the characteristic information comprises a user portrait, an application load of a user in a set period and a user position, and extracting user intention classification;

a neural network training module: pre-training an intention pre-judging model through a deep neural network, wherein the intention pre-judging model is a multi-classification neural network model, the output of the intention pre-judging model is multi-classification user intention probability passing through a normalized exponential function softmax, then, the intention pre-judging model is optimized by utilizing a cross entropy loss function, the optimized intention pre-judging model is output as the current intention category, and meanwhile, the second layer from the last of the intention pre-judging model is used as a representation vector to establish a DDPG model;

a prejudgment module: setting a reward function of reinforcement learning, wherein if the user uses one of the services, the reward value is 1, and otherwise, the reward value is 0; and in the process of interacting with the user, prejudging the user resource request according to the reward value, and selecting the action which enables the critic valuation function to be maximum.

6. The deep reinforcement learning-based edge computing active service system according to claim 5, further comprising:

7. The deep reinforcement learning-based edge computing active service system according to claim 5, wherein the DDPG model comprises the following specific working steps:

2) selecting whether to use the pushed service by the user at the user side;

4) and continuing returning to the step 1) for circulating work.