CN117521717A

CN117521717A - Improved DDPG strategy method based on HER and ICM realization

Info

Publication number: CN117521717A
Application number: CN202311242639.6A
Authority: CN
Inventors: 臧兆祥; 李思博
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2024-02-06

Abstract

An improved DDPG strategy method based on HER and ICM realization, step 1: creating an experimental environment and constructing models of DDPG, HER and ICM; step 2: setting training parameters and creating an experience pool; step 3: initializing a network and an optimizer; step 4: the model runs training in the environment and stores information such as states, actions, rewards and the like into an experience pool; step 5: processing the samples in the experience pool using a HER algorithm to generate new samples; step 6: calculating rewards and integrating rewards using the ICM; step 7: updating the network parameter training model. The invention aims to solve the technical problems of slow learning of an agent and narrow coverage of a sample caused by too low exploration efficiency due to sparse rewards in the traditional DDPG algorithm, and provides an improved DDPG strategy method based on HER and ICM implementation.

Description

Improved DDPG strategy method based on HER and ICM realization

Technical Field

The invention relates to the technical field of deep learning, in particular to an improved DDPG strategy method based on HER and ICM realization.

Background

Deep reinforcement learning has wide application in the fields of games, robot control, autopilot, finance, resource management, natural language processing, medical treatment, and the like. The strategy method realized by deep reinforcement learning is beneficial to improving the autonomous decision making capability, adaptability and learning capability of the intelligent agent, realizing advanced strategy and complex behavior, solving the sparse rewarding problem and supporting multi-intelligent agent cooperation.

Patent document with application publication number of CN116533249A discloses a mechanical arm control method based on deep reinforcement learning DDPG, and patent document with application publication number of CN116321057A discloses a vehicle crowd sensing user recruitment method based on deep reinforcement learning DDPG. The deep reinforcement learning algorithm described above has some imperfections in some respects:

1) Challenge of sparse rewarding problem: in many cases, the agent receives a forward rewards signal only when the goal is reached or the task is completed, while the rewards signal received at other time steps is less or zero;

2) Problem of low search efficiency: in many cases, there are a large number of unknown states and combinations of actions in the environment, but conventional reward signals may not be effective to direct agents to explore these unknown areas. This makes it difficult for the agent to efficiently discover new, valuable information.

Therefore, an improved DDPG strategy method based on HER and ICM is provided, and starting from experience samples and exploratory schemes, the method is optimized for the problems existing in DDPG.

Disclosure of Invention

The invention aims to solve the technical problems of slow learning of an agent and narrow coverage of a sample caused by too low exploration efficiency due to sparse rewards in the traditional DDPG algorithm, and provides an improved DDPG strategy method based on HER and ICM implementation.

In order to solve the technical problems, the invention adopts the following technical scheme:

an improved DDPG policy method based on HER and ICM implementation, comprising the steps of:

step 1) creating an experimental environment and constructing models of DDPG, HER and ICM;

step 2) setting training parameters and creating an experience pool;

step 3) initializing a network and an optimizer;

step 4) the model runs training in the environment, and stores information such as states, actions, rewards and the like into an experience pool;

step 5) processing the samples in the experience pool using a HER algorithm to generate new samples;

step 6) calculating rewards and integrating rewards using the ICM;

step 7) calculating a loss function, and updating a network parameter training model by using a gradient descent method;

the DDPG strategy is realized through the steps.

In step 1), the method specifically comprises the following steps:

1-1) constructing a basic algorithm DDPG model, wherein the basic algorithm DDPG model comprises a Critic network, an Actor network and a learning function; the Critic network is used for estimating the Q value of the state-action and providing the value information of the action; the Actor network is used for selecting actions according to the states, and optimizing strategies to obtain the maximum cumulative rewards;

critic network, actor network are constructed through three full-link layers, the calculation of the full-link layer is shown as formula (1):

a _i ＝f _i (W _i ×a _i-1 +b _i ) (1)

wherein a is _i Represents the activation output result of the ith layer, f _i Is an activation function of the i-th layer, W _i Is the weight matrix of the i-th layer, a _i-1 Is the output result of the i-1 layer, b _i Is the bias vector of the i-th layer;

1-2) constructing an ICM network model, including a forward model and a reverse model; predicting the next state by the forward model through the state and the action, and predicting the action executed by the reverse model through the state and the next state; the network parts of the forward model and the reverse model are also constructed according to full connection;

the calculation of the forward model is shown in formula (2):

wherein f is a forward network model, s _t A is the current observation state _t For the current action, f (s _t ,a _t ) Representing the output of the forward model,is the predicted next state;

the calculation of the inverse model is shown in formula (3):

wherein g is a reverse network model, s _t S is the current observation state _t+1 For the next state, g (s _t ,s _t+1 ) Representing the output of the inverse model,is a predicted action;

1-3) constructing the computational functions required for the HER algorithm, including recalculating the reward function as shown in equation (4), the new objective generation function.

In step 2), the method specifically comprises the following steps:

2-1) setting super parameters, including training round number, step number of each round, learning sample size, learning rate, experience pool size and HER new target number; the training round number is the running number of the game; the step number of each round is the longest step length of each game running, if the step number is exhausted, the game is forcefully ended, and the failure is judged; the size of the learning sample is the number of the experience samples extracted during each learning; the learning rate is learning efficiency; the size of the experience pool is the number of the experience samples which can be stored in the experience pool, and the number of the samples is full of the early experience sample release first-in first-out; the HER new target number is the new target number to be selected and is used for rewarding calculation of new experience samples;

2-2) creating an experience pool for storing current state, rewards, actions, next state, and termination information; the experience pool is divided into a total experience pool and a temporary experience pool, wherein the total experience pool is used for storing all experience samples, the temporary experience pool is used for storing an experience sample group of each round for HER algorithm calculation, and each round is finished to empty the temporary experience pool.

In step 4), the method specifically comprises the following steps:

4-1) resetting the experimental environment, and resetting the environment parameters and the temporary experience pool before training begins;

4-2) selecting actions through the DDPG, running the intelligent agent in the environment according to the obtained actions, and storing the next state, actions, rewards, whether ending and current state information obtained by running into an experience pool.

In step 5), the method specifically comprises the following steps:

5-1) HER algorithm takes the state of several experiences as new target by randomly selecting from the last experience sample of each round of temporary experience pool;

5-2) recalculating rewards according to the new targets and the rewards function by each experience in the temporary experience pool, replacing the recalculated rewards with the rewards in the initial experience sample to obtain a new experience sample, and storing the new experience sample in the experience pool; the algorithm structure is as shown in fig. 2; the calculation of the recalculated prize is shown in equation (4):

R(s _t ,a _t ,s _g )＝R+λ(-α·distance(s _t ,s _g )) (4)

wherein the current state is s _t The action performed is a _t The target state is s _g Distance represents the Euclidean distance between states, alpha is a positive coefficient used to adjust the magnitude of the virtual reward, and typically the value of alpha is adjusted according to the nature of the problem, R represents the reward value in the initial empirical sample, lambda represents the impact factor, R(s) _t ,a _t ,s _g ) Representing the recalculated prize value.

In step 6), the method specifically comprises the following steps:

6-1) the ICM algorithm structure is as shown in figure 3, training is started when the quantity of experience data in the experience pool reaches a set quantity, experience in the experience pool is extracted, and a forward model of the ICM algorithm predicts the next state according to the state and action in the experience data;

6-2) calculating the predicted loss of the predicted next state and the next state in the experience data;

6-3) obtaining icm_forward according to the loss calculation, wherein the calculation of icm_forward is shown as a formula (5):

wherein R is _icm Representing the ICM intrinsic rewards, | … || ² Representing the squared norm of the vector,representing the predicted next state, s _t+1 Representing the next state; />Representing predicted actions, a _t Representing actions in the experience sample; beta, (1-beta) represents the influencing factors of the forward model and the reverse model respectively;

adding icm_report to the report of the experience sample to obtain final training report, and calculating the final reward value is shown in the formula (6):

R＝R _exp +σ·R _icm (6)

wherein R represents the final training re-ward, R _exp Representing the reward, R in an empirical sample _icm Representing the intrinsic reward of the ICM, σ is the hyper-parameter, controlling the impact factor of the intrinsic reward.

In step 7), the method specifically comprises the following steps:

7-1) calculating the mean square error loss of the Critic network, wherein the calculation of the loss function is shown in the formula (7):

q (s, a; theta) is the Q value of action a taken in the current state s, calculated by a neural network parameter theta, R is a rewarding value calculated by a formula (6), gamma is a discount factor used for measuring importance of future rewarding, s 'is a new state after executing the action a, theta-is a parameter of a target network and used for calculating the Q value in the target state s', and the parameter theta of the neural network is updated by minimizing a loss function in the training process, so that the Q value gradually approaches to an optimal Q value function;

7-2) calculating the strategic gradient loss of the Actor network, wherein the calculation of the loss function is shown in a formula (8);

where N is the batch size, s _i Is in the state of mu (s _i ) Is that the Actor network is in state s _i Action selected by time, Q (s _i ,μ(s _i ) Is the Critic network for state s _i And the predicted value of the action selected by the Actor network, and by calculating the losses of the two parts, the DDPG algorithm respectively updates the parameters of the Critic network and the Actor network, so that the value function of the Critic network approximates to the real return, the strategy of the Actor network is gradually improved to obtain higher accumulated return, and in the training process, the network parameters are slightly adjusted according to the gradient direction of the loss function so as to achieve the aim of optimization.

Compared with the prior art, the invention has the following technical effects:

1) The invention uses HER algorithm to convert failed experience into more valuable training signals through post-hoc experience playback technology. A post-experience playback technique is introduced to replace the target state that has not been reached with the reached state, thereby converting the failed experience into a useful training signal, speeding up the learning process. In each training round, the method not only uses the current target state, but also uses the alternative target state to train, thereby helping the intelligent agent to learn more strategies to adapt to environmental changes.

2) The invention uses ICM algorithm to improve the exploring efficiency of the model through the internal curiosity module. The ICM algorithm introduces an 'internal curiosity module', predicts the next state according to the current observed state and action and compares with the actual state, thereby stimulating the active exploration of the intelligent agent.

The method improves learning efficiency together, so that the intelligent agent can better cope with complex situations and unknown environments.

Drawings

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is a schematic diagram of the HER model structure of the present invention;

FIG. 3 is a schematic view of an ICM model structure according to the present invention;

FIG. 4 is a detailed logic flow diagram of the present invention;

FIG. 5 is a game screen according to an embodiment of the present invention.

Detailed Description

step 1) creating an experimental environment and constructing models of DDPG (Deep Deterministic Policy Gradient, depth deterministic strategy gradient), HER (Hindsight Experience Replay, post hoc empirical playback) and ICM (Intrinsic Curiosity Module );

step 2) setting training parameters and creating an experience pool;

step 3) initializing a network and an optimizer;

step 6) calculating rewards and integrating rewards using the ICM;

in step 1), the method specifically comprises the following steps:

1-1) constructing a basic algorithm DDPG model, wherein the basic algorithm DDPG model comprises a Critic network, an Actor network and a learning function. Critic networks are used to estimate the Q value of a state-action, providing value information for the action. The Actor network is used to select actions based on the state, optimizing policies to obtain maximum jackpots.

Critic network and Actor network are built through full connection, and the calculation of the full connection layer is shown as formula (1):

a _i ＝f _i (W _i ×a _i-1 +b _i ) (1)

1-2) constructing an ICM network model, including a forward model and a reverse model. The forward model predicts the next state by state and action, and the reverse model predicts the action performed by state and next state. The network portions of the forward model and the reverse model are also constructed from full connectivity.

The calculation of the forward model is shown in formula (2):

wherein f is a forward network model, s _t Alpha is the current observation state _t For the current action, f (s _t ,a _i ) Representing the output of the forward model,is the predicted next state.

The calculation of the inverse model is shown in formula (3):

wherein g is a reverse network model, s _t S is the current observation state _t+1 For the next state, g (s _t ,s _t+1 ) Representing the output of the inverse model,is a predicted action.

1-3) constructing the computational functions required by the HER algorithm, including recalculating the reward function, calculating a new objective function.

In step 2), the method specifically comprises the following steps:

2-1) setting super parameters including training round number, step number per round, learning sample size, learning rate, experience pool size, HER new target number, etc. The training round number is the running number of the game; the step number of each round is the longest step length of each game running, if the step number is exhausted, the game is forcefully ended, and the failure is judged; the size of the learning sample is the number of the experience samples extracted during each learning; the learning rate is learning efficiency; the size of the experience pool is the number of the experience samples which can be stored in the experience pool, and the number of the samples is full of the early experience sample release first-in first-out; the number of new HER targets is the number of new targets to be selected and is used for rewarding calculation of new experience samples.

2-2) creating an experience pool for storing current status, rewards, actions, next status, and termination information. The experience pool is divided into a total experience pool and a temporary experience pool, wherein the total experience pool is used for storing all experience samples, the temporary experience pool is used for storing an experience sample group of each round for HER algorithm calculation, and each round is finished to empty the temporary experience pool.

In step 4), the method specifically comprises the following steps:

4-1) resetting the experimental environment, resetting the environment parameters and temporary experience pools before training begins.

In step 5), the method specifically comprises the following steps:

5-1) HER algorithm takes the state of several experiences as a new goal by randomly choosing from the last experience sample of each round of temporary experience pool.

5-2) recalculating the rewards according to the new targets and the rewards function by each experience in the temporary experience pool, replacing the recalculated rewards with the rewards in the initial experience sample to obtain a new experience sample, and storing the new experience sample in the experience pool. The algorithm structure is as shown in fig. 2. The calculation of the recalculated prize is shown in equation (4):

R(s _t ,a _t ,s _g )＝R+λ(-α·distance(s _t ,s _g )) (4)

wherein whenThe front state is s _t The action performed is a _t The target state is s _g Distance represents the euclidean distance between states, and α is a positive coefficient used to adjust the magnitude of the virtual prize. In general, the value of α needs to be adjusted according to the characteristics of the problem. R represents the prize value in the initial experience sample, lambda represents the impact factor, R (s _t ,a _t ,s _g ) Representing the recalculated prize value.

In step 6), the method specifically comprises the following steps:

6-1) ICM Algorithm architecture As in FIG. 3, training begins when the number of experience data in the experience pool reaches a set amount. And extracting experience in the experience pool, and predicting the next state by a forward model of the ICM algorithm according to the state and the action in the experience data.

6-2) calculating a predicted loss for the predicted next state and the next state in the empirical data.

wherein R is _icm Representing the ICM intrinsic rewards, | … || ² Representing the squared norm of the vector,representing the predicted next state, s _t+1 Representing the next state; />Representing predicted actions, a _t Representing actions in the experience sample; beta, (1-beta) represents the influencing factors of the forward model and the reverse model, respectively.

R＝R _exp +σ·R _icm (6)

In step 7), the method specifically comprises the following steps:

7-1) calculating the mean square error loss of the Critic network, wherein the calculation of the loss function is shown in a formula (7).

where N is the batch size, s _i Is in the state of mu (s _i ) Is that the Actor network is in state s _i Action selected by time, Q (s _i ,μ(s _i ) Actions and states s obtained for the Critic network for the Actor network _i Is a predicted value of (a). By calculating the losses of the two parts, the DDPG algorithm will update the parameters of the Critic network and the Actor network respectively, so that the value function of the Critic network approximates to the real return, and the strategy of the Actor network is gradually improved to obtain higher accumulated return. In the training process, the network parameters can be slightly adjusted according to the gradient direction of the loss function so as to achieve the aim of optimization.

Examples:

the invention tests in a mountain CarContinuous-v0 environment in the ym functional network, which is a continuous action space and continuous state space environment. A screenshot of the rendering of the environment is shown in fig. 5. By controlling the trolley to left or right, the object is to the flag of the trolley reaching the high point on the right. The trolley is one step to the left or right each time, the score is higher as the time for reaching the flag is shorter, and the purpose of the algorithm is to enable the trolley to reach the flag at the highest speed.

Firstly, an Agent model is built, and an algorithm model is composed of DDPG, ICM and HER algorithms. DDPG includes Critic network, actor network, and learning function. The HER algorithm principle is shown in fig. 2, and the ICM algorithm principle is shown in fig. 3. An experience pool is constructed for storing experiences and initialized.

Initializing a game environment, acquiring an action space and a state space, initializing an Agent, and setting the value of a super parameter. Wherein the training round is set to 100 rounds; 1000 steps per round; the number of learning samples in each batch is 128; experience pool size 10000, her new target number 4. If the trolley reaches the flag of the right high land in 1000 steps, judging that the game is finished, and finishing the target by the intelligent body; if the trolley still does not reach the flag of the right plateau after 1000 steps, judging that the game is finished, and the intelligent agent does not finish the target and fails the task. The agent obtains punishment score every step, obtains 100 points of rewards after reaching the flag of the right plateau, and uses the total point at the end of the last game to judge the training result of the algorithm.

In each round, the environment, agent, experience pool, etc. are first initialized. The initial environmental state information is acquired, the intelligent agent judges the execution action according to the state information, and the environment gives the next state information, rewards, actions, termination information and the like according to the action. Status information, rewards, next status information, and termination information are taken as a set of experiences. The experience sample is stored in a temporary experience pool. When a round of game play is completed, the experience group in the temporary experience pool is taken out, the HER algorithm is used for recalculating the rewards value through the new target to obtain a new experience sample group, and the new experience sample group is stored in the total experience pool. And (5) emptying the temporary experience pool.

And when the number of the experience samples in the experience pool is enough, invoking the DDPG algorithm to perform training learning by using the experience samples. The status information, rewards, actions, next status information and termination information in a batch of experience are extracted respectively. Predicting next state information through state information and actions by using an ICM algorithm, calculating Loss through the predicted next state information and the next state information which is actually obtained, and calculating ICM rewards according to the value of the Loss. And combining the calculated ICM rewards with the rewards extracted from the experience samples to obtain new rewards, calculating a target Q value by using the new rewards, and finally calculating a loss function to reversely update network parameters.

After 100 rounds of training are finished, model parameters obtained after training are stored, and 100 rounds of training are operated on a new environment to test model effects. The score of the standard DDPG algorithm is about 80 minutes, the score of the DDPG algorithm improved by the HER algorithm and the ICM algorithm is about 93 minutes, and experimental results show that the improved DDPG algorithm is superior to the standard DDPG algorithm in experimental environment.

Claims

1. An improved DDPG policy method based on HER and ICM implementation, characterized in that it comprises the steps of:

step 2) setting training parameters and creating an experience pool;

step 3) initializing a network and an optimizer;

step 6) calculating rewards and integrating rewards using the ICM;

the DDPG strategy is realized through the steps.

2. The method according to claim 1, characterized in that in step 1), it comprises in particular the following steps:

a _i ＝f _i (W _i ×a _i-1 +b _i ) (1)

the calculation of the forward model is shown in formula (2):

the calculation of the inverse model is shown in formula (3):

3. The method according to claim 1, characterized in that in step 2) it comprises in particular the following steps:

4. The method according to claim 1, characterized in that in step 4) it comprises in particular the following steps:

5. The method according to claim 1, characterized in that in step 5), it comprises in particular the following steps:

5-2) recalculating rewards according to the new targets and the rewards function by each experience in the temporary experience pool, replacing the recalculated rewards with the rewards in the initial experience sample to obtain a new experience sample, and storing the new experience sample in the experience pool; the calculation of the recalculated prize is shown in equation (4):

R(s _t ,a _t ,s _g )＝R+λ(-α·distance(s _t ,s _g )) (4)

wherein the current state is s _t The action performed is alpha _t The target state is s _g Distance represents the Euclidean distance between states, alpha is a positive coefficient used to adjust the magnitude of the virtual reward, and typically the value of alpha is adjusted according to the nature of the problem, R represents the reward value in the initial empirical sample, lambda represents the impact factor, R(s) _t ,a _t ,s _g ) Representing the recalculated prize value.

6. The method according to claim 1, characterized in that in step 6), it comprises in particular the following steps:

6-1) when the number of experience data in the experience pool reaches a set amount, starting training, extracting experience in the experience pool, and predicting a next state by a forward model of the ICM algorithm according to the state and the action in the experience data;

R＝R _exp +σ·R _icm (6)

7. Method according to claim 1, characterized in that in step 7) it comprises in particular the following steps:

q (s, a; theta) is the Q value of action a under the current state s, calculated by a neural network parameter theta, R is the rewarding value calculated by a formula (6), gamma is a discount factor used for measuring the importance of future rewarding, s 'is a new state after executing action alpha, theta-is a parameter of a target network and used for calculating the Q value under the target state s', and the parameter theta of the neural network is updated by minimizing a loss function in the training process so that the Q value gradually approaches to an optimal Q value function;