CN114492215A

CN114492215A - GP world model for assisting training by utilizing strategy model and training method thereof

Info

Publication number: CN114492215A
Application number: CN202210404483.6A
Authority: CN
Inventors: 葛品; 吴冠霖; 方文其; 平洋; 栾绍童; 缪正元; 戴迎枫; 沈源源; 金新竹
Original assignee: Nanhu Laboratory
Current assignee: Nanhu Laboratory
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-05-13

Abstract

The invention discloses a GP world model using a strategy model for auxiliary training and a training method thereof, wherein the GP world model comprises a loss function used for training the world model, the loss function comprises a first loss function and a second loss function, the first loss function is the own loss function of the GP world model, the second loss function is the loss function of the strategy model, and the training method comprises the following steps: s1. world model utilization loss function

Updating model parameters

(ii) a S2, utilizing a loss function by a strategy model

Updating model parameters

And storing each step in the training

(ii) a S3. in pairs

Taking the average value as

Post substitution

And the training is used for the next training of the world model. The invention provides a training mechanism of a GP world model training method assisted by a strategy model, which can achieve the purpose of modulating and training the world model by utilizing the stability of strategy training, thereby improving the training effect and performance of the world model.

Description

GP world model for assisting training by utilizing strategy model and training method thereof

Technical Field

The invention belongs to the technical field of world models, and particularly relates to a GP world model for training by using a strategy model in an auxiliary manner and a training method thereof.

Background

The deep reinforcement learning framework is a framework capable of well solving the problem of limited sample data, and mainly comprises two parts: a policy model and a world model. The strategy model is trained by using experiences in the experience pool, the world model simulates the environment through learning state conversion and reward, and the experiences generated by the learning environment of the world model are also stored in the experience pool to provide more training data for the strategy model, so that the problem of insufficient sample data can be solved.

At present, a strategy model and a world model of deep reinforcement learning are trained respectively: simulation experience generated by the world model and real experience generated by interaction with the environment are stored in the experience pool and used for training the strategy model to update the loss function of the strategy model, and the world model is trained by using the real experience generated by interaction between the strategy model and the environment to update the loss function of the world model. The applicant finds that the deep reinforcement learning world model training effect achieved in the mode is poor in long-term research, but no suitable solution exists before.

In the latest research, the applicant tries to assist the GP world model training with the strategy model, and later proves that the stability of the strategy can achieve the effect of modulating the training world model, so that the world model has a better training effect. Meanwhile, in the research of continuous action intelligent decision, the applicant proposes a deep reinforcement learning framework using a PPO algorithm to replace a DQN algorithm in the traditional deep reinforcement learning, and uses the PPO algorithm to assist the training of the GP world model in combination with the research applicant, and experiments prove that the deep reinforcement learning trained by the PPO algorithm assisted GP model has higher training speed and better performance than the deep reinforcement learning trained without the PPO algorithm assisted GP model.

Disclosure of Invention

The invention aims to solve the problems and provides a GP world model trained by a strategy model in an auxiliary way and a training method thereof.

In order to achieve the purpose, the invention adopts the following technical scheme:

the GP world model comprises a loss function used for training the world model, wherein the loss function comprises a first loss function and a second loss function, the first loss function is the own loss function of the GP world model, and the second loss function is the loss function of the strategy model.

In the GP world model trained with the aid of the strategy model, the second loss function isThe strategy model updates the model parameters in each training process in each step of training

The average of the resulting loss functions.

In the GP world model trained with the help of the strategy model, the loss function includes:

（1）

wherein

、

Respectively, the adjustable coefficients are the coefficients of the adjustment,

is the own loss function of the GP world model,

is a loss function of the policy model.

In the GP world model trained with the aid of the strategy model, the intrinsic loss function of the GP world model

The method comprises the following steps:

（2）

wherein the content of the first and second substances,

to predict the resulting covariance, y represents the output value in the training data.

In the GP world model trained by the strategy model, the assistant partDifference (D)

Predicting by:

（3）

d is a diagonal matrix of dimension N x M, I represents an identity matrix,

for describing the association between the different tasks,

representing a correlation matrix between the training data.

In the GP world model trained with the aid of the strategic model, the strategic model includes a PPO algorithm, and the second loss function is a loss function of the PPO algorithm.

In the GP world model trained with the help of the strategic model, the PPO algorithm loss functions include strategic loss functions:

（4）

it is shown that the average value is calculated,

the representation is taken to be a small value,

the change proportion of the new strategy and the old strategy is shown,

a dominant function representing the PPO algorithm,

the function of the truncation is represented by,

is the truncation factor.

In the GP world model trained with the aid of the strategy model, the PPO algorithm loss functions further include a value function loss function and an entropy loss function:

（5）

wherein the content of the first and second substances,

in order to be a function of the policy loss,

the loss function is expressed as a function of the value,

in order to be a function of the entropy loss,

is a weight value;

the value function loss function includes:

（6）

wherein the content of the first and second substances,

representing the accumulated return values for the following trajectory,

in order to be a function of the value,

indicating the mean value。

A method for assisting GP world model training by utilizing a strategy model comprises the following steps:

s1. world model utilization loss function

Updating model parameters

；

S2, utilizing a loss function by a strategy model

Updating model parameters

And storing each step in the training

；

S3. in pairs

Taking the average value as

Post substitution

And the training is used for the next training of the world model.

In the method for assisting GP world model training by using the strategic model, the strategic model is a PPO algorithm, and a loss function of the PPO algorithm is

，

In order to be a function of the policy loss,

the loss function is expressed as a function of the value,

in order to be a function of the entropy loss,

is a weight value.

The invention has the advantages that:

1. the training mechanism of the GP world model method is provided through the strategy model for assisting in training, and the purpose of modulating and training the world model can be achieved by utilizing the stability of strategy training, so that the training effect and performance of the world model are improved;

2. in a Dyna-PPO framework capable of realizing continuous action decision, a PPO algorithm is used for assisting GP world model training, so that the training of a framework system can be accelerated, and the performance of the framework system can be promoted.

Drawings

FIG. 1 is a block diagram of a GP-based Dyna-PPO method;

FIG. 2 is a training and prediction phase of the GP model structure diagram;

FIG. 3 is a flow chart of an algorithm in an experimental process;

FIG. 4a is a graph showing the learning curves of PPO, GPPO and i-GPPPO in the experiment of CarRacing-v0, where M =30, K =1 and N = 8;

FIG. 4b is a graph showing the learning curves of PPO, GPPO and i-GPPPO in the experiment of CarRacing-v0, where M =30, K =3 and N = 8;

FIG. 4c is a graph showing the learning curves of PPO, GPPO and i-GPPPO in the experiment of CarRacing-v0, where M =30, K =5 and N = 8;

FIG. 5a is a graph showing the learning curves of PPO, GPPO and i-GPPPO when M =25, K =10 and N =4 in Carla Simulator experiment;

FIG. 5b is a graph showing the learning curves of PPO, GPPO and i-GPPPO when M =25, K =10 and N =8 in Carla Simulator experiment;

FIG. 5c is a graph of the learning curves of PPO, GPPO, i-GPPPO for Carla Simulator experiments with M =25, K =10, and N = 16;

FIG. 5d is a graph showing the learning curves of PPO, GPPO and i-GPPPO in Carla Simulator experiment with M =25, K =10 and N = 32.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

As shown in FIG. 1, the present embodiment discloses a GP world model trained with the help of a strategy model, and as in the prior art, the GP world model comprises a loss function for training the world model, the loss function comprises a first loss function, and the first loss function is an inherent loss function of the GP world model

. The solution is characterized in that the loss function of the GP world model further comprises a second loss function, and the second loss function is a loss function of the strategy model. The scheme provides a training mechanism of a GP world model training method assisted by a strategy model, and the aim of modulating and training the world model can be achieved by utilizing the stability of strategy training, so that the training effect and the performance of the world model are improved.

Specifically, the loss function of the GP world model includes:

（1）

wherein

、

Respectively, the adjustable coefficients are provided for the adjustable coefficients,

is the own loss function of the GP world model,

is a loss function of the policy model.

Preferably, the multi-output Gp model is adopted to construct the world model, the problem of multi-dimensional output is regarded as a plurality of related tasks, the relevance of each dimension is fully considered, the priori knowledge can be combined, the dependence on training data is reduced, and therefore the accuracy of prediction is improved.

Consider a situation in which set X includes N different inputs

Output of corresponding M tasks

Wherein

Corresponding to the ith input and the second

Tasks, the distribution of which is similar to GP of one-dimensional output, satisfy

Like a general gaussian model, it can be assumed that the average value of the GP model is 0, and the correlation function between different tasks and different inputs can be:

(2)

wherein the content of the first and second substances,

for describing the association between the different tasks,

for describing the correlation between different inputs, the same function can be chosen for both, and in general, to satisfy the intrinsic requirement of gaussian distribution,

must be oneA semi-positive definite matrix, so as to guarantee

Semi-positive nature of (1), the product LL of two matrices can be decomposed using Cholesky^TWhere L is the lower triangular matrix, although its form may be chosen to be the existing common kernel function. In the GP model of multidimensional output of the scheme, the standard GP method is also followed, and the tasks are aimed at

To test

The average value that is input can be expressed by the following expression:

，

（3）

representing the predicted average, y represents the output value in the training data,

the covariance of the prediction is represented,

the unit matrix is represented by a matrix of units,

which represents the kronecker product, and,

to represent

First, the

The columns of the image data are,

to represent

And

the association vector between the two or more of the two,

representing the correlation matrix between training data, D is a diagonal matrix in dimensions N x M, each element corresponding to the i-th noise value. For the same reason, the intrinsic loss function of the GP world model in the scheme is as follows:

（4）

wherein the content of the first and second substances,

for the predicted covariance, y represents the output value in the training data.

Preferably, the second loss function is that the model parameters of the strategy model are updated during each training process in each step of training

The average value of the loss function obtained later is shown in fig. 2, and the specific training method is as follows:

s1. world model utilization loss function

Adam algorithm updating model parameters for corresponding setting step length

；

S2, utilizing a loss function by a strategy model

Adam algorithm updating model parameters for corresponding setting step length

And storing each step in the training

；

S3. in pairs

Taking the average value as

Post substitution

And the training is used for the next training of the world model.

Example two

In this embodiment, the scheme is applied to a Dyna-PPO framework which is researched and designed by the applicant and can realize continuous action decision, in which a policy model includes a PPO algorithm, so that the second loss function in this embodiment is a loss function of the PPO algorithm.

The PPO algorithm is a novel Policy Gradient (PG) algorithm, the PPO method encourages exploration and limits the change of the Policy to keep the Policy updating slow, and the PPO algorithm is a method integrating intelligent optimization and Policy optimization and can be used for processing the continuous action problem. The PPO algorithm provides that the target function can be updated in a small batch in a plurality of training steps, and the problem that the step size is difficult to determine in the traditional strategy gradient algorithm is solved. The method tries to calculate a new strategy in each iteration step, can obtain new balance among the difficulty degree of realization, the sampling complexity and the energy required by debugging, can minimize a loss function, and can ensure that the deviation between the strategy and the strategy of the previous iteration step is relatively small. Meanwhile, the PPO algorithm utilizes importance sampling to enable the sample to be repeatedly used, and the algorithm is not required to be reused after the parameters of the algorithm are updated every time to interact with the environment to collect data.

The principle of importance sampling consists in introducing another importance weight to sum, the key being to introduce the importance weight by a policy ratio that is used to consider the ratio of the new policy to the old policy:

where π denotes a policy network, θ denotes a parameter of the policy network, a_tRepresents the movement at time t, s_tIndicating the state at time t.

Specifically, the PPO algorithm penalty functions include a policy penalty function:

(5)

which means that the average, is a symbol,

the representation is taken to be a small value,

the function of the truncation is represented by,

is the truncation factor.

Expressing the advantage function of the PPO algorithm, the expression is as follows:

(6)

representing the difference between the discounted sum of the reward value at time t +1 and the function of the value at time t,

representing a value function in the state of the next t +1 moment;

and

respectively representing the return and value functions at step t,

for the discounting factor, λ is a constant introduced so that the merit function

With a more general mathematical representation, it can usually take a constant approximately equal to 1.

T denotes that an epicode has T moments, starting from a particular state s until the end of the task is called a complete epicode, at each step a reward R is obtained, and the final reward obtained for a complete task is called R.

The first term in the policy loss function represents a conservative policy iteration, and when this term is optimized, without limitation, it results in a very large policy update, while the second term's elimination will result in

Move to the interval [1-

, 1+

]The possibility of, and,due to the existence of the minimum function min, the cost function always takes the lower bound of the first term, so that the strategies before and after optimization in the PPO algorithm are not mutated, and a better strategy is obtained. The embodiment proposes that a PPO method is used for assisting in training a world model, and a loss function of a PPO algorithm is added to a GP method for assisting in the world model, so that training of a frame system can be accelerated, and the performance of the frame system can be improved.

Further, the PPO algorithm penalty function also comprises a penalty function of the value function

And loss function of entropy

。

In particular, the loss function for entropy

The existing method is directly adopted. Loss function for value function

Here, the loss function of the value function part is designed to be composed of the difference between the value of the strategy loss function and the return value, the return value R and the attenuation thereof of each step are obtained by calculating the following track, and the return value accumulated by the track is recorded as R_tValue function calculated from the strategic loss function

And performing average evaluation by taking the difference to obtain a loss function of the value function:

（7）

wherein the content of the first and second substances,

indicating a follow trajectoryThe value of the accumulated return is then calculated,

in order to be a function of the value,

indicating averaging. And the PPO algorithm is optimized and improved, so that the PPO algorithm can better assist the GP model training.

Thus, the loss function of the entire PPO algorithm consists of a linear combination of these three functions:

（8）

c₁，c₂is introduced to balance the values of the three to obtain better optimization results, c₁，c₂As may be determined by the skilled person as required.

To illustrate the effectiveness and superiority of this solution, this example compares the performance of the Dyna-PPO framework of the GP world model optimized using the PPO loss function with the Dyna-PPO framework of the GP world model not optimized using the PPO loss function under two environments, CarRacing-v0 and CARLA, and under different hyper-parameters. In these experiments, images received by the vehicle sensors were processed by the variational encoder model, as shown in fig. 3, and the image information was stitched with the external state as input to the variational encoder model.

The names of the algorithms involved in the experimental tests are as follows:

GPPPO (M, K, N), based on the GPPPO method learning of the scheme, M is preheating step length, K is the number of planning step lengths, N is batch size, and the model does not use PPO loss function to optimize a world model;

i-GPPPO (M, K, N), similar to GPPPO (M, K, N), but additionally using the loss function of PPO to optimize the world model;

PPO (n) is learned only by the traditional PPO method.

The PPO method was proposed in 2017 by OpenAI, and is a milestone progress for reinforcement learning. The PPO method encourages exploration and limits the change of the strategy to keep the strategy updating slowly, and is a method framework integrating intelligent optimization and strategy optimization. Therefore, using PPO through a deep reinforcement learning framework has become a promising method for controlling a plurality of autonomous vehicles, and PPO-based deep reinforcement learning is correspondingly applied to common driving tasks.

Setting an experimental environment:

1) CarRacing-v0, developed by the Open AI team, is an environment in Gym for benchmarking reinforcement learning algorithms originally intended for racing, where the environment is modified to accommodate such tasks as follows in order for the environment to achieve the goal of lane keeping:

the turning radius is reduced, and the action space is reduced by removing the braking action, so that the action space only has two element vectors a = { steer, acc }, and steer, acc respectively represent the steering wheel angle and the accelerator of the automobile. To better control the car and limit its maximum speed, the throttle is closed at a speed approaching 30 pixels/time step. The termination condition mainly comprises driving off the road, the speed of the automobile is less than 0.1 pixel/time step after 30 times of actions, and the same track is accessed twice. Further, after converting the ambient RGB frame into an 84 × 84 grayscale image, 6, and 12 pixels are respectively cropped from the left, right, and lower sides of the image to reduce the state space before inputting the grayscale image into a variational self-encoder (VAE) model. After the VAE model processing, the input state of the strategy model is an 8-dimensional vector. The reward function for this experimental environment is defined as follows:

(9)

where v is the speed of the car in pixels/time step.

2) CARLA Simulator: since the CarRacing-v0 environment is a big gap from the real road, the experiment also uses the city driving simulator carala (version 0.9.11) to test in order to better illustrate the superiority of the algorithm. CARL is an open source autopilot simulator, built on top of the ghost engine 4, which allows all vehicles, maps and sensors to be controlled in an arbitrary way. In the present experiment, each algorithm was tested using a dense road map Town07 with many intersections.

Since the action brakes may be somewhat detrimental to the training algorithm without regard to traffic regulations, the experiment only retains the action { steer, acc } tuple as in the aforementioned CarRacing-v0 environment, with the reward function defined as follows:

(10)

d_normis a function of the distance from the center of the lane,

；

is of formula (11):

(11)

representing the included angle between the current orientation of the vehicle and the direction vector of the road center line;

is a threshold value for the angle between the current orientation of the vehicle and the direction vector of the road center line, and exceeding the threshold value means that the heading is deviated too much.

In the experimental environment, before the strategy model and the world model are input, the VAE model is also applied to pre-process the image.

Results of experiments in the Experimental Environment of CarRacing-v 0:

in this set of experiments, performance evaluations at different parameters were performed by varying the planning step size in the algorithm.

In fig. 4a, 4B, and 4C, the a curve represents the learning curve of PPO (8), the B curve represents the learning curve of GPPPO (30, 1, 8), and the C curve represents the learning curve of i-GPPPO (30, 1, 8). From fig. 4a, 4b, and 4c, it can be seen that the performance comparison results of the three algorithms under the same batch N =8 and preheating step M =30, and different planning steps K =1,3, and 5 show that the i-GPPPO method has better performance than the GPPPO in the convergence stage when the planning step is smaller, and also has better performance when the iteration number is smaller, particularly smaller than 50, although the i-GPPPO method does show a certain concussion when the iteration number is about 100, but the performance in the later stage is smoother than the other two methods.

Experimental results in carra Simulator experimental environment:

since carra is more complex than CarRacing-v0, the time required for convergence is much more than CarRacing-v 0. Furthermore, from the CarRacing-v0 experiment, it can be concluded that parameters, planning step size and batch size play a more important role in performance, and a relatively small planning step size is a better choice, so that a smaller planning step size is used, the planning step size is fixed with the preheat step size, M =25, K10, and the batch sizes are N =4,8,16,32, respectively.

In this complex experiment, the output dimension of the VAE model is set to 10 and linked to the motion and speed of the vehicle. In fig. 5a, 5B, 5C, and 5d, the a curve represents the learning curve of PPO (4), the B curve represents the learning curve of GPPPO (25, 10, 4), and the C curve represents the learning curve of i-GPPPO (25, 10, 4). As can be seen from fig. 5a, 5b, 5c, 5d, i-GPPPO is, overall, the best, next to GPPPO, the worst PPO effect, especially in the early and convergent phases, where the curve of i-GPPPO rises faster and in the convergent phase the vehicle trained by the i-GPPPO method can also travel a greater distance.

Furthermore, we can note that the GPPPO method is worst with a batch size of 32. The interaction of the world model and the policy model does increase the diversity of the sample, but also produces too much similar data, which can result in penalizing the policy model when the world model is underperforming. However, according to the analysis of the present experiment, an appropriate threshold may be set for the reward r predicted by the world model, and it is possible to add the experience that is considered to be good to the experience pool only when the value of r is greater than the threshold, which is also mentioned in the previous schemes, and it is proved that the better effect can be produced, and the detailed description is omitted here.

The method provides that the loss function of PPO is utilized to assist the training of the world model, and the implemented algorithm has the effects of realizing rapid training and good performance.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Although the terms world model, policy penalty function, value function penalty function, entropy penalty function, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Claims

1. A GP world model trained by using a strategy model in an auxiliary way comprises a loss function used for training the world model, and is characterized in that the loss function comprises a first loss function and a second loss function, the first loss function is an inherent loss function of the GP world model, and the second loss function is a loss function of the strategy model.

2. The GP world model according to claim 1, wherein the second loss function is an update model for each training step of the strategic model during each training processForm parameter

The average of the resulting loss functions.

3. A GP world model trained with the aid of strategy model according to claim 1 or 2, wherein the loss function includes:

（1）

wherein

is the own loss function of the GP world model,

is a loss function of the policy model.

4. The GP world model trained with the aid of the strategy model according to claim 3, wherein the intrinsic loss function of the GP world model

The method comprises the following steps:

（2）

wherein the content of the first and second substances,

5. The GP world model trained with the aid of the strategy model according to claim 4, wherein the covariance

Predicting by:

（3）

d is a diagonal matrix of dimension N x M, I represents an identity matrix,

for describing the association between the different tasks,

representing a correlation matrix between the training data.

6. A GP world model trained with a strategy model according to claim 3, wherein the strategy model comprises a PPO algorithm, and the second loss function is a loss function of the PPO algorithm.

7. The GP world model trained with the help of the strategy model according to claim 6, wherein the PPO algorithm penalty functions comprise strategy penalty functions:

（4）

it is shown that the average value is calculated,

the representation is taken to be a small value,

the change proportion of the new strategy and the old strategy is shown,

a dominant function representing the PPO algorithm,

it is shown that the function of the truncation is,

is the truncation factor.

8. The GP world model trained with the aid of the strategy model according to claim 7, wherein the PPO algorithm penalty functions further comprise a value function penalty function and an entropy penalty function: