CN114492215A - GP world model for assisting training by utilizing strategy model and training method thereof - Google Patents

GP world model for assisting training by utilizing strategy model and training method thereof Download PDF

Info

Publication number
CN114492215A
CN114492215A CN202210404483.6A CN202210404483A CN114492215A CN 114492215 A CN114492215 A CN 114492215A CN 202210404483 A CN202210404483 A CN 202210404483A CN 114492215 A CN114492215 A CN 114492215A
Authority
CN
China
Prior art keywords
model
function
loss function
training
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210404483.6A
Other languages
Chinese (zh)
Inventor
葛品
吴冠霖
方文其
平洋
栾绍童
缪正元
戴迎枫
沈源源
金新竹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanhu Laboratory
Original Assignee
Nanhu Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanhu Laboratory filed Critical Nanhu Laboratory
Priority to CN202210404483.6A priority Critical patent/CN114492215A/en
Publication of CN114492215A publication Critical patent/CN114492215A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06Q50/40

Abstract

The invention discloses a GP world model using a strategy model for auxiliary training and a training method thereof, wherein the GP world model comprises a loss function used for training the world model, the loss function comprises a first loss function and a second loss function, the first loss function is the own loss function of the GP world model, the second loss function is the loss function of the strategy model, and the training method comprises the following steps: s1. world model utilization loss function
Figure 87286DEST_PATH_IMAGE001
Updating model parameters
Figure 917095DEST_PATH_IMAGE002
(ii) a S2, utilizing a loss function by a strategy model
Figure 699106DEST_PATH_IMAGE003
Updating model parameters
Figure 87493DEST_PATH_IMAGE004
And storing each step in the training
Figure 751693DEST_PATH_IMAGE003
(ii) a S3. in pairs
Figure 273941DEST_PATH_IMAGE003
Taking the average value as
Figure 690885DEST_PATH_IMAGE005
Post substitution
Figure 640386DEST_PATH_IMAGE001
And the training is used for the next training of the world model. The invention provides a training mechanism of a GP world model training method assisted by a strategy model, which can achieve the purpose of modulating and training the world model by utilizing the stability of strategy training, thereby improving the training effect and performance of the world model.

Description

GP world model for assisting training by utilizing strategy model and training method thereof
Technical Field
The invention belongs to the technical field of world models, and particularly relates to a GP world model for training by using a strategy model in an auxiliary manner and a training method thereof.
Background
The deep reinforcement learning framework is a framework capable of well solving the problem of limited sample data, and mainly comprises two parts: a policy model and a world model. The strategy model is trained by using experiences in the experience pool, the world model simulates the environment through learning state conversion and reward, and the experiences generated by the learning environment of the world model are also stored in the experience pool to provide more training data for the strategy model, so that the problem of insufficient sample data can be solved.
At present, a strategy model and a world model of deep reinforcement learning are trained respectively: simulation experience generated by the world model and real experience generated by interaction with the environment are stored in the experience pool and used for training the strategy model to update the loss function of the strategy model, and the world model is trained by using the real experience generated by interaction between the strategy model and the environment to update the loss function of the world model. The applicant finds that the deep reinforcement learning world model training effect achieved in the mode is poor in long-term research, but no suitable solution exists before.
In the latest research, the applicant tries to assist the GP world model training with the strategy model, and later proves that the stability of the strategy can achieve the effect of modulating the training world model, so that the world model has a better training effect. Meanwhile, in the research of continuous action intelligent decision, the applicant proposes a deep reinforcement learning framework using a PPO algorithm to replace a DQN algorithm in the traditional deep reinforcement learning, and uses the PPO algorithm to assist the training of the GP world model in combination with the research applicant, and experiments prove that the deep reinforcement learning trained by the PPO algorithm assisted GP model has higher training speed and better performance than the deep reinforcement learning trained without the PPO algorithm assisted GP model.
Disclosure of Invention
The invention aims to solve the problems and provides a GP world model trained by a strategy model in an auxiliary way and a training method thereof.
In order to achieve the purpose, the invention adopts the following technical scheme:
the GP world model comprises a loss function used for training the world model, wherein the loss function comprises a first loss function and a second loss function, the first loss function is the own loss function of the GP world model, and the second loss function is the loss function of the strategy model.
In the GP world model trained with the aid of the strategy model, the second loss function isThe strategy model updates the model parameters in each training process in each step of training
Figure 677334DEST_PATH_IMAGE001
The average of the resulting loss functions.
In the GP world model trained with the help of the strategy model, the loss function includes:
Figure 63316DEST_PATH_IMAGE002
(1)
wherein
Figure 436529DEST_PATH_IMAGE003
Figure 300579DEST_PATH_IMAGE004
Respectively, the adjustable coefficients are the coefficients of the adjustment,
Figure 948905DEST_PATH_IMAGE005
is the own loss function of the GP world model,
Figure 240209DEST_PATH_IMAGE006
is a loss function of the policy model.
In the GP world model trained with the aid of the strategy model, the intrinsic loss function of the GP world model
Figure 100718DEST_PATH_IMAGE005
The method comprises the following steps:
Figure 768460DEST_PATH_IMAGE007
(2)
wherein the content of the first and second substances,
Figure 781546DEST_PATH_IMAGE008
to predict the resulting covariance, y represents the output value in the training data.
In the GP world model trained by the strategy model, the assistant partDifference (D)
Figure 571648DEST_PATH_IMAGE008
Predicting by:
Figure 794819DEST_PATH_IMAGE009
(3)
d is a diagonal matrix of dimension N x M, I represents an identity matrix,
Figure 374573DEST_PATH_IMAGE010
for describing the association between the different tasks,
Figure 163538DEST_PATH_IMAGE011
representing a correlation matrix between the training data.
In the GP world model trained with the aid of the strategic model, the strategic model includes a PPO algorithm, and the second loss function is a loss function of the PPO algorithm.
In the GP world model trained with the help of the strategic model, the PPO algorithm loss functions include strategic loss functions:
Figure 62224DEST_PATH_IMAGE012
(4)
Figure 897325DEST_PATH_IMAGE013
it is shown that the average value is calculated,
Figure 641290DEST_PATH_IMAGE014
the representation is taken to be a small value,
Figure 628968DEST_PATH_IMAGE015
the change proportion of the new strategy and the old strategy is shown,
Figure 432976DEST_PATH_IMAGE016
a dominant function representing the PPO algorithm,
Figure 755373DEST_PATH_IMAGE017
the function of the truncation is represented by,
Figure 303029DEST_PATH_IMAGE018
is the truncation factor.
In the GP world model trained with the aid of the strategy model, the PPO algorithm loss functions further include a value function loss function and an entropy loss function:
Figure 646679DEST_PATH_IMAGE019
(5)
wherein the content of the first and second substances,
Figure 887168DEST_PATH_IMAGE020
in order to be a function of the policy loss,
Figure 165702DEST_PATH_IMAGE021
the loss function is expressed as a function of the value,
Figure 517049DEST_PATH_IMAGE022
in order to be a function of the entropy loss,
Figure 213741DEST_PATH_IMAGE023
is a weight value;
the value function loss function includes:
Figure 625131DEST_PATH_IMAGE024
(6)
wherein the content of the first and second substances,
Figure 656541DEST_PATH_IMAGE025
representing the accumulated return values for the following trajectory,
Figure 545999DEST_PATH_IMAGE026
in order to be a function of the value,
Figure 861312DEST_PATH_IMAGE013
indicating the mean value。
A method for assisting GP world model training by utilizing a strategy model comprises the following steps:
s1. world model utilization loss function
Figure 443603DEST_PATH_IMAGE002
Updating model parameters
Figure 962309DEST_PATH_IMAGE027
S2, utilizing a loss function by a strategy model
Figure 655458DEST_PATH_IMAGE028
Updating model parameters
Figure 61163DEST_PATH_IMAGE001
And storing each step in the training
Figure 814355DEST_PATH_IMAGE028
S3. in pairs
Figure 820358DEST_PATH_IMAGE028
Taking the average value as
Figure 113936DEST_PATH_IMAGE006
Post substitution
Figure 764360DEST_PATH_IMAGE002
And the training is used for the next training of the world model.
In the method for assisting GP world model training by using the strategic model, the strategic model is a PPO algorithm, and a loss function of the PPO algorithm is
Figure 799705DEST_PATH_IMAGE019
Figure 168370DEST_PATH_IMAGE020
In order to be a function of the policy loss,
Figure 327956DEST_PATH_IMAGE021
the loss function is expressed as a function of the value,
Figure 98466DEST_PATH_IMAGE022
in order to be a function of the entropy loss,
Figure 803248DEST_PATH_IMAGE023
is a weight value.
The invention has the advantages that:
1. the training mechanism of the GP world model method is provided through the strategy model for assisting in training, and the purpose of modulating and training the world model can be achieved by utilizing the stability of strategy training, so that the training effect and performance of the world model are improved;
2. in a Dyna-PPO framework capable of realizing continuous action decision, a PPO algorithm is used for assisting GP world model training, so that the training of a framework system can be accelerated, and the performance of the framework system can be promoted.
Drawings
FIG. 1 is a block diagram of a GP-based Dyna-PPO method;
FIG. 2 is a training and prediction phase of the GP model structure diagram;
FIG. 3 is a flow chart of an algorithm in an experimental process;
FIG. 4a is a graph showing the learning curves of PPO, GPPO and i-GPPPO in the experiment of CarRacing-v0, where M =30, K =1 and N = 8;
FIG. 4b is a graph showing the learning curves of PPO, GPPO and i-GPPPO in the experiment of CarRacing-v0, where M =30, K =3 and N = 8;
FIG. 4c is a graph showing the learning curves of PPO, GPPO and i-GPPPO in the experiment of CarRacing-v0, where M =30, K =5 and N = 8;
FIG. 5a is a graph showing the learning curves of PPO, GPPO and i-GPPPO when M =25, K =10 and N =4 in Carla Simulator experiment;
FIG. 5b is a graph showing the learning curves of PPO, GPPO and i-GPPPO when M =25, K =10 and N =8 in Carla Simulator experiment;
FIG. 5c is a graph of the learning curves of PPO, GPPO, i-GPPPO for Carla Simulator experiments with M =25, K =10, and N = 16;
FIG. 5d is a graph showing the learning curves of PPO, GPPO and i-GPPPO in Carla Simulator experiment with M =25, K =10 and N = 32.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
As shown in FIG. 1, the present embodiment discloses a GP world model trained with the help of a strategy model, and as in the prior art, the GP world model comprises a loss function for training the world model, the loss function comprises a first loss function, and the first loss function is an inherent loss function of the GP world model
Figure 393629DEST_PATH_IMAGE005
. The solution is characterized in that the loss function of the GP world model further comprises a second loss function, and the second loss function is a loss function of the strategy model. The scheme provides a training mechanism of a GP world model training method assisted by a strategy model, and the aim of modulating and training the world model can be achieved by utilizing the stability of strategy training, so that the training effect and the performance of the world model are improved.
Specifically, the loss function of the GP world model includes:
Figure 622485DEST_PATH_IMAGE002
(1)
wherein
Figure 716343DEST_PATH_IMAGE003
Figure 356140DEST_PATH_IMAGE004
Respectively, the adjustable coefficients are provided for the adjustable coefficients,
Figure 433818DEST_PATH_IMAGE005
is the own loss function of the GP world model,
Figure 200786DEST_PATH_IMAGE006
is a loss function of the policy model.
Preferably, the multi-output Gp model is adopted to construct the world model, the problem of multi-dimensional output is regarded as a plurality of related tasks, the relevance of each dimension is fully considered, the priori knowledge can be combined, the dependence on training data is reduced, and therefore the accuracy of prediction is improved.
Consider a situation in which set X includes N different inputs
Figure 680309DEST_PATH_IMAGE029
Output of corresponding M tasks
Figure 992472DEST_PATH_IMAGE030
Wherein
Figure 557446DEST_PATH_IMAGE031
Corresponding to the ith input and the second
Figure 862525DEST_PATH_IMAGE032
Tasks, the distribution of which is similar to GP of one-dimensional output, satisfy
Figure 993292DEST_PATH_IMAGE033
Like a general gaussian model, it can be assumed that the average value of the GP model is 0, and the correlation function between different tasks and different inputs can be:
Figure 600991DEST_PATH_IMAGE034
(2)
wherein the content of the first and second substances,
Figure 30092DEST_PATH_IMAGE035
for describing the association between the different tasks,
Figure 748649DEST_PATH_IMAGE036
for describing the correlation between different inputs, the same function can be chosen for both, and in general, to satisfy the intrinsic requirement of gaussian distribution,
Figure 327398DEST_PATH_IMAGE035
must be oneA semi-positive definite matrix, so as to guarantee
Figure 105998DEST_PATH_IMAGE035
Semi-positive nature of (1), the product LL of two matrices can be decomposed using CholeskyTWhere L is the lower triangular matrix, although its form may be chosen to be the existing common kernel function. In the GP model of multidimensional output of the scheme, the standard GP method is also followed, and the tasks are aimed at
Figure 520930DEST_PATH_IMAGE032
To test
Figure 43178DEST_PATH_IMAGE037
The average value that is input can be expressed by the following expression:
Figure 476434DEST_PATH_IMAGE038
Figure 425935DEST_PATH_IMAGE009
(3)
Figure 295540DEST_PATH_IMAGE039
representing the predicted average, y represents the output value in the training data,
Figure 621479DEST_PATH_IMAGE008
the covariance of the prediction is represented,
Figure 909241DEST_PATH_IMAGE040
the unit matrix is represented by a matrix of units,
Figure 29644DEST_PATH_IMAGE041
which represents the kronecker product, and,
Figure 419168DEST_PATH_IMAGE042
to represent
Figure 283219DEST_PATH_IMAGE035
First, the
Figure 425487DEST_PATH_IMAGE032
The columns of the image data are,
Figure 779108DEST_PATH_IMAGE043
to represent
Figure 514983DEST_PATH_IMAGE044
And
Figure 571274DEST_PATH_IMAGE037
the association vector between the two or more of the two,
Figure 443415DEST_PATH_IMAGE045
representing the correlation matrix between training data, D is a diagonal matrix in dimensions N x M, each element corresponding to the i-th noise value. For the same reason, the intrinsic loss function of the GP world model in the scheme is as follows:
Figure 295834DEST_PATH_IMAGE007
(4)
wherein the content of the first and second substances,
Figure 519005DEST_PATH_IMAGE008
for the predicted covariance, y represents the output value in the training data.
Preferably, the second loss function is that the model parameters of the strategy model are updated during each training process in each step of training
Figure 600224DEST_PATH_IMAGE001
The average value of the loss function obtained later is shown in fig. 2, and the specific training method is as follows:
s1. world model utilization loss function
Figure 326872DEST_PATH_IMAGE002
Adam algorithm updating model parameters for corresponding setting step length
Figure 615771DEST_PATH_IMAGE027
S2, utilizing a loss function by a strategy model
Figure 795079DEST_PATH_IMAGE028
Adam algorithm updating model parameters for corresponding setting step length
Figure 178525DEST_PATH_IMAGE001
And storing each step in the training
Figure 290837DEST_PATH_IMAGE028
S3. in pairs
Figure 219479DEST_PATH_IMAGE028
Taking the average value as
Figure 417242DEST_PATH_IMAGE006
Post substitution
Figure 840265DEST_PATH_IMAGE002
And the training is used for the next training of the world model.
Example two
In this embodiment, the scheme is applied to a Dyna-PPO framework which is researched and designed by the applicant and can realize continuous action decision, in which a policy model includes a PPO algorithm, so that the second loss function in this embodiment is a loss function of the PPO algorithm.
The PPO algorithm is a novel Policy Gradient (PG) algorithm, the PPO method encourages exploration and limits the change of the Policy to keep the Policy updating slow, and the PPO algorithm is a method integrating intelligent optimization and Policy optimization and can be used for processing the continuous action problem. The PPO algorithm provides that the target function can be updated in a small batch in a plurality of training steps, and the problem that the step size is difficult to determine in the traditional strategy gradient algorithm is solved. The method tries to calculate a new strategy in each iteration step, can obtain new balance among the difficulty degree of realization, the sampling complexity and the energy required by debugging, can minimize a loss function, and can ensure that the deviation between the strategy and the strategy of the previous iteration step is relatively small. Meanwhile, the PPO algorithm utilizes importance sampling to enable the sample to be repeatedly used, and the algorithm is not required to be reused after the parameters of the algorithm are updated every time to interact with the environment to collect data.
The principle of importance sampling consists in introducing another importance weight to sum, the key being to introduce the importance weight by a policy ratio that is used to consider the ratio of the new policy to the old policy:
Figure 807084DEST_PATH_IMAGE046
where π denotes a policy network, θ denotes a parameter of the policy network, atRepresents the movement at time t, stIndicating the state at time t.
Specifically, the PPO algorithm penalty functions include a policy penalty function:
Figure 578730DEST_PATH_IMAGE047
(5)
Figure 388424DEST_PATH_IMAGE013
which means that the average, is a symbol,
Figure 739770DEST_PATH_IMAGE014
the representation is taken to be a small value,
Figure 937927DEST_PATH_IMAGE017
the function of the truncation is represented by,
Figure 349317DEST_PATH_IMAGE018
is the truncation factor.
Figure 380727DEST_PATH_IMAGE016
Expressing the advantage function of the PPO algorithm, the expression is as follows:
Figure 270185DEST_PATH_IMAGE048
(6)
Figure 821383DEST_PATH_IMAGE049
representing the difference between the discounted sum of the reward value at time t +1 and the function of the value at time t,
Figure 138095DEST_PATH_IMAGE050
representing a value function in the state of the next t +1 moment;
Figure 656801DEST_PATH_IMAGE051
and
Figure 349951DEST_PATH_IMAGE052
respectively representing the return and value functions at step t,
Figure 519770DEST_PATH_IMAGE053
for the discounting factor, λ is a constant introduced so that the merit function
Figure 272962DEST_PATH_IMAGE016
With a more general mathematical representation, it can usually take a constant approximately equal to 1.
T denotes that an epicode has T moments, starting from a particular state s until the end of the task is called a complete epicode, at each step a reward R is obtained, and the final reward obtained for a complete task is called R.
The first term in the policy loss function represents a conservative policy iteration, and when this term is optimized, without limitation, it results in a very large policy update, while the second term's elimination will result in
Figure 544544DEST_PATH_IMAGE051
Move to the interval [1-
Figure 775805DEST_PATH_IMAGE054
, 1+
Figure 36016DEST_PATH_IMAGE054
]The possibility of, and,due to the existence of the minimum function min, the cost function always takes the lower bound of the first term, so that the strategies before and after optimization in the PPO algorithm are not mutated, and a better strategy is obtained. The embodiment proposes that a PPO method is used for assisting in training a world model, and a loss function of a PPO algorithm is added to a GP method for assisting in the world model, so that training of a frame system can be accelerated, and the performance of the frame system can be improved.
Further, the PPO algorithm penalty function also comprises a penalty function of the value function
Figure 960110DEST_PATH_IMAGE021
And loss function of entropy
Figure 453408DEST_PATH_IMAGE022
In particular, the loss function for entropy
Figure 488360DEST_PATH_IMAGE022
The existing method is directly adopted. Loss function for value function
Figure 370122DEST_PATH_IMAGE021
Here, the loss function of the value function part is designed to be composed of the difference between the value of the strategy loss function and the return value, the return value R and the attenuation thereof of each step are obtained by calculating the following track, and the return value accumulated by the track is recorded as RtValue function calculated from the strategic loss function
Figure 465117DEST_PATH_IMAGE026
And performing average evaluation by taking the difference to obtain a loss function of the value function:
Figure 445711DEST_PATH_IMAGE024
(7)
wherein the content of the first and second substances,
Figure 18775DEST_PATH_IMAGE025
indicating a follow trajectoryThe value of the accumulated return is then calculated,
Figure 519157DEST_PATH_IMAGE026
in order to be a function of the value,
Figure 785054DEST_PATH_IMAGE013
indicating averaging. And the PPO algorithm is optimized and improved, so that the PPO algorithm can better assist the GP model training.
Thus, the loss function of the entire PPO algorithm consists of a linear combination of these three functions:
Figure 987365DEST_PATH_IMAGE019
(8)
c1,c2is introduced to balance the values of the three to obtain better optimization results, c1,c2As may be determined by the skilled person as required.
To illustrate the effectiveness and superiority of this solution, this example compares the performance of the Dyna-PPO framework of the GP world model optimized using the PPO loss function with the Dyna-PPO framework of the GP world model not optimized using the PPO loss function under two environments, CarRacing-v0 and CARLA, and under different hyper-parameters. In these experiments, images received by the vehicle sensors were processed by the variational encoder model, as shown in fig. 3, and the image information was stitched with the external state as input to the variational encoder model.
The names of the algorithms involved in the experimental tests are as follows:
GPPPO (M, K, N), based on the GPPPO method learning of the scheme, M is preheating step length, K is the number of planning step lengths, N is batch size, and the model does not use PPO loss function to optimize a world model;
i-GPPPO (M, K, N), similar to GPPPO (M, K, N), but additionally using the loss function of PPO to optimize the world model;
PPO (n) is learned only by the traditional PPO method.
The PPO method was proposed in 2017 by OpenAI, and is a milestone progress for reinforcement learning. The PPO method encourages exploration and limits the change of the strategy to keep the strategy updating slowly, and is a method framework integrating intelligent optimization and strategy optimization. Therefore, using PPO through a deep reinforcement learning framework has become a promising method for controlling a plurality of autonomous vehicles, and PPO-based deep reinforcement learning is correspondingly applied to common driving tasks.
Setting an experimental environment:
1) CarRacing-v0, developed by the Open AI team, is an environment in Gym for benchmarking reinforcement learning algorithms originally intended for racing, where the environment is modified to accommodate such tasks as follows in order for the environment to achieve the goal of lane keeping:
the turning radius is reduced, and the action space is reduced by removing the braking action, so that the action space only has two element vectors a = { steer, acc }, and steer, acc respectively represent the steering wheel angle and the accelerator of the automobile. To better control the car and limit its maximum speed, the throttle is closed at a speed approaching 30 pixels/time step. The termination condition mainly comprises driving off the road, the speed of the automobile is less than 0.1 pixel/time step after 30 times of actions, and the same track is accessed twice. Further, after converting the ambient RGB frame into an 84 × 84 grayscale image, 6, and 12 pixels are respectively cropped from the left, right, and lower sides of the image to reduce the state space before inputting the grayscale image into a variational self-encoder (VAE) model. After the VAE model processing, the input state of the strategy model is an 8-dimensional vector. The reward function for this experimental environment is defined as follows:
Figure 629699DEST_PATH_IMAGE055
(9)
where v is the speed of the car in pixels/time step.
2) CARLA Simulator: since the CarRacing-v0 environment is a big gap from the real road, the experiment also uses the city driving simulator carala (version 0.9.11) to test in order to better illustrate the superiority of the algorithm. CARL is an open source autopilot simulator, built on top of the ghost engine 4, which allows all vehicles, maps and sensors to be controlled in an arbitrary way. In the present experiment, each algorithm was tested using a dense road map Town07 with many intersections.
Since the action brakes may be somewhat detrimental to the training algorithm without regard to traffic regulations, the experiment only retains the action { steer, acc } tuple as in the aforementioned CarRacing-v0 environment, with the reward function defined as follows:
Figure 483123DEST_PATH_IMAGE056
(10)
dnormis a function of the distance from the center of the lane,
Figure 716658DEST_PATH_IMAGE057
Figure 281632DEST_PATH_IMAGE058
is of formula (11):
Figure 321132DEST_PATH_IMAGE059
(11)
Figure 920741DEST_PATH_IMAGE060
representing the included angle between the current orientation of the vehicle and the direction vector of the road center line;
Figure 403806DEST_PATH_IMAGE061
is a threshold value for the angle between the current orientation of the vehicle and the direction vector of the road center line, and exceeding the threshold value means that the heading is deviated too much.
In the experimental environment, before the strategy model and the world model are input, the VAE model is also applied to pre-process the image.
Results of experiments in the Experimental Environment of CarRacing-v 0:
in this set of experiments, performance evaluations at different parameters were performed by varying the planning step size in the algorithm.
In fig. 4a, 4B, and 4C, the a curve represents the learning curve of PPO (8), the B curve represents the learning curve of GPPPO (30, 1, 8), and the C curve represents the learning curve of i-GPPPO (30, 1, 8). From fig. 4a, 4b, and 4c, it can be seen that the performance comparison results of the three algorithms under the same batch N =8 and preheating step M =30, and different planning steps K =1,3, and 5 show that the i-GPPPO method has better performance than the GPPPO in the convergence stage when the planning step is smaller, and also has better performance when the iteration number is smaller, particularly smaller than 50, although the i-GPPPO method does show a certain concussion when the iteration number is about 100, but the performance in the later stage is smoother than the other two methods.
Experimental results in carra Simulator experimental environment:
since carra is more complex than CarRacing-v0, the time required for convergence is much more than CarRacing-v 0. Furthermore, from the CarRacing-v0 experiment, it can be concluded that parameters, planning step size and batch size play a more important role in performance, and a relatively small planning step size is a better choice, so that a smaller planning step size is used, the planning step size is fixed with the preheat step size, M =25, K10, and the batch sizes are N =4,8,16,32, respectively.
In this complex experiment, the output dimension of the VAE model is set to 10 and linked to the motion and speed of the vehicle. In fig. 5a, 5B, 5C, and 5d, the a curve represents the learning curve of PPO (4), the B curve represents the learning curve of GPPPO (25, 10, 4), and the C curve represents the learning curve of i-GPPPO (25, 10, 4). As can be seen from fig. 5a, 5b, 5c, 5d, i-GPPPO is, overall, the best, next to GPPPO, the worst PPO effect, especially in the early and convergent phases, where the curve of i-GPPPO rises faster and in the convergent phase the vehicle trained by the i-GPPPO method can also travel a greater distance.
Furthermore, we can note that the GPPPO method is worst with a batch size of 32. The interaction of the world model and the policy model does increase the diversity of the sample, but also produces too much similar data, which can result in penalizing the policy model when the world model is underperforming. However, according to the analysis of the present experiment, an appropriate threshold may be set for the reward r predicted by the world model, and it is possible to add the experience that is considered to be good to the experience pool only when the value of r is greater than the threshold, which is also mentioned in the previous schemes, and it is proved that the better effect can be produced, and the detailed description is omitted here.
The method provides that the loss function of PPO is utilized to assist the training of the world model, and the implemented algorithm has the effects of realizing rapid training and good performance.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Although the terms world model, policy penalty function, value function penalty function, entropy penalty function, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Claims (10)

1. A GP world model trained by using a strategy model in an auxiliary way comprises a loss function used for training the world model, and is characterized in that the loss function comprises a first loss function and a second loss function, the first loss function is an inherent loss function of the GP world model, and the second loss function is a loss function of the strategy model.
2. The GP world model according to claim 1, wherein the second loss function is an update model for each training step of the strategic model during each training processForm parameter
Figure 537842DEST_PATH_IMAGE001
The average of the resulting loss functions.
3. A GP world model trained with the aid of strategy model according to claim 1 or 2, wherein the loss function includes:
Figure 866055DEST_PATH_IMAGE002
(1)
wherein
Figure 249457DEST_PATH_IMAGE003
Respectively, the adjustable coefficients are the coefficients of the adjustment,
Figure 688529DEST_PATH_IMAGE004
is the own loss function of the GP world model,
Figure 230369DEST_PATH_IMAGE005
is a loss function of the policy model.
4. The GP world model trained with the aid of the strategy model according to claim 3, wherein the intrinsic loss function of the GP world model
Figure 995062DEST_PATH_IMAGE006
The method comprises the following steps:
Figure 373085DEST_PATH_IMAGE007
(2)
wherein the content of the first and second substances,
Figure 615848DEST_PATH_IMAGE008
for the predicted covariance, y represents the output value in the training data.
5. The GP world model trained with the aid of the strategy model according to claim 4, wherein the covariance
Figure 12194DEST_PATH_IMAGE008
Predicting by:
Figure 682210DEST_PATH_IMAGE009
(3)
d is a diagonal matrix of dimension N x M, I represents an identity matrix,
Figure 46064DEST_PATH_IMAGE010
for describing the association between the different tasks,
Figure 92517DEST_PATH_IMAGE011
representing a correlation matrix between the training data.
6. A GP world model trained with a strategy model according to claim 3, wherein the strategy model comprises a PPO algorithm, and the second loss function is a loss function of the PPO algorithm.
7. The GP world model trained with the help of the strategy model according to claim 6, wherein the PPO algorithm penalty functions comprise strategy penalty functions:
Figure 343370DEST_PATH_IMAGE012
(4)
Figure 387550DEST_PATH_IMAGE013
it is shown that the average value is calculated,
Figure 989432DEST_PATH_IMAGE014
the representation is taken to be a small value,
Figure 324730DEST_PATH_IMAGE015
the change proportion of the new strategy and the old strategy is shown,
Figure 695668DEST_PATH_IMAGE016
a dominant function representing the PPO algorithm,
Figure 707487DEST_PATH_IMAGE017
it is shown that the function of the truncation is,
Figure 796665DEST_PATH_IMAGE018
is the truncation factor.
8. The GP world model trained with the aid of the strategy model according to claim 7, wherein the PPO algorithm penalty functions further comprise a value function penalty function and an entropy penalty function:
Figure 168610DEST_PATH_IMAGE019
(5)
wherein the content of the first and second substances,
Figure 394055DEST_PATH_IMAGE020
in order to be a function of the policy loss,
Figure 107933DEST_PATH_IMAGE021
a function of the loss is represented as a value function,
Figure 418828DEST_PATH_IMAGE022
in order to be a function of the entropy loss,
Figure 95928DEST_PATH_IMAGE023
is a weight value;
the value function loss function includes:
Figure 441459DEST_PATH_IMAGE024
(6)
wherein the content of the first and second substances,
Figure 795080DEST_PATH_IMAGE025
representing the accumulated return values for the following trajectory,
Figure 593272DEST_PATH_IMAGE026
in order to be a function of the value,
Figure 572598DEST_PATH_IMAGE027
indicating averaging.
9. A method for assisting GP world model training by utilizing a strategy model is characterized by comprising the following steps:
s1. world model utilization loss function
Figure 975897DEST_PATH_IMAGE028
Updating model parameters
Figure 500420DEST_PATH_IMAGE029
S2, utilizing a loss function by a strategy model
Figure 785908DEST_PATH_IMAGE030
Updating model parameters
Figure 54078DEST_PATH_IMAGE031
And storing each step in the training
Figure 859354DEST_PATH_IMAGE030
S3. in pairs
Figure 820357DEST_PATH_IMAGE030
Taking the average value as
Figure 593141DEST_PATH_IMAGE032
Post substitution
Figure 399423DEST_PATH_IMAGE033
And the training is used for the next training of the world model.
10. The method of claim 9, wherein the strategic model is a PPO algorithm and the loss function of the PPO algorithm is a PPO function
Figure 829179DEST_PATH_IMAGE034
Figure 695504DEST_PATH_IMAGE035
In order to be a function of the policy loss,
Figure 955584DEST_PATH_IMAGE036
a function of the loss is represented as a value function,
Figure 831136DEST_PATH_IMAGE037
in order to be a function of the entropy loss,
Figure 611004DEST_PATH_IMAGE038
is a weight value.
CN202210404483.6A 2022-04-18 2022-04-18 GP world model for assisting training by utilizing strategy model and training method thereof Pending CN114492215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210404483.6A CN114492215A (en) 2022-04-18 2022-04-18 GP world model for assisting training by utilizing strategy model and training method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210404483.6A CN114492215A (en) 2022-04-18 2022-04-18 GP world model for assisting training by utilizing strategy model and training method thereof

Publications (1)

Publication Number Publication Date
CN114492215A true CN114492215A (en) 2022-05-13

Family

ID=81489469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210404483.6A Pending CN114492215A (en) 2022-04-18 2022-04-18 GP world model for assisting training by utilizing strategy model and training method thereof

Country Status (1)

Country Link
CN (1) CN114492215A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019222734A1 (en) * 2018-05-18 2019-11-21 Google Llc Learning data augmentation policies
US20200160168A1 (en) * 2018-11-16 2020-05-21 Honda Motor Co., Ltd. Cooperative multi-goal, multi-agent, multi-stage reinforcement learning
US20200234144A1 (en) * 2019-01-18 2020-07-23 Uber Technologies, Inc. Generating training datasets for training neural networks
EP3745326A1 (en) * 2019-05-27 2020-12-02 Siemens Aktiengesellschaft Method for determining a plurality of trained machine learning models
US20210034976A1 (en) * 2019-08-02 2021-02-04 Google Llc Framework for Learning to Transfer Learn
CN112989017A (en) * 2021-05-17 2021-06-18 南湖实验室 Method for generating high-quality simulation experience for dialogue strategy learning
CN113392956A (en) * 2021-05-17 2021-09-14 南湖实验室 GP-based deep Dyna-Q method for dialogue strategy learning
CN113688977A (en) * 2021-08-30 2021-11-23 浙江大学 Confrontation task oriented man-machine symbiosis reinforcement learning method and device, computing equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019222734A1 (en) * 2018-05-18 2019-11-21 Google Llc Learning data augmentation policies
US20200160168A1 (en) * 2018-11-16 2020-05-21 Honda Motor Co., Ltd. Cooperative multi-goal, multi-agent, multi-stage reinforcement learning
US20200234144A1 (en) * 2019-01-18 2020-07-23 Uber Technologies, Inc. Generating training datasets for training neural networks
EP3745326A1 (en) * 2019-05-27 2020-12-02 Siemens Aktiengesellschaft Method for determining a plurality of trained machine learning models
US20210034976A1 (en) * 2019-08-02 2021-02-04 Google Llc Framework for Learning to Transfer Learn
CN112989017A (en) * 2021-05-17 2021-06-18 南湖实验室 Method for generating high-quality simulation experience for dialogue strategy learning
CN113392956A (en) * 2021-05-17 2021-09-14 南湖实验室 GP-based deep Dyna-Q method for dialogue strategy learning
CN113688977A (en) * 2021-08-30 2021-11-23 浙江大学 Confrontation task oriented man-machine symbiosis reinforcement learning method and device, computing equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WENQI FANG 等: "Spectrum Gaussian Processes Based On Tunable Basis Functions", 《ARXIV》 *
陈建廷等: "深度神经网络训练中梯度不稳定现象研究综述", 《软件学报》 *

Similar Documents

Publication Publication Date Title
CN110568760B (en) Parameterized learning decision control system and method suitable for lane changing and lane keeping
Liang et al. Cirl: Controllable imitative reinforcement learning for vision-based self-driving
Chen et al. Model-free deep reinforcement learning for urban autonomous driving
CN112132263B (en) Multi-agent autonomous navigation method based on reinforcement learning
CN112232490B (en) Visual-based depth simulation reinforcement learning driving strategy training method
CN113044064B (en) Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning
Yin et al. Data-driven models for train control dynamics in high-speed railways: LAG-LSTM for train trajectory prediction
CN113264043A (en) Unmanned driving layered motion decision control method based on deep reinforcement learning
Zuo et al. Continuous reinforcement learning from human demonstrations with integrated experience replay for autonomous driving
Wang et al. Lane keeping assist for an autonomous vehicle based on deep reinforcement learning
Löckel et al. A probabilistic framework for imitating human race driver behavior
Liu et al. Mtd-gpt: A multi-task decision-making gpt model for autonomous driving at unsignalized intersections
Guo et al. Koopman operator-based driver-vehicle dynamic model for shared control systems
Hilleli et al. Toward deep reinforcement learning without a simulator: An autonomous steering example
Huang et al. An efficient self-evolution method of autonomous driving for any given algorithm
CN114492215A (en) GP world model for assisting training by utilizing strategy model and training method thereof
Fang et al. A maximum entropy inverse reinforcement learning algorithm for automatic parking
Guo et al. Modeling, learning and prediction of longitudinal behaviors of human-driven vehicles by incorporating internal human DecisionMaking process using inverse model predictive control
Xiao DDK: A deep Koopman approach for dynamics modeling and trajectory tracking of autonomous vehicles
CN114997048A (en) Automatic driving vehicle lane keeping method based on TD3 algorithm improved by exploration strategy
CN115116240A (en) Lantern-free intersection vehicle cooperative control method and system
CN114647986A (en) Intelligent decision method and system for realizing continuous action decision based on GP (GP) and PPO (Peer-to-Peer)
Zhang et al. Learning how to drive using DDPG algorithm with double experience buffer priority sampling
Samsani et al. Rapid Autonomous Vehicle Drifting with Deep Reinforcement Learning
Wang et al. An End-to-End Deep Reinforcement Learning Model Based on Proximal Policy Optimization Algorithm for Autonomous Driving of Off-Road Vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220513

RJ01 Rejection of invention patent application after publication