CN117973497A

CN117973497A - Automatic driving reinforcement learning method, device and storage medium

Info

Publication number: CN117973497A
Application number: CN202410283603.0A
Authority: CN
Inventors: 何弢; 严骏驰; 廖文龙; 李奇峰; 贾萧松
Original assignee: Shanghai Kuyi Robot Co ltd; Kuwa Technology Co ltd
Current assignee: Shanghai Kuyi Robot Co ltd; Kuwa Technology Co ltd
Priority date: 2024-03-13
Filing date: 2024-03-13
Publication date: 2024-05-03

Abstract

The invention discloses an automatic driving reinforcement learning method, an automatic driving reinforcement learning device and a storage medium, wherein video and action information are used as input respectively; the large model is used for predicting state transition based on specific actions on a high-dimensional image space and a low-dimensional feature space respectively, and the difference between a reconstructed image decoded from the low-dimensional features and a real image is used as a reward function for training a downstream reinforcement learning algorithm. The beneficial effects that this patent technical scheme brought include: the automatic driving reinforcement learning method for automatically generating rewards by using the large model can effectively utilize massive offline real driving data and enable an automatic driving vehicle to learn anthropomorphic driving strategies. The automatic driving reinforcement learning method for automatically generating rewards by using the large model can automate the rewards generation process, can be connected with any form of automatic driving reinforcement learning regulation device for training at the downstream, and accelerates the development of an automatic driving regulation algorithm.

Description

Automatic driving reinforcement learning method, device and storage medium

Technical Field

The present invention relates to the field of high-level assisted driving and automatic driving, and more particularly, to an automatic driving reinforcement learning method, apparatus and storage medium for automatically generating rewards using a large model.

Background

With the development of autopilot technology, there is an increasing demand for efficient, safe vehicle control strategies. Reinforcement learning is widely adopted as an effective way to learn an automatic driving strategy, which uses logic of trial and error learning to make a vehicle explore in simulation, and optimizes the driving strategy according to the acquired rewards. The bonus function thus plays a vital role in reinforcement learning, which guides the choice of behavior during learning. However, the design of the reward function requires a lot of manpower cost, and solves the problem of possible reward conflict under different scenarios, and the enumeration type manner of manually designing the reward function cannot cope with infinite extreme scenarios. In addition, even if the existing reinforcement learning method finds a feasible strategy in a known scene, the driving effect of the reinforcement learning method is greatly different from the style of a human driver, so that the experience is poor. However, there is a huge amount of human driving video data on the network, and how to use the human driving knowledge contained in the data to make the automatic driving algorithm more anthropomorphic, and to be able to automatically learn driving strategies under a large number of different scenes from the video is a great challenge, which is also a problem to be solved by the present invention.

However, the massive video data have various differences such as different viewing angles, different vehicle dynamics, different labels and the like, and it is difficult to directly extract a unified decision track from the video to generate rewards. It is also a difficult problem how to map state transitions in video space to specific low-dimensional actions of the vehicle. How to evaluate driving behavior in differentiated data and to perform reasonable action abstraction and mapping is therefore the key to automated reward generation.

In the prior art, VIDEO LANGUAGE PLANNING-Video Language Planning (VLP), VLP mainly comprises a video language big model and a video big model, wherein the video language big model takes images and language instructions acquired by a current camera as input, outputs actions in a high-level semantic space, and predicts state transition in the image space caused by specific actions. The VLP generates various possible actions by enabling the video language big model to perform action exploration on each frame of video input, then utilizes the video big model as a dynamic transfer model to simulate a plurality of video action tracks formed based on the actions, finally uses the video language big model as a heuristic function again, selects the action track which can be most beneficial to achieving the final task target, and the generated video action plan can be converted into the action of a real robot by using a target-based strategy. The algorithm can learn planning strategies from offline expert video data, but is mainly focused on the robot field, and cannot generate rewards so that other reinforcement learning algorithms cannot be accessed for training. The data sources are as follows: video language playing. ArXiv preprint arXiv:2310.10625 (2023).

At the current application level, VLPs consist of two modules, a video language big model and a video big model. The video language big model is responsible for generating actions and heuristic track selection, and the video language big model is responsible for track exploration of a video space based on the generated actions. In the action generation stage, the video language big model takes images and high-level action instructions as input, and outputs a plurality of possible actions capable of completing the final task instructions. And then the video language big model performs video track exploration based on the action generated by the video language big model to generate a plurality of video action tracks. And finally, the video language large model is used again, and the optimal track capable of realizing the final task instruction is heuristically selected. The target-based strategy is used in the execution phase to translate actions in the trajectory into actions that a specific robot can directly perform. The algorithm flow can be summarized as follows:

Table 1 algorithm process flow pseudocode

The flow of operation of VLPs is shown in figure 1. But has the defects that rewards cannot be generated explicitly, other downstream reinforcement learning algorithms cannot be accessed for training and the like; depending on specific robot instructions, the method cannot be applied to the self-driving field. Therefore, the invention aims to solve the problem of how to automatically generate the rewarding function from mass real driving data and the problem of learning the driving strategy of the similar person in a complex scene.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an automatic driving reinforcement learning method, an automatic driving reinforcement learning device and a storage medium for automatically generating rewards by using a large model.

In order to achieve the above purpose, the present invention adopts the following technical scheme: an automatic driving reinforcement learning method uses video and motion information as inputs, respectively; the large model is used for predicting state transition based on specific actions on a high-dimensional image space and a low-dimensional feature space respectively, and the difference between a reconstructed image decoded from the low-dimensional features and a real image is used as a reward function for training a downstream reinforcement learning algorithm.

Further, the video and action information comprises video and specific action instruction data contained in the offline driving data, and the video and the specific action instruction data are used as input to train a video frame which is not predicted by the fine tuning large model; meanwhile, the large model encodes the current frame image in the video into a low-dimensional characteristic state, state transition based on specific actions of the vehicle is predicted under the low-dimensional characteristic space, the low-dimensional state is reconstructed back to the original image space by the decoder, and the difference between the reconstructed image and the predicted image of the large model is used as rewards to train a reinforcement learning controller at the downstream.

Further, the training and deployment process of the automatic driving reinforcement learning method comprises the following steps:

S1: the large model takes as input a video made up of a sequence of images and a corresponding sequence of actions, x= { (img ₀,a₀),…,(img_T,a_T) }, where img _i represents the image at time i, a _i represents the action taken at time i, predicting the image frames at k future times:

{img′_T+1,…,img′_T+k}＝f_high({(img₀,a₀),…,(img_T,a_T)})#(1)

Where f _high (·) represents the predictor of the large model in high dimensional space, imp' _T+j represents the predicted image at the future jth moment given the input sequence of T moments;

s2: the encoder f _enc (·) in the large model encodes the high-dimensional image img as a low-dimensional state feature z:

z＝f_enc(img)#(2)

Then the low-dimensional dynamic model f _low (·) in the large model predicts the state z' at the next moment based on action a and state features z:

z'＝f_low(z,a)#(3)

Finally the decoder f _dec (·) in the large model reconstructs the low-dimensional state z' back into the high-dimensional image space:

S3: image reconstructed from decoder And the difference between the large model predicted image img', generating a reward function:

where R (-) represents the theoretical maximum difference value between the two images minus the actual difference value.

Further, in S3, the prize value is logarithmically scaled by r' =symlog (r), where symlog (·) represents the sign logarithmically scaling the content therein.

Further, finally, the reinforcement learning master pi _θ (a|z) is used to output actions based on the low-dimensional features, and the objective function of the master is to maximize rewards:

Where θ represents the master parameter.

Further, the specific steps of the rewarding algorithm are as follows:

Step 1, sampling video with the length of T and action sequence data X= { (img ₀,a₀),…,(img_T,a_T) from offline human driving data;

Step 2. The predictor in the large model predicts video output for k future moments:

{img'_T+1,…,img'_T+k}＝f_high(X)；

Step 3, an encoder in the large model encodes the image at the time t into a low-dimensional characteristic state z _t＝f_enc(img_t);

step 4, the master controller outputs action a _t＝π_θ(a|z_t according to the low-dimensional state);

Step 5, predicting a low-dimensional state characteristic z' _t+1＝f_low(z_t,a_t of the next moment by a dynamic model in the large model based on the action;

Step 6, repeating Step 2 to Step5 until the master controller converges.

The invention also provides a device for automatic driving reinforcement learning, which comprises:

The predictor is used for taking the video and the action information as input, training a fine tuning model and predicting future video frames;

The encoder encodes the current frame image in the video into a low-dimensional characteristic state, and predicts state transition based on specific actions of the vehicle in the low-dimensional characteristic space;

A decoder reconstructing the low-dimensional state back to the original image space;

a low latitude dynamic model processor for predicting a state at a next time based on the motion and state characteristics;

And the reward function generating device is used for generating a reward function according to the difference between the reconstructed image and the predicted image of the large model and is used for automatic driving reinforcement learning training of the downstream controller.

Further, the encoder takes as input a video composed of a sequence of images and a corresponding sequence of actions, x= { (img ₀,a₀),…,(img_T,a_T) }, where img _i represents the image at time i, a _i represents the action taken at time i, predicting the image frames at k future times:

{img'_T+1,…,img'_T+k}＝f_high({(img₀,a₀),…,(img_T,a_T)})#(1)

Where f _high (·) represents the predictor of the large model in high dimensional space, img' _T+j represents the predicted image at the future jth moment given the input sequence of T moments;

The encoder f _enc (·) encodes the high-dimensional image img into a low-dimensional state feature z:

z＝f_enc(img)#(2)

The low-dimensional dynamic model processor f _low (·) predicts the state z' at the next time based on action a and state feature z:

z'＝f_low(z,a)#(3)

The decoder f _dec (·) then reconstructs the low-dimensional state z' back into the high-dimensional image space:

The rewarding function generating device reconstructs images according to the decoder And the difference between the large model predicted image img', generating a reward function:

Further, the reward function generating device performs log scaling processing r' =symlog (r) on the reward value, wherein symlog (·) represents that the content therein is subjected to sign log scaling.

The present invention also provides a computer-readable storage medium containing a computer program which, when executed by one or more processors, implements the automated driving reinforcement learning method of any one of the above.

The invention has the following advantages:

1. The automatic driving reinforcement learning method for automatically generating rewards by using the large model can effectively utilize massive offline real driving data and enable an automatic driving vehicle to learn anthropomorphic driving strategies.

2. The automatic driving reinforcement learning method for automatically generating rewards by using the large model can automate the rewards generation process, can be connected with any form of automatic driving reinforcement learning regulation device for training at the downstream, and accelerates the development of an automatic driving regulation algorithm.

Drawings

FIG. 1 is a schematic diagram of a prior art VLP operational flow;

FIG. 2 is a schematic diagram of an automated driving reinforcement learning method for automatically generating rewards using a large model in accordance with the present invention;

FIG. 3 is a flow chart of an automated driving reinforcement learning method for automatically generating rewards using a large model in accordance with the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 2 and 3, an autopilot reinforcement learning method uses video and motion information as inputs, respectively, including but not limited to using more information, such as high-level language instructions, radar information, and other more modal information as inputs, also included in the protection range; the large model is used for predicting state transition based on specific actions on a high-dimensional image space and a low-dimensional feature space respectively, and the difference between a reconstructed image decoded from the low-dimensional features and a real image is used as a reward function for training a downstream reinforcement learning algorithm.

The invention also uses the difference between the future video track predicted by the large model based on the video input and the picture sequence decoded by the low-dimensional features as a reward function. Also included within the scope of protection are different types of difference metrics, such as neural networks, heuristic functions, etc. By adopting the scheme of the invention, the following advantages can be obtained:

In practical application, the video and action information comprises video and specific action instruction data contained in offline driving data, and the video and specific action instruction data are used as input to train a video frame which is not predicted by the fine tuning large model; meanwhile, the large model encodes the current frame image in the video into a low-dimensional characteristic state, state transition based on specific actions of the vehicle is predicted under the low-dimensional characteristic space, the low-dimensional state is reconstructed back to the original image space by the decoder, and the difference between the reconstructed image and the predicted image of the large model is used as rewards to train a reinforcement learning controller at the downstream. The overall model structure and data flow is shown in fig. 2.

The training and deployment process of the automatic driving reinforcement learning method comprises the following steps:

{img'_T+1,…,img'_T+k}＝f_high({(img₀,a₀),…,(img_T,a_T)})#(1)

z＝f_enc(img)#(2)

z'＝f_low(z,a)#(3)

The measure of the difference value may take various forms, such as a squared difference distance and a manhattan distance, among others.

In a specific application, in order to avoid instability of training caused by fluctuation of the amplitude of the prize value, the prize value is subjected to logarithmic scaling r' =symlog (r) in S3, where symlog (·) represents logarithmic scaling of the symbol of the content therein.

Finally, a reinforcement learning master pi _θ (a|z) is used to output actions based on the low-dimensional features, and the objective function of the master is to maximize rewards:

Where θ represents the master parameter.

As shown in fig. 3, the master controller, i.e. the controller, in the figure, the specific steps of the rewarding algorithm are as follows:

Step 2. The predictor in the large model predicts video output for k future moments: { img' _T+1,…,img'_T+k}＝f_high (X);

Step 6, repeating Step 2 to Step5 until the master controller converges.

Taking offline nuScenes datasets as examples, the offline nuScenes datasets comprise more than 1000 driving scenes, including video and point cloud data acquired by a 32-line radar and a 6-way looking-around camera. All the data are collected by human drivers, and implicitly comprise human preference driving coping strategies in complex scenes. Video data of length T is sampled from nuScenes datasets. Because the original data set does not contain specific driving actions, unified acceleration and steering angle are used as basic actions, a video action sequence is formed together with the video data, the video action sequence is input into a predictor of a large model to predict images of future k frames, and the large model can be finely tuned by using real data of the future k frames in the real data set. Meanwhile, an encoder in the large model maps an image in the video into a low-dimensional state feature z, and inputs the low-dimensional state feature z to a reinforcement learning controller at the downstream to obtain an action a which the host vehicle should adopt. The dynamic model in the large model then continuously predicts the state features of the future k steps based on the low-dimensional state features and motion prediction. And finally, mapping the predicted k-step state characteristics back to an image space by a decoder in the large model to obtain a reconstructed picture sequence, and generating a reward signal required by the controller in the reinforcement learning training process according to the reconstructed picture sequence and the degree of difference between k-frame images predicted by the predictor. The master controller learns the driving strategy by using the provided reward signals, and the reward signals generated by the large model can guide the driving behavior of the master controller to be aligned with the driving strategy of an expert in the data set, so that the master controller learns the anthropomorphic driving strategy and effectively utilizes massive real driving data.

The encoder takes as input a video made up of a sequence of images and a corresponding sequence of actions, x= { (img ₀,a₀),…,(img_T,a_T) }, where img _i represents the image at time i, a _i represents the action taken at time i, predicting the image frames at k future times:

{img'_T+1,…,img'_T+k}＝f_high({(img₀,a₀),…,(img_T,a_T)})#(1)

z＝f_enc(img)#(2)

z'＝f_low(z,a)#(3)

The reward function generating means performs a logarithmic scaling process r' =symlog (r) on the reward value, wherein symlog (·) represents a logarithmic scaling of the content therein.

The present invention also provides a computer-readable storage medium containing a computer program which, when executed by one or more processors, implements the automated driving reinforcement learning method of any one of the above

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. An automatic driving reinforcement learning method is characterized in that video and action information are used as input respectively; the large model is used for predicting state transition based on specific actions on a high-dimensional image space and a low-dimensional feature space respectively, and the difference between a reconstructed image decoded from the low-dimensional features and a real image is used as a reward function for training a downstream reinforcement learning algorithm.

2. The automated driving reinforcement learning method of claim 1, wherein,

The video and action information comprises video and specific action instruction data contained in offline driving data, and the video and the specific action instruction data are used as input to train a fine tuning large model to predict future video frames; meanwhile, the large model encodes the current frame image in the video into a low-dimensional characteristic state, state transition based on specific actions of the vehicle is predicted under the low-dimensional characteristic space, the low-dimensional state is reconstructed back to the original image space by the decoder, and the difference between the reconstructed image and the predicted image of the large model is used as rewards to train a reinforcement learning controller at the downstream.

3. The automated driving reinforcement learning method of claim 2, wherein,

s1: the large model takes as input a video made up of a sequence of images and a corresponding sequence of actions, x= { (img ₀,a₀),...,(img_T,a_T) }, where img _i represents the image at time i, a _i represents the action taken at time i, predicting the image frames at k future times:

{img′_T+1,…,img′_T+k}＝f_high({(img₀,a₀),...,(img_T,a_T)})#(1)

z＝f_enc(img)#(2)

z′＝f_low(z，a)#(3)

4. The automated driving reinforcement learning method of claim 2, wherein the prize value is logarithmically scaled by r' =symlog (r) in S3, wherein symlog (·) represents the sign logarithmically scaling of the content therein.

5. The method of automated driving reinforcement learning of claim 4, wherein finally using reinforcement learning master pi _θ (a|z) based on low-dimensional feature output actions, the master objective function is to maximize rewards:

Where θ represents the master parameter.

6. The automated driving reinforcement learning method of any one of claims 1-5, wherein the specific steps of the rewarding algorithm are as follows:

Step1: sampling video of length T, motion sequence data x= { (img ₀,a₀),...,(img_T,a_T) } from offline human driving data;

Step2: a predictor in the large model predicts video output for k future moments: { img' _T+1,...,img′_T+k}＝f_high (X);

step3: an encoder in the large model encodes the image at time t into a low-dimensional feature state z _t＝f_enc(img_t);

Step4: the master controller outputs an action a _t＝π_θ(a|z_t according to the low-dimensional state);

Step5: dynamic models in the large model predict the low-dimensional state features z' _t+1＝f_low(z_t,a_t at the next moment based on motion);

Step6: steps 2 through 5 are repeated until the master converges.

7. An apparatus for automated driving reinforcement learning, comprising:

8. The apparatus for automated driving reinforcement learning of claim 7, wherein,

The encoder takes as input a video made up of a sequence of images and a corresponding sequence of actions, x= { (img ₀,a₀),...,(img_T,a_T) }, where img _i represents the image at time i, a _i represents the action taken at time i, predicting the image frames at k future times:

{img′_T+1,…,img′_T+k}＝f_high({(img₀,a₀),...,(img_T,a_T)})#(1)

z＝f_enc(img)#(2)

z′＝f_low(z，a)#(3)

9. The apparatus for automated driving reinforcement learning of claim 8, wherein the bonus function generating means performs a logarithmic scaling process r' =symlog (r) on the bonus values, wherein symlog (·) represents a logarithmic scaling of the sign of the content therein.

10. A computer readable storage medium containing a computer program, characterized in that the method of automated driving reinforcement learning of any of claims 1-6 is implemented when the computer program is executed by one or more processors.