CN117973497A - Automatic driving reinforcement learning method, device and storage medium - Google Patents

Automatic driving reinforcement learning method, device and storage medium Download PDF

Info

Publication number
CN117973497A
CN117973497A CN202410283603.0A CN202410283603A CN117973497A CN 117973497 A CN117973497 A CN 117973497A CN 202410283603 A CN202410283603 A CN 202410283603A CN 117973497 A CN117973497 A CN 117973497A
Authority
CN
China
Prior art keywords
img
low
dimensional
reinforcement learning
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410283603.0A
Other languages
Chinese (zh)
Inventor
何弢
严骏驰
廖文龙
李奇峰
贾萧松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Kuyi Robot Co ltd
Kuwa Technology Co ltd
Original Assignee
Shanghai Kuyi Robot Co ltd
Kuwa Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Kuyi Robot Co ltd, Kuwa Technology Co ltd filed Critical Shanghai Kuyi Robot Co ltd
Priority to CN202410283603.0A priority Critical patent/CN117973497A/en
Publication of CN117973497A publication Critical patent/CN117973497A/en
Pending legal-status Critical Current

Links

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The invention discloses an automatic driving reinforcement learning method, an automatic driving reinforcement learning device and a storage medium, wherein video and action information are used as input respectively; the large model is used for predicting state transition based on specific actions on a high-dimensional image space and a low-dimensional feature space respectively, and the difference between a reconstructed image decoded from the low-dimensional features and a real image is used as a reward function for training a downstream reinforcement learning algorithm. The beneficial effects that this patent technical scheme brought include: the automatic driving reinforcement learning method for automatically generating rewards by using the large model can effectively utilize massive offline real driving data and enable an automatic driving vehicle to learn anthropomorphic driving strategies. The automatic driving reinforcement learning method for automatically generating rewards by using the large model can automate the rewards generation process, can be connected with any form of automatic driving reinforcement learning regulation device for training at the downstream, and accelerates the development of an automatic driving regulation algorithm.

Description

Automatic driving reinforcement learning method, device and storage medium
Technical Field
The present invention relates to the field of high-level assisted driving and automatic driving, and more particularly, to an automatic driving reinforcement learning method, apparatus and storage medium for automatically generating rewards using a large model.
Background
With the development of autopilot technology, there is an increasing demand for efficient, safe vehicle control strategies. Reinforcement learning is widely adopted as an effective way to learn an automatic driving strategy, which uses logic of trial and error learning to make a vehicle explore in simulation, and optimizes the driving strategy according to the acquired rewards. The bonus function thus plays a vital role in reinforcement learning, which guides the choice of behavior during learning. However, the design of the reward function requires a lot of manpower cost, and solves the problem of possible reward conflict under different scenarios, and the enumeration type manner of manually designing the reward function cannot cope with infinite extreme scenarios. In addition, even if the existing reinforcement learning method finds a feasible strategy in a known scene, the driving effect of the reinforcement learning method is greatly different from the style of a human driver, so that the experience is poor. However, there is a huge amount of human driving video data on the network, and how to use the human driving knowledge contained in the data to make the automatic driving algorithm more anthropomorphic, and to be able to automatically learn driving strategies under a large number of different scenes from the video is a great challenge, which is also a problem to be solved by the present invention.
However, the massive video data have various differences such as different viewing angles, different vehicle dynamics, different labels and the like, and it is difficult to directly extract a unified decision track from the video to generate rewards. It is also a difficult problem how to map state transitions in video space to specific low-dimensional actions of the vehicle. How to evaluate driving behavior in differentiated data and to perform reasonable action abstraction and mapping is therefore the key to automated reward generation.
In the prior art, VIDEO LANGUAGE PLANNING-Video Language Planning (VLP), VLP mainly comprises a video language big model and a video big model, wherein the video language big model takes images and language instructions acquired by a current camera as input, outputs actions in a high-level semantic space, and predicts state transition in the image space caused by specific actions. The VLP generates various possible actions by enabling the video language big model to perform action exploration on each frame of video input, then utilizes the video big model as a dynamic transfer model to simulate a plurality of video action tracks formed based on the actions, finally uses the video language big model as a heuristic function again, selects the action track which can be most beneficial to achieving the final task target, and the generated video action plan can be converted into the action of a real robot by using a target-based strategy. The algorithm can learn planning strategies from offline expert video data, but is mainly focused on the robot field, and cannot generate rewards so that other reinforcement learning algorithms cannot be accessed for training. The data sources are as follows: video language playing. ArXiv preprint arXiv:2310.10625 (2023).
At the current application level, VLPs consist of two modules, a video language big model and a video big model. The video language big model is responsible for generating actions and heuristic track selection, and the video language big model is responsible for track exploration of a video space based on the generated actions. In the action generation stage, the video language big model takes images and high-level action instructions as input, and outputs a plurality of possible actions capable of completing the final task instructions. And then the video language big model performs video track exploration based on the action generated by the video language big model to generate a plurality of video action tracks. And finally, the video language large model is used again, and the optimal track capable of realizing the final task instruction is heuristically selected. The target-based strategy is used in the execution phase to translate actions in the trajectory into actions that a specific robot can directly perform. The algorithm flow can be summarized as follows:
Table 1 algorithm process flow pseudocode
The flow of operation of VLPs is shown in figure 1. But has the defects that rewards cannot be generated explicitly, other downstream reinforcement learning algorithms cannot be accessed for training and the like; depending on specific robot instructions, the method cannot be applied to the self-driving field. Therefore, the invention aims to solve the problem of how to automatically generate the rewarding function from mass real driving data and the problem of learning the driving strategy of the similar person in a complex scene.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an automatic driving reinforcement learning method, an automatic driving reinforcement learning device and a storage medium for automatically generating rewards by using a large model.
In order to achieve the above purpose, the present invention adopts the following technical scheme: an automatic driving reinforcement learning method uses video and motion information as inputs, respectively; the large model is used for predicting state transition based on specific actions on a high-dimensional image space and a low-dimensional feature space respectively, and the difference between a reconstructed image decoded from the low-dimensional features and a real image is used as a reward function for training a downstream reinforcement learning algorithm.
Further, the video and action information comprises video and specific action instruction data contained in the offline driving data, and the video and the specific action instruction data are used as input to train a video frame which is not predicted by the fine tuning large model; meanwhile, the large model encodes the current frame image in the video into a low-dimensional characteristic state, state transition based on specific actions of the vehicle is predicted under the low-dimensional characteristic space, the low-dimensional state is reconstructed back to the original image space by the decoder, and the difference between the reconstructed image and the predicted image of the large model is used as rewards to train a reinforcement learning controller at the downstream.
Further, the training and deployment process of the automatic driving reinforcement learning method comprises the following steps:
S1: the large model takes as input a video made up of a sequence of images and a corresponding sequence of actions, x= { (img 0,a0),…,(imgT,aT) }, where img i represents the image at time i, a i represents the action taken at time i, predicting the image frames at k future times:
{img′T+1,…,img′T+k}=fhigh({(img0,a0),…,(imgT,aT)})#(1)
Where f high (·) represents the predictor of the large model in high dimensional space, imp' T+j represents the predicted image at the future jth moment given the input sequence of T moments;
s2: the encoder f enc (·) in the large model encodes the high-dimensional image img as a low-dimensional state feature z:
z=fenc(img)#(2)
Then the low-dimensional dynamic model f low (·) in the large model predicts the state z' at the next moment based on action a and state features z:
z'=flow(z,a)#(3)
Finally the decoder f dec (·) in the large model reconstructs the low-dimensional state z' back into the high-dimensional image space:
S3: image reconstructed from decoder And the difference between the large model predicted image img', generating a reward function:
where R (-) represents the theoretical maximum difference value between the two images minus the actual difference value.
Further, in S3, the prize value is logarithmically scaled by r' =symlog (r), where symlog (·) represents the sign logarithmically scaling the content therein.
Further, finally, the reinforcement learning master pi θ (a|z) is used to output actions based on the low-dimensional features, and the objective function of the master is to maximize rewards:
Where θ represents the master parameter.
Further, the specific steps of the rewarding algorithm are as follows:
Step 1, sampling video with the length of T and action sequence data X= { (img 0,a0),…,(imgT,aT) from offline human driving data;
Step 2. The predictor in the large model predicts video output for k future moments:
{img'T+1,…,img'T+k}=fhigh(X);
Step 3, an encoder in the large model encodes the image at the time t into a low-dimensional characteristic state z t=fenc(imgt);
step 4, the master controller outputs action a t=πθ(a|zt according to the low-dimensional state);
Step 5, predicting a low-dimensional state characteristic z' t+1=flow(zt,at of the next moment by a dynamic model in the large model based on the action;
Step 6, repeating Step 2 to Step5 until the master controller converges.
The invention also provides a device for automatic driving reinforcement learning, which comprises:
The predictor is used for taking the video and the action information as input, training a fine tuning model and predicting future video frames;
The encoder encodes the current frame image in the video into a low-dimensional characteristic state, and predicts state transition based on specific actions of the vehicle in the low-dimensional characteristic space;
A decoder reconstructing the low-dimensional state back to the original image space;
a low latitude dynamic model processor for predicting a state at a next time based on the motion and state characteristics;
And the reward function generating device is used for generating a reward function according to the difference between the reconstructed image and the predicted image of the large model and is used for automatic driving reinforcement learning training of the downstream controller.
Further, the encoder takes as input a video composed of a sequence of images and a corresponding sequence of actions, x= { (img 0,a0),…,(imgT,aT) }, where img i represents the image at time i, a i represents the action taken at time i, predicting the image frames at k future times:
{img'T+1,…,img'T+k}=fhigh({(img0,a0),…,(imgT,aT)})#(1)
Where f high (·) represents the predictor of the large model in high dimensional space, img' T+j represents the predicted image at the future jth moment given the input sequence of T moments;
The encoder f enc (·) encodes the high-dimensional image img into a low-dimensional state feature z:
z=fenc(img)#(2)
The low-dimensional dynamic model processor f low (·) predicts the state z' at the next time based on action a and state feature z:
z'=flow(z,a)#(3)
The decoder f dec (·) then reconstructs the low-dimensional state z' back into the high-dimensional image space:
The rewarding function generating device reconstructs images according to the decoder And the difference between the large model predicted image img', generating a reward function:
where R (-) represents the theoretical maximum difference value between the two images minus the actual difference value.
Further, the reward function generating device performs log scaling processing r' =symlog (r) on the reward value, wherein symlog (·) represents that the content therein is subjected to sign log scaling.
The present invention also provides a computer-readable storage medium containing a computer program which, when executed by one or more processors, implements the automated driving reinforcement learning method of any one of the above.
The invention has the following advantages:
1. The automatic driving reinforcement learning method for automatically generating rewards by using the large model can effectively utilize massive offline real driving data and enable an automatic driving vehicle to learn anthropomorphic driving strategies.
2. The automatic driving reinforcement learning method for automatically generating rewards by using the large model can automate the rewards generation process, can be connected with any form of automatic driving reinforcement learning regulation device for training at the downstream, and accelerates the development of an automatic driving regulation algorithm.
Drawings
FIG. 1 is a schematic diagram of a prior art VLP operational flow;
FIG. 2 is a schematic diagram of an automated driving reinforcement learning method for automatically generating rewards using a large model in accordance with the present invention;
FIG. 3 is a flow chart of an automated driving reinforcement learning method for automatically generating rewards using a large model in accordance with the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 2 and 3, an autopilot reinforcement learning method uses video and motion information as inputs, respectively, including but not limited to using more information, such as high-level language instructions, radar information, and other more modal information as inputs, also included in the protection range; the large model is used for predicting state transition based on specific actions on a high-dimensional image space and a low-dimensional feature space respectively, and the difference between a reconstructed image decoded from the low-dimensional features and a real image is used as a reward function for training a downstream reinforcement learning algorithm.
The invention also uses the difference between the future video track predicted by the large model based on the video input and the picture sequence decoded by the low-dimensional features as a reward function. Also included within the scope of protection are different types of difference metrics, such as neural networks, heuristic functions, etc. By adopting the scheme of the invention, the following advantages can be obtained:
1. The automatic driving reinforcement learning method for automatically generating rewards by using the large model can effectively utilize massive offline real driving data and enable an automatic driving vehicle to learn anthropomorphic driving strategies.
2. The automatic driving reinforcement learning method for automatically generating rewards by using the large model can automate the rewards generation process, can be connected with any form of automatic driving reinforcement learning regulation device for training at the downstream, and accelerates the development of an automatic driving regulation algorithm.
In practical application, the video and action information comprises video and specific action instruction data contained in offline driving data, and the video and specific action instruction data are used as input to train a video frame which is not predicted by the fine tuning large model; meanwhile, the large model encodes the current frame image in the video into a low-dimensional characteristic state, state transition based on specific actions of the vehicle is predicted under the low-dimensional characteristic space, the low-dimensional state is reconstructed back to the original image space by the decoder, and the difference between the reconstructed image and the predicted image of the large model is used as rewards to train a reinforcement learning controller at the downstream. The overall model structure and data flow is shown in fig. 2.
The training and deployment process of the automatic driving reinforcement learning method comprises the following steps:
S1: the large model takes as input a video made up of a sequence of images and a corresponding sequence of actions, x= { (img 0,a0),…,(imgT,aT) }, where img i represents the image at time i, a i represents the action taken at time i, predicting the image frames at k future times:
{img'T+1,…,img'T+k}=fhigh({(img0,a0),…,(imgT,aT)})#(1)
Where f high (·) represents the predictor of the large model in high dimensional space, img' T+j represents the predicted image at the future jth moment given the input sequence of T moments;
s2: the encoder f enc (·) in the large model encodes the high-dimensional image img as a low-dimensional state feature z:
z=fenc(img)#(2)
Then the low-dimensional dynamic model f low (·) in the large model predicts the state z' at the next moment based on action a and state features z:
z'=flow(z,a)#(3)
Finally the decoder f dec (·) in the large model reconstructs the low-dimensional state z' back into the high-dimensional image space:
S3: image reconstructed from decoder And the difference between the large model predicted image img', generating a reward function:
where R (-) represents the theoretical maximum difference value between the two images minus the actual difference value.
The measure of the difference value may take various forms, such as a squared difference distance and a manhattan distance, among others.
In a specific application, in order to avoid instability of training caused by fluctuation of the amplitude of the prize value, the prize value is subjected to logarithmic scaling r' =symlog (r) in S3, where symlog (·) represents logarithmic scaling of the symbol of the content therein.
Finally, a reinforcement learning master pi θ (a|z) is used to output actions based on the low-dimensional features, and the objective function of the master is to maximize rewards:
Where θ represents the master parameter.
As shown in fig. 3, the master controller, i.e. the controller, in the figure, the specific steps of the rewarding algorithm are as follows:
Step 1, sampling video with the length of T and action sequence data X= { (img 0,a0),…,(imgT,aT) from offline human driving data;
Step 2. The predictor in the large model predicts video output for k future moments: { img' T+1,…,img'T+k}=fhigh (X);
Step 3, an encoder in the large model encodes the image at the time t into a low-dimensional characteristic state z t=fenc(imgt);
step 4, the master controller outputs action a t=πθ(a|zt according to the low-dimensional state);
Step 5, predicting a low-dimensional state characteristic z' t+1=flow(zt,at of the next moment by a dynamic model in the large model based on the action;
Step 6, repeating Step 2 to Step5 until the master controller converges.
Taking offline nuScenes datasets as examples, the offline nuScenes datasets comprise more than 1000 driving scenes, including video and point cloud data acquired by a 32-line radar and a 6-way looking-around camera. All the data are collected by human drivers, and implicitly comprise human preference driving coping strategies in complex scenes. Video data of length T is sampled from nuScenes datasets. Because the original data set does not contain specific driving actions, unified acceleration and steering angle are used as basic actions, a video action sequence is formed together with the video data, the video action sequence is input into a predictor of a large model to predict images of future k frames, and the large model can be finely tuned by using real data of the future k frames in the real data set. Meanwhile, an encoder in the large model maps an image in the video into a low-dimensional state feature z, and inputs the low-dimensional state feature z to a reinforcement learning controller at the downstream to obtain an action a which the host vehicle should adopt. The dynamic model in the large model then continuously predicts the state features of the future k steps based on the low-dimensional state features and motion prediction. And finally, mapping the predicted k-step state characteristics back to an image space by a decoder in the large model to obtain a reconstructed picture sequence, and generating a reward signal required by the controller in the reinforcement learning training process according to the reconstructed picture sequence and the degree of difference between k-frame images predicted by the predictor. The master controller learns the driving strategy by using the provided reward signals, and the reward signals generated by the large model can guide the driving behavior of the master controller to be aligned with the driving strategy of an expert in the data set, so that the master controller learns the anthropomorphic driving strategy and effectively utilizes massive real driving data.
The invention also provides a device for automatic driving reinforcement learning, which comprises:
The predictor is used for taking the video and the action information as input, training a fine tuning model and predicting future video frames;
The encoder encodes the current frame image in the video into a low-dimensional characteristic state, and predicts state transition based on specific actions of the vehicle in the low-dimensional characteristic space;
A decoder reconstructing the low-dimensional state back to the original image space;
a low latitude dynamic model processor for predicting a state at a next time based on the motion and state characteristics;
And the reward function generating device is used for generating a reward function according to the difference between the reconstructed image and the predicted image of the large model and is used for automatic driving reinforcement learning training of the downstream controller.
The encoder takes as input a video made up of a sequence of images and a corresponding sequence of actions, x= { (img 0,a0),…,(imgT,aT) }, where img i represents the image at time i, a i represents the action taken at time i, predicting the image frames at k future times:
{img'T+1,…,img'T+k}=fhigh({(img0,a0),…,(imgT,aT)})#(1)
Where f high (·) represents the predictor of the large model in high dimensional space, img' T+j represents the predicted image at the future jth moment given the input sequence of T moments;
The encoder f enc (·) encodes the high-dimensional image img into a low-dimensional state feature z:
z=fenc(img)#(2)
The low-dimensional dynamic model processor f low (·) predicts the state z' at the next time based on action a and state feature z:
z'=flow(z,a)#(3)
The decoder f dec (·) then reconstructs the low-dimensional state z' back into the high-dimensional image space:
The rewarding function generating device reconstructs images according to the decoder And the difference between the large model predicted image img', generating a reward function:
where R (-) represents the theoretical maximum difference value between the two images minus the actual difference value.
The reward function generating means performs a logarithmic scaling process r' =symlog (r) on the reward value, wherein symlog (·) represents a logarithmic scaling of the content therein.
The present invention also provides a computer-readable storage medium containing a computer program which, when executed by one or more processors, implements the automated driving reinforcement learning method of any one of the above
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (10)

1. An automatic driving reinforcement learning method is characterized in that video and action information are used as input respectively; the large model is used for predicting state transition based on specific actions on a high-dimensional image space and a low-dimensional feature space respectively, and the difference between a reconstructed image decoded from the low-dimensional features and a real image is used as a reward function for training a downstream reinforcement learning algorithm.
2. The automated driving reinforcement learning method of claim 1, wherein,
The video and action information comprises video and specific action instruction data contained in offline driving data, and the video and the specific action instruction data are used as input to train a fine tuning large model to predict future video frames; meanwhile, the large model encodes the current frame image in the video into a low-dimensional characteristic state, state transition based on specific actions of the vehicle is predicted under the low-dimensional characteristic space, the low-dimensional state is reconstructed back to the original image space by the decoder, and the difference between the reconstructed image and the predicted image of the large model is used as rewards to train a reinforcement learning controller at the downstream.
3. The automated driving reinforcement learning method of claim 2, wherein,
The training and deployment process of the automatic driving reinforcement learning method comprises the following steps:
s1: the large model takes as input a video made up of a sequence of images and a corresponding sequence of actions, x= { (img 0,a0),...,(imgT,aT) }, where img i represents the image at time i, a i represents the action taken at time i, predicting the image frames at k future times:
{img′T+1,…,img′T+k}=fhigh({(img0,a0),...,(imgT,aT)})#(1)
Where f high (·) represents the predictor of the large model in high dimensional space, img' T+j represents the predicted image at the future jth moment given the input sequence of T moments;
s2: the encoder f enc (·) in the large model encodes the high-dimensional image img as a low-dimensional state feature z:
z=fenc(img)#(2)
then the low-dimensional dynamic model f low (·) in the large model predicts the state z' at the next moment based on action a and state features z:
z′=flow(z,a)#(3)
Finally the decoder f dec (·) in the large model reconstructs the low-dimensional state z' back into the high-dimensional image space:
S3: image reconstructed from decoder And the difference between the large model predicted image img', generating a reward function:
where R (-) represents the theoretical maximum difference value between the two images minus the actual difference value.
4. The automated driving reinforcement learning method of claim 2, wherein the prize value is logarithmically scaled by r' =symlog (r) in S3, wherein symlog (·) represents the sign logarithmically scaling of the content therein.
5. The method of automated driving reinforcement learning of claim 4, wherein finally using reinforcement learning master pi θ (a|z) based on low-dimensional feature output actions, the master objective function is to maximize rewards:
Where θ represents the master parameter.
6. The automated driving reinforcement learning method of any one of claims 1-5, wherein the specific steps of the rewarding algorithm are as follows:
Step1: sampling video of length T, motion sequence data x= { (img 0,a0),...,(imgT,aT) } from offline human driving data;
Step2: a predictor in the large model predicts video output for k future moments: { img' T+1,...,img′T+k}=fhigh (X);
step3: an encoder in the large model encodes the image at time t into a low-dimensional feature state z t=fenc(imgt);
Step4: the master controller outputs an action a t=πθ(a|zt according to the low-dimensional state);
Step5: dynamic models in the large model predict the low-dimensional state features z' t+1=flow(zt,at at the next moment based on motion);
Step6: steps 2 through 5 are repeated until the master converges.
7. An apparatus for automated driving reinforcement learning, comprising:
The predictor is used for taking the video and the action information as input, training a fine tuning model and predicting future video frames;
The encoder encodes the current frame image in the video into a low-dimensional characteristic state, and predicts state transition based on specific actions of the vehicle in the low-dimensional characteristic space;
A decoder reconstructing the low-dimensional state back to the original image space;
a low latitude dynamic model processor for predicting a state at a next time based on the motion and state characteristics;
And the reward function generating device is used for generating a reward function according to the difference between the reconstructed image and the predicted image of the large model and is used for automatic driving reinforcement learning training of the downstream controller.
8. The apparatus for automated driving reinforcement learning of claim 7, wherein,
The encoder takes as input a video made up of a sequence of images and a corresponding sequence of actions, x= { (img 0,a0),...,(imgT,aT) }, where img i represents the image at time i, a i represents the action taken at time i, predicting the image frames at k future times:
{img′T+1,…,img′T+k}=fhigh({(img0,a0),...,(imgT,aT)})#(1)
Where f high (·) represents the predictor of the large model in high dimensional space, img' T+j represents the predicted image at the future jth moment given the input sequence of T moments;
The encoder f enc (·) encodes the high-dimensional image img into a low-dimensional state feature z:
z=fenc(img)#(2)
The low-dimensional dynamic model processor f low (·) predicts the state z' at the next time based on action a and state feature z:
z′=flow(z,a)#(3)
the decoder f dec (·) then reconstructs the low-dimensional state z' back into the high-dimensional image space:
The rewarding function generating device reconstructs images according to the decoder And the difference between the large model predicted image img', generating a reward function:
where R (-) represents the theoretical maximum difference value between the two images minus the actual difference value.
9. The apparatus for automated driving reinforcement learning of claim 8, wherein the bonus function generating means performs a logarithmic scaling process r' =symlog (r) on the bonus values, wherein symlog (·) represents a logarithmic scaling of the sign of the content therein.
10. A computer readable storage medium containing a computer program, characterized in that the method of automated driving reinforcement learning of any of claims 1-6 is implemented when the computer program is executed by one or more processors.
CN202410283603.0A 2024-03-13 2024-03-13 Automatic driving reinforcement learning method, device and storage medium Pending CN117973497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410283603.0A CN117973497A (en) 2024-03-13 2024-03-13 Automatic driving reinforcement learning method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410283603.0A CN117973497A (en) 2024-03-13 2024-03-13 Automatic driving reinforcement learning method, device and storage medium

Publications (1)

Publication Number Publication Date
CN117973497A true CN117973497A (en) 2024-05-03

Family

ID=90849567

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410283603.0A Pending CN117973497A (en) 2024-03-13 2024-03-13 Automatic driving reinforcement learning method, device and storage medium

Country Status (1)

Country Link
CN (1) CN117973497A (en)

Similar Documents

Publication Publication Date Title
Guo et al. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts
Liu et al. Federated imitation learning: A novel framework for cloud robotic systems with heterogeneous sensor data
US20180277099A1 (en) Method and device for processing speech based on artificial intelligence
CN109215092B (en) Simulation scene generation method and device
CN114194211B (en) Automatic driving method and device, electronic equipment and storage medium
WO2023102962A1 (en) Method for training end-to-end autonomous driving strategy
KR102375286B1 (en) Learning method and learning device for generating training data from virtual data on virtual world by using generative adversarial network, to thereby reduce annotation cost required in training processes of neural network for autonomous driving
CN113609965B (en) Training method and device of character recognition model, storage medium and electronic equipment
CN112172813A (en) Car following system and method for simulating driving style based on deep inverse reinforcement learning
CN117218498B (en) Multi-modal large language model training method and system based on multi-modal encoder
CN112116685A (en) Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism
CN111625457A (en) Virtual automatic driving test optimization method based on improved DQN algorithm
Valle Hands-On Generative Adversarial Networks with Keras: Your guide to implementing next-generation generative adversarial networks
CN113850012A (en) Data processing model generation method, device, medium and electronic equipment
CN117291232A (en) Image generation method and device based on diffusion model
CN116824584A (en) Diversified image description method based on conditional variation transducer and introspection countermeasure learning
CN117973497A (en) Automatic driving reinforcement learning method, device and storage medium
CN116977712A (en) Knowledge distillation-based road scene segmentation method, system, equipment and medium
CN116525052A (en) Hierarchical image report generation method and device combined with sentence level contrast learning
CN113554040B (en) Image description method and device based on condition generation countermeasure network
CN113327265A (en) Optical flow estimation method and system based on guiding learning strategy
US20240153259A1 (en) Single image concept encoder for personalization using a pretrained diffusion model
CN112884129B (en) Multi-step rule extraction method, device and storage medium based on teaching data
CN116542292B (en) Training method, device, equipment and storage medium of image generation model
CN111476020B (en) Text generation method based on meta reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication