WO2022052406A1 - Procédé, appareil et dispositif d'entraînement de conduite automatique, et support - Google Patents

Procédé, appareil et dispositif d'entraînement de conduite automatique, et support Download PDF

Info

Publication number
WO2022052406A1
WO2022052406A1 PCT/CN2021/073449 CN2021073449W WO2022052406A1 WO 2022052406 A1 WO2022052406 A1 WO 2022052406A1 CN 2021073449 W CN2021073449 W CN 2021073449W WO 2022052406 A1 WO2022052406 A1 WO 2022052406A1
Authority
WO
WIPO (PCT)
Prior art keywords
automatic driving
policy
structured noise
network
historical data
Prior art date
Application number
PCT/CN2021/073449
Other languages
English (en)
Chinese (zh)
Inventor
李仁刚
赵雅倩
李茹杨
李雪雷
金良
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2022052406A1 publication Critical patent/WO2022052406A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0238Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors
    • G05D1/024Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors in combination with a laser
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0214Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0223Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0246Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means
    • G05D1/0253Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means extracting relative motion information from a plurality of images taken successively, e.g. visual odometry, optical flow
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0257Control of position or course in two dimensions specially adapted to land vehicles using a radar
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0276Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0276Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
    • G05D1/0278Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle using satellite positioning signals, e.g. GPS
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B9/00Simulators for teaching or training purposes
    • G09B9/02Simulators for teaching or training purposes for teaching control of vehicles or other craft
    • G09B9/04Simulators for teaching or training purposes for teaching control of vehicles or other craft for teaching control of land vehicles

Definitions

  • the present application relates to the technical field of automatic driving, and in particular, to an automatic driving training method, device, equipment and medium.
  • FIG. 1 is a schematic diagram of a control architecture of an automatic driving vehicle provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a modular method in the prior art provided by this application, and the automatic driving system is decomposed into several independent but interrelated modules, such as perception (Perception), localization (Localization) ), Planning and Control modules, which have good interpretability and can quickly locate the problem module when the system fails. It is a conventional method widely used in the industry at this stage.
  • FIG. 3 is a schematic diagram of an end-to-end method in the prior art provided by this application.
  • the end-to-end method regards the automatic driving problem as a machine learning problem, and directly optimizes the "sensor data processing-generation control" command-execute command" process.
  • the end-to-end method is simple to build and has achieved rapid development in the field of autonomous driving, but the method itself is also a "black box" with poor interpretability.
  • FIG. 4 is a schematic diagram of an Open-loop imitation learning method in the prior art provided by this application.
  • Open-loop's imitation learning method learns to drive autonomously in a supervised learning manner by imitating the behavior of human drivers, emphasizing a "predictive ability”
  • Figure 5 is a closed-loop in the prior art provided by this application Schematic diagram of reinforcement learning method.
  • the reinforcement learning method of Closed-loop uses the Markov Decision Process (MDP, Markov Decision Process) to explore and improve automatic driving strategies from scratch, emphasizing a kind of "driving ability".
  • MDP Markov Decision Process
  • Reinforcement learning is a type of machine learning method that has developed rapidly in recent years, in which the agent-environment interaction mechanism and the sequential decision-making mechanism are close to the process of human learning, so it is also called RL.
  • AGI Artificial General Intelligence
  • the deep reinforcement learning (DRL, Deep Reinforcement Learning) algorithm combined with deep learning (DL, Deep Learning) can automatically learn the abstract representation of large-scale input data, and the decision-making performance is better. It has been used in video games, mechanical control, advertising recommendation, financial transactions. , urban transportation and other fields have been widely used.
  • DRL When DRL is applied to autonomous driving problems, it does not require domain expert knowledge, nor does it need to build models. It has a wide range of adaptability and can cope with changing and complex road environments.
  • DRL-based autonomous vehicles to learn autonomous driving from scratch, the selection of poor actions in the sequential decision-making process will lead to a large training variance, which is reflected in the unstable driving of the vehicle, and even accidents such as running out of lanes and collisions.
  • the existing research results show that, compared with the modular method and the Open-loop imitation learning method, the DRL-based autonomous driving training has the worst stability and is very sensitive to changes in the environment and weather.
  • the purpose of the present application is to provide an automatic driving training method, device, equipment and medium, which can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents. Its specific plan is as follows:
  • an automatic driving training method including:
  • the structured noise is the structured noise determined based on historical data
  • the historical data is saved in the process of pre-training the autonomous driving vehicle data, and the historical data includes historical action information and historical traffic environment state information
  • the policy network parameters are updated using the policy gradient algorithm.
  • the automatic driving training method further includes:
  • the corresponding pre-training data is stored in the playback buffer, and the data stored in the playback buffer is used as the historical data.
  • the updating of the evaluation network parameters through back-propagation operation based on the reward includes:
  • the use of the policy gradient algorithm to update the policy network parameters includes:
  • the policy gradient operation is performed using the value function of the evaluation network and the current policy of the policy network, and the policy network parameters are updated.
  • the automatic driving training method further includes:
  • the structured noise is precomputed.
  • the pre-calculating the structured noise includes:
  • the structured noise corresponding to the minibatch is calculated using all the Gaussian factors.
  • the pre-calculating the structured noise includes:
  • an automatic driving training device comprising:
  • the data acquisition module is used to acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used for pre-preparing the automatic driving vehicle.
  • Data saved in the training process, and the historical data includes historical action information and historical traffic environment state information;
  • an action determination module configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network
  • an action control module configured to control the autonomous driving vehicle to execute the execution action
  • a strategy evaluation module configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward
  • an evaluation network update module for updating the evaluation network parameters through back-propagation operation based on the return
  • the policy network update module is used to update the policy network parameters using the policy gradient algorithm.
  • an automatic driving training device including a processor and a memory; wherein,
  • the memory for storing computer programs
  • the processor is configured to execute the computer program to implement the aforementioned automatic driving training method.
  • the present application discloses a computer-readable storage medium for storing a computer program, wherein when the computer program is executed by a processor, the aforementioned automatic driving training method is implemented.
  • the present application obtains the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is pre-trained for autonomous driving vehicles.
  • the data saved in the process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the
  • the autonomous vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and utilization strategy based on the reward through back-propagation operation.
  • the gradient algorithm updates the policy network parameters.
  • structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .
  • FIG. 1 is a schematic diagram of a control architecture of an autonomous driving vehicle provided by the present application
  • Fig. 2 is a kind of modularization method schematic diagram in the prior art
  • FIG. 3 is a schematic diagram of an end-to-end method in the prior art
  • FIG. 4 is a schematic diagram of an imitation learning method of Open-loop in the prior art
  • FIG. 5 is a schematic diagram of a closed-loop reinforcement learning method in the prior art
  • FIG. 10 is a schematic structural diagram of an automatic driving training device disclosed in the application.
  • FIG. 11 is a structural diagram of an automatic driving training device disclosed in this application.
  • DRL When DRL is applied to autonomous driving problems, it does not require domain expert knowledge, nor does it need to build models. It has a wide range of adaptability and can cope with changing and complex road environments.
  • DRL-based autonomous vehicles to learn autonomous driving from scratch, the selection of poor actions in the sequential decision-making process will lead to a large training variance, which is reflected in the unstable driving of the vehicle, and even accidents such as running out of lanes and collisions.
  • the existing research results show that, compared with the modular method and the Open-loop imitation learning method, the DRL-based autonomous driving training has the worst stability and is very sensitive to changes in the environment and weather. Therefore, the present application provides an automatic driving training solution, which can improve the stability of automatic driving training, thereby reducing the probability of occurrence of dangerous accidents.
  • an embodiment of the present application discloses an automatic driving training method, including:
  • Step S11 Obtain the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is in the process of pre-training the autonomous driving vehicle and the historical data includes historical action information and historical traffic environment state information.
  • the sequential decision-making process of the automatic driving system based on DRL is: the automatic driving vehicle (ie the agent) observes the state S t of the environment at time t, such as the position, speed, acceleration of itself and other traffic participants Isokinetic information, traffic lights and road topology features and other information, use a nonlinear neural network (NN, Neural Network) to represent the policy (Policy) ⁇ ⁇ , and select vehicle actions a t , such as acceleration/deceleration, steering, lane change , brakes, etc.
  • NN nonlinear neural network
  • Policy Policy
  • the environment calculates the reward r t according to the action at t taken by the autonomous driving vehicle, combined with the set benchmarks, such as the average driving speed of the autonomous driving vehicle, distance from the center of the lane, running a red light, collision and other factors. +1 and enter a new state S t+1 .
  • the self-driving vehicle adjusts the policy ⁇ according to the reward r t+1 obtained, and enters the next decision-making process in combination with the new state S t+1 .
  • DRL-based autonomous driving research applications mostly use algorithms that can deal with continuous action spaces, such as Deep Deterministic Policy Gradient (DDPG), Trust Region Policy Optimization (TRPO) and recent Terminal Policy Optimization (PPO, Proximal Policy Optimization).
  • DDPG Deep Deterministic Policy Gradient
  • TRPO Trust Region Policy Optimization
  • PPO Terminal Policy Optimization
  • DRL and structured noise can be fused to make automatic driving decisions.
  • this embodiment can use the DDPG algorithm with high sample efficiency and computational efficiency.
  • the Asynchronous Advantage Actor-Critic algorithm A3C (Asynchronous Advantage Actor-Critic), the Double Delayed Deterministic Policy Gradient Algorithm TD3 (Twin Delayed Deep Deterministic policy gradient), the Relaxed Actor-Critic Algorithm SAC (Soft Actor-Critic).
  • this embodiment can acquire traffic environment state data collected by vehicle sensors.
  • the driving environment status such as weather data
  • the driving environment status can be obtained with the help of camera, GPS (Global Positioning System, Global Positioning System), IMU (Inertia Measurement Unit, Inertial Measurement Unit), millimeter-wave radar, LiDAR and other vehicle-mounted sensor devices , traffic lights, traffic topology information, automatic driving vehicles, the location of other traffic participants, running status and other information
  • the traffic environment status in this embodiment not only includes the direct original image data obtained by the camera, but also includes the deep learning model, Such as the depth map and semantic segmentation map processed by RefineNet and so on.
  • Step S12 Determine a corresponding execution action by using the traffic environment state and the structured noise through a policy network.
  • ActorNet policy network selects action a t based on the policy function ⁇ ⁇ (a
  • Step S13 Control the automatic driving vehicle to execute the execution action.
  • Step S14 Evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward.
  • CriticNet evaluation network evaluates ActorNet's strategy based on the value function Q ⁇ (s, a, z) according to the action at t performed by the autonomous vehicle, and obtains the reward r t given by the traffic environment +1 , where ⁇ is the network parameter of CriticNet.
  • the value function Q ⁇ (s, a, z) is transformed from the preset reward function.
  • the reward function r t for studying the automatic driving problem can also be designed in advance.
  • the reward function of the autonomous driving vehicle can be designed in different forms .
  • the reward function can be designed as:
  • v is the driving speed of the autonomous vehicle
  • v ref is the reference speed set according to the road speed limit
  • is a coefficient set manually.
  • the value function can be calculated by the reward function in the form:
  • ⁇ (0,1] is the discount factor.
  • structured noise is introduced, and the corresponding value function is Q ⁇ (s, a, z), and ⁇ represents the expectation operation.
  • Step S15 Update evaluation network parameters through back-propagation operation based on the reward.
  • a back-propagation operation for the evaluation network loss function is performed based on the reward, and the evaluation network parameters are updated in a single step. Specifically, through the back-propagation propagation operation, the loss function of the evaluation network is minimized, and the network parameter ⁇ is updated in a single step.
  • the evaluation network loss function is:
  • y t r t+1 + ⁇ Q′ ⁇ (s t+1 ,at +1 ,z t+1 ).
  • Q′ ⁇ (s t+1 , a t+1 , z t+1 ) and Q ⁇ (s t , at t , z t ) are the value functions of the target network and the prediction network, respectively.
  • N is the number of samples collected, and ⁇ (0,1] is a discount factor.
  • the target network and prediction network are neural networks designed based on the DQN (ie Deep-Q-Network, deep value function neural network) algorithm.
  • Step S16 Use the policy gradient algorithm to update the policy network parameters.
  • this embodiment may use the value function of the evaluation network and the current strategy of the strategy network to perform a strategy gradient operation, and update the strategy network parameters.
  • this embodiment updates the network parameter ⁇ of Actor Net through the following policy gradient:
  • J( ⁇ ) is the objective function of the policy gradient method, usually expressed in some form of reward. It is obtained from the derivation of the value function of Critical Net with respect to the action a, It is derived from the policy of Actor Net at the current step.
  • the task of the policy gradient method is to maximize the objective function, which is achieved by gradient ascent.
  • FIG. 7 is a schematic diagram of an automatic driving training disclosed in the present application.
  • the DDPG algorithm is used to train vehicles for autonomous driving.
  • the DDPG algorithm is a typical Actor-Critic reinforcement learning algorithm.
  • the policy network updates the policy according to the value function fed back by the evaluation network (Critic Net), while the Critic Net trains the value function and uses the time difference method (TD) for single-step update.
  • Critic Net includes a target network (Target Net) and a prediction network (Pred Net) designed based on the DQN algorithm, and the value functions of the two networks are used when the network parameters are updated.
  • the Actor Net and Critic Net work together to maximize the cumulative reward for the actions chosen by the agent.
  • the embodiment of the present application obtains the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used in the prediction of the automatic driving vehicle.
  • the data saved in the training process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the
  • the self-driving vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and
  • the policy network parameters are updated using the policy gradient algorithm.
  • structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .
  • an embodiment of the present application discloses a specific automatic driving training method, including:
  • Step S21 Use the DQN algorithm to pre-train the autonomous vehicle.
  • Step S22 Store the corresponding pre-training data in the playback buffer, and use the data stored in the playback buffer as the historical data.
  • the classical DQN algorithm is used to pre-train the vehicle for automatic driving, and the playback buffer data B is accumulated.
  • the classic DQN method two neural networks with the same structure but different parameters are constructed, namely the target network (Target Net) which updates parameters at a certain interval and the prediction network (Pred Net) which updates parameters at each step.
  • the action space of the autonomous vehicle at each time t is [a t1 , a t2 , a t3 ], which represent “lane change to the left”, “lane change to the right” and “keep current lane”.
  • Both Target Net and Pred Net use a simple 3-layer neural network with only one hidden layer in the middle.
  • the traffic environment state S t collected by the vehicle sensor device is input, the output target value Qtarget and the predicted value Qpred are calculated, and the action at corresponding to the largest Qpred is selected as the driving action of the autonomous vehicle.
  • the network parameters are updated to minimize the loss function using the RMSProP optimizer, and the self-driving vehicle is continuously pre-trained until sufficient playback buffer data B is accumulated.
  • Step S23 Calculate the structured noise.
  • a preset number of data can be randomly extracted from the historical data to obtain a corresponding minibatch (ie, small batch data); Gaussian factor of historical data; the structured noise corresponding to the minibatch is calculated using all the Gaussian factors.
  • this embodiment can randomly extract data from the historical data to obtain multiple minibatches; calculate the Gaussian factor of each piece of the historical data in each of the minibatches, and then use each All the Gaussian factors corresponding to the minibatches calculate the structured noise corresponding to each minibatch.
  • multiple structured noises can be calculated by using multiple minibatches, so that during automatic driving training, different structured noises can be used for training to improve the robustness of automatic driving.
  • the Gaussian factor of each sampled historical data c n namely ⁇ ⁇ (z
  • c n ) N( ⁇ n , ⁇ n ).
  • N represents the Gaussian distribution
  • the Gaussian factor of the historical data c n is expressed as
  • is the parameter of the neural network f.
  • the latent variable is computed to obtain a probabilistic representation, namely structured noise.
  • c 1:N ) is obtained by accumulating the Gaussian factor ⁇ ⁇ (z
  • the structured noise may be pre-calculated.
  • a minibatch may be extracted from historical data when obtaining the traffic environment state at the current moment, and the structured noise corresponding to the current moment may be calculated.
  • Step S24 obtaining the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is in the process of pre-training the autonomous vehicle and the historical data includes historical action information and historical traffic environment state information.
  • this embodiment can acquire the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is a pre-calculated fixed value, and the structured noise used at each moment Noise is the same.
  • this embodiment can acquire the traffic state at the current moment and the corresponding structured noise; wherein the structured noise acquired at the current moment is obtained from a plurality of pre-calculated structured noises.
  • the structured noise corresponding to the current moment may be obtained cyclically from a plurality of the pre-calculated structured noises. For example, if 100 structured noises are pre-calculated, the structured noise corresponding to the current moment can be obtained cyclically from the 100 structured noises.
  • the specific process of obtaining the structured noise corresponding to the current moment may include: randomly extracting a preset number of data from the historical data in real time, obtaining a corresponding minibatch, and then calculating the minibatch The Gaussian factor of each piece of historical data in , and the structured noise corresponding to the minibatch is calculated by using all the Gaussian factors.
  • Step S25 Determine a corresponding execution action by using the traffic environment state and the structured noise through a policy network.
  • Step S26 Control the automatic driving vehicle to execute the execution action.
  • Step S27 Evaluate the strategy of the strategy network through the evaluation network according to the execution action, and obtain a corresponding reward.
  • the evaluation network inherits the pre-trained target network and neural network, thereby improving the efficiency of automatic driving training.
  • Step S28 Update evaluation network parameters through back-propagation operation based on the reward.
  • Step S29 Use the policy gradient algorithm to update the policy network parameters.
  • the present application provides an automatic driving decision-making method based on the fusion of DRL and structured noise.
  • environmental state information is obtained through the vehicle sensor device, and historical data is sampled from the playback buffer (Replay Buffer).
  • structured noise is introduced into the policy function and value function to solve the robustness problem of DRL-based automatic driving sequence decision-making, and to avoid the dangerous situation of unstable driving and even causing accidents when the automatic driving vehicle faces a complex environment.
  • an embodiment of the present application discloses a specific automatic driving training method, including (1) acquiring the traffic environment state S t collected by the vehicle sensor device; (2) designing the automatic driving problem under study.
  • the reward function rt (3) use the classical DQN algorithm to pre-train the vehicle for autonomous driving, and accumulate the playback buffer data B; (4) sample the historical data c from the playback buffer B, and use the Gaussian factor to calculate the potential represented by the probability
  • the variable z is the structured noise; (5) Combined with the structured noise z, use the DDPG algorithm to train the vehicle to drive automatically.
  • an embodiment of the present application discloses an automatic driving training device, including:
  • the data acquisition module 11 is used to acquire the current state of the traffic environment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is the Data saved in the pre-training process, and the historical data includes historical action information and historical traffic environment state information;
  • an action determination module 12 configured to use the traffic environment state and the structured noise to determine a corresponding execution action through a policy network
  • an action control module 13 configured to control the autonomous driving vehicle to execute the execution action
  • the strategy evaluation module 14 is configured to evaluate the strategy of the strategy network through the evaluation network according to the execution action to obtain a corresponding reward;
  • an evaluation network update module 15 configured to update the evaluation network parameters through back-propagation operation based on the return;
  • the policy network updating module 16 is used for updating the parameters of the policy network by using the policy gradient algorithm.
  • the embodiment of the present application obtains the traffic environment state at the current moment and the corresponding structured noise; wherein, the structured noise is the structured noise determined based on historical data, and the historical data is used in the prediction of the automatic driving vehicle.
  • the data saved in the training process, and the historical data includes historical action information and historical traffic environment state information, and then use the traffic environment state and the structured noise to determine the corresponding execution action through the policy network, and then control the
  • the self-driving vehicle performs the execution action, and evaluates the strategy of the strategy network according to the execution action through the evaluation network to obtain a corresponding reward, and then updates the evaluation network parameters and
  • the policy network parameters are updated using the policy gradient algorithm.
  • structured noise based on historical data is introduced, and the historical data includes historical action information and historical traffic environment state information, which can improve the stability of automatic driving training and reduce the probability of dangerous accidents. .
  • the device further includes a pre-training module for pre-training the self-driving vehicle by using the DQN algorithm; storing the corresponding pre-training data in a playback buffer, and using the data stored in the playback buffer as the historical data.
  • the evaluation network updating module 15 is specifically configured to perform a back-propagation operation on the evaluation network loss function based on the reward, and update the evaluation network parameters in a single step.
  • the policy network update module 16 is specifically configured to perform policy gradient operation by using the value function of the evaluation network and the current policy of the policy network to update the policy network parameters.
  • the apparatus further includes a structured noise calculation module for pre-calculating the structured noise.
  • the structured noise calculation module is specifically used to randomly extract a preset number of data from the historical data to obtain a corresponding minibatch; Gaussian factor of the historical data; using all the Gaussian factors to calculate the structured noise corresponding to the minibatch.
  • the structured noise calculation module is specifically configured to randomly extract data from the historical data to obtain multiple minibatches; calculate each piece of the historical data in each of the minibatches The Gaussian factor of , and then the structured noise corresponding to each minibatch is calculated by using all the Gaussian factors corresponding to each minibatch.
  • an embodiment of the present application discloses an automatic driving training device, including a processor 21 and a memory 22; wherein, the memory 22 is used to save a computer program; the processor 21 is used to execute all The computer program is used to implement the automatic driving training method disclosed in the foregoing embodiments.
  • the embodiment of the present application also discloses a computer-readable storage medium for storing a computer program, wherein the computer program implements the automatic driving training method disclosed in the foregoing embodiments when the computer program is executed by the processor.
  • the steps of a method or algorithm described in connection with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two.
  • the software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • General Physics & Mathematics (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Automation & Control Theory (AREA)
  • Electromagnetism (AREA)
  • Theoretical Computer Science (AREA)
  • Educational Technology (AREA)
  • Optics & Photonics (AREA)
  • Educational Administration (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Traffic Control Systems (AREA)

Abstract

L'invention concerne un procédé, un appareil et un dispositif d'entraînement de conduite automatique, et un support. Le procédé consiste : à acquérir un état d'environnement de trafic d'un moment actuel et du bruit structuré correspondant (S11), le bruit structuré étant déterminé sur la base de données historiques, les données historiques étant des données sauvegardées durant le préentraînement d'un véhicule autonome, et les données historiques incluant des informations d'actions historiques et des informations d'état d'environnement de trafic historique ; à déterminer une action d'exécution correspondante en utilisant l'état d'environnement de trafic et le bruit structuré et au moyen d'un réseau de principes (S12) ; à commander au véhicule autonome de mettre en œuvre l'action d'exécution (S13) ; à évaluer un principe du réseau de principes en fonction de l'action d'exécution et au moyen d'un réseau d'évaluation pour obtenir un retour correspondant (S14) ; à actualiser des paramètres du réseau d'évaluation sur la base du retour et au moyen d'une opération de propagation arrière (S15) ; et à actualiser des paramètres du réseau de principes en utilisant un algorithme de gradient de principes (S16). Le procédé peut améliorer la stabilité d'entraînement de conduite automatique, réduisant ainsi la probabilité de survenue d'accidents dangereux.
PCT/CN2021/073449 2020-09-08 2021-01-23 Procédé, appareil et dispositif d'entraînement de conduite automatique, et support WO2022052406A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010934770.9A CN112099496B (zh) 2020-09-08 2020-09-08 一种自动驾驶训练方法、装置、设备及介质
CN202010934770.9 2020-09-08

Publications (1)

Publication Number Publication Date
WO2022052406A1 true WO2022052406A1 (fr) 2022-03-17

Family

ID=73752230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073449 WO2022052406A1 (fr) 2020-09-08 2021-01-23 Procédé, appareil et dispositif d'entraînement de conduite automatique, et support

Country Status (2)

Country Link
CN (1) CN112099496B (fr)
WO (1) WO2022052406A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114859734A (zh) * 2022-06-15 2022-08-05 厦门大学 一种基于改进sac算法的温室环境参数优化决策方法
CN114859899A (zh) * 2022-04-18 2022-08-05 哈尔滨工业大学人工智能研究院有限公司 移动机器人导航避障的演员-评论家稳定性强化学习方法
CN114895697A (zh) * 2022-05-27 2022-08-12 西北工业大学 一种基于元强化学习并行训练算法的无人机飞行决策方法
CN115903457A (zh) * 2022-11-02 2023-04-04 曲阜师范大学 一种基于深度强化学习的低风速永磁同步风力发电机控制方法
CN116946162A (zh) * 2023-09-19 2023-10-27 东南大学 考虑路面附着条件的智能网联商用车安全驾驶决策方法
CN117078923A (zh) * 2023-07-19 2023-11-17 苏州大学 面向自动驾驶环境的语义分割自动化方法、系统及介质
CN117330063A (zh) * 2023-12-01 2024-01-02 华南理工大学 一种提升imu和轮速计组合定位算法精度的方法

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112099496B (zh) * 2020-09-08 2023-03-21 苏州浪潮智能科技有限公司 一种自动驾驶训练方法、装置、设备及介质
CN112835368A (zh) * 2021-01-06 2021-05-25 上海大学 一种多无人艇协同编队控制方法及系统
CN112904864B (zh) * 2021-01-28 2023-01-03 的卢技术有限公司 基于深度强化学习的自动驾驶方法和系统
CN113253612B (zh) 2021-06-01 2021-09-17 苏州浪潮智能科技有限公司 一种自动驾驶控制方法、装置、设备及可读存储介质
CN113743469B (zh) * 2021-08-04 2024-05-28 北京理工大学 一种融合多源数据及综合多维指标的自动驾驶决策方法
CN113449823B (zh) * 2021-08-31 2021-11-19 成都深蓝思维信息技术有限公司 自动驾驶模型训练方法及数据处理设备
CN113991654B (zh) * 2021-10-28 2024-01-23 东华大学 一种能源互联网混合能量系统及其调度方法
CN114118276B (zh) * 2021-11-29 2024-08-20 北京触达无界科技有限公司 一种网络训练的方法、控制方法以及装置
CN114104005B (zh) * 2022-01-26 2022-04-19 苏州浪潮智能科技有限公司 自动驾驶设备的决策方法、装置、设备及可读存储介质
CN114120653A (zh) * 2022-01-26 2022-03-01 苏州浪潮智能科技有限公司 一种集中式车群决策控制方法、装置及电子设备
CN116811915A (zh) * 2023-06-30 2023-09-29 清华大学 基于乘员脑电信号的车辆决策方法、装置和计算机设备
CN117041916B (zh) * 2023-09-27 2024-01-09 创意信息技术股份有限公司 一种海量数据处理方法、装置、系统及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108600379A (zh) * 2018-04-28 2018-09-28 中国科学院软件研究所 一种基于深度确定性策略梯度的异构多智能体协同决策方法
WO2019089591A1 (fr) * 2017-10-30 2019-05-09 Mobileye Vision Technologies Ltd. Circulation d'un véhicule sur la base d'une activité humaine
CN110481536A (zh) * 2019-07-03 2019-11-22 中国科学院深圳先进技术研究院 一种应用于混合动力汽车的控制方法及设备
CN111310915A (zh) * 2020-01-21 2020-06-19 浙江工业大学 一种面向强化学习的数据异常检测防御方法
CN112099496A (zh) * 2020-09-08 2020-12-18 苏州浪潮智能科技有限公司 一种自动驾驶训练方法、装置、设备及介质
CN112256746A (zh) * 2020-09-11 2021-01-22 安徽中科新辰技术有限公司 一种基于标签化数据治理技术实现方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110196587A (zh) * 2018-02-27 2019-09-03 中国科学院深圳先进技术研究院 车辆自动驾驶控制策略模型生成方法、装置、设备及介质
CN110989577B (zh) * 2019-11-15 2023-06-23 深圳先进技术研究院 自动驾驶决策方法及车辆的自动驾驶装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019089591A1 (fr) * 2017-10-30 2019-05-09 Mobileye Vision Technologies Ltd. Circulation d'un véhicule sur la base d'une activité humaine
CN108600379A (zh) * 2018-04-28 2018-09-28 中国科学院软件研究所 一种基于深度确定性策略梯度的异构多智能体协同决策方法
CN110481536A (zh) * 2019-07-03 2019-11-22 中国科学院深圳先进技术研究院 一种应用于混合动力汽车的控制方法及设备
CN111310915A (zh) * 2020-01-21 2020-06-19 浙江工业大学 一种面向强化学习的数据异常检测防御方法
CN112099496A (zh) * 2020-09-08 2020-12-18 苏州浪潮智能科技有限公司 一种自动驾驶训练方法、装置、设备及介质
CN112256746A (zh) * 2020-09-11 2021-01-22 安徽中科新辰技术有限公司 一种基于标签化数据治理技术实现方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG YILIN: "Study on Self-driving Cars Overtaking Control Method Based on Deep Reinforced Learning", CHINESE MASTER’S THESES FULL-TEXT DATABASE, no. 4, 30 April 2020 (2020-04-30), XP055910492 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114859899A (zh) * 2022-04-18 2022-08-05 哈尔滨工业大学人工智能研究院有限公司 移动机器人导航避障的演员-评论家稳定性强化学习方法
CN114859899B (zh) * 2022-04-18 2024-05-31 哈尔滨工业大学人工智能研究院有限公司 移动机器人导航避障的演员-评论家稳定性强化学习方法
CN114895697B (zh) * 2022-05-27 2024-04-30 西北工业大学 一种基于元强化学习并行训练算法的无人机飞行决策方法
CN114895697A (zh) * 2022-05-27 2022-08-12 西北工业大学 一种基于元强化学习并行训练算法的无人机飞行决策方法
CN114859734B (zh) * 2022-06-15 2024-06-07 厦门大学 一种基于改进sac算法的温室环境参数优化决策方法
CN114859734A (zh) * 2022-06-15 2022-08-05 厦门大学 一种基于改进sac算法的温室环境参数优化决策方法
CN115903457A (zh) * 2022-11-02 2023-04-04 曲阜师范大学 一种基于深度强化学习的低风速永磁同步风力发电机控制方法
CN115903457B (zh) * 2022-11-02 2023-09-08 曲阜师范大学 一种基于深度强化学习的低风速永磁同步风力发电机控制方法
CN117078923A (zh) * 2023-07-19 2023-11-17 苏州大学 面向自动驾驶环境的语义分割自动化方法、系统及介质
CN116946162B (zh) * 2023-09-19 2023-12-15 东南大学 考虑路面附着条件的智能网联商用车安全驾驶决策方法
CN116946162A (zh) * 2023-09-19 2023-10-27 东南大学 考虑路面附着条件的智能网联商用车安全驾驶决策方法
CN117330063B (zh) * 2023-12-01 2024-03-22 华南理工大学 一种提升imu和轮速计组合定位算法精度的方法
CN117330063A (zh) * 2023-12-01 2024-01-02 华南理工大学 一种提升imu和轮速计组合定位算法精度的方法

Also Published As

Publication number Publication date
CN112099496B (zh) 2023-03-21
CN112099496A (zh) 2020-12-18

Similar Documents

Publication Publication Date Title
WO2022052406A1 (fr) Procédé, appareil et dispositif d'entraînement de conduite automatique, et support
CN110834644B (zh) 一种车辆控制方法、装置、待控制车辆及存储介质
JP7532615B2 (ja) 自律型車両の計画
US11899411B2 (en) Hybrid reinforcement learning for autonomous driving
CN110796856B (zh) 车辆变道意图预测方法及变道意图预测网络的训练方法
US11243532B1 (en) Evaluating varying-sized action spaces using reinforcement learning
US20230124864A1 (en) Graph Representation Querying of Machine Learning Models for Traffic or Safety Rules
Min et al. Deep Q learning based high level driving policy determination
CN112034834A (zh) 使用强化学习来加速自动驾驶车辆的轨迹规划的离线代理
Chen et al. Driving maneuvers prediction based autonomous driving control by deep Monte Carlo tree search
CN113044064B (zh) 基于元强化学习的车辆自适应的自动驾驶决策方法及系统
CN115303297B (zh) 基于注意力机制与图模型强化学习的城市场景下端到端自动驾驶控制方法及装置
WO2022252457A1 (fr) Procédé, appareil et dispositif de commande de conduite autonome, et support de stockage lisible
CN112406904B (zh) 自动驾驶策略的训练方法、自动驾驶方法、设备和车辆
CN114919578B (zh) 智能车行为决策方法、规划方法、系统及存储介质
CN113743469A (zh) 一种融合多源数据及综合多维指标的自动驾驶决策方法
CN116476863A (zh) 基于深度强化学习的自动驾驶横纵向一体化决策方法
CN114926823A (zh) 基于wgcn的车辆驾驶行为预测方法
Youssef et al. Comparative study of end-to-end deep learning methods for self-driving car
CN114267191B (zh) 缓解交通拥堵驾驶员控制系统、方法、介质、设备及应用
Chen et al. Automatic overtaking on two-way roads with vehicle interactions based on proximal policy optimization
US20210398014A1 (en) Reinforcement learning based control of imitative policies for autonomous driving
Ren et al. Intelligent path planning and obstacle avoidance algorithms for autonomous vehicles based on enhanced rrt algorithm
KR20230024392A (ko) 주행 의사 결정 방법 및 장치 및 칩
Arbabi et al. Planning for autonomous driving via interaction-aware probabilistic action policies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21865470

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21865470

Country of ref document: EP

Kind code of ref document: A1