CN114194211B

CN114194211B - Automatic driving method and device, electronic equipment and storage medium

Info

Publication number: CN114194211B
Application number: CN202111442633.4A
Authority: CN
Inventors: 邓琪; 李茹杨; 张亚强; 赵雅倩; 李仁刚
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2023-04-25
Anticipated expiration: 2041-11-30
Also published as: CN114194211A

Abstract

The application discloses an automatic driving method, an automatic driving device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring multi-mode sensing information and driving behavior data of a driving environment; extracting multi-scale features of multi-mode sensing information by using a convolutional neural network, and fusing the multi-scale features by using a transducer to obtain fused feature data; combining the fusion characteristic data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process; obtaining a reward function of an automatic driving process by using expert demonstration data and maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by using deep reinforcement learning; and outputting the optimized driving strategy model to the client so that the client can realize automatic driving according to the environment perception information by utilizing the optimized driving strategy model. The reliability of the automatic driving perception data is guaranteed, and the rationality of decision planning in the automatic driving process is improved.

Description

Automatic driving method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technology, and more particularly, to an automatic driving method and apparatus, and an electronic device and a computer readable storage medium.

Background

Autopilot is a complex system integrating the functions of environment sensing, decision planning, control execution and the like, and with the development of artificial intelligence and information technology, autopilot has received great attention in academia, industry and national defense and military fields.

The end-to-end driving method takes sensing information (such as laser radar point cloud, RGB image and the like) measured by a sensor as input of a neural network, and the neural network directly outputs control signals such as steering instructions and acceleration. The main advantages of the framework are that it is easy to implement and that the training data of the markers can be obtained by recording the human driving process on an autopilot platform. In recent years, technological advances in computing hardware have greatly facilitated the use of end-to-end learning models, and back-propagation algorithms for gradient estimation of deep neural networks (Deep Neural Network, DNN) can be implemented in parallel on graphics processing units (Graphics Processing Unit, GPUs). This approach helps train a large autopilot network architecture, but also relies on a large number of training samples. Meanwhile, the nature of automatic driving control is a sequential decision problem, and the end-to-end driving method is generally required to be generalized based on a large amount of data, so that the method can be affected by mixing errors in actual engineering practice.

Another technical route for the end-to-end driving approach is to explore different driving strategies based on deep reinforcement learning (Deep Reinforcement Learning, DRL). Generally, researchers train and evaluate DRL systems on-line in a simulation platform, and research is also conducted on transplanting a DRL model for simulation training to a real driving environment, even training the DRL system directly based on real world image data. However, the existing DRL-based end-to-end driving method mainly focuses on driving scenes with limited dynamic participants, and it is difficult to consider various complex problems in the real world given that the behaviors of other participants in the scenes are close to ideal. In addition, such methods are mostly focused on a single input mode, i.e. the acquired perception information is based on images or radar only, and when the resistance of participants in the scene increases, the system fails due to lack of critical perception information.

Therefore, how to prevent the loss of key perceived information caused by the complexity of driving scene and the increase of resistance during automatic driving is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide an automatic driving method and device, electronic equipment and a computer readable storage medium, which are used for preventing key perception information loss caused by complexity and resistance increase of driving scenes and ensuring reliability of automatic driving perception data. Meanwhile, the complex priori knowledge in the real driving scene is fully utilized to learn the driving strategy model, and the rationality of decision planning in the automatic driving process is improved.

To achieve the above object, the present application provides an automatic driving method, including:

acquiring multi-mode sensing information and driving behavior data of a driving environment;

extracting multi-scale features of the multi-mode sensing information by using a convolutional neural network, and fusing the multi-scale features by using a transducer to obtain fused feature data;

combining the fusion feature data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process;

obtaining a reward function of an automatic driving process by using the expert demonstration data through maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by using deep reinforcement learning;

and outputting the optimized driving strategy model to a client so that the client can realize automatic driving according to the environment perception information by utilizing the optimized driving strategy model.

The obtaining the multi-mode sensing information and the driving behavior data of the driving environment comprises the following steps:

acquiring driving states as multi-mode sensing information of a driving environment through a plurality of vehicle-mounted sensor devices;

acquiring operations or commands executed aiming at different driving scenes in the driving process as driving behavior data; wherein the driving behavior data includes any one or a combination of any of a plurality of time stamps, speed data, urgent acceleration and deceleration data, and lane departure data;

And aligning the multi-modal awareness information and the driving behavior data in time sequence according to the time stamp.

The method for extracting the multi-scale features of the multi-mode sensing information by using the convolutional neural network, and fusing the multi-scale features by using a transducer to obtain fused feature data comprises the following steps:

coding the multi-mode sensing information at different network layers by utilizing a convolutional neural network so as to extract an intermediate feature map;

fusing the intermediate feature images by using a transducer to obtain a fused feature image;

summing the elements of the fusion feature map and returning the elements to each modal branch to obtain a multi-modal feature vector;

and summing the multi-mode feature vectors element by element to obtain fusion feature data.

The method for obtaining the reward function of the automatic driving process by using the expert demonstration data through maximum entropy inverse reinforcement learning and optimizing the driving strategy model by using deep reinforcement learning comprises the following steps:

initializing a reward function and a driving strategy model based on the reward function by using a deep neural network;

estimating the state distribution probability density of the driving strategy model by using the expert demonstration data, and updating the driving strategy model by using deep reinforcement learning based on the state distribution probability density;

Iteratively calculating a desire for driving state-driving behavior access count, calculating a maximum entropy gradient using the desire, and updating weights of the deep neural network based on the maximum entropy gradient;

judging whether the updated driving strategy model meets the convergence condition or not; if yes, recording the weight of the deep neural network to obtain an optimized driving strategy model; and if not, re-entering the step of estimating the state distribution probability density of the driving strategy by using the expert demonstration data.

Wherein the initializing the bonus function using the deep neural network comprises:

defining the expert presentation data as a set of data pairs of driving state-driving behavior;

a reward function in the form of driving state-driving behavior-reward value is initialized with the deep neural network.

The deep neural network is used for inputting driving states and driving behaviors and outputting rewarding values;

or, the input of the deep neural network is a driving state, and the deep neural network comprises a plurality of output channels, and each output channel corresponds to a reward value corresponding to driving behavior.

The convergence condition includes that the iteration times reach a preset iteration times, or that the modulus of the gradient of the weight of the deep neural network reaches a preset threshold.

To achieve the above object, the present application provides an automatic driving apparatus comprising:

the data acquisition module is used for acquiring multi-mode perception information and driving behavior data of a driving environment;

the feature fusion module is used for extracting multi-scale features of the multi-mode sensing information by using a convolutional neural network, and fusing the multi-scale features by using a transducer to obtain fused feature data;

the modeling module is used for combining the fusion characteristic data and the driving behavior data into expert demonstration data and modeling an automatic driving process into a Markov decision process;

the optimizing module is used for acquiring a reward function of an automatic driving process by using the expert demonstration data through maximum entropy inverse reinforcement learning and optimizing a driving strategy model by using deep reinforcement learning;

and the output module is used for outputting the optimized driving strategy model to the client so that the client can realize automatic driving according to the environment perception information by utilizing the optimized driving strategy model.

To achieve the above object, the present application provides an electronic device, including:

a memory for storing a computer program;

and a processor for implementing the steps of the automatic driving method as described above when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the automatic driving method as described above.

According to the scheme, the automatic driving method provided by the application comprises the following steps: acquiring multi-mode sensing information and driving behavior data of a driving environment; extracting multi-scale features of the multi-mode sensing information by using a convolutional neural network, and fusing the multi-scale features by using a transducer to obtain fused feature data; combining the fusion feature data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process; obtaining a reward function of an automatic driving process by using the expert demonstration data through maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by using deep reinforcement learning; and outputting the optimized driving strategy model to a client so that the client can realize automatic driving according to the environment perception information by utilizing the optimized driving strategy model.

According to the automatic driving method, the multi-mode sensing information and the driving behavior data of the driving environment are synchronously collected, the transformation former is used for fusing the multi-mode sensing data to obtain the fused characteristic representation of the 3D driving scene, the global expression capacity of the obtained sensing data on the driving scene is improved, the key sensing information loss caused by the complexity and the increase of the antagonism of the driving scene is prevented, and the reliability of the automatic driving sensing data is ensured. Furthermore, the fusion perception data and the driving behavior are combined to serve as expert demonstration data, MDP (Markov Decision Processes, markov decision process) modeling is conducted on an automatic driving process, a reward function is obtained based on maximum entropy inverse reinforcement learning, a DRL optimization strategy model is combined, the driving strategy model is learned by fully utilizing complex priori knowledge in a real driving scene, and the rationality of decision planning in the automatic driving process is improved. The application also discloses an automatic driving device, electronic equipment and a computer readable storage medium, and the technical effects can be achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate the disclosure and together with the description serve to explain, but do not limit the disclosure. In the drawings:

FIG. 1 is a flow chart illustrating an autopilot method according to one exemplary embodiment;

FIG. 2 is a block diagram illustrating an autopilot technique according to one exemplary embodiment;

FIG. 3 is a block diagram of a data acquisition system according to an exemplary embodiment;

FIG. 4 is a diagram illustrating a transform-based multi-modal awareness data feature fusion in accordance with an exemplary embodiment

FIG. 5 is a flow chart illustrating another method of autopilot according to one exemplary embodiment;

FIG. 6 is a flow chart illustrating a maximum entropy inverse reinforcement learning acquisition driving strategy according to an exemplary embodiment;

FIG. 7 is a block diagram of an autopilot device shown in accordance with one exemplary embodiment;

fig. 8 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application. In addition, in the embodiments of the present application, "first," "second," and the like are used to distinguish similar objects, and are not necessarily used to describe a particular order or sequence.

The embodiment of the application discloses an automatic driving method, which prevents the loss of key perception information caused by the increase of complexity and antagonism of driving scenes and ensures the reliability of automatic driving perception data. Meanwhile, the complex priori knowledge in the real driving scene is fully utilized to learn the driving strategy model, and the rationality of decision planning in the automatic driving process is improved.

Referring to fig. 1 and 2, fig. 1 is a flowchart illustrating an automatic driving method according to an exemplary embodiment, and fig. 2 is a block diagram illustrating an automatic driving technique according to an exemplary embodiment. As shown in fig. 1, includes:

s101: acquiring multi-mode sensing information and driving behavior data of a driving environment;

in a specific implementation, the data acquisition system acquires multimodal awareness information of the driving environment and driving behavior data. As a possible implementation, this step may include: acquiring driving states as multi-mode sensing information of a driving environment through a plurality of vehicle-mounted sensor devices; acquiring operations or commands executed aiming at different driving scenes in the driving process as driving behavior data; wherein the driving behavior data includes any one or a combination of any of a plurality of time stamps, speed data, urgent acceleration and deceleration data, and lane departure data; and aligning the multi-modal awareness information and the driving behavior data in time sequence according to the time stamp.

As shown in fig. 3, the data acquisition system includes a data acquisition module 1, a data acquisition module 2 and a data storage module, wherein the data acquisition module 1 acquires multi-modal sensing information of a driving environment, the data acquisition module 2 synchronously records driving behavior data, and the data storage module is responsible for storing the acquired driving data, namely the multi-modal sensing information and the driving behavior data.

In the running process of the vehicle, the data acquisition module 1 acquires driving environment information, namely driving state s, through a camera, radar and other vehicle-mounted sensor equipment, so that multi-mode perception information is acquired. The data acquisition module 2 is responsible for recording operations or commands executed by a driver or a vehicle control center for different driving scenes in the driving process, namely driving behavior a, the portion can be acquired through a built-in driving behavior acquisition device of the vehicle, and driving behavior data can comprise time stamps, speed data, urgent and rapid deceleration data, lane departure data and the like.

The driving behavior data acquired by the data acquisition module 2 are directly stored in the data storage module, the multi-mode sensing information such as RGB images and radar point clouds acquired by the data acquisition module 1 is sent to the data processing system, and processed multi-mode fusion data can be obtained after a series of data operations such as feature extraction and feature fusion are performed, and then the processed multi-mode fusion data are stored in the data storage module. And (3) aligning the multi-mode fusion data with the driving behavior data in time sequence according to the time stamp recorded in the earlier stage, so as to obtain a series of data sets of driving state-driving behavior data pairs (s, a) for later use.

S102: extracting multi-scale features of the multi-mode sensing information by using a convolutional neural network, and fusing the multi-scale features by using a transducer to obtain fused feature data;

in the step, the data processing system adopts CNN (Convolutional Neural Networks, convolutional neural network) to extract multi-scale characteristics for multi-mode perception information respectively, and acquires fusion characteristic data by combining a transducer. As a possible implementation, this step may include: coding the multi-mode sensing information at different network layers by utilizing a convolutional neural network so as to extract an intermediate feature map; fusing the intermediate feature images by using a transducer to obtain a fused feature image; summing the elements of the fusion feature map and returning the elements to each modal branch to obtain a multi-modal feature vector; and summing the multi-mode feature vectors element by element to obtain fusion feature data.

For multi-mode sensing information comprising laser radar point clouds and RGB images, the key point is the information fusion of the different modes. The usual fusion approach is based on a post-fusion architecture, i.e. each information input is encoded in a separate stream and then integrated together. Because the behavior of multiple agents in a scene cannot be explained, the fusion mechanism can have a large error in a complex scene. In order to better describe the driving behavior of the vehicle, the data processing system of the application utilizes the multi-mode fusion transducer to process environment perception information of multiple modes such as single view images, laser radar point clouds and the like. The key idea of the method is to combine global information about 3D scenes in RGB image and radar point cloud data by using a transducer's attention mechanism and integrate the global information directly into feature extraction layers of different modalities so as to effectively integrate perceptual information from different modalities at multiple stages during feature encoding.

As shown in fig. 4, fig. 4 is a schematic diagram of a transformation-based multi-modal perceptual data feature fusion. The transform model is based on a coding-decoding architecture, wherein the coding module comprises a self-attention layer and a feedforward neural network, so that different feature extraction layers can be helped to acquire multi-scale features of different modal sensing information, and compared with the coding module, the decoding module is provided with one more coding-decoding attention layer and is used for helping to acquire key features after multi-modal sensing information fusion. Unlike the conventional symbol input structure of the transducer, when the transducer is used to process the multi-mode driving data, the feature map needs to be operated, so that the intermediate feature map of each mode information can be regarded as a set, and each element in the set can be treated as a symbol. In the whole processing process, the embodiment utilizes CNN to encode the input image and radar point cloud information in different network layers to different aspects of a scene, namely, an intermediate feature map is extracted, the CNN comprises a plurality of convolution +pooling layers and full connection +softmax, then the intermediate feature maps are fused in a multi-scale mode by adopting a attention layer of a transducer to obtain a fused feature map of multi-mode perception information, and elements of the fused feature map are summed and fed back to each independent mode branch. After a series of Multi-scale feature fusion operations are completed, the Multi-mode feature vectors are summed element by element, and a processed 3D scene representation, i.e., fusion features, is obtained through an MLP (Multi-layer perceptron).

S103: combining the fusion feature data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process;

in this step, the fusion feature data is combined with the synchronously recorded driving behavior as expert demonstration data, and the automatic driving process is modeled as a markov decision process. MDP is defined as a five-tuple (S, A, T, R, gamma), wherein S is the state space where the autonomous vehicle is located, A is the behavior decision space of the autonomous vehicle, T is the state transfer function, R is the rewarding function, and gamma E (0, 1) is the attenuation factor of rewarding. According to the definition above, the autopilot process can be described as finding an optimal driving strategy pi for the car at each moment: s→a, once the strategy is determined, the effect of the vehicle taking action in a given state depends only on the current driving strategy, so the entire driving process can be regarded as a markov chain. The goal of policy selection is typically to optimize the current state to a future jackpot, assuming a vehicle state s, where the behavior generated by each policy is denoted as a, which can be expressed as:

wherein t is the moment, R _a (s _t ，s _t+1 ) For driving state s _t Taking the driving action a to transition to the driving state s _t+1 Is a reward value of (a), an optimal policy pi ^* The selection process of (1) is as follows:

wherein P is _a (s _t ，s _t+1 ) For driving state s _t Taking the driving action a to transition to the driving state s _t+1 Probability value of V(s) _t ) A cumulative prize indicating a future attenuation stack. Specific solutionIn the case of an optimal strategy, it is usually expressed as an iterative convergence process between all possible states s and s', i.e.

V _i+1 (s)＝max{∑ _s′ P _a (s，s′)(R _a (s，s′)+γV _i (s′))}；

And i is the iteration number, and when V(s) gradually tends to be stable, the iteration is ended, and an optimal strategy is output.

S104: obtaining a reward function of an automatic driving process by using the expert demonstration data through maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by using deep reinforcement learning;

in the step, expert demonstration data is utilized to acquire a reward function by maximum entropy inverse reinforcement learning, and a driving strategy model is learned by combining with a DRL (dynamic random language) so as to optimize the driving strategy model. In specific implementation, initializing a DNN reward function model, estimating the state distribution probability density of an expert driving strategy by using expert demonstration data samples, updating the driving strategy by adopting DRL based on the strategy state distribution probability density of the expert demonstration data, iteratively calculating the expectation of driving state-driving behavior access count, calculating the maximum entropy gradient by using the expectation of state-action access count, further updating DNN weight, judging convergence conditions, wherein the convergence conditions comprise that the iteration times reach the preset iteration times or the modulus of the gradient of the weight of the deep neural network reaches the preset threshold value. If not, repeating the updating operation, otherwise, ending the updating process, reserving DNN model parameters, and outputting an optimal strategy model pi ^* 。

S105: and outputting the optimized driving strategy model to a client so that the client can realize automatic driving according to the environment perception information by utilizing the optimized driving strategy model.

In the step, the optimized driving strategy model is output to an automatic driving client, and automatic driving is implemented according to the environment perception information.

According to the automatic driving method provided by the embodiment of the application, the multi-mode sensing information and the driving behavior data of the driving environment are synchronously acquired, the transformation former is used for fusing the multi-mode sensing data to obtain the fused characteristic representation of the 3D driving scene, the global expression capacity of the acquired sensing data on the driving scene is improved, the key sensing information loss caused by the complexity and the increase of the antagonism of the driving scene is prevented, and the reliability of the automatic driving sensing data is ensured. Further, the embodiment of the application combines the fusion perception data and the driving behavior as expert demonstration data, performs MDP modeling on an automatic driving process, acquires a reward function based on maximum entropy inverse reinforcement learning, and combines a DRL optimization strategy model, so that the complex priori knowledge in a real driving scene is fully utilized to learn the driving strategy model, and the rationality of decision planning in the automatic driving process is improved.

The embodiment of the application discloses an automatic driving method, and compared with the previous embodiment, the technical scheme is further described and optimized. Specific:

referring to fig. 5, a flowchart of another automatic driving method is shown according to an exemplary embodiment, as shown in fig. 5, including:

s201: acquiring multi-mode sensing information and driving behavior data of a driving environment;

s202: extracting multi-scale features of the multi-mode sensing information by using a convolutional neural network, and fusing the multi-scale features by using a transducer to obtain fused feature data;

s203: combining the fusion feature data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process;

s204: initializing a reward function and a driving strategy model based on the reward function by using a deep neural network;

in the process of automatic driving using MDP modeling, it is critical to design the bonus function R, which needs to consider as many influencing factors as possible, including route completion, driving safety, riding comfort, etc. However, all environmental conditions are often not accurately obtained while the vehicle is traveling, and the mapping between sensor input and output actions can be very complex. Thus, in some real-world tasks, artificially setting the bonus function of the environment is a laborious and laborious task. Therefore, the application adopts maximum entropy inverse reinforcement learning to help build the rewarding function model of the environment based on expert demonstration data obtained after processing. And training an automatic driving strategy based on a DRL algorithm by combining the DNN parameterized rewarding function.

As shown in fig. 6, fig. 6 is a flowchart of maximum entropy inverse reinforcement learning acquisition driving strategy. In this step, a reward function is defined, which requires demonstration of data from a group of experts, since the reward function R during the MDP is unknown at this time

DNN is used as a parameterized reward function and the problem of inverse reinforcement learning driving strategies is solved based on the principle of maximum entropy.

First, expert presentation data is defined as a set of a series of driving state-driving behavior pairs {(s) ₁ ，a ₁ )，(s ₂ ，a ₂ )，...，(s _n ，a _n ) (s is therein _i Represents a driving state, a _i Representing expert in state s _i The next selected driving behavior. Then, the bonus function is defined as a form of driving state-driving behavior-bonus value, i.e., R: sxa→r, the bonus function is denoted R (S, a). This form of definition takes into account the actions, and may embody preferences in expert data for particular actions, thus facilitating reproduction of driving behavior with different preferences for available actions.

As a possible implementation, the initializing the reward function with the deep neural network includes: defining the expert presentation data as a set of data pairs of driving state-driving behavior; a reward function in the form of driving state-driving behavior-reward value is initialized with the deep neural network. In specific implementation, based on the definition, when adopting DNN to learn the rewarding function, two network structures are selectable, one is to input the vector of the driving state and the vector of the driving behavior at the same time, and output the vector as the rewarding value; the other is to input only the state vector, output a plurality of channels, representing the prize values corresponding to a plurality of driving behaviors respectively. Both DNNs can be used as approximate models of driving behavior-driving state-rewarding functions, and any structure which is convenient to realize can be selected in the practical application process. Namely, the input of the deep neural network is driving state and driving behavior, and the output is a rewarding value; or, the input of the deep neural network is a driving state, and the deep neural network comprises a plurality of output channels, and each output channel corresponds to a reward value corresponding to driving behavior.

S205: estimating the state distribution probability density of the driving strategy model by using the expert demonstration data, and updating the driving strategy model by using deep reinforcement learning based on the state distribution probability density;

s206: iteratively calculating a desire for driving state-driving behavior access count, calculating a maximum entropy gradient using the desire, and updating weights of the deep neural network based on the maximum entropy gradient;

the purpose of S205-S206 is to update the driving strategy pi based on the bonus function model, first to initialize the DNN bonus function model. In a specific implementation, expert demonstration data samples are utilized to estimate the state distribution probability density of an expert driving strategy. Taking into account the probabilistic model P _a Unknown, (s, s ') and the number of times each driving state-driving behavior-driving state triplet (s, a, s') is analyzed to calculate the state transition probability of each possible result, which can be expressed as

Where c (s, a, s ') is the cumulative number of times that the driving behavior a is taken to transition from the driving state s to the driving state s'. Along with the continuous interaction of the strategy model and the driving environment, the state access times approach infinity, and the probability value P _a (s, s') will gradually approach the true probability distribution.

Based on the policy state distribution probability density of the obtained expert demonstration data, the invention adopts PPO with model learning to update the current driving policy pi, and introduces the following iterative formula to calculate the driving state-driving behavior access count expectation:

E _i+1 [μ(s)]＝∑ _s′∈S ∑ _a∈A P _a (s，s′)π(s′，a)E _i [μ(s′)]；

E _i+1 [μ(s，a)]＝π(s，a)E _i+1 [μ(s)]；

The embodiment can adopt a PPO (Proximal Policy Optimization, near-end policy optimization) algorithm with better super-parameter performance as an illustration, and other DRL algorithms can be selected, such as DDPG (Deep Deterministic Policy Gradient, depth deterministic policy gradient), SAC (soft actuator-Critic, flexible actuation-evaluation), TD3 (Twin Delayed Deep Deterministic Policy Gradient, dual delay depth deterministic policy gradient), and the like.

Further, the DNN bonus function model parameters are updated. In particular, when the driving state s reaches the final state or target state s _final Future state transitions will not occur later. The maximum entropy gradient can then be determined:

wherein,,

the likelihood function of the data is demonstrated for the expert, θ is the network weight of the DNN. Further calculate->

With respect to the partial derivative of theta,

wherein,,

obtained by DNN counter-propagation, the following can be used +.>

Updating DNN weights

Where λ is the learning rate and β is the weight decay coefficient.

S207: judging whether the updated driving strategy model meets the convergence condition or not; if yes, go to S208; if not, reenter S205;

s208: recording the weight of the deep neural network to obtain an optimized driving strategy model;

S209: and outputting the optimized driving strategy model to a client so that the client can realize automatic driving according to the environment perception information by utilizing the optimized driving strategy model.

In a specific implementation, whether the process is ended is determined by judging whether the strategy model converges, the convergence condition can be that the update iteration number reaches the iteration upper limit of the initial setting, or the model of the gradient of the weight theta reaches the threshold value of the initial setting, and the specific convergence condition can be set according to the task requirement in the actual application. If the algorithm does not reach the convergence condition, repeating the updating operation, and when the set convergence condition is met, ending the learning process, reserving DNN model parameters, and outputting the optimal strategy model pi acquired by PPO ^* 。

An autopilot device according to an embodiment of the present application is described below, and an autopilot device described below and an autopilot method described above may be referred to with reference to each other.

Referring to fig. 7, a block diagram of an automatic driving apparatus according to an exemplary embodiment is shown, as shown in fig. 7, including:

the data acquisition module 701 is configured to acquire multi-modal sensing information and driving behavior data of a driving environment;

The feature fusion module 702 is configured to extract multi-scale features of the multi-mode sensing information by using a convolutional neural network, and fuse the multi-scale features by using a transducer to obtain fused feature data;

a modeling module 703, configured to combine the fusion feature data and the driving behavior data into expert demonstration data, and to model an autopilot process into a markov decision process;

the optimizing module 704 is configured to acquire a reward function of an automatic driving process by using the expert demonstration data and using maximum entropy inverse reinforcement learning, and optimize a driving strategy model by using deep reinforcement learning;

and the output module 705 is configured to output the optimized driving strategy model to a client, so that the client uses the optimized driving strategy model to implement automatic driving according to the environmental awareness information.

According to the automatic driving device provided by the embodiment of the application, the multi-mode sensing information and the driving behavior data of the driving environment are synchronously acquired, the fusion characteristic representation of the 3D driving scene is obtained by adopting the transformer to fuse the multi-mode sensing data, the global expression capacity of the acquired sensing data on the driving scene is improved, the key sensing information loss caused by the complexity and the increase of the antagonism of the driving scene is prevented, and the reliability of the automatic driving sensing data is ensured. Further, the embodiment of the application combines the fusion perception data and the driving behavior as expert demonstration data, performs MDP modeling on an automatic driving process, acquires a reward function based on maximum entropy inverse reinforcement learning, and combines a DRL optimization strategy model, so that the complex priori knowledge in a real driving scene is fully utilized to learn the driving strategy model, and the rationality of decision planning in the automatic driving process is improved.

Based on the above embodiment, as a preferred implementation manner, the data acquisition module 701 includes:

a first acquisition unit configured to acquire driving states as multi-modal awareness information of a driving environment through a plurality of in-vehicle sensor devices;

the second acquisition unit is used for acquiring operations or commands executed for different driving scenes in the driving process as driving behavior data; wherein the driving behavior data includes any one or a combination of any of a plurality of time stamps, speed data, urgent acceleration and deceleration data, and lane departure data;

and the alignment unit is used for aligning the multi-modal awareness information and the driving behavior data in time sequence according to the time stamp.

Based on the above embodiment, as a preferred implementation manner, the feature fusion module 702 includes:

the extraction unit is used for encoding the multi-mode sensing information at different network layers by utilizing a convolutional neural network so as to extract an intermediate feature map;

the fusion unit is used for fusing the intermediate feature images by using a transducer to obtain a fusion feature image;

the first summation unit is used for summing the elements of the fusion feature map and returning the elements to each modal branch to obtain a multi-modal feature vector;

And the second summation unit is used for summing the multi-mode feature vectors element by element to obtain the fusion feature data.

Based on the above embodiment, as a preferred implementation manner, the optimizing module 704 includes:

an initializing unit for initializing a reward function and a driving strategy model based on the reward function by using a deep neural network;

a first updating unit for estimating a state distribution probability density of a driving strategy model using the expert demonstration data, and updating the driving strategy model using deep reinforcement learning based on the state distribution probability density;

a second updating unit for iteratively calculating a desire for a driving state-driving behavior access count, calculating a maximum entropy gradient using the desire, and updating a weight of the deep neural network based on the maximum entropy gradient;

the judging unit is used for judging whether the updated driving strategy model meets the convergence condition; if yes, recording the weight of the deep neural network to obtain an optimized driving strategy model; if not, restarting the workflow of the first updating unit.

On the basis of the above embodiment, as a preferred implementation manner, the initializing unit is specifically configured to:

On the basis of the above embodiment, as a preferred implementation manner, the input of the deep neural network is a driving state and driving behavior, and the output is a reward value; or, the input of the deep neural network is a driving state, and the deep neural network comprises a plurality of output channels, and each output channel corresponds to a reward value corresponding to driving behavior.

On the basis of the foregoing embodiment, as a preferred implementation manner, the convergence condition includes that the number of iterations reaches a preset number of iterations, or that a modulus of a gradient of the weight of the deep neural network reaches a preset threshold.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Based on the hardware implementation of the program modules, and in order to implement the method of the embodiments of the present application, the embodiments of the present application further provide an electronic device, fig. 8 is a block diagram of an electronic device according to an exemplary embodiment, and as shown in fig. 8, the electronic device includes:

A communication interface 1 capable of information interaction with other devices such as network devices and the like;

and the processor 2 is connected with the communication interface 1 to realize information interaction with other devices and is used for executing the automatic driving method provided by one or more technical schemes when running the computer program. And the computer program is stored on the memory 3.

Of course, in practice, the various components in the electronic device are coupled together by a bus system 4. It will be appreciated that the bus system 4 is used to enable connected communications between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. But for clarity of illustration the various buses are labeled as bus system 4 in fig. 8.

The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.

It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Wherein the nonvolatile Memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read Only Memory (EEPROM, electrically Erasable Programmable Read-Only Memory), magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk Read Only Memory (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory 3 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the embodiments of the present application may be applied to the processor 2 or implemented by the processor 2. The processor 2 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 2 or by instructions in the form of software. The processor 2 described above may be a general purpose processor, DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied in a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium in the memory 3 and the processor 2 reads the program in the memory 3 to perform the steps of the method described above in connection with its hardware.

The processor 2 implements corresponding flows in the methods of the embodiments of the present application when executing the program, and for brevity, will not be described in detail herein.

In an exemplary embodiment, the present application also provides a storage medium, i.e. a computer storage medium, in particular a computer readable storage medium, for example comprising a memory 3 storing a computer program executable by the processor 2 for performing the steps of the method described above. The computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

Alternatively, the integrated units described above may be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partly contributing to the prior art, and the computer software product may be stored in a storage medium, and include several instructions to cause an electronic device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An automatic driving method, comprising:

acquiring multi-mode sensing information and driving behavior data of a driving environment; the method comprises the steps that a plurality of vehicle-mounted sensor devices acquire driving states as multi-mode sensing information of a driving environment;

Outputting the optimized driving strategy model to a client so that the client can realize automatic driving according to the environment perception information by utilizing the optimized driving strategy model;

2. The automatic driving method according to claim 1, wherein the acquiring the multimodal awareness information of the driving environment and the driving behavior data includes:

3. The automatic driving method according to claim 1, wherein the extracting the multi-scale features of the multi-modal sensing information by using a convolutional neural network, and fusing the multi-scale features by using a transducer to obtain fused feature data includes:

4. The method of autopilot of claim 1 wherein initializing a reward function with a deep neural network comprises:

5. The automated driving method of claim 4, wherein the deep neural network inputs are driving state and driving behavior and outputs are prize values.

6. The method of automatic driving according to claim 4, wherein the input of the deep neural network is a driving state, and the deep neural network includes a plurality of output channels, and each output channel corresponds to a reward value corresponding to a driving behavior.

7. The autopilot method of claim 1 wherein the convergence condition includes a number of iterations reaching a preset number of iterations or a modulus of a gradient of weights of the deep neural network reaching a preset threshold.

8. An automatic driving apparatus, comprising:

the data acquisition module is used for acquiring multi-mode perception information and driving behavior data of a driving environment; the method comprises the steps that a plurality of vehicle-mounted sensor devices acquire driving states as multi-mode sensing information of a driving environment;

the output module is used for outputting the optimized driving strategy model to the client so that the client can realize automatic driving according to the environment perception information by utilizing the optimized driving strategy model;

wherein, the optimization module includes:

9. An electronic device, comprising:

a memory for storing a computer program;

processor for implementing the steps of the autopilot method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the autopilot method according to any one of claims 1 to 7.