CN114194211A

CN114194211A - Automatic driving method and device, electronic equipment and storage medium

Info

Publication number: CN114194211A
Application number: CN202111442633.4A
Authority: CN
Inventors: 邓琪; 李茹杨; 张亚强; 赵雅倩; 李仁刚
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-18
Anticipated expiration: 2041-11-30
Also published as: CN114194211B

Abstract

The application discloses an automatic driving method, an automatic driving device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring multi-modal perception information and driving behavior data of a driving environment; extracting multi-scale features of the multi-mode perception information by using a convolutional neural network, and fusing the multi-scale features by using a Transformer to obtain fused feature data; combining the fusion characteristic data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process; acquiring a reward function of the automatic driving process by using expert demonstration data through maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by using deep reinforcement learning; and outputting the optimized driving strategy model to the client so that the client can realize automatic driving according to the environment perception information by using the optimized driving strategy model. The method and the device ensure the reliability of the automatic driving perception data and improve the rationality of decision planning in the automatic driving process.

Description

Automatic driving method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an automatic driving method, an automatic driving apparatus, an electronic device, and a computer-readable storage medium.

Background

The automatic driving is a complex system integrating functions of environment perception, decision planning, control execution and the like, and along with the development of artificial intelligence and information technology, the automatic driving has received great attention in academic, industrial and national defense and military fields.

The end-to-end driving method takes the perception information (such as laser radar point cloud, RGB image and the like) measured by the sensor as the input of the neural network, and the neural network directly outputs control signals, such as steering instructions and acceleration. The main advantage of this framework is that it is easy to implement and that the labeled training data can be obtained by recording the human driving process on the autonomous driving platform. In recent years, the use of an end-to-end learning model is greatly promoted by technical progress of computing hardware, and a back propagation algorithm for Deep Neural Network (DNN) gradient estimation can be realized on a Graphics Processing Unit (GPU) in parallel. This approach helps in training large autopilot network architectures, but also relies on a large number of training samples. Meanwhile, the nature of automatic driving control is a sequential decision problem, and an end-to-end driving method generally needs to be summarized based on a large amount of data, so that the end-to-end driving method is influenced by mixing errors in actual engineering practice.

Another technical route to end-to-end driving methods is to explore different driving strategies based on Deep Reinforcement Learning (DRL). Typically, researchers will train and evaluate DRL systems online in a simulation platform, and there are also studies to transplant simulation-trained DRL models to real driving environments, even training DRL systems directly based on real-world image data. However, the existing end-to-end driving method based on the DRL mainly focuses on the limited driving scene of the dynamic participants, and assumes that the behaviors of other participants in the scene are close to ideal, so that various complex problems in the real world are difficult to take into account. Furthermore, this type of method focuses on a single input mode, i.e. the acquired perceptual information is based on images only or radar only, and when the antagonism of the participants in the scene increases, the system fails due to the lack of critical perceptual information.

Therefore, how to prevent the loss of key perception information caused by the increase of complexity and antagonism of driving scenes in the automatic driving process is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide an automatic driving method, an automatic driving device, electronic equipment and a computer readable storage medium, so that loss of key perception information caused by increase of complexity and antagonism of a driving scene is prevented, and reliability of automatic driving perception data is guaranteed. Meanwhile, the driving strategy model is learned by fully utilizing the complex prior knowledge in the real driving scene, and the reasonability of decision planning in the automatic driving process is improved.

To achieve the above object, the present application provides an automatic driving method, comprising:

acquiring multi-modal perception information and driving behavior data of a driving environment;

extracting multi-scale features of the multi-mode perception information by using a convolutional neural network, and fusing the multi-scale features by using a Transformer to obtain fused feature data;

combining the fusion characteristic data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process;

acquiring a reward function of the automatic driving process by utilizing the expert demonstration data through maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by utilizing deep reinforcement learning;

and outputting the optimized driving strategy model to a client so that the client can realize automatic driving according to the environment perception information by using the optimized driving strategy model.

Wherein, the acquiring of the multi-modal perception information and the driving behavior data of the driving environment comprises:

acquiring a driving state as multi-mode perception information of a driving environment through a plurality of vehicle-mounted sensor devices;

acquiring operation or commands executed aiming at different driving scenes in the driving process as driving behavior data; wherein the driving behavior data comprises any one or a combination of time stamps, speed data, rapid acceleration and deceleration data and lane departure data;

and aligning the multi-modal perception information and the driving behavior data in time sequence according to a timestamp.

The method for extracting the multi-scale features of the multi-modal perception information by using the convolutional neural network and fusing the multi-scale features by using a Transformer to obtain fused feature data comprises the following steps:

coding the multi-modal perception information at different network layers by utilizing a convolutional neural network to extract an intermediate feature map;

fusing the intermediate characteristic diagram by using a Transformer to obtain a fused characteristic diagram;

summing the elements of the fusion feature map and returning the elements to each modal branch to obtain a multi-modal feature vector;

and summing the multi-modal feature vectors element by element to obtain fusion feature data.

The method for acquiring the reward function of the automatic driving process by utilizing the expert demonstration data and adopting maximum entropy inverse reinforcement learning and optimizing the driving strategy model by utilizing deep reinforcement learning comprises the following steps:

initializing a reward function and a driving strategy model based on the reward function by utilizing a deep neural network;

estimating the state distribution probability density of the driving strategy model by using the expert demonstration data, and updating the driving strategy model by using deep reinforcement learning based on the state distribution probability density;

iteratively calculating an expectation of an access count of driving state-driving behavior, calculating a maximum entropy gradient using the expectation, and updating a weight of the deep neural network based on the maximum entropy gradient;

judging whether the updated driving strategy model meets a convergence condition; if so, recording the weight of the deep neural network to obtain an optimized driving strategy model; and if not, re-entering the step of estimating the state distribution probability density of the driving strategy by using the expert demonstration data.

Wherein the initializing the reward function with the deep neural network includes:

defining the expert demonstration data as a set of driving state-driving behavior data pairs;

a reward function in the form of a driving state-driving behavior-reward value is initialized with a deep neural network.

The input of the deep neural network is a driving state and a driving behavior, and the output is an award value;

or the input of the deep neural network is a driving state, the deep neural network comprises a plurality of output channels, and each output channel corresponds to an award value corresponding to a driving behavior.

The convergence condition includes that the iteration number reaches a preset iteration number, or a modulus of a gradient of the weight of the deep neural network reaches a preset threshold value.

To achieve the above object, the present application provides an automatic driving apparatus, comprising:

the data acquisition module is used for acquiring multi-modal perception information and driving behavior data of a driving environment;

the feature fusion module is used for extracting multi-scale features of the multi-modal perception information by using a convolutional neural network and fusing the multi-scale features by using a Transformer to obtain fusion feature data;

the modeling module is used for combining the fusion characteristic data and the driving behavior data into expert demonstration data and modeling an automatic driving process into a Markov decision process;

the optimization module is used for acquiring a reward function in the automatic driving process by utilizing the expert demonstration data and adopting maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by utilizing deep reinforcement learning;

and the output module is used for outputting the optimized driving strategy model to the client so that the client can realize automatic driving according to the environment perception information by using the optimized driving strategy model.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

a processor for implementing the steps of the above described automated driving method when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, realizes the steps of the above described autopilot method.

According to the scheme, the automatic driving method comprises the following steps: acquiring multi-modal perception information and driving behavior data of a driving environment; extracting multi-scale features of the multi-mode perception information by using a convolutional neural network, and fusing the multi-scale features by using a Transformer to obtain fused feature data; combining the fusion characteristic data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process; acquiring a reward function of the automatic driving process by utilizing the expert demonstration data through maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by utilizing deep reinforcement learning; and outputting the optimized driving strategy model to a client so that the client can realize automatic driving according to the environment perception information by using the optimized driving strategy model.

According to the automatic driving method, the multi-modal perception information and the driving behavior data of the driving environment are synchronously acquired, the transformer is adopted to fuse the multi-modal perception data to obtain the fusion characteristic representation of the 3D driving scene, the global expression capability of the acquired perception data to the driving scene is improved, the loss of key perception information caused by the increase of the complexity and the antagonism of the driving scene is prevented, and the reliability of the automatic driving perception data is ensured. Furthermore, the method combines the fusion perception data and the driving behavior as expert demonstration data, carries out MDP (Markov Decision process) modeling on the automatic driving process, obtains a reward function based on maximum entropy inverse reinforcement learning, combines a DRL (data logging language) optimization strategy model, fully utilizes the complex priori knowledge in the real driving scene to learn the driving strategy model, and improves the rationality of Decision planning in the automatic driving process. The application also discloses an automatic driving device, an electronic device and a computer readable storage medium, which can also realize the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow chart illustrating a method of automatic driving according to an exemplary embodiment;

FIG. 2 is a block diagram illustrating an automated driving technique according to an exemplary embodiment;

FIG. 3 is a block diagram illustrating a data acquisition system in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating a Transformer-based multi-modal perceptual data feature fusion according to an exemplary embodiment

FIG. 5 is a flow chart illustrating another method of autonomous driving according to an exemplary embodiment;

FIG. 6 is a flow diagram illustrating a maximum entropy inverse reinforcement learning derived driving strategy in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating an autopilot device according to one exemplary embodiment;

FIG. 8 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In addition, in the embodiments of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.

The embodiment of the application discloses an automatic driving method, which can prevent the loss of key perception information caused by the increase of complexity and antagonism of a driving scene and ensure the reliability of automatic driving perception data. Meanwhile, the driving strategy model is learned by fully utilizing the complex prior knowledge in the real driving scene, and the reasonability of decision planning in the automatic driving process is improved.

Referring to fig. 1 and 2, fig. 1 is a flowchart illustrating an automatic driving method according to an exemplary embodiment, and fig. 2 is a block diagram illustrating an automatic driving technique according to an exemplary embodiment. As shown in fig. 1, includes:

s101: acquiring multi-modal perception information and driving behavior data of a driving environment;

in a specific implementation, the data acquisition system acquires multi-modal perception information and driving behavior data of the driving environment. As a possible implementation, the step may include: acquiring a driving state as multi-mode perception information of a driving environment through a plurality of vehicle-mounted sensor devices; acquiring operation or commands executed aiming at different driving scenes in the driving process as driving behavior data; wherein the driving behavior data comprises any one or a combination of time stamps, speed data, rapid acceleration and deceleration data and lane departure data; and aligning the multi-modal perception information and the driving behavior data in time sequence according to a timestamp.

As shown in fig. 3, the data acquisition system includes a data acquisition module 1, a data acquisition module 2, and a data storage module, wherein the data acquisition module 1 acquires multimodal perception information of the driving environment, the data acquisition module 2 synchronously records driving behavior data, and the data storage module is responsible for storing the acquired driving data, that is, the multimodal perception information and the driving behavior data.

In the driving process of the vehicle, the data acquisition module 1 acquires driving environment information, namely a driving state s, through vehicle-mounted sensor equipment such as a camera and a radar, so that multi-modal perception information is acquired. The data acquisition module 2 is responsible for recording operations or commands executed by a driver or a vehicle control center for different driving scenes in the driving process, namely driving behaviors a, the operations or commands can be acquired by a driving behavior acquisition device arranged in the vehicle, and the driving behavior data can comprise timestamps, speed data, rapid acceleration and deceleration data, lane departure data and the like.

The driving behavior data acquired by the data acquisition module 2 are directly stored in the data storage module, the multi-modal perception information such as RGB images, radar point clouds and the like acquired by the data acquisition module 1 is sent to the data processing system, the processed multi-modal fusion data can be obtained after a series of data operations such as feature extraction, feature fusion and the like, and then the processed multi-modal fusion data are stored in the data storage module. And aligning the multimodal fusion data with the driving behavior data in time sequence according to the timestamp recorded in the earlier stage, thereby obtaining a data set of a series of driving state-driving behavior data pairs (s, a) for later use.

S102: extracting multi-scale features of the multi-mode perception information by using a convolutional neural network, and fusing the multi-scale features by using a Transformer to obtain fused feature data;

in this step, the data processing system respectively extracts multi-scale features from the multi-modal perceptual information by using a CNN (Convolutional Neural Networks), and acquires fusion feature data by combining with a Transformer. As a possible implementation, the step may include: coding the multi-modal perception information at different network layers by utilizing a convolutional neural network to extract an intermediate feature map; fusing the intermediate characteristic diagram by using a Transformer to obtain a fused characteristic diagram; summing the elements of the fusion feature map and returning the elements to each modal branch to obtain a multi-modal feature vector; and summing the multi-modal feature vectors element by element to obtain fusion feature data.

The key to multi-modal perceptual information including lidar point cloud and RGB images is the fusion of information of these different types of modalities. A common fusion approach is based on a post-fusion architecture, i.e. each information input is encoded in a separate stream and then integrated. The fusion mechanism can generate large errors in complex scenes because the behaviors of a plurality of agents in the scenes cannot be explained. In order to better describe the driving behavior of the vehicle, a multi-mode fusion Transformer is utilized in the data processing system to process environment perception information of multiple modes such as a single-view image and laser radar point cloud. The key idea of the method is to combine the global information about the 3D scene in the RGB image and the radar point cloud data using the transform attention mechanism and integrate them directly into the feature extraction layers of different modalities in order to integrate the perceptual information from different modalities efficiently at multiple stages during feature encoding.

As shown in fig. 4, fig. 4 is a schematic diagram of multi-modal perceptual data feature fusion based on Transformer. The Transformer model is based on a coding-decoding framework, wherein a coding module comprises a self-attention layer and a feedforward neural network and can help different feature extraction layers to acquire multi-scale features of different modal perception information, and a decoding module has a layer of coding and decoding attention layer in addition to the coding module and is used for helping to acquire key features after multi-modal perception information fusion. Unlike the conventional symbolic input structure of the Transformer, when processing multi-modal driving data by the Transformer, the feature map needs to be operated, so that the intermediate feature map of each modal information can be regarded as a set, and each element in the set can be processed as a symbol. In the whole processing process, the embodiment encodes the input image and the radar point cloud information in different aspects of the scene in different network layers by using the CNN, that is, extracts an intermediate feature map, where the CNN includes multiple convolution + Pooling (Pooling) layers and full connection + Softmax, and then completes multi-scale fusion on the intermediate feature maps by using a transform attention layer to obtain a fusion feature map of multi-modal perception information, and further sums and feeds back elements of the fusion feature map to each individual modal branch. After a series of Multi-scale feature fusion operations are completed, the Multi-modal feature vectors are subjected to element-by-element summation, and a processed 3D scene representation, namely fusion features, is obtained through an MLP (Multi-layer perceptron).

S103: combining the fusion characteristic data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process;

in this step, the fusion feature data is combined with the synchronously recorded driving behavior as expert demonstration data, and the automatic driving process is modeled as a markov decision process. MDP is defined as a quintuple (S, A, T, R, gamma), wherein S is a state space where the automatic driving vehicle is located, A is a behavior decision space of the automatic driving vehicle, T is a state transfer function, R is a reward function, and gamma belongs to (0, 1) as a decay factor of the reward. According to the above definition, the automatic driving process can be described as finding an optimal driving strategy for the car at each moment of time, pi: s → A, once the strategy is determined, the effect of the vehicle taking action in a given state depends only on the current driving strategy, so the entire driving process can be considered as a Markov chain. The selected goal of a strategy is typically to optimize the cumulative reward from the current state to the future, given a vehicle state s, where the behavior produced by each strategy is denoted as a, and the cumulative reward can be expressed as:

wherein t is time, R_a(s_t，s_t+1) Is a driving state s_tThe driving behavior a is taken down and the driving state s is transferred_t+1Reward value of, optimal strategy pi^*The selection process comprises the following steps:

wherein, P_a(s_t，s_t+1) Is a driving state s_tThe driving behavior a is taken down and the driving state s is transferred_t+1Probability value of V(s)_t) Representing the cumulative prize of the future attenuation stack. When solving the optimal strategy in detail, it usually represents an iterative convergence process between all possible states s and s', i.e.

V_i+1(s)＝max{∑_s′P_a(s，s′)(R_a(s，s′)+γV_i(s′))}；

And i is the iteration number, and when V(s) gradually tends to be stable, the iteration is ended, and the optimal strategy is output.

S104: acquiring a reward function of the automatic driving process by utilizing the expert demonstration data through maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by utilizing deep reinforcement learning;

in the step, the incentive function is obtained by utilizing expert demonstration data and adopting maximum entropy inverse reinforcement learning, and the driving strategy model is optimized by combining with the DRL learning driving strategy model. In the specific implementation, a DNN reward function model is initialized, the state distribution probability density of an expert driving strategy is estimated by using an expert demonstration data sample, the driving strategy is updated by adopting DRL (DRL) based on the strategy state distribution probability density of the expert demonstration data, the expectation of the driving state-driving behavior access count is calculated in an iterative mode, the maximum entropy gradient is calculated by using the expectation of the state-action access count, the DNN weight is further updated, and a convergence condition is judged, wherein the convergence condition comprises that the iteration number reaches the preset iteration number, or the deep entropy gradient reaches the preset iteration numberThe modulus of the gradient of the weights of the neural network reaches a preset threshold. If not, repeating the above updating operation, otherwise, ending the updating process, retaining DNN model parameters, and outputting the optimal strategy model pi^*。

S105: and outputting the optimized driving strategy model to a client so that the client can realize automatic driving according to the environment perception information by using the optimized driving strategy model.

In the step, the optimized driving strategy model is output to an automatic driving client, and automatic driving is implemented according to the environment perception information.

According to the automatic driving method provided by the embodiment of the application, the multi-modal perception information and the driving behavior data of the driving environment are synchronously acquired, the transformer is adopted to fuse the multi-modal perception data to obtain the fusion characteristic representation of the 3D driving scene, the global expression capability of the acquired perception data to the driving scene is improved, the loss of key perception information caused by the increase of the complexity and the antagonism of the driving scene is prevented, and the reliability of the automatic driving perception data is ensured. Furthermore, the embodiment of the application combines the fusion perception data and the driving behavior as expert demonstration data, MDP modeling is carried out on the automatic driving process, the reward function is obtained based on maximum entropy inverse reinforcement learning, the DRL optimization strategy model is combined, the complex priori knowledge in the real driving scene is fully utilized to learn the driving strategy model, and the rationality of decision planning in the automatic driving process is improved.

The embodiment of the application discloses an automatic driving method, and compared with the previous embodiment, the technical scheme is further explained and optimized in the embodiment. Specifically, the method comprises the following steps:

referring to FIG. 5, a flow chart of another automated driving method according to an exemplary embodiment is shown, as shown in FIG. 5, including:

s201: acquiring multi-modal perception information and driving behavior data of a driving environment;

s202: extracting multi-scale features of the multi-mode perception information by using a convolutional neural network, and fusing the multi-scale features by using a Transformer to obtain fused feature data;

s203: combining the fusion characteristic data and the driving behavior data into expert demonstration data, and modeling an automatic driving process into a Markov decision process;

s204: initializing a reward function and a driving strategy model based on the reward function by utilizing a deep neural network;

in the process of automatic driving by using MDP modeling, the key point is the design of the reward function R, which needs to consider various influence factors as much as possible, including route completion, driving safety, riding comfort, and the like. However, all environmental conditions are often not accurately obtained while the vehicle is traveling, and the mapping between sensor input and output actions can be very complex. Thus, in some real-world tasks, artificially setting the reward function of an environment is a difficult and laborious task. Therefore, the method and the device help to establish a reward function model of the environment by adopting maximum entropy inverse reinforcement learning based on the expert demonstration data obtained after processing. And combining the DNN parameterized reward function to train the automatic driving strategy based on the DRL algorithm.

As shown in fig. 6, fig. 6 is a flowchart of maximum entropy inverse reinforcement learning to obtain driving strategy. In this step, a reward function is defined, which requires presentation of data from a set of experts since the reward function R in the MDP process is now unknown

It is concluded that DNN is used as a parameterized reward function and that the problem of inverse reinforcement learning driving strategy is solved based on the maximum entropy principle.

First, expert presentation data is defined as a set of driving state-driving behavior pairs {(s)₁，a₁)，(s₂，a₂)，...，(s_n，a_n) In which s is_iRepresenting a driving state, a_iRepresentative expert in state s_iThe selected driving behavior is selected. Then, the reward function is defined in the form of driving state-driving behavior-reward value, i.e. R: s × A → R, the reward function is denoted as R (S, a). This form of definition takes actions into account and mayEmbodying a preference for a particular action in the expert data, thus facilitating the reproduction of driving behaviour with different preferences for available actions.

As a possible implementation, the initializing the reward function by using the deep neural network includes: defining the expert demonstration data as a set of driving state-driving behavior data pairs; a reward function in the form of a driving state-driving behavior-reward value is initialized with a deep neural network. In the specific implementation, based on the above definition, when the DNN learning reward function is adopted, two network structures are selectable, one is to input the vector of the driving state and the vector of the driving behavior at the same time and output the vectors as a reward value; the other method is to input only the state vector and output a plurality of channels which represent reward values corresponding to a plurality of driving behaviors respectively. Both DNNs can be used as approximate models of driving behavior-driving state-reward function, and any structure which is convenient to realize can be selected in the practical application process. Namely, the input of the deep neural network is the driving state and the driving behavior, and the output is the reward value; or the input of the deep neural network is a driving state, the deep neural network comprises a plurality of output channels, and each output channel corresponds to an award value corresponding to a driving behavior.

S205: estimating the state distribution probability density of the driving strategy model by using the expert demonstration data, and updating the driving strategy model by using deep reinforcement learning based on the state distribution probability density;

s206: iteratively calculating an expectation of an access count of driving state-driving behavior, calculating a maximum entropy gradient using the expectation, and updating a weight of the deep neural network based on the maximum entropy gradient;

the purpose of S205-S206 is to update the reward function model-based driving strategy π, first initialize the DNN reward function model. In a specific implementation, the state distribution probability density of the expert driving strategy is estimated using expert demonstration data samples. Taking into account a probabilistic model P_a(s, s ') is unknown, and the number of times each driving state-driving behavior-driving state triple (s, a, s') is analyzed to calculate the state transition probability for each possible outcome, which can be expressed as

Where c (s, a, s ') is the cumulative number of transitions from the driving state s to the driving state s' in which the driving behavior a is taken. With the continuous interaction between the strategy model and the driving environment, the state access times are close to infinity, and the probability value P_a(s, s') will gradually approach the true probability distribution.

Based on the strategy state distribution probability density of the obtained expert demonstration data, the invention adopts PPO with model learning to update the current driving strategy pi, and introduces the following iterative formula to calculate the expectation of the driving state-driving behavior access count:

E_i+1[μ(s)]＝∑_s′∈S∑_a∈AP_a(s，s′)π(s′，a)E_i[μ(s′)]；

E_i+1[μ(s，a)]＝π(s，a)E_i+1[μ(s)]；

in this embodiment, a PPO (Proximal Policy Optimization) algorithm with better hyper-parameter performance may be used as an illustration, and of course, other DRL algorithms may be selected, such as DDPG (Deep Deterministic Policy Gradient), SAC (soft actuator-critical, flexible actuation-evaluation), TD3 (double Delayed Deep Deterministic Policy Gradient), and the like.

Further, the DNN reward function model parameters are updated. In an implementation, when the driving state s reaches a final or target state s_finalThereafter, no future state transitions will occur. The maximum entropy gradient can now be determined:

wherein the content of the first and second substances,

to be specially designedThe family demonstrates the likelihood function of the data, θ is the network weight of the DNN. Further calculation of

With respect to the partial derivative of theta,

wherein the content of the first and second substances,

obtained by DNN counter-propagation, the following can be utilized

Updating DNN weights

Where λ is the learning rate and β is the weight attenuation coefficient.

S207: judging whether the updated driving strategy model meets a convergence condition; if yes, entering S208; if not, re-entering S205;

s208: recording the weight of the deep neural network to obtain an optimized driving strategy model;

s209: and outputting the optimized driving strategy model to a client so that the client can realize automatic driving according to the environment perception information by using the optimized driving strategy model.

In specific implementation, whether the process is finished or not is determined by judging whether the strategy model is converged or not, the convergence condition can be that the number of updated iteration reaches an initially set iteration upper limit, or that the modulus of the gradient of the weight theta reaches an initially set threshold, and a specific convergence condition can be set according to task requirements in actual application. If the algorithm does not reach the convergence condition, repeating the updating operation, and when the set convergence condition is met, finishing the learning process, retaining the DNN model parameters, and outputting the optimal strategy model pi obtained by the PPO^*。

An automatic driving device provided by the embodiment of the present application is described below, and an automatic driving device described below and an automatic driving method described above may be referred to each other.

Referring to fig. 7, a block diagram of an automatic driving apparatus according to an exemplary embodiment is shown, as shown in fig. 7, including:

the data acquisition module 701 is used for acquiring multi-modal perception information and driving behavior data of a driving environment;

the feature fusion module 702 is configured to extract multi-scale features of the multi-modal perceptual information by using a convolutional neural network, and fuse the multi-scale features by using a Transformer to obtain fusion feature data;

a modeling module 703 for combining the fusion feature data and the driving behavior data into expert demonstration data and modeling an automatic driving process as a markov decision process;

the optimization module 704 is used for acquiring a reward function in the automatic driving process by utilizing the expert demonstration data and adopting maximum entropy inverse reinforcement learning, and optimizing a driving strategy model by utilizing deep reinforcement learning;

the output module 705 is configured to output the optimized driving strategy model to the client, so that the client can realize automatic driving according to the environment perception information by using the optimized driving strategy model.

The automatic driving device provided by the embodiment of the application synchronously acquires the multi-modal perception information and the driving behavior data of the driving environment, adopts the transformer to fuse the multi-modal perception data to obtain the fusion characteristic representation of the 3D driving scene, improves the global expression capability of the acquired perception data to the driving scene, prevents the loss of the key perception information caused by the increase of the complexity and the antagonism of the driving scene, and ensures the reliability of the automatic driving perception data. Furthermore, the embodiment of the application combines the fusion perception data and the driving behavior as expert demonstration data, MDP modeling is carried out on the automatic driving process, the reward function is obtained based on maximum entropy inverse reinforcement learning, the DRL optimization strategy model is combined, the complex priori knowledge in the real driving scene is fully utilized to learn the driving strategy model, and the rationality of decision planning in the automatic driving process is improved.

On the basis of the above embodiment, as a preferred implementation, the data acquisition module 701 includes:

a first acquisition unit configured to acquire a driving state as multimodal perception information of a driving environment through a plurality of in-vehicle sensor devices;

the second acquisition unit is used for acquiring operations or commands executed aiming at different driving scenes in the driving process as driving behavior data; wherein the driving behavior data comprises any one or a combination of time stamps, speed data, rapid acceleration and deceleration data and lane departure data;

and the aligning unit is used for aligning the multi-modal perception information and the driving behavior data in time sequence according to a timestamp.

On the basis of the foregoing embodiment, as a preferred implementation, the feature fusion module 702 includes:

the extraction unit is used for encoding the multi-modal perception information at different network layers by utilizing a convolutional neural network so as to extract an intermediate feature map;

the fusion unit is used for fusing the intermediate characteristic diagram by using a Transformer to obtain a fusion characteristic diagram;

the first summation unit is used for summing the elements of the fusion feature map and returning the elements to each modal branch to obtain a multi-modal feature vector;

and the second summation unit is used for carrying out element-by-element summation on the multi-modal feature vectors to obtain fusion feature data.

On the basis of the foregoing embodiment, as a preferred implementation, the optimization module 704 includes:

the device comprises an initialization unit, a calculation unit and a control unit, wherein the initialization unit is used for initializing a reward function and a driving strategy model based on the reward function by utilizing a deep neural network;

the first updating unit is used for estimating the state distribution probability density of the driving strategy model by utilizing the expert demonstration data and updating the driving strategy model by utilizing deep reinforcement learning based on the state distribution probability density;

a second updating unit for iteratively calculating an expectation of an access count of driving state-driving behavior, calculating a maximum entropy gradient using the expectation, and updating the weight of the deep neural network based on the maximum entropy gradient;

the judging unit is used for judging whether the updated driving strategy model meets the convergence condition; if so, recording the weight of the deep neural network to obtain an optimized driving strategy model; if not, the work flow of the first updating unit is restarted.

On the basis of the foregoing embodiment, as a preferred implementation manner, the initialization unit is specifically configured to:

On the basis of the above embodiment, as a preferred implementation, the input of the deep neural network is the driving state and the driving behavior, and the output is the reward value; or the input of the deep neural network is a driving state, the deep neural network comprises a plurality of output channels, and each output channel corresponds to an award value corresponding to a driving behavior.

On the basis of the foregoing embodiment, as a preferred implementation manner, the convergence condition includes that the number of iterations reaches a preset number of iterations, or a modulus of a gradient of the weight of the deep neural network reaches a preset threshold.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides an electronic device, and fig. 8 is a structural diagram of an electronic device according to an exemplary embodiment, as shown in fig. 8, the electronic device includes:

a communication interface 1 capable of information interaction with other devices such as network devices and the like;

and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing the automatic driving method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.

In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For the sake of clarity, however, the various buses are labeled as bus system 4 in fig. 8.

The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.

It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 3 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiment of the present application may be applied to the processor 2, or implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.

When the processor 2 executes the program, the corresponding processes in the methods according to the embodiments of the present application are realized, and for brevity, are not described herein again.

In an exemplary embodiment, the present application further provides a storage medium, i.e. a computer storage medium, specifically a computer readable storage medium, for example, including a memory 3 storing a computer program, which can be executed by a processor 2 to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An automatic driving method, characterized by comprising:

2. The automated driving method of claim 1, wherein the obtaining multimodal awareness information and driving behavior data of the driving environment comprises:

3. The automatic driving method according to claim 1, wherein the extracting multi-scale features of the multi-modal perceptual information by using a convolutional neural network, and fusing the multi-scale features by using a Transformer to obtain fused feature data comprises:

4. The automated driving method according to claim 1, wherein the using the expert demonstration data to obtain a reward function for the automated driving process using maximum entropy inverse reinforcement learning and to optimize the driving strategy model using deep reinforcement learning comprises:

5. The autopilot method of claim 1 wherein the initializing a reward function with a deep neural network comprises:

6. The automated driving method of claim 5, wherein the deep neural network has inputs of driving status and driving behavior and outputs of reward values;

7. The autopilot method of claim 4 wherein the convergence condition includes the number of iterations reaching a preset number of iterations or a modulus of a gradient of weights of the deep neural network reaching a preset threshold.

8. An autopilot device, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the autopilot method according to one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the autopilot method according to one of the claims 1 to 7.