CN113071524A

CN113071524A - Decision control method, decision control device, autonomous driving vehicle and storage medium

Info

Publication number: CN113071524A
Application number: CN202110474518.9A
Authority: CN
Inventors: 陈龙权; 贺颖; 邹广源; 潘微科
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-06
Anticipated expiration: 2041-04-29
Also published as: CN113071524B

Abstract

The application is applicable to the technical field of automatic driving, and provides a decision control method, a decision control device, an automatic driving vehicle and a storage medium, wherein the decision control method comprises the following steps: acquiring current running information of an automatic driving vehicle and target running information of surrounding vehicles, wherein the surrounding vehicles are vehicles with a distance smaller than a preset distance from the automatic driving vehicle; inputting the current running information of the automatic driving vehicle and the target running information of the surrounding vehicles into a decision network in a trained operator network to obtain target decision information; inputting the current running information of the automatic driving vehicle, the target running information of the surrounding vehicles and the target decision information into a control network in the actor network to obtain target control information; and controlling the automatic driving vehicle to run according to the target control information. The safety of automatic driving can be improved through the method and the device.

Description

Decision control method, decision control device, autonomous driving vehicle and storage medium

Technical Field

The application belongs to the technical field of automatic driving, and particularly relates to a decision control method and device, an automatic driving vehicle and a storage medium.

Background

An automatic driving vehicle is also called as an unmanned vehicle, a computer driving vehicle or a wheeled mobile robot, and is an intelligent vehicle which realizes unmanned driving through a computer system.

The automatic driving is an intelligent system integrating functions of environmental perception, decision making, control and the like, is an important component of a future intelligent transportation system, and brings great changes to the traveling and even the life style of people. In the field of automatic driving, how to improve the safety of automatic driving is an urgent technical problem to be solved.

Disclosure of Invention

The embodiment of the application provides a decision control method and device, an automatic driving vehicle and a storage medium, and the safety of automatic driving can be improved.

In a first aspect, an embodiment of the present application provides a decision control method, where the decision control method includes:

acquiring current running information of an automatic driving vehicle and target running information of surrounding vehicles, wherein the surrounding vehicles are vehicles with a distance smaller than a preset distance from the automatic driving vehicle;

inputting the current running information of the automatic driving vehicle and the target running information of the surrounding vehicles into a decision network in a trained operator network to obtain target decision information;

inputting the current running information of the automatic driving vehicle, the target running information of the surrounding vehicles and the target decision information into a control network in the actor network to obtain target control information;

and controlling the automatic driving vehicle to run according to the target control information.

In a second aspect, an embodiment of the present application provides a decision control apparatus, including:

the system comprises an information acquisition module, a data processing module and a data processing module, wherein the information acquisition module is used for acquiring current running information of an automatic driving vehicle and target running information of surrounding vehicles, and the surrounding vehicles are vehicles with a distance smaller than a preset distance from the automatic driving vehicle;

the first input module is used for inputting the current running information of the automatic driving vehicle and the target running information of the surrounding vehicles into a decision network in a trained operator network to obtain target decision information;

the second input module is used for inputting the current running information of the automatic driving vehicle, the target running information of the surrounding vehicles and the target decision information into a control network in the operator network to obtain target control information;

and the vehicle control module is used for controlling the automatic driving vehicle to run according to the target control information.

In a third aspect, an embodiment of the present application provides an autonomous vehicle, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the decision control method according to the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the decision control method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on an autonomous vehicle, causes the autonomous vehicle to perform the steps of the decision control method according to the first aspect.

As can be seen from the above, according to the scheme, the target decision information can be obtained by obtaining the current driving information of the autonomous vehicle and the target driving information of the surrounding vehicles, and inputting the current driving information and the target driving information into the decision network in the trained actor network, and the target control information can be obtained by inputting the current driving information, the target driving information and the target decision information into the control network in the actor network.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart illustrating an implementation of a decision control method according to an embodiment of the present application;

FIG. 2 is an illustration of a road;

FIG. 3a is a diagram of an example architecture of an actor network; FIG. 3b is a diagram illustrating an exemplary configuration of a stress relieving layer; FIG. 3c is a diagram of an example of the structure of a convolutional neural network layer;

fig. 4 is a schematic flow chart illustrating an implementation of a decision control method according to a second embodiment of the present application;

fig. 5 is a schematic structural diagram of a decision control device according to a third embodiment of the present application;

fig. 6 is a schematic structural diagram of an autonomous vehicle according to a fourth embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "first," "second," "third," and the like in the description of the present application and in the appended claims, are used for distinguishing between descriptions that are not intended to indicate or imply relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

It should also be understood that, the sequence numbers of the steps in this embodiment do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation to the implementation process of this embodiment.

The decision control method provided by the embodiment of the application can be applied to automatic driving vehicles such as driving assistance, partial automation, conditional automation, high automation and full automation, and the specific types of the automatic driving vehicles are not limited at all.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

Referring to fig. 1, which is a schematic view of an implementation flow of a decision control method provided in an embodiment of the present application, as shown in fig. 1, the decision control method may include the following steps:

step 101, obtaining current running information of an autonomous vehicle and target running information of surrounding vehicles.

The surrounding vehicle is a vehicle which is less than a preset distance away from the automatic driving vehicle.

In this embodiment, a coordinate system is established with a certain point on the current road where the autonomous vehicle is located as an origin, for example, the passing direction of the current road is a vertical coordinate of the coordinate system, and a direction having an angle of 90 ° with the passing direction of the current road is a horizontal coordinate of the coordinate system, and according to the coordinate system, the distance between the surrounding vehicle and the autonomous vehicle can be divided into a lateral distance and a longitudinal distance. Since a vehicle is at a smaller longitudinal distance from the autonomous vehicle, it is said that the vehicle is closer to the autonomous vehicle. And a vehicle may be closer or further from the autonomous vehicle when the lateral distance from the autonomous vehicle is smaller. The surrounding vehicles can be selected more accurately by the longitudinal distance. That is, the above-mentioned surrounding vehicle may specifically refer to a vehicle whose longitudinal distance from the autonomous vehicle is less than a preset distance.

It should be noted that, in order to obtain more comprehensive road vehicle information, the number of the above-mentioned surrounding vehicles may be at least two.

The current travel information of the autonomous vehicle includes, but is not limited to, a lateral position of the autonomous vehicle on the current road, a longitudinal position of the autonomous vehicle on the current road, a lateral velocity of the autonomous vehicle, a longitudinal velocity of the autonomous vehicle, and the like.

The target travel information of the surrounding vehicle includes, but is not limited to, a lateral distance of the surrounding vehicle from the autonomous vehicle, a longitudinal position of the surrounding vehicle on the current road, a lateral speed of the surrounding vehicle, a longitudinal speed of the surrounding vehicle, a time at which the autonomous vehicle collides with the surrounding vehicle, and the like. The time at which the autonomous vehicle collides with the surrounding vehicle may specifically be a time at which the autonomous vehicle collides with the surrounding vehicle while maintaining the current vehicle speed.

The current road may be a road on which the autonomous vehicle is located. The surrounding vehicle may be located on the same road as the autonomous vehicle, or may be located on a different road, which is not limited herein.

In this embodiment, a base station may be set near the current road, all vehicles near the base station (for example, all vehicles may have a distance from the base station smaller than a target distance, and the target distance is greater than or equal to a preset distance) may all send their current driving information to the base station, since the base station is set near the current road of the autonomous vehicle, the autonomous vehicle in step 101 is included in the all vehicles, and the base station may calculate a longitudinal distance between the remaining vehicle and the autonomous vehicle according to a longitudinal position of the autonomous vehicle on the current road and a longitudinal position of the remaining vehicle on the current road; determining vehicles with longitudinal distances smaller than a preset distance in the remaining vehicles as surrounding vehicles; the time at which the autonomous vehicle collides with a surrounding vehicle can be calculated according to the following formula (1); after calculating the longitudinal distance between the surrounding vehicle and the automatic driving vehicle and the time when the automatic driving vehicle collides with the surrounding vehicle, the base station may form target driving information of the surrounding vehicle with the longitudinal position of the surrounding vehicle on the current road, the lateral speed and the longitudinal speed of the surrounding vehicle, and transmit the target driving information of the surrounding vehicle to the automatic driving vehicle.

The calculation formula of the time at which the autonomous vehicle collides with the surrounding vehicle is as follows:

wherein, y_yegoIndicating a longitudinal position of the autonomous vehicle; y is_yotherIndicating the longitudinal position of the surrounding vehicle; v. of_yegoRepresenting a longitudinal speed of the autonomous vehicle; v. of_yotherRepresenting the longitudinal speed of the surrounding vehicle; a represents a constant greater than zero, e.g., 10 s.

As shown in fig. 2, which is a road example diagram, the diagram includes one autonomous vehicle, six surrounding vehicles, and two base stations, the base station 1 transmits the target traveling information of the vehicles a1, a2, and a3 to the autonomous vehicle, and the base station 2 transmits the target traveling information of the vehicles a4, a5, and a6 to the autonomous vehicle, so that the autonomous vehicle can obtain the target traveling information of the six surrounding vehicles.

And 102, inputting the current running information of the automatic driving vehicle and the target running information of the surrounding vehicles into a decision network in a trained operator network to obtain target decision information.

The operator network includes a decision network and a control network. The decision network is used for outputting decision information, and the decision information is discrete and comprises instructions such as lane keeping, left lane changing or right lane changing and the like. The control network is used for outputting control information. The control information is continuous and includes, but is not limited to, the angle of the steering wheel, the strength of the throttle or brake, etc.

In order to distinguish the decision information output by the decision network in the trained operator network from the decision information output by the decision network in the operator network in the training process, the decision information output by the decision network in the trained operator network may be referred to as target decision information, and the decision information output by the decision network in the operator network in the training process may be referred to as candidate decision information.

In order to distinguish the decision information output by the control network in the trained operator network from the control information output by the control network in the operator network in the training process, the control information output by the control network in the trained operator network may be referred to as target control information, and the control information output by the control network in the operator network in the training process may be referred to as candidate control information.

As an optional embodiment, the decision network sequentially includes a first attention mechanism layer, a first convolutional neural network layer, and a first fully-connected layer, and the current driving information of the autonomous vehicle and the target driving information of surrounding vehicles are input to the decision network, and obtaining the target decision information includes:

inputting the current running information of the automatic driving vehicle and the target running information of the surrounding vehicles into a first attention mechanism layer, and respectively increasing the dimensionality of the current running information of the automatic driving vehicle and the dimensionality of the target running information of the surrounding vehicles to obtain a first running vector corresponding to the current running information of the automatic driving vehicle and a second running vector corresponding to the target running information of the surrounding vehicles;

determining a first similarity of the autonomous vehicle and surrounding vehicles according to the first driving vector and the second driving vector;

multiplying the first similarity by the second driving vector to obtain a third driving vector;

and inputting the first running vector and the third running vector into the remaining layers of the decision network to obtain target decision information, wherein the remaining layers of the decision network comprise a first convolution neural network layer and a first full-connection layer.

After the current driving information of the automatic driving vehicle and the target driving information of surrounding vehicles are input into the first attention mechanism layer, the dimension of the current driving information and the dimension of the target driving information can be increased in the first attention mechanism layer, so that information related to the automatic driving vehicle is enriched, and the accuracy of the target decision information is improved.

Specifically, the first attention mechanism layer may include an embedded layer to which current driving information of the autonomous vehicle is input, and the dimension of the current driving information may be increased by the embedded layer to obtain the first driving vector; target driving information of surrounding vehicles is input into the embedding layer, and the dimensionality of the target driving information can be increased through the embedding layer to obtain a second driving vector. The first driving vector and the second driving vector have the same dimension. For example, the dimension of the current driving information of the autonomous vehicle is four dimensions, and a first driving vector of eight dimensions can be obtained through the embedded layer; the target running information of the surrounding vehicles has five dimensions, and a second running vector of eight dimensions can be obtained through the embedded layer.

The first similarity of the autonomous vehicle and the surrounding vehicles characterizes the attention of the autonomous vehicle to the surrounding vehicles, and the larger the first similarity is, the more attention of the autonomous vehicle to the surrounding vehicles is characterized, and the influence of the surrounding vehicles on the target decision information is larger.

In calculating the first similarity, the first travel vector and the second travel vector may be input to a similarity layer in the first attention layer, and the similarity layer outputs the first similarity. The similarity layer may be a neural network, and a specific similarity calculation method is not limited herein.

As an alternative embodiment, the first attention mechanism layer may further include a Softmax layer. Before multiplying the first similarity by the second travel vector, the first similarity may be input to a Softmax layer, the first similarity may be transformed to a range of 0 to 1 by the Softmax layer, and the sum of all the first similarities may be 1, so as to reduce the data calculation amount.

The transformation formula of the first similarity is as follows:

wherein e represents the Euler number; omega_iRepresenting the ith first similarity before transformation; omega_nRepresenting the nth first similarity before transformation; alpha is alpha_iRepresenting the ith first similarity after the transformation of the Sotmax layer; n denotes the number of first similarities, which may also be understood as the number of surrounding vehicles.

As an alternative embodiment, inputting the first driving vector and the third driving vector into the remaining layer of the decision network, and obtaining the objective decision information includes:

inputting the third driving vector to the first convolution neural network layer to obtain a fourth driving vector;

and inputting the first driving vector and the fourth driving vector to a first full-connection layer to obtain target decision information.

The first convolutional neural network layer may sequentially include two convolutional layers, a maximum pooling layer, and a full-link layer. The third driving vector can pass through two convolutional layers, the maximum pooling layer and the full-connection layer, and higher-level information of surrounding vehicles can be obtained.

Before the first driving vector and the fourth driving vector are input to the first full-connected layer, the first driving vector and the fourth driving vector may be spliced to obtain a first spliced vector, and the first spliced vector is input to the first full-connected layer. The first driving vector and the fourth driving vector are spliced, so that the first driving vector and the fourth driving vector are input to the first full-connection layer as a whole.

Step 103, inputting the current running information of the automatic driving vehicle, the target running information of the surrounding vehicles and the target decision information into a control network in the actor network to obtain target control information.

When the target control information is obtained based on the control network, the current running information of the automatic driving vehicle and the target running information of surrounding vehicles are considered, more comprehensive road vehicle information can be obtained, the association between the target decision information and the target control information can be strengthened by considering the target decision information, more accurate target control information can be obtained by obtaining the more comprehensive road vehicle information and strengthening the association between the target decision information and the target control information, and the safety of automatic driving is improved.

As an optional embodiment, the control network sequentially includes a second attention mechanism layer, a second convolutional neural network layer, and a second fully-connected layer, and the current driving information of the autonomous vehicle, the target driving information of the surrounding vehicles, and the target decision information are input to the control network, and obtaining the target control information includes:

inputting the current running information of the automatic driving vehicle and the target running information of the surrounding vehicles into a second attention mechanism layer, and respectively increasing the dimensionality of the current running information of the automatic driving vehicle and the dimensionality of the target running information of the surrounding vehicles to obtain a fifth running vector corresponding to the current running information of the automatic driving vehicle and a sixth running vector corresponding to the target running information of the surrounding vehicles;

determining a second similarity between the automatic driving vehicle and surrounding vehicles according to the fifth driving vector, the sixth driving vector and the target decision information;

multiplying the second similarity by the sixth driving vector to obtain a seventh driving vector;

and inputting the fifth driving vector and the seventh driving vector into a residual layer of the control network to obtain target control information, wherein the residual layer of the control network comprises a second convolutional neural network layer and a second full-connection layer.

After the current driving information of the automatic driving vehicle and the target driving information of surrounding vehicles are input into the second attention mechanism layer, the dimension of the current driving information and the dimension of the target driving information can be increased in the second attention mechanism layer, so that information related to the automatic driving vehicle is enriched, and the accuracy of the target decision information is improved.

Specifically, the second attention mechanism layer may include an embedded layer to which current driving information of the autonomous vehicle is input, and the dimension of the current driving information may be increased by the embedded layer to obtain a fifth driving vector; and inputting the target running information of the surrounding vehicles into the embedded layer, and increasing the dimensionality of the target running information through the embedded layer to obtain a sixth running vector.

The second similarity between the automatically-driven vehicle and the surrounding vehicles represents the attention of the automatically-driven vehicle to the surrounding vehicles, and the larger the second similarity is, the more the automatically-driven vehicle pays attention to the surrounding vehicles, and the larger the influence of the surrounding vehicles on the target decision information is.

In calculating the second similarity, the fifth driving vector, the sixth driving vector, and the target decision information may be input to a similarity layer in the second attention mechanism layer, and the similarity layer outputs the second similarity. The similarity layer may be a neural network, and a specific similarity calculation method is not limited herein.

As an alternative embodiment, the second attention mechanism layer may further include a Softmax layer. Before multiplying the second similarity by the sixth driving vector, the second similarity may be input to a Sotmax layer, the second similarity may be transformed to a range of 0 to 1 by the Sotmax layer, and the sum of all the second similarities may be 1, so as to reduce the amount of data calculation.

As an alternative embodiment, inputting the fifth running vector and the seventh running vector to the remaining layers of the control network, and obtaining the target control information includes:

inputting the seventh driving vector into the second convolution neural network layer to obtain an eighth driving vector;

and inputting the fifth driving vector and the eighth driving vector to a second full-connection layer to obtain target control information.

The second convolutional neural network layer may sequentially include two convolutional layers, a maximum pooling layer, and a full-link layer. The seventh driving vector can pass through two layers of convolution layers, the maximum pooling layer and the full-connection layer, and higher-level information of surrounding vehicles can be obtained.

Before the fifth driving vector and the eighth driving vector are input to the second full-connected layer, the fifth driving vector and the eighth driving vector may be spliced to obtain a second spliced vector, and the second spliced vector is input to the second full-connected layer. The fifth driving vector and the eighth driving vector are spliced, so that the fifth driving vector and the eighth driving vector can be input to the second full-connected layer as a whole.

And 104, controlling the automatic driving vehicle to run according to the target control information.

The autonomous vehicle may implement autonomous driving according to the target control information after acquiring the target control information.

Fig. 3a shows an example of the structure of an operator network. A 'in FIG. 3 a'_dRepresenting target decision information; a'_cRepresenting target control information; s_egoCurrent travel information representing an autonomous vehicle; (s)₁,s₂,…,s_N) Indicating target traveling information of surrounding vehicles.

Fig. 3b is a diagram illustrating an example of the structure of the attention mechanism layer, where the solid line portion in fig. 3b represents a first attention mechanism layer, and the solid line portion and the dotted line portion represent a second attention mechanism layer, i.e., the second attention mechanism layer adds the objective decision information in calculating the similarity compared to the first attention mechanism layer.

The attention layer enables the autonomous vehicle to focus on information of certain vehicles according to current needs, without focusing on information of all vehicles. If in the decision making process, the automatic driving vehicle is required to pay more attention to the front vehicle so as to judge whether lane changing is needed; in the control process, the automatic driving vehicle selects a lane which is more concerned with the automatic driving vehicle to change according to the decision information so as to better control the speed of the automatic driving vehicle.

Take the first attention mechanism layer in fig. 3b as an example. E_egoRepresenting a first driving vector; (E)₁,E₁,…,E_N) Representing a second driving vector; embedding Layer represents an embedded Layer; the Similarity Layer represents a similar Layer; softmax Layer denotes a Softmax Layer; (alpha₁,α₁,…,α_N) Represents the first similarity (E ') after the transformation of the Softmax layer'₁,E'₂,…E'_NAnd) represents a third driving vector, S_attThe vector after splicing the first driving vector and the third driving vector is shown, the Concatenate shows the splicing, and the Dot product shows the Dot product or multiplication.

Fig. 3c is a diagram illustrating an exemplary structure of a convolutional neural network layer, which refers to a first convolutional neural network and a second convolutional neural network.

The convolutional neural network layer is mainly used for guaranteeing the translation invariance of surrounding vehicles and extracting higher-level information, and because when target running information of the surrounding vehicles is input, if the target running information is input according to the distance, training of the actor network is unstable when the vehicles are sparse and the vehicles are dense, the translation invariance of the convolutional neural network is utilized, and the training stability of the actor network can be improved. In addition, the convolutional neural network can be adopted to reduce network parameters and extract higher-level information.

According to the embodiment of the application, the current driving information of the automatic driving vehicle and the target driving information of surrounding vehicles are obtained, the current driving information and the target driving information are input to the decision network in the trained actor network, the target decision information can be obtained, the current driving information, the target driving information and the target decision information are input to the control network in the actor network, and the target control information can be obtained.

Fig. 4 is a schematic diagram of an implementation flow of the decision control method provided in the second embodiment of the present application. As shown in fig. 4, the decision control method may include the steps of:

step 401, obtaining a current environmental state of a first test vehicle.

The current environment state comprises current running information of a first test vehicle and target running information of a second test vehicle, and the second test vehicle is a vehicle with a distance from the first test vehicle being smaller than a preset distance.

The first test vehicle may be an autonomous vehicle used to train the operator network. The second test vehicle may be a vehicle for training an actor network, and may be an autonomous vehicle or a non-autonomous vehicle, which is not limited herein. The first test vehicle and the second test vehicle may be virtual vehicles constructed using simulation software for training an actor network.

Step 402, inputting the current environment state into a decision network to obtain candidate decision information.

Step 403, inputting the current environment state and the candidate decision information into the control network to obtain candidate control information.

For the candidate decision information and the candidate control information, reference may be made to the related description of the first embodiment, which is not described herein again.

Step 404, determining the reward corresponding to the next environmental state and the current environmental state of the first test vehicle according to the candidate control information.

The reward corresponding to the current environment state may be a reward obtained by taking the candidate control information in the current environment state.

The autonomous vehicle may control the first test vehicle to travel according to the candidate control information, and when the first test vehicle travels, an environmental state of the first test travel is changed, and the changed environmental state is a next environmental state.

Step 405, training the operator network according to the current environment state, the next environment state and the reward of the current environment state.

The autonomous vehicle can update the network parameters of the actor network according to the current environment state, the next environment state and the reward of the current environment state, thereby completing the training of the actor network.

The network parameters of the operator network may include network parameters of the decision network and network parameters of the control network.

As an optional embodiment, the current driving information of the first test vehicle includes a lateral speed of the first test vehicle, the target driving information of the second test vehicle includes a time when the first test vehicle collides with a vehicle ahead of the first test vehicle, and determining the reward corresponding to the current environmental state according to the candidate decision information and the candidate control information includes:

controlling the first test vehicle to run according to the candidate decision information and the candidate control information;

detecting whether the first test vehicle collides or not in the running process of the first test vehicle;

if the first test vehicle collides, determining the reward corresponding to the current environment state as a target value;

if the first test vehicle is not collided, determining collision rewards according to the transverse speed of the first test vehicle and the time of collision between the first test vehicle and the front vehicle of the first test vehicle; detecting whether the first test vehicle safely reaches the terminal to obtain a first detection result, and determining a safe arrival reward according to the first detection result; detecting whether the first test vehicle changes lanes or not to obtain a second detection result, and determining lane change reward according to the second detection result; and adding the collision reward, the safe arrival reward and the lane change reward, and determining the value obtained by adding is the reward corresponding to the current environment state.

The autonomous vehicle may calculate the time when the first test vehicle collides with the vehicle in front of the autonomous vehicle according to equation (1) in the first embodiment.

In the process that the first test vehicle runs according to the candidate control information, the automatic driving vehicle can detect whether the first test vehicle collides, and if the first test vehicle collides, the reward corresponding to the current environment state can be determined to be a target value; if the first test vehicle is not collided, determining that the reward corresponding to the current environment state can be composed of collision reward, safe arrival reward and lane change reward.

The calculation formula of the reward corresponding to the current environment state is as follows:

wherein B represents a target value, a constant less than zero, e.g., -10; r is_vIndicating a collision reward; r is_safeIndicating a secure arrival reward; r is_comfIndicating a lane change award.

The calculation formula of the collision reward is as follows:

where β represents a penalty-based strength, and is a constant greater than zero, e.g., β is 7; v. of_yRepresenting a longitudinal speed of the first test vehicle; v. of_maxRepresenting a maximum longitudinal speed of the first test vehicle; ttc represents the time at which the first test vehicle collides with a vehicle in front of it; C. both D and P are constants greater than zero, e.g., C is 30, D is 27, and P is 2.

A reward less than zero indicates a penalty. If ttc<P, indicating that the current speed of the first test vehicle is at a dangerous value, a collision will occur very quickly and therefore a smaller penalty will be incurred. If ttc ≧ P (i.e., other conditions in equation (4)), this indicates that the current speed of the first test vehicle is at a safer value, where v is taken_yAnd C, the first test vehicle may be limited from being too fast.

The safe arrival reward is to incentivize the first test vehicle to safely arrive at the terminal.

The secure arrival reward is calculated as follows:

where F is a constant greater than zero, e.g., F is 3.

The lane change award is to prevent the first test vehicle from constantly changing lanes.

In order to improve the accuracy of lane change awards, the longitudinal distance between the first test vehicle and the front vehicle before and after lane change can be considered when the lane change awards are calculated.

The calculation formula of lane change reward is as follows:

wherein Δ y represents the longitudinal distance between the first test vehicle after lane change and the vehicle in front, and Δ y' represents the longitudinal distance between the first test vehicle before lane change and the vehicle in front; h represents a constant greater than zero, e.g., H is 10; g represents a constant less than zero, e.g., G is-0.5.

As an alternative embodiment, according to the current environment state, the next environment state and the reward corresponding to the current environment state, the training operator network includes:

inputting the current environment state into a criticic network to obtain a state value function of the current environment state;

inputting the next environment state into a criticic network to obtain a state cost function of the next environment state;

and training the operator network according to the state cost function of the current environment state, the state cost function of the next environment state and the reward corresponding to the current environment state.

The state cost function of the current environment state is used for representing the quality of the current environment state, for example, if the speed of the first test vehicle in the current environment state is high and the first test vehicle is close to the front vehicle, the state cost function of the current environment state can be determined to be low; if the first test vehicle is at a moderate speed and in a safer state at the current environmental state, it may be determined that the state cost function for the current environmental state is higher.

The state cost function of the next environmental state is used for representing the quality of the next environmental state, for example, if the first test vehicle is fast in the next environmental state and is close to the front vehicle, the state cost function of the next environmental state can be determined to be low; if the first test vehicle is at a moderate speed and in a safer state in the next environmental state, it may be determined that the state cost function for the next environmental state is higher.

As an optional embodiment, according to the state cost function of the current environment state, the state cost function of the next environment state, and the reward corresponding to the current environment state, the training operator network includes:

determining an action cost function of the current environment state according to the state cost function of the next environment state and the reward corresponding to the current environment state;

determining an advantage function of the operator network according to the state cost function of the current environment state and the action cost function of the current environment state;

determining a target function corresponding to the decision network and a target function corresponding to the control network according to the advantage function;

and training the operator network according to the target function corresponding to the decision network and the target function corresponding to the control network.

The calculation formula of the action cost function of the current environment state is as follows:

Q(s,a_d,a_c)＝r+γV(s') (7)

wherein, a_dRepresenting candidate decision information; a is_cRepresenting candidate control information; r represents the reward corresponding to the current environment state; gamma represents a discount factor; v (s') represents the state cost function for the next environmental state.

The calculation formula of the merit function is as follows:

wherein the content of the first and second substances,

old network parameters representing a decision network;

old network parameters representing the control network; v(s)) A state cost function representing a current environmental state.

The calculation formula of the objective function corresponding to the decision network is as follows:

wherein the content of the first and second substances,

represents the ratio corresponding to the candidate decision information,

representing the current probability of taking candidate decision information in the current environmental state,

representing a last probability of taking candidate decision information in a current environmental state; theta_dNew network parameters representing the decision network. For example, the old network parameters of the decision network represent the network parameters of the decision network in the last iteration, and the new network parameters of the decision network represent the network parameters of the decision network in this iteration.

The calculation formula of the objective function corresponding to the control network is as follows:

wherein the content of the first and second substances,

indicates a ratio corresponding to the candidate control information,

representing the current probability of taking the candidate control information in the current environmental state,

representing a last probability of taking the candidate control information in the current environmental state; theta_cRepresenting new network parameters of the controlling network. For example, the old network parameters of the control network represent the network parameters of the control network in the last iteration, and the new network parameters of the control network represent the network parameters of the control network in this iteration.

E in formulas (9) and (10) represents an average value.

It should be noted that, in this embodiment, the training of the actor network is implemented by maximizing the objective function corresponding to the decision network and the objective function corresponding to the control network.

In this embodiment, the operator network outputs the decision information and the control information at the same time, and the decision information and the control information have a strong association degree, so that the joint training of the decision network and the control network can be realized through the formulas (9) and (10).

As an alternative embodiment, this embodiment further includes:

determining an objective function of the criticc network according to the advantage function;

and training the critic network according to the objective function of the critic network.

Wherein, the square value of the merit function can be used as the objective function of the criticc network.

The autonomous vehicle can train the critic network by controlling the objective function of the critic network to be minimum.

In step 406, current driving information of the autonomous vehicle and target driving information of surrounding vehicles are obtained.

The step is the same as step 101, and reference may be made to the related description of step 101, which is not described herein again.

Step 407, inputting the current driving information of the automatic driving vehicle and the target driving information of the surrounding vehicles into a decision network in the trained operator network to obtain target decision information.

The step is the same as step 102, and reference may be made to the related description of step 102, which is not repeated herein.

Step 408, inputting the current running information of the automatic driving vehicle, the target running information of the surrounding vehicles and the target decision information into a control network in the trained operator network to obtain target control information.

The step is the same as step 103, and reference may be made to the related description of step 103, which is not described herein again.

And step 409, controlling the automatic driving vehicle to run according to the target control information.

The step is the same as step 104, and reference may be made to the related description of step 104, which is not described herein again.

According to the embodiment of the application, the actor network is trained according to the rewards of the current environment state, the next environment state and the current environment state, so that the actor network can jointly learn the decision information and the control information, the association between the decision information and the control information can be strengthened through the joint learning, the trained actor network can output more accurate control information, and the safety of automatic driving is improved.

In the formulas of the first and second embodiments, the unit of the position and the distance may be meter, the unit of the speed may be meter/second, and the unit of the time may be second.

Fig. 5 is a schematic structural diagram of a decision control device provided in the third embodiment of the present application, and for convenience of description, only the parts related to the third embodiment of the present application are shown.

The decision control device comprises:

the information acquisition module 51 is configured to acquire current driving information of the autonomous vehicle and target driving information of surrounding vehicles, where the surrounding vehicles are vehicles whose distance from the autonomous vehicle is less than a preset distance;

a first input module 52, configured to input current driving information of an autonomous vehicle and target driving information of surrounding vehicles into a decision network in a trained operator network to obtain target decision information;

a second input module 53, configured to input current driving information of the autonomous vehicle, target driving information of surrounding vehicles, and target decision information to a control network in an operator network to obtain target control information;

and a vehicle control module 54 for controlling the autonomous vehicle to run according to the target control information.

Optionally, the decision control device further includes:

the system comprises a state acquisition module, a state detection module and a state detection module, wherein the state acquisition module is used for acquiring the current environment state of a first test vehicle, the current environment state comprises the current running information of the first test vehicle and the target running information of a second test vehicle, and the second test vehicle refers to a vehicle, the distance between the second test vehicle and the first test vehicle is less than the preset distance;

the third input module is used for inputting the current environment state into a decision network to obtain candidate decision information;

the fourth input module is used for inputting the current environment state and the candidate decision information into the control network to obtain candidate control information;

the information confirmation module is used for determining rewards corresponding to the next environment state and the current environment state of the first test vehicle according to the candidate control information;

and the network training module is used for training the operator network according to the current environment state, the next environment state and the reward corresponding to the current environment state.

Optionally, the current driving information of the first test vehicle includes a longitudinal speed of the first test vehicle, the target driving information of the second test vehicle includes a time when the first test vehicle collides with a vehicle ahead of the first test vehicle, and the information confirmation module is specifically configured to:

controlling the first test vehicle to run according to the candidate control information;

if the first test vehicle is not collided, determining collision rewards according to the longitudinal speed of the first test vehicle and the time of collision between the first test vehicle and the front vehicle of the first test vehicle; detecting whether the first test vehicle safely reaches the terminal to obtain a first detection result, and determining a safe arrival reward according to the first detection result; detecting whether the first test vehicle changes lanes or not to obtain a second detection result, and determining lane change reward according to the second detection result; and adding the collision reward, the safe arrival reward and the lane change reward, and determining the value obtained by adding is the reward corresponding to the current environment state.

The network training module comprises:

the state input unit is used for inputting the current environment state into the criticic network to obtain a state value function of the current environment state;

the environment input unit is used for inputting the next environment state into the criticic network to obtain a state cost function of the next environment state;

and the first training unit is used for training the operator network according to the state cost function of the current environment state, the state cost function of the next environment state and the reward corresponding to the current environment state.

The first training unit is specifically configured to:

The network training module further comprises:

the function determining unit is used for determining a target function of the criticc network according to the advantage function;

and the second training unit is used for training the critic network according to the objective function of the critic network.

Optionally, the decision network sequentially includes a first attention mechanism layer, a first convolutional neural network layer, and a first fully-connected layer, and the first input module 52 includes:

a first adding unit, configured to input current driving information of the autonomous vehicle and target driving information of surrounding vehicles to a first attention mechanism layer, and respectively add a dimension of the current driving information of the autonomous vehicle and a dimension of the target driving information of the surrounding vehicles to obtain a first driving vector corresponding to the current driving information of the autonomous vehicle and a second driving vector corresponding to the target driving information of the surrounding vehicles;

a first determination unit configured to determine a first similarity of the autonomous vehicle and a surrounding vehicle based on the first travel vector and the second travel vector;

the first multiplying unit is used for multiplying the first similarity and the second running vector to obtain a third running vector;

and the first input unit is used for inputting the first running vector and the third running vector into the remaining layers of the decision network to obtain target decision information, and the remaining layers of the decision network comprise a first convolution neural network layer and a first full-connection layer.

Optionally, the first input unit is specifically configured to:

Optionally, the control network sequentially includes a second attention mechanism layer, a second convolutional neural network layer, and a second fully-connected layer, and the second input module 53 includes:

a second adding unit, configured to input the current driving information of the autonomous vehicle and the target driving information of the surrounding vehicle to a second attention mechanism layer, and respectively add a dimension of the current driving information of the autonomous vehicle and a dimension of the target driving information of the surrounding vehicle to obtain a fifth driving vector corresponding to the current driving information of the autonomous vehicle and a sixth driving vector corresponding to the target driving information of the surrounding vehicle;

the second determining unit is used for determining a second similarity between the automatic driving vehicle and surrounding vehicles according to the fifth driving vector, the sixth driving vector and the target decision information;

the second multiplying unit is used for multiplying the second similarity and the sixth running vector to obtain a seventh running vector;

and the second input unit is used for inputting the fifth driving vector and the seventh driving vector into the residual layer of the control network to obtain target control information, and the residual layer of the control network comprises a second convolutional neural network layer and a second full-connection layer.

Optionally, the second input unit is specifically configured to:

The decision control device provided in the embodiment of the present application can be applied to the first method embodiment and the second method embodiment, and for details, reference is made to the description of the first method embodiment and the second method embodiment, and details are not repeated here.

Fig. 6 is a schematic structural diagram of an autonomous vehicle according to a fourth embodiment of the present application. As shown in fig. 6, the autonomous vehicle 6 of the embodiment includes: one or more processors 60 (only one of which is shown), a memory 61, and a computer program 62 stored in the memory 61 and executable on the processors 60. The steps in the various decision control method embodiments described above are implemented when the processor 60 executes the computer program 62

Autonomous vehicle 6 may include, but is not limited to, a processor 60, a memory 61. Those skilled in the art will appreciate that fig. 6 is merely an example of an autonomous vehicle 6 and does not constitute a limitation of autonomous vehicle 6 and may include more or fewer components than shown, or some components in combination, or different components, e.g., autonomous vehicles may also include input-output devices, network access devices, buses, etc.

The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the autonomous vehicle 6, such as a hard disk or a memory of the autonomous vehicle 6. The memory 61 may also be an external storage device of the autonomous vehicle 6, such as a plug-in hard disk provided on the autonomous vehicle 6, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 61 may also include both an internal storage unit and an external storage device of the autonomous vehicle 6. The memory 61 is used to store computer programs and other programs and data required for autonomous driving of the vehicle. The memory 61 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided herein, it should be understood that the disclosed apparatus/autonomous vehicle and method may be implemented in other ways. For example, the above-described device/autonomous vehicle embodiments are merely illustrative, and for example, a division of modules or units is merely a logical division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method according to the embodiments described above may be implemented by a computer program, which is stored in a computer readable storage medium and used by a processor to implement the steps of the embodiments of the methods described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.

The present application may also implement all or part of the processes in the methods of the above embodiments, and may also be implemented by a computer program product, when the computer program product runs on an autonomous vehicle, the steps in the above embodiments of the methods may be implemented when the autonomous vehicle is executed.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of decision control, the method comprising:

2. The decision control method of claim 1, wherein the training process of the operator network comprises:

acquiring a current environment state of a first test vehicle, wherein the current environment state comprises current running information of the first test vehicle and target running information of a second test vehicle, and the second test vehicle is a vehicle with a distance from the first test vehicle being smaller than the preset distance;

inputting the current environment state into the decision network to obtain candidate decision information;

inputting the current environment state and the candidate decision information into the control network to obtain candidate control information;

determining the reward corresponding to the next environment state of the first test vehicle and the current environment state according to the candidate control information;

and training the operator network according to the current environment state, the next environment state and the reward corresponding to the current environment state.

3. The decision control method according to claim 2, wherein the current driving information of the first test vehicle includes a longitudinal speed of the first test vehicle, the target driving information of the second test vehicle includes a time of collision of the first test vehicle with a vehicle ahead thereof, and the determining the reward corresponding to the current environmental state according to the candidate control information includes:

if the first test vehicle collides, determining that the reward corresponding to the current environment state is a target value;

if the first test vehicle is not collided, determining collision reward according to the longitudinal speed of the first test vehicle and the time of collision between the first test vehicle and a front vehicle of the first test vehicle; detecting whether the first test vehicle safely reaches a terminal to obtain a first detection result, and determining a safe arrival reward according to the first detection result; detecting whether the first test vehicle changes lanes or not to obtain a second detection result, and determining lane change reward according to the second detection result; and adding the collision reward, the safe arrival reward and the lane change reward, and determining the value obtained by adding is the reward corresponding to the current environment state.

4. The decision control method of claim 2, wherein the training the operator network according to the current environmental state, the next environmental state, and the reward corresponding to the current environmental state comprises:

inputting the current environment state into a criticic network to obtain a state cost function of the current environment state;

inputting the next environment state into the criticic network to obtain a state cost function of the next environment state;

5. The decision control method according to claim 4, wherein the training the operator network according to the state cost function of the current environment state, the state cost function of the next environment state, and the reward corresponding to the current environment state comprises:

determining an advantage function of the actor network according to the state cost function of the current environment state and the action cost function of the current environment state;

and training the operator network according to the objective function corresponding to the decision network and the objective function corresponding to the control network.

6. The decision control method according to claim 5, wherein the decision control method further comprises:

7. The decision control method according to claim 1, wherein the decision network comprises a first attention mechanism layer, a first convolutional neural network layer and a first fully-connected layer in this order, and the inputting the current driving information of the autonomous vehicle and the target driving information of the surrounding vehicles into the decision network to obtain the target decision information comprises:

inputting the current driving information of the autonomous vehicle and the target driving information of the surrounding vehicle into the first attention mechanism layer, and respectively increasing the dimension of the current driving information of the autonomous vehicle and the dimension of the target driving information of the surrounding vehicle to obtain a first driving vector corresponding to the current driving information of the autonomous vehicle and a second driving vector corresponding to the target driving information of the surrounding vehicle;

determining a first similarity of the autonomous vehicle to the surrounding vehicle according to the first driving vector and the second driving vector;

and inputting the first driving vector and the third driving vector into the remaining layers of the decision network to obtain the target decision information, wherein the remaining layers of the decision network comprise the first convolutional neural network layer and the first fully-connected layer.

8. The decision control method of claim 7, wherein the inputting the first travel vector and the third travel vector into remaining layers of the decision network to obtain the objective decision information comprises:

and inputting the first running vector and the fourth running vector to the first full-connection layer to obtain the target decision information.

9. The decision control method according to any one of claims 1 to 8, wherein the control network includes a second attention mechanism layer, a second convolutional neural network layer, and a second fully-connected layer in this order, and the inputting the current travel information of the autonomous vehicle, the target travel information of the surrounding vehicles, and the target decision information into the control network to obtain the target control information includes:

inputting the current driving information of the autonomous vehicle and the target driving information of the surrounding vehicle into the second attention mechanism layer, and respectively increasing the dimension of the current driving information of the autonomous vehicle and the dimension of the target driving information of the surrounding vehicle to obtain a fifth driving vector corresponding to the current driving information of the autonomous vehicle and a sixth driving vector corresponding to the target driving information of the surrounding vehicle;

determining a second similarity of the autonomous vehicle and the surrounding vehicle according to the fifth driving vector, the sixth driving vector and the target decision information;

and inputting the fifth driving vector and the seventh driving vector into a residual layer of the control network to obtain the target control information, wherein the residual layer of the control network comprises the second convolutional neural network layer and the second full-connection layer.

10. The decision control method according to claim 9, wherein the inputting the fifth and seventh travel vectors into a remaining layer of the control network to obtain the target control information comprises:

inputting the seventh driving vector to the second convolutional neural network layer to obtain an eighth driving vector;

and inputting the fifth running vector and the eighth running vector to the second full-connected layer to obtain the target control information.

11. A decision control device, characterized in that the decision control device comprises:

12. An autonomous vehicle comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the decision control method according to any one of claims 1 to 10 when executing the computer program.

13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the decision control method according to any one of claims 1 to 10.