CN115472038B

CN115472038B - Automatic parking method and system based on deep reinforcement learning

Info

Publication number: CN115472038B
Application number: CN202211353517.XA
Authority: CN
Inventors: 邱思杰; 黄忠虎; 贾鹏; 马豪; 伍坪; 谢华; 刘春明; 纪联南
Original assignee: Nanjing Jiezhiyi Technology Co ltd; Sanming University
Current assignee: Nanjing Jiezhiyi Technology Co ltd; Sanming University
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2023-02-03
Anticipated expiration: 2042-11-01
Also published as: CN115472038A

Abstract

The invention provides an automatic parking method and system based on deep reinforcement learning, which comprises the steps of constructing an initial evaluator network and an initial executor network; training the initial evaluator network and the initial executor network to obtain an executor network based on a state value baseline of a state; acquiring a current image of a vehicle; acquiring the current vehicle position and the parking position; inputting the current image, the current vehicle position and the parking space position into the executor network, and outputting a current action execution strategy by the executor network; the vehicle executes the strategy execution action based on the current action, and acquires a next action execution strategy based on the executed next image, the next vehicle position and the parking space position until the vehicle finishes an automatic parking task; the control instruction of the vehicle is generated by using the deep neural network, and the training of the deep neural network is completed by an evaluator executor algorithm, so that automatic parking can be better realized.

Description

Automatic parking method and system based on deep reinforcement learning

Technical Field

The invention relates to the technical field of automatic driving, in particular to an automatic parking method and system based on deep reinforcement learning.

Background

The parking task is a situation frequently encountered in daily life, and particularly when the range of a feasible driving space around a target parking space is small, the parking task often requires a lot of driving experience and driving skill of a driver, which cannot ensure completion of the corresponding parking task for inexperienced drivers. In the traditional scheme, multiple cameras and a vehicle-mounted radar are mostly adopted as a vehicle environment sensing means, the system cost is improved, the complexity of characteristic information extraction is increased, the vehicle path planning and the motion control of the vehicle are mutually split, and the parking system module is complex in design.

In view of the above, the present invention provides an automatic parking method and system based on deep reinforcement learning, so as to provide an end-to-end automatic parking solution while meeting the requirements for automatic parking tasks in daily life. The invention adopts a camera as an environment perception means, uses a deep neural network to generate a control instruction of a vehicle, completes the training of the deep neural network through an evaluator executor algorithm, and finally realizes the automatic parking function.

Disclosure of Invention

The invention aims to provide an automatic parking method based on deep reinforcement learning, which comprises the steps of constructing an initial evaluator network and an initial executor network; training the initial evaluator network and the initial executor network based on a state value baseline of a state to obtain an executor network; training to obtain an executor network, and constructing a profit gradient of the initial executor network based on the value of an action execution strategy and the state value baseline; wherein the formula for constructing the profit gradient is:

wherein the content of the first and second substances,

representing the revenue gradient;

representing the accumulated revenue;

representing an action award;

a discount rate representing an action award;

representing a state value baseline of the vehicle at time t + 1;

representing a state value baseline of the vehicle at time t;

is shown in a state

Perform an action

The sample action execution policy of (1); updating network parameters of the initial actor network based on the benefit gradient until the benefit gradient reaches a maximum value; using the initial executor network as training when obtaining maximum profit gradientA trained actor network; acquiring a current image of a vehicle; the current image includes a state of the vehicle in a current environment; acquiring the current vehicle position and the parking position; inputting the current image, the current vehicle position and the parking space position into the executor network, and outputting a current action execution strategy by the executor network; and the vehicle executes the strategy execution action based on the current action, and acquires a next action execution strategy based on the executed next image, the next vehicle position and the parking space position until the vehicle finishes the automatic parking task.

Further, by constructing a multi-layer data structure, obtaining the initial evaluator network and the initial executor network, including a convolution operation and a maximal pooling operation with 7 × 7 in a first layer of the data structure; a second layer of the data structure adopts a residual error module to perform feature extraction; the third layer of the data structure adopts a residual error module to extract the characteristics; a fourth layer of the data structure adopts a residual error module to perform feature extraction; a fifth layer of the data structure adopts a residual error module to extract the characteristics; the sixth layer of the data structure employs an averaging pooling operation.

Further, the training is performed to obtain an executor network, which comprises inputting a sample image, a sample vehicle position and a sample parking space position into the initial executor network, and the initial executor network outputs a sample action execution strategy; the vehicle executes a policy-enforcement action based on the sample action; obtaining an action reward for executing the sample action execution strategy; taking the sample image, the execution action, the action reward and the next sample image as training samples and storing the training samples in an experience pool; the next sample image is an image of the vehicle environment obtained after the action is executed; randomly extracting training samples from the experience pool; inputting a sample image in the extracted training sample and a next sample image into the initial executor network to obtain the value of an action execution strategy and the state value baseline; updating network parameters of the initial actor network and the initial evaluator network based on the value of the action execution policy and the state value baseline; and when the vehicle is not collided and the training of the initial executor network and the initial evaluator network is finished, obtaining the trained executor network and the trained evaluator network.

Further, the formula for updating the network parameters of the initial actor network is:

wherein, the first and the second end of the pipe are connected with each other,

network parameters representing the updated initial actor network;

network parameters representing the initial actor network;

representing a learning rate of the initial actor network;

a discount rate representing an action award;

representing a value of the action execution policy;

representing the state value baseline;

a sample action execution strategy representing the extracted training samples; the formula for updating the network parameters of the initial evaluator network is:

network parameters representing the updated initial evaluator network;

network parameters representing the initial evaluator network;

representing a learning rate of the initial evaluator network;

representing a value of the action execution policy;

representing the state value baseline;

representing a state value baseline of the selected training sample.

Further, the evaluator network training is completed, including constructing a loss function of the initial evaluator network based on the state value baseline; updating network parameters of the initial evaluator network based on the loss function until the loss function reaches a minimum value; and taking the initial evaluator network when the minimum loss function is obtained as the trained evaluator network.

Further, the formula for constructing the loss function is as follows:

wherein the content of the first and second substances,

indicating that the initial evaluator network has network parameters of

A loss function of time;

representing an action award;

a discount rate representing an action award;

representing a state value baseline of the vehicle at time t + 1;

representing the vehicle's state value baseline at time t.

Further, the formula of the action execution strategy is as follows:

wherein the content of the first and second substances,

representing the selected action;

indicating a driving direction of the vehicle;

indicating steering of the steering wheel.

The invention aims to provide an automatic parking system based on deep reinforcement learning, which comprises a deep neural network construction module, a deep neural network training module, an image acquisition module, a position acquisition module, a determination module and a circulation module; the deep neural network construction module is used for constructing an initial evaluator network and an initial executor network; the deep neural network training module is used for training the initial evaluator network and the initial executor network to obtain an executor network based on a state value baseline of a state; training to obtain an executor network, and constructing a profit gradient of the initial executor network based on the value of an action execution strategy and the state value baseline; wherein the formula for constructing the profit gradient is:

representing the revenue gradient;

representing the accumulated revenue;

representing an action award;

a discount rate representing an action award;

representing a state value baseline of the vehicle at time t + 1;

representing a state value baseline of the vehicle at time t;

is shown in a state

Perform an action

The sample action execution policy of (1); updating network parameters of the initial actor network based on the benefit gradient until the benefit gradient reaches a maximum value; taking the initial executor network when the maximum profit gradient is obtained as a trained executor network; the image acquisition module is used for acquiring a current image of the vehicle; the current image includes a state of the vehicle in a current environment; the position acquisition module is used for acquiring the current vehicle position and the parking space position; the determining module is used for inputting the current image, the current vehicle position and the parking space position into the executor network, and the executor network outputs a current action execution strategy; and the circulation module is used for executing the strategy execution action by the vehicle based on the current action, and acquiring a next action execution strategy based on the executed next image, the next vehicle position and the parking space position until the vehicle finishes an automatic parking task.

The technical scheme of the embodiment of the invention at least has the following advantages and beneficial effects:

some embodiments of the invention can greatly improve the convergence rate of network training by constructing a profit gradient and updating the actor network based on the maximum value of the profit gradient.

Some embodiments of the invention train the network by adopting an evaluator executor algorithm based on a state baseline, train the evaluator network while training the executor network, so that the updated executor network can update the parameters of the network based on the evaluation of the updated evaluator network, improve the accuracy of parameter update, and the state value baseline is an evaluation reference obtained by the evaluator network according to historical actions and values, so that the change of the evaluation value can be within a certain range, and reduce errors.

Some embodiments of the present invention enable further increasing the efficiency of vehicle exploration for the environment space in the deep reinforcement learning environment by using a reward function based on a potential function difference form.

Drawings

FIG. 1 is a flowchart illustrating an example method for automatic parking based on deep reinforcement learning according to some embodiments of the present disclosure;

FIG. 2 is an exemplary flow diagram of training a resulting actor network provided by some embodiments of the present invention;

fig. 3 is a block diagram of an automatic parking system based on deep reinforcement learning according to some embodiments of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Fig. 1 is an exemplary flowchart of an automatic parking method based on deep reinforcement learning according to some embodiments of the present invention. In some embodiments, process 100 may be performed by system 300. The process 100 shown in fig. 1 may include the following steps:

step 110, an initial evaluator network and an initial executor network are constructed. In some embodiments, step 110 may be performed by deep neural network building module 310.

The initial actor network may refer to a deep neural network used to train the resulting actor network. The actor network may be used to determine an action execution policy based on the input current image of the vehicle. The current image may refer to an image of the vehicle in the current environment. In some embodiments, a camera is disposed on the vehicle, and the camera may acquire an image of an environment in which the vehicle is located. An action execution strategy may refer to an action that can be performed based on the environment in which the vehicle is currently located. For example, the current image of the vehicle with dimension 3 × 224 is input to the actor network, and the actor network outputs an action execution policy of 10 × 1. Where 10 represents ten execution actions that may be made.

The execution of the action takes a discrete action space, and in some embodiments, the formula of the action execution policy is:

wherein the content of the first and second substances,

an action that represents that may be selected;

indicating the driving direction of the vehicle. For example, forward or backward.

Indicating the steering of the steering wheel. For example, the steering wheel steering angles are five, i.e., 90 degrees left, 45 degrees left, neutral, 45 degrees right, and 90 degrees right.

The initial evaluator network may refer to a deep neural network used to train the resulting evaluator network. The evaluator network may be configured to determine a value of an action execution policy based on the action execution policy.

In some embodiments, the deep neural network construction module 310 may construct the actor network and the evaluator network in various ways to construct a deep neural network.

And step 120, training the initial evaluator network and the initial executor network based on the action value baseline of the state to obtain an executor network. In some embodiments, step 120 may be performed by deep neural network training module 320.

The state value baseline may reflect the current state of the vehicle. In some embodiments, the state value baseline may be obtained through an evaluator network. For example, a current image of the vehicle, a next image after the execution of the action, and a corresponding action execution policy may be input to an evaluator network, which outputs a state value baseline of the vehicle in a current state and a state value baseline of the vehicle in a next state.

In some embodiments, the deep neural network training module 320 may train the initial actor network by various methods of training the deep machine learning model, resulting in an actor network. For more on training the executive network, see figure 2 and its associated description.

Step 130, acquiring a current image of the vehicle; the current image includes a state in which the vehicle is in the current environment. In some embodiments, step 130 may be performed by image acquisition module 330.

See step 110 and its associated description for the current image and the associated content of the acquired current image.

And 140, acquiring the current vehicle position and the parking space position. In some embodiments, step 140 may be performed by location acquisition module 340.

In some embodiments, the position obtaining module 340 may obtain the current vehicle position and the parking space position in various feasible manners. For example, the current position of the vehicle may be acquired by an onboard GPS; and the position information of the idle parking spaces in the garage is acquired through communication connection with the garage.

And 150, inputting the current image, the current vehicle position and the parking space position into an executor network, and outputting a current action execution strategy by the executor network. In some embodiments, step 150 may be performed by determination module 350.

For example, will

The environmental image of the vehicle, the position of the vehicle and the position of the parking space at the moment are input into an executor network, and the executor network outputs the action that the vehicle can be selected at the current position and the probability of the selection. In some embodiments, the action with the highest probability may be taken as the execution action.

And step 160, the vehicle executes the action based on the current action execution strategy, and acquires the next action execution strategy based on the executed next image, the next vehicle position and the parking space position until the vehicle finishes the automatic parking task. In some embodiments, step 160 may be performed by loop module 360.

The next image may refer to an image of the environment around the vehicle acquired when the vehicle reaches the next state after performing the motion. The next image is acquired in the same manner as the current image. The next vehicle position may refer to a position that the vehicle reaches after performing the action. The next vehicle position is acquired in the same manner as the current vehicle position is acquired.

In some embodiments, the performer network may determine a number of performance actions of the vehicle based on the environment in which the vehicle is located a number of times, respectively. The vehicles may each perform an action until the vehicle reaches a parking location. In some embodiments, a GNSS sensor may be provided on the vehicle, and whether the vehicle reaches the parking location may be determined based on a difference between a sensor longitude, a sensor latitude, and a vehicle attitude of the vehicle and a parking space longitude, a parking space latitude, and an attitude of the requested vehicle, which are acquired by the GNSS sensor. In some embodiments, a potential function representation of the state may be constructed by selecting a distance between a latitude and longitude coordinate corresponding to a GNSS of a current location of the vehicle and a GNSS latitude and longitude coordinate of the vehicle at a destination location (parking space):

wherein the content of the first and second substances,

to represent

At the first moment

The dimensional information displayed by the individual GNSS sensors,

represent

At the first moment

The longitude information displayed by the individual GNSS sensors,

represents the first

The latitude coordinate of the sensor at the end position,

represents the first

Longitude coordinates of the sensor at the end position.

The greater the difference in vehicle position from the terminal, the greater the corresponding state-based potential function. Value of current potential function

And when the difference value is smaller than the preset difference value threshold value, determining that the vehicle finishes automatic parking. The preset difference threshold may refer to a maximum value of a potential function for the vehicle to complete parking. The preset difference threshold may be set empirically.

According to some embodiments of the invention, the change of the vehicle state is represented through the change of the image information, and then the optimal action execution strategy is determined according to different states of the vehicle, so that the dynamic planning of the vehicle parking path is realized, and the motion control of the vehicle is completed through the action execution strategy. And the control strategy of the vehicle is output after the input image of the front camera of the vehicle is calculated through a deep neural network, so that the end-to-end automatic parking function is realized.

The architecture of the initial evaluator network and the initial executor network may include an input layer, an output layer, and six-layer data structure layers. Obtaining an initial evaluator network and an initial executor network by constructing a multi-layer data structure, wherein the first layer of the data structure adopts 7-by-7 convolution operation and maximum pooling operation; the second layer of the data structure adopts a residual error module to extract the characteristics; the third layer of the data structure adopts a residual error module to extract the characteristics; the fourth layer of the data structure adopts a residual error module to extract the characteristics; the fifth layer of the data structure adopts a residual error module to extract the characteristics; the sixth layer of the data structure employs an averaging pooling operation.

The input layer may be used to input 3 x 224 image data. In some embodiments, the input pixel size accepts an RGB color picture of size 224 x 224. And the output layer is used for outputting the obtained feature vectors after full connection. For the actor network, the dimensions of the feature vectors it outputs are 10 × 1, where 10 corresponds to 10 actions of the vehicle. For the evaluator network, the dimension of the output feature vector is 1 × 1, where 1 corresponds to the value of the action.

Figure 2 is an exemplary flow diagram of training a resulting actor network provided by some embodiments of the present invention. In some embodiments, the process 200 may be performed by the deep neural network training module 320. As shown in fig. 2, the process 200 may include the following steps:

and step 210, inputting the sample image, the sample vehicle position and the sample parking space position into an initial executor network, and outputting a sample action execution strategy by the initial executor network. Wherein the sample image includes the current state of the vehicle, which may be recorded as

(ii) a The sample action execution policy may be noted as

。

The sample image may refer to an image of the current environment of a vehicle used to train the actor network. The sample vehicle location may refer to the current location of the vehicle used to train the actor network. The sample slot location may refer to a location of a slot in which the vehicle is required to park during the training process. In some embodiments, the sample image, sample vehicle location, and sample parking spot locations may be obtained by automatic parking of the vehicle. For example, a sample parking space position for parking and an initial position of the vehicle may be preset, and then an actual working scene of the vehicle may be simulated to obtain a sample image and a sample vehicle position. Wherein the initial position may be based on the design of the environment.

In step 220, the vehicle executes a policy enforcement action based on the sample action. Wherein the execution action may be noted as

。

Performing the action may refer to an action of the vehicle to go from a current state to a next state. For example, the vehicle moves according to the action with the highest probability of being selected in the execution strategy.

An action reward for executing the sample action execution policy is obtained, step 230. Wherein the action reward may be recorded as

。

In some embodiments, an action reward is calculated

The formula of (1) is as follows:

wherein the content of the first and second substances,

represents a proportionality coefficient, the effect of which is to

The scaling is carried out to a reasonable interval, and the scaling can be actually determined according to requirements;

to represent

Time reward letterA component of the number based on the potential function difference;

represents a collision penalty of 0 when no collision occurs and-2 when a collision occurs;

indicating a reward after completion of the auto-dock, a reward of +5 is given when the task is completed.

In some embodiments, calculating

Component of time reward function based on potential function difference

The formula of (1) is:

i.e. the difference between the two preceding and following vehicle situational functions, with respect to

See fig. 1 and its associated description.

Step 240, taking the sample image, the execution action, the action reward and the next sample image as training samples and storing the training samples in an experience pool; the next sample image is an image of the vehicle environment obtained after the action is performed. Wherein the next sample image can be used to represent the next state of the vehicle, which can be noted as

。

In some embodiments, the training samples stored in the experience pool may be in the format of

。

At step 250, training samples are randomly drawn from the experience pool. Wherein the extracted training samples may be recorded as

。

Step 260, inputting the sample image in the extracted training sample and the next sample image into the initial executor network to obtain the value and state value baseline of the action execution strategy.

Step 270, updating the network parameters of the initial actor network and the initial evaluator network based on the value of the action execution policy and the state value baseline.

In some embodiments, the network parameters of the initial actor network and the initial evaluator network may be updated in an iterative manner.

In some embodiments, the formula for updating the network parameters of the initial actor network is:

network parameters representing the updated initial actor network;

representing the network parameters of the initial executor network, wherein the initial network parameters are obtained through an initialization model;

representing a learning rate of an initial actor network, the learning rate of the initial actor network being determined by initializing a simulated parking environment;

represents the value of the action execution policy;

representing a status value baseline;

a sample action execution policy representing the extracted training samples.

The formula for updating the network parameters of the initial evaluator network is:

network parameters representing the updated initial evaluator network;

representing the network parameters of an initial evaluator network, wherein the initial network parameters are obtained through an initialization model;

representing the learning rate of an initial evaluator network, the learning rate of the initial evaluator network being determined by initializing a simulated parking environment;

represents the value of the action execution policy;

representing a status value baseline;

representing the state value baseline of the selected training sample.

In some embodiments of the present invention, the,

representing an action award;

the discount rate of the action reward is represented and is determined by initializing a simulated parking environment;

representing a state value baseline of the vehicle at time t + 1;

representing the vehicle's state value baseline at time t.

And step 280, when the vehicle is not collided and the training of the initial executor network and the initial evaluator network is finished, obtaining the trained executor network and the trained evaluator network.

The non-collision means that the vehicle does not collide during the completion of parking.

In some embodiments, the actor network training is complete, including constructing a revenue gradient for the initial actor network based on the value and state value baselines of the action execution strategy; updating network parameters of the initial executor network based on the profit gradient until the profit gradient reaches a maximum value; and taking the initial executor network obtained when the maximum profit gradient is obtained as a trained executor network.

In some embodiments, the formula for constructing the revenue gradient is:

wherein the content of the first and second substances,

representing a revenue gradient;

representing the accumulated revenue;

representing an action award;

a discount rate representing an action award;

representing a state value baseline of the vehicle at time t + 1;

representing a state value baseline of the vehicle at time t;

is shown in a state

Perform an action

The sample action execution policy of (1).

In some embodiments, evaluator network training is complete, including constructing a loss function for an initial evaluator network based on a state value baseline; updating the network parameters of the initial evaluator network based on the loss function until the loss function reaches a minimum value; and taking the initial evaluator network when the minimum loss function is obtained as the trained evaluator network.

In some embodiments, the formula for constructing the loss function is:

indicating that the initial evaluator network has network parameters of

A loss function of time;

representing an action award;

a discount rate representing an action award;

representing a state value baseline of the vehicle at time t + 1;

representing the vehicle's state value baseline at time t.

And if the vehicle is collided or the automatic parking task is finished, re-initializing the system environment of the vehicle, and training again until the action strategy output by the network meets the automatic parking requirement.

Some embodiments herein employ two deep neural networks for the generation of a vehicle action enforcement strategy, i.e., an actor, respectively; and an evaluator, which is an estimate of the vehicle state value. The vehicle action value function in the process of the strategy gradient algorithm is estimated by using the reinforcement learning algorithm based on the value function, so that the defect that the vehicle state value cannot be accurately obtained by the strategy gradient algorithm in an unknown environment is overcome.

Fig. 3 is a block diagram of an automatic parking system based on deep reinforcement learning according to some embodiments of the present invention. As shown in fig. 3, the system 300 may include a deep neural network construction module 310, a deep neural network training module 320, an image acquisition module 330, a location acquisition module 340, a determination module 350, and a loop module 360.

The deep neural network building module 310 is used to build an initial evaluator network and an initial performer network. For more on the deep neural network building block 310, refer to fig. 1 and its associated description.

The deep neural network training module 320 is configured to train the initial evaluator network and the initial executor network to obtain an executor network based on the state value baseline of the state. For more on the deep neural network training module 320, refer to fig. 1 and its associated description.

The image acquisition module 330 is used for acquiring a current image of the vehicle; the current image includes a state in which the vehicle is in the current environment. For more on the image acquisition module 330, refer to fig. 1 and its associated description.

The position obtaining module 340 is used for obtaining the current vehicle position and the parking space position. For more on the location acquisition module 340, refer to fig. 1 and its related description.

The determination module 350 is configured to input the current image, the current vehicle position, and the parking space position into the executor network, where the executor network outputs a current action execution policy. For more of the determination module 350, refer to fig. 1 and its associated description.

The loop module 360 is used for the vehicle to execute the action based on the current action execution strategy, and to obtain the next action execution strategy based on the executed next image, the next vehicle position and the parking space position until the vehicle completes the automatic parking task. For more of the loop module 360, see FIG. 1 and its associated description.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An automatic parking method based on deep reinforcement learning is characterized by comprising the following steps:

constructing an initial evaluator network and an initial executor network; obtaining the initial evaluator network and the initial executor network by constructing a multi-layer data structure, including:

a first level of the data structure employs a convolution operation of 7 x 7 and a max pooling operation;

a second layer of the data structure adopts a residual error module to perform feature extraction;

the third layer of the data structure adopts a residual error module to extract the characteristics;

a fourth layer of the data structure adopts a residual error module to perform feature extraction;

the fifth layer of the data structure adopts a residual error module to extract the characteristics;

the sixth layer of the data structure adopts an average pooling operation;

training the initial evaluator network and the initial executor network based on a state value baseline of a state to obtain an executor network; wherein training obtains an actor network comprising:

constructing a revenue gradient for the initial actor network based on the value of an action execution policy and the state value baseline; wherein the formula for constructing the profit gradient is:

wherein the content of the first and second substances,

representing the revenue gradient;

representing the accumulated revenue;

representing an action award;

a discount rate representing an action award;

representing a state value baseline of the vehicle at time t + 1;

representing a state value baseline of the vehicle at time t;

is shown in a state

Perform an action

The sample action execution policy of (1);

updating network parameters of the initial actor network based on the benefit gradient until the benefit gradient reaches a maximum value;

taking the initial executor network when the maximum profit gradient is obtained as a trained executor network;

acquiring a current image of a vehicle; the current image includes a state of the vehicle in a current environment;

acquiring the current vehicle position and the parking position;

inputting the current image, the current vehicle position and the parking space position into the executor network, and outputting a current action execution strategy by the executor network; the action execution strategy refers to the execution action made based on the environment where the vehicle is currently located; the formula of the action execution strategy is:

representing the selected action;

indicating a driving direction of the vehicle;

indicating steering of the steering wheel;

the training obtains an executor network, further comprising:

inputting a sample image, a sample vehicle position and a sample parking space position into the initial executor network, and outputting a sample action execution strategy by the initial executor network;

the vehicle executes a policy-enforcement action based on the sample action;

obtaining an action reward for executing the sample action execution strategy;

taking the sample image, the execution action, the action reward and the next sample image as training samples and storing the training samples in an experience pool; the next sample image is an image of the vehicle environment obtained after the action is executed;

randomly extracting training samples from the experience pool;

inputting a sample image in the extracted training sample and a next sample image into the initial executor network to obtain the value of an action execution strategy and the state value baseline;

updating network parameters of the initial actor network and the initial evaluator network based on the value of the action execution policy and the state value baseline;

when the vehicle is not collided and the training of the initial executor network and the initial evaluator network is finished, obtaining the trained executor network and the trained evaluator network;

and the vehicle executes the strategy execution action based on the current action, and acquires a next action execution strategy based on the executed next image, the next vehicle position and the parking space position until the vehicle finishes the automatic parking task.

2. The deep reinforcement learning-based automatic parking method according to claim 1, wherein the formula for updating the network parameters of the initial actor network is as follows:

network parameters representing the updated initial actor network;

network parameters representing the initial actor network;

representing a learning rate of the initial actor network;

a discount rate representing an action award;

representing a value of the action execution policy;

representing the state value baseline;

a sample action execution strategy representing the extracted training samples;

network parameters representing the updated initial evaluator network;

a network parameter representing the initial evaluator network;

representing a learning rate of the initial evaluator network;

representing a value of the action execution policy;

representing the state value baseline;

representing a state value baseline of the selected training sample.

3. The automatic parking method based on the deep reinforcement learning of claim 1, wherein the evaluator network training is completed and comprises:

constructing a loss function of the initial evaluator network based on the state value baseline;

updating network parameters of the initial evaluator network based on the loss function until the loss function reaches a minimum value;

and taking the initial evaluator network when the minimum loss function is obtained as the trained evaluator network.

4. The automatic parking method based on deep reinforcement learning according to claim 3, wherein the formula for constructing the loss function is as follows:

wherein the content of the first and second substances,

indicating that the initial evaluator network has network parameters of

A loss function of time;

representing an action award;

a discount rate representing an action award;

representing a state value baseline of the vehicle at time t + 1;

representing the vehicle's state value baseline at time t.

5. An automatic parking system based on deep reinforcement learning is characterized by comprising a deep neural network construction module, a deep neural network training module, an image acquisition module, a position acquisition module, a determination module and a circulation module;

the deep neural network construction module is used for constructing an initial evaluator network and an initial executor network; obtaining the initial evaluator network and the initial executor network by constructing a multi-layer data structure, including:

the sixth layer of the data structure adopts an average pooling operation;

the deep neural network training module is used for training the initial evaluator network and the initial executor network to obtain an executor network based on a state value baseline of a state; training to obtain an executor network, and constructing a profit gradient of the initial executor network based on the value of an action execution strategy and the state value baseline; wherein the formula for constructing the profit gradient is:

representing the revenue gradient;

representing the accumulated revenue;

representing an action award;

a discount rate representing an action award;

representing a state value baseline of the vehicle at time t + 1;

representing a state value baseline of the vehicle at time t;

is shown in a state

Perform an action

The sample action execution policy of (1); updating network parameters of the initial actor network based on the benefit gradient until the benefit gradient reaches a maximum value; taking the initial executor network when the maximum profit gradient is obtained as a trained executor network;

the image acquisition module is used for acquiring a current image of the vehicle; the current image includes a state of the vehicle in a current environment;

the position acquisition module is used for acquiring the current vehicle position and the parking space position;

the determining module is used for inputting the current image, the current vehicle position and the parking space position into the executor network, and the executor network outputs a current action execution strategy; the action execution strategy refers to the execution action made based on the environment where the vehicle is currently located; the formula of the action execution strategy is:

representing the selected action;

indicating a driving direction of the vehicle;

indicating steering of the steering wheel;

the training obtains an executor network, further comprising:

the vehicle executes a policy-enforcement action based on the sample action;

obtaining an action reward for executing the sample action execution strategy;

randomly extracting training samples from the experience pool;

when the vehicle is not collided and the training of the initial executor network and the initial evaluator network is completed, obtaining the trained executor network and the trained evaluator network;

and the circulation module is used for executing the action by the vehicle based on the current action execution strategy and acquiring a next action execution strategy based on the executed next image, the next vehicle position and the parking space position until the vehicle finishes an automatic parking task.