CN114662656A

CN114662656A - Deep neural network model training method, autonomous navigation method and system

Info

Publication number: CN114662656A
Application number: CN202210210763.3A
Authority: CN
Inventors: 曹远强; 陈剑勇; 刘尊; 李坚强
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-24

Abstract

The invention discloses a depth neural network model training method, an autonomous navigation method and an autonomous navigation system, wherein the autonomous navigation method comprises the steps of setting an unmanned aerial vehicle target location and acquiring an M-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera; training the deep neural network model by using the deep neural network model training method of the first aspect to obtain the deep neural network model; inputting the depth image information and the unmanned aerial vehicle state information into a depth neural network model, and predicting flight action; if the unmanned aerial vehicle does not collide or reach the target site, the unmanned aerial vehicle continues to execute, otherwise, the unmanned aerial vehicle stops flying; a temporal attention module is set. A model training method and an autonomous navigation system are also disclosed. The problem of unstable training caused by overestimation of a Q value in the existing deep neural network model algorithm and the problem of weak processing capability of the deep neural network model algorithm on time sequence data are solved by acquiring a state average value and adding a time attention module.

Description

Deep neural network model training method, autonomous navigation method and system

Technical Field

The invention relates to the technical field of mobile communication, in particular to a deep neural network model training method, an autonomous navigation method and an autonomous navigation system.

Background

The unmanned aerial vehicle autonomous navigation method is a hotspot direction of related research of the current unmanned aerial vehicle, and as the unmanned aerial vehicle technology is rapidly developed and application scenes are gradually complicated, for the unmanned aerial vehicle, the realization of autonomous navigation in a complex and unknown environment is a very important problem. The traditional unmanned aerial vehicle autonomous navigation method is mainly based on Simultaneous Localization and Mapping (SLAM) or a moving Structure (Structure from motion, SfM), but the SLAM technology is not enough to do all the work, and because the SLAM technology only realizes the calculation of self pose and the map reconstruction of the environment through the imaging principle of a camera, the geometric basis and the mathematical basis, the computer cannot understand the content in the constructed map information, so in order to enable the machine to complete the perception and interaction work in the environment, the computer needs to understand the observed surrounding objects, namely the meaning of each object in the map, and the realization needs to be realized through the application of artificial intelligence in the computer vision. However, the methods have high requirements on hardware of the unmanned aerial vehicle, require large storage resources and calculation resources, and have high cost.

In order for a machine to accomplish perception, interaction, and understanding of observed environmental transactions in an environment, deep learning by the machine is required. The deep learning is to adjust parameters in the artificial neural network by an iterative optimization method of gradient descent, so that the network has the capability of optimally describing the nonlinear mapping relationship between input and output. The deep reinforcement learning algorithm combines deep learning and reinforcement learning, so that the deep reinforcement learning has the perception capability of the deep learning and the decision capability of the reinforcement learning at the same time, and the deep reinforcement learning algorithm shows great potential in the field of autonomous navigation of the unmanned aerial vehicle.

The unmanned aerial vehicle autonomous navigation method based on deep reinforcement learning has the advantages of low cost, good real-time performance and the like, but the existing autonomous navigation algorithm still has the defects of unstable training, weak processing capacity on time sequence data and the like.

Disclosure of Invention

The existing deep learning algorithm for autonomous navigation of the unmanned aerial vehicle has the problems of unstable training and weaker time sequence data processing capability.

Aiming at the problems, a deep neural network model training method, an autonomous navigation method and an autonomous navigation system are provided, overestimation of a Q value is reduced by averaging the first K state values learned by a value network of a cost function, so that variance in the transmission process of Critic network parameters is reduced, stability in the algorithm training process and algorithm performance are improved, and the problem of unstable training caused by the overestimation of the Q value in the conventional deep neural network model algorithm is solved. By adding the time attention module mechanism in the deep neural network model algorithm, the time attention mechanism considers the input of the past M frames and determines the importance degree of each frame, so that the deep neural network model can make more accurate and ideal prediction on the action of the unmanned aerial vehicle, the performance of the deep neural network model is further improved, and the problem that the conventional deep neural network model algorithm is weak in time sequence data processing capacity is solved.

In a first aspect, a deep neural network model training method is used for obtaining a deep neural network model through deep reinforcement learning, and improving algorithm stability, and includes:

step 100, initializing the deep neural network model;

step 200, updating an array of the playback experience pool D;

step 300, averaging K state values learned by a value network to obtain a state average value;

step 400, updating network parameters of the deep neural network model by using the state average value;

the deep neural network model comprises an Actor network and a Critic network.

In a first possible implementation manner of the deep neural network model training method according to the first aspect of the present invention, the step 200 includes:

step 210, obtaining current state information s_t；

Step 220, the state information s_tInputting the predicted action into the Actor network to obtain the predicted action a_t；

Step 230, executing the action a_tReceive a reward r(s)_t,a_t) And enter the next state s_t+1；

Step 240, apply the array(s)_t,a_t,r(s_t,a_t),s_t+1) Storing the data into a playback experience pool D;

wherein s is_tIs the state at time t, a_tR(s) is the movement at time t_t,a_t) Is in a state of s_tThe action is a_tThe prize of the time.

With reference to the first possible implementation manner of the first aspect of the present invention, in a second possible implementation manner, the step 300 includes:

randomly sampling N sets of data(s) from the playback experience pool D, step 310_t,a_t,r(s_t,a_t),s_t+1)；

Step 320, obtaining the learned K state values s by using the value network formulas (1) and (2)_t+1Average value of (2)

Wherein phi is an Actor network parameter, theta is a criticic network parameter,

for the target network parameter, pi is the policy, pi_φRepresents the strategy that the Actor network parameter is phi, alpha is a temperature parameter, and V(s)_t+1) Is the value of the state at time t +1,

is the average value of the state values at time t +1, Q

Is in a state of s_tThe action is a_tThe target Critic network parameter is

Value of Q of time pi_φ(a_t|s_t) When the network parameter of Actor is phi, the state is s_tTime selection action a_tThe probability of (c).

With reference to the second possible implementation manner of the first aspect of the present invention, in a third possible implementation manner, the step 400 includes:

step 410, updating the Critic network parameter theta by using the state average value calculation formula (2) and the formula (3):

step 420, updating the network parameter phi of the Actor by using the formula (4):

is a target network parameter, a_tPredicting motion for time t, s_t+1Is the state value at time t +1, J_Q(theta) is when the network parameter is thetaThe Critic network of, D is the playback experience pool, r(s)_t,a_t) Is in a state of s_tThe action is a_tReward of time, gamma being a discount factor, Q_θ(s_t,a_t) Is in a state of s_tThe action is a_tQ value, J, when Critic network parameter is theta_π(φ) is the loss of the Actor network when the network parameter is φ, e_tIs the standard deviation of the Gaussian distribution, f_φ(∈_t；s_t) For an Actor network parameter of phi, the standard deviation is epsilon_tIn a state of s_tTime, based on the action, pi, obtained by Gaussian distribution sampling_φ(f_φ(∈_t；s_t)|s_tWhen the network parameter of Actor is phi, the state is s_tTime selection action f_φ(∈_t；s_t) Probability of (Q)_θ(s_t,f_φ(∈_t；s_t) For Actor network parameter phi, Critic network parameter theta, and status s_tThe motion is f_φ(∈_t；s_t) The Q value of (1).

With reference to the third possible implementation manner of the first aspect of the present invention, in a fourth possible implementation manner, the step 400 further includes:

step 430, setting a Critic network parameter updating interval N;

and step 440, if the time t can be divided by the updating interval N of the Critic network parameter, updating the Critic network parameter theta into the Critic network.

In a second aspect, an unmanned aerial vehicle autonomous navigation method based on deep reinforcement learning includes:

500, setting an unmanned aerial vehicle target site and acquiring an M-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera;

step 600, training a neural network model by using the training method to obtain a deep neural network model;

step 700, inputting the depth image information and the unmanned aerial vehicle state information into a depth neural network model, and predicting flight action;

step 800, if the unmanned aerial vehicle does not collide or reach the target location, returning to step 500 to continue execution, otherwise, stopping flying of the unmanned aerial vehicle;

wherein the step 600 comprises:

step 610: setting a temporal attention module to the deep neural network model.

In a first possible implementation manner of the method for autonomous navigation of a drone based on deep reinforcement learning according to the second aspect, the step 600 further includes:

step 620, respectively configuring the Actor network and the Critic network to at least comprise a convolutional layer, a global average pooling layer, an LSTM unit layer, a linear full-link layer and a ReLU function layer;

step 630, configuring the Actor network and Critic network in sequence: the system comprises a convolution layer, a global average pooling layer, an LSTM unit layer, a first linear full-connection layer, a first ReLU function layer, a second linear full-connection layer, a second ReLU function layer and a third linear full-connection layer;

wherein the linear fully-connected layer comprises:

a first linear fully-connected layer, a second linear fully-connected layer, a third linear fully-connected layer;

the ReLU function layer includes:

a first ReLU function layer and a second ReLU function layer.

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the step 600 further includes:

step 640, setting a time attention module in the Actor network;

step 650, setting the time attention module between the LSTM unit layer and the first linear full connection layer;

step 660, calculating the output weight of the given frame LSTM unit layer by using the time attention module to determine the output action.

With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, the step 660 includes:

step 661, learning scalar weight output by LSTM unit at different time step using equation (5),

W_t-i＝Softmax(ReLU(w_t-i·h_t-i+b_t-i)) (5)

where i ═ 0.,. t-1, Softmax () is the normalization function, h_t-iFor LSTM unit hidden vectors, w_t-iAnd b_t-iT represents the time, ReLU (w) for a learnable parameter_t-i·h_t-i+b_t-i) Denotes the LSTM unit hidden vector as h_t-iThe learnable parameter is w_t-iAnd b_t-iA function value of activation time;

step 662, calculate the context vector U using equation (6)_T，

Step 663, using the context vector U_TAnd the state characteristic of the unmanned aerial vehicle is connected as the input of the next layer.

In a third aspect, an unmanned aerial vehicle autonomous navigation system based on deep reinforcement learning adopts the autonomous navigation method of the second aspect, and the method includes

A deep neural network model unit;

an acquisition unit;

a training unit;

the acquisition unit is used for acquiring a T-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera;

the training unit is used for averaging K state values learned by the value network and updating the network parameters of the deep neural network model by using the state average values;

the depth neural network model unit is used for predicting the flight action of the unmanned aerial vehicle according to the acquired M-frame depth map acquired by the unmanned aerial vehicle camera and the unmanned aerial vehicle state information;

the deep neural network model comprises: an Actor network and a Critic network;

the Actor network is sequentially configured as follows: the system comprises a convolutional layer, a global average pooling layer, an LSTM unit layer, a time attention module, a first linear full-link layer, a first ReLU function layer, a second linear full-link layer, a second ReLU function layer and a third linear full-link layer;

the temporal attention module is used to calculate output weights for a given frame LSTM unit layer to calculate motion output.

The implementation of the deep neural network model training method, the autonomous navigation method and the system has the following technical effects:

(1) the Q value overestimation is reduced by averaging the first K state values learned by the value function through the value network, so that the variance in the Critic network parameter transfer process is reduced, the stability in the algorithm training process and the algorithm performance are improved, and the problem of unstable training caused by Q value overestimation in the conventional deep neural network model algorithm is solved.

(2) By adding the time attention module mechanism in the deep neural network model algorithm, the time attention mechanism considers the input of the past M frames and determines the importance degree of each frame, so that the deep neural network model can make more accurate and ideal prediction on the action of the unmanned aerial vehicle, the performance of the deep neural network model is further improved, and the problem that the conventional deep neural network model algorithm is weak in time sequence data processing capacity is solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a deep neural network model training method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a deep neural network model training method according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a third embodiment of a deep neural network model training method in the present invention;

FIG. 4 is a schematic diagram of a fourth embodiment of the deep neural network model training method in the present invention;

FIG. 5 is a diagram illustrating a fifth embodiment of a deep neural network model training method according to the present invention;

FIG. 6 is a schematic diagram of a first embodiment of the autonomous navigation method in accordance with the present invention;

FIG. 7 is a schematic diagram of a second embodiment of the autonomous navigation method in accordance with the present invention;

FIG. 8 is a schematic diagram of a third embodiment of the autonomous navigation method in accordance with the present invention;

FIG. 9 is a schematic diagram of a fourth embodiment of the autonomous navigation method in accordance with the present invention;

FIG. 10 is a schematic diagram of an Actor network structure of the deep neural network model according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a Critic network structure of the deep neural network model according to an embodiment of the present invention;

FIG. 12 is a schematic view of a first embodiment of the autonomous navigation system in accordance with the present invention;

the part names indicated by the numbers in the drawings are as follows: 10-deep neural network model unit, 20-acquisition unit, 30-training unit.

Detailed Description

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments given herein without any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Aiming at the problems, a deep neural network model training method, an autonomous navigation method and an autonomous navigation system are provided.

In a first aspect, as shown in fig. 1, fig. 1 is a schematic diagram of a deep neural network model training method according to a first embodiment of the present invention, and a deep neural network model training method is used for obtaining a deep neural network model through deep reinforcement learning to improve algorithm stability, and includes: step 100, initializing a deep neural network model; step 200, updating an array of the playback experience pool D; step 300, averaging K state values learned by a value network to obtain a state average value; step 400, updating network parameters of the deep neural network model by using the state average value; the deep neural network model comprises an Actor network and a Critic network.

Preferably, as shown in fig. 10, fig. 10 is a schematic diagram of an Actor network structure of the deep neural network model in the present invention, and the Actor network may be implemented to include:

1) a convolutional layer consisting of 8 filters of 2 × 2 with a step size of 2; 2) a convolutional layer consisting of 8 filters of 2 × 2 with a step size of 2; 3) a convolutional layer consisting of 8 filters of 2 × 2 with a step size of 2; 4) convolutional layer with step size of 2 consisting of 8 filters of 2 × 2; 5) a global average pooling layer; 6) hiding an LSTM unit layer with the size of 8 layers; 7) a temporal attention module; 8) a linear fully-connected layer of scale 64; 9) a ReLU function layer; 10) a linear fully-connected layer of scale 32; 11) a ReLU function layer; 12) linear fully connected layers of scale 2.

Referring to fig. 11, fig. 11 is a schematic diagram of an embodiment of a Critic network structure of the deep neural network model in the present invention, the Critic network may be implemented as an embodiment including: 1) a convolutional layer consisting of 8 filters of 2 × 2 with a step size of 2; 2) a convolutional layer of step size 2 consisting of 8 2x2 filters; 3) a convolutional layer of step size 2 consisting of 8 2x2 filters; 4) a convolutional layer of step size 2 consisting of 8 2x2 filters; 5) a global average pooling layer; 6) hiding an LSTM unit layer with the size of 8 layers; 7) a linear fully-connected layer of scale 64; 8) a ReLU function layer; 9) a linear fully-connected layer of scale 32; 10) a ReLU function layer; 11) a linear fully connected layer of scale 1.

The LSTM unit layer is a long memory network layer, and the ReLU function layer is an activation function layer.

Firstly, initializing a deep neural network model: actor network parameter phi, Critic network parameter theta, target Critic network parameter

Playback experience pool

Target Critic network parameter update interval N.

Preferably, as shown in fig. 2, fig. 2 is a schematic diagram of a deep neural network model training method according to a second embodiment of the present invention, and step 200 includes step 210 of obtaining current state information s_t(ii) a Step 220, converting the state information s_tInputting the predicted action into the Actor network to obtain the predicted action a_t(ii) a Step 230, perform action a_tReceive a reward r(s)_t,a_t) And enter the next state s_t+1(ii) a Step 240, apply the array(s)_t,a_t,r(s_t,a_t),s_t+1) Storing in a playback experience pool D, s_tIs the state at time t, a_tR(s) is the movement at time t_t,a_t) Is in a state of s_tThe action is a_tThe prize of the time.

Preferably, as shown in fig. 3, fig. 3 is a schematic diagram of a third embodiment of the deep neural network model training method in the present invention, and step 300 includes: randomly sampling N sets of data(s) from the playback experience pool D, step 310_t,a_t,r(s_t,a_t),s_t+1) (ii) a Step 320, obtaining the learned K state values s by using the value network formulas (1) and (2)_t+1Average value of (2)

is the average value of the state values at time t +1,

is in a state of s_tThe action is a_tThe target Critic network parameter is

Preferably, as shown in fig. 4, fig. 4 is a schematic diagram of a fourth embodiment of the deep neural network model training method in the present invention, and step 400 includes: step 410, updating the Critic network parameter theta by using the state average value calculation formula (2) and the formula (3):

step 420, updating the Actor network parameter phi by using the equation (4),

wherein phi is an Actor network parameter, theta is a Critic network parameter,

as a target network parameter, a_tPredicting motion for time t, s_t+1Is the state value at time t +1, J_Q(θ) is the Critic network loss at a network parameter of θ, D is the playback experience pool, r(s)_t,a_t) Is in a state of s_tThe action is a_tReward of time, gamma being a discount factor, Q_θ(s_t,a_t) Is in a state of s_tThe action is a_tQ value, J, when Critic network parameter is theta_π(φ) is the loss of the Actor network when the network parameter is φ, e_tIs the standard deviation of the Gaussian distribution, f_φ(∈_t；s_t) For an Actor network parameter of phi, the standard deviation is epsilon_tThe state is s_tTime, based on the action, pi, obtained by Gaussian distribution sampling_φ(f_φ(∈_t；s_t)|s_tWhen the network parameter of Actor is phi, the state is s_tTime selection action f_φ(∈_t；s_t) Probability of (Q)_θ(s_t,f_φ(∈_t；s_t) For Actor network parameter phi, Critic network parameter theta, and status s_tThe motion is f_φ(∈_t；s_t) The Q value of (1).

From a pool of playback experiences

Randomly sampling N sets of data(s)_t，a_t，r(s_t，a_t)，s_t+1) And calculating the loss of the Critic network through a formula (3), updating a Critic network parameter theta, calculating the loss of the Actor network through a formula (4), and updating an Actor network parameter phi. Equation (2) K state values V(s) learned using a value network_t+1) IsThe average value is used for reducing the overestimation of the Q value, so that the variance in the parameter transfer process is reduced, and the training stability is improved.

Preferably, as shown in fig. 5, fig. 5 is a schematic diagram of a fifth embodiment of the deep neural network model training method in the present invention, and the step 400 further includes: step 430, setting a criticic network parameter updating interval N; and step 440, if the time t can be divided by the updating interval N of the Critic network parameter, updating the Critic network parameter theta into the Critic network.

The Q value overestimation is reduced by averaging the first K state values learned by the value function through the value network, so that the variance in the Critic network parameter transfer process is reduced, the stability in the algorithm training process and the algorithm performance are improved, and the problem of unstable training caused by Q value overestimation in the conventional deep neural network model algorithm is solved.

In a second aspect, as shown in fig. 6, fig. 6 is a schematic view of a first embodiment of the autonomous navigation method in the present invention, and a method for autonomous navigation of a drone based on deep reinforcement learning includes: 500, setting a target location of the unmanned aerial vehicle and acquiring an M-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera; step 600, training the neural network model by using a training method to obtain a deep neural network model; step 700, inputting depth image information and unmanned aerial vehicle state information into a depth neural network model, and predicting flight action; step 800, if the unmanned aerial vehicle does not collide or reach the target location, returning to step 500 to continue execution, otherwise, stopping flying of the unmanned aerial vehicle; wherein step 600 comprises: step 610: and setting a time attention module.

Preferably, as shown in fig. 7, fig. 7 is a schematic view of a second embodiment of the autonomous navigation method in the present invention, and the step 600 further includes:

step 620, respectively configuring the Actor network and the Critic network to at least comprise a convolutional layer, a global averaging pooling layer, an LSTM unit layer, a linear full-link layer and a ReLU function layer; step 630, configuring the Actor network and the criticic network in sequence: the system comprises a convolution layer, a global average pooling layer, an LSTM unit layer, a first linear full-connection layer, a first ReLU function layer, a second linear full-connection layer, a second ReLU function layer and a third linear full-connection layer; the linear full-connection layer comprises a first linear full-connection layer, a second linear full-connection layer and a third linear full-connection layer; the ReLU function layer includes: a first ReLU function layer and a second ReLU function layer.

Preferably, as shown in fig. 8, fig. 8 is a schematic view of a third embodiment of the autonomous navigation method in the present invention, and the step 600 further includes: step 640, setting a time attention module in the Actor network; step 650, arranging a time attention module between the LSTM unit layer and the first linear full connection layer; step 660, calculate the output weight for a given frame LSTM unit layer using the temporal attention module to determine the output action.

Preferably, as shown in fig. 9, fig. 9 is a schematic view of a fourth embodiment of the autonomous navigation method in the present invention, and step 660 includes: step 661, using equation (5) to learn the scalar weights output by the LSTM unit at different time steps,

W_t-i＝Softmax(ReLU(w_t-i·h_t-i+b_t-i)) (5)

step 662, calculate the context vector U using equation (6)_T，

Step 663, apply context vector U_TAnd the unmanned aerial vehicle state characteristic connection is used as next layer input.

Weight W of each LSTM cell output_iIs defined by formula (5), h_iFor LSTM unit hidden vectors, w_iAnd b_iFor learnable parameters, the activation function is ReLU,the ReLU function layer is followed by a Softmax function that normalizes the sum of weights to 1. According to this definition, each learned weight depends on information of the previous time step and current state information.

The combined context vector U is then computed_TAs shown in equation (6), the context vector U_TIs a weighted sum of the LSTM unit layer outputs over the T time step.

Derived context vector U_TAnd the state characteristic of the unmanned aerial vehicle is connected as the input of the next layer. Weight W learned here_t-iThe importance of the LSTM unit layer output at a given frame can be understood. Thus, the optimization process can be viewed as learning which observations to choose are relatively more important to learning the correct action.

Working principle of temporal attention:

action output is calculated explicitly taking into account LSTM unit-level output characteristics of past M frames, and this information is only implicitly conveyed by normal LSTM units. By increasing the value of the T time step, the deep neural network model can take into account a longer historical frame sequence, and thus can make better action choices. By introducing a temporal attention module, the neural network module can handle the ability of longer input sequences, exploration of the temporal dependence of the output actions, and better performance with a partially observable experience.

By adding the time attention module mechanism in the deep neural network model algorithm, the time attention mechanism considers the input of the past M frames and determines the importance degree of each frame, so that the deep neural network model can make more accurate and ideal prediction on the action of the unmanned aerial vehicle, the performance of the deep neural network model is further improved, and the problem that the conventional deep neural network model algorithm is weak in time sequence data processing capacity is solved.

In a third aspect, as shown in fig. 12, fig. 12 is a schematic view of a first embodiment of the autonomous navigation system in the present invention, an unmanned aerial vehicle autonomous navigation system based on deep reinforcement learning adopts the autonomous navigation method of the second aspect, and includes a deep neural network model unit 10, an obtaining unit 20, and a training unit 30; the acquiring unit 20 is configured to acquire a T-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera; the training unit 30 is configured to average the K state values learned by the value network, and update the network parameters of the deep neural network model by using the state average values; the depth neural network model unit 10 is used for predicting the flight action of the unmanned aerial vehicle according to the acquired T-frame depth map acquired by the unmanned aerial vehicle camera and the unmanned aerial vehicle state information; the deep neural network model comprises: an Actor network and a Critic network; the Actor network is sequentially configured as: the system comprises a convolutional layer, a global average pooling layer, an LSTM unit layer, a time attention module, a first linear full-link layer, a first ReLU function layer, a second linear full-link layer, a second ReLU function layer and a third linear full-link layer; the temporal attention module is used to calculate output weights for a given frame LSTM unit layer to calculate motion output.

(2) By adding the time attention module mechanism in the deep neural network model algorithm, the time attention mechanism considers the input of the past M frames and determines the importance degree of each frame, so that the deep neural network model can predict the action of the unmanned aerial vehicle more accurately and ideally, the performance of the deep neural network model is further improved, and the problem that the conventional deep neural network model algorithm is weak in time sequence data processing capacity is solved.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A deep neural network model training method is used for obtaining a deep neural network model through deep reinforcement learning and improving the stability of an algorithm of the deep neural network model, and is characterized by comprising the following steps:

step 100, initializing the deep neural network model;

step 200, updating an array of the playback experience pool D;

the deep neural network model comprises an Actor network and a Critic network.

2. The deep neural network model training method of claim 1, wherein the step 200 comprises:

step 210, obtaining current state information s_t；

Step 220, converting the state information s_tInputting the predicted action into the Actor network to obtain the predicted action a_t；

3. The deep neural network model training method of claim 2, wherein the step 300 comprises:

is the average value of the state values at time t +1,

is in a state of s_tThe motion is a_tThe target Critic network parameter is

4. The method of claim 3, wherein the step 400 comprises:

step 410, updating the critical network parameter θ by using the state average value calculation formula (2) and the Q value loss calculation formula (3):

step 420, updating the Actor network parameter phi by using the policy loss calculation formula (4),

is a target network parameter, a_tPredicting motion for time t, s_t+1Is the state value at time t +1, J_Q(θ) is the Critic network loss at a network parameter of θ, D is the playback experience pool, r(s)_t,a_t) Is in a state of s_tThe action is a_tReward of time, gamma being a discount factor, Q_θ(s_t,a_t) Is in a state of s_tThe action is a_tQ value, J, when Critic network parameter is theta_π(φ) is the loss of the Actor network when the network parameter is φ, e_tIs the standard deviation of the Gaussian distribution, f_φ(∈_t；s_t) For an Actor network parameter of phi, the standard deviation is epsilon_tIn a state of s_tTime, based on the action, pi, obtained by Gaussian distribution sampling_φ(f_φ(∈_t；s_t)|s_tWhen the network parameter of Actor is phi, the state is s_tTime selection action f_φ(∈_t；s_t) Probability of (Q)_θ(s_t,f_φ(∈_t；s_t) For Actor network parameter phi, Critic network parameter theta, and status s_tThe motion is f_φ(∈_t；s_t) The Q value of (1).

5. The deep neural network model training method of claim 4, wherein the step 400 further comprises:

step 430, setting a criticic network parameter updating interval N;

6. An autonomous navigation method for training a deep neural network model by using the training method according to any one of claims 1 to 5, comprising:

500, setting a target location of the unmanned aerial vehicle and acquiring an M-frame depth map and unmanned aerial vehicle state information acquired by the unmanned aerial vehicle;

step 600, training the neural network model by using the training method to obtain a deep neural network model;

step 800, if the unmanned aerial vehicle does not collide or reach the target location, returning to step 500 to continue execution, otherwise, stopping the unmanned aerial vehicle from flying;

wherein the step 600 comprises:

step 610: setting a temporal attention module to the deep neural network model.

7. The autonomous navigation method according to claim 6, characterized in that said step 600 further comprises:

step 630, configuring the Actor network and the Critic network in sequence: the system comprises a convolution layer, a global average pooling layer, an LSTM unit layer, a first linear full-connection layer, a first ReLU function layer, a second linear full-connection layer, a second ReLU function layer and a third linear full-connection layer;

wherein the linear fully-connected layer comprises:

the ReLU function layer includes:

a first ReLU function layer and a second ReLU function layer.

8. The autonomous navigation method according to claim 7, characterized in that said step 600 further comprises:

step 640, setting a time attention module in the Actor network;

9. The autonomous navigation method of claim 8, wherein the step 660 comprises:

step 661, learning scalar weights output by LSTM unit at different time step using equation (5):

W_t-i＝Soft max(ReLU(w_t-i·h_t-i+b_t-i)) (5)

where i ═ 0.. once, t-1, Soft max () is the normalization function, h_t-iFor LSTM unit hidden vectors, w_t-iAnd b_t-iT represents the time, ReLU (w) for a learnable parameter_t-i·h_t-i+b_t-i) Denotes the LSTM unit hidden vector as h_t-iThe learnable parameter is w_t-iAnd b_t-iA function value of activation time;

step 662, calculating the context vector U using equation (6)_T，

10. An autonomous navigation system for navigating a drone using the autonomous navigation method of any one of claims 6 to 9, comprising

A deep neural network model unit;

an acquisition unit;

a training unit;

the training unit is used for averaging K state values learned by the value network and updating network parameters of the deep neural network model by using the state average values;

the deep neural network model includes: actor network and Critic network;

the Actor network configuration sequentially comprises: the system comprises a convolutional layer, a global average pooling layer, an LSTM unit layer, a time attention module, a first linear full-link layer, a first ReLU function layer, a second linear full-link layer, a second ReLU function layer and a third linear full-link layer;

the temporal attention module is used to compute output weights for a given frame LSTM unit layer to compute action output.