CN114662656A - Deep neural network model training method, autonomous navigation method and system - Google Patents

Deep neural network model training method, autonomous navigation method and system Download PDF

Info

Publication number
CN114662656A
CN114662656A CN202210210763.3A CN202210210763A CN114662656A CN 114662656 A CN114662656 A CN 114662656A CN 202210210763 A CN202210210763 A CN 202210210763A CN 114662656 A CN114662656 A CN 114662656A
Authority
CN
China
Prior art keywords
layer
neural network
network model
deep neural
aerial vehicle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210210763.3A
Other languages
Chinese (zh)
Inventor
曹远强
陈剑勇
刘尊
李坚强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202210210763.3A priority Critical patent/CN114662656A/en
Publication of CN114662656A publication Critical patent/CN114662656A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a depth neural network model training method, an autonomous navigation method and an autonomous navigation system, wherein the autonomous navigation method comprises the steps of setting an unmanned aerial vehicle target location and acquiring an M-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera; training the deep neural network model by using the deep neural network model training method of the first aspect to obtain the deep neural network model; inputting the depth image information and the unmanned aerial vehicle state information into a depth neural network model, and predicting flight action; if the unmanned aerial vehicle does not collide or reach the target site, the unmanned aerial vehicle continues to execute, otherwise, the unmanned aerial vehicle stops flying; a temporal attention module is set. A model training method and an autonomous navigation system are also disclosed. The problem of unstable training caused by overestimation of a Q value in the existing deep neural network model algorithm and the problem of weak processing capability of the deep neural network model algorithm on time sequence data are solved by acquiring a state average value and adding a time attention module.

Description

Deep neural network model training method, autonomous navigation method and system
Technical Field
The invention relates to the technical field of mobile communication, in particular to a deep neural network model training method, an autonomous navigation method and an autonomous navigation system.
Background
The unmanned aerial vehicle autonomous navigation method is a hotspot direction of related research of the current unmanned aerial vehicle, and as the unmanned aerial vehicle technology is rapidly developed and application scenes are gradually complicated, for the unmanned aerial vehicle, the realization of autonomous navigation in a complex and unknown environment is a very important problem. The traditional unmanned aerial vehicle autonomous navigation method is mainly based on Simultaneous Localization and Mapping (SLAM) or a moving Structure (Structure from motion, SfM), but the SLAM technology is not enough to do all the work, and because the SLAM technology only realizes the calculation of self pose and the map reconstruction of the environment through the imaging principle of a camera, the geometric basis and the mathematical basis, the computer cannot understand the content in the constructed map information, so in order to enable the machine to complete the perception and interaction work in the environment, the computer needs to understand the observed surrounding objects, namely the meaning of each object in the map, and the realization needs to be realized through the application of artificial intelligence in the computer vision. However, the methods have high requirements on hardware of the unmanned aerial vehicle, require large storage resources and calculation resources, and have high cost.
In order for a machine to accomplish perception, interaction, and understanding of observed environmental transactions in an environment, deep learning by the machine is required. The deep learning is to adjust parameters in the artificial neural network by an iterative optimization method of gradient descent, so that the network has the capability of optimally describing the nonlinear mapping relationship between input and output. The deep reinforcement learning algorithm combines deep learning and reinforcement learning, so that the deep reinforcement learning has the perception capability of the deep learning and the decision capability of the reinforcement learning at the same time, and the deep reinforcement learning algorithm shows great potential in the field of autonomous navigation of the unmanned aerial vehicle.
The unmanned aerial vehicle autonomous navigation method based on deep reinforcement learning has the advantages of low cost, good real-time performance and the like, but the existing autonomous navigation algorithm still has the defects of unstable training, weak processing capacity on time sequence data and the like.
Disclosure of Invention
The existing deep learning algorithm for autonomous navigation of the unmanned aerial vehicle has the problems of unstable training and weaker time sequence data processing capability.
Aiming at the problems, a deep neural network model training method, an autonomous navigation method and an autonomous navigation system are provided, overestimation of a Q value is reduced by averaging the first K state values learned by a value network of a cost function, so that variance in the transmission process of Critic network parameters is reduced, stability in the algorithm training process and algorithm performance are improved, and the problem of unstable training caused by the overestimation of the Q value in the conventional deep neural network model algorithm is solved. By adding the time attention module mechanism in the deep neural network model algorithm, the time attention mechanism considers the input of the past M frames and determines the importance degree of each frame, so that the deep neural network model can make more accurate and ideal prediction on the action of the unmanned aerial vehicle, the performance of the deep neural network model is further improved, and the problem that the conventional deep neural network model algorithm is weak in time sequence data processing capacity is solved.
In a first aspect, a deep neural network model training method is used for obtaining a deep neural network model through deep reinforcement learning, and improving algorithm stability, and includes:
step 100, initializing the deep neural network model;
step 200, updating an array of the playback experience pool D;
step 300, averaging K state values learned by a value network to obtain a state average value;
step 400, updating network parameters of the deep neural network model by using the state average value;
the deep neural network model comprises an Actor network and a Critic network.
In a first possible implementation manner of the deep neural network model training method according to the first aspect of the present invention, the step 200 includes:
step 210, obtaining current state information st
Step 220, the state information stInputting the predicted action into the Actor network to obtain the predicted action at
Step 230, executing the action atReceive a reward r(s)t,at) And enter the next state st+1
Step 240, apply the array(s)t,at,r(st,at),st+1) Storing the data into a playback experience pool D;
wherein s istIs the state at time t, atR(s) is the movement at time tt,at) Is in a state of stThe action is atThe prize of the time.
With reference to the first possible implementation manner of the first aspect of the present invention, in a second possible implementation manner, the step 300 includes:
randomly sampling N sets of data(s) from the playback experience pool D, step 310t,at,r(st,at),st+1);
Step 320, obtaining the learned K state values s by using the value network formulas (1) and (2)t+1Average value of (2)
Figure BDA0003533154830000031
Figure BDA0003533154830000032
Figure BDA0003533154830000033
Wherein phi is an Actor network parameter, theta is a criticic network parameter,
Figure BDA0003533154830000034
for the target network parameter, pi is the policy, piφRepresents the strategy that the Actor network parameter is phi, alpha is a temperature parameter, and V(s)t+1) Is the value of the state at time t +1,
Figure BDA0003533154830000035
is the average value of the state values at time t +1, Q
Figure BDA0003533154830000036
Is in a state of stThe action is atThe target Critic network parameter is
Figure BDA0003533154830000037
Value of Q of time piφ(at|st) When the network parameter of Actor is phi, the state is stTime selection action atThe probability of (c).
With reference to the second possible implementation manner of the first aspect of the present invention, in a third possible implementation manner, the step 400 includes:
step 410, updating the Critic network parameter theta by using the state average value calculation formula (2) and the formula (3):
Figure BDA0003533154830000038
step 420, updating the network parameter phi of the Actor by using the formula (4):
Figure BDA0003533154830000039
wherein phi is an Actor network parameter, theta is a criticic network parameter,
Figure BDA00035331548300000310
is a target network parameter, atPredicting motion for time t, st+1Is the state value at time t +1, JQ(theta) is when the network parameter is thetaThe Critic network of, D is the playback experience pool, r(s)t,at) Is in a state of stThe action is atReward of time, gamma being a discount factor, Qθ(st,at) Is in a state of stThe action is atQ value, J, when Critic network parameter is thetaπ(φ) is the loss of the Actor network when the network parameter is φ, etIs the standard deviation of the Gaussian distribution, fφ(∈t;st) For an Actor network parameter of phi, the standard deviation is epsilontIn a state of stTime, based on the action, pi, obtained by Gaussian distribution samplingφ(fφ(∈t;st)|stWhen the network parameter of Actor is phi, the state is stTime selection action fφ(∈t;st) Probability of (Q)θ(st,fφ(∈t;st) For Actor network parameter phi, Critic network parameter theta, and status stThe motion is fφ(∈t;st) The Q value of (1).
With reference to the third possible implementation manner of the first aspect of the present invention, in a fourth possible implementation manner, the step 400 further includes:
step 430, setting a Critic network parameter updating interval N;
and step 440, if the time t can be divided by the updating interval N of the Critic network parameter, updating the Critic network parameter theta into the Critic network.
In a second aspect, an unmanned aerial vehicle autonomous navigation method based on deep reinforcement learning includes:
500, setting an unmanned aerial vehicle target site and acquiring an M-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera;
step 600, training a neural network model by using the training method to obtain a deep neural network model;
step 700, inputting the depth image information and the unmanned aerial vehicle state information into a depth neural network model, and predicting flight action;
step 800, if the unmanned aerial vehicle does not collide or reach the target location, returning to step 500 to continue execution, otherwise, stopping flying of the unmanned aerial vehicle;
wherein the step 600 comprises:
step 610: setting a temporal attention module to the deep neural network model.
In a first possible implementation manner of the method for autonomous navigation of a drone based on deep reinforcement learning according to the second aspect, the step 600 further includes:
step 620, respectively configuring the Actor network and the Critic network to at least comprise a convolutional layer, a global average pooling layer, an LSTM unit layer, a linear full-link layer and a ReLU function layer;
step 630, configuring the Actor network and Critic network in sequence: the system comprises a convolution layer, a global average pooling layer, an LSTM unit layer, a first linear full-connection layer, a first ReLU function layer, a second linear full-connection layer, a second ReLU function layer and a third linear full-connection layer;
wherein the linear fully-connected layer comprises:
a first linear fully-connected layer, a second linear fully-connected layer, a third linear fully-connected layer;
the ReLU function layer includes:
a first ReLU function layer and a second ReLU function layer.
With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the step 600 further includes:
step 640, setting a time attention module in the Actor network;
step 650, setting the time attention module between the LSTM unit layer and the first linear full connection layer;
step 660, calculating the output weight of the given frame LSTM unit layer by using the time attention module to determine the output action.
With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner, the step 660 includes:
step 661, learning scalar weight output by LSTM unit at different time step using equation (5),
Wt-i=Softmax(ReLU(wt-i·ht-i+bt-i)) (5)
where i ═ 0.,. t-1, Softmax () is the normalization function, ht-iFor LSTM unit hidden vectors, wt-iAnd bt-iT represents the time, ReLU (w) for a learnable parametert-i·ht-i+bt-i) Denotes the LSTM unit hidden vector as ht-iThe learnable parameter is wt-iAnd bt-iA function value of activation time;
step 662, calculate the context vector U using equation (6)T
Figure BDA0003533154830000051
Step 663, using the context vector UTAnd the state characteristic of the unmanned aerial vehicle is connected as the input of the next layer.
In a third aspect, an unmanned aerial vehicle autonomous navigation system based on deep reinforcement learning adopts the autonomous navigation method of the second aspect, and the method includes
A deep neural network model unit;
an acquisition unit;
a training unit;
the acquisition unit is used for acquiring a T-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera;
the training unit is used for averaging K state values learned by the value network and updating the network parameters of the deep neural network model by using the state average values;
the depth neural network model unit is used for predicting the flight action of the unmanned aerial vehicle according to the acquired M-frame depth map acquired by the unmanned aerial vehicle camera and the unmanned aerial vehicle state information;
the deep neural network model comprises: an Actor network and a Critic network;
the Actor network is sequentially configured as follows: the system comprises a convolutional layer, a global average pooling layer, an LSTM unit layer, a time attention module, a first linear full-link layer, a first ReLU function layer, a second linear full-link layer, a second ReLU function layer and a third linear full-link layer;
the temporal attention module is used to calculate output weights for a given frame LSTM unit layer to calculate motion output.
The implementation of the deep neural network model training method, the autonomous navigation method and the system has the following technical effects:
(1) the Q value overestimation is reduced by averaging the first K state values learned by the value function through the value network, so that the variance in the Critic network parameter transfer process is reduced, the stability in the algorithm training process and the algorithm performance are improved, and the problem of unstable training caused by Q value overestimation in the conventional deep neural network model algorithm is solved.
(2) By adding the time attention module mechanism in the deep neural network model algorithm, the time attention mechanism considers the input of the past M frames and determines the importance degree of each frame, so that the deep neural network model can make more accurate and ideal prediction on the action of the unmanned aerial vehicle, the performance of the deep neural network model is further improved, and the problem that the conventional deep neural network model algorithm is weak in time sequence data processing capacity is solved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of a deep neural network model training method according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of a deep neural network model training method according to a second embodiment of the present invention;
FIG. 3 is a schematic diagram of a third embodiment of a deep neural network model training method in the present invention;
FIG. 4 is a schematic diagram of a fourth embodiment of the deep neural network model training method in the present invention;
FIG. 5 is a diagram illustrating a fifth embodiment of a deep neural network model training method according to the present invention;
FIG. 6 is a schematic diagram of a first embodiment of the autonomous navigation method in accordance with the present invention;
FIG. 7 is a schematic diagram of a second embodiment of the autonomous navigation method in accordance with the present invention;
FIG. 8 is a schematic diagram of a third embodiment of the autonomous navigation method in accordance with the present invention;
FIG. 9 is a schematic diagram of a fourth embodiment of the autonomous navigation method in accordance with the present invention;
FIG. 10 is a schematic diagram of an Actor network structure of the deep neural network model according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a Critic network structure of the deep neural network model according to an embodiment of the present invention;
FIG. 12 is a schematic view of a first embodiment of the autonomous navigation system in accordance with the present invention;
the part names indicated by the numbers in the drawings are as follows: 10-deep neural network model unit, 20-acquisition unit, 30-training unit.
Detailed Description
The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments given herein without any creative effort, shall fall within the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The existing deep learning algorithm for autonomous navigation of the unmanned aerial vehicle has the problems of unstable training and weaker time sequence data processing capability.
Aiming at the problems, a deep neural network model training method, an autonomous navigation method and an autonomous navigation system are provided.
In a first aspect, as shown in fig. 1, fig. 1 is a schematic diagram of a deep neural network model training method according to a first embodiment of the present invention, and a deep neural network model training method is used for obtaining a deep neural network model through deep reinforcement learning to improve algorithm stability, and includes: step 100, initializing a deep neural network model; step 200, updating an array of the playback experience pool D; step 300, averaging K state values learned by a value network to obtain a state average value; step 400, updating network parameters of the deep neural network model by using the state average value; the deep neural network model comprises an Actor network and a Critic network.
Preferably, as shown in fig. 10, fig. 10 is a schematic diagram of an Actor network structure of the deep neural network model in the present invention, and the Actor network may be implemented to include:
1) a convolutional layer consisting of 8 filters of 2 × 2 with a step size of 2; 2) a convolutional layer consisting of 8 filters of 2 × 2 with a step size of 2; 3) a convolutional layer consisting of 8 filters of 2 × 2 with a step size of 2; 4) convolutional layer with step size of 2 consisting of 8 filters of 2 × 2; 5) a global average pooling layer; 6) hiding an LSTM unit layer with the size of 8 layers; 7) a temporal attention module; 8) a linear fully-connected layer of scale 64; 9) a ReLU function layer; 10) a linear fully-connected layer of scale 32; 11) a ReLU function layer; 12) linear fully connected layers of scale 2.
Referring to fig. 11, fig. 11 is a schematic diagram of an embodiment of a Critic network structure of the deep neural network model in the present invention, the Critic network may be implemented as an embodiment including: 1) a convolutional layer consisting of 8 filters of 2 × 2 with a step size of 2; 2) a convolutional layer of step size 2 consisting of 8 2x2 filters; 3) a convolutional layer of step size 2 consisting of 8 2x2 filters; 4) a convolutional layer of step size 2 consisting of 8 2x2 filters; 5) a global average pooling layer; 6) hiding an LSTM unit layer with the size of 8 layers; 7) a linear fully-connected layer of scale 64; 8) a ReLU function layer; 9) a linear fully-connected layer of scale 32; 10) a ReLU function layer; 11) a linear fully connected layer of scale 1.
The LSTM unit layer is a long memory network layer, and the ReLU function layer is an activation function layer.
Firstly, initializing a deep neural network model: actor network parameter phi, Critic network parameter theta, target Critic network parameter
Figure BDA0003533154830000081
Playback experience pool
Figure BDA00035331548300000810
Target Critic network parameter update interval N.
Preferably, as shown in fig. 2, fig. 2 is a schematic diagram of a deep neural network model training method according to a second embodiment of the present invention, and step 200 includes step 210 of obtaining current state information st(ii) a Step 220, converting the state information stInputting the predicted action into the Actor network to obtain the predicted action at(ii) a Step 230, perform action atReceive a reward r(s)t,at) And enter the next state st+1(ii) a Step 240, apply the array(s)t,at,r(st,at),st+1) Storing in a playback experience pool D, stIs the state at time t, atR(s) is the movement at time tt,at) Is in a state of stThe action is atThe prize of the time.
Preferably, as shown in fig. 3, fig. 3 is a schematic diagram of a third embodiment of the deep neural network model training method in the present invention, and step 300 includes: randomly sampling N sets of data(s) from the playback experience pool D, step 310t,at,r(st,at),st+1) (ii) a Step 320, obtaining the learned K state values s by using the value network formulas (1) and (2)t+1Average value of (2)
Figure BDA0003533154830000082
Figure BDA0003533154830000083
Figure BDA0003533154830000084
Wherein phi is an Actor network parameter, theta is a criticic network parameter,
Figure BDA0003533154830000085
for the target network parameter, pi is the policy, piφRepresents the strategy that the Actor network parameter is phi, alpha is a temperature parameter, and V(s)t+1) Is the value of the state at time t +1,
Figure BDA0003533154830000086
is the average value of the state values at time t +1,
Figure BDA0003533154830000087
is in a state of stThe action is atThe target Critic network parameter is
Figure BDA0003533154830000088
Value of Q of time piφ(at|st) When the network parameter of Actor is phi, the state is stTime selection action atThe probability of (c).
Preferably, as shown in fig. 4, fig. 4 is a schematic diagram of a fourth embodiment of the deep neural network model training method in the present invention, and step 400 includes: step 410, updating the Critic network parameter theta by using the state average value calculation formula (2) and the formula (3):
Figure BDA0003533154830000089
step 420, updating the Actor network parameter phi by using the equation (4),
Figure BDA0003533154830000091
wherein phi is an Actor network parameter, theta is a Critic network parameter,
Figure BDA0003533154830000092
as a target network parameter, atPredicting motion for time t, st+1Is the state value at time t +1, JQ(θ) is the Critic network loss at a network parameter of θ, D is the playback experience pool, r(s)t,at) Is in a state of stThe action is atReward of time, gamma being a discount factor, Qθ(st,at) Is in a state of stThe action is atQ value, J, when Critic network parameter is thetaπ(φ) is the loss of the Actor network when the network parameter is φ, etIs the standard deviation of the Gaussian distribution, fφ(∈t;st) For an Actor network parameter of phi, the standard deviation is epsilontThe state is stTime, based on the action, pi, obtained by Gaussian distribution samplingφ(fφ(∈t;st)|stWhen the network parameter of Actor is phi, the state is stTime selection action fφ(∈t;st) Probability of (Q)θ(st,fφ(∈t;st) For Actor network parameter phi, Critic network parameter theta, and status stThe motion is fφ(∈t;st) The Q value of (1).
From a pool of playback experiences
Figure BDA0003533154830000093
Randomly sampling N sets of data(s)t,at,r(st,at),st+1) And calculating the loss of the Critic network through a formula (3), updating a Critic network parameter theta, calculating the loss of the Actor network through a formula (4), and updating an Actor network parameter phi. Equation (2) K state values V(s) learned using a value networkt+1) IsThe average value is used for reducing the overestimation of the Q value, so that the variance in the parameter transfer process is reduced, and the training stability is improved.
Preferably, as shown in fig. 5, fig. 5 is a schematic diagram of a fifth embodiment of the deep neural network model training method in the present invention, and the step 400 further includes: step 430, setting a criticic network parameter updating interval N; and step 440, if the time t can be divided by the updating interval N of the Critic network parameter, updating the Critic network parameter theta into the Critic network.
The Q value overestimation is reduced by averaging the first K state values learned by the value function through the value network, so that the variance in the Critic network parameter transfer process is reduced, the stability in the algorithm training process and the algorithm performance are improved, and the problem of unstable training caused by Q value overestimation in the conventional deep neural network model algorithm is solved.
In a second aspect, as shown in fig. 6, fig. 6 is a schematic view of a first embodiment of the autonomous navigation method in the present invention, and a method for autonomous navigation of a drone based on deep reinforcement learning includes: 500, setting a target location of the unmanned aerial vehicle and acquiring an M-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera; step 600, training the neural network model by using a training method to obtain a deep neural network model; step 700, inputting depth image information and unmanned aerial vehicle state information into a depth neural network model, and predicting flight action; step 800, if the unmanned aerial vehicle does not collide or reach the target location, returning to step 500 to continue execution, otherwise, stopping flying of the unmanned aerial vehicle; wherein step 600 comprises: step 610: and setting a time attention module.
Preferably, as shown in fig. 7, fig. 7 is a schematic view of a second embodiment of the autonomous navigation method in the present invention, and the step 600 further includes:
step 620, respectively configuring the Actor network and the Critic network to at least comprise a convolutional layer, a global averaging pooling layer, an LSTM unit layer, a linear full-link layer and a ReLU function layer; step 630, configuring the Actor network and the criticic network in sequence: the system comprises a convolution layer, a global average pooling layer, an LSTM unit layer, a first linear full-connection layer, a first ReLU function layer, a second linear full-connection layer, a second ReLU function layer and a third linear full-connection layer; the linear full-connection layer comprises a first linear full-connection layer, a second linear full-connection layer and a third linear full-connection layer; the ReLU function layer includes: a first ReLU function layer and a second ReLU function layer.
Preferably, as shown in fig. 8, fig. 8 is a schematic view of a third embodiment of the autonomous navigation method in the present invention, and the step 600 further includes: step 640, setting a time attention module in the Actor network; step 650, arranging a time attention module between the LSTM unit layer and the first linear full connection layer; step 660, calculate the output weight for a given frame LSTM unit layer using the temporal attention module to determine the output action.
Preferably, as shown in fig. 9, fig. 9 is a schematic view of a fourth embodiment of the autonomous navigation method in the present invention, and step 660 includes: step 661, using equation (5) to learn the scalar weights output by the LSTM unit at different time steps,
Wt-i=Softmax(ReLU(wt-i·ht-i+bt-i)) (5)
where i ═ 0.,. t-1, Softmax () is the normalization function, ht-iFor LSTM unit hidden vectors, wt-iAnd bt-iT represents the time, ReLU (w) for a learnable parametert-i·ht-i+bt-i) Denotes the LSTM unit hidden vector as ht-iThe learnable parameter is wt-iAnd bt-iA function value of activation time;
step 662, calculate the context vector U using equation (6)T
Figure BDA0003533154830000101
Step 663, apply context vector UTAnd the unmanned aerial vehicle state characteristic connection is used as next layer input.
Weight W of each LSTM cell outputiIs defined by formula (5), hiFor LSTM unit hidden vectors, wiAnd biFor learnable parameters, the activation function is ReLU,the ReLU function layer is followed by a Softmax function that normalizes the sum of weights to 1. According to this definition, each learned weight depends on information of the previous time step and current state information.
The combined context vector U is then computedTAs shown in equation (6), the context vector UTIs a weighted sum of the LSTM unit layer outputs over the T time step.
Derived context vector UTAnd the state characteristic of the unmanned aerial vehicle is connected as the input of the next layer. Weight W learned heret-iThe importance of the LSTM unit layer output at a given frame can be understood. Thus, the optimization process can be viewed as learning which observations to choose are relatively more important to learning the correct action.
Working principle of temporal attention:
action output is calculated explicitly taking into account LSTM unit-level output characteristics of past M frames, and this information is only implicitly conveyed by normal LSTM units. By increasing the value of the T time step, the deep neural network model can take into account a longer historical frame sequence, and thus can make better action choices. By introducing a temporal attention module, the neural network module can handle the ability of longer input sequences, exploration of the temporal dependence of the output actions, and better performance with a partially observable experience.
By adding the time attention module mechanism in the deep neural network model algorithm, the time attention mechanism considers the input of the past M frames and determines the importance degree of each frame, so that the deep neural network model can make more accurate and ideal prediction on the action of the unmanned aerial vehicle, the performance of the deep neural network model is further improved, and the problem that the conventional deep neural network model algorithm is weak in time sequence data processing capacity is solved.
In a third aspect, as shown in fig. 12, fig. 12 is a schematic view of a first embodiment of the autonomous navigation system in the present invention, an unmanned aerial vehicle autonomous navigation system based on deep reinforcement learning adopts the autonomous navigation method of the second aspect, and includes a deep neural network model unit 10, an obtaining unit 20, and a training unit 30; the acquiring unit 20 is configured to acquire a T-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera; the training unit 30 is configured to average the K state values learned by the value network, and update the network parameters of the deep neural network model by using the state average values; the depth neural network model unit 10 is used for predicting the flight action of the unmanned aerial vehicle according to the acquired T-frame depth map acquired by the unmanned aerial vehicle camera and the unmanned aerial vehicle state information; the deep neural network model comprises: an Actor network and a Critic network; the Actor network is sequentially configured as: the system comprises a convolutional layer, a global average pooling layer, an LSTM unit layer, a time attention module, a first linear full-link layer, a first ReLU function layer, a second linear full-link layer, a second ReLU function layer and a third linear full-link layer; the temporal attention module is used to calculate output weights for a given frame LSTM unit layer to calculate motion output.
The implementation of the deep neural network model training method, the autonomous navigation method and the system has the following technical effects:
(1) the Q value overestimation is reduced by averaging the first K state values learned by the value function through the value network, so that the variance in the Critic network parameter transfer process is reduced, the stability in the algorithm training process and the algorithm performance are improved, and the problem of unstable training caused by Q value overestimation in the conventional deep neural network model algorithm is solved.
(2) By adding the time attention module mechanism in the deep neural network model algorithm, the time attention mechanism considers the input of the past M frames and determines the importance degree of each frame, so that the deep neural network model can predict the action of the unmanned aerial vehicle more accurately and ideally, the performance of the deep neural network model is further improved, and the problem that the conventional deep neural network model algorithm is weak in time sequence data processing capacity is solved.
The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A deep neural network model training method is used for obtaining a deep neural network model through deep reinforcement learning and improving the stability of an algorithm of the deep neural network model, and is characterized by comprising the following steps:
step 100, initializing the deep neural network model;
step 200, updating an array of the playback experience pool D;
step 300, averaging K state values learned by a value network to obtain a state average value;
step 400, updating network parameters of the deep neural network model by using the state average value;
the deep neural network model comprises an Actor network and a Critic network.
2. The deep neural network model training method of claim 1, wherein the step 200 comprises:
step 210, obtaining current state information st
Step 220, converting the state information stInputting the predicted action into the Actor network to obtain the predicted action at
Step 230, executing the action atReceive a reward r(s)t,at) And enter the next state st+1
Step 240, apply the array(s)t,at,r(st,at),st+1) Storing the data into a playback experience pool D;
wherein s istIs the state at time t, atR(s) is the movement at time tt,at) Is in a state of stThe action is atThe prize of the time.
3. The deep neural network model training method of claim 2, wherein the step 300 comprises:
randomly sampling N sets of data(s) from the playback experience pool D, step 310t,at,r(st,at),st+1);
Step 320, obtaining the learned K state values s by using the value network formulas (1) and (2)t+1Average value of (2)
Figure FDA0003533154820000011
Figure FDA0003533154820000012
Figure FDA0003533154820000013
Wherein phi is an Actor network parameter, theta is a criticic network parameter,
Figure FDA0003533154820000014
for the target network parameter, pi is the policy, piφRepresents the strategy that the Actor network parameter is phi, alpha is a temperature parameter, and V(s)t+1) Is the value of the state at time t +1,
Figure FDA0003533154820000015
is the average value of the state values at time t +1,
Figure FDA0003533154820000016
is in a state of stThe motion is atThe target Critic network parameter is
Figure FDA0003533154820000017
Value of Q of time piφ(at|st) When the network parameter of Actor is phi, the state is stTime selection action atThe probability of (c).
4. The method of claim 3, wherein the step 400 comprises:
step 410, updating the critical network parameter θ by using the state average value calculation formula (2) and the Q value loss calculation formula (3):
Figure FDA0003533154820000021
step 420, updating the Actor network parameter phi by using the policy loss calculation formula (4),
Figure FDA0003533154820000022
wherein phi is an Actor network parameter, theta is a criticic network parameter,
Figure FDA0003533154820000023
is a target network parameter, atPredicting motion for time t, st+1Is the state value at time t +1, JQ(θ) is the Critic network loss at a network parameter of θ, D is the playback experience pool, r(s)t,at) Is in a state of stThe action is atReward of time, gamma being a discount factor, Qθ(st,at) Is in a state of stThe action is atQ value, J, when Critic network parameter is thetaπ(φ) is the loss of the Actor network when the network parameter is φ, etIs the standard deviation of the Gaussian distribution, fφ(∈t;st) For an Actor network parameter of phi, the standard deviation is epsilontIn a state of stTime, based on the action, pi, obtained by Gaussian distribution samplingφ(fφ(∈t;st)|stWhen the network parameter of Actor is phi, the state is stTime selection action fφ(∈t;st) Probability of (Q)θ(st,fφ(∈t;st) For Actor network parameter phi, Critic network parameter theta, and status stThe motion is fφ(∈t;st) The Q value of (1).
5. The deep neural network model training method of claim 4, wherein the step 400 further comprises:
step 430, setting a criticic network parameter updating interval N;
and step 440, if the time t can be divided by the updating interval N of the Critic network parameter, updating the Critic network parameter theta into the Critic network.
6. An autonomous navigation method for training a deep neural network model by using the training method according to any one of claims 1 to 5, comprising:
500, setting a target location of the unmanned aerial vehicle and acquiring an M-frame depth map and unmanned aerial vehicle state information acquired by the unmanned aerial vehicle;
step 600, training the neural network model by using the training method to obtain a deep neural network model;
step 700, inputting the depth image information and the unmanned aerial vehicle state information into a depth neural network model, and predicting flight action;
step 800, if the unmanned aerial vehicle does not collide or reach the target location, returning to step 500 to continue execution, otherwise, stopping the unmanned aerial vehicle from flying;
wherein the step 600 comprises:
step 610: setting a temporal attention module to the deep neural network model.
7. The autonomous navigation method according to claim 6, characterized in that said step 600 further comprises:
step 620, respectively configuring the Actor network and the Critic network to at least comprise a convolutional layer, a global average pooling layer, an LSTM unit layer, a linear full-link layer and a ReLU function layer;
step 630, configuring the Actor network and the Critic network in sequence: the system comprises a convolution layer, a global average pooling layer, an LSTM unit layer, a first linear full-connection layer, a first ReLU function layer, a second linear full-connection layer, a second ReLU function layer and a third linear full-connection layer;
wherein the linear fully-connected layer comprises:
a first linear fully-connected layer, a second linear fully-connected layer, a third linear fully-connected layer;
the ReLU function layer includes:
a first ReLU function layer and a second ReLU function layer.
8. The autonomous navigation method according to claim 7, characterized in that said step 600 further comprises:
step 640, setting a time attention module in the Actor network;
step 650, setting the time attention module between the LSTM unit layer and the first linear full connection layer;
step 660, calculating the output weight of the given frame LSTM unit layer by using the time attention module to determine the output action.
9. The autonomous navigation method of claim 8, wherein the step 660 comprises:
step 661, learning scalar weights output by LSTM unit at different time step using equation (5):
Wt-i=Soft max(ReLU(wt-i·ht-i+bt-i)) (5)
where i ═ 0.. once, t-1, Soft max () is the normalization function, ht-iFor LSTM unit hidden vectors, wt-iAnd bt-iT represents the time, ReLU (w) for a learnable parametert-i·ht-i+bt-i) Denotes the LSTM unit hidden vector as ht-iThe learnable parameter is wt-iAnd bt-iA function value of activation time;
step 662, calculating the context vector U using equation (6)T
Figure FDA0003533154820000041
Step 663, using the context vector UTAnd the state characteristic of the unmanned aerial vehicle is connected as the input of the next layer.
10. An autonomous navigation system for navigating a drone using the autonomous navigation method of any one of claims 6 to 9, comprising
A deep neural network model unit;
an acquisition unit;
a training unit;
the acquisition unit is used for acquiring a T-frame depth map and unmanned aerial vehicle state information acquired by an unmanned aerial vehicle camera;
the training unit is used for averaging K state values learned by the value network and updating network parameters of the deep neural network model by using the state average values;
the depth neural network model unit is used for predicting the flight action of the unmanned aerial vehicle according to the acquired M-frame depth map acquired by the unmanned aerial vehicle camera and the unmanned aerial vehicle state information;
the deep neural network model includes: actor network and Critic network;
the Actor network configuration sequentially comprises: the system comprises a convolutional layer, a global average pooling layer, an LSTM unit layer, a time attention module, a first linear full-link layer, a first ReLU function layer, a second linear full-link layer, a second ReLU function layer and a third linear full-link layer;
the temporal attention module is used to compute output weights for a given frame LSTM unit layer to compute action output.
CN202210210763.3A 2022-03-04 2022-03-04 Deep neural network model training method, autonomous navigation method and system Pending CN114662656A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210210763.3A CN114662656A (en) 2022-03-04 2022-03-04 Deep neural network model training method, autonomous navigation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210210763.3A CN114662656A (en) 2022-03-04 2022-03-04 Deep neural network model training method, autonomous navigation method and system

Publications (1)

Publication Number Publication Date
CN114662656A true CN114662656A (en) 2022-06-24

Family

ID=82027579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210210763.3A Pending CN114662656A (en) 2022-03-04 2022-03-04 Deep neural network model training method, autonomous navigation method and system

Country Status (1)

Country Link
CN (1) CN114662656A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114964268A (en) * 2022-07-29 2022-08-30 白杨时代(北京)科技有限公司 Unmanned aerial vehicle navigation method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114964268A (en) * 2022-07-29 2022-08-30 白杨时代(北京)科技有限公司 Unmanned aerial vehicle navigation method and device

Similar Documents

Publication Publication Date Title
CN112256056B (en) Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning
CN108803321B (en) Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
CN107833183B (en) Method for simultaneously super-resolving and coloring satellite image based on multitask deep neural network
CN112119409A (en) Neural network with relational memory
CN110874578A (en) Unmanned aerial vehicle visual angle vehicle identification and tracking method based on reinforcement learning
CN110447041B (en) Noise neural network layer
EP3568810A1 (en) Action selection for reinforcement learning using neural networks
CN112313672A (en) Stacked convolutional long-short term memory for model-free reinforcement learning
CN112819253A (en) Unmanned aerial vehicle obstacle avoidance and path planning device and method
CN111260026B (en) Navigation migration method based on meta reinforcement learning
CN114261400B (en) Automatic driving decision method, device, equipment and storage medium
US11650551B2 (en) System and method for policy optimization using quasi-Newton trust region method
CN115018017B (en) Multi-agent credit allocation method, system and equipment based on ensemble learning
CN114839884B (en) Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN113537365B (en) Information entropy dynamic weighting-based multi-task learning self-adaptive balancing method
US20230074148A1 (en) Controller for Optimizing Motion Trajectory to Control Motion of One or More Devices
CN114355915B (en) AGV path planning based on deep reinforcement learning
CN114662656A (en) Deep neural network model training method, autonomous navigation method and system
CN115265547A (en) Robot active navigation method based on reinforcement learning in unknown environment
Ou et al. Hybrid path planning based on adaptive visibility graph initialization and edge computing for mobile robots
CN116679710A (en) Robot obstacle avoidance strategy training and deployment method based on multitask learning
CN115630566B (en) Data assimilation method and system based on deep learning and dynamic constraint
CN115009291B (en) Automatic driving assistance decision making method and system based on network evolution replay buffer area
CN116091776A (en) Semantic segmentation method based on field increment learning
CN115936058A (en) Multi-agent migration reinforcement learning method based on graph attention network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination