CN116203979A - Monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient - Google Patents

Monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient Download PDF

Info

Publication number
CN116203979A
CN116203979A CN202211612609.5A CN202211612609A CN116203979A CN 116203979 A CN116203979 A CN 116203979A CN 202211612609 A CN202211612609 A CN 202211612609A CN 116203979 A CN116203979 A CN 116203979A
Authority
CN
China
Prior art keywords
network
aerial vehicle
unmanned aerial
actor
critic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211612609.5A
Other languages
Chinese (zh)
Inventor
张凯
邵艳明
魏瑶
钮赛赛
杨尧
李妍琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Shanghai Aerospace Control Technology Institute
Original Assignee
Northwestern Polytechnical University
Shanghai Aerospace Control Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University, Shanghai Aerospace Control Technology Institute filed Critical Northwestern Polytechnical University
Priority to CN202211612609.5A priority Critical patent/CN116203979A/en
Publication of CN116203979A publication Critical patent/CN116203979A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient; the method comprises the following steps: generating an antagonism network CGAN based on conditions for an original RGB image acquired by a monocular camera onboard the unmanned aerial vehicle to be converted into a depth image; taking the depth image and the current speed information of the unmanned aerial vehicle as state information, and constructing an Actor network and an evaluator Critic network in a depth deterministic strategy gradient DDPG network; constructing a reward function according to the distance between the unmanned aerial vehicle and the target, the speed of the unmanned aerial vehicle and the collision information of the unmanned aerial vehicle; training the Actor network and the Critic network in the DDPG network according to the reward function, so as to obtain parameters of the Actor network and the Critic network; and performing obstacle avoidance control on the monocular unmanned aerial vehicle according to the DDPG network after training.

Description

Monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient
Technical Field
The embodiment of the invention relates to the technical field of unmanned aerial vehicle flight control, in particular to a monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth certainty strategy gradient.
Background
The small unmanned aerial vehicle has small size and high agility, so that the unmanned aerial vehicle can successfully complete various tasks, such as searching, rescue environment mapping, unknown area exploration and the like, and the unmanned aerial vehicle plays an increasingly important role in various industries. Because obstacle avoidance planning is required to be performed in a complex indoor environment, it is particularly important that a set of effective obstacle avoidance algorithms ensure that an unmanned aerial vehicle can safely perform various tasks.
Currently, sensors commonly used in unmanned aerial vehicle obstacle avoidance tasks include ultrasonic sensors, radar sensors, depth cameras and monocular cameras, and each of these sensors has advantages and disadvantages. Compared with a laser radar and a depth camera, although the monocular camera can provide less information, the monocular camera has the advantages of small size, light weight, low energy consumption, wide field of view and the like, and is more suitable for obstacle avoidance perception of a small unmanned aerial vehicle.
For an unmanned aerial vehicle carrying a monocular camera (hereinafter or simply referred to as a monocular unmanned aerial vehicle), depth information cannot be obtained, so that the depth information cannot be extracted from an image shot by the monocular camera to avoid the obstacle, and the problem of discontinuous obstacle avoidance speed exists in the existing unmanned aerial vehicle obstacle avoidance scheme.
Disclosure of Invention
In view of the above, the embodiment of the invention is expected to provide a monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient; the depth picture can be predicted through the image shot by the monocular camera for action decision, and the stability and the effectiveness of unmanned aerial vehicle flight control are improved, so that the unmanned aerial vehicle can avoid the obstacle in a continuous action space, and the obstacle avoidance track is smoother and more continuous.
The technical scheme of the embodiment of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a monocular unmanned aerial vehicle obstacle avoidance method based on a depth deterministic strategy gradient, where the method includes:
for an original RGB image acquired by a monocular camera onboard the drone, a condition-based generation countermeasure network (CGAN, conditional Generative Adversarial Network) is converted into a depth image;
constructing an Actor activator network and an evaluator cric network in a depth deterministic strategy gradient (DDPG, deep Deterministic Policy Gradient) network by taking the depth image and the current speed information of the unmanned aerial vehicle as state information;
constructing a reward function according to the distance between the unmanned aerial vehicle and the target, the speed of the unmanned aerial vehicle and the collision information of the unmanned aerial vehicle;
Training the Actor network and the Critic network in the DDPG network according to the reward function, so as to obtain parameters of the Actor network and the Critic network;
and performing obstacle avoidance control on the monocular unmanned aerial vehicle according to the DDPG network after training.
In a second aspect, an embodiment of the present invention provides a monocular unmanned aerial vehicle obstacle avoidance device based on a depth deterministic strategy gradient, the device comprising: the device comprises an acquisition part, a conversion part, a first construction part, a second construction part, a training part and an obstacle avoidance control part; wherein,,
the acquisition part is configured to acquire an original RGB image through a monocular camera carried by the unmanned aerial vehicle;
the conversion part is configured to generate an antagonism network CGAN based on conditions for the original RGB image to be converted into a depth image;
the first construction part is configured to construct an Actor activator network and an evaluator cric network in a depth deterministic strategy gradient DDPG network by taking the depth image and the current speed information of the unmanned aerial vehicle as state information;
the second construction part is configured to construct a reward function according to the distance between the unmanned aerial vehicle and the target, the speed of the unmanned aerial vehicle and the collision information of the unmanned aerial vehicle;
The training part is configured to train the Actor network and the Critic network in the DDPG network according to the reward function so as to obtain parameters of the Actor network and the Critic network;
the obstacle avoidance control part is configured to perform obstacle avoidance control on the monocular unmanned aerial vehicle according to the DDPG network after training.
In a third aspect, embodiments of the present invention provide a computing device comprising: a communication interface, a memory and a processor; the components are coupled together by a bus system; wherein,,
the communication interface is used for receiving and transmitting signals in the process of receiving and transmitting information with other external network elements;
the memory is used for storing a computer program capable of running on the processor;
the processor is configured to execute the step of the depth deterministic strategy gradient-based monocular unmanned aerial vehicle obstacle avoidance method according to the first aspect when the computer program is executed.
In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where a monocular unmanned aerial vehicle obstacle avoidance program based on a depth deterministic strategy gradient is stored, where the monocular unmanned aerial vehicle obstacle avoidance program based on the depth deterministic strategy gradient implements the steps of the monocular unmanned aerial vehicle obstacle avoidance method based on the depth deterministic strategy gradient of the first aspect when executed by at least one processor.
The embodiment of the invention provides a monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient; predicting a depth picture through an image shot by a monocular camera for action decision; the depth information and the state information are introduced into a DDPG network together for processing, so that the stability and the effectiveness of unmanned aerial vehicle flight control are improved; and establish the rewarding function based on unmanned aerial vehicle and target distance, unmanned aerial vehicle's speed and unmanned aerial vehicle's collision information for unmanned aerial vehicle can keep away the barrier in continuous action space, keeps away the more smooth continuity of barrier orbit.
Drawings
Fig. 1 is a schematic flow chart of a monocular unmanned aerial vehicle obstacle avoidance method based on depth deterministic strategy gradient provided by the embodiment of the invention;
fig. 2 is a schematic diagram of a network architecture of a CGAN according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a deep reinforcement learning architecture according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating generation of status information according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an Actor network according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a Critical network according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an overall process flow architecture according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a simulation environment provided by an embodiment of the present invention;
fig. 9 is a schematic diagram of an obstacle avoidance training process of the unmanned aerial vehicle according to an embodiment of the present invention;
fig. 10 is a schematic diagram of total score in the unmanned aerial vehicle training process according to the embodiment of the present invention;
fig. 11 is a schematic diagram of an average prize change curve in the unmanned aerial vehicle training process according to an embodiment of the present invention;
fig. 12 is a schematic diagram of a Q-value change curve of network estimation in the unmanned aerial vehicle training process according to the embodiment of the present invention;
fig. 13 is a schematic diagram of an Actor network loss value in the unmanned aerial vehicle training process according to the embodiment of the present invention;
fig. 14 is a schematic diagram of Critic network loss values in the unmanned aerial vehicle training process according to the embodiment of the present invention;
fig. 15 is a schematic diagram of a speed change curve in the unmanned aerial vehicle training process according to the embodiment of the present invention;
fig. 16 is a comparison diagram of obstacle avoidance routes of the unmanned aerial vehicle in a top view provided by the embodiment of the invention;
fig. 17 is a schematic diagram of a monocular unmanned aerial vehicle obstacle avoidance device based on depth deterministic strategy gradient according to an embodiment of the present invention;
fig. 18 is a schematic diagram of a specific hardware structure of a computing device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
Referring to fig. 1, a method for avoiding an obstacle of a monocular unmanned aerial vehicle based on a depth deterministic strategy gradient according to an embodiment of the present invention may include:
s101: for an original RGB image acquired by a monocular camera onboard the drone, a condition-based generation countermeasure network (CGAN, conditional Generative Adversarial Network) is converted into a depth image;
s102: constructing an Actor activator network and an evaluator cric network in a depth deterministic strategy gradient (DDPG, deep Deterministic Policy Gradient) network by taking the depth image and the current speed information of the unmanned aerial vehicle as state information;
s103: constructing a reward function according to the distance between the unmanned aerial vehicle and the target, the speed of the unmanned aerial vehicle and the collision information of the unmanned aerial vehicle;
s104: training the Actor network and the Critic network in the DDPG network according to the reward function, so as to obtain parameters of the Actor network and the Critic network;
s105: and performing obstacle avoidance control on the monocular unmanned aerial vehicle according to the DDPG network after training.
According to the technical scheme, the depth picture is predicted through the image shot by the monocular camera so as to be used for action decision; the depth information and the state information are introduced into a DDPG network together for processing, so that the stability and the effectiveness of unmanned aerial vehicle flight control are improved; and establish the rewarding function based on unmanned aerial vehicle and target distance, unmanned aerial vehicle's speed and unmanned aerial vehicle's collision information for unmanned aerial vehicle can keep away the barrier in continuous action space, keeps away the more smooth continuity of barrier orbit.
For the technical solution shown in fig. 1, in some implementations, the generating, for an original RGB image acquired by a monocular camera carried by an unmanned aerial vehicle, a challenge network CGAN to a depth image based on a condition includes:
constructing two independent convolution networks and setting the two independent convolution networks as a generator and a discriminator respectively; the generator is used for generating a pseudo-depth image from the RGB image and inputting the pseudo-depth image to the discriminator; the discriminator is used for judging whether the input image is a true depth image or not;
training network parameters of the generator and the arbiter by training a set of samples to minimize a loss function of the CGAN represented by:
L CGANDG )=E x,y~Pdata [logD(x,y)]+E x~pdata(x),z~pz(z) [log(1-D(x,G(x,z)))]
wherein θ D Representing the network parameters, θ, of the arbiter G Network parameters representing the generator; the variable x represents an RGB image; z is a noise image; the variable y is the corresponding real depth image; g (x, z) is the pseudo-depth image generated by the generator; d (x, y) is the probability that the pseudo-depth picture generated by the generator is judged to be a true depth image by the discriminator; p is p data Representing data sample distribution, p data (x) Representing the distribution of real data samples, p z (z) represents noise distribution; symbols "-" indicate meanings obeying the set probability distribution;
And inputting the original RGB image into the CGAN after training is completed, and obtaining a depth image corresponding to the original RGB image.
For the above implementation, it should be noted that the generator and the arbiter are two separate convolutional networks, respectively, the generator learns to generate realistic samples for each tag in the training dataset, and the arbiter learns to distinguish between true and false sample-tag pairs. The goal of the two networks of generator and arbiter is in a sense opposed to each other, which is reflected by the loss function shown in the previous implementation.
For the above implementation, in a specific implementation procedure, the embodiment of the present invention preferably uses U-net as CGAN as a model generator, where the U-net can be regarded as an encoder-decoder model, which connects corresponding feature maps of different layers together between encoder and decoder. Wherein, the discriminator is constituted by the encoder, thereby distinguish true and false pictures through training the discriminator. In detail, as shown in fig. 2, the network architecture of the CGAN according to the embodiment of the present invention uses a random noise image z and an RGB image as inputs, generates a "pseudo" depth image G (x, z) subject to actual image data distribution, and the discriminator uses the RGB image x, the actual depth image y and the G (x, z) as inputs to determine whether the label generated by the generator is true. In the embodiment of the invention, the specific network architecture of the generator and the discriminator of the CGAN network is consistent with the network structure design.
For the technical solution shown in fig. 1, in some implementations, the constructing an Actor activator network and an evaluator Critic network in a depth deterministic policy gradient DDPG network with the depth image and the current speed information of the unmanned aerial vehicle as state information includes:
defining the state information at the current moment as s t =[d t ,v t ]The method comprises the steps of carrying out a first treatment on the surface of the Wherein d t Representing a depth image obtained by converting an RGB image acquired by a monocular unmanned aerial vehicle at the current moment through the CGAN; current time speed v t =[v xt ,v yt ,v zt ];
Construction of the Actor network μ (s|θ μ ) To represent deterministically outputting an action policy based on the state information of the current time; wherein μ represents an Actor network; θ μ Representing an Actor network parameter; s represents input state information;
construction of the Critic network Q (s, a|θ Q ) Giving an evaluation to an action strategy output by the Actor network; wherein θ Q Representing Critic network parameters; s represents state information of an input Actor network; a represents an action policy output by the Actor network.
For the above implementation, it should be noted that deep reinforcement learning combines deep learning "perceptive" capability with reinforcement learning sequence decision capability, so that the decision problem of the agent in the complex state space is solved. The deep reinforcement learning architecture, as shown in fig. 3, may generally be composed of the following parts: the intelligent Agent system comprises an intelligent Agent, an interaction environment E, a state transition equation P (representing the probability of the intelligent Agent transitioning from one state to the next state) and a reward function R, wherein briefly, the intelligent Agent perceives the state of the environment, the state information is input as a deep neural network, the neural network gives out an action strategy according to the state information, the intelligent Agent executes corresponding actions to interact with the environment, and the interaction environment gives out the intelligent Agent reward R and the next state S' according to the reward function and the state transition rule. In the obstacle avoidance process of the unmanned aerial vehicle, the state of the unmanned aerial vehicle at the next moment only depends on the current scene and a state transition equation after the unmanned aerial vehicle adopts actions, and is irrelevant to a series of historical states of the unmanned aerial vehicle. Therefore, the obstacle avoidance problem of the unmanned aerial vehicle can be regarded as a Markov decision process, so that the obstacle avoidance problem can be incorporated into a reinforcement learning framework for solving. Specifically, the current state S t Current action a t Prize value r and next state S t+1 As a tuple (s t ,a t ,r,s t+1 ) Collection is performed and a set (S, a, R, S') is formed. The reinforcement learning algorithm optimizes the target strategy pi: S- & gt a so that at S t The time accumulation acquisition prize, rmax, then R is expressed as:
Figure BDA0003999736000000071
wherein, gamma E (0, 1) is a decay factor, which represents the influence of the current time rewards on future time. The Q-cost function is defined as Q π =E[R t |s t ,a t ]The training goal of reinforcement learning is to obtain the optimal strategy pi * Maximizing R's expectation, i.e., finding the optimal strategy pi * So that the following formula holds:
Figure BDA0003999736000000072
based on the above description of the reinforcement learning algorithm objective, embodiments of the present invention prefer DDPG as an algorithm model for deep reinforcement learning, which uses an Actor-critter (Actor-Critic) network structure. In connection with what is set forth in the foregoing implementations, the Actor network (denoted μ (s|θ μ ) Deterministic state S t Mapping to actions, i.e. giving deterministic actions to the unmanned aerial vehicle based on the current environmental state, critic network (denoted Q (s, a|θ Q ) To compute an action extremum function from the current state action approximation, where θ μ And theta Q Network parameters of the Actor network and the Critic network, respectively. Furthermore, in some examples, to reduce overestimation of the cost function, an Actor-target network μ (s|θ μ' ) And Critic target network Q (s|θ Q' )。
For the above implementation, in some examples, the training the Actor network and the Critic network in the DDPG network according to the reward function to obtain parameters of the Actor network and the Critic network includes:
constructing an Actor target network and a Critic target network which correspond to the Actor network and the Critic network respectively; wherein, the target network parameter of the Actor is theta μ′ The Critic target network parameter is theta Q′
Executing an iterative training process based on the set iterative conditions; the training process for each iteration is as follows:
based on state information s at the current moment through the Actor network t Obtaining an action strategy a at the current moment t =μ(s tμ ) +N (t); wherein, the random noise N (t) is linearly attenuated to 0 along with the training iteration times;
executing the action strategy through the unmanned aerial vehicle and interacting with the current environment to obtain state information s of the next moment t+1 Value of prize function r i =R(s t+1 );
Metadata of dataGroup(s) t ,a t ,r i ,s t+1 ) Storing the sample into an experience playback pool D as a sample for offline training of the DDPG network;
randomly sampling N samples (s i ,a i ,r i ,s i+1 );
Updating the network parameter theta of the Actor network based on the direction of the sampling gradient rise shown in the formula 1 according to the N samples μ
Figure BDA0003999736000000081
Wherein E represents the desire, p β Representing the state distribution of deterministic policies, Q (s, a|θ Q )| s=st,a=μ(st) The state information at the current moment and the evaluation value of the Critic network under the action strategy are represented;
the Actor target network is based on s in the sampled sample i+1 Obtaining a corresponding action strategy a i+1 =μ'(s t+1μ' ) And bringing said s i+1 And a is as described in the above i+1 Outputting to the Critic target network to calculate and obtain the evaluation value Q+=Q (s t+1 ,a t+1Q' );
Based on the evaluation value Q (s of the Critic network t ,a tQ ) And an evaluation value Q≡of the next state of the Critic target network; by minimizing the loss function L (θ) shown in the following equation Q ) To update the network parameter theta of the Critic network Q
Figure BDA0003999736000000091
Wherein q+=q (s t+1 ,a t+1Q' )=r t +γQ'(s t+1 ,μ'(s t+1μ' )|θ Q' ) The method comprises the steps of carrying out a first treatment on the surface of the st-pβ, at- β, rt-E, gamma E (0, 1) as attenuation factor, and current time rewardingInfluence on future time; mu'(s) t+1μ' ) Representing that the Actor target network is based on s in the sampled sample i+1 Correspondingly obtained action strategies added with random noise;
updating the network parameters of the Actor target network and the Critic target network according to formula 3:
Figure BDA0003999736000000092
wherein τ is a sliding average coefficient, and the value is smaller than 1.
In combination with the implementation manner and the example thereof, specifically, for the random noise N (t), since the Actor network gives deterministic actions according to the state information of the current intelligent agent, when the state information is the same, the Actor network only outputs the same action result, so that the explored samples are reduced; in order to better search the Actor network, random noise N (t) is added to the strategy mu so as to make the action strategy output by the strategy mu have randomness, and more searches can be performed.
Specifically, for the state information space and the action policy space, taking a monocular camera as a sensor to acquire an RGB image with a resolution of 640 x 480 as an example, and the CGAN network described in step S101 can acquire a depth image with a resolution of 84 x 84, and perform preprocessing to expect effective information to be extracted from complex information, so as to reduce complexity and improve algorithm efficiency. As shown in fig. 4, the depth image of 84×84 is converted into a gray image d of 84×84, and the speed information v of the unmanned plane is superimposed t (the speed information comprises speed information of the unmanned plane in three directions of x, y and z) as state information s of the current moment t =[d t ,v t ]Wherein v is t =[v x ,v y ,v z ]. For the action policy space, the speed v (m/s) is taken as the main action, so that the problem that global positioning is required by utilizing the position information is avoided, and the calculation amount and the algorithm complexity are reduced. The unmanned plane is regarded as particle motion with three degrees of freedom, and the action strategy given by the Actor network is three different directionsThe speed of the direction. The operation space of the unmanned aerial vehicle is shown in the following table 1.
TABLE 1
Symbol definition Description of the action Value range
v x Unmanned plane speed along x-axis (front and back) (-2,2)
v y Unmanned plane speed along y-axis (left and right) (-2,2)
v z Unmanned plane speed along z-axis (up and down) (-2,2)
For the Actor-Critic structure of the DDPG model, the specific structures of the Actor network and the Critic network are shown in fig. 5 and fig. 6, respectively, in combination with the above-mentioned exemplary depth pictures, and specific parameter settings thereof are shown in table 2 below.
TABLE 2
Parameters (parameters) Type/value
Actor Optimizer Adam,α=0.0001
Critic Optimizer Adam,α=0.0005
γ 0.99
τ 0.005
D 100000
epsilon 0.98
gamma 0.99
batch size 64
Specifically, the Actor network inputs the state information of the unmanned aerial vehicle at this time, thereby giving an action of the unmanned aerial vehicle based on the state. Compared with the conventional scheme that the Actor network only inputs depth information to make decisions, the embodiment of the invention adopts the multi-mode network to make decision information processing, thereby improving the control capability of the network on the unmanned aerial vehicle. According to the collaborative representation method, the processing is performed by mapping the velocity information and the 84 x 84 image information to different feature subspaces. Wherein 9 convolution layers are used to extract image feature information and a fully connected layer is used to process speed information. And combining the motion information after passing through the convolution layer and the motion information subjected to full connection processing at the same time, and outputting the motion information through the three full connection layers.
Next, in the Critic network, the processing procedure of the state information is identical to that of the Actor network, and the corresponding network structure is also the same, and this procedure is called "observation preprocessing". Since the motion is input as one-dimensional data, it is input into a full connection layer having 128 units to be processed so that the motion has the same shape as the processed state information, which is called "motion preprocessing". Then, the "observation preprocessing result" and the "action preprocessing" result are combined together by using the merging layer, and the action evaluation value based on the state is output through 3 fully connected layers.
For the bonus function, in some possible implementations, the constructing the bonus function according to the distance between the unmanned aerial vehicle and the target, the speed of the unmanned aerial vehicle, and collision information of the unmanned aerial vehicle includes:
state information s of the next moment generated based on the action strategy output by the Actor network according to the state information of the current moment t+1
When the state information of the next moment indicates that the unmanned aerial vehicle collides, the reward function value r=r (s t+1 ) Is the maximum penalty value and ends the current training round;
when the state information of the next moment indicates that the speed of the unmanned aerial vehicle is lower than a set speed threshold, the reward function value is a minimum punishment value;
when the state information of the next moment indicates that the distance between the unmanned aerial vehicle and the target is smaller than a set distance threshold, the reward function value is the maximum reward value;
otherwise, taking the difference value between the distance between the unmanned aerial vehicle and the target at the current moment and the distance between the unmanned aerial vehicle and the target at the last moment as the reward function value.
For the above implementation, specifically, the input of the reward function is { p, speed, collision }, where p is the position information, used to calculate the straight line distance dist between the unmanned plane and the target point at this time; speed is the current speed norm of the unmanned plane in three directions; the collision is collision zone bit is unmanned aerial vehicle collision information given by simulation environment. For example, when the collision is 1, it indicates that the current unmanned aerial vehicle generates a collision, and when it is 0, it indicates that the unmanned aerial vehicle is in a safe flight state. The algorithm interaction process is shown in the following table 3, when the collision zone bit of the unmanned aerial vehicle is 1, the unmanned aerial vehicle is demonstrated to generate collision, penalty of-20 is given, and the round of training is ended; when the unmanned aerial vehicle does not generate collision, detecting whether the speed is smaller than speed_limit, wherein in the embodiment of the invention, the speed_limit is set to be 0.1 to prevent the DDPG network from being in-situ immobilized to avoid collision penalty; and when dist <3, giving 50 rewards when the unmanned aerial vehicle has reached the vicinity of the target point, otherwise, giving a reward value which is the difference between the distance between the unmanned aerial vehicle and the target point at the current moment and the distance at the last moment, so as to encourage the unmanned aerial vehicle to reach the appointed target position while avoiding collision.
TABLE 3 Table 3
Figure BDA0003999736000000121
With reference to fig. 7, the overall processing flow of the technical solution provided by the embodiment of the present invention may be divided into three parts, namely a CGAN network, an interaction environment, and a DDPG network; the interaction environment comprises an unmanned aerial vehicle carrying a monocular camera; when the current time is set to be the t time, the unmanned aerial vehicle collects RBG pictures of the current environmental state, inputs the RBG pictures into the trained CGAN network to generate depth pictures and outputs the depth pictures; the generated picture is processed to obtain the state information s at the moment t . The Actor network outputs the state s according to the output state s at the moment t Gives an action strategy a of the unmanned aerial vehicle t =μ(s tμ ) And adding noise to the motion to obtain a t =μ(s tμ ) +N, action a at this time is to be performed t The motion information and the environmental product are sent to the unmanned aerial vehicle in a data communication mode, and the unmanned aerial vehicle obtains the motion information and the environmental productGenerating interaction, generating next moment state value s t+1 Action reward value r t =R(s t+1 ). At this time, the unmanned plane completes the environmental exploration and collects the resulting data (s t ,a t ,r t ,s t+1 ) And storing the data into an experience playback pool for training by a subsequent network.
The DDPG network training adopts an offline training method, firstly, sampling is carried out in an experience pool to obtain data (s i ,a i ,r i ,s i+1 ) Then, the network parameters of the Actor network are updated and trained in the direction of increasing the Q value according to the expression 1 in the implementation mode. The Target-Actor network is based on the sampled data s t+1 Calculating an action value a t+1 Will s t+1 、a t+1 The Q value Q+=Q(s) of the next time state is calculated by transmitting to the Target-Critic network t+1 ,a t+1Q' ) At the same time, the Critic network calculates the current state Q value q=q (s t ,a tQ ). The Critic network is updated by minimizing the loss function, equation 2, and the target network is updated according to equation 3.
For the foregoing technical solutions, and implementation manners and examples thereof, the embodiments of the present invention perform explanation of technical effects through simulation experiments. The simulation conditions of the simulation experiment are as follows: the unmanned aerial vehicle adopts a monocular camera to collect RGB pictures from the environment in the forward 180-degree range, and an inertial measurement unit (IMU, inertial Measurement Unit) is used for measuring the self linear velocity as a state observation value; fig. 8 is a schematic diagram of a simulation environment used, fig. 9 is a schematic diagram of an obstacle avoidance training process of an unmanned aerial vehicle, an initial rewarding value is 0 when the unmanned aerial vehicle starts training, punishment of-20 is obtained after the unmanned aerial vehicle collides, rewarding is obtained when the unmanned aerial vehicle flies along a correct route to approach a target position, and finally rewarding of +50 is obtained when the unmanned aerial vehicle safely reaches a target point. The total score of unmanned aerial vehicle training is the sum of rewards of each step when training, therefore after the training, when total score is greater than 50, indicate in current training round, unmanned aerial vehicle can successfully avoid all barriers and reach appointed target point, total score is higher, indicates that unmanned aerial vehicle's selected obstacle avoidance route is better. In order to verify the effectiveness of the algorithm, the DQN algorithm, the DDPG and the DDPG (DDPG-PER, prioritized Experience Replay) with preferential experience playback are respectively subjected to training test in the simulation environment. The DDPG-PER is an algorithm introducing a priority experience playback mechanism, and aims to improve the data utilization rate and accelerate training.
Fig. 10 to 15 respectively show the total score, average reward, network estimation Q value, actor network loss value loss, critic network loss value loss and the change condition of average speed along with training time in the unmanned aerial vehicle training process, and the training is performed for 6000 training rounds of epoode.
As can be seen from fig. 10, the DDPG algorithm provided by the embodiment of the present invention has a cumulative score greater than 0 when training for 500 times, which indicates that the unmanned aerial vehicle has already had the capability of avoiding the obstacle at this time, and has a cumulative score greater than 50 when training for 1000 times, which indicates that the unmanned aerial vehicle has already been able to successfully avoid the obstacle and reach the vicinity of the designated target location at this time. Compared with DDPG, the overall total score of the DQN algorithm is always about 0 in the whole training process, which indicates that the unmanned aerial vehicle only learns to obtain a simple obstacle avoidance function but can only wander in situ or "roam" in a simulation environment and cannot reach a designated position.
FIG. 11 is a graph of average prize change calculated as the total score divided by the number of steps in the epoode for the drone to gauge whether the action decisions given at each step of the training algorithm are useful. As can be seen by comparison, the average rewarding value of the DDPG algorithm is obviously larger than that of the DQN algorithm, and the convergence and stability of the algorithm are improved after the preferential experience playback mechanism is introduced.
Fig. 12 is a schematic diagram of a Q-value variation curve of network estimation, from which it can be seen that the DQN algorithm can cause Q-value overestimation to affect the giving of a correct decision action due to repeated use of the maximum value theory, and the DDPG algorithm can obviously reduce overestimation of a cost function due to decoupling of the action decision and the value estimation. In addition, since the DQN algorithm network structure is different from the DDPG algorithm structure, the Loss values are not comparable, so only the Loss values Loss of DDPG and DDPG-PER are compared in fig. 13 and 14, and it can be seen that both algorithms can be quickly dropped and converged.
Fig. 15 shows a speed change curve of the unmanned aerial vehicle, wherein the speed is a speed norm of the unmanned aerial vehicle in three directions of x, y and z, and it can be seen from the figure that the speed of the final unmanned aerial vehicle converges to about 2.75 m/s.
Based on the simulation experiment, the average rewarding value of the DDPG algorithm is obviously larger than that of the DQN algorithm, and the convergence and stability of the algorithm are improved after a priority experience playback mechanism is introduced. The DDPG algorithm can significantly reduce overestimation of the cost function due to decoupling of the action decision and the cost estimation.
Finally, the training results of the DQN and the DDPG algorithm model proposed in the embodiment of the present invention were tested 100 times in the test environment, and the results are shown in table 4 below.
TABLE 4 Table 4
Type of algorithm Average cumulative score Average prize value
DQN -35.46 -2.77
DDPG 66.75 2.73
As can be seen from table 4, the average cumulative score and average prize value for DQN are much lower than the DDPG model proposed by the examples of the present invention. Specifically, a cumulative score of DQN less than 0 indicates that while DQN algorithm performance may avoid obstacles during training, the algorithm is not robust during testing, which is likely to result in failure of the drone to avoid the obstacle due to a small disturbance. While an average cumulative score of the DDPG algorithm model in 100 tests of greater than 50 indicates that the unmanned aerial vehicle can avoid the obstacle and reach the target position almost every time in the test process. As shown in fig. 16, which is a comparison diagram of the unmanned aerial vehicle obstacle avoidance route in the top view, the DDPG algorithm adopts continuous speed control, so that the obstacle avoidance track is smoother than other algorithms, and the obstacle avoidance effect is better.
Based on the same inventive concept as the foregoing technical solution, referring to fig. 17, there is shown a monocular unmanned aerial vehicle obstacle avoidance device 170 provided by an embodiment of the present invention, where the device 170 includes: acquisition portion 1701, conversion portion 1702, first build portion 1703, second build portion 1704, training portion 1705, and obstacle avoidance control portion 1706; wherein,,
The acquisition part 1701 is configured to acquire an original RGB image by a monocular camera carried by the unmanned aerial vehicle;
the conversion portion 1702 configured to generate, for the original RGB image, a antagonism network CGAN conversion into a depth image based on a condition;
the first constructing part 1703 is configured to construct an Actor network and an evaluator Critic network in a depth deterministic policy gradient DDPG network by using the depth image and current speed information of the unmanned aerial vehicle as state information;
the second constructing section 1704 is configured to construct a bonus function according to the distance between the unmanned aerial vehicle and the target, the speed of the unmanned aerial vehicle, and collision information of the unmanned aerial vehicle;
the training portion 1705 configured to train the Actor network and the Critic network in the DDPG network according to the reward function, thereby obtaining parameters of the Actor network and the Critic network;
the obstacle avoidance control portion 1706 is configured to perform obstacle avoidance control on the monocular unmanned aerial vehicle according to the DDPG network after training.
In some examples, the conversion portion 1702 is configured to:
constructing two independent convolution networks and setting the two independent convolution networks as a generator and a discriminator respectively; the generator is used for generating a pseudo-depth image from the RGB image and inputting the pseudo-depth image to the discriminator; the discriminator is used for judging whether the input image is a true depth image or not;
Training network parameters of the generator and the arbiter by training a set of samples to minimize a loss function of the CGAN represented by:
L CGANDG )=E x,y~Pdata [logD(x,y)]+E x~pdata(x),z~pz(z) [log(1-D(x,G(x,z)))]
wherein θ D Representing the network parameters, θ, of the arbiter G Network parameters representing the generator; the variable x represents an RGB image; z is a noise image; the variable y is the corresponding real depth image; g (x, z) is the pseudo-depth image generated by the generator; d (x, y) is the probability that the pseudo-depth picture generated by the generator is judged to be a true depth image by the discriminator; p is p data Representing data sample distribution, p data (x) Representing the distribution of real data samples, p z (z) represents noise distribution; symbols "-" indicate meanings obeying the set probability distribution;
and inputting the original RGB image into the CGAN after training is completed, and obtaining a depth image corresponding to the original RGB image.
In some examples, the first build portion 1703 is configured to:
defining the state information at the current moment as s t =[d t ,v t ]The method comprises the steps of carrying out a first treatment on the surface of the Wherein d t Representing a depth image obtained by converting an RGB image acquired by a monocular unmanned aerial vehicle at the current moment through the CGAN; current time speed v t =[v xt ,v yt ,v zt ];
Construction of the Actor network μ (s|θ μ ) To represent deterministically outputting an action policy based on the state information of the current time; wherein μ represents an Actor network; θ μ Representing an Actor network parameter; s represents input state information;
construction of the Critic network Q (s, a|θ Q ) Giving an evaluation to an action strategy output by the Actor network; wherein θ Q Representing Critic network parameters; s represents state information of an input Actor network; a represents an action policy output by the Actor network.
In some examples, the training portion 1705 is configured to:
constructing an Actor target network and a Critic target network which correspond to the Actor network and the Critic network respectively; wherein, the target network parameter of the Actor is theta μ′ The Critic target network parameter is theta Q ′;
Executing an iterative training process based on the set iterative conditions; the training process for each iteration is as follows:
based on state information s at the current moment through the Actor network t Obtaining an action strategy a at the current moment t =μ(s tμ ) +N (t); wherein, the random noise N (t) is linearly attenuated to 0 along with the training iteration times;
executing the action strategy through the unmanned aerial vehicle and interacting with the current environment to obtain state information s of the next moment t+1 Value of prize function r i =R(s t+1 );
Data tuples(s) t ,a t ,r i ,s t+1 ) Storing the sample into an experience playback pool D as a sample for offline training of the DDPG network;
randomly sampling N samples (s i ,a i ,r i ,s i+1 );
Updating a network parameter theta of the Actor network based on the direction of the rising of the sampling gradient shown in the following formula according to the N samples μ
Figure BDA0003999736000000171
Wherein E represents the desire, p β Representing the state distribution of deterministic policies, Q (s, a|θ Q )| s=st,a=μ(st) The state information at the current moment and the evaluation value of the Critic network under the action strategy are represented;
the Actor target network is based on s in the sampled sample i+1 Obtaining a corresponding action strategy a i+1 =μ'(s t+1μ' ) And bringing said s i+1 And a is as described in the above i+1 Outputting to the Critic target network to calculate and obtain the evaluation value Q+=Q (s t+1 ,a t+1Q' );
Based on the evaluation value Q (s of the Critic network t ,a tQ ) And an evaluation value Q≡of the next state of the Critic target network; by minimizing the loss function L (θ) shown in the following equation Q ) To update the network parameter theta of the Critic network Q
Figure BDA0003999736000000172
Wherein q+=q (s t+1 ,a t+1Q' )=r t +γQ'(s t+1 ,μ'(s t+1μ' )|θ Q' );s t ~p β Representation, a t Beta represents, r t E represents that gamma E (0, 1) is an attenuation factor and represents the influence of rewards at the current moment on future moments; mu'(s) t+1μ' ) Representing that the Actor target network is based on s in the sampled sample i+1 Correspondingly obtained action strategies added with random noise;
updating the network parameters of the Actor target network and the Critic target network according to the following formula:
Figure BDA0003999736000000173
wherein τ is a sliding average coefficient, and the value is smaller than 1.
In some examples, the second build portion 1704 is configured to:
state information s of the next moment generated based on the action strategy output by the Actor network according to the state information of the current moment t+1
When the state information of the next moment indicates that the unmanned aerial vehicle collides, the reward function value r=r (s t+1 ) Is the maximum penalty value and ends the current training round;
when the state information of the next moment indicates that the speed of the unmanned aerial vehicle is lower than a set speed threshold, the reward function value is a minimum punishment value;
when the state information of the next moment indicates that the distance between the unmanned aerial vehicle and the target is smaller than a set distance threshold, the reward function value is the maximum reward value;
otherwise, taking the difference value between the distance between the unmanned aerial vehicle and the target at the current moment and the distance between the unmanned aerial vehicle and the target at the last moment as the reward function value.
It will be appreciated that in this embodiment, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., and of course may be a unit, or a module may be non-modular.
In addition, each component in the present embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional modules.
The integrated units, if implemented in the form of software functional modules, may be stored in a computer-readable storage medium, if not sold or used as separate products, and based on such understanding, the technical solution of the present embodiment may be embodied essentially or partly in the form of a software product, which is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or processor to perform all or part of the steps of the method described in the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Therefore, the embodiment provides a computer storage medium, wherein the computer storage medium stores a monocular unmanned aerial vehicle obstacle avoidance program based on a depth certainty strategy gradient, and the monocular unmanned aerial vehicle obstacle avoidance method based on the depth certainty strategy gradient in the technical scheme is realized when the monocular unmanned aerial vehicle obstacle avoidance program based on the depth certainty strategy gradient is executed by at least one processor.
According to the depth deterministic policy gradient-based monocular unmanned aerial vehicle obstacle avoidance device 170 and computer storage medium described above, see fig. 18, which illustrates a specific hardware architecture of a computing device 180 capable of implementing the depth deterministic policy gradient-based monocular unmanned aerial vehicle obstacle avoidance device 170 described above, which computing device 180 may be a wireless device, a mobile or cellular phone (including so-called smart phones), a Personal Digital Assistant (PDA), a video game console (including video display, mobile video game device, mobile video conferencing unit), a laptop computer, a desktop computer, a television set-top box, a tablet computing device, an electronic book reader, a fixed or mobile media player, etc., provided by an embodiment of the present invention. The computing device 180 includes: a communication interface 1801, a memory 1802, and a processor 1803; the various components are coupled together by a bus system 1804. It is appreciated that the bus system 1804 is employed to facilitate connected communications between these components. The bus system 1804 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 1804 in fig. 18 for clarity of illustration. Wherein,,
The communication interface 1801 is configured to receive and send signals during the process of receiving and sending information with other external network elements;
the memory 1802 for storing a computer program capable of running on the processor 1803;
the processor 1803 is configured to execute the step of the monocular unmanned aerial vehicle obstacle avoidance method based on the depth deterministic strategy gradient in the foregoing technical solution when the computer program is executed, which is not described herein.
It is to be appreciated that memory 1802 in embodiments of the invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). The memory 1802 of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
And processor 1803 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 1803. The processor 1803 may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. Which is located in a memory 1802, and a processor 1803 reads information in the memory 1802, and in combination with its hardware, performs the steps of the method described above.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (DSP devices, DSPD), programmable logic devices (Programmable Logic Device, PLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), general purpose processors, controllers, microcontrollers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
It will be appreciated that the exemplary technical solution of the monocular unmanned aerial vehicle obstacle avoidance device 170 and the computing device 180 based on the depth deterministic strategy gradient described above is the same concept as the technical solution of the monocular unmanned aerial vehicle obstacle avoidance method based on the depth deterministic strategy gradient described above, and therefore, for details of the technical solution of the monocular unmanned aerial vehicle obstacle avoidance device 170 and the computing device 180 based on the depth deterministic strategy gradient that are not described in detail, reference may be made to the description of the technical solution of the monocular unmanned aerial vehicle obstacle avoidance method based on the depth deterministic strategy gradient described above. The embodiments of the present invention will not be described in detail.
It should be noted that: the technical schemes described in the embodiments of the present invention may be arbitrarily combined without any collision.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A monocular unmanned aerial vehicle obstacle avoidance method based on depth deterministic strategy gradients, the method comprising:
generating an antagonism network CGAN based on conditions for an original RGB image acquired by a monocular camera onboard the unmanned aerial vehicle to be converted into a depth image;
taking the depth image and the current speed information of the unmanned aerial vehicle as state information, and constructing an Actor network and an evaluator Critic network in a depth deterministic strategy gradient DDPG network;
constructing a reward function according to the distance between the unmanned aerial vehicle and the target, the speed of the unmanned aerial vehicle and the collision information of the unmanned aerial vehicle;
training the Actor network and the Critic network in the DDPG network according to the reward function, so as to obtain parameters of the Actor network and the Critic network;
And performing obstacle avoidance control on the monocular unmanned aerial vehicle according to the DDPG network after training.
2. The method of claim 1, wherein the generating the antagonism network CGAN conversion to a depth image based on conditions for the raw RGB images acquired by the drone-mounted monocular camera comprises:
constructing two independent convolution networks and setting the two independent convolution networks as a generator and a discriminator respectively; the generator is used for generating a pseudo-depth image from the RGB image and inputting the pseudo-depth image to the discriminator; the discriminator is used for judging whether the input image is a true depth image or not;
training network parameters of the generator and the arbiter by training a set of samples to minimize a loss function of the CGAN represented by:
Figure FDA0003999735990000011
wherein θ D Representing the network parameters, θ, of the arbiter G Network parameters representing the generator; the variable x represents an RGB image; z is a noise image; the variable y is the corresponding real depth image; g (x, z) is the pseudo-depth image generated by the generator; d (x, y) is the probability that the pseudo-depth picture generated by the generator is judged to be a true depth image by the discriminator; p is p data Representing data sample distribution, p data (x) Representing the distribution of real data samples, p z (z) represents noise distribution; symbols "-" indicate meanings obeying the set probability distribution;
and inputting the original RGB image into the CGAN after training is completed, and obtaining a depth image corresponding to the original RGB image.
3. The method of claim 1, wherein constructing the reward function based on the distance of the drone from the target, the speed of the drone, and collision information of the drone, comprises:
state information s of the next moment generated based on the action strategy output by the Actor network according to the state information of the current moment t+1
When the state information of the next moment indicates that the unmanned aerial vehicle collides, the reward function value r=r (s t+1 ) Is the maximum penalty value and ends the current training round;
when the state information of the next moment indicates that the speed of the unmanned aerial vehicle is lower than a set speed threshold, the reward function value is a minimum punishment value;
when the state information of the next moment indicates that the distance between the unmanned aerial vehicle and the target is smaller than a set distance threshold, the reward function value is the maximum reward value;
otherwise, taking the difference value between the distance between the unmanned aerial vehicle and the target at the current moment and the distance between the unmanned aerial vehicle and the target at the last moment as the reward function value.
4. The method of claim 1, wherein constructing Actor and evaluator Critic networks in a depth deterministic policy gradient DDPG network using the depth image and current speed information of the drone as state information comprises:
defining the state information at the current moment as s t =[d t ,v t ]The method comprises the steps of carrying out a first treatment on the surface of the Wherein d t Representing a depth image obtained by converting an RGB image acquired by a monocular unmanned aerial vehicle at the current moment through the CGAN; current time speed v t =[v xt ,v yt ,v zt ];
Construction of the Actor network μ (s|θ μ ) To represent deterministically outputting an action policy based on the state information of the current time; wherein μ represents an Actor network; θ μ Representing an Actor network parameter; s represents input state information;
construction of the Critic network Q (s, a|θ Q ) Giving an evaluation to an action strategy output by the Actor network; wherein θ Q Representing Critic network parameters; s represents state information of an input Actor network; a represents an action policy output by the Actor network.
5. The method of claim 3, wherein the training the Actor network and the Critic network in the DDPG network according to the reward function to obtain parameters of the Actor network and the Critic network comprises:
Constructing an Actor target network and a Critic target network which correspond to the Actor network and the Critic network respectively; wherein, the target network parameter of the Actor is theta μ′ The Critic target network parameter is theta Q′
Executing an iterative training process based on the set iterative conditions; the training process for each iteration is as follows:
based on state information s at the current moment through the Actor network t Obtaining an action strategy a at the current moment t =μ(s tμ ) +N (t); wherein, the random noise N (t) is linearly attenuated to 0 along with the training iteration times;
executing the action strategy through the unmanned aerial vehicle and interacting with the current environment to obtain state information s of the next moment t+1 Value of prize function r i =R(s t+1 );
Data tuples(s) t ,a t ,r i ,s t+1 ) Storing the sample into an experience playback pool D as a sample for offline training of the DDPG network;
randomly sampling N samples (s i ,a i ,r i ,s i+1 );
Updating a network parameter theta of the Actor network based on the direction of the rising of the sampling gradient shown in the following formula according to the N samples μ
Figure FDA0003999735990000031
Wherein E represents the desire, p β Representing the state distribution of the deterministic policy,
Figure FDA0003999735990000032
the state information at the current moment and the evaluation value of the Critic network under the action strategy are represented;
the Actor target network is based on s in the sampled sample i+1 Obtaining a corresponding action strategy a i+1 =μ'(s t+1μ' ) And bringing said s i+1 And a is as described in the above i+1 Outputting to the Critic target network to calculate and obtain the evaluation value Q+=Q (s t+1 ,a t+1Q' );
Based on the evaluation value Q (s of the Critic network t ,a tQ ) And an evaluation value Q≡of the next state of the Critic target network; by minimizing the loss function L (θ) shown in the following equation Q ) To update the network parameter theta of the Critic network Q
Figure FDA0003999735990000041
Wherein Q is ^ =Q(s t+1 ,a t+1Q ')=r t +γQ'(s t+1 ,μ'(s t+1μ ')|θ Q ' s); st-pβ, at- β, rt-E, gamma E (0, 1) as attenuation factors, and indicating the influence of rewards at the current moment on future moments; mu'(s) t+1μ ') indicates that the Actor target network is based on s in the sampled samples i+1 Correspondingly obtained action strategies added with random noise;
updating the network parameters of the Actor target network and the Critic target network according to the following formula:
Figure FDA0003999735990000042
wherein τ is a sliding average coefficient, and the value is smaller than 1.
6. A monocular unmanned aerial vehicle obstacle avoidance device based on depth deterministic strategy gradients, the device comprising: the device comprises an acquisition part, a conversion part, a first construction part, a second construction part, a training part and an obstacle avoidance control part; wherein,,
the acquisition part is configured to acquire an original RGB image through a monocular camera carried by the unmanned aerial vehicle;
The conversion part is configured to generate an antagonism network CGAN based on conditions for the original RGB image to be converted into a depth image;
the first construction part is configured to construct an Actor activator network and an evaluator cric network in a depth deterministic strategy gradient DDPG network by taking the depth image and the current speed information of the unmanned aerial vehicle as state information;
the second construction part is configured to construct a reward function according to the distance between the unmanned aerial vehicle and the target, the speed of the unmanned aerial vehicle and the collision information of the unmanned aerial vehicle;
the training part is configured to train the Actor network and the Critic network in the DDPG network according to the reward function so as to obtain parameters of the Actor network and the Critic network;
the obstacle avoidance control part is configured to perform obstacle avoidance control on the monocular unmanned aerial vehicle according to the DDPG network after training.
7. The apparatus of claim 6, wherein the conversion portion is configured to:
constructing two independent convolution networks and setting the two independent convolution networks as a generator and a discriminator respectively; the generator is used for generating a pseudo-depth image from the RGB image and inputting the pseudo-depth image to the discriminator; the discriminator is used for judging whether the input image is a true depth image or not;
Training network parameters of the generator and the arbiter by training a set of samples to minimize a loss function of the CGAN represented by:
Figure FDA0003999735990000051
wherein θ D Representing the network parameters, θ, of the arbiter G Network parameters representing the generator; the variable x represents an RGB image; z is a noise image; the variable y is the corresponding real depth image; g(x, z) is a pseudo-depth image generated by the generator; d (x, y) is the probability that the pseudo-depth picture generated by the generator is judged to be a true depth image by the discriminator; p is p data Representing data sample distribution, p data (x) Representing the distribution of real data samples, p z (z) represents noise distribution; symbols "-" indicate meanings obeying the set probability distribution;
and inputting the original RGB image into the CGAN after training is completed, and obtaining a depth image corresponding to the original RGB image.
8. The apparatus of claim 6, wherein the first build portion is configured to:
defining the state information at the current moment as s t =[d t ,v t ]The method comprises the steps of carrying out a first treatment on the surface of the Wherein d t Representing a depth image obtained by converting an RGB image acquired by a monocular unmanned aerial vehicle at the current moment through the CGAN; current time speed v t =[v xt ,v yt ,v zt ];
Construction of the Actor network μ (s|θ μ ) To represent deterministically outputting an action policy based on the state information of the current time; wherein μ represents an Actor network; θ μ Representing an Actor network parameter; s represents input state information;
construction of the Critic network Q (s, a|θ Q ) Giving an evaluation to an action strategy output by the Actor network; wherein θ Q Representing Critic network parameters; s represents state information of an input Actor network; a represents an action policy output by the Actor network.
9. The apparatus of claim 8, wherein the training portion is configured to:
constructing an Actor target network and a Critic target network which correspond to the Actor network and the Critic network respectively; wherein, the target network parameter of the Actor is theta μ′ The Critic target network parameter is theta Q′
Executing an iterative training process based on the set iterative conditions; the training process for each iteration is as follows:
based on state information s at the current moment through the Actor network t Obtaining an action strategy a at the current moment t =μ(s tμ ) +N (t); wherein, the random noise N (t) is linearly attenuated to 0 along with the training iteration times;
executing the action strategy through the unmanned aerial vehicle and interacting with the current environment to obtain state information s of the next moment t+1 Value of prize function r i =R(s t+1 );
Data tuples(s) t ,a t ,r i ,s t+1 ) Storing the sample into an experience playback pool D as a sample for offline training of the DDPG network;
randomly sampling N samples (s i ,a i ,r i ,s i+1 );
Updating a network parameter theta of the Actor network based on the direction of the rising of the sampling gradient shown in the following formula according to the N samples μ
Figure FDA0003999735990000061
Wherein E represents the desire, p β Representing the state distribution of the deterministic policy,
Figure FDA0003999735990000062
the state information at the current moment and the evaluation value of the Critic network under the action strategy are represented;
the Actor target network is based on s in the sampled sample i+1 Obtaining a corresponding action strategy a i+1 =μ'(s t+1μ' ) And bringing said s i+1 And a is as described in the above i+1 Outputting to the Critic target network to calculate and obtain the evaluation value Q+=Q (s t+1 ,a t+1Q' );
Based on the evaluation value Q (s of the Critic network t ,a tQ ) AndThe evaluation value Q of the next state of the Critic target network; by minimizing the loss function L (θ) shown in the following equation Q ) To update the network parameter theta of the Critic network Q
Figure FDA0003999735990000063
Wherein q+=q (s t+1 ,a t+1Q' )=r t +γQ'(s t+1 ,μ'(s t+1μ' )|θ Q' );s t ~p β Representation, a t Beta represents, r t E represents that gamma E (0, 1) is an attenuation factor and represents the influence of rewards at the current moment on future moments; mu'(s) t+1μ' ) Representing that the Actor target network is based on s in the sampled sample i+1 Correspondingly obtained action strategies added with random noise;
updating the network parameters of the Actor target network and the Critic target network according to the following formula:
Figure FDA0003999735990000071
wherein τ is a sliding average coefficient, and the value is smaller than 1.
10. A computer storage medium, characterized in that it stores a depth deterministic strategy gradient-based monocular drone obstacle avoidance program, which when executed by at least one processor implements the steps of the depth deterministic strategy gradient-based monocular drone obstacle avoidance method according to any one of claims 1 to 5.
CN202211612609.5A 2022-12-14 2022-12-14 Monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient Pending CN116203979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211612609.5A CN116203979A (en) 2022-12-14 2022-12-14 Monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211612609.5A CN116203979A (en) 2022-12-14 2022-12-14 Monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient

Publications (1)

Publication Number Publication Date
CN116203979A true CN116203979A (en) 2023-06-02

Family

ID=86508391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211612609.5A Pending CN116203979A (en) 2022-12-14 2022-12-14 Monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient

Country Status (1)

Country Link
CN (1) CN116203979A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578102A (en) * 2023-07-13 2023-08-11 清华大学 Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium
CN117406706A (en) * 2023-08-11 2024-01-16 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578102A (en) * 2023-07-13 2023-08-11 清华大学 Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium
CN116578102B (en) * 2023-07-13 2023-09-19 清华大学 Obstacle avoidance method and device for autonomous underwater vehicle, computer equipment and storage medium
CN117406706A (en) * 2023-08-11 2024-01-16 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning
CN117406706B (en) * 2023-08-11 2024-04-09 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning

Similar Documents

Publication Publication Date Title
US11842261B2 (en) Deep reinforcement learning with fast updating recurrent neural networks and slow updating recurrent neural networks
CN110326004B (en) Training a strategic neural network using path consistency learning
WO2022052406A1 (en) Automatic driving training method, apparatus and device, and medium
CN116203979A (en) Monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient
US20210201156A1 (en) Sample-efficient reinforcement learning
CN112937564A (en) Lane change decision model generation method and unmanned vehicle lane change decision method and device
CN116776964A (en) Method, program product and storage medium for distributed reinforcement learning
US20210103815A1 (en) Domain adaptation for robotic control using self-supervised learning
US20220036186A1 (en) Accelerated deep reinforcement learning of agent control policies
CN113168566A (en) Controlling a robot by using entropy constraints
CN111707270B (en) Map-free obstacle avoidance navigation method based on distribution estimation and reinforcement learning
CN114261400B (en) Automatic driving decision method, device, equipment and storage medium
Yan et al. Reinforcement Learning‐Based Autonomous Navigation and Obstacle Avoidance for USVs under Partially Observable Conditions
Sun et al. A 2D optimal path planning algorithm for autonomous underwater vehicle driving in unknown underwater canyons
WO2022069732A1 (en) Cross-domain imitation learning using goal conditioned policies
CN112256037A (en) Control method and device applied to automatic driving, electronic equipment and medium
CN117058547A (en) Unmanned ship dynamic target tracking method
US20210398014A1 (en) Reinforcement learning based control of imitative policies for autonomous driving
Yue et al. A new search scheme using multi‐bee‐colony elite learning method for unmanned aerial vehicles in unknown environments
CN116039957B (en) Spacecraft online game planning method, device and medium considering barrier constraint
Han et al. Three‐dimensional obstacle avoidance for UAV based on reinforcement learning and RealSense
US12061474B2 (en) Controller for optimizing motion trajectory to control motion of one or more devices
US20230074148A1 (en) Controller for Optimizing Motion Trajectory to Control Motion of One or More Devices
CN117032297A (en) Training method and using method of unmanned aerial vehicle tracking control model and terminal equipment
Guo et al. Adaptive Whale Optimization Algorithm–DBiLSTM for Autonomous Underwater Vehicle (AUV) Trajectory Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination