CN112506210A

CN112506210A - Unmanned aerial vehicle control method for autonomous target tracking

Info

Publication number: CN112506210A
Application number: CN202011402067.XA
Authority: CN
Inventors: 徐乐玏; 孙长银; 陆科林; 王腾
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-03-16
Anticipated expiration: 2040-12-04
Also published as: CN112506210B

Abstract

The invention relates to an unmanned aerial vehicle control method for autonomous target tracking, which is characterized in that four-dimensional actions are output through a neural network and then converted into actions of a low-level motor through a PID (proportion integration differentiation) controller, so that an unmanned aerial vehicle can fly more stably, and the PID controller can be changed into other more optimized control methods through later improvement. The layered control system can easily transfer the training strategy in the simulation environment to the real environment to operate. Has good generalization ability. The method comprises the steps of firstly carrying out CNN pre-training on collected images in a simulated environment to obtain the relative distance between the unmanned aerial vehicle and a target object, wherein the relative distance comprises three dimensions of x, y and h. And then, taking the attitude of the unmanned aerial vehicle into consideration, selecting a strategy, outputting four-dimensional actions, outputting the four-dimensional actions to a low-level motor of the unmanned aerial vehicle through PID (proportion integration differentiation), obtaining Reward through a DDPG (distributed data group) reinforcement learning method, updating the strategy, and learning and training.

Description

Unmanned aerial vehicle control method for autonomous target tracking

Technical Field

The invention relates to an unmanned aerial vehicle control method for autonomous target tracking, and belongs to the technical field of unmanned aerial vehicle tracking.

Background

The technology of unmanned aerial vehicle tracking moving objects and people is increasingly needed by military affairs, monitoring, inspection and the like. This requires that the drone be implemented by visual perception techniques and control methods. However, the unmanned aerial vehicle is a very fragile system, and the strategy needs to be updated and optimized through model-free reinforcement learning, and meanwhile, the stability of the controller is guaranteed.

Disclosure of Invention

The invention aims to provide a control method of an unmanned aerial vehicle autonomous tracking motion robot, which combines the self improvement performance of a model-free reinforcement learning method with the stability of a conventional PID controller, outputs four-dimensional actions through a neural network, and converts the actions into the actions of a low-level motor through the PID controller, so that the unmanned aerial vehicle can fly more stably.

In order to achieve the purpose, the invention adopts the following technical scheme: a drone control method for autonomous target tracking, the control method comprising the steps of:

s1: building a simulation platform environment, adding models of an unmanned aerial vehicle and a robot by modifying a launch file, and setting initial positions of the unmanned aerial vehicle and the robot in the simulation environment;

s2: the robot and the unmanned aerial vehicle can respectively move through instructions or codes, the laser radar of the robot can be used for realizing the simple obstacle avoidance function of the robot, the movement of the unmanned aerial vehicle and the robot is controlled by setting set command to the robot and the unmanned aerial vehicle, setting linear speeds in the x, y and z directions and angular speeds rotating around the x, y and z axes, and the images of the environment and the robot on the ground are acquired by acquiring the images through a camera of a bottom of the unmanned aerial vehicle;

s3: gather the image, the perception layer in the simulated environment is trained in advance, let unmanned aerial vehicle on the fixed height that is higher than the environment, the pixel value of the picture that fixed bottom camera was gathered is 256, under the fixed circumstances of unmanned aerial vehicle then, the relative size of the environment that the bottom camera can see, regional area is fixed promptly, let the robot when unmanned aerial vehicle at same high different x, y's sight range internal random motion, gather 10000 images, every has the robot in these 10000 images, carry out the training of CNN neural network as shown in figure 1.

S4: the real-time positions of the robot and the unmanned aerial vehicle in the simulated environment are obtained through topic, subscripter, get position, and the like, some operations are carried out to obtain the relative positions of the robot and the unmanned aerial vehicle, and the mode of storing picture names is modified to take the relative positions of the robot and the unmanned aerial vehicle as the names of the images acquired by the unmanned aerial vehicle, so that the pictures and the labels can be in one-to-one correspondence for the subsequent image processing. Meanwhile, when the unmanned aerial vehicle is controlled to acquire images in an inorganic mode, the unmanned aerial vehicle is controlled by the PID, so that the unmanned aerial vehicle can be fixed in a small area range and is not obvious in shaking.

S5: images and labels are known in a simulation environment, supervised pre-training is carried out, so that the relative positions of the images can be predicted by inputting the images in a real environment, the training x and y are about 6m, the loss obtained by 2000 epicodes is 0.03m, and the average distance between the relative positions predicted in a test set and the actual labels is about 0.1 m.

S6: quad-rotor aircraft have complex nonlinear aerodynamics that are difficult to learn by model-less RL methods. Obviously, this challenge can be solved by incorporating a conventional PID controller. Figure 2 shows the proposed hierarchical control system. At each time step t, the policy network generates a four-dimensional high-level reference action u given the observed image_t。

S7: by means of the reinforcement learning method of the DDPG, a reward function is shaped, and the reward function simultaneously considers the four-rotor state and the target related state.

S8: and transferring to a real environment. The invention relates to an unmanned aerial vehicle control method for autonomous target tracking. The technology of unmanned aerial vehicle tracking moving objects and people is increasingly needed by military affairs, monitoring, inspection and the like. This requires that the drone be implemented by visual perception techniques and control methods. However, the unmanned aerial vehicle is a very fragile system, and the strategy needs to be updated and optimized through model-free reinforcement learning, and meanwhile, the stability of the controller is guaranteed.

Wherein, the environment in step S1 includes obstacles such as football court, football, cleaning cart, garbage bin, ladder rack, color fence, scooter, warning stake, desk, etc.

Wherein the step S5: the last convolution layer is then merged with a spatial softmax layer to integrate the values for each pixelThe feature map is converted into spatial coordinates in image space. The spatial softmax map layer consists of the spatial softmax function applied to the last convolved feature map and a fixed sparse fully connected layer that computes the expected image position for each feature map. The spatial feature points are then regressed into a three-dimensional vector, i.e., t ═ x_t，y_t，h_t) It represents the 2D position and height of the object on the image plane, where my height is fixed through another fully connected layer. To achieve stable flight, a four-rotor configuration s must be used_q，t＝(z_t，v_t，q_t，w_t) Including altitude, linear velocity, direction and angular velocity, as additional inputs to the neural network. After the perception layer, the target-related state s_o，tWith four rotor states s_q，tMerged together and then the layer that is fully contiguous with the action.

Wherein, in the step S6,

u_trepresenting a four-dimensional high-level reference motion, divided into four motions, p_xCorresponding to a relative positional shift, p, in the x-direction_yCorresponding to a relative positional shift, p, in the y direction_zCorresponding to the relative positional shift in the z direction,

corresponding to a relative angular offset about the yaw axis.

In step S7, the environment is observed sufficiently, and the state o at time t of the environment is observed_tState s_tThen Q function Q^π(s_t,a_t) Is shown in state s_tExpected benefit after taking action and then following policy π:

Q^π(s_t,a_t)＝E_π[R_t|s_t,a_t]

wherein Q^π(s_t,a_t) Is a Q function, R_tIs in a state s_tAt a moment of time_tReward in action, and then performing the desired calculation to obtain the Q value, considering a Q function approximator parameterized by Q, which can be optimized by minimizing the loss, L (theta)^Q) Is the loss of the Q function, to minimize, Q(s)_t,a_t|θ^Q) Is the Q value of the critic network:

L(θ^Q)＝E_π[(Q(s_t,a_t|θ^Q)-y_t)²]

wherein

y_t＝r(s_t,a_t)+γmax_aQ(s_t+1,a|θ^Q')

Is the target Q value, θ, estimated by the target Q network^QWeight, θ, representing the critic's network^Q'A weight representing a critic's target network, γ being a discount factor, according to the loss L (θ)^Q) The resulting gradient is calculated to update the weight of the critic's network.

For the continuous action problem, Q-learning becomes difficult, the continuous domain is usually solved by AC, DDPG is a combination of AC and DQN, making more efficient learning on continuous action, deterministic makes the function deterministically specify that the current policy maps each state to a unique action, and the operator is updated by performing gradient ascent based on the following policy gradient.

To this end, the reward function is designed as a combination of a goal-oriented goal reward and an auxiliary quad-rotor reward, where r represents the total reward and r represents the total reward_g(s_g) Representing goal-oriented reward of goals, r_q(s_q) Representing an auxiliary four-rotor reward:

r＝r_g(s_g)+r_q(s_q)

for simplicity of notation, the subscripted time step t is omitted. The target prize is expressed as the sum of two components:

r_g(s_g)＝r_g(x,y)+r_g(h)

corresponding to position reward and scale reward respectivelyLet s_partRepresents a state s_gWhich corresponds to (x, y) or h, and then the corresponding reward takes the form:

wherein

Δs_part＝||s_part-s^* _part||₂

Representing the "2-norm", τ, between the current state and the desired target state₁，τ₂Representing different thresholds, the desired state is that the drone is kept one h above the robot at all times, the relative distance of x and y is 0, and the reward is calculated comparing the desired state with the current state. Also, in the case of a slight use of symbols, the auxiliary prize is expressed as follows:

r_q(s_q)＝r_q(z)+r_q(q₁,q₂,q₃,q₄)；

they correspond to the height and direction (in quaternion) of the quadrotors, respectively, where the height remains unchanged, so considering the quaternion of the drone, i.e. the RPY parameter, unlike the target reward, which is used to impose other constraints on the attitude, only the penalty terms are introduced. By using the same symbols as in the target prize, the auxiliary prize takes the form:

wherein, tau₁Indicating the same threshold as the above formula, and c represents a penalty weight. Tau is₁＝0.05，τ₂0.2 and 0.5. It is desirable that the attitude of the drone does not change much from the previous attitude. In step S8, the positions and state information of the unmanned aerial vehicle and the robot in the whole map cannot be known in the real scene, so training, updating parameters, and determining need to be performed in the simulation environmentThe strategy is finally applied to the real world, and the generalization capability is good.

Compared with the prior art, the invention has the following advantages: the invention combines the self-improvement performance of the model-free reinforcement learning method with the stability of the conventional PID controller, outputs four-dimensional actions through the neural network, and then converts the actions into the actions of the low-level motor through the PID controller, so that the unmanned aerial vehicle can fly more stably, and the later improvement can change the PID controller into other more optimized control methods. The layered control system can easily transfer the training strategy in the simulation environment to the real environment to operate. Has good generalization ability. The method comprises the steps of firstly carrying out CNN pre-training on collected images in a simulated environment to obtain the relative distance between the unmanned aerial vehicle and a target object, wherein the relative distance comprises three dimensions of x, y and h. And then, taking the attitude of the unmanned aerial vehicle into consideration, selecting a strategy, outputting four-dimensional actions, outputting the four-dimensional actions to a low-level motor of the unmanned aerial vehicle through PID (proportion integration differentiation), obtaining Reward through a DDPG (distributed data group) reinforcement learning method, updating the strategy, and learning and training.

Drawings

FIG. 1 is a block diagram of a policy network architecture of the present invention, with a sensing layer estimating target states and a control layer learning control behavior;

FIG. 2 is a hierarchical control system of the present invention incorporating a policy network and a PID controller;

Detailed Description

For the purposes of promoting an understanding and appreciation of the invention, reference will now be made in detail to the embodiments illustrated in the drawings. Example 1: referring to fig. 1-2, a drone control method for autonomous target tracking, the method comprising the steps of:

s2: the robot and the unmanned aerial vehicle can respectively move through instructions or codes. The laser radar of the robot can be used for realizing the simple obstacle avoidance function of the robot, the movement of the unmanned aerial vehicle and the robot is controlled by performing set command on the robot and the unmanned aerial vehicle, setting linear speeds in the x, y and z directions and angular speeds rotating around the x, y and z axes, and images are acquired by using a camera of a bottom of the unmanned aerial vehicle so as to acquire images of the environment and the robot on the ground;

s3: acquiring images, pre-training a perception layer in a simulation environment, enabling an unmanned aerial vehicle to be at a fixed height higher than the environment, fixing the pixel value of a picture acquired by a bottom camera to be 256 × 256, and under the condition that the unmanned aerial vehicle is fixed, enabling the bottom camera to see the relative size of the environment, namely the area of an area to be fixed, enabling a robot to randomly move in the sight ranges of x and y at the same height when the unmanned aerial vehicle is at the same height, and acquiring 10000 images, wherein each 10000 images are provided with the robot, and performing the training of a CNN (neural network) as shown in the attached figure 1;

S6: quad-rotor aircraft have complex nonlinear aerodynamics that are difficult to learn by model-less RL methods. Obviously, this challenge can be solved by incorporating a conventional PID controller. Figure 2 shows the proposed hierarchical control system. At each time step t, the policy network will generate given the observed imageInto a four-dimensional high-level reference motion u_t。

S8: and transferring to a real environment.

Wherein the step S5: the last convolutional layer is then merged with a spatial softmax layer to convert the feature map for each pixel direction to spatial coordinates in image space. The spatial softmax map layer consists of the spatial softmax function applied to the last convolved feature map and a fixed sparse fully connected layer that computes the expected image position for each feature map. The spatial feature points are then regressed into a three-dimensional vector, i.e., t ═ x_t，y_t，h_t) It represents the 2D position and height of the object on the image plane, where my height is fixed through another fully connected layer. To achieve stable flight, a four-rotor configuration s must be used_q，t＝(z_t，v_t，q_t，w_t) Including altitude, linear velocity, direction and angular velocity, as additional inputs to the neural network. After the perception layer, the target-related state s_o，tWith four rotor states s_q，tMerged together and then the layer that is fully contiguous with the action.

Wherein, in the step S6,

corresponding to a relative angular offset about the yaw axis.

Q^π(s_t,a_t)＝E_π[R_t|s_t,a_t]

L(θ^Q)＝E_π[(Q(s_t,a_t|θ^Q)-y_t)²]

wherein

y_t＝r(s_t,a_t)+γmax_aQ(s_t+1,a|θ^Q')

To this end, the reward function is designed to be eye-orientedA combination of a target reward and an auxiliary quad-rotor reward, where r represents the total reward and r represents the total reward_g(s_g) Representing goal-oriented reward of goals, r_q(s_q) Representing an auxiliary four-rotor reward:

r＝r_g(s_g)+r_q(s_q)

r_g(s_g)＝r_g(x,y)+r_g(h)

corresponding to position reward and scale reward, respectively, s_partRepresents a state s_gWhich corresponds to (x, y) or h, and then the corresponding reward takes the form:

wherein

Δs_part＝||s_part-s^* _part||₂

r_q(s_q)＝r_q(z)+r_q(q₁,q₂,q₃,q₄)；

wherein, tau₁Indicating the same threshold as the above formula, and c represents a penalty weight. Tau is₁＝0.05，τ₂0.2 and 0.5. It is desirable that the attitude of the drone does not change much from the previous attitude. In step S8, the positions and state information of the unmanned aerial vehicle and the robot in the whole map cannot be known in the real scene, so training, updating parameters, determining strategies, and finally applying to the real world are required in the simulation environment, and the method has good generalization capability.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent modifications or substitutions based on the above-mentioned technical solutions are within the scope of the present invention.

Claims

1. An unmanned aerial vehicle control method for autonomous target tracking is characterized in that: the method comprises the following steps:

s3: acquiring images, pre-training a perception layer in a simulation environment, enabling an unmanned aerial vehicle to be at a fixed height higher than the environment, fixing the pixel value of a picture acquired by a bottom camera to be 256 × 256, and under the condition that the unmanned aerial vehicle is fixed, enabling the bottom camera to see the relative size of the environment, namely the area of an area to be fixed, enabling a robot to randomly move in the sight ranges of x and y at the same height when the unmanned aerial vehicle is at the same height, and acquiring 10000 images, wherein each 10000 image is provided with the robot to train a CNN (neural network);

s4: the real-time positions of the robot and the unmanned aerial vehicle in the simulated environment are obtained through topic, subscripter, get position, and the like, the operation is carried out to obtain the relative positions of the robot and the unmanned aerial vehicle, the mode of storing picture names is modified to take the relative positions of the robot and the unmanned aerial vehicle as the names of the images acquired by the unmanned aerial vehicle,

s5: the method comprises the steps of knowing images and labels in a simulation environment, carrying out supervised pre-training, enabling input images in a real environment to predict the relative positions of the known images and the labels, wherein x and y of training are about 6m, loss obtained by 2000 epsilon is 0.03m, and the average distance between the relative positions predicted in a test set and the actual labels is about 0.1 m;

s6: the hierarchical control system generates a four-dimensional high-level reference action u in each time step t given the observed image_t；

S7: building a reward function by a DDPG (distributed data group) -based reinforcement learning method, wherein the reward function simultaneously considers the four-rotor state and the target related state;

s8: and transferring to a real environment.

2. The drone controlling method for autonomous target tracking according to claim 1, characterized in that: the environment in step S1 includes obstacles such as football stadium, football, cleaning cart, garbage bin, ladder rack, colored enclosure, scooter, warning stake, desk, etc.

3. The drone controlling method for autonomous target tracking according to claim 1, characterized in that: the step S5: the last convolution layer is then merged with a spatial softmax layer to convert the feature map for each pixel direction to spatial coordinates in image spaceConsisting of a spatial softmax function applied to the last convolved feature map and a fixed sparse fully connected layer that computes the expected image position for each feature map and then regresses the spatial feature points into a three-dimensional vector, i.e., t ═ x (x)_t，y_t，h_t) Which represents the 2D position and height of the object in the image plane, through another fully connected layer, using a quad-rotor configuration s_q，t＝(z_t，v_t，q_t，w_t) Including altitude, linear velocity, direction and angular velocity, as additional inputs to the neural network, after the perception layer, the target-related state s_o，tWith four rotor states s_q，tMerged together and then the layer that is fully contiguous with the action.

4. The drone controlling method for autonomous target tracking according to claim 1, characterized in that: in the step S6, in the above step,

corresponding to a relative angular offset about the yaw axis.

5. The drone controlling method for autonomous target tracking according to claim 1, characterized in that: in step S7, it is assumed that the environment is sufficiently observed and the state o at time t of the environment is observed_tState s_tThen Q function Q^π(s_t,a_t) Is shown in state s_tExpected benefit after taking action and then following policy π:

Q^π(s_t,a_t)＝E_π[R_t|s_t,a_t]

L(θ^Q)＝E_π[(Q(s_t,a_t|θ^Q)-y_t)²]

wherein

y_t＝r(s_t,a_t)+γmax_aQ(s_t+1,a|θ^Q')

Is the target Q value, θ, estimated by the target Q network^QWeight, θ, representing the critic's network^Q′A weight representing a critic's target network, γ being a discount factor, according to the loss L (θ)^Q) The calculated gradient is used for updating the weight of the critic network;

r＝r_g(s_g)+r_q(s_q)

r_g(s_g)＝r_g(x,y)+r_g(h)

wherein

Δs_part＝||s_part-s^* _part||₂

r_q(s_q)＝r_q(z)+r_q(q₁,q₂,q₃,q₄)

wherein, tau₁Indicating the same threshold as the above equation, c represents a penalty weight, τ₁＝0.05，τ₂0.2 and 0.5. It is desirable that the attitude of the drone does not change much from the previous attitude.

6. The drone controlling method for autonomous target tracking according to claim 1, characterized in that: in step S8, training needs to be performed in a simulation environment, parameters are updated, strategies are determined, and the method is finally applied to the real world, and has a good generalization capability.