CN117406762A

CN117406762A - Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning

Info

Publication number: CN117406762A
Application number: CN202311006154.7A
Authority: CN
Inventors: 廖扬志; 邹莹; 陈祖国; 陈超洋; 卢明; 黄毅; 李沛; 张红强; 陈永伟
Original assignee: Hunan University of Science and Technology
Current assignee: Hunan University of Science and Technology
Priority date: 2023-08-10
Filing date: 2023-08-10
Publication date: 2024-01-16

Abstract

The invention provides an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning, which comprises the following steps: definition of state space and action space, reward setting and loss function, network structure design, network training method, network updating method and simulation environment. The unmanned aerial vehicle remote control algorithm based on the sectional reinforcement learning has the following beneficial effects: the method for training the network by adopting the sectional reinforcement learning inspired by course learning divides a complex network into two sections for training, one is speed control and the other is obstacle avoidance control, the training difficulty of a model can be reduced, the decoupled neural network is beneficial to enhancing the interpretability of the neural network, the two networks can be separately operated when necessary, and the method has stronger adaptability.

Description

Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning

Technical Field

The invention relates to the field of unmanned aerial vehicles, in particular to an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning.

Background

In the prior art, the unmanned aerial vehicle remote control often requires a pilot to have higher skills and experience, and meanwhile, a traditional remote controller is required to be used for operation, so that the unmanned aerial vehicle remote control system is difficult for a common user, in addition, a traditional unmanned aerial vehicle remote control system often requires equipment such as a GPS (global positioning system) to be used for positioning, so that the equipment cost is increased, the unmanned aerial vehicle remote control system cannot be used in environments such as indoor environments where GPS signals cannot be received, in addition, because the unmanned aerial vehicle control involves a plurality of input signals, the traditional control method often requires a plurality of sensors to be used for data acquisition and processing, and the complexity and cost of the system are increased.

In the prior art, a vision-based unmanned aerial vehicle remote control system receives a great deal of attention, the method obtains environmental information around the unmanned aerial vehicle by using equipment such as a camera and uses an image processing algorithm to analyze and process, so that unmanned aerial vehicle control is realized, however, a traditional vision remote control system often needs to use a plurality of cameras to collect and process data, which increases the complexity and cost of the system, and meanwhile, because of the higher complexity of the image processing algorithm, the traditional vision remote control system often needs to use high-performance computing equipment to process, which also increases the cost and energy consumption of the system.

Therefore, it is necessary to provide an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning to solve the above technical problems.

Disclosure of Invention

The invention provides a remote control algorithm of an unmanned aerial vehicle based on sectional reinforcement learning, which solves the problems that a plurality of input signals are involved in the control of the unmanned aerial vehicle, so that a plurality of sensors are often required to be used for data acquisition and processing in a traditional control method, the complexity and the cost of a system are increased, and a high-performance computing device is often required to be used for processing in a traditional visual remote control system due to the higher complexity of an image processing algorithm, and the cost and the energy consumption of the system are increased.

In order to solve the technical problems, the unmanned aerial vehicle remote control algorithm based on the sectional reinforcement learning provided by the invention comprises the following steps:

s1, defining a state space and an action space: classical RC remote control signals of state space and action space;

s2, rewarding setting and loss function: to train the neural network, a reward setting and a loss function are required to be given, wherein the reward is that the negative number of root mean square values of a speed error and a reference error is added with a collision penalty, the collision penalty is related to the speed when the obstacle is hit, and when the obstacle is not hit, the collision penalty is 0;

s3, designing a network structure: the basic network structure consists of a convolution layer and a full connection layer. A part of the network is a speed monitoring network and is trained in a safe environment in advance;

s4, a network training method: the method is characterized in that a method which is inspired by course learning and is classified as reinforcement learning is adopted to train the network efficiently and quickly;

s5, a network updating method comprises the following steps: the network update adopts the updating mode of the PPO, and the network update of the PPO algorithm comprises two parts: a policy network and a value network, wherein the policy network is used for generating an action sequence, and the value network is used for evaluating the value of the current state;

s6, selecting a simulation environment.

Preferably, the state space in S1 includes: the speed instruction reference value, filtered IMU data of three time steps in the past, 640 multiplied by 480 multiplied by 2 binocular images of the three time steps in the past are selected to be taken as a state space together, so that the prediction capability can be brought to the neural network to cope with the high delay condition of transmission in flight, and meanwhile, the great training difficulty is not brought to the neural network.

Preferably, the motion space classical RC remote control signal in S1 includes: throttle, pitching, rolling and yawing, the action space is enough for the unmanned aerial vehicle to perform high-difficulty movement in the three-dimensional space, and remote control of the unmanned aerial vehicle is realized.

Preferably, the mathematical expression of the reward in S2 is:

$R＝-\sqrt{\frac{1}{N}\sum_{i＝1}^{N}(v_i-v_{ref,i})^2}-Collision*(\sum_{i＝1}^{N}v_i^2)$；

wherein:

the reward value, the data set size, the speed error of the ith sample, the speed reference value of the ith sample and the collision penalty are 1 when the unmanned aerial vehicle collides with an obstacle, otherwise, the reward value is 0;

in the selection of the loss function, the present patent uses a mean square error loss function. The loss function can measure the error between the predicted value and the true value of the unmanned aerial vehicle, so that the unmanned aerial vehicle can continuously optimize the prediction capability of the unmanned aerial vehicle, the control effect is improved, and the unmanned aerial vehicle can realize autonomous control and stable flight through continuous learning and optimization, thereby improving the application value and the application range of the unmanned aerial vehicle;

the mathematical expression of the loss function is:

$L＝\frac{1}{2N}\sum_{i＝1}^{N}(Q(s_i,a_i)-y_i)^2$；

wherein:

$ N $: training set size;

$ s_i $: the status of the ith sample;

$ a_i $: an action of the ith sample;

q (s, a) $: an output of the neural network;

$ y_i $: target value of the i-th sample;

when the unmanned aerial vehicle can move according to the expected track, positive rewards can be obtained, otherwise negative rewards can be obtained, and the rewards can enable the unmanned aerial vehicle to continuously optimize own control strategies in the learning process, so that a stable control effect is finally achieved.

Preferably, the structure in S4 is formed by changing the structure of the PPO' S actor network and the structure of the critical network from the basic network structure, and they share a part of the feature extraction network.

Preferably, the training method in S4 includes the following steps:

the network training is divided into two phases:

s41, a first stage: training a portion of the model in a safe barrier-free, floor-free environment;

s42, a second stage: training in an environment with obstacles, simultaneously monitoring model parameters of a network at a fixed speed, and updating parameters of the rest networks;

the segmented learning method divides a complex task into two sections, one is speed control and the other is obstacle avoidance control, so that the training difficulty of a model can be reduced, the decoupled neural network is beneficial to enhancing the interpretability of the neural network, and the two networks can be operated separately when necessary, so that the segmented learning method has relatively strong adaptability.

Preferably, the policy network in S5 includes the following steps:

s51, updating a strategy network:

first, the dominance function $A_t$ is calculated:

$A_t＝Q_t-V(s_t)$

where Q_t is the expected return from state $s_t after the action $a_t is performed at time $t, $V (s_t) is the expected return from state $s_t after any action is performed in state $s_t, $V (s_t) is the expected return from state $s_t, $A_t, $V (s_t) is the average level of the performance of the action, $a_t, $A_t, $A/t, $V (s_t) is the level of the performance of the action;

then, the log likelihood ratio of the new strategy $\pi_ \theta$and the old strategy $\pi_ { \theta_ { old } $ is calculated:

r_t (\theta) = \frac { \pi\theta (a_t|s_t) } { \pi_ { \theta_ { old } (a_t|s_t) } $then, an updated policy parameter $\theta +_ is calculated {' } $:

$\theta^{'}＝argmax_{\theta}L_{CLIP}(\theta)$；

where $l_ { CLIP $ is a constraint objective function employed in the PPO algorithm to constrain the difference between the new policy and the old policy, specifically, $l_ { CLIP $ is defined as:

$L_{CLIP}(\theta)＝E_{t}\left[\min\left(r_t(\theta)A_t,clip(r_t(\theta),1-\epsilon,1+\epsilon)A_t\right)\right]$；

where $\epsilon $ is a hyper-parameter that is used to control the stride size at policy update.

Preferably, the value network in S5 includes the following steps:

s52, updating the value network:

the updating of the value network is mainly achieved by minimizing the square error, in particular defining the target value $y_t$ as the sum of the return $r_t$ of the current state and the estimated value $v_ { t+1} $ of the next state:

$y_t＝r_t+\gamma v_{t+1}$；

where $\gamma $ is an decay factor that controls the importance of future rewards, then we define the loss function $ l_ { VF $ as the square error between the predicted value and the target value:

$L_{VF}＝(y_t-V(s_t))^2$；

then, we use a random gradient descent method to minimize the loss function $L_ { VF } $, update the parameters of the value network;

thus, the network update of the PPO algorithm can be summarized as the following steps:

calculating an advantage function $A_t$;

calculating a log-likelihood ratio $r_t (\theta) $;

calculating a limiting objective function $L_ { CLIP } (\theta) $, and updating policy network parameters $\theta {' } $;

calculating a target value $y_t$;

the loss function $L_ { VF } $ is calculated and the value network parameters are updated.

Preferably, the selecting of the simulation environment in S6 includes the following steps:

the training simulator used filghtmaster, developed by the university of zurich research team, based on gazebo physical simulation and units rendering, provides good unmanned dynamics simulation and binocular images close to real world, while its running speed is also sufficient to support the need to reinforcement learn huge sampling volumes;

the acquisition of state space data in the environment and unmanned aerial vehicle operation are based on the gym interface, the realization of a neural network, training and gradient update, and the system is mature and easy to realize.

Compared with the related art, the unmanned aerial vehicle remote control algorithm based on the sectional reinforcement learning has the following beneficial effects:

training difficulty reduction and training efficiency improvement caused by sectional training

The method for training the network by adopting the sectional reinforcement learning inspired by course learning divides a complex network into two sections for training, one is speed control and the other is obstacle avoidance control, the training difficulty of a model can be reduced, the decoupled neural network is beneficial to enhancing the interpretability of the neural network, the two networks can be separately operated when necessary, and the method has stronger adaptability.

The design of the multi-mode fusion network improves the perception of the nerve unmanned aerial vehicle and can also finish the target under high delay.

The neural network provided by the invention has a multi-mode fusion design, which comprises a speed instruction reference value, a binocular image, IMU data and other data inputs, and the multi-mode fusion design can help the unmanned aerial vehicle to estimate the state value more accurately, so that the control effect is improved.

In addition, the invention also adopts a basic network structure formed by a convolution layer and a full connection layer, and designs a plurality of network structures such as a speed monitoring network, an actor network, a critic network and the like on the basis of the basic network structure, thereby further improving the expression capacity and the control effect of the network.

Based on binocular vision and imu data, the data is directly input into a neural network for remote control, and the control implementation cost is low.

Based on binocular vision and imu data, expensive other sensors are not needed, gps is not relied on, and obstacle avoidance can be achieved indoors.

Drawings

Fig. 1 is a schematic structural diagram of a preferred embodiment of an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning according to the present invention;

fig. 2 is a schematic diagram of an Actor network.

Detailed Description

The invention will be further described with reference to the drawings and embodiments.

Referring to fig. 1 and fig. 2 in combination, fig. 1 is a schematic structural diagram of a preferred embodiment of an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning according to the present invention; fig. 2 is a schematic diagram of an Actor network. An unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning comprises the following steps:

s6, selecting a simulation environment.

The state space in S1 includes: the speed instruction reference value, filtered IMU data of three time steps in the past, 640 multiplied by 480 multiplied by 2 binocular images of the three time steps in the past are selected to be taken as a state space together, so that the prediction capability can be brought to the neural network to cope with the high delay condition of transmission in flight, and meanwhile, the great training difficulty is not brought to the neural network.

The motion space classical RC remote control signal in S1 comprises: throttle, pitching, rolling and yawing, the action space is enough for the unmanned aerial vehicle to perform high-difficulty movement in the three-dimensional space, and remote control of the unmanned aerial vehicle is realized.

The mathematical expression of the reward in S2 is:

wherein:

the mathematical expression of the loss function is:

$L＝\frac{1}{2N}\sum_{i＝1}^{N}(Q(s_i,a_i)-y_i)^2$；

wherein:

$ N $: training set size;

$ s_i $: the status of the ith sample;

$ a_i $: an action of the ith sample;

q (s, a) $: an output of the neural network;

$ y_i $: target value of the i-th sample;

The structure in S4 is formed by the change of the structure of the actor network and the critical network of the PPO by the basic network structure, and the actor network and the critical network share a part of the feature extraction network.

The training method in S4 includes the following steps:

the network training is divided into two phases:

The policy network in S5 includes the following steps:

s51, updating a strategy network:

first, the dominance function $A_t$ is calculated:

$A_t＝Q_t-V(s_t)$

$\theta^{'}＝argmax_{\theta}L_{CLIP}(\theta)$；

The value network in S5 includes the following steps:

s52, updating the value network:

$y_t＝r_t+\gamma v_{t+1}$；

$L_{VF}＝(y_t-V(s_t))^2$；

calculating an advantage function $A_t$;

calculating a log-likelihood ratio $r_t (\theta) $;

calculating a target value $y_t$;

It is noted that in calculating the target value $y_t$, since the state space of the drone includes the speed command reference value, the binocular image, and the IMU data, we need to use a neural network to estimate the value $v_ { t+1} $. Specifically, we take the current state as input to the neural network, take the predicted value $v_ { t+1} $ as the target value, and then use a random gradient descent method to minimize the square error. Therefore, the state value of the unmanned aerial vehicle can be estimated through the neural network, and the autonomous control and stable flight of the unmanned aerial vehicle are better realized.

The selection of the simulation environment in S6 includes the following steps:

The training network is trained by adopting a sectional reinforcement learning method inspired by course learning, a complex task is divided into two sections, one is speed control and the other is obstacle avoidance control, the training difficulty of a model can be reduced, the training efficiency is improved, in addition, the design of the multi-mode fusion network is adopted, the improvement of the perception of the nerve unmanned aerial vehicle is facilitated, and the goal can be completed under high delay.

The parts, steps or components directly influencing are the sectional learning method in the fourth part and the design of the multi-mode fusion network, and the innovative designs of the two aspects enable the neural network to complete tasks more efficiently, have stronger adaptability and better interpretability.

Under the condition of realizing the same technical purpose, other alternatives may exist in the technical scheme of the invention, for example, other reinforcement learning algorithms such as a DDPG algorithm, a TD3 algorithm and the like can be used for training a neural network, and in addition, other unmanned aerial vehicle dynamics simulators and robot operating systems can be used for realizing remote control of the unmanned aerial vehicle, however, the sectional reinforcement learning method and the design of the multi-mode fusion network adopted by the invention can improve the perception and the control effect of the neural unmanned aerial vehicle, and meanwhile, the neural unmanned aerial vehicle has better interpretability and adaptability, which are the advantages which cannot be replaced by other alternatives, so that the technical scheme of the invention has higher superiority and feasibility under the condition of realizing the same technical purpose.

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims

1. The unmanned aerial vehicle remote control algorithm based on the sectional reinforcement learning is characterized by comprising the following steps of:

s6, selecting a simulation environment.

2. The unmanned aerial vehicle remote control algorithm based on segmented reinforcement learning of claim 1, wherein the state space in S1 comprises: the speed instruction reference value, filtered IMU data of three time steps in the past, 640 multiplied by 480 multiplied by 2 binocular images of the three time steps in the past are selected to be taken as a state space together, so that the prediction capability can be brought to the neural network to cope with the high delay condition of transmission in flight, and meanwhile, the great training difficulty is not brought to the neural network.

3. The unmanned aerial vehicle remote control algorithm based on segmented reinforcement learning of claim 1, wherein the action space classical RC remote control signal in S1 comprises: throttle, pitching, rolling and yawing, the action space is enough for the unmanned aerial vehicle to perform high-difficulty movement in the three-dimensional space, and remote control of the unmanned aerial vehicle is realized.

4. The unmanned aerial vehicle remote control algorithm based on segmented reinforcement learning of claim 1, wherein the mathematical expression of the reward in S2 is:

wherein:

the mathematical expression of the loss function is:

$L＝\frac{1}{2N}\sum_{i＝1}^{N}(Q(s_i,a_i)-y_i)^2$；

wherein:

$ N $: training set size;

$ s_i $: the status of the ith sample;

$ a_i $: an action of the ith sample;

q (s, a) $: an output of the neural network;

$ y_i $: target value of the i-th sample;

5. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning of claim 1, wherein the structure in S4 is formed by the change of the basic network structure by both the actor network and the critical network structure of PPO, and they share a part of the feature extraction network.

6. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning according to claim 1, wherein the training method in S4 comprises the steps of:

the network training is divided into two phases:

7. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning of claim 1, wherein the policy network in S5 comprises the steps of:

s51, updating a strategy network:

first, the dominance function $A_t$ is calculated:

$A_t＝Q_t-V(s_t)$

$\theta^{'}＝argmax_{\theta}L_{CLIP}(\theta)$；

8. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning of claim 7, wherein the value network in S5 comprises the steps of:

s52, updating the value network:

$y_t＝r_t+\gamma v_{t+1}$；

$L_{VF}＝(y_t-V(s_t))^2$；

calculating an advantage function $A_t$;

calculating a log-likelihood ratio $r_t (\theta) $;

calculating a target value $y_t$;

9. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning according to claim 1, wherein the selection of the simulation environment in S6 comprises the steps of: