CN117406762A - Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning - Google Patents

Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning Download PDF

Info

Publication number
CN117406762A
CN117406762A CN202311006154.7A CN202311006154A CN117406762A CN 117406762 A CN117406762 A CN 117406762A CN 202311006154 A CN202311006154 A CN 202311006154A CN 117406762 A CN117406762 A CN 117406762A
Authority
CN
China
Prior art keywords
network
aerial vehicle
unmanned aerial
value
theta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311006154.7A
Other languages
Chinese (zh)
Inventor
廖扬志
邹莹
陈祖国
陈超洋
卢明
黄毅
李沛
张红强
陈永伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Science and Technology
Original Assignee
Hunan University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Science and Technology filed Critical Hunan University of Science and Technology
Priority to CN202311006154.7A priority Critical patent/CN117406762A/en
Publication of CN117406762A publication Critical patent/CN117406762A/en
Pending legal-status Critical Current

Links

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B64AIRCRAFT; AVIATION; COSMONAUTICS
    • B64UUNMANNED AERIAL VEHICLES [UAV]; EQUIPMENT THEREFOR
    • B64U10/00Type of UAV
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B64AIRCRAFT; AVIATION; COSMONAUTICS
    • B64UUNMANNED AERIAL VEHICLES [UAV]; EQUIPMENT THEREFOR
    • B64U2201/00UAVs characterised by their flight controls
    • B64U2201/20Remote controls

Landscapes

  • Engineering & Computer Science (AREA)
  • Mechanical Engineering (AREA)
  • Remote Sensing (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention provides an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning, which comprises the following steps: definition of state space and action space, reward setting and loss function, network structure design, network training method, network updating method and simulation environment. The unmanned aerial vehicle remote control algorithm based on the sectional reinforcement learning has the following beneficial effects: the method for training the network by adopting the sectional reinforcement learning inspired by course learning divides a complex network into two sections for training, one is speed control and the other is obstacle avoidance control, the training difficulty of a model can be reduced, the decoupled neural network is beneficial to enhancing the interpretability of the neural network, the two networks can be separately operated when necessary, and the method has stronger adaptability.

Description

Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning
Technical Field
The invention relates to the field of unmanned aerial vehicles, in particular to an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning.
Background
In the prior art, the unmanned aerial vehicle remote control often requires a pilot to have higher skills and experience, and meanwhile, a traditional remote controller is required to be used for operation, so that the unmanned aerial vehicle remote control system is difficult for a common user, in addition, a traditional unmanned aerial vehicle remote control system often requires equipment such as a GPS (global positioning system) to be used for positioning, so that the equipment cost is increased, the unmanned aerial vehicle remote control system cannot be used in environments such as indoor environments where GPS signals cannot be received, in addition, because the unmanned aerial vehicle control involves a plurality of input signals, the traditional control method often requires a plurality of sensors to be used for data acquisition and processing, and the complexity and cost of the system are increased.
In the prior art, a vision-based unmanned aerial vehicle remote control system receives a great deal of attention, the method obtains environmental information around the unmanned aerial vehicle by using equipment such as a camera and uses an image processing algorithm to analyze and process, so that unmanned aerial vehicle control is realized, however, a traditional vision remote control system often needs to use a plurality of cameras to collect and process data, which increases the complexity and cost of the system, and meanwhile, because of the higher complexity of the image processing algorithm, the traditional vision remote control system often needs to use high-performance computing equipment to process, which also increases the cost and energy consumption of the system.
Therefore, it is necessary to provide an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning to solve the above technical problems.
Disclosure of Invention
The invention provides a remote control algorithm of an unmanned aerial vehicle based on sectional reinforcement learning, which solves the problems that a plurality of input signals are involved in the control of the unmanned aerial vehicle, so that a plurality of sensors are often required to be used for data acquisition and processing in a traditional control method, the complexity and the cost of a system are increased, and a high-performance computing device is often required to be used for processing in a traditional visual remote control system due to the higher complexity of an image processing algorithm, and the cost and the energy consumption of the system are increased.
In order to solve the technical problems, the unmanned aerial vehicle remote control algorithm based on the sectional reinforcement learning provided by the invention comprises the following steps:
s1, defining a state space and an action space: classical RC remote control signals of state space and action space;
s2, rewarding setting and loss function: to train the neural network, a reward setting and a loss function are required to be given, wherein the reward is that the negative number of root mean square values of a speed error and a reference error is added with a collision penalty, the collision penalty is related to the speed when the obstacle is hit, and when the obstacle is not hit, the collision penalty is 0;
s3, designing a network structure: the basic network structure consists of a convolution layer and a full connection layer. A part of the network is a speed monitoring network and is trained in a safe environment in advance;
s4, a network training method: the method is characterized in that a method which is inspired by course learning and is classified as reinforcement learning is adopted to train the network efficiently and quickly;
s5, a network updating method comprises the following steps: the network update adopts the updating mode of the PPO, and the network update of the PPO algorithm comprises two parts: a policy network and a value network, wherein the policy network is used for generating an action sequence, and the value network is used for evaluating the value of the current state;
s6, selecting a simulation environment.
Preferably, the state space in S1 includes: the speed instruction reference value, filtered IMU data of three time steps in the past, 640 multiplied by 480 multiplied by 2 binocular images of the three time steps in the past are selected to be taken as a state space together, so that the prediction capability can be brought to the neural network to cope with the high delay condition of transmission in flight, and meanwhile, the great training difficulty is not brought to the neural network.
Preferably, the motion space classical RC remote control signal in S1 includes: throttle, pitching, rolling and yawing, the action space is enough for the unmanned aerial vehicle to perform high-difficulty movement in the three-dimensional space, and remote control of the unmanned aerial vehicle is realized.
Preferably, the mathematical expression of the reward in S2 is:
$R=-\sqrt{\frac{1}{N}\sum_{i=1}^{N}(v_i-v_{ref,i})^2}-Collision*(\sum_{i=1}^{N}v_i^2)$;
wherein:
the reward value, the data set size, the speed error of the ith sample, the speed reference value of the ith sample and the collision penalty are 1 when the unmanned aerial vehicle collides with an obstacle, otherwise, the reward value is 0;
in the selection of the loss function, the present patent uses a mean square error loss function. The loss function can measure the error between the predicted value and the true value of the unmanned aerial vehicle, so that the unmanned aerial vehicle can continuously optimize the prediction capability of the unmanned aerial vehicle, the control effect is improved, and the unmanned aerial vehicle can realize autonomous control and stable flight through continuous learning and optimization, thereby improving the application value and the application range of the unmanned aerial vehicle;
the mathematical expression of the loss function is:
$L=\frac{1}{2N}\sum_{i=1}^{N}(Q(s_i,a_i)-y_i)^2$;
wherein:
$ N $: training set size;
$ s_i $: the status of the ith sample;
$ a_i $: an action of the ith sample;
q (s, a) $: an output of the neural network;
$ y_i $: target value of the i-th sample;
when the unmanned aerial vehicle can move according to the expected track, positive rewards can be obtained, otherwise negative rewards can be obtained, and the rewards can enable the unmanned aerial vehicle to continuously optimize own control strategies in the learning process, so that a stable control effect is finally achieved.
Preferably, the structure in S4 is formed by changing the structure of the PPO' S actor network and the structure of the critical network from the basic network structure, and they share a part of the feature extraction network.
Preferably, the training method in S4 includes the following steps:
the network training is divided into two phases:
s41, a first stage: training a portion of the model in a safe barrier-free, floor-free environment;
s42, a second stage: training in an environment with obstacles, simultaneously monitoring model parameters of a network at a fixed speed, and updating parameters of the rest networks;
the segmented learning method divides a complex task into two sections, one is speed control and the other is obstacle avoidance control, so that the training difficulty of a model can be reduced, the decoupled neural network is beneficial to enhancing the interpretability of the neural network, and the two networks can be operated separately when necessary, so that the segmented learning method has relatively strong adaptability.
Preferably, the policy network in S5 includes the following steps:
s51, updating a strategy network:
first, the dominance function $A_t$ is calculated:
$A_t=Q_t-V(s_t)$
where Q_t is the expected return from state $s_t after the action $a_t is performed at time $t, $V (s_t) is the expected return from state $s_t after any action is performed in state $s_t, $V (s_t) is the expected return from state $s_t, $A_t, $V (s_t) is the average level of the performance of the action, $a_t, $A_t, $A/t, $V (s_t) is the level of the performance of the action;
then, the log likelihood ratio of the new strategy $\pi_ \theta$and the old strategy $\pi_ { \theta_ { old } $ is calculated:
r_t (\theta) = \frac { \pi\theta (a_t|s_t) } { \pi_ { \theta_ { old } (a_t|s_t) } $then, an updated policy parameter $\theta +_ is calculated {' } $:
$\theta^{'}=argmax_{\theta}L_{CLIP}(\theta)$;
where $l_ { CLIP $ is a constraint objective function employed in the PPO algorithm to constrain the difference between the new policy and the old policy, specifically, $l_ { CLIP $ is defined as:
$L_{CLIP}(\theta)=E_{t}\left[\min\left(r_t(\theta)A_t,clip(r_t(\theta),1-\epsilon,1+\epsilon)A_t\right)\right]$;
where $\epsilon $ is a hyper-parameter that is used to control the stride size at policy update.
Preferably, the value network in S5 includes the following steps:
s52, updating the value network:
the updating of the value network is mainly achieved by minimizing the square error, in particular defining the target value $y_t$ as the sum of the return $r_t$ of the current state and the estimated value $v_ { t+1} $ of the next state:
$y_t=r_t+\gamma v_{t+1}$;
where $\gamma $ is an decay factor that controls the importance of future rewards, then we define the loss function $ l_ { VF $ as the square error between the predicted value and the target value:
$L_{VF}=(y_t-V(s_t))^2$;
then, we use a random gradient descent method to minimize the loss function $L_ { VF } $, update the parameters of the value network;
thus, the network update of the PPO algorithm can be summarized as the following steps:
calculating an advantage function $A_t$;
calculating a log-likelihood ratio $r_t (\theta) $;
calculating a limiting objective function $L_ { CLIP } (\theta) $, and updating policy network parameters $\theta {' } $;
calculating a target value $y_t$;
the loss function $L_ { VF } $ is calculated and the value network parameters are updated.
Preferably, the selecting of the simulation environment in S6 includes the following steps:
the training simulator used filghtmaster, developed by the university of zurich research team, based on gazebo physical simulation and units rendering, provides good unmanned dynamics simulation and binocular images close to real world, while its running speed is also sufficient to support the need to reinforcement learn huge sampling volumes;
the acquisition of state space data in the environment and unmanned aerial vehicle operation are based on the gym interface, the realization of a neural network, training and gradient update, and the system is mature and easy to realize.
Compared with the related art, the unmanned aerial vehicle remote control algorithm based on the sectional reinforcement learning has the following beneficial effects:
training difficulty reduction and training efficiency improvement caused by sectional training
The method for training the network by adopting the sectional reinforcement learning inspired by course learning divides a complex network into two sections for training, one is speed control and the other is obstacle avoidance control, the training difficulty of a model can be reduced, the decoupled neural network is beneficial to enhancing the interpretability of the neural network, the two networks can be separately operated when necessary, and the method has stronger adaptability.
The design of the multi-mode fusion network improves the perception of the nerve unmanned aerial vehicle and can also finish the target under high delay.
The neural network provided by the invention has a multi-mode fusion design, which comprises a speed instruction reference value, a binocular image, IMU data and other data inputs, and the multi-mode fusion design can help the unmanned aerial vehicle to estimate the state value more accurately, so that the control effect is improved.
In addition, the invention also adopts a basic network structure formed by a convolution layer and a full connection layer, and designs a plurality of network structures such as a speed monitoring network, an actor network, a critic network and the like on the basis of the basic network structure, thereby further improving the expression capacity and the control effect of the network.
Based on binocular vision and imu data, the data is directly input into a neural network for remote control, and the control implementation cost is low.
Based on binocular vision and imu data, expensive other sensors are not needed, gps is not relied on, and obstacle avoidance can be achieved indoors.
Drawings
Fig. 1 is a schematic structural diagram of a preferred embodiment of an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning according to the present invention;
fig. 2 is a schematic diagram of an Actor network.
Detailed Description
The invention will be further described with reference to the drawings and embodiments.
Referring to fig. 1 and fig. 2 in combination, fig. 1 is a schematic structural diagram of a preferred embodiment of an unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning according to the present invention; fig. 2 is a schematic diagram of an Actor network. An unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning comprises the following steps:
s1, defining a state space and an action space: classical RC remote control signals of state space and action space;
s2, rewarding setting and loss function: to train the neural network, a reward setting and a loss function are required to be given, wherein the reward is that the negative number of root mean square values of a speed error and a reference error is added with a collision penalty, the collision penalty is related to the speed when the obstacle is hit, and when the obstacle is not hit, the collision penalty is 0;
s3, designing a network structure: the basic network structure consists of a convolution layer and a full connection layer. A part of the network is a speed monitoring network and is trained in a safe environment in advance;
s4, a network training method: the method is characterized in that a method which is inspired by course learning and is classified as reinforcement learning is adopted to train the network efficiently and quickly;
s5, a network updating method comprises the following steps: the network update adopts the updating mode of the PPO, and the network update of the PPO algorithm comprises two parts: a policy network and a value network, wherein the policy network is used for generating an action sequence, and the value network is used for evaluating the value of the current state;
s6, selecting a simulation environment.
The state space in S1 includes: the speed instruction reference value, filtered IMU data of three time steps in the past, 640 multiplied by 480 multiplied by 2 binocular images of the three time steps in the past are selected to be taken as a state space together, so that the prediction capability can be brought to the neural network to cope with the high delay condition of transmission in flight, and meanwhile, the great training difficulty is not brought to the neural network.
The motion space classical RC remote control signal in S1 comprises: throttle, pitching, rolling and yawing, the action space is enough for the unmanned aerial vehicle to perform high-difficulty movement in the three-dimensional space, and remote control of the unmanned aerial vehicle is realized.
The mathematical expression of the reward in S2 is:
$R=-\sqrt{\frac{1}{N}\sum_{i=1}^{N}(v_i-v_{ref,i})^2}-Collision*(\sum_{i=1}^{N}v_i^2)$;
wherein:
the reward value, the data set size, the speed error of the ith sample, the speed reference value of the ith sample and the collision penalty are 1 when the unmanned aerial vehicle collides with an obstacle, otherwise, the reward value is 0;
in the selection of the loss function, the present patent uses a mean square error loss function. The loss function can measure the error between the predicted value and the true value of the unmanned aerial vehicle, so that the unmanned aerial vehicle can continuously optimize the prediction capability of the unmanned aerial vehicle, the control effect is improved, and the unmanned aerial vehicle can realize autonomous control and stable flight through continuous learning and optimization, thereby improving the application value and the application range of the unmanned aerial vehicle;
the mathematical expression of the loss function is:
$L=\frac{1}{2N}\sum_{i=1}^{N}(Q(s_i,a_i)-y_i)^2$;
wherein:
$ N $: training set size;
$ s_i $: the status of the ith sample;
$ a_i $: an action of the ith sample;
q (s, a) $: an output of the neural network;
$ y_i $: target value of the i-th sample;
when the unmanned aerial vehicle can move according to the expected track, positive rewards can be obtained, otherwise negative rewards can be obtained, and the rewards can enable the unmanned aerial vehicle to continuously optimize own control strategies in the learning process, so that a stable control effect is finally achieved.
The structure in S4 is formed by the change of the structure of the actor network and the critical network of the PPO by the basic network structure, and the actor network and the critical network share a part of the feature extraction network.
The training method in S4 includes the following steps:
the network training is divided into two phases:
s41, a first stage: training a portion of the model in a safe barrier-free, floor-free environment;
s42, a second stage: training in an environment with obstacles, simultaneously monitoring model parameters of a network at a fixed speed, and updating parameters of the rest networks;
the segmented learning method divides a complex task into two sections, one is speed control and the other is obstacle avoidance control, so that the training difficulty of a model can be reduced, the decoupled neural network is beneficial to enhancing the interpretability of the neural network, and the two networks can be operated separately when necessary, so that the segmented learning method has relatively strong adaptability.
The policy network in S5 includes the following steps:
s51, updating a strategy network:
first, the dominance function $A_t$ is calculated:
$A_t=Q_t-V(s_t)$
where Q_t is the expected return from state $s_t after the action $a_t is performed at time $t, $V (s_t) is the expected return from state $s_t after any action is performed in state $s_t, $V (s_t) is the expected return from state $s_t, $A_t, $V (s_t) is the average level of the performance of the action, $a_t, $A_t, $A/t, $V (s_t) is the level of the performance of the action;
then, the log likelihood ratio of the new strategy $\pi_ \theta$and the old strategy $\pi_ { \theta_ { old } $ is calculated:
r_t (\theta) = \frac { \pi\theta (a_t|s_t) } { \pi_ { \theta_ { old } (a_t|s_t) } $then, an updated policy parameter $\theta +_ is calculated {' } $:
$\theta^{'}=argmax_{\theta}L_{CLIP}(\theta)$;
where $l_ { CLIP $ is a constraint objective function employed in the PPO algorithm to constrain the difference between the new policy and the old policy, specifically, $l_ { CLIP $ is defined as:
$L_{CLIP}(\theta)=E_{t}\left[\min\left(r_t(\theta)A_t,clip(r_t(\theta),1-\epsilon,1+\epsilon)A_t\right)\right]$;
where $\epsilon $ is a hyper-parameter that is used to control the stride size at policy update.
The value network in S5 includes the following steps:
s52, updating the value network:
the updating of the value network is mainly achieved by minimizing the square error, in particular defining the target value $y_t$ as the sum of the return $r_t$ of the current state and the estimated value $v_ { t+1} $ of the next state:
$y_t=r_t+\gamma v_{t+1}$;
where $\gamma $ is an decay factor that controls the importance of future rewards, then we define the loss function $ l_ { VF $ as the square error between the predicted value and the target value:
$L_{VF}=(y_t-V(s_t))^2$;
then, we use a random gradient descent method to minimize the loss function $L_ { VF } $, update the parameters of the value network;
thus, the network update of the PPO algorithm can be summarized as the following steps:
calculating an advantage function $A_t$;
calculating a log-likelihood ratio $r_t (\theta) $;
calculating a limiting objective function $L_ { CLIP } (\theta) $, and updating policy network parameters $\theta {' } $;
calculating a target value $y_t$;
the loss function $L_ { VF } $ is calculated and the value network parameters are updated.
It is noted that in calculating the target value $y_t$, since the state space of the drone includes the speed command reference value, the binocular image, and the IMU data, we need to use a neural network to estimate the value $v_ { t+1} $. Specifically, we take the current state as input to the neural network, take the predicted value $v_ { t+1} $ as the target value, and then use a random gradient descent method to minimize the square error. Therefore, the state value of the unmanned aerial vehicle can be estimated through the neural network, and the autonomous control and stable flight of the unmanned aerial vehicle are better realized.
The selection of the simulation environment in S6 includes the following steps:
the training simulator used filghtmaster, developed by the university of zurich research team, based on gazebo physical simulation and units rendering, provides good unmanned dynamics simulation and binocular images close to real world, while its running speed is also sufficient to support the need to reinforcement learn huge sampling volumes;
the acquisition of state space data in the environment and unmanned aerial vehicle operation are based on the gym interface, the realization of a neural network, training and gradient update, and the system is mature and easy to realize.
The training network is trained by adopting a sectional reinforcement learning method inspired by course learning, a complex task is divided into two sections, one is speed control and the other is obstacle avoidance control, the training difficulty of a model can be reduced, the training efficiency is improved, in addition, the design of the multi-mode fusion network is adopted, the improvement of the perception of the nerve unmanned aerial vehicle is facilitated, and the goal can be completed under high delay.
The parts, steps or components directly influencing are the sectional learning method in the fourth part and the design of the multi-mode fusion network, and the innovative designs of the two aspects enable the neural network to complete tasks more efficiently, have stronger adaptability and better interpretability.
Under the condition of realizing the same technical purpose, other alternatives may exist in the technical scheme of the invention, for example, other reinforcement learning algorithms such as a DDPG algorithm, a TD3 algorithm and the like can be used for training a neural network, and in addition, other unmanned aerial vehicle dynamics simulators and robot operating systems can be used for realizing remote control of the unmanned aerial vehicle, however, the sectional reinforcement learning method and the design of the multi-mode fusion network adopted by the invention can improve the perception and the control effect of the neural unmanned aerial vehicle, and meanwhile, the neural unmanned aerial vehicle has better interpretability and adaptability, which are the advantages which cannot be replaced by other alternatives, so that the technical scheme of the invention has higher superiority and feasibility under the condition of realizing the same technical purpose.
Compared with the related art, the unmanned aerial vehicle remote control algorithm based on the sectional reinforcement learning has the following beneficial effects:
training difficulty reduction and training efficiency improvement caused by sectional training
The method for training the network by adopting the sectional reinforcement learning inspired by course learning divides a complex network into two sections for training, one is speed control and the other is obstacle avoidance control, the training difficulty of a model can be reduced, the decoupled neural network is beneficial to enhancing the interpretability of the neural network, the two networks can be separately operated when necessary, and the method has stronger adaptability.
The design of the multi-mode fusion network improves the perception of the nerve unmanned aerial vehicle and can also finish the target under high delay.
The neural network provided by the invention has a multi-mode fusion design, which comprises a speed instruction reference value, a binocular image, IMU data and other data inputs, and the multi-mode fusion design can help the unmanned aerial vehicle to estimate the state value more accurately, so that the control effect is improved.
In addition, the invention also adopts a basic network structure formed by a convolution layer and a full connection layer, and designs a plurality of network structures such as a speed monitoring network, an actor network, a critic network and the like on the basis of the basic network structure, thereby further improving the expression capacity and the control effect of the network.
Based on binocular vision and imu data, the data is directly input into a neural network for remote control, and the control implementation cost is low.
Based on binocular vision and imu data, expensive other sensors are not needed, gps is not relied on, and obstacle avoidance can be achieved indoors.
The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims (9)

1. The unmanned aerial vehicle remote control algorithm based on the sectional reinforcement learning is characterized by comprising the following steps of:
s1, defining a state space and an action space: classical RC remote control signals of state space and action space;
s2, rewarding setting and loss function: to train the neural network, a reward setting and a loss function are required to be given, wherein the reward is that the negative number of root mean square values of a speed error and a reference error is added with a collision penalty, the collision penalty is related to the speed when the obstacle is hit, and when the obstacle is not hit, the collision penalty is 0;
s3, designing a network structure: the basic network structure consists of a convolution layer and a full connection layer. A part of the network is a speed monitoring network and is trained in a safe environment in advance;
s4, a network training method: the method is characterized in that a method which is inspired by course learning and is classified as reinforcement learning is adopted to train the network efficiently and quickly;
s5, a network updating method comprises the following steps: the network update adopts the updating mode of the PPO, and the network update of the PPO algorithm comprises two parts: a policy network and a value network, wherein the policy network is used for generating an action sequence, and the value network is used for evaluating the value of the current state;
s6, selecting a simulation environment.
2. The unmanned aerial vehicle remote control algorithm based on segmented reinforcement learning of claim 1, wherein the state space in S1 comprises: the speed instruction reference value, filtered IMU data of three time steps in the past, 640 multiplied by 480 multiplied by 2 binocular images of the three time steps in the past are selected to be taken as a state space together, so that the prediction capability can be brought to the neural network to cope with the high delay condition of transmission in flight, and meanwhile, the great training difficulty is not brought to the neural network.
3. The unmanned aerial vehicle remote control algorithm based on segmented reinforcement learning of claim 1, wherein the action space classical RC remote control signal in S1 comprises: throttle, pitching, rolling and yawing, the action space is enough for the unmanned aerial vehicle to perform high-difficulty movement in the three-dimensional space, and remote control of the unmanned aerial vehicle is realized.
4. The unmanned aerial vehicle remote control algorithm based on segmented reinforcement learning of claim 1, wherein the mathematical expression of the reward in S2 is:
$R=-\sqrt{\frac{1}{N}\sum_{i=1}^{N}(v_i-v_{ref,i})^2}-Collision*(\sum_{i=1}^{N}v_i^2)$;
wherein:
the reward value, the data set size, the speed error of the ith sample, the speed reference value of the ith sample and the collision penalty are 1 when the unmanned aerial vehicle collides with an obstacle, otherwise, the reward value is 0;
in the selection of the loss function, the present patent uses a mean square error loss function. The loss function can measure the error between the predicted value and the true value of the unmanned aerial vehicle, so that the unmanned aerial vehicle can continuously optimize the prediction capability of the unmanned aerial vehicle, the control effect is improved, and the unmanned aerial vehicle can realize autonomous control and stable flight through continuous learning and optimization, thereby improving the application value and the application range of the unmanned aerial vehicle;
the mathematical expression of the loss function is:
$L=\frac{1}{2N}\sum_{i=1}^{N}(Q(s_i,a_i)-y_i)^2$;
wherein:
$ N $: training set size;
$ s_i $: the status of the ith sample;
$ a_i $: an action of the ith sample;
q (s, a) $: an output of the neural network;
$ y_i $: target value of the i-th sample;
when the unmanned aerial vehicle can move according to the expected track, positive rewards can be obtained, otherwise negative rewards can be obtained, and the rewards can enable the unmanned aerial vehicle to continuously optimize own control strategies in the learning process, so that a stable control effect is finally achieved.
5. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning of claim 1, wherein the structure in S4 is formed by the change of the basic network structure by both the actor network and the critical network structure of PPO, and they share a part of the feature extraction network.
6. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning according to claim 1, wherein the training method in S4 comprises the steps of:
the network training is divided into two phases:
s41, a first stage: training a portion of the model in a safe barrier-free, floor-free environment;
s42, a second stage: training in an environment with obstacles, simultaneously monitoring model parameters of a network at a fixed speed, and updating parameters of the rest networks;
the segmented learning method divides a complex task into two sections, one is speed control and the other is obstacle avoidance control, so that the training difficulty of a model can be reduced, the decoupled neural network is beneficial to enhancing the interpretability of the neural network, and the two networks can be operated separately when necessary, so that the segmented learning method has relatively strong adaptability.
7. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning of claim 1, wherein the policy network in S5 comprises the steps of:
s51, updating a strategy network:
first, the dominance function $A_t$ is calculated:
$A_t=Q_t-V(s_t)$
where Q_t is the expected return from state $s_t after the action $a_t is performed at time $t, $V (s_t) is the expected return from state $s_t after any action is performed in state $s_t, $V (s_t) is the expected return from state $s_t, $A_t, $V (s_t) is the average level of the performance of the action, $a_t, $A_t, $A/t, $V (s_t) is the level of the performance of the action;
then, the log likelihood ratio of the new strategy $\pi_ \theta$and the old strategy $\pi_ { \theta_ { old } $ is calculated:
r_t (\theta) = \frac { \pi\theta (a_t|s_t) } { \pi_ { \theta_ { old } (a_t|s_t) } $then, an updated policy parameter $\theta +_ is calculated {' } $:
$\theta^{'}=argmax_{\theta}L_{CLIP}(\theta)$;
where $l_ { CLIP $ is a constraint objective function employed in the PPO algorithm to constrain the difference between the new policy and the old policy, specifically, $l_ { CLIP $ is defined as:
$L_{CLIP}(\theta)=E_{t}\left[\min\left(r_t(\theta)A_t,clip(r_t(\theta),1-\epsilon,1+\epsilon)A_t\right)\right]$;
where $\epsilon $ is a hyper-parameter that is used to control the stride size at policy update.
8. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning of claim 7, wherein the value network in S5 comprises the steps of:
s52, updating the value network:
the updating of the value network is mainly achieved by minimizing the square error, in particular defining the target value $y_t$ as the sum of the return $r_t$ of the current state and the estimated value $v_ { t+1} $ of the next state:
$y_t=r_t+\gamma v_{t+1}$;
where $\gamma $ is an decay factor that controls the importance of future rewards, then we define the loss function $ l_ { VF $ as the square error between the predicted value and the target value:
$L_{VF}=(y_t-V(s_t))^2$;
then, we use a random gradient descent method to minimize the loss function $L_ { VF } $, update the parameters of the value network;
thus, the network update of the PPO algorithm can be summarized as the following steps:
calculating an advantage function $A_t$;
calculating a log-likelihood ratio $r_t (\theta) $;
calculating a limiting objective function $L_ { CLIP } (\theta) $, and updating policy network parameters $\theta {' } $;
calculating a target value $y_t$;
the loss function $L_ { VF } $ is calculated and the value network parameters are updated.
9. The unmanned aerial vehicle remote control algorithm based on the segmented reinforcement learning according to claim 1, wherein the selection of the simulation environment in S6 comprises the steps of:
the training simulator used filghtmaster, developed by the university of zurich research team, based on gazebo physical simulation and units rendering, provides good unmanned dynamics simulation and binocular images close to real world, while its running speed is also sufficient to support the need to reinforcement learn huge sampling volumes;
the acquisition of state space data in the environment and unmanned aerial vehicle operation are based on the gym interface, the realization of a neural network, training and gradient update, and the system is mature and easy to realize.
CN202311006154.7A 2023-08-10 2023-08-10 Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning Pending CN117406762A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311006154.7A CN117406762A (en) 2023-08-10 2023-08-10 Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311006154.7A CN117406762A (en) 2023-08-10 2023-08-10 Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning

Publications (1)

Publication Number Publication Date
CN117406762A true CN117406762A (en) 2024-01-16

Family

ID=89487756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311006154.7A Pending CN117406762A (en) 2023-08-10 2023-08-10 Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning

Country Status (1)

Country Link
CN (1) CN117406762A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117826867A (en) * 2024-03-04 2024-04-05 之江实验室 Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium
CN117826867B (en) * 2024-03-04 2024-06-11 之江实验室 Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117826867A (en) * 2024-03-04 2024-04-05 之江实验室 Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium
CN117826867B (en) * 2024-03-04 2024-06-11 之江实验室 Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium

Similar Documents

Publication Publication Date Title
Ruan et al. Mobile robot navigation based on deep reinforcement learning
US11062617B2 (en) Training system for autonomous driving control policy
CN112232490B (en) Visual-based depth simulation reinforcement learning driving strategy training method
CN108819948B (en) Driver behavior modeling method based on reverse reinforcement learning
CN111578940B (en) Indoor monocular navigation method and system based on cross-sensor transfer learning
WO2019076044A1 (en) Mobile robot local motion planning method and apparatus and computer storage medium
CN110531786B (en) Unmanned aerial vehicle maneuvering strategy autonomous generation method based on DQN
CN112034887A (en) Optimal path training method for unmanned aerial vehicle to avoid cylindrical barrier to reach target point
Li et al. Oil: Observational imitation learning
CN111580544A (en) Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN108791302B (en) Driver behavior modeling system
Maravall et al. Vision-based anticipatory controller for the autonomous navigation of an UAV using artificial neural networks
Kim et al. Towards monocular vision-based autonomous flight through deep reinforcement learning
CN113741533A (en) Unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning
Zijian et al. Imaginary filtered hindsight experience replay for UAV tracking dynamic targets in large-scale unknown environments
Gao et al. Autonomous driving based on modified sac algorithm through imitation learning pretraining
CN114077258A (en) Unmanned ship pose control method based on reinforcement learning PPO2 algorithm
CN114170454A (en) Intelligent voxel action learning method based on joint grouping strategy
CN116824303B (en) Structure inspection agent navigation method based on damage driving and multi-mode multi-task learning
Wang et al. Autonomous target tracking of multi-UAV: A two-stage deep reinforcement learning approach with expert experience
Helble et al. 3-d path planning and target trajectory prediction for the oxford aerial tracking system
WO2021008798A1 (en) Training of a convolutional neural network
CN117406762A (en) Unmanned aerial vehicle remote control algorithm based on sectional reinforcement learning
CN114609925B (en) Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish
Wang et al. Towards better generalization in quadrotor landing using deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination