CN116225046A - Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment - Google Patents

Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment Download PDF

Info

Publication number
CN116225046A
CN116225046A CN202211099640.3A CN202211099640A CN116225046A CN 116225046 A CN116225046 A CN 116225046A CN 202211099640 A CN202211099640 A CN 202211099640A CN 116225046 A CN116225046 A CN 116225046A
Authority
CN
China
Prior art keywords
action
state
unmanned aerial
aerial vehicle
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211099640.3A
Other languages
Chinese (zh)
Inventor
贺楚超
田琳宇
辛泊言
王鹏
吕志刚
邸若海
李晓艳
许韫韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Technological University
Original Assignee
Xian Technological University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Technological University filed Critical Xian Technological University
Priority to CN202211099640.3A priority Critical patent/CN116225046A/en
Publication of CN116225046A publication Critical patent/CN116225046A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/106Change initiated in response to external conditions, e.g. avoidance of elevated terrain or of no-fly zones
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses an unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment, which solves the problems that an experience replay mechanism cannot extract important samples and sparse rewards in the prior art. The invention comprises the following steps: 1) Establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, randomly generating the number and the positions of the obstacles and the starting point of the unmanned aerial vehicle; 2) Establishing an environment model based on a Markov decision process framework and designing a ladder rewarding mechanism; 3) Based on the state and the strategy selection action, forming information obtained by interaction with the environment into five-tuple and storing the five-tuple into an experience pool, and sampling according to a designed priority experience replay mechanism; 4) Performing network updating on a sample obtained by sampling an environment model by adopting an improved DQN algorithm, and assigning a value to a state-action pair of the sample; 5) And selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.

Description

Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment
Technical field:
the invention belongs to the technical field of reinforcement learning and unmanned aerial vehicle obstacle avoidance, and relates to an unmanned aerial vehicle autonomous motion path planning method based on deep reinforcement learning under an unknown environment.
The background technology is as follows:
the use of unmanned aerial vehicles in a variety of practical tasks, such as intelligence, surveillance and reconnaissance, suppressing enemy air defense, search and rescue, and cargo transportation, has been on the rise over the past few years. In these applications, a key requirement is how to build an intelligent system for the drone, autonomously performing tasks without any human intervention. In particular, there is a need to develop advanced intelligent technologies that autonomously navigate a drone from an arbitrary departure point to a destination in a dynamic, unknown environment while avoiding obstacles and threats en route. To achieve this task, two challenges need to be overcome:
1) Partial observability of the environment. The unmanned aerial vehicle is not known about the environment at the beginning, and only partial information can be perceived in the task. This feature makes some rule-based path planning methods unusable because it is impossible to design complete rules for all possible scenarios in the face of an uncertain environment.
2) Unpredictability of the environment. Irregular movement of the dispersed objects creates an unstable environment for the unmanned aerial vehicle, and navigation methods based on simultaneous localization and mapping (Simultaneous Localization And Mapping, SLAM) will become problematic because the moving objects require continuous mapping, which will result in an unacceptable computational cost. Furthermore, the open-loop mechanism of the sensor planning based approach makes decisions without any predictions and inferences about the future, hampering their suitability to dynamic environments.
To address these challenges, researchers have resorted to reinforcement learning (Reinforcement Learning, RL) techniques and have focused on designing learning-based planners for unmanned aerial vehicles. As a machine learning algorithm, RL is often used to solve the sequential decision problem and has a profound link to adaptive dynamic programming (Approximate Dynamic Programming, ADP). The special mechanism of the RL enables it to learn an intelligent planner through trial-and-error interactions with the environment. The RL-based planner uses a markov decision process (Markov Decision Process, MDP) to model the problem and generates a strategy based on the predicted long-term return, which enables the RL to adapt to a random dynamic environment without knowing the system model. However, the problem of dimension disasters prevents further application of the conventional RL algorithm. To solve the "dimension disaster" problem and maintain a better representation of the high-dimensional continuous state space, deep neural networks were introduced in the traditional RL and a deep reinforcement learning (Deep Reinforcement Learning, DRL) approach was developed. By utilizing the perception capability of Deep Learning (DL) and the decision capability of RL, the DRL achieves excellent performance in the field of unmanned aerial vehicle motion planning.
However, conventional deep reinforcement learning methods still suffer from drawbacks in certain aspects, such as extracting samples with the same probability in conventional empirical replay mechanisms such that a large number of valid samples are covered and not extracted. In addition, conventional rewarding mechanisms often face sparse rewards in larger environments, which greatly affects the learning efficiency of the algorithm.
The invention comprises the following steps:
the invention aims to provide an unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment, which solves the problems that an experience replay mechanism cannot extract important samples and sparse rewards in the prior art.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment is characterized by comprising the following steps of: the method comprises the following steps:
1) Establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, randomly generating the number and the positions of the obstacles and the starting point of the unmanned aerial vehicle;
2) Establishing an environment model based on a Markov decision process framework, and designing a ladder rewarding mechanism;
3) Based on the state and the strategy selection action, the unmanned aerial vehicle interacts with the environment to generate a new state and calculates the obtained rewards after taking the action, the feature vector corresponding to the state, the action, the rewards, the feature vector corresponding to the next moment state and the termination flag bit form five-tuple which are stored in an experience pool, and a SumToe sampling method is used for batch sampling from the obtained experience pool according to a priority experience replay mechanism to train a network model;
4) Performing network updating on a sample obtained by sampling an environment model by adopting an improved DQN algorithm, and assigning a value to a state-action pair of the sample;
5) And selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.
In step 1), a two-dimensional world is created for training and testing and threats are set in the two-dimensional world. Wherein the starting position of the drone is fixed, while the positions of the threat and the target are randomly varied.
Step 2) comprises the steps of:
s2-1, state space S is described as vector space, state S of environment at time t t Is a state in its set of environmental states;
s2-2, action space A is described as discrete vector space, action A taken by individual at time t t Is one action in the action set;
s2-3, the reward signal R describes that the environment judges the action of the Agent, and the individual is in the state S at the moment t t Action A taken t Corresponding rewards R t+1 Can be obtained at the time t+1; designing a ladder rewarding mechanism, namely dynamically setting rewards according to the distance between the unmanned aerial vehicle and a designated target on the premise of fully considering the characteristics of a movement planning problem, so as to enrich the intermediate rewarding information in the movement process of the unmanned aerial vehicle;
s2-4, describing a policy pi of an individual as a basis for the individual to take action, namely, selecting the action by the individual according to the policy pi;
s2-5 value v after Agent action π (s) describing the value of the individual after taking action as a desired function in terms of policy pi and state s;
s2-6, rewarding attenuation factor gamma is between [0,1 ]; if 0, the greedy approach is used, i.e., the value is determined only by the current delay prize, and if 1, all subsequent state rewards and current rewards are one-view. Most of the time, a number between 0 and 1 is taken, namely the weight of the current delay rewards is larger than that of the subsequent rewards;
s2-7, the state transition model of the environment, expressed as a probability model, i.e. the probability of taking action a in state S to go to the next state S', expressed as
Figure BDA0003839552360000041
S2-8, the exploration rate epsilon is described as the probability of Agent to select the next action, and the ratio is used in the reinforcement learning training iteration process.
Step 3) comprises the steps of:
s3-1, establishing a data buffer area with the capacity of MEMORY_SIZE for storing historical experience, and initializing to be empty;
s3-2, continuously collecting historical experiences of interaction between the unmanned aerial vehicle and the environment, and storing the historical experiences into an experience pool;
the interaction process is as follows: the unmanned aerial vehicle acquires environmental state information as current state information S, obtains a characteristic vector phi (S) of the environmental state information, evaluates Q values of all actions in the current state according to the obtained characteristic vector phi (S) as input, and selects an optimal action A in current Q value output according to an epsilon-greedy strategy combined with heuristic search rules; the unmanned aerial vehicle executes the action to obtain the environmental state at the next moment, and further obtain the state information S' at the next moment.
S3-3, storing the historical experience data into an experience pool. If the number of data in the experience pool is greater than the maximum capacity of the experience pool, using the latest experience data to replace the oldest experience data;
s3-4, carrying out batch sampling by using a SumPree sampling method from the obtained experience pool according to a priority experience replay mechanism; giving the priority of each sample proportional to the absolute value of the TD error of delta (t) according to the absolute value of the TD error of delta (t), and storing the value of the priority into an experience playback pool; samples are extracted under priority considerations using a binary tree structure based on SumTree.
Step 4) comprises the steps of:
s4-1, calculating the current target Q value y j
S4-2, using a mean square error loss function
Figure BDA0003839552360000051
Updating all parameters w of the Q network by gradient back propagation of the neural network; wherein m is the number of samples of batch gradient descent, A j An action set for the current iteration round;
s4-3, recalculating TD-error of all samples: delta j =y j +γmax a′ Q (s ', a') -Q (s, a) +τ, updating the priority p of all nodes in SumPree j =|δ j I (I); where τ varies with distance.
S4-4, if i% c=1, updating the target Q network parameter w' =w; wherein i is the current iteration round number, and C is the update frequency of the target Q network parameters;
s4-5, after gradient updating is carried out on the Q network parameters, the TD error is recalculated, and the TD error is updated to SumPree.
Step 5) comprises the steps of:
the characteristic vector phi (S) obtained by the state sequence is used as input to evaluate the Q value of each action in the current state; and selecting a corresponding action in the current Q value output according to an epsilon-greedy strategy combined with a heuristic search rule by the action selection strategy, and determining the flight direction of the unmanned aerial vehicle.
Compared with the prior art, the invention has the following advantages and effects:
according to the unmanned aerial vehicle autonomous path planning method based on the deep reinforcement learning, disclosed by the invention, the problems of obstacle avoidance and path optimization of the unmanned aerial vehicle in a dynamic unknown environment are solved by applying the deep reinforcement learning, the unmanned aerial vehicle autonomous flight capacity is improved without depending on an environment model and priori knowledge. The ladder rewarding mechanism provided by the invention dynamically sets rewards according to the distance between the unmanned aerial vehicle and the appointed target, enriches the intermediate rewards information in the movement process of the unmanned aerial vehicle, and overcomes the sparse rewards problem; the priority-based experience replay mechanism calculates the priority of each sample based on TD-error, fully considers the importance degree of sample information in the sampling process, improves the sampling probability of important samples, and improves the overall learning efficiency of the algorithm.
Description of the drawings:
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic representation of a SumPree in accordance with a preferred embodiment of the present invention;
FIG. 3 is a diagram of a DQN network architecture according to a preferred embodiment of the present invention;
FIG. 4 is a diagram showing the comparison of the optimal routes and the required steps found by different agents (agents) during the experimental stage of a preferred embodiment of the present invention;
FIG. 5 is a comparison diagram of learning efficiency according to a preferred embodiment of the present invention.
The specific embodiment is as follows:
the exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.
Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.
The invention provides an unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment, which comprises the following steps:
step 1, establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, and randomly generating the number and positions of barriers (threats) and the starting point of the unmanned aerial vehicle;
step 2, establishing an environment model based on a Markov process framework, wherein the environment model comprises a state space S, an action space A, a reward function R, a reward attenuation factor gamma, an exploration rate epsilon and the like; designing a ladder rewarding mechanism;
step 3, selecting actions based on the states and strategies, after taking actions, the unmanned aerial vehicle interacts with the environment to generate new states and calculates rewards, feature vectors corresponding to the states, the actions, the rewards, feature vectors corresponding to the next moment states and termination flag bit components are five-tuple-stored in an experience pool (in the initial state, the experience pool is empty) with the capacity of memory_size, and batch sampling is carried out from the obtained experience pool by using a SumTree sampling method according to a priority experience replay mechanism to train a network model;
step 4, adopting an improved DQN algorithm to update the network based on the samples obtained by sampling the environmental model, and assigning values to the state-action pairs of the samples;
and 5, selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.
Examples:
referring to fig. 1, a flow chart of an autonomous path planning method of an unmanned aerial vehicle in an unknown environment based on deep reinforcement learning according to the present invention includes the following steps:
step 1, an unmanned aerial vehicle autonomous movement flight model is established in a two-dimensional space, the number and the positions of barriers (threats) and the starting point of the unmanned aerial vehicle are randomly generated, and the step 1 specifically comprises:
build a two-dimensional 40 x 20m for training and testing 2 And sets 150 threats in the two-dimensional world. Wherein the starting position of the unmanned aerial vehicle is fixed, the inventionIt is placed in the upper left corner of the 2-dimensional world, while the threat location is randomly varied, which randomly appears in the 2-dimensional world with the movement of the drone. In addition, the target is not stationary and changes position as the drone moves.
Step 2, establishing an environment model based on a Markov process framework; including state space S, action space a, reward function R, reward decay factor γ, exploration rate epsilon, etc. Step 2 can be divided into the following sub-steps:
step 2-1, state space S is described as vector space, state S of environment at time t t Is one of its set of environmental states.
Step 2-2, action space A is described as discrete vector space, action A taken by individual at time t t Is one action in the action set, and the unmanned aerial vehicle selects actions in the { up, down, left and right } four actions.
Step 2-3, the reward signal R is described as the judgment of the environment on the Agent action, and the individual is in the state S at the moment t t Action A taken t Corresponding rewards R t+1 Will be obtained at time t + 1.
As a further improvement of the invention: considering that the unmanned aerial vehicle often needs to go through many steps in the flight process in order to avoid the obstacle to reach the specified target when performing the daily task, the adoption of the conventional sparse rewards brings a plurality of ineffective rewards to the unmanned aerial vehicle Agent. In order to solve the problem, the invention provides a step rewarding mechanism, namely, rewards are dynamically set according to the distance between the unmanned aerial vehicle and a designated target on the premise of fully considering the characteristics of a movement planning problem, so that intermediate rewarding information in the movement process of the unmanned aerial vehicle is enriched, and the implementation method is shown in a formula (1).
Figure BDA0003839552360000091
Compared with the traditional sparse rewards, the improved rewards signals enhance the connection between rewards and the unmanned plane motion path, and avoid the occurrence of a large amount of useless rewards. Meanwhile, the improved reward signal can greatly improve the convergence efficiency of the training process, reduce the overall training time and facilitate the interpretation of the optimal solution.
Step 2-4, the policy pi of the individual is described as the basis for the individual to take an action, i.e., the individual will select an action according to the policy pi.
Value v after Agent action in step 2-5 π (s) describes the value of the individual after taking action, typically as a desired function, in terms of policy pi and state s, although the current action gives a delay prize R t+1 But looking at this delay prize is not feasible because the current delay prize is high, not representing a subsequent prize up to t+1, t+2. The value is thus integrated into the current and subsequent delay rewards. The cost function can be generally expressed as equation (2).
v π (s)=E π (R t+1 +γR t+22 R t+3 +...|S t =s) (2)
Step 2-6, rewarding attenuation factor gamma, taking a number between 0 and 1, namely, the weight of the current delay rewarding is larger than that of the subsequent rewarding; in this embodiment, the prize decay factor is taken to be 0.90.
The state transition model of the environment in the steps 2-7 can be understood as a probability state machine, and can be expressed as a probability model, namely, the probability of taking action a in state s and going to the next state s' is expressed as
Figure BDA0003839552360000092
Step 2-8, the exploration rate epsilon is described as the probability of the Agent to select the next action. Taken as 0.90 in this embodiment.
And 3, selecting actions based on the states and strategies, generating new states by interaction with the environment after the unmanned aerial vehicle takes the actions, calculating obtained rewards, storing feature vectors corresponding to the states, the actions, the rewards, feature vectors corresponding to the next moment states and termination flag bit composition quintuples in an experience pool with the capacity of memory_size (the experience pool is empty in the initial state), and performing batch sampling by using a SumTie sampling method from the obtained experience pool according to a priority experience replay mechanism to train a network model. Step 3 can be divided into the following sub-steps:
step 3-1, input and output of algorithm: selecting the iteration round number T, the state characteristic dimension n, the action set A, the step length alpha, the sampling weight coefficient beta, the attenuation factor gamma, the exploration rate epsilon, the current Q network Q, the target Q network Q', the number m of samples with batch gradient descent, and the target Q network parameter updating frequency C as the input of an algorithm; and outputting the Q network parameters as an algorithm.
Step 3-2, parameter initialization of algorithm: randomly initializing the value Q corresponding to all states and actions; randomly initializing all parameters w of the current Q network; initializing a parameter w '=w of a target Q network Q'; initializing a default data structure of experience replay SumTiee, prioritizing p of leaf nodes of all SumTiee j Set to 1.
Step 3-3, carrying out T iterations from 1 to T in the following substeps:
and step 3-3-1, initializing the S to be the first state of the current state sequence, and obtaining the characteristic vector phi (S) of the current state sequence.
And step 3-3-2, using the characteristic vector phi (S) as input in the Q network to obtain Q value output corresponding to all actions of the Q network. And selecting a corresponding action A in the current Q value output according to an epsilon-greedy strategy combined with heuristic search rules.
Step 3-3-3, executing the current action A in the state S to obtain the feature vector phi (S ') and the reward signal R corresponding to the new state S', and whether to terminate the state is_end.
Step 3-3-4, storing the five-tuple { phi (S), A, R, phi (S'), is_end } into SumToe;
step 3-3-5, updating the state s=s' at this time;
and 3-4, carrying out batch sampling according to a SumPree sampling method from the obtained experience pool according to the priority.
As a further improvement of the invention: the experience replay mechanism in the traditional DQN algorithm only saves the data such as sample states, actions, rewards and the like obtained by interaction with the environment, and does not suggest the concept of priority. The invention improves the experience replay mechanism in the traditional DQN algorithm, gives the priority of each sample proportional to the absolute value of the TD error of delta (t) according to the absolute value of the TD error of delta (t), and stores the value of the priority into an experience replay pool.
In the empirical playback pool, the effect on back propagation is different for different samples due to different TD errors. The larger the TD error, the greater the effect on the back propagation. And the sample with small TD error has little influence on the calculation of the inverse gradient. In the Q network, the TD error is the difference between the target Q value calculated by the target Q network and the Q value calculated by the current Q network.
Sampling method of improved experience replay mechanism: sampling m samples { φ (S) j ),A j ,R j ,φ(S′ j ),is_end j J=1, 2,, the probability of each sample being sampled is based on
Figure BDA0003839552360000111
Calculating the weight of the loss function to be w j =(N*P(j)) /max i (w i ) Wherein p is i For the probability of each sample being sampled at the ith iteration from 1 to T iterations, the value is proportional to |δ (T) |, w j And the weight value is a loss function weight value under the jth iteration, and beta is a sampling weight coefficient.
Fig. 2 illustrates a sampling implementation process of SumTree, in which leaf nodes are respectively corresponding value intervals in a binary tree structure of SumTree, the larger the value of a leaf node (the higher the priority) is, the larger the interval length is, so that the probability that one data falls into the interval is uniformly sampled in the total value interval, and the implementation is as follows:
1) Extracting a data (0-29), uniformly sampling a data (assumed to be v);
2) Traversing child nodes of the node 0 serving as a father node;
3) If the left child node is greater than v, traversing the child node by taking the left child node as a father node;
4) If the value of the left child node is smaller than v, subtracting the value of the left child node from v, selecting the right child node as a father node, and traversing the child nodes;
5) Until the traversed leaf node, the value of the leaf node is the priority.
And 4, performing network updating on the samples obtained by sampling the environment model by adopting an improved DQN algorithm, and assigning values to the state-action pairs of the samples, wherein the network structure of the DQN is shown in figure 3. Step 4 may be divided into the following sub-steps:
step 4-1, calculating the current target Q value y j The implementation method is shown in the formula (3).
Figure BDA0003839552360000121
Step 4-2, using a mean square error loss function
Figure BDA0003839552360000122
Updating all parameters w of the Q network by gradient back propagation of the neural network; wherein m is the number of samples of batch gradient descent, A j Is the action set of the current iteration round.
As part of the improvement of the invention, in addition to priority-based empirical replay, the invention also optimizes the loss function of the Q network:
the conventional loss function is:
Figure BDA0003839552360000123
the invention adds the sample priority to obtain a new loss function:
Figure BDA0003839552360000124
wherein w is j The priority weight of the j-th sample is obtained by normalizing TD error |delta (t) |;
w j =(N*P(j)) /max i (w i )=(N*P(j)) /max i ((N*P(i)) )=(P(j)) /max i ((P(i)) )=(P j /min i P(i)) (6)
step 4-3, recalculating TD-error of all samples: delta j =y j +γmax a′ Q (s ', a') -Q (s, a) +τ, updating the priority p of all nodes in SumPree j =|δ j I (I); where τ varies with distance.
Step 4-4, if i% c=1, updating the target Q network parameter w' =w; wherein i is the current iteration round number, and C is the update frequency of the target Q network parameters.
And step 4-5, if S' is in a termination state, finishing the iteration of the current round, otherwise, jumping to step 3-3-2.
The Q values in the above steps 4-1 and 4-2 are calculated through the Q network. Meanwhile, in order for the algorithm to have better convergence, the exploration rate epsilon needs to be smaller as the iteration progresses.
After gradient updating is carried out on the Q network parameters, the TD error is required to be recalculated, and the TD error is updated to SumToe; and when the number of training scenarios (Epinodes) reaches a preset condition, ending the updating process and saving the model parameters of the DQN.
The update process is performed in units of episodes (epoode). In the updating process, each Epinode starts from an initial state, and when the unmanned aerial vehicle meets any one of the following conditions, the Epinode ends and the learning of the next Epinode is restarted: 1) Reaching a target; 2) Encountering a threat task failure; 3) And fly out of the task area boundary.
And 5, selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.
The effect of the invention can be further illustrated by the following simulation experiment:
for full comparison, the invention constructs 4 agents, wherein Agent1 and Agent3 adopt a traditional experience replay mechanism, agent2 and Agent4 adopt an improved experience replay mechanism, meanwhile, agent1 and Agent2 adopt a traditional rewarding mechanism, and Agent3 and Agent4 adopt an improved rewarding mechanism, as shown in table 1:
table 1 experiment setting table
Agent Traditional experience replay mechanism Traditional rewarding mechanism Improved experience replay mechanism Improved rewarding mechanism
Agent1
Agent2
Agent3
Agent4
The embodiment evaluates the algorithm advantages and disadvantages of solving the autonomous path planning problem from the practical aspect based on the practical application perspective. The specific implementation mode is as follows: based on the same simulation environment, all agents start from the same place, and the required step length is judged by taking a capture target as a standard.
As shown in FIG. 4, the four agents completed the task in average steps of 67.5, 61.6, 66.4 and 58.3, respectively. Obviously, the improvement of the invention has better performance compared with the traditional algorithm in solving the problem of unmanned aerial vehicle autonomous path planning, and meanwhile, the invention also carries out the comparison of training time length on Agent3 and Agent4, and the result is shown in fig. 5, so that the improvement of the invention has faster learning efficiency compared with the traditional algorithm.
The foregoing description is only illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, and all changes that may be made in the equivalent structures described in the specification and drawings of the present invention are intended to be included in the scope of the invention.

Claims (6)

1. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment is characterized by comprising the following steps of: the method comprises the following steps:
1) Establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, randomly generating the number and the positions of the obstacles and the starting point of the unmanned aerial vehicle;
2) Establishing an environment model based on a Markov decision process framework, and designing a ladder rewarding mechanism;
3) Based on the state and the strategy selection action, the unmanned aerial vehicle interacts with the environment to generate a new state and calculates the obtained rewards after taking the action, the feature vector corresponding to the state, the action, the rewards, the feature vector corresponding to the next moment state and the termination flag bit form five-tuple which are stored in an experience pool, and a SumToe sampling method is used for batch sampling from the obtained experience pool according to a priority experience replay mechanism to train a network model;
4) Performing network updating on a sample obtained by sampling an environment model by adopting an improved DQN algorithm, and assigning a value to a state-action pair of the sample;
5) And selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.
2. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: in step 1), a two-dimensional world is created for training and testing and threats are set in the two-dimensional world. Wherein the starting position of the drone is fixed, while the positions of the threat and the target are randomly varied.
3. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 2) comprises the steps of:
s2-1, state space S is described as vector space, state S of environment at time t t Is a state in its set of environmental states;
s2-2, action space A is described as discrete vector space, action A taken by individual at time t t Is one action in the action set;
s2-3, the reward signal R describes that the environment judges the action of the Agent, and the individual is in the state S at the moment t t Action A taken t Corresponding rewards R t+1 Can be obtained at the time t+1; a step rewarding mechanism is designed, namely, the motion planning is fully consideredOn the premise of the problem characteristics, rewards are dynamically set according to the distance between the unmanned aerial vehicle and the appointed target, so that intermediate rewards information in the movement process of the unmanned aerial vehicle is enriched;
s2-4, describing a policy pi of an individual as a basis for the individual to take action, namely, selecting the action by the individual according to the policy pi;
s2-5 value v after Agent action π (s) describing the value of the individual after taking action as a desired function in terms of policy pi and state s;
s2-6, rewarding attenuation factor gamma is between [0,1 ]; if 0, the greedy approach is used, i.e., the value is determined only by the current delay prize, and if 1, all subsequent state rewards and current rewards are one-view. Most of the time, a number between 0 and 1 is taken, namely the weight of the current delay rewards is larger than that of the subsequent rewards;
s2-7, the state transition model of the environment, expressed as a probability model, i.e. the probability of taking action a in state S to go to the next state S', expressed as
Figure FDA0003839552350000021
S2-8, the exploration rate epsilon is described as the probability of Agent to select the next action, and the ratio is used in the reinforcement learning training iteration process.
4. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 3) comprises the steps of:
s3-1, establishing a data buffer area with the capacity of MEMORY_SIZE for storing historical experience, and initializing to be empty;
s3-2, continuously collecting historical experiences of interaction between the unmanned aerial vehicle and the environment, and storing the historical experiences into an experience pool;
the interaction process is as follows: the unmanned aerial vehicle acquires environmental state information as current state information S, obtains a characteristic vector phi (S) of the environmental state information, evaluates Q values of all actions in the current state according to the obtained characteristic vector phi (S) as input, and selects an optimal action A in current Q value output according to an epsilon-greedy strategy combined with heuristic search rules; the unmanned aerial vehicle executes the action to obtain the environmental state at the next moment, and further obtain the state information S' at the next moment.
S3-3, storing the historical experience data into an experience pool. If the number of data in the experience pool is greater than the maximum capacity of the experience pool, using the latest experience data to replace the oldest experience data;
s3-4, carrying out batch sampling by using a SumPree sampling method from the obtained experience pool according to a priority experience replay mechanism; giving the priority of each sample proportional to the absolute value of the TD error of delta (t) according to the absolute value of the TD error of delta (t), and storing the value of the priority into an experience playback pool; samples are extracted under priority considerations using a binary tree structure based on SumTree.
5. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 4) comprises the steps of:
s4-1, calculating the current target Q value y j
S4-2, using a mean square error loss function
Figure FDA0003839552350000031
Updating all parameters w of the Q network by gradient back propagation of the neural network; wherein m is the number of samples of batch gradient descent, A j An action set for the current iteration round;
s4-3, recalculating TD-error of all samples: delta j =y j +γmax a′ Q (s ', a') -Q (s, a) +τ, updating the priority p of all nodes in SumPree j =|δ j I (I); where τ varies with distance.
S4-4, if i% c=1, updating the target Q network parameter w' =w; wherein i is the current iteration round number, and C is the update frequency of the target Q network parameters;
s4-5, after gradient updating is carried out on the Q network parameters, the TD error is recalculated, and the TD error is updated to SumPree.
6. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 5) comprises the steps of:
the characteristic vector phi (S) obtained by the state sequence is used as input to evaluate the Q value of each action in the current state; and selecting a corresponding action in the current Q value output according to an epsilon-greedy strategy combined with a heuristic search rule by the action selection strategy, and determining the flight direction of the unmanned aerial vehicle.
CN202211099640.3A 2022-09-09 2022-09-09 Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment Pending CN116225046A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211099640.3A CN116225046A (en) 2022-09-09 2022-09-09 Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211099640.3A CN116225046A (en) 2022-09-09 2022-09-09 Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment

Publications (1)

Publication Number Publication Date
CN116225046A true CN116225046A (en) 2023-06-06

Family

ID=86570281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211099640.3A Pending CN116225046A (en) 2022-09-09 2022-09-09 Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment

Country Status (1)

Country Link
CN (1) CN116225046A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116718198A (en) * 2023-08-10 2023-09-08 湖南璟德科技有限公司 Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116718198A (en) * 2023-08-10 2023-09-08 湖南璟德科技有限公司 Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph
CN116718198B (en) * 2023-08-10 2023-11-03 湖南璟德科技有限公司 Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph

Similar Documents

Publication Publication Date Title
CN109597425B (en) Unmanned aerial vehicle navigation and obstacle avoidance method based on reinforcement learning
Tang et al. A novel hierarchical soft actor-critic algorithm for multi-logistics robots task allocation
CN112148008B (en) Real-time unmanned aerial vehicle path prediction method based on deep reinforcement learning
CN113253733B (en) Navigation obstacle avoidance method, device and system based on learning and fusion
CN112947591A (en) Path planning method, device, medium and unmanned aerial vehicle based on improved ant colony algorithm
CN116225046A (en) Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment
Wang et al. Research on dynamic path planning of wheeled robot based on deep reinforcement learning on the slope ground
CN114815801A (en) Adaptive environment path planning method based on strategy-value network and MCTS
CN113110101B (en) Production line mobile robot gathering type recovery and warehousing simulation method and system
CN114089776A (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN117522078A (en) Method and system for planning transferable tasks under unmanned system cluster environment coupling
CN113848911A (en) Mobile robot global path planning method based on Q-learning and RRT
Shihab et al. Improved Artificial Bee Colony Algorithm-based Path Planning of Unmanned Aerial Vehicle Using Late Acceptance Hill Climbing.
CN113778090A (en) Mobile robot path planning method based on ant colony optimization and PRM algorithm
CN116307331B (en) Aircraft trajectory planning method
CN112486185A (en) Path planning method based on ant colony and VO algorithm in unknown environment
Guan et al. Research on path planning of mobile robot based on improved Deep Q Network
Gao et al. Competitive self-organizing neural network based UAV path planning
Li et al. A novel path planning algorithm based on Q-learning and adaptive exploration strategy
Ruqing et al. Deep reinforcement learning based path planning for mobile robots using time-sensitive reward
CN115185297A (en) Unmanned aerial vehicle cluster distributed cooperative target searching method
CN114237282A (en) Intelligent unmanned aerial vehicle flight path planning method for intelligent industrial park monitoring
Cui Multi-target points path planning for fixed-wing unmanned aerial vehicle performing reconnaissance missions
CN117032247B (en) Marine rescue search path planning method, device and equipment
Ji et al. Research on Path Planning of Mobile Robot Based on Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination