CN116225046A - Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment - Google Patents
Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment Download PDFInfo
- Publication number
- CN116225046A CN116225046A CN202211099640.3A CN202211099640A CN116225046A CN 116225046 A CN116225046 A CN 116225046A CN 202211099640 A CN202211099640 A CN 202211099640A CN 116225046 A CN116225046 A CN 116225046A
- Authority
- CN
- China
- Prior art keywords
- action
- state
- unmanned aerial
- aerial vehicle
- environment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 230000002787 reinforcement Effects 0.000 title claims abstract description 24
- 230000009471 action Effects 0.000 claims abstract description 82
- 230000007246 mechanism Effects 0.000 claims abstract description 33
- 238000005070 sampling Methods 0.000 claims abstract description 28
- 230000008569 process Effects 0.000 claims abstract description 22
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 19
- 230000033001 locomotion Effects 0.000 claims abstract description 17
- 230000003993 interaction Effects 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 23
- 230000000875 corresponding effect Effects 0.000 claims description 18
- 239000003795 chemical substances by application Substances 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 11
- 230000007613 environmental effect Effects 0.000 claims description 10
- 238000012549 training Methods 0.000 claims description 9
- 238000013459 approach Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 230000006872 improvement Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 230000004888 barrier function Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/10—Simultaneous control of position or course in three dimensions
- G05D1/101—Simultaneous control of position or course in three dimensions specially adapted for aircraft
- G05D1/106—Change initiated in response to external conditions, e.g. avoidance of elevated terrain or of no-fly zones
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses an unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment, which solves the problems that an experience replay mechanism cannot extract important samples and sparse rewards in the prior art. The invention comprises the following steps: 1) Establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, randomly generating the number and the positions of the obstacles and the starting point of the unmanned aerial vehicle; 2) Establishing an environment model based on a Markov decision process framework and designing a ladder rewarding mechanism; 3) Based on the state and the strategy selection action, forming information obtained by interaction with the environment into five-tuple and storing the five-tuple into an experience pool, and sampling according to a designed priority experience replay mechanism; 4) Performing network updating on a sample obtained by sampling an environment model by adopting an improved DQN algorithm, and assigning a value to a state-action pair of the sample; 5) And selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.
Description
Technical field:
the invention belongs to the technical field of reinforcement learning and unmanned aerial vehicle obstacle avoidance, and relates to an unmanned aerial vehicle autonomous motion path planning method based on deep reinforcement learning under an unknown environment.
The background technology is as follows:
the use of unmanned aerial vehicles in a variety of practical tasks, such as intelligence, surveillance and reconnaissance, suppressing enemy air defense, search and rescue, and cargo transportation, has been on the rise over the past few years. In these applications, a key requirement is how to build an intelligent system for the drone, autonomously performing tasks without any human intervention. In particular, there is a need to develop advanced intelligent technologies that autonomously navigate a drone from an arbitrary departure point to a destination in a dynamic, unknown environment while avoiding obstacles and threats en route. To achieve this task, two challenges need to be overcome:
1) Partial observability of the environment. The unmanned aerial vehicle is not known about the environment at the beginning, and only partial information can be perceived in the task. This feature makes some rule-based path planning methods unusable because it is impossible to design complete rules for all possible scenarios in the face of an uncertain environment.
2) Unpredictability of the environment. Irregular movement of the dispersed objects creates an unstable environment for the unmanned aerial vehicle, and navigation methods based on simultaneous localization and mapping (Simultaneous Localization And Mapping, SLAM) will become problematic because the moving objects require continuous mapping, which will result in an unacceptable computational cost. Furthermore, the open-loop mechanism of the sensor planning based approach makes decisions without any predictions and inferences about the future, hampering their suitability to dynamic environments.
To address these challenges, researchers have resorted to reinforcement learning (Reinforcement Learning, RL) techniques and have focused on designing learning-based planners for unmanned aerial vehicles. As a machine learning algorithm, RL is often used to solve the sequential decision problem and has a profound link to adaptive dynamic programming (Approximate Dynamic Programming, ADP). The special mechanism of the RL enables it to learn an intelligent planner through trial-and-error interactions with the environment. The RL-based planner uses a markov decision process (Markov Decision Process, MDP) to model the problem and generates a strategy based on the predicted long-term return, which enables the RL to adapt to a random dynamic environment without knowing the system model. However, the problem of dimension disasters prevents further application of the conventional RL algorithm. To solve the "dimension disaster" problem and maintain a better representation of the high-dimensional continuous state space, deep neural networks were introduced in the traditional RL and a deep reinforcement learning (Deep Reinforcement Learning, DRL) approach was developed. By utilizing the perception capability of Deep Learning (DL) and the decision capability of RL, the DRL achieves excellent performance in the field of unmanned aerial vehicle motion planning.
However, conventional deep reinforcement learning methods still suffer from drawbacks in certain aspects, such as extracting samples with the same probability in conventional empirical replay mechanisms such that a large number of valid samples are covered and not extracted. In addition, conventional rewarding mechanisms often face sparse rewards in larger environments, which greatly affects the learning efficiency of the algorithm.
The invention comprises the following steps:
the invention aims to provide an unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment, which solves the problems that an experience replay mechanism cannot extract important samples and sparse rewards in the prior art.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment is characterized by comprising the following steps of: the method comprises the following steps:
1) Establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, randomly generating the number and the positions of the obstacles and the starting point of the unmanned aerial vehicle;
2) Establishing an environment model based on a Markov decision process framework, and designing a ladder rewarding mechanism;
3) Based on the state and the strategy selection action, the unmanned aerial vehicle interacts with the environment to generate a new state and calculates the obtained rewards after taking the action, the feature vector corresponding to the state, the action, the rewards, the feature vector corresponding to the next moment state and the termination flag bit form five-tuple which are stored in an experience pool, and a SumToe sampling method is used for batch sampling from the obtained experience pool according to a priority experience replay mechanism to train a network model;
4) Performing network updating on a sample obtained by sampling an environment model by adopting an improved DQN algorithm, and assigning a value to a state-action pair of the sample;
5) And selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.
In step 1), a two-dimensional world is created for training and testing and threats are set in the two-dimensional world. Wherein the starting position of the drone is fixed, while the positions of the threat and the target are randomly varied.
Step 2) comprises the steps of:
s2-1, state space S is described as vector space, state S of environment at time t t Is a state in its set of environmental states;
s2-2, action space A is described as discrete vector space, action A taken by individual at time t t Is one action in the action set;
s2-3, the reward signal R describes that the environment judges the action of the Agent, and the individual is in the state S at the moment t t Action A taken t Corresponding rewards R t+1 Can be obtained at the time t+1; designing a ladder rewarding mechanism, namely dynamically setting rewards according to the distance between the unmanned aerial vehicle and a designated target on the premise of fully considering the characteristics of a movement planning problem, so as to enrich the intermediate rewarding information in the movement process of the unmanned aerial vehicle;
s2-4, describing a policy pi of an individual as a basis for the individual to take action, namely, selecting the action by the individual according to the policy pi;
s2-5 value v after Agent action π (s) describing the value of the individual after taking action as a desired function in terms of policy pi and state s;
s2-6, rewarding attenuation factor gamma is between [0,1 ]; if 0, the greedy approach is used, i.e., the value is determined only by the current delay prize, and if 1, all subsequent state rewards and current rewards are one-view. Most of the time, a number between 0 and 1 is taken, namely the weight of the current delay rewards is larger than that of the subsequent rewards;
s2-7, the state transition model of the environment, expressed as a probability model, i.e. the probability of taking action a in state S to go to the next state S', expressed as
S2-8, the exploration rate epsilon is described as the probability of Agent to select the next action, and the ratio is used in the reinforcement learning training iteration process.
Step 3) comprises the steps of:
s3-1, establishing a data buffer area with the capacity of MEMORY_SIZE for storing historical experience, and initializing to be empty;
s3-2, continuously collecting historical experiences of interaction between the unmanned aerial vehicle and the environment, and storing the historical experiences into an experience pool;
the interaction process is as follows: the unmanned aerial vehicle acquires environmental state information as current state information S, obtains a characteristic vector phi (S) of the environmental state information, evaluates Q values of all actions in the current state according to the obtained characteristic vector phi (S) as input, and selects an optimal action A in current Q value output according to an epsilon-greedy strategy combined with heuristic search rules; the unmanned aerial vehicle executes the action to obtain the environmental state at the next moment, and further obtain the state information S' at the next moment.
S3-3, storing the historical experience data into an experience pool. If the number of data in the experience pool is greater than the maximum capacity of the experience pool, using the latest experience data to replace the oldest experience data;
s3-4, carrying out batch sampling by using a SumPree sampling method from the obtained experience pool according to a priority experience replay mechanism; giving the priority of each sample proportional to the absolute value of the TD error of delta (t) according to the absolute value of the TD error of delta (t), and storing the value of the priority into an experience playback pool; samples are extracted under priority considerations using a binary tree structure based on SumTree.
Step 4) comprises the steps of:
s4-1, calculating the current target Q value y j :
S4-2, using a mean square error loss functionUpdating all parameters w of the Q network by gradient back propagation of the neural network; wherein m is the number of samples of batch gradient descent, A j An action set for the current iteration round;
s4-3, recalculating TD-error of all samples: delta j =y j +γmax a′ Q (s ', a') -Q (s, a) +τ, updating the priority p of all nodes in SumPree j =|δ j I (I); where τ varies with distance.
S4-4, if i% c=1, updating the target Q network parameter w' =w; wherein i is the current iteration round number, and C is the update frequency of the target Q network parameters;
s4-5, after gradient updating is carried out on the Q network parameters, the TD error is recalculated, and the TD error is updated to SumPree.
Step 5) comprises the steps of:
the characteristic vector phi (S) obtained by the state sequence is used as input to evaluate the Q value of each action in the current state; and selecting a corresponding action in the current Q value output according to an epsilon-greedy strategy combined with a heuristic search rule by the action selection strategy, and determining the flight direction of the unmanned aerial vehicle.
Compared with the prior art, the invention has the following advantages and effects:
according to the unmanned aerial vehicle autonomous path planning method based on the deep reinforcement learning, disclosed by the invention, the problems of obstacle avoidance and path optimization of the unmanned aerial vehicle in a dynamic unknown environment are solved by applying the deep reinforcement learning, the unmanned aerial vehicle autonomous flight capacity is improved without depending on an environment model and priori knowledge. The ladder rewarding mechanism provided by the invention dynamically sets rewards according to the distance between the unmanned aerial vehicle and the appointed target, enriches the intermediate rewards information in the movement process of the unmanned aerial vehicle, and overcomes the sparse rewards problem; the priority-based experience replay mechanism calculates the priority of each sample based on TD-error, fully considers the importance degree of sample information in the sampling process, improves the sampling probability of important samples, and improves the overall learning efficiency of the algorithm.
Description of the drawings:
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic representation of a SumPree in accordance with a preferred embodiment of the present invention;
FIG. 3 is a diagram of a DQN network architecture according to a preferred embodiment of the present invention;
FIG. 4 is a diagram showing the comparison of the optimal routes and the required steps found by different agents (agents) during the experimental stage of a preferred embodiment of the present invention;
FIG. 5 is a comparison diagram of learning efficiency according to a preferred embodiment of the present invention.
The specific embodiment is as follows:
the exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.
Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.
The invention provides an unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment, which comprises the following steps:
and 5, selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.
Examples:
referring to fig. 1, a flow chart of an autonomous path planning method of an unmanned aerial vehicle in an unknown environment based on deep reinforcement learning according to the present invention includes the following steps:
build a two-dimensional 40 x 20m for training and testing 2 And sets 150 threats in the two-dimensional world. Wherein the starting position of the unmanned aerial vehicle is fixed, the inventionIt is placed in the upper left corner of the 2-dimensional world, while the threat location is randomly varied, which randomly appears in the 2-dimensional world with the movement of the drone. In addition, the target is not stationary and changes position as the drone moves.
step 2-1, state space S is described as vector space, state S of environment at time t t Is one of its set of environmental states.
Step 2-2, action space A is described as discrete vector space, action A taken by individual at time t t Is one action in the action set, and the unmanned aerial vehicle selects actions in the { up, down, left and right } four actions.
Step 2-3, the reward signal R is described as the judgment of the environment on the Agent action, and the individual is in the state S at the moment t t Action A taken t Corresponding rewards R t+1 Will be obtained at time t + 1.
As a further improvement of the invention: considering that the unmanned aerial vehicle often needs to go through many steps in the flight process in order to avoid the obstacle to reach the specified target when performing the daily task, the adoption of the conventional sparse rewards brings a plurality of ineffective rewards to the unmanned aerial vehicle Agent. In order to solve the problem, the invention provides a step rewarding mechanism, namely, rewards are dynamically set according to the distance between the unmanned aerial vehicle and a designated target on the premise of fully considering the characteristics of a movement planning problem, so that intermediate rewarding information in the movement process of the unmanned aerial vehicle is enriched, and the implementation method is shown in a formula (1).
Compared with the traditional sparse rewards, the improved rewards signals enhance the connection between rewards and the unmanned plane motion path, and avoid the occurrence of a large amount of useless rewards. Meanwhile, the improved reward signal can greatly improve the convergence efficiency of the training process, reduce the overall training time and facilitate the interpretation of the optimal solution.
Step 2-4, the policy pi of the individual is described as the basis for the individual to take an action, i.e., the individual will select an action according to the policy pi.
Value v after Agent action in step 2-5 π (s) describes the value of the individual after taking action, typically as a desired function, in terms of policy pi and state s, although the current action gives a delay prize R t+1 But looking at this delay prize is not feasible because the current delay prize is high, not representing a subsequent prize up to t+1, t+ 2. The value is thus integrated into the current and subsequent delay rewards. The cost function can be generally expressed as equation (2).
v π (s)=E π (R t+1 +γR t+2 +γ 2 R t+3 +...|S t =s) (2)
Step 2-6, rewarding attenuation factor gamma, taking a number between 0 and 1, namely, the weight of the current delay rewarding is larger than that of the subsequent rewarding; in this embodiment, the prize decay factor is taken to be 0.90.
The state transition model of the environment in the steps 2-7 can be understood as a probability state machine, and can be expressed as a probability model, namely, the probability of taking action a in state s and going to the next state s' is expressed as
Step 2-8, the exploration rate epsilon is described as the probability of the Agent to select the next action. Taken as 0.90 in this embodiment.
And 3, selecting actions based on the states and strategies, generating new states by interaction with the environment after the unmanned aerial vehicle takes the actions, calculating obtained rewards, storing feature vectors corresponding to the states, the actions, the rewards, feature vectors corresponding to the next moment states and termination flag bit composition quintuples in an experience pool with the capacity of memory_size (the experience pool is empty in the initial state), and performing batch sampling by using a SumTie sampling method from the obtained experience pool according to a priority experience replay mechanism to train a network model. Step 3 can be divided into the following sub-steps:
step 3-1, input and output of algorithm: selecting the iteration round number T, the state characteristic dimension n, the action set A, the step length alpha, the sampling weight coefficient beta, the attenuation factor gamma, the exploration rate epsilon, the current Q network Q, the target Q network Q', the number m of samples with batch gradient descent, and the target Q network parameter updating frequency C as the input of an algorithm; and outputting the Q network parameters as an algorithm.
Step 3-2, parameter initialization of algorithm: randomly initializing the value Q corresponding to all states and actions; randomly initializing all parameters w of the current Q network; initializing a parameter w '=w of a target Q network Q'; initializing a default data structure of experience replay SumTiee, prioritizing p of leaf nodes of all SumTiee j Set to 1.
Step 3-3, carrying out T iterations from 1 to T in the following substeps:
and step 3-3-1, initializing the S to be the first state of the current state sequence, and obtaining the characteristic vector phi (S) of the current state sequence.
And step 3-3-2, using the characteristic vector phi (S) as input in the Q network to obtain Q value output corresponding to all actions of the Q network. And selecting a corresponding action A in the current Q value output according to an epsilon-greedy strategy combined with heuristic search rules.
Step 3-3-3, executing the current action A in the state S to obtain the feature vector phi (S ') and the reward signal R corresponding to the new state S', and whether to terminate the state is_end.
Step 3-3-4, storing the five-tuple { phi (S), A, R, phi (S'), is_end } into SumToe;
step 3-3-5, updating the state s=s' at this time;
and 3-4, carrying out batch sampling according to a SumPree sampling method from the obtained experience pool according to the priority.
As a further improvement of the invention: the experience replay mechanism in the traditional DQN algorithm only saves the data such as sample states, actions, rewards and the like obtained by interaction with the environment, and does not suggest the concept of priority. The invention improves the experience replay mechanism in the traditional DQN algorithm, gives the priority of each sample proportional to the absolute value of the TD error of delta (t) according to the absolute value of the TD error of delta (t), and stores the value of the priority into an experience replay pool.
In the empirical playback pool, the effect on back propagation is different for different samples due to different TD errors. The larger the TD error, the greater the effect on the back propagation. And the sample with small TD error has little influence on the calculation of the inverse gradient. In the Q network, the TD error is the difference between the target Q value calculated by the target Q network and the Q value calculated by the current Q network.
Sampling method of improved experience replay mechanism: sampling m samples { φ (S) j ),A j ,R j ,φ(S′ j ),is_end j J=1, 2,, the probability of each sample being sampled is based onCalculating the weight of the loss function to be w j =(N*P(j)) -β /max i (w i ) Wherein p is i For the probability of each sample being sampled at the ith iteration from 1 to T iterations, the value is proportional to |δ (T) |, w j And the weight value is a loss function weight value under the jth iteration, and beta is a sampling weight coefficient.
Fig. 2 illustrates a sampling implementation process of SumTree, in which leaf nodes are respectively corresponding value intervals in a binary tree structure of SumTree, the larger the value of a leaf node (the higher the priority) is, the larger the interval length is, so that the probability that one data falls into the interval is uniformly sampled in the total value interval, and the implementation is as follows:
1) Extracting a data (0-29), uniformly sampling a data (assumed to be v);
2) Traversing child nodes of the node 0 serving as a father node;
3) If the left child node is greater than v, traversing the child node by taking the left child node as a father node;
4) If the value of the left child node is smaller than v, subtracting the value of the left child node from v, selecting the right child node as a father node, and traversing the child nodes;
5) Until the traversed leaf node, the value of the leaf node is the priority.
And 4, performing network updating on the samples obtained by sampling the environment model by adopting an improved DQN algorithm, and assigning values to the state-action pairs of the samples, wherein the network structure of the DQN is shown in figure 3. Step 4 may be divided into the following sub-steps:
step 4-1, calculating the current target Q value y j The implementation method is shown in the formula (3).
Step 4-2, using a mean square error loss functionUpdating all parameters w of the Q network by gradient back propagation of the neural network; wherein m is the number of samples of batch gradient descent, A j Is the action set of the current iteration round.
As part of the improvement of the invention, in addition to priority-based empirical replay, the invention also optimizes the loss function of the Q network:
the conventional loss function is:
the invention adds the sample priority to obtain a new loss function:
wherein w is j The priority weight of the j-th sample is obtained by normalizing TD error |delta (t) |;
w j =(N*P(j)) -β /max i (w i )=(N*P(j)) -β /max i ((N*P(i)) -β )=(P(j)) -β /max i ((P(i)) -β )=(P j /min i P(i)) -β (6)
step 4-3, recalculating TD-error of all samples: delta j =y j +γmax a′ Q (s ', a') -Q (s, a) +τ, updating the priority p of all nodes in SumPree j =|δ j I (I); where τ varies with distance.
Step 4-4, if i% c=1, updating the target Q network parameter w' =w; wherein i is the current iteration round number, and C is the update frequency of the target Q network parameters.
And step 4-5, if S' is in a termination state, finishing the iteration of the current round, otherwise, jumping to step 3-3-2.
The Q values in the above steps 4-1 and 4-2 are calculated through the Q network. Meanwhile, in order for the algorithm to have better convergence, the exploration rate epsilon needs to be smaller as the iteration progresses.
After gradient updating is carried out on the Q network parameters, the TD error is required to be recalculated, and the TD error is updated to SumToe; and when the number of training scenarios (Epinodes) reaches a preset condition, ending the updating process and saving the model parameters of the DQN.
The update process is performed in units of episodes (epoode). In the updating process, each Epinode starts from an initial state, and when the unmanned aerial vehicle meets any one of the following conditions, the Epinode ends and the learning of the next Epinode is restarted: 1) Reaching a target; 2) Encountering a threat task failure; 3) And fly out of the task area boundary.
And 5, selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.
The effect of the invention can be further illustrated by the following simulation experiment:
for full comparison, the invention constructs 4 agents, wherein Agent1 and Agent3 adopt a traditional experience replay mechanism, agent2 and Agent4 adopt an improved experience replay mechanism, meanwhile, agent1 and Agent2 adopt a traditional rewarding mechanism, and Agent3 and Agent4 adopt an improved rewarding mechanism, as shown in table 1:
table 1 experiment setting table
Agent | Traditional experience replay mechanism | Traditional rewarding mechanism | Improved experience replay mechanism | Improved rewarding mechanism |
Agent1 | √ | √ | ||
Agent2 | √ | √ | ||
Agent3 | √ | √ | ||
Agent4 | √ | √ |
The embodiment evaluates the algorithm advantages and disadvantages of solving the autonomous path planning problem from the practical aspect based on the practical application perspective. The specific implementation mode is as follows: based on the same simulation environment, all agents start from the same place, and the required step length is judged by taking a capture target as a standard.
As shown in FIG. 4, the four agents completed the task in average steps of 67.5, 61.6, 66.4 and 58.3, respectively. Obviously, the improvement of the invention has better performance compared with the traditional algorithm in solving the problem of unmanned aerial vehicle autonomous path planning, and meanwhile, the invention also carries out the comparison of training time length on Agent3 and Agent4, and the result is shown in fig. 5, so that the improvement of the invention has faster learning efficiency compared with the traditional algorithm.
The foregoing description is only illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, and all changes that may be made in the equivalent structures described in the specification and drawings of the present invention are intended to be included in the scope of the invention.
Claims (6)
1. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment is characterized by comprising the following steps of: the method comprises the following steps:
1) Establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, randomly generating the number and the positions of the obstacles and the starting point of the unmanned aerial vehicle;
2) Establishing an environment model based on a Markov decision process framework, and designing a ladder rewarding mechanism;
3) Based on the state and the strategy selection action, the unmanned aerial vehicle interacts with the environment to generate a new state and calculates the obtained rewards after taking the action, the feature vector corresponding to the state, the action, the rewards, the feature vector corresponding to the next moment state and the termination flag bit form five-tuple which are stored in an experience pool, and a SumToe sampling method is used for batch sampling from the obtained experience pool according to a priority experience replay mechanism to train a network model;
4) Performing network updating on a sample obtained by sampling an environment model by adopting an improved DQN algorithm, and assigning a value to a state-action pair of the sample;
5) And selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.
2. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: in step 1), a two-dimensional world is created for training and testing and threats are set in the two-dimensional world. Wherein the starting position of the drone is fixed, while the positions of the threat and the target are randomly varied.
3. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 2) comprises the steps of:
s2-1, state space S is described as vector space, state S of environment at time t t Is a state in its set of environmental states;
s2-2, action space A is described as discrete vector space, action A taken by individual at time t t Is one action in the action set;
s2-3, the reward signal R describes that the environment judges the action of the Agent, and the individual is in the state S at the moment t t Action A taken t Corresponding rewards R t+1 Can be obtained at the time t+1; a step rewarding mechanism is designed, namely, the motion planning is fully consideredOn the premise of the problem characteristics, rewards are dynamically set according to the distance between the unmanned aerial vehicle and the appointed target, so that intermediate rewards information in the movement process of the unmanned aerial vehicle is enriched;
s2-4, describing a policy pi of an individual as a basis for the individual to take action, namely, selecting the action by the individual according to the policy pi;
s2-5 value v after Agent action π (s) describing the value of the individual after taking action as a desired function in terms of policy pi and state s;
s2-6, rewarding attenuation factor gamma is between [0,1 ]; if 0, the greedy approach is used, i.e., the value is determined only by the current delay prize, and if 1, all subsequent state rewards and current rewards are one-view. Most of the time, a number between 0 and 1 is taken, namely the weight of the current delay rewards is larger than that of the subsequent rewards;
s2-7, the state transition model of the environment, expressed as a probability model, i.e. the probability of taking action a in state S to go to the next state S', expressed as
S2-8, the exploration rate epsilon is described as the probability of Agent to select the next action, and the ratio is used in the reinforcement learning training iteration process.
4. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 3) comprises the steps of:
s3-1, establishing a data buffer area with the capacity of MEMORY_SIZE for storing historical experience, and initializing to be empty;
s3-2, continuously collecting historical experiences of interaction between the unmanned aerial vehicle and the environment, and storing the historical experiences into an experience pool;
the interaction process is as follows: the unmanned aerial vehicle acquires environmental state information as current state information S, obtains a characteristic vector phi (S) of the environmental state information, evaluates Q values of all actions in the current state according to the obtained characteristic vector phi (S) as input, and selects an optimal action A in current Q value output according to an epsilon-greedy strategy combined with heuristic search rules; the unmanned aerial vehicle executes the action to obtain the environmental state at the next moment, and further obtain the state information S' at the next moment.
S3-3, storing the historical experience data into an experience pool. If the number of data in the experience pool is greater than the maximum capacity of the experience pool, using the latest experience data to replace the oldest experience data;
s3-4, carrying out batch sampling by using a SumPree sampling method from the obtained experience pool according to a priority experience replay mechanism; giving the priority of each sample proportional to the absolute value of the TD error of delta (t) according to the absolute value of the TD error of delta (t), and storing the value of the priority into an experience playback pool; samples are extracted under priority considerations using a binary tree structure based on SumTree.
5. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 4) comprises the steps of:
s4-1, calculating the current target Q value y j :
S4-2, using a mean square error loss functionUpdating all parameters w of the Q network by gradient back propagation of the neural network; wherein m is the number of samples of batch gradient descent, A j An action set for the current iteration round;
s4-3, recalculating TD-error of all samples: delta j =y j +γmax a′ Q (s ', a') -Q (s, a) +τ, updating the priority p of all nodes in SumPree j =|δ j I (I); where τ varies with distance.
S4-4, if i% c=1, updating the target Q network parameter w' =w; wherein i is the current iteration round number, and C is the update frequency of the target Q network parameters;
s4-5, after gradient updating is carried out on the Q network parameters, the TD error is recalculated, and the TD error is updated to SumPree.
6. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 5) comprises the steps of:
the characteristic vector phi (S) obtained by the state sequence is used as input to evaluate the Q value of each action in the current state; and selecting a corresponding action in the current Q value output according to an epsilon-greedy strategy combined with a heuristic search rule by the action selection strategy, and determining the flight direction of the unmanned aerial vehicle.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211099640.3A CN116225046A (en) | 2022-09-09 | 2022-09-09 | Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211099640.3A CN116225046A (en) | 2022-09-09 | 2022-09-09 | Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116225046A true CN116225046A (en) | 2023-06-06 |
Family
ID=86570281
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211099640.3A Pending CN116225046A (en) | 2022-09-09 | 2022-09-09 | Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116225046A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116718198A (en) * | 2023-08-10 | 2023-09-08 | 湖南璟德科技有限公司 | Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph |
-
2022
- 2022-09-09 CN CN202211099640.3A patent/CN116225046A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116718198A (en) * | 2023-08-10 | 2023-09-08 | 湖南璟德科技有限公司 | Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph |
CN116718198B (en) * | 2023-08-10 | 2023-11-03 | 湖南璟德科技有限公司 | Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109597425B (en) | Unmanned aerial vehicle navigation and obstacle avoidance method based on reinforcement learning | |
Tang et al. | A novel hierarchical soft actor-critic algorithm for multi-logistics robots task allocation | |
CN112148008B (en) | Real-time unmanned aerial vehicle path prediction method based on deep reinforcement learning | |
CN113253733B (en) | Navigation obstacle avoidance method, device and system based on learning and fusion | |
CN112947591A (en) | Path planning method, device, medium and unmanned aerial vehicle based on improved ant colony algorithm | |
CN116225046A (en) | Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment | |
Wang et al. | Research on dynamic path planning of wheeled robot based on deep reinforcement learning on the slope ground | |
CN114815801A (en) | Adaptive environment path planning method based on strategy-value network and MCTS | |
CN113110101B (en) | Production line mobile robot gathering type recovery and warehousing simulation method and system | |
CN114089776A (en) | Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning | |
CN117522078A (en) | Method and system for planning transferable tasks under unmanned system cluster environment coupling | |
CN113848911A (en) | Mobile robot global path planning method based on Q-learning and RRT | |
Shihab et al. | Improved Artificial Bee Colony Algorithm-based Path Planning of Unmanned Aerial Vehicle Using Late Acceptance Hill Climbing. | |
CN113778090A (en) | Mobile robot path planning method based on ant colony optimization and PRM algorithm | |
CN116307331B (en) | Aircraft trajectory planning method | |
CN112486185A (en) | Path planning method based on ant colony and VO algorithm in unknown environment | |
Guan et al. | Research on path planning of mobile robot based on improved Deep Q Network | |
Gao et al. | Competitive self-organizing neural network based UAV path planning | |
Li et al. | A novel path planning algorithm based on Q-learning and adaptive exploration strategy | |
Ruqing et al. | Deep reinforcement learning based path planning for mobile robots using time-sensitive reward | |
CN115185297A (en) | Unmanned aerial vehicle cluster distributed cooperative target searching method | |
CN114237282A (en) | Intelligent unmanned aerial vehicle flight path planning method for intelligent industrial park monitoring | |
Cui | Multi-target points path planning for fixed-wing unmanned aerial vehicle performing reconnaissance missions | |
CN117032247B (en) | Marine rescue search path planning method, device and equipment | |
Ji et al. | Research on Path Planning of Mobile Robot Based on Reinforcement Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |