CN116225046A

CN116225046A - Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment

Info

Publication number: CN116225046A
Application number: CN202211099640.3A
Authority: CN
Inventors: 贺楚超; 田琳宇; 辛泊言; 王鹏; 吕志刚; 邸若海; 李晓艳; 许韫韬
Original assignee: Xian Technological University
Current assignee: Xian Technological University
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-06-06

Abstract

The invention discloses an unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment, which solves the problems that an experience replay mechanism cannot extract important samples and sparse rewards in the prior art. The invention comprises the following steps: 1) Establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, randomly generating the number and the positions of the obstacles and the starting point of the unmanned aerial vehicle; 2) Establishing an environment model based on a Markov decision process framework and designing a ladder rewarding mechanism; 3) Based on the state and the strategy selection action, forming information obtained by interaction with the environment into five-tuple and storing the five-tuple into an experience pool, and sampling according to a designed priority experience replay mechanism; 4) Performing network updating on a sample obtained by sampling an environment model by adopting an improved DQN algorithm, and assigning a value to a state-action pair of the sample; 5) And selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.

Description

Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment

Technical field:

the invention belongs to the technical field of reinforcement learning and unmanned aerial vehicle obstacle avoidance, and relates to an unmanned aerial vehicle autonomous motion path planning method based on deep reinforcement learning under an unknown environment.

The background technology is as follows:

the use of unmanned aerial vehicles in a variety of practical tasks, such as intelligence, surveillance and reconnaissance, suppressing enemy air defense, search and rescue, and cargo transportation, has been on the rise over the past few years. In these applications, a key requirement is how to build an intelligent system for the drone, autonomously performing tasks without any human intervention. In particular, there is a need to develop advanced intelligent technologies that autonomously navigate a drone from an arbitrary departure point to a destination in a dynamic, unknown environment while avoiding obstacles and threats en route. To achieve this task, two challenges need to be overcome:

1) Partial observability of the environment. The unmanned aerial vehicle is not known about the environment at the beginning, and only partial information can be perceived in the task. This feature makes some rule-based path planning methods unusable because it is impossible to design complete rules for all possible scenarios in the face of an uncertain environment.

2) Unpredictability of the environment. Irregular movement of the dispersed objects creates an unstable environment for the unmanned aerial vehicle, and navigation methods based on simultaneous localization and mapping (Simultaneous Localization And Mapping, SLAM) will become problematic because the moving objects require continuous mapping, which will result in an unacceptable computational cost. Furthermore, the open-loop mechanism of the sensor planning based approach makes decisions without any predictions and inferences about the future, hampering their suitability to dynamic environments.

To address these challenges, researchers have resorted to reinforcement learning (Reinforcement Learning, RL) techniques and have focused on designing learning-based planners for unmanned aerial vehicles. As a machine learning algorithm, RL is often used to solve the sequential decision problem and has a profound link to adaptive dynamic programming (Approximate Dynamic Programming, ADP). The special mechanism of the RL enables it to learn an intelligent planner through trial-and-error interactions with the environment. The RL-based planner uses a markov decision process (Markov Decision Process, MDP) to model the problem and generates a strategy based on the predicted long-term return, which enables the RL to adapt to a random dynamic environment without knowing the system model. However, the problem of dimension disasters prevents further application of the conventional RL algorithm. To solve the "dimension disaster" problem and maintain a better representation of the high-dimensional continuous state space, deep neural networks were introduced in the traditional RL and a deep reinforcement learning (Deep Reinforcement Learning, DRL) approach was developed. By utilizing the perception capability of Deep Learning (DL) and the decision capability of RL, the DRL achieves excellent performance in the field of unmanned aerial vehicle motion planning.

However, conventional deep reinforcement learning methods still suffer from drawbacks in certain aspects, such as extracting samples with the same probability in conventional empirical replay mechanisms such that a large number of valid samples are covered and not extracted. In addition, conventional rewarding mechanisms often face sparse rewards in larger environments, which greatly affects the learning efficiency of the algorithm.

The invention comprises the following steps:

the invention aims to provide an unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment, which solves the problems that an experience replay mechanism cannot extract important samples and sparse rewards in the prior art.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment is characterized by comprising the following steps of: the method comprises the following steps:

1) Establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, randomly generating the number and the positions of the obstacles and the starting point of the unmanned aerial vehicle;

2) Establishing an environment model based on a Markov decision process framework, and designing a ladder rewarding mechanism;

3) Based on the state and the strategy selection action, the unmanned aerial vehicle interacts with the environment to generate a new state and calculates the obtained rewards after taking the action, the feature vector corresponding to the state, the action, the rewards, the feature vector corresponding to the next moment state and the termination flag bit form five-tuple which are stored in an experience pool, and a SumToe sampling method is used for batch sampling from the obtained experience pool according to a priority experience replay mechanism to train a network model;

4) Performing network updating on a sample obtained by sampling an environment model by adopting an improved DQN algorithm, and assigning a value to a state-action pair of the sample;

5) And selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.

In step 1), a two-dimensional world is created for training and testing and threats are set in the two-dimensional world. Wherein the starting position of the drone is fixed, while the positions of the threat and the target are randomly varied.

Step 2) comprises the steps of:

s2-1, state space S is described as vector space, state S of environment at time t _t Is a state in its set of environmental states;

s2-2, action space A is described as discrete vector space, action A taken by individual at time t _t Is one action in the action set;

s2-3, the reward signal R describes that the environment judges the action of the Agent, and the individual is in the state S at the moment t _t Action A taken _t Corresponding rewards R _t+1 Can be obtained at the time t+1; designing a ladder rewarding mechanism, namely dynamically setting rewards according to the distance between the unmanned aerial vehicle and a designated target on the premise of fully considering the characteristics of a movement planning problem, so as to enrich the intermediate rewarding information in the movement process of the unmanned aerial vehicle;

s2-4, describing a policy pi of an individual as a basis for the individual to take action, namely, selecting the action by the individual according to the policy pi;

s2-5 value v after Agent action _π (s) describing the value of the individual after taking action as a desired function in terms of policy pi and state s;

s2-6, rewarding attenuation factor gamma is between [0,1 ]; if 0, the greedy approach is used, i.e., the value is determined only by the current delay prize, and if 1, all subsequent state rewards and current rewards are one-view. Most of the time, a number between 0 and 1 is taken, namely the weight of the current delay rewards is larger than that of the subsequent rewards;

s2-7, the state transition model of the environment, expressed as a probability model, i.e. the probability of taking action a in state S to go to the next state S', expressed as

S2-8, the exploration rate epsilon is described as the probability of Agent to select the next action, and the ratio is used in the reinforcement learning training iteration process.

Step 3) comprises the steps of:

s3-1, establishing a data buffer area with the capacity of MEMORY_SIZE for storing historical experience, and initializing to be empty;

s3-2, continuously collecting historical experiences of interaction between the unmanned aerial vehicle and the environment, and storing the historical experiences into an experience pool;

the interaction process is as follows: the unmanned aerial vehicle acquires environmental state information as current state information S, obtains a characteristic vector phi (S) of the environmental state information, evaluates Q values of all actions in the current state according to the obtained characteristic vector phi (S) as input, and selects an optimal action A in current Q value output according to an epsilon-greedy strategy combined with heuristic search rules; the unmanned aerial vehicle executes the action to obtain the environmental state at the next moment, and further obtain the state information S' at the next moment.

S3-3, storing the historical experience data into an experience pool. If the number of data in the experience pool is greater than the maximum capacity of the experience pool, using the latest experience data to replace the oldest experience data;

s3-4, carrying out batch sampling by using a SumPree sampling method from the obtained experience pool according to a priority experience replay mechanism; giving the priority of each sample proportional to the absolute value of the TD error of delta (t) according to the absolute value of the TD error of delta (t), and storing the value of the priority into an experience playback pool; samples are extracted under priority considerations using a binary tree structure based on SumTree.

Step 4) comprises the steps of:

s4-1, calculating the current target Q value y _j ：

S4-2, using a mean square error loss function

Updating all parameters w of the Q network by gradient back propagation of the neural network; wherein m is the number of samples of batch gradient descent, A _j An action set for the current iteration round;

s4-3, recalculating TD-error of all samples: delta _j ＝y _j +γmax _a′ Q (s ', a') -Q (s, a) +τ, updating the priority p of all nodes in SumPree _j ＝|δ _j I (I); where τ varies with distance.

S4-4, if i% c=1, updating the target Q network parameter w' =w; wherein i is the current iteration round number, and C is the update frequency of the target Q network parameters;

s4-5, after gradient updating is carried out on the Q network parameters, the TD error is recalculated, and the TD error is updated to SumPree.

Step 5) comprises the steps of:

the characteristic vector phi (S) obtained by the state sequence is used as input to evaluate the Q value of each action in the current state; and selecting a corresponding action in the current Q value output according to an epsilon-greedy strategy combined with a heuristic search rule by the action selection strategy, and determining the flight direction of the unmanned aerial vehicle.

Compared with the prior art, the invention has the following advantages and effects:

according to the unmanned aerial vehicle autonomous path planning method based on the deep reinforcement learning, disclosed by the invention, the problems of obstacle avoidance and path optimization of the unmanned aerial vehicle in a dynamic unknown environment are solved by applying the deep reinforcement learning, the unmanned aerial vehicle autonomous flight capacity is improved without depending on an environment model and priori knowledge. The ladder rewarding mechanism provided by the invention dynamically sets rewards according to the distance between the unmanned aerial vehicle and the appointed target, enriches the intermediate rewards information in the movement process of the unmanned aerial vehicle, and overcomes the sparse rewards problem; the priority-based experience replay mechanism calculates the priority of each sample based on TD-error, fully considers the importance degree of sample information in the sampling process, improves the sampling probability of important samples, and improves the overall learning efficiency of the algorithm.

Description of the drawings:

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic representation of a SumPree in accordance with a preferred embodiment of the present invention;

FIG. 3 is a diagram of a DQN network architecture according to a preferred embodiment of the present invention;

FIG. 4 is a diagram showing the comparison of the optimal routes and the required steps found by different agents (agents) during the experimental stage of a preferred embodiment of the present invention;

FIG. 5 is a comparison diagram of learning efficiency according to a preferred embodiment of the present invention.

The specific embodiment is as follows:

the exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.

Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

The invention provides an unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment, which comprises the following steps:

step 1, establishing an unmanned aerial vehicle autonomous movement flight model in a two-dimensional space, and randomly generating the number and positions of barriers (threats) and the starting point of the unmanned aerial vehicle;

step 2, establishing an environment model based on a Markov process framework, wherein the environment model comprises a state space S, an action space A, a reward function R, a reward attenuation factor gamma, an exploration rate epsilon and the like; designing a ladder rewarding mechanism;

step 3, selecting actions based on the states and strategies, after taking actions, the unmanned aerial vehicle interacts with the environment to generate new states and calculates rewards, feature vectors corresponding to the states, the actions, the rewards, feature vectors corresponding to the next moment states and termination flag bit components are five-tuple-stored in an experience pool (in the initial state, the experience pool is empty) with the capacity of memory_size, and batch sampling is carried out from the obtained experience pool by using a SumTree sampling method according to a priority experience replay mechanism to train a network model;

step 4, adopting an improved DQN algorithm to update the network based on the samples obtained by sampling the environmental model, and assigning values to the state-action pairs of the samples;

and 5, selecting an optimal action according to the Q value of each action in the state in the sample, and further obtaining an optimal strategy.

Examples:

referring to fig. 1, a flow chart of an autonomous path planning method of an unmanned aerial vehicle in an unknown environment based on deep reinforcement learning according to the present invention includes the following steps:

step 1, an unmanned aerial vehicle autonomous movement flight model is established in a two-dimensional space, the number and the positions of barriers (threats) and the starting point of the unmanned aerial vehicle are randomly generated, and the step 1 specifically comprises:

build a two-dimensional 40 x 20m for training and testing ² And sets 150 threats in the two-dimensional world. Wherein the starting position of the unmanned aerial vehicle is fixed, the inventionIt is placed in the upper left corner of the 2-dimensional world, while the threat location is randomly varied, which randomly appears in the 2-dimensional world with the movement of the drone. In addition, the target is not stationary and changes position as the drone moves.

Step 2, establishing an environment model based on a Markov process framework; including state space S, action space a, reward function R, reward decay factor γ, exploration rate epsilon, etc. Step 2 can be divided into the following sub-steps:

step 2-1, state space S is described as vector space, state S of environment at time t _t Is one of its set of environmental states.

Step 2-2, action space A is described as discrete vector space, action A taken by individual at time t _t Is one action in the action set, and the unmanned aerial vehicle selects actions in the { up, down, left and right } four actions.

Step 2-3, the reward signal R is described as the judgment of the environment on the Agent action, and the individual is in the state S at the moment t _t Action A taken _t Corresponding rewards R _t+1 Will be obtained at time t + 1.

As a further improvement of the invention: considering that the unmanned aerial vehicle often needs to go through many steps in the flight process in order to avoid the obstacle to reach the specified target when performing the daily task, the adoption of the conventional sparse rewards brings a plurality of ineffective rewards to the unmanned aerial vehicle Agent. In order to solve the problem, the invention provides a step rewarding mechanism, namely, rewards are dynamically set according to the distance between the unmanned aerial vehicle and a designated target on the premise of fully considering the characteristics of a movement planning problem, so that intermediate rewarding information in the movement process of the unmanned aerial vehicle is enriched, and the implementation method is shown in a formula (1).

Compared with the traditional sparse rewards, the improved rewards signals enhance the connection between rewards and the unmanned plane motion path, and avoid the occurrence of a large amount of useless rewards. Meanwhile, the improved reward signal can greatly improve the convergence efficiency of the training process, reduce the overall training time and facilitate the interpretation of the optimal solution.

Step 2-4, the policy pi of the individual is described as the basis for the individual to take an action, i.e., the individual will select an action according to the policy pi.

Value v after Agent action in step 2-5 _π (s) describes the value of the individual after taking action, typically as a desired function, in terms of policy pi and state s, although the current action gives a delay prize R _t+1 But looking at this delay prize is not feasible because the current delay prize is high, not representing a subsequent prize up to t+1, t+2. The value is thus integrated into the current and subsequent delay rewards. The cost function can be generally expressed as equation (2).

v _π (s)＝E _π (R _t+1 +γR _t+2 +γ ² R _t+3 +...|S _t ＝s) (2)

Step 2-6, rewarding attenuation factor gamma, taking a number between 0 and 1, namely, the weight of the current delay rewarding is larger than that of the subsequent rewarding; in this embodiment, the prize decay factor is taken to be 0.90.

The state transition model of the environment in the steps 2-7 can be understood as a probability state machine, and can be expressed as a probability model, namely, the probability of taking action a in state s and going to the next state s' is expressed as

Step 2-8, the exploration rate epsilon is described as the probability of the Agent to select the next action. Taken as 0.90 in this embodiment.

And 3, selecting actions based on the states and strategies, generating new states by interaction with the environment after the unmanned aerial vehicle takes the actions, calculating obtained rewards, storing feature vectors corresponding to the states, the actions, the rewards, feature vectors corresponding to the next moment states and termination flag bit composition quintuples in an experience pool with the capacity of memory_size (the experience pool is empty in the initial state), and performing batch sampling by using a SumTie sampling method from the obtained experience pool according to a priority experience replay mechanism to train a network model. Step 3 can be divided into the following sub-steps:

step 3-1, input and output of algorithm: selecting the iteration round number T, the state characteristic dimension n, the action set A, the step length alpha, the sampling weight coefficient beta, the attenuation factor gamma, the exploration rate epsilon, the current Q network Q, the target Q network Q', the number m of samples with batch gradient descent, and the target Q network parameter updating frequency C as the input of an algorithm; and outputting the Q network parameters as an algorithm.

Step 3-2, parameter initialization of algorithm: randomly initializing the value Q corresponding to all states and actions; randomly initializing all parameters w of the current Q network; initializing a parameter w '=w of a target Q network Q'; initializing a default data structure of experience replay SumTiee, prioritizing p of leaf nodes of all SumTiee _j Set to 1.

Step 3-3, carrying out T iterations from 1 to T in the following substeps:

and step 3-3-1, initializing the S to be the first state of the current state sequence, and obtaining the characteristic vector phi (S) of the current state sequence.

And step 3-3-2, using the characteristic vector phi (S) as input in the Q network to obtain Q value output corresponding to all actions of the Q network. And selecting a corresponding action A in the current Q value output according to an epsilon-greedy strategy combined with heuristic search rules.

Step 3-3-3, executing the current action A in the state S to obtain the feature vector phi (S ') and the reward signal R corresponding to the new state S', and whether to terminate the state is_end.

Step 3-3-4, storing the five-tuple { phi (S), A, R, phi (S'), is_end } into SumToe;

step 3-3-5, updating the state s=s' at this time;

and 3-4, carrying out batch sampling according to a SumPree sampling method from the obtained experience pool according to the priority.

As a further improvement of the invention: the experience replay mechanism in the traditional DQN algorithm only saves the data such as sample states, actions, rewards and the like obtained by interaction with the environment, and does not suggest the concept of priority. The invention improves the experience replay mechanism in the traditional DQN algorithm, gives the priority of each sample proportional to the absolute value of the TD error of delta (t) according to the absolute value of the TD error of delta (t), and stores the value of the priority into an experience replay pool.

In the empirical playback pool, the effect on back propagation is different for different samples due to different TD errors. The larger the TD error, the greater the effect on the back propagation. And the sample with small TD error has little influence on the calculation of the inverse gradient. In the Q network, the TD error is the difference between the target Q value calculated by the target Q network and the Q value calculated by the current Q network.

Sampling method of improved experience replay mechanism: sampling m samples { φ (S) _j ),A _j ,R _j ,φ(S′ _j ),is_end _j J=1, 2,, the probability of each sample being sampled is based on

Calculating the weight of the loss function to be w _j ＝(N*P(j)) ^-β /max _i (w _i ) Wherein p is _i For the probability of each sample being sampled at the ith iteration from 1 to T iterations, the value is proportional to |δ (T) |, w _j And the weight value is a loss function weight value under the jth iteration, and beta is a sampling weight coefficient.

Fig. 2 illustrates a sampling implementation process of SumTree, in which leaf nodes are respectively corresponding value intervals in a binary tree structure of SumTree, the larger the value of a leaf node (the higher the priority) is, the larger the interval length is, so that the probability that one data falls into the interval is uniformly sampled in the total value interval, and the implementation is as follows:

1) Extracting a data (0-29), uniformly sampling a data (assumed to be v);

2) Traversing child nodes of the node 0 serving as a father node;

3) If the left child node is greater than v, traversing the child node by taking the left child node as a father node;

4) If the value of the left child node is smaller than v, subtracting the value of the left child node from v, selecting the right child node as a father node, and traversing the child nodes;

5) Until the traversed leaf node, the value of the leaf node is the priority.

And 4, performing network updating on the samples obtained by sampling the environment model by adopting an improved DQN algorithm, and assigning values to the state-action pairs of the samples, wherein the network structure of the DQN is shown in figure 3. Step 4 may be divided into the following sub-steps:

step 4-1, calculating the current target Q value y _j The implementation method is shown in the formula (3).

Step 4-2, using a mean square error loss function

Updating all parameters w of the Q network by gradient back propagation of the neural network; wherein m is the number of samples of batch gradient descent, A _j Is the action set of the current iteration round.

As part of the improvement of the invention, in addition to priority-based empirical replay, the invention also optimizes the loss function of the Q network:

the conventional loss function is:

the invention adds the sample priority to obtain a new loss function:

wherein w is _j The priority weight of the j-th sample is obtained by normalizing TD error |delta (t) |;

w _j ＝(N*P(j)) ^-β /max _i (w _i )＝(N*P(j)) ^-β /max _i ((N*P(i)) ^-β )＝(P(j)) ^-β /max _i ((P(i)) ^-β )＝(P _j /min _i P(i)) ^-β (6)

step 4-3, recalculating TD-error of all samples: delta _j ＝y _j +γmax _a′ Q (s ', a') -Q (s, a) +τ, updating the priority p of all nodes in SumPree _j ＝|δ _j I (I); where τ varies with distance.

Step 4-4, if i% c=1, updating the target Q network parameter w' =w; wherein i is the current iteration round number, and C is the update frequency of the target Q network parameters.

And step 4-5, if S' is in a termination state, finishing the iteration of the current round, otherwise, jumping to step 3-3-2.

The Q values in the above steps 4-1 and 4-2 are calculated through the Q network. Meanwhile, in order for the algorithm to have better convergence, the exploration rate epsilon needs to be smaller as the iteration progresses.

After gradient updating is carried out on the Q network parameters, the TD error is required to be recalculated, and the TD error is updated to SumToe; and when the number of training scenarios (Epinodes) reaches a preset condition, ending the updating process and saving the model parameters of the DQN.

The update process is performed in units of episodes (epoode). In the updating process, each Epinode starts from an initial state, and when the unmanned aerial vehicle meets any one of the following conditions, the Epinode ends and the learning of the next Epinode is restarted: 1) Reaching a target; 2) Encountering a threat task failure; 3) And fly out of the task area boundary.

The effect of the invention can be further illustrated by the following simulation experiment:

for full comparison, the invention constructs 4 agents, wherein Agent1 and Agent3 adopt a traditional experience replay mechanism, agent2 and Agent4 adopt an improved experience replay mechanism, meanwhile, agent1 and Agent2 adopt a traditional rewarding mechanism, and Agent3 and Agent4 adopt an improved rewarding mechanism, as shown in table 1:

table 1 experiment setting table

Agent	Traditional experience replay mechanism	Traditional rewarding mechanism	Improved experience replay mechanism	Improved rewarding mechanism
					Agent1	√	√
Agent2		√	√
					Agent3	√		√
Agent4			√	√

The embodiment evaluates the algorithm advantages and disadvantages of solving the autonomous path planning problem from the practical aspect based on the practical application perspective. The specific implementation mode is as follows: based on the same simulation environment, all agents start from the same place, and the required step length is judged by taking a capture target as a standard.

As shown in FIG. 4, the four agents completed the task in average steps of 67.5, 61.6, 66.4 and 58.3, respectively. Obviously, the improvement of the invention has better performance compared with the traditional algorithm in solving the problem of unmanned aerial vehicle autonomous path planning, and meanwhile, the invention also carries out the comparison of training time length on Agent3 and Agent4, and the result is shown in fig. 5, so that the improvement of the invention has faster learning efficiency compared with the traditional algorithm.

The foregoing description is only illustrative of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, and all changes that may be made in the equivalent structures described in the specification and drawings of the present invention are intended to be included in the scope of the invention.

Claims

1. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment is characterized by comprising the following steps of: the method comprises the following steps:

2. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: in step 1), a two-dimensional world is created for training and testing and threats are set in the two-dimensional world. Wherein the starting position of the drone is fixed, while the positions of the threat and the target are randomly varied.

3. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 2) comprises the steps of:

s2-3, the reward signal R describes that the environment judges the action of the Agent, and the individual is in the state S at the moment t _t Action A taken _t Corresponding rewards R _t+1 Can be obtained at the time t+1; a step rewarding mechanism is designed, namely, the motion planning is fully consideredOn the premise of the problem characteristics, rewards are dynamically set according to the distance between the unmanned aerial vehicle and the appointed target, so that intermediate rewards information in the movement process of the unmanned aerial vehicle is enriched;

4. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 3) comprises the steps of:

5. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 4) comprises the steps of:

s4-1, calculating the current target Q value y _j ：

S4-2, using a mean square error loss function

6. The unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under an unknown environment according to claim 1, wherein the method comprises the following steps: step 5) comprises the steps of: