CN114564039A

CN114564039A - Flight path planning method based on deep Q network and fast search random tree algorithm

Info

Publication number: CN114564039A
Application number: CN202210089643.2A
Authority: CN
Inventors: 李昭莹; 石若凌; 欧一鸣
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-31
Anticipated expiration: 2042-01-25
Also published as: CN114564039B

Abstract

The invention discloses a flight path planning method based on a deep Q network and a fast search random tree algorithm, which comprises the steps of firstly abstracting a Markov decision process in an RRT algorithm; due to the randomness of RRT growth, for each expansion process, the RRT can be regarded as a Markov process and MDP models of the RRT can be built. Then, a deep Q network is trained, and the optimal action corresponding to each state can be obtained by studying the deep Q network: and performing improved RRT path planning after introducing the most action. The invention can make the RRT-GoalBias algorithm have stronger obstacle avoidance capability and increase the chance of greedy expansion, thereby improving the efficiency and stability of the algorithm.

Description

Flight path planning method based on deep Q network and fast search random tree algorithm

Technical Field

The invention relates to the field of flight path planning, in particular to a flight path planning method combining a deep Q network and a fast search random tree algorithm.

Background

The flight path planning technology is one of the key links of the intellectualization of the unmanned aerial vehicle, and is rapidly developed under the drive of a computer technology, an information technology and an artificial intelligence technology. The path planning algorithm is the core of the flight path planning technology, and aims to find a path solution from a starting point to a target point in a model space, wherein the path solution needs to meet a certain constraint condition and meet certain performance indexes (path length, time, energy consumption and the like) according to actual needs. Therefore, the path not only needs to satisfy various platform constraints, but also needs to ensure that the intelligent agent does not collide with obstacles when running along the path. The rapid-searching Random Trees (RRT) algorithm has the advantages of no need of preprocessing a state space, simple process and the like, but has the problems of high randomness, low searching redundancy efficiency, non-optimal path quality and the like, and the application of the algorithm is restricted. The RRT-GoalBias algorithm is a variant of the RRT algorithm, improves target guidance, has the advantages of simplicity, high efficiency and rapid convergence, is still high in algorithm randomness, is lack of obstacle avoidance capability, and reduces algorithm efficiency to a certain extent.

Disclosure of Invention

The invention provides an RRT-GoalBias path planning optimization algorithm (DQN-RRTGoBias) based on a deep Q network to solve the problems of algorithm efficiency reduction and unstable operation caused by high randomness and poor obstacle avoidance capability of an RRT-GoalBias algorithm so as to improve the algorithm efficiency and stability.

The invention relates to a flight path planning method based on a deep Q network and a fast search random tree algorithm, which comprises the following steps:

step 1: modeling the Markov decision process of the RRT algorithm.

And 2, step: a deep Q network is trained.

The purpose of using the DQN algorithm is to obtain a state-action pair(s) that can be evaluated_t,a_t) The function of merit, called the target Q value, is denoted Q(s)_t,a_t) (ii) a Approximating Q(s) using a neural network as a function approximator_t,a_t) And minimizes the error by gradient descent, approximating the function Q (s, a; w), w are fitting parameters.

And step 3: and planning the path according to the depth Q network.

According to the deep Q network, the optimal action a corresponding to each state can be obtained_opt：

The improved RRT path planning process introducing the optimal action is as follows:

1) adding a starting point S toRandom tree table X_tree；

2) Sampling tree nodes P in state space X_sampThe sampling method comprises the following steps:

wherein, P_goalAs end point, P _randIs a random point in the state space, p is a constant (0)<p<1)；

3) In random tree table X_treeFind the distance random node P_sampNearest tree node P_near；

4) If P_samp≠P_goalFrom the nearest neighbor tree node P_nearTo random node P_randAt a certain step length d₀Extending to obtain new tree node P_newOtherwise, executing the optimal action a of the current state_optTo obtain a new tree node P_new；

5) Judging new tree node P_newAnd a new crotch P_nearP_newWhether or not in free space X_free. If so, q is added_newAdd to random Tree Table X_tree(ii) a If not, returning to the step 2);

6) and repeating the steps 2) to 5) until the random tree extends to the terminal point.

The invention has the advantages that:

1. the flight path planning method based on the deep Q network and the fast search random tree algorithm can enable the obstacle avoidance capability of the RRT-GoalBias algorithm to be stronger, increase the probability of greedy expansion, and further improve the efficiency and stability of the algorithm.

2. The invention relates to a flight path planning method based on a deep Q network and a fast search random tree algorithm, which models a node exploration process of RRT into a Markov Decision Process (MDP) model, and embodies the exploration preference of the RRT algorithm through the reward function design of environment feedback.

3. The invention relates to a flight path planning method based on a deep Q network and a fast search random tree algorithm, which designs a new exploration mechanism based on an RRT algorithm, improves exploration efficiency and considers probability completeness of the algorithm at the same time;

Drawings

FIG. 1 is a flow chart of a flight path planning method based on a deep Q network and a fast search random tree algorithm according to the present invention.

Fig. 2 is a schematic diagram of a complex field variable step size obstacle avoidance strategy. .

FIG. 3 is the structure diagram of BP neural network

Fig. 4 is a map a.

Fig. 5 is a map b.

Fig. 6 is a map a optimal action visualization diagram.

Fig. 7 is a map b optimal action visualization diagram.

Fig. 8 is a simulation example setup.

FIG. 9 is a comparison of algorithm performance under each calculation example.

FIG. 10 shows the algorithm running time in example 1.

FIG. 11 shows the algorithm running time in example 2.

FIG. 12 shows the algorithm running time in example 3.

Detailed Description

The present invention will be described in further detail with reference to examples.

The invention relates to a flight path planning method based on a deep Q network and a fast search random tree algorithm, which comprises the following specific steps as shown in figure 1:

step 1: markov decision process modeling for RRT algorithm

Reinforcement learning generally uses Markov Decision Process (MDP) as a basic framework, and in the simulation of MDP, an intelligent agent senses the current system state, selects and implements actions from an action space according to an optimal strategy, thereby changing the environment and the state of the intelligent agent and obtaining feedback (reward) of the environment. To introduce a reinforcement learning algorithm, the Markov Decision Process (MDP) in the RRT algorithm must first be abstracted. Due to the randomness of RRT growth, for each expansion process, the RRT can be regarded as a Markov process and MDP models of the RRT can be built. The following is a definition of each element of the MDP model in the RRT algorithm according to the present invention.

(ii) state space

The area specified by the path planning task is called "planning space" (planning space), and can be described as:

X＝{(x,y)|x_min≤x≤x_max,y_min≤y≤y_max}

in the formula, X is a planning space; x and y are two-dimensional coordinates of the position of the agent respectively; x is the number of_min，y_minRespectively, the minimum value of the two-dimensional coordinates; x is the number of_max，y_maxRespectively, the maximum value of the two-dimensional coordinates.

The plan space can be divided into "free space" (free space) and "obstacle space" (obstacle space), the free space represents the area where the agent (agent) can pass through, and the obstacle space represents the area where the agent cannot pass through, there are:

X＝X_free+X_obs

in the formula, X_freeIs free space; x_obsIs an obstacle space.

For convenience of calculation, a binary grid map (binary grid map) is often used in the path planning process to discretely represent a planning space model, a grid with a value of 0 represents a free space node, and a grid with a value of 1 represents an obstacle space node, and there are:

in the formula, map (x, y) is a binary grid map.

On a binary grid map, free space can be defined as:

X_free＝{(x,y)|map(x,y)＝0}

accordingly, the obstacle space may be defined as:

X_obs＝{(x,y)|map(x,y)＝1}

in the present invention, the free space X_freeThe two-dimensional space composed of the coordinates of all the points is called a state space, and the representation method is as follows:

S＝{(x,y)|map(x,y)＝0}

motion space-

In order to enable the RRT to avoid the obstacle independently by adjusting the growth direction, the action in the random tree exploration process is designed as the growth angle of the branches, and a complex field variable step size obstacle avoidance strategy is introduced, wherein the complex step size is defined as:

d＝d₀e^jθ

Wherein, d₀The crotch length is represented and is a positive real constant; j is an imaginary unit; θ represents the new fork relative to the orientation to the target point P_goalThe rotation angle of the direction is in the range of (-pi, pi). As shown in FIG. 2, the action space designed by the present invention is a set of 5 complex steps, that is, the growing direction of each branch has five choices, and the expression is as follows.

③ reward function

In the RRT algorithm, generally, the less the number of times of encountering an obstacle, the faster the speed of approaching the target point, and the higher the efficiency of path planning. Therefore, the reward function is designed to measure whether the current state action pair can cause the random tree to encounter obstacles and be closer to the target point. Meanwhile, the obstacle avoidance of the random tree needs to have certain predictability, namely, the obstacle avoidance operation can be carried out at P_nearWhen the distance from the obstacle is still a certain distance; the reward function is thus designed to be:

wherein, c₁,c₂,c₃K is a normal number, | arg (a) | represents the argument of action a. Condition P is expressed as: from the current state s_tMake action a ═ d₀e^j0Expanding the random tree to the next state s_t+1And based on the result of the reinforcement learning algorithm training, state s_t+1The corresponding optimal action is

Or

Step 2: training deep Q-network

The purpose of using the DQN algorithm is to obtain a "state-action pair"(s) that can be evaluated_t,a_t) The function of merit, called the target Q value, is denoted Q(s)_t,a_t). The invention utilizes a neural network as a function approximator to approximate Q(s)_t,a_t) And minimizes the error by gradient descent, approximating the function Q (s, a; w), w are fitting parameters.

The invention adopts a back propagation algorithm to design a neural network, and inputs the state s corresponding to the current state_tSince the output is the Q value after five operations in the operation space in this state, a 2-input 5-output feedforward neural network is used, and the configuration thereof is shown in fig. 3.

The main process of training is as follows:

(1) initializing the playback buffer D, predicting the network Q (s, a; w) and the target network Q (s, a; w)^-) Wherein w is^-W is a random weight;

(2) initialization state s_t；

(3) Returning the Q values corresponding to the five actions in the state by using the prediction network, and selecting the action a corresponding to the maximum Q value_tTo perform;

(4) after performing the action, a reward R is calculated using a reward function_t+1Transition to a new state s_t+1And updating the corresponding Q value according to:

Q(s_t,a_t)＝R_t+1+λmaxQ(s_t+1,a；w^-)

wherein λ is a discount factor;

(5) transferring the information<s_t a_t R_t+1s_t+1>Storing in a playback buffer D;

(6) randomly sampling a batch of transfer information from the playback buffer D, and calculating a loss function:

L(w)＝[R_t+1+λmaxQ(s_t+1,a；w^-)-Q(s_t,a_t；w)]²

And updating w by using a gradient descent method;

(7) repeating the steps (3) to (6) for C times, and enabling w^-＝w；

(8) And (4) repeating the steps (3) to (7) until the final state (namely the random tree is extended to the end point).

And 3, step 3: path planning according to deep Q network

7) adding the starting point S to the random tree table X_tree；

8) Sampling tree nodes P in state space X_sampThe sampling method comprises the following steps:

wherein, P_goalAs end point, P_randIs a random point in the state space, p is a constant (0)<p<1)；

9) In random tree table X_treeFind the distance random node P_sampNearest tree node P_near；

10) If P_samp≠P_goalFrom the nearest neighbor tree node P_nearTo random node P_randAt a certain step length d₀Extending to obtain new tree node P_newOtherwise, executing the optimal action a of the current state_optTo obtain a new tree node P_new；

11) Judging new tree node P_newAnd a new crotch P_nearP_newWhether or not in free space X_free. If so, q is added_newAdd to random Tree Table X_tree(ii) a If not, returning to the step 2);

12) and repeating the steps 2) to 5) until the random tree extends to the terminal point.

To verify the effect of the algorithm, two 500 × 500 maps are selected and marked as map a and map b, as shown in fig. 4 and 5. Training and extracting an optimal action table on the matlab platform, as shown in fig. 6 and 7, and performing a path planning simulation experiment according to the optimal action table. Three examples were set in the simulation, and as shown in fig. 8, 1000 experiments were performed for each example, and the planning time of each experiment was recorded, and the results of the experiments are shown in fig. 9 to 12. Simulation results show that the DQN-RRTGoalbias algorithm can improve efficiency and time performance stability, and can achieve expected effects under different calculation conditions.

The RRT-GoalBias path planning optimization algorithm based on the deep Q network combines a relatively mechanical RRT-GoalBias algorithm with poor obstacle avoidance capability and a deep Q network algorithm by designing a complex number field variable step length obstacle avoidance strategy, so that the deep Q network RRT-GoalBias path planning optimization algorithm has the capability of flexibly avoiding obstacles according to the learning experience. Simulation shows that the optimization algorithm can improve the efficiency and the time performance stability of the algorithm.

Claims

1. A flight path planning method based on a deep Q network and a fast search random tree algorithm is characterized in that:

step 1: modeling a Markov decision process of the RRT algorithm;

step 2: training a deep Q network;

the purpose of using the DQN algorithm is to obtain a state-action pair(s) that can be evaluated_t,a_t) The function of merit, called the target Q value, is denoted Q(s)_t,a_t) (ii) a Approximating Q(s) using a neural network as a function approximator_t,a_t) And minimizes the error by gradient descent, approximating the function Q (s, a; w), w is a fitting parameter;

and step 3: planning a path according to the depth Q network;

1) adding origin S to random Tree Table X_tree；

2) Sampling tree nodes P in state space X _sampThe sampling method comprises the following steps:

3) In random tree table X_treeFind random node P of distance_sampNearest tree node P_near；

5) Judging new tree node P_newAnd a new crotch P_nearP_newWhether or not in free space X_free(ii) a If so, q is added_newAdd to random Tree Table X_tree(ii) a If not, returning to the step 2);

2. The flight path planning method based on the deep Q network and the fast search random tree algorithm as claimed in claim 1, characterized in that:

the elements of the markov decision process model are defined as:

(ii) state space

The area specified by the path planning task is called a planning space, and can be described as follows:

X＝{(x,y)|x_min≤x≤x_max,y_min≤y≤y_max}

in the formula, X is a planning space; x and y are two-dimensional coordinates of the position of the agent respectively; x is the number of_min，y_minRespectively, the minimum value of the two-dimensional coordinates; x is the number of_max，y_maxRespectively are maximum values of two-dimensional coordinates;

the planning space can be divided into free space and obstacle space, and the free space represents the region that the agent can pass, and the obstacle space then represents the region that the agent can not pass, then has:

X＝X_free+X_obs

In the formula, X_freeIs free space; x_obsIs an obstacle space;

for convenience of operation, a binary grid map is often used in a path planning process to discretely represent a planning space model, a grid with a value of 0 represents a free space node, and a grid with a value of 1 represents an obstacle space node, and the following methods are provided:

in the formula, map (x, y) is a binary grid map;

on a binary grid map, free space can be defined as:

X_free＝{(x,y)|map(x,y)＝0}

accordingly, the obstacle space may be defined as:

X_obs＝{(x,y)|map(x,y)＝1}

free space X_freeThe two-dimensional space composed of the coordinates of all the points is called a state space, and the representation method is as follows:

S＝{(x,y)|map(x,y)＝0}

motion space-

d＝d₀e^jθ

wherein d is₀The crotch length is represented as a positive real constant; theta denotes the new bifurcation relative to the orientation target point P_goalThe value range of the rotation angle of the direction is (-pi, pi);

③ reward function

The reward function is designed as:

wherein, c₁,c₂,c₃K is a normal number, | arg (a) | represents the argument of action a. Condition P is expressed as: from the current state s_tMake action a ═ d₀e^j0Expanding the random tree to the next state s _t+1And based on the result of the reinforcement learning algorithm training, state s_t+1The corresponding optimal action is

Or

3. The flight path planning method based on the deep Q network and the fast search random tree algorithm as claimed in claim 2, characterized in that: the motion space is designed as a set of 5 complex steps, and the expression is as follows:

4. a method as claimed in claim 1, based onThe flight path planning method of the deep Q network and the fast search random tree algorithm is characterized in that: in step 2, a neural network is designed by adopting a back propagation algorithm, and the input corresponds to the current state s_tThe output is the Q value after each action in the action space in this state, so that the deep Q network training method is to use a 2-input 5-output feedforward neural network as follows:

(2) initialization state s_t；

(3) Returning the Q values corresponding to all actions in the state by using a prediction network, and selecting the action a corresponding to the maximum Q value_tTo perform;

Q(s_t,a_t)＝R_t+1+λmaxQ(s_t+1,a；w^-)

Wherein λ is a discounting factor;

(5) transferring the information<s_t a_t R_t+1 s_t+1>Storing in a playback buffer D;

L(w)＝[R_t+1+λmaxQ(s_t+1,a；w^-)-Q(s_t,a_t；w)]²

updating w by using a gradient descent method;

(7) repeating the steps (3) to (6) for C times, and enabling w^-＝w；

(8) And (5) repeatedly executing the steps (3) to (7) until the final state is reached.