CN114003059A

CN114003059A - UAV path planning method based on deep reinforcement learning under kinematic constraint condition

Info

Publication number: CN114003059A
Application number: CN202111282488.8A
Authority: CN
Inventors: 高明生; 张晓璇
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-02-01
Anticipated expiration: 2041-11-01
Also published as: CN114003059B

Abstract

The invention discloses a UAV path planning method based on deep reinforcement learning under a kinematic constraint condition, which comprises the following specific steps of: s1: the deep reinforcement learning neural network obtains the shortest path according to the vector coordinates of the plurality of task points and the static barrier; s2: flying and executing tasks along the shortest path after the unmanned aerial vehicle takes off; s3: when the existence of the dynamic barrier is detected, the unmanned aerial vehicle sends a signal to the base station, and the super computer predicts the position of the unmanned aerial vehicle when receiving the signal; s4: outputting a new flight path by using a deep reinforcement learning neural network according to the coordinates of the dynamic barrier and the rest task points, and sending the new flight path to the unmanned aerial vehicle by radio; s5: and the unmanned aerial vehicle executes the tasks along the new path, and finally returns to the base after all the tasks are executed. The invention provides a framework based on online and offline, which not only solves the problem that the state and the action in Q-Learning are high-dimensional, but also considers a kinematic model and avoids dynamic obstacles while solving the TSP problem.

Description

UAV path planning method based on deep reinforcement learning under kinematic constraint condition

Technical Field

The invention belongs to the field of unmanned aerial vehicle path planning design, and particularly relates to a UAV path planning method based on deep reinforcement learning under a kinematic constraint condition.

Background

In the civil and military fields, an unmanned aerial vehicle usually needs to perform tasks at multiple target points, and finding an optimal path to traverse all the target points is a key technology of unmanned aerial vehicle application research, namely a path planning problem.

Generally, path planning problems fall into three categories:

1) numerical methods, such as methods of mixed integer programming; however, the numerical method usually needs to solve the problem of non-convex optimization, which not only needs special commercial software (such as CPLEX) but also takes a long time.

2) Traditional intelligent algorithms such as genetic algorithm, ant colony algorithm, greedy algorithm, simulated annealing method and the like. However, the group intelligence algorithm is easy to fall into local optimization, and because the implementation of the operator has many parameters, such as cross rate and mutation rate, the selection of the parameters may cause the problem of solving premature convergence; and the traditional intelligent algorithm can only provide a solution close to the optimal solution, and cannot ensure or globally optimize the solution.

3) An algorithm based on reinforcement learning. The principle of reinforcement learning is an algorithm that an agent selects an action by observing the current state and learns according to the obtained reward value. Compared with numerical algorithms and traditional intelligent algorithms, reinforcement learning is based on a markov process, which makes use of the property that markov matrices necessarily converge for global planning.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a UAV path planning method based on deep reinforcement Learning under a kinematic constraint condition, provides a framework based on online and ofline, not only solves the problem that the state and the action in Q-Learning are high-dimensional, but also considers a kinematic model and avoids dynamic obstacles while solving the TSP problem.

The invention mainly adopts the technical scheme that:

a UAV path planning method based on deep reinforcement learning under a kinematic constraint condition comprises the following specific steps:

s1: when the unmanned aerial vehicle is at the base, according to the vector coordinates of the plurality of task points and the static barrier, the shortest path of the unmanned aerial vehicle under the kinematic constraint is obtained by using the deep reinforcement learning neural network;

s2: flying and executing tasks along the shortest path after the unmanned aerial vehicle takes off;

s3: in the process of executing tasks, when the radar on the unmanned aerial vehicle detects that a dynamic obstacle exists within 5km, the unmanned aerial vehicle is connected with the unmanned aerial vehicleTransmitting the vector coordinates of the dynamic obstacles and the residual task points to the base station by the over-the-air system, flying along the original path before receiving the feedback signal of the base station, and transmitting the signal to the super computer of the base station according to the time t from the transmission of the signal to the reception of the signal by the unmanned aerial vehicle₀Predicting the position of the unmanned aerial vehicle when receiving the signal;

s4: the super computer of the base outputs Q values of all actions by using a deep reinforcement learning neural network according to the coordinates of the dynamic obstacles and the residual task points, generates a new action selection strategy epsilon-greedy from the Q values, selects actions according to the new action selection strategy epsilon-greedy to obtain a new flight path, and sends the new flight path to the unmanned aerial vehicle through radio;

s5: and after receiving the feedback signal, the unmanned aerial vehicle executes the tasks along the new path, finally returns to the base after all the tasks are executed, and the unmanned aerial vehicle completes the tasks.

Preferably, the specific steps of using the deep reinforcement learning neural network to obtain the shortest path of the drone under the kinematic constraint in the step S1 are as follows:

s1-1: when the unmanned aerial vehicle is at the base, the N target task points are numbered as 1, 2 and 3 … … N in sequence, the base is numbered as 0, the dimension of the state vector of the unmanned aerial vehicle is set as N +2, the first bit in the state vector of the unmanned aerial vehicle is 0 and represents the base number, and the last bit is theta_iThe current task point incidence angle with the number i represents, and the middle digit is updated to the task point number according to the task point reached by the unmanned aerial vehicle, so that the initial state vector of the unmanned aerial vehicle at the base is as follows:

s_initial＝[0,0,0,…,0,θ₀]^T (1)；

wherein the first bit is 0, representing the base number, the other 0 representing the initial state when the task point is not reached, θ₀Represents the angle of incidence of the drone at base 0;

s1-2: the unmanned aerial vehicle state vector is used as the input of a deep reinforcement learning neural network, which action is solved and selected by the deep reinforcement learning neural network, so that the total distance under the kinematic constraint is shortest, namely the Q value is maximum, and an action selection strategy epsilon-greedy is generated;

s1-3: the deep reinforcement learning neural network selects actions according to an action selection strategy epsilon-greedy, determines which task point to go to and at what angle to fly out, randomly searches when the random number is smaller than epsilon, and selects the action with the maximum Q value when the random number is larger than or equal to epsilon, so that the state of the unmanned aerial vehicle is updated as follows:

s_bcd＝[0,b,c,d,0…,0,θ_d]^T (2)；

wherein, b, c and d are the serial numbers of the task points reached by the unmanned aerial vehicle in sequence, and theta_dThe number of the incident angle is d, the state vector of the unmanned aerial vehicle is the number sequence of the flying task points of the unmanned aerial vehicle, and the state vector of the unmanned aerial vehicle is updated once every action is taken.

Preferably, the deep reinforcement learning neural network comprises two neural networks of the same structure: neural network Q_evalAnd neural network Q_targetDuring initialization, the parameter weights of the two neural networks are the same, and then the neural network Q_evalTraining and updating nerve Q is reversely transmitted every h steps while generating action selection strategy epsilon-greedy_evalNetwork parameter omega of network to obtain new neural network Q_evalThe method comprises the following specific steps:

s1-21: shortest Dubins curve distance l between two points_DubinsThe calculation formula of (a) is as follows:

in the formula (3), alpha and beta are incident angles of two points respectively, d is a linear distance between the two points, R is a turning radius of a Dubins curve, R represents clockwise motion, S represents linear motion, and L represents anticlockwise motion;

when any two task points P₁And P₂When no barrier exists between the two task points, substituting the vector coordinates of the two task points into a formula (3) to calculate the shortest Dubins curve distance of the two task points

When any two task points P₁And P₂When a static obstacle or a dynamic obstacle exists between the two, the shortest Dubins curve distance of the two task points

The specific calculation steps are as follows:

firstly, a circle C with radius r is taken as the center of a circle by taking the center of a dynamic barrier or a static barrier as the center of the circle₂Wherein r is the turning radius of the Dubins curve; then, the unmanned plane moves towards the circle C from the moving direction of the position₂Making tangent lines to obtain common tangent points

And a vector

(Vector)

Expressed as:

wherein the content of the first and second substances,

respectively two common tangent points

Is determined by the coordinate of (a) in the space,

is two common tangent points

Angle of incidence of;

according to two task points P₁And P₂Vector coordinates and vectors

Calculates two task points P₁And P₂The shortest distance between the Dubins curves

As shown in equation (5):

wherein the content of the first and second substances,

respectively represent task points P₁And P₂Of (2), wherein P₁For the current task point, P₂Is the next task point;

are obtained by calculation according to a formula (3);

s1-22: according to the shortest Dubins curve distance between two task points

Calculating the prize value p as shown in equation (6):

in the formula (6), γ₁The discount coefficient is set to be 0.1 and is used for preventing gradient explosion caused by too large difference of training data;

s1-23: calculating the Loss value of the Loss function by using the reward value p calculated in the step S1-22, wherein the Loss function is expressed by the formula (7):

in the formula (7), the reaction mixture is,

neural network Q for deep reinforcement learning_evalApproximate Q value of output, s_jIs the state of the jth data, a_jFor the j-th data, ω is the deep reinforcement learning neural network Q_evalParameter of (2) need to be trained, y_jA Q value calculated for the drone by the instant prize value, as shown in equation (8):

in the formula (8), ρ_jIs shown in state s_jTaking action a_jInstant prize value, gamma, obtained₂For the discount coefficient, set to 0.01,

neural network Q for deep reinforcement learning_targetPredicted at state s'_j+1Take action a'_jMaximum Q value obtainable, wherein, state s'_j+1Is the state s in the formula (8)_jTaking action a_jRear state, a'_jIs unmanned plane at state s'_j+1An action capable of obtaining a maximum Q value;

s1-24: training and updating the nerve Q according to the Loss value obtained in the step S1-23 and the reverse transmission_evalThe network parameters ω of the network, and in addition, the neural network Q, are applied every 5 × h steps_evalTo the network Q_targetAnd (6) updating.

Preferably, the neural network Q_evalAnd neural network Q_targetEach comprising 3 convolutional layers and 3 fully-connected layers, each convolutional layer having a convolutional kernel size of 4 × 4, a step size of 3 × 3, and an output action number of N × 24.

Has the advantages that: the invention provides a UAV path planning method based on Deep reinforcement Learning under a kinematic constraint condition, which obtains the path planning of an unmanned aerial vehicle by adopting Deep reinforcement Learning DQN (Deep Q-Learning), and has the following advantages:

(1) aiming at the complex problem that reinforcement learning cannot process high dimension, Deep Reinforcement Learning (DRL) adopts a neural network to approximate a Q value, so that the defect of reinforcement learning is overcome;

(2) due to the existence of the exploration rate epsilon, the algorithm can explore a global optimal solution, and the problem of premature convergence is solved;

(3) compared with the traditional intelligent algorithm, the optimal solution under the kinematic constraint condition can be obtained, the method has a certain obstacle avoidance function, and can be widely applied to the aspects of inspection, detection or logistics dispatching and the like of multiple target points in civil or military affairs.

Drawings

FIG. 1 is a flow chart of a path planning method of the present invention;

fig. 2 is a schematic diagram of path planning when encountering dynamic or static obstacles.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example 1

As shown in fig. 1, a UAV path planning method based on deep reinforcement learning under the kinematic constraint condition includes the following specific steps:

s1: when the unmanned aerial vehicle is at a base, according to the vector coordinates of the plurality of task points and the static barrier, the shortest path of the unmanned aerial vehicle under the kinematic constraint is obtained by using a deep reinforcement learning neural network, wherein the concrete steps of the shortest path are as follows:

s1-1: when the unmanned aerial vehicle is at the base, the N target task points are numbered as 1, 2 and 3 … … N in sequence, the base is numbered as 0, the state vector dimension of the unmanned aerial vehicle is set as N +2, the first bit in the state vector of the unmanned aerial vehicle is 0 and represents the base number, and finally the unmanned aerial vehicle is at the baseOne bit is theta_iThe current task point incidence angle with the number i represents, and the middle digit is updated to the task point number according to the task point reached by the unmanned aerial vehicle, so that the initial state vector of the unmanned aerial vehicle at the base is as follows:

s_initial＝[0,0,0,…,0,θ₀]^T (1)；

s_bcd＝[0,b,c,d,0…,0,θ_d]^T (2)；

s3: in the process of executing the task, when a radar on the unmanned aerial vehicle detects that a dynamic barrier exists within 5km, the unmanned aerial vehicle sends the dynamic barrier and the vector coordinates of the residual task points to the base through radio, flies along the original path before receiving a feedback signal of the base, and a supercomputer of the base flies according to the time t from the sending of the signal to the receiving of the signal of the unmanned aerial vehicle₀Predicting the position of the unmanned aerial vehicle when receiving the signal;

In this embodiment 1, the deep reinforcement learning neural network includes two neural networks with the same structure: namely neural network Q_evalAnd neural network Q_targetAnd the neural network Q_evalAnd neural network Q_targetEach comprising 3 convolutional layers and 3 fully-connected layers, each convolutional layer having a convolutional kernel size of 4 × 4, a step size of 3 × 3, and an output action number of N × 24. During initialization, the parameter weights of the two neural networks are the same, and subsequently, the neural network Q_evalTraining and updating nerve Q is reversely transmitted every h steps while generating action selection strategy epsilon-greedy_evalNetwork parameter omega of network to obtain new neural network Q_evalThe method comprises the following specific steps:

The specific calculation steps are as follows:

as shown in FIG. 2, a circle C with radius r is first formed by using the center of the dynamic obstacle or the static obstacle as the center of a circle₂Wherein r is the turning radius of the Dubins curve; then, the unmanned plane moves towards the circle C from the moving direction of the position₂Making tangent lines to obtain common tangent points

And a vector

(Vector)

Expressed as:

wherein the content of the first and second substances,

respectively two common tangent points

Is determined by the coordinate of (a) in the space,

is two common tangent points

Angle of incidence of;

according to two task points P₁And P₂Vector coordinates and vectors

As shown in equation (5):

wherein the content of the first and second substances,

are obtained by calculation according to a formula (3);

s1-22: according to the shortest Dubins curve distance between two task points

Calculating the prize value p as shown in equation (6):

in the formula (7), the reaction mixture is,

s1-24: training and updating the nerve Q according to the Loss value obtained in the step S1-23 and the reverse transmission_evalThe network parameters ω of the network, and in addition, the neural network Q, are applied every 5 × h steps_evalTo the neural network Q_targetAnd (6) updating.

In the invention, the tangent lines are made as shown in figure 2 when the static obstacle and the dynamic obstacle meet, and the shortest Dubins curve distance of the two task points

The same is true for the calculation of (c). This is because the action selected by the action selection strategy of the deep reinforcement learning determines at what angle the next point is emittedThe angle of incidence of the two points is known, and the coordinates of the obstacle are known, and the calculation is the same. Therefore, when the dynamic barrier is detected, the unmanned plane sends the dynamic barrier coordinate to the base, the dynamic barrier coordinate and the incident angles of the two points are provided, and the tangent method and the calculation method are consistent with the static barrier processing mode.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A UAV path planning method based on deep reinforcement learning under a kinematic constraint condition is characterized by comprising the following specific steps:

2. The method for planning UAV path based on deep reinforcement learning under the kinematic constraint condition of claim 1, wherein the specific steps of using the deep reinforcement learning neural network to derive the shortest path of the drone under the kinematic constraint in step S1 are as follows:

s_initial＝[0，0，0，...，0，θ₀]^T (1)；

s_bcd＝[0，b，c，d，0…，0，θ_d]^T (2)；

wherein, b, c and d are the serial numbers of the task points reached by the unmanned aerial vehicle in sequence, and theta_dThe incident angle of the task point is numbered as d, and the state direction of the unmanned aerial vehicleThe number is the sequence of the serial numbers of the flying task points of the unmanned aerial vehicle, and the state vector of the unmanned aerial vehicle is updated once every action is taken.

3. The method for planning UAV path according to claim 2, wherein the deep-reinforcement learning neural network comprises two neural networks with the same structure: neural network Q_evalAnd neural network Q_targetDuring initialization, the parameter weights of the two neural networks are the same, and then the neural network Q_evalTraining and updating nerve Q is reversely transmitted every h steps while generating action selection strategy epsilon-greedy_evalNetwork parameter omega of network to obtain new neural network Q_evalThe method comprises the following specific steps:

The specific calculation steps are as follows:

And a vector

(Vector)

Expressed as:

wherein the content of the first and second substances,

respectively two common tangent points

Is determined by the coordinate of (a) in the space,

is two common tangent points

Angle of incidence of;

according to two task points P₁And P₂Vector coordinates and vectors

As shown in equation (5):

wherein the content of the first and second substances,

are obtained by calculation according to a formula (3);

s1-22: according to the shortest Dubins curve distance between two task points

Calculating the prize value p as shown in equation (6):

in the formula (7), the reaction mixture is,

neural network Q for deep reinforcement learning_evalOf the outputApproximate Q value, s_jIs the state of the jth data, a_jFor the j-th data, ω is the deep reinforcement learning neural network Q_evalParameter of (2) need to be trained, y_jA Q value calculated for the drone by the instant prize value, as shown in equation (8):

4. The UAV path planning method based on deep reinforcement learning under the kinematic constraint condition of claim 3, wherein the neural network Q is_evalAnd neural network Q_targetEach comprising 3 convolutional layers and 3 fully-connected layers, each convolutional layer having a convolutional kernel size of 4 × 4, a step size of 3 × 3, and an output action number of N × 24.