CN115268494A

CN115268494A - Unmanned aerial vehicle path planning method based on layered reinforcement learning

Info

Publication number: CN115268494A
Application number: CN202210883240.5A
Authority: CN
Inventors: 王�琦; 潘德民; 王栋; 高尚; 于化龙; 崔弘杨
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-11-01
Anticipated expiration: 2042-07-26
Also published as: CN115268494B

Abstract

The invention discloses an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning, which comprises the following steps: step 1: initializing a deep Q network algorithm and a Q learning algorithm; step 2: driving the unmanned aerial vehicle to move from the starting point to the target point, and training a depth Q network algorithm and a Q learning algorithm; when the unmanned aerial vehicle does not detect a dynamic obstacle in the moving process, planning a path by using a depth Q network algorithm; when the unmanned aerial vehicle detects a dynamic obstacle in the moving process, planning a path by using a Q learning algorithm; and step 3: and (3) repeating the step (2) until the training of the deep Q network algorithm and the Q learning algorithm is completed, setting the actual coordinate, the starting point coordinate and the target point coordinate of the unmanned aerial vehicle, and planning the path through the trained deep Q network algorithm and the trained Q learning algorithm. The invention overcomes the problem that the network fitting is easily influenced by dynamic obstacles when a single algorithm is applied to a dynamic environment, and improves the performance of algorithm path planning.

Description

Unmanned aerial vehicle path planning method based on layered reinforcement learning

Technical Field

The invention relates to the technical field of unmanned aerial vehicle path planning, in particular to an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning.

Background

In recent years, the wide application of unmanned aerial vehicles in many fields of military use and civil use makes the demand for autonomy of unmanned aerial vehicles stronger, wherein the autonomous path planning of unmanned aerial vehicles is the key point of research. In the current phase of research on unmanned aerial vehicle path planning, most of the research focuses on path planning in a static environment, and the research on a dynamic environment is less. In the prior art, reinforcement learning becomes a hotspot method of path planning due to the unique reward and punishment mechanism and the characteristic of autonomously learning an optimal strategy through interaction with the environment. Q-learning (Q-learning), which is the most classical algorithm of reinforcement learning, is widely applied to the path planning problem of unmanned aerial vehicles. However, due to the characteristics of table learning, Q learning cannot be applied to a scene with a complex environment or a large state space dimension. Therefore, deep reinforcement learning combined with deep learning is proposed and applied to various complicated unmanned aerial vehicle path planning problems, and the most widely applied method is a Deep Q Network (DQN) algorithm.

However, the inventor of the present invention finds that, in the problem of implementing the dynamic path planning of the unmanned aerial vehicle based on the deep Q network algorithm, the reinforcement learning algorithm employs a random selection action exploration strategy, which results in low efficiency at the initial training stage, too long iteration times, and a non-optimal planned path. This situation is further exacerbated in complex environments where dynamic and static obstacles coexist. In addition, it is found that when a single deep Q network algorithm faces a dynamic environment, the fitting of the network in the training process is not good due to the unfixed position of the dynamic barrier, and the finally trained network performance is also not good.

Therefore, the technical problems that the training efficiency is low and the network fitting is susceptible exist in the prior art.

Disclosure of Invention

The invention provides an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning, and aims to solve the problems that in the prior art, training efficiency is low and network fitting is easily influenced.

The invention provides an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning, which comprises the following steps:

step 1: initializing a deep Q network algorithm and a Q learning algorithm;

step 2: driving the unmanned aerial vehicle to move from a starting point to a target point, and training a depth Q network algorithm and a Q learning algorithm;

when the unmanned aerial vehicle does not detect a dynamic obstacle in the moving process, planning a path by using a depth Q network algorithm;

when the unmanned aerial vehicle detects a dynamic obstacle in the moving process, planning a path by using a Q learning algorithm;

and step 3: and (3) repeating the step (2) until the training of the deep Q network algorithm and the Q learning algorithm is completed, setting the actual coordinate, the starting point coordinate and the target point coordinate of the unmanned aerial vehicle, and planning the path through the trained deep Q network algorithm and the trained Q learning algorithm.

Further, when the unmanned aerial vehicle does not detect a dynamic obstacle, the Q-depth network algorithm plans a path, and updating the Q learning algorithm through an experience element group generated in the Q-depth network algorithm after the path is planned currently. At the moment, the reward function used by the updated depth Q network algorithm is kept consistent with the normal update of the reward function;

when the unmanned aerial vehicle detects a dynamic obstacle, the Q learning algorithm plans a path, and the method also comprises the step of updating the depth Q network algorithm through an experience tuple generated in the Q learning algorithm after the path is currently planned.

Further, when the Q learning algorithm is updated through the experience element group generated in the depth Q network algorithm after the path is planned currently, the reward function formula used by the Q learning algorithm is as follows:

reward＝η(d_s-1-d_s)

wherein η is a constant; d_s-1The distance from the unmanned aerial vehicle to the target point at the last moment; d_sThe distance from the unmanned aerial vehicle to the target point at the current moment.

Further, in step 2, before the path is planned by the deep Q network algorithm and the Q learning algorithm, the method further includes: a heuristic fish algorithm is used as an action guide of a deep Q network algorithm and a Q learning algorithm in path planning; wherein the heuristic fish algorithm comprises: the method comprises a travelling behavior process and a foraging behavior process, wherein the travelling behavior process is to acquire the collision direction of the unmanned aerial vehicle and surrounding obstacles; the foraging behavior process is to acquire a plurality of high-priority directions of the unmanned aerial vehicle moving towards a target point, and the heuristic fish algorithm removes collision directions in the plurality of high-priority directions as action guidance.

Further, when the direction that unmanned aerial vehicle is likely to collide with surrounding obstacles is obtained, and when the obstacles are dynamic, whether the unmanned aerial vehicle collides with the obstacles is judged according to the movement direction and the movement speed of the obstacles.

The invention has the beneficial effects that:

the invention adds the action guidance strategy of the heuristic fish algorithm into the action selection strategy of the basic deep Q network algorithm and the Q learning algorithm. The method carries out action guidance on two aspects of fast reaching a target point and avoiding of dynamic and static obstacles, and the action guidance greatly reduces unnecessary exploration in the initial stage of algorithm training so as to reduce the blindness of original algorithm exploration.

The invention utilizes layered reinforcement learning to respectively process static and dynamic obstacles by using two algorithms when facing a dynamic complex environment. The design overcomes the problem that the network fitting is easily influenced by dynamic obstacles when a single algorithm is applied to a dynamic environment, and improves the performance of algorithm path planning.

The two effects respectively solve the problems that algorithm training efficiency is low and a planning path is lack of safety consideration in the prior art.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and not to be construed as limiting the invention in any way, and in which:

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating the detection of UAV sensors in an environment according to an embodiment of the present invention;

FIG. 3 is a flow chart of a heuristic fish algorithm according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a situation in foraging behavior of a heuristic fish algorithm described in an embodiment of the present invention;

FIG. 5 is a diagram illustrating a case in which a heuristic fish algorithm performs according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an unmanned aerial vehicle path planning method based on hierarchical reinforcement learning, the flow structure of the method is shown in figure 1, and the method comprises the following steps:

step 1: network parameter theta, empirical playback zone for initializing deep Q network algorithm

And a Q table for Q learning; number of initial training rounds N_episodeSetting a starting point P of the flight mission of the unmanned aerial vehicle_OAnd a target point P_T；

And 2, step: when the training round number is less than the set maximum round number, the state and the environment are reset, and the training of the round is started. Detecting the environment according to the sensor, and judging whether a dynamic barrier exists in a detection range, wherein the detection range of the sensor is shown in figure 2;

and the depth Q network algorithm selects and executes actions according to the current position of the unmanned aerial vehicle and the position information of the static obstacle by using a heuristic fish algorithm as an action guide of the algorithm, and then reaches the next state. For the current action, the reward can be obtained by a reward function, and the embodiment of the invention sets the reward function of the static path planning part as follows:

alpha and beta are constants, and the weights of the two reward calculation units in the total reward function are determined. According to experimental commissioning, the present example sets α, β to 1.1,2, respectively. d_sRepresenting the distance between the unmanned aerial vehicle and the target point in the last state; d is a radical of_s-1Representing the distance between the drone and the target point in the next state.

The distance from the unmanned aerial vehicle to each static obstacle.

Obtaining an experience tuple [ S, A, R, S 'consisting of the current state, the action, the reward and the next state from the interaction']Deposit to an experience playback zone

In (1). Then the algorithm follows the set batch number m to get from the experience playback zone

And sampling data to update the Q network of the deep Q network algorithm.

Meanwhile, when the deep Q network algorithm and the Q learning algorithm are switched to be used, if one algorithm is completely separated from the other algorithm to stop working, the Q value of a partial state action pair is lost after the two algorithms are trained. To avoid this problem, when the deep Q network algorithm works, the Q table of the Q learning algorithm is also updated by using the experience element group interactively generated in the previous step, and at this time, since the Q learning algorithm has no dynamic obstacle in the range of the drone sensor in the non-working period, the reward function is defined as:

reward＝η(d_s-1-d_s)

finally, if the action taken by the drone this time results in a collision, ending and starting a new training round; if no collision is caused, the training of the current round is continued.

When the unmanned aerial vehicle detects a dynamic barrier in the moving process, planning a path by using a Q learning algorithm;

and the Q learning algorithm selects and executes actions according to the current position of the unmanned aerial vehicle and the information of the detected dynamic obstacles by using a heuristic fish algorithm as the action guidance of the algorithm, and the next state is reached. For the reward function of the dynamic path planning part, the embodiment of the invention sets the reward function as follows:

gamma, delta are weight constants, and according to experimental debugging, the example sets gamma, delta to be 1.1,1 respectively; d'_u→t，d_u→tRespectively representing the distances between the unmanned aerial vehicle and the target point at the previous moment and the current moment; d'_u→o，d_u→oAnd respectively representing the distances between the unmanned aerial vehicle and the hidden dynamic barrier at the previous moment and the current moment.

Then, the Q table of the Q learning algorithm is updated according to the information tuple [ S, A, R, S' ] obtained by the interaction.

And similarly, updating the network of the deep Q network algorithm by using the experience tuple obtained by the previous step of interaction. At this time, the reward function is consistent with the reward function when the deep Q network algorithm actually performs static path planning.

And step 3: repeating the step 2, and ending the current round if the unmanned aerial vehicle reaches the target point; if the current training round number of the unmanned aerial vehicle reaches the set maximum round number N_episodeAnd finishing the training of the deep Q network algorithm and the Q learning algorithm. At this moment, the unmanned aerial vehicle is arrangedAnd (4) planning the path by using the interpositional coordinates, the starting point coordinates and the target point coordinates through a trained deep Q network algorithm and a trained Q learning algorithm.

In step 2, before the path is planned by the deep Q network algorithm and the Q learning algorithm, the method further comprises the following steps: a heuristic fish algorithm is used as an action guide of a depth Q network algorithm and a Q learning algorithm in path planning; the heuristic fish algorithm is inspired by the phenomenon that fish can forage for food by using lateral line organs in a dark environment in nature, and comprises the following steps: the method comprises a traveling behavior process and a foraging behavior process, wherein the traveling behavior process is to acquire the collision direction of the unmanned aerial vehicle and surrounding obstacles; the foraging behavior process is to obtain a plurality of high-priority directions of the unmanned aerial vehicle moving towards a target point, and the heuristic fish algorithm removes the collision direction from the high-priority directions to be used as an action guide. The algorithm flow is shown in fig. 3, and comprises the following steps:

step 21: when the depth Q network algorithm or the Q learning algorithm calls the heuristic fish algorithm to select actions, the current state, the position of a target point and information containing dynamic and static obstacles are input into the heuristic fish algorithm. The experimental environment adopted by the invention is a grid environment, the unmanned aerial vehicle can take actions in eight directions, and the heuristic fish algorithm is responsible for selecting the optimal action in the current state from the actions.

Step 22: the foraging behavior calculates a set of selectable actions according to the current state and the target point position, as shown in fig. 4. Let the direction vector formed by the current position of the unmanned aerial vehicle and the target point be L_u→t，L_horizontalThe vector is a unit vector in the forward direction of the unmanned aerial vehicle, and the included angle between the two vectors is as follows:

second, L_actionThe action element A is a unit direction vector of a certain action in the action space, each action and L_horizontalThe included angle between the two is as follows:

then theta_tAnd each theta_actionThe difference of (d) is:

and finally, giving priority to each action from high to low according to the difference from small to large, and returning to the action set with the first five priorities.

Step 23: the traveling behavior calculates an optional set of actions that will not cause a collision based on the current state and the information of the dynamic and static obstacles, as shown in fig. 5, where the gray squares represent static obstacles and the slashed squares represent dynamic obstacles.

For the avoidance of the static obstacle, the position information of the static obstacle is utilized, when the unmanned aerial vehicle executes a certain action and enters the area of the static obstacle, the action is set to be the forbidden action in the current state, and the available action is returned.

For the avoidance of the dynamic barrier, a threat area of the dynamic barrier at the next moment is predicted according to an information set [ speed, direction, position ] of the dynamic barrier detected by a sensor, when the unmanned aerial vehicle executes a certain action and enters the threat area, the action is set as a forbidden action in the current state, and an available action is returned.

Step 24: and (4) integrating the actions returned in the step (22) and the step (23), and returning a plurality of actions which have high priority and cannot cause collision to the deep Q network algorithm or the Q learning. And ending the calling.

The specific embodiment process is exemplified in a simulation manner, which specifically includes the following steps:

example 1: layered reinforcement learning

Step 1: initializing network parameters of a deep Q network algorithm, empirical playback zone size

1000000; initializing the Q-learning algorithmAnd Q table. Setting the total training round number as 500 rounds and starting point P of unmanned aerial vehicle flight task_O＝[0，0]And a target point P_T＝[29，29]；

Step 2: the sensor detection range is set to 3 as shown in fig. 2.

And if no dynamic barrier exists in the current detection range of the unmanned aerial vehicle, calling a depth Q algorithm to plan a static path, and then calling a heuristic fish algorithm to select actions. The drone executes the selected action into the next state while receiving a reward for performing the action. The algorithm stores the experience tuples into an experience replay area. And meanwhile, updating network parameters according to set batch m =16 sampling information from experience playback, and updating a Q table of a Q learning algorithm by using the experience tuple.

If a dynamic obstacle exists in the detection range, as shown in the condition of fig. 2, a Q learning algorithm is called to carry out dynamic path planning. And calling a heuristic fish algorithm to select the action, and then executing the selected action by the unmanned aerial vehicle to enter the next state and obtaining the reward of the action. And finally, updating the Q table by using the Q learning algorithm through the experience tuple, and updating the network of the deep Q network algorithm through the experience tuple.

And step 3: the unmanned aerial vehicle and the environment interact continuously: detecting dynamic obstacles → switching algorithm → selecting action → performing action → calculating reward → updating Q network/Q table until collision with obstacles or arrival at target point, ending current turn. When the total number of training rounds reaches the set N_episodeAnd when the training is finished, the whole training is finished.

Example 2: heuristic fish algorithm

Step 1: the heuristic method is called by a deep Q network algorithm or a Q learning algorithm, and inputs information including the current state, the target point position and dynamic and static obstacles. And performing foraging and traveling behavior selection on the available action sets respectively by using a heuristic algorithm.

And 2, step: calculating theta according to the current state and the position of the target point_t，θ_actionThen, calculate theta_tAnd each theta_actionAnd then eight are assigned according to the differenceAnd (5) actions with different priorities and actions with the top five of the priorities are returned. Referring to FIG. 4, the returned set of priority actions in this case is [ left front, left, right front, left back ]]。

And 3, step 3: the traveling behavior returns an action that does not cause a collision according to the information of the static and dynamic obstacles. For static obstacles, the action of selecting to enter the area is forbidden due to the fixed position of the static obstacles; for a dynamic obstacle, the position of the obstacle at the next moment is predicted by using the set [ speed, direction, position ], and then the action of selecting to enter the area is prohibited. In the scenario of a traveling behavior shown in fig. 5, the gray box is a static obstacle, the oblique line is a dynamic obstacle, and the information of the dynamic obstacle is [1, left, current position ], so the next time is the marked area in the figure. Finally, the actions [ left and right rear ] which can cause the collision are removed, and the remaining 6 actions are selectable actions.

And 4, step 4: and (4) integrating the actions returned in the step (2) and the step (3), returning an optional action set of [ front left, front right, front left, back left ], and ending the call.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. An unmanned aerial vehicle path planning method based on layered reinforcement learning is characterized by comprising the following steps:

step 1: initializing a deep Q network algorithm and a Q learning algorithm;

2. The method for unmanned aerial vehicle path planning based on hierarchical reinforcement learning of claim 1, wherein when the unmanned aerial vehicle does not detect a dynamic obstacle, the deep Q network algorithm plans the path, and further comprising updating the Q learning algorithm by an experience element group generated in the deep Q network algorithm after the path is currently planned;

3. The unmanned aerial vehicle path planning method based on hierarchical reinforcement learning of claim 2, wherein when the Q learning algorithm is updated by an experience element group generated in a deep Q network algorithm after a path is currently planned, a reward function formula used by the Q learning algorithm is as follows:

reward＝η(d_s-1-d_s)

4. The method for unmanned aerial vehicle path planning based on hierarchical reinforcement learning according to claim 1, wherein in step 2, before the path is planned by the deep Q-network algorithm and the Q-learning algorithm, the method further includes: a heuristic fish algorithm is used as an action guide of a depth Q network algorithm and a Q learning algorithm in path planning; wherein the heuristic fish algorithm comprises: the method comprises a travelling behavior process and a foraging behavior process, wherein the travelling behavior process is to acquire the collision direction of the unmanned aerial vehicle and surrounding obstacles; the foraging behavior process is to acquire a plurality of high-priority directions of the unmanned aerial vehicle moving towards a target point, and the heuristic fish algorithm removes collision directions in the plurality of high-priority directions as action guidance.

5. The method for planning the path of the unmanned aerial vehicle based on the hierarchical reinforcement learning of claim 4, wherein when acquiring the direction of collision between the unmanned aerial vehicle and surrounding obstacles and when the obstacles are dynamic, whether the unmanned aerial vehicle collides with the obstacles is judged according to the moving direction and the moving speed of the obstacles.