CN114020024B

CN114020024B - Unmanned aerial vehicle path planning method based on Monte Carlo tree search

Info

Publication number: CN114020024B
Application number: CN202111305350.5A
Authority: CN
Inventors: 盛可欣; 马川; 钱玉文; 时龙; 王喆; 李骏
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2023-03-31
Anticipated expiration: 2041-11-05
Also published as: CN114020024A

Abstract

The invention discloses an unmanned aerial vehicle path planning method based on Monte Carlo tree search, which is high in algorithm efficiency, good in performance and capable of better adapting to a dynamic environment. The method comprises the following steps: (10) Establishing a Monte Carlo tree, initializing a root node and initializing the position of the unmanned aerial vehicle; (20) Setting the total training times of the Monte Carlo tree search algorithm according to experimental data; (30) Within the set total training times, carrying out search algorithm training on the Monte Carlo tree, enabling Monte Carlo tree parameters to iterate according to specific steps, and enabling the unmanned aerial vehicle to make corresponding actions; (40) When the training times are equal to the total training times, finishing the training to obtain a trained Monte Carlo tree; and continuously selecting the child node with the maximum UCT value downwards by using the UCT algorithm from the root node until reaching one leaf node according to the tree structure of the trained Monte Carlo tree, and executing corresponding actions by the unmanned aerial vehicle according to the selected node to obtain the optimal unmanned aerial vehicle path.

Description

Unmanned aerial vehicle path planning method based on Monte Carlo tree search

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle path planning, and particularly relates to an unmanned aerial vehicle path planning method based on Monte Carlo tree search.

Background

In recent years, drones have proven to be one of the most challenging and potential technologies in aeronautics. Due to its high mobility and low cost, drones have gained widespread use in the field of communications over the past few decades. Meanwhile, the task of unmanned aerial vehicle path planning is becoming one of the key technologies of unmanned aerial vehicles, and has been widely studied by scholars all over the world. The main objective of unmanned aerial vehicle path planning is to design an optimal flight path pointing to a target, that is, to satisfy the performance requirements of the unmanned aerial vehicle and also satisfy the given target conditions.

Drone-assisted wireless communication may provide wireless connectivity to devices that do not have communication infrastructure coverage, e.g., areas that cannot cover infrastructure due to severe architectural shadows, natural disaster damage, etc. In a communication system, a deployed drone may operate as a mobile relay or as a flying base station. In certain military scenarios, drone-assisted relay may provide reliable wireless connectivity between two or more remote devices. Since the size and weight of the wireless network devices are greatly reduced by the drone and the communication with the ground devices employs line of sight (LoS), this is very attractive to wireless device providers.

The mobile edge computing is to deploy task computing and storage resource services at the edge of a mobile network, provide cloud computing services for the mobile network and users at the edge, and accelerate the rapid downloading of various contents and applications in the network, thereby providing a network service solution with ultra-low time delay and high bandwidth for the users. After the mobile edge server is deployed, the transmission requirement on the core network can be reduced, and the pressure of the core network is reduced. In a mobile edge computing service scenario, the drone may serve as a mobile cloud server to provide services to users at the edge of the network. When the current task downloading speed of the user is low, the user can upload the task to the unmanned aerial vehicle, and the unmanned aerial vehicle finishes the task unloading and calculating.

The communication distance between the user and the unmanned aerial vehicle can be shortened through the appropriate flight path, and the improvement on the system performance is very important. In a moving edge computing scenario, in addition to designing the optimal flight trajectory of the drone, migration throughput is often maximized as much as possible. Furthermore, the path planning problem is also limited by various constraints, such as energy consumption constraints, user quality of service constraints, power constraints, etc., which are typically NP-hard problems. However, some conventional geometric algorithms such as an a-star algorithm, a genetic algorithm, an ant colony algorithm and the like have great limitations, and it becomes very difficult to solve the problem with constraint conditions, so that researchers apply machine learning and reinforcement learning to the field of unmanned plane path planning to solve the path planning problem under these complex conditions.

The path planning technology of drones in static environments is becoming more and more mature, and the path planning problem becomes more and more complex when the environment changes. For example, real-time changing user locations, user mission requirements, drone-user channel conditions can affect the design of drone flight trajectories. The task demand of the user in each time slot affects the migration throughput of the system, and the movement of the user affects the state of the communication channel and the energy consumption of the drone. Traditional reinforcement learning methods, such as Q-learning, can take considerable time cost and space cost when handling a large number of state-action pairs, which will reduce the efficiency of unmanned aerial vehicle path planning and cannot be directly applied to dynamic environments.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle path planning method based on Monte Carlo tree search, which is high in algorithm efficiency, good in performance and capable of better adapting to a dynamic environment.

The technical solution for realizing the purpose of the invention is as follows:

an unmanned aerial vehicle path planning method based on Monte Carlo tree search comprises the following steps:

(10) Initializing unmanned aerial vehicles and Monte Carlo trees: establishing a Monte Carlo tree, initializing a root node and initializing the position of the unmanned aerial vehicle;

(20) Setting the total training times: setting the total training times of the Monte Carlo tree search algorithm according to experimental data;

(30) Training a Monte Carlo tree search algorithm: within the set total training times, carrying out search algorithm training on the Monte Carlo tree, enabling Monte Carlo tree parameters to iterate according to specific steps, and enabling the unmanned aerial vehicle to make corresponding actions according to the specific steps;

(40) Obtaining an optimal unmanned aerial vehicle path: when the training times are equal to the total training times, finishing the training to obtain a trained Monte Carlo tree; and continuously selecting a child node with the maximum UCT value downwards by using a UCT algorithm from a root node until reaching a leaf node according to the tree structure of the trained Monte Carlo tree, and executing corresponding actions by the unmanned aerial vehicle according to the selected node to obtain the optimal unmanned aerial vehicle path.

Compared with the prior art, the invention has the remarkable advantages that:

1. more applicable to dynamic environments: in the flight environment of the invention, the requirements of the position of the user, the number of tasks and the change of the channel state are changed along with the change of time, the Monte Carlo tree search can make better action balance in the dynamic unknown environment, and an enlightening solution can be obtained. Compared with a method for recording state-action pairs by using a table for Q-learning, the Monte Carlo tree search uses a tree structure to store the action and the state of the current situation, and a strategy for randomly selecting the action is added, so that the method is more suitable for a dynamically changing environment and solves the problem of unmanned aerial vehicle path planning in a complex dynamic environment;

2. the path planning efficiency is high: compared with other traditional reinforcement learning algorithms, the path planning method can effectively reduce training time, greatly reduces the time complexity of the algorithm in a top-down tree search mode, and is more suitable for real-time unmanned aerial vehicle path planning.

The invention is described in further detail below with reference to the figures and the detailed description.

Drawings

Fig. 1 is a main flow chart of the path planning method based on monte carlo tree search according to the present invention.

FIG. 2 is a flow chart of the training steps of the Monte Carlo tree search algorithm of FIG. 1.

FIG. 3 is a flow chart of the step of calculating the prize value of FIG. 2.

FIG. 4 is a schematic diagram of a Monte Carlo tree search algorithm.

Fig. 4 (1) shows the selection of a node, fig. 4 (2) shows the expansion of a node, fig. 4 (3) shows the simulation of a node, and fig. 4 (4) shows the update of the simulation result upward.

Fig. 5 is an exemplary diagram of a flight trajectory of the drone.

Detailed Description

The path planning method based on the Monte Carlo tree search is implemented based on the following scenes:

and establishing a scene model of mobile edge calculation, wherein the unmanned aerial vehicle serves as a mobile edge server to provide services for a group of users on the ground. For the convenience of calculation, the unmanned aerial vehicle can fly only at given K fixed points, and the flying time is dispersed into M time slots. In each time slot, each user sends a task unloading request to the unmanned aerial vehicle, and the number of tasks of the users obeys Gaussian distribution; the drone may fly from the current fixed point to another fixed point and provide service to one of the users, typically selecting the user closest to the drone. And the position of the user in each time slot is dynamically changed.

As shown in fig. 1, the path planning method based on monte carlo tree search of the present invention includes the following steps: (10) initializing the unmanned aerial vehicle and the Monte Carlo tree: and establishing a Monte Carlo tree, initializing a root node and initializing the position of the unmanned aerial vehicle.

The tree structure initially contains only one root node. The information in each node comprises a state S, a quality value Q, access times N and a father node N _p Child node n _c And the state S comprises the selected action a, the position l of the user currently served, the residual battery capacity E of the unmanned plane and the like. Action a can only be selected in a given action set a, which is a set of all actions that the drone can perform. In the scenario of the present invention, the action set a is the coordinates of the given K fixed points.

The drone is initially positioned at the location of coordinates (0, 0).

(20) Setting the total training times: and setting the total training times of the Monte Carlo tree search algorithm according to the experimental data.

Setting the total training times of the Monte Carlo tree search algorithm to be N according to conventional experimental data _e The parameters of the sub, i.e. Monte Carlo tree, require iteration N _e Next, the process is carried out. When the training times do not reach N _e The iteration needs to be performed all the time. Preferably, according to experimental experience, N _e 500 to 1000 times can be taken.

(30) Training a Monte Carlo tree search algorithm: and (3) within the set total training times, carrying out search algorithm training on the Monte Carlo tree, enabling Monte Carlo tree parameters to iterate according to specific steps, and enabling the unmanned aerial vehicle to make corresponding actions according to the specific steps.

N on Monte Carlo Tree _e And (5) performing secondary training. In the training process, the parameters of the tree are continuously iterated according to specific steps of the algorithm, and the unmanned aerial vehicle can make corresponding actions according to the specific steps, so that the actions of the unmanned aerial vehicle can influence the iteration and the change of the parameters.

As shown in fig. 2, the monte carlo tree search algorithm training comprises the following steps:

(31) And (3) selecting a node: from a root node, selecting a child node with the largest UCT value downwards by using a UCT algorithm; the unmanned aerial vehicle executes corresponding actions contained in the node; and continuing to select the child nodes downwards by using the UCT algorithm until reaching a node which is not fully expanded, and stopping the selection of the node.

As shown in fig. 4 (1), a child node having the largest UCT value is selected downward from the root node by the UCT algorithm. The specific formula is as follows:

wherein the constant C is a trade-off factor, Q (N ') is the Q value of the child node, N (N') is the number of accesses of the child node, and N (N) is the number of accesses of the current node.

After the selection of the child node, the drone executes the corresponding actions contained in the node, i.e. flies from the fixed point where it is now located to another fixed point, and serves a user. The selection of child nodes continues down with the UCT algorithm until a node is reached that is not fully expanded. A fully expanded node means that the number of child nodes of the node is equal to the number of actions contained in action set a.

(32) And (3) expanding the nodes: and establishing a new node as a child node of the current node, and randomly selecting an action a' from the action set A to be bound with the new node.

As shown in fig. 4 (2), the current node needs to perform a node expansion operation. And establishing a new node as a child node of the current node, and randomly selecting an action a' from the action set A to be bound with the node, wherein the selected action and the action of the child node of the same layer cannot be repeated.

(33) And (3) simulation of the nodes: and starting from the state corresponding to the new node, performing simulation action selection and the flight process of the unmanned aerial vehicle.

As shown in fig. 4 (3), from the state corresponding to the newly expanded node, the simulated action selection and the flight process of the drone are performed. In the simulation process, all the actions of the unmanned aerial vehicle are randomly selected from the action set A, namely, the unmanned aerial vehicle continuously flies to another random fixed point and selects a user for service until the battery reaches a lower level, and the unmanned aerial vehicle returns to a charging place for charging. In the process, the unmanned aerial vehicle can consume hovering energy, calculation energy and flight energy, wherein the hovering energy consumption is energy consumption generated when the unmanned aerial vehicle hovers over the fixed points when serving the user, the calculation energy consumption is energy consumption generated when the unmanned aerial vehicle is used for task calculation after a task is unloaded to the unmanned aerial vehicle, and the flight energy consumption is energy consumption generated when the unmanned aerial vehicle flies between the fixed points. After the simulation process is completed, the corresponding reward value needs to be calculated to evaluate the whole simulation process. This step does not create new tree nodes in order to update the parameters of the existing tree nodes, which facilitates trajectory optimization and more favorable action selection by the drone.

(34) Calculating the reward value:

the prize value obtained in step (33) is calculated. In the present invention, the reward value is defined to be related to the energy consumption of the drone and the throughput of the user, as shown in fig. 3, the specific calculation method is as follows:

(341) Calculating the energy consumption of the unmanned aerial vehicle:

and (4) calculating the total energy consumption according to the hovering energy, the calculation energy and the flight energy consumed by the unmanned aerial vehicle in the step (33).

The specific calculation formula of hovering energy consumption is as follows:

wherein p is _h (t) is the drone hover power,

is the data transmission rate, P, between the ith user and the UAV at the kth fixed point _u Representing transmission power, σ ² Is the power of additive white gaussian noise,

channel power gain, p, of a path loss model ₀ Is the channel gain at a reference distance of 1m, H is the altitude of the drone, D _uav (t) coordinates of the drone at time slot t, U _i (t) is the coordinates of the ith user. S (t) = μ _i (t)R _i,k (t) Δ tB is the total number of bits of user tasks offloaded by the drone, μ i (t) is the number of tasks for the user, B is the channel bandwidth, Δ t is the drone hover time.

The specific calculation formula for calculating the energy consumption is as follows:

e _c (t)＝γ _c CS(t)f _c ²

wherein, γ _c Is effectively a switched capacitor, C is the number of CPU cycles required for each bit calculation, f _c Is the frequency of the CPU.

The specific calculation formula of the flight energy consumption is as follows:

wherein, κ ₁ 、κ ₂ Is a constant parameter, a _uav The acceleration of the unmanned aerial vehicle during takeoff, and g is the gravity acceleration.

Therefore, the total energy consumption generated by the drone in a single time slot is:

e _{general assembly} ＝e _h (t)+e _c (t)+e _f (t)

(342) Calculating the throughput of the user:

according to the user served in step (33), assuming that the number of task requests of the user is mu _i (t) each task has N number of bits _b Bit, then user throughput is μ _i (t)N _b 。

(343) Calculating a specific prize value according to the formula:

the calculation formula of the generated prize value in a single time slot is as follows:

wherein, mu _max Is the maximum value of the number of tasks, W _max Is the maximum value of the total energy consumption.

The reward value of the whole simulation process in the step (33) is as follows:

where T is the number of slots at the beginning of the simulation process and T is the total number of slots.

(35) Updating the simulation result upwards: and starting from the current newly expanded node, and updating the quality value and the access times of all nodes on the path upwards along the path opposite to the path during node selection until the root node.

As shown in fig. 4 (4), starting from the node that is newly expanded currently, the Q values and N values of all nodes on the path are updated upward along the path of the node selected in step (31) until the root node. Updating the quality value Q of the node according to the reward value calculated in the step (34), specifically: q (N) = Q (N) + R, the number of accesses N of a node is updated by the formula: n (N) = N (N) +1.

By this point, a training process is ended. The second training also begins with the root node and repeats steps (31) - (35). The training process is carried out by N _e After that time, the training is ended.

(40) Obtaining an optimal unmanned aerial vehicle path: when the training times are equal to the total training times, finishing the training to obtain a trained Monte Carlo tree; and continuously selecting the child node with the maximum UCT value downwards by utilizing the UCT algorithm from the root node until reaching one leaf node according to the tree structure of the trained Monte Carlo tree. And the unmanned plane executes corresponding actions according to the selected nodes, so that the optimal unmanned plane path can be obtained.

N _e And after the secondary training is finished, obtaining a trained Monte Carlo tree. At the moment, the parameters and the structure of the Monte Carlo tree are complete, and the optimal unmanned aerial vehicle track can be obtained more accurately. And according to the trained tree structure, selecting a child node with the largest UCT value downwards by using a UCT algorithm from the root node, and executing a corresponding action by the unmanned aerial vehicle according to an action a contained in the node, namely flying from the current fixed point to another fixed point and serving one user. Continuing to select the child nodes downwards until a leaf node is reached, and executing corresponding actions by the unmanned aerial vehicle during each node selection, namely obtaining the unmanned aerial vehicle track shown in fig. 5.

As shown in the above flow, the unmanned aerial vehicle path planning method based on monte carlo tree search aims to continuously optimize the flight trajectory of the unmanned aerial vehicle and find an optimal path, so that the average throughput of a user is maximized.

In the environment set by the invention, the position of the user, the requirement of the user on the task quantity and the channel state between the user and the unmanned aerial vehicle all change in real time. Unlike other traditional reinforcement learning algorithms, the Monte Carlo Tree search utilizes a tree structure to store the state and actions of the current situation, and adds randomness in the steps of the algorithm, speeding up the training process. Therefore, the path planning method provided by the invention can effectively reduce the training time, reduce the time complexity and the space complexity of the algorithm, and is more suitable for real-time unmanned aerial vehicle path planning in a dynamic environment.

Claims

1. An unmanned aerial vehicle path planning method based on Monte Carlo tree search is characterized by comprising the following steps:

(40) Obtaining an optimal unmanned aerial vehicle path: when the training times are equal to the total training times, finishing training to obtain a trained Monte Carlo tree; continuously selecting a child node with the maximum UCT value downwards by using a UCT algorithm from a root node until reaching a leaf node according to the tree structure of the trained Monte Carlo tree, and executing corresponding actions by the unmanned aerial vehicle according to the selected node to obtain an optimal unmanned aerial vehicle path;

the (30) Monte Carlo tree search algorithm training step comprises:

(31) And (3) node selection: starting from a root node, selecting a child node with the maximum UCT value downwards by using a UCT algorithm; the unmanned aerial vehicle executes corresponding actions contained in the node; continuing to select the child nodes downwards by using a UCT algorithm until reaching a node which is not completely expanded, and stopping the selection of the nodes;

(32) And (3) node expansion: establishing a new node as a child node of the current node, and randomly selecting an action a' from the action set A to be bound with the new node;

(33) And (3) node simulation: starting from the state corresponding to the new node, performing simulation action selection and an unmanned aerial vehicle flight process;

(34) Calculating the reward value: after the simulation process is finished, calculating the reward value obtained in the process;

(35) The simulation result is updated upwards: starting from the current newly expanded node, updating the quality values and the access times of all nodes on the path upwards along the path opposite to the path during node selection until reaching the root node;

the step of (34) calculating a prize value comprises:

(341) Calculating the energy consumption of the unmanned aerial vehicle: the total energy consumption of the drone in a single time slot is calculated as follows,

e _{general assembly} ＝e _h (t)+e _c (t)+e _f (t)，

Wherein hovering energy consumption

p _h (t) is the drone hover power,

channel power gain, p, of a path loss model ₀ Is the channel gain at a reference distance of 1m, H is the altitude of the drone, D _uav (t) coordinates of the drone at time slot t, U _i (t) is the coordinates of the ith user; s (t) = μ _i (t)R _i,k (t) Δ tB is the total number of bits of user tasks offloaded by the drone, μ i (t) is the number of tasks for the user, B is the channel bandwidth, Δ t is the drone hover time;

calculating energy consumption

e _c (t)＝γ _c CS(t)f _c ² ，

γ _c Is effectively a switched capacitor, C is the number of CPU cycles required for each bit calculation, f _c Is the frequency of the CPU;

flight energy consumption

κ ₁ 、κ ₂ Is a constant parameter, a _uav The acceleration of the unmanned aerial vehicle during takeoff, and g is the gravity acceleration;

(342) Computing user throughputQuantity: suppose the user has a number of task requests of mu _i (t) each task has N number of bits _b Bit, then user throughput is μ _i (t)N _b ；

(343) Calculating the reward value: the prize value for the entire simulation process is calculated as follows,

wherein the prize value generated in a single time slot

Wherein, mu _max Is the maximum value of the number of tasks, W _max Is the maximum value of the total energy consumption, T is the number of time slots at the beginning of the simulation process, and T is the total number of time slots.