CN113283827A

CN113283827A - Two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning

Info

Publication number: CN113283827A
Application number: CN202110413367.6A
Authority: CN
Inventors: 于滨; 张力; 崔少华; 刘家铭; 单文轩
Original assignee: Hefei Innovation Research Institute of Beihang University
Current assignee: Hefei Innovation Research Institute of Beihang University
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-08-20
Anticipated expiration: 2041-04-16
Also published as: CN113283827B

Abstract

The invention discloses a two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning, which comprises the steps of firstly, constructing a deep reinforcement learning model through preprocessing a logistics distribution area and related data, establishing a corresponding unmanned aerial vehicle flight state space, an action space and a return value function, and training the deep reinforcement learning model by combining an offline learning mode and an online learning mode; secondly, planning a logistics distribution path and a flight path in the logistics process of the unmanned aerial vehicle by adopting a two-stage optimization method. And the unmanned aerial vehicle flight path planning stage is mainly completed by selecting real-time actions based on deep reinforcement learning. According to the method, the distribution cost of the unmanned aerial vehicle is estimated through deep reinforcement learning in the logistics path planning stage, so that the optimization of the logistics path is more suitable for the actual flight process of the unmanned aerial vehicle, the planning of the real-time flight path of the unmanned aerial vehicle is realized based on the deep reinforcement learning, and the method has the advantages of higher calculation speed and higher robustness.

Description

Two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning

Technical Field

The invention belongs to the field of intelligent logistics, and particularly relates to an unmanned aerial vehicle logistics path planning method based on deep reinforcement learning.

Background

With the explosive development of unmanned aerial vehicle technology in recent years, more and more logistics enterprises are trying to use unmanned aerial vehicles as a supplement to urban subject logistics. Compare in traditional ground logistics distribution mode, use unmanned aerial vehicle to carry out logistics distribution more nimble, reduce artifical labour, improve advantages such as delivery coverage, consequently unmanned aerial vehicle commodity circulation is regarded as the last kilometer reasonable way of solving the commodity circulation. However, the use of the unmanned aerial vehicle for logistics transportation not only needs to plan a reasonable distribution path, but also needs to consider the safety trajectory of unmanned aerial vehicle flight in the distribution process, so that two aspects of path optimization in the logistics distribution process and airspace management in the unmanned aerial vehicle flight process need to be considered simultaneously when designing a corresponding unmanned aerial vehicle logistics path planning method, and reducing the distribution cost to the maximum extent under the condition of safe operation of the unmanned aerial vehicle is an important target for unmanned aerial vehicle logistics path planning.

Compared with the traditional vehicle path problem, the path planning process of unmanned aerial vehicle logistics also comprises planning of the landing and landing positions of the unmanned aerial vehicle and real-time path planning in the flight process of the unmanned aerial vehicle. In the existing logistics path planning method at home and abroad, the vehicle path problem is mainly researched based on a heuristic algorithm or an accurate algorithm, and the path problem and the flight control process of an unmanned aerial vehicle are not involved. Therefore, there is a need for a method that can consider both logistics and flight path planning when using an unmanned aerial vehicle for logistics distribution.

Disclosure of Invention

The invention provides a two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning. The method divides the logistics path planning problem based on the unmanned aerial vehicle into two processes: preprocessing and model training; two-stage unmanned aerial vehicle route planning, it is specific, two-stage unmanned aerial vehicle route planning includes: an unmanned aerial vehicle logistics path planning stage and an unmanned aerial vehicle flight path planning stage.

The method is characterized by comprising the following steps of collecting data of a training deep reinforcement learning model in a preprocessing and model training process, and training the model by combining offline data:

1) the method comprises the steps of firstly, carrying out space rasterization operation on an internal space of a logistics service area, setting an initial grid state by combining with the distribution of obstacles in the space, marking grids which are forbidden to enter, and constructing a simulation environment based on an actual space. Combining the space grating division result to construct an offline training data set for the existing unmanned aerial vehicle manual operation trajectory data;

2) determining a state space S and an action space A of the deep reinforcement learning, and setting a return value r of the deep reinforcement learning according to a distribution task, wherein the return value r consists of two parts, and the specific value r is r_l+r_sWherein r is_lIndicating the distance return of the current position of the drone from the target position, r_sA value representing the unmanned aerial vehicle action safety return,

3) and constructing a training experience pool in the training process to store experience data (s, a, r, s'), carrying out data sampling in batches from the experience pool in a batch training mode, and training the neural network parameters for providing the Q value by combining a gradient descent algorithm.

4) And in a simulation environment, generating a logistics path at random, simulating a flight path planning stage of the unmanned aerial vehicle based on the trained deep reinforcement learning model, and performing online training on the model. Meanwhile, simulation is used as a mode for estimating the flight cost of the unmanned aerial vehicle in the first stage of the two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning.

The unmanned aerial vehicle logistics route planning stage confirms that the logistics distribution in-process waits for customer's access sequence and unmanned aerial vehicle to open and stop the position, combines unmanned aerial vehicle flight path to confirm the optimal delivery strategy under guaranteeing the delivery safety condition, its characterized in that: comprises the following steps of (a) carrying out,

1) collecting the location l of a customer point i to be served inside a service area_iDelivery demand q_iService time s_iAnd a time window [ a ] in which it can be serviced_i,b_i]Constructing a customer data set;

2) unmanned aerial vehicle quantity N based on unmanned aerial vehicle starting and stopping point m_mAnd the maximum number of unmanned aerial vehicles that can be accommodated

The maximum cargo capacity Q and the endurance time T of the unmanned aerial vehicle adopt a greedy insertion method to construct an initial logistics distribution path scheme, and the scheme mainly comprises the following steps of (1) distributing customers to be served to available unmanned aerial vehicles n_i(ii) a (2) Unmanned plane n_iThe order in which the customer locations are visited; (3) unmanned plane n_iTakeoff position and landing position. The safety and cost time consumption of logistics distribution in the process of constructing the initial logistics distribution path scheme are simulated in a simulation environment by the deep reinforcement learning model trained in the preprocessing and model training process.

3) Optimizing an initial logistics distribution path scheme by using a neighborhood search-based algorithm, wherein the method comprises the following steps: mainly comprises the following steps: (1) performing customer point deletion operation on the existing logistics distribution path scheme, namely deleting customer nodes in a part of the existing logistics distribution path scheme based on a given deletion strategy, and putting the customer nodes into a customer set to be inserted; (2) selecting customers who are not arranged from the customer set to be inserted into the logistics distribution path scheme based on a given insertion strategy until all the customers are distributed; (3) local neighborhood searching is carried out on the deleted and inserted new logistics distribution path scheme, and a logistics distribution path with low cost is found; 4) and (3) judging whether the neighborhood searching process is converged, if not, returning to the step (1) for continuous circulation, and if so, adopting the logistics distribution route scheme with the least distribution cost.

Unmanned aerial vehicle flight path planning stage carries out real-time planning and adjustment to unmanned aerial vehicle flight path based on degree of depth reinforcement study, guarantees the safe flight of unmanned aerial vehicle at the delivery in-process, its characterized in that: comprises the following steps of (a) carrying out,

1) the method is characterized in that an unmanned aerial vehicle flight path task set is constructed, and an unmanned aerial vehicle flight controller is generated based on an unmanned aerial vehicle logistics distribution path scheme obtained in an unmanned aerial vehicle logistics path planning stageService sequence q_niWhere m and m 'denote drones n, …, i, …, m' }, respectively_iUnmanned plane stopping points for taking off and landing;

2) and based on the deep reinforcement learning model, selecting unmanned flight actions of all the arranged logistics distribution paths in real time, and updating the state space and the surrounding space grating accessible state. When unmanned plane n_iThe position of the unmanned aerial vehicle n is coincident with the destination position and all distribution tasks are completed_iThe flight path planning process of (1) is terminated;

3) and repeating the step 2) until all the unmanned aerial vehicles reach the preset destination and complete the distribution task.

The invention has the following advantages:

1. the invention combines the unmanned aerial vehicle flight path planning process based on deep reinforcement learning in the unmanned aerial vehicle logistics distribution path planning method. Meanwhile, the problem of path planning of two dimensions in logistics of the unmanned aerial vehicle is optimized, and a corresponding two-stage unmanned aerial vehicle path planning method is designed. The two-stage unmanned aerial vehicle path planning method adopted by the invention can effectively ensure the safety and the high efficiency of the unmanned aerial vehicle logistics path obtained by optimization.

2. In the first-stage unmanned aerial vehicle distribution path planning stage adopted by the invention, simulation results obtained by operating an unmanned aerial vehicle flight path planning model in a simulation environment based on deep reinforcement learning are used for distribution cost, distribution time and path safety in the unmanned aerial vehicle distribution process, so that the logistics path planning result in the first stage is more in accordance with the flight process of the unmanned aerial vehicle, the difference in cost estimation of the two-stage model is reduced, and the accuracy in the actual use process of the invention is improved.

3. The invention adopts the mode of combining the static training of the existing flight trajectory data of the unmanned aerial vehicle and the dynamic training process in the simulation environment to construct the unmanned aerial vehicle flight path planning method based on deep reinforcement learning. In actual use, the unmanned aerial vehicle distribution process is controlled by the trained deep reinforcement learning model, compared with the traditional path planning algorithm, the time for calculating the optimal strategy of the unmanned aerial vehicle in real time in the actual use process is saved, the matching with the actual distribution environment is ensured, and the safety of the distribution process is guaranteed.

Drawings

FIG. 1 is a basic flow chart of a two-stage unmanned aerial vehicle path planning method based on deep reinforcement learning;

FIG. 2 is a schematic diagram of an optional action in a flight path planning phase of an unmanned aerial vehicle;

fig. 3 is a schematic diagram of a planning stage of a logistics distribution path of an unmanned aerial vehicle.

Detailed Description

The following detailed description of specific embodiments of the invention is provided in conjunction with the accompanying drawings:

the invention adopts a two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning, which comprises the following specific steps as shown in figure 1:

1) a preprocessing and model training stage:

(1) firstly, carrying out space grating operation in a distribution area, constructing a simulation environment, setting an inaccessible airspace according to the distribution of obstacles in the distribution airspace, and setting an initial value for each space grating grid, wherein 1 represents that the unmanned aerial vehicle can enter, and 0 represents that the unmanned aerial vehicle can not enter. Collecting the manual operation trajectory data of the existing unmanned aerial vehicle, and constructing an offline training data set;

(2) and determining a state space S and an action space A of deep reinforcement learning. Wherein state space mainly embodies the spatial position that unmanned aerial vehicle located, load state and surplus continuation of the journey specifically are:

wherein (x)_t,y_t,h_t) Representing the coordinates and altitude of the drone at time t, q_tIndicating the cargo capacity of the drone at time t,

representing the remaining endurance time of the drone at time t,

this indicates the completion status of the delivery task for customer i at time t, and if 0 indicates that the delivery task has not been completed, then if 1 indicates that the delivery task has been completed. The motion space a includes the motions that can be selected at time t, specifically, 7 motions { climb, descend, advance, retreat, turn left, turn right, and keep home position }, as shown in fig. 2, where the basic units of climb, descend, advance, and retreat are all a grid of spatial grid.

(3) Setting a deep reinforcement learning return value r according to a distribution task, wherein the return value r consists of two parts, and the specific r is r_l+r_sWherein r is_lIndicating a distance return, specific r, of a current position of the drone from a target position_lThe calculation can be given by:

r_srepresenting the unmanned aerial vehicle action safety report value, the specific calculation mode can be given by the following formula:

(4) establishing a training experience pool in the training process to store experience data (s, a, r, s'), carrying out data sampling in batches from the experience pool in a batch training mode, providing Q-value neural network parameters for training, updating the parameters in the training process by adopting a gradient descent algorithm, and specifically expressing a loss function as follows: (y)_t-Q_t(s,a；θ))²Wherein y is_tCalculated from the following formula:

wherein the parameter γ represents a reduction factor of the return value, in the specific example 0.95 is used, wherein the termination condition comprises completion of the delivery task back to the unmanned aerial vehicle stop point, driving into the area marked 0 but possibly other unmanned aerial vehicles, and reaching the unmanned aerial vehicle endurance limit. The action selection in the training process follows an epsilon-greedy strategy, namely, the action which can obtain the maximum return value is selected under the probability epsilon, and an action space A is randomly selected under the probability 1-epsilon.

(5) In a simulation environment, randomly generating logistics path data to obtain an online training data set, specifically, including the starting position of the path

End position

And intermediate customer point location to be serviced

And expected arrival time

And (4) simulating a flight path planning stage of the unmanned aerial vehicle based on the deep reinforcement learning model trained in the step (4), planning a flight path of the unmanned aerial vehicle in the online training data set according to an epsilon-greedy strategy, and performing online training on the model. Meanwhile, simulation is used as a mode for estimating the flight cost, flight time and flight safety of the unmanned aerial vehicle in the first stage of the two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning. The acquisition of the flight cost and the flight time of the unmanned aerial vehicle is realized by recording the energy consumption and the flight time of the unmanned aerial vehicle in the flight process in real time in the simulation process.

2) Unmanned aerial vehicle logistics distribution route planning stage:

(1) a structure flow data set for collecting the position l of a customer point i to be served in a distribution area of the structure flow_iDelivery demand q_iService time s_iAnd a time window [ a ] in which it can be serviced_i,b_i]A customer data set is constructed, with customer nodes represented by the serial number i. Determining the number N of unmanned aerial vehicles at starting and stopping points (indicated by a sequence number m) of each unmanned aerial vehicle_mAnd the maximum number of unmanned aerial vehicles capable of being accommodatedMeasurement of

Determining each drone (by sequence number n)_iExpressed) and a time of flight T.

(2) Method for constructing initial logistics distribution path scheme by greedy insertion method, specifically including unmanned aerial vehicle n_iTake-off and landing positions m and m' and unmanned aerial vehicle n_iThe sequence of customers served { … i … }, m and m' as shown in fig. 3, respectively represent the start and end of the unmanned airplane logistics path, and three customers i, j and k are served accordingly during the unmanned airplane distribution process. The specific greedy insertion method can be summarized as: and sequentially selecting one customer from the set of customers to be served and inserting the customer into the currently obtained unmanned aerial vehicle logistics distribution path set according to the inserting rule, wherein the new distribution path after the inserting is selected and the distribution cost of the distribution path before the inserting is increased to the minimum. The above insertion operation is repeated in this manner until all customers are assigned to the path. Particularly, the unmanned aerial vehicle distribution cost is obtained by adopting a mode of flight cost consumption of the unmanned aerial vehicle under a temporary path obtained by simulation construction in a simulation environment, and the action of the unmanned aerial vehicle is generated according to a strategy of selecting the maximum return.

(3) The logistics distribution path is optimized by using a neighborhood search-based algorithm: the specific process is as follows:

step 1: a proportion of customers in the existing set of paths is deleted according to a given deletion policy, with alpha in the example being in the range 0-1. Specifically, the method comprises the following steps: the adopted deletion strategy comprises randomly selecting customers with the proportion of alpha in the existing path set to delete; selecting the customers with the ratio of alpha which can reduce the path cost most after deletion; the option delete may result in a path cost reduction for the kth customer (k-regret delete, k options 2, 3 and 4 in the example); randomly deleting all customers served by the unmanned aerial vehicle; and selecting the unmanned aerial vehicle with the largest current cost to delete all customers. And putting all the deleted customers into a customer set to be inserted.

Step 2: and selecting a position where the customer is inserted into the set of customers to be inserted according to a given insertion strategy so that the inserted flight cost is minimum, wherein the flight cost is obtained according to the action of selecting the maximum return by combining the deep reinforcement learning model obtained by training according to the flight simulation environment of the unmanned aerial vehicle. The specific insertion operation includes: randomly selecting customers from a customer set to be inserted; selecting the customer with the minimum cost increase after insertion; the insertion is selected in such a way that the cost increases for the kth customer (k-regret insertion, k in the example chosen 2, 3 and 4).

Wherein each delete and insert operation has a selection weight w_iIn each iteration according to the following formula:

calculating each deletion insertion operation selection probability p_iAnd selects delete and insert operations based on the probability.

And step 3: if not, the maximum cycle number L₁Returning to the step 1 to continue the circulation for the circulation times l₁＝l₁+1. If the maximum cycle number is reached, calling a local neighborhood search strategy to optimize the current result, specifically: the strategy of local neighborhood search comprises the following steps: the order of two customers within the exchange path, the order of two customers between exchange paths, and several customers with the same position in the service sequence between exchange paths. The number of iterations of the local neighborhood search in the example is L₂。

And 4, step 4: and judging whether the maximum search cycle number L is reached, if the maximum search cycle number L is not reached, updating, deleting and inserting the operation weight, returning to the step 1, setting the cycle number L to be L +1, and otherwise, outputting the current best result to the unmanned aerial vehicle flight path planning model in the second stage. Specifically, the method comprises the following steps: the specific gravity of the delete and insert operation is updated as follows:

wherein the parameter r represents a coefficient for updating the specific gravity according to the deletion insertion operation score, and in the example, the value range of η is 0 to 1. Rho_iRepresenting the number of times each operation occurs during the iteration, pi_iThe scores of the operations in the iterative process are shown, specifically: a new optimal solution is obtained with a score of 33 when an insert delete operation is performed, a solution that is not optimal but better than the solution before the operation is obtained with a score of 9 when an insert delete operation is performed, and a score of 13 when a solution that is worse than the solution before the operation but is selected based on a simulated annealing mechanism is obtained after an insert delete operation is performed.

3) An unmanned aerial vehicle flight path planning stage based on deep reinforcement learning:

(1) an unmanned aerial vehicle flight path task set is established, specifically, an unmanned aerial vehicle n is established based on a logistics distribution path obtained based on a neighborhood algorithm in the first stage_iFlight path sequence q_niWhere m and m 'denote drones n, …, i, …, m' }, respectively_iTakeoff and landing positions. Determining the starting point and the end point of a flight path planning stage according to the flight path sequence, namely (x)_m,y_m,h_m)，(x_m′,y_m′,h_m′)。

(2) Unmanned aerial vehicle action selection, in particular, unmanned aerial vehicle n_iAt an initial time t₀From the starting point (x)_m,y_m,h_m) Starting from, at any time t, it is first determined whether the end point (x) has been reached_m′,y_m′,h_m′) If the destination is reached, n of the unmanned aerial vehicle_iThe path planning is completed. If the terminal is not reached, selecting the action corresponding to the maximum Q value according to the Q value output by the neural network, and generating the unmanned aerial vehicle state s' at the moment of t + 1. Particularly, in the flight process of the unmanned aerial vehicle, a radar carried by the unmanned aerial vehicle detects whether an obstacle exists in an adjacent space grid in real time, and if the obstacle exists, the state of the grid is marked as 0 in real time, namely, the obstacle cannot be entered.

(3) And (3) repeating the step (2) until the unmanned aerial vehicle completes the distribution tasks arranged in the sequence and finally reaches the preset landing position.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiment, and it should be noted that several modifications and amendments without departing from the principle of the present invention should be considered as the protection scope of the present invention.

Claims

1. A two-stage unmanned aerial vehicle logistics path planning method based on deep reinforcement learning is characterized by comprising the following stages:

a preprocessing and model training stage: the method comprises the steps of performing space rasterization on a space in a distribution range, collecting flight trajectory data of a manually operated unmanned aerial vehicle to construct an offline training set, constructing an unmanned aerial vehicle flight simulation environment by combining with space domain characteristics, designing an unmanned aerial vehicle state space, an action space and return value function, and an offline and online training model by combining with an unmanned aerial vehicle distribution process;

unmanned aerial vehicle logistics path planning stage: collecting logistics distribution data in a distribution range, constructing an initial distribution path scheme, evaluating the flight path of the unmanned aerial vehicle by combining a trained model, and optimizing the logistics path of the unmanned aerial vehicle;

unmanned aerial vehicle flight path planning stage: and determining an unmanned aerial vehicle task sequence by combining the output of the unmanned aerial vehicle path planning stage, and outputting the flight path of the unmanned aerial vehicle based on deep reinforcement learning.

2. The deep reinforcement learning-based two-phase unmanned aerial vehicle logistics path planning method of claim 1, wherein the preprocessing phase comprises

Carrying out space grating operation in the distribution area, constructing a simulation environment, and setting an entry state for the space grating;

determining a state space S and an action space A of deep reinforcement learning;

setting a deep reinforcement learning return value function r according to a distribution task;

collecting flight trajectory data of the manually operated unmanned aerial vehicle, and constructing a discrete training data set.

3. The scheme of claim 2, wherein the state space comprises four types of state information including a spatial position where the unmanned aerial vehicle is located, a loading state, a remaining endurance, and a customer point service state, and the action space comprises { climbing, descending, advancing, retreating, turning left, turning right, and keeping an original position } selectable actions.

4. The solution of claim 2, wherein the deep reinforcement learning return value function is derived from the return value r of the drone from the target position_lAnd unmanned aerial vehicle action safety report value r_sTwo parts are formed.

5. The deep reinforcement learning-based two-phase unmanned aerial vehicle logistics path planning method according to claim 1, wherein for the unmanned aerial vehicle logistics path planning phase, the method comprises the following steps:

collecting unmanned aerial vehicle logistics demand data and constructing an unmanned aerial vehicle logistics demand data set;

determining an unmanned aerial vehicle logistics distribution path planning initial scheme;

and optimizing the unmanned aerial vehicle logistics distribution path planning scheme, and taking the optimized unmanned aerial vehicle logistics distribution path planning scheme as the input of the unmanned aerial vehicle flight path planning stage.

6. The scheme of claim 5, wherein a method for optimizing the logistics distribution path of the unmanned aerial vehicle based on neighborhood search is adopted, and a special neighborhood search process comprises the following steps:

a large neighborhood search process based on delete and insert operations;

and (3) optimizing based on local neighborhood searching.

7. The scheme of claim 5, wherein the flight path cost is obtained by using the simulation environment of claim 2 based on the deep reinforcement learning model when the logistics distribution path of the unmanned aerial vehicle is optimized.

8. The deep reinforcement learning-based two-phase unmanned aerial vehicle logistics path planning method according to claim 1, is characterized in that the unmanned aerial vehicle flight path planning phase comprises the following steps:

constructing a flight path task set of the unmanned aerial vehicle;

and carrying out unmanned aerial vehicle flight action selection based on deep reinforcement learning.