CN113377131A

CN113377131A - Method for obtaining unmanned aerial vehicle collected data track by using reinforcement learning

Info

Publication number: CN113377131A
Application number: CN202110697404.0A
Authority: CN
Inventors: 刘楠; 慕红伟; 潘志文; 尤肖虎
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-09-10
Anticipated expiration: 2041-06-23
Also published as: CN113377131B

Abstract

The invention discloses a method for acquiring a data track collected by an unmanned aerial vehicle by using reinforcement learning. The method fully considers the different data volumes to be collected and respective energy limits of each ground node under the aim of minimizing the completion time of the data collection task. On the solution method, the optimal collected data decision and the optimal motion decision of the unmanned aerial vehicle in each state are obtained on the basis of an Actor-Critic algorithm by converting the continuous time unmanned aerial vehicle track design problem into a discrete time Markov decision process. The optimal data collecting track of the unmanned aerial vehicle can be designed, and the collecting time can be obviously shortened on the premise of ensuring that the data volume to be transmitted of all ground nodes is collected and meeting the energy limit of each ground node.

Description

Method for obtaining unmanned aerial vehicle collected data track by using reinforcement learning

Technical Field

The invention belongs to the technical field of mobile communication, and particularly relates to a method for obtaining a data track collected by an unmanned aerial vehicle by using reinforcement learning.

Background

With the development of the internet of things industry, data collection becomes an important basis for realizing the functions of the internet of things. Although many communication protocols and routing algorithms are proposed to implement data collection tasks in the internet of things and wireless sensor networks, it is difficult for the communication protocols and routing algorithms to well implement the intended functions due to mobility of sensor nodes and inability to guarantee network connectivity in the event of natural disasters.

Disclosure of Invention

The invention aims to provide a method for obtaining a data track collected by an unmanned aerial vehicle by using reinforcement learning, which aims to solve the technical problems that the mobility of a sensor node and the connectivity of a network cannot be ensured when natural disasters occur, and the communication protocols and routing algorithms are difficult to well realize the established functions.

In order to solve the technical problems, the specific technical scheme of the invention is as follows:

a method for obtaining a data collecting track of an unmanned aerial vehicle by using reinforcement learning comprises the following steps of inputting the initial position and the end position of the unmanned aerial vehicle, the positions of all nodes on the ground, the data volume to be transmitted of all ground nodes and energy limits, considering different data volumes to be collected of all ground nodes and respective energy limits, and designing the data collecting track of the unmanned aerial vehicle aiming at minimizing the data collecting task completion time by adopting an Actor-Critic algorithm, wherein the data collecting track comprises the following steps:

step 1, dividing a region to be simulated into grids according to the step length, defining a state space S, an action space A and a timely reward r;

step 2, using Critic neural network with parameter omega to represent state value function Q_ω(s, a), the target Critic neural network parameter of the same network structure as the Critic neural network is omega^-(ii) a Actor neural network representation strategy pi using parameter theta_θ(as | s) for representing the probability of selecting action a in state s, and the target Actor neural network parameter of the same network structure as the Actor neural network is θ^-；

Step 3, randomly initializing a criticic neural network parameter omega and an Actor neural network parameter theta,

initializing Critic target neural network parameters omega^-ω, Actor neural network parameter θ^-θ; setting the empirical playback pool size to D for storage<s,a,r,s_t+1>Wherein s is_t+1For the next state, the sampling number in the updating process is B;

step 4, setting the initial round mark as 1, and entering a large cycleIncrementally traversing until a maximum round limit M is reached, the initialization state being an initial state s₁：

Step 5, for a single round, T is incremented from 1 to a limit T:

step 6, according to the current Actor neural network strategy a_t＝π_θ(as) selection action to obtain instant prize r_tAnd the next state s_t+1；

Step 7, storing the state transition record<s_t,a_t,r_t,s_t+1>Go to the experience playback pool;

step 8, randomly selecting B records from the experience playback pool(s)_i,a_i,r_i,s_i+1) Respectively represent the current state s_iPerformed action a_iInstant reward r_iNext state s_i+1；

Step 9, calculating an Actor update target

Where gamma represents the discount rate of the discount rate,

representing the neural network parameter theta according to the current target Actor^-The policy to be implemented is such that,

representing the Critic neural network parameter ω according to the current target^-An obtained state cost function;

step 10, by minimizing the loss function

Updating a Critic neural network parameter omega;

step 11, calculating a strategy gradient

Updating an Actor neural network parameter theta by adopting a random gradient descent method;

step 12, updating the target Critic neural network parameter omega at intervals^-Is tau omega + (1-tau) omega^-Updating target Actor neural network parameter theta^-Is τ θ + (1- τ) θ^-Wherein tau represents an update coefficient and takes a value of 0.01.

Further, the strategy-based Actor neural network is used to select action a (m) at each step m, and the value-based Critic neural network is used to evaluate the value function V (s (m)) for performing action a (m) at state s (m), and the Actor continuously adjusts and optimizes strategy pi (a (m) s (m)) according to V (s (m)).

Further, the Actor neural network and the criticic neural network are both composed of a multilayer feedforward neural network.

Furthermore, the number of nodes in the last layer of the Actor corresponds to the number of actions, the action selection is converted into a standardized percentage by using a softmax function during output, and the last layer of the Critic is a node and represents a state estimation value of an input state.

Further, the Actor neural network receives the state vector and selects an action, and the criticic neural network receives the state vector and estimates a state value, wherein the state value refers to the long-term accumulated reward of the current strategy.

Further, in the training process, the estimation of the Critic neural network on the state value is used for updating the selection strategy of the Actor on the action in a time sequence difference mode.

The method for obtaining the unmanned aerial vehicle collected data track by using reinforcement learning has the following advantages: the method fully considers the different data volumes to be collected and respective energy limits of each ground node under the aim of minimizing the completion time of the data collection task. On the solution method, the optimal collected data decision and the optimal motion decision of the unmanned aerial vehicle in each state are obtained on the basis of an Actor-Critic algorithm by converting the continuous time unmanned aerial vehicle track design problem into a discrete time Markov decision process. The unmanned aerial vehicle assisted ground node data track collection designed by the algorithm can obviously reduce the collection time on the premise of ensuring that the data volume to be transmitted of all the nodes is completely collected and meeting the energy limit of each ground node.

Detailed Description

In order to better understand the purpose, structure and function of the present invention, a method for obtaining the trajectory of the data collected by the drone by using reinforcement learning is further described in detail.

Considering a wireless communication system, a drone is used to collect data from N ground nodes, a set of ground nodes (GUs), during flight

Unmanned aerial vehicle is from starting point with fixed height H in air

Terminal point of flight

Representing a real number.

The horizontal coordinate of the node n can be expressed as

(

Representing a real number), N ∈ N. Defining the trajectory of the drone over time is represented as:

U(t)∈R^2×1，0≤t≤T；

t represents the time required to complete the task. Thus, a starting point limit U (0) and an end point limit U (t) may be obtained, i.e. the unmanned aerial vehicle flies from the starting point S to the end point E:

U(0)＝S,U(T)＝E

v is used for maximum speed of unmanned aerial vehicle in flight process_maxBy way of illustration, the speed limit during flight can be expressed as:

here, | · | | represents a euclidean norm, Δ represents an infinitesimal time interval, and | U (t + Δ) -U (t) | represents a change amount of the position of the unmanned aerial vehicle within the infinitesimal time Δ. The system model for solving the problem of data collection of the unmanned aerial vehicle is described in detail as follows:

1. transmission model

Consider a delay tolerant application scenario in which each ground node is equipped with an omni-directional antenna, passing power P at time t_n,tAnd sending the data to the unmanned aerial vehicle under the bandwidth B. The amount of data to be transmitted by each node is denoted M_n,n∈N。

Ground node n transmits rate R to unmanned aerial vehicle_n,tCan be expressed as:

R_n,t＝Blog₂(1+γ_n,t)

here, γ_n,tRepresenting the signal-to-noise ratio from the ground node n received from the drone at time t, the calculation formula can be expressed as:

wherein σ²Denotes the white Gaussian noise experienced at the receiver drone, λ: (b)>1) Is the signal-to-noise ratio difference, L, between the actual modulation scheme and the theoretical Gaussian signal_n,tWhich represents the average path loss transmitted from the ground node n to the drone at time t, and the specific formula will be explained in the channel model section below. To avoid transmission interference between ground nodes, we assume that all ground nodes do not transmit data to the drone at the same time. Therefore, the transmission schedule of all ground nodes also needs to be considered when designing the collection trajectory of the drone:

C_n(t)∈{0,1},

0≤t≤T

here, C_nWhen the value of (t) is 1, the current unmanned aerial vehicle collects data of the ground node n, and only one ground node transmits data to the unmanned aerial vehicle at most every moment.

2. Channel model

Since the overall data collection task time is relatively long compared to the channel coherence time, we focus on the average statistics of the channel states rather than on the instantaneous statistics, i.e., only the large scale path loss effects are considered in designing the channel gain expression.

The average path loss between ground node n at time t and the drone at the u (t) position may be expressed as:

and

the average path loss from the ground node n to the drone located at the position u (t) in the line-of-sight communication and non-line-of-sight communication scenarios, respectively, can be expressed as:

the first term of the above two equations represents the free space propagation loss, f_cRepresenting the carrier frequency, c represents the speed of light; and xi_LoSAnd xi_NLoSAverage additive path loss ([ xi ]) corresponding to free space propagation loss in line-of-sight communication and non-line-of-sight communication scenes respectively_LoS＜ξ_NLoS)，d_n,tThe distance between the ground node n and the unmanned aerial vehicle at the moment t can be representedComprises the following steps:

d_n,t＝(‖G_n-U(t)‖+H²)^1/2

wherein G is_n∈R^2×1The position of the ground node n is represented, and H represents the flight height of the unmanned aerial vehicle. The probability of being in a line-of-sight communication scenario between ground node n and the drone may be expressed as:

3. description of the problem

The invention provides a track design problem of auxiliary data collection of an unmanned aerial vehicle. The aim is to jointly optimize the track U of the unmanned aerial vehicle and the transmission strategy C of each node on the ground_n(t), N is more than or equal to 1 and less than or equal to N, and the transmitting power P of each node on the ground_nAnd (t), N is more than or equal to 1 and less than or equal to N, and the data to be transmitted of all the ground nodes are collected in the shortest time from the starting point to the end point under the condition that the limits of different data volumes to be transmitted and electric quantity of all the ground nodes are considered. With respect to trajectory, connection strategies, the problem of joint optimization of transmit power to minimize task completion time can be expressed as:

s.t.(1)U(0)＝S

(2)U(T)＝E

(3)C_n(t)∈{0,1},

0≤t≤T

(4)R_n,t＝Blog₂(1+γ_n,t)

(5)

(6)

(7)

(8)

(9)

here, P_n,tRepresenting the transmission power of the ground node n at time t, R_n,tRepresenting the transmission rate of the ground node n at the time t; l is_n,tRepresents the average path loss of the ground node n between the time t and the unmanned aerial vehicle at the position of U (t); equations (1) and (2) represent drone start and end limits; equations (3) and (8) represent a ground node transmission strategy, that is, all nodes do not transmit data to the drone at the same time so as to avoid interference; equation (6) indicates that the drone should establish a connection with each ground node long enough to have its data collected; the formula (7) represents the self electric quantity limit of each ground node; equation (9) represents the drone maximum speed limit.

Next, a state space, an action space, and a cost function are defined, respectively. Under the reinforcement learning framework, the unmanned aerial vehicle is used as an intelligent agent to learn the optimal control strategy according to the reinforcement learning algorithm principle. That is, at each interval, observations and rewards of the environment are received and actions are performed on the environment. A typical markov decision process can be expressed as:

the contents are as follows:

(1) state space:

the projection of the position of the drone on the ground at the end of the mth time slot may be represented as: s_u[m]＝[x(m),y(m)]∈L＝{Ω₁,Ω₂,…,Ω_I}

The state of the ground node n at the end of time slot m may be represented as:

M_n(m) represents the amount of data remaining at node n at the end of time slot m, E_n(m) represents the residual capacity of the node n at the end of the time slot m; in general, the state of the system may be denoted as s (m) ═ s_u(m),s₁(m),…,s_N(m),sim_t]，sim_tAnd recording the current flying time of the unmanned aerial vehicle.

(2) An action space:

the slot m action may be expressed as: a (m) ═ a_f(m),π(m),P₁(m),…,P_N(m)],

Here, a_f(m)＝[v_m,φ_m]Representing a direction indication of the unmanned aerial vehicle movement; pi (m) e {0,1, …, N } represents a ground node connection strategy for describing C_π(m)(t) ═ 1, i.e., node pi (m) transmits data to the drone; p₁(m),…,P_NAnd (m) transmitting power in the time slot m corresponding to each ground node.

(3) Status update procedure

The status update includes the drone location and the remaining data and power of each ground node, here, since the simulation area is scaled by step x here_s＝y_sDividing by 10, the indication of the direction of motion of the drone can be expressed as:

in this connection, it is possible to use,

i.e. in each state the drone may choose to hover or move to one of the 8 mesh points that are adjacent. Therefore, the updating of the system state comprises the position of the unmanned aerial vehicle and the updating of the residual electric quantity and the data quantity of each ground node according to the transmission strategy pi (m). Can be expressed as:

x(m)＝x(m-1)+v_mcosφ

y(m)＝y(m-1)+v_msinφ

M_π(m)(m)＝M_π(m)(m-1)-min{R_π(m),M_π(m)(m-1)}

E_n(m)＝E_n(m-1)-P_n(m),n∈{1,2,…,N}

sim_t＝sim_t+1

the first two items of the formula are used for describing the position coordinate change of the unmanned aerial vehicle, and comprise an x-axis coordinate x (m) and a y-axis y coordinate y (m) of the unmanned aerial vehicle; pi (m) is used to indicate which ground node is currently uploading data (i.e., C)_π(m)(m) 1: representing m time slot nodes pi (m) to upload data); r_π(m)The above-mentioned third formula is used for updating the change of the residual data quantity of said node when the residual data quantity M of said ground node is_π(m)(m-1) less than the transmission rate R_π(m)If so, updating the residual data volume of the ground node to be 0; the fourth formula is used for describing the residual electric quantity of each ground node; the last item is used to update the time of flight of the drone.

(4) Reward function

In the reinforcement learning process, the unmanned aerial vehicle takes an action a and obtains a reward in the time slot m, and the evaluation about the action in the table is updated according to the importance of the action generating the reward. Here, the reward function R:

the device consists of the following parts:

r_m＝r_data-G×r_p+r_end

firstly, the collected data amount r of the time slot m is calculated_data＝min{R_π(m),M_π(m)(m-1) }, which represents the minimum value of the transmission rate between the current data transmission node pi (m) and the unmanned aerial vehicle and the residual data volume to be transmitted of the node. Assuming that once the drone starts to collect the ground node n data, all the data stored in the sensors will be acquired; secondly, a constraint penalty factor r is calculated_pWhen the power consumption of the invalid ground node exists, the power consumption of the ground node is exhausted, but the data volume to be transmitted remains and the unmanned aerial vehicle moves out of the boundary (considering that the unmanned aerial vehicle can only be used in the process of unmanned aerial vehicleMotion within the fixed simulation region) constraint, the value of the indicator function G is 1, otherwise it is 0. Here, the ineffective ground node power consumption restriction condition means occurrence of

P_π(m)Indicating that the current transmitting data node pi (m) consumes power,

indicating that all nodes consume power and. The condition that the electricity quantity of the ground node is exhausted but the remaining data quantity to be transmitted is limited refers to the surplus electricity quantity E of the node n_n(M) is less than or equal to 0 but the data volume M to be transmitted_n(m) > 0. Meanwhile, sim is calculated and considered in order to encourage the unmanned aerial vehicle to recognize and move to the destination as soon as possible in the learning process_tThe reward r of collecting all data and reaching the end point_end＝sim_t×r_e，r_eIndicating a larger reward factor.

In the invention, the transition probability between states in the discrete time Markov decision problem is unknown; secondly, since both the state space and the action space in the problem are large, the traditional method for solving the markov decision problem, such as value iteration and strategy iteration, is not suitable for the problem model of the present invention, so we use the Actor-Critic algorithm in the deep reinforcement learning algorithm (DRL) to solve our problem here. It uses two networks to find the best strategy problem for the markov decision process. The policy-based Actor network is used to select action a (m) at each step m and the value-based Critic network is used to evaluate the cost function V (s (m)) that performs action a (m) at state s (m). The Actor continuously adjusts and optimizes the strategy pi (a (m) s (m)) according to V (s (m)). The Actor neural network and the Critic neural network are both composed of multilayer feedforward neural networks. The number of nodes at the last layer of Actor corresponds to the number of actions, the action selection is converted into a standardized percentage by using a softmax function during output, and the last layer of Critic is a node (representing the state estimation value of an input state). The Actor neural network model and the Critic neural network model can be respectively shown in ref { ac }. The Actor neural network receives the state vector and selects an action, and the criticic neural network likewise receives the state vector and estimates the state value (long-term cumulative reward for the current policy). In the training process, the estimation of the Critic neural network on the state value is used for updating the selection strategy of the Actor on the action in a time sequence difference mode.

The invention discloses a method for obtaining a data collecting track of an unmanned aerial vehicle by using reinforcement learning, which comprises the following steps of inputting the initial position and the end position of the unmanned aerial vehicle, the positions of all nodes on the ground, the data amount to be transmitted and the energy limit, fully considering the different data amounts to be collected and the respective energy limit of each ground node, and adopting an Actor-Critic algorithm to design the data collecting track of the ground node of the unmanned aerial vehicle aiming at minimizing the completion time of a data collecting task:

initializing Critic target neural network parameters omega^-ω, Actor neural network parameter θ^-θ. Setting empirical playback pool size to D (for storage)<s,a,r,s_t+1>) The sampling number in the updating process is B;

step 4, setting the initial round mark as 1, entering a large loop, gradually traversing until the maximum round number limit M is reached, and setting the initialization state as the initial state s₁：

Step 5, for a single round, T is incremented from 1 to a limit T:

Step 9, calculating an Actor update target

Here, γ represents the discount rate,

step 10, by minimizing the loss function

Updating a Critic neural network parameter omega;

step 11, calculating a strategy gradient

step 12, updating the target Critic neural network parameter omega at intervals^-Is tau omega + (1-tau) omega^-Updating target Actor neural network parameter theta^-Is τ θ + (1- τ) θ^-Here, τ represents an update coefficient (taking a value of 0.01).

In order to compare performance, the unmanned aerial vehicle collected data track obtained based on the Actor-Critic algorithm is compared with the following unmanned aerial vehicle flight schemes:

1. the problem of the traveler: the unmanned aerial vehicle collects data only when hovering right above the ground node, and the shortest path for collecting the ground node data is determined based on the problem of a traveler;

2. optimization strategy on the traveler problem and ground node transmission power: optimizing a collection strategy and ground node transmission power on the basis of an unmanned aerial vehicle collected data track obtained by a traveler problem: considering the uniform motion of the unmanned aerial vehicle in the collection process, optimizing the position of each ground node for starting to collect data, the position for finishing collecting data, the ground node transmitting power in the data collection process and the speed of the unmanned aerial vehicle in the collection process by using a dynamic programming algorithm;

3. finding the best set of ordered waypoints: given a starting point and an end point, the same goal is to minimize the time to collect all the ground node data, but assuming that the ground nodes are at a fixed transmit power P_tData is transmitted to the drone and current data is collected at a constant rate R as the drone comes within range of the node.

Compared with the prior art, the unmanned aerial vehicle auxiliary collection ground node data track designed based on the Actor-Critic algorithm can obviously reduce the collection time on the premise of ensuring that the data quantity to be transmitted of all nodes is completely collected and meeting the energy limit of each ground node.

It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A method for obtaining a data collecting track of an unmanned aerial vehicle by using reinforcement learning is characterized by comprising the following steps of inputting the initial position and the end position of the unmanned aerial vehicle, the positions of all nodes on the ground, the data volume to be transmitted of all ground nodes and energy limits, considering the difference of the data volume to be collected of all ground nodes and the respective energy limits, and designing the data collecting track of the unmanned aerial vehicle aiming at minimizing the data collecting task completion time by adopting an Actor-Critic algorithm:

Step 5, for a single round, T is incremented from 1 to a limit T:

step 8, randomly selecting from the experience playback poolB records(s)_i,a_i,r_i,s_i+1) Respectively represent the current state s_iPerformed action a_iInstant reward r_iNext state s_i+1；

Step 9, calculating an Actor update target

Where gamma represents the discount rate of the discount rate,

step 10, by minimizing the loss function

Updating a Critic neural network parameter omega;

step 11, calculating a strategy gradient

2. The method of claim 1, wherein the strategy-based Actor neural network is used to select action a (m) at each step m, the value-based Critic neural network is used to evaluate the value function V (s (m)) for executing action a (m) at state s (m), and the Actor continuously adjusts and optimizes the strategy pi (a (m) s (m)) according to V (s (m)).

3. The method for obtaining the trajectory of data collected by unmanned aerial vehicle through reinforcement learning according to claim 2, wherein the Actor neural network and the Critic neural network are both composed of a multilayer feedforward neural network.

4. The method for obtaining the data trace collected by the unmanned aerial vehicle through reinforcement learning according to claim 3, wherein the number of nodes in the last layer of the Actor corresponds to the number of actions, the action selection is converted into a standardized percentage by using a softmax function during output, and the last layer of Critic is a node and represents a state estimation value of an input state.

5. The method of using reinforcement learning to obtain unmanned aerial vehicle collected data trajectories of claim 4, wherein the Actor neural network receives the state vector and selects an action, and the criticic neural network receives the state vector and estimates a state value, the state value referring to a long-term cumulative reward of a current policy.

6. The method for obtaining the trajectory of the data collected by the UAV using reinforcement learning of claim 5, wherein during training, the estimation of the state value by the Critic neural network is used to update the selection strategy of the action by the Actor in a time-sequence difference manner.