CN113377131A - Method for obtaining unmanned aerial vehicle collected data track by using reinforcement learning - Google Patents

Method for obtaining unmanned aerial vehicle collected data track by using reinforcement learning Download PDF

Info

Publication number
CN113377131A
CN113377131A CN202110697404.0A CN202110697404A CN113377131A CN 113377131 A CN113377131 A CN 113377131A CN 202110697404 A CN202110697404 A CN 202110697404A CN 113377131 A CN113377131 A CN 113377131A
Authority
CN
China
Prior art keywords
neural network
actor
state
unmanned aerial
aerial vehicle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110697404.0A
Other languages
Chinese (zh)
Other versions
CN113377131B (en
Inventor
刘楠
慕红伟
潘志文
尤肖虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110697404.0A priority Critical patent/CN113377131B/en
Publication of CN113377131A publication Critical patent/CN113377131A/en
Application granted granted Critical
Publication of CN113377131B publication Critical patent/CN113377131B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/12Target-seeking control
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a method for acquiring a data track collected by an unmanned aerial vehicle by using reinforcement learning. The method fully considers the different data volumes to be collected and respective energy limits of each ground node under the aim of minimizing the completion time of the data collection task. On the solution method, the optimal collected data decision and the optimal motion decision of the unmanned aerial vehicle in each state are obtained on the basis of an Actor-Critic algorithm by converting the continuous time unmanned aerial vehicle track design problem into a discrete time Markov decision process. The optimal data collecting track of the unmanned aerial vehicle can be designed, and the collecting time can be obviously shortened on the premise of ensuring that the data volume to be transmitted of all ground nodes is collected and meeting the energy limit of each ground node.

Description

Method for obtaining unmanned aerial vehicle collected data track by using reinforcement learning
Technical Field
The invention belongs to the technical field of mobile communication, and particularly relates to a method for obtaining a data track collected by an unmanned aerial vehicle by using reinforcement learning.
Background
With the development of the internet of things industry, data collection becomes an important basis for realizing the functions of the internet of things. Although many communication protocols and routing algorithms are proposed to implement data collection tasks in the internet of things and wireless sensor networks, it is difficult for the communication protocols and routing algorithms to well implement the intended functions due to mobility of sensor nodes and inability to guarantee network connectivity in the event of natural disasters.
Disclosure of Invention
The invention aims to provide a method for obtaining a data track collected by an unmanned aerial vehicle by using reinforcement learning, which aims to solve the technical problems that the mobility of a sensor node and the connectivity of a network cannot be ensured when natural disasters occur, and the communication protocols and routing algorithms are difficult to well realize the established functions.
In order to solve the technical problems, the specific technical scheme of the invention is as follows:
a method for obtaining a data collecting track of an unmanned aerial vehicle by using reinforcement learning comprises the following steps of inputting the initial position and the end position of the unmanned aerial vehicle, the positions of all nodes on the ground, the data volume to be transmitted of all ground nodes and energy limits, considering different data volumes to be collected of all ground nodes and respective energy limits, and designing the data collecting track of the unmanned aerial vehicle aiming at minimizing the data collecting task completion time by adopting an Actor-Critic algorithm, wherein the data collecting track comprises the following steps:
step 1, dividing a region to be simulated into grids according to the step length, defining a state space S, an action space A and a timely reward r;
step 2, using Critic neural network with parameter omega to represent state value function Qω(s, a), the target Critic neural network parameter of the same network structure as the Critic neural network is omega-(ii) a Actor neural network representation strategy pi using parameter thetaθ(as | s) for representing the probability of selecting action a in state s, and the target Actor neural network parameter of the same network structure as the Actor neural network is θ-
Step 3, randomly initializing a criticic neural network parameter omega and an Actor neural network parameter theta,
initializing Critic target neural network parameters omega-ω, Actor neural network parameter θ-θ; setting the empirical playback pool size to D for storage<s,a,r,st+1>Wherein s ist+1For the next state, the sampling number in the updating process is B;
step 4, setting the initial round mark as 1, and entering a large cycleIncrementally traversing until a maximum round limit M is reached, the initialization state being an initial state s1
Step 5, for a single round, T is incremented from 1 to a limit T:
step 6, according to the current Actor neural network strategy at=πθ(as) selection action to obtain instant prize rtAnd the next state st+1
Step 7, storing the state transition record<st,at,rt,st+1>Go to the experience playback pool;
step 8, randomly selecting B records from the experience playback pool(s)i,ai,ri,si+1) Respectively represent the current state siPerformed action aiInstant reward riNext state si+1
Step 9, calculating an Actor update target
Figure BDA0003129072930000021
Where gamma represents the discount rate of the discount rate,
Figure BDA0003129072930000022
representing the neural network parameter theta according to the current target Actor-The policy to be implemented is such that,
Figure BDA0003129072930000023
representing the Critic neural network parameter ω according to the current target-An obtained state cost function;
step 10, by minimizing the loss function
Figure BDA0003129072930000024
Updating a Critic neural network parameter omega;
step 11, calculating a strategy gradient
Figure BDA0003129072930000025
Updating an Actor neural network parameter theta by adopting a random gradient descent method;
step 12, updating the target Critic neural network parameter omega at intervals-Is tau omega + (1-tau) omega-Updating target Actor neural network parameter theta-Is τ θ + (1- τ) θ-Wherein tau represents an update coefficient and takes a value of 0.01.
Further, the strategy-based Actor neural network is used to select action a (m) at each step m, and the value-based Critic neural network is used to evaluate the value function V (s (m)) for performing action a (m) at state s (m), and the Actor continuously adjusts and optimizes strategy pi (a (m) s (m)) according to V (s (m)).
Further, the Actor neural network and the criticic neural network are both composed of a multilayer feedforward neural network.
Furthermore, the number of nodes in the last layer of the Actor corresponds to the number of actions, the action selection is converted into a standardized percentage by using a softmax function during output, and the last layer of the Critic is a node and represents a state estimation value of an input state.
Further, the Actor neural network receives the state vector and selects an action, and the criticic neural network receives the state vector and estimates a state value, wherein the state value refers to the long-term accumulated reward of the current strategy.
Further, in the training process, the estimation of the Critic neural network on the state value is used for updating the selection strategy of the Actor on the action in a time sequence difference mode.
The method for obtaining the unmanned aerial vehicle collected data track by using reinforcement learning has the following advantages: the method fully considers the different data volumes to be collected and respective energy limits of each ground node under the aim of minimizing the completion time of the data collection task. On the solution method, the optimal collected data decision and the optimal motion decision of the unmanned aerial vehicle in each state are obtained on the basis of an Actor-Critic algorithm by converting the continuous time unmanned aerial vehicle track design problem into a discrete time Markov decision process. The unmanned aerial vehicle assisted ground node data track collection designed by the algorithm can obviously reduce the collection time on the premise of ensuring that the data volume to be transmitted of all the nodes is completely collected and meeting the energy limit of each ground node.
Detailed Description
In order to better understand the purpose, structure and function of the present invention, a method for obtaining the trajectory of the data collected by the drone by using reinforcement learning is further described in detail.
Considering a wireless communication system, a drone is used to collect data from N ground nodes, a set of ground nodes (GUs), during flight
Figure BDA0003129072930000041
Unmanned aerial vehicle is from starting point with fixed height H in air
Figure BDA0003129072930000042
Terminal point of flight
Figure BDA0003129072930000043
Figure BDA0003129072930000044
Representing a real number.
The horizontal coordinate of the node n can be expressed as
Figure BDA0003129072930000045
(
Figure BDA0003129072930000046
Representing a real number), N ∈ N. Defining the trajectory of the drone over time is represented as:
U(t)∈R2×1,0≤t≤T;
t represents the time required to complete the task. Thus, a starting point limit U (0) and an end point limit U (t) may be obtained, i.e. the unmanned aerial vehicle flies from the starting point S to the end point E:
U(0)=S,U(T)=E
v is used for maximum speed of unmanned aerial vehicle in flight processmaxBy way of illustration, the speed limit during flight can be expressed as:
Figure BDA0003129072930000047
here, | · | | represents a euclidean norm, Δ represents an infinitesimal time interval, and | U (t + Δ) -U (t) | represents a change amount of the position of the unmanned aerial vehicle within the infinitesimal time Δ. The system model for solving the problem of data collection of the unmanned aerial vehicle is described in detail as follows:
1. transmission model
Consider a delay tolerant application scenario in which each ground node is equipped with an omni-directional antenna, passing power P at time tn,tAnd sending the data to the unmanned aerial vehicle under the bandwidth B. The amount of data to be transmitted by each node is denoted Mn,n∈N。
Ground node n transmits rate R to unmanned aerial vehiclen,tCan be expressed as:
Rn,t=Blog2(1+γn,t)
here, γn,tRepresenting the signal-to-noise ratio from the ground node n received from the drone at time t, the calculation formula can be expressed as:
Figure BDA0003129072930000051
wherein σ2Denotes the white Gaussian noise experienced at the receiver drone, λ: (b)>1) Is the signal-to-noise ratio difference, L, between the actual modulation scheme and the theoretical Gaussian signaln,tWhich represents the average path loss transmitted from the ground node n to the drone at time t, and the specific formula will be explained in the channel model section below. To avoid transmission interference between ground nodes, we assume that all ground nodes do not transmit data to the drone at the same time. Therefore, the transmission schedule of all ground nodes also needs to be considered when designing the collection trajectory of the drone:
Cn(t)∈{0,1},
Figure BDA0003129072930000052
0≤t≤T
Figure BDA0003129072930000053
here, CnWhen the value of (t) is 1, the current unmanned aerial vehicle collects data of the ground node n, and only one ground node transmits data to the unmanned aerial vehicle at most every moment.
2. Channel model
Since the overall data collection task time is relatively long compared to the channel coherence time, we focus on the average statistics of the channel states rather than on the instantaneous statistics, i.e., only the large scale path loss effects are considered in designing the channel gain expression.
The average path loss between ground node n at time t and the drone at the u (t) position may be expressed as:
Figure BDA0003129072930000054
Figure BDA0003129072930000055
and
Figure BDA0003129072930000056
the average path loss from the ground node n to the drone located at the position u (t) in the line-of-sight communication and non-line-of-sight communication scenarios, respectively, can be expressed as:
Figure BDA0003129072930000057
Figure BDA0003129072930000058
the first term of the above two equations represents the free space propagation loss, fcRepresenting the carrier frequency, c represents the speed of light; and xiLoSAnd xiNLoSAverage additive path loss ([ xi ]) corresponding to free space propagation loss in line-of-sight communication and non-line-of-sight communication scenes respectivelyLoS<ξNLoS),dn,tThe distance between the ground node n and the unmanned aerial vehicle at the moment t can be representedComprises the following steps:
dn,t=(‖Gn-U(t)‖+H2)1/2
wherein G isn∈R2×1The position of the ground node n is represented, and H represents the flight height of the unmanned aerial vehicle. The probability of being in a line-of-sight communication scenario between ground node n and the drone may be expressed as:
Figure BDA0003129072930000061
3. description of the problem
The invention provides a track design problem of auxiliary data collection of an unmanned aerial vehicle. The aim is to jointly optimize the track U of the unmanned aerial vehicle and the transmission strategy C of each node on the groundn(t), N is more than or equal to 1 and less than or equal to N, and the transmitting power P of each node on the groundnAnd (t), N is more than or equal to 1 and less than or equal to N, and the data to be transmitted of all the ground nodes are collected in the shortest time from the starting point to the end point under the condition that the limits of different data volumes to be transmitted and electric quantity of all the ground nodes are considered. With respect to trajectory, connection strategies, the problem of joint optimization of transmit power to minimize task completion time can be expressed as:
Figure BDA0003129072930000062
s.t.(1)U(0)=S
(2)U(T)=E
(3)Cn(t)∈{0,1},
Figure BDA0003129072930000063
0≤t≤T
(4)Rn,t=Blog2(1+γn,t)
(5)
Figure BDA0003129072930000064
(6)
Figure BDA0003129072930000065
(7)
Figure BDA0003129072930000066
(8)
Figure BDA0003129072930000071
(9)
Figure BDA0003129072930000072
here, Pn,tRepresenting the transmission power of the ground node n at time t, Rn,tRepresenting the transmission rate of the ground node n at the time t; l isn,tRepresents the average path loss of the ground node n between the time t and the unmanned aerial vehicle at the position of U (t); equations (1) and (2) represent drone start and end limits; equations (3) and (8) represent a ground node transmission strategy, that is, all nodes do not transmit data to the drone at the same time so as to avoid interference; equation (6) indicates that the drone should establish a connection with each ground node long enough to have its data collected; the formula (7) represents the self electric quantity limit of each ground node; equation (9) represents the drone maximum speed limit.
Next, a state space, an action space, and a cost function are defined, respectively. Under the reinforcement learning framework, the unmanned aerial vehicle is used as an intelligent agent to learn the optimal control strategy according to the reinforcement learning algorithm principle. That is, at each interval, observations and rewards of the environment are received and actions are performed on the environment. A typical markov decision process can be expressed as:
Figure BDA0003129072930000073
the contents are as follows:
(1) state space:
the projection of the position of the drone on the ground at the end of the mth time slot may be represented as: su[m]=[x(m),y(m)]∈L={Ω12,…,ΩI}
The state of the ground node n at the end of time slot m may be represented as:
Figure BDA0003129072930000074
Mn(m) represents the amount of data remaining at node n at the end of time slot m, En(m) represents the residual capacity of the node n at the end of the time slot m; in general, the state of the system may be denoted as s (m) ═ su(m),s1(m),…,sN(m),simt],simtAnd recording the current flying time of the unmanned aerial vehicle.
(2) An action space:
the slot m action may be expressed as: a (m) ═ af(m),π(m),P1(m),…,PN(m)],
Here, af(m)=[vmm]Representing a direction indication of the unmanned aerial vehicle movement; pi (m) e {0,1, …, N } represents a ground node connection strategy for describing Cπ(m)(t) ═ 1, i.e., node pi (m) transmits data to the drone; p1(m),…,PNAnd (m) transmitting power in the time slot m corresponding to each ground node.
(3) Status update procedure
The status update includes the drone location and the remaining data and power of each ground node, here, since the simulation area is scaled by step x heres=ysDividing by 10, the indication of the direction of motion of the drone can be expressed as:
Figure BDA0003129072930000081
in this connection, it is possible to use,
Figure BDA0003129072930000082
i.e. in each state the drone may choose to hover or move to one of the 8 mesh points that are adjacent. Therefore, the updating of the system state comprises the position of the unmanned aerial vehicle and the updating of the residual electric quantity and the data quantity of each ground node according to the transmission strategy pi (m). Can be expressed as:
x(m)=x(m-1)+vmcosφ
y(m)=y(m-1)+vmsinφ
Mπ(m)(m)=Mπ(m)(m-1)-min{Rπ(m),Mπ(m)(m-1)}
En(m)=En(m-1)-Pn(m),n∈{1,2,…,N}
simt=simt+1
the first two items of the formula are used for describing the position coordinate change of the unmanned aerial vehicle, and comprise an x-axis coordinate x (m) and a y-axis y coordinate y (m) of the unmanned aerial vehicle; pi (m) is used to indicate which ground node is currently uploading data (i.e., C)π(m)(m) 1: representing m time slot nodes pi (m) to upload data); rπ(m)The above-mentioned third formula is used for updating the change of the residual data quantity of said node when the residual data quantity M of said ground node isπ(m)(m-1) less than the transmission rate Rπ(m)If so, updating the residual data volume of the ground node to be 0; the fourth formula is used for describing the residual electric quantity of each ground node; the last item is used to update the time of flight of the drone.
(4) Reward function
In the reinforcement learning process, the unmanned aerial vehicle takes an action a and obtains a reward in the time slot m, and the evaluation about the action in the table is updated according to the importance of the action generating the reward. Here, the reward function R:
Figure BDA0003129072930000091
the device consists of the following parts:
rm=rdata-G×rp+rend
firstly, the collected data amount r of the time slot m is calculateddata=min{Rπ(m),Mπ(m)(m-1) }, which represents the minimum value of the transmission rate between the current data transmission node pi (m) and the unmanned aerial vehicle and the residual data volume to be transmitted of the node. Assuming that once the drone starts to collect the ground node n data, all the data stored in the sensors will be acquired; secondly, a constraint penalty factor r is calculatedpWhen the power consumption of the invalid ground node exists, the power consumption of the ground node is exhausted, but the data volume to be transmitted remains and the unmanned aerial vehicle moves out of the boundary (considering that the unmanned aerial vehicle can only be used in the process of unmanned aerial vehicleMotion within the fixed simulation region) constraint, the value of the indicator function G is 1, otherwise it is 0. Here, the ineffective ground node power consumption restriction condition means occurrence of
Figure BDA0003129072930000092
Pπ(m)Indicating that the current transmitting data node pi (m) consumes power,
Figure BDA0003129072930000093
indicating that all nodes consume power and. The condition that the electricity quantity of the ground node is exhausted but the remaining data quantity to be transmitted is limited refers to the surplus electricity quantity E of the node nn(M) is less than or equal to 0 but the data volume M to be transmittedn(m) > 0. Meanwhile, sim is calculated and considered in order to encourage the unmanned aerial vehicle to recognize and move to the destination as soon as possible in the learning processtThe reward r of collecting all data and reaching the end pointend=simt×re,reIndicating a larger reward factor.
In the invention, the transition probability between states in the discrete time Markov decision problem is unknown; secondly, since both the state space and the action space in the problem are large, the traditional method for solving the markov decision problem, such as value iteration and strategy iteration, is not suitable for the problem model of the present invention, so we use the Actor-Critic algorithm in the deep reinforcement learning algorithm (DRL) to solve our problem here. It uses two networks to find the best strategy problem for the markov decision process. The policy-based Actor network is used to select action a (m) at each step m and the value-based Critic network is used to evaluate the cost function V (s (m)) that performs action a (m) at state s (m). The Actor continuously adjusts and optimizes the strategy pi (a (m) s (m)) according to V (s (m)). The Actor neural network and the Critic neural network are both composed of multilayer feedforward neural networks. The number of nodes at the last layer of Actor corresponds to the number of actions, the action selection is converted into a standardized percentage by using a softmax function during output, and the last layer of Critic is a node (representing the state estimation value of an input state). The Actor neural network model and the Critic neural network model can be respectively shown in ref { ac }. The Actor neural network receives the state vector and selects an action, and the criticic neural network likewise receives the state vector and estimates the state value (long-term cumulative reward for the current policy). In the training process, the estimation of the Critic neural network on the state value is used for updating the selection strategy of the Actor on the action in a time sequence difference mode.
The invention discloses a method for obtaining a data collecting track of an unmanned aerial vehicle by using reinforcement learning, which comprises the following steps of inputting the initial position and the end position of the unmanned aerial vehicle, the positions of all nodes on the ground, the data amount to be transmitted and the energy limit, fully considering the different data amounts to be collected and the respective energy limit of each ground node, and adopting an Actor-Critic algorithm to design the data collecting track of the ground node of the unmanned aerial vehicle aiming at minimizing the completion time of a data collecting task:
step 1, dividing a region to be simulated into grids according to the step length, defining a state space S, an action space A and a timely reward r;
step 2, using Critic neural network with parameter omega to represent state value function Qω(s, a), the target Critic neural network parameter of the same network structure as the Critic neural network is omega-(ii) a Actor neural network representation strategy pi using parameter thetaθ(as | s) for representing the probability of selecting action a in state s, and the target Actor neural network parameter of the same network structure as the Actor neural network is θ-
Step 3, randomly initializing a criticic neural network parameter omega and an Actor neural network parameter theta,
initializing Critic target neural network parameters omega-ω, Actor neural network parameter θ-θ. Setting empirical playback pool size to D (for storage)<s,a,r,st+1>) The sampling number in the updating process is B;
step 4, setting the initial round mark as 1, entering a large loop, gradually traversing until the maximum round number limit M is reached, and setting the initialization state as the initial state s1
Step 5, for a single round, T is incremented from 1 to a limit T:
step 6, according to the current Actor neural network strategy at=πθ(as) selection action to obtain instant prize rtAnd the next state st+1
Step 7, storing the state transition record<st,at,rt,st+1>Go to the experience playback pool;
step 8, randomly selecting B records from the experience playback pool(s)i,ai,ri,si+1) Respectively represent the current state siPerformed action aiInstant reward riNext state si+1
Step 9, calculating an Actor update target
Figure BDA0003129072930000111
Here, γ represents the discount rate,
Figure BDA0003129072930000112
representing the neural network parameter theta according to the current target Actor-The policy to be implemented is such that,
Figure BDA0003129072930000113
representing the Critic neural network parameter ω according to the current target-An obtained state cost function;
step 10, by minimizing the loss function
Figure BDA0003129072930000114
Updating a Critic neural network parameter omega;
step 11, calculating a strategy gradient
Figure BDA0003129072930000115
Updating an Actor neural network parameter theta by adopting a random gradient descent method;
step 12, updating the target Critic neural network parameter omega at intervals-Is tau omega + (1-tau) omega-Updating target Actor neural network parameter theta-Is τ θ + (1- τ) θ-Here, τ represents an update coefficient (taking a value of 0.01).
In order to compare performance, the unmanned aerial vehicle collected data track obtained based on the Actor-Critic algorithm is compared with the following unmanned aerial vehicle flight schemes:
1. the problem of the traveler: the unmanned aerial vehicle collects data only when hovering right above the ground node, and the shortest path for collecting the ground node data is determined based on the problem of a traveler;
2. optimization strategy on the traveler problem and ground node transmission power: optimizing a collection strategy and ground node transmission power on the basis of an unmanned aerial vehicle collected data track obtained by a traveler problem: considering the uniform motion of the unmanned aerial vehicle in the collection process, optimizing the position of each ground node for starting to collect data, the position for finishing collecting data, the ground node transmitting power in the data collection process and the speed of the unmanned aerial vehicle in the collection process by using a dynamic programming algorithm;
3. finding the best set of ordered waypoints: given a starting point and an end point, the same goal is to minimize the time to collect all the ground node data, but assuming that the ground nodes are at a fixed transmit power PtData is transmitted to the drone and current data is collected at a constant rate R as the drone comes within range of the node.
Compared with the prior art, the unmanned aerial vehicle auxiliary collection ground node data track designed based on the Actor-Critic algorithm can obviously reduce the collection time on the premise of ensuring that the data quantity to be transmitted of all nodes is completely collected and meeting the energy limit of each ground node.
It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (6)

1. A method for obtaining a data collecting track of an unmanned aerial vehicle by using reinforcement learning is characterized by comprising the following steps of inputting the initial position and the end position of the unmanned aerial vehicle, the positions of all nodes on the ground, the data volume to be transmitted of all ground nodes and energy limits, considering the difference of the data volume to be collected of all ground nodes and the respective energy limits, and designing the data collecting track of the unmanned aerial vehicle aiming at minimizing the data collecting task completion time by adopting an Actor-Critic algorithm:
step 1, dividing a region to be simulated into grids according to the step length, defining a state space S, an action space A and a timely reward r;
step 2, using Critic neural network with parameter omega to represent state value function Qω(s, a), the target Critic neural network parameter of the same network structure as the Critic neural network is omega-(ii) a Actor neural network representation strategy pi using parameter thetaθ(as | s) for representing the probability of selecting action a in state s, and the target Actor neural network parameter of the same network structure as the Actor neural network is θ-
Step 3, randomly initializing a criticic neural network parameter omega and an Actor neural network parameter theta,
initializing Critic target neural network parameters omega-ω, Actor neural network parameter θ-θ; setting the empirical playback pool size to D for storage<s,a,r,st+1>Wherein s ist+1For the next state, the sampling number in the updating process is B;
step 4, setting the initial round mark as 1, entering a large loop, gradually traversing until the maximum round number limit M is reached, and setting the initialization state as the initial state s1
Step 5, for a single round, T is incremented from 1 to a limit T:
step 6, according to the current Actor neural network strategy at=πθ(as) selection action to obtain instant prize rtAnd the next state st+1
Step 7, storing the state transition record<st,at,rt,st+1>Go to the experience playback pool;
step 8, randomly selecting from the experience playback poolB records(s)i,ai,ri,si+1) Respectively represent the current state siPerformed action aiInstant reward riNext state si+1
Step 9, calculating an Actor update target
Figure FDA0003129072920000021
Where gamma represents the discount rate of the discount rate,
Figure FDA0003129072920000022
representing the neural network parameter theta according to the current target Actor-The policy to be implemented is such that,
Figure FDA0003129072920000023
representing the Critic neural network parameter ω according to the current target-An obtained state cost function;
step 10, by minimizing the loss function
Figure FDA0003129072920000024
Updating a Critic neural network parameter omega;
step 11, calculating a strategy gradient
Figure FDA0003129072920000025
Updating an Actor neural network parameter theta by adopting a random gradient descent method;
step 12, updating the target Critic neural network parameter omega at intervals-Is tau omega + (1-tau) omega-Updating target Actor neural network parameter theta-Is τ θ + (1- τ) θ-Wherein tau represents an update coefficient and takes a value of 0.01.
2. The method of claim 1, wherein the strategy-based Actor neural network is used to select action a (m) at each step m, the value-based Critic neural network is used to evaluate the value function V (s (m)) for executing action a (m) at state s (m), and the Actor continuously adjusts and optimizes the strategy pi (a (m) s (m)) according to V (s (m)).
3. The method for obtaining the trajectory of data collected by unmanned aerial vehicle through reinforcement learning according to claim 2, wherein the Actor neural network and the Critic neural network are both composed of a multilayer feedforward neural network.
4. The method for obtaining the data trace collected by the unmanned aerial vehicle through reinforcement learning according to claim 3, wherein the number of nodes in the last layer of the Actor corresponds to the number of actions, the action selection is converted into a standardized percentage by using a softmax function during output, and the last layer of Critic is a node and represents a state estimation value of an input state.
5. The method of using reinforcement learning to obtain unmanned aerial vehicle collected data trajectories of claim 4, wherein the Actor neural network receives the state vector and selects an action, and the criticic neural network receives the state vector and estimates a state value, the state value referring to a long-term cumulative reward of a current policy.
6. The method for obtaining the trajectory of the data collected by the UAV using reinforcement learning of claim 5, wherein during training, the estimation of the state value by the Critic neural network is used to update the selection strategy of the action by the Actor in a time-sequence difference manner.
CN202110697404.0A 2021-06-23 2021-06-23 Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning Active CN113377131B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110697404.0A CN113377131B (en) 2021-06-23 2021-06-23 Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110697404.0A CN113377131B (en) 2021-06-23 2021-06-23 Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning

Publications (2)

Publication Number Publication Date
CN113377131A true CN113377131A (en) 2021-09-10
CN113377131B CN113377131B (en) 2022-06-03

Family

ID=77578579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110697404.0A Active CN113377131B (en) 2021-06-23 2021-06-23 Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning

Country Status (1)

Country Link
CN (1) CN113377131B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113885566A (en) * 2021-10-21 2022-01-04 重庆邮电大学 V-shaped track planning method for minimizing data acquisition time of multiple unmanned aerial vehicles
CN114025330A (en) * 2022-01-07 2022-02-08 北京航空航天大学 Air-ground cooperative self-organizing network data transmission method
CN116760888A (en) * 2023-05-31 2023-09-15 中国科学院软件研究所 Intelligent organization and pushing method for data among multiple unmanned aerial vehicles

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110488861A (en) * 2019-07-30 2019-11-22 北京邮电大学 Unmanned plane track optimizing method, device and unmanned plane based on deeply study
CN110879610A (en) * 2019-10-24 2020-03-13 北京航空航天大学 Reinforced learning method for autonomous optimizing track planning of solar unmanned aerial vehicle
CN111260031A (en) * 2020-01-14 2020-06-09 西北工业大学 Unmanned aerial vehicle cluster target defense method based on deep reinforcement learning
US20200201316A1 (en) * 2018-12-21 2020-06-25 Airbus Defence and Space GmbH Method For Operating An Unmanned Aerial Vehicle As Well As An Unmanned Aerial Vehicle
US20210074167A1 (en) * 2018-05-10 2021-03-11 Beijing Xiaomi Mobile Software Co., Ltd. Method and apparatus for reporting flight path information, and method and apparatus for determining information
CN112711271A (en) * 2020-12-16 2021-04-27 中山大学 Autonomous navigation unmanned aerial vehicle power optimization method based on deep reinforcement learning
CN112902969A (en) * 2021-02-03 2021-06-04 重庆大学 Path planning method for unmanned aerial vehicle in data collection process

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210074167A1 (en) * 2018-05-10 2021-03-11 Beijing Xiaomi Mobile Software Co., Ltd. Method and apparatus for reporting flight path information, and method and apparatus for determining information
US20200201316A1 (en) * 2018-12-21 2020-06-25 Airbus Defence and Space GmbH Method For Operating An Unmanned Aerial Vehicle As Well As An Unmanned Aerial Vehicle
CN110488861A (en) * 2019-07-30 2019-11-22 北京邮电大学 Unmanned plane track optimizing method, device and unmanned plane based on deeply study
CN110879610A (en) * 2019-10-24 2020-03-13 北京航空航天大学 Reinforced learning method for autonomous optimizing track planning of solar unmanned aerial vehicle
CN111260031A (en) * 2020-01-14 2020-06-09 西北工业大学 Unmanned aerial vehicle cluster target defense method based on deep reinforcement learning
CN112711271A (en) * 2020-12-16 2021-04-27 中山大学 Autonomous navigation unmanned aerial vehicle power optimization method based on deep reinforcement learning
CN112902969A (en) * 2021-02-03 2021-06-04 重庆大学 Path planning method for unmanned aerial vehicle in data collection process

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113885566A (en) * 2021-10-21 2022-01-04 重庆邮电大学 V-shaped track planning method for minimizing data acquisition time of multiple unmanned aerial vehicles
CN113885566B (en) * 2021-10-21 2024-01-23 重庆邮电大学 V-shaped track planning method oriented to minimization of data acquisition time of multiple unmanned aerial vehicles
CN114025330A (en) * 2022-01-07 2022-02-08 北京航空航天大学 Air-ground cooperative self-organizing network data transmission method
CN114025330B (en) * 2022-01-07 2022-03-25 北京航空航天大学 Air-ground cooperative self-organizing network data transmission method
CN116760888A (en) * 2023-05-31 2023-09-15 中国科学院软件研究所 Intelligent organization and pushing method for data among multiple unmanned aerial vehicles

Also Published As

Publication number Publication date
CN113377131B (en) 2022-06-03

Similar Documents

Publication Publication Date Title
CN113377131B (en) Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning
CN110488861B (en) Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle
Bayerlein et al. Trajectory optimization for autonomous flying base station via reinforcement learning
CN113395654A (en) Method for task unloading and resource allocation of multiple unmanned aerial vehicles of edge computing system
CN112511250A (en) DRL-based multi-unmanned aerial vehicle air base station dynamic deployment method and system
US20230297859A1 (en) Method and apparatus for generating multi-drone network cooperative operation plan based on reinforcement learning
CN115696211A (en) Unmanned aerial vehicle track self-adaptive optimization method based on information age
CN114169234A (en) Scheduling optimization method and system for unmanned aerial vehicle-assisted mobile edge calculation
CN112671451A (en) Unmanned aerial vehicle data collection method and device, electronic device and storage medium
CN113507717A (en) Unmanned aerial vehicle track optimization method and system based on vehicle track prediction
CN116700343A (en) Unmanned aerial vehicle path planning method, unmanned aerial vehicle path planning equipment and storage medium
CN113255218A (en) Unmanned aerial vehicle autonomous navigation and resource scheduling method of wireless self-powered communication network
CN113382060B (en) Unmanned aerial vehicle track optimization method and system in Internet of things data collection
CN114268986A (en) Unmanned aerial vehicle computing unloading and charging service efficiency optimization method
CN114339842A (en) Method and device for designing dynamic trajectory of unmanned aerial vehicle cluster under time-varying scene based on deep reinforcement learning
CN114548663A (en) Scheduling method for charging unmanned aerial vehicle to charge task unmanned aerial vehicle in air
CN113554680A (en) Target tracking method and device, unmanned aerial vehicle and storage medium
CN117236561A (en) SAC-based multi-unmanned aerial vehicle auxiliary mobile edge computing method, device and storage medium
CN116227767A (en) Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning
CN114879726A (en) Path planning method based on multi-unmanned-aerial-vehicle auxiliary data collection
CN114727323A (en) Unmanned aerial vehicle base station control method and device and model training method and device
CN114327876A (en) Task unloading method and device for unmanned aerial vehicle-assisted mobile edge computing
CN116484732A (en) Unmanned aerial vehicle energized digital twin model construction method
CN112383893B (en) Time-sharing-based wireless power transmission method for chargeable sensing network
CN117135376A (en) Multi-unmanned aerial vehicle video transmission method based on prediction and target tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant