CN113377131A - Method for obtaining unmanned aerial vehicle collected data track by using reinforcement learning - Google Patents
Method for obtaining unmanned aerial vehicle collected data track by using reinforcement learning Download PDFInfo
- Publication number
- CN113377131A CN113377131A CN202110697404.0A CN202110697404A CN113377131A CN 113377131 A CN113377131 A CN 113377131A CN 202110697404 A CN202110697404 A CN 202110697404A CN 113377131 A CN113377131 A CN 113377131A
- Authority
- CN
- China
- Prior art keywords
- neural network
- actor
- state
- unmanned aerial
- aerial vehicle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
- G05D1/12—Target-seeking control
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention discloses a method for acquiring a data track collected by an unmanned aerial vehicle by using reinforcement learning. The method fully considers the different data volumes to be collected and respective energy limits of each ground node under the aim of minimizing the completion time of the data collection task. On the solution method, the optimal collected data decision and the optimal motion decision of the unmanned aerial vehicle in each state are obtained on the basis of an Actor-Critic algorithm by converting the continuous time unmanned aerial vehicle track design problem into a discrete time Markov decision process. The optimal data collecting track of the unmanned aerial vehicle can be designed, and the collecting time can be obviously shortened on the premise of ensuring that the data volume to be transmitted of all ground nodes is collected and meeting the energy limit of each ground node.
Description
Technical Field
The invention belongs to the technical field of mobile communication, and particularly relates to a method for obtaining a data track collected by an unmanned aerial vehicle by using reinforcement learning.
Background
With the development of the internet of things industry, data collection becomes an important basis for realizing the functions of the internet of things. Although many communication protocols and routing algorithms are proposed to implement data collection tasks in the internet of things and wireless sensor networks, it is difficult for the communication protocols and routing algorithms to well implement the intended functions due to mobility of sensor nodes and inability to guarantee network connectivity in the event of natural disasters.
Disclosure of Invention
The invention aims to provide a method for obtaining a data track collected by an unmanned aerial vehicle by using reinforcement learning, which aims to solve the technical problems that the mobility of a sensor node and the connectivity of a network cannot be ensured when natural disasters occur, and the communication protocols and routing algorithms are difficult to well realize the established functions.
In order to solve the technical problems, the specific technical scheme of the invention is as follows:
a method for obtaining a data collecting track of an unmanned aerial vehicle by using reinforcement learning comprises the following steps of inputting the initial position and the end position of the unmanned aerial vehicle, the positions of all nodes on the ground, the data volume to be transmitted of all ground nodes and energy limits, considering different data volumes to be collected of all ground nodes and respective energy limits, and designing the data collecting track of the unmanned aerial vehicle aiming at minimizing the data collecting task completion time by adopting an Actor-Critic algorithm, wherein the data collecting track comprises the following steps:
step 1, dividing a region to be simulated into grids according to the step length, defining a state space S, an action space A and a timely reward r;
step 2, using Critic neural network with parameter omega to represent state value function Qω(s, a), the target Critic neural network parameter of the same network structure as the Critic neural network is omega-(ii) a Actor neural network representation strategy pi using parameter thetaθ(as | s) for representing the probability of selecting action a in state s, and the target Actor neural network parameter of the same network structure as the Actor neural network is θ-;
Step 3, randomly initializing a criticic neural network parameter omega and an Actor neural network parameter theta,
initializing Critic target neural network parameters omega-ω, Actor neural network parameter θ-θ; setting the empirical playback pool size to D for storage<s,a,r,st+1>Wherein s ist+1For the next state, the sampling number in the updating process is B;
step 4, setting the initial round mark as 1, and entering a large cycleIncrementally traversing until a maximum round limit M is reached, the initialization state being an initial state s1:
Step 5, for a single round, T is incremented from 1 to a limit T:
step 6, according to the current Actor neural network strategy at=πθ(as) selection action to obtain instant prize rtAnd the next state st+1;
Step 7, storing the state transition record<st,at,rt,st+1>Go to the experience playback pool;
step 8, randomly selecting B records from the experience playback pool(s)i,ai,ri,si+1) Respectively represent the current state siPerformed action aiInstant reward riNext state si+1;
Step 9, calculating an Actor update targetWhere gamma represents the discount rate of the discount rate,representing the neural network parameter theta according to the current target Actor-The policy to be implemented is such that,representing the Critic neural network parameter ω according to the current target-An obtained state cost function;
Updating an Actor neural network parameter theta by adopting a random gradient descent method;
step 12, updating the target Critic neural network parameter omega at intervals-Is tau omega + (1-tau) omega-Updating target Actor neural network parameter theta-Is τ θ + (1- τ) θ-Wherein tau represents an update coefficient and takes a value of 0.01.
Further, the strategy-based Actor neural network is used to select action a (m) at each step m, and the value-based Critic neural network is used to evaluate the value function V (s (m)) for performing action a (m) at state s (m), and the Actor continuously adjusts and optimizes strategy pi (a (m) s (m)) according to V (s (m)).
Further, the Actor neural network and the criticic neural network are both composed of a multilayer feedforward neural network.
Furthermore, the number of nodes in the last layer of the Actor corresponds to the number of actions, the action selection is converted into a standardized percentage by using a softmax function during output, and the last layer of the Critic is a node and represents a state estimation value of an input state.
Further, the Actor neural network receives the state vector and selects an action, and the criticic neural network receives the state vector and estimates a state value, wherein the state value refers to the long-term accumulated reward of the current strategy.
Further, in the training process, the estimation of the Critic neural network on the state value is used for updating the selection strategy of the Actor on the action in a time sequence difference mode.
The method for obtaining the unmanned aerial vehicle collected data track by using reinforcement learning has the following advantages: the method fully considers the different data volumes to be collected and respective energy limits of each ground node under the aim of minimizing the completion time of the data collection task. On the solution method, the optimal collected data decision and the optimal motion decision of the unmanned aerial vehicle in each state are obtained on the basis of an Actor-Critic algorithm by converting the continuous time unmanned aerial vehicle track design problem into a discrete time Markov decision process. The unmanned aerial vehicle assisted ground node data track collection designed by the algorithm can obviously reduce the collection time on the premise of ensuring that the data volume to be transmitted of all the nodes is completely collected and meeting the energy limit of each ground node.
Detailed Description
In order to better understand the purpose, structure and function of the present invention, a method for obtaining the trajectory of the data collected by the drone by using reinforcement learning is further described in detail.
Considering a wireless communication system, a drone is used to collect data from N ground nodes, a set of ground nodes (GUs), during flightUnmanned aerial vehicle is from starting point with fixed height H in airTerminal point of flight Representing a real number.
The horizontal coordinate of the node n can be expressed as(Representing a real number), N ∈ N. Defining the trajectory of the drone over time is represented as:
U(t)∈R2×1,0≤t≤T;
t represents the time required to complete the task. Thus, a starting point limit U (0) and an end point limit U (t) may be obtained, i.e. the unmanned aerial vehicle flies from the starting point S to the end point E:
U(0)=S,U(T)=E
v is used for maximum speed of unmanned aerial vehicle in flight processmaxBy way of illustration, the speed limit during flight can be expressed as:
here, | · | | represents a euclidean norm, Δ represents an infinitesimal time interval, and | U (t + Δ) -U (t) | represents a change amount of the position of the unmanned aerial vehicle within the infinitesimal time Δ. The system model for solving the problem of data collection of the unmanned aerial vehicle is described in detail as follows:
1. transmission model
Consider a delay tolerant application scenario in which each ground node is equipped with an omni-directional antenna, passing power P at time tn,tAnd sending the data to the unmanned aerial vehicle under the bandwidth B. The amount of data to be transmitted by each node is denoted Mn,n∈N。
Ground node n transmits rate R to unmanned aerial vehiclen,tCan be expressed as:
Rn,t=Blog2(1+γn,t)
here, γn,tRepresenting the signal-to-noise ratio from the ground node n received from the drone at time t, the calculation formula can be expressed as:
wherein σ2Denotes the white Gaussian noise experienced at the receiver drone, λ: (b)>1) Is the signal-to-noise ratio difference, L, between the actual modulation scheme and the theoretical Gaussian signaln,tWhich represents the average path loss transmitted from the ground node n to the drone at time t, and the specific formula will be explained in the channel model section below. To avoid transmission interference between ground nodes, we assume that all ground nodes do not transmit data to the drone at the same time. Therefore, the transmission schedule of all ground nodes also needs to be considered when designing the collection trajectory of the drone:
here, CnWhen the value of (t) is 1, the current unmanned aerial vehicle collects data of the ground node n, and only one ground node transmits data to the unmanned aerial vehicle at most every moment.
2. Channel model
Since the overall data collection task time is relatively long compared to the channel coherence time, we focus on the average statistics of the channel states rather than on the instantaneous statistics, i.e., only the large scale path loss effects are considered in designing the channel gain expression.
The average path loss between ground node n at time t and the drone at the u (t) position may be expressed as:
andthe average path loss from the ground node n to the drone located at the position u (t) in the line-of-sight communication and non-line-of-sight communication scenarios, respectively, can be expressed as:
the first term of the above two equations represents the free space propagation loss, fcRepresenting the carrier frequency, c represents the speed of light; and xiLoSAnd xiNLoSAverage additive path loss ([ xi ]) corresponding to free space propagation loss in line-of-sight communication and non-line-of-sight communication scenes respectivelyLoS<ξNLoS),dn,tThe distance between the ground node n and the unmanned aerial vehicle at the moment t can be representedComprises the following steps:
dn,t=(‖Gn-U(t)‖+H2)1/2
wherein G isn∈R2×1The position of the ground node n is represented, and H represents the flight height of the unmanned aerial vehicle. The probability of being in a line-of-sight communication scenario between ground node n and the drone may be expressed as:
3. description of the problem
The invention provides a track design problem of auxiliary data collection of an unmanned aerial vehicle. The aim is to jointly optimize the track U of the unmanned aerial vehicle and the transmission strategy C of each node on the groundn(t), N is more than or equal to 1 and less than or equal to N, and the transmitting power P of each node on the groundnAnd (t), N is more than or equal to 1 and less than or equal to N, and the data to be transmitted of all the ground nodes are collected in the shortest time from the starting point to the end point under the condition that the limits of different data volumes to be transmitted and electric quantity of all the ground nodes are considered. With respect to trajectory, connection strategies, the problem of joint optimization of transmit power to minimize task completion time can be expressed as:
s.t.(1)U(0)=S
(2)U(T)=E
(4)Rn,t=Blog2(1+γn,t)
here, Pn,tRepresenting the transmission power of the ground node n at time t, Rn,tRepresenting the transmission rate of the ground node n at the time t; l isn,tRepresents the average path loss of the ground node n between the time t and the unmanned aerial vehicle at the position of U (t); equations (1) and (2) represent drone start and end limits; equations (3) and (8) represent a ground node transmission strategy, that is, all nodes do not transmit data to the drone at the same time so as to avoid interference; equation (6) indicates that the drone should establish a connection with each ground node long enough to have its data collected; the formula (7) represents the self electric quantity limit of each ground node; equation (9) represents the drone maximum speed limit.
Next, a state space, an action space, and a cost function are defined, respectively. Under the reinforcement learning framework, the unmanned aerial vehicle is used as an intelligent agent to learn the optimal control strategy according to the reinforcement learning algorithm principle. That is, at each interval, observations and rewards of the environment are received and actions are performed on the environment. A typical markov decision process can be expressed as:the contents are as follows:
(1) state space:
the projection of the position of the drone on the ground at the end of the mth time slot may be represented as: su[m]=[x(m),y(m)]∈L={Ω1,Ω2,…,ΩI}
The state of the ground node n at the end of time slot m may be represented as:
Mn(m) represents the amount of data remaining at node n at the end of time slot m, En(m) represents the residual capacity of the node n at the end of the time slot m; in general, the state of the system may be denoted as s (m) ═ su(m),s1(m),…,sN(m),simt],simtAnd recording the current flying time of the unmanned aerial vehicle.
(2) An action space:
the slot m action may be expressed as: a (m) ═ af(m),π(m),P1(m),…,PN(m)],
Here, af(m)=[vm,φm]Representing a direction indication of the unmanned aerial vehicle movement; pi (m) e {0,1, …, N } represents a ground node connection strategy for describing Cπ(m)(t) ═ 1, i.e., node pi (m) transmits data to the drone; p1(m),…,PNAnd (m) transmitting power in the time slot m corresponding to each ground node.
(3) Status update procedure
The status update includes the drone location and the remaining data and power of each ground node, here, since the simulation area is scaled by step x heres=ysDividing by 10, the indication of the direction of motion of the drone can be expressed as:
in this connection, it is possible to use,i.e. in each state the drone may choose to hover or move to one of the 8 mesh points that are adjacent. Therefore, the updating of the system state comprises the position of the unmanned aerial vehicle and the updating of the residual electric quantity and the data quantity of each ground node according to the transmission strategy pi (m). Can be expressed as:
x(m)=x(m-1)+vmcosφ
y(m)=y(m-1)+vmsinφ
Mπ(m)(m)=Mπ(m)(m-1)-min{Rπ(m),Mπ(m)(m-1)}
En(m)=En(m-1)-Pn(m),n∈{1,2,…,N}
simt=simt+1
the first two items of the formula are used for describing the position coordinate change of the unmanned aerial vehicle, and comprise an x-axis coordinate x (m) and a y-axis y coordinate y (m) of the unmanned aerial vehicle; pi (m) is used to indicate which ground node is currently uploading data (i.e., C)π(m)(m) 1: representing m time slot nodes pi (m) to upload data); rπ(m)The above-mentioned third formula is used for updating the change of the residual data quantity of said node when the residual data quantity M of said ground node isπ(m)(m-1) less than the transmission rate Rπ(m)If so, updating the residual data volume of the ground node to be 0; the fourth formula is used for describing the residual electric quantity of each ground node; the last item is used to update the time of flight of the drone.
(4) Reward function
In the reinforcement learning process, the unmanned aerial vehicle takes an action a and obtains a reward in the time slot m, and the evaluation about the action in the table is updated according to the importance of the action generating the reward. Here, the reward function R:the device consists of the following parts:
rm=rdata-G×rp+rend
firstly, the collected data amount r of the time slot m is calculateddata=min{Rπ(m),Mπ(m)(m-1) }, which represents the minimum value of the transmission rate between the current data transmission node pi (m) and the unmanned aerial vehicle and the residual data volume to be transmitted of the node. Assuming that once the drone starts to collect the ground node n data, all the data stored in the sensors will be acquired; secondly, a constraint penalty factor r is calculatedpWhen the power consumption of the invalid ground node exists, the power consumption of the ground node is exhausted, but the data volume to be transmitted remains and the unmanned aerial vehicle moves out of the boundary (considering that the unmanned aerial vehicle can only be used in the process of unmanned aerial vehicleMotion within the fixed simulation region) constraint, the value of the indicator function G is 1, otherwise it is 0. Here, the ineffective ground node power consumption restriction condition means occurrence ofPπ(m)Indicating that the current transmitting data node pi (m) consumes power,indicating that all nodes consume power and. The condition that the electricity quantity of the ground node is exhausted but the remaining data quantity to be transmitted is limited refers to the surplus electricity quantity E of the node nn(M) is less than or equal to 0 but the data volume M to be transmittedn(m) > 0. Meanwhile, sim is calculated and considered in order to encourage the unmanned aerial vehicle to recognize and move to the destination as soon as possible in the learning processtThe reward r of collecting all data and reaching the end pointend=simt×re,reIndicating a larger reward factor.
In the invention, the transition probability between states in the discrete time Markov decision problem is unknown; secondly, since both the state space and the action space in the problem are large, the traditional method for solving the markov decision problem, such as value iteration and strategy iteration, is not suitable for the problem model of the present invention, so we use the Actor-Critic algorithm in the deep reinforcement learning algorithm (DRL) to solve our problem here. It uses two networks to find the best strategy problem for the markov decision process. The policy-based Actor network is used to select action a (m) at each step m and the value-based Critic network is used to evaluate the cost function V (s (m)) that performs action a (m) at state s (m). The Actor continuously adjusts and optimizes the strategy pi (a (m) s (m)) according to V (s (m)). The Actor neural network and the Critic neural network are both composed of multilayer feedforward neural networks. The number of nodes at the last layer of Actor corresponds to the number of actions, the action selection is converted into a standardized percentage by using a softmax function during output, and the last layer of Critic is a node (representing the state estimation value of an input state). The Actor neural network model and the Critic neural network model can be respectively shown in ref { ac }. The Actor neural network receives the state vector and selects an action, and the criticic neural network likewise receives the state vector and estimates the state value (long-term cumulative reward for the current policy). In the training process, the estimation of the Critic neural network on the state value is used for updating the selection strategy of the Actor on the action in a time sequence difference mode.
The invention discloses a method for obtaining a data collecting track of an unmanned aerial vehicle by using reinforcement learning, which comprises the following steps of inputting the initial position and the end position of the unmanned aerial vehicle, the positions of all nodes on the ground, the data amount to be transmitted and the energy limit, fully considering the different data amounts to be collected and the respective energy limit of each ground node, and adopting an Actor-Critic algorithm to design the data collecting track of the ground node of the unmanned aerial vehicle aiming at minimizing the completion time of a data collecting task:
step 1, dividing a region to be simulated into grids according to the step length, defining a state space S, an action space A and a timely reward r;
step 2, using Critic neural network with parameter omega to represent state value function Qω(s, a), the target Critic neural network parameter of the same network structure as the Critic neural network is omega-(ii) a Actor neural network representation strategy pi using parameter thetaθ(as | s) for representing the probability of selecting action a in state s, and the target Actor neural network parameter of the same network structure as the Actor neural network is θ-;
Step 3, randomly initializing a criticic neural network parameter omega and an Actor neural network parameter theta,
initializing Critic target neural network parameters omega-ω, Actor neural network parameter θ-θ. Setting empirical playback pool size to D (for storage)<s,a,r,st+1>) The sampling number in the updating process is B;
step 4, setting the initial round mark as 1, entering a large loop, gradually traversing until the maximum round number limit M is reached, and setting the initialization state as the initial state s1:
Step 5, for a single round, T is incremented from 1 to a limit T:
step 6, according to the current Actor neural network strategy at=πθ(as) selection action to obtain instant prize rtAnd the next state st+1;
Step 7, storing the state transition record<st,at,rt,st+1>Go to the experience playback pool;
step 8, randomly selecting B records from the experience playback pool(s)i,ai,ri,si+1) Respectively represent the current state siPerformed action aiInstant reward riNext state si+1;
Step 9, calculating an Actor update targetHere, γ represents the discount rate,representing the neural network parameter theta according to the current target Actor-The policy to be implemented is such that,representing the Critic neural network parameter ω according to the current target-An obtained state cost function;
Updating an Actor neural network parameter theta by adopting a random gradient descent method;
step 12, updating the target Critic neural network parameter omega at intervals-Is tau omega + (1-tau) omega-Updating target Actor neural network parameter theta-Is τ θ + (1- τ) θ-Here, τ represents an update coefficient (taking a value of 0.01).
In order to compare performance, the unmanned aerial vehicle collected data track obtained based on the Actor-Critic algorithm is compared with the following unmanned aerial vehicle flight schemes:
1. the problem of the traveler: the unmanned aerial vehicle collects data only when hovering right above the ground node, and the shortest path for collecting the ground node data is determined based on the problem of a traveler;
2. optimization strategy on the traveler problem and ground node transmission power: optimizing a collection strategy and ground node transmission power on the basis of an unmanned aerial vehicle collected data track obtained by a traveler problem: considering the uniform motion of the unmanned aerial vehicle in the collection process, optimizing the position of each ground node for starting to collect data, the position for finishing collecting data, the ground node transmitting power in the data collection process and the speed of the unmanned aerial vehicle in the collection process by using a dynamic programming algorithm;
3. finding the best set of ordered waypoints: given a starting point and an end point, the same goal is to minimize the time to collect all the ground node data, but assuming that the ground nodes are at a fixed transmit power PtData is transmitted to the drone and current data is collected at a constant rate R as the drone comes within range of the node.
Compared with the prior art, the unmanned aerial vehicle auxiliary collection ground node data track designed based on the Actor-Critic algorithm can obviously reduce the collection time on the premise of ensuring that the data quantity to be transmitted of all nodes is completely collected and meeting the energy limit of each ground node.
It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.
Claims (6)
1. A method for obtaining a data collecting track of an unmanned aerial vehicle by using reinforcement learning is characterized by comprising the following steps of inputting the initial position and the end position of the unmanned aerial vehicle, the positions of all nodes on the ground, the data volume to be transmitted of all ground nodes and energy limits, considering the difference of the data volume to be collected of all ground nodes and the respective energy limits, and designing the data collecting track of the unmanned aerial vehicle aiming at minimizing the data collecting task completion time by adopting an Actor-Critic algorithm:
step 1, dividing a region to be simulated into grids according to the step length, defining a state space S, an action space A and a timely reward r;
step 2, using Critic neural network with parameter omega to represent state value function Qω(s, a), the target Critic neural network parameter of the same network structure as the Critic neural network is omega-(ii) a Actor neural network representation strategy pi using parameter thetaθ(as | s) for representing the probability of selecting action a in state s, and the target Actor neural network parameter of the same network structure as the Actor neural network is θ-;
Step 3, randomly initializing a criticic neural network parameter omega and an Actor neural network parameter theta,
initializing Critic target neural network parameters omega-ω, Actor neural network parameter θ-θ; setting the empirical playback pool size to D for storage<s,a,r,st+1>Wherein s ist+1For the next state, the sampling number in the updating process is B;
step 4, setting the initial round mark as 1, entering a large loop, gradually traversing until the maximum round number limit M is reached, and setting the initialization state as the initial state s1:
Step 5, for a single round, T is incremented from 1 to a limit T:
step 6, according to the current Actor neural network strategy at=πθ(as) selection action to obtain instant prize rtAnd the next state st+1;
Step 7, storing the state transition record<st,at,rt,st+1>Go to the experience playback pool;
step 8, randomly selecting from the experience playback poolB records(s)i,ai,ri,si+1) Respectively represent the current state siPerformed action aiInstant reward riNext state si+1;
Step 9, calculating an Actor update targetWhere gamma represents the discount rate of the discount rate,representing the neural network parameter theta according to the current target Actor-The policy to be implemented is such that,representing the Critic neural network parameter ω according to the current target-An obtained state cost function;
Updating an Actor neural network parameter theta by adopting a random gradient descent method;
step 12, updating the target Critic neural network parameter omega at intervals-Is tau omega + (1-tau) omega-Updating target Actor neural network parameter theta-Is τ θ + (1- τ) θ-Wherein tau represents an update coefficient and takes a value of 0.01.
2. The method of claim 1, wherein the strategy-based Actor neural network is used to select action a (m) at each step m, the value-based Critic neural network is used to evaluate the value function V (s (m)) for executing action a (m) at state s (m), and the Actor continuously adjusts and optimizes the strategy pi (a (m) s (m)) according to V (s (m)).
3. The method for obtaining the trajectory of data collected by unmanned aerial vehicle through reinforcement learning according to claim 2, wherein the Actor neural network and the Critic neural network are both composed of a multilayer feedforward neural network.
4. The method for obtaining the data trace collected by the unmanned aerial vehicle through reinforcement learning according to claim 3, wherein the number of nodes in the last layer of the Actor corresponds to the number of actions, the action selection is converted into a standardized percentage by using a softmax function during output, and the last layer of Critic is a node and represents a state estimation value of an input state.
5. The method of using reinforcement learning to obtain unmanned aerial vehicle collected data trajectories of claim 4, wherein the Actor neural network receives the state vector and selects an action, and the criticic neural network receives the state vector and estimates a state value, the state value referring to a long-term cumulative reward of a current policy.
6. The method for obtaining the trajectory of the data collected by the UAV using reinforcement learning of claim 5, wherein during training, the estimation of the state value by the Critic neural network is used to update the selection strategy of the action by the Actor in a time-sequence difference manner.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110697404.0A CN113377131B (en) | 2021-06-23 | 2021-06-23 | Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110697404.0A CN113377131B (en) | 2021-06-23 | 2021-06-23 | Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113377131A true CN113377131A (en) | 2021-09-10 |
CN113377131B CN113377131B (en) | 2022-06-03 |
Family
ID=77578579
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110697404.0A Active CN113377131B (en) | 2021-06-23 | 2021-06-23 | Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113377131B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113885566A (en) * | 2021-10-21 | 2022-01-04 | 重庆邮电大学 | V-shaped track planning method for minimizing data acquisition time of multiple unmanned aerial vehicles |
CN114025330A (en) * | 2022-01-07 | 2022-02-08 | 北京航空航天大学 | Air-ground cooperative self-organizing network data transmission method |
CN116760888A (en) * | 2023-05-31 | 2023-09-15 | 中国科学院软件研究所 | Intelligent organization and pushing method for data among multiple unmanned aerial vehicles |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110488861A (en) * | 2019-07-30 | 2019-11-22 | 北京邮电大学 | Unmanned plane track optimizing method, device and unmanned plane based on deeply study |
CN110879610A (en) * | 2019-10-24 | 2020-03-13 | 北京航空航天大学 | Reinforced learning method for autonomous optimizing track planning of solar unmanned aerial vehicle |
CN111260031A (en) * | 2020-01-14 | 2020-06-09 | 西北工业大学 | Unmanned aerial vehicle cluster target defense method based on deep reinforcement learning |
US20200201316A1 (en) * | 2018-12-21 | 2020-06-25 | Airbus Defence and Space GmbH | Method For Operating An Unmanned Aerial Vehicle As Well As An Unmanned Aerial Vehicle |
US20210074167A1 (en) * | 2018-05-10 | 2021-03-11 | Beijing Xiaomi Mobile Software Co., Ltd. | Method and apparatus for reporting flight path information, and method and apparatus for determining information |
CN112711271A (en) * | 2020-12-16 | 2021-04-27 | 中山大学 | Autonomous navigation unmanned aerial vehicle power optimization method based on deep reinforcement learning |
CN112902969A (en) * | 2021-02-03 | 2021-06-04 | 重庆大学 | Path planning method for unmanned aerial vehicle in data collection process |
-
2021
- 2021-06-23 CN CN202110697404.0A patent/CN113377131B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210074167A1 (en) * | 2018-05-10 | 2021-03-11 | Beijing Xiaomi Mobile Software Co., Ltd. | Method and apparatus for reporting flight path information, and method and apparatus for determining information |
US20200201316A1 (en) * | 2018-12-21 | 2020-06-25 | Airbus Defence and Space GmbH | Method For Operating An Unmanned Aerial Vehicle As Well As An Unmanned Aerial Vehicle |
CN110488861A (en) * | 2019-07-30 | 2019-11-22 | 北京邮电大学 | Unmanned plane track optimizing method, device and unmanned plane based on deeply study |
CN110879610A (en) * | 2019-10-24 | 2020-03-13 | 北京航空航天大学 | Reinforced learning method for autonomous optimizing track planning of solar unmanned aerial vehicle |
CN111260031A (en) * | 2020-01-14 | 2020-06-09 | 西北工业大学 | Unmanned aerial vehicle cluster target defense method based on deep reinforcement learning |
CN112711271A (en) * | 2020-12-16 | 2021-04-27 | 中山大学 | Autonomous navigation unmanned aerial vehicle power optimization method based on deep reinforcement learning |
CN112902969A (en) * | 2021-02-03 | 2021-06-04 | 重庆大学 | Path planning method for unmanned aerial vehicle in data collection process |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113885566A (en) * | 2021-10-21 | 2022-01-04 | 重庆邮电大学 | V-shaped track planning method for minimizing data acquisition time of multiple unmanned aerial vehicles |
CN113885566B (en) * | 2021-10-21 | 2024-01-23 | 重庆邮电大学 | V-shaped track planning method oriented to minimization of data acquisition time of multiple unmanned aerial vehicles |
CN114025330A (en) * | 2022-01-07 | 2022-02-08 | 北京航空航天大学 | Air-ground cooperative self-organizing network data transmission method |
CN114025330B (en) * | 2022-01-07 | 2022-03-25 | 北京航空航天大学 | Air-ground cooperative self-organizing network data transmission method |
CN116760888A (en) * | 2023-05-31 | 2023-09-15 | 中国科学院软件研究所 | Intelligent organization and pushing method for data among multiple unmanned aerial vehicles |
Also Published As
Publication number | Publication date |
---|---|
CN113377131B (en) | 2022-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113377131B (en) | Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning | |
CN110488861B (en) | Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle | |
Bayerlein et al. | Trajectory optimization for autonomous flying base station via reinforcement learning | |
CN113395654A (en) | Method for task unloading and resource allocation of multiple unmanned aerial vehicles of edge computing system | |
CN112511250A (en) | DRL-based multi-unmanned aerial vehicle air base station dynamic deployment method and system | |
US20230297859A1 (en) | Method and apparatus for generating multi-drone network cooperative operation plan based on reinforcement learning | |
CN115696211A (en) | Unmanned aerial vehicle track self-adaptive optimization method based on information age | |
CN114169234A (en) | Scheduling optimization method and system for unmanned aerial vehicle-assisted mobile edge calculation | |
CN112671451A (en) | Unmanned aerial vehicle data collection method and device, electronic device and storage medium | |
CN113507717A (en) | Unmanned aerial vehicle track optimization method and system based on vehicle track prediction | |
CN116700343A (en) | Unmanned aerial vehicle path planning method, unmanned aerial vehicle path planning equipment and storage medium | |
CN113255218A (en) | Unmanned aerial vehicle autonomous navigation and resource scheduling method of wireless self-powered communication network | |
CN113382060B (en) | Unmanned aerial vehicle track optimization method and system in Internet of things data collection | |
CN114268986A (en) | Unmanned aerial vehicle computing unloading and charging service efficiency optimization method | |
CN114339842A (en) | Method and device for designing dynamic trajectory of unmanned aerial vehicle cluster under time-varying scene based on deep reinforcement learning | |
CN114548663A (en) | Scheduling method for charging unmanned aerial vehicle to charge task unmanned aerial vehicle in air | |
CN113554680A (en) | Target tracking method and device, unmanned aerial vehicle and storage medium | |
CN117236561A (en) | SAC-based multi-unmanned aerial vehicle auxiliary mobile edge computing method, device and storage medium | |
CN116227767A (en) | Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning | |
CN114879726A (en) | Path planning method based on multi-unmanned-aerial-vehicle auxiliary data collection | |
CN114727323A (en) | Unmanned aerial vehicle base station control method and device and model training method and device | |
CN114327876A (en) | Task unloading method and device for unmanned aerial vehicle-assisted mobile edge computing | |
CN116484732A (en) | Unmanned aerial vehicle energized digital twin model construction method | |
CN112383893B (en) | Time-sharing-based wireless power transmission method for chargeable sensing network | |
CN117135376A (en) | Multi-unmanned aerial vehicle video transmission method based on prediction and target tracking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |