CN117062182A

CN117062182A - DRL-based wireless chargeable unmanned aerial vehicle data uploading path optimization method

Info

Publication number: CN117062182A
Application number: CN202311129223.3A
Authority: CN
Inventors: 贾兆红; 张博文; 王辛迪
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-11-14

Abstract

The invention relates to a wireless chargeable unmanned aerial vehicle data uploading path optimization method based on DRL, which comprises the following steps: constructing an Internet of things communication and wireless charging scene system; establishing a first mathematical model for the on-board battery consumption of the task machine; establishing a second mathematical model for the data uploading channel; establishing an energy supplementing mathematical model, namely a third mathematical model, when the wireless charging process is executed; establishing a fourth mathematical model for the path optimization target; determining a state set S, an action set A and a reward function r ^t The method comprises the steps of carrying out a first treatment on the surface of the Offline learning is carried out by utilizing an improved Double DQN algorithm to obtain an optimal path strategy pi ^* . The invention provides more convenience for the task machine for assisting the base station to execute data uploading in the communication system of the Internet of thingsUnder the condition of higher data real-time requirement, the cruising ability of the mission machine is obviously improved through a more convenient charging method, and the high data uploading efficiency of the Internet of things communication system under the assistance of the unmanned aerial vehicle is realized.

Description

DRL-based wireless chargeable unmanned aerial vehicle data uploading path optimization method

Technical Field

The invention relates to the technical fields of artificial intelligence, power and communication, in particular to a wireless chargeable unmanned aerial vehicle data uploading path optimization method based on DRL.

Background

Owing to the highly flexible deployment capability, unmanned aerial vehicles are widely applied to emerging fields such as the Internet of things in recent years, and aim to overcome the defect of the existing base station communication, and the unmanned aerial vehicle can play a role in connecting a mobile communication access point with a ground user to provide emergency data service, so that the communication service quality of the mobile user in a large-scale scene is ensured. However, due to the energy constraints of the onboard battery, unmanned aerial vehicles face serious energy shortage problems in providing services.

The problem of inefficiency that unmanned aerial vehicle airborne energy constraint brought mainly represents along with going on continuously of time and task, unmanned aerial vehicle individual energy constantly consumes and makes it need return the charging station at the task execution moment, can not satisfy the data uploading demand of high real-time. With the advent of various emerging energy supply technologies, battery replenishment has advanced significantly, wherein wireless power transfer technologies can achieve decoupling of the location of energy from a sensing location, transferring energy from an energy rich region to an energy poor region, allowing for efficient energy harvesting while the drone is performing data transfer tasks.

By carrying a high-gain radio frequency antenna on the unmanned aerial vehicle as a mobile charger, the unmanned aerial vehicle can provide charging service for the unmanned aerial vehicle for performing data uploading tasks, and the unmanned aerial vehicle can be used as an effective energy solution. In the communication context of the task machine-assisted internet of things system, user data uploading and on-demand charging are involved, so that an effective unmanned aerial vehicle path planning strategy is needed, and efficient execution of individual unmanned aerial vehicle data uploading tasks is ensured.

Although related researchers currently carry out a series of researches on optimization of the unmanned aerial vehicle data uploading path, such as ant colony algorithm, genetic algorithm and reinforcement learning algorithm are used for solving the optimization path, most of the unmanned aerial vehicle data uploading path optimization is carried out only for flight tasks under the support of unmanned aerial vehicle airborne energy, more single flight energy utilization rate under the condition of no charging is considered, and the requirement of further meeting the task efficient execution by using an energy supplementing technology is ignored. Therefore, development of a data uploading path optimization method of a wireless chargeable unmanned aerial vehicle is urgently needed, real-time requirements of data uploading of equipment of the Internet of things are met, wireless energy supplement is considered to improve single flight life of the unmanned aerial vehicle, and important research significance and application value are provided for flight path optimization of a mission machine.

Disclosure of Invention

The invention aims to solve the problem of low efficiency caused by energy shortage of an unmanned aerial vehicle in a task scene of communication of an existing task unmanned aerial vehicle, namely a task unmanned aerial vehicle auxiliary Internet of things system, and provides a DRL-based wireless chargeable unmanned aerial vehicle data uploading path optimization method which can obviously improve the cruising ability of the task unmanned aerial vehicle and realize high data uploading efficiency of the Internet of things communication system under the assistance of the unmanned aerial vehicle under the condition of higher data real-time requirement.

In order to achieve the above purpose, the present invention adopts the following technical scheme: a method for optimizing a data uploading path of a wireless chargeable unmanned aerial vehicle based on a DRL, comprising the following sequential steps:

(1) The method comprises the steps of constructing an Internet of things communication and wireless charging scene system, wherein the Internet of things communication and wireless charging scene system comprises a task machine, a mobile charger and M mobile Internet of things devices;

(2) Establishing a first mathematical model for the on-board battery consumption of the task machine;

(3) Establishing a second mathematical model for the data uploading channel;

(4) Establishing an energy supplementing mathematical model, namely a third mathematical model, when the wireless charging process is executed;

(5) Establishing a fourth mathematical model for the path optimization target according to the communication of the Internet of things, the wireless charging scene system, the first mathematical model, the second mathematical model and the third mathematical model;

(6) Determining a state set S, an action set A and a reward function r according to the communication of the Internet of things, the wireless charging scene system, the first mathematical model, the second mathematical model, the third mathematical model and the fourth mathematical model ^t ；

(7) According to the state set S, the action set A and the rewarding function r ^t Using improved Double DQN algorithm performs offline learning to obtain optimal path strategy pi ^* 。

The step (1) specifically refers to: the system is characterized in that an Internet of things communication and wireless charging scene system is recorded as a system, the system space is divided into N grids, each grid is a square unit, the side length is c, N is the number of grids in the transverse direction and the longitudinal direction, a task machine is deployed in the scene to execute uploading tasks, a mobile charger gives wireless charging service, and M mobile Internet of things devices have values d respectively _m Waiting for uploading of the data volume of (2);

the current time slot of the system is recorded as T, t=0, τ,2τ..T, τ is the length of a single time slot, and T is the time of the system termination state, then the position of the mobile internet of things device m at the time of time slot T isRepresentation, where m.epsilon. {1,2 … M },>representing the abscissa of mobile internet of things device m in grid space,/for>Representing the ordinate, h, of mobile Internet of things device m in grid space _m Representing the fixed height of the mobile internet of things device m; assume that the remaining amount of data to be uploaded at time slot t for mobile animal networking device m is denoted +.>

The mobile charger as the energy supply end starts from a stop point at the beginning of a task and starts with a pre-deployed flight path and a pre-deployed flight speed v _k Moving and providing energy to task machines for real-time positioning Indicating (I)>Represents the abscissa of the mobile charger in grid space,/->Indicating the ordinate of the mobile charger in the grid space, h _k The fixed flying height of the mobile charger; when the mission starts, the mission machine starts from the stop point and takes a constant flying speed v _u Flying, real-time positioningRepresentation, wherein->Represents the abscissa, < >/of the task machine in grid space>Representing the ordinate of the task machine in the grid space, h _u Is a fixed flight level of the mission machine.

The step (2) specifically refers to: the maximum battery capacity of the task machine is b _max ，Indicating the remaining power on time slot t, +.>The mission machine consumes a constant energy value per execution of a flight action +.>Then, without considering charging, a mathematical model is built at the battery level of the task machine at time slot t+1 and is recorded as a first mathematical model, and the expression is as follows:

in the method, in the process of the invention,representing the remaining power on time slot t + 1.

The step (3) specifically refers to: the task machine and the mobile internet of things equipment establish communication connection and start data uploading, the flight height of the task machine is enough, at the moment, the visual range wireless transmission communication is guaranteed between the mobile internet of things equipment and the task machine, and then the expression of the channel gain between the mobile internet of things equipment and the task machine is as follows:

Wherein ρ is ₀ Representing the channel gain of the channel at a reference distance of 1m,the Euclidean distance between the task machine and the mobile Internet of things equipment m when the time slot is t is represented, and the expression is as follows:

wherein,representing Euclidean distance of the task machine and the mobile internet of things device m on the abscissa, and +.>The Euclidean distance of the task machine and the mobile Internet of things equipment m on the ordinate is represented; h is a _u The fixed flying height of the task machine; establishing a mathematical model of data transmission at the time slot t, and recording the mathematical model as a second mathematical model:

wherein,is the transmission rate of a data transmission link established between a task machine and mobile internet of things equipment m in a time slot t, W is the signal bandwidth and P is _IoT Is the transmitting power sigma of the equipment of the Internet of things ² Is the noise power.

The step (4) specifically refers to: the mobile charger is provided with a high-gain radio frequency antenna for transmitting wireless power, and provides energy supply service for the task machine according to a fixed deployment track; in the task execution process, the distances between a task machine and a mobile charger are different in different time slots, when the distance between the task machine and the mobile charger is increased, the power of wireless charging can be drastically reduced, the full-energy conversion efficiency is assumed in a wireless transmission link, and the power obtained at the task machine is calculated through a Friis free space propagation model Expressed by the following formula:

wherein P is _t Is the power of the transmitting end, G _t And G _r Is the antenna gain at the transmitting and receiving ends, lambda is the transmission wavelength,the euclidean distance between the transmitting end and the receiving end is expressed as the following formula:

wherein,indicating the Euclidean distance between the task machine and the mobile charger in the transverse direction of the time slot t, and +.>Indicating the Europe distance of the task machine and the mobile charger in the longitudinal direction of the time slot tSeparation, h _k，u Representing a constant height difference between the task machine and the mobile charger; and establishing an energy supplementing mathematical model when the wireless charging process is executed and marking the energy supplementing mathematical model as a third mathematical model, wherein the expression is as follows:

in the method, in the process of the invention,for the energy value received by the task machine on a single slot, τ is the length of the single slot.

The step (5) specifically refers to: the variables in the dynamic scene aiming at the optimization problem of the path planning of the task machine are the position information of the mobile Internet of things equipment, the mobile charger and the task machine, the equipment data uploading queue and the task machine battery state information; the optimization aims at finding a path strategy, helping a task machine to make an optimal decision between balancing energy consumption and uploading data amount, maximizing data uploading efficiency in a single flight process from the angle of optimizing a moving track, and the main factors considered in the process are data uploading amount and task energy consumption, wherein the expression of a fourth mathematical model is as follows:

Wherein,the transmission rate of a data transmission link established between a task machine and mobile internet of things equipment m in a time slot t is +.>The constant energy value is consumed for each flight action executed by the mission machine; recording the current time slot of the system as T, t=0, τ,2τ..T, τ is the length of a single time slot, and T is the time of the system termination state; m.epsilon. {1,2 … M }.

The step (6) specifically comprises the following steps:

(6a) Determining a state set S:

the expression of the state set S is as follows:

wherein s is ^t Is the system state at time slot t, defined by L _m (t)、L _u (t)、B _u (t) three parts; l (L) _m (t) representing the state of each mobile Internet of things device in the data uploading task, and guiding the task target of the data uploading, L _m (t) including the position and the residual data amount information of all the mobile Internet of things devices, and setting the two-dimensional coordinates of the mobile Internet of things device M with the last serial number in the region of the uploading task in the process of proceeding in the horizontal direction of the time slot tThe amount of data to be uploaded remaining therewith +.>Then L is _m (t) is represented by the following formula:

L _u (t) representing the data transmission rate of the mobile Internet of things equipment and the position of the task machine at different positions within the effective transmission distance of the task machine by simulating the visual field of the task machine; according to the system characteristics, a task machine is used as a center, n grids are assumed in a visual field range, positions outside a task area are represented by black grids, corresponding matrix values are set to-1 to represent a separation from the task area, data uploading rates at different positions are represented according to the difference of distances from the task machine, the uploading rate at the task machine is maximum, corresponding matrix elements are set to 50, a grid corresponding matrix value is set to 20, two grid corresponding matrix values are set to 5, Matrix values corresponding to the rest white grids are set to 0;

then a new view matrix Y εR is constructed at this time ^{(2n+1)×(2n+1)} ，For the ith row of the field of view matrix, the jth column of matrix element values, where i, j e (0, 2n+1)]And assuming that the ith row and jth column of the field of view matrix Y have spatial positions corresponding to x as the abscissa in the overall grid space _i 、y _j ，For the ith row of the visual field matrix at the time slot t, the position of the jth column matrix element is away from the grid distance of the task machine, and the visual field matrix element is +.>The numerical expression of (2) is as follows:

the obtained view matrix Y is composed of (2n+1) ² The square matrix of each element, the elements in the visual field matrix Y are flattened, and the abscissa of the combined task machine in the grid space is combinedOrdinate of task machine in grid space +.>Form a one-dimensional column vector L _u (t) the expression is as follows:

B _u (t) current battery level including task machineHorizontal coordinate of mobile charger in grid space>Ordinate of mobile charger in grid space +.>Euclidean distance between task machine and mobile charger>B _u The expression (t) is as follows:

(6b) Determining an action set A:

the movement of the task machine in the grid space in the system is selected according to the movement direction, and the expression of the action space is as follows:

wherein a is ^t Action performed on time slot t for mission machine and at fixed speed v during flight _u Four kinds of flying directions are adopted; the flight process obtains energy from the transmitting end of the mobile charger in a wireless charging mode, and two processes of uploading scene data and obtaining energy are realized in parallel;

(6c) Determining a reward function r ^t ：

For the task machine, the influence of different behavior decisions on each link of the system is embodied according to a reward mechanism, and the instant rewards obtained by the task machine in a time slot t are expressed as r ^t The expression is as follows:

wherein mu ₁ 、μ ₂ 、μ ₃ To adjustA weighting factor in between;Taking the maximum data throughput in the whole data uploading process as a core target, giving positive behavior rewards to each behavior according to the data throughput in the single communication process, wherein the expression is as follows:

wherein,the data amount uploaded to the task machine by the user m at the time t is represented;

aiming at maximizing the data uploading efficiency of the whole data uploading process, endowing negative rewards to the movement behaviors of each step of the task machine to urge the task machine to promote the path selection capability, reducing unnecessary energy loss and promoting the convergence of the optimal path;

the method aims at maximizing the endurance of the task machine to execute the data uploading task, and gives positive behavior rewards to wireless charging caused by the task machine to execute the movement decision, and the expression is as follows:

Wherein,indicating the remaining capacity of the task machine at time slot t, < >>B for the energy value received by the task machine on a single time slot _th To determine whether the task machine enters a threshold, beta, of a low energy level state ₁ And beta ₂ Are all constant coefficients.

The step (7) specifically comprises the following steps:

(7a) Initializing estimated value network neural parameter theta ₁ And target value network neural parameter θ ₂ Let theta ₂ ＝θ ₁ The method comprises the steps of carrying out a first treatment on the surface of the Initializing an experience playback pool, wherein the capacity is D; initializing a network learning rate alpha and an attenuation coefficient gamma;

(7b) Based on the current state-action pair (s ^t ，a ^t )，s ^t A is the system state at time slot t ^t For the task machine to perform action on the time slot t, the estimated value of the Q value is output, namely Q _predicted (s ^t ，a ^t ；θ ₁ ) Wherein θ is ₁ Network neural parameters are estimated values; the target value network generates for the next state selection actionWherein s is ^t+1 An action value, a, inputted for a target value network ^t+1 For the action value entered for the target network, a next state-action pair(s) of the target network is then determined ^t+1 ，a ^t+1 ) Q value of +.>The target value for the Double DQN algorithm is defined as:

(7c) According to the current state s ^t Executing action a ^t And selecting actions according to the improved Double DQN algorithm, and converting to a new state s ^t+1 According to the reward function r ^t Calculating a single step prize value, storing the translations in an experience playback pool (s ^t ，a ^t ，r ^t ，s ^t ⁺¹ )；

Circularly executing the steps (7 b) to (7 c) until the number of memory stores in the experience playback pool is equal to D and the step (7D) is entered;

the Double DQN algorithm is provided with two neural network structures, namely an estimated value network and a target value network, and training is carried out once every step by giving a training step length step;

defining the total prize value in each training round as R, the prize value obtained by executing a single flight path is expressed as:

the improved Double DQN algorithm is to improve greedy coefficient epsilon selected by actions, ensure actions obtaining maximum rewarding values in each step according to optimization measures, select the current optimal solution with probability of epsilon as a value in the execution process of epsilon-greedy strategies, and search other actions with probability of remaining 1-epsilon, wherein the numerical variation expression of epsilon coefficients is as follows:

wherein, epicode is the current number of flight rounds, K is the maximum number of flight rounds reached when ε=1;

randomly extracting z experiences from an experience playback pool with the capacity of D to form an offline learning training data set; the maximum task round number is F;

(7d) Randomly extracting z memories from the experience playback pool, z=32; the kth state transition sequence in the small batch of data is noted as (s ^k ，a ^k ，r ^k ，s ^k+1 )，k＝1，2，3…z；

(7e) According to step (7 d) (s ^k ，a ^k ，r ^k ，s ^k+1 ) K=1, 2,3 … z, the target Q value is calculatedAnd a loss value L (θ) ₁ ) Let->The loss function is expressed as follows:

(7f) Minimizing loss function L (θ) by gradient descent ₁ ) The expression is as follows:

in the method, in the process of the invention,is a partial guide symbol>Representing a minimization error function L (θ) ₁ ) For theta ₁ Deviation-inducing and-> Representing the estimated value network input as s ^t 、a ^t The square value of the difference between the Q value calculated at that time and the Q value of the target value network versus the estimated value network neural parameter theta ₁ Deviation-inducing and->To update the estimated value network neural parameter, the target value network neural parameter theta is obtained at intervals of step steps ₂ Replaced by theta ₁ ；

When the step (7 d) to the step (7F) are finished once, the single training is finished, the step (7 b) to the step (7F) are repeatedly executed in the execution process of each round of flight task, the training is finished after F rounds are finished, and the learning process of the Double DQN algorithm is finished;

(7g) And (3) after the training algorithm is finished, saving the optimal path strategy pi ^* And recording the rewarding value, the flight steps and the data uploading amount: in F training rounds, the estimated value network neural parameter θ ₁ And target value network neural parameter θ ₂ Updating towards the direction of maximizing the total rewarding value R, and finally finding the optimal path strategy pi ^* 。

According to the technical scheme, the beneficial effects of the invention are as follows: firstly, compared with the prior art, the remote field wireless charging technology is utilized, the unmanned aerial vehicle with high mobility is combined with the high-gain radio frequency antenna to serve as a mobile charger, more convenient charging service is provided for a task machine for assisting a base station to execute data uploading in the communication system of the Internet of things, and under the condition of higher data real-time requirement, the cruising ability of the task machine is obviously improved through a more convenient charging method, so that the high data uploading efficiency of the communication system of the Internet of things under the assistance of the unmanned aerial vehicle is realized; secondly, according to the characteristic of high dimensionality of the problem state space, the method uses a Double DQN (direct solution) with a Double-depth neural network structure on a path optimization method and improves the exploration coefficient.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a grid scene representation in an implementation of the present invention;

FIG. 3 is a block diagram of a model of the improved Double DQN algorithm of the present invention;

FIG. 4 is a graph of the effect of convergence of prize value training in the present invention;

FIG. 5 is a diagram of the training effect of the number of steps of a single flight mission in the invention;

fig. 6 is a diagram of training effect of single flight mission data upload in the present invention.

Detailed Description

As shown in fig. 1, a method for optimizing a data uploading path of a wireless chargeable unmanned aerial vehicle based on a DRL includes the following sequential steps:

(3) Establishing a second mathematical model for the data uploading channel;

(7) According to the state set S, the action set A and the rewarding function r ^t Offline learning is performed by utilizing an improved Double DQN algorithm, so that an optimal path strategy pi is obtained ^* 。

The step (1) specifically refers to: the system is characterized in that an Internet of things communication and wireless charging scene system is recorded as a system, a system space is divided into N grids, each grid is a square unit, the side length is c, N is the number of grids in the transverse direction and the longitudinal direction, a task machine is deployed in the scene to execute uploading tasks, a mobile charger gives wireless charging service, and M mobile Internet of things devices have values respectivelyIs d _m Waiting for uploading of the data volume of (2);

The mobile charger as the energy supply end starts from a stop point at the beginning of a task and starts with a pre-deployed flight path and a pre-deployed flight speed v _k Moving and providing energy to task machines for real-time positioningIndicating (I)>Represents the abscissa of the mobile charger in grid space,/->Indicating the ordinate of the mobile charger in the grid space, h _k The fixed flying height of the mobile charger; when the mission starts, the mission machine starts from the stop point and takes a constant flying speed v _u Flying, real-time positioningRepresentation, wherein->Represents the abscissa, < >/of the task machine in grid space>Representing the ordinate of the task machine in the grid space, h _u Is a fixed flight level of the mission machine. Here, v _k =10 m/s, h _k =40 meters, v _u =10 m/s, h _u =41 meters.

The step (2) specifically refers to: the maximum battery capacity of the task machine is b _max ，Indicating the remaining power on time slot t, +.>The mission machine consumes a constant energy value per execution of a flight action +.>Then, without considering charging, a mathematical model is built at the battery level of the task machine at time slot t+l and is recorded as a first mathematical model, and the expression is as follows:

In the method, in the process of the invention,representing the remaining power over time slot t + l.

Here, b _max =150 kilojoules of a person,

wherein,is the transmission rate of a data transmission link established between a task machine and mobile internet of things equipment m in a time slot t, W is the signal bandwidth and P is _loT Is the transmitting power sigma of the equipment of the Internet of things ² Is the noise power.

Here ρ ₀ ＝-50dB，W＝1Mhz，P _loT ＝0.1W，σ ² ＝-100dB。

wherein,indicating the Euclidean distance between the task machine and the mobile charger in the transverse direction of the time slot t, and +.>Indicating Euclidean distance h between task machine and mobile charger in longitudinal direction of time slot t _k，u Representing a constant height difference between the task machine and the mobile charger; and establishing an energy supplementing mathematical model when the wireless charging process is executed and marking the energy supplementing mathematical model as a third mathematical model, wherein the expression is as follows:

Here, P _t ＝20W，G _t ＝25dBi，G _r =25 dBi, λ=1 meter.

Wherein,the transmission rate of a data transmission link established between a task machine and mobile internet of things equipment m in a time slot t is +.>The constant energy value is consumed for each flight action executed by the mission machine; the current time slot of the system is recorded as t, t=0, τ and 2τ... T, τ is the length of a single slot, T is the time of the system termination state; m.epsilon. {1,2 … M }.

The step (6) specifically comprises the following steps:

(6a) Determining a state set S:

the expression of the state set S is as follows:

L _u (t) representing the data transmission rate of the mobile Internet of things equipment and the position of the task machine at different positions within the effective transmission distance of the task machine by simulating the visual field of the task machine; according to the system characteristics, a task machine is taken as a center, n grids in a visual field range are assumed, positions outside a task area are represented by black grids, a corresponding matrix value is set to be-1 to represent a separation task area, data uploading rates of different positions are represented according to the difference of distances from the task machine, the uploading rate at the task machine is maximum, The corresponding matrix element is set to 50, the matrix value corresponding to one grid is set to 20, the matrix value corresponding to two grids is set to 5, and the matrix values corresponding to the rest white grids are set to 0;

the obtained view matrix Y is composed of (2n+1) ² The square matrix of each element, the elements in the visual field matrix Y are flattened, and the abscissa of the combined task machine in the grid space is combinedOrdinate of task machine in grid space +.>Form a one-dimensional column vector L _u (t) the expression is as follows: />

B _u (t) includes tasksCurrent battery power of machineHorizontal coordinate of mobile charger in grid space>Ordinate of mobile charger in grid space +.>Euclidean distance between task machine and mobile charger>B _u The expression (t) is as follows:

(6b) Determining an action set A:

(6c) Determining a reward function r ^t ：

wherein mu ₁ 、μ ₂ 、μ ₃ To adjustA weighting factor in between; here, μ ₁ ＝1，μ ₂ ＝60，μ ₃ ＝5；Taking the maximum data throughput in the whole data uploading process as a core target, giving positive behavior rewards to each behavior according to the data throughput in the single communication process, wherein the expression is as follows:

The step (7) specifically comprises the following steps:

(7a) Initializing estimated value network neural parameter theta ₁ And target value network neural parameter θ ₂ Let theta ₂ ＝θ ₁ The method comprises the steps of carrying out a first treatment on the surface of the Initializing an experience playback pool, wherein the capacity is D; initializing a network learning rate alpha and an attenuation coefficient gamma; here, α=0.0001, γ=0.95.

(7b) Based on the current state-action pair (s ^t ，a ^t )，s ^t A is the system state at time slot t ^t For the task machine to perform action on the time slot t, the estimated value of the Q value is output, namely Q _predicted (s ^t ，a ^t ；θ ₁ ) Wherein θ is ₁ Network neural parameters are estimated values; the target value network generates for the next state selection actionWherein s is ^t+1 An action value, a, inputted for a target value network ^t+1 For the action value entered for the target network, a next state-action pair(s) of the target network is then determined ^t+l ，a ^t+1 ) Q value of +. >The target value for the Double DQN algorithm is defined as:

here, k=8000, d=55000, z=32, f=16000, step=25.

As shown in fig. 3, the improved Double DQN algorithm refers to an action of modifying a greedy coefficient epsilon selected by an action, ensuring that a maximum rewarding value is obtained in each step according to an optimization measure, and in the execution process of an epsilon-greedy strategy, an intelligent agent selects a current optimal solution with a probability of epsilon, and performs exploration of other actions with a probability of remaining 1-epsilon, wherein the numerical variation expression of the epsilon coefficient is as follows:

(7e) According to step (7 d) (s ^k ，a ^k ，r ^k ，s ^k+1 ) K=1, 2,3 … z, the target Q value is calculatedAnd a loss value L (θ) ₁ ) Let->The loss function is expressed as follows: />

In order to improve task execution efficiency of a task machine in a communication system for assisting task unmanned aerial vehicle (task machine for short) to upload data, a concept of far-field wireless charging and flexibility and easy deployment of the unmanned aerial vehicle are utilized, and a charging unmanned aerial vehicle (mobile charger for short) with a high-gain radio frequency antenna is used for providing charging service for the task machine. Aiming at the communication and wireless charging system, the improved deep reinforcement learning algorithm Double DQN is utilized to realize the flight path optimization strategy of the task machine for balancing the data uploading task and the electric quantity supplementing requirement, so that the purpose of improving the task execution efficiency of the task machine is achieved.

The invention aims at the state set 5, the action set A and the reward function r according to the problem characteristics of the path optimization target ^t The design is carried out, and furthermore, the actual field of view of the task machine is simulated, and the input information L of the neural network is designed by combining the effective information of the system _m (t)、L _u (t)、B _u (t). The greedy coefficient epsilon is designed to be increased along with the number of flight rounds in an algorithm manner so as to increase the acquisition quantity of the early-stage effective samples, an estimation value network and a target value network which are mutually independent are designed, and in the learning process, the two networks gradually update network parameters so as to reduce estimation errors, effectively reduce the influence of the over-estimation problem and improve the algorithm precision.

As shown in fig. 2, the communication and wireless charging system of the internet of things is recorded as a system, the system space is divided into N x N grids, each grid is a square unit and has a side length of c, wherein N is the number of grids in the transverse direction and the longitudinal direction, an unmanned aerial vehicle is deployed in a scene to execute uploading tasks, a mobile charger gives wireless charging service, and M mobile internet of things devices have values of d respectively _m Is waiting for an upload. In this embodiment, n=15, c=10 meters, m=10, d _m ＝1000kB。

As shown in fig. 4, the abscissa in fig. 4 represents the number of task rounds and the ordinate represents the prize value for a single flight round. As can be seen from FIG. 4, as the number of task rounds increases, the prize value gradually increases and tends to stabilize, the training effect is optimized when the number of rounds reaches about 14000, and the neural parameters θ of the estimated value network and the target value network are estimated ₁ 、θ ₂ Updating to obtain flight path strategy pi with maximum task execution efficiency ^* 。

As shown in fig. 5, the abscissa in fig. 5 represents the number of task rounds, and the ordinate represents the number of steps of the mission machine in a single flight round. As can be seen from fig. 5, as the number of rounds of tasks increases, the number of steps of the round of tasks machine in 0-10000 is gradually increased while vibrating, and the initial onboard electric quantity of the task machine is insufficient to support the task machine to complete all data uploading, so that the task machine learns to draw close the charger at a proper time to realize electric quantity supplement. Gradually reducing the flight steps of the round task machine at 10000-16000 and finally converging the round task machine at 72 steps of single flight steps, further learning the round task machine towards the direction of less flight steps in the generated optimized strategy set, and finally finding the path flight strategy pi with the minimum energy consumption ^* 。

As shown in fig. 6, the abscissa in fig. 6 is the number of task rounds, and the ordinate is the total data load for a single flight round. As can be seen from fig. 6, as the number of rounds of task increases, the total data uploading amount gradually increases and is stabilized at the maximum value 10000kB when the number of rounds reaches about 12000, thereby realizing a strategy capable of efficiently completing the data uploading task.

In summary, compared with the prior art, the invention utilizes far-field wireless charging technology and combines the unmanned aerial vehicle with high mobility to carry the high-gain radio frequency antenna as the mobile charger, thereby providing more convenient charging service for the task machine for assisting the base station to execute data uploading in the communication system of the Internet of things, obviously improving the cruising ability of the task machine through a more convenient charging method under the condition of higher data real-time requirement, and realizing the high data uploading efficiency of the communication system of the Internet of things under the assistance of the unmanned aerial vehicle.

Claims

1. A wireless chargeable unmanned aerial vehicle data uploading path optimizing method based on DRL is characterized in that: the method comprises the following steps in sequence:

(3) Establishing a second mathematical model for the data uploading channel;

(7) According to the state set S, the action set A and the prizeExcitation function r ^t Offline learning is performed by utilizing an improved Double DQN algorithm, so that an optimal path strategy pi is obtained ^* 。

2. The DRL-based wireless chargeable unmanned aerial vehicle data upload path optimization method of claim 1, wherein: the step (1) specifically refers to: the system is characterized in that an Internet of things communication and wireless charging scene system is recorded as a system, the system space is divided into N grids, each grid is a square unit, the side length is c, N is the number of grids in the transverse direction and the longitudinal direction, a task machine is deployed in the scene to execute uploading tasks, a mobile charger gives wireless charging service, and M mobile Internet of things devices have values d respectively _m Waiting for uploading of the data volume of (2);

recording that the current time slot of the system is T, t=0, τ,2τ … T, τ is the length of a single time slot, and T is the time of the system termination state, the position of the mobile Internet of things device m at the time of the time slot T isThe representation, where M ε {1,2 … M }, Representing the abscissa of mobile internet of things device m in grid space,/for>Representing the ordinate, h, of mobile Internet of things device m in grid space _m Representing the fixed height of the mobile internet of things device m; assume that the remaining amount of data to be uploaded at time slot t for mobile animal networking device m is denoted +.>

The mobile charger as the energy supply end starts from a stop point at the beginning of a task and starts with a pre-deployed flight path and a pre-deployed flight speed v _k Moving and providing energy to task machines for real-time positioningIndicating (I)>Represents the abscissa of the mobile charger in grid space,/->Indicating the ordinate of the mobile charger in the grid space, h _k The fixed flying height of the mobile charger; when the mission starts, the mission machine starts from the stop point and takes a constant flying speed v _u Flying, real-time positioningRepresentation, wherein->Represents the abscissa, < >/of the task machine in grid space>Representing the ordinate of the task machine in the grid space, h _u Is a fixed flight level of the mission machine.

3. The DRL-based wireless chargeable unmanned aerial vehicle data upload path optimization method of claim 1, wherein: the step (2) specifically refers to: the maximum battery capacity of the task machine is b _max ，Indicating the remaining power at the time slot t,the mission machine consumes a constant energy value per execution of a flight action +.>Then, without considering charging, a mathematical model is built at the battery level of the task machine at time slot t+1 and is recorded as a first mathematical model, and the expression is as follows:

4. The DRL-based wireless chargeable unmanned aerial vehicle data upload path optimization method of claim 1, wherein: the step (3) specifically refers to: the task machine and the mobile internet of things equipment establish communication connection and start data uploading, the flight height of the task machine is enough, at the moment, the visual range wireless transmission communication is guaranteed between the mobile internet of things equipment and the task machine, and then the expression of the channel gain between the mobile internet of things equipment and the task machine is as follows:

5. The DRL-based wireless chargeable unmanned aerial vehicle data upload path optimization method of claim 1, wherein: the step (4) specifically refers to: the mobile charger is provided with a high-gain radio frequency antenna for transmitting wireless power, and provides energy supply service for the task machine according to a fixed deployment track; in the task execution process, the distances between a task machine and a mobile charger are different in different time slots, when the distance between the task machine and the mobile charger is increased, the power of wireless charging can be drastically reduced, the full-energy conversion efficiency is assumed in a wireless transmission link, and the power obtained at the task machine is calculated through a Friis free space propagation modelExpressed by the following formula:

wherein,indicating the Euclidean distance between the task machine and the mobile charger in the transverse direction of the time slot t, and +.>Indicating Euclidean distance h between task machine and mobile charger in longitudinal direction of time slot t _k,u Representing a constant height difference between the task machine and the mobile charger; and establishing an energy supplementing mathematical model when the wireless charging process is executed and marking the energy supplementing mathematical model as a third mathematical model, wherein the expression is as follows:

6. The DRL-based wireless chargeable unmanned aerial vehicle data upload path optimization method of claim 1, wherein: the step (5) specifically refers to: the variables in the dynamic scene aiming at the optimization problem of the path planning of the task machine are the position information of the mobile Internet of things equipment, the mobile charger and the task machine, the equipment data uploading queue and the task machine battery state information; the optimization aims at finding a path strategy, helping a task machine to make an optimal decision between balancing energy consumption and uploading data amount, maximizing data uploading efficiency in a single flight process from the angle of optimizing a moving track, and the main factors considered in the process are data uploading amount and task energy consumption, wherein the expression of a fourth mathematical model is as follows:

wherein,is the transmission rate of a data transmission link established between the task machine and the mobile internet of things device m at the time slot t,the constant energy value is consumed for each flight action executed by the mission machine; the current time slot of the system is recorded as T, t=0, τ,2τ … T, τ is the length of a single time slot, and T is the time of the system termination state; m.epsilon. {1,2 … M }.

7. The DRL-based wireless chargeable unmanned aerial vehicle data upload path optimization method of claim 1, wherein: the step (6) specifically comprises the following steps:

(6a) Determining a state set S:

the expression of the state set S is as follows:

wherein s is ^t Is the system state at time slot t, defined by L _m (t)、L _u (t)、B _u (t) three parts; l (L) _m (t) for each of the data upload tasksThe state of the mobile Internet of things equipment is represented and used for guiding task targets of data uploading, and L _m (t) including the position and the residual data amount information of all the mobile Internet of things devices, and setting the two-dimensional coordinates of the mobile Internet of things device M with the last serial number in the region of the uploading task in the process of proceeding in the horizontal direction of the time slot tThe amount of data to be uploaded remaining therewith +.>Then L is _m (t) is represented by the following formula:

L _u (t) representing the data transmission rate of the mobile Internet of things equipment and the position of the task machine at different positions within the effective transmission distance of the task machine by simulating the visual field of the task machine; according to the system characteristics, a task machine is taken as a center, n grids in a visual field range are assumed, positions outside a task area are represented by black grids, corresponding matrix values are set to be-1 to represent a separation task area, data uploading rates at different positions are represented according to the difference of distances from the task machine, the uploading rate at the task machine is maximum, corresponding matrix elements are set to be 50, a matrix value corresponding to one grid is set to be 20, matrix values corresponding to two grids are set to be 5, and matrix values corresponding to the rest white grids are set to be 0;

B _u (t) current battery level including task machineHorizontal coordinate of mobile charger in grid space>Ordinate of mobile charger in grid space +.>Either oneEuclidean distance between service machine and mobile charger>B _u The expression (t) is as follows:

(6b) Determining an action set A:

(6c) Determining a reward function r ^t ：

Wherein,indicating the remaining capacity of the task machine at time slot t, < >>For the task machine when singleEnergy value received over slot, b _th To determine whether the task machine enters a threshold, beta, of a low energy level state ₁ And beta ₂ Are all constant coefficients.

8. The DRL-based wireless chargeable unmanned aerial vehicle data upload path optimization method of claim 1, wherein: the step (7) specifically comprises the following steps:

(7c) According to the current state s ^t Executing action a ^t And selecting actions according to the improved Double DQN algorithm, and converting to a new state s ^t+1 According to the reward function r ^t Calculation ofSingle step prize value, storing the translations in an experience playback pool (s ^t ，a ^t ，r ^t ，s ^t+1 )；

(7e) According to step (7 d) (s ^k ，a ^k ，r ^k ，s ^k+1 )，k＝1,2,3, … z, calculating a target Q valueAnd a loss value L (θ) ₁ ) Let->The loss function is expressed as follows: