CN117858015A

CN117858015A - Air edge calculation data safe transmission and resource allocation method based on deep reinforcement learning

Info

Publication number: CN117858015A
Application number: CN202311591342.0A
Authority: CN
Inventors: 雷宏江; 杨明绪
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-04-09

Abstract

The invention relates to an air edge calculation data safe transmission and resource allocation method based on deep reinforcement learning, which comprises the steps of constructing an unmanned aerial vehicle auxiliary edge calculation model under the condition of an air eavesdropper; calculating the time delay and energy consumption weighted by the system user as a system optimization target, and constructing a dynamic resource allocation and track joint optimization problem; modeling the dynamic resource allocation and track joint optimization problem into a Markov decision process by using a minimum system optimization target, and adopting a DDPG algorithm to jointly optimize the dynamic resource allocation and a 3D track strategy of the unmanned aerial vehicle; and finally, carrying out system dynamic resource allocation and unmanned aerial vehicle track optimization on the trained strategy network. The invention provides a depth deterministic strategy gradient (DDPG) algorithm based on a DRL to solve a 3D track and a dynamic resource allocation strategy of an unmanned aerial vehicle, and the time delay and the energy consumption of the whole system user are reduced on the premise of guaranteeing the safety of unloading data of the user, so that the calculation cost of the system user is reduced.

Description

Air edge calculation data safe transmission and resource allocation method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of unmanned aerial vehicle auxiliary mobile edge computing, in particular to a method for guaranteeing user unloading data safety and edge computing resource allocation and track joint optimization based on depth reinforcement.

Background

The statements in this section merely provide background information related to the present disclosure and may constitute prior art. In carrying out the present invention, the inventors have found that at least the following problems exist in the prior art.

In the 5G age, the interconnection of intelligent devices causes the task of computationally intensive and time delay sensitive to be exhibited in a communication network, and the intelligent device interconnection system forms a great challenge for the terminal equipment of the Internet of things with limited computing capacity and low power consumption. Mobile Edge Computing (MEC) is seen as a promising solution to reduce data transmission latency and bandwidth requirements by sinking computing and storage resources to the network edge, enabling task-near-source processing. MEC also supports near field communication and real-time data analysis, providing more efficient computing and response performance for Internet of things applications. However, terminal devices in remote or mountainous areas face challenges because they have difficulty obtaining reliable MEC server and infrastructure coverage, resulting in serious limitations in terms of computation and communication.

Unmanned aerial vehicles are widely applied to assist MEC systems in executing computation-intensive tasks due to the characteristics of flexible deployment, wide coverage, high probability of line-of-sight link communication and the like. By establishing a line-of-sight connection with a ground user device, the drone may act as a "flying MEC server" providing an important offload service with low network overhead and execution delay. Non-orthogonal multiple access allows multiple users to share link resources to improve spectral efficiency. The application number 202310399786. X patent name is 'a unmanned aerial vehicle auxiliary MEC resource optimization method based on NOMA', the unmanned aerial vehicle auxiliary MEC resource optimization allocation research based on NOMA is used for providing services for users with ground base stations simultaneously, so that the system capacity is improved, the calculation pressure of the ground base stations is greatly relieved, and the service requirements and QoE requirements of the users are met.

Although the unmanned aerial vehicle-assisted MEC system can significantly improve the computing performance of the terminal device, in the unmanned aerial vehicle-assisted edge computing network based on non-orthogonal multiple access, the broadcasting characteristics of the line-of-sight link signal propagation thereof make eavesdropping also benefit from the line-of-sight channel provided by the unmanned aerial vehicle-assisted MEC communication system, and the user unloading information is easily intercepted by a potential malicious eavesdropper, which would lead to risks of data security and user privacy disclosure. Therefore, in the unmanned aerial vehicle-assisted MEC application scenario, reducing the risk of data leakage is a real and important issue. To prevent data from being intercepted maliciously in wireless communications, solutions based on physical layer security techniques are widely used to enhance the security of the data transmission link. As a supplement to the upper layer traditional encryption technology, the physical layer security technology utilizes the uncertainty of noise and multipath transmission to increase the channel capacity difference between a legal receiving end and a eavesdropping receiving end, and ensures the communication security through the natural characteristics of time variability, mutual variability, spatial uniqueness and the like of wireless channels.

However, due to the complexity of resource allocation and task scheduling in MEC networks and environmental uncertainties, conventional optimization methods generally have difficulty in solving such problems. Especially, under the condition that a malicious eavesdropper with uncertain positions exists in the air, how to ensure the safety of unloading data of a user and reduce the calculation time delay and the energy consumption of the user is a current difficult problem. If the random movement characteristics of the user and the 3D flight path of the unmanned aerial vehicle are considered, the problems are more complex and difficult.

Disclosure of Invention

In view of the above, it is an object of the present invention to solve some of the problems of the prior art, or at least to alleviate them.

The method for safely transmitting and distributing the data by the air edge calculation based on the deep reinforcement learning comprises the following steps:

constructing an unmanned aerial vehicle auxiliary edge calculation model considering the existence of an air eavesdropper;

according to the unmanned aerial vehicle auxiliary edge calculation model, on the premise of ensuring user data safety, calculating time delay and energy consumption weighted by a system user as a system optimization target, and constructing a dynamic resource allocation and track joint optimization problem; the mathematical expression of the system user calculation cost minimization objective function is as follows:

P1:

s.t.C1:

C2:

C3:Z _min ≤z _S (n)≤Z _max ,

C4:λ _k,l (n)∈{0,1},

C5:

C6:

C7:

C8:

C9:

C10:

C11:

C12:E _S (N)≥0.

wherein C1-C3 respectively represent the constraint on the flying speed, collision avoidance and flying height of an Unmanned Aerial Vehicle (UAV); C4-C5 are for strong users when communicating in non-orthogonal multiple access (NOMA)And a weak user judgment; c6 is a user transmit power constraint; c7 is a minimum security rate set for guaranteeing the security of user unloading data; C8-C10 are calculation frequency limits distributed to users by the MEC server; c11 represents that the user has to process all data within a specified time; c12 is the energy consumption limit of UAV; the unmanned plane flight time T is equally divided into N time slots, and the length of each time slot is delta _t ＝T/N；q _S (n) is the position of the server unmanned aerial vehicle S, q _E (n) is the location of the eavesdropping drone E;maximum flight speed for S; d, d _min Is the minimum safe distance between S and E; z _S (n) is the flight level of S in the nth slot, where Z _min For the lowest flight altitude of the unmanned aerial vehicle, Z _max Is the maximum flying height; lambda (lambda) _k,l (n) is a binary variable representing the relationship of user channel strength, where l is the weak user relative to the k user; u (U) _k For ground user equipment (U) _k ,k＝1,2,…,K)，p _k (n) is U _k Transmit power at n slots, f _k (n) is U _k CPU frequency at n slots, P _max And->The local maximum transmitting power and the local maximum calculating frequency of the user are respectively calculated; r is R _k,sec (n) is U _k Instantaneous safe offload rate at nth time slot, < >>A security threshold for user offloading; />For the maximum calculated frequency of S, f _Sk (n) assigning S to U _k Is a calculated frequency of (2);unloading data volume for nth time slot user; c (C) _S The number of CPU turns required for calculating 1 bit data for S; e (E) _S (N) remaining energy for last time slot S of unmanned aerial vehicle flight;

modeling the dynamic resource allocation and track joint optimization problem as a Markov decision process with a minimum system optimization target;

the DDPG algorithm in the deep reinforcement learning is adopted to jointly optimize dynamic resource allocation and a 3D track strategy of the unmanned aerial vehicle for solving, so that the weighted time delay and the energy consumption of a user are reduced; the network framework of the DDPG algorithm comprises a strategy network, a value network, a target strategy network and a target value network;

and performing system dynamic resource allocation and unmanned aerial vehicle track optimization by using the trained strategy network.

The DDPG algorithm in the deep reinforcement learning is adopted to jointly optimize dynamic resource allocation and a 3D track strategy of the unmanned aerial vehicle for solving, and the method comprises the following steps:

constructing a DDPG algorithm network frame, which comprises a strategy network, a value network, a target strategy network and a target value network;

updating policy network weights θ based on gradient direction to maximize cumulative discount rewards ^μ ；

Updating the weight θ of a value network by minimizing a loss function ^Q ；

Respectively updating the weight theta of the target strategy network by adopting a soft updating method ^μ′ And the weight θ of the target value network ^Q′ 。

Further, the soft update policy is:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

where τ is the soft update parameter.

The DDPG algorithm employs an empirical playback mechanism to eliminate correlation between samples.

In the unmanned plane auxiliary edge calculation model, by scaling the distance between a user and E, taking the worst safety condition of a system into consideration, estimating the eavesdropping range when the eavesdropping capability of E is strongest; setting a ground friendly jammer J to send an artificial interference signal to E so as to inhibit eavesdropping; the ground user equipment U _k And unloading data to S by adopting a NOMA communication mode.

Further, estimating the eavesdropping range of E when the eavesdropping capability is greatest includes assuming that E is hidden at a rate of q' _E (n)＝(x′ _E (n),y′ _E (n),z′ _E (n)) as the center of a circle and the radius r _E And r is within the circular region of _E Satisfy ||q _E (n)-q′ _E (n)||≤r _E The method comprises the steps of carrying out a first treatment on the surface of the Wherein x' _E (n)、y′ _E (n)、z′ _E (n) coordinates of the eavesdropping center position in the x-axis, y-axis and z-axis, respectively.

The time delay and the energy consumption weighted by the system user are calculated, and the method also comprises the step of respectively setting corresponding energy and time delay weight factors.

The Markov decision process is < S, A, R >;

s is a system state set:

s(n)＝{q _S (n),E _S (n),R _k,sec (n),L _k (n)}.

a is dynamic resource allocation and track action set:

r is a set of reward functions:

r(n)＝-U _c (n)+r _off (n)+r _p (n).

wherein: u (U) _c (n) r is an optimization target _off (n) offloading data rewards for users, r _p (n) a penalty for violating the constraint; l (L) _k (n) an amount of unprocessed data remaining for the user; v is the flying speed of the unmanned plane, theta is the polar angle,is a horizontal angle.

The method for carrying out system dynamic resource allocation and unmanned aerial vehicle track optimization by using the trained strategy network comprises the following steps:

after full training, when the accumulated prize value tends to be in a stable state, the training process is stopped;

the well-trained strategy network is deployed to the unmanned aerial vehicle base station platform to guide the unmanned aerial vehicle system to perform tasks in practice quickly and efficiently to minimize system user delay and energy consumption optimization goals.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the deep reinforcement learning based over-the-air edge computing data security transmission and resource allocation method.

The invention has the following beneficial effects:

1. according to the method for safely transmitting and distributing the data by the air edge calculation based on the deep reinforcement learning, the safety problem of unloading the data by the user under the condition that a malicious eavesdropper with uncertain positions exists in the air is considered, the data safety of the user is ensured, the calculation time delay and the energy consumption of the system user are reduced as much as possible, and the average calculation cost of the user is saved; the user can set corresponding weight according to own preference of energy and time delay;

2. according to the method, the random movement characteristics of the user and the 3D track flight characteristics of the unmanned aerial vehicle are considered, and under the condition that the real factors are considered, the dynamic resource allocation strategy and the 3D track of the unmanned aerial vehicle are designed to minimize the calculation cost of the system user; introducing NOMA transmission strategy to improve the utilization efficiency of the system spectrum;

3. because the provided optimization problem has the problem of optimization variable coupling and the solution space of decision is larger, and the movement of a user and an unmanned aerial vehicle causes the dynamic change of the environment, the provided optimization problem is difficult to solve by the traditional convex optimization method. Considering a high-dimensional continuous action space, the invention adopts a DDPG method to obtain an effective user resource allocation scheme and an unmanned aerial vehicle flight path planning strategy;

4. in case of uncertainty of the air eavesdropping location, the applicant considers the worst security case through the estimated eavesdropping range, and the 3D trajectory of the unmanned aerial vehicle, the transmitting power of the user and the calculation frequency are jointly designed to minimize the long-term average network calculation cost.

Drawings

FIG. 1 is a NOMA-based unmanned aerial vehicle assisted edge computing communication model;

fig. 2 is a network structure diagram of the DDPG algorithm;

FIG. 3a is a graph of the cumulative prize versus training round number;

FIG. 3b is a graph showing the variation of the calculation cost with the number of training rounds;

fig. 4a is a 3D trajectory of a drone;

fig. 4b is a 2D trajectory of the drone;

fig. 5 is a graph of cost versus time for different offloading schemes.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, wherein the embodiments of the present invention are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and wherein various substitutions and modifications are made by the person of ordinary skill in the art without departing from the technical spirit of the present invention, and are intended to be included in the scope of the present invention.

Deep Reinforcement Learning (DRL) is used as an effective solving method to solve the optimization problem of an unmanned aerial vehicle-assisted MEC communication system, feedback rewards are obtained through interaction with an unknown environment, and a better strategy is learned in the process of continuous trial and error.

The invention considers the problems of data safety transmission and resource allocation of the unmanned aerial vehicle auxiliary edge computing system under the condition of an air eavesdropper with uncertain position, and aims to reduce the overall time delay and energy consumption of the system user on the premise of guaranteeing the data safety of user unloading. Because the provided optimization problem is a non-convex multi-element joint problem, the problem of optimization variable coupling exists, the solution space of decision is large, the environment dynamic change is caused by the movement of a user and an unmanned aerial vehicle, and the solution is difficult to solve by adopting a traditional optimization method. Therefore, the invention provides a depth deterministic strategy gradient (DDPG) algorithm based on DRL to solve the 3D track and dynamic resource allocation strategy of the unmanned aerial vehicle.

P1:

s.t.C1:

C2:

C3:Z _min ≤z _S (n)≤Z _max ,

C4:λ _k,l (n)∈{0,1},

C5:

C6:

C7:

C8:

C9:

C10:

C11:

C12:E _S (N)≥0.

wherein C1-C3 respectively represent the constraint on the flying speed, collision avoidance and flying height of an Unmanned Aerial Vehicle (UAV); C4-C5 are the judgment of strong users and weak users in non-orthogonal multiple access (NOMA) communication; c6 is a user transmit power constraint; c7 is a minimum security rate set for guaranteeing the security of user unloading data; C8-C10 are calculation frequency limits distributed to users by the MEC server; c11 represents that the user has to process all data within a specified time; c12 is the energy consumption limit of UAV; the unmanned plane flight time T is equally divided into N time slots, and the length of each time slot is delta _t ＝T/N；q _S (n) is the position of the server unmanned aerial vehicle S, q _E (n) is the location of the eavesdropping drone E;maximum flight speed for S; d, d _min Is the minimum safe distance between S and E; z _S (n) is the flight level of S in the nth slot, where Z _min For the lowest flight altitude of the unmanned aerial vehicle, Z _max Is the maximum flying height; lambda (lambda) _k,l (n) is a binary variable representing the relationship of user channel strength, where l is the weak user relative to the k user; u (U) _k For ground user equipment (U) _k ,k＝1,2,…,K)，p _k (n) is U _k Transmit power at n slots, f _k (n) is U _k CPU frequency at n slots, P _max And->The local maximum transmitting power and the local maximum calculating frequency of the user are respectively calculated; r is R _k,sec (n) is U _k Instantaneous safe offload rate at nth time slot, < >>A security threshold for user offloading; />For the maximum calculated frequency of S, f _Sk (n) assigning S to U _k Is a calculated frequency of (2);unloading data volume for nth time slot user; c (C) _S The number of CPU turns required for calculating 1 bit data for S; e (E) _S (N) remaining energy for last time slot S of unmanned aerial vehicle flight;

The method for safely transmitting the data and distributing the resources by the air edge calculation based on the deep reinforcement learning provided by the invention considers the safety problem of unloading the data by the user under the condition that a malicious eavesdropper with uncertain positions exists in the air, reduces the calculation time delay and the energy consumption of the system user as much as possible while ensuring the safety of the data of the user, and saves the average calculation cost of the user. And a DDPG method is adopted to obtain an effective user resource allocation scheme and an unmanned aerial vehicle flight path planning strategy for solving, so that the problem that the provided optimization problem has optimized variable coupling is solved, the decision solution space is large, the problem of environmental dynamic change caused by the movement of a user and an unmanned aerial vehicle is solved, and the control problem of high-dimensional and continuous action space in an unmanned aerial vehicle-assisted mobile edge computing system scene is solved.

The unmanned aerial vehicle auxiliary edge calculation model under the condition of considering the existence of an air eavesdropper is specifically constructed by the following steps:

as shown in fig. 1, the present invention contemplates an unmanned aerial vehicle-assisted edge computing system, where S is an unmanned aerial vehicle equipped with an MEC server, acting as an over-the-air edge computing server, providing edge computing services for ground mobile user equipment. Due to the ground user equipment (U) _k K=1, 2, …, K) is limited in computing and storage capacity, and a plurality of ground mobile users offload data to S to reduce the time delay and energy consumption of the users. Meanwhile, an over-the-air potential eavesdropping E with uncertain position exists in a certain area in the air, and continuous attempts are made to acquire data about unloading of a ground user to S.

In the unmanned aerial vehicle auxiliary edge calculation model, as the exact position of the eavesdropping E cannot be accurately obtained, by scaling the distance between a user and the E, the worst safety condition of the system is considered, and at the moment, the eavesdropping capacity of the E is strongest, so that the eavesdropping range of the E is estimated; setting a ground friendly jammer J to send an interference signal to E so as to inhibit eavesdropping; the ground user equipment U _k And unloading data to S by adopting a NOMA communication mode so as to improve the utilization rate of spectrum resources of the system.

To suppress eavesdropping and interference of eavesdropping E on the communication link, the invention provides a ground-friendly jammer J to send an interference signal to E to suppress eavesdropping. Assuming that S is able to fully decode the interference signal sent by J without being affected in the system, all devices are equipped with a single antenna and operate in full duplex mode.

The unmanned plane flight time T is equally divided into N time slots, and the length of each time slot is delta _t =t/N. The positions in slots n, S and E are denoted q, respectively _S (n)＝[x _S (n),y _S (n),z _S (n)] ^T ，q _E (n)＝[x _E (n),y _E (n),z _E (n)] ^T 。U _k And J is at position w _Uk (n)＝[x _k (n),y _k (n)] ^T ，w _J (n)＝[x _J (n),y _J (n)] ^T . Since the exact location of E is not known, the eavesdropping range at which E's eavesdropping capability is the greatest is estimated, including assuming that E is hidden at a rate of q' _E (n)＝(x′ _E (n),y′ _E (n),z′ _E (n)) as the center of a circle and the radius r _E And r is within the circular region of _E Satisfy ||q _E (n)-q′ _E (n)||≤r _E The method comprises the steps of carrying out a first treatment on the surface of the Wherein x' _E (n)、y′ _E (n)、z′ _E (n) coordinates of the eavesdropping center position in the x-axis, y-axis and z-axis, respectively.

By means of a spherical coordinate systemDescribing the speed and flight direction of S, wherein v is the unmanned aerial vehicle flight speed, θ is the polar angle, +.>Is a horizontal angle. Definition d _S (n)＝||q _S (n+1)-q _S (n) |, then the location update rule for S is:

the invention considers a probability path loss model, U _k The line of sight link (LoS) connection probability with the drone is:

where u.epsilon.S, E,q _u (n) is the position of the unmanned plane u in n time slots, and zu (n) is the flying height of u, eta _a And eta _b Is an environment-dependent parameter.

U _k The average path loss expression with u is:

wherein,the path loss for Los link, denoted +.> For the pathloss of NLos link, denoted +.>Free space path loss ofd _k,u (n) is the distance between the nth time slot user k and the unmanned plane u, f _c Is the carrier frequency, c is the speed of light, eta _LoS ,η _NLoS Excessive path loss for LoS and NLoS, respectively.

Thus U _k The channel gain with u is:

U _k the signal-to-interference-and-noise ratio with S is:

wherein h is _k,S (n) is the channel gain, h between the nth time slot user k and the unmanned S _l,S (n) is the channel gain between weak users l and S, p _l (n) is the transmit power of user i in the nth slot. P is p _k (n) is U _k Is used for the transmission power of the (c),is U (U) _k Additive white gaussian noise between S and lambda _k,l (n) for binary variables representing the relationship between the user channel strengths, the following is defined:

wherein lambda is _k,l (n)+λ _l,k (n)＝1。λ _l,k (n) is the channel strength relationship between user L and user k, L _k,S (n) is the average path loss between users k and S in the nth slot, L _l,S (n) is the average path loss between users l and S. U (U) _k The instantaneous achievable offloading rate from S is:

R _k,S (n)＝Blog ₂ (1+r _k,S (n)), (7)

wherein B is the channel bandwidth, r _k,S (n) is the nth time slot U _k Signal to interference plus noise ratio with S.

Similarly, U can be obtained _k The instantaneous achievable offload rate with E is:

R _k,E (n)＝Blog ₂ (1+r _k,E (n)), (8)

wherein,K＝{U _z |,L _z,E (n)≥L _k,E (n) is that the channel gain at eavesdropping node E is less than user U _k A weak set of users of channel gain. h is a _k,E (n) is the nth time slot U _k Channel gain with unmanned plane E, h _z,E (n) is the channel gain between weak users z and E, h _J,E (n) is the channel gain between J and E, P _J Transmit power of J, p _z (n) is the transmit power of the weak user, < >>Is U (U) _k And E.

Thus U _k The instantaneous safe offload rate at the nth time slot is:

wherein [ x ]] ⁺ ＝max{x,0}。

To ensure the security of user offloaded data, we define a minimum secure offload rateR _k,sec (n) should be greater thanExpressed as:

suppose U _k With L _k The data amount of the (2) needs to be processed within the flight time T, and the user cannot locally calculate all task data due to limited calculation capacity and battery capacity of the user equipment, so that the data is unloaded to S to reduce the time delay and energy consumption of the user equipment. It is assumed that task data is bit independent and can be arbitrarily split into local computations and offloaded computations that can be performed in parallel.

1) Local computing

The nth time slot, the data volume calculated locally by the user is:

the consumption energy of the local calculation of the user is as follows:

wherein C is _k The number of CPU turns, f, required to calculate 1 bit of data _k (n) is U _k At the CPU frequency of the n-slot,is U (U) _k Is effective in terms of the effective capacitance coefficient.

2) Computing offloading

The nth time slot, the user offloads the data volume as follows:

the user unloading energy consumption is as follows:

s is allocated to U _k The calculated frequency of (2) is:

wherein C is _S The number of CPU turns required for 1 bit data is calculated for S.

Nth time slot, S process U _k The energy consumed to offload data is:

wherein,is the effective capacitance coefficient of S.

Since the transmit power of S is typically much greater than the transmit power of the user and the amount of data of the task calculation is much less than the amount of data generated by the user, the time delay and energy consumption of returning the calculation is ignored.

U _k The remaining data in the n+1 time slot is expressed as:

wherein L is _k (n) is U _k Remaining data at nth time slot, and L _k [0]＝L _k 。

T _k (n) is defined as the remaining data state (not 0, i.e. 1) of the user in each time slot, expressed as:

thus U _k Is of (1)Wherein N is _k Is U (U) _k Treated L _k Total number of slots of data.

The sum of the data amounts offloaded by all users should be no greater than S can process the data amount at each time slot, which is:

wherein,the frequency is calculated for S maximum.

The server unmanned aerial vehicle S adopts a rotor unmanned aerial vehicle, and the propulsion power consumption of the rotor unmanned aerial vehicle in the nth time slot is as follows:

wherein P is ₀ And P _i Respectively the blade profile power and the induced power in a hovering state; u (U) _tip Is the tip speed of the rotor wing; v ₀ Is the average induction speed of the rotor wing in a hovering state; d, d ₀ And rho, s, A are fuselage resistance ratio, air density, rotor solidity, and rotor disk area, respectively.

The energy consumption of the rotor unmanned aerial vehicle S in the nth time slot is as follows:

let S be the battery capacityThe remaining energy of the n slots S is:

wherein,processing U for ith time slot S _k Energy consumed for offloading data, +.>The flight energy consumption for the ith slot S.

In order to ensure that the task data of all users can be processed before the S energy is exhausted, the following conditions should be satisfied:

E _S (N)≥0. (23)

on the premise of ensuring the safety of user data, calculating the weighted time delay and energy consumption of the system user, and taking the weighted time delay and energy consumption as the system optimization targets, wherein the system optimization targets also comprise corresponding energy and time delay weight factors respectively, and the user can set corresponding weights according to the preference of the user to the energy and the time delay.

The method comprises the following steps:

when each U _k After the completion of the calculation task of (1), all U's of the system _k The energy consumption of (2) is expressed as:

the time delay of all users in the system is as follows:

we will consume energy E _c And execution delay T _c Represents the average computational cost for the system user, namely:

wherein c _E And c _T Representing the unit cost, ω, of energy consumption and time delay, respectively ₁ ∈[0,1]And omega ₂ ∈[0,1]For the corresponding weight factor, omega is satisfied ₁ +ω ₂ ＝1。

By combining the above, constructing a dynamic resource allocation and track joint optimization problem (namely an unmanned plane-edge calculation model optimization problem);

the mathematical expression of the system user calculation cost minimization objective function is as follows:

the dynamic resource allocation and trajectory joint optimization problem (i.e., the unmanned aerial vehicle-edge computing model optimization problem) is modeled as a markov decision process with a minimum system optimization objective. The Markov decision process is < S, A, R >, wherein S is a system state set, A is a dynamic resource allocation and track action set, and R is a reward function set.

The state set is:

s(n)＝{q _S (n),E _S (n),R _k,sec (n),L _k (n)}. (28)

the action set is as follows:

the reward function is:

r(n)＝-U _c (n)+r _off (n)+r _p (n). (30)

wherein: u (U) _c (n) r is an optimization target _off (n) offloading data rewards for users, r _p (n) a penalty for violating the constraint; l (L) _k (n) user U with nth time slot _k The amount of unprocessed data remains.

r _off (n) is expressed as:

wherein, kappa _f Positive integers are awarded for adjusting roff (n).

r _p (n) is defined as:

r _p (n)＝-λ _ac (n)κ _ac -λ _rc (n)κ _rc +κ _er (n) (32)

wherein, kappa _ac ,κ _rc ,κ _er (n) represents penalties for constraint (formula 27C 2), unmanned aerial vehicle calculation frequency constraint (formulas 27C8 to 27C 10), and unmanned aerial vehicle remaining energy constraint (formula 27C 12), respectively. Lambda (lambda) _ac (n) and lambda _rc (n) is a binary coefficient, lambda _ac (n) =1 indicates that the constraint (formula 27C 2) is not satisfied. Similarly, lambda _rc (n) is a binary coefficient related to unmanned aerial vehicle computing resources, κ _er (n) is forJudging sparse rewards related to energy surplus of the unmanned aerial vehicle after all user data are processed, and defining the sparse rewards as follows:

wherein ζ is used for adjusting κ _er Normal number of (n). When E is _S (n) > 0, will give positive rewards to the agent, when E _S (n) < 0, a negative penalty will be given to the agent.

At each time step, the smart will select an action a (n) and jump to the next state s (n+1) based on the current state s (n) of the environment, and then the smart will get a prize r (n). (s (n), a (n), r (n), s (n+1)) are stored as experiences in an experience pool and randomly sampled to train the strategic and value neural networks to approximate the action and value functions, respectively.

In order to solve the problem that the action space is high-dimensional and continuous, a DDPG algorithm in deep reinforcement learning is used for jointly optimizing a 3D track and a dynamic resource allocation strategy of the unmanned aerial vehicle so as to reduce the weighted time delay and the energy consumption of a user.

updating policy network weights θ based on gradient direction to maximize discount jackpot ^μ ；

Updating the weight θ of a value network by minimizing a loss function ^Q ；

The DDPG algorithm is adopted to solve the optimization problem, and the method specifically comprises the following steps:

as shown in fig. 2, the DDPG algorithm employs an actor-critter architecture that contains a policy network, a value network, a target policyNetwork, target value network. Wherein the policy network takes as input the current environmental state s (n) and then based on the weight θ of the policy network ^μ Random noise N _t Outputting corresponding actions a (n) into an unmanned aerial vehicle assisted edge computing network based on a non-orthogonal multiple access (NOMA) technology, wherein the current a (n) is expressed as

a(n)＝μ(s(n)；θ ^μ )+N _t (34)

Wherein N is _t For exploring random noise, the method can increase the randomness and diversity of actions, so that the state space is better explored, and a better strategy is found. μ is a function approximating the strategy network, and random noise gradually decreases as training progresses, allowing the strategy network to gradually transition from exploration to utilization to better utilize learned knowledge and experience. It is worth noting that the range of action a (n) is limited to [0,1]]Therefore, if the action value after the noise is superimposed exceeds the range limited by us, the action value will be cut. Then, the agent executes the current action a (n), the environment transits to the next state s (n+1) according to a certain state transition probability, and at the same time, the environment feeds back a reward r (n) to the agent according to the state at the moment.

In order to solve the problem that data samples are independent and distributed at the same time, the DDPG algorithm adopts an empirical playback mechanism to eliminate the correlation between the samples. Specifically, the agent stores the current experience (s (n), (a) n, (s+) n1 ()) r at a size M _r And randomly sampling small batches of M from B _b The neural network is trained with empirical samples.

Furthermore, (s (n), a (n)) will be input into the value network to evaluate a (n), weighted by θ ^Q Will output the estimated Q value Q (s (n), a (n); θ ^Q ) Wherein the Q value is the expected long-term prize. In order to make the algorithm training process more stable and easier to converge, a weight of θ is used in the DDPG algorithm ^μ′ Target policy network of (c) and weight θ ^Q′ Is a target value network of (1).

Based on the policy gradient theorem, the DDPG algorithm aims to maximize the discount jackpot J (μ), and therefore, according to the equation48 Along the boost action value Q (s (n), a (n); θ ^Q ) Gradient direction update policy network weights θ ^μ

Wherein,for objective function J (μ) with respect to policy network weight θ ^μ Gradients, s _t For the current state +.>Q (s, a; θ) ^Q ) Gradient with respect to motion.

By minimizing the loss function L (θ ^Q ) Updating value networks

L(θ ^Q )＝(y _i -Q(s _i ,a _i ；θ ^Q )) ² (36)

Wherein s is _i ，a _i The state and action of extracting small batches of data from the experience pool, respectively.

Wherein y is _i The target Q value output for the target value network is defined as

y _i ＝r _i +γQ(s _i+1 ,μ′(s _i+1 ；θ ^μ′ )；θ ^Q′ ) (37)

Wherein r is _i For immediate rewards, gamma is the discount factor, θ ^μ′ And (3) weighting the target policy network, wherein mu' is the target policy network.

In terms of the target network, the target network is basically a copy of the original network, but the update frequency is slower, so that the strategy network and the value network can be trained more stably. The soft update policy is:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

where τ is the soft update parameter.

The following steps are simulation experiment steps of the invention:

step S1: and constructing the unmanned plane-mobile edge computing system.

As shown in fig. 1, in order to implement the unloading function of unmanned aerial vehicle assisted edge calculation based on deep reinforcement learning, this step constructs an unmanned aerial vehicle assisted MEC system, in which an unmanned aerial vehicle is equipped with a MEC server, and provides an unloading data calculation service for ground multi-users in a specified rectangular area, while taking into account the mobility of the users and the factors of the position uncertainty of the air eavesdropper. According to the method for safely transmitting and distributing the data of the air edge calculation based on the deep reinforcement learning, the training process is carried out in a simulated urban virtual environment, and the server unmanned aerial vehicle processes the unloading data of the ground user and the virtual unmanned aerial vehicle is carried out in the virtual urban environment. Therefore, the virtual city environment should be modeled before the deep reinforcement learning network framework is constructed.

In the invention, the communication link between the ground user and the unmanned aerial vehicle is a probabilistic line-of-sight link model, the channel gains between the user and the server unmanned aerial vehicle and between the user and the eavesdropping unmanned aerial vehicle can be calculated respectively through formulas (2) - (4), and the strength relation of the user is judged according to the channel gain, wherein in the uplink NOMA, the unmanned aerial vehicle can decode the data of the strong user preferentially, and the weak user can cause interference to the strong user. According to the strong and weak relation of the user and the channel noise power, the transmission rate between the user and the server unmanned aerial vehicle and the transmission rate between the user and the eavesdropping unmanned aerial vehicle can be calculated by the formulas (4) - (8), and the instantaneous reachable safe rate of the user can be obtained by taking the difference between the transmission rate and the transmission rate. When the instantaneous safety rate of the user is larger than the minimum safety rate threshold value specified by the user, the user can unload the data, and the data unloaded by the user is absolutely safe. When the instantaneous safe rate of the user does not reach the minimum safe rate threshold value, the data of the user will not be unloaded and only calculated locally at the time slot.

In the edge calculation model, the local calculation and the unloading calculation of the users are performed simultaneously, the local calculation data quantity and the local calculation energy consumption of each time slot user are calculated according to formulas (11) - (12), the unloading data quantity and the unloading energy consumption of each time slot user are calculated according to formulas (13) - (14), and the calculated total time delay and total energy consumption of the users are weighted as the system optimization target.

The rotary wing unmanned aerial vehicle adopted in the invention considers the energy consumption condition of the unmanned aerial vehicle so as to ensure that the data of all users can be processed before the energy consumption of the unmanned aerial vehicle. The unmanned energy consumption mainly comprises two parts of energy consumption of calculation energy consumption and flight energy consumption, wherein the calculation energy consumption is calculated as formula (16), and the flight energy consumption is calculated as formulas (20) - (21).

Taking the factors into consideration, the optimization problem of the constructed unmanned plane-edge calculation model is shown in a formula (27).

Step S2: and converting the optimization problem of the unmanned aerial vehicle-edge calculation model into a Markov decision process.

The specific implementation method of the steps is as follows: and converting the optimization problem into a triplet S, A and R, wherein the state space S comprises the position of the unmanned aerial vehicle, the residual energy of the unmanned aerial vehicle, the instantaneous safety rate of the user and the total amount of unprocessed data remained by the user. The action space A comprises the vector speed of the unmanned plane, the transmitting power of the user and the calculating frequency of the user. The design of the rewarding function is shown in a formula (30), and the designed rewarding function consists of three parts, namely, a negative value of an optimization target, rewarding of user unloading data and punishment of an agent against constraint conditions.

Step S3: and solving the optimization problem by adopting a DDPG algorithm, and constructing a DDPG algorithm network framework comprising a strategy network, a value network, a target strategy network and a target value network.

The DDPG algorithm network frame is shown in fig. 2, and comprises 4 neural networks, namely a strategy network, a value network, a target strategy network and a target value network. Wherein the policy network will select an action a (n) based on the current state s (n), the output actions a (n) and s (n) will be used as inputs to evaluate the incoming value network, the value network will score the action a (n) as Q (n), a (n), θ ^Q ). The state s (n+1) will be the input to the target policy network, which will select actions a (n+1) based on the input, s (n+1) and a (n+1) will evaluate as inputs to the target value network, denoted Q' (s (n+1), a (n+1); θ ^Q′ ). Calculating a loss function according to equation (49), updating the weight θ of the value network by minimizing the loss function ^Q And updating the policy network weights θ based on gradient directions that maximize the discounted jackpot ^μ . Finally, the weight of the target network is updated according to the soft update strategy of the formula (51).

Experience data generated in the process of continuously interacting with the environment by the intelligent agent are stored in the experience pool, and after the experience pool is filled, training and updating are carried out on the strategy network and the value network respectively by randomly extracting sample data in the experience pool in small batches, so that weight parameters of the corresponding neural network are continuously optimized. Until the jackpot value has stabilized, at which point the training process of the network is completed.

Step S4: and designing each neural network architecture in the DDPG algorithm.

The invention uses the fully connected neural network structure as a framework of a strategy network, a value network, a target strategy network and a target value network.

The strategy neural network and the value neural network are all 6-layer fully-connected neural networks, wherein the strategy neural network and the value neural network comprise four hidden layer networks, and the numbers of hidden layer neurons are 128, 256, 256 and 128 respectively. The activation functions of all layers except the output layer activation function of the policy network are sigmod and the activation functions of all other layers are ReLU functions. It is worth to say that the state of the input neural network comprises the unmanned plane position, the unmanned plane residual energy, the user reachable safety rate and the user residual data, and the dimensions are 3,1, K and K respectively. In the simulation, k=5, which results in that the dimension representing the state of the unmanned aerial vehicle occupies a relatively small area in the whole state dimension, and may not well influence the training process of the neural network. However, the position and remaining energy of the drone is important, and by adjusting the trajectory of the drone, it is necessary to ensure that the task data of all users is efficiently processed before the drone's energy is exhausted. Therefore, there is a dimension balancing problem, we extend the unmanned plane position from 3 to 8 dimensions and the unmanned plane residual energy from 1 to 5 dimensions to ensure that each input state is at a level. After dimension expansion, the policy network input layer has 2K+13 neurons, and the value network has 4K+16 neurons. Since the range of values for each state is not an order of magnitude, we normalize the state values to [0,1] to speed up the training speed of the neural network.

Step S5: and (3) carrying out system dynamic resource allocation and unmanned aerial vehicle track optimization by adopting a trained strategy network.

Once sufficiently trained, the jackpot value tends to steady state, i.e., the jackpot value reaches a maximum and is in a small range of fluctuations, the training process stops. At this time, the trained and complete strategy network is directly deployed to the optimization module, that is, the unmanned aerial vehicle base station platform, so as to guide the unmanned aerial vehicle system to perform tasks in practice rapidly and efficiently, so as to minimize the time delay of the system user and the energy consumption optimization target.

FIG. 3a is a plot of cumulative rewards versus training round number, and FIG. 3b is a plot of user calculation costs versus training round number, demonstrating the convergence and learning performance of NOMA and OMA in the proposed DDPG-based secure UAV-MEC algorithm. It can be seen from the two figures that under both NOMA and OMA schemes, the reward function we design can effectively reduce the cost of the user and the convergence is stable.

Fig. 4a is a 3D trajectory diagram of the unmanned aerial vehicle, fig. 4b is a 2D trajectory diagram of the unmanned aerial vehicle, and as can be seen from the overall movement trend, the unmanned aerial vehicle flies first in the horizontal plane and approaches the users to increase the overall unloading rate of all users, and then when only one user' S data remains unprocessed (user 4 in the figure), the unmanned aerial vehicle selects the elevation to increase the elevation angle with the users, thereby increasing the channel gain between the users and S.

Fig. 5 is a graph of the trend of average cost of a user as a function of user data under different offloading schemes. As can be seen from the figure, the cost of the 3 schemes increases to different extents as the user data increases, and the larger the user data, the greater the cost difference between the full local calculation and the offload calculation. And the computational cost of employing the NOMA scheme is always lower than that of other computational schemes.

The method and the device can be applied to the application scene of unmanned aerial vehicle-assisted high-density communication with the risk of wireless data leakage, and can reduce the average calculation cost of the user on the premise of ensuring the data safety. The method can be particularly applied to the fields of outdoor real-time live broadcasting, disaster association, field investigation and the like. Expected benefits include improved communication availability, reduced network latency, increased data analysis speed, and wider service coverage. This will help to increase production efficiency, enhance emergency response and improve decision making, thereby achieving higher benefits and efficiencies in various areas.

Claims

1. The method for safely transmitting and distributing the data by the air edge calculation based on the deep reinforcement learning is characterized by comprising the following steps:

C3:Z _min ≤z _S (n)≤Z _max ,

C4:λ _k,l (n)∈{0,1},

C12:E _S (N)≥0.

wherein C1-C3 respectively represent the constraint on the flying speed, collision avoidance and flying height of an Unmanned Aerial Vehicle (UAV); C4-C5 are the judgment of strong users and weak users in non-orthogonal multiple access (NOMA) communication; c6 is a user transmit power constraint; c7 is a minimum security rate set for guaranteeing the security of user unloading data; C8-C10 are calculation frequency limits distributed to users by the MEC server; c11 represents that the user has to process all data within a specified time; c12 is the energy consumption limit of UAV; the unmanned plane flight time T is equally divided into N time slots, and the length of each time slot is delta _t ＝T/N；q _S (n) is the position of the server unmanned aerial vehicle S, q _E (n) is the location of the eavesdropping drone E;maximum flight speed for S; d, d _min Is the minimum safe distance between S and E; z _S (n) is the flight level of S in the nth slot, where Z _min For the lowest flight altitude of the unmanned aerial vehicle, Z _max Is the maximum flying height; lambda (lambda) _k,l (n) is a binary variable representing the relationship of user channel strength, where l is the weak user relative to the k user, λ _l,k (n) is the channel strength relationship between user l and user k; u (U) _k For ground user equipment (U) _k ,k＝1,2,…,K)，p _k (n) is U _k Transmit power at n slots, f _k (n) is U _k CPU frequency at n slots, P _max And F _k ^max The local maximum transmitting power and the local maximum calculating frequency of the user are respectively calculated; r is R _k,sec (n) is U _k Instantaneous safe offload rate at nth time slot, < >>A security threshold for user offloading; />For the maximum calculated frequency of S, f _Sk (n) Assigning S to U _k Is a calculated frequency of (2); />Unloading data volume for nth time slot user; c (C) _S The number of CPU turns required for calculating 1 bit data for S; e (E) _S (N) remaining energy for last time slot S of unmanned aerial vehicle flight;

2. The method for secure transmission and resource allocation of data calculated by an air edge based on deep reinforcement learning according to claim 1, wherein the method for solving by adopting a DDPG algorithm in the deep reinforcement learning to jointly optimize dynamic resource allocation and a 3D track strategy of an unmanned aerial vehicle comprises the following steps:

Updating the weight θ of a value network by minimizing a loss function ^Q ；

3. The deep reinforcement learning-based over-the-air edge computation data security transmission and resource allocation method of claim 2, wherein the soft update strategy is:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

where τ is the soft update parameter.

4. The method for secure transmission and resource allocation of data calculated on the basis of deep reinforcement learning as claimed in claim 1 or 2, wherein the DDPG algorithm uses an empirical playback mechanism to eliminate correlation between samples.

5. The method for secure transmission and resource allocation of data calculated by an air edge based on deep reinforcement learning according to claim 1, wherein in the unmanned aerial vehicle auxiliary edge calculation model, by scaling the distance between a user and E, the worst security condition of the system is considered, and the eavesdropping range when the eavesdropping capability of E is the strongest is estimated; setting a ground friendly jammer J to send an artificial interference signal to E so as to inhibit eavesdropping; the ground user equipment U _k And unloading data to S by adopting a NOMA communication mode.

6. The method for secure transmission and resource allocation of data calculated over-the-air edges based on deep reinforcement learning of claim 5, wherein estimating the eavesdropping range of E when the eavesdropping capability is the strongest, comprises assuming that E is hidden at a rate of q' _E (n)＝(x′ _E (n),y′ _E (n),z′ _E (n)) as the center of a circle and the radius r _E And r is within the circular region of _E Satisfy ||q _E (n)-q′ _E (n)||≤r _E The method comprises the steps of carrying out a first treatment on the surface of the Wherein x' _E (n)、y′ _E (n)、z′ _E (n) coordinates of the eavesdropping center position in the x-axis, y-axis and z-axis, respectively.

7. The method for secure transmission and resource allocation of data calculated on the basis of deep reinforcement learning according to claim 1, wherein the time delay and energy consumption weighted by the user of the computing system are calculated, and the method further comprises the step of providing corresponding energy and time delay weight factors respectively.

8. The method for secure transmission and resource allocation of data calculated on the basis of deep reinforcement learning according to claim 1, wherein the markov decision process < S, a, R >;

s is a system state set:

s(n)＝{q _S (n),E _S (n),R _k,sec (n),L _k (n)}.

a is dynamic resource allocation and track action set:

r is a set of reward functions:

r(n)＝-U _c (n)+r _off (n)+r _p (n).

9. The method for secure transmission and resource allocation of data calculated on the air edge based on deep reinforcement learning according to claim 1, wherein the system dynamic resource allocation and unmanned aerial vehicle trajectory optimization are performed by using a trained strategy network, comprising the following steps:

10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the deep reinforcement learning based over-the-air edge computing data secure transmission and resource allocation method of any one of claims 1 to 9.