CN114065963A

CN114065963A - Computing task unloading method based on deep reinforcement learning in power Internet of things

Info

Publication number: CN114065963A
Application number: CN202111297200.4A
Authority: CN
Inventors: 赵楠; 任凡; 杜威; 叶智养
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-02-18

Abstract

The invention discloses a calculation task unloading method based on deep reinforcement learning in an electric power Internet of things, and aims to minimize energy consumption and time delay by jointly optimizing the position of an unmanned aerial vehicle, the transmitting power and task division variables. Firstly, aiming at the non-convexity of the calculation task unloading problem, a Markov decision process is formulated through designing a state, an action space and a reward function, wherein the reward function is based on an electric power internet of things system model facing power transmission line inspection, relates to interaction among acquisition equipment, an inspection unmanned aerial vehicle and an edge server, and describes the calculation task unloading problem. On the basis, the Markov model has a continuous action space, a double-delay depth deterministic strategy gradient algorithm is provided, and an optimal strategy for task unloading is obtained.

Description

Computing task unloading method based on deep reinforcement learning in power Internet of things

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a calculation task unloading method based on deep reinforcement learning in an electric power Internet of things.

Background

The power internet of things is an application of the internet of things technology in a smart grid, and can play a role in various aspects of power generation, power transmission, power transformation, power distribution, power utilization and the like of the smart grid. The power internet of things integrates communication infrastructure resources and power infrastructure resources, and realizes information exchange among interconnected devices by utilizing advanced information and communication technologies, so that the unification efficiency of a power system is further improved. Specifically, the power internet of things deploys different types of sensing devices (such as global positioning systems, cameras, infrared sensors and radio frequency identification devices) and advanced metering basic devices in different geographic areas, so that the power internet of things can sense the physical world, acquire and monitor data for processing, and further make intelligent decisions on energy management schemes.

However, the data collected by the power internet of things is likely to be huge and highly heterogeneous, which puts strict requirements on communication and computing resources. Furthermore, some real-time demand-side management schemes are typically delay sensitive, which makes the data processing task more difficult. Despite the recent increase in computing power of smart devices, complex tasks are still difficult to handle under strict latency constraints. Fortunately, mobile edge computing is an ultra-low latency technology capable of data analysis near the data source, pushing the leading edge of services and data from a centralized cloud to the edge of a mobile network with computing and storage resources, and then supporting resource intensive and delay sensitive applications with cloud computing deployed at the edge of the mobile network.

However, when the mobile edge computing service is provided for the device in the power internet of things, a dedicated wireless connection is required between the edge server and the device in the power internet of things, but the wireless communication is difficult to provide an effective mobile edge computing service for the device in the power internet of things due to signal blockage, shadow shielding and the like. To do so, drones are deployed as small edge clouds with embedded computing modules. In the electric power internet of things system, the electric power internet of things equipment needs to process generated data quickly, but communication, calculation and storage resources of the electric power internet of things equipment are very limited, so that the unmanned aerial vehicle provides calculation service for the electric power internet of things equipment by utilizing a calculation module of the unmanned aerial vehicle. However, given the limited battery life and computing power of drones, it is desirable to design efficient computing task offloading methods.

Disclosure of Invention

In order to overcome the non-convexity of the existing calculation task unloading problem, the invention aims to provide a method for optimizing the time delay and the energy consumption of the unmanned aerial vehicle service power Internet of things based on deep reinforcement learning.

In order to achieve the purpose, the invention adopts the technical scheme that: a computing task unloading method based on deep reinforcement learning in an electric power Internet of things comprises the following steps:

step 1, establishing a Markov model of a follow-up depth reinforcement learning algorithm of the flight position of the inspection unmanned aerial vehicle:

modeling a quintuple (S, A, O, R, gamma) of the MDP, wherein S is an input state set of the inspection unmanned aerial vehicle, A is an output action set of the inspection unmanned aerial vehicle, O is a state transition probability function, R is a reward function, and gamma is a discount coefficient;

step 1.1, defining an input state set S of the inspection unmanned aerial vehicle, and determining the input state S of the inspection unmanned aerial vehicle at each time slot t_t＝[u_t h_t]，u_tHorizontal coordinate, h, of the patrol drone at time slot t_tThe height of the unmanned aerial vehicle is patrolled and examined in a time slot t; s_tBelongs to the input state set S;

step 1.2, defining an output action set A of the inspection unmanned aerial vehicle, wherein the A represents a set of all actions which can be taken by the inspection unmanned aerial vehicle aiming at the input state of the inspection unmanned aerial vehicle after receiving the following external feedback values: when the unmanned aerial vehicle flies out of the inspection area of the power transmission line, a random direction angle phi is selected_tFlying back; when patrolling and examining unmanned aerial vehicle flying height h_tBeyond its height range, will remain at the minimum height H_minOr maximum height H_maxOnce the inspection unmanned aerial vehicle covers all the acquisition equipment, the inspection unmanned aerial vehicle is kept still and is prepared to adjust the transmitting power; when task divides variables

When, patrol and examine unmanned aerial vehicle will carry out

Operate to retrieve again the task partition variable β^nm _t(ii) a Output action a of inspection unmanned aerial vehicle in each time slot t_tThe method comprises the following steps: l_t、φ_t、Δh_t、p_t、β^nm _t；

Wherein l_t∈{0,l₁Denotes the horizontal distance of the unmanned plane flying in time slot t, l₁Representing the maximum horizontal distance of the patrol unmanned aerial vehicle flying in each time slot t; phi_tE {0,2 pi } represents a direction angle of the patrol unmanned aerial vehicle at the time slot t; Δ h_t＝(h_t-h_t-1)∈{-l₂,l₂Denotes the vertical movement distance h of the polling unmanned aerial vehicle in the time slot t_tAnd h_t-1Respectively representing the height, l, of the unmanned aerial vehicle patrolling at time slot t and the last time slot₂The maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t is represented; p is a radical of_tE {0, P } represents the transmitting power of the patrol unmanned aerial vehicle at the time slot t, and P represents the maximum transmitting power of the patrol unmanned aerial vehicle; beta is a^nm _tThe duty ratio of the fault identification and foreign object detection task received from the nth acquisition equipment processed on the mth edge server is shown, namely the task division variable { beta ] of the inspection unmanned aerial vehicle at the time slot t^nm _tN belongs to N, M belongs to M, and N and M respectively represent the collection of the acquisition equipment and the edge server in the system;

step 1.3, defining the input state s of the inspection unmanned aerial vehicle from the current time slot t_tAnd the action amount is a_tCan reach the next input state s_t+1The probability of (a) is a state transition probability function O;

step 1.4, defining a reward function R, and representing the input state s of the inspection unmanned aerial vehicle in the current time slot t_tNext, an action a is selected_tThen, the instantaneous feedback is obtained when

Time, reward function

When in use

Time, reward function

Wherein, horizontal direction angle phi of the inspection unmanned aerial vehicle_tAnd height h_tDetermining a radius R covering the range of the acquisition device_t，

If the distance between the horizontal position of the inspection unmanned aerial vehicle and the acquisition equipment is smaller than the coverage radius, the acquisition equipment is indicated to be covered by the inspection unmanned aerial vehicle, N' is the number of the acquisition equipment in the coverage range, xi₁Is a penalty factor, ξ, related to the degree of coverageⁿ ₂The fault identification and foreign object detection tasks on the nth acquisition equipment are divided to exceed the limit

Penalty factor of time, wherein^nm _tAnd betaⁿ⁰ _tIs the ratio of tasks received from the nth acquisition device in the time slot t to be processed on the mth edge server and the patrol unmanned machine respectively,

T^t _nis the total service delay of the nth acquisition device within the time slot t,

E^t _nis the task arrival rate λ of the nth acquisition device within the time slot t_nThe total energy consumption of the unmanned aerial vehicle is patrolled and examined in time;

T¹ _n，tis the uplink transmission time delay from the nth acquisition equipment to the patrol unmanned aerial vehicle in the time slot t,

D_nis the input data volume, v, of the nth acquisition device when all tasks have been offloaded onto the patrol droneⁿ _tIs the data rate of the uplink transmission and,

αⁿ _tis the gain of the uplink channel from the nth acquisition device to the patrol unmanned aerial vehicle, W_nAnd p_nRespectively representing the bandwidth, the transmission power, σ, allocated to the nth acquisition device²Representing the noise power of the inspection unmanned aerial vehicle; e¹ _n，tIs the uplink transmission energy consumption in time slot t, E¹ _n，t＝p T¹ _n，t；

T² _n，tThe patrol unmanned aerial vehicle calculates the time delay of the task received from the nth acquisition equipment in the time slot t,

C_nis the number of CPU cycles, f, required to process a unit of task data volume_nIs the computational resource allocated by the patrol unmanned aerial vehicle to the nth acquisition device, { beta_t ⁿ⁰∈[0,1]，n∈N}；E² _n，tIs to patrol the energy consumption of the unmanned aerial vehicle in the time slot t, E² _n，t＝κ(f_n)³T² _n，tKappa is an effective switching capacitor, and the energy consumption of a central processing unit in the inspection unmanned aerial vehicle is modeled as kappa (f)_n)³；

T¹ _nm，tIs the downlink transmission delay in the time slot t,

v^m _tis the data rate of the downlink transmission and,

W_mindicates the bandwidth, α, allocated to the mth edge server^m _tIs the downlink channel gain, p, from the patrol drone to the mth edge server in time slot t_tIndicates the transmission power of the patrol unmanned aerial vehicle in the time slot t, { beta^nm _t∈[0,1]，n∈N，m∈M}；E¹ _nm，tIs downlink transmission energy consumption, E¹ _nm，t＝p_tT¹ _nm，t；

T² _nm，tIs the calculated time delay of the mth edge server side in the time slot t,

f_nmis the computing resource distributed to the nth acquisition equipment by the mth edge server in the time slot t;

step 1.5, defining gamma epsilon (0,1) as a discount coefficient for calculating a return accumulated value in the whole process, wherein the closer the discount coefficient is to 1, the more important the long-term income is;

step 1.6, during a limited period T, learning the best strategy to obtain the maximum expected long term benefit

Wherein r is_t′Is a reward function at time slot t', will state s_tCorresponding to action a_tThe expected reward under is recorded as the state-action cost function Q(s)_t，a_t)＝E[G_t|s_t,a_t]Wherein E [ alpha ], [ alpha ]]Is a desired operator;

step 2, according to the Markov decision quintuple (S, A, O, R, gamma) modeled in the step 1, the action training of the unmanned aerial vehicle is realized by using a double-delay depth certainty strategy gradient (TD3) algorithm, and the steps are as follows:

step 2.1, two independent neural networks are adopted: the network communication system comprises 2 Actor networks and 4 Critic networks, wherein the Actor networks comprise a Current-Actor network with a parameter of omega and a Target-Actor network with a parameter of omega'; the Critic network includes 2 parameters of theta₁、θ₂The Current-critical network and 2 parameters are theta₁’、θ₂'and randomly initializing parameters of the empirical playback memory F, the Current-Actor network, and the Current-Critic network, setting ω' ═ ω, θ₁’＝θ₁、θ₂’＝θ₂；

Step 2.2, setting a maximum training process round number EP, setting T time slots in each round, and initializing a training round number EP to be 1;

step 2.3, initializing T to 0, initializing time slot T to 0, and initializing the input state s of the inspection unmanned aerial vehicle₀；

Step 2.4, observing the input state s of the current electric power Internet of things system by the inspection unmanned aerial vehicle_tAnd according to a policy function pi of the initial Current-Actor network_ω(s_t) Taking action, adding Gaussian noise sigma to disturb the action to obtain action a_t＝π_ω(s_t)+σ；

Step 2.5, in input state s_tThen, perform action a_tAnd calculating the current reward function r of the inspection unmanned aerial vehicle according to the step 1.4 by the obtained instant feedback_tAnd (3) utilizing the steps 1.1-1.2 to obtain the next input state s again_t+1An experience bar(s)_t，a_t，r_t，s_t+1) Storing the experience strips in the experience playback memory F into the experience playback memory F, wherein the experience strips in the experience playback memory F are stored in sequence, and when the number of the experience strips reaches the capacity of the experience playback memory F, new records are stored from the beginning to form a cycle;

step 2.6, randomly extracting the mini-batch sample from the empirical playback memory F and inputting the mini-batch sample(s) into the Actor and Critic network_i，a_i，r_i，s_i+1) Indicating patrol unmanned aerial vehicle at state s_iAt the same time take action a_iGive a reward of r_iAnd converting the environment to the next state s_i+1Wherein, mini-batch refers to randomly selecting a small batch of data in training data;

step 2.7, s_iAs the output of the Current-Actor networkAnd the updated Current-Actor network is according to s_iCalculate a new action a_i，(s_i，a_i) As input for two Current-critical networks, the latter are based on input s_iAnd a_iRespectively calculate out

And

s_i+1as the input of the Target-Actor network and the two Target-Critic networks, the updated Target-Actor network is according to s_i+1Calculate a new action a_i+1Then, noise is added, and a target action of the noise is calculated by using a target strategy smoothing and regularization method;

as input to two Target-critical networks, the latter are based on the input s_i+1And

respectively calculate out

And

get

And

is calculated as the target value y_i(ii) a Finally, using the target value y_iUpdating parameter theta of 2 Current-Critic networks by minimizing mean-squared Bellman error loss function₁And theta₂；

Step 2.8, updating a Current-Actor network parameter omega through a deterministic strategy gradient algorithm;

step 2.9, isStabilizing the training process, and further updating the parameter of Target-Actor and the parameter theta of 2 Target-Critic by using a soft updating method₁' and theta₂', thereby stabilizing the training process;

step 3, training the model by adopting the following steps:

3.1, if the inspection unmanned aerial vehicle covers all the acquisition equipment and the task division variable does not exceed the limit of the acquisition equipment, entering step 3.2, otherwise, adding 1 to T, judging T, if T is less than T, skipping to step 2.4, and if T is more than or equal to T, entering step 3.2;

step 3.2, adding 1 to the number EP of the training rounds, judging EP, if EP is smaller than EP, skipping to step 2.3, otherwise, entering step 3.3 when EP is larger than or equal to EP;

and 3.3, finishing iteration, terminating the neural network training process, storing the current Target-Actor network data and the Target-Actor network data, and loading the stored data into the inspection unmanned aerial vehicle system, thereby executing flight action to finish optimizing the time delay and energy consumption of the unmanned aerial vehicle service power Internet of things.

Further, noise target actions are calculated in step 2.7

Wherein the noise component

Is the noise component, π, clipped by the constant c in a normal distribution with a mean of zero and a variance of ε_ω′(s_i+1) Is the next state s of the Target-Actor network with parameter ω_i+1A policy function of;

target value y_i：

Where ri is the reward function and γ is the discount factor;

updating parameter theta of 2 Current-Critic networks by using minimum mean square Bellman error loss function₁And theta₂：

j is 1,2, wherein L (θ)_j) Is the mean square bellman error loss function, M is the number of randomly drawn small batch samples,

j is 1,2 is two parameters theta₁And theta₂The state-behavior value function of the Current-critical network of (1).

Further, the update formula of the Current-Actor network in step 2.8 is as follows:

wherein

Representing the policy gradient under the Current-Actor network parameter omega, M is the number of randomly drawn small-lot samples,

and

respectively, the parameter is theta₁The state-behavior value function gradient of the Current-critical network and the strategy function gradient of the Current-Actor network with the parameter of omega, pi_ω(s_i) Indicates the network input state s in the Current-Actor_iThe action strategy is selected according to the selected action strategy,

indicating an input state s_iTake action of a ═ pi_ω(s_i) Current-critical network state-behavior value function.

Further, the parameter ω' of Target-Actor is updated in step 2.9: omega '← mu omega + (1-mu) omega', updating 2 parameters theta of Target-Critic₁' and theta₂′：θ′₁←μθ₁+(1-μ)θ′₁，θ′₂←μθ₂+(1-μ)θ′₂Where μ denotes the update scale factor。

The invention has the beneficial effects that:

(1) the electric power internet of things computing task unloading model for actual transmission line inspection established by the method is complete, and the inspection unmanned aerial vehicle finds the optimal target strategy for computing task unloading through continuous interaction of the inspection unmanned aerial vehicle and the system environment due to unknown model environment, so that the method has high practical application value.

(2) The method uses a double-delay depth certainty strategy gradient (TD3) algorithm, effectively solves the problem of the continuity control of the inspection unmanned aerial vehicle, and effectively solves the problem that the state-action function value of the inspection unmanned aerial vehicle is continuously overestimated through a Target-Actor and a Target-Critic network, so that the training of the inspection unmanned aerial vehicle is more stable, and a better unloading strategy is found.

(3) The method combines reinforcement learning and the deep neural network, improves the learning ability and the generalization ability of the inspection unmanned aerial vehicle, avoids the complexity and the sparseness when the inspection unmanned aerial vehicle is manually operated to unload the calculation task in an uncertain environment, and ensures that the inspection unmanned aerial vehicle can safely and efficiently complete the unloading of the calculation task.

Drawings

FIG. 1 is a general framework diagram of a TD 3-based computing task offloading method in a power Internet of things;

fig. 2 is a flowchart of a calculation task unloading algorithm based on TD3 in the power internet of things.

Detailed Description

The present invention will be described in further detail with reference to examples for the purpose of facilitating understanding and practice of the invention by those of ordinary skill in the art, and it is to be understood that the present invention has been described in the illustrative embodiments and is not to be construed as limited thereto.

Firstly, an electric power internet of things system model facing power transmission line inspection is established, interaction among acquisition equipment, an inspection unmanned aerial vehicle and an edge server is involved, and the problem of calculation task unloading is described. Secondly, aiming at the non-convexity of the problem of task unloading, the invention provides a method based on deep reinforcement learning, and a Markov decision process is formulated by designing a state, an action space and a reward function. On the basis, the Markov model has a continuous action space, a double-delay depth deterministic strategy gradient algorithm is provided, and an optimal strategy for task unloading is obtained.

Optimal strategy is to patrol the position q of the unmanned aerial vehicle by joint optimization_tTask partition variable beta^nm _tAnd betaⁿ⁰ _tAnd a transmission power p_tThe sum of the total service time delay of all the acquisition devices and the total energy consumption of the inspection unmanned aerial vehicle is minimized, namely:

suppose that the position of the patrol unmanned aerial vehicle in the time slot t is denoted as q_t＝(u_t，h_t) Wherein u is_tIs the horizontal coordinate of the patrol unmanned plane, h_t∈[H_min，H_max]Is the height of the inspection unmanned plane H_minAnd H_maxThe minimum vertical distance and the maximum vertical distance of the patrol unmanned aerial vehicle are respectively, and the maximum horizontal distance and the maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t are respectively l₁And l₂. Where N and M represent a collection of acquisition devices and edge servers, T^t _nRepresenting the total service delay of the nth acquisition device within the time slot t, E^t _nThe total energy consumption of the inspection unmanned aerial vehicle for unloading and calculating the task on the nth acquisition device in the time slot t is shown, P is the maximum transmission power of the inspection unmanned aerial vehicle,

representing the Euclidean norm, u, of a vector_t+1Horizontal coordinate, h, of the patrol unmanned plane at the next time slot_t+1And the vertical coordinate of the inspection unmanned aerial vehicle in the next time slot is represented. Based on the analysis and description of the task unloading problems such as fault identification, foreign matter detection and the like in the power transmission line inspection, it can be seen that the problems have strong non-convexity and combinability, and therefore, a global optimal solution is difficult to find. In addition, the method can be used for producing a composite materialBecause the information and channel conditions of the acquisition equipment in the power internet of things system for power transmission line inspection are not easy to acquire, the traditional optimization strategy is more challenging.

A computing task unloading method based on deep reinforcement learning in an electric power Internet of things comprises the following steps:

modeling a quintuple (s, A, O, R, gamma) of the MDP in the Markov decision process, wherein s is an input state of the inspection unmanned aerial vehicle, A is an output action set of the inspection unmanned aerial vehicle, O is a state transition probability function, R is a reward function, and gamma is a discount coefficient; s_tBelongs to the input state set S;

step 1.1, determining the input state s of the inspection unmanned aerial vehicle at each time slot t_t＝[u_t h_t]，u_tHorizontal coordinate, h, of the patrol drone at time slot t_tThe height of the unmanned aerial vehicle is patrolled and examined in a time slot t;

When, patrol and examine unmanned aerial vehicle will carry out

Wherein l_t∈{0,l₁Denotes the horizontal distance of the unmanned plane flying in time slot t, l₁Representing the maximum horizontal distance of the patrol unmanned aerial vehicle flying in each time slot t; phi_tE {0,2 pi } represents a direction angle of the patrol unmanned aerial vehicle at the time slot t; Δ h_t＝(h_t-h_t+1)∈{-l₂,l₂Denotes the vertical movement distance h of the polling unmanned aerial vehicle in the time slot t_tAnd h_t+1Respectively representing the height, l, of the unmanned aerial vehicle patrolling at time slot t and the next time slot₂The maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t is represented; p is a radical of_tE {0, P } represents the transmitting power of the patrol unmanned aerial vehicle at the time slot t, and P represents the maximum transmitting power of the patrol unmanned aerial vehicle; beta is a^nm _tThe duty ratio of the fault identification and foreign object detection task received from the nth acquisition equipment processed on the mth edge server is shown, namely the task division variable { beta ] of the inspection unmanned aerial vehicle at the time slot t^nm _tN belongs to N, M belongs to M, and N and M respectively represent the collection of the acquisition equipment and the edge server in the system;

Time, reward function

When in use

Time, reward function

If patrol and examine unmanned aerial vehicle horizontal position s_t＝[u_t h_t]The distance between the acquisition equipment and the acquisition equipment is smaller than the coverage radius, the acquisition equipment is indicated to be covered by the inspection unmanned aerial vehicle, the horizontal coordinate of the acquisition equipment is a fixed value and is a known quantity, N' is the number of the inspection unmanned aerial vehicle covering the acquisition equipment, and xi₁Is a penalty factor, ξ, related to the degree of coverageⁿ ₂The fault identification and foreign object detection tasks on the nth acquisition equipment are divided to exceed the limit

Penalty factor of time, beta^nm _t，βⁿ⁰ _t,T^t _n,E^t _nSolving according to the model of the power Internet of things system; beta is a^nm _tAnd betaⁿ⁰ _tIs the ratio of tasks received from the nth acquisition device in the time slot t to be processed on the mth edge server and the patrol unmanned machine respectively,

A. electric power internet of things system model

The power internet of things system for power transmission line inspection comprises 1 inspection unmanned aerial vehicle, N information acquisition devices such as miniature meteorological stations, sensors and high-definition cameras and M edge servers. The collection of the acquisition devices and the edge servers are denoted as N and M, respectively. After receiving the calculation tasks such as fault identification and foreign object detection of the acquisition equipment through the uplink communication link, the patrol unmanned aerial vehicle distributes part or all of the tasks to the edge server through the downlink communication link.

1) Uplink transmission delay and energy consumption:

it is assumed that broadband frequency division multiple access protocols are shared among acquisition devices when tasks such as fault identification and foreign object detection are unloaded. According to the Shannon law, in the time slot t, the gain alpha of an uplink channel from the nth acquisition equipment to the inspection unmanned aerial vehicle is consideredⁿ _tThen the available uplink transmission data rate is:

wherein W_nAnd p_nRespectively representing the bandwidth, the transmission power, σ, allocated to the nth acquisition device²Representing the noise power of the patrol unmanned aerial vehicle. Thus, consider the amount of input data D when all tasks for the nth acquisition device have been offloaded onto the inspection drone_nThe uplink transmission delay in the time slot t is

Considering the receiving power p of the polling unmanned aerial vehicle, the uplink transmission energy consumption in the time slot t is E¹ _n，t＝p T¹ _n，t。

2) Patrol and examine unmanned aerial vehicle's calculation time delay and energy consumption:

after receiving the tasks such as fault identification and foreign matter detection from the acquisition equipment, the patrol unmanned aerial vehicle allocates the decision task to the patrol unmanned aerial vehicle for self-processing or uninstalls the decision task to the edge server for processing. { beta ]ⁿ⁰ _t∈[0,1]And N ∈ N } represents the proportion of tasks received from the nth acquisition device in the time slot t to be processed on the inspection unmanned machine, wherein N represents the set of acquisition devices in the system. Thus, the center required for processing the data amount of the unit task is consideredNumber of processor cycles C_nAnd the computing resource f distributed to the nth acquisition equipment by the patrol unmanned aerial vehicle_nThe calculation time delay of the patrol unmanned aerial vehicle in the time slot t is

Wherein D_nIndicating the amount of task input data for the nth acquisition device. Furthermore, consider modeling the energy consumption of the central processor in the inspection drone as κ (f)_n)³And kappa is effective switching capacitance, and the energy consumption of the inspection unmanned aerial vehicle in the time slot t can be obtained as E² _n，t＝κ(f_n)³T² _n，t。

3) Downlink transmission delay and energy consumption:

for simplicity, assume that the noise power of the patrol drone and the edge server are the same. During time slot t, the downlink channel gain α from the patrol drone to the mth edge server is taken into account^m _tThen the downlink transmission data rate is:

wherein W_mAnd σ²Respectively representing the bandwidth allocated to the m-th edge server and the noise power of the edge server, p_tAnd the transmission power of the inspection unmanned aerial vehicle in the time slot t is represented. Because the edge server has more communication, calculation and storage resources than the unmanned aerial vehicle of patrolling and examining, it is more likely that the unmanned aerial vehicle of patrolling and examining unloads tasks such as fault identification and foreign matter detection to the edge server and handles to reduce system delay. Taking into account the proportion { beta } of tasks received from the nth acquisition device in time slot t to be processed on the mth edge server^nm _t∈[0,1]N belongs to N, M belongs to M, wherein N and M respectively represent the collection of the acquisition equipment and the edge server in the system, and the downlink transmission delay is

Wherein D is_nIndicating the amount of task input data for the nth acquisition device. Furthermore, during the time slot t, the transmission power p of the patrol drone is taken into account_tThen the downlink transmission energy consumption is E¹ _nm，t＝p_tT¹ _nm，t。

4) Calculating time delay of the edge server:

the edge server starts the calculation process after receiving all task data of fault identification, foreign matter detection and the like unloaded by the inspection unmanned aerial vehicle, so that the calculation resource f distributed from the mth edge server to the nth acquisition equipment is considered in the time slot t_nmThen the calculated time delay of the mth edge server end is

Wherein C is_nIndicating the number of CPU cycles, D, required to process a unit of task data_nIndicating the amount of task input data, beta, of the nth acquisition device^nm _tIndicating the proportion of tasks received from the nth acquisition device in time slot t that are processed at the mth edge server.

5) Total service delay and energy consumption of the system:

the unmanned aerial vehicle can not start task processing such as fault identification and foreign matter detection until the uplink communication channel completes complete data transmission. In general, considering that the communication and computation modules on the drone are separate, the patrol drone may offload tasks to the edge server through a downlink communication channel while performing local task computation. Furthermore, the edge server may start the task processing only after all data transfers are completed. Therefore, in the time slot t, the total service delay of the nth acquisition device is:

wherein, T¹ _n，tIs the uplink transmission time delay from the nth acquisition equipment to the inspection unmanned aerial vehicle in the time slot T, T² _n，tIs unmanned for inspectionThe machine calculates the time delay of the task received from the nth acquisition device within the time slot T, T¹ _nm，tIs the downlink transmission time delay from the polling unmanned aerial vehicle to the mth edge server in the time slot T, T² _nm，tAnd (3) calculating time delay of the mth edge server end in the time slot t, wherein M represents a set of edge servers in the system.

Furthermore, in the time slot t, the task arrival rate λ of the nth acquisition device is taken into account_nThen, the total energy consumption of the inspection unmanned aerial vehicle is:

wherein E¹ _n，tIs the uplink transmission energy consumption of the polling unmanned aerial vehicle receiving tasks from the nth acquisition device in the time slot t, E² _n，tIs the energy consumption of the tasks of routing inspection unmanned aerial vehicle to process the fault recognition, foreign object detection and the like of the nth acquisition equipment in the time slot t, E¹ _nm，tAnd the downlink transmission energy consumption is the downlink transmission energy consumption when the inspection unmanned aerial vehicle unloads the task to the mth edge server in the time slot t.

Considering that network information in the communication process of the power internet of things system facing power transmission line inspection is unknown, the strategy optimization based on the Markov model cannot adopt the traditional iteration method. In the deep reinforcement learning, the inspection unmanned aerial vehicle only needs to optimize one action through the state information and does not need complete environmental information. Considering that the states and action spaces in the task unloading context of fault identification, foreign object detection and the like are continuous and the cost function is continuously overestimated in the training of the depth deterministic strategy gradient algorithm, a dual-delay depth deterministic strategy gradient (TD3) algorithm is adopted, and the algorithm also combines a strategy function-based (Actor network) with a behavior value function-based (criticic network).

step 1.6, during a limited period T, learning the optimal strategy to obtain the maximum predictionThe long term gain of the period is

Wherein r is_t′Is a reward function at time slot t', will state s_tCorresponding to action a_tThe expected reward under is recorded as the state-action cost function Q(s)_t，a_t)＝E[G_t|s_t,a_t]Wherein E [ alpha ], [ alpha ]]Is the desired operator.

The TD3 architecture consists of Current-Network and Target-Network. The Current-Network comprises a Current-Actor Network and a Current-critical Network: Current-Actor network with parameter omega will be used to select the policy function pi of an action_ω(s_t) Passed to a Current-Critic network that estimates the State-cost function

And

the Current-Critic network comprises two identical structures but with the parameter theta₁、θ₂And the Current-criticic network is respectively and independently initialized and trained and is used for calculating a Q value according to the input state and the selected action and updating the Current-Actor network based on the deterministic strategy gradient. Since the Current-Network will continuously overestimate the Q value, which results in unstable Network training, TD3 uses Target-Network to slowly track the updates of Current-Actor and Current-critical networks. The Target-Network comprises a Target-Actor Network with a parameter of omega 'and two parameters of theta'₁And θ'₂Target-critical network. The Target-Actor network transmits the strategy function to a Target-criticic network, and the Target-criticic network is used for estimating a state-value function; the Target-criticic network comprises two Target-criticic networks which have the same structure and parameters of which are respectively initialized and trained independently, and is used for calculating a Q value according to an input state and a selected action, calculating a Target value based on the smaller Q value and further updating the Current-criticic network.

step 2.1, adopting 2 Actor networks and 4 Critic networks, wherein the Actor networks comprise a Current-Actor network with a parameter of omega and a Target-Actor network with a parameter of omega'; the Critic network includes 2 parameters of theta₁、θ₂The Current-critical network and 2 parameters are theta₁’、θ₂'and randomly initializing parameters of the empirical playback memory F, the Current-Actor network, and the Current-Critic network, setting ω' ═ ω, θ₁’＝θ₁、θ₂’＝θ₂；

Step 2.4, observing the input state s of the current electric power Internet of things system by the inspection unmanned aerial vehicle_tAnd according to a policy function pi of the initial Current-Actor network_ω(s_t) Taking action, adding Gaussian noise sigma to disturb the action to obtain action a_t＝π_ω(s_t) + sigma; policy function pi_ω(s_t) Is that the patrol unmanned aerial vehicle decides at the input state s_tTake output action a_tThe manner of (a);

step 2.2-step 2.5 are strategy exploration phases, and off-line setting is carried outThe standby environment is used as a training environment, and the Current-Actor network observes the input state s of the time slot t_tAnd as input, output the recommended action a_tAnd selecting a by the inspection unmanned aerial vehicle_tThe state is represented by s_tIs converted into s_t+1And gives a reward feedback r_tThe Current-Actor network follows the new state s_t+1And generating a new recommended action, and repeating the steps to search the strategy, wherein the experience record of the strategy search is stored as the experience in the experience playback memory F.

The network updating phase is as follows, the Current-Actor network is not needed to interact with the user environment, and the experience replay is carried out by sampling from the experience replay memory F.

step 2.7, s_iAs the input of the Current-Actor network, the updated Current-Actor network is according to s_iCalculate a new action a_i，(s_i，a_i) As input for two Current-critical networks, the latter are based on input s_iAnd a_iRespectively calculate out

And

s_i+1as the input of the Target-Actor network and the two Target-Critic networks, the updated Target-Actor network is according to s_i+1Calculate a new action a_i+1Following the addition of noise, a target strategy smoothing regularization method is used to calculate the noise target action

Wherein the noise component

respectively calculate out

And

get

And

is calculated as the target value y_i：

Where ri is the reward function and γ is the discount factor; finally, using the target value y_iUpdating parameter theta of 2 Current-Critic networks by minimizing mean-squared Bellman error loss function₁And theta₂：

j is 1,2 is two parameters theta₁And theta₂The state-behavior value function of the Current-critical network;

and 2.8, updating a network parameter omega of the Current-Actor through a deterministic strategy gradient algorithm:

wherein

and

indicating an input state s_iTake action of a ═ pi_ω(s_i) Current-critical network state-behavior value function of (1);

step 2.9, in order to stabilize the training process, the parameter ω' of Target-Actor is further updated by using a soft update method: ω '. about.mu.ω + (1- μ) ω', 2 parameters θ of Target-Critic₁' and theta₂′：θ′₁←μθ₁+(1-μ)θ′₁，θ′₂←μθ₂+(1-μ)θ′₂Where μ denotes the update scale factor.

The TD3 reduces the updating frequency of the Current-Actor Network and the Target-Network, and the Current-Actor Network and the Target-Network are updated once every time the Current-Critic Network is updated d times.

Step 3, training the model by adopting the following steps:

Claims

1. A computing task unloading method based on deep reinforcement learning in an electric power Internet of things is characterized by comprising the following steps:

step 1.2, defining an output action set A of the inspection unmanned aerial vehicle, wherein the A represents a set of all actions which can be taken by the inspection unmanned aerial vehicle aiming at the input state of the inspection unmanned aerial vehicle after receiving the following external feedback values: when the unmanned aerial vehicle flies out of the inspection area of the power transmission line, the unmanned aerial vehicle is selectedA random direction angle phi_tFlying back; when patrolling and examining unmanned aerial vehicle flying height h_tBeyond its height range, will remain at the minimum height H_minOr maximum height H_maxOnce the inspection unmanned aerial vehicle covers all the acquisition equipment, the inspection unmanned aerial vehicle is kept still and is prepared to adjust the transmitting power; when task divides variables

When, patrol and examine unmanned aerial vehicle will carry out

step 1.3, defining the input state s of the inspection unmanned aerial vehicle from the current time slot t_tAnd adoptAn operation amount of a_tCan reach the next input state s_t+1The probability of (a) is a state transition probability function O;

Time, reward function

When in use

Time, reward function

T¹ _nm，tIs the downlink transmission delay in the time slot t,

v^m _tis the data rate of the downlink transmission and,

step (ii) of2.6, randomly drawing a mini-batch sample from the empirical playback memory F, inputting the i th mini-batch sample(s) into the Actor and Critic network_i，a_i，r_i，s_i+1) Indicating patrol unmanned aerial vehicle at state s_iAt the same time take action a_iGive a reward of r_iAnd converting the environment to the next state s_i+1Wherein, mini-batch refers to randomly selecting a small batch of data in training data;

And

respectively calculate out

And

get

And

step 2.9, in order to stabilize the training process, the parameter of the Target-Actor and the parameter theta of 2 Target-critics are further updated by using a soft updating method₁' and theta₂', thereby stabilizing the training process;

step 3, training the model by adopting the following steps:

2. The computing task offloading method based on deep reinforcement learning in the power internet of things as claimed in claim 1,

computing noise target action in step 2.7

Wherein the noise component

target value y_i：

Where ri is the reward function and γ is the discount factor;

Wherein L (θ)_j) Is the mean square bellman error loss function, M is the number of randomly drawn small batch samples,

is two parameters of theta₁And theta₂The state-behavior value function of the Current-critical network of (1).

3. The computing task unloading method based on deep reinforcement learning in the power internet of things according to claim 1, wherein an update formula of a Current-Actor network in step 2.8 is as follows:

wherein

Expressed under the Current-Actor network parameter omegaM is the number of randomly drawn small batches of samples,

and

4. The calculation task unloading method based on deep reinforcement learning in the power internet of things according to claim 1, wherein the parameter ω' of the Target-Actor is updated in step 2.9: omega '← mu omega + (1-mu) omega', updating 2 parameters theta of Target-Critic₁' and theta₂′：θ′₁←μθ₁+(1-μ)θ′₁，θ′₂←μθ₂+(1-μ)θ′₂Where μ denotes the update scale factor.