CN114065963A - Computing task unloading method based on deep reinforcement learning in power Internet of things - Google Patents

Computing task unloading method based on deep reinforcement learning in power Internet of things Download PDF

Info

Publication number
CN114065963A
CN114065963A CN202111297200.4A CN202111297200A CN114065963A CN 114065963 A CN114065963 A CN 114065963A CN 202111297200 A CN202111297200 A CN 202111297200A CN 114065963 A CN114065963 A CN 114065963A
Authority
CN
China
Prior art keywords
aerial vehicle
unmanned aerial
time slot
current
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111297200.4A
Other languages
Chinese (zh)
Inventor
赵楠
任凡
杜威
叶智养
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN202111297200.4A priority Critical patent/CN114065963A/en
Publication of CN114065963A publication Critical patent/CN114065963A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Public Health (AREA)
  • Primary Health Care (AREA)
  • Water Supply & Treatment (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Game Theory and Decision Science (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a calculation task unloading method based on deep reinforcement learning in an electric power Internet of things, and aims to minimize energy consumption and time delay by jointly optimizing the position of an unmanned aerial vehicle, the transmitting power and task division variables. Firstly, aiming at the non-convexity of the calculation task unloading problem, a Markov decision process is formulated through designing a state, an action space and a reward function, wherein the reward function is based on an electric power internet of things system model facing power transmission line inspection, relates to interaction among acquisition equipment, an inspection unmanned aerial vehicle and an edge server, and describes the calculation task unloading problem. On the basis, the Markov model has a continuous action space, a double-delay depth deterministic strategy gradient algorithm is provided, and an optimal strategy for task unloading is obtained.

Description

Computing task unloading method based on deep reinforcement learning in power Internet of things
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to a calculation task unloading method based on deep reinforcement learning in an electric power Internet of things.
Background
The power internet of things is an application of the internet of things technology in a smart grid, and can play a role in various aspects of power generation, power transmission, power transformation, power distribution, power utilization and the like of the smart grid. The power internet of things integrates communication infrastructure resources and power infrastructure resources, and realizes information exchange among interconnected devices by utilizing advanced information and communication technologies, so that the unification efficiency of a power system is further improved. Specifically, the power internet of things deploys different types of sensing devices (such as global positioning systems, cameras, infrared sensors and radio frequency identification devices) and advanced metering basic devices in different geographic areas, so that the power internet of things can sense the physical world, acquire and monitor data for processing, and further make intelligent decisions on energy management schemes.
However, the data collected by the power internet of things is likely to be huge and highly heterogeneous, which puts strict requirements on communication and computing resources. Furthermore, some real-time demand-side management schemes are typically delay sensitive, which makes the data processing task more difficult. Despite the recent increase in computing power of smart devices, complex tasks are still difficult to handle under strict latency constraints. Fortunately, mobile edge computing is an ultra-low latency technology capable of data analysis near the data source, pushing the leading edge of services and data from a centralized cloud to the edge of a mobile network with computing and storage resources, and then supporting resource intensive and delay sensitive applications with cloud computing deployed at the edge of the mobile network.
However, when the mobile edge computing service is provided for the device in the power internet of things, a dedicated wireless connection is required between the edge server and the device in the power internet of things, but the wireless communication is difficult to provide an effective mobile edge computing service for the device in the power internet of things due to signal blockage, shadow shielding and the like. To do so, drones are deployed as small edge clouds with embedded computing modules. In the electric power internet of things system, the electric power internet of things equipment needs to process generated data quickly, but communication, calculation and storage resources of the electric power internet of things equipment are very limited, so that the unmanned aerial vehicle provides calculation service for the electric power internet of things equipment by utilizing a calculation module of the unmanned aerial vehicle. However, given the limited battery life and computing power of drones, it is desirable to design efficient computing task offloading methods.
Disclosure of Invention
In order to overcome the non-convexity of the existing calculation task unloading problem, the invention aims to provide a method for optimizing the time delay and the energy consumption of the unmanned aerial vehicle service power Internet of things based on deep reinforcement learning.
In order to achieve the purpose, the invention adopts the technical scheme that: a computing task unloading method based on deep reinforcement learning in an electric power Internet of things comprises the following steps:
step 1, establishing a Markov model of a follow-up depth reinforcement learning algorithm of the flight position of the inspection unmanned aerial vehicle:
modeling a quintuple (S, A, O, R, gamma) of the MDP, wherein S is an input state set of the inspection unmanned aerial vehicle, A is an output action set of the inspection unmanned aerial vehicle, O is a state transition probability function, R is a reward function, and gamma is a discount coefficient;
step 1.1, defining an input state set S of the inspection unmanned aerial vehicle, and determining the input state S of the inspection unmanned aerial vehicle at each time slot tt=[ut ht],utHorizontal coordinate, h, of the patrol drone at time slot ttThe height of the unmanned aerial vehicle is patrolled and examined in a time slot t; stBelongs to the input state set S;
step 1.2, defining an output action set A of the inspection unmanned aerial vehicle, wherein the A represents a set of all actions which can be taken by the inspection unmanned aerial vehicle aiming at the input state of the inspection unmanned aerial vehicle after receiving the following external feedback values: when the unmanned aerial vehicle flies out of the inspection area of the power transmission line, a random direction angle phi is selectedtFlying back; when patrolling and examining unmanned aerial vehicle flying height htBeyond its height range, will remain at the minimum height HminOr maximum height HmaxOnce the inspection unmanned aerial vehicle covers all the acquisition equipment, the inspection unmanned aerial vehicle is kept still and is prepared to adjust the transmitting power; when task divides variables
Figure BDA0003338015070000021
When, patrol and examine unmanned aerial vehicle will carry out
Figure BDA0003338015070000022
Operate to retrieve again the task partition variable βnm t(ii) a Output action a of inspection unmanned aerial vehicle in each time slot ttThe method comprises the following steps: lt、φt、Δht、pt、βnm t
Wherein lt∈{0,l1Denotes the horizontal distance of the unmanned plane flying in time slot t, l1Representing the maximum horizontal distance of the patrol unmanned aerial vehicle flying in each time slot t; phitE {0,2 pi } represents a direction angle of the patrol unmanned aerial vehicle at the time slot t; Δ ht=(ht-ht-1)∈{-l2,l2Denotes the vertical movement distance h of the polling unmanned aerial vehicle in the time slot ttAnd ht-1Respectively representing the height, l, of the unmanned aerial vehicle patrolling at time slot t and the last time slot2The maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t is represented; p is a radical oftE {0, P } represents the transmitting power of the patrol unmanned aerial vehicle at the time slot t, and P represents the maximum transmitting power of the patrol unmanned aerial vehicle; beta is anm tThe duty ratio of the fault identification and foreign object detection task received from the nth acquisition equipment processed on the mth edge server is shown, namely the task division variable { beta ] of the inspection unmanned aerial vehicle at the time slot tnm tN belongs to N, M belongs to M, and N and M respectively represent the collection of the acquisition equipment and the edge server in the system;
step 1.3, defining the input state s of the inspection unmanned aerial vehicle from the current time slot ttAnd the action amount is atCan reach the next input state st+1The probability of (a) is a state transition probability function O;
step 1.4, defining a reward function R, and representing the input state s of the inspection unmanned aerial vehicle in the current time slot ttNext, an action a is selectedtThen, the instantaneous feedback is obtained when
Figure BDA0003338015070000031
Time, reward function
Figure BDA0003338015070000032
When in use
Figure BDA0003338015070000033
Time, reward function
Figure BDA0003338015070000034
Wherein, horizontal direction angle phi of the inspection unmanned aerial vehicletAnd height htDetermining a radius R covering the range of the acquisition devicet
Figure BDA0003338015070000035
If the distance between the horizontal position of the inspection unmanned aerial vehicle and the acquisition equipment is smaller than the coverage radius, the acquisition equipment is indicated to be covered by the inspection unmanned aerial vehicle, N' is the number of the acquisition equipment in the coverage range, xi1Is a penalty factor, ξ, related to the degree of coveragen 2The fault identification and foreign object detection tasks on the nth acquisition equipment are divided to exceed the limit
Figure BDA0003338015070000036
Penalty factor of time, whereinnm tAnd betan0 tIs the ratio of tasks received from the nth acquisition device in the time slot t to be processed on the mth edge server and the patrol unmanned machine respectively,
Figure BDA0003338015070000037
Tt nis the total service delay of the nth acquisition device within the time slot t,
Figure BDA0003338015070000038
Et nis the task arrival rate λ of the nth acquisition device within the time slot tnThe total energy consumption of the unmanned aerial vehicle is patrolled and examined in time;
T1 n,tis the uplink transmission time delay from the nth acquisition equipment to the patrol unmanned aerial vehicle in the time slot t,
Figure BDA0003338015070000039
Dnis the input data volume, v, of the nth acquisition device when all tasks have been offloaded onto the patrol dronen tIs the data rate of the uplink transmission and,
Figure BDA00033380150700000310
αn tis the gain of the uplink channel from the nth acquisition device to the patrol unmanned aerial vehicle, WnAnd pnRespectively representing the bandwidth, the transmission power, σ, allocated to the nth acquisition device2Representing the noise power of the inspection unmanned aerial vehicle; e1 n,tIs the uplink transmission energy consumption in time slot t, E1 n,t=p T1 n,t
T2 n,tThe patrol unmanned aerial vehicle calculates the time delay of the task received from the nth acquisition equipment in the time slot t,
Figure BDA0003338015070000041
Cnis the number of CPU cycles, f, required to process a unit of task data volumenIs the computational resource allocated by the patrol unmanned aerial vehicle to the nth acquisition device, { betat n0∈[0,1],n∈N};E2 n,tIs to patrol the energy consumption of the unmanned aerial vehicle in the time slot t, E2 n,t=κ(fn)3T2 n,tKappa is an effective switching capacitor, and the energy consumption of a central processing unit in the inspection unmanned aerial vehicle is modeled as kappa (f)n)3
T1 nm,tIs the downlink transmission delay in the time slot t,
Figure BDA0003338015070000042
vm tis the data rate of the downlink transmission and,
Figure BDA0003338015070000043
Wmindicates the bandwidth, α, allocated to the mth edge serverm tIs the downlink channel gain, p, from the patrol drone to the mth edge server in time slot ttIndicates the transmission power of the patrol unmanned aerial vehicle in the time slot t, { betanm t∈[0,1],n∈N,m∈M};E1 nm,tIs downlink transmission energy consumption, E1 nm,t=ptT1 nm,t
T2 nm,tIs the calculated time delay of the mth edge server side in the time slot t,
Figure BDA0003338015070000044
fnmis the computing resource distributed to the nth acquisition equipment by the mth edge server in the time slot t;
step 1.5, defining gamma epsilon (0,1) as a discount coefficient for calculating a return accumulated value in the whole process, wherein the closer the discount coefficient is to 1, the more important the long-term income is;
step 1.6, during a limited period T, learning the best strategy to obtain the maximum expected long term benefit
Figure BDA0003338015070000045
Wherein r ist′Is a reward function at time slot t', will state stCorresponding to action atThe expected reward under is recorded as the state-action cost function Q(s)t,at)=E[Gt|st,at]Wherein E [ alpha ], [ alpha ]]Is a desired operator;
step 2, according to the Markov decision quintuple (S, A, O, R, gamma) modeled in the step 1, the action training of the unmanned aerial vehicle is realized by using a double-delay depth certainty strategy gradient (TD3) algorithm, and the steps are as follows:
step 2.1, two independent neural networks are adopted: the network communication system comprises 2 Actor networks and 4 Critic networks, wherein the Actor networks comprise a Current-Actor network with a parameter of omega and a Target-Actor network with a parameter of omega'; the Critic network includes 2 parameters of theta1、θ2The Current-critical network and 2 parameters are theta1’、θ2'and randomly initializing parameters of the empirical playback memory F, the Current-Actor network, and the Current-Critic network, setting ω' ═ ω, θ1’=θ1、θ2’=θ2
Step 2.2, setting a maximum training process round number EP, setting T time slots in each round, and initializing a training round number EP to be 1;
step 2.3, initializing T to 0, initializing time slot T to 0, and initializing the input state s of the inspection unmanned aerial vehicle0
Step 2.4, observing the input state s of the current electric power Internet of things system by the inspection unmanned aerial vehicletAnd according to a policy function pi of the initial Current-Actor networkω(st) Taking action, adding Gaussian noise sigma to disturb the action to obtain action at=πω(st)+σ;
Step 2.5, in input state stThen, perform action atAnd calculating the current reward function r of the inspection unmanned aerial vehicle according to the step 1.4 by the obtained instant feedbacktAnd (3) utilizing the steps 1.1-1.2 to obtain the next input state s againt+1An experience bar(s)t,at,rt,st+1) Storing the experience strips in the experience playback memory F into the experience playback memory F, wherein the experience strips in the experience playback memory F are stored in sequence, and when the number of the experience strips reaches the capacity of the experience playback memory F, new records are stored from the beginning to form a cycle;
step 2.6, randomly extracting the mini-batch sample from the empirical playback memory F and inputting the mini-batch sample(s) into the Actor and Critic networki,ai,ri,si+1) Indicating patrol unmanned aerial vehicle at state siAt the same time take action aiGive a reward of riAnd converting the environment to the next state si+1Wherein, mini-batch refers to randomly selecting a small batch of data in training data;
step 2.7, siAs the output of the Current-Actor networkAnd the updated Current-Actor network is according to siCalculate a new action ai,(si,ai) As input for two Current-critical networks, the latter are based on input siAnd aiRespectively calculate out
Figure BDA0003338015070000057
And
Figure BDA0003338015070000058
si+1as the input of the Target-Actor network and the two Target-Critic networks, the updated Target-Actor network is according to si+1Calculate a new action ai+1Then, noise is added, and a target action of the noise is calculated by using a target strategy smoothing and regularization method;
Figure BDA0003338015070000051
as input to two Target-critical networks, the latter are based on the input si+1And
Figure BDA0003338015070000052
respectively calculate out
Figure BDA0003338015070000053
And
Figure BDA0003338015070000054
get
Figure BDA0003338015070000055
And
Figure BDA0003338015070000056
is calculated as the target value yi(ii) a Finally, using the target value yiUpdating parameter theta of 2 Current-Critic networks by minimizing mean-squared Bellman error loss function1And theta2
Step 2.8, updating a Current-Actor network parameter omega through a deterministic strategy gradient algorithm;
step 2.9, isStabilizing the training process, and further updating the parameter of Target-Actor and the parameter theta of 2 Target-Critic by using a soft updating method1' and theta2', thereby stabilizing the training process;
step 3, training the model by adopting the following steps:
3.1, if the inspection unmanned aerial vehicle covers all the acquisition equipment and the task division variable does not exceed the limit of the acquisition equipment, entering step 3.2, otherwise, adding 1 to T, judging T, if T is less than T, skipping to step 2.4, and if T is more than or equal to T, entering step 3.2;
step 3.2, adding 1 to the number EP of the training rounds, judging EP, if EP is smaller than EP, skipping to step 2.3, otherwise, entering step 3.3 when EP is larger than or equal to EP;
and 3.3, finishing iteration, terminating the neural network training process, storing the current Target-Actor network data and the Target-Actor network data, and loading the stored data into the inspection unmanned aerial vehicle system, thereby executing flight action to finish optimizing the time delay and energy consumption of the unmanned aerial vehicle service power Internet of things.
Further, noise target actions are calculated in step 2.7
Figure BDA0003338015070000061
Wherein the noise component
Figure BDA0003338015070000062
Is the noise component, π, clipped by the constant c in a normal distribution with a mean of zero and a variance of εω′(si+1) Is the next state s of the Target-Actor network with parameter ωi+1A policy function of;
target value yi
Figure BDA0003338015070000063
Where ri is the reward function and γ is the discount factor;
updating parameter theta of 2 Current-Critic networks by using minimum mean square Bellman error loss function1And theta2
Figure BDA0003338015070000064
j is 1,2, wherein L (θ)j) Is the mean square bellman error loss function, M is the number of randomly drawn small batch samples,
Figure BDA0003338015070000065
j is 1,2 is two parameters theta1And theta2The state-behavior value function of the Current-critical network of (1).
Further, the update formula of the Current-Actor network in step 2.8 is as follows:
Figure BDA0003338015070000066
wherein
Figure BDA0003338015070000067
Representing the policy gradient under the Current-Actor network parameter omega, M is the number of randomly drawn small-lot samples,
Figure BDA0003338015070000071
and
Figure BDA0003338015070000072
respectively, the parameter is theta1The state-behavior value function gradient of the Current-critical network and the strategy function gradient of the Current-Actor network with the parameter of omega, piω(si) Indicates the network input state s in the Current-ActoriThe action strategy is selected according to the selected action strategy,
Figure BDA0003338015070000073
indicating an input state siTake action of a ═ piω(si) Current-critical network state-behavior value function.
Further, the parameter ω' of Target-Actor is updated in step 2.9: omega '← mu omega + (1-mu) omega', updating 2 parameters theta of Target-Critic1' and theta2′:θ′1←μθ1+(1-μ)θ′1,θ′2←μθ2+(1-μ)θ′2Where μ denotes the update scale factor。
The invention has the beneficial effects that:
(1) the electric power internet of things computing task unloading model for actual transmission line inspection established by the method is complete, and the inspection unmanned aerial vehicle finds the optimal target strategy for computing task unloading through continuous interaction of the inspection unmanned aerial vehicle and the system environment due to unknown model environment, so that the method has high practical application value.
(2) The method uses a double-delay depth certainty strategy gradient (TD3) algorithm, effectively solves the problem of the continuity control of the inspection unmanned aerial vehicle, and effectively solves the problem that the state-action function value of the inspection unmanned aerial vehicle is continuously overestimated through a Target-Actor and a Target-Critic network, so that the training of the inspection unmanned aerial vehicle is more stable, and a better unloading strategy is found.
(3) The method combines reinforcement learning and the deep neural network, improves the learning ability and the generalization ability of the inspection unmanned aerial vehicle, avoids the complexity and the sparseness when the inspection unmanned aerial vehicle is manually operated to unload the calculation task in an uncertain environment, and ensures that the inspection unmanned aerial vehicle can safely and efficiently complete the unloading of the calculation task.
Drawings
FIG. 1 is a general framework diagram of a TD 3-based computing task offloading method in a power Internet of things;
fig. 2 is a flowchart of a calculation task unloading algorithm based on TD3 in the power internet of things.
Detailed Description
The present invention will be described in further detail with reference to examples for the purpose of facilitating understanding and practice of the invention by those of ordinary skill in the art, and it is to be understood that the present invention has been described in the illustrative embodiments and is not to be construed as limited thereto.
Firstly, an electric power internet of things system model facing power transmission line inspection is established, interaction among acquisition equipment, an inspection unmanned aerial vehicle and an edge server is involved, and the problem of calculation task unloading is described. Secondly, aiming at the non-convexity of the problem of task unloading, the invention provides a method based on deep reinforcement learning, and a Markov decision process is formulated by designing a state, an action space and a reward function. On the basis, the Markov model has a continuous action space, a double-delay depth deterministic strategy gradient algorithm is provided, and an optimal strategy for task unloading is obtained.
Optimal strategy is to patrol the position q of the unmanned aerial vehicle by joint optimizationtTask partition variable betanm tAnd betan0 tAnd a transmission power ptThe sum of the total service time delay of all the acquisition devices and the total energy consumption of the inspection unmanned aerial vehicle is minimized, namely:
Figure BDA0003338015070000081
suppose that the position of the patrol unmanned aerial vehicle in the time slot t is denoted as qt=(ut,ht) Wherein u istIs the horizontal coordinate of the patrol unmanned plane, ht∈[Hmin,Hmax]Is the height of the inspection unmanned plane HminAnd HmaxThe minimum vertical distance and the maximum vertical distance of the patrol unmanned aerial vehicle are respectively, and the maximum horizontal distance and the maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t are respectively l1And l2. Where N and M represent a collection of acquisition devices and edge servers, Tt nRepresenting the total service delay of the nth acquisition device within the time slot t, Et nThe total energy consumption of the inspection unmanned aerial vehicle for unloading and calculating the task on the nth acquisition device in the time slot t is shown, P is the maximum transmission power of the inspection unmanned aerial vehicle,
Figure BDA0003338015070000082
representing the Euclidean norm, u, of a vectort+1Horizontal coordinate, h, of the patrol unmanned plane at the next time slott+1And the vertical coordinate of the inspection unmanned aerial vehicle in the next time slot is represented. Based on the analysis and description of the task unloading problems such as fault identification, foreign matter detection and the like in the power transmission line inspection, it can be seen that the problems have strong non-convexity and combinability, and therefore, a global optimal solution is difficult to find. In addition, the method can be used for producing a composite materialBecause the information and channel conditions of the acquisition equipment in the power internet of things system for power transmission line inspection are not easy to acquire, the traditional optimization strategy is more challenging.
A computing task unloading method based on deep reinforcement learning in an electric power Internet of things comprises the following steps:
step 1, establishing a Markov model of a follow-up depth reinforcement learning algorithm of the flight position of the inspection unmanned aerial vehicle:
modeling a quintuple (s, A, O, R, gamma) of the MDP in the Markov decision process, wherein s is an input state of the inspection unmanned aerial vehicle, A is an output action set of the inspection unmanned aerial vehicle, O is a state transition probability function, R is a reward function, and gamma is a discount coefficient; stBelongs to the input state set S;
step 1.1, determining the input state s of the inspection unmanned aerial vehicle at each time slot tt=[ut ht],utHorizontal coordinate, h, of the patrol drone at time slot ttThe height of the unmanned aerial vehicle is patrolled and examined in a time slot t;
step 1.2, defining an output action set A of the inspection unmanned aerial vehicle, wherein the A represents a set of all actions which can be taken by the inspection unmanned aerial vehicle aiming at the input state of the inspection unmanned aerial vehicle after receiving the following external feedback values: when the unmanned aerial vehicle flies out of the inspection area of the power transmission line, a random direction angle phi is selectedtFlying back; when patrolling and examining unmanned aerial vehicle flying height htBeyond its height range, will remain at the minimum height HminOr maximum height HmaxOnce the inspection unmanned aerial vehicle covers all the acquisition equipment, the inspection unmanned aerial vehicle is kept still and is prepared to adjust the transmitting power; when task divides variables
Figure BDA0003338015070000091
When, patrol and examine unmanned aerial vehicle will carry out
Figure BDA0003338015070000092
Operate to retrieve again the task partition variable βnm t(ii) a Output action a of inspection unmanned aerial vehicle in each time slot ttThe method comprises the following steps: lt、Φt、Δht、pt、βnm t
Wherein lt∈{0,l1Denotes the horizontal distance of the unmanned plane flying in time slot t, l1Representing the maximum horizontal distance of the patrol unmanned aerial vehicle flying in each time slot t; phitE {0,2 pi } represents a direction angle of the patrol unmanned aerial vehicle at the time slot t; Δ ht=(ht-ht+1)∈{-l2,l2Denotes the vertical movement distance h of the polling unmanned aerial vehicle in the time slot ttAnd ht+1Respectively representing the height, l, of the unmanned aerial vehicle patrolling at time slot t and the next time slot2The maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t is represented; p is a radical oftE {0, P } represents the transmitting power of the patrol unmanned aerial vehicle at the time slot t, and P represents the maximum transmitting power of the patrol unmanned aerial vehicle; beta is anm tThe duty ratio of the fault identification and foreign object detection task received from the nth acquisition equipment processed on the mth edge server is shown, namely the task division variable { beta ] of the inspection unmanned aerial vehicle at the time slot tnm tN belongs to N, M belongs to M, and N and M respectively represent the collection of the acquisition equipment and the edge server in the system;
step 1.3, defining the input state s of the inspection unmanned aerial vehicle from the current time slot ttAnd the action amount is atCan reach the next input state st+1The probability of (a) is a state transition probability function O;
step 1.4, defining a reward function R, and representing the input state s of the inspection unmanned aerial vehicle in the current time slot ttNext, an action a is selectedtThen, the instantaneous feedback is obtained when
Figure BDA0003338015070000101
Time, reward function
Figure BDA0003338015070000102
When in use
Figure BDA0003338015070000103
Time, reward function
Figure BDA0003338015070000104
Wherein, horizontal direction angle phi of the inspection unmanned aerial vehicletAnd height htDetermining a radius R covering the range of the acquisition devicet
Figure BDA0003338015070000105
If patrol and examine unmanned aerial vehicle horizontal position st=[ut ht]The distance between the acquisition equipment and the acquisition equipment is smaller than the coverage radius, the acquisition equipment is indicated to be covered by the inspection unmanned aerial vehicle, the horizontal coordinate of the acquisition equipment is a fixed value and is a known quantity, N' is the number of the inspection unmanned aerial vehicle covering the acquisition equipment, and xi1Is a penalty factor, ξ, related to the degree of coveragen 2The fault identification and foreign object detection tasks on the nth acquisition equipment are divided to exceed the limit
Figure BDA0003338015070000106
Penalty factor of time, betanm t,βn0 t,Tt n,Et nSolving according to the model of the power Internet of things system; beta is anm tAnd betan0 tIs the ratio of tasks received from the nth acquisition device in the time slot t to be processed on the mth edge server and the patrol unmanned machine respectively,
Figure BDA0003338015070000107
Tt nis the total service delay of the nth acquisition device within the time slot t,
Figure BDA0003338015070000108
Et nis the task arrival rate λ of the nth acquisition device within the time slot tnThe total energy consumption of the unmanned aerial vehicle is patrolled and examined in time;
A. electric power internet of things system model
The power internet of things system for power transmission line inspection comprises 1 inspection unmanned aerial vehicle, N information acquisition devices such as miniature meteorological stations, sensors and high-definition cameras and M edge servers. The collection of the acquisition devices and the edge servers are denoted as N and M, respectively. After receiving the calculation tasks such as fault identification and foreign object detection of the acquisition equipment through the uplink communication link, the patrol unmanned aerial vehicle distributes part or all of the tasks to the edge server through the downlink communication link.
1) Uplink transmission delay and energy consumption:
it is assumed that broadband frequency division multiple access protocols are shared among acquisition devices when tasks such as fault identification and foreign object detection are unloaded. According to the Shannon law, in the time slot t, the gain alpha of an uplink channel from the nth acquisition equipment to the inspection unmanned aerial vehicle is consideredn tThen the available uplink transmission data rate is:
Figure BDA0003338015070000111
wherein WnAnd pnRespectively representing the bandwidth, the transmission power, σ, allocated to the nth acquisition device2Representing the noise power of the patrol unmanned aerial vehicle. Thus, consider the amount of input data D when all tasks for the nth acquisition device have been offloaded onto the inspection dronenThe uplink transmission delay in the time slot t is
Figure BDA0003338015070000112
Considering the receiving power p of the polling unmanned aerial vehicle, the uplink transmission energy consumption in the time slot t is E1 n,t=p T1 n,t
2) Patrol and examine unmanned aerial vehicle's calculation time delay and energy consumption:
after receiving the tasks such as fault identification and foreign matter detection from the acquisition equipment, the patrol unmanned aerial vehicle allocates the decision task to the patrol unmanned aerial vehicle for self-processing or uninstalls the decision task to the edge server for processing. { beta ]n0 t∈[0,1]And N ∈ N } represents the proportion of tasks received from the nth acquisition device in the time slot t to be processed on the inspection unmanned machine, wherein N represents the set of acquisition devices in the system. Thus, the center required for processing the data amount of the unit task is consideredNumber of processor cycles CnAnd the computing resource f distributed to the nth acquisition equipment by the patrol unmanned aerial vehiclenThe calculation time delay of the patrol unmanned aerial vehicle in the time slot t is
Figure BDA0003338015070000113
Wherein DnIndicating the amount of task input data for the nth acquisition device. Furthermore, consider modeling the energy consumption of the central processor in the inspection drone as κ (f)n)3And kappa is effective switching capacitance, and the energy consumption of the inspection unmanned aerial vehicle in the time slot t can be obtained as E2 n,t=κ(fn)3T2 n,t
3) Downlink transmission delay and energy consumption:
for simplicity, assume that the noise power of the patrol drone and the edge server are the same. During time slot t, the downlink channel gain α from the patrol drone to the mth edge server is taken into accountm tThen the downlink transmission data rate is:
Figure BDA0003338015070000114
wherein WmAnd σ2Respectively representing the bandwidth allocated to the m-th edge server and the noise power of the edge server, ptAnd the transmission power of the inspection unmanned aerial vehicle in the time slot t is represented. Because the edge server has more communication, calculation and storage resources than the unmanned aerial vehicle of patrolling and examining, it is more likely that the unmanned aerial vehicle of patrolling and examining unloads tasks such as fault identification and foreign matter detection to the edge server and handles to reduce system delay. Taking into account the proportion { beta } of tasks received from the nth acquisition device in time slot t to be processed on the mth edge servernm t∈[0,1]N belongs to N, M belongs to M, wherein N and M respectively represent the collection of the acquisition equipment and the edge server in the system, and the downlink transmission delay is
Figure BDA0003338015070000121
Wherein D isnIndicating the amount of task input data for the nth acquisition device. Furthermore, during the time slot t, the transmission power p of the patrol drone is taken into accounttThen the downlink transmission energy consumption is E1 nm,t=ptT1 nm,t
4) Calculating time delay of the edge server:
the edge server starts the calculation process after receiving all task data of fault identification, foreign matter detection and the like unloaded by the inspection unmanned aerial vehicle, so that the calculation resource f distributed from the mth edge server to the nth acquisition equipment is considered in the time slot tnmThen the calculated time delay of the mth edge server end is
Figure BDA0003338015070000122
Wherein C isnIndicating the number of CPU cycles, D, required to process a unit of task datanIndicating the amount of task input data, beta, of the nth acquisition devicenm tIndicating the proportion of tasks received from the nth acquisition device in time slot t that are processed at the mth edge server.
5) Total service delay and energy consumption of the system:
the unmanned aerial vehicle can not start task processing such as fault identification and foreign matter detection until the uplink communication channel completes complete data transmission. In general, considering that the communication and computation modules on the drone are separate, the patrol drone may offload tasks to the edge server through a downlink communication channel while performing local task computation. Furthermore, the edge server may start the task processing only after all data transfers are completed. Therefore, in the time slot t, the total service delay of the nth acquisition device is:
Figure BDA0003338015070000123
wherein, T1 n,tIs the uplink transmission time delay from the nth acquisition equipment to the inspection unmanned aerial vehicle in the time slot T, T2 n,tIs unmanned for inspectionThe machine calculates the time delay of the task received from the nth acquisition device within the time slot T, T1 nm,tIs the downlink transmission time delay from the polling unmanned aerial vehicle to the mth edge server in the time slot T, T2 nm,tAnd (3) calculating time delay of the mth edge server end in the time slot t, wherein M represents a set of edge servers in the system.
Furthermore, in the time slot t, the task arrival rate λ of the nth acquisition device is taken into accountnThen, the total energy consumption of the inspection unmanned aerial vehicle is:
Figure BDA0003338015070000124
wherein E1 n,tIs the uplink transmission energy consumption of the polling unmanned aerial vehicle receiving tasks from the nth acquisition device in the time slot t, E2 n,tIs the energy consumption of the tasks of routing inspection unmanned aerial vehicle to process the fault recognition, foreign object detection and the like of the nth acquisition equipment in the time slot t, E1 nm,tAnd the downlink transmission energy consumption is the downlink transmission energy consumption when the inspection unmanned aerial vehicle unloads the task to the mth edge server in the time slot t.
Considering that network information in the communication process of the power internet of things system facing power transmission line inspection is unknown, the strategy optimization based on the Markov model cannot adopt the traditional iteration method. In the deep reinforcement learning, the inspection unmanned aerial vehicle only needs to optimize one action through the state information and does not need complete environmental information. Considering that the states and action spaces in the task unloading context of fault identification, foreign object detection and the like are continuous and the cost function is continuously overestimated in the training of the depth deterministic strategy gradient algorithm, a dual-delay depth deterministic strategy gradient (TD3) algorithm is adopted, and the algorithm also combines a strategy function-based (Actor network) with a behavior value function-based (criticic network).
Step 1.5, defining gamma epsilon (0,1) as a discount coefficient for calculating a return accumulated value in the whole process, wherein the closer the discount coefficient is to 1, the more important the long-term income is;
step 1.6, during a limited period T, learning the optimal strategy to obtain the maximum predictionThe long term gain of the period is
Figure BDA0003338015070000131
Wherein r ist′Is a reward function at time slot t', will state stCorresponding to action atThe expected reward under is recorded as the state-action cost function Q(s)t,at)=E[Gt|st,at]Wherein E [ alpha ], [ alpha ]]Is the desired operator.
The TD3 architecture consists of Current-Network and Target-Network. The Current-Network comprises a Current-Actor Network and a Current-critical Network: Current-Actor network with parameter omega will be used to select the policy function pi of an actionω(st) Passed to a Current-Critic network that estimates the State-cost function
Figure BDA0003338015070000132
And
Figure BDA0003338015070000133
the Current-Critic network comprises two identical structures but with the parameter theta1、θ2And the Current-criticic network is respectively and independently initialized and trained and is used for calculating a Q value according to the input state and the selected action and updating the Current-Actor network based on the deterministic strategy gradient. Since the Current-Network will continuously overestimate the Q value, which results in unstable Network training, TD3 uses Target-Network to slowly track the updates of Current-Actor and Current-critical networks. The Target-Network comprises a Target-Actor Network with a parameter of omega 'and two parameters of theta'1And θ'2Target-critical network. The Target-Actor network transmits the strategy function to a Target-criticic network, and the Target-criticic network is used for estimating a state-value function; the Target-criticic network comprises two Target-criticic networks which have the same structure and parameters of which are respectively initialized and trained independently, and is used for calculating a Q value according to an input state and a selected action, calculating a Target value based on the smaller Q value and further updating the Current-criticic network.
Step 2, according to the Markov decision quintuple (S, A, O, R, gamma) modeled in the step 1, the action training of the unmanned aerial vehicle is realized by using a double-delay depth certainty strategy gradient (TD3) algorithm, and the steps are as follows:
step 2.1, adopting 2 Actor networks and 4 Critic networks, wherein the Actor networks comprise a Current-Actor network with a parameter of omega and a Target-Actor network with a parameter of omega'; the Critic network includes 2 parameters of theta1、θ2The Current-critical network and 2 parameters are theta1’、θ2'and randomly initializing parameters of the empirical playback memory F, the Current-Actor network, and the Current-Critic network, setting ω' ═ ω, θ1’=θ1、θ2’=θ2
Step 2.2, setting a maximum training process round number EP, setting T time slots in each round, and initializing a training round number EP to be 1;
step 2.3, initializing T to 0, initializing time slot T to 0, and initializing the input state s of the inspection unmanned aerial vehicle0
Step 2.4, observing the input state s of the current electric power Internet of things system by the inspection unmanned aerial vehicletAnd according to a policy function pi of the initial Current-Actor networkω(st) Taking action, adding Gaussian noise sigma to disturb the action to obtain action at=πω(st) + sigma; policy function piω(st) Is that the patrol unmanned aerial vehicle decides at the input state stTake output action atThe manner of (a);
step 2.5, in input state stThen, perform action atAnd calculating the current reward function r of the inspection unmanned aerial vehicle according to the step 1.4 by the obtained instant feedbacktAnd (3) utilizing the steps 1.1-1.2 to obtain the next input state s againt+1An experience bar(s)t,at,rt,st+1) Storing the experience strips in the experience playback memory F into the experience playback memory F, wherein the experience strips in the experience playback memory F are stored in sequence, and when the number of the experience strips reaches the capacity of the experience playback memory F, new records are stored from the beginning to form a cycle;
step 2.2-step 2.5 are strategy exploration phases, and off-line setting is carried outThe standby environment is used as a training environment, and the Current-Actor network observes the input state s of the time slot ttAnd as input, output the recommended action atAnd selecting a by the inspection unmanned aerial vehicletThe state is represented by stIs converted into st+1And gives a reward feedback rtThe Current-Actor network follows the new state st+1And generating a new recommended action, and repeating the steps to search the strategy, wherein the experience record of the strategy search is stored as the experience in the experience playback memory F.
The network updating phase is as follows, the Current-Actor network is not needed to interact with the user environment, and the experience replay is carried out by sampling from the experience replay memory F.
Step 2.6, randomly extracting the mini-batch sample from the empirical playback memory F and inputting the mini-batch sample(s) into the Actor and Critic networki,ai,ri,si+1) Indicating patrol unmanned aerial vehicle at state siAt the same time take action aiGive a reward of riAnd converting the environment to the next state si+1Wherein, mini-batch refers to randomly selecting a small batch of data in training data;
step 2.7, siAs the input of the Current-Actor network, the updated Current-Actor network is according to siCalculate a new action ai,(si,ai) As input for two Current-critical networks, the latter are based on input siAnd aiRespectively calculate out
Figure BDA0003338015070000151
And
Figure BDA0003338015070000152
si+1as the input of the Target-Actor network and the two Target-Critic networks, the updated Target-Actor network is according to si+1Calculate a new action ai+1Following the addition of noise, a target strategy smoothing regularization method is used to calculate the noise target action
Figure BDA0003338015070000153
Wherein the noise component
Figure BDA0003338015070000154
Is the noise component, π, clipped by the constant c in a normal distribution with a mean of zero and a variance of εω′(si+1) Is the next state s of the Target-Actor network with parameter ωi+1A policy function of;
Figure BDA0003338015070000155
as input to two Target-critical networks, the latter are based on the input Si+1And
Figure BDA0003338015070000156
respectively calculate out
Figure BDA0003338015070000157
And
Figure BDA0003338015070000158
get
Figure BDA0003338015070000159
And
Figure BDA00033380150700001510
is calculated as the target value yi
Figure BDA00033380150700001511
Where ri is the reward function and γ is the discount factor; finally, using the target value yiUpdating parameter theta of 2 Current-Critic networks by minimizing mean-squared Bellman error loss function1And theta2
Figure BDA00033380150700001512
j is 1,2, wherein L (θ)j) Is the mean square bellman error loss function, M is the number of randomly drawn small batch samples,
Figure BDA00033380150700001513
j is 1,2 is two parameters theta1And theta2The state-behavior value function of the Current-critical network;
and 2.8, updating a network parameter omega of the Current-Actor through a deterministic strategy gradient algorithm:
Figure BDA00033380150700001514
wherein
Figure BDA00033380150700001515
Representing the policy gradient under the Current-Actor network parameter omega, M is the number of randomly drawn small-lot samples,
Figure BDA0003338015070000161
and
Figure BDA0003338015070000162
respectively, the parameter is theta1The state-behavior value function gradient of the Current-critical network and the strategy function gradient of the Current-Actor network with the parameter of omega, piω(si) Indicates the network input state s in the Current-ActoriThe action strategy is selected according to the selected action strategy,
Figure BDA0003338015070000163
indicating an input state siTake action of a ═ piω(si) Current-critical network state-behavior value function of (1);
step 2.9, in order to stabilize the training process, the parameter ω' of Target-Actor is further updated by using a soft update method: ω '. about.mu.ω + (1- μ) ω', 2 parameters θ of Target-Critic1' and theta2′:θ′1←μθ1+(1-μ)θ′1,θ′2←μθ2+(1-μ)θ′2Where μ denotes the update scale factor.
The TD3 reduces the updating frequency of the Current-Actor Network and the Target-Network, and the Current-Actor Network and the Target-Network are updated once every time the Current-Critic Network is updated d times.
Step 3, training the model by adopting the following steps:
3.1, if the inspection unmanned aerial vehicle covers all the acquisition equipment and the task division variable does not exceed the limit of the acquisition equipment, entering step 3.2, otherwise, adding 1 to T, judging T, if T is less than T, skipping to step 2.4, and if T is more than or equal to T, entering step 3.2;
step 3.2, adding 1 to the number EP of the training rounds, judging EP, if EP is smaller than EP, skipping to step 2.3, otherwise, entering step 3.3 when EP is larger than or equal to EP;
and 3.3, finishing iteration, terminating the neural network training process, storing the current Target-Actor network data and the Target-Actor network data, and loading the stored data into the inspection unmanned aerial vehicle system, thereby executing flight action to finish optimizing the time delay and energy consumption of the unmanned aerial vehicle service power Internet of things.

Claims (4)

1. A computing task unloading method based on deep reinforcement learning in an electric power Internet of things is characterized by comprising the following steps:
step 1, establishing a Markov model of a follow-up depth reinforcement learning algorithm of the flight position of the inspection unmanned aerial vehicle:
modeling a quintuple (S, A, O, R, gamma) of the MDP, wherein S is an input state set of the inspection unmanned aerial vehicle, A is an output action set of the inspection unmanned aerial vehicle, O is a state transition probability function, R is a reward function, and gamma is a discount coefficient;
step 1.1, defining an input state set S of the inspection unmanned aerial vehicle, and determining the input state S of the inspection unmanned aerial vehicle at each time slot tt=[ut ht],utHorizontal coordinate, h, of the patrol drone at time slot ttThe height of the unmanned aerial vehicle is patrolled and examined in a time slot t; stBelongs to the input state set S;
step 1.2, defining an output action set A of the inspection unmanned aerial vehicle, wherein the A represents a set of all actions which can be taken by the inspection unmanned aerial vehicle aiming at the input state of the inspection unmanned aerial vehicle after receiving the following external feedback values: when the unmanned aerial vehicle flies out of the inspection area of the power transmission line, the unmanned aerial vehicle is selectedA random direction angle phitFlying back; when patrolling and examining unmanned aerial vehicle flying height htBeyond its height range, will remain at the minimum height HminOr maximum height HmaxOnce the inspection unmanned aerial vehicle covers all the acquisition equipment, the inspection unmanned aerial vehicle is kept still and is prepared to adjust the transmitting power; when task divides variables
Figure FDA0003338015060000011
When, patrol and examine unmanned aerial vehicle will carry out
Figure FDA0003338015060000012
Operate to retrieve again the task partition variable βnm t(ii) a Output action a of inspection unmanned aerial vehicle in each time slot ttThe method comprises the following steps: lt、Φt、Δht、pt、βnm t
Wherein lt∈{0,l1Denotes the horizontal distance of the unmanned plane flying in time slot t, l1Representing the maximum horizontal distance of the patrol unmanned aerial vehicle flying in each time slot t; phitE {0,2 pi } represents a direction angle of the patrol unmanned aerial vehicle at the time slot t; Δ ht=(ht-ht-1)∈{-l2,l2Denotes the vertical movement distance h of the polling unmanned aerial vehicle in the time slot ttAnd ht-1Respectively representing the height, l, of the unmanned aerial vehicle patrolling at time slot t and the last time slot2The maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t is represented; p is a radical oftE {0, P } represents the transmitting power of the patrol unmanned aerial vehicle at the time slot t, and P represents the maximum transmitting power of the patrol unmanned aerial vehicle; beta is anm tThe duty ratio of the fault identification and foreign object detection task received from the nth acquisition equipment processed on the mth edge server is shown, namely the task division variable { beta ] of the inspection unmanned aerial vehicle at the time slot tnm tN belongs to N, M belongs to M, and N and M respectively represent the collection of the acquisition equipment and the edge server in the system;
step 1.3, defining the input state s of the inspection unmanned aerial vehicle from the current time slot ttAnd adoptAn operation amount of atCan reach the next input state st+1The probability of (a) is a state transition probability function O;
step 1.4, defining a reward function R, and representing the input state s of the inspection unmanned aerial vehicle in the current time slot ttNext, an action a is selectedtThen, the instantaneous feedback is obtained when
Figure FDA0003338015060000021
Time, reward function
Figure FDA0003338015060000022
When in use
Figure FDA0003338015060000023
Time, reward function
Figure FDA0003338015060000024
Wherein, horizontal direction angle phi of the inspection unmanned aerial vehicletAnd height htDetermining a radius R covering the range of the acquisition devicet
Figure FDA0003338015060000025
If the distance between the horizontal position of the inspection unmanned aerial vehicle and the acquisition equipment is smaller than the coverage radius, the acquisition equipment is indicated to be covered by the inspection unmanned aerial vehicle, N' is the number of the acquisition equipment in the coverage range, xi1Is a penalty factor, ξ, related to the degree of coveragen 2The fault identification and foreign object detection tasks on the nth acquisition equipment are divided to exceed the limit
Figure FDA0003338015060000026
Penalty factor of time, whereinnm tAnd betan0 tIs the ratio of tasks received from the nth acquisition device in the time slot t to be processed on the mth edge server and the patrol unmanned machine respectively,
Figure FDA0003338015060000027
Tt nis the total service delay of the nth acquisition device within the time slot t,
Figure FDA0003338015060000028
Et nis the task arrival rate λ of the nth acquisition device within the time slot tnThe total energy consumption of the unmanned aerial vehicle is patrolled and examined in time;
T1 n,tis the uplink transmission time delay from the nth acquisition equipment to the patrol unmanned aerial vehicle in the time slot t,
Figure FDA0003338015060000029
Dnis the input data volume, v, of the nth acquisition device when all tasks have been offloaded onto the patrol dronen tIs the data rate of the uplink transmission and,
Figure FDA00033380150600000210
αn tis the gain of the uplink channel from the nth acquisition device to the patrol unmanned aerial vehicle, WnAnd pnRespectively representing the bandwidth, the transmission power, σ, allocated to the nth acquisition device2Representing the noise power of the inspection unmanned aerial vehicle; e1 n,tIs the uplink transmission energy consumption in time slot t, E1 n,t=p T1 n,t
T2 n,tThe patrol unmanned aerial vehicle calculates the time delay of the task received from the nth acquisition equipment in the time slot t,
Figure FDA00033380150600000211
Cnis the number of CPU cycles, f, required to process a unit of task data volumenIs the computational resource allocated by the patrol unmanned aerial vehicle to the nth acquisition device, { betat n0∈[0,1],n∈N};E2 n,tIs to patrol the energy consumption of the unmanned aerial vehicle in the time slot t, E2 n,t=κ(fn)3T2 n,tKappa is an effective switching capacitor, and the energy consumption of a central processing unit in the inspection unmanned aerial vehicle is modeled as kappa (f)n)3
T1 nm,tIs the downlink transmission delay in the time slot t,
Figure FDA0003338015060000031
vm tis the data rate of the downlink transmission and,
Figure FDA0003338015060000032
Wmindicates the bandwidth, α, allocated to the mth edge serverm tIs the downlink channel gain, p, from the patrol drone to the mth edge server in time slot ttIndicates the transmission power of the patrol unmanned aerial vehicle in the time slot t, { betanm t∈[0,1],n∈N,m∈M};E1 nm,tIs downlink transmission energy consumption, E1 nm,t=ptT1 nm,t
T2 nm,tIs the calculated time delay of the mth edge server side in the time slot t,
Figure FDA0003338015060000033
fnmis the computing resource distributed to the nth acquisition equipment by the mth edge server in the time slot t;
step 1.5, defining gamma epsilon (0,1) as a discount coefficient for calculating a return accumulated value in the whole process, wherein the closer the discount coefficient is to 1, the more important the long-term income is;
step 1.6, during a limited period T, learning the best strategy to obtain the maximum expected long term benefit
Figure FDA0003338015060000034
Wherein r ist′Is a reward function at time slot t', will state stCorresponding to action atThe expected reward under is recorded as the state-action cost function Q(s)t,at)=E[Gt|st,at]Wherein E [ alpha ], [ alpha ]]Is the desired operator.
Step 2, according to the Markov decision quintuple (S, A, O, R, gamma) modeled in the step 1, the action training of the unmanned aerial vehicle is realized by using a double-delay depth certainty strategy gradient (TD3) algorithm, and the steps are as follows:
step 2.1, two independent neural networks are adopted: the network communication system comprises 2 Actor networks and 4 Critic networks, wherein the Actor networks comprise a Current-Actor network with a parameter of omega and a Target-Actor network with a parameter of omega'; the Critic network includes 2 parameters of theta1、θ2The Current-critical network and 2 parameters are theta1’、θ2'and randomly initializing parameters of the empirical playback memory F, the Current-Actor network, and the Current-Critic network, setting ω' ═ ω, θ1’=θ1、θ2’=θ2
Step 2.2, setting a maximum training process round number EP, setting T time slots in each round, and initializing a training round number EP to be 1;
step 2.3, initializing T to 0, initializing time slot T to 0, and initializing the input state s of the inspection unmanned aerial vehicle0
Step 2.4, observing the input state s of the current electric power Internet of things system by the inspection unmanned aerial vehicletAnd according to a policy function pi of the initial Current-Actor networkω(st) Taking action, adding Gaussian noise sigma to disturb the action to obtain action at=πω(st)+σ;
Step 2.5, in input state stThen, perform action atAnd calculating the current reward function r of the inspection unmanned aerial vehicle according to the step 1.4 by the obtained instant feedbacktAnd (3) utilizing the steps 1.1-1.2 to obtain the next input state s againt+1An experience bar(s)t,at,rt,st+1) Storing the experience strips in the experience playback memory F into the experience playback memory F, wherein the experience strips in the experience playback memory F are stored in sequence, and when the number of the experience strips reaches the capacity of the experience playback memory F, new records are stored from the beginning to form a cycle;
step (ii) of2.6, randomly drawing a mini-batch sample from the empirical playback memory F, inputting the i th mini-batch sample(s) into the Actor and Critic networki,ai,ri,si+1) Indicating patrol unmanned aerial vehicle at state siAt the same time take action aiGive a reward of riAnd converting the environment to the next state si+1Wherein, mini-batch refers to randomly selecting a small batch of data in training data;
step 2.7, siAs the input of the Current-Actor network, the updated Current-Actor network is according to siCalculate a new action ai,(si,ai) As input for two Current-critical networks, the latter are based on input siAnd aiRespectively calculate out
Figure FDA0003338015060000047
And
Figure FDA0003338015060000048
si+1as the input of the Target-Actor network and the two Target-Critic networks, the updated Target-Actor network is according to si+1Calculate a new action ai+1Then, noise is added, and a target action of the noise is calculated by using a target strategy smoothing and regularization method;
Figure FDA0003338015060000041
as input to two Target-critical networks, the latter are based on the input si+1And
Figure FDA0003338015060000042
respectively calculate out
Figure FDA0003338015060000043
And
Figure FDA0003338015060000044
get
Figure FDA0003338015060000045
And
Figure FDA0003338015060000046
is calculated as the target value yi(ii) a Finally, using the target value yiUpdating parameter theta of 2 Current-Critic networks by minimizing mean-squared Bellman error loss function1And theta2
Step 2.8, updating a Current-Actor network parameter omega through a deterministic strategy gradient algorithm;
step 2.9, in order to stabilize the training process, the parameter of the Target-Actor and the parameter theta of 2 Target-critics are further updated by using a soft updating method1' and theta2', thereby stabilizing the training process;
step 3, training the model by adopting the following steps:
3.1, if the inspection unmanned aerial vehicle covers all the acquisition equipment and the task division variable does not exceed the limit of the acquisition equipment, entering step 3.2, otherwise, adding 1 to T, judging T, if T is less than T, skipping to step 2.4, and if T is more than or equal to T, entering step 3.2;
step 3.2, adding 1 to the number EP of the training rounds, judging EP, if EP is smaller than EP, skipping to step 2.3, otherwise, entering step 3.3 when EP is larger than or equal to EP;
and 3.3, finishing iteration, terminating the neural network training process, storing the current Target-Actor network data and the Target-Actor network data, and loading the stored data into the inspection unmanned aerial vehicle system, thereby executing flight action to finish optimizing the time delay and energy consumption of the unmanned aerial vehicle service power Internet of things.
2. The computing task offloading method based on deep reinforcement learning in the power internet of things as claimed in claim 1,
computing noise target action in step 2.7
Figure FDA0003338015060000051
Figure FDA0003338015060000052
Wherein the noise component
Figure FDA0003338015060000053
Is the noise component, π, clipped by the constant c in a normal distribution with a mean of zero and a variance of εω′(si+1) Is the next state s of the Target-Actor network with parameter ωi+1A policy function of;
target value yi
Figure FDA0003338015060000054
Where ri is the reward function and γ is the discount factor;
updating parameter theta of 2 Current-Critic networks by using minimum mean square Bellman error loss function1And theta2
Figure FDA0003338015060000055
Wherein L (θ)j) Is the mean square bellman error loss function, M is the number of randomly drawn small batch samples,
Figure FDA0003338015060000056
is two parameters of theta1And theta2The state-behavior value function of the Current-critical network of (1).
3. The computing task unloading method based on deep reinforcement learning in the power internet of things according to claim 1, wherein an update formula of a Current-Actor network in step 2.8 is as follows:
Figure FDA0003338015060000057
wherein
Figure FDA0003338015060000058
Expressed under the Current-Actor network parameter omegaM is the number of randomly drawn small batches of samples,
Figure FDA0003338015060000061
and
Figure FDA0003338015060000062
respectively, the parameter is theta1The state-behavior value function gradient of the Current-critical network and the strategy function gradient of the Current-Actor network with the parameter of omega, piω(si) Indicates the network input state s in the Current-ActoriThe action strategy is selected according to the selected action strategy,
Figure FDA0003338015060000063
indicating an input state siTake action of a ═ piω(si) Current-critical network state-behavior value function.
4. The calculation task unloading method based on deep reinforcement learning in the power internet of things according to claim 1, wherein the parameter ω' of the Target-Actor is updated in step 2.9: omega '← mu omega + (1-mu) omega', updating 2 parameters theta of Target-Critic1' and theta2′:θ′1←μθ1+(1-μ)θ′1,θ′2←μθ2+(1-μ)θ′2Where μ denotes the update scale factor.
CN202111297200.4A 2021-11-04 2021-11-04 Computing task unloading method based on deep reinforcement learning in power Internet of things Pending CN114065963A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111297200.4A CN114065963A (en) 2021-11-04 2021-11-04 Computing task unloading method based on deep reinforcement learning in power Internet of things

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111297200.4A CN114065963A (en) 2021-11-04 2021-11-04 Computing task unloading method based on deep reinforcement learning in power Internet of things

Publications (1)

Publication Number Publication Date
CN114065963A true CN114065963A (en) 2022-02-18

Family

ID=80273861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111297200.4A Pending CN114065963A (en) 2021-11-04 2021-11-04 Computing task unloading method based on deep reinforcement learning in power Internet of things

Country Status (1)

Country Link
CN (1) CN114065963A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114598702A (en) * 2022-02-24 2022-06-07 宁波大学 VR (virtual reality) service unmanned aerial vehicle edge calculation method based on deep learning
CN114727316A (en) * 2022-03-29 2022-07-08 江南大学 Internet of things transmission method and device based on depth certainty strategy
CN115022319A (en) * 2022-05-31 2022-09-06 浙江理工大学 DRL-based edge video target detection task unloading method and system
CN115249134A (en) * 2022-09-23 2022-10-28 江西锦路科技开发有限公司 Resource allocation method, device and equipment for unmanned aerial vehicle and storage medium
CN115686669A (en) * 2022-10-17 2023-02-03 中国矿业大学 Mine Internet of things intelligent computing unloading method assisted by energy collection
WO2023160012A1 (en) * 2022-02-25 2023-08-31 南京信息工程大学 Unmanned aerial vehicle assisted edge computing method for random inspection of power grid line
CN117149444A (en) * 2023-10-31 2023-12-01 华东交通大学 Deep neural network hybrid division method suitable for inspection system
CN117541025A (en) * 2024-01-05 2024-02-09 南京信息工程大学 Edge calculation method for intensive transmission line inspection

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114598702A (en) * 2022-02-24 2022-06-07 宁波大学 VR (virtual reality) service unmanned aerial vehicle edge calculation method based on deep learning
WO2023160012A1 (en) * 2022-02-25 2023-08-31 南京信息工程大学 Unmanned aerial vehicle assisted edge computing method for random inspection of power grid line
CN114727316A (en) * 2022-03-29 2022-07-08 江南大学 Internet of things transmission method and device based on depth certainty strategy
CN114727316B (en) * 2022-03-29 2023-01-06 江南大学 Internet of things transmission method and device based on depth certainty strategy
CN115022319A (en) * 2022-05-31 2022-09-06 浙江理工大学 DRL-based edge video target detection task unloading method and system
CN115249134A (en) * 2022-09-23 2022-10-28 江西锦路科技开发有限公司 Resource allocation method, device and equipment for unmanned aerial vehicle and storage medium
CN115249134B (en) * 2022-09-23 2022-12-23 江西锦路科技开发有限公司 Resource allocation method, device and equipment for unmanned aerial vehicle and storage medium
CN115686669A (en) * 2022-10-17 2023-02-03 中国矿业大学 Mine Internet of things intelligent computing unloading method assisted by energy collection
CN117149444A (en) * 2023-10-31 2023-12-01 华东交通大学 Deep neural network hybrid division method suitable for inspection system
CN117149444B (en) * 2023-10-31 2024-01-26 华东交通大学 Deep neural network hybrid division method suitable for inspection system
CN117541025A (en) * 2024-01-05 2024-02-09 南京信息工程大学 Edge calculation method for intensive transmission line inspection
CN117541025B (en) * 2024-01-05 2024-03-19 南京信息工程大学 Edge calculation method for intensive transmission line inspection

Similar Documents

Publication Publication Date Title
CN114065963A (en) Computing task unloading method based on deep reinforcement learning in power Internet of things
CN112995913B (en) Unmanned aerial vehicle track, user association and resource allocation joint optimization method
CN113346944B (en) Time delay minimization calculation task unloading method and system in air-space-ground integrated network
CN113543176B (en) Unloading decision method of mobile edge computing system based on intelligent reflecting surface assistance
CN113905347B (en) Cloud edge end cooperation method for air-ground integrated power Internet of things
CN113254188B (en) Scheduling optimization method and device, electronic equipment and storage medium
CN112835715B (en) Method and device for determining task unloading strategy of unmanned aerial vehicle based on reinforcement learning
CN112988285B (en) Task unloading method and device, electronic equipment and storage medium
Zhou et al. Computation bits maximization in UAV-assisted MEC networks with fairness constraint
CN113810233A (en) Distributed computation unloading method based on computation network cooperation in random network
CN113406974A (en) Learning and resource joint optimization method for unmanned aerial vehicle cluster federal learning
WO2022242468A1 (en) Task offloading method and apparatus, scheduling optimization method and apparatus, electronic device, and storage medium
CN117580105B (en) Unmanned aerial vehicle task unloading optimization method for power grid inspection
Ebrahim et al. A deep learning approach for task offloading in multi-UAV aided mobile edge computing
CN112579290B (en) Computing task migration method of ground terminal equipment based on unmanned aerial vehicle
CN114007231A (en) Heterogeneous unmanned aerial vehicle data unloading method and device, electronic equipment and storage medium
CN116663644A (en) Multi-compression version Yun Bianduan DNN collaborative reasoning acceleration method
Tharmarasa et al. Closed-loop multi-satellite scheduling based on hierarchical MDP
CN113157344B (en) DRL-based energy consumption perception task unloading method in mobile edge computing environment
Lu et al. Trajectory design for unmanned aerial vehicles via meta-reinforcement learning
CN114698125A (en) Method, device and system for optimizing computation offload of mobile edge computing network
CN114693141A (en) Transformer substation inspection method based on end edge cooperation
CN114599102A (en) Method for unloading linear dependent tasks of edge computing network of unmanned aerial vehicle
CN107787002B (en) Method for rapidly evaluating information transmission rate in wireless power supply communication
CN115226130B (en) Multi-unmanned aerial vehicle data unloading method based on fairness perception and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination