CN114065963A - Computing task unloading method based on deep reinforcement learning in power Internet of things - Google Patents
Computing task unloading method based on deep reinforcement learning in power Internet of things Download PDFInfo
- Publication number
- CN114065963A CN114065963A CN202111297200.4A CN202111297200A CN114065963A CN 114065963 A CN114065963 A CN 114065963A CN 202111297200 A CN202111297200 A CN 202111297200A CN 114065963 A CN114065963 A CN 114065963A
- Authority
- CN
- China
- Prior art keywords
- aerial vehicle
- unmanned aerial
- time slot
- current
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 230000002787 reinforcement Effects 0.000 title claims abstract description 17
- 238000007689 inspection Methods 0.000 claims abstract description 108
- 230000009471 action Effects 0.000 claims abstract description 71
- 230000006870 function Effects 0.000 claims abstract description 59
- 230000005540 biological transmission Effects 0.000 claims abstract description 46
- 238000005265 energy consumption Methods 0.000 claims abstract description 31
- 230000008569 process Effects 0.000 claims abstract description 23
- 238000004364 calculation method Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims description 30
- 230000006854 communication Effects 0.000 claims description 15
- 238000001514 detection method Methods 0.000 claims description 15
- 238000004891 communication Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 230000007774 longterm Effects 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 6
- 238000005192 partition Methods 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 3
- 239000003990 capacitor Substances 0.000 claims description 2
- 230000000087 stabilizing effect Effects 0.000 claims description 2
- 230000003993 interaction Effects 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000010248 power generation Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/20—Administration of product repair or maintenance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/29—Graphical models, e.g. Bayesian networks
- G06F18/295—Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- Entrepreneurship & Innovation (AREA)
- General Health & Medical Sciences (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Operations Research (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Public Health (AREA)
- Primary Health Care (AREA)
- Water Supply & Treatment (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Game Theory and Decision Science (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a calculation task unloading method based on deep reinforcement learning in an electric power Internet of things, and aims to minimize energy consumption and time delay by jointly optimizing the position of an unmanned aerial vehicle, the transmitting power and task division variables. Firstly, aiming at the non-convexity of the calculation task unloading problem, a Markov decision process is formulated through designing a state, an action space and a reward function, wherein the reward function is based on an electric power internet of things system model facing power transmission line inspection, relates to interaction among acquisition equipment, an inspection unmanned aerial vehicle and an edge server, and describes the calculation task unloading problem. On the basis, the Markov model has a continuous action space, a double-delay depth deterministic strategy gradient algorithm is provided, and an optimal strategy for task unloading is obtained.
Description
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to a calculation task unloading method based on deep reinforcement learning in an electric power Internet of things.
Background
The power internet of things is an application of the internet of things technology in a smart grid, and can play a role in various aspects of power generation, power transmission, power transformation, power distribution, power utilization and the like of the smart grid. The power internet of things integrates communication infrastructure resources and power infrastructure resources, and realizes information exchange among interconnected devices by utilizing advanced information and communication technologies, so that the unification efficiency of a power system is further improved. Specifically, the power internet of things deploys different types of sensing devices (such as global positioning systems, cameras, infrared sensors and radio frequency identification devices) and advanced metering basic devices in different geographic areas, so that the power internet of things can sense the physical world, acquire and monitor data for processing, and further make intelligent decisions on energy management schemes.
However, the data collected by the power internet of things is likely to be huge and highly heterogeneous, which puts strict requirements on communication and computing resources. Furthermore, some real-time demand-side management schemes are typically delay sensitive, which makes the data processing task more difficult. Despite the recent increase in computing power of smart devices, complex tasks are still difficult to handle under strict latency constraints. Fortunately, mobile edge computing is an ultra-low latency technology capable of data analysis near the data source, pushing the leading edge of services and data from a centralized cloud to the edge of a mobile network with computing and storage resources, and then supporting resource intensive and delay sensitive applications with cloud computing deployed at the edge of the mobile network.
However, when the mobile edge computing service is provided for the device in the power internet of things, a dedicated wireless connection is required between the edge server and the device in the power internet of things, but the wireless communication is difficult to provide an effective mobile edge computing service for the device in the power internet of things due to signal blockage, shadow shielding and the like. To do so, drones are deployed as small edge clouds with embedded computing modules. In the electric power internet of things system, the electric power internet of things equipment needs to process generated data quickly, but communication, calculation and storage resources of the electric power internet of things equipment are very limited, so that the unmanned aerial vehicle provides calculation service for the electric power internet of things equipment by utilizing a calculation module of the unmanned aerial vehicle. However, given the limited battery life and computing power of drones, it is desirable to design efficient computing task offloading methods.
Disclosure of Invention
In order to overcome the non-convexity of the existing calculation task unloading problem, the invention aims to provide a method for optimizing the time delay and the energy consumption of the unmanned aerial vehicle service power Internet of things based on deep reinforcement learning.
In order to achieve the purpose, the invention adopts the technical scheme that: a computing task unloading method based on deep reinforcement learning in an electric power Internet of things comprises the following steps:
modeling a quintuple (S, A, O, R, gamma) of the MDP, wherein S is an input state set of the inspection unmanned aerial vehicle, A is an output action set of the inspection unmanned aerial vehicle, O is a state transition probability function, R is a reward function, and gamma is a discount coefficient;
step 1.1, defining an input state set S of the inspection unmanned aerial vehicle, and determining the input state S of the inspection unmanned aerial vehicle at each time slot tt=[ut ht],utHorizontal coordinate, h, of the patrol drone at time slot ttThe height of the unmanned aerial vehicle is patrolled and examined in a time slot t; stBelongs to the input state set S;
step 1.2, defining an output action set A of the inspection unmanned aerial vehicle, wherein the A represents a set of all actions which can be taken by the inspection unmanned aerial vehicle aiming at the input state of the inspection unmanned aerial vehicle after receiving the following external feedback values: when the unmanned aerial vehicle flies out of the inspection area of the power transmission line, a random direction angle phi is selectedtFlying back; when patrolling and examining unmanned aerial vehicle flying height htBeyond its height range, will remain at the minimum height HminOr maximum height HmaxOnce the inspection unmanned aerial vehicle covers all the acquisition equipment, the inspection unmanned aerial vehicle is kept still and is prepared to adjust the transmitting power; when task divides variablesWhen, patrol and examine unmanned aerial vehicle will carry outOperate to retrieve again the task partition variable βnm t(ii) a Output action a of inspection unmanned aerial vehicle in each time slot ttThe method comprises the following steps: lt、φt、Δht、pt、βnm t;
Wherein lt∈{0,l1Denotes the horizontal distance of the unmanned plane flying in time slot t, l1Representing the maximum horizontal distance of the patrol unmanned aerial vehicle flying in each time slot t; phitE {0,2 pi } represents a direction angle of the patrol unmanned aerial vehicle at the time slot t; Δ ht=(ht-ht-1)∈{-l2,l2Denotes the vertical movement distance h of the polling unmanned aerial vehicle in the time slot ttAnd ht-1Respectively representing the height, l, of the unmanned aerial vehicle patrolling at time slot t and the last time slot2The maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t is represented; p is a radical oftE {0, P } represents the transmitting power of the patrol unmanned aerial vehicle at the time slot t, and P represents the maximum transmitting power of the patrol unmanned aerial vehicle; beta is anm tThe duty ratio of the fault identification and foreign object detection task received from the nth acquisition equipment processed on the mth edge server is shown, namely the task division variable { beta ] of the inspection unmanned aerial vehicle at the time slot tnm tN belongs to N, M belongs to M, and N and M respectively represent the collection of the acquisition equipment and the edge server in the system;
step 1.3, defining the input state s of the inspection unmanned aerial vehicle from the current time slot ttAnd the action amount is atCan reach the next input state st+1The probability of (a) is a state transition probability function O;
step 1.4, defining a reward function R, and representing the input state s of the inspection unmanned aerial vehicle in the current time slot ttNext, an action a is selectedtThen, the instantaneous feedback is obtained whenTime, reward functionWhen in useTime, reward function
Wherein, horizontal direction angle phi of the inspection unmanned aerial vehicletAnd height htDetermining a radius R covering the range of the acquisition devicet,If the distance between the horizontal position of the inspection unmanned aerial vehicle and the acquisition equipment is smaller than the coverage radius, the acquisition equipment is indicated to be covered by the inspection unmanned aerial vehicle, N' is the number of the acquisition equipment in the coverage range, xi1Is a penalty factor, ξ, related to the degree of coveragen 2The fault identification and foreign object detection tasks on the nth acquisition equipment are divided to exceed the limitPenalty factor of time, whereinnm tAnd betan0 tIs the ratio of tasks received from the nth acquisition device in the time slot t to be processed on the mth edge server and the patrol unmanned machine respectively,Tt nis the total service delay of the nth acquisition device within the time slot t,Et nis the task arrival rate λ of the nth acquisition device within the time slot tnThe total energy consumption of the unmanned aerial vehicle is patrolled and examined in time;
T1 n,tis the uplink transmission time delay from the nth acquisition equipment to the patrol unmanned aerial vehicle in the time slot t,Dnis the input data volume, v, of the nth acquisition device when all tasks have been offloaded onto the patrol dronen tIs the data rate of the uplink transmission and,αn tis the gain of the uplink channel from the nth acquisition device to the patrol unmanned aerial vehicle, WnAnd pnRespectively representing the bandwidth, the transmission power, σ, allocated to the nth acquisition device2Representing the noise power of the inspection unmanned aerial vehicle; e1 n,tIs the uplink transmission energy consumption in time slot t, E1 n,t=p T1 n,t;
T2 n,tThe patrol unmanned aerial vehicle calculates the time delay of the task received from the nth acquisition equipment in the time slot t,Cnis the number of CPU cycles, f, required to process a unit of task data volumenIs the computational resource allocated by the patrol unmanned aerial vehicle to the nth acquisition device, { betat n0∈[0,1],n∈N};E2 n,tIs to patrol the energy consumption of the unmanned aerial vehicle in the time slot t, E2 n,t=κ(fn)3T2 n,tKappa is an effective switching capacitor, and the energy consumption of a central processing unit in the inspection unmanned aerial vehicle is modeled as kappa (f)n)3;
T1 nm,tIs the downlink transmission delay in the time slot t,vm tis the data rate of the downlink transmission and,Wmindicates the bandwidth, α, allocated to the mth edge serverm tIs the downlink channel gain, p, from the patrol drone to the mth edge server in time slot ttIndicates the transmission power of the patrol unmanned aerial vehicle in the time slot t, { betanm t∈[0,1],n∈N,m∈M};E1 nm,tIs downlink transmission energy consumption, E1 nm,t=ptT1 nm,t;
T2 nm,tIs the calculated time delay of the mth edge server side in the time slot t,fnmis the computing resource distributed to the nth acquisition equipment by the mth edge server in the time slot t;
step 1.5, defining gamma epsilon (0,1) as a discount coefficient for calculating a return accumulated value in the whole process, wherein the closer the discount coefficient is to 1, the more important the long-term income is;
step 1.6, during a limited period T, learning the best strategy to obtain the maximum expected long term benefitWherein r ist′Is a reward function at time slot t', will state stCorresponding to action atThe expected reward under is recorded as the state-action cost function Q(s)t,at)=E[Gt|st,at]Wherein E [ alpha ], [ alpha ]]Is a desired operator;
step 2, according to the Markov decision quintuple (S, A, O, R, gamma) modeled in the step 1, the action training of the unmanned aerial vehicle is realized by using a double-delay depth certainty strategy gradient (TD3) algorithm, and the steps are as follows:
step 2.1, two independent neural networks are adopted: the network communication system comprises 2 Actor networks and 4 Critic networks, wherein the Actor networks comprise a Current-Actor network with a parameter of omega and a Target-Actor network with a parameter of omega'; the Critic network includes 2 parameters of theta1、θ2The Current-critical network and 2 parameters are theta1’、θ2'and randomly initializing parameters of the empirical playback memory F, the Current-Actor network, and the Current-Critic network, setting ω' ═ ω, θ1’=θ1、θ2’=θ2;
Step 2.2, setting a maximum training process round number EP, setting T time slots in each round, and initializing a training round number EP to be 1;
step 2.3, initializing T to 0, initializing time slot T to 0, and initializing the input state s of the inspection unmanned aerial vehicle0;
Step 2.4, observing the input state s of the current electric power Internet of things system by the inspection unmanned aerial vehicletAnd according to a policy function pi of the initial Current-Actor networkω(st) Taking action, adding Gaussian noise sigma to disturb the action to obtain action at=πω(st)+σ;
Step 2.5, in input state stThen, perform action atAnd calculating the current reward function r of the inspection unmanned aerial vehicle according to the step 1.4 by the obtained instant feedbacktAnd (3) utilizing the steps 1.1-1.2 to obtain the next input state s againt+1An experience bar(s)t,at,rt,st+1) Storing the experience strips in the experience playback memory F into the experience playback memory F, wherein the experience strips in the experience playback memory F are stored in sequence, and when the number of the experience strips reaches the capacity of the experience playback memory F, new records are stored from the beginning to form a cycle;
step 2.6, randomly extracting the mini-batch sample from the empirical playback memory F and inputting the mini-batch sample(s) into the Actor and Critic networki,ai,ri,si+1) Indicating patrol unmanned aerial vehicle at state siAt the same time take action aiGive a reward of riAnd converting the environment to the next state si+1Wherein, mini-batch refers to randomly selecting a small batch of data in training data;
step 2.7, siAs the output of the Current-Actor networkAnd the updated Current-Actor network is according to siCalculate a new action ai,(si,ai) As input for two Current-critical networks, the latter are based on input siAnd aiRespectively calculate outAndsi+1as the input of the Target-Actor network and the two Target-Critic networks, the updated Target-Actor network is according to si+1Calculate a new action ai+1Then, noise is added, and a target action of the noise is calculated by using a target strategy smoothing and regularization method;as input to two Target-critical networks, the latter are based on the input si+1Andrespectively calculate outAndgetAndis calculated as the target value yi(ii) a Finally, using the target value yiUpdating parameter theta of 2 Current-Critic networks by minimizing mean-squared Bellman error loss function1And theta2;
Step 2.8, updating a Current-Actor network parameter omega through a deterministic strategy gradient algorithm;
step 2.9, isStabilizing the training process, and further updating the parameter of Target-Actor and the parameter theta of 2 Target-Critic by using a soft updating method1' and theta2', thereby stabilizing the training process;
step 3, training the model by adopting the following steps:
3.1, if the inspection unmanned aerial vehicle covers all the acquisition equipment and the task division variable does not exceed the limit of the acquisition equipment, entering step 3.2, otherwise, adding 1 to T, judging T, if T is less than T, skipping to step 2.4, and if T is more than or equal to T, entering step 3.2;
step 3.2, adding 1 to the number EP of the training rounds, judging EP, if EP is smaller than EP, skipping to step 2.3, otherwise, entering step 3.3 when EP is larger than or equal to EP;
and 3.3, finishing iteration, terminating the neural network training process, storing the current Target-Actor network data and the Target-Actor network data, and loading the stored data into the inspection unmanned aerial vehicle system, thereby executing flight action to finish optimizing the time delay and energy consumption of the unmanned aerial vehicle service power Internet of things.
Further, noise target actions are calculated in step 2.7Wherein the noise componentIs the noise component, π, clipped by the constant c in a normal distribution with a mean of zero and a variance of εω′(si+1) Is the next state s of the Target-Actor network with parameter ωi+1A policy function of;
updating parameter theta of 2 Current-Critic networks by using minimum mean square Bellman error loss function1And theta2:j is 1,2, wherein L (θ)j) Is the mean square bellman error loss function, M is the number of randomly drawn small batch samples,j is 1,2 is two parameters theta1And theta2The state-behavior value function of the Current-critical network of (1).
Further, the update formula of the Current-Actor network in step 2.8 is as follows:whereinRepresenting the policy gradient under the Current-Actor network parameter omega, M is the number of randomly drawn small-lot samples,andrespectively, the parameter is theta1The state-behavior value function gradient of the Current-critical network and the strategy function gradient of the Current-Actor network with the parameter of omega, piω(si) Indicates the network input state s in the Current-ActoriThe action strategy is selected according to the selected action strategy,indicating an input state siTake action of a ═ piω(si) Current-critical network state-behavior value function.
Further, the parameter ω' of Target-Actor is updated in step 2.9: omega '← mu omega + (1-mu) omega', updating 2 parameters theta of Target-Critic1' and theta2′:θ′1←μθ1+(1-μ)θ′1,θ′2←μθ2+(1-μ)θ′2Where μ denotes the update scale factor。
The invention has the beneficial effects that:
(1) the electric power internet of things computing task unloading model for actual transmission line inspection established by the method is complete, and the inspection unmanned aerial vehicle finds the optimal target strategy for computing task unloading through continuous interaction of the inspection unmanned aerial vehicle and the system environment due to unknown model environment, so that the method has high practical application value.
(2) The method uses a double-delay depth certainty strategy gradient (TD3) algorithm, effectively solves the problem of the continuity control of the inspection unmanned aerial vehicle, and effectively solves the problem that the state-action function value of the inspection unmanned aerial vehicle is continuously overestimated through a Target-Actor and a Target-Critic network, so that the training of the inspection unmanned aerial vehicle is more stable, and a better unloading strategy is found.
(3) The method combines reinforcement learning and the deep neural network, improves the learning ability and the generalization ability of the inspection unmanned aerial vehicle, avoids the complexity and the sparseness when the inspection unmanned aerial vehicle is manually operated to unload the calculation task in an uncertain environment, and ensures that the inspection unmanned aerial vehicle can safely and efficiently complete the unloading of the calculation task.
Drawings
FIG. 1 is a general framework diagram of a TD 3-based computing task offloading method in a power Internet of things;
fig. 2 is a flowchart of a calculation task unloading algorithm based on TD3 in the power internet of things.
Detailed Description
The present invention will be described in further detail with reference to examples for the purpose of facilitating understanding and practice of the invention by those of ordinary skill in the art, and it is to be understood that the present invention has been described in the illustrative embodiments and is not to be construed as limited thereto.
Firstly, an electric power internet of things system model facing power transmission line inspection is established, interaction among acquisition equipment, an inspection unmanned aerial vehicle and an edge server is involved, and the problem of calculation task unloading is described. Secondly, aiming at the non-convexity of the problem of task unloading, the invention provides a method based on deep reinforcement learning, and a Markov decision process is formulated by designing a state, an action space and a reward function. On the basis, the Markov model has a continuous action space, a double-delay depth deterministic strategy gradient algorithm is provided, and an optimal strategy for task unloading is obtained.
Optimal strategy is to patrol the position q of the unmanned aerial vehicle by joint optimizationtTask partition variable betanm tAnd betan0 tAnd a transmission power ptThe sum of the total service time delay of all the acquisition devices and the total energy consumption of the inspection unmanned aerial vehicle is minimized, namely:
suppose that the position of the patrol unmanned aerial vehicle in the time slot t is denoted as qt=(ut,ht) Wherein u istIs the horizontal coordinate of the patrol unmanned plane, ht∈[Hmin,Hmax]Is the height of the inspection unmanned plane HminAnd HmaxThe minimum vertical distance and the maximum vertical distance of the patrol unmanned aerial vehicle are respectively, and the maximum horizontal distance and the maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t are respectively l1And l2. Where N and M represent a collection of acquisition devices and edge servers, Tt nRepresenting the total service delay of the nth acquisition device within the time slot t, Et nThe total energy consumption of the inspection unmanned aerial vehicle for unloading and calculating the task on the nth acquisition device in the time slot t is shown, P is the maximum transmission power of the inspection unmanned aerial vehicle,representing the Euclidean norm, u, of a vectort+1Horizontal coordinate, h, of the patrol unmanned plane at the next time slott+1And the vertical coordinate of the inspection unmanned aerial vehicle in the next time slot is represented. Based on the analysis and description of the task unloading problems such as fault identification, foreign matter detection and the like in the power transmission line inspection, it can be seen that the problems have strong non-convexity and combinability, and therefore, a global optimal solution is difficult to find. In addition, the method can be used for producing a composite materialBecause the information and channel conditions of the acquisition equipment in the power internet of things system for power transmission line inspection are not easy to acquire, the traditional optimization strategy is more challenging.
A computing task unloading method based on deep reinforcement learning in an electric power Internet of things comprises the following steps:
modeling a quintuple (s, A, O, R, gamma) of the MDP in the Markov decision process, wherein s is an input state of the inspection unmanned aerial vehicle, A is an output action set of the inspection unmanned aerial vehicle, O is a state transition probability function, R is a reward function, and gamma is a discount coefficient; stBelongs to the input state set S;
step 1.1, determining the input state s of the inspection unmanned aerial vehicle at each time slot tt=[ut ht],utHorizontal coordinate, h, of the patrol drone at time slot ttThe height of the unmanned aerial vehicle is patrolled and examined in a time slot t;
step 1.2, defining an output action set A of the inspection unmanned aerial vehicle, wherein the A represents a set of all actions which can be taken by the inspection unmanned aerial vehicle aiming at the input state of the inspection unmanned aerial vehicle after receiving the following external feedback values: when the unmanned aerial vehicle flies out of the inspection area of the power transmission line, a random direction angle phi is selectedtFlying back; when patrolling and examining unmanned aerial vehicle flying height htBeyond its height range, will remain at the minimum height HminOr maximum height HmaxOnce the inspection unmanned aerial vehicle covers all the acquisition equipment, the inspection unmanned aerial vehicle is kept still and is prepared to adjust the transmitting power; when task divides variablesWhen, patrol and examine unmanned aerial vehicle will carry outOperate to retrieve again the task partition variable βnm t(ii) a Output action a of inspection unmanned aerial vehicle in each time slot ttThe method comprises the following steps: lt、Φt、Δht、pt、βnm t;
Wherein lt∈{0,l1Denotes the horizontal distance of the unmanned plane flying in time slot t, l1Representing the maximum horizontal distance of the patrol unmanned aerial vehicle flying in each time slot t; phitE {0,2 pi } represents a direction angle of the patrol unmanned aerial vehicle at the time slot t; Δ ht=(ht-ht+1)∈{-l2,l2Denotes the vertical movement distance h of the polling unmanned aerial vehicle in the time slot ttAnd ht+1Respectively representing the height, l, of the unmanned aerial vehicle patrolling at time slot t and the next time slot2The maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t is represented; p is a radical oftE {0, P } represents the transmitting power of the patrol unmanned aerial vehicle at the time slot t, and P represents the maximum transmitting power of the patrol unmanned aerial vehicle; beta is anm tThe duty ratio of the fault identification and foreign object detection task received from the nth acquisition equipment processed on the mth edge server is shown, namely the task division variable { beta ] of the inspection unmanned aerial vehicle at the time slot tnm tN belongs to N, M belongs to M, and N and M respectively represent the collection of the acquisition equipment and the edge server in the system;
step 1.3, defining the input state s of the inspection unmanned aerial vehicle from the current time slot ttAnd the action amount is atCan reach the next input state st+1The probability of (a) is a state transition probability function O;
step 1.4, defining a reward function R, and representing the input state s of the inspection unmanned aerial vehicle in the current time slot ttNext, an action a is selectedtThen, the instantaneous feedback is obtained whenTime, reward functionWhen in useTime, reward function
Wherein, horizontal direction angle phi of the inspection unmanned aerial vehicletAnd height htDetermining a radius R covering the range of the acquisition devicet,If patrol and examine unmanned aerial vehicle horizontal position st=[ut ht]The distance between the acquisition equipment and the acquisition equipment is smaller than the coverage radius, the acquisition equipment is indicated to be covered by the inspection unmanned aerial vehicle, the horizontal coordinate of the acquisition equipment is a fixed value and is a known quantity, N' is the number of the inspection unmanned aerial vehicle covering the acquisition equipment, and xi1Is a penalty factor, ξ, related to the degree of coveragen 2The fault identification and foreign object detection tasks on the nth acquisition equipment are divided to exceed the limitPenalty factor of time, betanm t,βn0 t,Tt n,Et nSolving according to the model of the power Internet of things system; beta is anm tAnd betan0 tIs the ratio of tasks received from the nth acquisition device in the time slot t to be processed on the mth edge server and the patrol unmanned machine respectively,Tt nis the total service delay of the nth acquisition device within the time slot t,Et nis the task arrival rate λ of the nth acquisition device within the time slot tnThe total energy consumption of the unmanned aerial vehicle is patrolled and examined in time;
A. electric power internet of things system model
The power internet of things system for power transmission line inspection comprises 1 inspection unmanned aerial vehicle, N information acquisition devices such as miniature meteorological stations, sensors and high-definition cameras and M edge servers. The collection of the acquisition devices and the edge servers are denoted as N and M, respectively. After receiving the calculation tasks such as fault identification and foreign object detection of the acquisition equipment through the uplink communication link, the patrol unmanned aerial vehicle distributes part or all of the tasks to the edge server through the downlink communication link.
1) Uplink transmission delay and energy consumption:
it is assumed that broadband frequency division multiple access protocols are shared among acquisition devices when tasks such as fault identification and foreign object detection are unloaded. According to the Shannon law, in the time slot t, the gain alpha of an uplink channel from the nth acquisition equipment to the inspection unmanned aerial vehicle is consideredn tThen the available uplink transmission data rate is:
wherein WnAnd pnRespectively representing the bandwidth, the transmission power, σ, allocated to the nth acquisition device2Representing the noise power of the patrol unmanned aerial vehicle. Thus, consider the amount of input data D when all tasks for the nth acquisition device have been offloaded onto the inspection dronenThe uplink transmission delay in the time slot t isConsidering the receiving power p of the polling unmanned aerial vehicle, the uplink transmission energy consumption in the time slot t is E1 n,t=p T1 n,t。
2) Patrol and examine unmanned aerial vehicle's calculation time delay and energy consumption:
after receiving the tasks such as fault identification and foreign matter detection from the acquisition equipment, the patrol unmanned aerial vehicle allocates the decision task to the patrol unmanned aerial vehicle for self-processing or uninstalls the decision task to the edge server for processing. { beta ]n0 t∈[0,1]And N ∈ N } represents the proportion of tasks received from the nth acquisition device in the time slot t to be processed on the inspection unmanned machine, wherein N represents the set of acquisition devices in the system. Thus, the center required for processing the data amount of the unit task is consideredNumber of processor cycles CnAnd the computing resource f distributed to the nth acquisition equipment by the patrol unmanned aerial vehiclenThe calculation time delay of the patrol unmanned aerial vehicle in the time slot t isWherein DnIndicating the amount of task input data for the nth acquisition device. Furthermore, consider modeling the energy consumption of the central processor in the inspection drone as κ (f)n)3And kappa is effective switching capacitance, and the energy consumption of the inspection unmanned aerial vehicle in the time slot t can be obtained as E2 n,t=κ(fn)3T2 n,t。
3) Downlink transmission delay and energy consumption:
for simplicity, assume that the noise power of the patrol drone and the edge server are the same. During time slot t, the downlink channel gain α from the patrol drone to the mth edge server is taken into accountm tThen the downlink transmission data rate is:
wherein WmAnd σ2Respectively representing the bandwidth allocated to the m-th edge server and the noise power of the edge server, ptAnd the transmission power of the inspection unmanned aerial vehicle in the time slot t is represented. Because the edge server has more communication, calculation and storage resources than the unmanned aerial vehicle of patrolling and examining, it is more likely that the unmanned aerial vehicle of patrolling and examining unloads tasks such as fault identification and foreign matter detection to the edge server and handles to reduce system delay. Taking into account the proportion { beta } of tasks received from the nth acquisition device in time slot t to be processed on the mth edge servernm t∈[0,1]N belongs to N, M belongs to M, wherein N and M respectively represent the collection of the acquisition equipment and the edge server in the system, and the downlink transmission delay isWherein D isnIndicating the amount of task input data for the nth acquisition device. Furthermore, during the time slot t, the transmission power p of the patrol drone is taken into accounttThen the downlink transmission energy consumption is E1 nm,t=ptT1 nm,t。
4) Calculating time delay of the edge server:
the edge server starts the calculation process after receiving all task data of fault identification, foreign matter detection and the like unloaded by the inspection unmanned aerial vehicle, so that the calculation resource f distributed from the mth edge server to the nth acquisition equipment is considered in the time slot tnmThen the calculated time delay of the mth edge server end isWherein C isnIndicating the number of CPU cycles, D, required to process a unit of task datanIndicating the amount of task input data, beta, of the nth acquisition devicenm tIndicating the proportion of tasks received from the nth acquisition device in time slot t that are processed at the mth edge server.
5) Total service delay and energy consumption of the system:
the unmanned aerial vehicle can not start task processing such as fault identification and foreign matter detection until the uplink communication channel completes complete data transmission. In general, considering that the communication and computation modules on the drone are separate, the patrol drone may offload tasks to the edge server through a downlink communication channel while performing local task computation. Furthermore, the edge server may start the task processing only after all data transfers are completed. Therefore, in the time slot t, the total service delay of the nth acquisition device is:
wherein, T1 n,tIs the uplink transmission time delay from the nth acquisition equipment to the inspection unmanned aerial vehicle in the time slot T, T2 n,tIs unmanned for inspectionThe machine calculates the time delay of the task received from the nth acquisition device within the time slot T, T1 nm,tIs the downlink transmission time delay from the polling unmanned aerial vehicle to the mth edge server in the time slot T, T2 nm,tAnd (3) calculating time delay of the mth edge server end in the time slot t, wherein M represents a set of edge servers in the system.
Furthermore, in the time slot t, the task arrival rate λ of the nth acquisition device is taken into accountnThen, the total energy consumption of the inspection unmanned aerial vehicle is:
wherein E1 n,tIs the uplink transmission energy consumption of the polling unmanned aerial vehicle receiving tasks from the nth acquisition device in the time slot t, E2 n,tIs the energy consumption of the tasks of routing inspection unmanned aerial vehicle to process the fault recognition, foreign object detection and the like of the nth acquisition equipment in the time slot t, E1 nm,tAnd the downlink transmission energy consumption is the downlink transmission energy consumption when the inspection unmanned aerial vehicle unloads the task to the mth edge server in the time slot t.
Considering that network information in the communication process of the power internet of things system facing power transmission line inspection is unknown, the strategy optimization based on the Markov model cannot adopt the traditional iteration method. In the deep reinforcement learning, the inspection unmanned aerial vehicle only needs to optimize one action through the state information and does not need complete environmental information. Considering that the states and action spaces in the task unloading context of fault identification, foreign object detection and the like are continuous and the cost function is continuously overestimated in the training of the depth deterministic strategy gradient algorithm, a dual-delay depth deterministic strategy gradient (TD3) algorithm is adopted, and the algorithm also combines a strategy function-based (Actor network) with a behavior value function-based (criticic network).
Step 1.5, defining gamma epsilon (0,1) as a discount coefficient for calculating a return accumulated value in the whole process, wherein the closer the discount coefficient is to 1, the more important the long-term income is;
step 1.6, during a limited period T, learning the optimal strategy to obtain the maximum predictionThe long term gain of the period isWherein r ist′Is a reward function at time slot t', will state stCorresponding to action atThe expected reward under is recorded as the state-action cost function Q(s)t,at)=E[Gt|st,at]Wherein E [ alpha ], [ alpha ]]Is the desired operator.
The TD3 architecture consists of Current-Network and Target-Network. The Current-Network comprises a Current-Actor Network and a Current-critical Network: Current-Actor network with parameter omega will be used to select the policy function pi of an actionω(st) Passed to a Current-Critic network that estimates the State-cost functionAndthe Current-Critic network comprises two identical structures but with the parameter theta1、θ2And the Current-criticic network is respectively and independently initialized and trained and is used for calculating a Q value according to the input state and the selected action and updating the Current-Actor network based on the deterministic strategy gradient. Since the Current-Network will continuously overestimate the Q value, which results in unstable Network training, TD3 uses Target-Network to slowly track the updates of Current-Actor and Current-critical networks. The Target-Network comprises a Target-Actor Network with a parameter of omega 'and two parameters of theta'1And θ'2Target-critical network. The Target-Actor network transmits the strategy function to a Target-criticic network, and the Target-criticic network is used for estimating a state-value function; the Target-criticic network comprises two Target-criticic networks which have the same structure and parameters of which are respectively initialized and trained independently, and is used for calculating a Q value according to an input state and a selected action, calculating a Target value based on the smaller Q value and further updating the Current-criticic network.
Step 2, according to the Markov decision quintuple (S, A, O, R, gamma) modeled in the step 1, the action training of the unmanned aerial vehicle is realized by using a double-delay depth certainty strategy gradient (TD3) algorithm, and the steps are as follows:
step 2.1, adopting 2 Actor networks and 4 Critic networks, wherein the Actor networks comprise a Current-Actor network with a parameter of omega and a Target-Actor network with a parameter of omega'; the Critic network includes 2 parameters of theta1、θ2The Current-critical network and 2 parameters are theta1’、θ2'and randomly initializing parameters of the empirical playback memory F, the Current-Actor network, and the Current-Critic network, setting ω' ═ ω, θ1’=θ1、θ2’=θ2;
Step 2.2, setting a maximum training process round number EP, setting T time slots in each round, and initializing a training round number EP to be 1;
step 2.3, initializing T to 0, initializing time slot T to 0, and initializing the input state s of the inspection unmanned aerial vehicle0;
Step 2.4, observing the input state s of the current electric power Internet of things system by the inspection unmanned aerial vehicletAnd according to a policy function pi of the initial Current-Actor networkω(st) Taking action, adding Gaussian noise sigma to disturb the action to obtain action at=πω(st) + sigma; policy function piω(st) Is that the patrol unmanned aerial vehicle decides at the input state stTake output action atThe manner of (a);
step 2.5, in input state stThen, perform action atAnd calculating the current reward function r of the inspection unmanned aerial vehicle according to the step 1.4 by the obtained instant feedbacktAnd (3) utilizing the steps 1.1-1.2 to obtain the next input state s againt+1An experience bar(s)t,at,rt,st+1) Storing the experience strips in the experience playback memory F into the experience playback memory F, wherein the experience strips in the experience playback memory F are stored in sequence, and when the number of the experience strips reaches the capacity of the experience playback memory F, new records are stored from the beginning to form a cycle;
step 2.2-step 2.5 are strategy exploration phases, and off-line setting is carried outThe standby environment is used as a training environment, and the Current-Actor network observes the input state s of the time slot ttAnd as input, output the recommended action atAnd selecting a by the inspection unmanned aerial vehicletThe state is represented by stIs converted into st+1And gives a reward feedback rtThe Current-Actor network follows the new state st+1And generating a new recommended action, and repeating the steps to search the strategy, wherein the experience record of the strategy search is stored as the experience in the experience playback memory F.
The network updating phase is as follows, the Current-Actor network is not needed to interact with the user environment, and the experience replay is carried out by sampling from the experience replay memory F.
Step 2.6, randomly extracting the mini-batch sample from the empirical playback memory F and inputting the mini-batch sample(s) into the Actor and Critic networki,ai,ri,si+1) Indicating patrol unmanned aerial vehicle at state siAt the same time take action aiGive a reward of riAnd converting the environment to the next state si+1Wherein, mini-batch refers to randomly selecting a small batch of data in training data;
step 2.7, siAs the input of the Current-Actor network, the updated Current-Actor network is according to siCalculate a new action ai,(si,ai) As input for two Current-critical networks, the latter are based on input siAnd aiRespectively calculate outAndsi+1as the input of the Target-Actor network and the two Target-Critic networks, the updated Target-Actor network is according to si+1Calculate a new action ai+1Following the addition of noise, a target strategy smoothing regularization method is used to calculate the noise target actionWherein the noise componentIs the noise component, π, clipped by the constant c in a normal distribution with a mean of zero and a variance of εω′(si+1) Is the next state s of the Target-Actor network with parameter ωi+1A policy function of;as input to two Target-critical networks, the latter are based on the input Si+1Andrespectively calculate outAndgetAndis calculated as the target value yi:Where ri is the reward function and γ is the discount factor; finally, using the target value yiUpdating parameter theta of 2 Current-Critic networks by minimizing mean-squared Bellman error loss function1And theta2:j is 1,2, wherein L (θ)j) Is the mean square bellman error loss function, M is the number of randomly drawn small batch samples,j is 1,2 is two parameters theta1And theta2The state-behavior value function of the Current-critical network;
and 2.8, updating a network parameter omega of the Current-Actor through a deterministic strategy gradient algorithm:whereinRepresenting the policy gradient under the Current-Actor network parameter omega, M is the number of randomly drawn small-lot samples,andrespectively, the parameter is theta1The state-behavior value function gradient of the Current-critical network and the strategy function gradient of the Current-Actor network with the parameter of omega, piω(si) Indicates the network input state s in the Current-ActoriThe action strategy is selected according to the selected action strategy,indicating an input state siTake action of a ═ piω(si) Current-critical network state-behavior value function of (1);
step 2.9, in order to stabilize the training process, the parameter ω' of Target-Actor is further updated by using a soft update method: ω '. about.mu.ω + (1- μ) ω', 2 parameters θ of Target-Critic1' and theta2′:θ′1←μθ1+(1-μ)θ′1,θ′2←μθ2+(1-μ)θ′2Where μ denotes the update scale factor.
The TD3 reduces the updating frequency of the Current-Actor Network and the Target-Network, and the Current-Actor Network and the Target-Network are updated once every time the Current-Critic Network is updated d times.
Step 3, training the model by adopting the following steps:
3.1, if the inspection unmanned aerial vehicle covers all the acquisition equipment and the task division variable does not exceed the limit of the acquisition equipment, entering step 3.2, otherwise, adding 1 to T, judging T, if T is less than T, skipping to step 2.4, and if T is more than or equal to T, entering step 3.2;
step 3.2, adding 1 to the number EP of the training rounds, judging EP, if EP is smaller than EP, skipping to step 2.3, otherwise, entering step 3.3 when EP is larger than or equal to EP;
and 3.3, finishing iteration, terminating the neural network training process, storing the current Target-Actor network data and the Target-Actor network data, and loading the stored data into the inspection unmanned aerial vehicle system, thereby executing flight action to finish optimizing the time delay and energy consumption of the unmanned aerial vehicle service power Internet of things.
Claims (4)
1. A computing task unloading method based on deep reinforcement learning in an electric power Internet of things is characterized by comprising the following steps:
step 1, establishing a Markov model of a follow-up depth reinforcement learning algorithm of the flight position of the inspection unmanned aerial vehicle:
modeling a quintuple (S, A, O, R, gamma) of the MDP, wherein S is an input state set of the inspection unmanned aerial vehicle, A is an output action set of the inspection unmanned aerial vehicle, O is a state transition probability function, R is a reward function, and gamma is a discount coefficient;
step 1.1, defining an input state set S of the inspection unmanned aerial vehicle, and determining the input state S of the inspection unmanned aerial vehicle at each time slot tt=[ut ht],utHorizontal coordinate, h, of the patrol drone at time slot ttThe height of the unmanned aerial vehicle is patrolled and examined in a time slot t; stBelongs to the input state set S;
step 1.2, defining an output action set A of the inspection unmanned aerial vehicle, wherein the A represents a set of all actions which can be taken by the inspection unmanned aerial vehicle aiming at the input state of the inspection unmanned aerial vehicle after receiving the following external feedback values: when the unmanned aerial vehicle flies out of the inspection area of the power transmission line, the unmanned aerial vehicle is selectedA random direction angle phitFlying back; when patrolling and examining unmanned aerial vehicle flying height htBeyond its height range, will remain at the minimum height HminOr maximum height HmaxOnce the inspection unmanned aerial vehicle covers all the acquisition equipment, the inspection unmanned aerial vehicle is kept still and is prepared to adjust the transmitting power; when task divides variablesWhen, patrol and examine unmanned aerial vehicle will carry outOperate to retrieve again the task partition variable βnm t(ii) a Output action a of inspection unmanned aerial vehicle in each time slot ttThe method comprises the following steps: lt、Φt、Δht、pt、βnm t;
Wherein lt∈{0,l1Denotes the horizontal distance of the unmanned plane flying in time slot t, l1Representing the maximum horizontal distance of the patrol unmanned aerial vehicle flying in each time slot t; phitE {0,2 pi } represents a direction angle of the patrol unmanned aerial vehicle at the time slot t; Δ ht=(ht-ht-1)∈{-l2,l2Denotes the vertical movement distance h of the polling unmanned aerial vehicle in the time slot ttAnd ht-1Respectively representing the height, l, of the unmanned aerial vehicle patrolling at time slot t and the last time slot2The maximum vertical distance of the patrol unmanned aerial vehicle flying in each time slot t is represented; p is a radical oftE {0, P } represents the transmitting power of the patrol unmanned aerial vehicle at the time slot t, and P represents the maximum transmitting power of the patrol unmanned aerial vehicle; beta is anm tThe duty ratio of the fault identification and foreign object detection task received from the nth acquisition equipment processed on the mth edge server is shown, namely the task division variable { beta ] of the inspection unmanned aerial vehicle at the time slot tnm tN belongs to N, M belongs to M, and N and M respectively represent the collection of the acquisition equipment and the edge server in the system;
step 1.3, defining the input state s of the inspection unmanned aerial vehicle from the current time slot ttAnd adoptAn operation amount of atCan reach the next input state st+1The probability of (a) is a state transition probability function O;
step 1.4, defining a reward function R, and representing the input state s of the inspection unmanned aerial vehicle in the current time slot ttNext, an action a is selectedtThen, the instantaneous feedback is obtained whenTime, reward functionWhen in useTime, reward function
Wherein, horizontal direction angle phi of the inspection unmanned aerial vehicletAnd height htDetermining a radius R covering the range of the acquisition devicet,If the distance between the horizontal position of the inspection unmanned aerial vehicle and the acquisition equipment is smaller than the coverage radius, the acquisition equipment is indicated to be covered by the inspection unmanned aerial vehicle, N' is the number of the acquisition equipment in the coverage range, xi1Is a penalty factor, ξ, related to the degree of coveragen 2The fault identification and foreign object detection tasks on the nth acquisition equipment are divided to exceed the limitPenalty factor of time, whereinnm tAnd betan0 tIs the ratio of tasks received from the nth acquisition device in the time slot t to be processed on the mth edge server and the patrol unmanned machine respectively,Tt nis the total service delay of the nth acquisition device within the time slot t,Et nis the task arrival rate λ of the nth acquisition device within the time slot tnThe total energy consumption of the unmanned aerial vehicle is patrolled and examined in time;
T1 n,tis the uplink transmission time delay from the nth acquisition equipment to the patrol unmanned aerial vehicle in the time slot t,Dnis the input data volume, v, of the nth acquisition device when all tasks have been offloaded onto the patrol dronen tIs the data rate of the uplink transmission and,αn tis the gain of the uplink channel from the nth acquisition device to the patrol unmanned aerial vehicle, WnAnd pnRespectively representing the bandwidth, the transmission power, σ, allocated to the nth acquisition device2Representing the noise power of the inspection unmanned aerial vehicle; e1 n,tIs the uplink transmission energy consumption in time slot t, E1 n,t=p T1 n,t;
T2 n,tThe patrol unmanned aerial vehicle calculates the time delay of the task received from the nth acquisition equipment in the time slot t,Cnis the number of CPU cycles, f, required to process a unit of task data volumenIs the computational resource allocated by the patrol unmanned aerial vehicle to the nth acquisition device, { betat n0∈[0,1],n∈N};E2 n,tIs to patrol the energy consumption of the unmanned aerial vehicle in the time slot t, E2 n,t=κ(fn)3T2 n,tKappa is an effective switching capacitor, and the energy consumption of a central processing unit in the inspection unmanned aerial vehicle is modeled as kappa (f)n)3;
T1 nm,tIs the downlink transmission delay in the time slot t,vm tis the data rate of the downlink transmission and,Wmindicates the bandwidth, α, allocated to the mth edge serverm tIs the downlink channel gain, p, from the patrol drone to the mth edge server in time slot ttIndicates the transmission power of the patrol unmanned aerial vehicle in the time slot t, { betanm t∈[0,1],n∈N,m∈M};E1 nm,tIs downlink transmission energy consumption, E1 nm,t=ptT1 nm,t;
T2 nm,tIs the calculated time delay of the mth edge server side in the time slot t,fnmis the computing resource distributed to the nth acquisition equipment by the mth edge server in the time slot t;
step 1.5, defining gamma epsilon (0,1) as a discount coefficient for calculating a return accumulated value in the whole process, wherein the closer the discount coefficient is to 1, the more important the long-term income is;
step 1.6, during a limited period T, learning the best strategy to obtain the maximum expected long term benefitWherein r ist′Is a reward function at time slot t', will state stCorresponding to action atThe expected reward under is recorded as the state-action cost function Q(s)t,at)=E[Gt|st,at]Wherein E [ alpha ], [ alpha ]]Is the desired operator.
Step 2, according to the Markov decision quintuple (S, A, O, R, gamma) modeled in the step 1, the action training of the unmanned aerial vehicle is realized by using a double-delay depth certainty strategy gradient (TD3) algorithm, and the steps are as follows:
step 2.1, two independent neural networks are adopted: the network communication system comprises 2 Actor networks and 4 Critic networks, wherein the Actor networks comprise a Current-Actor network with a parameter of omega and a Target-Actor network with a parameter of omega'; the Critic network includes 2 parameters of theta1、θ2The Current-critical network and 2 parameters are theta1’、θ2'and randomly initializing parameters of the empirical playback memory F, the Current-Actor network, and the Current-Critic network, setting ω' ═ ω, θ1’=θ1、θ2’=θ2;
Step 2.2, setting a maximum training process round number EP, setting T time slots in each round, and initializing a training round number EP to be 1;
step 2.3, initializing T to 0, initializing time slot T to 0, and initializing the input state s of the inspection unmanned aerial vehicle0;
Step 2.4, observing the input state s of the current electric power Internet of things system by the inspection unmanned aerial vehicletAnd according to a policy function pi of the initial Current-Actor networkω(st) Taking action, adding Gaussian noise sigma to disturb the action to obtain action at=πω(st)+σ;
Step 2.5, in input state stThen, perform action atAnd calculating the current reward function r of the inspection unmanned aerial vehicle according to the step 1.4 by the obtained instant feedbacktAnd (3) utilizing the steps 1.1-1.2 to obtain the next input state s againt+1An experience bar(s)t,at,rt,st+1) Storing the experience strips in the experience playback memory F into the experience playback memory F, wherein the experience strips in the experience playback memory F are stored in sequence, and when the number of the experience strips reaches the capacity of the experience playback memory F, new records are stored from the beginning to form a cycle;
step (ii) of2.6, randomly drawing a mini-batch sample from the empirical playback memory F, inputting the i th mini-batch sample(s) into the Actor and Critic networki,ai,ri,si+1) Indicating patrol unmanned aerial vehicle at state siAt the same time take action aiGive a reward of riAnd converting the environment to the next state si+1Wherein, mini-batch refers to randomly selecting a small batch of data in training data;
step 2.7, siAs the input of the Current-Actor network, the updated Current-Actor network is according to siCalculate a new action ai,(si,ai) As input for two Current-critical networks, the latter are based on input siAnd aiRespectively calculate outAndsi+1as the input of the Target-Actor network and the two Target-Critic networks, the updated Target-Actor network is according to si+1Calculate a new action ai+1Then, noise is added, and a target action of the noise is calculated by using a target strategy smoothing and regularization method;as input to two Target-critical networks, the latter are based on the input si+1Andrespectively calculate outAndgetAndis calculated as the target value yi(ii) a Finally, using the target value yiUpdating parameter theta of 2 Current-Critic networks by minimizing mean-squared Bellman error loss function1And theta2;
Step 2.8, updating a Current-Actor network parameter omega through a deterministic strategy gradient algorithm;
step 2.9, in order to stabilize the training process, the parameter of the Target-Actor and the parameter theta of 2 Target-critics are further updated by using a soft updating method1' and theta2', thereby stabilizing the training process;
step 3, training the model by adopting the following steps:
3.1, if the inspection unmanned aerial vehicle covers all the acquisition equipment and the task division variable does not exceed the limit of the acquisition equipment, entering step 3.2, otherwise, adding 1 to T, judging T, if T is less than T, skipping to step 2.4, and if T is more than or equal to T, entering step 3.2;
step 3.2, adding 1 to the number EP of the training rounds, judging EP, if EP is smaller than EP, skipping to step 2.3, otherwise, entering step 3.3 when EP is larger than or equal to EP;
and 3.3, finishing iteration, terminating the neural network training process, storing the current Target-Actor network data and the Target-Actor network data, and loading the stored data into the inspection unmanned aerial vehicle system, thereby executing flight action to finish optimizing the time delay and energy consumption of the unmanned aerial vehicle service power Internet of things.
2. The computing task offloading method based on deep reinforcement learning in the power internet of things as claimed in claim 1,
computing noise target action in step 2.7 Wherein the noise componentIs the noise component, π, clipped by the constant c in a normal distribution with a mean of zero and a variance of εω′(si+1) Is the next state s of the Target-Actor network with parameter ωi+1A policy function of;
updating parameter theta of 2 Current-Critic networks by using minimum mean square Bellman error loss function1And theta2:Wherein L (θ)j) Is the mean square bellman error loss function, M is the number of randomly drawn small batch samples,is two parameters of theta1And theta2The state-behavior value function of the Current-critical network of (1).
3. The computing task unloading method based on deep reinforcement learning in the power internet of things according to claim 1, wherein an update formula of a Current-Actor network in step 2.8 is as follows:
whereinExpressed under the Current-Actor network parameter omegaM is the number of randomly drawn small batches of samples,andrespectively, the parameter is theta1The state-behavior value function gradient of the Current-critical network and the strategy function gradient of the Current-Actor network with the parameter of omega, piω(si) Indicates the network input state s in the Current-ActoriThe action strategy is selected according to the selected action strategy,indicating an input state siTake action of a ═ piω(si) Current-critical network state-behavior value function.
4. The calculation task unloading method based on deep reinforcement learning in the power internet of things according to claim 1, wherein the parameter ω' of the Target-Actor is updated in step 2.9: omega '← mu omega + (1-mu) omega', updating 2 parameters theta of Target-Critic1' and theta2′:θ′1←μθ1+(1-μ)θ′1,θ′2←μθ2+(1-μ)θ′2Where μ denotes the update scale factor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111297200.4A CN114065963A (en) | 2021-11-04 | 2021-11-04 | Computing task unloading method based on deep reinforcement learning in power Internet of things |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111297200.4A CN114065963A (en) | 2021-11-04 | 2021-11-04 | Computing task unloading method based on deep reinforcement learning in power Internet of things |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114065963A true CN114065963A (en) | 2022-02-18 |
Family
ID=80273861
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111297200.4A Pending CN114065963A (en) | 2021-11-04 | 2021-11-04 | Computing task unloading method based on deep reinforcement learning in power Internet of things |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114065963A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114598702A (en) * | 2022-02-24 | 2022-06-07 | 宁波大学 | VR (virtual reality) service unmanned aerial vehicle edge calculation method based on deep learning |
CN114727316A (en) * | 2022-03-29 | 2022-07-08 | 江南大学 | Internet of things transmission method and device based on depth certainty strategy |
CN115022319A (en) * | 2022-05-31 | 2022-09-06 | 浙江理工大学 | DRL-based edge video target detection task unloading method and system |
CN115249134A (en) * | 2022-09-23 | 2022-10-28 | 江西锦路科技开发有限公司 | Resource allocation method, device and equipment for unmanned aerial vehicle and storage medium |
CN115686669A (en) * | 2022-10-17 | 2023-02-03 | 中国矿业大学 | Mine Internet of things intelligent computing unloading method assisted by energy collection |
WO2023160012A1 (en) * | 2022-02-25 | 2023-08-31 | 南京信息工程大学 | Unmanned aerial vehicle assisted edge computing method for random inspection of power grid line |
CN117149444A (en) * | 2023-10-31 | 2023-12-01 | 华东交通大学 | Deep neural network hybrid division method suitable for inspection system |
CN117541025A (en) * | 2024-01-05 | 2024-02-09 | 南京信息工程大学 | Edge calculation method for intensive transmission line inspection |
-
2021
- 2021-11-04 CN CN202111297200.4A patent/CN114065963A/en active Pending
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114598702A (en) * | 2022-02-24 | 2022-06-07 | 宁波大学 | VR (virtual reality) service unmanned aerial vehicle edge calculation method based on deep learning |
WO2023160012A1 (en) * | 2022-02-25 | 2023-08-31 | 南京信息工程大学 | Unmanned aerial vehicle assisted edge computing method for random inspection of power grid line |
CN114727316A (en) * | 2022-03-29 | 2022-07-08 | 江南大学 | Internet of things transmission method and device based on depth certainty strategy |
CN114727316B (en) * | 2022-03-29 | 2023-01-06 | 江南大学 | Internet of things transmission method and device based on depth certainty strategy |
CN115022319A (en) * | 2022-05-31 | 2022-09-06 | 浙江理工大学 | DRL-based edge video target detection task unloading method and system |
CN115249134A (en) * | 2022-09-23 | 2022-10-28 | 江西锦路科技开发有限公司 | Resource allocation method, device and equipment for unmanned aerial vehicle and storage medium |
CN115249134B (en) * | 2022-09-23 | 2022-12-23 | 江西锦路科技开发有限公司 | Resource allocation method, device and equipment for unmanned aerial vehicle and storage medium |
CN115686669A (en) * | 2022-10-17 | 2023-02-03 | 中国矿业大学 | Mine Internet of things intelligent computing unloading method assisted by energy collection |
CN117149444A (en) * | 2023-10-31 | 2023-12-01 | 华东交通大学 | Deep neural network hybrid division method suitable for inspection system |
CN117149444B (en) * | 2023-10-31 | 2024-01-26 | 华东交通大学 | Deep neural network hybrid division method suitable for inspection system |
CN117541025A (en) * | 2024-01-05 | 2024-02-09 | 南京信息工程大学 | Edge calculation method for intensive transmission line inspection |
CN117541025B (en) * | 2024-01-05 | 2024-03-19 | 南京信息工程大学 | Edge calculation method for intensive transmission line inspection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114065963A (en) | Computing task unloading method based on deep reinforcement learning in power Internet of things | |
CN112995913B (en) | Unmanned aerial vehicle track, user association and resource allocation joint optimization method | |
CN113346944B (en) | Time delay minimization calculation task unloading method and system in air-space-ground integrated network | |
CN113543176B (en) | Unloading decision method of mobile edge computing system based on intelligent reflecting surface assistance | |
CN113905347B (en) | Cloud edge end cooperation method for air-ground integrated power Internet of things | |
CN113254188B (en) | Scheduling optimization method and device, electronic equipment and storage medium | |
CN112835715B (en) | Method and device for determining task unloading strategy of unmanned aerial vehicle based on reinforcement learning | |
CN112988285B (en) | Task unloading method and device, electronic equipment and storage medium | |
Zhou et al. | Computation bits maximization in UAV-assisted MEC networks with fairness constraint | |
CN113810233A (en) | Distributed computation unloading method based on computation network cooperation in random network | |
CN113406974A (en) | Learning and resource joint optimization method for unmanned aerial vehicle cluster federal learning | |
WO2022242468A1 (en) | Task offloading method and apparatus, scheduling optimization method and apparatus, electronic device, and storage medium | |
CN117580105B (en) | Unmanned aerial vehicle task unloading optimization method for power grid inspection | |
Ebrahim et al. | A deep learning approach for task offloading in multi-UAV aided mobile edge computing | |
CN112579290B (en) | Computing task migration method of ground terminal equipment based on unmanned aerial vehicle | |
CN114007231A (en) | Heterogeneous unmanned aerial vehicle data unloading method and device, electronic equipment and storage medium | |
CN116663644A (en) | Multi-compression version Yun Bianduan DNN collaborative reasoning acceleration method | |
Tharmarasa et al. | Closed-loop multi-satellite scheduling based on hierarchical MDP | |
CN113157344B (en) | DRL-based energy consumption perception task unloading method in mobile edge computing environment | |
Lu et al. | Trajectory design for unmanned aerial vehicles via meta-reinforcement learning | |
CN114698125A (en) | Method, device and system for optimizing computation offload of mobile edge computing network | |
CN114693141A (en) | Transformer substation inspection method based on end edge cooperation | |
CN114599102A (en) | Method for unloading linear dependent tasks of edge computing network of unmanned aerial vehicle | |
CN107787002B (en) | Method for rapidly evaluating information transmission rate in wireless power supply communication | |
CN115226130B (en) | Multi-unmanned aerial vehicle data unloading method based on fairness perception and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |