CN108632860B

CN108632860B - Mobile edge calculation rate maximization method based on deep reinforcement learning

Info

Publication number: CN108632860B
Application number: CN201810342359.5A
Authority: CN
Inventors: 黄亮; 冯旭; 钱丽萍; 吴远
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-04-17
Filing date: 2018-04-17
Publication date: 2021-06-18
Anticipated expiration: 2038-04-17
Also published as: CN108632860A

Abstract

A moving edge calculation rate maximization method based on deep reinforcement learning comprises the following steps: 1) in an edge computing system comprising a base station and a plurality of wireless devices and powered wirelessly, computing the sum of the rates of all wireless devices in the system given a mode selection; 2) finding an optimal mode selection, i.e. mode selection M of all wireless devices, by a reinforcement learning algorithm₀And M₁(ii) a 3) Mode selection M for all wireless devices₀And M₁System state x as reinforcement learning_tAction a is to system state x_tIf the total calculation rate of the modified system is greater than before, then the current prize r (x) is awarded_tA) is set to a positive value, otherwise to a negative value, and the system enters the next state x_t+1This iterative process is repeated until the best mode selection M is obtained₀And M₁. The invention maximizes the total calculation rate of all wireless devices on the premise of ensuring user experience.

Description

Mobile edge calculation rate maximization method based on deep reinforcement learning

Technical Field

The invention belongs to the field of communication, and particularly relates to a communication system for mobile edge calculation and a mobile edge calculation rate maximization method based on deep reinforcement learning.

Background

The recent development of internet of things technology is a key step towards real intelligence and autonomous control, and is particularly prominent in many important industrial and commercial systems. In an internet of things network, a large number of Wireless Devices (WDs) capable of communication and computing are deployed, and due to device size limitations and manufacturing cost considerations, internet of things devices (e.g., sensors) often carry batteries with limited capacity and energy-efficient low-performance processors, and therefore, the limited device lifetime and low computing power cannot support more and more sustainable new applications that require high-performance computing, such as autopilot and augmented reality. Deployment of wireless energy Transfer Systems (WPTs) can solve the two aforementioned performance problems, but frequent device battery failures not only disrupt normal personal wireless device operation but can also significantly degrade overall network performance, e.g., sensing accuracy in wireless sensor networks. Conventional wireless systems require frequent manual battery replacement, which is expensive and inconvenient, and due to severe battery capacity limitations, minimizing power consumption and extending the operational life of the wireless device is a critical design in battery-powered wireless systems. Each energy harvesting wireless device follows a binary computation offload policy, i.e., the data set for one task may be performed locally or by remote server offload. In order to maximize the total computation rate of all wireless devices, it is necessary to find the optimal individual computation mode selection.

Disclosure of Invention

In order to overcome the defect that the sum calculation rate of the existing wireless energy transmission system is low, in order to maximize the sum calculation rate of all wireless devices and find the optimal individual calculation mode selection and system transmission time allocation, the invention provides a mobile edge calculation rate maximization method based on deep reinforcement learning, and the sum calculation rate of all wireless devices is maximized on the premise of ensuring user experience.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for maximizing a moving edge computation rate based on deep reinforcement learning, the method comprising the following steps:

1) in an edge computing system powered wirelessly by a base station and a plurality of wireless devices, the base station and each wireless device having a separate antenna; the radio frequency energy emitter and the edge calculation server are integrated in the base station, and the base station is assumed to have a stable energy supply and can broadcast radio frequency energy to all wireless devices; each wireless device has an energy harvesting circuit and a rechargeable battery to perform some task by storing harvested energy; in this wireless communication system, each wireless device needs to establish contact with a base station, and the channel gain h between the wireless device i and the base station_iThe calculation is as follows:

wherein, each parameter is defined as follows:

A_d: antenna gain;

pi: a circumferential ratio;

f_c: a carrier frequency;

d_i: distance between wireless device i and base station;

d_e: a path loss exponent;

2) assuming that the computing tasks of each wireless device can be executed on a local low-performance microprocessor or offloaded to an edge computing server with more powerful processing power, it will process the computing tasks and then send the results back to the wireless device; suppose a wireless device employs a binary computation offload rule, i.e., a wireless device must choose either a local computation mode or an offload mode; using two non-overlapping sets

And

all wireless devices, all sets of wireless devices, representing local compute mode and offload mode, respectively

Expressed as:

3) in a collection

The wireless device in (1) is able to collect energy and process local tasks simultaneously while in the aggregate

The wireless device in the system can only shunt the task to the base station for processing after collecting energy, and under the condition that the computing capacity and transmission capacity of the base station are much stronger than those of the energy collection wireless device, the wireless device exhausts the energy collected by the wireless device in the task shunting process; the compute rate sum maximization problem for all wireless devices is described as:

the constraint conditions are as follows:

in the formula:

wherein, each parameter is defined as follows:

ω_i: a transition weight for the ith wireless device;

μ: an energy collection efficiency;

p: radio frequency energy transmission power;

phi: the number of calculation cycles required to process each bit of data;

h_i: channel gain of the ith wireless device;

k_i: an energy efficiency coefficient for the ith wireless device;

a: a time coefficient;

v_μ: conversion efficiency;

b: a bandwidth;

τ_j: a time coefficient for the jth wireless device;

N₀: the number of wireless devices in the local processing mode;

4) finding an optimal mode selection, i.e. mode selection of all wireless devices, by a reinforcement learning algorithm

And

the reinforcement learning system is composed of an agent and an environment. Mode selection for all users

And

are all programmed into the current state x of the system_tThe agent takes action a in the current state to enter the next state x_t+1While receiving the reward r (x) returned by the environment_tA), mode selection under constant interactive update of agent and environment

And

will be optimized continuously until finding the optimum, the update mode of the agent is:

Q^θ(x_t，a)＝r(x_t，a)+γmaxQ^θ′(x_t+1，a′) (4)

wherein, each parameter is defined as follows:

θ: evaluating a parameter in the network;

theta': parameters in the target network;

x_t: at time t, the system is in the state;

Q^θ(x_ta): in state x_tTaking the Q value obtained by the action a;

r(x_ta): in state x_tThe reward resulting from taking action a;

γ: rewarding the attenuated specific gravity;

5) mode selection for all wireless devices

And

system state x as deep reinforcement learning_tAction a is to system state x_tIf the total calculation rate of the modified system is greater than before, then the current prize r (x) is awarded_tA) is set to a positive value, otherwise to a negative value, and the system enters the next state x_t+1。

Further, in the step 5), the iterative process of reinforcement learning is as follows:

step 5.1: and initializing an evaluation network, a target network and a memory base in reinforcement learning. The current system state is x_tT is initialized to 1, and the iteration number k is initialized to 1;

step 5.2: randomly selecting a probability p when K is less than or equal to a given iteration number K;

step 5.3: if p is less than or equal to ε; selecting an action a (t) output by the evaluation network, otherwise randomly selecting an action;

step 5.4: after action a (t) is taken, obtaining reward r (t) and next state x (t +1), and storing the information in a memory base according to formats (x (t), a (t), r (t), x (t + 1));

step 5.5: combining the output of the target network, calculating the target y (r) (x) of the evaluation network_t，a)+γmaxQ^θ′(x_t+1，a′)；

Step 5.6: minimizing errors (y-Q (x (t), a (t); theta))²Meanwhile, updating the parameter theta of the evaluation network to enable the next time of prediction to be more accurate;

step 5.7: assigning the parameters of the evaluation network to the target network every S step, and returning to the step 4.2 when k is equal to k + 1;

step 5.8: when K is greater than the given iteration number K, the learning process is ended to obtain the best mode selection

And

the technical conception of the invention is as follows: first, in an internet of things network, a large number of Wireless Devices (WDs) capable of communication and computation are deployed, and due to device size constraints and manufacturing cost considerations, internet of things devices (e.g., sensors) often carry batteries with limited capacity and energy-saving low-performance processors, so that the limited device lifetime and low computing power cannot support more and more sustainable new applications requiring high-performance computation, and due to strict battery capacity constraints, in a battery-powered wireless system, minimizing energy consumption and extending the wireless device operational life cycle is a critical design. Each energy harvesting wireless device follows a binary computation offload policy, i.e., the data set for one task may be performed locally or by remote server offload. To maximize the total computation rate of all wireless devices, an optimal individual computation mode selection method is proposed.

The invention has the following beneficial effects: the optimal mode selection method is found through deep reinforcement learning, the total calculation rate of all wireless devices is maximized, the energy consumption is minimized, and the operation life cycle of the wireless devices is prolonged.

Drawings

FIG. 1 is a system model diagram.

Fig. 2 is a flow chart of a method of finding an optimal mode selection.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

Referring to fig. 1 and 2, a method for maximizing a moving edge computation rate based on deep reinforcement learning, which maximizes a sum computation rate of all wireless devices, minimizes energy consumption, and prolongs an operation life cycle of the wireless devices, the present invention provides an optimal individual computation mode selection method to decide which tasks of the wireless devices are to be shunted to a base station based on a system model of multiple wireless devices (as shown in fig. 1), the optimal individual computation mode selection method includes the following steps (as shown in fig. 2):

wherein, each parameter is defined as follows:

A_d: antenna gain;

pi: a circumferential ratio;

f_c: a carrier frequency;

d_i: distance between wireless device i and base station;

d_e: a path loss exponent;

And

Expressed as:

3) in a collection

The wireless devices in (1) can collect energy and process local tasks simultaneously while in the aggregate

The wireless device in (1) can only shunt the task to the base station for processing after collecting energy, and assuming that the computing power and transmission capability of the base station are much stronger than those of the energy collecting wireless device, in this case, in the task shunting process, the wireless device exhausts the energy collected by the wireless device, and the problem of maximizing the sum of the computing rates of all the wireless devices is described as follows:

the constraint conditions are as follows:

in the formula:

wherein, each parameter is defined as follows:

ω_i: a transition weight for the ith wireless device;

μ: an energy collection efficiency;

p: radio frequency energy transmission power;

the method comprises the following steps: the number of calculation cycles required to process each bit of data;

h_i: channel gain of the ith wireless device;

k_i: an energy efficiency coefficient for the ith wireless device;

a: a time coefficient;

v_μ: conversion efficiency;

b: a bandwidth;

τ_j: a time coefficient for the jth wireless device;

N₀: the number of wireless devices in the local processing mode;

And

And

And

Q^θ(x_t，a)＝r(x_t，a)+γmaxQ^θ′(x_t+1，a′) (4)

wherein, each parameter is defined as follows:

θ: evaluating a parameter in the network;

theta': parameters in the target network;

x_t: at time t, the system is in the state;

Q^θ(x_ta): in state x_tTaking the Q value obtained by the action a;

r(x_ta): in state x_tThe reward resulting from taking action a;

γ: rewarding the attenuated specific gravity;

5) mode selection for all wireless devices

And

In the step 5), the iterative process of reinforcement learning is as follows:

And

Claims

1. a moving edge calculation rate maximization method based on deep reinforcement learning is characterized by comprising the following steps:

wherein, each parameter is defined as follows:

A_d: antenna gain;

pi: a circumferential ratio;

f_c: a carrier frequency;

d_i: distance between wireless device i and base station;

d_e: a path loss exponent;

2) assume that the computing task of each wireless device can be locally lowThe performance microprocessor executes or shunts to the edge computing server with greater processing power, which will process the computing task and then send the result back to the wireless device; suppose a wireless device employs a binary computation offload rule, i.e., a wireless device must choose either a local computation mode or an offload mode; using two non-overlapping sets

And

Expressed as:

3) in a collection

The wireless device in (1) can only shunt the task to the base station for processing after collecting energy, and assuming that the computing power and transmission capability of the base station are much stronger than those of the energy collecting wireless device, in this case, during task unloading, the wireless device exhausts the energy collected by the wireless device, and the problem of maximizing the sum of the computing rates of all the wireless devices is described as follows:

the constraint conditions are as follows:

in the formula:

wherein, each parameter is defined as follows:

ω_i: a transition weight for the ith wireless device;

μ: an energy collection efficiency;

p: radio frequency energy transmission power;

phi: the number of calculation cycles required to process each bit of data;

h_i: channel gain of the ith wireless device;

k_i: an energy efficiency coefficient for the ith wireless device;

α: a time coefficient;

v_μ: conversion efficiency;

b: a bandwidth;

τ_j: a time coefficient for the jth wireless device;

N₀: the number of wireless devices in the local processing mode;

4)finding an optimal mode selection, i.e. mode selection of all wireless devices, by a reinforcement learning algorithm

And

the reinforcement learning system consists of an intelligent agent and an environment; mode selection for all users

And

are all programmed into the current state x of the system_tThe agent takes action a in the current state to enter the next state x_t+1While receiving the reward r (x) returned by the environment_tA); mode selection under constant interactive update of agent and environment

And

Q^θ(x_t，a)＝r(x_t，a)+γmaxQ^θ′(x_t+1，a′) (4)

wherein, each parameter is defined as follows:

θ: evaluating a parameter in the network;

theta': parameters in the target network;

x_t: at time t, the system is in the state;

Q^θ(x_ta): in state x_tTaking the Q value obtained by the action a;

r(x_ta): in state x_tThe reward resulting from taking action a;

γ: rewarding the attenuated specific gravity;

5) mode selection for all wireless devices

And

2. The method according to claim 1, wherein the method for maximizing the computation rate of the moving edge based on the deep reinforcement learning comprises: in the step 5), the iterative process of reinforcement learning is as follows:

step 5.1: initializing an evaluation network, a target network and a memory base in reinforcement learning, wherein the current system state is x_tT is initialized to 1, and the iteration number k is initialized to 1;

step 5.5: calculating a target of the evaluation network in combination with the output of the target network

y＝r(x_t，a)+γmaxQ^θ′(x_t+1，a′)；

Step 5.6: minimizing error (y-Q)^θ(x_t，a))²Meanwhile, updating the parameter theta of the evaluation network to enable the next time of prediction to be more accurate;

And