CN112351433A

CN112351433A - Heterogeneous network resource allocation method based on reinforcement learning

Info

Publication number: CN112351433A
Application number: CN202110006111.3A
Authority: CN
Inventors: 孙君; 吴锡
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-02-09
Anticipated expiration: 2041-01-05
Also published as: CN112351433B

Abstract

The invention discloses a heterogeneous network resource allocation method based on reinforcement learning, which comprises the steps of firstly deploying a DNN framework on each base station, wherein the DNN framework is based on an ADMM algorithm and takes channel information as the weight of a network; giving an optimal resource allocation strategy in the current state according to data obtained by a base station, namely current user association information and average interference power; regarding each base station as an independent subject, and regarding the state of the base station as a modeling environment; a plurality of agent programs observe the same heterogeneous network environment and take action, and meanwhile, the agent programs are communicated with each other through the awards of the environment; the agent adjusts the policy according to the reward; the resource allocation method provided by the invention is based on the deep learning network, can provide a resource allocation scheme without all CSI information, considers the spectrum efficiency at the same time, sets the spectrum efficiency function as the reward of an agent, and can ensure the spectrum efficiency while ensuring the system throughput.

Description

Heterogeneous network resource allocation method based on reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication, in particular to a heterogeneous network resource allocation method based on reinforcement learning.

Background

With the rapid growth of mobile devices and the emergence of the internet of things, next generation wireless networks face a great challenge to the proliferation of wireless applications. The most promising solution is to augment existing cellular networks with pico and femto cells with various transmission powers and coverage areas. These heterogeneous networks (hetnets) may transfer User Equipments (UEs) from Macro Base Stations (MBS) to Pico Base Stations (PBS), with different transmission powers and coverage. In addition, to achieve high spectral efficiency of the heterogeneous network, the PBS may reuse the MBS and share the same channel with the MBS. Therefore, heterogeneous networks are considered as a good strategy to increase the capacity of future wireless communication systems. There are some optimization problems in such heterogeneous networks, such as spectrum allocation and resource allocation. Recent studies have proposed new methods such as the game theory method, the linear programming method and the markov approximation strategy. However, these methods require almost complete information, which may not be generally available. Thus, it is challenging for the above-described approaches to achieve an optimal solution without such complete information.

Disclosure of Invention

The invention provides a dynamic resource allocation scheme aiming at the problem of downlink resource allocation in a heterogeneous cellular network. In particular, dynamic power allocation and channel allocation strategies are provided for the base station. To improve spectral efficiency, energy efficiency in heterogeneous cellular networks, an optimization framework based on Deep Neural Networks (DNN) is first composed of a series of multiplier alternating direction method (ADMM) iterative processes, making Channel State Information (CSI) the weight of learning. And applying a Deep Reinforcement Learning (DRL) framework to obtain a resource allocation scheme with Spectrum Efficiency (SE) and Energy Efficiency (EE).

In the downlink of a heterogeneous network with M base stations and N mobile users, the macro base station MBS has

A micro base station PBS has

And satisfy

；

Setting up

Indicating a base stationmAnd the usernThe association relationship between the two or more of the three,

indicating a base stationmAnd the usernAssociating;

indicating a base stationmAnd the usernIrrelevant;

setting up

Indicating the state of the spectrum when the user is presentnAnd sub-carrierkBase station ofmAssociated, spectral state

The following rules may be used to determine:

representing a useriUsing channelsk；

Indicating that the user is not using the channelk；

Setting up

Representing a usernAnd a base stationmOn-channelkA transmission power of; the method comprises the following specific steps:

indicating that the total transmit power of each cell base station should be at a preset power limit

Below;

representing time slots using block fading modelstUser's devicenTo the base stationmThe downlink channel gains of (c) are as follows:

wherein

Representing large-scale fading components including path loss and lognormal shading, and following a Jakes fading model; small scale Rayleigh fading component

Expressed as a first order gaussian-markov process:

wherein

Are independent and have uniformly distributed circularly symmetric complex gaussian random variables of unit variance;

wherein

Is a first type of zero order bessel function,

is the maximum doppler frequency;

the inter-cell interference ICI experienced when users in different cells are allocated the same sub-carriers is expressed as follows:

wherein

Indicating a base stationmOn the sub-carriernUser of upper servicek(ii) experienced inter-cell interference;

is represented in sub-carriersnUpper base stationm' To the usern' The transmission power of the antenna is set to be,

is at the sub-carrierkUpper slave base stationm' To the usernThe square of the channel gain of (d); when in use

From the base stationmOn the sub-carriernUser of upper servicekThe signal to interference plus noise ratio of (c) is as follows:

wherein

Is from the base stationmTo the usernA power of additive white gaussian noise; when the base stationmTo a usernAnd base stationm' To a usern'Are allocated sub-carriers simultaneouslykWhen the temperature of the water is higher than the set temperature,

will interfere with the base stationmTo a usernAnd is and

；

step S1, deploying a DNN framework for each base station, wherein the DNN framework is based on an ADMM algorithm and takes channel information CSI as heterogeneous network weight; giving an optimal resource allocation strategy in the current state according to the user association information and the average interference power obtained by the base station; in particular, the amount of the solvent to be used,

the spectral efficiency objective optimization function is as follows:

the energy efficiency objective optimization function is as follows:

the multi-objective optimization function is solved based on an ADMM algorithm, and the augmented Lagrange function is as follows:

wherein

The values, representing the lagrangian multiplier,

is a penalty parameter; at this time, the unconstrained optimization problem can be expressed as:

by respectively pairing

Find out by deviational derivation

The best solution of (1):

the following can be obtained:

wherein:

,

,

；

step S2, regarding each base station as an independent subject, and regarding the state of the base station as a modeling environment; a plurality of agent programs observe the same heterogeneous network environment and take action, and meanwhile, the agent programs are communicated with each other through the awards of the environment; the agent adjusts the policy according to the reward; specifically, the method comprises the following steps:

and (4) state set S: by

A state component comprising

(ii) a The state for characterizing the heterogeneous network environment comprises user association information

And interference power

Then the heterogeneous network state is represented as:

；

action set A: based on the current state, the agent is pi-in based on a decision policy

Taking an action; the actions include selecting a subcarrier

And corresponding transmission power

(ii) a Then the action is represented as

；

Rewarding: agent computing environment reward after action is taken

(ii) a Defining the energy efficiency function as a reward in the system model:

designing a DNN-based optimization framework, and combining Q learning to generate a strategy pi; wherein the input to the DNN-based optimization framework is the set of observed states S and the output of the DNN-based optimization framework is all executable actions in the set of actions A; each shapeThe dynamic action pairs all have corresponding Q values

(ii) a Each step selects the action that achieves the maximum Q value at each state, as shown below

Updating the Q value according to the Q learning algorithm by the following formula

：

Wherein

And

learning rate and discount factor, respectively;

and

indicates the next state and is in the state

The reward obtained after the action is taken is,

indicating a state

The following executable acts may be performed,

is a set of executable actions;

indicating a state

The value of Q in the following (A),

indicating a state

Set of executable actions under

Maximum Q value of (1); the loss function in each agent can be expressed as:

wherein

A weight representing the target network; use of

Greedy policy from online network

In the selection action

The target network is

While the weights are fixed, multiple iterations are performed while updating the weights in the online network.

Further, the resource allocation method based on the ADMM algorithm in step S1 specifically includes the following steps:

step S1.1, updating the currently observed state

；

Step S1.2, initialization

；

Step S1.3, setting threshold value

Maximum number of iterations

Starting iteration; network computing based on DNN

(ii) a When in use

Then output the corresponding

。

Further, the step S2 obtains an optimal resource allocation scheme by using the ADMM network that uses the channel state information as the network weight, and includes the following specific steps:

step S2.1, initializing the reproduction memory

DQN network parameters

And target network replacement step size

；

Step S2.2, initializing the on-line network

And weight

Initializing an online network

And make the weight

；

Step S2.3, setting threshold value

；

Step S2.4, each agent program is used according to the current state information

Greedy policy selection decision

；

Step S2.5, updating environment

Receive a reward

；

Step S2.6, each agent program observes the rewards obtained by all agents

Storing into respective D;

step S2.7, randomly sampling from D, and calculating loss function

And update the weight

Every other, at

Updating target network parameters

Until all agents meet a threshold or a maximum iteration step is reached.

Compared with the prior art, the invention has the following technical advantages:

(1) when the problem of resource allocation in a heterogeneous network is solved, because the traditional convex optimization method is difficult to provide a resource allocation scheme under the condition of incomplete CSI information, the method can provide the resource allocation scheme without all CSI information based on a deep learning network;

(2) when the resource allocation is considered, the spectrum efficiency is considered at the same time, and the method is applied to the deep reinforcement learning based on model driving, and the deep reinforcement learning method driven by the model is not applied to the resource allocation scheme of the heterogeneous network at present. The spectrum efficiency function is set as the reward of the agent, so that the spectrum efficiency can be ensured while the system throughput is ensured.

Drawings

FIG. 1 is a schematic diagram of a dual-layer heterogeneous cellular network provided by the present invention;

fig. 2 is a structural diagram of a DNN optimization framework based on an ADMM algorithm provided by the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The dual-layer heterogeneous cellular network shown in FIG. 1 comprises M base stations and N mobile users, wherein the macro base station MBS has

A micro base station PBS has

And satisfy

. Each cell site is located at the center of each cell and authorizes mobile users to be randomly distributed in the cell. It is assumed that there is an overlapping area between every two adjacent small cells. It is assumed that each communication terminal is equipped with an antenna for signal transmission. To maximize the use of radio resources and avoid trivial cases, the frequency reuse factor is set to 1, and to avoid intra-cell interference, it is assumed that each user in each cell is allocated only one subcarrier, so all signals are orthogonal in the same subcarrier. The N orthogonal subcarriers used in a cell may be reused in each neighboring cell. However, users in the overlapping area are served by the nearest small cell BS and may suffer from severe inter-cell interference (ICI) due to the fact that they may use the same spectral resources.

Setting up

indicating a base stationmAnd the usernAssociating;

indicating a base stationmAnd the usernIrrelevant;

setting up

The following rules may be used to determine:

for indicatingHouseholdiUsing channelsk；

Indicating that the user is not using the channelk；

Setting up

Below;

wherein

Expressed as a first order gaussian-markov process:

wherein

wherein

Is a first type of zero order bessel function,

is the maximum doppler frequency;

wherein

wherein

will interfere with the base stationmTo a usernAnd is and

。

the embodiment of the invention is divided into two parts, firstly, a DNN frame is deployed for each base station, the DNN frame is based on an ADMM algorithm, and channel information CSI is used as the weight of a heterogeneous network; it is assumed that the long term average interference power received by each UE can be estimated and fed back to the serving base station through a feedback channel. This information exchange requires very limited resources to be obtained with very low frequency compared to the required signal CSI. Giving an optimal resource allocation strategy in the current state according to the user association information and the average interference power obtained by the base station; in particular, the amount of the solvent to be used,

deploying a DNN framework for each base station, wherein the DNN framework is based on an ADMM algorithm and takes channel information CSI as heterogeneous network weight; giving an optimal resource allocation strategy in the current state according to the user association information and the average interference power obtained by the base station; in particular, the amount of the solvent to be used,

the spectral efficiency objective optimization function is as follows:

the energy efficiency objective optimization function is as follows:

wherein

The values, representing the lagrangian multiplier,

by respectively pairing

Find out by deviational derivation

The best solution of (1):

the following can be obtained:

wherein:

,

,

。

the DNN-based optimization framework shown in fig. 2 includes neurons corresponding to different operations in the ADMM iteration process, and directed edges corresponding to the data flow between the operations. Thus, the first of the DNN-based optimization frameworkskLayer corresponds to the second of ADMM procedurekAnd (6) iteration. Upon entering the DNN-based optimization framework, the input data flows through multiple layers of repetition, which correspond to successive iterations in the ADMM. When the convergence condition is satisfied, the DNN-based optimization framework will generate a resource allocation result. Specifically, the resource allocation method based on the ADMM algorithm comprises the following specific steps:

step S1.1, updating the currently observed state

；

Step S1.2, initialization

；

Step S1.3, setting threshold value

Maximum number of iterations

Starting iteration; network computing based on DNN

(ii) a When in use

Then output the corresponding

。

The second part is that each base station is regarded as an independent subject, and the state of the base station is used as a modeling environment; a plurality of agent programs observe the same heterogeneous network environment and take action, and meanwhile, the agent programs are communicated with each other through the awards of the environment; the agent adjusts the policy according to the reward; specifically, the method comprises the following steps:

and (4) state set S: by

A state component comprising

And interference power

Then the heterogeneous network state is represented as:

；

Taking an action; the actions include selecting a subcarrier

And corresponding transmission power

(ii) a Then the action is represented as

；

Rewarding: agent computing environment reward after action is taken

(ii) a Defining the energy efficiency function as a reward in the system model:

designing a DNN-based optimization framework, and combining Q learning to generate a strategy pi; wherein the input to the DNN-based optimization framework is the set of observed states S and the output of the DNN-based optimization framework is all executable actions in the set of actions A; each state action pair has a corresponding Q value

：

Wherein

And

learning rate and discount factor, respectively;

and

indicates the next state and is in the state

The reward obtained after the action is taken is,

indicating a state

The following executable acts may be performed,

is a set of executable actions;

indicating a state

The value of Q in the following (A),

indicating a state

Set of executable actions under

Maximum Q value of (1); the loss function in each agent can be expressed as:

wherein

A weight representing the target network; use of

Greedy policy from online network

In the selection action

The target network is

Specifically, the steps of obtaining the optimal resource allocation scheme by using the ADMM network using the channel state information as the network weight are as follows:

step S2.1, initializing the reproduction memory

DQN network parameters

And target network replacement step size

；

Step S2.2, initializing the on-line network

And weight

Initialization ofOnline network

And make the weight

；

Step S2.3, setting threshold value

；

Greedy policy selection decision

；

Step S2.5, updating environment

Receive a reward

；

Step S2.6, each agent program observes the rewards obtained by all agents

Storing into respective D;

step S2.7, randomly sampling from D, and calculating loss function

And update the weight

Every other, at

Updating target network parameters

Until all agents meet a threshold or a maximum iteration step is reached.

Claims

1. A heterogeneous network resource allocation method based on reinforcement learning is characterized in that in a downlink of a heterogeneous network with M base stations and N mobile users, a macro base station MBS has

A micro base station PBS has

And satisfy

；

Setting up

indicating a base stationmAnd the usernAssociating;

indicating a base stationmAnd the usernIrrelevant;

setting up

The following rules were used to determine:

representing a useriUsing channelsk；

Indicating that the user is not using the channelk；

Setting up

Below;

wherein

Expressed as a first order gaussian-markov process:

wherein

wherein

Is a first type of zero order bessel function,

is the maximum doppler frequency;

wherein

wherein

will interfere with the base stationmTo a usernAnd is and

；

the spectral efficiency objective optimization function is as follows:

the energy efficiency objective optimization function is as follows:

wherein

The values, representing the lagrangian multiplier,

by respectively pairing

Find out by deviational derivation

The best solution of (1):

the following can be obtained:

wherein:

,

,

；

and (4) state set S: by

A state component comprising

And interference power

Then the heterogeneous network state is represented as:

；

Taking an action; the actions include selecting a subcarrier

And corresponding transmission power

(ii) a Then the action is represented as

；

Rewarding: agent computing environment reward after action is taken

(ii) a Defining the energy efficiency function as a reward in the system model:

：

Wherein

And

learning rate and discount factor, respectively;

and

indicates the next state and is in the state

The reward obtained after the action is taken is,

indicating a state

The following executable acts may be performed,

is a set of executable actions;

indicating a state

The value of Q in the following (A),

indicating a state

Set of executable actions under

Maximum Q value of (1); the loss function in each agent can be expressed as:

wherein

A weight representing the target network; use of

Greedy policy from online network

In the selection action

The target network is

2. The reinforcement learning-based heterogeneous network resource allocation method according to claim 1, wherein the resource allocation method based on the ADMM algorithm in step S1 specifically includes the following steps:

step S1.1, updating the currently observed state

；

Step S1.2, initialization

；

Step S1.3, setting threshold value

Maximum number of iterations

Starting iteration; network computing based on DNN

(ii) a When in use

Then output the corresponding

。

3. The method according to claim 1, wherein the step S2 obtains an optimal resource allocation scheme using an ADMM network that uses channel state information as a network weight, and includes the following steps:

step S2.1, initializing the reproduction memory