CN112351433B

CN112351433B - Heterogeneous network resource allocation method based on reinforcement learning

Info

Publication number: CN112351433B
Application number: CN202110006111.3A
Authority: CN
Inventors: 孙君; 吴锡
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-05-25
Anticipated expiration: 2041-01-05
Also published as: CN112351433A

Abstract

The invention discloses a heterogeneous network resource allocation method based on reinforcement learning, which comprises the steps of firstly deploying a DNN framework on each base station, wherein the DNN framework is based on an ADMM algorithm and takes channel information as the weight of a network; giving an optimal resource allocation strategy in the current state according to data obtained by a base station, namely current user association information and average interference power; regarding each base station as an independent subject, and regarding the state of the base station as a modeling environment; a plurality of agent programs observe the same heterogeneous network environment and take action, and meanwhile, the agent programs are communicated with each other through the awards of the environment; the agent adjusts the policy according to the reward; the resource allocation method provided by the invention is based on the deep learning network, can provide a resource allocation scheme without all CSI information, considers the spectrum efficiency at the same time, sets the spectrum efficiency function as the reward of an agent, and can ensure the spectrum efficiency while ensuring the system throughput.

Description

Heterogeneous network resource allocation method based on reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication, in particular to a heterogeneous network resource allocation method based on reinforcement learning.

Background

With the rapid growth of mobile devices and the emergence of the internet of things, next generation wireless networks face a great challenge to the proliferation of wireless applications. The most promising solution is to augment existing cellular networks with pico and femto cells with various transmission powers and coverage areas. These heterogeneous networks (hetnets) may transfer User Equipments (UEs) from Macro Base Stations (MBS) to Pico Base Stations (PBS), with different transmission powers and coverage. In addition, to achieve high spectral efficiency of the heterogeneous network, the PBS may reuse the MBS and share the same channel with the MBS. Therefore, heterogeneous networks are considered as a good strategy to increase the capacity of future wireless communication systems. There are some optimization problems in such heterogeneous networks, such as spectrum allocation and resource allocation. Recent studies have proposed new methods such as the game theory method, the linear programming method and the markov approximation strategy. However, these methods require almost complete information, which may not be generally available. Thus, it is challenging for the above-described approaches to achieve an optimal solution without such complete information.

Disclosure of Invention

The invention provides a dynamic resource allocation scheme aiming at the problem of downlink resource allocation in a heterogeneous cellular network. In particular, dynamic power allocation and channel allocation strategies are provided for the base station. To improve spectral efficiency, energy efficiency in heterogeneous cellular networks, an optimization framework based on Deep Neural Networks (DNN) is first composed of a series of multiplier alternating direction method (ADMM) iterative processes, making Channel State Information (CSI) the weight of learning. And applying a Deep Reinforcement Learning (DRL) framework to obtain a resource allocation scheme with Spectrum Efficiency (SE) and Energy Efficiency (EE).

A heterogeneous network resource allocation method based on reinforcement learning is characterized in that in a downlink of a heterogeneous network with M base stations and N mobile users, a macro base station MBS has

A micro base station PBS has

And satisfy

；

Setting up

Indicating a base stationmAnd the usernThe association relationship between the two or more of the three,

indicating a base stationmAnd the usernAssociating;

indicating a base stationmAnd the usernIrrelevant;

setting up

Indicating the state of the spectrum when the user is presentnAnd sub-carrierkBase station ofmAssociated, spectral state

The following rules were used to determine:

representing a usernUsing sub-carriersk；

Representing a usernNot using sub-carriersk；

Setting up

Representing a usernAnd a base stationmOn the sub-carrierkA transmission power of; the method comprises the following specific steps:

indicating that the total transmit power of each cell base station should be at a preset power limit

Below;

representing time slots using block fading modelstUser's devicenTo the base stationmThe downlink channel gains of (c) are as follows:

wherein

Representation including path loss and lognormal shadingThe large-scale fading component follows a Jakes fading model; small scale Rayleigh fading component

Expressed as a first order gaussian-markov process:

wherein

Are independent and have uniformly distributed circularly symmetric complex gaussian random variables of unit variance;

wherein

Is a first type of zero order bessel function,

is the maximum doppler frequency;

the inter-cell interference ICI experienced when users in different cells are allocated the same sub-carriers is expressed as follows:

wherein

Indicating a base stationmOn the sub-carrierkUser of upper servicen(ii) experienced inter-cell interference;

is represented in sub-carrierskUpper base stationm'To the usern'The transmit power of (a);

is at the sub-carrierkUpper slave base stationm'To the usernThe square of the channel gain of (d); when in use

From the base stationmOn the sub-carrierkUser of upper servicenThe signal to interference plus noise ratio of (c) is as follows:

wherein

Is from the base stationmTo the usernA power of additive white gaussian noise; when the base stationmTo a usernAnd base stationm' To a usern'Are allocated sub-carriers simultaneouslykWhen the temperature of the water is higher than the set temperature,

will interfere with the base stationmTo a usernAnd is and

；

step S1, deploying a DNN framework for each base station, wherein the DNN framework is based on an ADMM algorithm and takes channel information CSI as heterogeneous network weight; giving an optimal resource allocation strategy in the current state according to the user association information and the average interference power obtained by the base station; in particular, the amount of the solvent to be used,

the spectral efficiency objective optimization function is as follows:

the energy efficiency objective optimization function is as follows:

solving the target optimization function of the frequency spectrum efficiency based on an ADMM algorithm, wherein the augmented Lagrangian function is as follows:

wherein

The values, representing the lagrangian multiplier,

is a penalty parameter; at this time, the spectral efficiency optimization function is expressed as:

by respectively pairing

Finding the deviation

The best solution of (1):

the following can be obtained:

wherein:

；

stands for ADMM algorithmlThe number of sub-iterations is,

,

,

,

,

,

respectively representlIn the sub-iteration

A value of (d);

step S2, regarding each base station as an independent agent, and taking the state of the base station as a modeling environment; a plurality of agent programs observe the same heterogeneous network environment and take action, and meanwhile, the agent programs are communicated with each other through the awards of the environment; the agent adjusts the policy according to the reward; specifically, the method comprises the following steps:

and (4) state set S: by

A state component comprising

(ii) a States observed by the agent to characterize a heterogeneous network environment

Including user association information

And interference power

Then the heterogeneous network state is represented as:

；

action set A: based on the current state, the agent is pi-in based on a decision policy

Taking an action; the actions include selecting a subcarrier

And corresponding transmission power

(ii) a Then the action is represented as

；

Reward: agent computing environment reward after action is taken

(ii) a Defining the energy efficiency function as a reward in the system model:

designing a DNN-based optimization framework, and combining Q learning to generate a strategy pi; wherein the input to the DNN-based optimization framework is the set of observed states S and the output of the DNN-based optimization framework is all executable actions in the set of actions A; each state action pair has a corresponding Q value

(ii) a Each step selects the action that achieves the maximum Q value at each state, as shown below

Updating the Q value according to the Q learning algorithm by the following formula

：

Wherein

And

learning rate and discount factor, respectively;

represents the next oneThe status of the mobile station is,

is shown in a state

The reward obtained after the action is taken is,

indicating a state

The following executable acts may be performed,

is a set of executable actions;

indicating a state

The value of Q in the following (A),

represents the updated Q value;

indicating a state

Set of executable actions under

Maximum Q value of (1); the loss function in each agent can be expressed as:

wherein

A network parameter indicative of the target network,

a network parameter representing an online network; squaring channel gain

And additive Gaussian noise

As network parameters of the l-th layer;

use of

Greedy policy from online network

In the selection action

Target network

Is an online network

But the network parameters are fixed in the iteration; and replacing the network parameters of the target network with the network parameters in the online network after each iteration.

Further, the resource allocation method based on the ADMM algorithm in step S1 specifically includes the following steps:

step S1.1, updating the currently observed state

；

Step S1.2, initializing network parameters

；

Step S1.3, setting threshold value

Maximum number of iterations

Starting iteration; network computing based on DNN

(ii) a When in use

Then output the corresponding

。

Further, the step S2 obtains an optimal resource allocation scheme by using the ADMM network that uses the channel state information as the network weight, and includes the following specific steps:

step S2.1, initializing the reproduction memory

DQN network parameters

And target network replacement step size

；

Step S2.2, initializing the on-line network

And

initializing the target network

And make

；

Step S2.3, setting threshold value

；

Step S2.4, each agent program is used according to the current state information

Greedy policy selection decision

；

Step S2.5, updating environment

Receive a reward

；

Step S2.6, each agent program observes the rewards obtained by all agents

Storing into respective D;

step S2.7, randomly sampling from D, and calculating loss function

And update

Every other, at

Step updating target network parameters

Up to all generationsThe process meets a threshold or reaches a maximum iteration step.

Compared with the prior art, the invention has the following technical advantages:

(1) when the problem of resource allocation in a heterogeneous network is solved, because the traditional convex optimization method is difficult to provide a resource allocation scheme under the condition of incomplete CSI information, the method can provide the resource allocation scheme without all CSI information based on a deep learning network;

(2) when the resource allocation is considered, the spectrum efficiency is considered at the same time, and the method is applied to the deep reinforcement learning based on model driving, and the deep reinforcement learning method driven by the model is not applied to the resource allocation scheme of the heterogeneous network at present. The spectrum efficiency function is set as the reward of the agent, so that the spectrum efficiency can be ensured while the system throughput is ensured.

Drawings

FIG. 1 is a schematic diagram of a dual-layer heterogeneous cellular network provided by the present invention;

fig. 2 is a structural diagram of a DNN optimization framework based on an ADMM algorithm provided by the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The dual-layer heterogeneous cellular network shown in FIG. 1 comprises M base stations and N mobile users, wherein the macro base station MBS has

A micro base station PBS has

And satisfy

. Each cell site is located at the center of each cell and authorizes mobile users to be randomly distributed in the cell. It is assumed that there is an overlapping area between every two adjacent small cells. It is assumed that each communication terminal is equipped with an antenna for signal transmission. In order to utilize the radio resources to the maximum extent and avoidWithout trivial details, the frequency reuse factor is set to 1, and in order to avoid intra-cell interference, it is assumed that each user in each cell is allocated only one subcarrier, so that the cells are orthogonal in the same subcarrier for all signals. The N orthogonal subcarriers used in a cell may be reused in each neighboring cell. However, users in the overlapping area are served by the nearest small cell BS and may suffer from severe inter-cell interference (ICI) due to the fact that they may use the same spectral resources.

Setting up

indicating a base stationmAnd the usernAssociating;

indicating a base stationmAnd the usernIrrelevant;

setting up

The following rules were used to determine:

representing a usernUsing sub-carriersk；

Representing a usernNot using sub-carriersk；

Setting up

For indicatingHouseholdnAnd a base stationmOn the sub-carrierkA transmission power of; the method comprises the following specific steps:

Below;

wherein

Representing large-scale fading components including path loss and lognormal shading, and following a Jakes fading model; small scale Rayleigh fading component

Expressed as a first order gaussian-markov process:

wherein

wherein

Is a first type of zero order bessel function,

is the maximum doppler frequency;

wherein

wherein

will interfere with the base stationmTo a usernAnd is and

；

the embodiment of the invention is divided into two parts, firstly, a DNN frame is deployed for each base station, the DNN frame is based on an ADMM algorithm, and channel information CSI is used as the weight of a heterogeneous network; it is assumed that the long term average interference power received by each UE can be estimated and fed back to the serving base station through a feedback channel. This information exchange requires very limited resources to be obtained with very low frequency compared to the required signal CSI. Giving an optimal resource allocation strategy in the current state according to the user association information and the average interference power obtained by the base station; in particular, the amount of the solvent to be used,

deploying a DNN framework for each base station, wherein the DNN framework is based on an ADMM algorithm and takes channel information CSI as heterogeneous network weight; giving an optimal resource allocation strategy in the current state according to the user association information and the average interference power obtained by the base station; in particular, the amount of the solvent to be used,

the spectral efficiency objective optimization function is as follows:

the energy efficiency objective optimization function is as follows:

wherein

The values, representing the lagrangian multiplier,

by respectively pairing

Finding the deviation

The best solution of (1):

the following can be obtained:

wherein:

；

stands for ADMM algorithmlThe number of sub-iterations is,

,

,

,

,

,

respectively representlIn the sub-iteration

The value of (c).

The DNN-based optimization framework shown in fig. 2 includes neurons corresponding to different operations in the ADMM iteration process, and directed edges corresponding to the data flow between the operations. Thus, the first of the DNN-based optimization frameworkskLayer corresponds to the second of ADMM procedurekAnd (6) iteration. Upon entering the DNN-based optimization framework, the input data flows through multiple layers of repetition, which correspond to successive iterations in the ADMM. DNN-based when convergence conditions are satisfiedThe optimization framework will generate resource allocation results. Specifically, the resource allocation method based on the ADMM algorithm comprises the following specific steps:

step S1.1, updating the currently observed state

；

Step S1.2, initializing network parameters

；

Step S1.3, setting threshold value

Maximum number of iterations

Starting iteration; network computing based on DNN

(ii) a When in use

Then output the corresponding

。

The second part is that each base station is regarded as an independent subject, and the state of the base station is used as a modeling environment; a plurality of agent programs observe the same heterogeneous network environment and take action, and meanwhile, the agent programs are communicated with each other through the awards of the environment; the agent adjusts the policy according to the reward; specifically, the method comprises the following steps:

and (4) state set S: by

A state component comprising

(ii) a Observed by the agent for characterizing heterogeneous networksState of the network environment

Including user association information

And interference power

Then the heterogeneous network state is represented as:

；

Taking an action; the actions include selecting a subcarrier

And corresponding transmission power

(ii) a Then the action is represented as

；

Rewarding: agent computing environment reward after action is taken

(ii) a Defining the energy efficiency function as a reward in the system model:

designing a DNN-based optimization framework, and combining Q learning to generate a strategy pi; wherein the input to the DNN based optimization framework is the set of observed states S, of the DNN based optimization frameworkThe output is all executable actions in action set A; each state action pair has a corresponding Q value

：

Wherein

And

learning rate and discount factor, respectively;

it is shown that in the next state,

is shown in a state

The reward obtained after the action is taken is,

indicating a state

The following executable acts may be performed,

is a set of executable actions;

indicating a state

The value of Q in the following (A),

represents the updated Q value;

indicating a state

Set of executable actions under

Maximum Q value of (1); the loss function in each agent can be expressed as:

wherein

A network parameter indicative of the target network,

a network parameter representing an online network; squaring channel gain

And additive Gaussian noise

As network parameters of the l-th layer;

use of

Greedy policy from online network

In the selection action

Target network

Is an online network

Specifically, the steps of obtaining the optimal resource allocation scheme by using the ADMM network using the channel state information as the network weight are as follows:

step S2.1, initializing the reproduction memory

DQN network parameters

And target network replacement step size

；

Step S2.2, initializing the on-line network

And

initializing the target network

And make

；

Step S2.3, setting threshold value

；

Greedy policy selection decision

；

Step S2.5, updating environment

Receive a reward

；

Step S2.6, each agent program observes the rewards obtained by all agents

Storing into respective D;

step S2.7, randomly sampling from D, and calculating loss function

And update

Every other, at

Step updating target network parameters

Up to thenThere are agents that meet the threshold or reach the maximum iteration step.

Claims

1. A heterogeneous network resource allocation method based on reinforcement learning is characterized in that in a downlink of a heterogeneous network with M base stations and N mobile users, a macro base station MBS has

A micro base station PBS has

And satisfy

；

Setting up

indicating a base stationmAnd the usernAssociating;

indicating a base stationmAnd the usernIrrelevant;

setting up

The following rules were used to determine:

to representUser' snUsing sub-carriersk；

Representing a usernNot using sub-carriersk；

Setting up

Below;

wherein

Expressed as a first order gaussian-markov process:

wherein

wherein

Is a first type of zero order bessel function,

is the maximum doppler frequency;

wherein

From the base stationmOn the sub-carrierWave (wave)kUser of upper servicenThe signal to interference plus noise ratio of (c) is as follows:

wherein

will interfere with the base stationmTo a usernAnd is and

；

the spectral efficiency objective optimization function is as follows:

the energy efficiency objective optimization function is as follows:

wherein

The values, representing the lagrangian multiplier,

is a penalty parameter; at this time, the spectral efficiency objective optimization function is expressed as:

by respectively pairing

Finding the deviation

The best solution of (1);

and (4) state set S: by

A state component comprising

Including user association information

And interference power

Then the heterogeneous network state is represented as:

；

Taking an action; the actions include selecting a subcarrier

And corresponding transmission power

(ii) a Then the action is represented as

；

Rewarding: agent computing environment reward after action is taken

(ii) a Defining the energy efficiency function as a reward in the system model:

designing a DNN-based optimization framework, and combining Q learning to generate a strategy pi; wherein the input to the DNN-based optimization framework is the set of observed states S and the output of the DNN-based optimization framework is all executable actions in the set of actions A; each stateThe action pairs all have corresponding Q values

：

Wherein

And

learning rate and discount factor, respectively;

it is shown that in the next state,

is shown in a state

The reward obtained after the action is taken is,

indicating a state

The following executable acts may be performed,

is a set of executable actions;

indicating a state

The value of Q in the following (A),

represents the updated Q value;

indicating a state

Set of executable actions under

Maximum Q value of (1); the loss function in each agent can be expressed as:

wherein

A network parameter indicative of the target network,

a network parameter representing an online network; squaring channel gain

And additive Gaussian noise

As a network parameter of the l-th layer, wherein

Stands for ADMM algorithmlPerforming secondary iteration;

use of

Greedy policy from online network

In the selection action

Target network

Is an online network

2. The reinforcement learning-based heterogeneous network resource allocation method according to claim 1, wherein the resource allocation method based on the ADMM algorithm in step S1 specifically includes the following steps:

step S1.1, updating the currently observed state

；

Step S1.2, initializing network parameters

；

Step S1.3, setting threshold value

Maximum number of iterations

Starting iteration; network computing based on DNN

(ii) a When in use

Then output the corresponding

。

3. The method according to claim 1, wherein the step S2 obtains an optimal resource allocation scheme using an ADMM network that uses channel state information as a network weight, and includes the following steps:

step S2.1, initializing the reproduction memory