CN116567667A

CN116567667A - Heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning

Info

Publication number: CN116567667A
Application number: CN202310514670.4A
Authority: CN
Inventors: 徐钰龙
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-05-09
Filing date: 2023-05-09
Publication date: 2023-08-08

Abstract

The invention provides a heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning, which aims at a communication scene with rapid change of an actual environment, uses a framework of a multi-Actor network and a single-Critic network to distribute transmission power and subcarriers, achieves information acquisition through interaction with the environment, obtains long-term maximized benefits through continuous deep reinforcement learning, solves the problem of correlation existing before and after each parameter update of an Actor-Critic neural network, and enhances robustness. Meanwhile, each intelligent agent performs distributed execution to obtain own action output, so that training speed and stability are improved, and system performance is improved.

Description

Heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication, in particular to a heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning.

Background

With the formal commercial use of 5G, the development of wireless communication has also entered a new stage. According to ericsson's predictions, by 2022, internet of things devices would reach 290 billions, by 2024, mobile data traffic would increase at 35% per year. With the increase of social communication demands, great pressure is built on the current wireless network, and meanwhile, higher requirements are also put on the communication technology. The advent of heterogeneous networks eases the pressure in this regard. Heterogeneous networks are network architecture techniques that can expand network coverage, improve spectrum usage efficiency and system capacity. In order to meet the wireless communication demand, the heterogeneous networking technology covers a specific area by adding multiple types of small base stations on the premise of traditional cellular network coverage, so that blind area elimination and hot spot coverage area elimination are realized, the distance between terminal equipment and the base stations is reduced, and more equipment can obtain better communication quality when accessing to a network. The heterogeneous network can deploy a plurality of micro base stations or femto base stations with small coverage in one macro base station so as to improve the spectrum utilization efficiency and network coverage. Specifically, the micro base station and the femto base station can be multiplexed with the macro base station and share the same frequency spectrum, so that the frequency spectrum efficiency is improved. Heterogeneous networks thus not only increase network capacity, but also meet the increasing communication demands of users in future wireless networks, and reduce deployment costs.

However, dense random deployment of small base stations can create serious interference and higher energy consumption problems, and a framework for resource allocation and energy efficiency optimization needs to be constructed for heterogeneous networks in order to reduce network interference, ensure quality of service (QoS) of user networks and improve network energy efficiency. However, in view of the actual environment, users mostly exist in a dynamic manner, and in view of the huge state space of the wireless network, such as location information, channel gain, power, etc., the conventional reinforcement learning method is not applicable. The Q-learning method in the traditional reinforcement learning can lead to huge Q value table for storing the Q value due to huge state space in reality, and a large amount of time and space are consumed in searching and storing, so that the convergence rate of the algorithm is greatly reduced.

In view of the above, there is a need to provide a new approach in an attempt to solve at least some of the above problems.

Disclosure of Invention

Aiming at one or more problems in the prior art, the invention provides a heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning, solves the problem that the traditional algorithm cannot process a large state space, solves the correlation existing before and after each parameter update of an Actor-Critic neural network, and enhances the robustness.

The technical solution for realizing the purpose of the invention is as follows:

a heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning comprises the following steps:

s1, establishing a heterogeneous network model, initializing a communication environment and setting a simulation environment area, wherein the simulation environment area comprises base station layout, the number of base stations, the number of user equipment and the number of subcarriers, the user equipment and the base stations are associated based on the maximum signal to interference plus noise ratio SINR principle, and the base stations adopt orthogonal frequency division multiple access to allocate resources to related user equipment;

s2, determining an optimization target according to the signal-to-noise ratio of the user equipment, the capacity of the network and the energy efficiency;

s3, introducing a Markov model, and determining an agent, a state space, an action space and a reward function;

s4, constructing an improved depth deterministic strategy gradient algorithm DDPG, wherein the improved DDPG adopts a multi-strategy network Actor and a monovalent value network Critic to carry out training and output of distributed transmission power and subcarriers, wherein the input of the Actor network is the state of a current intelligent agent, and the output is the subcarrier distribution strategy and the transmission power on subcarriers; the input of the Critic network is the action and state of the intelligent agent, and the output is the loss of the action and the learned weight parameter;

s5, setting the training rounds and the training steps of each round of the intelligent agent, and enabling each intelligent agent to generate continuous interaction with the set environment by improving a DDPG algorithm, so as to optimize and update parameters of a network and obtain an optimal resource allocation scheme.

Further, according to the heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning, the communication environment comprises a macro base station, N femto base stations and M user equipment, the number of subcarriers is K, the M user equipment and the N femto base stations are covered by the macro base station, wherein the N femto base stations obey Poisson distribution, and the M user equipment are uniformly and randomly distributed.

Further, according to the heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning, the S2 determining optimization targets and constraint conditions comprises the following steps:

s2-1, determining an interference signal received by a user, and calculating signal-to-noise ratio information of user equipment;

s2-2, using Gaussian approximation to process interference noise, and calculating the capacity and energy efficiency of a network;

s2-3, determining an optimization target as follows: the signal to noise ratio of the user equipment is greater than the minimum quality of service requirement and maximizes energy efficiency.

Furthermore, the heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning of the invention, S2-1, specifically comprises the steps of:

s2-1-1, assuming that each user equipment can only select one base station at most at any time, when the ith user equipment selects and connects to the ith base station, the following steps are: when l=n, a _i,l (t) =1; when l+.n, a _i,l (t) =0, where n= {1, …, N }, a _i,l (t) represents the connection relationship between the base station l and the user equipment i at the time t, i epsilon M, l epsilon N, N being the number of femto base stations, M being the number of user equipment;

s2-1-2, signal to noise ratio of user equipment i served by the first base station on the kth subcarrierThe method comprises the following steps:

wherein K is K, K is the number of subcarriers, a _i,l Representing the connection relation coefficient between the base station i and the user equipment i,andrespectively represent the first and the second ^′ Channel gain, sigma, of the kth subcarrier and between users of the base station ² Expressed as gaussian white noise +.> and />Respectively represent the first and the second ^′ The transmit power of the base station on the kth subcarrier.

Furthermore, the heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning of the invention, the computing network capacity and energy efficiency in S2-2 specifically comprises:

s2-2-1, on the kth subcarrier, the capacity realized by the macro base station and the associated user equipment thereofThe method comprises the following steps:

wherein ,indicating the channel gain between macro base station h and user equipment i +.>Indicating the transmit power of macro base station h on the kth subcarrier,/for>Indicating the channel gain between femto base station n and user equipment i +.>Representing the transmit power, σ, of femto base station n on the kth subcarrier ² Expressed as gaussian white noise, N is the number of femto base stations;

s2-2-2, capacity realized by femto base station and its associated user equipment on kth subcarrierThe method comprises the following steps:

wherein ,indicating the channel gain between femto base station n and user equipment i +.>Representing the transmit power of femto base station n on the kth subcarrier;

s2-2-3, capacity C in the network where macro base station and femto base station coexist _sum The method comprises the following steps:

wherein N is the number of femto base stations;

s2-2-4, energy efficiency eta of network _EE The method comprises the following steps:

wherein ,P_sum To power consumption of all base stations per unit time in network model, P _n For the transmit power of femto base station n, P _h For the transmitting power of macro base station, P _c Power consumption of each circuit of the macro base station and the femto base station.

Furthermore, the heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning in the invention specifically comprises the following steps of:

the optimization targets are as follows: argmax eta _EE

The constraint conditions include:

(a)

(b)

(c)

(d)a _i,l (t)∈{0,1}

(e)P _c ＝C

wherein ,η_EE Representing the energy efficiency of the network,representing the transmit power of femto base station n on the kth subcarrier,/>Representing the transmit power of femto base station n on the kth subcarrier,/>Representing the signal-to-noise ratio, gamma, of user equipment i served by the first base station on the kth subcarrier _min Representing minimum quality of service requirements, a _i,l (t) represents the connection relationship between the base station l and the user equipment i at the time t, P _c The power consumption of each circuit of the macro base station and the femto base station is constant.

Further, the heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning in the invention, wherein the determining of the intelligent agent, the state space, the action space and the rewarding function in the S3 specifically comprises the following steps:

1) Taking a femto base station N as an agent, independently updating a strategy by each agent, collecting information from an area of each agent and exploring a network environment, and automatically selecting subcarriers and transmitting power by each agent, wherein N is more than or equal to 1 and less than or equal to N;

2) State space S _n,k (t) is defined as: s is S _n,k (t)＝{M _n (t),P _n (t),I _k (t),G _n,k (t),a _i,l (t) } wherein M _n (t) represents the number of users of the femto base station at time t; p (P) _n (t) represents the power of the femto base station at time t; i _k (t) ∈ {0,1} represents the interference level from the macro base station on the kth subcarrier at time t, assuming that the minimum capacity requirement of the macro base station according to the quality of service performance is α _h When (when)Time interference class I _k (t) =0, when>Time interference class I _k (t)＝1；G _n,k (t) channel information indicating the kth subcarrier femto base station n and users at time t; a, a _i,l (t) represents the connection relationship between the base station and the user at time t;

3) The action space a is defined as: a= { k _n ,p _n,k (t) } wherein k _n K represents the kth subcarrier of the nth base station, K e K; p is p _n,k (t) represents a power value on a kth subcarrier of an nth femto base station at time t, the value being autonomously adjusted by algorithm learning;

4) Based on the optimization objective, rewarding functionDefined as the energy efficiency of the user, namely:

where β is a constant less than 0.

Further, the heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning of the invention specifically comprises the following steps:

sampling:

the intelligent agent interacts with the communication environment, and inputs the current state s (t) to the original Actor network mu (|theta) ^μ ) Original Actor network μ (|θ) ^μ ) Selecting action a (t) according to policy μ: a (t) =μ (s (t) |θ ^μ )+N ₀ ，N ₀ Is noise;

obtaining environmental rewards r (t) after the intelligent agent executes the action a (t), carrying out environment and entering the next state s (t+1), obtaining experience samples { s (t), a (t), r (t), s (t+1) } and storing the experience samples into an experience pool D until the storage capacity reaches the threshold value of the experience pool D;

training phase: randomly sampling N pieces of experience sample data from the experience pool D as an original Actor network, wherein one piece of training data of the original Critic network is named as { s '(t), a' (t), r '(t), s' (t+1) };

calculating the Loss function Loss of the original Critic network, minimizing the Loss function by a gradient method, and updating Critic network parameters theta by adopting the back propagation of an Adam optimizer ^Q The method comprises the steps of carrying out a first treatment on the surface of the Wherein the Loss function Loss is:

wherein ,y_i ＝r _i +γQ′(s _i+1 ，μ′(s _i+1 |θ ^μ′ )|θ ^Q′ ) Gamma is a discount factor; defining an agent objective function as: j (θi) =e [ Q ] ^μ (s，μ(s))]，Maximizing objective function, updating Actor network parameter theta by adopting Adam optimizer ^μ ；

Performing weighted average on the old target network parameters and the new corresponding network parameters, and soft updating the target Actor network and the target Critic network:

where τ is the discount factor.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

1. the heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning provided by the invention realizes that information is obtained through interaction with the environment aiming at the communication scene with rapid change of the actual environment, and long-term maximized benefits are obtained through continuous deep reinforcement learning, so that the method has more practical significance.

2. The heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning improves the traditional depth deterministic strategy gradient algorithm, a multi-Actor network and a single-Critic network architecture is used for distributing transmission power and subcarriers, the Actor network needs to locally observe in a training stage and generates actions according to the learned strategies, the Critic network feeds back the strategies of the Actor network according to global information at the same time, and after model training is finished, each agent performs distributed execution to obtain own action output, so that training speed and stability are improved, and system performance is improved.

3. According to the heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning, in a heterogeneous wireless network, a mode of adaptively selecting subcarriers and transmitting power is realized by utilizing a deep reinforcement learning algorithm while ensuring the service quality of users, the throughput of the whole system is improved, the energy efficiency is improved, and the power consumption is reduced.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and together with the description serve to explain the embodiments of the invention, and do not constitute a limitation of the invention. In the drawings:

fig. 1 shows a schematic diagram of a communication system model of a heterogeneous wireless network.

Fig. 2 shows the Actor network and Critic network architecture of the improved DDPG algorithm of the present invention.

Fig. 3 shows an overall flow chart of the improved DDPG algorithm of the present invention.

Fig. 4 shows a flow chart of a training model completion resource allocation scheme of the present invention.

Fig. 5 shows an example training flowchart of the improved DDPG algorithm of the present invention.

Detailed Description

For a further understanding of the present invention, preferred embodiments of the invention are described below in conjunction with the examples, but it should be understood that these descriptions are merely intended to illustrate further features and advantages of the invention, and are not limiting of the claims of the invention.

The description of this section is intended to be illustrative of only exemplary embodiments and is not intended to be limiting of the scope of the embodiments described herein. Combinations of the different embodiments, and alternatives of features from the same or similar prior art means and embodiments are also within the scope of the description and protection of the invention.

In order to utilize heterogeneous networking technology and increase the number of multiple types of small base stations, the distance between terminal equipment and the base stations is further shortened, and the system capacity is effectively improved, so that the wireless communication requirement is met. In addition, in order to solve the problems of interference and energy efficiency improvement and resource allocation and to be capable of processing a large state space and enabling wireless communication to be more efficient, the invention mainly provides a heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning. Under the background of a cellular heterogeneous network, the method realizes the mode of autonomously selecting the subcarriers and the transmitting power by utilizing a deep reinforcement learning method while ensuring the service quality of users, improves the throughput of the whole system, improves the energy efficiency and reduces the power consumption.

The primary scenario considered is the downlink of a heterogeneous wireless network. A heterogeneous wireless network consists of one macro base station and a plurality of femto base stations. The macro base station is located in the center of the geographical area, where the user equipment and the femto base station are covered by the macro base station. Wherein the femto base stations obey poisson distribution and the users are uniformly and randomly distributed in the area as shown in fig. 1. In particular, each femto base station can only connect 8 users at most to meet the minimum communication requirements of the users. Users exceeding the upper limit will be associated to the macro base station. The base station allocates resources to its associated users using an Orthogonal Frequency Division Multiple Access (OFDMA) scheme. It is assumed that each base station can use all available resources. Each user may configure multiple subcarriers, each of which can serve at most one user in a time slot. The received signal of the user equipment in the downlink includes interference from the base station and thermal noise. The channel is assumed to employ a particular fading model, namely a rayleigh fading channel. The invention adopts a deep reinforcement learning method to realize autonomous selection of the subcarrier and the transmitting power, and maximizes the energy efficiency of the system on the premise of meeting the QoS of the user.

The scheme adopted by the invention is shown in fig. 4, and comprises the following steps:

a network model is built, a communication environment is initialized, and the layout of base stations, the number of base stations and the number of user equipment are set.

And determining an optimization target and constraint conditions according to the signal-to-noise ratio of the user equipment, the capacity and energy efficiency of the network and ensuring the communication service quality.

Because of the huge State space describing the problem, the optimization problem is converted into a Markov decision process by using a deep reinforcement learning method, an Agent (Agent), a State space (State), an Action space (Action), a Reward function (Reward), an objective function and a loss function are determined, and an improved DDPG algorithm is constructed to solve the problem.

The intelligent agent generates continuous interaction with the set environment by improving the DDPG algorithm, so that the parameter is updated to achieve the effect of optimizing the network, the intelligent agent can obtain an optimal strategy, and the aim of autonomously carrying out optimal resource allocation is achieved.

Example 1

The communication environment in this example mainly includes one Macro Base Station (MBS) located in the center of a geographical area, and M User Equipments (UEs) and N Femto Base Stations (FBSs) are covered by the macro base station. Wherein the femto base stations obey poisson distribution and the users are uniformly and randomly distributed in the area. The communication system model is shown in fig. 1. In order to make the objects and advantages of the present invention more clear, the following description will further describe the specific technical scheme.

Step 1: initializing the number of cellular users as M, the number of subcarriers as K, and randomly moving the users in the area at the moving speed V _t ^m And movement angle->If users leave the area, they will reappear from the other end, the system will randomly assign them with a probability of 0.1, the association between users and base station being based on maximum signal to interference plus noise ratio (Signal to Interference plus Noise Ratio, SINR) principle.

Step 2.1: and determining an interference signal received by the user, and calculating signal-to-noise ratio information of the user.

In this communication environment, since Femto Base Stations (FBSs) are disposed within the coverage of Macro Base Stations (MBS), interference that exists is derived from interference generated by femto base stations to user equipments. As particularly shown in fig. 1.

Definition of binary variable a _i,l (t), i epsilon M and l epsilon N, which represent the connection relation between the base station and the user at the moment t. A when the ith user equipment selects and connects to the ith base station _i,l (t) =1, l=n and a _i,l (t) = 0,l +.n, where n= {1, …, N }. It is assumed that at most one base station can be selected per user equipment at any time.

The signal-to-noise ratio (Signal to Interference plus Noise Ratio, SINR) of the user served by the first base station on the kth (K e K) sub-carrier is expressed as

Considering that all user equipments wish to meet a minimum quality of service (QoS) requirement gamma _min At the same time, the maximum transmission capacity is obtained from its selected base station. The SINR of the user equipment should therefore be not less than the minimum quality of service requirement gamma _min. wherein a_i,l The representation is a connection relation coefficient between the base station and the user, and />Respectively represent the first and the second ^′ Channel gain, sigma, of the kth subcarrier and between users of the base station ² Represented as gaussian white noise. /> and />Respectively represent the first and the second ^′ The transmit power of the base station on the kth subcarrier.

Through capacity and energy efficiency formulas, the performance of the system can be evaluated, where the interference can be handled using a gaussian approximation. On the kth subcarrier, the capacity realized by the macro base station and its associated users can be expressed as:

wherein ,indicating the channel gain of the macro base station on the kth subcarrier and between users,/for>Indicating the transmit power of the macro base station on the kth subcarrier,/>Indicating the channel gain of the femto base station on the kth subcarrier and between users,representing the transmit power, σ, of a femto base station on the kth subcarrier ² Represented as gaussian white noise.

On the kth subcarrier, the capacity realized by a femto base station and its associated users can be expressed as:

wherein ,indicating the channel gain of the femto base station on the kth subcarrier and between users,/>Representing the transmit power, σ, of a femto base station on the kth subcarrier ² Represented as gaussian white noise.

Thus, the capacity of the macro base station and femto base station co-existing in the network can be expressed as:

energy efficiency can generally be defined as the ratio of the total throughput per unit time to the total power consumption. In the present invention, the energy efficiency is expressed as:

wherein ,P_sum Expressed as power consumption of all base stations per unit time in the system model, can be expressed as:

wherein ,P_c Power consumption of each circuit of the macro base station and the femto base station.

Step 2.2: an optimization objective is determined.

The invention aims to ensure that SINR of a user is larger than gamma _min Maximizing energy efficiency eta at the same time _EE 。

The optimization objective can be described as: argmax eta _EE

Limiting conditionsThe following are provided: (a)

(b)

(c)

(d)a _i,l (t)∈{0,1}

(e)P _c ＝C

(a) And (b) means that the base station ensures that the transmit power allocated to the user satisfies the minimum received power. (c) represents SINR requirements of the user. (d) means that each user is associated with at most one base station. (e) The power consumption of each circuit of the macro base station and the femto base station is shown as a constant.

Step 3: and constructing a reinforcement learning model, introducing a Markov model, and determining an agent, a state space, an action space and a reward function. And training by using a deep reinforcement learning algorithm, and distributing an optimal strategy for each agent.

Agent (Agent): taking femto base station N (1.ltoreq.n.ltoreq.n) as agents, each agent independently updates its policy, each agent can collect information from its own area and explore the network environment, each agent can select subcarriers and transmit power autonomously.

State space (State): defined as S _n，k (t)＝{M _n (t)，P _n (t)，I _k (t)，G _n,k (t)，a _i,l (t)}。

wherein ,M_n (t) represents the number of users of the femto base station at time t; p (P) _n (t) represents the power of the femto base station at time t; i _k (t) ∈ {0,1} represents the interference level from the macro base station on the kth subcarrier at time t, assuming that the minimum capacity requirement of the macro base station according to the quality of service performance is α _h When (when)Time interference class I _k (t) =0, when>Time interference class I _k (t)＝1；G _n,k (t) channel information indicating the kth subcarrier femto base station n and users at time t, a _i,l And (t) represents the connection relation between the base station and the user at the time t.

Action space (Action): the agent has K (K e K) subcarriers available for selection. Defining actions as transmit power and subcarriers selected by the agent, a= { k _n ，p _n,k (t) }. Wherein k is _n A kth subcarrier representing an nth base station; p is p _n,k (t) represents a power value on a kth subcarrier of an nth femto base station at time t, which value is to be automatically adjusted later by algorithm learning.

Bonus function (Reward): when the agent performs an action and conditions (a), (b), and (c) are restricted, a prize value is obtained. The reward function is defined in terms of data rate and energy efficiency targets. According to the objective of optimization, we define rewards as the energy efficiency of the user:

where β is a constant less than 0, each agent will adjust toward maximizing rewards as the learning process is further trained.

Step 4: and a deep reinforcement learning algorithm is adopted to enable the intelligent body to learn autonomously. Reinforcement learning algorithms include policy-based methods such as: policy Gradient (PG) and Actor-criter (AC) algorithms; value-based methods such as: Q-Learning, DQN: the traditional algorithm is simple and easy to realize, but in practical application, the problem of large state space cannot be solved, the convergence speed of the algorithm is greatly reduced, and even the situation of unstable training can be encountered. Therefore, the invention uses the DDPG algorithm of the improved version to solve the defects of the algorithm, uses the convolutional neural network to simulate the strategy function and the Q function, uses the deep learning method to train, expands the DQN to continuous action space or high-dimensional discrete value, absorbs the experience recovery mode in the DQN, improves the fixed-target method in the DQN to make the learning process more stable, uses soft update (soft target update), adds random noise to interact with the environment, and increases the robustness of the system. In addition, the invention improves the system, and uses a multi-Actor and single-Critic architecture to distribute transmission power and sub-carriers, wherein the Actor needs to be locally observed in a training stage, and generates actions according to the learned strategies, critic simultaneously feeds back the strategies of the Actor according to global information, and after model training is finished, each intelligent agent performs distributed execution to obtain own action output, so that training speed and stability are improved, and system performance is improved.

The DDPG algorithm is based on an Actor-Critic framework, and comprises four networks: original Actor network μ (|θ) ^μ ) And the original Critic network Q (|θ) ^Q ) In addition, each network has its corresponding target network, target Actor network μ' (|θ) ^μ′ ) And a target Critic network Q' (|θ) ^Q′ ). The Actor portion may observe the network state s (t) at time t and take action a (t). The agent will then transition to the next new state s (t+1) and get a timely prize r (t) after performing the action. Critic is used to evaluate the quality of the action generated by the Actor. The improved DDPG algorithm employed by the present invention includes a multi-Actor network and a single Critic network architecture, as shown in FIG. 2. Learning of an agent can be largely divided into a sampling phase and a training learning phase.

In the sampling phase, each agent continuously interacts with the environment, inputs the current state s (t) to the original Actor network, selects action a (t) according to the original Actor network and the strategy mu, wherein the strategy mu is a strategy and random U according to the current original Actor ₀ A random process of noise, from which the value of action a (t) is sampled, the action is performed, the prize r (t) of the environment is returned, and the environment enters the next state s (t+1). Critic is used to evaluate the quality of an action. The DDPG algorithm adopts the mode of experience recovery in DQN and is to be matched with a ringThe empirical sample transitions from the context interactions { s (t), a (t), r (t), s (t+1) } are stored in the empirical pool D, noting that: because the generated empirical samples are highly correlated in time when the Actor interacts with the environment, the inability to use these data sequences directly for training can result in overfitting of the neural network and difficult convergence. The Actor in the DDPG algorithm stores the experience sample data into the experience pool until the storage amount exceeds the set experience pool threshold value, and randomly samples N experience sample data in the experience pool D, so that the sampled data can be regarded as uncorrelated and the phenomenon of over fitting can not occur.

In the training learning stage, the gradient of the original Critic network is calculated, and the mean square error MSE is used, and the Loss function Loss of the original Critic network is defined as follows:

wherein ,y_i Can be seen as a "tag":

y _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ ) (equation 8)

y _i The calculation of the network parameter model uses a target strategy network mu 'and a target Q network Q', so that the learning process of the Q network parameter is more stable and easy to converge. Gamma is a discount factor. Minimizing the loss function by gradient method, back-propagating updating Critic network parameters θ ^Q 。

Defining an objective function J (θi) =e [ Q ] for an agent ^μ (s,μ(s))]

Actor network updates θ by maximizing cumulative expected returns ^μ 。

The updating of the target network adopts soft updating, or can become exponential average movement (Exponential Moving Average, EMA), that is, a learning rate tau is introduced, the old target network parameter and the new corresponding network parameter are weighted and averaged, and then assigned to the target network, and the main algorithm flow is shown in fig. 3.

The specific modified DDPG algorithm flow is shown in the following table.

Example 2

Step one: initializing the communication environment is modeled as a heterogeneous network architecture comprising one macro base station and 5 femto base stations. The simulated environment area is a 600m multiplied by 600m rectangular area, the macro base station is positioned at the central position of the area, the coverage radius of the macro base station is 300m, and the coverage radius of the femto base station is 30m.60 users move at 36km/h and in random directions, and if a user leaves the area, it will reappear from the other end, randomly reassigning the users with a probability of 0.1. The association between the user and the base station is based on the maximum SINR principle.

Step two: the maximum transmission power of the macro base station is set to 46dBm, the maximum transmission power of the femto base station is set to 30dBm, and the minimum transmission power of the femto base station is set to 20dBm. The number of subcarriers is 64, the lowest SINR of the user is-6 dB, the discount factor tau is 0.001, the discount factor gamma is 0.9, the path loss from the cellular user to the macro base station is 34+40lg (d [ km ]]) The bandwidth is 10MHz, and the Gaussian white noise sigma ² Noise power density N = -114dBm ₀ The Dropput rate was 0.8 at-174 dBm/Hz.

Step three: initializing network parameters, wherein the Actor network model of the improved DDPG algorithm consists of four layers of one-dimensional convolution and two layers of full-connection layers, inputting the state of the current intelligent agent, and outputting the state including a subcarrier allocation strategy and the transmitting power on subcarriers. The Critic network model consists of four two-dimensional convolution layers and two full-connection layers, wherein the input is the action and state of the intelligent agent, and the output is the loss of the action and the learned weight parameters. The experience pool capacity is set to 5000 and the batch size for updating is set to 32.

Step four: setting the training round number of the intelligent agent, namely Episode=10000, setting the training Step number of each round, namely step=100 steps, optimizing parameters of the neural network by using an Adam optimizer every 50 steps, recording obtained rewards according to the training process of the intelligent agent, and continuously optimizing the strategy of the intelligent agent according to the proposed algorithm to finally obtain the optimal resource allocation scheme. And finally, applying the trained model to an actual scene, namely, enabling a user to autonomously select an optimal resource allocation scheme, and improving energy efficiency. The main training flow is shown in fig. 5.

The description and applications of the present invention herein are illustrative and are not intended to limit the scope of the invention to the embodiments described above. The relevant descriptions of effects, advantages and the like in the description may not be presented in practical experimental examples due to uncertainty of specific condition parameters or influence of other factors, and the relevant descriptions of effects, advantages and the like are not used for limiting the scope of the invention. Variations and modifications of the embodiments disclosed herein are possible, and alternatives and equivalents of the various components of the embodiments are known to those of ordinary skill in the art. It will be clear to those skilled in the art that the present invention may be embodied in other forms, structures, arrangements, proportions, and with other assemblies, materials, and components, without departing from the spirit or essential characteristics thereof. Other variations and modifications of the embodiments disclosed herein may be made without departing from the scope and spirit of the invention.

Claims

1. The heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning is characterized by comprising the following steps of:

s2, according to the signal-to-noise ratio of the user equipmentNetwork capacity and energy efficiency eta _EE Determining an optimization target;

s4, constructing an improved depth deterministic strategy gradient algorithm DDPG, wherein the improved DDPG adopts a multi-strategy network Actor network and a monovalent value network Critic network to carry out training and output of distributed transmission power and subcarriers, wherein the input of the Actor network is the state of the current agent, and the output is the subcarrier distribution strategy and the transmission power on the subcarriers; the input of the Critic network is the action and state of the intelligent agent, and the output is the loss of the action and the learned weight parameter;

s5, setting the training rounds and the training steps of each round of the intelligent agent, and generating continuous interaction between each intelligent agent and the set environment by improving a DDPG algorithm, so as to optimize and update network parameters and obtain an optimal heterogeneous network resource allocation scheme.

2. The heterogeneous network resource efficiency optimization method based on deep reinforcement learning according to claim 1, wherein the communication environment comprises a macro base station, N femto base stations and M user equipments, the number of subcarriers is K, the M user equipments and the N femto base stations are covered by the macro base station, wherein the N femto base stations obey poisson distribution, and the M user equipments are uniformly and randomly distributed.

3. The heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning of claim 1, wherein S2 determining the optimization objective and constraint conditions comprises:

s2-1, determining an interference signal received by user equipment, and calculating signal-to-noise ratio information of the user equipment;

4. The heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning according to claim 1, wherein the calculating of the signal-to-noise ratio information of the user in S2-1 specifically includes:

wherein K is K, K is the number of subcarriers, a _i,l Representing the connection relation coefficient between the base station i and the user equipment i, and />Respectively represent the first and the second ^′ Channel gain, sigma, of the kth subcarrier and between users of the base station ² Expressed as gaussian white noise +.> and />Respectively represent the first and the second ^′ The transmit power of the base station on the kth subcarrier.

5. The heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning according to claim 1, wherein the computing network capacity and energy efficiency in S2-2 specifically comprises:

wherein N is the number of femto base stations;

6. The heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning as set forth in claim 1, wherein the optimization objective and constraint conditions in S2-3 specifically include:

the optimization targets are as follows: argmax eta _EE

The constraint conditions include:

(d)a _i,l (t)∈{0,1}

(e)P _c ＝C

wherein ,η_EE Representing the energy efficiency of the network;representing the transmit power of femto base station n on the kth subcarrier,respectively the minimum transmitting power and the maximum transmitting power of the femto base station n on the kth subcarrier;indicating the transmit power of macro base station h on the kth subcarrier,/for>Respectively the minimum transmitting power and the maximum transmitting power of the macro base station h on the kth subcarrier; />Representing the signal-to-noise ratio, gamma, of user equipment i served by the first base station on the kth subcarrier _min Representing a minimum quality of service requirement; a, a _i,l (t) represents the connection relationship between the base station l and the user equipment i at the time t; p (P) _c The power consumption of each circuit of the macro base station and the femto base station is constant.

7. The heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning according to claim 1, wherein determining the agent, the state space, the action space and the rewarding function in S3 specifically comprises:

1) Taking a femto base station N as an agent, wherein each agent independently updates a strategy, each agent collects information from own area and explores a network environment, and each agent automatically selects subcarriers and transmitting power, wherein N is more than or equal to 1 and less than or equal to N;

3) The action space a is defined as: a= { k _n ,p _n,k (t)}，Wherein k is _n K represents the kth subcarrier of the nth base station, K e K; p is p _n,k (t) represents a power value on a kth subcarrier of an nth femto base station at time t, the value being autonomously adjusted by algorithm learning;

where β is a constant less than 0.

8. The heterogeneous network resource energy efficiency optimization method based on deep reinforcement learning according to claim 1, wherein the improved DDPG algorithm specifically comprises:

sampling:

obtaining environmental rewards r (t) after the intelligent agent executes the action a (t), entering a next state s (t+1), obtaining experience samples { s (t), a (t), r (t), s (t+1) } and storing the experience samples into an experience pool D until the storage capacity reaches the threshold value of the experience pool D;

training phase:

n pieces of experience sample data are randomly sampled from the experience pool D to be used as an original Actor network, and an original Critic network Q (|theta) ^Q ) Is denoted as { s '(t), a' (t), r '(t), s' (t+1) };

the original Critic network Q (|θ) is calculated ^Q ) Minimizing the Loss function by gradient method, and updating Critic network parameters theta by adopting Adam optimizer back propagation ^Q The method comprises the steps of carrying out a first treatment on the surface of the Wherein the Loss function Loss is:

wherein ,y_i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ ) Gamma is a discount factor;

defining an agent objective function as: j (θi) =e [ Q ] ^μ (s,μ(s))]， Maximizing objective function, updating Actor network parameter theta by adopting Adam optimizer ^μ ；

where τ is the discount factor.