CN116132997A

CN116132997A - Method for optimizing energy efficiency in hybrid power supply heterogeneous network based on A2C algorithm

Info

Publication number: CN116132997A
Application number: CN202310082022.6A
Authority: CN
Inventors: 李君�; 刘子怡; 刘兴鑫; 李晨
Original assignee: Wuxi University
Current assignee: Wuxi University
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-05-16

Abstract

The invention discloses a method for optimizing energy efficiency in a hybrid power supply heterogeneous network based on an A2C algorithm, which comprises the steps of determining the user position of a small base station according to the quantity distribution condition of a macro base station and the small base station; a single small base station is regarded as an intelligent agent, and a state space, an action space and a reward function of a Markov decision process are established; random acquisition of small cell user and environment interactionsObtaining a state; action information(s) _t ,a _t ,r _t ,s _t+1 ) Transferring to a critic network; transmitting the optimal actions obtained by learning of each small base station as states to a macro base station, and repeatedly deploying the small base stations in the coverage area of the macro base station to obtain an optimal small base station deployment strategy, namely optimal resource allocation; the users are connected to the corresponding small base stations to obtain better channels, and the energy efficiency of the heterogeneous network system is maximized. The invention improves the energy efficiency in the A2C algorithm by using the A2C algorithm in reinforcement learning, approximates the state action value function by using a Gaussian distribution method, saves the resources of the traditional power grid, and saves the cost of the power grid communication energy consumption.

Description

Method for optimizing energy efficiency in hybrid power supply heterogeneous network based on A2C algorithm

Technical Field

The invention relates to the technical field of physical layers of communication systems, in particular to a method for optimizing energy efficiency in a hybrid power supply heterogeneous network based on an A2C algorithm.

Background

The increasing number of terminals and the rapid increase in data traffic demands, conventional single-layer networks have not been able to meet the rapidly evolving demands of current technologies, and wireless communication networks have also faced significant challenges. In order to relieve the huge pressure of the communication network, researchers put forward an A2C (adaptive active-critical network) algorithm, and the current wireless access network is developed into an A2C algorithm consisting of macro base stations meeting the wide area access requirements and small base stations meeting the small area high density access requirements. In order to support high speed mobile data services and provide better coverage, next generation cellular networks are expected to deploy widely micro or small cell base stations that can offload some users and traffic from traditional macro base stations. While system capacity can be increased and network coverage can be enhanced, the deployment of a large number of small base stations also presents new challenges such as cell interference, resource waste, and huge energy consumption, and thus optimizing grid resources from a global perspective is becoming more and more important. With the rise of the number of base stations and the price of energy, energy efficiency becomes an important issue for grid management.

Disclosure of Invention

The invention provides a method for optimizing energy efficiency in a hybrid power supply heterogeneous network based on an A2C algorithm, which optimizes energy efficiency based on the A2C algorithm in reinforcement learning.

In order to achieve the above effects, the technical scheme of the invention is as follows:

the method for optimizing the energy efficiency in the hybrid power supply heterogeneous network based on the A2C algorithm comprises the following steps:

step 1: constructing a heterogeneous network system according to an optimization target, wherein the system consists of a macro base station, small base stations and users, a single small base station is regarded as an intelligent body, a state space, an action space and a reward function of a Markov decision process are established, and the signal-to-interference-and-noise ratio from the macro base station to the users is calculated;

step 2: calculating total conventional power of the macro base station, and constructing an actor network and a critic network according to a defined Markov decision process; training a heterogeneous network system by utilizing an A2C algorithm;

step 3: random acquisition state s of small base station user and environment interaction _t State s _t Information is transmitted to an actor network, and the actor network is used for transmitting information according to the state s of the current environment _t And the self state of the intelligent agent to select proper action a _t Obtain instant rewards r _t And state s at time t+1 _t+1 Obtain action information(s) _t ，a _t ，r _t ，s _t+1 )；

Step 4: action information(s) _t ，a _t ，r _t ，s _t+1 ) Transmitting to the critic network, updating the parameters of the critic network and maximizing rewards to obtain an optimal small base station deployment strategy, namely, optimal resource allocation; the users are connected to the corresponding small base stations to obtain better channels, and the energy efficiency of the heterogeneous network system is maximized.

The heterogeneous network system comprises a macro base station and a plurality of small base stations, and users are uniformly distributed in the coverage area of the base stations.

Further, in step 1, K small base stations and M users are deployed in the macro base station, the set of the small base stations is κ= {0,1,2, … …, K }, the set of the users is m= {0,1,2, … … M }, and assuming that the conventional macro base station and the subordinate small base stations use the same radio spectrum, the signal-to-interference-and-noise ratio γm of the user M during the time slot t is:

in the formula g _k,m (t) is the average channel gain of macro base station at time slot t, p for user m _k,m (t) is the transmit power of the small cell; g _i,m (t) is the average channel gain of user m from other interfering base stations i at time slot t; p is p _i (t) is for user u _i (t) the total radio transmission power of the serving base station i,

is the variance of the additive white gaussian noise at user m.

Further, in step 1, the total bandwidth W and bandwidth B of the agent are divided into sub-channels of W/B, and during time slot t, user m is allocated as B _m (t) ∈ {0,1, … …, W/B }, while

Total information rate r achieved by the agent during time slot t _sum (t) is:

the small base station acquires energy from the power grid and the renewable energy collection device.

Further, the wireless transmission power obtained by the conventional power grid is set to be a positive value, and the wireless transmission power obtained in the renewable energy source equipment is set to be a negative value; at time slot t, the total power of the macro base station

The method comprises the following steps:

in the method, in the process of the invention,

representing static power, including circuit board generations of processors, cooling components, etc.; η is a coefficient factor related to the efficiency of the wireless power amplifier; p is p _k (t) is the total wireless transmission power of all users u (t) associated with the macro base station in time slot t;

the static power

To obtain the static power of each base station from the regular grid, the total regular power of the macro base station at time slot t +.>

The method comprises the following steps:

wherein p is _k,u (t) represents the power used by the user in the macro base station, u _k (t) represents a user.

Further, the energy efficiency of the agent in step 1 during time slot t is defined as the ratio ρ of the information rate to the conventional power consumption _t The method comprises the following steps:

further, in step 1, the states are the signal-to-interference-and-noise ratio γ (t) of each user and the battery power e (t) of each small cell, that is:

s _t ＝(γ ₁ (t),γ ₂ (t),...,γ _M (t),e ₁ (t),e ₂ (t),...,e _K (t))；

action a _t Represented as the number u of users in macro base station _k (t), number of subchannels b for user m _m (t), transmission power p of user m _k,m (t), action a _t ＝(u _k (t),b _m (t),p _k,m (t))；

Rewards r _t Equivalent to the energy efficiency of heterogeneous network systems, i.e. r _t ＝ρ _t 。

Further, step 2 trains the heterogeneous network system by utilizing an A2C algorithm specifically,

step 2.1: establishing a heterogeneous network system according to an optimization target; setting a reinforcement learning frame, initializing a state space, selecting an optimal action according to a state obtained by interaction of an agent and an A2C algorithm environment, and transmitting the optimal action as state information to a macro base station;

step 2.2: state s _t Inputting the state action value function and the state value function of the heterogeneous network system into the heterogeneous network system, and updating the state action value function and the state value function of the heterogeneous network system;

step 2.3: and carrying out multiple rounds of iterative training on the heterogeneous network until the rewarding function converges, and obtaining the trained heterogeneous network.

Further, in step 2.2, the state action value function and the state value function of the A2C algorithm are updated, specifically, the state action value function Q of the A2C algorithm is updated _w (s, a) and a state value function V _v (s) updating, approximating the state action value function Q by using a Gaussian distribution method _w (s, a); approximating the state value function V with tile encoding _v (s) using 32 tile overlays, each tile overlay 4*4 =16; introducing qualification trace to update all action values from each screen to different degrees, wherein the update degree is attenuated according to the time, and a dominance function is set in an actor network.

Further, the state action value function Q _w (s, a) is expressed as:

wherein w is ^T Is a parameter vector, expressed as: w= (w) ₁ ,w ₂ ,...,w _n ) ^T The method comprises the steps of carrying out a first treatment on the surface of the Use parameter θ= (θ) ₁ ,θ ₂ ,...,θ _n ) ^T Construction strategy pi _θ = (a|s), expressed as pi _θ (s,a)＝Pr(a|s,θ)；π _θ (s,a)～N(μ(s),σ ² ) Mu(s) is the mean and sigma is the standard deviation; ψ(s) =(γ ₁ ,γ ₂ ,...,γ _M ,e ₁ ,e ₂ ,...,e _K ) ^T ，

Gamma is denoted as the signal-to-interference-and-noise ratio of each user and e is denoted as the battery level of each small base station.

Further, the state value function V _v (s) is expressed as:

V _v (s)＝w _It(1) *1+w _It(2) *1+...+w _It(32) *1，

wherein w is _It(1) 、w _It(2) 、....、w _It(32) To activate the weight value of a tile, it (1), it (2), it (32) is the index value of the activated tile, the other tiles are 0.

Further, in step 4, the operation information (s _t ，a _t ，r _t ，s _t+1 ) The action return value q of the agent transmitted to the critic network by the critic network _t Making an evaluation, and updating the critic network parameters by adopting TD errors (the deviation between the estimated value in the Xu Chafen learning method and the existing value); will(s) _t ，a _t ，q _t ) Transmitting to an actor network, updating action selection probability according to strategy gradient, and maximizing rewards r _t The method comprises the steps of carrying out a first treatment on the surface of the Transmitting the optimal actions obtained by learning of each small base station as states to a macro base station, and repeatedly deploying the small base stations in the coverage area of the macro base station to obtain an optimal small base station deployment strategy, namely optimal resource allocation; the users are connected to the corresponding small base stations to obtain better channels, and the energy efficiency of the heterogeneous network system is maximized.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

according to the invention, the heterogeneous network model is trained by using the A2C algorithm in reinforcement learning, so that the energy efficiency of a heterogeneous network system is improved, the state action value function is approximated by using a Gaussian distribution method, the state value function is approximated by using tile coding, the actor network is helped to know the strategy gradient, the variance of the strategy gradient is further reduced by using the advantage function, and the A2C algorithm is enabled to converge faster.

Drawings

The drawings are for illustrative purposes only and are not to be construed as limiting the invention; for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

FIG. 1 is a schematic flow chart of an energy efficiency optimization method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a heterogeneous network system according to an embodiment of the present invention;

fig. 3 is a training flowchart of a heterogeneous network system according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

For ease of understanding, referring to fig. 1-2, an embodiment of a method for optimizing energy efficiency in a hybrid power heterogeneous network based on an A2C algorithm provided by the present invention includes the following steps:

step 1: determining the user position of the small base station according to the quantity distribution condition of the macro base station and the small base station; a single small base station is regarded as an intelligent agent, and a state space, an action space and a reward function of a Markov decision process are established; setting the rewards as the energy efficiency of the agent;

step 2: constructing an actor network and a critic network according to a defined Markov decision process; training a heterogeneous network system by utilizing an A2C algorithm;

step 3: small base station userAmbient interaction randomly obtains state s _t State s _t Transmitting to an actor network, wherein the actor network is used for transmitting the state s of the current environment to the actor network _t And the self state of the intelligent agent to select the proper action a _t Obtain instant rewards r _t And state s at time t+1 _t+1 Obtain action information(s) _t ，a _t ，r _t ，s _t+1 )；

Step 4: action information(s) _t ，a _t ，r _t ，s _t+1 ) Pass to the critic network, and the action taken by the critic network on the agent returns q _t Evaluating the value, and updating the critic network parameters by adopting TD errors; will(s) _t ，a _t ，q _t ) Transmitting to an actor network, updating action selection probability according to strategy gradient, and maximizing rewards r _t The method comprises the steps of carrying out a first treatment on the surface of the Transmitting the optimal actions obtained by learning of each small base station as states to a macro base station, and repeatedly deploying the small base stations in the coverage area of the macro base station to obtain an optimal small base station deployment strategy, namely optimal resource allocation; the users are connected to the corresponding small base stations to obtain better channels, and the energy efficiency of the heterogeneous network system is maximized. When the state of the critic network changes, a new small base station deployment strategy can be obtained only by re-inputting a new state into the actor network.

The invention optimizes the energy efficiency by using the A2C algorithm in reinforcement learning, saves the resources of the traditional power grid and saves the cost of the energy consumption of the communication network.

Example 2

Specifically, the description of the embodiment will be given with reference to specific embodiments on the basis of embodiment 1, so as to further demonstrate the technical effects of the present embodiment. The method comprises the following steps:

specifically, in step 1, K small base stations and M users are deployed in the macro base station, where the set of small base stations is κ= {0,1,2, … …, K }, the set of users is m= {0,1,2, … … M }, and if the conventional macro base station and the subordinate small base stations use the same radio spectrum, the signal-to-interference-and-noise ratio γ of the user M during the time slot t _m The method comprises the following steps:

/>

wherein k represents macro base station, g _k,m (t) is the average channel gain of macro base station at time slot t, p for user m _k,m (t) is the transmit power of the small cell; g _i,m (t) is the average channel gain of user m from other interfering base stations i at time slot t; p is p _i (t) is for user u _i (t) the total radio transmission power of the serving base station i,

is the variance of the additive white gaussian noise at user m.

Specifically, in step 1, the total bandwidth W and bandwidth B of the agent are divided into sub-channels W/B, and during time slot t, user m is allocated as B _m (t) ∈ {0,1, … …, W/B }, while

Total information rate r achieved by the agent during time slot t _sum (t) is:

specifically, the wireless transmission power obtained by the conventional power grid is set to be a positive value, and the wireless transmission power obtained in the renewable energy source equipment is set to be a negative value; at time slot t, the total power of the macro base station

The method comprises the following steps:

in the method, in the process of the invention,

representing static power; η is a coefficient factor related to the efficiency of the wireless power amplifier; p is p _k (t) is the total wireless transmission power of all users u (t) associated with the macro base station in time slot t;

the static power

The method comprises the following steps:

Specifically, the energy efficiency of the agent in step 1 during time slot t is defined as the ratio ρ of the information rate to the conventional power consumption _t The method comprises the following steps:

specifically, in step 1, the states are a signal-to-interference-and-noise ratio γ (t) of each user and a battery power e (t) of each small cell, that is:

action a _t Represented as the number u of users in macro base station _k (t), number of subchannels b for user m _m (t), transmission power p of user m _k,m (t), action a _t ＝(u _k (t),b _m (t),p _k,m (t)); rewards r _t Equivalent to the energy efficiency of heterogeneous network systems, i.e. r _t ＝ρ _t 。

Specifically, as shown in fig. 3, step 2 training the heterogeneous network system by using an A2C algorithm specifically includes step 2.1: setting a reinforcement learning frame, initializing a state space, selecting an optimal action according to a state obtained by interaction of an agent and an A2C algorithm environment, and transmitting the optimal action as state information to a macro base station;

In a specific implementation process, assuming that 20 users exist in one macro base station, the positions of the users are randomly distributed, and an initial state is randomly generated; the bandwidth of each sub-channel is set to b=1 Hz, the information rate r of each user _m ＝[0,1]The method comprises the steps of carrying out a first treatment on the surface of the Setting P at maximum transmission kilometer of single user _max =1w, therefore the transmission power p of user m _k,m ∈[-1,1]The method comprises the steps of carrying out a first treatment on the surface of the Static field power

η=1, discount coefficient β=0.99, attenuation factor λ=0.5; normalizing most of the parameters, and normalizing the renewable energy collected in the small cell to e _k ＝[0,1]The signal-to-interference-plus-noise ratio between the small cell and its served user m is normalized to gamma _m ＝[0,1]The method comprises the steps of carrying out a first treatment on the surface of the The learning rate of the critic network is 0.02, the learning rate of the actor network is gradually increased from 0.01 to 0.03, 10000 rounds are trained, and each round is circulated 50 times.

Specifically, in step 2.2, the state action value function and the state value function of the A2C algorithm are updated, specifically, the state action value function Q of the A2C algorithm is updated _w (s, a) and a state value function V _v (s) updating, approximating the state action value function Q by using a Gaussian distribution method _w (s, a); approximating the state value function V with tile encoding _v (s) using 32 tile overlays, each tile with 4*4 =16, to solve the problem of huge state actions;the critic network introduces qualification trace to update all action values from each screen to different degrees, and the update degree sets a dominance function in the actor network according to the time attenuation.

The dominance function is expressed as: a(s) _t ,a _t )＝Q _w (s _t ,a _t )-[τV _v (s _t )+(1-τ)V _v (s _t-1 )]The method comprises the steps of carrying out a first treatment on the surface of the The Q table in the state action value function is obtained through reinforcement learning iteration according to different actions executed under a plurality of state information.

State action value function Q _w (s, a) is expressed as:

wherein w is ^T Is a parameter vector, expressed as: w= (w) ₁ ,w ₂ ,...,w _n ) ^T The method comprises the steps of carrying out a first treatment on the surface of the Use parameter θ= (θ) ₁ ,θ ₂ ,...,θ _n ) ^T Construction strategy pi _θ = (a|s), expressed as pi _θ (s,a)＝Pr(a|s,θ)；π _θ (s,a)～N(μ(s),σ ² ) Mu(s) is the mean and sigma is the standard deviation; ψ(s) = (γ) ₁ ,γ ₂ ,...,γ _M ,e ₁ ,e ₂ ,...,e _K ) ^T ，

Since the bellman equation cannot be used for huge state actions, a function approximation is used to approximate the function; the difference between the A2C algorithm and the AC algorithm is that the A2C algorithm introduces a baseline to construct an advantage function, further reduces variance, and makes the A2C algorithm converge faster, so two sets of parameters, namely a state action value function Q, need to be updated in the algorithm _w (s, a) and a state value function V _v (s)。

State value function V _v (s) is expressed as:

V _v (s)＝w _It(1) *1+w _It(2) *1+...+w _It(32) *1，

wherein w is _It(1) 、w _It(2) 、....、w _It(32) To activate the weight value of a tile, it (1), it (2), it (32) are index values of activated tiles, and other tiles are not present in the state value function with inactive eigenvalues of 0.

Aiming at the problem of energy waste in power grid communication and reducing the energy cost and energy burden of a power grid, the invention provides a novel small base station, wherein energy collecting equipment is integrated in the small base station, and renewable energy sources (such as wind energy and solar energy) are utilized for replacing energy sources for supply. The invention uses the novel small base station, uses renewable energy sources to supply power, and uses the traditional power grid to supply power when the renewable energy sources in the equipment are exhausted. In the environment of the A2C algorithm, the novel small base station is deployed in the coverage area of the traditional macro base station, the small base station unloads traffic from the macro base station, and the macro base station makes joint user scheduling and resource allocation decisions. The A2C algorithm is applied to a macro base station and a small base station, and the small base station is randomly deployed in the coverage area of the macro base station; the types of the users connected to the base station can be divided into macro base station users and small base station users, and the users connected to the macro base station are more than the users connected to the small base station generally; the energy efficiency of the hybrid power supply of the heterogeneous network system is improved by using an A2C algorithm.

The invention provides a method for optimizing the energy efficiency of hybrid power supply based on an A2C algorithm in reinforcement learning, which uses Gaussian distribution as a parameterized strategy in an actor network to generate continuous random behaviors, and uses a gradient rising method to update strategy parameters. Tile encoding is used at the critic network to estimate the performance of the strategy and to help the actor network to learn the strategy gradient, with the advantage function further reducing the variance of the strategy gradient.

The pseudo code of the A2C algorithm in the embodiment of the present invention is as follows:

table 1A2C algorithm pseudocode

/>

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The method for optimizing the energy efficiency in the hybrid power supply heterogeneous network based on the A2C algorithm is characterized by comprising the following steps of:

step 2: calculating total conventional power of a macro base station, constructing an actor network and a critic network according to a defined Markov decision process, and training a heterogeneous network system by utilizing an A2C algorithm;

2. The method for optimizing energy efficiency in a hybrid power supply heterogeneous network based on an A2C algorithm according to claim 1, wherein the signal-to-interference-and-noise ratio from a macro base station to a user is calculated in the step 1, wherein K small base stations and M users are deployed in the macro base station, and the user positions of the small base stations are determined according to the quantity distribution condition of the macro base station and the small base stations; the set of small base stations is κ= {0,1,2, … …, K }, the set of users is m= {0,1,2, … … M }, assuming that the legacy macro base station and the attached small base stations use the same radio spectrum, the signal-to-interference-and-noise ratio γ of user M during time slot t _m The method comprises the following steps:

is the variance of the additive white gaussian noise at user m.

3. The method for optimizing energy efficiency in a hybrid power heterogeneous network based on A2C algorithm according to claim 2, wherein the calculating macro base station total regular power in step 2 is specifically that the total bandwidth W, bandwidth B of the heterogeneous network system is divided into sub-channels of W/B, and during time slot t, user m is allocated as B _m (t) ∈ {0,1, … …, W/B },

heterogeneous network system is implemented during time slot tCurrent total information rate r _sum (t) is:

setting the wireless transmission power obtained by a conventional power grid to be a positive value, and setting the wireless transmission power obtained by renewable energy equipment to be a negative value; at time slot t, the total power of the macro base station

The method comprises the following steps:

in the method, in the process of the invention,

the static power

The method comprises the following steps: />

4. The method for optimizing energy efficiency in a hybrid power heterogeneous network based on A2C algorithm according to claim 3, wherein the energy efficiency of the heterogeneous network system in step 1 during time slot t is defined as a ratio ρ of information rate to regular power consumption _t The method comprises the following steps:

5. the method for optimizing energy efficiency in a hybrid power heterogeneous network based on an A2C algorithm according to claim 4, wherein in step 1, the states are a signal-to-interference-and-noise ratio γ (t) of each user and a battery power e (t) of each small base station, namely:

6. The method for optimizing energy efficiency in a hybrid power supply heterogeneous network based on an A2C algorithm as claimed in claim 5, wherein step 2 trains the heterogeneous network system by using the A2C algorithm specifically,

step 2.1: setting a reinforcement learning frame, initializing a state space, selecting an optimal action according to a state obtained by interaction of an agent and an A2C algorithm environment, and transmitting the optimal action as state information to a macro base station;

7. The method for optimizing energy efficiency in a hybrid power heterogeneous network based on an A2C algorithm according to claim 6, wherein the updating of the state action value function and the state value function of the A2C algorithm in step 2.2 is specifically to update the state action value function Q of the A2C algorithm _w (s, a) and a state value function V _v (s) updating, approximating the state action value function Q by using a Gaussian distribution method _w (s, a); approximating the state value function V with tile encoding _v (s) using 32 tile overlays, each tile overlay 4*4 =16; introducing qualification trace to update all action values from each screen to different degrees, wherein the update degree is attenuated according to the time, and a dominance function is set in an actor network.

8. The method for optimizing energy efficiency in a hybrid power heterogeneous network based on the A2C algorithm of claim 7, wherein the state action value function Q _w (s, a) is expressed as:

Gamma is expressed as the signal-to-interference-and-noise ratio of each user and e is expressed as the electricity of each small base stationAnd (5) battery power.

9. The method for optimizing energy efficiency in a hybrid power heterogeneous network based on the A2C algorithm of claim 8, wherein the state value function V _v (s) is expressed as:

V _v (s)＝w _It(1) *1+w _It(2) *1+...+w _It(32) *1，

10. The method for optimizing energy efficiency in a hybrid power heterogeneous network based on an A2C algorithm according to claim 9, wherein step 4 is specifically to determine the energy efficiency of the hybrid power heterogeneous network based on the motion information (s _t ，a _t ，r _t ，s _t+1 ) The action return value q of the agent transmitted to the critic network by the critic network _t Making an evaluation, and updating the critic network parameters by adopting TD errors; will(s) _t ，a _t ，q _t ) Transmitting to an actor network, updating action selection probability according to strategy gradient, and maximizing rewards r _t The method comprises the steps of carrying out a first treatment on the surface of the Transmitting the optimal actions obtained by learning of each small base station as states to a macro base station, and repeatedly deploying the small base stations in the coverage area of the macro base station to obtain an optimal small base station deployment strategy, namely optimal resource allocation; the users are connected to the corresponding small base stations to obtain better channels, and the energy efficiency of the heterogeneous network system is maximized.