CN117615393A

CN117615393A - Resource optimization method of STAR-RIS communication system based on deep reinforcement learning

Info

Publication number: CN117615393A
Application number: CN202311692409.XA
Authority: CN
Inventors: 张淼; 丁叙然; 吴仕勋; 徐凯
Original assignee: Chongqing Jiaotong University
Current assignee: Chongqing Jiaotong University
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-02-27

Abstract

The invention provides a resource optimization method of a STAR-RIS communication system based on deep reinforcement learning, which adopts a deep reinforcement learning mode, aims at the maximum combination method user safety rate, and meets the constraint requirements of minimum power of a base station, STAR-RIS coefficient matrix, energy decomposition and the like and the minimum energy collection requirement of an untrusted eavesdropper under the condition that an untrusted eavesdropper exists, thereby maximizing the combination method user safety rate. The method provides a deep reinforcement learning algorithm based on soft update action-evaluation, comprehensively considers the number of users, the number of base station antennas and the number of reflecting elements, introduces an intelligent body, takes a transmission and reflection coefficient matrix and a beam forming matrix as an action space, takes channel state information as a state space, takes a safety rate as a base, takes a t-step instantaneous channel and the safety rate under action as rewards, and constructs reinforcement learning environment and trains a network so as to solve the optimization problem. By adopting the method, the safety rate of the system can be greatly improved.

Description

Resource optimization method of STAR-RIS communication system based on deep reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication, in particular to a resource optimization method of a STAR-RIS communication system based on deep reinforcement learning.

Background

The ubiquitous connection in the future becomes reality, the network structure will be more and more complex, the requirements on hardware and network capacity will be higher and the consumption of energy will be higher and higher, so that the high-frequency communication technologies such as millimeter wave and terahertz will become ideal choices for the wireless network in the future. In recent years, intelligent reflection surfaces (Reconfigurable Intelligent Surface, RIS) have received a lot of attention, which are considered as a key technology in next generation wireless networks, and have advantages of low cost, programmability, easy deployment, low energy consumption, and the like. The RIS adopts a programmable metamaterial, so that the electromagnetic characteristics of the RIS can be changed, and parameters such as amplitude, phase, frequency, polarization and the like can be reconfigured.

While RIS is considered a potential technology for changing wireless communication systems, most of the work in the prior art only considers the case where RIS reflects only the incoming wireless signal. In this case the transmitter and receiver must be on the same side of the RIS, so that the RIS coverage has only one half-space, i.e. 180 ° coverage of the reflective part, with a certain coverage dead zone. In practice, however, the user is typically located on both sides of the RIS. The proposal of a smart reflective surface (Simultaneously Transmitting And Reflecting RIS, STAR-RIS) that transmits and reflects simultaneously solves this problem, in a STAR-RIS aided communication system, a part of the reflected signal is reflected to the same space as the incident signal, which is the reflective half-space (or simply "reflective space"), while another part of the transmitted signal is transmitted to the opposite space to the incident space, which is called the transmission half-space (or simply "transmission space"), thus achieving 360 ° full coverage.

In the field of wireless communications, energy is a very important resource. With the emergence of new wireless network structures such as smart grids, smart home, intelligent detection and the like, the energy consumption of wireless communication equipment is also increasing. Wireless energy transfer is an emerging technology that provides wireless charging services to wireless devices via radio frequencies to address energy limitations in wireless sensor networks. In a wireless power supply communication network, an energy collection user can collect and utilize energy from electromagnetic waves transmitted by a base station, so that the operation cost is greatly reduced, and the network maintenance is simplified. In addition, information security is one of the most serious problems in wireless communication. Because of the broadcast nature of the wireless channel, all users in a network can receive electromagnetic waves sent by the base station to other users and demodulate and decode, which presents a significant challenge for secure communications. In particular, to the communication system related to the invention, the energy collection user can also enjoy the performance gain brought by the STAR-RIS as a potential eavesdropper, which causes great hidden danger to the information security and privacy of the legitimate user, so the invention takes the security rate related to the physical layer security as an optimization target.

Finally, in terms of conventional wireless resource optimization design, although the convex optimization algorithm is widely applied to solve various complex optimization problems, the conventional optimization method is not suitable for low-power-consumption equipment of a wireless network, and does not meet the requirement of the future wireless network on low delay. Moreover, the optimization design of beamforming and amplitude and phase of the STAR-RIS remains challenging, facing increasingly complex and dynamic environments, such as adjusting the angle of the reflecting surface to reflect the user's needs to the greatest extent or to provide information, which can be difficult to handle by conventional optimization methods, while reinforcement learning has the ability to handle dynamic environments, enabling learning and decision making in the changing environments.

Disclosure of Invention

The invention provides a resource optimization method of a STAR-RIS communication system based on deep reinforcement learning, which aims at the problems of the background technology and comprises the following steps:

step 1: establishing a communication system model based on STAR-RIS assistance, wherein the communication system comprises a base station, STAR-RIS, a reflection user positioned in a reflection space, a transmission user positioned in a transmission space and an intelligent controller, the intelligent controller can adjust relevant parameters of the STAR-RIS according to the requirements of the communication system, and in the communication system model, a direct link exists between the base station and the transmission user as well as between the base station and the reflection user;

step 2: acquiring channel data, a beam forming matrix of a base station, a reflection coefficient matrix and a transmission coefficient matrix of STAR-RIS in an energy splitting mode, wherein the channel data comprises channel data from the base station to R users, channel data from the base station to T users, channel data from the base station to STAR-RIS, channel data from STAR-RIS to R users and channel data from STAR-RIS to T users;

step 3: establishing an optimization problem by taking the safety rate between the maximized reflection user and the transmission user as an objective function and taking the STAR-RIS energy conservation relation, the transmission user energy collection and the maximum power of the base station as constraint conditions;

step 4: and designing a state space, an action space and a reward function of reinforcement learning, constructing a reinforcement learning environment according to the requirements of a communication system, and solving an optimization problem by adopting a SAC algorithm so as to determine the resource optimization configuration of the communication system.

Further, the objective function is:

s.t.P _e ≥E _ma x,

P _B ≤P _max

wherein H is _BU Indicating base station to reflecting user direct channel data, H _BE Indicating the direct channel data from the base station to the transmitting user, H _BR Channel data, h, representing base station to STAR-RIS _r Channel data representing STAR-RIS to reflecting user, h _t' Channel data, Θ, representing STAR-RIS to transmitting user _r Reflection coefficient matrix, Θ, representing STAR-RIS _t' Transmission coefficient matrix representing STAR-RIS, w representing beamforming matrix of base station, σ ² Representing the variance of noise, P _e Representing energy collected by the transmitting user, E _max Indicating that the set threshold value is to be used,and->Representing the amplitude in reflection and transmission of the nth reflective element in the STAR-RIS,representing the total number of reflective elements, P, in a STAR-RIS _B Representing base station power, P _max Indicating the maximum power value that the base station needs to meet.

Further, the action space is a beamforming matrix of the base station, a reflection coefficient matrix and a transmission coefficient matrix of STAR-RIS in an energy splitting mode, namely a _t ＝{ω _t ,Θ _r,t ,Θ _t',t And }, wherein a _t Representing the action, ω, at time step t _t Beamforming matrix, Θ, representing the base station at time step t _r,t And theta (theta) _t',t Respectively representing the reflection coefficient matrix and the transmission coefficient matrix at time step t.

The state space is the channel data from the base station to the R user, the channel data from the base station to the T user, the channel data from the base station to the STAR-RIS, the channel data from the STAR-RIS to the R user and the channel data from the STAR-RIS to the T user, i.e. _t ＝{H _BR,t ,H _BU,t ,H _BE,t ,h _r,t ,h _t',t -wherein s _t State of t time step, H _BR,t 、H _BU,t And H _BE,t Respectively representing t time steps of base station to STAR-RIS channel data, base station to reflective user direct channel data and base station to transmitting user direct channel data, h _r,t And h _t',t Channel data representing t time steps STAR-RIS to reflective user and STAR-RIS to transmission, respectivelyChannel data of the user;

the bonus function is the safe rate of t time steps, i.e

Wherein R (ω, Θ) _r ,Θ _t' ) _t Representing the rewards of time steps t.

Further, in the transmission of the user energy harvesting constraint, a penalty term needs to be constructed to perform the constraint, and the penalty rule is:

R(ω,Θ _r ,Θ _t' )′ _t ＝ξR(ω,Θ _r ,Θ _t' ) _t

wherein, xi represents the penalty factor relative to the reward, which takes the value:

wherein,representing t time steps of transmitting the energy collected by the user when the collected energy is greater than a threshold E _max When the penalty factor is 1, namely, meeting the constraint does not give penalty; otherwise, giving corresponding punishment, i.e. subtracting from the original rewards

Further, the SAC algorithm model comprises an action network, an evaluation network comprising two Q networks and a value network, wherein the action network, the evaluation network and the value network comprise an input layer, a hidden layer, a ReLU activation layer and an output layer in sequence.

Further, the SAC algorithm is adopted to solve the optimization problem, unlike other reinforcement learning algorithms, the SAC algorithm uses the entropy value as a part of rewards, and maximizes the rewards value and simultaneously also needs to maximize the entropy value, so the maximum entropy value objective function is defined as:

wherein pi ^* Represents the optimal solution of strategy pi, ρ _π Representing a set of state action pairs, r(s) _t ,a _t ) Indicating the return of the trajectory, pi (. Cndot. S) _t ) Representing state s _t The probability distribution of the action to be taken later,is the entropy term, alpha is the temperature parameter, +.>Indicating the desire.

Further, for two soft Q networks with parameters θ of critics in the SAC algorithm, the following formula is used for updating:

wherein,representing gradient operations on θ, L _Q (θ) represents a soft Q network loss function with parameters θ, < ->Representing the gradient of the soft Q network bellman loss function,>q-value function vs. action a representing parameter θ _t Sum state s _t Gradient of Q _θ (s _t ,a _t ) Represented in state s _t Take action a down _t Is representative of the expected jackpot of the current strategy, r(s) _t ,a _t ) Represented in state s _t Take action a down _t Instant rewards obtained later, ->Is a target value network,/->Represented in state s _t+1 The target state value function below, gamma represents the discount rate, is a discount on future rewards.

Further, for a soft value network with a parameter ψ in the SAC algorithm, the following formula is used for updating:

wherein,representing a gradient operation on psi, L _V (ψ) represents the soft value network loss function, < +.>Gradient the soft-valued network loss function, +.>Representation deriving psi, V _ψ (s _t ) Representing a state value function, the parameter being ψ. Q (Q) _θ (s _t ,a _t ) Represented in state s _t Take action a down _t Q-estimate of (2) representing the expected jackpot of the current strategy, the parameter being θ,/i>This is given state s _t And action a _t Log policy probability of (c).

Further, the parameters in the SAC algorithm are as followsIs updated using the following formula:

wherein,representing +.>Gradient manipulation(s)>Representing a policy network loss function, +.>Representing the gradient of the loss function of the policy network, < ->Representation pair->Derivation and->Representing a policy function, the parameters beingGiven state s _t And action a _t Is about the parameter ++>Is used for the gradient of (a),representing a neural network re-parameterization strategy, +.>Representation pair a _t Derivation and->Use->Implicit definition, τ _t Representing an input noise vector; q (Q) _θ (s _t ,a _t ) Represented in state s _t Take action a down _t And (2) a Q estimate representing an expected jackpot for the current strategy.

The beneficial effects of the technical scheme are as follows: the system can effectively utilize STAR-RIS to realize the gain of wireless signals, and compared with other reinforcement learning algorithms, the proposed deep reinforcement learning algorithm can obtain better effects and greatly improve the safety rate of the system.

Drawings

The drawings of the present invention are described below.

FIG. 1 is a schematic diagram of a STAR-RIS;

FIG. 2 is a schematic diagram of a STAR-RIS assistance based communication system model according to the present invention;

FIG. 3 is a schematic diagram of the algorithm flow structure of the present invention;

FIG. 4 is a schematic diagram of the structure of a neural network involved in the algorithm of the present invention;

FIG. 5 is a graph of simulation results of total security rate of system reflection users and transmission users versus different learning rates;

FIG. 6 is a diagram of the results of a simulation comparing the algorithm proposed by the present invention with other reinforcement learning algorithms;

fig. 7 is a diagram showing the results of comparing the proposed algorithm with other reinforcement learning algorithms according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

The invention adopts a deep reinforcement learning mode, aims at the maximum combination method user safety rate, and meets the constraint requirements of minimum power, STAR-RIS coefficient matrix, energy decomposition and the like of a base station and the minimum energy collection requirement of an untrusted eavesdropper under the condition that an untrusted eavesdropper exists, thereby maximizing the combination method user safety rate. The method provides a deep reinforcement learning algorithm based on soft update action-evaluation, comprehensively considers the number of users, the number of base station antennas and the number of reflecting elements, introduces an intelligent body, takes a transmission and reflection coefficient matrix and a beam forming matrix as an action space, takes channel state information as a state space, takes a safety rate as a base, takes a t-step instantaneous channel and the safety rate under action as rewards, and constructs reinforcement learning environment and trains a network so as to solve the optimization problem.

Specifically, the invention provides a resource optimization method of a STAR-RIS communication system based on deep reinforcement learning, which can comprise the following steps:

FIG. 1 is a STAR-RIS assisted communication system. The base station transmits a signal, a portion of which is reflected to the same space as the incident signal, referred to as a reflection half space (also referred to as a "reflection space"); while another part of the transmitted signal is transmitted to a space opposite to the incident space, called a transmission half space (also called a "transmission space"). Wherein R users represent users located in the reflection space, namely reflection users; t-users denote users located in the transmission space, i.e. transmitting users. Aiming at the coefficient regulation problem of STAR-RIS, electromagnetic flow passing through STAR-RIS elements can be operated to realize intelligent regulation of the surface, and compared with the traditional reflection-only RIS with only reflection coefficient, STAR-RIS needs to reconfigure transmission and reflection signals by two coefficients, namely a transmission coefficient and a reflection coefficient. Compared with the traditional reflection-only intelligent reflecting surface, the STAR-RSI can construct a full-space coverage range, so that the flexibility of a design system is increased, the intensity of information received by an access point can be improved, and the system comprises three modes of Energy Splitting (ES), mode Switching (MS) and Time Switching (TS).

FIG. 2 is a schematic diagram of a STAR-RIS assistance based communication system model of the present invention. As shown in fig. 2, the present invention contemplates a downlink secure transmission communication system scenario. The communication system comprises a base station, a STAR-RIS, a reflection user positioned in a reflection space, a transmission user positioned in a transmission space and an intelligent controller, wherein the intelligent controller can adjust relevant parameters of the STAR-RIS according to the requirements of the communication system. The base station transmits signals through the antenna, provides a communication coverage range, and ensures that users obtain signal connection in different geographic positions and environments. The base station is equipped with M antennas. The STAR-RIS comprises N reflective units. The R user is on the same side of the STAR-RIS as the base station, i.e., in the reflective space, and can receive confidential information from the base station. The T-user is on the other side of the STAR-RIS, i.e. the transmission space, which user has the possibility to eavesdrop on the R-user and needs to meet certain energy harvesting requirements. Wherein, R user and T user are single antenna user. In the communication system model, there is a direct link from the base station to both R-users and T-users.

The locations of the base station, STAR-RIS, R user, and T user are defined as (x) respectively, taking into account the locations and heights of the base station, STAR-RIS, and user _b ,y _b ,z _b ) ^T ,(x _r ,y _r ,z _r ) ^T ,(x _t' ,y _t' ,z _t' ) ^T . Without loss of generality, a STAR-RIS is provided with N reflective elements, whereinThe present invention employs an Energy Splitting (ES) mode in which the signal incident on each element is split into a reflected signal and a transmitted signal, and the signal energy is split into two parts. In this mode, the transmission parameters (i.e., transmission coefficient matrix) and reflection parameters (i.e., reflection coefficient matrix) of the STAR-RIS can be defined as: />And->Wherein the method comprises the steps ofRepresenting the amplitude of the nth reflection element in transmission and reflection, respectively, which satisfy the relation Representing the phase shift in transmission and reflection of the nth reflection unit, respectively, and

in this system, there is a direct link from the base station to both the T-user and the R-user. Thus, the base station to STAR-RIS channel isThe direct channel of the base station to the R user is +.>The direct channel from the base station to the T-user is +.> And channels from STAR-RIS to R-user and T-user, respectively, denoted +.>And->Assuming that all channels follow the rice distribution, channel H due to the location of the base station and STAR-RIS _BR Thus having Line of Sight (LoS) paths, channel H taking into account path fading and small scale fading _BR Can be expressed as +.> Wherein K is _r Represents Rayleigh factor, G _B,R Represents the scattering path, L _B,R Sight path representing base station and STAR-RIS,>represents the path loss of the power domain, d _B,R Representing the distance between the base station and STAR-RIS, f _c Representing the frequency.

The user is in a randomly moving state, so that a direct line-of-sight path between the user and the base station or STAR-RIS cannot be guaranteed. Defining the wave beam forming vector as w, the signal received by R user is S _u ＝[H _BU +H _BR Θ _r h _r ]w+n _u Wherein n is _u Representing gaussian noise contained in the signal received by the R user.

Similarly, the signal received by the T user is S _e ＝[H _BE +H _BR Θ _t' h _t' ]w+n _c ，n _c Representing gaussian noise contained in the signal received by the T-user. The energy collected by the T user can be expressed as P _e ＝η _eh ∣[H _BE +H _BR Θ _t' h _t' ]w∣ ² Wherein eta _eh Representing the energy harvesting discount rate.

The signal-to-noise ratios (SNR) of R users and T are respectively expressed asAndfor the STAR-RIS assisted secure transmission communication system shown in FIG. 2, the objective of the present invention is to maximize the security rate achievable by R users and meet the energy harvesting requirements of T users, the variables that need to be optimized are the coefficients of the STAR-RIS reflective elements, so the security rate maximization problem can be modeled as:

s.t.P _e ≥E _max ,

P _B ≤P _max

wherein sigma ² Representing the variance of noise, P _e Representing energy collected by the transmitting user, E _max Indicating that the set threshold value is to be used,andrepresenting the amplitude of the nth reflective element in reflection and transmission in STAR-RIS,/->Representing the total number of reflective elements, P, in a STAR-RIS _B Representing base station power, P _max Indicating the maximum power value that the base station needs to meet. The first constraint indicates that the harvested energy must be greater than the set minimum demand; since STAR-RIS is a passive device, the second constraint indicates that its amplitude should meet the energy conservation theorem; the third constraint indicates that the base station should meet less than the maximum power constraint. Because variable coupling is a non-convex problem, and the communication environment has real-time change and complex conditions, such as users can randomly move, beams and coefficients need to be adjusted to reflect the demands of the users to the greatest extent, and the optimization problem has more complex dimensions, the invention adopts reinforcement learning to solve the optimization problem.

FIG. 3 shows a STAR-RIS assisted secure communication framework based on the SAC deep reinforcement learning algorithm, where an agent is constantly interacting with the environment to obtain experience, save the experience into a pool of experience, then sample the experience, and refine the strategy based on the experience, eventually learn the optimal strategy to obtain the maximum rewards. At each discrete time step t, the smart agent will be based on the stateTake action->The environment will then receive feedback based on the actions made by the agent and get the corresponding rewards +.>Then observe the next state +.>Wherein->And->A state space and an action space, respectively. The SAC algorithm is an advanced off-pole algorithm based on continuous control of maximum entropy.

The SAC algorithm proposed by the present invention comprises an action network, an evaluation network and a value network (also referred to as "V network"), wherein only one of the action network and the value network comprises two Q networks, the purpose of which is to reduce overestimation. Aiming at the problem of safe transmission design, the invention considers the number of users, the number of base station antennas and the number of reflecting elements, introduces an intelligent body, takes a transmission and reflection coefficient matrix and a beam forming matrix as an action space, takes channel state information (namely channel data) as a state space, takes a safe rate as a base, takes a safe rate under t-step instantaneous channel and action as a reward, and constructs an environment. The specific details are as follows:

(1) State space: the state values are designed to be all channel state information (Channel State Information, CSI) information. Namely, from the base station to the STAR-RIS, from the base station to the reflection space user (namely R user) and to the transmission space energy acquisition user (namely T user), all channel data from the STAR-RIS to the reflection user and to the transmission space energy acquisition user, namely, the state space is recorded as:

s _t ＝{H _BR,t ,H _BU,t ,H _BE,t ,h _r,t ,h _t,t }

wherein s is _t Representing the state of the t time step.

Considering that a neural network can only accept real numbers as inputs, it is therefore necessary to separate the real and imaginary parts of the state complex numbers as independent inputs, and thus the dimension of the state space is 2mk+2nk+2mn. M is the number of antennas, N is the number of reflecting elements, and K is the number of users in the whole space.

(2) Action space: taking the transmission and reflection coefficient matrix and the beam forming matrix as an action space, and the action space with the dimension of 3N+2MK is recorded as:

a _t ＝{ω,Θ _r,t ,Θ _t',t }

(3) Rewarding: in reinforcement learning, the purpose of the agent (i.e. agent) is to maximize the reward, so that since the optimization objective of the communication system model in the present invention is to maximize the safe rate, the safe rate is used as a reward for t-time step training, in particular, the achievable privacy rate can be given by:

when reinforcement learning is used, if constraints exist on the system model, special algorithms are required to handle the constraints to avoid exceeding the constraints. Because of the energy harvesting constraints in the system, i.e. the energy harvested by the energy harvesting user must be greater than a threshold, a penalty term is constructed to be constrained, which penalty rule can be expressed as:

R(ω,Θ _r ,Θ _t' )′ _t ＝ξR(ω,Θ _r ,Θ _t' ) _t

where ζ represents a penalty factor with respect to rewards, which may be expressed as:

wherein,the t time step is represented to transmit the energy collected by the user, namely when the collected energy is larger than a designed threshold value, the penalty factor is 1, and no penalty is given when the constraint is met; when the energy is smaller than the designed threshold, a certain penalty is given, namely, subtracting +_on the basis of the original rewards>

The SAC adopts a random strategy, and the SAC also considers the maximized entropy besides the maximized reward value, so that the strategy is more random, unlike the traditional deep reinforcement learning algorithm. Thus, the maximum entropy objective function is defined as follows:

wherein pi ^* Represents the optimal solution of strategy pi, ρ _π Representing a set of state actions, r (s _t ,a _t ) Indicating the return of the trajectory, pi (. Cndot. S) _t ) Representing state s _t The probability distribution of the action to be taken later,is the entropy term, alpha is the temperature parameter, +.>Indicating that the desired alpha is a temperature parameter, determines the importance of the entropy term relative to the prize.

In the soft policy evaluation, a state value function V of SAC can be defined _soft (s _t ) And an action state value function O _soft (s _t ,a _t ) Respectively isAnd->

In the policy improvement updating step, KL divergence calculation is adopted for simplicity, and the formula is as follows

Wherein pi _new Representing a new policy, the present invention defines constraints, i.e., pi, in order to ensure that the policy tends to be controllable in the actual scenario _k E, limiting the policy to a set of policies n; d (D) _KL The function of the divergence is represented by a function of the divergence,representing the function of the state of the action corresponding to the old policy, +.>The distribution is normalized for the partitioning function, which is negligible in the differentiation process.

In the algorithm provided by the invention, a plurality of network parameters need to be updated, namely a soft value network parameter psi, two soft Q network parameters theta of a critic and a strategy network parameterUpdating the soft Q network uses the following formula: /> Wherein (1)>Representing gradient operations on θ, L _Q (θ) represents a soft Q network loss function with parameters θ, < ->Representing the gradient of the soft Q network bellman loss function,>q-value function vs. action a representing parameter θ _t Sum state s _t Gradient of Q _θ (s _t ,a _t ) Represented in state s _t Take action a down _t Q estimate of (2) representing the expected jackpot of the current strategyExcitation, r(s) _t ,a _t ) Represented in state s _t Take action a down _t Instant rewards obtained later, ->Is a target value network,/->Represented in state s _t+1 The target state value function below, gamma represents the discount rate, is a discount on future rewards.

Updating training soft value functions using unbiased estimation gradients Wherein (1)>Representing a gradient operation on psi, L _V (ψ) represents the soft value network loss function, < +.>Gradient the soft-valued network loss function, +.>Representation deriving psi, V _ψ (s _t ) Representing a state value function, the parameter being ψ. Q (Q) _θ (s _t ,a _t ) Represented in state s _t Take action a down _t And (c) a Q estimate representing the expected jackpot for the current strategy, the parameter being theta,this is given state s _t And action a _t Log policy probability of (c).

And the update of the strategy network (action network) adopts the steps of minimizing KL divergence and then updating in a gradient way, and the update formula is as follows:wherein (1)>Representing +.>Gradient manipulation(s)>Representing a policy network loss function, +.>Representing the gradient of the loss function of the policy network, < ->Representation pair->Derivation and->Representing a policy function, the parameter is->Given state s _t And action a _t Is about the parameter ++>Is a gradient of (2); for the strategy gradient approach, a typical solution is to use a likelihood ratio gradient estimator that does not require back propagation through the strategy and target density network. In this case, however, the target density is a Q function, represented by a neural network, which can be differentiated, so that it is more convenient to use a re-parameterization technique, which can result in a lower variance estimate. We re-parameterize the strategy by neural network transformation,representing a neural network re-parameterization strategy, +.>Representation pair a _t Derivation and->Use->Implicit definition, τ _t Representing the input noise vector.

Fig. 4 is a schematic diagram of the structure of a neural network involved in the algorithm of the present invention. As shown in fig. 4, the action network, the evaluation network, and the value network each include an input layer, a hidden layer, a ReLU activation layer, and an output layer in this order. I.e. each network has the structure: firstly, an input layer is provided with a hidden layer, after RELU activation, the input layer is provided with a hidden layer, and after RELU activation, the output layer is obtained. In addition, the action network adopts a Gaussian random strategy, and the final output is divided into a logarithmic form of mean and variance. D in FIG. 4 _s Representing the state space dimension, D _a Representing the action space dimension.

According to the optimization method, the invention discloses a corresponding simulation result. The simulation basic parameters are set as follows: the number M of the antennas is 4, the number N of the reflecting elements is 10, the strategy type adopts a random Gaussian strategy, the maximum step length is 100000 steps, the batch size is set to 256, the discount rate is 0.99, the number of hidden layer units in the neural network is 256, and an Adam optimizer is adopted as an optimizer of the neural network. According to the invention, the same parameters are adopted, different random seeds are designed, so that multiple experiments are carried out for each method, then a corresponding Seaborn graph is drawn according to the obtained data, the dark line segment is the multiple experiment result to obtain the average safety rate, the light shadow areas at the two sides of the corresponding line segment are confidence intervals for estimating the value range of the parameters, the narrower the confidence intervals are, the higher the estimation precision is, and the algorithm has better performance and stability. Fig. 5 shows the relationship between the average safety Rate and the number of steps corresponding to different Learning rates, and it can be seen from the simulation graph that the optimal parameter of the Learning Rate (LR) is 0.001. In order to embody the advantages of the method provided by the invention, the comparison with other reinforcement learning algorithms, namely a DDPG algorithm and a TD3 algorithm, is added in the simulation, and the DDPG algorithm and the TD3 algorithm are shown in figure 6. Compared with the DDPG and TD3 confidence intervals, the algorithm (SAC algorithm) provided by the invention has narrower confidence intervals, better stability and convergence, and higher obtained reward value and better performance. Fig. 7 shows a comparison of different numbers of antennas M and different algorithm average security rates, it can be seen that as the number of antennas increases, the average security rate achieved increases, and the SAC algorithm provided by the present invention can achieve better performance compared with DDPG and TD 3.

Finally, it is noted that the embodiments described in this specification are merely illustrative of the invention. Those of ordinary skill in the art will appreciate that: various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for optimizing resources of a deep reinforcement learning-based STAR-RIS communication system, comprising:

2. The method for optimizing resources of deep reinforcement learning based STAR-RIS communication system of claim 1, wherein the objective function is:

s.t.P _e ≥E _max ,

P _B ≤P _max

3. The resource optimization method of deep reinforcement learning-based STAR-RIS communication system according to claim 2, wherein the action space is a beamforming matrix of a base station, a reflection coefficient matrix and a transmission coefficient matrix of STAR-RIS in an energy splitting mode, namely a _t ＝{ω _t ,Θ _r,t ,Θ _t',t And }, wherein a _t Representing the action, ω, at time step t _t Beamforming matrix, Θ, representing the base station at time step t _r,t And theta (theta) _t',t Respectively representing the reflection coefficient matrix and the transmission coefficient matrix at time step t.

The state space is the channel data from the base station to the R user, the channel data from the base station to the T user, the channel data from the base station to the STAR-RIS, the channel data from the STAR-RIS to the R user and the channel data from the STAR-RIS to the T user, i.e. _t ＝{H _BR,t ,H _BU,t ,H _BE,t ,h _r,t ,h _t',t -wherein s _t State of t time step, H _BR,t 、H _BU,t And H _BE,t Respectively representing t time steps of base station to STAR-RIS channel data, base station to reflective user direct channel data and base station to transmitting user direct channel data, h _r,t And h _t',t Respectively representing t time steps of STAR-RIS to reflective user channel data and STAR-RIS to transmitting user channel data;

the bonus function is the safe rate of t time steps, i.e

Wherein R (ω, Θ) _r ,Θ _t' ) _t Representing the rewards of time steps t.

4. A method for optimizing resources of a deep reinforcement learning based STAR-RIS communication system according to claim 3, wherein in transmitting user energy harvesting constraints, a penalty term is constructed for the constraints, and the penalty rule is:

R(ω,Θ _r ,Θ _t' )′ _t ＝ξR(ω,Θ _r ,Θ _t' ) _t

wherein,representing t time steps of transmitting the energy collected by the user when the collected energy is greater than a threshold E _max When the penalty factor is 1, namely, meeting the constraint does not give penalty; otherwise giving punishment, i.e. subtracting from the original rewards

5. The resource optimization method of deep reinforcement learning-based STAR-RIS communication system of claim 3, wherein the SAC algorithm model comprises an action network, an evaluation network comprising two Q networks and a value network, wherein the action network, the evaluation network and the value network each comprise an input layer, a hidden layer, a ReLU activation layer and an output layer in sequence.

6. The resource optimization method of deep reinforcement learning-based STAR-RIS communication system of claim 5, wherein solving the optimization problem by SAC algorithm comprises:

the entropy value is taken as a part of rewards, and the maximum entropy value objective function is defined as:

7. The resource optimization method of deep reinforcement learning-based STAR-RIS communication system of claim 6, wherein for two soft Q networks with parameters θ of critics in SAC algorithm, the following formula is used for updating:

8. The resource optimization method of deep reinforcement learning based STAR-RIS communication system according to claim 6, wherein for the soft value network with parameter ψ in SAC algorithm, the following formula is used for updating:

wherein,representing a gradient operation on psi, L _V (ψ) represents the soft value network loss function, < +.>Gradient the soft-valued network loss function, +.>Representation deriving psi, V _ψ (s _t ) Representing a state value function, wherein the parameter is psi; q (Q) _θ (s _t ,a _t ) Represented in state s _t Take action a down _t Q-estimate of (2) representing the expected jackpot of the current strategy, the parameter being θ,/i>This is given state s _t And action a _t Log policy probability of (c).

9. The method for optimizing resources of deep reinforcement learning based STAR-RIS communication system of claim 6, wherein the parameters for SAC algorithm areIs updated using the following formula:

wherein,representing +.>Gradient manipulation(s)>Representing a policy network loss function, +.>Representing the gradient of the loss function of the policy network, < ->Representation pair->Derivation and->Representing a policy function, the parameter is-> Given state s _t And action a _t Is about the parameter ++>Is a gradient of (2); />Representing a neural network re-parameterization strategy, +.>Representation pair a _t Derivation and->Use->Implicit definition, τ _t Representing an input noise vector; q (Q) _θ (s _t ,a _t ) Represented in state s _t Take action a down _t And (2) a Q estimate representing an expected jackpot for the current strategy.