CN117615393A - Resource optimization method of STAR-RIS communication system based on deep reinforcement learning - Google Patents
Resource optimization method of STAR-RIS communication system based on deep reinforcement learning Download PDFInfo
- Publication number
- CN117615393A CN117615393A CN202311692409.XA CN202311692409A CN117615393A CN 117615393 A CN117615393 A CN 117615393A CN 202311692409 A CN202311692409 A CN 202311692409A CN 117615393 A CN117615393 A CN 117615393A
- Authority
- CN
- China
- Prior art keywords
- ris
- star
- representing
- base station
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004891 communication Methods 0.000 title claims abstract description 55
- 230000002787 reinforcement Effects 0.000 title claims abstract description 39
- 238000005457 optimization Methods 0.000 title claims abstract description 35
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000009471 action Effects 0.000 claims abstract description 56
- 230000005540 biological transmission Effects 0.000 claims abstract description 53
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 44
- 239000011159 matrix material Substances 0.000 claims abstract description 39
- 238000011156 evaluation Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 51
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 238000003306 harvesting Methods 0.000 claims description 7
- 238000009795 derivation Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 238000004134 energy conservation Methods 0.000 claims description 4
- 238000000354 decomposition reaction Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 8
- 238000004088 simulation Methods 0.000 description 6
- 239000003795 chemical substances by application Substances 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 2
- 230000033228 biological regulation Effects 0.000 description 2
- 238000005265 energy consumption Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000005562 fading Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000010363 phase shift Effects 0.000 description 1
- 230000010287 polarization Effects 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/02—Arrangements for optimising operational condition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/16—Central resource management; Negotiation of resources or communication parameters, e.g. negotiating bandwidth or QoS [Quality of Service]
- H04W28/18—Negotiating wireless communication parameters
- H04W28/22—Negotiating communication rate
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Quality & Reliability (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention provides a resource optimization method of a STAR-RIS communication system based on deep reinforcement learning, which adopts a deep reinforcement learning mode, aims at the maximum combination method user safety rate, and meets the constraint requirements of minimum power of a base station, STAR-RIS coefficient matrix, energy decomposition and the like and the minimum energy collection requirement of an untrusted eavesdropper under the condition that an untrusted eavesdropper exists, thereby maximizing the combination method user safety rate. The method provides a deep reinforcement learning algorithm based on soft update action-evaluation, comprehensively considers the number of users, the number of base station antennas and the number of reflecting elements, introduces an intelligent body, takes a transmission and reflection coefficient matrix and a beam forming matrix as an action space, takes channel state information as a state space, takes a safety rate as a base, takes a t-step instantaneous channel and the safety rate under action as rewards, and constructs reinforcement learning environment and trains a network so as to solve the optimization problem. By adopting the method, the safety rate of the system can be greatly improved.
Description
Technical Field
The invention relates to the technical field of wireless communication, in particular to a resource optimization method of a STAR-RIS communication system based on deep reinforcement learning.
Background
The ubiquitous connection in the future becomes reality, the network structure will be more and more complex, the requirements on hardware and network capacity will be higher and the consumption of energy will be higher and higher, so that the high-frequency communication technologies such as millimeter wave and terahertz will become ideal choices for the wireless network in the future. In recent years, intelligent reflection surfaces (Reconfigurable Intelligent Surface, RIS) have received a lot of attention, which are considered as a key technology in next generation wireless networks, and have advantages of low cost, programmability, easy deployment, low energy consumption, and the like. The RIS adopts a programmable metamaterial, so that the electromagnetic characteristics of the RIS can be changed, and parameters such as amplitude, phase, frequency, polarization and the like can be reconfigured.
While RIS is considered a potential technology for changing wireless communication systems, most of the work in the prior art only considers the case where RIS reflects only the incoming wireless signal. In this case the transmitter and receiver must be on the same side of the RIS, so that the RIS coverage has only one half-space, i.e. 180 ° coverage of the reflective part, with a certain coverage dead zone. In practice, however, the user is typically located on both sides of the RIS. The proposal of a smart reflective surface (Simultaneously Transmitting And Reflecting RIS, STAR-RIS) that transmits and reflects simultaneously solves this problem, in a STAR-RIS aided communication system, a part of the reflected signal is reflected to the same space as the incident signal, which is the reflective half-space (or simply "reflective space"), while another part of the transmitted signal is transmitted to the opposite space to the incident space, which is called the transmission half-space (or simply "transmission space"), thus achieving 360 ° full coverage.
In the field of wireless communications, energy is a very important resource. With the emergence of new wireless network structures such as smart grids, smart home, intelligent detection and the like, the energy consumption of wireless communication equipment is also increasing. Wireless energy transfer is an emerging technology that provides wireless charging services to wireless devices via radio frequencies to address energy limitations in wireless sensor networks. In a wireless power supply communication network, an energy collection user can collect and utilize energy from electromagnetic waves transmitted by a base station, so that the operation cost is greatly reduced, and the network maintenance is simplified. In addition, information security is one of the most serious problems in wireless communication. Because of the broadcast nature of the wireless channel, all users in a network can receive electromagnetic waves sent by the base station to other users and demodulate and decode, which presents a significant challenge for secure communications. In particular, to the communication system related to the invention, the energy collection user can also enjoy the performance gain brought by the STAR-RIS as a potential eavesdropper, which causes great hidden danger to the information security and privacy of the legitimate user, so the invention takes the security rate related to the physical layer security as an optimization target.
Finally, in terms of conventional wireless resource optimization design, although the convex optimization algorithm is widely applied to solve various complex optimization problems, the conventional optimization method is not suitable for low-power-consumption equipment of a wireless network, and does not meet the requirement of the future wireless network on low delay. Moreover, the optimization design of beamforming and amplitude and phase of the STAR-RIS remains challenging, facing increasingly complex and dynamic environments, such as adjusting the angle of the reflecting surface to reflect the user's needs to the greatest extent or to provide information, which can be difficult to handle by conventional optimization methods, while reinforcement learning has the ability to handle dynamic environments, enabling learning and decision making in the changing environments.
Disclosure of Invention
The invention provides a resource optimization method of a STAR-RIS communication system based on deep reinforcement learning, which aims at the problems of the background technology and comprises the following steps:
step 1: establishing a communication system model based on STAR-RIS assistance, wherein the communication system comprises a base station, STAR-RIS, a reflection user positioned in a reflection space, a transmission user positioned in a transmission space and an intelligent controller, the intelligent controller can adjust relevant parameters of the STAR-RIS according to the requirements of the communication system, and in the communication system model, a direct link exists between the base station and the transmission user as well as between the base station and the reflection user;
step 2: acquiring channel data, a beam forming matrix of a base station, a reflection coefficient matrix and a transmission coefficient matrix of STAR-RIS in an energy splitting mode, wherein the channel data comprises channel data from the base station to R users, channel data from the base station to T users, channel data from the base station to STAR-RIS, channel data from STAR-RIS to R users and channel data from STAR-RIS to T users;
step 3: establishing an optimization problem by taking the safety rate between the maximized reflection user and the transmission user as an objective function and taking the STAR-RIS energy conservation relation, the transmission user energy collection and the maximum power of the base station as constraint conditions;
step 4: and designing a state space, an action space and a reward function of reinforcement learning, constructing a reinforcement learning environment according to the requirements of a communication system, and solving an optimization problem by adopting a SAC algorithm so as to determine the resource optimization configuration of the communication system.
Further, the objective function is:
s.t.P e ≥E ma x,
P B ≤P max
wherein H is BU Indicating base station to reflecting user direct channel data, H BE Indicating the direct channel data from the base station to the transmitting user, H BR Channel data, h, representing base station to STAR-RIS r Channel data representing STAR-RIS to reflecting user, h t' Channel data, Θ, representing STAR-RIS to transmitting user r Reflection coefficient matrix, Θ, representing STAR-RIS t' Transmission coefficient matrix representing STAR-RIS, w representing beamforming matrix of base station, σ 2 Representing the variance of noise, P e Representing energy collected by the transmitting user, E max Indicating that the set threshold value is to be used,and->Representing the amplitude in reflection and transmission of the nth reflective element in the STAR-RIS,representing the total number of reflective elements, P, in a STAR-RIS B Representing base station power, P max Indicating the maximum power value that the base station needs to meet.
Further, the action space is a beamforming matrix of the base station, a reflection coefficient matrix and a transmission coefficient matrix of STAR-RIS in an energy splitting mode, namely a t ={ω t ,Θ r,t ,Θ t',t And }, wherein a t Representing the action, ω, at time step t t Beamforming matrix, Θ, representing the base station at time step t r,t And theta (theta) t',t Respectively representing the reflection coefficient matrix and the transmission coefficient matrix at time step t.
The state space is the channel data from the base station to the R user, the channel data from the base station to the T user, the channel data from the base station to the STAR-RIS, the channel data from the STAR-RIS to the R user and the channel data from the STAR-RIS to the T user, i.e. t ={H BR,t ,H BU,t ,H BE,t ,h r,t ,h t',t -wherein s t State of t time step, H BR,t 、H BU,t And H BE,t Respectively representing t time steps of base station to STAR-RIS channel data, base station to reflective user direct channel data and base station to transmitting user direct channel data, h r,t And h t',t Channel data representing t time steps STAR-RIS to reflective user and STAR-RIS to transmission, respectivelyChannel data of the user;
the bonus function is the safe rate of t time steps, i.e
Wherein R (ω, Θ) r ,Θ t' ) t Representing the rewards of time steps t.
Further, in the transmission of the user energy harvesting constraint, a penalty term needs to be constructed to perform the constraint, and the penalty rule is:
R(ω,Θ r ,Θ t' )′ t =ξR(ω,Θ r ,Θ t' ) t
wherein, xi represents the penalty factor relative to the reward, which takes the value:
wherein,representing t time steps of transmitting the energy collected by the user when the collected energy is greater than a threshold E max When the penalty factor is 1, namely, meeting the constraint does not give penalty; otherwise, giving corresponding punishment, i.e. subtracting from the original rewards
Further, the SAC algorithm model comprises an action network, an evaluation network comprising two Q networks and a value network, wherein the action network, the evaluation network and the value network comprise an input layer, a hidden layer, a ReLU activation layer and an output layer in sequence.
Further, the SAC algorithm is adopted to solve the optimization problem, unlike other reinforcement learning algorithms, the SAC algorithm uses the entropy value as a part of rewards, and maximizes the rewards value and simultaneously also needs to maximize the entropy value, so the maximum entropy value objective function is defined as:
wherein pi * Represents the optimal solution of strategy pi, ρ π Representing a set of state action pairs, r(s) t ,a t ) Indicating the return of the trajectory, pi (. Cndot. S) t ) Representing state s t The probability distribution of the action to be taken later,is the entropy term, alpha is the temperature parameter, +.>Indicating the desire.
Further, for two soft Q networks with parameters θ of critics in the SAC algorithm, the following formula is used for updating:
wherein,representing gradient operations on θ, L Q (θ) represents a soft Q network loss function with parameters θ, < ->Representing the gradient of the soft Q network bellman loss function,>q-value function vs. action a representing parameter θ t Sum state s t Gradient of Q θ (s t ,a t ) Represented in state s t Take action a down t Is representative of the expected jackpot of the current strategy, r(s) t ,a t ) Represented in state s t Take action a down t Instant rewards obtained later, ->Is a target value network,/->Represented in state s t+1 The target state value function below, gamma represents the discount rate, is a discount on future rewards.
Further, for a soft value network with a parameter ψ in the SAC algorithm, the following formula is used for updating:
wherein,representing a gradient operation on psi, L V (ψ) represents the soft value network loss function, < +.>Gradient the soft-valued network loss function, +.>Representation deriving psi, V ψ (s t ) Representing a state value function, the parameter being ψ. Q (Q) θ (s t ,a t ) Represented in state s t Take action a down t Q-estimate of (2) representing the expected jackpot of the current strategy, the parameter being θ,/i>This is given state s t And action a t Log policy probability of (c).
Further, the parameters in the SAC algorithm are as followsIs updated using the following formula:
wherein,representing +.>Gradient manipulation(s)>Representing a policy network loss function, +.>Representing the gradient of the loss function of the policy network, < ->Representation pair->Derivation and->Representing a policy function, the parameters beingGiven state s t And action a t Is about the parameter ++>Is used for the gradient of (a),representing a neural network re-parameterization strategy, +.>Representation pair a t Derivation and->Use->Implicit definition, τ t Representing an input noise vector; q (Q) θ (s t ,a t ) Represented in state s t Take action a down t And (2) a Q estimate representing an expected jackpot for the current strategy.
The beneficial effects of the technical scheme are as follows: the system can effectively utilize STAR-RIS to realize the gain of wireless signals, and compared with other reinforcement learning algorithms, the proposed deep reinforcement learning algorithm can obtain better effects and greatly improve the safety rate of the system.
Drawings
The drawings of the present invention are described below.
FIG. 1 is a schematic diagram of a STAR-RIS;
FIG. 2 is a schematic diagram of a STAR-RIS assistance based communication system model according to the present invention;
FIG. 3 is a schematic diagram of the algorithm flow structure of the present invention;
FIG. 4 is a schematic diagram of the structure of a neural network involved in the algorithm of the present invention;
FIG. 5 is a graph of simulation results of total security rate of system reflection users and transmission users versus different learning rates;
FIG. 6 is a diagram of the results of a simulation comparing the algorithm proposed by the present invention with other reinforcement learning algorithms;
fig. 7 is a diagram showing the results of comparing the proposed algorithm with other reinforcement learning algorithms according to the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
The invention adopts a deep reinforcement learning mode, aims at the maximum combination method user safety rate, and meets the constraint requirements of minimum power, STAR-RIS coefficient matrix, energy decomposition and the like of a base station and the minimum energy collection requirement of an untrusted eavesdropper under the condition that an untrusted eavesdropper exists, thereby maximizing the combination method user safety rate. The method provides a deep reinforcement learning algorithm based on soft update action-evaluation, comprehensively considers the number of users, the number of base station antennas and the number of reflecting elements, introduces an intelligent body, takes a transmission and reflection coefficient matrix and a beam forming matrix as an action space, takes channel state information as a state space, takes a safety rate as a base, takes a t-step instantaneous channel and the safety rate under action as rewards, and constructs reinforcement learning environment and trains a network so as to solve the optimization problem.
Specifically, the invention provides a resource optimization method of a STAR-RIS communication system based on deep reinforcement learning, which can comprise the following steps:
step 1: establishing a communication system model based on STAR-RIS assistance, wherein the communication system comprises a base station, STAR-RIS, a reflection user positioned in a reflection space, a transmission user positioned in a transmission space and an intelligent controller, the intelligent controller can adjust relevant parameters of the STAR-RIS according to the requirements of the communication system, and in the communication system model, a direct link exists between the base station and the transmission user as well as between the base station and the reflection user;
step 2: acquiring channel data, a beam forming matrix of a base station, a reflection coefficient matrix and a transmission coefficient matrix of STAR-RIS in an energy splitting mode, wherein the channel data comprises channel data from the base station to R users, channel data from the base station to T users, channel data from the base station to STAR-RIS, channel data from STAR-RIS to R users and channel data from STAR-RIS to T users;
step 3: establishing an optimization problem by taking the safety rate between the maximized reflection user and the transmission user as an objective function and taking the STAR-RIS energy conservation relation, the transmission user energy collection and the maximum power of the base station as constraint conditions;
step 4: and designing a state space, an action space and a reward function of reinforcement learning, constructing a reinforcement learning environment according to the requirements of a communication system, and solving an optimization problem by adopting a SAC algorithm so as to determine the resource optimization configuration of the communication system.
FIG. 1 is a STAR-RIS assisted communication system. The base station transmits a signal, a portion of which is reflected to the same space as the incident signal, referred to as a reflection half space (also referred to as a "reflection space"); while another part of the transmitted signal is transmitted to a space opposite to the incident space, called a transmission half space (also called a "transmission space"). Wherein R users represent users located in the reflection space, namely reflection users; t-users denote users located in the transmission space, i.e. transmitting users. Aiming at the coefficient regulation problem of STAR-RIS, electromagnetic flow passing through STAR-RIS elements can be operated to realize intelligent regulation of the surface, and compared with the traditional reflection-only RIS with only reflection coefficient, STAR-RIS needs to reconfigure transmission and reflection signals by two coefficients, namely a transmission coefficient and a reflection coefficient. Compared with the traditional reflection-only intelligent reflecting surface, the STAR-RSI can construct a full-space coverage range, so that the flexibility of a design system is increased, the intensity of information received by an access point can be improved, and the system comprises three modes of Energy Splitting (ES), mode Switching (MS) and Time Switching (TS).
FIG. 2 is a schematic diagram of a STAR-RIS assistance based communication system model of the present invention. As shown in fig. 2, the present invention contemplates a downlink secure transmission communication system scenario. The communication system comprises a base station, a STAR-RIS, a reflection user positioned in a reflection space, a transmission user positioned in a transmission space and an intelligent controller, wherein the intelligent controller can adjust relevant parameters of the STAR-RIS according to the requirements of the communication system. The base station transmits signals through the antenna, provides a communication coverage range, and ensures that users obtain signal connection in different geographic positions and environments. The base station is equipped with M antennas. The STAR-RIS comprises N reflective units. The R user is on the same side of the STAR-RIS as the base station, i.e., in the reflective space, and can receive confidential information from the base station. The T-user is on the other side of the STAR-RIS, i.e. the transmission space, which user has the possibility to eavesdrop on the R-user and needs to meet certain energy harvesting requirements. Wherein, R user and T user are single antenna user. In the communication system model, there is a direct link from the base station to both R-users and T-users.
The locations of the base station, STAR-RIS, R user, and T user are defined as (x) respectively, taking into account the locations and heights of the base station, STAR-RIS, and user b ,y b ,z b ) T ,(x r ,y r ,z r ) T ,(x t' ,y t' ,z t' ) T . Without loss of generality, a STAR-RIS is provided with N reflective elements, whereinThe present invention employs an Energy Splitting (ES) mode in which the signal incident on each element is split into a reflected signal and a transmitted signal, and the signal energy is split into two parts. In this mode, the transmission parameters (i.e., transmission coefficient matrix) and reflection parameters (i.e., reflection coefficient matrix) of the STAR-RIS can be defined as: />And->Wherein the method comprises the steps ofRepresenting the amplitude of the nth reflection element in transmission and reflection, respectively, which satisfy the relation Representing the phase shift in transmission and reflection of the nth reflection unit, respectively, and
in this system, there is a direct link from the base station to both the T-user and the R-user. Thus, the base station to STAR-RIS channel isThe direct channel of the base station to the R user is +.>The direct channel from the base station to the T-user is +.> And channels from STAR-RIS to R-user and T-user, respectively, denoted +.>And->Assuming that all channels follow the rice distribution, channel H due to the location of the base station and STAR-RIS BR Thus having Line of Sight (LoS) paths, channel H taking into account path fading and small scale fading BR Can be expressed as +.> Wherein K is r Represents Rayleigh factor, G B,R Represents the scattering path, L B,R Sight path representing base station and STAR-RIS,>represents the path loss of the power domain, d B,R Representing the distance between the base station and STAR-RIS, f c Representing the frequency.
The user is in a randomly moving state, so that a direct line-of-sight path between the user and the base station or STAR-RIS cannot be guaranteed. Defining the wave beam forming vector as w, the signal received by R user is S u =[H BU +H BR Θ r h r ]w+n u Wherein n is u Representing gaussian noise contained in the signal received by the R user.
Similarly, the signal received by the T user is S e =[H BE +H BR Θ t' h t' ]w+n c ,n c Representing gaussian noise contained in the signal received by the T-user. The energy collected by the T user can be expressed as P e =η eh ∣[H BE +H BR Θ t' h t' ]w∣ 2 Wherein eta eh Representing the energy harvesting discount rate.
The signal-to-noise ratios (SNR) of R users and T are respectively expressed asAndfor the STAR-RIS assisted secure transmission communication system shown in FIG. 2, the objective of the present invention is to maximize the security rate achievable by R users and meet the energy harvesting requirements of T users, the variables that need to be optimized are the coefficients of the STAR-RIS reflective elements, so the security rate maximization problem can be modeled as:
s.t.P e ≥E max ,
P B ≤P max
wherein sigma 2 Representing the variance of noise, P e Representing energy collected by the transmitting user, E max Indicating that the set threshold value is to be used,andrepresenting the amplitude of the nth reflective element in reflection and transmission in STAR-RIS,/->Representing the total number of reflective elements, P, in a STAR-RIS B Representing base station power, P max Indicating the maximum power value that the base station needs to meet. The first constraint indicates that the harvested energy must be greater than the set minimum demand; since STAR-RIS is a passive device, the second constraint indicates that its amplitude should meet the energy conservation theorem; the third constraint indicates that the base station should meet less than the maximum power constraint. Because variable coupling is a non-convex problem, and the communication environment has real-time change and complex conditions, such as users can randomly move, beams and coefficients need to be adjusted to reflect the demands of the users to the greatest extent, and the optimization problem has more complex dimensions, the invention adopts reinforcement learning to solve the optimization problem.
FIG. 3 shows a STAR-RIS assisted secure communication framework based on the SAC deep reinforcement learning algorithm, where an agent is constantly interacting with the environment to obtain experience, save the experience into a pool of experience, then sample the experience, and refine the strategy based on the experience, eventually learn the optimal strategy to obtain the maximum rewards. At each discrete time step t, the smart agent will be based on the stateTake action->The environment will then receive feedback based on the actions made by the agent and get the corresponding rewards +.>Then observe the next state +.>Wherein->And->A state space and an action space, respectively. The SAC algorithm is an advanced off-pole algorithm based on continuous control of maximum entropy.
The SAC algorithm proposed by the present invention comprises an action network, an evaluation network and a value network (also referred to as "V network"), wherein only one of the action network and the value network comprises two Q networks, the purpose of which is to reduce overestimation. Aiming at the problem of safe transmission design, the invention considers the number of users, the number of base station antennas and the number of reflecting elements, introduces an intelligent body, takes a transmission and reflection coefficient matrix and a beam forming matrix as an action space, takes channel state information (namely channel data) as a state space, takes a safe rate as a base, takes a safe rate under t-step instantaneous channel and action as a reward, and constructs an environment. The specific details are as follows:
(1) State space: the state values are designed to be all channel state information (Channel State Information, CSI) information. Namely, from the base station to the STAR-RIS, from the base station to the reflection space user (namely R user) and to the transmission space energy acquisition user (namely T user), all channel data from the STAR-RIS to the reflection user and to the transmission space energy acquisition user, namely, the state space is recorded as:
s t ={H BR,t ,H BU,t ,H BE,t ,h r,t ,h t,t }
wherein s is t Representing the state of the t time step.
Considering that a neural network can only accept real numbers as inputs, it is therefore necessary to separate the real and imaginary parts of the state complex numbers as independent inputs, and thus the dimension of the state space is 2mk+2nk+2mn. M is the number of antennas, N is the number of reflecting elements, and K is the number of users in the whole space.
(2) Action space: taking the transmission and reflection coefficient matrix and the beam forming matrix as an action space, and the action space with the dimension of 3N+2MK is recorded as:
a t ={ω,Θ r,t ,Θ t',t }
(3) Rewarding: in reinforcement learning, the purpose of the agent (i.e. agent) is to maximize the reward, so that since the optimization objective of the communication system model in the present invention is to maximize the safe rate, the safe rate is used as a reward for t-time step training, in particular, the achievable privacy rate can be given by:
when reinforcement learning is used, if constraints exist on the system model, special algorithms are required to handle the constraints to avoid exceeding the constraints. Because of the energy harvesting constraints in the system, i.e. the energy harvested by the energy harvesting user must be greater than a threshold, a penalty term is constructed to be constrained, which penalty rule can be expressed as:
R(ω,Θ r ,Θ t' )′ t =ξR(ω,Θ r ,Θ t' ) t
where ζ represents a penalty factor with respect to rewards, which may be expressed as:
wherein,the t time step is represented to transmit the energy collected by the user, namely when the collected energy is larger than a designed threshold value, the penalty factor is 1, and no penalty is given when the constraint is met; when the energy is smaller than the designed threshold, a certain penalty is given, namely, subtracting +_on the basis of the original rewards>
The SAC adopts a random strategy, and the SAC also considers the maximized entropy besides the maximized reward value, so that the strategy is more random, unlike the traditional deep reinforcement learning algorithm. Thus, the maximum entropy objective function is defined as follows:
wherein pi * Represents the optimal solution of strategy pi, ρ π Representing a set of state actions, r (s t ,a t ) Indicating the return of the trajectory, pi (. Cndot. S) t ) Representing state s t The probability distribution of the action to be taken later,is the entropy term, alpha is the temperature parameter, +.>Indicating that the desired alpha is a temperature parameter, determines the importance of the entropy term relative to the prize.
In the soft policy evaluation, a state value function V of SAC can be defined soft (s t ) And an action state value function O soft (s t ,a t ) Respectively isAnd->
In the policy improvement updating step, KL divergence calculation is adopted for simplicity, and the formula is as follows
Wherein pi new Representing a new policy, the present invention defines constraints, i.e., pi, in order to ensure that the policy tends to be controllable in the actual scenario k E, limiting the policy to a set of policies n; d (D) KL The function of the divergence is represented by a function of the divergence,representing the function of the state of the action corresponding to the old policy, +.>The distribution is normalized for the partitioning function, which is negligible in the differentiation process.
In the algorithm provided by the invention, a plurality of network parameters need to be updated, namely a soft value network parameter psi, two soft Q network parameters theta of a critic and a strategy network parameterUpdating the soft Q network uses the following formula: /> Wherein (1)>Representing gradient operations on θ, L Q (θ) represents a soft Q network loss function with parameters θ, < ->Representing the gradient of the soft Q network bellman loss function,>q-value function vs. action a representing parameter θ t Sum state s t Gradient of Q θ (s t ,a t ) Represented in state s t Take action a down t Q estimate of (2) representing the expected jackpot of the current strategyExcitation, r(s) t ,a t ) Represented in state s t Take action a down t Instant rewards obtained later, ->Is a target value network,/->Represented in state s t+1 The target state value function below, gamma represents the discount rate, is a discount on future rewards.
Updating training soft value functions using unbiased estimation gradients Wherein (1)>Representing a gradient operation on psi, L V (ψ) represents the soft value network loss function, < +.>Gradient the soft-valued network loss function, +.>Representation deriving psi, V ψ (s t ) Representing a state value function, the parameter being ψ. Q (Q) θ (s t ,a t ) Represented in state s t Take action a down t And (c) a Q estimate representing the expected jackpot for the current strategy, the parameter being theta,this is given state s t And action a t Log policy probability of (c).
And the update of the strategy network (action network) adopts the steps of minimizing KL divergence and then updating in a gradient way, and the update formula is as follows:wherein (1)>Representing +.>Gradient manipulation(s)>Representing a policy network loss function, +.>Representing the gradient of the loss function of the policy network, < ->Representation pair->Derivation and->Representing a policy function, the parameter is->Given state s t And action a t Is about the parameter ++>Is a gradient of (2); for the strategy gradient approach, a typical solution is to use a likelihood ratio gradient estimator that does not require back propagation through the strategy and target density network. In this case, however, the target density is a Q function, represented by a neural network, which can be differentiated, so that it is more convenient to use a re-parameterization technique, which can result in a lower variance estimate. We re-parameterize the strategy by neural network transformation,representing a neural network re-parameterization strategy, +.>Representation pair a t Derivation and->Use->Implicit definition, τ t Representing the input noise vector.
Fig. 4 is a schematic diagram of the structure of a neural network involved in the algorithm of the present invention. As shown in fig. 4, the action network, the evaluation network, and the value network each include an input layer, a hidden layer, a ReLU activation layer, and an output layer in this order. I.e. each network has the structure: firstly, an input layer is provided with a hidden layer, after RELU activation, the input layer is provided with a hidden layer, and after RELU activation, the output layer is obtained. In addition, the action network adopts a Gaussian random strategy, and the final output is divided into a logarithmic form of mean and variance. D in FIG. 4 s Representing the state space dimension, D a Representing the action space dimension.
According to the optimization method, the invention discloses a corresponding simulation result. The simulation basic parameters are set as follows: the number M of the antennas is 4, the number N of the reflecting elements is 10, the strategy type adopts a random Gaussian strategy, the maximum step length is 100000 steps, the batch size is set to 256, the discount rate is 0.99, the number of hidden layer units in the neural network is 256, and an Adam optimizer is adopted as an optimizer of the neural network. According to the invention, the same parameters are adopted, different random seeds are designed, so that multiple experiments are carried out for each method, then a corresponding Seaborn graph is drawn according to the obtained data, the dark line segment is the multiple experiment result to obtain the average safety rate, the light shadow areas at the two sides of the corresponding line segment are confidence intervals for estimating the value range of the parameters, the narrower the confidence intervals are, the higher the estimation precision is, and the algorithm has better performance and stability. Fig. 5 shows the relationship between the average safety Rate and the number of steps corresponding to different Learning rates, and it can be seen from the simulation graph that the optimal parameter of the Learning Rate (LR) is 0.001. In order to embody the advantages of the method provided by the invention, the comparison with other reinforcement learning algorithms, namely a DDPG algorithm and a TD3 algorithm, is added in the simulation, and the DDPG algorithm and the TD3 algorithm are shown in figure 6. Compared with the DDPG and TD3 confidence intervals, the algorithm (SAC algorithm) provided by the invention has narrower confidence intervals, better stability and convergence, and higher obtained reward value and better performance. Fig. 7 shows a comparison of different numbers of antennas M and different algorithm average security rates, it can be seen that as the number of antennas increases, the average security rate achieved increases, and the SAC algorithm provided by the present invention can achieve better performance compared with DDPG and TD 3.
Finally, it is noted that the embodiments described in this specification are merely illustrative of the invention. Those of ordinary skill in the art will appreciate that: various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (9)
1. A method for optimizing resources of a deep reinforcement learning-based STAR-RIS communication system, comprising:
step 1: establishing a communication system model based on STAR-RIS assistance, wherein the communication system comprises a base station, STAR-RIS, a reflection user positioned in a reflection space, a transmission user positioned in a transmission space and an intelligent controller, the intelligent controller can adjust relevant parameters of the STAR-RIS according to the requirements of the communication system, and in the communication system model, a direct link exists between the base station and the transmission user as well as between the base station and the reflection user;
step 2: acquiring channel data, a beam forming matrix of a base station, a reflection coefficient matrix and a transmission coefficient matrix of STAR-RIS in an energy splitting mode, wherein the channel data comprises channel data from the base station to R users, channel data from the base station to T users, channel data from the base station to STAR-RIS, channel data from STAR-RIS to R users and channel data from STAR-RIS to T users;
step 3: establishing an optimization problem by taking the safety rate between the maximized reflection user and the transmission user as an objective function and taking the STAR-RIS energy conservation relation, the transmission user energy collection and the maximum power of the base station as constraint conditions;
step 4: and designing a state space, an action space and a reward function of reinforcement learning, constructing a reinforcement learning environment according to the requirements of a communication system, and solving an optimization problem by adopting a SAC algorithm so as to determine the resource optimization configuration of the communication system.
2. The method for optimizing resources of deep reinforcement learning based STAR-RIS communication system of claim 1, wherein the objective function is:
s.t.P e ≥E max ,
P B ≤P max
wherein H is BU Indicating base station to reflecting user direct channel data, H BE Indicating the direct channel data from the base station to the transmitting user, H BR Channel data, h, representing base station to STAR-RIS r Channel data representing STAR-RIS to reflecting user, h t' Channel data, Θ, representing STAR-RIS to transmitting user r Reflection coefficient matrix, Θ, representing STAR-RIS t' Transmission coefficient matrix representing STAR-RIS, w representing beamforming matrix of base station, σ 2 Representing the variance of noise, P e Representing energy collected by the transmitting user, E max Indicating that the set threshold value is to be used,and->Representing the amplitude in reflection and transmission of the nth reflective element in the STAR-RIS,representing the total number of reflective elements, P, in a STAR-RIS B Representing base station power, P max Indicating the maximum power value that the base station needs to meet.
3. The resource optimization method of deep reinforcement learning-based STAR-RIS communication system according to claim 2, wherein the action space is a beamforming matrix of a base station, a reflection coefficient matrix and a transmission coefficient matrix of STAR-RIS in an energy splitting mode, namely a t ={ω t ,Θ r,t ,Θ t',t And }, wherein a t Representing the action, ω, at time step t t Beamforming matrix, Θ, representing the base station at time step t r,t And theta (theta) t',t Respectively representing the reflection coefficient matrix and the transmission coefficient matrix at time step t.
The state space is the channel data from the base station to the R user, the channel data from the base station to the T user, the channel data from the base station to the STAR-RIS, the channel data from the STAR-RIS to the R user and the channel data from the STAR-RIS to the T user, i.e. t ={H BR,t ,H BU,t ,H BE,t ,h r,t ,h t',t -wherein s t State of t time step, H BR,t 、H BU,t And H BE,t Respectively representing t time steps of base station to STAR-RIS channel data, base station to reflective user direct channel data and base station to transmitting user direct channel data, h r,t And h t',t Respectively representing t time steps of STAR-RIS to reflective user channel data and STAR-RIS to transmitting user channel data;
the bonus function is the safe rate of t time steps, i.e
Wherein R (ω, Θ) r ,Θ t' ) t Representing the rewards of time steps t.
4. A method for optimizing resources of a deep reinforcement learning based STAR-RIS communication system according to claim 3, wherein in transmitting user energy harvesting constraints, a penalty term is constructed for the constraints, and the penalty rule is:
R(ω,Θ r ,Θ t' )′ t =ξR(ω,Θ r ,Θ t' ) t
wherein, xi represents the penalty factor relative to the reward, which takes the value:
wherein,representing t time steps of transmitting the energy collected by the user when the collected energy is greater than a threshold E max When the penalty factor is 1, namely, meeting the constraint does not give penalty; otherwise giving punishment, i.e. subtracting from the original rewards
5. The resource optimization method of deep reinforcement learning-based STAR-RIS communication system of claim 3, wherein the SAC algorithm model comprises an action network, an evaluation network comprising two Q networks and a value network, wherein the action network, the evaluation network and the value network each comprise an input layer, a hidden layer, a ReLU activation layer and an output layer in sequence.
6. The resource optimization method of deep reinforcement learning-based STAR-RIS communication system of claim 5, wherein solving the optimization problem by SAC algorithm comprises:
the entropy value is taken as a part of rewards, and the maximum entropy value objective function is defined as:
wherein pi * Represents the optimal solution of strategy pi, ρ π Representing a set of state action pairs, r(s) t ,a t ) Indicating the return of the trajectory, pi (. Cndot. S) t ) Representing state s t The probability distribution of the action to be taken later,is the entropy term, alpha is the temperature parameter, +.>Indicating the desire.
7. The resource optimization method of deep reinforcement learning-based STAR-RIS communication system of claim 6, wherein for two soft Q networks with parameters θ of critics in SAC algorithm, the following formula is used for updating:
wherein,representing gradient operations on θ, L Q (θ) represents a soft Q network loss function with parameters θ, < ->Representing the gradient of the soft Q network bellman loss function,>q-value function vs. action a representing parameter θ t Sum state s t Gradient of Q θ (s t ,a t ) Represented in state s t Take action a down t Is representative of the expected jackpot of the current strategy, r(s) t ,a t ) Represented in state s t Take action a down t Instant rewards obtained later, ->Is a target value network,/->Represented in state s t+1 The target state value function below, gamma represents the discount rate, is a discount on future rewards.
8. The resource optimization method of deep reinforcement learning based STAR-RIS communication system according to claim 6, wherein for the soft value network with parameter ψ in SAC algorithm, the following formula is used for updating:
wherein,representing a gradient operation on psi, L V (ψ) represents the soft value network loss function, < +.>Gradient the soft-valued network loss function, +.>Representation deriving psi, V ψ (s t ) Representing a state value function, wherein the parameter is psi; q (Q) θ (s t ,a t ) Represented in state s t Take action a down t Q-estimate of (2) representing the expected jackpot of the current strategy, the parameter being θ,/i>This is given state s t And action a t Log policy probability of (c).
9. The method for optimizing resources of deep reinforcement learning based STAR-RIS communication system of claim 6, wherein the parameters for SAC algorithm areIs updated using the following formula:
wherein,representing +.>Gradient manipulation(s)>Representing a policy network loss function, +.>Representing the gradient of the loss function of the policy network, < ->Representation pair->Derivation and->Representing a policy function, the parameter is-> Given state s t And action a t Is about the parameter ++>Is a gradient of (2); />Representing a neural network re-parameterization strategy, +.>Representation pair a t Derivation and->Use->Implicit definition, τ t Representing an input noise vector; q (Q) θ (s t ,a t ) Represented in state s t Take action a down t And (2) a Q estimate representing an expected jackpot for the current strategy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311692409.XA CN117615393A (en) | 2023-12-11 | 2023-12-11 | Resource optimization method of STAR-RIS communication system based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311692409.XA CN117615393A (en) | 2023-12-11 | 2023-12-11 | Resource optimization method of STAR-RIS communication system based on deep reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117615393A true CN117615393A (en) | 2024-02-27 |
Family
ID=89953437
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311692409.XA Pending CN117615393A (en) | 2023-12-11 | 2023-12-11 | Resource optimization method of STAR-RIS communication system based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117615393A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118054828A (en) * | 2024-04-08 | 2024-05-17 | Ut斯达康通讯有限公司 | Intelligent super-surface-oriented beam forming method, device, equipment and storage medium |
CN118138175A (en) * | 2024-04-11 | 2024-06-04 | 江苏海洋大学 | Unmanned aerial vehicle anti-eavesdropping safety communication method based on reconfigurable intelligent reflecting surface |
-
2023
- 2023-12-11 CN CN202311692409.XA patent/CN117615393A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118054828A (en) * | 2024-04-08 | 2024-05-17 | Ut斯达康通讯有限公司 | Intelligent super-surface-oriented beam forming method, device, equipment and storage medium |
CN118138175A (en) * | 2024-04-11 | 2024-06-04 | 江苏海洋大学 | Unmanned aerial vehicle anti-eavesdropping safety communication method based on reconfigurable intelligent reflecting surface |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Faisal et al. | Machine learning approaches for reconfigurable intelligent surfaces: A survey | |
CN112672375B (en) | Safety communication method in intelligent reflection surface-assisted non-orthogonal multiple access network | |
CN117615393A (en) | Resource optimization method of STAR-RIS communication system based on deep reinforcement learning | |
Sagduyu et al. | MAC games for distributed wireless network security with incomplete information of selfish and malicious user types | |
CN113037346B (en) | IRS and artificial noise assisted MIMO system physical layer safety design method | |
CN113708819B (en) | Non-orthogonal multiple access method based on novel reconfigurable intelligent surface | |
CN112601242A (en) | Intelligent reflecting surface assisted two-cell NOMA uplink low-power-consumption transmission method | |
Liu et al. | RIS-aided next-generation high-speed train communications: Challenges, solutions, and future directions | |
Budhiraja et al. | Energy-efficient optimization scheme for RIS-assisted communication underlaying UAV with NOMA | |
Sun et al. | Leveraging uav-ris reflects to improve the security performance of wireless network systems | |
Rafieifar et al. | Secrecy rate maximization in multi-IRS mmWave networks | |
Bi et al. | Deep reinforcement learning for IRS-assisted UAV covert communications | |
Ahmad et al. | Resource allocation for IRS-assisted networks: A deep reinforcement learning approach | |
CN114040415A (en) | Intelligent reflector assisted DQN-DDPG-based resource allocation method | |
CN114157392A (en) | Optimization method for safety transmission of distributed IRS auxiliary communication system | |
Zheng et al. | Covert federated learning via intelligent reflecting surfaces | |
Narengerile et al. | Deep reinforcement learning-based beam training with energy and spectral efficiency maximisation for millimetre-wave channels | |
Aung et al. | Deep reinforcement learning based spectral efficiency maximization in STAR-RIS-assisted indoor outdoor communication | |
CN116390056B (en) | STAR-RIS-assisted vehicle networking SR system link optimization method | |
Desai et al. | A review on deep learning algorithms for large intelligent surfaces in next generation wireless systems | |
Ding et al. | Joint power allocation scheme for distributed secure spatial modulation in high-speed railway | |
Sun et al. | RIS Assisted Eavesdropping Avoidance for UAV Communication | |
Baskar et al. | A Survey on Resource Allocation and Energy Efficient Maximization for IRS-Aided MIMO Wireless Communication | |
Wang et al. | Joint Reliability Optimization and Beamforming Design for STAR-RIS-Aided Multi-user MISO URLLC systems | |
Wong et al. | Transceiver Design for Secure Wireless Communication Networks with IRS using Deep Learning: A Survey |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |