CN115002804A - Intelligent beam width optimization method based on reinforcement learning - Google Patents
Intelligent beam width optimization method based on reinforcement learning Download PDFInfo
- Publication number
- CN115002804A CN115002804A CN202210526035.3A CN202210526035A CN115002804A CN 115002804 A CN115002804 A CN 115002804A CN 202210526035 A CN202210526035 A CN 202210526035A CN 115002804 A CN115002804 A CN 115002804A
- Authority
- CN
- China
- Prior art keywords
- network
- time unit
- codebook
- data transmission
- reward
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/02—Arrangements for optimising operational condition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B7/00—Radio transmission systems, i.e. using radiation field
- H04B7/02—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
- H04B7/04—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
- H04B7/06—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station
- H04B7/0613—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission
- H04B7/0615—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal
- H04B7/0617—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal for beam forming
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/04—Wireless resource allocation
- H04W72/044—Wireless resource allocation based on the type of the allocated resource
- H04W72/046—Wireless resource allocation based on the type of the allocated resource the resource being in the space domain, e.g. beams
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Variable-Direction Aerials And Aerial Arrays (AREA)
Abstract
The invention discloses an intelligent beam width optimization method based on reinforcement learning. The method inhibits the beam drift effect in millimeter wave communication by dynamically adjusting the beam width, the algorithm models the dynamic optimization problem of the beam width into a Markov decision process, and the optimal beam width is selected at each decision time for data transmission. The state design can represent the severity of the beam drift effect of the current system in multiple angles, each action corresponds to different beam widths, and the width of a data transmission beam is optimized and adjusted by sensing the change speed of the environment. In each state, the optimal beam width of data transmission is selected according to an AC algorithm, a strategy network and a value network are continuously updated in training to improve the reasonability and reliability of model selection, and a dynamic network updating mechanism is introduced to reduce the operation burden of network updating. The throughput of the millimeter wave communication system under the beam drift effect is improved, and therefore the quality of a communication link is continuously guaranteed.
Description
Technical Field
The invention belongs to the field of wireless communication, and particularly relates to an intelligent beam width optimization method based on an Actor Critic (AC) for reinforcement learning in millimeter wave communication.
Background
Millimeter waves have attracted attention as a potential technology for addressing the need for high-speed wireless data transmission due to their large accessible bandwidth. However, in comparison with the conventional low frequency band, although the bandwidth is high, the free space path loss of the millimeter wave band is large, and the ability to bypass the object is poor, so that an array antenna composed of several tens or even several hundreds of antennas is required, and the signal-to-noise ratio is improved by concentrating energy through the beam forming technique. Generally, the spacing and wavelength of the antennas in the antenna array are an order of magnitude, which brings a large space for reducing the physical volume of the millimeter wave antenna array, so that the millimeter wave antenna array can be made small enough to meet the application of more scenes. The beam forming makes the millimeter wave have strong directivity, so the beam tracking technology is needed to ensure the stability and quality of the communication link. Notably, due to the high cost of the rf links and the high power consumption of the ADC/DAC, the mmwave communication system typically employs a digital-analog hybrid precoding architecture, with the number of rf chains being much less than the number of antennas.
Beamforming is only the basis for improving the signal-to-noise ratio, and how to select a reasonable direction for a directional beam is another big problem of obtaining a high signal-to-noise ratio. The main beam direction after beam forming needs to be aligned with a user in real time, namely, beam training is needed to ensure that the main beam of a transmitter is aligned and track a moving receiver. Most research schemes are to perform wave beam training in a time slot, obtain a high-gain wave beam direction, and transmit data in the time slot by adopting the wave beam direction. However, this is based on the fact that the communication environment between the user and the base station in each timeslot remains the same, and when the receiving-end user moves continuously, its deviation angle or arrival angle always changes continuously, even though the beam training is performed to transmit and receive the main beam at the beginning of each timeslot, the movement of the receiver in the timeslot causes the change of the relative position with respect to the transmitter, which results in the decrease of the beam forming/array gain, which is called the beam drift effect, and when it is serious, it may affect the quality of the communication link and even cause communication interruption.
In the beam training process, which is a process of balancing beam alignment success rate and beam training efficiency, a narrower beam can usually bring higher beamforming gain, but at the same time, a larger search burden is generated and the search accuracy is reduced. When the beam drift effect is considered, the relative invariance of the environment in the time slot is broken, and the faster the environment in the time slot changes, the more serious the influence on the beam forming gain. In order to suppress the beam drift effect, a trade-off needs to be made between the beam forming gain and the effective time of data transmission, and although the beam drift effect can be mitigated by reducing the length of each slot, the beam training overhead is increased, and the effective achievable rate performance is reduced. And apart from the idea of using narrow beams, when a beam becomes wide, the coverage area of a single beam becomes large, and although the maximum beamforming gain is affected compared with the narrow beam, better data transmission efficiency can still be achieved compared with the failure of beam alignment in a time slot. Therefore, the real-time adjustment of the beam width according to the change of the communication environment is an effective scheme for dealing with the beam drift effect.
Disclosure of Invention
The technical problem is as follows: in order to inhibit the beam drift effect and ensure the stability of a communication link, the invention provides an intelligent beam width optimization method, which counteracts the influence of the beam drift effect on the beam forming gain by adaptively adjusting the beam width. The method can improve the throughput of the millimeter wave communication system and ensure the stability of the quality of the communication link in the whole communication process.
The technical scheme is as follows: the intelligent beam width optimization method based on reinforcement learning comprises the following steps,
step 1, modeling a beam width optimization problem as a Markov Decision Process (MDP), defining codebooks of different resolutions, namely beam widths as action spaces by adopting an Actor-criticic algorithm based on combination of strategy learning and value learning, wherein a strategy network and a value network are formed by a fully-connected network, the strategy network is responsible for selecting the optimal action at each moment, the value network is responsible for feeding back the selection of the strategy network for evaluation according to an objective environment, and sensing environment dynamic characteristics and selecting the optimal codebook code word for data transmission through continuous interaction with the environment;
step 4, combining the optimal beam width and the optimal beam direction obtained by beam alignment, selecting the optimal code word in the selected codebook and performing data transmission;
step 5, calculating and obtaining reward information according to the information in the data transmission process, compensating the reward, judging whether a network updating threshold is met, if so, executing step 6, otherwise, directly jumping back to step 3 to start the next circulation;
and 6, updating the parameters of the strategy network and the value network based on the reward information.
Wherein the content of the first and second substances,
step 1, firstly, characterizing a beam width optimization problem as MDP, effectively and reasonably designing MDP model parameters including state design, action space and reward design, adaptively adjusting the beam width for data transmission according to the severity of a beam drift effect, and fully balancing the relationship between beam forming gain and effective data transmission time;
if the time unit of the beam training process for determining the optimal beam direction is a time slot, and the time scale for selecting the beam width is defined as a time unit, and each time unit includes M time slots, the MDP parameter is defined as follows:
step 1.1. state design: in M time slots, i.e. one time unit, M groups of waves are obtainedBundle training experience (S, A, R, S) t+1 ) Thus, M beam training rewards can be achieved, i.e.
R={R t ,R t+1 ,...,R i ,...,R t+M-2 ,R t+M-1 } (1)
Wherein R is i Representing the reward of the ith time slot in the time unit, i.e. the effective reachable rate obtained by the beam training of the time slot, and further taking the indirect result set obtained by some meaningful calculations of the R sequence as the state, i.e. the reward
State={R av ,R cv ,R kur } (2)
The meaning of each parameter is as follows:
·R av represents the mean of the R sequences, which represents the average size of the R sequences;
·R cv the coefficient of variation of the R sequence is represented, the mean value and the dispersion degree are two main characteristics of the sequence characteristics, the dispersion degree is usually represented by variance or standard deviation, but when a series of objective conditions such as beam alignment conditions of different time slots are different, measurement scales of different groups of data are likely to be greatly different, and the standard deviation is not suitable for comparison at this time, and the difference of the measurement scales can be eliminated by the coefficient of variation; it is calculated by first calculating the standard deviation σ of the R sequence
The coefficient of variation is a normalized measure of the degree of dispersion of the probability distribution, defined as the standard deviation σ and the mean R av The ratio of (A) to (B) is
·R kur The kurtosis represents the kurtosis of the R sequence, the kurtosis represents the characteristic number of the probability density distribution curve with the peak value height at the average value, the characteristic number is a statistic for describing the steepness degree of all value distribution forms in the population, and the kurtosis is introduced for addingAdding more extra information for the system to learn, then the kurtosis R kur The calculation is as follows:
step 1.2, action design: defining actions as codebooks of different beamwidths, i.e. different resolutions, a multiresolution codebook is defined asThe corresponding action space is defined as
Step 1.3. reward design: the reward is defined as the sum rate of M time slots per time unit, i.e. the rate
Wherein R is u,i Indicating the effective transmission rate corresponding to the ith time slot in the u time unit.
The initialization method of the step 2 is as follows:
step 2.1, constructing a value network V and a strategy network pi which are composed of full connection layers, and initializing a value network parameter omega and a strategy network parameter theta;
step 2.2. construction of multiresolution DFT codebookRandomly selecting a data transmission initial codebook;
step 2.3, initializing time unit count u to 1, and initializing network updating threshold l according to experience max ;
Step 2.4, opening up a storage space, and storing the optimal action selected by the time unit in the storage space for the time unit uAnd reward information generated by the M time slot beam training in the time unit and state information obtained by the reward information.
The step 3 sequentially executes the following steps in the time unit u:
step 3.1, obtaining reward sequences R obtained by M time slots in the time unit according to a formula 1;
step 3.2, calculating according to formulas 4, 5 and 6 to obtain the state information s of the time unit u ;
Step 3.3. get a set of probability distribution for all actions through the strategy network pi, for the action spaceIs provided withI.e. the probability that the policy network pi (theta) selects all actions in state s is added to 1 and thus the best action is selectedAnd selects the corresponding data transmission codebook
Step 4 the following steps are performed in sequence in time unit u:
step 4.1. assume that the beam training process aiming at finding the best beam direction employs a codebookObtaining the best beam of the time unit according to the beam alignment resultCorresponding beam center angle
Step 4.2. inIn which the best codeword is selected for data transmission, assumingWhereinWhich represents the ith code word, is,representing the number of codewords in the codebook, the codebookChinese code wordCorresponding to a beam center angle ofThen the data transmits the best codeword
WhereinRepresenting natural numbers 1 toThe formed set can construct direct mapping between different codebook indexes if the index relation between different resolution codebooks has specific objective rules in the specific implementation process
Step 5 the following steps are performed in sequence in time unit u:
step 5.1) receiving the external environment return and calculating the reward R of the time unit according to the formula (1) u Due to the deviation between the beam training application codebook and the data transmission application codebook, the reward needs to be compensated as follows, and the default beam training codebook isThen corrected R u ' represents the following:
step 5.2) judging whether the network updating threshold is met or not, and designing normalized criticizing lossTo characterize whether the network needs to be continuously updated,the calculation is as follows:
wherein the representation ω represents a value network parameter; u shape l Is a viewing window when satisfiedThe neural network is trained.
Step 6 the following steps are performed in sequence in time unit u:
step 6.1. performing the best action in time unit uSelecting corresponding code words for data transmission and obtaining compensation reward R u At the same time, the state s of the next time unit u +1 can be obtained according to the formula (2) u+1 And obtaining the best action of the next time unit according to the policy networkThis action is only used for subsequent calculations, and is not actually performed;
step 6.2, the state s of the time unit u And action a u Input value network derived q u =q(s u ,a u ;ω u ) The next time unit state s u+1 And predicting actionsInput value network obtaining andand calculating the TD-error value delta u =q u -(R u ′+γ·q u+1 );
Step 6.3, obtaining by derivation of value networkThen, the value network is updated by using the time sequence difference as the formula omega u+1 =ω u -α·δ u ·d ω,u ;
Step 6.4. derivation of strategy networkThe gradient ascent is then used to update the policy network parameters as in equation θ u+1 =θ u +β·q u ·d θ,u Wherein q is u May be represented by u Replacement;
step 6.5, update counter u ← u + 1.
Has the advantages that: compared with the prior art, the invention has the following advantages:
1) on the basis that only a single codebook is utilized in beam alignment training, the more realistic scene, namely the relative change characteristic of the environment in a time slot is considered, a multi-resolution codebook is introduced, the codebook with the best resolution is adaptively selected at each decision point by sensing the change rate of the direction angle of a channel in the time slot, and the most efficient code word is comprehensively selected by combining beam training, so that the system performance is improved.
2) The algorithm does not need channel state information, takes the real-time change characteristic of the real environment into consideration, introduces reinforcement learning to interact with the environment in real time, continuously learns and tries and mistakes, perceives the change rule of the environment, and makes the most reasonable decision by fully utilizing the existing computing power through a neural network.
3) Under the beam drift effect, different from the traditional thought, narrow beams are used for beam training, a multi-resolution codebook is introduced for data transmission, beam forming gain and effective time of data transmission are considered, the performance of the algorithm is better than that of the algorithm using a single codebook through simulation, and when the environmental change is more severe, namely the beam drift effect is more serious, the performance advantage is more obvious.
4) The AC algorithm makes action selections based on policy learning, which is more straightforward than other algorithms. The algorithm greatly reduces the training burden of the neural network through reasonable and concise state design and introduction of a dynamic training mechanism, and does not need a large storage space to store a series of contextual information.
Drawings
FIG. 1 is a flow chart of an intelligent beam width optimization algorithm;
FIG. 2 is a diagram of the information transfer between the beam training module and the data transmission module;
FIG. 3 is a diagram illustrating average EAR performance of different codebooks at different SNR;
FIG. 4 is a diagram illustrating the relationship between the average EAR performance and the environmental change rate in different codebooks.
Detailed Description
In order to make the technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to examples and software simulation.
Beam drift effects are taken into account, i.e. the beamforming gain may suffer due to a change in the aligned azimuth of the transmitter and receiver during the beam training time slot. The algorithm interacts with a dynamic environment by introducing reinforcement learning under the condition of not needing channel state information, continuously senses the change rule of the environment in a time slot, and adaptively adjusts the width of a data transmission beam so as to ensure the stability of the quality of a communication link.
In a specific implementation scheme, a beam width optimization problem is modeled into an MDP model, an optimal beam width is selected in each time unit through reasonable and efficient action, state and reward design, an optimal code word is selected for data transmission by combining a beam training alignment result, and neural network parameters are updated according to conditions in each time unit cycle to enable the selection to be more reasonable.
1. Model building
In the MDP model, a state space, an action space, and a reward are defined first. Assuming that the minimum time unit is a slot, every M slots are defined as one time unit, i.e., each time unit includes M slots.
Defining a state: within M time slots, i.e. one time unit, M sets of beam training experiences (S, a, R, S) can be obtained t+1 ) Thus M rewards can be obtained, i.e.
R={R t ,R t+1 ,...,R i ,...,R t+M-2 ,R t+M-1 } (1)
Wherein R is i Represents the time unitThe reward of the ith time slot, i.e. the effective reachable rate obtained by the beam training of the time slot, and further adopts an indirect result set obtained by some meaningful calculations on the sequence R as a state, i.e. the state
State={R av ,R cv ,R kur } (2)
The meaning of each parameter is as follows:
·R av represents the mean of the R sequences, which represents the average size of the R sequences;
·R cv the coefficient of variation of the R sequence is shown. The mean value and the dispersion degree are two main characteristics of sequence characteristics, the dispersion degree is often represented by a variance or a standard deviation, but considering that when a series of objective conditions such as beam alignment conditions of different time slots are different, measurement scales of different groups of data are likely to be greatly different, at this time, the standard deviation is not suitable for comparison, and the measurement scale difference can be eliminated by the variation coefficient. It is calculated by first calculating the standard deviation σ of the R sequence
The coefficient of variation is a normalized measure of the degree of dispersion of the probability distribution, defined as the standard deviation σ and the mean R av The ratio of (A) to (B) is
·R kur Represents the kurtosis of the R sequence. The kurtosis represents the characteristic number of the peak value height of the probability density distribution curve at the average value, and is a statistic for describing the steep degree of all value distribution forms in the population, the purpose of introducing the kurtosis is to add more extra information for the system to learn, and when the sequence has n samples, the kurtosis R is kur The calculation is as follows:
defining a motion space: defining actions as codebooks of different beamwidths, i.e. different resolutions, a multiresolution codebook is defined asThe corresponding action space is defined as
Defining the reward: the reward is defined as the sum rate of M time slots per time unit, i.e. the rate
Wherein R is u,i Indicating the effective transmission rate corresponding to the ith time slot in the u time unit.
Policy network selects optimal codebook based on stateThe best codeword then needs to be selected in combination with the beam alignment result. Direct mapping relation function can be established between different codebook code word indexes
After data transmission is carried out, the external environment reward is accepted as reward, the reward is compensated, and the codebook used by default beam training isThen corrected R u ' represents the following:
Then judging whether the network updating threshold is met or not, and designing normalized criticizing lossTo characterize whether the network needs to be continuously updated,the calculation is as follows:
wherein the representation ω represents a value network parameter; u shape l Is the viewing window. When it is satisfied withThen, the neural network is trained; otherwise, the learning process is skipped.
2. Implementation of AC-based beam width optimization algorithm
Initialization: establishing a value network V and a policy networkPi, initializing neural network parameters omega and theta, and updating network threshold l max Opening up storage space, starting data transmission codebook asLet time unit count u equal to 1.
A circulating body: for each time unit u, the following steps are repeatedly performed:
(1) obtaining the reward sequence R obtained by M time slots in the time unit according to the formula (1)
(2) Calculating the state information s of the time unit according to the formulas (4, 5 and 6) u
(3) Selecting the best action through the policy network piAnd selects a transmission codebook corresponding to the selected transmission codebook
(4) Determined according to equation (7)Best code word for data transmissionAnd performs data transmission
(5) Receiving the return value of the external environment and obtaining a modified compensated reward R according to equation (8) u ′
(6) Calculating normalized evaluation lossWhen it is satisfied withWhen the step (7) is normally carried out, otherwise, the step (10) is directly carried out
(7) Obtaining the state s of the next time unit u +1 according to the formula (2) u+1 And obtaining the next time according to the policy networkOptimal action of the cells
(8) Computing q over a value network u =q(s u ,a u ;ω u ) Andand calculating the TD-error value delta u =q u -(r u +γ·q u+1 );
(9) Updating the value network parameter and the policy network parameter omega separately u+1 =ω u -a·δ u ·d ω,u ,θ u+1 =θ u +β·q u ·d θ,u
(10) Update counter u ← u +1
3. Simulation environment and results
The following environment is adopted during simulation:
1) the simulation generates a millimeter wave channel environment which changes along with time in an office, the center frequency of a base station is 28GHz, the transmitting power is 10dBm, the size of the office is 20 multiplied by 20 meters, and the source number N of a Uniform Linear Array (ULA) array is 64.
2) The channel path clusters are randomly generated between 2 and 5, and each cluster only contains one path for the convenience of tracking. The power allocation of line-of-sight (LOS) and non-line-of-sight (NLOS) paths is determined by the rice factor K, and the attenuation of all NLOS paths follows an exponential distribution approximately. Since the exponential decay is fast, only the 1 st NLOS path is in fact likely to be significant, with the other paths having a power of approximately 0.
3) A Uniform Linear Array (ULA) can cover a 180 ° region with azimuth phi satisfying 0 < phi < pi, while different resolution codebooks average the sine of 0-pi with different granularity. Code bookCodebook comprising 4 different resolutionsThe number of code words is 8, 16, 32, 64 respectively. For convenience, each codebook is divided into(i-1, 2,3,4) is abbreviated as MRC i 。
4) The environmental change rate is quantified by δ, and the larger the value of δ, the faster the environmental change.
And (3) simulation result analysis:
fig. 3 shows the relationship between EAR performance and signal-to-noise ratio by using different sub-codebooks, and the average Effective Achievable Rate (EAR) can represent the effective transmission rate in a unit frequency spectrum, and can characterize the throughput of the system. It can be seen that as the signal-to-noise ratio increases, the EAR performance of the different sub-codebooks also increases. However, the EAR performance achieved with BWS algorithm is much better than that achieved with any fixed sub-codebook. The reason for this is that the BWS algorithm can be designed to sense the rate of change of the environment and adjust the beam width in real time, which achieves a good trade-off between achieving high array gain and mitigating the effects of beam drift.
Fig. 4 shows the average EAR performance versus the rate of change of the environment for different codebooks. It can be seen that as the environment changes faster and faster, the influence of the beam drift effect becomes more and more severe, and the EAR performance achieved by different sub-codebooks decreases accordingly. However, the EAR performance of BWS algorithms is better than any fixed subcodebook at all ambient rates of change. And as the environment changes faster and faster, the performance gap between them also becomes larger and larger. These observations indicate that the designed BWS algorithm can effectively counteract the influence caused by the beam drift effect in a dynamic environment and ensure good effective achievable EAR performance.
In addition, it can be observed that MRC is when μ is small enough, i.e., the environment changes slowly (e.g., when μ < 0.08), then 4 The EAR performance achieved is the highest of the 4 sub-codebooks. The reason for this is that in this case the beam drift effect is not significant, due to the MRC having the narrowest beamwidth codeword 4 Is the highest when the EAR performance and MRC of the BWS algorithm are high 4 Almost identical.Meanwhile, as the environmental change rate is improved, the beam drift effect becomes more and more obvious, and MRC 4 EAR performance of (A) deteriorates rapidly and becomes more specific than MRC 3 Worse, this indicates that narrow beams are more susceptible to beam drift effects. And MRC 1 The EAR performance achieved is the worst because, although the wide beam is robust to beam drift effects, its array gain is relatively low, which in turn degrades EAR performance. In contrast, our algorithm can achieve a good balance between achieving high array gain and mitigating beam-shifting effects, and sense and adapt to changing environments in real time.
In conclusion, the intelligent beam width optimization (BWS) algorithm can achieve a compromise and balance between achieving high array gain and suppressing the beam drift effect, and can sense and adapt to a constantly changing environment in real time, thereby improving the system throughput.
Claims (7)
1. An intelligent beam width optimization method based on reinforcement learning is characterized by comprising the following steps,
step 1, modeling a beam width optimization problem as a Markov Decision Process (MDP), defining codebooks of different resolutions, namely beam widths as action spaces by adopting an Actor-criticic algorithm based on combination of strategy learning and value learning, wherein a strategy network and a value network are formed by a fully-connected network, the strategy network is responsible for selecting the optimal action at each moment, the value network is responsible for feeding back the selection of the strategy network for evaluation according to an objective environment, and sensing environment dynamic characteristics and selecting the optimal codebook code word for data transmission through continuous interaction with the environment;
step 2, initialization: establishing and initializing a strategy network, a value network, a multi-resolution DFT codebook, a data transmission initial codebook, time unit counting, channel environment construction and a network updating threshold, and opening up a storage space;
step 3, at the beginning of each decision time slot, constructing a current state according to the obtained information and a state design rule, and determining the optimal action, namely the beam width through a strategy network;
step 4, combining the optimal beam width with the optimal beam direction obtained by beam alignment, selecting the optimal code word in the selected codebook and performing data transmission;
step 5, calculating and obtaining reward information according to the information in the data transmission process, compensating the reward, judging whether a network updating threshold is met, if so, executing step 6, otherwise, directly jumping back to step 3 to start the next circulation;
and 6, updating the parameters of the strategy network and the value network based on the reward information.
2. The intelligent beam width optimization method based on reinforcement learning AC according to claim 1, characterized in that, step 1 firstly characterizes the beam width optimization problem as MDP, and effectively and reasonably designs MDP model parameters including state design, action space and reward design, and adaptively adjusts the beam width for data transmission according to the severity of beam drift effect, fully balances the relationship between beam forming gain and effective time of data transmission;
if the time unit of the beam training process for determining the optimal beam direction is a time slot, and the time scale for selecting the beam width is defined as a time unit, and each time unit includes M time slots, the MDP parameter is defined as follows:
step 1.1. state design: obtaining M sets of beam training experiences (S, A, R, S) in M time slots, i.e. in one time unit t+1 ) Thus, M beam training rewards can be achieved, i.e.
R={R t ,R t+1 ,…,R i ,…,R t+M-2 ,R t+M-1 } (1)
Wherein R is i Represents the reward of the ith time slot in the time unit, namely the effective reachable speed obtained by the time slot wave beam training, and further adopts an indirect result set obtained by some meaningful calculations of the R sequence as a state, namely
State={R av ,R cv ,R kur } (2)
The meaning of each parameter is as follows:
·R av represents the mean of the R sequences, which represents the average size of the R sequences;
·R cv the coefficient of variation of the R sequence is represented, the mean value and the dispersion degree are two main characteristics of the sequence characteristics, the dispersion degree is usually represented by variance or standard deviation, but when a series of objective conditions such as beam alignment conditions of different time slots are different, measurement scales of different groups of data are likely to be greatly different, and the standard deviation is not suitable for comparison at this time, and the difference of the measurement scales can be eliminated by the coefficient of variation; it is calculated by first calculating the standard deviation σ of the R sequence
The coefficient of variation is a normalized measure of the degree of dispersion of the probability distribution, defined as the standard deviation σ and the mean R av The ratio of (A) to (B) is
·R kur The kurtosis R sequence represents the kurtosis of the R sequence, the kurtosis represents the characteristic number of the probability density distribution curve with the peak value height at the average value, the characteristic number is the statistic for describing the steep degree of all value distribution forms in the population, the purpose of introducing the kurtosis is to add more extra information for the system to learn, and then the kurtosis R sequence kur The calculation is as follows:
step 1.2, action design: defining actions as codebooks of different beamwidths, i.e. different resolutions, a multiresolution codebook is defined asThe corresponding action space is defined as
Step 1.3, reward design: the reward is defined as the sum rate of M time slots per time unit, i.e. the rate
Wherein R is u,i Indicating the effective transmission rate corresponding to the ith time slot in the u time unit.
3. The method of claim 1, wherein the initialization method of step 2 is as follows:
step 2.1, constructing a value network V and a strategy network pi which are composed of full connection layers, and initializing a value network parameter omega and a strategy network parameter theta;
step 2.2. construction of multiresolution DFT codebookRandomly selecting a data transmission initial codebook;
step 2.3, initializing time unit count u to 1, and initializing network updating threshold l according to experience max ;
4. The method according to claim 1, wherein the step 3 sequentially executes the following steps in time unit u:
step 3.1, obtaining reward sequences R obtained by M time slots in the time unit according to formula 1;
step 3.2, calculating to obtain the state information s of the time unit according to the formulas 4, 5 and 6 u ;
Step 3.3. get a set of probability distribution for all actions through the strategy network pi, for the action spaceIs provided withI.e. the probability that the policy network pi (theta) selects all actions in state s is added to 1 and thus the best action is selectedAnd selects the corresponding data transmission codebook
5. The intelligent beam width optimization method based on reinforcement learning AC according to claim 1, wherein step 4 is performed in sequence in time unit u as follows:
step 4.1. assume that the beam training process aiming at finding the best beam direction employs a codebookObtaining the best beam of the time unit according to the beam alignment resultCorresponding to the central angle of the beam
Step 4.2. inIn which the best codeword is selected for data transmission, assumingWhereinWhich represents the ith code word, is,representing the number of codewords in the codebook, the codebookChinese code wordCorresponding to a beam center angle ofThen the data transmits the best codewordComprises the following steps:
whereinRepresenting natural numbers 1 toIn the concrete implementation process, the index relation among codebooks with different resolutions has a concrete objective rule, and then the direct mapping among different codebook indexes can be constructed
6. The method of claim 1, wherein step 5 is performed in sequence in time unit u as follows:
step 5.1) receiving the external environment return and calculating to obtain the reward R of the time unit according to the formula (1) u Due to the deviation between the beam training application codebook and the data transmission application codebook, the reward needs to be compensated as follows, and the default beam training codebook isThen corrected R u ' represents the following:
step 5.2) judging whether the network updating threshold is met or not, and designing normalized criticizing lossTo characterize whether the network needs to be continuously updated,the calculation is as follows:
7. The method of claim 1, wherein step 6 is performed in sequence in time unit u as follows:
step 6.1. performing the best action in time unit uSelecting corresponding code words for data transmission and obtaining compensation reward R u At the same time, the state s of the next time unit u +1 can be obtained according to the formula (2) u+1 And obtaining the best action of the next time unit according to the policy networkThis action is only used for subsequent calculations, and is not actually performed;
step 6.2, the state s of the time unit u And action a u Input value network get q u =q(s u ,a u ;ω u ) The next time unit state s u+1 And predicting actionsInput value network obtaining andand calculating the TD-error value delta u =q u -(R u ′+γ·q u+1 );
Step 6.3, obtaining by derivation of value networkThen, the value network is updated by using the time sequence difference as the formula omega u+1 =ω u -α·δ u ·d ω,u ;
Step 6.4. derivation of strategy networkThe gradient ascent is then used to update the policy network parameters as in equation θ u+1 =θ u +β·q u ·d θ,u Wherein q is u May be represented by u Replacement;
step 6.5, update counter u ← u + 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210526035.3A CN115002804A (en) | 2022-05-13 | 2022-05-13 | Intelligent beam width optimization method based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210526035.3A CN115002804A (en) | 2022-05-13 | 2022-05-13 | Intelligent beam width optimization method based on reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115002804A true CN115002804A (en) | 2022-09-02 |
Family
ID=83027249
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210526035.3A Pending CN115002804A (en) | 2022-05-13 | 2022-05-13 | Intelligent beam width optimization method based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115002804A (en) |
-
2022
- 2022-05-13 CN CN202210526035.3A patent/CN115002804A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109379752B (en) | Massive MIMO optimization method, device, equipment and medium | |
CN110971279B (en) | Intelligent beam training method and precoding system in millimeter wave communication system | |
US8457240B2 (en) | Methods of selecting signal transmitting, receiving, and/or sensing devices with probabilistic evolutionary algorithms in information conveyance systems | |
CN113411110B (en) | Millimeter wave communication beam training method based on deep reinforcement learning | |
CN111865378B (en) | Large-scale MIMO downlink precoding method based on deep learning | |
US8615047B2 (en) | Beamforming systems and methods for link layer multicasting | |
Göttsch et al. | Deep learning-based beamforming and blockage prediction for sub-6GHz/mm wave mobile networks | |
CN113438002A (en) | LSTM-based analog beam switching method, device, equipment and medium | |
KR20220013906A (en) | Deep learning based beamforming method and apparatus for the same | |
CN115441939A (en) | Multi-beam satellite communication system resource allocation method based on MADDPG algorithm | |
CN113437999B (en) | Adaptive beam width modulation method for inhibiting beam drift effect in millimeter wave communication system | |
Xiong et al. | A novel real-time channel prediction algorithm in high-speed scenario using convolutional neural network | |
CN115002804A (en) | Intelligent beam width optimization method based on reinforcement learning | |
WO2023185978A1 (en) | Channel feature information reporting method, channel feature information recovery method, terminal and network side device | |
CN115549745A (en) | RIS phase shift design method, apparatus, computer device and storage medium | |
CN111277313B (en) | Bipartite graph-based large-scale MIMO beam selection and transmission method for cellular internet of vehicles | |
EP3726739B1 (en) | Memory-assisted radio frequency beam training for mimo channels | |
CN113839696A (en) | Online robust distributed multi-cell large-scale MIMO precoding method | |
CN113839695A (en) | FDD large-scale MIMO and rate optimal statistical precoding method and device | |
CN113904704B (en) | Beam prediction method based on multi-agent deep reinforcement learning | |
Boas et al. | Machine learning based channel prediction for NR Type II CSI reporting | |
Liao et al. | Ultra-reliable intelligent link scheduling based on DRL for manned/unmanned aerial vehicle cooperative scenarios | |
WO2023179473A1 (en) | Channel feature information reporting method, channel feature information recovery method, terminal and network side device | |
Zhang et al. | Data-Driven Multi-armed Beam Tracking for Mobile Millimeter-Wave Communication Systems | |
US11799530B2 (en) | Beam management with matching networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |