CN115002804A - Intelligent beam width optimization method based on reinforcement learning - Google Patents

Intelligent beam width optimization method based on reinforcement learning Download PDF

Info

Publication number
CN115002804A
CN115002804A CN202210526035.3A CN202210526035A CN115002804A CN 115002804 A CN115002804 A CN 115002804A CN 202210526035 A CN202210526035 A CN 202210526035A CN 115002804 A CN115002804 A CN 115002804A
Authority
CN
China
Prior art keywords
network
time unit
codebook
data transmission
reward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210526035.3A
Other languages
Chinese (zh)
Inventor
黄永明
陆昀程
胡梓炜
俞菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202210526035.3A priority Critical patent/CN115002804A/en
Publication of CN115002804A publication Critical patent/CN115002804A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B7/00Radio transmission systems, i.e. using radiation field
    • H04B7/02Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
    • H04B7/04Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
    • H04B7/06Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station
    • H04B7/0613Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission
    • H04B7/0615Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal
    • H04B7/0617Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal for beam forming
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/046Wireless resource allocation based on the type of the allocated resource the resource being in the space domain, e.g. beams
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Variable-Direction Aerials And Aerial Arrays (AREA)

Abstract

The invention discloses an intelligent beam width optimization method based on reinforcement learning. The method inhibits the beam drift effect in millimeter wave communication by dynamically adjusting the beam width, the algorithm models the dynamic optimization problem of the beam width into a Markov decision process, and the optimal beam width is selected at each decision time for data transmission. The state design can represent the severity of the beam drift effect of the current system in multiple angles, each action corresponds to different beam widths, and the width of a data transmission beam is optimized and adjusted by sensing the change speed of the environment. In each state, the optimal beam width of data transmission is selected according to an AC algorithm, a strategy network and a value network are continuously updated in training to improve the reasonability and reliability of model selection, and a dynamic network updating mechanism is introduced to reduce the operation burden of network updating. The throughput of the millimeter wave communication system under the beam drift effect is improved, and therefore the quality of a communication link is continuously guaranteed.

Description

Intelligent beam width optimization method based on reinforcement learning
Technical Field
The invention belongs to the field of wireless communication, and particularly relates to an intelligent beam width optimization method based on an Actor Critic (AC) for reinforcement learning in millimeter wave communication.
Background
Millimeter waves have attracted attention as a potential technology for addressing the need for high-speed wireless data transmission due to their large accessible bandwidth. However, in comparison with the conventional low frequency band, although the bandwidth is high, the free space path loss of the millimeter wave band is large, and the ability to bypass the object is poor, so that an array antenna composed of several tens or even several hundreds of antennas is required, and the signal-to-noise ratio is improved by concentrating energy through the beam forming technique. Generally, the spacing and wavelength of the antennas in the antenna array are an order of magnitude, which brings a large space for reducing the physical volume of the millimeter wave antenna array, so that the millimeter wave antenna array can be made small enough to meet the application of more scenes. The beam forming makes the millimeter wave have strong directivity, so the beam tracking technology is needed to ensure the stability and quality of the communication link. Notably, due to the high cost of the rf links and the high power consumption of the ADC/DAC, the mmwave communication system typically employs a digital-analog hybrid precoding architecture, with the number of rf chains being much less than the number of antennas.
Beamforming is only the basis for improving the signal-to-noise ratio, and how to select a reasonable direction for a directional beam is another big problem of obtaining a high signal-to-noise ratio. The main beam direction after beam forming needs to be aligned with a user in real time, namely, beam training is needed to ensure that the main beam of a transmitter is aligned and track a moving receiver. Most research schemes are to perform wave beam training in a time slot, obtain a high-gain wave beam direction, and transmit data in the time slot by adopting the wave beam direction. However, this is based on the fact that the communication environment between the user and the base station in each timeslot remains the same, and when the receiving-end user moves continuously, its deviation angle or arrival angle always changes continuously, even though the beam training is performed to transmit and receive the main beam at the beginning of each timeslot, the movement of the receiver in the timeslot causes the change of the relative position with respect to the transmitter, which results in the decrease of the beam forming/array gain, which is called the beam drift effect, and when it is serious, it may affect the quality of the communication link and even cause communication interruption.
In the beam training process, which is a process of balancing beam alignment success rate and beam training efficiency, a narrower beam can usually bring higher beamforming gain, but at the same time, a larger search burden is generated and the search accuracy is reduced. When the beam drift effect is considered, the relative invariance of the environment in the time slot is broken, and the faster the environment in the time slot changes, the more serious the influence on the beam forming gain. In order to suppress the beam drift effect, a trade-off needs to be made between the beam forming gain and the effective time of data transmission, and although the beam drift effect can be mitigated by reducing the length of each slot, the beam training overhead is increased, and the effective achievable rate performance is reduced. And apart from the idea of using narrow beams, when a beam becomes wide, the coverage area of a single beam becomes large, and although the maximum beamforming gain is affected compared with the narrow beam, better data transmission efficiency can still be achieved compared with the failure of beam alignment in a time slot. Therefore, the real-time adjustment of the beam width according to the change of the communication environment is an effective scheme for dealing with the beam drift effect.
Disclosure of Invention
The technical problem is as follows: in order to inhibit the beam drift effect and ensure the stability of a communication link, the invention provides an intelligent beam width optimization method, which counteracts the influence of the beam drift effect on the beam forming gain by adaptively adjusting the beam width. The method can improve the throughput of the millimeter wave communication system and ensure the stability of the quality of the communication link in the whole communication process.
The technical scheme is as follows: the intelligent beam width optimization method based on reinforcement learning comprises the following steps,
step 1, modeling a beam width optimization problem as a Markov Decision Process (MDP), defining codebooks of different resolutions, namely beam widths as action spaces by adopting an Actor-criticic algorithm based on combination of strategy learning and value learning, wherein a strategy network and a value network are formed by a fully-connected network, the strategy network is responsible for selecting the optimal action at each moment, the value network is responsible for feeding back the selection of the strategy network for evaluation according to an objective environment, and sensing environment dynamic characteristics and selecting the optimal codebook code word for data transmission through continuous interaction with the environment;
step 2, initialization: establishing and initializing a strategy network, a value network, a multi-resolution DFT codebook, a data transmission initial codebook, time unit counting, channel environment construction and a network updating threshold, and opening up a storage space;
step 3, at the beginning of each decision time slot, constructing a current state according to the obtained information and a state design rule, and determining the optimal action, namely the beam width through a strategy network;
step 4, combining the optimal beam width and the optimal beam direction obtained by beam alignment, selecting the optimal code word in the selected codebook and performing data transmission;
step 5, calculating and obtaining reward information according to the information in the data transmission process, compensating the reward, judging whether a network updating threshold is met, if so, executing step 6, otherwise, directly jumping back to step 3 to start the next circulation;
and 6, updating the parameters of the strategy network and the value network based on the reward information.
Wherein the content of the first and second substances,
step 1, firstly, characterizing a beam width optimization problem as MDP, effectively and reasonably designing MDP model parameters including state design, action space and reward design, adaptively adjusting the beam width for data transmission according to the severity of a beam drift effect, and fully balancing the relationship between beam forming gain and effective data transmission time;
if the time unit of the beam training process for determining the optimal beam direction is a time slot, and the time scale for selecting the beam width is defined as a time unit, and each time unit includes M time slots, the MDP parameter is defined as follows:
step 1.1. state design: in M time slots, i.e. one time unit, M groups of waves are obtainedBundle training experience (S, A, R, S) t+1 ) Thus, M beam training rewards can be achieved, i.e.
R={R t ,R t+1 ,...,R i ,...,R t+M-2 ,R t+M-1 } (1)
Wherein R is i Representing the reward of the ith time slot in the time unit, i.e. the effective reachable rate obtained by the beam training of the time slot, and further taking the indirect result set obtained by some meaningful calculations of the R sequence as the state, i.e. the reward
State={R av ,R cv ,R kur } (2)
The meaning of each parameter is as follows:
·R av represents the mean of the R sequences, which represents the average size of the R sequences;
·R cv the coefficient of variation of the R sequence is represented, the mean value and the dispersion degree are two main characteristics of the sequence characteristics, the dispersion degree is usually represented by variance or standard deviation, but when a series of objective conditions such as beam alignment conditions of different time slots are different, measurement scales of different groups of data are likely to be greatly different, and the standard deviation is not suitable for comparison at this time, and the difference of the measurement scales can be eliminated by the coefficient of variation; it is calculated by first calculating the standard deviation σ of the R sequence
Figure BDA0003641779480000031
The coefficient of variation is a normalized measure of the degree of dispersion of the probability distribution, defined as the standard deviation σ and the mean R av The ratio of (A) to (B) is
Figure BDA0003641779480000032
·R kur The kurtosis represents the kurtosis of the R sequence, the kurtosis represents the characteristic number of the probability density distribution curve with the peak value height at the average value, the characteristic number is a statistic for describing the steepness degree of all value distribution forms in the population, and the kurtosis is introduced for addingAdding more extra information for the system to learn, then the kurtosis R kur The calculation is as follows:
Figure BDA0003641779480000041
step 1.2, action design: defining actions as codebooks of different beamwidths, i.e. different resolutions, a multiresolution codebook is defined as
Figure BDA0003641779480000042
The corresponding action space is defined as
Figure BDA0003641779480000043
Step 1.3. reward design: the reward is defined as the sum rate of M time slots per time unit, i.e. the rate
Figure BDA0003641779480000044
Wherein R is u,i Indicating the effective transmission rate corresponding to the ith time slot in the u time unit.
The initialization method of the step 2 is as follows:
step 2.1, constructing a value network V and a strategy network pi which are composed of full connection layers, and initializing a value network parameter omega and a strategy network parameter theta;
step 2.2. construction of multiresolution DFT codebook
Figure BDA0003641779480000045
Randomly selecting a data transmission initial codebook;
step 2.3, initializing time unit count u to 1, and initializing network updating threshold l according to experience max
Step 2.4, opening up a storage space, and storing the optimal action selected by the time unit in the storage space for the time unit u
Figure BDA00036417794800000413
And reward information generated by the M time slot beam training in the time unit and state information obtained by the reward information.
The step 3 sequentially executes the following steps in the time unit u:
step 3.1, obtaining reward sequences R obtained by M time slots in the time unit according to a formula 1;
step 3.2, calculating according to formulas 4, 5 and 6 to obtain the state information s of the time unit u
Step 3.3. get a set of probability distribution for all actions through the strategy network pi, for the action space
Figure BDA0003641779480000046
Is provided with
Figure BDA0003641779480000047
I.e. the probability that the policy network pi (theta) selects all actions in state s is added to 1 and thus the best action is selected
Figure BDA0003641779480000048
And selects the corresponding data transmission codebook
Figure BDA0003641779480000049
Step 4 the following steps are performed in sequence in time unit u:
step 4.1. assume that the beam training process aiming at finding the best beam direction employs a codebook
Figure BDA00036417794800000410
Obtaining the best beam of the time unit according to the beam alignment result
Figure BDA00036417794800000411
Corresponding beam center angle
Figure BDA00036417794800000412
Step 4.2. in
Figure BDA0003641779480000051
In which the best codeword is selected for data transmission, assuming
Figure BDA0003641779480000052
Wherein
Figure BDA0003641779480000053
Which represents the ith code word, is,
Figure BDA0003641779480000054
representing the number of codewords in the codebook, the codebook
Figure BDA0003641779480000055
Chinese code word
Figure BDA0003641779480000056
Corresponding to a beam center angle of
Figure BDA0003641779480000057
Then the data transmits the best codeword
Figure BDA0003641779480000058
Figure BDA0003641779480000059
Wherein
Figure BDA00036417794800000510
Representing natural numbers 1 to
Figure BDA00036417794800000511
The formed set can construct direct mapping between different codebook indexes if the index relation between different resolution codebooks has specific objective rules in the specific implementation process
Figure BDA00036417794800000512
So as to directly calculate the optimal transmission codeword by means of a mapping function
Figure BDA00036417794800000513
Step 5 the following steps are performed in sequence in time unit u:
step 5.1) receiving the external environment return and calculating the reward R of the time unit according to the formula (1) u Due to the deviation between the beam training application codebook and the data transmission application codebook, the reward needs to be compensated as follows, and the default beam training codebook is
Figure BDA00036417794800000514
Then corrected R u ' represents the following:
Figure BDA00036417794800000515
wherein S S Represent
Figure BDA00036417794800000516
The number of code words of (a) is,
Figure BDA00036417794800000517
to represent
Figure BDA00036417794800000518
The number of codewords of;
step 5.2) judging whether the network updating threshold is met or not, and designing normalized criticizing loss
Figure BDA00036417794800000519
To characterize whether the network needs to be continuously updated,
Figure BDA00036417794800000520
the calculation is as follows:
Figure BDA00036417794800000521
wherein the representation ω represents a value network parameter; u shape l Is a viewing window when satisfied
Figure BDA00036417794800000522
The neural network is trained.
Step 6 the following steps are performed in sequence in time unit u:
step 6.1. performing the best action in time unit u
Figure BDA00036417794800000523
Selecting corresponding code words for data transmission and obtaining compensation reward R u At the same time, the state s of the next time unit u +1 can be obtained according to the formula (2) u+1 And obtaining the best action of the next time unit according to the policy network
Figure BDA0003641779480000061
This action is only used for subsequent calculations, and is not actually performed;
step 6.2, the state s of the time unit u And action a u Input value network derived q u =q(s u ,a u ;ω u ) The next time unit state s u+1 And predicting actions
Figure BDA0003641779480000062
Input value network obtaining and
Figure BDA0003641779480000063
and calculating the TD-error value delta u =q u -(R u ′+γ·q u+1 );
Step 6.3, obtaining by derivation of value network
Figure BDA0003641779480000064
Then, the value network is updated by using the time sequence difference as the formula omega u+1 =ω u -α·δ u ·d ω,u
Step 6.4. derivation of strategy network
Figure BDA0003641779480000065
The gradient ascent is then used to update the policy network parameters as in equation θ u+1 =θ u +β·q u ·d θ,u Wherein q is u May be represented by u Replacement;
step 6.5, update counter u ← u + 1.
Has the advantages that: compared with the prior art, the invention has the following advantages:
1) on the basis that only a single codebook is utilized in beam alignment training, the more realistic scene, namely the relative change characteristic of the environment in a time slot is considered, a multi-resolution codebook is introduced, the codebook with the best resolution is adaptively selected at each decision point by sensing the change rate of the direction angle of a channel in the time slot, and the most efficient code word is comprehensively selected by combining beam training, so that the system performance is improved.
2) The algorithm does not need channel state information, takes the real-time change characteristic of the real environment into consideration, introduces reinforcement learning to interact with the environment in real time, continuously learns and tries and mistakes, perceives the change rule of the environment, and makes the most reasonable decision by fully utilizing the existing computing power through a neural network.
3) Under the beam drift effect, different from the traditional thought, narrow beams are used for beam training, a multi-resolution codebook is introduced for data transmission, beam forming gain and effective time of data transmission are considered, the performance of the algorithm is better than that of the algorithm using a single codebook through simulation, and when the environmental change is more severe, namely the beam drift effect is more serious, the performance advantage is more obvious.
4) The AC algorithm makes action selections based on policy learning, which is more straightforward than other algorithms. The algorithm greatly reduces the training burden of the neural network through reasonable and concise state design and introduction of a dynamic training mechanism, and does not need a large storage space to store a series of contextual information.
Drawings
FIG. 1 is a flow chart of an intelligent beam width optimization algorithm;
FIG. 2 is a diagram of the information transfer between the beam training module and the data transmission module;
FIG. 3 is a diagram illustrating average EAR performance of different codebooks at different SNR;
FIG. 4 is a diagram illustrating the relationship between the average EAR performance and the environmental change rate in different codebooks.
Detailed Description
In order to make the technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to examples and software simulation.
Beam drift effects are taken into account, i.e. the beamforming gain may suffer due to a change in the aligned azimuth of the transmitter and receiver during the beam training time slot. The algorithm interacts with a dynamic environment by introducing reinforcement learning under the condition of not needing channel state information, continuously senses the change rule of the environment in a time slot, and adaptively adjusts the width of a data transmission beam so as to ensure the stability of the quality of a communication link.
In a specific implementation scheme, a beam width optimization problem is modeled into an MDP model, an optimal beam width is selected in each time unit through reasonable and efficient action, state and reward design, an optimal code word is selected for data transmission by combining a beam training alignment result, and neural network parameters are updated according to conditions in each time unit cycle to enable the selection to be more reasonable.
1. Model building
In the MDP model, a state space, an action space, and a reward are defined first. Assuming that the minimum time unit is a slot, every M slots are defined as one time unit, i.e., each time unit includes M slots.
Defining a state: within M time slots, i.e. one time unit, M sets of beam training experiences (S, a, R, S) can be obtained t+1 ) Thus M rewards can be obtained, i.e.
R={R t ,R t+1 ,...,R i ,...,R t+M-2 ,R t+M-1 } (1)
Wherein R is i Represents the time unitThe reward of the ith time slot, i.e. the effective reachable rate obtained by the beam training of the time slot, and further adopts an indirect result set obtained by some meaningful calculations on the sequence R as a state, i.e. the state
State={R av ,R cv ,R kur } (2)
The meaning of each parameter is as follows:
·R av represents the mean of the R sequences, which represents the average size of the R sequences;
·R cv the coefficient of variation of the R sequence is shown. The mean value and the dispersion degree are two main characteristics of sequence characteristics, the dispersion degree is often represented by a variance or a standard deviation, but considering that when a series of objective conditions such as beam alignment conditions of different time slots are different, measurement scales of different groups of data are likely to be greatly different, at this time, the standard deviation is not suitable for comparison, and the measurement scale difference can be eliminated by the variation coefficient. It is calculated by first calculating the standard deviation σ of the R sequence
Figure BDA0003641779480000081
The coefficient of variation is a normalized measure of the degree of dispersion of the probability distribution, defined as the standard deviation σ and the mean R av The ratio of (A) to (B) is
Figure BDA0003641779480000082
·R kur Represents the kurtosis of the R sequence. The kurtosis represents the characteristic number of the peak value height of the probability density distribution curve at the average value, and is a statistic for describing the steep degree of all value distribution forms in the population, the purpose of introducing the kurtosis is to add more extra information for the system to learn, and when the sequence has n samples, the kurtosis R is kur The calculation is as follows:
Figure BDA0003641779480000083
defining a motion space: defining actions as codebooks of different beamwidths, i.e. different resolutions, a multiresolution codebook is defined as
Figure BDA0003641779480000084
The corresponding action space is defined as
Figure BDA0003641779480000085
Defining the reward: the reward is defined as the sum rate of M time slots per time unit, i.e. the rate
Figure BDA0003641779480000086
Wherein R is u,i Indicating the effective transmission rate corresponding to the ith time slot in the u time unit.
Policy network selects optimal codebook based on state
Figure BDA0003641779480000087
The best codeword then needs to be selected in combination with the beam alignment result. Direct mapping relation function can be established between different codebook code word indexes
Figure BDA0003641779480000088
Then directly calculating the optimal transmission code word by the mapping function
Figure BDA0003641779480000089
After data transmission is carried out, the external environment reward is accepted as reward, the reward is compensated, and the codebook used by default beam training is
Figure BDA0003641779480000091
Then corrected R u ' represents the following:
Figure BDA0003641779480000092
wherein S S To represent
Figure BDA0003641779480000093
The number of code words of (a) is,
Figure BDA0003641779480000094
to represent
Figure BDA0003641779480000095
The number of codewords.
Then judging whether the network updating threshold is met or not, and designing normalized criticizing loss
Figure BDA0003641779480000096
To characterize whether the network needs to be continuously updated,
Figure BDA0003641779480000097
the calculation is as follows:
Figure BDA0003641779480000098
wherein the representation ω represents a value network parameter; u shape l Is the viewing window. When it is satisfied with
Figure BDA0003641779480000099
Then, the neural network is trained; otherwise, the learning process is skipped.
2. Implementation of AC-based beam width optimization algorithm
Input multi-resolution codebook
Figure BDA00036417794800000910
Action space
Figure BDA00036417794800000911
Initialization: establishing a value network V and a policy networkPi, initializing neural network parameters omega and theta, and updating network threshold l max Opening up storage space, starting data transmission codebook as
Figure BDA00036417794800000912
Let time unit count u equal to 1.
A circulating body: for each time unit u, the following steps are repeatedly performed:
(1) obtaining the reward sequence R obtained by M time slots in the time unit according to the formula (1)
(2) Calculating the state information s of the time unit according to the formulas (4, 5 and 6) u
(3) Selecting the best action through the policy network pi
Figure BDA00036417794800000913
And selects a transmission codebook corresponding to the selected transmission codebook
Figure BDA00036417794800000914
(4) Determined according to equation (7)
Figure BDA00036417794800000915
Best code word for data transmission
Figure BDA00036417794800000916
And performs data transmission
(5) Receiving the return value of the external environment and obtaining a modified compensated reward R according to equation (8) u
(6) Calculating normalized evaluation loss
Figure BDA00036417794800000917
When it is satisfied with
Figure BDA00036417794800000918
When the step (7) is normally carried out, otherwise, the step (10) is directly carried out
(7) Obtaining the state s of the next time unit u +1 according to the formula (2) u+1 And obtaining the next time according to the policy networkOptimal action of the cells
Figure BDA0003641779480000101
(8) Computing q over a value network u =q(s u ,a u ;ω u ) And
Figure BDA0003641779480000102
and calculating the TD-error value delta u =q u -(r u +γ·q u+1 );
(9) Updating the value network parameter and the policy network parameter omega separately u+1 =ω u -a·δ u ·d ω,u ,θ u+1 =θ u +β·q u ·d θ,u
(10) Update counter u ← u +1
3. Simulation environment and results
The following environment is adopted during simulation:
1) the simulation generates a millimeter wave channel environment which changes along with time in an office, the center frequency of a base station is 28GHz, the transmitting power is 10dBm, the size of the office is 20 multiplied by 20 meters, and the source number N of a Uniform Linear Array (ULA) array is 64.
2) The channel path clusters are randomly generated between 2 and 5, and each cluster only contains one path for the convenience of tracking. The power allocation of line-of-sight (LOS) and non-line-of-sight (NLOS) paths is determined by the rice factor K, and the attenuation of all NLOS paths follows an exponential distribution approximately. Since the exponential decay is fast, only the 1 st NLOS path is in fact likely to be significant, with the other paths having a power of approximately 0.
3) A Uniform Linear Array (ULA) can cover a 180 ° region with azimuth phi satisfying 0 < phi < pi, while different resolution codebooks average the sine of 0-pi with different granularity. Code book
Figure BDA0003641779480000103
Codebook comprising 4 different resolutions
Figure BDA0003641779480000104
The number of code words is 8, 16, 32, 64 respectively. For convenience, each codebook is divided into
Figure BDA0003641779480000105
(i-1, 2,3,4) is abbreviated as MRC i
4) The environmental change rate is quantified by δ, and the larger the value of δ, the faster the environmental change.
And (3) simulation result analysis:
fig. 3 shows the relationship between EAR performance and signal-to-noise ratio by using different sub-codebooks, and the average Effective Achievable Rate (EAR) can represent the effective transmission rate in a unit frequency spectrum, and can characterize the throughput of the system. It can be seen that as the signal-to-noise ratio increases, the EAR performance of the different sub-codebooks also increases. However, the EAR performance achieved with BWS algorithm is much better than that achieved with any fixed sub-codebook. The reason for this is that the BWS algorithm can be designed to sense the rate of change of the environment and adjust the beam width in real time, which achieves a good trade-off between achieving high array gain and mitigating the effects of beam drift.
Fig. 4 shows the average EAR performance versus the rate of change of the environment for different codebooks. It can be seen that as the environment changes faster and faster, the influence of the beam drift effect becomes more and more severe, and the EAR performance achieved by different sub-codebooks decreases accordingly. However, the EAR performance of BWS algorithms is better than any fixed subcodebook at all ambient rates of change. And as the environment changes faster and faster, the performance gap between them also becomes larger and larger. These observations indicate that the designed BWS algorithm can effectively counteract the influence caused by the beam drift effect in a dynamic environment and ensure good effective achievable EAR performance.
In addition, it can be observed that MRC is when μ is small enough, i.e., the environment changes slowly (e.g., when μ < 0.08), then 4 The EAR performance achieved is the highest of the 4 sub-codebooks. The reason for this is that in this case the beam drift effect is not significant, due to the MRC having the narrowest beamwidth codeword 4 Is the highest when the EAR performance and MRC of the BWS algorithm are high 4 Almost identical.Meanwhile, as the environmental change rate is improved, the beam drift effect becomes more and more obvious, and MRC 4 EAR performance of (A) deteriorates rapidly and becomes more specific than MRC 3 Worse, this indicates that narrow beams are more susceptible to beam drift effects. And MRC 1 The EAR performance achieved is the worst because, although the wide beam is robust to beam drift effects, its array gain is relatively low, which in turn degrades EAR performance. In contrast, our algorithm can achieve a good balance between achieving high array gain and mitigating beam-shifting effects, and sense and adapt to changing environments in real time.
In conclusion, the intelligent beam width optimization (BWS) algorithm can achieve a compromise and balance between achieving high array gain and suppressing the beam drift effect, and can sense and adapt to a constantly changing environment in real time, thereby improving the system throughput.

Claims (7)

1. An intelligent beam width optimization method based on reinforcement learning is characterized by comprising the following steps,
step 1, modeling a beam width optimization problem as a Markov Decision Process (MDP), defining codebooks of different resolutions, namely beam widths as action spaces by adopting an Actor-criticic algorithm based on combination of strategy learning and value learning, wherein a strategy network and a value network are formed by a fully-connected network, the strategy network is responsible for selecting the optimal action at each moment, the value network is responsible for feeding back the selection of the strategy network for evaluation according to an objective environment, and sensing environment dynamic characteristics and selecting the optimal codebook code word for data transmission through continuous interaction with the environment;
step 2, initialization: establishing and initializing a strategy network, a value network, a multi-resolution DFT codebook, a data transmission initial codebook, time unit counting, channel environment construction and a network updating threshold, and opening up a storage space;
step 3, at the beginning of each decision time slot, constructing a current state according to the obtained information and a state design rule, and determining the optimal action, namely the beam width through a strategy network;
step 4, combining the optimal beam width with the optimal beam direction obtained by beam alignment, selecting the optimal code word in the selected codebook and performing data transmission;
step 5, calculating and obtaining reward information according to the information in the data transmission process, compensating the reward, judging whether a network updating threshold is met, if so, executing step 6, otherwise, directly jumping back to step 3 to start the next circulation;
and 6, updating the parameters of the strategy network and the value network based on the reward information.
2. The intelligent beam width optimization method based on reinforcement learning AC according to claim 1, characterized in that, step 1 firstly characterizes the beam width optimization problem as MDP, and effectively and reasonably designs MDP model parameters including state design, action space and reward design, and adaptively adjusts the beam width for data transmission according to the severity of beam drift effect, fully balances the relationship between beam forming gain and effective time of data transmission;
if the time unit of the beam training process for determining the optimal beam direction is a time slot, and the time scale for selecting the beam width is defined as a time unit, and each time unit includes M time slots, the MDP parameter is defined as follows:
step 1.1. state design: obtaining M sets of beam training experiences (S, A, R, S) in M time slots, i.e. in one time unit t+1 ) Thus, M beam training rewards can be achieved, i.e.
R={R t ,R t+1 ,…,R i ,…,R t+M-2 ,R t+M-1 } (1)
Wherein R is i Represents the reward of the ith time slot in the time unit, namely the effective reachable speed obtained by the time slot wave beam training, and further adopts an indirect result set obtained by some meaningful calculations of the R sequence as a state, namely
State={R av ,R cv ,R kur } (2)
The meaning of each parameter is as follows:
·R av represents the mean of the R sequences, which represents the average size of the R sequences;
·R cv the coefficient of variation of the R sequence is represented, the mean value and the dispersion degree are two main characteristics of the sequence characteristics, the dispersion degree is usually represented by variance or standard deviation, but when a series of objective conditions such as beam alignment conditions of different time slots are different, measurement scales of different groups of data are likely to be greatly different, and the standard deviation is not suitable for comparison at this time, and the difference of the measurement scales can be eliminated by the coefficient of variation; it is calculated by first calculating the standard deviation σ of the R sequence
Figure FDA0003641779470000021
The coefficient of variation is a normalized measure of the degree of dispersion of the probability distribution, defined as the standard deviation σ and the mean R av The ratio of (A) to (B) is
Figure FDA0003641779470000022
·R kur The kurtosis R sequence represents the kurtosis of the R sequence, the kurtosis represents the characteristic number of the probability density distribution curve with the peak value height at the average value, the characteristic number is the statistic for describing the steep degree of all value distribution forms in the population, the purpose of introducing the kurtosis is to add more extra information for the system to learn, and then the kurtosis R sequence kur The calculation is as follows:
Figure FDA0003641779470000023
step 1.2, action design: defining actions as codebooks of different beamwidths, i.e. different resolutions, a multiresolution codebook is defined as
Figure FDA0003641779470000024
The corresponding action space is defined as
Figure FDA0003641779470000025
Step 1.3, reward design: the reward is defined as the sum rate of M time slots per time unit, i.e. the rate
Figure FDA0003641779470000026
Wherein R is u,i Indicating the effective transmission rate corresponding to the ith time slot in the u time unit.
3. The method of claim 1, wherein the initialization method of step 2 is as follows:
step 2.1, constructing a value network V and a strategy network pi which are composed of full connection layers, and initializing a value network parameter omega and a strategy network parameter theta;
step 2.2. construction of multiresolution DFT codebook
Figure FDA0003641779470000031
Randomly selecting a data transmission initial codebook;
step 2.3, initializing time unit count u to 1, and initializing network updating threshold l according to experience max
Step 2.4, opening up a storage space, and storing the optimal action selected by the time unit in the storage space for the time unit u
Figure FDA0003641779470000032
And reward information generated by the M time slot beam training in the time unit and state information obtained by the reward information.
4. The method according to claim 1, wherein the step 3 sequentially executes the following steps in time unit u:
step 3.1, obtaining reward sequences R obtained by M time slots in the time unit according to formula 1;
step 3.2, calculating to obtain the state information s of the time unit according to the formulas 4, 5 and 6 u
Step 3.3. get a set of probability distribution for all actions through the strategy network pi, for the action space
Figure FDA0003641779470000033
Is provided with
Figure FDA0003641779470000034
I.e. the probability that the policy network pi (theta) selects all actions in state s is added to 1 and thus the best action is selected
Figure FDA0003641779470000035
And selects the corresponding data transmission codebook
Figure FDA0003641779470000036
5. The intelligent beam width optimization method based on reinforcement learning AC according to claim 1, wherein step 4 is performed in sequence in time unit u as follows:
step 4.1. assume that the beam training process aiming at finding the best beam direction employs a codebook
Figure FDA0003641779470000037
Obtaining the best beam of the time unit according to the beam alignment result
Figure FDA0003641779470000038
Corresponding to the central angle of the beam
Figure FDA0003641779470000039
Step 4.2. in
Figure FDA00036417794700000310
In which the best codeword is selected for data transmission, assuming
Figure FDA00036417794700000311
Wherein
Figure FDA00036417794700000312
Which represents the ith code word, is,
Figure FDA00036417794700000313
representing the number of codewords in the codebook, the codebook
Figure FDA00036417794700000314
Chinese code word
Figure FDA00036417794700000315
Corresponding to a beam center angle of
Figure FDA00036417794700000316
Then the data transmits the best codeword
Figure FDA00036417794700000317
Comprises the following steps:
Figure FDA00036417794700000318
wherein
Figure FDA00036417794700000319
Representing natural numbers 1 to
Figure FDA00036417794700000320
In the concrete implementation process, the index relation among codebooks with different resolutions has a concrete objective rule, and then the direct mapping among different codebook indexes can be constructed
Figure FDA0003641779470000041
So as to directly calculate the optimal transmission codeword by means of a mapping function
Figure FDA0003641779470000042
6. The method of claim 1, wherein step 5 is performed in sequence in time unit u as follows:
step 5.1) receiving the external environment return and calculating to obtain the reward R of the time unit according to the formula (1) u Due to the deviation between the beam training application codebook and the data transmission application codebook, the reward needs to be compensated as follows, and the default beam training codebook is
Figure FDA0003641779470000043
Then corrected R u ' represents the following:
Figure FDA0003641779470000044
wherein S S To represent
Figure FDA0003641779470000045
The number of code words of (a) is,
Figure FDA0003641779470000046
to represent
Figure FDA0003641779470000047
The number of codewords of;
step 5.2) judging whether the network updating threshold is met or not, and designing normalized criticizing loss
Figure FDA0003641779470000048
To characterize whether the network needs to be continuously updated,
Figure FDA0003641779470000049
the calculation is as follows:
Figure FDA00036417794700000410
wherein the representation ω represents a value network parameter; u shape l Is a viewing window when satisfied
Figure FDA00036417794700000411
The neural network is trained.
7. The method of claim 1, wherein step 6 is performed in sequence in time unit u as follows:
step 6.1. performing the best action in time unit u
Figure FDA00036417794700000412
Selecting corresponding code words for data transmission and obtaining compensation reward R u At the same time, the state s of the next time unit u +1 can be obtained according to the formula (2) u+1 And obtaining the best action of the next time unit according to the policy network
Figure FDA00036417794700000413
This action is only used for subsequent calculations, and is not actually performed;
step 6.2, the state s of the time unit u And action a u Input value network get q u =q(s u ,a u ;ω u ) The next time unit state s u+1 And predicting actions
Figure FDA00036417794700000414
Input value network obtaining and
Figure FDA00036417794700000415
and calculating the TD-error value delta u =q u -(R u ′+γ·q u+1 );
Step 6.3, obtaining by derivation of value network
Figure FDA0003641779470000051
Then, the value network is updated by using the time sequence difference as the formula omega u+1 =ω u -α·δ u ·d ω,u
Step 6.4. derivation of strategy network
Figure FDA0003641779470000052
The gradient ascent is then used to update the policy network parameters as in equation θ u+1 =θ u +β·q u ·d θ,u Wherein q is u May be represented by u Replacement;
step 6.5, update counter u ← u + 1.
CN202210526035.3A 2022-05-13 2022-05-13 Intelligent beam width optimization method based on reinforcement learning Pending CN115002804A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210526035.3A CN115002804A (en) 2022-05-13 2022-05-13 Intelligent beam width optimization method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210526035.3A CN115002804A (en) 2022-05-13 2022-05-13 Intelligent beam width optimization method based on reinforcement learning

Publications (1)

Publication Number Publication Date
CN115002804A true CN115002804A (en) 2022-09-02

Family

ID=83027249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210526035.3A Pending CN115002804A (en) 2022-05-13 2022-05-13 Intelligent beam width optimization method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN115002804A (en)

Similar Documents

Publication Publication Date Title
CN109379752B (en) Massive MIMO optimization method, device, equipment and medium
CN110971279B (en) Intelligent beam training method and precoding system in millimeter wave communication system
US8457240B2 (en) Methods of selecting signal transmitting, receiving, and/or sensing devices with probabilistic evolutionary algorithms in information conveyance systems
CN113411110B (en) Millimeter wave communication beam training method based on deep reinforcement learning
CN111865378B (en) Large-scale MIMO downlink precoding method based on deep learning
US8615047B2 (en) Beamforming systems and methods for link layer multicasting
Göttsch et al. Deep learning-based beamforming and blockage prediction for sub-6GHz/mm wave mobile networks
CN113438002A (en) LSTM-based analog beam switching method, device, equipment and medium
KR20220013906A (en) Deep learning based beamforming method and apparatus for the same
CN115441939A (en) Multi-beam satellite communication system resource allocation method based on MADDPG algorithm
CN113437999B (en) Adaptive beam width modulation method for inhibiting beam drift effect in millimeter wave communication system
Xiong et al. A novel real-time channel prediction algorithm in high-speed scenario using convolutional neural network
CN115002804A (en) Intelligent beam width optimization method based on reinforcement learning
WO2023185978A1 (en) Channel feature information reporting method, channel feature information recovery method, terminal and network side device
CN115549745A (en) RIS phase shift design method, apparatus, computer device and storage medium
CN111277313B (en) Bipartite graph-based large-scale MIMO beam selection and transmission method for cellular internet of vehicles
EP3726739B1 (en) Memory-assisted radio frequency beam training for mimo channels
CN113839696A (en) Online robust distributed multi-cell large-scale MIMO precoding method
CN113839695A (en) FDD large-scale MIMO and rate optimal statistical precoding method and device
CN113904704B (en) Beam prediction method based on multi-agent deep reinforcement learning
Boas et al. Machine learning based channel prediction for NR Type II CSI reporting
Liao et al. Ultra-reliable intelligent link scheduling based on DRL for manned/unmanned aerial vehicle cooperative scenarios
WO2023179473A1 (en) Channel feature information reporting method, channel feature information recovery method, terminal and network side device
Zhang et al. Data-Driven Multi-armed Beam Tracking for Mobile Millimeter-Wave Communication Systems
US11799530B2 (en) Beam management with matching networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination