CN115002804A

CN115002804A - Intelligent beam width optimization method based on reinforcement learning

Info

Publication number: CN115002804A
Application number: CN202210526035.3A
Authority: CN
Inventors: 黄永明; 陆昀程; 胡梓炜; 俞菲
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-09-02

Abstract

The invention discloses an intelligent beam width optimization method based on reinforcement learning. The method inhibits the beam drift effect in millimeter wave communication by dynamically adjusting the beam width, the algorithm models the dynamic optimization problem of the beam width into a Markov decision process, and the optimal beam width is selected at each decision time for data transmission. The state design can represent the severity of the beam drift effect of the current system in multiple angles, each action corresponds to different beam widths, and the width of a data transmission beam is optimized and adjusted by sensing the change speed of the environment. In each state, the optimal beam width of data transmission is selected according to an AC algorithm, a strategy network and a value network are continuously updated in training to improve the reasonability and reliability of model selection, and a dynamic network updating mechanism is introduced to reduce the operation burden of network updating. The throughput of the millimeter wave communication system under the beam drift effect is improved, and therefore the quality of a communication link is continuously guaranteed.

Description

Intelligent beam width optimization method based on reinforcement learning

Technical Field

The invention belongs to the field of wireless communication, and particularly relates to an intelligent beam width optimization method based on an Actor Critic (AC) for reinforcement learning in millimeter wave communication.

Background

Millimeter waves have attracted attention as a potential technology for addressing the need for high-speed wireless data transmission due to their large accessible bandwidth. However, in comparison with the conventional low frequency band, although the bandwidth is high, the free space path loss of the millimeter wave band is large, and the ability to bypass the object is poor, so that an array antenna composed of several tens or even several hundreds of antennas is required, and the signal-to-noise ratio is improved by concentrating energy through the beam forming technique. Generally, the spacing and wavelength of the antennas in the antenna array are an order of magnitude, which brings a large space for reducing the physical volume of the millimeter wave antenna array, so that the millimeter wave antenna array can be made small enough to meet the application of more scenes. The beam forming makes the millimeter wave have strong directivity, so the beam tracking technology is needed to ensure the stability and quality of the communication link. Notably, due to the high cost of the rf links and the high power consumption of the ADC/DAC, the mmwave communication system typically employs a digital-analog hybrid precoding architecture, with the number of rf chains being much less than the number of antennas.

Beamforming is only the basis for improving the signal-to-noise ratio, and how to select a reasonable direction for a directional beam is another big problem of obtaining a high signal-to-noise ratio. The main beam direction after beam forming needs to be aligned with a user in real time, namely, beam training is needed to ensure that the main beam of a transmitter is aligned and track a moving receiver. Most research schemes are to perform wave beam training in a time slot, obtain a high-gain wave beam direction, and transmit data in the time slot by adopting the wave beam direction. However, this is based on the fact that the communication environment between the user and the base station in each timeslot remains the same, and when the receiving-end user moves continuously, its deviation angle or arrival angle always changes continuously, even though the beam training is performed to transmit and receive the main beam at the beginning of each timeslot, the movement of the receiver in the timeslot causes the change of the relative position with respect to the transmitter, which results in the decrease of the beam forming/array gain, which is called the beam drift effect, and when it is serious, it may affect the quality of the communication link and even cause communication interruption.

In the beam training process, which is a process of balancing beam alignment success rate and beam training efficiency, a narrower beam can usually bring higher beamforming gain, but at the same time, a larger search burden is generated and the search accuracy is reduced. When the beam drift effect is considered, the relative invariance of the environment in the time slot is broken, and the faster the environment in the time slot changes, the more serious the influence on the beam forming gain. In order to suppress the beam drift effect, a trade-off needs to be made between the beam forming gain and the effective time of data transmission, and although the beam drift effect can be mitigated by reducing the length of each slot, the beam training overhead is increased, and the effective achievable rate performance is reduced. And apart from the idea of using narrow beams, when a beam becomes wide, the coverage area of a single beam becomes large, and although the maximum beamforming gain is affected compared with the narrow beam, better data transmission efficiency can still be achieved compared with the failure of beam alignment in a time slot. Therefore, the real-time adjustment of the beam width according to the change of the communication environment is an effective scheme for dealing with the beam drift effect.

Disclosure of Invention

The technical problem is as follows: in order to inhibit the beam drift effect and ensure the stability of a communication link, the invention provides an intelligent beam width optimization method, which counteracts the influence of the beam drift effect on the beam forming gain by adaptively adjusting the beam width. The method can improve the throughput of the millimeter wave communication system and ensure the stability of the quality of the communication link in the whole communication process.

The technical scheme is as follows: the intelligent beam width optimization method based on reinforcement learning comprises the following steps,

step 1, modeling a beam width optimization problem as a Markov Decision Process (MDP), defining codebooks of different resolutions, namely beam widths as action spaces by adopting an Actor-criticic algorithm based on combination of strategy learning and value learning, wherein a strategy network and a value network are formed by a fully-connected network, the strategy network is responsible for selecting the optimal action at each moment, the value network is responsible for feeding back the selection of the strategy network for evaluation according to an objective environment, and sensing environment dynamic characteristics and selecting the optimal codebook code word for data transmission through continuous interaction with the environment;

step 2, initialization: establishing and initializing a strategy network, a value network, a multi-resolution DFT codebook, a data transmission initial codebook, time unit counting, channel environment construction and a network updating threshold, and opening up a storage space;

step 3, at the beginning of each decision time slot, constructing a current state according to the obtained information and a state design rule, and determining the optimal action, namely the beam width through a strategy network;

step 4, combining the optimal beam width and the optimal beam direction obtained by beam alignment, selecting the optimal code word in the selected codebook and performing data transmission;

step 5, calculating and obtaining reward information according to the information in the data transmission process, compensating the reward, judging whether a network updating threshold is met, if so, executing step 6, otherwise, directly jumping back to step 3 to start the next circulation;

and 6, updating the parameters of the strategy network and the value network based on the reward information.

Wherein the content of the first and second substances,

step 1, firstly, characterizing a beam width optimization problem as MDP, effectively and reasonably designing MDP model parameters including state design, action space and reward design, adaptively adjusting the beam width for data transmission according to the severity of a beam drift effect, and fully balancing the relationship between beam forming gain and effective data transmission time;

if the time unit of the beam training process for determining the optimal beam direction is a time slot, and the time scale for selecting the beam width is defined as a time unit, and each time unit includes M time slots, the MDP parameter is defined as follows:

step 1.1. state design: in M time slots, i.e. one time unit, M groups of waves are obtainedBundle training experience (S, A, R, S) _t+1 ) Thus, M beam training rewards can be achieved, i.e.

R＝{R _t ,R _t+1 ,...,R _i ,...,R _t+M-2 ,R _t+M-1 } (1)

Wherein R is _i Representing the reward of the ith time slot in the time unit, i.e. the effective reachable rate obtained by the beam training of the time slot, and further taking the indirect result set obtained by some meaningful calculations of the R sequence as the state, i.e. the reward

State＝{R _av ,R _cv ,R _kur } (2)

The meaning of each parameter is as follows:

·R _av represents the mean of the R sequences, which represents the average size of the R sequences;

·R _cv the coefficient of variation of the R sequence is represented, the mean value and the dispersion degree are two main characteristics of the sequence characteristics, the dispersion degree is usually represented by variance or standard deviation, but when a series of objective conditions such as beam alignment conditions of different time slots are different, measurement scales of different groups of data are likely to be greatly different, and the standard deviation is not suitable for comparison at this time, and the difference of the measurement scales can be eliminated by the coefficient of variation; it is calculated by first calculating the standard deviation σ of the R sequence

The coefficient of variation is a normalized measure of the degree of dispersion of the probability distribution, defined as the standard deviation σ and the mean R _av The ratio of (A) to (B) is

·R _kur The kurtosis represents the kurtosis of the R sequence, the kurtosis represents the characteristic number of the probability density distribution curve with the peak value height at the average value, the characteristic number is a statistic for describing the steepness degree of all value distribution forms in the population, and the kurtosis is introduced for addingAdding more extra information for the system to learn, then the kurtosis R _kur The calculation is as follows:

step 1.2, action design: defining actions as codebooks of different beamwidths, i.e. different resolutions, a multiresolution codebook is defined as

The corresponding action space is defined as

Step 1.3. reward design: the reward is defined as the sum rate of M time slots per time unit, i.e. the rate

Wherein R is _u,i Indicating the effective transmission rate corresponding to the ith time slot in the u time unit.

The initialization method of the step 2 is as follows:

step 2.1, constructing a value network V and a strategy network pi which are composed of full connection layers, and initializing a value network parameter omega and a strategy network parameter theta;

step 2.2. construction of multiresolution DFT codebook

Randomly selecting a data transmission initial codebook;

step 2.3, initializing time unit count u to 1, and initializing network updating threshold l according to experience _max ；

Step 2.4, opening up a storage space, and storing the optimal action selected by the time unit in the storage space for the time unit u

And reward information generated by the M time slot beam training in the time unit and state information obtained by the reward information.

The step 3 sequentially executes the following steps in the time unit u:

step 3.1, obtaining reward sequences R obtained by M time slots in the time unit according to a formula 1;

step 3.2, calculating according to formulas 4, 5 and 6 to obtain the state information s of the time unit _u ；

Step 3.3. get a set of probability distribution for all actions through the strategy network pi, for the action space

Is provided with

I.e. the probability that the policy network pi (theta) selects all actions in state s is added to 1 and thus the best action is selected

And selects the corresponding data transmission codebook

Step 4 the following steps are performed in sequence in time unit u:

step 4.1. assume that the beam training process aiming at finding the best beam direction employs a codebook

Obtaining the best beam of the time unit according to the beam alignment result

Corresponding beam center angle

Step 4.2. in

In which the best codeword is selected for data transmission, assuming

Wherein

Which represents the ith code word, is,

representing the number of codewords in the codebook, the codebook

Chinese code word

Corresponding to a beam center angle of

Then the data transmits the best codeword

Wherein

Representing natural numbers 1 to

The formed set can construct direct mapping between different codebook indexes if the index relation between different resolution codebooks has specific objective rules in the specific implementation process

So as to directly calculate the optimal transmission codeword by means of a mapping function

Step 5 the following steps are performed in sequence in time unit u:

step 5.1) receiving the external environment return and calculating the reward R of the time unit according to the formula (1) _u Due to the deviation between the beam training application codebook and the data transmission application codebook, the reward needs to be compensated as follows, and the default beam training codebook is

Then corrected R _u ' represents the following:

wherein S _S Represent

The number of code words of (a) is,

to represent

The number of codewords of;

step 5.2) judging whether the network updating threshold is met or not, and designing normalized criticizing loss

To characterize whether the network needs to be continuously updated,

the calculation is as follows:

wherein the representation ω represents a value network parameter; u shape _l Is a viewing window when satisfied

The neural network is trained.

Step 6 the following steps are performed in sequence in time unit u:

step 6.1. performing the best action in time unit u

Selecting corresponding code words for data transmission and obtaining compensation reward R _u At the same time, the state s of the next time unit u +1 can be obtained according to the formula (2) _u+1 And obtaining the best action of the next time unit according to the policy network

This action is only used for subsequent calculations, and is not actually performed;

step 6.2, the state s of the time unit _u And action a _u Input value network derived q _u ＝q(s _u ,a _u ；ω _u ) The next time unit state s _u+1 And predicting actions

Input value network obtaining and

and calculating the TD-error value delta _u ＝q _u -(R _u ′+γ·q _u+1 )；

Step 6.3, obtaining by derivation of value network

Then, the value network is updated by using the time sequence difference as the formula omega _u+1 ＝ω _u -α·δ _u ·d _ω,u ；

Step 6.4. derivation of strategy network

The gradient ascent is then used to update the policy network parameters as in equation θ _u+1 ＝θ _u +β·q _u ·d _θ,u Wherein q is _u May be represented by _u Replacement;

step 6.5, update counter u ← u + 1.

Has the advantages that: compared with the prior art, the invention has the following advantages:

1) on the basis that only a single codebook is utilized in beam alignment training, the more realistic scene, namely the relative change characteristic of the environment in a time slot is considered, a multi-resolution codebook is introduced, the codebook with the best resolution is adaptively selected at each decision point by sensing the change rate of the direction angle of a channel in the time slot, and the most efficient code word is comprehensively selected by combining beam training, so that the system performance is improved.

2) The algorithm does not need channel state information, takes the real-time change characteristic of the real environment into consideration, introduces reinforcement learning to interact with the environment in real time, continuously learns and tries and mistakes, perceives the change rule of the environment, and makes the most reasonable decision by fully utilizing the existing computing power through a neural network.

3) Under the beam drift effect, different from the traditional thought, narrow beams are used for beam training, a multi-resolution codebook is introduced for data transmission, beam forming gain and effective time of data transmission are considered, the performance of the algorithm is better than that of the algorithm using a single codebook through simulation, and when the environmental change is more severe, namely the beam drift effect is more serious, the performance advantage is more obvious.

4) The AC algorithm makes action selections based on policy learning, which is more straightforward than other algorithms. The algorithm greatly reduces the training burden of the neural network through reasonable and concise state design and introduction of a dynamic training mechanism, and does not need a large storage space to store a series of contextual information.

Drawings

FIG. 1 is a flow chart of an intelligent beam width optimization algorithm;

FIG. 2 is a diagram of the information transfer between the beam training module and the data transmission module;

FIG. 3 is a diagram illustrating average EAR performance of different codebooks at different SNR;

FIG. 4 is a diagram illustrating the relationship between the average EAR performance and the environmental change rate in different codebooks.

Detailed Description

In order to make the technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to examples and software simulation.

Beam drift effects are taken into account, i.e. the beamforming gain may suffer due to a change in the aligned azimuth of the transmitter and receiver during the beam training time slot. The algorithm interacts with a dynamic environment by introducing reinforcement learning under the condition of not needing channel state information, continuously senses the change rule of the environment in a time slot, and adaptively adjusts the width of a data transmission beam so as to ensure the stability of the quality of a communication link.

In a specific implementation scheme, a beam width optimization problem is modeled into an MDP model, an optimal beam width is selected in each time unit through reasonable and efficient action, state and reward design, an optimal code word is selected for data transmission by combining a beam training alignment result, and neural network parameters are updated according to conditions in each time unit cycle to enable the selection to be more reasonable.

1. Model building

In the MDP model, a state space, an action space, and a reward are defined first. Assuming that the minimum time unit is a slot, every M slots are defined as one time unit, i.e., each time unit includes M slots.

Defining a state: within M time slots, i.e. one time unit, M sets of beam training experiences (S, a, R, S) can be obtained _t+1 ) Thus M rewards can be obtained, i.e.

R＝{R _t ,R _t+1 ,...,R _i ,...,R _t+M-2 ,R _t+M-1 } (1)

Wherein R is _i Represents the time unitThe reward of the ith time slot, i.e. the effective reachable rate obtained by the beam training of the time slot, and further adopts an indirect result set obtained by some meaningful calculations on the sequence R as a state, i.e. the state

State＝{R _av ,R _cv ,R _kur } (2)

The meaning of each parameter is as follows:

·R _cv the coefficient of variation of the R sequence is shown. The mean value and the dispersion degree are two main characteristics of sequence characteristics, the dispersion degree is often represented by a variance or a standard deviation, but considering that when a series of objective conditions such as beam alignment conditions of different time slots are different, measurement scales of different groups of data are likely to be greatly different, at this time, the standard deviation is not suitable for comparison, and the measurement scale difference can be eliminated by the variation coefficient. It is calculated by first calculating the standard deviation σ of the R sequence

·R _kur Represents the kurtosis of the R sequence. The kurtosis represents the characteristic number of the peak value height of the probability density distribution curve at the average value, and is a statistic for describing the steep degree of all value distribution forms in the population, the purpose of introducing the kurtosis is to add more extra information for the system to learn, and when the sequence has n samples, the kurtosis R is _kur The calculation is as follows:

defining a motion space: defining actions as codebooks of different beamwidths, i.e. different resolutions, a multiresolution codebook is defined as

The corresponding action space is defined as

Defining the reward: the reward is defined as the sum rate of M time slots per time unit, i.e. the rate

Policy network selects optimal codebook based on state

The best codeword then needs to be selected in combination with the beam alignment result. Direct mapping relation function can be established between different codebook code word indexes

Then directly calculating the optimal transmission code word by the mapping function

After data transmission is carried out, the external environment reward is accepted as reward, the reward is compensated, and the codebook used by default beam training is

Then corrected R _u ' represents the following:

wherein S _S To represent

The number of code words of (a) is,

to represent

The number of codewords.

Then judging whether the network updating threshold is met or not, and designing normalized criticizing loss

To characterize whether the network needs to be continuously updated,

the calculation is as follows:

wherein the representation ω represents a value network parameter; u shape _l Is the viewing window. When it is satisfied with

Then, the neural network is trained; otherwise, the learning process is skipped.

2. Implementation of AC-based beam width optimization algorithm

Input multi-resolution codebook

Action space

Initialization: establishing a value network V and a policy networkPi, initializing neural network parameters omega and theta, and updating network threshold l _max Opening up storage space, starting data transmission codebook as

Let time unit count u equal to 1.

A circulating body: for each time unit u, the following steps are repeatedly performed:

(1) obtaining the reward sequence R obtained by M time slots in the time unit according to the formula (1)

(2) Calculating the state information s of the time unit according to the formulas (4, 5 and 6) _u

(3) Selecting the best action through the policy network pi

And selects a transmission codebook corresponding to the selected transmission codebook

(4) Determined according to equation (7)

Best code word for data transmission

And performs data transmission

(5) Receiving the return value of the external environment and obtaining a modified compensated reward R according to equation (8) _u ′

(6) Calculating normalized evaluation loss

When it is satisfied with

When the step (7) is normally carried out, otherwise, the step (10) is directly carried out

(7) Obtaining the state s of the next time unit u +1 according to the formula (2) _u+1 And obtaining the next time according to the policy networkOptimal action of the cells

(8) Computing q over a value network _u ＝q(s _u ,a _u ；ω _u ) And

and calculating the TD-error value delta _u ＝q _u -(r _u +γ·q _u+1 )；

(9) Updating the value network parameter and the policy network parameter omega separately _u+1 ＝ω _u -a·δ _u ·d _ω,u ，θ _u+1 ＝θ _u +β·q _u ·d _θ,u

(10) Update counter u ← u +1

3. Simulation environment and results

The following environment is adopted during simulation:

1) the simulation generates a millimeter wave channel environment which changes along with time in an office, the center frequency of a base station is 28GHz, the transmitting power is 10dBm, the size of the office is 20 multiplied by 20 meters, and the source number N of a Uniform Linear Array (ULA) array is 64.

2) The channel path clusters are randomly generated between 2 and 5, and each cluster only contains one path for the convenience of tracking. The power allocation of line-of-sight (LOS) and non-line-of-sight (NLOS) paths is determined by the rice factor K, and the attenuation of all NLOS paths follows an exponential distribution approximately. Since the exponential decay is fast, only the 1 st NLOS path is in fact likely to be significant, with the other paths having a power of approximately 0.

3) A Uniform Linear Array (ULA) can cover a 180 ° region with azimuth phi satisfying 0 < phi < pi, while different resolution codebooks average the sine of 0-pi with different granularity. Code book

Codebook comprising 4 different resolutions

The number of code words is 8, 16, 32, 64 respectively. For convenience, each codebook is divided into

(i-1, 2,3,4) is abbreviated as MRC _i 。

4) The environmental change rate is quantified by δ, and the larger the value of δ, the faster the environmental change.

And (3) simulation result analysis:

fig. 3 shows the relationship between EAR performance and signal-to-noise ratio by using different sub-codebooks, and the average Effective Achievable Rate (EAR) can represent the effective transmission rate in a unit frequency spectrum, and can characterize the throughput of the system. It can be seen that as the signal-to-noise ratio increases, the EAR performance of the different sub-codebooks also increases. However, the EAR performance achieved with BWS algorithm is much better than that achieved with any fixed sub-codebook. The reason for this is that the BWS algorithm can be designed to sense the rate of change of the environment and adjust the beam width in real time, which achieves a good trade-off between achieving high array gain and mitigating the effects of beam drift.

Fig. 4 shows the average EAR performance versus the rate of change of the environment for different codebooks. It can be seen that as the environment changes faster and faster, the influence of the beam drift effect becomes more and more severe, and the EAR performance achieved by different sub-codebooks decreases accordingly. However, the EAR performance of BWS algorithms is better than any fixed subcodebook at all ambient rates of change. And as the environment changes faster and faster, the performance gap between them also becomes larger and larger. These observations indicate that the designed BWS algorithm can effectively counteract the influence caused by the beam drift effect in a dynamic environment and ensure good effective achievable EAR performance.

In addition, it can be observed that MRC is when μ is small enough, i.e., the environment changes slowly (e.g., when μ < 0.08), then ₄ The EAR performance achieved is the highest of the 4 sub-codebooks. The reason for this is that in this case the beam drift effect is not significant, due to the MRC having the narrowest beamwidth codeword ₄ Is the highest when the EAR performance and MRC of the BWS algorithm are high ₄ Almost identical.Meanwhile, as the environmental change rate is improved, the beam drift effect becomes more and more obvious, and MRC ₄ EAR performance of (A) deteriorates rapidly and becomes more specific than MRC ₃ Worse, this indicates that narrow beams are more susceptible to beam drift effects. And MRC ₁ The EAR performance achieved is the worst because, although the wide beam is robust to beam drift effects, its array gain is relatively low, which in turn degrades EAR performance. In contrast, our algorithm can achieve a good balance between achieving high array gain and mitigating beam-shifting effects, and sense and adapt to changing environments in real time.

In conclusion, the intelligent beam width optimization (BWS) algorithm can achieve a compromise and balance between achieving high array gain and suppressing the beam drift effect, and can sense and adapt to a constantly changing environment in real time, thereby improving the system throughput.

Claims

1. An intelligent beam width optimization method based on reinforcement learning is characterized by comprising the following steps,

step 4, combining the optimal beam width with the optimal beam direction obtained by beam alignment, selecting the optimal code word in the selected codebook and performing data transmission;

2. The intelligent beam width optimization method based on reinforcement learning AC according to claim 1, characterized in that, step 1 firstly characterizes the beam width optimization problem as MDP, and effectively and reasonably designs MDP model parameters including state design, action space and reward design, and adaptively adjusts the beam width for data transmission according to the severity of beam drift effect, fully balances the relationship between beam forming gain and effective time of data transmission;

step 1.1. state design: obtaining M sets of beam training experiences (S, A, R, S) in M time slots, i.e. in one time unit _t+1 ) Thus, M beam training rewards can be achieved, i.e.

R＝{R _t ,R _t+1 ,…,R _i ,…,R _t+M-2 ,R _t+M-1 } (1)

Wherein R is _i Represents the reward of the ith time slot in the time unit, namely the effective reachable speed obtained by the time slot wave beam training, and further adopts an indirect result set obtained by some meaningful calculations of the R sequence as a state, namely

State＝{R _av ,R _cv ,R _kur } (2)

The meaning of each parameter is as follows:

·R _kur The kurtosis R sequence represents the kurtosis of the R sequence, the kurtosis represents the characteristic number of the probability density distribution curve with the peak value height at the average value, the characteristic number is the statistic for describing the steep degree of all value distribution forms in the population, the purpose of introducing the kurtosis is to add more extra information for the system to learn, and then the kurtosis R sequence _kur The calculation is as follows:

The corresponding action space is defined as

Step 1.3, reward design: the reward is defined as the sum rate of M time slots per time unit, i.e. the rate

3. The method of claim 1, wherein the initialization method of step 2 is as follows:

step 2.2. construction of multiresolution DFT codebook

Randomly selecting a data transmission initial codebook;

4. The method according to claim 1, wherein the step 3 sequentially executes the following steps in time unit u:

step 3.1, obtaining reward sequences R obtained by M time slots in the time unit according to formula 1;

step 3.2, calculating to obtain the state information s of the time unit according to the formulas 4, 5 and 6 _u ；

Is provided with

And selects the corresponding data transmission codebook

5. The intelligent beam width optimization method based on reinforcement learning AC according to claim 1, wherein step 4 is performed in sequence in time unit u as follows:

Obtaining the best beam of the time unit according to the beam alignment result

Corresponding to the central angle of the beam

Step 4.2. in

In which the best codeword is selected for data transmission, assuming

Wherein

Which represents the ith code word, is,

representing the number of codewords in the codebook, the codebook

Chinese code word

Corresponding to a beam center angle of

Then the data transmits the best codeword

Comprises the following steps:

wherein

Representing natural numbers 1 to

In the concrete implementation process, the index relation among codebooks with different resolutions has a concrete objective rule, and then the direct mapping among different codebook indexes can be constructed

6. The method of claim 1, wherein step 5 is performed in sequence in time unit u as follows:

step 5.1) receiving the external environment return and calculating to obtain the reward R of the time unit according to the formula (1) _u Due to the deviation between the beam training application codebook and the data transmission application codebook, the reward needs to be compensated as follows, and the default beam training codebook is

Then corrected R _u ' represents the following:

wherein S _S To represent

The number of code words of (a) is,

to represent

The number of codewords of;

To characterize whether the network needs to be continuously updated,

the calculation is as follows:

The neural network is trained.

7. The method of claim 1, wherein step 6 is performed in sequence in time unit u as follows:

step 6.1. performing the best action in time unit u

step 6.2, the state s of the time unit _u And action a _u Input value network get q _u ＝q(s _u ,a _u ；ω _u ) The next time unit state s _u+1 And predicting actions

Input value network obtaining and

and calculating the TD-error value delta _u ＝q _u -(R _u ′+γ·q _u+1 )；

Step 6.3, obtaining by derivation of value network

Step 6.4. derivation of strategy network

step 6.5, update counter u ← u + 1.