CN113411110B

CN113411110B - Millimeter wave communication beam training method based on deep reinforcement learning

Info

Publication number: CN113411110B
Application number: CN202110623890.1A
Authority: CN
Inventors: 戚晨皓; 姜国力; 王宇杰
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2022-07-22
Anticipated expiration: 2041-06-04
Also published as: CN113411110A

Abstract

The invention discloses a millimeter wave communication beam training method based on deep reinforcement learning, which is characterized in that a millimeter wave channel is tracked by defining the specific representation of elements such as states, targets, rewards and the like in a reinforcement learning model in the practical problem of beam training; defining the state as an image form, approximating a value function in reinforcement learning by using a convolutional neural network, and defining the action as a triple form based on the moving direction, the distance and the beam coverage range of the optimal beam combination of the channel at the last moment; when designing the reward function, taking the effective data reachable rate in a time slice as a target value; in the training process of the neural network, a Q learning method is used for updating network parameters; and predicting by utilizing the trained deep Q network, and selecting the action with the maximum Q value, wherein the action corresponds to the beam combination to be tested at the next moment.

Description

Millimeter wave communication beam training method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of millimeter wave wireless communication, in particular to a millimeter wave communication beam training method based on deep reinforcement learning.

Background

With the continuous development of wireless communication technology, some frequency bands with lower frequency spectrum resources are almost completely occupied. In order to meet the requirement of communication performance and obtain more spectrum resources, people pay attention to the frequency band with higher frequency band, namely the millimeter wave frequency band. The frequency band is a frequency band with the frequency within the range of 30-300 GHz, the frequency spectrum resource in the frequency band is rich, the transmission rate is high, and the requirements of applications with high bandwidth requirements can be met. However, due to the propagation characteristics of the millimeter-wave signal, the path loss of the millimeter-wave channel is high compared to the microwave channel. Considering that the wavelength of a millimeter wave signal is short compared to a microwave signal and the spacing of antennas is generally positively correlated to the signal wavelength, a large number of antennas can be concentrated in a small space to form a large-scale antenna array to improve a high gain. The large-scale MIMO technology and the millimeter wave communication are mutually complementary, the millimeter wave communication solves the problem of spectrum resource shortage of the large-scale MIMO technology, and meanwhile, the large-scale MIMO technology makes up the path loss of the millimeter wave communication, so that the application prospect of the millimeter wave large-scale MIMO communication is very wide.

In the existing research work, a codebook is usually preset at both the transmitting end and the receiving end, the codebook includes a plurality of beamforming vectors (also called codewords), the transmitting end and the receiving end traverse the codewords in the codebook to transmit and receive pilot signals, and the codeword combination with the maximum receiving power is used as the beamforming vector combination of the formal transmitting and receiving signals, which is called as beam training. However, the use of large-scale antenna arrays and directional narrow beams results in such a training algorithm that traverses the codebook being very time consuming. Especially in a dynamic scenario, the millimeter wave channel is constantly changing, and it is very difficult to achieve frequent and precise beam alignment, which is a very challenging problem so far. Therefore, if the process of the beam training can sense the change of the channel environment and adjust the trained beam in time according to the change, the training overhead can be greatly reduced, and the resources of the communication system can be saved.

In order to reduce the beam training overhead, document [1] "millimeter wave massive MIMO synchronous multi-user beam training using adaptive layered codebook" (Chen K, Qi C, Dobre O a, et al, sinusoidal multi-user beam training using adaptive hierarchical codebook for mmWave massive MIMO [ C ]//2019 IEEE Global Communications reference (GLOBECOM). Except for the bottom layer, each layer of the designed self-adaptive hierarchical codebook only has two code words, and no matter how many users are served by the BS, all the users only need to be subjected to beam training twice. The difficulty of the work is the design problem of the code words, because the layered codebook for the beam training is not fixed at first but is continuously constructed in the beam training process, the construction of the codebook is more complex, and the training difficulty is increased.

Document [2] "millimeter wave communication intelligent beam training based on deep reinforcement learning" (Zhang J, Huang Y, Wang J, et al. intelligent beam training for millimeter-wave communication via requirement learning [ C ]//2019 IEEE Global Communications reference. IEEE, 2019:1-7.) proposes a deep reinforcement learning beam training algorithm based on environmental perception. The algorithm can sense the change of the environment, learn the needed potential probability information from the environment and realize the intelligent training of the beam with lower expenditure. In addition, the algorithm does not need any prior knowledge of dynamic channel modeling, and therefore is suitable for various complex scenes. However, the method is only suitable for the condition of a single antenna at a receiving end, has a small application range, and does not support millimeter wave communication between similar base stations.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a millimeter wave communication beam training method based on deep reinforcement learning, which introduces a reinforcement learning framework into beam training, so that a trained beam is adjusted in time along with changes of a channel, and a state of the channel is tracked, thereby effectively reducing beam training overhead, ensuring performance of beam training, solving technical problems of large training overhead, hardware complexity and power consumption of an existing beam training method, and simultaneously supporting a communication scenario in which a transmitting end and a receiving end are multiple antennas.

In order to realize the purpose, the invention adopts the following scheme:

a millimeter wave communication beam training method based on deep reinforcement learning comprises the following steps:

step S1, constructing a millimeter wave communication channel model between the user side and the base station side;

step S2, designing codebooks of a user terminal and a base station terminal, constructing a model of a final received signal according to the designed codebooks, and carrying out mathematical modeling on a beam training process according to the model;

step S3, defining the representation of state, action and reward in the beam training;

and step S4, regarding the state defined in the step S3 as a multi-channel image, inputting the multi-channel image into the constructed convolutional neural network, and obtaining values of all actions corresponding to the state.

Further, the step S1 specifically includes:

setting a millimeter wave massive MIMO system aiming at single user, wherein, a user end has N_rRoot antenna, base station end has N_tRoot antenna, arrangement of antennasUniform linear arrays are adopted, and the millimeter wave communication channel model is modeled as follows:

in the formula (1), L and alpha_l、

θ_lRespectively representing the number of paths, the channel gain of the ith path, the arrival angle of a channel and the departure angle of the channel; definition of

Θ_lAnd Ψ_lObey [0, π ] for both the arrival angle and departure angle of the spatial domain]Inner uniform distribution, d_tAnd d_rRespectively representing the interval between the array antennas of the base station side and the user side, wherein lambda is the wavelength of the millimeter wave signal, and u (-) represents a channel steering vector; the variation of the steering angle of the channel in the adjacent time interval follows Gaussian distribution, and the expression is as follows:

in the formula (2), θ₀U (0, pi) represents the initial channel steering angle, θ, at which time t is 0 and random_tRepresenting the channel steering angle at time t,

indicating the amount of change in the channel steering angle.

Further, in step S2, the expression of the codebook of the ue and the bs is:

in the formula (3) and the formula (4),

the expression of the final received signal is:

in the formula (5), P,

Denotes a transmission power of the base station side, a reception codeword of the user terminal, a transmission codeword of the base station side, and a channel noise vector, respectively, and | w | |)₂＝‖f‖₂＝1，|x|²＝1；

Thus, the expression of the received signal matrix is:

in the formula (6), the first and second groups,

respectively representing the receiving endsAnd a DFT codebook at a transmitting end,

representing the channel matrix, x, P representing the transmitted signal and the power of the signal respectively,

a matrix representing the noise of the channel is represented,

an element Y (m, N) in the mth row and nth column in the matrix indicates that the transmitting end uses the nth (N is 1,2, …, N in the codebook F_t) The receiving end uses the m (m is 1,2, …, N) th in the codebook W for transmitting each codeword_r) Receiving the resulting signal by a code word; the beam training process is represented as the following optimization problem:

further, in step S3, defining the representation of the state in the beam training specifically includes:

let the channel matrix at time t be H_tThe matrix of the received signals corresponding thereto is Y_tDefining a matrix Z_tIs Y_tModulo of the received signal strength matrix Z for successive time instants_tIs defined as a state S_tSpecifically, the following are shown:

S_t(i)＝Z_t+i-C，i＝1，2，…，C (7)

in the formula (7), S_tIs a three-dimensional matrix with a third dimension of size C, C representing the number of successive time instants, Z_t+i-CRepresenting the received signal strength matrix at time t + i-C.

Further, in step S3, defining the representation of the action in the beam training, specifically including:

defining said matrix Z_tThe position corresponding to the middle largest element is

Wherein the content of the first and second substances,

respectively representing indexes of optimal transmitting and receiving beams at the time t in codebooks F and W;

the optimal beam combination at the current time is

Wherein, the first and the second end of the pipe are connected with each other,

the action at time t is defined as:

A_t＝(d，o，r) (8)

in formula (8), d, o, and r respectively represent the direction, offset, and coverage of the beam search with respect to the optimal beam position at time t +1 at time t; where D ∈ D ═ {0,1,2,3,4}, there are 5 directions that can be favored: 0 represents no movement, 1,2,3,4 represent i respectively_tThe position is that the base point moves up, down, left and right; o e O ═ {0,1,2, …, M-1}, with M optional offsets, defined as the center position of the beam search at time t +1 and i_tThe distance of the location; and R ∈ R ═ {1,2, …, N }, where N optional radii exist, and the radius is defined as a coverage radius with the central position of the beam search at the time t +1 as a base point.

Further, in step S3, defining the representation of the reward in the beam training, specifically including:

in formula (9), B_t+1Represented as agent performing action A at time t_tObtaining a set of beam combinations, t, for testing at the next time_sRepresenting testing a waveTime of beam combination, t_pExpressed as a precoding phase in one time step of the beam training, T_sRepresented as a time step of the beam training,

and the data reachable rate corresponding to the optimal beam combination at the moment of t +1 is represented.

Further, the convolutional neural network specifically comprises two convolutional layers, two pooling layers, a flat layer, a full-link layer and an output layer; the states are normalized before they are input into the neural network, in particular, the two-dimensional image represented by each channel.

Furthermore, the input state is updated by using a convolutional neural network, the parameters of the convolutional neural network are updated by taking the predicted value of Q learning as a target, the trained network is used for predicting, the action with the maximum Q value is selected, and the beam combination cluster corresponding to the action is used for testing to reduce the training overhead.

A millimeter wave communication beam training device based on deep reinforcement learning, the device comprising:

the beam selection module is used for acquiring a receiving beam set and a transmitting beam set according to the executed action;

the channel sample generation module is used for generating a plurality of randomly changed channel matrixes and calculating the optimal receiving and transmitting beam combination of each channel matrix;

the receiving signal matrix module is used for calculating the receiving signal intensity corresponding to the receiving and transmitting beam pair set in the beam selection module;

the state updating module updates the current state by using the received signal strength matrix;

the optimal receiving and sending wave beam combination determining module is used for obtaining the optimal receiving and sending wave beam combination corresponding to the channel, obtaining a wave beam searching range of the next moment according to a certain action after the action is executed, testing by using all wave beam combinations in the range, and selecting the wave beam combination with the maximum received signal intensity as the optimal wave beam combination;

the reward calculation module is used for calculating a reward value for executing the action by using the obtained optimal beam combination and other parameters;

the parameter setting module is used for setting parameters of the neural network and other parameters in the beam training process;

the experience storage module is used for storing experiences in the beam training process into a set;

the neural network training module is used for inputting the state matrix into the neural network, outputting all action values corresponding to the state, and selecting a plurality of experiences from the memory library to update the network parameters;

a target value setting module which calculates a target value corresponding to each experience by using the updating strategy of the Q learning;

and the neural network prediction module predicts all action values corresponding to the input state by using the trained network and selects the action with the maximum Q value as the optimal action.

The beneficial effects of the invention are:

1. the invention introduces a reinforcement learning frame in the beam training, so that the trained beam can be adjusted in time along with the change of the channel, and the state of the channel can be tracked, thereby more accurately predicting the optimal receiving and transmitting beam combination of an unknown channel, effectively reducing the beam training overhead and ensuring the performance of the beam training.

2. Different from the traditional training modes such as beam scanning and the like, the number of beam combinations tested each time is not fixed and is dynamically changed under different channel states, so that the overhead of beam training is effectively reduced.

3. In the design of receiving and transmitting beams, the invention only adopts narrow beams, thereby greatly reducing the complexity of hardware.

Drawings

Fig. 1 is a schematic input/output diagram of a neural network in embodiment 1;

FIG. 2 is a graphical representation of the received signal strength matrix in example 1;

FIG. 3 is a diagram illustrating states (matrices) of reinforcement learning in example 1;

FIG. 4 shows the operation of example 1The implementation process is schematically shown, wherein, FIG. 4a shows the received signal strength matrix Z at time τ_τFIGS. 4 b-4 f show different actions taken, respectively

Obtaining the received signal intensity matrix at the time of tau +1

j＝1，…，5；

FIG. 5 is a schematic time slice diagram of beam training in example 1;

fig. 6 is a diagram illustrating a comparison of beam search success rates when the number of channel paths is different;

FIG. 7 is a diagram illustrating a comparison of the achievable rates of users when the number of channel paths is different;

fig. 8 is a schematic diagram showing the beam training method proposed in embodiment 1 compared with the beam scanning, layered codebook-based beam training method in terms of the success rate of beam search;

fig. 9 is a schematic diagram showing a comparison between the beam training method proposed in embodiment 1 and the beam scanning, layered codebook-based beam training method in terms of user reachable rate.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1 to 5, the present embodiment provides a millimeter wave communication beam training method based on deep reinforcement learning, which specifically includes:

consider a millimeter-wave massive MIMO system for a single user, with N at the user_rRoot antenna, baseAt station has N_tThe antennas are arranged in a Uniform Linear Array (ULA) manner. According to the widely used Saleh-Valenzuela model, the millimeter wave channel of the downlink can be modeled as:

wherein, L, alpha_l、

θ_lRespectively representing the number of paths, the channel gain of the ith path, the arrival angle of the channel and the departure angle of the channel. Usually, the path of 1 is the LOS path, and the other paths are the NLOS paths. Definition of

Θ_lAnd Ψ_lObey [0, π ] for both arrival and departure angles in the spatial domain]Even distribution within. d_tAnd d_rRespectively representing the spacing of the array antennas at the base station side and the user side, lambda is the wavelength of the millimeter wave signal, and in general,

u (-) denotes a channel steering vector, defined as follows:

since the channel considered in the beam training of the present invention is time-varying, the channel needs to be dynamically modeled. In an actual communication environment, the channel variation is generally random, and in the present invention, a gaussian random walk is adopted as a variation form of the channel, that is, the variation amount of the steering angle (departure angle and arrival angle) of the channel in adjacent time intervals obeys gaussian distribution, which is specifically expressed as follows:

wherein, theta₀U (0, pi) represents the initial channel steering angle, θ, at which time t is 0 and random_tIndicating the channel steering angle at time t,

indicating the amount of change in the channel steering angle.

Before beam training, a codebook is defined at both the transmitting end and the receiving end, each codebook comprises a series of codewords, and each codeword represents a beam forming vector. In the present invention, a Discrete Fourier Transform (DFT) codebook is used as a codebook at a transmitting end, and the nature of the DFT codebook is a two-dimensional complex matrix determined according to the number of antennas, and a modulus value of each element in the matrix is constant. The DFT codebook is well suited for training of analog beams because the phase shifter network constituting the analog beamforming part only changes the phase of the transmitted signal and does not provide a gain in power.

DFT codebooks at a transmitting end and a receiving end are defined as F and W, respectively. Wherein the content of the first and second substances,

containing N_tA number of code words that are each associated with a code word,

containing N_rA code word. The codewords contained in the two codebooks each represent a channel steering vector pointing in different directions in space, and are represented as follows:

assuming that the sending end uses the codeword f to send the signal x, the receiving end uses the codeword w to receive the signal, and the transmission of the channel matrix H is performed in the middle, and the final received signal may be represented as:

wherein, P,

Respectively representing the transmission power of the base station, the received code word of the user terminal, the transmission code word of the base station terminal and the channel noise vector. Neither the transmitted nor the received codeword provides a power gain, i.e. | w |)₂＝‖f‖₂1, the transmission signal x has normalized power, | x ∞ n²＝1。

The user achievable rate can be expressed as:

in the beam training process, the transmitting end and the receiving end respectively test each code word in the codebooks F and W to find a transmitting end beam forming vector F and a receiving end beam forming vector W which can be optimally matched with the channel H. Therefore, the beam training problem can be equivalent to the following optimization problem:

in the beam training, the transmitting power P of the signal and the variance sigma of the channel noise²Given, the above optimization problem can be simplified to:

however, in practical situations, the channel H is usually unknown and cannot be directly solved for optimal f and w. It is common practice to find the best combination of f and w by measuring the strength value of the received signal y, so the beam training process can be expressed as the following optimization problem:

the optimal solutions of the above two optimization problems may be different due to the presence of the channel noise η. If the two are the same, the beam training is successful, otherwise, the beam training is failed. Suppose that a total of N is performed_totalThe secondary beam training succeeds N_sucNext, the success rate of the beam search can be expressed as:

the received signal matrix at time t is:

wherein the content of the first and second substances,

DFT codebooks at the receiving end and the transmitting end are respectively indicated,

a channel matrix representing time t, x, P representing the transmitted signal and the power of the signal respectively,

representing the channel noise matrix at time t, defining a received signal strength matrix Z_tIs Y_tThe following steps of (1):

Z_t(m，n)＝|Y_t(m，n)|，

as shown in FIG. 2, in the present example, Z is_tThe two dimensions of the image respectively represent the indexes of the transmitting and receiving end code words, and each grid in the image corresponds to one transmitting and receiving beam combination. The image describes the distribution of the corresponding received signal strength when the transceiving end uses different beams to test, and the pixel position with larger gray scale value in the image corresponds to the beam combination with high received signal strength. Image Z due to the sparsity of the millimeter wave channel_tIs close to 0, the positions of those non-zero elements correspond to the distribution of the steering angles under the current channel, Z_tThe position of the medium maximum element corresponds to the searched optimal beam combination. If the positions of the non-zero elements can be dynamically tracked, the change condition of the channel can be sensed in time, and the training overhead is greatly reduced. To capture the dynamically changing channel, we define several consecutive images as one state St, namely:

S_t(i)＝Z_t+i-C，i＝1，2，…，C，

wherein S is_tIs a three-dimensional matrix with a third dimension of size C, indicating a state matrix S_tWhich comprises C two-dimensional matrices Z. S. the_tThe ith two-dimensional matrix corresponds to a received signal strength matrix Z at the time of t + i-C_t+i-CAnd S is_tThe received signal strength matrix Z of which the last two-dimensional matrix is at the moment t_t. As shown in FIG. 3, the state matrix S may also be_tThe image is viewed as a multi-channel image, so that the convolutional neural network can be used for training.

From the above, define the state matrix S at time t_tThe last two-dimensional matrix contained is Z_tDefinition of Z_tThe position corresponding to the middle maximum element is:

wherein

Respectively representing the indexes of the optimal transmitting and receiving beams at the time t in the codebooks F and W, so that the optimal beam combination at the current time can be obtained as

Wherein

To obtain a state matrix S at time t +1_t+1Defining the action at the time t as a triple:

A_t＝(d，o，r)，

where d, o, and r respectively represent the direction, offset, and coverage of the beam search at time t +1 with respect to the optimal beam position at time t. D ∈ D ═ {0,1,2,3,4}, there are 5 possible directions: 0 represents no movement, 1,2,3,4 represent i respectively_tThe position is that the base point moves up, down, left and right. O e O ═ {0,1,2, …, M-1}, with M optional offsets, defined as the center position of the beam search at time t +1 and i_tThe distance of the location. R ∈ {1,2, …, N }, where N optional radii exist, a radius is defined as a coverage radius with the center position of the beam search at time t +1 as a base point, and for example, R ∈ R ═ 1 indicates that the coverage area of the beam search is a square area with a side length of 3.

The specific execution process of the action is shown in fig. 4, each image represents a received signal strength matrix Z at a certain time, and the gray value of each pixel point in the image represents the modulus of the received signal obtained by testing the corresponding beam combination. The positions of the colored grids correspond to the beam combinations to be trained, and the gray value is greater than 0; at other positionsThe beam combination does not need to train the beam, and the gray value is set to be 0. It is assumed that graph (a) represents a received signal strength matrix Z at time τ_τThe dark grid positions represent the index i of the optimal beam combination determined by the beam training_τFig. (4, 5), fig. (b-f) show taking different actions, respectively

Obtaining the received signal intensity matrix at the time of tau +1

For example, graph (b) shows taking action

The obtained received signal strength matrix at the time of tau +1

According to the definition of the above actions, the coverage area of the beam training is a square area with r being 1 (side length being 3), and the central position (dark grid) of the area is the last-time optimal beam combination index i_τThe graphs (c-f) are similarly obtained.

Assume that at time t the agent performs action A_tGet the set of beam combinations for testing at the next time as

Where I denotes the total number of beam combinations used for training, set B_t+1The elements contained in the test table are used as receiving and transmitting wave beams to be tested one by one, thereby obtaining a received signal intensity matrix Z at the t +1 moment_t+1According to Z_t+1The state matrix S at time t +1 can be constructed_t+1：

Selection matrix Z_t+1Maximum value

The corresponding beam combination is used as the optimal beam combination at the moment of t +1

Then perform action A_t+1And so on.

Considering the channel reachable rate and the beam training overhead together, the reward function is defined as the following form in the invention:

where Ts represents the size of a time slice, t_dIndicating the effective data transmission time within a time slice,

defining the data reachable rate corresponding to the optimal beam combination at the moment of t + 1:

the reward function may be understood as the effective data achievable rate over a time slice (since a time slice is subject to beam training and precoding in addition to transmitting data).

FIG. 5 is a definition of time slices, from which t can be known_d＝Ts-t_b-t_p＝Ts-It_s-t_pWherein I ═ B_t+1I denotes performing action A_tThe obtained beam set size t used for testing at the moment t +1_sIndicating the time at which one beam combination is tested.

Thus, the reward function R may eventually be applied_tExpressed in the following form:

because the state S can be represented as a multi-channel image, the present invention uses a convolutional neural network for processing. The network structure is shown in fig. 1, and comprises two convolutional layers, two pooling layers, a flat layer, a full-link layer, and an output layer. Convolutional layers are the result of a convolution operation, pooling layers are the result of a sampling operation, and flat layers are vectors that convert a multidimensional matrix into one dimension. The input to the network is a state matrix S_tAnd outputs all the operation values Q (S) corresponding to the state_t，A_t) The dimension of the output layer is the size of the motion space a.

To accelerate the convergence of the model, at S_tBefore being input into the neural network, the neural network needs to be normalized, namely, the two-dimensional image represented by each channel is normalized:

therein, max (S)_t(i) ) represents S_tThe maximum gray scale value of the ith two-dimensional image of (1).

The beam training based on deep reinforcement learning mainly comprises the following steps:

step 1, inputting an action space A, a discount factor gamma and a learning rate alpha.

Step 2, initializing DQN parameters, which specifically comprises: and randomly initializing and predicting a target neural network parameter theta, setting the target neural network parameter theta' to theta, and setting the size of a memory bank.

And 3, performing beam training, namely firstly setting the total number of the trained epsilon and the time step number T contained in each epsilon. At the beginning of each epsilon, C time-varying channels H are randomly generated_tInitializing the initial state in a beam scanning manner

The following steps are performed in each time step in turn:

step 3.1, supposethe state at time t is S_tSelecting an action A from an action space A according to an epsilon-greey strategy_t。

Step 3.2, perform action A_tDetermining a set of t +1 time instances for testing beam combinations

Calculating the received signal strength corresponding to all elements in the set

Setting the received signal corresponding to the untested wave beam combination as 0, thereby obtaining a received signal strength matrix Z_t+1。

Step 3.3, update the state S at the time t +1_t+1。

Step 3.4, selecting matrix Z_t+1Maximum value of

Step 3.5, calculate the optimal beam combination

Corresponding user achievable rate

Step 3.6, calculate the reward function R_t。

Step 4, updating the DQN parameters, which mainly comprises the following steps:

step 4.1, subject this time to step E_t＝(S_t，A_t，R_t，S_t+1) And storing the data into a memory bank.

Step 4.2, randomly selecting N experiences E ═ s from the memory bank_j，a_j，r_j，s′_j1,2, …, N, setting each experienceThe corresponding target value:

and 4.3, performing random gradient descent on the parameter theta, and training a neural network.

And 4.4, updating the parameter theta' of the target neural network after T time steps.

Step 5, output Q network Q (s, a; theta)

Specifically, in the training step:

channel H at time t_tThe actual corresponding optimal beam combination is the optimal solution of the following optimization problem:

the above problem is actually to find the beam combination with the largest objective function value in the codebooks F and W, and assume that the optimal solution of the problem (the actual optimal beam combination) is

According to the previously defined optimal beam combination obtained through beam training

If it is not

The beam training is successful, otherwise, the beam training is failed. Because the channel variation is random, the state of the channel may not be accurately captured at some point in time, resulting in failure of beam training. In this case, the position of the channel cannot be tracked, and if such samples are still used to update the DQN, then the error will propagate continuously, resulting in the failure of the algorithm, and therefore the S needs to be redefined_t：

S_t(C)＝W^HH_tF，

W, F denotes DFT codebooks at the transmitting and receiving ends, respectively. According to the above definition, the state matrix S at time t_tOnly the last two-dimensional matrix in the three-dimensional matrix is changed, and the other positions are not changed. Since before S_tThe last two-dimensional matrix of (a) is a received signal strength matrix Z at time t_tTo do so

It is according to Z_tObtained when

When necessary, Z is added_tFrom A to A_tAnd (5) removing, so that the algorithm is relocated to the state of the current channel.

Example 2

In this embodiment, on the basis of embodiment 1, a millimeter wave communication beam training device based on deep reinforcement learning is provided, where the device includes:

a beam selection module for executing action A according to t time_tGet the set of beam combinations for testing at the next time as

Where I represents the total number of beam combinations used for training,

representing the ith transceiving beam combination.

A channel sample generation module for generating a plurality of time-varying channel matrixes H according to the random variation of the channel guide angle_tDetermining each channel matrix H by beam scanning_tCorresponding optimal transmit-receive beam combination

Received signal matrix module, using set B_t+1The beams in (1) are tested in sequence to obtain each beam combination

Corresponding received signal strength z_t+1Setting the received signals corresponding to other untested beam combinations to 0, thereby obtaining a matrix Z of the received signal strength_t+1。

A state updating module for updating the received signal strength matrix Z according to the t +1 moment_t+1Constructing a state matrix S at time t +1_t+1And updating the current state.

An optimal receiving and transmitting beam combination determining module for selecting matrix Z_t+1Maximum value

Wherein

In order to optimize the transmission of the beam,

is the best receive beam (obtained through beam training).

A reward calculation module for using the obtained optimal beam combination

Transmission signal power P and channel noise variance σ²Computing execution action A_tThe prize value of (c).

And the parameter setting module is used for setting parameters of the neural network, other parameters in the beam training process and the like.

A experience storage module for storing experience E in the beam training process_t＝(S_t，A_t，R_t，S_t+1) And storing the data into a memory bank.

A neural network training module, the input of the neural network is a state matrix S_tAnd outputs all the operation values Q (S) corresponding to the state_t，A_t) Selecting from the memory bankTake several experiences E ═ s_j，a_j，r_j，s′_j1,2, …, N to update the predicted neural network parameter θ.

A target value setting module that calculates a target value for each experience using the update strategy of Q learning:

a neural network prediction module for predicting the input state S using the trained network_tCorresponding all operation values Q (S)_tA), a belongs to A, and the action A with the maximum Q value is selected_t＝argmax_a∈A Q(S_tAnd a) as the optimal action.

The invention is further described below with reference to simulation conditions and results:

setting status image S_tDepth C of 6, action triplet set D of {0,1,2,3,4}, O of {2, 4}, R of {1, 3}, slot size Ts of 20ms, time t of training a beam combination, and time t of training a beam combination_s0.1ms, time of precoding t_pThe learning rate α is 0.001, the discount factor γ is 0.95, the memory bank size D is 2000, the parameters of the predictive neural network are assigned to the target neural network through update _ freq is 100 time steps, and the training batch size batch _ size is 64. The adopted DQN is a convolutional neural network with 7 layers, the first convolution operation comprises 32 convolution kernels with 5 x 5, the second convolution operation comprises 16 convolution kernels with 5 x 5, 2 pooling operations all adopt a maximum pooling mode, the step length is 2, the flat function is to convert a three-dimensional matrix into a one-dimensional vector, a full connection layer comprises 128 neurons, and the dimension of an output layer corresponds to the size of an action space.

Considering the down link of a single-user millimeter wave large-scale MIMO communication system, the number of base station end antennas is N_t16, number of subscriber side antennas N_rThe antenna arrays are all placed in ULA form 16. Assuming that the number L of propagation paths of millimeter wave signals is 3, LOS path channel gain

I.e. a complex gaussian distribution with variance of 1 and mean of 0; two NLOS path channel gains

I.e. a complex gaussian distribution with variance 0.01 and mean 0. For the sake of convenience of processing, the variance σ of the channel noise is assumed²1, the amount of change in channel steering angle

The power P of the transmission signal is 1, and the transmitted signal x is 1. Fig. 6-7 are simulation results of DQN-based beam training considering multiple propagation paths of millimeter wave channels. As can be seen from the figure, as the acceptable signal-to-noise ratio increases, the success rate and the achievable rate of the beam search both show an increasing trend. When the NLOS path is increased, the state change of the channel is more complicated, and the corresponding beam search success rate and the reachable rate are both decreased to some extent, but the decrease is small. The method shows that the multipath effect of the millimeter wave channel has small influence on the algorithm, and the beam training algorithm based on deep reinforcement learning still keeps high performance in the multipath scene.

Considering the number N of the downlink, base station side and user side antennas of a single-user millimeter wave large-scale MIMO communication system_t＝N_rThe antenna arrays are all placed in ULA form 16. LOS path, channel gain considering only millimeter wave channel

Variance σ of channel noise ²1, variation of channel steering angle

The power P of the transmission signal is 1, and the transmission signal x is 1. Hierarchical codebook adoption document [1]In the codebook construction method in (1), the codeword of the current layer is constructed according to the beam training result of the previous layer, and the DFT codebook is used for beam scanning. FIGS. 8-9 compare the proposed basis for the present inventionPerformance of Beam training algorithm (Beam training based on DQN, BT-DQN) and Beam Scanning (BS) based on DQN and Beam training algorithm (BT-HC) based on hierarchical codebook. As can be seen from fig. 8, the success rate of the BS is the highest under different signal-to-noise ratios among the three beam training schemes. The search success rate of BT-DQN in the areas with low signal to noise ratio and high signal to noise ratio is close to BT-HC and slightly higher than BT-HC; the success rate of BT-DQN in the middle area is higher than BT-HC. As can be seen from fig. 9, the achievable rate of the BS is still the highest, BT-DQN times, and BT-HC the lowest under different snr conditions.

Although the search success rate and achievable rate of the BS are the highest of the three, it requires more beams to be trained at a time, takes more time, and thus is more costly. Table 1 compares the overhead of three different beam training schemes, where t is the time to train a beam combination_sThe time of one beam training is t. As can be seen from the table, the overhead of the BS is 10 times as much as the overhead of the BT-HC, and the search success rate and the reachable rate are increased at the cost of huge overhead. The average expense of BT-DQN is lower than that of BT-HC, and is reduced by about 21 percent.

TABLE 1

Name of algorithm	Average overhead (t/t)_s)
		Beam Scanning (BS)	256
Beam training based on hierarchical codebook (BT-HC)	24
		DQN-based Beam Training (BT)-DQN)	19

According to the simulation result, the beam training scheme and the device provided by the invention have the advantages that the success rate and the reachable rate of beam search are higher than those of the beam training scheme based on the codebook under the dynamic channel environment, and the training overhead is lower. Although not as performance-efficient as beam scanning, beam scanning requires a significant training overhead in exchange for high success rates and achievable rates that are in most cases invaluable. Therefore, under the scene of a time-varying channel, the beam training scheme based on the deep Q network provided by the invention can greatly reduce the overhead of beam training on the premise of ensuring higher performance.

The details of the present invention are well known to those skilled in the art.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations can be devised by those skilled in the art in light of the above teachings. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A millimeter wave communication beam training method based on deep reinforcement learning is characterized by comprising the following steps:

in step S3, defining the representation of the state in the beam training specifically includes:

let the channel matrix at time t be H_tThe matrix of the received signal corresponding thereto is Y_tDefine the matrix Z_tIs Y_tWill be a matrix Z of received signal strengths at successive times_tIs defined as a state S_tSpecifically, the following are shown:

S_t(i)＝Z_t+i-C,i＝1,2,…,C (1)

in the formula (1), S_tIs a three-dimensional matrix with a third dimension of size C, C representing the number of successive time instants, Z_t+i-CRepresenting the received signal strength matrix at time t + i-C;

defining the representation of the action in the beam training, specifically comprising:

the optimal beam combination at the current time is

the action at time t is defined as:

A_t＝(d,o,r) (2)

in formula (2), d, o, r respectively represent the direction, offset and coverage of the beam search at time t +1 relative to the optimal beam position at time t; where D ∈ D ═ {0,1,2,3,4}, there are preferably 5The direction of (c): 0 represents no movement, 1,2,3,4 represent i respectively_tThe position is that the base point moves up, down, left and right; o belongs to O ═ O {0,1,2, …, M-1}, and M optional offsets are provided, and the offsets are defined as the distance between the center position of the beam search at the moment t +1 and the position of it; r ∈ R ═ {1,2, …, N }, there are N optional radii, and the radius is defined as the coverage radius with the central position of the beam search at the time t +1 as the base point;

defining the representation of the reward in the beam training, and specifically comprising:

in the formula (3), B_t+1Represented as agent performing action A at time t_tObtaining a set of beam combinations, t, for testing at the next time_sRepresenting the time, t, at which a beam combination is tested_pDenoted as the precoding phase in one time step of the beam training, Ts is denoted as one time step of the beam training,

representing the data reachable rate corresponding to the optimal beam combination at the moment of t + 1;

step S4, regarding the state defined in the step S3 as a multi-channel image, inputting the multi-channel image into the constructed convolutional neural network, and obtaining values of all actions corresponding to the state;

the method comprises the steps of updating an input state by using a convolutional neural network, updating parameters of the convolutional neural network by taking a predicted value of Q learning as a target, predicting by using a trained network, selecting an action with the maximum Q value, and testing by using a beam combination cluster corresponding to the action to reduce training overhead.

2. The method for training the millimeter wave communication beam based on deep reinforcement learning according to claim 1, wherein the step S1 specifically includes:

arranging a needleFor a millimeter wave MIMO communication system of a single user, the system has N user terminals_rRoot antenna, base station end has N_tThe arrangement modes of the antennas are uniform linear arrays, and the millimeter wave communication channel model is modeled as follows:

in the formula (4), L and alpha_l、

Θ_lAnd Ψ_lObey [0, π ] for both the arrival angle and departure angle of the spatial domain]Inner uniform distribution, d_tAnd d_rRespectively representing the interval between the array antennas of the base station end and the user end, wherein lambda is the wavelength of a millimeter wave signal, and u (-) represents a channel steering vector; the variation of the steering angle of the channel in the adjacent time interval follows Gaussian distribution, and the expression is as follows:

in the formula (5), θ₀U (0, pi) represents the initial channel steering angle, θ, at which time t is 0 and random_tRepresenting the channel steering angle at time t,

indicating the amount of change in the channel steering angle.

3. The method for training millimeter wave communication beams based on deep reinforcement learning of claim 1, wherein in the step S2, the expression of the codebooks at the ue and the bs is:

in the formula (6) and the formula (7),

the expression of the final received signal is:

in the formula (8), P,

Respectively representing the transmitting power of a base station end, the receiving code word of a user end, the transmitting code word of the base station end and a channel noise vector, and | | | w | | calving₂＝||f||₂＝1，|x|²＝1；

Thus, the expression of the received signal matrix is:

in the formula (9), the reaction mixture,

DFT codebooks at a receiving end and a transmitting end are respectively indicated,

representing the channel matrix, x, P representing the transmitted signal and the signal, respectivelyThe power of the electric motor is controlled by the power controller,

a matrix representing the noise of the channel is represented,

an element Y (m, N) in the mth row and nth column in the matrix indicates that the transmitting end uses the nth (N is 1,2, …, N in the codebook F_t) The code word transmitting/receiving end uses the m-th (m is 1,2, …, N) in the codebook W_r) Receiving the resulting signal by a code word; the beam training process is represented as the following optimization problem:

4. the deep reinforcement learning-based millimeter wave communication beam training method according to claim 3, wherein the convolutional neural network specifically comprises two convolutional layers, two pooling layers, a flat layer, a full-link layer and an output layer; the states are normalized before they are input to the neural network, in particular, the two-dimensional image represented by each channel is normalized.