CN113411110B - Millimeter wave communication beam training method based on deep reinforcement learning - Google Patents
Millimeter wave communication beam training method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN113411110B CN113411110B CN202110623890.1A CN202110623890A CN113411110B CN 113411110 B CN113411110 B CN 113411110B CN 202110623890 A CN202110623890 A CN 202110623890A CN 113411110 B CN113411110 B CN 113411110B
- Authority
- CN
- China
- Prior art keywords
- channel
- time
- matrix
- training
- millimeter wave
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B7/00—Radio transmission systems, i.e. using radiation field
- H04B7/02—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
- H04B7/04—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
- H04B7/06—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station
- H04B7/0613—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission
- H04B7/0615—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal
- H04B7/0617—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal for beam forming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B7/00—Radio transmission systems, i.e. using radiation field
- H04B7/02—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
- H04B7/04—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
- H04B7/08—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the receiving station
- H04B7/0837—Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the receiving station using pre-detection combining
- H04B7/0842—Weighted combining
- H04B7/086—Weighted combining using weights depending on external parameters, e.g. direction of arrival [DOA], predetermined weights or beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Radio Transmission System (AREA)
- Variable-Direction Aerials And Aerial Arrays (AREA)
Abstract
The invention discloses a millimeter wave communication beam training method based on deep reinforcement learning, which is characterized in that a millimeter wave channel is tracked by defining the specific representation of elements such as states, targets, rewards and the like in a reinforcement learning model in the practical problem of beam training; defining the state as an image form, approximating a value function in reinforcement learning by using a convolutional neural network, and defining the action as a triple form based on the moving direction, the distance and the beam coverage range of the optimal beam combination of the channel at the last moment; when designing the reward function, taking the effective data reachable rate in a time slice as a target value; in the training process of the neural network, a Q learning method is used for updating network parameters; and predicting by utilizing the trained deep Q network, and selecting the action with the maximum Q value, wherein the action corresponds to the beam combination to be tested at the next moment.
Description
Technical Field
The invention relates to the technical field of millimeter wave wireless communication, in particular to a millimeter wave communication beam training method based on deep reinforcement learning.
Background
With the continuous development of wireless communication technology, some frequency bands with lower frequency spectrum resources are almost completely occupied. In order to meet the requirement of communication performance and obtain more spectrum resources, people pay attention to the frequency band with higher frequency band, namely the millimeter wave frequency band. The frequency band is a frequency band with the frequency within the range of 30-300 GHz, the frequency spectrum resource in the frequency band is rich, the transmission rate is high, and the requirements of applications with high bandwidth requirements can be met. However, due to the propagation characteristics of the millimeter-wave signal, the path loss of the millimeter-wave channel is high compared to the microwave channel. Considering that the wavelength of a millimeter wave signal is short compared to a microwave signal and the spacing of antennas is generally positively correlated to the signal wavelength, a large number of antennas can be concentrated in a small space to form a large-scale antenna array to improve a high gain. The large-scale MIMO technology and the millimeter wave communication are mutually complementary, the millimeter wave communication solves the problem of spectrum resource shortage of the large-scale MIMO technology, and meanwhile, the large-scale MIMO technology makes up the path loss of the millimeter wave communication, so that the application prospect of the millimeter wave large-scale MIMO communication is very wide.
In the existing research work, a codebook is usually preset at both the transmitting end and the receiving end, the codebook includes a plurality of beamforming vectors (also called codewords), the transmitting end and the receiving end traverse the codewords in the codebook to transmit and receive pilot signals, and the codeword combination with the maximum receiving power is used as the beamforming vector combination of the formal transmitting and receiving signals, which is called as beam training. However, the use of large-scale antenna arrays and directional narrow beams results in such a training algorithm that traverses the codebook being very time consuming. Especially in a dynamic scenario, the millimeter wave channel is constantly changing, and it is very difficult to achieve frequent and precise beam alignment, which is a very challenging problem so far. Therefore, if the process of the beam training can sense the change of the channel environment and adjust the trained beam in time according to the change, the training overhead can be greatly reduced, and the resources of the communication system can be saved.
In order to reduce the beam training overhead, document [1] "millimeter wave massive MIMO synchronous multi-user beam training using adaptive layered codebook" (Chen K, Qi C, Dobre O a, et al, sinusoidal multi-user beam training using adaptive hierarchical codebook for mmWave massive MIMO [ C ]//2019 IEEE Global Communications reference (GLOBECOM). Except for the bottom layer, each layer of the designed self-adaptive hierarchical codebook only has two code words, and no matter how many users are served by the BS, all the users only need to be subjected to beam training twice. The difficulty of the work is the design problem of the code words, because the layered codebook for the beam training is not fixed at first but is continuously constructed in the beam training process, the construction of the codebook is more complex, and the training difficulty is increased.
Document [2] "millimeter wave communication intelligent beam training based on deep reinforcement learning" (Zhang J, Huang Y, Wang J, et al. intelligent beam training for millimeter-wave communication via requirement learning [ C ]//2019 IEEE Global Communications reference. IEEE, 2019:1-7.) proposes a deep reinforcement learning beam training algorithm based on environmental perception. The algorithm can sense the change of the environment, learn the needed potential probability information from the environment and realize the intelligent training of the beam with lower expenditure. In addition, the algorithm does not need any prior knowledge of dynamic channel modeling, and therefore is suitable for various complex scenes. However, the method is only suitable for the condition of a single antenna at a receiving end, has a small application range, and does not support millimeter wave communication between similar base stations.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a millimeter wave communication beam training method based on deep reinforcement learning, which introduces a reinforcement learning framework into beam training, so that a trained beam is adjusted in time along with changes of a channel, and a state of the channel is tracked, thereby effectively reducing beam training overhead, ensuring performance of beam training, solving technical problems of large training overhead, hardware complexity and power consumption of an existing beam training method, and simultaneously supporting a communication scenario in which a transmitting end and a receiving end are multiple antennas.
In order to realize the purpose, the invention adopts the following scheme:
a millimeter wave communication beam training method based on deep reinforcement learning comprises the following steps:
step S1, constructing a millimeter wave communication channel model between the user side and the base station side;
step S2, designing codebooks of a user terminal and a base station terminal, constructing a model of a final received signal according to the designed codebooks, and carrying out mathematical modeling on a beam training process according to the model;
step S3, defining the representation of state, action and reward in the beam training;
and step S4, regarding the state defined in the step S3 as a multi-channel image, inputting the multi-channel image into the constructed convolutional neural network, and obtaining values of all actions corresponding to the state.
Further, the step S1 specifically includes:
setting a millimeter wave massive MIMO system aiming at single user, wherein, a user end has NrRoot antenna, base station end has NtRoot antenna, arrangement of antennasUniform linear arrays are adopted, and the millimeter wave communication channel model is modeled as follows:
in the formula (1), L and alphal、θlRespectively representing the number of paths, the channel gain of the ith path, the arrival angle of a channel and the departure angle of the channel; definition of ΘlAnd ΨlObey [0, π ] for both the arrival angle and departure angle of the spatial domain]Inner uniform distribution, dtAnd drRespectively representing the interval between the array antennas of the base station side and the user side, wherein lambda is the wavelength of the millimeter wave signal, and u (-) represents a channel steering vector; the variation of the steering angle of the channel in the adjacent time interval follows Gaussian distribution, and the expression is as follows:
in the formula (2), θ0U (0, pi) represents the initial channel steering angle, θ, at which time t is 0 and randomtRepresenting the channel steering angle at time t,indicating the amount of change in the channel steering angle.
Further, in step S2, the expression of the codebook of the ue and the bs is:
the expression of the final received signal is:
in the formula (5), P, Denotes a transmission power of the base station side, a reception codeword of the user terminal, a transmission codeword of the base station side, and a channel noise vector, respectively, and | w | |)2=‖f‖2=1,|x|2=1;
Thus, the expression of the received signal matrix is:
in the formula (6), the first and second groups, respectively representing the receiving endsAnd a DFT codebook at a transmitting end,representing the channel matrix, x, P representing the transmitted signal and the power of the signal respectively,a matrix representing the noise of the channel is represented,an element Y (m, N) in the mth row and nth column in the matrix indicates that the transmitting end uses the nth (N is 1,2, …, N in the codebook Ft) The receiving end uses the m (m is 1,2, …, N) th in the codebook W for transmitting each codewordr) Receiving the resulting signal by a code word; the beam training process is represented as the following optimization problem:
further, in step S3, defining the representation of the state in the beam training specifically includes:
let the channel matrix at time t be HtThe matrix of the received signals corresponding thereto is YtDefining a matrix ZtIs YtModulo of the received signal strength matrix Z for successive time instantstIs defined as a state StSpecifically, the following are shown:
St(i)=Zt+i-C,i=1,2,…,C (7)
in the formula (7), StIs a three-dimensional matrix with a third dimension of size C, C representing the number of successive time instants, Zt+i-CRepresenting the received signal strength matrix at time t + i-C.
Further, in step S3, defining the representation of the action in the beam training, specifically including:
defining said matrix ZtThe position corresponding to the middle largest element isWherein the content of the first and second substances,respectively representing indexes of optimal transmitting and receiving beams at the time t in codebooks F and W;
the optimal beam combination at the current time isWherein, the first and the second end of the pipe are connected with each other,
the action at time t is defined as:
At=(d,o,r) (8)
in formula (8), d, o, and r respectively represent the direction, offset, and coverage of the beam search with respect to the optimal beam position at time t +1 at time t; where D ∈ D ═ {0,1,2,3,4}, there are 5 directions that can be favored: 0 represents no movement, 1,2,3,4 represent i respectivelytThe position is that the base point moves up, down, left and right; o e O ═ {0,1,2, …, M-1}, with M optional offsets, defined as the center position of the beam search at time t +1 and itThe distance of the location; and R ∈ R ═ {1,2, …, N }, where N optional radii exist, and the radius is defined as a coverage radius with the central position of the beam search at the time t +1 as a base point.
Further, in step S3, defining the representation of the reward in the beam training, specifically including:
in formula (9), Bt+1Represented as agent performing action A at time ttObtaining a set of beam combinations, t, for testing at the next timesRepresenting testing a waveTime of beam combination, tpExpressed as a precoding phase in one time step of the beam training, TsRepresented as a time step of the beam training,and the data reachable rate corresponding to the optimal beam combination at the moment of t +1 is represented.
Further, the convolutional neural network specifically comprises two convolutional layers, two pooling layers, a flat layer, a full-link layer and an output layer; the states are normalized before they are input into the neural network, in particular, the two-dimensional image represented by each channel.
Furthermore, the input state is updated by using a convolutional neural network, the parameters of the convolutional neural network are updated by taking the predicted value of Q learning as a target, the trained network is used for predicting, the action with the maximum Q value is selected, and the beam combination cluster corresponding to the action is used for testing to reduce the training overhead.
A millimeter wave communication beam training device based on deep reinforcement learning, the device comprising:
the beam selection module is used for acquiring a receiving beam set and a transmitting beam set according to the executed action;
the channel sample generation module is used for generating a plurality of randomly changed channel matrixes and calculating the optimal receiving and transmitting beam combination of each channel matrix;
the receiving signal matrix module is used for calculating the receiving signal intensity corresponding to the receiving and transmitting beam pair set in the beam selection module;
the state updating module updates the current state by using the received signal strength matrix;
the optimal receiving and sending wave beam combination determining module is used for obtaining the optimal receiving and sending wave beam combination corresponding to the channel, obtaining a wave beam searching range of the next moment according to a certain action after the action is executed, testing by using all wave beam combinations in the range, and selecting the wave beam combination with the maximum received signal intensity as the optimal wave beam combination;
the reward calculation module is used for calculating a reward value for executing the action by using the obtained optimal beam combination and other parameters;
the parameter setting module is used for setting parameters of the neural network and other parameters in the beam training process;
the experience storage module is used for storing experiences in the beam training process into a set;
the neural network training module is used for inputting the state matrix into the neural network, outputting all action values corresponding to the state, and selecting a plurality of experiences from the memory library to update the network parameters;
a target value setting module which calculates a target value corresponding to each experience by using the updating strategy of the Q learning;
and the neural network prediction module predicts all action values corresponding to the input state by using the trained network and selects the action with the maximum Q value as the optimal action.
The beneficial effects of the invention are:
1. the invention introduces a reinforcement learning frame in the beam training, so that the trained beam can be adjusted in time along with the change of the channel, and the state of the channel can be tracked, thereby more accurately predicting the optimal receiving and transmitting beam combination of an unknown channel, effectively reducing the beam training overhead and ensuring the performance of the beam training.
2. Different from the traditional training modes such as beam scanning and the like, the number of beam combinations tested each time is not fixed and is dynamically changed under different channel states, so that the overhead of beam training is effectively reduced.
3. In the design of receiving and transmitting beams, the invention only adopts narrow beams, thereby greatly reducing the complexity of hardware.
Drawings
Fig. 1 is a schematic input/output diagram of a neural network in embodiment 1;
FIG. 2 is a graphical representation of the received signal strength matrix in example 1;
FIG. 3 is a diagram illustrating states (matrices) of reinforcement learning in example 1;
FIG. 4 shows the operation of example 1The implementation process is schematically shown, wherein, FIG. 4a shows the received signal strength matrix Z at time ττFIGS. 4 b-4 f show different actions taken, respectivelyObtaining the received signal intensity matrix at the time of tau +1j=1,…,5;
FIG. 5 is a schematic time slice diagram of beam training in example 1;
fig. 6 is a diagram illustrating a comparison of beam search success rates when the number of channel paths is different;
FIG. 7 is a diagram illustrating a comparison of the achievable rates of users when the number of channel paths is different;
fig. 8 is a schematic diagram showing the beam training method proposed in embodiment 1 compared with the beam scanning, layered codebook-based beam training method in terms of the success rate of beam search;
fig. 9 is a schematic diagram showing a comparison between the beam training method proposed in embodiment 1 and the beam scanning, layered codebook-based beam training method in terms of user reachable rate.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
Referring to fig. 1 to 5, the present embodiment provides a millimeter wave communication beam training method based on deep reinforcement learning, which specifically includes:
consider a millimeter-wave massive MIMO system for a single user, with N at the userrRoot antenna, baseAt station has NtThe antennas are arranged in a Uniform Linear Array (ULA) manner. According to the widely used Saleh-Valenzuela model, the millimeter wave channel of the downlink can be modeled as:
wherein, L, alphal、θlRespectively representing the number of paths, the channel gain of the ith path, the arrival angle of the channel and the departure angle of the channel. Usually, the path of 1 is the LOS path, and the other paths are the NLOS paths. Definition of ΘlAnd ΨlObey [0, π ] for both arrival and departure angles in the spatial domain]Even distribution within. dtAnd drRespectively representing the spacing of the array antennas at the base station side and the user side, lambda is the wavelength of the millimeter wave signal, and in general,u (-) denotes a channel steering vector, defined as follows:
since the channel considered in the beam training of the present invention is time-varying, the channel needs to be dynamically modeled. In an actual communication environment, the channel variation is generally random, and in the present invention, a gaussian random walk is adopted as a variation form of the channel, that is, the variation amount of the steering angle (departure angle and arrival angle) of the channel in adjacent time intervals obeys gaussian distribution, which is specifically expressed as follows:
wherein, theta0U (0, pi) represents the initial channel steering angle, θ, at which time t is 0 and randomtIndicating the channel steering angle at time t,indicating the amount of change in the channel steering angle.
Before beam training, a codebook is defined at both the transmitting end and the receiving end, each codebook comprises a series of codewords, and each codeword represents a beam forming vector. In the present invention, a Discrete Fourier Transform (DFT) codebook is used as a codebook at a transmitting end, and the nature of the DFT codebook is a two-dimensional complex matrix determined according to the number of antennas, and a modulus value of each element in the matrix is constant. The DFT codebook is well suited for training of analog beams because the phase shifter network constituting the analog beamforming part only changes the phase of the transmitted signal and does not provide a gain in power.
DFT codebooks at a transmitting end and a receiving end are defined as F and W, respectively. Wherein the content of the first and second substances,containing NtA number of code words that are each associated with a code word,containing NrA code word. The codewords contained in the two codebooks each represent a channel steering vector pointing in different directions in space, and are represented as follows:
assuming that the sending end uses the codeword f to send the signal x, the receiving end uses the codeword w to receive the signal, and the transmission of the channel matrix H is performed in the middle, and the final received signal may be represented as:
wherein, P, Respectively representing the transmission power of the base station, the received code word of the user terminal, the transmission code word of the base station terminal and the channel noise vector. Neither the transmitted nor the received codeword provides a power gain, i.e. | w |)2=‖f‖21, the transmission signal x has normalized power, | x ∞ n2=1。
The user achievable rate can be expressed as:
in the beam training process, the transmitting end and the receiving end respectively test each code word in the codebooks F and W to find a transmitting end beam forming vector F and a receiving end beam forming vector W which can be optimally matched with the channel H. Therefore, the beam training problem can be equivalent to the following optimization problem:
in the beam training, the transmitting power P of the signal and the variance sigma of the channel noise2Given, the above optimization problem can be simplified to:
however, in practical situations, the channel H is usually unknown and cannot be directly solved for optimal f and w. It is common practice to find the best combination of f and w by measuring the strength value of the received signal y, so the beam training process can be expressed as the following optimization problem:
the optimal solutions of the above two optimization problems may be different due to the presence of the channel noise η. If the two are the same, the beam training is successful, otherwise, the beam training is failed. Suppose that a total of N is performedtotalThe secondary beam training succeeds NsucNext, the success rate of the beam search can be expressed as:
the received signal matrix at time t is:
wherein the content of the first and second substances, DFT codebooks at the receiving end and the transmitting end are respectively indicated,a channel matrix representing time t, x, P representing the transmitted signal and the power of the signal respectively,representing the channel noise matrix at time t, defining a received signal strength matrix ZtIs YtThe following steps of (1):
Zt(m,n)=|Yt(m,n)|,
as shown in FIG. 2, in the present example, Z istThe two dimensions of the image respectively represent the indexes of the transmitting and receiving end code words, and each grid in the image corresponds to one transmitting and receiving beam combination. The image describes the distribution of the corresponding received signal strength when the transceiving end uses different beams to test, and the pixel position with larger gray scale value in the image corresponds to the beam combination with high received signal strength. Image Z due to the sparsity of the millimeter wave channeltIs close to 0, the positions of those non-zero elements correspond to the distribution of the steering angles under the current channel, ZtThe position of the medium maximum element corresponds to the searched optimal beam combination. If the positions of the non-zero elements can be dynamically tracked, the change condition of the channel can be sensed in time, and the training overhead is greatly reduced. To capture the dynamically changing channel, we define several consecutive images as one state St, namely:
St(i)=Zt+i-C,i=1,2,…,C,
wherein S istIs a three-dimensional matrix with a third dimension of size C, indicating a state matrix StWhich comprises C two-dimensional matrices Z. S. thetThe ith two-dimensional matrix corresponds to a received signal strength matrix Z at the time of t + i-Ct+i-CAnd S istThe received signal strength matrix Z of which the last two-dimensional matrix is at the moment tt. As shown in FIG. 3, the state matrix S may also betThe image is viewed as a multi-channel image, so that the convolutional neural network can be used for training.
From the above, define the state matrix S at time ttThe last two-dimensional matrix contained is ZtDefinition of ZtThe position corresponding to the middle maximum element is:
whereinRespectively representing the indexes of the optimal transmitting and receiving beams at the time t in the codebooks F and W, so that the optimal beam combination at the current time can be obtained asWherein To obtain a state matrix S at time t +1t+1Defining the action at the time t as a triple:
At=(d,o,r),
where d, o, and r respectively represent the direction, offset, and coverage of the beam search at time t +1 with respect to the optimal beam position at time t. D ∈ D ═ {0,1,2,3,4}, there are 5 possible directions: 0 represents no movement, 1,2,3,4 represent i respectivelytThe position is that the base point moves up, down, left and right. O e O ═ {0,1,2, …, M-1}, with M optional offsets, defined as the center position of the beam search at time t +1 and itThe distance of the location. R ∈ {1,2, …, N }, where N optional radii exist, a radius is defined as a coverage radius with the center position of the beam search at time t +1 as a base point, and for example, R ∈ R ═ 1 indicates that the coverage area of the beam search is a square area with a side length of 3.
The specific execution process of the action is shown in fig. 4, each image represents a received signal strength matrix Z at a certain time, and the gray value of each pixel point in the image represents the modulus of the received signal obtained by testing the corresponding beam combination. The positions of the colored grids correspond to the beam combinations to be trained, and the gray value is greater than 0; at other positionsThe beam combination does not need to train the beam, and the gray value is set to be 0. It is assumed that graph (a) represents a received signal strength matrix Z at time ττThe dark grid positions represent the index i of the optimal beam combination determined by the beam trainingτFig. (4, 5), fig. (b-f) show taking different actions, respectivelyObtaining the received signal intensity matrix at the time of tau +1For example, graph (b) shows taking actionThe obtained received signal strength matrix at the time of tau +1According to the definition of the above actions, the coverage area of the beam training is a square area with r being 1 (side length being 3), and the central position (dark grid) of the area is the last-time optimal beam combination index iτThe graphs (c-f) are similarly obtained.
Assume that at time t the agent performs action AtGet the set of beam combinations for testing at the next time as Where I denotes the total number of beam combinations used for training, set Bt+1The elements contained in the test table are used as receiving and transmitting wave beams to be tested one by one, thereby obtaining a received signal intensity matrix Z at the t +1 momentt+1According to Zt+1The state matrix S at time t +1 can be constructedt+1:
Selection matrix Zt+1Maximum valueThe corresponding beam combination is used as the optimal beam combination at the moment of t +1Then perform action At+1And so on.
Considering the channel reachable rate and the beam training overhead together, the reward function is defined as the following form in the invention:
where Ts represents the size of a time slice, tdIndicating the effective data transmission time within a time slice,defining the data reachable rate corresponding to the optimal beam combination at the moment of t + 1:
the reward function may be understood as the effective data achievable rate over a time slice (since a time slice is subject to beam training and precoding in addition to transmitting data).
FIG. 5 is a definition of time slices, from which t can be knownd=Ts-tb-tp=Ts-Its-tpWherein I ═ Bt+1I denotes performing action AtThe obtained beam set size t used for testing at the moment t +1sIndicating the time at which one beam combination is tested.
Thus, the reward function R may eventually be appliedtExpressed in the following form:
because the state S can be represented as a multi-channel image, the present invention uses a convolutional neural network for processing. The network structure is shown in fig. 1, and comprises two convolutional layers, two pooling layers, a flat layer, a full-link layer, and an output layer. Convolutional layers are the result of a convolution operation, pooling layers are the result of a sampling operation, and flat layers are vectors that convert a multidimensional matrix into one dimension. The input to the network is a state matrix StAnd outputs all the operation values Q (S) corresponding to the statet,At) The dimension of the output layer is the size of the motion space a.
To accelerate the convergence of the model, at StBefore being input into the neural network, the neural network needs to be normalized, namely, the two-dimensional image represented by each channel is normalized:
therein, max (S)t(i) ) represents StThe maximum gray scale value of the ith two-dimensional image of (1).
The beam training based on deep reinforcement learning mainly comprises the following steps:
And 3, performing beam training, namely firstly setting the total number of the trained epsilon and the time step number T contained in each epsilon. At the beginning of each epsilon, C time-varying channels H are randomly generatedtInitializing the initial state in a beam scanning mannerThe following steps are performed in each time step in turn:
step 3.1, supposethe state at time t is StSelecting an action A from an action space A according to an epsilon-greey strategyt。
Step 3.2, perform action AtDetermining a set of t +1 time instances for testing beam combinationsCalculating the received signal strength corresponding to all elements in the setSetting the received signal corresponding to the untested wave beam combination as 0, thereby obtaining a received signal strength matrix Zt+1。
Step 3.3, update the state S at the time t + 1t+1。
Step 3.4, selecting matrix Zt+1Maximum value ofThe corresponding beam combination is used as the optimal beam combination at the moment of t +1
Step 3.6, calculate the reward function Rt。
step 4.1, subject this time to step Et=(St,At,Rt,St+1) And storing the data into a memory bank.
Step 4.2, randomly selecting N experiences E ═ s from the memory bankj,aj,rj,s′j1,2, …, N, setting each experienceThe corresponding target value:
and 4.3, performing random gradient descent on the parameter theta, and training a neural network.
And 4.4, updating the parameter theta' of the target neural network after T time steps.
Specifically, in the training step:
channel H at time ttThe actual corresponding optimal beam combination is the optimal solution of the following optimization problem:
the above problem is actually to find the beam combination with the largest objective function value in the codebooks F and W, and assume that the optimal solution of the problem (the actual optimal beam combination) isAccording to the previously defined optimal beam combination obtained through beam trainingIf it is notThe beam training is successful, otherwise, the beam training is failed. Because the channel variation is random, the state of the channel may not be accurately captured at some point in time, resulting in failure of beam training. In this case, the position of the channel cannot be tracked, and if such samples are still used to update the DQN, then the error will propagate continuously, resulting in the failure of the algorithm, and therefore the S needs to be redefinedt:
St(C)=WHHtF,
W, F denotes DFT codebooks at the transmitting and receiving ends, respectively. According to the above definition, the state matrix S at time ttOnly the last two-dimensional matrix in the three-dimensional matrix is changed, and the other positions are not changed. Since before StThe last two-dimensional matrix of (a) is a received signal strength matrix Z at time ttTo do soIt is according to ZtObtained whenWhen necessary, Z is addedtFrom A to AtAnd (5) removing, so that the algorithm is relocated to the state of the current channel.
Example 2
In this embodiment, on the basis of embodiment 1, a millimeter wave communication beam training device based on deep reinforcement learning is provided, where the device includes:
a beam selection module for executing action A according to t timetGet the set of beam combinations for testing at the next time asWhere I represents the total number of beam combinations used for training,representing the ith transceiving beam combination.
A channel sample generation module for generating a plurality of time-varying channel matrixes H according to the random variation of the channel guide angletDetermining each channel matrix H by beam scanningtCorresponding optimal transmit-receive beam combination
Received signal matrix module, using set Bt+1The beams in (1) are tested in sequence to obtain each beam combinationCorresponding received signal strength zt+1Setting the received signals corresponding to other untested beam combinations to 0, thereby obtaining a matrix Z of the received signal strengtht+1。
A state updating module for updating the received signal strength matrix Z according to the t +1 momentt+1Constructing a state matrix S at time t +1t+1And updating the current state.
An optimal receiving and transmitting beam combination determining module for selecting matrix Zt+1Maximum valueThe corresponding beam combination is used as the optimal beam combination at the moment of t +1WhereinIn order to optimize the transmission of the beam,is the best receive beam (obtained through beam training).
A reward calculation module for using the obtained optimal beam combinationTransmission signal power P and channel noise variance σ2Computing execution action AtThe prize value of (c).
And the parameter setting module is used for setting parameters of the neural network, other parameters in the beam training process and the like.
A experience storage module for storing experience E in the beam training processt=(St,At,Rt,St+1) And storing the data into a memory bank.
A neural network training module, the input of the neural network is a state matrix StAnd outputs all the operation values Q (S) corresponding to the statet,At) Selecting from the memory bankTake several experiences E ═ sj,aj,rj,s′j1,2, …, N to update the predicted neural network parameter θ.
A target value setting module that calculates a target value for each experience using the update strategy of Q learning:
a neural network prediction module for predicting the input state S using the trained networktCorresponding all operation values Q (S)tA), a belongs to A, and the action A with the maximum Q value is selectedt=argmaxa∈A Q(StAnd a) as the optimal action.
The invention is further described below with reference to simulation conditions and results:
setting status image StDepth C of 6, action triplet set D of {0,1,2,3,4}, O of {2, 4}, R of {1, 3}, slot size Ts of 20ms, time t of training a beam combination, and time t of training a beam combinations0.1ms, time of precoding tpThe learning rate α is 0.001, the discount factor γ is 0.95, the memory bank size D is 2000, the parameters of the predictive neural network are assigned to the target neural network through update _ freq is 100 time steps, and the training batch size batch _ size is 64. The adopted DQN is a convolutional neural network with 7 layers, the first convolution operation comprises 32 convolution kernels with 5 x 5, the second convolution operation comprises 16 convolution kernels with 5 x 5, 2 pooling operations all adopt a maximum pooling mode, the step length is 2, the flat function is to convert a three-dimensional matrix into a one-dimensional vector, a full connection layer comprises 128 neurons, and the dimension of an output layer corresponds to the size of an action space.
Considering the down link of a single-user millimeter wave large-scale MIMO communication system, the number of base station end antennas is Nt16, number of subscriber side antennas NrThe antenna arrays are all placed in ULA form 16. Assuming that the number L of propagation paths of millimeter wave signals is 3, LOS path channel gainI.e. a complex gaussian distribution with variance of 1 and mean of 0; two NLOS path channel gainsI.e. a complex gaussian distribution with variance 0.01 and mean 0. For the sake of convenience of processing, the variance σ of the channel noise is assumed21, the amount of change in channel steering angleThe power P of the transmission signal is 1, and the transmitted signal x is 1. Fig. 6-7 are simulation results of DQN-based beam training considering multiple propagation paths of millimeter wave channels. As can be seen from the figure, as the acceptable signal-to-noise ratio increases, the success rate and the achievable rate of the beam search both show an increasing trend. When the NLOS path is increased, the state change of the channel is more complicated, and the corresponding beam search success rate and the reachable rate are both decreased to some extent, but the decrease is small. The method shows that the multipath effect of the millimeter wave channel has small influence on the algorithm, and the beam training algorithm based on deep reinforcement learning still keeps high performance in the multipath scene.
Considering the number N of the downlink, base station side and user side antennas of a single-user millimeter wave large-scale MIMO communication systemt=NrThe antenna arrays are all placed in ULA form 16. LOS path, channel gain considering only millimeter wave channelVariance σ of channel noise 21, variation of channel steering angleThe power P of the transmission signal is 1, and the transmission signal x is 1. Hierarchical codebook adoption document [1]In the codebook construction method in (1), the codeword of the current layer is constructed according to the beam training result of the previous layer, and the DFT codebook is used for beam scanning. FIGS. 8-9 compare the proposed basis for the present inventionPerformance of Beam training algorithm (Beam training based on DQN, BT-DQN) and Beam Scanning (BS) based on DQN and Beam training algorithm (BT-HC) based on hierarchical codebook. As can be seen from fig. 8, the success rate of the BS is the highest under different signal-to-noise ratios among the three beam training schemes. The search success rate of BT-DQN in the areas with low signal to noise ratio and high signal to noise ratio is close to BT-HC and slightly higher than BT-HC; the success rate of BT-DQN in the middle area is higher than BT-HC. As can be seen from fig. 9, the achievable rate of the BS is still the highest, BT-DQN times, and BT-HC the lowest under different snr conditions.
Although the search success rate and achievable rate of the BS are the highest of the three, it requires more beams to be trained at a time, takes more time, and thus is more costly. Table 1 compares the overhead of three different beam training schemes, where t is the time to train a beam combinationsThe time of one beam training is t. As can be seen from the table, the overhead of the BS is 10 times as much as the overhead of the BT-HC, and the search success rate and the reachable rate are increased at the cost of huge overhead. The average expense of BT-DQN is lower than that of BT-HC, and is reduced by about 21 percent.
TABLE 1
Name of algorithm | Average overhead (t/t)s) |
Beam Scanning (BS) | 256 |
Beam training based on hierarchical codebook (BT-HC) | 24 |
DQN-based Beam Training (BT)-DQN) | 19 |
According to the simulation result, the beam training scheme and the device provided by the invention have the advantages that the success rate and the reachable rate of beam search are higher than those of the beam training scheme based on the codebook under the dynamic channel environment, and the training overhead is lower. Although not as performance-efficient as beam scanning, beam scanning requires a significant training overhead in exchange for high success rates and achievable rates that are in most cases invaluable. Therefore, under the scene of a time-varying channel, the beam training scheme based on the deep Q network provided by the invention can greatly reduce the overhead of beam training on the premise of ensuring higher performance.
The details of the present invention are well known to those skilled in the art.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations can be devised by those skilled in the art in light of the above teachings. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.
Claims (4)
1. A millimeter wave communication beam training method based on deep reinforcement learning is characterized by comprising the following steps:
step S1, constructing a millimeter wave communication channel model between the user side and the base station side;
step S2, designing codebooks of a user terminal and a base station terminal, constructing a model of a final received signal according to the designed codebooks, and carrying out mathematical modeling on a beam training process according to the model;
step S3, defining the representation of state, action and reward in the beam training;
in step S3, defining the representation of the state in the beam training specifically includes:
let the channel matrix at time t be HtThe matrix of the received signal corresponding thereto is YtDefine the matrix ZtIs YtWill be a matrix Z of received signal strengths at successive timestIs defined as a state StSpecifically, the following are shown:
St(i)=Zt+i-C,i=1,2,…,C (1)
in the formula (1), StIs a three-dimensional matrix with a third dimension of size C, C representing the number of successive time instants, Zt+i-CRepresenting the received signal strength matrix at time t + i-C;
defining the representation of the action in the beam training, specifically comprising:
defining said matrix ZtThe position corresponding to the middle largest element isWherein, the first and the second end of the pipe are connected with each other,respectively representing indexes of optimal transmitting and receiving beams at the time t in codebooks F and W;
the optimal beam combination at the current time isWherein, the first and the second end of the pipe are connected with each other,
the action at time t is defined as:
At=(d,o,r) (2)
in formula (2), d, o, r respectively represent the direction, offset and coverage of the beam search at time t +1 relative to the optimal beam position at time t; where D ∈ D ═ {0,1,2,3,4}, there are preferably 5The direction of (c): 0 represents no movement, 1,2,3,4 represent i respectivelytThe position is that the base point moves up, down, left and right; o belongs to O ═ O {0,1,2, …, M-1}, and M optional offsets are provided, and the offsets are defined as the distance between the center position of the beam search at the moment t +1 and the position of it; r ∈ R ═ {1,2, …, N }, there are N optional radii, and the radius is defined as the coverage radius with the central position of the beam search at the time t +1 as the base point;
defining the representation of the reward in the beam training, and specifically comprising:
in the formula (3), Bt+1Represented as agent performing action A at time ttObtaining a set of beam combinations, t, for testing at the next timesRepresenting the time, t, at which a beam combination is testedpDenoted as the precoding phase in one time step of the beam training, Ts is denoted as one time step of the beam training,representing the data reachable rate corresponding to the optimal beam combination at the moment of t + 1;
step S4, regarding the state defined in the step S3 as a multi-channel image, inputting the multi-channel image into the constructed convolutional neural network, and obtaining values of all actions corresponding to the state;
the method comprises the steps of updating an input state by using a convolutional neural network, updating parameters of the convolutional neural network by taking a predicted value of Q learning as a target, predicting by using a trained network, selecting an action with the maximum Q value, and testing by using a beam combination cluster corresponding to the action to reduce training overhead.
2. The method for training the millimeter wave communication beam based on deep reinforcement learning according to claim 1, wherein the step S1 specifically includes:
arranging a needleFor a millimeter wave MIMO communication system of a single user, the system has N user terminalsrRoot antenna, base station end has NtThe arrangement modes of the antennas are uniform linear arrays, and the millimeter wave communication channel model is modeled as follows:
in the formula (4), L and alphal、θlRespectively representing the number of paths, the channel gain of the ith path, the arrival angle of a channel and the departure angle of the channel; definition ofΘlAnd ΨlObey [0, π ] for both the arrival angle and departure angle of the spatial domain]Inner uniform distribution, dtAnd drRespectively representing the interval between the array antennas of the base station end and the user end, wherein lambda is the wavelength of a millimeter wave signal, and u (-) represents a channel steering vector; the variation of the steering angle of the channel in the adjacent time interval follows Gaussian distribution, and the expression is as follows:
3. The method for training millimeter wave communication beams based on deep reinforcement learning of claim 1, wherein in the step S2, the expression of the codebooks at the ue and the bs is:
in the formula (8), P,Respectively representing the transmitting power of a base station end, the receiving code word of a user end, the transmitting code word of the base station end and a channel noise vector, and | | | w | | calving2=||f||2=1,|x|2=1;
Thus, the expression of the received signal matrix is:
in the formula (9), the reaction mixture,DFT codebooks at a receiving end and a transmitting end are respectively indicated,representing the channel matrix, x, P representing the transmitted signal and the signal, respectivelyThe power of the electric motor is controlled by the power controller, a matrix representing the noise of the channel is represented,an element Y (m, N) in the mth row and nth column in the matrix indicates that the transmitting end uses the nth (N is 1,2, …, N in the codebook Ft) The code word transmitting/receiving end uses the m-th (m is 1,2, …, N) in the codebook Wr) Receiving the resulting signal by a code word; the beam training process is represented as the following optimization problem:
4. the deep reinforcement learning-based millimeter wave communication beam training method according to claim 3, wherein the convolutional neural network specifically comprises two convolutional layers, two pooling layers, a flat layer, a full-link layer and an output layer; the states are normalized before they are input to the neural network, in particular, the two-dimensional image represented by each channel is normalized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110623890.1A CN113411110B (en) | 2021-06-04 | 2021-06-04 | Millimeter wave communication beam training method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110623890.1A CN113411110B (en) | 2021-06-04 | 2021-06-04 | Millimeter wave communication beam training method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113411110A CN113411110A (en) | 2021-09-17 |
CN113411110B true CN113411110B (en) | 2022-07-22 |
Family
ID=77676276
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110623890.1A Active CN113411110B (en) | 2021-06-04 | 2021-06-04 | Millimeter wave communication beam training method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113411110B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113904704B (en) * | 2021-09-27 | 2023-04-07 | 西安邮电大学 | Beam prediction method based on multi-agent deep reinforcement learning |
WO2023071760A1 (en) * | 2021-10-29 | 2023-05-04 | 中兴通讯股份有限公司 | Beam domain division method and apparatus, storage medium, and electronic device |
CN114021987A (en) * | 2021-11-08 | 2022-02-08 | 深圳供电局有限公司 | Microgrid energy scheduling strategy determination method, device, equipment and storage medium |
CN114567525B (en) * | 2022-01-14 | 2023-07-28 | 北京邮电大学 | Channel estimation method and device |
CN114499605B (en) * | 2022-02-25 | 2023-07-04 | 北京京东方传感技术有限公司 | Signal transmission method, signal transmission device, electronic equipment and storage medium |
CN117035018A (en) * | 2022-04-29 | 2023-11-10 | 中兴通讯股份有限公司 | Beam measurement parameter feedback method and receiving method and device |
CN114844538B (en) * | 2022-04-29 | 2023-05-05 | 东南大学 | Millimeter wave MIMO user increment cooperative beam selection method based on wide learning |
CN115065981B (en) * | 2022-08-16 | 2022-11-01 | 新华三技术有限公司 | Beam tracking method and device |
CN115426007B (en) * | 2022-08-22 | 2023-09-01 | 电子科技大学 | Intelligent wave beam alignment method based on deep convolutional neural network |
CN115580879A (en) * | 2022-09-07 | 2023-01-06 | 重庆邮电大学 | Millimeter wave network beam management method based on federal reinforcement learning |
CN117692014B (en) * | 2024-02-01 | 2024-04-23 | 北京雷格讯电子股份有限公司 | Microwave millimeter wave communication method and communication system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110417444B (en) * | 2019-07-08 | 2020-08-04 | 东南大学 | Millimeter wave channel beam training method based on deep learning |
CN110401476B (en) * | 2019-08-05 | 2022-07-08 | 东南大学 | Codebook-based millimeter wave communication multi-user parallel beam training method |
CN110971279B (en) * | 2019-12-30 | 2021-09-21 | 东南大学 | Intelligent beam training method and precoding system in millimeter wave communication system |
CN112073106B (en) * | 2020-08-14 | 2022-04-22 | 清华大学 | Millimeter wave beam prediction method and device, electronic device and readable storage medium |
-
2021
- 2021-06-04 CN CN202110623890.1A patent/CN113411110B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113411110A (en) | 2021-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113411110B (en) | Millimeter wave communication beam training method based on deep reinforcement learning | |
US11626909B2 (en) | Method and device for enhancing power of signal in wireless communication system using IRS | |
KR102154481B1 (en) | Apparatus for beamforming massive mimo system using deep learning | |
CN110113088B (en) | Intelligent estimation method for wave arrival angle of separated digital-analog hybrid antenna system | |
CN113438002B (en) | LSTM-based analog beam switching method, device, equipment and medium | |
Shen et al. | Design and implementation for deep learning based adjustable beamforming training for millimeter wave communication systems | |
CN113162666B (en) | Intelligent steel-oriented large-scale MIMO hybrid precoding method and device | |
CN113193893B (en) | Millimeter wave large-scale MIMO intelligent hybrid beam forming design method | |
CN112448742A (en) | Hybrid precoding method based on convolutional neural network under non-uniform quantization | |
Zhang et al. | Intelligent beam training for millimeter-wave communications via deep reinforcement learning | |
Nguyen et al. | Deep unfolding hybrid beamforming designs for THz massive MIMO systems | |
Chafaa et al. | Federated channel-beam mapping: from sub-6ghz to mmwave | |
Elbir et al. | Cognitive learning-aided multi-antenna communications | |
Abdallah et al. | Multi-agent deep reinforcement learning for beam codebook design in RIS-aided systems | |
CN113872655A (en) | Multicast beam forming rapid calculation method | |
CN113169777A (en) | Beam alignment | |
CN114844538B (en) | Millimeter wave MIMO user increment cooperative beam selection method based on wide learning | |
CN114866126B (en) | Low-overhead channel estimation method for intelligent reflection surface auxiliary millimeter wave system | |
CN112242860B (en) | Beam forming method and device for self-adaptive antenna grouping and large-scale MIMO system | |
CN115133969A (en) | Performance improving method of millimeter wave large-scale MIMO-NOMA system | |
CN114598574A (en) | Millimeter wave channel estimation method based on deep learning | |
CN115604824A (en) | User scheduling method and system | |
CN115102590B (en) | Millimeter wave beam space hybrid beam forming method and device | |
Wang et al. | New Environment Adaptation with Few Shots for OFDM Receiver and mmWave Beamforming | |
CN113904704B (en) | Beam prediction method based on multi-agent deep reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |