CN113472472B

CN113472472B - Multi-cell collaborative beam forming method based on distributed reinforcement learning

Info

Publication number: CN113472472B
Application number: CN202110768826.2A
Authority: CN
Inventors: 高贞贞; 廖学文; 吴丹青; 张金; 罗伟
Original assignee: Hunan Guotian Electronic Technology Co ltd
Current assignee: Hunan Guotian Electronic Technology Co ltd
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2023-06-27
Anticipated expiration: 2041-07-07
Also published as: CN113472472A

Abstract

The invention discloses a multi-cell collaborative beam forming method based on distributed reinforcement learning, which comprises the following steps: establishing a weight theta for a base station j _j And a weight θ' _j Is a target DQN of (a) and an empty experience pool M _j The method comprises the steps of carrying out a first treatment on the surface of the Initializing training DQN with random weights; the following steps are repeated every M slots: the base stations exchange channel state information of all users; each base station generates global channel samples of a plurality of groups of M time slots in the future; each base station takes action randomly and stores the corresponding experience in its experience pool M _j In (a) and (b); each base station performs network training. The invention can be superior to a greedy scheme and a random scheme which are compared under the condition of extremely low cost, and the performance is close to the optimal scheme which needs global information.

Description

Multi-cell collaborative beam forming method based on distributed reinforcement learning

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a multi-cell collaborative beam forming method based on distributed reinforcement learning.

Background

Conventional mobile communication systems typically employ cellular architecture designs that can improve throughput and save power consumption in cellular scenarios. However, since cells share a band portion of the spectrum between them, this may result in severe inter-cell interference, thereby compromising system capacity. Multi-cell co-beamforming is considered as one of the key technologies for interference management because it is possible to mitigate inter-cell interference and maximize system capacity by jointly controlling the transmit power and beamforming of neighboring base stations.

In general, the system capacity of a cellular communication system is represented by the sum of the achievable rates, i.e., sum rate, of all users. Clearly, maximization and rate are a NP-hard and non-convex problem, and thus it is difficult to obtain an optimal solution. Based on some optimization techniques, many suboptimal methods have been developed to address this problem, such as fractional planning algorithms, weighted minimum mean square error algorithms, and branch-and-bound algorithms. These algorithms may approach optimal performance, however they all have to know global channel state information and require multiple iterations to calculate the optimal solution, so the high overhead and computational complexity of these schemes when actually performed are intolerable. Distributed reinforcement learning has proven to be an emerging effective technique to address various problems in the communication and networking arts, such as internet of things, heterogeneous networks, and unmanned aerial vehicle networks. In these networks, agents (e.g., base stations) make their own decisions based on local information to optimize network performance. Studies of collaborative beamforming based on reinforcement learning are in the onset phase. Beams Ying Chang of university of electronics and technology, etc. train each base station to train its own deep Q network for a multi-cell multiple-input single-output system based on distributed deep reinforcement learning, using appropriate beam vectors and transmit power based on local information and limited information exchanged between neighboring base stations. The shift Jiang equivalent of tokyo industrial university aims at a multi-cell multi-input single-output system, and obtains the transmitting power and the beam vector based on the distributed reinforcement learning input global channel information. However, this approach requires that each base station be aware of the global channel state information, greatly increasing the overhead in execution.

Disclosure of Invention

In view of this, the present invention proposes a multi-cell multi-input single-output collaborative beamforming method based on distributed reinforcement learning, which enables each base station to train its own deep Q learning network (DQN), input channel information for base station-service users, and output optimal transmit power and codeword index by using sum rate as reward training. Meanwhile, the invention enables the base station to exchange channel state information once every fixed time slot, and the base station forms global channel state information according to the received channel information of other base stations to generate future channel sample retraining network so as to achieve the purpose of improving generalization of the network in the future fixed time slot and further improving network performance.

In order to achieve the above purpose, the present invention provides a low-overhead high-performance collaborative beamforming method based on global information training local information execution and channel prediction training. Consider a multi-cell multi-input single-output scenario with K cells, each base station equipped with N _t A root antenna, each user is equipped with a single antenna. Each base station only serves one user on the same time-frequency resource, and each user receives a useful signal from the serving base station and interference signals from other base stations.

Specifically, the multi-cell collaborative beamforming method based on distributed reinforcement learning disclosed by the invention comprises the following steps of:

s1: for base station j, j E [1, K]Establishing a weight of theta _j And a weight θ' _j Is a target DQN of (a) and an empty experience pool M _j ；

S2: initializing training DQN with random weights to let θ _j ＝θ，j∈[1，K]；

S3: repeating steps S4 to S7 every M slots;

s4: the base stations exchange channel state information of all users;

s5: each base station generates global channel samples of a plurality of groups of M time slots in the future;

s6: each base station takes action randomly and takes corresponding experience<s _j ，a _j ，r _j ，s′ _j >Stored in its experience pool M _j In (a) and (b);

s7: each base station performs network training.

Further, the step of network training includes:

s70: the base station j observes its state in time slot t, s _j (t)，j∈[1，K]；

S71: at time slot t, base station j is based on state s _j (t) selecting an action according to the ε -greedy policy, a _j ，j∈[1，K]；

S72: according to s _j And a _j ，j∈[1，K]Calculating a global prize r (t);

s73: base station j observes its new state s 'at time slot t+1' _j (t)，j∈[1，K]；

S74: base station j will experience<s _j (t)，a _j (t)，r _j (t)，r′ _j (t)>Stored in its experience pool M _j In j E [1, K]；

S75: base station j is from experience pool M _j Sampling to obtain small batches of samples;

s76: base station j updates its training DQN weights θ using reverse propagation _j ,j∈[1，K]；

S77: every other T of base station j _step Time slot update once target DQN weight θ' _j ,j∈[1，K]；

S78: until convergence or maximum training times are reached.

Further, each slot base-to-user channel is modeled as a rayleigh channel, and the channels between adjacent slots are considered correlated and can be expressed as:

wherein,,

representing the channel from the jth base station to the kth user in the t time slot, h _j，k (0) Each element in (a) obeys independent complex Gaussian distribution with mean value of 0 and variance of 1, e _j，k (t) represents channel independent white gaussian noise, wherein each element is also subject to an independent complex gaussian distribution with a mean of 0 and a variance of 1, and ρ represents the correlation coefficient representing the rayleigh fading vector between adjacent time slots.

Further, when the time slot t itself is interacted between the base stations to the channel information h of all users _1，1 (t)，h _1，2 (t)，…，h _1，K (t)]After thatEach base station forms a global channel state information matrix, H (t) = [ H ] _1，1 (t)，h _1，2 (t)，…，h _1，K (t)，…，h _K，K (t)]And generating global channels for multiple sets of future M time slots using correlation of adjacent channels

n∈[1，N]N is the number of global channel groups generated.

Further, global information is needed to guide training when training, and only local information is needed when executing.

Further, the state of each DQN network is based on channel state information h from base station j to its own serving user j _j，j (t) means for converting the channel state information h by using I/Q conversion _j，j (t) dividing into in-phase and quadrature components, forming said in-phase and quadrature vectors into vectors

The state of the DQN network of base station j is expressed as:

further, the beamforming vector w transmitted by the base station _j And (t) is a continuous complex vector:

wherein,,

normalized beamforming vector, p, taken for base station j _j (t) is the transmit power of base station j;

discretizing the normalized beamforming vector using a selection of codewords from a codebook, defining an available transmit power set for each base station for transmit power

Wherein p is _max For maximum transmit power of base station, Q _pow Is an available discrete power level;

the actions of the DQN network of base station j are:

a _j ＝{(P _j ，c _j )，p _j ∈P，c _j ∈C}

where p and c are the transmit power taken and the codeword index respectively,

for the codeword index set, Q _code Is the number of codewords in the codebook.

Further, the received signal of user j at time t is expressed as:

wherein the method comprises the steps of

Additive white gaussian noise at user j;

the rate of user j is expressed as:

C _j (w _j (t))＝log(1+SINR _j (t))

wherein SINR _j The signal-to-interference-and-noise ratio of the base station j at the time slot t is expressed as:

the sum rate is expressed as:

the rewards of the DQN network of base station j are:

r _j (t)＝C(t)。

further, the M is an integer from 0 to 300.

The beneficial effects of the invention are as follows:

every M time slots, the base stations exchange channel state information from the base stations to all users, channels of a plurality of groups of future M time slots are generated according to the obtained global channel state information, and the generated future channel samples are used for training the distributed reinforcement learning network, so that the aim of improving the network performance is achieved. The distributed reinforcement learning method and the distributed reinforcement learning system have the advantages that the distributed reinforcement learning is input into the channel state information from the base station to the service user, no information exchange between the base stations is needed, and the cost in execution is greatly reduced.

Simulation proves that the performance of the method is superior to that of a greedy scheme and a random scheme which are compared under the condition of extremely low cost, and the method is close to an optimal scheme requiring global information.

Drawings

Fig. 1 is a schematic diagram of a three-cell MISO cooperative beamforming system model;

FIG. 2 is a frame diagram of the present invention;

FIG. 3 is a flow chart of the present invention;

fig. 4 is a plot of sum rate versus time slot number M for the present invention and other schemes.

Detailed Description

The invention is further described below with reference to the accompanying drawings, without limiting the invention in any way, and any alterations or substitutions based on the teachings of the invention are intended to fall within the scope of the invention.

The invention is described in further detail below with reference to the attached drawing figures:

referring to fig. 1, consider a multi-cell multiple-input single-output scenario with K cells, each base station equipped with N _t A root antenna, each user is equipped with a single antenna. Each base station only serves one user on the same time-frequency resource, eachThe user receives both the useful signal from the serving base station and the interfering signal from the other base stations. The whole cooperative beamforming process is described as shown in fig. 2, and is described as follows:

first, every M time slots, the base stations exchange channel state information from the base station to all users, so that each base station generates global channel state information of the future M time slots according to the received global channel state information to train. Each slot base-to-user channel is modeled as a rayleigh channel, and the channels between adjacent slots are considered correlated and can be expressed as:

wherein,,

representing the channel from the jth base station to the kth user in the t time slot, h _j，k (0) Each element in (a) obeys independent complex Gaussian distribution with mean value of 0 and variance of 1, e _j，k (t) represents channel independent white gaussian noise, wherein each element is also subject to an independent complex gaussian distribution with a mean of 0 and a variance of 1, and ρ represents the correlation coefficient representing the rayleigh fading vector between adjacent time slots. When the interaction time slot t between base stations reaches the channel information h of all users _1，1 (t)，h _1，2 (t)，…，h _1，K (t)]Thereafter, each base station may form a global channel state information matrix, H (t) = [ H ] _1，1 (t)，h _1，2 (t)，…，h _1，K (t)，…，h _K，K (t)]. It can be seen from equation 1 that knowing H (t) and ρ, the correlation of adjacent channels can be used to generate global channels +.>

n∈[1，N]N is the number of global channel groups generated.

Then, to improve network performance, the present invention trains the DQN network for each base station using the generated global channel state information. The invention defines a distributed DQNThree elements of the network, namely status, action and rewards. Channel state information h of each DQN network from base station j to self-service user j _j，j (t) because the neural network cannot process complex numbers, the invention adopts I/Q transformation to divide complex vector into in-phase (real part) and quadrature (imaginary part) components, and the two vectors form a vector

Thus, the state of the DQN network of base station j is:

the action of the DQN network is typically a set of discrete real values, the transmitted beamforming vector w _j And (t) is a continuous complex vector. Thus, the present invention discretizes the continuous complex vector. The beamforming vector is composed of two parts, as shown in the following equation:

normalized beamforming vector, p, taken for base station j _j And (t) is the transmission power of the base station j. The present invention discretizes the normalized beamforming vector by selecting codewords from the codebook, and defines the available transmit power set for each base station for transmit power +.>

Wherein p is _max For maximum transmit power of base station, Q _pow Is a discrete power level available. The actions of the DQN network of base station j are therefore:

a _j ＝{(p _j ，c _j )，p _j ∈P，c _j ∈C} (4)

where p and c are the transmit power taken and the codeword index respectively,

for the codeword index set, Q _code Is the number of codewords in the codebook.

The rewards of the DQN network are sum rates, and the sum rates are indexes calculated according to global information, so that the purpose of training local information execution by the global information can be achieved by using the sum rates as rewards, the performance can be improved, and the network can be converged more quickly. In a multi-cell multiple-input single-output scenario, each user shares the same frequency band as users in other cells, and there is inter-cell interference. The received signal for user j at time t slot can be expressed as:

wherein the method comprises the steps of

Is additive white gaussian noise at user j. The rate of user j can be expressed as:

C _j (w _j (t))＝log(1+SINR _j (t)) (6)

the sum rate can be expressed as:

the rewards of the DQN network of base station j are therefore:

r _j (t)＝C(t) (9)

as shown in fig. 3 and algorithm 1, the pseudo code of the distributed reinforcement learning method of the present invention is as follows:

algorithm 1. Distributed reinforcement learning method pseudo code

In order to verify the performance of the collaborative beamforming scheme based on distributed reinforcement learning, the invention carries out the following simulation:

the channel parameters are assumed to follow a standard unit complex gaussian random distribution. The source node transmitting power is P, the noise variance at the destination node is

Assuming that the base station and the user perform accurate channel estimation, fig. 4 shows a curve of the sum rate of the greedy scheme and the random scheme according to the time slot number M, in the experiment, there are 4 base stations, 1 antenna for the user, 4 transmitting antennas are provided for each base station, the available discrete power level is 4, and the code number is 4. The greedy scheme is that each base station maximizes the throughput of its serving user, and the random scheme is that each base station randomly selects codeword and transmit power, and the optimal performance obtained by traversing is used as an upper bound. As can be seen from fig. 3, when m=1, the performance of the present invention can reach 95% of the optimal performance, and as the number of slots M increases, the performance of the present invention decreases. The number M of time slots is [0,300 ]]Throughout this range, the present invention is superior to greedy and random schemes.

The beneficial effects of the invention are as follows:

The embodiment of the present invention is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the embodiment, and any other changes, modifications, substitutions, combinations, and simplifications made by the spirit and principle of the present invention should be equivalent substitution manner, and all the changes, substitutions, combinations, and simplifications are included in the protection scope of the present invention.

Claims

1. The multi-cell collaborative beam forming method based on distributed reinforcement learning is characterized by comprising the following steps of:

s1: for base station j, j E [1, K]Establishing a weight of theta _j And a weight θ _j ' target DQN and an empty experience pool M _j ；

S2: initializing training DQN with random weight θ, let θ be _j ＝θ，j∈[1，K]；

S3: repeating steps S4 to S7 every M slots;

s4: the base stations exchange channel state information of all users;

s6: each base station takes action randomly and takes corresponding experience<s _j ，a _j ，r _j ，s _j ′>Stored in its experience pool M _j In (a) and (b);

s7: each base station performs network training;

the step of network training comprises the following steps:

s70: base station j observes that it is atStatus s of time slot t _j (t)，j∈[1，K]；

S71: at time slot t, base station j is based on state s _j (t) selecting action a according to epsilon-greedy policy _j ,j∈[1，K]；

S72: according to s _j And a _j ,j∈[1，K]Calculating global rewards r _j (t)；

S73: base station j observes its new state s at time slot t+1 _j ′(t)，j∈[1，K]；

S74: base station j will experience<s _j (t)，a _j (t)，r _j (t)，s _j ′(t)>Stored in its experience pool M _j In j E [1, K]；

s76: base station j updates its training DQN weights θ using reverse propagation _j ，j∈[1，K]；

S77: every other T of base station j _step Time slot update once target DQN weight θ _j ′，j∈[1，K]；

S78: until convergence or maximum training times are reached;

the received signal for user j at time t is expressed as:

wherein the method comprises the steps of

Is additive Gaussian white noise at user j, h _j，j (t) is channel state information from base station j to self-service user j, w _j (t) is a beamforming vector transmitted by base station j;

the rate of user j is expressed as:

C _j (w _j (t))＝log(1+SINR _j (t))

wherein sigma ² Is the noise variance;

the sum rate is expressed as:

the rewards of the DQN network of base station j are:

r _j (t)＝C(t)。

2. the distributed reinforcement learning based multi-cell collaborative beamforming method according to claim 1 wherein each time slot base station to user channel is modeled as a rayleigh channel and the channels between adjacent time slots are considered correlated expressed as:

wherein,,

3. The multi-cell collaborative beamforming method based on distributed reinforcement learning according to claim 2 wherein when interacting time slots t self to all users channel information h between base stations _1，1 (t)，h _1，2 (t)，…，h _1，K (t)]Thereafter, each base station constitutesA global channel state information matrix, H (t) = { H _1，1 (t)，h _1，2 (t)，…，h _1，K (t)，…，h _K，K (t)]And generating global channels for multiple sets of future M time slots using correlation of adjacent channels

And N is the number of the generated global channel groups.

4. The multi-cell collaborative beamforming method based on distributed reinforcement learning according to claim 1 wherein global information is needed to guide training when training and only local information is needed when performing.

5. The method of multi-cell collaborative beamforming based on distributed reinforcement learning according to claim 2 wherein the status of each DQN network is based on channel status information h from base station j to self-serving user j _j，j (t) means for converting the channel state information h by using I/Q conversion _j，j (t) dividing into in-phase and quadrature components, forming said in-phase and quadrature vectors into vectors

The state of the DQN network of base station j is expressed as:

6. the method for multi-cell collaborative beamforming based on distributed reinforcement learning according to claim 2 wherein a beamforming vector w transmitted by a base station _j And (t) is a continuous complex vector:

wherein,,

the actions of the DQN network of base station j are:

a _j ＝{(p _j ,c _j ),p _j ∈P,c _j ∈C}

wherein p is _j And c _j The transmit power and codeword index taken are respectively,

for the codeword index set, Q _code Is the number of codewords in the codebook.

7. The distributed reinforcement learning based multi-cell collaborative beamforming method according to claim 1 wherein M is an integer from 0 to 300.