CN115460699A

CN115460699A - Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning

Info

Publication number: CN115460699A
Application number: CN202210839976.2A
Authority: CN
Inventors: 赵军辉; 张欢
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-12-09

Abstract

The invention relates to the field of wireless air interface resources, in particular to a wireless transmission air-time frequency resource allocation method based on deep reinforcement learning, which comprises the following steps: firstly, clustering users by adopting a density clustering algorithm according to the positions of the users and the angle information of the users and a base station, and distributing airspace wireless resources for different users by using different beams by the users in different clusters; and then, allocating different frequency band resources for different users by adopting deep reinforcement learning in one time slot based on the position of the user, the angle between the user and the base station, the moving speed of the user, the moving direction of the user, the coverage condition of the base station and the clustering condition. The invention provides a space-time-frequency multi-domain associated resource allocation method based on multi-aspect information of users, space domain resources are allocated to users in different clusters by utilizing zero-forcing beam forming, and frequency resources are allocated to different users by utilizing a deep reinforcement learning method in a time slot. The wireless transmission space-time-frequency resource allocation method based on the deep reinforcement learning is obviously superior to the deep reinforcement learning scheme of random allocation and fighting in resource allocation, and is suitable for space-time-frequency resource allocation scenes in the field of wireless communication under dynamic conditions.

Description

Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning

Technical Field

The invention relates to the field of wireless air interface resources, in particular to a wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning.

Background

The development of the 6G technology in the future has not reached a consensus in all countries in the world, but in general, the 6G technology further integrates satellite communication, AI and big data on the basis of the 5G prior art to form a ubiquitous mobile communication infrastructure oriented to 2030 years later. Driven by new application requirements and technical requirements, the 6G needs to introduce new performance indexes, such as higher spectral efficiency/energy efficiency/cost efficiency, higher transmission rate, lower time delay, higher connection number density, coverage rate, intelligent degree, security, and the like. In order to meet the new requirements and performance index requirements, a new paradigm of full coverage, full spectrum, full application, and strong safety is adopted for 6G. Therefore, the 6G can support the ubiquitous heterogeneous scene, and a network with all things interconnected is provided in the fields of air, space, earth and sea by means of various sensors and based on big data and deep learning.

However, since 6G supports connections across sea, air, and ground, its transmission environment is very complicated. How to analyze the scale characteristics and the coupling relation of the space-time-frequency multi-domain resources, and how to mine the relevance and the reciprocity of the multi-domain resources, so as to realize the unified arrangement and management of the resources, which is an important technical challenge.

In the existing wireless network configuration scheme based on Q reinforcement learning, the Q network reinforcement learning is mainly utilized to optimize the allocation of wireless network resources according to the network state, but the allocation of the wireless network resources is unclear. A resource allocation optimization method based on reinforcement learning and a system implementation scheme are mainly used for allocating resource blocks for user services according to the information of bandwidth, physical resource block quantity, user service quantity to be transmitted, resource block characteristics, downlink characteristics and the like of a downlink. The existing scheme mainly utilizes reinforcement learning to allocate resources for a wireless transmission single domain (spatial domain, time domain or frequency domain). There is little research on the space-time-frequency multi-domain resources for wireless transmission.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a wireless transmission space-time frequency resource allocation method based on deep reinforcement learning, and solve the problem of resource allocation of wireless transmission space-time frequency multi-domain association.

In order to achieve the purpose, the invention adopts the technical scheme that:

a wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning comprises the following steps:

s1, clustering users by adopting a density clustering algorithm, wherein the input of the density clustering algorithm comprises user position information and an angle sine value between a user and a base station, and the output of the density clustering algorithm is a user clustering label;

s2, configuring different beams on a space domain for different clusters in the step S1 by adopting a zero-forcing beam forming method, wherein channel state information required by zero-forcing beam forming is average channel state information of users in each cluster;

and S3, taking the position, the driving speed and the driving direction of the user, which base station the user is covered by, the angle between the user and the base station and the user clustering label as states, selecting the subcarrier as an action, taking the transmission rate as an incentive, and learning a subcarrier resource allocation method in one time slot for different users by utilizing a deep reinforcement learning algorithm in order to maximize the total transmission rate of the system.

On the basis of the scheme, the initial position information of the user in the step S1 is randomly generated in the coverage range of the base station, the position of the vehicle user is generated on the road, and the position of the pedestrian user is generated on the non-road.

On the basis of the above scheme, the main body of the implementation of the density clustering algorithm in the step S1 is an edge server on the base station.

On the basis of the above scheme, the average channel state information matrix clustered in step S2 is H, and a beam matrix W = H obtained by using a zero-forcing beamforming scheme ^T (HH ^T ) ^-1 。

On the basis of the above scheme, the channel state information of the user in step S2 is collected by the base station.

On the basis of the scheme, the deep reinforcement learning algorithm in the step S3 specifically comprises an experience storage process, a Q estimation network training process and an inference process;

the experience storage process comprises the following steps:

s311, inputting the current user state information into a Q estimation network, outputting the Q value of each action by the Q estimation network, and selecting the action with the maximum Q value according to the probability epsilon;

s312, acting the action on the environment to obtain an award value and the next state;

s313, storing a tuple consisting of the current state, the action, the reward and the next state in an experience replay pool, wherein the experience replay pool is used for training the neural network;

the Q estimation network training process comprises the following steps:

s321, extracting small-batch data from the experience playback pool in the step S313, wherein the current state S of the data _t Inputting the Q estimation value into Q estimation network, estimating Q estimation value(s) of each action in current state _t )；

S322, next state S in data _t+1 Inputting the Q value into a Q target network to obtain a corresponding Q value Q(s) _t+1 ) And according to Q _{Reality(s) is} (s _t+1 )＝r _t +γmaxQ(s _t+1 ) Obtaining a real value of Q, wherein r _t The prize value for the previous state, and γ is the prize decay value. Secondly, the structure of the Q target network is the same as that of the Q estimation network, and when the transmission rate of the system is increased, the weight parameter of the Q estimation network is given to the Q target network;

s323, calculating loss by taking the mean square error of the Q actual value and the Q estimated value as a loss function;

s324, feeding the loss value back to the Q estimation network, and optimizing weight parameters in the Q estimation network by using an optimizer;

the reasoning process comprises the following steps:

user state information is input to the Q estimation network to select the sub-carrier with the largest Q value.

On the basis of the above scheme, the deep reinforcement learning algorithm in step S3 is run in an edge server on the base station.

On the basis of the scheme, the Q estimation network and the Q target network are both formed by three layers of neural networks, the number of nodes of two layers of hidden layers is 10, and the activation function is a ReLu function.

On the basis of the scheme, the transmission rate is calculated by the following formula:

the downlink channel on the m-th subcarrier from the ith base station to the kth user under the beam c is represented as:

where M is the number of subcarriers, PL is the path loss,

in order to be the number of the paths,

in order to be the path gain, the gain of the path,

for the emission angle of the p-th path,

is a response vector related to the emission angle;

the channel interference-to-noise ratio on the mth subcarrier under the beam c from the ith base station to the kth user is expressed as:

in the formula, po,

And N ₀ Respectively representing transmission power, interference between users covered by the same base station, and being differentUsers covered by the base station use the same subcarrier interference and gaussian noise,

is a component of the beam matrix W;

the communication rate of the user is expressed as:

where B is the system bandwidth and M is the number of subcarriers.

The wireless transmission space-time frequency resource allocation method based on deep reinforcement learning has the following beneficial effects:

1. the invention aims at wider resource range, including space, time and frequency resources.

2. The method of the invention is obviously superior to the deep reinforcement learning scheme of random allocation and duel allocation in the allocation of resources, and is suitable for the space-time-frequency resource allocation scene in the field of wireless communication under dynamic conditions.

Drawings

The invention has the following drawings:

fig. 1 is a general flowchart of a method for allocating radio resources based on deep reinforcement learning according to the present invention;

FIG. 2 (a) is an operational diagram of a density clustering scheme employed in the present invention;

FIG. 2 (b) is a diagram of a neural network architecture employed in the present invention;

FIG. 3 is a block diagram of the deep reinforcement learning architecture of the present invention;

FIG. 4 is a convergence comparison diagram of the deep reinforcement learning method of duel in the training process;

FIG. 5 is a comparison chart of the test results of the deep reinforcement learning method and the random distribution method of duel of the present invention.

Detailed Description

The invention relates to a wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning, which comprises the following steps:

the method comprises the following steps: and the edge server on the base station adopts a density clustering method to attach a clustering label to the user. The embodiment of the invention utilizes the position of the user and the angle sin value between the user and the base station as the input of a density clustering method, and the clustering label information of the user is output through density clustering and is used as the basic characteristic of the spatial domain information distributed by the user.

Step two: in order to allocate airspace wireless transmission resources to users, the embodiment of the invention adopts a zero-forcing beam forming method on the side of a multi-antenna base station to allocate different beams to different clustering users on the airspace. Wherein the channel state information required for zero-forcing beamforming is determined by the average channel state information of the users in each cluster. According to the average channel information of different clusters, a zero-forcing beam forming method is adopted to obtain beam matrix information W = H according to the channel state information ^T (HH ^T ) ^-1 Different beams are allocated to the users of different clusters.

Step three: based on the position information of the user, the angle sin value between the user and the base station, the driving speed information of the user, the driving direction information of the user, the base station information of the user and the cluster label information 6 big characteristics of the user, in order to maximize the total transmission rate of the system, a method for distributing the subcarrier resources in one time slot is learned for different users by utilizing a deep reinforcement learning algorithm on an edge server of the base station. First, 6 large features of the user are input into the neural network as the current state. Then, an action to be performed is selected based on the predicted Q value of the action by the neural network. The specific choice is to select the action with the largest Q value under the probability of epsilon, and conversely, to randomly select the action to be executed. And then, acting the action on the environment, and combining the wave beam information to generate an award value to obtain the state information of the next step. And storing the tuple { current state, action, reward and next state } into an experience playback pool and using the tuple in a training process of deep reinforcement learning. Finally, the neural network is trained by using a deep reinforcement learning method, so that the network can allocate space-frequency resources to users under the condition of maximizing the transmission rate in one time slot.

In step three, the embodiment of the present invention provides a three-layer Q estimation neural network architecture and a three-layer Q target network architecture. The two structures have the same structure and different parameters. The Q-target network architecture is used only to store previous parameter values in the Q-estimation network architecture. The first layer of the network adopts a 6-dimensional characteristic value as the input of the first layer of the neural network, and the output is the characteristics of 10 nodes; the second layer takes the characteristics of 10 nodes as input, and the output is the characteristics of 10 nodes; the third layer takes the characteristics of 10 nodes as input, and outputs 14 predicted values, namely the number of sub-bands.

According to the air interface resource allocation method based on deep reinforcement learning and wireless transmission space-time-frequency multi-domain association, provided by the embodiment of the invention, by utilizing the position information of a user, the angle sin value between the user and a base station, the driving speed information of the user, the driving direction information of the user, the information of the base station to which the user belongs and the clustering label information of the user, a method of flexibly combining density clustering and deep reinforcement learning is adopted, the convergence of a reward value is maximized and a loss function is minimized through training, and wireless air-frequency resources are allocated to different users in a time slot. The method enables allocation of radio transmission resources from space-time-frequency-multiple-degree.

The present invention will be described in further detail with reference to the accompanying drawings.

Firstly, determining a scene: suppose there are two base stations BS1 and BS2 on the road side, each equipped with an antenna with a number N _t And an MEC server. The MEC server is responsible for collecting user information, and the MEC servers can communicate with each other. In the coverage range of the base station, the initial positions of the vehicle users and the pedestrian users with the total number of K are randomly generated, the vehicle users can only be generated on a road, and the pedestrian users are not on the road. In addition to this, respective driving speeds and directions are set for the vehicle and pedestrian users. To maximize the total rate of system communication, we allocate radio transmission space (i.e., beam) and frequency (i.e., subcarrier) resources to different users in one time slot.

The downlink channel on the m-th subcarrier from the ith base station to the kth user under the beam c may be represented as:

where M is the number of subcarriers, PL is the path loss,

the number of paths (including both transmission paths of line-of-sight (LOS) path and non-line-of-sight (NLOS) path),

in order to be the path gain, the gain of the path,

for the emission angle of the p-th path,

is a response vector related to the transmit angle. It is worth noting that

Only the LOS path exists.

The allocation of the airspace resources is to allocate the users to different clusters through density clustering according to the position information of the users and the angle information between the users and the base station, and allocate different wave beams to the different clusters. The characteristic channel of different clusters is the average value channel matrix of the user channels in the cluster

In the present invention, zero-forcing beamforming technology is used to allocate different beams to different clusters, and a beam matrix

C is the number of clusters.

The channel interference to noise ratio at the mth sub-carrier under the beam c from the ith base station to the kth user can be expressed as:

wherein, po,

And N ₀ Respectively, the transmission power, the inter-user interference covered by the same base station, the interference using the same subcarrier by users covered by different base stations, and the gaussian noise.

The communication rate of a user can be expressed as:

where B is the system bandwidth and M is the number of subcarriers.

The invention adopts a density clustering and deep reinforcement learning method to configure different beams and subcarriers for different users in a time slot so as to maximize the total rate of the system.

The deep reinforcement learning has the following key factors, and the device is arranged as follows:

1. state s _t : the state space mainly comprises the information of the position of the user, the angle between the user and the base station, the moving speed of the user, the moving direction of the user, the coverage condition of the base station and the clustering condition.

2. Action a _t : selecting different sub-carriers for a user;

3. reward function r _t : user communication rate R _k (t)。

On the basis, the invention adopts the general flow chart based on deep reinforcement learning wireless resource allocation shown in fig. 1 to allocate space-time-frequency resources to users, and the specific flow is as follows:

s1: initializing an experience playback pool D, initializing a Q target network weight parameter theta, initializing a Q estimation network weight parameter theta, initializing Step number statistic value Step =0, and ReQP =0;

s2: initialization of the Current State s _t . Firstly, randomly generating position data of a user, and generating user driving speed and direction data by combining actual conditions; then judges the user quiltWhich base station covers, and simultaneously calculating the angle between the user and the base station; finally, clustering the users by a density clustering method to obtain clustering label information of the users;

s3: according to state s _t Selecting Q with a probability of ε _Estimating (s _t ) Maximum movement, i.e. a _t ＝argmax Q(s _t ) Otherwise, randomly selecting an action a _t ；

S4: will act a _t Acting on and interacting with the environment to calculate a reward function r for each user _t Next state s _t+1 And a terminator (terminator True or False);

s5: will(s) _t ,a _t ,r _t ,s _t+1 ) Stored in the experience playback pool D;

s6: judging whether enough data exist in the experience playback pool and the program running Step number is a multiple of 50, namely whether Step >200 and Step/5=0 are true or not, if yes, carrying out S7; otherwise, performing S14;

s7: and judging whether the Q target network parameter needs to be updated, namely whether ReQP =1 is established. If yes, performing S8; otherwise, S9-S13 are carried out;

s8: updating the parameter of the Q target network into a weight parameter theta of the Q estimation network;

s9: randomly sampling small batches of data from an empirical playback pool D, wherein the data volume is 128;

s10: s in small batch data _t Feeding back to Q estimation network, and outputting estimated value Q estimation(s) _t ) (ii) a S in small batch data _t+1 Feedback to Q target network to output Q(s) _t+1 ) Predicting a value;

s11: selecting Q(s) _t+1 ) Action of medium maximum value a _t+1 Calculating rewards Q in a delivery _{Reality(s) is} (s _t+1 )＝r _t +γmaxQ(s _t+1 )；

S12: calculating loss function loss = E [ (Q) _{Reality (reality)} (s _t+1 )-Q _Estimating (s _t )) ² ]；

S13: optimizing a weight parameter theta in the Q estimation network by using an ADAM optimizer based on the loss function value loss;

s14: judging whether the terminator is True, if True, performing S15-S16, otherwise, performing S17;

s15: calculating an average system rate;

s16: and judging whether the system rate of the round is greater than that of the last round. If so, reQP =1, otherwise, reQP =0, and S17 is performed.

S17: the current state s _t Is replaced by the next state s _t+1 Step = Step +1, return to S3;

s18: judging whether the current number of rounds reaches the maximum number of rounds, if so, ending; otherwise, return to S2.

Fig. 2 (a) is an operation diagram of the present invention adopting a density clustering scheme, wherein the input characteristics of density clustering are position information of a user and angle sin value information between the user and a covered base station, and user clustering label information is output after density clustering.

FIG. 2 (b) is a diagram of a neural network structure adopted by the present invention, and the inputs of the neural network are the position of the user, the driving speed, the driving direction, which base station the user is covered by, the angle between the user and the base station, and 6 characteristics of the label information of the user cluster; two layers of neural networks are arranged in the middle, and the number of nodes in the middle layer is 10; the Q value corresponding to each action is output.

FIG. 3 is a structural diagram of the detailed deep reinforcement learning adopted by the present invention, which is divided into two parts of experience storage and learning.

In the empirical storage process, the Q estimation network takes 6 characteristics of the user as input, outputs the Q value of each action, and selects the action with the maximum Q value under the probability of epsilon. Then, the action is applied to the environment to obtain the return value and the next state. Finally, the tuples consisting of the current state, the action, the reward and the next state are stored in an empirical playback pool.

And updating the parameters in the Q target network into Q estimation network parameters when the parameters of the Q target network meet the updating condition (namely the system rate in the current step is greater than that in the previous step) in the learning process. Then, playback is performed from experienceSelecting small batch of data from the pool, inputting the current state of the data into a Q estimation network for estimating the Q value of each action in the current state, namely Q _Estimating (s _t ) (ii) a Inputting the next state in the data into the Q target network to obtain the corresponding Q value, and according to Q _{Reality(s) is} (s _t+1 )＝r _t +γmaxQ(s _t+1 ) And obtaining a Q actual value. Then pass loss = E [ (Q) _{Reality(s) is} (s _t+1 )-Q _Estimating (s _t )) ² ]A loss function is calculated. Feeding back the loss function to the Q estimation network optimizes parameters in the Q estimation network using an ADAM optimizer. This is done until the maximum number of cycles is reached.

Fig. 4 is a convergence comparison diagram of the deep reinforcement learning method of duel in the training process, and it can be seen from the diagram that the convergence result of the present invention is obviously superior to that of the deep reinforcement learning method of duel.

Fig. 5 is a comparison graph of the test results of the depth reinforcement learning method and the decision fighting and the random allocation method of the present invention, and it can be seen from the graph that the method of the present invention is obviously due to the random method and the decision fighting depth reinforcement learning method.

Those not described in detail in this specification are well within the skill of the art.

Claims

1. A wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:

s2, configuring different beams on a space domain for different clusters in the step S1 by adopting a zero-forcing beam forming method, wherein channel state information required by the zero-forcing beam forming is average channel state information of users in each cluster;

2. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein in step S1, the initial position information of the users is randomly generated in the coverage of the base station, the positions of the users in the vehicle are generated on the road, and the users in the pedestrian are generated on the non-road.

3. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the implementation subject of the density clustering algorithm in step S1 is an edge server on a base station.

4. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the average channel state information matrix clustered in step S2 is H, and a beam matrix W = H obtained by using a zero-forcing beamforming scheme ^T (HH ^T ) ^-1 。

5. The method for configuring space-time-frequency resources for wireless transmission based on deep reinforcement learning of claim 1, wherein the channel state information of the users in step S2 is collected by a base station.

6. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the step S3 deep reinforcement learning algorithm specifically includes an experience storage process, a Q estimation network training process, and an inference process;

the experience storage process comprises the following steps:

s312, acting the action on the environment to obtain an award value and a next state;

the Q estimation network training process comprises the following steps:

s321, extracting small-batch data from the experience playback pool in the step S313, wherein the current state S of the data _t Inputting the Q estimation value to a Q estimation network, and estimating the Q estimation value Q of each action in the current state _Estimating (s _t )；

s324, feeding the loss value back to the Q estimation network, and optimizing a weight parameter in the Q estimation network by using an optimizer;

the reasoning process comprises the following steps:

7. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the deep reinforcement learning algorithm in step S3 is run in an edge server on a base station.

8. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 6, wherein the Q estimation network and the Q target network are both composed of three layers of neural networks, the number of nodes of two hidden layers is 10, and the activation function is a ReLu function.

9. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the transmission rate is calculated by the following formula:

where M is the number of subcarriers, PL is the path loss,

in order to be the number of the paths,

in order to be the path gain, the gain of the path,

is the emission angle of the p-th path,

is a response vector related to the emission angle;

in the formula, po,

And N ₀ Respectively representing transmission power, covered by the same base stationInter-user interference of (a), interference of users covered by different base stations using the same subcarrier, and gaussian noise,

is a component of the beam matrix W;

the communication rate of the user is expressed as:

where B is the system bandwidth and M is the number of subcarriers.