CN115460699A - Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning - Google Patents

Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning Download PDF

Info

Publication number
CN115460699A
CN115460699A CN202210839976.2A CN202210839976A CN115460699A CN 115460699 A CN115460699 A CN 115460699A CN 202210839976 A CN202210839976 A CN 202210839976A CN 115460699 A CN115460699 A CN 115460699A
Authority
CN
China
Prior art keywords
user
reinforcement learning
users
deep reinforcement
resource allocation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210839976.2A
Other languages
Chinese (zh)
Inventor
赵军辉
张欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202210839976.2A priority Critical patent/CN115460699A/en
Publication of CN115460699A publication Critical patent/CN115460699A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/02Services making use of location information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0446Resources in time domain, e.g. slots or frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0453Resources in frequency domain, e.g. a carrier in FDMA
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/046Wireless resource allocation based on the type of the allocated resource the resource being in the space domain, e.g. beams

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention relates to the field of wireless air interface resources, in particular to a wireless transmission air-time frequency resource allocation method based on deep reinforcement learning, which comprises the following steps: firstly, clustering users by adopting a density clustering algorithm according to the positions of the users and the angle information of the users and a base station, and distributing airspace wireless resources for different users by using different beams by the users in different clusters; and then, allocating different frequency band resources for different users by adopting deep reinforcement learning in one time slot based on the position of the user, the angle between the user and the base station, the moving speed of the user, the moving direction of the user, the coverage condition of the base station and the clustering condition. The invention provides a space-time-frequency multi-domain associated resource allocation method based on multi-aspect information of users, space domain resources are allocated to users in different clusters by utilizing zero-forcing beam forming, and frequency resources are allocated to different users by utilizing a deep reinforcement learning method in a time slot. The wireless transmission space-time-frequency resource allocation method based on the deep reinforcement learning is obviously superior to the deep reinforcement learning scheme of random allocation and fighting in resource allocation, and is suitable for space-time-frequency resource allocation scenes in the field of wireless communication under dynamic conditions.

Description

Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning
Technical Field
The invention relates to the field of wireless air interface resources, in particular to a wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning.
Background
The development of the 6G technology in the future has not reached a consensus in all countries in the world, but in general, the 6G technology further integrates satellite communication, AI and big data on the basis of the 5G prior art to form a ubiquitous mobile communication infrastructure oriented to 2030 years later. Driven by new application requirements and technical requirements, the 6G needs to introduce new performance indexes, such as higher spectral efficiency/energy efficiency/cost efficiency, higher transmission rate, lower time delay, higher connection number density, coverage rate, intelligent degree, security, and the like. In order to meet the new requirements and performance index requirements, a new paradigm of full coverage, full spectrum, full application, and strong safety is adopted for 6G. Therefore, the 6G can support the ubiquitous heterogeneous scene, and a network with all things interconnected is provided in the fields of air, space, earth and sea by means of various sensors and based on big data and deep learning.
However, since 6G supports connections across sea, air, and ground, its transmission environment is very complicated. How to analyze the scale characteristics and the coupling relation of the space-time-frequency multi-domain resources, and how to mine the relevance and the reciprocity of the multi-domain resources, so as to realize the unified arrangement and management of the resources, which is an important technical challenge.
In the existing wireless network configuration scheme based on Q reinforcement learning, the Q network reinforcement learning is mainly utilized to optimize the allocation of wireless network resources according to the network state, but the allocation of the wireless network resources is unclear. A resource allocation optimization method based on reinforcement learning and a system implementation scheme are mainly used for allocating resource blocks for user services according to the information of bandwidth, physical resource block quantity, user service quantity to be transmitted, resource block characteristics, downlink characteristics and the like of a downlink. The existing scheme mainly utilizes reinforcement learning to allocate resources for a wireless transmission single domain (spatial domain, time domain or frequency domain). There is little research on the space-time-frequency multi-domain resources for wireless transmission.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a wireless transmission space-time frequency resource allocation method based on deep reinforcement learning, and solve the problem of resource allocation of wireless transmission space-time frequency multi-domain association.
In order to achieve the purpose, the invention adopts the technical scheme that:
a wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning comprises the following steps:
s1, clustering users by adopting a density clustering algorithm, wherein the input of the density clustering algorithm comprises user position information and an angle sine value between a user and a base station, and the output of the density clustering algorithm is a user clustering label;
s2, configuring different beams on a space domain for different clusters in the step S1 by adopting a zero-forcing beam forming method, wherein channel state information required by zero-forcing beam forming is average channel state information of users in each cluster;
and S3, taking the position, the driving speed and the driving direction of the user, which base station the user is covered by, the angle between the user and the base station and the user clustering label as states, selecting the subcarrier as an action, taking the transmission rate as an incentive, and learning a subcarrier resource allocation method in one time slot for different users by utilizing a deep reinforcement learning algorithm in order to maximize the total transmission rate of the system.
On the basis of the scheme, the initial position information of the user in the step S1 is randomly generated in the coverage range of the base station, the position of the vehicle user is generated on the road, and the position of the pedestrian user is generated on the non-road.
On the basis of the above scheme, the main body of the implementation of the density clustering algorithm in the step S1 is an edge server on the base station.
On the basis of the above scheme, the average channel state information matrix clustered in step S2 is H, and a beam matrix W = H obtained by using a zero-forcing beamforming scheme T (HH T ) -1
On the basis of the above scheme, the channel state information of the user in step S2 is collected by the base station.
On the basis of the scheme, the deep reinforcement learning algorithm in the step S3 specifically comprises an experience storage process, a Q estimation network training process and an inference process;
the experience storage process comprises the following steps:
s311, inputting the current user state information into a Q estimation network, outputting the Q value of each action by the Q estimation network, and selecting the action with the maximum Q value according to the probability epsilon;
s312, acting the action on the environment to obtain an award value and the next state;
s313, storing a tuple consisting of the current state, the action, the reward and the next state in an experience replay pool, wherein the experience replay pool is used for training the neural network;
the Q estimation network training process comprises the following steps:
s321, extracting small-batch data from the experience playback pool in the step S313, wherein the current state S of the data t Inputting the Q estimation value into Q estimation network, estimating Q estimation value(s) of each action in current state t );
S322, next state S in data t+1 Inputting the Q value into a Q target network to obtain a corresponding Q value Q(s) t+1 ) And according to Q Reality(s) is (s t+1 )=r t +γmaxQ(s t+1 ) Obtaining a real value of Q, wherein r t The prize value for the previous state, and γ is the prize decay value. Secondly, the structure of the Q target network is the same as that of the Q estimation network, and when the transmission rate of the system is increased, the weight parameter of the Q estimation network is given to the Q target network;
s323, calculating loss by taking the mean square error of the Q actual value and the Q estimated value as a loss function;
s324, feeding the loss value back to the Q estimation network, and optimizing weight parameters in the Q estimation network by using an optimizer;
the reasoning process comprises the following steps:
user state information is input to the Q estimation network to select the sub-carrier with the largest Q value.
On the basis of the above scheme, the deep reinforcement learning algorithm in step S3 is run in an edge server on the base station.
On the basis of the scheme, the Q estimation network and the Q target network are both formed by three layers of neural networks, the number of nodes of two layers of hidden layers is 10, and the activation function is a ReLu function.
On the basis of the scheme, the transmission rate is calculated by the following formula:
the downlink channel on the m-th subcarrier from the ith base station to the kth user under the beam c is represented as:
Figure BDA0003750536600000041
where M is the number of subcarriers, PL is the path loss,
Figure BDA0003750536600000042
in order to be the number of the paths,
Figure BDA0003750536600000043
in order to be the path gain, the gain of the path,
Figure BDA0003750536600000044
for the emission angle of the p-th path,
Figure BDA0003750536600000045
is a response vector related to the emission angle;
the channel interference-to-noise ratio on the mth subcarrier under the beam c from the ith base station to the kth user is expressed as:
Figure BDA0003750536600000046
in the formula, po,
Figure BDA0003750536600000047
And N 0 Respectively representing transmission power, interference between users covered by the same base station, and being differentUsers covered by the base station use the same subcarrier interference and gaussian noise,
Figure BDA0003750536600000048
is a component of the beam matrix W;
the communication rate of the user is expressed as:
Figure BDA0003750536600000049
where B is the system bandwidth and M is the number of subcarriers.
The wireless transmission space-time frequency resource allocation method based on deep reinforcement learning has the following beneficial effects:
1. the invention aims at wider resource range, including space, time and frequency resources.
2. The method of the invention is obviously superior to the deep reinforcement learning scheme of random allocation and duel allocation in the allocation of resources, and is suitable for the space-time-frequency resource allocation scene in the field of wireless communication under dynamic conditions.
Drawings
The invention has the following drawings:
fig. 1 is a general flowchart of a method for allocating radio resources based on deep reinforcement learning according to the present invention;
FIG. 2 (a) is an operational diagram of a density clustering scheme employed in the present invention;
FIG. 2 (b) is a diagram of a neural network architecture employed in the present invention;
FIG. 3 is a block diagram of the deep reinforcement learning architecture of the present invention;
FIG. 4 is a convergence comparison diagram of the deep reinforcement learning method of duel in the training process;
FIG. 5 is a comparison chart of the test results of the deep reinforcement learning method and the random distribution method of duel of the present invention.
Detailed Description
The invention relates to a wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning, which comprises the following steps:
the method comprises the following steps: and the edge server on the base station adopts a density clustering method to attach a clustering label to the user. The embodiment of the invention utilizes the position of the user and the angle sin value between the user and the base station as the input of a density clustering method, and the clustering label information of the user is output through density clustering and is used as the basic characteristic of the spatial domain information distributed by the user.
Step two: in order to allocate airspace wireless transmission resources to users, the embodiment of the invention adopts a zero-forcing beam forming method on the side of a multi-antenna base station to allocate different beams to different clustering users on the airspace. Wherein the channel state information required for zero-forcing beamforming is determined by the average channel state information of the users in each cluster. According to the average channel information of different clusters, a zero-forcing beam forming method is adopted to obtain beam matrix information W = H according to the channel state information T (HH T ) -1 Different beams are allocated to the users of different clusters.
Step three: based on the position information of the user, the angle sin value between the user and the base station, the driving speed information of the user, the driving direction information of the user, the base station information of the user and the cluster label information 6 big characteristics of the user, in order to maximize the total transmission rate of the system, a method for distributing the subcarrier resources in one time slot is learned for different users by utilizing a deep reinforcement learning algorithm on an edge server of the base station. First, 6 large features of the user are input into the neural network as the current state. Then, an action to be performed is selected based on the predicted Q value of the action by the neural network. The specific choice is to select the action with the largest Q value under the probability of epsilon, and conversely, to randomly select the action to be executed. And then, acting the action on the environment, and combining the wave beam information to generate an award value to obtain the state information of the next step. And storing the tuple { current state, action, reward and next state } into an experience playback pool and using the tuple in a training process of deep reinforcement learning. Finally, the neural network is trained by using a deep reinforcement learning method, so that the network can allocate space-frequency resources to users under the condition of maximizing the transmission rate in one time slot.
In step three, the embodiment of the present invention provides a three-layer Q estimation neural network architecture and a three-layer Q target network architecture. The two structures have the same structure and different parameters. The Q-target network architecture is used only to store previous parameter values in the Q-estimation network architecture. The first layer of the network adopts a 6-dimensional characteristic value as the input of the first layer of the neural network, and the output is the characteristics of 10 nodes; the second layer takes the characteristics of 10 nodes as input, and the output is the characteristics of 10 nodes; the third layer takes the characteristics of 10 nodes as input, and outputs 14 predicted values, namely the number of sub-bands.
According to the air interface resource allocation method based on deep reinforcement learning and wireless transmission space-time-frequency multi-domain association, provided by the embodiment of the invention, by utilizing the position information of a user, the angle sin value between the user and a base station, the driving speed information of the user, the driving direction information of the user, the information of the base station to which the user belongs and the clustering label information of the user, a method of flexibly combining density clustering and deep reinforcement learning is adopted, the convergence of a reward value is maximized and a loss function is minimized through training, and wireless air-frequency resources are allocated to different users in a time slot. The method enables allocation of radio transmission resources from space-time-frequency-multiple-degree.
The present invention will be described in further detail with reference to the accompanying drawings.
Firstly, determining a scene: suppose there are two base stations BS1 and BS2 on the road side, each equipped with an antenna with a number N t And an MEC server. The MEC server is responsible for collecting user information, and the MEC servers can communicate with each other. In the coverage range of the base station, the initial positions of the vehicle users and the pedestrian users with the total number of K are randomly generated, the vehicle users can only be generated on a road, and the pedestrian users are not on the road. In addition to this, respective driving speeds and directions are set for the vehicle and pedestrian users. To maximize the total rate of system communication, we allocate radio transmission space (i.e., beam) and frequency (i.e., subcarrier) resources to different users in one time slot.
The downlink channel on the m-th subcarrier from the ith base station to the kth user under the beam c may be represented as:
Figure BDA0003750536600000061
where M is the number of subcarriers, PL is the path loss,
Figure BDA0003750536600000071
the number of paths (including both transmission paths of line-of-sight (LOS) path and non-line-of-sight (NLOS) path),
Figure BDA0003750536600000072
in order to be the path gain, the gain of the path,
Figure BDA0003750536600000073
for the emission angle of the p-th path,
Figure BDA0003750536600000074
is a response vector related to the transmit angle. It is worth noting that
Figure BDA0003750536600000075
Only the LOS path exists.
The allocation of the airspace resources is to allocate the users to different clusters through density clustering according to the position information of the users and the angle information between the users and the base station, and allocate different wave beams to the different clusters. The characteristic channel of different clusters is the average value channel matrix of the user channels in the cluster
Figure BDA0003750536600000076
In the present invention, zero-forcing beamforming technology is used to allocate different beams to different clusters, and a beam matrix
Figure BDA0003750536600000077
C is the number of clusters.
The channel interference to noise ratio at the mth sub-carrier under the beam c from the ith base station to the kth user can be expressed as:
Figure BDA0003750536600000078
wherein, po,
Figure BDA0003750536600000079
And N 0 Respectively, the transmission power, the inter-user interference covered by the same base station, the interference using the same subcarrier by users covered by different base stations, and the gaussian noise.
The communication rate of a user can be expressed as:
Figure BDA00037505366000000710
where B is the system bandwidth and M is the number of subcarriers.
The invention adopts a density clustering and deep reinforcement learning method to configure different beams and subcarriers for different users in a time slot so as to maximize the total rate of the system.
The deep reinforcement learning has the following key factors, and the device is arranged as follows:
1. state s t : the state space mainly comprises the information of the position of the user, the angle between the user and the base station, the moving speed of the user, the moving direction of the user, the coverage condition of the base station and the clustering condition.
2. Action a t : selecting different sub-carriers for a user;
3. reward function r t : user communication rate R k (t)。
On the basis, the invention adopts the general flow chart based on deep reinforcement learning wireless resource allocation shown in fig. 1 to allocate space-time-frequency resources to users, and the specific flow is as follows:
s1: initializing an experience playback pool D, initializing a Q target network weight parameter theta, initializing a Q estimation network weight parameter theta, initializing Step number statistic value Step =0, and ReQP =0;
s2: initialization of the Current State s t . Firstly, randomly generating position data of a user, and generating user driving speed and direction data by combining actual conditions; then judges the user quiltWhich base station covers, and simultaneously calculating the angle between the user and the base station; finally, clustering the users by a density clustering method to obtain clustering label information of the users;
s3: according to state s t Selecting Q with a probability of ε Estimating (s t ) Maximum movement, i.e. a t =argmax Q(s t ) Otherwise, randomly selecting an action a t
S4: will act a t Acting on and interacting with the environment to calculate a reward function r for each user t Next state s t+1 And a terminator (terminator True or False);
s5: will(s) t ,a t ,r t ,s t+1 ) Stored in the experience playback pool D;
s6: judging whether enough data exist in the experience playback pool and the program running Step number is a multiple of 50, namely whether Step >200 and Step/5=0 are true or not, if yes, carrying out S7; otherwise, performing S14;
s7: and judging whether the Q target network parameter needs to be updated, namely whether ReQP =1 is established. If yes, performing S8; otherwise, S9-S13 are carried out;
s8: updating the parameter of the Q target network into a weight parameter theta of the Q estimation network;
s9: randomly sampling small batches of data from an empirical playback pool D, wherein the data volume is 128;
s10: s in small batch data t Feeding back to Q estimation network, and outputting estimated value Q estimation(s) t ) (ii) a S in small batch data t+1 Feedback to Q target network to output Q(s) t+1 ) Predicting a value;
s11: selecting Q(s) t+1 ) Action of medium maximum value a t+1 Calculating rewards Q in a delivery Reality(s) is (s t+1 )=r t +γmaxQ(s t+1 );
S12: calculating loss function loss = E [ (Q) Reality (reality) (s t+1 )-Q Estimating (s t )) 2 ];
S13: optimizing a weight parameter theta in the Q estimation network by using an ADAM optimizer based on the loss function value loss;
s14: judging whether the terminator is True, if True, performing S15-S16, otherwise, performing S17;
s15: calculating an average system rate;
s16: and judging whether the system rate of the round is greater than that of the last round. If so, reQP =1, otherwise, reQP =0, and S17 is performed.
S17: the current state s t Is replaced by the next state s t+1 Step = Step +1, return to S3;
s18: judging whether the current number of rounds reaches the maximum number of rounds, if so, ending; otherwise, return to S2.
Fig. 2 (a) is an operation diagram of the present invention adopting a density clustering scheme, wherein the input characteristics of density clustering are position information of a user and angle sin value information between the user and a covered base station, and user clustering label information is output after density clustering.
FIG. 2 (b) is a diagram of a neural network structure adopted by the present invention, and the inputs of the neural network are the position of the user, the driving speed, the driving direction, which base station the user is covered by, the angle between the user and the base station, and 6 characteristics of the label information of the user cluster; two layers of neural networks are arranged in the middle, and the number of nodes in the middle layer is 10; the Q value corresponding to each action is output.
FIG. 3 is a structural diagram of the detailed deep reinforcement learning adopted by the present invention, which is divided into two parts of experience storage and learning.
In the empirical storage process, the Q estimation network takes 6 characteristics of the user as input, outputs the Q value of each action, and selects the action with the maximum Q value under the probability of epsilon. Then, the action is applied to the environment to obtain the return value and the next state. Finally, the tuples consisting of the current state, the action, the reward and the next state are stored in an empirical playback pool.
And updating the parameters in the Q target network into Q estimation network parameters when the parameters of the Q target network meet the updating condition (namely the system rate in the current step is greater than that in the previous step) in the learning process. Then, playback is performed from experienceSelecting small batch of data from the pool, inputting the current state of the data into a Q estimation network for estimating the Q value of each action in the current state, namely Q Estimating (s t ) (ii) a Inputting the next state in the data into the Q target network to obtain the corresponding Q value, and according to Q Reality(s) is (s t+1 )=r t +γmaxQ(s t+1 ) And obtaining a Q actual value. Then pass loss = E [ (Q) Reality(s) is (s t+1 )-Q Estimating (s t )) 2 ]A loss function is calculated. Feeding back the loss function to the Q estimation network optimizes parameters in the Q estimation network using an ADAM optimizer. This is done until the maximum number of cycles is reached.
Fig. 4 is a convergence comparison diagram of the deep reinforcement learning method of duel in the training process, and it can be seen from the diagram that the convergence result of the present invention is obviously superior to that of the deep reinforcement learning method of duel.
Fig. 5 is a comparison graph of the test results of the depth reinforcement learning method and the decision fighting and the random allocation method of the present invention, and it can be seen from the graph that the method of the present invention is obviously due to the random method and the decision fighting depth reinforcement learning method.
Those not described in detail in this specification are well within the skill of the art.

Claims (9)

1. A wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:
s1, clustering users by adopting a density clustering algorithm, wherein the input of the density clustering algorithm comprises user position information and an angle sine value between a user and a base station, and the output of the density clustering algorithm is a user clustering label;
s2, configuring different beams on a space domain for different clusters in the step S1 by adopting a zero-forcing beam forming method, wherein channel state information required by the zero-forcing beam forming is average channel state information of users in each cluster;
and S3, taking the position, the driving speed and the driving direction of the user, which base station the user is covered by, the angle between the user and the base station and the user clustering label as states, selecting the subcarrier as an action, taking the transmission rate as an incentive, and learning a subcarrier resource allocation method in one time slot for different users by utilizing a deep reinforcement learning algorithm in order to maximize the total transmission rate of the system.
2. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein in step S1, the initial position information of the users is randomly generated in the coverage of the base station, the positions of the users in the vehicle are generated on the road, and the users in the pedestrian are generated on the non-road.
3. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the implementation subject of the density clustering algorithm in step S1 is an edge server on a base station.
4. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the average channel state information matrix clustered in step S2 is H, and a beam matrix W = H obtained by using a zero-forcing beamforming scheme T (HH T ) -1
5. The method for configuring space-time-frequency resources for wireless transmission based on deep reinforcement learning of claim 1, wherein the channel state information of the users in step S2 is collected by a base station.
6. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the step S3 deep reinforcement learning algorithm specifically includes an experience storage process, a Q estimation network training process, and an inference process;
the experience storage process comprises the following steps:
s311, inputting the current user state information into a Q estimation network, outputting the Q value of each action by the Q estimation network, and selecting the action with the maximum Q value according to the probability epsilon;
s312, acting the action on the environment to obtain an award value and a next state;
s313, storing a tuple consisting of the current state, the action, the reward and the next state in an experience replay pool, wherein the experience replay pool is used for training the neural network;
the Q estimation network training process comprises the following steps:
s321, extracting small-batch data from the experience playback pool in the step S313, wherein the current state S of the data t Inputting the Q estimation value to a Q estimation network, and estimating the Q estimation value Q of each action in the current state Estimating (s t );
S322, next state S in data t+1 Inputting the Q value into a Q target network to obtain a corresponding Q value Q(s) t+1 ) And according to Q Reality(s) is (s t+1 )=r t +γmaxQ(s t+1 ) Obtaining a real value of Q, wherein r t The prize value for the previous state, and γ is the prize decay value. Secondly, the structure of the Q target network is the same as that of the Q estimation network, and when the transmission rate of the system is increased, the weight parameter of the Q estimation network is given to the Q target network;
s323, calculating loss by taking the mean square error of the Q actual value and the Q estimated value as a loss function;
s324, feeding the loss value back to the Q estimation network, and optimizing a weight parameter in the Q estimation network by using an optimizer;
the reasoning process comprises the following steps:
user state information is input to the Q estimation network to select the sub-carrier with the largest Q value.
7. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the deep reinforcement learning algorithm in step S3 is run in an edge server on a base station.
8. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 6, wherein the Q estimation network and the Q target network are both composed of three layers of neural networks, the number of nodes of two hidden layers is 10, and the activation function is a ReLu function.
9. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the transmission rate is calculated by the following formula:
the downlink channel on the m-th subcarrier from the ith base station to the kth user under the beam c is represented as:
Figure FDA0003750536590000031
where M is the number of subcarriers, PL is the path loss,
Figure FDA0003750536590000032
in order to be the number of the paths,
Figure FDA0003750536590000033
in order to be the path gain, the gain of the path,
Figure FDA0003750536590000034
is the emission angle of the p-th path,
Figure FDA0003750536590000035
is a response vector related to the emission angle;
the channel interference-to-noise ratio on the mth subcarrier under the beam c from the ith base station to the kth user is expressed as:
Figure FDA0003750536590000036
in the formula, po,
Figure FDA0003750536590000037
And N 0 Respectively representing transmission power, covered by the same base stationInter-user interference of (a), interference of users covered by different base stations using the same subcarrier, and gaussian noise,
Figure FDA0003750536590000038
is a component of the beam matrix W;
the communication rate of the user is expressed as:
Figure FDA0003750536590000039
where B is the system bandwidth and M is the number of subcarriers.
CN202210839976.2A 2022-07-18 2022-07-18 Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning Pending CN115460699A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210839976.2A CN115460699A (en) 2022-07-18 2022-07-18 Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210839976.2A CN115460699A (en) 2022-07-18 2022-07-18 Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN115460699A true CN115460699A (en) 2022-12-09

Family

ID=84296721

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210839976.2A Pending CN115460699A (en) 2022-07-18 2022-07-18 Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115460699A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190268894A1 (en) * 2018-02-28 2019-08-29 Korea Advanced Institute Of Science And Technology Resource allocation method and apparatus for wireless backhaul network based on reinforcement learning
CN110493826A (en) * 2019-08-28 2019-11-22 重庆邮电大学 A kind of isomery cloud radio access network resources distribution method based on deeply study
US20200084777A1 (en) * 2018-09-12 2020-03-12 Ambeent Wireless Bilisim ve Yazilim A.S Method and system for assigning one or more optimal wireless channels to a wi-fi access point using a cloud-based software defined network (sdn)
CN112272232A (en) * 2020-10-23 2021-01-26 北京邮电大学 Millimeter wave Internet of vehicles resource scheduling method and device, electronic equipment and storage medium
CN113727306A (en) * 2021-08-16 2021-11-30 南京大学 Decoupling C-V2X network slicing method based on deep reinforcement learning
CN114040415A (en) * 2021-11-03 2022-02-11 南京邮电大学 Intelligent reflector assisted DQN-DDPG-based resource allocation method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190268894A1 (en) * 2018-02-28 2019-08-29 Korea Advanced Institute Of Science And Technology Resource allocation method and apparatus for wireless backhaul network based on reinforcement learning
US20200084777A1 (en) * 2018-09-12 2020-03-12 Ambeent Wireless Bilisim ve Yazilim A.S Method and system for assigning one or more optimal wireless channels to a wi-fi access point using a cloud-based software defined network (sdn)
CN110493826A (en) * 2019-08-28 2019-11-22 重庆邮电大学 A kind of isomery cloud radio access network resources distribution method based on deeply study
CN112272232A (en) * 2020-10-23 2021-01-26 北京邮电大学 Millimeter wave Internet of vehicles resource scheduling method and device, electronic equipment and storage medium
CN113727306A (en) * 2021-08-16 2021-11-30 南京大学 Decoupling C-V2X network slicing method based on deep reinforcement learning
CN114040415A (en) * 2021-11-03 2022-02-11 南京邮电大学 Intelligent reflector assisted DQN-DDPG-based resource allocation method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
JIAHANG LI等: "Deep Reinforcement Learning Based Wireless Resource Allocation for V2X Communications", 《 2021 13TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS AND SIGNAL PROCESSING (WCSP)》, 1 November 2021 (2021-11-01) *
张海波;栾秋季;朱江;李方伟;: "车辆异构网中基于移动边缘计算的任务卸载与资源分配", 物联网学报, no. 03, 30 September 2018 (2018-09-30) *
张雅雯: "MEC系统中基于机器学习的资源分配算法研究", 《信息科技》, 15 January 2022 (2022-01-15) *
方维维: "基于多智能体深度强化学习的车联网通信资源分配优化", 《北京交通大学学报》, 15 April 2022 (2022-04-15) *
舒锦;卫国;: "多用户MIMO-OFDM系统中基于迫零波束成型的自适应资源分配算法", 中国科学院研究生院学报, no. 03, 15 May 2009 (2009-05-15) *
谭俊杰;梁应敞;: "面向智能通信的深度强化学习方法", 电子科技大学学报, no. 02, 30 March 2020 (2020-03-30) *

Similar Documents

Publication Publication Date Title
CN111970072B (en) Broadband anti-interference system and method based on deep reinforcement learning
US8923907B2 (en) Scalable network MIMO for wireless networks
CN112672361B (en) Large-scale MIMO capacity increasing method based on unmanned aerial vehicle cluster deployment
CN115441939B (en) MADDPG algorithm-based multi-beam satellite communication system resource allocation method
Elsayed et al. Transfer reinforcement learning for 5G new radio mmWave networks
ElHalawany et al. Leveraging machine learning for millimeter wave beamforming in beyond 5G networks
CN112887999B (en) Intelligent access control and resource allocation method based on distributed A-C
CN106231665B (en) Resource allocation methods based on the switching of RRH dynamic mode in number energy integrated network
Liu et al. Multiobjective optimization for improving throughput and energy efficiency in UAV-enabled IoT
Ji et al. Reconfigurable intelligent surface enhanced device-to-device communications
CN116321466A (en) Spectrum efficiency optimization method for unmanned aerial vehicle communication in honeycomb-removed large-scale MIMO
Nasr-Azadani et al. Single-and multiagent actor–critic for initial UAV’s deployment and 3-D trajectory design
CN111277308A (en) Wave width control method based on machine learning
Salam et al. Modulation schemes and connectivity in wireless underground channel
Hua et al. On sum-rate maximization in downlink UAV-aided RSMA systems
Perumal et al. A machine learning‐based compressive spectrum sensing in 5G networks using cognitive radio networks
CN117412391A (en) Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method
CN115460699A (en) Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning
Cui et al. Federated edge learning for the wireless physical layer: Opportunities and challenges
CN116074974A (en) Multi-unmanned aerial vehicle group channel access control method under layered architecture
Narengerile et al. Deep reinforcement learning-based beam training with energy and spectral efficiency maximisation for millimetre-wave channels
Xue et al. Reducing the system overhead of millimeter-wave beamforming with neural networks for 5G and beyond
Zhang et al. Primal dual PPO learning resource allocation in indoor IRS-aided networks
CN114727318A (en) Multi-RIS communication network rate increasing method based on MADDPG
Shahabodini et al. Recurrent neural network and federated learning based channel estimation approach in mmWave massive MIMO systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination