CN115460699A - Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning - Google Patents
Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN115460699A CN115460699A CN202210839976.2A CN202210839976A CN115460699A CN 115460699 A CN115460699 A CN 115460699A CN 202210839976 A CN202210839976 A CN 202210839976A CN 115460699 A CN115460699 A CN 115460699A
- Authority
- CN
- China
- Prior art keywords
- user
- reinforcement learning
- users
- deep reinforcement
- resource allocation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 230000002787 reinforcement Effects 0.000 title claims abstract description 49
- 230000005540 biological transmission Effects 0.000 title claims abstract description 41
- 238000013468 resource allocation Methods 0.000 title claims abstract description 27
- 238000004891 communication Methods 0.000 claims abstract description 8
- 230000009471 action Effects 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 17
- 238000013528 artificial neural network Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000004044 response Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 102000003780 Clusterin Human genes 0.000 description 1
- 108090000197 Clusterin Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012384 transportation and delivery Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/02—Services making use of location information
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/04—Wireless resource allocation
- H04W72/044—Wireless resource allocation based on the type of the allocated resource
- H04W72/0446—Resources in time domain, e.g. slots or frames
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/04—Wireless resource allocation
- H04W72/044—Wireless resource allocation based on the type of the allocated resource
- H04W72/0453—Resources in frequency domain, e.g. a carrier in FDMA
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/04—Wireless resource allocation
- H04W72/044—Wireless resource allocation based on the type of the allocated resource
- H04W72/046—Wireless resource allocation based on the type of the allocated resource the resource being in the space domain, e.g. beams
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention relates to the field of wireless air interface resources, in particular to a wireless transmission air-time frequency resource allocation method based on deep reinforcement learning, which comprises the following steps: firstly, clustering users by adopting a density clustering algorithm according to the positions of the users and the angle information of the users and a base station, and distributing airspace wireless resources for different users by using different beams by the users in different clusters; and then, allocating different frequency band resources for different users by adopting deep reinforcement learning in one time slot based on the position of the user, the angle between the user and the base station, the moving speed of the user, the moving direction of the user, the coverage condition of the base station and the clustering condition. The invention provides a space-time-frequency multi-domain associated resource allocation method based on multi-aspect information of users, space domain resources are allocated to users in different clusters by utilizing zero-forcing beam forming, and frequency resources are allocated to different users by utilizing a deep reinforcement learning method in a time slot. The wireless transmission space-time-frequency resource allocation method based on the deep reinforcement learning is obviously superior to the deep reinforcement learning scheme of random allocation and fighting in resource allocation, and is suitable for space-time-frequency resource allocation scenes in the field of wireless communication under dynamic conditions.
Description
Technical Field
The invention relates to the field of wireless air interface resources, in particular to a wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning.
Background
The development of the 6G technology in the future has not reached a consensus in all countries in the world, but in general, the 6G technology further integrates satellite communication, AI and big data on the basis of the 5G prior art to form a ubiquitous mobile communication infrastructure oriented to 2030 years later. Driven by new application requirements and technical requirements, the 6G needs to introduce new performance indexes, such as higher spectral efficiency/energy efficiency/cost efficiency, higher transmission rate, lower time delay, higher connection number density, coverage rate, intelligent degree, security, and the like. In order to meet the new requirements and performance index requirements, a new paradigm of full coverage, full spectrum, full application, and strong safety is adopted for 6G. Therefore, the 6G can support the ubiquitous heterogeneous scene, and a network with all things interconnected is provided in the fields of air, space, earth and sea by means of various sensors and based on big data and deep learning.
However, since 6G supports connections across sea, air, and ground, its transmission environment is very complicated. How to analyze the scale characteristics and the coupling relation of the space-time-frequency multi-domain resources, and how to mine the relevance and the reciprocity of the multi-domain resources, so as to realize the unified arrangement and management of the resources, which is an important technical challenge.
In the existing wireless network configuration scheme based on Q reinforcement learning, the Q network reinforcement learning is mainly utilized to optimize the allocation of wireless network resources according to the network state, but the allocation of the wireless network resources is unclear. A resource allocation optimization method based on reinforcement learning and a system implementation scheme are mainly used for allocating resource blocks for user services according to the information of bandwidth, physical resource block quantity, user service quantity to be transmitted, resource block characteristics, downlink characteristics and the like of a downlink. The existing scheme mainly utilizes reinforcement learning to allocate resources for a wireless transmission single domain (spatial domain, time domain or frequency domain). There is little research on the space-time-frequency multi-domain resources for wireless transmission.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a wireless transmission space-time frequency resource allocation method based on deep reinforcement learning, and solve the problem of resource allocation of wireless transmission space-time frequency multi-domain association.
In order to achieve the purpose, the invention adopts the technical scheme that:
a wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning comprises the following steps:
s1, clustering users by adopting a density clustering algorithm, wherein the input of the density clustering algorithm comprises user position information and an angle sine value between a user and a base station, and the output of the density clustering algorithm is a user clustering label;
s2, configuring different beams on a space domain for different clusters in the step S1 by adopting a zero-forcing beam forming method, wherein channel state information required by zero-forcing beam forming is average channel state information of users in each cluster;
and S3, taking the position, the driving speed and the driving direction of the user, which base station the user is covered by, the angle between the user and the base station and the user clustering label as states, selecting the subcarrier as an action, taking the transmission rate as an incentive, and learning a subcarrier resource allocation method in one time slot for different users by utilizing a deep reinforcement learning algorithm in order to maximize the total transmission rate of the system.
On the basis of the scheme, the initial position information of the user in the step S1 is randomly generated in the coverage range of the base station, the position of the vehicle user is generated on the road, and the position of the pedestrian user is generated on the non-road.
On the basis of the above scheme, the main body of the implementation of the density clustering algorithm in the step S1 is an edge server on the base station.
On the basis of the above scheme, the average channel state information matrix clustered in step S2 is H, and a beam matrix W = H obtained by using a zero-forcing beamforming scheme T (HH T ) -1 。
On the basis of the above scheme, the channel state information of the user in step S2 is collected by the base station.
On the basis of the scheme, the deep reinforcement learning algorithm in the step S3 specifically comprises an experience storage process, a Q estimation network training process and an inference process;
the experience storage process comprises the following steps:
s311, inputting the current user state information into a Q estimation network, outputting the Q value of each action by the Q estimation network, and selecting the action with the maximum Q value according to the probability epsilon;
s312, acting the action on the environment to obtain an award value and the next state;
s313, storing a tuple consisting of the current state, the action, the reward and the next state in an experience replay pool, wherein the experience replay pool is used for training the neural network;
the Q estimation network training process comprises the following steps:
s321, extracting small-batch data from the experience playback pool in the step S313, wherein the current state S of the data t Inputting the Q estimation value into Q estimation network, estimating Q estimation value(s) of each action in current state t );
S322, next state S in data t+1 Inputting the Q value into a Q target network to obtain a corresponding Q value Q(s) t+1 ) And according to Q Reality(s) is (s t+1 )=r t +γmaxQ(s t+1 ) Obtaining a real value of Q, wherein r t The prize value for the previous state, and γ is the prize decay value. Secondly, the structure of the Q target network is the same as that of the Q estimation network, and when the transmission rate of the system is increased, the weight parameter of the Q estimation network is given to the Q target network;
s323, calculating loss by taking the mean square error of the Q actual value and the Q estimated value as a loss function;
s324, feeding the loss value back to the Q estimation network, and optimizing weight parameters in the Q estimation network by using an optimizer;
the reasoning process comprises the following steps:
user state information is input to the Q estimation network to select the sub-carrier with the largest Q value.
On the basis of the above scheme, the deep reinforcement learning algorithm in step S3 is run in an edge server on the base station.
On the basis of the scheme, the Q estimation network and the Q target network are both formed by three layers of neural networks, the number of nodes of two layers of hidden layers is 10, and the activation function is a ReLu function.
On the basis of the scheme, the transmission rate is calculated by the following formula:
the downlink channel on the m-th subcarrier from the ith base station to the kth user under the beam c is represented as:
where M is the number of subcarriers, PL is the path loss,in order to be the number of the paths,in order to be the path gain, the gain of the path,for the emission angle of the p-th path,is a response vector related to the emission angle;
the channel interference-to-noise ratio on the mth subcarrier under the beam c from the ith base station to the kth user is expressed as:
in the formula, po,And N 0 Respectively representing transmission power, interference between users covered by the same base station, and being differentUsers covered by the base station use the same subcarrier interference and gaussian noise,is a component of the beam matrix W;
the communication rate of the user is expressed as:
where B is the system bandwidth and M is the number of subcarriers.
The wireless transmission space-time frequency resource allocation method based on deep reinforcement learning has the following beneficial effects:
1. the invention aims at wider resource range, including space, time and frequency resources.
2. The method of the invention is obviously superior to the deep reinforcement learning scheme of random allocation and duel allocation in the allocation of resources, and is suitable for the space-time-frequency resource allocation scene in the field of wireless communication under dynamic conditions.
Drawings
The invention has the following drawings:
fig. 1 is a general flowchart of a method for allocating radio resources based on deep reinforcement learning according to the present invention;
FIG. 2 (a) is an operational diagram of a density clustering scheme employed in the present invention;
FIG. 2 (b) is a diagram of a neural network architecture employed in the present invention;
FIG. 3 is a block diagram of the deep reinforcement learning architecture of the present invention;
FIG. 4 is a convergence comparison diagram of the deep reinforcement learning method of duel in the training process;
FIG. 5 is a comparison chart of the test results of the deep reinforcement learning method and the random distribution method of duel of the present invention.
Detailed Description
The invention relates to a wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning, which comprises the following steps:
the method comprises the following steps: and the edge server on the base station adopts a density clustering method to attach a clustering label to the user. The embodiment of the invention utilizes the position of the user and the angle sin value between the user and the base station as the input of a density clustering method, and the clustering label information of the user is output through density clustering and is used as the basic characteristic of the spatial domain information distributed by the user.
Step two: in order to allocate airspace wireless transmission resources to users, the embodiment of the invention adopts a zero-forcing beam forming method on the side of a multi-antenna base station to allocate different beams to different clustering users on the airspace. Wherein the channel state information required for zero-forcing beamforming is determined by the average channel state information of the users in each cluster. According to the average channel information of different clusters, a zero-forcing beam forming method is adopted to obtain beam matrix information W = H according to the channel state information T (HH T ) -1 Different beams are allocated to the users of different clusters.
Step three: based on the position information of the user, the angle sin value between the user and the base station, the driving speed information of the user, the driving direction information of the user, the base station information of the user and the cluster label information 6 big characteristics of the user, in order to maximize the total transmission rate of the system, a method for distributing the subcarrier resources in one time slot is learned for different users by utilizing a deep reinforcement learning algorithm on an edge server of the base station. First, 6 large features of the user are input into the neural network as the current state. Then, an action to be performed is selected based on the predicted Q value of the action by the neural network. The specific choice is to select the action with the largest Q value under the probability of epsilon, and conversely, to randomly select the action to be executed. And then, acting the action on the environment, and combining the wave beam information to generate an award value to obtain the state information of the next step. And storing the tuple { current state, action, reward and next state } into an experience playback pool and using the tuple in a training process of deep reinforcement learning. Finally, the neural network is trained by using a deep reinforcement learning method, so that the network can allocate space-frequency resources to users under the condition of maximizing the transmission rate in one time slot.
In step three, the embodiment of the present invention provides a three-layer Q estimation neural network architecture and a three-layer Q target network architecture. The two structures have the same structure and different parameters. The Q-target network architecture is used only to store previous parameter values in the Q-estimation network architecture. The first layer of the network adopts a 6-dimensional characteristic value as the input of the first layer of the neural network, and the output is the characteristics of 10 nodes; the second layer takes the characteristics of 10 nodes as input, and the output is the characteristics of 10 nodes; the third layer takes the characteristics of 10 nodes as input, and outputs 14 predicted values, namely the number of sub-bands.
According to the air interface resource allocation method based on deep reinforcement learning and wireless transmission space-time-frequency multi-domain association, provided by the embodiment of the invention, by utilizing the position information of a user, the angle sin value between the user and a base station, the driving speed information of the user, the driving direction information of the user, the information of the base station to which the user belongs and the clustering label information of the user, a method of flexibly combining density clustering and deep reinforcement learning is adopted, the convergence of a reward value is maximized and a loss function is minimized through training, and wireless air-frequency resources are allocated to different users in a time slot. The method enables allocation of radio transmission resources from space-time-frequency-multiple-degree.
The present invention will be described in further detail with reference to the accompanying drawings.
Firstly, determining a scene: suppose there are two base stations BS1 and BS2 on the road side, each equipped with an antenna with a number N t And an MEC server. The MEC server is responsible for collecting user information, and the MEC servers can communicate with each other. In the coverage range of the base station, the initial positions of the vehicle users and the pedestrian users with the total number of K are randomly generated, the vehicle users can only be generated on a road, and the pedestrian users are not on the road. In addition to this, respective driving speeds and directions are set for the vehicle and pedestrian users. To maximize the total rate of system communication, we allocate radio transmission space (i.e., beam) and frequency (i.e., subcarrier) resources to different users in one time slot.
The downlink channel on the m-th subcarrier from the ith base station to the kth user under the beam c may be represented as:
where M is the number of subcarriers, PL is the path loss,the number of paths (including both transmission paths of line-of-sight (LOS) path and non-line-of-sight (NLOS) path),in order to be the path gain, the gain of the path,for the emission angle of the p-th path,is a response vector related to the transmit angle. It is worth noting thatOnly the LOS path exists.
The allocation of the airspace resources is to allocate the users to different clusters through density clustering according to the position information of the users and the angle information between the users and the base station, and allocate different wave beams to the different clusters. The characteristic channel of different clusters is the average value channel matrix of the user channels in the clusterIn the present invention, zero-forcing beamforming technology is used to allocate different beams to different clusters, and a beam matrixC is the number of clusters.
The channel interference to noise ratio at the mth sub-carrier under the beam c from the ith base station to the kth user can be expressed as:
wherein, po,And N 0 Respectively, the transmission power, the inter-user interference covered by the same base station, the interference using the same subcarrier by users covered by different base stations, and the gaussian noise.
The communication rate of a user can be expressed as:
where B is the system bandwidth and M is the number of subcarriers.
The invention adopts a density clustering and deep reinforcement learning method to configure different beams and subcarriers for different users in a time slot so as to maximize the total rate of the system.
The deep reinforcement learning has the following key factors, and the device is arranged as follows:
1. state s t : the state space mainly comprises the information of the position of the user, the angle between the user and the base station, the moving speed of the user, the moving direction of the user, the coverage condition of the base station and the clustering condition.
2. Action a t : selecting different sub-carriers for a user;
3. reward function r t : user communication rate R k (t)。
On the basis, the invention adopts the general flow chart based on deep reinforcement learning wireless resource allocation shown in fig. 1 to allocate space-time-frequency resources to users, and the specific flow is as follows:
s1: initializing an experience playback pool D, initializing a Q target network weight parameter theta, initializing a Q estimation network weight parameter theta, initializing Step number statistic value Step =0, and ReQP =0;
s2: initialization of the Current State s t . Firstly, randomly generating position data of a user, and generating user driving speed and direction data by combining actual conditions; then judges the user quiltWhich base station covers, and simultaneously calculating the angle between the user and the base station; finally, clustering the users by a density clustering method to obtain clustering label information of the users;
s3: according to state s t Selecting Q with a probability of ε Estimating (s t ) Maximum movement, i.e. a t =argmax Q(s t ) Otherwise, randomly selecting an action a t ;
S4: will act a t Acting on and interacting with the environment to calculate a reward function r for each user t Next state s t+1 And a terminator (terminator True or False);
s5: will(s) t ,a t ,r t ,s t+1 ) Stored in the experience playback pool D;
s6: judging whether enough data exist in the experience playback pool and the program running Step number is a multiple of 50, namely whether Step >200 and Step/5=0 are true or not, if yes, carrying out S7; otherwise, performing S14;
s7: and judging whether the Q target network parameter needs to be updated, namely whether ReQP =1 is established. If yes, performing S8; otherwise, S9-S13 are carried out;
s8: updating the parameter of the Q target network into a weight parameter theta of the Q estimation network;
s9: randomly sampling small batches of data from an empirical playback pool D, wherein the data volume is 128;
s10: s in small batch data t Feeding back to Q estimation network, and outputting estimated value Q estimation(s) t ) (ii) a S in small batch data t+1 Feedback to Q target network to output Q(s) t+1 ) Predicting a value;
s11: selecting Q(s) t+1 ) Action of medium maximum value a t+1 Calculating rewards Q in a delivery Reality(s) is (s t+1 )=r t +γmaxQ(s t+1 );
S12: calculating loss function loss = E [ (Q) Reality (reality) (s t+1 )-Q Estimating (s t )) 2 ];
S13: optimizing a weight parameter theta in the Q estimation network by using an ADAM optimizer based on the loss function value loss;
s14: judging whether the terminator is True, if True, performing S15-S16, otherwise, performing S17;
s15: calculating an average system rate;
s16: and judging whether the system rate of the round is greater than that of the last round. If so, reQP =1, otherwise, reQP =0, and S17 is performed.
S17: the current state s t Is replaced by the next state s t+1 Step = Step +1, return to S3;
s18: judging whether the current number of rounds reaches the maximum number of rounds, if so, ending; otherwise, return to S2.
Fig. 2 (a) is an operation diagram of the present invention adopting a density clustering scheme, wherein the input characteristics of density clustering are position information of a user and angle sin value information between the user and a covered base station, and user clustering label information is output after density clustering.
FIG. 2 (b) is a diagram of a neural network structure adopted by the present invention, and the inputs of the neural network are the position of the user, the driving speed, the driving direction, which base station the user is covered by, the angle between the user and the base station, and 6 characteristics of the label information of the user cluster; two layers of neural networks are arranged in the middle, and the number of nodes in the middle layer is 10; the Q value corresponding to each action is output.
FIG. 3 is a structural diagram of the detailed deep reinforcement learning adopted by the present invention, which is divided into two parts of experience storage and learning.
In the empirical storage process, the Q estimation network takes 6 characteristics of the user as input, outputs the Q value of each action, and selects the action with the maximum Q value under the probability of epsilon. Then, the action is applied to the environment to obtain the return value and the next state. Finally, the tuples consisting of the current state, the action, the reward and the next state are stored in an empirical playback pool.
And updating the parameters in the Q target network into Q estimation network parameters when the parameters of the Q target network meet the updating condition (namely the system rate in the current step is greater than that in the previous step) in the learning process. Then, playback is performed from experienceSelecting small batch of data from the pool, inputting the current state of the data into a Q estimation network for estimating the Q value of each action in the current state, namely Q Estimating (s t ) (ii) a Inputting the next state in the data into the Q target network to obtain the corresponding Q value, and according to Q Reality(s) is (s t+1 )=r t +γmaxQ(s t+1 ) And obtaining a Q actual value. Then pass loss = E [ (Q) Reality(s) is (s t+1 )-Q Estimating (s t )) 2 ]A loss function is calculated. Feeding back the loss function to the Q estimation network optimizes parameters in the Q estimation network using an ADAM optimizer. This is done until the maximum number of cycles is reached.
Fig. 4 is a convergence comparison diagram of the deep reinforcement learning method of duel in the training process, and it can be seen from the diagram that the convergence result of the present invention is obviously superior to that of the deep reinforcement learning method of duel.
Fig. 5 is a comparison graph of the test results of the depth reinforcement learning method and the decision fighting and the random allocation method of the present invention, and it can be seen from the graph that the method of the present invention is obviously due to the random method and the decision fighting depth reinforcement learning method.
Those not described in detail in this specification are well within the skill of the art.
Claims (9)
1. A wireless transmission space-time-frequency resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:
s1, clustering users by adopting a density clustering algorithm, wherein the input of the density clustering algorithm comprises user position information and an angle sine value between a user and a base station, and the output of the density clustering algorithm is a user clustering label;
s2, configuring different beams on a space domain for different clusters in the step S1 by adopting a zero-forcing beam forming method, wherein channel state information required by the zero-forcing beam forming is average channel state information of users in each cluster;
and S3, taking the position, the driving speed and the driving direction of the user, which base station the user is covered by, the angle between the user and the base station and the user clustering label as states, selecting the subcarrier as an action, taking the transmission rate as an incentive, and learning a subcarrier resource allocation method in one time slot for different users by utilizing a deep reinforcement learning algorithm in order to maximize the total transmission rate of the system.
2. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein in step S1, the initial position information of the users is randomly generated in the coverage of the base station, the positions of the users in the vehicle are generated on the road, and the users in the pedestrian are generated on the non-road.
3. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the implementation subject of the density clustering algorithm in step S1 is an edge server on a base station.
4. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the average channel state information matrix clustered in step S2 is H, and a beam matrix W = H obtained by using a zero-forcing beamforming scheme T (HH T ) -1 。
5. The method for configuring space-time-frequency resources for wireless transmission based on deep reinforcement learning of claim 1, wherein the channel state information of the users in step S2 is collected by a base station.
6. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the step S3 deep reinforcement learning algorithm specifically includes an experience storage process, a Q estimation network training process, and an inference process;
the experience storage process comprises the following steps:
s311, inputting the current user state information into a Q estimation network, outputting the Q value of each action by the Q estimation network, and selecting the action with the maximum Q value according to the probability epsilon;
s312, acting the action on the environment to obtain an award value and a next state;
s313, storing a tuple consisting of the current state, the action, the reward and the next state in an experience replay pool, wherein the experience replay pool is used for training the neural network;
the Q estimation network training process comprises the following steps:
s321, extracting small-batch data from the experience playback pool in the step S313, wherein the current state S of the data t Inputting the Q estimation value to a Q estimation network, and estimating the Q estimation value Q of each action in the current state Estimating (s t );
S322, next state S in data t+1 Inputting the Q value into a Q target network to obtain a corresponding Q value Q(s) t+1 ) And according to Q Reality(s) is (s t+1 )=r t +γmaxQ(s t+1 ) Obtaining a real value of Q, wherein r t The prize value for the previous state, and γ is the prize decay value. Secondly, the structure of the Q target network is the same as that of the Q estimation network, and when the transmission rate of the system is increased, the weight parameter of the Q estimation network is given to the Q target network;
s323, calculating loss by taking the mean square error of the Q actual value and the Q estimated value as a loss function;
s324, feeding the loss value back to the Q estimation network, and optimizing a weight parameter in the Q estimation network by using an optimizer;
the reasoning process comprises the following steps:
user state information is input to the Q estimation network to select the sub-carrier with the largest Q value.
7. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the deep reinforcement learning algorithm in step S3 is run in an edge server on a base station.
8. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 6, wherein the Q estimation network and the Q target network are both composed of three layers of neural networks, the number of nodes of two hidden layers is 10, and the activation function is a ReLu function.
9. The deep reinforcement learning-based wireless transmission space-time-frequency resource allocation method according to claim 1, wherein the transmission rate is calculated by the following formula:
the downlink channel on the m-th subcarrier from the ith base station to the kth user under the beam c is represented as:
where M is the number of subcarriers, PL is the path loss,in order to be the number of the paths,in order to be the path gain, the gain of the path,is the emission angle of the p-th path,is a response vector related to the emission angle;
the channel interference-to-noise ratio on the mth subcarrier under the beam c from the ith base station to the kth user is expressed as:
in the formula, po,And N 0 Respectively representing transmission power, covered by the same base stationInter-user interference of (a), interference of users covered by different base stations using the same subcarrier, and gaussian noise,is a component of the beam matrix W;
the communication rate of the user is expressed as:
where B is the system bandwidth and M is the number of subcarriers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210839976.2A CN115460699A (en) | 2022-07-18 | 2022-07-18 | Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210839976.2A CN115460699A (en) | 2022-07-18 | 2022-07-18 | Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115460699A true CN115460699A (en) | 2022-12-09 |
Family
ID=84296721
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210839976.2A Pending CN115460699A (en) | 2022-07-18 | 2022-07-18 | Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115460699A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190268894A1 (en) * | 2018-02-28 | 2019-08-29 | Korea Advanced Institute Of Science And Technology | Resource allocation method and apparatus for wireless backhaul network based on reinforcement learning |
CN110493826A (en) * | 2019-08-28 | 2019-11-22 | 重庆邮电大学 | A kind of isomery cloud radio access network resources distribution method based on deeply study |
US20200084777A1 (en) * | 2018-09-12 | 2020-03-12 | Ambeent Wireless Bilisim ve Yazilim A.S | Method and system for assigning one or more optimal wireless channels to a wi-fi access point using a cloud-based software defined network (sdn) |
CN112272232A (en) * | 2020-10-23 | 2021-01-26 | 北京邮电大学 | Millimeter wave Internet of vehicles resource scheduling method and device, electronic equipment and storage medium |
CN113727306A (en) * | 2021-08-16 | 2021-11-30 | 南京大学 | Decoupling C-V2X network slicing method based on deep reinforcement learning |
CN114040415A (en) * | 2021-11-03 | 2022-02-11 | 南京邮电大学 | Intelligent reflector assisted DQN-DDPG-based resource allocation method |
-
2022
- 2022-07-18 CN CN202210839976.2A patent/CN115460699A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190268894A1 (en) * | 2018-02-28 | 2019-08-29 | Korea Advanced Institute Of Science And Technology | Resource allocation method and apparatus for wireless backhaul network based on reinforcement learning |
US20200084777A1 (en) * | 2018-09-12 | 2020-03-12 | Ambeent Wireless Bilisim ve Yazilim A.S | Method and system for assigning one or more optimal wireless channels to a wi-fi access point using a cloud-based software defined network (sdn) |
CN110493826A (en) * | 2019-08-28 | 2019-11-22 | 重庆邮电大学 | A kind of isomery cloud radio access network resources distribution method based on deeply study |
CN112272232A (en) * | 2020-10-23 | 2021-01-26 | 北京邮电大学 | Millimeter wave Internet of vehicles resource scheduling method and device, electronic equipment and storage medium |
CN113727306A (en) * | 2021-08-16 | 2021-11-30 | 南京大学 | Decoupling C-V2X network slicing method based on deep reinforcement learning |
CN114040415A (en) * | 2021-11-03 | 2022-02-11 | 南京邮电大学 | Intelligent reflector assisted DQN-DDPG-based resource allocation method |
Non-Patent Citations (6)
Title |
---|
JIAHANG LI等: "Deep Reinforcement Learning Based Wireless Resource Allocation for V2X Communications", 《 2021 13TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS AND SIGNAL PROCESSING (WCSP)》, 1 November 2021 (2021-11-01) * |
张海波;栾秋季;朱江;李方伟;: "车辆异构网中基于移动边缘计算的任务卸载与资源分配", 物联网学报, no. 03, 30 September 2018 (2018-09-30) * |
张雅雯: "MEC系统中基于机器学习的资源分配算法研究", 《信息科技》, 15 January 2022 (2022-01-15) * |
方维维: "基于多智能体深度强化学习的车联网通信资源分配优化", 《北京交通大学学报》, 15 April 2022 (2022-04-15) * |
舒锦;卫国;: "多用户MIMO-OFDM系统中基于迫零波束成型的自适应资源分配算法", 中国科学院研究生院学报, no. 03, 15 May 2009 (2009-05-15) * |
谭俊杰;梁应敞;: "面向智能通信的深度强化学习方法", 电子科技大学学报, no. 02, 30 March 2020 (2020-03-30) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111970072B (en) | Broadband anti-interference system and method based on deep reinforcement learning | |
US8923907B2 (en) | Scalable network MIMO for wireless networks | |
CN112672361B (en) | Large-scale MIMO capacity increasing method based on unmanned aerial vehicle cluster deployment | |
CN115441939B (en) | MADDPG algorithm-based multi-beam satellite communication system resource allocation method | |
Elsayed et al. | Transfer reinforcement learning for 5G new radio mmWave networks | |
ElHalawany et al. | Leveraging machine learning for millimeter wave beamforming in beyond 5G networks | |
CN112887999B (en) | Intelligent access control and resource allocation method based on distributed A-C | |
CN106231665B (en) | Resource allocation methods based on the switching of RRH dynamic mode in number energy integrated network | |
Liu et al. | Multiobjective optimization for improving throughput and energy efficiency in UAV-enabled IoT | |
Ji et al. | Reconfigurable intelligent surface enhanced device-to-device communications | |
CN116321466A (en) | Spectrum efficiency optimization method for unmanned aerial vehicle communication in honeycomb-removed large-scale MIMO | |
Nasr-Azadani et al. | Single-and multiagent actor–critic for initial UAV’s deployment and 3-D trajectory design | |
CN111277308A (en) | Wave width control method based on machine learning | |
Salam et al. | Modulation schemes and connectivity in wireless underground channel | |
Hua et al. | On sum-rate maximization in downlink UAV-aided RSMA systems | |
Perumal et al. | A machine learning‐based compressive spectrum sensing in 5G networks using cognitive radio networks | |
CN117412391A (en) | Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method | |
CN115460699A (en) | Wireless transmission space-time frequency resource allocation method based on deep reinforcement learning | |
Cui et al. | Federated edge learning for the wireless physical layer: Opportunities and challenges | |
CN116074974A (en) | Multi-unmanned aerial vehicle group channel access control method under layered architecture | |
Narengerile et al. | Deep reinforcement learning-based beam training with energy and spectral efficiency maximisation for millimetre-wave channels | |
Xue et al. | Reducing the system overhead of millimeter-wave beamforming with neural networks for 5G and beyond | |
Zhang et al. | Primal dual PPO learning resource allocation in indoor IRS-aided networks | |
CN114727318A (en) | Multi-RIS communication network rate increasing method based on MADDPG | |
Shahabodini et al. | Recurrent neural network and federated learning based channel estimation approach in mmWave massive MIMO systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |