CN111901862A - User clustering and power distribution method, device and medium based on deep Q network - Google Patents

User clustering and power distribution method, device and medium based on deep Q network Download PDF

Info

Publication number
CN111901862A
CN111901862A CN202010643958.8A CN202010643958A CN111901862A CN 111901862 A CN111901862 A CN 111901862A CN 202010643958 A CN202010643958 A CN 202010643958A CN 111901862 A CN111901862 A CN 111901862A
Authority
CN
China
Prior art keywords
network
power distribution
user
training
power
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010643958.8A
Other languages
Chinese (zh)
Other versions
CN111901862B (en
Inventor
张国梅
曹艳梅
李国兵
史晔钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202010643958.8A priority Critical patent/CN111901862B/en
Publication of CN111901862A publication Critical patent/CN111901862A/en
Application granted granted Critical
Publication of CN111901862B publication Critical patent/CN111901862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/30TPC using constraints in the total amount of available transmission power
    • H04W52/34TPC management, i.e. sharing limited amount of power among users or channels or data types, e.g. cell loading
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/30TPC using constraints in the total amount of available transmission power

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a method, equipment and a medium for user clustering and power distribution based on a deep Q network, which utilize a user clustering and power distribution problem modeling combined optimization problem; establishing a BP neural network to realize a power distribution function in a joint optimization problem; training a BP neural network by using a training data set, testing the network and storing a BP neural network model to obtain power distribution results under different channel conditions so as to realize power distribution; modeling a user clustering problem into a reinforcement learning task; constructing a deep Q network according to the reinforcement learning task; after the network is trained on line, the deep Q network is trained according to the input state, and the best action is selected as the best clustering result, so that user clustering is realized. The invention can reduce the complexity of on-line calculation, ensure the fairness of users to a certain extent and effectively improve the spectrum efficiency of the system.

Description

User clustering and power distribution method, device and medium based on deep Q network
Technical Field
The invention belongs to the technical field of resource allocation in a communication system, and particularly relates to a user clustering and power allocation method, device and medium based on a deep Q network.
Background
In the face of the current situation that radio spectrum resources are seriously deficient and the spectrum utilization rate of the existing communication link is close to the limit, how to further improve the spectrum efficiency and the system capacity and meet the requirements of large flow, huge connection, high reliability and the like under the whole scene application of a future radio communication system is a key problem to be urgently solved by the research in the field of radio communication. Non-orthogonality and large dimensions are considered as effective ways to improve spectrum resource utilization. In 2010, NTT DoCoMo corporation of japan first proposed a Non-orthogonal Multiple Access (NOMA) technique based on Successive Interference Cancellation (SIC) reception, and by resource multiplexing in a power domain, system spectral efficiency and the number of user connections can be increased by times, thereby meeting the requirement of mass Access. The power domain NOMA technology can effectively improve the frequency spectrum efficiency and the number of user connections by virtue of the non-orthogonal advantages of the power domain NOMA technology, is easy to combine with other technologies, and is considered to be one of key technologies in a future wireless communication system. The massive MIMO technology proposed in the same period as NOMA has been adopted by the 3GPP Release15 standard, and because it can fully exploit spatial domain resources by using a large-dimensional antenna array to obtain a significant improvement in spectral efficiency, it plays an important role in realizing a large capacity for a 5G system, and will continue to become one of candidates for the physical layer of a future wireless communication system. By combining NOMA and large-scale MIMO technology, the degree of freedom of a power domain and a space domain can be excavated simultaneously, so that the peak rate and the spectral efficiency of the system are further improved, the requirement of explosive flow increase can be effectively met, and the NOMA and large-scale MIMO technology becomes a key candidate technology of a physical layer of a future wireless communication system.
In the face of the requirements of future wireless communication huge flow access, ultra-large capacity, ultra-low time delay, ultra-dense networking and ultra-high reliability, the resource management and transmission technology system of the traditional wireless communication is greatly challenged. Meanwhile, massive data of a wireless communication system generated by large connection, large dimensionality, large bandwidth and high density provides abundant data for future wireless communication by adopting an Artificial Intelligence (AI) means. Therefore, smart communication is considered as a mainstream development direction of wireless communication systems after 5G. In the academic world, researchers are exploring the organic integration of AI from various aspects of wireless communication systems, and have preliminarily demonstrated the performance improvement brought to wireless communication systems by the application of AI technology. In recent years, research on intelligent wireless communication is gradually advanced to the physical layer, and Deep Learning (DL) technology is used in various aspects of channel estimation, signal detection, channel feedback and reconstruction, channel decoding, and the like, and even replaces a traditional baseband processing module to directly implement an end-to-end wireless communication system. Although these research works have achieved certain performance gains, the goals of "intelligent communication breaking the performance constraint of the traditional wireless communication mode and achieving a great improvement in wireless transmission performance" that are far from the expectations are still far from each other, and further intensive research is needed.
After NOMA is combined with large-scale MIMO, the number of antennas is obviously increased, the number of service users is multiplied, and the user distribution is denser. On one hand, the contradiction between the transmission efficiency faced by the user in clustering and the SIC receiving accumulated error propagation is more prominent; on the other hand, the contradiction between intra-cluster beam coverage enhancement and inter-cluster interference suppression is more prominent. To solve these two pairs of contradictions, overall consideration and joint optimization of user clustering, power allocation and beam forming are required from the viewpoint of improving the overall performance of the system. However, the channel characteristics between different users are complex, and it is difficult for the conventional method to capture the potential relationship between users. Meanwhile, the solution space of the optimization problem is huge, and the nonlinear search process is inevitable. Therefore, it is difficult to obtain good user clustering and power allocation results using the conventional method, and the performance of the NOMA system is still greatly limited. Research shows that at present, a comprehensive intelligent solution of the system is not formed by the research aiming at the large-scale MIMO-NOMA, the research attention angle is single, the used deep learning network structure is solidified, the main reason of the limitation of the system performance is formed, and further technical breakthrough is urgently needed.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method, device and medium for user clustering and power allocation based on a deep Q network, which significantly improve the spectrum efficiency of a system, in view of the above disadvantages in the prior art.
The invention adopts the following technical scheme:
a user clustering and power distribution method based on a deep Q network comprises the following steps:
s1, modeling by using a user clustering and power distribution problem, wherein the optimization target is that the system and the rate are maximum, and the constraint conditions are power constraint and total user number constraint;
s2, setting minimum transmission rate constraint, obtaining a power distribution result by adopting a full search power distribution method, taking the power distribution result as a training label, forming a training data set of the network by using the channel information and the power distribution result, and establishing a BP neural network;
s3, training the BP neural network by using the training data set obtained in the step S2, and training the BP neural network until the mean square error is 10-4Testing the network and storing the BP neural network model to obtain power distribution results under different channel conditions, thereby realizing power distribution;
s4, modeling the user clustering problem into a reinforcement learning task, determining a state space as a combination of user channel information, determining an action space as all grouping conditions, and determining a reward function as a system and a rate;
s5, constructing a deep Q network according to the reinforcement learning task in the step S4, determining network input as a combination of a state space and an action space, outputting as a system and a speed, and initializing parameters and the number of hidden layers of the deep Q network and the Q label network; after the network is trained on line, the deep Q network is trained according to the input state, and the best action is selected as the best clustering result, so that user clustering is realized.
Specifically, in step S1, with the system and rate maximum target, the joint optimization problem is established as follows:
Figure BDA0002572428890000041
Figure BDA0002572428890000042
Figure BDA0002572428890000043
wherein the content of the first and second substances,
Figure BDA0002572428890000044
for a set of power allocation factors, a toneUn,kIs the user set, N is the number of clusters, K is the number of users in a cluster, Rn,kFor the information transmission rate, p, of user k in the nth clustern,kPower, P, allocated to user k in the nth clustermaxMaximum power, alpha, allowed for transmission by the base stationn,kThe power of user k in the nth cluster is assigned a factor,
Figure BDA0002572428890000045
k is true for all n.
Specifically, in step S2, a BP neural network is used for power distribution, and the result of the exhaustive search power distribution method is used as a network training label, where the label acquisition method is as follows: discretizing power within a limited power interval, wherein the step length is delta; for a certain combination of channel information hn,1… hn,KSearching out an optimal power distribution result { p) in the discrete power set by an exhaustive search power distribution methodn,1… pn,KAnd the power distribution result is calculated according to the input channel information and the total power limit after the training is finished.
Specifically, in step S3, the BP neural network includes an input layer, an output layer, and a hidden layer, where the input of the BP neural network is channel state information and total power of users in the cluster, and the output power of the BP neural network is a result of power distribution; the number of input and output nodes of the BP neural network is the number of users in the cluster, and the number of hidden layer nodes is adjusted according to the training result; the loss function is defined as
Figure BDA0002572428890000046
Specifically, in step S4, the reinforcement learning task includes an interacting agent and environment, including a state space S, an action space a, an immediate reward R, and a transition probability between a current state and a next state, with a base station as the agent, the capabilities of the NOMA system as the environment, and an action a taken by the agenttIs decided based on the expected rewards that the user may obtain; in each step, according to the current state stCan achieve the purpose ofThe agent selects action a from a plurality of actions according to the learned user clustering strategyt(ii) a The environment evolves to a new state; then, power distribution and beam forming are carried out according to the obtained user cluster, and step length reward r is calculatedtAnd fed back to the agent.
Further, the state space S is the channel h of all users in the t time slotn,k(t) the current state s is formedt(ii) a The action space A contains actions to all possible user allocation combinations, the impact effect being defined as
Figure BDA0002572428890000051
The return function is
Figure BDA0002572428890000052
The goal of reinforcement learning is to maximize the cumulative discount return
Figure BDA0002572428890000053
The discount factor gamma is an element of 0,1]。
Specifically, in step S5, a neural network structure in the deep reinforcement learning network DQN is established for fitting the Q value, a Q tag network is introduced to update the training tag, a sample is selected based on the training data played back empirically, and the transfer sample (S) obtained at each iteration is sampled (S)t,at,rt,st+1) Stored in the playback memory unit as training data. During training, part of the training is randomly taken out for training.
Further, the input of the neural network structure is the combination(s) of the current state and the actiont,at) The network output of the neural network structure is an estimated Q value, i.e., Q(s), for each actiont,atOmega) is a training parameter, and two fully-connected layers are used as hidden layers of the network; and (3) selecting actions at random initially, and selecting between the random actions and the Q strategy by using a probability hyper-parameter by adopting a greedy algorithm.
Another aspect of the invention is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described.
Another aspect of the present invention is a computing device, including:
one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods.
Compared with the prior art, the invention has at least the following beneficial effects:
the user clustering and power distribution method based on the deep Q network can achieve the performance equivalent to that of traversal search. By ensuring that the offline use complexity of the power distribution network of the user performance is extremely low, the user clustering part considers the real-time interaction with the current environment, the network is used while training, and the complexity is negligible compared with the traversal search. Therefore, the invention provides the method for performing joint resource allocation by using the power allocation network based on the BP network and the user clustering network based on the DQN algorithm, which can remarkably improve the spectrum efficiency of the system and is superior to other schemes while ensuring the performance of edge users.
Furthermore, a joint optimization problem is established at the maximum rate of the system, the problem comprises two sub-problems of user clustering and power distribution, and the optimization problem restricts the power and the number of users. And solving an optimization problem established at the maximum rate and ensuring the information transmission rate of the user to the maximum extent.
Further, in the power distribution part, a training data set obtained by exhaustive search is adopted to train the BP neural network, and the trained network can be used offline as long as a training data set which is real enough and data which are as much as possible are used to train the network.
Furthermore, for a specific BP network, the number of network input and output nodes is determined according to the number of users in a cluster, so that the physical meaning of the network is clearer. The loss function is a mean square error loss function, and the mathematical property of the loss function is good, so that gradient calculation is easier.
Furthermore, the user clustering problem is modeled into a specific reinforcement learning task, and the method aims to create a new idea for solving the user clustering problem, and after an agent and an environment are specified, the method is convenient for establishing a deep Q network.
Further, each part of the reinforcement learning task is assigned with physical meaning and mathematical expression. In particular, the reward function is set to the system and rate, and the training target of the deep Q network is defined.
Furthermore, a Q label network is introduced into the deep Q network, so that the network can train and update labels at the same time, and the training is more accurate. In addition, the training data set of the part is obtained by adopting an experience playback method, the original data sequence can be disordered, and the historical data can be effectively utilized by extracting small batches for training.
Furthermore, the specific structure and input and output of the deep Q network are determined, a more complete training data set can be generated by a greedy algorithm adopted in the process, and the training speed is improved. In step S5, an implementation procedure of the network implementing the user clustering function is described, which can select the clustering result with the largest system and rate, thereby improving the spectrum efficiency.
In conclusion, the invention can reduce the complexity of online calculation, ensure the fairness of users to a certain extent and effectively improve the spectrum efficiency of the system.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a massive MIMO-NOMA system model of the present invention;
FIG. 2 is a block diagram of a massive MIMO-NOMA downlink transmission system according to the present invention;
FIG. 3 is a deep Q-network based joint optimization network of the present invention;
FIG. 4 is a diagram of an intra-cluster power allocation scheme of the present invention;
FIG. 5 is a diagram of a reinforcement learning based user clustering scheme of the present invention;
fig. 6 is a schematic diagram of the spectral efficiency of resource allocation in different schemes under the Ray-based channel model according to the embodiment of the present invention as a function of the transmission power;
fig. 7 is a CDF curve in different schemes under the Ray-based channel model in the embodiment of the present invention.
Detailed Description
Clustering users: the multiple users are divided into different groups, and the users in the groups do not overlap.
CDF: the Cumulative Distribution Function (also called Distribution Function) is an integral of the probability density Function, and can completely describe the probability Distribution of a real random variable X.
Referring to fig. 1, the present invention provides a method for user clustering and power allocation based on a deep Q network, which includes the following steps:
s1, under a specific considered scene, modeling user clustering and power distribution problems into a joint optimization problem, wherein the optimization target is that the system and the rate are maximum, and the constraint condition is that the power constraint and the total number of users are constrained;
the invention is based on a single-cell large-scale MIMO-NOMA system model, and mainly refers to the problems of user clustering and power allocation under the condition of multiple users for the problem of downlink resource allocation of the system. Therefore, in order to realize the joint optimization of user clustering, power distribution and beam forming, a close coupling optimization iterative structure of three functional modules is established by using a reinforcement learning technology.
Referring to fig. 1, in the user clustering stage, with the maximum system throughput as the target, the deep Q learning network is adopted to gradually adjust the clustering result. In the power distribution stage, a BP (Back propagation) neural network is designed, and the BP neural network takes a power distribution result obtained by an exhaustive search algorithm as an offline training label on the premise of ensuring the minimum transmission rate of a user, so that the complexity of online calculation is greatly reduced while the throughput of the user is ensured. In the outer loop iteration process, the power distribution and beam forming results are fed back to the reinforcement learning network, the deep reinforcement learning network intelligently adjusts the user cluster by taking the maximum system throughput as a target, and the effect of approximate ideal joint optimization can be achieved by iterating for multiple times.
Referring to fig. 2, considering a single-cell multi-user downlink, the base station configures Nt transmit antennas to serve L single-antenna users. All users in a cell are divided into N clusters according to a certain rule, and K users are assumed in each cluster. A power domain non-orthogonal multiple access transmission structure is adopted in the cluster, and each cluster is served by the same beam forming vector. The system model is shown in fig. 4. In general, a beamforming vector is designed to eliminate inter-cluster interference, and then power allocation is performed among users scheduled in a cluster to form a NOMA transmission structure.
In the system, considering modeling according to a Ray-based channel model, a base station deploys UPA (uplink packet access) antennas on a y-z plane, and the number of the antennas in the vertical direction is NvThe antenna spacing is d1The number of antennas in the horizontal direction is NtThe antenna spacing is d2The channel comprises LuA bar scatter path. For simplicity, regardless of the mechanical downtilt angle of the array antenna, #denotesthe horizontal incident angle of the array antenna, theta denotes the vertical incident angle of the array antenna, σ denotes the standard deviation of the horizontal angular spread of the array antenna, ξ denotes the standard deviation of the vertical angular spread of the array antenna, and for each scattering path, the random complex gain g may be expressed as
Figure BDA0002572428890000091
Alpha is the amplitude of the signal and alpha is the amplitude,
Figure BDA0002572428890000092
for phase, the Ray-based channel vector from the kth user to the base station is represented as:
Figure BDA0002572428890000093
Figure BDA0002572428890000094
Figure BDA0002572428890000095
wherein, b (v)k,l) Represents the vertical array response, a (u)k,l) Representing a horizontal direction array response;
Figure BDA0002572428890000096
Figure BDA0002572428890000097
wherein: λ represents the carrier wavelength, Δ θk,lRepresents the vertical angular spread of the ith path of the kth user, obeying a normal distribution Δ θk,lN (0, σ), Δ θ for different antenna elementsk,lAre independent of each other and can be used for,
Figure BDA0002572428890000098
the horizontal angular spread of the ith path representing the kth user,
Figure BDA0002572428890000099
of different antenna elements
Figure BDA00025724288900000910
Are independent of each other.
Assuming that the channel is a flat block fading channel, considering large scale fading, the channel vector from user k to the base station in the cell is represented as:
Figure BDA00025724288900000911
Figure BDA00025724288900000912
wherein:
Figure BDA0002572428890000101
is an mx 1 dimensional small scale channel vector; beta is akLarge scale path loss and shadowingFading coefficient, dkDenotes the distance, d, of user k in the cell to the base station0Denotes the cell radius and λ denotes the path loss coefficient.
It is assumed here that the distance between a user and a base station is much larger than the physical size between antennas of the base station, and the large-scale information between the same user and M antennas of the base station is considered to be unchanged. Therefore, the channel matrix from K users to the base station is:
H=[h1,h2,...,hK]∈CM×K
please refer to fig. 3, assuming that X ═ X1x2x3… xN]T∈CN×1Data is transmitted for a base station, wherein,
Figure BDA0002572428890000102
is NOMA signal of cluster n, PnIs the total power of the nth cluster signal transmission, alphan,kIs the power allocation factor, s, of each user in the clustern,kIs the k-th user U in the n-th clustern,kAnd E [ | s [ ] sn,k|2]1. Preprocessing the signals after the power superposition corresponding to each cluster by a beam forming vector to obtain
Figure BDA0002572428890000103
The following were used:
Figure BDA0002572428890000104
wherein the content of the first and second substances,
Figure BDA0002572428890000105
is a beamforming matrix.
Assuming a downlink channel matrix of
Figure BDA0002572428890000106
Represents channel state information of the nth cluster.
The received signal of the kth user in the nth cluster is:
Figure BDA0002572428890000107
wherein z isn,kMean 0 and variance σ2Complex gaussian noise.
Besides useful signals, the signals received by the users also include inter-cluster interference, inter-cluster user interference and noise terms. Assuming that a beamforming vector designed based on channel information aims at eliminating inter-cluster interference, h can be approximatedn w i0, i ≠ n, but the current algorithm has difficulty in achieving ideal effects, so the interference term is difficult to ignore. Setting receiving end SIC to detect and ideally offset the interference of preorder user so as to obtain user Un,kThe achievable rates for (the nth cluster kth decoding user) are as follows:
Figure BDA0002572428890000111
where B is the bandwidth.
With the system and rate maximum objective, the joint optimization problem is established as follows:
Figure BDA0002572428890000112
Figure BDA0002572428890000113
Figure BDA0002572428890000114
the invention provides a combined optimization method based on a deep learning technology, which is used for realizing the combined optimization of user clustering and power distribution.
Referring to fig. 3, the joint optimization problem is solved by using a machine learning algorithm, and the joint optimization problem is divided into a power distribution scheme based on a BP neural network and a user clustering module based on a deep Q network, and the deep Q network calculates an incentive value according to a result of the power distribution network, so as to adjust a clustering result.
S2, for the power distribution part in the joint optimization, the functional module is completed by considering the use of the self-designed BP neural network, and because of the supervised learning mode, the power distribution result obtained by adopting the full search power distribution algorithm with minimum transmission rate constraint is used as a training label;
in a large-scale MIMO-NOMA system, in order to ensure the effectiveness of a SIC receiver at a receiving end, the power of users in the same cluster needs to satisfy a certain relation. Different power allocation algorithms are different in pursuit of overall system throughput performance and user fairness performance, and power allocation is the key to realizing compromise between system and rate-fairness performance.
Referring to fig. 4, a typical Power Allocation algorithm includes Fixed Power Allocation (FPA), Fractional Power Allocation (FTPA), Exhaustive Search Power Allocation (ESPA), and the like. Although the FPA and FTPA algorithms are not computationally complex, the system performance is not ideal. The ESPA algorithm is an algorithm for pursuing the optimal system performance, but the online computation complexity is too high, and the ESPA algorithm is difficult to popularize and apply in an actual system. Different from the idea of the traditional optimization algorithm, the invention provides a power distribution algorithm based on a BP neural network.
The BP neural network has stronger nonlinear mapping capability, can automatically extract reasonable rules between input data and output data through learning during training, and has high self-learning and self-adapting capabilities. Therefore, a BP neural network is employed for power allocation. Taking the result of the ESPA algorithm as a network training label, wherein the label obtaining method comprises the following steps: within a limited power interval, discretizing the power with a step size of delta. For a certain combination of channel information hn,1… hn,KSearching out an optimal power distribution result { p) in the discrete power set through an ESPA algorithmn,1… pn,KAnd taking the result as a training label by the BP neural network. The trained network can calculate the power distribution result according to the input channel information and the total power limit, and the calculation complexity can be greatly reduced.
S3, obtaining a large amount of training dataUsed for training BP neural network to obtain power distribution result under different channel conditions, and the network is trained until the mean square error is 10-4Thereafter, testing the network and saving the network model for subsequent invocation;
the BP neural network consists of an input layer, an output layer and three hidden layers. The input of the BP neural network is channel state information and the total power of users in the cluster, and the BP neural network outputs a power distribution result; the number of input and output nodes of the BP neural network is the number of users in the cluster, and the number of nodes of the hidden layer is adjusted according to the training result. The loss function is defined as
Figure BDA0002572428890000121
Therefore, the network parameters are updated to complete the training. It is noted that in order to guarantee the generalization performance of the network, the training data must be as much as possible, traversing all possible channel conditions.
TABLE 1 configuration and parameter configuration of power distribution network (K is the number of users)
Figure BDA0002572428890000122
Figure BDA0002572428890000131
S4, modeling the user clustering problem into a reinforcement learning task, and determining a state space as a combination of user channel information, an action space as all possible grouping conditions and a reward function as a system and a speed;
based on the scenario in fig. 2, the user clustering problem with the sum rate being the maximum target is represented as:
Figure BDA0002572428890000132
in the conventional optimization method, all allocation combinations are subjected to online traversal, and the implementation complexity is increased very rapidly as the number of users increases. To address this problem, a deep reinforcement learning framework is proposed to optimize the user clustering process of the NOMA system.
Referring to FIG. 5, the user clustering optimization problem is modeled as a reinforcement learning task consisting of an interacting agent and environment. The general reinforcement learning problem consists of four parts: state space S, action space a, instant prize R, and transition probability between the current state and the next state. The method specifically comprises the following steps: base station as agent, capabilities of NOMA system are environment, action a taken by agenttAre decided based on the expected rewards that the user may obtain. In each step, according to the current state stAchievable system performance, agent selects action a from a plurality of actions based on learned user clustering policyt. As the operation progresses, the environment evolves to a new state. Then, according to the obtained user cluster, power distribution and beam forming are carried out, and step length reward r is calculatedtAnd fed back to the agent. Learning often begins with a series of samples of states, actions, and rewards using random strategy experiments, and algorithms improve the strategy based on the samples to maximize rewards.
In conjunction with the system scenario of the present invention, the detailed representation of each part in the reinforcement learning framework is described as follows:
state space S: t time slot channel h for all usersn,k(t) the current state s is formedtI.e. by
st={[h1,1(t),…h1,K(t)],
...
[hN,1(t),…hN,K(t)]}
The action space A: actions should be included that can reach all possible user allocation combinations. The purpose of this action is to select an appropriate grouping for the user. For a specific action, the effect is defined as:
Figure BDA0002572428890000141
a return function:
is shown at stSelecting action a in StatetThe time return can be obtained by using the sum rate or energy efficiency of NOMA systemTarget, here consider using system and rateAs a function of return, involving the power allocation factor αn,kAnd a beamforming vector wn
The invention firstly assumes that the traditional zero forcing beam forming method is adopted, and because each cluster has a plurality of users, the beam forming vector with good channel quality is selected as an equivalent channel to be calculated, and the specific form is as follows:
W=[w1… wn]=HH(HHH)-1
the goal of reinforcement learning is to maximize the cumulative discount return
Figure BDA0002572428890000143
The discount factor gamma is an element of 0,1]。
S5, constructing a deep Q network, and initializing parameters of the deep Q network and the Q label network and the number of hidden layers of the neural network. After the network starts on-line training, the deep Q network is trained according to the input state, so that the best action, namely the best clustering result, is selected.
Wherein, the sum rate is calculated according to the power distribution result of the step S3; the deep Q network trains and adjusts the clustering result and is used in the resource allocation process of signal transmission.
Deep reinforcement Learning Network DQN (Deep Q-Learning Network)
At each observation time t, the agent determines the next action from the observation based on the current state. There is thus a mapping between states and actions, which is the policy π. To evaluate the expected return of a policy, a value function needs to be defined, where the state-action value function is given as follows:
Figure BDA0002572428890000151
the above equation is non-linear and has no closed form solution. Thus, many iterative methods (e.g., Q-Learning) have been proposed and have been shown to converge to an optimal Q function. In Q-Learning, when the state and motion space are discrete and not high in dimension, Q-Table can be used to store Q value corresponding to each state motion, and when the state and motion space are continuous in high dimension, it is impractical to use Q-Table, and the Learning process will be less efficient. One solution to this problem is to estimate the Q value by a neural network, which is the main idea of DQN.
In summary, the DQN is to design a neural network structure to fit the Q value, so as to be applied to reinforcement learning.
Deep neural networks in DQN
Designing a neural network in DQN, with inputs being combinations of current states and actions(s)t,at) The network output is the estimated Q value corresponding to each action, i.e. Q(s)t,atω), where ω is a training parameter. The role of the network is to fit the Q function, thus using two fully connected layers as the hidden layer of the network. And selecting actions at random initially, wherein the effect disappears along with the time, and therefore, a greedy algorithm is adopted, and probability hyper-parameters are used for selecting between the random actions and the Q strategy.
The DQN introduces a Q label network on the basis of the original Q network, namely the network used for updating the training labels. It is the same as the Q network structure, with the same initial weights, except that the Q network is updated every iteration, but the Q tag network is updated at intervals. DQN determines the loss function based on Q-Learning, and it is desirable to minimize the Q-tag value and Q-estimate error. The loss function in DQN is:
Figure BDA0002572428890000161
training data selection based on empirical playback
Since the samples for deep learning are independent and the target is fixed, the states before and after reinforcement learning are related. Therefore, empirical playback methods are used to select samples in DQN networks. The specific method is that the transfer sample(s) obtained by each iteration is usedt,at,rt,st+1) Stored in a playback memory unit as a trainerAnd (5) practicing data. During training, a part (Mini Batch) is randomly taken out for training. The specific flow is shown in algorithm 1:
Figure BDA0002572428890000162
Figure BDA0002572428890000171
in order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
A single-cell large-scale MIMO-NOMA scene is considered, in the scene, the user clustering and power distribution of a downlink are realized by adopting the method based on the deep Q network, and detailed simulation parameters are shown in a table 2.
TABLE 2 simulation parameters Table
Figure BDA0002572428890000172
Figure BDA0002572428890000181
Comparison scheme:
the first comparison scheme is as follows: and a fractional power distribution algorithm is adopted, and an empirical value clustering method is adopted in a user clustering part.
Comparative scheme two: and a fractional power allocation algorithm is adopted, and a traversing search method is adopted for user clustering.
A third comparison scheme: and a fractional order power distribution algorithm is adopted, and the user clusters adopt the DQN method provided by us.
The fourth comparative scheme is as follows: with the power distribution network proposed by us, users are clustered by adopting a traversal search algorithm.
Referring to fig. 6, in the channel model set in the present scheme, compared with the first, second, and third comparison schemes, the spectral efficiency of the system is greatly improved when the transmission power is 0.02-1W in the proposed algorithm, and is nearly doubled especially when the transmission power is 0.02W; meanwhile, compared with the fourth comparison scheme, the network designed by the scheme can achieve the performance equivalent to that of traversal search. However, the computational complexity of traversal search increases exponentially with the increase of the number of users, the offline use complexity of the power distribution network for guaranteeing the user performance provided by the scheme is extremely low, the user clustering part considers real-time interaction with the current environment, the network is used while training, and the complexity is negligible compared with the traversal search. Therefore, in summary, the scheme provides that joint resource allocation is performed by using a power allocation network based on a BP network and a user clustering network based on a DQN algorithm, which can significantly improve the spectrum efficiency of the system and is superior to other schemes.
Referring to fig. 7, a CDF curve comparing the method and the second embodiment of the present invention is shown, wherein the dotted line indicates the system performance that should be achieved by the ideal beamforming scheme, which is difficult to achieve by the prior art. As can be seen from the figure, the proposed scheme is compared with the second scheme, and the performance of the edge user is relatively better, that is, the fairness of the user is ensured while the spectrum efficiency of the system is improved, and certainly, if a better beamforming scheme is available, the performance of the edge user is better guaranteed.
The deep Q network based user clustering and power allocation method of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The user clustering and power distribution method based on the deep Q network can be stored in a computer readable storage medium if the method is realized in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice. The computer storage medium may be any available medium or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical memory (e.g., CD, DVD, BD, HVD, etc.), and semiconductor memory (e.g., ROM, EPROM, EEPROM, nonvolatile memory (NANDFLASH), Solid State Disk (SSD)), etc.
In an exemplary embodiment, a computer device is also provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the Q-network based resource allocation method when executing the computer program. The processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.
In summary, the method, device and medium for user clustering and power allocation based on the deep Q network of the present invention can effectively improve the spectrum efficiency of the system. Firstly, in a power distribution stage, a BP neural network is designed, and on the premise of ensuring the minimum transmission rate of a user, the power distribution result obtained by an exhaustive search algorithm is used as an offline training label, so that the user throughput is ensured, and the complexity of online calculation is greatly reduced. Secondly, in the user clustering stage, with the maximum system throughput as a target, a deep Q learning network is adopted to gradually adjust clustering results through the feedback of reward values, and a trained power distribution network is adopted in a cluster. In the outer loop iteration process, the power distribution and beam forming results are fed back to the reinforcement learning network, the deep reinforcement learning network intelligently adjusts the user cluster by taking the maximum system throughput as a target, and the effect of approximate ideal joint optimization can be achieved by iterating for multiple times. Finally, simulation verifies that the user clustering and power distribution method based on the deep Q network greatly improves the spectrum efficiency of the system while reducing the complexity.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. A user clustering and power distribution method based on a deep Q network is characterized by comprising the following steps:
s1, modeling by using a user clustering and power distribution problem, wherein the optimization target is that the system and the rate are maximum, and the constraint conditions are power constraint and total user number constraint;
s2, setting minimum transmission rate constraint, obtaining a power distribution result by adopting a full search power distribution method, taking the power distribution result as a training label, forming a training data set of the network by using the channel information and the power distribution result, and establishing a BP neural network;
s3, training the BP neural network by using the training data set obtained in the step S2, and training the BP neural network until the mean square error is 10-4Testing the network and storing the BP neural network model to obtain power distribution results under different channel conditions, thereby realizing power distribution;
s4, modeling the user clustering problem into a reinforcement learning task, determining a state space as a combination of user channel information, determining an action space as all grouping conditions, and determining a reward function as a system and a rate;
s5, constructing a deep Q network according to the reinforcement learning task in the step S4, determining network input as a combination of a state space and an action space, outputting as a system and a speed, and initializing parameters and the number of hidden layers of the deep Q network and the Q label network; after the network is trained on line, the deep Q network is trained according to the input state, and the best action is selected as the best clustering result, so that user clustering is realized.
2. The method according to claim 1, wherein in step S1, with system and rate maximum targets, a joint optimization problem is established as follows:
Figure FDA0002572428880000011
s.t.C1:
Figure FDA0002572428880000012
C2:
Figure FDA0002572428880000013
wherein the content of the first and second substances,
Figure FDA0002572428880000014
for the power allocation factor set, { Un,kIs the user set, N is the number of clusters, K is the number of users in a cluster, Rn,kFor the information transmission rate, p, of user k in the nth clustern,kPower, P, allocated to user k in the nth clustermaxMaximum power, alpha, allowed for transmission by the base stationn,kThe power of user k in the nth cluster is assigned a factor,
Figure FDA0002572428880000021
k is true for all n.
3. The method according to claim 1, wherein in step S2, the BP neural network is used for power allocation, and the result of the exhaustive search power allocation method is used as a network training label, and the label obtaining method is as follows: discretizing power within a limited power interval, wherein the step length is delta; for a certain combination of channel information hn,1…hn,KSearching out an optimal power distribution result { p) in the discrete power set by an exhaustive search power distribution methodn,1…pn,KAnd the power distribution result is calculated according to the input channel information and the total power limit after the training is finished.
4. The method according to claim 1, wherein in step S3, the BP neural network comprises an input layer, an output layer and a hidden layer, the input of the BP neural network is channel state information and total power of users in the cluster, and the BP neural network outputs a power distribution result; the number of input and output nodes of the BP neural network is the number of users in the cluster, and the number of hidden layer nodes is adjusted according to the training result; the loss function is defined as
Figure FDA0002572428880000022
5. The method of claim 1, wherein in step S4, the reinforcement learning task includes an interacting agent and environment, including state space S, action space a, immediate reward R, and transition probability between the current state and the next state, with the base station as the agent, the capabilities of the NOMA system being the environment, and the agent taking action atIs decided based on the expected rewards that the user may obtain; in each step, according to the current state stAchievable system performance, agent selects action a from a plurality of actions based on learned user clustering policyt(ii) a The environment evolves to a new state; then, power distribution and beam forming are carried out according to the obtained user cluster, and step length reward r is calculatedtAnd fed back to the agent.
6. The method of claim 5, wherein the state space S is the channel h of all users in t time slotn,k(t) the current state s is formedt(ii) a The action space A contains actions to all possible user allocation combinations, the impact effect being defined as
Figure FDA0002572428880000031
The return function is
Figure FDA0002572428880000032
The goal of reinforcement learning is to maximize the cumulative discount return
Figure FDA0002572428880000033
The discount factor gamma is an element of 0,1]。
7. The method of claim 1, wherein in step S5, the neural network structure in the DQN is established for fitting Q values, a Q label network is introduced to update the training labels, samples are selected based on the training data from empirical playback, and the transfer samples (S) obtained from each iteration are sampledt,at,rt,st+1) Stored in the playback memory unit as training data. During training, part of the training is randomly taken out for training.
8. The method according to claim 7, characterized in that the input to the neural network structure is a combination(s) of current state and actiont,at) The network output of the neural network structure is an estimated Q value, i.e., Q(s), for each actiont,atOmega) is a training parameter, and two fully-connected layers are used as hidden layers of the network; and (3) selecting actions at random initially, and selecting between the random actions and the Q strategy by using a probability hyper-parameter by adopting a greedy algorithm.
9. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-8.
10. A computing device, comprising:
one or more processors, memory, and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-8.
CN202010643958.8A 2020-07-07 2020-07-07 User clustering and power distribution method, device and medium based on deep Q network Active CN111901862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010643958.8A CN111901862B (en) 2020-07-07 2020-07-07 User clustering and power distribution method, device and medium based on deep Q network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010643958.8A CN111901862B (en) 2020-07-07 2020-07-07 User clustering and power distribution method, device and medium based on deep Q network

Publications (2)

Publication Number Publication Date
CN111901862A true CN111901862A (en) 2020-11-06
CN111901862B CN111901862B (en) 2021-08-13

Family

ID=73191862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010643958.8A Active CN111901862B (en) 2020-07-07 2020-07-07 User clustering and power distribution method, device and medium based on deep Q network

Country Status (1)

Country Link
CN (1) CN111901862B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112243283A (en) * 2020-11-10 2021-01-19 哈尔滨工业大学 Cell-Free Massive MIMO network clustering calculation method based on successful transmission probability
CN112492691A (en) * 2020-11-26 2021-03-12 辽宁工程技术大学 Downlink NOMA power distribution method of deep certainty strategy gradient
CN112566253A (en) * 2020-11-10 2021-03-26 北京科技大学 Wireless resource allocation joint optimization method and device
CN113015186A (en) * 2021-01-20 2021-06-22 重庆邮电大学 Interference control method based on reinforcement learning
CN113068146A (en) * 2021-03-22 2021-07-02 天津大学 Multi-base-station beam joint selection method in dense millimeter wave vehicle network
CN113114313A (en) * 2021-04-13 2021-07-13 南京邮电大学 Method, system and storage medium for detecting pilot auxiliary signal of MIMO-NOMA system
CN113115355A (en) * 2021-04-29 2021-07-13 电子科技大学 Power distribution method based on deep reinforcement learning in D2D system
CN113242066A (en) * 2021-05-10 2021-08-10 东南大学 Multi-cell large-scale MIMO communication intelligent power distribution method
CN113242601A (en) * 2021-05-10 2021-08-10 黑龙江大学 NOMA system resource allocation method based on optimized sample sampling and storage medium
CN113242602A (en) * 2021-05-10 2021-08-10 内蒙古大学 Millimeter wave large-scale MIMO-NOMA system resource allocation method and system
CN113472472A (en) * 2021-07-07 2021-10-01 湖南国天电子科技有限公司 Multi-cell cooperative beam forming method based on distributed reinforcement learning
CN113543271A (en) * 2021-06-08 2021-10-22 西安交通大学 Effective capacity-oriented resource allocation method and system
CN113613301A (en) * 2021-08-04 2021-11-05 北京航空航天大学 Air-space-ground integrated network intelligent switching method based on DQN
CN114143150A (en) * 2021-12-09 2022-03-04 中央民族大学 User fairness communication transmission method
CN114423028A (en) * 2022-01-29 2022-04-29 南京邮电大学 CoMP-NOMA (coordinated multi-point-non-orthogonal multiple Access) cooperative clustering and power distribution method based on multi-agent deep reinforcement learning
US20220159586A1 (en) * 2020-11-17 2022-05-19 Industry-Academic Cooperation Foundation, Chosun University Transmission power allocation method based on user clustering and reinforcement learning
CN114980178A (en) * 2022-06-06 2022-08-30 厦门大学马来西亚分校 Distributed PD-NOMA underwater acoustic network communication method and system based on reinforcement learning
CN115103372A (en) * 2022-06-17 2022-09-23 东南大学 Multi-user MIMO system user scheduling method based on deep reinforcement learning
CN115408150A (en) * 2022-06-15 2022-11-29 华为技术有限公司 Calculation strength measuring method, device and related equipment
CN117176213A (en) * 2023-11-03 2023-12-05 中国人民解放军国防科技大学 SCMA codebook selection and power distribution method based on deep prediction Q network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103974404A (en) * 2014-05-15 2014-08-06 西安电子科技大学 Power distribution scheme based on maximum effective capacity and applied to wireless multi-antenna virtual MIMO
US20170359754A1 (en) * 2016-06-09 2017-12-14 The Regents Of The University Of California Learning-constrained optimal enhancement of cellular networks capacity
CN108737057A (en) * 2018-04-27 2018-11-02 南京邮电大学 Multicarrier based on deep learning recognizes NOMA resource allocation methods
CN108777872A (en) * 2018-05-22 2018-11-09 中国人民解放军陆军工程大学 A kind of anti-interference model of depth Q neural networks and intelligent Anti-interference algorithm
CN109940596A (en) * 2019-04-16 2019-06-28 四川阿泰因机器人智能装备有限公司 A kind of robot displacement compensation method based on variance
US10588118B2 (en) * 2015-05-11 2020-03-10 Huawei Technologies Co., Ltd. Semi-orthogonal transmission-based communication method and device
CN111240836A (en) * 2020-01-06 2020-06-05 北京百度网讯科技有限公司 Computing resource management method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103974404A (en) * 2014-05-15 2014-08-06 西安电子科技大学 Power distribution scheme based on maximum effective capacity and applied to wireless multi-antenna virtual MIMO
US10588118B2 (en) * 2015-05-11 2020-03-10 Huawei Technologies Co., Ltd. Semi-orthogonal transmission-based communication method and device
US20170359754A1 (en) * 2016-06-09 2017-12-14 The Regents Of The University Of California Learning-constrained optimal enhancement of cellular networks capacity
CN108737057A (en) * 2018-04-27 2018-11-02 南京邮电大学 Multicarrier based on deep learning recognizes NOMA resource allocation methods
CN108777872A (en) * 2018-05-22 2018-11-09 中国人民解放军陆军工程大学 A kind of anti-interference model of depth Q neural networks and intelligent Anti-interference algorithm
CN109940596A (en) * 2019-04-16 2019-06-28 四川阿泰因机器人智能装备有限公司 A kind of robot displacement compensation method based on variance
CN111240836A (en) * 2020-01-06 2020-06-05 北京百度网讯科技有限公司 Computing resource management method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANMEI CAO, GUOMEI ZHANG, GUOBING LI, JIA ZHANG: "《IEEE COMMUNICATIONS LETTERS》", 28 February 2021 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112566253A (en) * 2020-11-10 2021-03-26 北京科技大学 Wireless resource allocation joint optimization method and device
CN112243283A (en) * 2020-11-10 2021-01-19 哈尔滨工业大学 Cell-Free Massive MIMO network clustering calculation method based on successful transmission probability
CN112243283B (en) * 2020-11-10 2021-09-03 哈尔滨工业大学 Cell-Free Massive MIMO network clustering calculation method based on successful transmission probability
CN112566253B (en) * 2020-11-10 2022-09-06 北京科技大学 Wireless resource allocation joint optimization method and device
US11647468B2 (en) * 2020-11-17 2023-05-09 Industry-Academic Cooperation Foundation, Chosun University Transmission power allocation method based on user clustering and reinforcement learning
US20220159586A1 (en) * 2020-11-17 2022-05-19 Industry-Academic Cooperation Foundation, Chosun University Transmission power allocation method based on user clustering and reinforcement learning
CN112492691A (en) * 2020-11-26 2021-03-12 辽宁工程技术大学 Downlink NOMA power distribution method of deep certainty strategy gradient
CN112492691B (en) * 2020-11-26 2024-03-26 辽宁工程技术大学 Downlink NOMA power distribution method of depth deterministic strategy gradient
CN113015186A (en) * 2021-01-20 2021-06-22 重庆邮电大学 Interference control method based on reinforcement learning
CN113068146A (en) * 2021-03-22 2021-07-02 天津大学 Multi-base-station beam joint selection method in dense millimeter wave vehicle network
CN113068146B (en) * 2021-03-22 2021-11-02 天津大学 Multi-base-station beam joint selection method in dense millimeter wave vehicle network
CN113114313A (en) * 2021-04-13 2021-07-13 南京邮电大学 Method, system and storage medium for detecting pilot auxiliary signal of MIMO-NOMA system
CN113115355A (en) * 2021-04-29 2021-07-13 电子科技大学 Power distribution method based on deep reinforcement learning in D2D system
CN113115355B (en) * 2021-04-29 2022-04-22 电子科技大学 Power distribution method based on deep reinforcement learning in D2D system
CN113242601B (en) * 2021-05-10 2022-04-08 黑龙江大学 NOMA system resource allocation method based on optimized sample sampling and storage medium
CN113242066A (en) * 2021-05-10 2021-08-10 东南大学 Multi-cell large-scale MIMO communication intelligent power distribution method
CN113242601A (en) * 2021-05-10 2021-08-10 黑龙江大学 NOMA system resource allocation method based on optimized sample sampling and storage medium
CN113242602A (en) * 2021-05-10 2021-08-10 内蒙古大学 Millimeter wave large-scale MIMO-NOMA system resource allocation method and system
CN113543271B (en) * 2021-06-08 2022-06-07 西安交通大学 Effective capacity-oriented resource allocation method and system
CN113543271A (en) * 2021-06-08 2021-10-22 西安交通大学 Effective capacity-oriented resource allocation method and system
CN113472472B (en) * 2021-07-07 2023-06-27 湖南国天电子科技有限公司 Multi-cell collaborative beam forming method based on distributed reinforcement learning
CN113472472A (en) * 2021-07-07 2021-10-01 湖南国天电子科技有限公司 Multi-cell cooperative beam forming method based on distributed reinforcement learning
CN113613301B (en) * 2021-08-04 2022-05-13 北京航空航天大学 Air-ground integrated network intelligent switching method based on DQN
CN113613301A (en) * 2021-08-04 2021-11-05 北京航空航天大学 Air-space-ground integrated network intelligent switching method based on DQN
CN114143150A (en) * 2021-12-09 2022-03-04 中央民族大学 User fairness communication transmission method
CN114423028A (en) * 2022-01-29 2022-04-29 南京邮电大学 CoMP-NOMA (coordinated multi-point-non-orthogonal multiple Access) cooperative clustering and power distribution method based on multi-agent deep reinforcement learning
CN114423028B (en) * 2022-01-29 2023-08-04 南京邮电大学 CoMP-NOMA cooperative clustering and power distribution method based on multi-agent deep reinforcement learning
CN114980178A (en) * 2022-06-06 2022-08-30 厦门大学马来西亚分校 Distributed PD-NOMA underwater acoustic network communication method and system based on reinforcement learning
CN115408150A (en) * 2022-06-15 2022-11-29 华为技术有限公司 Calculation strength measuring method, device and related equipment
CN115408150B (en) * 2022-06-15 2023-08-22 华为技术有限公司 Force calculation measurement method and device and related equipment
CN115103372A (en) * 2022-06-17 2022-09-23 东南大学 Multi-user MIMO system user scheduling method based on deep reinforcement learning
CN117176213A (en) * 2023-11-03 2023-12-05 中国人民解放军国防科技大学 SCMA codebook selection and power distribution method based on deep prediction Q network
CN117176213B (en) * 2023-11-03 2024-01-30 中国人民解放军国防科技大学 SCMA codebook selection and power distribution method based on deep prediction Q network

Also Published As

Publication number Publication date
CN111901862B (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN111901862B (en) User clustering and power distribution method, device and medium based on deep Q network
Nath et al. Deep reinforcement learning for dynamic computation offloading and resource allocation in cache-assisted mobile edge computing systems
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
Zhao et al. Deep reinforcement learning based mobile edge computing for intelligent Internet of Things
Maksymyuk et al. Deep learning based massive MIMO beamforming for 5G mobile network
CN113543176B (en) Unloading decision method of mobile edge computing system based on intelligent reflecting surface assistance
Rajapaksha et al. Deep learning-based power control for cell-free massive MIMO networks
CN111800828B (en) Mobile edge computing resource allocation method for ultra-dense network
CN112118287B (en) Network resource optimization scheduling decision method based on alternative direction multiplier algorithm and mobile edge calculation
CN109474980A (en) A kind of wireless network resource distribution method based on depth enhancing study
CN111464465A (en) Channel estimation method based on integrated neural network model
Luo et al. Downlink power control for cell-free massive MIMO with deep reinforcement learning
CN112788605B (en) Edge computing resource scheduling method and system based on double-delay depth certainty strategy
CN105379412A (en) System and method for controlling multiple wireless access nodes
Elbir et al. Federated learning for physical layer design
CN113543342A (en) Reinforced learning resource allocation and task unloading method based on NOMA-MEC
Sadiki et al. Deep reinforcement learning for the computation offloading in MIMO-based Edge Computing
Li et al. Deep learning for energy efficient beamforming in MU-MISO networks: A GAT-based approach
Bhardwaj et al. Deep learning-based MIMO and NOMA energy conservation and sum data rate management system
Xu et al. Deep reinforcement learning for communication and computing resource allocation in RIS aided MEC networks
Giri et al. Deep Q-learning based optimal resource allocation method for energy harvested cognitive radio networks
CN111277308A (en) Wave width control method based on machine learning
Zhao et al. Matching-aided-learning resource allocation for dynamic offloading in mmWave MEC system
Dong et al. Optimization-Driven DRL-Based Joint Beamformer Design for IRS-Aided ITSN Against Smart Jamming Attacks
CN114826833B (en) Communication optimization method and terminal for CF-mMIMO in IRS auxiliary MEC

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant