CN111901862A

CN111901862A - User clustering and power distribution method, device and medium based on deep Q network

Info

Publication number: CN111901862A
Application number: CN202010643958.8A
Authority: CN
Inventors: 张国梅; 曹艳梅; 李国兵; 史晔钊
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-11-06
Anticipated expiration: 2040-07-07
Also published as: CN111901862B

Abstract

The invention discloses a method, equipment and a medium for user clustering and power distribution based on a deep Q network, which utilize a user clustering and power distribution problem modeling combined optimization problem; establishing a BP neural network to realize a power distribution function in a joint optimization problem; training a BP neural network by using a training data set, testing the network and storing a BP neural network model to obtain power distribution results under different channel conditions so as to realize power distribution; modeling a user clustering problem into a reinforcement learning task; constructing a deep Q network according to the reinforcement learning task; after the network is trained on line, the deep Q network is trained according to the input state, and the best action is selected as the best clustering result, so that user clustering is realized. The invention can reduce the complexity of on-line calculation, ensure the fairness of users to a certain extent and effectively improve the spectrum efficiency of the system.

Description

User clustering and power distribution method, device and medium based on deep Q network

Technical Field

The invention belongs to the technical field of resource allocation in a communication system, and particularly relates to a user clustering and power allocation method, device and medium based on a deep Q network.

Background

In the face of the current situation that radio spectrum resources are seriously deficient and the spectrum utilization rate of the existing communication link is close to the limit, how to further improve the spectrum efficiency and the system capacity and meet the requirements of large flow, huge connection, high reliability and the like under the whole scene application of a future radio communication system is a key problem to be urgently solved by the research in the field of radio communication. Non-orthogonality and large dimensions are considered as effective ways to improve spectrum resource utilization. In 2010, NTT DoCoMo corporation of japan first proposed a Non-orthogonal Multiple Access (NOMA) technique based on Successive Interference Cancellation (SIC) reception, and by resource multiplexing in a power domain, system spectral efficiency and the number of user connections can be increased by times, thereby meeting the requirement of mass Access. The power domain NOMA technology can effectively improve the frequency spectrum efficiency and the number of user connections by virtue of the non-orthogonal advantages of the power domain NOMA technology, is easy to combine with other technologies, and is considered to be one of key technologies in a future wireless communication system. The massive MIMO technology proposed in the same period as NOMA has been adopted by the 3GPP Release15 standard, and because it can fully exploit spatial domain resources by using a large-dimensional antenna array to obtain a significant improvement in spectral efficiency, it plays an important role in realizing a large capacity for a 5G system, and will continue to become one of candidates for the physical layer of a future wireless communication system. By combining NOMA and large-scale MIMO technology, the degree of freedom of a power domain and a space domain can be excavated simultaneously, so that the peak rate and the spectral efficiency of the system are further improved, the requirement of explosive flow increase can be effectively met, and the NOMA and large-scale MIMO technology becomes a key candidate technology of a physical layer of a future wireless communication system.

In the face of the requirements of future wireless communication huge flow access, ultra-large capacity, ultra-low time delay, ultra-dense networking and ultra-high reliability, the resource management and transmission technology system of the traditional wireless communication is greatly challenged. Meanwhile, massive data of a wireless communication system generated by large connection, large dimensionality, large bandwidth and high density provides abundant data for future wireless communication by adopting an Artificial Intelligence (AI) means. Therefore, smart communication is considered as a mainstream development direction of wireless communication systems after 5G. In the academic world, researchers are exploring the organic integration of AI from various aspects of wireless communication systems, and have preliminarily demonstrated the performance improvement brought to wireless communication systems by the application of AI technology. In recent years, research on intelligent wireless communication is gradually advanced to the physical layer, and Deep Learning (DL) technology is used in various aspects of channel estimation, signal detection, channel feedback and reconstruction, channel decoding, and the like, and even replaces a traditional baseband processing module to directly implement an end-to-end wireless communication system. Although these research works have achieved certain performance gains, the goals of "intelligent communication breaking the performance constraint of the traditional wireless communication mode and achieving a great improvement in wireless transmission performance" that are far from the expectations are still far from each other, and further intensive research is needed.

After NOMA is combined with large-scale MIMO, the number of antennas is obviously increased, the number of service users is multiplied, and the user distribution is denser. On one hand, the contradiction between the transmission efficiency faced by the user in clustering and the SIC receiving accumulated error propagation is more prominent; on the other hand, the contradiction between intra-cluster beam coverage enhancement and inter-cluster interference suppression is more prominent. To solve these two pairs of contradictions, overall consideration and joint optimization of user clustering, power allocation and beam forming are required from the viewpoint of improving the overall performance of the system. However, the channel characteristics between different users are complex, and it is difficult for the conventional method to capture the potential relationship between users. Meanwhile, the solution space of the optimization problem is huge, and the nonlinear search process is inevitable. Therefore, it is difficult to obtain good user clustering and power allocation results using the conventional method, and the performance of the NOMA system is still greatly limited. Research shows that at present, a comprehensive intelligent solution of the system is not formed by the research aiming at the large-scale MIMO-NOMA, the research attention angle is single, the used deep learning network structure is solidified, the main reason of the limitation of the system performance is formed, and further technical breakthrough is urgently needed.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method, device and medium for user clustering and power allocation based on a deep Q network, which significantly improve the spectrum efficiency of a system, in view of the above disadvantages in the prior art.

The invention adopts the following technical scheme:

a user clustering and power distribution method based on a deep Q network comprises the following steps:

s1, modeling by using a user clustering and power distribution problem, wherein the optimization target is that the system and the rate are maximum, and the constraint conditions are power constraint and total user number constraint;

s2, setting minimum transmission rate constraint, obtaining a power distribution result by adopting a full search power distribution method, taking the power distribution result as a training label, forming a training data set of the network by using the channel information and the power distribution result, and establishing a BP neural network;

s3, training the BP neural network by using the training data set obtained in the step S2, and training the BP neural network until the mean square error is 10^-4Testing the network and storing the BP neural network model to obtain power distribution results under different channel conditions, thereby realizing power distribution;

s4, modeling the user clustering problem into a reinforcement learning task, determining a state space as a combination of user channel information, determining an action space as all grouping conditions, and determining a reward function as a system and a rate;

s5, constructing a deep Q network according to the reinforcement learning task in the step S4, determining network input as a combination of a state space and an action space, outputting as a system and a speed, and initializing parameters and the number of hidden layers of the deep Q network and the Q label network; after the network is trained on line, the deep Q network is trained according to the input state, and the best action is selected as the best clustering result, so that user clustering is realized.

Specifically, in step S1, with the system and rate maximum target, the joint optimization problem is established as follows:

wherein the content of the first and second substances,

for a set of power allocation factors, a tone_Un,kIs the user set, N is the number of clusters, K is the number of users in a cluster, R_n,kFor the information transmission rate, p, of user k in the nth cluster_n,kPower, P, allocated to user k in the nth cluster_maxMaximum power, alpha, allowed for transmission by the base station_n,kThe power of user k in the nth cluster is assigned a factor,

k is true for all n.

Specifically, in step S2, a BP neural network is used for power distribution, and the result of the exhaustive search power distribution method is used as a network training label, where the label acquisition method is as follows: discretizing power within a limited power interval, wherein the step length is delta; for a certain combination of channel information h_n,1… h_n,KSearching out an optimal power distribution result { p) in the discrete power set by an exhaustive search power distribution method_n,1… p_n,KAnd the power distribution result is calculated according to the input channel information and the total power limit after the training is finished.

Specifically, in step S3, the BP neural network includes an input layer, an output layer, and a hidden layer, where the input of the BP neural network is channel state information and total power of users in the cluster, and the output power of the BP neural network is a result of power distribution; the number of input and output nodes of the BP neural network is the number of users in the cluster, and the number of hidden layer nodes is adjusted according to the training result; the loss function is defined as

Specifically, in step S4, the reinforcement learning task includes an interacting agent and environment, including a state space S, an action space a, an immediate reward R, and a transition probability between a current state and a next state, with a base station as the agent, the capabilities of the NOMA system as the environment, and an action a taken by the agent_tIs decided based on the expected rewards that the user may obtain; in each step, according to the current state s_tCan achieve the purpose ofThe agent selects action a from a plurality of actions according to the learned user clustering strategy_t(ii) a The environment evolves to a new state; then, power distribution and beam forming are carried out according to the obtained user cluster, and step length reward r is calculated_tAnd fed back to the agent.

Further, the state space S is the channel h of all users in the t time slot_n,k(t) the current state s is formed_t(ii) a The action space A contains actions to all possible user allocation combinations, the impact effect being defined as

The return function is

The goal of reinforcement learning is to maximize the cumulative discount return

The discount factor gamma is an element of 0,1]。

Specifically, in step S5, a neural network structure in the deep reinforcement learning network DQN is established for fitting the Q value, a Q tag network is introduced to update the training tag, a sample is selected based on the training data played back empirically, and the transfer sample (S) obtained at each iteration is sampled (S)_t,a_t,r_t,s_t+1) Stored in the playback memory unit as training data. During training, part of the training is randomly taken out for training.

Further, the input of the neural network structure is the combination(s) of the current state and the action_t,a_t) The network output of the neural network structure is an estimated Q value, i.e., Q(s), for each action_t,a_tOmega) is a training parameter, and two fully-connected layers are used as hidden layers of the network; and (3) selecting actions at random initially, and selecting between the random actions and the Q strategy by using a probability hyper-parameter by adopting a greedy algorithm.

Another aspect of the invention is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described.

Another aspect of the present invention is a computing device, including:

one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods.

Compared with the prior art, the invention has at least the following beneficial effects:

the user clustering and power distribution method based on the deep Q network can achieve the performance equivalent to that of traversal search. By ensuring that the offline use complexity of the power distribution network of the user performance is extremely low, the user clustering part considers the real-time interaction with the current environment, the network is used while training, and the complexity is negligible compared with the traversal search. Therefore, the invention provides the method for performing joint resource allocation by using the power allocation network based on the BP network and the user clustering network based on the DQN algorithm, which can remarkably improve the spectrum efficiency of the system and is superior to other schemes while ensuring the performance of edge users.

Furthermore, a joint optimization problem is established at the maximum rate of the system, the problem comprises two sub-problems of user clustering and power distribution, and the optimization problem restricts the power and the number of users. And solving an optimization problem established at the maximum rate and ensuring the information transmission rate of the user to the maximum extent.

Further, in the power distribution part, a training data set obtained by exhaustive search is adopted to train the BP neural network, and the trained network can be used offline as long as a training data set which is real enough and data which are as much as possible are used to train the network.

Furthermore, for a specific BP network, the number of network input and output nodes is determined according to the number of users in a cluster, so that the physical meaning of the network is clearer. The loss function is a mean square error loss function, and the mathematical property of the loss function is good, so that gradient calculation is easier.

Furthermore, the user clustering problem is modeled into a specific reinforcement learning task, and the method aims to create a new idea for solving the user clustering problem, and after an agent and an environment are specified, the method is convenient for establishing a deep Q network.

Further, each part of the reinforcement learning task is assigned with physical meaning and mathematical expression. In particular, the reward function is set to the system and rate, and the training target of the deep Q network is defined.

Furthermore, a Q label network is introduced into the deep Q network, so that the network can train and update labels at the same time, and the training is more accurate. In addition, the training data set of the part is obtained by adopting an experience playback method, the original data sequence can be disordered, and the historical data can be effectively utilized by extracting small batches for training.

Furthermore, the specific structure and input and output of the deep Q network are determined, a more complete training data set can be generated by a greedy algorithm adopted in the process, and the training speed is improved. In step S5, an implementation procedure of the network implementing the user clustering function is described, which can select the clustering result with the largest system and rate, thereby improving the spectrum efficiency.

In conclusion, the invention can reduce the complexity of online calculation, ensure the fairness of users to a certain extent and effectively improve the spectrum efficiency of the system.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a massive MIMO-NOMA system model of the present invention;

FIG. 2 is a block diagram of a massive MIMO-NOMA downlink transmission system according to the present invention;

FIG. 3 is a deep Q-network based joint optimization network of the present invention;

FIG. 4 is a diagram of an intra-cluster power allocation scheme of the present invention;

FIG. 5 is a diagram of a reinforcement learning based user clustering scheme of the present invention;

fig. 6 is a schematic diagram of the spectral efficiency of resource allocation in different schemes under the Ray-based channel model according to the embodiment of the present invention as a function of the transmission power;

fig. 7 is a CDF curve in different schemes under the Ray-based channel model in the embodiment of the present invention.

Detailed Description

Clustering users: the multiple users are divided into different groups, and the users in the groups do not overlap.

CDF: the Cumulative Distribution Function (also called Distribution Function) is an integral of the probability density Function, and can completely describe the probability Distribution of a real random variable X.

Referring to fig. 1, the present invention provides a method for user clustering and power allocation based on a deep Q network, which includes the following steps:

s1, under a specific considered scene, modeling user clustering and power distribution problems into a joint optimization problem, wherein the optimization target is that the system and the rate are maximum, and the constraint condition is that the power constraint and the total number of users are constrained;

the invention is based on a single-cell large-scale MIMO-NOMA system model, and mainly refers to the problems of user clustering and power allocation under the condition of multiple users for the problem of downlink resource allocation of the system. Therefore, in order to realize the joint optimization of user clustering, power distribution and beam forming, a close coupling optimization iterative structure of three functional modules is established by using a reinforcement learning technology.

Referring to fig. 1, in the user clustering stage, with the maximum system throughput as the target, the deep Q learning network is adopted to gradually adjust the clustering result. In the power distribution stage, a BP (Back propagation) neural network is designed, and the BP neural network takes a power distribution result obtained by an exhaustive search algorithm as an offline training label on the premise of ensuring the minimum transmission rate of a user, so that the complexity of online calculation is greatly reduced while the throughput of the user is ensured. In the outer loop iteration process, the power distribution and beam forming results are fed back to the reinforcement learning network, the deep reinforcement learning network intelligently adjusts the user cluster by taking the maximum system throughput as a target, and the effect of approximate ideal joint optimization can be achieved by iterating for multiple times.

Referring to fig. 2, considering a single-cell multi-user downlink, the base station configures Nt transmit antennas to serve L single-antenna users. All users in a cell are divided into N clusters according to a certain rule, and K users are assumed in each cluster. A power domain non-orthogonal multiple access transmission structure is adopted in the cluster, and each cluster is served by the same beam forming vector. The system model is shown in fig. 4. In general, a beamforming vector is designed to eliminate inter-cluster interference, and then power allocation is performed among users scheduled in a cluster to form a NOMA transmission structure.

In the system, considering modeling according to a Ray-based channel model, a base station deploys UPA (uplink packet access) antennas on a y-z plane, and the number of the antennas in the vertical direction is N_vThe antenna spacing is d₁The number of antennas in the horizontal direction is N_tThe antenna spacing is d₂The channel comprises L_uA bar scatter path. For simplicity, regardless of the mechanical downtilt angle of the array antenna, #denotesthe horizontal incident angle of the array antenna, theta denotes the vertical incident angle of the array antenna, σ denotes the standard deviation of the horizontal angular spread of the array antenna, ξ denotes the standard deviation of the vertical angular spread of the array antenna, and for each scattering path, the random complex gain g may be expressed as

Alpha is the amplitude of the signal and alpha is the amplitude,

for phase, the Ray-based channel vector from the kth user to the base station is represented as:

wherein, b (v)_k,l) Represents the vertical array response, a (u)_k,l) Representing a horizontal direction array response;

wherein: λ represents the carrier wavelength, Δ θ_k,lRepresents the vertical angular spread of the ith path of the kth user, obeying a normal distribution Δ θ_k,lN (0, σ), Δ θ for different antenna elements_k,lAre independent of each other and can be used for,

the horizontal angular spread of the ith path representing the kth user,

of different antenna elements

Are independent of each other.

Assuming that the channel is a flat block fading channel, considering large scale fading, the channel vector from user k to the base station in the cell is represented as:

wherein:

is an mx 1 dimensional small scale channel vector; beta is a_kLarge scale path loss and shadowingFading coefficient, d_kDenotes the distance, d, of user k in the cell to the base station₀Denotes the cell radius and λ denotes the path loss coefficient.

It is assumed here that the distance between a user and a base station is much larger than the physical size between antennas of the base station, and the large-scale information between the same user and M antennas of the base station is considered to be unchanged. Therefore, the channel matrix from K users to the base station is:

H＝[h₁,h₂,...,h_K]∈C^M×K

please refer to fig. 3, assuming that X ═ X₁x₂x₃… x_N]^T∈C^N×1Data is transmitted for a base station, wherein,

is NOMA signal of cluster n, P_nIs the total power of the nth cluster signal transmission, alpha_n,kIs the power allocation factor, s, of each user in the cluster_n,kIs the k-th user U in the n-th cluster_n,kAnd E [ | s [ ] s_n,k|²]1. Preprocessing the signals after the power superposition corresponding to each cluster by a beam forming vector to obtain

The following were used:

wherein the content of the first and second substances,

is a beamforming matrix.

Assuming a downlink channel matrix of

Represents channel state information of the nth cluster.

The received signal of the kth user in the nth cluster is:

wherein z is_n,kMean 0 and variance σ²Complex gaussian noise.

Besides useful signals, the signals received by the users also include inter-cluster interference, inter-cluster user interference and noise terms. Assuming that a beamforming vector designed based on channel information aims at eliminating inter-cluster interference, h can be approximated_n

w

_i0, i ≠ n, but the current algorithm has difficulty in achieving ideal effects, so the interference term is difficult to ignore. Setting receiving end SIC to detect and ideally offset the interference of preorder user so as to obtain user U_n,kThe achievable rates for (the nth cluster kth decoding user) are as follows:

where B is the bandwidth.

With the system and rate maximum objective, the joint optimization problem is established as follows:

the invention provides a combined optimization method based on a deep learning technology, which is used for realizing the combined optimization of user clustering and power distribution.

Referring to fig. 3, the joint optimization problem is solved by using a machine learning algorithm, and the joint optimization problem is divided into a power distribution scheme based on a BP neural network and a user clustering module based on a deep Q network, and the deep Q network calculates an incentive value according to a result of the power distribution network, so as to adjust a clustering result.

S2, for the power distribution part in the joint optimization, the functional module is completed by considering the use of the self-designed BP neural network, and because of the supervised learning mode, the power distribution result obtained by adopting the full search power distribution algorithm with minimum transmission rate constraint is used as a training label;

in a large-scale MIMO-NOMA system, in order to ensure the effectiveness of a SIC receiver at a receiving end, the power of users in the same cluster needs to satisfy a certain relation. Different power allocation algorithms are different in pursuit of overall system throughput performance and user fairness performance, and power allocation is the key to realizing compromise between system and rate-fairness performance.

Referring to fig. 4, a typical Power Allocation algorithm includes Fixed Power Allocation (FPA), Fractional Power Allocation (FTPA), Exhaustive Search Power Allocation (ESPA), and the like. Although the FPA and FTPA algorithms are not computationally complex, the system performance is not ideal. The ESPA algorithm is an algorithm for pursuing the optimal system performance, but the online computation complexity is too high, and the ESPA algorithm is difficult to popularize and apply in an actual system. Different from the idea of the traditional optimization algorithm, the invention provides a power distribution algorithm based on a BP neural network.

The BP neural network has stronger nonlinear mapping capability, can automatically extract reasonable rules between input data and output data through learning during training, and has high self-learning and self-adapting capabilities. Therefore, a BP neural network is employed for power allocation. Taking the result of the ESPA algorithm as a network training label, wherein the label obtaining method comprises the following steps: within a limited power interval, discretizing the power with a step size of delta. For a certain combination of channel information h_n,1… h_n,_KSearching out an optimal power distribution result { p) in the discrete power set through an ESPA algorithm_n,1… p_n,_KAnd taking the result as a training label by the BP neural network. The trained network can calculate the power distribution result according to the input channel information and the total power limit, and the calculation complexity can be greatly reduced.

S3, obtaining a large amount of training dataUsed for training BP neural network to obtain power distribution result under different channel conditions, and the network is trained until the mean square error is 10^-4Thereafter, testing the network and saving the network model for subsequent invocation;

the BP neural network consists of an input layer, an output layer and three hidden layers. The input of the BP neural network is channel state information and the total power of users in the cluster, and the BP neural network outputs a power distribution result; the number of input and output nodes of the BP neural network is the number of users in the cluster, and the number of nodes of the hidden layer is adjusted according to the training result. The loss function is defined as

Therefore, the network parameters are updated to complete the training. It is noted that in order to guarantee the generalization performance of the network, the training data must be as much as possible, traversing all possible channel conditions.

TABLE 1 configuration and parameter configuration of power distribution network (K is the number of users)

S4, modeling the user clustering problem into a reinforcement learning task, and determining a state space as a combination of user channel information, an action space as all possible grouping conditions and a reward function as a system and a speed;

based on the scenario in fig. 2, the user clustering problem with the sum rate being the maximum target is represented as:

in the conventional optimization method, all allocation combinations are subjected to online traversal, and the implementation complexity is increased very rapidly as the number of users increases. To address this problem, a deep reinforcement learning framework is proposed to optimize the user clustering process of the NOMA system.

Referring to FIG. 5, the user clustering optimization problem is modeled as a reinforcement learning task consisting of an interacting agent and environment. The general reinforcement learning problem consists of four parts: state space S, action space a, instant prize R, and transition probability between the current state and the next state. The method specifically comprises the following steps: base station as agent, capabilities of NOMA system are environment, action a taken by agent_tAre decided based on the expected rewards that the user may obtain. In each step, according to the current state s_tAchievable system performance, agent selects action a from a plurality of actions based on learned user clustering policy_t. As the operation progresses, the environment evolves to a new state. Then, according to the obtained user cluster, power distribution and beam forming are carried out, and step length reward r is calculated_tAnd fed back to the agent. Learning often begins with a series of samples of states, actions, and rewards using random strategy experiments, and algorithms improve the strategy based on the samples to maximize rewards.

In conjunction with the system scenario of the present invention, the detailed representation of each part in the reinforcement learning framework is described as follows:

state space S: t time slot channel h for all users_n,k(t) the current state s is formed_tI.e. by

s_t＝{[h_1,1(t),…h_1,K(t)],

...

[h_N,1(t),…h_N,K(t)]}

The action space A: actions should be included that can reach all possible user allocation combinations. The purpose of this action is to select an appropriate grouping for the user. For a specific action, the effect is defined as:

a return function:

is shown at s_tSelecting action a in State_tThe time return can be obtained by using the sum rate or energy efficiency of NOMA systemTarget, here consider using system and rateAs a function of return, involving the power allocation factor α_n,kAnd a beamforming vector w_n。

The invention firstly assumes that the traditional zero forcing beam forming method is adopted, and because each cluster has a plurality of users, the beam forming vector with good channel quality is selected as an equivalent channel to be calculated, and the specific form is as follows:

W＝[w₁… w_n]＝H^H(HH^H)^-1

The discount factor gamma is an element of 0,1]。

S5, constructing a deep Q network, and initializing parameters of the deep Q network and the Q label network and the number of hidden layers of the neural network. After the network starts on-line training, the deep Q network is trained according to the input state, so that the best action, namely the best clustering result, is selected.

Wherein, the sum rate is calculated according to the power distribution result of the step S3; the deep Q network trains and adjusts the clustering result and is used in the resource allocation process of signal transmission.

Deep reinforcement Learning Network DQN (Deep Q-Learning Network)

At each observation time t, the agent determines the next action from the observation based on the current state. There is thus a mapping between states and actions, which is the policy π. To evaluate the expected return of a policy, a value function needs to be defined, where the state-action value function is given as follows:

the above equation is non-linear and has no closed form solution. Thus, many iterative methods (e.g., Q-Learning) have been proposed and have been shown to converge to an optimal Q function. In Q-Learning, when the state and motion space are discrete and not high in dimension, Q-Table can be used to store Q value corresponding to each state motion, and when the state and motion space are continuous in high dimension, it is impractical to use Q-Table, and the Learning process will be less efficient. One solution to this problem is to estimate the Q value by a neural network, which is the main idea of DQN.

In summary, the DQN is to design a neural network structure to fit the Q value, so as to be applied to reinforcement learning.

Deep neural networks in DQN

Designing a neural network in DQN, with inputs being combinations of current states and actions(s)_t,a_t) The network output is the estimated Q value corresponding to each action, i.e. Q(s)_t,a_tω), where ω is a training parameter. The role of the network is to fit the Q function, thus using two fully connected layers as the hidden layer of the network. And selecting actions at random initially, wherein the effect disappears along with the time, and therefore, a greedy algorithm is adopted, and probability hyper-parameters are used for selecting between the random actions and the Q strategy.

The DQN introduces a Q label network on the basis of the original Q network, namely the network used for updating the training labels. It is the same as the Q network structure, with the same initial weights, except that the Q network is updated every iteration, but the Q tag network is updated at intervals. DQN determines the loss function based on Q-Learning, and it is desirable to minimize the Q-tag value and Q-estimate error. The loss function in DQN is:

training data selection based on empirical playback

Since the samples for deep learning are independent and the target is fixed, the states before and after reinforcement learning are related. Therefore, empirical playback methods are used to select samples in DQN networks. The specific method is that the transfer sample(s) obtained by each iteration is used_t,a_t,r_t,s_t+1) Stored in a playback memory unit as a trainerAnd (5) practicing data. During training, a part (Mini Batch) is randomly taken out for training. The specific flow is shown in algorithm 1:

in order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

A single-cell large-scale MIMO-NOMA scene is considered, in the scene, the user clustering and power distribution of a downlink are realized by adopting the method based on the deep Q network, and detailed simulation parameters are shown in a table 2.

TABLE 2 simulation parameters Table

Comparison scheme:

the first comparison scheme is as follows: and a fractional power distribution algorithm is adopted, and an empirical value clustering method is adopted in a user clustering part.

Comparative scheme two: and a fractional power allocation algorithm is adopted, and a traversing search method is adopted for user clustering.

A third comparison scheme: and a fractional order power distribution algorithm is adopted, and the user clusters adopt the DQN method provided by us.

The fourth comparative scheme is as follows: with the power distribution network proposed by us, users are clustered by adopting a traversal search algorithm.

Referring to fig. 6, in the channel model set in the present scheme, compared with the first, second, and third comparison schemes, the spectral efficiency of the system is greatly improved when the transmission power is 0.02-1W in the proposed algorithm, and is nearly doubled especially when the transmission power is 0.02W; meanwhile, compared with the fourth comparison scheme, the network designed by the scheme can achieve the performance equivalent to that of traversal search. However, the computational complexity of traversal search increases exponentially with the increase of the number of users, the offline use complexity of the power distribution network for guaranteeing the user performance provided by the scheme is extremely low, the user clustering part considers real-time interaction with the current environment, the network is used while training, and the complexity is negligible compared with the traversal search. Therefore, in summary, the scheme provides that joint resource allocation is performed by using a power allocation network based on a BP network and a user clustering network based on a DQN algorithm, which can significantly improve the spectrum efficiency of the system and is superior to other schemes.

Referring to fig. 7, a CDF curve comparing the method and the second embodiment of the present invention is shown, wherein the dotted line indicates the system performance that should be achieved by the ideal beamforming scheme, which is difficult to achieve by the prior art. As can be seen from the figure, the proposed scheme is compared with the second scheme, and the performance of the edge user is relatively better, that is, the fairness of the user is ensured while the spectrum efficiency of the system is improved, and certainly, if a better beamforming scheme is available, the performance of the edge user is better guaranteed.

The deep Q network based user clustering and power allocation method of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The user clustering and power distribution method based on the deep Q network can be stored in a computer readable storage medium if the method is realized in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice. The computer storage medium may be any available medium or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical memory (e.g., CD, DVD, BD, HVD, etc.), and semiconductor memory (e.g., ROM, EPROM, EEPROM, nonvolatile memory (NANDFLASH), Solid State Disk (SSD)), etc.

In an exemplary embodiment, a computer device is also provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the Q-network based resource allocation method when executing the computer program. The processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.

In summary, the method, device and medium for user clustering and power allocation based on the deep Q network of the present invention can effectively improve the spectrum efficiency of the system. Firstly, in a power distribution stage, a BP neural network is designed, and on the premise of ensuring the minimum transmission rate of a user, the power distribution result obtained by an exhaustive search algorithm is used as an offline training label, so that the user throughput is ensured, and the complexity of online calculation is greatly reduced. Secondly, in the user clustering stage, with the maximum system throughput as a target, a deep Q learning network is adopted to gradually adjust clustering results through the feedback of reward values, and a trained power distribution network is adopted in a cluster. In the outer loop iteration process, the power distribution and beam forming results are fed back to the reinforcement learning network, the deep reinforcement learning network intelligently adjusts the user cluster by taking the maximum system throughput as a target, and the effect of approximate ideal joint optimization can be achieved by iterating for multiple times. Finally, simulation verifies that the user clustering and power distribution method based on the deep Q network greatly improves the spectrum efficiency of the system while reducing the complexity.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A user clustering and power distribution method based on a deep Q network is characterized by comprising the following steps:

2. The method according to claim 1, wherein in step S1, with system and rate maximum targets, a joint optimization problem is established as follows:

s.t.C1:

C2:

wherein the content of the first and second substances,

for the power allocation factor set, { U_n,kIs the user set, N is the number of clusters, K is the number of users in a cluster, R_n,kFor the information transmission rate, p, of user k in the nth cluster_n,kPower, P, allocated to user k in the nth cluster_maxMaximum power, alpha, allowed for transmission by the base station_n,kThe power of user k in the nth cluster is assigned a factor,

k is true for all n.

3. The method according to claim 1, wherein in step S2, the BP neural network is used for power allocation, and the result of the exhaustive search power allocation method is used as a network training label, and the label obtaining method is as follows: discretizing power within a limited power interval, wherein the step length is delta; for a certain combination of channel information h_n,1…h_n,KSearching out an optimal power distribution result { p) in the discrete power set by an exhaustive search power distribution method_n,1…p_n,KAnd the power distribution result is calculated according to the input channel information and the total power limit after the training is finished.

4. The method according to claim 1, wherein in step S3, the BP neural network comprises an input layer, an output layer and a hidden layer, the input of the BP neural network is channel state information and total power of users in the cluster, and the BP neural network outputs a power distribution result; the number of input and output nodes of the BP neural network is the number of users in the cluster, and the number of hidden layer nodes is adjusted according to the training result; the loss function is defined as

5. The method of claim 1, wherein in step S4, the reinforcement learning task includes an interacting agent and environment, including state space S, action space a, immediate reward R, and transition probability between the current state and the next state, with the base station as the agent, the capabilities of the NOMA system being the environment, and the agent taking action a_tIs decided based on the expected rewards that the user may obtain; in each step, according to the current state s_tAchievable system performance, agent selects action a from a plurality of actions based on learned user clustering policy_t(ii) a The environment evolves to a new state; then, power distribution and beam forming are carried out according to the obtained user cluster, and step length reward r is calculated_tAnd fed back to the agent.

6. The method of claim 5, wherein the state space S is the channel h of all users in t time slot_n,k(t) the current state s is formed_t(ii) a The action space A contains actions to all possible user allocation combinations, the impact effect being defined as

The return function is

The discount factor gamma is an element of 0,1]。

7. The method of claim 1, wherein in step S5, the neural network structure in the DQN is established for fitting Q values, a Q label network is introduced to update the training labels, samples are selected based on the training data from empirical playback, and the transfer samples (S) obtained from each iteration are sampled_t,a_t,r_t,s_t+1) Stored in the playback memory unit as training data. During training, part of the training is randomly taken out for training.

8. The method according to claim 7, characterized in that the input to the neural network structure is a combination(s) of current state and action_t,a_t) The network output of the neural network structure is an estimated Q value, i.e., Q(s), for each action_t,a_tOmega) is a training parameter, and two fully-connected layers are used as hidden layers of the network; and (3) selecting actions at random initially, and selecting between the random actions and the Q strategy by using a probability hyper-parameter by adopting a greedy algorithm.

9. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-8.

10. A computing device, comprising:

one or more processors, memory, and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-8.