CN115412134A

CN115412134A - Off-line reinforcement learning-based user-centered non-cellular large-scale MIMO power distribution method

Info

Publication number: CN115412134A
Application number: CN202211051651.4A
Authority: CN
Inventors: 李春国; 孙希茜; 徐澍; 王东明; 杨绿溪
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-11-29

Abstract

The invention discloses a user-centered cellular-free large-scale MIMO power distribution method based on offline reinforcement learning, which comprises the following steps: constructing a MIMO system with users as the center, and establishing a service relationship between a wireless access point and part of users; taking the power control coefficient of a downlink as an optimization parameter to solve a problem and construct a Markov decision process model; establishing a DuelingDDQN network, performing on-line training, and storing state transfer data generated by interaction of the environment and the network in the on-line training process; and (4) taking out 20% of an online data set, and introducing a regular term into the loss function to carry out offline training on the network. The power distribution strategy taking users as centers selects part of users to access for the wireless access point; the off-line algorithm provided by the invention reduces the training cost, and can realize off-line and real-time adjustment of the power control coefficient in a real scene only by deploying 20% of the data volume of the on-line training data set for training.

Description

Off-line reinforcement learning-based user-centered non-cellular large-scale MIMO power distribution method

Technical Field

The invention belongs to the field of non-cellular large-scale MIMO power distribution, and particularly relates to a non-cellular large-scale MIMO power distribution method based on offline reinforcement learning and taking a user as a center.

Background

Wireless communication services permeate all social industries, and from ordinary telephone answering and short message sending to some emerging fields such as unmanned driving and intelligent medical treatment, the services of large and small sizes depend on the deployment of a wireless network. In order to ensure the quality of service, the wireless communication service needs to cover a large geographical range, and the conventional wireless communication service uses a cellular network topology to deploy base stations, each of which serves a group of user equipments. This cellular network topology has been in use for decades, and user interference in this scenario is reduced by reducing cell size and applying advanced signal processing schemes. In recent years, a new network topology called a large-scale cellular-free MIMO system has emerged in the field of wireless network services. In the large-scale MIMO system without the cell, the division of the cell is cancelled, and the number of the base stations is far more than that of the users. The idea of cellular-less massive MIMO is to deploy a large number of distributed single antenna Access Points (APs) connected by a Central Processing Unit (CPU). The CPU operates the system in a cellular boundary network-free MIMO manner, and services the users jointly by using a cooperative transceiving manner. Compared with the traditional cellular massive MIMO network, the non-cellular MIMO scheme has strong macro diversity capability, multi-user interference suppression capability, and capability of providing users with similar quality services, and has been widely paid attention and deployed in recent years.

However, there are also some problems with cellular-less MIMO systems. Since all APs in the system are fully connected to the UE, a large amount of power consumption in the fronthaul link has a significant impact on the energy efficiency of the cellular-free MIMO system, and especially in a multi-antenna scenario, as the number of antennas increases, the power consumption of the fronthaul link further increases, thereby reducing the energy efficiency of the system. In addition, in order to further improve the user transmission rate and thus improve the user experience, the cellular-free MIMO system adopts a power allocation strategy to design a power control coefficient. The traditional power control method needs to establish an accurate model for the problem and then iterate to obtain an optimal solution, and the time complexity of the algorithm is often very high, so that a large amount of computing resources are consumed. With the development of modern computing resources, many algorithms based on deep neural networks emerge. The existing power distribution strategy based on deep reinforcement learning all uses an online training strategy, the algorithm needs to interact with the environment in real time while training a network so as to obtain more data sets, however, in an actual application scene, the interaction between an agent and the environment can only occur within a certain time interval, and the real-time interaction is unrealistic, so the algorithms cannot be put into practical application usually.

Disclosure of Invention

The invention aims to provide a user-centered non-cellular large-scale MIMO power distribution method based on offline reinforcement learning, and aims at a downlink data transmission stage in a user-centered non-cellular large-scale MIMO scene, and the user-centered non-cellular large-scale MIMO power distribution method based on a DuelingDDQN network is provided. After modeling of a user-centered non-cellular large-scale MIMO environment, establishment of an MDP model, online training of a Dueling DDQN network and offline training of the Dueling DDQN network, a power control coefficient of the user-centered non-cellular large-scale MIMO is finally obtained, and the technical problems mentioned in the background art are solved.

In order to solve the technical problems, the specific technical scheme of the invention is as follows:

a user-centered cellular-free large-scale MIMO power distribution method based on offline reinforcement learning comprises the following steps:

s1, modeling a non-cellular large-scale MIMO system with a user as a center, determining a service relation between a wireless Access Point (AP) and User Equipment (UE) according to channel estimation of an uplink, taking a power control coefficient in a downlink data transmission stage as an optimization object, and taking the sum of maximized downlink rates as a target to put forward an optimization problem;

s2, according to the optimization problem in the step S1, modeling the optimization process of the power control coefficient in the downlink data transmission stage into a Markov decision process, and determining the state transition, the action space, the strategy and the reward of the Markov decision process;

s3, providing a power distribution algorithm model based on deep reinforcement learning, wherein the model comprises a large-scale MIMO system environment module and an intelligent agent module; the massive MIMO system environment module is used for simulating a channel model and a downlink data transmission model of a cellular-free massive MIMO system with a user as a center, and the agent module is used for sensing the current system state, estimating a Q value of a power distribution strategy and selecting an optimal power distribution coefficient; the core of the intelligent agent module is a deep neural network, and the training mode of the deep neural network comprises early-stage online training and offline training in an application period;

s4, training a deep neural network on line; in the online training stage, before training the deep neural network based on parameters in a data set, a state transition parameter needs to be collected first to update the data set; after a large-scale MIMO system is initialized, firstly, inputting the state of the system into the deep neural network, then selecting a power control coefficient for the current AP based on the Q value output by the deep neural network, implementing a power control strategy in an environment, thereby changing the environment state and obtaining reward, and storing the state transition information of the time; randomly extracting a batch of data from the data set, respectively calculating an accumulated reward value and an expected value by using a deep neural network, and updating parameters of the deep neural network by taking the mean square error of the minimized reward value and the expected value as a target;

s5, training a DuelingDDQN network off line based on the state transition data set collected in the step S4; and (4) taking the first 20% of the state transition data set in the step S4 as an offline training data set, taking a batch of data from the offline data set each time and inputting the data into the deep neural network, respectively calculating the cumulative reward value and the expected value by using the deep neural network, and updating the parameters of the deep neural network by taking the mean square error of the minimal reward value and the expected value as a target, so that the intelligent module can select the optimal power control coefficient.

Further, in the step S1, the constructing a large-scale MIMO system with a user as a center specifically includes:

step S101, firstly setting a distribution area of a scene, setting N UEs to be served by each AP, wherein M APs and K UEs are randomly distributed, and then establishing large-scale fading and small-scale fading models of channels between the APs and the UEs;

step S102, modeling is carried out on an uplink training stage, and the method specifically comprises the following steps:

firstly, distributing an orthogonal pilot frequency sequence for UE, then enabling the UE to forward the pilot frequency sequence to each AP, and after receiving data, estimating a channel coefficient between the AP and the UE based on a minimum mean square error criterion;

step S103, associating the UE needing service for each AP, which specifically includes:

for each AP, arranging channel estimation coefficients between the AP and all the UEs in a descending order, selecting the UE with the highest channel coefficient for each AP to establish a service relationship, and forwarding the established service relationship information to a CPU;

step S104, modeling the downlink data transmission phase, specifically comprising:

and the AP terminal regards the channel estimation obtained in the step S102 as a real channel coefficient, carries out conjugate beam forming on data to be transmitted, and then sends the precoded data to the UE establishing a connection relation with the current AP by power.

Further, in step S1, the optimization problem in step S1 is constructed based on the user signal-to-noise ratio, the transmission rate and the power limitation condition in the downlink data transmission phase.

Further, the user snr expression at the downlink data transmission stage is:

in the formula, SINR _k K = 1.. K denotes the signal-to-noise ratio of the kth user, β _mk Denotes the firstLarge scale fading of the channel between the m APs and the kth UE;

represents the normalized signal-to-noise ratio of the pilot symbols,

pilot sequence representing the kth UE, η _mk M =1,. A, M, K =1,. K denotes a power control coefficient between the mth AP and the kth UE, P (K), K =1,. K denotes a set of APs serving the kth user; in the formula, the content of the active carbon is shown in the specification,

wherein

Minimum mean square error estimate, τ, representing the channel between the mth AP and the kth UE ^cf Representing the number of up-training samples, c, in a coherence interval _mk The expression of (a) is:

further, the expression of the transmission rate in the downlink data transmission phase is:

in the formula, the content of the active carbon is shown in the specification,

indicates the transmission rate, SINR, of the k-th UE _k K = 1.. K denotes the downlink signal-to-noise ratio of K UEs.

Further, the expression of the power allocation optimization problem is as follows:

η _mk ≥0,k＝1,...,K,m＝1,...,M；

wherein, T (M), M = 1., M denotes an index set of APs establishing a connection relationship with the mth AP, and N indexes are shared in the index set, which means that each AP serves N UEs.

Further, the step S2 of modeling the optimization process of the power control coefficient in the downlink data transmission stage as a markov decision process specifically includes:

step S201, modeling the optimization step of the power distribution coefficient in the system into a sequential decision process, wherein the process has elements including states, actions, transfer strategies and rewards; in the process, each step selects a power distribution coefficient for an AP in a large-scale MIMO system with a user as the center;

step S202, a system state is set, the system state describes the signal-to-noise ratio condition of a user under the current power distribution strategy, and an AP optimized power control coefficient at the current moment is appointed; setting the current system state as the mth AP updating power control coefficient, the parameter eta will be updated _mk ,k∈T(m)；

Step S203, setting an action space, wherein the action space is a limited set, and numbers in the set describe all selectable values of the power control coefficient;

step S204, setting a state transition probability, wherein the state transition probability describes the probability that the environment is changed into a new state after a power distribution strategy is implemented on a large-scale MIMO system taking a user as a center, and the value of the state transition probability is [0,1];

and S205, setting a reward, wherein the reward describes the gain of the sum of the transmission rates of the K users after a power distribution strategy is implemented on the large-scale MIMO system taking the users as the center.

Further, the system status in step S202 is represented as S _t ＝[SINR,c]The SINR is the signal-to-noise ratio of a user and is a K-dimensional vector; the specific expression is as follows:

SINR＝[SINR ₁ ,...,SINR _k ,...,SINR _K ]，

c is a one-hot code for indicating an AP index value, and its specific expression is:

e _m the M-dimensional vector is 1 in the M dimension, and the other dimensions are 0, which indicates that the power control coefficient needs to be updated for the M-th AP currently, so that the intelligent agent can update the parameter eta of the large-scale MIMO environment centering on the user at the current moment _mk ,k∈T(m)，η _mk ＝0,

Namely, for the UE establishing the service relationship with the mth AP, the updating of the power control coefficients between the UE and the current AP is implemented; setting the power control coefficient between the UE which does not establish service relationship with the mth AP and the mth AP to be 0;

in step S203, the motion space is a _t ＝(η _m1 ,η _m2 ,…,η _mK ) Wherein η _mk ＝0,

It is described that the value of the power coefficient of the UE which does not establish a service relationship with the AP can only be 0; eta _mk E {0.1,0.4,0.7,1.0}, M =1,.., M, k e T (M) describes alternative values for the power coefficient power control coefficient for the UE establishing a service relationship with the AP.

Further, the on-line training process of the dulingddqn network in step S4 specifically includes:

step S401, initializing a large-scale MIMO system environment module taking a user as a center, namely determining distribution and a channel model of an AP and UE; initializing an agent module, namely initializing parameters and a buffer area of a DuelingDDQN network;

s402, collecting state transition data; firstly, inputting a system state into the intelligent agent module, estimating a Q value of the current state by the intelligent agent module, then selecting a power distribution coefficient based on the Q value, transmitting the selected power control coefficient to the large-scale MIMO system environment module for implementation, thereby changing the environment state and obtaining a user signal-to-noise ratio gain as a reward, and finally saving the parameter of the state transition to the cache region;

step S403, training a network; randomly extracting a batch of state transition parameters from the cache region, and taking the system state before transition as the input of the intelligent agent module to enable the intelligent agent to sense the state and estimate the accumulated reward value; then the state after the state transition is used as the input of the intelligent agent module, so that the intelligent agent senses the state and obtains an expected accumulated reward value by combining the reward value information in the state transition;

step S404, updating the network parameters of the DuelingDDQN network by using a back propagation algorithm with the aim of minimizing the mean square error between the accumulated income and the expected value; and continuously and repeatedly carrying out the agent-environment interaction operation from the step S402 to the step S403, thereby continuously updating the network parameters and the data set.

Further, the step S5 of the offline training process specifically includes:

further, the step S5 of the offline training process specifically includes:

step S501, initializing a large-scale MIMO system environment module with a user as a center, namely determining distribution and a channel model of an AP and UE; initializing an agent module, namely randomly initializing parameters of the DuelingDDQN network, and taking out the first 20% of data of the state transition parameter data set collected in the step S4 as a data set for offline training;

step S502, randomly extracting a batch of state transition parameters from an off-line training data set, and taking the system state before transition as the input of an agent module to enable the agent to sense the state and estimate an accumulated reward value; then the state after the state transition is used as the input of the intelligent agent module, so that the intelligent agent senses the state and obtains an expected accumulated reward value by combining the reward value information in the state transition; updating the network parameters of the DuelingDDQN network by utilizing a back propagation algorithm with the aim of minimizing the mean square error between the calculated accumulated income and the expected value;

and S503, continuously repeating the step S502, updating parameters of the Dueling DDQN network by using the off-line data set until the signal-to-noise ratio gain of the user converges to a certain value, and stopping network training.

The off-line reinforcement learning-based user-centered non-cellular large-scale MIMO power distribution method has the following advantages that:

1. compared with a general large-scale MIMO system model without a honeycomb, the method disclosed by the invention uses the steps of claim S1, and the large-scale MIMO model taking the user as the center reduces the power consumption of the system and improves the energy efficiency of the system;

2. compared with the traditional optimization-based power distribution algorithm, the method disclosed by the invention uses the steps of the claims S3-S5, and the algorithm based on reinforcement learning reduces the time complexity and time cost in calculation;

3. compared with an online reinforcement learning algorithm, the invention uses the steps of the claims S3-S5, reduces the data set size to 20% of the online training based on the offline reinforcement learning algorithm, and can perform real-time power distribution for a non-cellular large-scale MIMO system centered on users in the practical application scene that the environment and the intelligent agent can not interact in real time.

Drawings

Fig. 1 is a schematic flowchart of a user-centric large-scale MIMO power distribution method based on offline reinforcement learning according to embodiment 1 of the present invention;

FIG. 2 is a block diagram of a power distribution model of an offline reinforcement learning algorithm provided in embodiment 1 of the present invention;

fig. 3 is a flowchart of user-centric large-scale MIMO model establishment provided in embodiment 1 of the present invention;

fig. 4 is a schematic diagram of a user-centric large-scale MIMO system provided in embodiment 1 of the present invention;

fig. 5 is a schematic flowchart of online training of a dueling ddqn network provided in embodiment 1 of the present invention;

fig. 6 is a schematic flowchart of offline training a dulingddqn network according to embodiment 1 of the present invention;

fig. 7 is an offline training curve of the duelingdqn network provided in embodiment 1 of the present invention.

Detailed Description

In order to better understand the purpose, structure and function of the present invention, the following describes a user-centric large-scale MIMO power allocation method based on offline reinforcement learning in further detail with reference to the accompanying drawings.

Example 1

Referring to fig. 1 to fig. 7, the present embodiment provides a user-centric large-scale MIMO power allocation method based on offline reinforcement learning, specifically as shown in fig. 1, the method includes the following steps:

step S1, constructing a large-scale MIMO system without cells with users as the center, which specifically comprises the following steps:

firstly, setting a distribution area of a scene, setting the number of randomly distributed wireless Access Points (APs), user Equipment (UE) and the number of UEs to be served by each AP, and then establishing a large-scale fading model and a small-scale fading model of a channel between the APs and the UEs.

Then, an orthogonal pilot sequence is distributed for the UE, then the UE forwards the pilot sequence to each AP, and after the AP receives data, the channel coefficient between the AP and the characteristic UE is estimated based on the minimum mean square error criterion. And for each AP, sequencing the channel estimation coefficients between the AP and all the UE in a descending order, selecting N UEs with higher channel coefficients for each AP to establish a service relationship, and forwarding the established service relationship information to a CPU.

And the AP terminal carries out conjugate beam forming on the data to be transmitted based on channel estimation and then sends the precoded data to the UE establishing a connection relation with the current AP at a specific power.

And taking a power control coefficient between an AP and UE in a downlink data transmission stage in a non-cellular large-scale MIMO system with a user as a center as an optimization object, taking the sum of the rates of the UE in the downlink stage as a target, and constructing the power distribution optimization problem based on the signal-to-noise ratio of the user, the transmission rate and the power limiting condition in the downlink data transmission stage.

Modeling the optimization process of the power distribution coefficient into a Markov decision process, and determining the state transition, the action space, the strategy and the reward of the Markov decision process.

And S2, modeling the optimization process of the power distribution coefficient into a Markov decision process. The MDP model may be described by a quadruplet, which may be represented as

Namely, state space S, motion space

Probability of state transition

Reward

The method comprises the following specific steps:

1. state space S, describing the state of a user-centric cellular-less massive MIMO system. s _t ＝[SINR,c]E S, where SINR is the user signal-to-noise ratio, which is a K-dimensional vector. The specific expression is as follows:

SINR＝[SINR ₁ ,…,SINR _k ,…,SINR _K ]，

e _m the M-dimension vector is 1, and the other dimensions are 0, and the M-dimension vector indicates that the power control coefficient needs to be updated for the mth AP at present, so that the intelligent agent can update the parameter eta of the large-scale MIMO environment with the user as the center at the present moment _mk ,k∈T(m)，η _mk ＝0,

Namely, for the UE establishing the service relationship with the mth AP, the updating of the power control coefficients between the UE and the current AP is implemented; and for the UE which does not establish the service relationship with the mth AP, the UE and the mth AP are connectedThe power control coefficient is set to 0.

2. Movement space

Power control coefficients are described that an agent may implement for a user-centric cellular-less massive MIMO system. In the present embodiment, it is preferred that,

wherein eta is _mk ＝0,

3. Transition probability between states

Has a value range of [0,1]]。

In this embodiment, assume a state s _t ＝[SINR,c _t ]By updating power control coefficients (η) in a user-centric cellular-free massive MIMO environment _m1 ,η _m2 ,…,η _mK ) Make the environment transit to the state s _t+1 ＝[SINR′,c _t+1 ]。

4. Revenue information

In the present embodiment, it is shown as

Namely, the gain of the sum of all user rates in a massive MIMO system taking a user as the center before and after one state transition.

And S3, constructing a power distribution algorithm model based on deep reinforcement learning, wherein the model comprises a large-scale MIMO system environment module and an intelligent agent module which take a user as a center. The massive MIMO system environment module is used for simulating a channel model and a downlink data transmission model of a cellular-free massive MIMO system with a user as a center, and the intelligent agent module is used for sensing the current system state, estimating the Q value of a power distribution strategy and selecting the optimal power distribution coefficient; the core of the intelligent agent module is a deep neural network, and the training mode of the deep neural network comprises early-stage online training and off-line training in the application period.

And S4, training the DuelingDDQN network on line. In the online training phase, state transition parameters are acquired to update the data set before training the network based on the parameters in the data set. After a large-scale MIMO system is initialized, firstly, the state of the system is input into the deep neural network, then a power control coefficient is selected for the current AP based on the Q value output by the network, a power control strategy is implemented in the environment, so that the environment state is changed and awards are obtained, and the state transition information of the time is stored. Then, a batch of data is randomly extracted from the data set, and the accumulated reward value and the expected value are respectively calculated by the network, so that the network parameters are updated with the aim of minimizing the mean square error of the reward value and the expected value.

And S5, training the DuelingDDQN network off line based on the state transition data set collected in the step S4. And (4) taking the first 20% of the state transition data set in the step S4 as an offline training data set, taking a batch of data from the offline data set each time to input into a network, respectively calculating the cumulative reward value and the expected value by using the network, updating network parameters by taking the mean square error of the minimum reward value and the expected value as a target, and finally enabling the intelligent module to select the optimal power control coefficient.

Specifically, in this embodiment, a specific structure of the power allocation algorithm model is shown in fig. 2, and more specifically, the power allocation model includes:

user-centric cellular-free massive MIMO environment module: the state transition of a large-scale fading model, a small-scale fading model, an uplink training model, a downlink data transmission model and an MDP model of a channel is simulated, wherein the state transition mode of the MDP model comprises different system states, rewards under power control coefficients and the like.

An online training module: including buffers, the dulingddqn network and action selection policies. In the online training phase, state transition parameters are acquired to update the data set before training the network based on the parameters in the data set. After a large-scale MIMO system is initialized, firstly, the state of the system is input into the deep neural network, then, a power control coefficient is selected for the current AP based on the Q value output by the network, a power control strategy is implemented in the environment, so that the environment state is changed, the reward is obtained, and the state transition information of the time is stored. Then randomly extracting a batch of data from the data set, and respectively calculating the cumulative reward value and the expected value by using the network, and updating the network parameters with the aim of minimizing the mean square error of the reward value and the expected value.

An off-line training module: including an offline training data set, the dulingddqn network. And taking the first 20% of the online training buffer area data set as an offline training data set, taking a batch of data from the offline data set each time to input into a network, respectively calculating the cumulative reward value and the expected value by using the network, updating network parameters by taking the mean square error of the minimized reward value and the expected value as a target, and finally enabling the intelligent module to select the optimal power control coefficient. The updating of the network of off-line training modules relies entirely on the training data sampled from the buffer without requiring additional interaction with the environment.

Specifically, in this embodiment, a specific non-cellular massive MIMO system is provided, and a model establishing process of the system is shown in fig. 3, and more specifically, the non-cellular massive MIMO system is established through the following steps:

step S101, considering an area of 1km ² The square area is set with M APs and K UEs randomly distributed in the area, and each AP needs to serve N UEs with characteristics. Fig. 4 shows the case when M =8, k =6, n =2, and the AP and the UE have only a single antenna, and the AP is connected to the CPU through an ideal backhaul network. With g _mk Describing the channel coefficient between the mth AP and the kth UE, defined by the following equation:

in the formula, h _mk M =1,.. M, K =1,. Wherein K represents a small scale fading, subject to an independent simultaneous gaussian distribution; beta is a _mk M = 1., M, K = 1., K denotes large scale fading.

And step S102, modeling an uplink training phase. Firstly, an orthogonal pilot sequence is distributed for the UE, then the UE forwards the pilot sequence to each AP, after the AP receives data, a channel coefficient between the AP and the specific UE is estimated based on a minimum mean square error criterion, and the channel estimation can be expressed as:

in the formula, the first step is that,

represents the minimum mean square error estimate of the channel between the mth AP and the kth UE,

is the received signal y of the mth AP _p,m Pilot at kth UE

Is projected on the surface. c. C _mk The expression of (a) is:

step S103, associating UE needing service for each AP. And for each AP, sequencing the channel estimation coefficients between the AP and all the UE in a descending order, selecting N UEs with higher channel coefficients for each AP to establish a service relationship, and forwarding the established service relationship information to a CPU. As for the mth AP, there are:

then the mth AP is the user s _m1 ，s _m2 ，…，s _mN Service, i.e. T (m) = { s = _m1 ,s _m2 ,...,s _mN And for other users s _m,N+1 ，…，s _mK The mth AP and its no data transmission have

Step S104, in the downlink data transmission stage, the AP terminal regards the channel estimation obtained in the step S102 as a real channel coefficient, carries out conjugate beam forming on data to be transmitted, and then sends the pre-coded data to the UE establishing a connection relation with the current AP at a specific power. The data received by the kth UE may be represented as:

in the formula, r _d,k Denotes data received by the kth UE in a downlink data transmission phase, P (k) denotes a set of APs serving the kth user, q _k K = 1.. K denotes a symbol to be transmitted to the kth UE, q _k Satisfy the requirement of

w _d,k K = 1.. K is additive complex gaussian noise with mean 0 and variance 1, i.e. K is a complex gaussian noise with mean 0 and variance 1

Power control coefficient eta _mk The following constraints are satisfied:

as previously mentioned, in the formula,

step S105, writing the power distribution problem of the downlink data transmission stage of the cellular-free massive MIMO system taking the user as the center as follows:

η _mk ≥0,k＝1,...,K,m＝1,...,M。

in the formula, the first step is that,

indicates the transmission rate, SINR, of the k-th UE _k K = 1.. K denotes the downlink signal-to-noise ratio of the kth UE, which may be specifically expressed as:

specifically, in the present embodiment, fig. 5 shows a specific flow of online training of the duelingdqn network. The method comprises the following steps:

step S401, initializing a large-scale MIMO system environment module taking a user as a center, namely determining distribution and a channel model of an AP and UE; and initializing the intelligent agent module, namely initializing parameters of the DuelingdQN network and a buffer area.

And S402, collecting state transition data. Firstly, inputting the system state into the intelligent agent module, estimating the Q value of the current state by the intelligent agent module, then selecting a power distribution coefficient based on the Q value, transmitting the selected power control coefficient to the large-scale MIMO system environment module for implementation, thereby changing the environment state and obtaining the signal-to-noise ratio gain of a user as reward, and finally saving the parameter of the state transition into the cache region.

And S403, training a network. Randomly extracting a batch of state transition parameters from the cache region, and taking the system state before transition as the input of the intelligent agent module to enable the intelligent agent to sense the state and estimate the accumulated reward value; and then the state after the state transition is used as the input of the intelligent agent module, so that the intelligent agent senses the state and obtains the expected accumulated reward value by combining the reward value information in the state transition.

And S404, updating the network parameters of the DuelingDDQN network by using a back propagation algorithm with the aim of minimizing the mean square error between the accumulated benefit and the expected value. And continuously and repeatedly carrying out the agent-environment interaction operations of S402-S403, thereby continuously updating the network parameters and the data set.

Specifically, in this embodiment, fig. 6 shows a specific flow of offline training of the duelingdqn network. The method comprises the following steps:

step S501, initializing a large-scale MIMO system environment module with a user as a center, namely determining distribution and a channel model of an AP and UE; and initializing an agent module, namely randomly initializing parameters of the DuelingDDQN network, and taking out the first 20% of the data set of the state transition parameters collected in the step S4 as a data set for off-line training.

Step S502, randomly extracting a batch of state transition parameters from an off-line training data set, and taking the system state before transition as the input of an intelligent agent module to enable the intelligent agent to sense the state and estimate an accumulated reward value; and then the state after the state transition is used as the input of the intelligent agent module, so that the intelligent agent senses the state and obtains the expected accumulated reward value by combining the reward value information in the state transition. Updating network parameters of the Dueling DDQN network by using a back propagation algorithm with the aim of minimizing the mean square error between the calculated accumulated benefit and the expected value.

And S503, continuously repeating the step S502, and updating parameters of the DuelingDDQN network by using the offline data set until the number of training steps reaches 10000 rounds.

Specifically, in this embodiment, the cumulative benefit curve of the Dueling DDQN network trained with the offline reinforcement learning algorithm is shown in FIG. 7. Fig. 7 shows an offline training curve of the network when M =10,k =6,n = 4. The abscissa of fig. 7 represents the number of rounds of training and the ordinate represents the normalized benefit value. The prize value begins to level off by 200 rounds of training until the prize value converges to substantially 0.71 at 400 rounds. This shows that the off-line reinforcement learning-based user-centric massive MIMO power allocation algorithm provided in this embodiment can obtain a better convergence effect even when training is performed on only 20% of the online training buffer, and can design a suitable power allocation coefficient, which is helpful for improving the energy efficiency of a user-centric cellular-free massive MIMO system.

In conclusion, the invention realizes the power distribution method in the user-centered cellular-free large-scale MIMO system based on the off-line reinforcement learning. The method comprises the steps of modeling an optimization problem process of power distribution into an MDP model by determining a system state, an action space, a transition probability and an incentive value in the optimization problem, then constructing an offline learning algorithm which is composed of a non-cellular large-scale MIMO environment module taking a user as a center, an online learning module and an offline learning module, and continuously optimizing parameters of a deep neural network by utilizing a back propagation algorithm to obtain a power control coefficient which maximizes the sum of user rates in the system. The invention adopts a large-scale MIMO system model without honeycomb which takes the user as the center, and can improve the energy efficiency of the system on the premise of ensuring the service quality; the invention provides an off-line reinforcement learning-based algorithm, and an off-line data set is used for training an agent to obtain a distribution scheme of a power distribution coefficient. The algorithm can be deployed to an actual scene by only once online training and can realize dynamic regulation and control of the power distribution coefficient by using an offline training algorithm.

The details of the present invention are well known to those skilled in the art.

It is to be understood that the present invention has been described with reference to certain embodiments and that various changes in form and details may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A user-centered cellular-free massive MIMO power distribution method based on offline reinforcement learning is characterized by comprising the following steps:

s4, training a deep neural network on line; in the online training stage, before training the deep neural network based on parameters in a data set, a state transition parameter needs to be collected first to update the data set; after a large-scale MIMO system is initialized, firstly, inputting the state of the system into the deep neural network, then selecting a power control coefficient for the current AP based on the Q value output by the deep neural network, implementing a power control strategy in an environment, thereby changing the environment state and obtaining reward, and storing the state transition information of the time; then randomly extracting a batch of data from the data set, respectively calculating an accumulated reward value and an expected value by using a deep neural network, and updating parameters of the deep neural network by taking the mean square error of the minimized reward value and the expected value as a target;

s5, training a Dueling DDQN network in an off-line manner based on the state transition data set collected in the S4; and (4) taking the first 20% of the state transition data set in the step S4 as an offline training data set, taking a batch of data from the offline data set each time and inputting the data into the deep neural network, respectively calculating the cumulative reward value and the expected value by using the deep neural network, and updating the parameters of the deep neural network by taking the mean square error of the minimal reward value and the expected value as a target, so that the intelligent module can select the optimal power control coefficient.

2. The off-line reinforcement learning-based user-centric cellular-free massive MIMO power distribution method according to claim 1, wherein in the step S1, the constructing a user-centric massive MIMO system specifically comprises:

step S102, modeling the uplink training phase, specifically comprising:

for each AP, arranging channel estimation coefficients between the AP and all the UEs in a descending order, selecting N UEs with the highest channel coefficients for each AP to establish a service relationship, and forwarding the established service relationship information to a CPU;

3. The off-line reinforcement learning-based user-centric large-scale MIMO power allocation method according to claim 2, wherein in step S1, the optimization problem in step S1 is constructed based on user snr, transmission rate and power limitation condition in downlink data transmission phase.

4. The off-line reinforcement learning-based user-centric large-scale MIMO power allocation method according to claim 3, wherein the SNR expression of the downlink data transmission stage is:

in the formula, SINR _k K = 1.. K denotes the signal-to-noise ratio of the kth user, β _mk Representing the large scale fading of the channel between the mth AP and the kth UE;

represents the normalized signal-to-noise ratio of the pilot symbols,

indicating the pilot sequence of the kth UE, η _mk M = 1., M, K = 1.,. K denotes a power control coefficient between the mth AP and the kth UE, P (K), K = 1.,. K denotes a set serving the kth user AP; in the formula, the content of the active carbon is shown in the specification,

wherein

Minimum mean square error estimate, τ, representing the channel between the mth AP and the kth UE ^cf To representNumber of uplink training samples in a coherence interval, c _mk The expression of (a) is:

5. the off-line reinforcement learning-based user-centric cell-free massive MIMO power allocation method according to claim 4, wherein the expression of the transmission rate in the downlink data transmission stage is:

in the formula, the first step is that,

6. The off-line reinforcement learning-based user-centric cell-free massive MIMO power distribution method according to claim 5, wherein the power distribution optimization problem has an expression as follows:

η _mk ≥0,k＝1,...,K,m＝1,...,M；

7. The off-line reinforcement learning-based user-centric large-scale MIMO power allocation method according to claim 6, wherein the step S2 of modeling the optimization process of the power control coefficients in the downlink data transmission stage as a markov decision process specifically comprises:

step S201, modeling the optimization step of the power distribution coefficient in the system into a sequential decision process, wherein the process has elements including states, actions, transfer strategies and rewards; in the process, each step is used for selecting a power distribution coefficient for an AP in a large-scale MIMO system taking a user as a center;

step S202, a system state is set, the system state describes the signal-to-noise ratio condition of a user under the current power distribution strategy, and an AP optimized power control coefficient at the current moment is appointed; setting the current system state as the mth AP to update the power control coefficient, the parameter eta will be updated _mk ,k∈T(m)；

8. The off-line reinforcement learning-based user-centric cell-free massive MIMO power distribution method according to claim 7, wherein the system state in step S202 is represented as S _t ＝[SINR,c]The component belongs to S, wherein SINR is the signal-to-noise ratio of a user and is a K-dimensional vector; the specific expression is as follows:

SINR＝[SINR ₁ ,...,SINR _k ,...,SINR _K ]，

e _m the M-dimensional vector is 1 in the M dimension, and the other dimensions are 0, which indicates that the power control coefficient needs to be updated for the M-th AP currently, so that the intelligent agent can update the parameter eta of the large-scale MIMO environment centering on the user at the current moment _mk ,k∈T(m)，

Namely, for the UE establishing the service relationship with the mth AP, the power control coefficients between the UE and the current AP are updated; setting the power control coefficient between the UE which does not establish service relationship with the mth AP and the mth AP to be 0;

in step S203, the motion space is a _t ＝(η _m1 ,η _m2 ,…,η _mK ) Wherein, in the step (A),

it is described that the value of the power coefficient of the UE which does not establish a service relationship with the AP can only be 0; eta _mk E {0.1,0.4,0.7,1.0}, M =1,.

9. The off-line reinforcement learning-based user-centric large-scale MIMO power allocation method according to claim 8, wherein the on-line training process of the dulling DDQN network in step S4 specifically includes:

step S401, initializing a large-scale MIMO system environment module taking a user as a center, namely determining distribution and a channel model of an AP and UE; initializing an agent module, namely initializing parameters and a buffer area of a Dueling DDQN network;

s402, collecting state transition data; firstly, inputting a system state into the intelligent agent module, estimating a Q value of the current state by the intelligent agent module, then selecting a power distribution coefficient based on the Q value, transmitting the selected power control coefficient to the large-scale MIMO system environment module for implementation, thereby changing the environment state and obtaining a user signal-to-noise ratio gain as a reward, and finally storing a parameter of the state transition into the cache region;

step S404, updating the network parameters of the Dueling DDQN network by using a back propagation algorithm with the aim of minimizing the mean square error between the accumulated income and the expected value; and continuously and repeatedly carrying out the agent-environment interaction operation from the step S402 to the step S403, thereby continuously updating the network parameters and the data set.

10. The off-line reinforcement learning-based user-centric large-scale MIMO power allocation method according to claim 9, wherein the off-line training process of step S5 specifically comprises:

step S501, initializing a large-scale MIMO system environment module taking a user as a center, namely determining distribution and a channel model of an AP and UE; initializing an agent module, namely randomly initializing parameters of the Dueling DDQN network, and taking out the first 20% of data of the state transition parameter data set collected in the step S4 as a data set for offline training;

step S502, randomly extracting a batch of state transition parameters from an off-line training data set, and taking the system state before transition as the input of an intelligent agent module to enable the intelligent agent to sense the state and estimate an accumulated reward value; then the state after the state transition is used as the input of the intelligent agent module, so that the intelligent agent senses the state and obtains an expected accumulated reward value by combining the reward value information in the state transition; updating the network parameters of the Dueling DDQN network by utilizing a back propagation algorithm with the aim of minimizing the mean square error between the accumulated income and the expected value;

and S503, continuously repeating the step S502, updating parameters of the Dueling DDQN network by using the offline data set until the signal-to-noise ratio gain of the user converges to a certain value, and stopping network training.