CN111182644A

CN111182644A - Joint retransmission URLLC resource scheduling method based on deep reinforcement learning

Info

Publication number: CN111182644A
Application number: CN201911348750.7A
Authority: CN
Inventors: 赵中原; 李阳; 高慧慧
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-19
Anticipated expiration: 2039-12-24
Also published as: CN111182644B

Abstract

The invention discloses a combined retransmission URLLC resource scheduling method based on deep reinforcement learning, which comprises the following steps: collecting data packet information and channel information of URLLC as training data; establishing a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning, and training model parameters by using training data; performing performance evaluation on the obtained URLLC resource scheduling decision model of deep reinforcement learning until performance requirements are met; collecting URLLC data packet information and channel information of the current mini-slot; inputting the obtained information into a URLLC resource scheduling decision model based on deep reinforcement learning to obtain a resource scheduling decision result; and according to the resource scheduling decision result, carrying out resource allocation on the URLLC data packet. The method trains the URLLC data packet information and the channel state information based on the deep reinforcement learning method to obtain a URLLC data packet scheduling resource decision result, reasonably distributes scheduling resources according to the decision result, and solves the problem of power and time-frequency resource waste on the basis of meeting the URLLC transmission requirement.

Description

Joint retransmission URLLC resource scheduling method based on deep reinforcement learning

Technical Field

The invention relates to the field of wireless communication, in particular to a combined retransmission URLLC resource scheduling method based on deep reinforcement learning.

Background

In order to meet the requirements of different scene services on delay, reliability, mobility and the like in the future, in 2015, ITU formally defines three major scenes of a future 5G network: enhanced mobile broadband (eMBB), massive machine-type communication (mMTC), and ultra-reliable low latency (uRLLC). The eMBB scene is mainly used for pursuing the extremely consistent communication experience among people for further improving the performance of user experience and the like on the basis of the existing mobile broadband service scene. mMTC and eMTC are application scenarios of the Internet of things, but the respective emphasis points are different: mMTC is mainly information interaction between people and objects, and eMTC mainly reflects communication requirements between the objects.

In the prior art, URLLC is widely applied to emerging fields such as remote control and smart driving due to its low-delay and high-reliability transmission performance requirements, and is a key direction of 5G research work, so the research on URLLC scene services is also a current hot topic, in order to meet the low-delay requirement of URLLC, one way is to adopt a 60KHz subcarrier interval to achieve a slot length of 1/4 (longer than LTE), in order to further reduce the slot length, ULRLLC adopts 4 symbols as a mini-slot and reduces the slot length to 1/14 of LTE, however, a mini-slot transmission mode is adopted immediately, when URLLC service data demodulation fails, a large experimental overhead is also brought, and a challenge is brought to the low-delay requirement of URLLC.

For example, the invention patent with chinese patent publication No. CN109561504A discloses a resource multiplexing method of URLLC and eMBB based on deep reinforcement learning, which collects data packet information, channel information and queue information of M mini-slot URLLC and eMBB as training data; establishing a URLLC and eMB resource multiplexing model based on deep reinforcement learning, and training model parameters by using training data; performing performance evaluation on the trained model until the performance requirement is met; collecting current mini-slot URLLC and eMBB data packet information, channel information and queue information, inputting the collected information into a trained model, and obtaining a resource multiplexing decision result; and according to the resource multiplexing decision result, performing resource allocation on the eMBB and URLLC data packets of the current mini-slot. The reasonable distribution and utilization of time-frequency resources and power under the transmission requirements of eMBB and URLLC data packets can be met.

The prior art has at least the following problems:

however, if the joint transmission mode of multiple redundant copies is adopted, the limited time-frequency resources are seriously wasted. Therefore, how to allocate the URLLC service in limited resources is an urgent problem to be solved, and the efficient utilization of the resources is realized while the transmission requirements of the URLLC service are met.

Aiming at the problem that the joint transmission mode of multiple redundant copies in the prior art can cause serious waste of limited time-frequency resources. Therefore, how to allocate the URLLC service in limited resources is an urgent problem to be solved while meeting the transmission requirement of the URLLC and realizing efficient utilization of resources, and an effective solution is not proposed at present.

Disclosure of Invention

The invention aims to provide a combined retransmission URLLC resource scheduling method based on deep reinforcement learning aiming at the defects of the prior art.

The method comprises the following operation steps:

step 1, collecting data packet information and channel information of URLLC (ultra-reliable low-delay communication) as training data, the base station obtaining the bit number of URLLC data packets arriving at M mini-slots and the gain of corresponding channel, and using the data packet information and channel information of the kth mini-slot as training data, the specific steps are as follows:

step 1.1, obtaining the downlink channel gain g of the current mini-slot through the Channel Quality Indication (CQI) information periodically uploaded by the UE (User Equipment)^k；

Step 1.2, the base station packages the URLLC service in the service queue, generates a data packet sent by the kth mini-slot URLLC service, and obtains the bit number N of the URLLC data packet^k；

Step 1.3, packaging the obtained information into state information:

wherein

The Mth queue length of the URLLC data packet of the kth mini-slot is represented;

step 2, establishing a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning, and training model parameters by using training data, wherein the method specifically comprises the following steps:

step 2.1, a neural network in a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning is constructed and initialized, and the specific steps are as follows:

step 2.1.1, set motion vector space a ═ bool, R₁,R₂,....R_M]Wherein, the pool represents the transmission mode of URLLC service in the current mini-slot, 1 represents the transmission of redundancy version, 0 represents the transmission of single link, and R represents the transmission of single link_MRepresenting the bit number of the Mth queue processed by the current mini-slot;

step 2.1.2, constructing eval and next two same neural networks, wherein the eval neural network is used for obtaining the action estimation function Q of the current state and selecting the action a, and the next neural network selects the action estimation function argmax with the maximum action estimation function argmax of the next state_aQ' calculating a target action valuation function Q_targetThe EVAL neural network parameter updating module is used for completing updating of the EVAL neural network parameters;

step 2.1.3, setting the parameter C of the eval neural network as [ n, n ═ n_h，n_in，n_out，θ，bias，activate]N denotes the number of hidden layers of the neural network, n_h＝[n_h1，n_h2，...，n_hn]Indicates the number of neurons included in each hidden layer, n_inIs the number of input layer neurons and is equal to the length of the state vector s, n_outExpressing the number of neurons in an output layer and all possible values of the motion vector a, expressing weight by theta and randomly initializing the weight to be 0-w, expressing bias by bias and initializing the bias to be b, expressing an activation function by activity, and adopting a ReLU (Rectified Linear Unit, Linear rectification function) as the activation function;

step 2.1.4, initializing a next neural network parameter C ═ C;

step 2.2, inputting data in the training data into a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning, training model parameters, taking the data of the kth mini-slot in the training data as an example, and specifically comprising the following steps:

step 2.2.1, the data of the k mini-slot

Inputting an eval neural network of a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning, and calculating the ith queue length of the URLLC according to the following formula (1):

in the formula: z represents the corresponding mini-slot number between retransmission intervals;

step 2.2.2, set probability ε_aWith probability epsilon_aSelecting a random selection action a from an action pool^kWith probability (1-epsilon)_a) Selecting ARgmax satisfying condition from eval neural network_aAction a of Q (s, a; theta)^k；

Step 2.2.3, calculating according to step 2.2.2 to obtain action a^kThe prize r earned^kAnd the next state s reached^k+1According to the selected action

Calculating the signal-to-noise ratio of the kth mini-slot according to the following formula (2):

in the formula:

which represents the power of the gaussian noise, is,

represents the power allocated for the k-th slot;

when the boul is 0, single link transmission is used, and there are:

when the pool is 1, a copy transmission mode is adopted, and in this case, there are:

wherein:

for URLLC traffic, its transmission rate is calculated according to the following equation (3):

wherein:

indicating channel separation;

calculating the transmission error rate of the URLLC data at the kth mini-slot according to the following formula (4):

the queue length of URLLC is calculated according to equation (5) below:

wherein: z represents the corresponding mini-slot number between retransmission intervals;

calculating the time required for the current arrival service to be transmitted according to the following formula (6):

wherein: count (x) represents the number of retransmissions required when x is zero;

the action a is calculated according to the following equation (7)^kThe obtained reward:

wherein: p represents the transmitting power of URLLC, Q represents the queue length of URLLC, flag represents the retransmission, and omega₁、ω₂、ω₃、 ω₄Are constants, and the motion estimation function Q is calculated from Bellman Equation (Bellman Equation):

Q(s^k,a^k)＝E[r^k+1+λr^k+2+λ²r^k+3+...|s^k,a^k]

＝E[r^k+λQ(s^k+1,a^k+1)|s^k,a^k]，

namely: the current Q value equals to take action a^kThe prize r earned^kAdding the Q value of the next state, and calculating the parameter value of the next state according to the formula

Step 2.2.4, converting(s) obtained in step 2.2.3^k，a^k，r^k，s^k+1) Storing the data into a memory unit D for the next training of the model;

step 2.2.5, take F samples from memory D at random, and get s^k+1Input next godObtaining a maximum motion estimation function argmax over a network_a ^k+1Q’；

Step 2.2.6, the loss function is calculated according to the following equation (8):

Loss＝(Q_target-Q(s^k,a^k；θ))²

wherein: q represents an action valuation function, theta represents a weight parameter of the current neural network, and gamma represents a discount factor;

step 2.2.7, updating eval neural network weight parameters by adopting a gradient descent method, and calculating the gradient according to the following formula (9):

according to the calculated gradient, selecting the direction with the fastest gradient decrease to update the weight parameter theta;

step 2.3, updating parameters of the eval neural network every time I times to enable theta' to be theta, and updating the next neural network;

step 2.4, repeating the steps 2.2-2.3 to train the model continuously until the loss function is converged;

and 3, performing performance evaluation on the obtained combined retransmission URLLC resource scheduling decision model for deep reinforcement learning until the performance requirement is met, wherein the specific steps are as follows:

step 3.1, continue to use the data obtained in step 1 as a state vector

Inputting the trained DQN (Deep Q-Network) model obtained in the step 2 to obtain a decision result of resource decision;

step 3.2, resource allocation is carried out on the URLLC data packet according to the decision result obtained in the step 3.1, and when the allocation result meets that the transmission delay of the URLLC is less than p_lAnd passError rate less than p_eIf so, completing the performance evaluation process, performing the step 4, and if the requirements are not met, returning to the step 2, and continuing to train the combined retransmission URLLC resource scheduling decision model based on the deep reinforcement learning until the performance requirements are met;

and 4, collecting URLLC data packet information and channel information of the current mini-slot, and specifically comprising the following steps of:

step 4.1, acquiring the size of a data packet encapsulated by the current mini-slot base station for the incoming URLLC service from the base station side;

step 4.2, acquiring the current mini-slot downlink channel gain g through the CQI information periodically uploaded by the UE;

step 5, inputting the URLLC data packet information and the channel information obtained in the step 4 into a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning to obtain a resource scheduling decision result, and combining the obtained current state information and the queue length information into a state vector

Inputting a trained resource scheduling decision model of the combined retransmission URLLC to obtain a decision result of resource scheduling;

and 6, performing resource allocation on the URLLC data packet according to the resource scheduling decision result, and specifically comprising the following steps:

step 6.1, according to the result of the URLLC Resource scheduling obtained in step 5, the RNC (Radio network controller ) indicates the power size allocated to the URLLC and the number of transmission bits allocated to the URLLC data packet through a Radio Resource Control (RRC) sublayer;

and 6.2, a single link transmission mode or a multi-link transmission mode to be adopted by the current mini-slot of the URLLC is instantly informed through configuring a downlink DCI (Pre-Indication) signaling PI (Pre-Indication), so that the reasonable distribution of time-frequency domain resources and power of the URLLC data packet service is realized, and the utilization of limited time-frequency resources is realized.

Compared with the prior art, the method has the following remarkable advantages:

training URLLC data packet information and channel state information by using a deep reinforcement learning method to obtain a resource scheduling decision result, thereby realizing reasonable distribution and utilization of time-frequency resources and power under the condition of meeting the transmission requirement of the URLLC data packet.

And 2, transmission priorities can be set for different queues by adopting a multi-queue mode for transmission, the retransmission queues can be transmitted more flexibly, and transmission on demand is realized to reduce retransmission time delay.

Drawings

Fig. 1 is a flowchart of a URLLC resource scheduling method based on deep reinforcement learning in the present invention;

fig. 2 is a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

Referring to fig. 1, the method comprises the steps of:

step 1, collecting data packet information and channel information of URLLC as training data:

specifically, the base station obtains the bit number of a URLLC data packet arriving at M mini-slots and the gain of a corresponding channel, and takes the data packet information and channel information of the kth mini-slot as training data, and the specific steps are as follows:

step 1.1, acquiring the current mini-slot downlink channel gain g through the CQI information periodically uploaded by the UE^k；

Step 1.3, packaging the obtained information into state information

Wherein

step 2.1.3, setting the parameter C of the eval neural network as [ n, n ═ n_h，n_in，n_out，θ，bias，activate]In particular, the number of implicit layers of the n neural network, n_h＝[n_h1，n_h2，...，n_hn]Number, n, of neurons contained in each hidden layer_inIs the number of input layer neurons and is equal to the length of the state vector s, n_outExpressing the number of neurons in an output layer and all possible values of the motion vector a, expressing weight by theta and randomly initializing the weight to be 0-w, expressing bias by bias and initializing the bias to be b, expressing an activation function by activity, and adopting a ReLU (Rectified Linear Unit, Linear rectification function) as the activation function;

step 2.1.4, initializing a next neural network parameter C ═ C;

step 2.2, inputting data in the training data into a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning, training model parameters, taking the data of the kth mini-slot in the training data as an example, and specifically training the following process:

step 2.2.1, the data of the k mini-slot

Step 2.2.3, calculating according to step 2.2.2 to obtain action a^kThe prize r earned^kAnd the next state s reached^k+1In particular, according to the selected action a^k＝[bool^k,R₁ ^k,R₂ ^k,....R_M ^k]Calculating the signal-to-noise ratio of the kth mini-slot according to the following formula (2):

in the formula:

which represents the power of the gaussian noise, is,

represents the power allocated for the k-th slot;

when the boul is 0, single link transmission is used, and there are:

wherein:

wherein:

it is indicated that the channel is separated,

the queue length of URLLC is calculated according to equation (5) below:

wherein: z represents the corresponding number of mini-slots between retransmission intervals,

wherein: count (x) indicates the number of retransmissions required when x is zero,

therefore, the action a is calculated according to the following formula (7)^kThe obtained reward:

wherein: p represents the transmitting power of URLLC, Q represents the queue length of URLLC, flag represents the retransmission, and omega₁、ω₂、ω₃、 ω₄All are constants, and according to the Bellman equation, the motion estimation function Q is calculated:

Q(s^k,a^k)＝E[r^k+1+λr^k+2+λ²r^k+3+...|s^k,a^k]

＝E[r^k+λQ(s^k+1,a^k+1)|s^k,a^k]，

i.e. the current Q value equals to take action a^kThe prize r earned^kAdding the Q value of the next state, and calculating the parameter value of the next state according to the formula

step 2.2.5, in order to solve the problem of correlation and non-static distribution between samples, F samples are randomly taken out from the memory unit D, and s is added^k+1Input next neural network obtains maximum action estimation function argmax_a ^k+1Q' for solving the correlation and non-static distribution problem between samples;

Loss＝(Q_target-Q(s^k,a^k；θ))²

step 3.1, continue to use the data obtained in step 1 as a state vector

Inputting the trained DQN model obtained in the step 2 to obtain a decision result of resource decision;

step 3.2, resource allocation is carried out on the URLLC data packet according to the decision result obtained in the step 3.1, and when the allocation result meets that the transmission delay of the URLLC is less than p_lAnd a transmission error rate less than p_eIf so, completing the performance evaluation process, performing the step 4, and if the requirements are not met, returning to the step 2, and continuing to train the combined retransmission URLLC resource scheduling decision model based on the deep reinforcement learning until the performance requirements are met;

step 5, inputting the URLLC data packet information and the channel information obtained in the step 4 into a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning to obtain a resource scheduling decision result;

in particular, the obtained current state information and queue length information are combined into a state vector

step 6.1, according to the result of URLLC resource scheduling obtained in step 5, the RNC indicates the power size distributed to the URLLC and the transmission bit number distributed to the URLLC data packet through the RRC sublayer;

and 6.2, informing that the current mini-slot of the URLLC to adopt a single link transmission mode or a multi-link transmission mode through configuring a downlink DCI signaling PI, realizing reasonable distribution of time-frequency domain resources and power of the URLLC data packet service, and realizing utilization of limited time-frequency resources.

Referring to fig. 2, a deep reinforcement learning-based joint retransmission URLLC resource scheduling decision model proposed in the present invention is specifically described.

Specifically, in order to meet the requirement of low delay of URLLC, a 60KHz subcarrier interval is used to achieve a slot length of 1/4 (compared with LTE), in order to further reduce the slot length, ULRLLC uses 4 symbols as a mini-slot to reduce the slot length to 1/14 of the length of LTE one TTI, and uses one mini-slot as one TTI for transmission, M queues are constructed, the first arriving URLLC service enters queue 1 for transmission, and the task uses P to perform transmission_1，sIs successful and is transmitted with a probability of P_1，fUntil M-1 failures arrive at queue M for discard, the next queue 2 is reached. The transmission priority can be set for different queues by adopting a multi-queue mode for transmission, the retransmission queues can be transmitted more flexibly, and the transmission according to the requirement is realized so as to reduce the retransmission time delay.

The above description is only for the preferred embodiment of the present invention and should not be construed as limiting the present invention, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the present invention, and any modifications, equivalents, improvements, etc. made therein are intended to be included within the scope of the appended claims.

Claims

1. A URLLC resource scheduling method based on deep reinforcement learning is characterized by comprising the following steps:

step 1, collecting data packet information and channel information of URLLC as training data, acquiring the bit number of URLLC data packets arriving at M mini-slots and the gain of a corresponding channel by a base station, and taking the data packet information and the channel information of the kth mini-slot as the training data;

step 2, establishing a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning, and training model parameters by using training data;

step 3, performing performance evaluation on the obtained combined retransmission URLLC resource scheduling decision model of the deep reinforcement learning until the performance requirement is met;

step 4, collecting URLLC data packet information and channel information of the current mini-slot;

and 6, performing resource allocation on the URLLC data packet according to the resource scheduling decision result.

2. The URLLC resource scheduling method based on deep reinforcement learning of claim 1, characterized in that in step 1, the following steps are included:

Step 1.3, packaging the obtained information into state information

Wherein

The Mth queue length of the k mini-slot URLLC packet is represented.

3. The URLLC resource scheduling method based on deep reinforcement learning of claim 1, characterized in that in step 2, the following steps are included:

step 2.1.2, constructing eval and next two same neural networks, wherein the eval neural network is used for obtaining the action evaluation function Q of the current state and selecting the action a, and the next neural network selects the action evaluation function argmax with the maximum next state_aQ' calculating a target action valuation function Q_targetThe EVAL neural network parameter updating module is used for completing updating of the EVAL neural network parameters;

step 2.1.3, setting the parameter C of the eval neural network as [ n, n ═ n_h，n_in，n_out，θ，bias，activate]N denotes the number of hidden layers of the neural network, n_h＝[n_h1，n_h2，...，n_hn]Indicates the number of neurons included in each hidden layer, n_inIs the number of input layer neurons and is equal to the length of the state vector s, n_outExpressing the number of neurons in an output layer and being equal to all possible values of the motion vector a, expressing weight by theta and randomly initializing to be 0-w, expressing bias by bias and initializing to be b, expressing an activation function by activity, and adopting ReLU as the activation function;

step 2.1.4, initializing a next neural network parameter C ═ C;

step 2.2.1, the data of the k mini-slot

Inputting an eval neural network of a combined retransmission URLLC resource scheduling decision model based on deep reinforcement learning, and performing resource scheduling according to the following formula(1) Calculating the ith queue length of URLLC:

in the formula, z represents the corresponding mini-slot number between retransmission intervals;

step 2.2.2, set probability ε_aWith probability epsilon_aSelecting a random selection action a from an action pool^kWith probability (1-epsilon)_a) Selecting arg max satisfying condition from eval neural network_aAction a of Q (s, a; theta)^k；

Step 2.2.3, calculating according to step 2.2.2 to obtain action a^kThe prize r earned^kAnd the next state s reached^k+1According to the selected action a^k＝[bool^k,R₁ ^k,R₂ ^k,....R_M ^k]Calculating the signal-to-noise ratio of the kth mini-slot according to the following formula (2):

in the formula:

which represents the power of the gaussian noise, is,

represents the power allocated for the k-th slot;

when the boul is 0, single link transmission is used, and there are:

wherein:

wherein:

indicating channel separation;

the queue length of URLLC is calculated according to equation (5) below:

the action a is calculated according to the following equation (7)^kThe reward obtained can be expressed as:

wherein: p represents the transmitting power of URLLC, Q represents the queue length of URLLC, flag represents the retransmission, and omega₁、ω₂、ω₃、ω₄All are constants, and according to the Bellman equation, the motion estimation function Q is calculated:

Q(s^k,a^k)＝E[r^k+1+λr^k+2+λ²r^k+3+...|s^k,a^k]

＝E[r^k+λQ(s^k+1,a^k+1)|s^k,a^k]

step 2.2.5, in order to solve the problem of correlation and non-static distribution between samples, F samples are randomly taken out from the memory unit D, and s is added^k+1Input next neural network to obtain maximum action estimation function

and 2.4, repeating the steps 2.2-2.3 to train the model continuously until the loss function is converged.

4. The URLLC resource scheduling method based on deep reinforcement learning of claim 1, characterized in that in step 3, the method further includes the following steps:

step 3.1, continue to use the data obtained in step 1 as a state vector

step 3.2, resource allocation is carried out on the URLLC data packet according to the decision result obtained in the step 3.1, and when the allocation result meets that the transmission delay of the URLLC is less than p_lAnd a transmission error rate less than p_eAnd if the performance requirement is not met, the step 4 is carried out, and if the performance requirement is not met, the step 2 is returned, and the training of the combined retransmission URLLC resource scheduling decision model based on the deep reinforcement learning is continued until the performance requirement is met.

5. The URLLC resource scheduling method based on deep reinforcement learning of claim 1, characterized in that in step 4, the following steps are included:

and 4.2, acquiring the current mini-slot downlink channel gain g through the CQI information periodically uploaded by the UE.

6. The URLLC resource scheduling method based on deep reinforcement learning of claim 1, characterized in that in step 6, the following steps are included:

step 6.1, obtaining the result of URLLC resource scheduling according to the step 5, and indicating the power size distributed to the URLLC and the transmission bit number distributed to the URLLC data packet by the RNC through an RRC sublayer;

and 6.2, instantly informing that the current mini-slot of the URLLC needs to adopt a single link transmission mode or a multi-link transmission mode by configuring a downlink DCI signaling PI, and realizing the distribution of the time-frequency domain resources and the power of the URLLC data packet service.