CN114980353A

CN114980353A - Ordered competition large-scale access learning method for machine type communication system

Info

Publication number: CN114980353A
Application number: CN202210472683.5A
Authority: CN
Inventors: 孙君; 郭兴康
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-30

Abstract

The invention discloses a sequential competition large-scale access learning method for a machine type communication system, which comprises the following steps: the newly accessed equipment adopts a multi-agent reinforcement learning algorithm to cooperatively select a channel meeting the self requirement; the equipment sends a lead code on a physical random access channel, the base station sends a response after receiving the request, the response comprises a specific number of the equipment under the selected lead code, the specific number is determined according to the self priority of the equipment, the equipment with the minimum number sends data on the physical random access channel, the other numbered equipment are in a waiting state, and the number of the equipment in waiting is automatically reduced at each unit time. The invention weakens the randomness in the competition process, newly accessed equipment can cooperate to select a proper lead code through a multi-agent reinforcement learning algorithm and is divided according to the priority, and meanwhile, the lowest time delay requirement of unsuccessfully accessed equipment is ensured.

Description

Ordered competition large-scale access learning method for machine type communication system

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to an orderly competitive large-scale access learning method in Internet of things large-scale Machine type communication M2M (Machine-to-Machine).

Background

With the rapid development of communication technology, communication services have gradually developed from traditional human-to-human communication to object-to-object communication, which is called Internet of Things (IoT), and it is expected that by 2023, devices that implement M2M (Machine-to-Machine) communication globally will be over 350 billion, which brings serious challenges to existing cellular networks, and thus fifth generation mobile communication (5G) technology becomes the focus of mtc development, where mtc (passive Machine Type communication) is one of three 5G application scenarios. In order to meet the mass access requirement of the machine type equipment, further control optimization needs to be performed on the basis of the current congestion control mechanism. Meanwhile, in consideration of the difference of Quality of Service (QoS) of Machine Type Communication Devices (MTCDs) of different traffic types, the access requirements of MTCDs of different traffic types are different. Therefore, the preamble resource partitioning problem when MTCD of multiple traffic types accesses the cellular network should also be solved. For the scenario of simultaneously performing random access on multiple traffic types MTCD, the access delay, the number of collisions and the access fairness of the various traffic types MTCD must be considered while improving the system throughput. However, most conventional access schemes are based on random contention, and a dynamic acb (access Class barring) factor is usually adopted to optimize collision, that is, backoff prediction is adopted, so that although the collision problem can be effectively alleviated, collision still occurs. Therefore, a new ordered contention access scheme is needed to solve the collision problem.

Disclosure of Invention

The technical problem to be solved is as follows: the invention provides a new ordered competition access scheme aiming at the traditional random access collision problem based on competition, and unlike the traditional scheme, the scheme is not blind competition but has target competition, thereby weakening the randomness in the competition process. Each device has different priorities and minimum delay requirements, the devices enter a queuing state when collision occurs, newly accessed devices can cooperate to select proper lead codes through a multi-agent reinforcement learning algorithm and are divided according to the priorities, and meanwhile the minimum delay requirement of the devices which are not successfully accessed is required to be ensured.

The technical scheme is as follows:

a sequential competition large-scale access learning method for a machine type communication system comprises the following steps:

s1, before the random access starts, the base station distributes the wireless resources of the physical random access channel and the physical uplink shared channel and broadcasts the wireless resources to all the mobile devices; each physical random access channel is corresponding to a unique lead code;

s2, when the random access starts, the base station broadcasts the number of devices waiting correspondingly under each lead code, and the newly accessed devices adopt a multi-agent reinforcement learning algorithm to cooperatively select channels meeting the self requirements; specifically, the device sends a preamble on a physical random access channel, the base station sends a response after receiving the request, the response includes a specific number of the device under the selected preamble, the specific number is determined according to the priority of the device, the device with the minimum number sends data on the physical random access channel, the other devices with numbers are in a waiting state, and the number of the waiting device is automatically reduced at each unit time;

the multi-agent reinforcement learning algorithm is combined with the number of devices waiting for the preamble sequences, the channel requirements of the devices, the delay tolerance and the priority of the devices at the same time, so that each device is in the maximum delay tolerance of the device and each preamble sequence preamble _i The lower lengths are evenly distributed.

Further, in step S2, the number of each new access device is determined according to the priority function of the following devices:

priorityfunc(t)＝P _MTCD +k(t-t ₀ )

wherein t is not less than t ₀ Constant P _MTCD Indicating the priority level of the device itself; the symbol k represents the growth rate of the priority, which is different for different devices; the symbol t denotes the current time, the symbol t ₀ Indicating the time at which the device entered the team.

Further, in step S3, it is assumed that there are m preambles, and the device queue corresponding to the ith preamble is a

preamble

_i 1,2,3, ·, m; at time tThe total number of devices corresponding to all preamble queues under the carving is represented as

n _t,i Is a preamble queue preambl _i Number of devices waiting in e, n _t,acces Is the number of newly accessed devices; preamble queue preambl _i The maximum tolerated delay of a device in e is expressed as

The objective function of the multi-agent reinforcement learning algorithm is expressed as:

wherein x is _t,i Indicating the number of devices selecting the ith preamble among the newly accessed devices at time t, _p reambl _ei,last indicating the last device under the ith preamble sequence after the new device decides,

representing the maximum tolerant time delay of the last equipment under the ith lead code sequence after the decision of the new equipment;

the variance of the queue length under the m preamble queues at t time is shown, i.e. the objective function is to minimize the variance.

Further, in step S3, the process of the newly accessed device adopting the multi-agent reinforcement learning algorithm to cooperatively select the channel satisfying its own requirement includes the following steps:

s31, constructing a state set S: the state set S is used to represent the state of the whole access environment and is composed of t +1 states, where S is{s ₀ ,s ₁ ,...s _t }; each state includes device information s under each preamble sequence _t ＝{preamble _1,t ,preamble _2,t ,...,preamble _m,t }；

S32, constructing an action set A: the action set A is used to represent each agent according to the current state s _t And action a taken by its own decision policy π _t ；a _t Is a one-dimensional array of length m, where a _i,t If the number is 1, the device selects the ith preamble, and if the number is 0, the device does not select the ith preamble;

s33, constructing a reward R: after the agent takes action, the state of the current environment changes, and environmental revenue and corresponding reward r are generated at the same time _j,t Expressed as:

s34, adopting deep reinforcement learning to construct a neural network, wherein the input of the network is action a _t State s _t Q value Q of output as motion _k (s _t ,a _t ) Calculating the next state s using the target neural network _t+1 Q value of (Q) _k (s _t+1 ,a _t )：

Wherein alpha is _k And gamma are the learning rate and discount factor, s, respectively _t+1 And r _t+1 Indicating the next state and at state s _t Reward obtained after taking action, a' denotes the state s _t+1 An executable action, A is a set of executable actions,

represents a state s _t+1 Adopting an epsilon greedy strategy in the process of searching the maximum value of the maximum Q value in the lower action set A; the loss function E is expressed as:

and S35, updating the weight theta of the neural network by adopting a gradient descent method.

Further, in step S35, the process of updating the weight θ of the neural network by using the gradient descent method includes the following steps:

s351, randomly initializing weights theta of the neural network and actions of the agent j

Preamble sequences of the respective preamble codes _i Setting 0;

s352, calculating the priority of the new access equipment, setting the convergence threshold of the loss function E, and initializing alpha _k 、γ、ε；

S353, each agent makes a decision by using an epsilon greedy strategy according to the current state information;

s354, updating the state S of the environment _t+1 And a prize r _t+1 ；

S355, storing S _t 、a _t 、s _t+1 、r _t+1 For empirical playback;

s356, repeating the step S353 to the step S355, and accumulating the experience; randomly extracting a certain number of samples from the accumulated experience, calculating a loss function E according to the samples, and updating the weight theta;

and S357, repeating the steps S353 to S356 until the loss function E reaches the convergence condition or the maximum iteration number T.

Has the advantages that:

(1) different from the traditional random competition access mode, the ordered competition large-scale access learning method for the machine type communication system can solve the collision problem by adopting the ordered competition access, and can allow more mobile devices (MTCDs) to access on the same scale.

(2) According to the ordered competition large-scale access learning method for the machine type communication system, when the mobile equipment (MTCD) makes a decision, a proper lead code is cooperatively selected by adopting a multi-agent reinforcement learning algorithm, and the learning algorithm can be better adapted to environmental changes to improve the convergence rate.

Drawings

Fig. 1 is a diagram of an ordered contention based access model according to an embodiment of the present invention.

FIG. 2 is a model diagram of multi-agent reinforcement learning based on an embodiment of the present invention.

Fig. 3 is a model diagram of each preamble sequence according to an embodiment of the present invention.

FIG. 4 is a diagram of a neural network architecture for a multi-agent embodiment of the present invention.

Detailed Description

The following examples are presented to enable one of ordinary skill in the art to more fully understand the present invention and are not intended to limit the invention in any way.

The embodiment provides an orderly competition large-scale access learning method for a machine type communication system, which comprises the following steps:

s1, before the random access starts, the base station distributes the wireless resources of the physical random access channel and the physical uplink shared channel and broadcasts the wireless resources to all the mobile devices; each physical random access channel corresponds to a unique preamble.

S2, when the random access starts, the base station broadcasts the number of devices waiting correspondingly under each lead code, and the newly accessed devices adopt a multi-agent reinforcement learning algorithm to cooperatively select channels meeting the self requirements; specifically, the device sends a preamble on the physical random access channel, the base station receives the request and then sends a response, the response includes a specific number of the device under the selected preamble, the specific number is determined according to the priority of the device, the device with the smallest number sends data on the physical random access channel, the other devices with the numbers are all in a waiting state, and the number of the waiting device is automatically reduced after each unit time.

The multi-agent reinforcement learning algorithm is combined with the number of devices waiting for the lead code sequence, the channel requirement of the devices, the delay tolerance and the priority of the devices at the same time, so that each device is in the maximum timePreamble sequence preamble within delay tolerance _i The lower lengths are evenly distributed.

Under the scene of a single Base Station (Base Station BS), there are several mobile devices MTCD, and under the conventional scheme, a user can be subdivided into a newly accessed device and a device which collides and backs off, and the device which collides and backs off under the present invention is not in a queuing state but randomly contends for a preamble again. Before Random Access (Random Access RA) starts, the base station allocates radio resources of a Physical Random Access Channel (PRACH) and a Physical Uplink Shared Channel (PUSCH) and broadcasts the radio resources to all MTCDs.

Referring to fig. 1, when RA starts, the base station broadcasts the number of devices waiting for each preamble, so that the newly accessed device can decide to select a suitable channel according to its own requirements. Different from the traditional random access mode, the device sends a preamble on the PRACH, the base station sends a response after receiving the request, the response includes a specific number of the device under the selected preamble, and the number is determined according to the priority of the device itself. The device with the smallest number transmits data on the PUSCH, and the number of the device is reduced by self every unit time. The remaining numbered devices are all in a wait state. The priority function of the device is as follows:

priorityfunc(t)＝P _MTCD +k(t-t ₀ )；

wherein t is not less than t ₀ (ii) a The priority function is composed of two parts, where the constant P _MTCD Indicating the priority level of the device itself; the symbol k represents the growth rate of the priority, which is different for different devices; the symbol t denotes the current time, the symbol t ₀ The priority level of the waiting device in the team increases to a different extent every unit time when the waiting device enters the team.

Compared with the traditional random access mode, the device sends data on the PUSCH independently in the time dimension, so that the steps of collision avoidance and collision avoidance are avoided. Assuming that the MTCD arrival model follows a Beta distribution, as follows:

wherein, alpha is 3 and beta is 4.

Referring to fig. 2, in the device decision aspect, a multi-agent reinforcement learning algorithm is adopted to cooperate with each other, each device can be regarded as an agent, and the decision of each agent affects the rest of the agents because the queue length of the preamble code changes with the newly added device. The device decision will take into account the number of devices waiting for the preamble sequence, their own channel requirements, delay tolerance, device priority.

Referring to fig. 3, assuming that there are m preambles, which may be denoted by

numbers

1,2,3, a _i To represent; n under each preamble queue _t,i A waiting device, plus the number n of newly accessed devices _t,access Then the total number of devices can be expressed as

The maximum tolerated delay per device can be expressed as

In the process of waiting for the equipment, the waiting time of the equipment is required to be ensured not to exceed the threshold, and if the number of the equipment exceeds a certain scale, each equipment cannot be ensured to be within the maximum self delay tolerance; assuming that the devices at the head of the queue can successfully transmit data, the MTCD access success rate at time t may be represented as:

when a newly accessed device uses a multi-agent to make a cooperation decision, each preamble sequence preamble needs to be as uniform as possible _i Length of lower, i.e. in summary, object boxThe number may be expressed as:

wherein x is _t,i Indicating the number of devices selecting the ith preamble among the newly accessed devices at time t, _p remableii, last denotes the last device under the ith preamble sequence after the new device decides.

And (4) state set S: the state used to represent the entire access environment includes device information s under each preamble sequence _t ＝{preamble _1,t ,preamble _2,t ,...,preamble _m,t Is composed of t +1 states, i.e. S ═ S ₀ ,s ₁ ,...s _t }。

Action set A: according to the current state, each agent takes an action a according to its decision strategy pi _t ，a _t Is a one-dimensional array of length m, where a _i,t And {0,1}, a value of 1 indicates that the device selects the ith preamble, and a value of 0 indicates that the device does not select the ith preamble.

Reward R: after the agent takes action, the state of the current environment changes, and environmental benefits, i.e., returns, are generated. In order to simplify the calculation, it is desirable that each device at the tail of the queue can satisfy its own delay tolerance in the process of waiting in the queue, so that the reward r is given _j,t Can be expressed as:

wherein j is more than or equal to 1 and less than or equal to n _t,access . Because the scale of the state set and the action set is large, deep reinforcement learning is adopted to construct a neural network, and the input of the network is action a _t State s _t The output is Q value of the action _k (s _t ,a _t ) (ii) a And calculating the next state s using the target neural network _t+1 Q value of (i.e. Q) _k (s _t+1 ,a _t ) Then, it can be updated by the following expression:

represents a state s _t+1 And (3) adopting an epsilon greedy strategy in the process of searching the maximum value for the maximum Q value in the lower action set A.

Loss function E: to make Q _k+1 (s _t ,a _t And Q _k (s _t ,a _t ) The difference between the two is minimized, so that the

Towards 0, the loss function E can be expressed as:

the weight θ of the neural network is updated using a gradient descent method.

Referring to fig. 4, according to the above mentioned technical solution, the specific implementation steps are as follows:

step 1: randomly initializing weights θ of neural network and actions of agent j

Preamble of each preamble sequence _i And setting 0.

And 2, step: calculating out new accessesPriority of equipment, setting convergence threshold of loss function E, initializing alpha _k 、γ、ε。

And step 3: each agent makes a decision based on the current state information and using an epsilon greedy strategy.

And 4, step 4: updating the state s of an environment _t+1 And a prize r _t+1 。

And 5: will s _t 、a _t 、s _t+1 、r _t+1 These parameter values are stored for empirical playback.

Step 6: and (3) randomly extracting a certain number of samples from the accumulated experience, calculating a loss function E according to the samples, updating the weight theta, and repeating the step (3) until the loss function E reaches a convergence condition or the program reaches the maximum iteration time T.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention. All the components not specified in this embodiment can be implemented by the prior art.

Claims

1. A method for ordered competition large-scale access learning for a machine type communication system is characterized by comprising the following steps:

the multi-agent reinforcement learning algorithm is combined with the number of devices waiting for the lead code sequence, the channel requirement of the devices, the delay tolerance and the priority of the devices at the same time, so that each device is in the maximum delay tolerance of the device and the preamble of each lead code sequence _i The lower lengths are evenly distributed.

2. The ordered competition large-scale access learning method for the machine-based communication system as claimed in claim 1, wherein in step S2, the number of each new access device is determined according to the priority function of the following devices:

priorityfunc(t)＝P _MTCD +k(t-t ₀ )

3. The method according to claim 1, wherein in step S3, it is assumed that there are m preambles, and the device queue corresponding to the ith preamble is preamble _i 1,2,3, ·, m; the total number of devices corresponding to all preamble queues at time t is represented as

n _t,i Is a preamble queue preambl _i Number of waiting devices, n, in e _t,access Is the number of newly accessed devices; preamble queue preamble _i The maximum tolerated delay of the medium device is expressed as

wherein x is _t,i Indicates the number of devices, preambles, which select the ith preamble from the newly accessed devices at time t _i,last Indicating the last device under the ith preamble sequence after the new device decides,

represents the variance of the queue length under the m preamble queues at time t.

4. The ordered competition large-scale access learning method for machine type communication system as claimed in claim 1, wherein in step S3, the process of the newly accessed device adopting multi-agent reinforcement learning algorithm to cooperatively select the channel satisfying its own requirement includes the following steps:

s31, constructing a state set S: the state set S is used to represent the states of the entire access environment, and is composed of t +1 states, S ═ S ₀ ,s ₁ ,...s _t }; each state includes device information s under each preamble sequence _t ＝{preamble _1,t ,preamble _2,t ,...,preamble _m,t }；

S32, constructing an action set A: the action set A is used to represent each agent according to the current state s _t And itselfAction a taken by decision policy π _t ；a _t Is a one-dimensional array of length m, where a _i,t If the number is 1, the device selects the ith preamble, and if the number is 0, the device does not select the ith preamble;

5. The sequential competition large-scale access learning method for the machine type communication system according to claim 4, wherein in the step S35, the process of updating the weight θ of the neural network by adopting the gradient descent method comprises the following steps:

Preamble of each preamble sequence _i Setting 0;

s354, updating the state S of the environment _t+1 And a prize r _t+1 ；

S355, storing S _t 、a _t 、s _t+1 、r _t+1 For empirical playback;