CN117241409B - Multi-type terminal random access competition solving method based on near-end policy optimization - Google Patents

Multi-type terminal random access competition solving method based on near-end policy optimization Download PDF

Info

Publication number
CN117241409B
CN117241409B CN202311504327.8A CN202311504327A CN117241409B CN 117241409 B CN117241409 B CN 117241409B CN 202311504327 A CN202311504327 A CN 202311504327A CN 117241409 B CN117241409 B CN 117241409B
Authority
CN
China
Prior art keywords
terminal
terminals
random access
access
optimization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311504327.8A
Other languages
Chinese (zh)
Other versions
CN117241409A (en
Inventor
颜志
苑书豪
欧阳博
禹怀龙
段豪勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202311504327.8A priority Critical patent/CN117241409B/en
Publication of CN117241409A publication Critical patent/CN117241409A/en
Application granted granted Critical
Publication of CN117241409B publication Critical patent/CN117241409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Mobile Radio Communication Systems (AREA)

Abstract

The invention belongs to the technical field of wireless communication, and particularly relates to a method for solving random access contention of a multi-type terminal based on near-end policy optimization, which comprises the following steps: s1: initializing states of various types of terminals and states of data queues, states of cell base stations, quantity of competing resources and states of competing queues; carrying out priority division on various types of terminals to obtain terminals with different priorities; acquiring a current environment state; s2: establishing an agent model at a base station side, training the agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state based on a distributed queue mechanism, acquiring optimal selection actions and instant rewards, and forming experience data to be stored in an experience pool; s3: and constructing an objective function, performing deep learning on the objective function based on experience data stored in an experience pool, and performing training update on parameters by using a preset threshold value to complete allocation optimization of random access of multiple types of terminals.

Description

Multi-type terminal random access competition solving method based on near-end policy optimization
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to a method for solving random access contention of a multi-type terminal based on near-end policy optimization.
Background
As human society enters the era of everything interconnection, in the fifth generation communication technology (5G), the introduction of large-scale machine type communication (mctc) makes access to large-scale terminal devices possible. The random access process is an important process for realizing uplink communication by the device, and is used for realizing initialization access of users in the cellular network, uplink resource allocation when the users perform uplink data transmission, uplink medium synchronization and the like. However, as the number of wireless communication devices increases in bursts, collision problems during random access are increasingly prominent. For this reason, it is necessary to design an efficient contention resolution mechanism to cope with access of a large number of multi-type terminals.
At present, the researches on random access protocols are mainly divided into two major categories, namely ALOHA family protocols and tree splitting protocols. ALOHA family protocols discretely process conflicting devices in the time domain with access class restriction (ACB) and Backoff (Backoff) mechanisms to reduce the probability of collisions. The distributed queue contention resolution mechanism is derived from a tree splitting protocol that combines conflicting devices into a device group for time domain discretization by introducing a Contention Resolution Queue (CRQ) to fully exploit the preamble and reduce the probability of secondary collisions by conflicting terminals in retransmissions. When solving the optimal strategy for random access, the prior stage mostly utilizes deep reinforcement learning algorithms such as DQN and AC algorithms to solve the problem of resource conflict in the random access process.
However, most of the researches at present focus on the random access procedure of only single-type users, and do not consider the coexistence of multiple-type communication devices in actual production activities; meanwhile, most of the optimization for the random access protocol focuses on the optimization of the ALOHA family protocol, the potential advantages of the distributed queue scheme are not realized, and the existing contention resolution mechanism cannot adapt to network load caused by massive terminal large-scale initialization access and periodical data uploading, so that the overall performance of a communication system is affected.
Disclosure of Invention
The invention provides a method for solving random access contention of a multi-type terminal based on near-end policy optimization, which can be based on a distributed queue mechanism, introduce a priority division idea, increase the access opportunity of a specific terminal according to requirements, reduce the probability of secondary collision of the specific terminal, and thereby improve the access success rate and stability of the whole system so as to cope with network congestion caused by initiating random access requests by mass terminals; meanwhile, the PPO algorithm is optimized by utilizing the near-end strategy to dynamically adjust the exclusive resources, so that the optimal resource planning under the preset condition is met, the terminal access success rate is improved, and the resource waste is reduced.
A random access contention resolution method for a multi-type terminal based on near-end policy optimization specifically comprises the following steps:
s1: initializing states of various types of terminals and states of data queues, states of cell base stations, quantity of competing resources and states of competing queues; the priority of each type of terminal is divided, and terminals with different priorities are obtained; acquiring a current environment state;
s2: establishing an agent model at a base station side, training the agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state based on a distributed queue mechanism, acquiring optimal selection actions and instant rewards, and forming experience data to be stored in an experience pool;
s3: constructing an objective function, and performing deep learning on the objective function based on experience data stored in an experience pool; training and updating parameters by using a preset threshold value to adjust the number of reserved exclusive resources so as to finish the allocation optimization of the random access of the multi-type terminals;
in the step S3, an objective function is constructed, and deep learning is carried out on the objective function based on experience data stored in an experience pool; the process for completing the allocation optimization of the random access of the multi-type terminals specifically comprises the following steps:
s31: judging whether experience data stored in the experience pool reaches a preset threshold value or not;
s311: when the experience data stored in the experience pool reaches a preset threshold, an objective function is constructed, a near-end strategy optimization PPO algorithm is trained, network parameters are updated, and the experience pool is emptied;
s312: when the experience data stored in the experience pool does not reach a preset threshold value, entering the next step;
s32: judging whether the iteration number reaches a preset maximum iteration number or not;
s321: when the iteration number reaches the preset maximum iteration number, entering the next step;
s322: when the iteration times do not reach the preset maximum iteration times, training the intelligent body model again based on the current environment state;
s33: judging whether the overall indexes of the system, namely the average time delay, the average preamble transmission times and the average energy consumption, can reach preset requirements after the current iteration period is finished;
s331: when any index of the overall indexes of the system does not reach the preset requirement, training the intelligent body model again based on the current environment state;
s332: when the overall indexes of the system all reach the preset requirements, outputting an optimal solution to finish the allocation optimization of the random access of the multiple types of terminals.
The method can introduce the thought of priority division based on a distributed queue mechanism, increase the access opportunity of a specific terminal according to the requirement, and reduce the probability of secondary conflict of the specific terminal, thereby improving the access success rate and the stability of the whole system and coping with network congestion caused by the initiation of random access requests by massive terminals; meanwhile, the PPO algorithm is optimized by utilizing the near-end strategy to dynamically adjust the exclusive resources, so that the optimal resource planning under the preset condition is met, the terminal access success rate is improved, and the resource waste is reduced.
Further, in the step S1, the process of prioritizing the terminals of each type to obtain terminals with different priorities and obtaining the current environmental state specifically includes the following steps:
s11: based on the sensitivity degree of the data terminal to time delay and reliability, prioritizing the access terminal, dividing the terminal with high reliability requirement for low time delay into high priority terminal, and dividing the conventional machine type communication terminal into low priority terminal;
s12: initializing environment parameters;
the parameters needed to initialize the environment include: terminal number of initiating access terminal number of contention queue, status of contention queue, total number of preamble slot number, number of exclusive resources reserved for high priority;
s13: and acquiring the current environment state, namely acquiring the number URLLC_nums of terminals with low current time delay and high reliable requirements, the number mMTC_nums of machine type communication terminals and the length CDQ_length of a competition queue.
Further, in the step S11,
the high-priority terminal comprises a control equipment data acquisition terminal and fault early warning and detecting equipment;
the low-priority terminal comprises an operation index data acquisition terminal and an environment monitoring data acquisition terminal.
Further, in the step S2, the process of obtaining the optimal selection action and the instant rewards, and forming the experience data to store in the experience pool specifically includes the following steps:
s21: setting the number of the preambles which initialize the monopolization of the high-priority terminal as;
S22: based on a distributed queue mechanism, training an agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state to obtain an optimal selection action;
s23: inputting the obtained optimal selection action into an environment state, and executing a distributed queue random access process based on random access information broadcast by a base station;
s24: updating the environment state, and calculating instant rewards based on the terminal access result;
s25: the current environmental state, the optimal selection action, the probability of the selection action, and the instant rewards are stored as a set of experience data into an experience pool.
Further, in S22, the process of obtaining the optimal selection action specifically includes the following steps:
s221: based on the distributed queue mechanism, the current environment state is calculatedInputting the strategy network of the near-end strategy optimization PPO algorithm;
s222: outputting an action space in an output layerThe action with the highest medium probability is used as the optimal selection action;
s2221: outputting each action in the output layer, and obtaining a score vector of each action by utilizing a softmax function;
s2222: sampling any action value from the formed probability distribution of each actionTo characterize the ratio of exclusive access resources and to set the probability of selecting the action as +.>
Wherein the probability of the action is selectedThe calculated expression of (2) is:
in the method, in the process of the invention,is in state->Lower action->Is selected according to the selection probability of (1); />Is the action spaceSize of the material;to select a value of an action component; />Is the%>The values of the individual motion components;
s2223: selecting an action spaceThe action with the highest medium probability is used as the optimal selection action;
wherein the action spaceThe range of (2) is: />
Further, in S23, the process of executing the distributed queue random access procedure based on the random access information broadcasted by the base station by inputting the selected action into the environment specifically includes the following steps:
s231: detecting whether the number of data packet retransmission times of the random access terminal participating in the current round reaches the tolerant number of retransmission times;
s2311: when the data packet retransmission times reach the tolerant retransmission times, the access fails, and the packet loss processing is carried out;
s2312: when the data packet retransmission times are detected to not reach the tolerant retransmission times, initiating an access process according to the priority rule of the access terminal, namely, initiating an access request by selecting a preamble by the high-priority terminal, and initiating the access request by the low-priority terminal to share part of the preamble with the high-priority terminal; respectively calculating the probability of successful access of the high-priority terminal and the low-priority terminal;
wherein, the high priority terminal access success rateThe calculated expression of (2) is:
in the method, in the process of the invention,the number of preambles exclusive for a specified high priority terminal; />For the number of terminals initiating an access request in this round, and +.>
Wherein, the low priority terminal access success rateThe calculated expression of (2) is:
in the method, in the process of the invention,is the available number of preambles;
s232: transmitting the selected preamble to a base station through a physical layer random access channel, decoding the preamble by the base station, and informing the sequence number of the terminal with access conflict; terminals with access conflicts form a conflict terminal group and are orderly discharged into the CRQ according to the sequence number of the lead code to wait for the next opportunity of random access.
Further, in S24, the process of updating the environmental status and calculating the instant rewards based on the terminal access result specifically includes the following steps:
s241: combining the conflict terminal group, counting the number of high priority terminals currently positioned at the head of the competition queue, the number of low priority terminals positioned at the head of the competition queue and the length of the competition queue, and entering a new environment state
S242: calculating instant rewards based on the access success rate of the high-priority terminal, the access success rate of the low-priority terminal, the length change of the competition queue and the packet loss condition respectively;
instant rewardsThe calculated expression of (2) is:
wherein,the method is characterized in that rewards for the success rate of the access of the high-priority terminal are calculated by the following expression:
in the method, in the process of the invention,for the access success rate of terminals with low latency and high reliability requirements participating in the access procedure +.>The lowest success rate of the terminal with low time delay and high reliability requirement is tolerable for the system;
wherein,the method is characterized in that rewards for the success rate of the access of the low-priority terminal are calculated by the following expression:
in the method, in the process of the invention,access success rate for machine type communication terminals participating in an access procedure,/->The lowest success rate of access for the machine type communication terminal which can be tolerated by the system;
wherein,for rewards of competing queue length changes, the computational expression is:
in the method, in the process of the invention,the length of the contention queue after the random access process of the present round; />The length of the historical competition queue;
wherein,the penalty for packet loss processing in the random access process of the round is calculated by the following expression:
in the method, in the process of the invention,the number of lost packets.
Further, in S311, training the near-end policy optimization PPO algorithm, and updating the network parameters specifically includes the following steps:
s3111: according to discount rateAnd settling rewards expectations and advantage estimates corresponding to each random access process in the batch of data;
wherein the reward is desiredThe calculated expression of (2) is:
in the method, in the process of the invention,is->Rewarding expectations of the secondary random access procedure; />Is->Instant rewards of the secondary random access procedure;for obtaining +.>The value of the state; />Is the total number of rounds;
wherein the dominance estimationThe calculated expression of (2) is:
in the method, in the process of the invention,is->Secondary random access procedureIs a dominant estimate of (1); />Obtained for use of preset Critic networksThe value of the state;
s3112: updating the Critic network with MSEloss as a loss function to minimize the difference between the current state value and the discounted prize:
in the method, in the process of the invention,error terms for the cost function; />Status +.>Status value of (2); />For the discount to be awarded,is the amount of data in the experience pool;
s3113: calculating a loss function of the Actor network;
wherein, the loss function of the Actor networkThe calculated expression of (2) is:
in the method, in the process of the invention,is the probability ratio of new strategy to old strategy, namely +.>,/>Probability of being a new strategy->Probability of being an old policy; />For estimating the function for advantage, i.e;/>Is a truncated function; />To cut off the super parameter to limit the network update amplitude;
s3114: constructing an objective function of a near-end strategy optimization PPO algorithm network;
wherein, the near-end strategy optimizes the objective function of the PPO algorithm networkThe expression of (2) is:
wherein,is a loss function of the Actor network; />Error terms for the cost function; />Entropy rewards for the policy model; />、/>Is a constant coefficient to adjust the weight of each part in the objective function;
s3115: optimizing the objective function of the PPO algorithm network through maximized near-end strategy, and updating network parametersAnd network parameters are performed when the data in the experience poolTAfter a second continuous update, network parameters +.>Updated to->
Further, in S33, the calculating of the overall system index after the end of the current iteration period includes:
average time delayThe calculated expression of (2) is:
in the method, in the process of the invention,the RAO sum required by all terminals to complete the random access process is represented, namely the overall average time delay of the system; "1" means the first RAO; />Representing the total number of terminals participating in the access procedure; />Representing the total number of available preambles; />A terminal participating in an access process; />The sum of RAOs required for the terminals that collide in the first RAO to complete the random access procedure in the following;
preamble transfer timesThe calculated expression of (2) is:
average energy consumptionThe calculated expression of (2) is:
in the method, in the process of the invention,、/>、/>the energy consumption of the terminal in the back-off state, the access state and the monitoring state is respectively; />、/>Respectively +.>The number of high priority terminals and low priority terminals involved in random access by the layer; />For the number of reserved exclusive preambles.
The beneficial effects of the invention are as follows:
the invention can introduce the thought of priority division based on a distributed queue mechanism, increase the access opportunity of a specific terminal according to the requirement, reduce the probability of secondary conflict, thereby improving the access success rate and the stability of the whole system and coping with network congestion caused by the random access request initiated by a mass of terminals; meanwhile, the PPO algorithm is optimized by utilizing the near-end strategy to dynamically adjust the exclusive resources, so that the optimal resource planning under the preset condition is met, the terminal access success rate is improved, and the resource waste is reduced.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow diagram of a distributed queue contention resolution mechanism;
FIG. 3 is a schematic diagram of a terminal access state transition;
fig. 4 is a flowchart for obtaining the optimal solution in embodiment 2.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
Furthermore, in the following description, specific details are provided for the purpose of providing a thorough understanding of the examples, and the particular meaning of the terms described above in this application will be understood to those of ordinary skill in the art in the context of the present application.
Example 1
Fig. 1 shows a method for solving random access contention of multiple types of terminals based on optimization of a near-end policy, which can be based on a distributed queue mechanism, introduce a priority division idea, increase access opportunities of specific terminals according to requirements, and reduce probability of secondary collision of specific terminals, so as to improve access success rate and stability of the whole system, and cope with network congestion caused by initiating random access requests by massive terminals; meanwhile, the PPO algorithm is optimized by utilizing the near-end strategy to dynamically adjust the exclusive resources, so that the optimal resource planning under the preset condition is met, the terminal access success rate is improved, and the resource waste is reduced. The method specifically comprises the following steps:
s1: initializing states of various types of terminals and states of data queues, states of cell base stations, quantity of competing resources and states of competing queues; the priority of each type of terminal is divided, and terminals with different priorities are obtained; acquiring a current environment state;
s11: based on the sensitivity degree of the data terminal to time delay and reliability, prioritizing the access terminal, dividing the terminal with high reliability requirement for low time delay into high priority terminal, and dividing the conventional machine type communication terminal into low priority terminal;
the high-priority terminal comprises a control equipment data acquisition terminal and fault early warning and detecting equipment; the low-priority terminal comprises an operation index data acquisition terminal and an environment monitoring data acquisition terminal.
S12: initializing environment parameters;
the parameters needed to initialize the environment include: terminal number of initiating access terminal number of contention queue, status of contention queue, total number of preamble slot number, number of exclusive resources reserved for high priority;
s13: and acquiring the current environment state, namely acquiring the number URLLC_nums of terminals with low current time delay and high reliable requirements, the number mMTC_nums of machine type communication terminals and the length CDQ_length of a competition queue.
S2: establishing an agent model at a base station side, training the agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state based on a distributed queue mechanism, acquiring optimal selection actions and instant rewards, and forming experience data to be stored in an experience pool; the method specifically comprises the following steps:
s21: setting the number of the preambles which initialize the monopolization of the high-priority terminal as;
S22: based on a distributed queue mechanism, training an agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state to obtain an optimal selection action;
it should be noted that, as shown in fig. 2, before training the agent model by using the policy network of the near-end policy optimization PPO algorithm, the near-end policy optimization PPO algorithm network needs to be constructed.
The process of obtaining the optimal selection action specifically comprises the following steps:
s221: based on the distributed queue mechanism, the current environment state is calculatedInputting the strategy network of the near-end strategy optimization PPO algorithm;
s222: outputting an action space in an output layerThe action with the highest medium probability is used as the optimal selection action;
s2221: outputting each action in the output layer, and obtaining a score vector of each action by utilizing a softmax function;
s2222: sampling any action value from the formed probability distribution of each actionTo characterize the ratio of exclusive access resources and to set the probability of selecting the action as +.>
Wherein the probability of the action is selectedThe calculated expression of (2) is:
in the method, in the process of the invention,is in state->Lower action->Is selected according to the selection probability of (1); />Is the size of the action space;to select a value of an action component; />Is the%>The values of the individual motion components;
s2223: selecting an action spaceThe action with the highest medium probability is used as the optimal selection action;
wherein the action spaceThe range of (2) is: />
In the present embodiment, when the firstWhen the batch of terminal equipment requests access, the intelligent agent model performs optimal action selection according to the instant rewards of the last round of access process, so as to adjust the quantity of exclusive resources.
S23: inputting the obtained optimal selection action into an environment state, and executing a distributed queue random access process based on random access information broadcast by a base station;
the process of executing the distributed queue random access process based on the random access information broadcast by the base station specifically includes the following steps:
s231: detecting whether the number of data packet retransmission times of the random access terminal participating in the current round reaches the tolerant number of retransmission times;
s2311: when the data packet retransmission times reach the tolerant retransmission times, the access fails, and the packet loss processing is carried out;
s2312: when the data packet retransmission times are detected to not reach the tolerant retransmission times, initiating an access process according to the priority rule of the access terminal, namely, initiating an access request by selecting a preamble by the high-priority terminal, and initiating the access request by the low-priority terminal to share part of the preamble with the high-priority terminal; respectively calculating the probability of successful access of the high-priority terminal and the low-priority terminal;
wherein, the high priority terminal access success rateThe calculated expression of (2) is:
in the method, in the process of the invention,the number of preambles exclusive for a specified high priority terminal; />For the number of terminals initiating an access request in this round, and +.>
Wherein, the low priority terminal access success rateThe calculated expression of (2) is:
in the method, in the process of the invention,is the available number of preambles;
s232: transmitting the selected preamble to a base station through a physical layer random access channel, decoding the preamble by the base station, and informing the sequence number of the terminal with access conflict; terminals with access conflicts form a conflict terminal group and are orderly discharged into the CRQ according to the sequence number of the lead code to wait for the next opportunity of random access.
S24: updating the environment state, and calculating instant rewards based on the terminal access result;
the process of updating the environment state and calculating the instant rewards based on the terminal access result specifically comprises the following steps:
s241: combining the conflict terminal group, counting the number of high priority terminals currently positioned at the head of the competition queue, the number of low priority terminals positioned at the head of the competition queue and the length of the competition queue, and entering a new environment state
S242: calculating instant rewards based on the access success rate of the high-priority terminal, the access success rate of the low-priority terminal, the length change of the competition queue and the packet loss condition respectively;
instant rewardsThe calculated expression of (2) is:
wherein,the method is characterized in that rewards for the success rate of the access of the high-priority terminal are calculated by the following expression:
in the method, in the process of the invention,for the access success rate of terminals with low latency and high reliability requirements participating in the access procedure +.>The lowest success rate of the terminal with low time delay and high reliability requirement is tolerable for the system;
wherein,the method is characterized in that rewards for the success rate of the access of the low-priority terminal are calculated by the following expression:
in the method, in the process of the invention,access success rate for machine type communication terminals participating in an access procedure,/->For machine-like communication terminals tolerated by the systemThe lowest success rate of access;
wherein,for rewards of competing queue length changes, the computational expression is:
in the method, in the process of the invention,the length of the contention queue after the random access process of the present round; />The length of the historical competition queue;
wherein,the penalty for packet loss processing in the random access process of the round is calculated by the following expression:
in the method, in the process of the invention,the number of lost packets.
S25: the current environmental state, the optimal selection action, the probability of the selection action, and the instant rewards are stored as a set of experience data into an experience pool.
S3: constructing an objective function, and performing deep learning on the objective function based on experience data stored in an experience pool; and training and updating parameters by using a preset threshold value to adjust the number of reserved exclusive resources so as to finish the allocation optimization of the random access of the multi-type terminals.
S31: judging whether experience data stored in the experience pool reaches a preset threshold value or not;
s311: when the experience data stored in the experience pool reaches a preset threshold, an objective function is constructed, a near-end strategy optimization PPO algorithm is trained, network parameters are updated, and the experience pool is emptied; the method specifically comprises the following steps:
s3111: according to discount rateAnd settling rewards expectations and advantage estimates corresponding to each random access process in the batch of data;
wherein the reward is desiredThe calculated expression of (2) is:
in the method, in the process of the invention,is->Rewarding expectations of the secondary random access procedure; />Is->Instant rewards of the secondary random access procedure;for obtaining +.>The value of the state; />Is the total number of rounds;
wherein the dominance estimationThe calculated expression of (2) is:
in the method, in the process of the invention,is->Estimating the advantages of the secondary random access process; />Obtained for use of preset Critic networksThe value of the state;
s3112: updating the Critic network with MSEloss as a loss function to minimize the difference between the current state value and the discounted prize:
in the method, in the process of the invention,error terms for the cost function; />Status +.>Status value of (2); />For the discount to be awarded,is the amount of data in the experience pool;
s3113: calculating a loss function of the Actor network;
wherein, the loss function of the Actor networkThe calculated expression of (2) is:
in the method, in the process of the invention,is the probability ratio of new strategy to old strategy, namely +.>,/>Probability of being a new strategy->Probability of being an old policy; />For estimating the function for advantage, i.e;/>Is a truncated function; />To cut off the super parameter to limit the network update amplitude;
s3114: constructing an objective function of a near-end strategy optimization PPO algorithm network;
wherein, the near-end strategy optimizes the objective function of the PPO algorithm networkThe expression of (2) is:
wherein,is a loss function of the Actor network; />Error terms for the cost function; />Entropy rewards for the policy model; />、/>Is a constant coefficient to adjust the weight of each part in the objective function;
s3115: optimizing the objective function of the PPO algorithm network through maximized near-end strategy, and updating network parametersAnd network parameters are performed when the data in the experience poolTAfter a second continuous update, network parameters +.>Updated to->
S312: when the experience data stored in the experience pool does not reach a preset threshold value, entering the next step;
s32: judging whether the iteration number reaches a preset maximum iteration number or not;
s321: when the iteration number reaches the preset maximum iteration number, entering the next step;
s322: when the iteration times do not reach the preset maximum iteration times, training the intelligent body model again based on the current environment state;
s33: judging whether the overall indexes of the system, namely the average time delay, the average preamble transmission times and the average energy consumption, can reach preset requirements after the current iteration period is finished;
s331: when any index of the overall indexes of the system does not reach the preset requirement, training the intelligent body model again based on the current environment state;
s332: when the overall indexes of the system all reach the preset requirements, outputting an optimal solution to finish the allocation optimization of the random access of the multiple types of terminals.
The calculation of the system overall index after the current iteration period is finished comprises the following steps:
average time delayThe calculated expression of (2) is:
in the method, in the process of the invention,the RAO sum required by all terminals to complete the random access process is represented, namely the overall average time delay of the system; "1" means the first RAO; />Representing the total number of terminals participating in the access procedure; />Representing the total number of available preambles; />A terminal participating in an access process; />The sum of RAOs required for the terminals that collide in the first RAO to complete the random access procedure in the following;
preamble transfer timesThe calculated expression of (2) is:
average energy consumptionThe calculated expression of (2) is:
in the method, in the process of the invention,、/>、/>the energy consumption of the terminal in the back-off state, the access state and the monitoring state is respectively; />、/>Respectively +.>The number of high priority terminals and low priority terminals involved in random access by the layer; />For the number of reserved exclusive preambles.
Fig. 3 is a schematic diagram showing a terminal access state transition, which shows that the terminal is always in a sleep state and is in a monitoring state after being activated; when the length of the competition queue CRQ is detected to be empty, terminal access is carried out, namely the terminal access state is achieved; when the access is successful, data transmission is carried out, and the sleep state is returned after the data transmission is successful; when the access fails, entering a back-off state, sequentially entering a CRQ according to the selected preamble sequence number, and entering a monitoring state for the terminal group at the head of the queue in the next round to carry out the access process again; if the initialized terminal detects that the length of the contention queue CRQ is not empty, the terminal enters a back-off state, and is in a monitoring state again after the back-off is finished.
Example 2
As shown in fig. 4, in this embodiment, a method for solving random access contention of a multi-type terminal based on optimization of a near-end policy is provided, which specifically includes the following steps:
t1: acquiring initialized states of various types of terminals and data queue states, cell base station states, competition resource quantity and competition queue states; the priority of each type of terminal is divided, and terminals with different priorities are obtained; acquiring the states of all types of terminals; acquiring a current environment state;
t2: determining the length of a competition queue CRQ in the round;
t21: when the length of the competition queue CRQ in the round is not empty, namely CRQ_length is not equal to 0, continuing to execute the access process of the data terminal in the CRQ, waiting for the next RAO, and returning to T1;
t22: when the length of the competition queue CRQ in the round is empty, namely CRQ_length=0, the terminal carries out packet loss judgment, activates the terminal in the first group in the competition queue CRQ, and judges whether the retransmission times of the data packets in the round reach the tolerant retransmission times or not;
t221: when the retransmission times of the data packet in the round reach the tolerant retransmission times, the access fails, and the packet loss processing is carried out;
t22: when the retransmission times of the data packet in the round do not reach the tolerant retransmission times, a terminal access request is sent out, and a preamble is transmitted to a base station; the base station responds to the access and judges whether access conflict occurs;
t221: when no access conflict occurs, terminal access is carried out, access competition is solved, and access of multiple types of terminals is completed;
t222: when access conflict occurs, entering the next round;
t2221: the number of data packet retransmission times is increased by one;
t2222: and acquiring CRQ information of the competition queue from the T22, updating the position of the terminal in the queue, turning to the T22, and waiting for the activation of the first group of terminals.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (9)

1. A random access contention resolution method for a multi-type terminal based on near-end policy optimization is characterized by comprising the following steps:
s1: initializing states of various types of terminals and states of data queues, states of cell base stations, quantity of competing resources and states of competing queues; the priority of each type of terminal is divided, and terminals with different priorities are obtained; acquiring a current environment state;
s2: establishing an agent model at a base station side, training the agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state based on a distributed queue mechanism, acquiring optimal selection actions and instant rewards, and forming experience data to be stored in an experience pool;
s3: constructing an objective function, and performing deep learning on the objective function based on experience data stored in an experience pool; training and updating parameters by using a preset threshold value to adjust the number of reserved exclusive resources so as to finish the allocation optimization of the random access of the multi-type terminals;
in the step S3, an objective function is constructed, and deep learning is carried out on the objective function based on experience data stored in an experience pool; the process for completing the allocation optimization of the random access of the multi-type terminals specifically comprises the following steps:
s31: judging whether experience data stored in the experience pool reaches a preset threshold value or not;
s311: when the experience data stored in the experience pool reaches a preset threshold, an objective function is constructed, a near-end strategy optimization PPO algorithm is trained, network parameters are updated, and the experience pool is emptied;
s312: when the experience data stored in the experience pool does not reach a preset threshold value, entering the next step;
s32: judging whether the iteration number reaches a preset maximum iteration number or not;
s321: when the iteration number reaches the preset maximum iteration number, entering the next step;
s322: when the iteration times do not reach the preset maximum iteration times, training the intelligent body model again based on the current environment state;
s33: judging whether the overall indexes of the system, namely the average time delay, the average preamble transmission times and the average energy consumption, can reach preset requirements after the current iteration period is finished;
s331: when any index of the overall indexes of the system does not reach the preset requirement, training the intelligent body model again based on the current environment state;
s332: when the overall indexes of the system all reach the preset requirements, outputting an optimal solution to finish the allocation optimization of the random access of the multiple types of terminals.
2. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 1, wherein in S1, the priority of each type of terminal is divided to obtain terminals with different priorities, and the process of obtaining the current environmental state specifically includes the following steps:
s11: based on the sensitivity degree of the data terminal to time delay and reliability, the access terminal is prioritized, the terminal with low time delay and high reliability requirement is classified as a high priority terminal, and the conventional machine type communication terminal is classified as a low priority terminal;
s12: initializing environment parameters;
the parameters needed to initialize the environment include: terminal number of initiating access terminal number of contention queue, status of contention queue, total number of preamble slot number, number of exclusive resources reserved for high priority;
s13: and acquiring the current environment state, namely acquiring the number URLLC_nums of terminals with low current time delay and high reliable requirements, the number mMTC_nums of machine type communication terminals and the length CDQ_length of a competition queue.
3. The method for random access contention resolution of multi-type terminals based on the optimization of the near-end policy according to claim 2, wherein, in S11,
the high-priority terminal comprises a control equipment data acquisition terminal and fault early warning and detecting equipment;
the low-priority terminal comprises an operation index data acquisition terminal and an environment monitoring data acquisition terminal.
4. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 2, wherein in S2, the process of obtaining the optimal selection action and the instant rewards and forming experience data to be stored in the experience pool specifically comprises the following steps:
s21: setting the number of the preambles which initialize the monopolization of the high-priority terminal as;
S22: based on a distributed queue mechanism, training an agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state to obtain an optimal selection action;
s23: inputting the obtained optimal selection action into an environment state, and executing a distributed queue random access process based on random access information broadcast by a base station;
s24: updating the environment state, and calculating instant rewards based on the terminal access result;
s25: the current environmental state, the optimal selection action, the probability of the selection action, and the instant rewards are stored as a set of experience data into an experience pool.
5. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 4, wherein in S22, the process of obtaining the optimal selection action specifically includes the following steps:
s221: based on the distributed queue mechanism, the current environment state is calculatedInputting the strategy network of the near-end strategy optimization PPO algorithm;
s222: outputting an action space in an output layerThe action with the highest medium probability is used as the optimal selection action;
s2221: outputting each action in the output layer, and obtaining a score vector of each action by utilizing a softmax function;
s2222: sampling any action value from the formed probability distribution of each actionTo characterize the ratio of exclusive access resources and to set the probability of selecting the action as +.>
Wherein the probability of the action is selectedThe calculated expression of (2) is:
in the method, in the process of the invention,is in state->Lower action->Is selected according to the selection probability of (1); />Is the size of the action space; />To select a value of an action component; />Is the%>The values of the individual motion components;
s2223: selecting an action spaceThe action with the highest medium probability is used as the optimal selection action;
wherein the action spaceThe range of (2) is: />
6. The method for resolving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 5, wherein in S23, the selected action is input into the environment, and the process of executing the distributed queue random access procedure based on the random access information broadcasted by the base station specifically comprises the following steps:
s231: detecting whether the number of data packet retransmission times of the random access terminal participating in the current round reaches the tolerant number of retransmission times;
s2311: when the data packet retransmission times reach the tolerant retransmission times, the access fails, and the packet loss processing is carried out;
s2312: when the data packet retransmission times are detected to not reach the tolerant retransmission times, initiating an access process according to the priority rule of the access terminal, namely, initiating an access request by selecting a preamble by the high-priority terminal, and initiating the access request by the low-priority terminal to share part of the preamble with the high-priority terminal; respectively calculating the probability of successful access of the high-priority terminal and the low-priority terminal;
wherein, the high priority terminal access success rateThe calculated expression of (2) is:
in the method, in the process of the invention,the number of preambles exclusive for a specified high priority terminal; />For the number of terminals initiating an access request in this round, and +.>
Wherein, the low priority terminal access success rateThe calculated expression of (2) is:
in the method, in the process of the invention,is the available number of preambles;
s232: transmitting the selected preamble to a base station through a physical layer random access channel, decoding the preamble by the base station, and informing the sequence number of the terminal with access conflict; terminals with access conflicts form a conflict terminal group and are orderly discharged into the CRQ according to the sequence number of the lead code to wait for the next opportunity of random access.
7. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 6, wherein the process of updating the environmental status and calculating the instant prize based on the terminal access result in S24 specifically comprises the steps of:
s241: combining the conflict terminal group, counting the number of high priority terminals currently positioned at the head of the competition queue, the number of low priority terminals positioned at the head of the competition queue and the length of the competition queue, and entering a new environment state
S242: calculating instant rewards based on the access success rate of the high-priority terminal, the access success rate of the low-priority terminal, the length change of the competition queue and the packet loss condition respectively;
instant rewardsThe calculated expression of (2) is:
wherein,is high enough toThe rewards of the access success rate of the priority terminal are calculated as the following expression:
in the method, in the process of the invention,for the access success rate of terminals with low latency and high reliability requirements participating in the access procedure +.>The lowest success rate of the terminal with low time delay and high reliability requirement is tolerable for the system;
wherein,the method is characterized in that rewards for the success rate of the access of the low-priority terminal are calculated by the following expression:
in the method, in the process of the invention,access success rate for machine type communication terminals participating in an access procedure,/->The lowest success rate of access for the machine type communication terminal which can be tolerated by the system;
wherein,for rewards of competing queue length changes, the computational expression is:
in the method, in the process of the invention,the length of the contention queue after the random access process of the present round; />The length of the historical competition queue;
wherein,the penalty for packet loss processing in the random access process of the round is calculated by the following expression:
in the method, in the process of the invention,the number of lost packets.
8. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 7, wherein in S311, the process of training the near-end policy optimization PPO algorithm and updating the network parameters specifically includes the following steps:
s3111: according to discount rateAnd settling rewards expectations and advantage estimates corresponding to each random access process in the batch of data;
wherein the reward is desiredThe calculated expression of (2) is:
in the method, in the process of the invention,is->Rewarding expectations of the secondary random access procedure; />Is->Instant rewards of the secondary random access procedure; />For obtaining +.>The value of the state; />Is the total number of rounds;
wherein the dominance estimationThe calculated expression of (2) is:
in the method, in the process of the invention,is->Estimating the advantages of the secondary random access process; />For obtaining +.>The value of the state;
s3112: updating the Critic network with MSEloss as a loss function to minimize the difference between the current state value and the discounted prize:
in the method, in the process of the invention,error terms for the cost function; />Status +.>Status value of (2); />For discounts rewarding->Is the amount of data in the experience pool;
s3113: calculating a loss function of the Actor network;
wherein, the loss function of the Actor networkThe calculated expression of (2) is:
in the method, in the process of the invention,is the probability ratio of new strategy to old strategy, namely +.>,/>Probability of being a new strategy->Probability of being an old policy; />For estimating the function for advantage, i.e;/>Is a truncated function; />To cut off the super parameter to limit the network update amplitude;
s3114: constructing an objective function of a near-end strategy optimization PPO algorithm network;
wherein, the near-end strategy optimizes the objective function of the PPO algorithm networkThe expression of (2) is:
wherein,is a loss function of the Actor network; />Error terms for the cost function; />Entropy rewards for the policy model; />、/>Is a constant coefficient to adjust the weight of each part in the objective function;
s3115: optimizing the objective function of the PPO algorithm network through maximized near-end strategy, and updating network parametersAnd network parameters are performed when the data in the experience poolTAfter a second continuous update, network parameters +.>Updated to->
9. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 7, wherein in S33, the calculation of the overall system index after the end of the current iteration period includes:
average time delayThe calculated expression of (2) is:
in the method, in the process of the invention,the RAO sum required by all terminals to complete the random access process is represented, namely the overall average time delay of the system; "1" means the first RAO; />Representing ginsengTotal number of terminals with access procedure; />Representing the total number of available preambles; />A terminal participating in an access process; />The sum of RAOs required for the terminals that collide in the first RAO to complete the random access procedure in the following;
preamble transfer timesThe calculated expression of (2) is:
average energy consumptionThe calculated expression of (2) is:
in the method, in the process of the invention,、/>、/>the energy consumption of the terminal in the back-off state, the access state and the monitoring state is respectively; />、/>Respectively +.>The number of high priority terminals and low priority terminals involved in random access by the layer; />For the number of reserved exclusive preambles.
CN202311504327.8A 2023-11-13 2023-11-13 Multi-type terminal random access competition solving method based on near-end policy optimization Active CN117241409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311504327.8A CN117241409B (en) 2023-11-13 2023-11-13 Multi-type terminal random access competition solving method based on near-end policy optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311504327.8A CN117241409B (en) 2023-11-13 2023-11-13 Multi-type terminal random access competition solving method based on near-end policy optimization

Publications (2)

Publication Number Publication Date
CN117241409A CN117241409A (en) 2023-12-15
CN117241409B true CN117241409B (en) 2024-03-22

Family

ID=89098749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311504327.8A Active CN117241409B (en) 2023-11-13 2023-11-13 Multi-type terminal random access competition solving method based on near-end policy optimization

Country Status (1)

Country Link
CN (1) CN117241409B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050023701A (en) * 2003-09-02 2005-03-10 삼성전자주식회사 Method for controlling back-off of random access and its program recording record medium
US9019823B1 (en) * 2013-01-16 2015-04-28 Sprint Spectrum L.P. Random access preamble selection
CN105828450A (en) * 2016-03-11 2016-08-03 京信通信系统(广州)有限公司 Competition access method and apparatus
KR101845398B1 (en) * 2017-02-28 2018-04-04 숙명여자대학교산학협력단 Method and apparatus for setting barring factor for controlling access of user equipment based on machine learning
CN109756990A (en) * 2017-11-06 2019-05-14 中国移动通信有限公司研究院 A kind of accidental access method and mobile communication terminal
CN110139392A (en) * 2019-05-06 2019-08-16 安徽继远软件有限公司 LTE electric power wireless private network random access channel multiple conflict detection method
WO2021146828A1 (en) * 2020-01-20 2021-07-29 Oppo广东移动通信有限公司 Random access method and apparatus
CN114375066A (en) * 2022-01-08 2022-04-19 山东大学 Distributed channel competition method based on multi-agent reinforcement learning
CN114501667A (en) * 2022-02-21 2022-05-13 清华大学 Multi-channel access modeling and distributed implementation method considering service priority
CN115278908A (en) * 2022-01-24 2022-11-01 北京科技大学 Wireless resource allocation optimization method and device
CN116801314A (en) * 2023-06-21 2023-09-22 湖南大学 Network slice resource allocation method based on near-end policy optimization

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9893774B2 (en) * 2001-04-26 2018-02-13 Genghiscomm Holdings, LLC Cloud radio access network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050023701A (en) * 2003-09-02 2005-03-10 삼성전자주식회사 Method for controlling back-off of random access and its program recording record medium
US9019823B1 (en) * 2013-01-16 2015-04-28 Sprint Spectrum L.P. Random access preamble selection
CN105828450A (en) * 2016-03-11 2016-08-03 京信通信系统(广州)有限公司 Competition access method and apparatus
KR101845398B1 (en) * 2017-02-28 2018-04-04 숙명여자대학교산학협력단 Method and apparatus for setting barring factor for controlling access of user equipment based on machine learning
CN109756990A (en) * 2017-11-06 2019-05-14 中国移动通信有限公司研究院 A kind of accidental access method and mobile communication terminal
CN110139392A (en) * 2019-05-06 2019-08-16 安徽继远软件有限公司 LTE electric power wireless private network random access channel multiple conflict detection method
WO2021146828A1 (en) * 2020-01-20 2021-07-29 Oppo广东移动通信有限公司 Random access method and apparatus
CN114375066A (en) * 2022-01-08 2022-04-19 山东大学 Distributed channel competition method based on multi-agent reinforcement learning
CN115278908A (en) * 2022-01-24 2022-11-01 北京科技大学 Wireless resource allocation optimization method and device
CN114501667A (en) * 2022-02-21 2022-05-13 清华大学 Multi-channel access modeling and distributed implementation method considering service priority
CN116801314A (en) * 2023-06-21 2023-09-22 湖南大学 Network slice resource allocation method based on near-end policy optimization

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Research on Service Function Chain Deployment Algorithm Based on Proximal Policy Optimization;Peng Sun;《Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering》;全文 *
基于深度强化学习的无线网络资源分配算法;李孜恒;孟超;;通信技术(08);全文 *
适于无线多媒体传感器网络的MAC层退避算法;李瑞芳;罗娟;李仁发;;通信学报(11);全文 *
面向不确定CSI随机接入网络的深度稳健资源分配;吴伟华;柴冠华;杨清海;刘润滋;;通信学报(07);全文 *

Also Published As

Publication number Publication date
CN117241409A (en) 2023-12-15

Similar Documents

Publication Publication Date Title
CN110809306B (en) Terminal access selection method based on deep reinforcement learning
CN104185298B (en) Network load dynamic self-adapting parameter regulation means based on priority
CN111867139B (en) Deep neural network self-adaptive back-off strategy implementation method and system based on Q learning
CN111245541B (en) Channel multiple access method based on reinforcement learning
CN105592564B (en) Adaptive access mechanism based on the estimation of live-vertex number in wireless Mesh netword
CN111050413B (en) Unmanned aerial vehicle CSMA access method based on adaptive adjustment strategy
CN101582837A (en) Service access control method, wireless access system and access control device
CN110972162B (en) Underwater acoustic sensor network saturation throughput solving method based on Markov chain
Chen et al. Contention resolution in Wi-Fi 6-enabled Internet of Things based on deep learning
CN111601398B (en) Ad hoc network medium access control method based on reinforcement learning
Le et al. LSTM-based channel access scheme for vehicles in cognitive vehicular networks with multi-agent settings
CN114024639B (en) Distributed channel allocation method in wireless multi-hop network
CN117241409B (en) Multi-type terminal random access competition solving method based on near-end policy optimization
CN110337069A (en) The frame length optimization method of information broadcast
CN113316174A (en) Intelligent access method for unlicensed spectrum
CN108513309A (en) A kind of access jamming control method of NB-IoT systems
Pei et al. MAC contention protocol based on reinforcement learning for IoV communication environments
CN106101988B (en) A method of reducing machine type communication access interference probability and time delay
CN114051280A (en) CSMA optimization method based on deep reinforcement learning
Lee et al. Multi-agent reinforcement learning for a random access game
Han et al. Multi-agent reinforcement learning for green energy powered IoT networks with random access
CN115134026B (en) Intelligent unlicensed spectrum access method based on average field
CN111541999B (en) Large-scale distributed node self-organizing access method based on biological elicitation
CN113115470B (en) Distributed channel opportunistic access method of wireless multicast network
WO2024031535A1 (en) Wireless communication method, terminal device, and network device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant