CN117241409B

CN117241409B - Multi-type terminal random access competition solving method based on near-end policy optimization

Info

Publication number: CN117241409B
Application number: CN202311504327.8A
Authority: CN
Inventors: 颜志; 苑书豪; 欧阳博; 禹怀龙; 段豪勇
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-03-22
Anticipated expiration: 2043-11-13
Also published as: CN117241409A

Abstract

The invention belongs to the technical field of wireless communication, and particularly relates to a method for solving random access contention of a multi-type terminal based on near-end policy optimization, which comprises the following steps: s1: initializing states of various types of terminals and states of data queues, states of cell base stations, quantity of competing resources and states of competing queues; carrying out priority division on various types of terminals to obtain terminals with different priorities; acquiring a current environment state; s2: establishing an agent model at a base station side, training the agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state based on a distributed queue mechanism, acquiring optimal selection actions and instant rewards, and forming experience data to be stored in an experience pool; s3: and constructing an objective function, performing deep learning on the objective function based on experience data stored in an experience pool, and performing training update on parameters by using a preset threshold value to complete allocation optimization of random access of multiple types of terminals.

Description

Multi-type terminal random access competition solving method based on near-end policy optimization

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a method for solving random access contention of a multi-type terminal based on near-end policy optimization.

Background

As human society enters the era of everything interconnection, in the fifth generation communication technology (5G), the introduction of large-scale machine type communication (mctc) makes access to large-scale terminal devices possible. The random access process is an important process for realizing uplink communication by the device, and is used for realizing initialization access of users in the cellular network, uplink resource allocation when the users perform uplink data transmission, uplink medium synchronization and the like. However, as the number of wireless communication devices increases in bursts, collision problems during random access are increasingly prominent. For this reason, it is necessary to design an efficient contention resolution mechanism to cope with access of a large number of multi-type terminals.

At present, the researches on random access protocols are mainly divided into two major categories, namely ALOHA family protocols and tree splitting protocols. ALOHA family protocols discretely process conflicting devices in the time domain with access class restriction (ACB) and Backoff (Backoff) mechanisms to reduce the probability of collisions. The distributed queue contention resolution mechanism is derived from a tree splitting protocol that combines conflicting devices into a device group for time domain discretization by introducing a Contention Resolution Queue (CRQ) to fully exploit the preamble and reduce the probability of secondary collisions by conflicting terminals in retransmissions. When solving the optimal strategy for random access, the prior stage mostly utilizes deep reinforcement learning algorithms such as DQN and AC algorithms to solve the problem of resource conflict in the random access process.

However, most of the researches at present focus on the random access procedure of only single-type users, and do not consider the coexistence of multiple-type communication devices in actual production activities; meanwhile, most of the optimization for the random access protocol focuses on the optimization of the ALOHA family protocol, the potential advantages of the distributed queue scheme are not realized, and the existing contention resolution mechanism cannot adapt to network load caused by massive terminal large-scale initialization access and periodical data uploading, so that the overall performance of a communication system is affected.

Disclosure of Invention

The invention provides a method for solving random access contention of a multi-type terminal based on near-end policy optimization, which can be based on a distributed queue mechanism, introduce a priority division idea, increase the access opportunity of a specific terminal according to requirements, reduce the probability of secondary collision of the specific terminal, and thereby improve the access success rate and stability of the whole system so as to cope with network congestion caused by initiating random access requests by mass terminals; meanwhile, the PPO algorithm is optimized by utilizing the near-end strategy to dynamically adjust the exclusive resources, so that the optimal resource planning under the preset condition is met, the terminal access success rate is improved, and the resource waste is reduced.

A random access contention resolution method for a multi-type terminal based on near-end policy optimization specifically comprises the following steps:

s1: initializing states of various types of terminals and states of data queues, states of cell base stations, quantity of competing resources and states of competing queues; the priority of each type of terminal is divided, and terminals with different priorities are obtained; acquiring a current environment state;

s2: establishing an agent model at a base station side, training the agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state based on a distributed queue mechanism, acquiring optimal selection actions and instant rewards, and forming experience data to be stored in an experience pool;

s3: constructing an objective function, and performing deep learning on the objective function based on experience data stored in an experience pool; training and updating parameters by using a preset threshold value to adjust the number of reserved exclusive resources so as to finish the allocation optimization of the random access of the multi-type terminals;

in the step S3, an objective function is constructed, and deep learning is carried out on the objective function based on experience data stored in an experience pool; the process for completing the allocation optimization of the random access of the multi-type terminals specifically comprises the following steps:

s31: judging whether experience data stored in the experience pool reaches a preset threshold value or not;

s311: when the experience data stored in the experience pool reaches a preset threshold, an objective function is constructed, a near-end strategy optimization PPO algorithm is trained, network parameters are updated, and the experience pool is emptied;

s312: when the experience data stored in the experience pool does not reach a preset threshold value, entering the next step;

s32: judging whether the iteration number reaches a preset maximum iteration number or not;

s321: when the iteration number reaches the preset maximum iteration number, entering the next step;

s322: when the iteration times do not reach the preset maximum iteration times, training the intelligent body model again based on the current environment state;

s33: judging whether the overall indexes of the system, namely the average time delay, the average preamble transmission times and the average energy consumption, can reach preset requirements after the current iteration period is finished;

s331: when any index of the overall indexes of the system does not reach the preset requirement, training the intelligent body model again based on the current environment state;

s332: when the overall indexes of the system all reach the preset requirements, outputting an optimal solution to finish the allocation optimization of the random access of the multiple types of terminals.

The method can introduce the thought of priority division based on a distributed queue mechanism, increase the access opportunity of a specific terminal according to the requirement, and reduce the probability of secondary conflict of the specific terminal, thereby improving the access success rate and the stability of the whole system and coping with network congestion caused by the initiation of random access requests by massive terminals; meanwhile, the PPO algorithm is optimized by utilizing the near-end strategy to dynamically adjust the exclusive resources, so that the optimal resource planning under the preset condition is met, the terminal access success rate is improved, and the resource waste is reduced.

Further, in the step S1, the process of prioritizing the terminals of each type to obtain terminals with different priorities and obtaining the current environmental state specifically includes the following steps:

s11: based on the sensitivity degree of the data terminal to time delay and reliability, prioritizing the access terminal, dividing the terminal with high reliability requirement for low time delay into high priority terminal, and dividing the conventional machine type communication terminal into low priority terminal;

s12: initializing environment parameters;

the parameters needed to initialize the environment include: terminal number of initiating access terminal number of contention queue, status of contention queue, total number of preamble slot number, number of exclusive resources reserved for high priority;

s13: and acquiring the current environment state, namely acquiring the number URLLC_nums of terminals with low current time delay and high reliable requirements, the number mMTC_nums of machine type communication terminals and the length CDQ_length of a competition queue.

Further, in the step S11,

the high-priority terminal comprises a control equipment data acquisition terminal and fault early warning and detecting equipment;

the low-priority terminal comprises an operation index data acquisition terminal and an environment monitoring data acquisition terminal.

Further, in the step S2, the process of obtaining the optimal selection action and the instant rewards, and forming the experience data to store in the experience pool specifically includes the following steps:

s21: setting the number of the preambles which initialize the monopolization of the high-priority terminal as;

S22: based on a distributed queue mechanism, training an agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state to obtain an optimal selection action;

s23: inputting the obtained optimal selection action into an environment state, and executing a distributed queue random access process based on random access information broadcast by a base station;

s24: updating the environment state, and calculating instant rewards based on the terminal access result;

s25: the current environmental state, the optimal selection action, the probability of the selection action, and the instant rewards are stored as a set of experience data into an experience pool.

Further, in S22, the process of obtaining the optimal selection action specifically includes the following steps:

s221: based on the distributed queue mechanism, the current environment state is calculatedInputting the strategy network of the near-end strategy optimization PPO algorithm;

s222: outputting an action space in an output layerThe action with the highest medium probability is used as the optimal selection action;

s2221: outputting each action in the output layer, and obtaining a score vector of each action by utilizing a softmax function;

s2222: sampling any action value from the formed probability distribution of each actionTo characterize the ratio of exclusive access resources and to set the probability of selecting the action as +.>；

Wherein the probability of the action is selectedThe calculated expression of (2) is:

；

in the method, in the process of the invention,is in state->Lower action->Is selected according to the selection probability of (1); />Is the action spaceSize of the material;to select a value of an action component; />Is the%>The values of the individual motion components;

s2223: selecting an action spaceThe action with the highest medium probability is used as the optimal selection action;

wherein the action spaceThe range of (2) is: />。

Further, in S23, the process of executing the distributed queue random access procedure based on the random access information broadcasted by the base station by inputting the selected action into the environment specifically includes the following steps:

s231: detecting whether the number of data packet retransmission times of the random access terminal participating in the current round reaches the tolerant number of retransmission times;

s2311: when the data packet retransmission times reach the tolerant retransmission times, the access fails, and the packet loss processing is carried out;

s2312: when the data packet retransmission times are detected to not reach the tolerant retransmission times, initiating an access process according to the priority rule of the access terminal, namely, initiating an access request by selecting a preamble by the high-priority terminal, and initiating the access request by the low-priority terminal to share part of the preamble with the high-priority terminal; respectively calculating the probability of successful access of the high-priority terminal and the low-priority terminal;

wherein, the high priority terminal access success rateThe calculated expression of (2) is:

；

in the method, in the process of the invention,the number of preambles exclusive for a specified high priority terminal; />For the number of terminals initiating an access request in this round, and +.>；

Wherein, the low priority terminal access success rateThe calculated expression of (2) is:

；

in the method, in the process of the invention,is the available number of preambles;

s232: transmitting the selected preamble to a base station through a physical layer random access channel, decoding the preamble by the base station, and informing the sequence number of the terminal with access conflict; terminals with access conflicts form a conflict terminal group and are orderly discharged into the CRQ according to the sequence number of the lead code to wait for the next opportunity of random access.

Further, in S24, the process of updating the environmental status and calculating the instant rewards based on the terminal access result specifically includes the following steps:

s241: combining the conflict terminal group, counting the number of high priority terminals currently positioned at the head of the competition queue, the number of low priority terminals positioned at the head of the competition queue and the length of the competition queue, and entering a new environment state；

S242: calculating instant rewards based on the access success rate of the high-priority terminal, the access success rate of the low-priority terminal, the length change of the competition queue and the packet loss condition respectively;

instant rewardsThe calculated expression of (2) is:

；

wherein,the method is characterized in that rewards for the success rate of the access of the high-priority terminal are calculated by the following expression:

；

in the method, in the process of the invention,for the access success rate of terminals with low latency and high reliability requirements participating in the access procedure +.>The lowest success rate of the terminal with low time delay and high reliability requirement is tolerable for the system;

wherein,the method is characterized in that rewards for the success rate of the access of the low-priority terminal are calculated by the following expression:

；

in the method, in the process of the invention,access success rate for machine type communication terminals participating in an access procedure,/->The lowest success rate of access for the machine type communication terminal which can be tolerated by the system;

wherein,for rewards of competing queue length changes, the computational expression is:

；

in the method, in the process of the invention,the length of the contention queue after the random access process of the present round; />The length of the historical competition queue;

wherein,the penalty for packet loss processing in the random access process of the round is calculated by the following expression:

；

in the method, in the process of the invention,the number of lost packets.

Further, in S311, training the near-end policy optimization PPO algorithm, and updating the network parameters specifically includes the following steps:

s3111: according to discount rateAnd settling rewards expectations and advantage estimates corresponding to each random access process in the batch of data;

wherein the reward is desiredThe calculated expression of (2) is:

；

in the method, in the process of the invention,is->Rewarding expectations of the secondary random access procedure; />Is->Instant rewards of the secondary random access procedure;for obtaining +.>The value of the state; />Is the total number of rounds;

wherein the dominance estimationThe calculated expression of (2) is:

；

in the method, in the process of the invention,is->Secondary random access procedureIs a dominant estimate of (1); />Obtained for use of preset Critic networksThe value of the state;

s3112: updating the Critic network with MSEloss as a loss function to minimize the difference between the current state value and the discounted prize:

；

in the method, in the process of the invention,error terms for the cost function; />Status +.>Status value of (2); />For the discount to be awarded,is the amount of data in the experience pool;

s3113: calculating a loss function of the Actor network;

wherein, the loss function of the Actor networkThe calculated expression of (2) is:

；

in the method, in the process of the invention,is the probability ratio of new strategy to old strategy, namely +.>，/>Probability of being a new strategy->Probability of being an old policy; />For estimating the function for advantage, i.e；/>Is a truncated function; />To cut off the super parameter to limit the network update amplitude;

s3114: constructing an objective function of a near-end strategy optimization PPO algorithm network;

wherein, the near-end strategy optimizes the objective function of the PPO algorithm networkThe expression of (2) is:

；

wherein,is a loss function of the Actor network; />Error terms for the cost function; />Entropy rewards for the policy model; />、/>Is a constant coefficient to adjust the weight of each part in the objective function;

s3115: optimizing the objective function of the PPO algorithm network through maximized near-end strategy, and updating network parametersAnd network parameters are performed when the data in the experience poolTAfter a second continuous update, network parameters +.>Updated to->。

Further, in S33, the calculating of the overall system index after the end of the current iteration period includes:

average time delayThe calculated expression of (2) is:

；

in the method, in the process of the invention,the RAO sum required by all terminals to complete the random access process is represented, namely the overall average time delay of the system; "1" means the first RAO; />Representing the total number of terminals participating in the access procedure; />Representing the total number of available preambles; />A terminal participating in an access process; />The sum of RAOs required for the terminals that collide in the first RAO to complete the random access procedure in the following;

preamble transfer timesThe calculated expression of (2) is:

；

average energy consumptionThe calculated expression of (2) is:

；

in the method, in the process of the invention,、/>、/>the energy consumption of the terminal in the back-off state, the access state and the monitoring state is respectively; />、/>Respectively +.>The number of high priority terminals and low priority terminals involved in random access by the layer; />For the number of reserved exclusive preambles.

The beneficial effects of the invention are as follows:

the invention can introduce the thought of priority division based on a distributed queue mechanism, increase the access opportunity of a specific terminal according to the requirement, reduce the probability of secondary conflict, thereby improving the access success rate and the stability of the whole system and coping with network congestion caused by the random access request initiated by a mass of terminals; meanwhile, the PPO algorithm is optimized by utilizing the near-end strategy to dynamically adjust the exclusive resources, so that the optimal resource planning under the preset condition is met, the terminal access success rate is improved, and the resource waste is reduced.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow diagram of a distributed queue contention resolution mechanism;

FIG. 3 is a schematic diagram of a terminal access state transition;

fig. 4 is a flowchart for obtaining the optimal solution in embodiment 2.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

Furthermore, in the following description, specific details are provided for the purpose of providing a thorough understanding of the examples, and the particular meaning of the terms described above in this application will be understood to those of ordinary skill in the art in the context of the present application.

Example 1

Fig. 1 shows a method for solving random access contention of multiple types of terminals based on optimization of a near-end policy, which can be based on a distributed queue mechanism, introduce a priority division idea, increase access opportunities of specific terminals according to requirements, and reduce probability of secondary collision of specific terminals, so as to improve access success rate and stability of the whole system, and cope with network congestion caused by initiating random access requests by massive terminals; meanwhile, the PPO algorithm is optimized by utilizing the near-end strategy to dynamically adjust the exclusive resources, so that the optimal resource planning under the preset condition is met, the terminal access success rate is improved, and the resource waste is reduced. The method specifically comprises the following steps:

the high-priority terminal comprises a control equipment data acquisition terminal and fault early warning and detecting equipment; the low-priority terminal comprises an operation index data acquisition terminal and an environment monitoring data acquisition terminal.

S12: initializing environment parameters;

S2: establishing an agent model at a base station side, training the agent model by utilizing a strategy network of a near-end strategy optimization PPO algorithm in combination with the current environment state based on a distributed queue mechanism, acquiring optimal selection actions and instant rewards, and forming experience data to be stored in an experience pool; the method specifically comprises the following steps:

it should be noted that, as shown in fig. 2, before training the agent model by using the policy network of the near-end policy optimization PPO algorithm, the near-end policy optimization PPO algorithm network needs to be constructed.

The process of obtaining the optimal selection action specifically comprises the following steps:

；

in the method, in the process of the invention,is in state->Lower action->Is selected according to the selection probability of (1); />Is the size of the action space;to select a value of an action component; />Is the%>The values of the individual motion components;

wherein the action spaceThe range of (2) is: />。

In the present embodiment, when the firstWhen the batch of terminal equipment requests access, the intelligent agent model performs optimal action selection according to the instant rewards of the last round of access process, so as to adjust the quantity of exclusive resources.

the process of executing the distributed queue random access process based on the random access information broadcast by the base station specifically includes the following steps:

；

the process of updating the environment state and calculating the instant rewards based on the terminal access result specifically comprises the following steps:

instant rewardsThe calculated expression of (2) is:

；

in the method, in the process of the invention,access success rate for machine type communication terminals participating in an access procedure,/->For machine-like communication terminals tolerated by the systemThe lowest success rate of access;

；

in the method, in the process of the invention,the number of lost packets.

S3: constructing an objective function, and performing deep learning on the objective function based on experience data stored in an experience pool; and training and updating parameters by using a preset threshold value to adjust the number of reserved exclusive resources so as to finish the allocation optimization of the random access of the multi-type terminals.

s311: when the experience data stored in the experience pool reaches a preset threshold, an objective function is constructed, a near-end strategy optimization PPO algorithm is trained, network parameters are updated, and the experience pool is emptied; the method specifically comprises the following steps:

wherein the reward is desiredThe calculated expression of (2) is:

；

wherein the dominance estimationThe calculated expression of (2) is:

；

in the method, in the process of the invention,is->Estimating the advantages of the secondary random access process; />Obtained for use of preset Critic networksThe value of the state;

；

s3113: calculating a loss function of the Actor network;

；

The calculation of the system overall index after the current iteration period is finished comprises the following steps:

average time delayThe calculated expression of (2) is:

；

preamble transfer timesThe calculated expression of (2) is:

；

average energy consumptionThe calculated expression of (2) is:

；

Fig. 3 is a schematic diagram showing a terminal access state transition, which shows that the terminal is always in a sleep state and is in a monitoring state after being activated; when the length of the competition queue CRQ is detected to be empty, terminal access is carried out, namely the terminal access state is achieved; when the access is successful, data transmission is carried out, and the sleep state is returned after the data transmission is successful; when the access fails, entering a back-off state, sequentially entering a CRQ according to the selected preamble sequence number, and entering a monitoring state for the terminal group at the head of the queue in the next round to carry out the access process again; if the initialized terminal detects that the length of the contention queue CRQ is not empty, the terminal enters a back-off state, and is in a monitoring state again after the back-off is finished.

Example 2

As shown in fig. 4, in this embodiment, a method for solving random access contention of a multi-type terminal based on optimization of a near-end policy is provided, which specifically includes the following steps:

t1: acquiring initialized states of various types of terminals and data queue states, cell base station states, competition resource quantity and competition queue states; the priority of each type of terminal is divided, and terminals with different priorities are obtained; acquiring the states of all types of terminals; acquiring a current environment state;

t2: determining the length of a competition queue CRQ in the round;

t21: when the length of the competition queue CRQ in the round is not empty, namely CRQ_length is not equal to 0, continuing to execute the access process of the data terminal in the CRQ, waiting for the next RAO, and returning to T1;

t22: when the length of the competition queue CRQ in the round is empty, namely CRQ_length=0, the terminal carries out packet loss judgment, activates the terminal in the first group in the competition queue CRQ, and judges whether the retransmission times of the data packets in the round reach the tolerant retransmission times or not;

t221: when the retransmission times of the data packet in the round reach the tolerant retransmission times, the access fails, and the packet loss processing is carried out;

t22: when the retransmission times of the data packet in the round do not reach the tolerant retransmission times, a terminal access request is sent out, and a preamble is transmitted to a base station; the base station responds to the access and judges whether access conflict occurs;

t221: when no access conflict occurs, terminal access is carried out, access competition is solved, and access of multiple types of terminals is completed;

t222: when access conflict occurs, entering the next round;

t2221: the number of data packet retransmission times is increased by one;

t2222: and acquiring CRQ information of the competition queue from the T22, updating the position of the terminal in the queue, turning to the T22, and waiting for the activation of the first group of terminals.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A random access contention resolution method for a multi-type terminal based on near-end policy optimization is characterized by comprising the following steps:

2. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 1, wherein in S1, the priority of each type of terminal is divided to obtain terminals with different priorities, and the process of obtaining the current environmental state specifically includes the following steps:

s11: based on the sensitivity degree of the data terminal to time delay and reliability, the access terminal is prioritized, the terminal with low time delay and high reliability requirement is classified as a high priority terminal, and the conventional machine type communication terminal is classified as a low priority terminal;

s12: initializing environment parameters;

3. The method for random access contention resolution of multi-type terminals based on the optimization of the near-end policy according to claim 2, wherein, in S11,

4. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 2, wherein in S2, the process of obtaining the optimal selection action and the instant rewards and forming experience data to be stored in the experience pool specifically comprises the following steps:

5. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 4, wherein in S22, the process of obtaining the optimal selection action specifically includes the following steps:

；

in the method, in the process of the invention,is in state->Lower action->Is selected according to the selection probability of (1); />Is the size of the action space; />To select a value of an action component; />Is the%>The values of the individual motion components;

wherein the action spaceThe range of (2) is: />。

6. The method for resolving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 5, wherein in S23, the selected action is input into the environment, and the process of executing the distributed queue random access procedure based on the random access information broadcasted by the base station specifically comprises the following steps:

；

7. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 6, wherein the process of updating the environmental status and calculating the instant prize based on the terminal access result in S24 specifically comprises the steps of:

instant rewardsThe calculated expression of (2) is:

；

wherein,is high enough toThe rewards of the access success rate of the priority terminal are calculated as the following expression:

；

in the method, in the process of the invention,the number of lost packets.

8. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 7, wherein in S311, the process of training the near-end policy optimization PPO algorithm and updating the network parameters specifically includes the following steps:

wherein the reward is desiredThe calculated expression of (2) is:

；

in the method, in the process of the invention,is->Rewarding expectations of the secondary random access procedure; />Is->Instant rewards of the secondary random access procedure; />For obtaining +.>The value of the state; />Is the total number of rounds;

wherein the dominance estimationThe calculated expression of (2) is:

；

in the method, in the process of the invention,is->Estimating the advantages of the secondary random access process; />For obtaining +.>The value of the state;

；

in the method, in the process of the invention,error terms for the cost function; />Status +.>Status value of (2); />For discounts rewarding->Is the amount of data in the experience pool;

s3113: calculating a loss function of the Actor network;

；

9. The method for solving random access contention of multiple types of terminals based on optimization of a near-end policy according to claim 7, wherein in S33, the calculation of the overall system index after the end of the current iteration period includes:

average time delayThe calculated expression of (2) is:

；

in the method, in the process of the invention,the RAO sum required by all terminals to complete the random access process is represented, namely the overall average time delay of the system; "1" means the first RAO; />Representing ginsengTotal number of terminals with access procedure; />Representing the total number of available preambles; />A terminal participating in an access process; />The sum of RAOs required for the terminals that collide in the first RAO to complete the random access procedure in the following;

preamble transfer timesThe calculated expression of (2) is:

；

average energy consumptionThe calculated expression of (2) is:

；