CN112867117B

CN112867117B - A Q-learning-based energy-saving method in NB-IoT

Info

Publication number: CN112867117B
Application number: CN202110074159.8A
Authority: CN
Inventors: 裴二荣; 王振民; 朱冰冰; 张茹; 杨光财; 荆玉琪; 周礼能
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: China Mobile IoT Co Ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2022-04-12
Anticipated expiration: 2041-01-20
Also published as: CN112867117A

Abstract

The invention relates to a Q-learning-based energy-saving method in NB-IoT, and belongs to the technical field of communication. In this method, the base station can dynamically control the number of devices that initiate random access in each transmission time interval according to parameters such as network load, repetition times, transmission data resources, etc., so as to reduce the number of devices that collide during the random access process. Reduce the total energy consumption of random access to achieve the purpose of energy saving. In this method, the base station acts as an agent, the agent action is defined as the ratio of the number of devices allowed to initiate random access to the total number of active devices, and the agent state mainly includes a set of observed information, such as the number of successful communication devices and energy consumption, etc. The invention can reduce the energy consumption of the system and prolong the life of the equipment while ensuring the throughput of the equipment.

Description

An energy-saving method based on Q-learning in NB-IoT

技术领域technical field

本发明属于通信技术领域，涉及NB-IoT中一种基于Q学习的节能方法。The invention belongs to the field of communication technologies, and relates to a Q-learning-based energy-saving method in NB-IoT.

背景技术Background technique

在过去的几十年中，随着工业革命的开始，人类已经取得巨大的发展。近年来，第五代演进(5th-Generation，5G)，物联网(Internet of Things，IoT)和移动计算的迅速发展，低功耗广域网(Low Power Internet of Wide Area Network，LPWAN)越来越受到关注。基于蜂窝通信技术的窄带物联网(Narrow Band IoT， NB-IoT)是一个具有发展前景的LPWAN技术。NB-IoT分别要求下行和上行链路的最小系统带宽为180kHz，是一种新的第三代合作伙伴计划(3rd Generation Partnership Project，3GPP)无线电接入技术，其具有低功耗、窄带宽、强覆盖、低成本以及海量连接的特点，因而NB-IoT可以广泛应用于信道传输条件差(如地下停车场)或有延时容忍的设备(如水表)的场景。In the past few decades, with the onset of the Industrial Revolution, humanity has made tremendous progress. In recent years, with the 5th-Generation (5G), the rapid development of the Internet of Things (IoT) and mobile computing, the Low Power Internet of Wide Area Network (LPWAN) has become increasingly popular. focus on. Narrow Band IoT (NB-IoT) based on cellular communication technology is a promising LPWAN technology. NB-IoT requires a minimum system bandwidth of 180kHz for downlink and uplink respectively. It is a new 3rd Generation Partnership Project (3GPP) radio access technology with low power consumption, narrow bandwidth, With the characteristics of strong coverage, low cost and massive connections, NB-IoT can be widely used in scenarios with poor channel transmission conditions (such as underground parking lots) or devices with delay tolerance (such as water meters).

NB-IoT的数据通信主要发生在上行信道中，上行信道资源主要包含窄带物理随机接入信道(Narrowband Physical Random Access Channel，NPRACH)和窄带物理上行共享信道(Narrowband Physical Uplink Sharing Channel，NPUSCH)。其中，NPRACH主要用于启动随机接入过程，NPUSCH主要负责设备到基站之间的数据传输。NB-IoT主要的是针对有高容忍时延的周期性设备(如水表、电表等)，如何在保证设备的吞吐量要求下，以较低的能耗完成设备通信，是我们考虑的问题。The data communication of NB-IoT mainly occurs in the uplink channel, and the uplink channel resources mainly include Narrowband Physical Random Access Channel (NPRACH) and Narrowband Physical Uplink Sharing Channel (NPUSCH). Among them, the NPRACH is mainly used to start the random access process, and the NPUSCH is mainly responsible for the data transmission between the device and the base station. NB-IoT is mainly aimed at periodic devices with high latency tolerance (such as water meters, electricity meters, etc.). How to complete device communication with lower energy consumption while ensuring the throughput of the device is a problem we consider.

当前，已有的节能机制缺乏动态学习的过程，这些机制主要是对个设备的退避时间和扩展性非连续接收机制进行优化，达到设备能耗和时延的折中。但是在实际情况中，有多个设备同时传输数据的场景，合理调度设备，控制每个传输时间间隔(TransmissionTime Interval，TTI)中设备接入的个数，在保证吞吐量的前提下，优化通信总能耗。因此，设计一种基于Q学习算法的调度机制，使得基站根据网络负载、上行资源数量、数据大小和重复次数等参数，在每个TTI 允许适量的设备进行随机接入，在保证吞吐量的前提下，减少系统能耗，增加设备寿命。At present, the existing energy-saving mechanisms lack the process of dynamic learning. These mechanisms mainly optimize the back-off time of each device and the scalable discontinuous reception mechanism to achieve a compromise between device energy consumption and delay. However, in actual situations, there are scenarios where multiple devices transmit data at the same time, rationally schedule devices, control the number of device accesses in each Transmission Time Interval (TTI), and optimize communication on the premise of ensuring throughput. total energy consumption. Therefore, a scheduling mechanism based on the Q-learning algorithm is designed, so that the base station allows an appropriate amount of devices to perform random access in each TTI according to parameters such as network load, uplink resource quantity, data size, and repetition times, on the premise of ensuring throughput. It reduces system energy consumption and increases equipment life.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提供NB-IoT中一种基于Q学习的节能方法。基站通过Q学习算法，根据网络负载、时延、上行资源数量、前导码大小和重复次数等因素灵活调整每个TTI中发起接入请求设备的数量，在保证吞吐量的前提下，节省系统能耗，延长设备寿命。该方法具有简洁高效的特点，与此同时，具备一定的可移植性。In view of this, the purpose of the present invention is to provide a Q-learning-based energy saving method in NB-IoT. The base station uses the Q-learning algorithm to flexibly adjust the number of devices initiating access requests in each TTI according to factors such as network load, delay, number of uplink resources, preamble size, and repetition times. consumption and prolong the life of the equipment. The method has the characteristics of simplicity and efficiency, and at the same time, it has certain portability.

为了达到上述目的，本发明提供如下技术方案：In order to achieve the above object, the present invention provides the following technical solutions:

NB-IoT中一种基于Q学习的节能方法，该方法包括以下步骤：A Q-learning-based energy-saving method in NB-IoT, which includes the following steps:

S1：定义基站的状态集合和动作集合；S1: Define the state set and action set of the base station;

S2：在t＝0时刻，初始化基站的状态和行为Q值为“0”；S2: At time t=0, the state and behavior Q value of the initialized base station is "0";

S3：计算基站的初始状态s_t的状态值；S3: Calculate the state value of the initial state _st of the base station;

S4：根据ε贪婪策略选择一个行为a_t(i)；S4: Choose an action at (i) according to the _ε -greedy strategy;

S5：执行行为a_t(i)后，系统将根据奖励函数公式获取环境奖励值r_t，然后进入到下一个状态s_t+1；S5: After executing the behavior at (i), the system will obtain the environmental reward value _rt according to the reward function formula, and then enter the next state s _t ₊₁ ;

S6：根据公式更新基站的行为Q值函数；S6: Update the behavioral Q-value function of the base station according to the formula;

S7：t←t+1，跳转至步骤S2。S7: t←t+1, jump to step S2.

进一步，在步骤S1中，基站的状态集合表示为一系列先前观测到的信息，即S^t＝{U^t-1，U^t-2，U^t-3，L U¹}，其中

Further, in step S1, the state set of the base station is represented as a series of previously observed information, namely S ^t ={U ^t-1 , U ^t-2 , U ^t-3 , LU ¹ }, where

其中，

表示随机接入能耗，

表示设备等待能耗，

表示数据传输能耗，

表示等待设备数量，

表示通信设备数量，

表示接入失败设备数量。in,

represents the energy consumption of random access,

Indicates that the device is waiting for power consumption,

represents the energy consumption of data transmission,

Indicates the number of waiting devices,

represents the number of communication devices,

Indicates the number of devices that fail to access.

对于行为集合，将每个TTI中允许发起随机接入的设备数量与当前TTI 中总活跃设备的比例作为基站行为，并且根据有限动作集合的马尔科夫过程定义任意第t个TTI中基站行为α^t∈{0.2，0.4，0.6，0.8，1.0}。For the behavior set, the ratio of the number of devices allowed to initiate random access in each TTI to the total active devices in the current TTI is taken as the base station behavior, and the base station behavior α in any t-th TTI is defined according to the Markov process of the limited action set ^t ∈ {0.2, 0.4, 0.6, 0.8, 1.0}.

进一步，在步骤S2中，设置基站的状态和Q值为零矩阵。对于基站马尔科夫决策过程的求解目标是寻找一个最优策略π^*，以使得每一个状态s的值 V(s)同时达到最大。状态值函数表示如下：Further, in step S2, the state of the base station and the Q value of the matrix are set to zero. The goal of solving the Markov decision process of the base station is to find an optimal strategy π ^* so that the value V(s) of each state s reaches the maximum at the same time. The state value function is represented as follows:

其中r(s_t，a_t)表示基站从环境中获取的奖励值，p(s_t+1|s_t，a_t)表示当基站处于状态s_t时选择行为a_t后转移到状态s_t+1的概率。Among them, r(s _t , at _t ) represents the reward value that the base station obtains from the environment, and p(s _t+1 |s _t , at _t ) means that when the base station is in state s _t , it selects behavior at _t and then transfers to state s _{t +1} probability.

进一步，在步骤S4中，基站的目标是获取较高的奖励值，因此，在每个状态下，将会选择具有较高Q值的动作。但是在学习的初始阶段，对于状态- 动作的经验比较少，Q值不能准确地表示正确的最优值。最高Q值的动作导致了基站总是沿着相同的路径而不可能探索到其他更好的值，从而容易陷入局部最优。因此引入ε贪婪策略，其主要原理如下：Further, in step S4, the base station's goal is to obtain a higher reward value, so in each state, an action with a higher Q value will be selected. However, in the initial stage of learning, the experience of state-action is relatively small, and the Q value cannot accurately represent the correct optimal value. The action of the highest Q value causes the base station to always follow the same path and it is impossible to explore other better values, so it is easy to fall into a local optimum. Therefore, the ε greedy strategy is introduced, and its main principles are as follows:

智能体以ε的概率随机选择动作，以1-ε的概率选择使Q值最大的动作。The agent randomly selects actions with probability ε, and chooses the action that maximizes the Q value with probability 1-ε.

进一步，在步骤S5中，基站执行选择的行为后将从环境中获取一个奖励值，奖励值函数定义为：Further, in step S5, after the base station performs the selected behavior, it will obtain a reward value from the environment, and the reward value function is defined as:

表示服务设备数，N表示总传输设备个数，T表示TTI个数，E^t表示第t个TTI中的系统总能耗。

represents the number of service devices, N represents the total number of transmission devices, T represents the number of TTIs, and E ^t represents the total system energy consumption in the t-th TTI.

其中，

in,

n^t表示当前TTI允许接入设备数，r表示重复次数，μ表示传输数据资源，Q表示上行链路总资源，mⁱ表示前导码的个数。n ^t represents the number of access devices allowed by the current TTI, r represents the number of repetitions, μ represents transmission data resources, Q represents total uplink resources, and ^mi represents the number of preambles.

E^t＝E_sy，t+E_ra，t+E_wait，t+E_dt，t E ^t = E _{sy, t} + E _{ra, t} + E _{wait, t} + E _{dt, t}

E_sy，t表示同步能耗，E_ra，t表示随机接入能耗，E_wait，t表示设备等待能耗， E_dt，t表示数据传输能耗。E _{sy, t} represents the synchronization energy consumption, E _{ra, t} represents the random access energy consumption, E _{wait, t} represents the device waiting energy consumption, and E _{dt, t} represents the data transmission energy consumption.

进一步，在步骤S6中，基站在从环境中获取奖励值后，需要对Q矩阵进行更新，其更新公式为：Further, in step S6, after the base station obtains the reward value from the environment, it needs to update the Q matrix, and the update formula is:

式中α表示学习速率且0＜α＜1，Υ表示折扣因子且0≤Υ＜1。学习速率和折扣因子协同作用调节Q矩阵的更新，进而影响Q算法的学习性能，α取值0.01，Υ取值0.8。where α represents the learning rate and 0<α<1, and Y represents the discount factor and 0≤Y<1. The learning rate and the discount factor work together to adjust the update of the Q matrix, which in turn affects the learning performance of the Q algorithm. α is 0.01, and Y is 0.8.

本发明的有益效果在于：通过Q学习算法，能够在保证设备吞吐量的条件下，降低系统能耗。The beneficial effect of the present invention is that: through the Q learning algorithm, the system energy consumption can be reduced under the condition of ensuring the equipment throughput.

附图说明Description of drawings

图1为Q学习与环境交互过程模型图；Figure 1 is a model diagram of the Q-learning and environment interaction process;

图2为基于Q学习的节能算法步骤图；Figure 2 is a step diagram of an energy-saving algorithm based on Q-learning;

图3为NB-IoT上行通信流程图。Figure 3 is a flowchart of NB-IoT uplink communication.

具体实施方式Detailed ways

下面将结合附图，对本发明的优选实施例进行详细的描述。The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

本发明针对NB-IoT系统能耗问题，提出NB-IoT中一种基于Q学习的节能方法。与传统的优化算法相比，本发明中基于Q学习算法能够对传输设备进行动态优化，基站可以根据网络实时情景灵活调整接入设备数量。其过程如图一所示，首先基站在某个状态下，根据当前的环境，基于ε贪婪策略执行某个行为；然后观察环境获取奖励值，根据公式更新Q函数值同时确定下一个状态的行为，重复上述动作直到收敛。Aiming at the energy consumption problem of the NB-IoT system, the present invention proposes a Q-learning-based energy-saving method in the NB-IoT. Compared with the traditional optimization algorithm, the Q-learning algorithm in the present invention can dynamically optimize the transmission device, and the base station can flexibly adjust the number of access devices according to the real-time network situation. The process is shown in Figure 1. First, in a certain state, the base station performs a certain behavior based on the ε greedy strategy according to the current environment; then observe the environment to obtain the reward value, update the Q function value according to the formula, and determine the behavior of the next state. , and repeat the above actions until convergence.

具体算法步骤如图二所示，在Q学习算法迭代过程中，将状态集合定义为S，若决策时间为t，则s_t∈S，表示在t时刻基站的状态为s_t。同时，我们将基站可能执行的有限行为集合定义为A，a_t∈A表示在t时刻基站的行为。奖励函数r(s_t，a_t)表示基站基于所处的状态s_t执行行为a_t后从环境中获得的奖励值，然后从状态s_t转移到s_t+1，在下一个决策时间t+1对Q_t函数进行更新。重复进行上述步骤，直至迭代结束。The specific algorithm steps are shown in Figure 2. In the iterative process of the Q-learning algorithm, the state set is defined as S. If the decision time is t, then s _t ∈ S, indicating that the state of the base station at time t is s _t . At the same time, we define the limited set of actions that the base station may perform as A, where at ∈ A represents the base station’s behavior at time _t . The reward function r(s _t , at _t ) represents the reward value obtained from the environment after the base station performs the behavior a _t based on the state s _t it is in, and then transfers from the state s _t to s _t+1 , at the next decision time t+ 1 to update the _Qt function. Repeat the above steps until the iteration ends.

NB-IoT在上行传输过程中的流程如图三所示。首先基站对设备发送窄带主同步信号((Narrowband Primary Synchronization Signal，NPSS)和窄带次同步信号(Narrowband Secondary Synchronization Signal，NSSS)，使设备和小区的时间和频率同步，此过程为同步过程。然后基站通过NPRACH发送接入请求信息，基站接收到设备的请求信息后通过窄带物理下行控制信道(Narrowband Physical Downlink Control channel，NPDCCH)进行回应，然后基站与设备建立起连接。连接建立建立后，基站发送调度请求，并通过NPUSCH进行数据传输。The flow of NB-IoT in the uplink transmission process is shown in Figure 3. First, the base station sends a narrowband primary synchronization signal (NPSS) and a narrowband secondary synchronization signal (NSSS) to the device to synchronize the time and frequency of the device and the cell. This process is a synchronization process. Then the base station The access request information is sent through NPRACH. After receiving the request information from the device, the base station responds through the Narrowband Physical Downlink Control channel (NPDCCH), and then the base station establishes a connection with the device. After the connection is established, the base station sends the scheduling request and data transmission via NPUSCH.

Q学习算法实际是马尔科夫决策过程(Markov Decision Processes，MDP) 的一种变化形式。在NB-IoT中的节能算法中，基于Q学习算法工作原理，我们将状态集合表示如下：The Q-learning algorithm is actually a variation of Markov Decision Processes (MDP). In the energy-saving algorithm in NB-IoT, based on the working principle of the Q-learning algorithm, we express the state set as follows:

S^t＝{U^t-1，U^t-2，U^t-3，L U¹}，其中

S ^t = {U ^t-1 , U ^t-2 , U ^t-3 , LU ¹ }, where

其中，

表示随机接入能耗，

表示设备等待能耗，

表示数据传输能耗，

表示等待设备数量，

表示通信设备数量，

表示接入失败设备数量。in,

represents the energy consumption of random access,

Indicates that the device is waiting for power consumption,

represents the energy consumption of data transmission,

Indicates the number of waiting devices,

represents the number of communication devices,

Indicates the number of devices that fail to access.

将每个TTI中允许发起随机接入的设备数量与当前TTI中总活跃设备的比例作为基站行为，则基站的行为集合A＝{a(1)，a(2)，L，a(k)}。根据有限动作集合的马尔科夫过程定义任意t个TTI基站行为α^t∈{0.2，0.4，0.6，0.8，1.0}。Taking the ratio of the number of devices allowed to initiate random access in each TTI to the total active devices in the current TTI as the base station behavior, the base station behavior set A={a(1), a(2), L, a(k) }. Any t TTI base station behaviors α ^t ∈ {0.2, 0.4, 0.6, 0.8, 1.0} are defined according to a Markov process with a finite set of actions.

基站面临的任务是决定一个最优策略，使得所获得的奖励最大。基站会根据当前的状态与环境，对下一步的状态/动作做出最好的决定。状态st的折扣累计奖励值函数可以表示为：The task faced by the base station is to decide an optimal strategy that maximizes the reward obtained. The base station will make the best decision on the next state/action based on the current state and environment. The discounted cumulative reward value function of state st can be expressed as:

其中r(s_t，a_t)表示基站在状态s_t选择动作a_t时所获得的即时奖励。Υ表示折扣因子且0≤Υ＜1，折扣因子趋于0表示基站主要考虑即时奖励。p(s_t+1|s_t，a_t) 表示基站选择动作a_t时从状态s_t转移到s_t+1的概率。MDP求解的目标是寻找一个最优策略π^*，使得每一个状态s的值V(s)同时达到最大。根据贝尔曼原理，当基站的总折扣期望奖励最大时我们至少能得到一个最优策略π^*使得：Among them, r(s _t , at _t ) represents the instant reward obtained by the base station when the base station selects the action at in the state _s _t . Υ represents a discount factor and 0≤Υ<1, and the discount factor tends to 0, indicating that the base station mainly considers immediate rewards. p(s _t+1 |s _t , at _t ) represents the probability of transitioning from state s _t to s _t ₊₁ when the base station selects action at . The goal of the MDP solution is to find an optimal policy π ^* that makes the value V(s) of each state s reach the maximum at the same time. According to Bellman's principle, when the total discounted expected reward of the base station is the largest, we can at least get an optimal policy π ^* such that:

其中V^*(s_t)表示基站从状态s_t开始并遵循最优策略π^*所获得的最大折扣累计奖励值。对于一个给定的策略π，是将状态空间映射到动作空间的函数，即：π：s_t→a_t。因此最优策略可以表示成如下形式：where V ^* (s _t ) represents the maximum discounted cumulative reward value obtained by the base station starting from state s _t and following the optimal policy π ^* . For a given policy π, is the function that maps the state space to the action space, ie: π: s _t → a _t . Therefore, the optimal strategy can be expressed as:

π^*(s_t)＝argV^*(s_t)π ^* (s _t )=argV ^* (s _t )

基站的目标是获取较高的奖励值，因此，在每个状态下，将会选择具有较高Q值的动作。但是在学习的初始阶段，对于状态-动作的经验比较少，Q值不能准确地表示正确的最优值。最高Q值的动作导致了基站总是沿着相同的路径而不可能探索到其他更好的值，从而容易陷入局部最优。因此，为了克服该缺点，基站必须随机地选择动作，因此，引入ε贪婪策略，从而减小基站动作选择策略陷入局部最优解的可能。The goal of the base station is to obtain a high reward value, so in each state, an action with a high Q value will be selected. However, in the initial stage of learning, the experience of state-action is relatively small, and the Q value cannot accurately represent the correct optimal value. The action of the highest Q value causes the base station to always follow the same path and it is impossible to explore other better values, so it is easy to fall into a local optimum. Therefore, in order to overcome this shortcoming, the base station must select actions randomly, therefore, an ε-greedy strategy is introduced to reduce the possibility that the action selection strategy of the base station falls into a local optimal solution.

其中，

in,

E^t＝E_sy，t+E_ra，t+E_wait，t+E_dt，t。E ^t = E _{sy, t} + E _{ra, t} + E _{wait, t} + E _{dt, t} .

在Q学习算法中，基于策略π，基站按下式在每个TTI以递归的方式对 Q值函数进行计算：In the Q-learning algorithm, based on the policy π, the base station calculates the Q-value function recursively at each TTI as follows:

很显然，Q值表示当基站在状态s_t时遵循策略π执行动作a_t所获得的期望折扣奖励。因此，我们的目标是评估最优策略π^*下的Q值。从上式可以得出状态值函数与行为值函数的关系如下：Obviously, the Q value represents the expected discounted reward obtained by the base station following policy π to perform action a _t when it is in state s _t . Therefore, our goal is to evaluate the Q value under the optimal policy π ^* . From the above formula, it can be concluded that the relationship between the state value function and the behavior value function is as follows:

然而，基于非确定性环境，上述Q值函数只有在最优策略下才成立，即Q值函数的值在非最优策略下通过Q学习是变化的(或称不收敛)。因此，修正 Q值函数的计算公式如下所示：However, based on the non-deterministic environment, the above Q-value function is only valid under the optimal strategy, that is, the value of the Q-value function changes (or does not converge) through Q-learning under the non-optimal strategy. Therefore, the calculation formula of the modified Q-value function is as follows:

其中α表示学习速率且0＜α＜1，学习速率越大，表明保留之前训练的效果就越少。如果每个状态-动作对能够多次重复，学习速率会根据合适的方案下降，对任意有限的MDP，Q学习算法能够收敛至最优策略。Υ表示折扣率且 0＜Υ＜1，Υ表示对未来奖励的重视程度。较高的Υ值可以捕获长期有效奖励，而较低的Υ值使得智能体更关注即时奖励。学习速率和折扣因子协同作用调节Q 矩阵的更新，进而影响Q学习算法的学习性能，α取值0.01，Υ取值0.8。Where α represents the learning rate and 0<α<1, the larger the learning rate, the less the effect of retaining the previous training. If each state-action pair can be repeated multiple times, the learning rate drops according to an appropriate scheme, and the Q-learning algorithm can converge to the optimal policy for any finite MDP. Υ represents the discount rate and 0<Υ<1, where Υ represents the importance of future rewards. Higher Υ values can capture long-term effective rewards, while lower Υ values make the agent focus more on immediate rewards. The learning rate and the discount factor work together to adjust the update of the Q matrix, which in turn affects the learning performance of the Q-learning algorithm.

最后说明的是，以上优选实施例仅用以说明本发明的技术方案而非限制，尽管通过上述优选实施例已经对本发明进行了详细的描述，但本领域技术人员应当理解，可以在形式上和细节上对其作出各种各样的改变，而不偏离本发明权利要求书所限定的范围。Finally, it should be noted that the above preferred embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail through the above preferred embodiments, those skilled in the art should Various changes may be made in details without departing from the scope of the invention as defined by the claims.

Claims

A method for Q-learning based energy saving in NB-IoT, the method comprising the steps of:

s1: defining a set of states and a set of actions for a base station, a set of states being defined as a series of previously observed information, i.e. S^t＝{U^t-1,U^t-2,U^t-3,L U¹Therein of

Which represents the energy consumption for random access,
indicating that the device is waiting for energy consumption,
which represents the energy consumption for data transmission,
indicating the number of devices waiting for the device,
which is indicative of the number of communication devices,
representing the number of access failure devices, and defining an action set as the proportion of the number of devices which are allowed to initiate random access in each TTI to the total active devices in the current TTI;

s2: setting the state and behavior Q value of the base station as a zero matrix at the moment when t is 0;

s3: selecting an action a according to an epsilon greedy method_t(i) The method comprises the following steps In the initial stage of learning, the experience on state-action is less, the Q value cannot accurately represent the correct optimal value, the action with the highest Q value causes the base station to always follow the same path and cannot search other better values, so that the base station is easy to fall into local optimization, an epsilon greedy strategy is introduced, an intelligent body randomly selects action according to the probability of epsilon, and selects the action which enables the Q value to be maximum according to the probability of 1-epsilon, namely the action

S4: performing an action a_t(i) Then, the system obtains the environment reward value R according to the formula_tThen enters the next state s_t+1: the reward value function is defined as:
wherein
Representing the number of serving devices, N representing the total number of transmission devices, T representing the number of TTIs, E^tIndicating the total energy consumption of the system in the t-th TTI,
n^trepresenting the number of access devices allowed in the current TTI, r representing the number of repetitions, m representing the transmission data resource, Q representing the total uplink resource, mⁱIndicates the number of preambles, E^t＝E_sy,t+E_ra,t+E_wait,t+E_dt,t，E_sy,tIndicating synchronous energy consumption, E_ra,tIndicating random access power consumption, E_wait,tIndicating waiting energy consumption of the apparatus, E_dt,tRepresenting data transmission energy consumption;

s5: and updating a behavior Q value function of the base station according to a formula: the Q matrix update formula is:
wherein r(s)_t,a_t) For agent in state s_tWhile performing action a_tThe obtained reward value, alpha represents the learning rate and 0<α<Y 1, y represents discount factor and 0 ≦ y<1, adjusting the updating of a Q matrix under the synergistic action of a learning rate and a discount factor so as to influence the learning performance of a Q algorithm, wherein alpha is 0.01, and gamma is 0.8;

s6: t ← t +1, go to step S2.