CN112867117B - A Q-learning-based energy-saving method in NB-IoT - Google Patents

A Q-learning-based energy-saving method in NB-IoT Download PDF

Info

Publication number
CN112867117B
CN112867117B CN202110074159.8A CN202110074159A CN112867117B CN 112867117 B CN112867117 B CN 112867117B CN 202110074159 A CN202110074159 A CN 202110074159A CN 112867117 B CN112867117 B CN 112867117B
Authority
CN
China
Prior art keywords
devices
base station
action
energy consumption
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110074159.8A
Other languages
Chinese (zh)
Other versions
CN112867117A (en
Inventor
裴二荣
王振民
朱冰冰
张茹
杨光财
荆玉琪
周礼能
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile IoT Co Ltd
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110074159.8A priority Critical patent/CN112867117B/en
Publication of CN112867117A publication Critical patent/CN112867117A/en
Application granted granted Critical
Publication of CN112867117B publication Critical patent/CN112867117B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. Transmission Power Control [TPC] or power classes
    • H04W52/02Power saving arrangements
    • H04W52/0203Power saving arrangements in the radio access network or backbone network of wireless communication networks
    • H04W52/0206Power saving arrangements in the radio access network or backbone network of wireless communication networks in access points, e.g. base stations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/06Testing, supervising or monitoring using simulated traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W74/00Wireless channel access
    • H04W74/08Non-scheduled access, e.g. ALOHA
    • H04W74/0833Random access procedures, e.g. with 4-step access
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

本发明涉及NB‑IoT中一种基于Q学习的节能方法,属于通信技术领域。在该方法中,基站可以根据网络负载、重复次数、传输数据资源等参数,动态的控制每个传输时间间隔中发起随机接入的设备数量,减少随机接入过程中发生碰撞的设备数量,从而减少随机接入总能耗,达到节能的目的。在该方法中,基站充当智能体,智能体动作定义为允许发起随机接入的设备数量与总活跃设备数的比例,智能体状态主要包括一系列已观察到信息的集合,如通信成功设备数量和能耗等。本发明能够在保证设备吞吐量的同时,减少系统能耗,延长设备寿命。

Figure 202110074159

The invention relates to a Q-learning-based energy-saving method in NB-IoT, and belongs to the technical field of communication. In this method, the base station can dynamically control the number of devices that initiate random access in each transmission time interval according to parameters such as network load, repetition times, transmission data resources, etc., so as to reduce the number of devices that collide during the random access process. Reduce the total energy consumption of random access to achieve the purpose of energy saving. In this method, the base station acts as an agent, the agent action is defined as the ratio of the number of devices allowed to initiate random access to the total number of active devices, and the agent state mainly includes a set of observed information, such as the number of successful communication devices and energy consumption, etc. The invention can reduce the energy consumption of the system and prolong the life of the equipment while ensuring the throughput of the equipment.

Figure 202110074159

Description

NB-IoT中一种基于Q学习的节能方法An energy-saving method based on Q-learning in NB-IoT

技术领域technical field

本发明属于通信技术领域,涉及NB-IoT中一种基于Q学习的节能方法。The invention belongs to the field of communication technologies, and relates to a Q-learning-based energy-saving method in NB-IoT.

背景技术Background technique

在过去的几十年中,随着工业革命的开始,人类已经取得巨大的发展。近年来,第五代演进(5th-Generation,5G),物联网(Internet of Things,IoT)和移动计算的迅速发展,低功耗广域网(Low Power Internet of Wide Area Network,LPWAN)越来越受到关注。基于蜂窝通信技术的窄带物联网(Narrow Band IoT, NB-IoT)是一个具有发展前景的LPWAN技术。NB-IoT分别要求下行和上行链路的最小系统带宽为180kHz,是一种新的第三代合作伙伴计划(3rd Generation Partnership Project,3GPP)无线电接入技术,其具有低功耗、窄带宽、强覆盖、低成本以及海量连接的特点,因而NB-IoT可以广泛应用于信道传输条件差(如地下停车场)或有延时容忍的设备(如水表)的场景。In the past few decades, with the onset of the Industrial Revolution, humanity has made tremendous progress. In recent years, with the 5th-Generation (5G), the rapid development of the Internet of Things (IoT) and mobile computing, the Low Power Internet of Wide Area Network (LPWAN) has become increasingly popular. focus on. Narrow Band IoT (NB-IoT) based on cellular communication technology is a promising LPWAN technology. NB-IoT requires a minimum system bandwidth of 180kHz for downlink and uplink respectively. It is a new 3rd Generation Partnership Project (3GPP) radio access technology with low power consumption, narrow bandwidth, With the characteristics of strong coverage, low cost and massive connections, NB-IoT can be widely used in scenarios with poor channel transmission conditions (such as underground parking lots) or devices with delay tolerance (such as water meters).

NB-IoT的数据通信主要发生在上行信道中,上行信道资源主要包含窄带物理随机接入信道(Narrowband Physical Random Access Channel,NPRACH)和窄带物理上行共享信道(Narrowband Physical Uplink Sharing Channel,NPUSCH)。其中,NPRACH主要用于启动随机接入过程,NPUSCH主要负责设备到基站之间的数据传输。NB-IoT主要的是针对有高容忍时延的周期性设备(如水表、电表等),如何在保证设备的吞吐量要求下,以较低的能耗完成设备通信,是我们考虑的问题。The data communication of NB-IoT mainly occurs in the uplink channel, and the uplink channel resources mainly include Narrowband Physical Random Access Channel (NPRACH) and Narrowband Physical Uplink Sharing Channel (NPUSCH). Among them, the NPRACH is mainly used to start the random access process, and the NPUSCH is mainly responsible for the data transmission between the device and the base station. NB-IoT is mainly aimed at periodic devices with high latency tolerance (such as water meters, electricity meters, etc.). How to complete device communication with lower energy consumption while ensuring the throughput of the device is a problem we consider.

当前,已有的节能机制缺乏动态学习的过程,这些机制主要是对个设备的退避时间和扩展性非连续接收机制进行优化,达到设备能耗和时延的折中。但是在实际情况中,有多个设备同时传输数据的场景,合理调度设备,控制每个传输时间间隔(TransmissionTime Interval,TTI)中设备接入的个数,在保证吞吐量的前提下,优化通信总能耗。因此,设计一种基于Q学习算法的调度机制,使得基站根据网络负载、上行资源数量、数据大小和重复次数等参数,在每个TTI 允许适量的设备进行随机接入,在保证吞吐量的前提下,减少系统能耗,增加设备寿命。At present, the existing energy-saving mechanisms lack the process of dynamic learning. These mechanisms mainly optimize the back-off time of each device and the scalable discontinuous reception mechanism to achieve a compromise between device energy consumption and delay. However, in actual situations, there are scenarios where multiple devices transmit data at the same time, rationally schedule devices, control the number of device accesses in each Transmission Time Interval (TTI), and optimize communication on the premise of ensuring throughput. total energy consumption. Therefore, a scheduling mechanism based on the Q-learning algorithm is designed, so that the base station allows an appropriate amount of devices to perform random access in each TTI according to parameters such as network load, uplink resource quantity, data size, and repetition times, on the premise of ensuring throughput. It reduces system energy consumption and increases equipment life.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本发明的目的在于提供NB-IoT中一种基于Q学习的节能方法。基站通过Q学习算法,根据网络负载、时延、上行资源数量、前导码大小和重复次数等因素灵活调整每个TTI中发起接入请求设备的数量,在保证吞吐量的前提下,节省系统能耗,延长设备寿命。该方法具有简洁高效的特点,与此同时,具备一定的可移植性。In view of this, the purpose of the present invention is to provide a Q-learning-based energy saving method in NB-IoT. The base station uses the Q-learning algorithm to flexibly adjust the number of devices initiating access requests in each TTI according to factors such as network load, delay, number of uplink resources, preamble size, and repetition times. consumption and prolong the life of the equipment. The method has the characteristics of simplicity and efficiency, and at the same time, it has certain portability.

为了达到上述目的,本发明提供如下技术方案:In order to achieve the above object, the present invention provides the following technical solutions:

NB-IoT中一种基于Q学习的节能方法,该方法包括以下步骤:A Q-learning-based energy-saving method in NB-IoT, which includes the following steps:

S1:定义基站的状态集合和动作集合;S1: Define the state set and action set of the base station;

S2:在t=0时刻,初始化基站的状态和行为Q值为“0”;S2: At time t=0, the state and behavior Q value of the initialized base station is "0";

S3:计算基站的初始状态st的状态值;S3: Calculate the state value of the initial state st of the base station;

S4:根据ε贪婪策略选择一个行为at(i);S4: Choose an action at (i) according to the ε -greedy strategy;

S5:执行行为at(i)后,系统将根据奖励函数公式获取环境奖励值rt,然后进入到下一个状态st+1S5: After executing the behavior at (i), the system will obtain the environmental reward value rt according to the reward function formula, and then enter the next state s t +1 ;

S6:根据公式更新基站的行为Q值函数;S6: Update the behavioral Q-value function of the base station according to the formula;

S7:t←t+1,跳转至步骤S2。S7: t←t+1, jump to step S2.

进一步,在步骤S1中,基站的状态集合表示为一系列先前观测到的信息,即St={Ut-1,Ut-2,Ut-3,L U1},其中

Figure BDA0002906931050000021
Further, in step S1, the state set of the base station is represented as a series of previously observed information, namely S t ={U t-1 , U t-2 , U t-3 , LU 1 }, where
Figure BDA0002906931050000021

其中,

Figure BDA0002906931050000022
表示随机接入能耗,
Figure BDA0002906931050000023
表示设备等待能耗,
Figure BDA0002906931050000024
表示数据传输能耗,
Figure BDA0002906931050000025
表示等待设备数量,
Figure BDA0002906931050000026
表示通信设备数量,
Figure BDA0002906931050000027
表示接入失败设备数量。in,
Figure BDA0002906931050000022
represents the energy consumption of random access,
Figure BDA0002906931050000023
Indicates that the device is waiting for power consumption,
Figure BDA0002906931050000024
represents the energy consumption of data transmission,
Figure BDA0002906931050000025
Indicates the number of waiting devices,
Figure BDA0002906931050000026
represents the number of communication devices,
Figure BDA0002906931050000027
Indicates the number of devices that fail to access.

对于行为集合,将每个TTI中允许发起随机接入的设备数量与当前TTI 中总活跃设备的比例作为基站行为,并且根据有限动作集合的马尔科夫过程定义任意第t个TTI中基站行为αt∈{0.2,0.4,0.6,0.8,1.0}。For the behavior set, the ratio of the number of devices allowed to initiate random access in each TTI to the total active devices in the current TTI is taken as the base station behavior, and the base station behavior α in any t-th TTI is defined according to the Markov process of the limited action set t ∈ {0.2, 0.4, 0.6, 0.8, 1.0}.

进一步,在步骤S2中,设置基站的状态和Q值为零矩阵。对于基站马尔科夫决策过程的求解目标是寻找一个最优策略π*,以使得每一个状态s的值 V(s)同时达到最大。状态值函数表示如下:Further, in step S2, the state of the base station and the Q value of the matrix are set to zero. The goal of solving the Markov decision process of the base station is to find an optimal strategy π * so that the value V(s) of each state s reaches the maximum at the same time. The state value function is represented as follows:

Figure BDA0002906931050000031
Figure BDA0002906931050000031

其中r(st,at)表示基站从环境中获取的奖励值,p(st+1|st,at)表示当基站处于状态st时选择行为at后转移到状态st+1的概率。Among them, r(s t , at t ) represents the reward value that the base station obtains from the environment, and p(s t+1 |s t , at t ) means that when the base station is in state s t , it selects behavior at t and then transfers to state s t +1 probability.

进一步,在步骤S4中,基站的目标是获取较高的奖励值,因此,在每个状态下,将会选择具有较高Q值的动作。但是在学习的初始阶段,对于状态- 动作的经验比较少,Q值不能准确地表示正确的最优值。最高Q值的动作导致了基站总是沿着相同的路径而不可能探索到其他更好的值,从而容易陷入局部最优。因此引入ε贪婪策略,其主要原理如下:Further, in step S4, the base station's goal is to obtain a higher reward value, so in each state, an action with a higher Q value will be selected. However, in the initial stage of learning, the experience of state-action is relatively small, and the Q value cannot accurately represent the correct optimal value. The action of the highest Q value causes the base station to always follow the same path and it is impossible to explore other better values, so it is easy to fall into a local optimum. Therefore, the ε greedy strategy is introduced, and its main principles are as follows:

Figure BDA0002906931050000032
Figure BDA0002906931050000032

智能体以ε的概率随机选择动作,以1-ε的概率选择使Q值最大的动作。The agent randomly selects actions with probability ε, and chooses the action that maximizes the Q value with probability 1-ε.

进一步,在步骤S5中,基站执行选择的行为后将从环境中获取一个奖励值,奖励值函数定义为:Further, in step S5, after the base station performs the selected behavior, it will obtain a reward value from the environment, and the reward value function is defined as:

Figure BDA0002906931050000033
Figure BDA0002906931050000033

Figure BDA0002906931050000034
表示服务设备数,N表示总传输设备个数,T表示TTI个数,Et表示第t个TTI中的系统总能耗。
Figure BDA0002906931050000034
represents the number of service devices, N represents the total number of transmission devices, T represents the number of TTIs, and E t represents the total system energy consumption in the t-th TTI.

其中,

Figure BDA0002906931050000035
in,
Figure BDA0002906931050000035

nt表示当前TTI允许接入设备数,r表示重复次数,μ表示传输数据资源,Q表示上行链路总资源,mi表示前导码的个数。n t represents the number of access devices allowed by the current TTI, r represents the number of repetitions, μ represents transmission data resources, Q represents total uplink resources, and mi represents the number of preambles.

Et=Esy,t+Era,t+Ewait,t+Edt,t E t = E sy, t + E ra, t + E wait, t + E dt, t

Esy,t表示同步能耗,Era,t表示随机接入能耗,Ewait,t表示设备等待能耗, Edt,t表示数据传输能耗。E sy, t represents the synchronization energy consumption, E ra, t represents the random access energy consumption, E wait, t represents the device waiting energy consumption, and E dt, t represents the data transmission energy consumption.

进一步,在步骤S6中,基站在从环境中获取奖励值后,需要对Q矩阵进行更新,其更新公式为:Further, in step S6, after the base station obtains the reward value from the environment, it needs to update the Q matrix, and the update formula is:

Figure BDA0002906931050000036
Figure BDA0002906931050000036

式中α表示学习速率且0<α<1,Υ表示折扣因子且0≤Υ<1。学习速率和折扣因子协同作用调节Q矩阵的更新,进而影响Q算法的学习性能,α取值0.01,Υ取值0.8。where α represents the learning rate and 0<α<1, and Y represents the discount factor and 0≤Y<1. The learning rate and the discount factor work together to adjust the update of the Q matrix, which in turn affects the learning performance of the Q algorithm. α is 0.01, and Y is 0.8.

本发明的有益效果在于:通过Q学习算法,能够在保证设备吞吐量的条件下,降低系统能耗。The beneficial effect of the present invention is that: through the Q learning algorithm, the system energy consumption can be reduced under the condition of ensuring the equipment throughput.

附图说明Description of drawings

图1为Q学习与环境交互过程模型图;Figure 1 is a model diagram of the Q-learning and environment interaction process;

图2为基于Q学习的节能算法步骤图;Figure 2 is a step diagram of an energy-saving algorithm based on Q-learning;

图3为NB-IoT上行通信流程图。Figure 3 is a flowchart of NB-IoT uplink communication.

具体实施方式Detailed ways

下面将结合附图,对本发明的优选实施例进行详细的描述。The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

本发明针对NB-IoT系统能耗问题,提出NB-IoT中一种基于Q学习的节能方法。与传统的优化算法相比,本发明中基于Q学习算法能够对传输设备进行动态优化,基站可以根据网络实时情景灵活调整接入设备数量。其过程如图一所示,首先基站在某个状态下,根据当前的环境,基于ε贪婪策略执行某个行为;然后观察环境获取奖励值,根据公式更新Q函数值同时确定下一个状态的行为,重复上述动作直到收敛。Aiming at the energy consumption problem of the NB-IoT system, the present invention proposes a Q-learning-based energy-saving method in the NB-IoT. Compared with the traditional optimization algorithm, the Q-learning algorithm in the present invention can dynamically optimize the transmission device, and the base station can flexibly adjust the number of access devices according to the real-time network situation. The process is shown in Figure 1. First, in a certain state, the base station performs a certain behavior based on the ε greedy strategy according to the current environment; then observe the environment to obtain the reward value, update the Q function value according to the formula, and determine the behavior of the next state. , and repeat the above actions until convergence.

具体算法步骤如图二所示,在Q学习算法迭代过程中,将状态集合定义为S,若决策时间为t,则st∈S,表示在t时刻基站的状态为st。同时,我们将基站可能执行的有限行为集合定义为A,at∈A表示在t时刻基站的行为。奖励函数r(st,at)表示基站基于所处的状态st执行行为at后从环境中获得的奖励值,然后从状态st转移到st+1,在下一个决策时间t+1对Qt函数进行更新。重复进行上述步骤,直至迭代结束。The specific algorithm steps are shown in Figure 2. In the iterative process of the Q-learning algorithm, the state set is defined as S. If the decision time is t, then s t ∈ S, indicating that the state of the base station at time t is s t . At the same time, we define the limited set of actions that the base station may perform as A, where at ∈ A represents the base station’s behavior at time t . The reward function r(s t , at t ) represents the reward value obtained from the environment after the base station performs the behavior a t based on the state s t it is in, and then transfers from the state s t to s t+1 , at the next decision time t+ 1 to update the Qt function. Repeat the above steps until the iteration ends.

NB-IoT在上行传输过程中的流程如图三所示。首先基站对设备发送窄带主同步信号((Narrowband Primary Synchronization Signal,NPSS)和窄带次同步信号(Narrowband Secondary Synchronization Signal,NSSS),使设备和小区的时间和频率同步,此过程为同步过程。然后基站通过NPRACH发送接入请求信息,基站接收到设备的请求信息后通过窄带物理下行控制信道(Narrowband Physical Downlink Control channel,NPDCCH)进行回应,然后基站与设备建立起连接。连接建立建立后,基站发送调度请求,并通过NPUSCH进行数据传输。The flow of NB-IoT in the uplink transmission process is shown in Figure 3. First, the base station sends a narrowband primary synchronization signal (NPSS) and a narrowband secondary synchronization signal (NSSS) to the device to synchronize the time and frequency of the device and the cell. This process is a synchronization process. Then the base station The access request information is sent through NPRACH. After receiving the request information from the device, the base station responds through the Narrowband Physical Downlink Control channel (NPDCCH), and then the base station establishes a connection with the device. After the connection is established, the base station sends the scheduling request and data transmission via NPUSCH.

Q学习算法实际是马尔科夫决策过程(Markov Decision Processes,MDP) 的一种变化形式。在NB-IoT中的节能算法中,基于Q学习算法工作原理,我们将状态集合表示如下:The Q-learning algorithm is actually a variation of Markov Decision Processes (MDP). In the energy-saving algorithm in NB-IoT, based on the working principle of the Q-learning algorithm, we express the state set as follows:

St={Ut-1,Ut-2,Ut-3,L U1},其中

Figure BDA0002906931050000051
S t = {U t-1 , U t-2 , U t-3 , LU 1 }, where
Figure BDA0002906931050000051

其中,

Figure BDA0002906931050000052
表示随机接入能耗,
Figure BDA0002906931050000053
表示设备等待能耗,
Figure BDA0002906931050000054
表示数据传输能耗,
Figure BDA0002906931050000055
表示等待设备数量,
Figure BDA0002906931050000056
表示通信设备数量,
Figure BDA0002906931050000057
表示接入失败设备数量。in,
Figure BDA0002906931050000052
represents the energy consumption of random access,
Figure BDA0002906931050000053
Indicates that the device is waiting for power consumption,
Figure BDA0002906931050000054
represents the energy consumption of data transmission,
Figure BDA0002906931050000055
Indicates the number of waiting devices,
Figure BDA0002906931050000056
represents the number of communication devices,
Figure BDA0002906931050000057
Indicates the number of devices that fail to access.

将每个TTI中允许发起随机接入的设备数量与当前TTI中总活跃设备的比例作为基站行为,则基站的行为集合A={a(1),a(2),L,a(k)}。根据有限动作集合的马尔科夫过程定义任意t个TTI基站行为αt∈{0.2,0.4,0.6,0.8,1.0}。Taking the ratio of the number of devices allowed to initiate random access in each TTI to the total active devices in the current TTI as the base station behavior, the base station behavior set A={a(1), a(2), L, a(k) }. Any t TTI base station behaviors α t ∈ {0.2, 0.4, 0.6, 0.8, 1.0} are defined according to a Markov process with a finite set of actions.

基站面临的任务是决定一个最优策略,使得所获得的奖励最大。基站会根据当前的状态与环境,对下一步的状态/动作做出最好的决定。状态st的折扣累计奖励值函数可以表示为:The task faced by the base station is to decide an optimal strategy that maximizes the reward obtained. The base station will make the best decision on the next state/action based on the current state and environment. The discounted cumulative reward value function of state st can be expressed as:

Figure BDA0002906931050000058
Figure BDA0002906931050000058

其中r(st,at)表示基站在状态st选择动作at时所获得的即时奖励。Υ表示折扣因子且0≤Υ<1,折扣因子趋于0表示基站主要考虑即时奖励。p(st+1|st,at) 表示基站选择动作at时从状态st转移到st+1的概率。MDP求解的目标是寻找一个最优策略π*,使得每一个状态s的值V(s)同时达到最大。根据贝尔曼原理,当基站的总折扣期望奖励最大时我们至少能得到一个最优策略π*使得:Among them, r(s t , at t ) represents the instant reward obtained by the base station when the base station selects the action at in the state s t . Υ represents a discount factor and 0≤Υ<1, and the discount factor tends to 0, indicating that the base station mainly considers immediate rewards. p(s t+1 |s t , at t ) represents the probability of transitioning from state s t to s t +1 when the base station selects action at . The goal of the MDP solution is to find an optimal policy π * that makes the value V(s) of each state s reach the maximum at the same time. According to Bellman's principle, when the total discounted expected reward of the base station is the largest, we can at least get an optimal policy π * such that:

Figure BDA0002906931050000059
Figure BDA0002906931050000059

其中V*(st)表示基站从状态st开始并遵循最优策略π*所获得的最大折扣累计奖励值。对于一个给定的策略π,是将状态空间映射到动作空间的函数,即:π:st→at。因此最优策略可以表示成如下形式:where V * (s t ) represents the maximum discounted cumulative reward value obtained by the base station starting from state s t and following the optimal policy π * . For a given policy π, is the function that maps the state space to the action space, ie: π: s t → a t . Therefore, the optimal strategy can be expressed as:

π*(st)=argV*(st)π * (s t )=argV * (s t )

基站的目标是获取较高的奖励值,因此,在每个状态下,将会选择具有较高Q值的动作。但是在学习的初始阶段,对于状态-动作的经验比较少,Q值不能准确地表示正确的最优值。最高Q值的动作导致了基站总是沿着相同的路径而不可能探索到其他更好的值,从而容易陷入局部最优。因此,为了克服该缺点,基站必须随机地选择动作,因此,引入ε贪婪策略,从而减小基站动作选择策略陷入局部最优解的可能。The goal of the base station is to obtain a high reward value, so in each state, an action with a high Q value will be selected. However, in the initial stage of learning, the experience of state-action is relatively small, and the Q value cannot accurately represent the correct optimal value. The action of the highest Q value causes the base station to always follow the same path and it is impossible to explore other better values, so it is easy to fall into a local optimum. Therefore, in order to overcome this shortcoming, the base station must select actions randomly, therefore, an ε-greedy strategy is introduced to reduce the possibility that the action selection strategy of the base station falls into a local optimal solution.

Figure BDA0002906931050000061
Figure BDA0002906931050000061

智能体以ε的概率随机选择动作,以1-ε的概率选择使Q值最大的动作。The agent randomly selects actions with probability ε, and chooses the action that maximizes the Q value with probability 1-ε.

进一步,在步骤S5中,基站执行选择的行为后将从环境中获取一个奖励值,奖励值函数定义为:Further, in step S5, after the base station performs the selected behavior, it will obtain a reward value from the environment, and the reward value function is defined as:

Figure BDA0002906931050000062
Figure BDA0002906931050000062

Figure BDA0002906931050000063
表示服务设备数,N表示总传输设备个数,T表示TTI个数,Et表示第t个TTI中的系统总能耗。
Figure BDA0002906931050000063
represents the number of service devices, N represents the total number of transmission devices, T represents the number of TTIs, and E t represents the total system energy consumption in the t-th TTI.

其中,

Figure BDA0002906931050000064
in,
Figure BDA0002906931050000064

nt表示当前TTI允许接入设备数,r表示重复次数,μ表示传输数据资源,Q表示上行链路总资源,mi表示前导码的个数。n t represents the number of access devices allowed by the current TTI, r represents the number of repetitions, μ represents transmission data resources, Q represents total uplink resources, and mi represents the number of preambles.

Et=Esy,t+Era,t+Ewait,t+Edt,tE t = E sy, t + E ra, t + E wait, t + E dt, t .

Esy,t表示同步能耗,Era,t表示随机接入能耗,Ewait,t表示设备等待能耗, Edt,t表示数据传输能耗。E sy, t represents the synchronization energy consumption, E ra, t represents the random access energy consumption, E wait, t represents the device waiting energy consumption, and E dt, t represents the data transmission energy consumption.

在Q学习算法中,基于策略π,基站按下式在每个TTI以递归的方式对 Q值函数进行计算:In the Q-learning algorithm, based on the policy π, the base station calculates the Q-value function recursively at each TTI as follows:

Figure BDA0002906931050000065
Figure BDA0002906931050000065

很显然,Q值表示当基站在状态st时遵循策略π执行动作at所获得的期望折扣奖励。因此,我们的目标是评估最优策略π*下的Q值。从上式可以得出状态值函数与行为值函数的关系如下:Obviously, the Q value represents the expected discounted reward obtained by the base station following policy π to perform action a t when it is in state s t . Therefore, our goal is to evaluate the Q value under the optimal policy π * . From the above formula, it can be concluded that the relationship between the state value function and the behavior value function is as follows:

Figure BDA0002906931050000066
Figure BDA0002906931050000066

然而,基于非确定性环境,上述Q值函数只有在最优策略下才成立,即Q值函数的值在非最优策略下通过Q学习是变化的(或称不收敛)。因此,修正 Q值函数的计算公式如下所示:However, based on the non-deterministic environment, the above Q-value function is only valid under the optimal strategy, that is, the value of the Q-value function changes (or does not converge) through Q-learning under the non-optimal strategy. Therefore, the calculation formula of the modified Q-value function is as follows:

Figure BDA0002906931050000071
Figure BDA0002906931050000071

其中α表示学习速率且0<α<1,学习速率越大,表明保留之前训练的效果就越少。如果每个状态-动作对能够多次重复,学习速率会根据合适的方案下降,对任意有限的MDP,Q学习算法能够收敛至最优策略。Υ表示折扣率且 0<Υ<1,Υ表示对未来奖励的重视程度。较高的Υ值可以捕获长期有效奖励,而较低的Υ值使得智能体更关注即时奖励。学习速率和折扣因子协同作用调节Q 矩阵的更新,进而影响Q学习算法的学习性能,α取值0.01,Υ取值0.8。Where α represents the learning rate and 0<α<1, the larger the learning rate, the less the effect of retaining the previous training. If each state-action pair can be repeated multiple times, the learning rate drops according to an appropriate scheme, and the Q-learning algorithm can converge to the optimal policy for any finite MDP. Υ represents the discount rate and 0<Υ<1, where Υ represents the importance of future rewards. Higher Υ values can capture long-term effective rewards, while lower Υ values make the agent focus more on immediate rewards. The learning rate and the discount factor work together to adjust the update of the Q matrix, which in turn affects the learning performance of the Q-learning algorithm.

最后说明的是,以上优选实施例仅用以说明本发明的技术方案而非限制,尽管通过上述优选实施例已经对本发明进行了详细的描述,但本领域技术人员应当理解,可以在形式上和细节上对其作出各种各样的改变,而不偏离本发明权利要求书所限定的范围。Finally, it should be noted that the above preferred embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail through the above preferred embodiments, those skilled in the art should Various changes may be made in details without departing from the scope of the invention as defined by the claims.

Claims (1)

  1. A method for Q-learning based energy saving in NB-IoT, the method comprising the steps of:
    s1: defining a set of states and a set of actions for a base station, a set of states being defined as a series of previously observed information, i.e. St={Ut-1,Ut-2,Ut-3,L U1Therein of
    Figure FDA0003536062420000011
    Figure FDA0003536062420000012
    Which represents the energy consumption for random access,
    Figure FDA0003536062420000013
    indicating that the device is waiting for energy consumption,
    Figure FDA0003536062420000014
    which represents the energy consumption for data transmission,
    Figure FDA0003536062420000015
    indicating the number of devices waiting for the device,
    Figure FDA0003536062420000016
    which is indicative of the number of communication devices,
    Figure FDA0003536062420000017
    representing the number of access failure devices, and defining an action set as the proportion of the number of devices which are allowed to initiate random access in each TTI to the total active devices in the current TTI;
    s2: setting the state and behavior Q value of the base station as a zero matrix at the moment when t is 0;
    s3: selecting an action a according to an epsilon greedy methodt(i) The method comprises the following steps In the initial stage of learning, the experience on state-action is less, the Q value cannot accurately represent the correct optimal value, the action with the highest Q value causes the base station to always follow the same path and cannot search other better values, so that the base station is easy to fall into local optimization, an epsilon greedy strategy is introduced, an intelligent body randomly selects action according to the probability of epsilon, and selects the action which enables the Q value to be maximum according to the probability of 1-epsilon, namely the action
    Figure FDA0003536062420000018
    S4: performing an action at(i) Then, the system obtains the environment reward value R according to the formulatThen enters the next state st+1: the reward value function is defined as:
    Figure FDA0003536062420000019
    wherein
    Figure FDA00035360624200000110
    Representing the number of serving devices, N representing the total number of transmission devices, T representing the number of TTIs, EtIndicating the total energy consumption of the system in the t-th TTI,
    Figure FDA00035360624200000111
    ntrepresenting the number of access devices allowed in the current TTI, r representing the number of repetitions, m representing the transmission data resource, Q representing the total uplink resource, miIndicates the number of preambles, Et=Esy,t+Era,t+Ewait,t+Edt,t,Esy,tIndicating synchronous energy consumption, Era,tIndicating random access power consumption, Ewait,tIndicating waiting energy consumption of the apparatus, Edt,tRepresenting data transmission energy consumption;
    s5: and updating a behavior Q value function of the base station according to a formula: the Q matrix update formula is:
    Figure FDA00035360624200000112
    wherein r(s)t,at) For agent in state stWhile performing action atThe obtained reward value, alpha represents the learning rate and 0<α<Y 1, y represents discount factor and 0 ≦ y<1, adjusting the updating of a Q matrix under the synergistic action of a learning rate and a discount factor so as to influence the learning performance of a Q algorithm, wherein alpha is 0.01, and gamma is 0.8;
    s6: t ← t +1, go to step S2.
CN202110074159.8A 2021-01-20 2021-01-20 A Q-learning-based energy-saving method in NB-IoT Active CN112867117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110074159.8A CN112867117B (en) 2021-01-20 2021-01-20 A Q-learning-based energy-saving method in NB-IoT

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110074159.8A CN112867117B (en) 2021-01-20 2021-01-20 A Q-learning-based energy-saving method in NB-IoT

Publications (2)

Publication Number Publication Date
CN112867117A CN112867117A (en) 2021-05-28
CN112867117B true CN112867117B (en) 2022-04-12

Family

ID=76007591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110074159.8A Active CN112867117B (en) 2021-01-20 2021-01-20 A Q-learning-based energy-saving method in NB-IoT

Country Status (1)

Country Link
CN (1) CN112867117B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114567920B (en) * 2022-02-23 2023-05-23 重庆邮电大学 A Hybrid Discontinuous Reception Method for MTC Devices with Strategy Optimization
CN114727423B (en) * 2022-04-02 2024-11-29 北京邮电大学 Personalized access method in GF-NOMA system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110809274A (en) * 2019-10-28 2020-02-18 南京邮电大学 An enhanced network optimization method for UAV base stations for narrowband Internet of Things
CN110856234A (en) * 2019-11-20 2020-02-28 廊坊新奥燃气设备有限公司 Energy-saving method and system for NB-IoT meter based on PSM access mode
CN111970703A (en) * 2020-06-24 2020-11-20 重庆邮电大学 Method for optimizing uplink communication resources in NB-IoT (NB-IoT)

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3424243B1 (en) * 2016-03-01 2019-12-25 Telefonaktiebolaget LM Ericsson (PUBL) Energy efficient operation of radio network nodes and wireless communication devices in nb-iot

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110809274A (en) * 2019-10-28 2020-02-18 南京邮电大学 An enhanced network optimization method for UAV base stations for narrowband Internet of Things
CN110856234A (en) * 2019-11-20 2020-02-28 廊坊新奥燃气设备有限公司 Energy-saving method and system for NB-IoT meter based on PSM access mode
CN111970703A (en) * 2020-06-24 2020-11-20 重庆邮电大学 Method for optimizing uplink communication resources in NB-IoT (NB-IoT)

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Introduction of NB-IoT";Huawei;《3GPP TSG-RAN WG2 NB-IOT Ad-hoc#2 R2-163218》;20160429;全文 *
Energy-efficient joint power control and resource allocation for cluster-based NB-IoT cellular networks;Zhu shuqiong, Wu Wenquan, Feng Lei, et al.;《Transactions on Emerging Telecommunications Technologies》;20171227;全文 *
异形磁电复合材料增强磁电效应的理论和实验研究;张茹;《中国博士学位论文电子期刊网》;20190115;全文 *
认知无线电网络中的资源优化分配的研究;裴二荣;《中国博士学位论文电子期刊网》;20121215;全文 *

Also Published As

Publication number Publication date
CN112867117A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
Chen et al. Robust computation offloading and resource scheduling in cloudlet-based mobile cloud computing
Min et al. Learning-based computation offloading for IoT devices with energy harvesting
CN112867117B (en) A Q-learning-based energy-saving method in NB-IoT
CN112084025B (en) A fog computing task offloading delay optimization method based on improved particle swarm optimization
CN109194763B (en) A caching method based on self-organized cooperation of small base stations in ultra-dense networks
Chen et al. Delay guaranteed energy-efficient computation offloading for industrial IoT in fog computing
CN113115339B (en) Task unloading and resource allocation joint optimization method based on mobility awareness
CN113490184A (en) Smart factory-oriented random access resource optimization method and device
CN116260871A (en) Independent task unloading method based on local and edge collaborative caching
CN107820309B (en) Wake-up strategy and time slot optimization algorithm for low-power-consumption communication equipment
CN108282821B (en) A packet-based congestion control access method for giant connection in Internet of Things communication
Dai et al. Deep reinforcement learning for edge computing and resource allocation in 5G beyond
Yan et al. Energy-efficient content fetching strategies in cache-enabled D2D networks via an Actor-Critic reinforcement learning structure
Zhang et al. Computation cost-driven offloading strategy based on reinforcement learning for consumer devices
CN106878958A (en) Fast Propagation Method Based on Adjustable Duty Cycle in Software Custom Wireless Networks
CN112445617B (en) Load strategy selection method and system based on mobile edge calculation
Abbas et al. An efficient partial task offloading and resource allocation scheme for vehicular edge computing in a dynamic environment
Li et al. A lightweight transmission parameter selection scheme using reinforcement learning for LoRaWAN
CN105916197B (en) The power adaptive method that social credibility drives in D2D network
Chen et al. Liquid state based transfer learning for 360 image transmission in wireless VR networks
CN117610644A (en) Federal learning optimization method based on block chain
CN114786275B (en) A data transmission method and device for Internet of Things gateway
CN116542319A (en) Self-adaptive federation learning method and system based on digital twin in edge computing environment
Tong et al. Dynamic user-centric multi-dimensional resource allocation for a wide-area coverage signaling cell based on DQN
US20250031230A1 (en) Methods and apparatuses for user equipment selecting and scheduling in intelligent wireless system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230324

Address after: 401336 Yuen Road, Nanan District, Chongqing City, No. 8

Patentee after: CHINA MOBILE IOT Co.,Ltd.

Address before: 400065 No. 2, Chongwen Road, Nan'an District, Chongqing

Patentee before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS

TR01 Transfer of patent right