CN115915454A

CN115915454A - SWIPT-assisted downlink resource allocation method and device

Info

Publication number: CN115915454A
Application number: CN202211225933.1A
Authority: CN
Inventors: 陈硕; 卫醒; 李学华
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2023-04-04

Abstract

The application discloses a method and a device for downlink resource allocation assisted by SWIPT. Wherein, the method comprises the following steps: obtaining a state observation value of a current environment state of a communication link between tolerant machine equipment user equipment and a machine equipment gateway; based on the state observation value of the current environment state, selecting a resource allocation strategy for the tolerant machine user equipment by using a resource allocation model constructed based on a neural network; allocating downlink resources to the tolerant machine user equipment based on the selected resource allocation strategy; wherein the tolerant machine user device requires a machine user device that completes a task of transmission of the periodically generated payload. The method and the device solve the technical problems that the machine user equipment with the energy limitation characteristic can generate overhigh energy consumption and uneven resource distribution of various internet of things equipment when a communication link is established, and particularly the machine user equipment with low QoS (quality of service) requirements can obtain poor resource distribution and therefore the performance is poor.

Description

SWIPT-assisted downlink resource allocation method and device

技术领域Technical Field

本申请涉及通信领域，具体而言，涉及一种SWIPT辅助的下行资源分配方法及装置。The present application relates to the field of communications, and in particular, to a SWIPT-assisted downlink resource allocation method and device.

背景技术Background Art

物联网技术的快速发展衍生了互联产业、智能交通系统和智能城市等数据驱动领域，在这些领域的驱使下人类社会逐步变得数字化。The rapid development of Internet of Things technology has spawned data-driven fields such as interconnected industries, intelligent transportation systems and smart cities. Driven by these fields, human society is gradually becoming digital.

机器与机器通信(M2M)由于其无需直接人工干预和近乎及时的无线连接是促成数字化发展的主要因素。随着入网机器设备数量的指数级增长，M2M通信将在家庭、娱乐、医疗和工作等场景中为公民服务提供新的可能性，并预计成为实现万物智联美好愿景的关键驱动因素。Machine-to-machine communication (M2M) is a major factor in the development of digitalization due to its near-instant wireless connection without direct human intervention. With the exponential growth in the number of connected machines and devices, M2M communication will provide new possibilities for citizen services in scenarios such as home, entertainment, medical care and work, and is expected to become a key driver for realizing the beautiful vision of the intelligent connection of all things.

不同于传统的人与人通信(H2H)，M2M通信业务在网络能效、数据传输量以及一些任务关键型业务中的高可靠性传输等方面具备其独有的特征。虽然物联网技术中有开发低功率广域网络专门用于M2M通信，其可以为M2M设备提供极低功耗的长距离传输，但该网络只能够提供很低的传输速率，难以满足大多关键型M2M业务的QoS需求。因此，蜂窝网络由于其更大的带宽和更强的网络拓展性可以满足各类业务接入以及提供高数据传输速率，而被认为是部署M2M通信的关键因素。同时，蜂窝网络本身对于许多M2M通信应用来说过于耗能且成本高昂。Different from traditional human-to-human communication (H2H), M2M communication services have their own unique characteristics in terms of network energy efficiency, data transmission volume, and high-reliability transmission in some mission-critical services. Although low-power wide-area networks have been developed in IoT technology specifically for M2M communication, which can provide M2M devices with extremely low power consumption and long-distance transmission, the network can only provide a very low transmission rate and is difficult to meet the QoS requirements of most critical M2M services. Therefore, cellular networks are considered to be the key factor in deploying M2M communications because of their larger bandwidth and stronger network scalability, which can meet the access of various services and provide high data transmission rates. At the same time, cellular networks themselves are too energy-consuming and costly for many M2M communication applications.

为了削弱蜂窝网络对M2M通信造成的负面影响，无线携能通信(SWIPT)是一种充满应用前景的技术，可以在H2H/M2M共存蜂窝网络中大幅提升M2M设备的能效。虽然传统的能量收集技术可以使移动设备或基站从诸如风或太阳光等自然环境中汲取能量，但此类技术的效率受到地理位置和天气条件的严重约束。SWIPT作为一种基于射频的能量收集技术，其使得应用设备可以获得相对可控和稳定的能量。通常，该技术可以将干扰信号转化为电能，即强干扰环境在降低系统吞吐量的同时也会带来更多的能量收集量。此外，SWIPT辅助的H2H/M2M共存蜂窝网络在满足用户多样的服务质量(QoS)需求方面还面临许多挑战，例如层内和层间干扰控制、频谱子带选择、系统信息传输速率与能量收集量之间的权衡等。In order to mitigate the negative impact of cellular networks on M2M communications, SWIPT (Wireless Power In-Person Transmission) is a promising technology that can significantly improve the energy efficiency of M2M devices in H2H/M2M coexisting cellular networks. Although traditional energy harvesting technologies allow mobile devices or base stations to extract energy from natural environments such as wind or sunlight, the efficiency of such technologies is severely constrained by geographical location and weather conditions. SWIPT, as a radio frequency-based energy harvesting technology, enables application devices to obtain relatively controllable and stable energy. Generally, this technology can convert interference signals into electrical energy, that is, a strong interference environment will reduce system throughput while also bringing more energy collection. In addition, SWIPT-assisted H2H/M2M coexisting cellular networks face many challenges in meeting users' diverse quality of service (QoS) requirements, such as intra-layer and inter-layer interference control, spectrum sub-band selection, and the trade-off between system information transmission rate and energy collection.

针对上述的问题，目前尚未提出有效的解决方案。To address the above-mentioned problems, no effective solution has been proposed yet.

发明内容Summary of the invention

本申请实施例提供了一种SWIPT辅助的下行资源分配方法及装置，以至少解决机器用户设备获得较差的资源分配从而性能表现不佳的技术问题。The embodiments of the present application provide a SWIPT-assisted downlink resource allocation method and apparatus to at least solve the technical problem that a machine user equipment obtains poor resource allocation and thus has poor performance.

根据本申请实施例的一个方面，提供了一种SWIPT辅助的下行资源分配方法，包括：获取容忍型机器用户设备与机器设备网关之间的通信链路的当前环境状态的状态观测值；基于所述当前环境状态的状态观测值，利用基于神经网络构建的资源分配模型，为所述容忍型机器用户设备选择资源分配策略；基于所选的资源分配策略，为所述容忍型机器用户设备分配下行资源；其中，所述容忍型机器用户设备需要完成周期性生成的载荷的传输任务的机器用户设备。According to one aspect of an embodiment of the present application, a SWIPT-assisted downlink resource allocation method is provided, including: obtaining a state observation value of a current environmental state of a communication link between a tolerant machine user device and a machine device gateway; based on the state observation value of the current environmental state, selecting a resource allocation strategy for the tolerant machine user device using a resource allocation model constructed based on a neural network; based on the selected resource allocation strategy, allocating downlink resources to the tolerant machine user device; wherein the tolerant machine user device is a machine user device that needs to complete a transmission task of a periodically generated load.

根据本申请实施例的另一方面，还提供了一种SWIPT辅助的下行资源分配装置，包括：获取模块，被配置为获取容忍型机器用户设备与机器设备网关之间的通信链路的当前环境状态的状态观测值；选择模块，被配置为基于所述当前环境状态的状态观测值，利用基于神经网络构建的资源分配模型，为所述容忍型机器用户设备选择资源分配策略；分配模块，基于所选的资源分配策略，为所述容忍型机器用户设备分配下行资源；其中，需要完成周期性生成的载荷的传输任务的机器用户设备。According to another aspect of an embodiment of the present application, a SWIPT-assisted downlink resource allocation device is also provided, including: an acquisition module, configured to obtain a state observation value of a current environmental state of a communication link between a tolerant machine user device and a machine device gateway; a selection module, configured to select a resource allocation strategy for the tolerant machine user device based on the state observation value of the current environmental state and a resource allocation model constructed based on a neural network; an allocation module, based on the selected resource allocation strategy, allocates downlink resources to the tolerant machine user device; wherein the machine user device needs to complete the transmission task of a periodically generated load.

在本申请实施例中，基于当前环境状态的状态观测值，利用基于神经网络构建的资源分配模型，为容忍型机器用户设备选择资源分配策略，解决了机器用户设备获得较差的资源分配从而性能表现不佳的技术问题。In an embodiment of the present application, based on the state observation value of the current environmental state, a resource allocation strategy is selected for a tolerant machine user device using a resource allocation model built based on a neural network, thereby solving the technical problem that the machine user device obtains poor resource allocation and thus has poor performance.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings described herein are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation on the present application. In the drawings:

图1是根据本申请实施例的一种SWIPT辅助的下行资源分配方法的流程图；FIG1 is a flow chart of a SWIPT-assisted downlink resource allocation method according to an embodiment of the present application;

图2是根据本申请实施例的构建下行资源分配模型的方法的流程图；FIG2 is a flow chart of a method for constructing a downlink resource allocation model according to an embodiment of the present application;

图3是根据本申请实施例的另一种SWIPT辅助的下行资源分配方法的流程图；FIG3 is a flow chart of another SWIPT-assisted downlink resource allocation method according to an embodiment of the present application;

图4是根据本申请实施例的又一种SWIPT辅助的下行资源分配方法的流程图；FIG4 is a flowchart of another SWIPT-assisted downlink resource allocation method according to an embodiment of the present application;

图5是根据本申请实施例的一种SWIPT辅助的下行资源分配装置的结构示意图；FIG5 is a schematic structural diagram of a SWIPT-assisted downlink resource allocation device according to an embodiment of the present application;

图6是根据本申请实施例的一种SWIPT辅助的下行资源分配系统的结构示意图；FIG6 is a schematic structural diagram of a SWIPT-assisted downlink resource allocation system according to an embodiment of the present application;

图7是根据本申请实施例的M2M设备数量变化与M2M设备总能效比较示意图；FIG7 is a schematic diagram showing a comparison between the change in the number of M2M devices and the total energy efficiency of the M2M devices according to an embodiment of the present application;

图8是根据本申请实施例的容忍型M2M设备数量变化与H2H用户QoS需求满足率比较示意图；FIG8 is a schematic diagram showing a comparison between the change in the number of tolerant M2M devices and the satisfaction rate of H2H user QoS requirements according to an embodiment of the present application;

图9是根据本申请实施例的容忍型M2M设备数量变化与关键型M2M链路中断概率比较示意图；9 is a schematic diagram showing a comparison between the change in the number of tolerant M2M devices and the probability of critical M2M link interruption according to an embodiment of the present application;

图10是根据本申请实施例的容忍型M2M设备数量变化与载荷传输完成率比较示意图FIG. 10 is a schematic diagram showing a comparison between the number of tolerant M2M devices and the load transmission completion rate according to an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分的实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the solution of the present application, the technical solution in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of the present application.

需要说明的是，本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.

实施例1Example 1

根据本申请实施例，提供了一种SWIPT辅助的下行资源分配方法，如图1所示，该方法包括：According to an embodiment of the present application, a SWIPT-assisted downlink resource allocation method is provided, as shown in FIG1 , the method comprising:

步骤S102，获取容忍型机器用户设备与机器设备网关之间的通信链路的当前环境状态的状态观测值。Step S102: obtaining a state observation value of a current environment state of a communication link between a tolerant machine user device and a machine device gateway.

在获取容忍型机器用户设备与机器设备网关之间的通信链路的当前环境状态的状态观测值之前，需要基于机器用户设备的业务类型，将机器用户设备划分为容忍型机器用户设备和关键型机器用户设备；其中，关键型机器用户设备是传输可靠性要求高于预设要求阈值的机器用户设备，容忍型机器用户设备是需要完成周期性生成的载荷的传输任务的机器用户设备。Before obtaining the state observation value of the current environmental state of the communication link between the tolerant machine user device and the machine device gateway, it is necessary to divide the machine user device into tolerant machine user devices and critical machine user devices based on the service type of the machine user device; among which, the critical machine user device is a machine user device with a transmission reliability requirement higher than a preset requirement threshold, and the tolerant machine user device is a machine user device that needs to complete the transmission task of the periodically generated load.

本实施例除了考虑现有的常规H2H(Human-to-Huma，人对人)用户设备(也称为人类用户设备，或H2H用户)，将机器用户设备(Machine-Type Communication Devices，MTCD)进一步细分为容忍型机器用户设备(也称为容忍型M2M设备，容忍型M2M用户)和关键型机器用户设备(也称为关键型M2M设备，关键型M2M用户)，从而可以针对不同的机器用户设备的业务特点，选择不同的资源分配策略，进而能够更加合理地为机器用户设备分配资源。In addition to considering the existing conventional H2H (Human-to-Huma) user devices (also called human user devices, or H2H users), this embodiment further subdivides machine user devices (Machine-Type Communication Devices, MTCD) into tolerant machine user devices (also called tolerant M2M devices, tolerant M2M users) and critical machine user devices (also called critical M2M devices, critical M2M users), so that different resource allocation strategies can be selected according to the service characteristics of different machine user devices, thereby allocating resources to machine user devices more reasonably.

在一个示例中，机器用户设备上可以设置有无线通信携能(SimultaneousWireless Information and Power Transfer，SWIPT)，用于使得所述机器用户设备能够从射频环境中获取能量并同时进行信息解码。通过设置SWIPT，使MTCD能够同时获取能量和解码信息。In one example, the machine user device may be provided with a Simultaneous Wireless Information and Power Transfer (SWIPT) to enable the machine user device to obtain energy from the radio frequency environment and decode information at the same time. By setting SWIPT, the MTCD can obtain energy and decode information at the same time.

步骤S104，基于当前环境状态的状态观测值，利用基于神经网络构建的资源分配模型，为容忍型机器用户设备选择资源分配策略。Step S104, based on the state observation value of the current environment state, using the resource allocation model constructed based on the neural network, select a resource allocation strategy for the tolerant machine user device.

在一个示例中，可以如图2所示，基于以下方法构建资源分配模型：In one example, as shown in FIG. 2 , a resource allocation model may be constructed based on the following method:

步骤S1042，构建状态函数。Step S1042, constructing a state function.

状态函数为状态观测值的集合。例如，基于通信链路在每个频谱子带上的信道增益信息、通信链路在每个频谱子带上受到的干扰功率大小、通信链路的载荷剩余和传输时间剩余、当前迭代次数、以及表示通信链路的当前的环境探索率的贪婪因子，来构建状态函数。The state function is a collection of state observations. For example, the state function is constructed based on the channel gain information of the communication link on each spectrum sub-band, the interference power of the communication link on each spectrum sub-band, the load remaining and transmission time remaining of the communication link, the current number of iterations, and the greed factor representing the current environment exploration rate of the communication link.

步骤S1044，构建动作函数。Step S1044, constructing an action function.

动作函数为下行链路频谱资源、发射功率等级和功率分流比的集合。The action function is a collection of downlink spectrum resources, transmit power level and power split ratio.

步骤S1046，构建奖励函数。Step S1046, construct a reward function.

基于资源分配优化目标与QoS约束条件的平衡来构建奖励函数。The reward function is constructed based on the balance between resource allocation optimization objectives and QoS constraints.

在一些示例中，可以如下方式构建奖励函数。In some examples, the reward function can be constructed as follows.

首先，基于容忍型机器用户设备的载荷剩余量和传输时间剩余，确定容忍型机器用户设备所提供的奖励函数惩罚项；接着，基于关键型机器用户设备的中断概率，确定关键型机器用户设备所提供的奖励函数惩罚项；然后，基于人类用户设备的信干噪比，确定人类用户设备的奖励函数惩罚项；最后，基于所有机器用户设备的通信链路的总能效值、容忍型机器用户设备所提供的奖励函数惩罚项、关键型机器用户设备所提供的奖励函数惩罚项、和人类用户设备的奖励函数惩罚项，来构建奖励函数。First, based on the load remaining amount and transmission time remaining of the tolerant machine user device, the penalty item of the reward function provided by the tolerant machine user device is determined; then, based on the interruption probability of the critical machine user device, the penalty item of the reward function provided by the critical machine user device is determined; then, based on the signal to interference and noise ratio of the human user device, the penalty item of the reward function of the human user device is determined; finally, a reward function is constructed based on the total energy efficiency value of the communication link of all machine user devices, the penalty item of the reward function provided by the tolerant machine user device, the penalty item of the reward function provided by the critical machine user device, and the penalty item of the reward function of the human user device.

本实施例针对QoS约束，分别对满足H2H用户设备信干噪比的阈值、关键型M2M通信链路的中断概率和M2M通信链路有效负载传输概率，进行了显式建模和求解，并得到相应的奖励函数惩罚项，从而使得奖励函数的奖励值更加准确，进而提高了资源分配的合理性。In this embodiment, for QoS constraints, the threshold of the signal-to-interference-and-noise ratio of the H2H user equipment, the interruption probability of the critical M2M communication link, and the effective load transmission probability of the M2M communication link are explicitly modeled and solved, and the corresponding penalty term of the reward function is obtained, so that the reward value of the reward function is more accurate, thereby improving the rationality of resource allocation.

在另一些示例中，可以如下方式构建奖励函数。In some other examples, the reward function can be constructed as follows.

首先，设置服务质量(Quality of Service，QoS)约束条件。例如，将人类用户设备的QoS约束条件设置为信干噪比SINR大于设定的最低阈值；将容忍型机器用户设备的QoS约束条件设置为时间约束T内大小为预设大小V的载荷的传输成功率高于设定的成功率阈值；将关键型机器用户设备的QoS约束条件设置为中断概率不高于设定的中断阈值。First, set the Quality of Service (QoS) constraint. For example, set the QoS constraint of human user equipment to the signal to interference plus noise ratio SINR greater than the set minimum threshold; set the QoS constraint of tolerant machine user equipment to the transmission success rate of a load of a preset size V within the time constraint T higher than the set success rate threshold; set the QoS constraint of critical machine user equipment to the interruption probability not higher than the set interruption threshold.

接着，将容忍型机器用户设备和关键型机器用户设备实现的总能效值EE作为资源分配优化目标。本实施例以M2M通信链路的EE最大化为目标，在考虑用户QoS约束的情况下，对联合频谱、发射功率和功率分流PS比进行分配。Next, the total energy efficiency value EE achieved by the tolerant machine user equipment and the critical machine user equipment is used as the resource allocation optimization target. This embodiment aims to maximize the EE of the M2M communication link and allocates the joint spectrum, transmit power and power split PS ratio under the consideration of user QoS constraints.

最后，基于资源分配优化目标以及QoS约束条件，来构建奖励函数。Finally, the reward function is constructed based on the resource allocation optimization goal and QoS constraints.

步骤S1048，构建经验重放池，其中，所述经验重放池用以存放数据并为训练神经网络提供数据。Step S1048, constructing an experience replay pool, wherein the experience replay pool is used to store data and provide data for training the neural network.

构建经验重放池，用于存放用于训练所述资源分配模型的训练数据，其中，所述训练数据包括当前时刻和下一时刻的状态观测值、奖励值和所选的资源分配策略；其中，所述神经网络包括：训练网络和目标网络，其中，训练网络在每轮迭代中利用从所述经验重放池中随机抽取的数据，采用随机梯度下降的方式进行训练；所述目标网络是一个固定的神经网络，其网络参数每隔一段时间更新一次，更新为当前时刻的训练网络参数。An experience replay pool is constructed to store training data for training the resource allocation model, wherein the training data includes state observation values, reward values and selected resource allocation strategies at the current moment and the next moment; wherein the neural network includes: a training network and a target network, wherein the training network uses data randomly extracted from the experience replay pool in each round of iteration and is trained by stochastic gradient descent; the target network is a fixed neural network, and its network parameters are updated at regular intervals to the training network parameters at the current moment.

步骤S1049，基于状态函数、动作函数、奖励函数以及经验重放池构建并训练资源分配模型。Step S1049, constructing and training a resource allocation model based on the state function, action function, reward function and experience replay pool.

步骤S106，基于所选的资源分配策略，为容忍型机器用户设备分配下行资源。Step S106: allocating downlink resources to the tolerant machine user equipment based on the selected resource allocation strategy.

在获取了当前环境状态的状态观测值之后，或者在分配了下行资源之后，还可以将分配资源相关的数据放入到每个通信链路对应的经验重放池，用作训练数据。After obtaining the state observation value of the current environment state, or after allocating downlink resources, data related to the allocated resources can also be put into the experience replay pool corresponding to each communication link to be used as training data.

例如，计算当前时刻的所有机器用户设备的总能效值和各个机器用户设备和人类用户设备的QoS水平，将总能效值、QoS水平、通信链路在每个频谱子带上受到的干扰功率大小、所选资源分配策略、状态观测值、奖励函数输出的奖励、以及机器设备网关接收到的先验信息放入到每条通信链路的经验重放池，用作训练资源分配模型的训练数据。For example, the total energy efficiency value of all machine user devices and the QoS level of each machine user device and human user device at the current moment are calculated, and the total energy efficiency value, QoS level, the interference power received by the communication link on each spectrum sub-band, the selected resource allocation strategy, the state observation value, the reward output by the reward function, and the prior information received by the machine device gateway are put into the experience replay pool of each communication link to be used as training data for training the resource allocation model.

本实施例解决了SWIPT辅助的H2H/M2M(Machine-to-Machine，机器对机器)共存蜂窝网络中的能量资源分配问题。通过为容忍型机器用户设备分配合适的资源(即频谱、功率和功率分流(Power Splitting)PS比)，实现了高能效(Energy Efficiency，EE)并保证了QoS要求。This embodiment solves the energy resource allocation problem in SWIPT-assisted H2H/M2M (Machine-to-Machine) coexistence cellular networks. By allocating appropriate resources (i.e., spectrum, power, and power splitting (PS) ratio) to tolerant machine user equipment, high energy efficiency (EE) is achieved and QoS requirements are guaranteed.

本实施例基于行为跟踪的状态提供了一个稳定的训练过程，共享的奖励函数使各个代理能够以分布式的方式协同工作。此外，本实施例给出了QoS约束的数学精确表达式，从而降低了奖励函数的计算复杂度。This embodiment provides a stable training process based on the state of behavior tracking, and the shared reward function enables each agent to work together in a distributed manner. In addition, this embodiment provides a mathematically precise expression of the QoS constraint, thereby reducing the computational complexity of the reward function.

此外，为了支持不同的QoS需求，本实施例将多个M2M通信链路的EE优化问题建模为一个多代理(Agent)的问题，其中，多个容忍型M2M通信链路共享关键型M2M通信链路和H2H通信链路所占用的频谱。此外，本实施例通过设计状态函数、动作函数和奖励函数，使代理适应动态的网络环境，从而达到目标优化的效果。In addition, in order to support different QoS requirements, this embodiment models the EE optimization problem of multiple M2M communication links as a multi-agent problem, where multiple tolerant M2M communication links share the spectrum occupied by critical M2M communication links and H2H communication links. In addition, this embodiment enables the agent to adapt to the dynamic network environment by designing the state function, action function and reward function, thereby achieving the effect of target optimization.

实施例2Example 2

根据本申请实施例，还提供了一种SWIPT辅助的基于QoS的下行资源分配方法，如图3所示，该方法包括：According to an embodiment of the present application, a SWIPT-assisted QoS-based downlink resource allocation method is also provided, as shown in FIG3 , the method comprising:

步骤S302，初始化资源分配系统。Step S302: Initialize the resource allocation system.

基于预先分配的频谱子带和发射功率，得到H2H用户和关键型M2M用户所占用频谱子带的信道条件以及SINR或中断概率等先验信息，更具体地说，初始化指确定每个容忍型M2M链路在时刻t的初始状态

Based on the pre-assigned spectrum sub-bands and transmit power, the channel conditions of the spectrum sub-bands occupied by H2H users and critical M2M users and prior information such as SINR or outage probability are obtained. More specifically, initialization refers to determining the initial state of each tolerant M2M link at time t

步骤S304，为每个代理搭建训练神经网络和目标神经网络。Step S304, building a training neural network and a target neural network for each agent.

将每条容忍型M2M链路视为代理，为每个代理搭建训练神经网络Q(s，a；ω)和目标神经网络Q^-(s，a；ω^-)，其中，ω和ω^-分别代表训练网络和目标网络的参数，网络的输入层为s，即每条容忍型链路的状态观测值，输出层对应每个动作选择a所对应的估计奖励值，表示为Q值。Each tolerant M2M link is regarded as an agent, and a training neural network Q(s, a; ω) and a target neural network ^Q- (s, a; ^ω- ) are built for each agent, where ω and ^ω- represent the parameters of the training network and the target network, respectively. The input layer of the network is s, that is, the state observation value of each tolerant link, and the output layer corresponds to the estimated reward value corresponding to each action selection a, expressed as Q value.

为每个代理建立经验重放池D，用以存放当前时刻t和下一时刻t+1内自身链路的状态观测值S^t和S^t+1、当前时刻t的动作选择A^t以及下一时刻收到的奖励值R^t+1，以拓扑(S^t，A^t，R^t+1，S^t+1)表示D内于时刻t存放的数据。An experience replay pool D is established for each agent to store the state observation values S ^t and S ^{t+1 of its own link at the current time t and the next time t+1} , the action selection A ^t at the current time t, and the reward value R ^t+1 received at the next time. The data stored in D at time t is represented by the topology (S ^t , A ^t , R ^t+1 , S ^t+1 ).

步骤S306，为代理设计状态函数、动作函数和奖励函数。Step S306, designing a state function, an action function and a reward function for the agent.

每个代理基于先验信息以及根据本地状态观测值所做出的动作选择，建立链路通信能效优化模型。具体包括：制定发射功率、功率分流比的约束条件和QoS约束。Each agent builds a link communication energy efficiency optimization model based on prior information and action selection according to local state observations, including: setting constraints on transmit power, power split ratio, and QoS constraints.

其中，QoS约束根据业务实际特点分别建模如下：H2H业务通常是基于语音和互联网通信的流量，其对数据传输速率提出较高的需求，因此将H2H业务的QoS约束建模为SINR值大于设定的最低阈值；大多数M2M业务是由事件驱动的，其流量通常是根据M2M业务本身需求以在不同的频率上周期性生成的，因此将容忍型M2M业务的QoS约束建模为时间约束T内大小为V的数据包的传输成功率；此外，对于关键型M2M业务，其往往具有严格的时延和传输可靠性要求，因此将关键型M2M业务的QoS约束建模为中断概率不得高于设定的中断阈值。Among them, QoS constraints are modeled as follows according to the actual characteristics of the services: H2H services are usually traffic based on voice and Internet communications, which places high demands on data transmission rates. Therefore, the QoS constraints of H2H services are modeled as SINR values greater than the set minimum threshold; most M2M services are event-driven, and their traffic is usually generated periodically at different frequencies according to the needs of the M2M services themselves. Therefore, the QoS constraints of tolerant M2M services are modeled as the transmission success rate of data packets of size V within the time constraint T; in addition, for critical M2M services, they often have strict delay and transmission reliability requirements. Therefore, the QoS constraints of critical M2M services are modeled as the interruption probability must not be higher than the set interruption threshold.

步骤S308，对训练网络Q进行训练并进行网络资源分配。Step S308: training the training network Q and allocating network resources.

基于链路通信能效优化模型，容忍型M2M链路不断从经验重放池存放的大量数据中随机抽取数据对训练网络Q进行训练，以随机梯度下降的方式缩小训练网络和目标网参数间的误差，最终在资源分配模型训练收敛后能做出满足所述约束条件和QoS约束的资源分配方案。Based on the link communication energy efficiency optimization model, the tolerant M2M link continuously randomly extracts data from a large amount of data stored in the experience replay pool to train the training network Q, and reduces the error between the training network and the target network parameters by stochastic gradient descent. Finally, after the resource allocation model training converges, a resource allocation plan that meets the constraints and QoS constraints can be made.

此外，为了稳定训练环境，本实施采用了一种低维映射高维的行为跟踪方法，使得各条容忍型M2M链路在面对从经验重放池中随机选取的训练数据时能够知晓当前所学经验的所处状态。并且，为了提升整体能效表现，本实施例中共享奖励函数，以鼓励各条容忍型M2M链路以合作式方式探索环境，并做出对整体性能提升最大的智能决策。In addition, in order to stabilize the training environment, this implementation adopts a low-dimensional mapping high-dimensional behavior tracking method, so that each tolerant M2M link can know the current state of the learned experience when facing the training data randomly selected from the experience replay pool. And, in order to improve the overall energy efficiency performance, this embodiment shares the reward function to encourage each tolerant M2M link to explore the environment in a cooperative manner and make intelligent decisions that maximize the overall performance.

在一个实施例中，资源分配具体过程可以包括：容忍型M2M链路根据所接收到的先验信息，由机器设备网关选择具有最优信道条件的频谱子带进行复用，并为链路内的容忍型机器设备分配合适的发射功率和功率分流比，在每一时刻(1ms)，网络内的频谱子带信道条件会由于信道衰落发生变化，此外，每一时刻每条容忍型M2M链路做出的资源分配选择也会影响网络内各类型用户的QoS指标。容忍型M2M根据这些变化后的QoS指标，即H2H用户的SINR、关键型M2M用户的中断概率、容忍型M2M用户的载荷剩余量和传输时间剩余以及所有M2M用户的可达能效来判断此时刻资源分配方案选择的优劣性，并将当前信道条件、QoS指标、所受干扰以及资源选择方案和可达能效全部存入经验重放池，作为各条容忍型M2M链路中训练网络Q的训练数据，随着环境数据的不断变化，训练数据样本不断增加，通过从经验重放池内随机抽取数据进行训练，可以有效剔除数据相关性。In one embodiment, the specific process of resource allocation may include: the tolerant M2M link selects the spectrum sub-band with the best channel condition for multiplexing by the machine device gateway according to the received prior information, and allocates appropriate transmission power and power split ratio to the tolerant machine devices in the link. At each moment (1ms), the spectrum sub-band channel condition in the network will change due to channel fading. In addition, the resource allocation selection made by each tolerant M2M link at each moment will also affect the QoS indicators of various types of users in the network. Tolerant M2M judges the merits of the resource allocation scheme at this moment based on these changed QoS indicators, namely, the SINR of H2H users, the interruption probability of critical M2M users, the remaining load and transmission time of tolerant M2M users, and the achievable energy efficiency of all M2M users, and stores the current channel conditions, QoS indicators, interference, resource selection scheme and achievable energy efficiency in the experience replay pool as training data for training network Q in each tolerant M2M link. As the environmental data continues to change, the training data samples continue to increase. By randomly extracting data from the experience replay pool for training, data correlation can be effectively eliminated.

对于容忍型M2M链路的频谱子带选择、发射功率和功率分流比分配任务，采用分布式实施方案。在每一时刻t，每条容忍型M2M链路根据上述先验信息和自身训练网络，选择当前具有最优信道条件的频谱子带和合适的发射功率和功率分流比，并以该资源分配方案进行通信，而后获得所有M2M设备的整体能效以及网络内其他用户的QoS指标，称之为奖励函数。网络内所有容忍型M2M链路共享这一相同的奖励函数，因此本申请提出的资源分配方法鼓励各链路间的合作行为，以实现M2M整体能效最大化的目标。For the spectrum subband selection, transmission power and power split ratio allocation tasks of tolerant M2M links, a distributed implementation scheme is adopted. At each time t, each tolerant M2M link selects the spectrum subband with the optimal channel condition and the appropriate transmission power and power split ratio according to the above-mentioned prior information and its own training network, and communicates with the resource allocation scheme, and then obtains the overall energy efficiency of all M2M devices and the QoS indicators of other users in the network, which is called the reward function. All tolerant M2M links in the network share this same reward function, so the resource allocation method proposed in this application encourages cooperation between links to achieve the goal of maximizing the overall energy efficiency of M2M.

资源分配任务进一步包括：每一时刻t，根据集群内机器设备网关MTCG m接收到的先验信息，更新容忍型M2M链路m，n的可用环境数据集合，设计其状态集如下：The resource allocation task further includes: at each time t, according to the prior information received by the machine equipment gateway MTCG m in the cluster, the available environment data set of the tolerant M2M link m, n is updated, and its state set is designed as follows:

其中，

包含了链路m，n在每个频谱子带上的信道增益信息，

描述了链路m，n每个频谱子带上受到的干扰功率大小，V_m，n，T_m，n分别表示链路m，n的载荷剩余和传输时间剩余，{QoS_h}_h∈H，{QoS_s}_s∈S为二元变量，分别表示H2H用户和关键型M2M用户的QoS指标，即所有H2H用户的QoS约束得到满足，则{QoS_h}_h∈H＝1，否则为0，{QoS_s}_s∈S以相同的二元变量表示所有关键型M2M链路的QoS约束是否得到满足。in,

Contains the channel gain information of link m, n in each spectrum subband,

It describes the interference power on each spectrum sub-band of link m, n. V _m,n , T _m,n represent the load surplus and transmission time surplus of link m, n respectively. {QoS _h } _h∈H , {QoS _s } _s∈S are binary variables, representing the QoS indicators of H2H users and critical M2M users respectively. That is, if the QoS constraints of all H2H users are met, {QoS _h } _h∈H ＝1, otherwise it is 0. {QoS _s } _s∈S uses the same binary variable to indicate whether the QoS constraints of all critical M2M links are met.

此外，e代表当前迭代次数，∈为贪婪因子，表示链路m，n当前的环境探索率，本申请实施例在状态集中加入这两项是用于跟踪训练过程中位于迭代次数e和探索概率∈时的环境状态：在训练过程中每条容忍型M2M链路的资源分配策略都会随其它链路的资源分配策略的变化而变化，因此从单条链路的角度看，网络环境时刻都在变化，并且由于从经验重放池中抽取经验具有随机性，当前时刻抽取的训练数据不在反映此时的网络环境，而极大可能是已经过时的数据，更使得每条链路在面对训练数据时难以知晓该条经验的所处状态以及各条链路的策略选择，而真正代表各条链路策略选择的值函数是一个高维神经网络参数，无法作为经验数据供各条链路学习，因此，做出在经验数据中加入迭代次数e和贪婪因子∈的设计以映射高维神经网络参数，从而使得容忍型M2M链路在面对所选经验数据时能够跟踪环境状态，达到稳定训练的目的。In addition, e represents the current number of iterations, ∈ is the greed factor, which indicates the current environment exploration rate of link m, n. The embodiment of the present application adds these two items to the state set to track the environment state at the iteration number e and the exploration probability ∈ during the training process: during the training process, the resource allocation strategy of each tolerant M2M link will change with the change of the resource allocation strategy of other links. Therefore, from the perspective of a single link, the network environment is changing all the time. Moreover, since the experience extracted from the experience replay pool is random, the training data extracted at the current moment no longer reflects the network environment at this time, but is most likely outdated data, which makes it difficult for each link to know the state of the experience and the strategy selection of each link when facing the training data. The value function that truly represents the strategy selection of each link is a high-dimensional neural network parameter, which cannot be used as experience data for each link to learn. Therefore, a design is made to add the iteration number e and the greed factor ∈ to the experience data to map the high-dimensional neural network parameters, so that the tolerant M2M link can track the environment state when facing the selected experience data, thereby achieving the purpose of stable training.

根据上述环境信息，容忍型M2M链路m，n在时刻t利用贪婪策略：以概率∈随机选取频谱子带、发射功率和功率分流比或以概率1-∈选取当前训练网络给出的最优动作选择策略，动作(资源)集合定义为：According to the above environmental information, the tolerant M2M link m, n uses the greedy strategy at time t: randomly select the spectrum subband, transmit power and power split ratio with probability ∈ or select the optimal action selection strategy given by the current training network with probability 1-∈. The action (resource) set is defined as:

其中

表示频谱子带选择，

分别代表了发射功率和功率分流比选择，

代表最大发射功率，L和Z分别代表功率和功率分流比的分割等级。in

represents the spectrum subband selection,

Represents the transmit power and power split ratio selection respectively,

represents the maximum transmit power, L and Z represent the division level of power and power split ratio respectively.

在本实施例中，假设网络中可用的频谱子带集以K＝H∪S为索引，其中H＝{1，2...H}表示HUE，S＝{1、2...S}表示关键型机器用户设备(CMTCD)。此外，M＝{1，2...M}表示MTCG，每个MTCG管控下的容忍型机器用户设备(TMTCD)由N＝{1、2...N}表示，其中，所有TMTCD和CMTCD配有SWIPT技术，以集合D表示，即D＝S∪N。In this embodiment, it is assumed that the spectrum subband set available in the network is indexed by K=H∪S, where H={1, 2...H} represents HUE, and S={1, 2...S} represents critical machine user device (CMTCD). In addition, M={1, 2...M} represents MTCG, and each MTCG-controlled tolerant machine user device (TMTCD) is represented by N={1, 2...N}, where all TMTCDs and CMTCDs are equipped with SWIPT technology, represented by set D, that is, D=S∪N.

容忍型M2M链路m，n根据时刻t做出的动作选择

建立通信链路，根据频谱子带选择，高QoS需求用户的SINR分为如下两种计算情况：Action selection made by tolerant M2M link m, n at time t

When establishing a communication link, based on the spectrum sub-band selection, the SINR of users with high QoS requirements is calculated in the following two ways:

上式中，分母第一项

表示H2H用户或关键型M2M用户于频谱子带k上受到的干扰，

为复用第k个频谱子带时受到的跨集群干扰和同集群干扰，

为表示频谱子带占用情况的二元变量，第二项σ²表示加性高斯白噪声功率。P_BS，h和P_m，s分别代表由基站和机器设备网关分配给H2H用户和关键型M2M用户的发射功率，ρ_m，s表示机器设备网关分配给关键型M2M用户的功率分流比，P_BS，h表示基站分配给H2H用户的发射功率，

表示基站与H2H用户间的信道增益，

表示机器设备网关与H2H用户间的信道增益，

表示同一集群内机器设备网关与关键型M2M用户间的信道增益。在计算得到H2H用户和关键型M2M用户的SINR后，利用如下公式：In the above formula, the first term in the denominator is

represents the interference received by H2H users or critical M2M users on spectrum sub-band k,

is the cross-cluster interference and intra-cluster interference when reusing the kth spectrum sub-band,

is a binary variable representing the occupancy of the spectrum subband, and the second term σ ² represents the additive white Gaussian noise power. _{P BS,h} and P _m,s represent the transmit power allocated by the base station and the machine device gateway to the H2H user and the critical M2M user, respectively. ρ _m,s represents the power split ratio allocated by the machine device gateway to the critical M2M user. P _BS,h represents the transmit power allocated by the base station to the H2H user.

Indicates the channel gain between the base station and the H2H user,

Indicates the channel gain between the machine device gateway and the H2H user.

It represents the channel gain between the machine device gateway and the key M2M user in the same cluster. After calculating the SINR of H2H users and key M2M users, the following formula is used:

来分别判断H2H用户和关键型M2M用户的QoS需求是否得到保障，其中

和

分别表示H2H业务和关键型M2M业务的SINR阈值，

指可容忍的最大中断概率。若满足H2H业务的QoS约束，则环境状态信息{QoS_h}_h∈H＝1，否则{QoS_h}_h∈H＝0，相应地，{QoS_h}_s∈S以相同的指示符表示所有关键型M2M用户的QoS满意度。To determine whether the QoS requirements of H2H users and critical M2M users are guaranteed.

and

Respectively represent the SINR thresholds for H2H services and critical M2M services,

Refers to the maximum tolerable interruption probability. If the QoS constraint of the H2H service is met, the environment status information {QoS _h } _h∈H ＝1, otherwise {QoS _h } _h∈H ＝0. Correspondingly, {QoS _h } _s∈S represents the QoS satisfaction of all critical M2M users with the same indicator.

对于容忍型M2M链路本身，其SINR值由下式表示：For the tolerant M2M link itself, its SINR value is expressed by the following formula:

其中，

是第m，n条容忍型M2M链路受到的干扰，根据复用的频谱子带的占有者不同区分为如下两种情况：in,

The interference to the mth and nth tolerant M2M links is divided into the following two cases according to the occupants of the multiplexed spectrum sub-bands:

其中，P_m，s表示机器设备网关分配给关键型M2M用户的发射功率，

表示容忍型M2M链路m，n于频谱子带k上的信道增益，

表示频谱子带分配情况指示符，

表示第m′个机器设备网关在频谱子带k上的发射功率，

表示第m′个机器设备网关和容忍型M2M设备n于频谱子带k上的信道增益，

表示频谱子带分配情况指示符，

表示容忍型M2M链路m，n于频谱子带k上进行通信，

表示容忍型M2M链路m，n未于频谱子带k上进行通信，

表示机器设备网关m分配给容忍型M2M用户n的发射功率，M表示机器设备网关总数，N表示一个机器设备网关管控范围内的容忍型M2M设备总数。Where Pm _,s represents the transmission power allocated by the machine equipment gateway to the critical M2M user.

represents the channel gain of tolerant M2M link m, n on spectrum subband k,

Indicates the spectrum sub-band allocation indicator.

represents the transmission power of the m′th machine device gateway on spectrum subband k,

represents the channel gain of the m′th M2M gateway and tolerant M2M device n on spectrum subband k,

Indicates the spectrum sub-band allocation indicator.

represents a tolerant M2M link m, n communicating on spectrum subband k,

Indicates that the tolerant M2M link m, n does not communicate on the spectrum subband k,

It represents the transmission power allocated to the tolerant M2M user n by the machine device gateway m, M represents the total number of machine device gateways, and N represents the total number of tolerant M2M devices within the control range of a machine device gateway.

考虑到容忍型M2M业务所需处理的信息往往是根据其自身业务特性在不同频率定期生成的数据包，将其QoS建模为时间预算T内完成数据包传输的成功率：Considering that the information that the tolerant M2M service needs to process is often data packets generated periodically at different frequencies according to its own service characteristics, its QoS is modeled as the success rate of completing data packet transmission within the time budget T:

其中，

表示链路m，n在不同时刻t内，于频谱子带k上的信息传输速率，B是每个频谱子带的带宽，上式中V代表周期性产生的M2M载荷，单位为比特，Δ_T为信道相干时间。in,

represents the information transmission rate of link m, n on spectrum subband k at different times t, B is the bandwidth of each spectrum subband, V represents the periodically generated M2M load in bits, and Δ _T is the channel coherence time.

所有SWIPT辅助的M2M设备的EE可以表示为频谱效率与能量消耗的比值，并且以D＝S∪N表示所有M2M设备：The EE of all SWIPT-assisted M2M devices can be expressed as the ratio of spectrum efficiency to energy consumption, and all M2M devices are represented by D = S ∪ N:

其中in

表示所有M2M设备的可达频谱效率；Represents the achievable spectrum efficiency of all M2M devices;

表示所有M2M设备的能量消耗。其中P_C表示所有M2M链路功耗。represents the energy consumption of all M2M devices. Where _PC represents the power consumption of all M2M links.

表示M2M设备的能量收集量，其中θ∈[0，1]为能量转换效率。represents the energy harvested by the M2M device, where θ∈[0,1] is the energy conversion efficiency.

至此，本申请的M2M链路能效优化模型总结如下：So far, the M2M link energy efficiency optimization model of this application is summarized as follows:

其中

和

分别代表了功率分流比发射功率分配和频谱子带分配策略，条件(1a)(1b)(1c)分别代表了H2H用户、关键型M2M用户和容忍型M2M用户的QoS约束，条件(1d)规定了应用SWIPT的M2M设备的功率分流比不大于1，(1e)中的

是二元变量，容忍型M2M链路m，n被分配频谱子带k表示为1，否则为0，(1f)中的

规定了容忍型M2M链路的发射功率上界，(1g)用于约束每条容忍型M2M链路最多选择一个频谱子带进行通信。in

and

represent the power split ratio transmit power allocation and spectrum sub-band allocation strategy respectively. Conditions (1a), (1b) and (1c) represent the QoS constraints of H2H users, critical M2M users and tolerant M2M users respectively. Condition (1d) stipulates that the power split ratio of the M2M device applying SWIPT is not greater than 1. Condition (1e)

is a binary variable, and the tolerant M2M link m,n is assigned spectrum subband k, which is 1, otherwise it is 0.

The upper bound of the transmission power of the tolerant M2M link is specified, and (1g) is used to constrain each tolerant M2M link to select at most one spectrum subband for communication.

对于各通信、干扰链路间信道增益模型(其中，g代表信道增益)，总结如下：For each communication and interference link channel gain model (where g represents the channel gain), it is summarized as follows:

利用

计算同一集群内，机器设备网关m与容忍型机器设备n间组成的M2M通信链路m，n于频谱子带k上的信道增益。其中X_m，nβ_m，n表示路径损耗和阴影衰落，称为大尺度衰落，这两项与频率无关且在较长时间内保持不变，

表示与频率相关的快衰落。use

Calculate the channel gain of the M2M communication link m,n between the machine device gateway m and the tolerant machine device n in the same cluster on the spectrum subband k. Where X _m,n β _m,n represents the path loss and shadow fading, called large-scale fading, which are independent of frequency and remain unchanged over a long period of time.

Indicates frequency-dependent fast fading.

利用

计算同一集群内，机器设备网关m与关键型机器设备s间组成的M2M通信链路m，s于频谱子带k上的信道增益。use

Calculate the channel gain of the M2M communication link m,s formed between the machine device gateway m and the key machine device s in the same cluster on the spectrum subband k.

利用

计算基站BS与人类型用户设备h间组成的H2H通信链路BS，h于频谱子带k上的信道增益。use

Calculate the channel gain of the H2H communication link BS,h formed between the base station BS and the human-type user equipment h on the spectrum sub-band k.

利用

计算基站BS与容忍型M2M链路m，n于频谱子带k上的干扰信道增益。use

Calculate the interference channel gain between the base station BS and the tolerant M2M link m, n on the spectrum subband k.

利用

计算不同集群间，机器设备网关m′与容忍型M2M链路m，n于频谱子带k上的干扰信道增益。use

Calculate the interference channel gain between the machine device gateway m′ and the tolerant M2M link m,n on the spectrum subband k between different clusters.

利用

计算不同集群间，机器设备网关m′与关键型M2M链路m，s于频谱子带k上的干扰信道增益。use

Calculate the interference channel gain between the machine device gateway m′ and the critical M2M link m,s on the spectrum subband k between different clusters.

利用

计算机器设备网关m与人类型用户设备h于频谱子带k上的干扰信道增益。use

The interference channel gains between the machine device gateway m and the human type user equipment h on the spectrum sub-band k are calculated.

为了使多条容忍型M2M链路在训练过程中朝着最大化整体能效目标前进的同时还能够确保各类型用户的QoS约束，本申请将整体EE和QoS约束加入到奖励函数的设计中。In order to enable multiple tolerant M2M links to move toward the goal of maximizing overall energy efficiency during training while ensuring QoS constraints for various types of users, this application adds overall EE and QoS constraints to the design of the reward function.

显然，在每一时刻t内H2H用户的SINR值可以容易地从用户接收到的功率和干扰求得，而M2M用户的QoS约束作为一个概率值，在每一时刻都需要各个代理随机生成大量模拟信道来求得所需的概率值，从而消耗大量的计算资源并减缓算法的收敛速度，为了解决这一问题，本申请将这类QoS约束转化为精确的显式表达式。Obviously, at each time t, the SINR value of the H2H user can be easily obtained from the power and interference received by the user, while the QoS constraint of the M2M user is a probability value. At each time, each agent needs to randomly generate a large number of simulated channels to obtain the required probability value, thereby consuming a large amount of computing resources and slowing down the convergence speed of the algorithm. In order to solve this problem, this application converts such QoS constraints into precise explicit expressions.

首先，使用U_n表示容忍型M2M用户在负载未传输完成时的传输时间剩余，至传输完成后，将U_n设置为一个定值。因此，在时刻t，与容忍型M2M用户相关的奖惩设置为：First, use U _n to represent the remaining transmission time of the tolerant M2M user when the load is not transmitted. After the transmission is completed, U _n is set to a constant value. Therefore, at time t, the reward and penalty related to the tolerant M2M user are set as:

通过这样的转换，各代理能够在训练过程中同时考虑传输时间剩余和负载传输速率的影响。Through such transformation, each agent can consider the impact of both transmission time surplus and load transmission rate during training.

其次，本申请利用理论分析和数学变换求得精确的关键型M2M设备的中断概率精确值，当第s个关键型M2M设备复用与不同集群内的容忍型M2M设备共享第k个频谱子带时，中断概率可以替换如下：Secondly, the present application uses theoretical analysis and mathematical transformation to obtain the precise value of the interruption probability of the critical M2M device. When the sth critical M2M device multiplexes and shares the kth spectrum subband with tolerant M2M devices in different clusters, the interruption probability can be replaced as follows:

根据引理：假设z₁，...，z_n是均值为

的独立指数分布随机变量，可得公式：According to the lemma: Assume that z ₁ , ..., z _n are

Independent exponentially distributed random variables, we can get the formula:

其中c为正值常数。故根据瑞利衰落特性以及上式，可将中断概率表达式改写为：Where c is a positive constant. Therefore, according to the Rayleigh fading characteristics and the above formula, the outage probability expression can be rewritten as:

其中

根据所受干扰源的不同可以区分为m或m′，分别表示集群间干扰与集群内干扰。此外，

也可以同时包含m和m′，表示此时受到的干扰既包含集群间干扰也包括集群内干扰。值得注意的是，当受到的干扰仅为集群间干扰时，上式的累乘项

仅保留Π_n项。相应地，

表示为第s个CMTCD于频谱子带k上所受到的干扰，表示为in

According to the different interference sources, it can be divided into m or m', which represent inter-cluster interference and intra-cluster interference respectively.

It can also include both m and m′, indicating that the interference received at this time includes both inter-cluster interference and intra-cluster interference. It is worth noting that when the interference received is only inter-cluster interference, the cumulative term in the above formula is

Only the Π _n term is retained. Accordingly,

It is represented by the interference received by the s-th CMTCD on the spectrum sub-band k, which is expressed as

此外，当

同时包含m和m′时，中断概率的计算需要连续累乘上式中的所有项，从而产生较大的计算资源开销。In addition, when

When both m and m′ are included, the calculation of the outage probability requires continuous multiplication of all terms in the above formula, which results in a large computational resource overhead.

为了进一步加速算法收敛，将上式转换为如下形式：In order to further accelerate the convergence of the algorithm, the above formula is converted into the following form:

利用上式可以以较小的计算资源消耗求得每个关键型M2M设备精确的中断概率值，为了使各代理能在训练过程中进一步区分所选资源分配策略的好坏，本申请进一步设置

来表示在时隙t内关键型M2M设备的QoS满意度，表示如下：The above formula can be used to obtain the accurate interruption probability value of each critical M2M device with less computing resource consumption. In order to enable each agent to further distinguish the quality of the selected resource allocation strategy during the training process, this application further sets

To express the QoS satisfaction of the critical M2M device in time slot t, it is expressed as follows:

其中ps是利用上边导出的中断概率表达式求得的中断概率值，同理利用

来表示H时隙t内2H用户的QoS满意度：Where ps is the interruption probability value obtained by using the interruption probability expression derived above. Similarly,

To express the QoS satisfaction of 2H users in H time slot t:

综上所述，本申请将全局奖励函数设置为：In summary, this application sets the global reward function as:

其中μ是一个正实数以平衡奖励与惩罚，通过使所有代理共享相同奖励函数的设计，使得各代理能合作探索位置环境并学习对整体能效提升最大的资源分配策略，同时各惩罚项能够确保各代理在进行资源分配时考虑各类用户的QoS约束。Where μ is a positive real number to balance rewards and penalties. By making all agents share the same reward function, the agents can cooperate to explore the location environment and learn the resource allocation strategy that maximizes the overall energy efficiency. At the same time, each penalty term can ensure that each agent considers the QoS constraints of various users when allocating resources.

每轮迭代结束后，各代理从经验重放池中随机抽取小批量数据对训练网络Q进行训练，并以随机梯度下降的方式更新训练网络参数ω以逼近目标网络参数ω^-，并在一段迭代后将自身网络参数传递给目标网络。After each round of iteration, each agent randomly extracts small batches of data from the experience replay pool to train the training network Q, and updates the training network parameters ω by stochastic gradient descent to approach the target network parameters ω ^- , and passes its own network parameters to the target network after one iteration.

目标网络表示为：The target network is represented as:

其中0≤γ≤1是折扣因子，其值的大小决定了在迭代过程中当前奖惩值的重要程度，越小则说明越看重当前奖惩值，反之亦然。S^t+1表示下一时刻的状态观测值，a′表示状态S^t+1下作出的最优动作选择，ω^-表示目标网络参数。Among them, 0≤γ≤1 is the discount factor, and its value determines the importance of the current reward and punishment value in the iteration process. The smaller it is, the more important the current reward and punishment value is, and vice versa. ^St+1 represents the state observation value at the next moment, a′ represents the optimal action selection made under the state ^St+1 , and ω ^- represents the target network parameter.

本实施例中，损失函数表示为：In this embodiment, the loss function is expressed as:

L(θ)＝E[(Q^--Q(S^t，A^t；ω))²]L(θ)=E[(Q ^- -Q(S ^t , A ^t ; ω)) ² ]

其中，L(θ)表示损失函数，E表示数学期望，Q表示训练网络值函数，S^t表示t时刻的状态观测值，A^t表示状态S^t下的动作选择，ω表示训练网络的参数。Among them, L(θ) represents the loss function, E represents the mathematical expectation, Q represents the training network value function, ^St represents the state observation value at time t, A ^t represents the action selection under the state ^St , and ω represents the parameters of the training network.

本申请实施例，在H2H/M2M共存蜂窝网络中为M2M设备配备SWIPT技术，在由频谱共享带来的复杂干扰环境下，通过为容忍型M2M链路分配频谱、发射功率和功率分流比资源，实现了高EE性能表现并确保了所有类型用户的QoS需求。In the embodiment of the present application, the SWIPT technology is equipped for the M2M device in the H2H/M2M coexistence cellular network. In the complex interference environment caused by spectrum sharing, high EE performance is achieved and the QoS requirements of all types of users are ensured by allocating spectrum, transmission power and power split ratio resources to tolerant M2M links.

此外，在本申请实施例中，考虑SWIPT辅助的H2H/M2M共存蜂窝网络中除了包含传统的H2H业务外，还将M2M业务类型区分为关键型业务和容忍型业务，其中M2M设备以集群形式存在并由与本地机器设备网关建立通信链路，且所有M2M设备配备有SWIPT技术，使其可以同时进行能量收集和信息解码；网络内H2H用户和关键型M2M用户预先分配OFDM频谱子带，且其余容忍型M2M用户复用相同的频谱资源。In addition, in an embodiment of the present application, in addition to the traditional H2H services, the SWIPT-assisted H2H/M2M coexistence cellular network also divides the M2M service types into critical services and tolerant services, wherein M2M devices exist in a cluster form and establish a communication link with a local machine device gateway, and all M2M devices are equipped with SWIPT technology, so that they can simultaneously perform energy collection and information decoding; H2H users and critical M2M users in the network are pre-allocated OFDM spectrum sub-bands, and the remaining tolerant M2M users reuse the same spectrum resources.

在本申请所提出的资源分配方法中，将能效优化问题建模为多代理问题。每条容忍型M2M链路作为一个代理，并提出一套系统的状态、动作、奖励函数设计方案，通过设计的基于行为跟踪的状态集和奖励函数，使得各代理在与未知网络环境的交互过程中更高效地以分布式合作的方式从动作集中选择合适的频谱子带、发射功率和功率分流比资源，达到优化整体M2M能效的目的。并且，本申请还通过将每类用户的QoS需求(QoS约束)用精确的数学公式表达，进一步降低了模型训练过程中的计算复杂度。In the resource allocation method proposed in this application, the energy efficiency optimization problem is modeled as a multi-agent problem. Each tolerant M2M link is taken as an agent, and a set of systematic state, action, and reward function design schemes are proposed. Through the designed state set and reward function based on behavior tracking, each agent can more efficiently select appropriate spectrum sub-bands, transmission power, and power split ratio resources from the action set in a distributed and cooperative manner during the interaction with the unknown network environment, so as to achieve the purpose of optimizing the overall M2M energy efficiency. In addition, this application further reduces the computational complexity in the model training process by expressing the QoS requirements (QoS constraints) of each type of user with precise mathematical formulas.

实施例3Example 3

根据本申请实施例，还提供了另一种SWIPT辅助的基于QoS的下行资源分配方法，如图4所示，该方法包括：According to an embodiment of the present application, another SWIPT-assisted QoS-based downlink resource allocation method is also provided, as shown in FIG4 , the method comprising:

步骤S402，初始化。Step S402, initialization.

初始化资源分配系统，基于预先分配的频谱子带和发射功率，得到H2H链路和关键型M2M链路所占用频谱子带的信道条件以及SINR或中断概率等先验信息。Initialize the resource allocation system, and obtain the channel conditions of the spectrum subbands occupied by the H2H link and the critical M2M link and prior information such as SINR or outage probability based on the pre-allocated spectrum subbands and transmit power.

步骤S404，搭建神经网络，为每个代理建立经验重放池。Step S404, building a neural network and establishing an experience replay pool for each agent.

将每条容忍型M2M链路视为代理并搭建训练神经网络和目标神经网络，为每个代理建立经验重放池。Each tolerant M2M link is regarded as an agent and a training neural network and a target neural network are built, and an experience replay pool is established for each agent.

本申请实施例，首先将资源分配问题建模为多代理智能决策问题，其中每条容忍型M2M链路被视为一个代理，用于与未知网络环境进行交互并优化资源分配策略。接着，使用资源分配方法为每个代理构建训练网络、目标网络和经验重放池，其中经验重放池存放用以训练训练网络的数据。In the embodiment of the present application, the resource allocation problem is first modeled as a multi-agent intelligent decision-making problem, where each tolerant M2M link is regarded as an agent for interacting with the unknown network environment and optimizing the resource allocation strategy. Then, a training network, a target network and an experience replay pool are constructed for each agent using a resource allocation method, where the experience replay pool stores data for training the training network.

步骤S406，为代理设计状态函数、动作函数和奖励函数。Step S406, designing a state function, an action function and a reward function for the agent.

每个代理基于所述先验信息以及根据本地状态观测值所做出的动作选择，建立链路通信能效优化模型，并将链路能效优化模型返回的M2M系统能效、各项QoS指标和代理本身受到的干扰等数据存入经验重放池。Each agent builds a link communication energy efficiency optimization model based on the prior information and the action selection made according to the local state observation value, and stores the M2M system energy efficiency, various QoS indicators and the interference data of the agent itself returned by the link energy efficiency optimization model into the experience replay pool.

为代理设计状态函数、动作函数和奖励函数包括以下步骤：Designing the state function, action function, and reward function for the agent involves the following steps:

S3-1，状态集中设计一种低维映射高维的轨迹跟踪方法：在状态集中加入迭代次数与贪婪因子，使代理在抽取训练数据时可以跟踪训练过程中位于迭代次数e和探索概率∈时的环境状态，这样解决了多代理设置引起的训练环境非平稳性问题。S3-1, a low-dimensional mapping high-dimensional trajectory tracking method is designed in the state set: the number of iterations and the greedy factor are added to the state set, so that the agent can track the environment state at the iteration number e and the exploration probability ∈ during the training process when extracting training data. This solves the problem of non-stationarity of the training environment caused by the multi-agent setting.

S3-2，代理根据状态观测值和所选动作，建立链路通信能效优化模型，具体包括：制定发射功率、功率分流比的约束条件和QoS需求，其中QoS需求根据业务实际特点分别建模如下：H2H业务通常是基于语音和互联网通信的流量，其对数据传输速率提出较高的需求，因此将H2H业务的QoS需求建模为SINR值大于设定的最低阈值；大多数M2M业务是由事件驱动的，其流量通常是根据M2M业务本身需求以在不同的频率上周期性生成的，因此将容忍型M2M业务的QoS需求建模为时间约束T内大小为V的数据包的传输成功率；此外，对于关键型M2M业务，其往往具有严格的时延和传输可靠性要求，因此将关键型M2M业务的QoS需求建模为中断概率不得高于设定的阈值。S3-2, the agent establishes a link communication energy efficiency optimization model based on the state observation value and the selected action, specifically including: formulating the constraints of the transmission power and power split ratio and the QoS requirements, wherein the QoS requirements are modeled according to the actual characteristics of the service as follows: H2H services are usually based on voice and Internet communication traffic, which places high demands on data transmission rate, so the QoS requirements of H2H services are modeled as SINR values greater than the set minimum threshold; most M2M services are event-driven, and their traffic is usually generated periodically at different frequencies according to the needs of the M2M services themselves, so the QoS requirements of tolerant M2M services are modeled as the transmission success rate of data packets of size V within the time constraint T; in addition, for critical M2M services, they often have strict delay and transmission reliability requirements, so the QoS requirements of critical M2M services are modeled as the interruption probability must not be higher than the set threshold.

S3-3，奖励函数的设计为主要奖励项与惩罚项的组合，且所有代理共享相同的奖励函数，鼓励各代理间合作探索最优资源分配策略的同时还能兼顾各类用户的QoS需求，此外，为了降低计算资源消耗和加快收敛速度，在奖励函数的各惩罚项中对各类QoS需求作了精确的数学表达；S3-3, the reward function is designed as a combination of the main reward term and the penalty term, and all agents share the same reward function, which encourages the agents to cooperate in exploring the optimal resource allocation strategy while taking into account the QoS requirements of various users. In addition, in order to reduce the consumption of computing resources and speed up the convergence speed, various QoS requirements are accurately mathematically expressed in the penalty terms of the reward function;

S3-4：通过链路优化模型计算当前时隙网络M2M设备总EE和各类用户的QoS水平，将计算到的EE、QoS水平、容忍型M2M链路所受干扰、所选动作、状态观测值、所获奖励以及机器设备网关接收到的先验信息一并存入经验重放池，用作训练网络的训练数据；S3-4: Calculate the total EE of the M2M devices in the current time slot network and the QoS level of each type of user through the link optimization model, and store the calculated EE, QoS level, interference suffered by the tolerant M2M link, the selected action, the state observation value, the reward obtained, and the prior information received by the machine device gateway into the experience replay pool as training data for training the network;

步骤S408，训练资源分配模型并分配资源。Step S408: training a resource allocation model and allocating resources.

基于所述链路通信能效优化模型和经验重放池，代理不断从所述经验重放池存放的大量数据中随机抽取数据对训练网络Q进行训练，以随机梯度下降的方式缩小训练网络和目标网参数间的误差，最终在模型训练收敛后能做出满足所述约束条件和QoS需求的资源分配方案。Based on the link communication energy efficiency optimization model and the experience replay pool, the agent continuously randomly extracts data from the large amount of data stored in the experience replay pool to train the training network Q, and reduces the error between the training network and the target network parameters in a stochastic gradient descent manner. Finally, after the model training converges, a resource allocation plan that meets the constraints and QoS requirements can be made.

通过从经验重放池中随机抽取批量数据对训练网络进行训练，可以同时学习当前与过去的经验，也有效剔除了数据相关性，通过异步更新目标网络参数，使得一段时间内目标网络参数不变，从而使得算法更新更为稳定。By randomly extracting batches of data from the experience replay pool to train the training network, it is possible to learn current and past experiences at the same time, and effectively eliminate data correlation. By asynchronously updating the target network parameters, the target network parameters remain unchanged for a period of time, making the algorithm update more stable.

SWIPT技术具有两种接收机设计，即″时间切换″和″功率分割″，其中″时间切换″在信息解码与能量收集的时长间进行切换，而″功率分割″则是用功率分流比将接收到的能量分割为信息解码部分和能量收集部分。本实施例采用的是″功率分割″设计，因此在资源分配方法的动作集设计中引入功率分流比分配。在其他的实施例中，也可以采用″时间切换″，将功率分流比分配替换成时间分配。SWIPT technology has two receiver designs, namely "time switching" and "power splitting", where "time switching" switches between the duration of information decoding and energy collection, while "power splitting" uses the power split ratio to split the received energy into the information decoding part and the energy collection part. This embodiment adopts the "power splitting" design, so the power split ratio allocation is introduced in the action set design of the resource allocation method. In other embodiments, "time switching" can also be used to replace the power split ratio allocation with time allocation.

在现有技术所应用的H2H/M2M共存蜂窝网络方面，由于H2H/M2M共存蜂窝网络的网络结构复杂，资源分配任务多变，使得传统的资源分配方案性能表现较差，并且，其忽略了机器用户设备所具有的丰富业务类型和频谱共享造成的干扰；并且，在考虑EH的网络模型中没有使用SWIPT来获取稳定、可靠的能源。在现有技术的实际应用方面，其未将每一类型用户的QoS需求进行精确的数学建模，这常常会导致及其用户设备作为低QoS需求方而获得较差的性能表现；此外，所提出的传统算法在解决具有非线性约束的优化问题时计算复杂度高且效果差，而所提智能算法的集中式训练却具有较慢的收敛速度。In the H2H/M2M coexistence cellular network used by the prior art, due to the complex network structure of the H2H/M2M coexistence cellular network and the changeable resource allocation tasks, the traditional resource allocation scheme has poor performance, and it ignores the rich service types of machine user equipment and the interference caused by spectrum sharing; and SWIPT is not used in the network model considering EH to obtain stable and reliable energy. In the practical application of the prior art, it does not accurately mathematically model the QoS requirements of each type of user, which often leads to poor performance of its user equipment as a low QoS demand side; in addition, the proposed traditional algorithm has high computational complexity and poor effect when solving optimization problems with nonlinear constraints, while the centralized training of the proposed intelligent algorithm has a slow convergence speed.

本申请实施例与现有的智能资源分配方法相比，在网络结构内引入SWIPT辅助的M2M设备来从射频环境中获取稳定、可靠的能源。Compared with the existing intelligent resource allocation method, the embodiment of the present application introduces SWIPT-assisted M2M devices into the network structure to obtain stable and reliable energy from the radio frequency environment.

此外，在资源分配方法的实施上选择多代理合作、分布式执行的方式，由于多代理分布式执行的资源分配方案能考虑每个代理自身情况而做出最适合的资源分配策略，而单代理集中式执行所做的资源分配策略更倾向于适用于网络内所有代理，因此，本实施例极大减轻了基站的业务负载，且在每轮迭代中可以更新每个代理的资源分配策略，极大提升训练收敛速度的同时也能实现更好的性能表现。In addition, in the implementation of the resource allocation method, a multi-agent cooperation and distributed execution method is selected. Since the resource allocation scheme of the multi-agent distributed execution can take into account the situation of each agent and make the most suitable resource allocation strategy, and the resource allocation strategy made by the single-agent centralized execution tends to be applicable to all agents in the network, this embodiment greatly reduces the business load of the base station, and the resource allocation strategy of each agent can be updated in each round of iteration, which greatly improves the training convergence speed while achieving better performance.

需要说明的是，对于前述的各方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本申请并不受所描述的动作顺序的限制，因为依据本申请，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the aforementioned method embodiments, for the sake of simplicity, they are all expressed as a series of action combinations, but those skilled in the art should be aware that the present application is not limited by the order of the actions described, because according to the present application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present application.

实施例4Example 4

根据本申请实施例，还提供了一种SWIPT辅助的基于QoS的下行资源分配装置，如图5所示，该装置包括获取模块542、选择模块544和分配模块546。According to an embodiment of the present application, a SWIPT-assisted QoS-based downlink resource allocation device is also provided. As shown in FIG. 5 , the device includes an acquisition module 542 , a selection module 544 and an allocation module 546 .

基于机器用户设备的业务类型，将所述机器用户设备划分为容忍型机器用户设备和关键型机器用户设备；其中，所述关键型机器用户设备是传输可靠性要求高于预设要求阈值的机器用户设备，其中，所述容忍型机器用户设备是需要完成周期性生成的载荷的传输任务的机器用户设备。Based on the service type of the machine user device, the machine user device is divided into a tolerant machine user device and a critical machine user device; wherein the critical machine user device is a machine user device with a transmission reliability requirement higher than a preset requirement threshold, wherein the tolerant machine user device is a machine user device that needs to complete the transmission task of a periodically generated load.

获取模块542被配置为获取容忍型机器用户设备与机器设备网关之间的通信链路的当前环境状态的状态观测值。The acquisition module 542 is configured to acquire a state observation value of a current environment state of a communication link between a tolerant machine user device and a machine device gateway.

选择模块544被配置为基于所述当前环境状态的状态观测值，利用基于神经网络构建的资源分配模型，为所述容忍型机器用户设备选择资源分配策略。The selection module 544 is configured to select a resource allocation strategy for the tolerant machine user device based on the state observation value of the current environment state and using a resource allocation model constructed based on a neural network.

分配模块546基于所选的资源分配策略，为所述容忍型机器用户设备分配下行资源。The allocation module 546 allocates downlink resources to the tolerant machine user equipment based on the selected resource allocation strategy.

本实施例中的下行资源分配装置能够实现上述实施例中的下行资源分配方法，因此，此处不再赘述。The downlink resource allocation device in this embodiment can implement the downlink resource allocation method in the above-mentioned embodiment, so it will not be described in detail here.

实施例5Example 5

本申请实施例提供了一种SWIPT辅助的基于QoS的下行资源分配系统，如图6所示，包括：基站(BS)52、人类用户设备(HUE)60、机器设备网关(MTCG)54、机器用户设备，其中，机器用户设备包括关键型机器用户设备(CMTCD)56和容忍型机器用户设备(TMTCD)58。An embodiment of the present application provides a SWIPT-assisted QoS-based downlink resource allocation system, as shown in Figure 6, including: a base station (BS) 52, a human user equipment (HUE) 60, a machine device gateway (MTCG) 54, and a machine user equipment, wherein the machine user equipment includes a critical machine user equipment (CMTCD) 56 and a tolerant machine user equipment (TMTCD) 58.

机器用户设备上设置有SWIPT。SWIPT用于嵌入M2M通信，赋予机器用户设备从射频环境中获取稳定、可靠能源的能力，以及同时进行信息解码功能。SWIPT is set on the machine user device. SWIPT is used to embed M2M communication, giving the machine user device the ability to obtain stable and reliable energy from the radio frequency environment and perform information decoding functions at the same time.

频谱子带K＝H∪S，表示每个正交频谱子带都被预分配给一个人类用户设备或关键型机器用户设备，其中H＝{1，2，...，H}表示，S＝{1，2，...，S}代表关键型机器用户设备；此外，以M＝{1，2，...，M}表示机器设备网关或集群的数量；容忍型机器用户设备以N＝{1，2，...，N}表示。The spectrum subband K=H∪S indicates that each orthogonal spectrum subband is pre-allocated to a human user device or a critical machine user device, where H={1, 2, ..., H} represents, S={1, 2, ..., S} represents the critical machine user device; in addition, M={1, 2, ..., M} represents the number of machine device gateways or clusters; and the tolerant machine user device is represented by N={1, 2, ..., N}.

基站52用于和人类用户设备60建立H2H通信链路、预先分配频谱子带和发射功率，并将所有H2H链路的信道增益信息和QoS指标等先验信息广播给覆盖范围内的机器设备网关54。The base station 52 is used to establish an H2H communication link with the human user device 60, pre-allocate spectrum sub-bands and transmission power, and broadcast a priori information such as channel gain information and QoS indicators of all H2H links to the machine device gateway 54 within the coverage area.

机器设备网关54用于和机器用户设备组成集群并建立多条M2M通信链路，并为关键型M2M链路预先分配频谱子带、发射功率和功率分流比，其中一个机器设备网关和若干个机器用户设备组成一个集群。The machine device gateway 54 is used to form a cluster with the machine user devices and establish multiple M2M communication links, and pre-allocate spectrum sub-bands, transmission power and power split ratios for critical M2M links. One machine device gateway and several machine user devices form a cluster.

机器设备网关(MTCG)54可以为上述实施例中的下行资源分配装置，能够实现上述实施例中的下行资源分配方法，因此，此处不再赘述。The machine tool gateway (MTCG) 54 may be the downlink resource allocation device in the above embodiment, and may implement the downlink resource allocation method in the above embodiment, and therefore, it will not be described in detail here.

实施例6Example 6

本申请的实施例还提供了一种存储介质，上述存储介质被设置为存储用于执行以上实施例中的下行资源分配方法的程序代码。An embodiment of the present application further provides a storage medium, which is configured to store program codes for executing the downlink resource allocation method in the above embodiment.

可选地，在本实施例中，上述存储介质可以包括但不限于：U盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。Optionally, in this embodiment, the above-mentioned storage medium may include but is not limited to: a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk or an optical disk, and other media that can store program codes.

仿真试验Simulation test

本申请做如下参数设置，基站覆盖半径500m，机器设备网关覆盖半径30m，H2H用户设备数2个，机器设备网关数2个，每个机器设备网关覆盖范围内包含有一个关键型M2M设备，且覆盖范围内的容忍型M2M设备由0增加至5个。系统带宽4MHz，路径损耗建模为128+37.6 log₁₀d，d以km为单位，阴影衰落建模为标准差为8dB的对数正态分布，快衰落建模为瑞利衰落，环境噪声功率σ²为-114dBm，基站分配给H2H用户设备的发射功率P_BS，h为30dBm，机器设备网关分配给关键型M2M用户设备的发射功率P_m，s为23dBm，功率分流比ρ_m，s为0.8，分配给容忍型M2M用户设备的最大发射功率

为15dBm，发射功率和功率分流比分割等级L和Z分别为10和5，H2H业务SINR阈值

为7dB，关键型M2M业务SINR阈值

为5dB，中断概率

不超过0.01，容忍型M2M业务载荷传输时间约束T为100ms，载荷大小V为3×1024bytes，此外，考虑到有限精度数字信号处理的实际效果，规定每个用户的接收SINR值不超过30dB。对于执行算法所需参数，设置如下：折扣因子γ＝0.5，常数μ＝1/50，迭代次数3000次，且在前80％的迭代中贪婪因子∈由1线性衰减至0.01，每条容忍型M2M链路的神经网络均由3个全连接隐藏层组成，每层神经网络的神经元数目等于动作可选数目，即K×L×Z＝200个，Relu函数作为激活函数且RMSProp优化算法以0.001的学习率更新网络参数。The present application sets the following parameters: the base station coverage radius is 500m, the machine device gateway coverage radius is 30m, the number of H2H user devices is 2, the number of machine device gateways is 2, each machine device gateway has a critical M2M device within its coverage range, and the number of tolerant M2M devices within its coverage range increases from 0 to 5. The system bandwidth is 4MHz, the path loss is modeled as 128+37.6 log ₁₀ d, where d is in km, the shadow fading is modeled as a log-normal distribution with a standard deviation of 8dB, the fast fading is modeled as Rayleigh fading, the environmental noise power σ ² is -114dBm, the transmission power P _BS,h allocated by the base station to the H2H user device is 30dBm, the transmission power P _m,s allocated by the machine device gateway to the critical M2M user device is 23dBm, the power split ratio ρ _m,s is 0.8, and the maximum transmission power allocated to the tolerant M2M user device

The transmit power and power split ratio L and Z are 10 and 5 respectively, and the H2H service SINR threshold is

7dB, critical M2M service SINR threshold

is 5dB, the outage probability

The transmission time constraint T of the tolerant M2M service load is 100ms, and the load size V is 3×1024 bytes. In addition, considering the actual effect of finite precision digital signal processing, the receiving SINR value of each user is stipulated to be no more than 30dB. The parameters required for executing the algorithm are set as follows: discount factor γ = 0.5, constant μ = 1/50, number of iterations 3000 times, and the greed factor ∈ linearly decays from 1 to 0.01 in the first 80% of iterations. The neural network of each tolerant M2M link consists of 3 fully connected hidden layers, and the number of neurons in each layer of the neural network is equal to the number of optional actions, that is, K×L×Z=200. The Relu function is used as the activation function and the RMSProp optimization algorithm updates the network parameters at a learning rate of 0.001.

将本申请提出的频谱子带、发射功率、功率分流比分配方法根据其特性命名为多代理SWIPT辅助的自适应频谱-功率-比率分配方法(MA-SWIPT-ASPRA)，并且与三种高效的智能资源分配方法作了比较：(1)单代理SWIPT辅助的自适应频谱-功率-比率分配方法(SA-SWIPT-ASPRA)，该算法仅是将本申请公布的资源分配方法实现的多M2M链路同时分布式执行资源分配决策改为单M2M链路异步执行资源分配决策；(2)无SWIPT辅助的自适应频谱-功率-比率分配方法(MA-Non-SWIPT-ASPRA)，该算法仅是在本申请公布的资源分配方法的基础上去除SWIPT功能；(3)基于Q学习的SWIPT辅助的频谱-功率-比率分配方法(QL-SWIPT-ASPRA)，该算法是最为经典的基于强化学习的智能资源分配方案。The spectrum subband, transmission power and power split ratio allocation method proposed in this application is named as multi-agent SWIPT-assisted adaptive spectrum-power-ratio allocation method (MA-SWIPT-ASPRA) according to its characteristics, and is compared with three efficient intelligent resource allocation methods: (1) single-agent SWIPT-assisted adaptive spectrum-power-ratio allocation method (SA-SWIPT-ASPRA), which is an algorithm that only changes the resource allocation decision of multiple M2M links simultaneously distributedly executed in the resource allocation method disclosed in this application to the resource allocation decision of a single M2M link asynchronously executed; (2) non-SWIPT-assisted adaptive spectrum-power-ratio allocation method (MA-Non-SWIPT-ASPRA), which is an algorithm that only removes the SWIPT function on the basis of the resource allocation method disclosed in this application; (3) Q-learning-based SWIPT-assisted spectrum-power-ratio allocation method (QL-SWIPT-ASPRA), which is the most classic intelligent resource allocation scheme based on reinforcement learning.

图7示出了随着系统中接入M2M设备数量的增加，M2M设备总能效的变化。从图可知，随着M2M设备数量的增加，M2M设备总能效先增长后降低，其原因如下：Figure 7 shows the change in the total energy efficiency of M2M devices as the number of M2M devices connected to the system increases. As can be seen from the figure, as the number of M2M devices increases, the total energy efficiency of M2M devices first increases and then decreases. The reasons are as follows:

当网络中仅存在关键型M2M设备时，即M2M设备数量为2时，无论是SWIPT还是无SWIPT方案所实现的总能效几乎没有差异，这是因为此时网络内不存在由于频谱共享而产生的干扰，而关键型M2M设备由于其距离本地机器设备网关距离较近而拥有较强的信道增益，从而获得很高的SINR值，此外，在SWIPT方案中，关键型M2M设备将绝大部分能量用于信息解码，仅有小部分能量用于收集，较大的数量级差异使得能效表现几乎无差异。When there are only critical M2M devices in the network, that is, when the number of M2M devices is 2, there is almost no difference in the total energy efficiency achieved by the SWIPT and non-SWIPT solutions. This is because there is no interference caused by spectrum sharing in the network at this time, and the critical M2M devices have a stronger channel gain due to their close distance to the local machine device gateway, thereby obtaining a very high SINR value. In addition, in the SWIPT solution, the critical M2M devices use most of the energy for information decoding and only a small part of the energy for collection. The large order of magnitude difference makes the energy efficiency performance almost the same.

当网络中的M2M设备从2增加至6时，各方案所实现的总能效随之增长并在M2M设备数为6时达到最大值，这是因为频谱复用可以提升频谱效率从而进一步提升网络能效，在M2M设备数量为6时，此时存在4条容忍型M2M链路复用2条H2H链路和2条关键型M2M链路所占用的频谱资源，各方案可以为每条容忍型M2M链路分配一个被预先占用的频谱子带而不产生额外的集群内干扰，因此在此时网络的频谱效率达到最高，进一步实现了最高的能效值。When the number of M2M devices in the network increases from 2 to 6, the total energy efficiency achieved by each solution increases accordingly and reaches the maximum value when the number of M2M devices is 6. This is because spectrum reuse can improve spectrum efficiency and thus further improve network energy efficiency. When the number of M2M devices is 6, there are 4 tolerant M2M links that reuse the spectrum resources occupied by 2 H2H links and 2 critical M2M links. Each solution can allocate a pre-occupied spectrum subband to each tolerant M2M link without generating additional intra-cluster interference. Therefore, the spectrum efficiency of the network reaches the highest at this time, further achieving the highest energy efficiency value.

当网络中的M2M设备从6增加至12时，各方案所实现的能效随之降低，这是因为过多的频谱复用将同时产生集群内干扰和集群间干扰，从而导致干扰链路普遍存在、功耗严重增加，从而导致较差的能效表现。When the number of M2M devices in the network increases from 6 to 12, the energy efficiency achieved by each scheme decreases. This is because excessive spectrum reuse will generate both intra-cluster interference and inter-cluster interference, resulting in the widespread existence of interfering links and a serious increase in power consumption, leading to poor energy efficiency performance.

从各方案的使能条件看，在无SWIPT的方案中，其实现的能效在M2M设备数目超过6时快速下降，而其余SWIPT方案则可以维持一个较高能效直至M2M设备数超过8，这是因为相较于无SWIPT方案，SWIPT方案可以在小幅降低频谱效率的同时将大量的干扰功率转换为能量收集量，因此，基于频谱效率与功耗间的权衡，SWIPT方案可以在一定的干扰环境下具有较好的性能表现。这种性能衰减现象随着M2M设备的不断增加而变得平缓，这是因为M2M设备数量继续增加会使得频谱效率受损严重，使得性能继续下降的空间持续减小。From the enabling conditions of each solution, the energy efficiency of the solution without SWIPT drops rapidly when the number of M2M devices exceeds 6, while the other SWIPT solutions can maintain a high energy efficiency until the number of M2M devices exceeds 8. This is because compared with the solution without SWIPT, the SWIPT solution can convert a large amount of interference power into energy collection while slightly reducing the spectrum efficiency. Therefore, based on the trade-off between spectrum efficiency and power consumption, the SWIPT solution can have better performance in a certain interference environment. This performance degradation phenomenon becomes gentle as the number of M2M devices continues to increase, because the continued increase in the number of M2M devices will seriously damage the spectrum efficiency, which will continue to reduce the space for further performance degradation.

从各方案的性能表现看，本申请所提出的资源分配方案实现了最高的能效，并且在M2M设备从6增加到8时，本申请所实现的能效几乎没有下降，表明本方案在频谱效率与能耗间的权衡表现更好。这是因为本申请通过多代理设置使得训练过程分布式执行，能充分考虑每个代理在每一时刻的自身条件，做出最适合每个代理的资源分配策略，而集中式训练的单代理方案表现略差的原因则是因为其所做出的资源分配方案更具通用性，可以应用于每个代理，在一定程度上忽略了每个代理在每一时刻的自身情况，Q学习方案表现较差的原因则是因为网络环境复杂，状态、动作集庞大，传统强化学习的查表法计算效率低下，会在迭代过程中忽略掉一些表现优异的资源分配方案。From the performance of each solution, the resource allocation solution proposed in this application achieves the highest energy efficiency, and when the number of M2M devices increases from 6 to 8, the energy efficiency achieved by this application hardly decreases, indicating that this solution has a better balance between spectrum efficiency and energy consumption. This is because this application uses a multi-agent setting to make the training process distributed, which can fully consider the conditions of each agent at each moment and make the most suitable resource allocation strategy for each agent. The reason why the single-agent solution of centralized training performs slightly worse is that the resource allocation solution it makes is more universal and can be applied to each agent, ignoring the situation of each agent at each moment to a certain extent. The reason why the Q learning solution performs poorly is that the network environment is complex, the state and action set are huge, and the traditional reinforcement learning table lookup method is inefficient, which will ignore some excellent resource allocation solutions during the iteration process.

图8示出了随着系统中接入容忍型M2M设备数量的增加，H2H用户QoS需求满足率的变化。从图可知，随着容忍型M2M设备数量的增加，H2H用户的QoS需求满足率随之下降。这是因为增加容忍型M2M设备数量会使复用H2H链路所占频谱的链路增多，使得H2H链路接收端的干扰功率增强，从而降低接收端的SINR值。此外，无SWIPT方案中的H2H用户QoS需求满足率对容忍型M2M设备数量的增加反应更为剧烈，在仅有少数M2M设备存在时性能便快速下降。产生这种现象的原因有两方面，其一，在无SWIPT方案中，容忍型M2M链路更愿意与H2H链路共享性相同的频谱资源，因为容忍型链路的M2M接收端大多情况下距离基站较远，因此受到基站的干扰较小，从而可以获得较高的能效值，而机器设备网关管控范围内的关键型M2M链路在没有其它用户复用自身所占频谱子带时，也能实现较高的能效；其二，在SWIPT方案中，容忍型M2M链路更愿意与关键型M2M链路共享相同的频谱资源，这是因为SWIPT可以将接收到的干扰转换为能量，以此提升能效。在所有方案中，本申请所提的资源分配方法实现了最佳的H2H用户QoS需求满足率，证实了本方法在保证用户QoS需求方面表现更好。Figure 8 shows the change in the QoS requirement satisfaction rate of H2H users as the number of tolerant M2M devices connected to the system increases. As can be seen from the figure, as the number of tolerant M2M devices increases, the QoS requirement satisfaction rate of H2H users decreases. This is because increasing the number of tolerant M2M devices will increase the number of links that reuse the spectrum occupied by the H2H link, which will increase the interference power at the receiving end of the H2H link, thereby reducing the SINR value at the receiving end. In addition, the QoS requirement satisfaction rate of H2H users in the non-SWIPT solution reacts more dramatically to the increase in the number of tolerant M2M devices, and the performance drops rapidly when only a few M2M devices exist. There are two reasons for this phenomenon. First, in the non-SWIPT scheme, the tolerant M2M link is more willing to share the same spectrum resources with the H2H link, because the M2M receiving end of the tolerant link is mostly far away from the base station, so it is less interfered by the base station, so it can obtain a higher energy efficiency value. The critical M2M link within the control range of the machine device gateway can also achieve higher energy efficiency when no other users reuse the spectrum sub-band occupied by itself; second, in the SWIPT scheme, the tolerant M2M link is more willing to share the same spectrum resources with the critical M2M link, because SWIPT can convert the received interference into energy, thereby improving energy efficiency. Among all the schemes, the resource allocation method proposed in this application achieves the best H2H user QoS demand satisfaction rate, which proves that this method performs better in ensuring user QoS requirements.

图9示出了随着系统中接入容忍型M2M设备数量的增加，关键型M2M链路中断概率的变化。从图9可知，随着容忍型M2M设备数量的增加，关键型M2M链路的中断概率随之上升。这是基于更多的频谱接入带来更高的中断概率的事实。此外，无SWIPT方案表现最好，其原因除了图2所述的频谱复用优先级外，还因为无SWIPT方案中的关键型M2M设备将其接收到的所有能量都用于信息解码，因此获得的SINR值更高，从而降低了中断概率。然而，SWIPT方案为了获得更高的能效会牺牲一定量的链路传输可靠性。本申请提出的资源分配方法在SWIPT方案中表现最好，证实了所提方案的高可靠性。FIG9 shows the change in the probability of critical M2M link interruption as the number of access-tolerant M2M devices in the system increases. As can be seen from FIG9, as the number of tolerant M2M devices increases, the probability of critical M2M link interruption increases accordingly. This is based on the fact that more spectrum access leads to a higher probability of interruption. In addition, the SWIPT-free scheme performs best. In addition to the spectrum reuse priority described in FIG2, the critical M2M devices in the SWIPT-free scheme use all the energy they receive for information decoding, so the SINR value obtained is higher, thereby reducing the probability of interruption. However, the SWIPT scheme sacrifices a certain amount of link transmission reliability in order to obtain higher energy efficiency. The resource allocation method proposed in this application performs best in the SWIPT scheme, confirming the high reliability of the proposed scheme.

参照图4，介绍随着系统中接入容忍型M2M设备数量的增加，容忍型M2M用户载荷传输成功率的变化。从图可知，随着容忍型M2M设备数量的增加，载荷传输成功率随之下降。这是因为过多的频谱复用带来更多的干扰和功耗，从而降低了每条链路用于传输载荷的容量。此外，无SWIPT方案表现最差的原因是其在不具备能量收集功能的前提下，每条容忍型M2M链路只能通过选择较低的发射功率来维持较高水平的能效表现，但同时降低了容量。SWIPT方案在具有能量收集功能的前提下，每条容忍型M2M链路更倾向于接收一定量的干扰来获取更高的能效，因此愿意选择较高等级的发射功率，链路容量也随之升高。本申请所提资源分配方法在所有方案中具有最优表现，进一步证实了所提方法的有效性。Referring to Figure 4, the change in the success rate of tolerant M2M user load transmission as the number of tolerant M2M devices connected to the system increases is introduced. As can be seen from the figure, as the number of tolerant M2M devices increases, the success rate of load transmission decreases. This is because too much spectrum reuse brings more interference and power consumption, thereby reducing the capacity of each link for transmitting loads. In addition, the reason why the SWIPT scheme performs the worst is that, under the premise of not having the energy collection function, each tolerant M2M link can only maintain a high level of energy efficiency by selecting a lower transmission power, but at the same time reduces the capacity. Under the premise of having the energy collection function, the SWIPT scheme has each tolerant M2M link that is more inclined to receive a certain amount of interference to obtain higher energy efficiency, so it is willing to choose a higher level of transmission power, and the link capacity also increases accordingly. The resource allocation method proposed in this application has the best performance among all the schemes, which further confirms the effectiveness of the proposed method.

以上所述仅是本申请的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本申请原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本申请的保护范围。The above is only a preferred implementation of the present application. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principles of the present application. These improvements and modifications should also be regarded as the scope of protection of the present application.

Claims

1. A SWIPT-assisted downlink resource allocation method is characterized by comprising the following steps:

obtaining a state observation value of a current environment state of a communication link between tolerant machine equipment user equipment and a machine equipment gateway;

based on the state observation value of the current environment state, selecting a resource allocation strategy for the tolerant machine user equipment by using a resource allocation model constructed based on a neural network;

allocating downlink resources for the tolerant machine user equipment based on the selected resource allocation policy;

wherein the tolerant machine user device is a machine user device that needs to complete a transmission task of a periodically generated payload.

2. The method of claim 1, prior to obtaining a state observation that is tolerant to a current environmental state of a communication link between a machine device user equipment and a machine device gateway, the method further comprising:

dividing the machine user equipment into tolerant machine user equipment and key machine user equipment based on the service type of the machine user equipment;

wherein the key machine user equipment is a machine user equipment with transmission reliability requirement higher than a preset requirement threshold.

3. The method of claim 2, wherein the resource allocation model is constructed based on the following method:

constructing a state function, wherein the state function is a set of state observations;

constructing an action function, wherein the action function is a set of downlink spectrum resources, a transmission power level and a power split ratio;

constructing a reward function based on a balance of resource allocation optimization objectives and QoS constraints;

and constructing the resource allocation model based on the state function, the action function and the reward function.

4. The method of claim 3, wherein constructing the state function comprises: constructing the state function based on channel gain information of the communication link on each frequency spectrum sub-band, interference power size of the communication link on each frequency spectrum sub-band, load residue and transmission time residue of the communication link, current iteration number and greedy factor representing current environment exploration rate of the communication link.

5. The method of claim 3, wherein constructing a reward function comprises:

determining a reward function penalty provided by the tolerant machine user device based on the remaining amount of load and the remaining amount of transmission time of the tolerant machine user device;

determining a reward function penalty provided by the key machine user device based on an outage probability of the key machine user device;

determining a reward function penalty term of the human user equipment based on the signal-to-interference-and-noise ratio of the human user equipment;

constructing the reward function based on a total effective value of communication links of all machine user devices, a reward function penalty term provided by the tolerant machine user device, a reward function penalty term provided by the critical machine user device, and a reward function penalty term of the human user device.

6. The method of claim 4, wherein constructing the reward function based on a balance of resource allocation optimization objectives and QoS constraints comprises:

setting QoS constraint conditions;

taking the total energy value achieved by the tolerant machine user equipment and the critical machine user equipment as the resource allocation optimization target;

building the reward function based on the resource allocation optimization objective and the QoS constraints.

7. The method of claim 6, wherein setting the QoS constraints comprises:

setting the QoS constraint condition of the human user equipment to be that a signal to interference plus noise ratio (SINR) is larger than a set lowest threshold;

setting the QoS constraint condition of the tolerant machine user equipment to be that the transmission success rate of the load with the preset size V in the time constraint T is higher than a set success rate threshold;

setting the QoS constraint for a critical machine user equipment to be an outage probability not higher than a set outage threshold.

8. The method of claim 3, wherein after obtaining the state observation of the current environmental state of the communication link between the tolerant machine device user equipment and the machine device gateway, the method further comprises:

constructing an experience replay pool for storing training data for training the resource allocation model, wherein the training data comprises state observation values, reward values and selected resource allocation strategies at the current time and the next time;

the neural network comprises a training network and a target network, and the training network is trained in a random gradient descent mode by using data randomly extracted from the experience replay pool in each iteration; the target network is a fixed neural network, and the network parameters of the target network are updated to the training network parameters at the current moment at intervals.

9. The method of any one of claims 2 to 8, wherein the machine user equipment is provided with wireless communication energy carrying SWIPT for enabling the machine user equipment to obtain energy from a radio frequency environment and simultaneously decode information.

10. A SWIPT-assisted downlink resource allocation device is characterized by comprising:

an acquisition module configured to acquire a state observation of a current environmental state of a communication link between a tolerant machine user device and a machine device gateway;

a selection module configured to select a resource allocation policy for the tolerant machine user equipment based on the state observation of the current environmental state using a resource allocation model constructed based on a neural network;

the distribution module is used for distributing downlink resources to the tolerant machine user equipment based on the selected resource distribution strategy;