WO2023226183A1 - 一种基于多智能体协作的多基站排队式前导码分配方法 - Google Patents

一种基于多智能体协作的多基站排队式前导码分配方法 Download PDF

Info

Publication number
WO2023226183A1
WO2023226183A1 PCT/CN2022/107420 CN2022107420W WO2023226183A1 WO 2023226183 A1 WO2023226183 A1 WO 2023226183A1 CN 2022107420 W CN2022107420 W CN 2022107420W WO 2023226183 A1 WO2023226183 A1 WO 2023226183A1
Authority
WO
WIPO (PCT)
Prior art keywords
agent
action
agents
preamble
state
Prior art date
Application number
PCT/CN2022/107420
Other languages
English (en)
French (fr)
Inventor
孙君
过萌竹
陆音
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京邮电大学 filed Critical 南京邮电大学
Publication of WO2023226183A1 publication Critical patent/WO2023226183A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W74/00Wireless channel access
    • H04W74/08Non-scheduled access, e.g. ALOHA
    • H04W74/0833Random access procedures, e.g. with 4-step access
    • H04W74/0841Random access procedures, e.g. with 4-step access with collision treatment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/08Load balancing or load distribution
    • H04W28/086Load balancing or load distribution among access entities
    • H04W28/0861Load balancing or load distribution among access entities between base stations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/08Load balancing or load distribution
    • H04W28/09Management thereof
    • H04W28/0958Management thereof based on metrics or performance parameters
    • H04W28/0967Quality of Service [QoS] parameters
    • H04W28/0975Quality of Service [QoS] parameters for reducing delays

Definitions

  • the invention belongs to the field of wireless communication technology, and specifically relates to a preamble allocation method for random access of large-scale machine equipment in the Internet of Things.
  • Massive machine type communications is one of the three major application scenarios of the fifth generation mobile communication technology.
  • Machine type communication is a key technology of the fifth generation new radio (5G NR), playing a huge role in important and critical application scenarios such as telemedicine, autonomous driving, and intelligent transportation.
  • Machine type communication is also called M2M communication.
  • M2M communication Different from human-to-human (H2H) communication, M2M communication mainly occurs in the uplink, with a large number of terminals, short duration and frequent frequency.
  • H2H human-to-human
  • the purpose of the present invention is to provide a multi-base station queuing preamble allocation method based on multi-agent cooperation.
  • a method is proposed A non-competitive preamble allocation method.
  • the target area includes a network composed of at least two base stations, and each base station includes a preamble pool.
  • Each agent that enters the network performs the following steps S1 to S3 to complete the preamble distribution of each agent;
  • each agent group the agents connected to the network, calculate the average delay tolerance for each group of agents, and arrange the average delay tolerance of each group of agents in ascending order to obtain priority level set;
  • each preamble corresponds to a queue.
  • the state space S is constructed based on the maximum number of queues in each queue.
  • the action space A is constructed based on the action of the agent selecting the preamble to queue. Taking the state space S as input, based on the deep neural network, combined with Q learning method, the agent is based on the greedy strategy, with the goal of maximizing revenue, selects the action in the action space A as the executable action of the agent, and uses the Q value of the agent's executable action as the output to build the local agent leader code allocation model;
  • step S1 the agents accessing the network are grouped according to the service type, the average delay tolerance is calculated for each group of agents, and the values of each group of agents are arranged in ascending order.
  • Average delay tolerance, the specific steps to obtain the priority set are as follows:
  • c(i,j) is the similarity between business i and business j
  • t i is the delay requirement of business i
  • t j is the delay requirement of business j
  • is the similarity coefficient, 0 ⁇ c(i ,j) ⁇ 1;
  • the services of the agents whose similarity difference is less than the preset value are regarded as similar services, and the corresponding agents are divided into the same group of agents;
  • N k represents the number of agents in the k-th group, Represents the average delay tolerance of the kth group of agents;
  • S13 Calculate the average delay tolerance of each group of agents separately, expressed as Where n is the number of groups of agents.
  • the average delay tolerance of each group of agents is arranged in ascending order and given priorities.
  • the priority order is that the agent group with the smallest average delay tolerance is given the highest priority, and the average delay is The agent group with the highest tolerance is assigned the lowest priority and obtains a priority set consisting of the priorities of each agent group.
  • step S2 the specific steps of step S2 are as follows:
  • Each preamble corresponds to a queue, and the state is constructed based on the maximum number of queues in each queue at time t as follows:
  • s t is the state at time t
  • p i is the maximum number of queues in the i-th queue, i ⁇ 1, 2,..., M ⁇ , and M is the total number of queues
  • the state space S is constructed from the state from the initial time to time t as follows:
  • s 0 , s 1 ,..., s t represent the state from the initial time to time t, and s 0 is the state at the initial time;
  • the agent accesses the network, it selects one of the queues corresponding to the M preambles to queue.
  • the action space A is constructed by the action of the agent selecting the preamble to queue as follows:
  • A ⁇ a 1 , a 2 ,..., a i ,..., a M ⁇
  • a i represents the action strategy of the agent, that is, the action of selecting the i-th preamble to queue
  • fi (a 1 , a 2 ,..., an ) represents the priority of agent i
  • g i represents the variance of the queue
  • a local agent preamble allocation model is constructed, taking the state space S as the input and the Q value of the agent's executable action as the output. Each time the agent is in the s t state Each action corresponds to the Q value Q(s t , a t ), where a t is specifically as follows:
  • a represents all executable actions in state s t ;
  • the Q value Q k+1 (s t , a t ) at the next moment is updated through the following formula:
  • ⁇ k and ⁇ are the learning rate and discount factor respectively
  • s t+1 represents the state at the next moment
  • r t+1 represents the reward obtained by the agent’s executable action in the state s t+1
  • a′ represents the executable action of the agent in state s t+1
  • A is the action space
  • Q k (s t , a t ) represents the Q value in state s t
  • represents the weight of the online network
  • a′ i represents the action that maximizes the Q value of the target network in state s′, and ⁇ - represents the weight;
  • step S2 the local agent preamble allocation model is trained for a preset number of times before the status is updated.
  • each agent uses the ⁇ greedy strategy to select the action a i , selects the action strategy in the action space A with the probability of exploring the factor ⁇ , and selects the action with the probability of (1- ⁇ ) The optimal action strategy in space A.
  • step S3 based on the federated learning method, the specific steps for training the global agent preamble allocation model are as follows:
  • Each agent inputs the current state into the deep neural network in its respective local agent preamble allocation model for learning, obtains the parameters of each local agent's preamble allocation model, and sends them to the federated agent;
  • the federated agent uses the aggregation average algorithm to learn the parameters of each local agent's preamble allocation model, and obtains the global agent's preamble allocation model.
  • the parameters of the global agent's preamble allocation model are as follows:
  • ⁇ g assigns model weights to the global agent preamble
  • ⁇ l assigns model weights to the local agent preamble
  • D is the number of training data
  • D k represents the number of data owned by the k-th participant.
  • the advantages of the present invention include:
  • non-competitive queuing access can solve the collision problem and allow more agents to access under the same conditions.
  • the agent in the present invention uses a multi-agent reinforcement learning algorithm to collaboratively select an appropriate preamble.
  • This learning algorithm can better adapt to environmental changes and make optimal decisions.
  • Figure 1 is a schematic diagram of agent grouping according to an embodiment of the present invention.
  • Figure 2 is a schematic diagram of an intelligent agent access network provided according to an embodiment of the present invention.
  • Figure 3 is a structural diagram of an agent neural network provided according to an embodiment of the present invention.
  • Figure 4 is a diagram of a federated training model provided according to an embodiment of the present invention.
  • the embodiment of the present invention provides a multi-base station queuing preamble allocation method based on multi-agent collaboration.
  • the target area includes a network composed of at least two base stations, and each base station includes a preamble pool.
  • Each agent of each agent executes the following steps S1 to S3 to complete the preamble allocation of each agent;
  • the agent is an MTC device, and each agent has its own service type. According to the service type of each agent, the agents connected to the network are grouped, and the average delay tolerance is calculated for each group of agents. And arrange the average delay tolerance of each group of agents in ascending order to obtain the priority set.
  • Figure 1 for the schematic diagram of agent grouping
  • step S1 The specific steps of step S1 are as follows:
  • each type of service is divided into delay-tolerant services and delay-sensitive services according to its sensitivity to delay.
  • the QoS requirements of each agent need to be considered. Since there are many agents accessing the network at the same time, the types of services accessed at the same time are also different.
  • the delay requirements of each service type are used to measure the correlation of service types. According to the delay requirements of each agent's service, the similarity of each agent's service is calculated as follows:
  • c(i,j) is the similarity between business i and business j
  • t i is the delay requirement of business i
  • t j is the delay requirement of business j
  • is the similarity coefficient, 0 ⁇ c(i , j) ⁇ 1, the larger c(i,j), the more similar the two businesses are;
  • the services of the agents whose similarity difference is less than the preset value are regarded as similar services, and the corresponding agents are divided into the same group of agents;
  • N k represents the number of agents in the k-th group, Represents the average delay tolerance of the kth group of agents;
  • S13 Calculate the average delay tolerance of each group of agents separately, expressed as Where n is the number of groups of agents.
  • the average delay tolerance of each group of agents is arranged in ascending order and given priorities.
  • the priority order is that the agent group with the smallest average delay tolerance is given the highest priority, and the average delay is The agent group with the highest tolerance is assigned the lowest priority and obtains a priority set consisting of the priorities of each agent group.
  • Reinforcement learning is used to solve Markov decision process problems.
  • the agent can periodically learn to take actions, observe the maximum benefit and automatically adjust the action strategy to obtain the optimal action strategy. Due to the grouping of agents, multiple agents learn in interaction with the network. In the case of competitive games, multi-agent can achieve local optimality, but it cannot maximize the overall network performance. In order to achieve the goal of the optimization problem, the multi-agent problem is transformed into a cooperative game, using the same reward function for all agents.
  • each preamble corresponds to a queue.
  • the state space S is constructed based on the maximum number of queues in each queue.
  • the action space A is constructed based on the action of the agent selecting the preamble to queue.
  • the agent is based on the greedy strategy, with the goal of maximizing revenue, selects the action in the action space A as the executable action of the agent, and uses the Q value of the agent's executable action as the output to build the local agent leader Code allocation model
  • the schematic diagram of the agent access network is shown in Figure 2.
  • R 1 , R 2 ,..., R M-1 , and R M represent the preamble;
  • step S2 The specific steps of step S2 are as follows:
  • Each preamble corresponds to a queue, and the state is constructed based on the maximum number of queues in each queue at time t as follows:
  • s t is the state at time t
  • p i is the maximum number of queues in the i-th queue, i ⁇ 1, 2,..., M ⁇ , and M is the total number of queues
  • the state space S is constructed from the state from the initial time to time t as follows:
  • s 0 , s 1 ,..., s t represent the state from the initial time to time t, and s 0 is the state at the initial time;
  • the agent accesses the network, it selects one of the queues corresponding to the M preambles to queue.
  • the action space A is constructed by the action of the agent selecting the preamble to queue as follows:
  • A ⁇ a 1 , a 2 ,..., a i ,..., a M ⁇
  • a i represents the action strategy of the agent, that is, the action of selecting the i-th preamble to queue
  • fi (a 1 , a 2 ,..., a n ) represents the priority of agent i.
  • the agent with the highest priority enters the queue and gets the largest reward, g i (a 1 , a 2 ,. .., a n ) represents the variance of the queue;
  • a represents all executable actions in state s t ;
  • the Q value Q k+1 (s t , a t ) at the next moment is updated through the following formula:
  • ⁇ k and ⁇ are the learning rate and discount factor respectively
  • s t+1 represents the state at the next moment
  • r t+1 represents the reward obtained by the agent’s executable action in the state s t+1
  • a′ represents the executable action of the agent in state s t+1
  • A is the action space
  • Q k (s t , a t ) represents the Q value in state s t
  • represents the weight of the online network
  • a′ i represents the action that maximizes the Q value of the target network in state s′, and ⁇ - represents the weight;
  • step S2 the local agent preamble allocation model is trained for a preset number of times before the status is updated.
  • each agent uses the ⁇ greedy strategy to select the action a i , selects the action strategy in the action space A with the probability of exploring the factor ⁇ , and selects the action strategy in the action space A with the probability of (1- ⁇ ). Best action strategy.
  • a federated training method is used to simultaneously optimize the neural network of each agent through the average optimization of the neural network gradient.
  • each agent optimizes its own neural network through local experience and neural network gradients from other collaborating agents.
  • Design a federated agent with the goal of collecting various local gradients of the agents involved and performing average optimization.
  • This federated agent has the same neural network structure as the other agents, but takes no action.
  • step S3 based on the federated learning method, the specific steps for training the global agent preamble allocation model are as follows:
  • Each agent inputs the current state into the deep neural network in its respective local agent preamble allocation model for learning, obtains the parameters of each local agent's preamble allocation model, and sends them to the federated agent;
  • the federated agent uses the aggregation average algorithm to learn the parameters of each local agent's preamble allocation model, and obtains the global agent's preamble allocation model.
  • the parameters of the global agent's preamble allocation model are as follows:
  • ⁇ g assigns model weights to the global agent preamble
  • ⁇ l assigns model weights to the local agent preamble
  • D is the number of training data
  • D k represents the number of data owned by the k-th participant.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种基于多智能体协作的多基站排队式前导码分配方法,针对海量智能体随机接入时发生的拥塞问题,在多基站多小区的场景下,提出了一种非竞争的前导码分配方法。基于深度强化学习将设备排队式地选择前导码,并采用联邦学习的训练方法,有效解决了竞争接入时会发生的拥塞问题。首先对新接入的智能体进行分组,根据延迟容忍时间来设定优先级;其次基于多智能的强化学习算法将智能体合理分配给空闲队列;最后采用联邦训练方法,通过神经网络梯度的平均优化来同步优化每个智能体的神经网络,完成各智能体的前导码分配。

Description

一种基于多智能体协作的多基站排队式前导码分配方法 技术领域
本发明属于无线通信技术领域,具体涉及物联网大规模机器类设备随机接入时的一种前导码分配方法。
背景技术
大规模机器类通信(mMTC)是第五代移动通信技术的三大应用场景之一。机器类型通信是第五代新无线电(5G NR)的一项关键技术,在远程医疗,自动驾驶,智能交通等重要且关键的应用场景发挥巨大的作用。机器类通信(MTC)也被称为M2M通信,与人与人(H2H)通信不同,M2M通信主要发生在上行链路,终端数量庞大,持续时间短且次数频繁。传统的接入方法下,MTC设备总是会选择最佳信号质量的演进型Node B进行接入,大量的MTC设备进行碰撞,造成网络的拥塞,严重影响设备的接入成功率。因此,如何为大规模MTCD的随机接入设计合理的方案成为5G移动通信系统的关键。最有前途的解决方案是使用强化学习来制定一套前导码分配方案,让设备做决策,选择合适的前导码,最大限度减少随机接入时发生的冲突。这些方案中,设备互相竞争前导码,在设备数量越来越大的情况下,冲突无法避免,并且接入成功率将越来越低。因此需要制定合理的前导码分配方案来为大规模MTCD随机接入减少甚至避免冲突。
发明内容
本发明目的:在于提供针一种基于多智能体协作的多基站排队式前导码分配方法,针对海量多智能体随机接入时发生的拥塞问题,在多基站多小区的场景下,提出了一种非竞争的前导码分配方法。
为实现以上功能,本发明设计一种基于多智能体协作的多基站排队式前导码分配方法,目标区域内包括由至少两个基站组成的网络,每个基站分别均包括前导码池,针对接入网络的各智能体,执行以下步骤S1-步骤S3,完成各智能体的前导码分配;
S1.根据各智能体的业务种类,对接入网络的各智能体进行分组,分别针对各组智能体,计算平均延迟容忍度,并按照升序排列各组智能体的平均延迟容忍度,获得优先级集;
S2.分别针对各组智能体,基于强化学习算法对各组中的各智能体进行前导码分配;
其中,每个前导码对应一个队列,以各队列的最大排队数构建状态空间S,以智能体选择前导码进行排队的动作构建动作空间A,以状态空间S为输入,基于深度神经网络,结合Q学习方法,智能体基于贪婪策略,以收益最大化为目标,选择动作空间A中的动作作为智能体的可执行动作,以智能体的可执行动作的Q值为输出,构建本地智能体前导码分配模型;
S3.基于各智能体对应的本地智能体前导码分配模型、以及联邦智能体,构建全局智能体前导码分配模型,基于联邦学习方法,对全局智能体前导码分配模型进行训练,获得训练好的全局智能体前导码分配模型,应用全局智能体前导码分配模型,完成接入网络的各智能体的前导码分配。
作为本发明的一种优选技术方案:步骤S1中根据业务种类,对接入网络的各智能体进行分组,分别针对各组智能体,计算平均延迟容忍度,并按照升序排列各组智能体的平均延迟容忍度,获得优先级集的具体步骤如下:
S11:根据各智能体的业务的时延要求,计算各智能体的业务的相似度如下式:
Figure PCTCN2022107420-appb-000001
式中,c(i,j)为业务i与业务j的相似度,t i为业务i的时延要求,t j为业务j的时延要求,σ为相似度系数,0≤c(i,j)≤1;
根据各智能体的业务的相似度,将相似度差值小于预设值的智能体的业务作为同类业务,所对应的智能体分为同组智能体;
S12:分别针对各组智能体,计算平均延迟容忍度如下式:
Figure PCTCN2022107420-appb-000002
式中,N k表示第k组中智能体的数量,
Figure PCTCN2022107420-appb-000003
表示第k组智能体的平均延迟容忍度;
S13:分别计算各组智能体的平均延迟容忍度,表示为
Figure PCTCN2022107420-appb-000004
其中n为智能体的组数,将各组智能体的平均延迟容忍度按照升序排列,并依次赋予优先级,其中优先级次序为平均延迟容忍度最小的智能体组赋予最高优先级,平均延迟容忍度最大的智能体组赋予最低优先级,获得由各智能体组的优先级构成的优先级集。
作为本发明的一种优选技术方案:步骤S2的具体步骤如下:
S21:每个前导码对应一个队列,以t时刻各队列的最大排队数构建状态如下式:
s t={p 1,p 2,...,p i,...,p M}
式中,s t为t时刻的状态,p i为第i个队列的最大排队数,i∈{1,2,...,M},M为队列总数;
以初始时刻至t时刻的状态构建状态空间S如下式:
S={s 0,s 1,...,s t}
其中,s 0,s 1,...,s t表示初始时刻至t时刻的状态,s 0为初始时刻的状态;
S22:智能体接入网络时,在M个前导码所对应的队列中选择一个进行排队,以智能体选择前导码进行排队的动作构建动作空间A如下式:
A={a 1,a 2,...,a i,...,a M}
式中,a i表示智能体的动作策略,即选择第i个前导码进行排队的动作;
S23:针对智能体选择执行的动作策略a 1,a 2,...,a n,分别对应奖励r 1,r 2,...,r n,并构建奖励函数R如下式:
R=r i(r 1,r 2,...,r n)
引入智能体的优先级,以及各队列的方差,将奖励函数R转化为如下形式:
Figure PCTCN2022107420-appb-000005
式中,f i(a 1,a 2,...,a n)表示智能体i的优先级,g i(a 1,a 2,...,a n)表示队列的方差;
S24:基于深度神经网络,结合Q学习方法,构建本地智能体前导码分配模型,以状态空间S为输入,以智能体的可执行动作的Q值为输出,智能体在s t状态下的每个动作对应Q值Q(s t,a t),其中a t具体如下式:
Figure PCTCN2022107420-appb-000006
式中,a表示状态s t下所有可执行的动作;
根据Q学习算法,通过下式更新下一时刻的Q值Q k+1(s t,a t):
Figure PCTCN2022107420-appb-000007
式中,α k和γ分别为学习率和折扣因子,s t+1表示下一时刻状态,r t+1表示在状态s t+1下智能体的可执行动作所获得的奖励,a′表示状态s t+1下智能体的可执行动作,A为动作空间,Q k(s t,a t)表示状态s t下的Q值,
Figure PCTCN2022107420-appb-000008
表示状 态s t+1下动作空间A中的各可执行动作所对应的最大Q值;
S25:更新状态s t+1及其所对应的奖励r t+1,构建经验样本(s t,a t,s t+1,r t+1),并存放到经验库中;
S26:各智能体的损失函数L i(θ)如下式:
L i(θ)=E[(y DQN-Q k(s t,a t;θ)) 2]
式中,θ表示在线网络的权重;
其中,y DQN的计算具体如下式:
Figure PCTCN2022107420-appb-000009
式中,a′ i表示在状态s′下使目标网络Q值最大的动作,θ -表示权重;
S27:随机抽取经验库中的各经验样本对本地智能体前导码分配模型进行训练。
作为本发明的一种优选技术方案:步骤S2中本地智能体前导码分配模型训练预设次数后再进行状态更新。
作为本发明的一种优选技术方案:步骤S2中各智能体使用ε贪婪策略选择动作a i,以探索因子ε的概率选择动作空间A中的动作策略,以(1-ε)的概率选择动作空间A中的最佳动作策略。
作为本发明的一种优选技术方案:步骤S3中基于联邦学习方法,对全局智能体前导码分配模型进行训练的具体步骤如下:
S31:所有智能体根据当前状态,选择动作策略,并获得相应的奖励;
S32:各智能体将当前状态输入各自的本地智能体前导码分配模型中的深度神经网络中进行学习,获得各本地智能体前导码分配模型的参数,并发送至联邦智能体;
S33:联邦智能体采用聚合平均算法,对各本地智能体前导码分配模型的参数进行学习,获得全局智能体前导码分配模型,其中全局智能体前导码分配模型参数如下式:
Figure PCTCN2022107420-appb-000010
式中,θ g为全局智能体前导码分配模型权重,θ l为本地智能体前导码分配模 型权重,D为训练数据的数量,D k表示第k个参与方所拥有的数据数量。
有益效果:相对于现有技术,本发明的优点包括:
(1)与传统竞争前导码方式不同,非竞争排队接入可以解决碰撞问题,在同一条件下可以使更多的智能体接入。
(2)本发明中智能体进行决策时,采用了基于多智能体强化学习算法来协作选择合适的前导码,采用该学习算法可以更好的适应环境变化作出最优决策。
(3)采用联邦学习来进行训练,可以提高强化学习的性能,训练出更健壮的模型。
附图说明
图1是根据本发明实施例提供的智能体分组示意图;
图2是根据本发明实施例提供的智能体接入网络的示意图;
图3是根据本发明实施例提供的智能体神经网络结构图;
图4是根据本发明实施例提供的联邦训练模型图。
具体实施方式
下面结合附图对本发明作进一步描述。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本发明的保护范围。
本发明实施例提供的一种基于多智能体协作的多基站排队式前导码分配方法,目标区域内包括由至少两个基站组成的网络,每个基站分别均包括前导码池,针对接入网络的各智能体,执行以下步骤S1-步骤S3,完成各智能体的前导码分配;
S1.智能体为MTC设备,各智能体均有各自的业务种类,根据各智能体的业务种类,对接入网络的各智能体进行分组,分别针对各组智能体,计算平均延迟容忍度,并按照升序排列各组智能体的平均延迟容忍度,获得优先级集,智能体分组示意图参照图1;
步骤S1的具体步骤如下:
S11:网络中存在着不同的业务种类,根据各业务种类对时延的敏感程度不同分为时延容忍业务和时延敏感业务。除此之外,还需要考虑到各智能体的QoS要求,由于同时接入网络的智能体数量多,同一时刻接入的业务种类也是有所不同的。针对当前网络对MTC应用的要求,采用各业务种类的时延要求来度量业务种类的相关性,根据各智能体的业务的时延要求,计算各智能体的业务的相似 度如下式:
Figure PCTCN2022107420-appb-000011
式中,c(i,j)为业务i与业务j的相似度,t i为业务i的时延要求,t j为业务j的时延要求,σ为相似度系数,0≤c(i,j)≤1,c(i,j)越大,两业务越相似;
根据各智能体的业务的相似度,将相似度差值小于预设值的智能体的业务作为同类业务,所对应的智能体分为同组智能体;
S12:分别针对各组智能体,计算平均延迟容忍度如下式:
Figure PCTCN2022107420-appb-000012
式中,N k表示第k组中智能体的数量,
Figure PCTCN2022107420-appb-000013
表示第k组智能体的平均延迟容忍度;
S13:分别计算各组智能体的平均延迟容忍度,表示为
Figure PCTCN2022107420-appb-000014
其中n为智能体的组数,将各组智能体的平均延迟容忍度按照升序排列,并依次赋予优先级,其中优先级次序为平均延迟容忍度最小的智能体组赋予最高优先级,平均延迟容忍度最大的智能体组赋予最低优先级,获得由各智能体组的优先级构成的优先级集。
S2.分别针对各组智能体,基于强化学习算法对各组中的各智能体进行前导码分配,运用强化学习的思想,智能体不断与网络交互,基于网络来选择可以获得最大化收益的动作;
强化学习用于解决马尔可夫决策过程的问题。在强化学习中,智能体可以周期性的学习采取行动,观察最大收益并自动调整动作策略,以获得最优动作策略。由于对智能体进行了分组,多个智能体在与网络的交互中进行学习。多智能体在竞争博弈的情况下,可达到局部最优,但不能满足整体网络性能最大化。为了达到优化问题的目标,将多智能体问题转化为合作博弈,对所有智能体使用相同的奖励函数。
其中,每个前导码对应一个队列,以各队列的最大排队数构建状态空间S,以智能体选择前导码进行排队的动作构建动作空间A,以状态空间S为输入,基于深度神经网络,结合Q学习方法,智能体基于贪婪策略,以收益最大化为目标,选择动作空间A中的动作作为智能体的可执行动作,以智能体的可执行动 作的Q值为输出,构建本地智能体前导码分配模型,智能体接入网络的示意图参照图2,图中R 1,R 2,...,R M-1,R M表示前导码;
步骤S2的具体步骤如下:
S21:每个前导码对应一个队列,以t时刻各队列的最大排队数构建状态如下式:
s t={p 1,p 2,...,p i,...,p M}
式中,s t为t时刻的状态,p i为第i个队列的最大排队数,i∈{1,2,...,M},M为队列总数;
以初始时刻至t时刻的状态构建状态空间S如下式:
S={s 0,s 1,...,s t}
其中,s 0,s 1,...,s t表示初始时刻至t时刻的状态,s 0为初始时刻的状态;
S22:智能体接入网络时,在M个前导码所对应的队列中选择一个进行排队,以智能体选择前导码进行排队的动作构建动作空间A如下式:
A={a 1,a 2,...,a i,...,a M}
式中,a i表示智能体的动作策略,即选择第i个前导码进行排队的动作;
S23:针对智能体选择执行的动作策略a 1,a 2,...,a n,分别对应奖励r 1,r 2,...,r n,并构建奖励函数R如下式:
R=r i(r 1,r 2,...,r n)
当每个队列中排队的智能体数量趋向于一致时,此时不存在空闲的队列,因此不存在前导码闲置的情况,此时接入效率更高。当智能体数量很多时,优先级高的智能体会更快进入队列,在延迟容忍时间内接入,保证智能体的接入成功率。
引入智能体的优先级,以及各队列的方差,将奖励函数R转化为如下形式:
Figure PCTCN2022107420-appb-000015
式中,f i(a 1,a 2,...,a n)表示智能体i的优先级,优先级最高的智能体进入队列获得的奖励最大,g i(a 1,a 2,...,a n)表示队列的方差;
S24:基于深度神经网络,结合Q学习方法,构建本地智能体前导码分配模型,以状态空间S为输入,以智能体的可执行动作的Q值为输出,智能体的神经网络结构图参照图3,智能体在s t状态下的每个动作对应Q值Q(s t,a t),其中a t具体 如下式:
Figure PCTCN2022107420-appb-000016
式中,a表示状态s t下所有可执行的动作;
根据Q学习算法,通过下式更新下一时刻的Q值Q k+1(s t,a t):
Figure PCTCN2022107420-appb-000017
式中,α k和γ分别为学习率和折扣因子,s t+1表示下一时刻状态,r t+1表示在状态s t+1下智能体的可执行动作所获得的奖励,a′表示状态s t+1下智能体的可执行动作,A为动作空间,Q k(s t,a t)表示状态s t下的Q值,
Figure PCTCN2022107420-appb-000018
表示状态s t+1下动作空间A中的各可执行动作所对应的最大Q值;
S25:更新状态s t+1及其所对应的奖励r t+1,构建经验样本(s t,a t,s t+1,r t+1),并存放到经验库中;
S26:各智能体的损失函数L i(θ)如下式:
L i(θ)=E[(y DQN-Q k(s t,a t;θ)) 2]
式中,θ表示在线网络的权重;
其中,y DQN的计算具体如下式:
Figure PCTCN2022107420-appb-000019
式中,a′ i表示在状态s′下使目标网络Q值最大的动作,θ -表示权重;
S27:随机抽取经验库中的各经验样本对本地智能体前导码分配模型进行训练。
在一个实施例中,步骤S2中本地智能体前导码分配模型训练预设次数后再进行状态更新。
在一个实施例中,步骤S2中各智能体使用ε贪婪策略选择动作a i,以探索因子ε的概率选择动作空间A中的动作策略,以(1-ε)的概率选择动作空间A中的最佳动作策略。
S3.基于各智能体对应的本地智能体前导码分配模型、以及联邦智能体,构建全局智能体前导码分配模型,基于联邦学习方法,对全局智能体前导码分配模型进行训练,获得训练好的全局智能体前导码分配模型,应用全局智能体前导码分配模型,完成接入网络的各智能体的前导码分配,联邦训练模型图参照图4。
由于多智能体系统中的单个智能体面临不同的任务或情况,存储在经验库中的经验样本无法适应变化。因此采用一种联邦训练方法,通过神经网络梯度的平均优化来同步优化每个智能体的神经网络。在这种联邦训练方法中,每个智能体通过本地的经验和来自其他协作智能体的神经网络梯度来优化自身神经网络。设计一个联邦智能体,目的是收集所涉及智能体的各种局部梯度并进行平均优化。这个联邦智能体具有与其他智能体相同的神经网络结构,但不采取任何行动。
步骤S3中基于联邦学习方法,对全局智能体前导码分配模型进行训练的具体步骤如下:
S31:所有智能体根据当前状态,选择动作策略,并获得相应的奖励;
S32:各智能体将当前状态输入各自的本地智能体前导码分配模型中的深度神经网络中进行学习,获得各本地智能体前导码分配模型的参数,并发送至联邦智能体;
S33:联邦智能体采用聚合平均算法,对各本地智能体前导码分配模型的参数进行学习,获得全局智能体前导码分配模型,其中全局智能体前导码分配模型参数如下式:
Figure PCTCN2022107420-appb-000020
式中,θ g为全局智能体前导码分配模型权重,θ l为本地智能体前导码分配模型权重,D为训练数据的数量,D k表示第k个参与方所拥有的数据数量。
上面结合附图对本发明的实施方式作了详细说明,但是本发明并不限于上述实施方式,在本领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下做出各种变化。

Claims (6)

  1. 一种基于多智能体协作的多基站排队式前导码分配方法,其特征在于,目标区域内包括由至少两个基站组成的网络,每个基站分别均包括前导码池,针对接入网络的各智能体,执行以下步骤S1-步骤S3,完成各智能体的前导码分配;
    S1.根据各智能体的业务种类,对接入网络的各智能体进行分组,分别针对各组智能体,计算平均延迟容忍度,并按照升序排列各组智能体的平均延迟容忍度,获得优先级集;
    S2.分别针对各组智能体,基于强化学习算法对各组中的各智能体进行前导码分配;
    其中,每个前导码对应一个队列,以各队列的最大排队数构建状态空间S,以智能体选择前导码进行排队的动作构建动作空间A,以状态空间S为输入,基于深度神经网络,结合Q学习方法,智能体基于贪婪策略,以收益最大化为目标,选择动作空间A中的动作作为智能体的可执行动作,以智能体的可执行动作的Q值为输出,构建本地智能体前导码分配模型;
    S3.基于各智能体对应的本地智能体前导码分配模型、以及联邦智能体,构建全局智能体前导码分配模型,基于联邦学习方法,对全局智能体前导码分配模型进行训练,获得训练好的全局智能体前导码分配模型,应用全局智能体前导码分配模型,完成接入网络的各智能体的前导码分配。
  2. 如权利要求1所述的一种基于多智能体协作的多基站排队式前导码分配方法,其特征在于,步骤S1中根据业务种类,对接入网络的各智能体进行分组,分别针对各组智能体,计算平均延迟容忍度,并按照升序排列各组智能体的平均延迟容忍度,获得优先级集的具体步骤如下:
    S11:根据各智能体的业务的时延要求,计算各智能体的业务的相似度如下式:
    Figure PCTCN2022107420-appb-100001
    式中,c(i,j)为业务i与业务j的相似度,t i为业务i的时延要求,t j为业务j的时延要求,σ为相似度系数,0≤c(i,j)≤1;
    根据各智能体的业务的相似度,将相似度差值小于预设值的智能体的业务作为同类业务,所对应的智能体分为同组智能体;
    S12:分别针对各组智能体,计算平均延迟容忍度如下式:
    Figure PCTCN2022107420-appb-100002
    式中,N k表示第k组中智能体的数量,
    Figure PCTCN2022107420-appb-100003
    表示第k组智能体的平均延迟容忍度;
    S13:分别计算各组智能体的平均延迟容忍度,表示为
    Figure PCTCN2022107420-appb-100004
    其中n为智能体的组数,将各组智能体的平均延迟容忍度按照升序排列,并依次赋予优先级,其中优先级次序为平均延迟容忍度最小的智能体组赋予最高优先级,平均延迟容忍度最大的智能体组赋予最低优先级,获得由各智能体组的优先级构成的优先级集。
  3. 如权利要求2所述的一种基于多智能体协作的多基站排队式前导码分配方法,其特征在于,步骤S2的具体步骤如下:
    S21:每个前导码对应一个队列,以t时刻各队列的最大排队数构建状态如下式:
    s t={p 1,p 2,...,p i,...,p M}
    式中,s t为t时刻的状态,p i为第i个队列的最大排队数,i∈{1,2,...,M},M为队列总数;
    以初始时刻至t时刻的状态构建状态空间S如下式:
    S={s 0,s 1,...,s t}
    其中,s 0,s 1,...,s t表示初始时刻至t时刻的状态,s 0为初始时刻的状态;
    S22:智能体接入网络时,在M个前导码所对应的队列中选择一个进行排队,以智能体选择前导码进行排队的动作构建动作空间A如下式:
    A={a 1,a 2,...,a i,...,a M}
    式中,a i表示智能体的动作策略,即选择第i个前导码进行排队的动作;
    S23:针对智能体选择执行的动作策略a 1,a 2,...,a n,分别对应奖励r 1,r 2,...,r n,并构建奖励函数R如下式:
    R=r i(r 1,r 2,...,r n)
    引入智能体的优先级,以及各队列的方差,将奖励函数R转化为如下形式:
    Figure PCTCN2022107420-appb-100005
    式中,f i(a 1,a 2,...,a n)表示智能体i的优先级,g i(a 1,a 2,...,a n)表示队列的方差;
    S24:基于深度神经网络,结合Q学习方法,构建本地智能体前导码分配模型,以状态空间S为输入,以智能体的可执行动作的Q值为输出,智能体在s t状态下的每个动作对应Q值Q(s t,a t),其中a t具体如下式:
    Figure PCTCN2022107420-appb-100006
    式中,a表示状态s t下所有可执行的动作;
    根据Q学习算法,通过下式更新下一时刻的Q值Q k+1(s t,a t):
    Figure PCTCN2022107420-appb-100007
    式中,α k和γ分别为学习率和折扣因子,s t+1表示下一时刻状态,r t+1表示在状态s t+1下智能体的可执行动作所获得的奖励,a′表示状态s t+1下智能体的可执行动作,A为动作空间,Q k(s t,a t)表示状态s t下的Q值,
    Figure PCTCN2022107420-appb-100008
    表示状态s t+1下动作空间A中的各可执行动作所对应的最大Q值;
    S25:更新状态s t+1及其所对应的奖励r t+1,构建经验样本(s t,a t,s t+1,r t+1),并存放到经验库中;
    S26:各智能体的损失函数L i(θ)如下式:
    L i(θ)=E[(y DQN-Q k(s t,a t;θ)) 2]
    式中,θ表示在线网络的权重;
    其中,y DQN的计算具体如下式:
    Figure PCTCN2022107420-appb-100009
    式中,a′ i表示在状态s′下使目标网络Q值最大的动作,θ -表示权重;
    S27:随机抽取经验库中的各经验样本对本地智能体前导码分配模型进行训练。
  4. 如权利要求3所述的一种基于多智能体协作的多基站排队式前导码分配方法,其特征在于,步骤S2中本地智能体前导码分配模型训练预设次数后再进行状态更新。
  5. 如权利要求3所述的一种基于多智能体协作的多基站排队式前导码分配方法,其特征在于,步骤S2中各智能体使用ε贪婪策略选择动作a i,以探索因子ε的概率选择动作空间A中的动作策略,以(1-ε)的概率选择动作空间A中的最 佳动作策略。
  6. 如权利要求3所述的一种基于多智能体协作的多基站排队式前导码分配方法,其特征在于,步骤S3中基于联邦学习方法,对全局智能体前导码分配模型进行训练的具体步骤如下:
    S31:所有智能体根据当前状态,选择动作策略,并获得相应的奖励;
    S32:各智能体将当前状态输入各自的本地智能体前导码分配模型中的深度神经网络中进行学习,获得各本地智能体前导码分配模型的参数,并发送至联邦智能体;
    S33:联邦智能体采用聚合平均算法,对各本地智能体前导码分配模型的参数进行学习,获得全局智能体前导码分配模型,其中全局智能体前导码分配模型参数如下式:
    Figure PCTCN2022107420-appb-100010
    式中,θ g为全局智能体前导码分配模型权重,θ l为本地智能体前导码分配模型权重,D为训练数据的数量,D k表示第k个参与方所拥有的数据数量。
PCT/CN2022/107420 2022-05-24 2022-07-22 一种基于多智能体协作的多基站排队式前导码分配方法 WO2023226183A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210570855.2A CN115066036A (zh) 2022-05-24 2022-05-24 一种基于多智能体协作的多基站排队式前导码分配方法
CN202210570855.2 2022-05-24

Publications (1)

Publication Number Publication Date
WO2023226183A1 true WO2023226183A1 (zh) 2023-11-30

Family

ID=83198743

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107420 WO2023226183A1 (zh) 2022-05-24 2022-07-22 一种基于多智能体协作的多基站排队式前导码分配方法

Country Status (2)

Country Link
CN (1) CN115066036A (zh)
WO (1) WO2023226183A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392483A (zh) * 2023-12-06 2024-01-12 山东大学 基于增强学习的相册分类模型训练加速方法、系统及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112465151A (zh) * 2020-12-17 2021-03-09 电子科技大学长三角研究院(衢州) 一种基于深度强化学习的多智能体联邦协作方法
US20210150417A1 (en) * 2019-11-14 2021-05-20 NEC Laboratories Europe GmbH Weakly supervised reinforcement learning
CN113114581A (zh) * 2021-05-14 2021-07-13 南京大学 基于多智能体深度强化学习的tcp拥塞控制方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210150417A1 (en) * 2019-11-14 2021-05-20 NEC Laboratories Europe GmbH Weakly supervised reinforcement learning
CN112465151A (zh) * 2020-12-17 2021-03-09 电子科技大学长三角研究院(衢州) 一种基于深度强化学习的多智能体联邦协作方法
CN113114581A (zh) * 2021-05-14 2021-07-13 南京大学 基于多智能体深度强化学习的tcp拥塞控制方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHENYU ZHOU, ZEHAN JIA, LIAO HAIJUN, XIONGWEN ZHAO, LEI ZHANG: "Context-aware learning-based access control method for power IoT", JOURNAL ON COMMUNICATIONS, vol. 42, no. 3, 4 March 2021 (2021-03-04), pages 150 - 159, XP093111245, DOI: 10.11959/j.issn.1000−436x.2021062 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392483A (zh) * 2023-12-06 2024-01-12 山东大学 基于增强学习的相册分类模型训练加速方法、系统及介质
CN117392483B (zh) * 2023-12-06 2024-02-23 山东大学 基于增强学习的相册分类模型训练加速方法、系统及介质

Also Published As

Publication number Publication date
CN115066036A (zh) 2022-09-16

Similar Documents

Publication Publication Date Title
CN112737837B (zh) 一种高动态网络拓扑下无人机群带宽资源分配方法
CN110809306B (zh) 一种基于深度强化学习的终端接入选择方法
Shi et al. Drone-cell trajectory planning and resource allocation for highly mobile networks: A hierarchical DRL approach
Yoshida et al. MAB-based client selection for federated learning with uncertain resources in mobile networks
Ding et al. Trajectory design and access control for air–ground coordinated communications system with multiagent deep reinforcement learning
US20220217792A1 (en) Industrial 5g dynamic multi-priority multi-access method based on deep reinforcement learning
Wei et al. Computation offloading over multi-UAV MEC network: A distributed deep reinforcement learning approach
CN111629380A (zh) 面向高并发多业务工业5g网络的动态资源分配方法
Li et al. A multi-agent deep reinforcement learning based spectrum allocation framework for D2D communications
Balakrishnan et al. Resource management and fairness for federated learning over wireless edge networks
WO2023226183A1 (zh) 一种基于多智能体协作的多基站排队式前导码分配方法
CN109831790B (zh) 雾无线接入网中基于头脑风暴优化算法的协作缓存方法
Tan et al. Resource allocation in MEC-enabled vehicular networks: A deep reinforcement learning approach
CN112202847B (zh) 一种基于移动边缘计算的服务器资源配置方法
US20240039788A1 (en) Deep reinforcement learning for adaptive network slicing in 5g for intelligent vehicular systems and smart cities
CN115065678A (zh) 一种基于深度强化学习的多智能设备任务卸载决策方法
CN113821346B (zh) 基于深度强化学习的边缘计算中计算卸载与资源管理方法
Sarlak et al. Diversity maximized scheduling in roadside units for traffic monitoring applications
Liu et al. Distributed asynchronous learning for multipath data transmission based on P-DDQN
CN114501667A (zh) 一种考虑业务优先级的多信道接入建模及分布式实现方法
CN114938372A (zh) 一种基于联邦学习的微网群请求动态迁移调度方法及装置
Wang et al. Modeling on resource allocation for age-sensitive mobile edge computing using federated multi-agent reinforcement learning
Sun et al. A resource allocation scheme for edge computing network in smart city based on attention mechanism
CN115529604A (zh) 一种基于服务器协作的联合资源分配与多元任务卸载方法
Huang et al. Deep reinforcement learning based collaborative optimization of communication resource and route for UAV cluster

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22943374

Country of ref document: EP

Kind code of ref document: A1