CN117221122B - Asynchronous layered joint learning training method based on bandwidth pre-allocation - Google Patents
Asynchronous layered joint learning training method based on bandwidth pre-allocation Download PDFInfo
- Publication number
- CN117221122B CN117221122B CN202311172306.0A CN202311172306A CN117221122B CN 117221122 B CN117221122 B CN 117221122B CN 202311172306 A CN202311172306 A CN 202311172306A CN 117221122 B CN117221122 B CN 117221122B
- Authority
- CN
- China
- Prior art keywords
- client
- clients
- training
- server
- edge server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 136
- 238000000034 method Methods 0.000 title claims abstract description 54
- 230000002776 aggregation Effects 0.000 claims abstract description 92
- 238000004220 aggregation Methods 0.000 claims abstract description 92
- 238000004891 communication Methods 0.000 claims abstract description 9
- 230000008859 change Effects 0.000 claims abstract description 3
- 230000006870 function Effects 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 23
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000012546 transfer Methods 0.000 claims description 8
- 238000012935 Averaging Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000013459 approach Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000003321 amplification Effects 0.000 claims description 2
- 238000011835 investigation Methods 0.000 claims description 2
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 2
- 238000013517 stratification Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 9
- 238000009826 distribution Methods 0.000 abstract description 3
- 238000002474 experimental method Methods 0.000 description 9
- 230000001360 synchronised effect Effects 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 230000001934 delay Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Computer And Data Communications (AREA)
Abstract
Description
技术领域Technical field
本发明涉及基于带宽预分配的异步分层联合学习训练方法,属于联邦学习模型训练的技术领域。The invention relates to an asynchronous hierarchical federated learning training method based on bandwidth pre-allocation, and belongs to the technical field of federated learning model training.
背景技术Background technique
数字技术的出现推动了各种变革性技术(如大数据和人工智能)的显著进步。机器学习驱动的移动应用正在革命化现代生活的各个方面。然而,机器学习训练任务通常需要来自各种终端的大量数据,这些终端具有不同的计算能力。传统方法是将数据上传到远程云服务器进行处理,但这种方法存在隐私侵犯、网络拥塞和传输延迟等挑战,阻碍了对数据的充分利用。The emergence of digital technology has driven significant progress in various transformative technologies such as big data and artificial intelligence. Machine learning-powered mobile apps are revolutionizing every aspect of modern life. However, machine learning training tasks often require large amounts of data from various terminals with different computing capabilities. The traditional method is to upload data to a remote cloud server for processing, but this method has challenges such as privacy violations, network congestion, and transmission delays, which hinder the full utilization of the data.
2016年,谷歌研究院提出了联邦学习的概念,旨在缓解网络带宽约束并解决数据隐私的漏洞。联邦学习是一种协作训练和分享方法,它消除了访问原始数据的必要性,符合分散收集和数据最小化原则,更新与中央服务器或协调员共享。这种分散的方法遵循数据最小化原则,消除了集中式数据收集的需求。原始数据仍然安全地存储在个体设备上,无法直接访问。联邦学习在每个设备上实现本地模型训练,同时仅上传聚合的更新或模型参数到中央服务器。In 2016, Google Research proposed the concept of federated learning, aiming to alleviate network bandwidth constraints and solve data privacy vulnerabilities. Federated learning is a collaborative training and sharing method that eliminates the need to access raw data, conforms to the principles of decentralized collection and data minimization, and updates are shared with a central server or coordinator. This decentralized approach follows data minimization principles and eliminates the need for centralized data collection. The raw data remains securely stored on the individual device and cannot be accessed directly. Federated learning enables local model training on each device while only uploading aggregated updates or model parameters to a central server.
分层联邦学习是一种联邦学习框架,通常在基站上部署边缘服务器。在这个框架中,边缘服务器被部署在基站上,用作移动设备和云服务器之间的中间站。这些边缘服务器促进了从边缘接收的邻近设备的本地模型的聚合。通过使云服务器能够有效处理来自更多终端的数据,分层联邦学习成功解决了上传数据到云的挑战。Hierarchical federated learning is a federated learning framework that typically deploys edge servers on base stations. In this framework, edge servers are deployed on base stations and serve as intermediate stations between mobile devices and cloud servers. These edge servers facilitate the aggregation of local models of neighboring devices received from the edge. Hierarchical federated learning successfully solves the challenge of uploading data to the cloud by enabling cloud servers to efficiently process data from more terminals.
然而,标准的分层联邦学习框架采用了全局模型的同步聚合,其中服务器在进行全局聚合之前等待所有客户端参数上传完毕。这种同步方法导致了“掉队者效应”,即全局聚合的延迟由最慢上传参数的客户端决定,导致整个训练过程延迟增加。此外,客户端参数的延迟聚合阻碍了全局模型的收敛,可能影响训练模型的准确性和性能。However, standard hierarchical federated learning frameworks employ synchronous aggregation of global models, where the server waits for all client parameters to be uploaded before performing global aggregation. This synchronization method leads to a "straggler effect", where the latency of global aggregation is determined by the client with the slowest upload parameters, resulting in increased latency throughout the training process. Furthermore, delayed aggregation of client-side parameters hinders the convergence of the global model and may affect the accuracy and performance of the trained model.
因此,有必要提出了一种异步分层联邦学习框架,消除服务器在进行全局聚合之前等待所有客户端参数被上传的需求,来缩短联邦学习的训练时延。Therefore, it is necessary to propose an asynchronous hierarchical federated learning framework that eliminates the need for the server to wait for all client parameters to be uploaded before global aggregation, thereby shortening the training delay of federated learning.
发明内容Contents of the invention
为了解决上述存在的问题,本发明公开了一种基于带宽预分配的异步分层联合学习训练方法,其具体技术方案如下:In order to solve the above existing problems, the present invention discloses an asynchronous hierarchical joint learning training method based on bandwidth pre-allocation. The specific technical solution is as follows:
一种基于带宽预分配的异步分层联合学习训练方法,包括以下步骤:An asynchronous hierarchical joint learning training method based on bandwidth pre-allocation, including the following steps:
步骤1:云服务器选取客户端Step 1: Cloud server selects client
在这一步骤中,云服务器负责协调和管理联合学习任务,云服务器根据策略从每个边缘服务器范围内选择相应数量的客户端,如果可选客户端数量多于需求的客户端,则根据客户端的本地训练数据的质量和计算能力来选择最优秀的客户端参与下一轮的训练,并将模型参数发送给这些客户端;In this step, the cloud server is responsible for coordinating and managing joint learning tasks. The cloud server selects a corresponding number of clients from each edge server according to the policy. If the number of optional clients is more than the required clients, the cloud server selects a corresponding number of clients based on the client Based on the quality and computing power of local training data of the client, the best clients are selected to participate in the next round of training and the model parameters are sent to these clients;
步骤2:客户端本地训练Step 2: Client local training
选定的客户端接收到模型参数后会在本地执行模型训练,每个客户端根据任务类型和模型架构执行优化算法,以更新其本地模型的参数,这个过程可进行多轮迭代,每轮迭代中都会更新模型参数;After receiving the model parameters, the selected client will perform model training locally. Each client executes an optimization algorithm according to the task type and model architecture to update the parameters of its local model. This process can carry out multiple rounds of iterations, each iteration The model parameters will be updated;
步骤3:本地参数上传Step 3: Upload local parameters
客户端迭代本地模型到预设的精度后,客户端根据目前的地理位置,选取最近的且有剩余带宽的边缘服务器进行关联,接着将本地的模型参数上传至边缘服务器;After the client iterates the local model to the preset accuracy, the client selects the nearest edge server with remaining bandwidth to associate based on the current geographical location, and then uploads the local model parameters to the edge server;
步骤4:边缘聚合Step 4: Edge aggregation
模型参数聚合是一种联邦学习的方法,通过对来自不同客户端的参数进行平均、加权平均或其他聚合策略,生成一个更好的全局模型;有别于其它分层联邦学习框架,边缘服务器不需要收集齐所有关联客户端发送过来的模型参数再进行聚合,而是每经过一段时间Taggregation,对已经收集的模型参数进行聚合;考虑到收集到的模型参数可能是上一轮训练产生的过时数据,因此会根据参数开始在本地迭代的时间赋予权重,越早的参数权重越小;每个边缘服务器在边缘聚合后,将聚合后的模型参数发送给云服务器;Model parameter aggregation is a federated learning method that generates a better global model by averaging, weighted averaging or other aggregation strategies on parameters from different clients; unlike other hierarchical federated learning frameworks, edge servers do not require Collect all model parameters sent by the associated clients and then aggregate them. Instead, aggregate the collected model parameters after a period of Tag aggregation ; considering that the collected model parameters may be outdated data generated by the previous round of training , so the weight will be given based on the time when the parameters start to iterate locally, and the earlier the parameter, the smaller the weight; each edge server will send the aggregated model parameters to the cloud server after edge aggregation;
步骤5:云聚合Step 5: Cloud Aggregation
每经过时间Taggregation,云服务器同时收到所有边缘服务器发送过来的模型参数,对这些模型参数进行聚合,如果聚合后模型到达指定的精度,则停止基于带宽预分配的异步分层联合学习训练,否则返回步骤1。Every time T aggregation elapses, the cloud server receives the model parameters sent by all edge servers at the same time, and aggregates these model parameters. If the aggregation model reaches the specified accuracy, the asynchronous hierarchical joint learning training based on bandwidth pre-allocation is stopped. Otherwise return to step 1.
进一步的,所述步骤1中,客户端在训练过程中有大范围的移动的情形,则每轮云服务器选取的客户端在训练过程中的移动情况无法事先预知,这些客户端在结束训练时会关联至不同的边缘服务器;由于每个边缘服务器的带宽有限,关联的客户端太多会导致一些客户端无法顺利上传模型参数,较少则会导致边缘模型聚合的质量不佳,因此训练前选取客户端时需要在每个边缘服务器范围内选取相应数量的客户端;Furthermore, in step 1, if the client moves in a wide range during the training process, the movement of the clients selected by the cloud server in each round during the training process cannot be predicted in advance. When the training ends, these clients will Will be associated to different edge servers; due to the limited bandwidth of each edge server, too many associated clients will cause some clients to be unable to upload model parameters smoothly, and fewer clients will lead to poor quality of edge model aggregation, so before training When selecting clients, you need to select a corresponding number of clients within each edge server;
客户端选取的具体方法为:The specific method of client selection is:
训练过程中客户端的移动可以找到历史统计规律,比如从A边缘服务器范围下住宅区的客户端,训练过程中留在该地、前往B边缘服务器范围下的商业区或者前往C边缘服务器范围下的公司上班的情况都能找当相应的概率分布,在这种情况下,训练过程中每个边缘服务器到另一个边缘服务器的转移概率可由矩阵来表示,实际调查获取云服务器覆盖范围内客户端转移情况的矩阵,根据这个矩阵和每个边缘服务器需要关联的客户端数,计算出云服务器初始在每个边缘服务器下选取的客户端数量,使得结束训练时每个边缘服务器关联的客户端数均匀;Historical statistical patterns can be found in the movement of clients during the training process. For example, a client from a residential area under the scope of edge server A will stay there during the training process, go to a business area under the scope of edge server B, or go to a client under the scope of edge server C. The corresponding probability distribution can be found for the company's work situation. In this case, the transfer probability of each edge server to another edge server during the training process can be represented by a matrix. The actual investigation obtains the client transfer within the coverage of the cloud server. The matrix of the situation, based on this matrix and the number of clients that need to be associated with each edge server, calculate the number of clients that the cloud server initially selects under each edge server, so that the number of clients associated with each edge server is even when the training ends;
如矩阵所示,第j行l列的元素表示为客户端在训练过程中从边缘服务器j范围转移至边缘服务器l范围的概率,As shown in the matrix, the elements in the jth row and column l represent the probability that the client transfers from the edge server j range to the edge server l range during the training process.
每个参与训练的客户端初始带宽分配比βdevice,云服务器根据每个边缘服务器j的剩余带宽分配比关联/>个客户端,在训练结束时上传至该服务器,设|K|个边缘服务器下一轮应关联的客户端数量为(n1,n2,n3,...,n|K|),云服务器初始在|K|个边缘服务器下分配的客户端数X=(x1,x2,x3,…,x|K|),使xM=(y1,y2,y3,...,y|K|),得到以下优化问题:The initial bandwidth allocation ratio of each client participating in training is β device , and the cloud server is based on the remaining bandwidth allocation ratio of each edge server j Related/> Clients are uploaded to the server at the end of training. Assume that the number of clients that |K| edge servers should be associated with in the next round is (n 1 ,n 2 ,n 3 ,...,n |K| ), The number of clients X=(x 1 ,x 2 ,x 3 ,…,x |K| ) that the cloud server initially allocates under |K| edge servers makes xM=(y 1 ,y 2 ,y 3 ,.. .,y |K| ), the following optimization problem is obtained:
S.t.对于/>为服务器j下剩余客户端数,St for/> is the number of remaining clients under server j,
作为一般约束问题,由乘子法--PHR算法求解,给定初始点(n1,n2,n3,...,n|K|),罚因子σ、放大系数c1>1、控制误差ε>0、常数θ∈(0,1),令k=1,求解过程如下:As a general constraint problem, it is solved by the multiplier method-PHR algorithm. Given the initial point (n 1 ,n 2 ,n 3 ,...,n |K| ), the penalty factor σ, the amplification coefficient c 1 >1, Control error ε>0, constant θ∈(0,1), let k=1, the solution process is as follows:
Step 1:以xk-1为初始点,求解无约束问题:Step 1: Taking x k-1 as the initial point, solve the unconstrained problem:
得到最优解xk; Obtain the optimal solution x k ;
Step2:若则xk为最优解,停止;否则,转Step 3;Step2:If Then x k is the optimal solution and stop; otherwise, go to Step 3;
Step3:若转Step 4;否则令σk+1=cσk,转Step 4;Step3: If Go to Step 4; otherwise let σ k+1 = cσ k and go to Step 4;
Step4:修正乘子向量:Step4: Correct the multiplier vector:
(λk+1)1=(λk)1-σc1(xk)(λ k+1 ) 1 = (λ k ) 1 -σc 1 (x k )
(λk+1)i=max[0,(λk)i-σci(xk)],i=2,3,...,|2K+1|,(λ k+1 ) i =max[0,(λ k ) i -σc i (x k )],i=2,3,...,|2K+1|,
令k=k+1,转Step 2;Let k=k+1, go to Step 2;
ci(x)为第i条约束条件,第1条为等式约束,第2条至第2|K|+1条为不等式约束,按该算法进行迭代,获取近似最优解即可,精度ε不需要很高,求出来的X四舍五入至整数;该最优解即为云服务器在每个边缘服务器范围中需要选取的最佳客户端数量。c i (x) is the i-th constraint, the 1st is an equality constraint, the 2nd to 2|K|+1 are inequality constraints, iterate according to this algorithm and obtain the approximate optimal solution, The precision ε does not need to be very high, and the calculated X is rounded to an integer; the optimal solution is the optimal number of clients that the cloud server needs to select in each edge server range.
进一步的,所述步骤2中,参与训练的客户端接收到了云服务器传来的模型参数,客户端i利用自身的数据集Di来求解最优模型参数ω来表示使得损失函数最小,则客户端i上的损失函数表示为:目标是找到最小化损失函数的最优化模型参量ω*=argmin[Fi(ω)];Further, in step 2, the client participating in the training receives the model parameters from the cloud server, and the client i uses its own data set D i to solve for the optimal model parameters ω to represent so that the loss function Minimum, then the loss function on client i is expressed as: The goal is to find the optimal model parameters ω * =argmin[F i (ω)] that minimize the loss function;
该问题几乎不可解,客户端需要在多次迭代中执行梯度下降,以逐步接近最优解,为了达到预定的本地精度θ∈(0,1),客户端需要进行多轮本地迭代L(θ)=μlog(1/θ),其中常数μ取决于训练任务的规模大小,第nLocalTraining轮本地迭代表示为:迭代直到/>时,本地训练完成,其中η为学习率;This problem is almost unsolvable. The client needs to perform gradient descent in multiple iterations to gradually approach the optimal solution. In order to achieve the predetermined local accuracy θ∈(0,1), the client needs to perform multiple rounds of local iterations L(θ) =μlog(1/θ), where the constant μ depends on the size of the training task. The nth LocalTraining round of local iteration is expressed as: Iterate until/> When, local training is completed, where eta is the learning rate;
客户端上模型训练的计算时延与到达精度需要的迭代次数L(θ),客户端的计算能力Pi和训练数据集的大小|Di|相关,表示为:其中计算能力P是客户端一段时间内处理样本的数量。训练数据集的大小可以提前设定,在选取计算能力强的客户端可以有效降低计算时延。The computing delay of model training on the client is related to the number of iterations L(θ) required to achieve accuracy, the computing power Pi of the client and the size of the training data set |D i |, expressed as: The computing power P is the number of samples processed by the client within a period of time. The size of the training data set can be set in advance, and the calculation delay can be effectively reduced by selecting a client with strong computing power.
进一步的,所述步骤3中,客户端进行本地训练达到预定精度后,立即将模型参数上传至边缘服务器,在基于带宽预分配的异步分层联邦学习的框架下,客户端i初始未关联相应的边缘服务器,而是在训练结束时,根据当前的位置计算出与每个边缘服务器j的距离sij,从仍有剩余带宽,即的服务器中选择距离最近,即minsij的边缘服务器j进行关联,并将模型参数上传至该边缘服务器,Further, in step 3, after the client performs local training and reaches the predetermined accuracy, it immediately uploads the model parameters to the edge server. Under the framework of asynchronous hierarchical federated learning based on bandwidth pre-allocation, client i is not initially associated with the corresponding edge server, but at the end of training, the distance s ij to each edge server j is calculated based on the current location, so that there is still remaining bandwidth, that is Select the edge server j with the closest distance, that is, mins ij , to associate among the servers, and upload the model parameters to the edge server.
关联的边缘服务器给所有客户端的初始带宽分配比为βji,则客户端i的参数上传速率为:其中hi是客户端i的信道增益,N0为噪声,/>为边缘服务器j对客户端i的接收功率,表示为/>其中pj为边缘服务器j的最大接收功率,c为常数,则客户端参数上传时延/>其中|di|为需要上传模型参数的大小。The initial bandwidth allocation ratio of the associated edge server to all clients is β ji , then the parameter upload rate of client i is: where h i is the channel gain of client i, N 0 is the noise,/> is the receiving power of edge server j to client i, expressed as/> Where p j is the maximum received power of edge server j, and c is a constant, then the client parameter upload delay/> Where |d i | is the size of the model parameters that need to be uploaded.
进一步的,所述步骤4中,在同步聚合的算法中,云端服务器需要接收所有参与训练的客户端参数后再进行一轮全局聚合,导致全局聚合的时延由最后一个完成训练和上传参数的客户端来决定造成“掉队者效应”,客户端的本地数据集大小和计算性能差异较大,初始云服务器选取的客户端未关联边缘服务器,因此边缘服务器同步聚合的方式可能导致迟迟未等到相应的客户端上传,严重影响了模型训练进度,因此采用边缘异步聚合的方式,每个边缘服务器每经过一段时间Taggregation,对已经收集的模型参数进行聚合,并将聚合后的参数上传至云服务器。Furthermore, in step 4, in the synchronous aggregation algorithm, the cloud server needs to receive the parameters of all clients participating in the training before performing a round of global aggregation, resulting in a global aggregation delay caused by the last one completing training and uploading parameters. The client makes the decision, resulting in a "straggler effect". The client's local data set size and computing performance vary greatly. The client selected by the initial cloud server is not associated with the edge server. Therefore, the edge server's synchronization and aggregation method may cause a delay in waiting for the corresponding response. The client upload seriously affects the model training progress, so the edge asynchronous aggregation method is adopted. Each edge server aggregates the collected model parameters after a period of Tag aggregation and uploads the aggregated parameters to the cloud server. .
进一步的,所述采用异步聚合方式时,客户端上传的模型参数可能是过时的,因此进行边缘聚合时每个客户端的陈旧函数与该客户端接收到模型时已经进行的云聚合轮数nround和云聚合的最新轮数nCurrentRound相关,因此客户端i上传的参数相应的陈旧函数表示为其中λ∈(0,1)为给定的衰减系数,在边缘服务器j上的参数更新表示为:其中Sj表示为参与本次边缘聚合的客户端集。Furthermore, when the asynchronous aggregation method is used, the model parameters uploaded by the client may be out of date. Therefore, the stale function of each client when performing edge aggregation is different from the number n round of cloud aggregation rounds that have been performed when the client receives the model. It is related to the latest round number n CurrentRound of cloud aggregation, so the stale function corresponding to the parameters uploaded by client i is expressed as Where λ∈(0,1) is the given attenuation coefficient, and the parameter update on edge server j is expressed as: Where S j represents the set of clients participating in this edge aggregation.
进一步的,所述步骤5中,考虑到边缘服务器和云服务器有强大的通信能力,因此边缘服务器上传参数至云服务器的通信时延可以忽略不计,云服务器再接收到边缘服务器上传的参数后,立即进行一轮云聚合;Furthermore, in step 5, considering that the edge server and the cloud server have strong communication capabilities, the communication delay for the edge server to upload parameters to the cloud server is negligible. After the cloud server receives the parameters uploaded by the edge server, Immediately perform a round of cloud aggregation;
同样因为云聚合接收的参数质量参差不齐,需要引入陈旧性函数作为超参数以减少过时模型对于全局模型训练的影响,每个边缘服务器上传参数的陈旧函数与该边缘服务器上轮聚合时参与客户端的整体陈旧度相关,可简单设为:云服务器上模型参数的更新表示为:/>其中/>表示为边缘服务器j最新上传的模型参数,Ds为已参与训练的客户端的数据集总集合,对这些模型参数进行聚合,如果聚合后模型到达指定的精度,则停止基于带宽预分配的异步分层联合学习训练,否则返回步骤1。Also because the quality of parameters received by cloud aggregation is uneven, it is necessary to introduce the stale function as a hyperparameter to reduce the impact of outdated models on global model training. The stale function of parameters uploaded by each edge server is consistent with the participating customers in the last round of aggregation of the edge server. It is related to the overall staleness of the terminal, which can be simply set as: The update of model parameters on the cloud server is expressed as:/> Among them/> Represented as the latest model parameters uploaded by edge server j, D s is the total set of data sets of clients that have participated in training. These model parameters are aggregated. If the aggregation model reaches the specified accuracy, the asynchronous analysis based on bandwidth pre-allocation is stopped. Layer joint learning training, otherwise return to step 1.
进一步的,当边缘服务器和云服务器的通信能力比较强大,则使边缘服务器接收客户端上传的参数后,不进行边缘聚合,而是把这些参数直接发送给云服务器进行云聚合,这样陈旧性函数对于模型参数权重的影响更加准确。Furthermore, when the communication capabilities of the edge server and the cloud server are relatively strong, the edge server does not perform edge aggregation after receiving the parameters uploaded by the client, but directly sends these parameters to the cloud server for cloud aggregation. In this way, the staleness function The impact of model parameter weights is more accurate.
进一步的,所述云服务器选取客户端时需要考虑到客户端的计算性能来降低本地计算时延;客户端上数据集的数据质量来使模型达到更好的局部训练效果,因为具有代表性和多样性的数据可以提升模型训练效果,为此采用以下技术方法:在可参与训练的客户端数量多于一轮训练中所需客户端时,综合考虑客户端的计算性能和数据的熵权值,在每个边缘服务器范围下选取最优的客户端参与下一轮的训练;Furthermore, when selecting a client, the cloud server needs to consider the computing performance of the client to reduce local computing latency; the data quality of the data set on the client can enable the model to achieve better local training effects because it is representative and diverse. Specific data can improve the model training effect. To this end, the following technical methods are adopted: when the number of clients that can participate in training is more than the clients required in a round of training, the computing performance of the client and the entropy weight of the data are comprehensively considered. Select the best client under each edge server to participate in the next round of training;
客户端上数据集的数据质量用熵权法来定义,从客户端i中抽取m个样本,1,2,…,ki表示客户端i上的属性特征索引1,2,…,m为样本索引,/>为数据属性标准化后的值,/>为数据属性赋予的熵权值,为每个数据属性的信息熵;The data quality of the data set on the client is defined using the entropy weight method. m samples are extracted from client i, 1,2,…,k i represents the attribute feature index on client i. 1,2,…,m is the sample index,/> Is the normalized value of the data attribute,/> The entropy weight assigned to the data attribute, is the information entropy of each data attribute;
在每个服务器j下,选取个计算能力P和本地模型质量Q最优的客户端,设同一客户端下P和Q在多轮训练中变化较小,总体策略是每个边缘服务器下优先从参与过训练的客户端集Nselected中选取综合能力φ=γP+(1-γ)Q值较大的客户端,剩余需要的客户端从训练客户端总集Nsum中选取,设在边缘服务器j的范围中选取/>个客户端,参数δ,则具体步骤如下:Under each server j, select A client with optimal computing power P and local model quality Q. Assume that P and Q change little in multiple rounds of training under the same client. The overall strategy is to give priority to each edge server from the client set N that has participated in training. Select clients with larger comprehensive capabilities φ = γP + (1-γ) Q value from selected , and the remaining required clients are selected from the total set of training clients N sum , and are selected from the range of edge server j/> For a client and parameter δ, the specific steps are as follows:
S1:更新|Nselected|,若|Nselected|大于设定的阈值θclient,则更新|Nselected|中φ值最大的δ|Nselected|个客户端加入Nbest,然后选取所有在Nbest中且在边缘服务器j范围内的客户端个;S1: Update |N selected |. If |N selected | is greater than the set threshold θ client , update the δ |N selected | clients with the largest φ value in |N selected | to join N best , and then select all clients in N best Clients in and within the range of edge server j indivual;
S2:若则在Nsum-Nselected中,且符合sji<Rj条件,即在边缘服务器j范围内随机选取/>个客户端,将这些客户端计算φ值并加入Nselected,Nselected中的所有客户端即为下一轮参与训练的客户端。S2: If Then it is in N sum -N selected , and meets the condition of s ji <R j , that is, it is randomly selected within the range of edge server j/> clients, calculate the φ values of these clients and add them to N selected . All clients in N selected are the clients participating in the next round of training.
进一步的,云服务器选取客户端数量的操作在一个边缘服务器进行边缘聚合并参数上传之后,这个边缘服务器是肯定有剩余带宽的,其他边缘服务器可能也有剩余带宽,因为一部分计算能力较强,与边缘服务器距离较近的客户端已经完成本地训练、上传参数的操作并且释放了带宽,这部分带宽也会由云服务器分配给下一轮参与训练的客户端;Furthermore, the cloud server selects the number of clients after an edge server performs edge aggregation and uploads parameters. This edge server definitely has remaining bandwidth. Other edge servers may also have remaining bandwidth, because some of them have strong computing capabilities and are not related to the edge servers. Clients that are closer to the server have completed local training, uploaded parameters, and released bandwidth. This bandwidth will also be allocated by the cloud server to the clients participating in the next round of training;
云服务器选取客户端后,所有边缘服务器的剩余带宽比设为0,考虑到一些客户端因为异常原因长时间无法完成训练并释放带宽,这部分客户端完成训练后模型参数也是过时的,因此在设定的novertime轮云聚合后,已分配但未被释放的带宽将自动释放,用于下一轮客户端的参数上传。After the cloud server selects a client, the remaining bandwidth ratio of all edge servers is set to 0. Considering that some clients cannot complete training and release bandwidth for a long time due to abnormal reasons, the model parameters of these clients will also be outdated after completing training, so in After the set n overtime round of cloud aggregation, the allocated but not released bandwidth will be automatically released for the next round of client parameter uploads.
本发明的有益效果是:The beneficial effects of the present invention are:
本发明与现有技术相比的有益效果是:本发明在客户端训练完成后再根据当前位置关联至有剩余带宽且最近的边缘服务器,以缩短模型上传时延;为了避免客户端分配不均的情况,每轮训练之前先根据各个边缘服务器的剩余带宽和客户端转移矩阵来计算出本轮初始在每个边缘服务器范围内分配客户端的数量;选取客户端时按熵权值计算挑选出对模型训练有帮助且计算能力强的客户端,以缩短本地训练时延;上传过程中根据边缘服务器的剩余带宽进行上传加速来进一步缩短上传时延。Compared with the existing technology, the beneficial effects of the present invention are: after the client training is completed, the present invention associates it with the nearest edge server with remaining bandwidth according to the current location to shorten the model upload delay; in order to avoid uneven distribution of clients Before each round of training, the number of clients initially assigned to each edge server in this round is calculated based on the remaining bandwidth of each edge server and the client transfer matrix; when selecting clients, the pairs are selected based on the entropy weight calculation. Model training is helpful and the client with strong computing power is used to shorten the local training delay; during the upload process, the upload is accelerated according to the remaining bandwidth of the edge server to further shorten the upload delay.
附图说明Description of drawings
图1是本发明的训练流程图,Figure 1 is a training flow chart of the present invention,
图2是本发明的云端交互图,Figure 2 is a cloud interaction diagram of the present invention,
图3是实施例中对比实验中模型精度在测试集中的表现图,Figure 3 is a diagram showing the performance of the model accuracy in the test set in the comparative experiment in the embodiment.
图4是实施例中对比实验中模型损失率在测试集中的表现图,Figure 4 is a diagram showing the performance of the model loss rate in the test set in the comparative experiment in the embodiment.
图5是实施例中贪心算法和分层联邦学习算法中用户上传数据所需时间的对比。Figure 5 is a comparison of the time required for users to upload data in the greedy algorithm and the hierarchical federated learning algorithm in the embodiment.
具体实施方式Detailed ways
下面结合附图和具体实施方式,进一步阐明本发明。应理解下述具体实施方式仅用于说明本发明而不用于限制本发明的范围。The present invention will be further elucidated below in conjunction with the accompanying drawings and specific embodiments. It should be understood that the following specific embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention.
结合附图1-2可见,本发明过程为:It can be seen with reference to the accompanying drawings 1-2 that the process of the present invention is:
1)云服务器选取客户端1) Cloud server selects client
在每个边缘服务器下选取相应数量客户端,分发最新的模型参数在这些客户端中利用本地数据集进行本地训练。Select a corresponding number of clients under each edge server, distribute the latest model parameters to these clients, and use local data sets for local training.
2)客户端本地训练2)Client local training
这一步骤中,参与训练的客户端接收到了云服务器传来的模型参数,客户端利用本地的数据集来求解最优模型参数。In this step, the client participating in the training receives the model parameters from the cloud server, and the client uses the local data set to solve the optimal model parameters.
3)本地参数上传3)Local parameter upload
客户端进行本地训练达到预定精度后,立即将模型参数上传至边缘服务器。After the client performs local training and reaches the predetermined accuracy, it immediately uploads the model parameters to the edge server.
4)边缘聚合4) Edge aggregation
在同步聚合的算法中,服务器需要接收所有参与训练的客户端参数后再进行一轮全局聚合,导致全局聚合的时延由最后一个完成训练和上传参数的客户端来决定造成“掉队者效应”。本发明场景下客户端的本地数据集大小和计算性能差异较大,初始云服务器选取的客户端未关联边缘服务器,因此边缘服务器同步聚合的方式可能导致迟迟未等到相应的客户端上传,严重影响了模型训练进度。因此本发明有别于其它分层联邦学习框架,边缘服务器不需要收集齐所有关联客户端发送过来的模型参数再进行聚合,而是每经过一段时间Taggregation,对已经收集的模型参数进行聚合;考虑到收集到的模型参数可能是上一轮训练产生的过时数据,因此会根据参数开始在本地迭代的时间赋予权重,越早的参数权重越小;每个边缘服务器在边缘聚合后,将聚合后的模型参数发送给云服务器;In the synchronous aggregation algorithm, the server needs to receive the parameters of all clients participating in the training before performing a round of global aggregation. As a result, the delay of global aggregation is determined by the last client that completes training and uploads parameters, resulting in a "straggler effect" . In the scenario of the present invention, the client's local data set size and computing performance vary greatly. The client selected by the initial cloud server is not associated with the edge server. Therefore, the synchronous aggregation method of the edge server may cause a delay in waiting for the corresponding client to upload, which seriously affects model training progress. Therefore, the present invention is different from other hierarchical federated learning frameworks. The edge server does not need to collect all the model parameters sent by the associated clients and then aggregate them. Instead, it aggregates the collected model parameters after a period of taggration ; Considering that the collected model parameters may be outdated data generated by the previous round of training, weights will be given based on the time when the parameters start to iterate locally. The earlier the parameters, the smaller the weight; after each edge server is aggregated at the edge, it will be aggregated The final model parameters are sent to the cloud server;
采用异步聚合方式时,客户端上传的模型参数可能是过时的,因此进行边缘聚合时每个客户端的陈旧函数与该客户端接收到模型时已经进行的云聚合轮数nround和云聚合的最新轮数nCurrentRound相关,因此客户端i上传的参数相应的陈旧函数表示为其中λ∈(0,1)为给定的衰减系数,在边缘服务器j上的参数更新表示为:其中Sj表示为参与本次边缘聚合的客户端集。When using asynchronous aggregation, the model parameters uploaded by the client may be out of date. Therefore, the stale function of each client when performing edge aggregation is different from the number n round of cloud aggregation that has been performed when the client receives the model and the latest cloud aggregation. The number of rounds n is related to CurrentRound , so the stale function corresponding to the parameters uploaded by client i is expressed as Where λ∈(0,1) is the given attenuation coefficient, and the parameter update on edge server j is expressed as: Where S j represents the set of clients participating in this edge aggregation.
5)云聚合5) Cloud aggregation
考虑到边缘服务器和云服务器有强大的通信能力,因此边缘服务器上传参数至云服务器的通信时延可以忽略不计。云服务器再接收到边缘服务器上传的参数后,立即进行一轮云聚合。同样因为云聚合接收的参数质量参差不齐,需要引入陈旧性函数作为超参数以减少过时模型对于全局模型训练的影响。每个边缘服务器上传参数的陈旧函数与该边缘服务器上轮聚合时参与客户端的整体陈旧度相关,可简单设为:云服务器上模型参数的更新表示为:/>其中/>表示为边缘服务器j最新上传的模型参数,Ds为已参与训练的客户端的数据集总集合,对这些模型参数进行聚合,如果聚合后模型到达指定的精度,则停止基于带宽预分配的异步分层联合学习训练,否则云服务器再进行下一轮客户端的选取,并将最新的参数分发到这些客户端上。Considering that the edge server and the cloud server have strong communication capabilities, the communication delay for the edge server to upload parameters to the cloud server is negligible. After the cloud server receives the parameters uploaded by the edge server, it immediately performs a round of cloud aggregation. Also because the quality of parameters received by cloud aggregation varies, the staleness function needs to be introduced as a hyperparameter to reduce the impact of outdated models on global model training. The staleness function of each edge server's upload parameters is related to the overall staleness of the participating clients in the last round of aggregation of the edge server, and can be simply set as: The update of model parameters on the cloud server is expressed as:/> Among them/> Represented as the latest model parameters uploaded by edge server j, D s is the total set of data sets of clients that have participated in training. These model parameters are aggregated. If the aggregation model reaches the specified accuracy, the asynchronous analysis based on bandwidth pre-allocation is stopped. Layer joint learning training, otherwise the cloud server will select the next round of clients and distribute the latest parameters to these clients.
如果边缘服务器和云服务器的通信能力比较强大,则可以使边缘服务器接收客户端上传的参数后,不进行边缘聚合,而是把这些参数直接发送给云服务器进行云聚合,这样陈旧性函数对于模型参数权重的影响更加准确。If the communication capabilities of the edge server and the cloud server are relatively strong, the edge server can not perform edge aggregation after receiving the parameters uploaded by the client, but directly send these parameters to the cloud server for cloud aggregation. In this way, the staleness function will be useful for the model. The influence of parameter weights is more accurate.
此外,云服务器选取客户端数量的操作在一个边缘服务器进行边缘聚合并参数上传之后,这个边缘服务器是肯定有剩余带宽的,其他边缘服务器可能也有剩余带宽,因为一部分计算能力较强,与边缘服务器距离较近的客户端已经完成本地训练、上传参数的操作并且释放了带宽,这部分带宽也会由云服务器分配给下一轮参与训练的客户端。云服务器选取客户端后所有边缘服务器的剩余带宽比设为0,考虑到一些客户端因为异常原因长时间无法完成训练并释放带宽,这部分客户端完成训练后模型参数也是过时的,因此在一段时间后分配且未被释放的带宽将自动释放,用于下一轮客户端的参数上传。In addition, after the cloud server selects the number of clients, after an edge server performs edge aggregation and uploads parameters, this edge server definitely has remaining bandwidth. Other edge servers may also have remaining bandwidth, because some of them have strong computing power and are different from the edge servers. Clients that are closer have completed local training, uploaded parameters, and released bandwidth. This bandwidth will also be allocated by the cloud server to the clients participating in the next round of training. After the cloud server selects the client, the remaining bandwidth ratio of all edge servers is set to 0. Considering that some clients cannot complete training and release bandwidth for a long time due to abnormal reasons, the model parameters of these clients are also outdated after completing training, so within a period of time The bandwidth allocated after the time and not released will be automatically released for the next round of client parameter uploads.
为验证本专利的应用,下面给出本专利的具体实验:In order to verify the application of this patent, the specific experiments of this patent are given below:
实验环境:lab environment:
实验考虑在由一个云服务器,5个边缘服务器和250个待参与训练的客户端构成的分布式框架下进行训练。每个边缘服务器范围内包含50个客户端,在边缘层次的联邦学习中,每一个边缘下将有10个客户端被随机选择参与训练,云端层次的联邦学习中每个边缘服务器的参数都参与云聚合。The experiment considers training under a distributed framework consisting of a cloud server, 5 edge servers and 250 clients to participate in training. Each edge server contains 50 clients. In edge-level federated learning, 10 clients under each edge will be randomly selected to participate in training. In cloud-level federated learning, the parameters of each edge server are involved. Cloud aggregation.
本地训练中,考虑了以Lenet卷积神经网络为模型,在minst数据集上进行验证。模型初始化了两个卷积层,一个dropout层和两个全连接层。第一个卷积层的输入通道数为3,输出通道数为10,卷积核大小为5。第二个卷积层的输入通道数为10,输出通道数为20,卷积核大小为5。第一个全连接层的输入节点数为320,输出节点数为50。第二个全连接层的输入节点数为50,输出节点数为10。输入数据通过第一个卷积层,然后进行最大池化和ReLU激活。接着,数据通过第二个卷积层和dropout层,再进行最大池化和ReLU激活。然后,将数据展平并通过两个全连接层,返回输出结果。In local training, the Lenet convolutional neural network was considered as a model and verified on the minst data set. The model is initialized with two convolutional layers, one dropout layer and two fully connected layers. The number of input channels of the first convolutional layer is 3, the number of output channels is 10, and the convolution kernel size is 5. The number of input channels of the second convolutional layer is 10, the number of output channels is 20, and the convolution kernel size is 5. The number of input nodes of the first fully connected layer is 320, and the number of output nodes is 50. The number of input nodes of the second fully connected layer is 50, and the number of output nodes is 10. The input data is passed through the first convolutional layer, followed by max pooling and ReLU activation. Then, the data passes through the second convolution layer and dropout layer, and then undergoes max pooling and ReLU activation. The data is then flattened and passed through two fully connected layers, and the output is returned.
本实施例从60000张带有标签的数据集中随机选择1000张分配给云端,5000张平均分成5份分配给5个边缘服务器,用于测试模型在边缘和云端的精度,其余54000张平均分配给250个客户端。本实施例采用独立同分布的方式进行分配,边缘服务器和客户端分配样本的过程中都进行均匀随机的分配。This example randomly selects 1,000 images from the 60,000 labeled data set and distributes them to the cloud. The 5,000 images are evenly divided into 5 parts and distributed to 5 edge servers for testing the accuracy of the model at the edge and the cloud. The remaining 54,000 images are evenly distributed to 250 clients. In this embodiment, the independent and identically distributed method is used for allocation. The edge server and the client allocate uniformly and randomly during the process of allocating samples.
本实施例模拟了本地训练完成后,模型参数上传至边缘服务器的时间。标准化所有客户端到边缘服务器的距离为(0,1)间,考虑信号强度考虑仅和自由路径衰弱有关,设边缘服务器范围内最远的客户端信噪比为5,则根据信道中最大信息传送速率公式,上传时延可以表示为 This embodiment simulates the time it takes for the model parameters to be uploaded to the edge server after local training is completed. The distance from all clients to the edge server is standardized to be between (0,1). Considering that the signal strength is only related to the weakening of the free path, assuming that the signal-to-noise ratio of the farthest client within the range of the edge server is 5, then according to the maximum information in the channel Transmission rate formula, upload delay can be expressed as
实验中mnist数据集的学习率为0.01,实验过程中每轮学习率衰减为之前的0.995倍,每一轮本地训练过程中,模型在本地迭代40轮。In the experiment, the learning rate of the mnist data set was 0.01. During the experiment, the learning rate in each round was attenuated to 0.995 times the previous one. During each round of local training, the model was iterated locally for 40 rounds.
对比实验设置:Compare experimental settings:
·分层联邦学习(对照):训练开始前以及每轮云聚合过后,5个边缘服务器随机选择范围内10个客户端参与训练,客户端在训练过程中可能移动,完成本地训练时,若在关联边缘服务器范围内,则将模型参数上传至相应边缘服务器。边缘服务器收集满所有客户端,或者超过最大等待时间时,进行一轮边缘聚合,并将聚合后的模型参数上传至云服务器。云服务器收集齐所有边缘服务器传来的参数后进行一轮云聚合。·Hierarchical federated learning (control): Before the start of training and after each round of cloud aggregation, 5 edge servers randomly select 10 clients within the range to participate in training. The clients may move during the training process. When completing local training, if Within the scope of the associated edge server, the model parameters are uploaded to the corresponding edge server. When the edge server collects all clients or exceeds the maximum waiting time, it performs a round of edge aggregation and uploads the aggregated model parameters to the cloud server. The cloud server collects the parameters from all edge servers and performs a round of cloud aggregation.
·上传模型参数时客户端关联最近服务器(对照):在训练开始前和每轮云聚合过后,在每个边缘服务器范围内随机选择10个客户端进行训练。客户端在训练过程中可能移动,在本地训练结束后,选择与其距离最近的边缘服务器进行关联并上传模型参数。每个边缘服务器经过一定时间后,对已经上传完成的模型参数进行边缘聚合,未能及时上传的参数在实验中被放入缓冲区,这些参数根据上传时延在之后的边缘聚合中参与,缓冲区中的参数每轮都会根据陈旧函数进行修正。云服务器几乎能同时收到所有边缘服务器上传的模型参数和参与客户端的数量,进行云聚合。·When uploading model parameters, the client associates with the nearest server (control): Before the start of training and after each round of cloud aggregation, 10 clients are randomly selected for training within the range of each edge server. The client may move during the training process. After the local training is completed, it selects the edge server closest to it to associate with it and upload the model parameters. After a certain period of time, each edge server performs edge aggregation on the model parameters that have been uploaded. Parameters that failed to be uploaded in time were put into the buffer during the experiment. These parameters will participate in the subsequent edge aggregation based on the upload delay. Buffering The parameters in the zone are corrected each round based on the stale function. The cloud server can receive the model parameters and the number of participating clients uploaded by all edge servers almost simultaneously for cloud aggregation.
·基于带宽预分配的异步分层联邦学习:在训练开始前和每轮云聚合过后,根据每个边缘服务器的剩余带宽和转移矩阵计算出每个边缘服务器范围内选取的客户端数量,并将所有边缘服务器剩余带宽设为0。客户端在训练过程中可能移动,在本地训练结束后,根据其所处每个边缘服务器范围内的剩余带宽数和距离,计算上传时延,选择时延最小的边缘服务器进行关联和上传模型参数(带宽·Asynchronous hierarchical federated learning based on bandwidth pre-allocation: Before the start of training and after each round of cloud aggregation, the number of clients selected within each edge server is calculated based on the remaining bandwidth and transfer matrix of each edge server, and The remaining bandwidth of all edge servers is set to 0. The client may move during the training process. After the local training is completed, the upload delay is calculated based on the remaining bandwidth and distance within each edge server where it is located, and the edge server with the smallest delay is selected to associate and upload the model parameters. (bandwidth
=min(自定最多可分配的带宽,剩余带宽+1),上传完成后该边缘服务器剩余带宽+1。若客户端一段时间内未能上传至预分配的边缘服务器,该边缘服务器自动释放带宽。每个边缘服务器经过一定时间后,对已经上传完成的模型参数进行边缘聚合,并上传至云服务器。云服务器收到所有边缘服务器上传的模型参数和参与客户端的数量,进行云聚合。=min (customize the maximum allocated bandwidth, remaining bandwidth + 1), after the upload is completed, the remaining bandwidth of the edge server + 1. If the client fails to upload to the pre-allocated edge server within a period of time, the edge server automatically releases the bandwidth. After a certain period of time, each edge server performs edge aggregation on the uploaded model parameters and uploads them to the cloud server. The cloud server receives the model parameters and the number of participating clients uploaded by all edge servers and performs cloud aggregation.
结论分析:Conclusion analysis:
实验结果图3和图4所示,由于分层联邦学习每轮聚合的时间不固定,实验以训练时间为横坐标,以本实施例算法中每轮云聚合的时间作为1个时间单位。纵坐标分别为对应时间下,云聚合过后的模型在测试集上的准确率和loss值。The experimental results are shown in Figures 3 and 4. Since the time of each round of aggregation in hierarchical federated learning is not fixed, the experiment uses the training time as the abscissa and the time of each round of cloud aggregation in the algorithm of this embodiment as a time unit. The vertical coordinates are the accuracy and loss values of the model after cloud aggregation on the test set at the corresponding time.
由图可以看出,本实施例算法虽然采用异步聚合算法,在模型收敛和稳定性方面不如同步算法。但在此场景下,模型达到0.9的准确度,分层联邦学习需要113个时间单位,本实施例算法只需要33个时间单位,能快速收敛,并在108个时间单位时达到0.95的准确率,达到较好的效果。本实施例实验未考虑本地迭代的时延,实际效果会没这么明显。It can be seen from the figure that although the algorithm of this embodiment uses an asynchronous aggregation algorithm, it is not as good as the synchronous algorithm in terms of model convergence and stability. However, in this scenario, the model reaches an accuracy of 0.9. Hierarchical federated learning requires 113 time units. The algorithm in this embodiment only requires 33 time units, can converge quickly, and reaches an accuracy of 0.95 in 108 time units. , achieve better results. The experiment of this embodiment does not consider the delay of local iteration, and the actual effect will not be so obvious.
分层联邦学习算法在该场景下,由于客户端在训练过程时的移动,最后可能导致无法准时上传模型参数至相应的边缘服务器,在设置了最大时间等待的情况下,部分客户端最后没参与边缘模型的聚合,导致曲线的不稳定和最终聚合效果变差。In this scenario of the hierarchical federated learning algorithm, due to the movement of clients during the training process, the model parameters may not be uploaded to the corresponding edge server on time. When the maximum waiting time is set, some clients may not participate in the end. The aggregation of edge models leads to instability of the curve and poor final aggregation effect.
上传模型参数时客户端关联最近服务器能在分层联邦学习的框架中大幅缩短模型参数上传的时延,如图5所示,图中横坐标为客户端上传模型至关联的边缘服务器所需的时间,纵坐标为上传至最近边缘服务器所需时间,并将最大值标准化为1。但是该方法可能导致初始分配的客户端在训练完成后聚集于部分边缘服务器范围内的现象,导致这些边缘服务器需要接收额外的客户端,若未能分配带宽,则客户端阻塞于上传阶段,最后影响了模型收敛效果,导致曲线不稳定。When uploading model parameters, the client associates with the nearest server, which can greatly shorten the delay in uploading model parameters in the framework of hierarchical federated learning. As shown in Figure 5, the abscissa in the figure is the time required for the client to upload the model to the associated edge server. Time, the ordinate is the time required to upload to the nearest edge server, and the maximum value is normalized to 1. However, this method may cause the initially allocated clients to gather within the range of some edge servers after training is completed, causing these edge servers to need to receive additional clients. If bandwidth cannot be allocated, the clients will be blocked in the upload stage. Finally, It affects the convergence effect of the model and causes the curve to be unstable.
基于带宽预分配的异步分层联邦学习在贪心算法的基础上,对初始客户端的分配进行处理,使得本地训练完成时,每个边缘服务器下客户端的数量相对均匀,使得模型收敛过程更加稳定。收集上传时延短的客户端传来的参数后,边缘服务器会将更多带宽分配给上传时延长的客户端,加快模型参数的上传,使得更多客户端的模型参数能及时参与边缘聚合。Asynchronous hierarchical federated learning based on bandwidth pre-allocation processes the allocation of initial clients based on the greedy algorithm, so that when local training is completed, the number of clients under each edge server is relatively even, making the model convergence process more stable. After collecting parameters from clients with short upload delays, the edge server will allocate more bandwidth to clients with longer upload delays, speeding up the upload of model parameters, so that the model parameters of more clients can participate in edge aggregation in a timely manner.
本发明方案所公开的技术手段不仅限于上述技术手段所公开的技术手段,还包括由以上技术特征任意组合所组成的技术方案。The technical means disclosed in the solution of the present invention are not limited to the technical means disclosed in the above technical means, but also include technical solutions composed of any combination of the above technical features.
以上述依据本发明的理想实施例为启示,通过上述的说明内容,相关工作人员完全可以在不偏离本项发明技术思想的范围内,进行多样的变更以及修改。本项发明的技术性范围并不局限于说明书上的内容,必须要根据权利要求范围来确定其技术性范围。Taking the above-mentioned ideal embodiments of the present invention as inspiration and through the above description, relevant workers can make various changes and modifications without departing from the scope of the technical idea of the present invention. The technical scope of the present invention is not limited to the content in the description, and must be determined based on the scope of the claims.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311172306.0A CN117221122B (en) | 2023-09-12 | 2023-09-12 | Asynchronous layered joint learning training method based on bandwidth pre-allocation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311172306.0A CN117221122B (en) | 2023-09-12 | 2023-09-12 | Asynchronous layered joint learning training method based on bandwidth pre-allocation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117221122A CN117221122A (en) | 2023-12-12 |
CN117221122B true CN117221122B (en) | 2024-02-09 |
Family
ID=89034669
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311172306.0A Active CN117221122B (en) | 2023-09-12 | 2023-09-12 | Asynchronous layered joint learning training method based on bandwidth pre-allocation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117221122B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111447083A (en) * | 2020-03-10 | 2020-07-24 | 中国人民解放军国防科技大学 | Federal learning framework under dynamic bandwidth and unreliable network and compression algorithm thereof |
CN113469325A (en) * | 2021-06-09 | 2021-10-01 | 南京邮电大学 | Layered federated learning method, computer equipment and storage medium for edge aggregation interval adaptive control |
CN115526333A (en) * | 2022-08-31 | 2022-12-27 | 电子科技大学 | Federal learning method for dynamic weight under edge scene |
CN115766475A (en) * | 2022-10-31 | 2023-03-07 | 国网湖北省电力有限公司信息通信公司 | Semi-asynchronous power federated learning network and its communication method based on communication efficiency |
CN116484976A (en) * | 2023-04-26 | 2023-07-25 | 吉林大学 | A method for asynchronous federated learning in wireless networks |
CN116681126A (en) * | 2023-06-06 | 2023-09-01 | 重庆邮电大学空间通信研究院 | Asynchronous weighted federation learning method capable of adapting to waiting time |
CN116702881A (en) * | 2023-05-15 | 2023-09-05 | 湖南科技大学 | Multilayer federal learning scheme based on sampling aggregation optimization |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11244242B2 (en) * | 2018-09-07 | 2022-02-08 | Intel Corporation | Technologies for distributing gradient descent computation in a heterogeneous multi-access edge computing (MEC) networks |
US20210073639A1 (en) * | 2018-12-04 | 2021-03-11 | Google Llc | Federated Learning with Adaptive Optimization |
US20230196092A1 (en) * | 2021-12-21 | 2023-06-22 | Beijing Wodong Tianjun Information Technology Co., Ltd. | System and method for asynchronous multi-aspect weighted federated learning |
US20230214642A1 (en) * | 2022-01-05 | 2023-07-06 | Google Llc | Federated Learning with Partially Trainable Networks |
-
2023
- 2023-09-12 CN CN202311172306.0A patent/CN117221122B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111447083A (en) * | 2020-03-10 | 2020-07-24 | 中国人民解放军国防科技大学 | Federal learning framework under dynamic bandwidth and unreliable network and compression algorithm thereof |
CN113469325A (en) * | 2021-06-09 | 2021-10-01 | 南京邮电大学 | Layered federated learning method, computer equipment and storage medium for edge aggregation interval adaptive control |
CN115526333A (en) * | 2022-08-31 | 2022-12-27 | 电子科技大学 | Federal learning method for dynamic weight under edge scene |
CN115766475A (en) * | 2022-10-31 | 2023-03-07 | 国网湖北省电力有限公司信息通信公司 | Semi-asynchronous power federated learning network and its communication method based on communication efficiency |
CN116484976A (en) * | 2023-04-26 | 2023-07-25 | 吉林大学 | A method for asynchronous federated learning in wireless networks |
CN116702881A (en) * | 2023-05-15 | 2023-09-05 | 湖南科技大学 | Multilayer federal learning scheme based on sampling aggregation optimization |
CN116681126A (en) * | 2023-06-06 | 2023-09-01 | 重庆邮电大学空间通信研究院 | Asynchronous weighted federation learning method capable of adapting to waiting time |
Non-Patent Citations (1)
Title |
---|
基于异步奖励深度确定性策略梯度的边缘计算多任务资源联合优化;周恒;李丽君;董增寿;计算机应用研究;第 40卷(第05期);1491-1496 * |
Also Published As
Publication number | Publication date |
---|---|
CN117221122A (en) | 2023-12-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109684075A (en) | Method for unloading computing tasks based on edge computing and cloud computing cooperation | |
CN113141317B (en) | Streaming media server load balancing method, system, computer equipment, and terminal | |
CN114143891A (en) | Multidimensional resource collaborative optimization method based on FDQL in mobile edge network | |
CN112118312B (en) | A network burst load evacuation method for edge servers | |
CN110505644B (en) | User task unloading and resource allocation joint optimization method | |
CN112020103A (en) | Content cache deployment method in mobile edge cloud | |
CN106101196B (en) | A task scheduling system for cloud rendering platform based on probability model | |
CN110519776A (en) | Balanced cluster and federated resource distribution method in a kind of mist computing system | |
Qi et al. | Deep reinforcement learning based task scheduling in edge computing networks | |
CN117076132B (en) | Resource allocation and aggregation optimization method and device for hierarchical federal learning system | |
CN116384504A (en) | Federal migration learning system | |
CN106792995B (en) | User access method for guaranteeing low-delay content transmission in 5G network | |
CN113988160A (en) | Semi-asynchronous layered federal learning updating method based on timeliness | |
CN115278708A (en) | A mobile edge computing resource management method for federated learning | |
CN114222371A (en) | Flow scheduling method for coexistence of eMBB (enhanced multimedia broadcast/multicast service) and uRLLC (unified radio link control) equipment | |
Feng et al. | Feddd: Toward communication-efficient federated learning with differential parameter dropout | |
CN109889573B (en) | NGSA multi-target-based copy placement method in hybrid cloud | |
CN117221122B (en) | Asynchronous layered joint learning training method based on bandwidth pre-allocation | |
CN114022731A (en) | Federal learning node selection method based on DRL | |
CN112084034A (en) | MCT scheduling method based on edge platform layer adjustment coefficient | |
WO2024036909A1 (en) | Fair load unloading and migration method for edge service network | |
CN115329987A (en) | User selection method in federated learning system | |
Xu et al. | A ring topology-based communication-efficient scheme for d2d wireless federated learning | |
CN115766475A (en) | Semi-asynchronous power federated learning network and its communication method based on communication efficiency | |
CN114970823A (en) | A real-time edge-cloud collaborative convolutional neural network reasoning method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |