CN113923605B

CN113923605B - Distributed edge learning system and method for industrial internet

Info

Publication number: CN113923605B
Application number: CN202111240693.8A
Authority: CN
Inventors: 江智慧; 余官定; 袁建涛; 刘胜利
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2022-08-09
Anticipated expiration: 2041-10-25
Also published as: CN113923605A

Abstract

The invention discloses a distributed edge learning system and method facing to an industrial internet, which comprises a base station and computing equipment, wherein the two computing equipment transmit data through a D2D link; after training a local model according to local data, the computing equipment shares the local model with all other adjacent computing equipment through broadcasting according to the optimal broadcast data rate and the optimal bandwidth, estimates a global model according to the local model shared by other adjacent computing equipment, and uploads the optimal broadcast data rate and the relevant information of the local model to a base station; and the base station determines the model deviation between the estimated global model and the real global model of each computing device according to the local model related information, the optimal broadcast data rate and the device network information, defines the deviation reduction rate as the model deviation reduction amount of all the computing devices per increasing unit bandwidth, and performs bandwidth allocation by taking the deviation reduction rates of all the computing devices as the same target so as to determine the optimal bandwidth allocated to the computing devices.

Description

Distributed edge learning system and method for industrial internet

Technical Field

The invention relates to the field of artificial intelligence and communication, in particular to a distributed edge learning system and method for industrial internet.

Background

In recent years, machine learning has been considered as a promising driver to inject Artificial Intelligence (AI) into the network edge. As the data generated by massive distributed edge devices has increased explosively, traditional centralized learning algorithms have been replaced by distributed learning algorithms. It has resulted in the emergence of a new computing paradigm, edge learning, that can quickly access distributed data and utilize the computing resources of various edge devices.

In distributed Learning, Federal Learning (FL) and parameter-server training are two major frameworks. They employ a distributed random Gradient Descent (SGD) algorithm and are designed for different scenarios. Since data privacy is critical today, federal learning is widely studied. In the federal learning framework, devices do not need to transmit raw data to a data center. In order to utilize rich data local to the device, in federal learning, the device only needs to upload its local gradient or local model to the center periodically, and then the center aggregates them to obtain a high-quality global model. Unlike federal learning, the purpose of parameter-server training is to solve the large-scale learning problem, and data privacy is not a major concern. In this framework, the model is chunked by the server. Each device downloads a portion of the data from the server and trains only a specified portion of the model. The server then collects the patch models from the devices to update the global model.

Both of the above frameworks are centralized and both rely on a central node to collect model parameters or gradients. However, such an architecture may cause central node congestion. Therefore, the centralized framework is not always suitable in some scenarios, such as scenarios with high requirements on robustness and security of the system. Therefore, decentralized learning frameworks that rely on point-to-point communication replace the centralized framework in these scenarios. But considering the decentralized edge learning framework in wireless systems will encounter some problems. In this scenario, a Device-to-Device (D2D) link is used for data transmission. Due to random channel fading and noise, wireless D2D communication is always unreliable. It cannot be simply assumed that the communication between the devices is perfect and transmission errors of the D2D link need to be taken into account. Transmission errors may cause some devices to fail to receive the local models of all other devices, affecting the quality of the global model, and thus affecting the convergence rate of the model.

This therefore prompted the study of a decentralized edge learning framework based on unreliable D2D communication. How to improve the convergence rate of the framework and to propose a method to realize the framework is an urgent problem to be solved.

Disclosure of Invention

In view of the foregoing, it is an object of the present invention to provide a distributed edge learning system and method for industrial internet. The system and the method reduce the model deviation and improve the convergence rate of the model in a given time delay by jointly optimizing the broadcast data rate and the bandwidth allocation, and moreover, the computing equipment does not need to transmit local original data, so that the privacy and the safety of a user can be well protected.

In order to achieve the purpose, the invention provides the following technical scheme:

in a first aspect, an embodiment provides an industrial internet-oriented distributed edge learning system, which includes a base station and a plurality of computing devices, wherein data is transmitted between the computing devices through D2D links;

the computing device is used for local model training and estimating a global model, and comprises: after a local model is trained according to local data, the local model is shared with all other adjacent computing equipment through broadcasting according to the optimal broadcast data rate and the optimal bandwidth, a global model is estimated according to the local model shared by other adjacent computing equipment, and the optimal broadcast data rate and the relevant information of the local model are uploaded to a base station, wherein the optimal broadcast data rate is determined according to the size of the local model, the total time delay of each round of training and the local computation time delay;

the base station is used for network coordination and bandwidth allocation, and comprises: determining the model deviation between the estimated global model and the real global model of each computing device according to the relevant information of the local model, the optimal broadcast data rate and the device network information, defining the deviation reduction rate as the model deviation reduction amount of all the computing devices per increasing unit bandwidth, and performing bandwidth allocation by taking the deviation reduction rates of all the computing devices as the same target to determine the optimal bandwidth among the computing devices.

In one embodiment, when the computing device trains the local model from the local data, the model is updated according to a gradient descent method using the following formula:

wherein k represents a calculationDevice index,. l denotes the number of training rounds,. eta ^(l-1) The learning rate of the l-1 st round is expressed,

a local model representing the ith computing device update,

a local model representing the l-1 th round of the kth computing device,

representing the gradient vector of the k-th computing device.

In one embodiment, the optimal broadcast data rate determined according to the size of the local model, the total time delay of each round of training, and the local computation time delay is expressed as:

wherein the content of the first and second substances,

represents the optimal broadcast data rate, S represents the local model size, T represents the total delay of a given round of training,

and representing local calculation time delay, wherein the total time delay of each round of training comprises two parts, one part is the local calculation time delay, namely the time required by the calculation device to update the local model, and the other part is communication time delay, namely the time required by the calculation device to broadcast and share to the adjacent calculation devices.

In one embodiment, the global model is estimated from local models shared from neighboring other computing devices using the following formula:

wherein the content of the first and second substances,

a global model representing the ith computing device's round estimate, i and k being indices of the computing devices,

local model, α, representing the ith computing device update _k,i Is a binary index, when the instantaneous channel capacity from the k-th computing device to the i-th computing device is greater than or equal to the optimal broadcast data rate, the local model of the k-th computing device can be successfully shared with the i-th computing device, and then alpha is _k,i 1 is ═ 1; otherwise, α _k,i K denotes the number of all computing devices adjacent to the ith computing device.

In one embodiment, the local model related information includes a modulus of the local model of the current round, a modulus of a difference between the local model of the current round and the local model of the previous round, a modulus of a gradient of the previous round, a local loss reduction amount, a size of the local model of the current round, a total time delay of each round of training, and a local computation time delay.

In one embodiment, the determining a model bias between the estimated global model and the true global model for each computing device as a function of the local model-related information, the optimal broadcast data rate, and the device network information is:

wherein the content of the first and second substances,

represents the optimal broadcast data rate for the k-th computing device, B _k Representing the bandwidth, P, allocated to the kth computing device for broadcast _k Representing the transmission power, σ, of the k-th computing device _k,i Represents the variance of the k-th to i-th computing device link channel power gain, N ₀ Representing noise power, K representing the number of computing devices, Delta _k,i Is represented byModel bias of the i-th computing device due to unreliable transmission from the k-th computing device to the i-th computing device, A _k Local model related information representing the kth computing device includes a modulus of the local model of the current round, a modulus of a difference between the local model of the current round and the local model of the previous round, a modulus of a gradient of the previous round, and a local loss reduction amount.

In one embodiment, when bandwidth allocation is performed with the target of making the deviation reduction rates of all the computing devices the same, the bandwidth allocation algorithm adopted is as follows:

(a) inputting information about the local model, the transmission power P _k Variance σ of power gain of link channel _k,i Noise power N ₀ ；

(b) Initializing lagrange parameter lambda ⁽⁰⁾ And the number of training rounds l is 0;

(c) using a gradient descent method to obtain values of the Lagrangian parameters of the next round, i.e.

Wherein eta is _λ Denotes the step size, L ₂ ({B _k }, λ) represents the corresponding lagrange function;

(d) initializing bandwidth allocation

And the number of iterations j is 0;

(e) obtaining bandwidth allocation after a new iteration by using gradient descent method, i.e.

Wherein the content of the first and second substances,

denotes the step size, Δ _k,i Representing a model bias for the ith computing device resulting from unreliable transmission between the kth computing device to the ith computing device;

(f) the number of iterations increases, i.e., j + 1;

(g) repeating the steps (e) to (f) until convergence;

(h) obtaining current optimal bandwidth allocation, wherein l is l + 1;

(i) repeating the steps (c) to (h) until convergence;

(j) and outputting the optimal bandwidth allocation.

In one embodiment, the base station acquires long time scale information of the computing device during network coordination, and performs channel estimation based on the long time scale information.

In a second aspect, an embodiment provides an industrial internet-oriented distributed edge learning method, where the distributed edge learning method employs the distributed edge learning system of the first aspect, and the distributed edge learning method includes the following steps:

step 1, a base station assists a computing device in network coordination;

step 2, the computing equipment feeds back the relevant information of the local model to the base station;

step 3, the base station performs bandwidth allocation based on the received local model related information;

step 4, the computing device updates the local model by using the local data set of the computing device;

and 5, sharing the local model to the adjacent computing equipment by the computing equipment based on the allocated optimal bandwidth and the allocated optimal broadcast data rate, and receiving the local model of the adjacent computing equipment to estimate the global model.

Compared with the prior art, the invention has the beneficial effects that at least:

compared with a central edge learning framework, the decentralized edge learning framework does not need a central node to collect a model and aggregate a global model, so that the congestion of the central node is avoided, and the method is suitable for scenes with high requirements on the robustness and the safety of a system; considering a decentralized edge learning framework based on D2D communication is more consistent with a real system than the hypothetical communication ideal; compared with direct data transmission, the transmission model can fully protect the privacy and the security of user data; the overall deviation of the estimated global model and the real global model can be reduced by jointly optimizing the broadcast data rate and the bandwidth allocation, and the accuracy and the convergence rate of the model are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of an industrial Internet-oriented distributed edge learning system provided by an embodiment;

FIG. 2 is a diagram illustrating an iterative process of computing and sharing by a computing device, according to an embodiment;

FIG. 3 is a flow diagram of bandwidth allocation provided by an embodiment;

fig. 4 is a flowchart of a distributed edge learning method according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

Fig. 1 is a schematic diagram of an industrial internet-oriented distributed edge learning system provided in embodiment 1, and as shown in fig. 1, the distributed edge learning system provided in the embodiment includes a base station and a plurality of computing devices, and is intended to cooperatively train a learning model. Each computing device may be a single antenna device in communication with a base station. To take advantage of the proximity of computing devices and reduce data traffic in the system, data is transmitted between two computing devices and the computing devices over D2D links.

The computing device may be a mobile terminal disposed at an edge terminal, communicating with the base station via mobile communication technology. Of course, the computing device may employ other wireless communication means. In the system, a plurality of computing devices adopt parallel computing to improve learning efficiency, namely, each computing device trains the whole model on local data, specifically comprising the training of the local model and the estimation of the global model.

As shown in fig. 2, each computing device contains two phases in each round of training, the first phase being a computing phase that updates the local model with a subset of local data. Updating the model according to the gradient descent method using the following formula:

where k denotes the computing device index, l denotes the number of training rounds, η ^(l-1) The learning rate of the l-1 st round is expressed,

a local model representing the ith computing device update,

a local model representing the l-1 th round of the kth computing device,

representing the gradient vector of the k-th computing device.

The second phase is a communication phase in which each computing device shares its local model with all neighboring other computing devices through broadcast. On the one hand, each computing device broadcasts its local model to all other computing devices, and on the other hand, each computing device receives a local model shared by the other computing devices for estimating a global model. In an embodiment, the global model is estimated from local models shared from other neighboring computing devices using the following formula:

wherein the content of the first and second substances,

indicating the ith computing deviceThe global model of the l rounds of estimation, i and k are both indices of the computing device,

And the two stages are iteratively executed until the model converges to obtain a final global model.

In the second phase, each computing device shares the local model with all neighboring other computing devices via broadcast according to the optimal broadcast data rate and optimal bandwidth. The optimal broadcast data rate is determined according to the size of the local model, the total time delay of each round of training and the local calculation time delay. Preferably, the optimum broadcast data rate may be determined using the following equation:

wherein the content of the first and second substances,

and expressing local computation time delay, wherein the total time delay of each round of training comprises two parts, one part is the local computation time delay, namely the time required by the computing equipment to update the local model according to the self learning capability and the size of the local data set, and the other part is the communication time delay, namely the time required by the computing equipment to broadcast and share the local model to the adjacent computing equipment according to the size of the model and the broadcast data rate of the computing equipment.

After one round of training is carried out by each computing device, the relevant information of the local model comprising the modulus of the local model of the current round, the modulus of the difference between the local model of the current round and the local model of the previous round, the modulus of the gradient of the previous round, the local loss reduction amount, the size of the local model of the current round, the total time delay of each round of training and the local calculation time delay is uploaded to the base station, so that the base station can carry out bandwidth allocation. Because the optimal broadcast data rate is obtained by calculating the size of the local model of the current round, the total time delay of each round of training and the local calculation time delay, the relevant information of the local model containing the current round local model, the model of the difference between the current round local model and the previous round local model, the model of the previous round gradient, the local loss reduction and the optimal broadcast data rate can be uploaded to the base station.

In the embodiment, the base station is not a server for collecting the model parameters or the gradient parameters, and mainly performs network coordination of the device, that is, mainly performs channel estimation. Since D2D channel estimation is expensive, the instantaneous channel state information of all D2D links is unknown. The path loss of all links depends mainly on stable information such as position, and the change is slow and known. To reduce the signaling overhead of the D2D channel estimation, the base station obtains long time scale information, such as link distance between devices, instead of instantaneous channel state information for all D2D links, and performs channel estimation based on the long time scale information.

In this embodiment, the ultimate goal is to reduce the model bias caused by unreliable transmission by jointly optimizing the broadcast data rate and bandwidth allocation under the constraint of given bandwidth and delay, so as to improve the convergence rate of model training. Therefore, the base station is further configured to allocate a bandwidth, which specifically includes: determining the model deviation between the estimated global model and the real global model of each computing device according to the relevant information of the local model, the optimal broadcast data rate and the device network information, defining the deviation reduction rate as the model deviation reduction amount of all the computing devices per increasing unit bandwidth, and performing bandwidth allocation by taking the deviation reduction rates of all the computing devices as the same target to determine the optimal bandwidth among the computing devices.

In an embodiment, the absence of instantaneous channel state information results in unreliable transmission of data, and therefore the global model estimated locally by the computing devices may differ from the true global model, and to this end, the model bias between the estimated global model and the true global model for each computing device is determined, which may be determined by the following formula:

wherein the content of the first and second substances,

represents the optimal broadcast data rate for the k-th computing device, B _k Representing the bandwidth, P, allocated to the kth computing device for broadcast _k Representing the transmission power, σ, of the k-th computing device _k,i Represents the variance of the k-th to i-th computing device link channel power gain, N ₀ Representing noise power, K representing the number of computing devices, Delta _k,i Representing the model deviation of the i-th computing device resulting from unreliable transmission from the k-th computing device to the i-th computing device, A _k Local model related information representing the kth computing device includes a modulus of the local model of the current round, a modulus of a difference between the local model of the current round and the local model of the previous round, a modulus of a gradient of the previous round, and a local loss reduction amount.

In an embodiment, the bandwidth allocation is targeted such that the rate of droop reduction is the same for all computing devices, where the rate of droop reduction is defined as the amount of model droop reduction for all devices per unit bandwidth increase. As shown in fig. 3, the bandwidth allocation algorithm adopted is:

(a) inputting information about the local model, the transmission power P _k Variance σ of power gain of link channel _k,i Noise power N ₀ Wherein, the local model related information includes a module of the local model of the current round, a module of the difference between the local model of the current round and the local model of the previous round, a module of the gradient of the previous round, a local loss reduction amount, the size S of the local model of the current round, and the size S of the local model of each roundTotal time delay T, local calculation time delay

(d) initializing bandwidth allocation

And the number of iterations j is 0;

Wherein the content of the first and second substances,

(f) the number of iterations increases, i.e., j + 1;

(g) repeating the steps (e) to (f) until convergence;

(h) obtaining current optimal bandwidth allocation, wherein l is l + 1;

(i) repeating the steps (c) to (h) until convergence;

(j) and outputting the optimal bandwidth allocation.

Compared with a central edge learning frame, the distributed edge learning system provided in embodiment 1 does not need a central node to collect models and aggregate global models by a decentralized edge learning frame, avoids congestion of the central node, and is suitable for scenes with high requirements on robustness and safety of the system; considering a decentralized edge learning framework based on D2D communication is more consistent with a real system than the hypothetical communication ideal; compared with direct data transmission, the transmission model can fully protect the privacy and the security of user data; the overall deviation of the estimated global model and the real global model can be reduced by jointly optimizing the broadcast data rate and the bandwidth allocation, and the accuracy and the convergence rate of the model are improved while the training time delay is ensured.

Example 2

Fig. 4 is a flowchart of a distributed edge learning method according to an embodiment. The distributed edge learning method adopts the distributed edge learning system provided by embodiment 1, and the distributed edge learning method comprises the following steps:

step 1, a base station assists a computing device in network coordination;

According to the distributed edge learning method, the convergence rate of the model is improved within the specified learning time by adjusting the broadcast data rate and bandwidth allocation, and the computing equipment at the edge end does not need to transmit original data, so that the privacy and the safety of a user can be better protected.

In embodiments 1 and 2, the wireless communication method may be an existing mobile communication network, that is, an LTE (Long-term Evolution) or 5G network, or a WiFi network. The computing equipment can be a mobile terminal which can support model training, such as a modern smart phone, a tablet computer, a notebook computer, an automatic driving automobile and the like, is provided with a wireless communication system, and can be accessed to mainstream wireless communication networks such as a mobile communication network and WiFi.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. The distributed edge learning system facing the industrial Internet is characterized by comprising a base station and a plurality of computing devices, wherein data are transmitted between the computing devices through D2D links;

the base station is used for network coordination and bandwidth allocation, and comprises: determining a model deviation between an estimated global model and a real global model of each computing device according to the relevant information of the local model, the optimal broadcast data rate and the device network information, defining a deviation reduction rate as the model deviation reduction amount of all computing devices per increasing unit bandwidth, and performing bandwidth allocation by taking the deviation reduction rates of all computing devices as the same target to determine the optimal bandwidth allocated to the computing devices;

when the computing equipment trains the local model according to the local data, the model is updated according to a gradient descent method by adopting the following formula:

a local model representing the ith computing device update,

a local model representing the l-1 th round of the kth computing device,

a gradient vector representing a kth computing device;

the optimal broadcast data rate determined according to the size of the local model, the total time delay of each round of training and the local calculation time delay is represented as follows:

wherein the content of the first and second substances,

the method comprises the steps of representing local calculation time delay, wherein the total time delay of each round of training comprises two parts, one part is the local calculation time delay, and the time required by the calculation equipment for updating a local model is calculated; the other part is communication delay, and the computing equipment broadcasts and shares the time required by the adjacent computing equipment;

wherein the global model is estimated from local models shared from other neighboring computing devices using the following formula:

wherein the content of the first and second substances,

local model, α, representing the ith computing device update _k,i Is a binary index, when the instantaneous channel capacity from the k-th computing device to the i-th computing device is greater than or equal to the optimal broadcast data rate, the local model of the k-th computing device can be successfully shared with the i-th computing device, and then alpha is _k,i 1 is ═ 1; otherwise, α _k,i K represents the number of all computing devices adjacent to the ith computing device;

the relevant information of the local model comprises a module of a local model of the current round, a module of the difference between the local model of the current round and the local model of the previous round, a module of the gradient of the previous round, the local loss reduction, the size of the local model of the current round, the total time delay of each round of training and the local calculation time delay;

wherein the determining of the model bias between the estimated global model and the true global model for each computing device based on the local model related information, the optimal broadcast data rate, and the device network information is:

wherein the content of the first and second substances,

represents the optimal broadcast data rate for the k-th computing device, B _k Representing the bandwidth, P, allocated to the kth computing device for broadcast _k Representing the transmission power, σ, of the k-th computing device _k,i Represents the variance of the k-th to i-th computing device link channel power gain, N ₀ Representing the noise power, K representing the calculation deviceNumber of standby, Δ _k,i Representing the model deviation of the i-th computing device resulting from unreliable transmission from the k-th computing device to the i-th computing device, A _k Local model related information representing the kth computing device comprises a modulus of a local model of the current round, a modulus of a difference between the local model of the current round and the local model of the previous round, a modulus of a gradient of the previous round, and a local loss reduction amount;

wherein, when the bandwidth allocation is performed by taking the deviation reduction rates of all the computing devices as the same target, the adopted bandwidth allocation algorithm is as follows:

(c) obtaining the value of the Lagrangian parameter of the next round by adopting a gradient descent method,

(d) initializing bandwidth allocation

And the number of iterations j is 0;

(e) the bandwidth allocation after a new iteration is obtained by adopting a gradient descent method,

wherein eta is _Bk Denotes the step size, Δ _k,i Representing a model bias for the ith computing device resulting from unreliable transmission between the kth computing device to the ith computing device;

(f) the iteration number is increased, j equals j + 1;

(g) repeating the steps (e) to (f) until convergence;

(h) obtaining current optimal bandwidth allocation, wherein l is l + 1;

(i) repeating the steps (c) to (h) until convergence;

(j) and outputting the optimal bandwidth allocation.

2. The industrial internet-oriented distributed edge learning system of claim 1, wherein the base station obtains long time scale information of the computing device during network coordination, and performs channel estimation based on the long time scale information.

3. An industrial internet-oriented distributed edge learning method, which employs the distributed edge learning system of claim 1 or 2, and includes the steps of:

step 1, a base station assists a computing device in network coordination;