CN116029393A

CN116029393A - Federal learning method, apparatus, electronic device, and computer-readable storage medium

Info

Publication number: CN116029393A
Application number: CN202310082825.1A
Authority: CN
Inventors: 刘吉; 马北辰; 窦德景
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-01-19
Filing date: 2023-01-19
Publication date: 2023-04-28

Abstract

The present disclosure provides a federal learning method, apparatus, electronic device, and computer readable storage medium, and relates to the technical field of artificial intelligence, in particular to the fields of federal learning, image processing, and the like. The specific implementation scheme is as follows: parameter optimization is carried out based on the current model of the equipment, so that a local update model is obtained; pruning the local updating model based on the local mask of the equipment to obtain a local pruning model; based on the local mask corresponding to each neighbor model in the plurality of neighbor models and the local mask of the device, aggregating the plurality of neighbor models and the local pruning model to obtain a new current model; and sending relevant information of the current model to neighbor equipment of the equipment, wherein the relevant information is used for representing the current model and the local mask of the equipment. The present disclosure reduces the workload of a single device and greatly reduces communication consumption. The present disclosure reduces communication overhead in federal learning.

Description

Federal learning method, apparatus, electronic device, and computer-readable storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the fields of federal learning, image processing, and the like.

Background

In recent years, significant progress has been made in federal learning (Federated Learning, FL) which can utilize distributed data on edge devices for collaborative model training. Federal learning typically utilizes a distributed architecture to transfer gradients or models, rather than raw data, between devices and centralized servers to address security or privacy concerns. Distributed federal learning using centralized servers can create significant communication overhead on the servers.

Disclosure of Invention

The present disclosure provides a federal learning method, apparatus, electronic device, and computer-readable storage medium.

According to an aspect of the present disclosure, there is provided a federal learning method comprising:

parameter optimization is carried out based on the current model of the equipment, so that a local update model is obtained;

pruning the local updating model based on the local mask of the equipment to obtain a local pruning model;

based on the local mask corresponding to each neighbor model in the plurality of neighbor models and the local mask of the device, aggregating the plurality of neighbor models and the local pruning model to obtain a new current model;

and sending relevant information of the current model to neighbor equipment of the equipment, wherein the relevant information is used for representing the current model and the local mask of the equipment.

According to another aspect of the present disclosure, there is provided a federal learning apparatus comprising:

the local updating module is used for carrying out parameter optimization based on the current model of the equipment to obtain a local updating model;

the trimming module is used for trimming the local updating model based on the local mask of the device to obtain a local trimming model;

the aggregation module is used for aggregating the plurality of neighbor models and the local pruning model based on the local mask corresponding to each neighbor model in the plurality of neighbor models and the local mask of the device to obtain a new current model;

and the sending module is used for sending the related information of the current model to the neighbor equipment of the equipment, wherein the related information is used for representing the current model and the local mask of the equipment.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

According to the technical scheme, each device can prune the local update model and then aggregate the local update model with the neighbor model, so that communication overhead is greatly reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an exemplary application scenario of a federal learning method of an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an alternative approach to distributed federal learning;

FIG. 3 is a flow diagram of a federal learning method according to an embodiment of the present disclosure;

FIG. 4A is a schematic diagram of a locally updated model in an embodiment of the present disclosure;

FIG. 4B is a schematic diagram of a partial mask in an embodiment of the present disclosure;

FIG. 4C is a schematic diagram of a local pruning model in an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of one example application of the federal learning method in an embodiment of the present disclosure;

FIG. 6 is a schematic block diagram of a federal learning device provided in an embodiment of the present disclosure;

FIG. 7 is a schematic block diagram of a federal learning apparatus provided in accordance with another embodiment of the present disclosure;

FIG. 8 is a schematic block diagram of a federal learning apparatus provided in accordance with another embodiment of the present disclosure;

FIG. 9 is a schematic block diagram of a federal learning apparatus provided in accordance with another embodiment of the present disclosure;

fig. 10 is a schematic block diagram of an electronic device for implementing the federal learning method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to facilitate understanding of the technical solutions of the present disclosure, the following description will explain related technologies of the embodiments of the present disclosure, and the following related technologies may be optionally combined with the technical solutions of the embodiments of the present disclosure as an alternative, which all belong to the protection scope of the embodiments of the present disclosure.

The distributed FL training process typically consists of a local training model on each of a plurality of edge devices and a model aggregation on a server. The server may select the available devices and broadcast the global model to the selected devices. Each selected device then updates the model according to local data within the device, which is referred to as local training. After receiving the updated model from the selected device, the server aggregates it with the global model and generates a new global model, which is referred to as model aggregation or model aggregation.

The training process described above may be synchronous or asynchronous. Under the synchronous FL mechanism, the server performs model aggregation after receiving updated models for all selected devices. Under the asynchronous FL mechanism, the server may aggregate the models when it receives part of the models. Since edge devices are typically heterogeneous with different computing or communication capabilities, the efficiency of synchronizing distributed FL is low. Asynchronous FL may result in poor accuracy and even failure to converge to a non-independent co-distributed data distribution. And centralized FL creates severe communication or computational workload on the server, which becomes a bottleneck, easily resulting in inefficiency and single point of failure. In particular in the field of image processing, image or video data is often distributed over a plurality of devices, for example video data of a plurality of cameras for object detection are acquired by different edge devices, respectively. Through FL collaborative training of the image processing model, each device can be enabled to avoid uploading of image or video data, communication cost is reduced, and privacy data of users are protected. However, the bottleneck generated by the distributed FL at the server side is limited, and the desired communication cost reduction effect cannot be achieved in practice.

The federal learning method of embodiments of the present disclosure is used to solve at least one of the above-described technical problems. Fig. 1 is a schematic diagram of an exemplary application scenario of a federal learning method according to an embodiment of the present disclosure. As shown in fig. 1, the FL environment includes a plurality of electronic devices 110 for collaborative training, and devices 0 to 7 are taken as examples in fig. 1. The electronic device 110 may be a terminal, a server cluster, or other processing device. The terminal may be a personal computer, a mobile device, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, an in-vehicle device, a wearable device, or other User Equipment (UE). The server or cluster of servers may be a cloud-centric server or an edge server, etc. In this FL environment, each electronic device 110 implements the cooperative training process without a central server. Wherein each electronic device 110 may communicate with a neighboring device. For example, in the example of fig. 1, device 0 may transmit information to device 1, device 2, or device 4; device 5 may transmit information to device 6, device 7, or device 1.

Optionally, a coordinator 120 is also included in the FL environment. Although a centralized coordinator 120 is deployed, it only manages the index and heartbeat of each device, not participating in the training process of the FL. For example, each electronic device 110 is connected to the coordinator 120 and obtains a device index from the coordinator for federal learning. The coordinator may also be a terminal, a server cluster, or other processing device, for example.

In the above application scenario, each electronic device 110 updates the local model based on the local data set and the received model from the neighboring devices, and after updating, sends the current model to one or more neighboring devices, so as to realize diffusion of the model of each electronic device 110, thereby realizing collaborative training between each electronic device 110. This approach may be referred to as decentralized federal learning. Fig. 2 shows a schematic diagram of an alternative approach to decentralized federal learning. In this manner, as shown in FIG. 2, the t-th of the device i _i For example, the process of performing a local model update includes:

s21, local training. Illustratively, the model in FIG. 2

Representing the t th warp in device i _i -1 updating the obtained current model, model +.>

Performing local training to obtain model->

S22, model aggregation. Illustratively, device i is model-based

A cached neighbor model->

To the model

Get model->

I.e. via the t _i New current model obtained by updating the model for the second time +.>

S23, model trimming. Illustratively, the model

And pruning is carried out to obtain a model which is transmitted to the neighbor equipment so as to reduce the size of the model and further reduce the communication overhead.

It will be appreciated that after executing S21-S23, a copy of another current model may be created, and S31-S33 may be re-executed to continue updating the current model. In the alternative mode of the distributed federal learning, pruning is performed based on the aggregated model, and the model precision is greatly influenced.

FIG. 3 shows a flow diagram of a federal learning method according to an embodiment of the present disclosure. Illustratively, the method may be applied to various electronic devices in a federal learning system, such as various electronic devices 110 in FIG. 1. As shown in fig. 3, the method may include:

s310, performing parameter optimization based on a current model of the equipment to obtain a local update model;

s320, pruning the local update model based on the local mask of the equipment to obtain a local pruning model;

s330, based on the local mask corresponding to each neighbor model in the plurality of neighbor models and the local mask of the device, aggregating the plurality of neighbor models and the local pruning model to obtain a new current model;

and S340, sending relevant information of the current model to neighbor equipment of the equipment, wherein the relevant information is used for representing the current model and local masks of the equipment.

Illustratively, in the embodiments of the present disclosure, the current model refers to a training object that is trained locally in the electronic device, and the neighbor model refers to a model cached by the electronic device that corresponds to other devices in the federal learning system.

Illustratively, in the above step S310, the current model may be parameter-optimized by using a random gradient descent method (Stochastic Gradient Descent, SGD) based on the local data set, to obtain a local update model.

Illustratively, the partial mask in step S320 described above may refer to a mask corresponding to a specific certain device. The mask is used to indicate whether parameters in the locally updated model of the device are pruned. Illustratively, the mask may include a mask value that corresponds one-to-one to each parameter in the locally updated model, the mask value being used to indicate whether the corresponding parameter is pruned.

Referring to fig. 4A and 4B, fig. 4A shows a schematic diagram of a local update model, and fig. 4B shows a schematic diagram of a corresponding local mask. The model in fig. 4A may include a plurality of parameters, for example 16 parameters, each of which takes a value of 2. The partial mask in fig. 4B includes 16 mask values. From the location of each mask value in fig. 4B, a parameter corresponding to that mask value may be determined in fig. 4A; for example, the mask value of row 1 and column 1 in fig. 4B corresponds to the parameters of row 1 and column 1 in fig. 4A. A mask value of 1 indicates that the parameter is preserved and a mask value of 0 indicates that the parameter is pruned. The mask value in the local mask is multiplied by the corresponding parameter value in the local update model to obtain each parameter value in the local trimming model; for example, based on the model of fig. 4A and the partial mask of fig. 4B, a local pruning model as shown in fig. 4C may be obtained.

Illustratively, in the disclosed embodiments, the local mask includes mask values corresponding to all parameters in the complete model structure. Here, the complete model structure may include a structure for training all parameters of the model at the time of initialization, and each model obtained by performing subsequent updating based on the model at the time of initialization is represented as a complete model structure. For example, in the case where a part of the parameters in fig. 4A is trimmed, the local trimming model obtained after trimming is still represented in a 4-row 4-column structure in fig. 4A. As shown in fig. 4C, in the local clipping model, a value of 0 is obtained after a part of parameters are clipped.

According to step S330, in the embodiment of the present disclosure, based on the local mask corresponding to each neighbor model and the local mask of the present device, the plurality of neighbor models and the local pruning model are aggregated to obtain a new current model. Because pruning is not performed after aggregation of the neighbor model and the local model, not all neighbor models are uniformly pruned, and differentiated information among the neighbor models is preserved. And because the local mask of the neighbor model is utilized in the aggregation process, the local mask can embody the pruning structure of the neighbor model obtained after repeated pruning and updating, thereby being beneficial to aggregation of heterogeneous neighbor models and local pruning models and improving the accuracy of the aggregated models.

In step S340, each device may obtain related information of the current model based on the current model and the local mask of the device, and transmit the related information to the neighbor device. Here, the neighbor device may be determined from the topology of the federal learning system. Each device may send relevant information of the current model to one or more neighbor devices, e.g., randomly select one neighbor device as a target neighbor device, to which relevant information of the current model is sent. In this way, the target neighbor device is used as another electronic device in the federal learning system, and according to the received related information, a neighbor model can be determined and used for combining with the local pruning model of the electronic device to update the current model of the electronic device.

According to the method, in one updating process of the current model, the electronic equipment performs parameter optimization on the current model to obtain the local updating model. And trimming the local updated model to obtain a local trimming model. Then, the electronic device obtains a new current model based on the cached plurality of neighbor models and the local pruning model. In practical application, the electronic device may update the current model multiple times, and after each update to obtain a new current model, send relevant information of the current model to the neighboring device, and execute the next update. That is, the above steps S310 to S340 may be iteratively performed.

The iterative implementation of the federal learning method is described below using a specific example of an application. This application example may be implemented in the scenario shown in fig. 1. In particular, a local data set of device i in a federal learning system

Wherein x is _i And y _i Representing a sample, e.g. x _i Representing the sample image, y _i A supervision label corresponding to the representative sample image; s is(s) _i Representing the number of samples on device i.

In this application example, the model training target of federal learning can be expressed by the following formula (1):

wherein,,

wherein f (m _i ,x _i ,y _i ) For measuring the current model m in the device i, as a local loss function of the device i _i In data { x _i ,y _i Error in }. />

Representation definition, i.e. F _i (m _i ) Defined as the average loss of the current model in device i for each sample. Furthermore, m represents a global model, e.g. m may comprise parameters in the global model. s represents the number of the systemTotal number of samples. n is the number of devices in the system. I.e. < ->

Defined as the average loss of each device. The model training objectives described above can be understood as minimizing global loss.

In this application example, the electronic devices may be connected by an exponential topology structure, where each device has O neighbor devices, where o=log (n), and is denoted as O (log (n)) below, and n is the number of devices in the system. By w _i,j Representing the weight of transmitting information from device j to device i, w in the exponential topology _i,j Can be determined based on the following formula (2):

wherein,,

representing a set of integers. Equation (2) indicates when i=j+2 ^k When w is _i,j >0, i.e. device j may transmit information to device i, and accordingly device i may receive information about the current model of device j transmitted by device j and determine a neighbor model corresponding to device j. When i noteq j and i noteq j+2 ^k When w is _i,j =0, indicating that device j may not transmit information to device i.

Further, a topology matrix may be defined

Wherein R is ^n×n Matrix representing n rows and n columns, w _i,j An element in the matrix, row i and column i, represents a weight for transmitting information from device j to device i. The topology matrix can embody whether information can be transmitted between any two devices in the federal learning system.

According to the definition above, each device obtains the device index from the coordinator, then each device (device i for example) performs a model update, after which each device sends its updated model to one randomly selected neighbor device. The model updating and transmitting steps (corresponding to steps S310 to S340) of each device are repeatedly performed until a predetermined condition is satisfied, and the training target is considered to be reached. The predetermined condition is, for example, that the number of model updates (i.e. the number of iterations) on each device reaches a first predetermined value, or that the consensus distance of the current model of each device is smaller than a second predetermined value. The consensus distance may be determined based on differences between the current model and the neighbor model of the device. For example, consensus distance is defined as the average difference of the current model between any two devices.

According to the method, each device can prune the local update model and then aggregate with the neighbor model, and aggregate the heterogeneous models by combining the local masks of each model, so that sparse training in each device is facilitated, and communication expenditure is greatly reduced under the condition that model accuracy is not affected.

On the basis of the method, the current model of each device may be an image processing model. Optionally, step S310, performing parameter optimization based on the current model of the device to obtain a local update model may include: processing the sample image based on the current model to obtain an image processing result; determining loss corresponding to the current model based on the image processing result; and carrying out parameter optimization on the current model of the equipment based on the loss to obtain a local updated model.

For example, the current model may be the target detection model, and the corresponding image processing result is the position information of the target detection frame. Based on the image processing result and the supervision labels corresponding to the sample images, the loss corresponding to the current model can be determined. Based on the loss, a random gradient descent method (SGD) can be adopted to perform parameter optimization on the current model of the equipment, so as to obtain a local update model. The process can be expressed by the following formula (3):

Wherein,,

represents the t _i Updating the obtained new current model for the second time; />

Representing a learning rate; />

Representing the current model +.>

Corresponding losses; />

Representing the gradient in SGD; />

Representation pair->

And carrying out parameter optimization to obtain a local update model. That is, the above formula (3) represents that the parameter optimization is performed on the current model based on the learning rate and the gradient of the current model, so as to obtain the locally updated model.

In the above optional embodiment, each device performs federal learning with respect to the image processing model, so that each device does not need to upload image or video data to a central server, reducing communication cost and protecting privacy data of a user. And moreover, the model is transmitted after the sparse training of the model is realized through pruning the model, so that the communication cost is further reduced, and the expected effect is favorably achieved.

Optionally, in some embodiments of the present disclosure, the federal learning method may further include determining a manner of local masking of the present device. Illustratively, the federal learning method described above may further include: determining the weight of each neighbor model and the weight of the current model based on the training information of each neighbor model and the training information of the current model; obtaining a target pruning rate of the equipment based on the weight of each neighbor model and the weight of the current model; based on the target pruning rate, a local mask of the device is obtained.

That is, in the embodiment of the present disclosure, the target pruning rate of the present apparatus is determined according to the weight of each model. Because the weight of each model can represent the importance degree of the model in the device, the accuracy of model pruning can be improved by determining the target pruning rate corresponding to the local mask based on the weight of each model. And the weight of each model can be determined based on the training information of each model, namely the weight of each model is dynamically adjusted according to the training of the model, so that the accuracy of model pruning is further improved, and the communication cost is reduced under the condition of ensuring the model accuracy.

For example, the training information of a first neighbor model of the plurality of neighbor models may include at least one of: the difference between the number of model updates of the first neighbor model and the number of model updates of the current model; loss corresponding to the first neighbor model; the number of samples corresponding to the first neighbor model. Wherein the first neighbor model may be determined based on related information from the first neighbor device. The related information may include parameters of the first neighbor model, and may further include information such as model update times, loss, sample number, and local mask corresponding to the first neighbor model.

Accordingly, the training information of the current model may include at least one of: loss corresponding to the current model; the number of samples of the device. In practical applications, a fixed value, for example, 1, may be used as the model update number difference corresponding to the current model. The weights of the current model and the weights of the neighbor models can thus be determined in the same manner.

Because the training process is asynchronous, the multiple neighbor models cached within each device have different versions. For example, device i is one of the neighbor devices of devices j and j'. Then the cache model for devices j and j' is

And->

Wherein t is _j And t _j' Representing respectively that device i is receiving +.>

And->

When (I)>

And->

Corresponding model update times. The impact of different model updates on model aggregation is different. For example, with t _i To represent the number of model updates on device i, then when t _j <<t _i Cache model->

There may be no benefit in model aggregation for device i. Based on the method, the weight of each model in the model aggregation stage is determined according to the difference of the model updating times, so that the influence degree of each model on the model aggregation can be better represented by the weight. In addition, the loss and the sample number of each model also have influence on the accuracy of the model, so that the weight is determined by combining the loss and the sample number of each model, and the accuracy of the weight is improved, so that the accuracy of the target pruning rate is improved.

Illustratively, determining the weight of each neighbor model and the weight of the current model based on the training information of each neighbor model and the training information of the current model may include: updating the control variable based on the loss corresponding to the current model; and determining the weight of each neighbor model and the weight of the current model based on the training information of each neighbor model, the training information of the current model and the control variable.

That is, when determining the weight, a control variable which can be dynamically adjusted according to the loss in the model training process can be set, so that the control variable is gradually optimized in the model training process, thereby optimizing the accuracy of the weight and being beneficial to improving the accuracy of the target pruning rate.

For example, the importance of each neighbor model may be calculated using equation (4):

wherein,,

represents at t _i When the secondary model is updated, the importance degree of the neighbor model corresponding to the neighbor device j on the device i; s is(s) _j Is the number of samples; />

Is a dynamically updated control variable; />

Model update times t representing current model _i Difference between model update times when the neighbor model corresponding to device j is updated on device j, +. >

The hysteresis of the neighbor model corresponding to the equipment j on the equipment i can be well represented; />

Representing model loss on device j.

Accordingly, the importance of the current model may also be calculated according to equation (4), wherein a preset fixed value is adopted as the difference of the model update times. I.e. i=j,

is a fixed value which is preset and is set in advance,the fixed value is, for example, 1.

Normalizing the importance degree of each model according to the formula (5), and obtaining the weight of each model:

wherein,,

indicated at t _i Weights from device j to device i's neighbor model in the secondary local update; />

Representing a set of neighbor models. When j=i, _j->

Representing the weights of the locally updated model.

Illustratively, the control variables described above

The update method of (2) can be expressed by the following formula (6):

wherein eta _λ The learning rate of the control variable is represented,

represents the t _i -control variables employed at model update time 1; />

Represents the t _i Control variables used in the secondary model update. />

Representing current model correspondenceLoss of (2)

Control variable +.>

Can be calculated according to the following equation (7):

/>

wherein s is _j For the number of samples of device j,

for the latest loss of device j before the current round of updating, g _i Representing a model +.>

Is a gradient of (a).

Illustratively, in some embodiments of the present disclosure, a specific manner of determining the target pruning rate based on the weights of the models is also provided. Specifically, obtaining the target pruning rate of the device based on the weight of each neighbor model and the weight of the current model includes: obtaining a reference pruning rate of the equipment based on a preset algorithm; and calculating a plurality of target pruning rates corresponding to the plurality of neighbor models and the reference pruning rate of the equipment based on the weight of each neighbor model and the weight of the current model to obtain the target pruning rate of the equipment.

The preset algorithm may be a lossless pruning algorithm for single model information. Such as a lossless pruning algorithm based on a lottery hypothesis (lottery ticket hypothesis) model. In the embodiment of the disclosure, after the reference pruning rate is obtained based on a preset algorithm, the target pruning rate of the neighbor model and the reference pruning rate of the device are weighted and summed based on the weight of each model to obtain the target pruning rate of the device, so that the adjustment of the reference pruning rate based on the weight of each model and the target pruning rate of other models is realized.

According to the above exemplary manner, various pruning algorithms for single model information can be conveniently combined, and the target pruning rate of the device can be determined based on the weight of each model, so that the accuracy of the target pruning rate can be improved by utilizing the advantages of the pruning algorithm.

For example, the target pruning rate may be determined according to the following equation (8):

wherein p is _i The target pruning rate is indicated to be the target pruning rate,

representing a set of neighbor models, +.>

Weights, p, representing neighbor models corresponding to neighbor device j _j Target pruning rate representing neighbor model, +.>

Weights representing the current model +.>

Representing the reference pruning rate.

Alternatively, the process of determining the target pruning rate may be performed once at intervals according to the information such as loss, weight, etc., for example, the model is updated N times per iteration, where N is a positive integer.

Optionally, in some embodiments of the present disclosure, a way of how to determine the local mask based on the target pruning rate is also provided. Illustratively, obtaining the local mask of the device based on the target pruning rate includes: determining the influence degree of each parameter of the current model on the loss corresponding to the current model; based on the influence degree, the order of all parameters of the current model is obtained; determining parameters to be pruned based on the sequence of the parameters of the current model and the target pruning rate; and obtaining the local mask of the equipment based on the parameters to be trimmed and the complete model structure.

For example, the parameters may be sorted in ascending order according to the above-mentioned influence degree of the parameters, that is, the sorting with small influence degree is earlier and the sorting with large influence degree is later. Assuming that the number of parameters is K, the target pruning rate is p _i The parameter to be trimmed can be determined to be the previous K p _i Parameters are obtained by mapping K.times.p in a mask pattern corresponding to the complete model structure _i The mask value of each parameter is set to a predetermined value, for example 0, and a partial mask can be obtained.

For each parameter, an algorithm may be used to calculate, for each parameter, a difference between the loss after pruning the parameter and the loss of the current model, and based on the difference, an effect of each parameter on the loss corresponding to the current model may be obtained.

For example, Δm is used _i Representing a model m on device i _i Model differences caused by the parameter p pruning. Then we use

Represents Deltam _i Impact on the current local loss function (i.e., loss local to device i). Then based on Taylor (Taylor) expansion, it is possible to get a calculation +.>

Formula (9) of (2):

wherein g _i To lose the corresponding gradient, T represents the transpose,

a Hessian matrix (Hessian matrix) representing a model,

corresponding to higher order itemsThe order is negligible.

Correlating the pruned parameter (or channel) with p and the remaining parameter with r, the calculation can be performed

The formula (9) of (2) is extended to the formula (10):

since pruning is performed during training, g cannot be ignored _i I.e.

The Lagrangian method is used to extend equation (10) while minimizing the current effect +.>

Equation (11) can be obtained:

since equation (9) only considers the effect of current loss, by inserting the gradient magnitude, the effect of the trimmed parameter is further covered in the next several rounds of updating, and the degree of effect of the parameter p on the model can be calculated according to the following equation (12):

wherein, |and| represent absolute values, and lambda is more than or equal to 0 and less than or equal to 1 and is a super parameter. Increased term lambda _g ||Δg _i The term represents the exploration of the training process. The exploration in the training process should be significant at the beginning of the training process and should be small at the end of the training. When g _i When the magnitude of (2) is large, the model is poorly trained and exploration should be noted. Lambda (lambda) _g Can be set as

C is a superparameter,/->

Is the maximum L2 criterion for the gradient during the previous training. In practical application, C may take a value of 2. For a pruned model, the pruned parameters are still updated in the neighbor model, although they are transferred and updated in the local update. Then, during each pruning we represent the whole model with a complete model structure to preserve parameters that become important after some iterations.

The above-mentioned exemplary manner trims the locally updated model based on the influence degree of each parameter on the loss corresponding to the current model, so that the trimming accuracy can be improved.

Optionally, in some embodiments of the present disclosure, a specific manner of model aggregation is also provided. Illustratively, aggregating the plurality of neighbor models and the local pruning model based on the local mask corresponding to each of the plurality of neighbor models and the local mask of the device may include:

for each parameter in the complete model structure, carrying out weighted summation on the corresponding parameter in each neighbor model and the corresponding parameter in the local pruning model based on the weight of each neighbor model and the weight of the current model to obtain a first updating result of each parameter;

based on the local mask corresponding to each neighbor model and the local mask of the device, determining a related model of each parameter in a plurality of neighbor models and a local pruning model;

obtaining a second updating result of each parameter based on the weight of the related model and the first updating result of each parameter;

and obtaining a new current model based on the second updating result of each parameter.

Wherein the relevant model for each parameter may refer to a model for which the mask value for that parameter is not 0, i.e. the model contributes to the updating of that parameter.

Illustratively, the second update result for each parameter may be determined according to the following equation (13):

Wherein,,

is the parameter +/in the local mask of the neighbor model from neighbor device j>

Is 1, the parameter +.>

Not cut off, and a value of 0 represents mu ^p Is cut out, then when->

When (I)>

Representing corresponding parameters in the neighbor model, when j=i,/>

Representing corresponding parameters in the local pruning model. Thus, the first and second substrates are bonded together,

the sum parameters obtained for weighting and summing the models based on the model weights>

And a corresponding first updating result. />

The representation multiplies the weights of the models by the mask value for the parameter, so that a weight sum of the relevant models can be obtained. Based on the first update result divided by the sum of the weights, a second update result, i.e. the final parameter +.>

Is a value of (a).

The above steps are explained below in connection with specific application examples. Fig. 5 shows a schematic diagram of an application example of the federal learning method according to the embodiment of the present disclosure. As shown in FIG. 5, in this application example, the complete model structure includes 16 parameters, namely, 4 rows and 4 columns, and the local update model and the local trimming model of the device are expressed as

The device is cached with a neighbor model->

And neighbor model->

And has a corresponding partial mask O ₄ And O ₇ And local mask O of the device ₀ . Weight of the current model->

Neighbor model->

The weight of (2) is +.>

Neighbor model->

The weight of (2) is +.>

Taking the parameters of the 1 st row and the 1 st column as examples, carrying out weighted summation on the local pruning model and each neighbor model to obtain a first updating result: (2×0.3+1×0.6+3×0.1) =1.5. According to each local mask, determining that mask values of the parameters in each local mask are 1, namely each model is a related model, and the weight sum of the related models is as follows: (0.3+0.6+0.1) =1. Based on the weight sum and the first updating result, a second updating result of the parameter is obtained: 1.5/1=1.5. And so on, a second updating result of each parameter can be obtained, thereby obtaining a new current model

By adopting the mode, the influence of the pruning structure of the neighbor model on aggregation can be considered, so that the accuracy of aggregation is improved, and the communication overhead is reduced on the basis of ensuring the accuracy of the model.

Optionally, the federal learning method may further include: and obtaining the model related information of the current model based on the parameters of the current model, the training information of the current model and the local mask of the equipment.

Accordingly, the neighbor model cached in the device, the local mask corresponding to the neighbor model, training information of the neighbor model and the like are determined based on the received related information from other devices.

According to the alternative mode, the related information of the current model can comprise information such as loss corresponding to the current model, sample number corresponding to the current model, model updating times and the like, so that the neighbor model can conveniently determine the weight of the current model of the device based on training information, and the accuracy and federal learning efficiency of the aggregated model can be improved. The related information of the current model also comprises a local mask of the device so as to facilitate aggregation of the neighbor model based on the pruning structure of the device, and improve the accuracy of aggregation, thereby reducing communication overhead on the basis of ensuring the model accuracy.

Optionally, in an embodiment of the present disclosure, the federal learning method may further include: storing a plurality of neighbor models in a memory of a Graphics Processing Unit (GPU); and storing the plurality of neighbor models in the memory of the equipment under the condition that the memory of the GPU reaches the preset condition.

Since one model may require much memory space, the cached neighbor model may be placed on the memory of the GPU or on the memory of the device. When the model is placed in the memory of the device, additional time may be required to move the model onto the GPU during the model aggregation process. Thus, in the above alternative, the memory of the default selected GPU stores multiple neighbor models. When the GPU memory reaches a preset condition, e.g., the memory space is less than a predetermined value, i.e., the GPU is not fully used, the memory space of the device is used for the cache model. Thus, the model training efficiency and the resource utilization rate can be fully considered.

According to an embodiment of the present disclosure, the present disclosure further provides a federal learning device. Fig. 6 shows a schematic block diagram of a federal learning device provided by an embodiment of the present disclosure. As shown in fig. 6, the apparatus includes:

the local updating module 610 is configured to perform parameter optimization based on a current model of the device to obtain a local updating model;

a pruning module 620, configured to prune the local update model based on the local mask of the device to obtain a local pruning model;

the aggregation module 630 is configured to aggregate the plurality of neighbor models and the local pruning model based on the local mask corresponding to each neighbor model in the plurality of neighbor models and the local mask of the device, so as to obtain a new current model;

and the sending module 640 is configured to send, to a neighboring device of the present device, relevant information of the present model, where the relevant information is used to characterize the present model and a local mask of the present device.

Optionally, as shown in fig. 7, the local update module 610 includes:

an image processing unit 710, configured to process the sample image based on the current model, to obtain an image processing result;

a loss calculation unit 720, configured to determine a loss corresponding to the current model based on the image processing result;

And the parameter optimization unit 730 is configured to perform parameter optimization on the current model of the device based on the loss, so as to obtain a local update model.

Optionally, as shown in fig. 8, the apparatus further includes:

the weight determining module 810 is configured to determine a weight of each neighbor model and a weight of the current model based on training information of each neighbor model and training information of the current model;

the pruning rate determining module 820 is configured to obtain a target pruning rate of the present device based on the weight of each neighboring model and the weight of the current model;

the mask determining module 830 is configured to obtain a local mask of the device based on the target pruning rate.

Optionally, the pruning rate determination module 820 is configured to:

obtaining a reference pruning rate of the equipment based on a preset algorithm;

and calculating a plurality of target pruning rates corresponding to the plurality of neighbor models and the reference pruning rate of the equipment based on the weight of each neighbor model and the weight of the current model to obtain the target pruning rate of the equipment.

Optionally, the training information of the first neighbor model of the plurality of neighbor models includes at least one of:

the difference between the number of model updates of the first neighbor model and the number of model updates of the current model;

Loss corresponding to the first neighbor model;

the number of samples corresponding to the first neighbor model.

Optionally, the weight determining module 810 is configured to:

updating the control variable based on the loss corresponding to the current model;

and determining the weight of each neighbor model and the weight of the current model based on the training information of each neighbor model, the training information of the current model and the control variable.

Optionally, the mask determination module uses 830 to:

determining the influence degree of each parameter of the current model on the loss corresponding to the current model;

based on the influence degree, the order of all parameters of the current model is obtained;

determining parameters to be pruned based on the sequence of the parameters of the current model and the target pruning rate;

and obtaining the local mask of the equipment based on the parameters to be trimmed and the complete model structure.

Alternatively, as shown in fig. 9, the aggregation module 630 includes:

a first updating unit 910, configured to weight and sum, for each parameter in the complete model structure, the corresponding parameter in each neighbor model and the corresponding parameter in the local pruning model based on the weight of each neighbor model and the weight of the current model, to obtain a first updating result of each parameter;

A model screening unit 920, configured to determine a relevant model of each parameter from the plurality of neighbor models and the local pruning model based on the local mask corresponding to each neighbor model and the local mask of the device;

a second updating unit 930, configured to obtain a second updating result of each parameter based on the weight of the correlation model and the first updating result of each parameter;

and a parameter aggregation unit 940, configured to obtain a new current model based on the second update result of each parameter.

Optionally, as shown in fig. 9, the apparatus further includes:

the information determining module 950 is configured to obtain model related information of the current model based on parameters of the current model, training information of the current model, and a local mask of the device.

For descriptions of specific functions and examples of each module and sub-module of the apparatus in the embodiments of the present disclosure, reference may be made to the related descriptions of corresponding steps in the foregoing method embodiments, which are not repeated herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Various components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, such as the federal learning method. For example, in some embodiments, the federal learning method can be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the federal learning method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the federal learning method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A federal learning method, comprising:

based on the local masks corresponding to the neighbor models and the local masks of the equipment, aggregating the neighbor models and the local pruning model to obtain a new current model;

And sending related information of the current model to neighbor equipment of the equipment, wherein the related information is used for representing the current model and local masks of the equipment.

2. The method of claim 1, wherein the performing parameter optimization based on the current model of the device to obtain a locally updated model comprises:

processing the sample image based on the current model to obtain an image processing result;

determining a loss corresponding to the current model based on the image processing result;

and carrying out parameter optimization on the current model of the equipment based on the loss to obtain the local updating model.

3. The method of claim 1 or 2, further comprising:

determining the weight of each neighbor model and the weight of the current model based on the training information of each neighbor model and the training information of the current model;

obtaining a target pruning rate of the equipment based on the weights of the neighbor models and the weights of the current model;

and obtaining the local mask of the equipment based on the target pruning rate.

4. The method of claim 3, wherein the obtaining the target pruning rate of the device based on the weights of the neighboring models and the weights of the current model includes:

and calculating a plurality of target pruning rates corresponding to the plurality of neighbor models and the reference pruning rate of the equipment based on the weights of the neighbor models and the weights of the current model to obtain the target pruning rate of the equipment.

5. The method of claim 3 or 4, wherein the training information of a first neighbor model of the plurality of neighbor models comprises at least one of:

loss corresponding to the first neighbor model;

and the number of samples corresponding to the first neighbor model.

6. The method of any of claims 3-5, wherein the determining weights for the respective neighbor models and weights for the current model based on the training information for the respective neighbor models and the training information for the current model comprises:

7. The method according to any one of claims 3-6, wherein the obtaining a local mask of the device based on the target pruning rate comprises:

determining the influence degree of each parameter of a current model on the loss corresponding to the current model;

based on the influence degree, obtaining the order of all parameters of the current model;

determining parameters to be pruned based on the order of the parameters of the current model and the target pruning rate;

8. The method of any of claims 1-7, wherein the aggregating the plurality of neighbor models and the local pruning model based on the local mask corresponding to each of the plurality of neighbor models and the local mask of the device comprises:

based on the local mask corresponding to each neighbor model and the local mask of the device, determining a related model of each parameter from the plurality of neighbor models and the local pruning model;

9. The method of any of claims 1-8, further comprising:

and obtaining the model related information of the current model based on the parameters of the current model, the training information of the current model and the local mask of the equipment.

10. A federal learning apparatus, comprising:

and the sending module is used for sending related information of the current model to neighbor equipment of the equipment, wherein the related information is used for representing the current model and local masks of the equipment.

11. The apparatus of claim 10, wherein the local update module comprises:

the image processing unit is used for processing the sample image based on the current model to obtain an image processing result;

the loss calculation unit is used for determining the loss corresponding to the current model based on the image processing result;

and the parameter optimization unit is used for carrying out parameter optimization on the current model of the equipment based on the loss to obtain the local updating model.

12. The apparatus of claim 10 or 11, further comprising:

the weight determining module is used for determining the weight of each neighbor model and the weight of the current model based on the training information of each neighbor model and the training information of the current model;

the pruning rate determining module is used for obtaining the target pruning rate of the equipment based on the weights of the neighbor models and the weights of the current model;

and the mask determining module is used for obtaining the local mask of the equipment based on the target pruning rate.

13. The apparatus of claim 12, wherein the pruning rate determination module is to:

14. The apparatus of claim 12 or 13, wherein the training information of a first neighbor model of the plurality of neighbor models comprises at least one of:

loss corresponding to the first neighbor model;

and the number of samples corresponding to the first neighbor model.

15. The apparatus of any of claims 12-14, wherein the weight determination module is to:

16. The apparatus of any of claims 12-15, wherein the mask determination module is to:

17. The apparatus of any of claims 10-16, wherein the aggregation module comprises:

the first updating unit is used for carrying out weighted summation on the corresponding parameters in each neighbor model and the corresponding parameters in the local pruning model according to the weight of each neighbor model and the weight of the current model aiming at each parameter in the complete model structure to obtain a first updating result of each parameter;

the model screening unit is used for determining a relevant model of each parameter from the plurality of neighbor models and the local pruning model based on the local mask corresponding to each neighbor model and the local mask of the device;

the second updating unit is used for obtaining a second updating result of each parameter based on the weight of the related model and the first updating result of each parameter;

And the parameter aggregation unit is used for obtaining a new current model based on the second updating result of each parameter.

18. The apparatus of any of claims 10-17, further comprising:

and the information determining module is used for obtaining the model related information of the current model based on the parameters of the current model, the training information of the current model and the local mask of the equipment.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.