CN115719092A

CN115719092A - Model training method based on federal learning and federal learning system

Info

Publication number: CN115719092A
Application number: CN202211198906.XA
Authority: CN
Inventors: 范洺源; 周文猛
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2023-02-28

Abstract

A model training method based on federal learning and a federal learning system are disclosed. The method is applied to a federated learning system comprising a server and a plurality of nodes and comprises the following steps: in t operations of model training, performing: the server issues the model parameter set to n nodes; the n nodes respectively use local training samples to execute gradient calculation to obtain original gradients, use the utility indexes and the privacy indexes to carry out joint optimization to obtain conversion samples, and make upload data based on the conversion samples; and the server acquires the uploaded data and updates the model parameter set. Thus, the transformed samples are implicitly extracted for generating the perturbation gradient, such that utility and privacy information is injected into the perturbation gradient. A utility index for calculating gradient distances in a weighted manner is designed, and the weight is determined by the element weight and the layer weight of the parameter. Preferably, a trained evaluation network is used as a privacy metric, which network is able to learn how to evaluate the differences in a manner consistent with human cognition.

Description

Model training method based on federal learning and federal learning system

Technical Field

The disclosure relates to the field of machine learning, in particular to a model training method based on federal learning and a federal learning system.

Background

In recent years, artificial intelligence has been in the wake of a new wave of development, and machine learning plays a core role therein. To train a well-behaved machine learning model, a large amount of high-quality data needs to be collected. In a large number of application scenarios, however, privacy protection prevents the collection of private data from users to the server for centralized model training, which is an obstacle to the widespread use of machine learning.

Therefore, the multi-party combined modeling is provided, which can enable the participants to cooperatively train a model on the premise of not revealing data so as to overcome the data privacy problem. One important and common scenario in multi-party joint modeling is Federal Learning (FL). Under federal learning, a user uses data locally to obtain an update value (namely, gradient) of a model according to a set algorithm, and the update value is fed back to a server, so that the local training data is prevented from being known by the outside. However, it has been found that the server can reverse restore the local training data for a particular user using the gradient uploaded by that user.

For this reason, a model training method capable of protecting the safety of local training data of federal learning users is required.

Disclosure of Invention

One technical problem to be solved by the present disclosure is to provide a federate training-based model training method, which changes the conventional method of directly disturbing a gradient to be uploaded in the prior art, and then makes a transformation sample based on a training sample, where the transformation sample contains minimum private information and can generate a utility closest to the original sample. Thus, differences between the transformed samples and the original training samples can be intuitively obtained to accurately estimate the private information contained in the uploaded data and avoid accurate derivation from gradient to sample that is difficult to achieve due to neural network non-linearity.

According to a first aspect of the disclosure, a method for model training based on federated learning is provided, which is applied to a federated learning system including a server and N nodes, where N > 1, and the method includes: in the t-th operation of model training, the following steps are performed: the server issues the model parameter set to n nodes, whichIn the formula, N is less than or equal to N; the n nodes each perform gradient computation using local training samples to obtain raw gradients, joint optimization using utility and privacy indices to obtain transformed samples

And based on transformed samples

Making upload data, wherein i =1,2, …, n, the utility indicator characterizes a difference between a transformed gradient and an original gradient of the transformed sample, and the privacy indicator characterizes a difference between the transformed sample and a training sample; and the server acquires the uploaded data and updates the model parameter set.

Optionally, the uploaded data comprises the transformed samples

And the server training transformation samples from each of the n nodes to update the set of model parameters; or the uploading data comprises converting samples from

Mid-range finding of transformation gradients

And, the server obtains the transformation gradient

And updates the model parameter set.

Optionally, the utility indicator is weighted, the weight of the weighted calculation being associated with at least one of: the value of the parameter is obtained; the parameter corresponds to the magnitude of the gradient; and the layer in which the parameter resides.

Optionally, the privacy indicator characterizes a distance of the transformed samples to a noise distribution. Optionally, a trained evaluation neural network is used to evaluate the distance of the transformed samples to a uniform distribution of noise.

Optionally, the output of the evaluating neural network characterizes the distance of the transformed samples to noise evenly distributed in a monotonically increasing or decreasing manner, and a training data-label pair is created for the evaluating neural network in an interpolating manner for injecting the above monotonically increasing or decreasing knowledge into the evaluating neural network. .

Optionally, jointly optimizing using the utility index and the privacy index to obtain the transformation sample comprises: and iterating the calculation to obtain an optimal solution of the transformation sample under the condition of meeting the target of the utility index and the privacy index.

According to a second aspect of the present disclosure, there is provided a federated learning system, comprising a server and N nodes, N > 1; the server issues the model parameter set to M in the t operation of model training _i A node, where N is less than or equal to N, M _i Nodes each performing gradient calculations using local training samples to obtain raw gradients, joint optimization using utility and privacy indices to obtain transformed samples

And based on transformed samples

Making upload data, wherein i =1,2, …, n; the utility indicator characterizes a difference between a transformed gradient and an original gradient of the transformed sample, and the privacy indicator characterizes a difference between the transformed sample and a training sample; and the server also acquires the uploaded data and updates the model parameter set based on the uploaded data.

According to a third aspect of the present disclosure, a model training method based on federal learning is provided, which is applied to nodes in a federal learning system, wherein the federal learning system comprises N nodes and servers, and N is greater than 1; the method comprises the following steps: in the ith operation of model training, the following steps are performed: acquiring a model parameter set issued by the server; performing gradient calculations using local training samples to obtain raw gradients; using utility and privacy metricsPerforming joint optimization to obtain transform samples

And based on transformed samples

Making upload data, wherein i =1,2, …, n; the utility indicator characterizes a difference between a transformed gradient and an original gradient of the transformed sample, and the privacy indicator characterizes a difference between the transformed sample and a training sample; and sending the uploading data to the server so that the server updates a model parameter set together with the uploading data acquired from other nodes, wherein in the ith operation, the server issues the model parameter set to N nodes including the current node and the other nodes, wherein N is less than or equal to N.

According to a fourth aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method according to the third aspect.

According to a fifth aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method of the third aspect as described above.

Thus, the transformed samples are implicitly drawn for generating perturbation gradients, greatly facilitating the injection of utility and privacy information into the perturbation gradients. And designing a utility index as a utility measure, and calculating the gradient distance in a weighted mode, wherein the weight is determined by the element weight and the layer weight of the parameter. Preferably, a trained evaluation network is used as the privacy metric. The evaluation network can learn how to evaluate the differences in a manner consistent with human cognition.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a schematic of the federal learned training process.

Fig. 2 shows an example of a gradient leakage attack.

Fig. 3 shows a schematic diagram of the training process of federal learning when the client performs defense.

FIG. 4 shows a schematic flow diagram of a federated learning-based model training method in accordance with one embodiment of the present invention.

Fig. 5 shows an effect diagram after the original graph is perturbed by using different methods.

FIG. 6 shows a schematic diagram of the components of a federated learning system that implements the model training method of the present invention.

FIG. 7 is a schematic structural diagram of a computing device that may be used to implement the above federated learning-based model training method in accordance with an embodiment of the present invention.

Fig. 8 shows a schematic illustration of the execution of the inventive method on the client side.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Deep neural networks are typically built up from a series of layers stacked, each layer consisting of a linear function and a simple nonlinear activation function. The deep neural network is denoted as F (x, θ), where x is the input to the neural network and θ is the learnable parameter. In order to have a good generalization capability for neural networks, it is common practice to train the network on a data set D, which consists of a pair of pairs of data x and corresponding labels y, and use a certain loss function

And a gradient descent optimization method (called back propagation algorithm). The optimization details in the t-th iteration can be succinctly expressed as follows:

where the subscript of θ is used to denote the timestamp, η is the learning rate, θ ₀ Is a parameter that is randomly initialized.

For the gradient obtained from the previous calculation.

The success of deep learning depends largely on the large amount of available data, which typically needs to be concentrated in one machine or a single data center for training the model. For privacy protection, in application scenarios such as Mobile Edge Computing (MEC), a server cannot collect private data from users (various forms of clients, e.g., various IoT devices) for centralized model training. Thus, to protect user privacy in deep learning model training, federated Learning (FL) that enables cross-device coordination model training may be used.

FIG. 1 shows a schematic of the federal learned training process. Under federal learning, all users share a complete machine learning model. As shown, the model structure itself is the same at the server and at each client. The server may be, for example, an edge server of an MEC, or other server for performing parameter aggregation tasks in federal learning. In different application scenarios, the client may have various modalities, such as various modalities of IoT devices, including smartphones, robotic devices, smart medical devices, gaming devices, etc., and may even itself have the form of a server (albeit serving as a client in the FL model). In conventional FL, the server coordination client co-trains the global sharing model by alternately performing the following steps in a total of T rounds:

the server first receives fromA small fraction of N clients are extracted from the N clients to participate in the t-th round, while the other clients wait for the next round. F (-, θ) _t-1 ) Is assigned to the selected client.

These clients model-train their local data in a pre-specified manner to obtain true gradients. The true gradient generated in the ith selected client is determined by

Is shown, and

is uploaded to the server.

Is receiving

The server then aggregates these gradients equally to update the global model, i.e. to update the parameter set of the model, and thus obtain F (-, θ) for the next round of delivery and training _t ) I.e. by

The server may then send the updated set of parameters to the user to begin the next round of training and so on until iteration reaches a predetermined number of times or the decision model converges.

Briefly, federated learning divides each conventional iteration (equation 1) into multiple steps, steps involving gradient computation are handed to the client, and the server is only responsible for the model update step. This approach creates a privacy-centric model training approach, since the client's private data is only used for local computations.

Under federal learning, a user does not need to upload training data, but performs model training locally and uploads only updated values (i.e., gradients) of parameters, which seems to protect the data privacy of the user, but it has been found that a server can reversely restore local training data of a specific user by using the updated values uploaded by the user, and the restored data can be almost consistent with real data, thereby causing indirect data leakage, and making the safety of federal learning questionable.

Fig. 2 shows an example of a gradient leakage attack. When a server desires to obtain user local data, the server may be considered an attacker. At this point, the client may be considered a defender. As shown on the left side of the dashed line, the user has completed a model training locally using the image on its left side as training data, and the obtained true gradient is provided directly to the server. When the true gradient is uploaded directly, the user does not make any defense against the attack. At this time, as shown on the right side of the dotted line, an attacker can initialize the virtual (dummy) data and optimize the gradient thereof by fixing the parameters of the model, so that the parameter update value obtained by the value of the virtual data gradually approaches the real gradient uploaded by the user, thereby reversing the data used by the local training of the user.

A gradient leak attack means that data can be reconstructed by solving the gradient matching problem. Specifically, in federal learning, users (defenders) upload local gradients to servers (attackers). Through some optimization method, the server reconstructs the client's data by forcing the gradient of randomly initialized seed data x ' with tag y ' to be as close as possible to the uploaded gradient:

wherein

Is the uploaded gradient and x' is the reconstructed data. It was found that the tags could be recovered explicitly by observing the gradient of the fully connected layer and that the attack efficiency could be significantly improved by replacing y' with the authentic tag. Further, the search space can be reduced by restarting, improved initialization, or collecting data similar to the client, so as to improve attack efficiency.

Assuming the client is alerted to a potential privacy disclosure hazard, as a remedy, the client is allowed to send a perturbation gradient instead of a true gradient. In the t-th round, the first round,disturbance gradient for ith selected client

And (4) showing. In the remainder of this document, superscripts and subscripts are omitted for conceptual simplicity, except where necessary for some statements.

In order to defend against gradient leakage attacks, the client needs to adjust the uploaded gradient. Fig. 3 shows a schematic diagram of the training process of federal learning when the client performs defense. The training process shown in fig. 3 is similar to that shown in fig. 1, except that after local training is performed to obtain a gradient, the client does not upload the gradient directly, but performs some transformation on the gradient to obtain a transformed gradient, and uploads the transformed gradient instead of the original gradient, thereby increasing the difficulty of the server inverting local data based on the uploaded gradient when the server acts as an attacker.

A common defense method for gradient leakage attack is differential privacy, and the gradient is disturbed by adding noise on the premise of not influencing the performance, so that an attacker is given inaccurate gradient to weaken the attack. Such defense methods have limited utility in the face of threatening attacks, since they build effectiveness on the unrealistic assumption of simplifying deep neural networks into linear models.

In view of the above, the present invention provides a new defense method that does not require the deep neural network to be simplified into a linear model for calculation. In the gradient attack defense scheme of the present invention, rather than perturbing the gradient, training samples (i.e., real training data) are adjusted (define) to produce robust data (i.e., transformed samples, hereinafter referred to as "transformed samples" or "robust data") that is sufficiently valid but with a minimal amount of private information, and then the gradient of the robust data is uploaded or the robust data is uploaded directly. The invention promotes the gradient of key parameters related to robust data to be close to real data, and simultaneously, the gradient of trivial parameters is subjected to larger transformation for protecting privacy. Furthermore, in order to take advantage of the gradient of trivial parameters, the evaluation network designed by the present invention guides robust data away from real data, thereby mitigating the risk of privacy leakage.

FIG. 4 shows a schematic flow diagram of a federated learning-based model training method in accordance with one embodiment of the present invention. The method is applied to a federated learning system comprising a server and N nodes, N > 1, as shown in FIG. 5 below.

As can be seen from fig. 1 and 3 above, federal learning accomplishes model training through repeated rounds of training. The following steps S410-S430 shown in fig. 4 may be considered as being performed in the t-th operation of model training. The model may be trained a total of T times. In one embodiment, each round of training may correspond to a batch of training. That is, the cycle of issuing, updating, converting, uploading, and integrating is based on batch, and the tth operation may be a tth batch data training operation of the model training. As known from the background knowledge of machine learning, a model training usually requires a plurality of epochs (iterations), each epoch including a plurality of batch data calculations. Assuming that 10 iterations are required for training a model, and each iteration uses 6 batches of data for training, 60 batches of data training operations are required for completing the model training, i.e., T =60, T =1,2, …,60. In other embodiments, each round of training may be performed in units other than batches.

In step S410, the server sends the model parameter set to N nodes, where N is equal to or less than N. In federal learning, the server may issue a model parameter set to all nodes in the system each time, or may select some nodes from the N nodes to perform model training based on certain conditions. The chosen nodes may be the same or different in each operation.

In step S420, the n nodes each perform gradient computation using local training samples to obtain original gradients, and perform joint optimization using utility and privacy indices to obtain transformed samples

And based on transformed samples

Upload data was made, where i =1,2, …, n.

UtilityMetric (UM) characterization transformation sample x ^* I.e. to what extent the transformed gradient obtained using the transformed samples can contain valid information of the original gradient. To this end, the utility indicator may characterize the difference between the transformed gradient and the original gradient of the transformed sample.

The privacy index (PM) characterizes the privacy protection of the transformed sample, i.e. how far the transformed sample is from the training sample. To this end, the privacy index may characterize the difference between the transformed samples and the original training samples.

Here, the "original gradient" refers to a gradient obtained by the node for representing an updated value of a parameter for a current t-th operation (for example, a t-th batch data training operation), that is, a true gradient, for a model parameter set issued by the node for the current t-th operation. Here, "original" is for the subsequent "transformed" gradient. As previously mentioned, the "original gradient" (or "true gradient") may be

Here, each of the n nodes may each utilize its own local training sample (i.e., local training data) to find its respective true gradient, and each jointly optimize its respective utility index and privacy index to find the transformed sample. The utility indicator is related to the determined real gradient, for example, as a function of the determined real gradient. The privacy index then characterizes the difference between the transformed sample and the original training sample.

Subsequently, in step S430, the server collects the uploaded data uploaded by each of the n nodes and updates the model parameter set accordingly.

In the present invention, transform samples are obtained

Thereafter, the samples may be transformed based on

Upload data is prepared for upload to the server. In one embodiment, uploading data includes transforming samples from

Mid-range finding of transformation gradients

And, the server obtains the transformation gradient

And updates the model parameter set by, for example, directly averaging the n transform gradients from the n clients. This is also the conventional paradigm adopted by federal learning.

In another embodiment, samples are transformed due to the present invention

Is itself sufficiently valid and sufficiently different from the client-side raw training data, so the upload data comprises the transformed samples

And the server processes the transformed samples from each of the n nodes

Training is performed to update the model parameter set. At this time, since the server side directly obtains the samples uploaded by the client side (processed samples with the privacy protected but still containing enough effective gradient information), the training can be performed at the server side. But this time is distinguished from the classic federal learning scenario where only gradients are uploaded.

Therefore, when the client node uploads the data, the real original gradient is not directly uploaded, and the data obtained by jointly optimizing the solved transformation sample based on the utility index (UM) and the privacy index (PM) is uploaded. The utility index can transform the effect of the sample, while the privacy index can measure the privacy disclosure risk. By jointly optimizing the two indexes, the invention can find out the optimized transformation sample (for example, the most optimized transformation sample), and thus the training effect is improved while the privacy of the local data of the client is protected.

In order to find an optimization point between utility maintenance and privacy protection, a utility index and privacy index characterization needs to be reasonably constructed.

Transformation sample x for which utility index is expected to accurately evaluate model ^* (i.e., robust data) and training sample x (i.e., raw data). Similar to identifying gradient to data mappings, the utility gap relationships are also complex. The inventors have noted that the parameters of the model are updated by the gradient of the loss function associated with the particular data. Based on this, x ^* And the gradient of the loss function of x can be taken as a suitable proxy measure for the true utility measure. Thus, in the present invention, the utility indicator characterizes the transformed sample x ^* The difference between the original gradient and the gradient of (c). It should be understood that the corresponding gradient of the training sample is actually the delta value of the parameters of each layer of the depth model after the sample is input into the model and the parameters are propagated and adjusted in the reverse direction according to the loss value of the model output result. Thus, the difference between the gradient of the transformed sample and the original gradient corresponds to the difference between the Δ values caused under the original training sample and the transformed sample between the respective parameters of the respective layers of the model.

The most common distance metric may be the Mean Square Error (MSE). Although the utility index may be calculated directly using conventional MSEs (i.e., unweighted MSEs), conventional MSEs treat equally any gradient element involved, indicating that applying perturbations of the same magnitude to any gradient element will result in the same level of loss. However, even if different gradient elements are perturbed by the same magnitude, gradients of different utility are produced. This is because the parameters corresponding to different gradient elements have significantly different effects on the model performance. In more detail, perturbing the gradient elements of the key parameters results in higher losses than the trivial parameters. In view of the above, in one embodiment, the present invention uses weighted MSE as a utility indicator to better approximate the true utility. Higher weights can be given to the gradients of important parameters to reduce the loss of model performance.

The importance of a parameter is highly related to the value and gradient of that parameter. Intuitively, parameters with high amplitude values are more critical because they greatly emphasize their input compared to parameters with low amplitude values, and higher outputs may have a greater impact on model prediction; in addition, the nature of the gradient is to estimate the effect of small changes in the parameters on the final output, so parameters with high gradients also play a more critical role in model prediction. To this end, the utility indicator characterizing a difference between a gradient of the transformed sample and an original gradient comprises: and calculating the mean square error between each gradient element of the transformation sample and the corresponding original gradient element, and assigning a weight to each mean square error value to calculate the utility index in a weighting manner, wherein the weight is associated with the value size of the parameter and/or the size of the corresponding gradient of the parameter.

Further, since a neural network is composed of many layers, each layer receives input from its adjacent upstream layer (except for the input layer) and passes its output to the adjacent downstream layer (except for the output layer). Thus, earlier layers (i.e., layers closer to the input layer) may be more important than later layers. On the one hand, errors caused by learning processes that interfere with early layers are likely to be strongly amplified with forward propagation. On the other hand, early layers are typically focused on identifying bottom features shared by various samples, so the learning process that destroys early layers will result in a large degradation of model performance. To this end, the weight is also associated with the layer in which the parameter resides.

The privacy index is a distance measure used to quantify x ^* How much privacy of x is exposed, or x and x ^* The degree of difference between each other. The conventional distance index MSE is not suitable as a privacy index in the present invention, because the index can only guarantee a high similarity between two inputs if the index associated with the two inputs falls within a low value range, whereas the opposite does not hold, i.e. two highly similar images may be treated as different by the MSE index. Fig. 5 shows an effect diagram after the original graph is perturbed by using different methods. Specifically, (b) moving the original image by one unit in the upper left direction, (c) scaling one pixel of the original image, and (d) adding random noise to the original image, corresponding to the original image (a). The MSE for the three methods was 0.025, 0.106 and 0.004, respectively. The MSE for shifting and scaling is higher than for adding noise, but does not appear to the human eye to be different from the original, i.e. the privacy preserving effect is negligible.

To this end, the invention turns to finding the transformed sample x ^* The distance to the noise distribution serves as a privacy index. This is because the noise distribution can be considered to not include any information in the original training samples, and thus the derivation of the transformed samples can be greatly simplified. In a preferred embodiment, rather than selecting a fixed noise sample and making the transform sample x ^* Moving towards the stationary noise sample, preferably by shifting x ^* The distance to the noise distribution (i.e., the distribution distance) is defined as a privacy index. The distribution distance can be considered to expand the number of available reference samples in the optimization process, and x can be automatically guided ^* Proceeding towards the reference sample with the larger amount of information. In one embodiment, the transformed samples x may be evaluated using a trained evaluation neural network ^* Distance to noise even distribution. The output of the evaluating neural network characterizes the transformed samples x in a monotonically increasing or decreasing manner ^* Distance to noise evenly distributed and creating training data-label pairs for the evaluation neural network in an interpolated manner for injecting the above monotonically increasing or decreasing knowledge into the evaluation neural network.

Transforming a sample x ^* The corresponding joint optimization problem is solved by iterative computation. Therefore, iterative computation is required to find the transform sample x satisfying the limitation of the utility index and the privacy index ^* An optimized solution (e.g., the most optimized solution).

In addition, transform samples x may be initialized using a mixture of training samples and random noise ^* Thereby avoiding the optimization to obtain a resistant sample,

the present invention may also be implemented as a federated learning system. FIG. 6 shows a schematic diagram of the components of a federated learning system that implements the model training method of the present invention. As shown, system 600 includes a server 610 and N nodes 620 _1-N 。

The server 610 sends the model parameter set to N nodes in the t-th operation of model training, where N is less than or equal to N. The n nodes each perform gradient computation using local training samples to obtain original gradients, joint optimization using utility and privacy indices to obtain transformed samples

And based on transformed samples

Making upload data, wherein i =1,2, …, n; and the server collects the uploaded data uploaded by each of the n nodes and updates the model parameter set accordingly.

The invention can also be realized as a model training method based on federal learning, which is applied to nodes in a federal learning system, wherein the federal learning system comprises N nodes and a server, and N is more than 1; the method comprises the following steps: in the t-th operation of model training, the following steps are performed: acquiring a model parameter set issued by the server; performing gradient calculation by using a local training sample to obtain an original gradient, performing joint optimization by using a utility index and a privacy index to obtain a transformation sample, and making upload data based on the transformation sample; and sending the uploaded data to the server so that the server updates a model parameter set together with data uploaded by other nodes, wherein in the operation of the t time, the server issues the model parameter set to N nodes including the current node and the other nodes, wherein N is less than or equal to N.

FIG. 7 is a schematic structural diagram of a computing device that may be used to implement the above federated learning-based model training method in accordance with an embodiment of the present invention. The computing device may be particularly useful as a node in a federated learning system for obtaining raw gradients and jointly optimizing a utility index and a privacy index to find transformed samples.

Referring to fig. 7, computing device 700 includes memory 710 and processor 720.

Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), digital Signal Processor (DSP), or the like. In some embodiments, processor 720 may be implemented using custom circuitry, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 710 may include various types of storage units such as system memory, read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by processor 720 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 710 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 710 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, minSD card, micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

Memory 710 has stored thereon executable code that, when processed by processor 720, causes processor 720 to perform the above-described federated learning-based model training method.

Application example

Fig. 8 shows a schematic illustration of the execution of the inventive method on the client side. One specific embodiment of the present invention will be described below with reference to fig. 8. As shown on the upper side of fig. 8, the real data (i.e., the original training samples) may be fed into the model. The model makes a prediction and calculates the loss from the corresponding label and obtains the true gradient (i.e., the original gradient) through a back propagation algorithm. Then, robust data is initialized (for example, initialized by using a mixture of training samples and random noise), and final robust data is obtained as a transformation sample through joint optimization of the efficacy index and the privacy index, and the transformation sample can be sent to a model for prediction by the model, loss is calculated according to a corresponding label, a transformation gradient is obtained through a back propagation algorithm, and the transformation gradient is uploaded as upload data.

As shown in the figure, the optimization of the performance index can be actually regarded as a process of making the transformation gradient gradually approach to the original gradient in the gradient space; the optimization of the privacy index can be regarded as a process in which robust data gradually moves away from original data in a data space, and the process can be performed in the presence of the evaluation.

In one particular implementation, the general objective of the present invention is to find a suitable upload gradient

Servers can be obfuscated while having good utility. However, without simplifying the DNN, it is very difficult to directly derive an analytical mapping from the perturbation gradients to the corresponding reconstructed data, due to the highly non-linear and non-convex nature of DNN. Furthermore, the perturbations that are generated may be without an explicit gradient-to-data mapping relationshipIt is unclear how much privacy protection is obtained, i.e. the optimal perturbation direction is unknown. In other words, the inventors of the present invention found that directly optimizing the gradient in the prior art is not feasible in practical operation because it is difficult to derive an analyzable gradient to data mapping. To cope with this, the present invention turns to making an adjustment to real data to make robust data, and using a gradient of the robust data instead of the real gradient for uploading. If there is a significant difference between the robust data and the raw data, the server must not reveal the raw data. To this end, given a privacy preference level β and real data x labeled y, the problem can be expressed by the following equation:

wherein x ^* Is the robust data produced, UM (-) and PM (-) represent the utility index and privacy index of the present invention, respectively. Utility index for evaluating utility x ^* How much less than x (it is desirable that UM be as small as possible, i.e. x) ^* As close as possible to x), the privacy metric indicates x ^* How much privacy on x is reduced (it is desirable that PM be as large as possible, i.e. x) ^* Contains as little privacy as possible than x contains). Under the guidance of the two indexes, x is finally obtained ^* Not only is the gradient of (i.e., effectively usable for model training), but the server can be effectively obfuscated. These two indices will be described in detail below.

A. Utility index (UM)

It is desirable that UM be able to accurately evaluate x of a model ^* And the difference in utility between x. Similar to identifying gradient to data mappings, utility gap relationships are also complex. The inventors have observed that the parameters of the model are updated by the gradient of the loss function associated with the particular data. Based on this, x ^* And the gradient of the loss function of x can be taken as a suitable proxy measure for the true utility measure.

It is not appropriate to use conventional MSE to define UM, since perturbing the gradient elements for the key parameters results in higher losses than the trivial parameters. To this end, a weighted MSE may be assigned as a utility indicator, and the general idea of weighted MSE is to give higher weight to the gradient of the important parameter to reduce the loss of model performance. Two factors are considered here to determine the weights: statistical indications of the parameters, such as gradients, and location information of the parameters. More specifically, the present invention can define element-level weights by fusing statistical indicators of parameters and define hierarchical weights by using location information. The product of the element-level weight and the level weight serves as the final weight of the corresponding gradient element.

The importance of a parameter is highly related to the value and gradient of that parameter. Parameters with high amplitude values are more critical, and parameters with high gradients also play a more critical role in model prediction. Here, the element-level weight may be defined as an absolute value of a product of a value of the parameter and the gradient. In effect, the values of the parameters reflect their importance to the upstream layers, while the gradients of the parameters indicate their importance to the downstream layers, neither of which should be omitted when evaluating the weights.

To further understand the element-level weights, some mathematical explanations are provided herein. Can make

Then there is

Please note that when

The optimum θ is usually reached. μ is set to 1, and therefore,

if θ is optimal, Q' (1) =0. Therefore, whether or not the optimum is reached can be judged by checking the value of Q' (1). In other words, if the magnitude value and gradient of one parameter are small, the optimality is not affected. Here, Q (μ) is a common use for analyzing the optimality condition of a nonlinear non-convex function in a high-dimensional spaceMathematical practice can thereby more intuitively understand the optimality condition of a complex function. Note that Q' (1) is only a sufficient condition for an optimal solution, not a necessary condition, but this does not affect the use herein.

A neural network is composed of a number of layers, each receiving input from its adjacent upstream layer (except for the input layer) and passing its output to the adjacent downstream layer (except for the output layer). Thus, earlier layers (i.e., layers closer to the input layer) may be more important than later layers. On the one hand, errors caused by the learning process that interferes with early layers are likely to be strongly amplified with forward propagation. On the other hand, early layers are typically focused on identifying the bottom features shared by the various samples, so the learning process that destroys early layers will result in a large degradation of model performance.

To this end, penalties for gradient perturbations belonging to earlier layers are magnified by hierarchical weights, which assigns more attention to preserving gradients in earlier layers. Assuming that F (x, theta) shares K layers, the ith layer parameter of F (x, theta) is theta [ i ] (theta = { theta [1], theta [2], ·, theta [ K ] }). The hierarchical weight θ [ k ] of the i-th level gradient element is defined as power (τ, i), where τ is the attenuation factor (0 ≦ τ ≦ 1), and power (·, · is a power function.

The final weight of θ [ i ], which is the product of the element-level weight and the level-level weight, can be expressed using the formula:

weight(θ|i|)＝|grad(θ|i|)·value(θ|i|)·power(τ，i)|， (4)

wherein grad (θ [ i ]) and value (θ [ i ]) represent the gradient and value of the extracted input parameter, respectively. UM can now be defined as follows:

B. privacy index (PM)

The privacy index is a distance measure used to quantify x ^* How much privacy of x is exposed, or ^* The degree of difference between each other. The PM should have the following properties: as the PM value increases, is included in x ^* Should monotonically decrease private information in (1)Less. The intuitive idea in designing PM is to make a reference sample for x, and the reference sample does not contain any private information x. Then, reference sample and x ^* The distance therebetween may be used as PM. The smaller the distance between x and the reference sample, x ^* The less privacy x is contained in. However, it is very troublesome to search for a reference sample for each data. A better choice is to make a reference sample that fits all the data, and obviously noisy data is a suitable choice. In particular, noise data may be sampled from a uniform distribution as a general reference sample.

Although the above method is feasible, since the search space is limited to the path from x to the extracted fixed reference sample, the resulting x may be close to the extracted reference sample, while in reality there may be a better solution around other noise. In other words, it does not matter which random noise is extracted as a reference sample, because all random noise does not relate to the privacy information of x. Therefore, it is better to combine x ^* The distance to the noise distribution (i.e., the distribution distance) is defined as PM. The distribution distance can be considered to expand the number of available reference samples in the optimization process, and x can be automatically guided ^* Proceeding toward the reference sample with a larger amount of information.

In practice, an evaluation network is used to evaluate how close x is to the noise distribution. The evaluation network is a neural network with an output ranging from 0 to 1. Furthermore, output 1 indicates that the input belongs entirely to the noise profile and vice versa. The theory behind the network is the Maximum Mean Difference (MMD), which is a statistical concept often used to test whether two samples are from the same distribution. Formally, the MMD is as follows:

where f is the evaluation network, q (x) is the noise (uniform) distribution (uniform distribution is the no information distribution in the information theory), and p (x) is the data distribution of the internet of things devices. This indicates that we need to train the network f to maximize the expected output difference between p (x) and q (x). Furthermore, since the output of the network is limited to 0-1, expanding the expected output difference is essentially equal to encouraging the network to output 0 for the real data x and 1 for noise, respectively.

Further, during training of the network, training data-label pairs, i.e., (1-r). T, may be created for the network in an interpolated manner ₁ +r·t ₂ ,r),r～U(0,1),t ₁ ～p,t ₂ Q, these data-label pairs are provided as supervisory signals to the evaluation network, whereby the monotonicity principle can be explicitly incorporated into the training process, since the knowledge that the output should increase with increasing r (proportion of noise) is injected into the evaluation network. In addition, a gradient penalty can be added in the training process of evaluating the network to establish that | | | f | | H is less than or equal to 1. Finally, the trained evaluation network can be paired with x ^* As PM, i.e., PM (x) ^* ,x)＝f(x ^* )。

C. Solving optimization tasks

After the specific forms of UM and PM are determined, a gradient descent algorithm may be used to iteratively solve equation 3 for a total number of iterations T. However, direct application of gradient descent algorithms is problematic due to the adversarial vulnerability of neural networks. The term "antagonism vulnerability" refers to the phenomenon that neural networks are susceptible to antagonism examples, which can cause large changes in model predictions by adding subtle human-imperceptible noise to the original samples. X obtained directly using conventional gradient descent algorithm ^* Semantically very similar to x, i.e. x ^* Is to evaluate the challenge sample of the network f. The antagonism example can be viewed as a local extremum around x, and one way to solve this problem is to perturb the initialization point to avoid the region with the local extremum. Thus, x can be initialized using a mixture of x and random noise ^* I.e. x ^* ＝(1-α)x+αv,v～q(x),α∈[0,1]Where alpha is a blending factor. If alpha is set large, x is initialized around the region of noise flooding ^* And is closest to x ^* Is still noise, thus avoiding the adversarial vulnerability.

The federal learning-based model training method and the federal learning system of the present invention are described above. Compared with the existing method for directly processing the true gradient of the gradient, the method for creating the perturbation gradient implicitly extracts robust data for generating the perturbation gradient, thereby greatly promoting the injection of utility and privacy information into the perturbation gradient. Specifically, an effective utility metric is designed, and gradient distances are calculated in a weighted mode, wherein weights are determined by element weights and layer weights of parameters; due to the finding of the drawback of common metrics in quantifying the difference between two images, the present invention suggests to use an evaluation network as a privacy metric, which is able to learn how to evaluate the difference in a way consistent with human cognition.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A model training method based on federated learning is applied to a federated learning system comprising a server and N nodes, wherein N is greater than 1, and the method comprises the following steps:

in the t-th operation of model training, the following steps are performed:

the server issues the model parameter set to N nodes, wherein N is less than or equal to N;

the n nodes each perform gradient computation using local training samples to obtain original gradients, joint optimization using utility and privacy indices to obtain transformed samples

And based on transformed samples

Making upload data, wherein i =1,2, …, n, instituteThe utility indicator characterizes a difference between a transformed gradient and an original gradient of the transformed sample, the privacy indicator characterizes a difference between the transformed sample and a training sample;

the server obtains the uploaded data and updates the model parameter set based on the uploaded data.

2. The method of claim 1, wherein,

the uploaded data comprises the transformed samples

And the server processes the transformed samples from each of the n nodes

Training to update the model parameter set; or

The uploading data comprises transforming samples from

Mid-range finding of transformation gradients

And, the server obtains the transformation gradient

And updates the model parameter set.

3. The method of claim 1, wherein the utility indicator characterizing a difference between a gradient of the transformed sample and an original gradient comprises:

taking a mean square error between each gradient element of the transformed samples and the corresponding original gradient element, and assigning a weight to each mean square error value to weight calculate the utility indicator, the weight being associated with at least one of:

the value of the parameter is obtained;

the parameter corresponds to the magnitude of the gradient; and

the layer where the parameter is located.

4. The method of claim 1, wherein the privacy index characterizes a distance of the transformed samples to a noise distribution.

5. The method of claim 4, wherein the distance of the transformed samples to a noise distribution is estimated using a trained evaluation neural network.

6. The method of claim 5, wherein the output of the evaluation neural network characterizes the distance of the transformed samples to the noise distribution in a monotonically increasing or decreasing manner, and a training data-tag pair is created for the evaluation neural network in an interpolating manner for injecting the monotonically increasing or decreasing knowledge into the evaluation neural network.

7. The method of claim 1, wherein jointly optimizing using the utility indicator and the privacy indicator to obtain transformed samples comprises:

and iterating the calculation to obtain an optimal solution of the transformation sample under the condition of meeting the target of the utility index and the privacy index.

8. The method of claim 1, wherein the transformed samples are initialized using a mixture of training samples and random noise.

9. A kind of conjunctive learning system, including server and N nodes, N > 1;

the server issues the model parameter set to N nodes in the ith operation of model training, wherein N is less than or equal to N,

the n nodes each perform gradient computation using local training samples to obtain original gradients, and perform joint optimization using utility and privacy indices to obtain transformed samples

And based on transformed samples

Making upload data, wherein i =1,2, …, n; the utility index represents the performance of the transformed sample to maintain model training convergence, and the privacy index represents the protection degree of the transformed sample on the training sample privacy; and

the server also obtains uploaded data and updates a model parameter set based on the uploaded data.

10. A model training method based on federal learning is applied to nodes in a federal learning system, wherein the federal learning system comprises N nodes and a server, and N is more than 1;

the method comprises the following steps:

in the ith operation of model training, the following steps are performed:

acquiring a model parameter set issued by the server;

performing gradient calculations using local training samples to obtain raw gradients;

performing joint optimization by using a utility index and a privacy index to obtain a transformation sample, and making uploading data based on the transformation sample, wherein the utility index represents the performance of the transformation sample for maintaining model training convergence, and the privacy index represents the protection degree of the transformation sample on the training sample privacy; and

and sending the uploading data to the server so that the server updates a model parameter set together with the uploading data acquired from other nodes, wherein in the ith operation, the server issues the model parameter set to N nodes including the current node and the other nodes, wherein N is less than or equal to N.

11. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of claim 10.

12. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of claim 10.