WO2021259090A1 - 联邦学习的方法、装置和芯片 - Google Patents

联邦学习的方法、装置和芯片 Download PDF

Info

Publication number
WO2021259090A1
WO2021259090A1 PCT/CN2021/100098 CN2021100098W WO2021259090A1 WO 2021259090 A1 WO2021259090 A1 WO 2021259090A1 CN 2021100098 W CN2021100098 W CN 2021100098W WO 2021259090 A1 WO2021259090 A1 WO 2021259090A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
parameters
model
distribution
local
Prior art date
Application number
PCT/CN2021/100098
Other languages
English (en)
French (fr)
Inventor
邵云峰
郭凯洋
莫恩斯文森特
汪军
杨春春
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21829619.2A priority Critical patent/EP4156039A4/en
Publication of WO2021259090A1 publication Critical patent/WO2021259090A1/zh
Priority to US18/080,523 priority patent/US20230116117A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks

Definitions

  • This application relates to the field of artificial intelligence, in particular to a method, device and chip for federated learning.
  • Data islands pose a new challenge to artificial intelligence (AI) based on massive amounts of data, that is, how to train machine learning models without the authority to obtain enough training data?
  • AI artificial intelligence
  • the present application provides a method and device for federated learning, which can support federated learning of a machine learning model whose parameters follow a distribution, thereby reducing the training time and communication overhead of federated learning.
  • a method for federated learning including: a first node receives a prior distribution of parameters of a federated model from a second node, wherein the federated model is a machine learning model whose parameters follow the distribution; the first node According to the prior distribution of the parameters of the federation model and the local training data of the first node, training to obtain the posterior distribution of the parameters of the local model of the first node.
  • the federated learning of the machine learning model whose parameters follow the distribution is realized.
  • the machine learning model whose parameters obey the distribution can give the possibility of various values of the parameters in advance, and the possibility of various values of the parameters can characterize the pros and cons of the various possible improvement directions of the machine learning model. Therefore, performing federated learning on a machine learning model whose parameters follow a distribution can help nodes participating in federated learning to find a better direction for improvement of the machine learning model, thereby reducing the training time of federated learning and the communication overhead between nodes.
  • the method further includes: the first node determines the uncertainty of the local model according to the posterior distribution of the parameters of the local model; when the When the uncertainty of the local model satisfies the first preset condition, the first node sends the posterior distribution of the parameters of the local model to the second node.
  • the uncertainty of the local model can be a good measure of the matching degree between the local training data and the federated model, which in turn can indicate the importance of the first node for federated learning. Therefore, using the uncertainty of the local model as an index to measure whether the first node feeds back the training result to the second node can make the training process of the federated model more controllable. For example, when you want the federated model to converge quickly, you can prohibit the first node with a higher uncertainty of the local model from feeding back the local training results; another example, when you want to increase the capacity of the federated model, you can request the uncertainty of the local model The higher first node feeds back the local training result. In addition, the local model whose uncertainty does not meet the first preset condition is not sent to the second node, which can reduce the communication overhead between nodes.
  • the uncertainty of the local model is measured based on at least one of the following information: the variance of the posterior distribution of the parameters of the local model, so The convergence rate of the posterior distribution of the parameters of the local model, or the inference accuracy rate of the posterior distribution of the parameters of the local model.
  • the method further includes: the first node determines the uncertainty of the first parameter according to the posterior distribution of the first parameter of the local model Degree, wherein the parameters of the local model include at least one parameter, and the first parameter is any one of the at least one parameter; when the uncertainty of the first parameter satisfies a second preset condition, the The first node sends the posterior distribution of the first parameter to the second node.
  • the uncertainty of the parameter in the local model can be a good measure of the importance of the parameter to the local model.
  • the first node can upload only the training results of the parameters that are important to the local model, which can reduce the communication overhead between nodes and improve the communication efficiency.
  • the uncertainty of the first parameter is measured based on the variance of the posterior distribution of the first parameter.
  • the method further includes: the first node determines the uncertainty of the local model according to the posterior distribution of the parameters of the local model; when When the uncertainty of the local model satisfies the first preset condition, the first node determines the uncertainty of the first parameter according to the posterior distribution of the first parameter of the local model, wherein The local model includes at least one parameter, and the first parameter is any one of the at least one parameter; when the uncertainty of the first parameter satisfies a second preset condition, the first node reports the The two nodes send the posterior distribution of the first parameter.
  • the first node selectively sends all or part of the results obtained from local training to the second node according to the uncertainty of the local model and the uncertainty of the parameters in the local model, which can reduce communication overhead between nodes and improve communication efficiency.
  • the prior distribution of the parameters of the federation model includes multiple local prior distributions, and the multiple local prior distributions correspond to multiple Bayes one-to-one A model, where the first node is trained to obtain the posterior distribution of the parameters of the local model of the first node according to the prior distribution of the parameters of the federation model and the local training data of the first node, including: The first node determines the prior distribution of the parameters of the local model of the first node according to the degree of matching between the multiple local prior distributions and the local training data; the first node determines the prior distribution of the parameters of the local model of the first node according to the parameters of the local model The prior distribution of and the local training data are trained to obtain the posterior distribution of the parameters of the local model.
  • the multiple local prior distributions may be implicit in the prior distribution of the parameters of the federation model.
  • the prior distributions of the parameters of the federation model may be decomposed into multiple local priors in a certain manner.
  • the prior distribution of the parameters of the federation model can be randomly sampled, so as to decompose the prior distribution of the parameters of the federation model into multiple local prior distributions.
  • the second node maintains a larger federated model containing multiple local prior distributions, and the first node selects a local prior distribution matching the local training data for local training, which can speed up the convergence speed of the local training process.
  • the federated learning includes multiple rounds of iteration, and the posterior distribution of the parameters of the local model is the posterior distribution of the parameters of the local model obtained through the current round of iterations .
  • the first node determining the prior distribution of the parameters of the local model of the first node according to the degree of matching between the multiple local prior distributions and the local training data includes: the first node according to the The difference between the multiple local prior distributions and the historical posterior distribution, the prior distribution of the parameters of the local model of the first node is determined, wherein the historical posterior distribution is that the first node is in the local The posterior distribution of the parameters of the local model obtained before the iteration.
  • the prior distribution of the parameters of the local model is the prior distribution with the smallest difference from the historical posterior distribution among the multiple local prior distributions
  • the prior distribution of the parameters of the local model is a weighted sum of the multiple local prior distributions, wherein the weights of the multiple local prior distributions in the weighted sum are determined by the multiple The difference between a local prior distribution and the historical posterior distribution is determined.
  • the method further includes: the first node sends a posterior distribution of the parameters of the local model to the second node.
  • the machine learning model is a neural network.
  • the federation model is a Bayesian neural network.
  • the parameters of the federation model are random variables.
  • the local model is a neural network.
  • the local model is a Bayesian neural network.
  • the parameters of the local model are random variables.
  • the prior distribution of the parameters of the federated model is the probability distribution of the parameters of the federated model, or is the probability of the probability distribution of the parameters of the federated model distributed.
  • the first node and the second node are respectively a client and a server in the network.
  • a method for federated learning which includes: a second node receives a posterior distribution of a parameter of at least one local model of a first node; The posterior distribution of the parameters updates the prior distribution of the parameters of the federated model, where the federated model is a machine learning model whose parameters follow the distribution.
  • the federated learning of the machine learning model whose parameters follow the distribution is realized.
  • the machine learning model whose parameters obey the distribution can give the possibility of various values of the parameters in advance, and the possibility of various values of the parameters can characterize the pros and cons of the various possible improvement directions of the machine learning model. Therefore, the federated learning of the machine learning model whose parameters follow the distribution will help the nodes participating in the federated learning to find a better improvement direction of the machine learning model, thereby reducing the training time of federated learning and the communication overhead between nodes.
  • the method before the second node receives the posterior distribution of the parameters of the at least one local model of the first node, the method further includes: the second node
  • the at least one first node is selected from candidate nodes, the federated learning includes multiple rounds of iteration, the at least one first node is a node that participates in the current round of iteration, and the candidate node is a node that participated in the current round of iteration.
  • the node of the federated learning; the second node sends the prior distribution of the parameters of the federated model to the at least one first node.
  • the second node selects the first node participating in this round of training from the candidate nodes, which can make the training process of federated learning more targeted and more flexible.
  • the second node selecting the at least one first node from candidate nodes includes: The evaluation information sent by the second node, the at least one first node is selected from the candidate nodes, wherein the evaluation information is used to indicate the prior distribution of the parameters of the federation model and the local training data of the candidate node
  • the degree of matching, or the evaluation information is used to indicate the degree of matching between the posterior distribution trained by the candidate node according to the prior distribution of the parameters of the federated model and the local training data of the candidate node, or the evaluation information It is used to indicate the matching degree between the prior distribution of the parameters of the federation model and the posterior distribution obtained by training the candidate node according to the prior distribution of the parameters of the federation model.
  • the second node can accurately grasp the degree of matching between the candidate node's local model (or local training data) and the federated model, so that it can better perform the first node participating in the federated learning according to actual needs. choose.
  • the second node selecting the at least one first node from candidate nodes includes: the second node according to the historical posterior of the candidate node The difference between the distribution and the prior distribution of the parameters of the federation model, the at least one first node is selected from the candidate nodes, wherein the historical posterior distribution is that the candidate node is obtained before the current iteration The posterior distribution of the parameters of the local model.
  • the second node can grasp the matching degree between the candidate node's local model (or local training data) and the federated model by calculating the difference between the historical posterior distribution of the candidate node and the prior distribution of the parameters of the federated model, so that it can be based on actual conditions. Need to better choose the first node participating in federated learning.
  • the local model does not include a parameter whose uncertainty does not meet a preset condition.
  • the uncertainty of the parameter in the local model can be a good measure of the importance of the parameter to the local model.
  • the nodes selectively interact with important parameters, which can reduce the communication overhead between the nodes and improve the communication efficiency.
  • the at least one first node includes multiple first nodes, and the posterior distributions of the parameters of the local models of the multiple first nodes all include the first A posterior distribution of a parameter, the second node updates the prior distribution of the parameters of the federated model according to the posterior distribution of the parameters of the local model of the at least one first node, including: if the plurality of first nodes The difference between the posterior distributions of the first parameter of a node is greater than a preset threshold, and the second node updates the prior distribution of the parameters of the federation model to split the first parameter into Multiple parameters.
  • the prior distribution of the parameters of the federation model includes multiple local prior distributions, and the multiple local prior distributions correspond to multiple Bayes one-to-one Model.
  • the second node maintains a larger federation model that contains multiple local prior distributions, so that the first node can select a matching local prior distribution according to its own situation, which helps to speed up the convergence speed of the first node's local training process .
  • the machine learning model is a neural network.
  • the federation model is a Bayesian neural network.
  • the parameters of the federation model are random variables.
  • the local model is a neural network.
  • the local model is a Bayesian neural network.
  • the parameters of the local model are random variables.
  • the prior distribution of the parameters of the federated model is the probability distribution of the parameters of the federated model, or the probability of the probability distribution of the parameters of the federated model distributed.
  • the first node and the second node are respectively a client and a server in the network.
  • a method for federated learning including: a first node receives a federated model from a second node, the federated model includes multiple machine learning models (such as multiple neural networks); A target machine learning model is selected from the multiple machine learning models; the first node trains the local model of the first node according to the target machine learning model and the local training data of the first node.
  • the first node can select a machine learning model for local training according to its own situation, which helps to shorten the time-consuming local calculation of the first node and improve local calculation efficiency.
  • the first node selects a target machine learning model from the multiple machine learning models, including: the first node learns according to the multiple machine learning models The degree of matching between the model and the local training data is to select the target machine learning model from the multiple models.
  • the first node selects a machine learning model that matches the local training data for local training, which can improve the training efficiency of local training.
  • a method for federated learning including: the second node sends a federated model to a first node, the federated model includes multiple machine learning models (such as multiple neural networks); and the second node receives the first node The sent local model corresponding to the target machine learning model of the plurality of machine learning models; the second node optimizes the target machine learning model according to the local model.
  • the first node can select a machine learning model for local training according to its own situation, which helps to shorten the time-consuming local calculation of the first node and improve local calculation efficiency.
  • a device for federated learning includes a module for executing the method of any one of the first to fourth aspects.
  • a device for federated learning includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the first aspect To any one of the methods of the fourth aspect.
  • a computer-readable medium stores program code for device execution, and the program code includes a method for executing any one of the first to fourth aspects.
  • a computer program product containing instructions is provided.
  • the computer program product runs on a computer, the computer executes the method of any one of the first to fourth aspects.
  • a chip in a ninth aspect, includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface and executes the method of any one of the first to fourth aspects .
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
  • the processor is used to execute the method in the first aspect.
  • an electronic device which includes the federated learning device in any one of the above-mentioned fifth aspect to the sixth aspect.
  • Figure 1 is an example diagram of an application scenario of federated learning.
  • Figure 2 is a flowchart of federated learning.
  • FIG. 3 is a diagram of a chip hardware structure provided by an embodiment of the application.
  • Fig. 4 is a schematic flowchart of a method for federated learning provided by an embodiment of the application.
  • FIG. 5 is a schematic flowchart of a possible implementation manner of step S420 in FIG. 4.
  • Fig. 6 is a schematic flowchart of a method for selecting a first node participating in federated learning provided by an embodiment of the application.
  • Fig. 7 is a schematic structural diagram of a federated learning device provided by an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of a federated learning device provided by another embodiment of the present application.
  • Fig. 9 is a schematic structural diagram of a federated learning device provided by another embodiment of the present application.
  • the scenario of federated learning may include multiple first nodes 102 and second nodes 105.
  • the first node 102 and the second node 105 may be any nodes (such as network nodes) that support data transmission.
  • the first node 102 may be a client, such as a mobile terminal or a personal computer.
  • the second node 105 may be a server, or a parameter server.
  • the first node may be referred to as the owner of the training data, and the second node may also be referred to as the coordinator of the federated learning process.
  • the second node 105 can be used to maintain the federation model.
  • the first node 102 can obtain the federated model from the second node 105, and perform local training in combination with the local training data to obtain the local model. After the local model is obtained through training, the first node 102 may send the local model to the second node 105 so that the second node 105 can update or optimize the federated model. This reciprocation, after multiple rounds of iterations, until the federation model converges or reaches the preset iterative stop condition.
  • the second node 105 constructs a federation model.
  • the second node 105 can construct a general machine learning model, or can construct a specific machine learning model according to requirements. Taking the image recognition task as an example, the second node 105 may construct a convolutional neural network (convolutional neural network, CNN) as a federation model.
  • CNN convolutional neural network
  • the second node 105 selects the first node 102.
  • the first node 102 selected by the second node 105 will get the federation model issued by the first node 102.
  • the second node 105 may select the first node 102 randomly, or select the first node 102 according to a certain strategy. For example, the second node 105 may select the first node 102 with a higher degree of matching between the local model and the federated model to speed up the convergence of the federated model.
  • the first node 102 obtains or receives the federation model from the second node 105.
  • the first node 102 may actively request the second node 105 to issue a federated model.
  • the second node 105 actively delivers the federation model to the first node 102.
  • the client can download the federation model from the server.
  • step S240 the first node 102 uses the local training data to train the federated model to obtain the local model.
  • the first node 102 may use the federated model as the initial model of the local model, and then use the local training data to train the initial model in one or more steps to obtain the local model.
  • the local training process can be regarded as the optimization process of the local model.
  • the optimization goal of optimization can be expressed by the following formula:
  • represents the local model
  • ⁇ t represents the federated model in the t-th iteration.
  • can use ⁇ t as the initial value, or use the local model obtained in the previous iteration as the initial value.
  • k represents the k-th first node.
  • F k ( ⁇ ) represents the loss function of the local model on the local training data.
  • the second node 105 converges the local models trained by the first node 102 to obtain an updated federated model.
  • the second node 105 may perform a weighted summation of the parameters of the local models of the multiple first nodes 102, and use the weighted summation result as the updated federated model.
  • steps S220-S250 can be regarded as an iteration of the federated learning process.
  • the second node 105 and the first node 102 may repeat steps S220-S250 until the federated model converges or reaches a preset effect.
  • Federated learning can be used to train machine learning models.
  • the most common machine learning model is a neural network.
  • the related concepts of the neural network and some terms related to the embodiments of the present application will be explained first.
  • a neural network can be composed of neural units.
  • a neural unit can refer to an arithmetic unit that takes x s and intercept 1 as inputs.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is the activation functions of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be a region composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with many hidden layers. There is no special metric for "many” here. Dividing DNN according to the location of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
  • DNN looks complicated, it is not complicated as far as the work of each layer is concerned.
  • DNN The definition of these parameters in DNN is as follows: Take coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third-level index 2 and the input second-level index 4.
  • the summary is: the coefficient from the kth neuron in the L-1th layer to the jth neuron in the Lth layer is defined as It should be noted that there is no W parameter in the input layer. In deep neural networks, more hidden layers make the network more capable of portraying complex situations in the real world.
  • a model with more parameters is more complex and has a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
  • Important equation taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • a neural network whose parameters follow a distribution is one of the machine learning models whose parameters follow a distribution.
  • the parameters of the traditional neural network (such as the weight of the neuron mentioned above) are fixed values.
  • this type of neural network has the problem of overfitting, that is, this type of neural network often gives overconfident predictions in areas where the training data is missing, and cannot accurately measure the uncertainty of the prediction results.
  • the parameters of some neural networks obey a certain distribution.
  • the parameters of a Bayesian neural network are random variables that obey a certain distribution, such as a random variable that obeys a Gaussian distribution.
  • the training process of a neural network whose parameters obey a probability distribution is not intended to obtain a fixed value of the parameter, but to optimize the probability distribution of the parameter.
  • the distribution of the parameters can be sampled, and each sampled value can correspond to a neural network with a fixed value of the parameter.
  • the neural network has a small uncertainty in the prediction corresponding to the input, otherwise the neural network has the uncertainty of the prediction corresponding to the input Larger.
  • the neural network whose parameters obey the probability distribution can characterize the uncertainty of prediction due to missing data, thereby avoiding the problem of overfitting.
  • the training problem of the machine learning model whose parameters obey the probability distribution can be regarded as the problem of estimating the probability distribution of the parameters based on the Bayesian formula.
  • prior distribution, posterior distribution and likelihood estimation are three important concepts.
  • the prior distribution of parameters is a pre-hypothesis of the posterior distribution, that is, the prior distribution of the parameters refers to the assumption of the posterior distribution of the parameters before the training data is observed.
  • the prior distribution of parameters can be manually specified, or it can be obtained through data learning.
  • the posterior distribution of the parameters is the description of the distribution of the parameters after the training data is observed.
  • the prior distribution and/or the posterior distribution of the parameters may adopt a parameterized distribution description method.
  • the prior distribution and/or the posterior distribution of the parameter can describe the Gaussian distribution by means and variance.
  • the prior distribution and/or the posterior distribution may also adopt a non-parametric distribution description method.
  • the prior distribution and/or the posterior distribution of the parameters may use probability histograms, probability density, cumulative function curves, etc. to describe the distribution of the parameters.
  • the prior distribution of the model parameters can be the probability distribution of the model parameters, or the probability distribution of the probability distribution of the model parameters.
  • the prior distribution can be regarded as a pre-description of the posterior distribution, that is, a hypothetical description before the training data is observed. If the prior distribution of model parameters is the probability distribution of model parameters, then this type of prior distribution can be understood as a "point description" of the posterior distribution; if the prior distribution of model parameters is the probability distribution of model parameters Probability distribution, this type of prior distribution can be understood as a "distribution description" of the posterior distribution.
  • the prior distribution of the model parameters when the prior distribution of the model parameters is the probability distribution of the model parameters, the prior distribution of the model parameters may be the mean and variance of the distribution of the model parameters. From the perspective of the prior distribution describing the posterior distribution, it is equivalent to using a point of [mean, variance] in the prior distribution to "point description" the posterior distribution.
  • the prior distribution of the model parameters is the probability distribution of the probability distribution of the model parameters
  • the prior distribution of the model parameters does not give the mean and variance of the distribution of the model parameters, but rather The probability that the mean and variance of the model parameter distribution take different values is described. From the perspective of the description of the posterior distribution by the prior distribution, it is equivalent to the prior distribution using the probability distribution to take the probability of different values of the mean and variance of the posterior distribution (or the penalty or reward for different values). describe".
  • Certain embodiments of the present application may involve the measurement of the difference between the prior distribution and the posterior distribution.
  • There can be many ways to measure the difference between the prior distribution and the posterior distribution and different distribution difference measurement functions can be designed according to the way the prior distribution describes the posterior distribution to measure the difference between the two distributions. The difference. A few examples are given below.
  • the difference between the prior distribution and the posterior distribution can be the difference between the two distributions.
  • KL divergence Kullback-Leibler divergence
  • the KL divergence of the prior distribution and the posterior distribution can be used as the distribution difference measurement function of the two distributions.
  • the prior distribution and the posterior The difference between the distributions can be measured by calculating the similarity of the histograms (or probability density curves) corresponding to the two distributions.
  • the similarity of the histogram (or probability density curve) corresponding to the prior distribution and the posterior distribution can be used as the distribution difference measurement function of the two distributions.
  • the similarity of the histograms (or probability density curves) corresponding to two distributions can be obtained by calculating the difference or cosine distance between the areas of the two histograms (or probability density curves).
  • the probability of the prior distribution at the value of the posterior distribution can be used as a description of the difference between the two distributions.
  • the probability of the prior distribution at the value of the posterior distribution can be used as a measure of the difference between the two distributions.
  • FIG. 3 is a chip hardware structure provided by an embodiment of the application.
  • the chip includes a neural network processor 50.
  • the chip can be set in the first node 102 as shown in FIG. 1 and used for the first node 102 to complete the training work of the local model.
  • the chip can also be set in the second node 105 as shown in FIG. 1 for the second node 105 to complete the maintenance and update of the federated model.
  • the neural network processor 50 is mounted on a main central processing unit (host central processing unit, host CPU) as a coprocessor, and the main CPU distributes tasks.
  • the core part of the neural network processor 50 is the arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data from the memory (weight memory or input memory) and perform calculations.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 503 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches matrix A data and matrix B from the input memory 501 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 508.
  • the vector calculation unit 507 can perform further processing on the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 507 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 507 stores the processed output vector to the unified buffer 506.
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 507 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 503, for example for use in a subsequent layer in a neural network.
  • the unified memory 506 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 501 and/or the unified memory 506 through the storage unit access controller 505 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 502, And the data in the unified memory 506 is stored in the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 510 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through the bus.
  • An instruction fetch buffer 509 connected to the controller 504 is used to store instructions used by the controller 504.
  • the controller 504 is used to call the instructions cached in the memory 509 to control the working process of the computing accelerator.
  • the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are all on-chip memories.
  • the external memory is a memory external to the neural network processor, and the external memory can have a double data rate. Synchronous dynamic random access memory (double data rate, synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.
  • the existing federated learning can only train machine learning models whose parameters (such as weights) are fixed values, but cannot train machine learning models whose parameters follow a distribution. Since the data distribution of the local training data is often inconsistent with the data distribution of the overall training data (the overall training data refers to the data set formed by all local training data), the federated learning process of the machine learning model with a fixed value often exhibits model shocks. The problem (that is, during the training process, the value of the model parameters swings back and forth instead of continuously converging in one direction), which leads to a long training time and high communication overhead in the federated learning process.
  • this application provides a method of federated learning, which can realize federated learning of a machine learning model whose parameters follow a distribution. It should be understood that the distribution mentioned in this application refers to a probability distribution. The following describes in detail the federated learning method provided by the embodiment of the present application in conjunction with FIG. 4.
  • the method in FIG. 4 includes steps S410-S440.
  • the first node in FIG. 4 may be any one of the first nodes 102 in FIG. 1, and the second node in FIG. 4 may be the second node 105 in FIG. 1.
  • the federated model mentioned in the embodiment of FIG. 4 is a machine learning model whose parameters follow a distribution.
  • the federated model is a neural network whose parameters follow a distribution, and the parameters of the federated model may refer to the parameters of neurons in the neural network.
  • the federation model can be a Bayesian neural network. Further, in some embodiments, the parameters in the Bayesian neural network may obey Gaussian distribution.
  • the local model mentioned in the embodiment of FIG. 4 may also be a machine learning model whose parameters follow a distribution.
  • the local model is a neural network whose parameters follow a distribution, and the parameters of the local model may refer to the parameters of neurons in the neural network.
  • the local model can be a Bayesian neural network.
  • the parameters in the Bayesian neural network obey Gaussian distribution, delta distribution or other distributions.
  • the federated model and the local model may be machine learning models with the same structure.
  • the federated model may include multiple Bayesian models (such as multiple Bayesian neural networks), and the local model may have the same structure as one of the Bayesian models.
  • the first node receives the prior distribution of the parameters of the federated model from the second node.
  • the first node may actively request the second node to issue the prior distribution of the parameters of the federation model.
  • the second node may actively deliver the prior distribution of the parameters of the federation model to the first node.
  • step S420 the first node trains to obtain the posterior distribution of the parameters of the local model of the first node according to the prior distribution of the parameters of the federation model and the local training data of the first node.
  • step S420 can also be described as: the first node optimizes and obtains the posterior distribution of the parameters of the local model of the first node according to the prior distribution of the parameters of the federation model and the local training data of the first node.
  • the posterior distribution of the parameters of the local model of the first node can be inferred by Bayesian optimization according to the prior distribution of the parameters of the federation model.
  • step S430 the second node receives at least one posterior distribution of the parameters of the local model of the first node.
  • the first node actively sends the posterior distribution of the parameters of the local model to the second node.
  • the first node may send the posterior distribution of the parameters of the local model to the second node at the request of the second node.
  • the posterior distribution of the parameters of the local model sent by the first node to the second node may be the posterior distribution of all the parameters of the local model, or may be the posterior distribution of some parameters of the local model.
  • the first node may send the difference between the posterior distribution of the local model parameters and the prior distribution of the federated model parameters to the second node, and send the posterior distribution of the parameters of the local model to the second node.
  • the first node may directly send the posterior distribution itself of the parameters of the local model to the second node.
  • the posterior distribution of the parameters of the local model sent by the first node to the second node may be the posterior distribution of the parameters of the local model after encryption, or the posterior distribution of the parameters of the local model without encryption.
  • the first node may also send local training data to the second node.
  • the second node updates the prior distribution of the parameters of the federated model according to the posterior distribution of the parameters of the local model of the at least one first node. For example, the second node may receive the posterior distribution of the parameters of the local model sent by at least one first node; then, the second node may perform a weighted summation of the posterior distribution of the parameters of the local model of the at least the first node to obtain The prior distribution of the parameters of the updated federated model.
  • step S410 to step S440 can be executed once, or repeated multiple times.
  • step S410 to step S440 may be iteratively performed multiple times until the iteration stop condition is satisfied.
  • the iteration stopping condition may be that a preset number of iterations is reached, or the federated model may have converged, for example.
  • the embodiment of the application realizes the federated learning of the machine learning model whose parameters follow the distribution by interacting the prior distribution and the posterior distribution of the model parameters between nodes.
  • the machine learning model whose parameters obey the distribution can give the possibility of various values of the parameters in advance, and the possibility of various values of the parameters can characterize the pros and cons of the various possible improvement directions of the machine learning model. Therefore, performing federated learning on a machine learning model whose parameters follow a distribution can help nodes participating in federated learning to find a better direction for improvement of the machine learning model, thereby reducing training time and communication overhead between nodes.
  • step S420 in FIG. 4 There may be multiple implementations of step S420 in FIG. 4, which will be described below with reference to FIG. 5 as an example.
  • step S420 further includes step S422 and step S424.
  • step S422 the first node determines the prior distribution of the parameters of the local model according to the prior distribution of the parameters of the federation model.
  • step S424 the first node trains to obtain the posterior distribution of the parameters of the local model of the first node according to the prior distribution of the parameters of the local model and the local training data of the first node.
  • step S422 There are many ways to implement step S422. For example, if the federated model and the local model correspond to a machine learning model of the same structure, the first node may directly use the prior distribution of the parameters of the federated model as the prior distribution of the parameters of the local model.
  • the prior distribution of the parameters of the federation model can include multiple local prior distributions (each local prior can correspond to a Bayesian model)
  • each local prior can correspond to a Bayesian model
  • the first node will receive the priori of the parameters of the federation model
  • the prior distribution of the parameters of the local model can be determined according to the degree of matching between the multiple local prior distributions and the local training data.
  • multiple local prior distributions may be explicitly included in the prior distribution of the parameters of the federation model; or, in some embodiments, the multiple local prior distributions may also be implicitly included in the federation.
  • the prior distribution of the parameters of the model it needs to be decomposed from the prior distribution of the parameters of the federation model by a certain method (such as random sampling). A few examples are given below.
  • the federated model includes multiple Bayesian models with the same structure, where each parameter of each Bayesian model contains only one distribution.
  • the prior distribution of the federated model parameters "point description" to the posterior distribution.
  • the prior distributions provided by multiple Bayesian models may be different, that is, a parameter may have multiple possible distributions.
  • the first node After the first node receives the prior distribution of the federated model parameters, it can sample multiple possible distributions of each parameter (such as random sampling), and combine the sampling results of the distributions of different parameters in multiple ways to form Multiple local prior distributions.
  • the first node may select the local prior distribution that best matches the local training data from the multiple local prior distributions according to the degree of matching between the multiple local prior distributions and the local training data of the first node, as The prior distribution of the parameters of the local model.
  • the first node may use a weighted summation method to obtain the prior distribution of the parameters of the local model according to the difference in the degree of matching between the multiple local prior distributions and the local training data.
  • the federated model includes only one machine learning model, but each parameter of the machine learning model includes multiple distributions (that is, the distribution of the parameters is a mixed distribution).
  • the prior distribution of the federated model parameters "point description" to the posterior distribution. In this case, there are still multiple possible distributions for each parameter of the machine learning model.
  • the first node After the first node receives the prior distribution of the federated model parameters, it can sample multiple possible distributions of each parameter (such as random sampling), and combine the sampling results of the distributions of different parameters in multiple ways to form Multiple local prior distributions.
  • the first node may select the local prior distribution that best matches the local training data from the multiple local prior distributions according to the degree of matching between the multiple local prior distributions and the local training data of the first node, as The prior distribution of the parameters of the local model.
  • the first node may use a weighted summation method to obtain the prior distribution of the parameters of the local model according to the difference in the degree of matching between the multiple local prior distributions and the local training data.
  • the federated model maintained by the second node may also be a combination of the above two situations, that is, the second node maintains multiple machine learning models, and one parameter of one machine learning model includes multiple distributions.
  • the value of the distribution of each parameter has more possibilities, which can provide a richer selection range for the sampling of the first node.
  • Case 1 The federation model only maintains one Bayesian neural network, and each parameter of the Bayesian neural network contains only one Gaussian distribution;
  • Case 2 The federated model maintains multiple Bayesian neural networks, where each parameter of each Bayesian neural network contains only one Gaussian distribution, and the distributions of the parameters of the multiple Bayesian neural networks are different;
  • Case 3 The federation model only maintains one Bayesian neural network, where each parameter contains multiple Gaussian distributions.
  • the federation model maintains multiple Bayesian neural networks, where each parameter of each Bayesian neural network contains multiple Gaussian distributions, and the distributions of the parameters of the multiple Bayesian neural networks are different.
  • the first node can sample it first to obtain the parameters of a Bayesian neural network, and make the Bayesian neural network's parameters One parameter contains only one Gaussian distribution.
  • the prior distribution of the federated model parameters uses the "distribution description" for the posterior distribution, you can first sample the values of the prior distribution according to the probability of the distribution value given by the "distribution description” to obtain Multiple values of the prior distribution. After the above sampling operation, it is equivalent to converting the "distribution description" of the prior distribution to the posterior distribution into multiple "point descriptions” of the prior distribution to the posterior distribution, and each "point description” is equivalent to the federation model A local prior distribution decomposed from the prior distribution of parameters.
  • the first node may select a local prior distribution that matches the local training data from the multiple local prior distributions according to the degree of matching between the multiple local prior distributions and the local training data of the first node, as the local model The prior distribution of the parameters.
  • the first node may use a weighted summation method to obtain the prior distribution of the parameters of the local model according to the difference in the degree of matching between the multiple local prior distributions and the local training data.
  • each local prior distribution can be used as the prior distribution of the local model parameters in turn, and the local training data can be combined for training. Then, according to the training effect of each local prior distribution, the matching degree between the local prior distribution and the local training data of the first node is measured.
  • the matching degree between the local prior distribution and the local training data of the first node may be measured according to the difference between the local prior distribution and the historical posterior distribution of the local model parameters. Then, the prior distribution of the parameters of the local model can be determined based on the difference between the multiple local prior distributions and the historical posterior distribution. For example, the prior distribution with the smallest difference from the historical posterior distribution among the multiple local prior distributions may be used as the prior distribution of the parameters of the local model.
  • a weighted summation may be performed on a plurality of local prior distributions based on the difference between a plurality of local prior distributions and a historical posterior distribution, and the result of the weighted summation may be used as the prior distribution of the parameters of the local model.
  • the historical posterior distribution mentioned in this embodiment refers to the posterior distribution of the parameters of the local model obtained by the first node before the current iteration, such as the posterior distribution of the parameters of the local model obtained in the previous iteration.
  • the measurement method of the difference between the two distributions is described in the previous section and will not be detailed here.
  • the scheme of maintaining multiple machine learning models by the federated model can also be applied to federated learning of machine learning models whose parameters are fixed values.
  • the first node receives a federated model including multiple machine learning models from the second node; then, the first node selects the target machine learning model from the multiple machine learning models, and selects the target machine learning model according to the target machine learning model and the local machine learning model of the first node. Training data, training the local model of the first node.
  • the target machine learning model may be the machine learning model with the highest degree of matching with the local training data among the plurality of machine learning models, or the target machine learning model may be the machine learning model with the highest accuracy among the plurality of machine learning models.
  • the second node sends a federated model including multiple machine learning models to the first node; then, the second node can receive a local model ( That is, the local model is obtained by training the target machine learning model); the second node optimizes the target machine learning model according to the local model (that is, the second node optimizes the corresponding machine learning model in the federated model according to the local model).
  • Step S422 in FIG. 5 is described in detail above, and step S424 in FIG. 5 is described in detail below, that is, how to generate the posterior distribution of the parameters of the local model according to the prior distribution of the parameters of the local model.
  • the process of generating the posterior distribution of the parameters of the local model according to the prior distribution of the parameters of the local model is the process of using the local training data to perform local training on the local model.
  • the prior distribution of the parameters of the local model can be used in multiple ways.
  • the prior distribution of the parameters of the local model can be used as the constraint condition in the optimization target of the local training.
  • the initial value of the posterior distribution of the parameters of the local model may be determined according to the prior distribution of the parameters of the local model.
  • Method 1 The prior distribution of the parameters of the local model is used as the constraint condition in the optimization objective of the local training
  • the optimization goal of local training can be set as: the loss function of the posterior distribution of the parameters of the local model on the local training data is as small as possible (or the likelihood function is as large as possible), and at the same time, the priori of the local model parameters
  • the distribution difference measurement function of the distribution and the posterior distribution is as small as possible or the penalty is as small as possible.
  • initial values can be set for the posterior distribution of the parameters of the local model.
  • the initial value of the posterior distribution of the parameters of the local model may be set to the value of the posterior distribution of the parameters of the local model before the current iteration (such as the previous iteration), or may be a randomized initial value.
  • the initial value of the posterior distribution of the parameters of the local model may be determined according to the prior distribution of the parameters of the local model.
  • the initial value of the posterior distribution of the parameters of the local model can be the value of the prior distribution of the parameters of the local model; taking the parameters of the local model
  • the prior distribution of "Distribution description" is used as an example for the posterior distribution.
  • the initial value of the posterior distribution of the parameters of the local model can be the sampling value of the prior distribution of the parameters of the local model.
  • a score function or reparameterization method can be used for local training until the posterior distribution of the parameters of the local model converges.
  • Method 2 Determine the initial value of the posterior distribution of the parameters of the local model according to the prior distribution of the parameters of the local model
  • the prior distribution of the parameters of the local model can be used as the initial value of the posterior distribution of the parameters of the local model in the local training process. If the prior distribution of the parameters of the local model adopts the "distribution description" for the posterior distribution, then in the local training process, the initial value of the posterior distribution of the parameters of the local model can be the sampling value of the prior distribution of the parameters of the local model .
  • the optimization objective of the local training can be set as: when the local training data is trained, the loss function of the posterior distribution of the parameters of the local model is as small as possible or the likelihood function is as large as possible.
  • the first node uses local training data to carry out local training.
  • the first node can send the posterior distribution of the parameters of the local model obtained by the training to the second node, so that the second node can update the parameters of the federated model according to the posterior distribution of the parameters of the local model received.
  • Prior distribution the first node may also make a decision whether to feed back the result of the local training to the second node according to certain conditions; and/or, the first node The node can also decide whether to feed back all the results of the local training to the second node or partial results according to certain conditions. The following describes the decision-making method of the first node with examples in conjunction with specific embodiments.
  • the first node Before sending the posterior distribution of the parameters of the local model to the second node, the first node may determine the uncertainty of the local model according to the posterior distribution of the parameters of the local model. When the uncertainty of the local model meets the first preset condition, the first node sends the posterior distribution of the parameters of the local model to the second node; when the uncertainty of the local model does not meet the first preset condition, the first node The node does not send the posterior distribution of the parameters of the local model to the second node.
  • the uncertainty of the local model can be used to express the stability of the local model.
  • the uncertainty of the local model may indicate the importance of the local training data of the first node to the federated model (or the importance of federated learning).
  • the local model when the federated model is expected to converge as soon as possible, if the uncertainty of the local model is high, it means that the local training data of the first node is not important to the federated model.
  • the local model Taking into account the posterior distribution of, it will reduce the convergence speed of the federated model.
  • the uncertainty of the local model is high, it means that the local training data of the first node is important for the federated model.
  • this Taking the posterior distribution of the parameters of the local model into account will improve the reliability of the federated model's inference on data that is the same or close to the local training data.
  • the uncertainty of the local model can be based on at least one of the following information: the variance of the posterior distribution of the parameters of the local model, the convergence speed (or convergence effect) of the posterior distribution of the parameters of the local model, or the local model’s The inference accuracy rate of the posterior distribution of the parameters.
  • the embodiment of the application does not limit the specific content of the first preset condition, and can be selected according to actual needs.
  • the first node may not send the posterior distribution of the parameters of the local model to the second node. For example, when the variance of the local model is greater than the preset threshold or the convergence efficiency speed of the local model is less than the preset threshold, the first node does not send the posterior distribution of the parameters of the local model to the second node.
  • the first node sends the posterior distribution of the parameters of the local model to the second node. For example, when the variance of the local model is greater than the preset threshold or the convergence efficiency speed of the local model is less than the preset threshold, the first node sends the posterior distribution of the parameters of the local model to the second node.
  • the first node Before sending the posterior distribution of the parameters of the local model to the second node, the first node may also choose whether to change the posterior distribution of the parameters of the local model according to the difference between the posterior distribution of the parameters of the local model and the prior distribution of the local model The posterior distribution of the parameters is sent to the second node.
  • the first node can The posterior distribution of the parameters of the local model is not sent to the second node.
  • the difference between the posterior distribution of the parameters of the local model and the prior distribution of the parameters of the local model is small, it indicates that the difference between the local model and the federated model is small, even if the posterior distribution of the parameters of the local model is Sending the prior distribution to the second node will not have much impact on the update of the prior distribution of the parameters of the federation model.
  • the first node does not upload the posterior distribution of the parameters of the local model, which can save bandwidth between nodes and improve communication efficiency between nodes.
  • the above describes in detail how the first node decides whether to send the results of the local training to the second node.
  • the following describes in detail how the first node decides whether to send partial results of the local training results to the second node. It should be noted that these two decisions can be independent of each other or can be combined with each other.
  • the first node may decide which of the local training results to feed back to the second node after deciding to feed back the local training results to the second node.
  • the first node may determine the uncertainty of the first parameter in the local model according to the posterior distribution of the first parameter, where the local model may include at least one parameter, and the first parameter is Any one of the at least one parameter; when the uncertainty of the first parameter satisfies the second preset condition, the first node sends the posterior distribution of the first parameter to the second node.
  • the uncertainty of the first parameter can be used to indicate the importance of the first parameter to the local model of the first node. If the uncertainty of the first parameter is high (for example, the distribution of the first parameter is relatively flat), the parameter usually does not have much influence on the final prediction or inference result of the local model. In this case, the first node may consider not sending the posterior distribution of the first parameter to the second node.
  • the uncertainty of the first parameter may be based on the mean value, variance of the posterior distribution of the first parameter, or a combined measure of the two.
  • the first node may compare the variance of the first parameter with a fixed threshold. When the variance is less than the fixed threshold, the first node sends the posterior distribution of the first parameter to the second node; when the variance is greater than or equal to the fixed threshold, the first node does not send the first parameter to the second node The posterior distribution.
  • the first node may first generate a random number according to the variance of the first parameter, and then compare the random number with a fixed threshold.
  • the first node When the random number is less than the fixed threshold, the first node sends the posterior distribution of the first parameter to the second node; when the random number is greater than or equal to the fixed threshold, the first node does not send the second node to the second node.
  • the posterior distribution of a parameter When the random number is less than the fixed threshold, the first node sends the posterior distribution of the first parameter to the second node; when the random number is greater than or equal to the fixed threshold, the first node does not send the second node to the second node.
  • the posterior distribution of a parameter When the random number is less than the fixed threshold, the first node sends the posterior distribution of the first parameter to the second node; when the random number is greater than or equal to the fixed threshold, the first node does not send the second node to the second node.
  • the embodiments of the present application do not limit the specific content of the second preset condition mentioned above, and can be set according to actual needs.
  • the second preset condition can be set according to the uncertainty of the first parameter, or can be set according to the order of the uncertainty of the first parameter among the uncertainties of all the parameters of the local model.
  • the first parameter mentioned above is any parameter in the local model, and the first node can process some or all of the parameters in the local model in a similar manner to the processing manner of the first parameter. If the first node processes all the parameters in the local model in a manner similar to that of the first parameter, it can find all the parameters whose uncertainty of the parameters in the local model does not meet the second preset condition. When the local training result is fed back to the second node, the posterior distribution of these parameters is not fed back to the second node.
  • the first node may send the overall distribution of the parameters of the local model to the second node, and may also send one or more sample values of the overall distribution of the parameters of the local model to the second node.
  • the second node can determine the overall distribution of the parameters according to the received multiple sampling values for the overall distribution of the same parameter. The distribution is estimated, and the estimated result is updated to the federated model as the prior distribution of the parameter.
  • the first node sends the overall distributed sampling value to the second node, which can improve the communication efficiency between nodes and reduce the communication bandwidth.
  • the second node may perform the steps shown in FIG. 6. That is, the second node can select one or more first nodes from the candidate nodes according to certain rules, and send the prior distribution of the parameters of the federation model to the selected first node, instead of sending the federation to the unselected nodes
  • Federated learning usually includes multiple rounds of iterations. At least one of the first nodes in Figure 4 can be a node participating in this round of iteration, and the candidate node mentioned above can be a node that participated in the federated learning before this round of iteration. It is the node that participated in the previous iteration of federated learning.
  • the second node can choose the same first node in different iteration rounds, or choose a different first node.
  • step S610 may randomly select the first node participating in the current iteration.
  • the second node may select the first node participating in the current iteration according to the evaluation information fed back by the candidate node.
  • the evaluation information can be used to indicate the matching degree between the prior distribution of the parameters of the federated model and the local training data of the candidate node.
  • the evaluation information can be used to indicate the degree of matching between the candidate node's posterior distribution trained according to the prior distribution of the parameters of the federated model and the candidate node's local training data.
  • the evaluation information may be used to indicate the degree of matching between the prior distribution of the parameters of the federation model and the posterior distribution obtained by training the candidate node according to the prior distribution of the parameters of the federation model.
  • the degree of matching between the prior distribution or the posterior distribution and the local training data can be evaluated using the value of the loss function obtained by the local model in the local test.
  • the second node can choose a candidate node with a lower matching degree to participate in federated learning. If the convergence speed of the federated model is accelerated, the second node can choose a candidate node with a higher matching degree to participate in federated learning.
  • the second node may select at least one first node from the candidate nodes according to the difference between the historical posterior distribution of the candidate nodes and the prior distribution of the parameters of the federated model.
  • the second node can choose candidate nodes with larger differences to participate in federated learning. If the convergence speed of the federation model is accelerated, the second node can choose candidate nodes with smaller differences to participate in federated learning.
  • Step S440 in FIG. 4 describes the process of updating the prior distribution of the parameters of the federated model by the second node.
  • the update process can also be understood as the process of optimizing the prior distribution of the parameters of the federation model by the second node, or the process of calculating the optimal solution of the prior distribution of the parameters of the federation model.
  • the process of updating the parameters of the federation model will be described in detail in conjunction with specific embodiments.
  • the second node can use the difference in the posterior distribution of the parameter to calculate the prior distribution of the parameter.
  • the prior distribution minimizes the mean value (or weighted average) of the difference between the prior distribution of the parameter and the posterior distribution of the parameter.
  • the second node can synthesize the histogram or probability density curve of the same parameter, To get the prior distribution of this parameter.
  • the second node can estimate the probability distribution of the posterior distribution of the parameter according to different posterior distributions for the same parameter, and will estimate The probability distribution of the posterior distribution of the parameter is taken as the prior distribution of the parameter.
  • the prior distribution of the parameters of the federated model of the second node contains or can be split into multiple local prior distributions, and the local training process of a certain first node is only based on one of the local prior distributions, then the first The posterior distribution of the parameters of the local model of the node can only be used to update its corresponding local prior distribution.
  • the structure of the federation model can also be adjusted.
  • the superposition of a smaller number of distributions may be used to approximate the superposition of the current distribution of the parameter to simplify the federated model.
  • a component reduction technique can be used to approximate the superposition of a smaller number of distributions to the superposition of a larger number of distributions.
  • the second node can update the prior distribution of the federated model parameters to split the first parameter into multiple parameters.
  • this technique is referred to as a model splitting technique.
  • the second node when the second node maintains multiple machine learning models, the second node can merge machine learning models with smaller differences, or it can generate new machine learning models from existing machine learning models (such as using random Way to generate new models).
  • the second node can also initialize the federated model first.
  • the embodiment of the present application does not specifically limit the content of initialization.
  • the second node can set the network structure of the federation model.
  • the second node can also set initial values for the prior distribution of the parameters of the federation model.
  • the second node can set hyperparameters in the federated learning process.
  • the federation model maintained by the second node is a single neural network, and the prior distribution of the parameters of the federation model "distributes description" to the posterior distribution.
  • the first node directly uses the prior distribution of the parameters of the federation model as the prior distribution of the parameters of the local model for local training.
  • the prior distribution and the posterior distribution of the parameters of the local model correspond to the neural network of the same size, and in the local training process, the first node uses the Gaussian distribution as the likelihood function for Bayesian optimization.
  • the prior distribution of the parameters of the federation model maintained by the second node adopts the Gaussian inverse gamma distribution to "distribute description" the posterior distribution, and the posterior distribution adopts the Gaussian distribution.
  • the Gaussian inverse gamma distribution can also be called normal inverse gamma (normal inverse gamma) distribution, which can be expressed by the following formula (1).
  • N- ⁇ -1 represents the Gaussian inverse gamma distribution
  • ⁇ 0 , v, ⁇ , and ⁇ are the 4 parameters of the Gaussian inverse gamma distribution. These four parameters determine the distribution of the mean ⁇ and the variance ⁇ 2 of the posterior distribution (Gaussian distribution).
  • the probability that the local training data is generated by the federated model can be expressed by formula (2):
  • K represents the number of first nodes participating in federated learning
  • k represents the k-th first node among the K first nodes.
  • D represents a complete data set composed of the local training data of the K first nodes
  • D k represents a data set composed of the local training data of the k-th first node.
  • ⁇ k represents the parameters of the local model of the k-th first node
  • ⁇ k ) represents the probability of the data set D k appearing given the parameters ⁇ k.
  • ⁇ (.) represents Gaussian distribution, Represents the mean ⁇ k and variance of the Gaussian distribution Determines the distribution of ⁇ k.
  • Equation (2) Represents the probability that the k-th first node appears in the data set D k under the given parameters ⁇ 0 , v 0, ⁇ , ⁇ . Since it is assumed in advance that the first nodes are independent of each other, the probability that the data set D appears under the given parameters ⁇ 0 , v 0 , ⁇ , ⁇ is the combined product of the probabilities of each data set D k.
  • the local training process can actually be an optimization process.
  • the optimization goal can be defined by formula (3):
  • the meaning of the optimization objective is to find the optimal model parameters given the prior distribution ⁇ 0 ,v, ⁇ , ⁇ of the parameters of the local model Make the value of formula (3) the largest.
  • Optimal model parameters obtained by optimization It can be used as the posterior distribution of the parameters of the local model.
  • formula (3) Indicates the parameters in a given model Under the conditions of, the probability that the data set D k composed of local training data appears, through optimization Make this probability as large as possible.
  • formula (3) Indicates that under the conditions of given parameters ⁇ 0 ,v, ⁇ , ⁇ , The probability that the optimization goal is to hope to appear The probability of being as large as possible.
  • the regular term of the regular term can make the posterior distribution of the parameters of the local model The difference between the prior distribution of the parameters of the local model and the prior distribution of the parameters of the local model is as small as possible, so that the posterior distribution of the local model parameters does not deviate too far from the prior distribution of the federal model parameters, that is, to ensure that the federal learning process is a continuous learning process. There will be no model shock problems.
  • the second node can update the prior distribution of the parameters of the federated model according to formula (4):
  • the second node can obtain the optimal solution of the prior distribution of the parameters of the federation model by maximizing formula (4), that is, the optimal solution of ⁇ 0 , v, ⁇ , ⁇ .
  • the federation model maintained by the second node is a single neural network (such as a Bayesian neural network).
  • One parameter of the federated model has multiple distributions (such as a mixture of Gaussian distribution), and the prior distribution of the parameters of the federated model "point description" to the posterior distribution.
  • the first node directly uses the prior distribution of the parameters of the federation model as the prior distribution of the parameters of the local model for local training.
  • the prior distribution and the posterior distribution of the parameters of the local model correspond to the neural network of the same size, and in the local training process, the first node uses the Gaussian distribution as the likelihood function for Bayesian optimization.
  • the second node initializes a neural network as a federation model.
  • ⁇ ) represents the prior distribution of the parameters of the federation model, where ⁇ represents the model parameters, and ⁇ represents the prior value describing the distribution of ⁇ .
  • represents the model parameters
  • represents the prior value describing the distribution of ⁇ .
  • the first node selected by the second node obtains the prior distribution P( ⁇
  • the first node uses P( ⁇
  • the training process of the posterior distribution of the parameters of the local model is an optimization process.
  • the optimization goal can be defined by formula (5):
  • q k ( ⁇ ) represents the posterior distribution of the parameter ⁇ of the local model. If the posterior distribution of the parameters of the local model adopts a parametric description (rather than a non-parametric description such as histograms and probability density curves), the posterior distribution of the parameters of the local model can also be passed q k ( ⁇
  • the second node can update the prior distribution of the parameters of the federated model according to formula (6):
  • P( ⁇ ) in formula (6) represents the distribution of ⁇ , which can be set manually in advance.
  • the federated model maintained by the second node includes multiple neural networks.
  • the local model of the first node is a single neural network.
  • the second node initializes the prior distribution of the parameters of the federation model.
  • the prior distribution of the parameters of the federated model includes N local prior distributions (N is an integer greater than 1).
  • N local prior distributions correspond to N neural networks one to one.
  • the N local prior distributions are respectively the prior distributions of the parameters of the N neural networks.
  • the structures of the N neural networks can be the same or different.
  • the first neural network in N neural networks There are 5 fully connected layers, with 50 neurons in each layer.
  • 2nd neural network It also has a neural network with 5 fully connected layers, with 50 neurons in each layer.
  • the third neural network There are 4 fully connected layers, with 40 neurons in each layer. 4th neural network It has 4 convolutional layers and 1 fully connected layer.
  • the second node may send the N local prior distributions to multiple first nodes.
  • the second node can send different local prior distributions to different first nodes.
  • the second node can send the local prior distribution corresponding to the first neural network to the first node 1, 2, 3; send the local prior distribution corresponding to the second neural network to the first node 4, 5, 6; Send the local prior distribution corresponding to the third neural network to the first node 7, 8, 9; send the local prior distribution corresponding to the fourth neural network to the first node 9, 10, 11.
  • the second node may also send the same local prior distribution to different first nodes.
  • the local prior distribution corresponding to the i-th neural network can be used as the initial value of the prior distribution of the local model parameters Then use the local training data for one or more trainings to obtain the posterior distribution of the local model parameters.
  • the first node can use formula (7) as the loss function of the local training process:
  • the first node sends the posterior distribution of the local model parameters obtained through training to the second node.
  • the second node updates the prior distribution of the parameters of the federation model in a weighted average manner according to formula (8):
  • Ni represents the number of posterior distributions of local model parameters obtained after local training based on the local prior distribution corresponding to the i-th neural network
  • the ratio of the amount to the total amount of data of the local training data of the posterior distribution of the parameters of the Ni local models is determined.
  • the first node selected by the second node can obtain the prior distribution of the parameters of the federation model from the second node Then, the first node can use the local training data to test the matching degree of each local prior distribution in the prior distribution of the parameters of the federation model with the local training data, and select the local prior distribution with the highest matching degree from it
  • the first node After determining the local prior distribution that best matches the local training data After that, the first node can distribute the local prior As the initial value of the prior distribution of the parameters of the local model, namely Then, the first node can use the local training data to perform one or more trainings to obtain the posterior distribution of the parameters of the local model.
  • the training process can be As a loss function, where Represents the local training data of the i*th first node.
  • the first node can distribute the local prior Added to the regularization term of the loss function of the local training process: Then, based on the loss function, the local training data is used for training to obtain the posterior distribution of the parameters of the local model.
  • the second node can update the prior distribution of the parameters of each neural network in the federated model according to formula (9):
  • Ni represents the number of posterior distributions of local model parameters obtained after local training based on the local prior distribution corresponding to the i-th neural network
  • the weight can be based on the data of the local training data of the posterior distribution of the parameter of the nth local model
  • the ratio of the amount to the total amount of data of the local training data of the posterior distribution of the parameters of the Ni local models is determined.
  • the federated model maintained by the second node includes multiple neural networks (such as multiple Bayesian neural networks), and the parameters of each neural network are described by a Gaussian distribution.
  • the prior distribution of the federated model parameters includes multiple local prior distributions corresponding to the multiple neural networks one-to-one, and each local prior distribution "point description" to the posterior distribution.
  • the first node uses a certain local prior distribution in the prior distribution of the federated model parameters as the prior distribution of the local model parameters to perform local training. For example, the first node selects a local prior distribution that best matches the local training data from the multiple local prior distributions maintained by the second node, and uses the local prior distribution as the prior distribution of the local model parameters.
  • the prior distribution and the posterior distribution of the local model parameters correspond to the neural network of the same size, and in the local training process, the first node uses the Gaussian distribution as the likelihood function for Bayesian optimization.
  • the second node initializes the prior distribution of the parameters of the federation model.
  • the prior distribution of the parameters of the federated model includes N local prior distributions (N is an integer greater than 1).
  • N local prior distributions correspond to N neural networks one to one.
  • can be [mean, variance] of Gaussian distribution.
  • the second node sends the N local prior distributions to different first nodes. If data privacy protection is considered, the second node can also send different local prior distributions to the same first node.
  • the local training process is essentially an optimization process, and formula (10) can be used as the optimization goal:
  • the first node can be optimized by re-parameterization to obtain the posterior distribution of the parameters of the local model
  • the first node can calculate the posterior distribution of the parameters of the trained local model Sent to the second node.
  • the second node can use formula (11) to update (or optimize) the prior distribution of the parameters of the federated model according to the posterior distribution of the parameters of the local model provided by each first node:
  • the first node selected by the second node can obtain the prior distribution of the parameters of the federation model from the second node Then, the first node can test the matching degree of each local prior distribution in the prior distribution of the parameters of the federation model with the local training data, and select the local prior distribution with the highest matching degree
  • the first node can start with As the prior distribution of the parameters of the local model, the local training data is used to train the posterior distribution of the parameters of the local model.
  • the local training process can use formula (12) as the optimization goal:
  • the first node can be optimized by reparameterization to determine the optimal solution of the posterior distribution of the parameters of the local model.
  • the second node After the first node participating in the federated learning sends the posterior distribution of the parameters of the local model to the second node, the second node can update each neural network according to formula (13):
  • the federation model maintained by the second node maintains a neural network.
  • Each parameter of the neural network is described by a distribution.
  • the prior distribution of the parameters of the neural network describes the posterior distribution in points.
  • the federated model is a Bayesian neural network, and each parameter of the Bayesian neural network is described by a Gaussian distribution.
  • the first node uses the local prior distribution in the prior distribution of the federated model parameters as the prior distribution of the local model parameters.
  • the local model of the first node has the same size as the federated model, and the posterior distribution of the local model parameters adopts the delta distribution.
  • the second node initializes a neural network as a federation model.
  • ⁇ ) represents the prior distribution of the parameters of the federation model, where ⁇ represents the model parameters, and ⁇ represents the prior value describing the distribution of ⁇ .
  • represents the model parameters
  • represents the prior value describing the distribution of ⁇ .
  • the first node selected by the second node obtains the prior distribution P( ⁇
  • the first node uses P( ⁇
  • the training process of the posterior distribution of the parameters of the local model is an optimization process.
  • the optimization goal can be defined by formula (14):
  • ⁇ k represents the parameters of the local model.
  • ⁇ k ) represents the likelihood function corresponding to the parameters of a given model.
  • the gradient descent method can be used to train the local model parameters ⁇ k , and its posterior distribution ⁇ ( ⁇ k ).
  • ⁇ ( ⁇ k ) indicates that the posterior distribution is a delta distribution.
  • the second node can update each neural network according to formula (15):
  • P( ⁇ ) in formula (15) represents the distribution of ⁇ , which can be set manually in advance.
  • Example 6 aims to provide a solution to measure the importance of each first node, so that the first node participating in the federated learning can be selected according to the importance of the first node in the federated learning process, so that the entire training process of the federated learning The stability is the best.
  • the weight of the first node can be set according to the variance of the model of the parameters of the local model of the first node, and the first node participating in the federated learning can be selected according to the weight of the first node, or the first node can be selected according to the weight of the first node Whether it is necessary for a certain first node to upload the posterior distribution of the parameters of the local model.
  • the corresponding weight r(D k ) can be set for different first nodes.
  • D k represents the local training data of the k-th first node. Therefore, the weight of the first node can also be understood as a measure of the importance of the local training data of the first node.
  • the second node can minimize the variance of the posterior distribution of the local model parameters fed back by each first node according to formula (16):
  • P data (D k ) in formula (16) represents the probability of D k appearing in the data set formed by the local training data of all the first nodes. Considering that the sum of weights should be 1, formula (16) can Converted into the following formula (17):
  • the relationship between the weight of the first node and the posterior distribution of the local model parameters can be obtained as If the posterior distribution of the local model adopts the Gaussian distribution, the relationship between the weight of the first node and the posterior distribution of the local model parameters can be expressed as: (j is the number of parameters in the local model).
  • the second node can select the first node that needs to upload the posterior distribution of the local model parameters according to r(D k ).
  • the first node can also make a decision based on the r(D k ) whether it needs to send the local training result to the second node. For example, r(D k ) can be compared with a fixed threshold to determine whether the first node needs to send the local training result to the second node.
  • the probability of the first node being selected can be calculated according to r(D k ), and then whether to send the local training result to the second node is determined according to the probability.
  • Example 7 aims to provide a simplified scheme of the federation model, which aims to approximate the overlay of more distributions with a superposition of a smaller number of distributions when a certain parameter of the federation model is a superposition of more distributions.
  • the second node updates the prior distribution of the federated model parameters according to the following formula (18):
  • D KL represents the KL divergence
  • ⁇ ) represents the prior distribution of the parameters of the federation model.
  • the prior distribution of the federated model parameters in formula (19) obeys the mixed Gaussian distribution, where each parameter contains the mixed Gaussian distribution of K components. It can be seen that the scale of the federated model parameters is K times larger than that of the local model parameters, which will cause greater communication overhead.
  • formula (20) can be used to optimize the parameters of the federation model.
  • the parameters of the federation model are defined as a mixture of Gaussian distributions containing at most M components (M ⁇ K):
  • ⁇ m represents the ratio of the m-th component among the M components.
  • ⁇ m and ⁇ m are the mean and covariance matrices of the messy distribution, respectively.
  • the optimized ⁇ m can be made sparse (that is, containing more 0 elements), so that the final mixed distribution of parameters contains at most M ingredients. It can be seen that by adjusting the parameters of the Dirichlet distribution, a compromise can be made between the accuracy and complexity of the federation model (that is, how many components each parameter contains, which determines the communication overhead of the federated learning process).
  • Example 8 aims to give a specific posterior distribution.
  • the posterior distribution of the local model parameters can obey the distribution shown in formula (21):
  • is the posterior distribution of the local model parameters
  • is the mean value of the prior distribution
  • ⁇ k is the mean value of the posterior distribution.
  • Formula (21) can be seen: when the mean value of the posterior distribution of the local model parameters is far from the mean value of the prior distribution, the variance of the posterior distribution is also relatively large. In this way, the use of the distribution form shown in formula (21) can make the posterior distribution and the prior distribution of the local model parameters overlap as much as possible, so that the local training process becomes more reliable.
  • Fig. 7 is a schematic structural diagram of a federated learning device provided by an embodiment of the present application.
  • the device 700 for federated learning corresponds to the first node mentioned above, and the device 700 is in communication connection with the second node.
  • the device 700 includes a receiving module 701 and a training module 702.
  • the receiving module 701 may be used to receive the prior distribution of the parameters of the federated model from the second node, where the federated model is a machine learning model whose parameters follow the distribution.
  • the training module 702 may be used to train to obtain the posterior distribution of the parameters of the local model of the device according to the prior distribution of the parameters of the federated model and the local training data of the device.
  • the apparatus 700 may further include: a first determining module, configured to determine the uncertainty of the local model according to the posterior distribution of the parameters of the local model; When the uncertainty of the model satisfies the first preset condition, the posterior distribution of the parameters of the local model is sent to the second node.
  • a first determining module configured to determine the uncertainty of the local model according to the posterior distribution of the parameters of the local model.
  • the apparatus 700 may further include: a second determining module, configured to determine the uncertainty of the first parameter according to the posterior distribution of the first parameter of the local model, where the local model includes at least one Parameter, the first parameter is any one of the at least one parameter; the second sending module is configured to send the posterior distribution of the first parameter to the second node when the uncertainty of the first parameter satisfies the second preset condition .
  • a second determining module configured to determine the uncertainty of the first parameter according to the posterior distribution of the first parameter of the local model, where the local model includes at least one Parameter, the first parameter is any one of the at least one parameter
  • the second sending module is configured to send the posterior distribution of the first parameter to the second node when the uncertainty of the first parameter satisfies the second preset condition .
  • the apparatus 700 may further include: a third determining module, configured to determine the uncertainty of the local model according to the posterior distribution of the parameters of the local model; when the uncertainty of the local model satisfies the first Under a preset condition, determine the uncertainty of the first parameter according to the posterior distribution of the first parameter of the local model, where the local model includes at least one parameter, and the first parameter is any one of the at least one parameter; third The sending module is configured to send the posterior distribution of the first parameter to the second node when the uncertainty of the first parameter satisfies the second preset condition.
  • Fig. 8 is a schematic structural diagram of a federated learning device provided by another embodiment of the present application.
  • the device 800 for federated learning corresponds to the second node above, and the device 800 is in communication connection with the first node.
  • the device 800 includes a receiving module 801 and an updating module 802.
  • the receiving module 801 may be used to receive the posterior distribution of the parameters of the local model of at least one first node.
  • the update module 802 may be configured to update the prior distribution of the parameters of the federated model according to the posterior distribution of the parameters of the local model of the at least one first node, where the federated model is a machine learning model whose parameters follow the distribution.
  • the device 800 may further include: a selection module for selecting at least one first node from candidate nodes before the device receives the posterior distribution of the parameters of the local model of the at least one first node ,
  • the federated learning includes multiple rounds of iteration, the at least one first node is a node participating in the current round of iteration, and the candidate node is a node participating in the federated learning before the current round of iteration;
  • the first sending module is configured to receive at least one first node in the device Before the posterior distribution of the parameters of the local model of the node, the prior distribution of the parameters of the federated model is sent to at least one first node.
  • the selection module is configured to select at least one first node from the candidate nodes according to the evaluation information sent by the candidate nodes to the device, where the evaluation information is used to indicate the prior distribution of the parameters of the federation model and The matching degree of the local training data of the candidate node, or the evaluation information is used to indicate the matching degree of the posterior distribution obtained by the candidate node training according to the prior distribution of the parameters of the federated model and the local training data of the candidate node, or the evaluation information is used to indicate The degree of matching between the prior distribution of the parameters of the federation model and the posterior distribution obtained by training the candidate nodes according to the prior distribution of the parameters of the federation model.
  • the selection module is configured to select at least one first node from the candidate nodes according to the difference between the historical posterior distribution of the candidate node and the prior distribution of the parameters of the federation model, wherein the historical posterior distribution It is the posterior distribution of the parameters of the local model obtained by the candidate node before the current iteration.
  • the local model does not include parameters whose uncertainty does not meet a preset condition.
  • Fig. 9 is a schematic diagram of the hardware structure of a federated learning device provided by an embodiment of the present application.
  • the device 900 for federal learning shown in FIG. 9 includes a memory 901, a processor 902, a communication interface 903, and a bus 904.
  • the memory 901, the processor 902, and the communication interface 903 implement communication connections between each other through the bus 904.
  • the memory 901 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 901 may store a program. When the program stored in the memory 901 is executed by the processor 902, the processor 902 and the communication interface 903 are used to execute each step of the federated learning method in the embodiment of the present application.
  • the processor 902 may adopt a general CPU, a microprocessor, an application specific integrated circuit (application specific integrated circuit, ASIC), a graphics processing unit (GPU) or one or more integrated circuits for executing related programs, To realize the functions required by the modules in the device for federated learning in the embodiment of the present application, or to execute the method for federated learning in the method embodiment of the present application.
  • ASIC application specific integrated circuit
  • GPU graphics processing unit
  • the processor 902 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the federated learning method of the present application can be completed by the integrated logic circuit of hardware in the processor 902 or instructions in the form of software.
  • the above-mentioned processor 902 may also be a general-purpose processor, a digital signal processing (digital signal processing, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices , Discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • Discrete gates or transistor logic devices discrete hardware components.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 901, and the processor 902 reads the information in the memory 901, and combines its hardware to complete the functions required by the modules included in the federated learning apparatus of the embodiment of the present application, or execute the federation of the method embodiment of the present application. learning method.
  • the communication interface 903 uses a transceiving device such as but not limited to a transceiver to implement communication between the device 900 and other devices or a communication network.
  • a transceiving device such as but not limited to a transceiver to implement communication between the device 900 and other devices or a communication network.
  • the bus 904 may include a path for transferring information between various components of the device 900 (for example, the memory 901, the processor 902, and the communication interface 903).
  • the receiving module 701 in the federated learning device 700 is equivalent to the communication interface 903 in the federated learning device 900, and the training module 702 may be equivalent to the processor 902.
  • the receiving module 801 in the device 800 for federated learning is equivalent to the communication interface 903 in the device 900 for federated learning
  • the update module 802 may be equivalent to the processor 902.
  • the device 900 shown in FIG. 9 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the device 900 also includes other devices necessary for normal operation. .
  • the apparatus 900 may also include hardware devices that implement other additional functions.
  • the device 900 may also include only the components necessary to implement the embodiments of the present application, and does not necessarily include all the components shown in FIG. 9.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read only memory (read only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种联邦学习的方法、装置和芯片,涉及人工智能领域。该方法适于处理参数服从分布的机器学习模型。该方法包括:第二节点向至少一个第一节点发送联邦模型的参数的先验分布(S410)。在收到联邦模型的参数的先验分布之后,该至少一个第一节点根据联邦模型的参数的先验分布和第一节点的本地训练数据,训练得到第一节点的本地模型的参数的后验分布(S420)。本地训练结束后,该至少一个第一节点向第二节点反馈本地模型的参数的后验分布(S430),以便第二节点根据该至少一个第一节点的本地模型的参数的后验分布,对联邦模型的参数的先验分布进行更新(S440)。通过节点之间交互先验分布和后验分布,实现了参数服从分布的机器学习模型的联邦学习,从而有助于减少联邦学习的训练时间和节点之间的通信开销。

Description

联邦学习的方法、装置和芯片 技术领域
本申请涉及人工智能领域,具体涉及一种联邦学习的方法、装置和芯片。
背景技术
随着用户对个人隐私数据的保护意愿日益提升,数据拥有者之间的用户数据无法互通,形成了大大小小的“数据孤岛”。“数据孤岛”对基于海量数据的人工智能(artificial intelligence,AI)提出了新的挑战,即在没有权限获得足够多的训练数据的情况下,如何对机器学习模型进行训练?
针对“数据孤岛”的存在,联邦学习(federated learning)被提出。但是,传统的联邦学习仅能用于训练参数为固定值的机器学习模型,导致联邦学习的训练时间较长、通信开销较大。
发明内容
本申请提供一种联邦学习的方法和装置,能够支持参数服从分布的机器学习模型的联邦学习,从而减少联邦学习的训练时间和通信开销。
第一方面,提供一种联邦学习的方法,包括:第一节点从第二节点接收联邦模型的参数的先验分布,其中所述联邦模型为参数服从分布的机器学习模型;所述第一节点根据所述联邦模型的参数的先验分布和所述第一节点的本地训练数据,训练得到所述第一节点的本地模型的参数的后验分布。
通过在节点之间交互模型参数的先验分布和后验分布,实现了参数服从分布的机器学习模型的联邦学习。参数服从分布的机器学习模型能够预先给出参数的各种取值的可能性,而参数的各种取值的可能性能够表征机器学习模型的各种可能的改进方向之间的优劣。因此,对参数服从分布的机器学习模型进行联邦学习,有助于参与联邦学习的节点找到机器学习模型的较优的改进方向,从而减少联邦学习的训练时间和节点之间的通信开销。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:所述第一节点根据所述本地模型的参数的后验分布,确定本地模型的不确定度;当所述本地模型的不确定度满足第一预设条件时,所述第一节点向所述第二节点发送所述本地模型的参数的后验分布。
本地模型的不确定度能够很好地衡量本地训练数据与联邦模型的匹配度,进而可以表明第一节点对于联邦学习的重要性。因此,以本地模型的不确定度作为衡量第一节点是否向第二节点反馈训练结果的指标,能够使得联邦模型的训练过程更加可控。例如,当希望联邦模型快速收敛,则可以禁止本地模型的不确定度较高的第一节点反馈本地训 练结果;又如,当希望提升联邦模型的容量时,则可以要求本地模型的不确定度较高的第一节点反馈本地训练结果。此外,不向第二节点发送不确定度不满足第一预设条件的本地模型,可以降低节点之间的通信开销。
结合第一方面,在第一方面的某些实现方式中,所述本地模型的不确定度是基于以下信息中的至少一种度量的:所述本地模型的参数的后验分布的方差,所述本地模型的参数的后验分布的收敛速度,或者所述本地模型的参数的后验分布的推断准确率。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:所述第一节点根据所述本地模型的第一参数的后验分布,确定所述第一参数的不确定度,其中所述本地模型的参数包括至少一个参数,所述第一参数为所述至少一个参数中的任意一个;当所述第一参数的不确定度满足第二预设条件时,所述第一节点向所述第二节点发送所述第一参数的后验分布。
本地模型中的参数的不确定度可以很好地衡量该参数对该本地模型的重要性。通过计算参数的不确定度,第一节点可以仅上传对本地模型重要的参数的训练结果,这样可以降低节点间的通信开销,提高通信效率。
结合第一方面,在第一方面的某些实现方式中,所述第一参数的不确定度是基于所述第一参数的后验分布的方差度量的。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:所述第一节点根据所述本地模型的参数的后验分布,确定所述本地模型的不确定度;当所述本地模型的不确定度满足第一预设条件时,所述第一节点根据所述本地模型的第一参数的后验分布,确定所述第一参数的不确定度,其中,所述本地模型包括至少一个参数,所述第一参数为所述至少一个参数中的任意一个;当所述第一参数的不确定度满足第二预设条件时,所述第一节点向所述第二节点发送所述第一参数的后验分布。
第一节点根据本地模型的不确定度以及本地模型中的参数的不确定度有选择地向第二节点发送本地训练得到的全部或部分结果,可以降低节点间的通信开销,提高通信效率。
结合第一方面,在第一方面的某些实现方式中,所述联邦模型的参数的先验分布包括多个局部先验分布,所述多个局部先验分布一一对应多个贝叶斯模型,所述第一节点根据所述联邦模型的参数的先验分布和所述第一节点的本地训练数据,训练得到所述第一节点的本地模型的参数的后验分布,包括:所述第一节点根据所述多个局部先验分布与所述本地训练数据的匹配度,确定所述第一节点的本地模型的参数的先验分布;所述第一节点根据所述本地模型的参数的先验分布和所述本地训练数据,训练得到所述本地模型的参数的后验分布。
可选地,该多个局部先验分布可以隐含在联邦模型的参数的先验分布中,换句话说,所述联邦模型的参数的先验分布可以按照一定的方式分解成多个局部先验分布,如可以对联邦模型的参数的先验分布进行随机采样,从而将联邦模型的参数的先验分布分解成多个局部先验分布。
第二节点维护较大的、包含多个局部先验分布的联邦模型,第一节点从中选取与本地训练数据匹配的局部先验分布进行本地训练,这样可以加快本地训练过程的收敛速度。
结合第一方面,在第一方面的某些实现方式中,所述联邦学习包括多轮迭代,所述 本地模型的参数的后验分布为经过本轮迭代得到的本地模型的参数的后验分布,所述第一节点根据所述多个局部先验分布与所述本地训练数据的匹配度,确定所述第一节点的本地模型的参数的先验分布,包括:所述第一节点根据所述多个局部先验分布与历史后验分布之间的差异,确定所述第一节点的本地模型的参数的先验分布,其中所述历史后验分布为所述第一节点在所述本轮迭代之前得到的本地模型的参数的后验分布。
结合第一方面,在第一方面的某些实现方式中,所述本地模型的参数的先验分布为所述多个局部先验分布中的与所述历史后验分布差异最小的先验分布;或者,所述本地模型的参数的先验分布为所述多个局部先验分布的加权和,其中所述多个局部先验分布在所述加权和中分别所占的权重由所述多个局部先验分布与所述历史后验分布之间的差异确定。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:所述第一节点向所述第二节点发送所述本地模型的参数的后验分布。
结合第一方面,在第一方面的某些实现方式中,所述机器学习模型为神经网络。
结合第一方面,在第一方面的某些实现方式中,所述联邦模型为贝叶斯神经网络。
结合第一方面,在第一方面的某些实现方式中,所述联邦模型的参数为随机变量。
结合第一方面,在第一方面的某些实现方式中,所述本地模型为神经网络。
结合第一方面,在第一方面的某些实现方式中,所述本地模型为贝叶斯神经网络。
结合第一方面,在第一方面的某些实现方式中,所述本地模型的参数为随机变量。
结合第一方面,在第一方面的某些实现方式中,所述联邦模型的参数的先验分布为所述联邦模型的参数的概率分布,或者为所述联邦模型的参数的概率分布的概率分布。
结合第一方面,在第一方面的某些实现方式中,所述第一节点和所述第二节点分别为网络中的客户端和服务器。
第二方面,提供一种联邦学习的方法,包括:第二节点接收至少一个第一节点的本地模型的参数的后验分布;所述第二节点根据所述至少一个第一节点的本地模型的参数的后验分布,对联邦模型的参数的先验分布进行更新,其中所述联邦模型为参数服从分布的机器学习模型。
通过在节点之间交互模型参数的先验分布和后验分布,实现了参数服从分布的机器学习模型的联邦学习。参数服从分布的机器学习模型能够预先给出参数的各种取值的可能性,而参数的各种取值的可能性能够表征机器学习模型的各种可能的改进方向之间的优劣。因此,对参数服从分布的机器学习模型进行联邦学习,有助于参与联邦学习的节点找到机器学习模型的较优的改进方向,从而减少联邦学习的训练时间和节点之间的通信开销。
结合第二方面,在第二方面的某些实现方式中,在所述第二节点接收至少一个第一节点的本地模型的参数的后验分布之前,所述方法还包括:所述第二节点从候选节点中选取所述至少一个第一节点,所述联邦学习包括多轮迭代,所述至少一个第一节点为参与本轮迭代的节点,所述候选节点为在所述本轮迭代之前参与所述联邦学习的节点;所述第二节点向所述至少一个第一节点发送所述联邦模型的参数的先验分布。
第二节点从候选节点中选择参与本轮训练的第一节点,能够使得联邦学习的训练过程更有针对性,也更加灵活。
结合第二方面,在第二方面的某些实现方式中,所述第二节点从候选节点中选取所述至少一个第一节点,包括:所述第二节点根据所述候选节点向所述第二节点发送的评价信息,从所述候选节点中选取所述至少一个第一节点,其中所述评价信息用于表示所述联邦模型的参数的先验分布与所述候选节点的本地训练数据的匹配度,或者所述评价信息用于表示所述候选节点根据所述联邦模型的参数的先验分布训练得到的后验分布与所述候选节点的本地训练数据的匹配度,或者所述评价信息用于表示所述联邦模型的参数的先验分布与所述候选节点根据所述联邦模型的参数的先验分布训练得到的后验分布的匹配度。
通过候选节点反馈的评价信息,第二节点能够准确掌握候选节点的本地模型(或本地训练数据)与联邦模型的匹配度,从而可以根据实际需要,更好地对参与联邦学习的第一节点进行选择。
结合第二方面,在第二方面的某些实现方式中,所述第二节点从候选节点中选取所述至少一个第一节点,包括:所述第二节点根据所述候选节点的历史后验分布与所述联邦模型的参数的先验分布的差异,从所述候选节点中选取所述至少一个第一节点,其中所述历史后验分布为所述候选节点在所述本轮迭代之前得到的本地模型的参数的后验分布。
第二节点通过计算候选节点的历史后验分布与联邦模型的参数的先验分布的差异,能够掌握候选节点的本地模型(或本地训练数据)与联邦模型之间的匹配度,从而可以根据实际需要,更好地对参与联邦学习的第一节点进行选择。
结合第二方面,在第二方面的某些实现方式中,所述本地模型不包含不确定度不满足预设条件的参数。
本地模型中的参数的不确定度可以很好地衡量该参数对该本地模型的重要性。节点之间根据参数的不确定度,有选择性地交互重要的参数,能够降低节点间的通信开销,提高通信效率。
结合第二方面,在第二方面的某些实现方式中,所述至少一个第一节点包括多个第一节点,且所述多个第一节点的本地模型的参数的后验分布均包括第一参数的后验分布,所述第二节点根据所述至少一个第一节点的本地模型的参数的后验分布,对联邦模型的参数的先验分布进行更新,包括:如果所述多个第一节点的所述第一参数的后验分布之间的差异大于预设阈值,所述第二节点对所述联邦模型的参数的先验分布进行更新,以将所述第一参数拆分成多个参数。
结合第二方面,在第二方面的某些实现方式中,所述联邦模型的参数的先验分布包括多个局部先验分布,所述多个局部先验分布一一对应多个贝叶斯模型。
第二节点维护较大的、包含多个局部先验分布的联邦模型,使得第一节点可以根据自身的情况选择匹配的局部先验分布,有助于加快第一节点的本地训练过程的收敛速度。
结合第二方面,在第二方面的某些实现方式中,所述机器学习模型为神经网络。
结合第二方面,在第二方面的某些实现方式中,所述联邦模型为贝叶斯神经网络。
结合第二方面,在第二方面的某些实现方式中,所述联邦模型的参数为随机变量。
结合第二方面,在第二方面的某些实现方式中,所述本地模型为神经网络。
结合第二方面,在第二方面的某些实现方式中,所述本地模型为贝叶斯神经网络。
结合第二方面,在第二方面的某些实现方式中,所述本地模型的参数为随机变量。
结合第二方面,在第二方面的某些实现方式中,所述联邦模型的参数的先验分布为所述联邦模型的参数的概率分布,或者为所述联邦模型的参数的概率分布的概率分布。
结合第二方面,在第二方面的某些实现方式中,所述第一节点和所述第二节点分别为网络中的客户端和服务器。
第三方面,提供一种联邦学习的方法,包括:第一节点从第二节点接收联邦模型,所述联邦模型包括多个机器学习模型(如多个神经网络);所述第一节点从所述多个机器学习模型中选取目标机器学习模型;所述第一节点根据所述目标机器学习模型和所述第一节点的本地训练数据,训练所述第一节点的本地模型。
在第二节点处维护多个机器学习模型,第一节点能够根据自身情况从中选择一个机器学习模型进行本地训练,这样有助于缩短第一节点本地计算的耗时,提升本地的计算效率。
结合第三方面,在第三方面的某些实现方式中,所述第一节点从所述多个机器学习模型中选取目标机器学习模型,包括:所述第一节点根据所述多个机器学习模型与所述本地训练数据的匹配度,从所述多个模型中选取所述目标机器学习模型。
第一节点选择与本地训练数据相匹配的机器学习模型进行本地训练,能够提升本地训练的训练效率。
第四方面,提供一种联邦学习的方法,包括:所述第二节点向第一节点发送联邦模型,联邦模型包括多个机器学习模型(如多个神经网络);第二节点接收第一节点发送的与所述多个机器学习模型中的目标机器学习模型对应的本地模型;所述第二节点根据所述本地模型优化所述目标机器学习模型。
在第二节点处维护多个机器学习模型,第一节点能够根据自身情况从中选择一个机器学习模型进行本地训练,这样有助于缩短第一节点本地计算的耗时,提升本地的计算效率。
第五方面,提供一种联邦学习的装置,该装置包括用于执行第一方面至第四方面中任意一个方面的方法的模块。
第六方面,提供一种联邦学习的装置,该装置包括:存储器,用于存储程序;处理器,用于执行存储器存储的程序,当存储器存储的程序被执行时,处理器用于执行第一方面至第四方面中任意一个方面的方法。
第七方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面至第四方面中任意一个方面的方法。
第八方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行第一方面至第四方面中任意一个方面的方法。
第九方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行第一方面至第四方面中任意一个方面的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面中的方法。
第十方面,提供一种电子设备,该电子设备包括上述第五方面至第六方面中的任意 一个方面中的联邦学习的装置。
附图说明
图1为联邦学习的应用场景的示例图。
图2为联邦学习的流程图。
图3为本申请实施例提供的一种芯片硬件结构图。
图4为本申请实施例提供的联邦学习的方法的示意性流程图。
图5为图4中的步骤S420的一种可能的实现方式的示意性流程图。
图6为本申请实施例提供的参与联邦学习的第一节点的选择方式的示意性流程图。
图7是本申请一个实施例提供的联邦学习的装置的结构示意图。
图8是本申请另一实施例提供的联邦学习的装置的结构示意图。
图9是本申请又一实施例提供的联邦学习的装置的结构示意图。
具体实施方式
为了便于理解,先结合图1和图2,对联邦学习的场景和过程进行示例性说明。
参见图1,联邦学习的场景中可以包括多个第一节点102和第二节点105。第一节点102和第二节点105可以是支持数据传输的任意节点(如网络节点)。例如,第一节点102可以是客户端,如移动终端或个人电脑。第二节点105可以是服务器,或称参数服务器。在某些实施例中,第一节点可以称为训练数据的拥有者,第二节点也可称为联邦学习过程的协调者。
第二节点105可用于维护联邦模型。第一节点102可以从第二节点105获取联邦模型,并结合本地训练数据进行本地训练,得到本地模型。在训练得到本地模型之后,第一节点102可以将该本地模型发送给第二节点105,以便第二节点105更新或优化该联邦模型。如此往复,经过多轮迭代,直到联邦模型收敛或达到预设的迭代停止条件。
下面结合图2,对联邦学习的一般过程进行介绍。
在步骤S210,第二节点105构建联邦模型。第二节点105可以构建通用的机器学习模型,也可以根据需求构建特定的机器学习模型。以图像识别任务为例,第二节点105可以构建一个卷积神经网络(convolutional neural network,CNN),作为联邦模型。
在步骤S220,第二节点105选择第一节点102。第二节点105选择出的第一节点102会得到第一节点102下发的联邦模型。第二节点105可以随机选择第一节点102,也可以根据一定的策略选择第一节点102。例如,第二节点105可以选择本地模型与联邦模型匹配度较高的第一节点102,以加快联邦模型的收敛速度。
在步骤S230,第一节点102从第二节点105获取或接收联邦模型。例如,在一种实现方式中,第一节点102可以主动请求第二节点105下发联邦模型。或者,在另一种实现方式中,第二节点105主动向第一节点102下发联邦模型。以第一节点102为客户端,第二节点105为服务器为例,则客户端可以从服务器下载联邦模型。
在步骤S240,第一节点102利用本地训练数据对联邦模型进行训练,得到本地模型。第一节点102可以将联邦模型作为本地模型的初始模型,然后利用本地训练数据对该初 始模型进行一步或多步训练,得到本地模型。
本地训练过程可以看成是本地模型的优化过程。优化的优化目标可以通过下式表示:
Figure PCTCN2021100098-appb-000001
其中,ω表示本地模型,ω t表示第t轮迭代时的联邦模型,ω可以采用ω t作为初始值,也可以使用上一轮迭代得到的本地模型作为初始值。k表示第k个第一节点。F k(ω)表示本地模型在本地训练数据上的损失函数。
在步骤S250,第二节点105将第一节点102训练得到的本地模型进行汇聚,得到更新后的联邦模型。例如,在其中一种实现方式中,第二节点105可以将多个第一节点102的本地模型的参数进行加权求和,并将加权求和的结果作为更新后的联邦模型。
步骤S220-S250描述的过程可以看成是联邦学习过程中的一轮迭代。第二节点105和第一节点102可以重复执行步骤S220-S250,直到联邦模型收敛或达到预设效果。
联邦学习可用于训练机器学习模型。最常见的机器学习模型为神经网络。为了便于理解,先对神经网络的相关概念以及本申请实施例涉及的一些用语进行解释。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2021100098-appb-000002
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021100098-appb-000003
中,
Figure PCTCN2021100098-appb-000004
是输入向量,
Figure PCTCN2021100098-appb-000005
是输出向量,b是偏移向量,W是权重矩阵(也称系数),α(.)为激活函数。每一层仅仅是对输入向量
Figure PCTCN2021100098-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2021100098-appb-000007
由于DNN层数多,则系数W和偏移向量b的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2021100098-appb-000008
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2021100098-appb-000009
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。 理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(4)参数服从分布的神经网络
参数服从分布的神经网络是参数服从分布的机器学习模型中的一种。具体而言,传统的神经网络的参数(如前文中提及的神经元的权重)为固定值。但是,这种类型的神经网络存在过拟合的问题,即这种类型的神经网络在训练数据缺失的区域往往给出过于自信的预测,无法对预测结果的不确定性进行准确度量。
相比参数为固定值的神经网络,有些神经网络的参数服从一定的分布。例如,贝叶斯神经网络的参数即为服从某种分布的随机变量,如服从高斯分布的随机变量。参数服从概率分布的神经网络的训练过程并非希望得到参数的固定值,而是旨在优化参数的概率分布。在训练完成后,可以对参数的分布进行采样,每个采样值可以对应一个参数为固定值的神经网络。如果采样得到的大量神经网络对某个输入的预测是相似的,则可以认为该神经网络对该输入对应的预测有较小的不确定度,否则该神经网络对输入对应的预测的不确定度较大。通过这样的方式,这种参数服从概率分布的神经网络可以表征由于数据缺失带来的预测的不确定性,从而避免过拟合的问题。
(5)先验分布、后验分布和似然估计
参数服从概率分布的机器学习模型的训练问题可以看成是基于贝叶斯公式,对参数的概率分布的估计问题。在贝叶斯公式中,先验分布、后验分布和似然估计是三个重要的概念。
参数的先验分布是对后验分布的预先假设,也就是说,参数的先验分布指的是在未观测到训练数据之前对参数的后验分布的假设。参数的先验分布可以由人工指定,也可以通过数据学习得到。相对而言,参数的后验分布是在观测到训练数据之后,对参数的分布的描述。换句话说,参数的后验分布是在已知训练数据的条件下,对参数的分布的描述。根据贝叶斯公式,参数的先验分布、后验分布和似然估计之间满足如下关系:后验分布=(先验分布×似然估计)/训练数据出现的概率。
(6)参数的分布的参数化描述和非参数化描述
无论是参数的先验分布还是后验分布,都是在描述参数的分布,但参数的分布的具体描述方式可以有多种,本申请实施例对此并不限定。在一些实施例中,参数的先验分 布和/或后验分布可以采用参数化的分布描述方式。例如,假设参数的分布为高斯分布,则参数的先验分布和/或后验分布可以通过均值和方差描述高斯分布。在另一些实施例中,先验分布和/或后验分布也可以采用非参数化的分布描述方式。例如,参数的先验分布和/或后验分布可以采用概率直方图、概率密度、累计函数曲线等方式描述参数的分布。
(7)先验分布对后验分布的“点描述”和“分布描述”
模型参数的先验分布可以是模型参数的概率分布,也可以是模型参数的概率分布的概率分布。
先验分布和后验分布之间是存在关联的,即先验分布可以视为对后验分布的一种预先的描述,即在观测到训练数据之前的一种假设性的描述。如果模型参数的先验分布为模型参数的概率分布,则这种类型的先验分布可以理解为在对后验分布进行“点描述”;如果模型参数的先验分布为模型参数的概率分布的概率分布,则这种类型的先验分布可以理解为在对后验分布进行“分布描述”。
例如,假设模型参数服从高斯分布,当模型参数的先验分布为模型参数的概率分布时,该模型参数的先验分布可以是模型参数的分布的均值和方差。从先验分布对后验分布进行描述这个角度来看,相当于先验分布采用[均值,方差]这样一个点对后验分布进行了“点描述”。
又如,假设模型参数服从高斯分布,当模型参数的先验分布为模型参数的概率分布的概率分布时,该模型参数的先验分布并非给出模型参数的分布的均值和方差,而是对模型参数的分布的均值和方差取不同值的概率进行描述。从先验分布对后验分布进行描述这个角度来看,相当于先验分布采用概率分布对后验分布的均值和方差取不同值的概率(或取不同值的惩罚或奖励)进行了“分布描述”。
(8)两个分布之间的差异的度量
本申请的某些实施例会涉及先验分布与后验分布之间的差异的度量。先验分布和后验分布之间的差异度量方式可以有多种,且可以根据先验分布对后验分布的描述方式的不同,设计出不同的分布差异度量函数,以度量两个分布之间的差异。下面给出几个示例。
作为一个示例,如果先验分布对后验分布采用的是“点描述”,且先验分布采用参数化的分布描述方式,则先验分布与后验分布之间的差异可以采用两个分布的KL散度(Kullback-Leibler散度)进行度量。换句话说,可以采用先验分布与后验分布的KL散度作为两个分布的分布差异度量函数。
作为另一示例,如果先验分布采用的是“点描述”,且先验分布采用非参数化的分布描述方式(如基于直方图、概率密度曲线等进行描述),则先验分布与后验分布之间的差异可以通过计算两个分布对应的直方图(或概率密度曲线)的相似性,对两个分布之间的差异进行度量。换句话说,可以将先验分布与后验分布对应的直方图(或概率密度曲线)的相似性作为两个分布的分布差异度量函数。两个分布对应的直方图(或概率密度曲线)的相似性可以通过计算两个直方图(或概率密度曲线)的面积的差异或者余弦距离得到。
作为又一示例,如果先验分布对后验分布采用“分布描述”,则可以使用先验分布在后验分布的取值处的概率作为两个分布的差异的描述。换句话说,可以使用先验分布 在后验分布的取值处的概率作为两个分布的分布差异度量函数。
下面介绍本申请实施例提供的一种芯片硬件结构。
图3为本申请实施例提供的一种芯片硬件结构。该芯片包括神经网络处理器50。该芯片可以被设置在如图1所示的第一节点102中,用于第一节点102完成本地模型的训练工作。该芯片也可以被设置在如图1所示的第二节点105中,用于第二节点105完成联邦模型的维护和更新工作。
神经网络处理器50作为协处理器挂载到主中央处理单元(host central processing unit,host CPU)上,由主CPU分配任务。神经网络处理器50的核心部分为运算电路503,控制器504控制运算电路503提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)508中。
向量计算单元507可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元507可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现中,向量计算单元507将经处理的输出的向量存储到统一缓存器506。例如,向量计算单元507可以将非线性函数应用到运算电路503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路503的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器506用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器505(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器501和/或统一存储器506、将外部存储器中的权重数据存入权重存储器502,以及将统一存储器506中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)510,用于通过总线实现主CPU、DMAC和取指存储器509之间进行交互。
与控制器504连接的取指存储器(instruction fetch buffer)509,用于存储控制器504使用的指令。
控制器504用于调用指存储器509中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器506,输入存储器501,权重存储器502以及取指存储器509均 为片上(on-chip)存储器,外部存储器为该神经网络处理器外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
目前,现有的联邦学习仅能对参数(如权重)为固定值的机器学习模型进行训练,而无法对参数服从分布的机器学习模型进行训练。由于本地训练数据的数据分布与整体训练数据(整体训练数据指的是所有本地训练数据形成的数据集)的数据分布经常不一致,所以参数为固定值的机器学习模型的联邦学习过程经常出现模型震荡的问题(即训练过程中,模型参数的取值来回摆动,而不是朝着一个方向持续收敛),导致联邦学习过程的训练时间长、通信开销大。
为了解决该问题,本申请提供一种联邦学习的方法,能够实现参数服从分布的机器学习模型的联邦学习。应理解,本申请提及的分布指的是概率分布。下面结合图4,对本申请实施例提供的联邦学习的方法进行详细介绍。
图4的方法包括步骤S410-S440。图4中的第一节点可以是图1中的第一节点102中的任意一个,图4中的第二节点可以是图1中的第二节点105。
图4实施例中提及的联邦模型是参数服从分布的机器学习模型。在某些实施例中,联邦模型是参数服从分布的神经网络,联邦模型的参数可以指神经网络中的神经元的参数。例如,联邦模型可以为贝叶斯神经网络。进一步地,在一些实施例中,该贝叶斯神经网络中的参数可以服从高斯分布。
图4实施例中提及的本地模型也可以是参数服从分布的机器学习模型。在某些实施例中,本地模型是参数服从分布的神经网络,本地模型的参数可以指神经网络中的神经元的参数。例如,本地模型可以为贝叶斯神经网络。进一步地,在一些实施例中,该贝叶斯神经网络中的参数服从高斯分布、delta分布或其他分布。在一些实施例中,联邦模型和本地模型可以是结构相同的机器学习模型。在另一些实施例中,联邦模型可以包括多个贝叶斯模型(如多个贝叶斯神经网络),而本地模型可以与其中的一个贝叶斯模型结构相同。
在步骤S410,第一节点从第二节点接收联邦模型的参数的先验分布。例如,在一种实现方式中,第一节点可以主动请求第二节点下发联邦模型的参数的先验分布。或者,在另一种实现方式中,第二节点可以主动向第一节点下发联邦模型的参数的先验分布。
在步骤S420,第一节点根据联邦模型的参数的先验分布和第一节点的本地训练数据,训练得到第一节点的本地模型的参数的后验分布。
可替换地,步骤S420也可描述成:第一节点根据联邦模型的参数的先验分布和第一节点的本地训练数据,优化得到第一节点的本地模型的参数的后验分布。在具体实现时,可以根据联邦模型参数的先验分布,通过贝叶斯优化方式,推断得到第一节点的本地模型的参数的后验分布。
在步骤S430,第二节点接收至少一个第一节点的本地模型的参数的后验分布。
例如,在一种实现方式中,第一节点主动向第二节点发送本地模型的参数的后验分布。或者,在另一种实现方式中,第一节点可以应第二节点的要求向第二节点发送本地模型的参数的后验分布。
第一节点向第二节点发送的本地模型的参数的后验分布可以是本地模型的所有参数的后验分布,也可以是本地模型的部分参数的后验分布。
第一节点可以采用向第二节点发送本地模型参数的后验分布与联邦模型参数的先验分布之间的差异的方式,向第二节点发送本地模型的参数的后验分布。或者,第一节点也可以直接向第二节点发送本地模型的参数的后验分布本身。
第一节点向第二节点发送的本地模型的参数的后验分布可以是加密后的本地模型的参数的后验分布,也可以是不加密的本地模型的参数的后验分布。
此外,在某些实现方式中,第一节点也可以向第二节点发送本地训练数据。
在步骤S440,第二节点根据该至少一个第一节点的本地模型的参数的后验分布,对联邦模型的参数的先验分布进行更新。例如,第二节点可以接收至少一个第一节点发送的本地模型的参数的后验分布;然后,第二节点可以对该至少第一节点的本地模型的参数的后验分布进行加权求和,得到更新后的联邦模型的参数的先验分布。
在联邦学习的过程中,步骤S410至步骤S440可以执行一次,也可以重复执行多次。例如,步骤S410至步骤S440可以迭代执行多次,直到满足迭代停止条件。迭代停止条件例如可以是达到预设的迭代次数,或者可以是联邦模型已经收敛。
本申请实施例通过在节点之间交互模型参数的先验分布和后验分布,实现了参数服从分布的机器学习模型的联邦学习。参数服从分布的机器学习模型能够预先给出参数的各种取值的可能性,而参数的各种取值的可能性能够表征机器学习模型的各种可能的改进方向之间的优劣。因此,对参数服从分布的机器学习模型进行联邦学习,有助于参与联邦学习的节点找到机器学习模型的较优的改进方向,从而减少训练时间和节点之间的通信开销。
此外,参数服从分布的机器学习模型的训练过程仍然存在需要保护数据隐私的需求,但现有技术并不支持参数服从分布的机器学习模型的联邦学习。针对参数服从分布的机器学习模型,现有技术需要将各节点的训练数据汇聚在一端共同进行训练,利用汇聚后的数据进行训练一方面容易泄露用户隐私,另一方面对汇聚数据的节点的计算力要求较高。采用本申请实施例提供的方案可以实现参数服从分布的机器学习模型的联邦学习,从而可以避免泄露用户隐私,也可以降低对执行训练任务的节点的计算力的要求。
图4中的步骤S420的实现方式可以有多种,下面结合图5进行举例说明。
如图5所示,步骤S420进一步包括步骤S422和步骤S424。在步骤S422,第一节点根据联邦模型的参数的先验分布,确定本地模型的参数的先验分布。在步骤S424,第一节点根据本地模型的参数的先验分布和第一节点的本地训练数据,训练得到第一节点的本地模型的参数的后验分布。
步骤S422的实现方式有多种。例如,如果联邦模型与本地模型对应相同结构的机器学习模型,则第一节点可以直接将联邦模型的参数的先验分布作为本地模型的参数的先验分布。
或者,如果联邦模型的参数的先验分布可以包括多个局部先验分布(每个局部先验分可以对应一个贝叶斯模型),则第一节点在收到该联邦模型的参数的先验分布之后,可以根据该多个局部先验分布与本地训练数据的匹配度,确定本地模型的参数的先验分布。
需要说明的是,多个局部先验分布可以以显式的方式包含在联邦模型的参数的先验分布中;或者,在一些实施例中,该多个局部先验分布也可以隐含在联邦模型的参数的先验分布中,并需要通过一定的方式(如随机采样的方式)从联邦模型的参数的先验分布中将其分解出来。下面给出几个示例。
作为一个示例,联邦模型包括结构相同的多个贝叶斯模型,其中每个贝叶斯模型的每个参数仅包含一个分布。此外,联邦模型参数的先验分布对后验分布进行“点描述”。在这种情况下,针对一个参数,多个贝叶斯模型提供的先验分布可能是不同的,即一个参数可能存在多种可能的分布。第一节点接收到联邦模型参数的先验分布之后,可以对每个参数的多种可能的分布进行采样(如随机采样),并将不同参数的分布的采样结果按照多种方式进行组合,形成多个局部先验分布。然后,第一节点可以根据该多个局部先验分布与第一节点的本地训练数据的匹配度,从该多个局部先验分布中选取与该本地训练数据最匹配的局部先验分布,作为本地模型的参数的先验分布。或者,第一节点可以根据该多个局部先验分布与该本地训练数据匹配度的差异,采用加权求和的方式得到本地模型的参数的先验分布。
作为另一示例,联邦模型仅包括一个机器学习模型,但该机器学习模型的每个参数包括多个分布(即该参数的分布为混合分布)。此外,联邦模型参数的先验分布对后验分布进行“点描述”。在这种情况下,该机器学习模型的每个参数仍然存在多种可能的分布。第一节点接收到联邦模型参数的先验分布之后,可以对每个参数的多种可能的分布进行采样(如随机采样),并将不同参数的分布的采样结果按照多种方式进行组合,形成多个局部先验分布。然后,第一节点可以根据该多个局部先验分布与第一节点的本地训练数据的匹配度,从该多个局部先验分布中选取与该本地训练数据最匹配的局部先验分布,作为本地模型的参数的先验分布。或者,第一节点可以根据该多个局部先验分布与该本地训练数据匹配度的差异,采用加权求和的方式得到本地模型的参数的先验分布。
作为又一示例,第二节点维护的联邦模型还可以是上述两种情况的组合,即第二节点维护多个机器学习模型,其中一个机器学习模型的一个参数包含多种分布。在这种情况下,每个参数的分布的取值具有更多的可能性,可以为第一节点的采样提供更丰富的选择范围。
以参数服从高斯分布的贝叶斯神经网络为例,假设第二节点维护的联邦模型是以下几种情况中的一种:
情况一:联邦模型仅维护一个贝叶斯神经网络,该贝叶斯神经网络的每个参数仅包含一个高斯分布;
情况二:联邦模型维护多个贝叶斯神经网络,其中每个贝叶斯神经网络的每个参数仅包含一个高斯分布,且该多个贝叶斯神经网络的参数的分布不同;
情况三:联邦模型仅维护一个贝叶斯神经网络,其中每个参数包含多个高斯分布。
情况四:联邦模型维护多个贝叶斯神经网络,其中每个贝叶斯神经网络的每个参数包含多个高斯分布,该多个贝叶斯神经网络的参数的分布不同。
面对情况二至四,第一节点在收到联邦模型参数的先验分布之后,均可以先对其进行采样,以获得一个贝叶斯神经网络的参数,且使得该贝叶斯神经网络的一个参数仅包 含一个高斯分布。
如果联邦模型参数的先验分布对后验分布采用的是“分布描述”,则可以先依据该“分布描述”给出的分布取值的概率,对先验分布的取值进行采样,从而得到先验分布的多种取值。经过上述采样操作,相当于将先验分布对后验分布的“分布描述”转换成先验分布对后验分布的多个“点描述”,其中每个“点描述”就相当于从联邦模型参数的先验分布中分解出的一个局部先验分布。然后,第一节点可以根据多个局部先验分布与第一节点的本地训练数据的匹配度,从该多个局部先验分布中选取与该本地训练数据匹配的局部先验分布,作为本地模型的参数的先验分布。或者,第一节点可以根据该多个局部先验分布与该本地训练数据匹配度的差异,采用加权求和的方式得到本地模型的参数的先验分布。
局部先验分布与第一节点的本地训练数据的匹配度的度量方式可以有多种。
例如,可以依次将各个局部先验分布作为本地模型参数的先验分布,并结合本地训练数据进行训练。然后,根据各个局部先验分布的训练效果,对局部先验分布与第一节点的本地训练数据的匹配度进行度量。
或者,在某些实施例中,可以根据局部先验分布与本地模型参数的历史后验分布之间的差异对局部先验分布与第一节点的本地训练数据的匹配度进行度量。然后,可以根据多个局部先验分布与历史后验分布之间的差异,确定本地模型的参数的先验分布。例如,可以将多个局部先验分布中的与历史后验分布差异最小的先验分布作为本地模型的参数的先验分布。或者,可以根据多个局部先验分布与历史后验分布之间的差异,对多个局部先验分布进行加权求和,并将加权求和的结果作为本地模型的参数的先验分布。
本实施例提及的历史后验分布指的是第一节点在本轮迭代之前得到的本地模型的参数的后验分布,如上一轮迭代得到的本地模型的参数的后验分布。两个分布之间的差异的度量方式在前文有描述,此处不再详述。
需要说明的是,联邦模型维护多个机器学习模型的方案也可应用于参数为固定值的机器学习模型的联邦学习中。
例如,第一节点从第二节点接收包括多个机器学习模型的联邦模型;然后,第一节点从多个机器学习模型中选取目标机器学习模型,并根据目标机器学习模型和第一节点的本地训练数据,训练第一节点的本地模型。该目标机器学习模型可以是多个机器学习模型中的与本地训练数据匹配度最高的机器学习模型,或者,该目标机器学习模型可以是多个机器学习模型中的精度最高的机器学习模型。
对应地,第二节点向第一节点发送包括多个机器学习模型的联邦模型;然后,第二节点可以接收第一节点发送的与多个机器学习模型中的目标机器学习模型对应的本地模型(即该本地模型是通过训练该目标机器学习模型得到的);第二节点根据本地模型优化目标机器学习模型(即第二节点根据本地模型优化联邦模型中的对应的机器学习模型)。
前文对图5中的步骤S422进行了详细描述,下文对图5中的步骤S424进行详细描述,即对如何根据本地模型的参数的先验分布生成本地模型的参数的后验分布进行详细描述。
根据本地模型的参数的先验分布生成本地模型的参数的后验分布的过程也就是利用 本地训练数据对本地模型进行本地训练的过程。该本地训练的过程中,本地模型的参数的先验分布的使用方式可以有多种。例如,可以将本地模型的参数的先验分布作为本地训练的优化目标中的约束条件。或者,可以根据本地模型的参数的先验分布确定本地模型的参数的后验分布的初始值。下面对这两种使用方式各自对应的本地训练过程进行详细描述。
方式一:本地模型的参数的先验分布作为本地训练的优化目标中的约束条件
首先,可以将本地训练的优化目标设定为:本地模型的参数的后验分布在本地训练数据上的损失函数尽可能小(或者似然函数尽可能大),同时,本地模型参数的先验分布和后验分布的分布差异度量函数尽可能小或惩罚尽可能小。
其次,在本地训练开始前,可以先为本地模型的参数的后验分布设置初始值。该初始值的设定方式可以有多种。例如,本地模型的参数的后验分布的初始值可以设定为本轮迭代之前(如上一轮迭代)的本地模型的参数的后验分布的值,也可以是随机化的初始值。在某些实施例中,本地模型的参数的后验分布的初始值可以根据本地模型的参数的先验分布确定。以本地模型的参数的先验分布对后验分布采用“点描述”为例,本地模型的参数的后验分布的初始值可以是本地模型的参数的先验分布的值;以本地模型的参数的先验分布对后验分布采用“分布描述”为例,本地模型的参数的后验分布的初始值可以是本地模型的参数的先验分布的采样值。
接着,在确定本地模型的参数的后验分布的初始值,以及优化目标之后,可以采用评分函数(score function)或者重参数化的方式进行本地训练,直到本地模型的参数的后验分布收敛。
方式二:根据本地模型的参数的先验分布确定本地模型的参数的后验分布的初始值
如果本地模型的参数的先验分布对后验分布采用“点描述”,则在本地训练过程中,可以将本地模型的参数的先验分布作为本地模型的参数的后验分布的初始值。如果本地模型的参数的先验分布对后验分布采用“分布描述”,则在本地训练过程中,本地模型的参数的后验分布的初始值可以采用本地模型的参数的先验分布的采样值。
本地训练的优化目标可以设定为:在本地训练数据训练时,本地模型的参数的后验分布的损失函数尽可能小或者似然函数尽可能大。
接着,在确定本地模型的参数的后验分布的初始值以及本地训练的优化目标之后,可以采用评分函数(score function)或者重参数化的方式进行训练,直到本地模型的参数的后验分布收敛。
上文结合图5,详细描述了第一节点如何利用本地训练数据开展本地训练。在本地训练结束之后,第一节点可以将训练得到的本地模型的参数的后验分布发送至第二节点,以便第二节点根据接收到的本地模型的参数的后验分布更新联邦模型的参数的先验分布。但是,在某些实施例中,在向第二节点反馈本地训练的结果之前,第一节点也可以根据一定的条件对是否向第二节点反馈本地训练的结果进行决策;和/或,第一节点也可以根据一定的条件对向第二节点反馈本地训练的全部结果还是部分结果进行决策。下面结合具体的实施例,对第一节点的决策方式进行举例说明。
在向第二节点发送本地模型的参数的后验分布之前,第一节点可以根据本地模型的参数的后验分布,确定本地模型的不确定度。当本地模型的不确定度满足第一预设条件 时,第一节点向第二节点发送本地模型的参数的后验分布;当本地模型的不确定度不满足第一预设条件时,第一节点不向第二节点发送本地模型的参数的后验分布。
本地模型的不确定度可用于表示本地模型的稳定性。在某些实施例中,本地模型的不确定度可以表示第一节点的本地训练数据对联邦模型的重要性(或对联邦学习的重要性)。
例如,当希望联邦模型尽快收敛时,如果本地模型的不确定度较高,则说明第一节点的本地训练数据对联邦模型不重要,在优化联邦模型参数的先验分布时,将该本地模型的后验分布考虑在内,会降低联邦模型的收敛速度。
又如,当希望增大联邦模型的容量时,如果本地模型的不确定度较高,则说明第一节点的本地训练数据对联邦模型重要,在优化联邦模型参数的先验分布时,将该本地模型的参数的后验分布考虑在内,会提升联邦模型在与本地训练数据相同或接近的数据上进行推断的可靠性。
本地模型的不确定度可以基于以下信息中的至少一种度量:本地模型的参数的后验分布的方差,本地模型的参数的后验分布的收敛速度(或称收敛效果),或者本地模型的参数的后验分布的推断准确率。
本申请实施例对第一预设条件的具体内容不做限定,可以根据实际需要选取。
作为一个例子,如果希望加快联邦模型的收敛速度,则当本地模型的不确定度较高时,第一节点可以不向第二节点发送本地模型的参数的后验分布。例如,当本地模型的方差大于预设阈值或本地模型的收敛效率速度小于预设阈值时,第一节点不向第二节点发送本地模型的参数的后验分布。
作为另一个例子,如果希望提高联邦模型的容量,则当本地模型的不确定度较高时,第一节点向第二节点发送本地模型的参数的后验分布。例如,当本地模型的方差大于预设阈值或本地模型的收敛效率速度小于预设阈值时,第一节点向第二节点发送本地模型的参数的后验分布。
在向第二节点发送本地模型的参数的后验分布之前,第一节点也可以根据本地模型的参数的后验分布与本地模型的参数的先验分布之间的差异,选择是否将本地模型的参数的后验分布发送至第二节点。
例如,如果希望提高节点之间的通信效率,则当本地模型的参数的后验分布与本地模型的参数的先验分布之间的差异较小(如小于预设阈值)时,第一节点可以不向第二节点发送本地模型的参数的后验分布。这是因为,当本地模型的参数的后验分布与本地模型的参数的先验分布之间的差异较小时,表明本地模型与联邦模型之间的差异较小,即使将本地模型的参数的后验分布发送给第二节点,也不会对联邦模型的参数的先验分布的更新带来多大影响。在这种情况下,第一节点不上传本地模型的参数的后验分布可以节省节点之间的带宽,提升节点之间的通信效率。
上文详细描述了第一节点如何决策是否向第二节点发送本地训练的结果。下文详细描述第一节点如何决策是否向第二节点发送本地训练结果中的部分结果。需要说明的是,这两种决策可以相互独立,也可以相互组合。例如,第一节点可以在决定向第二节点反馈本地训练结果之后,再决定向第二节点反馈本地训练结果中的哪些结果。
可选地,在一些实施例中,第一节点可以根据该第一参数的后验分布,确定本地模 型中的第一参数的不确定度,其中本地模型可以包括至少一个参数,第一参数是该至少一个参数中的任意一个参数;当第一参数的不确定度满足第二预设条件时,第一节点向第二节点发送该第一参数的后验分布。
第一参数的不确定度可用于表示第一参数对第一节点的本地模型的重要性。如果第一参数的不确定度较高(如第一参数的分布比较平坦),该参数通常对本地模型最终的预测或推断结果没有太大影响。在这种情况下,第一节点可以考虑不向第二节点发送该第一参数的后验分布。
上文提及的第一参数的不确定度的度量方式可以有多种。例如,第一参数的不确定度可以基于第一参数的后验分布的均值、方差或二者的结合度量。例如,第一节点可以将第一参数的方差与固定阈值进行比较。当该方差小于该固定阈值时,第一节点向第二节点发送该第一参数的后验分布;当该方差大于或等于该固定阈值时,第一节点不向第二节点发送该第一参数的后验分布。又如,第一节点可以先根据第一参数的方差生成随机数,然后将该随机数与固定阈值进行比较。当该随机数小于该固定阈值时,第一节点向第二节点发送该第一参数的后验分布;当该随机数大于或等于该固定阈值时,第一节点不向第二节点发送该第一参数的后验分布。
本申请实施例对上文提及的第二预设条件的具体内容不做限定,可以根据实际需要设定。例如,第二预设条件可以根据第一参数的不确定度的大小设定,也可以根据第一参数的不确定度在本地模型的所有参数的不确定度中的排序设定。
应理解,上文提及的第一参数是本地模型中的任意一个参数,第一节点可以按照第一参数的处理方式类似的方式处理本地模型中的部分或全部参数。如果第一节点按照与第一参数的处理方式类似的方式对本地模型中的所有参数进行处理,则可以找到本地模型中的参数的不确定度不满足第二预设条件的所有参数,并在向第二节点反馈本地训练结果时,不向第二节点反馈这些参数的后验分布。
第一节点向第二节点发送本地模型的参数的后验分布的方式也可以有多种。例如,第一节点可以向第二节点发送本地模型的参数的整体分布,也可以向第二节点发送本地模型的参数的整体分布的一个或多个采样值。当第一节点向第二节点发送本地模型的参数的整体分布的一个或多个采样值时,第二节点可以根据接收到的针对同一参数的整体分布的多个采样值,对该参数的整体分布进行估计,并将估计结果作为该参数的先验分布更新至联邦模型中。第一节点向第二节点发送整体分布的采样值,可以提升节点之间的通信效率,降低通信带宽。
在执行图4中的步骤S410之前,第二节点可以执行如图6所示的步骤。即第二节点可以按照一定的规则从候选节点中选择一个或多个第一节点,并向被选择的第一节点发送联邦模型的参数的先验分布,而不向未被选择的节点发送联邦模型的参数的先验分布。联邦学习通常包括多轮迭代,图4中的至少一个第一节点可以是参与本轮迭代的节点,则上文提及的候选节点可以为在本轮迭代之前参与该联邦学习的节点,如可以是参与联邦学习的上一轮迭代的节点。第二节点在不同的迭代轮次可以选择相同的第一节点,也可以选择不同的第一节点。
步骤S610的实现方式可以有多种,下面给出几种可能的实现方式。例如,在某些实施例中,第二节点可以随机选择参与本轮迭代的第一节点。
或者,在某些实施例中,第二节点可以按照候选节点反馈的评价信息,选择参与本轮迭代的第一节点。评价信息可用于表示联邦模型的参数的先验分布与候选节点的本地训练数据的匹配度。或者,评价信息可用于表示候选节点根据联邦模型的参数的先验分布训练得到的后验分布与候选节点的本地训练数据的匹配度。或者,评价信息可用于表示联邦模型的参数的先验分布与候选节点根据联邦模型的参数的先验分布训练得到的后验分布的匹配度。先验分布或后验分布与本地训练数据的匹配度可以采用本地模型在本地测试时得到的损失函数的取值进行评价。
如果希望提升联邦模型的容量,第二节点可以选择匹配度较低的候选节点参与联邦学习。如果加快联邦模型的收敛速度,第二节点可以选择匹配度较高的候选节点参与联邦学习。
或者,在某些实施例中,第二节点可以根据候选节点的历史后验分布与联邦模型的参数的先验分布之间的差异,从候选节点中选取至少一个第一节点。
如果希望提升联邦模型的容量,第二节点可以选择差异较大的候选节点参与联邦学习。如果加快联邦模型的收敛速度,第二节点可以选择差异较小的候选节点参与联邦学习。
重新参见图4。图4中的步骤S440描述的是第二节点对联邦模型的参数的先验分布的更新过程。该更新过程也可理解为第二节点对联邦模型的参数的先验分布的优化过程,或计算联邦模型的参数的先验分布的最优解的过程。下面结合具体的实施例,对联邦模型参数的更新过程进行详细描述。
如果联邦模型的参数的先验分布对后验分布采用参数化的“点描述”,则在更新过程中,针对同一参数,第二节点可以利用该参数的后验分布的差异计算该参数的先验分布,使得该参数的先验分布与该参数的各个后验分布的差异的均值(或加权平均值)最小。
如果联邦模型的参数的先验分布对后验分布采用非参数化的“点描述”(如直方图或概率密度曲线),第二节点可以将对同一参数的直方图或概率密度曲线进行合成,以得到该参数的先验分布。
如果联邦模型的参数的先验分布对后验分布采用“分布描述”,则第二节点可以根据针对同一参数的不同的后验分布,估计该参数的后验分布的概率分布,并将估计出该参数的后验分布的概率分布作为该参数的先验分布。
如果第二节点的联邦模型的参数的先验分布包含或可以拆分出多个局部先验分布,而某个第一节点的本地训练过程仅基于其中某个局部先验分布,则该第一节点的本地模型的参数的后验分布可以仅用于更新其对应的局部先验分布。
可选地,在更新过程中,还可以对联邦模型的结构进行调整。
例如,假设联邦模型中的一个参数的当前分布由较多分布叠加而成,则可以采用较少数量的分布的叠加对该参数的当前分布的叠加进行近似,以简化联邦模型。具体可以采用成分减少(component reduction)技术实现较少数量的分布的叠加对较多分布的叠加的近似。
或者,假设第二节点接收到的多个第一节点的本地模型参数的后验分布均包括第一参数的后验分布,且该多个第一节点的第一参数的后验分布之间的差异大于预设阈值, 则第二节点可以对联邦模型参数的先验分布进行更新,以将该第一参数拆分成多个参数。本申请实施例将该技术称为模型拆分技术。
或者,当第二节点维护多个机器学习模型时,第二节点可以将差异较小的机器学习模型合并在一起,也可以从已有的机器学习模型中产生新的机器学习模型(如采用随机的方式产生新的模型)。
在联邦学习的起始阶段,第二节点还可以先对联邦模型进行初始化。本申请实施例对初始化的内容不做具体限定。例如,第二节点可以设置联邦模型的网络结构。又如,第二节点还可以为联邦模型的参数的先验分布设置初始值。又如,第二节点可以设置联邦学习过程中的超参数。
下面结合具体例子,更加详细地描述本申请实施例。应注意,下文给出的例子仅仅是为了帮助本领域技术人员理解本申请实施例,而非要将本申请实施例限于所例示的具体数值或具体场景。本领域技术人员根据所给出的例子,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。
示例1:
1.1、应用场景介绍
第二节点维护的联邦模型为单神经网络,且该联邦模型的参数的先验分布对后验分布进行“分布描述”。在本地训练过程中,第一节点直接将联邦模型的参数的先验分布作为本地模型的参数的先验分布,进行本地训练。本地模型的参数的先验分布和后验分布均对应相同大小的神经网络,且本地训练过程中,第一节点采用高斯分布作为似然函数进行贝叶斯优化。
例如,第二节点维护的联邦模型的参数的先验分布采用高斯逆伽玛分布对后验分布进行“分布描述”,而后验分布采用的是高斯分布。高斯逆伽玛分布也可称为正态逆伽玛(normal inverse gamma)分布,其可以通过如下的公式(1)表示。
N-Γ -1(μ,σ 20,v,α,β)      (1)
公式(1)中,N-Γ -1表示高斯逆伽玛分布;μ 0,v,α,β为高斯逆伽玛分布的4个参数。该4个参数决定了后验分布(高斯分布)的均值μ和方差σ 2的分布。
本地训练数据由联邦模型生成的概率可以通过公式(2)表示:
Figure PCTCN2021100098-appb-000010
公式(2)中,K表示参与联邦学习的第一节点的数量,k表示K个第一节点中的第k个第一节点。D表示K个第一节点的本地训练数据构成的完整的数据集,D k表示第k个第一节点的本地训练数据所构成的数据集。θ k表示第k个第一节点的本地模型的参数,p(D kk)表示在给定参数θ k的情况下,出现数据集D k的概率。Ν(.)表示高斯分布,
Figure PCTCN2021100098-appb-000011
表示高斯分布的均值μ k和方差
Figure PCTCN2021100098-appb-000012
决定了θ k的分布。
进一步地,公式(2)中,
Figure PCTCN2021100098-appb-000013
表示第k个第一节点在给定参数μ 0,v 0,α,β下出现数据集D k的概率。由于预先假设各第一节点 是相互独立的,因此,在给定参数μ 0,v 0,α,β下,出现数据集D的概率是出现各个数据集D k的概率的联乘。
1.2、本地训练过程
本地训练过程实际上可以是一个优化过程。优化目标可以采用公式(3)定义:
Figure PCTCN2021100098-appb-000014
该优化目标的含义是给定本地模型的参数的先验分布μ 0,v,α,β的条件下,寻找最优的模型参数
Figure PCTCN2021100098-appb-000015
使得公式(3)的取值最大。优化得到的最优模型参数
Figure PCTCN2021100098-appb-000016
即可作为本地模型的参数的后验分布。
公式(3)中的
Figure PCTCN2021100098-appb-000017
表示在给定模型参数
Figure PCTCN2021100098-appb-000018
的条件下,出现本地训练数据组成的数据集D k的概率,通过优化
Figure PCTCN2021100098-appb-000019
使得这个概率尽可能大。公式(3)中的
Figure PCTCN2021100098-appb-000020
表示在给定参数μ 0,v,α,β的条件下,出现
Figure PCTCN2021100098-appb-000021
的概率,优化目标是希望出现
Figure PCTCN2021100098-appb-000022
的概率尽可能大。可以将
Figure PCTCN2021100098-appb-000023
理解为
Figure PCTCN2021100098-appb-000024
的正则项,该正则项可以使得本地模型的参数的后验分布
Figure PCTCN2021100098-appb-000025
与本地模型的参数的先验分布之间的差异尽可能小,从而使得本地模型参数的后验分布不偏离联邦模型参数的先验分布太远,即保证联邦学习过程是一个持续学习的过程,不会出现模型震荡问题。
在定义了优化目标之后,可以采用重参数化的方式进行优化,得到本地模型的参数的后验分布,即得到
Figure PCTCN2021100098-appb-000026
1.3、联邦模型的参数的先验分布的更新(或优化)过程
当参与联邦学习的第一节点将本地模型的参数的后验分布发送给第二节点之后,第二节点可以根据公式(4)更新联邦模型的参数的先验分布:
Figure PCTCN2021100098-appb-000027
例如,第二节点可以通过最大化公式(4),得到联邦模型的参数的先验分布的最优解,即μ 0,v,α,β的最优解。
示例2:
2.1、应用场景介绍
第二节点维护的联邦模型为单神经网络(如一个贝叶斯神经网络)。该联邦模型的一个参数具有多个分布(如混合高斯分布),且联邦模型的参数的先验分布对后验分布进行“点描述”。本地训练过程中,第一节点直接将联邦模型的参数的先验分布作为本地模型的参数的先验分布,进行本地训练。本地模型的参数的先验分布和后验分布均对应相同大小的神经网络,且本地训练过程中,第一节点采用高斯分布作为似然函数进行贝叶斯优化。
2.2、初始化过程
第二节点初始化一个神经网络,作为联邦模型。P(θ|η)表示该联邦模型的参数的先验分布,其中θ表示模型参数,η表示描述θ的分布的先验值。以一个参数包含两个高斯分布为例,η=[均值1,方差1,均值2,方差2]。
2.3、本地训练过程
首先,被第二节点选定的第一节点从第二节点处获取联邦模型的参数的先验分布P(θ|η)。
其次,第一节点以P(θ|η)作为本地模型的参数的先验分布,使用本地训练数据,训练得到本地模型的参数的后验分布。
具体地,本地模型的参数的后验分布的训练过程是一个优化过程。优化目标可以采用公式(5)定义:
Figure PCTCN2021100098-appb-000028
公式(5)中,q k(θ)表示本地模型的参数θ的后验分布。如果本地模型的参数的后验分布采用的是参数化的描述方式(而非直方图、概率密度曲线等非参数化描述方式),则本地模型的参数的后验分布也可以通过q k(θ|η k)表示,即通过参数η k描述本地模型的参数θ。logp(D k|θ)表示本地模型的参数θ对应的似然函数。D KL表示KL散度。
在定义了优化目标之后,采用重参数化的方式对q k(θ)进行优化,即可得到优化后的q k(θ)。
2.4、联邦模型的参数的先验分布的更新(或优化)过程
当参与联邦学习的第一节点将本地模型的参数的后验分布发送给第二节点之后,第二节点可以根据公式(6)更新联邦模型的参数的先验分布:
Figure PCTCN2021100098-appb-000029
公式(6)中的P(η)表示η的分布,该分布可以事先由人为设定。
示例3:
3.1、应用场景介绍
第二节点维护的联邦模型包括多个神经网络。第一节点的本地模型为单个神经网络。
3.2、初始化过程
第二节点对联邦模型的参数的先验分布进行初始化。该联邦模型的参数的先验分布包括N个局部先验分布(N为大于1的整数)。N个局部先验分布一一对应N个神经网络。换句话说,该N个局部先验分布分别为该N个神经网络的参数的先验分布。该N个神经网络的结构可以相同,也可以不同。例如,N个神经网络中的第1个神经网络
Figure PCTCN2021100098-appb-000030
具有5个全连接层,其中每层设置50个神经元。第2个神经网络
Figure PCTCN2021100098-appb-000031
也具有5个全连接层的神经网络,其中每层设置50个神经元。第3个神经网络
Figure PCTCN2021100098-appb-000032
具有4个全连接层,其中每层设置40个神经元。第4个神经网络
Figure PCTCN2021100098-appb-000033
具有4个卷积层和1个全连接层。
然后,第二节点可以将N个局部先验分布发送给多个第一节点。第二节点可以向不同的第一节点发送不同的局部先验分布。例如,第二节点可以将第1个神经网络对应的局部先验分布发送给第一节点1,2,3;将第2个神经网对应的局部先验分布发送给第一节点4,5,6;将第3个神经网络对应的局部先验分布发送给第一节点7,8,9;将第4个神经网 络对应的局部先验分布发送给第一节点9,10,11。当然,为了对数据进行隐私保护,第二节点也可以向不同的第一节点发送相同的局部先验分布。
收到第i个神经网络对应的局部先验分布的第一节点,可以以该第i个神经网络对应的局部先验分布作为本地模型参数的先验分布的初始值
Figure PCTCN2021100098-appb-000034
然后使用本地训练数据进行1次或多次训练,以得到本地模型参数的后验分布。第一节点可以采用公式(7)作为本地训练过程的损失函数:
Figure PCTCN2021100098-appb-000035
第一节点将训练得到的本地模型参数的后验分布发送至第二节点。第二节点根据公式(8),以加权平均的方式对联邦模型的参数的先验分布进行更新:
Figure PCTCN2021100098-appb-000036
公式(8)中,Ni表示基于第i个神经网络对应的局部先验分布进行本地训练后得到的本地模型参数的后验分布的数量,
Figure PCTCN2021100098-appb-000037
表示该Ni个本地模型的参数的后验分布中的第n个本地模型的参数的后验分布的权重,该权重可以根据该第n个本地模型的参数的后验分布的本地训练数据的数据量占该Ni个本地模型的参数的后验分布的本地训练数据的数据总量的比重确定。
3.3、本地训练过程
第二节点选定的第一节点可以从第二节点获取到联邦模型的参数的先验分布
Figure PCTCN2021100098-appb-000038
然后,第一节点可以使用本地训练数据测试联邦模型的参数的先验分布中的各个局部先验分布与本地训练数据的匹配度,并从中选择匹配度最高的局部先验分布
Figure PCTCN2021100098-appb-000039
在确定了与本地训练数据最匹配的局部先验分布
Figure PCTCN2021100098-appb-000040
之后,第一节点可以将该局部先验分布
Figure PCTCN2021100098-appb-000041
作为本地模型的参数的先验分布的初始值,即
Figure PCTCN2021100098-appb-000042
然后,第一节点可以使用本地训练数据进行1次或多次训练,以获取本地模型的参数的后验分布。该训练过程可以以
Figure PCTCN2021100098-appb-000043
作为损失函数,其中
Figure PCTCN2021100098-appb-000044
表示该第i*个第一节点的本地训练数据。
或者,第一节点可以将该局部先验分布
Figure PCTCN2021100098-appb-000045
添加到本地训练过程的损失函数的正则化项中:
Figure PCTCN2021100098-appb-000046
然后基于该损失函数,利用本地训练数据进行训练,以获取本地模型的参数的后验分布。
3.4、联邦模型的参数的先验分布的更新(或优化)过程
当参与联邦学习的第一节点将本地模型的参数的后验分布发送给第二节点之后,第二节点可以根据公式(9)更新联邦模型中的各个神经网络的参数的先验分布:
Figure PCTCN2021100098-appb-000047
公式(9)中,Ni表示基于第i个神经网络对应的局部先验分布进行本地训练后得到的本地模型参数的后验分布的数量,
Figure PCTCN2021100098-appb-000048
表示该Ni个本地模型的参数的后验分布中的第n个本地模型的参数的后验分布的权重,该权重可以根据该第n个本地模型的参数的后验分布的本地训练数据的数据量占该Ni个本地模型的参数的后验分布的本地训练数据的 数据总量的比重确定。
示例4:
4.1、应用场景介绍
第二节点维护的联邦模型包括多个神经网络(如多个贝叶斯神经网络),其中每个神经网络的参数采用一个高斯分布进行描述。
联邦模型参数的先验分布包括与该多个神经网络一一对应的多个局部先验分布,且每个局部先验分布对后验分布进行“点描述”。
本地训练过程中,第一节点使用联邦模型参数的先验分布中的某个局部先验分布作为本地模型参数的先验分布,进行本地训练。例如,第一节点从第二节点维护的多个局部先验分布中选择与本地训练数据最匹配的一个局部先验分布,并将该局部先验分布作为本地模型参数的先验分布。
本地模型参数的先验分布和后验分布均对应相同大小的神经网络,且本地训练过程中,第一节点采用高斯分布作为似然函数进行贝叶斯优化。
4.2、初始化过程
第二节点对联邦模型的参数的先验分布进行初始化。该联邦模型的参数的先验分布包括N个局部先验分布(N为大于1的整数)。N个局部先验分布一一对应N个神经网络。
Figure PCTCN2021100098-appb-000049
表示联邦模型中的与第i个神经网络对应的局部先验分布,θ表示该第i个神经网络的参数,η用于描述θ的分布的先验值。以高斯分布为例,η可以是高斯分布的[均值,方差]。
第二节点将N个局部先验分布发送给不同的第一节点。如果考虑数据的隐私保护,第二节点也可以将不同的局部先验分布发送给相同的第一节点。
收到第i个神经网络对应的局部先验分布的第一节点,可以以
Figure PCTCN2021100098-appb-000050
作为本地模型的参数的先验分布,使用本地训练数据训练得到本地模型的参数的后验分布。
本地训练过程本质上是一个优化过程,可以采用公式(10)作为优化目标:
Figure PCTCN2021100098-appb-000051
Figure PCTCN2021100098-appb-000052
表示本地模型的参数的后验分布,
Figure PCTCN2021100098-appb-000053
表示给定模型参数对应的似然函数,D KL表示KL散度。
第一节点可以使用重参数化的方式进行优化,以得到本地模型的参数的后验分布
Figure PCTCN2021100098-appb-000054
训练结束后,第一节点可以将训练好的本地模型的参数的后验分布
Figure PCTCN2021100098-appb-000055
发送给第二节点。第二节点可以根据各个第一节点提供的本地模型的参数的后验分布,采用公式(11)更新(或优化)联邦模型的参数的先验分布:
Figure PCTCN2021100098-appb-000056
4.3、本地训练过程
被第二节点选定的第一节点可以从第二节点获取到联邦模型的参数的先验分布
Figure PCTCN2021100098-appb-000057
然后,第一节点可以测试联邦模型的参数的先验分布中的各个局部先验分布与本地训练数据的匹配度,并选择匹配度最高的局部先验分布
Figure PCTCN2021100098-appb-000058
然后,第一节点可以以
Figure PCTCN2021100098-appb-000059
作为本地模型的参数的先验分布,使用本地训练数据,训练本地模型的参数的后验分布。
本地训练过程可以采用公式(12)作为优化目标:
Figure PCTCN2021100098-appb-000060
Figure PCTCN2021100098-appb-000061
表示本地模型的参数的后验分布,
Figure PCTCN2021100098-appb-000062
表示本地模型参数对应的似然函数,D KL表示KL散度。
第一节点可以采用重参数化的方式进行优化,以确定本地模型的参数的后验分布的最优解。
4.4、联邦模型的参数的先验分布的更新(或优化)过程
当参与联邦学习的第一节点将本地模型的参数的后验分布发送给第二节点之后,第二节点可以根据公式(13)更新各个神经网络:
Figure PCTCN2021100098-appb-000063
示例5:
5.1、应用场景介绍
第二节点维护的联邦模型维护一个神经网络。该神经网络的每个参数采用一个分布进行描述。该神经网络的参数的先验分布对后验分布进行点描述。例如,联邦模型为一个贝叶斯神经网络,且该贝叶斯神经网络的每个参数采用一个高斯分布进行描述。
第一节点使用联邦模型参数的先验分布中的局部先验分布作为本地模型参数的先验分布。
第一节点的本地模型与联邦模型的尺寸相同,且本地模型参数的后验分布采用的是delta分布。
5.2、初始化过程
第二节点初始化一个神经网络,作为联邦模型。P(θ|η)表示该联邦模型的参数的先验分布,其中θ表示模型参数,η表示描述θ的分布的先验值。以高斯分布为例,η=[均值,方差]。
5.3、本地训练过程
首先,被第二节点选定的第一节点从第二节点处获取联邦模型的参数的先验分布P(θ|η)。
其次,第一节点以P(θ|η)作为本地模型的参数的先验分布,使用本地训练数据,训练得到本地模型的参数的后验分布。
具体地,本地模型的参数的后验分布的训练过程是一个优化过程。优化目标可以采用公式(14)定义:
Figure PCTCN2021100098-appb-000064
公式(14)中,θ k表示本地模型的参数。logp(D kk)表示给定模型的参数对应的似然函数。可以采用梯度下降法进行训练本地模型参数θ k,其后验分布δ(θ k)。δ(θ k)表示该后验分布为delta分布。
5.4、联邦模型的参数的先验分布的更新(或优化)过程
当参与联邦学习的第一节点将本地模型的参数的后验分布发送给第二节点之后,第二节点可以根据公式(15)更新各个神经网络:
Figure PCTCN2021100098-appb-000065
公式(15)中的P(η)表示η的分布,该分布可以事先由人为设定。
示例6:
6.1应用场景介绍
示例6旨在提供一种衡量各个第一节点的重要性的方案,从而能够在联邦学习过程中根据第一节点的重要性对参与联邦学习的第一节点进行选择,使得联邦学习的整个训练过程的稳定性最优。
例如,可以根据第一节点的本地模型的参数的模型的方差,为第一节点设置权重,并根据第一节点的权重对参与联邦学习的第一节点进行选择,或者根据第一节点的权重选择是否需要某个第一节点上传本地模型的参数的后验分布。
首先,可以为不同的第一节点设置各自对应的权重r(D k)。D k表示第k个第一节点的本地训练数据,因此,第一节点的权重也可以理解为对第一节点的本地训练数据的重要性的一种衡量。
然后,第二节点可以根据公式(16)最小化各个第一节点反馈的本地模型参数的后验分布的方差:
Figure PCTCN2021100098-appb-000066
公式(16)中的p data(D k)表示D k在所有的第一节点的本地训练数据形成的数据集上出现的概率,考虑到权重之和应该为1,所以,公式(16)可以转换成如下的公式(17):
Figure PCTCN2021100098-appb-000067
通过求解上式,可以得到第一节点的权重与本地模型参数的后验分布之间的关系为
Figure PCTCN2021100098-appb-000068
如果本地模型的后验分布采用高斯分布,则第一节点的权重与本地模型参数的后验分布之间的关系可以表示为:
Figure PCTCN2021100098-appb-000069
(j为本地模型中的参数的数量)。
第二节点可以根据r(D k)对需要上传本地模型参数的后验分布的第一节点进行选择。第一节点也可以根据该r(D k)对其是否需要向第二节点发送本地的训练结果进行决策。例如,可以将r(D k)与固定阈值进行比较,以决定第一节点是否需要向第二节点发送本地的训练结果。或者,可以根据r(D k)计算第一节点被选择的概率,然后根据该概率决定是否需要向第二节点发送本地的训练结果。
示例7:
示例7旨在提供一种联邦模型的简化方案,旨在当联邦模型的某个参数为较多分布的叠加时,以较少数量的分布的叠加对该较多分布的叠加进行近似。
例如,假设K个第一节点向第二节点上传本地模型参数的后验分布q k(θ)之后,第二节点根据如下的公式(18)更新联邦模型参数的先验分布:
Figure PCTCN2021100098-appb-000070
公式(18)中,D KL表示KL散度,p(θ|η)表示联邦模型参数的先验分布。
公式(18)的最优解可以通过公式(19)表示:
Figure PCTCN2021100098-appb-000071
假设本地模型参数的后验分布服从高斯分布,则公式(19)中的联邦模型参数的先验分布服从混合高斯分布,其中每个参数包含K个成分的混合高斯分布。由此可见,联邦模型参数的规模与本地模型参数相比,扩大了K倍,这样会造成较大的通信开销。
为了限制通信开销,可以采用公式(20)对联邦模型参数进行优化,将联邦模型的参数定义为最多包含M个成分(M<K)的混合高斯分布:
Figure PCTCN2021100098-appb-000072
公式(20)中,ρ m表示M个成分中的第m个成分的比例。μ mm分别为搞死分布的均值和协方差矩阵。然后,通过引入ρ m的先验分布—狄利克雷分布(Dirichlet分布),可以使得优化后的ρ m变得稀疏(即包含较多的0元素),从而使得最终的参数的混合分布最多包含M个成分。由此可见,通过调整Dirichlet分布的参数,可以在联邦模型的精度与复杂度(即每个参数包含多少成分,这决定了联邦学习过程的通信开销)之间进行折中。
示例8:
本申请对本地模型参数的后验分布的类型不做具体限定。示例8旨在给出一种特定的后验分布。
具体地,本地模型参数的后验分布可以服从如公式(21)所示的分布:
θ~N(μ k,λ(μ k-μ) 2)   (21)
公式(21)中,θ为本地模型参数的后验分布,μ为先验分布的均值,μ k为该后验分布的均值。
公式(21)可以看出:当本地模型参数的后验分布的均值距离先验分布的均值较远时,该后验分布的方差也比较大。这样一来,采用公式(21)所示的分布形式可以使得本地模型参数的后验分布与先验分布之间尽量存在较大重叠,从而使得本地训练过程变得更加可靠。
上文结合图1至图6,详细描述了本申请的方法实施例,下面结合图7至图8,详细描述本申请的装置实施例。应理解,方法实施例的描述与装置实施例的描述相互对应,因此,未详细描述的部分可以参见前面方法实施例。
图7是本申请一个实施例提供的联邦学习的装置的结构示意图。该联邦学习的装置 700对应于上文中的第一节点,且装置700与第二节点通信连接。
如图7所示,装置700包括接收模块701和训练模块702。接收模块701可用于从第二节点接收联邦模型的参数的先验分布,其中联邦模型为参数服从分布的机器学习模型。训练模块702可用于根据联邦模型的参数的先验分布和装置的本地训练数据,训练得到装置的本地模型的参数的后验分布。
可选地,在一些实施例中,装置700还可包括:第一确定模块,用于根据本地模型的参数的后验分布,确定本地模型的不确定度;第一发送模块,用于当本地模型的不确定度满足第一预设条件时,向第二节点发送本地模型的参数的后验分布。
可选地,在一些实施例中,装置700还可包括:第二确定模块,用于根据本地模型的第一参数的后验分布,确定第一参数的不确定度,其中本地模型包括至少一个参数,第一参数为该至少一个参数中的任意一个;第二发送模块,用于当第一参数的不确定度满足第二预设条件时,向第二节点发送第一参数的后验分布。
可选地,在一些实施例中,装置700还可包括:第三确定模块,用于根据本地模型的参数的后验分布,确定本地模型的不确定度;当本地模型的不确定度满足第一预设条件时,根据本地模型的第一参数的后验分布,确定第一参数的不确定度,其中,本地模型包括至少一个参数,第一参数为至少一个参数中的任意一个;第三发送模块,用于当第一参数的不确定度满足第二预设条件时,向第二节点发送第一参数的后验分布。
图8是本申请另一实施例提供的联邦学习的装置的结构示意图。该联邦学习的装置800对应于上文中的第二节点,且装置800与第一节点通信连接。
如图8所示,装置800包括接收模块801和更新模块802。接收模块801可用于接收至少一个第一节点的本地模型的参数的后验分布。更新模块802可用于根据至少一个第一节点的本地模型的参数的后验分布,对联邦模型的参数的先验分布进行更新,其中联邦模型为参数服从分布的机器学习模型。
可选地,在一些实施例中,装置800还可包括:选取模块,用于在装置接收至少一个第一节点的本地模型的参数的后验分布之前,从候选节点中选取至少一个第一节点,联邦学习包括多轮迭代,该至少一个第一节点为参与本轮迭代的节点,候选节点为在本轮迭代之前参与联邦学习的节点;第一发送模块,用于在装置接收至少一个第一节点的本地模型的参数的后验分布之前,向至少一个第一节点发送联邦模型的参数的先验分布。
可选地,在一些实施例中,选取模块用于根据候选节点向装置发送的评价信息,从候选节点中选取至少一个第一节点,其中评价信息用于表示联邦模型的参数的先验分布与候选节点的本地训练数据的匹配度,或者评价信息用于表示候选节点根据联邦模型的参数的先验分布训练得到的后验分布与候选节点的本地训练数据的匹配度,或者评价信息用于表示联邦模型的参数的先验分布与候选节点根据联邦模型的参数的先验分布训练得到的后验分布的匹配度。
可选地,在一些实施例中,选取模块用于根据候选节点的历史后验分布与联邦模型的参数的先验分布的差异,从候选节点中选取至少一个第一节点,其中历史后验分布为候选节点在本轮迭代之前得到的本地模型的参数的后验分布。
可选地,在一些实施例中,本地模型不包含不确定度不满足预设条件的参数。
图9是本申请实施例提供的一种联邦学习的装置的硬件结构示意图。图9所示的联 邦学习的装置900(该装置900具体可以是一种计算机设备)包括存储器901、处理器902、通信接口903以及总线904。其中,存储器901、处理器902、通信接口903通过总线904实现彼此之间的通信连接。
存储器901可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器901可以存储程序,当存储器901中存储的程序被处理器902执行时,处理器902和通信接口903用于执行本申请实施例的联邦学习的方法的各个步骤。
处理器902可以采用通用的CPU,微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的联邦学习的装置中的模块所需执行的功能,或者执行本申请方法实施例的联邦学习的方法。
处理器902还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的联邦学习的方法的各个步骤可以通过处理器902中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器902还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器901,处理器902读取存储器901中的信息,结合其硬件完成本申请实施例的联邦学习的装置中包括的模块所需执行的功能,或者执行本申请方法实施例的联邦学习的方法。
通信接口903使用例如但不限于收发器一类的收发装置,来实现装置900与其他设备或通信网络之间的通信。
总线904可包括在装置900各个部件(例如,存储器901、处理器902、通信接口903)之间传送信息的通路。
应理解,联邦学习的装置700中的接收模块701相当于联邦学习的装置900中的通信接口903,训练模块702可以相当于处理器902。或者,联邦学习的装置800中的接收模块801相当于联邦学习的装置900中的通信接口903,更新模块802可以相当于处理器902。
应注意,尽管图9所示的装置900仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置900还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置900还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置900也可仅仅包括实现本申请实施例所必须的器件,而不必包括图9所示的全部器件。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (50)

  1. 一种联邦学习的方法,其特征在于,包括:
    第一节点从第二节点接收联邦模型的参数的先验分布,其中所述联邦模型为参数服从分布的机器学习模型;
    所述第一节点根据所述联邦模型的参数的先验分布和所述第一节点的本地训练数据,训练得到所述第一节点的本地模型的参数的后验分布。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述第一节点根据所述本地模型的参数的后验分布,确定所述本地模型的不确定度;
    当所述本地模型的不确定度满足第一预设条件时,所述第一节点向所述第二节点发送所述本地模型的参数的后验分布。
  3. 根据权利要求2所述的方法,其特征在于,所述本地模型的不确定度是基于以下信息中的至少一种度量的:所述本地模型的参数的后验分布的方差,所述本地模型的参数的后验分布的收敛速度,或者所述本地模型的参数的后验分布的推断准确率。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述第一节点根据所述本地模型的第一参数的后验分布,确定所述第一参数的不确定度,其中,所述本地模型包括至少一个参数,所述第一参数为所述至少一个参数中的任意一个;
    当所述第一参数的不确定度满足第二预设条件时,所述第一节点向所述第二节点发送所述第一参数的后验分布。
  5. 根据权利要求4所述的方法,其特征在于,所述第一参数的不确定度是基于所述第一参数的后验分布的方差度量的。
  6. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述第一节点根据所述本地模型的参数的后验分布,确定所述本地模型的不确定度;
    当所述本地模型的不确定度满足第一预设条件时,所述第一节点根据所述本地模型的第一参数的后验分布,确定所述第一参数的不确定度,其中,所述本地模型包括至少一个参数,所述第一参数为所述至少一个参数中的任意一个;
    当所述第一参数的不确定度满足第二预设条件时,所述第一节点向所述第二节点发送所述第一参数的后验分布。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述联邦模型的参数的先验分布包括多个局部先验分布,所述多个局部先验分布一一对应多个贝叶斯模型,
    所述第一节点根据所述联邦模型的参数的先验分布和所述第一节点的本地训练数据,训练得到所述第一节点的本地模型的参数的后验分布,包括:
    所述第一节点根据所述多个局部先验分布与所述本地训练数据的匹配度,确定所述第一节点的本地模型的参数的先验分布;
    所述第一节点根据所述本地模型的参数的先验分布和所述本地训练数据,训练得到所述本地模型的参数的后验分布。
  8. 根据权利要求7所述的方法,其特征在于,所述联邦学习包括多轮迭代,所述本地模型的参数的后验分布为经过本轮迭代得到的本地模型的参数的后验分布,
    所述第一节点根据所述多个局部先验分布与所述本地训练数据的匹配度,确定所述 第一节点的本地模型的参数的先验分布,包括:
    所述第一节点根据所述多个局部先验分布与历史后验分布之间的差异,确定所述第一节点的本地模型的参数的先验分布,其中所述历史后验分布为所述第一节点在所述本轮迭代之前得到的本地模型的参数的后验分布。
  9. 根据权利要求8所述的方法,其特征在于,所述本地模型的参数的先验分布为所述多个局部先验分布中的与所述历史后验分布差异最小的先验分布;或者,所述本地模型的参数的先验分布为所述多个局部先验分布的加权和,其中所述多个局部先验分布在所述加权和中分别所占的权重由所述多个局部先验分布与所述历史后验分布之间的差异确定。
  10. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述第一节点向所述第二节点发送所述本地模型的参数的后验分布。
  11. 根据权利要求1-10中任一项所述的方法,其特征在于,所述机器学习模型为神经网络。
  12. 根据权利要求1-11中任一项所述的方法,其特征在于,所述联邦模型的参数的先验分布为所述联邦模型的参数的概率分布,或者为所述联邦模型的参数的概率分布的概率分布。
  13. 根据权利要求1-12中任一项所述的方法,其特征在于,所述第一节点和所述第二节点分别为网络中的客户端和服务器。
  14. 一种联邦学习的方法,其特征在于,包括:
    第二节点接收至少一个第一节点的本地模型的参数的后验分布;
    所述第二节点根据所述至少一个第一节点的本地模型的参数的后验分布,对联邦模型的参数的先验分布进行更新,其中所述联邦模型为参数服从分布的机器学习模型。
  15. 根据权利要求14所述的方法,其特征在于,在所述第二节点接收至少一个第一节点的本地模型的参数的后验分布之前,所述方法还包括:
    所述第二节点从候选节点中选取所述至少一个第一节点,所述联邦学习包括多轮迭代,所述至少一个第一节点为参与本轮迭代的节点,所述候选节点为在所述本轮迭代之前参与所述联邦学习的节点;
    所述第二节点向所述至少一个第一节点发送所述联邦模型的参数的先验分布。
  16. 根据权利要求15所述的方法,其特征在于,所述第二节点从候选节点中选取所述至少一个第一节点,包括:
    所述第二节点根据所述候选节点向所述第二节点发送的评价信息,从所述候选节点中选取所述至少一个第一节点,其中所述评价信息用于表示所述联邦模型的参数的先验分布与所述候选节点的本地训练数据的匹配度,或者所述评价信息用于表示所述候选节点根据所述联邦模型的参数的先验分布训练得到的后验分布与所述候选节点的本地训练数据的匹配度,或者所述评价信息用于表示所述联邦模型的参数的先验分布与所述候选节点根据所述联邦模型的参数的先验分布训练得到的后验分布的匹配度。
  17. 根据权利要求15所述的方法,其特征在于,所述第二节点从候选节点中选取所述至少一个第一节点,包括:
    所述第二节点根据所述候选节点的历史后验分布与所述联邦模型的参数的先验分布 的差异,从所述候选节点中选取所述至少一个第一节点,其中所述历史后验分布为所述候选节点在所述本轮迭代之前得到的本地模型的参数的后验分布。
  18. 根据权利要求14-17中任一项所述的方法,其特征在于,所述本地模型不包含不确定度不满足预设条件的参数。
  19. 根据权利要求14-18中任一项所述的方法,其特征在于,所述至少一个第一节点包括多个第一节点,且所述多个第一节点的本地模型的参数的后验分布均包括第一参数的后验分布,
    所述第二节点根据所述至少一个第一节点的本地模型的参数的后验分布,对联邦模型的参数的先验分布进行更新,包括:
    如果所述多个第一节点的所述第一参数的后验分布之间的差异大于预设阈值,所述第二节点对所述联邦模型的参数的先验分布进行更新,以将所述第一参数拆分成多个参数。
  20. 根据权利要求14-19中任一项所述的方法,其特征在于,所述联邦模型的参数的先验分布包括多个局部先验分布,所述多个局部先验分布一一对应多个贝叶斯模型。
  21. 根据权利要求14-20中任一项所述的方法,其特征在于,所述机器学习模型为神经网络。
  22. 根据权利要求14-21中任一项所述的方法,其特征在于,所述联邦模型的参数的先验分布为所述联邦模型的参数的概率分布,或者为所述联邦模型的参数的概率分布的概率分布。
  23. 根据权利要求14-22中任一项所述的方法,其特征在于,所述第一节点和所述第二节点分别为网络中的客户端和服务器。
  24. 一种联邦学习的装置,其特征在于,所述装置为与第二节点通信连接的第一节点,所述装置包括:
    接收模块,用于从所述第二节点接收联邦模型的参数的先验分布,其中所述联邦模型为参数服从分布的机器学习模型;
    训练模块,用于根据所述联邦模型的参数的先验分布和所述装置的本地训练数据,训练得到所述装置的本地模型的参数的后验分布。
  25. 根据权利要求24所述的装置,其特征在于,所述装置还包括:
    第一确定模块,用于根据所述本地模型的参数的后验分布,确定本地模型的不确定度;
    第一发送模块,用于当所述本地模型的不确定度满足第一预设条件时,向所述第二节点发送所述本地模型的参数的后验分布。
  26. 根据权利要求25所述的装置,其特征在于,所述本地模型的不确定度是基于以下信息中的至少一种度量的:所述本地模型的参数的后验分布的方差,所述本地模型的参数的后验分布的收敛速度,或者所述本地模型的参数的后验分布的推断准确率。
  27. 根据权利要求24所述的装置,其特征在于,所述装置还包括:
    第二确定模块,用于根据所述本地模型的第一参数的后验分布,确定所述第一参数的不确定度,所述本地模型包括至少一个参数,所述第一参数为所述至少一个参数中的任意一个;
    第二发送模块,用于当所述第一参数的不确定度满足第二预设条件时,向所述第二节点发送所述第一参数的后验分布。
  28. 根据权利要求27所述的装置,其特征在于,所述第一参数的不确定度是基于所述第一参数的后验分布的方差度量的。
  29. 根据权利要求24所述的装置,其特征在于,所述装置还包括:
    第三确定模块,用于根据所述本地模型的参数的后验分布,确定所述本地模型的不确定度;当所述本地模型的不确定度满足第一预设条件时,根据所述本地模型的第一参数的后验分布,确定所述第一参数的不确定度,其中,所述本地模型包括至少一个参数,所述第一参数为所述至少一个参数中的任意一个;
    第三发送模块,用于当所述第一参数的不确定度满足第二预设条件时,向所述第二节点发送所述第一参数的后验分布。
  30. 根据权利要求24-29中任一项所述的装置,其特征在于,所述联邦模型的参数的先验分布包括多个局部先验分布,所述多个局部先验分布一一对应多个贝叶斯模型,
    所述训练模块用于根据所述多个局部先验分布与所述本地训练数据的匹配度,确定所述装置的本地模型的参数的先验分布;根据所述本地模型的参数的先验分布和所述本地训练数据,训练得到所述本地模型的参数的后验分布。
  31. 根据权利要求30所述的装置,其特征在于,所述联邦学习包括多轮迭代,所述本地模型的参数的后验分布为经过本轮迭代得到的本地模型的参数的后验分布,
    所述训练模块用于根据所述多个局部先验分布与历史后验分布之间的差异,确定所述装置的本地模型的参数的先验分布,其中所述历史后验分布为所述装置在所述本轮迭代之前得到的本地模型的参数的后验分布。
  32. 根据权利要求31所述的装置,其特征在于,所述本地模型的参数的先验分布为所述多个局部先验分布中的与所述历史后验分布差异最小的先验分布;或者,所述本地模型的参数的先验分布为所述多个局部先验分布的加权和,其中所述多个局部先验分布在所述加权和中分别所占的权重由所述多个局部先验分布与所述历史后验分布之间的差异确定。
  33. 根据权利要求24所述的装置,其特征在于,所述装置还包括:
    第三发送模块,用于向所述第二节点发送所述本地模型的参数的后验分布。
  34. 根据权利要求24-33中任一项所述的装置,其特征在于,所述机器学习模型为神经网络。
  35. 根据权利要求24-34中任一项所述的装置,其特征在于,所述联邦模型的参数的先验分布为所述联邦模型的参数的概率分布,或者为所述联邦模型的参数的概率分布的概率分布。
  36. 根据权利要求24-35中任一项所述的装置,其特征在于,所述装置和所述第二节点分别为网络中的客户端和服务器。
  37. 一种联邦学习的装置,其特征在于,所述装置为与第一节点通信连接的第二节点,所述装置包括:
    接收模块,用于接收至少一个第一节点的本地模型的参数的后验分布;
    更新模块,用于根据所述至少一个第一节点的本地模型的参数的后验分布,对联邦 模型的参数的先验分布进行更新,其中所述联邦模型为参数服从分布的机器学习模型。
  38. 根据权利要求37所述的装置,其特征在于,所述装置还包括:
    选取模块,用于在所述装置接收所述至少一个第一节点的本地模型的参数的后验分布之前,从候选节点中选取所述至少一个第一节点,所述联邦学习包括多轮迭代,所述至少一个第一节点为参与本轮迭代的节点,所述候选节点为在所述本轮迭代之前参与所述联邦学习的节点;
    第一发送模块,用于在所述装置接收所述至少一个第一节点的本地模型的参数的后验分布之前,向所述至少一个第一节点发送所述联邦模型的参数的先验分布。
  39. 根据权利要求38所述的装置,其特征在于,所述选取模块用于根据所述候选节点向所述装置发送的评价信息,从所述候选节点中选取所述至少一个第一节点,其中所述评价信息用于表示所述联邦模型的参数的先验分布与所述候选节点的本地训练数据的匹配度,或者所述评价信息用于表示所述候选节点根据所述联邦模型的参数的先验分布训练得到的后验分布与所述候选节点的本地训练数据的匹配度,或者所述评价信息用于表示所述联邦模型的参数的先验分布与所述候选节点根据所述联邦模型的参数的先验分布训练得到的后验分布的匹配度。
  40. 根据权利要求38所述的装置,其特征在于,所述选取模块用于根据所述候选节点的历史后验分布与所述联邦模型的参数的先验分布的差异,从所述候选节点中选取所述至少一个第一节点,其中所述历史后验分布为所述候选节点在所述本轮迭代之前得到的本地模型的参数的后验分布。
  41. 根据权利要求37-40中任一项所述的装置,其特征在于,所述本地模型不包含不确定度不满足预设条件的参数。
  42. 根据权利要求37-41中任一项所述的装置,其特征在于,所述至少一个第一节点包括多个第一节点,且所述多个第一节点的本地模型的参数的后验分布均包括第一参数的后验分布,所述更新模块用于在所述多个第一节点的所述第一参数的后验分布之间的差异大于预设阈值的情况下,对所述联邦模型的参数的先验分布进行更新,以将所述第一参数拆分成多个参数。
  43. 根据权利要求37-42中任一项所述的装置,其特征在于,所述联邦模型的参数的先验分布包括多个局部先验分布,所述多个局部先验分布一一对应多个贝叶斯模型。
  44. 根据权利要求37-43中任一项所述的装置,其特征在于,所述机器学习模型为神经网络。
  45. 根据权利要求37-44中任一项所述的装置,其特征在于,所述联邦模型的参数的先验分布为所述联邦模型的参数的概率分布,或者为所述联邦模型的参数的概率分布的概率分布。
  46. 根据权利要求37-45中任一项所述的装置,其特征在于,所述第一节点和所述装置分别为网络中的客户端和服务器。
  47. 一种芯片,其特征在于,包括:所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行如权利要求1-23中任一项所述的方法。
  48. 根据权利要求47所述的芯片,其特征在于,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执 行时,所述处理器用于执行如权利要求1-23中任一项所述的方法。
  49. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1-23任意一项所述的方法。
  50. 一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行如权利要求1-23任意一项所述的方法。
PCT/CN2021/100098 2020-06-23 2021-06-15 联邦学习的方法、装置和芯片 WO2021259090A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21829619.2A EP4156039A4 (en) 2020-06-23 2021-06-15 METHOD AND APPARATUS FOR FEDERATED LEARNING AND CHIP
US18/080,523 US20230116117A1 (en) 2020-06-23 2022-12-13 Federated learning method and apparatus, and chip

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010593841.3A CN111898764A (zh) 2020-06-23 2020-06-23 联邦学习的方法、装置和芯片
CN202010593841.3 2020-06-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/080,523 Continuation US20230116117A1 (en) 2020-06-23 2022-12-13 Federated learning method and apparatus, and chip

Publications (1)

Publication Number Publication Date
WO2021259090A1 true WO2021259090A1 (zh) 2021-12-30

Family

ID=73207076

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/100098 WO2021259090A1 (zh) 2020-06-23 2021-06-15 联邦学习的方法、装置和芯片

Country Status (4)

Country Link
US (1) US20230116117A1 (zh)
EP (1) EP4156039A4 (zh)
CN (1) CN111898764A (zh)
WO (1) WO2021259090A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114662340A (zh) * 2022-04-29 2022-06-24 烟台创迹软件有限公司 称重模型方案的确定方法、装置、计算机设备及存储介质
CN115277555A (zh) * 2022-06-13 2022-11-01 香港理工大学深圳研究院 异构环境的网络流量分类方法、装置、终端及存储介质
CN115905648A (zh) * 2023-01-06 2023-04-04 北京锘崴信息科技有限公司 基于高斯混合模型的用户群和金融用户群分析方法及装置

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111898764A (zh) * 2020-06-23 2020-11-06 华为技术有限公司 联邦学习的方法、装置和芯片
US11790039B2 (en) * 2020-10-29 2023-10-17 EMC IP Holding Company LLC Compression switching for federated learning
US20220156633A1 (en) * 2020-11-19 2022-05-19 Kabushiki Kaisha Toshiba System and method for adaptive compression in federated learning
CN112686388A (zh) * 2020-12-10 2021-04-20 广州广电运通金融电子股份有限公司 一种在联邦学习场景下的数据集划分方法及系统
CN112804304B (zh) * 2020-12-31 2022-04-19 平安科技(深圳)有限公司 基于多点输出模型的任务节点分配方法、装置及相关设备
CN113822436A (zh) * 2021-03-12 2021-12-21 京东科技控股股份有限公司 联邦学习模型训练的通信方法、装置和电子设备
CN113033823B (zh) * 2021-04-20 2022-05-10 支付宝(杭州)信息技术有限公司 一种模型训练方法、系统及装置
CN113609785B (zh) * 2021-08-19 2023-05-09 成都数融科技有限公司 基于贝叶斯优化的联邦学习超参数选择系统及方法
CN113420335B (zh) * 2021-08-24 2021-11-12 浙江数秦科技有限公司 一种基于区块链的联邦学习系统
CN116419257A (zh) * 2021-12-29 2023-07-11 华为技术有限公司 一种通信方法及装置
GB202214033D0 (en) * 2022-09-26 2022-11-09 Samsung Electronics Co Ltd Method and system for federated learning
CN116187430A (zh) * 2023-01-31 2023-05-30 华为技术有限公司 一种联邦学习方法及相关装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108876038A (zh) * 2018-06-19 2018-11-23 中国原子能科学研究院 大数据、人工智能、超算协同的材料性能预测方法
CN110490335A (zh) * 2019-08-07 2019-11-22 深圳前海微众银行股份有限公司 一种计算参与者贡献率的方法及装置
CN111190487A (zh) * 2019-12-30 2020-05-22 中国科学院计算技术研究所 一种建立数据分析模型的方法
CN111898764A (zh) * 2020-06-23 2020-11-06 华为技术有限公司 联邦学习的方法、装置和芯片

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189825B (zh) * 2018-08-10 2022-03-15 深圳前海微众银行股份有限公司 横向数据切分联邦学习建模方法、服务器及介质
CN110442457A (zh) * 2019-08-12 2019-11-12 北京大学深圳研究生院 基于联邦学习的模型训练方法、装置及服务器
CN111222646B (zh) * 2019-12-11 2021-07-30 深圳逻辑汇科技有限公司 联邦学习机制的设计方法、装置和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108876038A (zh) * 2018-06-19 2018-11-23 中国原子能科学研究院 大数据、人工智能、超算协同的材料性能预测方法
CN110490335A (zh) * 2019-08-07 2019-11-22 深圳前海微众银行股份有限公司 一种计算参与者贡献率的方法及装置
CN111190487A (zh) * 2019-12-30 2020-05-22 中国科学院计算技术研究所 一种建立数据分析模型的方法
CN111898764A (zh) * 2020-06-23 2020-11-06 华为技术有限公司 联邦学习的方法、装置和芯片

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUROCHKIN MIKHAIL, AGARWAL MAYANK, GHOSH SOUMYA, GREENEWALD KRISTJAN, HOANG TRONG NGHIA, KHAZAENI YASAMAN: "Bayesian Nonparametric Federated Learning of Neural Networks", PROCEEDINGS OF THE 36TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 28 May 2019 (2019-05-28), XP055885615, Retrieved from the Internet <URL:http://proceedings.mlr.press/v97/yurochkin19a/yurochkin19a.pdf> [retrieved on 20220131] *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114662340A (zh) * 2022-04-29 2022-06-24 烟台创迹软件有限公司 称重模型方案的确定方法、装置、计算机设备及存储介质
CN114662340B (zh) * 2022-04-29 2023-02-28 烟台创迹软件有限公司 称重模型方案的确定方法、装置、计算机设备及存储介质
CN115277555A (zh) * 2022-06-13 2022-11-01 香港理工大学深圳研究院 异构环境的网络流量分类方法、装置、终端及存储介质
CN115277555B (zh) * 2022-06-13 2024-01-16 香港理工大学深圳研究院 异构环境的网络流量分类方法、装置、终端及存储介质
CN115905648A (zh) * 2023-01-06 2023-04-04 北京锘崴信息科技有限公司 基于高斯混合模型的用户群和金融用户群分析方法及装置

Also Published As

Publication number Publication date
US20230116117A1 (en) 2023-04-13
EP4156039A1 (en) 2023-03-29
CN111898764A (zh) 2020-11-06
EP4156039A4 (en) 2023-11-08

Similar Documents

Publication Publication Date Title
WO2021259090A1 (zh) 联邦学习的方法、装置和芯片
US11783199B2 (en) Image description information generation method and apparatus, and electronic device
CN113067873B (zh) 基于深度强化学习的边云协同优化方法
CN109902546B (zh) 人脸识别方法、装置及计算机可读介质
WO2021254114A1 (zh) 构建多任务学习模型的方法、装置、电子设备及存储介质
KR20190068255A (ko) 고정 소수점 뉴럴 네트워크를 생성하는 방법 및 장치
US20210342696A1 (en) Deep Learning Model Training Method and System
US20210224692A1 (en) Hyperparameter tuning method, device, and program
WO2022267036A1 (zh) 神经网络模型训练方法和装置、数据处理方法和装置
WO2024160216A1 (zh) 一种联邦学习方法及相关装置
WO2022088063A1 (zh) 神经网络模型的量化方法和装置、数据处理的方法和装置
CN114004383A (zh) 时间序列预测模型的训练方法、时间序列预测方法及装置
WO2023179609A1 (zh) 一种数据处理方法及装置
Hemmat et al. $\text {Edge}^{n} $ AI: Distributed Inference with Local Edge Devices and Minimal Latency
Hu et al. Content-Aware Adaptive Device–Cloud Collaborative Inference for Object Detection
CN113705724A (zh) 基于自适应l-bfgs算法的深度神经网络的批量学习方法
CN114091652A (zh) 脉冲神经网络模型训练方法、处理芯片以及电子设备
CN113657592B (zh) 一种软件定义卫星自适应剪枝模型压缩方法
CN114449536B (zh) 一种基于深度强化学习的5g超密集网络多用户接入选择方法
WO2023123275A1 (zh) 确定分布式训练算法框架配置方法、装置及系统
EP4109374A1 (en) Data processing method and device
CN112329404A (zh) 基于事实导向的文本生成方法、装置和计算机设备
Tang et al. Digital Twin-Enabled Efficient Federated Learning for Collision Warning in Intelligent Driving
CN112396069B (zh) 基于联合学习的语义边缘检测方法、装置、系统及介质
CN115510593B (zh) 一种基于lstm的mr阻尼器逆向映射模型计算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21829619

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021829619

Country of ref document: EP

Effective date: 20221222

NENP Non-entry into the national phase

Ref country code: DE