CN114429219A

CN114429219A - Long-tail heterogeneous data-oriented federal learning method

Info

Publication number: CN114429219A
Application number: CN202111502142.4A
Authority: CN
Inventors: 卢杨; 尚心怡; 黄刚; 华炜; 王菡子
Original assignee: Xiamen University; Zhejiang Lab
Current assignee: Xiamen University; Zhejiang Lab
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-05-03

Abstract

The invention discloses a federal learning method oriented to long-tail heterogeneous data, which comprises the following steps: step one, a server side randomly initializes a global model w and sends model parameters to each client side, each client side updates the model by using the received model parameters and uploads the updated model parameters to the server side; step two, the server side carries out aggregation on the received local model parameters to obtain a teacher model and a student model; step three, the server side calibrates the teacher model obtained in the step two, so that the teacher model can learn on unbiased knowledge, and a good student model can be taught; and step four, transmitting unbiased knowledge of the teacher model to the student models by using knowledge distillation, and then sending the student models to the clients to start the next round of federal training.

Description

Long-tail heterogeneous data-oriented federal learning method

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a federal learning method for long-tailed heterogeneous data.

Background

With the further development of big data, the importance on data privacy and security becomes a worldwide trend, and meanwhile, most industrial data show a data island phenomenon, so that the problem that how to perform cross-organization data cooperation is troubling artificial intelligence practitioners on the premise of meeting the user privacy protection, data security and regulations is a big problem. And "federal learning" will become a key technology to solve this industry problem.

Federal learning was originally proposed by google in 2016, and the objective of the federal learning is to realize common modeling and improve the effect of an artificial intelligence model, namely a machine learning paradigm, on the basis of guaranteeing data privacy safety and legal compliance. One of the core challenges of federal learning is the heterogeneity of the distribution of different data among the parties, i.e., heterogeneous data, which can greatly degrade the performance of federal learning. Meanwhile, the overall data distribution of the parties participating in the training tends to exhibit a long-tailed distribution rather than an equilibrium distribution, which may result in the model performing well on the categories with a large number of samples (head categories) and performing poorly on the categories with a small number of samples (tail categories). Therefore, the research on the long-tail distribution and heterogeneous data problem in the federal learning is of great significance.

The existing method for solving the heterogeneous data in the federal learning can be mainly divided into a client method and a server method. The first method is mainly to regularize the local model update of the client, and limit the training direction of the local model of the client by the knowledge of the global model to prevent the training direction of the local model with a larger difference from the global model from deviating from the whole system, so as to reduce the influence caused by heterogeneous data. The second type of method is a server-side method, which mainly employs a special aggregation strategy to omit and mitigate the negative impact caused by heterogeneous data.

The method solves the problem of heterogeneous data in federal learning to a certain extent, but the methods are all realized on the premise that the global data distribution is balanced. In a real scene, the global data distribution is almost impossible to be balanced, the distribution tends to approach the long tail distribution, and the performance of the global model obtained by training by the method in the tail class is still poor, so that a good effect cannot be achieved.

The existing methods for researching long-tail data mainly comprise three categories of resampling, weighting and characterization learning. The first method mainly repeats the class data with small sample size and reduces the class data with large sample size, aiming at reconstructing a balanced data set. The second method is to modify the loss function to make it more beneficial for training the tail class. The third method is to use the characteristics of deep learning to focus on the characterization learning of the input data.

In the method for solving the long tail data distribution, the data is centralized together for model training, but the method is not suitable for the real federal learning environment. To solve the data imbalance problem under federal learning, Duan, M et al propose the Astraea method, which first performs data sampling before training the model to construct a balanced training data set, thereby alleviating the global imbalance. Then using some Mediators, the client group for which each Mediator is responsible is assigned according to the KL divergence between Mediators, and the model training of the clients is rescheduled. By selectively combining clients of heterogeneous data, it may be possible to achieve a new local balance. But this results in some clients never being selected, i.e., not participating in the federal training process, and their own information cannot be utilized. Wang, L et al propose a Ratio Loss method, which realizes monitoring of data imbalance opacity in Federal learning, and a novel Loss function Ratio Loss is proposed to reduce the influence caused by imbalance. But the performance of the method is sharply reduced as the data isomerism degree is deepened.

Disclosure of Invention

In order to solve the defects of the prior art, the invention realizes the purposes of satisfying the user privacy protection and data security and simultaneously improving the model performance under the federal study so as to improve the image recognition efficiency, and adopts the following technical scheme:

a federal learning method for long-tailed heterogeneous data comprises the following steps:

s1, the server side initializes the global model w randomly and sends the model parameters to each client side, each client side updates the local model by using the received model parameters and uploads the updated local model parameters to the server side;

s2, the server side carries out aggregation on the received local model parameters to obtain a teacher model and a student model;

s3, the server side calibrates the teacher model to enable the teacher model to learn on unbiased knowledge, and therefore a good student model is taught;

and S4, transmitting unbiased knowledge of the teacher model to the student models through knowledge distillation, and then sending the student models to each client to start the next round of federal training.

Further, in step S1, the server initializes the global model parameter w, randomly selects the set S of clients participating in the current round of training, broadcasts the model parameter to each client in the set S, S participating in the current round of training, and executes a random gradient descent (SGD) using the received global model parameter w and local data to update the local model, where the local model parameter obtained by updating the client k is w_kAnd after the updating, each client sends the updated model parameters back to the server.

Further, step S2 includes the following steps:

s21, the server side carries out average weighting on the local model parameters to obtain a student model, and the calculation formula is as follows:

φ^s(x)＝φ_w(x) (formula 2)

Wherein, | D^k| represents the amount of image data owned by the kth client, | D | represents the total amount of image data owned by all clients, K represents the number of clients, x represents the input image data, φ_w(. represents a network of Federal averaging models, phi^sNetwork of (a) representation student models；

S22, the server side carries out weighting aggregation on the local model parameters to obtain a teacher model, and the calculation formula is as follows:

wherein phi^t(. a) a network representing a teacher model, e_kRepresents the weight assigned to client k, represents

Network of kth client.

Further, in step S3, since the local models are trained on the local data with different distributions, and each local model may behave differently on the tail class, we need to assign higher weight to the local model that performs better on the tail class, however, the server does not know which image class is the tail class and which client local model performs well on the top, and therefore we do not give each client a fixed weight, and instead, we propose a client-based weight assignment strategy to calculate the weight e of each client local model_kFinally, e is_kNormalized to sum to 1, i.e. the final weight, weight e_kThe calculation formula of (a) is as follows:

wherein, a_e∈R^cAnd b_eRepresenting a network parameter that can be learned, R^cRepresenting a c-dimensional vector, T being a transposed symbol, the client-based calibration works like a self-attention mechanism, computing weights for the local model according to the original output logits of the model, and multiplying the weights back to the original output logits.

Further, in step S3, if none of the client-side local models can handle the end classes well, the teacher model obtained by the weighted integration is biased toIn the head class, in order to solve the problem, a class-based original output logits calibration strategy is proposed to further improve the performance of the model in the tail class, and the calibrated model output logits is set as z^clThe calculation formula is as follows:

z^cl＝a_z⊙φ^t(x)+b_z(formula 5)

Wherein, a_zAnd b_zIndicating a network parameter that can be learned, an indicates a hadamard product.

Further, in step S3, the premise that the above calibration policy for logits is valid is that the characterization information extracted by the local model for the input image data is good enough, and if the feature extraction of the input image data by the client local model is seriously affected by the long tail distribution, it is not enough to calibrate only the output logits, so we need to update the feature extractor to further improve the model performance, and we use an additional image with balanced labels at the server end to form a balanced labeled data set

Fine tuning is carried out on the global model w to obtain a fine tuning model

Because of

Is balanced, so the model is fine-tuned

An unbiased feature extractor can be obtained, and then we can obtain a fine-tuned model output logits of x for the input image data as

Wherein z is^ftRepresenting the output of the fine tuning model for x,

a network representing a fine-tuning model.

Further, the model is fine-tuned

Wherein, eta represents the learning rate,

the function of the loss is represented by,

the derivation is indicated.

Further, in step S3, z^clAnd z^ftIs to calibrate the teacher model from two different levels, z^clThe teacher model output logits level is calibrated, the model feature extractor is fixed, however z^ftIs the result of fine adjustment of a feature extractor, so as to improve the feature extraction capability of the model, and in order to fully combine the advantages of the feature extractor and the model, a calibration gating network is provided for z^clAnd z^ftAnd (3) carrying out weighing, namely calibrating the gating network by taking the integrated characteristics as input and outputting weights through a nonlinear layer, so that each sample obtains different weights according to different characteristics of the sample, wherein the weight calculation formula is as follows:

σ＝sigmoid(u^Tv) (equation 6)

Wherein the content of the first and second substances,

the integrated features are represented as such,

feature extractor representing the kth client, u ∈ R^dRepresenting a network parameter that can be learned, R^dThe d-dimensional vector is represented, so the final calibration model output logits through the calibration gating network is z', the calculation formula is as follows:

z′＝σz^cl+(1-σ)z^ft(formula 7)

Where σ ∈ (0,1) is used to trade-off z^clAnd z^ftBoth models output logits.

Further, all parameters that can be learned are passed through the entire process of integrated calibration

The cross entropy penalty above is updated as follows:

wherein C represents the number of categories, y_jA real label representing input image data, j represents a j-th dimension value in y, exp (-) represents an exponential function with a natural constant e as a base, and z'_jDenotes the value of j dimension, z 'in the final calibration z'_iDenotes the value of the ith dimension in the final calibration z ', which is a vector of z ' and z '_jAnd z'_iRespectively, representing the value of one of the dimensions.

Further, in step S4, unbiased knowledge of the teacher model is transferred to the student models by knowledge distillation, and specifically, to better teach unbiased knowledge of the teacher model to the student models, we train the student models by combining labeled data training and unlabeled data distillation, and the loss function is as follows:

L′＝(1-λ)L_CE+λL_KL(formula 9)

Wherein L is_CECross entropy loss, L, between model output logits representing the student model and the image true label ground-truth_KLRepresenting the relative entropy (KL-Leibler) divergence of model output logits between teacher and student models by balancing tagged datasets

Calculating L_CEAnd using another unlabelled image to construct an unlabelled dataset

Calculating L_KLSo as to further improve the distillation performance of knowledge, and the lambda belongs to [0,1 ]]Represents a hyper-parameter, pair L_CEAnd L_KLA trade-off is made.

The invention has the advantages and beneficial effects that:

the invention researches the joint problem of heterogeneous data and long tail distribution in federal learning, fully utilizes the diversity of a local model of a client to process the heterogeneous data, provides a new model calibration strategy and a gating network to effectively solve the long tail problem, and further improves the model performance under the federal learning.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a diagram of a client data distribution in the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are given by way of illustration and explanation only, not limitation.

As shown in fig. 1, a federal learning method oriented to long-tail heterogeneous data includes the following steps:

step one, preparing a data set, initializing a network, distributing the data set to each client side, and updating a model.

Step 1.1, the data sets used are CIFAR-10, CIFAR-100 and ImageNet _ LT.

The CIFAR-10 dataset had 60000 color images, the size of these images was 32 × 32, and were divided into 10 classes: airplanes (airplane), cars (automobile), birds (bird), cats (cat), deer (deer), dogs (dog), frogs (frog), horses (horse), boats (ship), and trucks (truck). 6000 pictures in each category, wherein 50000 pictures are used for training, 5 training batches are formed, and 10000 pictures in each batch; another 10000 was used for testing, constituting a batch individually. From the test lot data, 1000 sheets were randomly taken from each of 10 categories. The remainder is randomly arranged to form a training batch. Note that the number of images in a training batch is not necessarily the same, and there are 5000 images for each class in the training batch. In addition, 100 random pictures in each type in CIFAR-10 are selected to form an additional balanced data set

Knowledge distillation was performed using CIFAR-100 as unlabeled data.

CIFAR-100 has 100 classes, each containing 600 pictures. There are 500 training images and 100 test images per class. The 100 classes in CIFAR-100 are divided into 20 super classes. Each image carries a "fine" label (the class to which it belongs) and a "coarse" label (the superset to which it belongs). Randomly selecting 10 pictures in each type in CIFAR-100 to form an additional balanced data set

Knowledge distillation was performed using down-sampled ImageNet (image size 32 x 32) as unlabeled data.

ImageNet-LT is a large image classification dataset, a long-tailed version of ImageNet, by sampling subsets that obey Pareto distributions according to α ═ 6. It contains 115800 images in 1000 categories, the largest and smallest category containing 1280 and 5 images, respectively. We obtained datasets from balance validation data

Knowledge distillation was performed using oversampled CIFAR100 (image size 224 x 224) as unlabeled data.

The three data sets are distributed to different clients according to the degree of isomerism eta of 0.1 in the dirichlet distribution, and a data distribution diagram on the CIFAR-10 is shown in fig. 2 as local data of the three data sets.

And 1.2, building a federal learning environment and initializing a network.

Training was performed on CIFAR-10-LT and CIFAR-100-LT using ResNet-8 network and ImageNet-LT using ResNet-50 network. All our experiments were run by PyTorch on two NVIDIA GeForce RTX 3080 GPUs. Typically, we design 20 clients for 200 rounds of training, with 40% of the clients selected per round to participate in federal training. For client training, the batch size is set to 128, the learning rate is 0.1, and the SGD acts as the optimizer. For server-side global model training, we set the calibration epoch to 100, the distillation epoch to 100, and knowledge distillation using an Adam optimizer with a learning rate of 0.001.

And step 1.3, updating the client model.

And the server side initializes the global model parameters w, randomly selects a client side set S participating in the current round of training and broadcasts the model parameters to the client side set S participating in the current round of training. Each client in S utilizes the received global model parameter w and local data to execute random gradient descent (SGD) to update the models thereof, and the model parameter obtained by updating the client k is set as w_k. After the updating, each client sends the updated model parameters back to the server.

Step two, the server side obtains a teacher model and a student model, and the specific process comprises the following substeps:

step 2.1, firstly, the server side carries out average weighting on the received model parameters to obtain a student model, and the calculation formula is as follows:

φ^s(x)＝φ_w(x) (formula 2)

Step 2.2, then the server side carries out weighting aggregation on the model parameters obtained from the client side to obtain a teacher model, and the calculation formula is as follows:

wherein e is_kIs the weight assigned to client k.

And step three, the server side calibrates the teacher model obtained in the step D.

Since the global data has a long tail distribution, the model obtained by each client is biased to the head class and performs poorly on the tail class, and the teacher model weighted by each client model is biased to the head class and ignores the tail class. Because the teacher model is biased to the head class, the contained knowledge is very biased, which causes the taught student model to be biased and seriously impairs the performance of the model. Therefore, the teacher model is calibrated to obtain an unbiased teacher model, and then unbiased knowledge is taught to the student models. The specific method comprises the following substeps:

step 3.1, because the local models are trained on local data with different distributions, and each local model may behave differently on the tail class, we assign higher weights to the local models that behave better on the tail class. However, the server does not know which classes are tail classes and which local models perform well on top, so we do not give each client a fixed weight, but instead we propose a client-based weight assignment strategy to compute the weight e for each client local model_kAnd e is combined_kNormalization makes the sum equal to 1, which is the final weight. The calculation formula is as follows:

wherein, a_e∈R^CAnd b_eAre parameters that can be learned. Client-based calibration just like the self-attention mechanism, weights are computed for the local model from the original logits, which are then multiplied back to the original logits.

And 3.2, if no local model can well process the tail class, the teacher model obtained by the weighted integration is probably still biased to the head class. To solve this problem, we propose a class-based logits calibration strategy to further improve the performance of the model in the tail class. Let the calibrated model output logits be z^clThe calculation formula is as follows:

z^cl＝a_z⊙φ^t(x)+b_z(A)Formula 5)

Wherein, a_zAnd b_zIt is a network parameter that can be learned, which indicates a hadamard product.

Step 3.3, the premise that the above calibration strategy for logits is effective is that the characterization information extracted by the local model for the input data is good enough, and if the feature extraction of the data by the local model is seriously affected by long-tailed distribution, it is not enough to calibrate only the logits. Therefore, we need to update the feature extractor to further improve the model performance. We leverage additional balanced tagged data on the server side

Fine tuning is carried out on the global model w to obtain a fine tuning model

Because of the fact that

Are balanced, so the model is fine-tuned

An unbiased feature extractor can be obtained. Then, we can obtain the fine tuning locations for the input x as

Step 3.4, by the above steps, it can be seen that z^clAnd z^ftThe teacher model is calibrated from two different levels. z is a radical of^clThe teacher model output logits level is calibrated, the model feature extractor is fixed, however z^ftIs the result of fine tuning the feature extractor, thereby improving the feature extraction capability of the model. To fully combine the advantages of the two, we propose a calibration gating network to align z^clAnd z^ftA trade-off is made. The gating network takes the integrated feature as an input and outputs the weight through a nonlinear layer, so that each sample obtains different weights according to different features of the sample. Weight calculationThe formula is as follows:

σ＝sigmoid(u^Tv) (equation 6)

Wherein the content of the first and second substances,

is an integrated feature, u is an element of R^dAre network parameters that can be learned. Thus, the final calibration logits by calibrating the gating network is z', the calculation formula is as follows:

z′＝σz^cl+(1-σ)z^ft(equation 7)

Where σ ∈ (0,1) is used to trade off two logits.

Step 3.5, all parameters that can be learned during the whole process of the integrated calibration are passed

The cross entropy penalty above is updated as follows:

and step four, transmitting unbiased knowledge of the teacher model to the student model by using knowledge distillation.

To better teach unbiased knowledge of the teacher model (i.e., the calibration integrated model) to the student models (i.e., the global models), we trained the student models using a combination of labeled data training and unlabeled data distillation, with a loss function comprising two parts: (1) l is_CEIs the cross entropy loss between logits and ground-truth of the student model; (2) l is_KLIs the Kullback-Leibler (KL) divergence of logits between teacher and student models. We use

To calculate L_CEAnd using another unlabeled data set

To calculate L_KLTo further improve knowledgeDistillation performance. The final loss function is determined by the hyperparameter lambda epsilon [0,1 ]]The trade-offs are:

L′＝(1-λ)L_CE+λL_KL(formula 9)

Table 1 shows the results of the precision (%) alignment of the CIFAR-10-LT and CIFAR-100-LT data sets with imbalance ratios of 100, 50 and 10 in accordance with the present invention and several other Federal learning methods. The bolded results in the table are the optimal results for each index.

It can be seen from the results in table 1 that the method of the present invention can solve the joint problem of long tail distribution and heterogeneous data in federal learning, and the method of the present invention achieves the highest test accuracy at all degrees of imbalance.

TABLE 1

Table 2 shows the results of comparison of the accuracy (%) of the ImageNet-LT data set of the present invention with several Federal learning methods. The bolded results in the table are the optimal results for each index.

The accuracy of several methods on three categories is compared in table 2, respectively: a head class (number of samples over 100), a middle class (number of samples between 20 and 100), and a tail class (number of samples less than 20). Compared with other methods, the method of the invention achieves the best results. Meanwhile, the accuracy of the method on the tail class reaches 15.91%, the method solves the problem of combination of long tail distribution and heterogeneous data in federal learning, and greatly improves the performance of the model on the tail class while improving the overall performance of the model.

TABLE 2

In tables 1 and 2:

FedAvg corresponds to the method proposed by McMahan, B et al (McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y ARCas, B.A.2017. communication-effective learning of deep networks from centralized data. in Artificial Intelligence and Statistics, 1273-;

FedAvgM corresponds to the method proposed by Hsu, T. -M.H et al (Hsu, T. -M.H.; Qi, H.; and Brown, M.2019.measuring the effects of non-essential data distribution for fed visual classification. arXiv preprint arXiv: 1909.06335.);

FedProx corresponds to a method (Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; and Smith, V.2020b. Federated optimization in heterologous networks. in Machine Learning and Systems, 429. 450.), proposed by Li, T et al, FedProx, and improves the stability of convergence while improving the accuracy of the model by adding a proximalterm correction term to a loss function updated by a client;

FedVova corresponds to the method proposed by Wang, J et al (Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; and Poor, H.V.2020b. taggling the objective in-situ purification in heterologous knowledge optimization. in Advances in Neural Information Processing Systems, 7611-7623.);

FedDF corresponds to the method proposed by Lin, T et al (Lin, T.; Kong, L.; Stich, S.U.; and Jaggi, M.2020. end partition for robust model fusion in Federal learning, in Advances in Neural Information Processing Systems, 2351-;

FedBE corresponds to a method (Chen, H. -Y.; and Chao, W. -L.2021.FedBE: creating basic model applicable to fed Learning in International Conference on Learning retrieval.) proposed by Chen, H. -Y et al, strong aggregation is realized from the perspective of Bayesian inference by sampling a high-quality global model and combining the models through Bayesian models;

Fed-Focal local corresponds to the method proposed by Sarkar, D.C. (Sarkar, D.; Narang, A.; and Rai, S.2020.Fed-Focal local for immunological data classification in contaminated left-hand study. arXiv prediction arXiv: 2011.06283.);

the method (Wang, L.; Xu, S.; Wang, X.; and Zhu, Q.2021a. addressing class interference in affected learning. in AAAI reference on opacity Intelligence of data imbalance in federal learning, 10165-;

cRT, tau-norm and LWS correspond to the methods proposed by Kang, B et al (Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; and Kalantidis, Y.2020.Decoupling representation and classifier for long-tail registration. in International Conference on Learning Representations), indicating that data imbalance does not affect the high-quality representation of the Learning input data and that the authors can achieve strong long-tail recognition capability by adjusting only the classifier.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A federal learning method facing long-tail heterogeneous data is characterized by comprising the following steps:

s1, the server side initializes the global model w randomly and sends the model parameters to the client side, the client side updates the local model by using the received model parameters and uploads the updated local model parameters to the server side;

s2, the server side aggregates the local model parameters to obtain a teacher model and a student model;

s3, the server side calibrates the teacher model to enable the teacher model to learn on unbiased knowledge;

and S4, transmitting unbiased knowledge of the teacher model to the student model through knowledge distillation, and then sending the student model to the client to start the next round of federal training.

2. The method of claim 1, wherein in step S1, the server initializes global model parameters w, randomly selects a set S of clients participating in the current round of training, broadcasts the model parameters to the clients in the set S, S of the clients participating in the current round of training, performs a random gradient descent using the received global model parameters w and local data to update the local model, and updates the local model parameter w obtained by the client k_kAnd after the updating, the client sends the updated model parameters back to the server.

3. The long-tailed heterogeneous data-oriented federal learning method as claimed in claim 2, wherein the step S2 includes the steps of:

φ^s(x)＝φ_w(x) (formula 2)

Wherein, | D^k| represents the amount of data owned by the kth client, | D | represents the total amount of data owned by all clients, K represents the number of clients, x represents the input data, φ_w(. represents a network of Federal averaging models, phi^s(. cndot.) represents a network of student models.

wherein phi^t(. a) a network representing a teacher model, e_kWeight, representation of client k

Network of kth client.

4. The method of claim 3, wherein in step S3, a client-based weight distribution strategy is proposed to calculate the weight e of each client local model_kFinally, e is_kNormalizing to make the sum equal to 1, i.e. the final weight, weight e_kThe calculation formula of (a) is as follows:

wherein, a_e∈R^cAnd b_eRepresenting a network parameter that can be learned, R^cRepresenting a c-dimensional vector, T being a transposed symbol, calculating a weight for the local model according to the original output of the model, and multiplying the weight back to the original output.

5. The method of claim 4, wherein in step S3, a class-based raw output calibration strategy is proposed, and the calibrated model output is z^clThe calculation formula is as follows:

z^cl＝a_z⊙φ^t(x)+b_z(formula 5)

6. The method for federal learning of long-tailed heterogeneous data according to claim 5, wherein in step S3, additional balanced tagged data sets are utilized on the server side

Fine tuning is carried out on the global model w to obtain a fine tuning model

The fine-tuning model output for input data x is

Wherein z is^ftRepresenting the output of the fine tuning model for x,

a network representing a fine-tuning model.

7. The method of claim 6, wherein the fine-tuning model is based on the long-tailed heterogeneous data

Wherein, eta represents the learning rate,

the function of the loss is represented by,

the derivation is indicated.

8. The method for federal learning of long-tailed heterogeneous data according to claim 6, wherein in step S3, z is calibrated by a calibration gating network^clAnd z^ftAnd (3) carrying out weighing, namely calibrating the gating network, taking the integrated features as input, and outputting weights through the nonlinear layer, wherein a weight calculation formula is as follows:

σ＝sigmoid(u^Tv) (equation 6)

Wherein the content of the first and second substances,

the integrated features are represented as such,

feature extractor representing the kth client, u ∈ R^dRepresenting a network parameter that can be learned, R^dRepresenting a d-dimensional vector, and outputting a final calibration model through a calibration gating network as z', wherein the calculation formula is as follows:

z′＝σz^cl+(1-σ)z^ft(formula 7)

Where σ ∈ (0,1) is used to trade off z^clAnd z^ftAnd outputting two models.

9. The method of claim 8, wherein the parameters capable of being learned pass through the whole process of the integrated calibration

The cross entropy penalty above is updated as follows:

wherein C represents the number of categories, y_jTrue tag representing input data, j represents the value of dimension j in y, exp (-) represents an exponential function, z'_jDenotes the value of j dimension, z 'in the final calibration z'_iRepresenting the value of the ith dimension in the final calibration z'.

10. The federal learning method for long-tailed heterogeneous data as claimed in claim 1, wherein in step S4, unbiased knowledge of the teacher model is transferred to the student model by knowledge distillation, and specifically, the student model is trained by a combination of labeled data training and unlabeled data distillation, and the loss function is as follows:

L′＝(1-λ)L_CE+λL_KL(formula 9)

Wherein L is_CERepresenting the cross-entropy loss between the model output of the student model and the true label, L_KLRepresenting the relative entropy divergence of model outputs between teacher and student models by balancing tagged datasets

Calculating L_CEAnd using unlabeled datasets

Calculating L_KL，λ∈[0,1]Represents a hyper-parameter, pair L_CEAnd L_KLA trade-off is made.