CN116306969A

CN116306969A - Federal learning method and system based on self-supervision learning

Info

Publication number: CN116306969A
Application number: CN202310189525.3A
Authority: CN
Inventors: 王德健; 林博; 董科雄; 王慧东; 赵冲
Original assignee: Hangzhou Yikang Huilian Technology Co ltd
Current assignee: Hangzhou Yikang Huilian Technology Co ltd
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-06-23

Abstract

The present application relates to a federal learning method and system based on self-supervised learning, the federal learning method being implemented between a plurality of participants and a central node, comprising: each participant trains a local model by using a private data set, predicts the intra-domain data set in the training process, and obtains a predicted value; the central node trains a global model by using a data set in a domain and a predicted value corresponding to the data set in the domain; training a domain classifier using the global model, the domain classifier extracting the intra-domain dataset from an open dataset. The global model is not obtained through simple linear combination, so that the global model has better global performance. In addition, the training of the global model uses an intra-domain data set instead of an open data set in the traditional sense, so that the dependency of the participants on the open data set is weakened, and the negative influence of noise in the open data set on the global model is reduced.

Description

Federal learning method and system based on self-supervision learning

Technical Field

The application relates to the field of deep learning, in particular to a federal learning method and system based on self-supervision learning.

Background

Training a highly accurate, generalizing, deep neural network model typically requires large-scale and diverse data sets, but this requirement becomes difficult to meet when the data relates to user privacy and personal information. With increased awareness of personal privacy protection, users may prefer to store their own private data locally and refuse the internet company's request to collect the data. In other scenarios, when the model needs to be trained by using data of across enterprises or departments, law can require enterprises to clearly list responsible parties for data protection and the application range of the data, and the scenarios all provide challenges for the development of artificial intelligence in real life.

To overcome this problem, federal Learning (FL) provides a solution to the data island problem described above. It requires all participants to train the depth model locally using their private data sets and aggregate the local model through specific central nodes to get a target-consistent global model. Although federal learning is effectively applied in the scenario of large-scale private data set joint training, there are certain limitations, and the following two problems are problems to be solved in the conventional federal learning.

(1): participant data non-independent co-distribution problem: federal learning assumes that each party's private data is independently co-distributed (IID). This requirement is met on a small-scale federal learning basis, where multiple participants gather data from similar scenarios in the same way. However, when the scope of the problem extends to multiple geographic locations or multiple application scenarios, the participants' private data sets tend to be non-independently co-distributed (non-IID). In this case, the local model trained by each participant has a certain variance in the capability of feature extraction, and the global model obtained by only linear combination has weaker global performance.

(2) Model heterogeneous problem: traditional federal learning requires each participant to train a local model of the same architecture. The requirement that the participants are equipped with the same hardware and software is applicable, but when the span that the participants refer to is large (from intelligent wearable equipment to mobile terminals to servers of data centers), federal learning can only make a compromise between model performance and training time consumption, and due to memory limitation, a barrel effect often occurs during training, and the size of the model can only be set according to the participant with the weakest hardware condition.

The existing research thought for solving the problems is to aggregate knowledge of a plurality of local models into a global model through transfer learning so as to solve the problem of data non-independent and same distribution. Specifically, knowledge learned by the local model is uniformly quantized by an open data set, and then the knowledge is aggregated at a central node so as to aggregate knowledge of all participants.

The knowledge distillation-based solution method for the non-independent and same-step steps needs a shared open data set as a medium for knowledge transfer, which puts high requirements on data feature distribution of the open data set, and misleading knowledge transferred by the participant can be caused by the inconsistency of the open data set and the private data set on the feature distribution, so that the generalization performance of the global model is influenced.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a federal learning method based on self-supervised learning.

The application is based on a federal learning method of self-supervision learning, which is implemented between a plurality of participants and a central node, and comprises the following steps:

each participant trains a local model by using a private data set, predicts the intra-domain data set in the training process, and obtains a predicted value;

The central node trains a global model by using a data set in a domain and a predicted value corresponding to the data set in the domain;

training a domain classifier using the global model, the domain classifier extracting the intra-domain dataset from an open dataset.

Optionally, the global model, the domain classifier, and each of the local models are iteratively updated during the training process.

Optionally, each local model uses the domain data set extracted by the domain classifier in the previous round when predicting the domain data set in the current round.

Optionally, when predicting the intra-domain data set in the first round, the intra-domain data set is randomly extracted from the open data set.

Optionally, each local model is used as a teacher model, the global model is used as a student model, and the global model is iteratively updated by using a knowledge distillation mode;

and the average value of the predicted values obtained by each local model is used for training the global model.

Optionally, each local model is of the same structural category, and the iteration of each local model is completed by distributing the global model to each participant.

Optionally, training a domain classifier using the global model includes:

The global model generates output layer information of an input sample;

the domain classifier obtains the input samples and output layer information of the input samples;

and the domain classifier obtains a score according to the output layer information, and places the input samples meeting expectations into the domain data set according to the score.

Optionally, placing the input sample meeting the expectations into the intra-domain data set according to the score specifically includes: and after sorting the scores, sequentially extracting the corresponding input samples with absolute number or duty ratio number, and placing the input samples into the intra-domain data set.

Optionally, training a domain classifier using the global model includes:

and training the domain classifier in a self-supervision manner by using middle layer information of the global model, wherein the middle layer information is derived from a characteristic diagram before each batch normalization layer in the middle layer of the global model.

Optionally, the domain classifier includes a base model and a multi-layer perceptron, wherein the base model is the global model of each iteration, and the multi-layer perceptron is used as a detection head.

Optionally, the domain classifier includes a multi-layer perceptron, and the training process includes:

carrying out data enhancement on an input sample to obtain a comparison sample, wherein the input sample corresponds to the comparison sample one by one;

Obtaining a first level of features based on the input sample, and obtaining a second level of features based on the comparison sample, wherein the first level of features and the second level of features are in one-to-one correspondence;

and training the domain classifier by using the first level features, the second level features and the corresponding relation of the first level features and the second level features.

Optionally, the domain classifier extracts the intra-domain data set from an open data set, including:

the domain classifier receives the first-level features, the series of feature averages in the batch normalization layer, and outputs a relative distance between the first-level features and the series of feature averages, wherein the relative distance is used for dividing the open dataset into the inner-domain dataset and the outer-domain dataset.

the domain classifier receives the first-level features and the feature average value series connection in the batch normalization layers, projects the first-level features and the feature average value series connection in the batch normalization layers into an embedded space, and the relative distance is the cosine distance of the first-level features and the cosine distance of the second-level features projected into the embedded space, and selects an input sample which accords with expectations and corresponds to the first-level features according to the relative distance reservation, so that the input sample is placed into a domain data set.

Optionally, obtaining a second level of features based on the comparison sample in a manner that a first level of features is obtained based on the input sample;

obtaining a first level of features based on the input samples using the following equation:

wherein x is an input sample;

v (x) is the first level of features of the input samples;

f ⁱ representing a feature map of the global model before an ith batch normalization layer for an input sample x;

GAP represents that global average pooling is carried out on a two-dimensional characteristic diagram to obtain a scalar value;

is operated in series.

The application also provides a federal learning system based on self-supervised learning, which comprises a plurality of participants and a central node, and the federal learning method based on self-supervised learning is implemented.

Optionally, the structure categories of the local models of the participants at least comprise two kinds, and the federal learning method comprises the following steps:

the structure category of the global model is the same as that of one of the local models of the participants, the global model, the domain classifier and each local model are iteratively updated in the training process, and in the federal aggregation stage of each iteration update:

for a local model with the same structure category as the global model, replacing and iterating the local model by using the global model;

For a local model which is different from the structure type of the global model, updating and iterating the local model by using the intra-domain data set and the predicted value corresponding to the intra-domain data set in a knowledge distillation mode.

Optionally, updating and iterating the local model in a knowledge distillation mode, and performing any one of the following modes:

the global model predicts the intra-domain data set and trains a local model with different structure categories from the global model according to the predicted value;

and the average value of the predicted values obtained by the local models is used for training the local models which are different from the global model structure type.

The federal learning method and system based on self-supervision learning have at least the following effects:

the global model is not obtained through simple linear combination, so that the global model has better global performance. In addition, the training of the global model uses an intra-domain data set instead of an open data set in the traditional sense, so that the dependency of the participants on the open data set is weakened, and the negative influence of noise in the open data set on the global model is reduced.

The method is applicable to the situation that all local models are the same in structural category and also applicable to at least two structural categories of all local models of the participants based on the fact that the domain classifier is utilized to reduce the open data set, and the situation that not only is the global model good in global performance, but also the problem of isomerism of the local models can be solved.

The domain classifier is trained by adopting a self-supervision learning mode and comprises a base model, wherein the base model is a global model for each round of iteration, so that the training speed of the domain classifier is greatly improved, and the federal learning method is accelerated.

Drawings

FIG. 1 is a flow chart of a federal learning method based on self-supervised learning in an embodiment of the present application;

fig. 2 is an internal block diagram of a computer device used by a participant or central node in a self-supervised learning based federal learning system in accordance with an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

An embodiment of the present application provides a federal learning method based on self-supervised learning, implemented between a plurality of participants and a central node.

For example, N participants are respectively denoted as { P ] ₁ ,P ₂ ,…,P _N The private local data set of which is denoted as { D } ₁ ,D ₂ ,…,D _N Each data set D _i Consists of three parts: [ X ] _i ,Y _i ,I _i ]Representing a feature space, a label space, and a sample number space, respectively. The private data of the N participants are non-independent co-distributed data, and all embodiments of the application aim to train a global model M on the premise of non-independent co-distribution of the data of the participants _fed So that it aggregates the data set D at all participants _agg ＝{D ₁ ,D ₂ ,…,D _N With as high a degree of test accuracy as possible, each party P during the training process _i It is necessary to guarantee its local data set P _i Is provided. The federal learning method based on self-supervision learning includes steps S100 to S300.

Step S100, each participant trains a local model by using a private data set, and predicts an intra-domain data set in the training process to obtain a predicted value;

step S200, the central node trains a global model by using the intra-domain data set and the predicted value of the corresponding intra-domain data set;

step S300, training a domain classifier by using the global model, wherein the domain classifier extracts a domain data set from the open data set.

Steps S100 to S200 are training loops of the federal learning method, in which the intra-domain data set used in step S100 is derived from step S300, the domain classifier used in step S300 is obtained by global model training in step S200, and the predicted value used in step S200 is derived from step S100.

In the embodiment, step S200 does not obtain the global model by simple linear combination, so that the global model has better global performance. In addition, both step S100 and step S200 use intra-domain data sets, rather than open data sets in the traditional sense, thus weakening the dependency of the participants on the open data sets and reducing the negative impact of noise in the open data sets on the global model. The training of the domain classifier in step S300 is based on an open dataset (unlabeled dataset), and is performed by using a global model as a base model and adopting a self-supervised learning manner.

The implementation reasons for weakening the dependency of the participants on the open dataset can be understood by intra-domain data detection. in-Domain data detection, i.e. detecting in-Domain data and Out-Of-Domain (OOD) samples from unlabeled open dataset data, for example, when an image classification model faces Out-Of-Domain data, the image classification model should ideally output a high entropy distribution (low prediction certainty) with the same classification probability, because the input picture from Out-Of-Domain data does not belong to any one Of the target classes; but in general, the OOD class may have some characteristics similar to those of the target class, so that the prediction probability of the class is high, and misjudgment is caused. One intuitive method for distinguishing the intra-domain and the extra-domain data is to judge the entropy of the output probability, for example, a verification set is used to calculate the threshold value of the entropy of the intra-domain and the extra-domain samples, and all the samples with the prediction probability higher than the threshold value are regarded as the extra-domain samples. However, this approach is often difficult to use in the open world, and some unknown categories of data, even noise data, can make the model output a confident (high prediction certainty) low entropy prediction.

The domain classifier can filter the domain data in the unlabeled open dataset, the domain data D ^ID Knowledge from multiple participants will be aggregated as an intermediary. FedAVg algorithm adopts a linear average mode directly when the model is polymerized, but the method is not suitable in the scene of data isomerization, and the existing solution mostly adopts a knowledge distillation algorithm to distill the knowledge of the local model into the global model.

In this embodiment, model aggregation is performed by adopting an integrated distillation mode, each participant uploads the respective model to a central node, and the central node calculates the model at D ^U Filtered out domainsInner dataset D ^ID The prediction output is directly subjected to knowledge distillation, and then each participant replaces the local model with a global model so as to perform the next training, and the whole process is similar to the FedAvg algorithm except that the model aggregation is performed in a knowledge distillation mode. In this way, the federal learning method can build a global data set D _agg (collection of all private datasets) better accurate models.

In step S200, the global model, the domain classifier, and each local model are iteratively updated during the training process. In step S300, each local model uses the intra-domain data set extracted by the upper-domain classifier when predicting the intra-domain data set in the present round. In the first round of predicting the intra-domain data set, the intra-domain data set is randomly extracted from the open data set.

The domain classifier comprises a base model and a multi-layer perceptron, wherein the base model is a global model of each round of iteration, and the multi-layer perceptron is used as a detection head. The base model needs to have strong feature extraction capability and treat all private data sets as intra-domain data. The global model of each iteration is used as a base model of the domain classifier, so that the training speed of the domain classifier can be improved.

In step S200, each local model is used as a teacher model, the global model is used as a student model, and the global model is iteratively updated by using a knowledge distillation mode, and the average value of the predicted values obtained by each local model is used for training the global model.

In selecting a student model for integrated distillation, a global model after a new round of local parameter averaging is selected and the global model is distributed to all participants (serving as local models) as a target of distillation, and although linear combination leads to the averaged model not being the global model of the previous round in test precision, the parameter averaging is considered to introduce new feature extraction capability into the model and is more suitable for being used as an initialization parameter of the new round of student model.

The parameters of the domain classifier are updated in each round of communication with reference to equation (7), and it is considered that the capability of extracting features gradually increases as the model is trained on the sample, and the above-mentioned introduced parameters change the statistical information recorded by the BN layer on average, so that real-time updating of the domain classifier is necessary.

After one round of communication is finished, training the domain classifier, selecting a global model of one round on the central node as a base model, and setting the base model so that the central node can train the domain classifier in parallel when the participants train the local model, thereby avoiding overlong dead time of each participant in model aggregation.

If the local models are of the same structural category, iteration of the local models is completed through the mode that the global models are distributed to all the participants.

The above embodiment reduces the dependency on the open dataset in the federal learning approach based on self-supervised learning if the local models are of the same structural class. On the basis of reducing the dependency on the open dataset, the problem of model heterogeneity is solved in one embodiment. For model heterogeneous problems, each participant trains a local model { M } due to hardware limitations ₁ ,M ₂ ,…,M _N The algorithm except the global model M is expected to be a heterogeneous model with inconsistent architecture and different depth and width _fed At global dataset D _agg Besides good precision, attention is paid to each local model M _i Generalization performance and in dataset D _i Precision in the above.

If the structure categories of the local models of all the participants at least comprise two types, the structure category of the global model is the same as that of one of the local models of all the participants, and the global model, the domain classifier and all the local models are iteratively updated in the training process, and in the federal aggregation stage of each iteration update:

(1) For a local model different from the structure type of the global model, updating the iterative local model by using the intra-domain data set and the predicted value of the corresponding intra-domain data set in a knowledge distillation mode. The global model is used as a teacher model, the corresponding local model is used as a student model, iteration of each local model is completed, and the data set is selected from the intra-domain data set.

(2) For the local model of the same structural category as the global model, the global model is used to replace the iterative local model.

For a local model different from the structure type of the global model, replacing the iterative local model by the global model, wherein the method is carried out by adopting any one of the following modes:

mode one: predicting a data set in the domain by the global model, and training a local model with different structural categories from the global model according to the predicted value;

mode two: the average value of the predicted values obtained by each local model is used for training the local model which is different from the global model structure category.

Step S300 includes step S310 training a domain classifier using a global model; in step S320, the domain classifier extracts a domain data set from the open data set.

Step S310, training a domain classifier by using the global model, including using output layer information of the global model and/or middle layer information of the global model.

When the output layer information of the global model is used, step S310 includes steps S311 to S313. Step S311, the global model generates output layer information of the input sample; step S312, the domain classifier obtains an input sample and output layer information of the input sample; in step S313, the domain classifier obtains a score (a result of the scoring function l (x)) according to the output layer information, and places the input sample according with the expectation into the domain data set according to the score. The method specifically comprises the following steps: and after the scores are sequenced, the corresponding input samples with absolute quantity or duty ratio quantity are extracted in order and put into the intra-domain data set.

The problem of intra-domain data detection can be formally represented as a classification problem, in particular, from an unlabeled open dataset D ^U Filtering out intra-domain data set D ^ID A scoring function/(x) with sample x as input sample and domain confidence as output layer information can be defined, which creates a simple domain data classifier based on predefined thresholds:

D ^ID ＝{x|l(x)>γ,x∈D ^U }#(1)

where γ is the confidence level that is required to be predefined. The formula (1) only provides a basic idea of the scoring function l (x), and detailed description of the specific function contents is omitted. In the embodiment, the basic thought given by the formula (1) is based on the output prediction entropy value, the logits given by the deep learning model at the output layer in the image classification task correspond to the scores of the model for each class, and the subsequent Softmax function is used as a normalization mode to convert the scores into the prediction probabilities of the classes. An entropy-based algorithm can be understood as matching the output distribution with one-hot encoding as a priori distribution. This matching involves only the last layer of the model, and does not constrain the intermediate rich dimensional features.

When training the domain classifier using middle layer information of the global model, the middle layer information is derived from feature maps within the middle layer of the global model that precede each batch of normalized layers.

The use of intermediate layer features can help the model to better judge both intra-domain data and out-of-domain noise. Deep learning models such as ResNet are often composed of multiple blocks, and the stacking of these structurally similar blocks enables feature matching of images from low to high dimensions. From the structural aspect of the model, the neural network has small low-dimensional convolution quantity, low dimension and high resolution of the characteristic diagram size, and the receptive field of the whole neuron is smaller, so that the low-dimensional characteristics are more local characteristics related to the graphic image. The neural network has the advantages of large high-dimensional convolution quantity, large dimension, small corresponding feature map size, low resolution and large receptive field of the whole neuron, and can extract features more globally and more relevant to tasks.

For this embodiment, when using intermediate layer information of the global model, the training process of step S310 includes steps S314 to S316. That is, step S310 includes steps S311 to S313 (output layer information using the global model), or S314 to S316 (intermediate layer information using the global model).

Step S314, data enhancement is carried out on the input samples, comparison samples are obtained, and the input samples correspond to the comparison samples one by one; step S315, obtaining a first level feature based on an input sample, and obtaining a second level feature based on a comparison sample, wherein the first level feature and the second level feature are in one-to-one correspondence; step S316, training the domain classifier by using the first level features, the second level features and the correspondence between the first level features and the second level features.

The contrast samples are structured by data enhancement, such as random cropping, horizontal flipping, and random alteration of image properties (brightness, contrast, saturation, and hue). Specifically, for a given unlabeled open dataset D ^U And a model f (x; θ), first for each sample x _i Randomly performing data enhancement to obtain a pair of positive samples { x } _i ,x′ _i I.e. the input samples and the comparison samples are in one-to-one correspondence. Batches of number 2N are then input into the domain classifier and the second-level features { v } for each sample are calculated using equation (2) ₁ ,v′ ₁ ,…,v _N ,v′ _N I.e. the one-to-one correspondence between the first level features and the second level features.

In step S315, a second level feature is obtained based on the comparison sample, in a manner that a first level feature is obtained based on the input sample; obtaining a first level of features based on the input samples using the following:

Wherein x is an input sample;

v (x) is a first hierarchical feature of the input sample for detecting intra-domain data;

is operated in series.

GAP operator refers to global average pooling (Global Average Pooling), f ⁱ The representative model is normalized at the ith batch for input sample xThe feature map before the layer is rendered,

is a simple series operation. The GAP operator globally averages a two-dimensional feature map to obtain a scalar value, and is originally designed to replace a fully-connected layer of the model output layer, so that the GAP layer can strengthen the connection between category scores and corresponding convolutions and can also have better spatial information. The embodiment performs global average pooling on the output of the convolution layer, so as to aggregate the characteristics of each channel and align the semantic information and the statistical information of the BN layer.

It will be appreciated that in the training process, a first level of features needs to be obtained based on the input samples and a second level of features needs to be obtained based on the comparison samples to complete the training of the domain classifier. But during the application (step S320) it is no longer necessary to obtain a second level of features based on the comparison samples.

In neural networks, features between different channels often have strong interdependencies, and naive distance measures are more focused on channels with large amplitude variations, but cannot capture structured information between multiple channels. The scoring function used by the domain classifier provided by this embodiment solves this problem. The present embodiment performs self-supervised learning based on the input samples and the comparison samples.

The present embodiment utilizes bulk normalization layers (Batch Normalization Layer, BN layers) as an in-domain data detection algorithm for a priori distribution to facilitate the utilization of intermediate layer information. Almost all modern neural networks have BN layers because as the neural network progresses toward deeper and wider targets, small perturbations of the upper-level related features can have a tremendous impact on the lower layers, so to eliminate the problem of variance in features caused by the difference in data for each batch, BN layers are added after the convolutional layers to normalize the feature map. The mean and variance of the feature map during training are stored in BN layer, and we can use these statistics to construct a priori distribution for detecting data in the domain, this extra information makes our algorithm more robust than algorithms using classification probabilities only.

When middle layer information of the global model is utilized, the domain classifier extracts a domain data set from the open data set, including: the domain classifier receives the first-level features, series of feature averages in the batch normalization layer, and outputs the relative distance of the two, i.e., d (v (x), v ^* ) The relative distance is used to divide the open dataset into an inside-domain dataset and an outside-domain dataset.

The formal definition of the intra-domain classifier presented herein can now be derived, assuming that the characteristic averages stored in the BN layer are concatenated as v ^* Then we can extract intra-domain data from the unlabeled dataset by:

D ^ID ＝{x|d(v(x),v ^* )>∈,x∈D ^U }#(3)

wherein v (x) is shown in formula (2), d (v (x), v ^* ) There are many ways to measure the confidence between the hierarchical features and the prior distribution, such as a simple distance function of ||v (x) -v ^* II, etc. However, a common distance function may not be suitable, because we do not know the importance of the hierarchical features between different dimensions, and at the same time, different types of noise samples have different differences for the model to determine.

Extracting the extracted intra-domain data and then placing the extracted intra-domain data into a intra-domain data set D ^ID And (3) inner part. The central node will pair the unlabeled dataset D ^U Traversing, selecting the intra-domain data set D in the round of communication according to formula (3) ^ID The number of samples per class is referred to herein as pair D ^U The filtering is performed instead of explicitly setting the threshold e.

The relative distance for dividing the open dataset into an inside-domain dataset and an outside-domain dataset includes: the absolute number or the occupied number of input samples in the open dataset are extracted according to the arrangement order and are put into the intra-domain dataset. This facilitates the generation of a class-number balanced data set D ^ID 。

Step S320 specifically includes: the domain classifier receives the first-level features and the feature average value series connection in the batch normalization layer, projects the first-level features and the feature average value series connection into the embedded space, and selects an input sample which accords with the expected and corresponds to the first-level features according to the relative distance reservation, wherein the relative distance is the cosine distance of the first-level features projected into the embedded space, and then the input sample is placed into the intra-domain data set.

The multi-layer perceptron is used as a detection head of a domain classifier which receives a first-layer feature v (x) as input and projects the first-layer feature v (x) into an embedded space

In receiving series of characteristic average values in the batch normalization layer and projecting the characteristic average values into the embedding space

In, and projected in the embedding space by cosine distance measurement>

Is a distance between the two.

The multi-layer perceptron comprises fully connected layers and an activation function which is a non-linear ReLU function, and the importance of features and the correlation between channels can be reflected in a high-dimensional embedding space in the projection process. On the other hand, cosine distances omit the absolute size of the features, achieving a similar effect of projecting the input onto the hypersphere, so we can use a simple linear distance to group similar samples together.

Using equation (2) to calculate the second-level features { v } for each sample ₁ ,v′ ₁ ,…,v _N ,v′ _N After } self-supervised contrast loss is calculated as:

wherein the method comprises the steps of

Is an indication function, the value of which is 1 when k is not equal to i; where τ is a temperature parameter, a similarity function sim (z _i Zj) is as followsThe method comprises the following steps of:

where h represents the multi-layer perceptron set forth in this section, which serves to project the hierarchical features onto the hypersphere.

The above procedure enables self-supervised training of a projection head using contrast learning, but also requires the ability of the multi-layer perceptron to have strong partial and task correlations, i.e., scoring function d (v (x), v) as in equation (3) ^* ) And calculating the distance between a certain hierarchical feature and the BN layer statistical information. Notably, v (x) is the hierarchical feature of a sample, and v ^* The statistical information of the BN layer is obtained by series connection, and the statistical information are not matched with each other in concept; as also previously described, the multi-layer perceptron projects the hierarchical features into the hypersphere, while v ^* In the embedding space only in the form of a point, so in this case sim (v (x), v) is directly calculated ^* ) Is not suitable.

To solve the above problem, another method for establishing contrast learning multiple views is proposed herein: inspired by the BN layer aggregating all sample features in a batch, we randomly group the features and train the grouping result as a self-supervising tag.

In particular, the hierarchical features { v }, will be described herein ₁ ,v ₂ ,…,v _N Random packet averaging to obtain { g } ₁ ,g ₂ ,…,g _M Each hierarchical feature belongs to and only one of the groupings. These inputs are then projected into the hypersphere by a multi-layer perceptron, for each layer of features v _i Attempting to pull it closer to the belonging packet g _i And back to other partitions. The objective function can be formally described as:

wherein v is _i ,g _j Belonging to a group of positive examples, i.e. v _i Is divided into a j-th group;

representing an indication function if and only if i.noteq.j has a value of 1; τ is a temperature parameter, d () is a similarity function, and the calculation mode is shown in formula (5).

In summary, the objective function of the self-supervision contrast learning multi-layer perceptron is as follows:

where β is the coefficient that balances the two losses, only the parameters of the projection head are trainable during training, and the parameters of the model f (x; θ) are fixed.

Embodiments of the present application select schemes that pass the entire model parameters in communication and conduct knowledge distillation at the central node, the distillation loss can be formally expressed as:

wherein D is _kL For KL divergence, τ is the temperature coefficient, which is used to measure the distance between two distributions.

The federal learning method based on self-supervised learning provided in the embodiments of the present application is explained in further detail below.

Federal learning algorithms often consist of multiple rounds of communication, in each of which the algorithm can be roughly divided into two phases: a local training phase and a federal aggregation phase. The input of the algorithm is the private data set of each participant and the unlabeled data set D ^U In the initialization phase, the central node needs to learn the global model f for federation _fed (x；θ _fed ) And a domain classifier d (v, g; θ _d ) Initializing model parameters, synchronizing each participant with the central node, exchanging task meta-information such as algorithm, model architecture, training super-parameters and communicationInterfaces and keys, etc., this step is not given in detail in the algorithm. Details of the algorithm will be described in detail below.

The local training stage is the main stage of learning knowledge from the private data set by the local model, and all the participants in the stage update the parameters of the model by using the local data in parallel; meanwhile, from the algorithm time sequence, the central node performs parallel self-supervision training on the domain classifier at the stage, and filters the domain data by using the trained domain classifier.

At the beginning of each round of communication, the central node distributes the parameters of the global model or the initialized global model obtained in the previous round to each participant. Each participant then uses the global model f _fed Replacement of local model f _i Subsequent parameter updates will be based on the parameters of the global model. Due to the distributed specificity in each round of communication, few participants may be dropped or miss training, and the activation rate C is used herein in the algorithm to simulate real-world conditions. Party P in active state _i The local data set is randomly partitioned by a fixed batch size, and the loss is calculated for each batch of data (x, y) using a cross entropy loss function.

This stage of the training process occurs at the central node, so domain classifiers can be trained in parallel as each participant trains the local model. First by a non-labeled dataset D ^U Two pairs of samples x and x' are generated through random data enhancement, wherein the corresponding samples are considered as positive samples, and the contrast loss is calculated in a self-supervision manner through the formula (4). The sample passes through the global model f of the previous round _fed And performing feature extraction, splicing the multi-level features from bottom to top into a global feature expression v, then randomly grouping the feature expression for simulating a calculation mode of a BN layer, taking an average value to obtain g, and finally calculating the countermeasures by taking grouping information as a self-supervision tag. The domain classifier updates parameters in each round of communication to accommodate the new feature extraction capabilities that the local model learns from the private dataset.

The data phase in the filtering domain is also parallel to the training occurrence of the participators. After the domain classifier training is completed, the statistical information of the BN layer is extracted into prior distribution v of samples in the judgment domain by using a formula (2) ^* The unlabeled dataset D is then traversed ^U And filtered. Specifically, during the filtering process, the output label of the model is taken as the class label of the sample, and the confidence level of the output label output by the domain classifier is recorded at the same time; the confidence in the category is then ordered, and the sample with the top confidence rank is selected as the intra-domain sample to be added to D ^ID Is a kind of medium. The confidence threshold is not explicitly specified herein, but rather is filtered by confidence rankings within each category to generate a category-number balanced intra-domain dataset.

In the federal aggregation phase, the algorithm aggregates the local models from multiple non-IID datasets, desirably resulting in a global dataset D _agg There is a global model with good performance. For participants who are in active communication for this round, the central node waits for these participants to complete local training and collects their model parameters for knowledge distillation. The central node filters the obtained intra-domain data set D ^ID Randomly dividing into fixed batches, inputting a batch of samples into the local model participating in training, and distilling knowledge to the global model f by taking average logits as teachers _fed In the integrated distillation, the formalized expression is shown in the formula (8). The flow of the communication of this round is completed, f in the next round _fed Will be distributed to each participant as a new global model, and at the end of the federal learning algorithm, the participants take the final model from the central node and test and deploy according to downstream tasks.

Because federal learning algorithms involve multiple participants, the participants often have different hardware facilities and software resources, and it is difficult to unify the computational power levels. The FedAvg algorithm is based on linear combination among model parameters, and a plurality of models with different architectures are beyond the application range of the algorithm. Some self-supervised learning-based algorithms can implement model-independent federal learning systems by aggregating predictions of models, but these algorithms also rely on open unlabeled datasets as carriers for knowledge transfer. The previous algorithm is then modified to adapt it to the model heterogeneous scenario.

The central idea of the algorithm is to integrate heterogeneous models from multiple participants into one global model using knowledge distillation, then update the parameters of the domain classifier using this model, and finally update the local model by each participant using the intra-domain dataset.

For the problem of global model selection, it is desirable herein that the global model can achieve the best performance in the current state, and at the same time, the local model of each participant can learn knowledge in the global data. Thus, there are:

max_count, maximum number; arch (θ) _i ) Refers to the architecture of the model. Specifically, in each round of communication, the algorithm selects the model with the most similar model architecture among the currently active participants as the global model.

The federation learning algorithm suitable for the heterogeneous scene of the model can be divided into three stages, namely local training, federation aggregation and knowledge migration. The local training means that each participant trains a local personalized model by using a local data set and sends model parameters to a central node, and this step is similar to the local training stage of the algorithm mentioned before, and will not be described herein.

In the federal polymerization stage, the algorithm utilizes the local model to carry out integrated distillation, so as to obtain a global model with better performance. Specifically, the algorithm first makes a selection of the model architecture, and then randomly selects one model from the local models belonging to the architecture as the initialized global model. The next step is to transfer knowledge by using the intra-domain data set, and it is noted that there is one sequence of training and knowledge transfer of the domain classifier, and for the first round of model, the center node is required to randomly initialize the intra-domain data set, i.e. randomly select samples from the unlabeled open data set and add them Into D ^ID Is a kind of medium. After the parameter updating of the global model is completed, the algorithm trains the domain classifier by using the statistical information of the BN layer of the global model as prior distribution, and particularly, the domain classifier d (v, g; theta) is caused by different sizes of hierarchical features due to model isomerism _d ) Training from the initialized state needs to be started in each round. The final algorithm uses the trained domain classifier to complete the label-free data set D ^U Filtering is performed to facilitate knowledge distillation of the data set within the domain for the next round of communication.

In the knowledge migration phase, the activated participants will accept a new round of global model f from the central node _fed And fine-tuning the local personalized model using the portion of global knowledge. Specifically, if the architecture of the global model and the local model is the same, corresponding replacement is directly performed, and the special case can be understood as a federal learning algorithm with isomorphic models; otherwise knowledge distillation of the local model using the global view in the global model is required.

The central node will attach the intra-domain data set D when sharing the model parameters ^ID The local party performs data set reduction by the tag and knowledge distillation by distillation loss (8). It is worth mentioning that the teacher model is f when knowledge distillation is performed _fed The parameter size of the model (C) does not hinder the progress of participation in playing with weaker calculation power, because the teacher model is in the reasoning state of fixed parameters at the moment, and the calculation graph and the gradient of each parameter are not recorded during calculation.

The federal learning method based on self-supervised learning, provided by the embodiments of the application, aims to expand the applicable scene of federal learning, at least reduces the phenomenon that the algorithm based on self-supervised learning is too dependent on an open label-free data set for knowledge transfer, and self-supervises training a domain classifier to project the hierarchical features of a sample into an embedded space, so that the algorithm can conveniently and well judge the similarity between the sample and a model BN layer by utilizing the linear distance. Then, aiming at the federation learning scene of model isomerism, a model isomerism federation learning algorithm which is applicable to the scene and is based on self-supervision learning is provided.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

An embodiment of the present application further provides a federal learning system based on self-supervised learning, including a plurality of participants and a central node, where the federal learning method based on self-supervised learning provided in the embodiments of the present application is implemented.

For any one of the participants and the central node, for example, a computer device may be used, where the computer device may be a terminal, a smart wearable device, a mobile terminal, or a server, and the internal structure diagram may be as shown in fig. 2. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. Computer programs of different computer devices, when executed by a processor, implement a method of federal learning based on self-supervised learning. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description. When technical features of different embodiments are embodied in the same drawing, the drawing can be regarded as a combination of the embodiments concerned also being disclosed at the same time.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A federal learning method based on self-supervised learning, implemented between a plurality of participants and a central node, comprising:

2. The self-supervised learning based federal learning method of claim 1, wherein the global model, the domain classifier, and each of the local models are iteratively updated during training;

when each local model predicts the intra-domain data set, the intra-domain data set extracted by the domain classifier in the previous round is used;

each local model is used as a teacher model, the global model is used as a student model, and the global model is iteratively updated by using a knowledge distillation mode;

3. The self-supervised learning based federal learning method of claim 2, wherein each of the local models is of the same structural class, and iterations of each of the local models are performed by distribution of the global model to each of the participants.

4. The self-supervised learning based federal learning method of claim 1, wherein training a domain classifier using the global model comprises:

The global model generates output layer information of an input sample;

5. The self-supervised learning based federal learning method of claim 1, wherein training a domain classifier using the global model comprises:

6. The self-supervised learning based federal learning method of claim 5, wherein the domain classifier comprises a multi-layer perceptron, and the training process comprises:

7. The self-supervised learning based federal learning method of claim 6, wherein the domain classifier extracts the intra-domain data set from an open data set, comprising:

8. The self-supervised learning based federal learning method of claim 7, wherein the domain classifier extracts the intra-domain data set from an open data set, comprising:

9. The self-supervised learning based federal learning method of claim 7, wherein the second level features are obtained based on the comparison samples in a manner that the first level features are obtained based on the input samples;

wherein x is an input sample;

v (x) is the first level of features of the input samples;

is operated in series.

10. Federal learning system based on self-supervised learning, comprising a plurality of participants and a central node, characterized in that a federal learning method based on self-supervised learning according to any of claims 1-9 is implemented.