CN115238908A

CN115238908A - Data generation method based on variational self-encoder, unsupervised clustering algorithm and federal learning

Info

Publication number: CN115238908A
Application number: CN202210251482.2A
Authority: CN
Inventors: 魏森辉; 高明; 蔡文渊; 杜蓓; 刘翔
Original assignee: Shanghai Hipu Intelligent Information Technology Co ltd; East China Normal University
Current assignee: Shanghai Hipu Intelligent Information Technology Co ltd; East China Normal University
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-10-25

Abstract

The invention discloses a data generation method based on a variational self-encoder, an unsupervised clustering algorithm and federal learning, which is characterized in that the variational self-encoder of each local client is trained together through a federal learning framework, and the unsupervised clustering algorithm is proposed to group different clients according to the difference of data domains of different clients, and then the federal learning model training is carried out in each cluster independently, so that the harm of the data domain difference to the federal model training is relieved to a great extent, and finally, each group can be trained to obtain a global generation model. More safe shared data can be generated by using the trained global generation model in the prediction stage, and effective data support is improved for more machine learning and deep learning tasks.

Description

Data generation method based on variational self-encoder, unsupervised clustering algorithm and federal learning

Technical Field

The invention belongs to the technical field of data privacy security and data generation of deep learning, and particularly relates to a data generation method based on a variational self-encoder, an unsupervised clustering algorithm and federal learning.

Background

Data is new energy

With the vigorous development of information technologies such as big data, cloud computing, internet of things and the internet, artificial intelligence technologies represented by machine learning and deep learning enter the period of rapid development, and a new technological revolution is opened. For machine learning and deep learning, they are learning processes for mining rules from data, two factors of which are crucial: algorithms, and data. Algorithms can solve the problem of "how to learn" while data can solve the problem of "from where to learn". Due to the rapid development of deep learning, researchers propose a plurality of algorithms to solve the problem of how to learn in models in various scenes, however, the machine learning world flows a sentence, data and characteristics determine the upper limit of machine learning, and the models and algorithms only approach the upper limit, so that the researchers can know that even if the algorithms are delicate, the models have poor performance to solve the practical problem without good data support. Data seems ubiquitous in the modern times, as with the rapid growth of the internet, huge amounts of data are constantly being produced and stored.

In the era of digital economy, data is viewed as a new source of energy, with an unlimited amount of value, and is reusable compared to petroleum. In the present day, data is certainly scarce, but they are scattered on different companies, different people, different devices. Data sharing among different systems and different organizations is generally low in openness, and therefore the problem of information islanding is caused. Massive data are isolated from each other, and further fusion and collision are difficult to achieve to release potential. And if the data is freely shared, the problem of privacy security is involved.

Data privacy security

In recent years, negative events related to privacy leakage and illegal data disclosure of users, for example, in 2018, a third-party company collects personal information of nearly 5000 ten thousand users through an application program, the number of the users accounts for one fourth of the people elected in a country, and the related range is very large. A software bug in the same year causes the leakage of a private photograph of 6800 million users. The series of events raises concerns about privacy security of data by users, and the relevant privacy regulation agencies also make a huge fine. The public is more concerned about data security and privacy protection, and countries begin to establish laws and regulations for data security, establish data security laws and personal information protection laws and provide protection for personal data privacy from the legal level.

Federal learning

In this context, it is difficult to collect enough data for model training for machine learning and deep learning. Compared with the traditional method of collecting data of each party to one place for centralized model training, the data of each party can be independently trained by the current party, the data volume is small, and the difficulty of training a good model is greatly increased. How to effectively integrate and utilize data scattered everywhere without invading the privacy of users is a problem that researchers need to think about. The concept of the federal learning was proposed in 2016, and unlike the traditional machine learning algorithm which requires all data to be concentrated to one place for training, the federal learning sends a model to each owner of the data, learns the data locally, and then integrates the learning results of all parties to obtain a final model. The federated learning allows a user to form a united body to train to obtain a global model on the premise that data is kept not shared by local clients, so that the problem of data privacy security is effectively solved.

Federal learning aims to build a distributed data set based federal learning model. During model training, model-related information can be exchanged (or in encrypted form) between the parties, but the raw data cannot. This exchange does not expose any protected private portion of the parties' data. The trained federated learning model can be placed on each participant of the federated learning system or can be shared among multiple parties.

Horizontal federal learning refers to the fact that data from different participants have a large overlap of features (horizontal), but data samples (vertical), i.e., samples to which features belong, do not overlap as much. For example, the federally learned participant is two banks that serve different regional markets, where the customer population is more diverse, but the customer characteristics may overlap more due to similar business models.

Variational self-encoder

Under the background that massive data are mutually isolated to form a data island, more safe and sharable data are expected to be generated from a coder by combining a generative model variation through a federally learned framework. The variational self-encoder has wide application in the real world, such as image generation, style migration and the like, and has the defect that a large amount of data is required for model training as other deep learning models, otherwise, the generated data quality is poor.

With the development of information technologies such as big data, cloud computing, the internet of things and the like, a generation model based on deep learning such as a variational self-encoder has a good data generation effect, but deep learning generally needs a large amount of data to perform model training, and a large amount of data does exist in real life. For example, after an enterprise develops to a certain stage, a plurality of business parts are bound to appear, each business part has respective data, however, each business part is like an isolated island, and data of different business parts cannot be connected and are isolated from each other, namely, the data isolated island.

Disclosure of Invention

Aiming at the problems, the invention provides a data generation method based on a variational self-encoder, an unsupervised clustering algorithm and federal learning innovatively, and provides a method for grouping different clients by using the unsupervised clustering algorithm according to the difference of data fields of the different clients, and then performing federal learning model training independently in each cluster to finally obtain a federal generation model with good data generation effect. More safe and sharable data can be generated by using the federation generation model in the reasoning stage, and effective data support is improved for more machine learning and deep learning tasks.

The specific technical scheme for realizing the purpose of the invention is as follows:

a data generation method based on a variational self-encoder, an unsupervised clustering algorithm and federal learning comprises the following steps:

model training phase

Step S1: in each round of communication of federal learning, the central server randomly selects the proportion K from all local clients ₁ Local client of, wherein K ₁ Is 10% -50%, then the encoder parameters of the central server are sent to the selected local client for updating the encoder parameters thereof;

step S2: the selected local client side utilizes the local training set to train the generation model variation self-encoder, defines a mean square error loss function and KL divergence as optimization targets, and uses gradient descent as an optimization method to iteratively train the local model;

and step S3: after the local training is finished, the selected client uploads the encoder parameters in the local variational decoder to a central server in a network communication transmission mode;

and step S4: the central server aggregates the encoder parameters uploaded from the local client and updates the encoder parameters of the central server;

step S5: repeating the steps S1-S4 until all local clients are selected by the central server for at least 3-5 times, sending the encoder parameters of the current central server to all clients, and updating the parameters by the encoders of the local clients;

step S6: each local model transmits its original number through the encoderMapping the data to a low-dimensional space, and clustering the low-dimensional space by using an unsupervised clustering algorithm K-means + + to obtain G ₁ Group wherein G ₁ The selectable range of (3) to (5), averaging the low-dimensional vectors of each group, and uploading the obtained low-dimensional vectors to a central server;

step S7: after receiving all the low-dimensional vectors sent by the local client, the central server performs clustering by using an unsupervised clustering algorithm K-means + +, and divides all the low-dimensional vectors into G ₂ Group wherein G ₂ The selectable range of (1) is 4-8, and the group with the most low-dimensional vectors belonging to a certain client is taken as the group into which the client is classified;

step S8: after grouping the local clients, independently performing federated learning model training in each group;

step S9: in each round of communication process of the clients in each group, the central server of the current group randomly selects the ratio K from all local clients of the current group ₂ Local client of, wherein K ₂ Then sending the encoder and decoder parameters of the current set of central server to the selected local client, and updating the encoder and decoder parameters;

step S10: the selected client in each group carries out local model training in the same step S2;

step S11: the selected client in each group uploads the parameters of the encoder and the decoder in the local variational decoder to a central server of the current group in a network communication transmission mode;

step S12: the central server in each group aggregates the parameters of the encoder and the decoder uploaded from the local client, and updates the parameters of the encoder and the decoder of the central server in the current group;

step S13: repeating the step S9 to the step S12 until the model of each group converges or reaches a fixed communication round number, stopping training, and obtaining a final global generation model for each group;

model prediction phase

Step S14:collecting N from a standard normal distribution _s A random sample of N _s Adjusting according to a specific service scene;

step S15: the clients of each group map random samples into realistic, secure shared data using a decoder that generates a global model.

The invention has the advantages of

(1) In the steps S1-S5, the encoder of each client model can be used for training a local model and exchanging information related to the model among all parties by means of a federal learning framework under the conditions that the privacy and the safety of the data are protected and the data of the client do not leave the local, and the exchange does not expose the protected original data. Since more data can be accessed in this particular manner, the encoder of the variational self-encoder can have greater information compression capability than the local model training alone.

(2) In the step S6, the encoder with strong information compression capability is utilized to map the original data to the low-dimensional space and simultaneously carry the noise obtained from the standard normal distribution sampling. Therefore, the low-dimensional vector reflecting the data information can be taken, and the central server can be ensured not to reversely deduce the original data from the low-dimensional vector, so that the privacy safety problem of the data is well protected.

(3) In the step S7, considering that the data of different clients are probably not independently and identically distributed, which can obviously reduce the performance of the whole federal learning model, the data generation method based on the variational self-encoder, the unsupervised clustering algorithm and the federal learning provided by the invention utilizes the information of low-dimensional vectors extracted by local clients to group the clients by using the unsupervised clustering algorithm K-means + +, divides the clients with similar data distribution into the same group, and divides the clients with larger data distribution difference into different groups.

(4) In the data generation method based on the variational self-encoder, the unsupervised clustering algorithm and the federal learning, in the steps S8-S13, the federal learning model training is independently carried out in each group, and as the client of each group carries out the federal learning training under the condition of similar data distribution, the harm of data distribution difference to the federal model training can be relieved to a great extent. The generated model variation in each group can be more effectively optimized from the encoder parameters and the decoder parameters of the encoder, and the decoder can obtain stronger data generation capacity by using more data information while improving the information compression capacity of the encoder. The method is beneficial to improving the performance of the central server generation model in each group and improving the generalization capability.

(5) In the data generation method based on the variational self-encoder, the unsupervised clustering algorithm and the federal learning, a large amount of vivid and safe data which can be shared can be generated by utilizing the capability of a decoder of a specific group of central servers to generate data according to different common requirements in the steps S14-S15.

(6) The invention provides a data generation method based on a variational self-encoder, an unsupervised clustering algorithm and federal learning, and the work of combining the variational self-encoder and the federal learning does not exist before. The training method provided by the invention can skillfully utilize the data scattered in different places, and effectively improve the capability of generating data by the variational self-encoder.

(7) At the present stage, the information islanding problem is caused because the data sharing openness degree between different systems and different organizations is generally low, and the problem of user privacy safety can be involved when random data sharing is used for model training, so that legal regulations can be seriously violated. In order to solve the problems, the method provided by the invention can generate a large amount of safe sharable valuable data by integrating the federal learning architecture into the variational self-coder model, thereby improving effective data support for more machine learning and deep learning tasks.

(8) The data collected by different devices may vary significantly, for example, the collectors may have different preferences and different geographical locations may result in different styles of photos. The federal learning is greatly limited by the data distribution difference of different participants, the data distribution difference between the participants is large, and the performance of the federal learning model is greatly reduced. The method skillfully combines the characteristics of the variational self-encoder, performs K-means + + clustering on safe low-dimensional vectors which are extracted by different participants and do not leak original data, groups the different participants, and then performs federated learning training in different groups, so that the problem caused by inconsistent data distribution to federated learning training optimization can be relieved, and finally a federated generation model with a good data generation effect can be obtained. More safe and sharable data can be generated by using the federation generation model in the reasoning stage, and effective data support is improved for more machine learning and deep learning tasks.

Drawings

FIG. 1 is a diagram of a model of a variational self-encoder used by a central server and local clients in the present invention;

FIG. 2 is a frame diagram of the present invention;

FIG. 3 is a flow chart of the prediction phase of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to the following detailed description and the accompanying drawings, which include conditions, processes, training methods, and the like for practicing the invention, and are general knowledge in the art, except for those specifically mentioned below, and the present invention is not particularly limited thereto.

The invention provides a data generation method based on a variational self-encoder, an unsupervised clustering algorithm and federal learning, which specifically comprises the following steps:

suppose there are N local clients { P ₁ ,P ₂ ,…,P _N Are provided with training data { D } ₁ ,D ₂ ,…,D _N And a central server C is also needed in federal learning. The central server C has no data set thereon, and is mainly used for completing a model training task in cooperation with the client. At the model level, participant P _i Having variational autoencoder M _i And center ofServer C has variational autocoder M _g All the variational self-encoders related to the invention have the same structure and are composed of an encoder and a decoder. The encoder is a multilayer convolutional neural network, the decoder is also a multilayer convolutional neural network, and the parameter of the encoder on the local client is set as theta _e The parameter of the decoder is theta _d The parameter of the encoder on the central server C is θ _ge The parameter of the decoder is theta _gd . In fact, the invention has no great limitation to the specific network structures of the encoder and the decoder, and only needs to satisfy the model architecture of the variational self-encoder. The concrete model structure of the variational self-encoder is shown in FIG. 1, and the framework of the method of the present invention is shown in FIG. 2.

The invention aims to train a federal data generation model with a good data generation effect. More safe and sharable data can be generated by using the federal data generation model in the prediction stage, and effective data support is improved for more machine learning and deep learning tasks.

Stage of model training

Randomly initializing the parameter of the encoder in the variational self-encoder of the central server to be theta _ge The parameter of the decoder is theta _gd 。

Step S1: in each round of communication of federal learning, the central server randomly selects the proportion K from all local clients ₁ Local client of (1), wherein K ₁ Is 10% -50%, let the selected local client set P = { P = { P } ₁ ,P ₅ ,…,P _N-2 And the encoder parameter theta of the central server _ge Sending the parameters to the client in the set P, and after receiving the parameters of the encoder, the local client sends the local variation to the encoder parameter theta of the encoder _e Is updated to theta _ge ；

Step S2: the clients in the set P utilize the local data to train the variational self-encoder. The training process is a parameter optimization process, in which a certain client P _k For example, the encoder parameter of its variational self-encoder model is θ _e And decoder parameter θ _d On the basis of minimizing KL divergence and reconstruction loss, the corresponding optimization targets are as follows:

wherein x is the input of the variational self-encoder, x gets the mean and variance through the encoder, then gets z through this normal distribution sampling, z gets the output through the decoder

Representing the loss of mean square error of the input and output,

representing a distribution

KL divergence distance from the standard normal distribution N (0, I);

the invention adopts a Stochastic Gradient Descent (SGD) method to optimize the objective function, the learning rate is 0.01, the batch size is 64, and the dimension of the low-dimensional vector z is 32. Training 5 epochs locally, during which the auto-encoder model encoder parameters θ are differentiated _e And decoder parameter θ _d The information compression capability of the encoder is improved and the data generation capability of the decoder is improved by continuous updating and optimization;

and step S3: after the client in the set P is trained by the local model in the step S2, the parameter theta of the encoder in the local variational decoder model is transmitted in a network communication mode _e Uploading to the central server C;

and step S4: the central server aggregates the encoder parameters uploaded from the local client and updates the encoder parameters theta of the central server _ge 。

Wherein

Representing clients s _i Is derived from the encoder parameters of the encoder,

representing a client s _i The number of data set samples;

step S5: repeating steps S1-S4 until all local clients are selected by the central server at least 3-5 times. Because each round of clients is randomly selected according to a certain proportion, in order to accurately group N clients in a subsequent process, each client needs to be ensured to be selected at least 3-5 times. The encoder parameter theta of the central server at the moment _ge Sending the parameters to all clients, and updating the parameter theta of the encoder of the local client _e ＝θ _ge ；

Step S6: each client maps own local original data x to a low-dimensional vector z through an encoder; all locally obtained low-dimensional vectors are marked as SetZ, and an unsupervised clustering algorithm K-means + + is used for clustering, wherein the process is as follows:

(1) Randomly selecting a sample point from the set SetZ as a first initial cluster center;

(2) Then, calculating the shortest distance between each sample and the current existing cluster center, representing the shortest distance by using D (z), and selecting a sample point corresponding to the largest D (z) in the set SetZ as the next cluster center;

(3) Repeating the processes of (1) and (2) until G is selected ₁ A cluster center;

(4) According to the principle of minimizing the distance from the cluster center, all sample points are classified into the classes in which the center points are located, and G is calculated ₁ The mean of all sample points in the class, as G for the second iteration ₁ A center point;

(5) Repeating the step (4) until the central point is not changed any more or the specified iteration times are reached, and finishing the clustering process;

SetZ in each client is divided into G by K-means + + clustering ₁ Groups, averaging all low-dimensional vectors of each group, which will then result in G ₁ Uploading an average low-dimensional vector to a central server, wherein G ₁ Is in the range of 3-5.

Step S7: after receiving all the low-dimensional vectors sent by the local client, the central server performs clustering by using an unsupervised clustering algorithm K-means + +, and divides all the low-dimensional vectors into G ₂ Group wherein G ₂ Is in the range of 4-8. Will contain the client P _i The group with the most low-dimensional vectors is taken as the category of the client. In the operation of S6-S7, the encoder with strong information compression capability is utilized to map the original data into a low-dimensional vector and carry noise obtained by standard normal distribution sampling. Therefore, the low-dimensional vector reflecting the data information can be obtained, and the central server can be ensured not to reversely deduce the original data from the low-dimensional vector, so that the privacy and safety problems of the data are well protected;

step S9: in each communication process of the clients in each group, the central server C of the current group randomly selects the proportion K from all local clients of the current group ₂ Local client of (1), wherein K ₂ Is 40% -80%, and then the variation of the central server C of the current group is divided from the encoder parameter θ of the encoder _ge And decoder parameter θ _gd Sending the parameters to the local client for updating the encoder parameters theta of the local client _e And decoder parameter theta _d ；

Step S10: the selected clients in each group carry out local model training in the same step S2;

step S11: the selected clients in each group transmit the trained encoder parameter theta in the local variational decoder in a network communication transmission mode _e And decoder parameter θ _d Uploading to a central server of the current group;

step S12: central server aggregation slave within each groupEncoder parameter theta uploaded by local client _e And decoder parameter theta _d The parameter aggregation mode is the same as the step S4 and is used for updating the encoder parameter theta of the current group of central servers _ge And decoder parameter theta _gd ；

Step S13: steps S9-S12 are repeated until the model for each group converges or a fixed number of communication rounds is reached. Because each client is subjected to the federal model training with the client with the data distribution similar to that of the client, the harm of data field difference to the federal model training is relieved to a great extent, and each group can be trained to obtain a final global generation model;

model prediction phase

Step S14: collecting N from a standard normal distribution _s A random sample of N _s Adjusting according to a specific service scene;

step S15: clients of each group utilize a global generative model M _g The sample set Z is mapped into a realistic, secure sharable data set X, and the prediction process is shown in fig. 3.

Claims

1. A data generation method based on a variational self-encoder, an unsupervised clustering algorithm and federal learning is characterized by comprising the following steps of:

model training phase

Step S1: in each round of communication of federal learning, the central server randomly selects the proportion K from all local clients ₁ Then the encoder parameters of the central server are sent to the selected local client to update the encoder parameters of the local client; wherein K ₁ The selection range of (A) is 10% -50%;

step S5: repeating the step S1 to the step S4 until all the local clients are selected by the central server for at least 3-5 times, sending the encoder parameters of the current central server to all the clients, and updating the parameters by the encoders of the local clients;

step S6: each local model maps own original data to a low-dimensional space through an encoder, and clustering is carried out on the low-dimensional space by using an unsupervised clustering algorithm K-means + + to obtain G ₁ The low-dimensional vectors of each group are averaged, and then the obtained low-dimensional vectors are uploaded to a central server; wherein G is ₁ Is 3 to 5;

step S7: after receiving all the low-dimensional vectors sent by the local client, the central server performs clustering by using an unsupervised clustering algorithm K-means + +, and divides all the low-dimensional vectors into G ₂ A group to which the client is classified, the group having the most low-dimensional vectors belonging to the client being the group to which the client is classified; wherein G is ₂ Is 4 to 8;

step S9: in each round of communication process of the clients in each group, the central server of the current group randomly selects the ratio K from all local clients of the current group ₂ Then, the encoder and decoder parameters of the current group of central servers are sent to the selected local client, and the encoder and decoder parameters of the local client are updated; wherein K ₂ The selection range of (A) is 40% -80%;

step S11: uploading parameters of an encoder and a decoder in a local variation self-decoder to a central server of a current group by selected clients in each group in a network communication transmission mode;

model prediction phase

Step S14: collecting N from a standard normal distribution _s A random number of samples, wherein N _s Adjusting according to a specific service scene;

step S15: the clients of each group map the random samples into realistic, secure shared data using a decoder of the global generative model.

2. The method for generating data based on the variational self-encoder, the unsupervised clustering algorithm and the federal learning as claimed in claim 1, wherein said step S6 specifically comprises: each client maps own local original data x to a low-dimensional vector z through an encoder; all locally obtained low-dimensional vectors are marked as SetZ, and an unsupervised clustering algorithm K-means + + is used for clustering, wherein the process is as follows:

SetZ in each client is divided into G by K-means + + clustering ₁ Groups, averaging all low-dimensional vectors of each group, which will then result in G ₁ Uploading an average low-dimensional vector to a central server, wherein G ₁ Is 3-5.