CN117592584A

CN117592584A - Random multi-model privacy protection method based on federal learning

Info

Publication number: CN117592584A
Application number: CN202311689003.6A
Authority: CN
Inventors: 张泽飞; 惠蓉; 王崇文; 董银环
Original assignee: West Yunnan University Of Applied Sciences
Current assignee: West Yunnan University Of Applied Sciences
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-02-23

Abstract

The invention discloses a random multi-model privacy protection method based on federal learning, which comprises a participant selection step, a model sharing pool construction step, a randomization processing step and a model evaluation step. Compared with the prior art, the invention has the advantages that: not only can the problem of unbalanced data distribution of all parties be solved, but also the data privacy protection intensity can be increased through a differentiation mechanism; the concept of sharing the model pool is introduced, so that hidden dangers such as model attack and the like are well avoided; different concepts are introduced, and different models are selected to participate in training according to the actual conditions of all the participants; enhancing the randomization concept, and confusing the memory of the server to the data of each participant so as to protect the privacy of the data of the participant; a heuristic selection strategy is introduced, and participants are selected symmetrically according to the data of the same application of different users.

Description

Random multi-model privacy protection method based on federal learning

Technical Field

The invention relates to the field of privacy protection of federal learning, in particular to a random multi-model privacy protection method based on federal learning.

Background

Federal learning offers many benefits, in that it protects the privacy and data security of users to some extent, as it does not require direct data exchange between the parties involved in the training. However, with the development of internet technology, the requirement of users for privacy protection is increased, and the federal learning model faces new challenges in data privacy and security protection.

As an important branch field of artificial intelligence, machine learning is applied to the fields of intelligent transportation, financial analysis, recommendation systems, intelligent medical treatment, and the like with a certain result. However, in the mainstream machine learning model training and use, data always faces the risk of leakage, and great challenges are brought to personal privacy and data security. For example, events such as Facebook data leakage, yahoo data leakage, etc. have attracted considerable attention in industry and academia. Privacy protection and data security are also important issues for machine learning applications. With the rapid development of artificial intelligence, the effective sharing and fusion of data are increasingly demanded, and federal learning is generated and widely applied in the fields of health care, financial analysis and the like in order to solve the challenges brought by data privacy and data island.

Federal learning is a multi-user scenario oriented distributed machine learning framework that is used to solve the data islanding and privacy protection problems encountered by artificial intelligence. According to the technology, under the condition that the original data is not required to be uploaded by the participants, the server coordinates a plurality of participants to finish training of a global model together, so that the data is available and invisible. Training data in federal learning is distributed across the devices of the data owners, which train locally and share model parameters only with service providers, who aggregate updates of the data owners by some algorithm (e.g., fedAvg, fedProx, etc.) to train the global model.

Typical federal learning generally includes the following steps: (1) all participants download the latest model from the server; (2) Each participant calculates gradients by using local data, then encrypts and uploads the gradients to a server, and the server aggregates the gradient update model parameters; (3) The server distributes the updated model to each participant; (4) all participants update the local model. Sequentially iterating the steps (1) (2) (3) (4) until the model converges or a specified termination condition is reached.

The federal study privacy protection still has the following problems at present:

(1) Privacy preserving mechanism has high cost

In recent years, researchers have proposed a large number of privacy protection methods based on techniques such as differential privacy, secret sharing, homomorphic encryption, secure multiparty computing, and the like. However, these methods tend to have a large computational and communication overhead, which affects the usability and real-time performance of the application. Therefore, how to design privacy protection with balanced energy efficiency according to specific requirements of internet services, especially a privacy protection mechanism capable of meeting real-time query requirements of data, becomes a challenge for designing a privacy protection scheme.

(2) Privacy preserving mechanisms are less robust

Differential privacy, homomorphic encryption, secure multiparty computation and the like can improve the security of federal learning, but the privacy protection mechanisms also influence the performance and accuracy of the model, so that the robustness of federal learning is reduced, and the applicability of the model is influenced to a certain extent. Data in a real application scene of federation learning often does not meet independent same distribution, but the existing federation learning privacy protection mechanism assumes independent same distribution properties of the data to a certain extent. The design of data security and privacy protection mechanisms under the condition of independent and same distribution of data also becomes a great challenge of federal learning.

(3) Lack of an effective excitation mechanism

How to ensure that the server and the local participants are honest and not malicious is a challenge in privacy protection.

Disclosure of Invention

The technical problem to be solved by the invention is to ensure that more and more federal learning applications fall to the ground safely, and design of an energy efficiency balanced privacy protection mechanism is imperative. The invention provides a random multi-model privacy protection mechanism based on federal learning by combining the technologies of a heuristic selection method, a model sharing pool, randomization and the like of a participant so as to solve the problem of improving the existing federal learning privacy protection mechanism and provide better privacy protection for federal learning application.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: the random multi-model privacy protection method based on federal learning comprises the following steps:

step one, a participant selects a step, which is used for a server to send heuristic information to a client, and confirms the communication, data quality and participation wish information of the client through feedback information of the client, and a part of the client is selected as a participant to participate in model training;

step two, a model sharing pool construction step is used for constructing a model sharing pool, wherein the sharing pool comprises a plurality of models, and different models are selected to participate in training according to the data distribution condition of the participants;

step three, a randomization processing step and a model evaluation step are used for confusing the parameter information memory of the server to the participants, so that the server cannot memorize the parameter information of the participants;

compared with the prior art, the invention has the advantages that: not only can the problem of unbalanced data distribution of all parties be solved, but also the data privacy protection intensity can be increased through a differentiation mechanism; the concept of sharing the model pool is introduced, so that hidden dangers such as model attack and the like are well avoided; different concepts are introduced, and different models are selected to participate in training according to the actual conditions of all the participants; enhancing the randomization concept, and confusing the memory of the server to the data of each participant so as to protect the privacy of the data of the participant; a heuristic selection strategy is introduced, and participants are selected symmetrically according to the data of the same application of different users.

Furthermore, the method-mounted proposal server provides a shared model pool, and different participants can extract model training data from the model pool according to the data volume and the data distribution.

Further, in the fourth step, the symmetry principle of the data training results of the same type is assumed to evaluate the advantages and disadvantages of model training, and the participants with poor performance are periodically removed through repeated iteration, and then new participants are randomly added.

Further, the method uses a hypothesis test mode to assume that the model trained by centralized machine learning after the data of each participant are combined is MALL, and the model independently trained by each participant is MONLY; if the PALL and PONLY represent the performance of MALL and MONLY, respectively, there is a non-negative real LP, and there is a ||pall-ponly| < LP, then the participant can continue to participate in the next round of training.

Furthermore, the participant of the method selects a heuristic selection strategy, if the server wants to randomly select N clients to participate in model training, the server sends heuristic messages to more than N devices, and the clients recommend the heuristic messages to the server after receiving the heuristic messages; the server selects N participants from the client terminal of the sponsor through the speed and data information returned by the message; the server sends out exploration information, sends out invitation information to the explored clients, sends acknowledgement information to the server if the clients receiving the invitation information have the intention of participating in training, simultaneously sends self-evaluation information to the server, and selects a proper amount of clients to participate in model training from the returned data and the acknowledged clients; the exploration information sent by the server contains model information, and then participants are selected according to the symmetry of the data of the same application of different users.

Drawings

FIG. 1 is a schematic diagram of a federal learning-based random multi-model privacy preserving method architecture of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

In the specific implementation of the invention, as shown in fig. 1, the invention provides a random multi-model privacy protection method based on federal learning, in the training process, different models are designated by each participant with different data distribution characteristics of a server to participate in training, the participant uploads the model parameters after training to the server, and the server realizes the integration of the characteristic parameters in a weighted average mode, and finally realizes that only the parameters of each participant are known, but the models and the data of each participant are not known. According to the algorithm, due to the fact that training models selected among all the participants are different, all the participants do not need to achieve operations such as data consistency. Meanwhile, the algorithm is added with a participant training quality evaluation mechanism, new participants are continuously added in the training process to replace the participants with poor performance, and a model with excellent performance can be trained through small data truly. Therefore, the algorithm not only can solve the problem of unbalanced data distribution of all parties, but also can increase the data privacy protection strength through a differentiation mechanism.

In one embodiment of the invention, as shown in FIG. 1, the invention introduces the concept of a shared model pool. In existing federal learning mechanisms, all participants use the same model training data. The scheme not only can not be well suitable for data which are not independently and uniformly distributed by all the participants, but also provides possibility of model attack for malicious participants. The trained global model may leak to other participants and even to untrained devices. Meanwhile, statistical heterogeneity significantly increases the complexity of problem modeling, theoretical analysis, and solution demonstration assessment. The invention provides a shared model pool provided by the server, and different participants can extract model training data from the model pool according to the conditions of the participants, such as the data size, the data distribution and the like. This strategy greatly alleviates various adverse effects from maldistribution of data, such as the omission of data-starved participants, etc. In addition, in the scheme, the server only knows the parameter training model selected by each participant at the present time, and hidden troubles such as model attack and the like can be well avoided.

In one embodiment of the invention, as shown in FIG. 1, the invention introduces a differentiated concept. Differentiation refers to the differentiation of devices between participants and the differentiation of data distribution between the participants in federal learning. Because of these differences, there are certain drawbacks to using the same model training data by different participants in classical federal learning. Thus, different models should be chosen to participate in the training according to the actual situation of each participant.

In one embodiment of the invention, as shown in FIG. 1, the invention enhances the concept of randomization. The curious server may learn private information of the client training data by generating countering the network. The shared model pool and different participation variance training provided by the invention can improve training efficiency and participant data privacy to a great extent. However, because the model selection and parameter update of the participants can be recorded by the server, if a malicious or curious server exists, the privacy data of each participant can still be obtained. The present invention thus enhances the randomization technique, which is different from the prior art federal learning in which the server randomly selects the participants. The randomization technique here aims to confuse the server's memory of the individual participant data, thereby protecting the participant data privacy.

In one embodiment of the present invention, as shown in fig. 1, the present invention assumes that the symmetry principle of the same type of data training results evaluates the merits of model training. Through repeated iteration, the participants with poor performance are removed periodically, and then new participants are added randomly. The invention adopts a hypothesis test mode to assume that the model trained by the centralized machine learning after the data of all the participants are combined is MALL, and the model independently trained by all the participants is MONLY. If PALL and PONLY represent the properties of MALL and MONLY, respectively. There is a non-negative real LP, with ||poll-ponly| < LP, then the participant can continue to participate in the next round of training.

In one embodiment of the invention, as shown in FIG. 1, the participants choose heuristic selection strategies. If the server wants to randomly select N clients to participate in model training, the server sends out heuristic messages to more than N devices, and the clients recommend the heuristic messages to the server after receiving the heuristic messages. The server selects N participants in the client terminal of the referral through the speed and data information returned by the message. The server sends out exploration information, sends out invitation information to the explored clients, sends acknowledgement information to the server if the clients receiving the invitation information have the intention of participating in training, simultaneously sends self-evaluation information to the server, and the server selects a proper amount of clients to participate in model training from the returned data and the acknowledged clients. The exploration information sent by the server contains model information, and then participants are selected according to the symmetry of the data of the same application of different users.

While there has been shown and described what is at present considered to be the fundamental principles and the main features of the invention and the advantages of the invention, it will be understood by those skilled in the art that the invention is not limited to the foregoing embodiments, but is described in the foregoing description merely illustrates the principles of the invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention as hereinafter claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A random multi-model privacy protection method based on federal learning is characterized by comprising the following steps:

and step four, a model evaluation step is used for evaluating the contribution and the data quality of each participant and providing a participant selection mechanism for training.

2. A random multi-model privacy preserving method based on federal learning as claimed in claim 1, wherein: the method comprises the steps that a provided server mounted on the method provides a shared model pool, and different participants can extract model training data from the model pool according to the data size and the data distribution condition.

3. A random multi-model privacy preserving method based on federal learning as claimed in claim 1, wherein: in the fourth step, the symmetry principle of the data training results of the same type is assumed to evaluate the advantages and disadvantages of model training, the participants with poor performance are removed periodically through repeated iteration, and then new participants are added randomly.

4. The random multi-model privacy preserving method based on federal learning, wherein: the method adopts a hypothesis test mode to assume that the model trained by centralized machine learning after the data of all the participants are combined is MALL, and the model independently trained by all the participants is MONLY; if the PALL and PONLY represent the performance of MALL and MONLY, respectively, there is a non-negative real LP, and there is a ||pall-ponly| < LP, then the participant can continue to participate in the next round of training.

5. A random multi-model privacy preserving method based on federal learning as claimed in claim 1, wherein: the method comprises the steps that a participant selects a heuristic selection strategy, if a server wants to randomly select N clients to participate in model training, the server sends heuristic messages to more than N devices, and the clients recommend the heuristic messages to the server after receiving the heuristic messages; the server selects N participants from the client terminal of the sponsor through the speed and data information returned by the message; the server sends out exploration information, sends out invitation information to the explored clients, sends acknowledgement information to the server if the clients receiving the invitation information have the intention of participating in training, simultaneously sends self-evaluation information to the server, and selects a proper amount of clients to participate in model training from the returned data and the acknowledged clients; the exploration information sent by the server contains model information, and then participants are selected according to the symmetry of the data of the same application of different users.