CN115577360A - Gradient-independent clustering federal learning method and system - Google Patents
Gradient-independent clustering federal learning method and system Download PDFInfo
- Publication number
- CN115577360A CN115577360A CN202211422140.9A CN202211422140A CN115577360A CN 115577360 A CN115577360 A CN 115577360A CN 202211422140 A CN202211422140 A CN 202211422140A CN 115577360 A CN115577360 A CN 115577360A
- Authority
- CN
- China
- Prior art keywords
- client
- cluster
- clients
- malicious
- gradient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 75
- 238000009826 distribution Methods 0.000 claims abstract description 72
- 238000012549 training Methods 0.000 claims abstract description 24
- 239000013598 vector Substances 0.000 claims abstract description 20
- 239000011159 matrix material Substances 0.000 claims abstract description 16
- 238000012795 verification Methods 0.000 claims abstract description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000003860 storage Methods 0.000 claims description 6
- 238000010200 validation analysis Methods 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 230000009191 jumping Effects 0.000 claims description 3
- 125000004432 carbon atom Chemical group C* 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 12
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 238000012360 testing method Methods 0.000 description 26
- 238000001514 detection method Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 12
- 238000004088 simulation Methods 0.000 description 8
- 238000004138 cluster model Methods 0.000 description 5
- 238000011423 initialization method Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 230000001419 dependent effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 208000019622 heart disease Diseases 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 210000003734 kidney Anatomy 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000002574 poison Substances 0.000 description 1
- 231100000614 poison Toxicity 0.000 description 1
- 231100000572 poisoning Toxicity 0.000 description 1
- 230000000607 poisoning effect Effects 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Mathematical Analysis (AREA)
- Computer Hardware Design (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Virology (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a gradient-independent clustering federal learning method and a gradient-independent clustering federal learning system, wherein the method comprises the following steps: the client respectively calculates data distribution information and intersection similarity of the data distribution information and the intersection similarity and constructs intersection similarity vectors; and the server collects the intersection similarity vectors and constructs a similarity matrix, clusters the clients and executes model training, and selects the clients to form a verification committee after the server detects that the precision of the model is reduced and determines a malicious cluster, verifies and votes to exclude the malicious model and reserve a benign model. The server does not need to rely on gradient information of the client for clustering but carries out clustering according to intersection similarity among data distributions of the client, thereby avoiding the problem of gradient information leakage of the client, protecting the gradient safety of the client, enhancing the safety and reliability in the process of clustering federal learning and improving the training precision.
Description
Technical Field
The invention relates to the technical field of artificial intelligence clustering federal learning, in particular to a gradient-independent clustering federal learning method and system.
Background
Although information is becoming more and more abundant with the development of informatization, the information is inherently in the form of islands because they are highly sensitive. One very typical field of application is the medical field. The data of the medical industry is very sensitive because of the important privacy that may be involved with patients, and is usually kept by different hospitals. The emphasis of the data owned by each hospital may be different (for example, some hospitals are good at treating heart diseases, some hospitals are good at treating kidneys, etc.), that is, there is a problem of non-independent and non-distributed data. In recent years, federal learning has attracted attention in resolving conflicts between model training and data privacy protection. Traditional federal learning does not well solve the problem of non-independent and uniform distribution of data among clients. In order to solve the problems, the prior art provides clustering federal learning, and the similarity of data distribution among clients is measured by using gradients and is clustered so as to solve the problem of non-independent same distribution. However, recent studies have shown that the customer's private information, even the original training data, can be recovered through gradients, and the gradient dimensions tend to explode as the complexity of the model increases. Meanwhile, the existing clustering federal learning scheme cannot group clients with diverse data into a plurality of clusters, so that diverse data owned by some clients cannot be fully utilized. In addition, clustering the cluster structure in federated learning provides malicious clients with the opportunity to collude to cluster in one cluster and poison the aggregated cluster model by launching a model poisoning attack locally, leading to model training failures, as compared to federated learning. Therefore, how to protect the privacy of the client in the clustering federal learning and fully utilize the diversity and availability of the client data has a crucial influence on the development of the industry. Meanwhile, how to improve the detection efficiency of the malicious model, reduce the detection overhead, and improve the safety in the training process is also an important problem to be solved.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a gradient-independent clustering federation learning method and a gradient-independent clustering federation learning system.
In order to solve the technical problems, the invention adopts the technical scheme that:
a gradient-independent clustered federation learning method, comprising:
s1, respectively calculating data distribution information of a label sample of a client by the client, obtaining intersection similarity between the data distribution information of the client and data distribution information of other clients, and constructing an intersection similarity vector;
s2, the server collects intersection similarity vectors of the clients and constructs a similarity matrix;
s3, clustering the clients by using a clustering method for ensuring diversity based on the similarity matrix by the server, executing a model training step, and jumping to the next step when the server detects that the precision of the model is reduced;
s4, the server detects the malicious clusters, and selects the clients which have the data distribution most similar to that of the clients in the malicious clusters and are not in the malicious clusters to form an authentication committee after the malicious clusters are determined;
and S6, verifying by using the models of the members of the verification committee as the members in the malicious cluster, voting to determine the models as a benign model and a malicious model, and excluding the malicious model and reserving the benign model.
Optionally, in step S1, the function expression of the client respectively calculating the data distribution information of its own tag sample is as follows:
in the above-mentioned formula, the compound has the following structure,andthe data distribution information of the single tags of the 1 st, 2 nd and ith clients are respectively represented, and the calculation function expression of the data distribution information of the single tag of any ith client is as follows:
wherein,data quantity, idx, representing the ith index of the ith client i Index, Q, representing tag i max Representing a predefined maximum of any number of tags, where none of the tags exceeds the maximum, X i And j is the sequence number of the jth index of the ith client.
Optionally, the functional expression of the intersection similarity vector constructed in step S1 is:
in the above formula, ISM i For the intersection similarity vector of the ith client, ISM i [1]~ISM i [j]Represents the data distribution similarity of the ith client to the 1 st to the j th clients, | X i ∩X j | represents the intersection of the data distributions between the ith and jth clients, X i Data distribution information, X, representing the ith client j And representing the data distribution information constructed by the jth client.
Optionally, the functional expression of the similarity matrix constructed in step S2 is:
in the above formula, M sim And any ith row represents an intersection similarity vector formed by the data distribution similarities of the ith client to the 1 st to nth clients.
Optionally, the clustering, by the server, the clients using a clustering method for ensuring diversity based on the similarity matrix in step S3 includes: aggregating all clients with intersection similarity higher than a threshold value alpha into a candidate cluster set and removing duplication, so that the candidate cluster set comprises all possible clustering results; and calculating the weight of each candidate cluster in the candidate cluster set, and adding the candidate cluster with the minimum load into the final cluster set by using a greedy algorithm in each selection until all the clients are distributed into the final cluster set.
Optionally, the functional expression for calculating the weight of each candidate cluster is:
in the above formula, cost (S' i ) Represents calculating candidate Cluster S' i ID is the number of the client, M sim [i][ID]The element is the ith row and the ID column of the similarity matrix; calculation function of the loadThe numerical expression is:
in the above formula, payload represents the load, S '\ I represents the set of candidate clusters that have not been selected into the final cluster, S' represents the set of candidate clusters, and I represents the number of the client that has been selected into the final cluster.
Optionally, the step S4 of detecting the malicious cluster by the server means that the server detects the precision of each cluster in the final cluster set according to local data of the server, and selects a cluster with the lowest precision as the malicious cluster.
Optionally, the verifying and voting to determine as a benign model and a malicious model using the models of the members of the verification committee as the members of the malicious cluster in step S6 comprises: and verifying the model accuracy of each member in the malicious cluster by using the members of the verification committee according to the local data of the members, regarding the model members with the accuracy lower than the average value as malicious models, voting, and finally summing all voting results, wherein the model with the vote number higher than the average vote number is determined as a malicious model by the verification committee, and otherwise, the model is determined as a benign model by the verification committee.
In addition, the invention also provides a gradient-independent clustering federal learning system, which comprises a plurality of interconnected clients, wherein each client comprises an interconnected microprocessor and a memory, and the microprocessor is programmed or configured to execute the gradient-independent clustering federal learning method.
Furthermore, the present invention also provides a computer-readable storage medium having stored thereon a computer program for being programmed or configured by a microprocessor to perform the gradient-independent clustering federal learning method.
Compared with the prior art, the invention mainly has the following advantages:
1. in the clustering process, the server does not need to cluster according to the gradient information of the client, but clusters according to the intersection similarity between the data distributions of the client, so that the problem of gradient information leakage of the client is avoided, the gradient safety of the client is protected, the safety and the reliability in the clustering federal learning process are enhanced, and the training precision is improved.
2. In the clustering process, the invention innovatively allows the same client to appear in a plurality of clusters, so that the most suitable cluster can be found for each client, and the diversity of the client data is fully utilized to enhance the model accuracy.
3. The invention is different from the prior detection commonly used in the prior federal study, innovatively uses a post detection mechanism to detect the malicious clusters existing in the system, allows the detection after the attack starts, saves the expenditure and further improves the safety of the system.
Drawings
FIG. 1 is a schematic diagram of a basic process flow of a method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of the principle of the embodiment of the present invention.
Fig. 3 shows the test accuracy on the MNIST data set according to an embodiment of the present invention.
FIG. 4 is a graph of the test accuracy on the FMNIST data set according to an embodiment of the present invention.
FIG. 5 shows the accuracy of the test on the CIFAR10 data set according to the embodiment of the present invention.
FIG. 6 shows a comparison of test accuracy in three methods according to embodiments of the present invention.
FIG. 7 is a diagram illustrating the main overhead of calculating the intersection similarity according to the embodiment of the present invention.
FIG. 8 is a diagram illustrating the test accuracy of each cluster when a malicious client performs malicious activities according to an embodiment of the present invention.
Fig. 9 shows the test accuracy of the global model after the detection by using different detection methods according to the embodiment of the present invention.
FIG. 10 shows the accuracy of the global model in different guard noises according to an embodiment of the present invention.
Detailed Description
The method can be applied to medical industry scenes, and each hospital can be used as a client for federal learning through the method, so that the diversity of hospital sensitive data is efficiently utilized, the availability of the sensitive data is ensured, the problem of non-independent and same distribution among hospitals is solved, and a plurality of cluster models suitable for each hospital are efficiently obtained. The present invention will be further described with reference to the drawings and specific preferred embodiments in the following description, taking each hospital as a client for federal learning, and taking machine learning for cancer cell identification based on CT images as an example, without limiting the scope of the present invention. In this example, the system includes a server and a plurality of clients, and the server communicates with the device through a secure channel to realize interaction between information and data. Without being limited to the federal learning system in this embodiment, one of ordinary skill in the art can implement the deployment of the federal learning system based on practical situations.
As shown in fig. 1, the gradient-independent clustering federal learning method of this embodiment includes:
s1, respectively calculating data distribution information of a label sample of a client by the client, obtaining intersection similarity between the data distribution information of the client and data distribution information of other clients, and constructing an intersection similarity vector;
s2, the server collects intersection similarity vectors of the clients and constructs a similarity matrix;
s3, clustering the clients by using a clustering method for ensuring diversity based on the similarity matrix by the server, executing a model training step, and jumping to the next step when the server detects that the precision of the model is reduced;
s4, the server detects the malicious clusters, and selects the clients which have the data distribution most similar to that of the clients in the malicious clusters and are not in the malicious clusters to form an authentication committee after the malicious clusters are determined;
and S6, verifying by using the model of which the member of the verification committee is a member in the malicious cluster, voting to determine the model into a benign model and a malicious model, and removing and reserving the benign model from the malicious model.
Based on the disclosure of the present embodiment, those skilled in the art can understand that data transmission between the server and the client, and between the client and the client are implemented through secure channels, such as processes of issuing a global model, uploading a local model, uploading a similarity vector, and calculating an intersection using an RSA-PSI scheme. In step S1 of this embodiment, a method for a client to obtain an intersection of data distributions between the client and other clients includes: the RSA-PSI scheme is used to obtain the intersection between itself and other clients. As previously described, existing clustering federal learning schemes rely on clustering using gradients, which may reveal the customer's private information and even the original training data. Therefore, in order to avoid this problem, in this embodiment, the RSA-PSI scheme is used to obtain the intersection with the data of other clients without revealing the training data, and the intersection similarity vector between itself and other clients is calculated according to this intersection.
In this embodiment, each client needs to count the number of samples of each tag in the local data, and process the number information to prevent leakage of real data distribution information of the client, and in step S1, the client calculates a functional expression of the data distribution information of the tag sample of the client, respectively, as follows:
in the above formula, the first and second carbon atoms are,andthe data distribution information of the single tags of the 1 st, 2 nd and ith clients are respectively represented, and the calculation function expression of the data distribution information of the single tag of any ith client is as follows:
wherein,data quantity, idx, representing the ith index of the ith client i Index, Q, representing tag i max Representing a predefined maximum of any number of tags, where none of the tags exceeds the maximum, X i And j is the serial number of the jth index of the ith client. After the transformation, the client can obtain the intersection without exposing the real data distribution information of the client.
In this embodiment, the method for the client to obtain the intersection of the data distribution with other clients in step S1 includes: the intersection of the data distribution with other clients is computed using the RSA-PSI scheme (RSA-based privacy set intersection scheme) and a similarity vector is constructed. As an optional implementation manner, in this embodiment, the functional expression of the intersection similarity vector constructed in step S1 is as follows:
in the above formula, ISM i For the intersection similarity vector of the ith client, ISM i [1]~ISM i [j]Represents the data distribution similarity of the ith client to the 1 st to the j th clients, | X i ∩X j | represents the intersection of the data distributions between the ith and jth clients, X i Data distribution information, X, representing the ith client j And representing the data distribution information constructed by the jth client. Based on the disclosure of this embodiment, a person skilled in the art may also use different privacy set intersection schemes to implement the calculation of the intersection, and the technical scope claimed in this application is not limited by this specific embodiment.
In this embodiment, the functional expression of the similarity matrix constructed in step S2 is:
in the above formula, M sim And any ith row represents an intersection similarity vector formed by the data distribution similarity of the ith client to the 1 st to nth clients.
In this embodiment, the clustering, performed by the server based on the similarity matrix, of the clients using the clustering method for ensuring diversity in step S3 includes: aggregating all clients with intersection similarity higher than a threshold value alpha into a candidate cluster set and removing duplication, so that the candidate cluster set comprises all possible clustering results; and calculating the weight of each candidate cluster in the candidate cluster set, and adding the candidate cluster with the minimum load into the final cluster set by using a greedy algorithm for each selection until all the clients are distributed into the final cluster set.
In this embodiment, the functional expression for calculating the weight of each candidate cluster is:
of the above formula, cost (S' i ) Represents calculating candidate Cluster S' i ID is the number of the client, M sim [i][ID]The element is the ith row and the ID column of the similarity matrix; the load is calculated as the function expression:
in the above formula, payload represents the load, S '\ I represents the set of candidate clusters that have not been selected into the final cluster, S' represents the set of candidate clusters, and I represents the number of the client that has been selected into the final cluster.
In this embodiment, the step S4 of detecting the malicious cluster by the server means that the server detects the precision of each cluster in the final cluster set according to local data of the server, and selects a cluster with the lowest precision as the malicious cluster.
In this embodiment, the method for detecting whether a malicious cluster exists includes: whether the accuracy of the global model detected by the server is significantly degraded (degraded beyond a set value). The method for determining the malicious cluster in the embodiment comprises the following steps: and the server detects the precision of each cluster according to the local data of the server, and selects the cluster with the lowest precision as a malicious cluster. Of course, one of ordinary skill in the art may select different methods to detect malicious clusters as needed; e.g., whether the cluster model meets a predetermined criterion, etc. The technical scope claimed in the present application is not limited by the present specific embodiment.
In this embodiment, the step S6 of verifying and voting by using the model in which the member of the verification committee is a member in the malicious cluster includes: and verifying the model accuracy of each member in the malicious cluster by using the members of the verification committee according to local data of the members, regarding the model members with the accuracy lower than the average value as malicious models and voting, and finally summing up all voting results, wherein the models with the votes higher than the average votes are identified as malicious models by the verification committee, and otherwise, the models are identified as benign models by the verification committee. The method for selecting the authentication committee in step S6 of this embodiment includes: and the server takes a client which is most similar in data distribution but not in the malicious cluster for the client in each malicious cluster. The method of validating committee voting comprises: and the verification committee member verifies the accuracy of each model in the malicious cluster according to the local data of the member, and the model with the accuracy lower than the average value is regarded as the malicious model by the member and voted. The validation committee voting method further comprises: after each member of the validation committee votes, all voting results are summed, and the model with the votes higher than the average votes is identified as a malicious model by the committee.
In this embodiment, the present embodiment is verified through a specific simulation experiment, and a simulation of federal learning is performed based on a Pytorch. In the simulation experiment, 1 server and 100 clients were set up. The server aggregates the local models in each cluster using the FederatedAveraging algorithm, iterating 1 time on the local data as each client trains. The server has a portion of data that is independently and equally distributed for use in detecting malicious clusters. And three different training methods of GCFL, fedAvg and GICFL are set. In the GCFL method, a system clusters clients by adopting a gradient-based clustering method. In the FedAvg method, the system does not cluster clients, and only uses the most primitive method for federal learning training. In a GICFL, the clients will be clustered using the method described above. For each training method, the MNIST, FMNIST, and CIFAR-10 data sets were used as the baseline data sets, respectively. And for each training set, respectively initializing the data of the client by using two different initialization methods of ill-conditioned non-independent same distribution and Dirichlet non-independent same distribution. In the ill-conditioned dependent distribution initialization method, each client can obtain random data of two tags, that is, a parameter of ill-conditioned dependent distribution is set to k =2. In the dirichlet non-independent homodistributive initialization method, the data of each client obeys a dirichlet distribution of β = 0.5. The test accuracy of the model under the three training methods is shown in fig. 3, 4 and 5. In fig. 3, (a) is the test accuracy on the ill-conditioned initialization profile, and (b) is the test accuracy on the dirichlet initialization profile; in fig. 4, (a) is the test accuracy on the ill-conditioned initialization profile, and (b) is the test accuracy on the dirichlet initialization profile; in fig. 5, (a) shows the test accuracy on the ill-initialized distribution, and (b) shows the test accuracy on the dirichlet initialized distribution. As can be seen from fig. 3, 4 and 5, under the GICFL training method of the present embodiment, the test accuracy of the model is better than the test accuracy of the other two training methods for various data sets and data distributions.
In the simulation experiment of this embodiment, the efficiency of the technical solution of the present invention is also tested by setting a more extreme data initialization manner. The test accuracies after convergence of the three different training methods under the MNIST and FMNIST data sets are shown in fig. 6 when the parameter of the dirichlet initialization method is set to β =0.1, where (a) is the test accuracy on the MNIST data set and (b) is the test accuracy on the FMNIST data set. As can be seen from fig. 6, the test accuracy of the present embodiment still performs well under the more extreme data initialization mode.
In the simulation experiment of the present embodiment, fig. 7 shows the main overheads required to calculate the intersection of the data distributions between the clients when different data distribution initialization methods are used, where (a) is the overhead on the ill-conditioned initialization distribution, and (b) is the overhead on the dirichlet initialization distribution. As can be seen from fig. 7, in this embodiment, some time is required to calculate the intersection similarity. But since this phase precedes the training phase, it is negligible compared to the training time. In addition, the new member can join without calculating the intersection between the original members again. The overhead of this protocol is acceptable considering the time required for training and the need for privacy protection.
In the simulation experiment of the present embodiment, a case where a malicious person is present in the client is simulated. The malicious client adds Gaussian noise with the variance of 1 into the local model of the malicious client, namely delta =1. The test accuracy of each cluster model after malicious client attack is shown in fig. 8, where (a) is the test accuracy on the ill-conditioned initialization distribution, and (b) is the test accuracy on the dirichlet initialization distribution. As can be seen from fig. 8, after malicious clients do malicious, the server can still perform detection of malicious clusters through the local data owned by the server.
In the simulation experiment of this example, three methods were used to construct the validation committee. In the first method, the server selects the clients, other than the malicious cluster, that are most similar to the data distribution of the clients in the cluster to form the authentication committee, which is the method selected in this embodiment. In the second approach, the server selects random clients outside of the malicious cluster to constitute the authentication committee. In a third method, the server selects the clients outside the malicious cluster that are least similar to the clients in the cluster to form an authentication committee. After the validation committee votes, clients identified as malicious clusters are culled, and the remaining clients are re-clustered. The accuracy of the re-clustered model is shown in fig. 9, where (a) is the test accuracy on the ill-initialized distribution and (b) is the test accuracy on the dirichlet initialized distribution. As can be seen from fig. 9, under the detection of the method of the present embodiment, an excellent detection effect of the malicious cluster and the malicious client can be obtained.
In the simulation experiment of the present embodiment, the robustness of our test method is also verified by gradually adding different degrees of gaussian noise to the model of the client, as shown in fig. 10, where (a) is the test accuracy on the ill-conditioned initialization distribution, and (b) is the test accuracy on the dirichlet initialization distribution. As can be seen from fig. 10, when gaussian noise with a variance of 0.006 is added to the dirichlet initialization distribution, a malicious model in a malicious cluster can be accurately identified. In a ill-initialized distribution, even gaussian noise with a variance of 0.01 can be tolerated.
In summary, in the clustering process of the clustering federal learning method independent of the gradient, the server does not need to cluster by relying on the gradient information of the client, but clusters according to the intersection similarity between the data distributions of the client, so that the problem of leakage of the gradient information of the client is avoided, the gradient safety of the client is protected, the safety and reliability in the clustering federal learning process are enhanced, and the training precision is improved. In the clustering process, the gradient-independent clustering federal learning method innovatively allows the same client to appear in a plurality of clusters, so that the most suitable cluster can be found for each client, the diversity of client data is fully utilized, and the model precision is enhanced. The gradient-independent clustering federated learning method is different from the prior detection commonly used in the existing federated learning, and innovatively uses a post detection mechanism to detect the malicious clusters existing in the system, so that the detection is allowed after the attack starts, the expenditure is saved, and the safety of the system is further improved. The gradient-independent clustering federal learning method can be applied to medical industry scenes, and each hospital can efficiently utilize the diversity of hospital sensitive data and ensure the availability of the sensitive data by using the method as a client for federal learning, so that the problem of non-independent and same distribution among hospitals is solved, and a plurality of cluster models suitable for each hospital are efficiently obtained.
In addition, the present embodiment also provides a gradient-independent clustering federal learning system, which includes a plurality of interconnected clients, each of which includes an interconnected microprocessor and a memory, wherein the microprocessor is programmed or configured to execute the gradient-independent clustering federal learning method. Furthermore, the present embodiment also provides a computer-readable storage medium, in which a computer program is stored, the computer program being programmed or configured by a microprocessor to execute the foregoing gradient-independent clustering federal learning method.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiments, and all technical solutions that belong to the idea of the present invention belong to the scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.
Claims (10)
1. A gradient-independent clustered federation learning method, comprising:
s1, respectively calculating data distribution information of a label sample of a client by the client, obtaining intersection similarity between the data distribution information of the client and data distribution information of other clients, and constructing an intersection similarity vector;
s2, the server collects intersection similarity vectors of the clients and constructs a similarity matrix;
s3, clustering the clients by using a clustering method for ensuring diversity based on the similarity matrix by the server, executing a model training step, and jumping to the next step when the server detects that the precision of the model is reduced;
s4, the server detects the malicious clusters, and selects the clients which have the data distribution most similar to that of the clients in the malicious clusters and are not in the malicious clusters to form an authentication committee after the malicious clusters are determined;
and S6, verifying by using the model of which the member of the verification committee is a member in the malicious cluster, voting to determine the model into a benign model and a malicious model, and removing and reserving the benign model from the malicious model.
2. The gradient-independent clustering federation learning method of claim 1, wherein in step S1, the functional expression of the client respectively calculating the data distribution information of its label sample is as follows:
in the above formula, the first and second carbon atoms are,andthe data distribution information of the single tags of the 1 st, 2 nd and ith clients are respectively represented, and the calculation function expression of the data distribution information of the single tag of any ith client is as follows:
3. The gradient-independent clustering federated learning method of claim 2, wherein the functional expression of the intersection similarity vector constructed in step S1 is:
in the above formula, ISM i For the intersection similarity vector of the ith client, ISM i [1]~ISM i [j]Represents the data distribution similarity of the ith client to the 1 st to the j th clients, | X i ∩X j | represents the intersection of the data distributions between the ith and jth clients, X i Representing data distribution information, X, constructed by the ith client j And representing the data distribution information constructed by the jth client.
4. The gradient-independent clustering federated learning method according to claim 3, wherein the functional expression of the similarity matrix constructed in step S2 is:
in the above formula, M sim And any ith row represents an intersection similarity vector formed by the data distribution similarities of the ith client to the 1 st to nth clients.
5. The gradient-independent clustering federated learning method according to claim 4, wherein the server clustering the clients using a diversity-guaranteed clustering method based on the similarity matrix in step S3 includes: aggregating all clients with intersection similarity higher than a threshold value alpha into a candidate cluster set and removing duplication, so that the candidate cluster set comprises all possible clustering results; and calculating the weight of each candidate cluster in the candidate cluster set, and adding the candidate cluster with the minimum load into the final cluster set by using a greedy algorithm in each selection until all the clients are distributed into the final cluster set.
6. The gradient-independent clustering federal learning method as claimed in claim 5, wherein the functional expression for calculating the weight of each candidate cluster is:
of the above formula, cost (S' i ) Denotes calculated candidate Cluster S' i ID is the number of the client, M sim [i][ID]The element is the ith row and the ID column of the similarity matrix; the calculation function expression of the load is as follows:
in the above formula, payload represents the load, S '\ I represents the set of candidate clusters that have not been selected into the final cluster, S' represents the set of candidate clusters, and I represents the number of the client that has been selected into the final cluster.
7. The gradient-independent cluster federal learning method as claimed in claim 6, wherein the step S4 of detecting the malicious cluster by the server means that the server detects the accuracy of each cluster in the final cluster set according to local data of the server, and selects a cluster with the lowest accuracy as the malicious cluster.
8. The gradient-independent cluster federated learning method of claim 7, wherein the step S6 of validating and voting for the decision as a benign model and a malicious model using the models of the members of the validation committee as members of the malicious cluster comprises: and verifying the model accuracy of each member in the malicious cluster by using the members of the verification committee according to local data of the members, regarding the model members with the accuracy lower than the average value as malicious models and voting, and finally summing up all voting results, wherein the models with the votes higher than the average votes are identified as malicious models by the verification committee, and otherwise, the models are identified as benign models by the verification committee.
9. A gradient-independent clustered federated learning system, comprising a plurality of interconnected clients, the clients comprising an interconnected microprocessor and memory, wherein the microprocessor is programmed or configured to perform the gradient-independent clustered federated learning method of any one of claims 1 to 8.
10. A computer readable storage medium having a computer program stored thereon, wherein the computer program is adapted to be programmed or configured by a microprocessor to perform the gradient independent clustering federal learning method as claimed in any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211422140.9A CN115577360A (en) | 2022-11-14 | 2022-11-14 | Gradient-independent clustering federal learning method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211422140.9A CN115577360A (en) | 2022-11-14 | 2022-11-14 | Gradient-independent clustering federal learning method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115577360A true CN115577360A (en) | 2023-01-06 |
Family
ID=84588580
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211422140.9A Pending CN115577360A (en) | 2022-11-14 | 2022-11-14 | Gradient-independent clustering federal learning method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115577360A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117077817A (en) * | 2023-10-13 | 2023-11-17 | 之江实验室 | Personalized federal learning model training method and device based on label distribution |
CN117094412A (en) * | 2023-08-18 | 2023-11-21 | 之江实验室 | Federal learning method and device aiming at non-independent co-distributed medical scene |
CN117436078A (en) * | 2023-12-18 | 2024-01-23 | 烟台大学 | Bidirectional model poisoning detection method and system in federal learning |
CN117640253A (en) * | 2024-01-25 | 2024-03-01 | 济南大学 | Federal learning privacy protection method and system based on homomorphic encryption |
CN118250098A (en) * | 2024-05-27 | 2024-06-25 | 泉城省实验室 | Method and system for resisting malicious client poisoning attack based on packet aggregation |
-
2022
- 2022-11-14 CN CN202211422140.9A patent/CN115577360A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117094412A (en) * | 2023-08-18 | 2023-11-21 | 之江实验室 | Federal learning method and device aiming at non-independent co-distributed medical scene |
CN117094412B (en) * | 2023-08-18 | 2024-06-28 | 之江实验室 | Federal learning method and device aiming at non-independent co-distributed medical scene |
CN117077817A (en) * | 2023-10-13 | 2023-11-17 | 之江实验室 | Personalized federal learning model training method and device based on label distribution |
CN117077817B (en) * | 2023-10-13 | 2024-01-30 | 之江实验室 | Personalized federal learning model training method and device based on label distribution |
CN117436078A (en) * | 2023-12-18 | 2024-01-23 | 烟台大学 | Bidirectional model poisoning detection method and system in federal learning |
CN117436078B (en) * | 2023-12-18 | 2024-03-12 | 烟台大学 | Bidirectional model poisoning detection method and system in federal learning |
CN117640253A (en) * | 2024-01-25 | 2024-03-01 | 济南大学 | Federal learning privacy protection method and system based on homomorphic encryption |
CN117640253B (en) * | 2024-01-25 | 2024-04-05 | 济南大学 | Federal learning privacy protection method and system based on homomorphic encryption |
CN118250098A (en) * | 2024-05-27 | 2024-06-25 | 泉城省实验室 | Method and system for resisting malicious client poisoning attack based on packet aggregation |
CN118250098B (en) * | 2024-05-27 | 2024-08-09 | 泉城省实验室 | Method and system for resisting malicious client poisoning attack based on packet aggregation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115577360A (en) | Gradient-independent clustering federal learning method and system | |
CN111898758A (en) | User abnormal behavior identification method and device and computer readable storage medium | |
US10546106B2 (en) | Biometric verification | |
JP6299592B2 (en) | Verification device, verification device control method, and computer program | |
CN103544429A (en) | Anomaly detection device and method for security information interaction | |
EP3799069A1 (en) | Two-face disease diagnosis system and method thereof | |
CN113408558B (en) | Method, apparatus, device and medium for model verification | |
CN115516484A (en) | Method and system for maximizing risk detection coverage using constraints | |
CN111612037A (en) | Abnormal user detection method, device, medium and electronic equipment | |
CN111723865A (en) | Method, apparatus and medium for evaluating performance of image recognition model and attack method | |
CN103797490B (en) | Change-tolerant method of generating an identifier for a collection of assets in a computing environment using a secret sharing scheme | |
CN105447927A (en) | A control method for opening access control electric locks, access controllers and an access control system | |
KR102153912B1 (en) | Device and method for insurance unfair claim and unfair pattern detection based on artificial intelligence | |
CN117521117A (en) | Medical data application security and privacy protection method and system | |
US9852291B2 (en) | Computer system and signature verification server | |
CN117171786A (en) | Decentralizing federal learning method for resisting poisoning attack | |
Manoharan et al. | Implementation of internet of things with blockchain using machine learning algorithm: Enhancement of security with blockchain | |
CN105631336A (en) | System and method for detecting malicious files on mobile device, and computer program product | |
US20210326475A1 (en) | Systems and method for evaluating identity disclosure risks in synthetic personal data | |
CN112927152A (en) | CT image denoising processing method, device, computer equipment and medium | |
CN112926574A (en) | Image recognition method, image recognition device and image recognition system | |
CN104751042A (en) | Credibility detection method based on password hash and biometric feature recognition | |
CN107665315B (en) | Role and trust-based access control method suitable for Hadoop | |
CN114782192A (en) | Risk control method and system for bank account opening | |
Wang et al. | Has Approximate Machine Unlearning been evaluated properly? From Auditing to Side Effects |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |