CN115577360A

CN115577360A - A Gradient-Independent Clustering Federated Learning Method and System

Info

Publication number: CN115577360A
Application number: CN202211422140.9A
Authority: CN
Inventors: 徐旸; 张益邦; 谭运林; 张程; 谢鲲; 唐卓; 李肯立; 张尧学
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-01-06

Abstract

The invention discloses a gradient-independent clustering federal learning method and a gradient-independent clustering federal learning system, wherein the method comprises the following steps: the client respectively calculates data distribution information and intersection similarity of the data distribution information and the intersection similarity and constructs intersection similarity vectors; and the server collects the intersection similarity vectors and constructs a similarity matrix, clusters the clients and executes model training, and selects the clients to form a verification committee after the server detects that the precision of the model is reduced and determines a malicious cluster, verifies and votes to exclude the malicious model and reserve a benign model. The server does not need to rely on gradient information of the client for clustering but carries out clustering according to intersection similarity among data distributions of the client, thereby avoiding the problem of gradient information leakage of the client, protecting the gradient safety of the client, enhancing the safety and reliability in the process of clustering federal learning and improving the training precision.

Description

A method and system for clustering federated learning that does not depend on gradient

技术领域technical field

本发明涉及人工智能的聚类联邦学习技术领域，具体涉及一种不依赖梯度的聚类联邦学习方法及系统。The invention relates to the technical field of clustering federated learning of artificial intelligence, in particular to a gradient-independent clustering federated learning method and system.

背景技术Background technique

虽然随着信息化的发展，信息越来越丰富，但是信息本质上是以孤岛的形式存在的，因为他们高度敏感。一个很典型的应用领域为医疗领域。医疗行业的数据十分敏感，因为可能涉及病人的重要隐私，这些数据通常由不同的医院保留。而各个医院拥有的数据的侧重点可能是不同的(比如有的医院擅长治疗心脏病，有的医院擅长治疗肾脏等)，即存在数据非独立同分布问题。近年来，联邦学习在解决模型训练和数据隐私保护之间的冲突方面引起了人们的关注。而传统的联邦学习并不能很好地解决各客户端之间数据的非独立同分布问题。针对上述问题，现有技术提出了聚类联邦学习，使用梯度来衡量客户端之间数据分布的相似性，并为其分簇，以解决非独立同分布问题。但是，最近的研究表明，客户的隐私信息甚至原始训练数据都可以通过梯度来恢复，且梯度维度往往会随着模型复杂性的增加而爆炸。同时，现有的聚类联邦学习方案无法将具有多样性数据的客户端分组到多个簇中，导致无法充分利用一些客户拥有的多样性数据。此外，与联邦学习相比，聚类联邦学习中的簇结构为恶意客户端提供了合谋聚集在一个簇中并通过在本地发起模型中毒攻击来毒化聚合的簇模型的机会，从而导致模型训练失败。因此，如何在聚类联邦学习中保护客户端的隐私，充分利用客户端数据的多样性和可用性，将对行业的发展有着至关重要的影响。与此同时，如何提高对恶意模型的检测效率，降低检测开销，提高训练过程中的安全性也是需要解决的重要问题。Although with the development of informatization, information is becoming more and more abundant, but information essentially exists in the form of isolated islands because they are highly sensitive. A very typical application field is the medical field. Data in the medical industry is very sensitive because it may involve important privacy of patients, and these data are usually kept by different hospitals. The focus of the data owned by each hospital may be different (for example, some hospitals are good at treating heart disease, and some hospitals are good at treating kidney, etc.), that is, there is a problem of non-independent and identical distribution of data. In recent years, federated learning has attracted attention in resolving the conflict between model training and data privacy protection. However, traditional federated learning cannot well solve the problem of non-independent and identical distribution of data between clients. In response to the above problems, the prior art proposes clustering federated learning, which uses gradients to measure the similarity of data distribution between clients and clusters them to solve the non-IID problem. However, recent studies have shown that customers' private information and even original training data can be recovered through gradients, and the gradient dimension tends to explode with the increase of model complexity. At the same time, existing clustering federated learning schemes cannot group clients with diverse data into multiple clusters, resulting in the inability to fully utilize the diverse data owned by some clients. Moreover, compared with federated learning, the cluster structure in clustering federated learning provides opportunities for malicious clients to conspire to gather in a cluster and poison the aggregated cluster models by initiating model poisoning attacks locally, resulting in failure of model training. . Therefore, how to protect the privacy of clients in clustering federated learning and make full use of the diversity and availability of client data will have a crucial impact on the development of the industry. At the same time, how to improve the detection efficiency of malicious models, reduce the detection overhead, and improve the security of the training process are also important issues that need to be solved.

发明内容Contents of the invention

本发明要解决的技术问题：针对现有技术的上述问题，提供一种不依赖梯度的聚类联邦学习方法及系统，本发明中服务器不需要依靠客户端的梯度信息进行聚类而是根据客户端的数据分布之间的交集相似度来进行聚类，避免了客户端的梯度信息泄露问题，保护了客户端的梯度安全，增强了聚类联邦学习过程中的安全性、可靠性、并且提高了训练精度。The technical problem to be solved by the present invention is to provide a gradient-independent clustering federated learning method and system for the above-mentioned problems in the prior art. In the present invention, the server does not need to rely on the gradient information of the client for clustering but according The intersection similarity between data distributions is used for clustering, which avoids the gradient information leakage problem of the client, protects the gradient security of the client, enhances the security and reliability of the clustering federated learning process, and improves the training accuracy.

为了解决上述技术问题，本发明采用的技术方案为：In order to solve the problems of the technologies described above, the technical solution adopted in the present invention is:

一种不依赖梯度的聚类联邦学习方法，包括：A clustering federated learning method that does not rely on gradients, including:

S1，客户端分别计算自己的标签样本的数据分布信息，获得自己与其他客户端的数据分布信息之间的交集相似度并构造交集相似度向量；S1, the client calculates the data distribution information of its own label samples separately, obtains the intersection similarity between the data distribution information of itself and other clients, and constructs an intersection similarity vector;

S2，服务器收集各个客户端的交集相似度向量并构建相似度矩阵；S2, the server collects the intersection similarity vectors of each client and constructs a similarity matrix;

S3，服务器基于相似度矩阵使用保证多样性的聚类方法对客户端进行聚类，并执行模型训练步骤，且在服务器检测到模型的精度下降时，跳转下一步；S3, the server clusters the clients based on the similarity matrix using a clustering method that guarantees diversity, and executes the model training step, and jumps to the next step when the server detects that the accuracy of the model has decreased;

S4，服务器检测恶意簇，且在确定恶意簇后，选择拥有与恶意簇中的客户端最相似数据分布、且不在恶意簇中的客户端来组成验证委员会；S4, the server detects the malicious cluster, and after determining the malicious cluster, selects the client that has the most similar data distribution to the client in the malicious cluster and is not in the malicious cluster to form a verification committee;

S6，利用验证委员会的成员为恶意簇中成员的模型进行验证并投票决定为良性模型和恶意模型，将恶意模型排除、保留良性模型。S6, use the members of the verification committee to verify the models of the members in the malicious cluster and vote for benign models and malicious models, exclude the malicious models and keep the benign models.

可选地，步骤S1中客户端分别计算自己的标签样本的数据分布信息的函数表达式为：Optionally, in step S1, the function expression for the client to calculate the data distribution information of its own label samples is:

上式中，

以及

分别表示第1、2以及第i个客户端的单一标签的数据分布信息，且有任意第i个客户端的单一标签的数据分布信息的计算函数表达式为：In the above formula,

as well as

Represent the data distribution information of the single label of the 1st, 2nd and i-th clients respectively, and the calculation function expression of the data distribution information of a single label of any i-th client is:

其中，

表示第i个客户端的第i个索引的数据数量，idx_i表示标签i的索引，Q_max表示一个预先定义的任意标签数量的最大值，且任意一个标签的数量都不可以超过这个最大值，X_i表示第i个客户端构建的数据分布信息，j为第i个客户端的第j个索引的序号。in,

Indicates the data quantity of the i-th index of the i-th client, idx _i indicates the index of label i, Q _max indicates a predefined maximum number of any label, and the number of any label cannot exceed this maximum value, X _i represents the data distribution information constructed by the i-th client, and j is the serial number of the j-th index of the i-th client.

可选地，步骤S1中构造的交集相似度向量的函数表达式为：Optionally, the functional expression of the intersection similarity vector constructed in step S1 is:

上式中，ISM_i为第i个客户端的交集相似度向量，ISM_i[1]～ISM_i[j]表示第i个客户端对第1～j个客户端的数据分布相似度，|X_i∩X_j|表示第i个客户端与第j个客户端之间的数据分布的交集，X_i表示第i个客户端构建的数据分布信息，X_j表示第j个客户端构建的数据分布信息。In the above formula, ISM _i is the intersection similarity vector of the i-th client, ISM _i [1]~ISM _i [j] represent the data distribution similarity between the i-th client and the 1st-j clients, |X _i ∩X _j | represents the intersection of the data distribution between the i-th client and the j-th client, X _i represents the data distribution information constructed by the i-th client, and X _j represents the data distribution constructed by the j-th client information.

可选地，步骤S2中构建的相似度矩阵的函数表达式为：Optionally, the functional expression of the similarity matrix constructed in step S2 is:

上式中，M_sim为相似度矩阵，任意第i行表示第i个客户端对第1～n个客户端的数据分布相似度构成的交集相似度向量。In the above formula, M _sim is a similarity matrix, and any i-th row represents the intersection similarity vector formed by the data distribution similarity between the i-th client and the 1st to nth clients.

可选地，步骤S3中服务器基于相似度矩阵使用保证多样性的聚类方法对客户端进行聚类包括：将所有交集相似度高于阈值α的客户端聚集成候选簇集合并去重，使得该候选簇集合中包含了所有可能的聚类结果；在该候选簇集合中，计算每一个候选簇的权重，并使用贪心算法，每次选择可使负载最小的候选簇加入到最终簇集合中，直至所有客户端都被分配到最终簇集合中。Optionally, in step S3, the server clustering the clients based on the similarity matrix using a clustering method that guarantees diversity includes: aggregating all clients whose intersection similarity is higher than the threshold α into a candidate cluster set and deduplication, so that The candidate cluster set contains all possible clustering results; in the candidate cluster set, calculate the weight of each candidate cluster, and use the greedy algorithm, each selection can make the candidate cluster with the smallest load be added to the final cluster set , until all clients are assigned to the final cluster set.

可选地，所述计算每一个候选簇的权重的函数表达式为：Optionally, the function expression for calculating the weight of each candidate cluster is:

上式中，Cost(S′_i)表示计算候选簇S′_i的成本，ID为客户端的编号，M_sim[i][ID]为相似度矩阵的第i行第ID列的元素；所述负载的计算函数表达式为：In the above formula, Cost(S' _i ) represents the cost of calculating the candidate cluster S' _i , ID is the serial number of the client, and M _sim [i][ID] is the element of the i-th row ID column of the similarity matrix; The calculation function expression of the load is:

上式中，Payload表示负载，S′\I表示尚未被选入最终簇的候选簇集合，S′表示候选簇集合，I表示已被选入最终簇的客户端的编号。In the above formula, Payload represents the load, S'\I represents the candidate cluster set that has not been selected into the final cluster, S' represents the candidate cluster set, and I represents the number of the client that has been selected into the final cluster.

可选地，步骤S4中服务器检测恶意簇是指服务器根据自己的本地数据检测最终簇集合中每个簇的精度，并选择其中精度最低的簇作为恶意簇。Optionally, the detection of malicious clusters by the server in step S4 means that the server detects the accuracy of each cluster in the final cluster set according to its own local data, and selects the cluster with the lowest accuracy as the malicious cluster.

可选地，步骤S6中利用验证委员会的成员为恶意簇中成员的模型进行验证并投票决定为良性模型和恶意模型包括：利用验证委员会的成员根据自己的本地数据验证恶意簇中每个成员的模型精度，将精度低于平均值的模型成员视为恶意模型并投票，最终将对所有投票结果进行加和，票数高于平均票数的模型将被验证委员会认定为恶意模型，否则被验证委员会认定为良性模型。Optionally, in step S6, using members of the verification committee to verify the models of the members in the malicious cluster and voting for benign models and malicious models includes: using members of the verification committee to verify the model of each member in the malicious cluster according to their own local data. Model accuracy, the model members whose accuracy is lower than the average are regarded as malicious models and voted, and finally all the voting results will be summed up, and the model with the number of votes higher than the average number of votes will be identified as a malicious model by the verification committee, otherwise it will be identified by the verification committee benign model.

此外，本发明还提供一种不依赖梯度的聚类联邦学习系统，包括相互连接的多个客户端，所述客户端包括相互连接的微处理器和存储器，所述微处理器被编程或配置以执行所述不依赖梯度的聚类联邦学习方法。In addition, the present invention also provides a gradient-independent clustering federated learning system, which includes multiple clients connected to each other, the clients include microprocessors and memories connected to each other, and the microprocessors are programmed or configured To implement the gradient-independent clustering federated learning method.

此外，本发明还提供一种计算机可读存储介质，所述计算机可读存储介质中存储有计算机程序，所述计算机程序用于被微处理器编程或配置以执行所述不依赖梯度的聚类联邦学习方法。In addition, the present invention also provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program is used to be programmed or configured by a microprocessor to perform the gradient-independent clustering Federated learning methods.

和现有技术相比，本发明主要具有下述优点：Compared with the prior art, the present invention mainly has the following advantages:

1、本发明在聚类的过程中，服务器不需要依靠客户端的梯度信息进行聚类，而是根据客户端的数据分布之间的交集相似度来进行聚类，从而避免了客户端的梯度信息泄露问题，保护了客户端的梯度安全，增强了聚类联邦学习过程中的安全性、可靠性、并且提高了训练精度。1. In the clustering process of the present invention, the server does not need to rely on the gradient information of the client for clustering, but performs clustering according to the intersection similarity between the data distributions of the clients, thus avoiding the leakage of the gradient information of the client , which protects the gradient security of the client, enhances the security and reliability of the clustering federated learning process, and improves the training accuracy.

2、本发明在聚类的过程中，创新性地允许同一个客户端出现在多个聚类中，从而可以为每一个客户端寻找最适合的簇，并且充分地利用客户端数据地多样性，增强模型精度。2. In the process of clustering, the present invention innovatively allows the same client to appear in multiple clusters, so that the most suitable cluster can be found for each client, and the diversity of client data can be fully utilized , to enhance model accuracy.

3、本发明不同于在现有联邦学习中普遍使用的事前检测，创新型地使用一种事后检测机制来检测系统中存在的恶意簇，允许在攻击开始之后进行检测，以节省开销，并且进一步提高系统的安全性。3. The present invention is different from the pre-detection commonly used in the existing federated learning, and innovatively uses a post-event detection mechanism to detect malicious clusters existing in the system, allowing detection after the attack starts to save overhead, and further Improve system security.

附图说明Description of drawings

图1为本发明实施例方法的基本流程示意图。Fig. 1 is a schematic flow diagram of the basic process of the method of the embodiment of the present invention.

图2为本发明实施例的原理示意图。Fig. 2 is a schematic diagram of the principle of an embodiment of the present invention.

图3为本发明实施例在MNIST数据集上的测试精度。Fig. 3 is the test accuracy of the embodiment of the present invention on the MNIST data set.

图4为本发明实施例在FMNIST数据集上的测试精度。Fig. 4 is the test accuracy of the embodiment of the present invention on the FMNIST data set.

图5为本发明实施例在CIFAR10数据集上的测试精度。Fig. 5 is the test accuracy of the embodiment of the present invention on the CIFAR10 data set.

图6为本发明实施例在三种方法上的测试精度对比。Fig. 6 is a comparison of test accuracy of the embodiment of the present invention in three methods.

图7为本发明实施例在计算交集相似度时的主要开销。Fig. 7 shows the main overhead when calculating the intersection similarity according to the embodiment of the present invention.

图8为本发明实施例在有恶意客户端作恶时各簇的测试精度。Fig. 8 shows the test accuracy of each cluster when there is a malicious client doing evil in the embodiment of the present invention.

图9为本发明实施例使用不同的检测方法检测后全局模型的测试精度。FIG. 9 shows the test accuracy of the global model after detection using different detection methods according to the embodiment of the present invention.

图10为本发明实施例在不同的保护噪声中全局模型的准确率。Fig. 10 shows the accuracy rate of the global model in different protection noises according to the embodiment of the present invention.

具体实施方式detailed description

本发明可以应用在医疗行业场景中，各个医院通过本发明作为联邦学习的客户端，高效地利用医院敏感数据的多样性，并且保证敏感数据的可用性，已解决各个医院之间的非独立同分布问题，高效地获得适用于每个医院的若干簇模型。下文将以各个医院作为联邦学习的客户端，针对客户端基于CT图像进行癌变细胞识别的机器学习为例，结合说明书附图和具体优选的实施例对本发明作进一步描述，但并不因此而限制本发明的保护范围。在本实例中，系统中包括一台服务器和若干台客户端，服务器与设备之间通过安全的信道进行通信，实现信息与数据的交互。不限于本实施例中的联邦学习系统，本领域普通人员可以根据实际情况实现联邦学习系统的部署。The present invention can be applied in medical industry scenarios. Each hospital uses the present invention as a client of federated learning to efficiently utilize the diversity of hospital sensitive data and ensure the availability of sensitive data, which has solved the non-independent and identical distribution among hospitals problem to efficiently obtain several cluster models applicable to each hospital. The following will take each hospital as the client of federated learning, and take the machine learning of cancerous cell identification based on CT images on the client as an example, and further describe the present invention in conjunction with the accompanying drawings and specific preferred embodiments, but it is not limited thereby protection scope of the present invention. In this example, the system includes a server and several clients, and the server and the device communicate through a secure channel to realize the interaction of information and data. It is not limited to the federated learning system in this embodiment, and ordinary people in the field can implement the deployment of the federated learning system according to actual conditions.

如图1所示，本实施例不依赖梯度的聚类联邦学习方法包括：As shown in Figure 1, the gradient-independent clustering federated learning method of this embodiment includes:

基于本实施例所公开的内容，本领域普通技术人员能够理解，服务器和客户端之间、客户端与客户端之间都通过安全的信道进行实现数据的传输，如发布全局模型、上传本地模型、上传相似度向量、使用RSA-PSI方案计算交集等过程。在本实施例步骤S1中，客户端获取自己与其他客户端之间的数据分布的交集的方法包括：使用RSA-PSI方案来获得自己与其他客户端之间的交集。正如前文所述，现有的聚类联邦学习的方案依赖于使用梯度来进行聚类，这种依赖梯度的方案可能泄露客户的隐私信息甚至原始的训练数据。因此，为了规避这个问题，本实施例中使用RSA-PSI方案让客户端不泄露训练数据的情况下得到与其他客户端数据之间的交集，并根据此交集计算自己与其他客户端之间的交集相似度向量。Based on the content disclosed in this embodiment, those of ordinary skill in the art can understand that data transmission is carried out between the server and the client, and between the client and the client through a secure channel, such as publishing the global model and uploading the local model , Upload similarity vectors, use RSA-PSI scheme to calculate intersection and other processes. In step S1 of this embodiment, the method for the client to obtain the intersection of data distribution between itself and other clients includes: using the RSA-PSI scheme to obtain the intersection between itself and other clients. As mentioned above, the existing clustering federated learning schemes rely on the use of gradients for clustering. This gradient-dependent scheme may leak the customer's private information or even the original training data. Therefore, in order to avoid this problem, the RSA-PSI scheme is used in this embodiment to allow the client to obtain the intersection with other client data without disclosing the training data, and calculate the distance between itself and other clients based on this intersection. Intersection similarity vector.

本实施例各个客户端需要分别统计本地数据中的每个标签的样本的数量，并将该数量信息进行处理以防止泄露自己的真实的数据分布信息，步骤S1中客户端分别计算自己的标签样本的数据分布信息的函数表达式为：In this embodiment, each client needs to separately count the number of samples of each label in the local data, and process the quantity information to prevent leakage of its own real data distribution information. In step S1, the client calculates its own label samples respectively The function expression of the data distribution information of is:

上式中，

以及

as well as

其中，

表示第i个客户端的第i个索引的数据数量，idx_i表示标签i的索引，Q_max表示一个预先定义的任意标签数量的最大值，且任意一个标签的数量都不可以超过这个最大值，X_i表示第i个客户端构建的数据分布信息，j为第i个客户端的第j个索引的序号。经过上述变换之后，客户端可以在不暴露自身真实数据分布信息的情况下获得交集。in,

Indicates the data quantity of the i-th index of the i-th client, idx _i indicates the index of label i, Q _max indicates a predefined maximum number of any label, and the number of any label cannot exceed this maximum value, X _i represents the data distribution information constructed by the i-th client, and j is the serial number of the j-th index of the i-th client. After the above transformation, the client can obtain the intersection without exposing its own real data distribution information.

在本实施例中，步骤S1中客户端获得与其他客户端之间数据分布的交集的方法包括：使用RSA-PSI方案(基于RSA的隐私集合求交方案)计算与其他客户端之间的数据分布的交集，并构建相似度向量。作为一种可选的实施方式，本实施例步骤S1中构造的交集相似度向量的函数表达式为：In this embodiment, the method for the client to obtain the intersection of data distribution with other clients in step S1 includes: using the RSA-PSI scheme (RSA-based privacy set intersection scheme) to calculate the data distribution with other clients distributions, and construct a similarity vector. As an optional implementation, the functional expression of the intersection similarity vector constructed in step S1 of this embodiment is:

上式中，ISM_i为第i个客户端的交集相似度向量，ISM_i[1]～ISM_i[j]表示第i个客户端对第1～j个客户端的数据分布相似度，|X_i∩X_j|表示第i个客户端与第j个客户端之间的数据分布的交集，X_i表示第i个客户端构建的数据分布信息，X_j表示第j个客户端构建的数据分布信息。基于本实施例所公开的内容，本领域普通技术人员还可以使用不同的隐私集合求交方案来实现交集的计算，不因本具体实施例而限定本申请所要求保护的技术范围。In the above formula, ISM _i is the intersection similarity vector of the i-th client, ISM _i [1]~ISM _i [j] represent the data distribution similarity between the i-th client and the 1st-j clients, |X _i ∩X _j | represents the intersection of the data distribution between the i-th client and the j-th client, X _i represents the data distribution information constructed by the i-th client, and X _j represents the data distribution constructed by the j-th client information. Based on the content disclosed in this embodiment, those skilled in the art can also use different privacy set intersection schemes to realize intersection calculation, and this specific embodiment does not limit the technical scope of protection claimed in this application.

本实施例中，步骤S2中构建的相似度矩阵的函数表达式为：In this embodiment, the functional expression of the similarity matrix constructed in step S2 is:

本实施例中，步骤S3中服务器基于相似度矩阵使用保证多样性的聚类方法对客户端进行聚类包括：将所有交集相似度高于阈值α的客户端聚集成候选簇集合并去重，使得该候选簇集合中包含了所有可能的聚类结果；在该候选簇集合中，计算每一个候选簇的权重，并使用贪心算法，每次选择可使负载最小的候选簇加入到最终簇集合中，直至所有客户端都被分配到最终簇集合中。In this embodiment, in step S3, the server clusters the clients based on the similarity matrix using a clustering method that guarantees diversity includes: gathering all clients whose intersection similarity is higher than the threshold α into a candidate cluster set and deduplication, Make the candidate cluster set contain all possible clustering results; in the candidate cluster set, calculate the weight of each candidate cluster, and use the greedy algorithm, each selection can make the candidate cluster with the smallest load be added to the final cluster set , until all clients are assigned to the final cluster set.

本实施例中，计算每一个候选簇的权重的函数表达式为：In this embodiment, the function expression for calculating the weight of each candidate cluster is:

上式中，Cost(S′_i)表示计算候选簇S′_i的成本，ID为客户端的编号，M_sim[i][ID]为相似度矩阵的第i行第ID列的元素；负载的计算函数表达式为：In the above formula, Cost(S′ _i ) represents the cost of calculating the candidate cluster S′ _i , ID is the serial number of the client, M _sim [i][ID] is the element of row i and column ID of the similarity matrix; the load The calculation function expression is:

本实施例中，步骤S4中服务器检测恶意簇是指服务器根据自己的本地数据检测最终簇集合中每个簇的精度，并选择其中精度最低的簇作为恶意簇。In this embodiment, the detection of malicious clusters by the server in step S4 means that the server detects the accuracy of each cluster in the final cluster set according to its own local data, and selects the cluster with the lowest accuracy as the malicious cluster.

在本实施例中，检测是否存在恶意簇的方法包括：服务器检测的全局模型的精度是否明显下降(下降超过设定值)。在本实施例中确定恶意簇的方法包括：服务器根据自己的本地数据检测每个簇的精度，并选择其中精度最低的簇作为恶意簇。当然，本领域普通技术人员还可以根据需要选择不同的方法来检测恶意簇；如，簇的模型是否达到预设标准等。不因本具体实施例而限定本申请所要求保护的技术范围。In this embodiment, the method for detecting whether there is a malicious cluster includes: whether the accuracy of the global model detected by the server is obviously decreased (decreased by more than a set value). The method for determining a malicious cluster in this embodiment includes: the server detects the accuracy of each cluster according to its own local data, and selects the cluster with the lowest accuracy as the malicious cluster. Of course, those skilled in the art can also choose different methods to detect malicious clusters according to needs; for example, whether the cluster model meets the preset standard or not. The technical scope claimed in this application is not limited by this specific embodiment.

本实施例中，步骤S6中利用验证委员会的成员为恶意簇中成员的模型进行验证并投票决定为良性模型和恶意模型包括：利用验证委员会的成员根据自己的本地数据验证恶意簇中每个成员的模型精度，将精度低于平均值的模型成员视为恶意模型并投票，最终将对所有投票结果进行加和，票数高于平均票数的模型将被验证委员会认定为恶意模型，否则被验证委员会认定为良性模型。本实施例步骤S6中选择验证委员会的方法包括：服务器对每一个恶意簇中的客户端，取一个最相似数据分布但不在恶意簇中的客户端。验证委员会投票的方法包括：验证委员会成员根据自己的本地数据验证恶意簇中每个模型的精度，将精度低于平均值的模型将被这个委员会成员视为恶意模型并投一票。验证委员会投票方法还包括：验证委员会的每个成员投票后，对所有投票结果进行加和，票数高于平均票数的模型将被委员会认定为恶意模型。In this embodiment, using the members of the verification committee to verify the models of the members in the malicious cluster in step S6 and voting for the benign model and the malicious model includes: using the members of the verification committee to verify each member in the malicious cluster according to their own local data The model accuracy of the model, the model members whose accuracy is lower than the average are regarded as malicious models and voted, and finally all the voting results will be summed up, and the model with the number of votes higher than the average number of votes will be identified as a malicious model by the verification committee, otherwise it will be recognized as a malicious model by the verification committee identified as a benign model. The method for selecting a verification committee in step S6 of this embodiment includes: for each client in the malicious cluster, the server selects a client that has the most similar data distribution but is not in the malicious cluster. The method of voting by the verification committee includes: the members of the verification committee verify the accuracy of each model in the malicious cluster according to their own local data, and the model whose accuracy is lower than the average will be regarded as a malicious model by this committee member and voted for. The voting method of the verification committee also includes: after each member of the verification committee votes, sum up all the voting results, and the model with a vote higher than the average number of votes will be identified as a malicious model by the committee.

在本实施例中，通过具体的仿真实验对本实施例进行验证，基于Pytorch进行联邦学习的仿真。在仿真实验中，设置1台服务器和100台客户端。服务器使用FederatedAveraging算法对每一个簇中的本地模型进行聚合，每个客户端训练时在本地数据上迭代1次。服务器拥有一部分独立同分布的数据可用于检测恶意簇。且设置了GCFL、FedAvg、GICFL三种不同的训练方法。在GCFL方法中，系统采用基于梯度的聚类方法对客户端进行聚类。在FedAvg方法中，系统不对客户端进行聚类，仅使用最原始的方法进行联邦学习训练。在GICFL中，将使用上文所述的方法对客户端进行聚类。对于每一种训练方法，均分别使用MNIST、FMNIST和CIFAR-10数据集作为基准数据集。对于每一种训练集，均分别使用病态非独立同分布和狄利克雷非独立同分布两种不同的初始化方法对客户端的数据进行初始化。在病态非独立同分布初始化方法中，每个客户端可以获得两个标签的随机数据，即设置病态非独立同分布的参数为k＝2。在狄利克雷非独立同分布初始化方法中，每个客户端的数据服从β＝0.5的狄利克雷分布。在三种训练方法下模型的测试精度如图3、图4和图5所示。图3中，(a)为在病态初始化分布上的测试精度，(b)为在狄利克雷初始化分布上的测试精度；图4中，(a)为在病态初始化分布上的测试精度，(b)为在狄利克雷初始化分布上的测试精度；图5中，(a)为在病态初始化分布上的测试精度，(b)为在狄利克雷初始化分布上的测试精度。通过图3、图4和图5可以看出，在本实施例的GICFL训练方法下，对于各种数据集以及数据分布，模型的测试精度均优于其他两种训练方法的测试精度。In this embodiment, this embodiment is verified through specific simulation experiments, and the simulation of federated learning is performed based on Pytorch. In the simulation experiment, set up 1 server and 100 clients. The server uses the FederatedAveraging algorithm to aggregate the local models in each cluster, and each client iterates once on the local data during training. The server has a part of independent and identically distributed data that can be used to detect malicious clusters. And set three different training methods of GCFL, FedAvg, GICFL. In the GCFL method, the system uses a gradient-based clustering method to cluster clients. In the FedAvg method, the system does not cluster clients, and only uses the most primitive method for federated learning training. In GICFL, clients will be clustered using the method described above. For each training method, the MNIST, FMNIST, and CIFAR-10 datasets are used as benchmark datasets, respectively. For each training set, two different initialization methods of ill-conditioned non-IID and Dirichlet non-IID are used to initialize the data of the client. In the ill-conditioned non-IID initialization method, each client can obtain random data of two labels, that is, the parameter of the ill-conditioned non-IID is set to k=2. In the Dirichlet non-IID initialization method, the data of each client obeys the Dirichlet distribution of β=0.5. The test accuracy of the model under the three training methods is shown in Fig. 3, Fig. 4 and Fig. 5. In Figure 3, (a) is the test accuracy on the ill-conditioned initialization distribution, (b) is the test accuracy on the Dirichlet initialization distribution; in Figure 4, (a) is the test accuracy on the ill-conditioned initialization distribution, ( b) is the test accuracy on the Dirichlet initialization distribution; in Figure 5, (a) is the test accuracy on the ill-conditioned initialization distribution, and (b) is the test accuracy on the Dirichlet initialization distribution. It can be seen from Fig. 3, Fig. 4 and Fig. 5 that under the GICFL training method of this embodiment, for various data sets and data distributions, the test accuracy of the model is better than that of the other two training methods.

在本实施例的仿真实验中，还通过设置更加极端的数据初始化方式来测试本发明的技术方案的效率。在图6中展示了当狄利克雷初始化方法的参数设置为β＝0.1时，在MNIST和FMNIST数据集下三种不同的训练方法收敛后的测试精度，其中(a)为在MNIST数据集上的测试精度，(b)为在FMNIST数据集上的测试精度。通过图6可以看出，在更加极端的数据初始化方式下，本实施例的测试精度仍然表现优秀。In the simulation experiment of this embodiment, the efficiency of the technical solution of the present invention is also tested by setting a more extreme data initialization mode. In Figure 6, when the parameter of the Dirichlet initialization method is set to β=0.1, the test accuracy of three different training methods under the MNIST and FMNIST data sets after convergence, where (a) is on the MNIST data set The test accuracy of , (b) is the test accuracy on the FMNIST dataset. It can be seen from FIG. 6 that, in a more extreme data initialization mode, the test accuracy of this embodiment is still excellent.

在本实施例的仿真实验中，图7展示了使用不同的数据分布初始化方法时，计算客户端之间数据分布的交集所需要的主要开销，其中(a)为在病态初始化分布上的开销，(b)为在狄利克雷初始化分布上的开销。通过图7可以看出，在本实施例中，需要一些时间来计算交集相似度。但由于这个阶段在训练阶段之前，与训练时间相比可以忽略不计。另外，新成员加入时可以不必再次计算原有成员之间的交集。考虑训练所需的时间以及隐私保护的需求，此协议的开销是可以接受的。In the simulation experiment of this embodiment, FIG. 7 shows the main overhead required to calculate the intersection of data distributions between clients when using different data distribution initialization methods, where (a) is the overhead on the ill-conditioned initialization distribution, (b) is the overhead on the Dirichlet initialization distribution. It can be seen from FIG. 7 that in this embodiment, it takes some time to calculate the intersection similarity. But since this phase precedes the training phase, it is negligible compared to the training time. In addition, when a new member joins, it is not necessary to calculate the intersection between the original members again. Considering the time required for training and the need for privacy protection, the overhead of this protocol is acceptable.

在本实施例的仿真实验中，模拟了客户端中有人作恶的情况。恶意的客户端在自己的本地模型中加入方差为1的高斯噪声，即δ＝1。恶意客户端作恶之后的各簇模型的测试精度如图8所示，其中(a)为在病态初始化分布上的测试精度，(b)为在狄利克雷初始化分布上的测试精度。通过图8可以看出，在恶意客户端作恶之后，服务器仍可通过其拥有的本地数据进行恶意簇的检测。In the simulation experiment of this embodiment, a situation in which someone does evil in the client is simulated. A malicious client adds Gaussian noise with a variance of 1 to its local model, that is, δ=1. The test accuracy of each cluster model after the malicious client is evil is shown in Figure 8, where (a) is the test accuracy on the ill-conditioned initialization distribution, and (b) is the test accuracy on the Dirichlet initialization distribution. It can be seen from Figure 8 that after the malicious client does evil, the server can still detect malicious clusters through its own local data.

在本实施例的仿真实验中，采用了三种方法组建验证委员会。在第一种方法中，服务器选择恶意簇之外的与该簇中客户端数据分布最相似的客户端组成验证委员会，即本实施例选择的方法。在第二种方法中，服务器选择恶意簇之外的随机的客户端组成验证委员会。在第三种方法中，服务器选择恶意簇之外的与该簇中客户端最不相似的客户端组成验证委员会。在验证委员会投票之后，将被认定为恶意簇的客户端剔除，并将剩余的客户端重新进行聚类。重新聚类后的模型精度如图9所示，其中(a)为在病态初始化分布上的测试精度，(b)为在狄利克雷初始化分布上的测试精度。通过图9可以看出，在本实施例方法的检测之下，可以获得优秀的恶意簇和恶意客户端的检测效果。In the simulation experiment of this embodiment, three methods are used to form the verification committee. In the first method, the server selects clients other than the malicious cluster that are most similar to the client data distribution in the cluster to form the verification committee, which is the method selected in this embodiment. In the second method, the server selects random clients outside the malicious cluster to form the verification committee. In the third method, the server selects the clients outside the malicious cluster that are most dissimilar to the clients in the cluster to form the verification committee. After the verification committee votes, the clients identified as malicious clusters are eliminated, and the remaining clients are re-clustered. The model accuracy after re-clustering is shown in Figure 9, where (a) is the test accuracy on the ill-conditioned initialization distribution, and (b) is the test accuracy on the Dirichlet initialization distribution. It can be seen from FIG. 9 that under the detection of the method of this embodiment, excellent detection effects of malicious clusters and malicious clients can be obtained.

在本实施例的仿真实验中，还通过逐步在客户端的模型上添加不同程度的高斯噪声来验证我们的测试方法的稳健性，如图10所示，其中(a)为在病态初始化分布上的测试精度，(b)为在狄利克雷初始化分布上的测试精度。通过图10可以看出，在狄利克雷初始化分布中，加入方差为0.006的高斯噪声时，还可以准确识别恶意簇中的恶意模型。在病态初始化分布中，甚至可以容忍方差为0.01的高斯噪声。In the simulation experiment of this embodiment, the robustness of our test method is also verified by gradually adding different degrees of Gaussian noise to the model of the client, as shown in Figure 10, where (a) is on the ill-conditioned initialization distribution Test accuracy, (b) is the test accuracy on the Dirichlet initialization distribution. It can be seen from Figure 10 that when Gaussian noise with a variance of 0.006 is added to the Dirichlet initialization distribution, the malicious model in the malicious cluster can also be accurately identified. In an ill-conditioned initialization distribution, even Gaussian noise with a variance of 0.01 can be tolerated.

综上所述，本实施例不依赖梯度的聚类联邦学习方法在聚类的过程中，服务器不需要依靠客户端的梯度信息进行聚类，而是根据客户端的数据分布之间的交集相似度来进行聚类，从而避免了客户端的梯度信息泄露问题，保护了客户端的梯度安全，增强了聚类联邦学习过程中的安全性、可靠性、并且提高了训练精度。本实施例不依赖梯度的聚类联邦学习方法在聚类的过程中，创新性地允许同一个客户端出现在多个聚类中，从而可以为每一个客户端寻找最适合的簇，并且充分地利用客户端数据地多样性，增强模型精度。本实施例不依赖梯度的聚类联邦学习方法不同于在现有联邦学习中普遍使用的事前检测，创新型地使用一种事后检测机制来检测系统中存在的恶意簇，允许在攻击开始之后进行检测，以节省开销，并且进一步提高系统的安全性。本实施例不依赖梯度的聚类联邦学习方法可以应用在医疗行业场景中，各个医院通过本发明作为联邦学习的客户端，高效地利用医院敏感数据的多样性，并且保证敏感数据的可用性，已解决各个医院之间的非独立同分布问题，高效地获得适用于每个医院的若干簇模型。To sum up, the gradient-independent clustering federated learning method of this embodiment does not need to rely on the gradient information of the client to perform clustering during the clustering process, but to perform clustering according to the intersection similarity between data distributions of the client Clustering avoids the gradient information leakage problem of the client, protects the gradient security of the client, enhances the security and reliability of the clustering federated learning process, and improves the training accuracy. The gradient-independent clustering federated learning method of this embodiment innovatively allows the same client to appear in multiple clusters during the clustering process, so that the most suitable cluster can be found for each client, and fully Make full use of the diversity of client data to enhance model accuracy. The gradient-independent clustering federated learning method in this embodiment is different from the pre-detection commonly used in existing federated learning. It innovatively uses a post-event detection mechanism to detect malicious clusters existing in the system, allowing the attack to be carried out after the start of the attack. detection to save overhead and further improve system security. The gradient-independent clustering federated learning method of this embodiment can be applied in medical industry scenarios. Each hospital uses the invention as a federated learning client to efficiently utilize the diversity of hospital sensitive data and ensure the availability of sensitive data. Solve the non-independent and identical distribution problem between hospitals, and efficiently obtain several cluster models suitable for each hospital.

此外，本实施例还提供一种不依赖梯度的聚类联邦学习系统，包括相互连接的多个客户端，该客户端包括相互连接的微处理器和存储器，该微处理器被编程或配置以执行前述不依赖梯度的聚类联邦学习方法。此外，本实施例还提供一种计算机可读存储介质，该计算机可读存储介质中存储有计算机程序，该计算机程序用于被微处理器编程或配置以执行前述不依赖梯度的聚类联邦学习方法。In addition, this embodiment also provides a gradient-independent clustering federated learning system, which includes multiple clients connected to each other, the client includes a microprocessor and a memory connected to each other, and the microprocessor is programmed or configured to Implement the aforementioned gradient-independent clustering federated learning method. In addition, this embodiment also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is used to be programmed or configured by a microprocessor to perform the aforementioned gradient-independent clustering federated learning method.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可读存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The present application is described with reference to flowcharts and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram. These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

以上所述仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention should also be regarded as the protection scope of the present invention.

Claims

1. A gradient-independent clustered federation learning method, comprising:

s1, respectively calculating data distribution information of a label sample of a client by the client, obtaining intersection similarity between the data distribution information of the client and data distribution information of other clients, and constructing an intersection similarity vector;

s2, the server collects intersection similarity vectors of the clients and constructs a similarity matrix;

s3, clustering the clients by using a clustering method for ensuring diversity based on the similarity matrix by the server, executing a model training step, and jumping to the next step when the server detects that the precision of the model is reduced;

s4, the server detects the malicious clusters, and selects the clients which have the data distribution most similar to that of the clients in the malicious clusters and are not in the malicious clusters to form an authentication committee after the malicious clusters are determined;

and S6, verifying by using the model of which the member of the verification committee is a member in the malicious cluster, voting to determine the model into a benign model and a malicious model, and removing and reserving the benign model from the malicious model.

2. The gradient-independent clustering federation learning method of claim 1, wherein in step S1, the functional expression of the client respectively calculating the data distribution information of its label sample is as follows:

in the above formula, the first and second carbon atoms are,

and

the data distribution information of the single tags of the 1 st, 2 nd and ith clients are respectively represented, and the calculation function expression of the data distribution information of the single tag of any ith client is as follows:

wherein,

data quantity, idx, representing the ith index of the ith client _i Index, Q, representing tag i _max Representing a predefined maximum of any number of tags, where none of the tags exceeds the maximum, X _i And j is the sequence number of the jth index of the ith client.

3. The gradient-independent clustering federated learning method of claim 2, wherein the functional expression of the intersection similarity vector constructed in step S1 is:

in the above formula, ISM _i For the intersection similarity vector of the ith client, ISM _i [1]～ISM _i [j]Represents the data distribution similarity of the ith client to the 1 st to the j th clients, | X _i ∩X _j | represents the intersection of the data distributions between the ith and jth clients, X _i Representing data distribution information, X, constructed by the ith client _j And representing the data distribution information constructed by the jth client.

4. The gradient-independent clustering federated learning method according to claim 3, wherein the functional expression of the similarity matrix constructed in step S2 is:

in the above formula, M _sim And any ith row represents an intersection similarity vector formed by the data distribution similarities of the ith client to the 1 st to nth clients.

5. The gradient-independent clustering federated learning method according to claim 4, wherein the server clustering the clients using a diversity-guaranteed clustering method based on the similarity matrix in step S3 includes: aggregating all clients with intersection similarity higher than a threshold value alpha into a candidate cluster set and removing duplication, so that the candidate cluster set comprises all possible clustering results; and calculating the weight of each candidate cluster in the candidate cluster set, and adding the candidate cluster with the minimum load into the final cluster set by using a greedy algorithm in each selection until all the clients are distributed into the final cluster set.

6. The gradient-independent clustering federal learning method as claimed in claim 5, wherein the functional expression for calculating the weight of each candidate cluster is:

of the above formula, cost (S' _i ) Denotes calculated candidate Cluster S' _i ID is the number of the client, M _sim [i][ID]The element is the ith row and the ID column of the similarity matrix; the calculation function expression of the load is as follows:

in the above formula, payload represents the load, S '\ I represents the set of candidate clusters that have not been selected into the final cluster, S' represents the set of candidate clusters, and I represents the number of the client that has been selected into the final cluster.

7. The gradient-independent cluster federal learning method as claimed in claim 6, wherein the step S4 of detecting the malicious cluster by the server means that the server detects the accuracy of each cluster in the final cluster set according to local data of the server, and selects a cluster with the lowest accuracy as the malicious cluster.

8. The gradient-independent cluster federated learning method of claim 7, wherein the step S6 of validating and voting for the decision as a benign model and a malicious model using the models of the members of the validation committee as members of the malicious cluster comprises: and verifying the model accuracy of each member in the malicious cluster by using the members of the verification committee according to local data of the members, regarding the model members with the accuracy lower than the average value as malicious models and voting, and finally summing up all voting results, wherein the models with the votes higher than the average votes are identified as malicious models by the verification committee, and otherwise, the models are identified as benign models by the verification committee.

9. A gradient-independent clustered federated learning system, comprising a plurality of interconnected clients, the clients comprising an interconnected microprocessor and memory, wherein the microprocessor is programmed or configured to perform the gradient-independent clustered federated learning method of any one of claims 1 to 8.

10. A computer readable storage medium having a computer program stored thereon, wherein the computer program is adapted to be programmed or configured by a microprocessor to perform the gradient independent clustering federal learning method as claimed in any of claims 1-8.