CN116233954A

CN116233954A - Clustered data sharing method and device based on federal learning system and storage medium

Info

Publication number: CN116233954A
Application number: CN202211575350.1A
Authority: CN
Inventors: 滕颖蕾; 余思聪; 胡刚; 满毅; 滕俊杰; 王楠; 牛涛; 杜韬; 金磊; 马仕君
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-06-06

Abstract

The invention provides a clustered data sharing method, a clustered data sharing device and a storage medium based on a federal learning system, wherein the federal learning system comprises K distributed devices and a central server, and K is an integer greater than 1; the method comprises the following steps: dividing K distributed devices into M clusters based on a preset clustering algorithm; m is an integer less than K, and at least one cluster comprising cluster head equipment and intra-cluster member equipment exists in M clusters; controlling cluster head equipment in each cluster to share training data to member equipment in the cluster; based on a preset federal learning algorithm, training the preset initial model through cooperative iteration of training data of each distributed device and the central server to obtain a target model after federal learning training. According to the invention, the distributed equipment is clustered, and the cluster head equipment shares training data to the member equipment in the cluster, so that the degree of data isomerism is slowed down, the communication overhead of federal learning training is reduced, and the accuracy of a final training target model is improved.

Description

Clustered data sharing method and device based on federal learning system and storage medium

Technical Field

The present invention relates to the field of wireless communications technologies, and in particular, to a clustered data sharing method and apparatus based on a federal learning system, and a storage medium.

Background

Federal Learning (Federated Learning, FL) is a Machine Learning (ML) technique capable of completing model training without revealing private data, and the core idea is to perform distributed model training among a plurality of devices having local data, specifically, to train a model locally through the local data by using a plurality of distributed devices, and to aggregate model parameters after training of each distributed device by a server, so as to obtain a federal learned model. Compared with the traditional centralized ML technology, the FL has the advantages that local data of a user side are not required to be uploaded to a server, and the data privacy of the user can be well protected.

In an actual application scenario, due to limited geographical environment and limited observation capability of each device, the characteristics of data generated by the distributed devices often have imbalance in distribution, that is, the data generated by the distributed devices belong to Non-independent co-distribution (Non-Independent and Identically Distributed, non-IID), and it can be understood that the problem of data isomerism exists in the training data sources of federal learning at present.

The problem of data isomerism can reduce the convergence rate of the federal learning algorithm, increase the communication overhead of federal learning training, and in addition, the data isomerism can also cause random gradient descent (Stochastic Gradient Decent, SGD), so that the model updating direction is deviated from the target direction, and the accuracy of the finally trained model is low.

Disclosure of Invention

The invention provides a clustered data sharing method, device and storage medium based on a federal learning system, which are used for solving the problems of low convergence rate of federal learning algorithm, high communication overhead of federal learning training and low accuracy of a final training model in the prior art.

The invention provides a clustered data sharing method based on a federal learning system, wherein the federal learning system comprises K distributed devices and a central server, and K is an integer greater than 1;

the method comprises the following steps:

dividing K distributed devices into M clusters based on a preset clustering algorithm; wherein M is an integer smaller than K, and at least one cluster comprising cluster head equipment and intra-cluster member equipment exists in the M clusters;

controlling cluster head equipment in each cluster to share training data to member equipment in the cluster;

Based on a preset federal learning algorithm, a preset initial model is trained through cooperative iteration of training data of each distributed device and the central server, and a target model after federal learning training is obtained.

According to the clustered data sharing method based on the federal learning system provided by the invention, the K distributed devices are divided into M clusters based on a preset clustering algorithm, and the method comprises the following steps:

establishing privacy constraint graphs of the K distributed devices; the privacy constraint graph comprises K nodes corresponding to the K distributed devices and edges used for connecting the nodes;

calculating the affinity relation value between each distributed device and other distributed devices in the K distributed devices, and taking the affinity relation value as a first attribute value of an edge between a node corresponding to each distributed device and a node corresponding to the other distributed devices; the affinity relation value is used for representing the trust degree between the distributed devices;

deleting edges corresponding to the first attribute values smaller than the privacy degree threshold in the privacy constraint graph to obtain a privacy communication constraint graph;

calculating the bulldozer distance EMD in the system corresponding to the K distributed devices, and taking the EMD as the attribute values of the K nodes; wherein the EMD in the system is used for representing the difference between the distribution of training data of each distributed device and the distribution of global data of the K distributed devices;

In the privacy communication constraint graph, selecting corresponding nodes as cluster head nodes according to the sequence from the large attribute value to the small attribute value of the nodes until edges exist between other nodes and at least one cluster head node, so as to obtain M cluster head nodes; wherein the other nodes are nodes except the cluster head node in the K nodes;

calculating inter-device EMD between each distributed device and other distributed devices in the K distributed devices as a second attribute value of an edge between a node corresponding to each distributed device and a node corresponding to the other distributed devices; wherein the inter-device EMD is used to characterize differences in distribution of training data between the distributed devices;

dividing the other nodes into clusters where the cluster head nodes with edges exist between the other nodes and the other nodes under the condition that the edges exist between the other nodes and only one cluster head node;

dividing the other nodes into clusters where the cluster head nodes with the largest second attribute values corresponding to the edges between the other nodes are located under the condition that the edges exist between the other nodes and at least two cluster head nodes;

and taking the distributed equipment corresponding to the M cluster head nodes as cluster head equipment of the M clusters, and taking the distributed equipment corresponding to other nodes in the cluster where each cluster head node is located as intra-cluster member equipment of the M clusters.

According to the clustered data sharing method based on the federal learning system provided by the invention, the edges corresponding to the first attribute values smaller than the privacy degree threshold are deleted in the privacy constraint map, and before the privacy communication constraint map is obtained, the method further comprises:

calculating the data transmission rate between each distributed device and the other distributed devices, and taking the data transmission rate as a third attribute value of an edge between a node corresponding to each distributed device and a node corresponding to the other distributed devices;

deleting the edge corresponding to the first attribute value smaller than the privacy degree threshold in the privacy constraint map to obtain a privacy communication constraint map, wherein the method comprises the following steps:

deleting edges corresponding to the first attribute values smaller than the privacy degree threshold in the privacy constraint graph, and deleting edges corresponding to the third attribute values smaller than the communication rate threshold to obtain the privacy communication constraint graph.

According to the clustered data sharing method based on the federal learning system provided by the invention, before the cluster head equipment in each cluster is controlled to share training data to the member equipment in the cluster, the method further comprises the following steps:

calculating the sharing time delay t of the cluster head equipment to the training data shared by the intra-cluster member equipment ^s And training time delay t for performing one round of model training based on the federal learning algorithm ^FL ；

Based on t ^s and t^FL Calculating a target shared data amount N of each cluster head device by adopting a formula (1) ^S And a target central processing unit CPU frequency f of each of the distributed device training models:

wherein Ω () represents the number of iterations of the federal learning system training model;

the controlling the cluster head devices in each cluster to share training data to the member devices in the cluster includes:

controlling each cluster head device to share data quantity N to the intra-cluster member devices ^S Training data of (a);

the training data of each distributed device and the central server are used for collaborative iterative training of a preset initial model based on a preset federal learning algorithm to obtain a target model after federal learning training, and the method comprises the following steps:

based on the federation learning algorithm, f is used as the CPU frequency of each distributed equipment training model, and a preset initial model is trained through collaborative iteration of training data of each distributed equipment and the central server to obtain a target model after federation learning training.

According to the clustered data sharing method based on the federal learning system provided by the invention, the sharing time delay t for sharing training data by each cluster head device to the intra-cluster member device is calculated ^s And training time delay t for performing one round of model training based on the federal learning algorithm ^FL Comprising:

calculating the sharing time delay t by adopting a formula (2) ^s ：

Wherein M represents the set of cluster head devices, n represents the nth cluster head device in the set of cluster head devices, and C _m Representing a set of intra-cluster member devices in a cluster in which an nth cluster head device is located, C representing a C-th intra-cluster member device in the set of intra-cluster member devices, a representing a bit number occupied by a sample of training data,

characterization of mthThe cluster head equipment shares the data volume of training data with the intra-cluster member equipment, v _m,c Characterizing the data transmission rate between the mth cluster head equipment and the member equipment in the c cluster;

calculating the training time delay t by adopting a formula (3) ^FL ：

wherein ,

characterizing downlink delay->

Characterizing update latency->

And representing the uplink time delay.

According to the clustered data sharing method based on the federal learning system provided by the invention, the cluster head equipment in each cluster is controlled to share training data to the member equipment in the cluster, and the method comprises the following steps:

and controlling cluster head equipment in each cluster to share training data with the member equipment in the cluster in a device-to-device (D2D) multicast mode.

The invention also provides a clustered data sharing device based on the federal learning system, wherein the federal learning system comprises K distributed devices and a central server, and K is an integer greater than 1;

the device comprises:

the clustering module is used for dividing K distributed devices into M clusters based on a preset clustering algorithm; wherein M is an integer smaller than K, and at least one cluster comprising cluster head equipment and intra-cluster member equipment exists in the M clusters;

the control module is used for controlling the cluster head equipment in each cluster to share training data to the member equipment in the cluster;

the federation learning training module is used for obtaining a target model after federation learning training by cooperatively and iteratively training a preset initial model through training data of each distributed device and the central server based on a preset federation learning algorithm.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the clustered data sharing method based on the federal learning system when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a clustered data sharing method based on a federal learning system as described in any one of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a clustered data sharing method based on a federal learning system as described in any one of the above.

According to the clustered data sharing method, device and storage medium based on the federal learning system, K distributed devices of the federal learning system are divided into M clusters based on a preset clustering algorithm, training data are shared by cluster head devices in each cluster to member devices in the cluster, the data heterogeneous degree among the training data of each distributed device can be slowed down, and a preset initial model is trained through cooperative iteration of the training data of each distributed device and a central server based on the federal learning algorithm, so that a target model after federal learning training is obtained. Compared with the model training process of federal learning in the related art, the method and the device for training the cluster head equipment have the advantages that the distributed equipment is clustered to enable the cluster head equipment to share training data to the member equipment in the cluster, so that the degree of data isomerism is slowed down, the convergence speed of a subsequent federal learning algorithm is further improved, the communication overhead of federal learning training is reduced, and the accuracy of a target model of final training can be effectively improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a clustered data sharing method based on a federal learning system provided by the invention;

FIG. 2 is a schematic diagram of a privacy constraint graph in a clustered data sharing method based on a federal learning system provided by the invention;

FIG. 3 is a schematic diagram of a privacy communication constraint graph in a clustered data sharing method based on a federal learning system provided by the invention;

FIG. 4 is a second flow chart of a clustered data sharing method based on a federal learning system according to the present invention;

FIG. 5 is a schematic diagram of a clustered data sharing device based on a federal learning system according to the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The clustered data sharing method, apparatus and storage medium based on the federal learning system of the present invention are described below with reference to the accompanying drawings.

The federal learning system according to the embodiment of the present invention includes K distributed devices, such as terminals, and a central server, where K is an integer greater than 1.

FIG. 1 is a schematic flow chart of a clustered data sharing method based on a federal learning system according to the present invention, as shown in FIG. 1, the method includes steps 101 to 103; wherein:

step 101, dividing K distributed devices into M clusters based on a preset clustering algorithm; wherein M is an integer smaller than K, and at least one cluster comprising cluster head equipment and intra-cluster member equipment exists in the M clusters;

102, controlling cluster head equipment in each cluster to share training data to member equipment in the cluster;

step 103, training a preset initial model in a collaborative iteration mode through training data of each distributed device and the central server based on a preset federal learning algorithm to obtain a target model after federal learning training.

Specifically, in the related art, due to limited geographical environment and limited observation capability of each device, the characteristics of data generated by the distributed devices often have unbalanced distribution, and the unbalanced data distribution can be understood as data isomerism, so that the convergence speed of the federal learning algorithm can be reduced, the communication overhead of federal learning training can be increased, in addition, SGD can be caused, the model updating direction is offset compared with the target direction, and the accuracy of the finally trained model is low.

Currently, there is a great deal of work to deal with the data heterogeneity challenges of federal learning, where schemes designed based on efficient algorithms increase the computational overhead of local devices and do not significantly improve the model accuracy of federal learning in data heterogeneous situations. In addition, many studies have also made data distribution across devices more consistent by data enhancement methods, which require building a usable, public, ideal data set at the central server side in view of having a large amount of devices and private data in the internet of things (Internet of Things, ioT) scenario, which may consume a large amount of costs and present security risks in the building process, and thus cannot be well applied to a scenario of large-scale user federal learning.

Aiming at the defects in the prior art, the embodiment of the invention provides a clustering data sharing method based on a federal learning system, which can slow down the data isomerism degree under the condition of not depending on an additional data set on a central server, thereby reducing the federal learning training cost (communication overhead) and improving the accuracy of a trained model.

In the embodiment of the invention, firstly, based on a preset clustering algorithm, K distributed devices of a federal learning system are divided into M clusters, wherein at least one cluster comprising cluster head devices and member devices in the clusters exists in the M clusters, and it is noted that the M clusters can comprise only the cluster head devices; in one cluster, only one cluster head device can be set, and devices in other clusters are all intra-cluster member devices.

After clustering, the cluster head equipment in each cluster shares training data to the member equipment in the cluster, so that the data heterogeneous degree among the training data of each distributed equipment can be slowed down to a certain degree, and further, a preset initial model is trained through cooperative iteration of the training data of each distributed equipment and a central server based on a federation learning algorithm, and a target model after federation learning training is obtained.

It should be noted that, the federal learning algorithm may select an existing federal learning algorithm, and the present invention focuses on: in each cluster, all or part of training data is shared to the cluster member equipment through the cluster head equipment, so that the data heterogeneous degree among the equipment in the cluster is slowed down, and then the initial model is iteratively trained by the distributed equipment with the slowed down data heterogeneous degree by utilizing the training data by utilizing a federal learning algorithm, so as to obtain a target model after federal learning training.

Optionally, the iterative training process of the model is as follows:

the one-round training process of federal learning consists of four parts: model downloading, local updating, model uploading and model aggregation:

1) Model downloading: the central server of the federal learning system first selects a set S of distributed devices that participate in FL training in the mu-th round _μ And broadcast the initialized global model w _g (i.e., initial model) to the selected scoreA cloth type device;

2) And (5) local updating: receiving an initialized global model w of a central server _g After that, the distributed device updates the model w on its own local data set by means of the training data via the SGD _g Obtaining an updated model w _k The following formula is specifically utilized:

wherein ,/>

Representing the gradient of the loss function, wherein eta represents the learning rate;

3) Model uploading: after the local model is updated, the distributed device uploads the local model to a Base Station (BS) through an uplink, and the uplink adopts an orthogonal frequency division multiple access technology (Orthogonal Frequency Division Multiple Access, OFDMA);

4) Model aggregation: after the step of training the local model is completed, the distributed device sends the local model to the BS for synchronous aggregation, and the aggregation mode of the global model is a weighted average of the uploading model, specifically adopting the formula (4):

wherein ,n_k Characterizing the data volume of training data of a kth distributed device, and n characterizes the sum of the data volumes of the K distributed devices in the federal learning system;

after the global model aggregation is completed, the global model aggregation can be fed back to the selected user in the next round, and the model training process is repeatedly executed for a plurality of rounds until convergence, so that a target model after federal learning training is obtained.

According to the clustering data sharing method based on the federation learning system, K distributed devices of the federation learning system are divided into M clusters based on a preset clustering algorithm, training data are shared by cluster head devices in each cluster to member devices in the cluster, the data heterogeneous degree among the training data of each distributed device can be slowed down, and a preset initial model is trained through cooperative iteration of the training data of each distributed device and a central server based on the federation learning algorithm, so that a target model after federation learning training is obtained. Compared with the model training process of federal learning in the related art, the method and the device for training the cluster head equipment have the advantages that the distributed equipment is clustered to enable the cluster head equipment to share training data to the member equipment in the cluster, so that the degree of data isomerism is slowed down, the convergence speed of a subsequent federal learning algorithm is further improved, the communication overhead of federal learning training is reduced, and the accuracy of a target model of final training can be effectively improved.

Optionally, the implementation manner of dividing the K distributed devices into M clusters based on the preset clustering algorithm may include:

calculating bulldozer distances (EMD) in a system corresponding to the K distributed devices, wherein the earthmover distances (EMD) are used as attribute values of the K nodes; wherein the EMD in the system is used for representing the difference between the distribution of training data of each distributed device and the distribution of global data of the K distributed devices;

in the privacy communication constraint graph, selecting corresponding nodes as cluster head nodes according to the sequence from big to small of attribute values (EMD in a system) of the nodes until edges exist between other nodes and at least one cluster head node, so as to obtain M cluster head nodes; wherein the other nodes are nodes except the cluster head node in the K nodes;

dividing the other nodes into clusters where cluster head nodes with the largest second attribute value (inter-device EMD) corresponding to edges between the other nodes are located under the condition that edges exist between the other nodes and at least two cluster head nodes;

Specifically, a privacy constraint map of K distributed devices is established first

The K represents the point set, including K nodes that K distributed equipment corresponds, epsilon represents the limit set, including the limit that is used for connecting each node.

For example, fig. 2 is a schematic diagram of a privacy constraint graph in the clustered data sharing method based on the federal learning system, where K is 7 and corresponds to nodes 1-7 as shown in fig. 2.

Firstly, calculating an affinity relation value e between each distributed device and other distributed devices in the K distributed devices, and taking the affinity relation value e as a first attribute value of an edge between a node corresponding to each distributed device and a node corresponding to the other distributed devices; the affinity relation value is used for representing the trust degree between the distributed devices;

as shown in fig. 2, assuming that the affinity relationship value between the node No. 1 and the node No. 2 is calculated to be equal to 0.5, the first attribute value of the edge connecting the node No. 1 and the node No. 2 is set to 0.5, denoted as (0.5) in the figure; similarly, calculating and setting a first attribute value of an edge between every two nodes, and assuming that the first attribute value of the edge between the node 1 and the node 6 is 1, and representing the first attribute value as (1); the first attribute value of the edge between node 2 and node 6 is 2, denoted as (2); the first attribute value of the edge between node 2 and node 7 is 2, denoted as (2); the first attribute value of the edge between node No. 5 and node No. 7 is 2, denoted as (2); the first attribute value of the edge between node 3 and node 7 is 2, denoted as (2); the first attribute value of the edge between node No. 4 and node No. 7 is 2, denoted as (2); only a portion of the edges and their corresponding first attribute values are shown in fig. 2, and the non-shown edges may consider their first attribute values to be less than a privacy threshold, e.g., 1.

FIG. 3 is a schematic diagram of a privacy communication constraint graph in a clustered data sharing method based on a federal learning system, where as shown in FIG. 3, compared with FIG. 2, the edges between the No. 1 node and the No. 2 node are deleted because the first attribute value of the edges between the No. 1 node and the No. 2 node is smaller than the privacy threshold;

alternatively, the affinity relationship value may also be referred to as social affinity, and is used to characterize the degree of trust between two distributed devices in the D2D network, in particular, the affinity relationship value may be equal to 1, which represents a trust relationship with itself; the affinity value may also be equal to 0, indicating no trust relationship with other distributed devices; the higher the affinity value, the tighter the relationship between the two distributed devices;

the affinity relationship value e (k, j) between the kth and jth distributed devices can be calculated by the following formula:

wherein ,φ_k,j Characterizing the number of interactions between the kth distributed device and the jth distributed device, which may be obtained from a priori information in the environment; when one distributed device communicates frequently with another distributed device for a long period of time, the connection between the distributed devices may be considered reliable, when the communication distance d between the kth distributed device and the jth distributed device _k,j Exceeding threshold d _th When there is no D2D connection between two distributed devices, it can be considered.

Deleting the edge corresponding to the first attribute value smaller than the privacy degree threshold in the privacy constraint graph, for example deleting the edge between the node No. 1 and the node No. 2 to obtain the privacy communication constraint graph, which can be understood that the relationship between two nodes connected by the edge reserved in the privacy communication constraint graph is more intimate, and when cluster head equipment corresponding to the member equipment in the cluster is selected subsequently, the privacy communication constraint graph can be utilized for selection so as to ensure that the privacy data of the distributed equipment are not revealed as much as possible under the condition that the cluster head equipment shares the data;

after obtaining the privacy communication constraint graph, firstly calculating EMDs in a system corresponding to K distributed devices as attribute values of K nodes, wherein the EMDs in the system are used for representing the distribution of training data of each distributed device and the difference between the distribution of global data of the K distributed devices, as shown in fig. 3, the EMDs in the system corresponding to 1-7 nodes are assumed to be respectively 1-7, namely, the attribute values corresponding to 1-7 nodes are respectively 1-7;

then, calculating the inter-device EMD between each distributed device and other distributed devices in the K distributed devices, wherein the inter-device EMD is used as a second attribute value of an edge between a node corresponding to each distributed device and a node corresponding to the other distributed devices, namely, is used as a second attribute value of an edge between each node and the other nodes; wherein the inter-device EMD is used to characterize differences in the distribution of training data between the distributed devices;

It should be noted that only the second attribute value corresponding to the edge existing in the privacy communication constraint graph may be calculated, and the second attribute value of the edge between the node No. 1 and the node No. 6 is calculated as 1, the second attribute value of the edge between the node No. 2 and the node No. 6 is calculated as 2, the second attribute value of the edge between the node No. 2 and the node No. 7 is calculated as 4, the second attribute value of the edge between the node No. 5 and the node No. 7 is calculated as 3, the second attribute value of the edge between the node No. 3 and the node No. 7 is calculated as 5, and the second attribute value of the edge between the node No. 4 and the node No. 7 is calculated as 6.

After EMD and inter-device EMD in the computing system, clustering using the privacy communication constraint map is started below, and cluster head devices and intra-cluster member devices are selected:

in the privacy communication constraint graph, selecting corresponding nodes as cluster head nodes according to the sequence from big to small of EMD in the system until edges exist between other nodes except the cluster head nodes in the K nodes and at least one cluster head node, so as to obtain M cluster head nodes;

for example, as shown in fig. 3, a node No. 7 with the largest EMD in the system is selected as a cluster head node, and at this time, edges exist between nodes No. 2, 3, 4, 5 and No. 7, but no edges exist between nodes No. 1, 6 and No. 7, so that a node No. 6 with the second largest EMD in the system is continuously selected as a cluster head node, at this time, edges exist between at least one cluster head node and other nodes of the cluster head node, and the nodes No. 6 and No. 7 are determined as final cluster head nodes;

The following begins to select the intra-cluster member nodes corresponding to the cluster head nodes 6 and 7, and specifically, the following two cases are divided:

1) Dividing other nodes into clusters where the cluster head nodes with edges exist between the other nodes under the condition that the edges exist between the other nodes and one cluster head node;

for example, as shown in fig. 3, there is an edge between the No. 1 node and the No. 6 cluster head node, so the No. 1 node is directly divided into the clusters where the No. 6 cluster head node is located, and the No. 1 node is used as an intra-cluster member node in the cluster where the No. 6 cluster head node is located; similarly, the

nodes

3, 4 and 5 are only connected with the cluster head node 7 by edges, so that the

nodes

3, 4 and 5 are directly divided into clusters where the cluster head node 7 is located, and the

nodes

3, 4 and 5 are used as intra-cluster member nodes in the cluster where the cluster head node 7 is located.

2) And under the condition that edges exist between other nodes and at least two cluster head nodes, dividing the other nodes into clusters where the cluster head nodes with the largest inter-device EMD corresponding to the edges between the other nodes are located.

For example, as shown in fig. 3, when an edge exists between the node No. 2 and the node No. 6 cluster head and an edge exists between the node No. 7 cluster head and the node No. 2 cluster head, the sizes of the second EDM of the edge between the node No. 2 and the node No. 6 cluster head and the second EDM of the edge between the node No. 2 and the node No. 7 cluster head are compared, and the second EDM of the edge between the node No. 2 and the node No. 6 cluster head is 2, and the cluster where the node No. 7 cluster head where the second EDM is larger is selected as the cluster of the node No. 2, that is, the node No. 2 is taken as the intra-cluster member node in the cluster where the node No. 7 cluster head is located.

After the cluster head nodes and the corresponding intra-cluster member nodes are selected, the distributed devices corresponding to the M cluster head nodes are used as cluster head devices of the M clusters, and the distributed devices corresponding to other nodes in the cluster where each cluster head node is located are used as intra-cluster member devices of the M clusters.

Optionally, before deleting the edge corresponding to the first attribute value smaller than the privacy degree threshold in the privacy constraint map to obtain the privacy communication constraint map, calculating a data transmission rate between each of the distributed devices and the other distributed devices, and using the data transmission rate as a third attribute value of the edge between the node corresponding to each of the distributed devices and the node corresponding to the other distributed devices;

deleting the edge corresponding to the first attribute value smaller than the privacy degree threshold in the privacy constraint map, and obtaining the implementation manner of the privacy communication constraint map may include:

In particular, the rate threshold v may be set in view of the fact that the amount of data shared also affects the optimization objective _th As the communication rate threshold, intra-cluster member nodes having a data transmission rate smaller than the communication rate threshold are not considered when intra-cluster member nodes of the cluster head node are selected, so as to avoid transmitting too little data. Specifically, the data transmission rate between each distributed device and other distributed devices can be calculated and used as a third attribute value of an edge between a node corresponding to each distributed device and a node corresponding to other distributed devices, when the privacy communication constraint graph is generated, the edge corresponding to the first attribute value smaller than the privacy threshold in the privacy constraint graph is deleted, the edge corresponding to the third attribute value smaller than the communication rate threshold is deleted, the privacy communication constraint graph is obtained, and further the selection of cluster head devices and intra-cluster member devices can be performed based on the privacy communication constraint graph.

In the embodiment of the invention, the constraint condition of the data transmission rate is considered to avoid too little data transmission among the distributed devices, so that the problem of data isomerism in the federal learning system is further alleviated, the convergence speed of a subsequent federal learning algorithm is further improved, the communication overhead of federal learning training is reduced, and the accuracy of a final training target model is improved.

Optionally, before the controlling the cluster head devices in each cluster to share the training data to the intra-cluster member devices, a sharing time delay t of each cluster head device to share the training data to the intra-cluster member devices may be calculated ^s And training time delay t for performing one round of model training based on the federal learning algorithm ^FL ；

Based on t ^s and t^FL Calculating a target shared data amount N of each cluster head device by adopting a formula (1) ^S And a target central processor (central processing unit, CPU) frequency f for each of the distributed device training models:

the implementation manner of obtaining the target model after the federation learning training based on the preset federation learning algorithm through collaborative iterative training of the training data of each distributed device and the central server by the preset initial model may include:

Specifically, the sharing time delay t of the training data shared by each cluster head device to the member devices in the cluster can be calculated ^s And training time delay t for performing one round of model training based on federal learning algorithm ^FL And then based on t ^s and t^FL Calculating target shared data quantity N of each cluster head equipment ^S And the target CPU frequency f of each distributed device training model, N solved herein ^S And f, the optimal combination of the federal learning system training models can be understood, and the time delay of the training models and the model precision after training are comprehensively considered.

For the solution of formula (1), the following constraint exists: theta is greater than or equal to theta _th 、

γ _k ≥γ _th K is K and f is more than or equal to 0 and less than or equal to f _max 。

The difficulty with this sub-problem is that the number of communication rounds is Ω (N) with respect to the amount of shared data ^S ) The expression is agnostic, and is determined by adopting a data fitting modeExpression of the number of communication rounds, Ω (N) can be determined from the analysis ^S ) The basic function of (a) is as follows

Through omega (N) ^S ) The sub-problem can be judged to be a convex problem, and the optimal solution of the problem can be obtained directly by using the existing algorithm, for example, the gradient descent method, the interior point method, the KKT condition and the like are used for solving to obtain N ^S and f.

Solving for N ^S After f, controlling each cluster head device to share data quantity N to the member devices in the cluster ^S Training data of (a); based on the federal learning algorithm, f is used as the CPU frequency of each distributed equipment training model, and the preset initial model is trained in a coordinated iteration mode through training data of each distributed equipment and a central server to obtain a target model after federal learning training so as to achieve a better model training effect.

Optionally, the calculating step calculates a sharing time delay t of the training data shared by each cluster head device to the intra-cluster member device ^s And training time delay t for performing one round of model training based on the federal learning algorithm ^FL The implementation of (a) may include:

calculating the sharing time delay t by adopting a formula (2) ^s ：

Wherein M represents the set of cluster head devices, M represents the M-th cluster head device in the set of cluster head devices, and C _m Representing a set of intra-cluster member devices in a cluster in which an mth cluster head device is located, c representing a c-th intra-cluster member device in the set of intra-cluster member devices, a representing a bit number occupied by a sample of training data,

characterizing data volume of mth cluster head device sharing training data to intra-cluster member devices, v _m,c Characterization of mthA data transmission rate between the cluster head device and the member device in the c-th cluster;

Calculating the training time delay t by adopting a formula (3) ^FL ：

wherein ,

characterizing downlink delay->

Characterizing update latency->

And representing the uplink time delay.

Specifically, the shared delay t can be calculated by using the formula (2) ^s And calculate the training time delay t by adopting the formula (3) ^FL ；

The following applies to the downlink delay in equation (3)

Update delay->

And uplink delay->

An explanation is given.

1) Downlink time delay

Is the downlink data delay of the BS when broadcasting the global model to the distributed devices, and the downlink delay +_can be calculated specifically by equation (5)>

wherein ,D_w Characterizing the number of bits occupied by the global model, D, since the global model and the local model in FL use the same architecture _w Also the number of bits of the local model, B ^D Characterizing broadcast bandwidth of BS, P _B Characterizing the transmit power of the base station, h _k Characterizing channel gain between BS and kth distributed device, N ₀ The power spectral density of the noise is characterized.

2) Update time delay

Typically, the local update is to perform an SGD algorithm to minimize the loss function. Setting the round (Epoch) of the update algorithm as E, and the corresponding calculation delay of the kth distributed device in a local update round is shown as a formula (6):

wherein ,L_k Characterizing the number of CPU cycles required for each training sample to train on the kth distributed device, f _k Characterizing the CPU frequency of a kth distributed device;

can use (7) to calculate the energy consumption of each user in local update

wherein ,ρ_k The energy consumption coefficient is characterized depending on the hardware properties of the kth distributed device.

3) Uplink time delay

KthThe delay of uploading the local model to the BS by the distributed equipment can be calculated by the formula (8)>

wherein ,B^U Characterizing bandwidth allocated to a participant in communication with a BS, P _k The transmit power of the kth distributed device is characterized.

The energy consumption generated by the transmission model can be calculated by adopting (9)

Optionally, the implementation manner of controlling the cluster head devices in each cluster to share training data to the member devices in the cluster may include:

controlling cluster head devices in each cluster to share training data to the intra-cluster member devices by means of Device-to-Device (D2D) multicast.

Specifically, the cluster head equipment shares training data to the member equipment in the cluster in a D2D multicast mode, and compared with the mode of sharing the training data through a base station, the data security of the distributed equipment is improved.

The clustered data sharing method based on the federal learning system provided by the embodiment of the invention is exemplified as follows.

1. Suppose that in a wireless edge computing cell system, there are K intelligent mobile devices as distributed devices and a BS is provided, one for each user intelligence. The local samples collected by each distributed device are

wherein ,/>

Is the input feature vector of the sample,/>

Is the output eigenvector of the sample, n _k Is the sample size of the user's distributed device. The user and BS cooperate under a client-server architecture to accomplish Deep Learning (DL) tasks. This edge environment not only contains the usual upstream and downstream transmissions, but also allows close-range users to use device-to-device, D2D, communications.

Local data generated by the user's distributed devices may vary significantly in statistical distribution, which in turn results in the generation of non-IIDs. For a clearer description, mathematical definitions of IID and non-IID are given first.

IID is a common assumption in distributed learning (Distributed Learning) that the data distribution of all users obeys one and the same global distribution P _g (x,y)。

However, the user's distributed devices no longer satisfy the IID assumption for local data due to geographic environment and observability limitations. That is, the data of the distributed devices of each user obeys a respectively different distribution P _k (x, y), also written as P _k (y)P _k (x|y)。

One common non-IID type in federal learning is data distribution offset (Label distribution skew). In this type of case, the conditional distribution P of distributed devices for all users _k The distribution of (x|y) is uniform, but the edge distribution P _k (y) is different. Using EMD, i.e. D _EMD (k) To quantify the degree of heterogeneity of data on a distributed device, see in particular equation (10):

furthermore, the weighted average sum of the EMD values of the distributed devices of all users is defined as the EMD value of the system as a whole, see in particular formula (11):

it can be found through research and experiments that the smaller the EMD value of the system is, the lower the Loss function (Loss) value of FL training is, and the faster the convergence speed is, which shows that reducing the EMD value of the system can accelerate the training of FL and improve model accuracy.

In order to alleviate the data imbalance characteristic in the FL, the embodiment of the present invention proposes a clustered data sharing framework that can reduce communication overhead, and can make the distribution of training data more similar among distributed devices by exchanging a small amount of training data in a cluster with efficient communication and privacy protection characteristics.

Fig. 4 is a second flow chart of the clustered data sharing method based on the federation learning system according to the present invention, as shown in fig. 4, the framework includes two stages:

stage (one) of data processing

In this stage, clustering is performed according to factors such as data distribution, channel state, and reliability of the distributed devices of the users. Within each Cluster, cluster Head (CH) devices share a part of their own training data to other intra-Cluster member devices (CMs) by means of D2D multicasting. The cluster head equipment is expressed as M epsilon M, and the corresponding set of the cluster member equipment is C _m . Note that in order to avoid collisions, a node user joins at most in a cluster, i.e

After the data sharing is completed, the local data volume on the kth distributed device is

Specifically, the formula (12):

wherein ,

the amount of data shared by cluster head device m is characterized.

In particular, P is set when node k is selected as the cluster head _m (y)＝P _k (y), so the new data distribution on node k can be written as:

intuitively, the data sharing can change the original data distribution on the distributed device, buffer the differences in the data characteristics, and the system EMD value after the data sharing becomes formula (13):

(II) Federal learning training phase

FL co-trains a global model through cooperation between users and multiple users, the goal of which is to minimize the loss function, which can be expressed by equation (14):

wherein ,F_k (. Cndot.) characterizes the loss function of the kth distributed device.

It is well known that data sharing is inevitably accompanied by privacy risks and communication costs. To mitigate these effects, the present invention defines two metrics to further design the clustering algorithm.

1. Social affinity (society): social affinity is an important index in social networks, and can reflect the affinity between users' distributed devices. In short, users are willing to exchange data with other users with higher social affinity to ensure the privacy of the data. At-cluster data sharing In the framework of the above, a certain social affinity is required between the member devices in the cluster and the cluster head devices, namely

Establishing a diagram->

(privacy constraint graph) to more clearly explain social affinity relationships between users' distributed devices, where

Is a collection of edges in the graph representing social affinity values between the user's distributed devices.

2. Shared latency (Transmission Delay): the cluster head device shares a portion of the training data set with the intra-cluster member devices via D2D multicast communication. Cluster head device m to each intra-cluster member device c e c _m Is x _m,c The specific calculation is as shown in formula (15):

wherein ,

communication bandwidth for characterizing cluster head equipment broadcast model, h _m,c Characterizing the channel gain between cluster head device m and intra-cluster member device c, +.>

Characterizing the transmission power of cluster head device m, I _m Characterizing interference caused by cluster head devices located in other service areas, N ₀ The noise power spectral density is characterized.

In multicast mode, the data sharing delay in a cluster is determined by the worst link, while the total data sharing time cost t ^s Depending on the maximum delay of the cluster, the specific calculation is as in equation (2); including the shared delay of the data preparation stageThe time delay of the body frame also comprises time delay of model downloading (downlink time delay), time delay of model updating (updating time delay) and time delay of model uploading (uplink time delay), namely, the time delay of one round of FL training is determined by the maximum value of the sum of downlink, updating and uplink time delays:

Energy consumption gamma for FL training rounds _k Consists of two parts, namely calculation and uplink, namely,

proper clustering data sharing methods are critical to federal learning. First, embodiments of the present invention present an optimization problem to minimize delays from data sharing and FL training and maximize model accuracy as much as possible. The influence of the clustering data sharing method on the problem is discussed in detail, and the problem is divided into two sub-problems to be solved respectively.

The problem of minimizing training delay can be expressed as equation (16):

the constraint conditions include:

(1)θ≥θ _th ；

(2)

/>

(3)

(4)、

(5)t ^s ≥T _th ；

(6)γ _k ≥γ _th ,k∈K；

(7)0≤f≤f _max 。

wherein C= { C ₁ …,C _m … is a set of all intra-cluster member devices,

is a set of data amounts shared by each cluster head device, and Ω is the number of communication rounds required by the federal learning system to reach the target accuracy.

The constraint condition (1) characterizes that the target accuracy of the FL is required to be larger than an accuracy threshold; constraint (2) is a clustered constraint; constraint (3) is used to limit the upper bound of the shared data; the constraint condition (4) is used for representing the privacy requirements among users; setting the maximum limit of the transmission delay to T _th Constraint (5) is used to maintain an appropriate shared preparation time; constraint (6) is the energy consumption requirement of the user to participate in one round of FL training; constraint (7) is a requirement of the CPU operating frequency of the distributed device.

Since the optimization objective is a coupling problem, it involves both the amount of shared data and the computing power of the device and relies on a clustering scheme. However, once the cluster head equipment and the in-cluster member equipment are selected, the original problem can be easily solved.

The invention breaks down the original problem into two sub-problems: 1) How to select a cluster head device and an intra-cluster member device; 2) How to optimize the amount of shared data and CPU frequency.

A. Determining a sub-problem of clustering:

it can be derived from experiments and analysis that a clustering data sharing method that can reduce communication delay and improve model accuracy is equivalent to minimizing

Therefore, through the joint design of the data sharing and clustering strategies, the invention designs an optimization framework for minimizing the EMD of the system while ensuring the privacy of users and low communication cost: />

The constraint conditions are as follows:

(1)

(2)

(3)t ^s ≥T _th 。

solving the optimization problem has the following difficulties: first, the joint decisions of clustering and sharing data volumes are coupled together, and the relationship between variables and targets is agnostic due to the non-displayed expressions. Second, the clustering and cluster head device selection constructs an NP-hard problem that is difficult to solve, and existing algorithms rely on heuristics. In addition, the formation of clusters also includes privacy protection and transmission efficiency limitations, which further complicate the problem. To address these issues, constraints may be temporarily ignored, considering the impact of the clustering strategy on the optimization objective. The invention designs the following three conditions, and analyzes the target benefits obtained by the clustering algorithm from the angles of individuals, clusters and clusters.

Condition 1 (individual angle): after data sharing, the EMD value of any user is as small as possible, i.e.,

without considering any constraint, condition 1 is equivalent to the original optimization problem. It is still difficult to meet this condition and thus the target can be translated into the following two

extension conditions

2 and 3;

condition 2 (intra cluster angle): within one cluster, if optimal data sharing effect is to be ensured, EMDs of the cluster head device and the intra-cluster member devices need to be as different as possible, that is,

the EMD values of the intra-cluster member devices are known to the user, so a distributed device with as small EMD as possible should be selected as a cluster head device to maximize the sharing effect. In short, the higher the data quality of the distributed devices, the greater the likelihood of being selected as cluster head devices. While condition 2 ensures shared benefits within a cluster, clustering still cannot be accomplished because of uncertainty in cluster boundaries, or how many cluster head devices should be selected, and therefore condition 3 is required to define the inter-cluster relationship.

Condition 3 (inter-cluster angle): given the clustering results M and C, an ideal clustering result should be satisfied for any member C ε C _m And the distribution distance of the current cluster head device m should be greater than that of the other cluster head devices m':

wherein ,

is the EMD between the intra-cluster member device and the cluster head device. For critical nodes that may belong to multiple clusters, this condition may be taken as a separation criterion. When the data distribution of the critical node is greatly different from that of the cluster head equipment, the critical node is added into the cluster. By analysis of two conditions, the following theorem holds:

theorem 1: if M, without consideration of constraints ^* and C^* Is the optimal solution to the optimization problem, then M ^* and C^* The conditions are both satisfied for condition 2 and condition 3.

However, because of the constraint, some nodes may not have a connection, so that theorem 1 cannot be directly used. At the same time, the optimization target is also influenced by considering the shared data quantity, so the rate threshold v is set _th To avoid transmitting too little data.

The sum considers constraints and a graph can be reconstructed

(privacy communication constraint graph), wherein the edge set +.>

Applying condition 2 and condition 3 to the graph

An adaptive clustering algorithm based on data distribution is provided. The specific steps of the algorithm can be divided into two parts, namely a cluster head device selection process and a cluster member device connection process.

For cluster head device selection, calculate D _EMD (k) And orders them, and then selects cluster head nodes in descending order until all nodes are covered.

For connection of cluster head devices, repeatedly selecting the cluster head device with the largest edge

Until all nodes have been selected. The algorithm can adaptively determine the number of clusters and has low complexity. Furthermore, the algorithm may be applicable to all existing non-IID FL algorithms.

Inputting privacy constraint graphs

Constraint threshold e _th ,v _th ,T _th The method comprises the steps of carrying out a first treatment on the surface of the Outputting M and C;

the specific flow is as follows:

step 1, respectively calculating for all distributed devices in K:

(1) Transmission rate v between adjacent distributed devices _k,j ；

(2) EMD distance D of data distribution between adjacent distributed devices _EMD (k,j)；

(3) EMD distance D of self data distribution and global data distribution _EMD (k)。

Step 2, establishing a privacy communication constraint graph

wherein ,/>

Step 3, according to D _EMD (k) Selecting cluster head equipment M in descending order of values so that all nodes can be covered by the cluster head equipment;

step 4, selecting the largest owned node from the unassigned nodes

Is distributed into the cluster of the cluster head equipment m;

and 5, repeating the step 4 until all the distributed devices are clustered.

B. Sub-problem of data volume and frequency joint optimization

After determining the clustering result, the original problem becomes

At this time, solving by gradient descent method, interior point method, KKT condition, etc., to obtain N ^S And f.

2. The embodiment of the invention provides a clustered data sharing method based on a federation learning system, which provides a federation learning framework for clustered data sharing by quantifying the heterogeneous degree of data distribution in federation learning, and is used for estimating and eliminating the influence of distribution deviation on federation learning and model precision;

specifically, based on a federal learning framework of clustered data sharing, clustered data sharing is performed in the first step, users in a system are clustered according to a designed clustering algorithm, and selected cluster head equipment shares partial data to intra-cluster member equipment; secondly, the base station broadcasts a model, selects the equipment participating in FL training in the round, and broadcasts a global model to all distributed equipment; thirdly, updating the model by the local equipment, and training the model by using a data set on the distributed equipment after receiving the global model; uploading the model, and uploading the model to a base station by the distributed equipment through a wireless communication network after updating is completed; and fifthly, synchronizing the aggregation model by the base station, and carrying out weighted aggregation on all the received local models by the base station and feeding back the weighted aggregation to the distributed equipment for the next training round. Repeating the second step to the fifth step until the FL training is completed.

3. The embodiment of the invention provides a clustering algorithm for minimizing the heterogeneous degree of system data, which establishes an optimization problem of minimizing the distribution distance under privacy and communication constraint conditions, and analyzes optimization targets from three angles of individuals, clusters and clusters, thereby designing a novel clustering algorithm.

The clustered data sharing device based on the federal learning system provided by the invention is described below, and the clustered data sharing device based on the federal learning system described below and the clustered data sharing method based on the federal learning system described above can be correspondingly referred to each other.

The federal learning system according to an embodiment of the present invention includes K distributed devices and a central server, K is an integer greater than 1, and fig. 5 is a schematic structural diagram of a clustered data sharing device based on the federal learning system provided by the present invention, as shown in fig. 5, a clustered data sharing device 500 based on the federal learning system includes:

a clustering module 501, configured to divide K distributed devices into M clusters based on a preset clustering algorithm; wherein M is an integer smaller than K, and at least one cluster comprising cluster head equipment and intra-cluster member equipment exists in the M clusters;

A control module 502, configured to control a cluster head device in each cluster to share training data to an intra-cluster member device;

the federal learning training module 503 is configured to obtain a target model after federal learning training by performing collaborative iterative training on training data of each of the distributed devices and the central server based on a preset federal learning algorithm.

According to the clustering data sharing device based on the federation learning system, the clustering module divides K distributed devices of the federation learning system into M clusters based on a preset clustering algorithm, so that cluster head devices in each cluster are controlled by the control module to share training data to member devices in the clusters, the data heterogeneous degree among the training data of each distributed device can be slowed down, and further the federation learning training module is used for training a preset initial model through cooperative iteration of the training data of each distributed device and a central server based on the federation learning algorithm, and a target model after federation learning training is obtained. Compared with the model training process of federal learning in the related art, the method and the device for training the cluster head equipment have the advantages that the distributed equipment is clustered to enable the cluster head equipment to share training data to the member equipment in the cluster, so that the degree of data isomerism is slowed down, the convergence speed of a subsequent federal learning algorithm is further improved, the communication overhead of federal learning training is reduced, and the accuracy of a target model of final training can be effectively improved.

Optionally, the clustering module 501 is specifically configured to:

Optionally, the clustered data sharing apparatus 500 based on the federal learning system further includes:

the processing module is used for calculating the data transmission rate between each distributed device and the other distributed devices and taking the data transmission rate as a third attribute value of an edge between a node corresponding to each distributed device and a node corresponding to the other distributed devices;

Clustering module 501 is also specifically configured to:

Optionally, the processing module is further configured to:

the control module 502 is specifically configured to control each of the cluster head devices to share a data amount N to the intra-cluster member devices ^S Training data of (a);

the federal learning training module 503 is specifically configured to: based on the federation learning algorithm, f is used as the CPU frequency of each distributed equipment training model, and a preset initial model is trained through collaborative iteration of training data of each distributed equipment and the central server to obtain a target model after federation learning training.

Optionally, the processing module is further specifically configured to:

calculating the sharing time delay t by adopting a formula (2) ^s ：

Wherein M represents the set of cluster head devices, M represents the M-th cluster head device in the set of cluster head devices, and C _m Characterizing a set of intra-cluster member devices within a cluster in which an mth cluster head device is located, c characterizing a c-th intra-cluster member device in the set of intra-cluster member devices, a characterizing a sample of training dataThe number of bits is such that,

characterizing data volume of mth cluster head device sharing training data to intra-cluster member devices, v _m,c Characterizing the data transmission rate between the mth cluster head equipment and the member equipment in the c cluster;

calculating the training time delay t by adopting a formula (3) ^FL ：

wherein ,

characterizing downlink delay->

Characterizing update latency->

And representing the uplink time delay.

Optionally, the control module 502 is further specifically configured to: and controlling cluster head equipment in each cluster to share training data with the member equipment in the cluster in a device-to-device (D2D) multicast mode.

Fig. 6 is a schematic structural diagram of an electronic device provided by the present invention, and as shown in fig. 6, the electronic device 600 may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a clustered data sharing method based on a federated learning system comprising K distributed devices and a central server, K being an integer greater than 1;

The method comprises the following steps: dividing K distributed devices into M clusters based on a preset clustering algorithm; wherein M is an integer smaller than K, and at least one cluster comprising cluster head equipment and intra-cluster member equipment exists in the M clusters;

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, where the computer program, when executed by a processor, can perform the clustered data sharing method based on a federal learning system provided by the methods above, where the federal learning system includes K distributed devices and a central server, and K is an integer greater than 1;

In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the clustered data sharing method based on a federal learning system provided by the above methods, the federal learning system including K distributed devices and a central server, K being an integer greater than 1;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The clustered data sharing method based on the federal learning system is characterized in that the federal learning system comprises K distributed devices and a central server, wherein K is an integer greater than 1;

the method comprises the following steps:

2. The clustered data sharing method based on the federal learning system according to claim 1, wherein the partitioning K distributed devices into M clusters based on a preset clustering algorithm includes:

3. The clustered data sharing method based on the federal learning system according to claim 2, wherein before deleting the edge corresponding to the first attribute value smaller than the privacy degree threshold in the privacy constraint map to obtain the privacy communication constraint map, the method further comprises:

4. The clustered data sharing method based on the federal learning system according to claim 1, wherein before controlling the cluster head devices in each of the clusters to share training data to the intra-cluster member devices, the method further comprises:

Based on t ^s and t^FL Using formula (1)Calculating a target shared data amount N of each cluster head device ^S And a target central processing unit CPU frequency f of each of the distributed device training models:

5. The clustered data sharing method based on the federal learning system according to claim 4, wherein the calculating the sharing time delay t for each of the cluster head devices to share training data to the intra-cluster member devices ^s And training time delay t for performing one round of model training based on the federal learning algorithm ^FL Comprising:

calculating the sharing time delay t by adopting a formula (2) ^s ：

Wherein M characterizes the set of cluster head devices, and M characterizes the set of cluster head devicesM-th cluster head device in cluster head device set, C _m Representing a set of intra-cluster member devices in a cluster in which an mth cluster head device is located, c representing a c-th intra-cluster member device in the set of intra-cluster member devices, a representing a bit number occupied by a sample of training data,

calculating the training time delay t by adopting a formula (3) ^FL ：

wherein ,

characterizing downlink delay->

Characterizing update latency->

And representing the uplink time delay.

6. The clustered data sharing method based on the federal learning system according to any one of claims 1 to 5, wherein the controlling the cluster head devices in each of the clusters to share training data to the intra-cluster member devices includes:

7. The clustered data sharing device based on the federal learning system is characterized in that the federal learning system comprises K distributed devices and a central server, wherein K is an integer greater than 1;

the device comprises:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the clustered data sharing method based on the federal learning system as claimed in any one of claims 1 to 6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the clustered data sharing method based on the federal learning system according to any one of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements a clustered data sharing method based on a federal learning system as claimed in any one of claims 1 to 6.