CN113627540A - Data set construction system and method for non-independent same-distribution federal learning - Google Patents

Data set construction system and method for non-independent same-distribution federal learning Download PDF

Info

Publication number
CN113627540A
CN113627540A CN202110928436.7A CN202110928436A CN113627540A CN 113627540 A CN113627540 A CN 113627540A CN 202110928436 A CN202110928436 A CN 202110928436A CN 113627540 A CN113627540 A CN 113627540A
Authority
CN
China
Prior art keywords
state sequence
data set
initial
module
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110928436.7A
Other languages
Chinese (zh)
Inventor
李侃
李洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202110928436.7A priority Critical patent/CN113627540A/en
Publication of CN113627540A publication Critical patent/CN113627540A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention provides a data set construction system and a data set construction method for non-independent same-distribution federal learning, wherein the system comprises an initial module, a selection module and a state sequence group extraction module; the device comprises an initial module, a data acquisition module, a data processing module and a data processing module, wherein the initial module is used for receiving a data set and generating an initial state sequence according to an initial probability distribution matrix and an initial probability transition matrix; the selection module is used for receiving the initial state sequence group and generating a data set for non-independent same-distribution federal learning: and the state sequence group extraction module is used for generating a state sequence group according to the received state sequence and the group number. By the method, the real data set is divided into a plurality of smaller subsets to form the distributed non-independent same-distribution data set, different numbers of participants are easy to set, the balance of local data is convenient to quantify, and a non-independent same-distribution scene is constructed for research.

Description

Data set construction system and method for non-independent same-distribution federal learning
Technical Field
The invention relates to the technical field of computers, in particular to a data set construction system and method for non-independent same-distribution federal learning.
Background
The federated learning provides a modeling method for guaranteeing data safety in the aspects of solving data island problems and privacy protection, and the modeling method is divided into horizontal federated, vertical federated and federated transfer learning according to different users with the same characteristics, different characteristics of the same user and different characteristics of different users, is beneficial to cooperation of participants to finish the overall target, and is regarded as a technology with a great application prospect. For example, google project Gboard based on federal learning emphasizes the application of federal averaging algorithm on mobile handsets for monitoring statistical data of large-scale cluster devices.
In general, the training data hypothesis for federal learning follows independent and identically distributed, and common algorithms such as neural networks and deep learning are studied based on the assumption. However, for application scenarios that do not satisfy independent co-distribution, the training model has low accuracy, and the global model cannot converge, so that the local data set of any particular user cannot represent the overall distribution. With business fusion among industries, more scenes meet the condition that the overlapping part of user features is more. Taking an intelligent retail business as an example, a machine learning technology is utilized to bring high-quality product recommendation and sales services to users, the data characteristics related to the intelligent retail business comprise user purchasing ability, user personal preference and product characteristics, and in real life, the three data characteristics may be dispersed in three or more different departments or enterprises. For example, banks have a user's purchasing features, shopping platforms and social networks have customer preference features, and shopping websites have product features. Due to the protection of user privacy and the supervision of laws and regulations, the data scattered in different positions cannot be directly aggregated and modeled on one side. In the field of telecommunications, operators deploy a large amount of intelligent devices in a network, generate a large amount of data at every moment, store the data locally, and do not need to upload the data to a central server, but the data volume and data characteristic distribution difference collected by each device are large, and the generalization capability of a trained model is poor.
The federated learning architecture comprises a central server and K dispersed clients, wherein the data set owned by the K clients is D1,…,DKThe goal is to train the global model together. The central server sends a modeling task, seeks participation of clients, provides a combined modeling response according to self requirements, and issues an initial parameter w to each participating clienttEach client trains the received initial parameters on a local data set, the clients upload weights to the central server, the central server carries out safe aggregation on the uploaded updated weights, the steps are that the global model is updated once, the aggregated parameters are distributed to the clients again, next local calculation is started, iteration is carried out for multiple times according to the steps until the model converges, and the joint modeling is finished. In the whole process, data of each participant is kept in local training, all participating clients jointly model, the clients and the server are not authorized to acquire and control information of other participants, data resources do not need to be shared, and the goal of training the overall model is achieved. Currently, federal learning is applied to image classification, linear regression and the like.
In the experimental stage of federal learning, it is important to independently and identically sample the training data in order to ensure that the stochastic gradient is an unbiased estimate of the full gradient. In other words, having independent and co-distributed data at the client means that each small batch of data for the client's local update is statistically the same as the sample that is uniformly drawn from the entire training data set (i.e., the union of all local data sets at the client). In practice, it is not practical to assume that the local data on each edge device is always independently and identically distributed. More specifically:
violation of independence: if the order of data processing is not sufficiently random. (e.g., ordering by device set or by time, independence is violated.
Violation of identity: because the devices are bound to a particular geographic area, the distribution of tags varies between different zones. Furthermore, different devices (partitions) may accommodate a large amount of different data.
Therefore, the temperature of the molten metal is controlled,
the data of each node is generated by a unique distribution;
the amount of data on each node may also vary greatly;
there may be an infrastructure for capturing the relationships between nodes and their relative distribution.
Most empirical studies on synthesizing non-independent co-distributed datasets have focused on label distribution skewing, i.e., forming a non-independent co-distributed dataset by partitioning a flat existing dataset based on labels.
It is also important that the distribution may change over time, which in turn introduces another dimension that is not independently the same distribution.
Recent work has shown that most decentralized learning algorithms suffer from model quality loss (even divergence) when run on non-independently identically distributed data partitions. Although several solutions have been proposed to deal with highly skewed non-independent co-distributed data (e.g., data sharing and model migration), none of them are satisfactory. For example, some existing work proposes heuristic based methods by sharing local device data or creating some server side proxy data. However, these methods may be impractical: in addition to burdening network bandwidth, sending local data to the server violates key privacy assumptions of joint learning, and sending globally shared proxy data to all devices requires careful generation or collection of such assistance data.
Disclosure of Invention
Aiming at the defects of the prior art, in order to meet the requirements of the non-independent and co-distributed Federal learning experiment process, the invention constructs the non-independent and co-distributed data set which is suitable for multi-rule and inter-data cross-correlation of Federal learning by adjusting the correlation probability among data in the data set and controlling the number of samples distributed to a client and the balance degree by using a number regulator.
In order to achieve the above object, the present invention is achieved by the following technical solutions.
A data set construction system for non-independent same-distribution federal learning comprises an initial module, a selection module and a state sequence group extraction module; wherein the content of the first and second substances,
the initial module is used for receiving the data set and generating an initial state sequence according to the initial probability distribution matrix and the initial probability transition matrix;
the selection module is used for receiving the initial state sequence group and generating a data set for non-independent same-distribution federal learning:
and the state sequence group extraction module is used for receiving the state sequences and the group numbers and generating the state sequence groups.
Further, in the initial module, the initial probability distribution matrix pi ═ is (pi ═ isi) In which pii=p(ij=qi) J-1, 2, …, T representing the number of samples contained in the sampled state sequence; p represents ij=qiProbability of 0<p<1;
And extracting an initial state sequence according to the initial probability distribution matrix and the initial state transition probability matrix.
Further, in the initial module, the initial state sequence, the initial state transition probability matrix and the group number are transmitted to a state sequence group extraction module, then the state sequence group returned by the state sequence group extraction module is received as the initial state sequence group, and the initial state sequence group is transmitted to a selection module.
Further, in the initialization block, the initialization state transition probability matrix is
A0=(pij)T×n,i=1,…,T,j=1,…,n,pij=P(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n,pijRepresenting the transition probability, n is the number of tags in the data set, sigmaj=1pij=1,…,∑j=npij=1,i=1,…,T。
Further, in the state sequence group extraction module, a state sequence group { I is extracted according to the received state sequence, the transition probability matrix and the group number1,I2,…,ISThe method comprises the following specific steps: for the state sequence I1={i1,i2,…,iTEach state value i ofjObtaining the transition probability distribution { p corresponding to the state value from the transition probability matrix1j,…,pTjAnd the label corresponding to the maximum probability value is taken from the state sequence, and then the label corresponding to the maximum probability value is stored in the corresponding position of the next state sequence, so that the next state sequence I with fixed length is obtained2Operating in the same way until I is obtainedSThereby generating a state-series group { I1,I2,…,ISAnd then returns the state sequence group.
Further, in the selection module, the step of generating the data set for the non-independent co-distributed federal learning is as follows:
(1) setting a threshold value N for changing the number of groups for adjusting the uniformity of data;
(2) setting a transition probability matrix A, a value pij=P(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n,pijRepresents the transition probability, and ∑j=1pij=1,…,∑j=npij=1,i=1,…,T,
Figure BDA0003209758610000041
Wherein m is rounded up by W/N, and W is the number of samples of the data set;
(3) if the number of state sequence groups Z is between [ (k-1) N, kN), 1<k is less than or equal to m, the last sequence of the state sequence group and the transition probability A are carried outk-1And the group number kN-Z is sent to a state sequence group extraction module to generate a state sequence group { IZ+1,IZ+2,…,IkNAnd the state sequence group form a new state sequence group { I }1,I2,…,IkN}; if k is>m, stopping execution;
(4) repeating the step (3) until the whole data set is traversed;
the finally formed state sequence group is a data set which is not independently and equally distributed.
The system further comprises an allocation module which is connected with the selection module and used for allocating data samples according to the nodes learned by the federation; first, a number adjuster C ═ C (C) corresponding to the node is set1,C2,C3,…,CK),
Figure BDA0003209758610000042
K represents the number of nodes; the final state sequence is set according to
Figure BDA0003209758610000043
Are evenly distributed to K nodes in turn, and each node is distributed to
Figure BDA0003209758610000044
A sample is obtained; when K nodes allocate different proportions of data sample size, C ═ C1,C2,C3,…,CK),
Figure BDA0003209758610000045
The number of samples distributed to each node is respectively C1W,C2W,C3W,…,CKW, the number of samples allocated by node 1 is:
Figure BDA0003209758610000046
the number of samples allocated by the node 2 is:
Figure BDA0003209758610000047
the number of samples allocated by the node K is:
Figure BDA0003209758610000048
further, the method comprises:
s1, the data set is transmitted to an initialization module, the data set is sampled according to the initial distribution probability matrix, and an initial state sequence group is generated according to the initial transition probability matrix;
and S2, transmitting the initial state sequence group and the transition probability matrix into a selection module, and generating a data set for non-independent same distribution.
Further, the initial state transition probability matrix A0=(pij)T×n,i=1,…,T,j=1,…,n,pij=P(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n,pijRepresenting the transition probability, n is the number of tags in the data set, sigmaj=1pij=1,…,∑j=npij=1,i=1,…,T。
In the invention, the real data set is divided into a plurality of smaller sub-sets to form a distributed non-independent data set with the same distribution, so that different numbers of participants are easy to set, and the balance of local data is convenient to quantify.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic block diagram of a data set building system according to one embodiment of the present invention;
FIG. 2 is a flow diagram of a selection module according to one embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention provides a data set construction system for non-independent same-distribution federal learning, which comprises an initial module, a selection module and a state sequence group extraction module, wherein the initial module, the selection module and the state sequence group extraction module are shown in figure 1.
An initial module: receiving a data set, wherein the total number of samples is W, converting the data set into a matrix, wherein each row of the matrix represents sample data, each column represents the characteristics of the sample data, and the characteristics X of the sample data is { X ═ X }0,x1,…,xmQ and tag data Q ═ Q0,q1,q2,…,qnAnd (4) forming complete training data (X, Q) as shown in table 1.
TABLE 1 training data matrix Structure
Figure BDA0003209758610000062
Setting an initial state distribution probability matrix pi ═ (pi) of the central nodei) In which pii=p(ij=qi) J-1, 2, …, T can be set manually, and its size is the number of samples contained in the state sequence. p is ij=qiProbability of (0)<p<1) According to the initial state distribution probability pii=(p1,…,pn) Sampling to obtain an initial state ij,ij∈{q1,q2,…,qnH, by sequence number and ijThe sample data corresponding to the training data can be found from the training data, which will not be described later, i.e. ijThe corresponding sample data can also be represented.
Setting an initial state transition probability matrix A0=(pij)T×n,i=1,…,T,j=1,…,n,pij=p(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n,pijThe transition probability is represented, the transition probability matrix controls the balance of data sample distribution and the correlation among data, can be centralized or dispersed, can be set by self, but needs to satisfy the following conditions: sigmaj=1pij=1,…,∑j=npij1, i 1, …, T, according to the initial state transition probability matrix a0Obtaining an initial state sequence I ═ tone of length Ti1,i2,…,iT},ij∈{q1,q2,…,qn},j=0,1,2,…,T。
Figure BDA0003209758610000061
In the initial module, the initial distribution state sequence, the initial state transition probability matrix and the group number are transmitted to a state sequence group extraction module, then the generated state sequence group is received as an initial state sequence group, and the initial state sequence group is transmitted to a selection module.
The state sequence group extraction module extracts a state sequence group { I) according to the received state sequence, the transition probability matrix A and the group number S1,I2,…,IS}. The specific process is as follows: for the state sequence I1={i1,i2,…,iTEach value i of }jFrom transition probability matrix A0To obtain its transition probability distribution { p1j,…,pTjAnd obtaining a label corresponding to the maximum value of the probability value, and storing the label corresponding to the maximum value into the corresponding position of the next state sequence to obtain the next state sequence I with fixed length2Operating in the same way until I is obtainedSThereby generating a state-series group { I1,I2,…,ISAnd then returns the state sequence group.
A selection module: receiving the state-sequence group sent by the initial module, and as shown in fig. 2, dividing the data set into final state-sequence groups according to the following steps:
(1) and setting a threshold value N for replacing the group number, wherein the threshold value N is manually set and is used for adjusting the uniformity of the data.
(2) Setting a transition probability matrix A, and A0Is set in the same way, value pij=p(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n,pijThe transition probability is expressed, and can be set by self, but needs to satisfy the following conditions: sigmaj=1pij=1,…,∑j= npij=1,i=1,…,T。
Figure BDA0003209758610000071
Wherein, m is W/N and rounded up.
(3) If the number of state sequence groups Z is between [ (k-1) N, kN), 1<k is less than or equal to m, the last sequence of the state sequence group and the transition probability Ak-1And the group number kN-Z is sent to a state sequence group extraction module to generate a state sequence group { IZ+1,IZ+2,…,IkNAnd the state sequence group form a new state sequence group { I }1,I2,…,IkN}; if k is>And m, stopping execution.
(4) And (4) repeatedly executing the step (3) until the whole data set is traversed.
By the above method, the conditional probability correlation between labeled sample data, i.e. the transition probability matrix a ═ p, is adjustedij)T×nI is 1, …, T, j is 1, …, n. The idea of the whole process is that after each N state sequence groups are generated, the transition probability matrix is transformed to continuously generate the N state sequence groups, so that the final state sequence group { I is obtained1,I2,…,ImNAnd the state sequence group is a data set which is not independent and distributed.
The transition probability matrix has the function that after the state of the previous step of the state sequence is determined, the state probability of the next step is determined, namely the state of the previous step influences the state of the next step, so that the correlation among data is continuously adjusted;
pij=p(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n
in the selection module, the final state sequence group { I can be further processed1,I2,…,ImNAnd (4) reordering, as shown in table 2, the sorted data sets still meet the non-independent same distribution, and data can be distributed to the nodes learned by the federation more clearly.
TABLE 2 Mixed data set
Figure BDA00032097586100000811
In one embodiment, the system further comprises an allocation module, coupled to the selection module, for allocating the data samples according to the federately learned number of nodes. First, a quantity controller C ═ C (C) is set1,C2,C3,…,CK),
Figure BDA0003209758610000081
K represents the number of nodes; the final state sequence group is proportioned
Figure BDA0003209758610000082
Are evenly distributed to K nodes in turn, and each node is distributed to
Figure BDA0003209758610000083
For each sample, the data set allocated by node 1 is:
Figure BDA0003209758610000084
the data set allocated by node 2 is:
Figure BDA0003209758610000085
the data set allocated by the node K is as follows:
Figure BDA0003209758610000086
when K nodes allocate different proportions of data sample size, C ═ C1,C2,C3,…,CK),
Figure BDA0003209758610000087
The number of samples distributed to each node is C1W,C2W,C3W,…,CKW, data set assigned by node 1:
Figure BDA0003209758610000088
data set assigned by node 2:
Figure BDA0003209758610000089
data set assigned by node K:
Figure BDA00032097586100000810
the quantity adjuster C is set manually.
According to another aspect of the present invention, a data set construction method for non-independent co-distributed federal learning is provided, which utilizes the above data set construction system, and includes:
s1, the data set is transmitted to an initialization module, the data set is sampled according to an initial distribution probability matrix, and an initial state sequence group is generated according to an initial probability transition matrix;
and S2, transmitting the initial state sequence group and the initial transition matrix into a selection module, and generating a data set for non-independent and same distribution.
Initial state transition probability matrix A0=(pij)T×n,i=1,…,T,j=1,…,n,pij=p(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n,pijRepresenting the transition probability, n being the number of data set tags, sigmaj=1pij=1,…,∑j=npij=1,i=1,…,T。
Example 1
Taking the MNIST data set as an example, the MNIST data set is constructed into data sets which are not independently and identically distributed. The MNIST data set is composed of handwriting of 250 different persons, 10 numbers, 0,1,2,3, … and 9, wherein 50% of the MNIST data set is high school students and 50% of the MNIST data set is written by staff of a census bureau, labels are marked, each picture in the MNIST data set is composed of 28x28 pixel points, each pixel point is represented by a gray value, the 28x28 pixels are unfolded into a one-dimensional row vector, the row vectors are rows in a picture array, the class labels (integers are 0-9) of the handwriting numbers, the training data set comprises 60,000 samples, and the testing data set comprises 10,000 samples. As shown in fig. 2, the method specifically includes the following steps:
(1) the original data set MNIST is passed into an initialization module, setting Q to all possible initialSet of states, tag set Q ═ { Q ═ Q }1,q2,…,qN},qiThe initial distribution probability matrix pi ═ 1,2,3,4,5,6,7,8,9}, i ═ 1, …, N, the number of samples corresponding to each label is different, and the initial distribution probability matrix pi ═ (pi ═ Ni) In which pii=P(ij=qi),ijAssuming that the initial distribution probability matrix is 0,1, …,9, and pi is (0.2,0.4,0.1,0.04,0.05,0.05,0.06,0.01,0,0.09), an initial state i is generated11, according to the initial transition probability matrix a0=(pij)T×nT20 and n 10, and extracting i21, the state sequence is denoted as I ═ I1,i2,…,iT},ij0,1,2,3, …,9, the initial state sequence generated is I1={1,1,1,4,2,1,0,5,6,1,2,1,3,4,5,6,1,2,3,4},T=20;
(2) The number of the generated state sequence groups is S and is marked as I1,…,ISAnd transmitting the data to a selection module. In the selection module, an initial transition matrix A is set0The effect of the transition matrix is to adjust the correlation between samples such that the inter-sample distribution is dependent, and the transition matrix satisfies that the sum of each column is 1. According to the Markov chain transfer process, a probability transfer matrix A ═ (p) is setij)20×10. If the number of the obtained state sequence groups does not meet the condition of replacing the threshold value, the initial transition probability matrix A is still used0The initial state sequence is obtained by sampling, and if the number of state sequence groups S reaches the set replacement (group number) threshold N, the threshold N is set according to the number of state sequences, and the larger the threshold N is, the more state sequence groups S are included, and similarly, the smaller the threshold N is, the fewer state sequence groups S are included. That is to say when S>When N, the change transition probability matrix a is (p)ij)T×nThe transition probability matrix a satisfies the matrix with a column probability sum of 1, T20, and n 10.
Figure BDA0003209758610000101
In the selection module, according to the set number of state sequence groups (here, set as N equal to 30), when S is<At 30, the initial usage is continuedTransition probability matrix A0Initial state series group { I1,…,ISWhen S is larger than or equal to 30 groups, the transition probability matrix is replaced by A, and the state sequence group number { I is generatedN+1,…,I2N,I2N+1,…,I3N,…,I(m-1)N,…,ImN}
Such as I35Changing the transition probability matrix once every N groups, wherein the sampling affects the sampling probability of the next step, dynamically adjusting the conditional probability of the sample data, traversing the whole data set according to the strategy to obtain a state sequence group { I }, wherein the transition probability matrix is changed once every N groups, and the sampling affects the sampling probability of the next step to obtain a state sequence group { I }1,…,ImN}。
Can be applied to the state sequence group I1,…,ImNThe serial numbers are rearranged.
(3) In the allocation module, a quantity adjustment parameter C ═ C (C) is set1,C2,C3,…,CK),
Figure BDA0003209758610000102
K represents the number of nodes, a quantity adjusting parameter C is used for adjusting the quantity of samples distributed to each node, and the sample set generated by the mixing module is according to
Figure BDA0003209758610000103
The proportion is evenly distributed to each node in turn. For example, when node K is 10, each node gets 6000 samples, and node 1 assigns a dataset of:
Figure BDA0003209758610000104
the data set allocated by node 2 is:
Figure BDA0003209758610000105
the data set allocated by the node 10 is:
Figure BDA0003209758610000106
when the node K is 5, each node allocates data sample size in different proportions, and the number adjuster C is set to (C)1,C2,C3,C4,C5)=(0.1,0.1,0.4,0.2,0.2), allocating different numbers of samples, 6000,6000,24000,12000,12000 respectively, to each node, allocating the sample set passing through the mixing module to 5 nodes according to the adjuster, and allocating the allocation result as: the data set allocated by node 1 is:
Figure BDA0003209758610000107
the data set allocated by node 2 is:
Figure BDA0003209758610000108
the data set allocated by the node 5 is:
Figure BDA0003209758610000111
classification performance was tested using the federal mean algorithm using the federal learning framework.
Example 2
Taking a data set of default of a credit card in the banking industry as an example, a non-independent and uniformly distributed data set is constructed. The credit card default data set has 30000 samples, 23 characteristics and 1 label, including the amount x of the credit card of the individual and family of the consumer1Card holder gender x2Education level x3Marital status x4Age x5Historical payment record x6~x11(4-9 months in 2005), billing amount (4-9 months in 2005) x12~x17Last payment amount x18~x23Labeled Y, the goal of processing the data is to estimate the probability of breach, rather than simply predict whether the user is trustworthy or untrustworthy. The method comprises the following specific steps:
(1) the raw data set is passed into an initial module. In the initial module, let X be all possible variable sets, tag set Y ═ 0,1, the number of samples under each tag is different, credit card limit X in the data set1The lowest amount of (2) is 10000, the highest amount is 1000000, which are sorted according to the credit card amount, [ 10000-100000%],[100000~200000],…,[900000~1000000]And 10 intervals in total. Assuming that the initial distribution probability matrix is pi ═ 0.3,0.1,0.1,0.04,0.05,0.05,0.06,0.1,0.1,0.1, the initial state i is generated120000, in terms of initial distribution probability matrixPi and initial state transition probability matrix A0And the obtained initial state sequence is a sampling obtaining sequence and is marked as I ═ { I ═ I1,i2,…,iT},ij0,1,2,3, …,9, e.g. I1={20000,50000,140000,20000,70000,630000,320000,50000,130000,20000,380000,60000,240000,30000,250000,80000,10000,20000,150000,10000},T=20。
(2) Transmitting the initial distribution state sequence, the initial state transition probability matrix and the group number to a state sequence group extraction module, then receiving the generated state sequence group as an initial state sequence group, and marking the initial state sequence group S as I1,…,ISAnd then transmitted to the selection module. If the number of state sequence groups S does not reach the threshold value N, (here set to N-30), the initial transition probability matrix a is still used0If the set threshold value N is reached, that is to say when S is reached>And when N is needed, the transition probability matrix A is replaced, the sum of the column probabilities of the matrix is 1, T is 20, and N is 10.
(3) Obtaining a state sequence group according to the selection strategy of the step (2), and sampling to obtain a state sequence group I, such as I45=(20000,120000,240000,30000,50000,70000,320000,
70000,40000,150000,20000,310000,210000,40000,240000,130000,
30000, 450000, 670000, 20000), replacing the transition probability matrix every 30 groups, replacing the transition probability matrix for 600 samples, and adjusting the transition probability matrix for 50 rounds in total, so as to dynamically adjust the data distribution;
Figure BDA0003209758610000121
it is also possible to have a set I of all state sequences1,…,I1500Merging and rearranging sequence numbers;
(4) in the distribution module, the client sets a quantity adjuster C ═ for the quantity K (C)1,C2,C3,…,CK),
Figure BDA0003209758610000122
All state sequences are set according to
Figure BDA0003209758610000123
The proportion of (2) is evenly distributed to each client, and the credit card default data set passing through the mixing module is distributed to each node, for example: when the number of nodes is K equal to 10, the data set allocated by the node 1 is:
Figure BDA0003209758610000124
the data set allocated by node 2 is:
Figure BDA0003209758610000125
the data set allocated by the node 10 is:
Figure BDA0003209758610000126
or when the number of nodes is 5, setting the number adjuster C ═ C (C)1,C2,C3,C4,C5) Each node is assigned a different proportional number of state sequence groups (0.1,0.3,0.2,0.3,0.1), and node 1 is assigned a data set of:
Figure BDA0003209758610000127
the data set allocated by node 2 is:
Figure BDA0003209758610000128
the data set allocated by the node 5 is:
Figure BDA0003209758610000129
and predicting the default probability of the credit card used by the user by using a federal average algorithm under a federal learning framework.
The above examples are only for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (9)

1. A data set construction system for non-independent same-distribution federal learning is characterized by comprising an initial module, a selection module and a state sequence group extraction module; wherein the content of the first and second substances,
the initial module is used for receiving the data set and generating an initial state sequence according to the initial probability distribution matrix and the initial probability transition matrix;
the selection module is used for receiving the initial state sequence group and generating a data set for non-independent same-distribution federal learning:
and the state sequence group extraction module is used for receiving the state sequences and the group numbers and generating the state sequence groups.
2. The data set construction system of claim 1, wherein in the initial module, the initial probability distribution matrix pi ═ is (pi ═ pii) In which pii=p(ij=qi) J 1, 2.. T, T denotes the number of samples contained in the sampled state sequence; p represents ij=qiThe probability of (1) is more than 0 and less than p;
and extracting an initial state sequence according to the initial probability distribution matrix and the initial state transition probability matrix.
3. The data set building system according to claim 2, wherein in the initial module, the initial state sequence, the initial state transition probability matrix and the group number are transmitted to a state sequence group extraction module, and then the state sequence group returned by the state sequence group extraction module is received as an initial state sequence group, and the initial state sequence group is transmitted to a selection module.
4. The data set construction system of claim 1, wherein in the initialization block, the initialization probability matrix is a0=(pij)T×n,i=1,…,T,j=1,…,n,pij=P(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n,pijRepresenting the transition probability, n is the number of tags in the data set, sigmaj=1pij=1,…,∑j=npij=1,i=1,…,T。
5. The data set construction system according to claim 1, wherein in the state-series group extraction module, the state-series group { I } is extracted based on the received state series, transition probability matrix, and group number1,I2,…,ISThe method comprises the following specific steps: for the state sequence I1={i1,i2,…,iTEach state value i ofjObtaining the transition probability distribution { p corresponding to the state value from the transition probability matrix1j,…,pTjAnd the label corresponding to the maximum probability value is taken from the state sequence, and then the label corresponding to the maximum probability value is stored in the corresponding position of the next state sequence, so that the next state sequence I with fixed length is obtained2Operating in the same way until I is obtainedSThereby generating a state-series group { I1,I2,…,ISAnd then returns the state sequence group.
6. The data set construction system according to claim 1, wherein in the selection module, the step of generating the data set for non-independent co-distributed federal learning is as follows:
(1) setting a threshold value N for changing the number of groups for adjusting the uniformity of data;
(2) setting a transition probability matrix A, a value pij=P(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n,pijRepresents the transition probability, and ∑j=1pij=1,…,∑j=npij=1,i=1,…,T,
Figure FDA0003209758600000021
Wherein m is rounded up by W/N, and W is the number of samples of the data set;
(3) if the number Z of the state sequence groups is between [ (k-1) N, kN) and k is more than 1 and less than or equal to m, the last sequence of the state sequence groups and the transition probability A are determinedk-1And the group number kN-Z is sent to a state sequence group extraction module to generate a state sequence group { IZ+1,IZ+2,…,IkNAnd the state sequence group form a new state sequence group { I }1,I2,…,IkN}; if k > m, stopping execution;
(4) repeating the step (3) until the whole data set is traversed;
the finally formed state sequence group is a data set which is not independently and equally distributed.
7. The data set construction system of claim 1, further comprising an allocation module, coupled to the selection module, for allocating data samples according to federate learned nodes; first, a number adjuster C ═ C (C) corresponding to the node is set1,C2,C3,…,CK),
Figure FDA0003209758600000022
Representing the number of nodes; the final state sequence is set according to
Figure FDA0003209758600000023
Are evenly distributed to K nodes in turn, and each node is distributed to
Figure FDA0003209758600000024
A sample is obtained; when K nodes allocate different proportions of data sample size, C ═ C1,C2,C3,…,CK),
Figure FDA0003209758600000025
The number of samples distributed to each node is respectively C1W,C2W,C3W,…,CKW, the number of samples allocated by node 1 is:
Figure FDA0003209758600000026
the number of samples allocated by the node 2 is:
Figure FDA0003209758600000027
the number of samples allocated by the node K is:
Figure FDA0003209758600000028
8. a method for constructing a data set for non-independent co-distributed federal learning, the method comprising:
s1, the data set is transmitted to an initialization module, the data set is sampled according to the initial distribution probability matrix, and an initial state sequence group is generated according to the initial transition probability matrix;
and S2, transmitting the initial state sequence group and the transition probability matrix into a selection module, and generating a data set for non-independent same distribution.
9. The data set construction method of claim 8, wherein the initial state transition probability matrix A0=(pij)T×n,i=1,…,T,j=1,…,n,pij=P(it+1=qj|it=qi),i=1,2,…,n;j=1,2,…,n,pijRepresenting the transition probability, n is the number of tags in the data set, sigmaj=1pij=1,…,∑j=npij=1,i=1,…,T。
CN202110928436.7A 2021-08-13 2021-08-13 Data set construction system and method for non-independent same-distribution federal learning Pending CN113627540A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110928436.7A CN113627540A (en) 2021-08-13 2021-08-13 Data set construction system and method for non-independent same-distribution federal learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110928436.7A CN113627540A (en) 2021-08-13 2021-08-13 Data set construction system and method for non-independent same-distribution federal learning

Publications (1)

Publication Number Publication Date
CN113627540A true CN113627540A (en) 2021-11-09

Family

ID=78385116

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110928436.7A Pending CN113627540A (en) 2021-08-13 2021-08-13 Data set construction system and method for non-independent same-distribution federal learning

Country Status (1)

Country Link
CN (1) CN113627540A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114723071A (en) * 2022-04-26 2022-07-08 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Federal learning method and device based on client classification and information entropy

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114723071A (en) * 2022-04-26 2022-07-08 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Federal learning method and device based on client classification and information entropy

Similar Documents

Publication Publication Date Title
Xu et al. Asynchronous federated learning on heterogeneous devices: A survey
Yurochkin et al. Bayesian nonparametric federated learning of neural networks
CN112364943B (en) Federal prediction method based on federal learning
Lu et al. Differentially private asynchronous federated learning for mobile edge computing in urban informatics
US11816727B2 (en) Credit scoring method and server
CN114580663A (en) Data non-independent same-distribution scene-oriented federal learning method and system
US20220351039A1 (en) Federated learning using heterogeneous model types and architectures
Luo et al. Adapt to adaptation: Learning personalization for cross-silo federated learning
Chan et al. Fedhe: Heterogeneous models and communication-efficient federated learning
Li et al. Research on QoS service composition based on coevolutionary genetic algorithm
CN116862012A (en) Machine learning model training method, business data processing method, device and system
CN113537509A (en) Collaborative model training method and device
WO2023036184A1 (en) Methods and systems for quantifying client contribution in federated learning
Arafeh et al. Data independent warmup scheme for non-IID federated learning
CN113627540A (en) Data set construction system and method for non-independent same-distribution federal learning
Chiaro et al. FL-Enhance: A federated learning framework for balancing non-IID data with augmented and shared compressed samples
Ranjan et al. The impact of estimation: A new method for clustering and trajectory estimation in patient flow modeling
Marnissi et al. Client selection in federated learning based on gradients importance
Castellon et al. Federated learning with incremental clustering for heterogeneous data
Zhu et al. Cognitive analytics of social media services for edge resource pre-allocation in industrial manufacturing
Li et al. A unified task recommendation strategy for realistic mobile crowdsourcing system
Ghnemat et al. Classification of Mobile Customers Behavior and Usage Patterns using Self-Organizing Neural Networks.
Wang et al. Adaptive distributionally robust cluster-based healthcare network design problem under an uncertain environment
CN114861936A (en) Feature prototype-based federated incremental learning method
Kushwaha et al. Optimal device selection in federated learning for resource-constrained edge networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination