CN113627540A

CN113627540A - Data set construction system and method for non-independent same-distribution federal learning

Info

Publication number: CN113627540A
Application number: CN202110928436.7A
Authority: CN
Inventors: 李侃; 李洋
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2021-11-09

Abstract

The invention provides a data set construction system and a data set construction method for non-independent same-distribution federal learning, wherein the system comprises an initial module, a selection module and a state sequence group extraction module; the device comprises an initial module, a data acquisition module, a data processing module and a data processing module, wherein the initial module is used for receiving a data set and generating an initial state sequence according to an initial probability distribution matrix and an initial probability transition matrix; the selection module is used for receiving the initial state sequence group and generating a data set for non-independent same-distribution federal learning: and the state sequence group extraction module is used for generating a state sequence group according to the received state sequence and the group number. By the method, the real data set is divided into a plurality of smaller subsets to form the distributed non-independent same-distribution data set, different numbers of participants are easy to set, the balance of local data is convenient to quantify, and a non-independent same-distribution scene is constructed for research.

Description

Data set construction system and method for non-independent same-distribution federal learning

Technical Field

The invention relates to the technical field of computers, in particular to a data set construction system and method for non-independent same-distribution federal learning.

Background

The federated learning provides a modeling method for guaranteeing data safety in the aspects of solving data island problems and privacy protection, and the modeling method is divided into horizontal federated, vertical federated and federated transfer learning according to different users with the same characteristics, different characteristics of the same user and different characteristics of different users, is beneficial to cooperation of participants to finish the overall target, and is regarded as a technology with a great application prospect. For example, google project Gboard based on federal learning emphasizes the application of federal averaging algorithm on mobile handsets for monitoring statistical data of large-scale cluster devices.

In general, the training data hypothesis for federal learning follows independent and identically distributed, and common algorithms such as neural networks and deep learning are studied based on the assumption. However, for application scenarios that do not satisfy independent co-distribution, the training model has low accuracy, and the global model cannot converge, so that the local data set of any particular user cannot represent the overall distribution. With business fusion among industries, more scenes meet the condition that the overlapping part of user features is more. Taking an intelligent retail business as an example, a machine learning technology is utilized to bring high-quality product recommendation and sales services to users, the data characteristics related to the intelligent retail business comprise user purchasing ability, user personal preference and product characteristics, and in real life, the three data characteristics may be dispersed in three or more different departments or enterprises. For example, banks have a user's purchasing features, shopping platforms and social networks have customer preference features, and shopping websites have product features. Due to the protection of user privacy and the supervision of laws and regulations, the data scattered in different positions cannot be directly aggregated and modeled on one side. In the field of telecommunications, operators deploy a large amount of intelligent devices in a network, generate a large amount of data at every moment, store the data locally, and do not need to upload the data to a central server, but the data volume and data characteristic distribution difference collected by each device are large, and the generalization capability of a trained model is poor.

The federated learning architecture comprises a central server and K dispersed clients, wherein the data set owned by the K clients is D₁,…,D_KThe goal is to train the global model together. The central server sends a modeling task, seeks participation of clients, provides a combined modeling response according to self requirements, and issues an initial parameter w to each participating client_tEach client trains the received initial parameters on a local data set, the clients upload weights to the central server, the central server carries out safe aggregation on the uploaded updated weights, the steps are that the global model is updated once, the aggregated parameters are distributed to the clients again, next local calculation is started, iteration is carried out for multiple times according to the steps until the model converges, and the joint modeling is finished. In the whole process, data of each participant is kept in local training, all participating clients jointly model, the clients and the server are not authorized to acquire and control information of other participants, data resources do not need to be shared, and the goal of training the overall model is achieved. Currently, federal learning is applied to image classification, linear regression and the like.

In the experimental stage of federal learning, it is important to independently and identically sample the training data in order to ensure that the stochastic gradient is an unbiased estimate of the full gradient. In other words, having independent and co-distributed data at the client means that each small batch of data for the client's local update is statistically the same as the sample that is uniformly drawn from the entire training data set (i.e., the union of all local data sets at the client). In practice, it is not practical to assume that the local data on each edge device is always independently and identically distributed. More specifically:

violation of independence: if the order of data processing is not sufficiently random. (e.g., ordering by device set or by time, independence is violated.

Violation of identity: because the devices are bound to a particular geographic area, the distribution of tags varies between different zones. Furthermore, different devices (partitions) may accommodate a large amount of different data.

Therefore, the temperature of the molten metal is controlled,

the data of each node is generated by a unique distribution;

the amount of data on each node may also vary greatly;

there may be an infrastructure for capturing the relationships between nodes and their relative distribution.

Most empirical studies on synthesizing non-independent co-distributed datasets have focused on label distribution skewing, i.e., forming a non-independent co-distributed dataset by partitioning a flat existing dataset based on labels.

It is also important that the distribution may change over time, which in turn introduces another dimension that is not independently the same distribution.

Recent work has shown that most decentralized learning algorithms suffer from model quality loss (even divergence) when run on non-independently identically distributed data partitions. Although several solutions have been proposed to deal with highly skewed non-independent co-distributed data (e.g., data sharing and model migration), none of them are satisfactory. For example, some existing work proposes heuristic based methods by sharing local device data or creating some server side proxy data. However, these methods may be impractical: in addition to burdening network bandwidth, sending local data to the server violates key privacy assumptions of joint learning, and sending globally shared proxy data to all devices requires careful generation or collection of such assistance data.

Disclosure of Invention

Aiming at the defects of the prior art, in order to meet the requirements of the non-independent and co-distributed Federal learning experiment process, the invention constructs the non-independent and co-distributed data set which is suitable for multi-rule and inter-data cross-correlation of Federal learning by adjusting the correlation probability among data in the data set and controlling the number of samples distributed to a client and the balance degree by using a number regulator.

In order to achieve the above object, the present invention is achieved by the following technical solutions.

A data set construction system for non-independent same-distribution federal learning comprises an initial module, a selection module and a state sequence group extraction module; wherein the content of the first and second substances,

the initial module is used for receiving the data set and generating an initial state sequence according to the initial probability distribution matrix and the initial probability transition matrix;

the selection module is used for receiving the initial state sequence group and generating a data set for non-independent same-distribution federal learning:

and the state sequence group extraction module is used for receiving the state sequences and the group numbers and generating the state sequence groups.

Further, in the initial module, the initial probability distribution matrix pi ═ is (pi ═ is_i) In which pi_i＝p(i_j＝q_i) J-1, 2, …, T representing the number of samples contained in the sampled state sequence; p represents i_j＝q_iProbability of 0<p<1；

And extracting an initial state sequence according to the initial probability distribution matrix and the initial state transition probability matrix.

Further, in the initial module, the initial state sequence, the initial state transition probability matrix and the group number are transmitted to a state sequence group extraction module, then the state sequence group returned by the state sequence group extraction module is received as the initial state sequence group, and the initial state sequence group is transmitted to a selection module.

Further, in the initialization block, the initialization state transition probability matrix is

A₀＝(p_ij)_T×n,i＝1,…,T,j＝1,…,n，p_ij＝P(i_t+1＝q_j|i_t＝q_i),i＝1,2,…,n；j＝1,2,…,n，p_ijRepresenting the transition probability, n is the number of tags in the data set, sigma_j＝1p_ij＝1,…，∑_j＝np_ij＝1，i＝1,…，T。

Further, in the state sequence group extraction module, a state sequence group { I is extracted according to the received state sequence, the transition probability matrix and the group number₁,I₂,…,I_SThe method comprises the following specific steps: for the state sequence I₁＝{i₁,i₂,…,i_TEach state value i of_jObtaining the transition probability distribution { p corresponding to the state value from the transition probability matrix_1j,…,p_TjAnd the label corresponding to the maximum probability value is taken from the state sequence, and then the label corresponding to the maximum probability value is stored in the corresponding position of the next state sequence, so that the next state sequence I with fixed length is obtained₂Operating in the same way until I is obtained_SThereby generating a state-series group { I₁，I₂,…,I_SAnd then returns the state sequence group.

Further, in the selection module, the step of generating the data set for the non-independent co-distributed federal learning is as follows:

(1) setting a threshold value N for changing the number of groups for adjusting the uniformity of data;

(2) setting a transition probability matrix A, a value p_ij＝P(i_t+1＝q_j|i_t＝q_i),i＝1,2,…,n；j＝1,2,…,n,p_ijRepresents the transition probability, and ∑_j＝1p_ij＝1,…,∑_j＝np_ij＝1,i＝1,…,T，

Wherein m is rounded up by W/N, and W is the number of samples of the data set;

(3) if the number of state sequence groups Z is between [ (k-1) N, kN), 1<k is less than or equal to m, the last sequence of the state sequence group and the transition probability A are carried out_k-1And the group number kN-Z is sent to a state sequence group extraction module to generate a state sequence group { I_Z+1,I_Z+2,…,I_kNAnd the state sequence group form a new state sequence group { I }₁,I₂,…,I_kN}; if k is>m, stopping execution;

(4) repeating the step (3) until the whole data set is traversed;

the finally formed state sequence group is a data set which is not independently and equally distributed.

The system further comprises an allocation module which is connected with the selection module and used for allocating data samples according to the nodes learned by the federation; first, a number adjuster C ═ C (C) corresponding to the node is set₁，C₂,C₃,…，C_K)，

K represents the number of nodes; the final state sequence is set according to

Are evenly distributed to K nodes in turn, and each node is distributed to

A sample is obtained; when K nodes allocate different proportions of data sample size, C ═ C₁，C₂，C₃，…，C_K)，

The number of samples distributed to each node is respectively C₁W，C₂W，C₃W，…，C_KW, the number of samples allocated by node 1 is:

the number of samples allocated by the node 2 is:

the number of samples allocated by the node K is:

further, the method comprises:

s1, the data set is transmitted to an initialization module, the data set is sampled according to the initial distribution probability matrix, and an initial state sequence group is generated according to the initial transition probability matrix;

and S2, transmitting the initial state sequence group and the transition probability matrix into a selection module, and generating a data set for non-independent same distribution.

Further, the initial state transition probability matrix A₀＝(p_ij)_T×n，i＝1,…，T，j＝1,…，n，p_ij＝P(i_t+1＝q_j|i_t＝q_i)，i＝1，2，…，n；j＝1，2，…，n，p_ijRepresenting the transition probability, n is the number of tags in the data set, sigma_j＝1p_ij＝1,…，∑_j＝np_ij＝1,i＝1,…,T。

In the invention, the real data set is divided into a plurality of smaller sub-sets to form a distributed non-independent data set with the same distribution, so that different numbers of participants are easy to set, and the balance of local data is convenient to quantify.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic block diagram of a data set building system according to one embodiment of the present invention;

FIG. 2 is a flow diagram of a selection module according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The invention provides a data set construction system for non-independent same-distribution federal learning, which comprises an initial module, a selection module and a state sequence group extraction module, wherein the initial module, the selection module and the state sequence group extraction module are shown in figure 1.

An initial module: receiving a data set, wherein the total number of samples is W, converting the data set into a matrix, wherein each row of the matrix represents sample data, each column represents the characteristics of the sample data, and the characteristics X of the sample data is { X ═ X }₀,x₁,…,x_mQ and tag data Q ═ Q₀,q₁,q₂,…,q_nAnd (4) forming complete training data (X, Q) as shown in table 1.

TABLE 1 training data matrix Structure

Setting an initial state distribution probability matrix pi ═ (pi) of the central node_i) In which pi_i＝p(i_j＝q_i) J-1, 2, …, T can be set manually, and its size is the number of samples contained in the state sequence. p is i_j＝q_iProbability of (0)<p<1) According to the initial state distribution probability pi_i＝(p₁,…,p_n) Sampling to obtain an initial state i_j，i_j∈{q₁,q₂,…，q_nH, by sequence number and i_jThe sample data corresponding to the training data can be found from the training data, which will not be described later, i.e. i_jThe corresponding sample data can also be represented.

Setting an initial state transition probability matrix A₀＝(p_ij)_T×n,i＝1,…,T,j＝1,…,n，p_ij＝p(i_t+1＝q_j|i_t＝q_i),i＝1,2,…,n；j＝1,2,…,n,p_ijThe transition probability is represented, the transition probability matrix controls the balance of data sample distribution and the correlation among data, can be centralized or dispersed, can be set by self, but needs to satisfy the following conditions: sigma_j＝1p_ij＝1,…,∑_j＝np_ij1, i 1, …, T, according to the initial state transition probability matrix a₀Obtaining an initial state sequence I ═ tone of length Ti₁,i₂,…,i_T},i_j∈{q₁,q₂,…,q_n},j＝0,1,2,…,T。

In the initial module, the initial distribution state sequence, the initial state transition probability matrix and the group number are transmitted to a state sequence group extraction module, then the generated state sequence group is received as an initial state sequence group, and the initial state sequence group is transmitted to a selection module.

The state sequence group extraction module extracts a state sequence group { I) according to the received state sequence, the transition probability matrix A and the group number S₁,I₂,…,I_S}. The specific process is as follows: for the state sequence I₁＝{i₁,i₂,…,i_TEach value i of }_jFrom transition probability matrix A₀To obtain its transition probability distribution { p_1j,…,p_TjAnd obtaining a label corresponding to the maximum value of the probability value, and storing the label corresponding to the maximum value into the corresponding position of the next state sequence to obtain the next state sequence I with fixed length₂Operating in the same way until I is obtained_SThereby generating a state-series group { I₁,I₂,…,I_SAnd then returns the state sequence group.

A selection module: receiving the state-sequence group sent by the initial module, and as shown in fig. 2, dividing the data set into final state-sequence groups according to the following steps:

(1) and setting a threshold value N for replacing the group number, wherein the threshold value N is manually set and is used for adjusting the uniformity of the data.

(2) Setting a transition probability matrix A, and A₀Is set in the same way, value p_ij＝p(i_t+1＝q_j|i_t＝q_i),i＝1,2,…,n；j＝1,2,…,n,p_ijThe transition probability is expressed, and can be set by self, but needs to satisfy the following conditions: sigma_j＝1p_ij＝1,…,∑_j＝ _np_ij＝1,i＝1,…,T。

Wherein, m is W/N and rounded up.

(3) If the number of state sequence groups Z is between [ (k-1) N, kN), 1<k is less than or equal to m, the last sequence of the state sequence group and the transition probability A_k-1And the group number kN-Z is sent to a state sequence group extraction module to generate a state sequence group { I_Z+1,I_Z+2,…,I_kNAnd the state sequence group form a new state sequence group { I }₁,I₂,…,I_kN}; if k is>And m, stopping execution.

(4) And (4) repeatedly executing the step (3) until the whole data set is traversed.

By the above method, the conditional probability correlation between labeled sample data, i.e. the transition probability matrix a ═ p, is adjusted_ij)_T×nI is 1, …, T, j is 1, …, n. The idea of the whole process is that after each N state sequence groups are generated, the transition probability matrix is transformed to continuously generate the N state sequence groups, so that the final state sequence group { I is obtained₁,I₂,…,I_mNAnd the state sequence group is a data set which is not independent and distributed.

The transition probability matrix has the function that after the state of the previous step of the state sequence is determined, the state probability of the next step is determined, namely the state of the previous step influences the state of the next step, so that the correlation among data is continuously adjusted;

p_ij＝p(i_t+1＝q_j|i_t＝q_i),i＝1,2,…,n；j＝1,2,…,n

in the selection module, the final state sequence group { I can be further processed₁,I₂,…,I_mNAnd (4) reordering, as shown in table 2, the sorted data sets still meet the non-independent same distribution, and data can be distributed to the nodes learned by the federation more clearly.

TABLE 2 Mixed data set

In one embodiment, the system further comprises an allocation module, coupled to the selection module, for allocating the data samples according to the federately learned number of nodes. First, a quantity controller C ═ C (C) is set₁,C₂,C₃,…,C_K)，

K represents the number of nodes; the final state sequence group is proportioned

Are evenly distributed to K nodes in turn, and each node is distributed to

For each sample, the data set allocated by node 1 is:

the data set allocated by node 2 is:

the data set allocated by the node K is as follows:

when K nodes allocate different proportions of data sample size, C ═ C₁,C₂,C₃,…,C_K),

The number of samples distributed to each node is C₁W,C₂W,C₃W,…,C_KW, data set assigned by node 1:

data set assigned by node 2:

data set assigned by node K:

the quantity adjuster C is set manually.

According to another aspect of the present invention, a data set construction method for non-independent co-distributed federal learning is provided, which utilizes the above data set construction system, and includes:

s1, the data set is transmitted to an initialization module, the data set is sampled according to an initial distribution probability matrix, and an initial state sequence group is generated according to an initial probability transition matrix;

and S2, transmitting the initial state sequence group and the initial transition matrix into a selection module, and generating a data set for non-independent and same distribution.

Initial state transition probability matrix A₀＝(p_ij)_T×n,i＝1,…,T,j＝1,…,n，p_ij＝p(i_t+1＝q_j|i_t＝q_i),i＝1,2,…,n；j＝1,2,…,n,p_ijRepresenting the transition probability, n being the number of data set tags, sigma_j＝1p_ij＝1,…,∑_j＝np_ij＝1,i＝1,…,T。

Example 1

Taking the MNIST data set as an example, the MNIST data set is constructed into data sets which are not independently and identically distributed. The MNIST data set is composed of handwriting of 250 different persons, 10 numbers, 0,1,2,3, … and 9, wherein 50% of the MNIST data set is high school students and 50% of the MNIST data set is written by staff of a census bureau, labels are marked, each picture in the MNIST data set is composed of 28x28 pixel points, each pixel point is represented by a gray value, the 28x28 pixels are unfolded into a one-dimensional row vector, the row vectors are rows in a picture array, the class labels (integers are 0-9) of the handwriting numbers, the training data set comprises 60,000 samples, and the testing data set comprises 10,000 samples. As shown in fig. 2, the method specifically includes the following steps:

(1) the original data set MNIST is passed into an initialization module, setting Q to all possible initialSet of states, tag set Q ═ { Q ═ Q }₁,q₂,…,q_N},q_iThe initial distribution probability matrix pi ═ 1,2,3,4,5,6,7,8,9}, i ═ 1, …, N, the number of samples corresponding to each label is different, and the initial distribution probability matrix pi ═ (pi ═ N_i) In which pi_i＝P(i_j＝q_i),i_jAssuming that the initial distribution probability matrix is 0,1, …,9, and pi is (0.2,0.4,0.1,0.04,0.05,0.05,0.06,0.01,0,0.09), an initial state i is generated₁1, according to the initial transition probability matrix a₀＝(p_ij)_T×nT20 and n 10, and extracting i₂1, the state sequence is denoted as I ═ I₁,i₂,…,i_T},i_j0,1,2,3, …,9, the initial state sequence generated is I₁＝{1,1,1,4,2,1,0,5,6,1,2,1,3,4,5,6,1,2,3,4},T＝20；

(2) The number of the generated state sequence groups is S and is marked as I₁,…,I_SAnd transmitting the data to a selection module. In the selection module, an initial transition matrix A is set₀The effect of the transition matrix is to adjust the correlation between samples such that the inter-sample distribution is dependent, and the transition matrix satisfies that the sum of each column is 1. According to the Markov chain transfer process, a probability transfer matrix A ═ (p) is set_ij)_20×10. If the number of the obtained state sequence groups does not meet the condition of replacing the threshold value, the initial transition probability matrix A is still used₀The initial state sequence is obtained by sampling, and if the number of state sequence groups S reaches the set replacement (group number) threshold N, the threshold N is set according to the number of state sequences, and the larger the threshold N is, the more state sequence groups S are included, and similarly, the smaller the threshold N is, the fewer state sequence groups S are included. That is to say when S>When N, the change transition probability matrix a is (p)_ij)_T×nThe transition probability matrix a satisfies the matrix with a column probability sum of 1, T20, and n 10.

In the selection module, according to the set number of state sequence groups (here, set as N equal to 30), when S is<At 30, the initial usage is continuedTransition probability matrix A₀Initial state series group { I₁,…,I_SWhen S is larger than or equal to 30 groups, the transition probability matrix is replaced by A, and the state sequence group number { I is generated_N+1,…,I_2N,I_2N+1,…,I_3N,…,I_(m-1)N,…,I_mN}

Such as I₃₅Changing the transition probability matrix once every N groups, wherein the sampling affects the sampling probability of the next step, dynamically adjusting the conditional probability of the sample data, traversing the whole data set according to the strategy to obtain a state sequence group { I }, wherein the transition probability matrix is changed once every N groups, and the sampling affects the sampling probability of the next step to obtain a state sequence group { I }₁,…，I_mN}。

Can be applied to the state sequence group I₁,…,I_mNThe serial numbers are rearranged.

(3) In the allocation module, a quantity adjustment parameter C ═ C (C) is set₁,C₂,C₃,…,C_K)，

K represents the number of nodes, a quantity adjusting parameter C is used for adjusting the quantity of samples distributed to each node, and the sample set generated by the mixing module is according to

The proportion is evenly distributed to each node in turn. For example, when node K is 10, each node gets 6000 samples, and node 1 assigns a dataset of:

the data set allocated by node 2 is:

the data set allocated by the node 10 is:

when the node K is 5, each node allocates data sample size in different proportions, and the number adjuster C is set to (C)₁,C₂,C₃,C₄,C₅)＝(0.1,0.1,0.4,0.2,0.2), allocating different numbers of samples, 6000,6000,24000,12000,12000 respectively, to each node, allocating the sample set passing through the mixing module to 5 nodes according to the adjuster, and allocating the allocation result as: the data set allocated by node 1 is:

the data set allocated by node 2 is:

the data set allocated by the node 5 is:

classification performance was tested using the federal mean algorithm using the federal learning framework.

Example 2

Taking a data set of default of a credit card in the banking industry as an example, a non-independent and uniformly distributed data set is constructed. The credit card default data set has 30000 samples, 23 characteristics and 1 label, including the amount x of the credit card of the individual and family of the consumer₁Card holder gender x₂Education level x₃Marital status x₄Age x₅Historical payment record x₆～x₁₁(4-9 months in 2005), billing amount (4-9 months in 2005) x₁₂～x₁₇Last payment amount x₁₈～x₂₃Labeled Y, the goal of processing the data is to estimate the probability of breach, rather than simply predict whether the user is trustworthy or untrustworthy. The method comprises the following specific steps:

(1) the raw data set is passed into an initial module. In the initial module, let X be all possible variable sets, tag set Y ═ 0,1, the number of samples under each tag is different, credit card limit X in the data set₁The lowest amount of (2) is 10000, the highest amount is 1000000, which are sorted according to the credit card amount, [ 10000-100000%],[100000～200000],…,[900000～1000000]And 10 intervals in total. Assuming that the initial distribution probability matrix is pi ═ 0.3,0.1,0.1,0.04,0.05,0.05,0.06,0.1,0.1,0.1, the initial state i is generated₁20000, in terms of initial distribution probability matrixPi and initial state transition probability matrix A₀And the obtained initial state sequence is a sampling obtaining sequence and is marked as I ═ { I ═ I₁,i₂,…,i_T},i_j0,1,2,3, …,9, e.g. I₁＝{20000,50000,140000,20000,70000,630000,320000,50000,130000,20000,380000,60000,240000,30000,250000,80000,10000,20000,150000,10000},T＝20。

(2) Transmitting the initial distribution state sequence, the initial state transition probability matrix and the group number to a state sequence group extraction module, then receiving the generated state sequence group as an initial state sequence group, and marking the initial state sequence group S as I₁,…,I_SAnd then transmitted to the selection module. If the number of state sequence groups S does not reach the threshold value N, (here set to N-30), the initial transition probability matrix a is still used₀If the set threshold value N is reached, that is to say when S is reached>And when N is needed, the transition probability matrix A is replaced, the sum of the column probabilities of the matrix is 1, T is 20, and N is 10.

(3) Obtaining a state sequence group according to the selection strategy of the step (2), and sampling to obtain a state sequence group I, such as I₄₅＝(20000,120000,240000,30000,50000,70000,320000,

70000,40000,150000,20000,310000,210000,40000，240000，130000，

30000, 450000, 670000, 20000), replacing the transition probability matrix every 30 groups, replacing the transition probability matrix for 600 samples, and adjusting the transition probability matrix for 50 rounds in total, so as to dynamically adjust the data distribution;

it is also possible to have a set I of all state sequences₁，…，I₁₅₀₀Merging and rearranging sequence numbers;

(4) in the distribution module, the client sets a quantity adjuster C ═ for the quantity K (C)₁,C₂,C₃，…,C_K)，

All state sequences are set according to

The proportion of (2) is evenly distributed to each client, and the credit card default data set passing through the mixing module is distributed to each node, for example: when the number of nodes is K equal to 10, the data set allocated by the node 1 is:

the data set allocated by node 2 is:

the data set allocated by the node 10 is:

or when the number of nodes is 5, setting the number adjuster C ═ C (C)₁,C₂,C₃,C₄,C₅) Each node is assigned a different proportional number of state sequence groups (0.1,0.3,0.2,0.3,0.1), and node 1 is assigned a data set of:

the data set allocated by node 2 is:

the data set allocated by the node 5 is:

and predicting the default probability of the credit card used by the user by using a federal average algorithm under a federal learning framework.

The above examples are only for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A data set construction system for non-independent same-distribution federal learning is characterized by comprising an initial module, a selection module and a state sequence group extraction module; wherein the content of the first and second substances,

2. The data set construction system of claim 1, wherein in the initial module, the initial probability distribution matrix pi ═ is (pi ═ pi_i) In which pi_i＝p(i_j＝q_i) J 1, 2.. T, T denotes the number of samples contained in the sampled state sequence; p represents i_j＝q_iThe probability of (1) is more than 0 and less than p;

3. The data set building system according to claim 2, wherein in the initial module, the initial state sequence, the initial state transition probability matrix and the group number are transmitted to a state sequence group extraction module, and then the state sequence group returned by the state sequence group extraction module is received as an initial state sequence group, and the initial state sequence group is transmitted to a selection module.

4. The data set construction system of claim 1, wherein in the initialization block, the initialization probability matrix is a₀＝(p_ij)_T×n，i＝1，…，T，j＝1，…，n，p_ij＝P(i_t+1＝q_j|i_t＝q_i)，i＝1，2，…，n；j＝1，2，…，n，p_ijRepresenting the transition probability, n is the number of tags in the data set, sigma_j＝1p_ij＝1，…，∑_j＝np_ij＝1，i＝1，…，T。

5. The data set construction system according to claim 1, wherein in the state-series group extraction module, the state-series group { I } is extracted based on the received state series, transition probability matrix, and group number₁，I₂，…，I_SThe method comprises the following specific steps: for the state sequence I₁＝{i₁，i₂，…，i_TEach state value i of_jObtaining the transition probability distribution { p corresponding to the state value from the transition probability matrix_1j，…，p_TjAnd the label corresponding to the maximum probability value is taken from the state sequence, and then the label corresponding to the maximum probability value is stored in the corresponding position of the next state sequence, so that the next state sequence I with fixed length is obtained₂Operating in the same way until I is obtained_SThereby generating a state-series group { I₁，I₂，…，I_SAnd then returns the state sequence group.

6. The data set construction system according to claim 1, wherein in the selection module, the step of generating the data set for non-independent co-distributed federal learning is as follows:

(2) setting a transition probability matrix A, a value p_ij＝P(i_t+1＝q_j|i_t＝q_i)，i＝1，2，…，n；j＝1，2，…，n，p_ijRepresents the transition probability, and ∑_j＝1p_ij＝1，…，∑_j＝np_ij＝1，i＝1，…，T，

Wherein m is rounded up by W/N, and W is the number of samples of the data set;

(3) if the number Z of the state sequence groups is between [ (k-1) N, kN) and k is more than 1 and less than or equal to m, the last sequence of the state sequence groups and the transition probability A are determined_k-1And the group number kN-Z is sent to a state sequence group extraction module to generate a state sequence group { I_Z+1，I_Z+2，…，I_kNAnd the state sequence group form a new state sequence group { I }₁，I₂，…，I_kN}; if k > m, stopping execution;

(4) repeating the step (3) until the whole data set is traversed;

7. The data set construction system of claim 1, further comprising an allocation module, coupled to the selection module, for allocating data samples according to federate learned nodes; first, a number adjuster C ═ C (C) corresponding to the node is set₁，C₂，C₃，…，C_K)，

Representing the number of nodes; the final state sequence is set according to

Are evenly distributed to K nodes in turn, and each node is distributed to

the number of samples allocated by the node 2 is:

the number of samples allocated by the node K is:

8. a method for constructing a data set for non-independent co-distributed federal learning, the method comprising:

9. The data set construction method of claim 8, wherein the initial state transition probability matrix A₀＝(p_ij)_T×n，i＝1，…，T，j＝1，…，n，p_ij＝P(i_t+1＝q_j|i_t＝q_i)，i＝1，2，…，n；j＝1，2，…，n，p_ijRepresenting the transition probability, n is the number of tags in the data set, sigma_j＝1p_ij＝1，…，∑_j＝np_ij＝1，i＝1，…，T。