CN114861217A

CN114861217A - Data synchronization method and device in multi-party combined training

Info

Publication number: CN114861217A
Application number: CN202210302973.5A
Authority: CN
Inventors: 郑龙飞; 王力; 张本宇
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-08-05

Abstract

The embodiment of the specification provides a data synchronization method and device in multi-party joint training. The server conducts disordering sequence and sampling on the common sample identification of the participant equipment, and sends the obtained first sample identification sequence to the participant equipment, so that data privacy of the participant cannot be revealed in the process. Any one participant device obtains a plurality of samples arranged according to a first sample identification sequence from a sample of the participant device, batches the samples in the obtained training set according to the existing arrangement sequence to obtain a plurality of batch samples and corresponding batch sequences, determines the training sequence of model training when the model combined training is needed, determines the corresponding batch sequence and batch samples based on the training sequence, and determines the output result of the model of the participant device based on the batch samples. The plurality of participant devices perform data interaction and data synchronization based on the training sequence and the corresponding output results to update the respective models.

Description

Data synchronization method and device in multi-party combined training

Technical Field

One or more embodiments of the present disclosure relate to the field of data processing technologies, and in particular, to a method and an apparatus for data synchronization in multi-party joint training.

Background

With the development of artificial intelligence technology, machine learning models have been gradually applied in the fields of risk assessment, speech recognition, face recognition, natural language processing, and the like. To achieve better model performance, more training data is needed. In the fields of medical treatment, finance and the like, different enterprises or institutions have different sample data, and the data are subjected to joint training by utilizing a multi-party joint machine learning algorithm, so that the model precision and stability are improved to a great extent. When sample data of multiple data parties is used for joint training of a model, the data of the multiple data parties needs to be synchronized. And the sample data of each data side belongs to private data and cannot be sent out in a clear text.

Therefore, an improved scheme is desired, which can align related data in a process of training a model by combining multiple parties on the premise of protecting data privacy, and improve efficiency of data synchronization between devices.

Disclosure of Invention

One or more embodiments of the present specification describe a data synchronization method and apparatus in multi-party federation training, so as to align related data in a process of training a model by multi-party federation on the premise of protecting data privacy, and improve efficiency of data synchronization between devices. The specific technical scheme is as follows.

In a first aspect, an embodiment provides a data synchronization method in multi-party joint training, where a network model to be trained is trained by multiple participant devices, the network model includes models owned by the multiple participant devices, and the method includes:

the server acquires a common sample identifier of the plurality of participant devices, determines a first sample identifier sequence for model training based on the common sample identifier, and sends the first sample identifier sequence to the plurality of participant devices;

any one participant device receives the first sample identification sequence sent by the server, and obtains a plurality of samples arranged according to the first sample identification sequence from own samples to obtain a training set; batching a plurality of samples in the training set according to the existing arrangement sequence to obtain a plurality of batched samples and corresponding batching orders;

any one participant device determines the training sequence of the model training when the model combined training is needed, determines the corresponding batch sequence and the batch samples based on the training sequence, and determines the output result of the model based on the batch samples;

and the plurality of participant devices perform data interaction and data synchronization based on the training sequence and the corresponding output result so as to update the respective models.

In one embodiment, the network model is jointly trained using a plurality of training cycles;

the server, based on the common sample identification, determines a first sample identification sequence for model training, including:

at the beginning of a training period, the first sample identification sequence is determined based on the common sample identification.

In one embodiment, the sample identifier is a number corresponding to a hash value of the original identifier of the sample; the server sends the corresponding relation between the hash value and the serial number of the sample original identification shared by the plurality of participant devices to the plurality of participant devices in advance;

any participant device, obtaining a plurality of samples arranged according to the first sample identification sequence from its own samples, comprising:

and acquiring a plurality of samples arranged according to the first sample identification sequence from the own samples based on the corresponding relation between the hash value of the original sample identification and the serial number.

In one embodiment, the step of the server determining a first sample identification sequence for model training based on the common sample identification comprises:

and performing operation of disordering sequence and/or sampling on the common sample identification to obtain the first sample identification sequence.

randomly selecting a first number of samples from the common sample identifiers as the first sample identifier sequence;

the method further comprises the following steps:

the server randomly selects a second number of samples from the rest sample identifications of the common sample identifications as a second sample identification sequence for model verification; a third number of samples is randomly selected as a third sample identification sequence for performing the model test.

In one embodiment, any one of the participant devices performing the batching step in the existing sort order for the plurality of samples in the training set comprises:

and batching the plurality of samples in the training set according to the existing arrangement sequence of the plurality of samples in the training set based on the common training round number and sample batch number of the plurality of participant devices.

In one embodiment, a plurality of participant devices are caused to obtain a common sample batch number in the following manner:

the plurality of participant devices perform data interaction with the server by utilizing respective original batch quantities, so that the server determines the sample batch quantity after the fusion of the original batch quantities;

the server sends the sample batch number to a plurality of participant devices;

and the plurality of participant devices respectively receive the sample batch number sent by the server.

In one embodiment, after updating the model of the participant device, the method further includes:

updating the training order, and returning to performing the step of determining the corresponding batch order and batch samples based on the training order.

In one embodiment, the step of performing data interaction and data synchronization by the plurality of participant devices based on the training sequence and the corresponding output result includes:

the plurality of participant devices respectively send the training sequence and the corresponding output result to the server;

the server receives training sequences and corresponding output results sent by the multiple participant devices, performs data synchronization and fusion on the output results based on the training sequences to obtain fusion results, and interacts with the multiple participant devices based on the fusion results to update models of the multiple participant devices.

In a second aspect, an embodiment provides a data synchronization method in multi-party joint training, where a network model to be trained is trained by multiple participant devices, the network model includes models owned by the multiple participant devices, and the method is performed by any one of the participant devices, and includes:

receiving a first sample identification sequence sent by the server; wherein the first sample identification sequence is used for model training and is determined based on sample identifications shared by a plurality of participant devices;

obtaining a plurality of samples arranged according to the first sample identification sequence from own samples to obtain a training set;

batching a plurality of samples in the training set according to the existing arrangement sequence to obtain a plurality of batched samples and corresponding batching orders;

when model combined training is required, determining the training sequence of the model training; determining a corresponding batch order and batch samples based on the training order, and determining an output result of a self model based on the batch samples;

and performing data interaction and data synchronization with other participant equipment based on the training sequence and the corresponding output result so as to update the respective models.

In a third aspect, an embodiment provides a data synchronization method in multiparty joint training, where a network model to be trained is trained by multiple participant devices, the network model includes models owned by the multiple participant devices, and the method is executed by a server and includes:

obtaining a sample identifier shared by a plurality of participant devices, and determining a first sample identifier sequence for model training based on the shared sample identifier;

sending the first sample identification sequence to a plurality of participant devices, so that any one participant device obtains a plurality of samples arranged according to the first sample identification sequence from own samples to obtain a training set, batching the samples according to the existing arrangement sequence of the samples in the training set, and determining the training sequence of the model training and the output result of the model when the model joint training is required;

receiving training sequences and corresponding output results sent by a plurality of participant devices;

performing data synchronization and fusion on the output result based on the training sequence to obtain a fusion result;

and performing data interaction with the plurality of participant devices based on the fusion result so as to update the models of the plurality of participant devices.

In a fourth aspect, an embodiment provides a data synchronization system in multiparty joint training, where the system includes a server and multiple participant devices, a network model to be trained is trained by the multiple participant devices, and the network model includes models owned by the multiple participant devices respectively;

the server is used for obtaining a sample identifier shared by a plurality of participant devices, determining a first sample identifier sequence for model training based on the shared sample identifier, and sending the first sample identifier sequence to the plurality of participant devices;

any one participant device, configured to receive the first sample identification sequence sent by the server, and obtain, from a sample owned by the participant device, a plurality of samples arranged according to the first sample identification sequence, to obtain a training set; batching a plurality of samples in the training set according to the existing arrangement sequence to obtain a plurality of batched samples and corresponding batching orders;

any one participant device is used for determining the training sequence of the model training when the model combined training is needed, determining the corresponding batch sequence and batch samples based on the training sequence, and determining the output result of the model based on the batch samples;

and the plurality of participant devices are used for performing data interaction and data synchronization based on the training sequence and the corresponding output result so as to update the respective models.

In a fifth aspect, an embodiment provides a data synchronization apparatus in multi-party joint training, where a network model to be trained is trained by multiple pieces of participant equipment, the network model includes models owned by the multiple pieces of participant equipment, and the apparatus is deployed in any one piece of participant equipment, and includes:

the first receiving module is configured to receive a first sample identification sequence sent by the server; wherein the first sample identification sequence is used for model training and is determined based on sample identifications shared by a plurality of participant devices;

the arrangement module is configured to obtain a plurality of samples arranged according to the first sample identification sequence from own samples to obtain a training set;

the batching module is configured to carry out batching according to the existing arrangement sequence aiming at the plurality of samples in the training set to obtain a plurality of batched samples and corresponding batching orders;

the determining module is configured to determine a training sequence of the model training when the model joint training is required; determining a corresponding batch order and batch samples based on the training order, and determining an output result of a self model based on the batch samples;

a first interaction module configured to perform data interaction and data synchronization with other participant devices based on the training order and corresponding output results to update respective models.

In a sixth aspect, an embodiment provides a data synchronization apparatus in multi-party joint training, where a network model to be trained is trained by multiple participant devices, the network model includes models owned by the multiple participant devices, and the apparatus is deployed in a server, and includes:

the acquisition module is configured to acquire a sample identifier shared by a plurality of participant devices, and determine a first sample identifier sequence for model training based on the shared sample identifier;

the sending module is configured to send the first sample identification sequence to a plurality of participant devices, so that any one of the participant devices obtains a plurality of samples arranged according to the first sample identification sequence from own samples to obtain a training set, batches the samples according to the existing arrangement sequence of the samples in the training set, and determines the training sequence of the model training and the output result of the model when the model joint training is required;

the second receiving module is configured to receive the training sequences and the corresponding output results sent by the plurality of participant devices;

the fusion module is configured to perform data synchronization and fusion on the output result based on the training sequence to obtain a fusion result;

and the second interaction module is configured to perform data interaction with the plurality of participant devices based on the fusion result so as to update the models of the plurality of participant devices.

In a seventh aspect, embodiments provide a computer-readable storage medium, on which a computer program is stored, which, when executed in a computer, causes the computer to perform the method of any one of the first to third aspects.

In an eighth aspect, an embodiment provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method of any one of the first to third aspects.

In the method and the apparatus provided in the embodiment of the present specification, the server sends the determined sample identification sequence to the multiple participant devices, and the participant devices obtain multiple samples from their own samples according to the sample identification sequence to obtain a training set, so that the multiple participant devices can obtain the same training set, that is, the samples in the training set and their arrangement order are the same, and the participants do not need to interact with the samples containing the private data, thereby protecting the private data from leakage and realizing data synchronization of the training set. Meanwhile, the participator equipment respectively batches the respective training sets and determines the batch sequence, and when the combined model training is carried out, the model output results of a plurality of participators are aligned based on the training sequence. In the joint training process of the model, alignment of related data can be achieved between the server and the participant equipment and between the participant equipment and the participant equipment through simple data interaction, and the efficiency of data synchronization between the equipment is improved on the premise of protecting data privacy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

fig. 2 is a schematic flowchart of a data synchronization method in multi-party joint training according to an embodiment;

FIG. 3 is a schematic diagram of batching samples in a training set;

FIG. 4 is a schematic block diagram of a data synchronization system in multiparty joint training provided by an embodiment;

FIG. 5 is a schematic block diagram of a data synchronization apparatus in multi-party joint training according to an embodiment;

fig. 6 is a schematic block diagram of a data synchronization apparatus in multi-party joint training according to an embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The server may perform scrambling (shuffle) and sampling on sample identifiers of samples shared by the multiple participant devices to obtain a new sample identifier sequence, and send the sample identifier sequence to the multiple participant devices. The participator equipment can obtain a training set from the sample by using the uniform sample identification sequence, and performs model joint training by using the training set. Meanwhile, in order to achieve data alignment when batch samples are used for combined training, each participant device also aligns output results of a plurality of models by using a training sequence (j th time), so that data synchronization is achieved. The 3 participant devices shown in fig. 1 are merely examples, and in practical applications, there may be 2 or more than 2 participant devices participating in the joint training.

The participator device is a device of participator participating in the model joint training, and the participator is also the owner of the sample data, so the participator can also be called as data side. The participator can be a service organization such as a bank, a hospital or a physical examination organization, and the participator performs joint training by using the owned business data through the owned equipment. The sample data of the participant may be its business data, which is private data and cannot be transmitted to the outside from the internal secure environment in which the participant device is located.

The sample of participants may be business data of the subject. For example, the object may be, but is not limited to, one of a user, a good, a transaction, an event. The business data of the object may comprise object characteristic data, which may for example but not exclusively comprise at least one of the following attribute characteristics: basic attribute characteristics of the object, historical behavior characteristics of the object, incidence relation characteristics of the object, interaction characteristics of the object and body indexes of the object.

Different participants have a large amount of different business data of the same object, that is, the sample spaces of the participants are the same, and the feature spaces are different. For example, a third-party payment company and a bank have a large number of same user groups, and business data of the same user groups comprise different attribute characteristics of the users or different characteristic values of the same attribute characteristics. This scenario belongs to the vertical slice scenario of the sample.

Multiple participants may employ different network architectures for joint training. For example, a peer-to-peer network architecture or a client-server architecture may be employed for joint training. The peer-to-peer network architecture includes a plurality of participant devices and no server. In the network architecture, joint training of the model is realized among a plurality of participant devices through a preset data transmission mode. The client-server architecture comprises a plurality of participant devices and a server, and the plurality of participant devices perform data transmission and data fusion through the server to realize the joint training of the model. In the client-server architecture, joint training can be performed specifically by adopting a federal learning mode or a split learning mode.

Regardless of which network architecture and which learning mode are adopted, in the process of performing the joint model training, the participant devices need to align the own samples, that is, each time the model is trained, the same batch of samples are input into the own models of the participants. In particular, in order to reduce overfitting and improve the training precision of the model, the process of multi-party combined training can be performed by adopting a plurality of training cycles. In different training periods, the samples of the participants need to be sequentially scrambled and resampled. In this case, more frequent sample alignment and data synchronization are required.

In order to improve the efficiency of data synchronization between devices and achieve alignment of related data, embodiments of the present specification provide a data synchronization method in multi-party joint training. In the method, in step S210, a server obtains a sample identifier common to a plurality of participant devices, determines a first sample identifier sequence for model training based on the common sample identifier, and sends the first sample identifier sequence to the plurality of participant devices. Step S220, any participant device obtains a plurality of samples arranged according to the first sample identification sequence from its own sample, to obtain a training set. Step S230, any participant device batches a plurality of samples in the training set according to the existing arrangement order to obtain a plurality of batch samples and corresponding batch orders. Step S240, when the model joint training is required, any one of the participant devices determines a training sequence of the current model training, determines a corresponding batch sequence and a batch sample based on the training sequence, and determines an output result of the model itself based on the batch sample. And step S250, the plurality of participant devices perform data interaction and data synchronization based on the training sequence and the corresponding output result so as to update the respective models.

In this embodiment, a new sample identification sequence after the disordering and sampling is determined by the server, and the plurality of participant devices determine training sets after aligning the samples based on the sample identification sequences, respectively. And the participant equipment batches the samples in the training set of the participant equipment, and performs data interaction and data synchronization on the basis of the training sequence and the output result of the model by utilizing the one-to-one correspondence relationship between the training sequence and the batch sequence. In the whole model updating process, a series of alignment operations such as sample set alignment, batch sample alignment, batch sequence alignment, alignment of a plurality of output results by using a training sequence and the like are performed, so that data synchronization in the joint training process is realized, less interaction is performed between devices due to the requirement of data alignment, and the amount of data interaction is less, so that the efficiency of data synchronization between the devices can be improved. The present embodiment will be described in detail with reference to fig. 2.

Fig. 2 is a flowchart illustrating a data synchronization method in multi-party joint training according to an embodiment. The network model to be trained is trained at least by a plurality of participant devices, and the network model includes models respectively owned by the plurality of participant devices. The model owned by the participant device may be a model including a feature extraction layer and a regression layer or a classification layer, or may be a partial model including only a part of the feature extraction layer. According to different network architectures and joint training modes, the configuration models of the equipment of each participant are different.

The Network model may be a service prediction model, which is used for performing service prediction on an object, and may be implemented by using a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Graph Neural Network (GNN), or the like.

The plurality of participant devices may be implemented by any means, device, platform, cluster of devices, etc. having computing, processing capabilities. The server is in communication connection with the plurality of participant devices, and alignment of the training set is achieved through interaction with the plurality of participant devices. A server may be implemented by any device, apparatus, platform, cluster of devices, etc. having computing, processing capabilities. For clarity of description, the following description will take the participating devices represented by device a and device B as examples. The method of the present embodiment includes the following steps S210 to S250.

In step S210, the server obtains a sample identity common to the plurality of participant devices, determines a first sample identity sequence S1 for model training based on the common sample identity, and sends the first sample identity sequence S1 to the plurality of participant devices. The plurality of participant devices respectively receive the first sample identification sequence S1 sent by the server.

The plurality of participant devices respectively have a large amount of sample data, and the sample data belongs to privacy data of the participants and cannot be sent out in a clear text. The multiple participant devices may find the Intersection of the sample spaces based on a privacy-Preserving Set Intersection (PSI), that is, determine common samples in the sample data of the multiple participants. The common sample identifier typically includes a plurality of sample identifiers, which are identifiers of samples that are common among the plurality of participant devices. For example, both the device a and the device B have the same traffic data of 1 ten thousand users, and the sample identifier common to both the devices is the identifier of the 1 ten thousand users.

In a specific embodiment, multiple participant devices may respectively determine hash values of original identifiers of samples of sample data owned by the participant devices, and respectively send the hash values to the server. After receiving the hash values of the original sample identifications of the multiple participant devices, the server determines an intersection of the hash values and sends the intersection to the multiple participant devices. In this way, the server may determine the intersection of the hash values as a sample identification common to the multiple participant devices.

Of course, the intersection of the sample data of the multiple participating party devices may also be determined by other service devices and the multiple participating party devices in a PSI manner, so as to obtain the intersection of the hash values of the original identifiers of the samples. The server obtains a sample identification common to the plurality of participant devices from the serving device.

The original sample identifier is used for identifying corresponding sample data, and the original sample identifier contains privacy information. And carrying out hash calculation on the original sample identifier to obtain a hash value, wherein the hash value can also identify corresponding sample data, and the specific meaning of the original sample identifier is erased, so that the privacy information in the original sample identifier can be protected from being revealed.

The sample identifier may be a hash value of the original sample identifier, or the hash value may be replaced by a simple number, and the number is used as the sample identifier. There are many implementations of sample identification, as long as the implementations can identify sample data and do not reveal privacy.

The server determines a first sequence of sample identifications based on the common sample identifications S1, for the purpose of determining a uniform set and ordering of samples to align sample data of the plurality of participant devices. The step of determining the first sample identifier sequence S1 may be specifically obtained by performing a shuffle (shuffle) operation and/or a sampling operation on the common sample identifiers.

The first sample identification sequence is used for model training and is determined based on a sample identification common to the plurality of participant devices. The first sample identifier sequence S1 includes sample identifiers selected from the common sample identifiers and the arrangement order among the sample identifiers. The first sample identification sequence S1 may include all common sample identifications, or may include some of the common sample identifications. The sample identifier sequence in the first sample identifier sequence S1 is obtained by disordering the sequence, which is different from the existing sequence.

In the joint training of the network model, in order to avoid overfitting or instability of the samples caused by fixed sample ordering, a shuffle and/or sampling operation can be periodically performed on the samples for model training.

In one type of model training, the network model may be jointly trained using multiple training periods (epochs). At the beginning of the training period, the server determines a first sample identification sequence based on the common sample identifications S1. At the beginning of each training cycle, a new sample identification sequence may be re-determined; and updating the sample identification sequence for multiple times according to other modes, so as to realize shuffle and sampling of the samples for model training. The server may determine the sample identification sequence of the current training period based on the sample identification sequence determined in the last training period.

When the server determines the sample identification sequence periodically or multiple times, the first sample identification sequence S1 may be any one of a plurality of sample identification sequences. The currently determined first sample identification sequence S1 is different from the historical sample identification sequence.

In this embodiment, the server is used to determine the shuffle and the sampled sample identification sequence, and send the shuffle and the sampled sample identification sequence to the multiple participant devices, so that the multiple participant devices synchronously obtain a uniform shuffle and a sampled sample, the shuffle and sampling efficiency among the multiple devices is improved, and the data privacy is also protected.

When the server needs to perform multiple shuffle and sampling on a sample, in order to reduce the data amount in the transmission process, the server may correspondingly represent hash values of original identifiers of the sample, which are shared by multiple participant devices, by numbers, and obtain a corresponding relationship between the hash values and the numbers. The server may transmit the correspondence to the plurality of participant devices in advance. In the present embodiment, the sample identifier is represented by a number, and the number is the sample identifier. The first sample identification sequence includes a plurality of numbers and a corresponding arrangement order. The server directly sends the sequence containing the serial numbers and the sequence of the serial numbers to a plurality of participant devices, so that the data volume of transmission can be reduced to a great extent, and the transmission efficiency is improved.

For example, when the sample original identity is computed using a Secure Hash Algorithm (SHA), the resulting Hash value is a 256-bit value. When the hash value is replaced by a shorter numerical number, the numerical digit can be shortened to a greater extent, and the data volume is reduced.

The server may employ the following embodiments when performing shuffle and/or sampling on the common sample identifications. The plurality of participant devices respectively count the local common sample number Ni and respectively send the common sample number Ni to the server. Ni is an integer. The server may check the received Ni for the plurality of participant devices to check whether the plurality of Ni are equal. If the two are equal, the verification is passed; if not, multiple participant devices may be interacted with, such that the corresponding participant device further modifies its own sample.

When the check passes, the server may randomly select a first number of samples from the common sample identifications as the first sample identification sequence S1. Meanwhile, the server may randomly select a second number of samples from the remaining sample identifications of the common sample identifications as a second sample identification sequence S2 for performing model verification; a third number of samples is randomly selected as a third sample identification sequence for performing model testing S3.

The server may send the second sample identification sequence S2 and the third sample identification sequence S3 to a plurality of participant devices. The second sample identification sequence S2 is used to cause the participant device to determine the validation set and the third sample identification sequence S3 is used to cause the participant device to determine the test set.

Wherein the sum of the first number, the second number and the third number may be equal to the sum of the common sample identifications Ni. These three numbers may be preset or may be determined based on a preset ratio and the sum Ni. For example, the ratios of the numbers of samples in the training set, the validation set, and the test set may be predetermined as α, β, and γ, respectively, where α + β + γ is 1. By using the product of the above ratio and Ni, the above three values, i.e., the first number Ni α, the second number Ni β, and the third number Ni γ, can be calculated.

In the embodiment, the effect of shuffle and sampling can be achieved simultaneously through a random selection mode, so that the sample identification sequence can be determined more conveniently and quickly.

In step S220, after receiving the first sample identification sequence S1 sent by the server, any one of the participant devices, for example, device a, obtains a plurality of samples arranged according to the first sample identification sequence S1 from its own samples, and obtains a training set TS 1. All the participant equipment executes the step to obtain a corresponding training set.

Wherein, the samples in the training set TS1 are the samples contained in the first sample identification sequence S1 and identify the corresponding samples in the participant device a; the sample sequence in the training set TS1 is the sequence order of the sample identifiers included in the first sample identifier sequence S1.

The plurality of samples arranged according to the first sample identification sequence S1 are obtained from samples owned by the device a, and may be obtained from an original sample set owned by the device a, or may be obtained from a common sample set owned by the device a, where the common sample set is a sample set shared by the device a and other participant devices, and is an intersection of samples processed by PSI.

For example, the samples owned by the device a include samples 1 to 10, and it is assumed that the first sample identification sequence S1 includes samples identified and ordered as "hash 1, hash4, hash3, hash5, hash6, hash7, hash9, hash2, hash8, and hash 10". Device a may derive from its own samples a sample set containing the following samples and their ordering: sample 1, sample 4, sample 3, sample 5, sample 6, sample 7, sample 9, sample 2, sample 8, and sample 10.

When the sample identifier is a number corresponding to the hash value of the original sample identifier, any one of the participating device a may obtain a plurality of samples arranged in the first sample identifier sequence from the own samples based on a correspondence between the hash value of the original sample identifier and the number, which is obtained in advance.

When any one of the participating devices, for example, device a, receives the second sample identification sequence S2 and the third sample identification sequence S3 sent by the server, it may obtain a plurality of samples arranged according to the second sample identification sequence S2 from the own samples to obtain the verification set VS1, and obtain a plurality of samples arranged according to the third sample identification sequence S3 from the own samples to obtain the test set Tes 1.

In step S230, any one of the participant devices, for example, device a, performs batching on the plurality of samples in the training set TS1 according to the existing ranking order to obtain a plurality of batched samples and corresponding batching orders. The existing permutation order refers to the permutation order of the samples in the training set TS 1. All participant devices perform the operations of this step, batching their training sets. After all participant devices have batched their respective training sets, the resulting batched samples are identical, as are the batching order.

Specifically, device a may batch training set TS1 in the existing permutation order of the plurality of samples in training set TS1 based on the number of training rounds and the number of sample batches common to the plurality of participant devices.

The number of common training rounds epoch is es, and the number of common sample batches batchsize is bs. And jointly utilizing all the samples of Ni x alpha in the respective training set by the plurality of participant devices to update the network model once, so as to train the model for one round. The number of sample batches refers to the number of samples in a batch when samples are batched. Inputting a batch of samples into a model to obtain a label predicted value, determining a prediction loss based on the label predicted value, and updating the model once based on the prediction loss to train the model once. That is, a batch of samples corresponds to a model update, which is also referred to as a model training or a model iteration. In one round of model training, the total training number (Ni × α/bs) and in es round of model training, the total training number (i.e., the total number of iterations) is es × Ni × α/bs. When the batch order is counted from 1, the maximum batch order is equal to the total number of training times stepN ═ es × Ni ×/bs.

In this step, when the training set TS1 is batched based on the common training round number es and the sample batch number bs, various specific embodiments may be included. For example, the device a may repeat the training round with Ni × α samples in the self training set TS1 based on the number of training rounds es to obtain a total training sample; then, with the sample batch number bs as a batch unit, the total training samples are split in sequence, and the batch order of each batch sample is determined in turn. The apparatus a may also split the samples in the training set TS1 in sequence by taking the sample batch number bs as a batch unit without repeating the samples in the training set TS1, and when the last sample of the training set TS1 is reached, split the samples from the first sample of the training set TS1, and the batch order is continuously increased.

FIG. 3 is a schematic diagram of batching samples in a training set. As an example, the number of training rounds es is 2, the number of sample batches bs is 5, the number of samples in the training set TS1 is Ni × α, the result of the batch processing on the training set TS1 is shown in fig. 3, the vertical dashed lines separate each batch sample, and j represents the batch order, and the batch order is gradually increased by 1 from 1. Since the training round number es is 2, the samples Ni × α in the training set TS1 will continue the second round of batch after being divided into j ═ Ni × α/5 times, the batch order increases sequentially, the largest batch order is j ═ 2Ni × α/5, where 2 is the number of training rounds. In this example, it is assumed that the number of samples Ni × α is an integer multiple of 5.

The number of training rounds and the number of sample batches may be the original hyper-parameters of the plurality of participant devices, that is, the original rounds among the plurality of participant devices are the same, and the original batches are also the same.

When the original round numbers of the plurality of participant devices are different or the original batch numbers are different, the training round numbers and the sample batch numbers shared by the plurality of participant devices can be determined by the server and are respectively sent to the plurality of participant devices.

The server may determine a number of training rounds and a number of sample batches that are common to the plurality of participant devices, respectively, based on respective original numbers of rounds and original numbers of batches of the plurality of participant devices. For example, the multiple participant devices may directly send respective original rounds or original batches to the server, and the server averages the original rounds of the multiple participant devices to obtain a common training round; or the server averages the original batch numbers of the plurality of participant devices to obtain the common sample batch number.

In order to protect data privacy, the multiple participant devices may perform data interaction based on privacy protection with the server by using the respective original batch numbers, so that the server determines a sample batch number obtained by fusing the multiple original batch numbers as a common sample batch number. Similarly, in order to protect data privacy, the server may also determine the number of fused training rounds in the same manner. The multiple participant devices can utilize respective original round numbers to perform data interaction based on privacy protection with the server, so that the server determines the training round number obtained by fusing the original round numbers as a common training round number. The data interaction based on privacy protection can be performed by adopting modes such as secret sharing or homomorphic encryption.

When a shuffle operation needs to be performed on a training set owned by a plurality of participant devices, the shuffle operation can be performed by changing the value of the number of training rounds es and/or the value of the number of sample batches bs.

When the network model is jointly trained by adopting a plurality of training periods, the number of sample batches and the number of training rounds of model training can be determined again at the beginning of the training period. The number of sample batches determined may be different in different training periods. The number of training rounds determined may also be different in different training periods.

In step S240, when the model joint training is required, any one of the participating devices, for example, device a, determines a training sequence of the current model training, determines a corresponding batch sequence and batch samples based on the training sequence, and determines an output result of the own model based on the batch samples. All the participant equipment executes the step, and respectively determines the training sequence and the corresponding output result.

And step S250, the plurality of participant devices perform data interaction and data synchronization based on the training sequence and the corresponding output result so as to update the respective models.

The device a may determine the training sequence of the model training according to a preset counting rule. The training order may be determined to be 1 at the first model training and 2 at the second model training. When the batch order counts from 1, the training order is equal to the batch order, and the training order corresponds to the batch order one-to-one, both of which may be represented by the jth order.

The apparatus a may obtain corresponding batch samples based on the batch order and input the batch samples into the model itself, resulting in an output of the model. For example, this time is the jth iteration, the apparatus a may select the jth batch sample from the multiple batch samples, and input the multiple samples in the jth batch sample into the self model one by one, so as to obtain multiple output results respectively.

After the output results of the model of the multiple participant devices in the jth iteration are respectively determined, joint training of the model can be completed through data interaction. The specific joint training mode is different according to different network architectures among the multiple pieces of established participant equipment. Next, the step S250 will be described by taking a client-server network architecture as an example. Step S250 may include the following two

steps

1 and 2 when executed.

Step 1, a plurality of participant devices respectively send training sequences and corresponding output results Li to a server. For example, in the j-th training process (j ═ 1,2, …, N), N is an integer, the device a transmits the training order step ═ j and the output result Li of the own model M _ a to the server.

And 2, the server receives the training sequences sent by the plurality of participant devices and the corresponding output results Li, performs data synchronization and fusion on the output results Li based on the training sequences j to obtain a fusion result Ls, and interacts with the plurality of participant devices based on the fusion result Ls to update the models of the plurality of participant devices.

The server verifies the training sequence j corresponding to the received output results Li, and selects the output result Li corresponding to the current training sequence j to be fused. The fusion operation includes summing, averaging, and the like.

In the step 2, a specific implementation of interaction with multiple participant devices based on the fusion result Ls is related to the built model and the learning method adopted.

In the embodiment, the model is built in relation to the selected learning mode of the multi-party combined training. The multi-party combined training can be carried out in a mode of federal learning or split learning. In the federal learning mode, a plurality of participant devices perform joint training with the assistance of a server, and no model is set in the server. In the split learning mode, a plurality of participant devices have the first few layers of the whole network model, the server has the rest layer model of the network model, and the plurality of participant devices and the server train the network model together.

When a federal learning mode is adopted, the output result Li determined by the participant equipment is gradient data determined based on the prediction loss, and the fusion result Ls determined by the server is a fusion gradient. And the server sends the fusion gradient to the plurality of participant devices, and the plurality of participant devices update the model by using the fusion gradient after receiving the fusion gradient sent by the server, so that the (jth) model training is completed.

When the split learning mode is adopted, the output result Li is an intermediate result, that is, an intermediate feature extracted from the sample. And the fusion result Ls determined by the server is an intermediate result after fusion. The server may determine output data of the self model M _ s using the fused intermediate result, and interact with the plurality of participant devices based on the output data.

In one embodiment, the output data determined by the server is a tag prediction value. In this case, the server may transmit the tag prediction value to the tag owner apparatus, so that the tag owner apparatus determines a loss function using the tag prediction value and the tag value of the sample, determines gradient data based on the loss function, performs back propagation based on the gradient data, and sequentially updates the model in the server and the models in the plurality of participant apparatuses. The tag owner device may be any one of a plurality of participant devices.

In one embodiment, the server determines whether the output data is a tag prediction value. In this case, the server may transmit the output data to the tag owner apparatus, so that the tag owner apparatus determines a tag prediction value of the sample using the output data and its own model, determines a loss function using the tag prediction value and the tag value of the sample, determines gradient data based on the loss function, performs back propagation based on the gradient data, and sequentially updates the model in the server and the models in the plurality of participant apparatuses.

During reverse propagation, the multiple participant devices and the server may select a model gradient corresponding to the training time j based on the current training time, and perform model update.

For any one of the participant apparatuses, after updating the model of itself, the training sequence may be updated, and the operation of determining the corresponding batch sequence and batch sample based on the training sequence in step S240 is performed. And updating the training sequence, namely adding 1 to the last training sequence and the like. And repeating the model training process until the model converges, and finishing the model training.

The server can judge whether the training process meets the end condition, and when the training process meets the end condition of the training, the training is ended. The ending condition may include that the number of training times of the model reaches a preset number threshold, or that the prediction loss is less than a preset loss threshold.

In this embodiment, each time a shuffle needs to be performed on training data, the server issues a uniform sample identification sequence, so that multiple participant devices can determine a training set after the shuffle and the data are aligned, and interaction between the devices is simple and easy to implement. In the training process, the output result of the model is aligned by adopting the training sequence j, the data volume of the training sequence is small, and the data volume transmitted between devices can be reduced, so that the transmission efficiency is improved.

In this specification, the first sample identification sequence, the first number and the corresponding serial number such as "first" in the words "first", and the corresponding "second" in the words "second" are used for distinguishing and describing convenience only, and do not have any limiting meaning.

The foregoing describes certain embodiments of the present specification, and other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily have to be in the particular order shown or in sequential order to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Fig. 4 is a schematic block diagram of a data synchronization system in multi-party joint training according to an embodiment. The system 400 includes a server 410 and a plurality of participant devices 420. The server 410 and the participant devices 420 may be implemented by any means, device, platform, cluster of devices, etc. having computing, processing capabilities. The network model to be trained is trained at least by the plurality of participant devices 420, and the network model includes models that the plurality of participant devices 420 respectively possess. This embodiment of the system corresponds to the embodiment of the method shown in fig. 2.

The server 410 is configured to obtain a sample identifier common to the plurality of participant devices 420, determine a first sample identifier sequence for model training based on the common sample identifier, and send the first sample identifier sequence to the plurality of participant devices 420;

any one of the participant devices 420, configured to receive the first sample identifier sequence sent by the server 410, and obtain, from a sample owned by the participant device, a plurality of samples arranged according to the first sample identifier sequence, to obtain a training set; batching a plurality of samples in the training set according to the existing arrangement sequence to obtain a plurality of batched samples and corresponding batching orders;

any one of the participant devices 420 is configured to, when model joint training is required, determine a training order of the current model training, determine a corresponding batch order and batch samples based on the training order, and determine an output result of the model itself based on the batch samples;

a plurality of participant devices 420 for performing data interaction and data synchronization based on the training order and corresponding output results to update respective models.

In one embodiment, the system jointly trains the network model using a plurality of training cycles;

the server 410, when determining the first sample identification sequence for model training based on the common sample identification, includes:

In one embodiment, the sample identifier is a number corresponding to a hash value of the original identifier of the sample; the server 410 is further configured to send, to the multiple participant devices 420, a corresponding relationship between a hash value of a sample original identifier shared by the multiple participant devices 420 and a serial number in advance;

when obtaining a plurality of samples arranged according to the first sample identification sequence from the own sample, any one of the participant devices 420 includes:

In one embodiment, the server 410, when determining the first sample identification sequence for model training based on the common sample identification, comprises:

the server 410 is further configured to randomly select a second number of samples from the remaining sample identifiers of the common sample identifiers respectively as a second sample identifier sequence for performing model verification, and randomly select a third number of samples as a third sample identifier sequence for performing model testing.

In one embodiment, when batching the plurality of samples in the training set in the existing rank order by any one of the participant devices 420, the method comprises:

based on the number of training rounds and the number of sample batches common to the plurality of participant devices 420, the plurality of samples in the training set are batched in their existing order of arrangement.

In one embodiment, the multiple participant devices 420 are further configured to perform data interaction with the server 410 by using the respective original batch quantities, so that the server 410 determines a sample batch quantity obtained by fusing the original batch quantities;

the server 410, further configured to send the sample lot number to a plurality of participant devices 420;

the plurality of participant devices 420 are further configured to receive the sample batch number sent by the server 410, respectively.

In one embodiment, any of the participant devices 420 is further configured to update the training sequence after updating its own model, and return to performing the determining the corresponding batch sequence and batch samples based on the training sequence.

In one embodiment, the plurality of participant devices 420 are specifically configured to send the training sequence and the corresponding output result to the server 410, respectively;

the server 410 is further configured to receive the training sequences and the corresponding output results sent by the multiple participant devices 420, perform data synchronization and fusion on the output results based on the training sequences to obtain fusion results, and interact with the multiple participant devices 420 based on the fusion results to update models of the multiple participant devices 420.

Fig. 5 is a schematic block diagram of a data synchronization apparatus in multi-party joint training according to an embodiment. The network model to be trained is trained by a plurality of participant devices, and the network model comprises models respectively owned by the plurality of participant devices. A participant device may be implemented by any apparatus, device, platform, cluster of devices, etc. having computing, processing capabilities. The apparatus embodiment corresponds to the part of the method performed by the participant device in the method embodiment shown in fig. 2. The apparatus 500 is deployed in any one of the participant devices, and includes:

a first receiving module 510, configured to receive a first sample identification sequence sent by the server; wherein the first sample identification sequence is used for model training and is determined based on sample identifications shared by a plurality of participant devices;

a ranking module 520, configured to obtain a plurality of samples ranked according to the first sample identification sequence from own samples, so as to obtain a training set;

a batching module 530 configured to batch the plurality of samples in the training set according to an existing arrangement order to obtain a plurality of batched samples and a corresponding batching order;

the determining module 540 is configured to determine a training sequence of the current model training when the model joint training is required; determining a corresponding batch order and batch samples based on the training order, and determining an output result of a self model based on the batch samples;

a first interaction module 550 configured to perform data interaction and data synchronization with other participant devices based on the training order and corresponding output results to update respective models.

In one embodiment, the sample identifier is a number corresponding to a hash value of the original sample identifier; the arrangement module 520 is specifically configured to:

acquiring a plurality of samples arranged according to the first sample identification sequence from own samples based on the corresponding relation between the hash value and the serial number of the original sample identification shared by the plurality of participant devices; the corresponding relation is obtained from the server in advance.

In one embodiment, the batch module 530 is specifically configured to:

In one embodiment, the apparatus 500 further comprises a first parameter module (not shown) for obtaining a common sample batch number by:

performing data interaction with the server and other participant equipment by using the original batch number of the server and the original batch numbers of other participant equipment, so that the server determines the sample batch number obtained by fusing a plurality of original batch numbers;

and receiving the sample batch quantity sent by the server.

In one embodiment, the apparatus 500 further comprises a circulation module (not shown) for:

after updating the model of the self, updating the training sequence, and returning to execute the determining of the corresponding batch sequence and batch sample based on the training sequence.

In one embodiment, the first interaction module 550 is specifically configured to:

sending the training sequence and the corresponding output result to the server so that the server performs data synchronization and fusion on the plurality of output results based on the training sequence sent by the plurality of participant devices to obtain a fusion result;

and interacting with the server and other participant equipment based on the fusion result so as to update the self model.

Fig. 6 is a schematic block diagram of a data synchronization apparatus in multi-party joint training according to an embodiment. The network model to be trained is trained by a plurality of participant devices, and the network model comprises models respectively owned by the plurality of participant devices. The server is communicatively coupled to a plurality of participant devices. A server may be implemented by any device, apparatus, platform, cluster of devices, etc. having computing, processing capabilities. The embodiment of the device corresponds to the part of the method executed by the server in the embodiment of the method shown in fig. 2. The apparatus 600 is deployed in a server, and includes:

an obtaining module 610 configured to obtain a sample identifier common to a plurality of participant devices, and determine a first sample identifier sequence for model training based on the common sample identifier;

a sending module 620, configured to send the first sample identifier sequence to multiple participant devices, so that any one of the participant devices obtains multiple samples arranged according to the first sample identifier sequence from its own sample, obtains a training set, batches the samples in the training set according to an existing arrangement order of the samples, and determines a training order of the current model training and an output result of its own model when model joint training needs to be performed;

a second receiving module 630, configured to receive the training sequences and the corresponding output results sent by the multiple participant devices;

a fusion module 640 configured to perform data synchronization and fusion on the output result based on the training sequence to obtain a fusion result;

a second interaction module 650 configured to perform data interaction with the plurality of participant devices based on the fusion result to update models of the plurality of participant devices.

the obtaining module 610, when determining the first sample identification sequence for model training based on the common sample identification, includes:

In one embodiment, the sample identifier is a number corresponding to a hash value of the original identifier of the sample; the apparatus 600 further comprises:

and a pre-module (not shown in the figure) configured to send a correspondence between the hash value of the original sample identifier and the number, which is common to the plurality of participant devices, to the plurality of participant devices in advance, so that any one of the participant devices obtains a plurality of samples arranged according to the first sample identifier sequence from the own samples based on the correspondence between the hash value of the original sample identifier and the number.

In one embodiment, the obtaining module 610, when determining the first sample identification sequence for model training based on the common sample identification, includes:

the apparatus 600 further comprises:

a selection module (not shown in the figure) configured to randomly select a second number of samples from the remaining sample identifications of the common sample identifications as a second sample identification sequence for model verification; a third number of samples is randomly selected as a third sample identification sequence for performing the model test.

In one embodiment, the apparatus 600 further comprises:

a second parameter module (not shown in the figure) configured to utilize respective original batch quantities of the multiple participant apparatuses to interact with the multiple participant apparatuses, determine a sample batch quantity obtained by fusing the original batch quantities, and send the sample batch quantity to the multiple participant apparatuses.

The above device embodiments correspond to the method embodiments, and specific descriptions may refer to descriptions of the method embodiments, which are not repeated herein. The device embodiment is obtained based on the corresponding method embodiment, has the same technical effect as the corresponding method embodiment, and for the specific description, reference may be made to the corresponding method embodiment.

Embodiments of the present specification also provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of fig. 1 to 3.

The embodiment of the present specification further provides a computing device, which includes a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in any one of fig. 1 to 3.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the storage medium and the computing device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments further describe the objects, technical solutions and advantages of the embodiments of the present invention in detail. It should be understood that the above description is only exemplary of the embodiments of the present invention, and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like based on the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A data synchronization method in multi-party joint training is provided, a network model to be trained is trained through a plurality of participant devices, the network model comprises models respectively owned by the plurality of participant devices, and the method comprises the following steps:

2. The method of claim 1, jointly training the network model using a plurality of training cycles;

3. The method according to claim 2, wherein the sample identification is a number corresponding to a hash value of the original sample identification; the server sends the corresponding relation between the hash value and the serial number of the sample original identification shared by the plurality of participant devices to the plurality of participant devices in advance;

4. The method of claim 1, the server, determining a first sequence of sample identifications for model training based on the common sample identifications, comprising:

5. The method of claim 1, the server, determining a first sequence of sample identifications for model training based on the common sample identifications, comprising:

the method further comprises the following steps:

6. The method of claim 1, wherein any one of the participant devices performs the step of batching the plurality of samples in the training set according to a pre-existing rank order, comprising:

7. The method of claim 6, causing a plurality of participant devices to obtain a common sample batch number by:

the server sends the sample batch number to a plurality of participant devices;

8. The method of claim 1, wherein any one of the participant devices, after updating its own model, further comprises:

9. The method of claim 1, the plurality of participant devices, the step of data interaction and data synchronization based on the training order and corresponding output results, comprising:

the plurality of participant devices respectively send the training sequences and the corresponding output results to the server;

10. A data synchronization method in multi-party joint training is provided, a network model to be trained is trained through a plurality of participant devices, the network model comprises models respectively owned by the plurality of participant devices, and the method is executed through any one of the participant devices and comprises the following steps:

11. A data synchronization method in multi-party joint training is disclosed, wherein a network model to be trained is trained through a plurality of participant devices, the network model comprises models respectively owned by the plurality of participant devices, and the method is executed through a server and comprises the following steps:

12. A data synchronization system in multi-party combined training comprises a server and a plurality of participant devices, wherein a network model to be trained is trained through the plurality of participant devices, and the network model comprises models respectively owned by the plurality of participant devices;

13. A data synchronization device in multi-party joint training, a network model to be trained is trained by a plurality of participant devices, the network model includes models owned by the plurality of participant devices respectively, the device is deployed in any one of the participant devices, and the device includes:

14. A data synchronization device in multi-party joint training, wherein a network model to be trained is trained by a plurality of participant devices, the network model comprises models respectively owned by the participant devices, and the device is deployed in a server and comprises:

15. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.

16. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-11.