CN114925744B

CN114925744B - Combined training method and device

Info

Publication number: CN114925744B
Application number: CN202210391540.1A
Authority: CN
Inventors: 郑龙飞; 张本宇; 王力
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Filing date: 2022-04-14
Publication date: 2024-07-02
Anticipated expiration: 2042-04-14

Abstract

The present disclosure discloses a method and apparatus for joint training. The joint training includes a plurality of training members for training a neural network model, the plurality of training members including a first training member and a second training member, local sample data of the first training member not including tag data, the local model of the first training member being a first N-layer of the neural network model, local sample data of the second training member including tag data, the local model of the second training member being the neural network model, the method being applied to the second training member, the method comprising: receiving a first training result from the first training member, wherein the first training result is a training result of a local model of the first training member; and training the local model of the second training member according to the local sample data of the second training member and the first training result.

Description

Combined training method and device

Technical Field

The disclosure relates to the technical field of data security, in particular to a method and a device for joint training.

Background

In the big data age, the accuracy of the neural network model can be greatly improved by joint training, but the original data held by each participant can contain a great deal of user privacy and business confidentiality, and once information is revealed, irreparable negative effects are caused.

In order to protect privacy data of each participant in the joint training, related technologies introduce a split learning algorithm (such as Ushape split learning algorithm) based on a server-client (C/S) architecture, i.e. split the overall neural network model. Wherein part of the model is trained by the clients of the participants and part of the model is trained by the server. In the training process of the neural network model, multiple data interactions are needed between the client and the server, and the problem of high communication overhead exists.

Disclosure of Invention

In view of the above problems, embodiments of the present disclosure provide a method and apparatus for joint training.

In a first aspect, a method of joint training is provided, the joint training including a plurality of training members for training a neural network model, the plurality of training members including a first training member and a second training member, local sample data of the first training member not including tag data, the local model of the first training member being a first N-layer of the neural network model, local sample data of the second training member including tag data, the local model of the second training member being the neural network model, the method being applied to the second training member, the method comprising: receiving a first training result from the first training member, wherein the first training result is a training result of a local model of the first training member; and training the local model of the second training member according to the local sample data of the second training member and the first training result.

In a second aspect, a method of joint training is provided, the joint training including a plurality of training members for training a neural network model, the plurality of training members including a first training member and a second training member, local sample data of the first training member not including tag data, the local model of the first training member being a first N-layer of the neural network model, local sample data of the second training member including tag data, the local model of the second training member being the neural network model, the method being applied to the first training member, the method comprising: training the local model of the first training member according to the local sample data of the first training member to obtain a first training result; and sending the first training result to the second training member, wherein the first training result is used for updating the local model of the second training member by the second training member.

In a third aspect, an apparatus for joint training is provided, the joint training including a plurality of training members for training a neural network model, the plurality of training members including a first training member and a second training member, local sample data of the first training member not including tag data, the local model of the first training member being a first N-layer of the neural network model, local sample data of the second training member including tag data, the local model of the second training member being the neural network model, the apparatus belonging to the second training member, the apparatus comprising: a first receiving unit configured to receive a first training result from the first training member, the first training result being a training result of a local model of the first training member; and the first training unit is configured to train the local model of the second training member according to the local sample data of the second training member and the first training result.

In a fourth aspect, an apparatus for joint training is provided, the joint training including a plurality of training members for training a neural network model, the plurality of training members including a first training member and a second training member, local sample data of the first training member not including tag data, the local model of the first training member being a first N-layer of the neural network model, local sample data of the second training member including tag data, the local model of the second training member being the neural network model, the apparatus belonging to the first training member, the apparatus comprising: the second training unit is configured to train the local model of the first training member according to the local sample data of the first training member to obtain a first training result; and the sending unit is used for sending the first training result to the second training member, wherein the first training result is used for updating the local model of the second training member by the second training member.

In a fifth aspect, there is provided a joint training apparatus comprising a memory having executable code stored therein and a processor configured to execute the executable code to implement the method of the first or second aspects.

In a sixth aspect, there is provided a computer readable storage medium having stored thereon executable code which when executed is capable of carrying out the method of the first or second aspect.

In a seventh aspect, a computer program product is provided comprising executable code which, when executed, is capable of implementing the method according to the first or second aspect.

The embodiment of the disclosure provides a combined training method, which cancels a middle training node of a server, directly transmits training results of training members without tag data aiming at a previous N layers to the training members with tag data, and then completes the training of the rest layers by the training members with tag data. Compared with the traditional scheme that a server is needed to be used as an intermediate training node to complete model training, the scheme in the embodiment of the disclosure reduces the intermediate communication process between training members and the server, so that communication overhead can be reduced.

Drawings

Fig. 1 is a schematic structural diagram of a neural network according to an embodiment of the disclosure.

Fig. 2 is a schematic structural diagram of a conventional distributed training system.

Fig. 3 is a schematic diagram of a sample space and a feature space provided by an embodiment of the present disclosure.

Fig. 4 is a schematic flow chart of a method of joint training provided by an embodiment of the present disclosure.

Fig. 5 is a schematic structural view of a joint training apparatus according to an embodiment of the present disclosure.

Fig. 6 is a schematic structural view of another joint training apparatus provided by an embodiment of the present disclosure.

Fig. 7 is a schematic structural view of yet another joint training apparatus provided by an embodiment of the present disclosure.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments.

In recent years, artificial intelligence research represented by neural networks has achieved very great results in many fields, and will also play an important role in the production and life of people for a long time in the future.

A neural network to which embodiments of the present disclosure are applicable is described below in conjunction with fig. 1. The structure of the neural network shown in fig. 1 includes: an input layer 110, a hidden layer 120, and an output layer 130. In general, the first layer is the input layer 110, the last layer is the output layer 130, and the intermediate layers between the first layer and the last layer are both hidden layers 120. The input layer 110 is used for inputting data, the hidden layer 120 is used for processing the input data, and the output layer 130 is used for outputting the processed output result. It will be appreciated that the hidden layer 120 may comprise one or more intermediate layers. When more intermediate layers are introduced into the neural network, which may also be referred to as deep neural network (deep neural network, DNN), more intermediate layers make DNN more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. For example, deep neural network models are widely used in pattern recognition, signal processing, optimization combining, anomaly detection, and the like.

As shown in fig. 1, the neural network includes a plurality of layers, each layer includes a plurality of neurons, and the neurons between layers may be fully connected or partially connected. For connected neurons, the output of the upper layer of neurons may serve as the input to the lower layer of neurons.

Typically, the output layer 130 is further provided with a loss function for calculating a prediction error, or for evaluating a degree of difference between a result (also called a predicted value) output by the neural network model and an ideal result (also called a true value).

In order to minimize the loss function, the neural network model needs to be trained. In some possible implementations, the neural network model may be trained using a back propagation algorithm (backpropagation algorithm, BP). The training process of BP consists of a forward propagation process and a backward propagation process. During forward propagation (as shown in fig. 1, propagation from input layer 110 to output layer 130 is forward propagation), input data is input to the above layers of the neural network model, processed layer by layer, and passed to output layer 130. If the difference between the result output at the output layer 130 and the ideal result is large, the minimization of the loss function is used as an optimization target, the back propagation is shifted (as shown in fig. 1, the propagation from the output layer 130 to the input layer 110 is back propagation), the partial derivatives of the optimization target to the weights of the neurons are obtained layer by layer, the gradient value of the optimization target to the weight vector is formed, the gradient value is used as the basis for modifying the model weight, and the training process of the neural network model is completed in the weight modifying process. When the error reaches a desired value, or when the neural network model converges, the training process of the neural network model ends.

It should be noted that the neural network model shown in fig. 1 is only an example of a neural network, and in a specific application, the neural network may also exist in the form of other network models, for example, the neural network may also include a convolutional neural network (convolutional neural network, CNN), a cyclic neural network (recurrent neural network, RNN), and the like. The embodiments of the present disclosure are not particularly limited thereto.

With the development of artificial intelligence technology, neural networks have been gradually applied to the fields of risk assessment, speech recognition, face recognition, natural language processing, and the like. In fact, the network structure of the neural network model is generally relatively fixed under different application scenarios. To achieve better model performance, more training data is needed. For example, in the fields of medical treatment, finance and the like, different enterprises or institutions have different data samples, and once the data are jointly trained in a data cooperation mode, the model precision is greatly improved, so that great economic benefits are brought to the enterprises. However, these raw training data may contain a lot of user privacy and business confidentiality, which, once the information is revealed, will lead to irreparable negative effects. Therefore, each enterprise or institution grasping the application data is often reluctant or does not have a suitable means to cooperate with each other, and it is difficult to make the respective grasped application data function together. This dilemma in data sharing and collaboration is known as the "data islanding" phenomenon.

In order to solve the inter-industry cross-organization data collaboration problem, especially the key privacy protection and data security problems, the concept of federal learning (FEDERATED LEARNING, FL) is proposed. Federal learning is also called joint learning, and can realize "availability and invisibility" of data on the premise of protecting user privacy and data safety, namely, training tasks of a machine learning model are completed through multiparty collaboration. Among them, machine learning is the core of artificial intelligence and generally includes neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, federal learning, and other techniques.

From a taxonomy, based on the distribution characteristics of the data, federal learning may include lateral federal learning (HFL), longitudinal federal learning (VERTICAL FEDERATED LEARNING, VFL), and federal migration learning (FEDERATED TRANSFER LEARNING). The lateral federation learning, also referred to as sample-based federation learning, is applicable to scenarios where the datasets share the same feature space but the sample spaces are different. Longitudinal federation learning, also known as feature-based federation learning, is applicable to scenarios where datasets share the same sample space but feature spaces are different. Federal transfer learning is then applicable to scenarios where the data sets differ not only in sample space but also in feature space. Scenes in which the data sets share the same feature space but differ in sample space are also referred to as horizontally sliced data scenes. A scene where the data sets share the same sample space but the feature space is different is also referred to as a vertically sliced data scene.

In some embodiments, the data processing method provided by the embodiments of the present disclosure mainly relates to a vertical federation learning method applicable to a vertically segmented data scene, and the vertically segmented data scene and the vertical federation learning are described in detail below.

In some embodiments, vertically sliced sample data may be understood as having identical Identities (IDs) but containing different characteristics of the sample data owned by each participant in the joint training. Wherein the ID of the sample data may refer to a global and unique identification of the sample data in the respective participant device. The ID of the sample data may be, for example, a unique identifier such as an identification card number or a mobile phone number of the user owned by each party, or a unique identifier (for example, a hash value of the user identifier) calculated based on the unique identifier of the user, as the ID of the sample data.

In practical applications, different parties may have the same user generated data (i.e., the same ID data), but if the different parties actually operate different services or provide different services, the features contained in the same user generated data stored by the different party's servers are completely different. For example, pay treasures and tremble are taken as examples, which are in different industries but may contain the same user group. Since payment treasures are mostly payment data and tremble is mostly entertainment social data, the data feature space intersection collected by the payment treasures is small. In this scenario, the data sets held by the payroll and the tremble may share sample data of the same sample space but different feature spaces, and this sample data may also be referred to as vertically sliced sample data.

Longitudinal federal learning refers to a method that, when two or more data sets overlap more users and the user features overlap less, the data sets can be split longitudinally (i.e., feature dimensions) and the part of the data of each user with the same user features but with non-identical user features can be extracted for training.

For example, in the case of a bank, an insurance company, and a shopping mall, a user group of the bank, the insurance company, and the shopping mall in the same area is likely to contain most residents of the area, and thus the intersection of users is large. However, since the bank records the deposit amount, loan amount, credit rating and other information of the user, the insurance company holds the insurance purchase and claim record data of the user, and the shopping center holds the daily consumption data of the user, the feature intersection of the user is smaller. Longitudinal federal learning is federal learning in which these different features are aggregated in an encrypted state to enhance model capabilities.

The system in which model training is completed by one party alone in the conventional manner may be referred to as a stand-alone training system, and the training model in the stand-alone training system may be referred to as a stand-alone model. In the federal learning process, two or more participants cooperate to train one or more models. The training system through the multiparty collaboration completion model may be referred to as a distributed training system. The distributed system based on the C/S architecture is widely applied because of the advantages of strong interactivity, safe access form, high response speed, convenience for processing a large amount of data and the like. Taking fig. 2 as an example, fig. 2 is a block diagram of a conventional distributed training system.

As shown in FIG. 2, a distributed joint training system based on a C/S architecture includes a server and a plurality of training member devices. Wherein, the server and each training member device, and each training member device need to keep communication connection. The individual training member devices may perform a joint training machine learning model together by way of data sharing based on locally stored sample data.

The server and the training member device shown in fig. 2 are divided according to whether there is sample data, the server is a device without sample data, and the training member device is a device with sample data. Of course, the server and training member devices may also be collectively referred to as training devices.

The specific type of the communication connection is not particularly limited in the present specification, and may be, for example, a transmission control protocol (transmission control protocol, TCP) connection. The server and each training member device, as well as the respective training member devices, may create a TCP connection by running TCP to perform a three-way handshake.

For longitudinal federal learning, split learning algorithms (e.g., ushape split learning algorithms) are introduced in the related art. Taking a neural network model as an example, namely splitting the whole neural network model. Wherein a part of the model is calculated by the training member device and a part of the model is calculated by the server. The training member equipment transmits the output hidden layer to the server for fusion, the server continues forward propagation aiming at the fused hidden layer, the training result of the server model is transmitted to the training member with the tag data, and the training member with the tag data calculates the loss function. After the loss function is obtained, each training member and the server can also perform back propagation to update the model, thereby completing the training of the model.

With continued reference to fig. 1, the neural network model may include multiple layers, and when the neural network model is split, the neural network model may be split into a front N layer and a remaining layer other than the front N layer. Wherein N is an integer greater than or equal to 2. For convenience of description, the model of the first N layers is hereinafter referred to as a first model, the remaining layers other than the first N layers are referred to as a second model, and the loss function calculation model of the neural network model is referred to as a third model.

The training process of longitudinal federal learning is described below with reference to fig. 3 using a neural network model as an example. Training members include training member 1, training member 2, and training member 3. The feature space of the sample data of the training member 1 is the feature X1, the feature space of the sample data of the training member 2 is the features X2 and X3, and the feature space of the sample data of the training member 3 is the feature X4, the feature X5, and the tag data Y. Since training member 1 and training member 2 only have feature data and no tag data, training member 1 and training member 2 only perform training of the first model. Since the server does not have feature data and tag data, the server may perform training of the second model. Since the training member 3 has both the feature data and the tag data, the training member 3 can perform training of the first model and the third model.

For example, the training members, sample spaces, and feature spaces shown in fig. 3 will be described with respect to banks, insurers, and shopping malls located in the same area. Training member 1 may be a shopping mall, training member 2 may be an insurance company, and training member 3 may be a bank. At this time, the feature X1 of the feature space of the sample data held by the training member 1 may refer to daily consumption data of the user at the shopping center, the features X2 and X3 of the feature space of the sample data held by the training member 2 may refer to insurance purchase and claim settlement data of the user at the insurance company, the features X4 and X5 of the feature space of the sample data held by the training member 3 may refer to deposit amount and loan amount of the user at the bank, and the tag data Y may refer to credit rating of the user at the bank. U1-U13 are sample spaces for sample data, i.e., users that may be included by training member 1, training member 2, and training member 3.

Each training member may also data align the local data that each has prior to training. Specifically, each training member may respectively read sample data from the locally stored sample data set, and based on a secret sharing algorithm, perform secret sharing operation on the data fragments split from the read sample data and the data fragments split from the sample data read from the locally stored sample data set sent by other training member devices, to obtain shared sample data. The secret sharing algorithm may be, for example, an algorithm based on a privacy preserving set intersection (PRIVATE SET intersection, PSI).

Taking fig. 3 as an example, the sample data of training member 1, training member 2 and training member 3 after alignment includes users with the same ID. After data alignment, the shared sample data for training member 1 includes data for U1-U4 (data in the dashed box), the shared sample data for training member 2 includes data for U1-U4 (data in the dashed box), and the shared sample data for training member 3 includes data for U1-U4 (data in the dashed box).

The training process of the neural network model may include a forward propagation process and a backward propagation process, which are described below in conjunction with fig. 3, respectively.

Forward propagation process:

Training member 1 and training member 2 respectively use the local data of U1-U4 as input data, execute training calculation of the first model, and send the training calculation result of the first model to the server. The training member 3 may also use the local data of U1 to U4 as input data, perform the training calculation of the first model, and send the training calculation result of the first model to the server. The server may perform training calculation of the second model using training calculation results of the first model, which are sent by the training member 1, the training member 2, and the training member 3, respectively, as input data of the second model. Further, the server may send the training calculation result of the second model to training member 3 (training member with tag data). And the training member 3 takes the calculation result sent by the server as input data of the third model, and carries out training calculation of the third model to obtain a loss function.

The back propagation process:

the training member 3 may update the third model, e.g. update parameters in the third model, based on the loss function, and obtain gradient values of the output layer of the second model, and send the gradient values of the output layer of the second model to the server. The server performs back propagation according to the gradient value of the output layer of the second model, updates the second model to obtain the gradient value of the output layer of the first model, and sends the gradient value of the output layer of the first model to the training member 1, the training member 2 and the training member 3 respectively. Training member 1, training member 2, and training member 3 may back-propagate according to the gradient values of the output layer of the first model, updating the first model.

Repeating the forward propagation process and the backward propagation process until the model converges, and completing training of the model.

According to the description, the distributed combined training system based on split learning takes a server as an intermediate training node, and multiple data interactions are needed between training members and the server in the forward propagation and backward propagation processes to complete model training. The data traffic is proportional to the sample size, and particularly for a large-data-size distributed task, the distributed joint training with a server as a middle node has the problems of large traffic and low training speed.

In order to solve the above-mentioned problem, the embodiment of the present disclosure provides a joint training method, which cancels the intermediate training node of the server, directly transmits the training result of the training member without the tag data for the previous N layers to the training member with the tag data, and then completes the training of the remaining layers by the training member with the tag data, thereby reducing the intermediate communication process between the training member and the server, and thus reducing the communication overhead.

The distributed training system in embodiments of the present disclosure may include a plurality of training members that may be partitioned according to the feature dimensions of the data set they hold. For example, training members may be partitioned according to whether they hold feature data and/or tag data.

As one example, the plurality of training members of embodiments of the present disclosure may include a first training member and a second training member. The local training data of the first training member includes characteristic data but does not include tag data. The local data of the second training member includes the characteristic data and also includes the tag data. It will be appreciated that one or more first training members may be included in the training system of embodiments of the present disclosure, as well as one or more second training members. The number of the first training member and the second training member is not particularly limited in the embodiments of the present disclosure. For ease of description, the training system of the embodiments of the present disclosure will be described as including a first training member and a second training member.

As one example, the local model of the first training member in embodiments of the present disclosure may be the first N-layer of the neural network model, i.e., the first model. The local model of the second training member may be an entire neural network model, and for convenience of joint training, the neural network model of the second training member may be split into a front N layer and a remaining layer other than the front N layer, i.e., the neural network model may be split into the first model and the second model. In addition, the local model of the second training member may also include a loss function calculation model, i.e., a third model.

In some embodiments, the sample data sets are typically also grouped, i.e., the Batch sizes are partitioned, prior to training the neural network model. Where Batch may refer to a data set used for one iteration of training. The sample data set may be divided into one or more Batch according to the sample data set size. The embodiments of the present disclosure are not particularly limited thereto.

The method of the embodiments of the present disclosure is described in detail below in conjunction with fig. 4. The method shown in fig. 4 may be applied to a first training member as well as to a second training member. The method shown in fig. 4 includes steps S420 to S460.

In step S420, the local model of the first training member is trained.

The first training member can train the local model of the first training member according to the local sample data of the first training member to obtain a first training result.

Taking the first training member as an insurance company as an example, the local sample data of the first training member may refer to insurance purchase and claim data of the user at the insurance company.

The local model of the first training member may be, for example, a embedding model, where embedding may refer to a way of converting discrete variables into a continuous vector representation. In neural networks embedding, for example, can be used to reduce the spatial dimension of discrete variables.

In step S440, the first training result is transmitted.

It will be appreciated that the first training result is a training result of the first model by the first training member using its local sample data. The first training result is the output result of the last layer in the first model. The last layer of the first model is the middle hidden layer of the neural network model, so the first training result may also be referred to as a hidden layer result.

In step S460, the local model of the second training member is trained.

The first training result may be received from the first training member prior to training the local model of the second training member. And then training the local model of the second training member according to the local sample data of the second training member and the first training result.

The manner in which the second training member receives the first training result is not specifically limited in this disclosure, for example, the second training member may receive the first training result through the TCP protocol, and it may be understood that the first training result at this time is sent by the first training member through the TCP protocol.

The local sample data of the second training member comprises feature data and label data, and taking the second training member as an example, the feature data of the second training member can refer to deposit amount and loan amount data of a user at a bank; the tag data may refer to a credit rating of the user at the bank.

To facilitate joint training, the local neural network model of the second training member may be split into two parts, the first N layers of the neural network (i.e., the first model) and the remaining layers other than the first N layers (i.e., the second model). In some embodiments, the second training member may further include a loss function model (i.e., a third model), which may be, for example, a mean squared error (mean squared error, MSE) function model, or an average absolute error (mean absolute error, MAE) function model, which is not specifically limited in this disclosure.

When the second training member trains the local model, the first model can be trained by utilizing the local sample data to obtain a second training result. The second training result may be a hidden layer output result of the first model in the second training member, that is, an output result of the last layer in the first model.

And then, training the second model by taking the first training result and the second training result as input data to obtain an output result of the second model, namely, an output result of the whole neural network model. In some embodiments, the first training result and the second training result may be fused to obtain the third training result before the first training result and the second training result are used as input data. That is, the third training result may be used as input data of the second model, and the second model may be trained. The fusion manner of the first training result and the second training result in the embodiment of the present disclosure is not particularly limited, and for example, the fusion may be any one of the following manners: average (mean), series (sum), etc.

After the output result of the neural network model is obtained, forward propagation can be continued, the output result of the neural network model is used as input data of a third model, and training calculation of the third model is performed to obtain a loss function.

The training of the neural network model is a forward propagation process, and if the loss error meets the requirement, the training of the neural network model can be completed without a back propagation process. If the loss error is not satisfactory, training of the neural network model also requires a back propagation process, which is described below in connection with the joint training method of fig. 4.

The second training member may further update the third model based on the loss function and obtain an error parameter for the output layer of the second model. The second training member may back propagate the error parameter and update the second model to obtain an error parameter that updates the first model and send the error parameter to the first training member. The first training member and the second training member may respectively back-propagate the first model according to the update error parameter of the first model, and update the first model. For example, when the neural network model is updated using the gradient descent (GRADIENT DESCENT) method, the error parameter may refer to a gradient value.

In some embodiments, to reduce the number of communications between the first training member and the second training member, the second training member may train the second model multiple times using the third training result. The second model is subjected to multiple training updates locally, so that the second model achieves higher accuracy, the training times of the neural network model are reduced as a whole, and therefore the communication traffic between the first training member and the second training member is reduced.

When the second training member updates the second model multiple times, multiple error parameters (e.g., multiple gradient values of the output layer of the first model) that update the first model may be obtained. In some embodiments, the first training member and the second training member may back-propagate the first model using any one of the plurality of gradient values and update the first model. The average value of the gradient values can be calculated first, and then the gradient value for updating the first model is calculated according to the fusion mode of the first training result and the second training result, so that the randomness of the gradient in the back propagation process can be reduced, and the robustness of the training model is improved. The randomness of the gradient in the back propagation process refers to the phenomenon that when the gradient of the neural network layer is solved reversely by using a gradient descent method, the gradient descent direction is random, so that the gradient is a local minimum value instead of a minimum value.

In some embodiments, when the first training result and the second training result are fused in a mean and sum manner, the average value of the plurality of gradient values may be divided by the number of training members to obtain the gradient value updated for the first model. When the fusion mode of the first training result and the second training result is concatate, the gradient value for updating the first model can be calculated according to the average value of a plurality of gradient values and the proportion of the first training result and the second training result in the fusion matrix.

The method of the embodiments of the present disclosure is described in full detail below in conjunction with the method of joint training in fig. 4. The method of the embodiment of the disclosure may include four processes, which are respectively described below for data privacy protection intersection, model definition and initialization, model training, and iterative training.

Let k training members total, the dataset for each training member i be X _i. Where i=1, 2,..k, only X _k contains the tag data Y.

1. Data privacy protection intersection

The intersection of the sample spaces between participating co-training members i may be determined based on PSI techniques to obtain a corresponding sample dataset X _i and label dataset Y. The more common PSI techniques may be, for example: and calculating the hash value of the sample space ID, and obtaining an intersection of the hash value through communication matching among the participating joint training members i, so as to obtain an intersection of the sample data set X _i and the tag data set Y.

2. Model definition and initialization

According to the feature dimension of the local data of the training member i, a first model M _i local to the training member i, a second model M _k and a loss function model M _l on the training member k are defined. Wherein the first model Mi may be a embedding model.

Then, all models are initialized.

3. Model training

Sample dataset X _i is also typically grouped (i.e., divided into Batch sizes) prior to model training. The following description will take the example of dividing the data set into W Batch.

(1) In the j-th training iteration (where j=0,..w, W is a positive integer), training member i uses j-th Batch to perform forward propagation of model M _i, obtain output layer L _i of M _i, and transmit L _i to training member k. And the training member k fuses the received hidden layer L _i to obtain a fused hidden layer L _s.

(2) The training member k continues forward propagation of the model M _k by using the hidden layer L _s, and finally obtains a predicted value of the tag data Y, and continues forward propagation, so that a loss function of the model M _l can be obtained. From this loss function, the gradient values of the layers in the model M _k can be calculated in turn. Model update can be performed on M _k according to the gradient values of each layer in the model M _k, and gradient values of the hidden layer L _s can be obtained.

(3) Repeating the step (2) for t times, namely training the model M _k for a plurality of times by using the hidden layer L _s, and obtaining an average value G _s of a plurality of gradient values of the hidden layer L _s in t times of training. Wherein t is a positive integer greater than 1.

(4) According to the fusion mode of the hidden layer L _s, the gradient average value G _s can be used to obtain the gradient G _i of the hidden layer L _i, and G _i is transmitted to each training member i, wherein i is not equal to k, and the model M _i can be updated according to the gradient G _i.

4. Iterative training

And (3) repeating training on the neural network model subjected to split learning according to the step of model training in the section 3 until the model converges so as to complete training of the model.

Method embodiments of the present disclosure are described above in detail in connection with fig. 1-4, and apparatus embodiments of the present disclosure are described below in detail in connection with fig. 5-7. It is to be understood that the description of the method embodiments corresponds to the description of the device embodiments, and that parts not described in detail can therefore be seen in the preceding method embodiments.

Fig. 5 is a schematic structural view of a device for joint training according to an embodiment of the present disclosure. The apparatus 500 in fig. 5 may be located at (or mounted at) the second training member. The apparatus 500 may include a first receiving unit 520 and a first training unit 540. These units are described in detail below.

The first receiving unit 520 may be configured to receive a first training result from the first training member, the first training result being a training result of a local model of the first training member;

The first training unit 540 may be configured to train the local model of the second training member based on the local sample data of the second training member and the first training result.

In some embodiments, the first training unit 520 is configured to include: training the front N layers of the local model of the second training member according to the local sample data of the second training member to obtain a second training result; fusing the first training result and the second training result to obtain a third training result; and training the rest layers of the local model of the second training member except the front N layers according to the third training result.

In some embodiments, the first training unit 520 is configured to include: training the rest layers except the front N layers of the local model of the second training member for multiple times according to the third training result to obtain error parameters for updating the local model of the first training member; and sending the error parameter to the first training member.

In some embodiments, the first training unit 520 is configured to include: training the residual layer for multiple times according to the third training result to obtain multiple gradient values corresponding to the output layer of the local model of the first training member; and calculating the average value of the gradient values to obtain the error parameter.

Fig. 6 is a schematic structural view of a joint training apparatus provided in another embodiment of the present disclosure. The apparatus 600 in fig. 6 may be located at (or mounted at) the first training member. The apparatus 600 may include a second training unit 620 and a transmitting unit 640. These units are described in detail below.

The second training unit 620 may be configured to train the local model of the first training member according to the local sample data of the first training member, so as to obtain a first training result;

The transmitting unit 640 may transmit the first training result to the second training member, where the first training result is used for updating the local model of the second training member by the second training member.

In some embodiments, the apparatus 600 further comprises: a second receiving unit 660 and an updating unit 680, the second receiving unit 660 may be configured to receive error parameters from the second training member; the updating unit 680 may be configured to update the local model of the first training member according to the error parameter.

In some embodiments, the error parameter is an average of a plurality of gradient values corresponding to an output layer of a local model of the first training member.

Fig. 7 is a schematic structural view of a device for joint training according to still another embodiment of the present disclosure. The apparatus 700 may be, for example, a computer, a server, etc. The apparatus 700 may include a processor 720, a memory 740, and a bus 780. Processor 720 and memory 740 are coupled via bus 780, processor 720 being configured to execute executable modules, such as computer programs, stored in memory 740.

Processor 720 may be, for example, an integrated circuit chip with signal processing capabilities. In implementation, the steps of the joint training method may be performed by integrated logic circuitry of hardware in processor 720 or instructions in the form of software. Processor 720 may also be a general-purpose processor, including a CPU, network processor (network processor, NP), or the like; but may also be a digital signal processor (DIGITAL SIGNAL processor, DSP), application Specific Integrated Circuit (ASIC), field programmable gate array (field programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

Memory 740 may comprise, for example, high-speed random access memory (random access memory, RAM) or may comprise non-volatile memory (non volatile memory), such as at least one disk memory.

Bus 780 may be an industry standard architecture (industry standard architecture, ISA) bus, a peripheral component interconnect (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. Only one double-headed arrow is shown in fig. 7, but only one bus 780 or one type of bus 780 is not shown.

The memory 740 is used to store programs, such as programs corresponding to the joint training apparatus. The apparatus 700 may include a software function module stored in the memory 740 in the form of at least one piece of software or firmware (firmware) or solidified in an Operating System (OS) of the apparatus 700. Processor 720, upon receiving the execution instructions, executes the program to implement the method of joint training described above.

In some embodiments, the apparatus 700 provided by the present disclosure may further include a communication interface 760. Communication interface 760 is coupled to processor 720 via a bus.

It should be understood that the structure shown in fig. 7 is a schematic diagram of only a portion of the apparatus 700, and that the apparatus 700 may also include more or fewer components than shown in fig. 7, or have a different configuration than shown in fig. 7. The components shown in fig. 7 may be implemented in hardware, software, or a combination thereof.

The joint training method provided by the embodiment of the present disclosure may be applied to, but not limited to, the apparatus for joint training shown in fig. 7.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present disclosure, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Drive (SSD)), or the like.

It should be understood that, in various embodiments of the present disclosure, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not constitute any limitation on the implementation of the embodiments of the present disclosure.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of joint training comprising a plurality of training members for training a neural network model, the plurality of training members comprising a first training member and a second training member, local sample data of the first training member not comprising tag data, the local model of the first training member being a first N-layer of the neural network model, local sample data of the second training member comprising tag data, the local model of the second training member being the neural network model, the method being applied to the second training member,

The method comprises the following steps:

receiving a first training result from the first training member, wherein the first training result is a training result of a local model of the first training member;

Training the local model of the second training member according to the local sample data of the second training member and the first training result;

The training the local model of the second training member according to the local sample data of the second training member and the first training result comprises the following steps:

training the front N layers of the local model of the second training member according to the local sample data of the second training member to obtain a second training result;

fusing the first training result and the second training result to obtain a third training result;

And training the rest layers of the local model of the second training member except the front N layers according to the third training result.

2. The method of claim 1, the training the remaining layers of the local model of the second training member, other than the first N layers, according to the third training result, comprising:

training the rest layers except the front N layers of the local model of the second training member for multiple times according to the third training result to obtain error parameters for updating the local model of the first training member;

and sending the error parameter to the first training member.

3. The method according to claim 2, wherein the training the remaining layers of the local model of the second training member, except the first N layers, multiple times according to the third training result, to obtain error parameters for updating the local model of the first training member, includes:

Training the residual layer for multiple times according to the third training result to obtain multiple gradient values corresponding to the output layer of the local model of the first training member;

and calculating the average value of the gradient values to obtain the error parameter.

4. A method of joint training comprising a plurality of training members for training a neural network model, the plurality of training members comprising a first training member and a second training member, local sample data of the first training member not comprising tag data, the local model of the first training member being a first N-layer of the neural network model, local sample data of the second training member comprising tag data, the local model of the second training member being the neural network model, the method being applied to the first training member,

The method comprises the following steps:

Training the local model of the first training member according to the local sample data of the first training member to obtain a first training result;

the first training result is sent to the second training member, and the first training result is used for updating the local model of the second training member by the second training member;

The first training result is used for the second training member to obtain a third training result, the third training result is used for the second training member to train the rest layers except the front N layer of the local model of the second training member, the third training result is obtained by the second training member according to the fusion of the first training result and the second training result, and the second training result is obtained by the second training member according to the local sample data of the second training member to train the front N layer of the local model of the second training member.

5. The method of claim 4, the method further comprising:

receiving an error parameter from the second training member;

and updating the local model of the first training member according to the error parameter.

6. The method of claim 5, wherein the error parameter is an average of a plurality of gradient values corresponding to an output layer of a local model of the first training member.

7. A joint training apparatus, the joint training comprising a plurality of training members for training a neural network model, the plurality of training members comprising a first training member and a second training member, local sample data of the first training member not including tag data, the local model of the first training member being a front N-layer of the neural network model, local sample data of the second training member including tag data, the local model of the second training member being the neural network model, the apparatus belonging to the second training member,

The device comprises:

A first receiving unit configured to receive a first training result from the first training member, the first training result being a training result of a local model of the first training member;

A first training unit configured to train a local model of the second training member according to the local sample data of the second training member and the first training result;

the first training unit is configured to:

8. The apparatus of claim 7, the first training unit configured to:

and sending the error parameter to the first training member.

9. The apparatus of claim 8, the first training unit configured to:

10. A joint training apparatus comprising a plurality of training members for training a neural network model, the plurality of training members comprising a first training member and a second training member, local sample data of the first training member not including tag data, the local model of the first training member being a front N-layer of the neural network model, local sample data of the second training member including tag data, the local model of the second training member being the neural network model, the apparatus belonging to the first training member,

The device comprises:

The second training unit is configured to train the local model of the first training member according to the local sample data of the first training member to obtain a first training result;

the sending unit is used for sending the first training result to the second training member, wherein the first training result is used for updating the local model of the second training member by the second training member;

11. The apparatus of claim 10, the apparatus further comprising:

a second receiving unit configured to receive an error parameter from the second training member;

And the updating unit is configured to update the local model of the first training member according to the error parameter.

12. The apparatus of claim 11, the error parameter being an average of a plurality of gradient values corresponding to an output layer of a local model of the first training member.

13. A joint training apparatus comprising a memory having executable code stored therein and a processor configured to execute the executable code to implement the method of any of claims 1-6.