CN114298321A

CN114298321A - Joint modeling method and device, electronic equipment and storage medium

Info

Publication number: CN114298321A
Application number: CN202111607572.2A
Authority: CN
Inventors: 张铁钢; 许文彬
Original assignee: Welab Information Technology Shenzhen Ltd
Current assignee: Welab Information Technology Shenzhen Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-08

Abstract

The invention relates to the field of data processing, and discloses a joint modeling method, which comprises the following steps: respectively executing common sample ID identification processing on the first sample set and each second sample set, and splitting the first sample set into first sub-sample sets corresponding to each second participant based on identification results; calculating a gradient value corresponding to each first sub-sample set based on the initial parameters of the preset model corresponding to each first sub-sample set and a second sample set of a corresponding second participant; determining a first parameter corresponding to each first subsample set based on the gradient values; receiving a second parameter sent by each second participant and a third parameter sent by other first participants; and when the preset model is judged to be converged, determining a target parameter based on the first parameter, the second parameter and the third parameter, and sending the target parameter to other participants to complete the combined modeling. The invention also provides a joint modeling device, electronic equipment and a storage medium. The invention improves the accuracy of the combined model.

Description

Joint modeling method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a joint modeling method and apparatus, an electronic device, and a storage medium.

Background

In order to eliminate data islands and guarantee data safety, federal learning is widely applied to joint modeling. In the federal learning process, all participants do not share data, local data training models are used respectively, and the models are subjected to parameter updating by exchanging encrypted model parameters to complete modeling.

The federal learning comprises horizontal federal learning and longitudinal federal learning, and generally, if the same sample characteristics exist between the participants and the samples are insufficient, a horizontal federal learning scheme is adopted; if the number of the reference side samples is enough, but the sample characteristics are lacked, a longitudinal federal learning scheme is adopted. However, for the case of both the lack of sample features and the lack of sample number, the accuracy of the constructed joint model is not high whether a horizontal federated learning scheme or a vertical federated learning scheme is used. Therefore, a need exists for a joint modeling method to improve the accuracy of a joint model in the absence of sample features and sample numbers.

Disclosure of Invention

In view of the above, there is a need to provide a joint modeling method, aiming at improving the accuracy of the joint model.

The joint modeling method provided by the invention is applied to any one first participant in a joint modeling system, the joint modeling system comprises a plurality of first participants and a plurality of second participants which are in communication connection, the same sample object and different sample characteristics are contained between each first participant and each second participant, and the same sample characteristic and different sample objects are contained between each second participant, the method comprises the following steps:

receiving a public key in a homomorphic encryption key pair sent by each second participant in the joint modeling system, respectively executing public sample ID identification processing on a first sample set which is locally stored and does not contain label information and a second sample set which carries the label information of each second participant based on the public key, and splitting the first sample set into first sub-sample sets corresponding to each second participant based on a public sample ID identification result;

acquiring initial parameters of a preset model corresponding to each first sub-sample set, and calculating a gradient value corresponding to each first sub-sample set based on the public key, the initial parameters and a second sample set of a corresponding second participant;

performing parameter updating processing on a preset model corresponding to each first sub-sample set based on the gradient values to obtain a first parameter corresponding to each first sub-sample set;

receiving a second parameter and a loss value which are sent by each second participant and correspond to the second sample set and are processed by adopting a security aggregation algorithm, and receiving a third parameter which is sent by other first participants and correspond to each sub-sample set and are processed by adopting the security aggregation algorithm;

and judging whether the preset model is converged or not based on the loss value, and if so, determining a target parameter based on the first parameter, the second parameter and the third parameter, and respectively sending the target parameter to other participants in the joint modeling system to complete joint modeling.

Optionally, the performing, based on the public key, a common sample ID identification process on the locally stored first sample set without the tag information and the second sample set carrying the tag information of each second participant respectively includes:

selecting a second participant, calculating a first hash value of each sample ID in the first sample set, encrypting the first hash value by adopting a public key in a homomorphic encryption key pair corresponding to the selected second participant to obtain a first ciphertext, and establishing a mapping relation between the first ciphertext and the sample ID;

receiving a second ciphertext sent by the selected second participant, wherein the second ciphertext is obtained by encrypting a second hash value of each sample ID in a second sample set of the selected second participant by using a public key in the same homomorphic encryption key pair;

and calculating the intersection of the first ciphertext and the second ciphertext to obtain a common sample ID ciphertext, and determining plaintext data of the common sample ID ciphertext based on the mapping relation.

Optionally, the calculating a gradient value corresponding to each first sub-sample set based on the public key, the initial parameter, and a second sample set of a corresponding second participant includes:

starting a plurality of processes according to the number of the first sub-sample sets, and calculating a first feature matrix corresponding to each first sub-sample set by each process according to the corresponding first sub-sample set and the initial parameters thereof;

sending the first characteristic matrix to a corresponding second participant, and receiving an error value sent by the corresponding second participant and encrypted by using the public key, wherein the error value is obtained by the corresponding second participant through calculation according to a second characteristic matrix of a second sample set of the second participant and the first characteristic matrix;

and substituting the encrypted error values into a gradient value calculation formula to obtain the encrypted gradient value corresponding to each first subsample set, and sending the encrypted gradient value to the corresponding second participant to obtain the plaintext data of the encrypted gradient value.

Optionally, the process of calculating an error value according to the second feature matrix of the second sample set and the first feature matrix of the corresponding second participant includes:

the corresponding second participant calculates the eigenvalue of the second sample set based on the second feature matrix of the second sample set and the first feature matrix;

inputting the characteristic value into a preset model to obtain a predicted value of a second sample set of the preset model;

and determining a real value of a second sample set of the tag information based on the tag information, calculating an error value based on the real value and the predicted value, encrypting the error value by adopting a public key in a corresponding homomorphic encryption key pair, and then sending the encrypted error value to a corresponding first participant.

Optionally, the sending the encrypted gradient value to a corresponding second party to obtain plaintext data of the encrypted gradient value includes:

generating a third random number for each second party, encrypting the corresponding third random number by adopting a corresponding public key, calculating the sum of the encrypted gradient value and the encrypted third random number to obtain an encrypted sum, and sending the encrypted sum to the corresponding second party;

and receiving a numerical value obtained by decrypting the encryption by the corresponding second party, and subtracting the corresponding third random number from the obtained numerical value to obtain a decrypted gradient value corresponding to the corresponding first sub-sample set.

Optionally, each process calculates a first feature matrix corresponding to each first sub-sample set according to the corresponding first sub-sample set and the initial parameter thereof, and includes:

selecting a process, acquiring a first sub-sample set and initial parameters corresponding to the process, determining an initial feature matrix of the acquired first sub-sample set, and calculating a first feature matrix corresponding to the acquired first sub-sample set based on the initial feature matrix and the initial parameters.

Optionally, the calculation formula of the loss value is:

wherein L is_iLoss value, y, for the ith second participant_ijIs the true value, h, of the jth sample in the second set of samples of the ith second participant_θ(x_ij) Is the predicted value of the jth sample in the second sample set of the ith second participant, and n is the total number of samples in the second sample set of the ith second participant.

In order to solve the above problems, the present invention also provides a joint modeling apparatus, including:

the receiving module is used for receiving a public key in a homomorphic encryption key pair sent by each second participant in the joint modeling system, respectively executing public sample ID identification processing on a first sample set which is locally stored and does not contain label information and a second sample set which carries the label information of each second participant based on the public key, and splitting the first sample set into first sub-sample sets corresponding to each second participant based on a public sample ID identification result;

the calculation module is used for acquiring initial parameters of a preset model corresponding to each first subsample set and calculating a gradient value corresponding to each first subsample set based on the public key, the initial parameters and a second subsample set of a corresponding second participant;

the updating module is used for performing parameter updating processing on the preset model corresponding to each first sub-sample set based on the gradient values to obtain a first parameter corresponding to each first sub-sample set;

the receiving module is used for receiving the second parameters and the loss values which are sent by each second participant and correspond to the second sample sets and are processed by adopting the security aggregation algorithm, and receiving the third parameters which are sent by other first participants and correspond to each sub-sample set and are processed by adopting the security aggregation algorithm;

and the determining module is used for judging whether the preset model is converged or not based on the loss value, and when the preset model is converged, determining a target parameter based on the first parameter, the second parameter and the third parameter, and respectively sending the target parameter to other participants in the joint modeling system to complete joint modeling.

In order to solve the above problem, the present invention also provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a joint modeling program executable by the at least one processor, the joint modeling program being executable by the at least one processor to enable the at least one processor to perform the joint modeling method described above.

In order to solve the above problems, the present invention also provides a computer-readable storage medium having stored thereon a joint modeling program executable by one or more processors to implement the joint modeling method described above.

Compared with the prior art, the method comprises the steps of firstly respectively executing common sample ID identification processing on a local first sample set and a second sample set of each second participant, and splitting the first sample set into first sub-sample sets corresponding to each second participant based on identification results; then, calculating a gradient value corresponding to each first sub-sample set based on the initial parameter of the preset model corresponding to each first sub-sample set and a second sample set of a corresponding second participant, and determining a first parameter corresponding to each first sub-sample set based on the gradient value; then, receiving a second parameter sent by each second participant, and receiving a third parameter corresponding to each subsample set of the second parameter sent by other first participants; and finally, when the preset model is judged to be converged, determining a target parameter based on the first parameter, the second parameter and the third parameter, and sending the target parameter to other participants to complete the combined modeling. The method combines the horizontal participators with the same sample characteristics and different sample objects with the vertical participators with the same sample objects and different sample characteristics for modeling, thereby improving the accuracy of the combined model.

Drawings

FIG. 1 is a schematic flow chart diagram of a joint modeling method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a joint modeling apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device implementing a joint modeling method according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

The invention provides a joint modeling method which is applied to any one first participant in a joint modeling system, wherein the joint modeling system comprises a plurality of first participants and a plurality of second participants which are in communication connection. Fig. 1 is a schematic flow chart of a joint modeling method according to an embodiment of the present invention. The method may be performed by an electronic device (the electronic device corresponding to the first party performing the scheme), which may be implemented by software and/or hardware.

In this embodiment, the same sample object and different sample features are included between each first participant and each second participant, and the same sample feature and different sample objects are included between each second participant, where the joint modeling method includes:

s1, receiving a public key in a homomorphic encryption key pair sent by each second participant in the joint modeling system, respectively executing public sample ID recognition processing on a first sample set which is locally stored and does not contain label information and a second sample set which carries label information of each second participant based on the public key, and splitting the first sample set into first sub-sample sets corresponding to each second participant based on a public sample ID recognition result.

The homomorphic encryption algorithm has the characteristics that: and calculating the homomorphic encrypted data by adopting a preset calculation formula to obtain an operation result, decrypting the operation result, wherein the decrypted result is the same as the result obtained by calculating the unencrypted original data by adopting the same calculation formula. By adopting a homomorphic encryption algorithm, the normal operation of the data can be ensured on the basis of not revealing the original data.

In this embodiment, each second participant generates a plurality of pairs of homomorphic encryption key pairs according to the number of the first participants, and sends the public key in each homomorphic encryption key pair to the corresponding first participant, where the second participant has a public key and a private key, the first participant has the public key, the public key is used for encryption, and the private key is used for decryption, so that the first participant cannot decrypt ciphertext data sent by the second participant, and the security of the data is fully ensured.

In this embodiment, the same sample object and different sample features are included between the first participant and the second participant, the same sample feature and different sample objects are included between the second participants, and the samples in the second sample set of each second participant carry tag information.

For example, assume that a first participant includes a social security bureau and a shopping platform, and a second participant includes a bank 1, a bank 2 and a bank 3, wherein a first sample set of the social security bureau includes sample data with a sample ID of 1-10000, and sample characteristics corresponding to the sample data include a social security payment base, a social security payment company, a social security payment duration time, and the like; the first sample set of the shopping platform comprises sample data with a sample ID of 1-9500, and sample characteristics corresponding to the sample data comprise the shopping times, shopping types and shopping money of a user in the last half year; the second sample set of the bank 1 comprises sample data with a sample ID of 1-1000, wherein the sample data comprises the deposit times, the borrowing times and the number and types of purchased financial products of the user in the last half year; the second sample sets of the bank 2 and the bank 3 respectively comprise sample data with sample IDs of 1000-6000 and 6001-9000, and the sample characteristics of the sample data are the same as those of the bank 1.

If the first participant executing the scheme is a social security agency, the social security agency takes the sample data with the sample ID of 1-1000 (the sample ID shared with the bank 1) in the first sample set as the 1 st first sub-sample set, takes the sample data with the sample ID of 1001-6000 (the sample ID shared with the bank 2) in the first sample set as the 2 nd first sub-sample set, and takes the sample data with the sample ID of 6001-9000 (the sample ID shared with the bank 3) in the first sample set as the 3 rd first sub-sample set, so that the residual sample data in the first sample set cannot be used in the modeling process, and the residual sample data can be transferred from the first sample set to other places for storage.

For the social security, the corresponding relationship between the first subsample set and the second participant is:

the 1 st first subsample set corresponds to bank 1;

the 2 nd first subsample set corresponds to bank 2;

the 3 rd first set of subsamples corresponds to the bank 3 set.

By adopting the method, other first participants also split the local sample sets of the first participants into the sub-sample sets corresponding to each second participant.

The performing, based on the public key, a common sample ID identification process on a locally stored first sample set that does not contain tag information and a second sample set that carries tag information of each second party, respectively, includes:

a11, selecting a second participant, calculating a first hash value of each sample ID in the first sample set, encrypting the first hash value by using a public key in a homomorphic encryption key pair corresponding to the selected second participant to obtain a first ciphertext, and establishing a mapping relation between the first ciphertext and the sample ID;

the sample ID may be a user ID, and the user ID may be a mobile phone number, an identification number, or other information for identifying the user identity.

The sample ID is subjected to Hash operation and encryption processing, so that the safety of the sample ID is fully guaranteed.

A12, receiving a second ciphertext sent by the selected second participant, wherein the second ciphertext is obtained by encrypting a second hash value of each sample ID in a second sample set of the selected second participant by using a public key in the same homomorphic encryption key pair;

the selected second participant also performs a hash operation and encryption process on its second sample set.

And A13, calculating the intersection of the first ciphertext and the second ciphertext to obtain a common sample ID ciphertext, and determining plaintext data of the common sample ID ciphertext based on the mapping relation.

The mapping relation comprises the mapping relation between each sample ID in the first sample set and the corresponding first ciphertext, and plaintext data of the ciphertext of the common sample ID can be inquired from the mapping relation.

And the first participant executing the scheme also sends the public sample ID ciphertext to the selected second participant so that the selected second participant can obtain the public sample ID.

And S2, obtaining initial parameters of the preset model corresponding to each first sub-sample set, and calculating a gradient value corresponding to each first sub-sample set based on the public key, the initial parameters and a second sample set of a corresponding second participant.

In this embodiment, a plurality of preset models (each preset model has the same structure) are configured according to the number of the first sub-sample sets, and the preset models are initialized to obtain initial parameters corresponding to each first sub-sample set.

Assuming that the execution subject of the scheme is social security bureau, three preset models are configured to generate three initial parameters theta_A1、θ_A2、θ_A3In this embodiment, the dimension number of the initial parameter is determined according to the feature dimensions of the samples in the sample set, and if each sample has 10 features, the initial parameter is a vector of 10 × 1.

For the shopping platform, three preset models are configured to generate three initial parameters theta_B1、θ_B2、θ_B3(ii) a For the bank 1, the bank 2 and the bank 3, only one preset model is configured to generate an initial parameter theta_C1、θ_C2、θ_C3。

In this embodiment, each first participant starts a plurality of processes according to the number of the sub-sample sets thereof, each process performs gradient value operation based on the local sub-sample set and the second sample set of the second participant having the same sample ID, and the following describes a calculation process of the gradient value corresponding to each first sub-sample set thereof by executing the first participant of the present scheme.

Said computing a gradient value corresponding to each first subsample set based on said public key, initial parameters and a corresponding second set of samples of a second participant, comprising steps B11-B13:

b11, starting a plurality of processes according to the number of the first sub-sample sets, and calculating a first feature matrix corresponding to each first sub-sample set by each process according to the corresponding first sub-sample sets and initial parameters thereof;

for example, for the social security, three processes are started, and each process correspondingly processes one first subsample set.

Each process calculates a first feature matrix corresponding to each first subsample set according to the corresponding first subsample set and the initial parameters thereof, and the method comprises the following steps:

Assuming that the selected first subsample set is the 1 st first subsample set a1 of the social security bureau, if 1000 samples are shared in a1 and each sample has 10 features, the initial matrix of the acquired first subsample set is a 1000 × 10 matrix, and the product of the initial feature matrix and the initial vector is calculated to obtain the first feature matrix corresponding to the selected first subsample set.

The calculation formula of the first feature matrix is as follows:

(θx)_A1-1＝θ_A1x_A1

wherein, (theta x)_A1-1A first feature matrix, x, corresponding to the selected first set of subsamples_A1For the initial feature matrix, θ, corresponding to the first selected subsample set_A1The initial parameters corresponding to the selected first subsample set.

If theta_A1Vector of 10 x 1 (column vector of 10 rows), x_A1Is a matrix of 1000 x 10, then (thetax)_A1-1A vector of 1000 x 1.

B12, sending the first feature matrix to a corresponding second participant, and receiving an error value sent by the corresponding second participant and encrypted by using the public key, wherein the error value is obtained by the corresponding second participant through calculation according to a second feature matrix of a second sample set of the second participant and the first feature matrix;

since the first feature matrix is a product and the initial feature matrix and the initial parameters of the corresponding first subsample set are not revealed, encrypted transmission is not required.

If the second participant corresponding to a1 is bank 1, the calculation formula of the second feature matrix of the second sample set of bank 1 is:

(θx)_A1-2＝θ_A2x_A2

wherein, (theta x)_A1-2A second feature matrix, x, of a second sample set for a corresponding second participant_A2Initial feature matrix of a second sample set, θ, for a corresponding second participant_A2Initial parameters of the preset model for the corresponding second participant.

The process by which the corresponding second participant calculates error values from the second feature matrix of its second sample set and the first feature matrix, includes steps C11-C13:

c11, the corresponding second participant calculating the eigenvalues of its second sample set based on its second eigenmatrix of the second sample set and the first eigenmatrix;

the formula for calculating the characteristic value is as follows:

(θx)_A1＝(θx)_A1-1+(θx)_A1-2

wherein, (theta x)_A1Characteristic values of a second sample set for a corresponding second participant, (thetax)_A1-1A first feature matrix being a corresponding first set of subsamples, (thetax)_A1-2A second feature matrix that is a second sample set of a corresponding second participant.

C12, inputting the characteristic value into a preset model to obtain a predicted value of a second sample set of the preset model;

the calculation formula of the predicted value is as follows:

wherein h is_A1(x) A prediction value of a second sample set for a corresponding second participant, (thetax)_A1The characteristic values of the second sample set for the corresponding second participant.

And C13, determining a real value of the second sample set based on the label information, calculating an error value based on the real value and the predicted value, encrypting the error value by adopting a public key in a corresponding homomorphic encryption key pair, and then sending the encrypted error value to a corresponding first participant.

In this embodiment, a difference value between the real value and the predicted value is used as an error value, and the error value is encrypted by using a public key in a homomorphic encryption public and private key pair and then sent to a first participant (i.e., a social security bureau) executing the scheme.

Reason (theta x)_A1-1、(θx)_A1-2All are 1000 x 1 row vectors, then (θ x)_A1And h_A1(x) A column vector also 1000 x 1. For h_A1(x) The value of each row in the column vector corresponds to the predicted value of one sample in the second sample set.

And B13, substituting the encrypted error values into a gradient value calculation formula to obtain an encrypted gradient value corresponding to each first sub-sample set, and sending the encrypted gradient values to corresponding second parties to obtain plaintext data of the encrypted gradient values.

The gradient value calculation formula is as follows:

wherein the content of the first and second substances,

gradient value, x, corresponding to the ith first subsample set_ijIs the original value, y, of the jth sample in the ith first set of subsamples_ij-h_θ(x_ij) And n is the total number of the samples in the ith first sub-sample set.

In this embodiment, the encrypted error value is substituted into the gradient value calculation formula, and a homomorphic encryption algorithm is adopted to calculate and obtain the encrypted gradient value corresponding to each first sub-sample set.

The sending the encrypted gradient value to a corresponding second party to obtain plaintext data of the encrypted gradient value includes:

d11, generating a third random number for each second party, encrypting the corresponding third random number by using the corresponding public key, calculating the sum of the encrypted gradient value and the encrypted third random number to obtain an encrypted sum, and sending the encrypted sum to the corresponding second party;

and encrypting the corresponding third random number by adopting the public key in the homomorphic encryption key pair of the corresponding second participant according to the corresponding relation.

D12, receiving the encrypted and obtained numerical value decrypted by the corresponding second party, and subtracting the corresponding third random number from the obtained numerical value to obtain the decrypted gradient value corresponding to the corresponding first sub-sample set.

In this embodiment, the purpose of generating the random number is to ensure the security of the data, and after the corresponding second party takes the encrypted sum, the sum of the gradient value and the corresponding third random number is obtained by decryption, and since the specific value of the third random number is unknown, the specific value of the gradient value cannot be known.

And S3, performing parameter updating processing on the preset model corresponding to each first sub-sample set based on the gradient values to obtain a first parameter corresponding to each first sub-sample set.

After obtaining the gradient value of the plaintext, each process may perform parameter update on the preset model corresponding to each first subsample to obtain a first parameter corresponding to each first subsample set.

And S4, receiving the second parameters and the loss values which are sent by each second participant and correspond to the second sample sets and are processed by adopting the security aggregation algorithm, and receiving the third parameters which are sent by other first participants and correspond to each sub-sample set and are processed by adopting the security aggregation algorithm.

And each second participant inputs the predicted value of each sample in the second sample set and the true value in the label information into a gradient value calculation formula to obtain a corresponding gradient value, and the parameters of the local preset model of each second participant can be updated according to the gradient value to obtain a second parameter. Meanwhile, the loss value can be obtained by inputting the predicted value and the true value into the loss function.

The calculation formula of the loss value is as follows:

And other first participants can acquire the third parameter corresponding to each subsample set according to the contents in the steps S1-S4, process the third parameter by adopting a security aggregation algorithm, and send the third parameter to the first participant executing the scheme.

The safety aggregation algorithm is a technology for hiding plaintext information from a third party by generating addition and subtraction random numbers between different parties through a Difie-Hellmanm key exchange technology. The third party can offset the random number during aggregation, and the operation result is not influenced. Namely: the random numbers generated based on the diffie-hellman key exchange technique are added or subtracted in the transmitted data by the other participants, and the first participant performing the scheme will cancel all the random numbers when aggregating (i.e. summing up).

And S5, judging whether the preset model is converged or not based on the loss value, and if so, determining target parameters based on the first parameter, the second parameter and the third parameter, and respectively sending the target parameters to other participants in the joint modeling system to complete joint modeling.

In this embodiment, the average value of the loss values of the second participants is calculated to obtain a loss average value (the loss average value is accurate because the average value is calculated by summing first and then averaging, and the random number is offset during summing), and if the difference between the obtained loss average value and the loss average value of the last iteration is smaller than a preset threshold (for example, 0.01%), the preset model is converged.

If the preset model is judged to be converged, calculating the average value (namely summing and averaging, and offsetting the random number during summing) of the first parameter, the second parameter and the third parameter to obtain a target parameter, and distributing the target parameter to other participants in the joint modeling system to complete the joint modeling.

In this embodiment, step S5 may also be performed by a third party, where the third party may be a public server or a cloud server communicatively connected to each participant in the joint modeling system.

As can be seen from the foregoing embodiments, in the joint modeling method provided by the present invention, first, common sample ID identification processing is performed on a local first sample set and a second sample set of each second participant, and the first sample set is split into first sub-sample sets corresponding to each second participant based on an identification result; then, calculating a gradient value corresponding to each first sub-sample set based on the initial parameter of the preset model corresponding to each first sub-sample set and a second sample set of a corresponding second participant, and determining a first parameter corresponding to each first sub-sample set based on the gradient value; then, receiving a second parameter sent by each second participant, and receiving a third parameter corresponding to each subsample set of the second parameter sent by other first participants; and finally, when the preset model is judged to be converged, determining a target parameter based on the first parameter, the second parameter and the third parameter, and sending the target parameter to other participants to complete the combined modeling. The method combines the horizontal participators with the same sample characteristics and different sample objects with the vertical participators with the same sample objects and different sample characteristics for modeling, thereby improving the accuracy of the combined model.

Fig. 2 is a schematic block diagram of a joint modeling apparatus according to an embodiment of the present invention.

The joint modeling apparatus 100 of the present invention can be installed in an electronic device. Depending on the implemented functionality, the joint modeling apparatus 100 may include a receiving module 110, a calculating module 120, an updating module 130, a receiving module 140, and a determining module 150. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

a receiving module 110, configured to receive a public key in a homomorphic encryption key pair sent by each second participant in the joint modeling system, perform, based on the public key, public sample ID identification processing on a first sample set that is locally stored and does not contain tag information and a second sample set that carries tag information of each second participant, and split the first sample set into first sub-sample sets corresponding to each second participant based on a public sample ID identification result.

a21, selecting a second participant, calculating a first hash value of each sample ID in the first sample set, encrypting the first hash value by using a public key in a homomorphic encryption key pair corresponding to the selected second participant to obtain a first ciphertext, and establishing a mapping relation between the first ciphertext and the sample ID;

a22, receiving a second ciphertext sent by the selected second participant, wherein the second ciphertext is obtained by encrypting a second hash value of each sample ID in a second sample set of the selected second participant by using a public key in the same homomorphic encryption key pair;

and A23, calculating the intersection of the first ciphertext and the second ciphertext to obtain a common sample ID ciphertext, and determining plaintext data of the common sample ID ciphertext based on the mapping relation.

The calculating module 120 is configured to obtain an initial parameter of the preset model corresponding to each first subsample set, and calculate a gradient value corresponding to each first subsample set based on the public key, the initial parameter, and a second subsample set of a corresponding second participant.

Said computing a gradient value corresponding to each first subsample set based on said public key, initial parameters and a corresponding second set of samples of a second participant, comprising steps B21-B23:

b21, starting a plurality of processes according to the number of the first sub-sample sets, and calculating a first feature matrix corresponding to each first sub-sample set by each process according to the corresponding first sub-sample sets and initial parameters thereof;

B22, sending the first feature matrix to a corresponding second participant, and receiving an error value sent by the corresponding second participant and encrypted by using the public key, wherein the error value is obtained by the corresponding second participant through calculation according to a second feature matrix of a second sample set of the second participant and the first feature matrix;

the process by which the corresponding second participant calculates error values from the second feature matrix of its second sample set and the first feature matrix, includes steps C21-C23:

c21, the corresponding second participant calculating the eigenvalues of its second sample set based on its second eigenmatrix of the second sample set and the first eigenmatrix;

c22, inputting the characteristic value into a preset model to obtain a predicted value of a second sample set of the preset model;

and C23, determining a real value of the second sample set based on the label information, calculating an error value based on the real value and the predicted value, encrypting the error value by adopting a public key in a corresponding homomorphic encryption key pair, and then sending the encrypted error value to a corresponding first participant.

And B23, substituting the encrypted error values into a gradient value calculation formula to obtain an encrypted gradient value corresponding to each first sub-sample set, and sending the encrypted gradient values to corresponding second parties to obtain plaintext data of the encrypted gradient values.

d21, generating a third random number for each second party, encrypting the corresponding third random number by using the corresponding public key, calculating the sum of the encrypted gradient value and the encrypted third random number to obtain an encrypted sum, and sending the encrypted sum to the corresponding second party;

d22, receiving the encrypted and obtained numerical value decrypted by the corresponding second party, and subtracting the corresponding third random number from the obtained numerical value to obtain the decrypted gradient value corresponding to the corresponding first sub-sample set.

An updating module 130, configured to perform parameter updating processing on the preset model corresponding to each first sub-sample set based on the gradient value, so as to obtain a first parameter corresponding to each first sub-sample set.

The receiving module 140 is configured to receive the second parameter and the loss value which are sent by each second participant and processed by the security aggregation algorithm and correspond to the second sample set of each second participant, and receive the third parameter which is sent by the other first participants and processed by the security aggregation algorithm and correspond to each sub-sample set of each first participant.

The calculation formula of the loss value is as follows:

wherein L is_iLoss value, y, for the ith second participant_ijSecond set of samples for ith second participantTrue values of j samples, h_θ(x_ij) Is the predicted value of the jth sample in the second sample set of the ith second participant, and n is the total number of samples in the second sample set of the ith second participant.

And the determining module 150 is configured to determine whether the preset model converges based on the loss value, determine a target parameter based on the first parameter, the second parameter and the third parameter when the preset model converges, and send the target parameter to other participants in the joint modeling system respectively to complete joint modeling.

Fig. 3 is a schematic structural diagram of an electronic device for implementing a joint modeling method according to an embodiment of the present invention.

The electronic device 1 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. The electronic device 1 may be a computer, or may be a single network server, a server group composed of a plurality of network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing, where cloud computing is one of distributed computing and is a super virtual computer composed of a group of loosely coupled computers.

In the present embodiment, the electronic device 1 includes, but is not limited to, a memory 11, a processor 12, and a network interface 13, which are communicatively connected to each other through a system bus, wherein the memory 11 stores a joint modeling program 10, and the joint modeling program 10 is executable by the processor 12. While FIG. 3 shows only the electronic device 1 with the components 11-13 and the joint modeling program 10, those skilled in the art will appreciate that the configuration shown in FIG. 3 does not constitute a limitation of the electronic device 1, and may include fewer or more components than shown, or some components in combination, or a different arrangement of components.

The storage 11 includes a memory and at least one type of readable storage medium. The memory provides cache for the operation of the electronic equipment 1; the readable storage medium may be a non-volatile storage medium such as flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be an external storage device of the electronic device 1, such as a plug-in hard disk provided on the electronic device 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. In this embodiment, the readable storage medium of the memory 11 is generally used for storing an operating system and various application software installed in the electronic device 1, for example, code of the joint modeling program 10 in an embodiment of the present invention. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally configured to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with other devices. In this embodiment, the processor 12 is configured to execute the program code stored in the memory 11 or process data, for example, execute the joint modeling program 10.

The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is used for establishing a communication connection between the electronic device 1 and a client (not shown).

Optionally, the electronic device 1 may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The joint modeling program 10 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 12, implement the steps of the joint modeling method described above.

Specifically, the processor 12 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the joint modeling program 10, which is not described herein again.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or non-volatile. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

The computer readable storage medium has stored thereon a joint modeling program 10, and the joint modeling program 10 is executable by one or more processors to implement the steps in the joint modeling method described above.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A joint modeling method is applied to any one first participant in a joint modeling system, the joint modeling system comprises a plurality of first participants and a plurality of second participants which are in communication connection, each first participant and each second participant comprise the same sample object and different sample characteristics, each second participant comprises the same sample characteristics and different sample objects, and the method comprises the following steps:

2. The joint modeling method according to claim 1, wherein the performing, based on the public key, common sample ID identification processing on the first sample set that is not stored locally and does not contain tag information and the second sample set that carries tag information of each second participant respectively includes:

3. The joint modeling method of claim 1, wherein the computing a gradient value corresponding to each first subsample set based on the public key, initial parameters, and a corresponding second sample set of a second participant comprises:

4. The joint modeling method of claim 3, wherein the process of calculating the error value by the corresponding second participant from the second feature matrix of the second sample set and the first feature matrix comprises:

5. The joint modeling method of claim 3, wherein the sending the encrypted gradient values to a corresponding second participant to obtain plaintext data for the encrypted gradient values comprises:

6. The joint modeling method of claim 3, wherein each process calculates a first feature matrix corresponding to each first subsample set according to the corresponding first subsample set and its initial parameters, comprising:

7. The joint modeling method of claim 1, wherein the loss value is calculated by the formula:

8. A joint modeling apparatus, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

the memory stores a joint modeling program executable by the at least one processor to enable the at least one processor to perform a joint modeling method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a joint modeling program executable by one or more processors to implement the joint modeling method of any one of claims 1 to 7.