CN114723070A

CN114723070A - Method and system for generating machine learning model and method for processing user data

Info

Publication number: CN114723070A
Application number: CN202210404748.2A
Authority: CN
Inventors: 徐武兴; 杨恺; 范昊; 黄志翔; 彭南博
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-07-08

Abstract

The disclosure relates to a method and a system for generating a machine learning model and a method for processing user data, and relates to the technical field of computers. Calculating gradient information ciphertexts of different user classifications according to the label ciphertexts of the samples and the user classification to which each user data sample belongs; the label provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications; and the label provider instructs the data provider to generate a machine learning model according to the information gain of different user classifications. The technical scheme disclosed by the invention can ensure the privacy of the data provider and the label provider, thereby improving the security.

Description

Method and system for generating machine learning model and method for processing user data

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and a system for generating a machine learning model, a method and a device for processing user data, an electronic device, and a non-volatile computer-readable storage medium.

Background

The data-driven artificial intelligence technology plays a great role in various industries, and brings extremely high value. Data has been listed as a new production element in China. In recent years, privacy protection and data compliance of data are also more and more emphasized, so that data islanding phenomenon can not occur in data intercommunication. Data islanding prevents the exploitation of the potential of data.

To address this problem, the concept of federal learning has come. The federated learning can realize common modeling on the premise that data of a plurality of participants cannot be found out locally, and the data potential of each party is stimulated while the data privacy is ensured.

In the related art, a machine learning model may be created using a SecureBoost or the like.

Disclosure of Invention

The inventors of the present disclosure found that the following problems exist in the above-described related art: privacy of the data provider and the tag provider cannot be guaranteed, resulting in poor security.

In view of this, the present disclosure provides a technical solution for generating a machine learning model, which can ensure privacy of a data provider and a tag provider, thereby improving security.

According to some embodiments of the present disclosure, there is provided a method of generating a machine learning model, including: the data provider generates a label predicted value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model transmitted by the label provider; the data provider calculates gradient information ciphertexts of different user classifications according to the label predicted value ciphertexts, the label ciphertexts of each user data sample sent by the label provider and the user classification to which each user data sample belongs; the label provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications; and the label provider instructs the data provider to generate a machine learning model according to the information gain of different user classifications.

In some embodiments, the generating method includes multiple generation processes, and generating the tag prediction value ciphertext for each user data sample includes: and the data provider generates a tag prediction value ciphertext in the T-1 generation process according to the parameter information ciphertext in the T-1 generation process and the tag prediction value ciphertext in the T-2 generation process in the T generation process.

In some embodiments, generating the tag prediction value ciphertext for the T-1 th round of generation process comprises: and the data provider generates a tag prediction value ciphertext in the T-1 generation process according to a homomorphic addition result of the parameter information ciphertext in the T-1 generation process and the tag prediction value ciphertext in the T-2 generation process.

In some embodiments, the generating method further comprises: the data provider divides each user data sample into different user classification sets according to a plurality of user characteristic information contained in each user data sample; the generation method comprises multiple generation processes, and the calculation of gradient information ciphertexts of different user classifications comprises the following steps: the data provider calculates gradient information ciphertext of each user data sample according to the label predicted value ciphertext of each user data sample in the T-1 generation process and the label ciphertext of each user data sample in the T-1 generation process; and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.

In some embodiments, the gradient information ciphertext includes a first order gradient ciphertext and a second order gradient ciphertext.

In some embodiments, calculating the gradient information cipher text for each user data sample comprises: the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample; and the data provider calculates homomorphic addition results of the label predicted value ciphertext and homomorphic multiplication results of the user data samples in the T-1 generation process, and determines a first-order gradient ciphertext.

In some embodiments, calculating the gradient information cipher text for each user data sample comprises: the data provider processes the label predicted value ciphertext of each user data sample in the T-1 generation process by using a sigmod function, and determines a function processing result; the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample; and calculating a homomorphic addition result of the function processing result and the homomorphic multiplication result to determine a first-order gradient ciphertext.

In some embodiments, calculating information gains for different user classifications includes: and the label provider calculates the information gain of each user classification according to the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the first-order gradient ciphertexts of all user data samples and the weighted sum of the second-order gradient ciphertexts of all user data samples.

In some embodiments, calculating the information gain for each user category comprises: the label provider calculates the information gain of each user classification according to the difference between the weighted sum of the first gain, the second gain and the third gain and a threshold value, the threshold value is used for indicating the data provider to generate a machine learning model, the weighted sum of the first gain and the first-order gradient ciphertext of each user data sample in each user classification set is positively correlated with the weighted sum of the second-order gradient ciphertext of each user data sample in each user classification set, the weighted sum of the second gain and the first-order gradient ciphertext of all user data samples is positively correlated with the weighted sum of the first-order gradient ciphertext of each user data sample in each user classification set, the weighted sum of the second-order gradient ciphertext of all user data samples is positively correlated with the weighted sum of the second-order gradient ciphertext of each user data sample in each user classification set, and the weighted sum of the third gain and the first-order gradient ciphertext of all user data samples are negatively correlated, and the weighted sum of the second order gradient ciphertexts of all user data samples.

In some embodiments, partitioning the user data samples into different sets of user classifications includes: the data provider sets a plurality of classification threshold values according to a plurality of user characteristic information contained in each user data sample; and dividing each user data sample into different user classification sets according to a plurality of classification threshold values.

In some embodiments, instructing the data provider to generate the machine learning model according to information gains of different user classifications includes: the tag provider determines the maximum gain information in the information gains of different user categories; and instructing the data provider to generate a machine learning model according to whether the maximum gain information is larger than a threshold value.

In some embodiments, the machine learning model is a tree model, instructing the data provider to generate the machine learning model based on whether the maximum gain information is greater than a threshold comprises: the label provider instructs the data provider to split the current node of the tree model under the condition that the maximum gain information is larger than the threshold value, and child nodes of the current node are generated and serve as the current node of the next iteration; the label provider calculates the weight parameters of the child nodes as parameter information plaintext of the tree model under the condition that the maximum gain information is less than or equal to the threshold value; determining the maximum gain information among the information gains for different user classifications includes: the label provider re-determines the maximum gain information according to the current node of the next iteration; and repeating the steps until the generation of the tree model is finished.

According to another embodiment of the present disclosure, there is provided a user data processing method, including: the users are classified according to the user data using a machine learning model generated according to the method of generating a machine learning model in any of the above embodiments.

According to still further embodiments of the present disclosure, there is provided a system for generating a machine learning model, including: the data provider is used for generating a label predicted value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model sent by the label provider, and calculating gradient information ciphertexts of different user classifications according to the label predicted value ciphertext, the label ciphertext of each user data sample sent by the label provider and the user classification to which each user data sample belongs; and the label provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications, and instructs the data provider to generate a machine learning model according to the information gain of different user classifications.

In some embodiments, the generation method comprises multiple generation processes, and the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process in the T-1 generation process.

In some embodiments, the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the homomorphic addition result of the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process.

In some embodiments, the data provider divides each user data sample into different user classification sets according to a plurality of user characteristic information contained in each user data sample; the generation method comprises a plurality of generation processes, a data provider, and a gradient information ciphertext of each user data sample is calculated according to a label prediction value ciphertext of each user data sample in a T-1 generation process and a label ciphertext of each user data sample in a T-1 generation process; and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.

In some embodiments, the data provider calculates a homomorphic multiplication result of the first preset value and the tag ciphertext of each user data sample; and the data provider calculates homomorphic addition results of the label predicted value ciphertext and homomorphic multiplication results of the user data samples in the T-1 generation process, and determines a first-order gradient ciphertext.

In some embodiments, the data provider processes a tag prediction value ciphertext of each user data sample in the T-1 th generation process by using a sigmod function, and determines a function processing result; the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample; and calculating a homomorphic addition result of the function processing result and the homomorphic multiplication result to determine a first-order gradient ciphertext.

In some embodiments, the tag provider calculates the information gain of each user class according to the weighted sum of the first-order gradient ciphertexts of each user data sample in each user class set, the weighted sum of the second-order gradient ciphertexts of each user data sample in each user class set, the weighted sum of the first-order gradient ciphertexts of all user data samples, and the weighted sum of the second-order gradient ciphertexts of all user data samples.

In some embodiments, the tag provider calculates the information gain of each user classification according to the difference between the weighted sum of the first gain, the second gain and the third gain and a threshold, the threshold is used for instructing the data provider to generate the machine learning model, the first gain is positively correlated with the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, and is negatively correlated with the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second gain and the first-order gradient ciphertexts of all user data samples is positively correlated with the difference of the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, and the weighted sum of the second-order gradient ciphertexts of all user data samples is positively correlated with the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the third gain is inversely related to the weight of the first order gradient ciphertext of all the user data samples and positively related to the weight sum of the second order gradient ciphertext of all the user data samples.

In some embodiments, the data provider sets a plurality of classification thresholds according to a plurality of user characteristic information contained in each user data sample; and dividing each user data sample into different user classification sets according to a plurality of classification threshold values.

In some embodiments, the tag provider determines the maximum gain information among the information gains for different user categories; and instructing the data provider to generate a machine learning model according to whether the maximum gain information is larger than a threshold value.

In some embodiments, the machine learning model is a tree model, and the tag provider instructs the data provider to split a current node of the tree model to generate child nodes of the current node as a current node of a next iteration when the maximum gain information is greater than a threshold; the label provider calculates the weight parameters of the child nodes as parameter information plaintext of the tree model under the condition that the maximum gain information is less than or equal to the threshold value; the label provider re-determines the maximum gain information according to the current node of the next iteration; and repeating the steps until the generation of the tree model is finished.

According to still further embodiments of the present disclosure, there is provided a processing apparatus of user data, including: a classification unit configured to classify the user using a machine learning model based on the user data, the machine learning model being generated according to the method for generating a machine learning model according to any one of the embodiments.

In some embodiments, there is provided an electronic device comprising: a memory; and a processor coupled to the memory, the processor configured to perform a method of generating a machine learning model, or a method of processing user data, in any of the above embodiments, based on instructions stored in the memory device.

According to still further embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of generating a machine learning model or a method of processing user data in any of the above embodiments. .

In the above embodiment, the data provider and the tag provider realize the secure calculation of the information gain through the interaction of the parameter information ciphertext, the tag prediction value ciphertext and the gradient information ciphertext. Thus, the privacy of the data provider and the tag provider can be ensured, and the security is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure can be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

FIG. 1 illustrates a flow diagram of some embodiments of a method of generation of a machine learning model of the present disclosure;

FIG. 2 illustrates a flow diagram of some embodiments of a method of processing user data of the present disclosure;

FIG. 3 illustrates a schematic diagram of some embodiments of a method of generating a machine learning model of the present disclosure;

FIG. 4a illustrates a block diagram of some embodiments of a generation system of a machine learning model of the present disclosure;

FIG. 4b illustrates a block diagram of some embodiments of a generation system of a machine learning model of the present disclosure;

FIG. 5 illustrates a block diagram of some embodiments of an electronic device of the present disclosure;

fig. 6 shows a block diagram of further embodiments of the electronic device of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

As previously mentioned, the present disclosure contemplates joint modeling of two parties. One party only has a data sample label, and the other party has a new scene of the data sample characteristics. The AP Party (Active Party) is used as a label provider, namely a business Party, only has a data label, and hopes to perform joint modeling by means of data of other organizations, and is usually an Active Party of federal learning; the organization that provides data to the AP side is the PP side (Passive Party), i.e., the data provider, and is typically the Passive Party of federal learning. The actual application scene is referred to as a single-side submodel scene for short.

Compared with the application scenario in the prior art, the single-side model scenario provides a new requirement for the interpretability of the federated XBG model which is jointly constructed.

On one hand, in order to protect the data privacy of the PP side, the AP side cannot obtain the predicted value corresponding to the specific sample

The reason is that in order to obtain interpretability, the AP side needs to acquire the characteristic meaning of the model node. At this time, if the AP side obtains the predicted value of the sample

The AP side also obtains the predicted path of the sample in the model, which exposes the characteristic information of each sample, and thus deduces the sample privacy information of the PP side.

On the other hand, in order to protect the data privacy of the AP side, the PP side cannot obtain the privacy data of the AP side, that is, the sample label and the leaf node weight information of the model. Because the model leaf node weight contains the sample label data information of the AP side.

However, the methods such as SecureBoost cannot simultaneously satisfy the two requirements in the single-side model scene. Although the method of SecureBoost and the like ensures that the AP side keeps the model weight parameters, the AP side still needs to obtain a single sample prediction value to calculate the information gain. Under the condition that the AP side knows the splitting characteristics of the whole model, the behavior can enable the AP side to deduce the sample data of the PP side through a single sample predicted value, so that the data privacy of the AP side and the PP side cannot be guaranteed, and the application scene of federal learning is limited.

That is to say, when the methods such as secure boost and the like are used for interpretable combined XGB (eXtreme Gradient Boosting) modeling in a single-side submodel scene, one side needs to use a single sample label y and a single sample prediction value

And calculating information gain, and further performing model splitting and training by using the information gain. If the AP side calculates the information gain, the AP side can estimate the sample privacy information of the PP side through a single sample prediction value and does not meet the requirement of protecting the data characteristic privacy of the PP side; if the PP side calculates the information gain, the PP side can obtain a sample predicted value and a sample label, and the requirement for protecting the privacy of the AP side data label is not met.

The disclosure provides an interpretable single-side submodel federal learning privacy protection scheme aiming at the problem that the related technology cannot realize interpretability and protect the joint modeling of private information of two parties in a single-side submodel scene.

In summary, the information gain calculation formula is disassembled, and a safe multiparty information gain calculation protocol based on a single-party model scene is designed. And an interpretable single-side submodel federal learning scheme is designed based on the protocol. Finally, model splitting and training which can be explained in the scene of the single-side model are completed under the condition that the privacy of the two sides is not disclosed. For example, the technical solution of the present disclosure can be realized by the following embodiments.

Fig. 1 illustrates a flow diagram of some embodiments of a method of generation of a machine learning model of the present disclosure.

As shown in fig. 1, in step 110, the data provider generates a tag prediction value ciphertext of each user data sample according to a parameter information ciphertext of the machine learning model transmitted by the tag provider (i.e., the service provider).

For example, a set of user data samples X ∈ R^n×dThe label set y of the user data sample is { y ═ y₁,y₂,...,y_n}∈Rⁿ. Generating a T Tree model Tree in the T round generation process_TThat is, each generation process generates a tree model with a set of predicted values of labels

The tag provider encrypts y to generate a tag ciphertext set [ y]＝{[y₁],[y₂],...,[y_n]And will [ y ]]And sending the data to a data provider.

In some embodiments, the data provider acquires the encrypted public key pk generated by the label provider and the corresponding relation between the sample prediction value and the model leaf node; the label provider generates an encrypted public and private key pair (sk, pk) and has parameter information of the T-1 tree

Sample label y ═ y₁,y₂,...,y_n}，

Weight information for leaf nodes of the T-1 th tree.

For example, the tag provider generates a sample tag ciphertext [ y ] using the public key pk]＝Enc_pk(y) parameter information ciphertext of T-1 tree

And will [ y]，

And sending the data to a data side.

In some embodiments, the initialized Tree_TOnly contains the root Node-0, and generates the Node-i in the ith iteration process in the T generation process. Each Node-i needs to maintain the following information: characteristic meaning k_iSplitting strategy split_i() Set of samples R_i. Initializing a sample set of Node-0 to R_iI contains all user data samples, and the set of nodes to be split L ═ Node-0.

For example, the characteristic meaning k_iDisclosure to tag providers, splitting policy split_i() Sample set R_iAll are reserved by the data side and are not disclosed to the label provider.

In some embodiments, the generation method comprises multiple generation processes, and the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the parameter information ciphertext in the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process in the T-1 generation process.

For example, the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the homomorphic addition result of the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process.

For example, the data provider calculates the sample prediction value ciphertext using the correspondence relationship described above

For example, the tree model structure is a binary tree model, and the corresponding relationship between the predicted value of the sample and the leaf node of the model is the weight parameter of the left and right child nodes of the current node and into which child node the sample is finally classified. According to the corresponding relation, the sample prediction value can be obtained. In this way, the data provider can perform computation using the correspondence and the ciphertext transmitted by the tag provider without obtaining the weight parameter of the node of the model.

In step 120, the data provider calculates gradient information ciphertexts of different user classifications according to the tag prediction value ciphertexts, the tag ciphertexts of each user data sample sent by the tag provider, and the user classification to which each user data sample belongs.

In some embodiments, the data provider divides each user data sample into different user classification sets according to a plurality of user characteristic information contained in each user data sample.

For example, for each user characteristic information dimension k of the data provider, which is a value of 1, … d, the set of user data samples X is divided into q-1 sets of user classifications (which may be referred to as binning) according to q classification thresholds. Thus, a user classification set I corresponding to each user characteristic information is obtained₁,…,I_q-1。

In some embodiments, the generation method comprises multiple generation processes, and a data provider, in the T-th generation process, calculating a gradient information ciphertext of each user data sample according to the tag prediction value ciphertext of each user data sample in the T-1-th generation process and the tag ciphertext of each user data sample; and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.

In some embodiments, the gradient information ciphertext includes a first order gradient ciphertext and a second order gradient ciphertext. For example, the data provider utilizes [ y ]]，

Computing each user data sample X_iFirst order gradient ciphertext g_i]And second order gradient ciphertext [ h ]_i]。

In some embodiments, the data provider calculates a homomorphic multiplication result of a first preset value (e.g., -1) and the tag ciphertext of each user data sample; and the data provider calculates homomorphic addition results of the label predicted value ciphertext and homomorphic multiplication results of the user data samples in the T-1 generation process, and determines a first-order gradient ciphertext.

For example, if the mean square error is taken as the loss function, the first order gradient and the second order gradient are respectively:

h_i＝1

the data provider can calculate a first-order gradient ciphertext and a second-order gradient ciphertext:

[h_i]＝[1]

in order to realize the homomorphic subtraction,

is homomorphic multiplication.

For example, if cross entropy is used as the loss function, the first order gradient and the second order gradient obtained from the above calculation are:

the data provider can approximate the Sigmod function by taylor expansion at 0 to find the first and second order gradients

In step 130, the tag provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications.

In some embodiments, the tag provider calculates the information gain of each user classification according to the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the first-order gradient ciphertexts of all user data samples, and the weighted sum of the second-order gradient ciphertexts of all user data samples.

In some embodiments, the data provider computes the child first-order gradient aggregation value ciphertext within the user classification set j

Sum second order gradient aggregation value ciphertext

And total first order gradient aggregation value ciphertext [ G ]]＝∑_i∈I[g_i]And total second order gradient aggregation value ciphertext [ H ]]＝∑_i∈I[h_i]And sent to the label provider.

In some embodiments, the tag provider calculates the information gain of each user classification according to the difference between the weighted sum of the first gain, the second gain and the third gain and a threshold, the threshold is used for instructing the data provider to generate the machine learning model, the first gain is positively correlated with the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, and is negatively correlated with the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second gain and the first-order gradient ciphertexts of all user data samples is positively correlated with the difference of the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, and the weighted sum of the second-order gradient ciphertexts of all user data samples is positively correlated with the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the third gain is inversely correlated with the weight of the first order gradient ciphertexts of all the user data samples, and is positively correlated with the weight sum of the second order gradient ciphertexts of all the user data samples.

For example, the label provider decrypts the sub-first-order gradient aggregation value ciphertext by using the private key sk to obtain G_j＝Dec_sk([G_j]) (ii) a Decrypting the subsequential second-order gradient aggregation value ciphertext by using the private key sk to obtain H_j＝Dec_sk([H₍]) (ii) a Decrypting the total first-order gradient aggregation value ciphertext by using the private key sk to obtain G; decrypting the total second-order gradient aggregation value ciphertext by using the private key sk to obtain H; calculating a first gain using the decryption result

Second gain

Third gain

And further calculating the information gain of the user classification set j:

λ is an adjustable parameter. From each Gain_jThe maximum gain information MaxGain is determined.

In step 140, the tag provider, based on the information gain of the different user categories, instructs the data provider to generate a machine learning model.

In some embodiments, the tag provider determines the maximum gain information among the information gains for different user categories; and instructing the data provider to generate a machine learning model according to whether the maximum gain information is larger than a threshold value gamma.

For example, the iterative process in the generation process of the T-th round can be realized by:

While L is not NULL do

selecting any Node-i belongs to L, and obtaining the maximum information gain by utilizing a safe multi-party information gain calculation protocol:

data provider

If MaxGain>γ then

Splitting the node i according to the splitting rule corresponding to the maximum information gain to obtain two sub-nodes i_LAnd i_RAnd corresponding sample sets thereof

Updating Tree model Tree_TRemove node i from L, remove i_LAnd i_RAnd adding the split data into the set L to be split.

else

Without splitting, remove node i from L

end

In some embodiments, the tag provider is according to G_j、H₍And calculating the parameter information of the T-th round. For example, the optimal weight for each leaf node:

in the above embodiment, the first-order gradient g (loss function first-order gradient term) and second-order gradient h (loss function second-order gradient term) calculation processes on which the information gain depends in XGB are disassembled.

By using homomorphic encryption, under the scene that a tag provider only has a data tag and the data provider has data characteristics, the tag provider provides a sample tag and a model weight parameter ciphertext required by calculation; the data provider calculates a sample prediction value by using the weight ciphertext, and further calculates g ciphertext aggregation values and h ciphertext aggregation values of different splitting schemes; with the above manner, the tag provider and the data provider jointly and securely calculate the information Gain (Gain); and finally, splitting the tree in such a way to complete the modeling process.

Therefore, under the condition that privacy information of both parties is not exposed in the model training and predicting stage, the label provider can acquire the split feature meaning and the model weight parameter, model interpretability and data privacy are guaranteed, and a safe and reliable interpretable federal learning scheme is provided for a new practical application scene.

In the following, a two-party combined scenario is taken as an example for explanation, and the technical scheme of the present disclosure may also be applied to a multi-party combined scenario with more than two parties.

Fig. 2 illustrates a flow diagram of some embodiments of a method of processing user data of the present disclosure.

As shown in fig. 2, in step 210, user data is obtained. In step 220, the users are classified according to the user data using a machine learning model, which is generated according to the generation method of the machine learning model in any of the above embodiments.

The data provider sets the owned sample label set y as y₁,y₂,...,y_nCarries on encryption, and encrypts the result [ y }]＝{[y₁],[y₂],...,[y_n]It sends it to the data provider.

In some embodiments, the data provider also weights leaf nodes of the model (as an example of a single-level binary tree) during the training of the tree model

(

Weights representing jth leaf node of a T-th tree) are encrypted, and the encryption result is processed

And sending the data to the data provider in sequence.

Since the data provider holds all data features in the single-side model scene, it can know which leaf node the predicted value of each training sample is equal to. According to the weight ciphertext of the leaf node sent by the label provider, the label prediction value ciphertext of each training sample on each tree model can be obtained

In some embodiments, the data provider may combine the sample tag ciphertext value [ y ] provided by the tag provider]And the obtained predicted value ciphertext of the label

And obtaining a first-order gradient ciphertext and a second-order gradient ciphertext according to the first-order gradient and the second-order gradient calculation formulas of different loss functions.

For example, for complex loss functions such as exponential operation and the like exceeding the operation capability of homomorphic encryption algorithm, technologies such as Taylor expansion and the like are utilized to perform approximate conversion into simple linear function calculation.

And finally, the data provider aggregates the first-order gradient ciphertext and the second-order gradient ciphertext according to different splitting schemes, and sends the ciphertext to the tag provider. The label provider decrypts and calculates the information gain of different splitting schemes; and selecting an optimal splitting scheme according to the information gain, and informing a data provider to split according to the scheme to complete single tree node splitting training iteration. And repeating the leaf node splitting process to complete the construction of the tree model.

According to the embodiment, the encryption first-order gradient and the encryption second-order gradient can be calculated under the condition that the data provider does not know the weight plaintext of the leaf node, and the data privacy safety of the tag provider is guaranteed. Moreover, the tag provider can only obtain the aggregation value, and cannot obtain a single predicted value, so that the data privacy of the data provider is also ensured.

In some embodiments, the protocol first has two participants: the tag provider only has an inclusion of n_ALabelset of strip data labels

And hopefully borrow data from other organizations (data providers) for joint modeling, and therefore are usually the initiative of federal learning; data provider owns the content n_PData collection of strip data

And d dimension feature X thereof_i＝(x₁,x₂,x₃,…,x_d)。

In some embodiments, the technical solution of the present disclosure comprises two stages.

For example, in the first stage, encryption sample alignment is performed. Sample alignment is required first since each participant in the longitudinal federated scenario is performing federated learning using different data features of a common sample. That is, a common sample ID owned by the tag provider and the data provider is found. This part can be implemented by the existing mature privacy set transaction protocol. After the privacy set intersection, the two parties can obtain n data intersection sets.

For example, in the second phase, federal model training is performed. And starting the federal modeling after the encrypted sample alignment step is realized.

Fig. 3 illustrates a schematic diagram of some embodiments of a method of generating a machine learning model of the present disclosure.

As shown in fig. 3, in step 0, the tag provider generates a homomorphic encryption (Paillier algorithm) public key pk and private key sk, and broadcasts the public key pk. The label provider encrypts Enc for n data labels y in the intersection by using the private key sk_pk(y) generating a ciphertext [ y]And sending the public key pk and the generated ciphertext to the data provider.

In step 1, after the data provider determines all sample IDs of the split node (current node), q classification thresholds S are determined and obtained according to d-dimensional features of the samples_k＝{s_k1,…,s_kq}; and dividing all samples into q-1 user classification sets according to q classification threshold values to serve as screening bases of the tree splitting nodes.

Taking the generation process of the T-th round for constructing T-th tree training as an example, the training stage mainly comprises the following steps.

In step 2, the data provider uses the public key pk to train leaf node weights of the tree model (taking a single-layer binary tree as an example) obtained in the previous generation process

Encrypting and sequentially encrypting the results

And sending the data to a data provider.

In step 3, the data provider obtains the corresponding relation between the sample prediction value and the leaf node of the tree model according to the process of tree model construction in the previous generation process training.

In step 4, the data provider performs homomorphic addition operation on the encryption predicted value result of the previous generation process in the T-2 th round and the encryption result received in the step 1, and obtains the encryption predicted value of the specific sample in the current round by combining the corresponding relation between the sample predicted value and the leaf node of the tree model

For example, the calculation is as follows:

is homomorphic addition, i.e. addition operation on the ciphertext, the operation result is equal to the encrypted value of the plaintext sum.

In step 5, the data listener utilizes the encrypted sample tag y]And the prediction value of the encrypted sample

First order gradient [ g ] for sample i_i]And a second order gradient [ h ]_i]And (6) performing calculation.

In some embodiments, since only addition, subtraction and multiplication operations are performed in the above calculation expression, the data provider may directly perform calculation using the addition homomorphism and the number multiplication homomorphism of the ciphertext to obtain the corresponding ciphertext.

h_i＝1

[h_i]＝[1]

in order to realize the homomorphic subtraction,

is homomorphic multiplication.

In some embodiments, for more complex operations such as power operations, rational number division operations, etc., the expression may be subjected to second-order approximation by using a taylor expansion, and converted into a homomorphic encryption operable form.

the data provider can approximate the Sigmod function by taylor expansion at 0 to find the first order and second order gradients

In step 6, based on the single sample encryption first-order gradient and encryption second-order gradient obtained in step 1 and obtained in step 5, the encryption aggregation values of the first-order gradient and the second-order gradient of each user classification set are calculated by utilizing the addition homomorphism of the encryption algorithm, and the result is sent back to the label provider for decryption. The formula is shown below.

In step 7, the label provider aggregates the value G according to the first order gradient of each user classification set of the data provider obtained by decryption_j＝Dec_sk([G_j]) And second order gradient aggregation value H_j＝Dec_sk([H_j]) And calculating the information Gain of each split of each characteristic of the data provider.

And the label provider returns the scheme serial number corresponding to the maximum information gain to the data provider. And the data provider determines the optimal splitting characteristic and the characteristic threshold according to the sequence number scheme and the reverse splitting. This data is kept opaque to the tag provider during the training process to ensure the privacy data security of the data provider.

If the information gain of the optimal splitting is smaller than a threshold value gamma, the current node is not split; otherwise, the feature corresponding to the maximum gain information value is the optimal splitting feature, and the current node is split according to the corresponding feature and the threshold; and finally, the PP side needs to calculate a sample set of two corresponding leaf nodes, and the next leaf node is split according to the sample set.

In step 8, for each leaf node of the T-th tree, step 7 is executed in a loop until a stop splitting condition is reached, i.e. all leaf nodes can no longer be split or the depth of the tree reaches a set maximum depth.

For the first-order gradient aggregation value and the second-order gradient aggregation value of the optimal splitting of the AP-side holding node, the optimal weight of each leaf node can be calculated:

so far, the T tree construction is completed.

In step 9, after the label provider completes the training of the T-th tree, the leaf node of the T-th tree is selected by the label provider

And sending the single tree model to a data provider, and repeating the training process of the single tree model from the step 1 to the step 9 until the total M trees are built. The resulting complete gradient lifting tree model is:

the single-party model is a very important business scenario. The disclosure provides a safe interpretable federal XGB submodel modeling scheme aiming at the scene.

In the above embodiment, the tag provider is enabled to maintain the feature meaning and the model weight parameter, and meanwhile, the tag provider is ensured not to speculate the data of the data provider; the model weight parameters are kept at the label provider and kept secret from the data provider, so that the data listener can not presume the privacy information of the label provider through the model weight parameters; the label provider cannot know the sample predicted value of the data provider, and the label provider cannot presume the privacy information of the data provider through the sample predicted value. Thereby, a reliable interpretable model is achieved.

In the above embodiment, a secure multiparty information gain calculation protocol is designed. Through a calculation formula of first-order gradient and second-order gradient in the information gain, a label provider provides a sample label and a model weight parameter ciphertext required by calculation; and the data provider calculates the sample prediction value by using the weight ciphertext and further calculates the information gain jointly and safely in a mode of g ciphertext aggregation value and h ciphertext aggregation value of different splitting schemes.

In this way, calculation can be performed for different loss functions, and for functions which cannot be directly calculated by using addition homomorphic encryption, technical approximation such as Taylor expansion can be used, and the protocol can be expanded to a multi-party scene.

Fig. 4a illustrates a block diagram of some embodiments of a generation system of a machine learning model of the present disclosure.

As shown in fig. 4a, the generation system 4a of the machine learning model includes: the data provider 41a is configured to generate a tag prediction value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model sent by the tag provider, and calculate gradient information ciphertexts of different user classifications according to the tag prediction value ciphertext, the tag ciphertext of each user data sample sent by the tag provider, and the user classification to which each user data sample belongs; and the tag provider 42a calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications, and instructs the data provider to generate a machine learning model according to the information gain of different user classifications.

In some embodiments, the generation method comprises multiple generation processes, and the data provider 41a generates the tag prediction value ciphertext of the T-1 generation process according to the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process in the T-1 generation process.

In some embodiments, the data provider 41a generates the tag prediction value ciphertext of the T-1 th round of generation process according to the result of homomorphic addition of the parameter information ciphertext of the T-1 th round of generation process and the tag prediction value ciphertext of the T-2 th round of generation process.

In some embodiments, the data provider 41a divides each user data sample into different user classification sets according to a plurality of user characteristic information included in each user data sample; the generation method comprises multiple generation processes, and a data provider 41a calculates gradient information ciphertexts of user data samples according to the label predicted value ciphertexts of the user data samples in the T-1 generation process and the label ciphertexts of the user data samples in the T-1 generation process; and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.

In some embodiments, the data provider 41a calculates a homomorphic multiplication result of the first preset value and the tag ciphertext of each user data sample; and the data provider calculates homomorphic addition results of the label predicted value ciphertext and homomorphic multiplication results of the user data samples in the T-1 generation process, and determines a first-order gradient ciphertext.

In some embodiments, the data provider 41a processes the tag prediction value ciphertext of each user data sample in the T-1 th generation process by using a sigmod function, and determines a function processing result; the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample; and calculating a homomorphic addition result of the function processing result and the homomorphic multiplication result to determine a first-order gradient ciphertext.

In some embodiments, the tag provider 42a calculates the information gain for each user class based on the weighted sum of the first order gradient ciphertexts of each user data sample in each user class set, the weighted sum of the second order gradient ciphertexts of each user data sample in each user class set, the weighted sum of the first order gradient ciphertexts of all user data samples, and the weighted sum of the second order gradient ciphertexts of all user data samples.

In some embodiments, the tag provider 42a calculates the information gain of each user classification according to the difference between the weighted sum of the first gain, the second gain, and the third gain and a threshold, the threshold is used to instruct the data provider to generate the machine learning model, the first gain and the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set are positively correlated, and the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set is negatively correlated, the weighted sum of the second gain and the first-order gradient ciphertexts of all user data samples is positively correlated, and the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set is positively correlated, the third gain is inversely related to the weight of the first order gradient ciphertext of all the user data samples and positively related to the weight sum of the second order gradient ciphertext of all the user data samples.

In some embodiments, the data provider 41a sets a plurality of classification thresholds according to a plurality of user characteristic information included in each user data sample; and dividing each user data sample into different user classification sets according to a plurality of classification threshold values.

In some embodiments, the tag provider 42a, determines the maximum gain information among the information gains for different user categories; and instructing the data provider to generate a machine learning model according to whether the maximum gain information is larger than a threshold value.

In some embodiments, the machine learning model is a tree model, and the tag provider 42a instructs the data provider to split the current node of the tree model to generate child nodes of the current node as the current node of the next iteration if the maximum gain information is greater than the threshold; the label provider calculates the weight parameters of the child nodes as parameter information plaintext of the tree model under the condition that the maximum gain information is less than or equal to the threshold value; the label provider 42a determines the maximum gain information again according to the current node of the next iteration; and repeating the steps until the generation of the tree model is finished.

Fig. 4b illustrates a block diagram of some embodiments of a generation system of a machine learning model of the present disclosure.

As shown in fig. 4b, the user data processing apparatus 4b includes: a classification unit 41b for classifying the user according to the user data using a machine learning model, which in some embodiments provides an electronic device comprising: a memory; and a processor coupled to the memory, the processor configured to perform a method of generating a machine learning model, or a method of processing user data, in any of the above embodiments, based on instructions stored in the memory device.

Fig. 5 illustrates a block diagram of some embodiments of an electronic device of the present disclosure.

As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51, the processor 52 being configured to execute a method of generating a machine learning model or a method of processing user data in any one of the embodiments of the present disclosure based on instructions stored in the memory 51.

The memory 51 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader, a database, and other programs.

As shown in fig. 6, the electronic apparatus 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, wherein the processor 620 is configured to execute a method of generating a machine learning model or a method of processing user data in any of the above embodiments based on instructions stored in the memory 610.

The memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader, and other programs.

The electronic device 6 may also include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These

interfaces

630, 640, 650 and the connections between the memory 610 and the processor 620 may be through a bus 660, for example. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, a microphone, and a sound box. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.

As will be appreciated by one of skill in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.

So far, a generation method of a machine learning model, a generation system of a machine learning model, a processing method of user data, a processing apparatus of user data, an electronic device, and a nonvolatile computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

The method and system of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method of generating a machine learning model, comprising:

the data provider generates a label predicted value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model sent by the label provider;

the data provider calculates gradient information ciphertexts of different user classifications according to the label predicted value ciphertexts, the label ciphertexts of the user data samples sent by the label provider and the user classifications to which the user data samples belong;

the label provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications;

and the label provider instructs the data provider to generate the machine learning model according to the information gain of the different user classifications.

2. The generation method of claim 1, wherein the generation method comprises a plurality of generation processes,

the generating of the tag prediction value ciphertext of each user data sample comprises:

and the data provider generates a tag prediction value ciphertext in the T-1 generation process according to the parameter information ciphertext in the T-1 generation process and the tag prediction value ciphertext in the T-2 generation process in the T generation process.

3. The generation method according to claim 2, wherein the generating of the tag prediction value ciphertext of the T-1 th round generation process comprises:

and the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the homomorphic addition result of the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process.

4. The generation method of claim 1, further comprising:

the data provider divides each user data sample into different user classification sets according to a plurality of user characteristic information contained in each user data sample;

the generation method comprises a plurality of generation processes, and the step of calculating gradient information ciphertexts of different user classifications comprises the following steps:

the data provider calculates gradient information ciphertexts of the user data samples according to the label predicted value ciphertexts of the user data samples in the T-1 generation process and the label ciphertexts of the user data samples in the T-1 generation process;

and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.

5. The generation method of claim 4, wherein the gradient information ciphertext comprises a first order gradient ciphertext and a second order gradient ciphertext.

6. The generation method according to claim 5, wherein the calculating the gradient information ciphertext of each user data sample comprises:

the data provider calculates homomorphic multiplication results of a first preset value and the label ciphertext of each user data sample;

and the data provider calculates homomorphic addition results of the label predicted value ciphertext of each user data sample in the T-1 generation process and the homomorphic multiplication results, and determines the first-order gradient ciphertext.

7. The generation method according to claim 5, wherein the calculating the gradient information ciphertext of each user data sample comprises:

the data provider processes the label predicted value ciphertext of each user data sample in the T-1 th generation process by using a sigmod function, and determines a function processing result;

the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample;

and calculating a homomorphic addition result of the function processing result and the homomorphic multiplication result, and determining the first-order gradient ciphertext.

8. The generation method of claim 5, wherein the calculating information gains for different user classifications comprises:

and the tag provider calculates the information gain of each user classification according to the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the first-order gradient ciphertexts of all user data samples and the weighted sum of the second-order gradient ciphertexts of all user data samples.

9. The generation method of claim 5, wherein the calculating the information gain for each user category comprises:

the label provider calculates the information gain of each user classification according to the difference value of the weighted sum of the first gain, the second gain and the third gain and a threshold value, wherein the threshold value is used for instructing the data provider to generate the machine learning model,

the first gain is positively correlated with the weighted sum of the first order gradient ciphertext of each user data sample in each user classification set, and the second order gradient ciphertext of each user data sample in each user classification set is negatively correlated with the weighted sum of the second order gradient ciphertext,

the second gain is positively correlated with a difference between a weighted sum of the first order gradient ciphertexts of all user data samples and a weighted sum of the first order gradient ciphertexts of each user data sample in each user classification set, and the second gain is positively correlated with a difference between a weighted sum of the second order gradient ciphertexts of all user data samples and a weighted sum of the second order gradient ciphertexts of each user data sample in each user classification set,

the third gain is inversely related to the weight of the first order gradient ciphertext of all user data samples and positively related to the weight of the second order gradient ciphertext of all user data samples.

10. The generation method of claim 4, wherein the partitioning the user data samples into different sets of user classifications comprises:

the data provider sets a plurality of classification thresholds according to a plurality of user characteristic information contained in each user data sample;

and dividing the user data samples into different user classification sets according to the classification threshold values.

11. The generation method of any of claims 1-10, wherein the gain of information according to the different user classifications, instructing the data provider to generate the machine learning model comprises:

the label provider determines the maximum gain information in the information gains of the different user categories;

instructing the data provider to generate the machine learning model according to whether the maximum gain information is greater than a threshold.

12. The generation method of claim 11, wherein the machine learning model is a tree model,

instructing the data provider to generate the machine learning model according to whether the maximum gain information is greater than a threshold comprises:

the label provider instructs the data provider to split the current node of the tree model under the condition that the maximum gain information is larger than the threshold value, and child nodes of the current node are generated and serve as the current node of the next iteration;

the label provider calculates the weight parameter of the child node as the parameter information plaintext of the tree model under the condition that the maximum gain information is less than or equal to the threshold value;

the determining the maximum gain information among the information gains of the different user classes comprises:

the label provider re-determines the maximum gain information according to the current node of the next iteration;

the generation method further comprises the following steps:

and repeating the steps until the tree model is generated.

13. A method for processing user data comprises the following steps:

classifying users using a machine learning model generated according to the method for generating a machine learning model according to any one of claims 1 to 12, based on user data.

14. A system for generating a machine learning model, comprising:

the data provider is used for generating a label predicted value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model sent by the label provider, and calculating gradient information ciphertexts of different user classifications according to the label predicted value ciphertext, the label ciphertext of each user data sample sent by the label provider and the user classification to which each user data sample belongs;

and the label provider calculates the information gains of different user classifications according to the gradient information ciphertexts of different user classifications, and instructs the data provider to generate the machine learning model according to the information gains of different user classifications.

15. An apparatus for processing user data, comprising:

a classification unit configured to classify a user using a machine learning model generated according to the method for generating a machine learning model according to any one of claims 1 to 12, based on user data.

16. An electronic device, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of generating a machine learning model of any of claims 1-12, or the method of processing user data of claim 13, based on instructions stored in the memory.

17. A non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of generating a machine learning model according to any one of claims 1 to 12, or the method of processing user data according to claim 13.