CN114723070A - Method and system for generating machine learning model and method for processing user data - Google Patents

Method and system for generating machine learning model and method for processing user data Download PDF

Info

Publication number
CN114723070A
CN114723070A CN202210404748.2A CN202210404748A CN114723070A CN 114723070 A CN114723070 A CN 114723070A CN 202210404748 A CN202210404748 A CN 202210404748A CN 114723070 A CN114723070 A CN 114723070A
Authority
CN
China
Prior art keywords
user
ciphertext
provider
user data
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210404748.2A
Other languages
Chinese (zh)
Inventor
徐武兴
杨恺
范昊
黄志翔
彭南博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Holding Co Ltd
Original Assignee
Jingdong Technology Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Holding Co Ltd filed Critical Jingdong Technology Holding Co Ltd
Priority to CN202210404748.2A priority Critical patent/CN114723070A/en
Publication of CN114723070A publication Critical patent/CN114723070A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/008Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords

Abstract

The disclosure relates to a method and a system for generating a machine learning model and a method for processing user data, and relates to the technical field of computers. Calculating gradient information ciphertexts of different user classifications according to the label ciphertexts of the samples and the user classification to which each user data sample belongs; the label provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications; and the label provider instructs the data provider to generate a machine learning model according to the information gain of different user classifications. The technical scheme disclosed by the invention can ensure the privacy of the data provider and the label provider, thereby improving the security.

Description

Method and system for generating machine learning model and method for processing user data
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and a system for generating a machine learning model, a method and a device for processing user data, an electronic device, and a non-volatile computer-readable storage medium.
Background
The data-driven artificial intelligence technology plays a great role in various industries, and brings extremely high value. Data has been listed as a new production element in China. In recent years, privacy protection and data compliance of data are also more and more emphasized, so that data islanding phenomenon can not occur in data intercommunication. Data islanding prevents the exploitation of the potential of data.
To address this problem, the concept of federal learning has come. The federated learning can realize common modeling on the premise that data of a plurality of participants cannot be found out locally, and the data potential of each party is stimulated while the data privacy is ensured.
In the related art, a machine learning model may be created using a SecureBoost or the like.
Disclosure of Invention
The inventors of the present disclosure found that the following problems exist in the above-described related art: privacy of the data provider and the tag provider cannot be guaranteed, resulting in poor security.
In view of this, the present disclosure provides a technical solution for generating a machine learning model, which can ensure privacy of a data provider and a tag provider, thereby improving security.
According to some embodiments of the present disclosure, there is provided a method of generating a machine learning model, including: the data provider generates a label predicted value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model transmitted by the label provider; the data provider calculates gradient information ciphertexts of different user classifications according to the label predicted value ciphertexts, the label ciphertexts of each user data sample sent by the label provider and the user classification to which each user data sample belongs; the label provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications; and the label provider instructs the data provider to generate a machine learning model according to the information gain of different user classifications.
In some embodiments, the generating method includes multiple generation processes, and generating the tag prediction value ciphertext for each user data sample includes: and the data provider generates a tag prediction value ciphertext in the T-1 generation process according to the parameter information ciphertext in the T-1 generation process and the tag prediction value ciphertext in the T-2 generation process in the T generation process.
In some embodiments, generating the tag prediction value ciphertext for the T-1 th round of generation process comprises: and the data provider generates a tag prediction value ciphertext in the T-1 generation process according to a homomorphic addition result of the parameter information ciphertext in the T-1 generation process and the tag prediction value ciphertext in the T-2 generation process.
In some embodiments, the generating method further comprises: the data provider divides each user data sample into different user classification sets according to a plurality of user characteristic information contained in each user data sample; the generation method comprises multiple generation processes, and the calculation of gradient information ciphertexts of different user classifications comprises the following steps: the data provider calculates gradient information ciphertext of each user data sample according to the label predicted value ciphertext of each user data sample in the T-1 generation process and the label ciphertext of each user data sample in the T-1 generation process; and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.
In some embodiments, the gradient information ciphertext includes a first order gradient ciphertext and a second order gradient ciphertext.
In some embodiments, calculating the gradient information cipher text for each user data sample comprises: the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample; and the data provider calculates homomorphic addition results of the label predicted value ciphertext and homomorphic multiplication results of the user data samples in the T-1 generation process, and determines a first-order gradient ciphertext.
In some embodiments, calculating the gradient information cipher text for each user data sample comprises: the data provider processes the label predicted value ciphertext of each user data sample in the T-1 generation process by using a sigmod function, and determines a function processing result; the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample; and calculating a homomorphic addition result of the function processing result and the homomorphic multiplication result to determine a first-order gradient ciphertext.
In some embodiments, calculating information gains for different user classifications includes: and the label provider calculates the information gain of each user classification according to the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the first-order gradient ciphertexts of all user data samples and the weighted sum of the second-order gradient ciphertexts of all user data samples.
In some embodiments, calculating the information gain for each user category comprises: the label provider calculates the information gain of each user classification according to the difference between the weighted sum of the first gain, the second gain and the third gain and a threshold value, the threshold value is used for indicating the data provider to generate a machine learning model, the weighted sum of the first gain and the first-order gradient ciphertext of each user data sample in each user classification set is positively correlated with the weighted sum of the second-order gradient ciphertext of each user data sample in each user classification set, the weighted sum of the second gain and the first-order gradient ciphertext of all user data samples is positively correlated with the weighted sum of the first-order gradient ciphertext of each user data sample in each user classification set, the weighted sum of the second-order gradient ciphertext of all user data samples is positively correlated with the weighted sum of the second-order gradient ciphertext of each user data sample in each user classification set, and the weighted sum of the third gain and the first-order gradient ciphertext of all user data samples are negatively correlated, and the weighted sum of the second order gradient ciphertexts of all user data samples.
In some embodiments, partitioning the user data samples into different sets of user classifications includes: the data provider sets a plurality of classification threshold values according to a plurality of user characteristic information contained in each user data sample; and dividing each user data sample into different user classification sets according to a plurality of classification threshold values.
In some embodiments, instructing the data provider to generate the machine learning model according to information gains of different user classifications includes: the tag provider determines the maximum gain information in the information gains of different user categories; and instructing the data provider to generate a machine learning model according to whether the maximum gain information is larger than a threshold value.
In some embodiments, the machine learning model is a tree model, instructing the data provider to generate the machine learning model based on whether the maximum gain information is greater than a threshold comprises: the label provider instructs the data provider to split the current node of the tree model under the condition that the maximum gain information is larger than the threshold value, and child nodes of the current node are generated and serve as the current node of the next iteration; the label provider calculates the weight parameters of the child nodes as parameter information plaintext of the tree model under the condition that the maximum gain information is less than or equal to the threshold value; determining the maximum gain information among the information gains for different user classifications includes: the label provider re-determines the maximum gain information according to the current node of the next iteration; and repeating the steps until the generation of the tree model is finished.
According to another embodiment of the present disclosure, there is provided a user data processing method, including: the users are classified according to the user data using a machine learning model generated according to the method of generating a machine learning model in any of the above embodiments.
According to still further embodiments of the present disclosure, there is provided a system for generating a machine learning model, including: the data provider is used for generating a label predicted value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model sent by the label provider, and calculating gradient information ciphertexts of different user classifications according to the label predicted value ciphertext, the label ciphertext of each user data sample sent by the label provider and the user classification to which each user data sample belongs; and the label provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications, and instructs the data provider to generate a machine learning model according to the information gain of different user classifications.
In some embodiments, the generation method comprises multiple generation processes, and the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process in the T-1 generation process.
In some embodiments, the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the homomorphic addition result of the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process.
In some embodiments, the data provider divides each user data sample into different user classification sets according to a plurality of user characteristic information contained in each user data sample; the generation method comprises a plurality of generation processes, a data provider, and a gradient information ciphertext of each user data sample is calculated according to a label prediction value ciphertext of each user data sample in a T-1 generation process and a label ciphertext of each user data sample in a T-1 generation process; and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.
In some embodiments, the gradient information ciphertext includes a first order gradient ciphertext and a second order gradient ciphertext.
In some embodiments, the data provider calculates a homomorphic multiplication result of the first preset value and the tag ciphertext of each user data sample; and the data provider calculates homomorphic addition results of the label predicted value ciphertext and homomorphic multiplication results of the user data samples in the T-1 generation process, and determines a first-order gradient ciphertext.
In some embodiments, the data provider processes a tag prediction value ciphertext of each user data sample in the T-1 th generation process by using a sigmod function, and determines a function processing result; the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample; and calculating a homomorphic addition result of the function processing result and the homomorphic multiplication result to determine a first-order gradient ciphertext.
In some embodiments, the tag provider calculates the information gain of each user class according to the weighted sum of the first-order gradient ciphertexts of each user data sample in each user class set, the weighted sum of the second-order gradient ciphertexts of each user data sample in each user class set, the weighted sum of the first-order gradient ciphertexts of all user data samples, and the weighted sum of the second-order gradient ciphertexts of all user data samples.
In some embodiments, the tag provider calculates the information gain of each user classification according to the difference between the weighted sum of the first gain, the second gain and the third gain and a threshold, the threshold is used for instructing the data provider to generate the machine learning model, the first gain is positively correlated with the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, and is negatively correlated with the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second gain and the first-order gradient ciphertexts of all user data samples is positively correlated with the difference of the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, and the weighted sum of the second-order gradient ciphertexts of all user data samples is positively correlated with the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the third gain is inversely related to the weight of the first order gradient ciphertext of all the user data samples and positively related to the weight sum of the second order gradient ciphertext of all the user data samples.
In some embodiments, the data provider sets a plurality of classification thresholds according to a plurality of user characteristic information contained in each user data sample; and dividing each user data sample into different user classification sets according to a plurality of classification threshold values.
In some embodiments, the tag provider determines the maximum gain information among the information gains for different user categories; and instructing the data provider to generate a machine learning model according to whether the maximum gain information is larger than a threshold value.
In some embodiments, the machine learning model is a tree model, and the tag provider instructs the data provider to split a current node of the tree model to generate child nodes of the current node as a current node of a next iteration when the maximum gain information is greater than a threshold; the label provider calculates the weight parameters of the child nodes as parameter information plaintext of the tree model under the condition that the maximum gain information is less than or equal to the threshold value; the label provider re-determines the maximum gain information according to the current node of the next iteration; and repeating the steps until the generation of the tree model is finished.
According to still further embodiments of the present disclosure, there is provided a processing apparatus of user data, including: a classification unit configured to classify the user using a machine learning model based on the user data, the machine learning model being generated according to the method for generating a machine learning model according to any one of the embodiments.
In some embodiments, there is provided an electronic device comprising: a memory; and a processor coupled to the memory, the processor configured to perform a method of generating a machine learning model, or a method of processing user data, in any of the above embodiments, based on instructions stored in the memory device.
According to still further embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of generating a machine learning model or a method of processing user data in any of the above embodiments. .
In the above embodiment, the data provider and the tag provider realize the secure calculation of the information gain through the interaction of the parameter information ciphertext, the tag prediction value ciphertext and the gradient information ciphertext. Thus, the privacy of the data provider and the tag provider can be ensured, and the security is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure can be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:
FIG. 1 illustrates a flow diagram of some embodiments of a method of generation of a machine learning model of the present disclosure;
FIG. 2 illustrates a flow diagram of some embodiments of a method of processing user data of the present disclosure;
FIG. 3 illustrates a schematic diagram of some embodiments of a method of generating a machine learning model of the present disclosure;
FIG. 4a illustrates a block diagram of some embodiments of a generation system of a machine learning model of the present disclosure;
FIG. 4b illustrates a block diagram of some embodiments of a generation system of a machine learning model of the present disclosure;
FIG. 5 illustrates a block diagram of some embodiments of an electronic device of the present disclosure;
fig. 6 shows a block diagram of further embodiments of the electronic device of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
As previously mentioned, the present disclosure contemplates joint modeling of two parties. One party only has a data sample label, and the other party has a new scene of the data sample characteristics. The AP Party (Active Party) is used as a label provider, namely a business Party, only has a data label, and hopes to perform joint modeling by means of data of other organizations, and is usually an Active Party of federal learning; the organization that provides data to the AP side is the PP side (Passive Party), i.e., the data provider, and is typically the Passive Party of federal learning. The actual application scene is referred to as a single-side submodel scene for short.
Compared with the application scenario in the prior art, the single-side model scenario provides a new requirement for the interpretability of the federated XBG model which is jointly constructed.
On one hand, in order to protect the data privacy of the PP side, the AP side cannot obtain the predicted value corresponding to the specific sample
Figure BDA0003601826450000081
The reason is that in order to obtain interpretability, the AP side needs to acquire the characteristic meaning of the model node. At this time, if the AP side obtains the predicted value of the sample
Figure BDA0003601826450000082
The AP side also obtains the predicted path of the sample in the model, which exposes the characteristic information of each sample, and thus deduces the sample privacy information of the PP side.
On the other hand, in order to protect the data privacy of the AP side, the PP side cannot obtain the privacy data of the AP side, that is, the sample label and the leaf node weight information of the model. Because the model leaf node weight contains the sample label data information of the AP side.
However, the methods such as SecureBoost cannot simultaneously satisfy the two requirements in the single-side model scene. Although the method of SecureBoost and the like ensures that the AP side keeps the model weight parameters, the AP side still needs to obtain a single sample prediction value to calculate the information gain. Under the condition that the AP side knows the splitting characteristics of the whole model, the behavior can enable the AP side to deduce the sample data of the PP side through a single sample predicted value, so that the data privacy of the AP side and the PP side cannot be guaranteed, and the application scene of federal learning is limited.
That is to say, when the methods such as secure boost and the like are used for interpretable combined XGB (eXtreme Gradient Boosting) modeling in a single-side submodel scene, one side needs to use a single sample label y and a single sample prediction value
Figure BDA0003601826450000083
And calculating information gain, and further performing model splitting and training by using the information gain. If the AP side calculates the information gain, the AP side can estimate the sample privacy information of the PP side through a single sample prediction value and does not meet the requirement of protecting the data characteristic privacy of the PP side; if the PP side calculates the information gain, the PP side can obtain a sample predicted value and a sample label, and the requirement for protecting the privacy of the AP side data label is not met.
The disclosure provides an interpretable single-side submodel federal learning privacy protection scheme aiming at the problem that the related technology cannot realize interpretability and protect the joint modeling of private information of two parties in a single-side submodel scene.
In summary, the information gain calculation formula is disassembled, and a safe multiparty information gain calculation protocol based on a single-party model scene is designed. And an interpretable single-side submodel federal learning scheme is designed based on the protocol. Finally, model splitting and training which can be explained in the scene of the single-side model are completed under the condition that the privacy of the two sides is not disclosed. For example, the technical solution of the present disclosure can be realized by the following embodiments.
Fig. 1 illustrates a flow diagram of some embodiments of a method of generation of a machine learning model of the present disclosure.
As shown in fig. 1, in step 110, the data provider generates a tag prediction value ciphertext of each user data sample according to a parameter information ciphertext of the machine learning model transmitted by the tag provider (i.e., the service provider).
For example, a set of user data samples X ∈ Rn×dThe label set y of the user data sample is { y ═ y1,y2,...,yn}∈Rn. Generating a T Tree model Tree in the T round generation processTThat is, each generation process generates a tree model with a set of predicted values of labels
Figure BDA0003601826450000091
The tag provider encrypts y to generate a tag ciphertext set [ y]={[y1],[y2],...,[yn]And will [ y ]]And sending the data to a data provider.
In some embodiments, the data provider acquires the encrypted public key pk generated by the label provider and the corresponding relation between the sample prediction value and the model leaf node; the label provider generates an encrypted public and private key pair (sk, pk) and has parameter information of the T-1 tree
Figure BDA0003601826450000092
Sample label y ═ y1,y2,...,yn},
Figure BDA0003601826450000093
Weight information for leaf nodes of the T-1 th tree.
For example, the tag provider generates a sample tag ciphertext [ y ] using the public key pk]=Encpk(y) parameter information ciphertext of T-1 tree
Figure BDA0003601826450000094
Figure BDA0003601826450000095
And will [ y],
Figure BDA0003601826450000096
And sending the data to a data side.
In some embodiments, the initialized TreeTOnly contains the root Node-0, and generates the Node-i in the ith iteration process in the T generation process. Each Node-i needs to maintain the following information: characteristic meaning kiSplitting strategy spliti() Set of samples Ri. Initializing a sample set of Node-0 to RiI contains all user data samples, and the set of nodes to be split L ═ Node-0.
For example, the characteristic meaning kiDisclosure to tag providers, splitting policy spliti() Sample set RiAll are reserved by the data side and are not disclosed to the label provider.
In some embodiments, the generation method comprises multiple generation processes, and the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the parameter information ciphertext in the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process in the T-1 generation process.
For example, the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the homomorphic addition result of the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process.
For example, the data provider calculates the sample prediction value ciphertext using the correspondence relationship described above
Figure BDA0003601826450000101
For example, the tree model structure is a binary tree model, and the corresponding relationship between the predicted value of the sample and the leaf node of the model is the weight parameter of the left and right child nodes of the current node and into which child node the sample is finally classified. According to the corresponding relation, the sample prediction value can be obtained. In this way, the data provider can perform computation using the correspondence and the ciphertext transmitted by the tag provider without obtaining the weight parameter of the node of the model.
In step 120, the data provider calculates gradient information ciphertexts of different user classifications according to the tag prediction value ciphertexts, the tag ciphertexts of each user data sample sent by the tag provider, and the user classification to which each user data sample belongs.
In some embodiments, the data provider divides each user data sample into different user classification sets according to a plurality of user characteristic information contained in each user data sample.
In some embodiments, the data provider sets a plurality of classification thresholds according to a plurality of user characteristic information contained in each user data sample; and dividing each user data sample into different user classification sets according to a plurality of classification threshold values.
For example, for each user characteristic information dimension k of the data provider, which is a value of 1, … d, the set of user data samples X is divided into q-1 sets of user classifications (which may be referred to as binning) according to q classification thresholds. Thus, a user classification set I corresponding to each user characteristic information is obtained1,…,Iq-1
In some embodiments, the generation method comprises multiple generation processes, and a data provider, in the T-th generation process, calculating a gradient information ciphertext of each user data sample according to the tag prediction value ciphertext of each user data sample in the T-1-th generation process and the tag ciphertext of each user data sample; and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.
In some embodiments, the gradient information ciphertext includes a first order gradient ciphertext and a second order gradient ciphertext. For example, the data provider utilizes [ y ]],
Figure BDA0003601826450000111
Computing each user data sample XiFirst order gradient ciphertext gi]And second order gradient ciphertext [ h ]i]。
In some embodiments, the data provider calculates a homomorphic multiplication result of a first preset value (e.g., -1) and the tag ciphertext of each user data sample; and the data provider calculates homomorphic addition results of the label predicted value ciphertext and homomorphic multiplication results of the user data samples in the T-1 generation process, and determines a first-order gradient ciphertext.
For example, if the mean square error is taken as the loss function, the first order gradient and the second order gradient are respectively:
Figure BDA0003601826450000112
hi=1
the data provider can calculate a first-order gradient ciphertext and a second-order gradient ciphertext:
Figure BDA0003601826450000113
[hi]=[1]
Figure BDA0003601826450000114
in order to realize the homomorphic subtraction,
Figure BDA0003601826450000115
is homomorphic multiplication.
In some embodiments, the data provider processes a tag prediction value ciphertext of each user data sample in the T-1 th generation process by using a sigmod function, and determines a function processing result; the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample; and calculating a homomorphic addition result of the function processing result and the homomorphic multiplication result to determine a first-order gradient ciphertext.
For example, if cross entropy is used as the loss function, the first order gradient and the second order gradient obtained from the above calculation are:
Figure BDA0003601826450000116
the data provider can approximate the Sigmod function by taylor expansion at 0 to find the first and second order gradients
Figure BDA0003601826450000121
Figure BDA0003601826450000122
Figure BDA0003601826450000123
In step 130, the tag provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications.
In some embodiments, the tag provider calculates the information gain of each user classification according to the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the first-order gradient ciphertexts of all user data samples, and the weighted sum of the second-order gradient ciphertexts of all user data samples.
In some embodiments, the data provider computes the child first-order gradient aggregation value ciphertext within the user classification set j
Figure BDA0003601826450000124
Sum second order gradient aggregation value ciphertext
Figure BDA0003601826450000125
And total first order gradient aggregation value ciphertext [ G ]]=∑i∈I[gi]And total second order gradient aggregation value ciphertext [ H ]]=∑i∈I[hi]And sent to the label provider.
In some embodiments, the tag provider calculates the information gain of each user classification according to the difference between the weighted sum of the first gain, the second gain and the third gain and a threshold, the threshold is used for instructing the data provider to generate the machine learning model, the first gain is positively correlated with the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, and is negatively correlated with the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second gain and the first-order gradient ciphertexts of all user data samples is positively correlated with the difference of the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, and the weighted sum of the second-order gradient ciphertexts of all user data samples is positively correlated with the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the third gain is inversely correlated with the weight of the first order gradient ciphertexts of all the user data samples, and is positively correlated with the weight sum of the second order gradient ciphertexts of all the user data samples.
For example, the label provider decrypts the sub-first-order gradient aggregation value ciphertext by using the private key sk to obtain Gj=Decsk([Gj]) (ii) a Decrypting the subsequential second-order gradient aggregation value ciphertext by using the private key sk to obtain Hj=Decsk([H(]) (ii) a Decrypting the total first-order gradient aggregation value ciphertext by using the private key sk to obtain G; decrypting the total second-order gradient aggregation value ciphertext by using the private key sk to obtain H; calculating a first gain using the decryption result
Figure BDA0003601826450000131
Second gain
Figure BDA0003601826450000132
Third gain
Figure BDA0003601826450000133
And further calculating the information gain of the user classification set j:
Figure BDA0003601826450000134
λ is an adjustable parameter. From each GainjThe maximum gain information MaxGain is determined.
In step 140, the tag provider, based on the information gain of the different user categories, instructs the data provider to generate a machine learning model.
In some embodiments, the tag provider determines the maximum gain information among the information gains for different user categories; and instructing the data provider to generate a machine learning model according to whether the maximum gain information is larger than a threshold value gamma.
In some embodiments, the machine learning model is a tree model, and the tag provider instructs the data provider to split a current node of the tree model to generate child nodes of the current node as a current node of a next iteration when the maximum gain information is greater than a threshold; the label provider calculates the weight parameters of the child nodes as parameter information plaintext of the tree model under the condition that the maximum gain information is less than or equal to the threshold value; the label provider re-determines the maximum gain information according to the current node of the next iteration; and repeating the steps until the generation of the tree model is finished.
For example, the iterative process in the generation process of the T-th round can be realized by:
While L is not NULL do
selecting any Node-i belongs to L, and obtaining the maximum information gain by utilizing a safe multi-party information gain calculation protocol:
data provider
Figure BDA0003601826450000135
If MaxGain>γ then
Splitting the node i according to the splitting rule corresponding to the maximum information gain to obtain two sub-nodes iLAnd iRAnd corresponding sample sets thereof
Figure BDA0003601826450000136
Updating Tree model TreeTRemove node i from L, remove iLAnd iRAnd adding the split data into the set L to be split.
else
Without splitting, remove node i from L
end
end
In some embodiments, the tag provider is according to Gj、H(And calculating the parameter information of the T-th round. For example, the optimal weight for each leaf node:
Figure BDA0003601826450000141
in the above embodiment, the first-order gradient g (loss function first-order gradient term) and second-order gradient h (loss function second-order gradient term) calculation processes on which the information gain depends in XGB are disassembled.
By using homomorphic encryption, under the scene that a tag provider only has a data tag and the data provider has data characteristics, the tag provider provides a sample tag and a model weight parameter ciphertext required by calculation; the data provider calculates a sample prediction value by using the weight ciphertext, and further calculates g ciphertext aggregation values and h ciphertext aggregation values of different splitting schemes; with the above manner, the tag provider and the data provider jointly and securely calculate the information Gain (Gain); and finally, splitting the tree in such a way to complete the modeling process.
Therefore, under the condition that privacy information of both parties is not exposed in the model training and predicting stage, the label provider can acquire the split feature meaning and the model weight parameter, model interpretability and data privacy are guaranteed, and a safe and reliable interpretable federal learning scheme is provided for a new practical application scene.
In the following, a two-party combined scenario is taken as an example for explanation, and the technical scheme of the present disclosure may also be applied to a multi-party combined scenario with more than two parties.
Fig. 2 illustrates a flow diagram of some embodiments of a method of processing user data of the present disclosure.
As shown in fig. 2, in step 210, user data is obtained. In step 220, the users are classified according to the user data using a machine learning model, which is generated according to the generation method of the machine learning model in any of the above embodiments.
The data provider sets the owned sample label set y as y1,y2,...,ynCarries on encryption, and encrypts the result [ y }]={[y1],[y2],...,[yn]It sends it to the data provider.
In some embodiments, the data provider also weights leaf nodes of the model (as an example of a single-level binary tree) during the training of the tree model
Figure BDA0003601826450000142
(
Figure BDA0003601826450000143
Weights representing jth leaf node of a T-th tree) are encrypted, and the encryption result is processed
Figure BDA0003601826450000151
And sending the data to the data provider in sequence.
Since the data provider holds all data features in the single-side model scene, it can know which leaf node the predicted value of each training sample is equal to. According to the weight ciphertext of the leaf node sent by the label provider, the label prediction value ciphertext of each training sample on each tree model can be obtained
Figure BDA0003601826450000152
In some embodiments, the data provider may combine the sample tag ciphertext value [ y ] provided by the tag provider]And the obtained predicted value ciphertext of the label
Figure BDA0003601826450000153
And obtaining a first-order gradient ciphertext and a second-order gradient ciphertext according to the first-order gradient and the second-order gradient calculation formulas of different loss functions.
For example, for complex loss functions such as exponential operation and the like exceeding the operation capability of homomorphic encryption algorithm, technologies such as Taylor expansion and the like are utilized to perform approximate conversion into simple linear function calculation.
And finally, the data provider aggregates the first-order gradient ciphertext and the second-order gradient ciphertext according to different splitting schemes, and sends the ciphertext to the tag provider. The label provider decrypts and calculates the information gain of different splitting schemes; and selecting an optimal splitting scheme according to the information gain, and informing a data provider to split according to the scheme to complete single tree node splitting training iteration. And repeating the leaf node splitting process to complete the construction of the tree model.
According to the embodiment, the encryption first-order gradient and the encryption second-order gradient can be calculated under the condition that the data provider does not know the weight plaintext of the leaf node, and the data privacy safety of the tag provider is guaranteed. Moreover, the tag provider can only obtain the aggregation value, and cannot obtain a single predicted value, so that the data privacy of the data provider is also ensured.
In some embodiments, the protocol first has two participants: the tag provider only has an inclusion of nALabelset of strip data labels
Figure BDA0003601826450000154
And hopefully borrow data from other organizations (data providers) for joint modeling, and therefore are usually the initiative of federal learning; data provider owns the content nPData collection of strip data
Figure BDA0003601826450000155
And d dimension feature X thereofi=(x1,x2,x3,…,xd)。
In some embodiments, the technical solution of the present disclosure comprises two stages.
For example, in the first stage, encryption sample alignment is performed. Sample alignment is required first since each participant in the longitudinal federated scenario is performing federated learning using different data features of a common sample. That is, a common sample ID owned by the tag provider and the data provider is found. This part can be implemented by the existing mature privacy set transaction protocol. After the privacy set intersection, the two parties can obtain n data intersection sets.
For example, in the second phase, federal model training is performed. And starting the federal modeling after the encrypted sample alignment step is realized.
Fig. 3 illustrates a schematic diagram of some embodiments of a method of generating a machine learning model of the present disclosure.
As shown in fig. 3, in step 0, the tag provider generates a homomorphic encryption (Paillier algorithm) public key pk and private key sk, and broadcasts the public key pk. The label provider encrypts Enc for n data labels y in the intersection by using the private key skpk(y) generating a ciphertext [ y]And sending the public key pk and the generated ciphertext to the data provider.
In step 1, after the data provider determines all sample IDs of the split node (current node), q classification thresholds S are determined and obtained according to d-dimensional features of the samplesk={sk1,…,skq}; and dividing all samples into q-1 user classification sets according to q classification threshold values to serve as screening bases of the tree splitting nodes.
Taking the generation process of the T-th round for constructing T-th tree training as an example, the training stage mainly comprises the following steps.
In step 2, the data provider uses the public key pk to train leaf node weights of the tree model (taking a single-layer binary tree as an example) obtained in the previous generation process
Figure BDA0003601826450000161
Encrypting and sequentially encrypting the results
Figure BDA0003601826450000162
And sending the data to a data provider.
In step 3, the data provider obtains the corresponding relation between the sample prediction value and the leaf node of the tree model according to the process of tree model construction in the previous generation process training.
In step 4, the data provider performs homomorphic addition operation on the encryption predicted value result of the previous generation process in the T-2 th round and the encryption result received in the step 1, and obtains the encryption predicted value of the specific sample in the current round by combining the corresponding relation between the sample predicted value and the leaf node of the tree model
Figure BDA0003601826450000163
For example, the calculation is as follows:
Figure BDA0003601826450000164
Figure BDA0003601826450000165
is homomorphic addition, i.e. addition operation on the ciphertext, the operation result is equal to the encrypted value of the plaintext sum.
In step 5, the data listener utilizes the encrypted sample tag y]And the prediction value of the encrypted sample
Figure BDA0003601826450000171
First order gradient [ g ] for sample ii]And a second order gradient [ h ]i]And (6) performing calculation.
In some embodiments, since only addition, subtraction and multiplication operations are performed in the above calculation expression, the data provider may directly perform calculation using the addition homomorphism and the number multiplication homomorphism of the ciphertext to obtain the corresponding ciphertext.
For example, if the mean square error is taken as the loss function, the first order gradient and the second order gradient are respectively:
Figure BDA0003601826450000172
hi=1
the data provider can calculate a first-order gradient ciphertext and a second-order gradient ciphertext:
Figure BDA0003601826450000173
[hi]=[1]
Figure BDA0003601826450000174
in order to realize the homomorphic subtraction,
Figure BDA0003601826450000175
is homomorphic multiplication.
In some embodiments, for more complex operations such as power operations, rational number division operations, etc., the expression may be subjected to second-order approximation by using a taylor expansion, and converted into a homomorphic encryption operable form.
For example, if cross entropy is used as the loss function, the first order gradient and the second order gradient obtained from the above calculation are:
Figure BDA0003601826450000176
the data provider can approximate the Sigmod function by taylor expansion at 0 to find the first order and second order gradients
Figure BDA0003601826450000177
Figure BDA0003601826450000178
Figure BDA0003601826450000179
In step 6, based on the single sample encryption first-order gradient and encryption second-order gradient obtained in step 1 and obtained in step 5, the encryption aggregation values of the first-order gradient and the second-order gradient of each user classification set are calculated by utilizing the addition homomorphism of the encryption algorithm, and the result is sent back to the label provider for decryption. The formula is shown below.
In step 7, the label provider aggregates the value G according to the first order gradient of each user classification set of the data provider obtained by decryptionj=Decsk([Gj]) And second order gradient aggregation value Hj=Decsk([Hj]) And calculating the information Gain of each split of each characteristic of the data provider.
And the label provider returns the scheme serial number corresponding to the maximum information gain to the data provider. And the data provider determines the optimal splitting characteristic and the characteristic threshold according to the sequence number scheme and the reverse splitting. This data is kept opaque to the tag provider during the training process to ensure the privacy data security of the data provider.
If the information gain of the optimal splitting is smaller than a threshold value gamma, the current node is not split; otherwise, the feature corresponding to the maximum gain information value is the optimal splitting feature, and the current node is split according to the corresponding feature and the threshold; and finally, the PP side needs to calculate a sample set of two corresponding leaf nodes, and the next leaf node is split according to the sample set.
In step 8, for each leaf node of the T-th tree, step 7 is executed in a loop until a stop splitting condition is reached, i.e. all leaf nodes can no longer be split or the depth of the tree reaches a set maximum depth.
For the first-order gradient aggregation value and the second-order gradient aggregation value of the optimal splitting of the AP-side holding node, the optimal weight of each leaf node can be calculated:
Figure BDA0003601826450000181
so far, the T tree construction is completed.
In step 9, after the label provider completes the training of the T-th tree, the leaf node of the T-th tree is selected by the label provider
Figure BDA0003601826450000182
And sending the single tree model to a data provider, and repeating the training process of the single tree model from the step 1 to the step 9 until the total M trees are built. The resulting complete gradient lifting tree model is:
Figure BDA0003601826450000183
the single-party model is a very important business scenario. The disclosure provides a safe interpretable federal XGB submodel modeling scheme aiming at the scene.
In the above embodiment, the tag provider is enabled to maintain the feature meaning and the model weight parameter, and meanwhile, the tag provider is ensured not to speculate the data of the data provider; the model weight parameters are kept at the label provider and kept secret from the data provider, so that the data listener can not presume the privacy information of the label provider through the model weight parameters; the label provider cannot know the sample predicted value of the data provider, and the label provider cannot presume the privacy information of the data provider through the sample predicted value. Thereby, a reliable interpretable model is achieved.
In the above embodiment, a secure multiparty information gain calculation protocol is designed. Through a calculation formula of first-order gradient and second-order gradient in the information gain, a label provider provides a sample label and a model weight parameter ciphertext required by calculation; and the data provider calculates the sample prediction value by using the weight ciphertext and further calculates the information gain jointly and safely in a mode of g ciphertext aggregation value and h ciphertext aggregation value of different splitting schemes.
In this way, calculation can be performed for different loss functions, and for functions which cannot be directly calculated by using addition homomorphic encryption, technical approximation such as Taylor expansion can be used, and the protocol can be expanded to a multi-party scene.
Fig. 4a illustrates a block diagram of some embodiments of a generation system of a machine learning model of the present disclosure.
As shown in fig. 4a, the generation system 4a of the machine learning model includes: the data provider 41a is configured to generate a tag prediction value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model sent by the tag provider, and calculate gradient information ciphertexts of different user classifications according to the tag prediction value ciphertext, the tag ciphertext of each user data sample sent by the tag provider, and the user classification to which each user data sample belongs; and the tag provider 42a calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications, and instructs the data provider to generate a machine learning model according to the information gain of different user classifications.
In some embodiments, the generation method comprises multiple generation processes, and the data provider 41a generates the tag prediction value ciphertext of the T-1 generation process according to the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process in the T-1 generation process.
In some embodiments, the data provider 41a generates the tag prediction value ciphertext of the T-1 th round of generation process according to the result of homomorphic addition of the parameter information ciphertext of the T-1 th round of generation process and the tag prediction value ciphertext of the T-2 th round of generation process.
In some embodiments, the data provider 41a divides each user data sample into different user classification sets according to a plurality of user characteristic information included in each user data sample; the generation method comprises multiple generation processes, and a data provider 41a calculates gradient information ciphertexts of user data samples according to the label predicted value ciphertexts of the user data samples in the T-1 generation process and the label ciphertexts of the user data samples in the T-1 generation process; and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.
In some embodiments, the gradient information ciphertext includes a first order gradient ciphertext and a second order gradient ciphertext.
In some embodiments, the data provider 41a calculates a homomorphic multiplication result of the first preset value and the tag ciphertext of each user data sample; and the data provider calculates homomorphic addition results of the label predicted value ciphertext and homomorphic multiplication results of the user data samples in the T-1 generation process, and determines a first-order gradient ciphertext.
In some embodiments, the data provider 41a processes the tag prediction value ciphertext of each user data sample in the T-1 th generation process by using a sigmod function, and determines a function processing result; the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample; and calculating a homomorphic addition result of the function processing result and the homomorphic multiplication result to determine a first-order gradient ciphertext.
In some embodiments, the tag provider 42a calculates the information gain for each user class based on the weighted sum of the first order gradient ciphertexts of each user data sample in each user class set, the weighted sum of the second order gradient ciphertexts of each user data sample in each user class set, the weighted sum of the first order gradient ciphertexts of all user data samples, and the weighted sum of the second order gradient ciphertexts of all user data samples.
In some embodiments, the tag provider 42a calculates the information gain of each user classification according to the difference between the weighted sum of the first gain, the second gain, and the third gain and a threshold, the threshold is used to instruct the data provider to generate the machine learning model, the first gain and the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set are positively correlated, and the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set is negatively correlated, the weighted sum of the second gain and the first-order gradient ciphertexts of all user data samples is positively correlated, and the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set is positively correlated, the third gain is inversely related to the weight of the first order gradient ciphertext of all the user data samples and positively related to the weight sum of the second order gradient ciphertext of all the user data samples.
In some embodiments, the data provider 41a sets a plurality of classification thresholds according to a plurality of user characteristic information included in each user data sample; and dividing each user data sample into different user classification sets according to a plurality of classification threshold values.
In some embodiments, the tag provider 42a, determines the maximum gain information among the information gains for different user categories; and instructing the data provider to generate a machine learning model according to whether the maximum gain information is larger than a threshold value.
In some embodiments, the machine learning model is a tree model, and the tag provider 42a instructs the data provider to split the current node of the tree model to generate child nodes of the current node as the current node of the next iteration if the maximum gain information is greater than the threshold; the label provider calculates the weight parameters of the child nodes as parameter information plaintext of the tree model under the condition that the maximum gain information is less than or equal to the threshold value; the label provider 42a determines the maximum gain information again according to the current node of the next iteration; and repeating the steps until the generation of the tree model is finished.
Fig. 4b illustrates a block diagram of some embodiments of a generation system of a machine learning model of the present disclosure.
As shown in fig. 4b, the user data processing apparatus 4b includes: a classification unit 41b for classifying the user according to the user data using a machine learning model, which in some embodiments provides an electronic device comprising: a memory; and a processor coupled to the memory, the processor configured to perform a method of generating a machine learning model, or a method of processing user data, in any of the above embodiments, based on instructions stored in the memory device.
Fig. 5 illustrates a block diagram of some embodiments of an electronic device of the present disclosure.
As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51, the processor 52 being configured to execute a method of generating a machine learning model or a method of processing user data in any one of the embodiments of the present disclosure based on instructions stored in the memory 51.
The memory 51 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader, a database, and other programs.
Fig. 6 shows a block diagram of further embodiments of the electronic device of the present disclosure.
As shown in fig. 6, the electronic apparatus 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, wherein the processor 620 is configured to execute a method of generating a machine learning model or a method of processing user data in any of the above embodiments based on instructions stored in the memory 610.
The memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader, and other programs.
The electronic device 6 may also include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650 and the connections between the memory 610 and the processor 620 may be through a bus 660, for example. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, a microphone, and a sound box. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.
As will be appreciated by one of skill in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.
So far, a generation method of a machine learning model, a generation system of a machine learning model, a processing method of user data, a processing apparatus of user data, an electronic device, and a nonvolatile computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
The method and system of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (17)

1. A method of generating a machine learning model, comprising:
the data provider generates a label predicted value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model sent by the label provider;
the data provider calculates gradient information ciphertexts of different user classifications according to the label predicted value ciphertexts, the label ciphertexts of the user data samples sent by the label provider and the user classifications to which the user data samples belong;
the label provider calculates the information gain of different user classifications according to the gradient information ciphertexts of different user classifications;
and the label provider instructs the data provider to generate the machine learning model according to the information gain of the different user classifications.
2. The generation method of claim 1, wherein the generation method comprises a plurality of generation processes,
the generating of the tag prediction value ciphertext of each user data sample comprises:
and the data provider generates a tag prediction value ciphertext in the T-1 generation process according to the parameter information ciphertext in the T-1 generation process and the tag prediction value ciphertext in the T-2 generation process in the T generation process.
3. The generation method according to claim 2, wherein the generating of the tag prediction value ciphertext of the T-1 th round generation process comprises:
and the data provider generates the tag prediction value ciphertext of the T-1 generation process according to the homomorphic addition result of the parameter information ciphertext of the T-1 generation process and the tag prediction value ciphertext of the T-2 generation process.
4. The generation method of claim 1, further comprising:
the data provider divides each user data sample into different user classification sets according to a plurality of user characteristic information contained in each user data sample;
the generation method comprises a plurality of generation processes, and the step of calculating gradient information ciphertexts of different user classifications comprises the following steps:
the data provider calculates gradient information ciphertexts of the user data samples according to the label predicted value ciphertexts of the user data samples in the T-1 generation process and the label ciphertexts of the user data samples in the T-1 generation process;
and respectively calculating the weighted sum of the gradient information ciphertexts of the user data samples in each user classification set to serve as the gradient information ciphertexts of different user classifications.
5. The generation method of claim 4, wherein the gradient information ciphertext comprises a first order gradient ciphertext and a second order gradient ciphertext.
6. The generation method according to claim 5, wherein the calculating the gradient information ciphertext of each user data sample comprises:
the data provider calculates homomorphic multiplication results of a first preset value and the label ciphertext of each user data sample;
and the data provider calculates homomorphic addition results of the label predicted value ciphertext of each user data sample in the T-1 generation process and the homomorphic multiplication results, and determines the first-order gradient ciphertext.
7. The generation method according to claim 5, wherein the calculating the gradient information ciphertext of each user data sample comprises:
the data provider processes the label predicted value ciphertext of each user data sample in the T-1 th generation process by using a sigmod function, and determines a function processing result;
the data provider calculates homomorphic multiplication results of the first preset value and the label ciphertext of each user data sample;
and calculating a homomorphic addition result of the function processing result and the homomorphic multiplication result, and determining the first-order gradient ciphertext.
8. The generation method of claim 5, wherein the calculating information gains for different user classifications comprises:
and the tag provider calculates the information gain of each user classification according to the weighted sum of the first-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the second-order gradient ciphertexts of each user data sample in each user classification set, the weighted sum of the first-order gradient ciphertexts of all user data samples and the weighted sum of the second-order gradient ciphertexts of all user data samples.
9. The generation method of claim 5, wherein the calculating the information gain for each user category comprises:
the label provider calculates the information gain of each user classification according to the difference value of the weighted sum of the first gain, the second gain and the third gain and a threshold value, wherein the threshold value is used for instructing the data provider to generate the machine learning model,
the first gain is positively correlated with the weighted sum of the first order gradient ciphertext of each user data sample in each user classification set, and the second order gradient ciphertext of each user data sample in each user classification set is negatively correlated with the weighted sum of the second order gradient ciphertext,
the second gain is positively correlated with a difference between a weighted sum of the first order gradient ciphertexts of all user data samples and a weighted sum of the first order gradient ciphertexts of each user data sample in each user classification set, and the second gain is positively correlated with a difference between a weighted sum of the second order gradient ciphertexts of all user data samples and a weighted sum of the second order gradient ciphertexts of each user data sample in each user classification set,
the third gain is inversely related to the weight of the first order gradient ciphertext of all user data samples and positively related to the weight of the second order gradient ciphertext of all user data samples.
10. The generation method of claim 4, wherein the partitioning the user data samples into different sets of user classifications comprises:
the data provider sets a plurality of classification thresholds according to a plurality of user characteristic information contained in each user data sample;
and dividing the user data samples into different user classification sets according to the classification threshold values.
11. The generation method of any of claims 1-10, wherein the gain of information according to the different user classifications, instructing the data provider to generate the machine learning model comprises:
the label provider determines the maximum gain information in the information gains of the different user categories;
instructing the data provider to generate the machine learning model according to whether the maximum gain information is greater than a threshold.
12. The generation method of claim 11, wherein the machine learning model is a tree model,
instructing the data provider to generate the machine learning model according to whether the maximum gain information is greater than a threshold comprises:
the label provider instructs the data provider to split the current node of the tree model under the condition that the maximum gain information is larger than the threshold value, and child nodes of the current node are generated and serve as the current node of the next iteration;
the label provider calculates the weight parameter of the child node as the parameter information plaintext of the tree model under the condition that the maximum gain information is less than or equal to the threshold value;
the determining the maximum gain information among the information gains of the different user classes comprises:
the label provider re-determines the maximum gain information according to the current node of the next iteration;
the generation method further comprises the following steps:
and repeating the steps until the tree model is generated.
13. A method for processing user data comprises the following steps:
classifying users using a machine learning model generated according to the method for generating a machine learning model according to any one of claims 1 to 12, based on user data.
14. A system for generating a machine learning model, comprising:
the data provider is used for generating a label predicted value ciphertext of each user data sample according to the parameter information ciphertext of the machine learning model sent by the label provider, and calculating gradient information ciphertexts of different user classifications according to the label predicted value ciphertext, the label ciphertext of each user data sample sent by the label provider and the user classification to which each user data sample belongs;
and the label provider calculates the information gains of different user classifications according to the gradient information ciphertexts of different user classifications, and instructs the data provider to generate the machine learning model according to the information gains of different user classifications.
15. An apparatus for processing user data, comprising:
a classification unit configured to classify a user using a machine learning model generated according to the method for generating a machine learning model according to any one of claims 1 to 12, based on user data.
16. An electronic device, comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the method of generating a machine learning model of any of claims 1-12, or the method of processing user data of claim 13, based on instructions stored in the memory.
17. A non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of generating a machine learning model according to any one of claims 1 to 12, or the method of processing user data according to claim 13.
CN202210404748.2A 2022-04-18 2022-04-18 Method and system for generating machine learning model and method for processing user data Pending CN114723070A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210404748.2A CN114723070A (en) 2022-04-18 2022-04-18 Method and system for generating machine learning model and method for processing user data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210404748.2A CN114723070A (en) 2022-04-18 2022-04-18 Method and system for generating machine learning model and method for processing user data

Publications (1)

Publication Number Publication Date
CN114723070A true CN114723070A (en) 2022-07-08

Family

ID=82243391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210404748.2A Pending CN114723070A (en) 2022-04-18 2022-04-18 Method and system for generating machine learning model and method for processing user data

Country Status (1)

Country Link
CN (1) CN114723070A (en)

Similar Documents

Publication Publication Date Title
Gao et al. Privacy-preserving Naive Bayes classifiers secure against the substitution-then-comparison attack
US9825758B2 (en) Secure computer evaluation of k-nearest neighbor models
US9787647B2 (en) Secure computer evaluation of decision trees
WO2018184407A1 (en) K-means clustering method and system having privacy protection
WO2021092977A1 (en) Vertical federated learning optimization method, appartus, device and storage medium
Liu et al. Intelligent and secure content-based image retrieval for mobile users
JP2014504741A (en) Method and server for evaluating the probability of observation sequence stored at client for Hidden Markov Model (HMM) stored at server
WO2022015948A1 (en) Privacy-preserving fuzzy query system and method
JP2016080766A (en) Encryption processing method, encryption processing device and encryption processing program
US20230379135A1 (en) Private decision tree evaluation using an arithmetic circuit
CN107291861B (en) Encryption graph-oriented approximate shortest distance query method with constraints
CN113098691B (en) Digital signature method, signature information verification method, related device and electronic equipment
CN111026359B (en) Method and device for judging numerical range of private data in multi-party combination manner
CN115842627A (en) Decision tree evaluation method, device, equipment and medium based on secure multi-party computation
CN115062323A (en) Multi-center federal learning method for enhancing privacy protection and computer equipment
CN113407976B (en) Digital signature method, signature information verification method, related device and electronic equipment
Mishra et al. Pattern analysis of cipher text: A combined approach
Sun et al. Face security authentication system based on deep learning and homomorphic encryption
Behera Privacy preserving C4. 5 using Gini index
CN114723070A (en) Method and system for generating machine learning model and method for processing user data
JP5579327B2 (en) Privacy protection probabilistic inference based on hidden Markov model
CN114817954A (en) Image processing method, system and device
Liu et al. Secure and verifiable outsourcing protocol for non-negative matrix factorisation
Chakraborti et al. {Distance-Aware} Private Set Intersection
Shoja et al. Asymptotic converse bound for secret key capacity in hidden Markov model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination