WO2021114585A1 - 模型训练方法、装置和电子设备 - Google Patents

模型训练方法、装置和电子设备 Download PDF

Info

Publication number
WO2021114585A1
WO2021114585A1 PCT/CN2020/094664 CN2020094664W WO2021114585A1 WO 2021114585 A1 WO2021114585 A1 WO 2021114585A1 CN 2020094664 W CN2020094664 W CN 2020094664W WO 2021114585 A1 WO2021114585 A1 WO 2021114585A1
Authority
WO
WIPO (PCT)
Prior art keywords
ciphertext
eigenvalue
subset
sample
gradient value
Prior art date
Application number
PCT/CN2020/094664
Other languages
English (en)
French (fr)
Inventor
李漓春
赵原
周亚顺
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021114585A1 publication Critical patent/WO2021114585A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Definitions

  • the embodiments of this specification relate to the field of computer technology, in particular to a model training method, device and electronic equipment.
  • the embodiments of this specification provide a model training method, device, and electronic equipment to enhance data privacy protection in the process of multi-party cooperative modeling.
  • a model training method is provided, which is applied to a first party, and the first party holds characteristic data of a sample.
  • the method includes: The identification set is divided into a plurality of subsets, the sample identification set includes a plurality of sample identifications; the first gradient value ciphertext and the second gradient value ciphertext corresponding to each sample identification are received, and the first gradient value ciphertext and The second gradient value ciphertext is calculated by a homomorphic encryption algorithm; in each subset, the first gradient value ciphertext identified by multiple samples is homomorphically added to obtain the first eigenvalue ciphertext of the subset , Homomorphically add the second gradient value ciphertexts identified by multiple samples to obtain the second eigenvalue ciphertext of the subset; use random numbers to cover the first eigenvalue ciphertext and the second eigenvalue ciphertext respectively , Get the ciphertext of the first eigenvalue after concealment and the
  • a model training method is provided, which is applied to a second party, and the second party holds label data of a sample.
  • the method includes: receiving a subset corresponding to the The masked first feature value ciphertext and the masked second feature value ciphertext, the subset is obtained by segmenting the sample identification set, the sample identification set includes a plurality of sample identifications;
  • the first feature value ciphertext and the masked second feature value ciphertext are decrypted to obtain the masked first feature value and the masked second feature value; using the masked first feature value and the masked second feature value
  • the eigenvalue is used to calculate a segmentation gain factor, the segmentation gain factor is used to calculate the segmentation gain of the subset, and the segmentation gain is used to train the non-leaf nodes of the data processing model.
  • a model training device which is applied to a first party, and the first party holds characteristic data of a sample.
  • the device includes: a segmentation unit for Feature data, the sample identification set is divided into multiple subsets, the sample identification set includes multiple sample identifications; the receiving unit is configured to receive the first gradient value ciphertext and the second gradient value ciphertext corresponding to each sample identification The first gradient value ciphertext and the second gradient value ciphertext are calculated by a homomorphic encryption algorithm; the adding unit is used to encrypt the first gradient value identified by multiple samples in each subset The text is homomorphically added to obtain the first eigenvalue ciphertext of the subset, and the second gradient value ciphertexts of multiple sample identifications are homomorphically added to obtain the second eigenvalue ciphertext of the subset; the masking unit, It is used to cover the first eigenvalue ciphertext and the second eigenvalue ciphertext with random
  • a model training device which is applied to a second party, and the second party holds label data of a sample.
  • the device includes: a receiving unit for receiving A first masked feature value ciphertext and a masked second feature value ciphertext corresponding to a subset, the subset is obtained by segmenting a sample identification set, the sample identification set includes a plurality of sample identifications;
  • the decryption unit is used to decrypt the ciphertext of the first eigenvalue after the concealment and the ciphertext of the second eigenvalue after the concealment, to obtain the first eigenvalue after the concealment and the second eigenvalue after the concealment;
  • the segmentation gain factor is used to calculate the segmentation gain of the subset, and the segmentation gain is used for non-uniformity of the data processing model.
  • an electronic device including a memory and a processor; the memory is used to store computer instructions; the processor is used to execute the method steps described in the first aspect .
  • an electronic device including a memory and a processor; the memory is used to store computer instructions; the processor is used to execute the method steps described in the second aspect .
  • Fig. 1 is a schematic diagram of a decision tree model according to an embodiment of the specification
  • Figure 2 is a flowchart of a model training method according to an embodiment of this specification
  • Fig. 3 is a flowchart of a model training method according to an embodiment of the specification
  • Fig. 4 is a flowchart of a model training method according to an embodiment of the specification.
  • Fig. 5 is a functional structure diagram of a model training device according to an embodiment of the specification.
  • Fig. 6 is a functional structure diagram of a model training device according to an embodiment of the specification.
  • FIG. 7 is a functional structure diagram of an electronic device according to an embodiment of this specification.
  • Tree model a supervised machine learning model.
  • the tree model may be, for example, a binary tree or the like.
  • the tree model may include a decision tree model, and the decision tree model may include a regression decision tree, a classification decision tree, and the like.
  • the tree model includes multiple nodes. Each node may correspond to a location identifier, and the location identifier may be used to identify the position of the node in the tree model, and may be specifically, for example, the number of the node.
  • the multiple nodes can form multiple predicted paths.
  • the start node of the predicted path is the root node of the tree model, and the end node is the leaf node of the tree model.
  • Leaf node When a node in the tree model cannot be split downward, the node can be called a leaf node.
  • the leaf node corresponds to a leaf value.
  • the leaf values corresponding to different leaf nodes of the tree model can be the same or different.
  • Each leaf value can represent a prediction result.
  • the leaf value can be a numeric value or a vector.
  • Non-leaf node When a node in the tree model can be split downward, the node can be called a non-leaf node.
  • the non-leaf node may specifically include a root node and other nodes (hereinafter referred to as internal nodes) except the leaf node and the root node.
  • the non-leaf node corresponds to a split condition, and the split condition can be used to select a predicted path.
  • One or more tree models can constitute a forest model.
  • the forest model may be a supervised machine learning model, and may specifically include a regression decision forest and a classification decision forest.
  • Algorithms for integrating multiple tree models into a forest model may include random forest (Random Forest), extreme gradient boosting (XGBoost), gradient boosting decision tree (Gradient Boosting Decision Tree, GBDT) and other algorithms.
  • the tree model Tree1 may include 11 nodes such as nodes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11.
  • node 1 is the root node; nodes 2, 3, 4, and 5 are internal nodes; and nodes 6, 7, 8, 9, 10, and 11 are leaf nodes.
  • Nodes 1, 2, 4, 8 can form a predicted path
  • nodes 1, 2, 4, and 9 can form a predicted path
  • nodes 1, 2, 5, and 10 can form a predicted path
  • nodes 1, 2, 5, 11 A predicted path can be formed
  • nodes 1, 3, and 6 can form a predicted path
  • nodes 1, 3, and 7 can form a predicted path.
  • node Split condition Node 1 Are older than 30 years old Node 2 Annual income is greater than 50,000 Node 3 Have room Node 4 Have a car Node 5 married
  • the split conditions "age greater than 20”, “annual income greater than 50,000”, "have a house”, “have a car”, and “married” can be used to select the prediction path.
  • the splitting condition is not met (that is, the judgment result is 0)
  • the predicted path on the left can be selected;
  • the splitting condition is satisfied (that is, the judgment result is 0)
  • the predicted path on the right can be selected.
  • Loss Function can be used to measure the degree of inconsistency between the predicted value of the data processing model and the true value. The smaller the value of the loss function, the better the robustness of the data processing model.
  • the loss function includes, but is not limited to, a logarithmic loss function (Logarithmic Loss Function), a square loss function (Square Loss), and the like.
  • This specification provides an embodiment of a model training system.
  • the model training system may include a first party and a second party.
  • the first party may be a device such as a server, a mobile phone, a tablet computer, or a personal computer; or, it may also be a system composed of multiple devices, such as a server cluster composed of multiple servers.
  • the second party may be a device such as a server, a mobile phone, a tablet computer, or a personal computer; or, it may also be a system composed of multiple devices, such as a server cluster composed of multiple servers.
  • the first party holds the characteristic data of the sample, but does not hold the label data of the sample.
  • the second party holds the label data of the sample.
  • the second party may not hold the characteristic data of the sample, or may also hold part of the characteristic data of the sample.
  • the first party and the second party may perform cooperative security modeling. In the process of cooperative security modeling, for the protection of data privacy, the first party cannot leak the characteristic data of the sample to the second party, and the second party cannot leak to the first party The label data of the sample.
  • the model obtained through cooperative security modeling may include a forest model, and the forest model may include at least one tree model.
  • the first party and the second party may perform recursive training on nodes in the forest model.
  • Algorithms used for recursive training include but are not limited to XGBoost algorithm, ID3 algorithm, C4.5 algorithm, C5.0 algorithm, and so on.
  • the non-leaf node 1 may correspond to a sample identification set, and the sample corresponding to each sample identification in the sample identification set is used to train the non-leaf node 1.
  • the first party may hold the characteristic data of the samples corresponding to the respective sample identifications, and the second party may hold the label data of the samples corresponding to the respective sample identifications.
  • the first party may train the non-leaf node 1 according to the feature data held by itself, and the second party may train the non-leaf node 1 according to the label data held by itself to obtain the split condition of the non-leaf node 1.
  • the split condition corresponding to the non-leaf node 1 can be obtained, and the sample identification set is divided into a first subset and a second subset.
  • the first subset may correspond to non-leaf node 2.
  • the sample corresponding to each sample identifier in the first subset is used to train the non-leaf node 2.
  • the first party may hold the characteristic data of the samples corresponding to the respective sample identifications, and the second party may hold the label data of the samples corresponding to the respective sample identifications.
  • the first party may train the non-leaf node 2 according to the feature data held by itself, and the second party may train the non-leaf node 2 according to the label data held by itself to obtain the split condition of the non-leaf node 2.
  • the split condition corresponding to non-leaf node 2 can be obtained.
  • the first subset is further divided into two subsets to facilitate further processing of non-leaf node 4 and non-leaf node 5. training. The subsequent process will not be repeated.
  • the second subset may correspond to non-leaf nodes 3.
  • the sample corresponding to each sample identifier in the second subset is used to train the non-leaf node 3.
  • the first party may hold the characteristic data of the samples corresponding to the respective sample identifications, and the second party may hold the label data of the samples corresponding to the respective sample identifications.
  • the first party may train the non-leaf node 3 according to the feature data held by itself, and the second party may train the non-leaf node 3 according to the label data held by itself to obtain the split condition of the non-leaf node 3.
  • the split condition corresponding to the non-leaf node 3 can be obtained, and the second subset is further divided into two subsets, so as to further train the leaf node 6 and the leaf node 7. Obtain the leaf value of leaf node 6 and the leaf value of leaf node 7.
  • the sample identification can be used to identify the sample.
  • the sample can be the data of the business object, and the sample ID can be the ID of the business object.
  • the sample may be user data, and the sample identifier may be the user's identity identifier.
  • the sample may be product data, and the sample identifier may be the identifier of the product.
  • the sample may include feature data and label data.
  • the feature data may include P sub-data in P dimensions, and P is a positive integer.
  • the sample x1 can be expressed as a vector [x1 1 ,x1 2 ,...,x1 i ,...,x1 p ,Y1].
  • x1 1 , x1 2 , ..., x1 i , ..., x1 p are feature data, including P sub-data in P dimensions.
  • Y1 is label data.
  • the characteristic data includes: loan amount data in the loan amount dimension, social insurance base data in the social security base dimension, married or not in the marriage dimension, and whether there is a house in the real estate dimension.
  • the label data includes: whether the user is a dishonest person.
  • the first party is a big data company
  • the second party is a credit agency.
  • the big data company holds data such as the user's loan amount, the user's social security payment base, whether the user is married, and whether the user has a house
  • the credit reporting agency holds data such as whether the user is a dishonest person.
  • the big data company and the credit reporting agency may perform cooperative security modeling based on the user data held by each to obtain a forest model.
  • the forest model can be used to predict whether the user is a dishonest person.
  • the big data company cannot disclose its own data to the credit reporting agency, and the credit reporting agency cannot disclose the data to the big data company. Leak the data held by oneself.
  • This specification provides an embodiment of the model training method.
  • the model training method may be used to train a non-leaf node in the forest model, and the non-leaf node may be a root node or an internal node.
  • the model training method using the model training method and adopting a recursive manner, the training of each non-leaf node in the forest model can be realized, thereby realizing cooperative security modeling.
  • the model training method may include the following steps.
  • Step S101 The first party divides the sample identification set into multiple subsets according to the characteristic data.
  • the sample identification set may include a plurality of sample identifications.
  • the sample corresponding to each sample identifier in the sample identifier set is used to train non-leaf nodes.
  • the sample identification set may be an original sample identification set, and the original sample identification set may include sample identifications of samples used for training the forest model.
  • the sample identification set may be a subset obtained after training the previous non-leaf node.
  • the first party may hold characteristic data of samples corresponding to each sample identifier in the sample identifier set.
  • the feature data may include P sub-data in P dimensions, and P is a positive integer.
  • the first party may divide the sample identification set into multiple subsets based on the sub-data in at least one dimension. In practical applications, according to the sub-data in each dimension, the first party may divide the sample identification set into multiple subsets.
  • the sample identifier set may include sample identifiers of N samples such as x1, x2,...,xi,...,xN, and the feature data of each sample may include P sub-data in P dimensions.
  • the sub-data of the samples x1,x2,...,xi,...,xN in the i-th dimension are x1 i , x2 i ,...,xi i ,...,xN i .
  • the first party can divide the sample identification of the samples x1,x2,...,xi,...,xN To multiple subsets.
  • the i-th dimension may be age.
  • the first party can divide the sample identifiers of the samples x1, x2,...,xi,...,xN into three subsets: T1, T2, T3, etc.
  • the sub-data in the age dimension of the sample corresponding to each sample identifier in the subset T1 is 0-20 years old, and the sub-data in the age dimension of the sample corresponding to each sample identifier in the subset T2 is 21-30 years old, the subset The sub-data in the age dimension of the sample corresponding to each sample identifier in T3 is 31-50 years old.
  • Step S103 The second party calculates the first gradient value ciphertext and the second gradient value ciphertext corresponding to the sample identifier.
  • the first gradient value ciphertext and the second gradient value ciphertext may be calculated from the loss function of the forest model.
  • the second party may hold label data of samples corresponding to each sample identifier in the sample identifier set.
  • the second party may calculate the first gradient value and the second gradient value corresponding to each sample identifier in the sample identifier set.
  • the first gradient value may be a first gradient value of the loss function
  • the second gradient value may be a second gradient value of the loss function. It is worth noting that the second party may hold the label data of the sample, but not the characteristic data of the sample.
  • the second party may only calculate the first gradient value and the second gradient value corresponding to each sample identifier in the sample identifier set based on the label data.
  • the second party may hold the label data and part of the characteristic data of the sample. Therefore, the second party can calculate the first gradient value and the second gradient value corresponding to each sample identifier in the sample identifier set according to the label data and part of the characteristic data.
  • the second party can Calculate the first gradient value corresponding to the sample ID; it can be based on Calculate the second gradient value corresponding to the sample identifier.
  • g represents the first gradient value
  • h represents the second gradient value
  • l represents the loss function
  • y represents the label data
  • t represents the current iteration round
  • the second party may encrypt the first gradient value and the second gradient value to obtain the first gradient value ciphertext and the second gradient value ciphertext corresponding to each sample identifier in the sample identifier set .
  • the second party may use a homomorphic encryption algorithm to encrypt the first gradient value and the second gradient value.
  • the homomorphic encryption algorithm may include Paillier algorithm, Okamoto-Uchiyama algorithm, Damgard-Jurik algorithm, and the like.
  • Homomorphic Encryption is an encryption technology. It allows direct operations on the ciphertext data to obtain the encrypted result, and the result obtained by decrypting it is the same as the result of the same operation on the plaintext data.
  • the homomorphic encryption algorithm may include additive homomorphic encryption algorithm, multiplicative homomorphic encryption algorithm, and the like.
  • the second party may generate a public-private key pair for homomorphic encryption; the public key in the public-private key pair may be used to encrypt the first gradient value and the second gradient value.
  • Step S105 The second party sends the first gradient value ciphertext and the second gradient value ciphertext corresponding to each sample identifier to the first party.
  • Step S107 The first party receives the first gradient value ciphertext and the second gradient value ciphertext corresponding to each sample identifier.
  • Step S109 In each subset, the first party homomorphically adds the first gradient value ciphertext identified by the multiple samples to obtain the first eigenvalue ciphertext of the subset, and the second gradient identified by the multiple samples The value ciphertext is homomorphically added to obtain the second eigenvalue ciphertext of the subset.
  • the first party may obtain multiple subsets, and each subset may include multiple sample identifiers.
  • the first party may homomorphically add the first gradient value ciphertext corresponding to multiple sample identities in the subset to obtain the first eigenvalue ciphertext of the subset;
  • the second gradient value ciphertexts corresponding to the multiple sample identifiers in the set are homomorphically added to obtain the second eigenvalue ciphertext of the subset.
  • a certain subset may include m sample identifiers such as sample identifiers x1, x2,...,xi,...,xm.
  • the ciphertexts of the first gradient values corresponding to the sample identifiers x1, x2,...,xi,...,xm are respectively
  • the ciphertexts of the second gradient values corresponding to the sample identifiers x1, x2,...,xi,...,xm are E(h(x1)), E(h(x2)),...,E(h (xi)),...,E(h(xm)).
  • Step S111 the first party separately masks the first eigenvalue ciphertext and the second eigenvalue ciphertext by using a random number to obtain the first eigenvalue ciphertext after the concealment and the second eigenvalue ciphertext after the concealment.
  • the second party by masking the first eigenvalue ciphertext and the second eigenvalue ciphertext, the second party can be prevented from obtaining the first eigenvalue ciphertext and the second eigenvalue ciphertext, thereby preventing the second party from obtaining the ciphertext.
  • the first feature value ciphertext and the second feature value ciphertext obtain the first feature value and the second feature value, which enhances privacy protection.
  • the first party can use any one of the following methods to cover up, and obtain the masked ciphertext corresponding to the subset The first eigenvalue ciphertext and the masked second eigenvalue ciphertext.
  • the second party can calculate the segmentation gain factor in the subsequent step S119.
  • the first party can use the homomorphic encryption algorithm to perform state encryption on the random number to obtain the random number ciphertext; the random number ciphertext can be homomorphically operated with the first eigenvalue ciphertext and the second eigenvalue ciphertext, respectively , The first eigenvalue ciphertext after the cover and the second eigenvalue ciphertext after the cover are obtained.
  • the homomorphic operations may include homomorphic addition operations, homomorphic multiplication operations and any combination thereof. Among them, for example, the first party can use the public key of the second party to homomorphically encrypt the random number.
  • the first eigenvalue ciphertext may be E(g), and the masked first eigenvalue ciphertext may be E(gr).
  • the second eigenvalue ciphertext may be E(h), and the concealed second eigenvalue ciphertext may be E((h+ ⁇ ) ⁇ r 2 ).
  • r represents a random number, and ⁇ represents the coefficient of the regular term.
  • the first noise data may be a random number with a small value.
  • the second party can calculate the segmentation gain factor with limited accuracy. It is worth noting that because the first noise data is a random number with a small value, the segmentation gain factor with limited precision can meet service requirements.
  • the first noise data may be a random number with a smaller value
  • the second noise data may be another random number with a smaller value.
  • the second party can calculate the segmentation gain factor with limited accuracy. It is worth noting that because the first noise data is a random number with a small value, and the second noise data is another random number with a small value, the segmentation gain factor with limited precision can meet service requirements.
  • the first eigenvalue ciphertext may be E(g), and the concealed first eigenvalue ciphertext may be E(gr+s1).
  • the second eigenvalue ciphertext may be E(h), and the masked second eigenvalue ciphertext may be E((h+ ⁇ ) ⁇ r 2 +s2).
  • r represents a random number
  • represents a regular term coefficient
  • s1 represents the first noise data
  • s2 represents the first noise data.
  • the second noise data may be a random number with a small value.
  • the second party can calculate the segmentation gain factor with limited accuracy. It is worth noting that because the first noise data is a random number with a small value, the segmentation gain factor with limited precision can meet service requirements.
  • Step S113 the first party sends to the second party the ciphertext of the first eigenvalue after concealment and the ciphertext of the second eigenvalue corresponding to each subset.
  • Step S115 The second party receives the ciphertext of the first eigenvalue after concealment and the ciphertext of the second eigenvalue corresponding to each subset.
  • Step S117 The second party respectively decrypts the masked first feature value ciphertext and the masked second feature value ciphertext to obtain the masked first feature value and the masked second feature value.
  • the second party may decrypt the ciphertext of the first eigenvalue and the ciphertext of the second eigenvalue corresponding to each subset to obtain the ciphertext corresponding to the subset.
  • the first eigenvalue and the second eigenvalue after masking may use the private key to decrypt the ciphertext of the first eigenvalue after the mask and the ciphertext of the second eigenvalue after the mask.
  • Step S119 The second party uses the masked first feature value and the masked second feature value to calculate a segmentation gain factor, where the segmentation gain factor is used to calculate the segmentation gain, and the segmentation gain is used for the evaluation of the data processing model.
  • Non-leaf nodes are trained.
  • the second party may perform operations on the masked first eigenvalue and masked second eigenvalue corresponding to the subset according to a preset algorithm to obtain the subset.
  • the segmentation gain factor may be used to calculate the segmentation gain, and the segmentation gain may be used to measure the degree of order of a plurality of specific samples, and the plurality of specific samples may include samples corresponding to the sample identifiers in the subset.
  • the segmentation gain may include at least one of the following: information gain, information gain rate, and Gini coefficient. Those skilled in the art should be able to understand that the segmentation gain is not limited to the information gain, information gain rate, and Gini coefficient listed above. In practice, the segmentation gain may also be different according to different training algorithms.
  • the masked first eigenvalue ciphertext corresponding to a certain subset can be E(gr), and the masked second eigenvalue ciphertext corresponding to this subset can be E((h+ ⁇ ) ⁇ r 2 ).
  • the first eigenvalue corresponding to the subset may be gr
  • the second eigenvalue corresponding to the subset may be (h+ ⁇ ) ⁇ r 2 .
  • the second party can calculate the segmentation gain factor
  • the masked first eigenvalue ciphertext corresponding to a certain subset can be E(gr+s1)
  • the masked second eigenvalue ciphertext corresponding to this subset can be E((h+ ⁇ ) ⁇ r 2 +s2).
  • the masked first feature value corresponding to the subset may be gr+s1
  • the masked second feature value corresponding to the subset may be (h+ ⁇ ) ⁇ r 2 +s2.
  • the second party can calculate the segmentation gain factor Since the first noise data s1 and the second noise data s2 are both random numbers with small values, versus Approximately equal.
  • the second party may also calculate the partition gain of each subset according to the partition gain factor of the subset.
  • the second party may select a subset according to the segmentation gain of each subset, and then may determine the split condition of the non-leaf node according to the selected subset. For example, the second party may select the subset with the largest segmentation gain.
  • the second party may also calculate the division gain of the subset together with the first party according to the division gain factor of each subset.
  • the model training method of some embodiments of the present specification can enhance the privacy protection of data in the process of multi-party cooperative modeling by using random numbers to cover the feature value ciphertext.
  • This specification provides another embodiment of the model training method.
  • the model training method may be used to train a non-leaf node in the forest model, and the non-leaf node may be a root node or an internal node.
  • the model training method using the model training method and adopting a recursive manner, the training of each non-leaf node in the forest model can be realized, thereby realizing cooperative security modeling.
  • the model training method can be applied to a first party, and the first party can hold characteristic data of a sample.
  • the model training method may include the following steps.
  • Step S21 According to the characteristic data, the sample identification set is divided into multiple subsets.
  • Step S23 Receive the first gradient value ciphertext and the second gradient value ciphertext corresponding to each sample identifier.
  • Step S25 In each subset, homomorphically add the first gradient value ciphertext identified by the multiple samples to obtain the first feature value ciphertext of the subset, and combine the second gradient value ciphertext identified by the multiple samples Homomorphic addition, the second eigenvalue ciphertext of the subset is obtained.
  • Step S27 Use random numbers to cover the first eigenvalue ciphertext and the second eigenvalue ciphertext, respectively, to obtain the first eigenvalue ciphertext after the concealment and the second eigenvalue ciphertext after the concealment;
  • Step S29 Send the masked first feature value ciphertext and the masked second feature value ciphertext corresponding to each subset to the second party, so as to train the non-leaf nodes of the data processing model.
  • the model training method of some embodiments of this specification can enhance the privacy protection of data in the process of multi-party cooperative modeling by using random numbers to cover the ciphertext of the characteristic value according to the homomorphic encryption algorithm.
  • the model training method may be used to train a non-leaf node in the forest model, and the non-leaf node may be a root node or an internal node.
  • the model training method using the model training method and adopting a recursive manner, the training of each non-leaf node in the forest model can be realized, thereby realizing cooperative security modeling.
  • the model training method can be applied to a second party, and the second party can hold the label data of the sample.
  • the model training method may include the following steps.
  • Step S31 Receive the first masked feature value ciphertext and the masked second feature value ciphertext corresponding to the subset, where the subset is obtained by segmenting the sample identification set, and the sample identification set includes multiple Sample IDs.
  • Step S33 Decrypt the ciphertext of the first eigenvalue after the concealment and the ciphertext of the second eigenvalue after the concealment respectively to obtain the first eigenvalue after the concealment and the second eigenvalue after the concealment.
  • Step S35 Calculate a segmentation gain factor using the masked first feature value and the masked second feature value, where the segmentation gain factor is used to calculate the segmentation gain of the subset, and the segmentation gain is used for the data processing model Of non-leaf nodes for training.
  • the model training method of some embodiments of this specification can enhance the privacy protection of data in the process of multi-party cooperative modeling by using random numbers to cover the ciphertext of the characteristic value according to the homomorphic encryption algorithm.
  • the device may include the following units.
  • the segmentation unit 41 is configured to segment the sample identification set into a plurality of subsets according to the characteristic data, the sample identification set including the identifications of the plurality of samples;
  • the receiving unit 43 is configured to receive the first gradient value ciphertext and the second gradient value ciphertext corresponding to each sample identifier, where the first gradient value ciphertext and the second gradient value ciphertext are encrypted by a homomorphic encryption algorithm Encrypting the first gradient value and the second gradient value of the loss function respectively to obtain;
  • the adding unit 45 is configured to homomorphically add the first gradient value ciphertext identified by multiple samples in each subset to obtain the first eigenvalue ciphertext of the subset, and combine the second ciphertext identified by the multiple samples.
  • the gradient value ciphertext is homomorphically added to obtain the second eigenvalue ciphertext of the subset;
  • the masking unit 47 is configured to separately mask the first eigenvalue ciphertext and the second eigenvalue ciphertext by using a random number to obtain the first eigenvalue ciphertext after the concealment and the second eigenvalue ciphertext after the concealment;
  • the sending unit 49 is configured to send the ciphertext of the first eigenvalue and the ciphertext of the second eigenvalue corresponding to each subset to the second party, so as to train the non-leaf nodes of the data processing model .
  • the device may include the following units.
  • the receiving unit 51 is configured to receive the first masked eigenvalue ciphertext and the masked second eigenvalue ciphertext corresponding to the subset, the subset is obtained by segmenting the sample identification set, the sample identification The set includes multiple sample IDs;
  • the decryption unit 53 is configured to respectively decrypt the ciphertext of the first eigenvalue after the concealment and the ciphertext of the second eigenvalue after the concealment, to obtain the first eigenvalue after the concealment and the second eigenvalue after the concealment;
  • the calculation unit 55 is configured to use the masked first feature value and the masked second feature value to calculate a segmentation gain factor, where the segmentation gain factor is used to calculate the segmentation gain of the subset, and the segmentation gain is used for The non-leaf nodes of the data processing model are trained.
  • Fig. 7 is a schematic diagram of the hardware structure of the electronic device in this embodiment.
  • the electronic device may include one or more (only one is shown in the figure) processor, memory, and transmission module.
  • the hardware structure shown in FIG. 7 is only for illustration, and it does not limit the hardware structure of the above electronic device.
  • the electronic device may also include more or fewer component units than shown in FIG. 7; or, it may have a configuration different from that shown in FIG. 7.
  • the memory may include a high-speed random access memory; or, may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the storage may also include a remotely set network storage.
  • the remotely set network storage can be connected to the blockchain client through a network such as the Internet, an intranet, a local area network, a mobile communication network, and the like.
  • the memory may be used to store program instructions or modules of application software, for example, the program instructions or modules of the embodiment corresponding to FIG. 3 or FIG. 4 of this specification.
  • the processor can be implemented in any suitable way.
  • the processor may take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (for example, software or firmware) executable by the (micro)processor, logic gates, switches, special-purpose integrated Circuit (Application Specific Integrated Circuit, ASIC), programmable logic controller and embedded microcontroller form, etc.
  • the processor can read and execute program instructions or modules in the memory.
  • the transmission module can be used for data transmission via a network, for example, data transmission via a network such as the Internet, an intranet, a local area network, a mobile communication network, and the like.
  • a network such as the Internet, an intranet, a local area network, a mobile communication network, and the like.
  • the computer storage medium includes but is not limited to random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), cache (Cache), hard disk (Hard Disk Drive, HDD), memory card ( Memory Card) and so on.
  • the computer storage medium stores computer program instructions. It is realized when the computer program instructions are executed: the program instructions or modules of the embodiment corresponding to FIG. 3 or FIG. 4 of this specification.
  • the improvement of a technology can be clearly distinguished between hardware improvements (for example, improvements in circuit structures such as diodes, transistors, switches, etc.) or software improvements (improvements in method flow).
  • hardware improvements for example, improvements in circuit structures such as diodes, transistors, switches, etc.
  • software improvements improvements in method flow.
  • the improvement of many methods and processes of today can be regarded as a direct improvement of the hardware circuit structure.
  • Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by the hardware entity module.
  • a programmable logic device for example, a Field Programmable Gate Array (Field Programmable Gate Array, FPGA)
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • HDL Hardware Description Language
  • ABEL Advanced Boolean Expression Language
  • AHDL Altera Hardware Description Language
  • HDCal JHDL
  • Lava Lava
  • Lola MyHDL
  • PALASM RHDL
  • Verilog2 Verilog2
  • a typical implementation device is a computer.
  • the computer can be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.
  • This manual can be used in many general-purpose or special-purpose computer system environments or configurations.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • This specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communication network.
  • program modules can be located in local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种模型训练方法、装置和电子设备。所述方法包括:根据特征数据,将样本标识集分割为多个子集(S21),接收样本标识所对应的第一梯度值密文和第二梯度值密文(S23);在每个子集内,将多个样本标识的第一梯度值密文同态相加,得到该子集的第一特征值密文,将多个样本标识的第二梯度值密文同态相加,得到该子集的第二特征值密文(S25);利用随机数对第一特征值密文和第二特征值密文进行掩盖,得到掩盖后的第一特征值密文和掩盖后的第二特征值密文(S27);向第二方发送子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,以便于对数据处理模型的非叶子节点进行训练(S29)。该方法通过根据同态加密算法,利用随机数对特征值密文进行掩盖,可以增强在多方合作建模过程中数据的隐私保护。

Description

模型训练方法、装置和电子设备 技术领域
本说明书实施例涉及计算机技术领域,特别涉及一种模型训练方法、装置和电子设备。
背景技术
在业务实际中,单个数据方拥有的数据并不完备,通常需要借助其它数据方的数据,共同完成数据处理模型的训练。在多方合作建模的过程中,往往存在隐私泄漏的问题。
发明内容
本说明书实施例提供一种模型训练方法、装置和电子设备,以增强在多方合作建模过程中数据的隐私保护。
为实现上述目的,本说明书中一个或多个实施例提供的技术方案如下。
根据本说明书一个或多个实施例的第一方面,提供了一种模型训练方法,应用于第一方,所述第一方持有样本的特征数据,该方法包括:根据特征数据,将样本标识集分割为多个子集,所述样本标识集包括多个样本标识;接收每个样本标识所对应的第一梯度值密文和第二梯度值密文,所述第一梯度值密文和所述第二梯度值密文由同态加密算法计算得到;在每个子集内,将多个样本标识的第一梯度值密文同态相加,得到该子集的第一特征值密文,将多个样本标识的第二梯度值密文同态相加,得到该子集的第二特征值密文;利用随机数分别对第一特征值密文和第二特征值密文进行掩盖,得到掩盖后的第一特征值密文和掩盖后的第二特征值密文;向第二方发送每个子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,以便于对数据处理模型的非叶子节点进行训练。
根据本说明书一个或多个实施例的第二方面,提供了一种模型训练方法,应用于第二方,所述第二方持有样本的标签数据,该方法包括:接收子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,所述子集通过对样本标识集进行分割得到,所述样本标识集包括多个样本标识;分别对掩盖后的第一特征值密文和掩盖后的第二特征值密文进行解密,得到掩盖后的第一特征值和掩盖后的第二特征值;利用掩盖后的第 一特征值和掩盖后的第二特征值,计算分割增益因子,所述分割增益因子用于计算该子集的分割增益,所述分割增益用于对数据处理模型的非叶子节点进行训练。
根据本说明书一个或多个实施例的第三方面,提供了一种模型训练装置,应用于第一方,所述第一方持有样本的特征数据,该装置包括:分割单元,用于根据特征数据,将样本标识集分割为多个子集,所述样本标识集包括多个样本的标识;接收单元,用于接收每个样本标识所对应的第一梯度值密文和第二梯度值密文,所述第一梯度值密文和所述第二梯度值密文由同态加密算法计算得到;相加单元,用于在每个子集内,将多个样本标识的第一梯度值密文同态相加,得到该子集的第一特征值密文,将多个样本标识的第二梯度值密文同态相加,得到该子集的第二特征值密文;掩盖单元,用于利用随机数分别对第一特征值密文和第二特征值密文进行掩盖,得到掩盖后的第一特征值密文和掩盖后的第二特征值密文;发送单元,用于向第二方发送每个子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,以便于对数据处理模型的非叶子节点进行训练。
根据本说明书一个或多个实施例的第四方面,提供了一种模型训练装置,应用于第二方,所述第二方持有样本的标签数据,该装置包括:接收单元,用于接收子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,所述子集通过对样本标识集进行分割得到,所述样本标识集包括多个样本标识;解密单元,用于分别对掩盖后的第一特征值密文和掩盖后的第二特征值密文进行解密,得到掩盖后的第一特征值和掩盖后的第二特征值;计算单元,用于利用掩盖后的第一特征值和掩盖后的第二特征值,计算分割增益因子,所述分割增益因子用于计算该子集的分割增益,所述分割增益用于对数据处理模型的非叶子节点进行训练。
根据本说明书一个或多个实施例的第五方面,提供了一种电子设备,包括存储器和处理器;存储器,用于存储计算机指令;处理器,用于执行如第一方面所述的方法步骤。
根据本说明书一个或多个实施例的第六方面,提供了一种电子设备,包括存储器和处理器;存储器,用于存储计算机指令;处理器,用于执行如第二方面所述的方法步骤。
由以上本说明书实施例提供的技术方案可见,本说明书实施例中,通过根据同态加密算法,利用随机数对特征值密文进行掩盖,可以增强在多方合作建模过程中数据的隐私保护。
附图说明
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本说明书一实施例的决策树模型示意图;
图2为本说明书一实施例的模型训练方法的流程图;
图3为本说明书一实施例的模型训练方法的流程图;
图4为本说明书一实施例的模型训练方法的流程图;
图5为本说明书一实施例的模型训练装置的功能结构图;
图6为本说明书一实施例的模型训练装置的功能结构图;
图7为本说明书一实施例的电子设备的功能结构图。
具体实施方式
下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书保护的范围。此外,应当理解,尽管在本说明书可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。
下面对本说明书实施例的技术术语进行说明。
树模型:一种有监督的机器学习模型。所述树模型例如可以为二叉树等。所述树模型可以包括决策树模型,所述决策树模型可以包括回归决策树和分类决策树等。所述树模型包括了多个节点。每个节点可以对应有位置标识,所述位置标识可以用于标识该节点在树模型中的位置,具体例如可以为该节点的编号等。所述多个节点能够形成多个预测路径。所述预测路径的起始节点为所述树模型的根节点,终止节点为所述树模型的叶子节点。
叶子节点:当树模型中的一个节点不能够向下分裂时,可以将该节点称为叶子节点。所述叶子节点对应有叶子值。树模型的不同叶子节点所对应的叶子值可以相同或不同。每个叶子值可以表示一种预测结果。所述叶子值可以为数值或向量等。
非叶子节点:当树模型中的一个节点能够向下分裂时,可以将该节点称为非叶子节点。所述非叶子节点具体可以包括根节点、以及除去叶子节点和根节点以外的其它节点(以下称为内部节点)。所述非叶子节点对应有分裂条件,所述分裂条件可以用于选择预测路径。
一个或多个树模型可以构成森林模型。所述森林模型可以为一种有监督的机器学习模型,具体可以包括回归决策森林和分类决策森林。用于实现将多个树模型集成为森林模型的算法可以包括随机森林(Random Forest)、极值梯度提升(Extreme Gradient Boosting,XGBoost)、梯度提升决策树(Gradient Boosting Decision Tree,GBDT)等算法。
以下介绍树模型一个场景示例。
请参阅图1。在本场景示例中,树模型Tree1可以包括节点1、2、3、4、5、6、7、8、9、10、11等11个节点。其中,节点1为根节点;节点2、3、4和5为内部节点;节点6、7、8、9、10和11为叶子节点。节点1、2、4、8可以形成一个预测路径,节点1、2、4、9可以形成一个预测路径;节点1、2、5、10可以形成一个预测路径;节点1、2、5、11可以形成一个预测路径;节点1、3、6可以形成一个预测路径;节点1、3、7可以形成一个预测路径。
节点1、2、3、4和5对应的分裂条件可以如下表1所示。
表1
节点 分裂条件
节点1 年龄大于30岁
节点2 年收入大于5万
节点3 有房
节点4 有车
节点5 已婚
分裂条件“年龄大于20岁”、“年收入大于5万”、“有房”、“有车”、“已婚”可以用于选择预测路径。当不满足分裂条件(即判断结果为0)时,可以选择左边的预测路径;当满足分裂条件(即判断结果为0)时,可以选择右边的预测路径。
节点6、7、8、9、10和11对应的叶子值如下表2所示。
表2
节点 叶子值
节点6 20
节点7 40
节点8 80
节点9 100
节点10 200
节点11 250
损失函数(Loss Function)可以用于衡量数据处理模型的预测值与真实值之间不一致的程度。损失函数的值越小,表示数据处理模型的鲁棒性越好。所述损失函数包括但不限于对数损失函数(Logarithmic Loss Function)、平方损失函数(Square Loss)等。
本说明书提供一种模型训练系统的实施例。
所述模型训练系统可以包括第一方和第二方。所述第一方可以为服务器、手机、平板电脑、或个人电脑等设备;或者,也可以为由多台设备组成的系统,例如由多个服务器组成的服务器集群。所述第二方可以为服务器、手机、平板电脑、或个人电脑等设备;或者,也可以为由多台设备组成的系统,例如由多个服务器组成的服务器集群。
在一些实施例中,所述第一方持有样本的特征数据,但不持有样本的标签数据。所述第二方持有样本的标签数据。所述第二方可以不持有样本的特征数据,或者,也可以持有样本的部分特征数据。所述第一方和所述第二方可以进行合作安全建模。在合作安全建模的过程中,出于保护数据隐私的考虑,所述第一方不能够向所述第二方泄漏样本的特征数据,所述第二方不能够向所述第一方泄漏样本的标签数据。
通过合作安全建模得到的模型可以包括森林模型,所述森林模型可以包括至少一个树模型。在实际应用中,所述第一方和所述第二方可以对森林模型中的节点进行递归训练。用于对进行递归训练的算法包括但不限于XGBoost算法、ID3算法、C4.5算法、C5.0算法等等。
以图1所示的树模型为例,非叶节点1可以对应有样本标识集,所述样本标识集中各个样本标识所对应的样本用于对非叶节点1进行训练。所述第一方可以持有所述各个样本标识所对应样本的特征数据,所述第二方可以持有所述各个样本标识所对应样本的标签数据。所述第一方可以根据自身持有的特征数据,所述第二方可以根据自身持有的 标签数据,对非叶节点1进行训练,得到非叶节点1的分裂条件。在对非叶节点1训练完成后,便可以得到非叶节点1对应的分裂条件,并且所述样本标识集被分割为了第一子集和第二子集。
所述第一子集可以与非叶节点2相对应。所述第一子集中各个样本标识所对应的样本用于对非叶节点2进行训练。所述第一方可以持有所述各个样本标识所对应样本的特征数据,所述第二方可以持有所述各个样本标识所对应样本的标签数据。所述第一方可以根据自身持有的特征数据,所述第二方可以根据自身持有的标签数据,对非叶节点2进行训练,得到非叶节点2的分裂条件。在对非叶节点2训练完成后,便可以得到非叶节点2对应的分裂条件,所述第一子集进一步被分割为了两个子集,以便于进一步对非叶节点4和非叶节点5进行训练。后续的过程不再赘述。
所述第二子集可以与非叶节点3相对应。所述第二子集中各个样本标识所对应的样本用于对非叶节点3进行训练。所述第一方可以持有所述各个样本标识所对应样本的特征数据,所述第二方可以持有所述各个样本标识所对应样本的标签数据。所述第一方可以根据自身持有的特征数据,所述第二方可以根据自身持有的标签数据,对非叶节点3进行训练,得到非叶节点3的分裂条件。在对非叶节点3训练完成后,便可以得到非叶节点3对应的分裂条件,所述第二子集进一步被分割为了两个子集,以便于进一步对叶子节点6和叶子节点7进行训练,得到叶子节点6的叶子值和叶子节点7的叶子值。
在一些实施例中,样本标识可以用于标识样本。例如,样本可以为业务对象的数据,样本标识可以为业务对象的标识。具体地,例如,样本可以为用户数据,样本标识可以为用户的身份标识。又例如,样本可以为商品数据,样本标识可以为商品的标识。
样本可以包括特征数据和标签数据,所述特征数据可以包括P个维度上的P个子数据,P为正整数。例如,样本x1可以表示为向量[x1 1,x1 2,...,x1 i,...,x1 p,Y1]。x1 1、x1 2、...、x1 i、...、x1 p为特征数据,包括P个维度上的P个子数据。Y1为标签数据。例如,针对样本x1,特征数据包括:在借贷金额维度上的借贷金额数据、在社保基数维度上的社保基数数据、在婚姻维度上的是否已婚数据、在房产维度上的是否有房数据,标签数据包括:用户是否为失信者。
以下介绍一个场景示例。在本场景示例中,所述第一方为大数据公司,所述第二方为征信机构。所述大数据公司持有用户的借贷金额、用户缴纳社保的基数、用户是否已婚、以及用户是否有房等数据,所述征信机构持有用户是否为失信者等数据。所述大数 据公司和所述征信机构可以基于各自持有的用户数据,进行合作安全建模,得到森林模型。所述森林模型可以用于预测用户是否为失信者。在合作安全建模的过程中,出于保护数据隐私的考虑,所述大数据公司不能够向所述征信机构泄漏自身持有的数据,所述征信机构不能够向所述大数据公司泄漏自身持有的数据。
本说明书提供模型训练方法的一个实施例。
所述模型训练方法可以用于对森林模型中的一个非叶子节点进行训练,所述非叶子节点可以为根节点或内部节点。在实际应用中,利用所述模型训练方法,采用递归的方式,可以实现对森林模型中的各个非叶子节点进行训练,从而实现合作安全建模。
请参阅图2。所述模型训练方法可以包括以下步骤。
步骤S101:第一方根据特征数据,将样本标识集分割为多个子集。
在一些实施例中,所述样本标识集可以包括多个样本标识。所述样本标识集中各个样本标识所对应的样本用于对非叶子节点进行训练。具体地,在所述非叶子为根节点时,所述样本标识集可以为原始的样本标识集,所述原始的样本标识集可以包括用于对森林模型进行训练的样本的样本标识。在所述非叶子节点为内部节点时,所述样本标识集可以为对上一个非叶子节点进行训练后所分割得到的子集。
在一些实施例中,所述第一方可以持有所述样本标识集中各个样本标识所对应样本的特征数据。所述特征数据可以包括P个维度上的P个子数据,P为正整数。第一方可以根据至少一个维度上的子数据,将样本标识集分割为多个子集。在实际应用中,根据每个维度上的子数据,第一方可以将所述样本标识集分割为多个子集。
例如,所述样本标识集可以包括x1,x2,...,xi,...,xN等N个样本的样本标识,每个样本的特征数据可以包括在P个维度上的P个子数据。样本x1,x2,...,xi,...,xN在第i个维度上的子数据分别为x1 i,x2 i,...,xi i,...,xN i。那么,根据子数据x1 i,x2 i,...,xi i,...,xN i,第一方可以将样本x1,x2,...,xi,...,xN的样本标识划分至多个子集。具体地,例如,第i个维度可以为年龄。样本x1,x2,...,xi,...,xN在年龄维度上的子数据分别为x1 i=30,x2 i=35,...,xi i=20,...,xN i=50。那么,第一方可以将样本x1,x2,...,xi,...,xN的样本标识划分至T1、T2、T3等3个子集。子集T1中各个样本标识所对应的样本在年龄维度上的子数据为0-20岁,子集T2中各个样本标识所对应的样本在年龄维度上的子数据为21-30岁,子集T3中各个样本标识所对应的样本在年龄维度上的子数据为31-50岁。
步骤S103:第二方计算样本标识所对应的第一梯度值密文和第二梯度值密文。
在一些实施例中,所述第一梯度值密文和第二梯度值密文可以由所述森林模型的损失函数计算得到。具体地,所述第二方可以持有所述样本标识集中各个样本标识所对应样本的标签数据。根据标签数据,所述第二方可以计算所述样本标识集中每个样本标识所对应的第一梯度值和第二梯度值。所述第一梯度值可以为损失函数的一阶梯度值,所述第二梯度值可以为损失函数的二阶梯度值。值得说明的是,所述第二方可以持有样本的标签数据,但不持有样本的特征数据。因此所述第二方可以仅根据标签数据计算所述样本标识集中每个样本标识所对应的第一梯度值和第二梯度值。或者,所述第二方可以持有样本的标签数据和部分特征数据。因此所述第二方可以根据标签数据和部分特征数据计算所述样本标识集中每个样本标识所对应的第一梯度值和第二梯度值。
以XGBoost算法为例,所述第二方可以根据
Figure PCTCN2020094664-appb-000001
计算样本标识所对应的第一梯度值;可以根据
Figure PCTCN2020094664-appb-000002
计算样本标识所对应的第二梯度值。其中,g表示第一梯度值,h表示第二梯度值,l表示损失函数,y表示标签数据,
Figure PCTCN2020094664-appb-000003
表示标签数据的预测值,t表示当前迭代轮次,
Figure PCTCN2020094664-appb-000004
表示第t-1轮迭代后的预测值。本领域技术人员应当能够理解,这里用于计算第一梯度值和第二梯度值的公式仅为示例,在实际中还可以有其它的变形或变化。另外,这里的XGBoost算法也仅为示例,在实际中还可以采用其它的训练算法。
在一些实施例中,所述第二方可以对第一梯度值和第二梯度值进行加密,得到所述样本标识集中各个样本标识所对应的第一梯度值密文和第二梯度值密文。具体地,所述第二方可以采用同态加密算法对第一梯度值和第二梯度值进行加密。所述同态加密算法可以包括Paillier算法、Okamoto-Uchiyama算法、Damgard-Jurik算法等。同态加密(Homomorphic Encryption)是一种加密技术。它允许直接对密文数据进行运算得到仍是加密的结果,将其解密所得到的结果与对明文数据进行同样运算的结果相同。所述同态加密算法可以包括加法同态加密算法和乘法同态加密算法等。例如,所述第二方可以生成用于进行同态加密的公私钥对;可以利用所述公私钥对中的公钥对第一梯度值和第二梯度值进行加密。
步骤S105:第二方向第一方发送每个样本标识所对应的第一梯度值密文和第二梯度值密文。
步骤S107:第一方接收每个样本标识所对应的第一梯度值密文和第二梯度值密文。
步骤S109:第一方在每个子集内,将多个样本标识的第一梯度值密文同态相加,得到该子集的第一特征值密文,将多个样本标识的第二梯度值密文同态相加,得到该子集的第二特征值密文。
在一些实施例中,经过步骤S101,所述第一方可以获得多个子集,每个子集可以包括多个样本标识。针对每个子集,所述第一方可以将该子集内多个样本标识所对应的第一梯度值密文同态相加,得到该子集的第一特征值密文;可以将该子集内多个样本标识所对应的第二梯度值密文同态相加,得到该子集的第二特征值密文。
例如,某一子集可以包括样本标识x1,x2,...,xi,...,xm等m个样本标识。样本标识x1,x2,...,xi,...,xm对应的第一梯度值密文分别为
Figure PCTCN2020094664-appb-000005
样本标识x1,x2,...,xi,...,xm对应的第二梯度值密文分别为E(h(x1)),E(h(x2)),...,E(h(xi)),...,E(h(xm))。那么,第一方可以计算E(g(x1))+E(g(x2))+,...,+E(g(xi))+,...,+E(g(xm))=E(g(x1)+g(x2)+,...,+g(xi)+,...,+g(xm))作为该子集的第一特征值密文;可以计算E(h(x1))+E(h(x2))+,...,+E(h(xi))+,...,+E(h(xm))=E(h(x1)+h(x2)+,...,+h(xi)+,...,+h(xm))作为该子集的第二特征值密文。
步骤S111:第一方利用随机数分别对第一特征值密文和第二特征值密文进行掩盖,得到掩盖后的第一特征值密文和掩盖后的第二特征值密文。
在一些实施例中,通过对第一特征值密文和第二特征值密文进行掩盖,可以避免第二方获得第一特征值密文和第二特征值密文,进而避免第二方由第一特征值密文和第二特征值密文获得第一特征值和第二特征值,增强了隐私保护。
在一些实施例中,针对每个子集的第一特征值密文和第二特征值密文,所述第一方可以采用以下任意一种方式进行掩盖,得到该子集所对应的掩盖后的第一特征值密文和掩盖后的第二特征值密文。
方式一:
仅利用随机数对第一特征值密文进行掩盖,得到掩盖后的第一特征值密文;仅利用随机数对第二特征值密文进行掩盖,得到掩盖后的第二特征值密文。这样在后续的步骤S119中第二方可以计算得到分割增益因子。
第一方可以利用同态加密算方法对随机数进行态加密,得到随机数密文;可以将所述随机数密文分别与第一特征值密文和第二特征值密文进行同态运算,得到掩盖后的第 一特征值密文和掩盖后的第二特征值密文。所述同态运算可以包括同态加法运算、同态乘法运算及其任意组合。其中,例如,第一方可以利用第二方的公钥对随机数进行同态加密。
例如,所述第一特征值密文可以为E(g),所述掩盖后的第一特征值密文可以为E(gr)。所述第二特征值密文可以为E(h),所述掩盖后的第二特征值密文可以为E((h+λ)×r 2)。r表示随机数,λ表示正则项系数。
方式二:
利用随机数和第一噪声数据对第一特征值密文进行掩盖,得到掩盖后的第一特征值密文;利用随机数对第二特征值密文进行掩盖,得到掩盖后的第二特征值密文。所述第一噪声数据可以为数值较小的随机数。这样在后续的步骤S119中第二方可以计算得到精度受限的分割增益因子。值得说明的是,由于所述第一噪声数据为数值较小的随机数,使得精度受限的分割增益因子能够满足业务需求。
具体掩盖过程与前面的方式一相类似,在此不再赘述。
方式三:
利用随机数和第一噪声数据对第一特征值密文进行掩盖,得到掩盖后的第一特征值密文;利用随机数和第二噪声数据对第二特征值密文进行掩盖,得到掩盖后的第二特征值密文。所述第一噪声数据可以为数值较小的一个随机数,所述第二噪声数据可以为数值较小的另一个随机数。这样在后续的步骤S119中第二方可以计算得到精度受限的分割增益因子。值得说明的是,由于所述第一噪声数据为数值较小的随机数,所述第二噪声数据为数值较小的另一个随机数,使得精度受限的分割增益因子能够满足业务需求。
具体掩盖过程与前面的方式一相类似,在此不再赘述。
例如,所述第一特征值密文可以为E(g),所述掩盖后的第一特征值密文可以为E(gr+s1)。所述第二特征值密文可以为E(h),所述掩盖后的第二特征值密文可以为E((h+λ)×r 2+s2)。r表示随机数,λ表示正则项系数,s1表示第一噪声数据,s2表示第一噪声数据。
方式四:
利用随机数对第一特征值密文进行掩盖,得到掩盖后的第一特征值密文;利用随机数和第二噪声数据对第二特征值密文进行掩盖,得到掩盖后的第二特征值密文。所述第 二噪声数据可以为数值较小的随机数。这样在后续的步骤S119中第二方可以计算得到精度受限的分割增益因子。值得说明的是,由于所述第一噪声数据为数值较小的随机数,使得精度受限的分割增益因子能够满足业务需求。
具体掩盖过程与前面的方式一相类似,在此不再赘述。
步骤S113:第一方向第二方发送每个子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文。
步骤S115:第二方接收每个子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文。
步骤S117:第二方分别对掩盖后的第一特征值密文和掩盖后的第二特征值密文进行解密,得到掩盖后的第一特征值和掩盖后的第二特征值。
在一些实施例中,所述第二方可以对每个子集所对应的掩盖后的第一特征值密文和掩盖后的第二特征值密文进行解密,得到该子集所对应的掩盖后的第一特征值和掩盖后的第二特征值。例如,所述第二方可以利用私钥对掩盖后的第一特征值密文和掩盖后的第二特征值密文进行解密。
步骤S119:第二方利用掩盖后的第一特征值和掩盖后的第二特征值,计算分割增益因子,所述分割增益因子用于计算分割增益,所述分割增益用于对数据处理模型的非叶子节点进行训练。
在一些实施例中,针对每个子集,所述第二方可以将该子集所对应的对掩盖后的第一特征值和掩盖后的第二特征值按照预设算法进行运算,得到该子集的分割增益因子。所述分割增益因子可以用于计算分割增益,所述分割增益用于可以用于度量多个特定样本的有序程度,所述多个特定样本可以包括子集内的样本标识所对应的样本。所述分割增益可以包括以下至少之一:信息增益、信息增益率、基尼系数。本领域技术人员应当能够理解,所述分割增益并不限于以上所列举的信息增益、信息增益率和基尼系数,在实际中根据训练算法的不同,所述分割增益也可以不同。
例如,某一子集所对应的掩盖后的第一特征值密文可以为E(gr),该子集所对应的掩盖后的第二特征值密文可以为E((h+λ)×r 2)。通过进行解密,该子集所对应的掩盖后的第一特征值可以为gr,该子集所对应的掩盖后的第二特征值可以为(h+λ)×r 2。所述第二方可以计算分割增益因子
Figure PCTCN2020094664-appb-000006
又例如,某一子集所对应的掩盖后的第一特征值密文可以为E(gr+s1),该子集所对应的掩盖后的第二特征值密文可以为E((h+λ)×r 2+s2)。通过进行解密,该子集所对应的掩盖后的第一特征值可以为gr+s1,该子集所对应的掩盖后的第二特征值可以为(h+λ)×r 2+s2。所述第二方可以计算分割增益因子
Figure PCTCN2020094664-appb-000007
由于第一噪声数据s1和第二噪声数据s2均为数值较小的随机数,因而
Figure PCTCN2020094664-appb-000008
Figure PCTCN2020094664-appb-000009
近似相等。
在一些实施例中,所述第二方还可以根据每个子集的分割增益因子计算该子集的分割增益。所述第二方可以根据各个子集的分割增益来选取子集,进而可以根据选取的子集确定非叶子节点的分裂条件。例如,所述第二方可以选取分割增益最大的子集。当然,所述第二方也可以根据每个子集的分割增益因子,与所述第一方共同计算该子集的分割增益。
本说明书一些实施例的模型训练方法,通过利用随机数对特征值密文进行掩盖,可以增强在多方合作建模过程中数据的隐私保护。
本说明书提供模型训练方法的另一个实施例。
所述模型训练方法可以用于对森林模型中的一个非叶子节点进行训练,所述非叶子节点可以为根节点或内部节点。在实际应用中,利用所述模型训练方法,采用递归的方式,可以实现对森林模型中的各个非叶子节点进行训练,从而实现合作安全建模。所述模型训练方法可以应用于第一方,所述第一方可以持有样本的特征数据。
请参阅图3。所述模型训练方法可以包括以下步骤。
步骤S21:根据特征数据,将样本标识集分割为多个子集。
步骤S23:接收每个样本标识所对应的第一梯度值密文和第二梯度值密文。
步骤S25:在每个子集内,将多个样本标识的第一梯度值密文同态相加,得到该子集的第一特征值密文,将多个样本标识的第二梯度值密文同态相加,得到该子集的第二特征值密文。
步骤S27:利用随机数分别对第一特征值密文和第二特征值密文进行掩盖,得到掩盖后的第一特征值密文和掩盖后的第二特征值密文;
步骤S29:向第二方发送每个子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,以便于对数据处理模型的非叶子节点进行训练。
本说明书一些实施例的模型训练方法,通过根据同态加密算法,利用随机数对特征值密文进行掩盖,可以增强在多方合作建模过程中数据的隐私保护。
本说明书提供模型训练方法的另一个实施例。所述模型训练方法可以用于对森林模型中的一个非叶子节点进行训练,所述非叶子节点可以为根节点或内部节点。在实际应用中,利用所述模型训练方法,采用递归的方式,可以实现对森林模型中的各个非叶子节点进行训练,从而实现合作安全建模。所述模型训练方法可以应用于第二方,所述第二方可以持有样本的标签数据。
请参阅图4。所述模型训练方法可以包括以下步骤。
步骤S31:接收子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,所述子集通过对样本标识集进行分割得到,所述样本标识集包括多个样本标识。
步骤S33:分别对掩盖后的第一特征值密文和掩盖后的第二特征值密文进行解密,得到掩盖后的第一特征值和掩盖后的第二特征值。
步骤S35:利用掩盖后的第一特征值和掩盖后的第二特征值,计算分割增益因子,所述分割增益因子用于计算该子集的分割增益,所述分割增益用于对数据处理模型的非叶子节点进行训练。
本说明书一些实施例的模型训练方法,通过根据同态加密算法,利用随机数对特征值密文进行掩盖,可以增强在多方合作建模过程中数据的隐私保护。
本说明书提供模型训练装置的一个实施例,应用于第一方,所述第一方持有样本的特征数据。请参阅图5。该装置可以包括以下单元。
分割单元41,用于根据特征数据,将样本标识集分割为多个子集,所述样本标识集包括多个样本的标识;
接收单元43,用于接收每个样本标识所对应的第一梯度值密文和第二梯度值密文,所述第一梯度值密文和所述第二梯度值密文由同态加密算法分别对损失函数的第一梯度值和第二梯度值加密得到;
相加单元45,用于在每个子集内,将多个样本标识的第一梯度值密文同态相加,得到该子集的第一特征值密文,将多个样本标识的第二梯度值密文同态相加,得到该子集的第二特征值密文;
掩盖单元47,用于利用随机数分别对第一特征值密文和第二特征值密文进行掩 盖,得到掩盖后的第一特征值密文和掩盖后的第二特征值密文;
发送单元49,用于向第二方发送每个子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,以便于对数据处理模型的非叶子节点进行训练。
本说明书提供模型训练装置的一个实施例,应用于第二方,所述第二方持有样本的标签数据。请参阅图6。该装置可以包括以下单元。
接收单元51,用于接收子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,所述子集通过对样本标识集进行分割得到,所述样本标识集包括多个样本标识;
解密单元53,用于分别对掩盖后的第一特征值密文和掩盖后的第二特征值密文进行解密,得到掩盖后的第一特征值和掩盖后的第二特征值;
计算单元55,用于利用掩盖后的第一特征值和掩盖后的第二特征值,计算分割增益因子,所述分割增益因子用于计算该子集的分割增益,所述分割增益用于对数据处理模型的非叶子节点进行训练。
下面介绍本说明书电子设备的一个实施例。图7是该实施例中电子设备的硬件结构示意图。如图7所示,该电子设备可以包括一个或多个(图中仅示出一个)处理器、存储器和传输模块。当然,本领域普通技术人员可以理解,图7所示的硬件结构仅为示意,其并不对上述电子设备的硬件结构造成限定。在实际中该电子设备还可以包括比图7所示更多或者更少的组件单元;或者,具有与图7所示不同的配置。
所述存储器可以包括高速随机存储器;或者,还可以包括非易失性存储器,例如一个或者多个磁性存储装置、闪存或者其他非易失性固态存储器。当然,所述存储器还可以包括远程设置的网络存储器。所述远程设置的网络存储器可以通过诸如互联网、企业内部网、局域网、移动通信网等网络连接至所述区块链客户端。所述存储器可以用于存储应用软件的程序指令或模块,例如本说明书图3或图4所对应实施例的程序指令或模块。
所述处理器可以按任何适当的方式实现。例如,所述处理器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式等等。所述处理器可以读取并执行所述存储器中的程序指令或模块。
所述传输模块可以用于经由网络进行数据传输,例如经由诸如互联网、企业内部网、局域网、移动通信网等网络进行数据传输。
本说明书还提供计算机存储介质的一个实施例。所述计算机存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、缓存(Cache)、硬盘(Hard Disk Drive,HDD)、存储卡(Memory Card)等等。所述计算机存储介质存储有计算机程序指令。在所述计算机程序指令被执行时实现:本说明书图3或图4所对应实施例的程序指令或模块。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同或相似的部分互相参见即可,每个实施例重点说明的都是与其它实施例的不同之处。尤其,对于单侧实施的方法实施例(例如图3和图4所对应的实施例)、装置实施例、电子设备实施例、以及计算机存储介质实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。另外,可以理解的是,本领域技术人员在阅读本说明书文件之后,可以无需创造性劳动想到将本说明书列举的部分或全部实施例进行任意组合,这些组合也在本说明书公开和保护的范围内。
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等, 目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog2。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本说明书各个实施例或者实施例的某些部分所述的方法。
本说明书可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
虽然通过实施例描绘了本说明书,本领域普通技术人员知道,本说明书有许多变形和变化而不脱离本说明书的精神,希望所附的权利要求包括这些变形和变化而不脱离本说明书的精神。

Claims (14)

  1. 一种模型训练方法,应用于第一方,所述第一方持有样本的特征数据,该方法包括:
    根据特征数据,将样本标识集分割为多个子集,所述样本标识集包括多个样本标识;
    接收每个样本标识所对应的第一梯度值密文和第二梯度值密文,所述第一梯度值密文和所述第二梯度值密文由同态加密算法分别对损失函数的第一梯度值和第二梯度值加密得到;
    在每个子集内,将多个样本标识的第一梯度值密文同态相加,得到该子集的第一特征值密文,将多个样本标识的第二梯度值密文同态相加,得到该子集的第二特征值密文;
    利用随机数分别对第一特征值密文和第二特征值密文进行掩盖,得到掩盖后的第一特征值密文和掩盖后的第二特征值密文;
    向第二方发送每个子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,以便于对数据处理模型的非叶子节点进行训练。
  2. 如权利要求1所述的方法,所述数据处理模型包括森林模型,所述森林模型包括至少一个树模型,所述树模型包括至少两个非叶子节点;
    所述第一梯度值为一阶梯度值,所述第二梯度值为二阶梯度值。
  3. 如权利要求1所述的方法,样本的特征数据包括多个子数据,每个子数据对应一个维度;所述将样本标识集分割为多个子集,包括:
    根据至少一个维度的子数据,将样本标识集分割为多个子集。
  4. 如权利要求1所述的方法,所述利用随机数分别对第一特征值密文和第二特征值密文进行掩盖,包括:
    对随机数进行同态加密,得到随机数密文;将随机数密文分别与第一特征值密文和第二特征值密文进行同态运算,得到掩盖后的第一特征值密文和掩盖后的第二特征值密文;
    所述同态运算包括以下之一或任意组合:同态加法运算、同态乘法运算。
  5. 如权利要求1所述的方法,采用以下任意一种方式,利用随机数分别对第一特征值密文和第二特征值密文进行掩盖:
    方式一:
    仅利用随机数对第一特征值密文进行掩盖,得到掩盖后的第一特征值密文;仅利用随机数对第二特征值密文进行掩盖,得到掩盖后的第二特征值密文;
    方式二:
    利用随机数和第一噪声数据对第一特征值密文进行掩盖,得到掩盖后的第一特征值密文;利用随机数对第二特征值密文进行掩盖,得到掩盖后的第二特征值密文;
    方式三:
    利用随机数和第一噪声数据对第一特征值密文进行掩盖,得到掩盖后的第一特征值密文;利用随机数和第二噪声数据对第二特征值密文进行掩盖,得到掩盖后的第二特征值密文;
    方式四:
    利用随机数对第一特征值密文进行掩盖,得到掩盖后的第一特征值密文;利用随机数和第二噪声数据对第二特征值密文进行掩盖,得到掩盖后的第二特征值密文。
  6. 如权利要求1所述的方法,所述第一特征值密文为E(g),所述掩盖后的第一特征值密文为E(gr);所述第二特征值密文为E(h),所述掩盖后的第二特征值密文为E((h+λ)×r 2);
    r为随机数,λ为正则项系数。
  7. 如权利要求1所述的方法,所述第一特征值密文为E(g),所述掩盖后的第一特征值密文为E(gr+s1);所述第二特征值密文为E(h),所述掩盖后的第二特征值密文为E((h+λ)×r 2+s2);
    r为随机数,λ为正则项系数,s1为第一噪声数据,s2为第一噪声数据。
  8. 一种模型训练方法,应用于第二方,所述第二方持有样本的标签数据,该方法包括:
    接收子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,所述子集通过对样本标识集进行分割得到,所述样本标识集包括多个样本标识;
    分别对掩盖后的第一特征值密文和掩盖后的第二特征值密文进行解密,得到掩盖后的第一特征值和掩盖后的第二特征值;
    利用掩盖后的第一特征值和掩盖后的第二特征值,计算分割增益因子,所述分割增益因子用于计算子集的分割增益,所述分割增益用于对数据处理模型的非叶子节点进行训练。
  9. 如权利要求8所述的方法,所述数据处理模型包括森林模型,所述森林模型包括至少一个树模型,所述树模型包括至少两个非叶子节点。
  10. 如权利要求8所述的方法,子集所对应的掩盖后的第一特征值密文根据该子集内样本标识的第一梯度值密文计算得到,子集所对应的掩盖后的第二特征值密文根据该子集内样本标识的第二梯度值密文计算得到;所述第一梯度值密文和所述第二梯度值密 文由同态加密算法分别对损失函数的第一梯度值和第二梯度值加密得到;损失函数的第一梯度值和第二梯度值根据样本的标签数据计算得到。
  11. 如权利要求8所述的方法,所述分割增益用于度量多个特定样本的有序程度,所述特定样本包括子集内的样本标识所对应的样本。
  12. 一种模型训练装置,应用于第一方,所述第一方持有样本的特征数据,该装置包括:
    分割单元,用于根据特征数据,将样本标识集分割为多个子集,所述样本标识集包括多个样本的标识;
    接收单元,用于接收每个样本标识所对应的第一梯度值密文和第二梯度值密文,所述第一梯度值密文和所述第二梯度值密文由同态加密算法分别对损失函数的第一梯度值和第二梯度值加密得到;
    相加单元,用于在每个子集内,将多个样本标识的第一梯度值密文同态相加,得到该子集的第一特征值密文,将多个样本标识的第二梯度值密文同态相加,得到该子集的第二特征值密文;
    掩盖单元,用于利用随机数分别对第一特征值密文和第二特征值密文进行掩盖,得到掩盖后的第一特征值密文和掩盖后的第二特征值密文;
    发送单元,用于向第二方发送每个子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,以便于对数据处理模型的非叶子节点进行训练。
  13. 一种模型训练装置,应用于第二方,所述第二方持有样本的标签数据,该装置包括:
    接收单元,用于接收子集所对应的掩盖后的第一特征值密文、以及掩盖后的第二特征值密文,所述子集通过对样本标识集进行分割得到,所述样本标识集包括多个样本标识;
    解密单元,用于分别对掩盖后的第一特征值密文和掩盖后的第二特征值密文进行解密,得到掩盖后的第一特征值和掩盖后的第二特征值;
    计算单元,用于利用掩盖后的第一特征值和掩盖后的第二特征值,计算分割增益因子,所述分割增益因子用于计算子集的分割增益,所述分割增益用于对数据处理模型的非叶子节点进行训练。
  14. 一种电子设备,包括:
    至少一个处理器;
    存储有程序指令的存储器,其中,所述程序指令被配置为适于由所述至少一个处理 器执行,所述程序指令包括用于执行如权利要求1-11中任一项所述方法的指令。
PCT/CN2020/094664 2019-12-13 2020-06-05 模型训练方法、装置和电子设备 WO2021114585A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911282429.3 2019-12-13
CN201911282429.3A CN111144576A (zh) 2019-12-13 2019-12-13 模型训练方法、装置和电子设备

Publications (1)

Publication Number Publication Date
WO2021114585A1 true WO2021114585A1 (zh) 2021-06-17

Family

ID=70518163

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/094664 WO2021114585A1 (zh) 2019-12-13 2020-06-05 模型训练方法、装置和电子设备

Country Status (2)

Country Link
CN (1) CN111144576A (zh)
WO (1) WO2021114585A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144576A (zh) * 2019-12-13 2020-05-12 支付宝(杭州)信息技术有限公司 模型训练方法、装置和电子设备
CN113824546B (zh) * 2020-06-19 2024-04-02 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
CN111930948B (zh) * 2020-09-08 2021-01-26 平安国际智慧城市科技股份有限公司 一种信息收集和分级方法、装置、计算机设备及存储介质
CN112381307B (zh) * 2020-11-20 2023-12-22 平安科技(深圳)有限公司 一种气象事件预测方法、装置及相关设备
CN112700031B (zh) * 2020-12-12 2023-03-31 同济大学 一种保护多方数据隐私的XGBoost预测模型训练方法
CN113824677B (zh) * 2020-12-28 2023-09-05 京东科技控股股份有限公司 联邦学习模型的训练方法、装置、电子设备和存储介质
CN114692717A (zh) * 2020-12-31 2022-07-01 华为技术有限公司 树模型训练方法、装置和系统
CN113088359A (zh) * 2021-03-30 2021-07-09 重庆大学 一种工艺参数驱动的三甘醇脱水装置三甘醇损耗量在线预测方法
CN115021900B (zh) * 2022-05-11 2024-05-03 电子科技大学 分布式梯度提升决策树实现全面隐私保护的方法

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016089710A1 (en) * 2014-12-02 2016-06-09 Microsoft Technology Licensing, Llc Secure computer evaluation of decision trees
CN109165515A (zh) * 2018-08-10 2019-01-08 深圳前海微众银行股份有限公司 基于联邦学习的模型参数获取方法、系统及可读存储介质
US20190026489A1 (en) * 2015-11-02 2019-01-24 LeapYear Technologies, Inc. Differentially private machine learning using a random forest classifier
CN109325584A (zh) * 2018-08-10 2019-02-12 深圳前海微众银行股份有限公司 基于神经网络的联邦建模方法、设备及可读存储介质
WO2019173851A1 (en) * 2018-03-06 2019-09-12 KenSci Inc. Cryptographically secure machine learning
CN110457912A (zh) * 2019-07-01 2019-11-15 阿里巴巴集团控股有限公司 数据处理方法、装置和电子设备
CN110535622A (zh) * 2019-08-01 2019-12-03 阿里巴巴集团控股有限公司 数据处理方法、装置和电子设备
CN111144576A (zh) * 2019-12-13 2020-05-12 支付宝(杭州)信息技术有限公司 模型训练方法、装置和电子设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165725B (zh) * 2018-08-10 2022-03-29 深圳前海微众银行股份有限公司 基于迁移学习的神经网络联邦建模方法、设备及存储介质
CN109299728B (zh) * 2018-08-10 2023-06-27 深圳前海微众银行股份有限公司 基于构建梯度树模型的样本联合预测方法、系统及介质
CN109002861B (zh) * 2018-08-10 2021-11-09 深圳前海微众银行股份有限公司 联邦建模方法、设备及存储介质
CN109492420B (zh) * 2018-12-28 2021-07-20 深圳前海微众银行股份有限公司 基于联邦学习的模型参数训练方法、终端、系统及介质
CN109886417B (zh) * 2019-03-01 2024-05-03 深圳前海微众银行股份有限公司 基于联邦学习的模型参数训练方法、装置、设备及介质
CN110276210B (zh) * 2019-06-12 2021-04-23 深圳前海微众银行股份有限公司 基于联邦学习的模型参数的确定方法及装置
CN110427969B (zh) * 2019-07-01 2020-11-27 创新先进技术有限公司 数据处理方法、装置和电子设备
CN110414567B (zh) * 2019-07-01 2020-08-04 阿里巴巴集团控股有限公司 数据处理方法、装置和电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016089710A1 (en) * 2014-12-02 2016-06-09 Microsoft Technology Licensing, Llc Secure computer evaluation of decision trees
US20190026489A1 (en) * 2015-11-02 2019-01-24 LeapYear Technologies, Inc. Differentially private machine learning using a random forest classifier
WO2019173851A1 (en) * 2018-03-06 2019-09-12 KenSci Inc. Cryptographically secure machine learning
CN109165515A (zh) * 2018-08-10 2019-01-08 深圳前海微众银行股份有限公司 基于联邦学习的模型参数获取方法、系统及可读存储介质
CN109325584A (zh) * 2018-08-10 2019-02-12 深圳前海微众银行股份有限公司 基于神经网络的联邦建模方法、设备及可读存储介质
CN110457912A (zh) * 2019-07-01 2019-11-15 阿里巴巴集团控股有限公司 数据处理方法、装置和电子设备
CN110535622A (zh) * 2019-08-01 2019-12-03 阿里巴巴集团控股有限公司 数据处理方法、装置和电子设备
CN111144576A (zh) * 2019-12-13 2020-05-12 支付宝(杭州)信息技术有限公司 模型训练方法、装置和电子设备

Also Published As

Publication number Publication date
CN111144576A (zh) 2020-05-12

Similar Documents

Publication Publication Date Title
WO2021114585A1 (zh) 模型训练方法、装置和电子设备
TWI745861B (zh) 資料處理方法、裝置和電子設備
CN109002861B (zh) 联邦建模方法、设备及存储介质
TWI730622B (zh) 資料處理方法、裝置和電子設備
TWI682304B (zh) 基於圖結構模型的異常帳號防控方法、裝置以及設備
Ateniese et al. Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers
Zheng et al. Privacy-preserving image denoising from external cloud databases
CN111125727B (zh) 混淆电路生成方法、预测结果确定方法、装置和电子设备
US10341103B2 (en) Data analytics on encrypted data elements
WO2021000572A1 (zh) 数据处理方法、装置和电子设备
US20200175426A1 (en) Data-based prediction results using decision forests
CN111428887B (zh) 一种基于多个计算节点的模型训练控制方法、装置及系统
WO2021017424A1 (zh) 数据预处理方法、密文数据获取方法、装置和电子设备
WO2020011200A1 (zh) 跨域数据融合方法、系统以及存储介质
WO2020233137A1 (zh) 损失函数取值的确定方法、装置和电子设备
US20220237323A1 (en) Compatible anonymization of data sets of different sources
CN114186263A (zh) 一种基于纵向联邦学习的数据回归方法及电子装置
US20200293911A1 (en) Performing data processing based on decision tree
WO2021098385A1 (zh) 在可信执行环境中训练gbdt模型的方法、装置及设备
US20200293908A1 (en) Performing data processing based on decision tree
US11502856B2 (en) Method for providing information to be stored and method for providing a proof of retrievability
CN112507323A (zh) 基于单向网络的模型训练方法、装置和计算设备
CN111737756A (zh) 经由两个数据拥有方进行的xgb模型预测方法、装置及系统
WO2021000573A1 (zh) 数据处理方法、装置和电子设备
CN113849837A (zh) 一种安全模型的训练方法、装置、设备以及数据处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20898004

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20898004

Country of ref document: EP

Kind code of ref document: A1