WO2021114927A1 - 保护隐私安全的多方联合进行特征评估的方法及装置 - Google Patents

保护隐私安全的多方联合进行特征评估的方法及装置 Download PDF

Info

Publication number
WO2021114927A1
WO2021114927A1 PCT/CN2020/124454 CN2020124454W WO2021114927A1 WO 2021114927 A1 WO2021114927 A1 WO 2021114927A1 CN 2020124454 W CN2020124454 W CN 2020124454W WO 2021114927 A1 WO2021114927 A1 WO 2021114927A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
encrypted
sample set
exchange information
feature
Prior art date
Application number
PCT/CN2020/124454
Other languages
English (en)
French (fr)
Inventor
陆梦倩
汲小溪
王维强
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021114927A1 publication Critical patent/WO2021114927A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • One or more embodiments of this specification relate to the field of computer information processing, and in particular to a method and device for multi-party joint feature evaluation to protect privacy and security.
  • the data needed for machine learning often involves multiple fields.
  • the electronic payment platform owns the merchant's transaction flow data
  • the e-commerce platform stores the merchant's sales data
  • the banking institution owns the merchant's loan data.
  • Data often exists in the form of islands. Due to industry competition, data security, user privacy and other issues, data integration is facing great resistance. It is difficult to integrate data scattered on various platforms to train machine learning models. Under the premise of ensuring that data is not leaked, the use of multi-party data to jointly train machine learning models has become a major challenge at present. To this end, a Federated Learning program is proposed.
  • the first step of federated learning is to perform feature screening.
  • a commonly used feature screening scheme is to calculate the information value (IV) of a feature to evaluate the correlation between the feature and the label.
  • IV information value
  • Calculating the information value of features requires tags and feature data.
  • the calculation of the information value of the characteristics of the non-tag holder requires the tag data of the tag holder, but the tag holder is usually unwilling to directly disclose the correspondence between the tag and the user (that is, the black and white list library) to the non-tag holder. ).
  • non-tag holders are unwilling to disclose their user and characteristic data to tag holders.
  • the user and the corresponding relationship between the user and the tag (or feature) are all private data. Therefore, there is a need for a solution that can calculate the information value of a feature when each party does not know the user of the other party, and when the tag and feature data are isolated.
  • One or more embodiments of this specification describe a method and device for multi-party joint feature evaluation to protect privacy and security, which can calculate the characteristics of users shared by both parties when the other party is unknown to the user and the tag and feature data are isolated. Information value.
  • the multi-party includes at least a first device and a second device.
  • the first device stores a first sample set and a label of each sample therein.
  • the second device stores a second sample set, and the method is applied to the first device; the method includes: using the first key to encrypt the initial ID of each sample in the first sample set to obtain the first sample set The first encrypted ID of each sample; send the first exchange information to the second device, which includes at least the first encrypted ID and tag of each sample in the first sample set; receive respectively from the second device The second exchange information and the third exchange information, wherein the second exchange information includes: the second device uses a second key to perform secondary encryption on the first encrypted ID of each sample in the first sample set The second encrypted ID and the corresponding label obtained later, and the relative order of the samples in the second exchange information has been disturbed by the second device; the third exchange information includes, for each of the second sample set A sample, the first encrypted ID obtained by the second device en
  • the method further includes: before sending the first exchange information to the second device, dividing the first sample set into a plurality of second features based on the feature value of the second feature of each sample in the first sample set. Two bins, and the identification of the second bin where each sample in the first sample set is located is included in the first exchange information; after the first encrypted set is obtained, the relative relationship of each sample in the second sample set is disturbed Order to obtain the fourth exchange information; send the fourth exchange information to the second device so that the second device is based on the second encryption ID in the fourth exchange information and each sample in the second encryption set
  • the second encrypted ID determines the shared sample, and determines the information value of the second feature based on the label of each sample in the shared sample and the identifier of the second sub-box where the second encrypted set is used.
  • the key is obtained by re-encrypting the first encrypted ID in the first exchange information.
  • the dividing the first sample set into a plurality of second bins based on the feature value of the second feature of each sample in the first sample set includes: according to equal frequency bins, equal distance bins, For any one of the chi-square bins, the first sample set is divided into the plurality of second bins.
  • the initial ID of each sample in the first sample set and the initial ID of each sample in the second sample set are both positive integers; the first key is used to encrypt the initial ID of each sample in the first sample set.
  • the method further includes: determining a first prime number that is greater than the largest initial ID among the initial IDs of each sample in the first sample set and greater than the largest initial ID among the initial IDs of each sample in the second sample set; The first positive integer that is relatively prime is the first key.
  • using the first key to encrypt the initial ID of each sample in the first sample set to obtain the first encrypted ID of each sample in the first sample set includes: For a sample, the remainder of the product of the initial ID of the sample and the first key divided by the first prime number is determined as the first encrypted ID of the sample.
  • the first sample set includes a plurality of samples with positive labels and a plurality of samples with negative labels; and the determination is made based on the label of each sample in the common sample and the identification of the first bin where it is located.
  • the information value of the first feature includes: determining the number of samples in the shared sample that fall into the first bin with the first identification and the label is positive, relative to the first ratio of the total number of samples in the shared sample with positive labels ; Determine the number of samples in the shared sample that fall into the first bin with the first identification and the label is negative, relative to the second proportion of the total number of samples with negative labels in the shared sample; based on each identification The first proportion and the second proportion respectively corresponding to the first bins determine the information value of the first feature of the shared sample.
  • the samples in the first sample set include user samples, and the machine learning model is a user classification model; or, the samples in the first sample set include business samples, and the machine learning model is Business processing model.
  • the multi-party includes at least a first device and a second device, and the first device stores a first sample set and each sample therein.
  • the second device stores a second sample set, and the method is applied to the second device; the method includes: receiving the first exchange information from the first device, which includes at least, being used by the first device.
  • the first key encrypts the initial ID of each sample in the first sample set and obtains the first encrypted ID and the corresponding label; using the second key, the first encrypted ID of each sample in the first exchange information
  • the secondary encryption ID performs secondary encryption to obtain the second encryption set, and then the relative order of the samples in the second encryption set is disturbed;
  • the second exchange information is sent to the first device, and the second exchange information includes the disturbed
  • the second encryption ID and label of each sample in the first sample set in relative order; the second key is used to encrypt the initial ID of each sample in the second sample set to obtain the first encrypted ID in the second sample set;
  • the third exchange information includes the first encryption ID of each sample in the second sample set and the identification of the first bin where it is located, so that the first device can use the first encryption
  • the key performs secondary encryption on the first encrypted ID in the third exchanged information to obtain the first encrypted set, which is based on the second encrypted ID in the first encrypted set and the second encrypted in the second exchanged information ID, to determine the common sample of the first sample set and the second sample set, and to determine the information value of the first feature based on the label of each sample in the common sample and the identification of the first bin in which it is located, which is used for machine learning
  • the model performs feature selection.
  • the first exchange information further includes the identification of the second bin where each sample in the first sample set is located, and the identification of the second bin is determined by the first device based on the first sample set.
  • the feature value of the second feature of each sample is obtained by binning; the method further includes: receiving fourth exchange information from the first device, where the fourth exchange information includes the second encryption of each sample in the second sample set ID, and the relative order of the samples in the fourth exchange information has been disturbed by the first device; the second encryption ID based on the second encryption set and the second encryption in the fourth exchange information ID, to determine the common sample of the first sample set and the second sample set; determine the information value of the second feature based on the label of each sample in the common sample and the identification of the second bin in which it is located, which is used for the machine learning model Perform feature selection.
  • an apparatus for multi-party joint feature evaluation to protect privacy and security at least includes a first device and a second device.
  • the first device stores a first sample set and a label of each sample therein.
  • the second device stores a second sample set, and the device is configured in the first device;
  • the device includes: a first encryption unit for encrypting the initial ID of each sample in the first sample set using a first key , Obtain the first encrypted ID of each sample in the first sample set;
  • the first sending unit is used to send the first exchange information to the second device, which includes at least the first exchange information of each sample in the first sample set.
  • a first receiving unit for receiving second exchange information and third exchange information from the second device, wherein the second exchange information includes the second device using the second The second encryption ID and the corresponding label obtained after the second encryption of the first encryption ID of each sample in the first sample set is performed by the key, and the relative order of each sample in the second exchange information has been determined by all The second device disrupts;
  • the third exchange information includes, for each sample in the second sample set, the first encrypted ID obtained by the second device encrypting its initial ID based on the second key, and The identification of the first bin where the sample is located, the identification of the first bin is obtained by the second device based on the feature value of the first feature of each sample in the second sample set;
  • the second encryption unit is used for Use the first key to perform secondary encryption on the first encrypted ID of each sample in the third exchange information to obtain a first encryption set;
  • the first determination unit is configured to be based on the first encryption ID in the second exchange information
  • the second encryption ID and the second encryption ID in the first encryption set determine the common sample of the
  • an apparatus for multi-party joint feature evaluation to protect privacy and security includes at least a first device and a second device, and the first device stores a first sample set and each sample therein.
  • the second device stores a second sample set, and the device is configured in the second device;
  • the device includes: a second receiving unit for receiving the first exchange information from the first device, which includes at least, The first encrypted ID and the corresponding label obtained by the first device using the first key to encrypt the initial ID of each sample in the first sample set;
  • the third encryption unit is used to use the second key , Performing secondary encryption on the first encrypted ID of each sample in the first exchange information to obtain a second encrypted set, and then disturb the relative order of each sample in the second encrypted set;
  • a second sending unit for Send second exchange information to the first device, where the second exchange information includes the second encryption ID and label of each sample in the first sample set whose relative order has been disturbed;
  • the fourth encryption unit is used to use the second The key encrypts the initial ID of each sample
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect or the method of the sixth aspect.
  • a computing terminal including a memory and a processor, the memory stores executable code, and when the processor executes the executable code, the method of the first aspect or the sixth aspect is implemented Methods.
  • the method and device provided by the embodiments of this specification can calculate the information value of the characteristics of the users shared by both parties when the other users are unknown to the other party and when the tag and the characteristic data are isolated, and the specific security is relatively high.
  • FIG. 1A shows a schematic diagram of data of data party A according to an embodiment
  • FIG. 1B shows a schematic diagram of data of data party B according to an embodiment
  • Fig. 2 shows a flow chart of jointly calculating the information value of a feature according to an embodiment
  • FIG. 3 shows a flowchart of a method for multi-party joint feature evaluation to protect privacy and security according to an embodiment
  • FIG. 4 shows a flowchart of encrypting ID according to an embodiment
  • Fig. 5 shows a flowchart of a method for multi-party joint feature evaluation to protect privacy and security according to an embodiment
  • Fig. 6 shows a flowchart of a method for multi-party joint feature evaluation to protect privacy and security according to an embodiment
  • Fig. 7 shows a schematic block diagram of an apparatus for multi-party joint feature evaluation to protect privacy and security according to an embodiment
  • Fig. 8 shows a schematic block diagram of an apparatus for multi-party joint feature evaluation to protect privacy and security according to an embodiment.
  • FIG. 1A shows the data owned by the data party A disclosed in the embodiment of this specification.
  • Figure 1B shows the data owned by the data party B disclosed in the embodiment of this specification.
  • Each ID (Identity Document, identity identification number) in FIG. 1A and FIG. 1B may be a digital code that uniquely identifies a user, such as a mobile phone number.
  • ID1, ID2, and ID3 are IDs shared by data party A and data party B.
  • Each ID in FIG. 1A has a tag and a characteristic value of the characteristic Fa.
  • tags can be classified into two types: positive tags and negative tags.
  • Each ID in FIG. 1B has the characteristic value of the characteristic Fb.
  • the data party A may be an electronic payment platform (for example, Alipay), and the label may be a mark of a fraudulent merchant or a mark of a non-fraudulent merchant.
  • the feature Fa may be transaction flow data.
  • the data party B can be a banking institution, and the feature Fb can be loan data.
  • the characteristic value of the transaction flow data or the characteristic value of the loan data corresponding to each ID can be calculated by the characteristic engineering. For details, please refer to the introduction of the prior art, which will not be repeated here.
  • the data party A may be an e-commerce platform (such as Taobao), the label may be a mark of a normal buyer or a mark of an abnormal buyer, and the feature Fa may be sales data.
  • Data party B can be a banking institution, and feature Fb can be loan data.
  • Multi-party joint training of machine learning models requires the use of data party A and data party B to share user characteristics. In order to effectively train a machine learning model, it is necessary to evaluate the correlation between features and labels.
  • the feature screening can be performed through the scheme shown in Figure 2.
  • multiple IDs (ID set) in data party A can be called set_A.
  • Multiple IDs (ID set) in B can be called set_B.
  • data party A can send the tags of set_A and set_A to data party B.
  • the data party B can determine the shared ID of set_A and set_B, and then calculate the information value of the feature Fb of the shared ID to evaluate the correlation between the feature Fb and the label.
  • Data party B can send set_B to data party A.
  • the data party A can determine the shared ID of set_A and set_B, and then calculate the information value of the feature Fa of the shared ID to evaluate the correlation between the feature Fa and the label.
  • the data parties need to exchange plaintext IDs.
  • Another solution for evaluating the correlation between features and tags is to build a trusted execution environment (for example, using Intel’s sgx technology to build a trusted execution environment), data of data party A (set_A, set_A tags, set_A The feature Fa) and the data of the data party B (the feature Fb of set_B and set_B) can be transmitted to the trusted execution environment after being encrypted with a public key.
  • the private key is decrypted in the trusted execution environment, and the information value calculation of the feature is completed, and the information value calculation result of the feature is transmitted to the trusted environment.
  • Another solution for evaluating the correlation between features and tags is that the data of data party A (set_A, set_A tags, set_A feature Fa) and data party B (set_B, set_B feature Fb) are sent to the first
  • data party A set_A, set_A tags, set_A feature Fa
  • data party B set_B, set_B feature Fb
  • a third party completes the calculation of the information value of the feature.
  • the embodiment of this specification provides a method for multi-party joint feature evaluation, which can calculate the information value of the feature of the user shared by both parties when the other party is unknown to the user and the tag and feature data are isolated.
  • the method may include the steps shown in FIG. 3. It should be noted that although FIG. 3 shows step 300a-step 310a and step 300b-step 310b in sequence, it does not limit the execution order of these steps 300-310.
  • step 300a to step 310a and step 300b to step 310b may be performed in the order shown in FIG. 3.
  • step 300a-step 310a and step 300b-step 310b may be performed in a different order from that shown in FIG. 3.
  • two or more steps of step 300a to step 310a and step 300b to step 310b may be performed in parallel.
  • the data party A and the data party B may be devices, equipment, platforms, and equipment clusters with computing and processing capabilities, and may cooperate with each other to execute the method shown in FIG. 3.
  • the data party A and the data party B can cooperate with each other to perform the initialization operation.
  • the data party A and the data party B can determine the upper limit of the value of the ID they own. Taking the ID as a mobile phone number as an example, it is an integer composed of 11 digits, that is, each ID is an integer.
  • the upper limit of the ID of either party is the ID with the largest value among the IDs owned by that party.
  • the data party A may determine the integer C1 that is greater than or equal to the data party A's numerical maximum ID. Exemplarily, taking the ID of 11 digits forming a mobile phone number as an example, the integer C1 may be an integer consisting of 12 digits.
  • Data party A can send data party A's integer C1 to data party B.
  • the data party B can determine the prime number P that is greater than the data party B's numerical maximum ID and is greater than the integer C1, and send the prime number P to the data party A.
  • the data party B may determine the integer C2 that is greater than or equal to the value of the data party B's largest ID. Exemplarily, taking the ID of 11 digits forming a mobile phone number as an example, the integer C2 may be an integer consisting of 12 digits.
  • Data party B can send data party A's integer C2 to data party A.
  • the data party A can determine the prime number P that is greater than the data party A's numerical maximum ID and greater than the integer C2, and send the prime number P to the data party B.
  • the data party A can randomly generate a positive integer keyA that is relatively prime to the prime number P.
  • keyA can also be called the first key.
  • the data party B can randomly generate a positive integer keyB that is relatively prime to the prime number P.
  • keyB can also be called the second key.
  • the data party A and the data party B complete the initialization in the above-mentioned manner, and obtain their respective keys.
  • the data party A and the data party B respectively use their own keys to encrypt their IDs for the first time to obtain their first encrypted IDs.
  • the other party uses its key to perform the second encryption.
  • the set of IDs owned by data party A that is, the set of IDs of each sample in the sample set of data party A
  • set_B The set of IDs owned by data party B, that is, the set of IDs of each sample in the sample set of data party B
  • each ID in set_A and set_B can be referred to as the initial ID of the sample.
  • step 302a the data party A uses keyA to encrypt each ID (initial ID) of set_A for the first time to obtain the first encrypted ID.
  • the first encryption method is to calculate the product of the ID and keyA, and divide the product by the prime number P to obtain the remainder as the first corresponding to the ID Encrypted ID.
  • the first encryption ID can be recorded as Encry(ID, keyA).
  • the ID to be encrypted can be each ID in set_A.
  • Initialization p is the above prime number p.
  • max(ID) is the ID with the largest value in data party A. You can multiply the ID to be encrypted by the ID to be encrypted to get the TMP. Then, the remainder E of the TMP modulus prime number p (that is, the remainder obtained by dividing TMP by the prime number p) E is used as the encryption result of the ID to be encrypted.
  • Data party A can perform feature binning on set_A according to the feature value of feature Fa, so as to split the first encrypted ID in set_A into multiple bins.
  • the feature Fa can be a feature set including multiple features such as feature Fa1, feature Fa2, etc.
  • Feature Fa1, feature Fa2 can be collectively referred to as Fai, that is, in Fai, i can be 1, or 2, and so on.
  • each sample has the feature value of the feature Fai (the feature value of the feature Fai may also be referred to as the value of the feature Fai).
  • data party A can perform feature binning according to the feature value of feature Fai corresponding to each ID in set_A, so as to divide the first encrypted ID of ID in set_A into multiple bins corresponding to feature Fai .
  • Each bin has a bin identification.
  • Fa1_bin Taking feature Fa1 as an example, its bin identification can be recorded as Fa1_bin.
  • Fa2_bin Taking feature Fa2 as an example, its bin identification can be recorded as Fa2_bin.
  • Fa1_bin, Fa2_bin, etc. can be collectively referred to as Fai_bin, which means that the ID is sorted into the Fai_bin bin according to the feature value of the feature Fai.
  • an equal frequency binning algorithm can be used to perform feature binning.
  • the equidistant binning algorithm can be used for feature binning.
  • the chi-square binning algorithm can be used for feature binning.
  • the first encrypted ID and label of each sample of set_A can be associated with the identification of the bin after being binned according to the feature value of the feature Fai, to obtain the associated information of the first encrypted ID of each sample of set_A, which can be recorded It is (Encry(ID, keyA), label, Fa1_bin, Fa2_bin,). All associated information of the first encrypted ID of set_A constitutes the first exchange information.
  • the data party A can send the first exchange information to the data party B.
  • each sub-box may include multiple IDs, for example, K IDs.
  • K IDs the feature binning information of A obtained by B is anonymized by K, that is, corresponding to any ID, at least K each ID and its feature binning information are the same. Therefore, it is difficult for data party B to correspond to the characteristics of the ID. Information to estimate the correspondence between ID and feature information.
  • the data party B uses keyB to encrypt each ID (initial ID) of set_B for the first time to obtain the first encrypted ID.
  • the first encryption method is to calculate the product of the ID and keyB, and divide the product by the prime number P to obtain the remainder as the first encryption corresponding to the ID ID.
  • the first encryption ID can be recorded as Encry(ID, keyB).
  • Data party B can perform feature binning on set_B according to the feature value of feature Fb, so as to bin the first encrypted ID in set_B into multiple bins.
  • the feature Fb may be a feature set including multiple features such as feature Fb1 and feature Fb2.
  • Feature Fb1 and Feature Fb2 can be collectively referred to as Fbi, that is, i in Fai can be 1, or 2, and so on. Among them, each sample has the characteristic value of the characteristic Fbi.
  • the set_B can be binned according to the feature value of the feature Fbi. For details, reference may be made to the above description of the embodiment shown in step 302a, which will not be repeated here.
  • the first encrypted ID of each sample in set_B can be associated with the identification of the bin after binning according to the feature value of Fbi, and the associated information of the first encrypted ID of each sample in set_B can be obtained, which can be recorded as ( Encry(ID, keyB), Fb1_bin, Fb2_bin,).
  • the associated information of all the first encrypted IDs of set_B constitutes the third exchange information.
  • the data party B can send the third exchange information to the data party A.
  • step 304a after the data party A receives the third exchange information, it can use keyA to perform secondary encryption on each first encrypted ID of set_B in the third exchange information to obtain each first encrypted ID of set_B.
  • the second encrypted ID Specifically, the product of the first encrypted ID and keyA is calculated, and the remainder obtained by dividing the product by the prime number P is used as the second encrypted ID corresponding to the first encrypted ID, which can be recorded as Encry(Encry(ID, keyB),keyA). Together with the bin identification, it can be recorded as (Encry(Encry(ID, keyB), keyA), Fb1_bin, Fb2_bin,...), and this information constitutes the first encrypted set.
  • step 306a the relative order between the respective second encrypted IDs of set_B is disrupted (disturbed), and the respective second encrypted IDs of set_B after the scramble are sent to the data party B as the fourth exchange information.
  • first encrypted IDs of set_B in the third exchange information have a relative order, and each first encrypted ID of set_B is encrypted twice using the first key, and each second encrypted ID of set_B is obtained.
  • the relative order between the secondary encryption IDs is the same as the relative order between the first encryption IDs of set_B.
  • each second encrypted ID of set_B is sent to the data party B, and then the data party B can follow the relative order between the second encrypted IDs of set_B , Determine the one-to-one correspondence between each second encrypted ID of set_B and each first encrypted ID of set_B, from which the first key can be obtained, and then the ID in set_A can be determined, resulting in the ID of data party A and the black and white list Give way.
  • the third exchange information does not carry the identification of the bin where each ID of set_B is located, so as to prevent the data party B from inferring the second time of each sample based on the identification of the bin where each second encryption ID of set_B is located.
  • the corresponding relationship between the encrypted ID and the initial ID (or the first encrypted ID) of each sample, thereby obtaining the first key, and then the ID in set_A can be determined, which leads to the leakage of the ID of the data party A and the black and white list.
  • step 304b after receiving the first exchange information, the data party B can use keyB to perform secondary encryption on each first encrypted ID of set_A in the first exchange information, respectively, to obtain each first encrypted ID of set_A.
  • the corresponding ID for the second encryption Specifically, the product of the first encrypted ID and keyB is calculated, and the remainder obtained by dividing the product by the prime number P is used as the second encrypted ID corresponding to the first encrypted ID, which can be recorded as Encry(Encry(ID, keyA),keyB). Together with the bin identification, it can be recorded as (Encry(Encry(ID, keyA), keyB), label, Fa1_bin, Fa2_bin,...), and this information constitutes the second encrypted set.
  • step 306b the relative sequence between each second encrypted ID of set_A is disturbed (disrupted), and each second encrypted ID of set_A after being disturbed is sent to the data as the second exchange information along with their respective tags. Party A.
  • step 306b the relative sequence between the second encrypted IDs of set_A is disturbed, and the identification of the bin where the ID in set_A is located is not sent to the data party, so as to prevent the data party A from inferring the second key.
  • each initial ID in set_A and set_B has been encrypted twice.
  • the initial ID in set_A is first encrypted by the data party A using the first key, and then encrypted by the data party B using the second key for the second time.
  • the initial ID in set_B is first encrypted by the data party B using the first key, and then encrypted by the data party A using the second key for the second time.
  • the data parties A and B exchange the results of their respective secondary encryptions with each other, so that both the data party A and the data party B have the second encryption IDs corresponding to the initial IDs in set_A and set_B.
  • x mod (y) can be called x mod y, which represents the remainder obtained by dividing x by y.
  • the remainder system has the following properties.
  • any two modulo n of the complete remainder system modulo n is different, and any positive integer modulo n in the positive integer must be the same as the remainder of a certain number modulo n in the complete remainder system of modulo n.
  • the complete remainder system modulo n the set of representative numbers that are relatively prime to modulo n is called the reduced remainder system modulo n.
  • Encry(Encry(ID, keyA), keyB) Encry(Encry(ID, keyB), keyA).
  • a*m mod(p) a*n mod(p)
  • a*m-k1*p a*n-k2*p
  • the encryption result of the ID is the same; when the IDs are not equal, the encryption result of the ID must be different.
  • step 308a the data party A can determine that set_A and set_B share IDs.
  • the second exchange information carries the tags of each ID, and through the third exchange information, the identification of the bin where the shared ID is binned by the feature value of the feature Fbi (Fb1, feature Fb2, etc.) can be obtained.
  • step 310a the information value of each feature Fbi can be calculated based on the information obtained in step 308a using the formula shown in FIG. 3.
  • Precall k represents the ratio of the number of positively labeled IDs in bin k to the total number of positively labeled samples in the common sample
  • Nrecall k represents the negatively labeled IDs in bin k
  • IV represents the value of information.
  • step 308b the data party B can determine that set_A and set_B share IDs.
  • the first exchange information carries the tag of each ID and the identification of the sub-box where it is located. Therefore, in step 310b, the information value of each feature Fai can be calculated.
  • the method provided in the embodiment of this specification can complete the secure calculation of the information value of the feature under the condition that the data of the parties are isolated, without leaking the data of the parties. details as follows.
  • data party A gets the ID of data party B which is encrypted by keyB and the corresponding Fb feature box, but this data is sufficiently secret for data party A, because: 1) The ID obtained by data party A is encrypted by keyB, and data party A cannot know the corresponding original ID behind it, and therefore cannot match the Fb binning result with the real ID; 2) binning information used when calculating the value of the information It is irrelevant to the order of bins, so the identification of the bin where the data party B transmits to the data party A can be in disorder (can be implemented when the order of the second encryption ID is disrupted), or the identification of the bin is just one Code name, so that data party A cannot know the order of feature size corresponding to the bins; 3) Each bin of the feature contains K IDs, which is equivalent to the information obtained by data party A about the characteristics of data party B is anonymized by K , The information of any ID has at least K IDs that are the same.
  • Data party A also gets the result of the second encryption of data party A's ID.
  • This encrypted ID has been shuffled by B and does not carry any additional information that can be identified. Therefore, data party A only knows these IDs. They are all the results obtained after their own ID is encrypted, and there is a one-to-one correspondence, but the correspondence relationship is not clear.
  • the data available to data party B is not enough for data party B to derive data information of data party A.
  • an embodiment of the present specification provides a method for protecting privacy and security in joint feature evaluation by multiple parties.
  • the multiple parties include at least a first device and a second device.
  • the first device stores a first sample set and each of them.
  • the label of the sample the second device stores the second sample set, and the method is applied to the first device.
  • the method includes the following steps.
  • Step 501 Use the first key to encrypt the initial ID of each sample in the first sample set to obtain the first encrypted ID of each sample in the first sample set.
  • step 302a Use the first key to encrypt the initial ID of each sample in the first sample set to obtain the first encrypted ID of each sample in the first sample set.
  • the remainder encryption algorithm has a small amount of calculation and high security, making it a better encryption algorithm. It should be understood that the remainder encryption algorithm is not the only encryption algorithm. As long as the encryption algorithm satisfies superimposability, interchangeability, and uniqueness, it can be used to encrypt the sample ID in step 302a and step 302b.
  • the data party A and the data party B may negotiate other encryption algorithms in advance.
  • the encryption algorithm here can be any algorithm that encrypts the target data based on the same set of keys, and the order of using the keys does not affect the encryption result.
  • the encryption algorithm here can also be any one of an exclusive OR (XOR) algorithm, a DH algorithm, an ECC-DH algorithm, and the like.
  • Step 503 Send the first exchange information to the second device, which includes at least the first encrypted ID and tag of each sample in the first sample set.
  • the second device which includes at least the first encrypted ID and tag of each sample in the first sample set.
  • Step 505 Receive the second exchange information and the third exchange information from the second device respectively, where the second exchange information includes that the second device uses a second key to pair each sample in the first sample set.
  • the second encrypted ID and the corresponding label obtained after the first encrypted ID of the sample is encrypted twice, and the relative order of each sample in the second exchange information has been disturbed by the second device;
  • the third The exchange information includes, for each sample in the second sample set, the first encrypted ID obtained by the second device encrypting its initial ID based on the second key and the identification of the first bin where the sample is located,
  • the identification of the first binning is obtained by the second device performing binning based on the feature value of the first feature of each sample in the second sample set.
  • Step 507 Use the first key to perform secondary encryption on the first encrypted ID of each sample in the third exchange information to obtain a first encrypted set.
  • step 304a in FIG. 3 please refer to the above description of step 304a in FIG. 3, which will not be repeated here.
  • Step 509 Determine a common sample of the first sample set and the second sample set based on the second encryption ID in the second exchange information and the second encryption ID in the first encryption set. For details, please refer to the above description of step 308a in FIG. 3, which will not be repeated here.
  • Step 511 Determine the information value of the first feature based on the label of each sample in the shared sample and the identification of the first bin in which it is located, so as to perform feature selection for the machine learning model.
  • Step 511 Determine the information value of the first feature based on the label of each sample in the shared sample and the identification of the first bin in which it is located, so as to perform feature selection for the machine learning model.
  • the method further includes: before sending the first exchange information to the second device, dividing the first sample set into a plurality of second features based on the feature value of the second feature of each sample in the first sample set. Two bins, and the identification of the second bin where each sample in the first sample set is located is included in the first exchange information; after the first encrypted set is obtained, the relative relationship of each sample in the second sample set is disturbed Order to obtain the fourth exchange information; send the fourth exchange information to the second device so that the second device is based on the second encryption ID in the fourth exchange information and the first encryption set in the second encryption set
  • the secondary encryption ID determines the shared sample, and determines the information value of the second feature based on the label of each sample in the shared sample and the identification of the second sub-box where the second feature is located, wherein the second encryption set uses the second key pair
  • the first encrypted ID in the first exchange information is obtained by performing secondary encryption. For details, please refer to the above description of steps 302a, 306a, 308b, and 310b in FIG. 3,
  • the dividing the first sample set into a plurality of second bins based on the feature values of the second features of each sample in the first sample set includes: according to equal frequency bins, equal distances Any one of binning and chi-square binning, dividing the first sample set into the plurality of second bins.
  • the initial ID of each sample in the first sample set and the initial ID of each sample in the second sample set are both positive integers; the first key is used to encrypt the initial ID of each sample in the first sample set.
  • the method further includes: determining a first prime number that is greater than the largest initial ID among the initial IDs of each sample in the first sample set and greater than the largest initial ID among the initial IDs of each sample in the second sample set; The first positive integer that is relatively prime is the first key.
  • using the first key to encrypt the initial ID of each sample in the first sample set to obtain the first encrypted ID of each sample in the first sample set includes: For a sample, the remainder of the product of the initial ID of the sample and the first key divided by the first prime number is determined as the first encrypted ID of the sample. For details, please refer to the above description of step 302 in FIG. 3, which will not be repeated here.
  • the first sample set includes a plurality of samples with positive labels and a plurality of samples with negative labels; and the determination is made based on the label of each sample in the common sample and the identification of the first bin where it is located.
  • the information value of the first feature includes: determining the number of samples in the shared sample that fall into the first bin with the first identification and the label is positive, relative to the first ratio of the total number of samples in the shared sample with positive labels ; Determine the number of samples in the shared sample that fall into the first bin with the first identification and the label is negative, relative to the second proportion of the total number of samples with negative labels in the shared sample; based on each identification
  • the first proportion and the second proportion respectively corresponding to the first bins determine the information value of the first feature of the shared sample.
  • the samples in the first sample set include user samples, and the machine learning model is a user classification model; or, the samples in the first sample set include business samples, and the machine learning model is Business processing model.
  • the method provided in the embodiment of this specification can calculate the information value of the characteristics of the users shared by both parties under the circumstances that the two parties do not know the other user and the tag and the feature data are isolated, and the security is high.
  • an embodiment of this specification provides a method for protecting privacy and security by multiple parties jointly performing feature evaluation.
  • the multiple parties at least include a first device and a second device, and the first device stores a first sample set and Wherein the label of each sample, the second device stores a second sample set, and the method is applied to the second device.
  • the method includes the following steps.
  • Step 601 Receive first exchange information from a first device, which includes at least the first encrypted ID obtained by encrypting the initial ID of each sample in the first sample set by the first device using the first key And the corresponding label.
  • a first device which includes at least the first encrypted ID obtained by encrypting the initial ID of each sample in the first sample set by the first device using the first key And the corresponding label.
  • Step 603 Use the second key to perform secondary encryption on the first encrypted ID of each sample in the first exchange information to obtain a second encrypted set, and then disturb the relative order of each sample in the second encrypted set .
  • steps 304b and 306b in FIG. 3 please refer to the above description of steps 304b and 306b in FIG. 3, which will not be repeated here.
  • Step 605 Send second exchange information to the first device, where the second exchange information includes the second encryption ID and tag of each sample in the first sample set whose relative order has been disturbed.
  • the second exchange information includes the second encryption ID and tag of each sample in the first sample set whose relative order has been disturbed.
  • Step 607 Use the second key to encrypt the initial ID of each sample in the second sample set to obtain the first encrypted ID in the second sample set.
  • Step 609 Based on the feature value of the first feature of each sample in the second sample set, divide the second sample set into a plurality of first bins. For details, please refer to the above description of step 302b in FIG. 3, which will not be repeated here.
  • Step 611 Send third exchange information to the first device, where the third exchange information includes the first encrypted ID of each sample in the second sample set and the identification of the first bin where it is located, so that the first device can use
  • the first key encrypts the first encrypted ID in the third exchanged information to obtain the first encrypted set, which is based on the second encrypted ID in the first encrypted set and the second encrypted ID in the second exchanged information Encrypt the ID, determine the common sample of the first sample set and the second sample set, and determine the information value of the first feature based on the label of each sample in the common sample and the identification of the first bin in which it is located, which is used to target the machine
  • the learning model performs feature selection.
  • step 302b in FIG. 3 please refer to the above description of step 302b in FIG. 3, which will not be repeated here.
  • the first exchange information further includes the identification of the second bin where each sample in the first sample set is located, and the identification of the second bin is determined by the first device based on the first sample set.
  • the feature value of the second feature of each sample is obtained by binning; the method further includes: receiving fourth exchange information from the first device, where the fourth exchange information includes the second encryption of each sample in the second sample set ID, and the relative order of each sample in the fourth exchange information has been disrupted by the first device; the second encryption ID based on the second encryption set and the second encryption in the fourth exchange information ID, to determine the common sample of the first sample set and the second sample set; determine the information value of the second feature based on the label of each sample in the common sample and the identification of the second bin in which it is located, which is used for the machine learning model Perform feature selection.
  • the method provided in the embodiment of this specification can calculate the information value of the characteristics of the users shared by both parties under the circumstances that the two parties do not know the other user and the tag and the feature data are isolated, and the security is high.
  • an embodiment of this specification provides a privacy protection device 700 for joint feature evaluation by multiple parties.
  • the multiple parties at least include a first device and a second device.
  • the first device stores a first sample set and The label of each sample
  • the second device stores a second sample set
  • the device is configured in the first device.
  • the device 700 includes the following units.
  • the first encryption unit 710 is configured to use the first key to encrypt the initial ID of each sample in the first sample set to obtain the first encrypted ID of each sample in the first sample set.
  • the first sending unit 720 is configured to send the first exchange information to the second device, which includes at least the first encrypted ID and tag of each sample in the first sample set.
  • the first receiving unit 730 is configured to receive second exchange information and third exchange information respectively from the second device, where the second exchange information includes: the second device uses a second key to pair the first The second encrypted ID and the corresponding label obtained after the second encrypted ID of the first encrypted ID of each sample in the sample set, and the relative order of each sample in the second exchange information has been disturbed by the second device
  • the third exchange information includes, for each sample in the second sample set, the first encrypted ID obtained by the second device encrypting its initial ID based on the second key and the first sample where the sample is located
  • the identification of the bin, the identification of the first bin is obtained by the second device based on the feature value of the first feature of each sample in the second sample set.
  • the second encryption unit 740 is configured to perform secondary encryption on the first encrypted ID of each sample in the third exchange information based on the first key to obtain the second encrypted ID of each sample in the second sample set .
  • the first determining unit 750 is configured to determine the common samples of the first sample set and the second sample set based on the second encryption ID of each sample in the first sample set and the second encryption ID of each sample in the second sample set .
  • the second determining unit 760 is configured to determine the information value of the first feature based on the label of each sample in the shared sample and the identification of the first bin in which it is located, so as to perform feature selection for the machine learning model.
  • each functional unit of the device 700 can be implemented with reference to the method embodiment shown in FIG. 5, and details are not described herein again.
  • the device provided by the embodiment of this specification can calculate the information value of the characteristics of the users shared by both parties under the circumstances that the two parties do not know the other user and the tag and the feature data are isolated, and have high security.
  • an embodiment of this specification provides an apparatus for protecting privacy and security by multiple parties jointly performing feature evaluation.
  • the multiple parties include at least a first device and a second device, and the first device stores a first sample set and Wherein the label of each sample, the second device stores a second sample set, and the device is configured in the second device; the device includes the following units.
  • the second receiving unit 810 is configured to receive the first exchange information from the first device, which includes at least the information obtained by encrypting the initial ID of each sample in the first sample set by the first device using the first key Encrypt the ID and the corresponding label for the first time.
  • the third encryption unit 820 is configured to use a second key to perform secondary encryption on the first encrypted ID of each sample in the first exchanged information to obtain a second encrypted set, and then disrupt the second encrypted set The relative order of each sample.
  • the second sending unit 830 is configured to send second exchange information to the first device, where the second exchange information includes the second encrypted ID and tag of each sample in the first sample set whose relative order has been disturbed.
  • the fourth encryption unit 840 is configured to use the second key to encrypt the initial ID of each sample in the second sample set to obtain the first encrypted ID in the second sample set.
  • the second binning unit 850 is configured to divide the second sample set into a plurality of first bins based on the feature value of the first feature of each sample in the second sample set.
  • the second sending unit 830 is also configured to send third exchange information to the first device.
  • the third exchange information includes the first encryption ID of each sample in the second sample set and the identification of the first bin where it is located, so that The first device uses the first key to perform secondary encryption on the first encrypted ID in the third exchanged information to obtain the first encrypted set, which is based on the second encrypted ID in the first encrypted set and the second encrypted ID.
  • Exchange the second encrypted ID of each sample in the information determine the common sample of the first sample set and the second sample set, and determine the first sub-box based on the label of each sample in the common sample and the identification of the first bin.
  • the information value of a feature is used for feature selection for the machine learning model.
  • each functional unit of the device 800 can be implemented with reference to the method embodiment shown in FIG. 6, and details are not described herein again.
  • the device provided by the embodiment of this specification can calculate the information value of the characteristics of the users shared by both parties under the circumstances that the two parties do not know the other user and the tag and the feature data are isolated, and have high security.
  • the embodiments of this specification provide a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed in a computer, the computer is caused to execute the method shown in FIG. 5 or the method shown in FIG. 6 Indicates the method.
  • the embodiment of this specification provides a computing terminal, including a memory and a processor.
  • the memory stores executable code.
  • the processor executes the executable code, the Method or the method shown in Figure 6.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Medical Informatics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Storage Device Security (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

一种保护隐私安全的多方联合进行特征评估的方法和装置。该多方至少包括存储有第一样本集的第一设备和存储有第二样本集的第二设备,该方法应用于第一设备;该方法包括:对第一样本集中各样本的初始ID进行加密,并将得到的第一样本集的第一次加密ID和标签发送给第二设备;从第二设备接收第二样本集的第一次加密ID和所在分箱的标识,以及第一样本集的第二次加密ID和标签;对第二样本集的第一次加密ID进行加密,得到第二样本集的第二次加密ID;根据第二样本集的第二次加密ID和第一样本集的第二加密ID确定共有样本;根据共有样本的标签、所在分箱的标识计算特征的信息价值,以针对机器学习模型进行特征选择。

Description

保护隐私安全的多方联合进行特征评估的方法及装置 技术领域
本说明书一个或多个实施例涉及计算机信息处理领域,尤其涉及一种保护隐私安全的多方联合进行特征评估的方法及装置。
背景技术
机器学习所需要的数据往往会涉及到多个领域。例如在基于机器学习的商户分类分析场景中,电子支付平台拥有商户的交易流水数据,电子商务平台存储有商户的销售数据,银行机构拥有商户的借贷数据。数据往往以孤岛的形式存在。由于行业竞争、数据安全、用户隐私等问题,数据整合面临着很大阻力,将分散在各个平台的数据整合在一起训练机器学习模型难以实现。在保证数据不泄露的前提下,使用多方数据联合训练机器学习模型变成目前的一大挑战。为此,提出有联邦学习(Federated Learning)方案。
通常,利用联邦学习(Federated Learning)算法训练机器学习模型需要标签相关特征,因此,联邦学习的第一步是进行特征筛选。目前,较为常用的特征筛选方案为计算特征的信息价值(Information Value,IV),以此来评估该特征和标签的相关性。计算特征的信息价值需要用到标签和特征数据。其中,计算非标签持有方的特征的信息价值需要标签持有方的标签数据,但标签持有方通常不愿意直接向非标签持有方透露的标签和用户的对应关系(即黑白名单库)。并且,非标签持有方也不愿意把其用户和特征数据透露给标签持有方。
另外,利用联邦学习(Federated Learning)需要各平台共有的用户,以进行联合训练。
而对于任一方而言,用户以及用户与标签(或特征)的对应关系都为隐私数据。因此,需要一种能够在各方未知其他方的用户的情况下,以及在标签和特征数据隔离的情况下,计算特征的信息价值的方案。
发明内容
本说明书一个或多个实施例描述了一种保护隐私安全的多方联合进行特征评估的方法及装置,可以在双方未知对方用户以及在标签和特征数据隔离的情况下,计算双方共有用户的特征的信息价值。
根据第一方面,提供了一种保护隐私安全的多方联合进行特征评估的方法,所述多方至少包括第一设备和第二设备,第一设备存储有第一样本集和其中各样本的标签,第二设备存储有第二样本集,所述方法应用于第一设备;所述方法包括:使用第一密钥对第一样本集中各样本的初始ID进行加密,得到第一样本集中各样本的第一次加密ID;向所述第二设备发送第一交换信息,其中至少包括,第一样本集中每个样本的第一次加密ID和标签;从所述第二设备分别接收第二交换信息和第三交换信息,其中,所述第二交换信息包括,由所述第二设备使用第二密钥对第一样本集中每个样本的第一次加密ID进行二次加密后得到的第二次加密ID和对应的标签,且所述第二交换信息中各样本的相对顺序已由所述第二设备扰乱;所述第三交换信息包括,针对第二样本集中每一个样本,由所述第二设备基于所述第二密钥对其初始ID进行加密得到的第一次加密ID和该样本所在第一分箱的标识,所述第一分箱的标识由所述第二设备基于第二样本集中各样本的第一特征的特征值进行分箱得到;使用所述第一密钥,对所述第三交换信息中各样本的第一次加密ID进行二次加密,得到第一加密集合;基于第二交换信息中的第二次加密ID和第一加密集合中的第二次加密ID,确定第一样本集和第二样本集的共有样本;基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用以针对机器学习模型进行特征选择。
在一些实施例中,所述方法还包括:在向第二设备发送第一交换信息之前,基于第一样本集中各样本的第二特征的特征值,将第一样本集分成多个第二分箱,并将第一样本集中每一个样本所在第二分箱的标识包括在所述第一交换信息中;在得到所述第一加密集合之后,扰乱第二样本集中各样本的相对顺序,得到第四交换信息;向所述第二设备发送所述第四交换信息,以便所述第二设备基于所述第四交换信息中的第二次加密ID和第二加密集合中各样本的第二次加密ID确定共有样本,并基于共有样本中各样本的标签、所在第二分箱的标识,确定所述第二特征的信息价值,其中第二加密集合是使用所述第二密钥对所述第一交换信息中的第一次加密ID进行二次加密得到的。
在一些实施例中,所述基于第一样本集中各样本的第二特征的特征值,将第一样本集分成多个第二分箱包括:根据等频分箱、等距分箱、卡方分箱中任一项,将第一样本集分成所述多个第二分箱。
在一些实施例中,第一样本集中各样本的初始ID和第二样本集中各样本的初始ID均为正整数;在使用第一密钥对第一样本集中各样本的初始ID进行加密之前,所述方 法还包括:确定大于第一样本集中各样本的初始ID中最大初始ID,且大于第二样本集中各样本的初始ID中最大初始ID的第一质数;确定与第一质数互质的第一正整数为所述第一密钥。
在一些实施例中,所述使用第一密钥对第一样本集中各样本的初始ID进行加密,得到第一样本集中各样本的第一次加密ID包括:对于第一样本集中每一个样本,确定该样本初始ID和所述第一密钥的乘积除以所述第一质数的余数为该样本的第一次加密ID。
在一些实施例中,第一样本集包括标签为正的多个样本和标签为负的多个样本;所述基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值包括:确定共有样本中落入具有第一标识的第一分箱中且标签为正的样本个数,相对于共有样本中标签为正的样本总个数的第一比例;确定共有样本中落入所述具有第一标识的第一分箱中且标签为负的样本个数,相对于共有样本中标签为负的样本总个数的第二比例;基于各个标识的第一分箱分别对应的所述第一比例,和所述第二比例,确定共有样本的第一特征的信息价值。
在一些实施例中,所述第一样本集中的样本包括用户样本,所述机器学习模型为用户分类模型;或者,所述第一样本集中的样本包括业务样本,所述机器学习模型为业务处理模型。
根据第二方面,提供了一种保护隐私安全的多方联合进行特征评估的方法,所述多方至少包括第一设备和第二设备,所述第一设备存储有第一样本集和其中各样本的标签,所述第二设备存储有第二样本集,所述方法应用于第二设备;所述方法包括:从第一设备接收第一交换信息,其中至少包括,由所述第一设备使用第一密钥对第一样本集中每个样本的初始ID进行加密后得到的第一次加密ID和对应的标签;使用第二密钥,对所述第一交换信息中各样本的第一次加密ID进行二次加密,得到第二加密集合,然后扰乱所述第二加密集合中各样本的相对顺序;向所述第一设备发送第二交换信息,所述第二交换信息包括已扰乱相对顺序的第一样本集中各样本的第二次加密ID和标签;使用第二密钥对第二样本集中各个样本的初始ID进行加密,得到第二样本集中第一次加密ID;基于第二样本集中各样本的第一特征的特征值,将第二样本集分成多个第一分箱;
向所述第一设备发送第三交换信息,所述第三交换信息包括第二样本集中各样本的第一次加密ID和所在第一分箱的标识,以便所述第一设备使用第一密钥对第三交换信 息中的第一次加密ID进行二次加密,得到第一加密集合,并基于第一加密集合中的第二次加密ID和所述第二交换信息中的第二次加密ID,确定第一样本集和第二样本集的共有样本,以及基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用于针对机器学习模型进行特征选择。
在一些实施例中,所述第一交换信息还包括第一样本集中每一个样本所在第二分箱的标识,所述第二分箱的标识由所述第一设备基于第一样本集中各样本的第二特征的特征值进行分箱得到;所述方法还包括:从所述第一设备接收第四交换信息,所述第四交换信息包括第二样本集中各样本的第二次加密ID,且所述第四交换信息中各样本的相对顺序已由所述第一设备扰乱;基于所说第二加密集合的第二次加密ID和所述第四交换信息中的第二次加密ID,确定第一样本集和第二样本集的共有样本;基于共有样本中各样本的标签、所在第二分箱的标识,确定所述第二特征的信息价值,用于针对机器学习模型进行特征选择。
根据第三方面,提供了一种保护隐私安全的多方联合进行特征评估的装置,所述多方至少包括第一设备和第二设备,第一设备存储有第一样本集和其中各样本的标签,第二设备存储有第二样本集,所述装置配置于第一设备;所述装置包括:第一加密单元,用于使用第一密钥对第一样本集中各样本的初始ID进行加密,得到第一样本集中各样本的第一次加密ID;第一发送单元,用于向所述第二设备发送第一交换信息,其中至少包括,第一样本集中每个样本的第一次加密ID和标签;第一接收单元,用于从所述第二设备分别接收第二交换信息和第三交换信息,其中,所述第二交换信息包括,由所述第二设备使用第二密钥对第一样本集中每个样本的第一次加密ID进行二次加密后得到的第二次加密ID和对应的标签,且所述第二交换信息中各样本的相对顺序已由所述第二设备扰乱;所述第三交换信息包括,针对第二样本集中每一个样本,由所述第二设备基于所述第二密钥对其初始ID进行加密得到的第一次加密ID和该样本所在第一分箱的标识,所述第一分箱的标识由所述第二设备基于第二样本集中各样本的第一特征的特征值进行分箱得到;第二加密单元,用于使用所述第一密钥,对所述第三交换信息中各样本的第一次加密ID进行二次加密,得到第一加密集合;第一确定单元,用于基于第二交换信息中的第二次加密ID和第一加密集合中的第二次加密ID,确定第一样本集和第二样本集的共有样本;第二确定单元,用于基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用以针对机器学习模型进行特征选择。
根据第四方面,提供了一种保护隐私安全的多方联合进行特征评估的装置,所述多 方至少包括第一设备和第二设备,所述第一设备存储有第一样本集和其中各样本的标签,所述第二设备存储有第二样本集,所述装置配置于第二设备;所述装置包括:第二接收单元,用于从第一设备接收第一交换信息,其中至少包括,由所述第一设备使用第一密钥对第一样本集中每个样本的初始ID进行加密后得到的第一次加密ID和对应的标签;第三加密单元,用于使用第二密钥,对所述第一交换信息中各样本的第一次加密ID进行二次加密,得到第二加密集合,然后扰乱所述第二加密集合中各样本的相对顺序;第二发送单元,用于向所述第一设备发送第二交换信息,所述第二交换信息包括已扰乱相对顺序的第一样本集中各样本的第二次加密ID和标签;第四加密单元,用于使用第二密钥对第二样本集中各个样本的初始ID进行加密,得到第二样本集中第一次加密ID;第二分箱单元,用于基于第二样本集中各样本的第一特征的特征值,将第二样本集分成多个第一分箱;第二发送单元还用于向所述第一设备发送第三交换信息,所述第三交换信息包括第二样本集中各样本的第一次加密ID和所在第一分箱的标识,以便所述第一设备使用第一密钥对第三交换信息中的第一次加密ID进行二次加密,得到第一加密集合,并基于第一加密集合中的第二次加密ID和所述第二交换信息中的各样本的第二次加密ID,确定第一样本集和第二样本集的共有样本,以及基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用于针对机器学习模型进行特征选择。
根据第五方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法或第六方面所述的方法。
根据第六方面,提供了一种计算终端,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法或第六方面的方法。
本说明书实施例提供的方法及装置,可以在双方未知对方用户以及在标签和特征数据隔离的情况下,计算双方共有用户的特征的信息价值,具体较高的安全性。
附图说明
为了更清楚地说明本说明书实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获 得其它的附图。
图1A示出根据一个实施例的数据方A的数据示意图;
图1B示出根据一个实施例的数据方B的数据示意图;
图2示出根据一个实施例的联合计算特征的信息价值的流程图;
图3示出根据一个实施例的保护隐私安全的多方联合进行特征评估的方法的流程图;
图4示出根据一个实施例的对ID进行加密的流程图;
图5示出根据一个实施例的保护隐私安全的多方联合进行特征评估的方法的流程图;
图6示出根据一个实施例的保护隐私安全的多方联合进行特征评估的方法的流程图;
图7示出根据一个实施例的保护隐私安全的多方联合进行特征评估的装置的示意性框图;
图8示出根据一个实施例的保护隐私安全的多方联合进行特征评估的装置的示意性框图。
具体实施方式
下面结合附图,对本说明书提供的方案进行描述。
图1A示出了本说明书实施例披露的数据方A拥有的数据。图1B示出了本说明书实施例披露的数据方B拥有的数据。图1A和图1B中的每一个ID(Identity Document,身份标识号)可以为唯一标识一个用户的数字编码,例如手机号等。如图1A和图1B所示,ID1、ID2、ID3为数据方A和数据方B共有的ID。图1A中的每一个ID具有标签和特征Fa的特征值。示例性的,如图1A所示,标签可分为正标签和负标签两种。图1B中的每一个ID具有特征Fb的特征值。
在一个示例性场景中,数据方A可以为电子支付平台(例如支付宝),标签可以为欺诈商户的标记或非欺诈商户的标记。特征Fa可以为交易流水数据。数据方B可以为银行机构,特征Fb可以为借贷数据。每个ID对应的交易流水数据的特征值或者借贷数据的特征值,可以通过特征工程计算得到,具体可以参考现有技术介绍,此处不再赘述。
在一个示例性场景中,数据方A可以为电子商务平台(例如淘宝),标签可以为正常买家的标记或非正常买家的标记,特征Fa可以为销售数据。数据方B可以为银行机 构,特征Fb可以为借贷数据。
多方联合训练机器学习模型,需要使用数据方A和数据方B共有用户的特征。为了有效训练机器学习模型,需要评估特征和标签的相关性。
可以通过图2所示方案进行特征筛选。其中,数据方A中的多个ID(ID集合),可以称为set_A。B中的多个ID(ID集合),可以称为set_B。在进行联合计算时,数据方A可以将set_A和set_A的标签发送给数据方B。由此,数据方B可以确定set_A和set_B的共有ID,然后,计算共有ID的特征Fb的信息价值,以评估特征Fb和标签的相关性。数据方B可以将set_B发送给数据方A。由此,数据方A可以确定set_A和set_B的共有ID,然后,计算共有ID的特征Fa的信息价值,以评估特征Fa和标签的相关性。在该方案中,数据双方需要交换明文ID。
用于评估特征和标签的相关性的另一种方案为,构建可信执行环境(例如利用intel的sgx技术构建一个可信执行环境),数据方A的数据(set_A、set_A的标签、set_A的特征Fa)以及数据方B的数据(set_B、set_B的特征Fb)可以各自经公钥加密后,传入可信执行环境。在可信执行环境内进行私钥解密,并完成特征的信息价值计算,以及将特征的信息价值计算结果传出可信环境。
用于评估特征和标签的相关性的又一种方案为,数据方A的数据(set_A、set_A的标签、set_A的特征Fa)以及数据方B的数据(set_B、set_B的特征Fb)发送给第三方机构,由第三方完成特征的信息价值计算。
为进一步增强隐私数据安全,本说明书实施例提供了一种多方联合进行特征评估的方法,可以在双方未知对方用户以及在标签和特征数据隔离的情况下,计算双方共有用户的特征的信息价值。在一个实施例中,该方法可以包括如图3所示的步骤。需要说明的是,图3虽然按照序列顺序示出步骤300a-步骤310a以及步骤300b-步骤310b,并不限定这些步骤300-步骤310的执行顺序。在一些示例中,可以按照图3所示顺序执行步骤300a-步骤310a以及步骤300b-步骤310b。在一些示例中,可以按照与图3所示顺序不同的顺序执行步骤300a-步骤310a以及步骤300b-步骤310b。在一些示例中,可以并行执行步骤300a-步骤310a以及步骤300b-步骤310b中的两个或更多个步骤。
接下来,结合图3对本说明书提供的保护隐私安全的多方联合进行特征评估的方法进行示例说明。
数据方A和数据方B可以为具有计算、处理能力的装置、设备、平台、设备集群, 可相互配合以执行图3所示的方法。
在步骤300a和步骤300b,数据方A和数据方B可以相互配合以执行初始化操作。具体的,数据方A和数据方B可以确定其拥有的ID的取值上限。以ID为手机号为例,其为11位数字构成的整数,即每一个ID为一整数。任一方的ID的取值上限为该方拥有的ID中数值最大ID。
在一个示例中,数据方A可以确定大于或等于数据方A的数值最大ID的整数C1。示例性的,以ID为11位数字组成手机号为例,整数C1可以为12位数字构成的整数。数据方A可以向数据方B发送数据方A的整数C1。数据方B可以确定大于数据方B的数值最大ID,且大于整数C1的质数P,并将质数P发送给数据方A。
在一个示例中,数据方B可以确定大于或等于数据方B的数值最大ID的整数C2。示例性的,以ID为11位数字组成手机号为例,整数C2可以为12位数字构成的整数。数据方B可以向数据方A发送数据方A的整数C2。数据方A可以确定大于数据方A的数值最大ID,且大于整数C2的质数P,并将质数P发送给数据方B。
数据方A可以随机生成与质数P互质的正整数keyA。keyA也可以称为第一密钥。数据方B可以随机生成与质数P互质的正整数keyB。keyB也可以称为第二密钥。
通过上述方式数据方A和数据方B完成初始化,得到各自的密钥。接下来,数据方A和数据方B分别通过各自的密钥对各自的ID进行第一次加密,得到各自的第一次加密ID。然后分别将各自的第一次加密ID发送给对方,由对方使用其密钥再进行第二次加密。就数值相同的ID而言,经过两次加密后,数值仍然相同,由此,可以使得数据方A和数据方B可以在无需向对方透漏未加密ID(也可以称为初始ID)情况下,分别得到双方共有的ID。具体如下。
为表述方便,可以将数据方A拥有的ID集合,即数据方A的样本集中各样本的ID的集合,称为set_A。可以将数据方B拥有的ID集合,即数据方B的样本集中各样本的ID的集合,称为set_B。可理解的,样本和ID具有一一对应关系。在进行下文所述的加密之前,set_A和set_B中的各ID可以称为样本的初始ID。
在步骤302a中,数据方A使用keyA对set_A的每一个ID(初始ID),进行第一次加密,得到第一次加密ID。示例性的,就set_A的每一个ID而言,其第一次加密方式为,计算该ID和keyA的乘积,并将乘积除以质数P得到的余数用作对应于该ID对应的第一次加密ID。第一次加密ID可以记为Encry(ID,keyA)。
具体可以如图4所示,待加密的ID可以为set_A中的每一个ID。初始化p即为上述质数p。max(ID)为数据方A中数值最大ID。可以将待加密ID乘以待加密ID,得到TMP。然后,将TMP模质数p的余数(即TMP除以质数p得到的余数)E,作为待加密ID的加密结果。
数据方A可以根据特征Fa的特征值对set_A进行特征分箱,以将set_A中第一次加密ID分到多个分箱中。参阅图3,特征Fa可以为包括了特征Fa1、特征Fa2等多种特征的特征集合,特征Fa1、特征Fa2可以统称为Fai,即Fai中i可以为1,也可以为2,等等。其中,每个样本具有特征Fai的特征值(特征Fai的特征值也可以称为特征Fai的取值)。就特征Fai而言,数据方A可以根据set_A中各ID对应的特征Fai的特征值,进行特征分箱,以将set_A中ID的第一次加密ID分到特征Fai对应的多个分箱中。每一个分箱均具有分箱标识,以特征Fa1为例,其分箱标识可以记为Fa1_bin。以特征Fa2为例,其分箱标识可以记为Fa2_bin。可以将每一个第一次加密ID、Fa1_bin、Fa2_bin等进行关联,可以记为(Encry(ID,keyA),Fa1_bin,Fa2_bin,…)。其中,Fa1_bin、Fa2_bin等可以统称为Fai_bin,其表示ID根据特征Fai的特征值被分到了第Fai_bin分箱中。
在一个例子中,可以采用等频分箱算法进行特征分箱。在另一个例子中,可以采用等距分箱算法进行特征分箱。在又一个例子中,可以采用卡方分箱算法进行特征分箱。
可以将set_A每一个样本的第一次加密ID、标签以及按照特征Fai的特征值进行分箱后所在分箱的标识进行关联,得到set_A每一个样本的第一次加密ID的关联信息,可以记为(Encry(ID,keyA),标签,Fa1_bin,Fa2_bin,…)。set_A所有第一次加密ID的关联信息构成了第一交换信息。数据方A可以将第一交换信息发送给数据方B。
可理解的,每一个分箱中可以包括多个ID,例如K个ID。这相当于B得到的A的特征分箱信息是K匿名化的,即对应任意一个ID,都至少有K各ID与其特征分箱信息是相同的,因此,数据方B难以根据ID对应的特征信息,来推测ID和特征信息的对应关系。
在步骤302b中,数据方B使用keyB对set_B的每一个ID(初始ID),进行第一次加密,得到第一次加密ID。示例性的,就set_B的每一个ID而言,其第一次加密方式为,计算该ID和keyB的乘积,并将乘积除以质数P得到的余数用作对应于该ID的第一次加密ID。第一次加密ID可以记为Encry(ID,keyB)。
数据方B可以根据特征Fb的特征值对set_B进行特征分箱,以将set_B中第一次加密ID分到多个分箱中。参阅图3,特征Fb可以为包括了特征Fb1、特征Fb2等多种特征的特征集合。特征Fb1、特征Fb2可以统称为Fbi,即Fai中i可以为1,也可以为2,等等。其中,每个样本具有特征Fbi的特征值。可以根据特征Fbi的特征值,对set_B进行特征分箱。具体可以参考上文关于步骤302a所示实施例的介绍,在此不再赘述。
可以将set_B中每一个样本的第一次加密ID、按照Fbi的特征值进行分箱后所在分箱的标识进行关联,得到set_B每一个样本的第一次加密ID的关联信息,可以记为(Encry(ID,keyB),Fb1_bin,Fb2_bin,…)。set_B所有第一次加密ID的关联信息构成了第三交换信息。数据方B可以将第三交换信息发送给数据方A。
在步骤304a中,数据方A在接收到第三交换信息后,可以使用keyA对第三交换信息中set_B的各个第一次加密ID分别进行二次加密,分别得到set_B的各个第一次加密ID的第二次加密ID。具体为,计算第一次加密ID和keyA的乘积,并将乘积除以质数P得到的余数用作对应于该第一次加密ID的第二次加密ID,可以记为Encry(Encry(ID,keyB),keyA)。连同所在分箱标识,可以记为(Encry(Encry(ID,keyB),keyA),Fb1_bin,Fb2_bin,…),该信息构成第一加密集合。
在步骤306a中,打乱(扰乱)set_B的各个第二次加密ID之间的相对顺序,并将扰乱后的set_B的各个第二次加密ID,作为第四交换信息发送给数据方B。
需要理解,第三交换信息中的set_B的各个第一次加密ID之间具有相对顺序,在使用第一密钥对set_B的各个第一次加密ID进行二次加密,得到的set_B的各个第二次加密ID之间的相对顺序与set_B的各个第一次加密ID之间具有相对顺序相同。如不打乱set_B各个第二次加密ID之间的相对顺序,就将set_B各个第二次加密ID发送给数据方B,则数据方B可以根据set_B各个第二次加密ID之间的相对顺序,确定set_B各个第二次加密ID和set_B各个第一次加密ID的一一对应关系,由此可以得到第一密钥,进而可以确定定set_A中的ID,导致数据方A的ID以及黑白名单泄露。
并且,在第三交换信息中并不携带set_B的各个ID的所在分箱的标识,以避免数据方B根据set_B的各个第二次加密ID的所在分箱的标识,推测出各样本第二次加密ID和各样本的初始ID(或第一次加密ID)的对应关系,由此,得到第一密钥,进而可以确定set_A中的ID,导致数据方A的ID以及黑白名单泄露。
在步骤304b中,数据方B在接收到第一交换信息后,可以使用keyB对第一交换信 息中set_A的各个第一次加密ID分别进行二次加密,分别得到set_A的各个第一次加密ID对应的第二次加密ID。具体为,计算第一次加密ID和keyB的乘积,并将乘积除以质数P得到的余数用作对应于该第一次加密ID的第二次加密ID,可以记为Encry(Encry(ID,keyA),keyB)。连同所在分箱标识,可以记为(Encry(Encry(ID,keyA),keyB),标签,Fa1_bin,Fa2_bin,…),该信息构成第二加密集合。
在步骤306b中,打乱(扰乱)set_A的各个第二次加密ID之间的相对顺序,并将扰乱后的set_A的各个第二次加密ID连同各自的标签,作为第二交换信息发送给数据方A。在步骤306b中,扰乱set_A的各个第二次加密ID之间的相对顺序,以及不向数据方发送set_A中ID的所在分箱的标识,以避免数据方A推测出第二密钥。
通过上述步骤,set_A和set_B中各初始ID的均进行了两次加密。其中,set_A中的初始ID,先在数据方A使用第一密钥进行第一次加密,然后在数据方B使用第二密钥进行第二次加密。set_B中的初始ID,先在数据方B使用第一密钥进行第一次加密,然后在数据方A使用第二密钥进行第二次加密。数据方A和B彼此交换各自二次加密的结果,使得数据方A和数据方B都拥有set_A和set_B中各初始ID对应的第二次加密ID。第一密钥和第二密钥均与质数p的互质,并且第一次和第二次的加密方式均为将密钥和ID乘积除以质数p的余数作为加密ID。由余数系统的性质,使得上述加密方式具有如下性质:可叠加性,ID加密前后具有相同的取值范围,可进行多次加密运算;可交换性,加密符合交换律,同一个ID通过两个不同的密钥进行二次加密,交换加密次序,得到的密文一致,即Encry(Encry(ID,keyA),keyB)=Encry(Encry(ID,keyB),keyA)。
难解密性,加密的密钥未知时,解密是极难的。
唯一性,当且仅当ID(整数)相等时,ID的加密结果才相同。
接下来,结合余数系统的性质对本说明书实施例所述的加密方式的性质进行证明。
在本说明书实施例中,x mod(y),可以称为x模y,表示x除以y所得的余数。余数系统具有如下性质。
模n的完整余数系统的任意两个数模n的余数不同,且正整数中任意正整数模n必定与模n的完整余数系统中的某个数模n的余数相同。模n完整余数系统中,与模n互质的代表数所构成的集合,称为模n的简约余数系统。
对于质数p和任意与p互质的正整数a,模p的最小简约余数系统集合S={1,2,3,…,(p-1)}的元素都乘以a,得到新的集合a*S={a,2a,3a,…,(p-1)a},满足a*S mod(p)=S。证明如下。
若x属于S,由余数性质可知a*x mod(p)属于集合S或0。假设a*x mod(p)=0,则a*x是p的整数倍。因p是质数,x不能被p整除,推出a能被p整除,与“a与p互质的条件”矛盾,因而假设不成立,a*x mod(p)不等于0,即知a*x mod(p)属于集合S。
若x1、x2都属于S且x1>x2,假设a*x1和a*x2模p同余,即a*x1 mod(p)=a*x2 mod(p),则a*x1-k1*p=a*x2-k2*p,推出a*(x1-x2)=(k1-k2)*p。因-p<x1-x2<p,p是质数。若前式a*(x1-x2)=(k1-k2)*p成立,即a是p的整数倍,这与“a与p互质的条件”矛盾,因而a*x1和a*x2模p同余不成立,a*x1和a*x2模p不同余。由上可知,集合a*S中的p-1个元素,模p后的余数是集合S中的元素,且互不相等,那么显而易见,集合S中的任何一个元素,都一定是a*S中某个元素模p的余数。即,集合a*S mod(p)与集合S相同。
在本说明书实施例中,max(ID)<p,所以ID属于集合S={1,2,3,...(p-1)},由此,可叠加性得证。即集合S的元素,经本说明书实施例提供的加密方式加密后,仍属于集合S,因而可以继续进行下一次加密。
对于质数p,对任意与p互质的正整数a和b,满足交换律b*(a*x mod(p))mod(p)=a*(b*x mod(p))mod(p)。证明如下。
易证明x*y mod(z)=(x mod(z))*(y mod(z)),于是,b*(a*x mod(p))mod(p)=[b mod(p)]*[(a*x mod(p))mod(p)]=[b mod(p)]*[a*x mod(p)]=[b mod(p)]*[a mod(p)]*[x mod(p)],同理可得a*(b*x mod(p))mod(p)=[a mod(p)]*[b mod(p)]*[x mod(p)],由上,b*(a*x mod(p))mod(p)=a*(b*x mod(p))mod(p)得证。
在本说明书实施例中,同一个ID通过两个不同的key进行二次加密,交换加密次序,得到的密文一致,即Encry(Encry(ID,keyA),keyB)=Encry(Encry(ID,keyB),keyA)。由此,可交换性得证。
已知质数p,和a*x mod(p)的值v,已知x属于集合{1,2,3,…,(p-1)},a是一个与p互质的正整数,求x是一件很难的事。证明:这里有两个未知数a和x,a的取值范围是1至正无穷,x的取值范围是1~(p-1),有无穷组可能解,因而不可能解出x的值。即加密key未知时,解密是极难。由此,难解密性得证。
对于质数p和任意与p互质的正整数a,m和n是集合S={1,2,3,…,(p-1)}的两个不同的元素,那么a*m mod(p)一定不等于a*n mod(p)。证明如下。
假设a*m mod(p)=a*n mod(p),那么a*m-k1*p=a*n-k2*p,k1和k2是整数。可推出a*(m-n)=(k1-k2)*p。因a与p互质,那么必然有m-n可被p整除。因为m和n都属于集合S,因而只可能有m-n=0,m和n相等,不符合条件,推出矛盾。因而a*m mod(p)不等于a*n mod(p)得证。
因此,通过本说明书提供的加密方式,当且仅当ID相等时,ID的加密结果才相同;当ID不相等时,ID的加密结果一定不同。
通过上述论证可知,set_A和set_B中具有相同ID时,set_A中的该ID经过上文所述加密方式加密后的加密结果,等于set_B中该ID经过上述所述加密方式加密后的加密界面。
由此,在步骤308a中,数据方A可以确定出set_A和set_B共有ID。并且第二交换信息中携带了各ID的标签,通过第三次交换信息可以得到共有ID通过特征Fbi(Fb1、特征Fb2等)的特征值进行分箱得到的所在分箱的标识。
在步骤310a中,可以根据步骤308a得到的信息,利用图3所示的公式,计算各特征Fbi的信息价值。其中,label=1表示标签为正,label=0表示标签为负。对于任一特征Fbi而言,Precall k表示分箱k中标签为正的ID的数量相对于共有样本中标签为正的样本总个数的比例,Nrecall k表示分箱k中标签为负的ID的数量相对于共有样本中标签为负的样本总个数的比例,IV表示信息价值。
在步骤308b中,数据方B可以确定出set_A和set_B共有ID。并且第一交换信息中携带了各ID的标签以及所在分箱的标识,由此,可以在步骤310b中,计算各特征Fai的信息价值。
本说明书实施例提供的方法,能够实现各方数据隔离的情况下,完成特征的信息价值的安全计算,不泄露各方数据。具体如下。
在信息价值计算过程中,数据方A拿到了数据方B的ID是由keyB加密的结果和对应的Fb特征分箱,但这个数据对数据方A来说是足够隐密的,因为:1)数据方A拿到的ID是经过keyB加密的,数据方A无法知道其背后对应的原ID,因而也无法把Fb分箱结果与真实ID对应起来;2)计算信息价值时用的分箱信息无关分箱的顺序, 因而数据方B传给数据方A的所在分箱的标识可以是打乱顺序的(可以在打乱第二次加密ID顺序时实现),或者所在分箱的标识只是一个代号,这样数据方A无法知道分箱对应的特征大小顺序;3)特征的每个分箱里包含K个ID,相当于数据方A得到的关于数据方B特征的信息是经过K匿名化的,任何一个ID的信息,都有至少K个ID与之是一样。数据方A还拿到了数据方A ID经过二次加密后的结果,这个加密ID因为已经被B打乱顺序,且没有携带任何其它可供辨识的额外信息,因而数据方A只知道,这些ID都是自身ID被加密后得到的结果,且一一对应,但是并不清楚其中的对应关系。数据方A在拿到两份数据后进行匹配、取交集、运算,这些操作相当于在一个ID加密后的空间内进行,且这个加密空间与原空间的对应关系未知(这个映射关系必须拥有两方的keyA和keyB两个密钥才可知),因此,计算是安全的。类似可知,数据方B可获得的数据,也不足以让数据方B推导出数据方A的数据信息。
参阅图5,本说明书实施例提供了一种保护隐私安全的多方联合进行特征评估的方法,所述多方至少包括第一设备和第二设备,第一设备存储有第一样本集和其中各样本的标签,第二设备存储有第二样本集,所述方法应用于第一设备。参阅图5,所述方法包括如下步骤。
步骤501,使用第一密钥对第一样本集中各样本的初始ID进行加密,得到第一样本集中各样本的第一次加密ID。具体可以参阅上文对图3中步骤302a的介绍,在此不再赘述。
需要理解,在描述302a时结合余数加密的算法进行描述。余数加密算法计算量少,并且安全性高,为一种较佳的加密算法。应该理解,余数加密算法并非唯一的加密算法,只要加密算法满足可叠加性、交换性、唯一性,都可以用于在步骤302a以及步骤302b中对样本ID进行加密。在本说明书实施例中,数据方A与数据方B可以预先协商其他加密算法。这里的加密算法可以为任一基于同一组密钥对目标数据进行加密时,密钥的使用顺序不影响加密结果的算法。这里的加密算法除图3所示实施例中描述的余数加密算法外,还可以为异或(XOR)算法、DH算法、ECC-DH算法等中任一种。
步骤503,向所述第二设备发送第一交换信息,其中至少包括,第一样本集中每个样本的第一次加密ID和标签。具体可以参阅上文对图3中步骤302a的介绍,在此步骤赘述。
步骤505,从所述第二设备分别接收第二交换信息和第三交换信息,其中,所述第 二交换信息包括,由所述第二设备使用第二密钥对第一样本集中每个样本的第一次加密ID进行二次加密后得到的第二次加密ID和对应的标签,且所述第二交换信息中各样本的相对顺序已由所述第二设备扰乱;所述第三交换信息包括,针对第二样本集中每一个样本,由所述第二设备基于所述第二密钥对其初始ID进行加密得到的第一次加密ID和该样本所在第一分箱的标识,所述第一分箱的标识由所述第二设备基于第二样本集中各样本的第一特征的特征值进行分箱得到。
具体可以参阅上文对图3中步骤302b、304b、306b的介绍,在此不再赘述。
步骤507,使用所述第一密钥,对所述第三交换信息中各样本的第一次加密ID进行二次加密,得到第一加密集合。具体可以参阅上文对图3中步骤304a的介绍,在此不再赘述。
步骤509,基于第二交换信息中的第二次加密ID和第一加密集合中的第二次加密ID,确定第一样本集和第二样本集的共有样本。具体可以参阅上文对图3步骤308a的介绍,在此不再赘述。
步骤511,基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用以针对机器学习模型进行特征选择。具体可以参阅上文对图3中步骤310a的介绍,在此不再赘述。
在一些实施例中,所述方法还包括:在向第二设备发送第一交换信息之前,基于第一样本集中各样本的第二特征的特征值,将第一样本集分成多个第二分箱,并将第一样本集中每一个样本所在第二分箱的标识包括在所述第一交换信息中;在得到所述第一加密集合之后,扰乱第二样本集中各样本的相对顺序,得到第四交换信息;向所述第二设备发送所述第四交换信息,以便所述第二设备基于所述第四交换信息中的第二次加密ID和第二加密集合中的第二次加密ID确定共有样本,并基于共有样本中各样本的标签、所在第二分箱的标识,确定所述第二特征的信息价值,其中第二加密集合是使用所述第二密钥对所述第一交换信息中的第一次加密ID进行二次加密得到的。具体可以参阅上文对图3中步骤302a、306a、308b、310b的介绍,在此不再赘述。
在该实施例的一个示例中,所述基于第一样本集中各样本的第二特征的特征值,将第一样本集分成多个第二分箱包括:根据等频分箱、等距分箱、卡方分箱中任一项,将第一样本集分成所述多个第二分箱。
在一些实施例中,第一样本集中各样本的初始ID和第二样本集中各样本的初始ID 均为正整数;在使用第一密钥对第一样本集中各样本的初始ID进行加密之前,所述方法还包括:确定大于第一样本集中各样本的初始ID中最大初始ID,且大于第二样本集中各样本的初始ID中最大初始ID的第一质数;确定与第一质数互质的第一正整数为所述第一密钥。具体可以参阅上文对图3中步骤300a和步骤300b的介绍,在此不再赘述。
在一些实施例中,所述使用第一密钥对第一样本集中各样本的初始ID进行加密,得到第一样本集中各样本的第一次加密ID包括:对于第一样本集中每一个样本,确定该样本初始ID和所述第一密钥的乘积除以所述第一质数的余数为该样本的第一次加密ID。具体可以参阅上文对图3中步骤302的介绍,在此不再赘述。
在一些实施例中,第一样本集包括标签为正的多个样本和标签为负的多个样本;所述基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值包括:确定共有样本中落入具有第一标识的第一分箱中且标签为正的样本个数,相对于共有样本中标签为正的样本总个数的第一比例;确定共有样本中落入所述具有第一标识的第一分箱中且标签为负的样本个数,相对于共有样本中标签为负的样本总个数的第二比例;基于各个标识的第一分箱分别对应的所述第一比例,和所述第二比例,确定共有样本的第一特征的信息价值。具体可以参阅上文对图3中步骤310a的介绍,在此不再赘述。
在一些实施例中,所述第一样本集中的样本包括用户样本,所述机器学习模型为用户分类模型;或者,所述第一样本集中的样本包括业务样本,所述机器学习模型为业务处理模型。
本说明书实施例提供的方法,可以在双方未知对方用户以及在标签和特征数据隔离的情况下,计算双方共有用户的特征的信息价值,安全性高。
参阅图6,本说明书实施例提供了一种保护隐私安全的多方联合进行特征评估的方法,所述多方至少包括第一设备和第二设备,所述第一设备存储有第一样本集和其中各样本的标签,所述第二设备存储有第二样本集,所述方法应用于第二设备。如图6所示,该方法包括如下步骤。
步骤601,从第一设备接收第一交换信息,其中至少包括,由所述第一设备使用第一密钥对第一样本集中每个样本的初始ID进行加密后得到的第一次加密ID和对应的标签。具体可以参阅上文对图3中步骤302a的介绍,在此不再赘述。
步骤603,使用第二密钥,对所述第一交换信息中各样本的第一次加密ID进行 二次加密,得到第二加密集合,然后扰乱所述第二加密集合中各样本的相对顺序。具体可以参阅上文对图3中步骤304b、306b的介绍,在此不再赘述。
步骤605,向所述第一设备发送第二交换信息,所述第二交换信息包括已扰乱相对顺序的第一样本集中各样本的第二次加密ID和标签。具体可以参阅上文对图3中步骤306b的介绍,在此不再赘述。
步骤607,使用第二密钥对第二样本集中各个样本的初始ID进行加密,得到第二样本集中第一次加密ID。具体可以参阅上文对图3中步骤302b的介绍,在此不再赘述。
步骤609,基于第二样本集中各样本的第一特征的特征值,将第二样本集分成多个第一分箱。具体可以参阅上文对图3中步骤302b的介绍,在此不再赘述。
步骤611,向所述第一设备发送第三交换信息,所述第三交换信息包括第二样本集中各样本的第一次加密ID和所在第一分箱的标识,以便所述第一设备使用第一密钥对第三交换信息中的第一次加密ID进行加密,得到第一加密集合,并基于第一加密集合中的第二次加密ID和所述第二交换信息中的第二次加密ID,确定第一样本集和第二样本集的共有样本,以及基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用于针对机器学习模型进行特征选择。
具体可以参阅上文对图3中步骤302b的介绍,在此不再赘述。
在一些实施例中,所述第一交换信息还包括第一样本集中每一个样本所在第二分箱的标识,所述第二分箱的标识由所述第一设备基于第一样本集中各样本的第二特征的特征值进行分箱得到;所述方法还包括:从所述第一设备接收第四交换信息,所述第四交换信息包括第二样本集中各样本的第二次加密ID,且所述第四交换信息中各样本的相对顺序已由所述第一设备扰乱;基于所述第二加密集合的第二次加密ID和所述第四交换信息中的第二次加密ID,确定第一样本集和第二样本集的共有样本;基于共有样本中各样本的标签、所在第二分箱的标识,确定所述第二特征的信息价值,用于针对机器学习模型进行特征选择。具体可以参阅上文对图3中步骤302a、304a、306a、308b、310b的介绍,在此不再赘述。
本说明书实施例提供的方法,可以在双方未知对方用户以及在标签和特征数据隔离的情况下,计算双方共有用户的特征的信息价值,安全性高。
参阅图7,本说明书实施例提供了一种保护隐私安全的多方联合进行特征评估的装置700,所述多方至少包括第一设备和第二设备,第一设备存储有第一样本集和其中各样本的标签,第二设备存储有第二样本集,所述装置配置于第一设备。如图7所示,所述装置700包括以下单元。
第一加密单元710,用于使用第一密钥对第一样本集中各样本的初始ID进行加密,得到第一样本集中各样本的第一次加密ID。
第一发送单元720,用于向所述第二设备发送第一交换信息,其中至少包括,第一样本集中每个样本的第一次加密ID和标签。
第一接收单元730,用于从所述第二设备分别接收第二交换信息和第三交换信息,其中,所述第二交换信息包括,由所述第二设备使用第二密钥对第一样本集中每个样本的第一次加密ID进行二次加密后得到的第二次加密ID和对应的标签,且所述第二交换信息中各样本的相对顺序已由所述第二设备扰乱;所述第三交换信息包括,针对第二样本集中每一个样本,由所述第二设备基于所述第二密钥对其初始ID进行加密得到的第一次加密ID和该样本所在第一分箱的标识,所述第一分箱的标识由所述第二设备基于第二样本集中各样本的第一特征的特征值进行分箱得到。
第二加密单元740,用于基于所述第一密钥,对所述第三交换信息中各样本的第一次加密ID进行二次加密,得到第二样本集中各样本的第二次加密ID。
第一确定单元750,用于基于第一样本集中各样本的第二次加密ID和第二样本集中各样本的第二次加密ID,确定第一样本集和第二样本集的共有样本。
第二确定单元760,用于基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用以针对机器学习模型进行特征选择。
装置700的各功能单元的功能可以参考图5所示方法实施例实现,在此不再赘述。
本说明书实施例提供的装置,可以在双方未知对方用户以及在标签和特征数据隔离的情况下,计算双方共有用户的特征的信息价值,安全性高。
参阅图8,本说明书实施例提供了一种保护隐私安全的多方联合进行特征评估的装置,所述多方至少包括第一设备和第二设备,所述第一设备存储有第一样本集和其中各样本的标签,所述第二设备存储有第二样本集,所述装置配置于第二设备;所述装置 包括以下单元。
第二接收单元810,用于从第一设备接收第一交换信息,其中至少包括,由所述第一设备使用第一密钥对第一样本集中每个样本的初始ID进行加密后得到的第一次加密ID和对应的标签。
第三加密单元820,用于使用第二密钥,对所述第一交换信息中各样本的第一次加密ID进行二次加密,得到第二加密集合,然后扰乱所述第二加密集合中各样本的相对顺序。
第二发送单元830,用于向所述第一设备发送第二交换信息,所述第二交换信息包括已扰乱相对顺序的第一样本集中各样本的第二次加密ID和标签。
第四加密单元840,用于使用第二密钥对第二样本集中各个样本的初始ID进行加密,得到第二样本集中第一次加密ID。
第二分箱单元850,用于基于第二样本集中各样本的第一特征的特征值,将第二样本集分成多个第一分箱。
第二发送单元830还用于向所述第一设备发送第三交换信息,所述第三交换信息包括第二样本集中各样本的第一次加密ID和所在第一分箱的标识,以便所述第一设备使用第一密钥对第三交换信息中的第一次加密ID进行二次加密,得到第一加密集合,并基于第一加密集合中的第二次加密ID和所述第二交换信息中的各样本的第二次加密ID,确定第一样本集和第二样本集的共有样本,以及基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用于针对机器学习模型进行特征选择。
装置800的各功能单元的功能可以参考图6所示方法实施例实现,在此不再赘述。
本说明书实施例提供的装置,可以在双方未知对方用户以及在标签和特征数据隔离的情况下,计算双方共有用户的特征的信息价值,安全性高。
另一方面,本说明书的实施例提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行图5所示的方法或图6所示的方法。
另一方面,本说明书的实施例提供了一种计算终端,包括存储器和处理器,所 述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现图5所示的方法或图6所示的方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本说明书所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。

Claims (20)

  1. 一种保护隐私安全的多方联合进行特征评估的方法,所述多方至少包括第一设备和第二设备,第一设备存储有第一样本集和其中各样本的标签,第二设备存储有第二样本集,所述方法应用于第一设备;所述方法包括:
    使用第一密钥对第一样本集中各样本的初始ID进行加密,得到第一样本集中各样本的第一次加密ID;
    向所述第二设备发送第一交换信息,其中至少包括,第一样本集中每个样本的第一次加密ID和标签;
    从所述第二设备分别接收第二交换信息和第三交换信息,其中,所述第二交换信息包括,由所述第二设备使用第二密钥对第一样本集中每个样本的第一次加密ID进行二次加密后得到的第二次加密ID和对应的标签,且所述第二交换信息中各样本的相对顺序已由所述第二设备扰乱;所述第三交换信息包括,针对第二样本集中每一个样本,由所述第二设备基于所述第二密钥对其初始ID进行加密得到的第一次加密ID和该样本所在第一分箱的标识,所述第一分箱的标识由所述第二设备基于第二样本集中各样本的第一特征的特征值进行分箱得到;
    使用所述第一密钥,对所述第三交换信息中各样本的第一次加密ID进行二次加密,得到第一加密集合;
    基于第二交换信息中的第二次加密ID和第一加密集合中的第二次加密ID,确定第一样本集和第二样本集的共有样本;
    基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用以针对机器学习模型进行特征选择。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    在向第二设备发送第一交换信息之前,基于第一样本集中各样本的第二特征的特征值,将第一样本集分成多个第二分箱,并将第一样本集中每一个样本所在第二分箱的标识包括在所述第一交换信息中;
    在得到所述第一加密集合之后,扰乱第二样本集中各样本的相对顺序,得到第四交换信息;
    向所述第二设备发送所述第四交换信息,以便所述第二设备基于所述第四交换信息中的第二次加密ID和第二加密集合中的第二次加密ID确定共有样本,并基于共有样本中各样本的标签、所在第二分箱的标识,确定所述第二特征的信息价值,其中第二加密 集合是使用所述第二密钥对所述第一交换信息中的第一次加密ID进行二次加密得到的。
  3. 根据权利要求2所述的方法,其中,所述基于第一样本集中各样本的第二特征的特征值,将第一样本集分成多个第二分箱包括:
    根据等频分箱、等距分箱、卡方分箱中任一项,将第一样本集分成所述多个第二分箱。
  4. 根据权利要求1所述的方法,其中,第一样本集中各样本的初始ID和第二样本集中各样本的初始ID均为正整数;在使用第一密钥对第一样本集中各样本的初始ID进行加密之前,所述方法还包括:
    确定大于第一样本集中各样本的初始ID中最大初始ID,且大于第二样本集中各样本的初始ID中最大初始ID的第一质数;
    确定与第一质数互质的第一正整数为所述第一密钥。
  5. 根据权利要求4所述的方法,其中,所述使用第一密钥对第一样本集中各样本的初始ID进行加密,得到第一样本集中各样本的第一次加密ID包括:
    对于第一样本集中每一个样本,确定该样本初始ID和所述第一密钥的乘积除以所述第一质数的余数为该样本的第一次加密ID。
  6. 根据权利要求1所述的方法,其中,第一样本集包括标签为正的多个样本和标签为负的多个样本;所述基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值包括:
    确定共有样本中落入具有第一标识的第一分箱中且标签为正的样本个数,相对于共有样本中标签为正的样本总个数的第一比例;
    确定共有样本中落入所述具有第一标识的第一分箱中且标签为负的样本个数,相对于共有样本中标签为负的样本总个数的第二比例;
    基于各个标识的第一分箱分别对应的所述第一比例,和所述第二比例,确定共有样本的第一特征的信息价值。
  7. 根据权利要求1所述的方法,其中,所述第一样本集中的样本包括用户样本,所述机器学习模型为用户分类模型;或者,
    所述第一样本集中的样本包括业务样本,所述机器学习模型为业务处理模型。
  8. 一种保护隐私安全的多方联合进行特征评估的方法,所述多方至少包括第一设备和第二设备,所述第一设备存储有第一样本集和其中各样本的标签,所述第二设备存储有第二样本集,所述方法应用于第二设备;所述方法包括:
    从第一设备接收第一交换信息,其中至少包括,由所述第一设备使用第一密钥对第一样本集中每个样本的初始ID进行加密后得到的第一次加密ID和对应的标签;
    使用第二密钥,对所述第一交换信息中各样本的第一次加密ID进行二次加密,得到第二加密集合,然后扰乱所述第二加密集合中各样本的相对顺序;
    向所述第一设备发送第二交换信息,所述第二交换信息包括已扰乱相对顺序的第一样本集中各样本的第二次加密ID和标签;
    使用第二密钥对第二样本集中各个样本的初始ID进行加密,得到第二样本集中第一次加密ID;
    基于第二样本集中各样本的第一特征的特征值,将第二样本集分成多个第一分箱;
    向所述第一设备发送第三交换信息,所述第三交换信息包括第二样本集中各样本的第一次加密ID和所在第一分箱的标识,以便所述第一设备使用第一密钥对第三交换信息中的第一次加密ID进行二次加密,得到第一加密集合,并基于第一加密集合中的第二次加密ID和所述第二交换信息中的第二次加密ID,确定第一样本集和第二样本集的共有样本,以及基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用于针对机器学习模型进行特征选择。
  9. 根据权利要求8所述的方法,其中,所述第一交换信息还包括第一样本集中每一个样本所在第二分箱的标识,所述第二分箱的标识由所述第一设备基于第一样本集中各样本的第二特征的特征值进行分箱得到;
    所述方法还包括:
    从所述第一设备接收第四交换信息,所述第四交换信息包括第二样本集中各样本的第二次加密ID,且所述第四交换信息中各样本的相对顺序已由所述第一设备扰乱;
    基于所述第二加密集合的第二次加密ID和所述第四交换信息中的第二次加密ID,确定第一样本集和第二样本集的共有样本;
    基于共有样本中各样本的标签、所在第二分箱的标识,确定所述第二特征的信息价值,用于针对机器学习模型进行特征选择。
  10. 一种保护隐私安全的多方联合进行特征评估的装置,所述多方至少包括第一设备和第二设备,第一设备存储有第一样本集和其中各样本的标签,第二设备存储有第二样本集,所述装置配置于第一设备;所述装置包括:
    第一加密单元,用于使用第一密钥对第一样本集中各样本的初始ID进行加密,得到第一样本集中各样本的第一次加密ID;
    第一发送单元,用于向所述第二设备发送第一交换信息,其中至少包括,第一样本集中每个样本的第一次加密ID和标签;
    第一接收单元,用于从所述第二设备分别接收第二交换信息和第三交换信息,其中,所述第二交换信息包括,由所述第二设备使用第二密钥对第一样本集中每个样本的第一次加密ID进行二次加密后得到的第二次加密ID和对应的标签,且所述第二交换信息中各样本的相对顺序已由所述第二设备扰乱;所述第三交换信息包括,针对第二样本集中每一个样本,由所述第二设备基于所述第二密钥对其初始ID进行加密得到的第一次加密ID和该样本所在第一分箱的标识,所述第一分箱的标识由所述第二设备基于第二样本集中各样本的第一特征的特征值进行分箱得到;
    第二加密单元,用于使用所述第一密钥,对所述第三交换信息中各样本的第一次加密ID进行二次加密,得到第一加密集合;
    第一确定单元,用于基于第二交换信息中的第二次加密ID和第一加密集合中的第二次加密ID,确定第一样本集和第二样本集的共有样本;
    第二确定单元,用于基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用以针对机器学习模型进行特征选择。
  11. 根据权利要求10所述的装置,其中,所述装置还包括:第一分箱单元和第一扰乱单元;
    所述第一分箱单元用于,在向第二设备发送第一交换信息之前,基于第一样本集中各样本的第二特征的特征值,将第一样本集分成多个第二分箱,其中,并将第一样本集中每一个样本所在第二分箱的标识包括在所述第一交换信息中;
    所述第一扰乱单元用于,在得到所述第一加密集合之后,扰乱第二样本集中各样本的相对顺序,得到第四交换信息;
    所述第一发送单元还用于,向所述第二设备发送所述第四交换信息,以便所述第二设备基于所述第四交换信息中的第二次加密ID和第二加密集合中的第二次加密ID确定共有样本,并基于共有样本中各样本的标签、所在第二分箱的标识,确定所述第二特征的信息价值,其中第二加密集合是使用所述第二密钥对所述第一交换信息中的第一次加密ID进行二次加密得到的。
  12. 根据权利要求11所述的装置,其中,所述第一分箱单元用于根据等频分箱、等距分箱、卡方分箱中任一项,将第一样本集分成所述多个第二分箱。
  13. 根据权利要求10所述的装置,其中,第一样本集中各样本的初始ID和第二样 本集中各样本的初始ID均为正整数;所述装置还包括:第三确定单元和第四确定单元;
    所述第三确定单元用于,确定大于第一样本集中各样本的初始ID中最大初始ID,且大于第二样本集中各样本的初始ID中最大初始ID的第一质数;
    所述第四确定单元用于,确定与第一质数互质的第一正整数为所述第一密钥。
  14. 根据权利要求13所述的装置,其中,所述第一加密单元还用于,对于第一样本集中每一个样本,确定该样本初始ID和所述第一密钥的乘积除以所述第一质数的余数为该样本的第一次加密ID。
  15. 根据权利要求10所述的装置,其中,所述第二确定单元还用于,确定共有样本中落入具有第一标识的第一分箱中且标签为正的样本个数,相对于共有样本中标签为正的样本总个数的第一比例;
    所述第二确定单元还用于,确定共有样本中落入所述具有第一标识的第一分箱中且标签为负的样本个数,相对于共有样本中标签为负的样本总个数的第二比例;
    所述第二确定单元还用于,基于各个标识的第一分箱分别对应的所述第一比例,和所述第二比例,确定共有样本的第一特征的信息价值。
  16. 根据权利要求10所述的装置,其中,所述第一样本集中的样本包括用户样本,所述机器学习模型为用户分类模型;或者,
    所述第一样本集中的样本包括业务样本,所述机器学习模型为业务处理模型。
  17. 一种保护隐私安全的多方联合进行特征评估的装置,所述多方至少包括第一设备和第二设备,所述第一设备存储有第一样本集和其中各样本的标签,所述第二设备存储有第二样本集,所述装置配置于第二设备;所述装置包括:
    第二接收单元,用于从第一设备接收第一交换信息,其中至少包括,由所述第一设备使用第一密钥对第一样本集中每个样本的初始ID进行加密后得到的第一次加密ID和对应的标签;
    第三加密单元,用于使用第二密钥,对所述第一交换信息中各样本的第一次加密ID进行二次加密,得到第二加密集合,然后扰乱第一样本集中各样本的相对顺序;
    第二发送单元,用于向所述第一设备发送第二交换信息,所述第二交换信息包括已扰乱相对顺序的第一样本集中各样本的第二次加密ID和标签;
    第四加密单元,用于使用第二密钥对第二样本集中各个样本的初始ID进行加密,得到第二样本集中第一次加密ID;
    第二分箱单元,用于基于第二样本集中各样本的第一特征的特征值,将第二样本集 分成多个第一分箱;
    第二发送单元还用于向所述第一设备发送第三交换信息,所述第三交换信息包括第二样本集中各样本的第一次加密ID和所在第一分箱的标识,以便所述第一设备使用第一密钥对第三交换信息中的第一次加密ID进行二次加密,得到第一加密集合,并基于第一加密集合中的第二次加密ID和所述第二交换信息中的各样本的第二次加密ID,确定第一样本集和第二样本集的共有样本,以及基于共有样本中各样本的标签、所在第一分箱的标识,确定所述第一特征的信息价值,用于针对机器学习模型进行特征选择。
  18. 根据权利要求17所述的装置,其中,所述第一交换信息还包括第一样本集中每一个样本所在第二分箱的标识,所述第二分箱的标识由所述第一设备基于第一样本集中各样本的第二特征的特征值进行分箱得到;
    所述装置还包括:第五单元和第六单元;
    所述第二接收单元用于,从所述第一设备接收第四交换信息,所述第四交换信息包括第二样本集中各样本的第二次加密ID,且所述第四交换信息中各样本的相对顺序已由所述第一设备扰乱;
    所述第五单元用于,基于所述第二加密集合的第二次加密ID和所述第四交换信息中的第二次加密ID,确定第一样本集和第二样本集的共有样本;
    所述第六单元用于,基于共有样本中各样本的标签、所在第二分箱的标识,确定第二特征的信息价值,用于针对机器学习模型进行特征选择。
  19. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-7中任一项所述的方法或8-9任一项所述的方法。
  20. 一种计算终端,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-7中任一项所述的方法或8-9任一项所述的方法。
PCT/CN2020/124454 2019-12-11 2020-10-28 保护隐私安全的多方联合进行特征评估的方法及装置 WO2021114927A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911269227.5A CN110990857B (zh) 2019-12-11 2019-12-11 保护隐私安全的多方联合进行特征评估的方法及装置
CN201911269227.5 2019-12-11

Publications (1)

Publication Number Publication Date
WO2021114927A1 true WO2021114927A1 (zh) 2021-06-17

Family

ID=70092518

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124454 WO2021114927A1 (zh) 2019-12-11 2020-10-28 保护隐私安全的多方联合进行特征评估的方法及装置

Country Status (3)

Country Link
CN (1) CN110990857B (zh)
TW (1) TWI738333B (zh)
WO (1) WO2021114927A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807415A (zh) * 2021-08-30 2021-12-17 中国再保险(集团)股份有限公司 联邦特征选择方法、装置、计算机设备和存储介质
CN114401079A (zh) * 2022-03-25 2022-04-26 腾讯科技(深圳)有限公司 多方联合信息价值计算方法、相关设备及存储介质
CN114398671A (zh) * 2021-12-30 2022-04-26 翼健(上海)信息科技有限公司 基于特征工程iv值的隐私计算方法、系统和可读存储介质
CN114611008A (zh) * 2022-05-09 2022-06-10 北京淇瑀信息科技有限公司 基于联邦学习的用户服务策略确定方法、装置及电子设备
CN115081004A (zh) * 2022-08-22 2022-09-20 北京瑞莱智慧科技有限公司 数据处理方法、相关装置及存储介质
CN115659381A (zh) * 2022-12-26 2023-01-31 北京数牍科技有限公司 联邦学习的woe编码方法、装置、设备及存储介质

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990857B (zh) * 2019-12-11 2021-04-06 支付宝(杭州)信息技术有限公司 保护隐私安全的多方联合进行特征评估的方法及装置
CN112667741B (zh) * 2020-04-13 2022-07-08 华控清交信息科技(北京)有限公司 一种数据处理方法、装置和用于数据处理的装置
CN111506485B (zh) * 2020-04-15 2021-07-27 深圳前海微众银行股份有限公司 特征分箱方法、装置、设备及计算机可读存储介质
CN111242244B (zh) * 2020-04-24 2020-09-18 支付宝(杭州)信息技术有限公司 特征值分箱方法、系统及装置
CN111695675B (zh) * 2020-05-14 2024-05-07 平安科技(深圳)有限公司 联邦学习模型训练方法及相关设备
CN111401572B (zh) * 2020-06-05 2020-08-21 支付宝(杭州)信息技术有限公司 基于隐私保护的有监督特征分箱方法及装置
CN111539009B (zh) * 2020-06-05 2023-05-23 支付宝(杭州)信息技术有限公司 保护隐私数据的有监督特征分箱方法及装置
CN111539535B (zh) * 2020-06-05 2022-04-12 支付宝(杭州)信息技术有限公司 基于隐私保护的联合特征分箱方法及装置
CN113824546B (zh) * 2020-06-19 2024-04-02 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
CN112231768B (zh) * 2020-10-27 2021-06-18 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机设备及存储介质
CN112711765A (zh) * 2020-12-30 2021-04-27 深圳前海微众银行股份有限公司 样本特征的信息价值确定方法、终端、设备和存储介质
CN112597525B (zh) * 2021-03-04 2021-05-28 支付宝(杭州)信息技术有限公司 基于隐私保护的数据处理方法、装置和服务器
CN113362048B (zh) * 2021-08-11 2021-11-30 腾讯科技(深圳)有限公司 数据标签分布确定方法、装置、计算机设备和存储介质
CN113722738B (zh) * 2021-09-02 2023-08-08 脸萌有限公司 数据保护方法、装置、介质及电子设备
CN113591133B (zh) * 2021-09-27 2021-12-24 支付宝(杭州)信息技术有限公司 基于差分隐私进行特征处理的方法及装置
CN114386336B (zh) * 2022-03-22 2022-07-15 成都飞机工业(集团)有限责任公司 一种基于多方3d打印数据库联合训练的方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110173452A1 (en) * 2008-05-28 2011-07-14 Nan Xiang-Hao Method of generating compound type combined public key
CN108256348A (zh) * 2017-11-30 2018-07-06 深圳大学 一种密文搜索结果验证方法及其系统
CN109492420A (zh) * 2018-12-28 2019-03-19 深圳前海微众银行股份有限公司 基于联邦学习的模型参数训练方法、终端、系统及介质
CN109886417A (zh) * 2019-03-01 2019-06-14 深圳前海微众银行股份有限公司 基于联邦学习的模型参数训练方法、装置、设备及介质
CN110276210A (zh) * 2019-06-12 2019-09-24 深圳前海微众银行股份有限公司 基于联邦学习的模型参数的确定方法及装置
CN110990857A (zh) * 2019-12-11 2020-04-10 支付宝(杭州)信息技术有限公司 保护隐私安全的多方联合进行特征评估的方法及装置

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015070160A1 (en) * 2013-11-08 2015-05-14 MustBin Inc. Bin enabled data object encryption and storage apparatuses, methods and systems
TWI528217B (zh) * 2014-07-02 2016-04-01 柯呈翰 於線上加上即時檔案動態標籤、加密之系統及方法
CN107347058B (zh) * 2016-05-06 2021-07-23 阿里巴巴集团控股有限公司 数据加密方法、数据解密方法、装置及系统
CN106650314A (zh) * 2016-11-25 2017-05-10 中南大学 预测氨基酸突变的方法及系统
CN108764273B (zh) * 2018-04-09 2023-12-05 中国平安人寿保险股份有限公司 一种数据处理的方法、装置、终端设备及存储介质
CN109325357B (zh) * 2018-08-10 2021-12-14 深圳前海微众银行股份有限公司 基于rsa的信息值计算方法、设备及可读存储介质
CN109636482B (zh) * 2018-12-21 2021-07-27 南京星云数字技术有限公司 基于相似度模型的数据处理方法及系统
CN109858566A (zh) * 2019-03-01 2019-06-07 成都新希望金融信息有限公司 一种基于多层模型构建增加入模维度的评分卡的方法
CN110032878B (zh) * 2019-03-04 2021-11-02 创新先进技术有限公司 一种安全的特征工程方法和装置
CN110309923B (zh) * 2019-07-03 2024-04-26 深圳前海微众银行股份有限公司 横向联邦学习方法、装置、设备及计算机存储介质
CN110378487B (zh) * 2019-07-18 2021-02-26 深圳前海微众银行股份有限公司 横向联邦学习中模型参数验证方法、装置、设备及介质
CN110751291B (zh) * 2019-10-29 2021-02-12 支付宝(杭州)信息技术有限公司 实现安全防御的多方联合训练神经网络的方法及装置
CN111104731B (zh) * 2019-11-19 2023-09-15 北京集奥聚合科技有限公司 一种用于联邦学习的图形化模型全生命周期建模方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110173452A1 (en) * 2008-05-28 2011-07-14 Nan Xiang-Hao Method of generating compound type combined public key
CN108256348A (zh) * 2017-11-30 2018-07-06 深圳大学 一种密文搜索结果验证方法及其系统
CN109492420A (zh) * 2018-12-28 2019-03-19 深圳前海微众银行股份有限公司 基于联邦学习的模型参数训练方法、终端、系统及介质
CN109886417A (zh) * 2019-03-01 2019-06-14 深圳前海微众银行股份有限公司 基于联邦学习的模型参数训练方法、装置、设备及介质
CN110276210A (zh) * 2019-06-12 2019-09-24 深圳前海微众银行股份有限公司 基于联邦学习的模型参数的确定方法及装置
CN110990857A (zh) * 2019-12-11 2020-04-10 支付宝(杭州)信息技术有限公司 保护隐私安全的多方联合进行特征评估的方法及装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807415A (zh) * 2021-08-30 2021-12-17 中国再保险(集团)股份有限公司 联邦特征选择方法、装置、计算机设备和存储介质
CN114398671A (zh) * 2021-12-30 2022-04-26 翼健(上海)信息科技有限公司 基于特征工程iv值的隐私计算方法、系统和可读存储介质
CN114401079A (zh) * 2022-03-25 2022-04-26 腾讯科技(深圳)有限公司 多方联合信息价值计算方法、相关设备及存储介质
CN114611008A (zh) * 2022-05-09 2022-06-10 北京淇瑀信息科技有限公司 基于联邦学习的用户服务策略确定方法、装置及电子设备
CN114611008B (zh) * 2022-05-09 2022-07-22 北京淇瑀信息科技有限公司 基于联邦学习的用户服务策略确定方法、装置及电子设备
CN115081004A (zh) * 2022-08-22 2022-09-20 北京瑞莱智慧科技有限公司 数据处理方法、相关装置及存储介质
CN115081004B (zh) * 2022-08-22 2022-11-04 北京瑞莱智慧科技有限公司 数据处理方法、相关装置及存储介质
CN115659381A (zh) * 2022-12-26 2023-01-31 北京数牍科技有限公司 联邦学习的woe编码方法、装置、设备及存储介质

Also Published As

Publication number Publication date
TWI738333B (zh) 2021-09-01
TW202123049A (zh) 2021-06-16
CN110990857A (zh) 2020-04-10
CN110990857B (zh) 2021-04-06

Similar Documents

Publication Publication Date Title
WO2021114927A1 (zh) 保护隐私安全的多方联合进行特征评估的方法及装置
WO2021197037A1 (zh) 双方联合进行数据处理的方法及装置
WO2020015478A1 (zh) 基于模型的预测方法和装置
EP3075098B1 (en) Server-aided private set intersection (psi) with data transfer
CN110086817B (zh) 可靠的用户服务系统和方法
CN110661764A (zh) 安全多方计算协议的输入获取方法和装置
CN111539009B (zh) 保护隐私数据的有监督特征分箱方法及装置
CN114175028B (zh) 密码假名映射方法、计算机系统、计算机程序和计算机可读介质
US11741242B2 (en) Cryptographic pseudonym mapping method, computer system computer program and computer-readable medium
CN113672949A (zh) 用于广告多方隐私保护的数据传输方法及系统
US10594473B2 (en) Terminal device, database server, and calculation system
Wu et al. SecEDMO: Enabling efficient data mining with strong privacy protection in cloud computing
Suthanthiramani et al. Secured data storage and retrieval using elliptic curve cryptography in cloud.
CN112800479B (zh) 利用可信第三方的多方联合数据处理方法及装置
Varshney et al. Big data privacy breach prevention strategies
CN114491637A (zh) 数据查询方法、装置、计算机设备和存储介质
CN111931221B (zh) 数据处理方法、装置和服务器
CN115442115A (zh) 一种风险数据推送方法、系统、服务器和可信单元
EP3364397B1 (en) Secret authentication code adding device, secret authentification code adding method, and program
CN113965310A (zh) 基于可控去标识化的标签实现混合隐私计算处理的方法
Sumaryanti et al. Improvement security in e-business systems using hybrid algorithm
CN115587897B (zh) 一种基于隐私计算的警税联合分析方法
CN114338164B (zh) 一种匿名安全比较方法和系统
CN113065156B (zh) 一种控制延时的多方联合数据处理方法及装置
KR20190116838A (ko) 암호화폐 보안 방법 및 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20900124

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20900124

Country of ref document: EP

Kind code of ref document: A1