WO2023045503A1 - 基于差分隐私进行特征处理的方法及装置 - Google Patents

基于差分隐私进行特征处理的方法及装置 Download PDF

Info

Publication number
WO2023045503A1
WO2023045503A1 PCT/CN2022/105052 CN2022105052W WO2023045503A1 WO 2023045503 A1 WO2023045503 A1 WO 2023045503A1 CN 2022105052 W CN2022105052 W CN 2022105052W WO 2023045503 A1 WO2023045503 A1 WO 2023045503A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
samples
party
encryption
feature
Prior art date
Application number
PCT/CN2022/105052
Other languages
English (en)
French (fr)
Inventor
杜健
段普
张本宇
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2023045503A1 publication Critical patent/WO2023045503A1/zh
Priority to US18/394,978 priority Critical patent/US20240152643A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Definitions

  • One or more embodiments of this specification relate to the technical field of data processing, and in particular to a method and device for feature processing based on differential privacy.
  • Federated learning also known as federated machine learning, federated learning, etc.
  • federated learning is a machine learning framework designed to effectively help multiple parties perform data usage and machine learning modeling while meeting the requirements of data privacy protection and legal compliance.
  • federated learning can be divided into horizontal federated learning and vertical federated learning.
  • vertical federated learning is also called federated learning of sample alignment.
  • multiple parties each hold different sample characteristics of the same sample ID, and a certain party (indicated as party B in Figure 1) has a sample label.
  • One or more embodiments of this specification describe a method and device for feature processing based on differential privacy.
  • each data holder can ensure the security of its own data. Jointly complete the feature transformation processing.
  • a method for feature processing based on differential privacy the method involving a first party and a second party, wherein the first party stores first feature parts of a plurality of samples, and the second party stores the multiple binary classification labels of samples; the method is executed by the second party, including: respectively encrypting a plurality of binary classification labels corresponding to the plurality of samples to obtain a plurality of encrypted labels; and encrypting the plurality of encrypted labels Sending to the first party; receiving from the first party the first positive sample encryption and noise addition quantity and the first negative sample encryption noise addition quantity corresponding to each first subbin in a plurality of first bins, and It decrypts to obtain the corresponding first positive sample noise addition amount and first negative sample noise addition amount; wherein, the first positive sample encryption noise addition amount and the first negative sample encryption noise addition amount are based on the multiple encryption label and the first differential privacy noise; the plurality of first binning is obtained by performing binning processing on the plurality of samples for any feature in the first feature part
  • the business objects targeted by the multiple samples are any of the following: users, commodities, and business events.
  • encrypting a plurality of binary classification labels corresponding to the plurality of samples to obtain a plurality of encrypted labels includes: encrypting the plurality of binary classification labels based on a homomorphic encryption algorithm to obtain The plurality of encryption tags.
  • determining the first noise addition index of the corresponding first binning includes: A plurality of first positive samples corresponding to the bins are summed to obtain the total number of first positive samples added with noise; the plurality of first negative samples corresponding to the plurality of first sub-bins are summed. , to obtain the total number of noises added to the first negative samples; based on the total number of noises added to the first positive samples, the total number of noises added to the first negative samples, the number of noises added to the first positive samples, and the number of noises added to the first negative samples, the first Noise indicator.
  • the first noise addition index is the first noise addition evidence weight
  • the above-mentioned determination of the first noise addition index includes: dividing the noise addition amount of the first positive sample by The total number of noises added to the first positive samples is obtained to obtain the proportion of the first positive samples; the number of noises added to the first negative samples is divided by the total number of noises added to the first negative samples to obtain the proportions of the first negative samples; subtracting the logarithmic result of the first positive sample proportion from the logarithmic result of the first negative sample proportion to obtain the first noise-added evidence weight.
  • the second party further stores a second feature portion of the plurality of samples; the method further includes: performing an operation on the plurality of samples for any feature in the second feature portion Binning processing, to obtain a plurality of second bins; based on the differential privacy mechanism, determine the second noise adding index of each second bin in the plurality of second bins; wherein, after determining the corresponding first bin After the first noise addition index, the method further includes: performing feature screening processing on the first feature part and/or the second feature part based on the first noise addition index and the second noise addition index.
  • determining the second noise adding index of each second sub-bin in the plurality of second sub-bins includes: based on the binary classification label, determining The real number of positive samples and the real number of negative samples; on the real number of positive samples and the real number of negative samples, add the second differential privacy noise respectively, correspondingly obtain the second positive sample plus noise number and the second negative sample
  • a noise addition amount based on the second positive sample noise addition amount and the second negative sample noise addition amount, determine a second noise addition index corresponding to the second binning.
  • the second differentially private noise is Gaussian noise; before adding the second differentially private noise respectively, the method further includes: based on setting The privacy budget parameter, and the number of bins corresponding to each feature in the second feature part, determine the noise power; use the noise power as the variance of the Gaussian distribution, and use 0 as the mean value to generate a Gaussian noise distribution; from the described The Gaussian noise is sampled in a Gaussian noise distribution.
  • determining the noise power includes: determining the sum of the number of bins corresponding to each feature; obtaining the variable value of the mean variable, the variable value is based on the parameter value of the privacy budget parameter, and the differential privacy It is determined by the constraint relationship between the privacy budget parameter and the mean variable under the Gaussian mechanism; the noise power is calculated based on the product of the following factors: the sum of the number of bins, and the reciprocal of the variable value after the square operation .
  • the privacy budget parameters include budget item parameters and slack item parameters.
  • the method further includes: for the plurality of second bins, correspondingly sampling a plurality of groups of noises from the differentially private noise distribution; wherein, adding the differentially private
  • the noise includes: adding one noise in the corresponding group of noises to the real number of the positive samples, and adding another noise in the group of noises to the real number of the negative samples.
  • determining a second noise addition index corresponding to the second binning includes: A plurality of second positive sample noise addition quantities corresponding to a second sub-bin are summed to obtain a second positive sample noise addition total number; a plurality of second negative sample noise addition quantities corresponding to the plurality of second sub-bins Perform a summation process to obtain the second total number of negative samples with noise; based on the second total number of positive samples with noise, the second total number of negative samples with noise, the second positive sample with noise, and the second negative sample with noise, determine The second noise adding index.
  • the second noise addition index is the second noise addition evidence weight
  • the above-mentioned determination of the second noise addition index includes: dividing the noise addition amount of the second positive sample by the The second positive sample plus noise total number is obtained to obtain the second positive sample ratio; the second negative sample noise plus quantity is divided by the second negative sample plus noise total number to obtain the second negative sample ratio; the described The logarithmic result of the second positive sample proportion is subtracted from the logarithmic result of the second negative sample proportion to obtain the second noise-added evidence weight.
  • a method for feature processing based on differential privacy the method involving a first party and a second party, wherein the first party stores first feature parts of a plurality of samples, and the second party stores the multiple The second feature part and the binary classification label of samples; the method is performed by the first party, including: receiving a plurality of encrypted labels from the second party, which are a plurality of binary labels corresponding to the plurality of samples The classification labels are obtained by encrypting respectively; for any feature in the first feature part, the plurality of samples are binned to obtain a plurality of first bins; based on the plurality of encrypted labels and differential privacy Noise, determining the first positive sample encryption and noise addition amount and the first negative sample encryption noise addition amount corresponding to each first binning; sending the first positive sample encryption noise addition amount and the first negative sample encryption noise addition amount to the second party, so that the second party decrypts it to obtain the first positive sample noise addition amount and the first negative sample noise addition amount, and determines the corresponding first
  • the business objects targeted by the multiple samples are any of the following: users, commodities, and business events.
  • determining the first positive sample encryption and noise addition amount and the first negative sample encryption noise addition amount corresponding to each first binning includes: for the For each first binning, determine the multiplication result between the encrypted tags corresponding to each sample; perform product processing on the multiplication result and the encryption noise obtained by encrypting the differential privacy noise to obtain the first Encryption and noise addition amount of positive samples: The encryption total number obtained by encrypting the total number of samples in the first binning is subtracted from the first positive sample encryption noise amount to obtain the first negative sample encryption and noise addition amount.
  • the method before performing product processing on the multiplication result and the encryption noise obtained by encrypting the differential privacy noise to obtain the first positive sample encryption noise addition amount, the method further includes: For the plurality of first sub-bins, a plurality of noises are correspondingly sampled from the noise distribution of differential privacy; wherein, performing product processing on the multiplication result and the encryption noise obtained by encrypting the differential privacy noise includes: Encrypting the noise corresponding to the multiplication result among the plurality of noises to obtain the encrypted noise; performing product processing on the multiplication result and the encrypted noise.
  • the differential privacy noise is Gaussian noise; based on the plurality of encryption labels and differential privacy noise, determine the first positive sample encryption noise number and the first negative sample corresponding to each first binning Before encrypting the number of noise additions, the method further includes: determining the noise power based on the privacy budget parameters set for the plurality of samples and the number of bins corresponding to each feature in the first feature part; The noise power is used as the variance of the Gaussian distribution with 0 as the mean value to generate a Gaussian noise distribution; the Gaussian noise is sampled from the Gaussian noise distribution.
  • determining the noise power includes: determining the sum of the number of bins corresponding to each feature; obtaining the variable value of the mean variable, the variable value is based on the parameter value of the privacy budget parameter, and the difference Under the Gaussian mechanism of privacy, it is determined by the constraint relationship between the privacy budget parameter and the mean variable; the noise power is calculated based on the product of the following factors: the sum of the number of bins, and the squared value of the variable value reciprocal.
  • the privacy budget parameters include budget item parameters and slack item parameters.
  • an apparatus for performing feature processing based on differential privacy involves a first party and a second party, wherein the first party stores first feature parts of a plurality of samples, and the second party stores the Binary classification labels of multiple samples; the device is integrated in the second party, including: a label encryption unit configured to encrypt the multiple binary classification labels corresponding to the multiple samples to obtain multiple encrypted labels; The encrypted label sending unit is configured to send the plurality of encrypted labels to the first party; the encrypted quantity processing unit is configured to receive from the first party a plurality of first bins corresponding to each first bin The first positive sample encryption and noise addition amount and the first negative sample encryption noise addition amount are decrypted to obtain the corresponding first positive sample noise addition amount and the first negative sample addition noise amount; wherein, the first The number of positive sample encryption and noise addition and the first negative sample encryption and noise addition amount are determined based on the plurality of encryption labels and the first differential privacy noise; the plurality of first binning is for any of the first feature parts A feature,
  • the second party further stores a second feature portion of the plurality of samples;
  • the device further includes: a binning processing unit configured to, for any feature in the second feature portion, Perform binning processing on the plurality of samples to obtain a plurality of second bins;
  • the second indicator calculation unit is configured to determine the second plus value of each second bin in the plurality of second bins based on a differential privacy mechanism.
  • Noise index the device further includes: a feature screening unit configured to perform feature screening processing on the first feature part and/or the second feature part based on the first noise addition index and the second noise addition index.
  • an apparatus for performing feature processing based on differential privacy involves a first party and a second party, wherein the first party stores first feature parts of a plurality of samples, and the second party stores the The second feature part and the binary classification label of a plurality of samples; the device is integrated in the first party, comprising: an encrypted label receiving unit configured to receive a plurality of encrypted labels from the second party, which is for the described A plurality of binary classification labels corresponding to the plurality of samples are obtained by encrypting respectively; the binning processing unit is configured to perform binning processing on the plurality of samples for any feature in the first feature part to obtain a plurality of The first binning; encryption and noise adding unit, configured to determine the first positive sample encryption and noise addition amount and the first negative sample encryption noise addition amount corresponding to each first binning based on the plurality of encryption tags and differential privacy noise An encrypted number sending unit configured to send the first positive sample encrypted and noised number and the first negative sample encrypted and noised number to the
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed in a computer, it causes the computer to execute the method provided in the first aspect or the second aspect.
  • a computing device including a memory and a processor, executable code is stored in the memory, and when the processor executes the executable code, the method provided in the first aspect or the second aspect above is implemented .
  • each data holder can jointly complete the feature transformation process while ensuring the security of its own data.
  • FIG. 1 shows a schematic diagram of a data distribution scenario of vertical federated learning according to an embodiment
  • FIG. 2 shows a multi-party interaction diagram for feature processing based on differential privacy according to an embodiment
  • FIG. 3 shows a flowchart of a method for feature processing based on differential privacy according to an embodiment
  • Fig. 4 shows a schematic structural diagram of a device for feature processing based on differential privacy according to an embodiment
  • Fig. 5 shows a schematic structural diagram of an apparatus for performing feature processing based on differential privacy according to another embodiment.
  • evaluation indicators such as weight of evidence (WoE), information value (Information Value, IV) and other evaluation indicators of sample features can be calculated based on sample labels, so as to realize feature screening, feature encoding or provide relevant Data query service, etc.
  • WoE weight of evidence
  • Information Value IV
  • other evaluation indicators of sample features can be calculated based on sample labels, so as to realize feature screening, feature encoding or provide relevant Data query service, etc.
  • the sample is usually divided into bins according to the feature value distribution of the feature i, and then the WoE of each bin is calculated separately.
  • the j-th binning under the i-th sample feature or simply a feature binning, uses the following formula to calculate the WoE value:
  • WoE i,j represents the evidence weight of a feature bin
  • y i,j and ni ,j represent the number of positive samples and negative samples in a feature bin, respectively
  • y and n represent The number of positive samples and the number of negative samples in the total sample set.
  • FIG. 2 shows a multi-party interaction diagram for feature processing based on differential privacy according to an embodiment. It should be noted that there are at least two parties among them. For concise description, the participant who holds the sample tag is called the second party, and any other party that does not store the sample tag but holds the sample characteristic data is called the first party. side, and part of the features of the sample held by the first side is referred to as the first feature part. It should be understood that FIG. 2 only shows the interaction process between the second party and a certain first party, and both the first party and the second party may be implemented as any device, platform, or device cluster with computing and processing capabilities.
  • the interaction process includes the following steps:
  • the second party encrypts multiple binary labels corresponding to multiple samples to obtain multiple encrypted labels.
  • the business objects targeted by the multiple samples may be users, commodities, or business events, and the binary classification labels (or class labels of binary classification) may include risk category labels or exception level labels.
  • the business object is an individual user, and correspondingly, the binary classification labels corresponding to the individual user sample may include high-consumption groups and low-consumption groups, or may be low-risk users or high-risk users.
  • the business object is an enterprise user, and correspondingly, the binary classification labels corresponding to the enterprise user sample may include credit enterprises and untrustworthy enterprises.
  • the business object is a commodity
  • the binary classification labels corresponding to the commodity sample may include popular commodities and unpopular commodities.
  • the business object is a business event, such as a registration event, an access event, a login event, or a payment event.
  • the binary classification labels corresponding to the event samples may include abnormal events and normal events.
  • this step may include: based on a homomorphic encryption algorithm, encrypting the above-mentioned plurality of binary classification labels respectively to obtain a plurality of encrypted labels.
  • the homomorphic encryption algorithm satisfies additive homomorphism, and further, in an example, it satisfies the condition: the decryption result after multiplication of ciphertexts is equal to the addition of corresponding plaintexts.
  • this condition can be expressed as:
  • condition satisfied by the above-mentioned preset encryption algorithm may also be: after the ciphertext is multiplied and the result of modulo calculation based on the preset value n is decrypted, it is equal to the addition of the corresponding plaintext.
  • this condition can be expressed as:
  • the second party can encrypt multiple encryption tags It should be noted that although there are only two types of values for the binary classification label, by using a non-deterministic encryption algorithm, it is possible to encrypt the same label value multiple times to obtain different random numbers. Therefore, using The obtained random number is used as the corresponding encrypted label, which can ensure that other parties cannot decrypt the encrypted label to obtain the real label.
  • step S202 the first party receives the above-mentioned plurality of encrypted tags from the second party
  • the first party may perform step S203 before, at the same time, or after step S202, and perform binning processing on the above-mentioned multiple samples for any feature in the first feature part held by it to obtain multiple first bins.
  • the business objects targeted by multiple samples are individual users, and correspondingly, the first feature part may include at least one of the following personal user features: age, gender, occupation, resident Location, income, transaction frequency, transaction amount, transaction details, etc.
  • the business objects targeted by the multiple samples are enterprise users, and accordingly, the first characteristic part may include at least one of the following enterprise user characteristics: establishment time, business scope, recruitment information, and the like.
  • the business object targeted by the multiple samples is a commodity, and accordingly, the first characteristic part may include one or more of the following commodity characteristics: cost, name, place of origin, category, sales volume, inventory, gross profit wait.
  • the business object targeted by the plurality of samples is a business event, and correspondingly, the first feature part may include one or more of the following event features: event occurrence time, network environment (such as IP address), geographical location or duration, etc.
  • binning processing in simple terms, it is to discretize continuous variables and merge multi-state discrete variables into few states.
  • binning methods including equal-frequency binning, equidistant binning, clustering binning, Best-KS binning, and chi-square binning.
  • this step may include: for any first feature in the first feature part, first determine a plurality of equidistant intervals according to the value space of the first feature, corresponding to a plurality of binning categories; then, for the above For any one of the multiple samples, determine the equidistant interval where the eigenvalue corresponding to the first feature is located, so as to classify the sample into the bins of the corresponding category.
  • the first feature is annual income
  • the multiple feature values corresponding to the annual income of multiple samples include 12, 20, 32, 45, 55, and 60 (unit: 10,000). bins, the binning results shown in Table 1 below can be obtained.
  • the binning result includes a sample ID corresponding to each of the multiple bins.
  • the first party can obtain multiple first bins under any first feature through binning processing in step S203, and receive the above-mentioned multiple encrypted tags from the second party in step S202 Based on this, the first party can execute step S204, based on the plurality of encryption tags And the first differential privacy noise, determining the first positive sample encryption and noise addition amount and the first negative sample encryption and noise addition amount corresponding to each first bin in the plurality of first bins.
  • the above-mentioned first differential privacy noise is noise sampled by the first party based on a Differential Privacy (DP for short) mechanism.
  • DP Differential Privacy
  • random noise is added to the original data or the original data calculation results, so that the noise-added data is usable while effectively preventing its publication from causing privacy leakage of the original data.
  • the above-mentioned first differentially private noise can be Gaussian noise, Laplacian noise or exponential noise, etc.
  • the above-mentioned first differential privacy noise is Gaussian noise as an example to illustrate the noise determination process.
  • the Gaussian noise is sampled from the Gaussian noise distribution of differential privacy.
  • the key parameters of the Gaussian noise distribution include mean and variance.
  • the noise power determined based on the differential privacy budget parameters is used as the variance of the Gaussian distribution, with 0 as the mean value.
  • the first party may determine the noise power based on the privacy budget parameters it sets for the above multiple samples, and the number of bins corresponding to each feature in the first feature part it holds.
  • the first party determines the sum of binning quantities corresponding to each feature in the first feature part.
  • the feature set corresponding to the first feature part is denoted as The number of bins corresponding to the i-th feature is recorded as K i , so that the sum of the number of bins can be expressed as
  • the first party In addition to determining the above sum, the first party also calculates the variable value of the mean variable, which is determined based on the parameter value of the above privacy budget parameter and the constraint relationship between the privacy budget parameter and the mean variable.
  • the constraint relationship is existing in the Gaussian mechanism of differential privacy, which can be expressed as the following formula:
  • ⁇ and ⁇ represent the budget item parameter and the slack item parameter in the above-mentioned privacy budget parameters respectively, and the parameter values of the two can be artificially set by the staff according to actual needs;
  • represents the above-mentioned mean value variable;
  • ⁇ ( t) represents the probability distribution function of the standard Gaussian distribution,
  • the above-mentioned noise power may be calculated based on the above-determined binning quantity sum value and the variable value of the mean variable.
  • the noise power can be calculated based on the product of the following factors: the sum of the number of bins mentioned above, and the reciprocal of the squared variable value of the mean variable.
  • the noise power can be calculated by the following formula:
  • the subscript A indicates that the variable corresponds to the first party
  • ⁇ A denote the variable values of the noise power and the mean variable, respectively
  • K i indicates the number of bins corresponding to the i-th feature.
  • the first party can determine the noise power, so that the noise power is used as the variance of the Gaussian distribution, and the Gaussian noise distribution is generated with 0 as the mean Then randomly sample Gaussian noise from it
  • Gaussian noise is mainly taken as an example to illustrate the determination of the noise distribution of differential privacy.
  • random noise sampling may be performed separately for different objects to be noised.
  • the first party may sample a plurality of noises correspondingly from the noise distribution of differential privacy for the above-mentioned multiple first bins; for example, the above-mentioned Gaussian noise distribution may be randomly sampled multiple times to obtain multiple Gaussian noise.
  • the first differential privacy noise can be obtained by sampling, so as to combine the above multiple encrypted tags Determine the first positive sample encryption and noise addition amount and the first negative sample encryption and noise addition amount corresponding to each first binning.
  • the amount of encryption and noise added to the first positive sample may be determined first.
  • the multiplication result between the encrypted tags corresponding to each sample is determined, so that the multiplication result and the encrypted first differential privacy noise obtained
  • the encrypted noise is multiplied to obtain the number of encrypted noise added to the first positive sample.
  • this calculation process can be expressed as:
  • the subscript 'i, j' represents the j-th bin under the i-th feature, which corresponds to any first bin; Indicates the number of first positive sample encryption and noise addition corresponding to the first binning; Indicates the differential privacy noise corresponding to the first bin, Indicates the corresponding encryption noise; Indicates the sample set corresponding to the first binning, Represents a collection The label of the sample in; Indicates the sample label The corresponding encrypted label, Indicates the multiplication result between encrypted tags.
  • a modulo operation may also be performed on the result obtained by the above product processing, so as to obtain the above first positive sample encryption and noise addition amount.
  • this calculation process can be expressed as:
  • n represents a preset value.
  • the first positive sample encryption and noise addition amount corresponding to the first binning can be determined. Further, the number of encrypted and noise-added first negative samples corresponding to the first binning can be determined. Specifically, the encrypted total number obtained by encrypting the total number of samples in the first binning based on a homomorphic encryption algorithm is used to subtract the above-mentioned first positive sample. The amount of noise added to the sample encryption, so as to obtain the amount of encrypted noise added to the first negative sample. Exemplarily, this calculation process can be expressed as:
  • Enc( ) represents a homomorphic encryption algorithm, which satisfies additive homomorphism; Indicates the number of encrypted and noise-added first negative samples corresponding to a certain first bin; N i,j represents the total number of samples in the certain first bin, and Enc(N i,j ) represents the encryption obtained by encrypting the total number total; Indicates the number of encrypted and noised first positive samples corresponding to the first binning.
  • the first positive sample encryption and noise addition amount of the first binning can be determined first Then determine the number of encrypted and noised first negative samples In fact, it is also possible to design the calculation result of the above formula (7) or (8) to correspond to the number of encryption and noise addition of the first negative sample Further, sampling is the same idea as formula (9), using the encrypted total Enc(N i,j ) to subtract You can get the number of first positive sample encryption and noise addition
  • the encryption algorithm used by the first party when encrypting the above-mentioned differential privacy noise and the total number of binned samples is the same as the encryption algorithm used by the second party when encrypting the sample labels.
  • the first party can determine the number of first positive sample encryption and noise addition corresponding to each first bin among the multiple first bins under any feature in the first feature part and the number of first negative sample encryption and noise addition Therefore, in step S205, send it to the second party.
  • step S206 the second party performs step S206 to encrypt and add noise to the first positive sample corresponding to each first binning and the number of first negative sample encryption and noise addition Decrypt to get the corresponding number of noise added to the first positive sample and the number of noises added to the first negative sample
  • the Decryption can be expressed as:
  • the first negative sample encryption and noise addition amount It is calculated based on the above formula (9).
  • the homomorphism of the encryption algorithm it can be based on Decrypt the number of negative samples to encrypt and add noise
  • decryption method is compatible with the encryption method, and it is not exhaustive here.
  • the second party can decrypt to obtain the number of noise added to the first positive sample and the number of noises added to the first negative sample Further, the second party executes step S207, based on the number of noises added to the first positive sample and the number of noises added to the first negative sample A first noise adding index corresponding to the first binning is determined.
  • the number of noises added to the multiple first positive samples corresponding to multiple first bins under a certain first feature Perform summation processing to obtain the total number of noises added to the first positive sample
  • this summation process can be expressed as:
  • this summation process can be expressed as:
  • a first noise adding index of the first binning is determined.
  • the above-mentioned first noise adding index is the first weight of evidence Its calculation may include: adding noise to the first positive sample Divide by the total number of first positive samples plus noise Get the proportion of the first positive sample; and, add the noise amount to the first negative sample Divide by the total number of noises added to the first negative sample Obtaining the proportion of the first negative sample; after that, subtracting the logarithmic result of the proportion of the first negative sample from the result of taking the logarithm of the proportion of the first positive sample to obtain the weight of the first noise-added evidence.
  • this calculation process can be expressed as:
  • this first noise-added weight of evidence It is equivalent to adding differential privacy noise to the corresponding original evidence weight WoE i,j and the amount of noise added.
  • the above-mentioned first noise adding index is the first information value
  • the calculation may include: calculating the proportion of the first positive sample and the proportion of the first negative sample; then, calculating the difference between the proportion of the first positive sample and the proportion of the first negative sample, and calculating the proportion of the first positive sample
  • the difference between the logarithmic result of the proportion and the logarithmic result of the proportion of the first negative sample; after that, the product result between the two differences is calculated as the first information value
  • this calculation process can be expressed as:
  • the second party can determine the first information value corresponding to any first bin Understandably, this first information value It is equivalent to adding differential privacy noise to the corresponding original information value IV i,j and the amount of noise added.
  • the characteristic data in the first party that does not hold the sample label can be evaluated with evidence weight or IV value and other feature evaluation indicators calculation.
  • step S207 after the second party performs the above step S207, it can also perform step S208 to send the first noise addition index to the first party, so that the first party can
  • the noise addition index corresponding to each first sub-bin under each first feature in the section is used to screen the features.
  • the noise addition index corresponding to each first sub-bin under a certain feature is very close, it can be determined that the A certain feature is a redundant feature, and the certain feature is discarded; or, the encoding of the feature can also be performed, for example, for the above-mentioned multiple samples, the eigenvalue of any sample corresponding to any first feature can be encoded as the sample
  • the noise-adding index of the first bin belonging to the first feature, and further, the encoded value of the feature can be used as the input of the machine learning model in federated learning, so as to effectively avoid the loss of training data caused by the release of model parameters or the open use of the model. Leakage of privacy.
  • the second party may introduce a differential privacy mechanism to calculate feature evaluation indicators for the feature data held by itself.
  • the feature data held by the second party for the above-mentioned multiple samples is called the second feature part.
  • the second feature part For the description of the second feature part, please refer to the above description of the first feature part. It should be noted that the two Different features corresponding to the same sample ID.
  • the second party calculates the second noise-adding index of the second binning under the second feature instead of the original index.
  • the second noise-adding index and the above-mentioned first noise-adding index can be combined to achieve the second noise-adding index.
  • Fig. 3 shows a flow chart of a method for feature processing based on differential privacy according to an embodiment, and the method is executed by a second party. As shown in Figure 3, the method may include the following steps:
  • Step S310 for any feature in the second feature part, perform binning processing on multiple samples to obtain multiple second bins; step S320, based on the binary classification label, determine the true value of the positive sample in each second binning number and the real number of negative samples; step S330, on the real number of positive samples and the real number of negative samples, add the second differential privacy noise respectively, correspondingly obtain the second positive sample plus noise quantity and the second negative sample plus Noise amount; step S340, based on the second positive sample noise addition amount and the second negative sample noise addition amount, determine a second noise addition index corresponding to the second binning.
  • step S310 for any second feature in the second feature part, perform binning processing on a plurality of samples to obtain a plurality of second bins.
  • each second binning may include the sample ID of the corresponding sample.
  • step S320 based on the binary classification labels, the actual number of positive samples and the actual number of negative samples in each second bin are determined. Specifically, for any second binning, the number of positive samples and the number of negative samples in the second binning can be counted according to the binary classification labels of each sample, and the counted numbers here are real numbers.
  • the following table 2 shows the statistics of sample distribution, including the sample numbers corresponding to different label values under each second bin, that is, low-consumption groups and high-consumption groups.
  • step S330 the second differential privacy noise is respectively added to the real number of positive samples and the real number of negative samples, correspondingly obtaining a second noise-added number of positive samples and a second noise-added number of negative samples.
  • the second differentially private noise is noise sampled by the second party based on the DP mechanism; and, the DP mechanism sampled by the second party is usually the same as the DP mechanism used by the first party to determine the above-mentioned first differentially private noise, but also Can be different.
  • the second differential privacy noise belongs to Gaussian noise and is sampled from a Gaussian noise distribution.
  • the second party can base on the privacy budget parameters it sets for multiple samples, and the second feature part it holds The number of bins corresponding to each feature is determined to determine the noise power, and then the noise power is used as the variance of the Gaussian distribution, and the Gaussian noise distribution is determined with 0 as the mean And then sample Gaussian noise from it.
  • the second party can determine For a further description of , see the aforementioned first-party determination of the Gaussian noise distribution The relevant descriptions are not repeated here.
  • random noise sampling may be performed separately for different objects to be noised.
  • multiple noises may be correspondingly sampled from the differentially private noise distribution.
  • multiple groups of noises may be correspondingly sampled from the differentially private noise distribution for the above-mentioned multiple second bins, and two noises in each group of noises correspond to positive samples and negative samples in the bins respectively.
  • the second differential privacy noise can be obtained by sampling, so as to add noise to the real number of positive and negative samples.
  • the second differential privacy noise corresponding to a certain second bin can be added to the real number of positive samples and the real number of negative samples corresponding to a second bin, that is, corresponding to The number of positive and negative samples in the same bin adds the same amount of noise, so that the corresponding second positive sample noise addition amount and the second negative sample noise addition amount are obtained.
  • this noise adding process can be expressed as:
  • the subscript 'i,j' represents the j-th bin under the i-th feature, which corresponds to any second bin; Indicates the sample set corresponding to the second binning; Represents a collection The label of the sample in ; z i, j represent the differential privacy noise corresponding to the second sub-bin; and Respectively represent the actual number of positive samples and the actual number of negative samples corresponding to the second binning; and Respectively represent the second positive sample noise addition quantity and the second negative sample noise addition quantity corresponding to the certain second binning.
  • one of the corresponding group differential privacy noises can be added to the former, and the corresponding group can be added to the latter Two noises in the noise, so as to obtain the corresponding second positive sample noise addition amount and the second negative sample noise addition amount.
  • this noise adding process can be expressed as:
  • step S340 can be executed, based on the second positive sample noise addition amount and the second negative sample plus noise amount A second noise adding index corresponding to the second binning is determined.
  • step S207 it should be understood that for the description of this step, you can refer to the above-mentioned description of the first noise addition index for determining the first binning in step S207. The following only lists the formula for calculating the weight of the second noise addition evidence for schematic illustration. Others can be found in Relevant descriptions in step S207.
  • Equation (20), (21) and (22) Represents a set of multiple second bins under the i second feature; j represents the jth second bin in the plurality of second bins; and Denote the total number of noise added to the second positive sample and the total number of noise added to the second negative sample, respectively; Indicates the second noise-added evidence weight corresponding to the j-th second bin under the i-th second feature.
  • the second party holding the label can determine the weight of the second noise-added evidence corresponding to any second bin by introducing a differential privacy mechanism
  • the first noise-added evidence weight determined above For the first feature part and/or the second feature part, perform further feature processing, such as feature screening, evaluation or encoding.
  • Fig. 4 shows a schematic structural diagram of a device for feature processing based on differential privacy according to an embodiment, the participants of the federated learning include a first party and a second party, wherein the first party stores the first feature parts of multiple samples, The second party stores the binary classification labels of the plurality of samples; the device is integrated in the second party.
  • the device 400 includes:
  • the label encryption unit 410 is configured to encrypt the plurality of binary classification labels corresponding to the plurality of samples respectively to obtain a plurality of encrypted labels; the encrypted label sending unit 420 is configured to send the plurality of encrypted labels to the second One side; the encryption amount processing unit 430, configured to receive from the first party the first positive sample encryption and noise addition amount and the first negative sample encryption noise addition amount corresponding to each first bin in a plurality of first bins , and decrypt it to obtain the corresponding first positive sample noise addition amount and first negative sample noise addition amount; wherein, the first positive sample encryption noise addition amount and the first negative sample encryption noise addition amount are based on the A plurality of encrypted tags and a first differential privacy noise; the plurality of first binning is obtained by performing binning processing on the plurality of samples for any feature in the first feature part; the first The index calculation unit 440 is configured to determine a first noise addition index corresponding to the first binning based on the first positive sample noise addition amount and the first negative sample noise addition amount.
  • the business objects targeted by the multiple samples are any of the following: users, commodities, and business events.
  • the tag encryption unit 410 is specifically configured to: respectively encrypt the plurality of binary classification tags based on a preset encryption algorithm to obtain the plurality of encrypted tags; wherein, the preset encryption algorithm The following conditions are met: the decryption result of multiplication of ciphertexts is equal to the addition of corresponding plaintexts.
  • the first index calculation unit 440 includes: a total number determination subunit configured to sum the multiple first positive sample noise addition quantities corresponding to the multiple first bins to obtain the first A total number of noises added to positive samples; and, summing the numbers of noises added to the first negative samples corresponding to the plurality of first bins to obtain the total number of noises added to the first negative samples; the index determination subunit is configured as The first noise addition index is determined based on the first total number of positive samples with noise, the first total number of negative samples with noise, the first number of positive samples with noise, and the first number of negative samples with noise.
  • the first noise addition index is the first noise addition evidence weight
  • the index determination subunit is specifically configured to: divide the noise addition amount of the first positive samples by the total number of noise additions to the first positive samples , to obtain the proportion of the first positive sample; divide the number of noise added to the first negative sample by the total number of noise added to the first negative sample to obtain the proportion of the first negative sample; take the proportion of the first positive sample Subtracting the logarithmic result of the proportion of the first negative sample from the logarithmic result to obtain the first noise-added evidence weight.
  • the second party also stores the second characteristic parts of the plurality of samples;
  • the apparatus 400 further includes: a binning processing unit 450 configured for any one of the second characteristic parts The feature is to perform binning processing on the plurality of samples to obtain a plurality of second bins;
  • the second index calculation unit 460 is configured to determine the number of each second bin in the plurality of second bins based on a differential privacy mechanism.
  • the apparatus 400 further includes: a feature screening unit 470 configured to perform the first feature part and/or the second feature part based on the first noise adding index and the second noise adding index Feature screening processing.
  • the second indicator calculation unit 460 includes: a real number determination subunit configured to determine the real number of positive samples and the real number of negative samples in each second binning based on the two classification labels;
  • the subunit for determining the number of noise additions is configured to add second differential privacy noise to the real number of positive samples and the real number of negative samples, respectively, to obtain the second number of positive samples with noise and the second number of negative samples with noise
  • the denoising index determining subunit is configured to determine a second denoising index corresponding to the second binning based on the second positive sample denoising quantity and the second negative sample denoising quantity.
  • the second differential privacy noise is Gaussian noise
  • the apparatus 400 further includes: a noise determination unit 480 configured to: based on the privacy budget parameters set for the plurality of samples, and The number of bins corresponding to each feature in the second feature part determines the noise power; using the noise power as the variance of the Gaussian distribution and taking 0 as the mean value to generate a Gaussian noise distribution; sampling the Gaussian noise distribution from the Gaussian noise distribution Gaussian noise mentioned above.
  • the noise determination unit 480 is configured to determine the noise power, which specifically includes: determining the sum of the number of bins corresponding to each feature; obtaining the variable value of the mean variable, which is based on the privacy The parameter value of the budget parameter, and the constraint relationship between the privacy budget parameter and the mean variable under the Gaussian mechanism of differential privacy; The reciprocal of the squared value of the above variable.
  • the privacy budget parameters include budget item parameters and slack item parameters.
  • the device further includes: a noise sampling unit configured to correspondingly sample a plurality of groups of noises from a differentially private noise distribution for the plurality of second bins; the second index
  • the computing unit 460 is configured to add differentially private noises respectively, specifically including: adding one noise in the corresponding group of noises to the real number of positive samples, and adding this group of noises to the real number of negative samples Another noise in .
  • the noise addition index determination subunit in the second index calculation unit 460 is specifically configured to: add noise amounts to the plurality of second positive samples corresponding to the plurality of second binning Perform a summation process to obtain the second total number of positive samples with noise; perform a summation process on the multiple second negative samples corresponding to the second sub-bins to obtain the second total number of negative samples with noise; based on the The second noise-added total number of positive samples, the second total number of noise-added negative samples, the second noise-added amount of positive samples, and the second noise-added amount of negative samples are used to determine the second noise-added index.
  • the second noise addition index is the second noise addition evidence weight
  • the noise addition index determination subunit is configured to determine the second noise addition index, which specifically includes: adding noise to the second positive sample Dividing by the total number of noises added to the second positive samples to obtain a second proportion of positive samples; dividing the number of noises added to the second negative samples by the total number of noises added to the second negative samples to obtain a second proportion of negative samples; Subtracting the logarithmic result of the second positive sample proportion from the logarithmic result of the second negative sample proportion to obtain the second noise-added evidence weight.
  • Fig. 5 shows a schematic structural diagram of a device for feature processing based on differential privacy
  • the participants of the federated learning include a first party and a second party, wherein the first party stores the first feature parts of multiple samples , the second party stores the second characteristic parts and binary classification labels of the plurality of samples, and the device is integrated with the first party.
  • the device 500 includes:
  • the encrypted label receiving unit 510 is configured to receive a plurality of encrypted labels from the second party, which are obtained by encrypting a plurality of binary classification labels corresponding to the plurality of samples; the binning processing unit 520 is configured to For any feature in the first feature part, perform binning processing on the plurality of samples to obtain a plurality of first binning; the encryption and noise adding unit 530 is configured to , determine the first positive sample encryption and noise addition amount and the first negative sample encryption noise addition amount corresponding to each first binning; the encryption amount sending unit 540 is configured to encrypt the first positive sample and the first positive sample encryption noise addition amount and the first The encrypted and noised amount of the negative sample is sent to the second party, so that the second party decrypts it to obtain the first positive sample and the first negative sample with noise, and determine the corresponding The first noise-adding index of the first bin of .
  • the business objects targeted by the multiple samples are any of the following: users, commodities, and business events.
  • the encryption and noise adding unit 530 is specifically configured to: for each of the first bins, determine the multiplication result between the encrypted labels corresponding to each sample; for the multiplication result and Encrypting the encryption noise obtained by encrypting the differential privacy noise and performing product processing to obtain the encrypted noise addition amount of the first positive sample; using the encrypted total number obtained by encrypting the total number of samples in the first binning, subtracting the first Encrypting the noise amount of the positive sample to obtain the encrypted noise adding amount of the first negative sample.
  • the apparatus 500 further includes: a noise sampling unit 550 configured to correspondingly sample a plurality of noises from a differentially private noise distribution for the plurality of first bins;
  • the unit 530 is configured to perform the product processing, which specifically includes: encrypting the noise corresponding to the multiplication result among the plurality of noises to obtain the encrypted noise; multiplying the multiplication result and the encrypted noise deal with.
  • the differential privacy noise is Gaussian noise
  • the apparatus 500 further includes: a noise determination unit 550 configured to be based on the privacy budget parameters set for the plurality of samples, and the first characteristic part The number of sub-bins corresponding to each feature in the method determines the noise power; the noise power is used as the variance of the Gaussian distribution, and 0 is the mean value to generate a Gaussian noise distribution; the Gaussian noise is sampled from the Gaussian noise distribution.
  • the noise determination unit 500 is configured to determine the noise power, which specifically includes: determining the sum of the number of bins corresponding to each feature; obtaining the variable value of the mean variable, the variable value is based on the The parameter value of the privacy budget parameter, and the constraint relationship between the privacy budget parameter and the mean variable under the Gaussian mechanism of differential privacy are determined; the noise power is calculated based on the product of the following factors: the sum of the number of bins, and the reciprocal of the variable value after the square operation.
  • the privacy budget parameters include budget item parameters and slack item parameters.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed in a computer, the computer is instructed to execute the method described in conjunction with FIG. 2 or FIG. 3 .
  • a computing device including a memory and a processor, where executable code is stored in the memory, and when the processor executes the executable code, the implementation described in conjunction with FIG. 2 or FIG. 3 is implemented. described method.
  • the functions described in the present invention may be implemented by hardware, software, firmware or any combination thereof.
  • the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Complex Calculations (AREA)

Abstract

本说明书实施例提供一种基于差分隐私进行特征处理的方法及装置,所述方法涉及第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储该多个样本的二分类标签;所述方法包括:第二方对多个样本对应的多个二分类标签分别进行加密,得到多个加密标签;第一方基于该多个加密标签以及差分隐私噪声,确定多个分箱中每个分箱对应的正样本加密加噪数量和负样本加密加噪数量,其中多个分箱是针对第一特征部分中的任一特征对多个样本进行分箱处理而得到;第二方对该正样本加密加噪数量和负样本加密加噪数量进行解密,得到正样本加噪数量和负样本加噪数量,从而确定出相对应分箱的加噪指标。

Description

基于差分隐私进行特征处理的方法及装置
本申请要求于2021年09月27日提交中国国家知识产权局、申请号为202111133642.5、申请名称为“基于差分隐私进行特征处理的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本说明书一个或多个实施例涉及数据处理技术领域,尤其涉及一种基于差分隐私进行特征处理的方法及装置。
背景技术
在大多数行业中,由于行业竞争、隐私安全等问题,数据常常以孤岛的形式存在,即使是在同一个公司的不同部门之间进行数据的集中整合,也面临着重重阻力。
联邦学习(Federated Learning)技术的提出,为打破数据孤岛提供了可能。联邦学习又名联邦机器学习、联盟学习等,是一个机器学习框架,旨在有效帮助多方在满足数据隐私保护、合法合规的要求下,进行数据使用和机器学习建模。根据数据在多方之间的分布情况,联邦学习可以分为横向联邦学习和纵向联邦学习等。其中,纵向联邦学习又称样本对齐的联邦学习,如图1所示,多方各自持有相同样本ID的不同样本特征,并且,某一方(图1中示意为B方)拥有样本标签。
在纵向联邦学习等场景中,存在某个数据方对样本特征数据进行筛选等特征处理时,需要用到其他数据方持有的样本标签的情况。因此,需要一种方案,可以在保证各方数据隐私均不被泄露的情况下,利用一方的样本标签信息完成对另一方特征数据的处理。
发明内容
本说明书一个或多个实施例描述了一种基于差分隐私进行特征处理的方法及装置,通过引入差分隐私机制和数据加密算法等,使得各个数据持有方可以在保证己方数据安全的情况下,联合完成特征变换处理。
根据第一方面,提供一种基于差分隐私进行特征处理的方法,所述方法涉及第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的二分类标签;所述方法由所述第二方执行,包括:对所述多个样本对应的多个二分类标签分别进行加密,得到多个加密标签;将所述多个加密标签发送至所述第一方;从所述第一方接收多个第一分箱中每个第一分箱对应的第一正样本加密加噪数量以及第一负样本加密加噪数 量,并对其进行解密,得到对应的第一正样本加噪数量和第一负样本加噪数量;其中,所述第一正样本加密加噪数量和第一负样本加密加噪数量基于所述多个加密标签以及第一差分隐私噪声而确定;所述多个第一分箱是针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理而得到;基于所述第一正样本加噪数量和第一负样本加噪数量,确定相对应的第一分箱的第一加噪指标。
在一个实施例中,所述多个样本针对的业务对象为以下中的任一种:用户、商品、业务事件。
在一个实施例中,对所述多个样本对应的多个二分类标签分别进行加密,得到多个加密标签,包括:基于同态加密算法,对所述多个二分类标签分别进行加密,得到所述多个加密标签。
在一个实施例中,基于所述第一正样本加噪数量和第一负样本加噪数量,确定相对应的第一分箱的第一加噪指标,包括:对所述多个第一分箱对应的多个第一正样本加噪数量进行求和处理,得到第一正样本加噪总数;对所述多个第一分箱对应的多个第一负样本加噪数量进行求和处理,得到第一负样本加噪总数;基于所述第一正样本加噪总数、第一负样本加噪总数、第一正样本加噪数量、第一负样本加噪数量,确定所述第一加噪指标。
在一个具体的实施例中,所述第一加噪指标为第一加噪证据权重,基于此,上述确定所述第一加噪指标,包括:将所述第一正样本加噪数量除以所述第一正样本加噪总数,得到第一正样本占比;将所述第一负样本加噪数量除以所述第一负样本加噪总数,得到第一负样本占比;将所述第一正样本占比的取对数结果减去所述第一负样本占比的取对数结果,得到所述第一加噪证据权重。
在一个实施例中,所述第二方还存储所述多个样本的第二特征部分;所述方法还包括:针对所述第二特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第二分箱;基于差分隐私机制,确定多个第二分箱中每个第二分箱的第二加噪指标;其中,在确定相对应的第一分箱的第一加噪指标之后,所述方法还包括:基于所述第一加噪指标和第二加噪指标,对所述第一特征部分和/或第二特征部分进行特征筛选处理。
在一个具体的实施例中,基于差分隐私机制,确定多个第二分箱中每个第二分箱的第二加噪指标,包括:基于所述二分类标签,确定每个第二分箱中正样本的真实数量和负样本的真实数量;在所述正样本的真实数量和负样本的真实数量上,分别添加第二差分隐私噪声,对应得到第二正样本加噪数量和第二负样本加噪数量;基于所述第二正样本加噪数量和第二负样本加噪数量,确定相对应的第二分箱的第二加噪指标。
一方面,在一个更具体的实施例中,所述第二差分隐私噪声为高斯噪声;在所述分别添加第二差分隐私噪声之前,所述方法还包括:基于针对所述多个样本设定的隐私预算参数,以及所述第二特征部分中各个特征所对应的分箱数量,确定噪声功率;以所述噪声功率作为高斯分布的方差,以0为均值,生成高斯噪声分布;从所述高斯噪声分布中采样所 述高斯噪声。
进一步,在一个例子中,其中确定噪声功率包括:确定所述各个特征所对应分箱数量的和值;获取均值变量的变量值,该变量值基于所述隐私预算参数的参数值,以及差分隐私的高斯机制下所述隐私预算参数和均值变量的约束关系而确定;基于以下因子的乘积计算得到所述噪声功率:所述分箱数量的和值,以及所述变量值进行平方运算后的倒数。
更进一步地,在一个具体的例子中,所述隐私预算参数包括预算项参数和松弛项参数。
另一方面,在一个更具体的实施例中,所述方法还包括:针对所述多个第二分箱,从差分隐私的噪声分布中对应采样多组噪声;其中,所述分别添加差分隐私噪声包括:在所述正样本的真实数量上,添加对应组别噪声中的一个噪声,并且,在所述负样本的真实数量上,添加该组噪声中的另一个噪声。
在又一个更具体的实施例中,基于所述第二正样本加噪数量和第二负样本加噪数量,确定相对应的第二分箱的第二加噪指标,包括:对所述多个第二分箱对应的多个第二正样本加噪数量进行求和处理,得到第二正样本加噪总数;对所述多个第二分箱对应的多个第二负样本加噪数量进行求和处理,得到第二负样本加噪总数;基于所述第二正样本加噪总数、第二负样本加噪总数、第二正样本加噪数量、第二负样本加噪数量,确定所述第二加噪指标。
进一步,在一个例子中,所述第二加噪指标为第二加噪证据权重,基于此,上述确定所述第二加噪指标,包括:将所述第二正样本加噪数量除以所述第二正样本加噪总数,得到第二正样本占比;将所述第二负样本加噪数量除以所述第二负样本加噪总数,得到第二负样本占比;将所述第二正样本占比的取对数结果减去所述第二负样本占比的取对数结果,得到所述第二加噪证据权重。
根据第二方面,提供一种基于差分隐私进行特征处理的方法,所述方法涉及第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的第二特征部分和二分类标签;所述方法由所述第一方执行,包括:从所述第二方接收多个加密标签,其是对所述多个样本对应的多个二分类标签分别进行加密而得到;针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第一分箱;基于所述多个加密标签以及差分隐私噪声,确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量;将所述第一正样本加密加噪数量和第一负样本加密加噪数量发送至所述第二方,以使得所述第二方对其解密得到第一正样本加噪数量和第一负样本加噪数量,并基于该解密的结果确定相对应的第一分箱的第一加噪指标。
在一个实施例中,所述多个样本针对的业务对象为以下中的任一种:用户、商品、业务事件。
在一个实施例中,基于所述多个加密标签以及差分隐私噪声,确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量,包括:针对所述每个第一分箱, 确定其中各个样本所对应的加密标签之间的连乘结果;对所述连乘结果以及加密所述差分隐私噪声而得到的加密噪声进行乘积处理,得到所述第一正样本加密加噪数量;利用加密该第一分箱中样本的总数而得到的加密总数,减去所述第一正样本加密噪声数量,得到所述第一负样本加密加噪数量。
在一个具体的实施例中,在对所述连乘结果以及加密所述差分隐私噪声而得到的加密噪声进行乘积处理,得到所述第一正样本加密加噪数量之前,所述方法还包括:针对所述多个第一分箱,从差分隐私的噪声分布中对应采样多个噪声;其中,对所述连乘结果以及加密所述差分隐私噪声而得到的加密噪声进行乘积处理,包括:对所述多个噪声中对应所述连乘结果的噪声进行加密,得到所述加密噪声;对所述连乘结果和所述加密噪声进行乘积处理。
在一个实施例中,所述差分隐私噪声为高斯噪声;在基于所述多个加密标签以及差分隐私噪声,确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量之前,所述方法还包括:基于针对所述多个样本设定的隐私预算参数,以及所述第一特征部分中各个特征所对应的分箱数量,确定噪声功率;以所述噪声功率作为高斯分布的方差,以0为均值,生成高斯噪声分布;从所述高斯噪声分布中采样所述高斯噪声。
在一个具体的实施例中,确定噪声功率包括:确定所述各个特征所对应的分箱数量的和值;获取均值变量的变量值,该变量值基于所述隐私预算参数的参数值,以及差分隐私的高斯机制下所述隐私预算参数和均值变量的约束关系而确定;基于以下因子的乘积计算得到所述噪声功率:所述分箱数量的和值,以及所述变量值进行平方运算后的倒数。
在一个例子中,所述隐私预算参数包括预算项参数和松弛项参数。
根据第三方面,提供一种基于差分隐私进行特征处理的装置,所述特征处理涉及第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的二分类标签;所述装置集成于所述第二方,包括:标签加密单元,配置为对所述多个样本对应的多个二分类标签分别进行加密,得到多个加密标签;加密标签发送单元,配置为将所述多个加密标签发送至所述第一方;加密数量处理单元,配置为从所述第一方接收多个第一分箱中每个第一分箱对应的第一正样本加密加噪数量以及第一负样本加密加噪数量,并对其进行解密,得到对应的第一正样本加噪数量和第一负样本加噪数量;其中,所述第一正样本加密加噪数量和第一负样本加密加噪数量基于所述多个加密标签以及第一差分隐私噪声而确定;所述多个第一分箱是针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理而得到;第一指标计算单元,配置为基于所述第一正样本加噪数量和第一负样本加噪数量,确定相对应的第一分箱的第一加噪指标。
在一个实施例中,所述第二方还存储所述多个样本的第二特征部分;所述装置还包括:分箱处理单元,配置为针对所述第二特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第二分箱;第二指标计算单元,配置为基于差分隐私机制,确定多个第二分 箱中每个第二分箱的第二加噪指标;所述装置还包括:特征筛选单元,配置为基于所述第一加噪指标和第二加噪指标,对所述第一特征部分和/或第二特征部分进行特征筛选处理。
根据第四方面,提供一种基于差分隐私进行特征处理的装置,所述特征处理涉及第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的第二特征部分和二分类标签;所述装置集成于所述第一方,包括:加密标签接收单元,配置为从所述第二方接收多个加密标签,其是对所述多个样本对应的多个二分类标签分别进行加密而得到;分箱处理单元,配置为针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第一分箱;加密加噪单元,配置为基于所述多个加密标签以及差分隐私噪声,确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量;加密数量发送单元,配置为将所述第一正样本加密加噪数量和第一负样本加密加噪数量发送至所述第二方,以使得所述第二方对其解密得到第一正样本加噪数量和第一负样本加噪数量,并基于该解密的结果确定相对应的第一分箱的第一加噪指标。
根据第五方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当该计算机程序在计算机中执行时,令计算机执行上述第一方面或第二方面提供的方法。
根据第六方面,提供了一种计算设备,包括存储器和处理器,存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现上述第一方面或第二方面提供的方法。
采用本说明书实施例提供的方法和装置,通过引入差分隐私机制和数据加密算法等,使得各个数据持有方可以在保证己方数据安全的情况下,联合完成特征变换处理。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1示出根据一个实施例的纵向联邦学习的数据分布场景示意图;
图2示出根据一个实施例的基于差分隐私进行特征处理的多方交互图;
图3示出根据一个实施例的基于差分隐私进行特征处理的方法流程图;
图4示出根据一个实施例的基于差分隐私进行特征处理的装置结构示意图;
图5示出根据另一个实施例的基于差分隐私进行特征处理的装置结构示意图。
具体实施方式
下面结合附图,对本说明书提供的方案进行描述。
如前所述,数据方对样本特征数据进行特征处理时,有时需要用到样本标签。在一种典型的场景中,可以基于样本标签,计算样本特征的证据权重(weight of evidence,WoE)、 信息值(Information Value,IV)等评估指标,从而实现特征筛选、特征编码或提供相关的数据查询服务等。比方说,为计算样本特征i的WoE,通常首先根据该特征i的特征值分布,对样本进行分箱,然后分别计算各个分箱的WoE。简单清楚起见,第i个样本特征下第j个分箱,或简称为某个特征分箱,采用以下公式计算WoE值:
Figure PCTCN2022105052-appb-000001
在上式中,WoE i,j表示某个特征分箱的证据权重,y i,j和n i,j分别表示某个特征分箱中正样本的数量和负样本的数量,y和n分别表示样本总集中正样本的数量和负样本的数量。
由上可知,在计算WoE的过程中,变量y i,j、n i,j、y和n取值的确定均需要用到各个样本是正样本还是负样本的样本标签。然而,在纵向联邦学习等场景下中,有些数据方只持有样本特征数据,而并未持有样本标签。
基于以上,本说明书实施例披露一种方案,使得没有样本标签的数据方,可以借助标签持有方中的样本标签信息,计算自身特征数据的WoE等特征评估指标,同时,保证各方数据隐私均不发生泄露。具体,图2示出根据一个实施例的基于差分隐私进行特征处理的多方交互图。需说明,其中多方至少为两个,为简洁描述,文中将持有样本标签的参与方称为第二方,将未存储样本标签但持有样本特征数据的其他任意一个参与方称为第一方,并且,将第一方持有的针对样本的部分特征称为第一特征部分。需理解,图2中仅示出第二方与某个第一方的交互过程,并且,第一方和第二方均可以实现为任何具有计算、处理能力的装置、平台或设备集群等。
如图2所示,所述交互过程包括以下步骤:
步骤S201,第二方对多个样本对应的多个二分类标签分别进行加密,得到多个加密标签。需说明,该多个样本针对的业务对象可以为用户、商品或业务事件,其中二分类标签(或称二元分类的类标签)可以包括风险类别标签或异常等级标签等。在一个示例中,业务对象是个人用户,相应,个人用户样本对应的二分类标签可以包括高消费人群和低消费人群,或者,可以是低风险用户或高风险用户。在另一个示例中,业务对象是企业用户,相应,企业用户样本对应的二分类标签可以包括信用企业和失信企业。在又一个示例中,业务对象是商品,相应,商品样本对应的二分类标签可以包括热销商品和冷门商品。在还一个示例中,业务对象是业务事件,如注册事件、访问事件、登录事件或支付事件等,相应,事件样本对应的二分类标签可以包括异常事件和正常事件。
在一个实施例中,本步骤可以包括:基于同态加密算法,对上述多个二分类标签分别进行加密,得到多个加密标签。在一个具体的实施例中,同态加密算法满足加法同态,进一步,在一个例子中,其满足条件:密文相乘后的解密结果等于对应明文的相加。示例性地,可以将此条件表示为:
Dec[∏ iEnc(t i)]=∑ it i       (2)
在上式中,t i表示第i个明文,Enc()表示加密操作,Dec()表示解密操作。需理解,二分类标签涉及的两类标签值通常取0和1,基于此,上述条件在本说明书实施例披露的场景中,可以被细化为:针对多个二分类标签中任意数量T的二分类标签
Figure PCTCN2022105052-appb-000002
其所对应加密标签
Figure PCTCN2022105052-appb-000003
之间的连乘结果
Figure PCTCN2022105052-appb-000004
与加密某个第一数值m 1而得到的加密数值Enc(m 1)之间的乘积结果,被解密后等于,T个二分类标签
Figure PCTCN2022105052-appb-000005
中标签值取1的标签的数量m 2与第一数值m 1之间的和值。示例性地,可以将此细化后的条件表示为:
Figure PCTCN2022105052-appb-000006
需说明,在公式(3)中,
Figure PCTCN2022105052-appb-000007
其中g为预设加密算法中设计的数值。
在另一个具体的实施例中,上述预设加密算法满足的条件还可以为:密文相乘后基于预设数值n进行取模运算的结果,被解密后等于对应明文的相加。示例性地,可以将此条件表示为:
Dec[∏ iEnc(t i)mod n]=∑ it i        (4)
由上,第二方可以加密得到多个加密标签
Figure PCTCN2022105052-appb-000008
需说明,虽然二分类标签的取值只有两类,但是,通过采用具有非确定性(non-deterministic)的加密算法,可以实现对同一标签值进行多次加密得到不同随机数,由此,利用得到的随机数作为对应的加密标签,可以保证其他方无法根据加密标签解密得到真实标签。
之后,在步骤S202,第一方从第二方接收上述多个加密标签
Figure PCTCN2022105052-appb-000009
第一方可以在步骤S202之前、同时或之后,执行步骤S203,针对其持有的第一特征部分中的任一特征,对上述多个样本进行分箱处理,得到多个第一分箱。
对于上述第一特征部分,在一个实施例中,多个样本针对的业务对象为个人用户,相应,第一特征部分可以包括以下个人用户特征中的至少一项:年龄、性别、职业、常驻地、收入、交易频次、交易金额、交易明细等。在另一个实施例中,多个样本针对的业务对象为企业用户,相应,第一特征部分可以包括以下企业用户特征中的至少一项:成立时间、经营范围、招聘信息等。在又一个实施例中,多个样本针对的业务对象为商品,相应,第一特征部分可以包括以下商品特征中的一项或多项:成本、名称、产地、类目、销量、库存、毛利等。在还一个实施例中,多个样本针对的业务对象为业务事件,相应,第一特征部分可以包括以下事件特征中的一项或多项:事件发生时刻、网络环境(如IP地址)、地理位置或持续时长等。
对于分箱处理,简单来说,就是将连续变量离散化,将多状态的离散变量合并成少状态。分箱方式有多种,包括等频分箱、等距分箱、聚类分箱、Best-KS分箱和卡方分箱等。
为便于理解,以等距分箱为例进行示例性说明。在一个实施例中,本步骤可以包括:针对第一特征部分中任意的第一特征,先根据第一特征的取值空间确定多个等距区间,对 应多个分箱类别;然后,对于上述多个样本中的任意一个样本,确定其对应该第一特征的特征值所在的等距区间,从而将该样本划归到对应类别的分箱中。在一个示例中,假定第一特征为年收入,并且,多个样本对应年收入的多个特征值包括12、20、32、45、55、60(单位:万),据此采用等距分箱,可以得到下表1中示出的分箱结果。
表1
Figure PCTCN2022105052-appb-000010
如表1所示,分箱结果中包括多个分箱中各个分箱对应的样本ID。
由上,第一方可以在步骤S203通过分箱处理得到任意的第一特征下的多个第一分箱,以及,在步骤S202中从第二方接收上述多个加密标签
Figure PCTCN2022105052-appb-000011
基于此,第一方可以执行步骤S204,基于该多个加密标签
Figure PCTCN2022105052-appb-000012
以及第一差分隐私噪声,确定上述多个第一分箱中每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量。
需说明,上述第一差分隐私噪声是由第一方基于差分隐私(Differential Privacy,简称DP)机制采样的噪声。在DP技术的实施中,通常,在原始数据或原始的数据计算结果上添加随机噪声,使得加噪后的数据在具有可用性的同时,有效防止其公布造成原始数据隐私泄露。
DP机制有多种,如高斯机制、拉普拉斯机制或指数机制等,相应,上述第一差分隐私噪声可以是高斯噪声、拉普拉斯噪声或指数噪声等。为便于理解,以上述第一差分隐私噪声是高斯噪声为例,对噪声的确定过程进行示例性说明。
高斯噪声采样自差分隐私的高斯噪声分布,高斯噪声分布的关键参数包括均值和方差,在一个实施例中,以基于差分隐私预算参数确定出的噪声功率作为高斯分布的方差,以0为均值,生成上述高斯噪声分布。具体,第一方可以基于其针对上述多个样本设定的隐私预算参数,以及其持有的第一特征部分中各个特征所对应的分箱数量,确定噪声功率。
进一步,在一个具体的实施例中,第一方确定第一特征部分中各个特征所对应的分箱数量的和值。示例性地,将第一特征部分对应的特征集合记作
Figure PCTCN2022105052-appb-000013
将其中第i个特征对应的分箱数量记作K i,从而可以将分箱数量的和值表示为
Figure PCTCN2022105052-appb-000014
除了确定上述和值,第一方还求解均值变量的变量值,此变量值基于上述隐私预算参数的参数值,以及隐私预算参数和均值变量的约束关系而确定。其中约束关系是差分隐私的高斯机制中已有的,可以表示为下式:
Figure PCTCN2022105052-appb-000015
在上式中,ε和δ分别表示上述隐私预算参数中的预算项参数和松弛项参数,二者的参数值可以是由工作人员根据实际需要人为设定的;μ表示上述均值变量;Φ(t)表示标准高斯分布的概率分布函数,
Figure PCTCN2022105052-appb-000016
进一步,可以基于以上确定出的分箱数量和值和均值变量的变量值,计算上述噪声功率。具体,可以基于以下因子的乘积计算得到噪声功率:上述分箱数量的和值,以及均值变量的变量值进行平方运算后的倒数。示例性地,可以通过以下公式计算噪声功率:
Figure PCTCN2022105052-appb-000017
在上式中,下角标A表示变量与第一方对应,
Figure PCTCN2022105052-appb-000018
和μ A分别表示噪声功率和均值变量的变量值,
Figure PCTCN2022105052-appb-000019
表示第一特征部分对应的特征集合,
Figure PCTCN2022105052-appb-000020
表示集合中特征元素的个数,K i表示第i个特征对应的分箱数量。
如此,第一方可以确定出噪声功率,从而以该噪声功率作为高斯分布的方差,以0为均值,生成高斯噪声分布
Figure PCTCN2022105052-appb-000021
进而从中随机采样得到高斯噪声
Figure PCTCN2022105052-appb-000022
以上,主要以高斯噪声为示例,对差分隐私的噪声分布的确定进行说明。另一方面,对于第一差分隐私噪声的采样数量,通常,可以针对不同的待加噪对象,分别进行随机噪声的采样。在一个实施例中,第一方可以针对上述多个第一分箱,从差分隐私的噪声分布中对应采样多个噪声;例如,可以对上述高斯噪声分布进行多次随机采样,得到多个高斯噪声。
由上,可以采样得到第一差分隐私噪声,从而结合上述多个加密标签
Figure PCTCN2022105052-appb-000023
确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量。
在一个实施例中,可以先确定上述第一正样本加密加噪数量。在一个具体的实施例中,针对上述每个第一分箱,确定其中各个样本所对应的加密标签之间的连乘结果,从而对该连乘结果以及加密上述第一差分隐私噪声而得到的加密噪声进行乘积处理,得到第一正样本加密加噪数量。示例性地,可以将此计算过程表示为:
Figure PCTCN2022105052-appb-000024
上式中,下标‘i,j’表示第i个特征下的第j个分箱,其对应任意的某个第一分箱;
Figure PCTCN2022105052-appb-000025
表示该某个第一分箱对应的第一正样本加密加噪数量;
Figure PCTCN2022105052-appb-000026
表示该某个第一分箱对应的差分隐私噪声,
Figure PCTCN2022105052-appb-000027
表示对应的加密噪声;
Figure PCTCN2022105052-appb-000028
表示该某个第一分箱对应的样本集合,
Figure PCTCN2022105052-appb-000029
Figure PCTCN2022105052-appb-000030
表示集合
Figure PCTCN2022105052-appb-000031
中样本的标签;
Figure PCTCN2022105052-appb-000032
表示样本标签
Figure PCTCN2022105052-appb-000033
对应的加密标签,
Figure PCTCN2022105052-appb-000034
表示加密标签之间的连乘结果。
在另一个具体的实施例中,还可以针对上述乘积处理得到的结果进行取模运算,从而得到上述第一正样本加密加噪数量。示例性地,可以将此计算过程表示为:
Figure PCTCN2022105052-appb-000035
在上式(4)中,n表示预设数值。
如此,可以确定第一分箱对应的第一正样本加密加噪数量。进一步,可以确定该第一分箱对应的第一负样本加密加噪数量,具体,利用基于同态加密算法加密该第一分箱中样本的总数而得到的加密总数,减去上述第一正样本加密噪声数量,从而得到第一负样本加密加噪数量。示例性地,可以将此计算过程表示为:
Figure PCTCN2022105052-appb-000036
在上式中,Enc(·)表示同态加密算法,其满足加法同态;
Figure PCTCN2022105052-appb-000037
表示某个第一分箱对应的第一负样本加密加噪数量;N i,j表示该某个第一分箱中样本的总数,Enc(N i,j)表示对该总数加密得到的加密总数;
Figure PCTCN2022105052-appb-000038
表示该某个第一分箱对应的第一正样本加密加噪数量。
如此,可以先确定出第一分箱的第一正样本加密加噪数量
Figure PCTCN2022105052-appb-000039
再确定出其第一负样本加密加噪数量
Figure PCTCN2022105052-appb-000040
实际上,也可以设计上述公式(7)或(8)的计算结果对应第一负样本加密加噪数量
Figure PCTCN2022105052-appb-000041
进一步,采样与公式(9)相同的思路,利用加密总数Enc(N i,j)减去
Figure PCTCN2022105052-appb-000042
可以得到第一正样本加密加噪数量
Figure PCTCN2022105052-appb-000043
此外需说明的是,在一个实施例中,第一方在对上述差分隐私噪声、分箱样本总数进行加密时所采用的加密算法,与第二方加密样本标签时采用的加密算法相同。
由上,第一方可以确定第一特征部分中任一特征下的多个第一分箱中,每个第一分箱对应的第一正样本加密加噪数量
Figure PCTCN2022105052-appb-000044
和第一负样本加密加噪数量
Figure PCTCN2022105052-appb-000045
从而在步骤S205,将之发送给第二方。
然后,第二方执行步骤S206,对上述每个第一分箱对应的第一正样本加密加噪数量
Figure PCTCN2022105052-appb-000046
和第一负样本加密加噪数量
Figure PCTCN2022105052-appb-000047
进行解密,得到对应的第一正样本加噪数量
Figure PCTCN2022105052-appb-000048
和第一负样本加噪数量
Figure PCTCN2022105052-appb-000049
在一个实施例中,假定第一正样本加密加噪数量
Figure PCTCN2022105052-appb-000050
是基于上式(7)进行计算而得到,并且,第二方采用的加密算法满足上述公式(3),基于此,对
Figure PCTCN2022105052-appb-000051
进行解密可以对照表示为:
Figure PCTCN2022105052-appb-000052
同时,假定第一负样本加密加噪数量
Figure PCTCN2022105052-appb-000053
是基于上式(9)进行计算而得到,此时,利用加密算法的同态性,可以基于
Figure PCTCN2022105052-appb-000054
解密出负样本加密加噪数量
Figure PCTCN2022105052-appb-000055
需说明,解密方式与加密方式相适应,在此不作穷举。
如此,第二方可以解密得到第一正样本加噪数量
Figure PCTCN2022105052-appb-000056
和第一负样本加噪数量
Figure PCTCN2022105052-appb-000057
进一步,第二方执行步骤S207,基于第一正样本加噪数量
Figure PCTCN2022105052-appb-000058
和第一负样本加噪数量
Figure PCTCN2022105052-appb-000059
确定相对应的第一分箱的第一加噪指标。
具体,一方面,对某个第一特征下的多个第一分箱对应的多个第一正样本加噪数量
Figure PCTCN2022105052-appb-000060
进行求和处理,得到第一正样本加噪总数
Figure PCTCN2022105052-appb-000061
示例性地,可以将此求和处理表示为:
Figure PCTCN2022105052-appb-000062
另一方面,通过利用上述多个样本的样本总数
Figure PCTCN2022105052-appb-000063
减去第一正样本加噪总数
Figure PCTCN2022105052-appb-000064
得到第一负样本加噪总数
Figure PCTCN2022105052-appb-000065
示例性地,可以将此计算过程表示为:
Figure PCTCN2022105052-appb-000066
或者,通过对上述多个第一分箱对应的多个第一负样本加噪数量
Figure PCTCN2022105052-appb-000067
进行求和处理,得到第一负样本加噪总数
Figure PCTCN2022105052-appb-000068
示例性地,可以将此求和处理表示为:
Figure PCTCN2022105052-appb-000069
上式中,
Figure PCTCN2022105052-appb-000070
表示第i个特征下多个第一分箱组成的集合。
进一步,可以基于得到的第一正样本加噪总数
Figure PCTCN2022105052-appb-000071
和第一负样本加噪总数
Figure PCTCN2022105052-appb-000072
以及任意的第一分箱对应的第一正样本加噪数量
Figure PCTCN2022105052-appb-000073
和第一负样本加噪数量
Figure PCTCN2022105052-appb-000074
确定该第一分箱的第一加噪指标。
在一个实施例中,上述第一加噪指标是第一证据权重
Figure PCTCN2022105052-appb-000075
其计算可以包括:将第一正样本加噪数量
Figure PCTCN2022105052-appb-000076
除以第一正样本加噪总数
Figure PCTCN2022105052-appb-000077
得到第一正样本占比;并且,将第一负样本加噪数量
Figure PCTCN2022105052-appb-000078
除以第一负样本加噪总数
Figure PCTCN2022105052-appb-000079
得到第一负样本占比;之后,将第一正样本占比的取对数结果减去所述第一负样本占比的取对数结果,得到所述第一加噪证据权重。示例性地,可以将此计算过程表示为:
Figure PCTCN2022105052-appb-000080
如此,第二方可以确定出任意的第一分箱对应的第一加噪证据权重
Figure PCTCN2022105052-appb-000081
可以理解,此第一加噪证据权重
Figure PCTCN2022105052-appb-000082
等同于在对应的原始证据权重WoE i,j中加入差分隐私噪声
Figure PCTCN2022105052-appb-000083
而得到的加噪量。
在另一个实施例中,上述第一加噪指标是第一信息值
Figure PCTCN2022105052-appb-000084
其计算可以包括:计算上述第一正样本占比和第一负样本占比;接着,计算第一正样本占比与第一负样本占比之间的差值,以及,计算第一正样本占比的取对数结果与所述第一负样本占比的取对数结果之间的差值;之后,求取这两个差值之间的乘积结果,作为第一信息值
Figure PCTCN2022105052-appb-000085
示例性地,可以将此计算过程表示为:
Figure PCTCN2022105052-appb-000086
如此,第二方可以确定出任意的第一分箱对应的第一信息值
Figure PCTCN2022105052-appb-000087
可以理解,此第一信息值
Figure PCTCN2022105052-appb-000088
等同于在对应的原始信息值IV i,j中加入差分隐私噪声
Figure PCTCN2022105052-appb-000089
而得到的加噪量。
以上,实现了在保护各方数据隐私的前提下,借助第二方持有的样本标签信息,对未持有样本标签的第一方中的特征数据进行证据权重、或IV值等特征评估指标的计算。
根据另一方面的实施例,第二方在执行上述步骤S207之后,还可以执行步骤S208,将第一加噪指标发送给第一方,从而,第一方可以根据其持有的第一特征部分中各个第一 特征下的各个第一分箱所对应的加噪指标,进行特征的筛选,例如,若某个特征下各个第一分箱对应的加噪指标均十分接近,则可以判定该某个特征是冗余特征,并舍弃该某个特征;或者,还可以进行特征的编码,例如,针对上述多个样本,可以将其中任意样本对应任意第一特征的特征值,编码为该样本在该第一特征下所属的第一分箱的加噪指标,进一步,特征的编码值可以被用作联邦学习中机器学习模型的输入,从而有效避免模型参数公布或模型开放使用而导致训练数据隐私的泄露。
根据再一方面的实施例,第二方可以引入差分隐私机制,对自身持有的特征数据进行特征评估指标计算。为区分描述,文中将第二方持有的针对上述多个样本的特征数据称为第二特征部分,对第二特征部分的描述可以参见前述对第一特征部分的描述,需注意,二者对应相同样本ID的不同特征。
第二方通过计算第二特征下第二分箱的第二加噪指标,而非原始指标,在一种应用场景中,可以结合第二加噪指标和上述第一加噪指标,实现对第二特征部分和/或第一特征部分的特征筛选,从而在保护各方隐私的同时得到更加精准的特征筛选结果;在另一种应用场景中,还可以基于第二加噪指标对第二特征部分进行特征编码。
下面,对第二方其基于持有的二分类标签和特征数据,引入隐私差分隐私机制计算WoE、IV值等特征评估指标的过程进行介绍。图3示出根据一个实施例的基于差分隐私进行特征处理的方法流程图,所述方法由第二方执行。如图3所示,所述方法可以包括以下步骤:
步骤S310,针对第二特征部分中的任一特征,对多个样本进行分箱处理,得到多个第二分箱;步骤S320,基于二分类标签,确定每个第二分箱中正样本的真实数量和负样本的真实数量;步骤S330,在所述正样本的真实数量和负样本的真实数量上,分别添加第二差分隐私噪声,对应得到第二正样本加噪数量和第二负样本加噪数量;步骤S340,基于所述第二正样本加噪数量和第二负样本加噪数量,确定相对应的第二分箱的第二加噪指标。
对以上步骤的展开介绍如下:
首先,在步骤S310,针对第二特征部分中任意的第二特征,对多个样本进行分箱处理,得到多个第二分箱。需说明,各个第二分箱中可以包括对应样本的样本ID,此外,对分箱处理的介绍可以参见前述实施例中的相关描述,不在此赘述。
然后,在步骤S320,基于二分类标签,确定每个第二分箱中正样本的真实数量和负样本的真实数量。具体,针对任意一个的第二分箱,可以根据其中各个样本的二分类标签,统计出该第二分箱中的正样本的数量和负样本的数量,此处统计出的数量为真实数量。
在一个示例中,下表2中示意统计出的样本分布情况,包括在各个第二分箱下不同标签值,即低消费人群和高消费人群,对应的样本数量。
表2
Figure PCTCN2022105052-appb-000090
由上,可以确定出每个第二分箱中正样本的真实数量和负样本的真实数量。从而,在步骤S330,在正样本的真实数量和负样本的真实数量上,分别添加第二差分隐私噪声,对应得到第二正样本加噪数量和第二负样本加噪数量。
需说明,第二差分隐私噪声是由第二方基于DP机制采样的噪声;并且,第二方采样的DP机制与第一方确定上述第一差分隐私噪声时采用的DP机制通常相同,但也可以不同。在一个实施例中,第二差分隐私噪声属于高斯噪声,采样自高斯噪声分布,具体,第二方可以基于其针对多个样本设定的隐私预算参数,以及其持有的第二特征部分中各个特征所对应的分箱数量,确定噪声功率,再以该噪声功率作为高斯分布的方差,以0为均值,确定高斯噪声分布
Figure PCTCN2022105052-appb-000091
进而从中采样高斯噪声
Figure PCTCN2022105052-appb-000092
此外,对于第二方确定
Figure PCTCN2022105052-appb-000093
的进一步描述,可以参见前述对第一方确定高斯噪声分布
Figure PCTCN2022105052-appb-000094
的相关描述,在此不作赘述。
另一方面,对于第二差分隐私噪声的采样数量,通常,可以针对不同的待加噪对象,分别进行随机噪声的采样。在一个实施例中,可以针对上述多个第二分箱,从差分隐私的噪声分布中对应采样多个噪声。在另一个实施例中,可以针对上述多个第二分箱,从差分隐私的噪声分布中对应采样多组噪声,每组噪声中的两个噪声分别对应分箱中的正样本和负样本。
由上,可以采样得到第二差分隐私噪声,从而对正负样本的真实数量进行加噪处理。在一个实施例中,可以在某个第二分箱对应的正样本真实数量和负样本真实数量上,分别添加与该某个第二分箱对应的第二差分隐私噪声,也就是说,对应同一分箱的正负样本数量添加的噪声相同,从而得到对应的第二正样本加噪数量和第二负样本加噪数量。示例性地,可以将此加噪过程表示为:
Figure PCTCN2022105052-appb-000095
Figure PCTCN2022105052-appb-000096
在公式(16)和(17)中,下标‘i,j’表示第i个特征下的第j个分箱,其对应任意的某个第二分箱;
Figure PCTCN2022105052-appb-000097
表示该某个第二分箱对应的样本集合;
Figure PCTCN2022105052-appb-000098
表示集合
Figure PCTCN2022105052-appb-000099
中样本的标签;z i,j表示该某个第二分箱对应的差分隐私噪声;
Figure PCTCN2022105052-appb-000100
Figure PCTCN2022105052-appb-000101
分别表示该某个第 二分箱对应的正样本真实数量和负样本真实数量;
Figure PCTCN2022105052-appb-000102
Figure PCTCN2022105052-appb-000103
分别表示该某个第二分箱对应的第二正样本加噪数量和第二负样本加噪数量。
在另一个实施例中,针对某个第二分箱对应的正样本真实数量和负样本真实数量,可以对前者添加对应组别差分隐私噪声中的一个噪声,并且,对后者添加对应组别噪声中的两个噪声,从而得到对应的第二正样本加噪数量和第二负样本加噪数量。示例性地,可以将此加噪过程表示为:
Figure PCTCN2022105052-appb-000104
Figure PCTCN2022105052-appb-000105
在公式(18)和(19)中,用符号
Figure PCTCN2022105052-appb-000106
Figure PCTCN2022105052-appb-000107
示意对应某个第二分箱的一组差分隐私噪声中的两个不同噪声。
由上,可以得到每个第二分箱对应的第二正样本加噪数量
Figure PCTCN2022105052-appb-000108
和第二负样本加噪数量
Figure PCTCN2022105052-appb-000109
其中
Figure PCTCN2022105052-appb-000110
表示上述第二特征部分。基于此,可以执行步骤S340,基于该第二正样本加噪数量
Figure PCTCN2022105052-appb-000111
和第二负样本加噪数量
Figure PCTCN2022105052-appb-000112
确定相对应的第二分箱的第二加噪指标。
需理解,对本步骤的描述,可以参见前述对步骤S207中确定第一分箱的第一加噪指标的描述,以下仅列出计算第二加噪证据权重的公式进行示意性说明,其他可以参见步骤S207中的相关描述。
Figure PCTCN2022105052-appb-000113
Figure PCTCN2022105052-appb-000114
Figure PCTCN2022105052-appb-000115
在以上公式(20)、(21)和(22)中,
Figure PCTCN2022105052-appb-000116
表示第i个第二特征下多个第二分箱组成的集合;j表示该多个第二分箱中第j个第二分箱;
Figure PCTCN2022105052-appb-000117
Figure PCTCN2022105052-appb-000118
分别表示第二正样本加噪总数和第二负样本加噪总数;
Figure PCTCN2022105052-appb-000119
表示第i个第二特征下第j个第二分箱对应的第二加噪证据权重。
如此,持有标签的第二方,可以通过引入差分隐私机制,确定出任意的第二分箱对应的第二加噪证据权重
Figure PCTCN2022105052-appb-000120
从而基于此第二加噪证据权重
Figure PCTCN2022105052-appb-000121
和/或,前述确定出的第一加噪证据权重
Figure PCTCN2022105052-appb-000122
对第一特征部分和/或第二特征部分,进行进一步的特征处理,如特征筛选、评估或编码等。
与上述特征处理方法相对应的,本说明书实施例还披露特征处理装置。图4示出根据一个实施例的基于差分隐私进行特征处理的装置结构示意图,所述联邦学习的参与方包括第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的二分类标签;所述装置集成于所述第二方。如图4所示,所述装置400包括:
标签加密单元410,配置为对所述多个样本对应的多个二分类标签分别进行加密,得到多个加密标签;加密标签发送单元420,配置为将所述多个加密标签发送至所述第一方;加密数量处理单元430,配置为从所述第一方接收多个第一分箱中每个第一分箱对应的第 一正样本加密加噪数量以及第一负样本加密加噪数量,并对其进行解密,得到对应的第一正样本加噪数量和第一负样本加噪数量;其中,所述第一正样本加密加噪数量和第一负样本加密加噪数量基于所述多个加密标签以及第一差分隐私噪声而确定;所述多个第一分箱是针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理而得到;第一指标计算单元440,配置为基于所述第一正样本加噪数量和第一负样本加噪数量,确定相对应的第一分箱的第一加噪指标。
在一个实施例中,所述多个样本针对的业务对象为以下中的任一种:用户、商品、业务事件。
在一个实施例中,所述标签加密单元410具体配置为:基于预设加密算法,对所述多个二分类标签分别进行加密,得到所述多个加密标签;其中,所述预设加密算法满足以下条件:密文相乘的解密结果等于对应明文的相加。
在一个具体的实施例中,第一指标计算单元440包括:总数确定子单元,配置为对所述多个第一分箱对应的多个第一正样本加噪数量进行求和处理,得到第一正样本加噪总数;并且,对所述多个第一分箱对应的多个第一负样本加噪数量进行求和处理,得到第一负样本加噪总数;指标确定子单元,配置为基于所述第一正样本加噪总数、第一负样本加噪总数、第一正样本加噪数量、第一负样本加噪数量,确定所述第一加噪指标。
在一个具体的实施例中,第一加噪指标为第一加噪证据权重;指标确定子单元具体配置为:将所述第一正样本加噪数量除以所述第一正样本加噪总数,得到第一正样本占比;将所述第一负样本加噪数量除以所述第一负样本加噪总数,得到第一负样本占比;将所述第一正样本占比的取对数结果减去所述第一负样本占比的取对数结果,得到所述第一加噪证据权重。
在一个实施例中,所述第二方还存储所述多个样本的第二特征部分;所述装置400还包括:分箱处理单元450,配置为针对所述第二特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第二分箱;第二指标计算单元460,配置为基于差分隐私机制,确定多个第二分箱中每个第二分箱的第二加噪指标;所述装置400还包括:特征筛选单元470,配置为基于所述第一加噪指标和第二加噪指标,对所述第一特征部分和/或第二特征部分进行特征筛选处理。
在一个具体的实施例中,第二指标计算单元460包括:真实数量确定子单元,配置为基于所述二分类标签,确定每个第二分箱中正样本的真实数量和负样本的真实数量;加噪数量确定子单元,配置为在所述正样本的真实数量和负样本的真实数量上,分别添加第二差分隐私噪声,对应得到第二正样本加噪数量和第二负样本加噪数量;加噪指标确定子单元,配置为基于所述第二正样本加噪数量和第二负样本加噪数量,确定相对应的第二分箱的第二加噪指标。
在一个更具体的实施例中,所述第二差分隐私噪声为高斯噪声;所述装置400还包括: 噪声确定单元480,配置为:基于针对所述多个样本设定的隐私预算参数,以及所述第二特征部分中各个特征所对应的分箱数量,确定噪声功率;以所述噪声功率作为高斯分布的方差,以0为均值,生成高斯噪声分布;从所述高斯噪声分布中采样所述高斯噪声。
进一步,在一个例子中,所述噪声确定单元480配置为确定噪声功率,具体包括:确定所述各个特征所对应分箱数量的和值;获取均值变量的变量值,该变量值基于所述隐私预算参数的参数值,以及差分隐私的高斯机制下所述隐私预算参数和均值变量的约束关系而确定;基于以下因子的乘积计算得到所述噪声功率:所述分箱数量的和值,以及所述变量值进行平方运算后的倒数。
更进一步,在一个具体的例子中,所述隐私预算参数包括预算项参数和松弛项参数。
在另一个更具体的实施例中,所述装置还包括:噪声采样单元,配置为针对所述多个第二分箱,从差分隐私的噪声分布中对应采样多组噪声;所述第二指标计算单元460配置为分别添加差分隐私噪声,具体包括:在所述正样本的真实数量上,添加对应组别噪声中的一个噪声,并且,在所述负样本的真实数量上,添加该组噪声中的另一个噪声。
在又一个更具体的实施例中,所述第二指标计算单元460中的加噪指标确定子单元具体配置为:对所述多个第二分箱对应的多个第二正样本加噪数量进行求和处理,得到第二正样本加噪总数;对所述多个第二分箱对应的多个第二负样本加噪数量进行求和处理,得到第二负样本加噪总数;基于所述第二正样本加噪总数、第二负样本加噪总数、第二正样本加噪数量、第二负样本加噪数量,确定所述第二加噪指标。
进一步,在一个例子中,第二加噪指标为第二加噪证据权重,加噪指标确定子单元配置为确定所述第二加噪指标,具体包括:将所述第二正样本加噪数量除以所述第二正样本加噪总数,得到第二正样本占比;将所述第二负样本加噪数量除以所述第二负样本加噪总数,得到第二负样本占比;将所述第二正样本占比的取对数结果减去所述第二负样本占比的取对数结果,得到所述第二加噪证据权重。
图5示出根据另一个实施例的基于差分隐私进行特征处理的装置结构示意图,所述联邦学习的参与方包括第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的第二特征部分和二分类标签,所述装置集成于所述第一方。如图5所示,所述装置500包括:
加密标签接收单元510,配置为从所述第二方接收多个加密标签,其是对所述多个样本对应的多个二分类标签分别进行加密而得到;分箱处理单元520,配置为针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第一分箱;加密加噪单元530,配置为基于所述多个加密标签以及差分隐私噪声,确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量;加密数量发送单元540,配置为将所述第一正样本加密加噪数量和第一负样本加密加噪数量发送至所述第二方,以使得所述第二 方对其解密得到第一正样本加噪数量和第一负样本加噪数量,并基于该解密的结果确定相对应的第一分箱的第一加噪指标。
在一个实施例中,所述多个样本针对的业务对象为以下中的任一种:用户、商品、业务事件。
在一个实施例中,所述加密加噪单元530具体配置为:针对所述每个第一分箱,确定其中各个样本所对应的加密标签之间的连乘结果;对所述连乘结果以及加密所述差分隐私噪声而得到的加密噪声进行乘积处理,得到所述第一正样本加密加噪数量;利用加密该第一分箱中样本的总数而得到的加密总数,减去所述第一正样本加密噪声数量,得到所述第一负样本加密加噪数量。
在一个具体的实施例中,所述装置500还包括:噪声采样单元550,配置为针对所述多个第一分箱,从差分隐私的噪声分布中对应采样多个噪声;所述加密加噪单元530配置为进行所述乘积处理,具体包括:对所述多个噪声中对应所述连乘结果的噪声进行加密,得到所述加密噪声;对所述连乘结果和所述加密噪声进行乘积处理。
在一个实施例中,所述差分隐私噪声为高斯噪声;所述装置500还包括:噪声确定单元550,配置为基于针对所述多个样本设定的隐私预算参数,以及所述第一特征部分中各个特征所对应的分箱数量,确定噪声功率;以所述噪声功率作为高斯分布的方差,以0为均值,生成高斯噪声分布;从所述高斯噪声分布中采样所述高斯噪声。
在一个具体的实施例中,所述噪声确定单元500配置为确定噪声功率,具体包括:确定所述各个特征所对应的分箱数量的和值;获取均值变量的变量值,该变量值基于所述隐私预算参数的参数值,以及差分隐私的高斯机制下所述隐私预算参数和均值变量的约束关系而确定;基于以下因子的乘积计算得到所述噪声功率:所述分箱数量的和值,以及所述变量值进行平方运算后的倒数。
在一个例子中,所述隐私预算参数包括预算项参数和松弛项参数。
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2或图3所描述的方法。
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,该存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2或图3所描述的方法。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的 保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。

Claims (25)

  1. 一种基于差分隐私进行特征处理的方法,所述方法涉及第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的二分类标签;所述方法由所述第二方执行,包括:
    对所述多个样本对应的多个二分类标签分别进行加密,得到多个加密标签;
    将所述多个加密标签发送至所述第一方;
    从所述第一方接收多个第一分箱中每个第一分箱对应的第一正样本加密加噪数量以及第一负样本加密加噪数量,并对其进行解密,得到对应的第一正样本加噪数量和第一负样本加噪数量;其中,所述第一正样本加密加噪数量和第一负样本加密加噪数量基于所述多个加密标签以及第一差分隐私噪声而确定;所述多个第一分箱是针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理而得到;
    基于所述第一正样本加噪数量和第一负样本加噪数量,确定相对应的第一分箱的第一加噪指标。
  2. 根据权利要求1所述的方法,其中,所述多个样本针对的业务对象为以下中的任一种:用户、商品、业务事件。
  3. 根据权利要求1所述的方法,其中,对所述多个样本对应的多个二分类标签分别进行加密,得到多个加密标签,包括:
    基于同态加密算法,对所述多个二分类标签分别进行加密,得到所述多个加密标签。
  4. 根据权利要求1所述的方法,其中,基于所述第一正样本加噪数量和第一负样本加噪数量,确定相对应的第一分箱的第一加噪指标,包括:
    对所述多个第一分箱对应的多个第一正样本加噪数量进行求和处理,得到第一正样本加噪总数;
    对所述多个第一分箱对应的多个第一负样本加噪数量进行求和处理,得到第一负样本加噪总数;
    基于所述第一正样本加噪总数、第一负样本加噪总数、第一正样本加噪数量、第一负样本加噪数量,确定所述第一加噪指标。
  5. 根据权利要求4所述的方法,所述第一加噪指标为第一加噪证据权重,其中,确定所述第一加噪指标,包括:
    将所述第一正样本加噪数量除以所述第一正样本加噪总数,得到第一正样本占比;
    将所述第一负样本加噪数量除以所述第一负样本加噪总数,得到第一负样本占比;
    将所述第一正样本占比的取对数结果减去所述第一负样本占比的取对数结果,得到所述第一加噪证据权重。
  6. 根据权利要求1所述的方法,所述第二方还存储所述多个样本的第二特征部分;所述方法还包括:
    针对所述第二特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第二分箱;
    基于差分隐私机制,确定多个第二分箱中每个第二分箱的第二加噪指标;
    其中,在确定相对应的第一分箱的第一加噪指标之后,所述方法还包括:
    基于所述第一加噪指标和第二加噪指标,对所述第一特征部分和/或第二特征部分进行特征筛选处理。
  7. 根据权利要求6所述的方法,其中,基于差分隐私机制,确定多个第二分箱中每个第二分箱的第二加噪指标,包括:
    基于所述二分类标签,确定每个第二分箱中正样本的真实数量和负样本的真实数量;
    在所述正样本的真实数量和负样本的真实数量上,分别添加第二差分隐私噪声,对应得到第二正样本加噪数量和第二负样本加噪数量;
    基于所述第二正样本加噪数量和第二负样本加噪数量,确定相对应的第二分箱的第二加噪指标。
  8. 根据权利要求7所述的方法,其中,所述第二差分隐私噪声为高斯噪声;在所述分别添加第二差分隐私噪声之前,所述方法还包括:
    基于针对所述多个样本设定的隐私预算参数,以及所述第二特征部分中各个特征所对应的分箱数量,确定噪声功率;
    以所述噪声功率作为高斯分布的方差,以0为均值,生成高斯噪声分布;
    从所述高斯噪声分布中采样所述高斯噪声。
  9. 根据权利要求8所述的方法,其中,确定噪声功率包括:
    确定所述各个特征所对应分箱数量的和值;
    获取均值变量的变量值,该变量值基于所述隐私预算参数的参数值,以及差分隐私的高斯机制下所述隐私预算参数和均值变量的约束关系而确定;
    基于以下因子的乘积计算得到所述噪声功率:所述分箱数量的和值,以及所述变量值进行平方运算后的倒数。
  10. 根据权利要求8或9所述的方法,其中,所述隐私预算参数包括预算项参数和松弛项参数。
  11. 根据权利要求7所述的方法,其中,在所述分别添加第二差分隐私噪声之前,所述方法还包括:
    针对所述多个第二分箱,从差分隐私的噪声分布中对应采样多组噪声;
    其中,所述分别添加差分隐私噪声包括:
    在所述正样本的真实数量上,添加对应组别噪声中的一个噪声,并且,在所述负样本 的真实数量上,添加该组噪声中的另一个噪声。
  12. 根据权利要求7所述的方法,其中,基于所述第二正样本加噪数量和第二负样本加噪数量,确定相对应的第二分箱的第二加噪指标,包括:
    对所述多个第二分箱对应的多个第二正样本加噪数量进行求和处理,得到第二正样本加噪总数;
    对所述多个第二分箱对应的多个第二负样本加噪数量进行求和处理,得到第二负样本加噪总数;
    基于所述第二正样本加噪总数、第二负样本加噪总数、第二正样本加噪数量、第二负样本加噪数量,确定所述第二加噪指标。
  13. 根据权利要求12所述的方法,所述第二加噪指标为第二加噪证据权重,其中,确定所述第二加噪指标,包括:
    将所述第二正样本加噪数量除以所述第二正样本加噪总数,得到第二正样本占比;
    将所述第二负样本加噪数量除以所述第二负样本加噪总数,得到第二负样本占比;
    将所述第二正样本占比的取对数结果减去所述第二负样本占比的取对数结果,得到所述第二加噪证据权重。
  14. 一种基于差分隐私进行特征处理的方法,所述方法涉及第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的二分类标签;所述方法由所述第一方执行,包括:
    从所述第二方接收多个加密标签,其是对所述多个样本对应的多个二分类标签分别进行加密而得到;
    针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第一分箱;
    基于所述多个加密标签以及差分隐私噪声,确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量;
    将所述第一正样本加密加噪数量和第一负样本加密加噪数量发送至所述第二方,以使得所述第二方对其解密得到第一正样本加噪数量和第一负样本加噪数量,并基于该解密的结果确定相对应的第一分箱的第一加噪指标。
  15. 根据权利要求14所述的方法,其中,所述多个样本针对的业务对象为以下中的任一种:用户、商品、业务事件。
  16. 根据权利要求14所述的方法,其中,基于所述多个加密标签以及差分隐私噪声,确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量,包括:
    针对所述每个第一分箱,确定其中各个样本所对应的加密标签之间的连乘结果;
    对所述连乘结果以及加密所述差分隐私噪声而得到的加密噪声进行乘积处理,得到所 述第一正样本加密加噪数量;
    利用加密该第一分箱中样本的总数而得到的加密总数,减去所述第一正样本加密噪声数量,得到所述第一负样本加密加噪数量。
  17. 根据权利要求16所述的方法,其中,在对所述连乘结果以及加密所述差分隐私噪声而得到的加密噪声进行乘积处理,得到所述第一正样本加密加噪数量之前,所述方法还包括:
    针对所述多个第一分箱,从差分隐私的噪声分布中对应采样多个噪声;
    其中,对所述连乘结果以及加密所述差分隐私噪声而得到的加密噪声进行乘积处理,包括:
    对所述多个噪声中对应所述连乘结果的噪声进行加密,得到所述加密噪声;
    对所述连乘结果和所述加密噪声进行乘积处理。
  18. 根据权利要求14所述的方法,其中,所述差分隐私噪声为高斯噪声;在基于所述多个加密标签以及差分隐私噪声,确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量之前,所述方法还包括:
    基于针对所述多个样本设定的隐私预算参数,以及所述第一特征部分中各个特征所对应的分箱数量,确定噪声功率;
    以所述噪声功率作为高斯分布的方差,以0为均值,生成高斯噪声分布;
    从所述高斯噪声分布中采样所述高斯噪声。
  19. 根据权利要求18所述的方法,其中,确定噪声功率包括:
    确定所述各个特征所对应的分箱数量的和值;
    获取均值变量的变量值,该变量值基于所述隐私预算参数的参数值,以及差分隐私的高斯机制下所述隐私预算参数和均值变量的约束关系而确定;
    基于以下因子的乘积计算得到所述噪声功率:所述分箱数量的和值,以及所述变量值进行平方运算后的倒数。
  20. 根据权利要求18或19所述的方法,其中,所述隐私预算参数包括预算项参数和松弛项参数。
  21. 一种基于差分隐私进行特征处理的装置,所述特征处理涉及第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的二分类标签;所述装置集成于所述第二方,包括:
    标签加密单元,配置为对所述多个样本对应的多个二分类标签分别进行加密,得到多个加密标签;
    加密标签发送单元,配置为将所述多个加密标签发送至所述第一方;
    加密数量处理单元,配置为从所述第一方接收多个第一分箱中每个第一分箱对应的第 一正样本加密加噪数量以及第一负样本加密加噪数量,并对其进行解密,得到对应的第一正样本加噪数量和第一负样本加噪数量;其中,所述第一正样本加密加噪数量和第一负样本加密加噪数量基于所述多个加密标签以及第一差分隐私噪声而确定;所述多个第一分箱是针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理而得到;
    第一指标计算单元,配置为基于所述第一正样本加噪数量和第一负样本加噪数量,确定相对应的第一分箱的第一加噪指标。
  22. 根据权利要求21所述的装置,所述第二方还存储所述多个样本的第二特征部分;所述装置还包括:
    分箱处理单元,配置为针对所述第二特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第二分箱;
    第二指标计算单元,配置为基于差分隐私机制,确定多个第二分箱中每个第二分箱的第二加噪指标;
    所述装置还包括:
    特征筛选单元,配置为基于所述第一加噪指标和第二加噪指标,对所述第一特征部分和/或第二特征部分进行特征筛选处理。
  23. 一种基于差分隐私进行特征处理的装置,所述特征处理涉及第一方和第二方,其中第一方存储多个样本的第一特征部分,第二方存储所述多个样本的二分类标签;所述装置集成于所述第一方,包括:
    加密标签接收单元,配置为从所述第二方接收多个加密标签,其是对所述多个样本对应的多个二分类标签分别进行加密而得到;
    分箱处理单元,配置为针对所述第一特征部分中的任一特征,对所述多个样本进行分箱处理,得到多个第一分箱;
    加密加噪单元,配置为基于所述多个加密标签以及差分隐私噪声,确定每个第一分箱对应的第一正样本加密加噪数量和第一负样本加密加噪数量;
    加密数量发送单元,配置为将所述第一正样本加密加噪数量和第一负样本加密加噪数量发送至所述第二方,以使得所述第二方对其解密得到第一正样本加噪数量和第一负样本加噪数量,并基于该解密的结果确定相对应的第一分箱的第一加噪指标。
  24. 一种计算机可读存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-20中任一项所述的方法。
  25. 一种计算设备,包括存储器和处理器,其中,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-20中任一项所述的方法。
PCT/CN2022/105052 2021-09-27 2022-07-12 基于差分隐私进行特征处理的方法及装置 WO2023045503A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/394,978 US20240152643A1 (en) 2021-09-27 2023-12-22 Differential privacy-based feature processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111133642.5A CN113591133B (zh) 2021-09-27 2021-09-27 基于差分隐私进行特征处理的方法及装置
CN202111133642.5 2021-09-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/394,978 Continuation US20240152643A1 (en) 2021-09-27 2023-12-22 Differential privacy-based feature processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2023045503A1 true WO2023045503A1 (zh) 2023-03-30

Family

ID=78242244

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/105052 WO2023045503A1 (zh) 2021-09-27 2022-07-12 基于差分隐私进行特征处理的方法及装置

Country Status (3)

Country Link
US (1) US20240152643A1 (zh)
CN (1) CN113591133B (zh)
WO (1) WO2023045503A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116451275A (zh) * 2023-06-15 2023-07-18 北京电子科技学院 一种基于联邦学习的隐私保护方法及计算设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591133B (zh) * 2021-09-27 2021-12-24 支付宝(杭州)信息技术有限公司 基于差分隐私进行特征处理的方法及装置
CN114398671B (zh) * 2021-12-30 2023-07-11 翼健(上海)信息科技有限公司 基于特征工程iv值的隐私计算方法、系统和可读存储介质
CN114329127B (zh) * 2021-12-30 2023-06-20 北京瑞莱智慧科技有限公司 特征分箱方法、装置及存储介质
CN114154202B (zh) * 2022-02-09 2022-06-24 支付宝(杭州)信息技术有限公司 基于差分隐私的风控数据探查方法和系统
CN114401079B (zh) * 2022-03-25 2022-06-14 腾讯科技(深圳)有限公司 多方联合信息价值计算方法、相关设备及存储介质
CN115329898B (zh) * 2022-10-10 2023-01-24 国网浙江省电力有限公司杭州供电公司 基于差分隐私策略的多属性数据发布方法及系统
CN115809473B (zh) * 2023-02-02 2023-04-25 富算科技(上海)有限公司 一种纵向联邦学习的信息价值的获取方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180101697A1 (en) * 2016-10-11 2018-04-12 Palo Alto Research Center Incorporated Method for differentially private aggregation in a star topology under a realistic adversarial model
CN112163896A (zh) * 2020-10-19 2021-01-01 科技谷(厦门)信息技术有限公司 一种联邦学习系统
CN112749749A (zh) * 2021-01-14 2021-05-04 深圳前海微众银行股份有限公司 基于分类决策树模型的分类方法、装置及电子设备
CN113362048A (zh) * 2021-08-11 2021-09-07 腾讯科技(深圳)有限公司 数据标签分布确定方法、装置、计算机设备和存储介质
WO2021179839A1 (zh) * 2020-03-11 2021-09-16 支付宝(杭州)信息技术有限公司 保护用户隐私的用户分类系统的构建方法及装置
CN113591133A (zh) * 2021-09-27 2021-11-02 支付宝(杭州)信息技术有限公司 基于差分隐私进行特征处理的方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990857B (zh) * 2019-12-11 2021-04-06 支付宝(杭州)信息技术有限公司 保护隐私安全的多方联合进行特征评估的方法及装置
CN112199702A (zh) * 2020-10-16 2021-01-08 鹏城实验室 一种基于联邦学习的隐私保护方法、存储介质及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180101697A1 (en) * 2016-10-11 2018-04-12 Palo Alto Research Center Incorporated Method for differentially private aggregation in a star topology under a realistic adversarial model
WO2021179839A1 (zh) * 2020-03-11 2021-09-16 支付宝(杭州)信息技术有限公司 保护用户隐私的用户分类系统的构建方法及装置
CN112163896A (zh) * 2020-10-19 2021-01-01 科技谷(厦门)信息技术有限公司 一种联邦学习系统
CN112749749A (zh) * 2021-01-14 2021-05-04 深圳前海微众银行股份有限公司 基于分类决策树模型的分类方法、装置及电子设备
CN113362048A (zh) * 2021-08-11 2021-09-07 腾讯科技(深圳)有限公司 数据标签分布确定方法、装置、计算机设备和存储介质
CN113591133A (zh) * 2021-09-27 2021-11-02 支付宝(杭州)信息技术有限公司 基于差分隐私进行特征处理的方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116451275A (zh) * 2023-06-15 2023-07-18 北京电子科技学院 一种基于联邦学习的隐私保护方法及计算设备
CN116451275B (zh) * 2023-06-15 2023-08-22 北京电子科技学院 一种基于联邦学习的隐私保护方法及计算设备

Also Published As

Publication number Publication date
US20240152643A1 (en) 2024-05-09
CN113591133A (zh) 2021-11-02
CN113591133B (zh) 2021-12-24

Similar Documents

Publication Publication Date Title
WO2023045503A1 (zh) 基于差分隐私进行特征处理的方法及装置
CN110622165B (zh) 用于确定隐私集交集的安全性措施
US20230006809A1 (en) Homomorphic computations on encrypted data within a distributed computing environment
US11394773B2 (en) Cryptographic currency block chain based voting system
WO2021197037A1 (zh) 双方联合进行数据处理的方法及装置
US10701100B2 (en) Threat intelligence management in security and compliance environment
Bassily et al. Coupled-worlds privacy: Exploiting adversarial uncertainty in statistical data privacy
US20230017374A1 (en) Secure multi-party computation of differentially private heavy hitters
US20100014657A1 (en) Privacy preserving social network analysis
Yigitbasioglu Modelling the intention to adopt cloud computing services: a transaction cost theory perspective
WO2020053854A1 (en) Systems and methods for secure prediction using an encrypted query executed based on encrypted data
HU231270B1 (hu) Adatkezelő eljárás és regisztrációs eljárás anonim adatmegosztó rendszerhez, valamint adatkezelő és azt tartalmazó anonim adatmegosztó rendszer
CN114401079A (zh) 多方联合信息价值计算方法、相关设备及存储介质
US11853461B2 (en) Differential privacy security for benchmarking
CN115242371A (zh) 差分隐私保护的集合交集及其基数计算方法、装置及系统
Liu et al. Quantum private set intersection cardinality based on bloom filter
Rafi et al. Fairness and privacy preserving in federated learning: A survey
US20200021496A1 (en) Method, apparatus, and computer-readable medium for data breach simulation and impact analysis in a computer network
SM et al. Improving security with federated learning
WO2022110716A1 (zh) 冷启动推荐方法、装置、计算机设备及存储介质
EP3596648B1 (en) Assessing data leakage risks
Trujillo et al. A traffic analysis attack to compute social network measures
Sumana et al. Privacy preserving naive bayes classifier for horizontally partitioned data using secure division
US11792167B2 (en) Flexible data security and machine learning system for merging third-party data
US20110093472A1 (en) Systems and methods to determine aggregated social relationships

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871562

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE