WO2023040429A1 - Data processing method, apparatus, and device for federated feature engineering, and medium - Google Patents
Data processing method, apparatus, and device for federated feature engineering, and medium Download PDFInfo
- Publication number
- WO2023040429A1 WO2023040429A1 PCT/CN2022/104178 CN2022104178W WO2023040429A1 WO 2023040429 A1 WO2023040429 A1 WO 2023040429A1 CN 2022104178 W CN2022104178 W CN 2022104178W WO 2023040429 A1 WO2023040429 A1 WO 2023040429A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sum
- label
- labels
- data
- party
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 26
- 230000004044 response Effects 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 36
- 238000012545 processing Methods 0.000 claims description 20
- 238000003860 storage Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 abstract description 9
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 10
- 230000003993 interaction Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
Definitions
- the present disclosure relates to the field of computer technology, specifically to the field of deep learning and data processing, and in particular to a data processing method, device, device and medium for federated feature engineering.
- federated learning In order to solve the problems of data islands and data privacy security, the current mainstream method is to use federated learning to jointly train different data to obtain better models to solve practical problems.
- federated learning can be divided into horizontal federated learning, vertical federated learning, and transfer learning. Among them, vertical federated learning is widely used.
- the disclosure provides a data processing method, device, device and medium for federated feature engineering.
- a data processing method for federated feature engineering including: receiving the first sample identification of the sample data sent by the business party, the ciphertext label corresponding to the first sample identification, and the data receiving party
- the second sample identification of the sample data sent, the ciphertext label includes the first label and the second label; according to the first sample identification and the second sample identification, determine the target sample identification and send it to the data party; according to the ciphertext label and the target sample Identify, determine the target ciphertext label and send it to the data party; in response to receiving the data party based on the target sample identification and the target ciphertext label, the sum of the first label and the sum of the second label of each bucket calculated after feature bucketing and, based on the target ciphertext label, the sum of the first label and the sum of the second label of each bucket, calculate and output the parameter corresponding to the target sample identifier.
- a data processing device for federated feature engineering including: a data receiving unit configured to receive the first sample identifier of the sample data sent by the business party, and the first sample identifier corresponding to the first sample identifier
- the ciphertext label and the second sample identification of the sample data sent by the data receiving party, the ciphertext label includes the first label and the second label
- the identification sending unit is configured to determine the target according to the first sample identification and the second sample identification The sample identification is sent to the data party;
- the label sending unit is configured to determine the target ciphertext label and send it to the data party according to the ciphertext label and the target sample identification;
- the information output unit is configured to respond to the received data party based on the target sample
- the sum of the first label and the sum of the second label of each bucket calculated after feature bucketing of the identification and the target ciphertext label, based on the target ciphertext label, the sum of the first label and the second label of each bucket and, calculate and output the parameters corresponding to the target
- an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by at least one processor. Executed by a processor, so that at least one processor can execute the method described in the first aspect.
- a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method described in the first aspect.
- a computer program product includes a computer program, and when executed by a processor, the computer program implements the method as described in the first aspect.
- FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;
- FIG. 2 is a flowchart of an embodiment of a data processing method for federated feature engineering according to the present disclosure
- FIG. 3 is a schematic diagram of an application scenario of a data processing method for federated feature engineering according to the present disclosure
- FIG. 4 is a flowchart of another embodiment of a data processing method for federated feature engineering according to the present disclosure
- Fig. 5 is a schematic diagram of the tripartite interaction process in the embodiment shown in Fig. 4;
- Fig. 6 is a schematic structural diagram of an embodiment of a data processing device for federated feature engineering according to the present disclosure
- FIG. 7 is a block diagram of an electronic device for implementing the data processing method for federated feature engineering according to an embodiment of the present disclosure.
- feature engineering is an essential part.
- Common feature engineering processing methods include data preprocessing, feature selection, and feature dimensionality reduction. After completing the data preprocessing, it is necessary to select meaningful features to train the model, and usually use some common indicators such as WOE (Weight Of Evidence, weight of evidence), IV (Information Value, feature information value), etc. to analyze the impact of each feature on the label. Ability to predict.
- WOE Weight Of Evidence, weight of evidence
- IV Information Value, feature information value
- the data needs to be binned, that is, the continuous variable is discretized, so that the model can be iterated quickly, which can reduce the risk of model overfitting. It is a commonly used data preprocessing method.
- federated feature engineering there are usually two parties.
- the tagged party is the Guest party (business party)
- the unlabeled party that only provides feature data is the Host party (data party).
- the Guest side hopes to expand the feature dimension of the data through union.
- the host cannot obtain the tags, and the guest does not know the host's feature values, thus completing the calculation of feature engineering while ensuring the security and privacy of both parties.
- data sharing can be realized while ensuring the security of the data of both parties.
- FIG. 1 shows an exemplary system architecture 100 to which the embodiment of the data processing method for federated feature engineering or the data processing device for federated feature engineering of the present disclosure can be applied.
- the system architecture 100 may include a business party 101 , a third party 102 and a data party 103 . Communication between the business party 101 and the third party 102 and between the third party 102 and the data party 103 can be carried out through the network.
- a network may include various connection types such as wires, wireless communication links, or fiber optic cables, among others.
- the business party 101 and the data party 103 may have relevant data for the same object, for example, the business party 101 may be a clothing production factory, and the data party 103 may be a clothing sales website.
- the third party 102 may be a party independent from the business party 101 and the data party 103, and is a trusted party. In order to avoid data security risks that may be caused by the business party 101 and the data party 103 during data interaction, in this embodiment, both the business party 101 and the data party 103 can send data to the third party 102 to improve data security.
- the business party 101, the third party 102 and the data party 103 may be hardware or software.
- the business party 101, the third party 102, and the data party 103 are hardware, they can be implemented as a distributed server cluster composed of multiple electronic devices, or as a single server.
- the business party 101, the third party 102, and the data party 103 are software, they can be implemented as multiple software or software modules (for example, for providing distributed services), or as a single software or software module. No specific limitation is made here.
- the data processing method for federated feature engineering provided by the embodiments of the present disclosure is generally executed by the third party 102 .
- the data processing device for federated feature engineering is generally set in the third party 102 .
- FIG. 2 shows a flow 200 of an embodiment of a data processing method for federated feature engineering according to the present disclosure.
- the data processing method for federated feature engineering in this embodiment includes the following steps:
- Step 201 receiving the first sample identifier of the sample data sent by the business party, the ciphertext label corresponding to the first sample identifier, and the second sample identifier of the sample data sent by the receiving party, the ciphertext label includes the first label and Second tab.
- the execution subject of the data processing method for federated feature engineering can receive data from the business party and the data party respectively.
- the execution subject may receive the first sample identifier of the sample data from the business party, and receive the second sample identifier of the sample data from the data party.
- both the first sample identifier and the second sample identifier are character strings used to identify sample data.
- the business party and the data party can encrypt the identification of the original sample data to obtain the first sample identification and the second sample identification.
- the encryption here can adopt the homomorphic encryption method.
- Homomorphic encryption is a form of encryption that allows people to perform specific forms of algebraic operations on ciphertext, allowing people to perform operations such as retrieval and comparison in encrypted data without decrypting the data.
- the execution subject can also receive the ciphertext label from the business party.
- the ciphertext label includes a first label and a second label.
- the first label sum identifies positive samples, and the second label can represent negative samples.
- Step 202 Determine the target sample ID according to the first sample ID and the second sample ID, and send the target sample ID to the data party.
- the execution subject can perform various processing on the first sample ID and the second sample ID to determine the first sample ID and the second sample ID.
- the same sample ID in the sample ID is used as the target sample ID, and the target sample ID is only sent to the data party.
- the above processing may be processing such as decryption and hash operation.
- the execution subject directly sends only the target sample ID to the data party, which can prevent the business party from guessing the original data of the data party based on the target sample ID, thereby improving data security.
- Step 203 Determine the target ciphertext label according to the ciphertext label and the target sample ID, and send the target ciphertext label to the data party.
- the execution subject may also determine the label corresponding to the target sample identifier from the above-mentioned ciphertext labels as the target ciphertext label after determining the target sample identifier.
- the ciphertext label includes the correspondence between the label and the first sample identifier, and according to the above correspondence, the execution subject can search the ciphertext label to determine the target ciphertext label. And send the above target ciphertext label to the data party.
- Step 204 in response to receiving the sum of the first label and the sum of the second label of each bucket calculated by the data party after performing feature bucketing based on the target sample ID and the target ciphertext label, based on the target ciphertext label, each bucket The sum of the first label and the sum of the second label of the bucket, calculate and output the parameters corresponding to the target sample identifier.
- the data party can perform characteristic bucketing on the original data, that is, divide the original data into multiple buckets (bins).
- the data in each bucket corresponds to the sample ID
- the data party can perform calculations based on the sample ID and ciphertext label corresponding to the data in each bucket to obtain the sum of the first label and the sum of the second label of each bucket .
- the execution subject can combine the target ciphertext label to calculate the parameters corresponding to the target sample ID.
- the above parameters may include a WOE value and an IV value.
- the execution subject can determine the number of positive labels and the number of negative labels from the target ciphertext labels. According to the calculation formula of WOE value and IV value, the above parameters are calculated.
- FIG. 3 shows a schematic diagram of an application scenario of the data processing method for federated feature engineering according to the present disclosure.
- the bank 301 sends the sample ID of the user's sample data and the encrypted credit label to the third party 302
- the e-commerce platform 303 sends the sample ID of the user's consumption data to the third party 302 .
- the third party 302 calculates the WOE value and IV value after performing the processing of steps 201 to 204 .
- a meaningful feature training model is selected for credit risk prediction.
- the data processing method for federated feature engineering provided by the above embodiments of the present disclosure can improve data security by introducing a third party into the business side and the data side.
- FIG. 4 shows a flow 400 of another embodiment of the data processing method for federated feature engineering according to the present disclosure.
- the method of this embodiment may include the following steps:
- Step 401 receiving the first sample ID of the sample data sent by the business party, the ciphertext label corresponding to the first sample ID, and the second sample ID of the sample data sent by the data receiving party.
- Step 402 aligning the first sample ID and the second sample ID, determining the sample ID shared by the business party and the data party as the target sample ID and sending it to the data party.
- the execution subject may align the first sample identifier and the second sample identifier.
- the execution subject can use existing sample ID alignment schemes, such as encrypted sample ID alignment based on RSA encryption/decryption algorithm and hash algorithm, encrypted sample ID alignment based on Diffie-Hellman, and so on.
- the execution subject may determine the sample identifier shared by the first sample identifier and the second sample identifier as the target sample identifier. And send the target sample ID to the data party.
- Step 403 according to the ciphertext label and the target sample identification, determine the target ciphertext label and send it to the data party.
- the business party can generate a public-private key pair before sending the ciphertext label. And send the public key to the execution subject. Then, use the public key to encrypt the original label to obtain the ciphertext label.
- the ciphertext label can be represented by ⁇ y>, ⁇ 1-y> ⁇ .
- ⁇ y> may be called the first tag
- ⁇ 1-y> may be called the second tag.
- the execution subject can determine the target ciphertext label from the above ciphertext labels ⁇ y>, ⁇ 1-y> ⁇ , represented by ⁇ y_n>, ⁇ 1-y_n> ⁇ .
- Step 404 the data receiving party calculates the sum of the first label and the sum of the second label of each bucket after performing feature bucketing based on the target sample ID and the target ciphertext label.
- the data party can perform feature bucketing, and calculate the sum of the first label and the sum of the second label of each bucket respectively.
- the sum of the first labels can be recorded as sum( ⁇ y_bin_i>), and the sum of the second labels can be recorded as sum( ⁇ 1-y_bin_i>).
- the data party can send ⁇ sum( ⁇ y_bin_i>), sum( ⁇ 1-y_bin_i>) ⁇ to the execution subject.
- Step 405 according to the target ciphertext label, determine the sum of positive labels and the sum of negative labels.
- the execution subject can perform split statistics on the target ciphertext label, and determine the sum of samples with the same label. After parsing, the execution subject can determine the sum of positive labels and the sum of negative labels.
- the execution subject can calculate the sum of positive labels and the sum of negative labels through the following steps:
- Step 4051 determine the sum of the first label and the sum of the second label in the target ciphertext label.
- Step 4052 Add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, encrypt the two obtained sums and send them to the business party .
- Step 4053 receiving the first data obtained by decrypting the two encrypted sum values from the business party, and determining the sum of the positive label and the sum of the negative label according to the first data, the first mask, and the second mask.
- the execution subject can first determine the first label and the second label in the target ciphertext label, and then calculate the sum of the first label sum( ⁇ y_n>) and the sum of the second label sum( ⁇ 1- y_n>). Then, the execution subject can randomly generate two masks (mask), which are respectively recorded as the first mask ⁇ mask_a> and the second mask ⁇ mask_b>. Add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, and encrypt with the public key to obtain the data ⁇ sum( ⁇ y_n>)+ ⁇ mask_a>,sum( ⁇ 1-y_n>)+ ⁇ mask_b> ⁇ . And send the above data to the business party.
- the business party can decrypt the above data to obtain Dec(sum( ⁇ y_n>)+ ⁇ mask_a>) and Dec(sum( ⁇ 1-y_n>)+ ⁇ mask_b>).
- the business party can use the private key paired with the above public key to decrypt.
- the business party sends the above data to the execution subject.
- the execution subject subtracts the corresponding first mask and second mask from the above data to obtain the sum of positive labels pos_total and the sum of negative labels neg_total.
- Step 406 according to the sum of the first label and the sum of the second label of each bucket, determine the number of positive labels and the number of negative labels of each bucket.
- the execution subject may directly use the sum of the first labels of each bucket as the number of positive labels of each bucket, and the sum of the second labels of each bucket as the number of negative labels of each bucket.
- the execution subject can calculate the number of positive labels and the number of negative labels of each bucket through the following steps:
- Step 4061 Add the sum of the first labels of each bucket to the randomly generated third mask, add the sum of the second labels of each bucket to the randomly generated fourth mask, and compare the obtained two The sum and value are encrypted and sent to the business party.
- Step 4062 receiving the second data obtained by decrypting the two encrypted sum values, and determining the number of positive labels and negative labels for each bucket according to the second data, the third mask, and the fourth mask.
- the host station may first generate multiple third masks and multiple fourth masks, which are respectively denoted as ⁇ mask_c> and ⁇ mask_d>. Then, the execution subject can add the sum of the first label sum( ⁇ y_bin_i>) of each bucket received from the data side to the third mask ⁇ mask_c>, and at the same time add the sum of the second label sum of each bucket ( ⁇ 1-y_bin_i>) is added to the fourth mask ⁇ mask_d>, and encrypted with the public key to obtain the data ⁇ sum( ⁇ y_bin_i>)+ ⁇ mask_c>, sum( ⁇ 1-y_bin_i>)+ ⁇ mask_d > ⁇ .
- the executive body can send the above data to the business party.
- the business party can use the private key paired with the above public key to decrypt to obtain the data Dec(sum( ⁇ y_bin_i>)+ ⁇ mask_c>) and Dec(sum( ⁇ 1-y_bin_i>)+ ⁇ mask_d>).
- the business party can send the above data to the execution subject, and the execution subject can subtract the corresponding mask from the above data to obtain the number of positive labels npos_i and the number of negative labels nneg_i for each bucket.
- Step 407 Calculate and output parameters corresponding to the target sample identifier according to the sum of positive labels, the sum of negative labels, and the number of positive labels and negative labels in each bucket.
- the execution subject can use the above parameter values to calculate the parameters corresponding to the target sample identification, such as WOE value and IV value.
- the execution subject may calculate the above parameters through the following steps:
- Step 4071 according to the sum of positive labels, the sum of negative labels, the number of positive labels and the number of negative labels in each bucket, and at least two preset parameters, calculate and output the parameters corresponding to the target sample identifier.
- npos i is the number of positive samples in the i-th bin
- nneg i is the number of negative samples in the i-th bin
- pos total is the total number of positive samples
- neg total is the total number of negative samples.
- ⁇ and ⁇ are preset values, 0 ⁇ 1, 0 ⁇ 0.02.
- the formula (3) can neither affect the WOE value under the normal WOE situation, At the same time, a WOE value can also be calculated under special circumstances, which will not affect the calculation of subsequent IV values.
- the formula (4) can neither affect the IV value when the IV can be calculated normally, but also can be used when the feature data provided by the data party belong to the same category. In this case, the calculated IV value is ⁇ .
- the IV value is less than 0.02, the predictive ability of this characteristic variable is almost non-existent. This will not only prevent the data party from judging from the IV value that the data he provided belongs to the same category, but also show that the predictive ability of the characteristic variable is extremely small, so as to achieve the purpose of not disclosing the label of the Guest party.
- the executive body can output the obtained parameters to the data party, so that the business party cannot know the WOE value and the IV value, preventing the business party from obtaining information about the data value of the data party.
- FIG. 5 shows a schematic diagram of the interaction process of the three parties (Guest party, third party and Host party) in this embodiment.
- the specific steps of federated feature engineering can be as follows:
- the Guest and the Host transmit the encrypted sample ID to a trusted third party, and the trusted third party aligns the encrypted samples to obtain the sample ID shared by both parties.
- Encryption methods include but are not limited to asymmetric encryption, hash algorithm, homomorphic encryption, etc.
- the trusted third party can distribute the public key to the Guest and the Host.
- the Guest and the Host use the public key to encrypt and transmit the sample ID to the trusted third party.
- the trusted third party uses the private key to decrypt and align the samples.
- the Guest and the Host can also perform a hash operation on the sample ID, allowing a trusted third party to calculate and compare the hash values to obtain aligned samples.
- the trusted third party sends the aligned sample ID to the host, and the host performs feature binning on the aligned sample features.
- the Guest encrypts the label and sends it to a trusted third party, and the trusted third party sends the aligned ciphertext label to the Host.
- the trusted third party sends the aligned ciphertext positive and negative labels plus different masks to the Guest for decryption, and subtracts the mask from the returned result to obtain the sum of positive and negative labels.
- the host side calculates the number of positive and negative labels of the ciphertext for each feature bucket, and sends it to a trusted third party.
- the trusted third party sends the result of the feature bucket calculation plus different masks to the Guest for decryption, and subtracts the mask from the returned result to obtain the number of positive and negative labels under each feature bucket.
- the trusted third party calculates the WOE and IV values based on the results of steps 4 and 6, and finally transmits the IV value of each feature to the host for storage.
- a trusted third party is responsible for aligning data, and only sends the aligned sample ID to the Host.
- adding a mask makes it impossible for the Guest to know the proportion of the host data sample labels.
- the third party calculates the WOE and IV values of the features provided by the Host. Guarantees that the Guest cannot obtain information about the value of the Host's data.
- the present disclosure provides an embodiment of a data processing device for federated feature engineering, which is similar to the method embodiment shown in FIG. 2
- the device can be specifically applied to various electronic devices.
- the data processing device 600 for federated feature engineering in this embodiment includes: a data receiving unit 601 , an identifier sending unit 602 , a tag sending unit 603 and an information output unit 604 .
- the data receiving unit 601 is configured to receive the first sample identifier of the sample data sent by the business party, the ciphertext label corresponding to the first sample identifier, and the second sample identifier of the sample data sent by the receiving party, the ciphertext label Include the first label and the second label.
- the identifier sending unit 602 is configured to determine the target sample identifier and send it to the data party according to the first sample identifier and the second sample identifier.
- the label sending unit 603 is configured to determine the target ciphertext label and send it to the data party according to the ciphertext label and the target sample identification.
- the information output unit 604 is configured to respond to the sum of the first label and the sum of the second label of each bucket calculated by the data party after performing feature bucketing based on the target sample identification and the target ciphertext label, based on the target ciphertext label
- the identifier sending unit 602 may be further configured to: align the first sample identifier and the second sample identifier, and determine that the sample identifier shared by the business party and the data party is the target sample The identification is sent to the data party.
- the information output unit 604 may be further configured to: determine the sum of positive labels and the sum of negative labels according to the target ciphertext label; and the sum of the second labels to determine the number of positive labels and the number of negative labels in each bucket; according to the sum of positive labels, the sum of negative labels, and the number of positive labels and negative labels in each bucket, calculate and output the corresponding parameters.
- the information output unit 604 may be further configured to: determine the sum of the first label and the sum of the second label in the target ciphertext label; respectively combine the sum of the first label and the random Add the generated first mask, add the sum of the second tag and the randomly generated second mask, encrypt the two obtained sums and send them to the business party; the receiving business party will encrypt the encrypted two sums
- the first data obtained by decrypting is determined according to the first data and the first mask and the second mask to determine the sum of positive labels and the sum of negative labels.
- the information output unit 604 may be further configured to: respectively add the sum of the first labels of each bucket to the randomly generated third mask, and add the sum of the first labels of each bucket to The sum of the second tag is added to the randomly generated fourth mask, and the two obtained sums are encrypted and then sent to the business party; the second data obtained after the business party decrypts the two encrypted sums is received according to The second data, the third mask, and the fourth mask determine the number of positive labels and the number of negative labels of each bucket.
- the information output unit 604 may be further configured to: according to the sum of positive labels, the sum of negative labels, the number of positive labels and the number of negative labels in each bucket, and the preset at least Two parameters, calculation and output parameters corresponding to the target sample ID.
- the information output unit 604 may be further configured to: output the calculated at least one parameter to the data party.
- the device 600 may further include an encryption unit not shown in FIG. 6, which is configured to: receive the public key sent by the business party; use the public key to encrypt for the business The party decrypts according to the private key paired with the public key.
- an encryption unit not shown in FIG. 6, which is configured to: receive the public key sent by the business party; use the public key to encrypt for the business The party decrypts according to the private key paired with the public key.
- the units 601 to 605 recorded in the data processing apparatus 600 for federated feature engineering correspond to the steps in the method described with reference to FIG. 2 . Therefore, the operations and features described above for the data processing method for federated feature engineering are also applicable to the device 600 and the units contained therein, and will not be repeated here.
- the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
- the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
- FIG. 7 shows a block diagram of an electronic device 700 performing a data processing method for federated feature engineering according to an embodiment of the present disclosure.
- Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
- Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
- the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
- an electronic device 700 includes a processor 701 that can execute according to a computer program stored in a read-only memory (ROM) 702 or loaded from a memory 708 into a random access memory (RAM) 703. Various appropriate actions and treatments. In the RAM 703, various programs and data necessary for the operation of the electronic device 700 can also be stored.
- the processor 701, ROM 702, and RAM 703 are connected to each other through a bus 704.
- An I/O interface (input/output interface) 705 is also connected to the bus 704 .
- the I/O interface 705 includes: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a memory 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like.
- the communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
- Processor 701 may be various general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various processors that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
- the processor 701 executes various methods and processes described above, for example, a data processing method for federated feature engineering.
- data processing methods for federated feature engineering may be implemented as a computer software program tangibly embodied on a machine-readable storage medium, such as memory 708.
- part or all of the computer program can be loaded and/or installed on the electronic device 700 via the ROM 702 and/or the communication unit 709.
- the computer program is loaded into RAM 703 and executed by processor 701, one or more steps of the data processing method for federated feature engineering described above can be performed.
- the processor 701 may be configured in any other appropriate way (for example, by means of firmware) to execute a data processing method for federated feature engineering.
- Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOC system of systems
- CPLD load programmable logic device
- computer hardware firmware, software, and/or combinations thereof.
- programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
- Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages.
- the above program code can be packaged into a computer program product.
- These program codes or computer program products may be provided to a processor or controller of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, so that the program codes, when executed by the processor 701, make the flow diagrams and/or block diagrams specified The function/operation is implemented.
- the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
- a machine-readable storage medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
- the machine-readable storage medium may be a machine-readable signal storage medium or a machine-readable storage medium.
- a machine-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
- machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
- RAM random access memory
- ROM read only memory
- EPROM or flash memory erasable programmable read only memory
- CD-ROM compact disk read-only memory
- magnetic storage devices or any suitable combination of the foregoing.
- the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and pointing device eg, a mouse or a trackball
- Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
- the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
- the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
- a computer system may include clients and servers.
- Clients and servers are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
- the server can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS”) Among them, there are defects such as difficult management and weak business scalability.
- the server can also be a server of a distributed system, or a server combined with a blockchain.
- steps may be reordered, added or deleted using the various forms of flow shown above.
- each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution of the present disclosure can be achieved, no limitation is imposed herein.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Economics (AREA)
- Technology Law (AREA)
- Strategic Management (AREA)
- Artificial Intelligence (AREA)
- Marketing (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- General Business, Economics & Management (AREA)
- Computing Systems (AREA)
- Development Economics (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Storage Device Security (AREA)
Abstract
The present disclosure provides a data processing method, apparatus, and device for federated feature engineering, and a medium, and relates to the field of deep learning. The specific implementation solution is as follows: receiving a first sample identifier of sample data sent by a service party as well as a ciphertext label corresponding to the first sample identifier, and receiving a second sample identifier of sample data sent by a data party; according to the first sample identifier and the second sample identifier, determining a target sample identifier and sending same to the data party; according to the ciphertext label and the target sample identifier, determining a target ciphertext label and sending same to the data party; in response to receiving the sum of first labels and the sum of second labels of sub-buckets obtained by calculation after the data party performs feature bucketing on the basis of the target sample identifier and target ciphertext label, on the basis of the target ciphertext label and the sum of first labels and the sum of the second labels of the sub-buckets, calculating and outputting a parameter corresponding to the target sample identifier.
Description
本申请要求于2021年9月15日提交的、申请号为202111078529.1申请的优先权,该申请的全文以引用的方式并入本申请。This application claims priority to application number 202111078529.1, filed September 15, 2021, the entirety of which is incorporated herein by reference.
本公开涉及计算机技术领域,具体涉及深度学习、数据处理领域,尤其涉及用于联邦特征工程的数据处理方法、装置、设备以及介质。The present disclosure relates to the field of computer technology, specifically to the field of deep learning and data processing, and in particular to a data processing method, device, device and medium for federated feature engineering.
为了解决数据孤岛及数据隐私安全的问题,目前主流方法是运用联邦学习将不同的数据联合训练,得到更好的模型以解决实际问题。联邦学习根据数据的分布情况分为横向联邦学习、纵向联邦学习、以及迁移学习。其中,纵向联邦学习应用较广。In order to solve the problems of data islands and data privacy security, the current mainstream method is to use federated learning to jointly train different data to obtain better models to solve practical problems. According to the distribution of data, federated learning can be divided into horizontal federated learning, vertical federated learning, and transfer learning. Among them, vertical federated learning is widely used.
发明内容Contents of the invention
本公开提供了一种用于联邦特征工程的数据处理方法、装置、设备以及介质。The disclosure provides a data processing method, device, device and medium for federated feature engineering.
根据第一方面,提供了一种用于联邦特征工程的数据处理方法,包括:接收业务方发送的样本数据的第一样本标识、与第一样本标识对应的密文标签以及接收数据方发送的样本数据的第二样本标识,密文标签包括第一标签和第二标签;根据第一样本标识以及第二样本标识,确定目标样本标识发送给数据方;根据密文标签以及目标样本标识,确定出目标密文标签发送给数据方;响应于接收到数据方基于目标样本标识以及目标密文标签进行特征分桶后计算得到的各分桶的第一标签之和以及第二标签之和,基于目标密文标签、各分桶的第一标签之和以及第二标签之和,计算以及输出目标样本标识对应的参数。According to the first aspect, a data processing method for federated feature engineering is provided, including: receiving the first sample identification of the sample data sent by the business party, the ciphertext label corresponding to the first sample identification, and the data receiving party The second sample identification of the sample data sent, the ciphertext label includes the first label and the second label; according to the first sample identification and the second sample identification, determine the target sample identification and send it to the data party; according to the ciphertext label and the target sample Identify, determine the target ciphertext label and send it to the data party; in response to receiving the data party based on the target sample identification and the target ciphertext label, the sum of the first label and the sum of the second label of each bucket calculated after feature bucketing and, based on the target ciphertext label, the sum of the first label and the sum of the second label of each bucket, calculate and output the parameter corresponding to the target sample identifier.
根据第二方面,提供了一种用于联邦特征工程的数据处理装置,包括:数据接收单元,被配置成接收业务方发送的样本数据的第一样本标识、与 第一样本标识对应的密文标签以及接收数据方发送的样本数据的第二样本标识,密文标签包括第一标签和第二标签;标识发送单元,被配置成根据第一样本标识以及第二样本标识,确定目标样本标识发送给数据方;标签发送单元,被配置成根据密文标签以及目标样本标识,确定出目标密文标签发送给数据方;信息输出单元,被配置成响应于接收到数据方基于目标样本标识以及目标密文标签进行特征分桶后计算得到的各分桶的第一标签之和以及第二标签之和,基于目标密文标签、各分桶的第一标签之和以及第二标签之和,计算以及输出目标样本标识对应的参数。According to the second aspect, there is provided a data processing device for federated feature engineering, including: a data receiving unit configured to receive the first sample identifier of the sample data sent by the business party, and the first sample identifier corresponding to the first sample identifier The ciphertext label and the second sample identification of the sample data sent by the data receiving party, the ciphertext label includes the first label and the second label; the identification sending unit is configured to determine the target according to the first sample identification and the second sample identification The sample identification is sent to the data party; the label sending unit is configured to determine the target ciphertext label and send it to the data party according to the ciphertext label and the target sample identification; the information output unit is configured to respond to the received data party based on the target sample The sum of the first label and the sum of the second label of each bucket calculated after feature bucketing of the identification and the target ciphertext label, based on the target ciphertext label, the sum of the first label and the second label of each bucket and, calculate and output the parameters corresponding to the target sample ID.
根据第三方面,提供了一种电子设备,包括:至少一个处理器;以及与上述至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,上述指令被至少一个处理器执行,以使至少一个处理器能够执行如第一方面所描述的方法。According to a third aspect, there is provided an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by at least one processor. Executed by a processor, so that at least one processor can execute the method described in the first aspect.
根据第四方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,上述计算机指令用于使计算机执行如第一方面所描述的方法。According to a fourth aspect, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method described in the first aspect.
根据第五方面,一种计算机程序产品,包括计算机程序,上述计算机程序在被处理器执行时实现如第一方面所描述的方法。According to a fifth aspect, a computer program product includes a computer program, and when executed by a processor, the computer program implements the method as described in the first aspect.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.
附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:
图1是本公开的一个实施例可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;
图2是根据本公开的用于联邦特征工程的数据处理方法的一个实施例的流程图;FIG. 2 is a flowchart of an embodiment of a data processing method for federated feature engineering according to the present disclosure;
图3是根据本公开的用于联邦特征工程的数据处理方法的一个应用场景的示意图;FIG. 3 is a schematic diagram of an application scenario of a data processing method for federated feature engineering according to the present disclosure;
图4是根据本公开的用于联邦特征工程的数据处理方法的另一个实施例的流程图;FIG. 4 is a flowchart of another embodiment of a data processing method for federated feature engineering according to the present disclosure;
图5是图4所示实施例中三方交互过程的示意图;Fig. 5 is a schematic diagram of the tripartite interaction process in the embodiment shown in Fig. 4;
图6是根据本公开的用于联邦特征工程的数据处理装置的一个实施例的结构示意图;Fig. 6 is a schematic structural diagram of an embodiment of a data processing device for federated feature engineering according to the present disclosure;
图7是用来实现本公开实施例的用于联邦特征工程的数据处理方法的电子设备的框图。FIG. 7 is a block diagram of an electronic device for implementing the data processing method for federated feature engineering according to an embodiment of the present disclosure.
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。It should be noted that, in the case of no conflict, the embodiments in the present disclosure and the features in the embodiments can be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings and embodiments.
在进行传统机器学习的过程中,特征工程是必不可少的一环。常见的特征工程处理方法有数据预处理、特征选择、及特征降维。在完成数据预处理后,需要选择有意义的特征训练模型,通常运用一些常见的指标如WOE(Weight Of Evidence,证据权重)、IV(Information Value,特征信息值)等分析每个特征对标签的预测能力。计算WOE值和IV值的过程中需要对数据进行分箱,即将连续变量离散化,使模型快速迭代,可以降低模型过拟合的风险,是一种常用的数据预处理方法。In the process of traditional machine learning, feature engineering is an essential part. Common feature engineering processing methods include data preprocessing, feature selection, and feature dimensionality reduction. After completing the data preprocessing, it is necessary to select meaningful features to train the model, and usually use some common indicators such as WOE (Weight Of Evidence, weight of evidence), IV (Information Value, feature information value), etc. to analyze the impact of each feature on the label. Ability to predict. In the process of calculating the WOE value and IV value, the data needs to be binned, that is, the continuous variable is discretized, so that the model can be iterated quickly, which can reduce the risk of model overfitting. It is a commonly used data preprocessing method.
在联邦特征工程中,通常存在两方,有标签的一方为Guest方(业务方),无标签仅提供特征数据的一方为Host方(数据方)。Guest方希望通过联合,扩展数据的特征维度。两方的数据交互过程中,没有明文数据的传输,Host方无法获取标签,Guest方也不知道Host方的特征值,从而在保证两方安全隐私的情况下完成了特征工程的计算。In federated feature engineering, there are usually two parties. The tagged party is the Guest party (business party), and the unlabeled party that only provides feature data is the Host party (data party). The Guest side hopes to expand the feature dimension of the data through union. During the data interaction process between the two parties, there is no transmission of plaintext data, the host cannot obtain the tags, and the guest does not know the host's feature values, thus completing the calculation of feature engineering while ensuring the security and privacy of both parties.
相关技术中,例如在金融场景中,银行等金融机构中含有信贷标签,而电商平台有用户的消费数据,双方用户存在交集,银行可以利用电商的数据进行信贷风险预测,但双方不能进行数据共享,这时纵向联邦学习可以用来解决这类问题。In related technologies, for example, in financial scenarios, banks and other financial institutions contain credit labels, while e-commerce platforms have user consumption data, and users of both parties overlap. Banks can use e-commerce data to predict credit risk, but both parties cannot. Data sharing, at this time vertical federated learning can be used to solve such problems.
根据本公开的技术可以实现数据共享,同时保证双方数据的安全性。According to the technology disclosed in the present disclosure, data sharing can be realized while ensuring the security of the data of both parties.
图1示出了可以应用本公开的用于联邦特征工程的数据处理方法或用于联邦特征工程的数据处理装置的实施例的示例性系统架构100。FIG. 1 shows an exemplary system architecture 100 to which the embodiment of the data processing method for federated feature engineering or the data processing device for federated feature engineering of the present disclosure can be applied.
如图1所示,系统架构100可以包括业务方101、第三方102和数据方103。业务方101和第三方102之间以及第三方102和数据方103之间可以通过网络进行通信连接。网络可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , the system architecture 100 may include a business party 101 , a third party 102 and a data party 103 . Communication between the business party 101 and the third party 102 and between the third party 102 and the data party 103 can be carried out through the network. A network may include various connection types such as wires, wireless communication links, or fiber optic cables, among others.
业务方101和数据方103可以拥有针对相同对象的相关数据,例如业务方101可以是衣服生产工厂,数据方103可以是衣服销售网站。第三方102可以是独立于业务方101和数据方103的一方,并且是可信的一方。为避免业务方101和数据方103在数据交互过程中可能造成的数据安全风险,本实施例中,业务方101和数据方103可以均将数据发送给第三方102,以提高数据的安全性。The business party 101 and the data party 103 may have relevant data for the same object, for example, the business party 101 may be a clothing production factory, and the data party 103 may be a clothing sales website. The third party 102 may be a party independent from the business party 101 and the data party 103, and is a trusted party. In order to avoid data security risks that may be caused by the business party 101 and the data party 103 during data interaction, in this embodiment, both the business party 101 and the data party 103 can send data to the third party 102 to improve data security.
需要说明的是,业务方101、第三方102和数据方103可以是硬件,也可以是软件。当业务方101、第三方102和数据方103为硬件时,可以实现成多个电子设备组成的分布式服务器集群,也可以实现成单个服务器。当业务方101、第三方102和数据方103为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块。在此不做具体限定。It should be noted that the business party 101, the third party 102 and the data party 103 may be hardware or software. When the business party 101, the third party 102, and the data party 103 are hardware, they can be implemented as a distributed server cluster composed of multiple electronic devices, or as a single server. When the business party 101, the third party 102, and the data party 103 are software, they can be implemented as multiple software or software modules (for example, for providing distributed services), or as a single software or software module. No specific limitation is made here.
需要说明的是,本公开实施例所提供的用于联邦特征工程的数据处理方法一般由第三方102执行。相应地,用于联邦特征工程的数据处理装置一般设置于第三方102中。It should be noted that the data processing method for federated feature engineering provided by the embodiments of the present disclosure is generally executed by the third party 102 . Correspondingly, the data processing device for federated feature engineering is generally set in the third party 102 .
应该理解,图1中的业务方、第三方和数据方的数目仅仅是示意性的。根据实现需要,可以具有任意数目的业务方、第三方和数据方。It should be understood that the numbers of business parties, third parties and data parties in Fig. 1 are only illustrative. According to the implementation requirements, there can be any number of business parties, third parties and data parties.
继续参考图2,其示出了根据本公开的用于联邦特征工程的数据处理方法的一个实施例的流程200。本实施例的用于联邦特征工程的数据处理方法,包括以下步骤:Continue to refer to FIG. 2 , which shows a flow 200 of an embodiment of a data processing method for federated feature engineering according to the present disclosure. The data processing method for federated feature engineering in this embodiment includes the following steps:
步骤201,接收业务方发送的样本数据的第一样本标识、与第一样本标识对应的密文标签以及接收数据方发送的样本数据的第二样本标识,密文标签包括第一标签和第二标签。 Step 201, receiving the first sample identifier of the sample data sent by the business party, the ciphertext label corresponding to the first sample identifier, and the second sample identifier of the sample data sent by the receiving party, the ciphertext label includes the first label and Second tab.
本实施例中,用于联邦特征工程的数据处理方法的执行主体(例如图1所示的第三方102)可以分别从业务方和数据方接收数据。具体的,执行主体可以从业务方接收样本数据的第一样本标识,从数据方接收样本数据的第二样本标识。这里,第一样本标识和第二样本标识均是用来标识样本数据的字符串。业务方和数据方可以将原始样本数据的标识进行加密后得到第一样本标识和第二样本标识。这里的加密可以采用同态加密方式。同态加密是一种加密形式,允许人们对密文进行特定形式的代数运算,可以使人们在加密的数据中进行检索比较等操作,无需对数据解密。执行主体还可以从业务方接收密文标签。这里,密文标签包括第一标签和第二标签。第一标签和标识正样本,第二标签可以表示负样本。In this embodiment, the execution subject of the data processing method for federated feature engineering (for example, the third party 102 shown in FIG. 1 ) can receive data from the business party and the data party respectively. Specifically, the execution subject may receive the first sample identifier of the sample data from the business party, and receive the second sample identifier of the sample data from the data party. Here, both the first sample identifier and the second sample identifier are character strings used to identify sample data. The business party and the data party can encrypt the identification of the original sample data to obtain the first sample identification and the second sample identification. The encryption here can adopt the homomorphic encryption method. Homomorphic encryption is a form of encryption that allows people to perform specific forms of algebraic operations on ciphertext, allowing people to perform operations such as retrieval and comparison in encrypted data without decrypting the data. The execution subject can also receive the ciphertext label from the business party. Here, the ciphertext label includes a first label and a second label. The first label sum identifies positive samples, and the second label can represent negative samples.
步骤202,根据第一样本标识以及第二样本标识,确定目标样本标识,并将目标样本标识发送给数据方。Step 202: Determine the target sample ID according to the first sample ID and the second sample ID, and send the target sample ID to the data party.
本实施例中,执行主体在接收到第一样本标识以及第二样本标识后,可以分别对第一样本标识以及第二样本标识进行各种处理,以确定第一样本标识以及第二样本标识中相同的样本标识作为目标样本标识,并将目标样本标识只发送给数据方。具体的,上述处理可以是解密、哈希运算等处理。本实施例中,执行主体直接将目标样本标识只发送给数据方,能够避免业务方根据目标样本标识对数据方原始数据的猜测,从而能够提高数据的安全性。In this embodiment, after receiving the first sample ID and the second sample ID, the execution subject can perform various processing on the first sample ID and the second sample ID to determine the first sample ID and the second sample ID. The same sample ID in the sample ID is used as the target sample ID, and the target sample ID is only sent to the data party. Specifically, the above processing may be processing such as decryption and hash operation. In this embodiment, the execution subject directly sends only the target sample ID to the data party, which can prevent the business party from guessing the original data of the data party based on the target sample ID, thereby improving data security.
步骤203,根据密文标签以及目标样本标识,确定目标密文标签,并将目标密文标签发送给数据方。Step 203: Determine the target ciphertext label according to the ciphertext label and the target sample ID, and send the target ciphertext label to the data party.
执行主体还可以在确定出目标样本标识后,从上述密文标签中确定出与目标样本标识对应的标签作为目标密文标签。具体的,密文标签中包括了标签与第一样本标识的对应关系,根据上述对应关系,执行主体可以对密文标签进行检索,确定出目标密文标签。并将上述目标密文标签发送给数据方。The execution subject may also determine the label corresponding to the target sample identifier from the above-mentioned ciphertext labels as the target ciphertext label after determining the target sample identifier. Specifically, the ciphertext label includes the correspondence between the label and the first sample identifier, and according to the above correspondence, the execution subject can search the ciphertext label to determine the target ciphertext label. And send the above target ciphertext label to the data party.
步骤204,响应于接收到数据方基于目标样本标识以及目标密文标签进行特征分桶后计算得到的各分桶的第一标签之和以及第二标签之和,基于目标密文标签、各分桶的第一标签之和以及第二标签之和,计算以及输出目标样本标识对应的参数。 Step 204, in response to receiving the sum of the first label and the sum of the second label of each bucket calculated by the data party after performing feature bucketing based on the target sample ID and the target ciphertext label, based on the target ciphertext label, each bucket The sum of the first label and the sum of the second label of the bucket, calculate and output the parameters corresponding to the target sample identifier.
数据方在接收到上述目标样本标识以及目标密文标签后,可以对原始数据进行特征分桶,即将原始数据划分到多个桶(bin)中。每个桶中的数据都对应样本标识,数据方可以基于每个分桶中的数据对应的样本标识以及密文标签进行运算,得到每个分桶的第一标签之和以及第二标签之和。然后,将计算得到的数据发送给执行主体。执行主体在接收到上述数据后,可以结合目标密文标签,计算出目标样本标识对应的参数。具体的,上述参数可以包括WOE值和IV值。执行主体可以从目标密文标签中确定出正标签的数量和负标签的数量。根据WOE值和IV值的计算公式,计算得到上述参数。After receiving the above-mentioned target sample identification and target ciphertext label, the data party can perform characteristic bucketing on the original data, that is, divide the original data into multiple buckets (bins). The data in each bucket corresponds to the sample ID, and the data party can perform calculations based on the sample ID and ciphertext label corresponding to the data in each bucket to obtain the sum of the first label and the sum of the second label of each bucket . Then, send the calculated data to the execution subject. After receiving the above data, the execution subject can combine the target ciphertext label to calculate the parameters corresponding to the target sample ID. Specifically, the above parameters may include a WOE value and an IV value. The execution subject can determine the number of positive labels and the number of negative labels from the target ciphertext labels. According to the calculation formula of WOE value and IV value, the above parameters are calculated.
继续参见图3,其示出了根据本公开的用于联邦特征工程的数据处理方法的一个应用场景的示意图。在图3的应用场景中,银行301将用户的样本数据的样本ID以及加密后的信贷标签发送给第三方302,电商平台303将用户的消费数据的样本ID发送给第三方302。第三方302执行步骤201~204的处理后,计算出WOE值和IV值。并根据上述两个参数值,选取出有意义的特征训练模型,用于信贷风险预测。Continue referring to FIG. 3 , which shows a schematic diagram of an application scenario of the data processing method for federated feature engineering according to the present disclosure. In the application scenario of FIG. 3 , the bank 301 sends the sample ID of the user's sample data and the encrypted credit label to the third party 302 , and the e-commerce platform 303 sends the sample ID of the user's consumption data to the third party 302 . The third party 302 calculates the WOE value and IV value after performing the processing of steps 201 to 204 . And according to the above two parameter values, a meaningful feature training model is selected for credit risk prediction.
本公开的上述实施例提供的用于联邦特征工程的数据处理方法,通过在业务方和数据方中引入第三方,从而能够提高数据的安全性。The data processing method for federated feature engineering provided by the above embodiments of the present disclosure can improve data security by introducing a third party into the business side and the data side.
继续参见图4,其示出了根据本公开的用于联邦特征工程的数据处理方法的另一个实施例的流程400。如图4所示,本实施例的方法可以包括以下步骤:Continue referring to FIG. 4 , which shows a flow 400 of another embodiment of the data processing method for federated feature engineering according to the present disclosure. As shown in Figure 4, the method of this embodiment may include the following steps:
步骤401,接收业务方发送的样本数据的第一样本标识、与第一样本标识对应的密文标签以及接收数据方发送的样本数据的第二样本标识。 Step 401, receiving the first sample ID of the sample data sent by the business party, the ciphertext label corresponding to the first sample ID, and the second sample ID of the sample data sent by the data receiving party.
步骤402,对第一样本标识以及第二样本标识进行对齐,确定业务方和数据方共有的样本标识为目标样本标识发送给数据方。 Step 402, aligning the first sample ID and the second sample ID, determining the sample ID shared by the business party and the data party as the target sample ID and sending it to the data party.
本实施例中,执行主体可以对第一样本标识以及第二样本标识进行对齐。具体的,执行主体可以采用现有的样本ID对齐方案来实现,例如基于RSA加/解密算法和哈希算法的加密样本ID对齐、基于Diffie-Hellman的加密样本ID对齐等等。在对齐后,执行主体可以确定第一样本标识和第二样本标识共有的样本标识作为目标样本标识。并将目标样本标识发送给数据方。In this embodiment, the execution subject may align the first sample identifier and the second sample identifier. Specifically, the execution subject can use existing sample ID alignment schemes, such as encrypted sample ID alignment based on RSA encryption/decryption algorithm and hash algorithm, encrypted sample ID alignment based on Diffie-Hellman, and so on. After the alignment, the execution subject may determine the sample identifier shared by the first sample identifier and the second sample identifier as the target sample identifier. And send the target sample ID to the data party.
步骤403,根据密文标签以及目标样本标识,确定出目标密文标签发送给数据方。 Step 403, according to the ciphertext label and the target sample identification, determine the target ciphertext label and send it to the data party.
在本实施例中,业务方可以在发送密文标签前,生成公私钥对。并将公钥发送给执行主体。然后,利用公钥对原始标签进行加密,得到密文标签。这里密文标签可以以{<y>,<1-y>}表示。<y>可以称为第一标签,<1-y>可以称为第二标签。然后,执行主体可以从上述密文标签{<y>,<1-y>}中确定出目标密文标签,以{<y_n>,<1-y_n>}表示。In this embodiment, the business party can generate a public-private key pair before sending the ciphertext label. And send the public key to the execution subject. Then, use the public key to encrypt the original label to obtain the ciphertext label. Here the ciphertext label can be represented by {<y>,<1-y>}. <y> may be called the first tag, and <1-y> may be called the second tag. Then, the execution subject can determine the target ciphertext label from the above ciphertext labels {<y>, <1-y>}, represented by {<y_n>, <1-y_n>}.
步骤404,接收数据方基于目标样本标识以及目标密文标签进行特征分桶后计算得到的各分桶的第一标签之和以及第二标签之和。 Step 404, the data receiving party calculates the sum of the first label and the sum of the second label of each bucket after performing feature bucketing based on the target sample ID and the target ciphertext label.
数据方在接收到目标样本标识以及目标密文标签后,可以进行特征分桶,并分别计算每个分桶的第一标签之和和第二标签之和。第一标签之和可以记为sum(<y_bin_i>),第二标签之和可以记为sum(<1-y_bin_i>)。数据方可以将{sum(<y_bin_i>),sum(<1-y_bin_i>)}发送给执行主体。After receiving the target sample ID and the target ciphertext label, the data party can perform feature bucketing, and calculate the sum of the first label and the sum of the second label of each bucket respectively. The sum of the first labels can be recorded as sum(<y_bin_i>), and the sum of the second labels can be recorded as sum(<1-y_bin_i>). The data party can send {sum(<y_bin_i>), sum(<1-y_bin_i>)} to the execution subject.
步骤405,根据目标密文标签,确定正标签之和以及负标签之和。 Step 405, according to the target ciphertext label, determine the sum of positive labels and the sum of negative labels.
本实施例中,执行主体可以对目标密文标签进行拆分统计,确定具有相同标签的样本之和。经过解析,执行主体可以确定出正标签之和以及负标签之和。In this embodiment, the execution subject can perform split statistics on the target ciphertext label, and determine the sum of samples with the same label. After parsing, the execution subject can determine the sum of positive labels and the sum of negative labels.
在本实施例的一些可选的实现方式中,执行主体可以通过以下步骤计算正标签之和以及负标签之和:In some optional implementations of this embodiment, the execution subject can calculate the sum of positive labels and the sum of negative labels through the following steps:
步骤4051,确定目标密文标签中第一标签之和以及第二标签之和。 Step 4051, determine the sum of the first label and the sum of the second label in the target ciphertext label.
步骤4052,分别将第一标签之和与随机生成的第一掩码相加、第二标签之和与随机生成的第二掩码相加,将得到的两个和值加密后发送给业务方。Step 4052: Add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, encrypt the two obtained sums and send them to the business party .
步骤4053,接收业务方对加密后的两个和值进行解密得到的第一数据,根据第一数据以及第一掩码、第二掩码,确定正标签之和以及负标签之和。Step 4053, receiving the first data obtained by decrypting the two encrypted sum values from the business party, and determining the sum of the positive label and the sum of the negative label according to the first data, the first mask, and the second mask.
本实现方式中,执行主体可以首先确定上述目标密文标签中的第一标签和第二标签,进而计算出第一标签之和sum(<y_n>)以及第二标签之和sum(<1-y_n>)。然后,执行主体可以随机生成两个掩码(mask),分别记 为第一掩码<mask_a>、第二掩码<mask_b>。并将第一标签之和与随机生成的第一掩码相加、将第二标签之和与随机生成的第二掩码相加,利用公钥加密后得到数据{sum(<y_n>)+<mask_a>,sum(<1-y_n>)+<mask_b>}。并将上述数据发送给业务方。业务方可以对上述数据进行解密,得到Dec(sum(<y_n>)+<mask_a>)和Dec(sum(<1-y_n>)+<mask_b>)。业务方在解密时可以利用与上述公钥配对的私钥进行解密。然后,业务方将上述数据发送给执行主体。执行主体根据上述数据减掉对应的第一掩码和第二掩码,得到正标签之和pos_total以及负标签之和neg_total。In this implementation, the execution subject can first determine the first label and the second label in the target ciphertext label, and then calculate the sum of the first label sum(<y_n>) and the sum of the second label sum(<1- y_n>). Then, the execution subject can randomly generate two masks (mask), which are respectively recorded as the first mask <mask_a> and the second mask <mask_b>. Add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, and encrypt with the public key to obtain the data {sum(<y_n>)+ <mask_a>,sum(<1-y_n>)+<mask_b>}. And send the above data to the business party. The business party can decrypt the above data to obtain Dec(sum(<y_n>)+<mask_a>) and Dec(sum(<1-y_n>)+<mask_b>). When decrypting, the business party can use the private key paired with the above public key to decrypt. Then, the business party sends the above data to the execution subject. The execution subject subtracts the corresponding first mask and second mask from the above data to obtain the sum of positive labels pos_total and the sum of negative labels neg_total.
步骤406,根据各分桶的第一标签之和以及第二标签之和,确定各分桶的正标签数量和负标签数量。 Step 406, according to the sum of the first label and the sum of the second label of each bucket, determine the number of positive labels and the number of negative labels of each bucket.
本实施例中,执行主体可以分别直接将各分桶的第一标签之和作为各分桶的正标签数量,将各分桶的第二标签之和作为各分桶的负标签数量。In this embodiment, the execution subject may directly use the sum of the first labels of each bucket as the number of positive labels of each bucket, and the sum of the second labels of each bucket as the number of negative labels of each bucket.
在本实施例的一些可选的实现方式中,执行主体可以通过以下步骤计算各分桶的正标签数量和负标签数量:In some optional implementations of this embodiment, the execution subject can calculate the number of positive labels and the number of negative labels of each bucket through the following steps:
步骤4061,分别将各分桶的第一标签之和与随机生成的第三掩码相加、将各分桶的第二标签之和与随机生成的第四掩码相加,对得到的两个和值加密后发送给业务方。Step 4061: Add the sum of the first labels of each bucket to the randomly generated third mask, add the sum of the second labels of each bucket to the randomly generated fourth mask, and compare the obtained two The sum and value are encrypted and sent to the business party.
步骤4062,接收业务方针对加密后的两个和值解密后得到的第二数据,根据第二数据以及第三掩码、第四掩码,确定各分桶的正标签数量和负标签数量。Step 4062, receiving the second data obtained by decrypting the two encrypted sum values, and determining the number of positive labels and negative labels for each bucket according to the second data, the third mask, and the fourth mask.
本实现方式中,执行主持台可以首先生成多个第三掩码和多个第四掩码,分别记为<mask_c>和<mask_d>。然后,执行主体可以将从数据方接收到的各分桶的第一标签之和sum(<y_bin_i>)与第三掩码<mask_c>相加,同时将各分桶的第二标签之和sum(<1-y_bin_i>)与第四掩码<mask_d>相加,并利用公钥进行加密,得到数据{sum(<y_bin_i>)+<mask_c>,sum(<1-y_bin_i>)+<mask_d>}。然后,执行主体可以将上述数据发送给业务方,业务方在得到上述数据后,可以利用与上述公钥配对的私钥进行解密,得到数据Dec(sum(<y_bin_i>)+<mask_c>)和Dec(sum(<1-y_bin_i>)+<mask_d>)。业务方可以将上述数据发送给执行主体,执行主体可以对上述数据减掉对 应的掩码后,得到每个分桶的正标签数量npos_i和负标签数量nneg_i。In this implementation, the host station may first generate multiple third masks and multiple fourth masks, which are respectively denoted as <mask_c> and <mask_d>. Then, the execution subject can add the sum of the first label sum(<y_bin_i>) of each bucket received from the data side to the third mask <mask_c>, and at the same time add the sum of the second label sum of each bucket (<1-y_bin_i>) is added to the fourth mask <mask_d>, and encrypted with the public key to obtain the data {sum(<y_bin_i>)+<mask_c>, sum(<1-y_bin_i>)+<mask_d >}. Then, the executive body can send the above data to the business party. After obtaining the above data, the business party can use the private key paired with the above public key to decrypt to obtain the data Dec(sum(<y_bin_i>)+<mask_c>) and Dec(sum(<1-y_bin_i>)+<mask_d>). The business party can send the above data to the execution subject, and the execution subject can subtract the corresponding mask from the above data to obtain the number of positive labels npos_i and the number of negative labels nneg_i for each bucket.
步骤407,根据正标签之和、负标签之和以及各分桶的正标签数量和负标签数量,计算以及输出所述目标样本标识对应的参数。Step 407: Calculate and output parameters corresponding to the target sample identifier according to the sum of positive labels, the sum of negative labels, and the number of positive labels and negative labels in each bucket.
执行主体在得到上述正标签之和pos_total、负标签之和neg_total以及各分桶的正标签数量npos_i和负标签数量nneg_i,可以利用上述各参数值计算目标样本标识对应的参数,例如WOE值和IV值。After obtaining the sum of positive labels pos_total, the sum of negative labels neg_total, the number of positive labels npos_i and the number of negative labels nneg_i of each bucket, the execution subject can use the above parameter values to calculate the parameters corresponding to the target sample identification, such as WOE value and IV value.
在本实施例的一些可选的实现方式中,执行主体可以通过以下步骤计算上述参数:In some optional implementations of this embodiment, the execution subject may calculate the above parameters through the following steps:
步骤4071,根据正标签之和、负标签之和、各分桶的正标签数量和负标签数量以及预先设置的至少两个参数,计算以及输出目标样本标识对应的参数。Step 4071, according to the sum of positive labels, the sum of negative labels, the number of positive labels and the number of negative labels in each bucket, and at least two preset parameters, calculate and output the parameters corresponding to the target sample identifier.
现有技术中,对WOE值的计算可以通过以下公式(1)来实现,对IV值的计算可以通过以下公式(2)来实现:In the prior art, the calculation of the WOE value can be realized by the following formula (1), and the calculation of the IV value can be realized by the following formula (2):
其中,npos
i是第i个分箱中正样本数,nneg
i是第i个分箱中负样本数,pos
total是总正样本数,neg
total是总负样本数。当业务方的数据同为一类时,pos
total=0或neg
total=0,WOE值和IV值均无法计算,数据方便可知道业务方提供的数据同为一类,从而推测出数据标签,存在数据泄露的风险。
Among them, npos i is the number of positive samples in the i-th bin, nneg i is the number of negative samples in the i-th bin, pos total is the total number of positive samples, and neg total is the total number of negative samples. When the data of the business party are of the same type, pos total = 0 or neg total = 0, the WOE value and IV value cannot be calculated, and the data is convenient to know that the data provided by the business party are of the same type, so that the data label can be inferred. There is a risk of data breaches.
本实现方式中,可以对上述公式(1)和公式(2)进行改进,得到公式(3)和公式(4)如下:In this implementation mode, the above formula (1) and formula (2) can be improved, and formula (3) and formula (4) can be obtained as follows:
其中,ε和δ均为预设值,0<ε<1,0<δ<0.02。Wherein, ε and δ are preset values, 0<ε<1, 0<δ<0.02.
在考虑到特征分箱时分箱中可能存在同一类样本以及数据方提供的特征数据所属标签可能同为一类的情况,公式(3)可以既不影响正常可以 计算出WOE情况下的WOE值,同时在特殊情况下也能算出一个WOE的值,不会影响到后续IV值的计算。同样的,在正负样本总数足够大的情况下,公式(4)可以既不影响正常可以计算出IV情况下的IV值,也可以在当数据方提供的特征数据所属标签同为一类的情况下计算出IV值为δ。通常在应用实践中当IV值小于0.02,该特征变量的预测能力几乎没有。这样既可以使得数据方无法从IV值中判断出他提供的数据同属一类,又可以表明该特征变量的预测能力极小,从而达到不泄露Guest方标签的目的。Considering that there may be samples of the same type in the binning during feature binning and the labels of the feature data provided by the data party may belong to the same class, the formula (3) can neither affect the WOE value under the normal WOE situation, At the same time, a WOE value can also be calculated under special circumstances, which will not affect the calculation of subsequent IV values. Similarly, when the total number of positive and negative samples is large enough, the formula (4) can neither affect the IV value when the IV can be calculated normally, but also can be used when the feature data provided by the data party belong to the same category. In this case, the calculated IV value is δ. Usually in practice when the IV value is less than 0.02, the predictive ability of this characteristic variable is almost non-existent. This will not only prevent the data party from judging from the IV value that the data he provided belongs to the same category, but also show that the predictive ability of the characteristic variable is extremely small, so as to achieve the purpose of not disclosing the label of the Guest party.
在本实施例的一些可选的实现方式中,执行主体可以将得到的各参数输出给数据方,这样业务方就不能得知WOE值和IV值,避免业务方获得数据方数据价值的信息。In some optional implementations of this embodiment, the executive body can output the obtained parameters to the data party, so that the business party cannot know the WOE value and the IV value, preventing the business party from obtaining information about the data value of the data party.
继续参见图5,其示出了本实施例的三方(Guest方、第三方和Host方)交互过程的示意图。如图5所示,联邦特征工程的具体步骤可以如下:Continue to refer to FIG. 5 , which shows a schematic diagram of the interaction process of the three parties (Guest party, third party and Host party) in this embodiment. As shown in Figure 5, the specific steps of federated feature engineering can be as follows:
1、Guest方与Host方将加密后的样本ID传输给可信第三方,可信第三方将加密样本对齐,得到两方共有的样本ID。加密方法包括但不限于非对称加密、哈希算法、同态加密等。可信第三方可以分发公钥给Guest方与Host方,Guest方与Host方利用公钥将样本ID加密传输给可信第三方,可信第三方利用私钥解密并对齐样本。Guest方与Host方也可以将样本ID进行哈希操作,让可信第三方来计算比较哈希值得到对齐样本。1. The Guest and the Host transmit the encrypted sample ID to a trusted third party, and the trusted third party aligns the encrypted samples to obtain the sample ID shared by both parties. Encryption methods include but are not limited to asymmetric encryption, hash algorithm, homomorphic encryption, etc. The trusted third party can distribute the public key to the Guest and the Host. The Guest and the Host use the public key to encrypt and transmit the sample ID to the trusted third party. The trusted third party uses the private key to decrypt and align the samples. The Guest and the Host can also perform a hash operation on the sample ID, allowing a trusted third party to calculate and compare the hash values to obtain aligned samples.
2、可信第三方将对齐后的样本ID发送给Host方,Host方对对齐后的样本特征进行特征分箱。2. The trusted third party sends the aligned sample ID to the host, and the host performs feature binning on the aligned sample features.
3、Guest方将标签加密并发送给可信第三方,可信第三方将对齐后的密文标签发送给Host方。3. The Guest encrypts the label and sends it to a trusted third party, and the trusted third party sends the aligned ciphertext label to the Host.
4、可信第三方将对齐后的密文正负标签和加上不同的mask传送给Guest方解密,并将传回的结果减掉mask得到正负标签数的和。4. The trusted third party sends the aligned ciphertext positive and negative labels plus different masks to the Guest for decryption, and subtracts the mask from the returned result to obtain the sum of positive and negative labels.
5、Host方分别计算每个特征分桶的密文正负标签数,并发送给可信第三方。5. The host side calculates the number of positive and negative labels of the ciphertext for each feature bucket, and sends it to a trusted third party.
6、可信第三方将特征分桶计算的结果再加上不同mask传送给Guest方解密,并将传回的结果减掉mask得到每个特征分桶下的正负标签数。6. The trusted third party sends the result of the feature bucket calculation plus different masks to the Guest for decryption, and subtracts the mask from the returned result to obtain the number of positive and negative labels under each feature bucket.
7、可信第三方根据步骤4与步骤6的结果计算WOE及IV值,并最终将每个特征的IV值传送给Host方保存。7. The trusted third party calculates the WOE and IV values based on the results of steps 4 and 6, and finally transmits the IV value of each feature to the host for storage.
本公开的上述实施例提供的联邦特征工程的数据处理方法,通过可信第三方负责对齐数据,并将对齐后的样本ID仅发送给Host方。在获取对齐后的数据样本计算正负标签数和的过程中,加入mask使得Guest方无法得知Host方数据样本标签的比例。并且在计算每个特征分桶下的正负标签数时,运用同样加入mask的方法,使得Guest方无法获取Host方的数据信息,最终由第三方计算Host方提供的特征的WOE及IV值,保证了Guest无法获得Host方数据价值的信息。In the data processing method of federated feature engineering provided by the above-mentioned embodiments of the present disclosure, a trusted third party is responsible for aligning data, and only sends the aligned sample ID to the Host. In the process of obtaining the aligned data samples and calculating the sum of positive and negative labels, adding a mask makes it impossible for the Guest to know the proportion of the host data sample labels. And when calculating the number of positive and negative tags under each feature bucket, the same method of adding a mask is used, so that the Guest cannot obtain the data information of the Host, and finally the third party calculates the WOE and IV values of the features provided by the Host. Guarantees that the Guest cannot obtain information about the value of the Host's data.
进一步参考图6,作为对上述各图所示方法的实现,本公开提供了一种用于联邦特征工程的数据处理装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。Further referring to FIG. 6 , as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a data processing device for federated feature engineering, which is similar to the method embodiment shown in FIG. 2 Correspondingly, the device can be specifically applied to various electronic devices.
如图6所示,本实施例的用于联邦特征工程的数据处理装置600包括:数据接收单元601、标识发送单元602、标签发送单元603和信息输出单元604。As shown in FIG. 6 , the data processing device 600 for federated feature engineering in this embodiment includes: a data receiving unit 601 , an identifier sending unit 602 , a tag sending unit 603 and an information output unit 604 .
数据接收单元601,被配置成接收业务方发送的样本数据的第一样本标识、与第一样本标识对应的密文标签以及接收数据方发送的样本数据的第二样本标识,密文标签包括第一标签和第二标签。The data receiving unit 601 is configured to receive the first sample identifier of the sample data sent by the business party, the ciphertext label corresponding to the first sample identifier, and the second sample identifier of the sample data sent by the receiving party, the ciphertext label Include the first label and the second label.
标识发送单元602,被配置成根据第一样本标识以及第二样本标识,确定目标样本标识发送给数据方。The identifier sending unit 602 is configured to determine the target sample identifier and send it to the data party according to the first sample identifier and the second sample identifier.
标签发送单元603,被配置成根据密文标签以及目标样本标识,确定出目标密文标签发送给数据方。The label sending unit 603 is configured to determine the target ciphertext label and send it to the data party according to the ciphertext label and the target sample identification.
信息输出单元604,被配置成响应于接收到数据方基于目标样本标识以及目标密文标签进行特征分桶后计算得到的各分桶的第一标签之和以及第二标签之和,基于目标密文标签、各分桶的第一标签之和以及第二标签之和,计算以及输出目标样本标识对应的参数。The information output unit 604 is configured to respond to the sum of the first label and the sum of the second label of each bucket calculated by the data party after performing feature bucketing based on the target sample identification and the target ciphertext label, based on the target ciphertext label The text label, the sum of the first label and the sum of the second label of each bucket, calculate and output the parameters corresponding to the target sample identification.
在本实施例的一些可选的实现方式中,标识发送单元602可以进一步被配置成:对第一样本标识以及第二样本标识进行对齐,确定业务方和数据方共有的样本标识为目标样本标识发送给数据方。In some optional implementations of this embodiment, the identifier sending unit 602 may be further configured to: align the first sample identifier and the second sample identifier, and determine that the sample identifier shared by the business party and the data party is the target sample The identification is sent to the data party.
在本实施例的一些可选的实现方式中,信息输出单元604可以进一步被配置成:根据目标密文标签,确定正标签之和以及负标签之和;根据各分桶的第一标签之和以及第二标签之和,确定各分桶的正标签数量和负标签数量;根据正标签之和、负标签之和以及各分桶的正标签数量和负标签数量,计算以及输出目标样本标识对应的参数。In some optional implementations of this embodiment, the information output unit 604 may be further configured to: determine the sum of positive labels and the sum of negative labels according to the target ciphertext label; and the sum of the second labels to determine the number of positive labels and the number of negative labels in each bucket; according to the sum of positive labels, the sum of negative labels, and the number of positive labels and negative labels in each bucket, calculate and output the corresponding parameters.
在本实施例的一些可选的实现方式中,信息输出单元604可以进一步被配置成:确定目标密文标签中第一标签之和以及第二标签之和;分别将第一标签之和与随机生成的第一掩码相加、第二标签之和与随机生成的第二掩码相加,将得到的两个和值加密后发送给业务方;接收业务方对加密后的两个和值进行解密得到的第一数据,根据第一数据以及第一掩码、第二掩码,确定正标签之和以及负标签之和。In some optional implementations of this embodiment, the information output unit 604 may be further configured to: determine the sum of the first label and the sum of the second label in the target ciphertext label; respectively combine the sum of the first label and the random Add the generated first mask, add the sum of the second tag and the randomly generated second mask, encrypt the two obtained sums and send them to the business party; the receiving business party will encrypt the encrypted two sums The first data obtained by decrypting is determined according to the first data and the first mask and the second mask to determine the sum of positive labels and the sum of negative labels.
在本实施例的一些可选的实现方式中,信息输出单元604可以进一步被配置成:分别将各分桶的第一标签之和与随机生成的第三掩码相加、将各分桶的第二标签之和与随机生成的第四掩码相加,对得到的两个和值加密后发送给业务方;接收业务方针对加密后的两个和值解密后得到的第二数据,根据第二数据以及第三掩码、第四掩码,确定各分桶的正标签数量和负标签数量。In some optional implementations of this embodiment, the information output unit 604 may be further configured to: respectively add the sum of the first labels of each bucket to the randomly generated third mask, and add the sum of the first labels of each bucket to The sum of the second tag is added to the randomly generated fourth mask, and the two obtained sums are encrypted and then sent to the business party; the second data obtained after the business party decrypts the two encrypted sums is received according to The second data, the third mask, and the fourth mask determine the number of positive labels and the number of negative labels of each bucket.
在本实施例的一些可选的实现方式中,信息输出单元604可以进一步被配置成:根据正标签之和、负标签之和、各分桶的正标签数量和负标签数量以及预先设置的至少两个参数,计算以及输出目标样本标识对应的参数。In some optional implementations of this embodiment, the information output unit 604 may be further configured to: according to the sum of positive labels, the sum of negative labels, the number of positive labels and the number of negative labels in each bucket, and the preset at least Two parameters, calculation and output parameters corresponding to the target sample ID.
在本实施例的一些可选的实现方式中,信息输出单元604可以进一步被配置成:将计算得到的至少一个参数输出给数据方。In some optional implementation manners of this embodiment, the information output unit 604 may be further configured to: output the calculated at least one parameter to the data party.
在本实施例的一些可选的实现方式中,装置600还可以进一步包括图6中未示出的加密单元,被配置成:接收业务方发送的公钥;利用公钥进行加密,以供业务方根据与公钥配对的私钥进行解密。In some optional implementations of this embodiment, the device 600 may further include an encryption unit not shown in FIG. 6, which is configured to: receive the public key sent by the business party; use the public key to encrypt for the business The party decrypts according to the private key paired with the public key.
应当理解,用于联邦特征工程的数据处理装置600中记载的单元601至单元605分别与参考图2中描述的方法中的各个步骤相对应。由此,上文针对用于联邦特征工程的数据处理方法描述的操作和特征同样适用于 装置600及其中包含的单元,在此不再赘述。It should be understood that the units 601 to 605 recorded in the data processing apparatus 600 for federated feature engineering correspond to the steps in the method described with reference to FIG. 2 . Therefore, the operations and features described above for the data processing method for federated feature engineering are also applicable to the device 600 and the units contained therein, and will not be repeated here.
本公开的技术方案中,所涉及的用户个人信息的获取、存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。In the technical solution of the present disclosure, the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
根据本公开的实施例,本公开还提供了还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
图7示出了根据本公开实施例的执行用于联邦特征工程的数据处理方法的电子设备700的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 7 shows a block diagram of an electronic device 700 performing a data processing method for federated feature engineering according to an embodiment of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图7所示,电子设备700包括处理器701,其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储器708加载到随机访问存储器(RAM)703中的计算机程序,来执行各种适当的动作和处理。在RAM 703中,还可存储电子设备700操作所需的各种程序和数据。处理器701、ROM 702以及RAM 703通过总线704彼此相连。I/O接口(输入/输出接口)705也连接至总线704。As shown in FIG. 7 , an electronic device 700 includes a processor 701 that can execute according to a computer program stored in a read-only memory (ROM) 702 or loaded from a memory 708 into a random access memory (RAM) 703. Various appropriate actions and treatments. In the RAM 703, various programs and data necessary for the operation of the electronic device 700 can also be stored. The processor 701, ROM 702, and RAM 703 are connected to each other through a bus 704. An I/O interface (input/output interface) 705 is also connected to the bus 704 .
电子设备700中的多个部件连接至I/O接口705,包括:输入单元706,例如键盘、鼠标等;输出单元707,例如各种类型的显示器、扬声器等;存储器708,例如磁盘、光盘等;以及通信单元709,例如网卡、调制解调器、无线通信收发机等。通信单元709允许电子设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a memory 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
处理器701可以是各种具有处理和计算能力的通用和/或专用处理组件。处理器701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的处理器、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。处理器701执行上文所描述的各个方法和处理,例如用于联邦特征工程的数据处理方法。例如,在一些实施例中,用于联邦 特征工程的数据处理方法可被实现为计算机软件程序,其被有形地包含于机器可读存储介质,例如存储器708。在一些实施例中,计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到电子设备700上。当计算机程序加载到RAM 703并由处理器701执行时,可以执行上文描述的用于联邦特征工程的数据处理方法的一个或多个步骤。备选地,在其他实施例中,处理器701可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行用于联邦特征工程的数据处理方法。 Processor 701 may be various general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various processors that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The processor 701 executes various methods and processes described above, for example, a data processing method for federated feature engineering. For example, in some embodiments, data processing methods for federated feature engineering may be implemented as a computer software program tangibly embodied on a machine-readable storage medium, such as memory 708. In some embodiments, part or all of the computer program can be loaded and/or installed on the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by processor 701, one or more steps of the data processing method for federated feature engineering described above can be performed. Alternatively, in other embodiments, the processor 701 may be configured in any other appropriate way (for example, by means of firmware) to execute a data processing method for federated feature engineering.
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。上述程序代码可以封装成计算机程序产品。这些程序代码或计算机程序产品可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器701执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. The above program code can be packaged into a computer program product. These program codes or computer program products may be provided to a processor or controller of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, so that the program codes, when executed by the processor 701, make the flow diagrams and/or block diagrams specified The function/operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读存储介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读存储介质可以是机器可读信号存储介质或机器可读存储介质。机器可读存储介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或 多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学存储设备、磁存储设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable storage medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. The machine-readable storage medium may be a machine-readable signal storage medium or a machine-readable storage medium. A machine-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务(“Virtual Private Server”,或简称“VPS”)中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以是分布式系统的服务器,或者是结合了区块链的服务器。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability. The server can also be a server of a distributed system, or a server combined with a blockchain.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开的技术方案所期望的结果, 本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution of the present disclosure can be achieved, no limitation is imposed herein.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开的保护范围之内。The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall fall within the protection scope of the present disclosure.
Claims (19)
- 一种用于联邦特征工程的数据处理方法,应用于第三方,包括:A data processing method for federated feature engineering, applied to a third party, comprising:接收业务方发送的样本数据的第一样本标识、与所述第一样本标识对应的密文标签以及接收数据方发送的样本数据的第二样本标识,所述密文标签包括第一标签和第二标签;The first sample identification of the sample data sent by the receiving business party, the ciphertext label corresponding to the first sample identification, and the second sample identification of the sample data sent by the receiving party, the ciphertext label includes the first label and the second label;根据所述第一样本标识以及所述第二样本标识,确定目标样本标识,并将所述目标样本标识发送给所述数据方;Determine a target sample ID according to the first sample ID and the second sample ID, and send the target sample ID to the data party;根据所述密文标签以及所述目标样本标识,确定目标密文标签,并将所述目标密文标签发送给所述数据方;determining a target ciphertext label according to the ciphertext label and the target sample identifier, and sending the target ciphertext label to the data party;响应于接收到所述数据方基于所述目标样本标识以及所述目标密文标签进行特征分桶后计算得到的各分桶的第一标签之和以及第二标签之和,基于所述目标密文标签、各分桶的第一标签之和以及第二标签之和,计算以及输出所述目标样本标识对应的参数。In response to receiving the sum of the first label and the sum of the second label of each bucket calculated by the data party based on the target sample identifier and the target ciphertext label after performing feature bucketing, based on the target ciphertext label The text label, the sum of the first label and the sum of the second label of each bucket, calculate and output the parameter corresponding to the target sample identifier.
- 根据权利要求1所述的方法,其中,所述根据所述第一样本标识以及所述第二样本标识,确定目标样本标识发送给所述数据方,包括:The method according to claim 1, wherein, according to the first sample ID and the second sample ID, determining the target sample ID and sending it to the data party includes:对所述第一样本标识以及所述第二样本标识进行对齐,确定所述业务方和所述数据方共有的样本标识为目标样本标识发送给所述数据方。Aligning the first sample ID and the second sample ID, determining the sample ID shared by the business party and the data party as the target sample ID and sending it to the data party.
- 根据权利要求1所述的方法,其中,所述基于所述目标密文标签、各分桶的第一标签之和以及第二标签之和,计算以及输出所述目标样本标识对应的参数,包括:The method according to claim 1, wherein, based on the target ciphertext label, the sum of the first labels and the sum of the second labels of each bucket, calculating and outputting the parameters corresponding to the target sample identification, including :根据所述目标密文标签,确定正标签之和以及负标签之和;Determine the sum of positive labels and the sum of negative labels according to the target ciphertext label;根据各分桶的第一标签之和以及第二标签之和,确定各分桶的正标签数量和负标签数量;Determine the number of positive labels and the number of negative labels for each bucket according to the sum of the first label and the sum of the second label of each bucket;根据所述正标签之和、所述负标签之和以及各分桶的正标签数 量和负标签数量,计算以及输出所述目标样本标识对应的参数。According to the sum of the positive labels, the sum of the negative labels and the number of positive labels and negative labels of each bucket, calculate and output the parameters corresponding to the target sample identification.
- 根据权利要求3所述的方法,其中,所述根据所述目标密文标签,确定正标签之和以及负标签之和,包括:The method according to claim 3, wherein, according to the target ciphertext label, determining the sum of positive labels and the sum of negative labels comprises:确定所述目标密文标签中第一标签之和以及第二标签之和;determining the sum of the first label and the sum of the second label in the target ciphertext label;分别将所述第一标签之和与随机生成的第一掩码相加、所述第二标签之和与随机生成的第二掩码相加,将得到的两个和值加密后发送给所述业务方;respectively add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, and encrypt the obtained two sum values and send them to the said business party;接收所述业务方对加密后的两个和值进行解密得到的第一数据,根据所述第一数据以及所述第一掩码、所述第二掩码,确定正标签之和以及负标签之和。Receive the first data obtained by decrypting the two encrypted sum values from the business party, and determine the sum of the positive label and the negative label according to the first data, the first mask, and the second mask Sum.
- 根据权利要求3所述的方法,其中,所述根据各分桶的第一标签之和以及第二标签之和,确定各分桶的正标签数量和负标签数量,包括:The method according to claim 3, wherein, according to the sum of the first labels and the sum of the second labels of each bucket, determining the number of positive labels and the number of negative labels of each bucket includes:分别将各分桶的第一标签之和与随机生成的第三掩码相加、将各分桶的第二标签之和与随机生成的第四掩码相加,对得到的两个和值加密后发送给所述业务方;Add the sum of the first label of each bucket to the randomly generated third mask, add the sum of the second label of each bucket to the randomly generated fourth mask, and compare the obtained two sum values Encrypted and sent to the business party;接收所述业务方针对加密后的两个和值解密后得到的第二数据,根据所述第二数据以及所述第三掩码、所述第四掩码,确定各分桶的正标签数量和负标签数量。Receiving the second data obtained by the business party after decrypting the two encrypted sum values, and determining the number of positive tags for each bucket according to the second data, the third mask, and the fourth mask and the number of negative labels.
- 根据权利要求3所述的方法,其中,所述根据所述正标签之和、所述负标签之和以及各分桶的正标签数量和负标签数量,计算以及输出所述目标样本标识对应的参数,包括:The method according to claim 3, wherein, according to the sum of the positive labels, the sum of the negative labels, and the number of positive labels and the number of negative labels in each bucket, calculate and output the target sample identifier corresponding to parameters, including:根据所述正标签之和、所述负标签之和、各分桶的正标签数量和负标签数量以及预先设置的至少两个参数,计算以及输出所述目标样本标识对应的参数。According to the sum of the positive labels, the sum of the negative labels, the number of positive labels and the number of negative labels in each bucket, and at least two preset parameters, calculate and output the parameters corresponding to the target sample identifier.
- 根据权利要求1-6任一项所述的方法,其中,所述计算以及 输出所述目标样本标识对应的参数,包括:The method according to any one of claims 1-6, wherein said calculating and outputting the parameters corresponding to said target sample identification include:将计算得到的至少一个参数输出给所述数据方。output the calculated at least one parameter to the data party.
- 根据权利要求4或5所述的方法,其中,所述方法还包括:The method according to claim 4 or 5, wherein the method further comprises:接收所述业务方发送的公钥;receiving the public key sent by the business party;利用所述公钥进行加密,以供所述业务方根据与所述公钥配对的私钥进行解密。The public key is used to encrypt, so that the business party can decrypt according to the private key paired with the public key.
- 一种用于联邦特征工程的数据处理装置,包括:A data processing device for federal feature engineering, comprising:数据接收单元,被配置成接收业务方发送的样本数据的第一样本标识、与所述第一样本标识对应的密文标签以及接收数据方发送的样本数据的第二样本标识,所述密文标签包括第一标签和第二标签;The data receiving unit is configured to receive the first sample identifier of the sample data sent by the business party, the ciphertext label corresponding to the first sample identifier, and the second sample identifier of the sample data sent by the data receiving party, the The ciphertext label includes a first label and a second label;标识发送单元,被配置成根据所述第一样本标识以及所述第二样本标识,确定目标样本标识发送给所述数据方;An identification sending unit configured to determine a target sample identification and send it to the data party according to the first sample identification and the second sample identification;标签发送单元,被配置成根据所述密文标签以及所述目标样本标识,确定出目标密文标签发送给所述数据方;The label sending unit is configured to determine the target ciphertext label and send it to the data party according to the ciphertext label and the target sample identifier;信息输出单元,被配置成响应于接收到所述数据方基于所述目标样本标识以及所述目标密文标签进行特征分桶后计算得到的各分桶的第一标签之和以及第二标签之和,基于所述目标密文标签、各分桶的第一标签之和以及第二标签之和,计算以及输出所述目标样本标识对应的参数。The information output unit is configured to respond to the sum of the first labels and the sum of the second labels of each bucket calculated by the data party after performing feature bucketing based on the target sample identifier and the target ciphertext label. and, based on the target ciphertext label, the sum of the first labels and the sum of the second labels of each bucket, calculate and output the parameter corresponding to the target sample identifier.
- 根据权利要求9所述的装置,其中,所述标识发送单元进一步被配置成:The device according to claim 9, wherein the identification sending unit is further configured to:对所述第一样本标识以及所述第二样本标识进行对齐,确定所述业务方和所述数据方共有的样本标识为目标样本标识发送给所述数据方。Aligning the first sample ID and the second sample ID, determining the sample ID shared by the business party and the data party as the target sample ID and sending it to the data party.
- 根据权利要求9所述的装置,其中,所述信息输出单元进 一步被配置成:The device according to claim 9, wherein the information output unit is further configured to:根据所述目标密文标签,确定正标签之和以及负标签之和;Determine the sum of positive labels and the sum of negative labels according to the target ciphertext label;根据各分桶的第一标签之和以及第二标签之和,确定各分桶的正标签数量和负标签数量;Determine the number of positive labels and the number of negative labels for each bucket according to the sum of the first label and the sum of the second label of each bucket;根据所述正标签之和、所述负标签之和以及各分桶的正标签数量和负标签数量,计算以及输出所述目标样本标识对应的参数。According to the sum of the positive labels, the sum of the negative labels, and the number of positive labels and the number of negative labels in each bucket, calculate and output the parameters corresponding to the target sample identifier.
- 根据权利要求11所述的装置,其中,所述信息输出单元进一步被配置成:The device according to claim 11, wherein the information output unit is further configured to:确定所述目标密文标签中第一标签之和以及第二标签之和;determining the sum of the first label and the sum of the second label in the target ciphertext label;分别将所述第一标签之和与随机生成的第一掩码相加、所述第二标签之和与随机生成的第二掩码相加,将得到的两个和值加密后发送给所述业务方;respectively add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, and encrypt the obtained two sum values and send them to the said business party;接收所述业务方对加密后的两个和值进行解密得到的第一数据,根据所述第一数据以及所述第一掩码、所述第二掩码,确定正标签之和以及负标签之和。Receive the first data obtained by decrypting the two encrypted sum values from the business party, and determine the sum of the positive label and the negative label according to the first data, the first mask, and the second mask Sum.
- 根据权利要求11所述的装置,其中,所述信息输出单元进一步被配置成:The device according to claim 11, wherein the information output unit is further configured to:分别将各分桶的第一标签之和与随机生成的第三掩码相加、将各分桶的第二标签之和与随机生成的第四掩码相加,对得到的两个和值加密后发送给所述业务方;Add the sum of the first labels of each bucket to the randomly generated third mask, add the sum of the second labels of each bucket to the randomly generated fourth mask, and compare the obtained two sum values Encrypted and sent to the business party;接收所述业务方针对加密后的两个和值解密后得到的第二数据,根据所述第二数据以及所述第三掩码、所述第四掩码,确定各分桶的正标签数量和负标签数量。Receiving the second data obtained by the business party after decrypting the two encrypted sum values, and determining the number of positive tags for each bucket according to the second data, the third mask, and the fourth mask and the number of negative labels.
- 根据权利要求11所述的装置,其中,所述信息输出单元进一步被配置成:The device according to claim 11, wherein the information output unit is further configured to:根据所述正标签之和、所述负标签之和、各分桶的正标签数量和负标签数量以及预先设置的至少两个参数,计算以及输出所述目 标样本标识对应的参数。According to the sum of the positive labels, the sum of the negative labels, the number of positive labels and the number of negative labels in each bucket, and at least two preset parameters, calculate and output the parameters corresponding to the target sample identifier.
- 根据权利要求9-14任一项所述的装置,其中,所述信息输出单元进一步被配置成:The device according to any one of claims 9-14, wherein the information output unit is further configured to:将计算得到的至少一个参数输出给所述数据方。output the calculated at least one parameter to the data party.
- 根据权利要求14或15所述的装置,其中,所述装置还包括加密单元,被配置成:The device according to claim 14 or 15, wherein the device further comprises an encryption unit configured to:接收所述业务方发送的公钥;receiving the public key sent by the business party;利用所述公钥进行加密,以供所述业务方根据与所述公钥配对的私钥进行解密。The public key is used to encrypt, so that the business party can decrypt according to the private key paired with the public key.
- 一种电子设备,包括:An electronic device comprising:至少一个处理器;以及at least one processor; and与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-8中任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-8. Methods.
- 一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使所述计算机执行权利要求1-8中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the method according to any one of claims 1-8.
- 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-8中任一项所述的方法。A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111078529.1A CN113722744B (en) | 2021-09-15 | 2021-09-15 | Data processing method, device, equipment and medium for federal feature engineering |
CN202111078529.1 | 2021-09-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023040429A1 true WO2023040429A1 (en) | 2023-03-23 |
Family
ID=78683794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/104178 WO2023040429A1 (en) | 2021-09-15 | 2022-07-06 | Data processing method, apparatus, and device for federated feature engineering, and medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113722744B (en) |
WO (1) | WO2023040429A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116842561A (en) * | 2023-06-29 | 2023-10-03 | 上海零数众合信息科技有限公司 | Privacy intersection system and method capable of dynamically adding and deleting data sets |
CN117278199A (en) * | 2023-10-18 | 2023-12-22 | 上海零数众合信息科技有限公司 | Federal learning feature screening method and system based on homomorphic encryption |
CN117579215A (en) * | 2024-01-17 | 2024-02-20 | 杭州世平信息科技有限公司 | Longitudinal federal learning differential privacy protection method and system based on tag sharing |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722744B (en) * | 2021-09-15 | 2024-09-24 | 京东科技信息技术有限公司 | Data processing method, device, equipment and medium for federal feature engineering |
CN114818972A (en) * | 2022-05-19 | 2022-07-29 | 北京瑞莱智慧科技有限公司 | Model construction method and device and storage medium |
CN115659381B (en) * | 2022-12-26 | 2023-03-10 | 北京数牍科技有限公司 | Federal learning WOE encoding method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111563267A (en) * | 2020-05-08 | 2020-08-21 | 京东数字科技控股有限公司 | Method and device for processing federal characteristic engineering data |
CN112257876A (en) * | 2020-11-15 | 2021-01-22 | 腾讯科技(深圳)有限公司 | Federal learning method, apparatus, computer device and medium |
CN112632045A (en) * | 2021-03-10 | 2021-04-09 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and computer readable storage medium |
CN112818374A (en) * | 2021-03-02 | 2021-05-18 | 深圳前海微众银行股份有限公司 | Joint training method, device, storage medium and program product of model |
CN112861939A (en) * | 2021-01-26 | 2021-05-28 | 深圳前海微众银行股份有限公司 | Feature selection method, device, readable storage medium and computer program product |
CN113722744A (en) * | 2021-09-15 | 2021-11-30 | 京东科技信息技术有限公司 | Data processing method, device, equipment and medium for federal characteristic engineering |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210233615A1 (en) * | 2018-04-22 | 2021-07-29 | Viome, Inc. | Systems and methods for inferring scores for health metrics |
US10885203B2 (en) * | 2019-08-01 | 2021-01-05 | Advanced New Technologies Co., Ltd. | Encrypted data exchange |
CN110535622A (en) * | 2019-08-01 | 2019-12-03 | 阿里巴巴集团控股有限公司 | Data processing method, device and electronic equipment |
CN111695674B (en) * | 2020-05-14 | 2024-04-09 | 平安科技(深圳)有限公司 | Federal learning method, federal learning device, federal learning computer device, and federal learning computer readable storage medium |
CN111931216B (en) * | 2020-09-16 | 2021-03-30 | 支付宝(杭州)信息技术有限公司 | Method and system for obtaining joint training model based on privacy protection |
CN113159327B (en) * | 2021-03-25 | 2024-04-09 | 深圳前海微众银行股份有限公司 | Model training method and device based on federal learning system and electronic equipment |
-
2021
- 2021-09-15 CN CN202111078529.1A patent/CN113722744B/en active Active
-
2022
- 2022-07-06 WO PCT/CN2022/104178 patent/WO2023040429A1/en unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111563267A (en) * | 2020-05-08 | 2020-08-21 | 京东数字科技控股有限公司 | Method and device for processing federal characteristic engineering data |
CN112257876A (en) * | 2020-11-15 | 2021-01-22 | 腾讯科技(深圳)有限公司 | Federal learning method, apparatus, computer device and medium |
CN112861939A (en) * | 2021-01-26 | 2021-05-28 | 深圳前海微众银行股份有限公司 | Feature selection method, device, readable storage medium and computer program product |
CN112818374A (en) * | 2021-03-02 | 2021-05-18 | 深圳前海微众银行股份有限公司 | Joint training method, device, storage medium and program product of model |
CN112632045A (en) * | 2021-03-10 | 2021-04-09 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and computer readable storage medium |
CN113722744A (en) * | 2021-09-15 | 2021-11-30 | 京东科技信息技术有限公司 | Data processing method, device, equipment and medium for federal characteristic engineering |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116842561A (en) * | 2023-06-29 | 2023-10-03 | 上海零数众合信息科技有限公司 | Privacy intersection system and method capable of dynamically adding and deleting data sets |
CN116842561B (en) * | 2023-06-29 | 2024-05-24 | 上海零数众合信息科技有限公司 | Privacy intersection system and method capable of dynamically adding and deleting data sets |
CN117278199A (en) * | 2023-10-18 | 2023-12-22 | 上海零数众合信息科技有限公司 | Federal learning feature screening method and system based on homomorphic encryption |
CN117579215A (en) * | 2024-01-17 | 2024-02-20 | 杭州世平信息科技有限公司 | Longitudinal federal learning differential privacy protection method and system based on tag sharing |
CN117579215B (en) * | 2024-01-17 | 2024-03-29 | 杭州世平信息科技有限公司 | Longitudinal federal learning differential privacy protection method and system based on tag sharing |
Also Published As
Publication number | Publication date |
---|---|
CN113722744A (en) | 2021-11-30 |
CN113722744B (en) | 2024-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023040429A1 (en) | Data processing method, apparatus, and device for federated feature engineering, and medium | |
US20200311257A1 (en) | Processing and storing blockchain data under a trusted execution environment | |
CN110414987B (en) | Account set identification method and device and computer system | |
CN110245510A (en) | Method and apparatus for predictive information | |
CN112287379B (en) | Service data using method, device, equipment, storage medium and program product | |
CN111310204B (en) | Data processing method and device | |
CN111563267B (en) | Method and apparatus for federal feature engineering data processing | |
JPWO2015155896A1 (en) | Support vector machine learning system and support vector machine learning method | |
WO2022156594A1 (en) | Federated model training method and apparatus, electronic device, computer program product, and computer-readable storage medium | |
US20230050771A1 (en) | Method for determining risk level of instance on cloud server, and electronic device | |
WO2023071105A1 (en) | Method and apparatus for analyzing feature variable, computer device, and storage medium | |
CN113051239A (en) | Data sharing method, use method of model applying data sharing method and related equipment | |
WO2024082514A1 (en) | Service index prediction method and apparatus, and device and storage medium | |
WO2023216494A1 (en) | Federated learning-based user service strategy determination method and apparatus | |
US20230206133A1 (en) | Model parameter adjusting method and device, storage medium and program product | |
CN115150063A (en) | Model encryption method and device and electronic equipment | |
CN114139450A (en) | Scoring card modeling method and device based on privacy protection | |
CN112598127B (en) | Federal learning model training method and device, electronic equipment, medium and product | |
WO2024139666A1 (en) | Training method and apparatus for dual-target domain recommendation model | |
US20180365687A1 (en) | Fraud detection | |
CN116781425B (en) | Service data acquisition method, device, equipment and storage medium | |
US20220360459A1 (en) | Method of querying data, method of writing data, electronic device, and readable storage medium | |
CN116405199A (en) | Encryption method, device, equipment and medium based on NTRU algorithm and SM2 algorithm | |
CN115599959A (en) | Data sharing method, device, equipment and storage medium | |
CN115346668A (en) | Training method and device of health risk grade evaluation model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22868806 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21.06.2024) |