WO2023040429A1

WO2023040429A1 - Data processing method, apparatus, and device for federated feature engineering, and medium

Info

Publication number: WO2023040429A1
Application number: PCT/CN2022/104178
Authority: WO
Inventors: 尹靖雯; 孙中伟; 张钧皓; 曹雨晨; 姬艳鑫; 张新; 刘永平; 宋红花; 赵国梁
Original assignee: 京东科技信息技术有限公司
Priority date: 2021-09-15
Filing date: 2022-07-06
Publication date: 2023-03-23
Also published as: CN113722744A; CN113722744B

Abstract

The present disclosure provides a data processing method, apparatus, and device for federated feature engineering, and a medium, and relates to the field of deep learning. The specific implementation solution is as follows: receiving a first sample identifier of sample data sent by a service party as well as a ciphertext label corresponding to the first sample identifier, and receiving a second sample identifier of sample data sent by a data party; according to the first sample identifier and the second sample identifier, determining a target sample identifier and sending same to the data party; according to the ciphertext label and the target sample identifier, determining a target ciphertext label and sending same to the data party; in response to receiving the sum of first labels and the sum of second labels of sub-buckets obtained by calculation after the data party performs feature bucketing on the basis of the target sample identifier and target ciphertext label, on the basis of the target ciphertext label and the sum of first labels and the sum of the second labels of the sub-buckets, calculating and outputting a parameter corresponding to the target sample identifier.

Description

Data processing method, device, device and medium for federal feature engineering

This application claims priority to application number 202111078529.1, filed September 15, 2021, the entirety of which is incorporated herein by reference.

technical field

The present disclosure relates to the field of computer technology, specifically to the field of deep learning and data processing, and in particular to a data processing method, device, device and medium for federated feature engineering.

Background technique

In order to solve the problems of data islands and data privacy security, the current mainstream method is to use federated learning to jointly train different data to obtain better models to solve practical problems. According to the distribution of data, federated learning can be divided into horizontal federated learning, vertical federated learning, and transfer learning. Among them, vertical federated learning is widely used.

Contents of the invention

The disclosure provides a data processing method, device, device and medium for federated feature engineering.

According to the first aspect, a data processing method for federated feature engineering is provided, including: receiving the first sample identification of the sample data sent by the business party, the ciphertext label corresponding to the first sample identification, and the data receiving party The second sample identification of the sample data sent, the ciphertext label includes the first label and the second label; according to the first sample identification and the second sample identification, determine the target sample identification and send it to the data party; according to the ciphertext label and the target sample Identify, determine the target ciphertext label and send it to the data party; in response to receiving the data party based on the target sample identification and the target ciphertext label, the sum of the first label and the sum of the second label of each bucket calculated after feature bucketing and, based on the target ciphertext label, the sum of the first label and the sum of the second label of each bucket, calculate and output the parameter corresponding to the target sample identifier.

According to the second aspect, there is provided a data processing device for federated feature engineering, including: a data receiving unit configured to receive the first sample identifier of the sample data sent by the business party, and the first sample identifier corresponding to the first sample identifier The ciphertext label and the second sample identification of the sample data sent by the data receiving party, the ciphertext label includes the first label and the second label; the identification sending unit is configured to determine the target according to the first sample identification and the second sample identification The sample identification is sent to the data party; the label sending unit is configured to determine the target ciphertext label and send it to the data party according to the ciphertext label and the target sample identification; the information output unit is configured to respond to the received data party based on the target sample The sum of the first label and the sum of the second label of each bucket calculated after feature bucketing of the identification and the target ciphertext label, based on the target ciphertext label, the sum of the first label and the second label of each bucket and, calculate and output the parameters corresponding to the target sample ID.

According to a third aspect, there is provided an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by at least one processor. Executed by a processor, so that at least one processor can execute the method described in the first aspect.

According to a fourth aspect, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method described in the first aspect.

According to a fifth aspect, a computer program product includes a computer program, and when executed by a processor, the computer program implements the method as described in the first aspect.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

Description of drawings

The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:

FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present disclosure can be applied;

FIG. 2 is a flowchart of an embodiment of a data processing method for federated feature engineering according to the present disclosure;

FIG. 3 is a schematic diagram of an application scenario of a data processing method for federated feature engineering according to the present disclosure;

FIG. 4 is a flowchart of another embodiment of a data processing method for federated feature engineering according to the present disclosure;

Fig. 5 is a schematic diagram of the tripartite interaction process in the embodiment shown in Fig. 4;

Fig. 6 is a schematic structural diagram of an embodiment of a data processing device for federated feature engineering according to the present disclosure;

FIG. 7 is a block diagram of an electronic device for implementing the data processing method for federated feature engineering according to an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the case of no conflict, the embodiments in the present disclosure and the features in the embodiments can be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings and embodiments.

In the process of traditional machine learning, feature engineering is an essential part. Common feature engineering processing methods include data preprocessing, feature selection, and feature dimensionality reduction. After completing the data preprocessing, it is necessary to select meaningful features to train the model, and usually use some common indicators such as WOE (Weight Of Evidence, weight of evidence), IV (Information Value, feature information value), etc. to analyze the impact of each feature on the label. Ability to predict. In the process of calculating the WOE value and IV value, the data needs to be binned, that is, the continuous variable is discretized, so that the model can be iterated quickly, which can reduce the risk of model overfitting. It is a commonly used data preprocessing method.

In federated feature engineering, there are usually two parties. The tagged party is the Guest party (business party), and the unlabeled party that only provides feature data is the Host party (data party). The Guest side hopes to expand the feature dimension of the data through union. During the data interaction process between the two parties, there is no transmission of plaintext data, the host cannot obtain the tags, and the guest does not know the host's feature values, thus completing the calculation of feature engineering while ensuring the security and privacy of both parties.

In related technologies, for example, in financial scenarios, banks and other financial institutions contain credit labels, while e-commerce platforms have user consumption data, and users of both parties overlap. Banks can use e-commerce data to predict credit risk, but both parties cannot. Data sharing, at this time vertical federated learning can be used to solve such problems.

According to the technology disclosed in the present disclosure, data sharing can be realized while ensuring the security of the data of both parties.

FIG. 1 shows an exemplary system architecture 100 to which the embodiment of the data processing method for federated feature engineering or the data processing device for federated feature engineering of the present disclosure can be applied.

As shown in FIG. 1 , the system architecture 100 may include a business party 101 , a third party 102 and a data party 103 . Communication between the business party 101 and the third party 102 and between the third party 102 and the data party 103 can be carried out through the network. A network may include various connection types such as wires, wireless communication links, or fiber optic cables, among others.

The business party 101 and the data party 103 may have relevant data for the same object, for example, the business party 101 may be a clothing production factory, and the data party 103 may be a clothing sales website. The third party 102 may be a party independent from the business party 101 and the data party 103, and is a trusted party. In order to avoid data security risks that may be caused by the business party 101 and the data party 103 during data interaction, in this embodiment, both the business party 101 and the data party 103 can send data to the third party 102 to improve data security.

It should be noted that the business party 101, the third party 102 and the data party 103 may be hardware or software. When the business party 101, the third party 102, and the data party 103 are hardware, they can be implemented as a distributed server cluster composed of multiple electronic devices, or as a single server. When the business party 101, the third party 102, and the data party 103 are software, they can be implemented as multiple software or software modules (for example, for providing distributed services), or as a single software or software module. No specific limitation is made here.

It should be noted that the data processing method for federated feature engineering provided by the embodiments of the present disclosure is generally executed by the third party 102 . Correspondingly, the data processing device for federated feature engineering is generally set in the third party 102 .

It should be understood that the numbers of business parties, third parties and data parties in Fig. 1 are only illustrative. According to the implementation requirements, there can be any number of business parties, third parties and data parties.

Continue to refer to FIG. 2 , which shows a flow 200 of an embodiment of a data processing method for federated feature engineering according to the present disclosure. The data processing method for federated feature engineering in this embodiment includes the following steps:

Step 201, receiving the first sample identifier of the sample data sent by the business party, the ciphertext label corresponding to the first sample identifier, and the second sample identifier of the sample data sent by the receiving party, the ciphertext label includes the first label and Second tab.

In this embodiment, the execution subject of the data processing method for federated feature engineering (for example, the third party 102 shown in FIG. 1 ) can receive data from the business party and the data party respectively. Specifically, the execution subject may receive the first sample identifier of the sample data from the business party, and receive the second sample identifier of the sample data from the data party. Here, both the first sample identifier and the second sample identifier are character strings used to identify sample data. The business party and the data party can encrypt the identification of the original sample data to obtain the first sample identification and the second sample identification. The encryption here can adopt the homomorphic encryption method. Homomorphic encryption is a form of encryption that allows people to perform specific forms of algebraic operations on ciphertext, allowing people to perform operations such as retrieval and comparison in encrypted data without decrypting the data. The execution subject can also receive the ciphertext label from the business party. Here, the ciphertext label includes a first label and a second label. The first label sum identifies positive samples, and the second label can represent negative samples.

Step 202: Determine the target sample ID according to the first sample ID and the second sample ID, and send the target sample ID to the data party.

In this embodiment, after receiving the first sample ID and the second sample ID, the execution subject can perform various processing on the first sample ID and the second sample ID to determine the first sample ID and the second sample ID. The same sample ID in the sample ID is used as the target sample ID, and the target sample ID is only sent to the data party. Specifically, the above processing may be processing such as decryption and hash operation. In this embodiment, the execution subject directly sends only the target sample ID to the data party, which can prevent the business party from guessing the original data of the data party based on the target sample ID, thereby improving data security.

Step 203: Determine the target ciphertext label according to the ciphertext label and the target sample ID, and send the target ciphertext label to the data party.

The execution subject may also determine the label corresponding to the target sample identifier from the above-mentioned ciphertext labels as the target ciphertext label after determining the target sample identifier. Specifically, the ciphertext label includes the correspondence between the label and the first sample identifier, and according to the above correspondence, the execution subject can search the ciphertext label to determine the target ciphertext label. And send the above target ciphertext label to the data party.

Step 204, in response to receiving the sum of the first label and the sum of the second label of each bucket calculated by the data party after performing feature bucketing based on the target sample ID and the target ciphertext label, based on the target ciphertext label, each bucket The sum of the first label and the sum of the second label of the bucket, calculate and output the parameters corresponding to the target sample identifier.

After receiving the above-mentioned target sample identification and target ciphertext label, the data party can perform characteristic bucketing on the original data, that is, divide the original data into multiple buckets (bins). The data in each bucket corresponds to the sample ID, and the data party can perform calculations based on the sample ID and ciphertext label corresponding to the data in each bucket to obtain the sum of the first label and the sum of the second label of each bucket . Then, send the calculated data to the execution subject. After receiving the above data, the execution subject can combine the target ciphertext label to calculate the parameters corresponding to the target sample ID. Specifically, the above parameters may include a WOE value and an IV value. The execution subject can determine the number of positive labels and the number of negative labels from the target ciphertext labels. According to the calculation formula of WOE value and IV value, the above parameters are calculated.

Continue referring to FIG. 3 , which shows a schematic diagram of an application scenario of the data processing method for federated feature engineering according to the present disclosure. In the application scenario of FIG. 3 , the bank 301 sends the sample ID of the user's sample data and the encrypted credit label to the third party 302 , and the e-commerce platform 303 sends the sample ID of the user's consumption data to the third party 302 . The third party 302 calculates the WOE value and IV value after performing the processing of steps 201 to 204 . And according to the above two parameter values, a meaningful feature training model is selected for credit risk prediction.

The data processing method for federated feature engineering provided by the above embodiments of the present disclosure can improve data security by introducing a third party into the business side and the data side.

Continue referring to FIG. 4 , which shows a flow 400 of another embodiment of the data processing method for federated feature engineering according to the present disclosure. As shown in Figure 4, the method of this embodiment may include the following steps:

Step 401, receiving the first sample ID of the sample data sent by the business party, the ciphertext label corresponding to the first sample ID, and the second sample ID of the sample data sent by the data receiving party.

Step 402, aligning the first sample ID and the second sample ID, determining the sample ID shared by the business party and the data party as the target sample ID and sending it to the data party.

In this embodiment, the execution subject may align the first sample identifier and the second sample identifier. Specifically, the execution subject can use existing sample ID alignment schemes, such as encrypted sample ID alignment based on RSA encryption/decryption algorithm and hash algorithm, encrypted sample ID alignment based on Diffie-Hellman, and so on. After the alignment, the execution subject may determine the sample identifier shared by the first sample identifier and the second sample identifier as the target sample identifier. And send the target sample ID to the data party.

Step 403, according to the ciphertext label and the target sample identification, determine the target ciphertext label and send it to the data party.

In this embodiment, the business party can generate a public-private key pair before sending the ciphertext label. And send the public key to the execution subject. Then, use the public key to encrypt the original label to obtain the ciphertext label. Here the ciphertext label can be represented by {<y>,<1-y>}. <y> may be called the first tag, and <1-y> may be called the second tag. Then, the execution subject can determine the target ciphertext label from the above ciphertext labels {<y>, <1-y>}, represented by {<y_n>, <1-y_n>}.

Step 404, the data receiving party calculates the sum of the first label and the sum of the second label of each bucket after performing feature bucketing based on the target sample ID and the target ciphertext label.

After receiving the target sample ID and the target ciphertext label, the data party can perform feature bucketing, and calculate the sum of the first label and the sum of the second label of each bucket respectively. The sum of the first labels can be recorded as sum(<y_bin_i>), and the sum of the second labels can be recorded as sum(<1-y_bin_i>). The data party can send {sum(<y_bin_i>), sum(<1-y_bin_i>)} to the execution subject.

Step 405, according to the target ciphertext label, determine the sum of positive labels and the sum of negative labels.

In this embodiment, the execution subject can perform split statistics on the target ciphertext label, and determine the sum of samples with the same label. After parsing, the execution subject can determine the sum of positive labels and the sum of negative labels.

In some optional implementations of this embodiment, the execution subject can calculate the sum of positive labels and the sum of negative labels through the following steps:

Step 4051, determine the sum of the first label and the sum of the second label in the target ciphertext label.

Step 4052: Add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, encrypt the two obtained sums and send them to the business party .

Step 4053, receiving the first data obtained by decrypting the two encrypted sum values from the business party, and determining the sum of the positive label and the sum of the negative label according to the first data, the first mask, and the second mask.

In this implementation, the execution subject can first determine the first label and the second label in the target ciphertext label, and then calculate the sum of the first label sum(<y_n>) and the sum of the second label sum(<1- y_n>). Then, the execution subject can randomly generate two masks (mask), which are respectively recorded as the first mask <mask_a> and the second mask <mask_b>. Add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, and encrypt with the public key to obtain the data {sum(<y_n>)+ <mask_a>,sum(<1-y_n>)+<mask_b>}. And send the above data to the business party. The business party can decrypt the above data to obtain Dec(sum(<y_n>)+<mask_a>) and Dec(sum(<1-y_n>)+<mask_b>). When decrypting, the business party can use the private key paired with the above public key to decrypt. Then, the business party sends the above data to the execution subject. The execution subject subtracts the corresponding first mask and second mask from the above data to obtain the sum of positive labels pos_total and the sum of negative labels neg_total.

Step 406, according to the sum of the first label and the sum of the second label of each bucket, determine the number of positive labels and the number of negative labels of each bucket.

In this embodiment, the execution subject may directly use the sum of the first labels of each bucket as the number of positive labels of each bucket, and the sum of the second labels of each bucket as the number of negative labels of each bucket.

In some optional implementations of this embodiment, the execution subject can calculate the number of positive labels and the number of negative labels of each bucket through the following steps:

Step 4061: Add the sum of the first labels of each bucket to the randomly generated third mask, add the sum of the second labels of each bucket to the randomly generated fourth mask, and compare the obtained two The sum and value are encrypted and sent to the business party.

Step 4062, receiving the second data obtained by decrypting the two encrypted sum values, and determining the number of positive labels and negative labels for each bucket according to the second data, the third mask, and the fourth mask.

In this implementation, the host station may first generate multiple third masks and multiple fourth masks, which are respectively denoted as <mask_c> and <mask_d>. Then, the execution subject can add the sum of the first label sum(<y_bin_i>) of each bucket received from the data side to the third mask <mask_c>, and at the same time add the sum of the second label sum of each bucket (<1-y_bin_i>) is added to the fourth mask <mask_d>, and encrypted with the public key to obtain the data {sum(<y_bin_i>)+<mask_c>, sum(<1-y_bin_i>)+<mask_d >}. Then, the executive body can send the above data to the business party. After obtaining the above data, the business party can use the private key paired with the above public key to decrypt to obtain the data Dec(sum(<y_bin_i>)+<mask_c>) and Dec(sum(<1-y_bin_i>)+<mask_d>). The business party can send the above data to the execution subject, and the execution subject can subtract the corresponding mask from the above data to obtain the number of positive labels npos_i and the number of negative labels nneg_i for each bucket.

Step 407: Calculate and output parameters corresponding to the target sample identifier according to the sum of positive labels, the sum of negative labels, and the number of positive labels and negative labels in each bucket.

After obtaining the sum of positive labels pos_total, the sum of negative labels neg_total, the number of positive labels npos_i and the number of negative labels nneg_i of each bucket, the execution subject can use the above parameter values to calculate the parameters corresponding to the target sample identification, such as WOE value and IV value.

In some optional implementations of this embodiment, the execution subject may calculate the above parameters through the following steps:

Step 4071, according to the sum of positive labels, the sum of negative labels, the number of positive labels and the number of negative labels in each bucket, and at least two preset parameters, calculate and output the parameters corresponding to the target sample identifier.

In the prior art, the calculation of the WOE value can be realized by the following formula (1), and the calculation of the IV value can be realized by the following formula (2):

Among them, npos _i is the number of positive samples in the i-th bin, nneg _i is the number of negative samples in the i-th bin, pos _total is the total number of positive samples, and neg _total is the total number of negative samples. When the data of the business party are of the same type, pos _total = 0 or neg _total = 0, the WOE value and IV value cannot be calculated, and the data is convenient to know that the data provided by the business party are of the same type, so that the data label can be inferred. There is a risk of data breaches.

In this implementation mode, the above formula (1) and formula (2) can be improved, and formula (3) and formula (4) can be obtained as follows:

Wherein, ε and δ are preset values, 0<ε<1, 0<δ<0.02.

Considering that there may be samples of the same type in the binning during feature binning and the labels of the feature data provided by the data party may belong to the same class, the formula (3) can neither affect the WOE value under the normal WOE situation, At the same time, a WOE value can also be calculated under special circumstances, which will not affect the calculation of subsequent IV values. Similarly, when the total number of positive and negative samples is large enough, the formula (4) can neither affect the IV value when the IV can be calculated normally, but also can be used when the feature data provided by the data party belong to the same category. In this case, the calculated IV value is δ. Usually in practice when the IV value is less than 0.02, the predictive ability of this characteristic variable is almost non-existent. This will not only prevent the data party from judging from the IV value that the data he provided belongs to the same category, but also show that the predictive ability of the characteristic variable is extremely small, so as to achieve the purpose of not disclosing the label of the Guest party.

In some optional implementations of this embodiment, the executive body can output the obtained parameters to the data party, so that the business party cannot know the WOE value and the IV value, preventing the business party from obtaining information about the data value of the data party.

Continue to refer to FIG. 5 , which shows a schematic diagram of the interaction process of the three parties (Guest party, third party and Host party) in this embodiment. As shown in Figure 5, the specific steps of federated feature engineering can be as follows:

1. The Guest and the Host transmit the encrypted sample ID to a trusted third party, and the trusted third party aligns the encrypted samples to obtain the sample ID shared by both parties. Encryption methods include but are not limited to asymmetric encryption, hash algorithm, homomorphic encryption, etc. The trusted third party can distribute the public key to the Guest and the Host. The Guest and the Host use the public key to encrypt and transmit the sample ID to the trusted third party. The trusted third party uses the private key to decrypt and align the samples. The Guest and the Host can also perform a hash operation on the sample ID, allowing a trusted third party to calculate and compare the hash values to obtain aligned samples.

2. The trusted third party sends the aligned sample ID to the host, and the host performs feature binning on the aligned sample features.

3. The Guest encrypts the label and sends it to a trusted third party, and the trusted third party sends the aligned ciphertext label to the Host.

4. The trusted third party sends the aligned ciphertext positive and negative labels plus different masks to the Guest for decryption, and subtracts the mask from the returned result to obtain the sum of positive and negative labels.

5. The host side calculates the number of positive and negative labels of the ciphertext for each feature bucket, and sends it to a trusted third party.

6. The trusted third party sends the result of the feature bucket calculation plus different masks to the Guest for decryption, and subtracts the mask from the returned result to obtain the number of positive and negative labels under each feature bucket.

7. The trusted third party calculates the WOE and IV values based on the results of steps 4 and 6, and finally transmits the IV value of each feature to the host for storage.

In the data processing method of federated feature engineering provided by the above-mentioned embodiments of the present disclosure, a trusted third party is responsible for aligning data, and only sends the aligned sample ID to the Host. In the process of obtaining the aligned data samples and calculating the sum of positive and negative labels, adding a mask makes it impossible for the Guest to know the proportion of the host data sample labels. And when calculating the number of positive and negative tags under each feature bucket, the same method of adding a mask is used, so that the Guest cannot obtain the data information of the Host, and finally the third party calculates the WOE and IV values of the features provided by the Host. Guarantees that the Guest cannot obtain information about the value of the Host's data.

Further referring to FIG. 6 , as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a data processing device for federated feature engineering, which is similar to the method embodiment shown in FIG. 2 Correspondingly, the device can be specifically applied to various electronic devices.

As shown in FIG. 6 , the data processing device 600 for federated feature engineering in this embodiment includes: a data receiving unit 601 , an identifier sending unit 602 , a tag sending unit 603 and an information output unit 604 .

The data receiving unit 601 is configured to receive the first sample identifier of the sample data sent by the business party, the ciphertext label corresponding to the first sample identifier, and the second sample identifier of the sample data sent by the receiving party, the ciphertext label Include the first label and the second label.

The identifier sending unit 602 is configured to determine the target sample identifier and send it to the data party according to the first sample identifier and the second sample identifier.

The label sending unit 603 is configured to determine the target ciphertext label and send it to the data party according to the ciphertext label and the target sample identification.

The information output unit 604 is configured to respond to the sum of the first label and the sum of the second label of each bucket calculated by the data party after performing feature bucketing based on the target sample identification and the target ciphertext label, based on the target ciphertext label The text label, the sum of the first label and the sum of the second label of each bucket, calculate and output the parameters corresponding to the target sample identification.

In some optional implementations of this embodiment, the identifier sending unit 602 may be further configured to: align the first sample identifier and the second sample identifier, and determine that the sample identifier shared by the business party and the data party is the target sample The identification is sent to the data party.

In some optional implementations of this embodiment, the information output unit 604 may be further configured to: determine the sum of positive labels and the sum of negative labels according to the target ciphertext label; and the sum of the second labels to determine the number of positive labels and the number of negative labels in each bucket; according to the sum of positive labels, the sum of negative labels, and the number of positive labels and negative labels in each bucket, calculate and output the corresponding parameters.

In some optional implementations of this embodiment, the information output unit 604 may be further configured to: determine the sum of the first label and the sum of the second label in the target ciphertext label; respectively combine the sum of the first label and the random Add the generated first mask, add the sum of the second tag and the randomly generated second mask, encrypt the two obtained sums and send them to the business party; the receiving business party will encrypt the encrypted two sums The first data obtained by decrypting is determined according to the first data and the first mask and the second mask to determine the sum of positive labels and the sum of negative labels.

In some optional implementations of this embodiment, the information output unit 604 may be further configured to: respectively add the sum of the first labels of each bucket to the randomly generated third mask, and add the sum of the first labels of each bucket to The sum of the second tag is added to the randomly generated fourth mask, and the two obtained sums are encrypted and then sent to the business party; the second data obtained after the business party decrypts the two encrypted sums is received according to The second data, the third mask, and the fourth mask determine the number of positive labels and the number of negative labels of each bucket.

In some optional implementations of this embodiment, the information output unit 604 may be further configured to: according to the sum of positive labels, the sum of negative labels, the number of positive labels and the number of negative labels in each bucket, and the preset at least Two parameters, calculation and output parameters corresponding to the target sample ID.

In some optional implementation manners of this embodiment, the information output unit 604 may be further configured to: output the calculated at least one parameter to the data party.

In some optional implementations of this embodiment, the device 600 may further include an encryption unit not shown in FIG. 6, which is configured to: receive the public key sent by the business party; use the public key to encrypt for the business The party decrypts according to the private key paired with the public key.

It should be understood that the units 601 to 605 recorded in the data processing apparatus 600 for federated feature engineering correspond to the steps in the method described with reference to FIG. 2 . Therefore, the operations and features described above for the data processing method for federated feature engineering are also applicable to the device 600 and the units contained therein, and will not be repeated here.

In the technical solution of the present disclosure, the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.

According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

FIG. 7 shows a block diagram of an electronic device 700 performing a data processing method for federated feature engineering according to an embodiment of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 7 , an electronic device 700 includes a processor 701 that can execute according to a computer program stored in a read-only memory (ROM) 702 or loaded from a memory 708 into a random access memory (RAM) 703. Various appropriate actions and treatments. In the RAM 703, various programs and data necessary for the operation of the electronic device 700 can also be stored. The processor 701, ROM 702, and RAM 703 are connected to each other through a bus 704. An I/O interface (input/output interface) 705 is also connected to the bus 704 .

Multiple components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a memory 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

Processor 701 may be various general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various processors that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The processor 701 executes various methods and processes described above, for example, a data processing method for federated feature engineering. For example, in some embodiments, data processing methods for federated feature engineering may be implemented as a computer software program tangibly embodied on a machine-readable storage medium, such as memory 708. In some embodiments, part or all of the computer program can be loaded and/or installed on the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by processor 701, one or more steps of the data processing method for federated feature engineering described above can be performed. Alternatively, in other embodiments, the processor 701 may be configured in any other appropriate way (for example, by means of firmware) to execute a data processing method for federated feature engineering.

Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.

Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. The above program code can be packaged into a computer program product. These program codes or computer program products may be provided to a processor or controller of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, so that the program codes, when executed by the processor 701, make the flow diagrams and/or block diagrams specified The function/operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present disclosure, a machine-readable storage medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. The machine-readable storage medium may be a machine-readable signal storage medium or a machine-readable storage medium. A machine-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.

The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.

A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability. The server can also be a server of a distributed system, or a server combined with a blockchain.

It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution of the present disclosure can be achieved, no limitation is imposed herein.

The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall fall within the protection scope of the present disclosure.

Claims

A data processing method for federated feature engineering, applied to a third party, comprising:

The first sample identification of the sample data sent by the receiving business party, the ciphertext label corresponding to the first sample identification, and the second sample identification of the sample data sent by the receiving party, the ciphertext label includes the first label and the second label;

Determine a target sample ID according to the first sample ID and the second sample ID, and send the target sample ID to the data party;

determining a target ciphertext label according to the ciphertext label and the target sample identifier, and sending the target ciphertext label to the data party;

In response to receiving the sum of the first label and the sum of the second label of each bucket calculated by the data party based on the target sample identifier and the target ciphertext label after performing feature bucketing, based on the target ciphertext label The text label, the sum of the first label and the sum of the second label of each bucket, calculate and output the parameter corresponding to the target sample identifier.
The method according to claim 1, wherein, according to the first sample ID and the second sample ID, determining the target sample ID and sending it to the data party includes:

Aligning the first sample ID and the second sample ID, determining the sample ID shared by the business party and the data party as the target sample ID and sending it to the data party.
The method according to claim 1, wherein, based on the target ciphertext label, the sum of the first labels and the sum of the second labels of each bucket, calculating and outputting the parameters corresponding to the target sample identification, including :

Determine the sum of positive labels and the sum of negative labels according to the target ciphertext label;

Determine the number of positive labels and the number of negative labels for each bucket according to the sum of the first label and the sum of the second label of each bucket;

According to the sum of the positive labels, the sum of the negative labels and the number of positive labels and negative labels of each bucket, calculate and output the parameters corresponding to the target sample identification.
The method according to claim 3, wherein, according to the target ciphertext label, determining the sum of positive labels and the sum of negative labels comprises:

determining the sum of the first label and the sum of the second label in the target ciphertext label;

respectively add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, and encrypt the obtained two sum values and send them to the said business party;

Receive the first data obtained by decrypting the two encrypted sum values from the business party, and determine the sum of the positive label and the negative label according to the first data, the first mask, and the second mask Sum.
The method according to claim 3, wherein, according to the sum of the first labels and the sum of the second labels of each bucket, determining the number of positive labels and the number of negative labels of each bucket includes:

Add the sum of the first label of each bucket to the randomly generated third mask, add the sum of the second label of each bucket to the randomly generated fourth mask, and compare the obtained two sum values Encrypted and sent to the business party;

Receiving the second data obtained by the business party after decrypting the two encrypted sum values, and determining the number of positive tags for each bucket according to the second data, the third mask, and the fourth mask and the number of negative labels.
The method according to claim 3, wherein, according to the sum of the positive labels, the sum of the negative labels, and the number of positive labels and the number of negative labels in each bucket, calculate and output the target sample identifier corresponding to parameters, including:

According to the sum of the positive labels, the sum of the negative labels, the number of positive labels and the number of negative labels in each bucket, and at least two preset parameters, calculate and output the parameters corresponding to the target sample identifier.
The method according to any one of claims 1-6, wherein said calculating and outputting the parameters corresponding to said target sample identification include:

output the calculated at least one parameter to the data party.
The method according to claim 4 or 5, wherein the method further comprises:

receiving the public key sent by the business party;

The public key is used to encrypt, so that the business party can decrypt according to the private key paired with the public key.
A data processing device for federal feature engineering, comprising:

The data receiving unit is configured to receive the first sample identifier of the sample data sent by the business party, the ciphertext label corresponding to the first sample identifier, and the second sample identifier of the sample data sent by the data receiving party, the The ciphertext label includes a first label and a second label;

An identification sending unit configured to determine a target sample identification and send it to the data party according to the first sample identification and the second sample identification;

The label sending unit is configured to determine the target ciphertext label and send it to the data party according to the ciphertext label and the target sample identifier;

The information output unit is configured to respond to the sum of the first labels and the sum of the second labels of each bucket calculated by the data party after performing feature bucketing based on the target sample identifier and the target ciphertext label. and, based on the target ciphertext label, the sum of the first labels and the sum of the second labels of each bucket, calculate and output the parameter corresponding to the target sample identifier.
The device according to claim 9, wherein the identification sending unit is further configured to:

Aligning the first sample ID and the second sample ID, determining the sample ID shared by the business party and the data party as the target sample ID and sending it to the data party.
The device according to claim 9, wherein the information output unit is further configured to:

Determine the sum of positive labels and the sum of negative labels according to the target ciphertext label;

Determine the number of positive labels and the number of negative labels for each bucket according to the sum of the first label and the sum of the second label of each bucket;

According to the sum of the positive labels, the sum of the negative labels, and the number of positive labels and the number of negative labels in each bucket, calculate and output the parameters corresponding to the target sample identifier.
The device according to claim 11, wherein the information output unit is further configured to:

determining the sum of the first label and the sum of the second label in the target ciphertext label;

respectively add the sum of the first label to the randomly generated first mask, add the sum of the second label to the randomly generated second mask, and encrypt the obtained two sum values and send them to the said business party;

Receive the first data obtained by decrypting the two encrypted sum values from the business party, and determine the sum of the positive label and the negative label according to the first data, the first mask, and the second mask Sum.
The device according to claim 11, wherein the information output unit is further configured to:

Add the sum of the first labels of each bucket to the randomly generated third mask, add the sum of the second labels of each bucket to the randomly generated fourth mask, and compare the obtained two sum values Encrypted and sent to the business party;

Receiving the second data obtained by the business party after decrypting the two encrypted sum values, and determining the number of positive tags for each bucket according to the second data, the third mask, and the fourth mask and the number of negative labels.
The device according to claim 11, wherein the information output unit is further configured to:

According to the sum of the positive labels, the sum of the negative labels, the number of positive labels and the number of negative labels in each bucket, and at least two preset parameters, calculate and output the parameters corresponding to the target sample identifier.
The device according to any one of claims 9-14, wherein the information output unit is further configured to:

output the calculated at least one parameter to the data party.
The device according to claim 14 or 15, wherein the device further comprises an encryption unit configured to:

receiving the public key sent by the business party;

The public key is used to encrypt, so that the business party can decrypt according to the private key paired with the public key.
An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-8. Methods.
A non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the method according to any one of claims 1-8.
A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.